Skip to main content

GeoETL v0.2.0: Streaming Architecture & 7.6x Performance Boost

· 8 min read
Yogesh
GeoETL Maintainer

TL;DR: GeoETL v0.2.0 delivers on our performance promises with a production-ready streaming architecture. CSV processing achieves 2,266 MB/min throughput (7.6x faster than GeoJSON) while maintaining constant O(1) memory usage. Process 4.2 GB files in just 49.9 MB of memory.

Why This Release Matters

In v0.1.0, we made bold claims about delivering 5-10x faster performance than traditional tools. With v0.2.0, we're beginning to deliver on those promises.

This release is all about performance and streaming architecture. We've implemented a production-ready streaming system that:

  • Eliminates OOM errors - Process files larger than your RAM
  • Achieves production-grade throughput - 2,266 MB/min for CSV
  • Maintains constant memory - O(1) memory complexity regardless of file size
  • Enables performance tuning - Configurable batch sizes for your workload

Headline Features

🚀 Production-Ready Streaming CSV Architecture

Problem: Traditional geospatial tools often load entire datasets into memory, causing out-of-memory (OOM) errors on large files. Users had to split files or use machines with excessive RAM.

Solution: We've implemented a streaming architecture based on Apache DataFusion that processes data in configurable batches. Data flows through the system in chunks, never loading the entire file into memory.

Value: You can now process files 100x larger than your available RAM. A 4.2 GB CSV file with 129 million features processes in just 49.9 MB of memory - a 0.012x memory ratio.

Real-World Results (Microsoft Buildings Dataset: 129M features, 4.2 GB):

# Before: Would crash with OOM on most laptops
# After: Runs smoothly in ~50 MB memory

$ geoetl-cli convert \
--input buildings.csv \
--output buildings.geojson \
--input-driver CSV \
--output-driver GeoJSON

✅ Duration: 1.86 minutes (112 seconds)
✅ Memory: 49.9 MB peak (constant)
✅ Throughput: 2,266 MB/min (38.2 MB/s)
✅ CPU: 96.9% average utilization

Technical Details: See ADR 002: Streaming CSV Architecture for the complete architecture.

⚡ 7.6x Performance Gap Analysis & Path Forward

Problem: While our streaming architecture works, we discovered a significant performance gap between CSV (2,266 MB/min) and GeoJSON (297 MB/min) processing.

Solution: We implemented comprehensive benchmarking infrastructure and documented our findings in three Architecture Decision Records (ADRs). We know exactly where the bottleneck is: JSON parsing.

Value: Transparency and honesty. We're not hiding our performance gaps. Instead, we've:

  • Identified the root cause (JSON parsing accounts for 99.5% CPU usage)
  • Documented the issue in ADR 003
  • Created a phased optimization roadmap targeting 3-7x improvement
  • Set a clear target: 1-2 GB/min throughput for GeoJSON

Before/After Benchmark Comparison:

FormatThroughputDuration (4.2GB)MemoryStatus
CSV2,266 MB/min1.86 min49.9 MB✅ Production-ready
GeoJSON297 MB/min49.95 min83.7 MB⚠️ Needs optimization

Key Finding: CSV is 7.6x faster than GeoJSON because JSON parsing is inherently more complex than CSV parsing. We're investigating faster JSON libraries (simd-json, sonic-rs) for future releases.

🎛️ Configurable Performance Tuning

Problem: Different workloads have different memory/speed tradeoffs. A data scientist with 64 GB RAM has different needs than a developer on a laptop with 8 GB.

Solution: We've added three new CLI parameters for performance tuning:

--batch-size <N>         # Features per batch (default: 8,192)
--read-partitions <N> # Parallel reading threads
--write-partitions <N> # Parallel writing threads

Value: You control the memory/speed tradeoff. Based on our benchmarking:

  • Conservative (8,192 batch size): Minimal memory, good performance
  • Optimal (262,144 batch size): 1.43x faster, slightly higher memory
  • Custom: Tune for your specific workload

Example - Tuning for Speed:

# High-performance conversion with large batches
$ geoetl-cli convert \
--input large-dataset.csv \
--output large-dataset.geojson \
--input-driver CSV \
--output-driver GeoJSON \
--batch-size 262144

# Result: 1.43x faster throughput with minimal memory increase

Warning: Some formats (CSV, GeoJSON) automatically override write partitions to 1 to ensure valid output. You'll see a warning message if this happens.

📊 GeoJSON Incremental Decoder - No More OOM!

Problem: Large GeoJSON files (10+ GB) caused out-of-memory errors because the entire file was loaded into memory for parsing.

Solution: We implemented a state machine-based incremental JSON parser that processes JSON in chunks, handling incomplete JSON across byte stream boundaries.

Value: Eliminated OOM errors on large GeoJSON files. You can now process GeoJSON files of any size.

Proof (Microsoft Buildings: 129M features, 14.5 GB GeoJSON):

$ geoetl-cli convert \
--input buildings.geojson \
--output buildings.csv \
--input-driver GeoJSON \
--output-driver CSV

✅ Duration: 49.95 minutes
✅ Memory: 83.7 MB peak (constant for 14.5 GB file!)
✅ No OOM errors

Technical Details: See ADR 001: Streaming GeoJSON Architecture

Other Improvements & Fixes

Performance

  • Optimized GeoJSON batch size: Increased default from 8,192 to 262,144 for 1.29x performance improvement
  • Schema inference optimization: Reduced memory by sampling first 10 MB instead of entire file

Bug Fixes

  • Fixed CSV write partitioning: CSV now correctly enforces single-partition writes (prevents invalid multi-file output)
  • Fixed GeoJSON OOM: Streaming decoder eliminates out-of-memory errors on 15+ GB files

Infrastructure

  • Comprehensive benchmarking suite: Real-time monitoring scripts with CPU, memory, disk I/O tracking
  • Automated benchmark runner: JSON result output for performance regression testing
  • Microsoft Buildings dataset: Download scripts for 129M feature test dataset

Documentation

  • 3 new ADRs: Complete architecture documentation for streaming implementations
  • Benchmark results: Published real-world performance data in bench/README.md
  • Honest performance assessment: Clear status indicators (✅ Production-ready vs ⚠️ Needs work)

Performance Benchmarks

We've published comprehensive benchmark results using the Microsoft Buildings dataset (129M features). All tests run on real hardware with detailed monitoring.

CSV Performance (Production-Ready ✅)

Dataset: 129M features, 4.2 GB CSV

MetricValue
Duration1.86 minutes (112 seconds)
Throughput2,266 MB/min (38.2 MB/s)
Peak Memory49.9 MB (constant)
CPU Utilization96.9% average (efficient)
Disk Write Speed88.2 MB/s
Memory Ratio0.012x (49.9 MB for 4.2 GB)

Status: ✅ Production-ready for performance-critical workloads

GeoJSON Performance (Needs Optimization ⚠️)

Dataset: 129M features, 14.5 GB GeoJSON

MetricValue
Duration49.95 minutes (2,997 seconds)
Throughput297 MB/min (12.0 MB/s)
Peak Memory83.7 MB (constant)
CPU Utilization99.5% saturated (parsing bottleneck)
Disk Write Speed12.0 MB/s
Memory Ratio0.006x (83.7 MB for 14.5 GB)

Status: ⚠️ Memory-efficient but needs 3-7x speed improvement

Batch Size Impact

Batch SizeGeoJSON ThroughputMemoryRecommendation
8,192230 MB/min83.7 MBConservative
262,144297 MB/min83.7 MBOptimal

Key Finding: Larger batches provide 1.29x speedup with no memory increase.

⚠️ Breaking Changes

None - this release is fully backward compatible with v0.1.x.

Community & Contributors

This release represents a major milestone in GeoETL's development. A special thank you to:

  • The Apache DataFusion team for their excellent streaming query engine
  • The GeoRust community for geospatial libraries
  • Everyone who provided feedback on v0.1.0

This project thrives on community contributions. We welcome your feedback, bug reports, and contributions!

The Future: What's Next?

We have a clear roadmap for the next releases:

Short-term (v0.3.0):

  • 🎯 GeoParquet format support - Modern columnar format for geospatial data
  • 🎯 Improved compression - Smaller file sizes
  • 🎯 Schema preservation - Maintain metadata across conversions

Medium-term (v0.4.0):

  • 🚀 GeoJSON optimization - Target 3-7x speedup (evaluate simd-json, sonic-rs)
  • 🚀 Parallel processing - Multi-threaded execution for supported formats
  • 🚀 Format auto-detection - No more manual driver specification

See our full Roadmap for details.

How to Upgrade

Installation

From source:

git clone https://github.com/geoyogesh/geoetl.git
cd geoetl
git checkout v0.2.0
cargo build --release

# Binary at: target/release/geoetl-cli

Pre-built binaries: Coming soon in v0.3.0

Verify Installation

$ geoetl-cli --version
geoetl-cli 0.2.0

$ geoetl-cli drivers
# Should show CSV and GeoJSON drivers

Get Started Today

Try the new streaming architecture:

# Download sample data (Microsoft Buildings)
# See: https://github.com/geoyogesh/geoetl/tree/main/bench

# Convert 4.2 GB CSV in ~2 minutes using only 50 MB memory
geoetl-cli convert \
--input buildings.csv \
--output buildings.geojson \
--input-driver CSV \
--output-driver GeoJSON \
--batch-size 262144

# Monitor memory usage in another terminal
watch -n 1 'ps aux | grep geoetl-cli'

Documentation

Get Involved

We'd love your help making GeoETL better:


Download: GeoETL v0.2.0

Have questions or feedback? Join the discussion on GitHub!