GeoETL v0.2.0: Streaming Architecture & 7.6x Performance Boost

November 3, 2025 · 8 min read

GeoETL Maintainer

TL;DR: GeoETL v0.2.0 delivers on our performance promises with a production-ready streaming architecture. CSV processing achieves 2,266 MB/min throughput (7.6x faster than GeoJSON) while maintaining constant O(1) memory usage. Process 4.2 GB files in just 49.9 MB of memory.

Why This Release Matters

In v0.1.0, we made bold claims about delivering 5-10x faster performance than traditional tools. With v0.2.0, we're beginning to deliver on those promises.

This release is all about performance and streaming architecture. We've implemented a production-ready streaming system that:

✅ Eliminates OOM errors - Process files larger than your RAM
✅ Achieves production-grade throughput - 2,266 MB/min for CSV
✅ Maintains constant memory - O(1) memory complexity regardless of file size
✅ Enables performance tuning - Configurable batch sizes for your workload

Headline Features

🚀 Production-Ready Streaming CSV Architecture

Problem: Traditional geospatial tools often load entire datasets into memory, causing out-of-memory (OOM) errors on large files. Users had to split files or use machines with excessive RAM.

Solution: We've implemented a streaming architecture based on Apache DataFusion that processes data in configurable batches. Data flows through the system in chunks, never loading the entire file into memory.

Value: You can now process files 100x larger than your available RAM. A 4.2 GB CSV file with 129 million features processes in just 49.9 MB of memory - a 0.012x memory ratio.

Real-World Results (Microsoft Buildings Dataset: 129M features, 4.2 GB):

# Before: Would crash with OOM on most laptops
# After: Runs smoothly in ~50 MB memory

$ geoetl-cli convert \
  --input buildings.csv \
  --output buildings.geojson \
  --input-driver CSV \
  --output-driver GeoJSON

✅ Duration: 1.86 minutes (112 seconds)
✅ Memory: 49.9 MB peak (constant)
✅ Throughput: 2,266 MB/min (38.2 MB/s)
✅ CPU: 96.9% average utilization

Technical Details: See ADR 002: Streaming CSV Architecture for the complete architecture.

⚡ 7.6x Performance Gap Analysis & Path Forward

Problem: While our streaming architecture works, we discovered a significant performance gap between CSV (2,266 MB/min) and GeoJSON (297 MB/min) processing.

Solution: We implemented comprehensive benchmarking infrastructure and documented our findings in three Architecture Decision Records (ADRs). We know exactly where the bottleneck is: JSON parsing.

Value: Transparency and honesty. We're not hiding our performance gaps. Instead, we've:

Identified the root cause (JSON parsing accounts for 99.5% CPU usage)
Documented the issue in ADR 003
Created a phased optimization roadmap targeting 3-7x improvement
Set a clear target: 1-2 GB/min throughput for GeoJSON

Before/After Benchmark Comparison:

Format	Throughput	Duration (4.2GB)	Memory	Status
CSV	2,266 MB/min	1.86 min	49.9 MB	✅ Production-ready
GeoJSON	297 MB/min	49.95 min	83.7 MB	⚠️ Needs optimization

Key Finding: CSV is 7.6x faster than GeoJSON because JSON parsing is inherently more complex than CSV parsing. We're investigating faster JSON libraries (simd-json, sonic-rs) for future releases.

🎛️ Configurable Performance Tuning

Problem: Different workloads have different memory/speed tradeoffs. A data scientist with 64 GB RAM has different needs than a developer on a laptop with 8 GB.

Solution: We've added three new CLI parameters for performance tuning:

--batch-size <N>         # Features per batch (default: 8,192)
--read-partitions <N>    # Parallel reading threads
--write-partitions <N>   # Parallel writing threads

Value: You control the memory/speed tradeoff. Based on our benchmarking:

Conservative (8,192 batch size): Minimal memory, good performance
Optimal (262,144 batch size): 1.43x faster, slightly higher memory
Custom: Tune for your specific workload

Example - Tuning for Speed:

# High-performance conversion with large batches
$ geoetl-cli convert \
  --input large-dataset.csv \
  --output large-dataset.geojson \
  --input-driver CSV \
  --output-driver GeoJSON \
  --batch-size 262144

# Result: 1.43x faster throughput with minimal memory increase

Warning: Some formats (CSV, GeoJSON) automatically override write partitions to 1 to ensure valid output. You'll see a warning message if this happens.

📊 GeoJSON Incremental Decoder - No More OOM!

Problem: Large GeoJSON files (10+ GB) caused out-of-memory errors because the entire file was loaded into memory for parsing.

Solution: We implemented a state machine-based incremental JSON parser that processes JSON in chunks, handling incomplete JSON across byte stream boundaries.

Value: Eliminated OOM errors on large GeoJSON files. You can now process GeoJSON files of any size.

Proof (Microsoft Buildings: 129M features, 14.5 GB GeoJSON):

$ geoetl-cli convert \
  --input buildings.geojson \
  --output buildings.csv \
  --input-driver GeoJSON \
  --output-driver CSV

✅ Duration: 49.95 minutes
✅ Memory: 83.7 MB peak (constant for 14.5 GB file!)
✅ No OOM errors

Technical Details: See ADR 001: Streaming GeoJSON Architecture

Other Improvements & Fixes

Performance

Optimized GeoJSON batch size: Increased default from 8,192 to 262,144 for 1.29x performance improvement
Schema inference optimization: Reduced memory by sampling first 10 MB instead of entire file

Bug Fixes

Fixed CSV write partitioning: CSV now correctly enforces single-partition writes (prevents invalid multi-file output)
Fixed GeoJSON OOM: Streaming decoder eliminates out-of-memory errors on 15+ GB files

Infrastructure

Comprehensive benchmarking suite: Real-time monitoring scripts with CPU, memory, disk I/O tracking
Automated benchmark runner: JSON result output for performance regression testing
Microsoft Buildings dataset: Download scripts for 129M feature test dataset

Documentation

3 new ADRs: Complete architecture documentation for streaming implementations
Benchmark results: Published real-world performance data in bench/README.md
Honest performance assessment: Clear status indicators (✅ Production-ready vs ⚠️ Needs work)

Performance Benchmarks

We've published comprehensive benchmark results using the Microsoft Buildings dataset (129M features). All tests run on real hardware with detailed monitoring.

CSV Performance (Production-Ready ✅)

Dataset: 129M features, 4.2 GB CSV

Metric	Value
Duration	1.86 minutes (112 seconds)
Throughput	2,266 MB/min (38.2 MB/s)
Peak Memory	49.9 MB (constant)
CPU Utilization	96.9% average (efficient)
Disk Write Speed	88.2 MB/s
Memory Ratio	0.012x (49.9 MB for 4.2 GB)

Status: ✅ Production-ready for performance-critical workloads

GeoJSON Performance (Needs Optimization ⚠️)

Dataset: 129M features, 14.5 GB GeoJSON

Metric	Value
Duration	49.95 minutes (2,997 seconds)
Throughput	297 MB/min (12.0 MB/s)
Peak Memory	83.7 MB (constant)
CPU Utilization	99.5% saturated (parsing bottleneck)
Disk Write Speed	12.0 MB/s
Memory Ratio	0.006x (83.7 MB for 14.5 GB)

Status: ⚠️ Memory-efficient but needs 3-7x speed improvement

Batch Size Impact

Batch Size	GeoJSON Throughput	Memory	Recommendation
8,192	230 MB/min	83.7 MB	Conservative
262,144	297 MB/min	83.7 MB	Optimal ✅

Key Finding: Larger batches provide 1.29x speedup with no memory increase.

⚠️ Breaking Changes

None - this release is fully backward compatible with v0.1.x.

Community & Contributors

This release represents a major milestone in GeoETL's development. A special thank you to:

The Apache DataFusion team for their excellent streaming query engine
The GeoRust community for geospatial libraries
Everyone who provided feedback on v0.1.0

This project thrives on community contributions. We welcome your feedback, bug reports, and contributions!

The Future: What's Next?

We have a clear roadmap for the next releases:

Short-term (v0.3.0):

🎯 GeoParquet format support - Modern columnar format for geospatial data
🎯 Improved compression - Smaller file sizes
🎯 Schema preservation - Maintain metadata across conversions

Medium-term (v0.4.0):

🚀 GeoJSON optimization - Target 3-7x speedup (evaluate simd-json, sonic-rs)
🚀 Parallel processing - Multi-threaded execution for supported formats
🚀 Format auto-detection - No more manual driver specification

See our full Roadmap for details.

How to Upgrade

Installation

From source:

git clone https://github.com/geoyogesh/geoetl.git
cd geoetl
git checkout v0.2.0
cargo build --release

# Binary at: target/release/geoetl-cli

Pre-built binaries: Coming soon in v0.3.0

Verify Installation

$ geoetl-cli --version
geoetl-cli 0.2.0

$ geoetl-cli drivers
# Should show CSV and GeoJSON drivers

Get Started Today

Try the new streaming architecture:

# Download sample data (Microsoft Buildings)
# See: https://github.com/geoyogesh/geoetl/tree/main/bench

# Convert 4.2 GB CSV in ~2 minutes using only 50 MB memory
geoetl-cli convert \
  --input buildings.csv \
  --output buildings.geojson \
  --input-driver CSV \
  --output-driver GeoJSON \
  --batch-size 262144

# Monitor memory usage in another terminal
watch -n 1 'ps aux | grep geoetl-cli'

Documentation

📖 Streaming GeoJSON Architecture (ADR 001)
📖 Streaming CSV Architecture (ADR 002)
📖 GeoJSON Performance Optimization Strategy (ADR 003)
📖 Benchmark Results & Procedures
📖 Full Changelog

Get Involved

We'd love your help making GeoETL better:

⭐ Star us on GitHub: github.com/geoyogesh/geoetl
🐛 Report bugs: Open an issue
💬 Ask questions: GitHub Discussions
🔧 Contribute code: Check out good first issues
📣 Spread the word: Share on Twitter/X with #GeoETL #Rust #Performance

Download: GeoETL v0.2.0

Have questions or feedback? Join the discussion on GitHub!

Why This Release Matters​

Headline Features​

🚀 Production-Ready Streaming CSV Architecture​

⚡ 7.6x Performance Gap Analysis & Path Forward​

🎛️ Configurable Performance Tuning​

📊 GeoJSON Incremental Decoder - No More OOM!​

Other Improvements & Fixes​

Performance​

Bug Fixes​

Infrastructure​

Documentation​

Performance Benchmarks​

CSV Performance (Production-Ready ✅)​

GeoJSON Performance (Needs Optimization ⚠️)​

Batch Size Impact​

⚠️ Breaking Changes​

Community & Contributors​

The Future: What's Next?​

How to Upgrade​

Installation​

Verify Installation​

Get Started Today​

Documentation​

Get Involved​

Why This Release Matters

Headline Features

🚀 Production-Ready Streaming CSV Architecture

⚡ 7.6x Performance Gap Analysis & Path Forward

🎛️ Configurable Performance Tuning

📊 GeoJSON Incremental Decoder - No More OOM!

Other Improvements & Fixes

Performance

Bug Fixes

Infrastructure

Documentation

Performance Benchmarks

CSV Performance (Production-Ready ✅)

GeoJSON Performance (Needs Optimization ⚠️)

Batch Size Impact

⚠️ Breaking Changes

Community & Contributors

The Future: What's Next?

How to Upgrade

Installation

Verify Installation

Get Started Today

Documentation

Get Involved