Skip to main content

Announcing GeoETL v0.1.0: A Modern Geospatial ETL Tool Built with Rust

· 9 min read
Yogesh
GeoETL Maintainer

TL;DR: We're excited to announce the first release of GeoETL, a next-generation geospatial ETL tool built from the ground up with Rust, Apache DataFusion, and Apache Arrow. Our goal is ambitious: deliver 5-10x faster performance than GDAL while providing seamless scalability from your laptop to a distributed cluster.

Why GeoETL?

For over two decades, GDAL has been the backbone of geospatial data processing. It's an incredible tool that has served the GIS community faithfully, and we owe a tremendous debt to its developers and contributors.

But the world of data has changed dramatically:

  • Hardware has evolved: Modern CPUs have 8, 16, or even 64+ cores, but many GDAL operations remain single-threaded.
  • Data has grown: We're now processing terabytes of geospatial data that won't fit on a single machine.
  • Cloud is everywhere: Object storage (S3, Azure Blob, GCS) has become the standard, but traditional tools weren't designed for it.
  • Memory safety matters: Spatial data processing involves complex algorithms where memory bugs can be catastrophic.

GeoETL is our answer to these challenges. We're building a modern geospatial ETL tool that embraces the cutting-edge technologies available today while learning from decades of GDAL's wisdom.

The Vision

Our mission is to become the modern standard for vector spatial data processing, empowering users with:

  • 5-10x faster performance than GDAL through vectorized, multi-threaded execution
  • Seamless scalability from a single laptop to a distributed cluster
  • Memory safety guaranteed by Rust's ownership model
  • Cloud-native architecture with first-class support for S3, Azure Blob, and GCS
  • Zero legacy dependencies - a clean, modern implementation without carrying decades of technical debt

What's in v0.1.0?

This is our foundation release (Phase 1 of our roadmap). Think of it as laying the cornerstone of a building - it's not the complete structure, but it's solid, well-engineered, and ready to build upon.

✅ What Works Today

Command-Line Interface:

# List all 68+ supported format drivers
geoetl-cli drivers

# Convert between geospatial formats
geoetl-cli convert \
-i cities.geojson \
-o cities.csv \
--input-driver GeoJSON \
--output-driver CSV

# With verbose logging
geoetl-cli -v convert -i input.csv -o output.geojson \
--input-driver CSV \
--output-driver GeoJSON

Currently Supported Formats:

  • CSV - Read and write CSV files with WKT geometries
  • GeoJSON - Full read/write support for GeoJSON FeatureCollections

Key Features:

  • 68+ Driver Registry: We've catalogued 68+ vector format drivers (GeoJSON, Shapefile, GeoPackage, PostGIS, etc.) with their capabilities
  • Robust Error Handling: Clear error messages when something goes wrong
  • Comprehensive Test Suite: Extensive end-to-end and unit tests ensure reliability
  • Professional CI/CD: Automated testing and releases via CircleCI
  • Docker Support: Consistent development environment

🚧 What's Coming Next

We're transparent about where we are: 2 of 68+ drivers are currently implemented. Here's what's on the immediate roadmap:

Phase 2 (Q2 2026):

  • Implement high-priority format drivers: GeoPackage, Shapefile, Parquet, FlatGeobuf
  • Driver auto-detection from file extensions
  • Complete the info command for dataset inspection
  • Core spatial operations (buffer, intersection, union)
  • CRS transformations
  • Performance benchmarking suite

Phase 3-4 (Q3-Q4 2026):

  • Advanced spatial algorithms
  • Ballista integration for distributed processing
  • Cloud storage support (S3, Azure Blob, GCS)
  • Horizontal scaling across clusters

The Technology Stack

GeoETL is built on a foundation of cutting-edge, battle-tested technologies:

Core Technologies

  • Rust: A systems programming language that guarantees memory safety without garbage collection. No more segfaults, data races, or undefined behavior.

  • Apache DataFusion: A blazing-fast SQL query engine written in Rust. DataFusion provides:

    • Vectorized execution: Operations on entire columns at once
    • Query optimization: Sophisticated cost-based optimizer
    • Extensibility: Custom data sources and functions
  • Apache Arrow: A language-agnostic columnar memory format that enables zero-copy data sharing and efficient analytical operations.

  • GeoArrow: An emerging standard for representing geospatial vector data in the Arrow columnar format.

The GeoRust Ecosystem

We're proud members of the GeoRust community, leveraging excellent Rust libraries:

  • geo: Geospatial algorithms (area, centroid, distance, etc.)
  • geozero: Zero-copy geospatial data streaming
  • proj: Coordinate reference system transformations
  • rstar: R-tree spatial indexing for fast queries

Why Rust + DataFusion?

You might be wondering: "Why not just improve GDAL or build this in Python?"

The Rust Advantage

Memory Safety Without Performance Cost: Rust's ownership model eliminates entire classes of bugs at compile time - no null pointer dereferences, no buffer overflows, no use-after-free errors. And unlike garbage-collected languages, you get this safety with zero runtime overhead.

Fearless Concurrency: Rust makes it impossible to have data races. This means we can parallelize operations across all your CPU cores without the complexity and bugs that plague multi-threaded C++ code.

Zero-Cost Abstractions: We can write high-level, readable code that compiles down to machine code as fast as hand-optimized C. You shouldn't have to choose between performance and maintainability.

The DataFusion Advantage

Vectorized Processing: DataFusion operates on entire columns of data at once using SIMD instructions. This is fundamentally faster than row-by-row processing for analytical workloads.

Distributed by Design: Through DataFusion Ballista, we get built-in distributed execution. Scale from one machine to hundreds without rewriting your code.

Proven in Production: DataFusion powers InfluxDB IOx, Cube.js, and other production systems processing billions of rows.

Performance Goals

We're making bold claims about performance. Here are our specific targets:

OperationTarget vs GDALMethod
Format conversion5-10x fasterVectorized processing, zero-copy
Spatial filtering5x fasterR-tree indexing, SIMD
Buffer operations3-5x fasterParallel execution
Spatial joins5x fasterPartition-based parallelism
Large datasets (1TB+)Linear scalingDistributed via Ballista

Important: These are targets, not claims about v0.1.0. We'll be publishing benchmark results as we implement each feature. Transparency is key.

Getting Started

Installation

Currently, GeoETL must be built from source (pre-built binaries coming soon):

# Prerequisites: Rust 1.90.0 or later
git clone https://github.com/geoyogesh/geoetl.git
cd geoetl
cargo build --release

# The binary will be at: target/release/geoetl-cli

Quick Start

# See all available format drivers
geoetl-cli drivers

# Convert GeoJSON to CSV
geoetl-cli convert \
-i cities.geojson \
-o cities.csv \
--input-driver GeoJSON \
--output-driver CSV

# Convert CSV with WKT geometries to GeoJSON
geoetl-cli convert \
-i points.csv \
-o points.geojson \
--input-driver CSV \
--output-driver GeoJSON \
--geometry-column wkt

# Enable verbose logging to see what's happening
geoetl-cli -v convert -i input.csv -o output.geojson \
--input-driver CSV \
--output-driver GeoJSON

Project Philosophy

Open & Transparent

We believe in radical transparency:

  • Architecture Decision Records: Every major design decision is documented in our ADR directory
  • Public Roadmap: Our VISION.md lays out our plans clearly
  • Honest About Limitations: We're upfront about what works and what doesn't
  • Open Development: All development happens in the open on GitHub

Community-Driven

GeoETL belongs to the community:

  • Open Source: Dual-licensed under MIT/Apache-2.0
  • Welcoming: We welcome contributors of all skill levels
  • Collaborative: Built on and contributing back to the GeoRust ecosystem
  • Responsive: We'll be active on GitHub issues and discussions

Sustainable & Professional

We're building for the long term:

  • Quality First: Comprehensive tests, CI/CD, code coverage
  • Well-Documented: Every function, every module, every decision
  • Stable Releases: Semantic versioning, automated releases
  • Professional Tooling: Docker, Makefile, pre-commit hooks

How You Can Help

GeoETL is a community effort, and we need your help to make it successful:

1. Try It Out

Download GeoETL, convert some files, and give us feedback. What works? What doesn't? What's confusing?

2. Contribute Code

We have issues tagged as "good first issue" for newcomers. Check our Development Guide to get started.

High-Priority Contributions:

  • Implement additional format drivers (GeoPackage, Shapefile, Parquet)
  • Add more spatial operations
  • Write documentation and tutorials
  • Create benchmark tests

3. Spread the Word

If you find GeoETL interesting:

  • ⭐ Star the repository on GitHub
  • 🐦 Share on Twitter/X with #GeoETL #Rust #GIS
  • 💬 Discuss on Reddit (r/rust, r/gis, r/programming)
  • 📝 Write about your experience using it

4. Report Issues

Found a bug? Have a feature request? Open an issue on GitHub.

5. Join the Conversation

  • GitHub Discussions: Share ideas and ask questions
  • Follow updates: Watch the repository for release notifications

The Road Ahead

We have an ambitious roadmap, but we're committed to incremental delivery. Every month, you'll see progress:

  • New format drivers being added
  • Performance benchmarks published
  • Features implemented from the roadmap
  • Documentation expanded with more examples

This is a marathon, not a sprint. We're building GeoETL to last for the next decade and beyond.

A Note on Realism

Let's be clear: GeoETL v0.1.0 is not a GDAL replacement today. GDAL has 20+ years of development, 250+ drivers, and battle-tested reliability. It will take us time to reach feature parity.

But we believe there's room for both tools:

  • GDAL remains the gold standard for comprehensive format support and mature, stable processing
  • GeoETL will excel at high-performance, cloud-native, distributed geospatial data processing

Think of GeoETL as complementary to GDAL, not competitive. Use the right tool for the job.

Acknowledgments

GeoETL stands on the shoulders of giants:

  • The GDAL team for decades of innovation in geospatial data processing
  • The GeoRust community for excellent geospatial libraries
  • The Apache Arrow project for DataFusion and the Arrow format
  • The GeoArrow ecosystem for geospatial standards
  • The Rust community for an amazing language and ecosystem

We're grateful to learn from and build upon your work.

Get Involved!

GitHub: https://github.com/geoyogesh/geoetl

Documentation:

Stay Updated:

Thank You

Thank you for your interest in GeoETL. We're excited about the future of geospatial data processing, and we hope you'll join us on this journey.

Whether you contribute code, report bugs, write documentation, or simply spread the word - every bit helps. Together, we can build something truly special.

Let's make geospatial data processing faster, safer, and more accessible for everyone.


Ready to get started? Check out our Getting Started Guide and join the discussion on GitHub.

Have questions or feedback? Drop a comment below or open a discussion on GitHub!