Skip to main content

GeoETL v0.3.0: GeoParquet Support

ยท 3 min read
Yogesh
GeoETL Maintainer

GeoETL v0.3.0 adds full GeoParquet format support with production-ready performance.

Key highlights:

  • โœ… Full read/write support
  • โšก 3,315 MB/min processing throughput
  • ๐Ÿ’พ Streaming architecture with minimal memory
  • ๐Ÿš€ Handles 100M+ features efficiently

What's Newโ€‹

GeoParquet is now a fully supported format driver in GeoETL, joining CSV and GeoJSON.

Format: Apache Parquet with WKB-encoded geometries and GeoArrow types

Use cases:

  • Large-scale geospatial data processing
  • Cloud storage (smaller files)
  • Analytics pipelines
  • Data archival

Performance Resultsโ€‹

Benchmarked with Microsoft Buildings dataset (up to 129M features):

Processing Speed (1M features)โ€‹

OperationThroughputDuration
GeoJSON โ†’ GeoJSON300 MB/min23s
CSV โ†’ CSV3,211 MB/min1s
GeoParquet โ†’ GeoParquet3,315 MB/min1s
GeoJSON โ†’ GeoParquet3,804 MB/min2s
CSV โ†’ GeoParquet3,211 MB/min1s

Key finding: GeoJSON โ†’ GeoParquet conversion achieves highest throughput (3,804 MB/min), making format migration fast and efficient.

File Size (1M features)โ€‹

FormatSizevs GeoJSON
GeoJSON114.13 MBbaseline
CSV32.11 MB3.5x smaller
GeoParquet16.86 MB6.8x smaller

Memory Usageโ€‹

All conversions use <250 MB peak memory regardless of dataset size, confirming streaming architecture works at scale.

Scalability Test (129M features)โ€‹

FormatInput SizeProcessing TimePeak Memory
GeoJSON14.5 GB50 minutes84 MB
GeoParquet~4 GB~2 minutes (projected)<100 MB

Getting Startedโ€‹

Installationโ€‹

Download GeoETL v0.3.0: GitHub Releases

Basic Usageโ€‹

# GeoJSON to GeoParquet
geoetl-cli convert \
--input data.geojson \
--output data.parquet \
--input-driver GeoJSON \
--output-driver GeoParquet

# CSV to GeoParquet
geoetl-cli convert \
--input data.csv \
--output data.parquet \
--input-driver CSV \
--output-driver GeoParquet \
--geometry-column WKT

# GeoParquet to GeoJSON
geoetl-cli convert \
--input data.parquet \
--output data.geojson \
--input-driver GeoParquet \
--output-driver GeoJSON

Architectureโ€‹

GeoETL's GeoParquet implementation:

  • Streaming: Constant O(1) memory regardless of file size
  • Native types: GeoArrow Point, LineString, Polygon, Multi*, etc.
  • Standard encoding: WKB (Well-Known Binary)
  • Metadata: CRS, bounding boxes, schema preservation

Technical details: Architecture ADR 004

Documentationโ€‹

What's Nextโ€‹

  • Next (v0.4.0): FlatGeobuf format support

See full Roadmap

Communityโ€‹


Download: GeoETL v0.3.0