How to Choose Between GeoParquet and FlatGeobuf
When deciding how to choose between GeoParquet and FlatGeobuf, align the format with your query pattern and infrastructure: use GeoParquet for cloud-native analytical pipelines, heavy columnar aggregations, and ML feature stores. Use FlatGeobuf for low-latency spatial filtering, HTTP range-request streaming, and web mapping backends. The decision hinges on workload type: batch analytics favors GeoParquet’s columnar compression and predicate pushdown, while interactive spatial serving requires FlatGeobuf’s embedded spatial index and partial-read efficiency. If your architecture runs Spark, DuckDB, or Polars aggregations over object storage, default to GeoParquet. If you serve OGC API Features, dynamic tile servers, or client-side spatial queries, default to FlatGeobuf.
Architecture & I/O Models
GeoParquet extends the Apache Parquet columnar specification by storing geometries as Well-Known Binary (WKB) in a dedicated column and attaching spatial metadata (CRS, bounding box, geometry type) via Parquet key-value metadata. Because it inherits Parquet’s row-group structure, dictionary encoding, and page-level compression (ZSTD, Snappy, LZ4), it minimizes I/O for analytical queries that project only a subset of attributes. Modern query engines push down predicates on non-geometry columns before deserializing WKB, drastically reducing memory pressure during large-scale joins or aggregations.
FlatGeobuf is a flat, row-ordered binary format with a packed Hilbert R-tree spatial index appended to the end of the file. The index maps spatial extents to byte offsets, enabling clients to issue HTTP Range requests that fetch only the index and relevant feature blocks. The format strictly aligns with the FlatGeobuf specification and GDAL/OGR drivers. Unlike columnar formats, FlatGeobuf reads features sequentially but skips irrelevant blocks via the spatial index, making it ideal for bounding-box filters, point-in-polygon checks, and streaming to browsers or lightweight clients.
Decision Matrix by Workload
| Workload Pattern | Recommended Format | Technical Rationale |
|---|---|---|
| Cloud data lake ETL / ML feature store | GeoParquet | Columnar pruning, ZSTD/Snappy compression, native Spark/DuckDB/Pandas support |
| Web tile serving / OGC API Features | FlatGeobuf | HTTP range requests, sub-second bounding-box filtering, low memory overhead |
| Heavy spatial joins & aggregations | GeoParquet | Vectorized columnar execution, predicate pushdown on non-geometry attributes |
| Client-side spatial queries (JS/Python) | FlatGeobuf | Partial downloads via Range, embedded index avoids full-file scan |
| Schema evolution & metadata tracking | GeoParquet | Parquet schema compatibility, versioned metadata, cross-tool interoperability |
| Strict GDAL/OGR ecosystem alignment | FlatGeobuf | First-class GDAL driver, seamless ogr2ogr conversion, legacy GIS compatibility |
Platform teams should evaluate data access patterns before standardizing. GeoParquet excels when queries touch fewer than 30% of columns but scan millions of rows. FlatGeobuf wins when queries are highly spatially selective, require sub-second response times, or must traverse bandwidth-constrained networks.
Implementation & Code Patterns
Below are production-ready Python workflows demonstrating read/write operations for both formats. Both examples assume geopandas and pyarrow are installed.
GeoParquet: Batch Analytics & Predicate Pushdown
import geopandas as gpd
import pyarrow.parquet as pq
# Write with explicit spatial metadata (GeoParquet v1.0.0 spec)
gdf = gpd.read_file("input.shp")
gdf.to_parquet(
"output.parquet",
compression="zstd",
index=False,
)
# Read with column projection & attribute filtering
# PyArrow pushes down non-geometry predicates to the row-group level
table = pq.read_table(
"output.parquet",
columns=["id", "population", "geometry"],
filters=[("population", ">", 50000)],
)
analytics_gdf = gpd.GeoDataFrame(table.to_pandas(), geometry="geometry", crs="EPSG:4326")
FlatGeobuf: Streaming & Spatial Filtering
import geopandas as gpd
# Write FlatGeobuf via GeoPandas (GDAL-backed)
gdf.to_file("output.fgb", driver="FlatGeobuf")
# Read with bbox spatial filter (uses the embedded Hilbert R-tree index)
# bbox=(minx, miny, maxx, maxy)
filtered_gdf = gpd.read_file(
"output.fgb",
bbox=(-122.5, 37.7, -122.3, 37.9),
)
For HTTP streaming (e.g., reading a FlatGeobuf from object storage via byte-range requests), use the JavaScript flatgeobuf library client-side or a server-side gateway that sets Accept-Ranges: bytes and serves the file unmodified from S3/GCS.
Key implementation notes:
- GeoParquet benefits from
row_group_sizetuning. Smaller row groups improve parallelism in distributed engines but increase metadata overhead. See Row Group Sizing Strategies for Parquet for guidance. - FlatGeobuf requires the spatial index to be built during write. GeoPandas builds the Hilbert R-tree index automatically when writing
.fgb. If appending to an existing file is required, the index must be rebuilt. - For cloud-native deployments, enable
Content-RangeandAccept-Ranges: byteson your object storage gateway to unlock FlatGeobuf’s streaming advantage.
Performance, Ecosystem & Next Steps
Benchmarking reveals that GeoParquet typically outperforms FlatGeobuf on wide-table aggregations and multi-column joins, while FlatGeobuf delivers 3–10x faster bounding-box filters on datasets under 500 MB. For deeper benchmark context, review the Comparing GeoParquet vs FlatGeobuf Performance analysis, which covers I/O latency, CPU utilization, and memory footprints across cloud and edge environments.
When designing microservices or data platforms, avoid forcing a single format across all layers. A common pattern stores raw/curated layers as GeoParquet in S3/GCS for batch processing, then materializes FlatGeobuf derivatives for API serving or client-side consumption. This dual-format strategy leverages the strengths of both without compromising query performance. Platform teams should also evaluate the Geospatial Storage Fundamentals & Format Comparison baseline before standardizing on a single format across services.
Final recommendation: Start with GeoParquet if your workload is compute-bound, schema-heavy, or integrated with modern data engines. Switch to FlatGeobuf if your workload is I/O-bound, latency-sensitive, or requires direct browser/client access. Both formats are mature, open, and interoperable; the right choice depends entirely on where your query execution happens and how your data moves through the stack.