Section: Geospatial Storage Formats Compared 5 min read

How to Choose Between GeoParquet and FlatGeobuf

When deciding how to choose between GeoParquet and FlatGeobuf, align the format with your query pattern and infrastructure: use GeoParquet for cloud-native analytical pipelines, heavy columnar aggregations, and ML feature stores. Use FlatGeobuf for low-latency spatial filtering, HTTP range-request streaming, and web mapping backends. The decision hinges on workload type: batch analytics favors GeoParquet’s columnar compression and predicate pushdown, while interactive spatial serving requires FlatGeobuf’s embedded spatial index and partial-read efficiency. If your architecture runs Spark, DuckDB, or Polars aggregations over object storage, default to GeoParquet. If you serve OGC API Features, dynamic tile servers, or client-side spatial queries, default to FlatGeobuf.

Architecture & I/O Models

GeoParquet extends the Apache Parquet columnar specification by storing geometries as Well-Known Binary (WKB) in a dedicated column and attaching spatial metadata (CRS, bounding box, geometry type) via Parquet key-value metadata. Because it inherits Parquet’s row-group structure, dictionary encoding, and page-level compression (ZSTD, Snappy, LZ4), it minimizes I/O for analytical queries that project only a subset of attributes. Modern query engines push down predicates on non-geometry columns before deserializing WKB, drastically reducing memory pressure during large-scale joins or aggregations.

FlatGeobuf is a flat, row-ordered binary format with a packed Hilbert R-tree spatial index appended to the end of the file. The index maps spatial extents to byte offsets, enabling clients to issue HTTP Range requests that fetch only the index and relevant feature blocks. The format strictly aligns with the FlatGeobuf specification and GDAL/OGR drivers. Unlike columnar formats, FlatGeobuf reads features sequentially but skips irrelevant blocks via the spatial index, making it ideal for bounding-box filters, point-in-polygon checks, and streaming to browsers or lightweight clients.

Decision Matrix by Workload

Workload Pattern	Recommended Format	Technical Rationale
Cloud data lake ETL / ML feature store	GeoParquet	Columnar pruning, ZSTD/Snappy compression, native Spark/DuckDB/Pandas support
Web tile serving / OGC API Features	FlatGeobuf	HTTP range requests, sub-second bounding-box filtering, low memory overhead
Heavy spatial joins & aggregations	GeoParquet	Vectorized columnar execution, predicate pushdown on non-geometry attributes
Client-side spatial queries (JS/Python)	FlatGeobuf	Partial downloads via `Range`, embedded index avoids full-file scan
Schema evolution & metadata tracking	GeoParquet	Parquet schema compatibility, versioned metadata, cross-tool interoperability
Strict GDAL/OGR ecosystem alignment	FlatGeobuf	First-class GDAL driver, seamless `ogr2ogr` conversion, legacy GIS compatibility

Platform teams should evaluate data access patterns before standardizing. GeoParquet excels when queries touch fewer than 30% of columns but scan millions of rows. FlatGeobuf wins when queries are highly spatially selective, require sub-second response times, or must traverse bandwidth-constrained networks.

Implementation & Code Patterns

Below are production-ready Python workflows demonstrating read/write operations for both formats. Both examples assume geopandas and pyarrow are installed.

GeoParquet: Batch Analytics & Predicate Pushdown

python

import geopandas as gpd
import pyarrow.parquet as pq

# Write with explicit spatial metadata (GeoParquet v1.0.0 spec)
gdf = gpd.read_file("input.shp")
gdf.to_parquet(
    "output.parquet",
    compression="zstd",
    index=False,
)

# Read with column projection & attribute filtering
# PyArrow pushes down non-geometry predicates to the row-group level
table = pq.read_table(
    "output.parquet",
    columns=["id", "population", "geometry"],
    filters=[("population", ">", 50000)],
)
analytics_gdf = gpd.GeoDataFrame(table.to_pandas(), geometry="geometry", crs="EPSG:4326")

import geopandas as gpd
import pyarrow.parquet as pq

# Write with explicit spatial metadata (GeoParquet v1.0.0 spec)
gdf = gpd.read_file("input.shp")
gdf.to_parquet(
    "output.parquet",
    compression="zstd",
    index=False,
)

# Read with column projection & attribute filtering
# PyArrow pushes down non-geometry predicates to the row-group level
table = pq.read_table(
    "output.parquet",
    columns=["id", "population", "geometry"],
    filters=[("population", ">", 50000)],
)
analytics_gdf = gpd.GeoDataFrame(table.to_pandas(), geometry="geometry", crs="EPSG:4326")

FlatGeobuf: Streaming & Spatial Filtering

python

import geopandas as gpd

# Write FlatGeobuf via GeoPandas (GDAL-backed)
gdf.to_file("output.fgb", driver="FlatGeobuf")

# Read with bbox spatial filter (uses the embedded Hilbert R-tree index)
# bbox=(minx, miny, maxx, maxy)
filtered_gdf = gpd.read_file(
    "output.fgb",
    bbox=(-122.5, 37.7, -122.3, 37.9),
)

import geopandas as gpd

# Write FlatGeobuf via GeoPandas (GDAL-backed)
gdf.to_file("output.fgb", driver="FlatGeobuf")

# Read with bbox spatial filter (uses the embedded Hilbert R-tree index)
# bbox=(minx, miny, maxx, maxy)
filtered_gdf = gpd.read_file(
    "output.fgb",
    bbox=(-122.5, 37.7, -122.3, 37.9),
)

For HTTP streaming (e.g., reading a FlatGeobuf from object storage via byte-range requests), use the JavaScript flatgeobuf library client-side or a server-side gateway that sets Accept-Ranges: bytes and serves the file unmodified from S3/GCS.

Key implementation notes:

GeoParquet benefits from row_group_size tuning. Smaller row groups improve parallelism in distributed engines but increase metadata overhead. See Row Group Sizing Strategies for Parquet for guidance.
FlatGeobuf requires the spatial index to be built during write. GeoPandas builds the Hilbert R-tree index automatically when writing .fgb. If appending to an existing file is required, the index must be rebuilt.
For cloud-native deployments, enable Content-Range and Accept-Ranges: bytes on your object storage gateway to unlock FlatGeobuf’s streaming advantage.

Performance, Ecosystem & Next Steps

Benchmarking reveals that GeoParquet typically outperforms FlatGeobuf on wide-table aggregations and multi-column joins, while FlatGeobuf delivers 3–10x faster bounding-box filters on datasets under 500 MB. For deeper benchmark context, review the Comparing GeoParquet vs FlatGeobuf Performance analysis, which covers I/O latency, CPU utilization, and memory footprints across cloud and edge environments.

When designing microservices or data platforms, avoid forcing a single format across all layers. A common pattern stores raw/curated layers as GeoParquet in S3/GCS for batch processing, then materializes FlatGeobuf derivatives for API serving or client-side consumption. This dual-format strategy leverages the strengths of both without compromising query performance. Platform teams should also evaluate the Geospatial Storage Fundamentals & Format Comparison baseline before standardizing on a single format across services.

Final recommendation: Start with GeoParquet if your workload is compute-bound, schema-heavy, or integrated with modern data engines. Switch to FlatGeobuf if your workload is I/O-bound, latency-sensitive, or requires direct browser/client access. Both formats are mature, open, and interoperable; the right choice depends entirely on where your query execution happens and how your data moves through the stack.

#How to Choose Between GeoParquet and FlatGeobuf

#Architecture & I/O Models

#Decision Matrix by Workload

#Implementation & Code Patterns

#GeoParquet: Batch Analytics & Predicate Pushdown

#FlatGeobuf: Streaming & Spatial Filtering

#Performance, Ecosystem & Next Steps