Preserving Metadata During GeoParquet Conversion
Modern geospatial analytics demand columnar storage for performance, yet migrating legacy vector datasets to GeoParquet frequently strips critical provenance, coordinate reference system (CRS) definitions, and custom attributes. Preserving metadata during GeoParquet conversion is a non-negotiable requirement for platform teams building compliant data lakes. Unlike shapefiles or GeoJSON, which embed metadata in sidecar files or inline JSON, GeoParquet relies on Parquet’s footer metadata and a strict geo JSON schema. Naïve conversions drop everything outside geometry and tabular columns, breaking downstream lineage tracking, spatial indexing, and regulatory compliance.
This guide provides a production-tested workflow for extracting, mapping, and injecting metadata during format transitions, ensuring full alignment with the official GeoParquet specification while maintaining pipeline reliability.
The Metadata Architecture Challenge
Traditional GIS formats store metadata heterogeneously. GeoPackage uses SQLite tables, Shapefile relies on .prj and .xml sidecars, and FlatGeobuf packs metadata in a header block. Parquet, by design, optimizes for analytical query performance and stores metadata as key-value pairs in the file footer. The GeoParquet standard bridges this gap by reserving a geo key containing a JSON object that defines geometry columns, CRS, bounding boxes, and encoding.
When designing Data Conversion & Migration Pipelines, engineers must account for three distinct metadata layers:
- Geospatial Core: CRS, geometry type, bounding box, and primary geometry column.
- Schema & Type Mapping: precision, scale, nullability constraints, and temporal formats.
- Custom/Provenance: source system identifiers, processing timestamps, and domain-specific tags.
Failing to serialize these layers correctly leads to silent data degradation. Some libraries silently drop unrecognized keys, while others raise exceptions on malformed CRS strings.
Prerequisites & Environment Configuration
A reliable conversion stack requires modern, actively maintained libraries that expose Parquet metadata APIs:
- Python 3.9+ (type hints and
pathlibrequired for pipeline orchestration) pyarrow>=14.0.0(direct Parquet metadata manipulation and schema enforcement)geopandas>=1.0.0(spatial DataFrame operations and CRS normalization)pyogrio>=0.7.0(GDAL-backed I/O with optimized vector reading)shapely>=2.0.0(geometry validation and topology checks)pandas>=2.0.0(tabular type coercion and null handling)
Install via pip:
pip install pyarrow geopandas pyogrio shapely pandas
For cloud deployments, ensure your IAM roles grant s3:PutObject and s3:GetObject permissions if writing directly to object storage.
Step 1: Extracting Source Metadata and CRS Definitions
The first phase involves reading the source dataset while explicitly capturing all embedded metadata. Many legacy formats store CRS information in non-standard locations, requiring careful parsing. Using pyogrio provides the fastest read path while exposing the raw metadata dictionary.
import pyogrio
from pathlib import Path
from typing import Any
def extract_source_metadata(source_path: Path) -> dict[str, Any]:
"""Extract CRS, bounding box, and custom metadata from source vector."""
meta = pyogrio.read_info(str(source_path))
# Normalize CRS to EPSG code if the string contains one
crs_raw = meta.get("crs") or ""
if "EPSG:" in crs_raw:
crs_epsg = crs_raw.split("EPSG:")[-1].split("]")[0].strip()
else:
crs_epsg = "unknown"
return {
"crs": crs_epsg,
"bbox": meta.get("total_bounds"), # pyogrio >= 0.7 key
"geometry_type": meta.get("geometry_type", "Unknown"),
"encoding": meta.get("encoding", "UTF-8"),
}
Step 2: Schema Mapping and Type Coercion
Parquet enforces strict typing, which conflicts with the loosely typed nature of many GIS formats. Integer fields stored as strings, mixed precision floats, and datetime formats must be explicitly coerced before serialization.
import geopandas as gpd
import pandas as pd
import pyarrow as pa
def map_and_coerce_schema(gdf: gpd.GeoDataFrame) -> pa.Table:
"""Normalize types and prepare for Parquet serialization."""
gdf = gdf.copy()
for col in gdf.columns:
if col == gdf.geometry.name:
continue
if pd.api.types.is_datetime64_any_dtype(gdf[col]):
# Use UTC-naive timestamps for cross-engine compatibility
gdf[col] = gdf[col].dt.tz_localize(None).astype("datetime64[ms]")
# Replace empty strings with NA to avoid Arrow type inference failures
gdf = gdf.replace("", pd.NA)
return pa.Table.from_pandas(gdf, preserve_index=False)
When migrating from legacy systems, column names often contain spaces, special characters, or reserved keywords. Standardizing these during the mapping phase prevents query failures in downstream engines like DuckDB or AWS Athena. Refer to Schema Mapping for Legacy to Modern Formats for comprehensive naming conventions and type coercion matrices.
Step 3: Constructing the GeoParquet Footer
GeoParquet requires a specific JSON structure attached to the geo key in the Parquet file metadata. This structure must conform to the specification exactly, or spatial engines will ignore the geometry column. The following function builds a compliant geo metadata payload and attaches it to the Arrow table.
import json
import pyarrow as pa
from typing import Any
def build_geoparquet_metadata(
table: pa.Table,
source_meta: dict[str, Any],
primary_column: str = "geometry",
) -> pa.Table:
"""Attach compliant GeoParquet 1.0 metadata to the Arrow table."""
crs_code = source_meta.get("crs", "unknown")
# Build a minimal PROJJSON CRS object; use EPSG:4326 as fallback
try:
epsg_int = int(crs_code)
except (ValueError, TypeError):
epsg_int = 4326
geo_meta = {
"version": "1.0.0",
"primary_column": primary_column,
"columns": {
primary_column: {
"encoding": "WKB",
"geometry_types": [source_meta.get("geometry_type", "Unknown")],
"crs": {
"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json",
"type": "GeographicCRS",
"name": f"EPSG:{epsg_int}",
"id": {"authority": "EPSG", "code": epsg_int},
},
"bbox": source_meta.get("bbox"),
}
},
}
geo_bytes = json.dumps(geo_meta, separators=(",", ":")).encode("utf-8")
existing_meta = table.schema.metadata or {}
updated_meta = {**existing_meta, b"geo": geo_bytes}
return table.replace_schema_metadata(updated_meta)
Note that the encoding field must be WKB or WKT. GeoParquet v1.0.0 mandates WKB for performance, and most modern engines expect this encoding. If you are migrating QGIS projects that rely on proprietary styling or layer metadata, consider Preserving QGIS Metadata in FlatGeobuf before converting to Parquet, as FlatGeobuf handles QGIS-specific tags more gracefully.
Step 4: Validation and Production Pipeline Integration
Writing the file is straightforward, but validation ensures downstream compatibility.
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq
def write_geoparquet(table: pa.Table, output_path: Path) -> None:
"""Write validated GeoParquet with optimized compression."""
pq.write_table(
table,
str(output_path),
compression="zstd",
compression_level=3,
use_dictionary=True,
write_statistics=True,
row_group_size=100_000, # Optimize for cloud query engines
)
# Quick validation: read back metadata and verify geo key exists
read_schema = pq.read_schema(str(output_path))
if b"geo" not in (read_schema.metadata or {}):
raise RuntimeError("GeoParquet metadata missing in output file footer.")
For enterprise-scale operations, wrap these functions in an orchestration framework like Apache Airflow or Prefect. Implement retry logic, dead-letter queues for malformed geometries, and checksum verification. When scaling horizontally, partition datasets by spatial index or temporal buckets to avoid skew. Detailed orchestration patterns are covered in Building Batch Conversion Pipelines with Python.
Common Pitfalls and Fallback Strategies
Even with careful implementation, certain edge cases consistently break conversions:
- Multi-CRS Datasets: GeoParquet v1.0.0 supports only one CRS per geometry column. If your source contains mixed projections, you must normalize to a single target CRS (typically EPSG:4326 or EPSG:3857) before conversion.
- Large Geometry Blobs: WKB encoding can produce massive binary payloads for complex polygons. Enable ZSTD compression and consider tiling or simplifying geometries at scale.
- Missing CRS Definitions: shapefiles without
.prjfiles default to unknown CRS. Implement a fallback routing mechanism that quarantines these files for manual review rather than injecting a default projection silently. - Metadata Bloat: Parquet footers are loaded into memory during file open. Keep the
geoJSON lean and move extensive provenance logs to a companion metadata catalog (e.g., OpenMetadata or AWS Glue).
Always validate output files against the Apache Parquet Python documentation and run spatial queries in DuckDB or PostGIS to confirm geometry integrity before promoting to production.
Conclusion
Preserving metadata during format transitions is not a cosmetic requirement; it is foundational to data lineage, spatial accuracy, and regulatory compliance. By explicitly extracting source attributes, coercing types to Parquet standards, and injecting a compliant geo footer, platform teams can eliminate silent degradation and build resilient geospatial data lakes. Implement the extraction, mapping, and injection steps outlined above, integrate automated validation, and your pipelines will consistently produce query-ready, standards-compliant GeoParquet files.