Preserving Metadata During GeoParquet Conversion

Modern geospatial analytics demand columnar storage for performance, yet migrating legacy vector datasets to GeoParquet frequently strips critical provenance, coordinate reference system (CRS) definitions, and custom attributes. Preserving metadata during GeoParquet conversion is a non-negotiable requirement for platform teams building compliant data lakes. Unlike shapefiles or GeoJSON, which embed metadata in sidecar files or inline JSON, GeoParquet relies on Parquet’s footer metadata and a strict geo JSON schema. Naïve conversions drop everything outside geometry and tabular columns, breaking downstream lineage tracking, spatial indexing, and regulatory compliance.

This guide provides a production-tested workflow for extracting, mapping, and injecting metadata during format transitions, ensuring full alignment with the official GeoParquet specification while maintaining pipeline reliability.

The Metadata Architecture Challenge

Traditional GIS formats store metadata heterogeneously. GeoPackage uses SQLite tables, Shapefile relies on .prj and .xml sidecars, and FlatGeobuf packs metadata in a header block. Parquet, by design, optimizes for analytical query performance and stores metadata as key-value pairs in the file footer. The GeoParquet standard bridges this gap by reserving a geo key containing a JSON object that defines geometry columns, CRS, bounding boxes, and encoding.

When designing Data Conversion & Migration Pipelines, engineers must account for three distinct metadata layers:

  1. Geospatial Core: CRS, geometry type, bounding box, and primary geometry column.
  2. Schema & Type Mapping: precision, scale, nullability constraints, and temporal formats.
  3. Custom/Provenance: source system identifiers, processing timestamps, and domain-specific tags.

Failing to serialize these layers correctly leads to silent data degradation. Some libraries silently drop unrecognized keys, while others raise exceptions on malformed CRS strings.

Prerequisites & Environment Configuration

A reliable conversion stack requires modern, actively maintained libraries that expose Parquet metadata APIs:

  • Python 3.9+ (type hints and pathlib required for pipeline orchestration)
  • pyarrow>=14.0.0 (direct Parquet metadata manipulation and schema enforcement)
  • geopandas>=1.0.0 (spatial DataFrame operations and CRS normalization)
  • pyogrio>=0.7.0 (GDAL-backed I/O with optimized vector reading)
  • shapely>=2.0.0 (geometry validation and topology checks)
  • pandas>=2.0.0 (tabular type coercion and null handling)

Install via pip:

bash
pip install pyarrow geopandas pyogrio shapely pandas

For cloud deployments, ensure your IAM roles grant s3:PutObject and s3:GetObject permissions if writing directly to object storage.

Step 1: Extracting Source Metadata and CRS Definitions

The first phase involves reading the source dataset while explicitly capturing all embedded metadata. Many legacy formats store CRS information in non-standard locations, requiring careful parsing. Using pyogrio provides the fastest read path while exposing the raw metadata dictionary.

python
import pyogrio
from pathlib import Path
from typing import Any


def extract_source_metadata(source_path: Path) -> dict[str, Any]:
    """Extract CRS, bounding box, and custom metadata from source vector."""
    meta = pyogrio.read_info(str(source_path))

    # Normalize CRS to EPSG code if the string contains one
    crs_raw = meta.get("crs") or ""
    if "EPSG:" in crs_raw:
        crs_epsg = crs_raw.split("EPSG:")[-1].split("]")[0].strip()
    else:
        crs_epsg = "unknown"

    return {
        "crs": crs_epsg,
        "bbox": meta.get("total_bounds"),  # pyogrio >= 0.7 key
        "geometry_type": meta.get("geometry_type", "Unknown"),
        "encoding": meta.get("encoding", "UTF-8"),
    }

Step 2: Schema Mapping and Type Coercion

Parquet enforces strict typing, which conflicts with the loosely typed nature of many GIS formats. Integer fields stored as strings, mixed precision floats, and datetime formats must be explicitly coerced before serialization.

python
import geopandas as gpd
import pandas as pd
import pyarrow as pa


def map_and_coerce_schema(gdf: gpd.GeoDataFrame) -> pa.Table:
    """Normalize types and prepare for Parquet serialization."""
    gdf = gdf.copy()

    for col in gdf.columns:
        if col == gdf.geometry.name:
            continue
        if pd.api.types.is_datetime64_any_dtype(gdf[col]):
            # Use UTC-naive timestamps for cross-engine compatibility
            gdf[col] = gdf[col].dt.tz_localize(None).astype("datetime64[ms]")

    # Replace empty strings with NA to avoid Arrow type inference failures
    gdf = gdf.replace("", pd.NA)

    return pa.Table.from_pandas(gdf, preserve_index=False)

When migrating from legacy systems, column names often contain spaces, special characters, or reserved keywords. Standardizing these during the mapping phase prevents query failures in downstream engines like DuckDB or AWS Athena. Refer to Schema Mapping for Legacy to Modern Formats for comprehensive naming conventions and type coercion matrices.

GeoParquet requires a specific JSON structure attached to the geo key in the Parquet file metadata. This structure must conform to the specification exactly, or spatial engines will ignore the geometry column. The following function builds a compliant geo metadata payload and attaches it to the Arrow table.

python
import json
import pyarrow as pa
from typing import Any


def build_geoparquet_metadata(
    table: pa.Table,
    source_meta: dict[str, Any],
    primary_column: str = "geometry",
) -> pa.Table:
    """Attach compliant GeoParquet 1.0 metadata to the Arrow table."""
    crs_code = source_meta.get("crs", "unknown")

    # Build a minimal PROJJSON CRS object; use EPSG:4326 as fallback
    try:
        epsg_int = int(crs_code)
    except (ValueError, TypeError):
        epsg_int = 4326

    geo_meta = {
        "version": "1.0.0",
        "primary_column": primary_column,
        "columns": {
            primary_column: {
                "encoding": "WKB",
                "geometry_types": [source_meta.get("geometry_type", "Unknown")],
                "crs": {
                    "$schema": "https://proj.org/schemas/v0.7/projjson.schema.json",
                    "type": "GeographicCRS",
                    "name": f"EPSG:{epsg_int}",
                    "id": {"authority": "EPSG", "code": epsg_int},
                },
                "bbox": source_meta.get("bbox"),
            }
        },
    }

    geo_bytes = json.dumps(geo_meta, separators=(",", ":")).encode("utf-8")
    existing_meta = table.schema.metadata or {}
    updated_meta = {**existing_meta, b"geo": geo_bytes}

    return table.replace_schema_metadata(updated_meta)

Note that the encoding field must be WKB or WKT. GeoParquet v1.0.0 mandates WKB for performance, and most modern engines expect this encoding. If you are migrating QGIS projects that rely on proprietary styling or layer metadata, consider Preserving QGIS Metadata in FlatGeobuf before converting to Parquet, as FlatGeobuf handles QGIS-specific tags more gracefully.

Step 4: Validation and Production Pipeline Integration

Writing the file is straightforward, but validation ensures downstream compatibility.

python
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq


def write_geoparquet(table: pa.Table, output_path: Path) -> None:
    """Write validated GeoParquet with optimized compression."""
    pq.write_table(
        table,
        str(output_path),
        compression="zstd",
        compression_level=3,
        use_dictionary=True,
        write_statistics=True,
        row_group_size=100_000,  # Optimize for cloud query engines
    )

    # Quick validation: read back metadata and verify geo key exists
    read_schema = pq.read_schema(str(output_path))
    if b"geo" not in (read_schema.metadata or {}):
        raise RuntimeError("GeoParquet metadata missing in output file footer.")

For enterprise-scale operations, wrap these functions in an orchestration framework like Apache Airflow or Prefect. Implement retry logic, dead-letter queues for malformed geometries, and checksum verification. When scaling horizontally, partition datasets by spatial index or temporal buckets to avoid skew. Detailed orchestration patterns are covered in Building Batch Conversion Pipelines with Python.

Common Pitfalls and Fallback Strategies

Even with careful implementation, certain edge cases consistently break conversions:

  1. Multi-CRS Datasets: GeoParquet v1.0.0 supports only one CRS per geometry column. If your source contains mixed projections, you must normalize to a single target CRS (typically EPSG:4326 or EPSG:3857) before conversion.
  2. Large Geometry Blobs: WKB encoding can produce massive binary payloads for complex polygons. Enable ZSTD compression and consider tiling or simplifying geometries at scale.
  3. Missing CRS Definitions: shapefiles without .prj files default to unknown CRS. Implement a fallback routing mechanism that quarantines these files for manual review rather than injecting a default projection silently.
  4. Metadata Bloat: Parquet footers are loaded into memory during file open. Keep the geo JSON lean and move extensive provenance logs to a companion metadata catalog (e.g., OpenMetadata or AWS Glue).

Always validate output files against the Apache Parquet Python documentation and run spatial queries in DuckDB or PostGIS to confirm geometry integrity before promoting to production.

Conclusion

Preserving metadata during format transitions is not a cosmetic requirement; it is foundational to data lineage, spatial accuracy, and regulatory compliance. By explicitly extracting source attributes, coercing types to Parquet standards, and injecting a compliant geo footer, platform teams can eliminate silent degradation and build resilient geospatial data lakes. Implement the extraction, mapping, and injection steps outlined above, integrate automated validation, and your pipelines will consistently produce query-ready, standards-compliant GeoParquet files.

Continue exploring