Converting Legacy Shapefiles to GeoParquet at Scale

Migrating multi-terabyte legacy shapefile archives into columnar GeoParquet storage requires deterministic pipeline orchestration, strict schema enforcement, and cold-storage-aware partitioning. The transition eliminates the 2 GB file-size ceiling, DBF attribute truncation, and unindexed spatial queries inherent to legacy formats. This guide details an exact, production-grade conversion workflow operating under the Format Conversion & Pipeline Automation framework, focusing on configuration tuning, validation gates, and edge-case resolution for data engineers, GIS archivists, and compliance teams.

Conversion Stages

Large archives convert in bounded chunks, partitioned and audited end-to-end:

flowchart LR
  A["Pre-flight validation"] --> B["Chunked read"]
  B --> C["H3 partition keys"]
  C --> D["Write partitioned GeoParquet"]
  D --> E["Row count + checksum audit"]

Pre-Flight Validation & Schema Enforcement

Shapefiles frequently fail during bulk ingestion due to implicit encoding mismatches, malformed .prj definitions, and untyped attribute columns. Execute a deterministic validation gate before triggering conversion jobs.

  1. Extract Metadata Deterministically:
ogrinfo -ro -json -al -geom=NO input.shp > manifest.json

Parse featureCount, geometryType, CRS, and field definitions from the JSON. Reject datasets where featureCount is -1 or unknown; these indicate a corrupted .shx index, which you regenerate by rewriting the dataset (ogr2ogr regenerated.shp input.shp).

  1. Enforce CRS Synchronization: Missing or legacy WKT1 .prj files cause downstream projection drift. Normalize explicitly:
gdalsrsinfo -o proj4 input.prj

If the output is ambiguous, force EPSG:4326 or a project-specific projected CRS using ogr2ogr -t_srs EPSG:XXXX. Store the resolved EPSG code directly in the GeoParquet geo metadata block. Do not rely on implicit CRS inference. Refer to CRS Synchronization in Pipelines for standardized projection registries.

  1. Map DBF Types to Arrow Primitives: DBF lacks native boolean, date, or 64-bit integer support. Apply explicit type coercion during ingestion: | DBF Type | Arrow Primitive | Coercion Logic | |----------|----------------|----------------| | String(254) | large_string | Truncate with audit log if >254 chars | | Numeric(10,2) | float64 | Preserve precision; reject NaN unless explicitly allowed | | Date(YYYYMMDD) | date32 | Parse via pd.to_datetime(..., format='%Y%m%d') | | Logical | boolean | Map T/F/Y/N/1/0True/False |

Log any field exceeding 254 characters to a compliance manifest before truncation. Reject implicit type promotion to prevent silent data loss.

Pipeline Architecture & Configuration Tuning

Monolithic ogr2ogr invocations exhaust memory and stall on terabyte-scale archives. Implement a chunked, parallelized pipeline with strict resource boundaries.

GDAL Environment Configuration:

export GDAL_NUM_THREADS=ALL_CPUS
export OGR_MAX_BUFFER_SIZE=512000000
export CPL_DEBUG=ON
export SHAPE_ENCODING=UTF-8
export GDAL_CACHEMAX=2048

Partitioning Strategy: GeoParquet performs optimally when partitioned by spatial index or administrative boundary. Generate H3 resolution 6 or S2 level 8 partition keys during ingestion. Write output to s3://archive-bucket/year=YYYY/month=MM/h3_cell=XXXXXX.parquet. Enable ZSTD compression (compression=ZSTD, compression_level=3) to balance archival footprint and decompression latency for cold storage retrieval. For detailed partitioning heuristics, consult the GeoParquet Migration Workflows reference.

Exact Conversion Workflow

Execute the conversion using a streaming architecture to maintain a constant memory footprint. The following Python implementation uses pyogrio for fast vector I/O and pyarrow for columnar serialization.

import os
import json
import pandas as pd
import pyogrio
import pyproj
import pyarrow as pa
import pyarrow.parquet as pq
import h3


def convert_shapefile_to_geoparquet(
    src_path: str,
    dst_dir: str,
    chunk_size: int = 500_000,
    h3_res: int = 6,
):
    # 1. Schema & CRS extraction
    info = pyogrio.read_info(src_path)
    crs_epsg = (
        pyproj.CRS.from_user_input(info["crs"]).to_epsg() if info.get("crs") else 4326
    )

    # 2. read_dataframe returns one frame; slice it into fixed-size chunks so
    #    per-chunk memory stays bounded (it has no chunk_size/streaming mode).
    gdf = pyogrio.read_dataframe(src_path)
    for chunk_idx, start in enumerate(range(0, len(gdf), chunk_size)):
        chunk = gdf.iloc[start : start + chunk_size].copy()

        # 3. Spatial partition key: H3 needs lat/lng, so derive centroids in EPSG:4326.
        centroids = chunk.geometry.to_crs(4326).centroid
        chunk["h3_cell"] = [h3.latlng_to_cell(pt.y, pt.x, h3_res) for pt in centroids]
        geometry_types = sorted(chunk.geometry.geom_type.unique().tolist())

        # 4. Encode geometry as WKB (the GeoParquet "geo" encoding) before Arrow.
        chunk["geometry"] = chunk.geometry.to_wkb()
        frame = pd.DataFrame(chunk)

        # 5. Type enforcement (example: collapse low-cardinality text cols to boolean).
        for col in frame.select_dtypes(include=["object"]).columns:
            if col != "geometry" and frame[col].nunique() <= 2:
                frame[col] = frame[col].astype("boolean")

        geo_meta = {
            "version": "1.0.0",
            "primary_column": "geometry",
            "columns": {
                "geometry": {
                    "encoding": "WKB",
                    "geometry_types": geometry_types,
                    "crs": f"EPSG:{crs_epsg}",
                }
            },
        }
        geo_bytes = json.dumps(geo_meta).encode()

        # 6. Partitioned write: one file per distinct H3 cell in this chunk.
        for cell, part in frame.groupby("h3_cell"):
            table = pa.Table.from_pandas(part, preserve_index=False)
            table = table.replace_schema_metadata({b"geo": geo_bytes})
            partition_path = os.path.join(dst_dir, f"h3_cell={cell}")
            os.makedirs(partition_path, exist_ok=True)
            pq.write_table(
                table,
                os.path.join(partition_path, f"chunk_{chunk_idx:04d}.parquet"),
                compression="zstd",
                compression_level=3,
                row_group_size=100_000,
            )

CLI Fallback for Non-Python Environments:

ogr2ogr -f "Parquet" output.parquet input.shp \
  -lco COMPRESSION=ZSTD \
  -lco COMPRESSION_LEVEL=3 \
  -lco ROW_GROUP_SIZE=100000 \
  -lco GEOMETRY_ENCODING=WKB \
  -nln layer_name \
  -progress

Validate driver capabilities against the official GDAL Parquet Driver Documentation before deploying CLI pipelines.

Post-Conversion Validation & Integrity Gates

Never assume successful write equals data fidelity. Execute automated validation gates immediately after ingestion.

  1. Schema & Metadata Verification:
parquet-tools schema output.parquet

Confirm the geo metadata key exists, contains primary_column, and matches the WKB encoding standard defined in the GeoParquet Specification.

  1. Spatial Integrity Check:
import json
import pyarrow.parquet as pq
import geopandas as gpd

expected_epsg = 4326  # set to your archival target CRS

table = pq.read_table("output.parquet")
geo = json.loads(table.schema.metadata[b"geo"])
stored_crs = geo["columns"][geo["primary_column"]]["crs"]

df = table.to_pandas()
gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df["geometry"], crs=stored_crs))
assert gdf.is_valid.all(), "Invalid geometries detected post-conversion"
assert gdf.crs.to_epsg() == expected_epsg, "CRS drift detected"
  1. Row Count & Checksum Audit: Compare featureCount from the pre-flight manifest against the converted row count from parquet-tools meta (or duckdb -c "SELECT count(*) FROM read_parquet('output.parquet')"). Generate SHA-256 hashes for raw .shp and converted .parquet files. Log discrepancies to an immutable compliance ledger.

Root-Cause Analysis for Conversion Failures

Symptom Root Cause Resolution
ArrowInvalid: Cannot convert string to large_string DBF field contains null bytes or mixed encodings Strip \x00 via df.replace(r'\x00', '', regex=True) before casting
CRS mismatch during spatial join .prj missing or contains deprecated PROJ strings Force gdalsrsinfo -o WKT2 and inject EPSG explicitly into geo metadata
MemoryError: Unable to allocate X GB Chunk size exceeds available RAM or unbounded geometry complexity Reduce chunk_size to 100_000, enable GDAL_CACHEMAX, and explode multi-part geometries pre-write
Invalid WKB: Unexpected end of buffer Corrupted .shp vertex arrays or zero-length geometries Filter df[df.geometry.notna() & df.geometry.is_valid] before serialization
Attribute truncation warnings Legacy DBF 254-character hard limit Split oversized text fields into a normalized lookup table or use large_string with explicit truncation logging

Deploy these validation gates and configuration boundaries to guarantee deterministic, auditable, and cold-storage-optimized spatial archives.