Converting Legacy Shapefiles to GeoParquet at Scale
Migrating multi-terabyte legacy shapefile archives into columnar GeoParquet storage requires deterministic pipeline orchestration, strict schema enforcement, and cold-storage-aware partitioning. The transition eliminates the 2 GB file-size ceiling, DBF attribute truncation, and unindexed spatial queries inherent to legacy formats. This guide details an exact, production-grade conversion workflow operating under the Format Conversion & Pipeline Automation framework, focusing on configuration tuning, validation gates, and edge-case resolution for data engineers, GIS archivists, and compliance teams.
Conversion Stages
Large archives convert in bounded chunks, partitioned and audited end-to-end:
flowchart LR A["Pre-flight validation"] --> B["Chunked read"] B --> C["H3 partition keys"] C --> D["Write partitioned GeoParquet"] D --> E["Row count + checksum audit"]
Pre-Flight Validation & Schema Enforcement
Shapefiles frequently fail during bulk ingestion due to implicit encoding mismatches, malformed .prj definitions, and untyped attribute columns. Execute a deterministic validation gate before triggering conversion jobs.
- Extract Metadata Deterministically:
ogrinfo -ro -json -al -geom=NO input.shp > manifest.json
Parse featureCount, geometryType, CRS, and field definitions from the JSON. Reject datasets where featureCount is -1 or unknown; these indicate a corrupted .shx index, which you regenerate by rewriting the dataset (ogr2ogr regenerated.shp input.shp).
- Enforce CRS Synchronization: Missing or legacy WKT1
.prjfiles cause downstream projection drift. Normalize explicitly:
gdalsrsinfo -o proj4 input.prj
If the output is ambiguous, force EPSG:4326 or a project-specific projected CRS using ogr2ogr -t_srs EPSG:XXXX. Store the resolved EPSG code directly in the GeoParquet geo metadata block. Do not rely on implicit CRS inference. Refer to CRS Synchronization in Pipelines for standardized projection registries.
- Map DBF Types to Arrow Primitives: DBF lacks native boolean, date, or 64-bit integer support. Apply explicit type coercion during ingestion:
| DBF Type | Arrow Primitive | Coercion Logic |
|----------|----------------|----------------|
|
String(254)|large_string| Truncate with audit log if>254chars | |Numeric(10,2)|float64| Preserve precision; rejectNaNunless explicitly allowed | |Date(YYYYMMDD)|date32| Parse viapd.to_datetime(..., format='%Y%m%d')| |Logical|boolean| MapT/F/Y/N/1/0→True/False|
Log any field exceeding 254 characters to a compliance manifest before truncation. Reject implicit type promotion to prevent silent data loss.
Pipeline Architecture & Configuration Tuning
Monolithic ogr2ogr invocations exhaust memory and stall on terabyte-scale archives. Implement a chunked, parallelized pipeline with strict resource boundaries.
GDAL Environment Configuration:
export GDAL_NUM_THREADS=ALL_CPUS
export OGR_MAX_BUFFER_SIZE=512000000
export CPL_DEBUG=ON
export SHAPE_ENCODING=UTF-8
export GDAL_CACHEMAX=2048
Partitioning Strategy: GeoParquet performs optimally when partitioned by spatial index or administrative boundary. Generate H3 resolution 6 or S2 level 8 partition keys during ingestion. Write output to s3://archive-bucket/year=YYYY/month=MM/h3_cell=XXXXXX.parquet. Enable ZSTD compression (compression=ZSTD, compression_level=3) to balance archival footprint and decompression latency for cold storage retrieval. For detailed partitioning heuristics, consult the GeoParquet Migration Workflows reference.
Exact Conversion Workflow
Execute the conversion using a streaming architecture to maintain a constant memory footprint. The following Python implementation uses pyogrio for fast vector I/O and pyarrow for columnar serialization.
import os
import json
import pandas as pd
import pyogrio
import pyproj
import pyarrow as pa
import pyarrow.parquet as pq
import h3
def convert_shapefile_to_geoparquet(
src_path: str,
dst_dir: str,
chunk_size: int = 500_000,
h3_res: int = 6,
):
# 1. Schema & CRS extraction
info = pyogrio.read_info(src_path)
crs_epsg = (
pyproj.CRS.from_user_input(info["crs"]).to_epsg() if info.get("crs") else 4326
)
# 2. read_dataframe returns one frame; slice it into fixed-size chunks so
# per-chunk memory stays bounded (it has no chunk_size/streaming mode).
gdf = pyogrio.read_dataframe(src_path)
for chunk_idx, start in enumerate(range(0, len(gdf), chunk_size)):
chunk = gdf.iloc[start : start + chunk_size].copy()
# 3. Spatial partition key: H3 needs lat/lng, so derive centroids in EPSG:4326.
centroids = chunk.geometry.to_crs(4326).centroid
chunk["h3_cell"] = [h3.latlng_to_cell(pt.y, pt.x, h3_res) for pt in centroids]
geometry_types = sorted(chunk.geometry.geom_type.unique().tolist())
# 4. Encode geometry as WKB (the GeoParquet "geo" encoding) before Arrow.
chunk["geometry"] = chunk.geometry.to_wkb()
frame = pd.DataFrame(chunk)
# 5. Type enforcement (example: collapse low-cardinality text cols to boolean).
for col in frame.select_dtypes(include=["object"]).columns:
if col != "geometry" and frame[col].nunique() <= 2:
frame[col] = frame[col].astype("boolean")
geo_meta = {
"version": "1.0.0",
"primary_column": "geometry",
"columns": {
"geometry": {
"encoding": "WKB",
"geometry_types": geometry_types,
"crs": f"EPSG:{crs_epsg}",
}
},
}
geo_bytes = json.dumps(geo_meta).encode()
# 6. Partitioned write: one file per distinct H3 cell in this chunk.
for cell, part in frame.groupby("h3_cell"):
table = pa.Table.from_pandas(part, preserve_index=False)
table = table.replace_schema_metadata({b"geo": geo_bytes})
partition_path = os.path.join(dst_dir, f"h3_cell={cell}")
os.makedirs(partition_path, exist_ok=True)
pq.write_table(
table,
os.path.join(partition_path, f"chunk_{chunk_idx:04d}.parquet"),
compression="zstd",
compression_level=3,
row_group_size=100_000,
)
CLI Fallback for Non-Python Environments:
ogr2ogr -f "Parquet" output.parquet input.shp \
-lco COMPRESSION=ZSTD \
-lco COMPRESSION_LEVEL=3 \
-lco ROW_GROUP_SIZE=100000 \
-lco GEOMETRY_ENCODING=WKB \
-nln layer_name \
-progress
Validate driver capabilities against the official GDAL Parquet Driver Documentation before deploying CLI pipelines.
Post-Conversion Validation & Integrity Gates
Never assume successful write equals data fidelity. Execute automated validation gates immediately after ingestion.
- Schema & Metadata Verification:
parquet-tools schema output.parquet
Confirm the geo metadata key exists, contains primary_column, and matches the WKB encoding standard defined in the GeoParquet Specification.
- Spatial Integrity Check:
import json
import pyarrow.parquet as pq
import geopandas as gpd
expected_epsg = 4326 # set to your archival target CRS
table = pq.read_table("output.parquet")
geo = json.loads(table.schema.metadata[b"geo"])
stored_crs = geo["columns"][geo["primary_column"]]["crs"]
df = table.to_pandas()
gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df["geometry"], crs=stored_crs))
assert gdf.is_valid.all(), "Invalid geometries detected post-conversion"
assert gdf.crs.to_epsg() == expected_epsg, "CRS drift detected"
- Row Count & Checksum Audit:
Compare
featureCountfrom the pre-flight manifest against the converted row count fromparquet-tools meta(orduckdb -c "SELECT count(*) FROM read_parquet('output.parquet')"). Generate SHA-256 hashes for raw.shpand converted.parquetfiles. Log discrepancies to an immutable compliance ledger.
Root-Cause Analysis for Conversion Failures
| Symptom | Root Cause | Resolution |
|---|---|---|
ArrowInvalid: Cannot convert string to large_string |
DBF field contains null bytes or mixed encodings | Strip \x00 via df.replace(r'\x00', '', regex=True) before casting |
CRS mismatch during spatial join |
.prj missing or contains deprecated PROJ strings |
Force gdalsrsinfo -o WKT2 and inject EPSG explicitly into geo metadata |
MemoryError: Unable to allocate X GB |
Chunk size exceeds available RAM or unbounded geometry complexity | Reduce chunk_size to 100_000, enable GDAL_CACHEMAX, and explode multi-part geometries pre-write |
Invalid WKB: Unexpected end of buffer |
Corrupted .shp vertex arrays or zero-length geometries |
Filter df[df.geometry.notna() & df.geometry.is_valid] before serialization |
Attribute truncation warnings |
Legacy DBF 254-character hard limit | Split oversized text fields into a normalized lookup table or use large_string with explicit truncation logging |
Deploy these validation gates and configuration boundaries to guarantee deterministic, auditable, and cold-storage-optimized spatial archives.