Handling Attribute Loss During Spatial Format Conversion

Attribute loss during spatial format conversion is a silent failure mode: rows survive, geometries render, and the job exits 0, yet columns are quietly truncated, retyped, or dropped before the data reaches cold storage. This guide is written for the data engineer, GIS archivist, or cloud architect migrating legacy shapefile, GeoJSON, or PostGIS exports into columnar GeoParquet or FlatGeobuf, and it explains why default converter settings fail here. A bare ogr2ogr call infers types from the first feature, truncates DBF field names to 10 characters, coerces Float → Decimal and String → Integer without warning, and strips coordinate reference system (CRS) and domain metadata that downstream analytics and regulators depend on. The fix is a deterministic profile-map-convert-validate procedure that runs as part of Schema Mapping & Attribute Validation before any batch is promoted to an archive bucket.

Attribute-Loss Failure Modes

Each common loss mode maps to a deterministic mitigation:

Attribute loss originates at three deterministic points. Shapefile DBF restricts field names to 10 ASCII characters and caps numeric precision at 18 digits, so long names collide and silently merge. GeoJSON carries no strict typing, forcing converters to infer each column’s type from the first row. When the target is GeoParquet or FlatGeobuf, unhandled null semantics, mixed-type arrays, and an unregistered CRS trigger silent column drops during serialization. The procedure below closes all three gaps in order.

Step-by-Step Procedure

Phase 1 — Profile the source schema

Dump the exact source schema first so every later step has a baseline for parity. Use GDAL/OGR for shapefiles and DuckDB’s spatial extension for GeoJSON or PostGIS exports.

# Shapefile: field names, types, widths, and CRS in one pass
ogrinfo -al -so archives/parcels/2021/county_travis.shp

# GeoJSON / PostGIS dump: ST_Read exposes the inferred column types
duckdb -c "INSTALL spatial; LOAD spatial; \
  DESCRIBE SELECT * FROM ST_Read('archives/zoning/2021/austin_zoning.geojson');"

Phase 2 — Build an explicit cast and alias map

Reject implicit promotions and pin every column to a target type in a YAML manifest. Where the target (or any round-trip through DBF) imposes a name-length limit, derive a deterministic alias so the same long name always maps to the same short column. Cross-reference target types against the Apache Parquet type system.

# scripts/build_schema_map.py
import hashlib

def alias_field(name: str, max_len: int = 64) -> str:
    """Stable, collision-resistant alias for over-length field names."""
    if len(name) <= max_len:
        return name
    digest = hashlib.md5(name.encode()).hexdigest()[:8]
    return f"{name[:max_len - 12]}_{digest}"

# Explicit cast rules — no String->Integer or Float->Decimal inference
CAST_MAP = {
    "parcel_id":           "BIGINT",
    "assessed_valuation":  "DOUBLE",   # never coerce to DECIMAL silently
    "zoning_designation":  "VARCHAR",
    "last_inspection_date": "DATE",
}

Phase 3 — Set null-tolerance and CRS gates

Apply two pre-conversion thresholds. A column above 15% nulls in the source must be declared NULLABLE in the target schema; a column above 85% nulls is dropped before conversion to keep the cold-storage footprint small. Missing EPSG codes or WKT strings cause geometry columns to be dropped on write, so confirm CRS presence before continuing — the standardized projection handling lives in Automating CRS Transformations in ETL Pipelines.

# Fail fast if the source has no spatial reference attached
ogrinfo -so archives/parcels/2021/county_travis.shp | grep -i "srs\|epsg" \
  || { echo "FATAL: no CRS on source — aborting conversion"; exit 1; }

Phase 4 — Convert with explicit casts (GDAL/OGR)

Disable automatic type promotion and force UTF-8 across all vector drivers. Carry every column through an explicit CAST so nothing is inferred at write time.

ogr2ogr -f "Parquet" \
  archives/parcels/2021/county_travis.parquet \
  archives/parcels/2021/county_travis.shp \
  --config SHAPE_ENCODING UTF-8 \
  -dialect SQLITE \
  -lco GEOMETRY_NAME=geometry \
  -lco COMPRESSION=ZSTD \
  -nln parcels_2021 \
  -sql "SELECT CAST(parcel_id AS INTEGER)         AS parcel_id,
               CAST(assessed_valuation AS REAL)   AS assessed_valuation,
               CAST(zoning_designation AS TEXT)   AS zoning_designation,
               geometry
        FROM county_travis"

Phase 5 — Convert with strict schema (DuckDB / PyArrow)

For cloud-native batches, register column types before the write so the inference engine is never consulted. This path also lets you emit the partition tree directly to an archive prefix.

import duckdb

con = duckdb.connect()
con.execute("INSTALL spatial; LOAD spatial;")

query = """
    SELECT
        CAST(parcel_id AS BIGINT)            AS parcel_id,
        CAST(assessed_valuation AS DOUBLE)   AS assessed_valuation,
        CAST(zoning_designation AS VARCHAR)  AS zoning_designation,
        geom
    FROM ST_Read('archives/parcels/2021/county_travis.shp')
"""

# Strict schema + GeoParquet-compliant ZSTD write
con.execute(f"""
    COPY ({query})
    TO 'archives/parcels/2021/county_travis.parquet'
    (FORMAT PARQUET, COMPRESSION ZSTD)
""")

The orchestration that makes these jobs idempotent and rolls them back on a schema mismatch is covered under Format Conversion & Pipeline Automation. Picking the ZSTD level that balances archive footprint against cold-retrieval CPU is covered in Tuning ZSTD Compression for GeoParquet Archives.

Phase 6 — Write a sidecar manifest

Persist the resolved schema, row statistics, and CRS as a JSON sidecar next to the object so any deprecated attribute can be reconstructed during retrieval, and map regulator-required fields that fail target validation into a legacy_metadata JSON column rather than discarding them.

duckdb -json -c "DESCRIBE SELECT * FROM \
  read_parquet('archives/parcels/2021/county_travis.parquet');" \
  > archives/parcels/2021/county_travis.schema.json

Validation & Verification

Validate immediately after the write, before the object is promoted to a colder tier. Run a four-check sequence and treat any mismatch as a hard failure.

1. Row parity — source feature count must equal target row count exactly:

src_rows=$(ogrinfo -ro -al -so archives/parcels/2021/county_travis.shp \
  | grep "Feature Count:" | awk '{print $NF}')
tgt_rows=$(duckdb -noheader -list -c \
  "SELECT count(*) FROM read_parquet('archives/parcels/2021/county_travis.parquet');")
[ "$src_rows" -eq "$tgt_rows" ] \
  && echo "OK parity: $tgt_rows rows" \
  || echo "ROW COUNT MISMATCH: $src_rows vs $tgt_rows"

Expected output — the counts match and parity is confirmed:

OK parity: 412877 rows

2. Schema delta and null drift — SUMMARIZE reports per-column type and null percentage in one pass, surfacing any unexpected promotion or injected null:

SUMMARIZE SELECT * FROM read_parquet('archives/parcels/2021/county_travis.parquet');

Confirm parcel_id is BIGINT (not the DECIMAL an inference engine would have chosen) and that null_percentage matches the source profile from Phase 1.

3. Geometry integrity — the geometry column is stored as WKB, so decode it before validating:

duckdb -c "INSTALL spatial; LOAD spatial; \
  SELECT count(*) AS invalid FROM read_parquet('archives/parcels/2021/county_travis.parquet') \
  WHERE NOT ST_IsValid(ST_GeomFromWKB(geom));"

Expected output — zero invalid geometries survived serialization:

┌─────────┐
│ invalid │
│  int64  │
├─────────┤
│    0    │
└─────────┘

4. Checksum ledger — generate SHA-256 hashes for the source and target and append them to an immutable ledger governed by your Retention Policy Frameworks, so any later schema drift detected on retrieval can trigger a rollback to the last verified manifest.

Troubleshooting

Symptom	Root cause	Fix
Two source columns collapse into one in the target	DBF 10-char name truncation merged `land_use_code` and `land_use_class` to `land_use_c`	Apply `alias_field()` from Phase 2 and write the long→short map into the sidecar manifest before converting
`assessed_valuation` arrives as `DECIMAL`, breaking downstream joins	Implicit `Float → Decimal` promotion during type inference	Pin the column with an explicit `CAST(... AS DOUBLE)` and reject any job whose `SUMMARIZE` type differs from the cast map
Geometry column missing from the output entirely	Source had no `.prj` / EPSG, so the writer dropped the geometry on serialization	Enforce the Phase 3 CRS gate; inject an explicit `EPSG` before write and re-run
Row count matches but a text column is all nulls	Mixed-type GeoJSON column inferred from a numeric first row, nulling later string values	Force `CAST(col AS VARCHAR)` at read time instead of trusting first-row inference

Operational Execution Checklist

Dump and archive the source schema (ogrinfo -so or DuckDB DESCRIBE) as the parity baseline.
Build an explicit cast map and deterministic alias table; reject all implicit type promotions.
Apply the 15% NULLABLE and 85% drop thresholds, and fail fast when the source carries no CRS.
Convert with explicit CAST statements via GDAL/OGR or DuckDB — never first-row inference.
Reconcile source-to-target row parity and confirm types with SUMMARIZE.
Decode WKB and confirm zero invalid geometries post-conversion.
Write the JSON sidecar manifest and record source/target SHA-256 hashes in the compliance ledger.

Up: Schema Mapping & Attribute Validation — the parent reference for field-level parity rules this procedure enforces.
Converting Legacy Shapefiles to GeoParquet at Scale — companion workflow that applies these cast and alias maps across terabyte-scale batches.
Automating CRS Transformations in ETL Pipelines — deterministic projection handling for the CRS gate in Phase 3.
Tuning ZSTD Compression for GeoParquet Archives — cross-pillar guidance on the compression level applied during the Phase 4 and 5 writes.

Handling Attribute Loss During Spatial Format Conversion

Attribute-Loss Failure Modes #

Step-by-Step Procedure #

Phase 1 — Profile the source schema #

Phase 2 — Build an explicit cast and alias map #

Phase 3 — Set null-tolerance and CRS gates #

Phase 4 — Convert with explicit casts (GDAL/OGR) #

Phase 5 — Convert with strict schema (DuckDB / PyArrow) #

Phase 6 — Write a sidecar manifest #

Validation & Verification #

Troubleshooting #

Operational Execution Checklist #

Related #