Tuning ZSTD Compression for GeoParquet Archives

Default ZSTD configurations applied to GeoParquet archives routinely produce suboptimal cold storage ratios and elevated decompression latency. The root cause is a mismatch between generic columnar compression defaults and the statistical distribution of coordinate arrays, CRS metadata, and high-cardinality GIS attributes. This document provides exact configuration steps, validation thresholds, and operational troubleshooting for data engineers, GIS archivists, cloud architects, and compliance teams managing spatial archival pipelines.

Tuning Workflow

GeoParquet ZSTD tuning proceeds from a baseline measurement to a benchmarked write:

flowchart LR
  A["Baseline profile"] --> B["Align row groups"]
  B --> C["Set compression_level = 9"]
  C --> D["Disable dict on geometry / CRS"]
  D --> E["Benchmark ratio + latency"]

Baseline Profiling & Target Metrics

Before modifying compression parameters, establish dataset-specific baselines. Run pyarrow.parquet.read_metadata() to extract row group counts, column cardinality, and existing compression codecs. Target the following operational thresholds for cold storage:

  • Compression ratio ≥ 3.5:1 for geometry columns
  • Decompression latency ≤ 120 ms per 100 MB row group on standard cloud VM instances (e.g., c6i.large / t3.medium)
  • Storage footprint reduction ≥ 28% compared to default zstd(level=3)

Foundational Compression Tuning & Storage Optimization workflows require isolating geometry columns from attribute columns during parameter assignment. Uniform compression policies degrade performance on mixed-type spatial datasets.

Execute baseline extraction:

import pyarrow.parquet as pq
import pandas as pd

meta = pq.read_metadata("input_archive.parquet")
stats = []
for i in range(meta.num_row_groups):
    rg = meta.row_group(i)
    for j in range(rg.num_columns):
        col = rg.column(j)
        stats.append({
            "row_group": i,
            "column": col.path_in_schema,
            "total_compressed": col.total_compressed_size,
            "total_uncompressed": col.total_uncompressed_size,
            "ratio": col.total_uncompressed_size / max(col.total_compressed_size, 1)
        })
baseline_df = pd.DataFrame(stats)
print(baseline_df.groupby("column")["ratio"].mean())

Step 1: Align Row Group Boundaries with Spatial Extents

ZSTD dictionary effectiveness degrades when row groups split spatially contiguous geometries. Implementing precise Row Group Sizing Strategies ensures that coordinate sequences remain intact within compression contexts.

  1. Derive the target row count per group from the block-size formula in Calculating Optimal Row Group Size for Spatial Queries, then cap it (e.g. R_FINAL = min(R_opt, 1_000_000)) so groups don’t straddle spatial partitions.
  2. In PyArrow, enforce row_group_size (a row count) and disable dictionary encoding for float-based geometry columns during write operations:
import pyarrow.parquet as pq

# Disable dictionary encoding for geometry/CRS float columns; keep it elsewhere.
dict_map = {
    col: ("geometry" not in col and "crs" not in col)
    for col in table.column_names
}

pq.write_table(
    table,
    "tuned_archive.parquet",
    row_group_size=1_000_000,  # rows per group, tuned toward a ~256 MB target
    compression="zstd",
    compression_level=9,       # 1-22; applied to all zstd columns
    use_dictionary=dict_map,
    write_statistics=True
)
  1. Verify alignment using pyarrow.parquet.read_metadata(). Ensure no row group spans multiple spatial partitions. Misaligned groups trigger dictionary cache misses during cold retrieval.

Step 2: Configure ZSTD Parameters for Geometry Columns

Coordinate arrays exhibit high sequential redundancy but low cross-column correlation. The single most important lever PyArrow exposes is compression_level:

  • compression_level: 9 — the practical sweet spot for cold archival; levels above 11 yield <2% additional ratio for a 40%+ write-CPU penalty.

PyArrow does not expose ZSTD’s advanced frame parameters (window log, chain log, hash log, minimum match). If a workload genuinely needs them, compress the raw column buffers with the zstd CLI outside the Parquet writer, where they are configurable:

# Advanced ZSTD frame tuning lives in the zstd CLI, not PyArrow.
zstd --ultra -22 --long=27 -c coords.bin > coords.bin.zst

Validation & Decompression Benchmarking

Validate compression efficacy and cold-read latency using deterministic benchmarks. Do not rely on file size alone; measure actual I/O throughput.

import pyarrow.parquet as pq
import time
import pyarrow as pa

# 1. Verify compression ratios
meta = pq.read_metadata("tuned_archive.parquet")
for i in range(meta.num_row_groups):
    rg = meta.row_group(i)
    for col_idx in range(rg.num_columns):
        col = rg.column(col_idx)
        if "geometry" in col.path_in_schema:
            ratio = col.total_uncompressed_size / col.total_compressed_size
            assert ratio >= 3.5, f"Row group {i} geometry ratio {ratio:.2f} < 3.5:1"

# 2. Measure decompression latency per 100 MB row group
pf = pq.ParquetFile("tuned_archive.parquet")
for i in range(pf.metadata.num_row_groups):
    start = time.perf_counter()
    # Force full decompression by reading into memory
    table = pf.read_row_group(i)
    elapsed_ms = (time.perf_counter() - start) * 1000
    rg_size_mb = sum(c.total_compressed_size for c in pf.metadata.row_group(i).columns) / (1024**2)
    print(f"RG {i} ({rg_size_mb:.1f} MB) decompressed in {elapsed_ms:.1f} ms")
    assert elapsed_ms <= 120, f"Decompression latency {elapsed_ms:.1f} ms exceeds 120 ms threshold"

Root-Cause Analysis & Operational Troubleshooting

Symptom Root Cause Corrective Action
Compression ratio < 2.8:1 on geometry compression_level too low; row groups split contiguous spatial features Raise compression_level toward 11; recompute the row count so groups don’t straddle spatial partitions
Decompression latency > 180 ms Oversized row groups force whole-group decode Reduce row_group_size; align groups to typical query extents (ZSTD decode speed is largely level-independent)
High memory OOM during read Dictionary encoding forced on high-cardinality float columns Set use_dictionary=False for geometry/CRS; confirm column encodings via the parquet/parquet-tools CLI
Inconsistent ratios across partitions Mixed CRS or varying coordinate precision within same column Normalize CRS to EPSG:4326 or EPSG:3857 pre-write; round coordinates to 6 decimal places
Dictionary cache misses on cold retrieval Row group boundaries misaligned with spatial partition extents Re-run parquet-tools meta or a PyArrow metadata scan; rewrite with row_group_size capped to the partition extent

Operational compliance requires maintaining audit logs of compression parameters per dataset version. Store the dict_map and the chosen compression_level alongside dataset manifests. When migrating archives to object storage, verify that lifecycle policies do not trigger re-encoding on read, which invalidates tuned ZSTD contexts.