ZSTD Level Configuration for Spatial Files

Zstandard (ZSTD) compression in a geospatial archive is a calibrated control surface, not a static toggle, and the most common way teams waste money on it is by treating one level as correct for every dataset. Pin the level too high on hot, frequently rewritten data and write-time CPU dominates the ingest budget; pin it too low on a multi-year legal-hold archive and storage and egress costs run 50–70% above where they should sit. This page is for data engineers, GIS archivists, cloud architects, and compliance teams who already have columnar storage in place and now need to choose an exact ZSTD level per access tier, wire it into the writer, and prove the result is lossless before promoting it to cold storage. It maps levels to spatial workloads, gives production writer configurations for PyArrow, Spark, and DuckDB, and sets the validation gates that catch precision drift before it reaches an immutable bucket.

When a Single ZSTD Level Costs You Money

The failure mode this topic solves is uniform compression policy — applying one ZSTD level across an archive whose files have wildly different access patterns and entropy profiles. Three concrete symptoms follow from it:

CPU-bound ingest on hot data. Coordinate streams, change-data-capture feeds, and ephemeral staging tables are rewritten constantly. Compressing them at level 15 burns 3–5x the write CPU of level 3 for a storage saving that evaporates within hours when the file is overwritten. The compression bill shows up as throttled ingest workers, not as a line item.
Over-paying for storage on cold data. The inverse failure is leaving deep-archive geometry at level 3 because that was the pipeline default. A decade of LiDAR tiles or cadastral snapshots sitting at baseline compression carries 50–70% more bytes than it needs to, and on Glacier-class tiers that delta compounds every month for the life of the retention mandate.
Surprise retrieval latency that is actually a row-group problem. Teams blame “high compression” for slow cold reads, but ZSTD decompression speed is essentially level-independent — a level-19 archive decompresses at roughly the same throughput as a level-3 archive for the same byte volume. When retrieval is slow, the cause is almost always oversized row groups forcing whole-group decompression, not the level itself.

Choosing the level per access tier — rather than per pipeline — is what turns each of these from a recurring cost into a one-time configuration decision.

What You Need in Place First

ZSTD level selection is a downstream tuning knob; it only behaves predictably once the layout above it is settled. Confirm the following before touching a level parameter:

A columnar archive format. ZSTD here applies per column chunk inside GeoParquet (or a Parquet-backed Iceberg/Delta table), not to a whole opaque blob. If geometry is still in Shapefile or GeoPackage, run the GeoParquet Migration Workflows pipeline first so geometry encoding and CRS metadata land correctly — compressing un-migrated data just locks in a bad layout at a smaller size.
Access tiers defined. You must already know which datasets are hot, nearline, or deep-archive, because the level follows the tier. That classification comes from your storage-class design; settle it with Object Storage Selection for GIS Archives before mapping levels.
A row-group baseline. Level and row-group size interact: ZSTD match-finding works across the whole row group, so the group size you pick changes the ratio a given level delivers. Lock that with Row Group Sizing Strategies so you are tuning one variable at a time.

This page sits inside the broader Compression Tuning & Storage Optimization methodology; treat that reference as the parent decision spine and this page as the ZSTD-specific layer of it.

Choosing a Level by Access Pattern

Match the ZSTD level to how often the data is read and rewritten, not to a single house default.

ZSTD operates across levels 1–22. Each increment applies more aggressive match-finding, longer hash chains, and deeper entropy coding. For spatial files — coordinate arrays, Well-Known Binary (WKB) geometry, topology graphs, and attribute tables — the optimal level is dictated by data entropy, access frequency, and the compute window available at write time. Because decompression throughput is roughly constant across levels, the only real trade is write CPU now against bytes stored for years, which is exactly why the access tier is the deciding input.

ZSTD Level	Operational Tier	CPU Overhead (Write)	Storage Reduction vs. Level 3	Recommended Spatial Workloads
1–3	Hot / Streaming	Minimal	Baseline	Real-time ingestion, CDC streams, ephemeral staging
4–7	Nearline / ETL	10–15%	+10–20%	Daily batch loads, intermediate Parquet/GeoJSON outputs
8–12	Balanced Cold	25–40%	+30–50%	Quarterly access, compliance snapshots, analytical cold tier
13–19	Deep Archive	3–5x baseline	+50–70%	Legal hold, multi-year retention, retrieval SLA >24h
20–22	Ultra / Max	5–8x baseline	+5–10% over 19	Maximum-ratio static archives; levels 20–22 require the `--ultra` flag

Two thresholds matter most in practice. Level 11–12 is the sweet spot for the balanced cold tier: it captures most of the achievable ratio while keeping write CPU within a normal nightly compute window. Level 19 is the practical ceiling for deep archive — the jump to 20–22 adds only single-digit extra reduction while multiplying CPU, so reserve --ultra for genuinely static, write-once datasets where the compute window is fully decoupled from production. For levels 13–19, provision burstable compute or schedule off-peak extraction jobs so compression never throttles live analytics.

Production Configurations & Engine Integration

Compression must be pinned explicitly at the writer layer. Engine-level defaults frequently override implicit settings, producing inconsistent archival footprints across otherwise identical jobs. The configurations below target the balanced cold tier (level 11) and use realistic archive paths.

PyArrow / GeoParquet writer

import pyarrow.parquet as pq

# Balanced cold tier: level 11 captures most of the ratio
# while staying inside a nightly compute window.
pq.write_table(
    table,
    "lidar/2024/region_north_cold.parquet",
    compression="zstd",
    compression_level=11,
    use_dictionary=True,      # lower entropy on categorical attrs before ZSTD
    write_statistics=True,    # per-column min/max enables predicate pushdown
    row_group_size=256 * 1024 * 1024,
)

Apache Spark SQL (DataFrame API)

df.write \
  .option("compression", "zstd") \
  .option("parquet.compression.codec.zstd.level", "11") \
  .option("parquet.enable.dictionary", "true") \
  .mode("overwrite") \
  .parquet("s3://gis-cold-storage/cadastral/snapshots/2024q2/")

DuckDB (CLI / Python)

COPY spatial_dataset
  TO 'imagery/scenes/2024/archive_cold.parquet'
  (FORMAT PARQUET, COMPRESSION ZSTD, COMPRESSION_LEVEL 11, ROW_GROUP_SIZE 1000000);

Align the compression boundary with your row-group layout: oversized row groups force full-group decompression during predicate pushdown and negate the cold-storage cost benefit, so target 128 MB–256 MB groups per the Row Group Sizing Strategies thresholds. Categorical GIS attributes — land-use codes, sensor IDs, jurisdiction codes — should be dictionary-encoded before ZSTD applies match-finding; see Dictionary Encoding for GIS Attributes for the schema-level patterns that make level 8–12 pay off. For GeoParquet-specific column tuning that aligns ZSTD with geometry encoding, the Tuning ZSTD Compression for GeoParquet Archives walkthrough takes these defaults to the column level.

Validation Gate

ZSTD is lossless at the byte level, so a correct round trip must reproduce the input exactly. What can still go wrong is the spatial writer truncating coordinate precision before ZSTD ever sees the bytes — that loss is invisible to a compression test and only surfaces when a downstream join misaligns. Validate both the byte round trip and the geometry before promoting a dataset to cold storage.

Confirm the codec and level that actually landed in the file (do not trust the job config — confirm the artifact):

parquet-tools inspect lidar/2024/region_north_cold.parquet | grep -i "compression\|codec"

Expected output (every column chunk reports the codec you pinned):

  geometry: ... compression: ZSTD ...
  attributes: ... compression: ZSTD ...

Then verify the geometry survived the write losslessly by comparing the bounding box and vertex count before and after:

import geopandas as gpd

before = gpd.read_parquet("staging/region_north.parquet")
after  = gpd.read_parquet("lidar/2024/region_north_cold.parquet")

assert before.total_bounds.round(9).tolist() == after.total_bounds.round(9).tolist()
assert before.geometry.apply(lambda g: len(g.exterior.coords) if g.geom_type == "Polygon" else g.length).sum() \
     == after.geometry.apply(lambda g: len(g.exterior.coords) if g.geom_type == "Polygon" else g.length).sum()
print("round-trip OK")

Most common failure — compression: UNCOMPRESSED on the geometry column. The root cause is almost always an engine default overriding the writer option: older Spark/Parquet builds ignore a codec set at the session level unless it is set per-write, and some GDAL-based writers fall back to Snappy when compression_level is passed but compression is not. Fix it by setting both the codec and the level on the write call itself, as in the configurations above, then re-inspect. A secondary failure — a bounding-box delta in the assertion — is never a ZSTD fault; it means the writer truncated coordinate precision, so check the geometry-encoding precision setting, not the compression level.

Cost & Performance Trade-offs

The economics of level selection are a balance between one-time write CPU and recurring storage and egress charges. The table below quantifies the trade for a representative 1 TB (level-3 baseline) spatial archive.

ZSTD Level	Stored Size (from 1 TB)	Write CPU (relative)	Decompression Speed	Best When
3	1.00 TB	1x	Fast	Data rewritten within hours
7	~0.85 TB	~1.4x	Fast	Daily-touched ETL outputs
11	~0.62 TB	~3x	Fast	Quarterly-access cold tier
19	~0.45 TB	~6x	Fast	Multi-year retention, rare reads

The decisive insight is that decompression speed barely moves down the column, so retrieval latency is not a reason to avoid high levels — read cost is governed by row-group scope and partition pruning, not by the compression level. The real constraint on going higher is the write-time compute window. On deep-archive tiers, the recurring monthly storage saving from level 19 typically repays the one-time CPU cost within the first quarter and then accrues for the life of the retention mandate, which is why write-once legal-hold data justifies the highest levels your compute schedule can absorb. Pair these numbers with the storage-class pricing in Object Storage Selection for GIS Archives to convert ratio into actual dollars per tier.

Failure Modes & Edge Cases

Four pitfalls account for most ZSTD problems in geospatial archives:

High level on incompressible data. Already-compressed payloads — JPEG-in-TIFF imagery, pre-compressed point clouds, encrypted columns — have near-maximum entropy, so level 19 spends 6x the CPU to shave a percent or two. Detect these columns and drop them to level 3; raising the level on high-entropy data is pure CPU waste.
Level set in config but lost at the artifact. A codec pinned at session or cluster scope is silently ignored by some writers, so the file lands at the engine default. This is the single most common surprise and is exactly why the validation gate above inspects the written file rather than trusting the job config.
Row-group/level mismatch inflating reads. A high level on an oversized row group means every predicate-pushdown read decompresses a huge group to return a small answer, erasing the cold-storage saving. Keep the group at 128–256 MB so decompression scope stays bounded to the queried extent.
Compression mistaken for an immutability control. A smaller file is still mutable and deletable. ZSTD level has no bearing on retention guarantees, so deep-archive data must be paired with WORM lifecycle controls (S3 Object Lock, equivalent bucket-lock policies) under a defined Retention Policy Frameworks design — the level decision and the retention decision are independent and both mandatory.

When deep-archive media is eventually rotated out, follow the NIST media-sanitization guidance referenced in your retention design rather than assuming compression obscures residual data.

Operational Execution Checklist

Compression Tuning & Storage Optimization — the parent methodology this ZSTD decision sits inside; start here for the full optimization spine.
Row Group Sizing Strategies — set the group size that determines the ratio a given ZSTD level delivers.
Dictionary Encoding for GIS Attributes — lower categorical entropy so levels 8–12 actually pay off.
Tuning ZSTD Compression for GeoParquet Archives — column-level GeoParquet walkthrough that extends these defaults.
Object Storage Selection for GIS Archives — convert compression ratio into per-tier storage cost across providers.

Validate algorithmic parameter limits against the Zstandard compression manual, GeoParquet compliance against the OGC GeoParquet specification, and immutability configuration against the AWS S3 Object Lifecycle documentation.

Up one level: Compression Tuning & Storage Optimization for Geospatial Cold Storage.

ZSTD Level Configuration for Spatial Files

When a Single ZSTD Level Costs You Money #

What You Need in Place First #

Choosing a Level by Access Pattern #

Production Configurations & Engine Integration #

Validation Gate #

Cost & Performance Trade-offs #

Failure Modes & Edge Cases #

Operational Execution Checklist #

Related #

Explore this section

Related pages