Dictionary Encoding for GIS Attributes

Geospatial pipelines routinely ingest high-frequency categorical metadata — jurisdiction codes, land-use classifications, sensor identifiers, and regulatory compliance flags. Stored as raw UTF-8 strings across tens of millions of features, these attributes inflate storage footprints, throttle cold-tier I/O throughput, and complicate archival retention. Dictionary encoding resolves this by mapping each unique string value to a compact integer index, materializing the mapping once per column segment and referencing it repeatedly. Within the Compression Tuning & Storage Optimization architecture, it is the foundational entropy-reduction layer for columnar spatial formats — particularly GeoParquet and Arrow-backed pipelines — and the cheapest single change that improves cold-storage density before any page-level codec runs.

The Failure Mode: String Bloat and Silent Fallback

Two distinct failures motivate this technique. The first is obvious: a land_use_code column carrying 40 distinct CORINE values repeated across 50 million parcels stores roughly the same handful of strings 50 million times. Plain UTF-8 storage burns bytes proportional to row count and string length, and the resulting high-entropy pages resist even aggressive Zstandard match-finding. The second failure is subtler and more dangerous in archival workloads: applying dictionary encoding indiscriminately. When a writer dictionary-encodes a high-cardinality field — unique parcel IDs, free-text survey remarks, microsecond timestamps — the dictionary page grows until the engine hits its dictionary size ceiling, silently falls back to plain encoding mid-row-group, and leaves you paying for a populated dictionary page that buys nothing while inflating decode latency on read.

The cost surfaces during cold-to-hot promotion. A dataset that dictionary-encoded cleanly at write time can fall back inconsistently across row groups if cardinality drifts between partitions, producing mixed-encoding files that defeat predicate pushdown and balloon egress when an analyst scans them. The objective of this page is a deterministic, audited encoding decision per column — never an engine default left to chance.

Prerequisite Context

This technique assumes you have already converted source data out of Shapefile or GeoPackage and into a columnar format; if you have not, run the GeoParquet Migration Workflows pipeline first, because dictionary encoding is a column-chunk property that only exists in Parquet-family layouts. You should also have a working object-storage tiering scheme — the Hot/Warm/Cold Tier Design for Geospatial Data model — so that encoding decisions can be matched to retrieval frequency. Dictionary encoding sits one level below the parent Compression Tuning & Storage Optimization strategy: it reduces entropy so that downstream ZSTD Level Configuration for Spatial Files has low-entropy integer arrays to compress rather than raw text. Finally, schema stability matters — if attribute taxonomies are still churning, stabilize them through Schema Mapping & Attribute Validation before committing dictionary mappings to immutable cold storage.

When Dictionary Encoding Pays Off

Dictionary encoding helps low-cardinality categorical fields but backfires as cardinality or null rates climb:

Concept & Design Decisions

Dictionary encoding operates at the column-chunk level in Parquet-family formats. Every row group holds an independent dictionary page per encoded column, so encoding efficiency is tightly coupled to how you partition and size chunks — a point that directly couples this decision to Row Group Sizing Strategies. The encoded values resolve to the RLE_DICTIONARY encoding in the column-chunk metadata, where dictionary indices are run-length and bit-pack encoded; this is why a near-constant admin_level column collapses to a few bytes per page.

The governing parameter is cardinality ratio — distinct values divided by the row count inside a single row group. Use these thresholds as defaults and tune against measured page sizes:

Ratio below ~1% (e.g. CRS EPSG codes, land-use classes, sensor model IDs): always dictionary encode. Indices are tiny and equality predicates (WHERE land_use_code = 'A21') become integer comparisons.
Ratio 1–15% (e.g. municipality names, acquisition platform tags): dictionary encode only when the field is queried with equality or IN filters; otherwise the dictionary page overhead may exceed the savings.
Ratio above ~15% (e.g. parcel IDs, vertex-level timestamps, free text): bypass dictionary encoding and let the page codec handle it. Forcing a dictionary here triggers fallback and wastes a populated page.

Null handling is a second axis. Parquet stores nulls in a separate definition-level stream, so a column that is 95% null carries almost no dictionary indices regardless of the non-null cardinality — but if the non-null subset is itself high-cardinality, the small populated portion still bloats the dictionary. Score both null_rate and non-null cardinality, not cardinality alone.

The final design choice is segment alignment. Because dictionaries are per row group, align row-group boundaries with natural spatial or temporal partitions so that categorical values stay homogeneous within a segment. Partitioning by administrative boundary or acquisition date concentrates identical jurisdiction codes and sensor IDs into contiguous blocks, shrinking the per-segment dictionary and improving partition pruning. Misalignment forces the same value into many adjacent dictionaries, multiplying overhead — the inverse of the gain you are chasing.

Implementation

Pin dictionary encoding explicitly at the writer layer for every engine in the pipeline. Engine defaults differ and silently override implicit settings, producing inconsistent archival footprints across jobs.

PyArrow / GeoParquet writer — enable globally, or pass an explicit column list so high-cardinality fields are never accidentally encoded:

import pyarrow as pa
import pyarrow.parquet as pq

# Cast low-cardinality columns to dictionary type so encoding is deterministic,
# not left to the writer's heuristic dictionary-size cutoff.
table = table.set_column(
    table.schema.get_field_index("land_use_code"),
    "land_use_code",
    table.column("land_use_code").dictionary_encode(),
)

pq.write_table(
    table,
    "s3://gis-cold-archive/parcels/2024/region_north.parquet",
    compression="zstd",
    compression_level=11,
    use_dictionary=["land_use_code", "admin_level", "sensor_id"],  # explicit allow-list
    dictionary_pagesize_limit=1 << 20,   # 1 MiB cap; forces clean fallback, not silent bloat
    row_group_size=200_000,              # align with spatial partition granularity
    write_statistics=True,
)

GDAL / OGR Parquet driver — the driver enables dictionary encoding by default; pin row-group size and codec so the archival footprint is reproducible:

ogr2ogr \
  -f Parquet \
  /vsis3/gis-cold-archive/parcels/2024/region_north.parquet \
  datasets/vector/parcels_region_north.gpkg \
  -lco COMPRESSION=ZSTD \
  -lco COMPRESSION_LEVEL=11 \
  -lco ROW_GROUP_SIZE=200000

Apache Spark / Delta Lake — disable the dictionary fallback so a cardinality breach raises rather than silently degrading to plain encoding; this is the configuration that makes compliance audits meaningful:

spark.conf.set("spark.sql.parquet.dictionary.enabled", "true")
# Force explicit handling of cardinality breaches instead of silent plain-encoding:
spark.conf.set("spark.sql.parquet.dictionary.fallback.enabled", "false")
# Keep Arrow vectorized serialization so dictionary structures survive the write path:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

(df.repartition("admin_boundary")              # cluster identical jurisdiction codes
   .write
   .option("compression", "zstd")
   .option("parquet.compression.codec.zstd.level", "11")
   .mode("overwrite")
   .parquet("s3://gis-cold-archive/parcels/2024/"))

Polars / Dask — pre-cast categorical columns so serialization is dictionary-aware and avoids runtime type coercion:

import polars as pl

df = pl.scan_parquet("datasets/vector/parcels_staging/*.parquet").with_columns(
    pl.col("land_use_code").cast(pl.Categorical),
    pl.col("sensor_id").cast(pl.Categorical),
)
df.sink_parquet(
    "datasets/vector/parcels_encoded.parquet",
    compression="zstd",
    compression_level=11,
    row_group_size=200_000,
)

Validation Gate

Never trust the writer’s intent — verify the encoding actually landed in the file metadata before promoting to cold storage. Inspect the column-chunk encodings and confirm they resolve to RLE_DICTIONARY:

parquet-tools inspect \
  /vsis3/gis-cold-archive/parcels/2024/region_north.parquet \
  | grep -A2 -E 'land_use_code|admin_level|sensor_id'

Expected output for a correctly encoded column:

column: land_use_code
  encodings: RLE_DICTIONARY, PLAIN, RLE
  compression: ZSTD

A scripted gate over PyArrow metadata makes this enforceable in CI:

import pyarrow.parquet as pq

meta = pq.read_metadata("s3://gis-cold-archive/parcels/2024/region_north.parquet")
for rg in range(meta.num_row_groups):
    col = meta.row_group(rg).column(
        meta.schema.names.index("land_use_code"))
    assert "RLE_DICTIONARY" in col.encodings, \
        f"row group {rg}: dictionary fallback occurred — cardinality breach"
print("all row groups dictionary-encoded")

Most common failure: the column reports only PLAIN encoding despite use_dictionary being set. Root cause is almost always a cardinality breach — the distinct-value count exceeded the writer’s dictionary_pagesize_limit inside that row group, so it fell back to plain encoding for the remainder. Confirm by running a distinct count per row group; if the ratio exceeds ~15%, the field does not belong in the dictionary allow-list. The second most common cause is row groups sized far larger than the spatial partition, which sweeps heterogeneous categorical values into one segment and inflates the distinct count past the limit — re-align row-group size to the partition before blaming the data.

Cost & Performance Trade-offs

The decision is a three-way trade between storage footprint, write CPU, and decode latency. Indicative figures for a 50-million-row vector archive, ZSTD level 11, 200k-row groups:

Field profile	Cardinality ratio	Encoding choice	Storage vs plain	Decode latency / 10k rows	Notes
`land_use_code` (40 values)	<0.01%	Dictionary	-70 to -85%	<40 ms	Ideal; fast equality scans
`admin_level` (12 values)	<0.01%	Dictionary	-75%	<40 ms	Near-constant per partition
`municipality` (8k values)	~3%	Dictionary (if filtered)	-30 to -45%	~80 ms	Worth it only with equality/IN predicates
`acquisition_ts` (per-second)	~12%	Dictionary, conditional	-5 to -15%	~150 ms	Marginal; consider delta encoding instead
`parcel_id` (unique)	100%	Plain + ZSTD	0% (or worse)	>200 ms if forced	Never dictionary; triggers fallback
`survey_remarks` (free text)	~90%	Plain + ZSTD	negative if forced	>250 ms if forced	Dictionary page bloat, no gain

The asymmetry to internalize: dictionary encoding’s write-CPU cost is modest (a hash-table build per segment), but a wrong encoding decision on a high-cardinality column costs on both ends — a wasted dictionary page at write and elevated decode latency at every read for the archive’s entire retention life. In cold tiers where data may sit for years and be scanned rarely, that read penalty is paid precisely when egress is most expensive.

Failure Modes & Edge Cases

Cross-segment dictionary duplication. Oversized row groups dilute encoding efficiency; undersized groups multiply the per-segment dictionary. Both inflate cold-tier CPU during retrieval and raise egress cost. Align boundaries to spatial/temporal partitions and verify with a metadata scan before promotion — this is the single most impactful tuning lever and is shared with Row Group Sizing Strategies.
Taxonomy revision breaks forward joins. When a land-use or jurisdiction taxonomy is revised (codes deprecated, merged, or renumbered), a new dictionary mapping is materialized but historical archives keep the old one. Without a versioned mapping registry, joins across vintages silently mismatch. Maintain an immutable, versioned dictionary registry alongside spatial extents and enforce schema-evolution checks.
Corrupted dictionary page invalidates the whole row group. A dictionary page is the decode key for every index in its segment; a single corrupted page during a tier transition zeroes out the entire row group and surfaces as a data-loss event on cold-to-hot promotion. Write CRC32 or xxHash checksums on dictionary pages at archival time and verify on retrieval. This keeps categorical provenance auditable across multi-year lifecycles, satisfying INSPIRE and FGDC metadata retention expectations.
High null rate masking high cardinality. A column that is 95% null can pass a naive cardinality gate (few non-null distinct values relative to total rows) while its small populated subset is itself high-cardinality, bloating the dictionary on the rows that exist. Score null_rate and non-null distinct count separately.

Operational Execution Checklist

Profile each categorical column for distinct-value ratio and null rate per row group, not over the whole file.
Build an explicit dictionary allow-list; never rely on the writer’s implicit heuristic for archival data.
Cap dictionary_pagesize_limit so cardinality breaches fall back cleanly instead of silently mid-segment.
Align row-group boundaries with spatial/temporal partitions to keep categorical values homogeneous.
Assert RLE_DICTIONARY in column-chunk metadata in CI before any cold-tier promotion.
Pair encoding with an appropriate ZSTD level (3–5 for active tiers, 10+ for static archive).
Checksum dictionary pages on write and verify on retrieval across tier transitions.
Version every dictionary mapping in an immutable registry alongside spatial extents.

When to Use Dictionary Encoding for Categorical GIS Fields — the precise cardinality thresholds, field-selection criteria, and fallback mitigation behind the decision above.
ZSTD Level Configuration for Spatial Files — the page-level codec that compresses the low-entropy integer arrays dictionary encoding produces.
Row Group Sizing Strategies — segment sizing governs per-row-group dictionary overhead and partition pruning.
GeoParquet Migration Workflows — the cross-pillar conversion step that produces the columnar files dictionary encoding operates on.
Up one level: Compression Tuning & Storage Optimization — the parent strategy coordinating encoding, compression, partitioning, and indexing.

Dictionary Encoding for GIS Attributes

The Failure Mode: String Bloat and Silent Fallback #

Prerequisite Context #

When Dictionary Encoding Pays Off #

Concept & Design Decisions #

Implementation #

Validation Gate #

Cost & Performance Trade-offs #

Failure Modes & Edge Cases #

Operational Execution Checklist #

Related #

Explore this section

Related pages