When to Use Dictionary Encoding for Categorical GIS Fields

Categorical GIS attributes (LU_CODE, ADMIN_LEVEL, sensor model IDs, compliance status flags) routinely dominate metadata overhead in spatial archival pipelines. Applying dictionary encoding replaces repeated string literals with compact integer pointers, reducing I/O footprint and accelerating equality-based filtering. However, indiscriminate application degrades compression ratios, inflates dictionary pages, and introduces serialization bottlenecks during cold storage tiering. This protocol defines exact operational thresholds, implementation commands, and validation routines for enterprise spatial archival workflows.

Decision at a Glance

Encode only when cardinality and null rate both stay within budget; otherwise fall back to plain encoding:

flowchart TD
  A["Profile column"] --> B{"≤ 500 unique values?"}
  B -->|"No"| P["Disable dictionary (PLAIN)"]
  B -->|"Yes"| C{"Null rate under 40%?"}
  C -->|"No"| P
  C -->|"Yes"| D["Enable dictionary encoding"]
  D --> E["Cap dictionary page at 1 MB"]

Operational Decision Matrix

Deploy dictionary encoding only when all five quantitative conditions are met simultaneously. Fields failing any criterion must bypass encoding and route directly to page-level compression.

Threshold Operational Limit Failure Root-Cause Fallback Strategy
Cardinality-to-Volume ≤ 500 unique values per 10M rows Dictionary page exceeds 1MB, negating columnar compression gains Plain encoding + ZSTD level 3
Average String Length ≥ 4 characters Short codes (Y/N, 0/1) compress more efficiently via RLE or bit-packing Direct ZSTD or RLE
Repetition Density ≥ 60% of non-null values repeat ≥3× per row group Sparse dictionaries waste 4-byte pointer overhead per row Materialize numeric surrogate keys
Query Access Pattern =, IN, GROUP BY, dimension joins Range operators (>, <, BETWEEN) force full dictionary decode at query time Leave unencoded; index separately
Format Compatibility Parquet, GeoParquet v1.0+, Zarr v3 Shapefile/GeoJSON lack native dictionary page support Pre-serialize to Parquet before archival

Step-by-Step Implementation Protocol

1. Profile Source Data

Execute a cardinality and repetition scan before writer configuration. Record null rates, string length distribution, and per-partition repetition frequency.

# DuckDB CLI cardinality & repetition scan
duckdb -c "
SELECT
  land_use_code,
  COUNT(*) AS freq,
  AVG(LENGTH(land_use_code)) AS avg_len
FROM spatial_archive
GROUP BY land_use_code
ORDER BY freq DESC;
"
# PyArrow cardinality validation
import pyarrow as pa
import pyarrow.compute as pc

counts = pc.value_counts(table.column("land_use_code"))
unique_count = len(counts)
null_pct = pc.sum(pc.is_null(table.column("land_use_code"))).as_py() / len(table) * 100

assert unique_count <= 500, "Cardinality exceeds dictionary threshold"
assert null_pct < 40, "High null rate triggers fallback encoding"

2. Configure Writer Parameters

Explicitly enable dictionary encoding for target columns. Disable fallback to plain encoding unless null rates exceed 40%. Align dictionary page limits to prevent fragmentation.

import pyarrow.parquet as pq

pq.write_table(
    table,
    "archive.parquet",
    use_dictionary=["land_use_code", "admin_jurisdiction"],
    dictionary_pagesize_limit=1048576,  # 1 MB hard cap on the dictionary page
    data_page_size=1048576,
    write_statistics=True,
    compression="zstd"
)

For GDAL/OGR pipelines, set creation options (the Parquet driver enables dictionary encoding by default):

ogr2ogr -f "Parquet" output.parquet input.shp \
  -lco COMPRESSION=ZSTD \
  -lco ROW_GROUP_SIZE=1000000

3. Align Row Group Boundaries

Dictionary efficiency decays when row groups span heterogeneous spatial partitions. Coalesce data into uniform 128–256MB row groups before encoding. Mismatched boundaries cause dictionary duplication across groups, inflating cold storage retrieval costs.

# Cluster rows by the categorical key before write so each row group
# holds contiguous values, which keeps dictionaries compact.
table = table.sort_by("admin_level")

Validation & Diagnostics

Verify encoding application and measure storage impact before promoting to archival tiers.

  1. Inspect Metadata: Confirm dictionary pages exist and remain under the 1MB threshold.
parquet-tools meta archive.parquet | grep -A5 "Dictionary"
  1. Validate Pointer Compression: Ensure integer pointers replaced >85% of original string bytes.
meta = pq.read_metadata("archive.parquet")
for row_group in range(meta.num_row_groups):
    col = meta.row_group(row_group).column(0)
    # The dictionary page (when present) sits between its own offset and the
    # first data page, so its size is the difference of the two offsets.
    dpo = col.dictionary_page_offset
    dict_size = (col.data_page_offset - dpo) if dpo is not None else 0
    data_size = col.total_compressed_size
    print(f"Dict overhead: {dict_size / data_size:.2%}")
  1. Cold Storage Tiering Check: Verify dictionary pages fit within single S3 GET requests (<5MB). Fragmented dictionaries across multiple parts trigger multi-request latency penalties.

Root-Cause Troubleshooting

Symptom Diagnostic Command Root Cause Remediation
Storage bloat post-encoding parquet-tools meta → check dictionary page sizes High cardinality leak or string length variance > 32 chars Switch to PLAIN encoding + ZSTD level 4
Query latency spikes on range filters EXPLAIN ANALYZE in DuckDB/Trino Range operators force full dictionary decode at runtime Materialize numeric sort keys; keep strings unencoded
Cold retrieval timeouts aws s3api get-object latency profiling Dictionary pages split across multiple row groups Increase row_group_size; coalesce partitions pre-write
Serialization OOM during write Monitor pyarrow memory profile dictionary_pagesize_limit unset; unbounded dictionary growth Cap with dictionary_pagesize_limit=1048576; disable dictionary (use_dictionary=False) for the offending column

Fields that consistently fail validation should bypass dictionary encoding entirely. High-cardinality identifiers, UUIDs, free-text compliance notes, and already-numeric classification codes are better served by direct ZSTD compression. For broader pipeline tuning, reference Compression Tuning & Storage Optimization guidelines on page-size alignment and tiered lifecycle policies.

When implementing at scale, ensure row group boundaries align with spatial partitioning strategies to prevent dictionary duplication across cold storage objects. Validate all encoded outputs against the Apache Parquet Dictionary Encoding Specification to guarantee cross-engine compatibility. For GDAL-based workflows, consult the GDAL Parquet Driver Documentation for driver-specific creation option overrides.