When to Use Dictionary Encoding for Categorical GIS Fields
Categorical GIS attributes (LU_CODE, ADMIN_LEVEL, sensor model IDs, compliance status flags) routinely dominate metadata overhead in spatial archival pipelines. Applying dictionary encoding replaces repeated string literals with compact integer pointers, reducing I/O footprint and accelerating equality-based filtering. However, indiscriminate application degrades compression ratios, inflates dictionary pages, and introduces serialization bottlenecks during cold storage tiering. This protocol defines exact operational thresholds, implementation commands, and validation routines for enterprise spatial archival workflows.
Decision at a Glance
Encode only when cardinality and null rate both stay within budget; otherwise fall back to plain encoding:
flowchart TD
A["Profile column"] --> B{"≤ 500 unique values?"}
B -->|"No"| P["Disable dictionary (PLAIN)"]
B -->|"Yes"| C{"Null rate under 40%?"}
C -->|"No"| P
C -->|"Yes"| D["Enable dictionary encoding"]
D --> E["Cap dictionary page at 1 MB"]
Operational Decision Matrix
Deploy dictionary encoding only when all five quantitative conditions are met simultaneously. Fields failing any criterion must bypass encoding and route directly to page-level compression.
| Threshold | Operational Limit | Failure Root-Cause | Fallback Strategy |
|---|---|---|---|
| Cardinality-to-Volume | ≤ 500 unique values per 10M rows | Dictionary page exceeds 1MB, negating columnar compression gains | Plain encoding + ZSTD level 3 |
| Average String Length | ≥ 4 characters | Short codes (Y/N, 0/1) compress more efficiently via RLE or bit-packing |
Direct ZSTD or RLE |
| Repetition Density | ≥ 60% of non-null values repeat ≥3× per row group | Sparse dictionaries waste 4-byte pointer overhead per row | Materialize numeric surrogate keys |
| Query Access Pattern | =, IN, GROUP BY, dimension joins |
Range operators (>, <, BETWEEN) force full dictionary decode at query time |
Leave unencoded; index separately |
| Format Compatibility | Parquet, GeoParquet v1.0+, Zarr v3 | Shapefile/GeoJSON lack native dictionary page support | Pre-serialize to Parquet before archival |
Step-by-Step Implementation Protocol
1. Profile Source Data
Execute a cardinality and repetition scan before writer configuration. Record null rates, string length distribution, and per-partition repetition frequency.
# DuckDB CLI cardinality & repetition scan
duckdb -c "
SELECT
land_use_code,
COUNT(*) AS freq,
AVG(LENGTH(land_use_code)) AS avg_len
FROM spatial_archive
GROUP BY land_use_code
ORDER BY freq DESC;
"
# PyArrow cardinality validation
import pyarrow as pa
import pyarrow.compute as pc
counts = pc.value_counts(table.column("land_use_code"))
unique_count = len(counts)
null_pct = pc.sum(pc.is_null(table.column("land_use_code"))).as_py() / len(table) * 100
assert unique_count <= 500, "Cardinality exceeds dictionary threshold"
assert null_pct < 40, "High null rate triggers fallback encoding"
2. Configure Writer Parameters
Explicitly enable dictionary encoding for target columns. Disable fallback to plain encoding unless null rates exceed 40%. Align dictionary page limits to prevent fragmentation.
import pyarrow.parquet as pq
pq.write_table(
table,
"archive.parquet",
use_dictionary=["land_use_code", "admin_jurisdiction"],
dictionary_pagesize_limit=1048576, # 1 MB hard cap on the dictionary page
data_page_size=1048576,
write_statistics=True,
compression="zstd"
)
For GDAL/OGR pipelines, set creation options (the Parquet driver enables dictionary encoding by default):
ogr2ogr -f "Parquet" output.parquet input.shp \
-lco COMPRESSION=ZSTD \
-lco ROW_GROUP_SIZE=1000000
3. Align Row Group Boundaries
Dictionary efficiency decays when row groups span heterogeneous spatial partitions. Coalesce data into uniform 128–256MB row groups before encoding. Mismatched boundaries cause dictionary duplication across groups, inflating cold storage retrieval costs.
# Cluster rows by the categorical key before write so each row group
# holds contiguous values, which keeps dictionaries compact.
table = table.sort_by("admin_level")
Validation & Diagnostics
Verify encoding application and measure storage impact before promoting to archival tiers.
- Inspect Metadata: Confirm dictionary pages exist and remain under the 1MB threshold.
parquet-tools meta archive.parquet | grep -A5 "Dictionary"
- Validate Pointer Compression: Ensure integer pointers replaced >85% of original string bytes.
meta = pq.read_metadata("archive.parquet")
for row_group in range(meta.num_row_groups):
col = meta.row_group(row_group).column(0)
# The dictionary page (when present) sits between its own offset and the
# first data page, so its size is the difference of the two offsets.
dpo = col.dictionary_page_offset
dict_size = (col.data_page_offset - dpo) if dpo is not None else 0
data_size = col.total_compressed_size
print(f"Dict overhead: {dict_size / data_size:.2%}")
- Cold Storage Tiering Check: Verify dictionary pages fit within single S3
GETrequests (<5MB). Fragmented dictionaries across multiple parts trigger multi-request latency penalties.
Root-Cause Troubleshooting
| Symptom | Diagnostic Command | Root Cause | Remediation |
|---|---|---|---|
| Storage bloat post-encoding | parquet-tools meta → check dictionary page sizes |
High cardinality leak or string length variance > 32 chars | Switch to PLAIN encoding + ZSTD level 4 |
| Query latency spikes on range filters | EXPLAIN ANALYZE in DuckDB/Trino |
Range operators force full dictionary decode at runtime | Materialize numeric sort keys; keep strings unencoded |
| Cold retrieval timeouts | aws s3api get-object latency profiling |
Dictionary pages split across multiple row groups | Increase row_group_size; coalesce partitions pre-write |
| Serialization OOM during write | Monitor pyarrow memory profile |
dictionary_pagesize_limit unset; unbounded dictionary growth |
Cap with dictionary_pagesize_limit=1048576; disable dictionary (use_dictionary=False) for the offending column |
Fields that consistently fail validation should bypass dictionary encoding entirely. High-cardinality identifiers, UUIDs, free-text compliance notes, and already-numeric classification codes are better served by direct ZSTD compression. For broader pipeline tuning, reference Compression Tuning & Storage Optimization guidelines on page-size alignment and tiered lifecycle policies.
When implementing at scale, ensure row group boundaries align with spatial partitioning strategies to prevent dictionary duplication across cold storage objects. Validate all encoded outputs against the Apache Parquet Dictionary Encoding Specification to guarantee cross-engine compatibility. For GDAL-based workflows, consult the GDAL Parquet Driver Documentation for driver-specific creation option overrides.