Row Group Sizing Strategies for Spatial Data Archival

Row group boundaries in columnar storage are engineered thresholds, not framework defaults, and getting them wrong is the single most common reason a cold-archive query bill grows faster than the archive itself. For data engineers, GIS archivists, cloud architects, and compliance teams running spatial archival pipelines, the row group is the unit that dictates I/O patterns, compression efficiency, predicate pushdown, and cold-tier retrieval SLAs. This page sets the exact sizes, writer settings, and validation steps that keep coordinate-heavy geometry, attribute payloads, and regulatory retention mandates aligned with a deterministic physical layout.

When Default Row Group Sizing Fails

Library defaults are tuned for homogeneous analytical tables of fixed-width numerics, not for Well-Known Binary (WKB) geometries whose serialized size swings by three orders of magnitude between a survey point and a multipolygon coastline. Three failure modes recur in spatial archives:

Egress amplification on cold reads. Object-storage retrieval is priced per request and per byte scanned. A ST_Intersects filter that should touch one metropolitan tile ends up issuing thousands of GET/HEAD calls against a fragmented file, or — at the other extreme — forces a query engine to pull and decompress a 512 MB group to satisfy a 2 MB answer. Both inflate the bill; neither shows up until the dataset is already in Glacier.
Out-of-memory failures on spatial joins. Engines materialise a full row group per worker during deserialization. Oversized groups holding dense geometry buffers blow past executor heap limits during ST_Contains joins or reprojection passes, producing intermittent OOM kills that are hard to reproduce.
Non-deterministic, non-auditable layouts. Static row-count targets ignore WKB byte variance, so two writes of the “same” partition produce different group boundaries, different checksums, and a compaction job that can never prove byte-for-byte stability across an audit cycle.

Sizing the row group correctly is what turns each of these from a recurring incident into a one-time configuration decision.

What You Need in Place First

Row group sizing is a downstream control. It only behaves predictably once the upstream layout decisions are settled, so confirm the following before tuning a single parameter:

A columnar archive format. Data must already be written as GeoParquet (or a Parquet-backed Delta/Iceberg table), not Shapefile, GeoPackage, or raw WKT. Row groups are a columnar construct. If you are still migrating, run the GeoParquet Migration Workflows pipeline first so geometry encoding and CRS metadata land correctly.
Spatial partitioning chosen. Row group boundaries operate strictly within a partition file. Decide your partition scheme — H3 cells, administrative boundaries, or a Z-order/Hilbert curve — using Spatial Partitioning Techniques before sizing, so spatial locality is already aligned to physical files.
A compression baseline. Row group size and compression codec interact: ZSTD match-finding works across the whole group, so the size you pick changes the ratio you get. Lock a baseline with ZSTD Level Configuration for Spatial Files first.

This topic sits inside the broader Compression Tuning & Storage Optimization methodology; treat that reference as the parent decision spine and this page as the row-group-specific layer of it.

Sizing Decisions: Parameters and Thresholds

Row group size is a balance: too small inflates metadata and request counts; too large forces wasteful decompression.

Columnar formats (Parquet, Delta, Iceberg) segment data into row groups, each containing column chunks with independent dictionaries, min/max statistics, and compression blocks. For spatial archives, calibrate three parameters explicitly rather than accepting writer defaults:

Target group size (compressed, on disk): 128–256 MB. This is the dominant lever. Below ~64 MB the per-group footer metadata and the object-storage request count dominate; above ~256 MB a single predicate forces decompression of a large irrelevant span. 128 MB is the safe default for mixed geometry; push toward 256 MB only for archival tiers that are read in full-scan batch jobs rather than point queries.
Row count cap, derived not fixed. Writers accept a row count, but your real target is a byte size, so derive the row count from the measured mean serialized row size: rows_per_group ≈ target_bytes / mean_row_bytes. Cap it at 1,000,000 rows to bound metadata regardless of how small individual geometries are. The derivation itself — including geometry profiling and memory-ceiling modelling — is worked end to end in Calculating Optimal Row Group Size for Spatial Queries.
Data page size: 1 MB. Pages are the sub-unit that carries the column min/max statistics that drive bounding-box skipping. 1 MB pages give the query engine tighter bbox statistics per page, sharpening predicate pushdown without materially raising footer overhead.

The reason these numbers differ from tabular defaults is geometry payload variance. Point datasets exhibit uniform row sizes, while cadastral parcels, hydrological networks, and administrative boundaries vary by orders of magnitude. A static row count that produces a tidy 128 MB group for points will produce a 6 GB group for coastlines. Always size by bytes, derive the row count, and cap it.

Production Writer Configuration

Implementation requires explicit writer-level overrides. Framework defaults assume homogeneous tabular data and will misallocate spatial payloads. Set the row group target, page size, and dictionary policy at write time; never rely on the compaction job to “fix” layout afterwards.

PyArrow / DuckDB baseline

import pyarrow.parquet as pq

pq.write_table(
    spatial_table,
    "s3://cold-archive/geospatial/v2/parcels/region_north.parquet",
    row_group_size=1_000_000,           # rows per group, derived toward a ~128 MB on-disk target
    data_page_size=1 * 1024 * 1024,     # 1 MB pages -> tighter per-page bbox statistics
    use_dictionary=False,               # disable for the WKB geometry column (high-cardinality, no benefit)
    compression="zstd",
    compression_level=3,
)

The 128 MB target balances cold-tier read cost against memory safety, and the 1 MB data pages improve the min/max bounding-box statistics that enable tight predicate pushdown. Dictionary encoding is disabled for the geometry column because WKB values are effectively unique — dictionaries there only bloat the file. Categorical GIS attributes are the opposite case and should be handled separately with Dictionary Encoding for GIS Attributes, which can apply per-column dictionary policy without touching geometry.

Spark SQL / Delta engine

SET spark.sql.parquet.rowGroupSize=134217728;       -- 128 MB
SET spark.sql.parquet.dataPageSize=1048576;         -- 1 MB
SET spark.sql.parquet.enableDictionary=false;       -- geometry column; set per-column where supported
SET spark.sql.parquet.compression.codec=zstd;
SET spark.sql.parquet.zstdCompressionLevel=3;

When writing to Delta or Iceberg tables, enforce these settings at the session level before every INSERT or MERGE. Critically, compaction and OPTIMIZE jobs must inherit the identical row group target — otherwise compaction quietly rewrites files at the engine default and your carefully sized layout drifts after the first maintenance window.

Validation Gate

Never assume the writer honoured the target — inspect the actual physical layout after the write. The fastest cross-engine check reads the Parquet footer metadata directly with DuckDB:

duckdb -c "
  SELECT row_group_id,
         row_group_num_rows                         AS rows,
         round(row_group_bytes / 1048576.0, 1)      AS mb
  FROM parquet_metadata('s3://cold-archive/geospatial/v2/parcels/region_north.parquet')
  GROUP BY ALL ORDER BY row_group_id;
"

Expected output for a healthy 128 MB layout — group sizes clustered tightly around target, row counts varying to absorb geometry size variance:

┌──────────────┬─────────┬───────┐
│ row_group_id │  rows   │  mb   │
├──────────────┼─────────┼───────┤
│ 0            │ 712334  │ 131.4 │
│ 1            │ 698120  │ 129.8 │
│ 2            │ 705991  │ 130.2 │
│ ...          │ ...     │ ...   │
└──────────────┴─────────┴───────┘

Most common failure — every group reads as one giant block (e.g. a single 6 GB group, or mb values in the thousands). Root cause: the writer received a row count target but the geometry column’s mean serialized size was far larger than assumed, so a 1,000,000-row cap translated to gigabytes. Fix: profile mean WKB bytes on a sample, recompute rows_per_group = target_bytes / mean_row_bytes, and rewrite — do not patch with compaction, which inherits the same bad ratio. The full profiling routine is in Calculating Optimal Row Group Size for Spatial Queries.

A second-line check confirms the statistics that drive skipping actually exist:

duckdb -c "
  SELECT path_in_schema, stats_min IS NOT NULL AS has_min, stats_max IS NOT NULL AS has_max
  FROM parquet_metadata('s3://cold-archive/geospatial/v2/parcels/region_north.parquet')
  WHERE path_in_schema LIKE 'bbox%';
"

If has_min/has_max are false on the bbox columns, predicate pushdown is silently disabled and every query becomes a full scan regardless of row group size.

Cost & Performance Trade-offs

Cold-storage pricing models (AWS S3 Glacier Deep Archive, Azure Cool/Archive Blob) are heavily request-sensitive, and retrieval cost scales with the number of row groups a query has to touch. The table below models a 1 TB spatial archive at three group sizes, assuming a typical spatial filter that returns roughly 5% of features:

Row group size	Groups per 1 TB	Object reads, point query	Decompression scope per hit	Executor memory pressure	Best fit
64 MB	~16,000	High (many small `GET`s)	Minimal	Low	Frequently-queried warm tier
128 MB	~8,000	Moderate	Moderate	Moderate	Default for mixed archives
256 MB	~4,000	Low	Large per hit	Elevated	Batch full-scan / deep archive
512 MB	~2,000	Very low	Wasteful on selective reads	High (OOM risk)	Avoid for point queries

The diagonal is the whole point: shrinking groups cuts decompression waste but multiplies request count and cost; growing them cuts request count but risks decompressing megabytes to answer a kilobyte and pushes executors toward OOM. 128 MB is the cost-minimising default across mixed query patterns; reserve 256 MB for archives that are genuinely only ever read in full-scan batch jobs.

Failure Modes & Edge Cases

Compaction layout drift. OPTIMIZE/VACUUM jobs that don’t inherit the session row group target rewrite files at the engine default, silently undoing your sizing after the first maintenance window. Pin the target in the compaction job config, not just the ingest job, and re-run the validation query after every maintenance cycle.
Mixed-geometry partitions skew the mean. A partition holding both survey points and coastline multipolygons has a bimodal byte distribution, so a single mean produces groups that are too big for the polygons and too small for the points. Where feasible, route geometry types to separate partitions before sizing, or size against the 90th-percentile row size rather than the mean.
Page size starves the statistics. Leaving data_page_size at the multi-megabyte default coarsens the per-page bounding box so much that predicate pushdown can no longer skip within a group — the group is right-sized but every read still scans it end to end. Keep pages at 1 MB for spatial columns.
Dictionary overflow on geometry. Leaving use_dictionary=True on the WKB column makes the writer attempt — and usually abort — a dictionary of near-unique values, bloating the file and occasionally fragmenting groups below target. Disable dictionaries on geometry columns explicitly; apply them only to categorical attributes.

Operational Execution Checklist

Confirm the dataset is GeoParquet/Parquet-backed and spatially partitioned before sizing.
Profile mean (and p90) serialized WKB bytes on a representative sample.
Derive rows_per_group = target_bytes / mean_row_bytes; cap at 1,000,000.
Set row_group_size, data_page_size=1 MB, use_dictionary=False on geometry, and ZSTD codec at write time.
Apply identical row group targets to compaction/OPTIMIZE jobs.
Run the parquet_metadata validation query; confirm group sizes land within 128–256 MB.
Confirm bbox min/max statistics are present for predicate pushdown.
Record the chosen target and checksums in the archive manifest for audit reproducibility.

Up to the parent reference: Compression Tuning & Storage Optimization — the full cold-storage optimisation methodology this layout decision sits inside.
Calculating Optimal Row Group Size for Spatial Queries — the deterministic byte-derivation and Hilbert-clustering routine behind the numbers above.
Spatial Partitioning Techniques — set partition boundaries first, since row groups operate within them.
ZSTD Level Configuration for Spatial Files — the codec baseline that interacts with group size to set the final compression ratio.
Hot/Warm/Cold Tier Design for Geospatial Data — how group size feeds the retrieval-cost assumptions of each storage tier.

For cross-engine compatibility, follow the Apache Parquet file format specification, and validate spatial metadata against the OGC GeoParquet specification so row group statistics remain readable across archival tiers.

Row Group Sizing Strategies for Spatial Data Archival

When Default Row Group Sizing Fails #

What You Need in Place First #

Sizing Decisions: Parameters and Thresholds #

Production Writer Configuration #

PyArrow / DuckDB baseline #

Spark SQL / Delta engine #

Validation Gate #

Cost & Performance Trade-offs #

Failure Modes & Edge Cases #

Operational Execution Checklist #

Related #

Explore this section

Related pages