Calculating Optimal Row Group Size for Spatial Queries

Data engineers, GIS archivists, and cloud architects who tier large vector archives to object storage hit the same wall: spatial predicate queries (ST_Intersects, ST_DWithin, ST_Contains) scan far more blocks than the filter geometry should touch, and cold-storage egress bills climb accordingly. The cause is that default columnar row-group sizing targets uniform tabular analytics — it assumes near-constant per-row byte width and no spatial locality. Serialized geometry violates both assumptions: WKB payloads vary by orders of magnitude between a survey point and a coastline multipolygon, and unsorted rows scatter neighbouring features across every block. This page gives a deterministic, execution-ready procedure for calculating row-group boundaries that preserve predicate pushdown, bound min/max envelope overlap, and keep ranged-GET retrieval costs low.

Sizing Workflow

The routine moves from profiling to a validated, spatially clustered write:

This procedure assumes the source is already a columnar archive (Parquet/GeoParquet) and that a Compression Tuning & Storage Optimization baseline — codec and compression level — is in place; row-group sizing is tuned after the codec is fixed, because the expected compression ratio feeds directly into the row-count formula below.

Step 1: Profile Geometry Payload Distribution

Serialized spatial payloads exhibit high byte-size variance. Unchecked variance forces oversized row groups, triggering full-block decompression during spatial filtering and inflating cold-storage egress.

import pyarrow.parquet as pq
import numpy as np

# Sample 10,000+ records from the target dataset
table = pq.read_table("datasets/cadastre/raw/parcels_2024.parquet",
                      columns=["geometry_wkb"])
wkb_bytes = table.column("geometry_wkb").to_pylist()
sizes = np.array([len(b) for b in wkb_bytes], dtype=np.float64)

p50, p90, p99 = np.percentile(sizes, [50, 90, 99])
g_avg = sizes.mean()
sigma_g = sizes.std()
variance_ratio = sigma_g / g_avg

print(f"G_avg: {g_avg:.0f}B | sigma_G: {sigma_g:.0f}B | ratio: {variance_ratio:.2f}")

Validation gate: if variance_ratio > 0.6, halt archival promotion. Isolate high-complexity polygons (p99 > 500 KB) into a separate tier or apply geometry simplification before grouping. High variance directly correlates with false-positive block scans during ST_Intersects evaluation, and it is also the dominant cause of poor ratios when tuning ZSTD compression for GeoParquet archives — so resolving it here pays off twice.

Step 2: Derive Target Row Count per Group

Optimal row-group size ($R_{opt}$) balances block-level I/O efficiency against spatial index granularity. Apply the deterministic formula:

$R_{opt} = \lfloor (T_{block} \times C_{ratio}) / (G_{avg} + A_{attr}) \rfloor$

Parameter definitions:

$T_{block}$: target compressed block size. Use 128MB for standard object storage, 256MB for deep-archive tiers.
$C_{ratio}$: expected compression ratio. Spatial WKB typically yields 1.8–3.2x with ZSTD; pull the exact figure for your codec from your ZSTD Level Configuration for Spatial Files baseline rather than guessing.
$G_{avg}$: average serialized geometry byte size (from Step 1).
$A_{attr}$: average serialized attribute payload per row (non-geometry columns).

target_block_mb = 128
c_ratio = 2.5
a_attr = 45  # bytes, measured from the non-geometry columns

r_opt = int((target_block_mb * 1024 * 1024 * c_ratio) / (g_avg + a_attr))

# Hard cap to prevent spatial-join materialization OOM
R_FINAL = min(r_opt, 1_000_000)
print(f"Calculated R_opt: {r_opt} | Enforced cap: {R_FINAL}")

Exceeding 1,000,000 rows per group introduces memory pressure during spatial-join materialization and increases bounding-box overlap probability. Where a single dataset spans wildly different geographic densities, split it along the same boundaries you use for spatial partitioning techniques before applying the cap, so no group straddles two partitions.

Step 3: Apply Spatial Clustering Prior to Grouping

Row groups must be spatially coherent. Unsorted data scatters geographic regions across blocks, defeating min/max statistics and forcing full-block decompression. DuckDB’s ST_Hilbert function takes a geometry and a BOX_2D extent and returns a uint64 Hilbert-curve key, so the dataset extent must be computed first.

-- Step 1: compute the dataset extent
CREATE TEMPORARY TABLE dataset_extent AS
SELECT ST_Extent_Agg(geometry) AS ext FROM archive_source;

-- Step 2: sort rows along the Hilbert curve, then write
COPY (
  SELECT s.*
  FROM archive_source s, dataset_extent e
  ORDER BY ST_Hilbert(s.geometry, e.ext)
)
TO 'datasets/cadastre/cold/parcels_optimized.parquet'
(FORMAT PARQUET, ROW_GROUP_SIZE 500000, COMPRESSION ZSTD);

Sorting by a Hilbert curve aligns physical storage with spatial locality. Each row group’s min/max bounding-box envelope then tightly encloses its contents, letting the query engine skip irrelevant blocks during ST_DWithin and ST_Contains evaluations. Without this step, spatial predicate pushdown degrades to sequential full-table scans regardless of how carefully $R_{opt}$ was chosen.

Validation & Verification

Run these gates against the written file before promoting it to a cold tier. Expected output is annotated inline.

import pyarrow.parquet as pq

meta = pq.read_metadata("datasets/cadastre/cold/parcels_optimized.parquet")

prev = None
overlaps = 0
for i in range(meta.num_row_groups):
    rg = meta.row_group(i)
    # column 0 here is the X ordinate of the bbox; adapt to your schema
    min_x = rg.column(0).statistics.min
    max_x = rg.column(0).statistics.max
    if prev is not None:
        p_min, p_max = prev
        # 1-D overlap fraction along X as a fast proxy for envelope overlap
        inter = max(0.0, min(max_x, p_max) - max(min_x, p_min))
        union = max(max_x, p_max) - min(min_x, p_min)
        if union > 0 and inter / union > 0.10:
            overlaps += 1
    prev = (min_x, max_x)

print(f"Row groups: {meta.num_row_groups}")
print(f"Overlap violations: {overlaps}")
assert overlaps < meta.num_row_groups * 0.10, "FAIL: spatial coherence threshold breached"

Expected output on a correctly Hilbert-sorted archive:

Row groups: 84
Overlap violations: 3        # < 10% of 84 → PASS

If Overlap violations approaches the row-group count, the sort did not take effect (see Troubleshooting). Cross-check the three thresholds below directly from parquet_metadata():

Validation gate	Threshold	Check	Failure root cause
Bounding-box overlap	`< 10%` between adjacent envelopes	compare adjacent row-group min/max envelopes	insufficient clustering; Hilbert key collision or centroid skew
Block decompression ratio	`< 15%` of blocks scanned per query	`blocks_scanned` vs `blocks_returned`	oversized groups; variance > 0.6 bypassed
Attribute sparsity alignment	`NULL/empty < 5%` per group	per-column null stats from metadata	mixed geometry types in one group

Troubleshooting

Symptom	Root cause	Fix
`ST_Intersects` scans 100% of blocks despite a tight filter	row groups unsorted; envelopes span multiple regions	re-run the Hilbert sort and rewrite with `write_statistics=True` so min/max stats regenerate
Cold-storage retrieval cost spikes on monthly audits	groups exceed ~1.2M rows; reads spill to disk	enforce `R_FINAL = min(R_opt, 1_000_000)` and split by geographic partition first
Geometry-column compression drops below 1.2x	mixed topology types (points, lines, multipolygons) in one group	isolate geometry types and apply type-specific encoding per dictionary encoding for GIS attributes
Query engine ignores spatial stats entirely	Parquet metadata not refreshed after the sort	rewrite with `write_statistics=True` (PyArrow) or re-`COPY` through DuckDB

For cloud-native cold retrieval, align row-group boundaries with your object store’s ranged-GET request sizing (typically 8–16 MB per request) to avoid partial-object retrieval penalties, and consult the Apache Parquet file format specification for the exact metadata layout.

Up to the parent topic: Row Group Sizing Strategies frames how block sizing interacts with every columnar writer in a spatial archive.
Sibling procedure: Tuning ZSTD Compression for GeoParquet Archives sets the compression ratio that feeds this page’s row-count formula.
Sibling procedure: When to Use Dictionary Encoding for Categorical GIS Fields keeps attribute payload ($A_{attr}$) small without breaking group statistics.
Cross-topic: Hot/Warm/Cold Tier Design for Geospatial Data explains which tier these optimized files should land in and how retrieval pricing shapes the target block size.

Calculating Optimal Row Group Size for Spatial Queries

Sizing Workflow #

Step 1: Profile Geometry Payload Distribution #

Step 2: Derive Target Row Count per Group #

Step 3: Apply Spatial Clustering Prior to Grouping #

Validation & Verification #

Troubleshooting #

Related #