Attribute Schema Mapping for Spatial Datasets

Q: What is the safest way to handle unmapped enum codes from upstream providers?

Quarantine unknown codes immediately rather than mapping them to a default. Log the raw code, the feature ID, and the ingestion timestamp in a dedicated error table. Alert data stewards via your monitoring pipeline. Never silently coerce an unknown value — it destroys audit traceability.

Q: When does attribute mapping need to move from single-node Python to a distributed engine?

When ingestion batches exceed ~5 million rows or when the transformation step includes expensive cross-joins against reference tables larger than available RAM, consider moving to Dask or Apache Sedona. The threshold is not purely row count — complex regex validation and nested JSON parsing per row can saturate a single core well below 1 million features.

Attribute schema mapping is the systematic process of aligning, transforming, and validating the non-geometric properties of spatial features as they move between ingestion, transformation, and publication stages. When spatial workflows treat geometry as the primary concern and attributes as secondary payloads, classification codes silently degrade, measurement units drift between formats, and regulatory reporting fails at the worst possible moment. For GIS analysts, QA engineers, data stewards, and compliance officers, a rigorous mapping framework prevents silent data degradation and preserves semantic fidelity across the full pipeline lifecycle — feeding directly into the standards described in Core Spatial QC Fundamentals & Standards.

Prerequisites

Before implementing an automated attribute mapping pipeline, confirm these baseline requirements. Skipping them produces brittle transformations that fail silently when upstream providers change field names or value domains.

Source and target schema definitions (versioned): Document field names, data types, cardinality, nullability, and value domains for both legacy inputs and target outputs. Maintain a YAML or JSON schema manifest in version control. Tools such as frictionless or a custom pyogrio-based header extractor can automate initial manifest generation.
Format awareness — geopandas 0.14+, pyogrio 0.7+, GDAL 3.4+: Understand how different formats encode attributes. Shapefiles impose a 10-character field name limit and lack native type enforcement. GeoPackage (backed by SQLite) and GeoJSON support richer typing and nested structures. The OGC GeoPackage Standard documents how SQLite-based containers handle strict typing and table constraints.
Validation library selection: Choose declarative schema enforcement libraries — pydantic v2, jsonschema, or pandera — alongside spatial I/O libraries (geopandas, pyogrio, shapely). Avoid ad-hoc if/else chains; declarative schemas can be versioned and independently tested.
Coordinate Reference System (CRS) normalization complete: Attribute validation that involves measurement conversions (acres to hectares, feet to metres) assumes the layer is already in the target CRS. Ensure coordinate reference system precision standards have been applied before the mapping stage.
Metadata alignment target identified: Map custom attributes to ISO 19115 elements or INSPIRE data specifications so compliance auditing can be automated. Knowing your target standard upfront prevents costly schema redesigns.

Conceptual Foundation

Spatial datasets consist of two logically distinct but coupled components: a geometry column encoding shape and position, and an attribute table encoding descriptive metadata. Most format-agnostic validation libraries operate exclusively on one component or the other. The mapping problem arises at the boundary: attributes must remain semantically consistent as geometry migrates between projections, formats, and organizational schemas.

Three classes of inconsistency cause the most damage in production:

Type coercion ambiguity. A field carrying land-use codes may be stored as TEXT in a legacy Shapefile and expected as INTEGER in the target GeoPackage. Silent numeric parsing of "N/A" or "0" produces incorrect aggregations that are indistinguishable from valid data.

Enum drift. Classification systems evolve — NLCD (National Land Cover Database) codes changed substantially between 2001 and 2021 releases. When lookup tables are not updated alongside data inflows, new codes pass through unmapped and collapse to nulls, corrupting land-area totals or zoning reports.

Cross-dimensional mismatch. An attribute can be internally valid (a recognised zoning code) but spatially implausible when correlated with geometry — a parcel classified as water that intersects a residential zoning polygon, for example. Catching these requires running attribute rules in conjunction with the geometry validity checks that operate on the same feature set.

The ISO 19157-1:2023 quality framework formalises these concerns under thematic accuracy, logical consistency, and completeness — quality elements that every attribute mapping implementation should trace back to measurable thresholds.

Step-by-Step Implementation

Step 1: Schema Inventory and Gap Analysis

Extract field-level metadata from source datasets without loading full geometry arrays into memory. pyogrio and GDAL/OGR support schema-only reads that return column names, types, and feature counts efficiently across dozens of formats.

import pyogrio
import json
from pathlib import Path

def extract_schema(path: str) -> dict:
    """Return field names and OGR type codes without reading geometry."""
    info = pyogrio.read_info(path)
    return {
        "fields": info["fields"].tolist(),
        "dtypes": [str(d) for d in info["dtypes"]],
        "feature_count": info["features"],
        "geometry_type": info["geometry_type"],
        "crs": info["crs"],
    }

source_schema = extract_schema("source_parcels.gpkg")
with open("target_schema.json") as f:
    target_schema = json.load(f)

# Identify missing and extra fields
source_fields = set(source_schema["fields"])
target_fields = set(target_schema["fields"])
missing = target_fields - source_fields
extra   = source_fields - target_fields

print(f"Missing in source: {missing}")
print(f"Extra in source:   {extra}")

Expected output: A list of unmapped fields on each side. Document these in a mapping matrix — a spreadsheet or YAML file that explicitly records source column → target column, type coercion rule, null handling strategy, and enum mapping table. This artefact is the single source of truth for every downstream transformation step.

Step 2: Constraint Definition and Type Enforcement

Define validation rules as declarative pydantic (v2) models. Each model field carries type, range, pattern, and nullability constraints. Separating schema definitions from transformation logic allows the same models to be reused in unit tests, CI gates, and documentation generation.

from pydantic import BaseModel, Field, field_validator
from typing import Optional
import re

class ParcelSchema(BaseModel):
    parcel_id:      str   = Field(..., min_length=8, max_length=12, pattern=r"^[A-Z0-9]+$")
    land_use_code:  int   = Field(..., ge=100, le=999)
    assessed_value: float = Field(..., ge=0.0)
    zoning:         Optional[str] = Field(None, pattern=r"^(R|C|I|M)-\d{2}$")
    survey_date:    Optional[str] = Field(None)

    @field_validator("survey_date", mode="before")
    @classmethod
    def parse_survey_date(cls, v: Optional[str]) -> Optional[str]:
        if v is None:
            return None
        # Reject ambiguous formats silently cast from legacy sources
        if not re.match(r"^\d{4}-\d{2}-\d{2}$", str(v).strip()):
            raise ValueError(f"survey_date must be ISO 8601 (YYYY-MM-DD), got: {v!r}")
        return v

For GeoJSON delivery targets, these same constraints map directly to properties validation. The child page on mapping attribute constraints to GeoJSON schemas covers translating pydantic models to JSON Schema draft-07 representations compatible with ajv and OGC API Features validators.

Step 3: Transformation and Value Mapping

Apply the mapping matrix to source data. Handle type casting, code translation, unit normalisation, and string sanitisation as discrete, logged operations. Combine these into a single validate_and_transform function that routes valid records forward and invalid records to a quarantine table — never dropping them silently.

import geopandas as gpd
from pydantic import ValidationError
import pandas as pd

LAND_USE_XLAT = {
    11: 110,   # NLCD 2001 open water → 2021 code
    21: 210,   # Developed open space
    22: 220,   # Developed low-intensity
}

def validate_and_transform(
    gdf: gpd.GeoDataFrame,
) -> tuple[gpd.GeoDataFrame, list[dict]]:
    """
    Returns (valid_gdf, error_records).
    Geometry column is preserved on the valid path.
    """
    valid_indices: list[int] = []
    validated_rows: list[dict] = []
    errors: list[dict] = []

    for idx, row in gdf.iterrows():
        raw_lu = int(row.get("LU_CODE", 0))
        try:
            record = ParcelSchema(
                parcel_id=str(row.get("PARCEL_NUM", "")).strip().upper(),
                land_use_code=LAND_USE_XLAT.get(raw_lu, raw_lu),
                assessed_value=float(row.get("ASSESS_VAL", 0.0)),
                zoning=row.get("ZONING_CLASS") or None,
                survey_date=row.get("SURV_DATE") or None,
            )
            valid_indices.append(idx)
            validated_rows.append(record.model_dump())
        except ValidationError as exc:
            errors.append({"row_index": idx, "parcel_num": row.get("PARCEL_NUM"), "errors": exc.errors()})

    if not validated_rows:
        return gpd.GeoDataFrame(columns=gdf.columns, crs=gdf.crs), errors

    result = gpd.GeoDataFrame(validated_rows, index=valid_indices, crs=gdf.crs)
    result["geometry"] = gdf.loc[valid_indices, "geometry"]
    return result, errors


gdf = gpd.read_file("source_parcels.gpkg", engine="pyogrio")
clean_gdf, quarantine = validate_and_transform(gdf)

print(f"Valid:      {len(clean_gdf)}")
print(f"Quarantine: {len(quarantine)}")

# Persist quarantine records for data steward review
if quarantine:
    pd.DataFrame(quarantine).to_json("quarantine.ndjson", orient="records", lines=True)

Verification: After the transform, run clean_gdf.dtypes and confirm all target columns carry their expected pandas types. Confirm the quarantine file is non-empty only when genuine failures exist — a quarantine file with zero rows is a test oracle for clean ingestion batches.

Step 4: Validation Gate — Attribute and Spatial Cross-Checks

Running attribute rules first removes a large fraction of defective records cheaply, before expensive spatial predicates execute. Once attributes are confirmed valid, run cross-dimensional consistency checks that correlate attribute values with geometry.

from shapely.geometry import Point

def cross_validate(gdf: gpd.GeoDataFrame) -> pd.DataFrame:
    """
    Returns a DataFrame of rows where attribute values are
    spatially implausible given the feature geometry.
    """
    violations = []

    # Example: water-coded parcels must not intersect residential zoning layer
    water_mask    = gdf["land_use_code"] == 110
    res_zoning    = gdf["zoning"].str.startswith("R-", na=False)
    conflict_mask = water_mask & res_zoning

    for idx, row in gdf[conflict_mask].iterrows():
        violations.append({
            "row_index": idx,
            "parcel_id": row["parcel_id"],
            "rule": "water_in_residential_zone",
            "severity": "blocker",
        })

    return pd.DataFrame(violations)

cross_issues = cross_validate(clean_gdf)
print(cross_issues)

Integrate this gate with the building rule engines with GeoPandas framework so that attribute and spatial rules share a unified severity taxonomy and output the same structured error schema consumed by your reporting layer.

Step 5: Deployment and Continuous Drift Monitoring

Publish the validated dataset with an explicit schema version tag embedded in the file metadata or accompanying sidecar manifest. Implement schema comparison on every new ingestion batch to detect when upstream providers change field names, add codes, or alter value distributions without notice.

import hashlib, json

def schema_fingerprint(schema: dict) -> str:
    canonical = json.dumps(schema, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()[:16]

current_fp  = schema_fingerprint(extract_schema("source_parcels.gpkg"))
baseline_fp = open(".schema_baseline").read().strip()

if current_fp != baseline_fp:
    raise RuntimeError(
        f"Schema drift detected: expected {baseline_fp}, got {current_fp}. "
        "Review upstream provider changelog and update mapping matrix."
    )

Store the baseline fingerprint in version control. Any CI run that detects drift fails the pipeline and generates a data quality ticket before a single transformed record reaches production.

Common Failure Modes and Fixes

Error / Symptom	Root Cause	Fix
`"N/A"` becomes `0` or `NaN` in aggregations	Silent numeric coercion on ambiguous strings	Use strict parsers; reject non-numeric strings explicitly; log the raw value
New land-use code passes through as `null`	Lookup table not updated when provider released new codes	Add dynamic code discovery; quarantine unknown codes; alert data stewards
Floating-point truncation in area conversions	Standard `float` precision insufficient	Use Python `decimal.Decimal` for financial/measurement attributes; round at the last step
`Parcel_ID` and `parcel_id` treated as different fields	Case sensitivity differences across formats	Normalise all headers to lowercase `snake_case` at ingestion entry point
Valid zoning code on water parcel passes attribute checks	Cross-dimensional rule not implemented	Run spatial-attribute correlation audit after attribute validation; see OGC topology rules for spatial predicate patterns
Schema drift from provider not detected	No fingerprinting or drift alerting	Compute and compare schema fingerprints on every ingestion batch

Performance and Scale Considerations

For datasets below ~500 k features, the pydantic-based row iteration pattern in Step 3 is acceptable. Beyond that threshold, the Python-level loop becomes the bottleneck. Migrate to vectorised approaches:

Vectorised type enforcement with pandera: pandera DataFrameSchema checks execute as numpy operations rather than Python loops, reducing attribute validation time by an order of magnitude for large datasets.

import pandera as pa

parcel_schema = pa.DataFrameSchema({
    "parcel_id":      pa.Column(str,   pa.Check.str_matches(r"^[A-Z0-9]{8,12}$")),
    "land_use_code":  pa.Column(int,   pa.Check.in_range(100, 999)),
    "assessed_value": pa.Column(float, pa.Check.greater_than_or_equal_to(0.0)),
})

validated_df = parcel_schema.validate(clean_gdf.drop(columns=["geometry"]), lazy=True)

Chunked ingestion with pyogrio: Read large files in chunks to bound memory use. Process each chunk through the validation pipeline independently and concatenate clean outputs.

chunk_size = 50_000
clean_chunks = []

for chunk_gdf in gpd.read_file("large_parcels.gpkg", engine="pyogrio", chunksize=chunk_size):
    clean, _ = validate_and_transform(chunk_gdf)
    clean_chunks.append(clean)

result = gpd.GeoDataFrame(pd.concat(clean_chunks, ignore_index=True), crs=clean_chunks[0].crs)

Distributed execution: When ingestion batches exceed 10 million rows or when validation includes cross-joins against large reference tables, move to asynchronous validation workflows using Celery or Dask. Attribute validation tasks parallelize cleanly because each row is independent; spatial cross-checks require partition-aware broadcasting of reference geometries.

Integration with the Validation Pipeline

Attribute schema mapping occupies the ingestion stage of the broader validation directed acyclic graph (DAG). It runs immediately after raw data lands in the staging area and before any spatial predicate or topology check executes. This ordering matters: spatial operations on malformed attribute data can silently propagate bad records through joins and dissolves without ever raising an exception.

The output of the mapping stage feeds into categorizing and prioritizing spatial errors — attributes carry their own severity dimension (blocker for missing required fields, warning for out-of-range values, informational for deprecated codes). Error records from the quarantine table should flow through the same severity classifier as geometry errors so that downstream reporting presents a unified quality scorecard.

For organisations managing decades of legacy GIS data, attribute schema mapping must also be version-aware. Migration scripts that preserve historical attribute semantics while enforcing modern standards — and that can roll back if a transformation rule is later found incorrect — are as critical to data lineage as the spatial transformations they accompany. Treat schema manifests as infrastructure: version them, review them, and test them in CI.

Frequently Asked Questions

Why do attributes need their own validation stage separate from geometry checks?

Geometry checks operate on coordinate arrays and topological predicates. Attribute checks operate on typed tabular data — enums, date strings, numeric ranges, business keys. They fail differently and are remediated differently. Running attribute validation first is cheaper (no spatial index required) and removes defective records before expensive spatial predicates execute, preventing silent error propagation into joins and dissolves.

What is the safest way to handle unmapped enum codes from upstream providers?

Quarantine unknown codes immediately — never map them to a default value. Log the raw code, feature identifier, and ingestion timestamp in a dedicated error table. Alert data stewards through your monitoring integration. Silent coercion of unknown codes destroys audit traceability and can corrupt statistics that downstream systems have already computed.

How should I version my mapping matrix?

Store mapping matrices as JSON or YAML manifests in version control alongside the pipeline code. Tag each matrix with the source schema version, target schema version, and effective date. Use semantic versioning: patch bumps for lookup table additions, minor bumps for new field mappings, major bumps for structural changes that alter backward compatibility. Compute a schema fingerprint on every ingestion run and fail the pipeline when it deviates from the registered baseline.

When does attribute mapping need to move from single-node Python to a distributed engine?

When ingestion batches exceed roughly 5 million rows or when transformation includes expensive cross-joins against reference tables larger than available RAM, move to Dask or Apache Sedona. The threshold is not purely row count — complex regex validation and nested JSON parsing per row can saturate a single core well below 1 million features. Profile first; optimise vectorised pandas/pandera paths before introducing distributed overhead.

Related:

Mapping Attribute Constraints to GeoJSON Schemas — translating pydantic models to JSON Schema for GeoJSON validation
Geometry Validity Checks for Vector Data — running spatial predicate checks on the same feature set
Understanding OGC Topology Rules — spatial constraints that interact with attribute classifications
Coordinate Reference System Precision Standards — unit conversion requirements that attribute mapping depends on
Categorizing and Prioritizing Spatial Errors — unified severity model for attribute and geometry failures

Back to Core Spatial QC Fundamentals & Standards