Skip to content

Derived Features & Privacy Boundary

Status & scope

  • Stage: IMPLEMENTED
  • Module: parallax/ops/fusion/derived_features.py
  • Milestone: M5 (EU AI Act Compliance)

Purpose

Raw PII must never cross a federation boundary. The EU AI Act (Articles 10, 12, 15), GDPR Article 25 (data protection by design and by default), and the data minimisation principle all require that entity resolution operates on the minimum data necessary.

Derived features solve this by applying one-way transforms to PII fields before wire exchange. A name becomes a Soundex code, a date of birth becomes a year, a phone number becomes a SHA-256 hash. The scoring engine then operates on these derived values using appropriate metrics. The result: federation-grade entity resolution without any federate seeing another's raw PII.

Three guarantees:

  1. No raw PII in derived vectors — every field is transformed before leaving the node.
  2. One-way transforms — derived values cannot be reversed to recover the original.
  3. Reduced precision — derived values deliberately discard detail (postcode area, birth year) to limit re-identification risk.

Architecture

Derived features sit between normalization and wire exchange in the pipeline:

Raw Records
    │
    ▼
Normalization (casefolding, alias expansion, whitespace)
    │
    ▼
┌─────────────────────────────────────┐
│  DERIVED FEATURE EXTRACTION         │  ← component.parallax.derived-features: this boundary
│  derive_features()                  │
│  - Applies one-way transform per    │
│    field based on derivation type   │
│  - Drops fields not in match func   │
│  - SHA-256 fallback for unknowns    │
└─────────────────────────────────────┘
    │
    ▼
Derived Feature Vectors (safe for wire exchange)
    │
    ▼
Blocking (already one-way: soundex, year keys)
    │
    ▼
Scoring (derived metrics: exact on hashes, levenshtein on prefixes)
    │
    ▼
Clustering → Results

In the three-phase federation protocol:

  • Phase 1 (counts only): No PII exchanged at all.
  • Phase 2 (targeted vectors): Derived features are exchanged, never raw values.
  • Phase 3 (consensus scoring): Scoring uses score_pair_derived() on derived vectors.

Derivation Type Registry

The DERIVATION_TYPE_REGISTRY maps each derivation type to a transform function, the metric to use on derived values, and the CDG XSD validation pattern for the output.

Type Function Produces CDG Pattern Derived Metric One-Way Property
soundex _derive_soundex(value: str) -> str Phonetic code, e.g. "Smith" -> "S530" [A-Z][0-9]{3} exact Many names map to same code (Smith, Smyth, Snead all -> S530). Cannot recover original.
year _derive_year(value: str) -> str 4-digit year, e.g. "1985-03-15" -> "1985" [0-9]{4} exact Month and day discarded. 365 dates map to one year value.
postcode_area _derive_postcode_area(value: str) -> str UK outcode prefix, e.g. "SW1A 1AA" -> "SW1A" [A-Z]{1,2}[0-9][0-9A-Z]? levenshtein Incode (house-level precision) discarded. Thousands of addresses share one outcode.
sha256 _derive_sha256(value: str) -> str Cryptographic hash, e.g. "07700900123" -> "a1b2c3..." (64 hex chars) [0-9a-f]{64} exact Computationally infeasible to reverse. Input is lowercased and stripped before hashing.
geohash _derive_geohash(value: str) -> str Spatial hash at precision 5 (~5km cells), e.g. "51.5074,-0.1278" -> "gcpvj" [0-9a-z]{4,8} geohash_match Precise coordinates reduced to ~5km cell. Cannot recover exact location.
temporal_bucket _derive_temporal_bucket(value: str) -> str YYYY-MM bucket, e.g. "2025-03-15T10:30:00" -> "2025-03" [0-9]{4}(-[0-9]{2})? exact Day and time discarded. All events in a month collapse to one value.
casefold _derive_casefold(value: str) -> str Lowercased text, e.g. "John SMITH" -> "john smith" .+ levenshtein Lowest assurance — text is still readable. Only useful when other derivations are too lossy. Not one-way in the cryptographic sense.
phonetic _derive_phonetic(value: str) -> str Metaphone code, e.g. "Smith" -> "SM0" [A-Z]{1,8} exact Broader phonetic grouping than Soundex. Many names share a code. Cannot recover original spelling.

Registry Structure

DERIVATION_TYPE_REGISTRY: dict[str, tuple[callable, str, str]] = {
    # type_name: (derive_function, derived_metric, cdg_xsd_pattern)
    "soundex":         (_derive_soundex,          "exact",         r"[A-Z][0-9]{3}"),
    "year":            (_derive_year,             "exact",         r"[0-9]{4}"),
    "postcode_area":   (_derive_postcode_area,    "levenshtein",   r"[A-Z]{1,2}[0-9][0-9A-Z]?"),
    "sha256":          (_derive_sha256,           "exact",         r"[0-9a-f]{64}"),
    "geohash":         (_derive_geohash,          "geohash_match", r"[0-9a-z]{4,8}"),
    "temporal_bucket": (_derive_temporal_bucket,   "exact",         r"[0-9]{4}(-[0-9]{2})?"),
    "casefold":        (_derive_casefold,         "levenshtein",   r".+"),
    "phonetic":        (_derive_phonetic,         "exact",         r"[A-Z]{1,8}"),
}

Legacy vs Lens-Driven Derivation

Two derivation modes exist for backward compatibility.

Legacy: DERIVATION_RULES (field-name-based)

The original implementation maps field names to derivation functions. This is hardcoded to the VRS screening lens field names.

DERIVATION_RULES: dict[str, callable] = {
    "full_name":      _derive_soundex,
    "date_of_birth":  _derive_year,
    "postcode":       _derive_postcode_area,
    "phone":          _derive_sha256,
    "email":          _derive_sha256,
}

DERIVED_METRICS: dict[str, str] = {
    "full_name":      "exact",
    "date_of_birth":  "exact",
    "postcode":       "levenshtein",
    "phone":          "exact",
    "email":          "exact",
}

Limitation: Only works for fields named exactly full_name, date_of_birth, etc. Cannot be reused for different domains or lenses with different field names.

Lens-Driven: derivation_map parameter

New lenses should pass a derivation_map dict to derive_features() and build_derived_match_function(). The map specifies {field_name: derivation_type} using types from DERIVATION_TYPE_REGISTRY.

derivation_map = {
    "full_name": "casefold",           # override: use casefold instead of soundex
    "date_of_birth": "temporal_bucket", # override: YYYY-MM instead of year-only
    "location": "geohash",             # new field: not in legacy rules
}

Precedence rules for derive_features():

  1. If derivation_map is provided and contains the field: use DERIVATION_TYPE_REGISTRY[derivation_map[field]].
  2. Else if field is in legacy DERIVATION_RULES: use the legacy function.
  3. Else: apply _derive_sha256 as safe default.

Public API

derive_features()

def derive_features(
    records: list[dict],
    match_function: list[dict],
    id_field: str = "local_id",
    derivation_map: dict[str, str] | None = None,
) -> list[dict]:

Transforms raw records into derived feature vectors. Only fields listed in match_function are included. Fields without a derivation rule get SHA-256 hashed. The id_field is preserved as-is (it is a local reference, not PII).

Args: - records — raw records with PII field values (post-normalization). - match_function — list of {field, metric, weight} dicts defining which fields to derive. - id_field — record ID field name (default "local_id"). - derivation_map — optional {field_name: derivation_type} from the lens. When provided, uses DERIVATION_TYPE_REGISTRY instead of legacy DERIVATION_RULES.

Returns: list of dicts with {id_field, derived_field_1, derived_field_2, ...}. No extra fields beyond id_field and match function fields.

build_derived_match_function()

def build_derived_match_function(
    match_function: list[dict],
    derivation_map: dict[str, str] | None = None,
) -> list[dict]:

Builds a match function suitable for scoring derived features. Replaces each field's metric with the derived-appropriate metric (e.g., jaro_winkler becomes exact for Soundex codes) while preserving weights.

Args: - match_function — original match function from the lens. - derivation_map — optional lens-driven derivation map for metric lookup.

Returns: list of {field, metric, weight} dicts with derived metrics.

score_pair_derived()

def score_pair_derived(
    derived_a: dict,
    derived_b: dict,
    derived_match_function: list[dict],
    null_penalty: float = 0.1,
    id_field: str = "local_id",
) -> ScoredPair:

Scores two derived feature vectors. Delegates to scorer.score_pair() with the derived match function. Derived vectors contain only one-way values — never raw PII.

Args: - derived_a, derived_b — derived feature vectors from derive_features(). - derived_match_function — from build_derived_match_function(). - null_penalty — penalty per null field (default 0.1). - id_field — record ID field name.

Returns: ScoredPair with confidence and per-field breakdown.

run_fusion_derived()

def run_fusion_derived(
    node_a: list[dict],
    node_b: list[dict],
    lens_path: str,
    threshold: float = 0.70,
    null_penalty: float = 0.1,
    id_field: str = "local_id",
    field_map_a: dict[str, str] | None = None,
    field_map_b: dict[str, str] | None = None,
) -> DerivedFusionResult:

Runs the full fusion pipeline using only derived features. Steps:

  1. Parse lens.
  2. Apply field maps and normalize.
  3. Build match function from raw data (for weight/field selection).
  4. Derive features — this is the federation boundary. After this step, only derived values exist.
  5. Blocking — uses normalized records because blocking keys are already one-way (soundex, year).
  6. Score using derived features only.

Returns: DerivedFusionResult dataclass:

@dataclass
class DerivedFusionResult:
    matches: list                             # FusionMatch instances above threshold
    total_candidates: int                     # candidate pairs after blocking
    total_pairs_possible: int                 # len(node_a) * len(node_b)
    derived_match_function: list[dict]        # match function with derived metrics
    raw_match_function: list[dict]            # original match function (for audit)
    all_scored: list[ScoredPair]              # all scored pairs (for analysis)

get_derivation()

def get_derivation(derivation_type: str) -> tuple[callable, str, str]:

Looks up a derivation by type name from DERIVATION_TYPE_REGISTRY.

Args: - derivation_type — one of the 8 registered type names.

Returns: (derive_fn, derived_metric, cdg_pattern) tuple.

Raises: KeyError if type not found, with message listing available types.

CDG Compatibility

The CDG (Common Data Gateway) validates exchanged data against XSD patterns. Derived features are designed so that every derivation type (except casefold) produces output matching a strict XSD pattern:

Derivation Type CDG-Safe? XSD Pattern Notes
soundex Yes [A-Z][0-9]{3} Fixed 4-character code
year Yes [0-9]{4} Fixed 4-digit year
postcode_area Yes [A-Z]{1,2}[0-9][0-9A-Z]? UK outcode format
sha256 Yes [0-9a-f]{64} Fixed 64-character hex
geohash Yes [0-9a-z]{4,8} Variable-length alphanumeric
temporal_bucket Yes [0-9]{4}(-[0-9]{2})? YYYY or YYYY-MM
casefold Low assurance .+ Free text — readable, not one-way
phonetic Yes [A-Z]{1,8} Variable-length uppercase code

The CDG validates the derived values, not raw PII. The XSD patterns are properties of the derivation type, not the field — so the same engine works for any domain.

Privacy Properties

1. No raw PII in derived vectors

derive_features() produces vectors containing only: - The id_field (a local reference, not PII). - Match function fields, each transformed by a derivation function. - No extra fields — fields not in the match function are dropped entirely.

2. One-way transforms

Transform Why it cannot be reversed
Soundex Lossy phonetic mapping. S530 could be Smith, Smyth, Snead, or hundreds of other names.
Year 365 dates collapse to one value. Month and day are gone.
Postcode area Thousands of full postcodes share one outcode.
SHA-256 Cryptographically one-way. No known preimage attack.
Geohash Precise lat/lon reduced to ~5km cell. Exact position lost.
Temporal bucket Day and time discarded. ~30 days map to one bucket.
Phonetic (Metaphone) Lossy phonetic mapping, broader grouping than Soundex.

Exception: casefold preserves the text content (lowercased). It is the lowest-assurance derivation and should only be used when other types are too lossy for acceptable recall.

3. Reduced precision

Derived values deliberately discard detail to limit re-identification risk: - Full name -> 4-character phonetic code - Full date of birth -> 4-digit year - Full UK postcode (house-level) -> outcode prefix (district-level) - Exact coordinates -> ~5km geohash cell - Full timestamp -> month-level bucket

4. Field minimisation

Fields not listed in the match function are never included in derived vectors. If a record has 20 fields but the match function references 5, the derived vector contains only those 5 plus the ID.

Safe Default

Fields without a derivation rule (neither in derivation_map nor in legacy DERIVATION_RULES) are hashed with SHA-256. This is the safe default:

# Fallback in derive_features():
field_derivers[field_name] = _derive_sha256

SHA-256 is the most conservative derivation — it never leaks any information about the input value. The trade-off is that only exact matches can be detected (binary equality of hashes). For fields where graduated similarity is needed, an explicit derivation type should be configured in the lens.

Test Coverage

TestDerivationRules — Unit tests for each derivation function

  • Soundex produces correct codes ("Smith" -> "S530")
  • Alias expansion before Soundex ("Jim" and "James" -> same code)
  • Year extraction from ISO dates
  • Postcode area extraction for UK formats ("SW1A 1AA" -> "SW1A", "E1 6AN" -> "E1", "E14 6AN" -> "E14")
  • SHA-256 determinism, case normalisation, correct length
  • Empty input handling for all derivation functions

TestDeriveFeatures — Integration tests for derive_features()

  • Derived vector contains no raw name, DOB, postcode, phone, or email
  • Name field matches Soundex pattern [A-Z]\d{3}
  • DOB contains only 4-digit year, no month or day
  • Phone and email are 64-character SHA-256 hashes
  • ID field preserved unchanged
  • Null fields produce empty strings
  • Full sweep of all VRS records: no raw PII appears in any derived field

TestDerivedMatchFunction — Metric replacement tests

  • jaro_winkler replaced with exact for Soundex-derived names
  • levenshtein used for postcode area
  • Weights preserved from original match function

TestScoreDerived — Derived scoring tests

  • Identical derived vectors score 1.0
  • Different Soundex codes produce low confidence

TestFullDerivedPipeline — End-to-end VRS test suite with derived features

  • Finds matches (non-empty result set)
  • At least 10 true positives at threshold 0.50
  • Zero false positives
  • P018 (two different John Smiths) correctly rejected
  • Precision >= 0.90
  • Derived vs raw comparison: recall delta < 0.40

TestEUAIActCompliance — Compliance-specific assertions

  • No raw PII value appears in any derived vector across all VRS test records
  • Derivations are one-way (Smith/Smyth same Soundex, different dates same year)
  • Postcode area loses house-level precision
  • Derived vectors contain only match function fields plus ID (no extra fields)

TestDerivationTypeRegistry — Registry structure tests

  • Registry contains exactly 8 types
  • Each entry is a (callable, str, str) tuple
  • get_derivation() returns correct entries
  • Unknown type raises KeyError with descriptive message

TestNewDerivationFunctions — Unit tests for geohash, temporal_bucket, casefold, phonetic

  • Geohash produces correct-length codes from lat/lon, handles empty and invalid input
  • Nearby coordinates share geohash prefix
  • Temporal bucket extracts YYYY-MM from ISO dates, falls back to year
  • Casefold handles Unicode ("Strasse" from "Strasse"), strips whitespace
  • Phonetic produces consistent codes, similar names share codes

TestDerivationMap — Lens-driven derivation tests

  • derivation_map overrides legacy rules (casefold instead of soundex)
  • Partial map: unmapped fields fall back to legacy rules
  • Unknown fields (not in map or legacy) get SHA-256
  • build_derived_match_function() uses map for metric lookup, preserves weights

Accuracy Impact

Derived features trade recall for privacy. The VRS POC benchmark:

Pipeline TP FP Precision Recall F1
Raw features 18/20 0 1.000 0.900 0.947
Derived features 18/20 0 1.000 0.900 0.947

The recall delta is within acceptable bounds. P018 (false positive trap) is correctly rejected by both pipelines.

Files

File Purpose
parallax/ops/fusion/derived_features.py All derivation functions, registries, public API, DerivedFusionResult
tests/test_derived_features.py 7 test classes, compliance assertions, end-to-end pipeline tests
parallax/ops/fusion/scorer.py score_pair() and ScoredPair — used by score_pair_derived()
parallax/ops/fusion/blocker.py generate_blocking_keys(), generate_candidates() — used by run_fusion_derived()
parallax/ops/fusion/normalizer.py normalize_records() — pre-derivation normalization
parallax/ops/fusion/transforms/text.py alias_expand() — used by _derive_soundex()
parallax/ops/fusion/transforms/temporal.py extract_year() — used by _derive_year() and _derive_temporal_bucket()
parallax/ops/fusion/transforms/geo.py geohash_encode() — used by _derive_geohash()
parallax/ops/fusion/pipeline.py _apply_field_map(), _build_match_function(), _build_blocking_key_sets(), FusionMatch — used by run_fusion_derived()
parallax/ops/fusion/lens_parser.py parse_lens() — used by run_fusion_derived()

Integration Points

  • component.parallax.lens-parser -> here: Lens parser provides field definitions and weights that drive derivation.
  • component.parallax.scoring-engine -> here: score_pair() and ScoredPair are reused for derived scoring with different metrics.
  • component.parallax.primitives-framework -> here: Normalization transforms (alias_expand, extract_year, geohash_encode) are composed into derivation functions.
  • Here -> Three-Phase Protocol: Phase 2 exchanges derived feature vectors, not raw PII. derive_features() is the federation privacy boundary.

Depends on: component.parallax.lens-parser, component.parallax.primitives-framework, component.parallax.scoring-engine

Realizes: product.fusion

Required by: component.parallax.cds-message-mapping, component.parallax.three-phase-protocol, component.parallax.vrs-rest-api