Derived Features & Privacy Boundary

Status & scope

Stage: IMPLEMENTED
Module: parallax/ops/fusion/derived_features.py
Milestone: M5 (EU AI Act Compliance)

Purpose

Raw PII must never cross a federation boundary. The EU AI Act (Articles 10, 12, 15), GDPR Article 25 (data protection by design and by default), and the data minimisation principle all require that entity resolution operates on the minimum data necessary.

Derived features solve this by applying one-way transforms to PII fields before wire exchange. A name becomes a Soundex code, a date of birth becomes a year, a phone number becomes a SHA-256 hash. The scoring engine then operates on these derived values using appropriate metrics. The result: federation-grade entity resolution without any federate seeing another's raw PII.

Three guarantees:

No raw PII in derived vectors — every field is transformed before leaving the node.
One-way transforms — derived values cannot be reversed to recover the original.
Reduced precision — derived values deliberately discard detail (postcode area, birth year) to limit re-identification risk.

Architecture

Derived features sit between normalization and wire exchange in the pipeline:

Raw Records
    │
    ▼
Normalization (casefolding, alias expansion, whitespace)
    │
    ▼
┌─────────────────────────────────────┐
│  DERIVED FEATURE EXTRACTION         │  ← component.parallax.derived-features: this boundary
│  derive_features()                  │
│  - Applies one-way transform per    │
│    field based on derivation type   │
│  - Drops fields not in match func   │
│  - SHA-256 fallback for unknowns    │
└─────────────────────────────────────┘
    │
    ▼
Derived Feature Vectors (safe for wire exchange)
    │
    ▼
Blocking (already one-way: soundex, year keys)
    │
    ▼
Scoring (derived metrics: exact on hashes, levenshtein on prefixes)
    │
    ▼
Clustering → Results

In the three-phase federation protocol:

Phase 1 (counts only): No PII exchanged at all.
Phase 2 (targeted vectors): Derived features are exchanged, never raw values.
Phase 3 (consensus scoring): Scoring uses score_pair_derived() on derived vectors.

Derivation Type Registry

The DERIVATION_TYPE_REGISTRY maps each derivation type to a transform function, the metric to use on derived values, and the CDG XSD validation pattern for the output.

Type	Function	Produces	CDG Pattern	Derived Metric	One-Way Property
`soundex`	`_derive_soundex(value: str) -> str`	Phonetic code, e.g. `"Smith"` -> `"S530"`	`[A-Z][0-9]{3}`	`exact`	Many names map to same code (Smith, Smyth, Snead all -> S530). Cannot recover original.
`year`	`_derive_year(value: str) -> str`	4-digit year, e.g. `"1985-03-15"` -> `"1985"`	`[0-9]{4}`	`exact`	Month and day discarded. 365 dates map to one year value.
`postcode_area`	`_derive_postcode_area(value: str) -> str`	UK outcode prefix, e.g. `"SW1A 1AA"` -> `"SW1A"`	`[A-Z]{1,2}[0-9][0-9A-Z]?`	`levenshtein`	Incode (house-level precision) discarded. Thousands of addresses share one outcode.
`sha256`	`_derive_sha256(value: str) -> str`	Cryptographic hash, e.g. `"07700900123"` -> `"a1b2c3..."` (64 hex chars)	`[0-9a-f]{64}`	`exact`	Computationally infeasible to reverse. Input is lowercased and stripped before hashing.
`geohash`	`_derive_geohash(value: str) -> str`	Spatial hash at precision 5 (~5km cells), e.g. `"51.5074,-0.1278"` -> `"gcpvj"`	`[0-9a-z]{4,8}`	`geohash_match`	Precise coordinates reduced to ~5km cell. Cannot recover exact location.
`temporal_bucket`	`_derive_temporal_bucket(value: str) -> str`	YYYY-MM bucket, e.g. `"2025-03-15T10:30:00"` -> `"2025-03"`	`[0-9]{4}(-[0-9]{2})?`	`exact`	Day and time discarded. All events in a month collapse to one value.
`casefold`	`_derive_casefold(value: str) -> str`	Lowercased text, e.g. `"John SMITH"` -> `"john smith"`	`.+`	`levenshtein`	Lowest assurance — text is still readable. Only useful when other derivations are too lossy. Not one-way in the cryptographic sense.
`phonetic`	`_derive_phonetic(value: str) -> str`	Metaphone code, e.g. `"Smith"` -> `"SM0"`	`[A-Z]{1,8}`	`exact`	Broader phonetic grouping than Soundex. Many names share a code. Cannot recover original spelling.

Registry Structure

DERIVATION_TYPE_REGISTRY: dict[str, tuple[callable, str, str]] = {
    # type_name: (derive_function, derived_metric, cdg_xsd_pattern)
    "soundex":         (_derive_soundex,          "exact",         r"[A-Z][0-9]{3}"),
    "year":            (_derive_year,             "exact",         r"[0-9]{4}"),
    "postcode_area":   (_derive_postcode_area,    "levenshtein",   r"[A-Z]{1,2}[0-9][0-9A-Z]?"),
    "sha256":          (_derive_sha256,           "exact",         r"[0-9a-f]{64}"),
    "geohash":         (_derive_geohash,          "geohash_match", r"[0-9a-z]{4,8}"),
    "temporal_bucket": (_derive_temporal_bucket,   "exact",         r"[0-9]{4}(-[0-9]{2})?"),
    "casefold":        (_derive_casefold,         "levenshtein",   r".+"),
    "phonetic":        (_derive_phonetic,         "exact",         r"[A-Z]{1,8}"),
}

Legacy vs Lens-Driven Derivation

Two derivation modes exist for backward compatibility.

Legacy: `DERIVATION_RULES` (field-name-based)

The original implementation maps field names to derivation functions. This is hardcoded to the VRS screening lens field names.

DERIVATION_RULES: dict[str, callable] = {
    "full_name":      _derive_soundex,
    "date_of_birth":  _derive_year,
    "postcode":       _derive_postcode_area,
    "phone":          _derive_sha256,
    "email":          _derive_sha256,
}

DERIVED_METRICS: dict[str, str] = {
    "full_name":      "exact",
    "date_of_birth":  "exact",
    "postcode":       "levenshtein",
    "phone":          "exact",
    "email":          "exact",
}

Limitation: Only works for fields named exactly full_name, date_of_birth, etc. Cannot be reused for different domains or lenses with different field names.

Lens-Driven: `derivation_map` parameter

New lenses should pass a derivation_map dict to derive_features() and build_derived_match_function(). The map specifies {field_name: derivation_type} using types from DERIVATION_TYPE_REGISTRY.

derivation_map = {
    "full_name": "casefold",           # override: use casefold instead of soundex
    "date_of_birth": "temporal_bucket", # override: YYYY-MM instead of year-only
    "location": "geohash",             # new field: not in legacy rules
}

Precedence rules for derive_features():

If derivation_map is provided and contains the field: use DERIVATION_TYPE_REGISTRY[derivation_map[field]].
Else if field is in legacy DERIVATION_RULES: use the legacy function.
Else: apply _derive_sha256 as safe default.

Public API

`derive_features()`

def derive_features(
    records: list[dict],
    match_function: list[dict],
    id_field: str = "local_id",
    derivation_map: dict[str, str] | None = None,
) -> list[dict]:

Transforms raw records into derived feature vectors. Only fields listed in match_function are included. Fields without a derivation rule get SHA-256 hashed. The id_field is preserved as-is (it is a local reference, not PII).

Args: - records — raw records with PII field values (post-normalization). - match_function — list of {field, metric, weight} dicts defining which fields to derive. - id_field — record ID field name (default "local_id"). - derivation_map — optional {field_name: derivation_type} from the lens. When provided, uses DERIVATION_TYPE_REGISTRY instead of legacy DERIVATION_RULES.

Returns: list of dicts with {id_field, derived_field_1, derived_field_2, ...}. No extra fields beyond id_field and match function fields.

`build_derived_match_function()`

def build_derived_match_function(
    match_function: list[dict],
    derivation_map: dict[str, str] | None = None,
) -> list[dict]:

Builds a match function suitable for scoring derived features. Replaces each field's metric with the derived-appropriate metric (e.g., jaro_winkler becomes exact for Soundex codes) while preserving weights.

Args: - match_function — original match function from the lens. - derivation_map — optional lens-driven derivation map for metric lookup.

Returns: list of {field, metric, weight} dicts with derived metrics.

`score_pair_derived()`

def score_pair_derived(
    derived_a: dict,
    derived_b: dict,
    derived_match_function: list[dict],
    null_penalty: float = 0.1,
    id_field: str = "local_id",
) -> ScoredPair:

Scores two derived feature vectors. Delegates to scorer.score_pair() with the derived match function. Derived vectors contain only one-way values — never raw PII.

Args: - derived_a, derived_b — derived feature vectors from derive_features(). - derived_match_function — from build_derived_match_function(). - null_penalty — penalty per null field (default 0.1). - id_field — record ID field name.

Returns: ScoredPair with confidence and per-field breakdown.

`run_fusion_derived()`

def run_fusion_derived(
    node_a: list[dict],
    node_b: list[dict],
    lens_path: str,
    threshold: float = 0.70,
    null_penalty: float = 0.1,
    id_field: str = "local_id",
    field_map_a: dict[str, str] | None = None,
    field_map_b: dict[str, str] | None = None,
) -> DerivedFusionResult:

Runs the full fusion pipeline using only derived features. Steps:

Parse lens.
Apply field maps and normalize.
Build match function from raw data (for weight/field selection).
Derive features — this is the federation boundary. After this step, only derived values exist.
Blocking — uses normalized records because blocking keys are already one-way (soundex, year).
Score using derived features only.

Returns: DerivedFusionResult dataclass:

@dataclass
class DerivedFusionResult:
    matches: list                             # FusionMatch instances above threshold
    total_candidates: int                     # candidate pairs after blocking
    total_pairs_possible: int                 # len(node_a) * len(node_b)
    derived_match_function: list[dict]        # match function with derived metrics
    raw_match_function: list[dict]            # original match function (for audit)
    all_scored: list[ScoredPair]              # all scored pairs (for analysis)

`get_derivation()`

def get_derivation(derivation_type: str) -> tuple[callable, str, str]:

Looks up a derivation by type name from DERIVATION_TYPE_REGISTRY.

Args: - derivation_type — one of the 8 registered type names.

Returns: (derive_fn, derived_metric, cdg_pattern) tuple.

Raises: KeyError if type not found, with message listing available types.

CDG Compatibility

The CDG (Common Data Gateway) validates exchanged data against XSD patterns. Derived features are designed so that every derivation type (except casefold) produces output matching a strict XSD pattern:

Derivation Type	CDG-Safe?	XSD Pattern	Notes
`soundex`	Yes	`[A-Z][0-9]{3}`	Fixed 4-character code
`year`	Yes	`[0-9]{4}`	Fixed 4-digit year
`postcode_area`	Yes	`[A-Z]{1,2}[0-9][0-9A-Z]?`	UK outcode format
`sha256`	Yes	`[0-9a-f]{64}`	Fixed 64-character hex
`geohash`	Yes	`[0-9a-z]{4,8}`	Variable-length alphanumeric
`temporal_bucket`	Yes	`[0-9]{4}(-[0-9]{2})?`	YYYY or YYYY-MM
`casefold`	Low assurance	`.+`	Free text — readable, not one-way
`phonetic`	Yes	`[A-Z]{1,8}`	Variable-length uppercase code

The CDG validates the derived values, not raw PII. The XSD patterns are properties of the derivation type, not the field — so the same engine works for any domain.

Privacy Properties

1. No raw PII in derived vectors

derive_features() produces vectors containing only: - The id_field (a local reference, not PII). - Match function fields, each transformed by a derivation function. - No extra fields — fields not in the match function are dropped entirely.

2. One-way transforms

Transform	Why it cannot be reversed
Soundex	Lossy phonetic mapping. `S530` could be Smith, Smyth, Snead, or hundreds of other names.
Year	365 dates collapse to one value. Month and day are gone.
Postcode area	Thousands of full postcodes share one outcode.
SHA-256	Cryptographically one-way. No known preimage attack.
Geohash	Precise lat/lon reduced to ~5km cell. Exact position lost.
Temporal bucket	Day and time discarded. ~30 days map to one bucket.
Phonetic (Metaphone)	Lossy phonetic mapping, broader grouping than Soundex.

Exception: casefold preserves the text content (lowercased). It is the lowest-assurance derivation and should only be used when other types are too lossy for acceptable recall.

3. Reduced precision

Derived values deliberately discard detail to limit re-identification risk: - Full name -> 4-character phonetic code - Full date of birth -> 4-digit year - Full UK postcode (house-level) -> outcode prefix (district-level) - Exact coordinates -> ~5km geohash cell - Full timestamp -> month-level bucket

4. Field minimisation

Fields not listed in the match function are never included in derived vectors. If a record has 20 fields but the match function references 5, the derived vector contains only those 5 plus the ID.

Safe Default

Fields without a derivation rule (neither in derivation_map nor in legacy DERIVATION_RULES) are hashed with SHA-256. This is the safe default:

# Fallback in derive_features():
field_derivers[field_name] = _derive_sha256

SHA-256 is the most conservative derivation — it never leaks any information about the input value. The trade-off is that only exact matches can be detected (binary equality of hashes). For fields where graduated similarity is needed, an explicit derivation type should be configured in the lens.

Test Coverage

`TestDerivationRules` — Unit tests for each derivation function

Soundex produces correct codes ("Smith" -> "S530")
Alias expansion before Soundex ("Jim" and "James" -> same code)
Year extraction from ISO dates
Postcode area extraction for UK formats ("SW1A 1AA" -> "SW1A", "E1 6AN" -> "E1", "E14 6AN" -> "E14")
SHA-256 determinism, case normalisation, correct length
Empty input handling for all derivation functions

`TestDeriveFeatures` — Integration tests for `derive_features()`

Derived vector contains no raw name, DOB, postcode, phone, or email
Name field matches Soundex pattern [A-Z]\d{3}
DOB contains only 4-digit year, no month or day
Phone and email are 64-character SHA-256 hashes
ID field preserved unchanged
Null fields produce empty strings
Full sweep of all VRS records: no raw PII appears in any derived field

`TestDerivedMatchFunction` — Metric replacement tests

jaro_winkler replaced with exact for Soundex-derived names
levenshtein used for postcode area
Weights preserved from original match function

`TestScoreDerived` — Derived scoring tests

Identical derived vectors score 1.0
Different Soundex codes produce low confidence

`TestFullDerivedPipeline` — End-to-end VRS test suite with derived features

Finds matches (non-empty result set)
At least 10 true positives at threshold 0.50
Zero false positives
P018 (two different John Smiths) correctly rejected
Precision >= 0.90
Derived vs raw comparison: recall delta < 0.40

`TestEUAIActCompliance` — Compliance-specific assertions

No raw PII value appears in any derived vector across all VRS test records
Derivations are one-way (Smith/Smyth same Soundex, different dates same year)
Postcode area loses house-level precision
Derived vectors contain only match function fields plus ID (no extra fields)

`TestDerivationTypeRegistry` — Registry structure tests

Registry contains exactly 8 types
Each entry is a (callable, str, str) tuple
get_derivation() returns correct entries
Unknown type raises KeyError with descriptive message

`TestNewDerivationFunctions` — Unit tests for geohash, temporal_bucket, casefold, phonetic

Geohash produces correct-length codes from lat/lon, handles empty and invalid input
Nearby coordinates share geohash prefix
Temporal bucket extracts YYYY-MM from ISO dates, falls back to year
Casefold handles Unicode ("Strasse" from "Strasse"), strips whitespace
Phonetic produces consistent codes, similar names share codes

`TestDerivationMap` — Lens-driven derivation tests

derivation_map overrides legacy rules (casefold instead of soundex)
Partial map: unmapped fields fall back to legacy rules
Unknown fields (not in map or legacy) get SHA-256
build_derived_match_function() uses map for metric lookup, preserves weights

Accuracy Impact

Derived features trade recall for privacy. The VRS POC benchmark:

Pipeline	TP	FP	Precision	Recall	F1
Raw features	18/20	0	1.000	0.900	0.947
Derived features	18/20	0	1.000	0.900	0.947

The recall delta is within acceptable bounds. P018 (false positive trap) is correctly rejected by both pipelines.

Files

File	Purpose
`parallax/ops/fusion/derived_features.py`	All derivation functions, registries, public API, `DerivedFusionResult`
`tests/test_derived_features.py`	7 test classes, compliance assertions, end-to-end pipeline tests
`parallax/ops/fusion/scorer.py`	`score_pair()` and `ScoredPair` — used by `score_pair_derived()`
`parallax/ops/fusion/blocker.py`	`generate_blocking_keys()`, `generate_candidates()` — used by `run_fusion_derived()`
`parallax/ops/fusion/normalizer.py`	`normalize_records()` — pre-derivation normalization
`parallax/ops/fusion/transforms/text.py`	`alias_expand()` — used by `_derive_soundex()`
`parallax/ops/fusion/transforms/temporal.py`	`extract_year()` — used by `_derive_year()` and `_derive_temporal_bucket()`
`parallax/ops/fusion/transforms/geo.py`	`geohash_encode()` — used by `_derive_geohash()`
`parallax/ops/fusion/pipeline.py`	`_apply_field_map()`, `_build_match_function()`, `_build_blocking_key_sets()`, `FusionMatch` — used by `run_fusion_derived()`
`parallax/ops/fusion/lens_parser.py`	`parse_lens()` — used by `run_fusion_derived()`

Integration Points

component.parallax.lens-parser -> here: Lens parser provides field definitions and weights that drive derivation.
component.parallax.scoring-engine -> here: score_pair() and ScoredPair are reused for derived scoring with different metrics.
component.parallax.primitives-framework -> here: Normalization transforms (alias_expand, extract_year, geohash_encode) are composed into derivation functions.
Here -> Three-Phase Protocol: Phase 2 exchanges derived feature vectors, not raw PII. derive_features() is the federation privacy boundary.

Depends on: component.parallax.lens-parser, component.parallax.primitives-framework, component.parallax.scoring-engine

Realizes: product.fusion

Required by: component.parallax.cds-message-mapping, component.parallax.three-phase-protocol, component.parallax.vrs-rest-api

Derived Features & Privacy Boundary

Status & scope

Purpose

Architecture

Derivation Type Registry

Registry Structure

Legacy vs Lens-Driven Derivation

Legacy: DERIVATION_RULES (field-name-based)

Lens-Driven: derivation_map parameter

Public API

derive_features()

build_derived_match_function()

score_pair_derived()

run_fusion_derived()

get_derivation()