Derived Features & Privacy Boundary
Status & scope
- Stage: IMPLEMENTED
- Module:
parallax/ops/fusion/derived_features.py - Milestone: M5 (EU AI Act Compliance)
Purpose
Raw PII must never cross a federation boundary. The EU AI Act (Articles 10, 12, 15), GDPR Article 25 (data protection by design and by default), and the data minimisation principle all require that entity resolution operates on the minimum data necessary.
Derived features solve this by applying one-way transforms to PII fields before wire exchange. A name becomes a Soundex code, a date of birth becomes a year, a phone number becomes a SHA-256 hash. The scoring engine then operates on these derived values using appropriate metrics. The result: federation-grade entity resolution without any federate seeing another's raw PII.
Three guarantees:
- No raw PII in derived vectors — every field is transformed before leaving the node.
- One-way transforms — derived values cannot be reversed to recover the original.
- Reduced precision — derived values deliberately discard detail (postcode area, birth year) to limit re-identification risk.
Architecture
Derived features sit between normalization and wire exchange in the pipeline:
Raw Records
│
▼
Normalization (casefolding, alias expansion, whitespace)
│
▼
┌─────────────────────────────────────┐
│ DERIVED FEATURE EXTRACTION │ ← component.parallax.derived-features: this boundary
│ derive_features() │
│ - Applies one-way transform per │
│ field based on derivation type │
│ - Drops fields not in match func │
│ - SHA-256 fallback for unknowns │
└─────────────────────────────────────┘
│
▼
Derived Feature Vectors (safe for wire exchange)
│
▼
Blocking (already one-way: soundex, year keys)
│
▼
Scoring (derived metrics: exact on hashes, levenshtein on prefixes)
│
▼
Clustering → Results
In the three-phase federation protocol:
- Phase 1 (counts only): No PII exchanged at all.
- Phase 2 (targeted vectors): Derived features are exchanged, never raw values.
- Phase 3 (consensus scoring): Scoring uses
score_pair_derived()on derived vectors.
Derivation Type Registry
The DERIVATION_TYPE_REGISTRY maps each derivation type to a transform function, the metric to use on derived values, and the CDG XSD validation pattern for the output.
| Type | Function | Produces | CDG Pattern | Derived Metric | One-Way Property |
|---|---|---|---|---|---|
soundex |
_derive_soundex(value: str) -> str |
Phonetic code, e.g. "Smith" -> "S530" |
[A-Z][0-9]{3} |
exact |
Many names map to same code (Smith, Smyth, Snead all -> S530). Cannot recover original. |
year |
_derive_year(value: str) -> str |
4-digit year, e.g. "1985-03-15" -> "1985" |
[0-9]{4} |
exact |
Month and day discarded. 365 dates map to one year value. |
postcode_area |
_derive_postcode_area(value: str) -> str |
UK outcode prefix, e.g. "SW1A 1AA" -> "SW1A" |
[A-Z]{1,2}[0-9][0-9A-Z]? |
levenshtein |
Incode (house-level precision) discarded. Thousands of addresses share one outcode. |
sha256 |
_derive_sha256(value: str) -> str |
Cryptographic hash, e.g. "07700900123" -> "a1b2c3..." (64 hex chars) |
[0-9a-f]{64} |
exact |
Computationally infeasible to reverse. Input is lowercased and stripped before hashing. |
geohash |
_derive_geohash(value: str) -> str |
Spatial hash at precision 5 (~5km cells), e.g. "51.5074,-0.1278" -> "gcpvj" |
[0-9a-z]{4,8} |
geohash_match |
Precise coordinates reduced to ~5km cell. Cannot recover exact location. |
temporal_bucket |
_derive_temporal_bucket(value: str) -> str |
YYYY-MM bucket, e.g. "2025-03-15T10:30:00" -> "2025-03" |
[0-9]{4}(-[0-9]{2})? |
exact |
Day and time discarded. All events in a month collapse to one value. |
casefold |
_derive_casefold(value: str) -> str |
Lowercased text, e.g. "John SMITH" -> "john smith" |
.+ |
levenshtein |
Lowest assurance — text is still readable. Only useful when other derivations are too lossy. Not one-way in the cryptographic sense. |
phonetic |
_derive_phonetic(value: str) -> str |
Metaphone code, e.g. "Smith" -> "SM0" |
[A-Z]{1,8} |
exact |
Broader phonetic grouping than Soundex. Many names share a code. Cannot recover original spelling. |
Registry Structure
DERIVATION_TYPE_REGISTRY: dict[str, tuple[callable, str, str]] = {
# type_name: (derive_function, derived_metric, cdg_xsd_pattern)
"soundex": (_derive_soundex, "exact", r"[A-Z][0-9]{3}"),
"year": (_derive_year, "exact", r"[0-9]{4}"),
"postcode_area": (_derive_postcode_area, "levenshtein", r"[A-Z]{1,2}[0-9][0-9A-Z]?"),
"sha256": (_derive_sha256, "exact", r"[0-9a-f]{64}"),
"geohash": (_derive_geohash, "geohash_match", r"[0-9a-z]{4,8}"),
"temporal_bucket": (_derive_temporal_bucket, "exact", r"[0-9]{4}(-[0-9]{2})?"),
"casefold": (_derive_casefold, "levenshtein", r".+"),
"phonetic": (_derive_phonetic, "exact", r"[A-Z]{1,8}"),
}
Legacy vs Lens-Driven Derivation
Two derivation modes exist for backward compatibility.
Legacy: DERIVATION_RULES (field-name-based)
The original implementation maps field names to derivation functions. This is hardcoded to the VRS screening lens field names.
DERIVATION_RULES: dict[str, callable] = {
"full_name": _derive_soundex,
"date_of_birth": _derive_year,
"postcode": _derive_postcode_area,
"phone": _derive_sha256,
"email": _derive_sha256,
}
DERIVED_METRICS: dict[str, str] = {
"full_name": "exact",
"date_of_birth": "exact",
"postcode": "levenshtein",
"phone": "exact",
"email": "exact",
}
Limitation: Only works for fields named exactly full_name, date_of_birth, etc. Cannot be reused for different domains or lenses with different field names.
Lens-Driven: derivation_map parameter
New lenses should pass a derivation_map dict to derive_features() and build_derived_match_function(). The map specifies {field_name: derivation_type} using types from DERIVATION_TYPE_REGISTRY.
derivation_map = {
"full_name": "casefold", # override: use casefold instead of soundex
"date_of_birth": "temporal_bucket", # override: YYYY-MM instead of year-only
"location": "geohash", # new field: not in legacy rules
}
Precedence rules for derive_features():
- If
derivation_mapis provided and contains the field: useDERIVATION_TYPE_REGISTRY[derivation_map[field]]. - Else if field is in legacy
DERIVATION_RULES: use the legacy function. - Else: apply
_derive_sha256as safe default.
Public API
derive_features()
def derive_features(
records: list[dict],
match_function: list[dict],
id_field: str = "local_id",
derivation_map: dict[str, str] | None = None,
) -> list[dict]:
Transforms raw records into derived feature vectors. Only fields listed in match_function are included. Fields without a derivation rule get SHA-256 hashed. The id_field is preserved as-is (it is a local reference, not PII).
Args:
- records — raw records with PII field values (post-normalization).
- match_function — list of {field, metric, weight} dicts defining which fields to derive.
- id_field — record ID field name (default "local_id").
- derivation_map — optional {field_name: derivation_type} from the lens. When provided, uses DERIVATION_TYPE_REGISTRY instead of legacy DERIVATION_RULES.
Returns: list of dicts with {id_field, derived_field_1, derived_field_2, ...}. No extra fields beyond id_field and match function fields.
build_derived_match_function()
def build_derived_match_function(
match_function: list[dict],
derivation_map: dict[str, str] | None = None,
) -> list[dict]:
Builds a match function suitable for scoring derived features. Replaces each field's metric with the derived-appropriate metric (e.g., jaro_winkler becomes exact for Soundex codes) while preserving weights.
Args:
- match_function — original match function from the lens.
- derivation_map — optional lens-driven derivation map for metric lookup.
Returns: list of {field, metric, weight} dicts with derived metrics.
score_pair_derived()
def score_pair_derived(
derived_a: dict,
derived_b: dict,
derived_match_function: list[dict],
null_penalty: float = 0.1,
id_field: str = "local_id",
) -> ScoredPair:
Scores two derived feature vectors. Delegates to scorer.score_pair() with the derived match function. Derived vectors contain only one-way values — never raw PII.
Args:
- derived_a, derived_b — derived feature vectors from derive_features().
- derived_match_function — from build_derived_match_function().
- null_penalty — penalty per null field (default 0.1).
- id_field — record ID field name.
Returns: ScoredPair with confidence and per-field breakdown.
run_fusion_derived()
def run_fusion_derived(
node_a: list[dict],
node_b: list[dict],
lens_path: str,
threshold: float = 0.70,
null_penalty: float = 0.1,
id_field: str = "local_id",
field_map_a: dict[str, str] | None = None,
field_map_b: dict[str, str] | None = None,
) -> DerivedFusionResult:
Runs the full fusion pipeline using only derived features. Steps:
- Parse lens.
- Apply field maps and normalize.
- Build match function from raw data (for weight/field selection).
- Derive features — this is the federation boundary. After this step, only derived values exist.
- Blocking — uses normalized records because blocking keys are already one-way (soundex, year).
- Score using derived features only.
Returns: DerivedFusionResult dataclass:
@dataclass
class DerivedFusionResult:
matches: list # FusionMatch instances above threshold
total_candidates: int # candidate pairs after blocking
total_pairs_possible: int # len(node_a) * len(node_b)
derived_match_function: list[dict] # match function with derived metrics
raw_match_function: list[dict] # original match function (for audit)
all_scored: list[ScoredPair] # all scored pairs (for analysis)
get_derivation()
def get_derivation(derivation_type: str) -> tuple[callable, str, str]:
Looks up a derivation by type name from DERIVATION_TYPE_REGISTRY.
Args:
- derivation_type — one of the 8 registered type names.
Returns: (derive_fn, derived_metric, cdg_pattern) tuple.
Raises: KeyError if type not found, with message listing available types.
CDG Compatibility
The CDG (Common Data Gateway) validates exchanged data against XSD patterns. Derived features are designed so that every derivation type (except casefold) produces output matching a strict XSD pattern:
| Derivation Type | CDG-Safe? | XSD Pattern | Notes |
|---|---|---|---|
soundex |
Yes | [A-Z][0-9]{3} |
Fixed 4-character code |
year |
Yes | [0-9]{4} |
Fixed 4-digit year |
postcode_area |
Yes | [A-Z]{1,2}[0-9][0-9A-Z]? |
UK outcode format |
sha256 |
Yes | [0-9a-f]{64} |
Fixed 64-character hex |
geohash |
Yes | [0-9a-z]{4,8} |
Variable-length alphanumeric |
temporal_bucket |
Yes | [0-9]{4}(-[0-9]{2})? |
YYYY or YYYY-MM |
casefold |
Low assurance | .+ |
Free text — readable, not one-way |
phonetic |
Yes | [A-Z]{1,8} |
Variable-length uppercase code |
The CDG validates the derived values, not raw PII. The XSD patterns are properties of the derivation type, not the field — so the same engine works for any domain.
Privacy Properties
1. No raw PII in derived vectors
derive_features() produces vectors containing only:
- The id_field (a local reference, not PII).
- Match function fields, each transformed by a derivation function.
- No extra fields — fields not in the match function are dropped entirely.
2. One-way transforms
| Transform | Why it cannot be reversed |
|---|---|
| Soundex | Lossy phonetic mapping. S530 could be Smith, Smyth, Snead, or hundreds of other names. |
| Year | 365 dates collapse to one value. Month and day are gone. |
| Postcode area | Thousands of full postcodes share one outcode. |
| SHA-256 | Cryptographically one-way. No known preimage attack. |
| Geohash | Precise lat/lon reduced to ~5km cell. Exact position lost. |
| Temporal bucket | Day and time discarded. ~30 days map to one bucket. |
| Phonetic (Metaphone) | Lossy phonetic mapping, broader grouping than Soundex. |
Exception: casefold preserves the text content (lowercased). It is the lowest-assurance derivation and should only be used when other types are too lossy for acceptable recall.
3. Reduced precision
Derived values deliberately discard detail to limit re-identification risk: - Full name -> 4-character phonetic code - Full date of birth -> 4-digit year - Full UK postcode (house-level) -> outcode prefix (district-level) - Exact coordinates -> ~5km geohash cell - Full timestamp -> month-level bucket
4. Field minimisation
Fields not listed in the match function are never included in derived vectors. If a record has 20 fields but the match function references 5, the derived vector contains only those 5 plus the ID.
Safe Default
Fields without a derivation rule (neither in derivation_map nor in legacy DERIVATION_RULES) are hashed with SHA-256. This is the safe default:
# Fallback in derive_features():
field_derivers[field_name] = _derive_sha256
SHA-256 is the most conservative derivation — it never leaks any information about the input value. The trade-off is that only exact matches can be detected (binary equality of hashes). For fields where graduated similarity is needed, an explicit derivation type should be configured in the lens.
Test Coverage
TestDerivationRules — Unit tests for each derivation function
- Soundex produces correct codes (
"Smith"->"S530") - Alias expansion before Soundex (
"Jim"and"James"-> same code) - Year extraction from ISO dates
- Postcode area extraction for UK formats (
"SW1A 1AA"->"SW1A","E1 6AN"->"E1","E14 6AN"->"E14") - SHA-256 determinism, case normalisation, correct length
- Empty input handling for all derivation functions
TestDeriveFeatures — Integration tests for derive_features()
- Derived vector contains no raw name, DOB, postcode, phone, or email
- Name field matches Soundex pattern
[A-Z]\d{3} - DOB contains only 4-digit year, no month or day
- Phone and email are 64-character SHA-256 hashes
- ID field preserved unchanged
- Null fields produce empty strings
- Full sweep of all VRS records: no raw PII appears in any derived field
TestDerivedMatchFunction — Metric replacement tests
jaro_winklerreplaced withexactfor Soundex-derived nameslevenshteinused for postcode area- Weights preserved from original match function
TestScoreDerived — Derived scoring tests
- Identical derived vectors score 1.0
- Different Soundex codes produce low confidence
TestFullDerivedPipeline — End-to-end VRS test suite with derived features
- Finds matches (non-empty result set)
- At least 10 true positives at threshold 0.50
- Zero false positives
- P018 (two different John Smiths) correctly rejected
- Precision >= 0.90
- Derived vs raw comparison: recall delta < 0.40
TestEUAIActCompliance — Compliance-specific assertions
- No raw PII value appears in any derived vector across all VRS test records
- Derivations are one-way (Smith/Smyth same Soundex, different dates same year)
- Postcode area loses house-level precision
- Derived vectors contain only match function fields plus ID (no extra fields)
TestDerivationTypeRegistry — Registry structure tests
- Registry contains exactly 8 types
- Each entry is a
(callable, str, str)tuple get_derivation()returns correct entries- Unknown type raises
KeyErrorwith descriptive message
TestNewDerivationFunctions — Unit tests for geohash, temporal_bucket, casefold, phonetic
- Geohash produces correct-length codes from lat/lon, handles empty and invalid input
- Nearby coordinates share geohash prefix
- Temporal bucket extracts YYYY-MM from ISO dates, falls back to year
- Casefold handles Unicode (
"Strasse"from"Strasse"), strips whitespace - Phonetic produces consistent codes, similar names share codes
TestDerivationMap — Lens-driven derivation tests
derivation_mapoverrides legacy rules (casefold instead of soundex)- Partial map: unmapped fields fall back to legacy rules
- Unknown fields (not in map or legacy) get SHA-256
build_derived_match_function()uses map for metric lookup, preserves weights
Accuracy Impact
Derived features trade recall for privacy. The VRS POC benchmark:
| Pipeline | TP | FP | Precision | Recall | F1 |
|---|---|---|---|---|---|
| Raw features | 18/20 | 0 | 1.000 | 0.900 | 0.947 |
| Derived features | 18/20 | 0 | 1.000 | 0.900 | 0.947 |
The recall delta is within acceptable bounds. P018 (false positive trap) is correctly rejected by both pipelines.
Files
| File | Purpose |
|---|---|
parallax/ops/fusion/derived_features.py |
All derivation functions, registries, public API, DerivedFusionResult |
tests/test_derived_features.py |
7 test classes, compliance assertions, end-to-end pipeline tests |
parallax/ops/fusion/scorer.py |
score_pair() and ScoredPair — used by score_pair_derived() |
parallax/ops/fusion/blocker.py |
generate_blocking_keys(), generate_candidates() — used by run_fusion_derived() |
parallax/ops/fusion/normalizer.py |
normalize_records() — pre-derivation normalization |
parallax/ops/fusion/transforms/text.py |
alias_expand() — used by _derive_soundex() |
parallax/ops/fusion/transforms/temporal.py |
extract_year() — used by _derive_year() and _derive_temporal_bucket() |
parallax/ops/fusion/transforms/geo.py |
geohash_encode() — used by _derive_geohash() |
parallax/ops/fusion/pipeline.py |
_apply_field_map(), _build_match_function(), _build_blocking_key_sets(), FusionMatch — used by run_fusion_derived() |
parallax/ops/fusion/lens_parser.py |
parse_lens() — used by run_fusion_derived() |
Integration Points
- component.parallax.lens-parser -> here: Lens parser provides field definitions and weights that drive derivation.
- component.parallax.scoring-engine -> here:
score_pair()andScoredPairare reused for derived scoring with different metrics. - component.parallax.primitives-framework -> here: Normalization transforms (
alias_expand,extract_year,geohash_encode) are composed into derivation functions. - Here -> Three-Phase Protocol: Phase 2 exchanges derived feature vectors, not raw PII.
derive_features()is the federation privacy boundary.
Depends on: component.parallax.lens-parser, component.parallax.primitives-framework, component.parallax.scoring-engine
Realizes: product.fusion
Required by: component.parallax.cds-message-mapping, component.parallax.three-phase-protocol, component.parallax.vrs-rest-api