Fusion Binding as Project + Dataset

Status & scope

Stage: DRAFT — Architecture Decision
Supersedes: SCHEMA-BINDING.md §"Why Not Pre-Registered Datasets?"
Milestone: M2 (Production Integration)

Problem

SCHEMA-BINDING.md (§132-151) argued against using the existing Dataset object for fusion because the POC operates on CSV files with no platform infrastructure. That was correct for the POC.

In production, the architecture is:

Airflow → UDS/Elastic → Dask (operations) → Fusion → titan (federation) → Another Axonis

Data arrives from UDS already normalized (ISO dates, WKT locations, UTF-8 NFKC). But name formatting, title stripping, alias expansion, deduplication, null handling, and field-specific shaping still need to happen before fusion scoring. The platform already has 187 tested operation types (tabular, text, timeseries, image, transform, join) running on Dask, a workflow builder UI with 1069+ deployed nodes, and a Dataset object that captures query + operations + data_model + federation metadata.

Building a second normalization layer inside parallax's scorer (or extractor) would duplicate what the platform already does, violate "Beacon renders, Cortex thinks" (by putting intelligence in the wrong layer), and miss the 187 operations that are already tested and deployed.

Decision

The Fusion Binding creates a Project. Each federate's binding is a branch in that Project's DAG. Each branch produces a Dataset. All Datasets converge at a common Fusion Engine node governed by a Lens.

The existing Dataset object (query + operations → shaped data) is the per-federate data preparation artifact. The existing Project DAG is the orchestration artifact. The Lens is the governing artifact. The only new node type is the Fusion Engine at the convergence point.

Architecture

                    ┌──────── LENS (shared, immutable) ────────┐
                    │   target_model: "customer"                │
                    │   Defines: model-qualified fields,        │
                    │   metrics, weights, blocking, thresholds  │
                    │                                            │
        ┌────────── PROJECT (common across all nodes) ──────────┐
        │           project_id: "fusion_vrs_2026_03"             │
        │           subtype: "fusion"                            │
        │                                                        │
   Binding A              Binding B              Binding C
   (Barclays)             (HSBC)                 (Lloyds)
   model: customer        model: client          model: customer
   field_map: ∅           field_map: {remap}     field_map: ∅
        │                     │                      │
   QUERY                 QUERY                  QUERY
   (uds.model:customer)  (uds.model:client)     (uds.model:customer)
        │                     │                      │
   TEXT_OP               TABULAR_OP             TEXT_OP
   (lowercase,           (replace comma         (lowercase)
    strip titles)         name format)               │
        │                     │                  TABULAR_OP
   TABULAR_OP            TEXT_OP                (dropna on
   (replace aliases:     (lowercase)             match fields)
    Bob→Robert)               │                      │
        │                     │                      │
   DATASET               DATASET                DATASET
   (fusion_input)        (fusion_input)          (fusion_input)
        │                     │                      │
        └─────────────────────┼──────────────────────┘
                              │
                       FUSION_ENGINE (new node type)
                       lens_id: vrs_vulnerability_v1
                       (blocking → scoring → clustering)
                              │
                        MATCH_RESULTS
                       (entity clusters with provenance)

Model-first resolution: Binding A and C both use uds.model: customer — the same model the lens targets. Their field_map is empty because field names already match. Binding B uses uds.model: client — a different schema. It provides a field_map to remap client.* fields to customer.* equivalents.

Key Properties

One Project, N Bindings. The project is shared across all participating nodes. Each federate sees the same project definition but only executes their own binding branch locally.
Model-first field resolution. The Lens declares a target_model (e.g., customer) and references model-qualified fields directly (e.g., customer.full_name, customer.dob). When a federate uses the same uds.model, field names match automatically — no field_map needed. The field_map becomes an optional override, only required when a federate uses a different model (e.g., client instead of customer) and needs to remap fields.
Each Binding = QUERY + Operations + Field Map (optional). The binding is a full DAG branch definition. The query hits local UDS (filtering by uds.model). The operations run on local Dask. The shaped dataset stays on the local node. Only feature vectors and blocking keys cross boundaries. If the federate's model matches the lens target_model, field_map is empty — the 95% case.
The Lens governs. The Lens defines what model-qualified fields to match, which metrics to use, what weights to apply, and what thresholds to enforce. The binding's operations must produce data that conforms to the Lens's field vocabulary. The Lens does not know or care how each federate shapes their data.
Sovereignty by architecture. Each federate's query, operations, and raw data remain local. The project definition is common, but execution is distributed. The FUSION_ENGINE node is the only convergence point, and it operates on feature vectors, not raw records.
Existing infrastructure. The QUERY node, all 187 operation types, the DATASET node, the project DAG, the workflow builder UI, federation health tracking — all work as-is. No new operation types needed. The only new node type is FUSION_ENGINE.
Three-tier binding for scale. At 700+ nodes, manual per-node bindings are impractical. The model-first approach enables tiered binding:
Domain Template — auto-generated for all nodes sharing the lens target_model. Covers ~80% of nodes with zero manual work. Contains query template + standard operations chain.
Auto-Binding — LLM introspects a non-standard schema, maps fields to the target model, auto-approves if all confidences ≥ 0.90. Covers ~15% of nodes.
Manual Override — human reviews and edits bindings for edge cases (~5% of nodes).

Binding Format (Production)

The binding evolves from SCHEMA-BINDING.md's field-map-only format to include the full data preparation pipeline:

{
  "binding_id": "vrs_v1__barclays__2026_03",
  "lens_id": "vrs_vulnerability_v1",
  "lens_version": "1.0.0",
  "federate_id": "barclays.axonis.ai",
  "project_id": "fusion_vrs_2026_03",

  "query": {
    "query": {
      "function_score": {
        "query": {
          "bool": {
            "must": [
              {
                "query_string": {
                  "query": "uds.model: \"customer\" AND uds.source: \"retail_kyc\""
                }
              }
            ]
          }
        }
      }
    },
    "federation": {
      "federates": {
        "barclays.axonis.ai": 200
      }
    },
    "size": 100,
    "orient": "records"
  },

  "operations": [
    {
      "command": "nlp",
      "parameters": {
        "operations": {
          "lowercase": true,
          "remove_whitespace": true,
          "remove_punctuation": true
        },
        "text_attribute": "customer.full_name",
        "tokenizer_mode": "document"
      },
      "dataset": "0000_ds_QUERY_barclays_kyc",
      "return": "0001_ds_TEXT_OPERATION_name_clean"
    },
    {
      "command": "replace",
      "parameters": {
        "to_replace": {
          "Dr ": "", "Mr ": "", "Mrs ": "",
          "Prof ": "", "Sir ": "", "Dame ": ""
        },
        "features": ["customer.full_name"]
      },
      "dataset": "0001_ds_TEXT_OPERATION_name_clean",
      "return": "0002_ds_TABULAR_OPERATION_title_strip"
    },
    {
      "command": "replace",
      "parameters": {
        "to_replace": {
          "Bob ": "Robert ", "Jim ": "James ",
          "Bill ": "William ", "Dick ": "Richard ",
          "Mike ": "Michael ", "Tom ": "Thomas "
        },
        "features": ["customer.full_name"]
      },
      "dataset": "0002_ds_TABULAR_OPERATION_title_strip",
      "return": "0003_ds_TABULAR_OPERATION_alias_expand"
    },
    {
      "command": "dropna",
      "parameters": {
        "how": "any",
        "subset": ["customer.full_name", "customer.dob"]
      },
      "dataset": "0003_ds_TABULAR_OPERATION_alias_expand",
      "return": "0004_ds_TABULAR_OPERATION_clean"
    }
  ],

  "field_map": {
    "identity.full_name": "customer.full_name",
    "identity.date_of_birth": "customer.dob",
    "location.postcode": "customer.postal_code",
    "identity.phone": "customer.phone_hash",
    "identity.email": "customer.email_hash",
    "context.vulnerability_category": "customer.vuln_type"
  },

  "dataset_config": {
    "name": "Barclays VRS Fusion Input",
    "dataset": "barclays_vrs_fusion_input",
    "subtype": "fusion_input",
    "is_snapshot": false,
    "split_type": "horizontal"
  },

  "suppression_check": [
    {
      "suppressed_field": "vulnerability_status",
      "local_equivalent": "customer.vuln_status",
      "status": "confirmed_suppressed"
    },
    {
      "suppressed_field": "case_notes",
      "local_equivalent": "customer.case_narrative",
      "status": "confirmed_suppressed"
    }
  ],

  "review_status": "approved",
  "reviewed_by": "lens_admin@barclays.axonis.ai",
  "created_at": "2026-03-01T09:00:00Z",
  "expires_at": null
}

Model-First Field Resolution

The Key Insight

The platform already has a lightweight semantic uds.model concept. Every record in UDS is tagged with its model type (e.g., uds.model: "customer", uds.model: "transaction", uds.model: "track_report"). Models define a shared vocabulary — all banks using uds.model: customer have fields like customer.full_name, customer.dob, customer.postal_code.

If the lens targets uds.model: customer and a federate's data uses uds.model: customer, the field names already match. No mapping needed. The binding collapses to just query + operations.

How It Works

LENS declares:
    target_model: "customer"
    match_function references: customer.full_name, customer.dob, customer.postal_code

FEDERATE A (Barclays):
    uds.model: "customer"
    → Fields match the lens target_model
    → field_map: {} (empty — no remapping needed)
    → Binding = query + operations only

FEDERATE B (HSBC):
    uds.model: "client"
    → Fields DON'T match (client.name_display ≠ customer.full_name)
    → field_map: { "customer.full_name": "client.name_display", ... }
    → Binding = query + operations + field_map

FEDERATE C (Lloyds):
    uds.model: "customer"
    → Fields match
    → field_map: {} (empty)
    → Binding = query + operations only

Resolution Order

When the fusion engine resolves a lens field (e.g., customer.full_name), it follows this order:

field_map override — if the binding has an explicit mapping for this field, use it
Model-qualified direct match — if the field exists in the shaped dataset under its exact model-qualified name, use it
Unqualified fallback — if the field exists without the model prefix (e.g., full_name), use it
Null — field unavailable at this federate, apply null penalty in scoring

Impact on Scale

For a 700-node VRS deployment where all banks use uds.model: customer:

Approach	Manual Bindings	Auto-Bindings	Total Effort
Without model-first	700 field_maps, each with ~6 mappings	0	4,200 field mappings to review
With model-first	~35 (5% edge cases)	~105 (15% non-standard)	~210 field mappings + 595 auto-approved

This is a 20x reduction in binding administration.

Lens Changes

The identity_fusion section of the lens gains an optional target_model field:

identity_fusion:
  target_model: "customer"   # NEW: UDS model this lens targets
  match_function:
    name_match:
      weight: 0.30
      metric: jaro_winkler
      field_ref: customer.full_name    # Model-qualified field reference
    dob_match:
      weight: 0.25
      metric: exact
      field_ref: customer.dob
    postcode_match:
      weight: 0.20
      metric: exact
      field_ref: customer.postal_code

When target_model is set, all field_ref values are model-qualified. When target_model is absent (backward compat), the existing abstract field names + per-binding field_map behavior applies.

Binding Format Changes

The binding gains an optional local_model field:

binding_id: vrs_v1__barclays__2026_03
lens_id: vrs_vulnerability_v1
federate_id: barclays.axonis.ai
local_model: "customer"          # NEW: this federate's UDS model

# When local_model == lens.target_model, field_mappings is OPTIONAL
# Operations chain still applies (data shaping is per-node regardless)

query:
  query:
    function_score:
      query:
        bool:
          must:
            - query_string:
                query: 'uds.model: "customer"'
  # ...

operations:
  # ... (data shaping still needed — model match doesn't eliminate normalization)

field_mappings: []   # Empty — model fields match the lens directly

For a non-matching model:

binding_id: vrs_v1__hsbc__2026_03
lens_id: vrs_vulnerability_v1
federate_id: hsbc.axonis.ai
local_model: "client"            # Different model — needs field_map

query:
  query:
    function_score:
      query:
        bool:
          must:
            - query_string:
                query: 'uds.model: "client"'

field_mappings:
  - lens_field: customer.full_name
    local_field: client.name_display
    confidence: 0.96
  - lens_field: customer.dob
    local_field: client.birth_date
    confidence: 0.99
  - lens_field: customer.postal_code
    local_field: client.address_postcode
    confidence: 0.97

FUSION_ENGINE Node Type

The only new node type. Receives N datasets (one per participating federate) plus a lens reference, and executes the parallax fusion pipeline.

Node Schema

{
  "type": "FUSION_ENGINE",
  "data": {
    "lens_id": "vrs_vulnerability_v1",
    "lens_version": "1.0.0",
    "project_id": "fusion_vrs_2026_03",
    "input_datasets": [
      {
        "federate_id": "barclays.axonis.ai",
        "dataset_ref": "0004_ds_TABULAR_OPERATION_clean",
        "binding_id": "vrs_v1__barclays__2026_03"
      },
      {
        "federate_id": "hsbc.axonis.ai",
        "dataset_ref": "0003_ds_TEXT_OPERATION_shaped",
        "binding_id": "vrs_v1__hsbc__2026_03"
      },
      {
        "federate_id": "lloyds.axonis.ai",
        "dataset_ref": "0002_ds_TABULAR_OPERATION_clean",
        "binding_id": "vrs_v1__lloyds__2026_03"
      }
    ],
    "execution_config": {
      "blocking_strategy": "from_lens",
      "scoring_threshold": "from_lens",
      "max_candidates_per_record": 100,
      "enable_provenance": true
    },
    "output": {
      "match_results_dataset": "fusion_vrs_matches_2026_03",
      "entity_clusters_dataset": "fusion_vrs_clusters_2026_03"
    }
  }
}

Execution Flow

FUSION_ENGINE receives:
    - N shaped Dask DataFrames (one per binding, via dataset_ref)
    - LensSpec (parsed from lens_id)
    - Field maps (from each binding)

Phase 1: BLOCKING
    For each federate:
        Local Dask worker computes blocking keys from shaped data
        Blocking keys transmitted to coordinator (metadata only)
    Coordinator intersects blocking keys → candidate pairs

Phase 2: FEATURE VECTORS
    For each candidate pair (record_a@federate_x, record_b@federate_y):
        Each federate extracts feature vector for their record locally
        Feature vectors transmitted to coordinator
    Raw records never leave the federate

Phase 3: SCORING
    Coordinator scores each candidate pair using lens match_function
    Metrics applied to feature vector values (already shaped by operations)
    Confidence = weighted aggregate with null penalty

Phase 4: CLUSTERING
    Union-find groups confirmed matches into entity clusters
    Contradiction detection (same federate appears twice in cluster)
    Provenance tracked: which pairs, which scores, which evidence

Output:
    MatchResults dataset: all scored pairs above threshold
    EntityClusters dataset: grouped entities with aggregate confidence
    FusionRun audit record: timing, counts, decisions, provenance

How This Changes the Binding Lifecycle

SCHEMA-BINDING.md defined a 7-step lifecycle. The production lifecycle extends it:

1. LENS PUBLISHED        Lens deployed to federation
       │
       ▼
2. PROJECT CREATED       Common fusion project instantiated
       │                 project_id assigned, lens_id linked
       │                 FUSION_ENGINE node created (convergence point)
       ▼
3. BINDING PROPOSED      Per federate, LLM-assisted or manual:
       │                 a. Schema introspection (ES mappings / Dask dtypes)
       │                 b. Field mapping with confidence scores
       │                 c. Operations chain designed (data shaping)
       │                 d. Query defined (which records to include)
       ▼
4. BINDING REVIEW        Human reviews (or auto-approve if all confidences ≥ 0.90)
       │                 Reviews: field_map, operations chain, suppression_check
       │                 Reject → binding blocked, lens author notified
       ▼
5. DATASET CREATED       Binding executes query + operations → shaped dataset
       │                 dataset.subtype = "fusion_input"
       │                 dataset.project_id = fusion project ID
       │                 Dataset registered in UDS with federation metadata
       ▼
6. FUSION READY          All participating federates have approved bindings
       │                 FUSION_ENGINE node has all input_datasets populated
       │                 Project status: ready_to_run
       ▼
7. FUSION EXECUTION      Triggered by: schedule, manual, or signal
       │                 parallax pipeline runs on shaped datasets
       │                 Results written as new datasets (match_results, clusters)
       ▼
8. INVALIDATION          Binding invalidated when:
                         - Lens version changes → all bindings regenerated
                         - Local schema changes → affected federate rebinds
                         - Operations chain modified → dataset reshapes
                         - Binding manually revoked

What Changes in SCHEMA-BINDING.md

The existing SCHEMA-BINDING.md remains correct for: - LLM guardrails (§117-130) — still applies - Auto-approval policy (§259-268) — still applies - Failure modes (§294-302) — still applies - Mapping hints (§304-320) — still applies

The following sections are superseded by this spec:

Section	Old	New
§132-151 "Why Not Pre-Registered Datasets?"	Argued against datasets	Datasets ARE the mechanism
§155-177 "Integration with Existing Architecture"	Binding feeds extractor directly	Binding creates project + dataset, dataset feeds extractor
§179-256 "New UDS Object Type"	LensBinding as standalone UDS object	LensBinding + Project + Dataset as linked UDS objects

The LensBinding UDS object type from §179-256 still exists but gains: - project_id (links to fusion project) - query (the ES query for this federate) - operations (the shaping chain) - dataset_config (dataset metadata) - dataset_id (populated after dataset creation)

Relationship to Existing Dataset

The production Dataset object already has everything needed:

Dataset Field	Fusion Use
`queries[]`	ES query per federate (with federation filter)
`operations[]`	Ordered shaping chain (nlp, replace, dropna, etc.)
`data_model.features`	Field names available after shaping
`data_model.feature_types`	Field types (for metric dispatch)
`federation`	Which federates participate + record counts
`split_type`	`horizontal` (each federate has different records, same schema post-shaping)
`is_snapshot`	`true` for frozen point-in-time fusion inputs
`project_id`	Links to fusion project (currently null on all 5000 existing datasets)

New field: subtype: "fusion_input" distinguishes fusion datasets from ML training datasets. No schema change required — subtype already exists (currently all values are "dataset").

Operations Chain for Common Fusion Scenarios

VRS (Vulnerable Person Resolution)

QUERY (customer records)
  → nlp: lowercase, remove_whitespace (on name fields)
  → replace: title stripping (Dr/Mr/Mrs/Prof → "")
  → replace: alias expansion (Bob→Robert, Jim→James)
  → dropna: require name + dob present
  → DATASET (fusion_input)

AML (Anti-Money Laundering Entity Resolution)

QUERY (transaction parties)
  → nlp: lowercase, remove_punctuation, remove_stopwords (on entity name)
  → replace: legal suffix normalization (Ltd→Limited, Corp→Corporation)
  → fillna: default country code from federate config
  → drop_duplicates: on entity_id (keep latest)
  → DATASET (fusion_input)

QUERY (track reports)
  → TIMESERIES_OP: resample to 1-minute intervals
  → TIMESERIES_OP: interpolate gaps (method: linear)
  → TIMESERIES_OP: handle_outliers (method: iqr, action: clip)
  → TABULAR_OP: dropna on lat/lon
  → DATASET (fusion_input)

INQ (Intelligence Query / SIGINT)

QUERY (intercept reports)
  → nlp: lowercase, remove_stopwords, lemmatize (on transcript)
  → TIMESERIES_OP: temporal decomposition (on intercept_time)
  → TABULAR_OP: drop classified fields (via suppression)
  → DATASET (fusion_input)

Impact on parallax (Standalone Library)

parallax's fusion compute code does not change. The scorer, blocker, metrics, and clustering remain pure functions operating on feature vectors. What changes is the input contract:

Layer	POC (Current)	Production (This Spec)
Input	`list[dict]` from CSV	Dask DataFrame from shaped Dataset
Normalization	parallax transforms (NM-01..NM-19)	Platform operations (nlp, replace, etc.)
Field resolution	Binding field_map always required	Model-first: field_map optional when models match
Data quality	Assumed clean (test data)	Enforced by operations chain (dropna, fillna, etc.)

Model-First in parallax

The resolver gains a new resolution path. When resolving a lens field to a local column:

def resolve_field_v2(lens_field: str, binding: LensBinding, target_model: str | None = None) -> str | None:
    """Resolve a lens field to a local column name.

    Resolution order:
    1. Explicit field_map override (existing behavior)
    2. Direct model-qualified match (NEW: if local_model == target_model, field exists as-is)
    3. Unqualified fallback (field exists without model prefix)
    4. None (unavailable)
    """

The Dask DataFrame adapter (from the previously approved prompt) bridges the data format: it accepts both list[dict] (for tests) and Dask DataFrames (for production). The parallax transforms (NM-01 through NM-19) remain available for standalone/POC use but are superseded by platform operations in production.

What Needs to Be Built

Already Exists (No Changes)

QUERY node type
All 187 operation types (tabular, text, timeseries, image, transform, join)
DATASET node type with query + operations + data_model + federation
Project DAG with workflow builder UI
Federation health tracking per dataset
Snapshot support (is_snapshot: true)
parallax fusion compute (blocker, scorer, metrics, clustering)

New (This Spec)

FUSION_ENGINE node type — new node in the project DAG. Takes N datasets + lens_id as input. Executes parallax pipeline. Outputs match_results and entity_clusters datasets.
Binding-to-Project creator — given a lens and a set of federate bindings, creates the fusion project with QUERY → operations → DATASET branches per federate, converging at a FUSION_ENGINE node.
subtype: fusion_input on Dataset — metadata-only, no schema change.
project_id population — link fusion datasets to their fusion project.
Fusion results as Datasets — match_results and entity_clusters stored as DATASET objects with subtype: "fusion_output", queryable, snapshotable, with full provenance.
Model-first field resolution — target_model on lens, local_model on binding, resolve_field_v2() in resolver with model-qualified direct match.
Domain template bindings — auto-generated binding templates for same-model federates: query template + standard operations chain + empty field_map.
Local task queue — bounded semaphore in titan for concurrent fusion runs at a single node. Prevents resource exhaustion when multiple lenses trigger simultaneously.

Deferred

Workflow builder UI extensions for fusion-specific operations (alias dictionaries, name token sort)
LLM-assisted binding creation with operations chain suggestion (auto-binding tier)
Scheduled/triggered fusion runs via Airflow
titan integration (RabbitMQ message types for federated execution)
Coordinator sharding for 500+ node deployments
PCA/FFT transforms for high-dimensional pattern matching domains (PNT, SIGINT)

Invariant Compliance

Invariant	Compliance
1. UDS is sole ABAC authority	Query executes through UDS. Operations run on UDS-governed data. FUSION_ENGINE reads shaped datasets via UDS. No bypass.
2. Events are append-only	Fusion results are new datasets (INSERT). Source datasets untouched. Binding changes create new versions.
3. Blocks are evidence	FUSION_ENGINE creates evidence blocks: query hash, operations chain hash, lens version, binding version, match provenance.
4. Frozen means frozen	Snapshot datasets (is_snapshot: true) are immutable. Lens version locked in binding. Results reference specific input snapshots.
5. Editions require frozen evidence	Match results reference frozen input datasets. Entity clusters reference specific fusion run.
6. AI assists, humans attest	LLM proposes bindings. Humans review and approve. Fusion computes matches. Humans decide what to do with results.
7. "No action" is a decision	Match results include non-matches (scored below threshold). P018-type false positive traps explicitly scored and rejected.

Cross-References

Document	Relationship
SCHEMA-BINDING.md	Extended by this spec. §132-151 superseded.
component.parallax.lens-parser (Lens Parser)	Lens drives the FUSION_ENGINE node configuration
component.parallax.feature-extraction (Feature Extraction)	Extractor consumes shaped datasets instead of raw CSV
component.parallax.blocking-engine (Blocking)	Blocking operates on shaped, field-mapped data
component.parallax.scoring-engine (Scoring)	Scorer receives pre-normalized values — no in-scorer normalization needed
component.parallax.primitives-framework (Primitives)	NM-01..NM-19 transforms are POC-only; platform operations replace them in production
UI_TEST_MATRIX.md	187 tested operations available for binding operations chains
datasets_audit.json	5000 existing datasets, 1742 with operations chains — pattern proven at scale

Depends on: component.parallax.blocking-engine, component.parallax.feature-extraction, component.parallax.lens-parser, component.parallax.scoring-engine

Realizes: product.fusion

Required by: component.parallax.correlation-persistence, component.parallax.fusion-governance-lifecycle