Skip to content

Node Scaling — 2000 Edge Nodes

Status & scope

  • Stage: Draft
  • Date: 2026-03-14
  • Requirement: VRS use case demands 2000 edge nodes reporting to a single hub.

Problem

Two scaling axes:

  1. Ingest capacity — Can the hub accept 2000 concurrent POSTs?
  2. Fusion capacity — Can the engine process 2000 nodes?

The Quadratic Problem

run_multi_fusion() uses itertools.combinations(node_ids, 2) for pairwise comparison: - 3 nodes → 3 pairs (current demo) - 100 nodes → 4,950 pairs - 500 nodes → 124,750 pairs - 2000 nodes → 1,999,000 pairs

Each pair runs multi-pass blocking + scoring. Even if each pair takes 0.1ms, 2M pairs = 200 seconds. This is the bottleneck.

Mitigation: Pre-merge by Blocking Key

Most node pairs will share ZERO blocking keys and produce ZERO candidates. Instead of iterating all node pairs, we can:

  1. Build a GLOBAL blocking index: {blocking_key: [(node_id, record_id, record)...]}
  2. Only compare records that share a blocking key — regardless of which node they came from
  3. This converts O(nodes²) into O(blocks × records_per_block²)

For island demo: blocking on phenomenon_class with ~5 values. 2000 nodes × 5 obs = 10,000 records. Each phenomenon has ~2000 records. Candidates within each block: 2000² / 2 = 2M per phenomenon × 5 = 10M total. Still large.

But with proper blocking (phenomenon_class + temporal bucketing), this drops dramatically.

Test Plan

Phase 1: Find the Breaking Point (local, no Docker)

Test Nodes Obs/Node Total Obs Expected
S-20a 10 5 50 Baseline
S-20b 50 5 250 Fast
S-20c 100 5 500 Should work
S-20d 500 5 2,500 May slow down
S-20e 1000 5 5,000 Stress
S-20f 2000 5 10,000 Target

Measure: ingest time, fusion time, memory, match count.

Phase 2: Ingest Load Test (local, threading)

Test Concurrent POSTs Expected
S-20g 100 simultaneous < 1s
S-20h 500 simultaneous < 2s
S-20i 2000 simultaneous < 5s

Phase 3: Optimize if Needed

If Phase 1 shows run_multi_fusion() is too slow at 2000 nodes, implement global blocking index optimization. This is a performance fix in pipeline.py, not an architecture change.

File Plan

File Action
tests/test_node_scaling.py New — all scaling tests
parallax/ops/fusion/pipeline.py May need optimization for global blocking

Acceptance Criteria

  • 2000 nodes ingest in < 5s (local TestClient)
  • 2000 nodes fusion completes — establishing the baseline time is the first deliverable of this spec; the measured value becomes the regression budget
  • Accuracy unchanged from 3-node baseline where ground truth overlaps
  • No crashes or OOM at any scale

Depends on: component.parallax.three-phase-protocol

Realizes: product.fusion