Node Scaling — 2000 Edge Nodes
Status & scope
- Stage: Draft
- Date: 2026-03-14
- Requirement: VRS use case demands 2000 edge nodes reporting to a single hub.
Problem
Two scaling axes:
- Ingest capacity — Can the hub accept 2000 concurrent POSTs?
- Fusion capacity — Can the engine process 2000 nodes?
The Quadratic Problem
run_multi_fusion() uses itertools.combinations(node_ids, 2) for pairwise comparison:
- 3 nodes → 3 pairs (current demo)
- 100 nodes → 4,950 pairs
- 500 nodes → 124,750 pairs
- 2000 nodes → 1,999,000 pairs
Each pair runs multi-pass blocking + scoring. Even if each pair takes 0.1ms, 2M pairs = 200 seconds. This is the bottleneck.
Mitigation: Pre-merge by Blocking Key
Most node pairs will share ZERO blocking keys and produce ZERO candidates. Instead of iterating all node pairs, we can:
- Build a GLOBAL blocking index:
{blocking_key: [(node_id, record_id, record)...]} - Only compare records that share a blocking key — regardless of which node they came from
- This converts O(nodes²) into O(blocks × records_per_block²)
For island demo: blocking on phenomenon_class with ~5 values. 2000 nodes × 5 obs = 10,000 records. Each phenomenon has ~2000 records. Candidates within each block: 2000² / 2 = 2M per phenomenon × 5 = 10M total. Still large.
But with proper blocking (phenomenon_class + temporal bucketing), this drops dramatically.
Test Plan
Phase 1: Find the Breaking Point (local, no Docker)
| Test | Nodes | Obs/Node | Total Obs | Expected |
|---|---|---|---|---|
| S-20a | 10 | 5 | 50 | Baseline |
| S-20b | 50 | 5 | 250 | Fast |
| S-20c | 100 | 5 | 500 | Should work |
| S-20d | 500 | 5 | 2,500 | May slow down |
| S-20e | 1000 | 5 | 5,000 | Stress |
| S-20f | 2000 | 5 | 10,000 | Target |
Measure: ingest time, fusion time, memory, match count.
Phase 2: Ingest Load Test (local, threading)
| Test | Concurrent POSTs | Expected |
|---|---|---|
| S-20g | 100 simultaneous | < 1s |
| S-20h | 500 simultaneous | < 2s |
| S-20i | 2000 simultaneous | < 5s |
Phase 3: Optimize if Needed
If Phase 1 shows run_multi_fusion() is too slow at 2000 nodes, implement global blocking index optimization. This is a performance fix in pipeline.py, not an architecture change.
File Plan
| File | Action |
|---|---|
tests/test_node_scaling.py |
New — all scaling tests |
parallax/ops/fusion/pipeline.py |
May need optimization for global blocking |
Acceptance Criteria
- 2000 nodes ingest in < 5s (local TestClient)
- 2000 nodes fusion completes — establishing the baseline time is the first deliverable of this spec; the measured value becomes the regression budget
- Accuracy unchanged from 3-node baseline where ground truth overlaps
- No crashes or OOM at any scale
Depends on: component.parallax.three-phase-protocol
Realizes: product.fusion