Island Fusion Robustness Tests
Status & scope
- Stage: Draft
- Author: Claude (from Chris's requirements)
- Date: 2026-03-14
- Depends on: Island two-container demo (island_hub, island_edge, docker-compose.island.yml)
Purpose
Prove the island fusion wire protocol is robust under realistic tactical conditions: concurrent access, large payloads, partial failures, repeated cycles, and mid-flight network disruption. These tests run locally (no Docker) using httpx TestClient and threading.
Test Categories
R-01: Concurrent Edge Submission
Risk: Three edge nodes POST simultaneously. The ObservationStore uses a threading lock, but we haven't proven it under contention.
Tests:
| ID | Test | Acceptance |
|---|---|---|
| R-01a | 3 threads POST simultaneously to /ingest |
All 15 observations stored, no data loss |
| R-01b | 10 threads POST simultaneously (stress) | All observations stored, total == sum of all posts |
| R-01c | Concurrent ingest + concurrent status reads | No deadlock, status always consistent (sum of nodes == total) |
| R-01d | Concurrent ingest + clear (race condition) | No crash. After clear completes, store is empty. Ingest during clear either succeeds or gets cleared — both acceptable |
Implementation: Use threading.Thread + threading.Barrier to synchronize start. Verify with assertions on final state.
R-02: Large Payload
Risk: Tactical edge nodes may batch hundreds of observations. JSON serialization, HTTP transfer, and fusion engine must handle this without timeout or memory issues.
Tests:
| ID | Test | Acceptance |
|---|---|---|
| R-02a | 100 observations per node (300 total) | Ingest < 1s, fusion completes, matches > 0 |
| R-02b | 500 observations per node (1500 total) | Ingest < 2s, fusion completes < 10s |
| R-02c | Payload size validation | Response includes correct accepted count matching input |
| R-02d | Empty observations list | Returns accepted=0, no error |
Data generation: Duplicate existing 5-observation fixtures with randomized timestamps and jittered coordinates. Use deterministic seed for reproducibility.
R-03: Partial Failure Recovery
Risk: Hub or network fails during operation. Edges must retry. Hub must not lose already-ingested data on failed fusion.
Tests:
| ID | Test | Acceptance |
|---|---|---|
| R-03a | Fusion fails (lens not loaded) | Returns 500/error, previously ingested observations still present |
| R-03b | Ingest after failed fusion | New observations accepted, fusion re-run succeeds |
| R-03c | Edge retry simulation: first 2 calls fail, 3rd succeeds | post_observations() returns success, attempts=3 |
| R-03d | Hub returns 500 on ingest (simulated) | Edge raises HTTPStatusError (not infinite retry) — 500 is not a transient error |
| R-03e | Partial ingest (2/3 nodes), run fusion, then 3rd node arrives, re-run fusion | Second run has more matches than first |
R-04: Multi-Cycle Operation
Risk: In tactical tempo, the hub runs multiple fusion cycles without restart. State from previous cycles must not leak into the next.
Tests:
| ID | Test | Acceptance |
|---|---|---|
| R-04a | Cycle 1: ingest 3 nodes → fuse → verify. Clear. Cycle 2: ingest 3 nodes → fuse → verify. | Both cycles produce identical results |
| R-04b | Clear between cycles truly resets | After clear, status shows 0 nodes, 0 observations |
| R-04c | Fusion results from cycle 1 still retrievable after cycle 2 | GET /fusion/results/{run_id_1} returns cycle 1 results |
| R-04d | 5 rapid cycles back-to-back | All 5 produce valid results, no state leakage, no memory growth pattern |
R-05: Mid-Flight Network Disruption
Risk: Network degrades during an edge POST, not before. The edge must handle partial sends, connection resets, and timeouts gracefully.
Tests:
| ID | Test | Acceptance |
|---|---|---|
| R-05a | ConnectError on attempt 1, success on attempt 2 |
Returns success, attempts=2 |
| R-05b | TimeoutException on attempt 1, success on attempt 2 |
Returns success, attempts=2 |
| R-05c | RemoteProtocolError on attempt 1, success on attempt 2 |
Returns success, attempts=2 |
| R-05d | Alternating failures: fail, fail, success | Returns success, attempts=3 |
| R-05e | ReadError (connection reset mid-transfer) |
Retries, eventually succeeds or exhausts retries |
| R-05f | Mixed error types: ConnectError, then Timeout, then success | Returns success, attempts=3 |
| R-05g | All retries exhausted with mixed errors | Returns failed status with correct attempt count |
Implementation: Mock httpx.post with side_effect lists. The edge_node must handle httpx.ReadError (not currently caught — new finding).
New Exception to Handle
During spec writing, identified that httpx.ReadError (connection reset mid-transfer) is not caught by the edge node. This must be added to the retry exception tuple alongside ConnectError, TimeoutException, and RemoteProtocolError.
File Plan
| File | Action |
|---|---|
tests/test_island_robustness.py |
New — all R-01 through R-05 tests |
island_edge/edge_node.py |
Fix — add httpx.ReadError to retry exceptions |
island_hub/services/obs_store.py |
Verify — thread safety under R-01 scenarios |
Non-Goals
- Docker-level testing (covered by
run_island_demo.sh) - Toxiproxy integration in pytest (requires Docker)
- Performance benchmarking (separate concern)
- Data corruption detection (observations are immutable dicts, no mutation risk)
Acceptance Criteria
All tests in tests/test_island_robustness.py pass. No existing tests regress.
Depends on: component.parallax.three-phase-protocol
Realizes: product.fusion