Skip to content

Island Fusion Robustness Tests

Status & scope

  • Stage: Draft
  • Author: Claude (from Chris's requirements)
  • Date: 2026-03-14
  • Depends on: Island two-container demo (island_hub, island_edge, docker-compose.island.yml)

Purpose

Prove the island fusion wire protocol is robust under realistic tactical conditions: concurrent access, large payloads, partial failures, repeated cycles, and mid-flight network disruption. These tests run locally (no Docker) using httpx TestClient and threading.

Test Categories

R-01: Concurrent Edge Submission

Risk: Three edge nodes POST simultaneously. The ObservationStore uses a threading lock, but we haven't proven it under contention.

Tests:

ID Test Acceptance
R-01a 3 threads POST simultaneously to /ingest All 15 observations stored, no data loss
R-01b 10 threads POST simultaneously (stress) All observations stored, total == sum of all posts
R-01c Concurrent ingest + concurrent status reads No deadlock, status always consistent (sum of nodes == total)
R-01d Concurrent ingest + clear (race condition) No crash. After clear completes, store is empty. Ingest during clear either succeeds or gets cleared — both acceptable

Implementation: Use threading.Thread + threading.Barrier to synchronize start. Verify with assertions on final state.

R-02: Large Payload

Risk: Tactical edge nodes may batch hundreds of observations. JSON serialization, HTTP transfer, and fusion engine must handle this without timeout or memory issues.

Tests:

ID Test Acceptance
R-02a 100 observations per node (300 total) Ingest < 1s, fusion completes, matches > 0
R-02b 500 observations per node (1500 total) Ingest < 2s, fusion completes < 10s
R-02c Payload size validation Response includes correct accepted count matching input
R-02d Empty observations list Returns accepted=0, no error

Data generation: Duplicate existing 5-observation fixtures with randomized timestamps and jittered coordinates. Use deterministic seed for reproducibility.

R-03: Partial Failure Recovery

Risk: Hub or network fails during operation. Edges must retry. Hub must not lose already-ingested data on failed fusion.

Tests:

ID Test Acceptance
R-03a Fusion fails (lens not loaded) Returns 500/error, previously ingested observations still present
R-03b Ingest after failed fusion New observations accepted, fusion re-run succeeds
R-03c Edge retry simulation: first 2 calls fail, 3rd succeeds post_observations() returns success, attempts=3
R-03d Hub returns 500 on ingest (simulated) Edge raises HTTPStatusError (not infinite retry) — 500 is not a transient error
R-03e Partial ingest (2/3 nodes), run fusion, then 3rd node arrives, re-run fusion Second run has more matches than first

R-04: Multi-Cycle Operation

Risk: In tactical tempo, the hub runs multiple fusion cycles without restart. State from previous cycles must not leak into the next.

Tests:

ID Test Acceptance
R-04a Cycle 1: ingest 3 nodes → fuse → verify. Clear. Cycle 2: ingest 3 nodes → fuse → verify. Both cycles produce identical results
R-04b Clear between cycles truly resets After clear, status shows 0 nodes, 0 observations
R-04c Fusion results from cycle 1 still retrievable after cycle 2 GET /fusion/results/{run_id_1} returns cycle 1 results
R-04d 5 rapid cycles back-to-back All 5 produce valid results, no state leakage, no memory growth pattern

R-05: Mid-Flight Network Disruption

Risk: Network degrades during an edge POST, not before. The edge must handle partial sends, connection resets, and timeouts gracefully.

Tests:

ID Test Acceptance
R-05a ConnectError on attempt 1, success on attempt 2 Returns success, attempts=2
R-05b TimeoutException on attempt 1, success on attempt 2 Returns success, attempts=2
R-05c RemoteProtocolError on attempt 1, success on attempt 2 Returns success, attempts=2
R-05d Alternating failures: fail, fail, success Returns success, attempts=3
R-05e ReadError (connection reset mid-transfer) Retries, eventually succeeds or exhausts retries
R-05f Mixed error types: ConnectError, then Timeout, then success Returns success, attempts=3
R-05g All retries exhausted with mixed errors Returns failed status with correct attempt count

Implementation: Mock httpx.post with side_effect lists. The edge_node must handle httpx.ReadError (not currently caught — new finding).

New Exception to Handle

During spec writing, identified that httpx.ReadError (connection reset mid-transfer) is not caught by the edge node. This must be added to the retry exception tuple alongside ConnectError, TimeoutException, and RemoteProtocolError.

File Plan

File Action
tests/test_island_robustness.py New — all R-01 through R-05 tests
island_edge/edge_node.py Fix — add httpx.ReadError to retry exceptions
island_hub/services/obs_store.py Verify — thread safety under R-01 scenarios

Non-Goals

  • Docker-level testing (covered by run_island_demo.sh)
  • Toxiproxy integration in pytest (requires Docker)
  • Performance benchmarking (separate concern)
  • Data corruption detection (observations are immutable dicts, no mutation risk)

Acceptance Criteria

All tests in tests/test_island_robustness.py pass. No existing tests regress.


Depends on: component.parallax.three-phase-protocol

Realizes: product.fusion