Skip to content

Latest commit

 

History

History
199 lines (164 loc) · 8.82 KB

File metadata and controls

199 lines (164 loc) · 8.82 KB

ADR-062: QEMU ESP32-S3 Swarm Configurator

Field Value
Status Accepted
Date 2026-03-14
Authors RuView Team
Relates ADR-061 (QEMU testing platform), ADR-060 (channel/MAC filter), ADR-018 (binary frame), ADR-039 (edge intel)

Glossary

Term Definition
Swarm A group of N QEMU ESP32-S3 instances running simultaneously
Topology How nodes are connected: star, mesh, line, ring
Role Node function: sensor (collects CSI), coordinator (aggregates + forwards), gateway (bridges to host)
Scenario matrix Cross-product of topology × node count × NVS config × mock scenario
Health oracle Python process that monitors all node UART logs and declares swarm health

Context

ADR-061 Layer 3 provides a basic multi-node mesh test: N identical nodes with sequential TDM slots connected via a Linux bridge. This is useful but limited:

  1. All nodes are identical — real deployments have heterogeneous roles (sensor, coordinator, gateway)
  2. Single topology — only fully-connected bridge; no star, line, or ring topologies
  3. No scenario variation per node — all nodes run the same mock CSI scenario
  4. Manual configuration — each test requires hand-editing env vars and arguments
  5. No swarm-level health monitoring — validation checks individual nodes, not collective behavior
  6. No cross-node timing validation — TDM slot ordering and inter-frame gaps aren't verified

Real WiFi-DensePose deployments use 3-8 ESP32-S3 nodes in various topologies. A single coordinator aggregates CSI from multiple sensors. The firmware must handle TDM conflicts, missing nodes, role-based behavior differences, and network partitions — none of which ADR-061 Layer 3 tests.

Decision

Build a QEMU Swarm Configurator — a YAML-driven tool that defines multi-node test scenarios declaratively and orchestrates them under QEMU with swarm-level validation.

Architecture

┌─────────────────────────────────────────────────────┐
│                 swarm_config.yaml                     │
│  nodes: [{role: sensor, scenario: 2, channel: 6}]   │
│  topology: star                                       │
│  duration: 60s                                        │
│  assertions: [all_nodes_boot, tdm_no_collision, ...]  │
└──────────────────────┬──────────────────────────────┘
                       │
          ┌────────────▼────────────┐
          │   qemu_swarm.py         │
          │   (orchestrator)        │
          └───┬────┬────┬───┬──────┘
              │    │    │   │
         ┌────▼┐ ┌▼──┐ ▼  ┌▼────┐
         │Node0│ │N1 │... │N(n-1)│   QEMU instances
         │sens │ │sen│    │coord │
         └──┬──┘ └─┬─┘    └──┬───┘
            │      │         │
         ┌──▼──────▼─────────▼──┐
         │  Virtual Network      │    TAP bridge / SLIRP
         │  (topology-shaped)    │
         └──────────┬───────────┘
                    │
         ┌──────────▼───────────┐
         │  Aggregator (Rust)    │    Collects frames
         └──────────┬───────────┘
                    │
         ┌──────────▼───────────┐
         │  Health Oracle        │    Swarm-level assertions
         │  (swarm_health.py)    │
         └──────────────────────┘

YAML Configuration Schema

# swarm_config.yaml
swarm:
  name: "3-sensor-star"
  duration_s: 60
  topology: star          # star | mesh | line | ring
  aggregator_port: 5005

nodes:
  - role: coordinator
    node_id: 0
    scenario: 0           # empty room (baseline)
    channel: 6
    edge_tier: 2
    is_gateway: true       # receives aggregated frames

  - role: sensor
    node_id: 1
    scenario: 2           # walking person
    channel: 6
    tdm_slot: 1           # TDM slot index (auto-assigned from node position if omitted)

  - role: sensor
    node_id: 2
    scenario: 3           # fall event
    channel: 6
    tdm_slot: 2

assertions:
  - all_nodes_boot
  - no_crashes
  - tdm_no_collision
  - all_nodes_produce_frames
  - coordinator_receives_from_all
  - fall_detected_by_node_2
  - frame_rate_above: 15    # Hz minimum per node
  - max_boot_time_s: 10

Topologies

Topology Network Description
star All sensors connect to coordinator; coordinator has TAP to each sensor Hub-and-spoke, most common
mesh All nodes on same bridge (existing Layer 3 behavior) Every node sees every other
line Node 0 ↔ Node 1 ↔ Node 2 ↔ ... Linear chain, tests multi-hop
ring Like line but last connects to first Circular, tests routing

Node Roles

Role Behavior NVS Keys
sensor Runs mock CSI, sends frames to coordinator node_id, tdm_slot, target_ip
coordinator Receives frames from sensors, runs edge aggregation node_id, tdm_slot=0, edge_tier=2
gateway Like coordinator but also bridges to host UDP node_id, target_ip=host, is_gateway=1

Assertions (Swarm-Level)

Assertion What It Checks
all_nodes_boot Every node's UART log shows boot indicators within timeout
no_crashes No Guru Meditation, assert, panic in any log
tdm_no_collision No two nodes transmit in the same TDM slot
all_nodes_produce_frames Every sensor node's log contains CSI frame output
coordinator_receives_from_all Coordinator log shows frames from each sensor's node_id
fall_detected_by_node_N Node N's log reports a fall detection event
frame_rate_above Each node produces at least N frames/second
max_boot_time_s All nodes boot within N seconds
no_heap_errors No OOM or heap corruption in any log
network_partitioned_recovery After deliberate partition, nodes resume communication (future)

Preset Configurations

Preset Nodes Topology Purpose
smoke 2 star Quick CI smoke test (15s)
standard 3 star Default 3-node (sensor + sensor + coordinator)
large-mesh 6 mesh Scale test with 6 fully-connected nodes
line-relay 4 line Multi-hop relay chain
ring-fault 4 ring Ring with fault injection mid-test
heterogeneous 5 star Mixed scenarios: walk, fall, static, channel-sweep, empty
ci-matrix 3 star CI-optimized preset (30s, minimal assertions)

File Layout

scripts/
├── qemu_swarm.py              # Main orchestrator (CLI entry point)
├── swarm_health.py            # Swarm-level health oracle
└── swarm_presets/
    ├── smoke.yaml
    ├── standard.yaml
    ├── large_mesh.yaml
    ├── line_relay.yaml
    ├── ring_fault.yaml
    ├── heterogeneous.yaml
    └── ci_matrix.yaml

.github/workflows/
└── firmware-qemu.yml          # MODIFIED: add swarm test job

Consequences

Benefits

  1. Declarative testing — define swarm topology in YAML, not shell scripts
  2. Role-based nodes — test coordinator/sensor/gateway interactions
  3. Topology variety — star/mesh/line/ring match real deployment patterns
  4. Swarm-level assertions — validate collective behavior, not just individual nodes
  5. Preset library — quick CI smoke tests and thorough manual validation
  6. Reproducible — YAML configs are version-controlled and shareable

Limitations

  1. Still requires root for TAP bridge topologies (star, line, ring); mesh can use SLIRP
  2. QEMU resource usage — 6+ QEMU instances use ~2GB RAM, may slow CI runners
  3. No real RF — inter-node communication is IP-based, not WiFi CSI multipath

References

  • ADR-061: QEMU ESP32-S3 firmware testing platform (Layers 1-9)
  • ADR-060: Channel override and MAC address filter provisioning
  • ADR-018: Binary CSI frame format (magic 0xC5110001)
  • ADR-039: Edge intelligence pipeline (biquad, vitals, fall detection)