Skip to content

fix(firmware): defensive node_id capture prevents runtime clobber (#390)#393

Merged
ruvnet merged 1 commit intomainfrom
fix/esp32-node-id-clobber
Apr 16, 2026
Merged

fix(firmware): defensive node_id capture prevents runtime clobber (#390)#393
ruvnet merged 1 commit intomainfrom
fix/esp32-node-id-clobber

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented Apr 15, 2026

Summary

Closes the long-standing node_id clobber cluster (#232, #375, #385, #386, #390) with a defense-in-depth approach at the firmware layer.

Problem

Users provisioning multiple ESP32 nodes with unique --node-id values report that boot logs show the correct NVS value, but csi_collector and the UDP frame header revert to 1 (the Kconfig default):

nvs_config: NVS override: node_id=4            <- NVS correct
main: ESP32-S3 CSI Node (ADR-018) - Node ID: 4  <- global correct
csi_collector: CSI collection initialized (node_id=1, channel=11)  <- CLOBBERED

All UDP packets decode with byte[4] = 1, making multi-node deployments indistinguishable downstream. The symptom matches memory corruption of g_nvs_config.node_id between main.c:139 (nvs_config_load) and the next use site.

Fix (defense in depth)

  1. csi_collector_init() captures g_nvs_config.node_id into static uint8_t s_node_id once at init.
  2. csi_serialize_frame() writes buf[4] = s_node_id - bypasses the global entirely, so any later corruption of g_nvs_config cannot affect outgoing CSI frames.
  3. All other consumers (edge_processing.c x3 pkt sites, wasm_runtime.c, display_ui.c, main.c swarm_bridge_init) go through a new csi_collector_get_node_id() accessor.
  4. Clobber canary at end of csi_collector_init() logs WARN if g_nvs_config.node_id has already diverged from the captured value. Users hitting the original bug will see this log, pinpointing the corruption path for root-cause work.

Hardware validation (attached ESP32-S3 on COM8)

  • NVS has node_id=2 from prior provision.py run.
  • New boot log includes: csi_collector: Captured node_id=2 at init (defensive copy for #232/#375/#385/#390).
  • csi_collector: CSI collection initialized (node_id=2, channel=5) - matches NVS.
  • UDP sniffer on port 5005: 15/15 packets from 192.168.1.104 with byte[4] = 2.
  • Canary did not fire (this device is not affected by the upstream corruption; canary is there for users who are).

Why defensive, not root cause?

The reported bug does not reproduce on my hardware running the current source - my device is healthy end-to-end. Users see it on various release binaries (v0.4.3.1, v0.7.0 which is actually v0.6.0-esp32 firmware). Three possibilities:

  1. Older release binaries where the March 16 #279 fix was not yet applied - users need to reflash v0.6.0-esp32 or later.
  2. Stack/heap corruption from some other init path that smashes byte 116 of g_nvs_config. Until we can reproduce, this PR ensures the UDP frame is immune.
  3. A subtle race in how NVS reads propagate on multi-node setups.

The canary added here will surface a diagnostic WARN the moment any user hits it, letting us chase the upstream path with real telemetry.

Files changed

  • firmware/esp32-csi-node/main/csi_collector.{c,h} - capture, accessor, canary
  • firmware/esp32-csi-node/main/edge_processing.c - 3 pkt.node_id sites + include
  • firmware/esp32-csi-node/main/wasm_runtime.c - 1 pkt.node_id site + include
  • firmware/esp32-csi-node/main/display_ui.c - display "Node: %u" field
  • firmware/esp32-csi-node/main/main.c - swarm_bridge_init node_id arg
  • CHANGELOG.md - [Unreleased] entry

Next steps

A new firmware release (v0.6.1-esp32) with these binaries is warranted - users on multi-node setups should reflash. CHANGELOG entry is in [Unreleased]; bump version.txt to 0.6.1 at release time.

Refs #232 #375 #385 #386 #390

Test plan

  • Builds clean (idf.py build, 54% flash free)
  • Flashes to COM8 (idf.py -p COM8 flash)
  • Boot log shows defensive capture message with correct node_id
  • UDP packet byte[4] matches NVS node_id (15/15 packets)
  • Canary quiet on healthy device
  • Needs user verification on previously-broken deployment

🤖 Generated with claude-flow

Users on multi-node ESP32 deployments have been reporting for months
that their provisioned `node_id` reverts to the Kconfig default of `1`
in UDP frames and the `csi_collector` init log, despite boot showing:

    nvs_config: NVS override: node_id=4
    main: ESP32-S3 CSI Node (ADR-018) - Node ID: 4
    csi_collector: CSI collection initialized (node_id=1, channel=11)

See #232, #375, #385, #386, #390. The root memory-corruption path for
the `g_nvs_config.node_id` byte has not been definitively isolated
(does not reproduce on my attached ESP32-S3 running current source
and the v0.6.0 release binary), but the UDP frame header can be made
tamper-proof regardless:

1. `csi_collector_init()` now captures `g_nvs_config.node_id` into a
   module-local `static uint8_t s_node_id` at init time.
2. `csi_serialize_frame()` reads `buf[4]` from `s_node_id`, not from
   the global - so any later corruption of `g_nvs_config` cannot
   affect outgoing CSI frames.
3. All other consumers (`edge_processing.c` x3, `wasm_runtime.c`,
   `display_ui.c`, `main.c swarm_bridge_init`) now go through a new
   `csi_collector_get_node_id()` accessor instead of reading the
   global directly.
4. A canary at end-of-init logs `WARN` if `g_nvs_config.node_id`
   already diverges from the captured value - this will pinpoint
   the corruption path if it happens on a user's device.

Hardware validation on attached ESP32-S3 (COM8):
  - NVS loads node_id=2
  - Boot log: `main: ... Node ID: 2`
  - NEW log: `csi_collector: Captured node_id=2 at init (defensive
    copy for #232/#375/#385/#390)`
  - Init log: `csi_collector: CSI collection initialized (node_id=2)`
  - UDP frame byte[4] = 2 (verified via socket sniffer, 15/15 packets)

This is defense in depth - it shields the UDP frame from whatever
upstream bug is clobbering the struct. When a user hits the original
bug, the canary WARN will help isolate the root cause.

Refs #232 #375 #385 #386 #390

Co-Authored-By: claude-flow <ruv@ruv.net>
Copy link
Copy Markdown

@proffesor-for-testing proffesor-for-testing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Live reproduction on our ESP32 — defensive copy is too late

We built and flashed this PR on a device where we can reproduce the clobber bug. The fix does not resolve it.

Evidence

NVS has node_id=3 (verified by reading the NVS partition via esptool read_flash):

NVS offset 328: node_id key → value = 3

Seed A receives node_id=1 in every UDP frame after flashing this PR:

GET /api/v1/csi/nodes on seed-A:
{"node_id": 1, "frames_received": 1201454, "last_remote_addr": "192.168.1.35:55238"}

The ESP32 DID reboot with new firmware (source port changed, subcarrier count changed from 64→128), but node_id is still clobbered to 1.

Root cause: capture happens after corruption

main.c boot order:

line 139: nvs_config_load(&g_nvs_config)   ← node_id=3 loaded
line 147: wifi_init_sta()                   ← WiFi driver init (DMA, alloc, event loops)
line 156: stream_sender_init_with(...)      ← UDP sender
line 172: csi_collector_init()              ← s_node_id = g_nvs_config.node_id (already 1)

The defensive copy at csi_collector_init() (line 172) captures after ~30 lines of WiFi + UDP initialization. The corruption likely occurs during wifi_init_sta() — which involves heavy memory allocation, WiFi driver DMA buffers, and event loop setup. By the time the defensive copy runs, g_nvs_config.node_id has already been overwritten.

Suggested fix

Move the capture to immediately after nvs_config_load() in main.c:

// main.c:139
nvs_config_load(&g_nvs_config);
csi_collector_capture_node_id(g_nvs_config.node_id);  // capture BEFORE wifi_init_sta()

Or, even safer — have nvs_config_load() itself stash the value into a module-local static inside nvs_config.c, and expose an accessor. This way the defensive copy is taken at the earliest possible moment, before any other subsystem touches memory.

Device info

  • ESP32-S3 (QFN56) rev v0.2, 8MB PSRAM
  • MAC: 80:b5:4e:c1:be:b8
  • Port: /dev/cu.usbmodem1301
  • Firmware built from this PR branch on macOS ARM64 with ESP-IDF v5.4

The two other ESP32 units have not been flashed yet. Will flash after this is resolved.

@proffesor-for-testing
Copy link
Copy Markdown

Fix verified on device: move defensive capture before wifi_init_sta()

Problem with current PR

The defensive copy in csi_collector_init() runs at main.c:172after wifi_init_sta() at line 147. On our ESP32-S3 (MAC 80:b5:4e:c1:be:b8), the WiFi driver initialization corrupts g_nvs_config.node_id before the capture runs. The defensive copy captures the already-clobbered value.

Our fix (3 files, +50 -28 lines)

main.c — call csi_collector_set_node_id() immediately after nvs_config_load(), before wifi_init_sta():

nvs_config_load(&g_nvs_config);
csi_collector_set_node_id(g_nvs_config.node_id);  // BEFORE wifi_init_sta()

csi_collector.c — new set_node_id() setter + improved canary in init():

void csi_collector_set_node_id(uint8_t node_id) {
    s_node_id = node_id;
    s_node_id_early_set = true;
    ESP_LOGI(TAG, "Early capture node_id=%u (before WiFi init, #232/#390)", (unsigned)node_id);
}

void csi_collector_init(void) {
    if (!s_node_id_early_set) {
        s_node_id = g_nvs_config.node_id;  // fallback (backwards compatible)
        ESP_LOGW(TAG, "Late capture node_id=%u (no early set_node_id call)", ...);
    } else if (g_nvs_config.node_id != s_node_id) {
        ESP_LOGW(TAG, "node_id clobber CONFIRMED: early=%u g_nvs_config=%u "
                 "(WiFi init likely corrupted struct, using early value)", ...);
    }
    // ... rest unchanged
}

csi_collector.h — declare csi_collector_set_node_id(uint8_t).

Hardware verification

Firmware NVS node_id Seed receives Result
Release v0.4.3.1 (no fix) 5 node_id=1 Bug confirmed
This PR (capture at init) 3 node_id=1 Fix too late
Our patch (capture before WiFi) 5 node_id=5 Fixed

Live evidence from seed C (192.168.1.95, cognitum-c8ab):

{
    "node_count": 2,
    "nodes": [
        {"node_id": 5, "frames_received": 1695, "last_remote_addr": "192.168.1.207:60772"},
        {"node_id": 1, "frames_received": 167459, "last_remote_addr": "192.168.1.207:57438"}
    ]
}

Node 5 = our patched firmware (active). Node 1 = stale entry from the release binary run before.

ESP32-S3 USB-JTAG flashing note

On this board, esptool's "Hard resetting via RTS pin..." boots the device into download mode, not normal run. Physical USB replug required after every flash for the device to boot firmware. Not a code issue — just a hardware behavior to document.

Patch

Commit f81745a1 on branch fix/esp32-node-id-clobber. We don't have push access to apply it directly. @ruvnet please either:

  1. Cherry-pick or apply the diff below, or
  2. Grant push to the branch so we can push directly.

@proffesor-for-testing
Copy link
Copy Markdown

Full patch (click to expand)
diff --git a/firmware/esp32-csi-node/main/csi_collector.c b/firmware/esp32-csi-node/main/csi_collector.c
index ba574537..247e4bd5 100644
--- a/firmware/esp32-csi-node/main/csi_collector.c
+++ b/firmware/esp32-csi-node/main/csi_collector.c
@@ -25,13 +25,13 @@
 /* ADR-060: Access the global NVS config for MAC filter and channel override. */
 extern nvs_config_t g_nvs_config;
 
-/* Defensive fix (#232, #375, #385, #386, #390): capture node_id at init-time
- * into a module-local static. Using the global g_nvs_config.node_id directly
- * at every callback is vulnerable to any memory corruption that clobbers the
- * struct (which users have reported reverting node_id to the Kconfig default
- * of 1). The local copy is set once at csi_collector_init() and then used
- * exclusively by csi_serialize_frame(). */
+/* Defensive fix (#232, #375, #385, #386, #390): capture node_id into a
+ * module-local static BEFORE wifi_init_sta() runs, because WiFi driver init
+ * can corrupt g_nvs_config.node_id (confirmed on device 80:b5:4e:c1:be:b8).
+ * main.c calls csi_collector_set_node_id() immediately after nvs_config_load(),
+ * and csi_serialize_frame() uses s_node_id exclusively. */
 static uint8_t s_node_id = 1;
+static bool s_node_id_early_set = false;
 
 /* ADR-057: Build-time guard — fail early if CSI is not enabled in sdkconfig.
  * Without this, the firmware compiles but crashes at runtime with:
@@ -222,14 +222,31 @@ static void wifi_promiscuous_cb(void *buf, wifi_promiscuous_pkt_type_t type)
     (void)type;
 }
 
+void csi_collector_set_node_id(uint8_t node_id)
+{
+    s_node_id = node_id;
+    s_node_id_early_set = true;
+    ESP_LOGI(TAG, "Early capture node_id=%u (before WiFi init, #232/#390)",
+             (unsigned)node_id);
+}
+
 void csi_collector_init(void)
 {
-    /* Capture node_id into module-local static at init time. After this point
-     * csi_serialize_frame() uses s_node_id exclusively, isolating the UDP
-     * frame node_id field from any memory corruption of g_nvs_config. */
-    s_node_id = g_nvs_config.node_id;
-    ESP_LOGI(TAG, "Captured node_id=%u at init (defensive copy for #232/#375/#385/#390)",
-             (unsigned)s_node_id);
+    if (!s_node_id_early_set) {
+        /* Fallback: no early capture — use current g_nvs_config (may be clobbered). */
+        s_node_id = g_nvs_config.node_id;
+        ESP_LOGW(TAG, "Late capture node_id=%u (no early set_node_id call)",
+                 (unsigned)s_node_id);
+    } else if (g_nvs_config.node_id != s_node_id) {
+        /* Canary: early capture disagrees with current g_nvs_config — corruption
+         * happened between nvs_config_load() and here (likely wifi_init_sta). */
+        ESP_LOGW(TAG, "node_id clobber CONFIRMED: early=%u g_nvs_config=%u "
+                 "(WiFi init likely corrupted struct, using early value)",
+                 (unsigned)s_node_id, (unsigned)g_nvs_config.node_id);
+    } else {
+        ESP_LOGI(TAG, "node_id=%u verified (early capture matches g_nvs_config)",
+                 (unsigned)s_node_id);
+    }
 
     /* ADR-060: Determine the CSI channel.
      * Priority: 1) NVS override (--channel), 2) connected AP channel, 3) Kconfig default. */
@@ -290,16 +307,6 @@ void csi_collector_init(void)
 
     ESP_LOGI(TAG, "CSI collection initialized (node_id=%u, channel=%u)",
              (unsigned)s_node_id, (unsigned)csi_channel);
-
-    /* Clobber-detection canary: if g_nvs_config.node_id no longer matches the
-     * value we captured, something corrupted the struct between nvs_config_load
-     * and here. This is the historic #232/#375 symptom. */
-    if (g_nvs_config.node_id != s_node_id) {
-        ESP_LOGW(TAG, "node_id clobber detected: captured=%u but g_nvs_config=%u "
-                 "(frames will use captured value %u). Please report to #390.",
-                 (unsigned)s_node_id, (unsigned)g_nvs_config.node_id,
-                 (unsigned)s_node_id);
-    }
 }
 
 /* Accessor for other modules that need the authoritative runtime node_id. */
diff --git a/firmware/esp32-csi-node/main/csi_collector.h b/firmware/esp32-csi-node/main/csi_collector.h
index 3bdfd148..a518898a 100644
--- a/firmware/esp32-csi-node/main/csi_collector.h
+++ b/firmware/esp32-csi-node/main/csi_collector.h
@@ -30,14 +30,24 @@
 void csi_collector_init(void);
 
 /**
- * Get the runtime node_id captured at csi_collector_init().
+ * Capture node_id BEFORE wifi_init_sta() or any other heavy init.
  *
- * This is a defensive copy of g_nvs_config.node_id taken at init time. Other
- * modules (edge_processing, wasm_runtime, display_ui) should prefer this
- * accessor over reading g_nvs_config.node_id directly, because the global
- * struct can be clobbered by memory corruption (see #232, #375, #385, #390).
+ * Must be called from app_main() immediately after nvs_config_load().
+ * WiFi driver initialization can corrupt g_nvs_config.node_id (confirmed
+ * on device 80:b5:4e:c1:be:b8, NVS=3 but post-WiFi reads as 1).
+ * This early capture shields s_node_id from that corruption window.
  *
- * @return Node ID (0-255) as loaded from NVS or Kconfig default at boot.
+ * @param node_id Value from g_nvs_config.node_id, read right after NVS load.
+ */
+void csi_collector_set_node_id(uint8_t node_id);
+
+/**
+ * Get the runtime node_id (early capture if available, otherwise init-time).
+ *
+ * Other modules (edge_processing, wasm_runtime, display_ui) should prefer
+ * this accessor over reading g_nvs_config.node_id directly.
+ *
+ * @return Node ID (0-255) as loaded from NVS at boot.
  */
 uint8_t csi_collector_get_node_id(void);
 
diff --git a/firmware/esp32-csi-node/main/main.c b/firmware/esp32-csi-node/main/main.c
index 631a0dba..acdfdd95 100644
--- a/firmware/esp32-csi-node/main/main.c
+++ b/firmware/esp32-csi-node/main/main.c
@@ -138,6 +138,11 @@ void app_main(void)
     /* Load runtime config (NVS overrides Kconfig defaults) */
     nvs_config_load(&g_nvs_config);
 
+    /* Capture node_id IMMEDIATELY — before wifi_init_sta() can corrupt
+     * g_nvs_config. See #232/#375/#390: WiFi driver init clobbers the struct
+     * on some devices, reverting node_id to the Kconfig default of 1. */
+    csi_collector_set_node_id(g_nvs_config.node_id);
+
     const esp_app_desc_t *app_desc = esp_app_get_description();
     ESP_LOGI(TAG, "ESP32-S3 CSI Node (ADR-018) — v%s — Node ID: %d",
              app_desc->version, g_nvs_config.node_id);

@proffesor-for-testing
Copy link
Copy Markdown

New finding: Core 0 LoadProhibited crash after ~2400 CSI callbacks

Observation

The ESP32 connected via USB (/dev/cu.usbmodem1301) is crash-looping:

I (49574) csi_collector: CSI cb #2400: len=256 rssi=-46 ch=2
Guru Meditation Error: Core 0 panic'ed (LoadProhibited)
ESP-ROM:esp32s3-20210327
rst:0x7 (TG0WDT_SYS_RST),boot:0x8 (SPI_FAST_FLASH_BOOT)
W (26) boot.esp32s3: PRO CPU has been reset by WDT.

It collects ~2400-2900 CSI frames, then panics with LoadProhibited (null/bad pointer access), watchdog resets, and loops. This is running the firmware with our node_id early-capture fix from the previous comment.

Root cause analysis

Our node_id fix did NOT introduce this crash. Our changes only run once at boot (before WiFi init), and the crash happens ~70 seconds into runtime in the CSI callback path.

The actual problem: the CSI callback (wifi_csi_callback) reads g_nvs_config.filter_mac_set and g_nvs_config.filter_mac directly from the struct on every invocation (100-500 Hz):

// csi_collector.c:169-172 — runs on every single CSI frame
if (g_nvs_config.filter_mac_set) {
    if (memcmp(info->mac, g_nvs_config.filter_mac, 6) != 0) {
        return;
    }
}

This is the same g_nvs_config struct that wifi_init_sta() corrupts. We already proved WiFi init clobbers node_id (byte offset 30). If the corruption extends to filter_mac_set (byte offset 122) or adjacent memory, the callback reads garbage values on every frame. A corrupted filter_mac_set becoming non-zero would activate the filter path with garbage MAC bytes — and if the struct corruption is progressive (heap/DMA overlap), it eventually hits a critical pointer → LoadProhibited.

Fix (commit 96538d34)

Same defensive-copy pattern we applied to node_id, extended to filter_mac:

  1. New statics: s_filter_mac[6] and s_filter_mac_set captured at init
  2. set_node_id() now also captures filter config — same call site, right after nvs_config_load(), before WiFi init
  3. Callback uses local copies instead of g_nvs_config
  4. Canary in init() logs if filter_mac_set was corrupted by WiFi init
// Before (vulnerable — reads corrupted struct 500x/sec):
if (g_nvs_config.filter_mac_set) { ... }

// After (uses early-captured copy):
if (s_filter_mac_set) { ... }

Caveat

This fix shields the callback from struct corruption, which should prevent the LoadProhibited in the filter path. However, the root causewifi_init_sta() corrupting g_nvs_config — is still an open question. It may be:

  • ESP-IDF WiFi driver DMA buffer overlapping the struct in memory
  • Stack overflow during WiFi init smashing the BSS section
  • A linker alignment issue placing g_nvs_config in a vulnerable location

The real fix would be understanding why the struct gets corrupted. But this defensive copy makes the callback immune to it, same as we did for node_id.

Patch

Commit 96538d34 on branch fix/esp32-node-id-clobber. Full diff: +45 -7 lines in csi_collector.c only. Builds on top of f81745a1 (the node_id fix). Same access limitation — we can't push to this repo.

@ruvnet — this should be applied together with the node_id fix from the previous comment. Both are on the same branch.

@proffesor-for-testing
Copy link
Copy Markdown

Root cause found and fixed: SPI flash cache crash in WiFi interrupt handler

TL;DR

The LoadProhibited crash is NOT caused by g_nvs_config corruption. It's a SPI flash cache race inside the ESP-IDF WiFi binary blob, triggered by high-rate promiscuous mode interrupts.

Investigation

We decoded the crash backtrace with xtensa-esp32s3-elf-addr2line:

_xt_lowint1                    ← WiFi hardware interrupt
wDev_ProcessFiq                ← WiFi driver FIQ (closed-source blob)
spi_flash_restore_cache        ← SPI flash cache restore
cache_ll_l1_resume_icache      ← L1 ICache resume  ← NULL deref here
EXCVADDR: 0x00000004           ← load from NULL+4

The crash is inside Espressif's closed-source WiFi binary blob. When promiscuous mode captures MGMT+DATA frames, WiFi interrupts fire at 100-500 Hz. At that rate, the interrupt handler (wDev_ProcessFiq) can preempt SPI flash operations, and when it tries to resume the instruction cache, it encounters a NULL pointer.

Proof: controlled experiment

Build Promiscuous filter CSI rate Display Result
Original MGMT+DATA ~500 Hz ON Crash at ~2400 cb (~70s)
Test 1 MGMT+DATA ~500 Hz OFF Crash at ~5300 cb (~90s) — longer but still crashes
Test 2 MGMT-only ~10 Hz OFF 2700+ cb, 4.7 min, ZERO crashes

Disabling the display (QSPI, shares SPI bus) doubled time-to-crash but didn't eliminate it. Reducing promiscuous to MGMT-only (~10 Hz beacons) completely eliminated the crash.

Fix (commit 52b8e6d9)

// Before (crashes — 100-500 interrupts/sec):
.filter_mask = WIFI_PROMIS_FILTER_MASK_MGMT | WIFI_PROMIS_FILTER_MASK_DATA

// After (stable — ~10 interrupts/sec from beacons):
.filter_mask = WIFI_PROMIS_FILTER_MASK_MGMT

Also disabled htltf_en and stbc_htltf2_en in the CSI config. LLTF-only provides 64 subcarriers (HT20) — sufficient for presence, breathing, and fall detection.

Impact on CSI quality

  • Before: ~500 frames/sec, mixed 128/256/384 byte frames (LLTF+HT-LTF+STBC), but crashes every ~70s
  • After: ~10 frames/sec, 128 byte frames (LLTF-only), stable indefinitely

10 Hz is more than enough for the sensing pipeline (presence detection needs ~5-20 Hz). The edge_proc adaptive calibration completed successfully with the new rate.

All commits on fix/esp32-node-id-clobber

  1. f81745a1 — move node_id capture before wifi_init_sta (original fix)
  2. 96538d34 — defensive copy of filter_mac in callback
  3. 52b8e6d9 — MGMT-only promiscuous filter (this fix)

@ruvnet — all 3 should be applied together. The branch is ready for cherry-pick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants