fix(firmware): defensive node_id capture prevents runtime clobber (#390)#393
fix(firmware): defensive node_id capture prevents runtime clobber (#390)#393
Conversation
Users on multi-node ESP32 deployments have been reporting for months
that their provisioned `node_id` reverts to the Kconfig default of `1`
in UDP frames and the `csi_collector` init log, despite boot showing:
nvs_config: NVS override: node_id=4
main: ESP32-S3 CSI Node (ADR-018) - Node ID: 4
csi_collector: CSI collection initialized (node_id=1, channel=11)
See #232, #375, #385, #386, #390. The root memory-corruption path for
the `g_nvs_config.node_id` byte has not been definitively isolated
(does not reproduce on my attached ESP32-S3 running current source
and the v0.6.0 release binary), but the UDP frame header can be made
tamper-proof regardless:
1. `csi_collector_init()` now captures `g_nvs_config.node_id` into a
module-local `static uint8_t s_node_id` at init time.
2. `csi_serialize_frame()` reads `buf[4]` from `s_node_id`, not from
the global - so any later corruption of `g_nvs_config` cannot
affect outgoing CSI frames.
3. All other consumers (`edge_processing.c` x3, `wasm_runtime.c`,
`display_ui.c`, `main.c swarm_bridge_init`) now go through a new
`csi_collector_get_node_id()` accessor instead of reading the
global directly.
4. A canary at end-of-init logs `WARN` if `g_nvs_config.node_id`
already diverges from the captured value - this will pinpoint
the corruption path if it happens on a user's device.
Hardware validation on attached ESP32-S3 (COM8):
- NVS loads node_id=2
- Boot log: `main: ... Node ID: 2`
- NEW log: `csi_collector: Captured node_id=2 at init (defensive
copy for #232/#375/#385/#390)`
- Init log: `csi_collector: CSI collection initialized (node_id=2)`
- UDP frame byte[4] = 2 (verified via socket sniffer, 15/15 packets)
This is defense in depth - it shields the UDP frame from whatever
upstream bug is clobbering the struct. When a user hits the original
bug, the canary WARN will help isolate the root cause.
Refs #232 #375 #385 #386 #390
Co-Authored-By: claude-flow <ruv@ruv.net>
proffesor-for-testing
left a comment
There was a problem hiding this comment.
Live reproduction on our ESP32 — defensive copy is too late
We built and flashed this PR on a device where we can reproduce the clobber bug. The fix does not resolve it.
Evidence
NVS has node_id=3 (verified by reading the NVS partition via esptool read_flash):
NVS offset 328: node_id key → value = 3
Seed A receives node_id=1 in every UDP frame after flashing this PR:
GET /api/v1/csi/nodes on seed-A:
{"node_id": 1, "frames_received": 1201454, "last_remote_addr": "192.168.1.35:55238"}The ESP32 DID reboot with new firmware (source port changed, subcarrier count changed from 64→128), but node_id is still clobbered to 1.
Root cause: capture happens after corruption
main.c boot order:
line 139: nvs_config_load(&g_nvs_config) ← node_id=3 loaded
line 147: wifi_init_sta() ← WiFi driver init (DMA, alloc, event loops)
line 156: stream_sender_init_with(...) ← UDP sender
line 172: csi_collector_init() ← s_node_id = g_nvs_config.node_id (already 1)
The defensive copy at csi_collector_init() (line 172) captures after ~30 lines of WiFi + UDP initialization. The corruption likely occurs during wifi_init_sta() — which involves heavy memory allocation, WiFi driver DMA buffers, and event loop setup. By the time the defensive copy runs, g_nvs_config.node_id has already been overwritten.
Suggested fix
Move the capture to immediately after nvs_config_load() in main.c:
// main.c:139
nvs_config_load(&g_nvs_config);
csi_collector_capture_node_id(g_nvs_config.node_id); // capture BEFORE wifi_init_sta()Or, even safer — have nvs_config_load() itself stash the value into a module-local static inside nvs_config.c, and expose an accessor. This way the defensive copy is taken at the earliest possible moment, before any other subsystem touches memory.
Device info
- ESP32-S3 (QFN56) rev v0.2, 8MB PSRAM
- MAC: 80:b5:4e:c1:be:b8
- Port: /dev/cu.usbmodem1301
- Firmware built from this PR branch on macOS ARM64 with ESP-IDF v5.4
The two other ESP32 units have not been flashed yet. Will flash after this is resolved.
Fix verified on device: move defensive capture before wifi_init_sta()Problem with current PRThe defensive copy in Our fix (3 files, +50 -28 lines)
nvs_config_load(&g_nvs_config);
csi_collector_set_node_id(g_nvs_config.node_id); // BEFORE wifi_init_sta()
void csi_collector_set_node_id(uint8_t node_id) {
s_node_id = node_id;
s_node_id_early_set = true;
ESP_LOGI(TAG, "Early capture node_id=%u (before WiFi init, #232/#390)", (unsigned)node_id);
}
void csi_collector_init(void) {
if (!s_node_id_early_set) {
s_node_id = g_nvs_config.node_id; // fallback (backwards compatible)
ESP_LOGW(TAG, "Late capture node_id=%u (no early set_node_id call)", ...);
} else if (g_nvs_config.node_id != s_node_id) {
ESP_LOGW(TAG, "node_id clobber CONFIRMED: early=%u g_nvs_config=%u "
"(WiFi init likely corrupted struct, using early value)", ...);
}
// ... rest unchanged
}
Hardware verification
Live evidence from seed C ( {
"node_count": 2,
"nodes": [
{"node_id": 5, "frames_received": 1695, "last_remote_addr": "192.168.1.207:60772"},
{"node_id": 1, "frames_received": 167459, "last_remote_addr": "192.168.1.207:57438"}
]
}Node 5 = our patched firmware (active). Node 1 = stale entry from the release binary run before. ESP32-S3 USB-JTAG flashing noteOn this board, esptool's "Hard resetting via RTS pin..." boots the device into download mode, not normal run. Physical USB replug required after every flash for the device to boot firmware. Not a code issue — just a hardware behavior to document. PatchCommit
|
Full patch (click to expand)diff --git a/firmware/esp32-csi-node/main/csi_collector.c b/firmware/esp32-csi-node/main/csi_collector.c
index ba574537..247e4bd5 100644
--- a/firmware/esp32-csi-node/main/csi_collector.c
+++ b/firmware/esp32-csi-node/main/csi_collector.c
@@ -25,13 +25,13 @@
/* ADR-060: Access the global NVS config for MAC filter and channel override. */
extern nvs_config_t g_nvs_config;
-/* Defensive fix (#232, #375, #385, #386, #390): capture node_id at init-time
- * into a module-local static. Using the global g_nvs_config.node_id directly
- * at every callback is vulnerable to any memory corruption that clobbers the
- * struct (which users have reported reverting node_id to the Kconfig default
- * of 1). The local copy is set once at csi_collector_init() and then used
- * exclusively by csi_serialize_frame(). */
+/* Defensive fix (#232, #375, #385, #386, #390): capture node_id into a
+ * module-local static BEFORE wifi_init_sta() runs, because WiFi driver init
+ * can corrupt g_nvs_config.node_id (confirmed on device 80:b5:4e:c1:be:b8).
+ * main.c calls csi_collector_set_node_id() immediately after nvs_config_load(),
+ * and csi_serialize_frame() uses s_node_id exclusively. */
static uint8_t s_node_id = 1;
+static bool s_node_id_early_set = false;
/* ADR-057: Build-time guard — fail early if CSI is not enabled in sdkconfig.
* Without this, the firmware compiles but crashes at runtime with:
@@ -222,14 +222,31 @@ static void wifi_promiscuous_cb(void *buf, wifi_promiscuous_pkt_type_t type)
(void)type;
}
+void csi_collector_set_node_id(uint8_t node_id)
+{
+ s_node_id = node_id;
+ s_node_id_early_set = true;
+ ESP_LOGI(TAG, "Early capture node_id=%u (before WiFi init, #232/#390)",
+ (unsigned)node_id);
+}
+
void csi_collector_init(void)
{
- /* Capture node_id into module-local static at init time. After this point
- * csi_serialize_frame() uses s_node_id exclusively, isolating the UDP
- * frame node_id field from any memory corruption of g_nvs_config. */
- s_node_id = g_nvs_config.node_id;
- ESP_LOGI(TAG, "Captured node_id=%u at init (defensive copy for #232/#375/#385/#390)",
- (unsigned)s_node_id);
+ if (!s_node_id_early_set) {
+ /* Fallback: no early capture — use current g_nvs_config (may be clobbered). */
+ s_node_id = g_nvs_config.node_id;
+ ESP_LOGW(TAG, "Late capture node_id=%u (no early set_node_id call)",
+ (unsigned)s_node_id);
+ } else if (g_nvs_config.node_id != s_node_id) {
+ /* Canary: early capture disagrees with current g_nvs_config — corruption
+ * happened between nvs_config_load() and here (likely wifi_init_sta). */
+ ESP_LOGW(TAG, "node_id clobber CONFIRMED: early=%u g_nvs_config=%u "
+ "(WiFi init likely corrupted struct, using early value)",
+ (unsigned)s_node_id, (unsigned)g_nvs_config.node_id);
+ } else {
+ ESP_LOGI(TAG, "node_id=%u verified (early capture matches g_nvs_config)",
+ (unsigned)s_node_id);
+ }
/* ADR-060: Determine the CSI channel.
* Priority: 1) NVS override (--channel), 2) connected AP channel, 3) Kconfig default. */
@@ -290,16 +307,6 @@ void csi_collector_init(void)
ESP_LOGI(TAG, "CSI collection initialized (node_id=%u, channel=%u)",
(unsigned)s_node_id, (unsigned)csi_channel);
-
- /* Clobber-detection canary: if g_nvs_config.node_id no longer matches the
- * value we captured, something corrupted the struct between nvs_config_load
- * and here. This is the historic #232/#375 symptom. */
- if (g_nvs_config.node_id != s_node_id) {
- ESP_LOGW(TAG, "node_id clobber detected: captured=%u but g_nvs_config=%u "
- "(frames will use captured value %u). Please report to #390.",
- (unsigned)s_node_id, (unsigned)g_nvs_config.node_id,
- (unsigned)s_node_id);
- }
}
/* Accessor for other modules that need the authoritative runtime node_id. */
diff --git a/firmware/esp32-csi-node/main/csi_collector.h b/firmware/esp32-csi-node/main/csi_collector.h
index 3bdfd148..a518898a 100644
--- a/firmware/esp32-csi-node/main/csi_collector.h
+++ b/firmware/esp32-csi-node/main/csi_collector.h
@@ -30,14 +30,24 @@
void csi_collector_init(void);
/**
- * Get the runtime node_id captured at csi_collector_init().
+ * Capture node_id BEFORE wifi_init_sta() or any other heavy init.
*
- * This is a defensive copy of g_nvs_config.node_id taken at init time. Other
- * modules (edge_processing, wasm_runtime, display_ui) should prefer this
- * accessor over reading g_nvs_config.node_id directly, because the global
- * struct can be clobbered by memory corruption (see #232, #375, #385, #390).
+ * Must be called from app_main() immediately after nvs_config_load().
+ * WiFi driver initialization can corrupt g_nvs_config.node_id (confirmed
+ * on device 80:b5:4e:c1:be:b8, NVS=3 but post-WiFi reads as 1).
+ * This early capture shields s_node_id from that corruption window.
*
- * @return Node ID (0-255) as loaded from NVS or Kconfig default at boot.
+ * @param node_id Value from g_nvs_config.node_id, read right after NVS load.
+ */
+void csi_collector_set_node_id(uint8_t node_id);
+
+/**
+ * Get the runtime node_id (early capture if available, otherwise init-time).
+ *
+ * Other modules (edge_processing, wasm_runtime, display_ui) should prefer
+ * this accessor over reading g_nvs_config.node_id directly.
+ *
+ * @return Node ID (0-255) as loaded from NVS at boot.
*/
uint8_t csi_collector_get_node_id(void);
diff --git a/firmware/esp32-csi-node/main/main.c b/firmware/esp32-csi-node/main/main.c
index 631a0dba..acdfdd95 100644
--- a/firmware/esp32-csi-node/main/main.c
+++ b/firmware/esp32-csi-node/main/main.c
@@ -138,6 +138,11 @@ void app_main(void)
/* Load runtime config (NVS overrides Kconfig defaults) */
nvs_config_load(&g_nvs_config);
+ /* Capture node_id IMMEDIATELY — before wifi_init_sta() can corrupt
+ * g_nvs_config. See #232/#375/#390: WiFi driver init clobbers the struct
+ * on some devices, reverting node_id to the Kconfig default of 1. */
+ csi_collector_set_node_id(g_nvs_config.node_id);
+
const esp_app_desc_t *app_desc = esp_app_get_description();
ESP_LOGI(TAG, "ESP32-S3 CSI Node (ADR-018) — v%s — Node ID: %d",
app_desc->version, g_nvs_config.node_id); |
New finding: Core 0 LoadProhibited crash after ~2400 CSI callbacksObservationThe ESP32 connected via USB ( It collects ~2400-2900 CSI frames, then panics with Root cause analysisOur The actual problem: the CSI callback ( // csi_collector.c:169-172 — runs on every single CSI frame
if (g_nvs_config.filter_mac_set) {
if (memcmp(info->mac, g_nvs_config.filter_mac, 6) != 0) {
return;
}
}This is the same Fix (commit
|
Root cause found and fixed: SPI flash cache crash in WiFi interrupt handlerTL;DRThe LoadProhibited crash is NOT caused by g_nvs_config corruption. It's a SPI flash cache race inside the ESP-IDF WiFi binary blob, triggered by high-rate promiscuous mode interrupts. InvestigationWe decoded the crash backtrace with The crash is inside Espressif's closed-source WiFi binary blob. When promiscuous mode captures MGMT+DATA frames, WiFi interrupts fire at 100-500 Hz. At that rate, the interrupt handler ( Proof: controlled experiment
Disabling the display (QSPI, shares SPI bus) doubled time-to-crash but didn't eliminate it. Reducing promiscuous to MGMT-only (~10 Hz beacons) completely eliminated the crash. Fix (commit
|
Summary
Closes the long-standing
node_idclobber cluster (#232, #375, #385, #386, #390) with a defense-in-depth approach at the firmware layer.Problem
Users provisioning multiple ESP32 nodes with unique
--node-idvalues report that boot logs show the correct NVS value, butcsi_collectorand the UDP frame header revert to1(the Kconfig default):All UDP packets decode with
byte[4] = 1, making multi-node deployments indistinguishable downstream. The symptom matches memory corruption ofg_nvs_config.node_idbetweenmain.c:139(nvs_config_load) and the next use site.Fix (defense in depth)
csi_collector_init()capturesg_nvs_config.node_idintostatic uint8_t s_node_idonce at init.csi_serialize_frame()writesbuf[4] = s_node_id- bypasses the global entirely, so any later corruption ofg_nvs_configcannot affect outgoing CSI frames.edge_processing.cx3 pkt sites,wasm_runtime.c,display_ui.c,main.c swarm_bridge_init) go through a newcsi_collector_get_node_id()accessor.csi_collector_init()logsWARNifg_nvs_config.node_idhas already diverged from the captured value. Users hitting the original bug will see this log, pinpointing the corruption path for root-cause work.Hardware validation (attached ESP32-S3 on COM8)
node_id=2from priorprovision.pyrun.csi_collector: Captured node_id=2 at init (defensive copy for #232/#375/#385/#390).csi_collector: CSI collection initialized (node_id=2, channel=5)- matches NVS.192.168.1.104withbyte[4] = 2.Why defensive, not root cause?
The reported bug does not reproduce on my hardware running the current source - my device is healthy end-to-end. Users see it on various release binaries (v0.4.3.1, v0.7.0 which is actually v0.6.0-esp32 firmware). Three possibilities:
#279fix was not yet applied - users need to reflash v0.6.0-esp32 or later.g_nvs_config. Until we can reproduce, this PR ensures the UDP frame is immune.The canary added here will surface a diagnostic
WARNthe moment any user hits it, letting us chase the upstream path with real telemetry.Files changed
firmware/esp32-csi-node/main/csi_collector.{c,h}- capture, accessor, canaryfirmware/esp32-csi-node/main/edge_processing.c- 3 pkt.node_id sites + includefirmware/esp32-csi-node/main/wasm_runtime.c- 1 pkt.node_id site + includefirmware/esp32-csi-node/main/display_ui.c- display "Node: %u" fieldfirmware/esp32-csi-node/main/main.c- swarm_bridge_init node_id argCHANGELOG.md-[Unreleased]entryNext steps
A new firmware release (
v0.6.1-esp32) with these binaries is warranted - users on multi-node setups should reflash. CHANGELOG entry is in[Unreleased]; bumpversion.txtto0.6.1at release time.Refs #232 #375 #385 #386 #390
Test plan
idf.py build, 54% flash free)idf.py -p COM8 flash)byte[4]matches NVS node_id (15/15 packets)🤖 Generated with claude-flow