Skip to content

ESP32-S3 CSI crash: SPI flash cache race in wDev_ProcessFiq during promiscuous mode #396

@proffesor-for-testing

Description

@proffesor-for-testing

Summary

ESP32-S3 CSI nodes crash with LoadProhibited in the WiFi driver's interrupt handler when promiscuous mode captures frames at high rates. The crash is inside Espressif's closed-source binary blob and cannot be fixed at the application level without reducing the WiFi hardware interrupt rate.

Crash signature

Guru Meditation Error: Core 0 panic'ed (LoadProhibited)
EXCVADDR: 0x00000004

Decoded backtrace (xtensa-esp32s3-elf-addr2line):

_xt_lowint1                    ← WiFi hardware interrupt
wDev_ProcessFiq                ← WiFi driver FIQ (closed-source blob)
spi_flash_restore_cache        ← SPI flash cache restore
cache_ll_l1_resume_icache      ← L1 ICache resume ← NULL deref here

Root cause

The WiFi MAC hardware generates a Level 1 interrupt for every frame captured by the promiscuous filter. wDev_ProcessFiq handles these interrupts and at some point calls spi_flash_restore_cache, which calls cache_ll_l1_resume_icache. When the interrupt rate is high enough (>50 Hz), this function encounters a NULL pointer — likely because the SPI flash cache state is in an inconsistent state from a concurrent flash operation (display QSPI, NVS write, etc.).

The crash is not in application code. It's inside the ESP-IDF WiFi binary blob (libpp.a).

Controlled experiments

All tests on ESP32-S3 (QFN56 rev v0.2, 8MB PSRAM, MAC 80:b5:4e:c1:be:b8), ESP-IDF v5.4, WiFi SSID Spiridonovi1 ch2.

# Build Display Promiscuous filter Effective CSI rate Crash point Result
1 Our build ON MGMT+DATA ~500 Hz ~2400 cb (~70s) Crash
2 Our build OFF MGMT+DATA ~500 Hz ~5300 cb (~90s) Crash (slower)
3 Our build OFF MGMT-only ~10 Hz 2700+ cb (4.7 min) Stable
4 Our build ON MGMT-only ~10 Hz 2400+ cb (4 min+) Stable
5 Our build + 50Hz callback gate ON MGMT+DATA ~50 Hz ~1300 cb Crash
6 Ruv's v0.6.1-esp32 release OFF MGMT+DATA ~100 Hz ~8200 cb Crash (19x in 2 min)

Key findings:

  • Display OFF doubles time-to-crash but doesn't prevent it (test 2 vs 1, test 6)
  • Callback-level rate limiting does NOT help because the WiFi HW interrupt fires for every captured frame regardless of callback execution (test 5)
  • MGMT-only filter is the only fix that works — it reduces the hardware interrupt rate itself (test 3, 4)
  • v0.6.1-esp32 release crashes with the same bug (test 6)

What doesn't work

Callback rate limiting

A 50 Hz early gate in wifi_csi_callback that returns immediately for excess frames does not help. The crash occurs in wDev_ProcessFiq which runs before the callback is invoked. Reducing callback execution time has no effect on interrupt rate.

SPIRAM XIP (CONFIG_SPIRAM_FETCH_INSTRUCTIONS + CONFIG_SPIRAM_RODATA)

In theory this eliminates the SPI flash cache race entirely (instructions served from PSRAM, cache never suspended). In practice, manual sdkconfig edits with CONFIG_SPIRAM_MODE_QUAD=y produced an IllegalInstruction crash-loop from boot — likely because this board's PSRAM is Octal, not Quad. Needs proper idf.py menuconfig validation with the correct PSRAM mode for this hardware.

Additional IRAM options

CONFIG_ESP_WIFI_EXTRA_IRAM_OPT=y and CONFIG_ESP_WIFI_SLP_IRAM_OPT=y were tested as part of the SPIRAM build but couldn't be isolated due to the PSRAM misconfiguration.

What works

MGMT-only promiscuous filter

wifi_promiscuous_filter_t filt = {
    .filter_mask = WIFI_PROMIS_FILTER_MASK_MGMT,  // was MGMT | DATA
};

Reduces WiFi hardware interrupt rate from ~100-500 Hz to ~10 Hz (beacon/probe frames only). Tested stable for 4+ minutes with display ON, zero crashes.

Trade-off: CSI data rate drops from ~100-500 frames/sec to ~10 frames/sec. However, 10 Hz is sufficient for presence detection, breathing rate (10-30 BPM), and heart rate detection. Edge processing adaptive calibration completes successfully at this rate.

Proper fix path (not yet tested)

SPIRAM XIP is the correct platform-level fix. When CONFIG_SPIRAM_FETCH_INSTRUCTIONS=y + CONFIG_SPIRAM_RODATA=y, the SPI flash cache is never suspended during flash operations, eliminating the race entirely. This requires:

  1. Determine correct PSRAM mode for this board (Quad vs Octal) — check Waveshare ESP32-S3 datasheet
  2. Configure via idf.py menuconfig (not manual sdkconfig edits) to get all dependencies right
  3. Test with full MGMT+DATA promiscuous at 500 Hz

Also affected: node_id clobber

Separately from the crash, the g_nvs_config.node_id clobber (#390) is confirmed on our hardware. Ruv's v0.6.1 late capture at csi_collector_init() works on some boots but not all — we proved wifi_init_sta() corrupts the struct before the capture runs. Our early capture (csi_collector_set_node_id() called before wifi_init_sta()) is the reliable fix. See PR #393 comments.

Hardware

  • Board: Waveshare ESP32-S3 AMOLED 1.8" (SH8601 368x448 QSPI display)
  • Chip: ESP32-S3 (QFN56) rev v0.2, 8MB PSRAM (AP_3v3), 16MB flash (Boya)
  • ESP-IDF: v5.4
  • WiFi: Spiridonovi1 ch2, WPA2-PSK

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingfirmwareESP32 firmware

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions