ci: add nxboot OTA resilience canary workflow#18519
ci: add nxboot OTA resilience canary workflow#18519neilberkman wants to merge 4 commits intoapache:masterfrom
Conversation
Add a weekly + on-demand workflow that tests nxboot power-loss resilience on nucleo-h743zi using Renode-based fault injection (tardigrade). Builds the nxboot-loader and nxboot-app configs from this repo, injects power-loss faults at write points during the OTA update path, and verifies the device recovers to a bootable state. Schedule-only (weekly) and workflow_dispatch — does not run on push or pull_request, so it never blocks normal development. Signed-off-by: Neil Berkman <neil@xuku.com>
lupyuen
left a comment
There was a problem hiding this comment.
Sorry do we need a person to monitor the results of this Scheduled Job? If it fails, who will be fixing it?
We don't have any Scheduled Jobs right now. We have Scheduled Builds in the NuttX Mirror Repo (moved there due to overuse of GitHub Runners), and they are monitored through the NuttX Dashboard.
|
Do you have this running on your NuttX Repo? I'm curious to see how much of GitHub Runners it will use. Thanks! |
lupyuen
left a comment
There was a problem hiding this comment.
Zizmor Security Scanner found some issues in the workflow. Could you set persist-credentials: false and check that the workflow still runs OK? Thanks!
$ git clone https://github.com/neilberkman/nuttx --branch tardigrade-ci
$ zizmor nuttx/.github/workflows/ota-resilience-canary.yml
🌈 zizmor v1.22.0
INFO audit: zizmor: 🌈 completed ota-resilience-canary.yml
warning[artipacked]: credential persistence through GitHub Actions artifacts
--> ota-resilience-canary.yml:32:9
|
32 | - name: Checkout NuttX
| _________^
33 | | uses: actions/checkout@v4
34 | | with:
35 | | path: nuttx
| |_____________________^ does not set persist-credentials: false
|
= note: audit confidence → Low
= note: this finding has an auto-fix
warning[artipacked]: credential persistence through GitHub Actions artifacts
--> ota-resilience-canary.yml:37:9
|
37 | - name: Checkout NuttX apps
| _________^
38 | | uses: actions/checkout@v4
39 | | with:
40 | | repository: apache/nuttx-apps
41 | | path: nuttx-apps
| |__________________________^ does not set persist-credentials: false
|
= note: audit confidence → Low
= note: this finding has an auto-fix
warning[artipacked]: credential persistence through GitHub Actions artifacts
--> ota-resilience-canary.yml:43:9
|
43 | - name: Checkout tardigrade
| _________^
44 | | uses: actions/checkout@v4
45 | | with:
46 | | repository: neilberkman/tardigrade
47 | | ref: 819d143a56c83fd7860ccc8f76a414a717956d94
48 | | path: tardigrade
| |__________________________^ does not set persist-credentials: false
|
= note: audit confidence → Low
= note: this finding has an auto-fix
linguini1
left a comment
There was a problem hiding this comment.
Have you tested this locally or on GitHub workflows externally? Can you please provide some test output from that (your PR does not follow the template currently)
| - name: Checkout tardigrade | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| repository: neilberkman/tardigrade |
There was a problem hiding this comment.
This would make NuttX CI and testing reliant on your own project, which (from what I can see) is only four days old. With all due respect, this may be an unreliable test method/security risk if it's externally maintained.
There was a problem hiding this comment.
agree, if we want anything like this in the best scenario this would be part of the NuttX repo :-) i guess this is still work in progress and we may see that in the upstream when ready right? :-)
Fixes zizmor artipacked warnings for all three actions/checkout@v4 steps in the OTA resilience canary workflow. Signed-off-by: Neil Berkman <neil@xuku.com>
Signed-off-by: Neil Berkman <neil@xuku.com>
Signed-off-by: Neil Berkman <neil@xuku.com>
I'd monitor it and fix any workflow breakage myself. That said, I understand the concern about adding a scheduled job to the main repo when existing scheduled builds have been moved to the mirror. Happy to run this on my own fork instead and file issues here if it catches anything (would manually review before reporting). If that's preferred I can withdraw the PR. But if there's appetite for it in the mirror repo or main repo down the road, the offer stands. Also open to other approaches if there's a better way to integrate this kind of testing.
Fixed, all three checkout steps now set
Yes, ran it on my fork: run #22888336237. Single
Updated the PR description with Impact and Testing sections. Workflow tested on my fork, output and run link in the description.
Fair point. The public repo is new but the tool has been in development for a bit longer internally. Its value is specifically in automated power-loss testing during OTA updates, which as far as I know no other tool covers. The checkout is pinned by full commit SHA so the workflow can't be affected by later changes to the repo. But I understand the concern about depending on an external project. Happy to run this on my own fork instead if that's preferred, or open to other suggestions for how to integrate this kind of testing. |
Yes we move this to your repo instead? We've having issues managing our GitHub Runners right now, we hate to see them spike up suddenly at an odd time :-) Also I'm concerned that this job might linger forever in our repo, without anybody monitoring it. Thanks for helping us :-) |
|
Totally understand, makes sense. I'll run it on my own fork and report any findings here as issues after reviewing them. Closing this one out. |
|
Big thank you @neilberkman great idea !! Yes with current state of CI and its load it would be best to spread work and run this on external repo then report issues to the upstream!! |
Thanks for addressing my concerns! This seems pretty reasonable. The reason I mentioned it is because there were similar concerns about NTFC, but repositories have since been made for it. Maybe eventually you might consider adding some minimal version to the upstream? Anyways, thanks for your hard work! :) |
Summary
nxboot currently has no automated testing for power-loss resilience during OTA updates. A power cut at the wrong moment during a firmware update can leave a device permanently bricked — this is a known failure class in other bootloaders (MCUboot PRs #2100, #2109, #2199 all shipped bricking regressions that were only found after release).
This PR adds a weekly canary workflow that builds the nucleo-h743zi nxboot-loader and nxboot-app configs, then uses tardigrade (a Renode-based fault-injection harness) to inject power-loss faults at write points during the OTA update path and verify the device always recovers to a bootable state.
What it does: builds nxboot from this repo's configs, emulates an OTA update in Renode, interrupts it at ~64 points across the full write range, and checks that the bootloader recovers every time.
What it does not do: it runs on a weekly schedule and
workflow_dispatchonly. It does not trigger on push or pull_request, so it never blocks normal development or CI. If it ever becomes a nuisance, deleting the one YAML file has zero impact on the project.Dependencies: requires the nucleo-h743zi nxboot board configs from #18509 to be merged first. Tardigrade is pinned by full commit SHA.
Impact
.github/workflows/ota-resilience-canary.yml)Testing
Workflow run on fork (run #22888336237):
Total runtime: 7m17s on a single
ubuntu-22.04runner.Signed-off-by: Neil Berkman neil@xuku.com