ci: add nxboot OTA resilience canary workflow by neilberkman · Pull Request #18519 · apache/nuttx

neilberkman · 2026-03-10T03:08:00Z

Summary

nxboot currently has no automated testing for power-loss resilience during OTA updates. A power cut at the wrong moment during a firmware update can leave a device permanently bricked — this is a known failure class in other bootloaders (MCUboot PRs #2100, #2109, #2199 all shipped bricking regressions that were only found after release).

This PR adds a weekly canary workflow that builds the nucleo-h743zi nxboot-loader and nxboot-app configs, then uses tardigrade (a Renode-based fault-injection harness) to inject power-loss faults at write points during the OTA update path and verify the device always recovers to a bootable state.

What it does: builds nxboot from this repo's configs, emulates an OTA update in Renode, interrupts it at ~64 points across the full write range, and checks that the bootloader recovers every time.

What it does not do: it runs on a weekly schedule and workflow_dispatch only. It does not trigger on push or pull_request, so it never blocks normal development or CI. If it ever becomes a nuisance, deleting the one YAML file has zero impact on the project.

Dependencies: requires the nucleo-h743zi nxboot board configs from #18509 to be merged first. Tardigrade is pinned by full commit SHA.

Impact

Adds one new workflow file (.github/workflows/ota-resilience-canary.yml)
No impact on existing CI, builds, or code
Weekly schedule only — no push/PR triggers
Self-contained: no changes to any existing files

Testing

Workflow run on fork (run #22888336237):

Profile: nuttx_nxboot_canary
Verdict: PASS
Calibrated writes: 100000
Fault points: 34
Issues: 0
Bricks: 0
Control outcome: success
Control multi-boot: converged exec

Total runtime: 7m17s on a single ubuntu-22.04 runner.

Signed-off-by: Neil Berkman neil@xuku.com

Add a weekly + on-demand workflow that tests nxboot power-loss resilience on nucleo-h743zi using Renode-based fault injection (tardigrade). Builds the nxboot-loader and nxboot-app configs from this repo, injects power-loss faults at write points during the OTA update path, and verifies the device recovers to a bootable state. Schedule-only (weekly) and workflow_dispatch — does not run on push or pull_request, so it never blocks normal development. Signed-off-by: Neil Berkman <neil@xuku.com>

lupyuen

Sorry do we need a person to monitor the results of this Scheduled Job? If it fails, who will be fixing it?

We don't have any Scheduled Jobs right now. We have Scheduled Builds in the NuttX Mirror Repo (moved there due to overuse of GitHub Runners), and they are monitored through the NuttX Dashboard.

lupyuen · 2026-03-10T03:25:28Z

Do you have this running on your NuttX Repo? I'm curious to see how much of GitHub Runners it will use. Thanks!

lupyuen

Zizmor Security Scanner found some issues in the workflow. Could you set persist-credentials: false and check that the workflow still runs OK? Thanks!

$ git clone https://github.com/neilberkman/nuttx --branch tardigrade-ci
$ zizmor nuttx/.github/workflows/ota-resilience-canary.yml
🌈 zizmor v1.22.0
 INFO audit: zizmor: 🌈 completed ota-resilience-canary.yml
warning[artipacked]: credential persistence through GitHub Actions artifacts
  --> ota-resilience-canary.yml:32:9
   |
32 |         - name: Checkout NuttX
   |  _________^
33 | |         uses: actions/checkout@v4
34 | |         with:
35 | |           path: nuttx
   | |_____________________^ does not set persist-credentials: false
   |
   = note: audit confidence → Low
   = note: this finding has an auto-fix

warning[artipacked]: credential persistence through GitHub Actions artifacts
  --> ota-resilience-canary.yml:37:9
   |
37 |         - name: Checkout NuttX apps
   |  _________^
38 | |         uses: actions/checkout@v4
39 | |         with:
40 | |           repository: apache/nuttx-apps
41 | |           path: nuttx-apps
   | |__________________________^ does not set persist-credentials: false
   |
   = note: audit confidence → Low
   = note: this finding has an auto-fix

warning[artipacked]: credential persistence through GitHub Actions artifacts
  --> ota-resilience-canary.yml:43:9
   |
43 |         - name: Checkout tardigrade
   |  _________^
44 | |         uses: actions/checkout@v4
45 | |         with:
46 | |           repository: neilberkman/tardigrade
47 | |           ref: 819d143a56c83fd7860ccc8f76a414a717956d94
48 | |           path: tardigrade
   | |__________________________^ does not set persist-credentials: false
   |
   = note: audit confidence → Low
   = note: this finding has an auto-fix

linguini1

Have you tested this locally or on GitHub workflows externally? Can you please provide some test output from that (your PR does not follow the template currently)

linguini1 · 2026-03-10T04:10:12Z

.github/workflows/ota-resilience-canary.yml

+      - name: Checkout tardigrade
+        uses: actions/checkout@v4
+        with:
+          repository: neilberkman/tardigrade


This would make NuttX CI and testing reliant on your own project, which (from what I can see) is only four days old. With all due respect, this may be an unreliable test method/security risk if it's externally maintained.

agree, if we want anything like this in the best scenario this would be part of the NuttX repo :-) i guess this is still work in progress and we may see that in the upstream when ready right? :-)

Fixes zizmor artipacked warnings for all three actions/checkout@v4 steps in the OTA resilience canary workflow. Signed-off-by: Neil Berkman <neil@xuku.com>

Signed-off-by: Neil Berkman <neil@xuku.com>

neilberkman · 2026-03-10T05:41:34Z

@lupyuen: "Sorry do we need a person to monitor the results of this Scheduled Job? If it fails, who will be fixing it?"

I'd monitor it and fix any workflow breakage myself. That said, I understand the concern about adding a scheduled job to the main repo when existing scheduled builds have been moved to the mirror. Happy to run this on my own fork instead and file issues here if it catches anything (would manually review before reporting). If that's preferred I can withdraw the PR. But if there's appetite for it in the mirror repo or main repo down the road, the offer stands. Also open to other approaches if there's a better way to integrate this kind of testing.

@lupyuen: Zizmor persist-credentials: false

Fixed, all three checkout steps now set persist-credentials: false. Thanks for running zizmor on it.

@lupyuen: "Do you have this running on your NuttX Repo? I'm curious to see how much of GitHub Runners it will use."

Yes, ran it on my fork: run #22888336237. Single ubuntu-22.04 runner, 7m17s total. 34 fault points tested, 0 bricks, 0 issues.

@linguini1: "Have you tested this locally or on GitHub workflows externally?"

Updated the PR description with Impact and Testing sections. Workflow tested on my fork, output and run link in the description.

@linguini1: "This would make NuttX CI and testing reliant on your own project, which (from what I can see) is only four days old."

Fair point. The public repo is new but the tool has been in development for a bit longer internally. Its value is specifically in automated power-loss testing during OTA updates, which as far as I know no other tool covers. The checkout is pinned by full commit SHA so the workflow can't be affected by later changes to the repo. But I understand the concern about depending on an external project. Happy to run this on my own fork instead if that's preferred, or open to other suggestions for how to integrate this kind of testing.

lupyuen · 2026-03-10T05:45:18Z

Happy to run this on my own fork instead and file issues here if it catches anything (would manually review before reporting).

Yes we move this to your repo instead? We've having issues managing our GitHub Runners right now, we hate to see them spike up suddenly at an odd time :-) Also I'm concerned that this job might linger forever in our repo, without anybody monitoring it. Thanks for helping us :-)

neilberkman · 2026-03-10T05:56:12Z

Totally understand, makes sense. I'll run it on my own fork and report any findings here as issues after reviewing them. Closing this one out.

cederom · 2026-03-10T09:12:45Z

Big thank you @neilberkman great idea !! Yes with current state of CI and its load it would be best to spread work and run this on external repo then report issues to the upstream!!

linguini1 · 2026-03-10T12:23:46Z

Fair point. The public repo is new but the tool has been in development for a bit longer internally. Its value is specifically in automated power-loss testing during OTA updates, which as far as I know no other tool covers. The checkout is pinned by full commit SHA so the workflow can't be affected by later changes to the repo. But I understand the concern about depending on an external project. Happy to run this on my own fork instead if that's preferred, or open to other suggestions for how to integrate this kind of testing.

Thanks for addressing my concerns! This seems pretty reasonable. The reason I mentioned it is because there were similar concerns about NTFC, but repositories have since been made for it. Maybe eventually you might consider adding some minimal version to the upstream? Anyways, thanks for your hard work! :)

neilberkman requested review from lupyuen, raiden00pl and simbit18 as code owners March 10, 2026 03:08

github-actions bot added Area: CI Size: M The size of the change in this PR is medium labels Mar 10, 2026

lupyuen requested changes Mar 10, 2026

View reviewed changes

lupyuen requested review from cederom and linguini1 March 10, 2026 03:22

lupyuen requested changes Mar 10, 2026

View reviewed changes

linguini1 requested changes Mar 10, 2026

View reviewed changes

neilberkman added 3 commits March 9, 2026 21:42

ci: set persist-credentials: false on checkout steps

76945c3

Fixes zizmor artipacked warnings for all three actions/checkout@v4 steps in the OTA resilience canary workflow. Signed-off-by: Neil Berkman <neil@xuku.com>

ci: update tardigrade pin to handle pre-patched NuttX tree

91a9faa

Signed-off-by: Neil Berkman <neil@xuku.com>

ci: update tardigrade pin (resolve relative paths)

d3688c8

Signed-off-by: Neil Berkman <neil@xuku.com>

neilberkman closed this Mar 10, 2026

cederom requested a review from michallenc March 10, 2026 09:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add nxboot OTA resilience canary workflow#18519

ci: add nxboot OTA resilience canary workflow#18519
neilberkman wants to merge 4 commits intoapache:masterfrom
neilberkman:tardigrade-ci

neilberkman commented Mar 10, 2026 •

edited

Loading

Uh oh!

lupyuen left a comment

Uh oh!

lupyuen commented Mar 10, 2026

Uh oh!

lupyuen left a comment

Uh oh!

linguini1 left a comment

Uh oh!

linguini1 Mar 10, 2026

Uh oh!

cederom Mar 10, 2026

Uh oh!

neilberkman commented Mar 10, 2026 •

edited

Loading

Uh oh!

lupyuen commented Mar 10, 2026

Uh oh!

neilberkman commented Mar 10, 2026

Uh oh!

cederom commented Mar 10, 2026

Uh oh!

linguini1 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

neilberkman commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Impact

Testing

Uh oh!

lupyuen left a comment

Choose a reason for hiding this comment

Uh oh!

lupyuen commented Mar 10, 2026

Uh oh!

lupyuen left a comment

Choose a reason for hiding this comment

Uh oh!

linguini1 left a comment

Choose a reason for hiding this comment

Uh oh!

linguini1 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

cederom Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

neilberkman commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lupyuen commented Mar 10, 2026

Uh oh!

neilberkman commented Mar 10, 2026

Uh oh!

cederom commented Mar 10, 2026

Uh oh!

linguini1 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

neilberkman commented Mar 10, 2026 •

edited

Loading

neilberkman commented Mar 10, 2026 •

edited

Loading