Skip to content

ci: add nxboot OTA resilience canary workflow#18519

Closed
neilberkman wants to merge 4 commits intoapache:masterfrom
neilberkman:tardigrade-ci
Closed

ci: add nxboot OTA resilience canary workflow#18519
neilberkman wants to merge 4 commits intoapache:masterfrom
neilberkman:tardigrade-ci

Conversation

@neilberkman
Copy link

@neilberkman neilberkman commented Mar 10, 2026

Summary

nxboot currently has no automated testing for power-loss resilience during OTA updates. A power cut at the wrong moment during a firmware update can leave a device permanently bricked — this is a known failure class in other bootloaders (MCUboot PRs #2100, #2109, #2199 all shipped bricking regressions that were only found after release).

This PR adds a weekly canary workflow that builds the nucleo-h743zi nxboot-loader and nxboot-app configs, then uses tardigrade (a Renode-based fault-injection harness) to inject power-loss faults at write points during the OTA update path and verify the device always recovers to a bootable state.

What it does: builds nxboot from this repo's configs, emulates an OTA update in Renode, interrupts it at ~64 points across the full write range, and checks that the bootloader recovers every time.

What it does not do: it runs on a weekly schedule and workflow_dispatch only. It does not trigger on push or pull_request, so it never blocks normal development or CI. If it ever becomes a nuisance, deleting the one YAML file has zero impact on the project.

Dependencies: requires the nucleo-h743zi nxboot board configs from #18509 to be merged first. Tardigrade is pinned by full commit SHA.

Impact

  • Adds one new workflow file (.github/workflows/ota-resilience-canary.yml)
  • No impact on existing CI, builds, or code
  • Weekly schedule only — no push/PR triggers
  • Self-contained: no changes to any existing files

Testing

Workflow run on fork (run #22888336237):

Profile: nuttx_nxboot_canary
Verdict: PASS
Calibrated writes: 100000
Fault points: 34
Issues: 0
Bricks: 0
Control outcome: success
Control multi-boot: converged exec

Total runtime: 7m17s on a single ubuntu-22.04 runner.

Signed-off-by: Neil Berkman neil@xuku.com

Add a weekly + on-demand workflow that tests nxboot power-loss
resilience on nucleo-h743zi using Renode-based fault injection
(tardigrade). Builds the nxboot-loader and nxboot-app configs from
this repo, injects power-loss faults at write points during the OTA
update path, and verifies the device recovers to a bootable state.

Schedule-only (weekly) and workflow_dispatch — does not run on push
or pull_request, so it never blocks normal development.

Signed-off-by: Neil Berkman <neil@xuku.com>
@github-actions github-actions bot added Area: CI Size: M The size of the change in this PR is medium labels Mar 10, 2026
Copy link
Member

@lupyuen lupyuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry do we need a person to monitor the results of this Scheduled Job? If it fails, who will be fixing it?

We don't have any Scheduled Jobs right now. We have Scheduled Builds in the NuttX Mirror Repo (moved there due to overuse of GitHub Runners), and they are monitored through the NuttX Dashboard.

@lupyuen lupyuen requested review from cederom and linguini1 March 10, 2026 03:22
@lupyuen
Copy link
Member

lupyuen commented Mar 10, 2026

Do you have this running on your NuttX Repo? I'm curious to see how much of GitHub Runners it will use. Thanks!

Copy link
Member

@lupyuen lupyuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zizmor Security Scanner found some issues in the workflow. Could you set persist-credentials: false and check that the workflow still runs OK? Thanks!

$ git clone https://github.com/neilberkman/nuttx --branch tardigrade-ci
$ zizmor nuttx/.github/workflows/ota-resilience-canary.yml
🌈 zizmor v1.22.0
 INFO audit: zizmor: 🌈 completed ota-resilience-canary.yml
warning[artipacked]: credential persistence through GitHub Actions artifacts
  --> ota-resilience-canary.yml:32:9
   |
32 |         - name: Checkout NuttX
   |  _________^
33 | |         uses: actions/checkout@v4
34 | |         with:
35 | |           path: nuttx
   | |_____________________^ does not set persist-credentials: false
   |
   = note: audit confidence → Low
   = note: this finding has an auto-fix

warning[artipacked]: credential persistence through GitHub Actions artifacts
  --> ota-resilience-canary.yml:37:9
   |
37 |         - name: Checkout NuttX apps
   |  _________^
38 | |         uses: actions/checkout@v4
39 | |         with:
40 | |           repository: apache/nuttx-apps
41 | |           path: nuttx-apps
   | |__________________________^ does not set persist-credentials: false
   |
   = note: audit confidence → Low
   = note: this finding has an auto-fix

warning[artipacked]: credential persistence through GitHub Actions artifacts
  --> ota-resilience-canary.yml:43:9
   |
43 |         - name: Checkout tardigrade
   |  _________^
44 | |         uses: actions/checkout@v4
45 | |         with:
46 | |           repository: neilberkman/tardigrade
47 | |           ref: 819d143a56c83fd7860ccc8f76a414a717956d94
48 | |           path: tardigrade
   | |__________________________^ does not set persist-credentials: false
   |
   = note: audit confidence → Low
   = note: this finding has an auto-fix

Copy link
Contributor

@linguini1 linguini1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this locally or on GitHub workflows externally? Can you please provide some test output from that (your PR does not follow the template currently)

- name: Checkout tardigrade
uses: actions/checkout@v4
with:
repository: neilberkman/tardigrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make NuttX CI and testing reliant on your own project, which (from what I can see) is only four days old. With all due respect, this may be an unreliable test method/security risk if it's externally maintained.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, if we want anything like this in the best scenario this would be part of the NuttX repo :-) i guess this is still work in progress and we may see that in the upstream when ready right? :-)

Fixes zizmor artipacked warnings for all three actions/checkout@v4
steps in the OTA resilience canary workflow.

Signed-off-by: Neil Berkman <neil@xuku.com>
Signed-off-by: Neil Berkman <neil@xuku.com>
Signed-off-by: Neil Berkman <neil@xuku.com>
@neilberkman
Copy link
Author

neilberkman commented Mar 10, 2026

@lupyuen: "Sorry do we need a person to monitor the results of this Scheduled Job? If it fails, who will be fixing it?"

I'd monitor it and fix any workflow breakage myself. That said, I understand the concern about adding a scheduled job to the main repo when existing scheduled builds have been moved to the mirror. Happy to run this on my own fork instead and file issues here if it catches anything (would manually review before reporting). If that's preferred I can withdraw the PR. But if there's appetite for it in the mirror repo or main repo down the road, the offer stands. Also open to other approaches if there's a better way to integrate this kind of testing.

@lupyuen: Zizmor persist-credentials: false

Fixed, all three checkout steps now set persist-credentials: false. Thanks for running zizmor on it.

@lupyuen: "Do you have this running on your NuttX Repo? I'm curious to see how much of GitHub Runners it will use."

Yes, ran it on my fork: run #22888336237. Single ubuntu-22.04 runner, 7m17s total. 34 fault points tested, 0 bricks, 0 issues.

@linguini1: "Have you tested this locally or on GitHub workflows externally?"

Updated the PR description with Impact and Testing sections. Workflow tested on my fork, output and run link in the description.

@linguini1: "This would make NuttX CI and testing reliant on your own project, which (from what I can see) is only four days old."

Fair point. The public repo is new but the tool has been in development for a bit longer internally. Its value is specifically in automated power-loss testing during OTA updates, which as far as I know no other tool covers. The checkout is pinned by full commit SHA so the workflow can't be affected by later changes to the repo. But I understand the concern about depending on an external project. Happy to run this on my own fork instead if that's preferred, or open to other suggestions for how to integrate this kind of testing.

@lupyuen
Copy link
Member

lupyuen commented Mar 10, 2026

Happy to run this on my own fork instead and file issues here if it catches anything (would manually review before reporting).

Yes we move this to your repo instead? We've having issues managing our GitHub Runners right now, we hate to see them spike up suddenly at an odd time :-) Also I'm concerned that this job might linger forever in our repo, without anybody monitoring it. Thanks for helping us :-)

@neilberkman
Copy link
Author

Totally understand, makes sense. I'll run it on my own fork and report any findings here as issues after reviewing them. Closing this one out.

@cederom cederom requested a review from michallenc March 10, 2026 09:07
@cederom
Copy link
Contributor

cederom commented Mar 10, 2026

Big thank you @neilberkman great idea !! Yes with current state of CI and its load it would be best to spread work and run this on external repo then report issues to the upstream!!

@linguini1
Copy link
Contributor

Fair point. The public repo is new but the tool has been in development for a bit longer internally. Its value is specifically in automated power-loss testing during OTA updates, which as far as I know no other tool covers. The checkout is pinned by full commit SHA so the workflow can't be affected by later changes to the repo. But I understand the concern about depending on an external project. Happy to run this on my own fork instead if that's preferred, or open to other suggestions for how to integrate this kind of testing.

Thanks for addressing my concerns! This seems pretty reasonable. The reason I mentioned it is because there were similar concerns about NTFC, but repositories have since been made for it. Maybe eventually you might consider adding some minimal version to the upstream? Anyways, thanks for your hard work! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: CI Size: M The size of the change in this PR is medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants