Skip to content

uffd: add support for UFFD_EVENT_REMOVE events#1896

Open
bchalios wants to merge 18 commits intomainfrom
feat/free-page-reporting
Open

uffd: add support for UFFD_EVENT_REMOVE events#1896
bchalios wants to merge 18 commits intomainfrom
feat/free-page-reporting

Conversation

@bchalios
Copy link
Copy Markdown
Contributor

@bchalios bchalios commented Feb 11, 2026

Enabling Firecracker free-page-reporting feature requires us to handle remove events (UFFD_EVENT_REMOVE) in our userfaultfd handler. These events are triggered whenever Firecracker calls madvise(MADV_DONTNEED) (or similar) on a range of guest memory addresses.

The main thing that changes on our logic is that page faults in a page that has previously been removed need to be served with a zero page rather than a page from the snapshot file.

This commit changes the page fault serving logic to:

  1. Introduce tracking of the state of every page in the guest's memory mappings.
  2. Add logic to handle the new UFFD_EVENT_REMOVE event
  3. Modify existing logic to take into account current state when deciding how to handle each page fault

This is dependent on the part of #1858 that enables free page reporting on the Firecracker side.


Note

High Risk
High risk because it changes low-level userfaultfd event handling and page population semantics (including new ioctls and concurrency/defer logic), which can affect VM stability and memory correctness. It also introduces a new template/sandbox configuration flag that changes runtime behavior based on Firecracker version and feature flags.

Overview
This PR upgrades the default Firecracker version to v1.14.1 and introduces opt-in free page reporting, wiring a new freePageReporting setting from template creation through template build/sandbox startup to Firecracker balloon configuration. To support this, the userfaultfd handler is reworked to read and process UFFD_EVENT_REMOVE events, track per-page state, zero-fill pages after removal, and defer/retry faults that race with removes, with tests expanded to validate remove, write-protect, and gated ordering scenarios.

Written by Cursor Bugbot for commit b35c9f7. This will update automatically on new commits. Configure here.

@bchalios bchalios force-pushed the feat/free-page-reporting branch 2 times, most recently from a1d0a02 to 03e2966 Compare February 11, 2026 21:20
@bchalios bchalios force-pushed the feat/free-page-reporting branch 5 times, most recently from b766eb3 to 4d197e4 Compare February 12, 2026 23:48
@bchalios bchalios force-pushed the feat/free-page-reporting branch 2 times, most recently from e639acf to 88420f2 Compare February 13, 2026 00:45
@bchalios bchalios force-pushed the feat/free-page-reporting branch 8 times, most recently from d2e5d09 to 8ea97e4 Compare February 17, 2026 15:49
@bchalios bchalios force-pushed the feat/free-page-reporting branch 4 times, most recently from 04b1f47 to cd591ff Compare March 31, 2026 10:37
span.AddEvent("prefault: page already faulted or write returned EAGAIN")
} else {
u.pageTracker.setState(addr, addr+u.pageSize, faulted)
u.prefetchTracker.Add(offset, block.Prefetch)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefault can overwrite removed state causing thread hangs

Medium Severity

Prefault runs concurrently with the Serve loop but is not part of the wg errgroup. The Serve loop calls wg.Wait() before processing REMOVE events, but this does not wait for in-flight Prefault calls. A race exists: Prefault reads state as unfaulted, then Serve processes a REMOVE (setting state to removed), then Prefault's faultPage succeeds and setState(faulted) overwrites the removed state. Subsequent page faults on that address see faulted, skip handling, and the faulting thread hangs indefinitely.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor Author

@bchalios bchalios Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a valid point for concurrent removes. I think in practice pre-faulting finishes way before we start receiving on-demand events, but I think the safest thing to do here would be to only resume vCPUs after pre-faulting has completed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT @ValentaTomas ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,I think I've found another work-around. We can reuse the settleRequests RW mutex. If we take it the write-lock when we 're handling remove events, I think we're safe. Both page fault handling and prefaulter Go routines take the read lock. So we can never have page/pre-faulting racing with remove event handling.

@bchalios bchalios force-pushed the feat/free-page-reporting branch from cd591ff to e6e8a19 Compare March 31, 2026 11:15
break
}
// And, finally, wake up the faulting thread
writeErr = u.fd.wake(addr, u.pageSize)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial zero+WP+wake failure loses write-protection permanently

Medium Severity

In the 4K read zero-fill path, if zero() succeeds but writeProtect() fails, the page is populated but not write-protected. The writeErr propagates to the EAGAIN/error checks. On retry via deferred, zero() returns EEXIST, and faultPage returns (true, nil), so setState(faulted) runs without the WP bit ever being set. This silently breaks dirty page tracking for that page.

Fix in Cursor Fix in Web

bchalios and others added 18 commits April 2, 2026 11:25
Bump the swagger file for Firecracker to v1.14 and regenerate
APIs/models.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Add support for Firecracker v1.14 and make it the default.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
As we're going to handle UFFD_EVENT_REMOVE events triggerred by
Firecracker, we need to keep track of the state of all the guest memory
pages.

Theis commit introduces 3 states:

* Unfaulted - A page that has not been faulted yet.
* Faulted   - A page that we have previously faulted in.
* Removed   - A page that we have received a remove event for and
              haven't faulted in since.

It also adds the necessary book keeping of page state in all the memory
regions of the guest, along with methods for retrieving and setting the
state of pages.

Co-authored-by: Tomas Valenta <valenta.and.thomas@gmail.com>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Import a few more bindings that we'll need for handling remove events.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Handle UFFD_EVENT_REMOVE events from the file descriptor. These events
are triggerred when Firecracker calls madvise() with MADV_DONTNEED on
some memory range that we are tracking. This Firecracker behaviour is
support with version 1.14.0 onwards using the free page reporting and
hinting features of balloon devices.

What this means for us is that, we need to track removed pages because
subsequent page faults need to be served with a zero page.

Co-authored-by: Tomas Valenta <valenta.and.thomas@gmail.com>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Do that, so we maintain the dirty page tracking on. Prefaulting does not
occur due to write faults, so tracking must be maintained. Writing will
clear the bit asynchronously in the kernel.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
userfaultfd_zeropage() does not understand UFFD_COPY_MODE_WP. When
zeroing a page we should always be passing 0 in mode (since we always
want to unblock the faulting thread).

For userfaultfd_copy() we want to provide UFFD_COPY_MODE_WP only when we
are copying due to a read. This includes read-triggerred page faults
from Firecracker and us pre-faulting.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
When guest memory is backed by 4K pages, Firecracker backs it with
anonymous pages. Anonymous pages can only be write-protected when there
is page entry present. This means that until we receive a fault for it a
page is not write-protected. This is fine. Not present (not faulted)
pages cannot be dirty, by definition.

When handling page faults for known pages, we use UFFDIO_COPY with
write-protection mode, which automatically sets write-protection for the
faulted page. However, imagine this sequence:

1. We resume from a snapshot
2. We receive a UFFD_EVENT_REMOVE for a range that was not previously
   faulted in.
3. We receive a page fault for a page within the removed region.

The correct way to handle this is providing the zero page and that's
what we do. However, UFFDIO_ZERO does not have a write-protected mode,
so the newly setup page is not write-protected tracked. Such a page will
always be reported dirty from Firecracker (present and !write-protected)
and it will be needlessly included in the snapshot.

Handle this by explicitly marking such pages as write-protected. Handle
the race condition by providing the zero page without waking up the
thread, marking it write protected and then waking up the thread.

Note, that huge pages don't have this issue because:
1. Hugetlbfs-backed pages are write-protected by Firecracker
2. we always handle huge page faults using UFFDIO_COPY with
   write-protection mode.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Balloon devices provide memory reclamation facilities through free page
reporting, as a more efficient mechanism (in terms of latency and CPU
time) than traditional ballooning.

Free page reporting instructs the guest to periodically report memory
that has been freed, so that we can reclaim it in the host side. It is
enabled before starting the sandbox and does not require any further
host-side orchestration.

Enable free page reporting for all new templates using Firecracker
versions >= v1.14.0. Also, allow users to optionally disable it for
these versions. Older Firecracker versions don't support the feature.
Trying to enable it for those, will return an error.

Co-authored-by: Valenta Tomas <valenta.and.tomas@gmail.com>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Test that if we always copy with WP on, write-protection will be handled
correctly by the kernel.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Add machinery for being able to coordinate the steps of the helper
process that operates the UFFD serve loop. Then add tests for various
scenarios of UFFD remove and page fault events.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Having 1M parallel operations sometimes causes intermittent issues in
tests. Use 10K instead as used in other tests.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
We are handling page fault events within go routines without much
synchronization across across iterations of the serve loop (apart from
when we are about to shut down the loop).

This works fine when we only have page fault events, but it might be
problematic with remove events in the mix. When we handle a remove
event, we need to make sure that no page fault handling Go routine is
still runing for the same page, as this cuases a race condition for the
book keeping of the page state.

Handle this, by ensuring that all outstanding Go routines have completed
before handling remove events. Do that only when we do have remove
events in the queue.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Previous commit added logic to wait for all Go routines handling page
faults from previous iterations before handling remove events. There is
one more race condition, though, between handling remove events and
prefaulting operations. The previous fix did not handle this one.

Re-use the settleRequests RW Mutex to instead of waiting the errorgroup.
All pagefault operations are taking the read lock, so page- and
pre-faulting can happen concurrently. Handling remove events takes the
write lock, so no pre/page-faulting can happen while we handle those.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
In crossProcessServe(), `cleanup` is initially set to a function that
signals the original fdExit and drains exitUffd (a buffered channel of
capacity 1).

When a gated test issues a pause ('P'), stopServe() calls the original
cleanup, which drains exitUffd. On resume ('R'), startServe() creates a
new fdExit and Serve goroutine, and reassigns both `stopServe` and
`cleanup` to a new stop function.

However, `defer cleanup()` captures the *value* of cleanup at the time
the defer statement executes — the original closure — not the variable
itself. So when SIGUSR1 arrives and crossProcessServe returns, the
deferred call invokes the original cleanup, which blocks forever on
<-exitUffd because that channel was already drained during the pause
step. The subprocess never exits, cmd.Wait() in the parent test hangs,
and the parallel test slot is never released, causing unrelated tests
waiting at t.Parallel() to time out.

Fix by deferring through the variable so the call always dispatches to
whatever cleanup currently points to at return time:

    defer func() { cleanup() }()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

return true
}

return false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version checks will break for major version > 1

Low Severity

Both HasHugePages and HasFreePageReporting use Major() == 1 which will return false for any future Firecracker version with major > 1 (e.g., v2.0). The correct check for "version >= X.Y" is Major() > X || (Major() == X && Minor() >= Y). While Firecracker is currently at v1.14, this pattern is easy to copy-paste incorrectly for future version gates.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants