Catch error in CommandBuffer and poison the events by zcbenz · Pull Request #3523 · ml-explore/mlx

zcbenz · 2026-05-11T08:10:19Z

When we got error in CommandBuffer's completion handler, save it, and then throw the error later in a few safe points:

after synchronization;
before and after command encoding.

In this way the program would be able to catch the error, and could continue running safely after recovering from the error.

The downside is that the timing of error throwing would be delayed much later until the program run into the check points, which is actually quite similar to how CUDA handles errors.

angeloskath

I left a couple of comments.

I think the main issue is that check_error in the Event/Fence wait is not doing what it's meant to since the calling thread is practically never the thread that scheduled the work.

zcbenz · 2026-05-20T10:02:05Z

I changed the implementation to propagate the error from CommandBuffer to Event:

When error happened in CommandBuffer, set error in all the signaled Events.
When an Event is waited and then an error was set for it, pass the error to the CommandEncoder and all its signaled Events.

So error in one stream would be propagated to all other streams via events. (Except for CPU stream which currently just throws and crashes.)

There is no good way to test it though, I'm mostly just using mlx.launch --verbose -n 32 python python/tests/ring_test_distributed.py which reliably times out.

angeloskath · 2026-05-21T00:15:57Z

This honestly looks like an awesome cleanup!

I am curious about any overheads... probably overthinking it. I want to run some generations and small model training to verify no regression and then we can merge.

Love the fence cleanup in particular.

zcbenz · 2026-05-22T06:30:17Z

I did a simple benchmarking by running mlx_lm.benchmark --model meta-llama/Llama-3.1-8B-Instruct -n 10 -p 64 -g 512 twice for each branch which seemed to be able to reduce variabilities a lot.

For main:

$ mlx_lm.benchmark --model meta-llama/Llama-3.1-8B-Instruct -n 10 -p 64 -g 512
Timing with prompt_tokens=64, generation_tokens=512, batch_size=1.
Trial 1:  prompt_tps=371.523, generation_tps=24.235, peak_memory=16.175, total_time=21.350
Trial 2:  prompt_tps=369.580, generation_tps=24.232, peak_memory=16.175, total_time=21.354
Trial 3:  prompt_tps=354.408, generation_tps=24.237, peak_memory=16.175, total_time=21.357
Trial 4:  prompt_tps=353.003, generation_tps=24.240, peak_memory=16.176, total_time=21.355
Trial 5:  prompt_tps=368.560, generation_tps=24.238, peak_memory=16.176, total_time=21.350
Trial 6:  prompt_tps=370.512, generation_tps=24.241, peak_memory=16.176, total_time=21.345
Trial 7:  prompt_tps=354.048, generation_tps=24.238, peak_memory=16.176, total_time=21.357
Trial 8:  prompt_tps=351.635, generation_tps=24.239, peak_memory=16.177, total_time=21.357
Trial 9:  prompt_tps=371.316, generation_tps=24.239, peak_memory=16.177, total_time=21.347
Trial 10:  prompt_tps=368.632, generation_tps=24.239, peak_memory=16.177, total_time=21.348
Averages: prompt_tps=363.322, generation_tps=24.238, peak_memory=16.176

for this branch:

$ mlx_lm.benchmark --model meta-llama/Llama-3.1-8B-Instruct -n 10 -p 64 -g 512
Timing with prompt_tokens=64, generation_tokens=512, batch_size=1.
Trial 1:  prompt_tps=354.694, generation_tps=24.242, peak_memory=16.175, total_time=21.353
Trial 2:  prompt_tps=371.278, generation_tps=24.237, peak_memory=16.175, total_time=21.348
Trial 3:  prompt_tps=370.507, generation_tps=24.240, peak_memory=16.175, total_time=21.347
Trial 4:  prompt_tps=352.574, generation_tps=24.246, peak_memory=16.176, total_time=21.350
Trial 5:  prompt_tps=354.255, generation_tps=24.236, peak_memory=16.176, total_time=21.358
Trial 6:  prompt_tps=355.678, generation_tps=24.242, peak_memory=16.176, total_time=21.353
Trial 7:  prompt_tps=370.067, generation_tps=24.238, peak_memory=16.176, total_time=21.349
Trial 8:  prompt_tps=371.358, generation_tps=24.230, peak_memory=16.177, total_time=21.355
Trial 9:  prompt_tps=353.896, generation_tps=24.246, peak_memory=16.177, total_time=21.350
Trial 10:  prompt_tps=352.977, generation_tps=24.235, peak_memory=16.177, total_time=21.360
Averages: prompt_tps=360.728, generation_tps=24.239, peak_memory=16.176

the generation seemed to become slightly faster but it is probably just variability.

Logically I think this branch added overhead when waiting events, but also reduced overhead by reducing the number of completion handlers.

zcbenz mentioned this pull request May 11, 2026

fix(metal): defer command-buffer error reporting + poison-on-failure #3519

Closed

zcbenz force-pushed the metal-error branch 6 times, most recently from b13de7b to 1d33462 Compare May 12, 2026 08:27

davidkoski mentioned this pull request May 15, 2026

[Bug] check_error() throws std::runtime_error from completion handler → uncaught SIGABRT on iPhone 15 Pro / iOS 26.4 (Qwen3 1.7B 4bit) ml-explore/mlx-swift#407

Open

angeloskath reviewed May 19, 2026

View reviewed changes

Comment thread mlx/backend/metal/device.cpp Outdated

Comment thread mlx/backend/metal/event.cpp Outdated

Comment thread mlx/backend/metal/event.cpp Outdated

Comment thread mlx/backend/metal/fence.cpp Outdated

zcbenz force-pushed the metal-error branch from 1d33462 to fc0b418 Compare May 20, 2026 08:23

zcbenz changed the title ~~Handle CommandBuffer error in same thread~~ Catch error in CommandBuffer and poison the events May 20, 2026

Catch error in CommandBuffer and poison the events

cd32966

zcbenz force-pushed the metal-error branch from fc0b418 to cd32966 Compare May 20, 2026 10:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch error in CommandBuffer and poison the events#3523

Catch error in CommandBuffer and poison the events#3523
zcbenz wants to merge 1 commit into
ml-explore:mainfrom
zcbenz:metal-error

zcbenz commented May 11, 2026

Uh oh!

angeloskath left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zcbenz commented May 20, 2026

Uh oh!

angeloskath commented May 21, 2026

Uh oh!

zcbenz commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zcbenz commented May 11, 2026

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zcbenz commented May 20, 2026

Uh oh!

angeloskath commented May 21, 2026

Uh oh!

zcbenz commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants