Skip to content

Catch error in CommandBuffer and poison the events#3523

Open
zcbenz wants to merge 1 commit into
ml-explore:mainfrom
zcbenz:metal-error
Open

Catch error in CommandBuffer and poison the events#3523
zcbenz wants to merge 1 commit into
ml-explore:mainfrom
zcbenz:metal-error

Conversation

@zcbenz
Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz commented May 11, 2026

Close #2670.

When we got error in CommandBuffer's completion handler, save it, and then throw the error later in a few safe points:

  • after synchronization;
  • before and after command encoding.

In this way the program would be able to catch the error, and could continue running safely after recovering from the error.

The downside is that the timing of error throwing would be delayed much later until the program run into the check points, which is actually quite similar to how CUDA handles errors.

Copy link
Copy Markdown
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments.

I think the main issue is that check_error in the Event/Fence wait is not doing what it's meant to since the calling thread is practically never the thread that scheduled the work.

Comment thread mlx/backend/metal/device.cpp Outdated
Comment thread mlx/backend/metal/event.cpp Outdated
Comment thread mlx/backend/metal/event.cpp Outdated
Comment thread mlx/backend/metal/fence.cpp Outdated
@zcbenz zcbenz changed the title Handle CommandBuffer error in same thread Catch error in CommandBuffer and poison the events May 20, 2026
@zcbenz
Copy link
Copy Markdown
Collaborator Author

zcbenz commented May 20, 2026

I changed the implementation to propagate the error from CommandBuffer to Event:

  • When error happened in CommandBuffer, set error in all the signaled Events.
  • When an Event is waited and then an error was set for it, pass the error to the CommandEncoder and all its signaled Events.

So error in one stream would be propagated to all other streams via events. (Except for CPU stream which currently just throws and crashes.)

There is no good way to test it though, I'm mostly just using mlx.launch --verbose -n 32 python python/tests/ring_test_distributed.py which reliably times out.

@angeloskath
Copy link
Copy Markdown
Member

This honestly looks like an awesome cleanup!

I am curious about any overheads... probably overthinking it. I want to run some generations and small model training to verify no regression and then we can merge.

Love the fence cleanup in particular.

@zcbenz
Copy link
Copy Markdown
Collaborator Author

zcbenz commented May 22, 2026

I did a simple benchmarking by running mlx_lm.benchmark --model meta-llama/Llama-3.1-8B-Instruct -n 10 -p 64 -g 512 twice for each branch which seemed to be able to reduce variabilities a lot.

For main:

$ mlx_lm.benchmark --model meta-llama/Llama-3.1-8B-Instruct -n 10 -p 64 -g 512
Timing with prompt_tokens=64, generation_tokens=512, batch_size=1.
Trial 1:  prompt_tps=371.523, generation_tps=24.235, peak_memory=16.175, total_time=21.350
Trial 2:  prompt_tps=369.580, generation_tps=24.232, peak_memory=16.175, total_time=21.354
Trial 3:  prompt_tps=354.408, generation_tps=24.237, peak_memory=16.175, total_time=21.357
Trial 4:  prompt_tps=353.003, generation_tps=24.240, peak_memory=16.176, total_time=21.355
Trial 5:  prompt_tps=368.560, generation_tps=24.238, peak_memory=16.176, total_time=21.350
Trial 6:  prompt_tps=370.512, generation_tps=24.241, peak_memory=16.176, total_time=21.345
Trial 7:  prompt_tps=354.048, generation_tps=24.238, peak_memory=16.176, total_time=21.357
Trial 8:  prompt_tps=351.635, generation_tps=24.239, peak_memory=16.177, total_time=21.357
Trial 9:  prompt_tps=371.316, generation_tps=24.239, peak_memory=16.177, total_time=21.347
Trial 10:  prompt_tps=368.632, generation_tps=24.239, peak_memory=16.177, total_time=21.348
Averages: prompt_tps=363.322, generation_tps=24.238, peak_memory=16.176

for this branch:

$ mlx_lm.benchmark --model meta-llama/Llama-3.1-8B-Instruct -n 10 -p 64 -g 512
Timing with prompt_tokens=64, generation_tokens=512, batch_size=1.
Trial 1:  prompt_tps=354.694, generation_tps=24.242, peak_memory=16.175, total_time=21.353
Trial 2:  prompt_tps=371.278, generation_tps=24.237, peak_memory=16.175, total_time=21.348
Trial 3:  prompt_tps=370.507, generation_tps=24.240, peak_memory=16.175, total_time=21.347
Trial 4:  prompt_tps=352.574, generation_tps=24.246, peak_memory=16.176, total_time=21.350
Trial 5:  prompt_tps=354.255, generation_tps=24.236, peak_memory=16.176, total_time=21.358
Trial 6:  prompt_tps=355.678, generation_tps=24.242, peak_memory=16.176, total_time=21.353
Trial 7:  prompt_tps=370.067, generation_tps=24.238, peak_memory=16.176, total_time=21.349
Trial 8:  prompt_tps=371.358, generation_tps=24.230, peak_memory=16.177, total_time=21.355
Trial 9:  prompt_tps=353.896, generation_tps=24.246, peak_memory=16.177, total_time=21.350
Trial 10:  prompt_tps=352.977, generation_tps=24.235, peak_memory=16.177, total_time=21.360
Averages: prompt_tps=360.728, generation_tps=24.239, peak_memory=16.176

the generation seemed to become slightly faster but it is probably just variability.

Logically I think this branch added overhead when waiting events, but also reduced overhead by reducing the number of completion handlers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

when it crashed in the background, can not catch the exception

2 participants