-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Async DirectIO model loading on Linux #18012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async DirectIO model loading on Linux #18012
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in a huge load speedup on DGX Spark and also at the end of the program leaves the memory in state free instead of buff/cache.
Currently, the implementation is gated behind defined(__linux__). Is this functionality generally supported across all linux platforms? If I am reading this correctly, it boils down to having O_DIRECT support for open().
Also, do we expect this change to also have effect on non-DGX Spark systems?
|
On my strix halo machine with btrfs, this is strictly worse than master or with mmap. mmap shows the highest throughput while loading the model (~6GByte/s), master is around 3GByte/s and this patch is 2GByte/s. |
|
IIRC with Strix Halo and ROCm/HIP, loading a model into memory reserved for the GPU using mmap has a major performance issue, hanging basically indefinitely for larger models. Given that reserving memory for the GPU also means having less RAM available to the CPU, it would be great if this DirectIO doesn't have that issue as it would make ROCm/HIP more viable for larger models. Vulkan doesn't have this issue. |
|
@ggerganov I have added a fallback In my tests the first (cold) load time improved with every system configuration (PCIe4.0/PCIe5.0 SSD, RTX5080/5090). On the second load @lemmi Which disk are you using? And I assume the 6GB/s load with |
|
So, I ran a bunch of tests with vulkan backend to test #18012 and #18047 against master.
Minisforum MS-S1 Max (AMD RYZEN AI MAX+ 395 w/ Radeon 8060S)
The (EDIT: of course I was a good boy and ran |
|
Thank you for testing this @lemmi ! Looking at your numbers it seems like Is |
| bool kv_unified = false; // enable unified KV cache | ||
|
|
||
| bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix | ||
| bool use_mmap = true; // use mmap for faster loads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this to false by default, results in a huge slowdown on MacOS with default arguments:
time ./bin/llama-completion -m ../models/gpt-oss-120b/ggml-model-mxfp4.gguf -p "hello" -n 1 -no-cnv
# master
real 0m4.648s
# PR
real 0m17.957sNot sure what is the best way to handle this. If we keep it true, then linux users would not get the benefit of Direct IO. If we switch to false, Mac users will take the hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be OK to set mmap depending on the platform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have such precedent atm for any of the parameters in common, so I would say it's not ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have on M4 Pro with GPT-OSS-20B on cold load --no-mmap: 4.168s --mmap: 6.3s. The warm load however takes with --mmap 2.1s (--no-mmap still ~4.1s).
Measured using time ./llama-cli -m /Users/jtischbein/Documents/models/openai_gpt-oss-20b-MXFP4.gguf --no-mmap -p "bla" -n 0 --single-turn and filesystem cache cleared using purge.
So the cold load time is still faster using --mmap, but unfortunately not as fast as on Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do the following:
- Add new CLI argument
--direct-io, -dio - Description: "Use DirectIO if available. Takes precedence over --mmap"
- Keep
use_mmap == trueanduse_direct_io == true - On Mac, the internal implementation will determine that DIO is not available so it will fallback to mmap
Might want to do it in a separate PR as it would require changes in libllama API. This PR should keep use_mmap == true by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
…ns in llama-mmap.cpp
|
The commit removes the branching in the llama-model-loader.cpp and reduces the code duplications in llama-mmap.cpp. Now DirectIO is easier to integrate on Windows and Mac. |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's restore use_mmap to true and we can merge.
|
I will file a PR which implements the argument |
Implements Direct I/O (uncached) file reading on Linux to improve model loading performance by bypassing the page cache. This is especially beneficial for large model files.
While mmap is fast on loading the same model multiple times, uncached read provides consistent model loading times at the speed of the sequential disk read speed. On DGX Spark loading GPT-OSS-120B-MXFP4 using mmap takes ~110s, in the following loads ~67s. With these changes it takes consistently ~10.5s. The speedup depends on the model size, the disk read speed and for sequential loading the available RAM.
I would propose to set uncached reads as default, Windows already has async uncached IO (PR)