Releases · ggml-org/llama.cpp

20 Dec 20:43

github-actions

b7491

52ab19d

b7491 Latest

Latest

tests: Avoid floating point precision false positives in SUM (#17471)

tests: Avoid floating point precision false positives in SUM
also apply to test_mean

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-12-20T20:43:41Z
cudart-llama-bin-win-cuda-13.1-x64.zip

sha256:f96935e7e385e3b2d0189239077c10fe8fd7e95690fea4afec455b1b6c7e3f18

384 MB 2025-12-20T20:43:53Z
llama-b7491-bin-310p-openEuler-aarch64.tar.gz

sha256:cc508ac539ca4d715d5f8a1d8e80aff5b7b11978a983a5cd57bb4621183eb60b

41.8 MB 2025-12-20T20:44:05Z
llama-b7491-bin-310p-openEuler-x86.tar.gz

sha256:d4bc10596481b893d0ac21a72b3358467e099c6c5acf9f0789b884a82baa3e73

45.8 MB 2025-12-20T20:44:08Z
llama-b7491-bin-910b-openEuler-aarch64.tar.gz

sha256:306f6ba8ef39846385f8a7a2e27929114423bf6ece306b870e99a86d843263de

41.8 MB 2025-12-20T20:44:10Z
llama-b7491-bin-910b-openEuler-x86.tar.gz

sha256:30c656101e305168466d2137417bca9a41f57de695047ef64e79ae54b336010e

45.8 MB 2025-12-20T20:44:12Z
llama-b7491-bin-macos-arm64.tar.gz

sha256:cf3b0dac12f59eea4723cee1a1bb062b7fe9759ca1de50f256bd9576b69b73db

15.9 MB 2025-12-20T20:44:14Z
llama-b7491-bin-macos-x64.tar.gz

sha256:009669ca818ef49c2fbc5b821d132355c4de08da26e698c5b1b71727652b9f47

41 MB 2025-12-20T20:44:16Z
llama-b7491-bin-ubuntu-s390x.tar.gz

sha256:3027567a4a47ef72c2584039565097d54a10c8506836dc9db47168d9ff5f22b7

21.2 MB 2025-12-20T20:44:20Z
llama-b7491-bin-ubuntu-vulkan-x64.tar.gz

sha256:8d9b0b564e7022602458ef67f0aa5befd1bfd8b524ca5e74d037fdc9dc224c37

33 MB 2025-12-20T20:44:22Z
Source code (zip)

2025-12-20T19:46:46Z
Source code (tar.gz)

2025-12-20T19:46:46Z

20 Dec 20:33

github-actions

b7490

5182dd6

b7490

test-backend-ops: improve msvc build time (#18209)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

20 Dec 12:08

github-actions

b7489

10b4f82

b7489

Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (#18212)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

20 Dec 10:27

github-actions

b7488

408616a

b7488

server : [easy] fix per round speculative decode logging (#18211)

Currently we always log 0, as we clear slot.drafted before.

To reproduce:
Run llama-server with devstral-2 as main model and devstral-2-small as
md, and verbose logging:

% ./build/bin/llama-server -v  \
  -m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \
  -md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \
  -c 8192 2> /tmp/llama.cpp.debug

Check the log:

slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 741
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 746
slot update_slots: id  3 | task 0 | accepted 16/0 draft tokens, new
n_tokens = 763
slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 775
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 778
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 783
slot update_slots: id  3 | task 0 | accepted 8/0 draft tokens, new
n_tokens = 792
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 795
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 797
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 799
slot update_slots: id  3 | task 0 | accepted 0/0 draft tokens, new
n_tokens = 800
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 803
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 805
slot update_slots: id  3 | task 0 | accepted 6/0 draft tokens, new
n_tokens = 812
slot update_slots: id  3 | task 0 | accepted 3/0 draft tokens, new
n_tokens = 816

After the fix, get correct per round logging:

slot update_slots: id  3 | task 0 | accepted 7/8 draft tokens, new
n_tokens = 654
slot update_slots: id  3 | task 0 | accepted 1/2 draft tokens, new
n_tokens = 656
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 659
slot update_slots: id  3 | task 0 | accepted 1/16 draft tokens, new
n_tokens = 661
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 664
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 681
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 698
slot update_slots: id  3 | task 0 | accepted 3/4 draft tokens, new
n_tokens = 702
slot update_slots: id  3 | task 0 | accepted 5/12 draft tokens, new
n_tokens = 708
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 725
slot update_slots: id  3 | task 0 | accepted 1/1 draft tokens, new
n_tokens = 727
slot update_slots: id  3 | task 0 | accepted 8/16 draft tokens, new
n_tokens = 736

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

20 Dec 08:52

github-actions

b7487

9e39a1e

b7487

server: support load model on startup, support preset-only options (#18206)

server: support autoload model, support preset-only options
add docs
load-on-startup
fix
Update common/arg.cpp

Co-authored-by: Pascal [email protected]

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

19 Dec 22:48

github-actions

b7486

74e0513

b7486

ci : remove non-windows zip artifacts (#18201)

remove non-windows zip artifacts
add cuda dll links

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

19 Dec 19:19

github-actions

b7484

ce734a8

b7484

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (#17977)

feat: implement real Q8_0
feat: adding cmake option for configuring FP32 quantize group size
typo: set() shall be used

Co-authored-by: ngdxzy [email protected]

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 28

19 Dec 18:52

github-actions

b7483

14931a8

b7483

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

arg: fix order to use short form before long form (#18196)

arg: fix order to use short form before long form
arg: update doc
arg: update test-arg-parser
arg: address review feedback from ngxson

simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified

arg: update doc

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 28

19 Dec 15:47

github-actions

b7482

f99ef53

b7482

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

llama : Changing off_t to size_t for Windows (#18204)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 28

19 Dec 13:20

github-actions

b7481

cc0a043

b7481

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

server: friendlier error msg when ctx < input (#18174)

llama-server: friendlier error msg when ctx < input

This PR adds formatted strings to the server's send_error function

llama-server: use string_format inline
fix test

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 28

Releases: ggml-org/llama.cpp

b7491

Uh oh!

b7490

Uh oh!

b7489

Uh oh!

b7488

Uh oh!

b7487

Uh oh!

b7486

Uh oh!

b7484

Uh oh!

b7483

Uh oh!

b7482

Uh oh!

b7481

Uh oh!