Skip to content

Releases: ggml-org/llama.cpp

b7491

20 Dec 20:43
52ab19d

Choose a tag to compare

b7490

20 Dec 20:33
5182dd6

Choose a tag to compare

b7489

20 Dec 12:08
10b4f82

Choose a tag to compare

b7488

20 Dec 10:27
408616a

Choose a tag to compare

server : [easy] fix per round speculative decode logging (#18211)

Currently we always log 0, as we clear slot.drafted before.

To reproduce:
Run llama-server with devstral-2 as main model and devstral-2-small as
md, and verbose logging:

% ./build/bin/llama-server -v  \
  -m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \
  -md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \
  -c 8192 2> /tmp/llama.cpp.debug

Check the log:

slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 741
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 746
slot update_slots: id  3 | task 0 | accepted 16/0 draft tokens, new
n_tokens = 763
slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 775
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 778
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 783
slot update_slots: id  3 | task 0 | accepted 8/0 draft tokens, new
n_tokens = 792
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 795
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 797
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 799
slot update_slots: id  3 | task 0 | accepted 0/0 draft tokens, new
n_tokens = 800
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 803
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 805
slot update_slots: id  3 | task 0 | accepted 6/0 draft tokens, new
n_tokens = 812
slot update_slots: id  3 | task 0 | accepted 3/0 draft tokens, new
n_tokens = 816

After the fix, get correct per round logging:

slot update_slots: id  3 | task 0 | accepted 7/8 draft tokens, new
n_tokens = 654
slot update_slots: id  3 | task 0 | accepted 1/2 draft tokens, new
n_tokens = 656
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 659
slot update_slots: id  3 | task 0 | accepted 1/16 draft tokens, new
n_tokens = 661
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 664
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 681
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 698
slot update_slots: id  3 | task 0 | accepted 3/4 draft tokens, new
n_tokens = 702
slot update_slots: id  3 | task 0 | accepted 5/12 draft tokens, new
n_tokens = 708
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 725
slot update_slots: id  3 | task 0 | accepted 1/1 draft tokens, new
n_tokens = 727
slot update_slots: id  3 | task 0 | accepted 8/16 draft tokens, new
n_tokens = 736

macOS/iOS:

Linux:

Windows:

openEuler:

b7487

20 Dec 08:52
9e39a1e

Choose a tag to compare

server: support load model on startup, support preset-only options (#18206)

  • server: support autoload model, support preset-only options

  • add docs

  • load-on-startup

  • fix

  • Update common/arg.cpp

Co-authored-by: Pascal [email protected]


Co-authored-by: Pascal [email protected]

macOS/iOS:

Linux:

Windows:

openEuler:

b7486

19 Dec 22:48
74e0513

Choose a tag to compare

b7484

19 Dec 19:19
ce734a8

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (#17977)

  • feat: implement real Q8_0

  • feat: adding cmake option for configuring FP32 quantize group size

  • typo: set() shall be used


Co-authored-by: ngdxzy [email protected]

macOS/iOS:

Linux:

Windows:

openEuler:

b7483

19 Dec 18:52
14931a8

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

arg: fix order to use short form before long form (#18196)

  • arg: fix order to use short form before long form

  • arg: update doc

  • arg: update test-arg-parser

  • arg: address review feedback from ngxson

simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified

  • arg: update doc

macOS/iOS:

Linux:

Windows:

openEuler:

b7482

19 Dec 15:47
f99ef53

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

llama : Changing off_t to size_t for Windows (#18204)

macOS/iOS:

Linux:

Windows:

openEuler:

b7481

19 Dec 13:20
cc0a043

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

server: friendlier error msg when ctx < input (#18174)

  • llama-server: friendlier error msg when ctx < input

This PR adds formatted strings to the server's send_error function

  • llama-server: use string_format inline

  • fix test

macOS/iOS:

Linux:

Windows:

openEuler: