ggml-blas: refactor BLAS backend #18027

taronaeo · 2025-12-14T12:01:23Z

I saw this comment and felt a bit sad since we use the BLAS backend a lot: #14909 (comment)

Currently, the BLAS backend dequantizes the weight tensors on the fly, which runs on every matrix multiplication operation. This is not ideal, and more performance gains can be attained by pre-dequantizing the weight tensors and running the matrix multiplication calculation on the dequantized buffers.

We currently see a 53.56% performance improvement for Prompt Processing, and 107.79% for Token Generation.

Next up in the roadmap will be to probably introduce the following:

GGML_OP_MUL_MAT_ID for MoE models
LibXSMM for smaller matrix multiplication

But this PR focuses more on improving the performance first.

Changes

Dequantization for weight tensors now runs only once, during tensor init.
test-backend-ops.cpp now sets weight tensors to GGML_BACKEND_BUFFER_USAGE_WEIGHTS instead of GGML_BACKEND_BUFFER_USAGE_ANY which is used to detect when to dequantize a tensor.

Performance Benchmark

$ build/bin/llama-bench -m ~/Library/Caches/llama.cpp/LiquidAI_LFM2-8B-A1B-GGUF_LFM2-8B-A1B-Q4_K_M.gguf -r 1 -t 8

This PR

model	size	params	backend	threads	test	t/s
lfm2moe 8B.A1B Q4_K - Medium	4.70 GiB	8.34 B	BLAS	8	pp512	221.20 ± 0.00
lfm2moe 8B.A1B Q4_K - Medium	4.70 GiB	8.34 B	BLAS	8	tg128	77.75 ± 0.00

build: 717531b (7364)

Upstream

model	size	params	backend	threads	test	t/s
lfm2moe 8B.A1B Q4_K - Medium	4.70 GiB	8.34 B	BLAS	8	pp512	127.74 ± 0.00
lfm2moe 8B.A1B Q4_K - Medium	4.70 GiB	8.34 B	BLAS	8	tg128	23.29 ± 0.00

build: 254098a (7399)

AI Declaration: AI has been used to modify the test-backend-ops.cpp code to set weight tensors usage to GGML_BACKEND_BUFFER_USAGE_WEIGHTS because I couldn't figure it out without it crashing. Modifications to the BLAS backend were written by a human.

Signed-off-by: Aaron Teo <[email protected]>

only for mul_mat and mul_mat_id ops Signed-off-by: Aaron Teo <[email protected]>

Signed-off-by: Aaron Teo <[email protected]>

taronaeo · 2025-12-14T15:24:47Z

Made the PR ready for review by accident - pressed the wrong button.

Signed-off-by: Aaron Teo <[email protected]>

taronaeo added 8 commits December 11, 2025 20:51

ggml-blas: initial mmid impl

f682374

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: fully working mmid

19c8ec9

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: code clean up

1926e07

Signed-off-by: Aaron Teo <[email protected]>

tests: set tensor usage as weight for weight tensors

61ee32d

only for mul_mat and mul_mat_id ops Signed-off-by: Aaron Teo <[email protected]>

ggml: rewrite ggml-blas

9a14a09

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: fix invalid data access

aae6d1e

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: add note

717531b

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: fix ne

4470579

Signed-off-by: Aaron Teo <[email protected]>

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 14, 2025

taronaeo added 6 commits December 14, 2025 21:57

ggml-blas: force dequant routine to use max logical cores

6dff031

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: move global blas n threads to set_n_threads

e481be6

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: bring back openmp

7998d08

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: clean up code

75e506f

Signed-off-by: Aaron Teo <[email protected]>

ggml-blas: more code formatting

10ce5e0

Signed-off-by: Aaron Teo <[email protected]>

CODEOWNERS: add @taronaeo to blas backend [no ci]

46dea5d

Signed-off-by: Aaron Teo <[email protected]>

taronaeo marked this pull request as ready for review December 14, 2025 15:23

taronaeo requested a review from ggerganov as a code owner December 14, 2025 15:23

taronaeo marked this pull request as draft December 14, 2025 15:23

ggml-blas: further cleanup

04ed19b

Signed-off-by: Aaron Teo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-blas: refactor BLAS backend #18027

ggml-blas: refactor BLAS backend #18027

taronaeo commented Dec 14, 2025 •

edited

Loading

Uh oh!

taronaeo commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ggml-blas: refactor BLAS backend #18027

Are you sure you want to change the base?

ggml-blas: refactor BLAS backend #18027

Conversation

taronaeo commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Performance Benchmark

This PR

Upstream

Uh oh!

taronaeo commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taronaeo commented Dec 14, 2025 •

edited

Loading