Skip to content

Conversation

@taronaeo
Copy link
Collaborator

@taronaeo taronaeo commented Dec 14, 2025

I saw this comment and felt a bit sad since we use the BLAS backend a lot: #14909 (comment)

Currently, the BLAS backend dequantizes the weight tensors on the fly, which runs on every matrix multiplication operation. This is not ideal, and more performance gains can be attained by pre-dequantizing the weight tensors and running the matrix multiplication calculation on the dequantized buffers.

We currently see a 53.56% performance improvement for Prompt Processing, and 107.79% for Token Generation.

Next up in the roadmap will be to probably introduce the following:

  1. GGML_OP_MUL_MAT_ID for MoE models
  2. LibXSMM for smaller matrix multiplication

But this PR focuses more on improving the performance first.

Changes

  1. Dequantization for weight tensors now runs only once, during tensor init.
  2. test-backend-ops.cpp now sets weight tensors to GGML_BACKEND_BUFFER_USAGE_WEIGHTS instead of GGML_BACKEND_BUFFER_USAGE_ANY which is used to detect when to dequantize a tensor.

Performance Benchmark

$ build/bin/llama-bench -m ~/Library/Caches/llama.cpp/LiquidAI_LFM2-8B-A1B-GGUF_LFM2-8B-A1B-Q4_K_M.gguf -r 1 -t 8

This PR

model size params backend threads test t/s
lfm2moe 8B.A1B Q4_K - Medium 4.70 GiB 8.34 B BLAS 8 pp512 221.20 ± 0.00
lfm2moe 8B.A1B Q4_K - Medium 4.70 GiB 8.34 B BLAS 8 tg128 77.75 ± 0.00

build: 717531b (7364)

Upstream

model size params backend threads test t/s
lfm2moe 8B.A1B Q4_K - Medium 4.70 GiB 8.34 B BLAS 8 pp512 127.74 ± 0.00
lfm2moe 8B.A1B Q4_K - Medium 4.70 GiB 8.34 B BLAS 8 tg128 23.29 ± 0.00

build: 254098a (7399)

AI Declaration: AI has been used to modify the test-backend-ops.cpp code to set weight tensors usage to GGML_BACKEND_BUFFER_USAGE_WEIGHTS because I couldn't figure it out without it crashing. Modifications to the BLAS backend were written by a human.

Signed-off-by: Aaron Teo <[email protected]>
only for mul_mat and mul_mat_id ops

Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 14, 2025
@taronaeo taronaeo marked this pull request as ready for review December 14, 2025 15:23
@taronaeo taronaeo requested a review from ggerganov as a code owner December 14, 2025 15:23
@taronaeo taronaeo marked this pull request as draft December 14, 2025 15:23
@taronaeo
Copy link
Collaborator Author

Made the PR ready for review by accident - pressed the wrong button.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant