LP: vllm benchmarking with quantised models by almayne · Pull Request #3207 · ArmDeveloperEcosystem/arm-learning-paths

almayne · 2026-04-24T15:27:48Z

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

fadara01

Thank you for your work!

I added some initial comments

fadara01 · 2026-04-24T16:12:09Z

+
+## Set up access to LLama3.1-8B models
+
+To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face cli so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the cli and login:


is it worth adding an instruction that you should also sign the licence agreement etc for the meta-llama model?

Requesting access to the model is covered in the paragraph below. Is there an additional step I've forgotten?

fadara01 · 2026-04-24T16:16:54Z

+  * Accuracy: --limit mmlu=10,gsm8k=500
+
+### Throughput ratios: INT8/BF16
+| Requests/s | Total Tokens/s | Output Tokens/s |


given that we ran a serving benchmark, I think we should report latency here too.

nikhil-arm

I think we need to redo the inference and benchmarking pages from scratch.
Also I did not find any mention of whisper which was one of the requirement if I understand correctly

nSircombe · 2026-04-27T07:12:29Z

I think we need to redo the inference and benchmarking pages from scratch. Also I did not find any mention of whisper which was one of the requirement if I understand correctly

Llama and/or Whisper I think.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

…page to use custom scripts and added whisper inference back in. Accuracy results in benchmarking page are now full runs. Signed-off-by: Anna Mayne <anna.mayne@arm.com>

nSircombe · 2026-05-13T07:55:25Z

+python w8a8_quant.py
+```
+
+Where w8a8_quant.py contains:


is this script taken from anywhere upstream (e.g. HF?)

https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8#creation

Every Red Hat W8A8 model on Hugging Face includes its quantization script. Our script is more explicit about the parameters being used, but under the hood it is essentially doing the same thing.

I initially preferred pointing users directly to the HF scripts. However, in hindsight, that may introduce unnecessary back-and-forth, and we cannot control if or when the HF scripts change in the future.

Given that, I think it is reasonable to include and use this script here?

yes, but we need to be clear that this is a slightly refined version of the one in HF.

nSircombe · 2026-05-13T08:37:11Z

+who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B and Whisper, with and without quantisation, and benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness.
+
+learning_objectives: 
+    - Install a recent release of vLLM


Question: Is there a convention for not punctuating lists in the LPs? (this list and the others higher up don't have any on the end of lines).

nikhil-arm · 2026-05-14T05:29:27Z

+Before you begin, make sure your environment meets these requirements:
+
+- Python 3.12 on Ubuntu 22.04 LTS or newer
+- At least 32 vCPUs, 96 GB RAM, and 64 GB of free disk space


Do we really need 96 GB Ram for the models in learning path?
If yes, I am assuming its for the quantization step and not inference itself?

That's correct

nikhil-arm · 2026-05-14T05:47:52Z

+layout: learningpathall
+---
+
+## Run inference on LLama3.1-8B


Are we using BF16 model inference as a baseline validation step to confirm that the setup, vllm serve, and OpenAI API scripts are working correctly, before moving on to W8A8 inference?

Want to understand the reason to show bf16 inference in this page.

It's more for completeness, so we can say the user has run inference on both bf16 and w8a8. Then we later talk about comparing the two. If the focus should be more on w8a8 I can swap it around: so the script uses the w8a8 model and then we state what to update the model to to use the bf16 weights instead?

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

LP: vllm benchmarking with quantised models

13546a9

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

nSircombe reviewed Apr 24, 2026

View reviewed changes

Comment thread ...arning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md Outdated

nSircombe reviewed Apr 24, 2026

View reviewed changes

Comment thread ...arning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md Outdated

nSircombe reviewed Apr 24, 2026

View reviewed changes

Comment thread ...arning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md Outdated

nSircombe reviewed Apr 24, 2026

View reviewed changes

Comment thread ...nt/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/2-run-inference.md Outdated

nSircombe reviewed Apr 24, 2026

View reviewed changes

Comment thread ...ent/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-benchmarking.md Outdated

nSircombe reviewed Apr 24, 2026

View reviewed changes

Comment thread ...ent/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-benchmarking.md Outdated

nSircombe reviewed Apr 24, 2026

View reviewed changes

Comment thread ...ing-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/4-further--quantisation.md Outdated

nSircombe reviewed Apr 24, 2026

View reviewed changes

Comment thread ...ing-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/4-further--quantisation.md Outdated

fadara01 reviewed Apr 24, 2026

View reviewed changes

Comment thread ...arning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md Outdated

fadara01 reviewed Apr 24, 2026

View reviewed changes

Comment thread ...arning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md Outdated

fadara01 reviewed Apr 24, 2026

View reviewed changes

Comment thread ...arning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md

nikhil-arm reviewed Apr 25, 2026

View reviewed changes

almayne added 3 commits April 27, 2026 08:30

Minor updates from review feedback.

4d45ef9

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

More minor updates from review feedback.

147df8f

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

Locked quantisation lib versions. Added more info in w8a8.

5202295

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

nikhil-arm reviewed Apr 28, 2026

View reviewed changes

Comment thread ...ent/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-benchmarking.md Outdated

nikhil-arm reviewed Apr 28, 2026

View reviewed changes

Comment thread ...ing-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/4-further--quantisation.md Outdated

nikhil-arm suggested changes Apr 28, 2026

View reviewed changes

Reordered pages and switched quant recipe to w8a8. Updated inference …

2fc3bcb

…page to use custom scripts and added whisper inference back in. Accuracy results in benchmarking page are now full runs. Signed-off-by: Anna Mayne <anna.mayne@arm.com>