Skip to content

LP: vllm benchmarking with quantised models#3207

Open
almayne wants to merge 7 commits into
ArmDeveloperEcosystem:mainfrom
almayne:vllm_bench_quantised
Open

LP: vllm benchmarking with quantised models#3207
almayne wants to merge 7 commits into
ArmDeveloperEcosystem:mainfrom
almayne:vllm_bench_quantised

Conversation

@almayne
Copy link
Copy Markdown

@almayne almayne commented Apr 24, 2026

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

  • I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

  • I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Copy link
Copy Markdown

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work!

I added some initial comments


## Set up access to LLama3.1-8B models

To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face cli so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the cli and login:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth adding an instruction that you should also sign the licence agreement etc for the meta-llama model?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting access to the model is covered in the paragraph below. Is there an additional step I've forgotten?

* Accuracy: --limit mmlu=10,gsm8k=500

### Throughput ratios: INT8/BF16
| Requests/s | Total Tokens/s | Output Tokens/s |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we ran a serving benchmark, I think we should report latency here too.

Copy link
Copy Markdown
Contributor

@nikhil-arm nikhil-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to redo the inference and benchmarking pages from scratch.
Also I did not find any mention of whisper which was one of the requirement if I understand correctly

@nSircombe
Copy link
Copy Markdown

I think we need to redo the inference and benchmarking pages from scratch. Also I did not find any mention of whisper which was one of the requirement if I understand correctly

Llama and/or Whisper I think.

almayne added 3 commits April 27, 2026 08:30
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
…page to use custom scripts and added whisper inference back in. Accuracy results in benchmarking page are now full runs.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>
python w8a8_quant.py
```

Where w8a8_quant.py contains:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this script taken from anywhere upstream (e.g. HF?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8#creation

Every Red Hat W8A8 model on Hugging Face includes its quantization script. Our script is more explicit about the parameters being used, but under the hood it is essentially doing the same thing.

I initially preferred pointing users directly to the HF scripts. However, in hindsight, that may introduce unnecessary back-and-forth, and we cannot control if or when the HF scripts change in the future.

Given that, I think it is reasonable to include and use this script here?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but we need to be clear that this is a slightly refined version of the one in HF.

who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B and Whisper, with and without quantisation, and benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness.

learning_objectives:
- Install a recent release of vLLM
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a convention for not punctuating lists in the LPs? (this list and the others higher up don't have any on the end of lines).

Before you begin, make sure your environment meets these requirements:

- Python 3.12 on Ubuntu 22.04 LTS or newer
- At least 32 vCPUs, 96 GB RAM, and 64 GB of free disk space
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need 96 GB Ram for the models in learning path?
If yes, I am assuming its for the quantization step and not inference itself?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct

layout: learningpathall
---

## Run inference on LLama3.1-8B
Copy link
Copy Markdown
Contributor

@nikhil-arm nikhil-arm May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using BF16 model inference as a baseline validation step to confirm that the setup, vllm serve, and OpenAI API scripts are working correctly, before moving on to W8A8 inference?

Want to understand the reason to show bf16 inference in this page.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more for completeness, so we can say the user has run inference on both bf16 and w8a8. Then we later talk about comparing the two. If the focus should be more on w8a8 I can swap it around: so the script uses the w8a8 model and then we state what to update the model to to use the bf16 weights instead?

almayne added 2 commits May 14, 2026 10:02
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants