LP: vllm benchmarking with quantised models#3207
Conversation
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
fadara01
left a comment
There was a problem hiding this comment.
Thank you for your work!
I added some initial comments
|
|
||
| ## Set up access to LLama3.1-8B models | ||
|
|
||
| To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face cli so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the cli and login: |
There was a problem hiding this comment.
is it worth adding an instruction that you should also sign the licence agreement etc for the meta-llama model?
There was a problem hiding this comment.
Requesting access to the model is covered in the paragraph below. Is there an additional step I've forgotten?
| * Accuracy: --limit mmlu=10,gsm8k=500 | ||
|
|
||
| ### Throughput ratios: INT8/BF16 | ||
| | Requests/s | Total Tokens/s | Output Tokens/s | |
There was a problem hiding this comment.
given that we ran a serving benchmark, I think we should report latency here too.
nikhil-arm
left a comment
There was a problem hiding this comment.
I think we need to redo the inference and benchmarking pages from scratch.
Also I did not find any mention of whisper which was one of the requirement if I understand correctly
Llama and/or Whisper I think. |
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
…page to use custom scripts and added whisper inference back in. Accuracy results in benchmarking page are now full runs. Signed-off-by: Anna Mayne <anna.mayne@arm.com>
| python w8a8_quant.py | ||
| ``` | ||
|
|
||
| Where w8a8_quant.py contains: |
There was a problem hiding this comment.
is this script taken from anywhere upstream (e.g. HF?)
There was a problem hiding this comment.
https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8#creation
Every Red Hat W8A8 model on Hugging Face includes its quantization script. Our script is more explicit about the parameters being used, but under the hood it is essentially doing the same thing.
I initially preferred pointing users directly to the HF scripts. However, in hindsight, that may introduce unnecessary back-and-forth, and we cannot control if or when the HF scripts change in the future.
Given that, I think it is reasonable to include and use this script here?
There was a problem hiding this comment.
yes, but we need to be clear that this is a slightly refined version of the one in HF.
| who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B and Whisper, with and without quantisation, and benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness. | ||
|
|
||
| learning_objectives: | ||
| - Install a recent release of vLLM |
There was a problem hiding this comment.
Question: Is there a convention for not punctuating lists in the LPs? (this list and the others higher up don't have any on the end of lines).
| Before you begin, make sure your environment meets these requirements: | ||
|
|
||
| - Python 3.12 on Ubuntu 22.04 LTS or newer | ||
| - At least 32 vCPUs, 96 GB RAM, and 64 GB of free disk space |
There was a problem hiding this comment.
Do we really need 96 GB Ram for the models in learning path?
If yes, I am assuming its for the quantization step and not inference itself?
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Run inference on LLama3.1-8B |
There was a problem hiding this comment.
Are we using BF16 model inference as a baseline validation step to confirm that the setup, vllm serve, and OpenAI API scripts are working correctly, before moving on to W8A8 inference?
Want to understand the reason to show bf16 inference in this page.
There was a problem hiding this comment.
It's more for completeness, so we can say the user has run inference on both bf16 and w8a8. Then we later talk about comparing the two. If the focus should be more on w8a8 I can swap it around: so the script uses the w8a8 model and then we state what to update the model to to use the bf16 weights instead?
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Before submitting a pull request for a new Learning Path, please review Create a Learning Path
Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.