Skip to content

Adding Support for Attention Sinks to vLLM Code Path.#2923

Merged
copybara-service[bot] merged 1 commit intomainfrom
nicogrande/enable-gpt-oss-attention-vllm
Jan 12, 2026
Merged

Adding Support for Attention Sinks to vLLM Code Path.#2923
copybara-service[bot] merged 1 commit intomainfrom
nicogrande/enable-gpt-oss-attention-vllm

Conversation

@NicoGrande
Copy link
Copy Markdown
Collaborator

Description

This PR introduces support for attention sinks in the MaxText on vLLM codepath. This introduces support for the GPT-OSS family of models.

Tests

Tested locally on v6e-4 with the following command:

  python3 -m MaxText.vllm_decode \
    --model_name gpt-oss-20b \
    --hf_model_name openai/gpt-oss-20b \
    --hf_config_path src/MaxText/integration/vllm/maxtext_vllm_adapter \
    --load_parameters_path $CHECKPOINT_PATH \
    --ici_tensor_parallelism 4 \
    --gpu_memory_utilization 0.5 \
    --prompt "Suggest some famous landmarks in London."

Output:

Prompt: 'Suggest some famous landmarks in London.', Generated text: "\n\nLondon is home to a wealth of iconic landmarks that reflect its rich history and vibrant culture. Here are some of the most famous:\n\n1. **The Tower of London** - A historic castle on the north bank of the River Thames, known for its role as a royal palace, prison, and treasury.\n2. **Buckingham Palace** - The London residence and administrative headquarters of the monarch of the United Kingdom.\n3. **The British Museum** - One of the world's best museums, famous for its vast collection of art and antiquities from around the world.\n4. **The Houses of Parliament and Big Ben** - The iconic clock tower and the seat of the UK Parliament.\n5. **The London Eye** - A giant Ferris wheel on the South Bank of the River Thames, offering panoramic views of the city.\n6. **St. Paul’s Cathedral** - Known for its magnificent dome and historic significance.\n7. **The Shard** - The tallest building in the UK, offering spectacular views from its viewing platform.\n8. **The Tate Modern** - A leading contemporary art museum.\n9. **The National Gallery** - Home to a collection of paintings, sculpture, and prints.\n10. **The National Gallery** - The National Gallery** - Home to a collection of paintings, sculpture, and prints.\n\nHere are some of the\n\nHere are some of the\n\nHere are some famous landmarks in London\n\nHere\n\nHere\n\nHere\n\nHere\n\nLondon\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nLondon\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\n1\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nLondon\n\nHere\n\nLondon\n\nHere\n\nHere\n\nLondon\n\nHere\n\nHere\n\nLondon\n\nLondon\n\nHere\n\nLondon\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nand\n\nHere\n\n\n\nHere\n\nand\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\n"

Note: GPT-OSS uses the harmony tokenizer with a special end token which is not used in vLLM by default. This is why we see repeated characters at the end of the response.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/MaxText/layers/attentions.py 0.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/MaxText/layers/attentions.py Outdated
@NicoGrande NicoGrande force-pushed the nicogrande/enable-gpt-oss-attention-vllm branch from 6873337 to 27ce213 Compare January 9, 2026 19:38
@NicoGrande NicoGrande force-pushed the nicogrande/enable-gpt-oss-attention-vllm branch from 27ce213 to e6976ba Compare January 9, 2026 22:26
Copy link
Copy Markdown
Collaborator

@gagika gagika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@copybara-service copybara-service Bot merged commit 05a4a53 into main Jan 12, 2026
38 of 47 checks passed
@copybara-service copybara-service Bot deleted the nicogrande/enable-gpt-oss-attention-vllm branch January 12, 2026 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants