Skip to content

[WIP] Fix OnlineDPO vLLM server completion handling#5516

Draft
JohnGiorgi wants to merge 1 commit intohuggingface:mainfrom
JohnGiorgi:fix-online-dpo-vllm-server-completion-shape
Draft

[WIP] Fix OnlineDPO vLLM server completion handling#5516
JohnGiorgi wants to merge 1 commit intohuggingface:mainfrom
JohnGiorgi:fix-online-dpo-vllm-server-completion-shape

Conversation

@JohnGiorgi
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes #5514

This removes an extra flattening step in OnlineDPOTrainer._generate_vllm_server() when using vLLM server mode.

trl/scripts/vllm_serve.py already returns completion_ids as one list[int] per completion. OnlineDPOTrainer was flattening that result a second time, which turned multi-token completions into one-token completions before decode.

This PR removes that second flatten and adds a regression test covering the server return shape.

Validation:

  • make precommit
  • PYTHONPATH=/tmp/trl-upstream-fix /mnt/home/john/elms-ai/.venv/bin/python -m pytest tests/experimental/test_online_dpo_trainer.py -k preserves_completion_lists

Before submitting

AI writing disclosure

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone familiar with OnlineDPOTrainer server-mode generation or the trl vllm-serve response shape.

@JohnGiorgi JohnGiorgi changed the title Fix OnlineDPO vLLM server completion handling [WIP] Fix OnlineDPO vLLM server completion handling Apr 10, 2026
@JohnGiorgi JohnGiorgi force-pushed the fix-online-dpo-vllm-server-completion-shape branch from fcf5d00 to a2ad425 Compare April 10, 2026 20:27
@JohnGiorgi JohnGiorgi force-pushed the fix-online-dpo-vllm-server-completion-shape branch from a2ad425 to be038f1 Compare April 13, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OnlineDPOTrainer._generate_vllm_server() flattens vllm-serve completion_ids twice

1 participant