[WIP] Fix OnlineDPO vLLM server completion handling#5516
Draft
JohnGiorgi wants to merge 1 commit intohuggingface:mainfrom
Draft
[WIP] Fix OnlineDPO vLLM server completion handling#5516JohnGiorgi wants to merge 1 commit intohuggingface:mainfrom
JohnGiorgi wants to merge 1 commit intohuggingface:mainfrom
Conversation
fcf5d00 to
a2ad425
Compare
a2ad425 to
be038f1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #5514
This removes an extra flattening step in
OnlineDPOTrainer._generate_vllm_server()when using vLLM server mode.trl/scripts/vllm_serve.pyalready returnscompletion_idsas onelist[int]per completion.OnlineDPOTrainerwas flattening that result a second time, which turned multi-token completions into one-token completions before decode.This PR removes that second flatten and adds a regression test covering the server return shape.
Validation:
make precommitPYTHONPATH=/tmp/trl-upstream-fix /mnt/home/john/elms-ai/.venv/bin/python -m pytest tests/experimental/test_online_dpo_trainer.py -k preserves_completion_listsBefore submitting
OnlineDPOTrainer._generate_vllm_server() flattens vllm-serve completion_ids twice #5514
No documentation changes are needed for this internal bug fix.
AI writing disclosure
Who can review?
Anyone familiar with
OnlineDPOTrainerserver-mode generation or thetrl vllm-serveresponse shape.