[diffusion] feat: implement VAE parallel decode for Wan by Songrui625 · Pull Request #16510 · sgl-project/sglang

Songrui625 · 2026-01-05T18:04:23Z

Motivation

Resolves #13191

Generating long or high-resolution videos demands more time and VRAM footprint during VAE decoding.

This Merge Request implements VAE parallel decode for Wan, which could accelerate decoding time and reduce peak VRAM usage during VAE decoding when using multiple GPUs.

Modifications

Basic Idea

The basic idea is when doing convolutions using multiple GPUs, follow the procedure below:

Split the latents from the denoising stage along height dimension.
Perform halo exchange(ghost cell) by P2P communication to exchange the data between edge of RankN and RankN+1.
Perform all-gather operation to get the complete output.

Implemente Detail

Implement WanDistCausalConv3d and WanDistConv2d, which perform halo exchange to share data across GPUs before the actual convolution.
During each frame of latents decoding, we first split latents along height dimension and then proceed with decoding as usual, and finally all-gather the outputs.

Generated Videos

Wan2.2 720p video on single H20. VAE decoding time: 26.2795s.

uv run sglang generate --model-path /data00/models/Wan-AI/Wan2.2-T2V-A14B-Diffusers --height 720 --width 1280 --seed 1024 --attention-backend sage_attn --prompt "A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

sgl_wan22_vae_1gpu.mp4

Wan2.2 720p video with VAE parallel decode on 4 * H20. VAE decoding time: 9.1224s.

uv run sglang generate --model-path /data00/models/Wan-AI/Wan2.2-T2V-A14B-Diffusers --height 720 --width 1280 --seed 1024 --num-gpus 4 --ulysses-degree 4 --vae-config.use-parallel-decode --attention-backend sage_attn --prompt "A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

sgl_wan22_vae_4gpus.mp4

Accuracy Tests

Benchmarking and Profiling

The baseline of benchmark is to generate a 1280*720 81 frames video from model Wan2.1-T2V-14B-Diffusers with single inference step.

number of GPUs	VRAM max Peak(MB)	VAE decode VRAM Peak (GB)	VAE Decoding Time(s)
1 * H20	51797.58	23.4	16.8490
2 * H20	39755.49	11.79	9.6863
4 * H20	35320.15	5.9	6.1343
8 * H20	35320.15	3.1	4.7050

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-05T18:04:27Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

mickqian · 2026-01-06T03:01:19Z

        )  # casting needed for mps since amp isn't supported
        return super().forward(x)

+def halo_exchange(x: torch.Tensor, height_halo_size: int = 1) -> torch.Tensor:


how should we generalize to other vae?

I haven't checked the VAE code of model HunyuanVideo. Will you mind if I make it a common function in the next PR, like support VAE parallel decode for HunyuanVideo?

zyksir · 2026-01-06T07:25:20Z

@Songrui625 please add unit test to make sure the output of distributed version is equal to non distributed one

mickqian

Please add a unittest, or I assume the parallel decode should be enabled automatically for multi-gpus?

Songrui625 · 2026-01-07T08:05:42Z

@Songrui625 please add unit test to make sure the output of distributed version is equal to non distributed one

Thanks to point out! Working on it.

Songrui625 · 2026-01-07T08:07:16Z

Please add a unittest, or I assume the parallel decode should be enabled automatically for multi-gpus?

Use the new option --vae-config.use-parallel-decode to enable.

mickqian · 2026-01-07T10:47:54Z

Please add a unittest, or I assume the parallel decode should be enabled automatically for multi-gpus?

Use the new option --vae-config.use-parallel-decode to enable.

and should we turn it on as default?

zcnrex · 2026-01-08T05:46:55Z

Could you help add some sample generated videos?

triple-mu · 2026-01-12T00:36:19Z

Hello, would you be willing to first try using my PR to simplify the VAE logic?
The current VAE computation has quite complex caching logic along the temporal dimension. My PR significantly simplifies this caching logic while remaining computationally equivalent to the original implementation.
If this PR can be merged, I believe your PR would also become much simpler.

#15068

Songrui625 · 2026-01-12T16:36:01Z

Hello, would you be willing to first try using my PR to simplify the VAE logic? The current VAE computation has quite complex caching logic along the temporal dimension. My PR significantly simplifies this caching logic while remaining computationally equivalent to the original implementation. If this PR can be merged, I believe your PR would also become much simpler.

#15068

Thanks, I will check later.

Songrui625 · 2026-01-12T16:38:05Z

Please add a unittest, or I assume the parallel decode should be enabled automatically for multi-gpus?

Use the new option --vae-config.use-parallel-decode to enable.

and should we turn it on as default?

Sorry for the late response, I think we could make it disabled by default until it is proved that it's stable.

Songrui625 · 2026-01-12T16:41:02Z

Could you help add some sample generated videos?

OK. The sample video will be provided later along with the unit tests!

Songrui625 · 2026-01-28T09:58:05Z

Still handling the padding problem when the height dimensions are not divisible by the GPU count; it needs to deal with the errors caused by padding.

mickqian · 2026-01-29T14:08:40Z

@Songrui625 appreciate your continuous work!

Songrui625 · 2026-02-04T12:17:13Z

@Songrui625 appreciate your continuous work!

Hi, mick, I have pushed the unit test. Please review again. To be clear, the distributed version of convolution may have some rounding errors compared to the non-distributed one. So I set the atol and rtol to different values based on the data type.

With the rounding errors accumulating, the distributed version of WanDecoder3d may have a larger difference compared to non-distributed one. I think it is acceptable so the atol and rtol to 5e-2 when the data type is torch.bfloat16. CC @zyksir

Signed-off-by: Songrui625 <songrui625@gmail.com>

…parallel decode

…le by world size

Songrui625 · 2026-02-05T08:19:34Z

This PR is ready to review again.

Songrui625 · 2026-02-10T10:03:23Z

I had pushed comprehensive unit tests. It's disappointing that this PR was left hanging and eventually discarded by the reviewers. Even more frustrating is the feeling that my contributions might be overlooked because I'm not an official member of the team.

BBuf · 2026-02-11T02:21:39Z

I had pushed comprehensive unit tests. It's disappointing that this PR was left hanging and eventually discarded by the reviewers. Even more frustrating is the feeling that my contributions might be overlooked because I'm not an official member of the team.

Hi,

I had pushed comprehensive unit tests. It's disappointing that this PR was left hanging and eventually discarded by the reviewers. Even more frustrating is the feeling that my contributions might be overlooked because I'm not an official member of the team.

Hi, I'm sorry about this. The situation is that we're currently working on a technical report and have incorporated the changes from this PR, which is currently blocking a version release. So, we merged that PR while giving you proper credit.

It wasn't because you're not part of the Diffusion team (you are actually). You can see that we've separately acknowledged you in this blog post: lm-sys/lm-sys.github.io#310.

Regarding your tests, we can resubmit a new PR and merge it into the main branch. Thank you very much for your contribution. Once again, I apologize for any inconvenience caused.

Songrui625 requested review from mickqian and yhyang201 as code owners January 5, 2026 18:04

github-actions Bot added the diffusion SGLang Diffusion label Jan 5, 2026

Songrui625 changed the title ~~[diffusion] feat: implement VAE parallel decode~~ [diffusion] feat: implement VAE parallel decode for Wan Jan 5, 2026

mickqian reviewed Jan 6, 2026

View reviewed changes

mickqian reviewed Jan 7, 2026

View reviewed changes

mickqian mentioned this pull request Jan 8, 2026

diffusion, parallelism: VAE Decode Parallel #13191

Closed

Songrui625 closed this Jan 12, 2026

Songrui625 reopened this Jan 12, 2026

Songrui625 force-pushed the vae-parallel-decode branch from 5fbbac8 to 2a880d8 Compare January 19, 2026 12:54

Songrui625 force-pushed the vae-parallel-decode branch from 2a880d8 to ceb0195 Compare January 28, 2026 05:30

Songrui625 force-pushed the vae-parallel-decode branch 2 times, most recently from 3b7eb4d to 9c2c6cf Compare February 4, 2026 12:09

Songrui625 added 4 commits February 5, 2026 16:19

[diffusion] feat: implement vae parallel decode

52a08bb

Signed-off-by: Songrui625 <songrui625@gmail.com>

[diffusion] refactor: reuse dedicate function for VAE parallel decode

be460f7

[diffusion] fix: WanVAE missing inherit from torch.nn.Module

6e92d85

[diffusion] feat: add option '--vae-config.use-parallel-decode'

6a8e702

Songrui625 added 18 commits February 5, 2026 16:19

[diffusion] refactor: add new distributed version of WanResample

e1f483a

[diffusion] fix: typo in WanVAE

b26b7be

[diffusion] fix: miss return value of all_gather oprations in WanVAE …

b861bef

…parallel decode

[diffusion] fix: do global attention in VAE parallel decode

c792074

Simplify WanDistAttentionBlock

ba65d7a

Add padding to latent along height dimension if height is not divisib…

5df6cf5

…le by world size

remove useless print statement

b6adcad

add unittest

7dd51e8

handle padding when do global attention during dist mid block

e46b2f3

fix typo

a08a320

simplify function sp_should_padding and rename to calc_padding_len

21c373e

refactor unittest

c07a63c

add helper function to manage tmp file in unittest

fba0082

test multiple data type

14d779c

add unittest for WanDistMidBlock

f0c3f16

fix typo

25a8d34

use batched p2p ops

6a4a986

style: lint and format

49de9c0

Songrui625 force-pushed the vae-parallel-decode branch from 9c2c6cf to 49de9c0 Compare February 5, 2026 08:19

mickqian mentioned this pull request Feb 5, 2026

[diffusion] feat: support parallel wan-vae decode #18179

Merged

5 tasks

Songrui625 closed this Feb 10, 2026

Conversation

Songrui625 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Basic Idea

Implemente Detail

Generated Videos

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 5, 2026

Uh oh!

mickqian Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Songrui625 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zyksir commented Jan 6, 2026

Uh oh!

mickqian left a comment

Choose a reason for hiding this comment

Uh oh!

Songrui625 commented Jan 7, 2026

Uh oh!

Songrui625 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mickqian commented Jan 7, 2026

Uh oh!

zcnrex commented Jan 8, 2026

Uh oh!

triple-mu commented Jan 12, 2026

Uh oh!

Songrui625 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Songrui625 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Songrui625 commented Jan 12, 2026

Uh oh!

Songrui625 commented Jan 28, 2026

Uh oh!

mickqian commented Jan 29, 2026

Uh oh!

Songrui625 commented Feb 4, 2026

Uh oh!

Songrui625 commented Feb 5, 2026

Uh oh!

Songrui625 commented Feb 10, 2026

Uh oh!

BBuf commented Feb 11, 2026 • edited by mickqian Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Songrui625 commented Jan 5, 2026 •

edited

Loading

Songrui625 commented Jan 7, 2026 •

edited

Loading

Songrui625 commented Jan 12, 2026 •

edited

Loading

Songrui625 commented Jan 12, 2026 •

edited

Loading

BBuf commented Feb 11, 2026 •

edited by mickqian

Loading