Add pre-allocated output tensor & offsets for VBE #3624

spmex · 2025-12-20T00:08:31Z

Summary:

Context

Mixing SSD offloading and non-SSD TBE with VBE enabled can cause a ~30% QPS regression compared to the baseline (HBM+UVM) for the Jupiter V1 model. Investigation shows that the regression is due to the inefficient tensor splits in _merge_variable_batch_embeddings(). This function is invoked only when:

VBE is enabled in the input KJT
different types of TBEs are involved for embedding lookup such that the output needs to be split, permuted and merged.

The second step above happens in CPU that makes GPU idle for a significant period of time.

To mitigate the regression, we proposed a different path to work around the inefficient splits and merge the output at FBGEMM op level:

Pre-allocates a 1D tensor to hold the output for all different TBEs.
Calculates the embedding offsets for TBEs so that they understand where to place the embedding lookup results
TBEs from FBGEMM take their own input KJT and place the resulting embeddings in the correct positions of output tensor based on the offsets.
Chains the multiple TBE lookups so that Autograd can correctly update the gradients in backward pass.

Changes

GroupedPooledEmbeddingsLookup:

Added logic to preallocate an 1D tensor, created the offsets and passed them to TBE instances (BaseBatchedEmbeddingBag and its child classes).
Refactored forward function for better readability, and reserved the original path without pre-allocation.

KeyValueModelParallelTest:

Updated the baseline test (test_ssd_mixed_kernels_with_vbe) by assigning different sets of optimizer parameters to different dtypes so that FP32 can be tested with a higher precision and FP16 can be tested with better numerical stability.

BaseBatchedEmbeddingBag and its child classes:

Updated the forward method to take the two new optional parameters: vbe_output and vbe_output_offsets.

NOTE: QuantBatchedEmbeddingBag from inference is a child class of BaseBatchedEmbeddingBag so its forward() is also impacted. Technically it should not have these VBE params and we should add extra abstractions to fully correct this.

Reviewed By: TroyGarden

Differential Revision: D67681796

meta-codesync · 2025-12-20T00:08:41Z

@spmex has exported this pull request. If you are a Meta employee, you can view the originating Diff in D67681796.

) Summary: # Context Mixing SSD offloading and non-SSD TBE with VBE enabled can cause a ~30% QPS regression compared to the baseline (HBM+UVM) for the Jupiter V1 model. Investigation shows that the regression is due to the **inefficient tensor splits** in `_merge_variable_batch_embeddings()`. This function is invoked only when: 1) VBE is enabled in the input KJT 2) different types of TBEs are involved for embedding lookup such that the output needs to be split, permuted and merged. The second step above happens in CPU that makes GPU idle for a significant period of time. To mitigate the regression, we proposed a different path to **work around the inefficient splits and merge the output at FBGEMM op level**: * Pre-allocates a 1D tensor to hold the output for all different TBEs. * Calculates the embedding offsets for TBEs so that they understand where to place the embedding lookup results * TBEs from FBGEMM take their own input KJT and place the resulting embeddings in the correct positions of output tensor based on the offsets. * Chains the multiple TBE lookups so that Autograd can correctly update the gradients in *backward pass*. # Changes `GroupedPooledEmbeddingsLookup`: * Added logic to preallocate an 1D tensor, created the offsets and passed them to TBE instances (`BaseBatchedEmbeddingBag` and its child classes). * Refactored `forward` function for better readability, and reserved the original path without pre-allocation. `KeyValueModelParallelTest`: * Updated the baseline test (`test_ssd_mixed_kernels_with_vbe`) by assigning different sets of optimizer parameters to different dtypes so that FP32 can be tested with a higher precision and FP16 can be tested with better numerical stability. `BaseBatchedEmbeddingBag` and its child classes: * Updated the `forward` method to take the two new optional parameters: `vbe_output` and `vbe_output_offsets`. NOTE: `QuantBatchedEmbeddingBag` from inference is a child class of `BaseBatchedEmbeddingBag` so its `forward()` is also impacted. Technically it should not have these VBE params and we should add extra abstractions to fully correct this. Reviewed By: TroyGarden Differential Revision: D67681796

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 20, 2025

meta-codesync bot added fb-exported meta-exported labels Dec 20, 2025

spmex force-pushed the export-D67681796 branch from 4b74a88 to fec8b10 Compare December 23, 2025 01:05

spmex force-pushed the export-D67681796 branch from fec8b10 to ddb011d Compare December 23, 2025 01:59

meta-codesync bot closed this in 3c66bc5 Dec 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pre-allocated output tensor & offsets for VBE #3624

Add pre-allocated output tensor & offsets for VBE #3624

Uh oh!

spmex commented Dec 20, 2025

Uh oh!

meta-codesync bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add pre-allocated output tensor & offsets for VBE #3624

Add pre-allocated output tensor & offsets for VBE #3624

Uh oh!

Conversation

spmex commented Dec 20, 2025

Context

Changes

Uh oh!

meta-codesync bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant