Skip to content

Conversation

@spmex
Copy link

@spmex spmex commented Dec 20, 2025

Summary:

Context

Mixing SSD offloading and non-SSD TBE with VBE enabled can cause a ~30% QPS regression compared to the baseline (HBM+UVM) for the Jupiter V1 model. Investigation shows that the regression is due to the inefficient tensor splits in _merge_variable_batch_embeddings(). This function is invoked only when:

  1. VBE is enabled in the input KJT
  2. different types of TBEs are involved for embedding lookup such that the output needs to be split, permuted and merged.

The second step above happens in CPU that makes GPU idle for a significant period of time.

To mitigate the regression, we proposed a different path to work around the inefficient splits and merge the output at FBGEMM op level:

  • Pre-allocates a 1D tensor to hold the output for all different TBEs.
  • Calculates the embedding offsets for TBEs so that they understand where to place the embedding lookup results
  • TBEs from FBGEMM take their own input KJT and place the resulting embeddings in the correct positions of output tensor based on the offsets.
  • Chains the multiple TBE lookups so that Autograd can correctly update the gradients in backward pass.

Changes

GroupedPooledEmbeddingsLookup:

  • Added logic to preallocate an 1D tensor, created the offsets and passed them to TBE instances (BaseBatchedEmbeddingBag and its child classes).
  • Refactored forward function for better readability, and reserved the original path without pre-allocation.

KeyValueModelParallelTest:

  • Updated the baseline test (test_ssd_mixed_kernels_with_vbe) by assigning different sets of optimizer parameters to different dtypes so that FP32 can be tested with a higher precision and FP16 can be tested with better numerical stability.

BaseBatchedEmbeddingBag and its child classes:

  • Updated the forward method to take the two new optional parameters: vbe_output and vbe_output_offsets.

NOTE: QuantBatchedEmbeddingBag from inference is a child class of BaseBatchedEmbeddingBag so its forward() is also impacted. Technically it should not have these VBE params and we should add extra abstractions to fully correct this.

Reviewed By: TroyGarden

Differential Revision: D67681796

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 20, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Dec 20, 2025

@spmex has exported this pull request. If you are a Meta employee, you can view the originating Diff in D67681796.

spmex pushed a commit to spmex/torchrec that referenced this pull request Dec 23, 2025
)

Summary:

# Context

Mixing SSD offloading and non-SSD TBE with VBE enabled can cause a ~30% QPS regression compared to the baseline (HBM+UVM) for the Jupiter V1 model. Investigation shows that the regression is due to the **inefficient tensor splits** in `_merge_variable_batch_embeddings()`. This function is invoked only when:

1) VBE is enabled in the input KJT
2) different types of TBEs are involved for embedding lookup such that the output needs to be split, permuted and merged.

The second step above happens in CPU that makes GPU idle for a significant period of time.

To mitigate the regression, we proposed a different path to **work around the inefficient splits and merge the output at FBGEMM op level**:

* Pre-allocates a 1D tensor to hold the output for all different TBEs.
* Calculates the embedding offsets for TBEs so that they understand where to place the embedding lookup results
* TBEs from FBGEMM take their own input KJT and place the resulting embeddings in the correct positions of output tensor based on the offsets.
* Chains the multiple TBE lookups so that Autograd can correctly update the gradients in *backward pass*. 

# Changes

`GroupedPooledEmbeddingsLookup`:
* Added logic to preallocate an 1D tensor, created the offsets and passed them to TBE instances (`BaseBatchedEmbeddingBag` and its child classes).
* Refactored `forward` function for better readability, and reserved the original path without pre-allocation. 

`KeyValueModelParallelTest`:
* Updated the baseline test (`test_ssd_mixed_kernels_with_vbe`) by assigning different sets of optimizer parameters to different dtypes so that FP32 can be tested with a higher precision and FP16 can be tested with better numerical stability.

`BaseBatchedEmbeddingBag` and its child classes:
* Updated the `forward` method to take the two new optional parameters: `vbe_output` and `vbe_output_offsets`.

NOTE: `QuantBatchedEmbeddingBag` from inference is a child class of `BaseBatchedEmbeddingBag` so its `forward()` is also impacted. Technically it should not have these VBE params and we should add extra abstractions to fully correct this.

Reviewed By: TroyGarden

Differential Revision: D67681796
)

Summary:

# Context

Mixing SSD offloading and non-SSD TBE with VBE enabled can cause a ~30% QPS regression compared to the baseline (HBM+UVM) for the Jupiter V1 model. Investigation shows that the regression is due to the **inefficient tensor splits** in `_merge_variable_batch_embeddings()`. This function is invoked only when:

1) VBE is enabled in the input KJT
2) different types of TBEs are involved for embedding lookup such that the output needs to be split, permuted and merged.

The second step above happens in CPU that makes GPU idle for a significant period of time.

To mitigate the regression, we proposed a different path to **work around the inefficient splits and merge the output at FBGEMM op level**:

* Pre-allocates a 1D tensor to hold the output for all different TBEs.
* Calculates the embedding offsets for TBEs so that they understand where to place the embedding lookup results
* TBEs from FBGEMM take their own input KJT and place the resulting embeddings in the correct positions of output tensor based on the offsets.
* Chains the multiple TBE lookups so that Autograd can correctly update the gradients in *backward pass*. 

# Changes

`GroupedPooledEmbeddingsLookup`:
* Added logic to preallocate an 1D tensor, created the offsets and passed them to TBE instances (`BaseBatchedEmbeddingBag` and its child classes).
* Refactored `forward` function for better readability, and reserved the original path without pre-allocation. 

`KeyValueModelParallelTest`:
* Updated the baseline test (`test_ssd_mixed_kernels_with_vbe`) by assigning different sets of optimizer parameters to different dtypes so that FP32 can be tested with a higher precision and FP16 can be tested with better numerical stability.

`BaseBatchedEmbeddingBag` and its child classes:
* Updated the `forward` method to take the two new optional parameters: `vbe_output` and `vbe_output_offsets`.

NOTE: `QuantBatchedEmbeddingBag` from inference is a child class of `BaseBatchedEmbeddingBag` so its `forward()` is also impacted. Technically it should not have these VBE params and we should add extra abstractions to fully correct this.

Reviewed By: TroyGarden

Differential Revision: D67681796
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant