You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following table provides a detailed breakdown of all MXFP8 kernels used in the forward and backward passes, with shapes representative of **DeepSeekV3 671B** (dim=7168, hidden_dim=2048, total_tokens=131072, groups=8, block_size=32).
228
+
229
+
Benchmark results measured on **NVIDIA B200** GPU at 80% power limit (Peak BW: 5888 GB/s, Peak MXFP8: 2808 TFLOPS).
-**Efficiency** is reported as percentage of peak achievable bandwidth (for memory-bound quantization kernels) or percentage of peak TFLOPS (for compute-bound GEMM kernels)
256
+
- Scale rearrangement kernels are not conventional memory bandwidth bound or compute bound kernels, so we report absolute runtime only
257
+
- Scale tensor shapes are derived from input shapes divided by `block_size=32` along the scaling dimension
258
+
- Detailed kernel breakdown with timing for all kernels is available in the roofline plots above (generated by `roofline_unified.py`)
259
+
- All kernels can be benchmarked individually using the scripts in `benchmarks/prototype/moe_training/mxfp8/`
0 commit comments