Refactor blockwise reduce lowering with DPP SubgroupReduceOp#2301
Open
stefankoncarevic wants to merge 6 commits intoROCm:developfrom
Open
Refactor blockwise reduce lowering with DPP SubgroupReduceOp#2301stefankoncarevic wants to merge 6 commits intoROCm:developfrom
stefankoncarevic wants to merge 6 commits intoROCm:developfrom
Conversation
…per extraction Restructure the blockwise reduce rewrite pattern in BlockwiseGemmToThreadwise.cpp to improve clarity, maintainability, and enable DPP-based reductions via gpu.SubgroupReduceOp. Shuffle decision logic: - Introduce has2DThreadLayout guard (mTidPerWave > 0 && nTidPerWave > 0) to clearly separate GEMM-style 2D thread layouts from general cases - Path 1 (Shuffle+DPP): activates when blockSize > nrDimProduct and the per-thread subtile is [1,1] with rDim == 1, using gpu.shuffle to transpose data from WMMA/MFMA strided layout into contiguous DPP-compatible layout - Path 2 (Serial XOR): activates when blockSize <= nrDimProduct, performing log2(rDim) XOR butterfly reduction steps within a wave at stride nTidPerWave - Initial LDS store is deferred: only performed when neither shuffle path applies, avoiding unnecessary LDS traffic for shuffle-eligible configurations Parallel reduction with DPP: - Use gpu.SubgroupReduceOp with cluster_size for DPP-eligible reductions (power-of-2 active threads, cluster_size <= waveSize) - Only the reduction group leader (rtid == 0) writes the result back to LDS, followed by a barrier and broadcast read - Use bitwise AND/SHRU for thread ID decomposition (rtid, nrtid) on the DPP path and for power-of-2 non-reduction dimensions; fall back to DIV/REM for non-power-of-2 cases - Force scalar accumulation (vectorLen = 1) during threadwise pre-reduction on the DPP path to ensure correct element-wise reduction before SubgroupReduceOp Helper extraction: - getPerWaveThreadCounts: promote to static member function; extracts m_tid and n_tid counts from the tid slice view Merge transform - shuffleRearrangeForDPP: encapsulates the gpu.shuffle-based transposition from strided WMMA/MFMA layout to contiguous DPP layout (sourceLane = (lane % clusterSize) * stride + lane / clusterSize) - readReducedResultsFromLDS: consolidates the repeated pattern of barrier + ThreadwiseReadInto from LDS into output registers (and optional extra output) Tree reduction path: - Retained as fallback for non-DPP-eligible configurations (non-power-of-2 thread counts or cluster_size > waveSize) - Scope ceilPowerOf2 computation and treeMaxActiveThreads naming to this path New test: blockwise_reduce_dpp_cluster_sizes.mlir - Integration test covering DPP reduction with cluster sizes 2, 4, 8, 16, 32, 64 - Validates both sum (rand=none, all ones) and max (rand=fixed) reductions - All test configurations use blockSize <= waveSize to ensure single-wave execution on both RDNA (waveSize=32) and CDNA (waveSize=64) - cluster_size=64 falls back to tree reduction on RDNA since 64 > waveSize=32
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the Rock BlockwiseBroadcastReduceOp lowering to use gpu.subgroup_reduce (with clustered reductions where applicable) and adds a Rock backend pass to lower gpu.subgroup_reduce to AMD DPP instructions, improving inter-thread reduction performance on supported architectures.
Changes:
- Update blockwise broadcast-reduce lowering to select between shuffle+DPP, serial XOR shuffle, and LDS tree fallback paths, with shared helper functions.
- Introduce
rock-subgroup-reduce-to-dpppass and wire it into the backend pipeline beforeconvert-gpu-to-rocdl. - Extend/adjust tests and pipelines to cover the new DPP clustered reduction behavior and pass ordering.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp | Refactors reduction lowering, adds shuffle/DPP paths, helpers, and emits gpu.shuffle + gpu.subgroup_reduce. |
| mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp | New Rock pass to lower gpu.subgroup_reduce into AMD DPP sequences via GPU transform patterns. |
| mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt | Adds the new pass source and links GPU transforms library. |
| mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp | Inserts rock-subgroup-reduce-to-dpp into the backend pipeline after lowering affine. |
| mlir/include/mlir/Dialect/Rock/Passes.td | Declares the new pass and its chip option. |
| mlir/include/mlir/Dialect/Rock/Passes.h | Adds the generated pass decl macro for the new pass. |
| mlir/test/rocmlir-driver/pipelines.mlir | Updates expected printed pipelines to include rock-subgroup-reduce-to-dpp{chip=...}. |
| mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir | Updates lowering checks and parameterizes arch via token substitution. |
| mlir/test/Dialect/Rock/integration/reduce/blockwise_reduce/blockwise_reduce_dpp_cluster_sizes.mlir | New integration test covering multiple cluster_size cases and both sum/max reductions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8636b38 to
918f35c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The
BlockwiseBroadcastReduceOplowering inBlockwiseGemmToThreadwise.cpphandlesthe reduction of partial results across threads within a workgroup. Previously, all
inter-thread reductions used an LDS-based tree reduction loop that requires
log2(N)barrier-synchronized LDS round-trips. This works correctly but leavesperformance on the table for cases where hardware-accelerated subgroup (wave-level)
reduction is available.
This PR restructures the lowering to:
with explicit conditions documented for each path
gpu.SubgroupReduceOpwithcluster_sizefor DPP-eligible cases,replacing the tree reduction loop with a single hardware-accelerated operation
Technical Details
Three reduction paths
The lowering now selects one of three paths based on thread layout and dimensions:
Path 1 — Shuffle + DPP (
canUseShuffleOptimization):blockSize > nrDimProduct, per-thread subtile is[1,1]gpu.shuffle IDXtransposes data from strided WMMA/MFMA layout tocontiguous DPP layout →
gpu.SubgroupReduceOp→ leader writes to LDS → broadcast<= mTidPerWave(typically 16 on RDNA WMMA, 32 on CDNA MFMA)Path 2 — Serial XOR (
canUseSerialShuffle):blockSize <= nrDimProduct, power-of-2 rDim,rDim == mTidPerWavelog2(rDim)XOR butterfly shuffle steps within a wave → store to LDS → broadcastmTidPerWave— on RDNA (WMMA, mTid typically 2)this covers
cluster_size=2; on CDNA (MFMA, mTid typically 2 or 4) this coverscluster_size=2andcluster_size=4Path 3 — LDS Tree (fallback):
cluster_sizeis power-of-2 and<= waveSize→store to LDS → threadwise pre-reduce →
gpu.SubgroupReduceOp→ leader write → broadcastlog2(N)LDS tree reduction loop → broadcastDPP vs Tree thread layout
gpu.SubgroupReduceOplog2(N)LDS looprtid == 0)Helper functions extracted
shuffleRearrangeForDPP:gpu.shuffle IDXtransposition logic(
sourceLane = (lane % cluster) * stride + lane / cluster)readReducedResultsFromLDS: barrier +ThreadwiseReadIntofrom LDS → output registers(eliminates ~20 lines of duplication per call site)
getPerWaveThreadCounts: promoted to static member; extractsm_tid/n_tidfromthe tid slice Merge transform
Test Plan
littest suite forreduce/blockwise_reduce/)blockwise_reduce_dpp_cluster_sizes.mlirtest covers cluster sizes 2, 4, 8, 16, 32, 64with both
sumandmaxreductionsTest Result
Submission Checklist