Refactor blockwise reduce lowering with DPP SubgroupReduceOp by stefankoncarevic · Pull Request #2301 · ROCm/rocMLIR

stefankoncarevic · 2026-03-17T18:21:02Z

Motivation

The BlockwiseBroadcastReduceOp lowering in BlockwiseGemmToThreadwise.cpp handles
the reduction of partial results across threads within a workgroup. Previously, all
inter-thread reductions used an LDS-based tree reduction loop that requires
log2(N) barrier-synchronized LDS round-trips. This works correctly but leaves
performance on the table for cases where hardware-accelerated subgroup (wave-level)
reduction is available.
This PR restructures the lowering to:

Clearly separate the three reduction strategies (Shuffle+DPP, Serial XOR, LDS Tree)
with explicit conditions documented for each path
Introduce gpu.SubgroupReduceOp with cluster_size for DPP-eligible cases,
replacing the tree reduction loop with a single hardware-accelerated operation
Extract repeated code patterns into helper functions for readability
Work correctly on both CDNA (waveSize=64) and RDNA (waveSize=32) architectures

Technical Details

Three reduction paths

The lowering now selects one of three paths based on thread layout and dimensions:
Path 1 — Shuffle + DPP (canUseShuffleOptimization):

When: 2D GEMM thread layout, blockSize > nrDimProduct, per-thread subtile is [1,1]
What: gpu.shuffle IDX transposes data from strided WMMA/MFMA layout to
contiguous DPP layout → gpu.SubgroupReduceOp → leader writes to LDS → broadcast
Skips initial LDS store entirely (data stays in registers until after DPP)
cluster_size limit: <= mTidPerWave (typically 16 on RDNA WMMA, 32 on CDNA MFMA)

Path 2 — Serial XOR (canUseSerialShuffle):

When: 2D GEMM thread layout, blockSize <= nrDimProduct, power-of-2 rDim, rDim == mTidPerWave
What: log2(rDim) XOR butterfly shuffle steps within a wave → store to LDS → broadcast
cluster_size limit: Must exactly equal mTidPerWave — on RDNA (WMMA, mTid typically 2)
this covers cluster_size=2; on CDNA (MFMA, mTid typically 2 or 4) this covers
cluster_size=2 and cluster_size=4

Path 3 — LDS Tree (fallback):

When: Neither shuffle path applies
Subpath DPP: If cluster_size is power-of-2 and <= waveSize →
store to LDS → threadwise pre-reduce → gpu.SubgroupReduceOp → leader write → broadcast
Subpath Tree: Otherwise → store to LDS → log2(N) LDS tree reduction loop → broadcast
No cluster_size limit — handles any configuration including non-power-of-2

DPP vs Tree thread layout

	DPP path	Tree path
Reduction	`gpu.SubgroupReduceOp`	`log2(N)` LDS loop
LDS writes	Leader only (`rtid == 0`)	All active threads per iteration

Helper functions extracted

shuffleRearrangeForDPP: gpu.shuffle IDX transposition logic
(sourceLane = (lane % cluster) * stride + lane / cluster)
readReducedResultsFromLDS: barrier + ThreadwiseReadInto from LDS → output registers
(eliminates ~20 lines of duplication per call site)
getPerWaveThreadCounts: promoted to static member; extracts m_tid/n_tid from
the tid slice Merge transform

Test Plan

All existing integration tests pass (lit test suite for reduce/blockwise_reduce/)
New blockwise_reduce_dpp_cluster_sizes.mlir test covers cluster sizes 2, 4, 8, 16, 32, 64
with both sum and max reductions

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…per extraction Restructure the blockwise reduce rewrite pattern in BlockwiseGemmToThreadwise.cpp to improve clarity, maintainability, and enable DPP-based reductions via gpu.SubgroupReduceOp. Shuffle decision logic: - Introduce has2DThreadLayout guard (mTidPerWave > 0 && nTidPerWave > 0) to clearly separate GEMM-style 2D thread layouts from general cases - Path 1 (Shuffle+DPP): activates when blockSize > nrDimProduct and the per-thread subtile is [1,1] with rDim == 1, using gpu.shuffle to transpose data from WMMA/MFMA strided layout into contiguous DPP-compatible layout - Path 2 (Serial XOR): activates when blockSize <= nrDimProduct, performing log2(rDim) XOR butterfly reduction steps within a wave at stride nTidPerWave - Initial LDS store is deferred: only performed when neither shuffle path applies, avoiding unnecessary LDS traffic for shuffle-eligible configurations Parallel reduction with DPP: - Use gpu.SubgroupReduceOp with cluster_size for DPP-eligible reductions (power-of-2 active threads, cluster_size <= waveSize) - Only the reduction group leader (rtid == 0) writes the result back to LDS, followed by a barrier and broadcast read - Use bitwise AND/SHRU for thread ID decomposition (rtid, nrtid) on the DPP path and for power-of-2 non-reduction dimensions; fall back to DIV/REM for non-power-of-2 cases - Force scalar accumulation (vectorLen = 1) during threadwise pre-reduction on the DPP path to ensure correct element-wise reduction before SubgroupReduceOp Helper extraction: - getPerWaveThreadCounts: promote to static member function; extracts m_tid and n_tid counts from the tid slice view Merge transform - shuffleRearrangeForDPP: encapsulates the gpu.shuffle-based transposition from strided WMMA/MFMA layout to contiguous DPP layout (sourceLane = (lane % clusterSize) * stride + lane / clusterSize) - readReducedResultsFromLDS: consolidates the repeated pattern of barrier + ThreadwiseReadInto from LDS into output registers (and optional extra output) Tree reduction path: - Retained as fallback for non-DPP-eligible configurations (non-power-of-2 thread counts or cluster_size > waveSize) - Scope ceilPowerOf2 computation and treeMaxActiveThreads naming to this path New test: blockwise_reduce_dpp_cluster_sizes.mlir - Integration test covering DPP reduction with cluster sizes 2, 4, 8, 16, 32, 64 - Validates both sum (rand=none, all ones) and max (rand=fixed) reductions - All test configurations use blockSize <= waveSize to ensure single-wave execution on both RDNA (waveSize=32) and CDNA (waveSize=64) - cluster_size=64 falls back to tree reduction on RDNA since 64 > waveSize=32

Copilot

Pull request overview

This PR refactors the Rock BlockwiseBroadcastReduceOp lowering to use gpu.subgroup_reduce (with clustered reductions where applicable) and adds a Rock backend pass to lower gpu.subgroup_reduce to AMD DPP instructions, improving inter-thread reduction performance on supported architectures.

Changes:

Update blockwise broadcast-reduce lowering to select between shuffle+DPP, serial XOR shuffle, and LDS tree fallback paths, with shared helper functions.
Introduce rock-subgroup-reduce-to-dpp pass and wire it into the backend pipeline before convert-gpu-to-rocdl.
Extend/adjust tests and pipelines to cover the new DPP clustered reduction behavior and pass ordering.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp	Refactors reduction lowering, adds shuffle/DPP paths, helpers, and emits `gpu.shuffle` + `gpu.subgroup_reduce`.
mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp	New Rock pass to lower `gpu.subgroup_reduce` into AMD DPP sequences via GPU transform patterns.
mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt	Adds the new pass source and links GPU transforms library.
mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp	Inserts `rock-subgroup-reduce-to-dpp` into the backend pipeline after lowering affine.
mlir/include/mlir/Dialect/Rock/Passes.td	Declares the new pass and its `chip` option.
mlir/include/mlir/Dialect/Rock/Passes.h	Adds the generated pass decl macro for the new pass.
mlir/test/rocmlir-driver/pipelines.mlir	Updates expected printed pipelines to include `rock-subgroup-reduce-to-dpp{chip=...}`.
mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir	Updates lowering checks and parameterizes `arch` via token substitution.
mlir/test/Dialect/Rock/integration/reduce/blockwise_reduce/blockwise_reduce_dpp_cluster_sizes.mlir	New integration test covering multiple `cluster_size` cases and both sum/max reductions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…td formatting

stefankoncarevic requested a review from causten as a code owner March 17, 2026 18:21

stefankoncarevic requested review from Copilot, dhernandez0, djramic, justinrosner, pabloantoniom and umangyadav March 18, 2026 09:16

Copilot started reviewing on behalf of stefankoncarevic March 18, 2026 09:19 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Comment thread mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp Outdated

Comment thread mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp

Mark gpu::GPUDialect legal in second ConversionTarget and fix Passes.…

f13f05d

…td formatting

stefankoncarevic mentioned this pull request Mar 18, 2026

[DRAFT] Implement DPP Reduction in wavefront #1796

Closed

stefankoncarevic added 4 commits March 24, 2026 06:26

Merge branch 'develop' into dpp-refactor-blockwise-reduce

bf5f8b2

Merge branch 'develop' into dpp-refactor-blockwise-reduce

7c20cba

Merge branch 'develop' into dpp-refactor-blockwise-reduce

9b206aa

Merge branch 'develop' into dpp-refactor-blockwise-reduce

918f35c

stefankoncarevic force-pushed the dpp-refactor-blockwise-reduce branch from 8636b38 to 918f35c Compare April 3, 2026 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor blockwise reduce lowering with DPP SubgroupReduceOp#2301

Refactor blockwise reduce lowering with DPP SubgroupReduceOp#2301
stefankoncarevic wants to merge 6 commits intoROCm:developfrom
stefankoncarevic:dpp-refactor-blockwise-reduce

stefankoncarevic commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stefankoncarevic commented Mar 17, 2026

Motivation

Technical Details

Three reduction paths

DPP vs Tree thread layout

Helper functions extracted

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants