Skip to content

Refactor blockwise reduce lowering with DPP SubgroupReduceOp#2301

Open
stefankoncarevic wants to merge 6 commits intoROCm:developfrom
stefankoncarevic:dpp-refactor-blockwise-reduce
Open

Refactor blockwise reduce lowering with DPP SubgroupReduceOp#2301
stefankoncarevic wants to merge 6 commits intoROCm:developfrom
stefankoncarevic:dpp-refactor-blockwise-reduce

Conversation

@stefankoncarevic
Copy link
Copy Markdown
Contributor

Motivation

The BlockwiseBroadcastReduceOp lowering in BlockwiseGemmToThreadwise.cpp handles
the reduction of partial results across threads within a workgroup. Previously, all
inter-thread reductions used an LDS-based tree reduction loop that requires
log2(N) barrier-synchronized LDS round-trips. This works correctly but leaves
performance on the table for cases where hardware-accelerated subgroup (wave-level)
reduction is available.
This PR restructures the lowering to:

  1. Clearly separate the three reduction strategies (Shuffle+DPP, Serial XOR, LDS Tree)
    with explicit conditions documented for each path
  2. Introduce gpu.SubgroupReduceOp with cluster_size for DPP-eligible cases,
    replacing the tree reduction loop with a single hardware-accelerated operation
  3. Extract repeated code patterns into helper functions for readability
  4. Work correctly on both CDNA (waveSize=64) and RDNA (waveSize=32) architectures

Technical Details

Three reduction paths

The lowering now selects one of three paths based on thread layout and dimensions:
Path 1 — Shuffle + DPP (canUseShuffleOptimization):

  • When: 2D GEMM thread layout, blockSize > nrDimProduct, per-thread subtile is [1,1]
  • What: gpu.shuffle IDX transposes data from strided WMMA/MFMA layout to
    contiguous DPP layout → gpu.SubgroupReduceOp → leader writes to LDS → broadcast
  • Skips initial LDS store entirely (data stays in registers until after DPP)
  • cluster_size limit: <= mTidPerWave (typically 16 on RDNA WMMA, 32 on CDNA MFMA)

Path 2 — Serial XOR (canUseSerialShuffle):

  • When: 2D GEMM thread layout, blockSize <= nrDimProduct, power-of-2 rDim, rDim == mTidPerWave
  • What: log2(rDim) XOR butterfly shuffle steps within a wave → store to LDS → broadcast
  • cluster_size limit: Must exactly equal mTidPerWave — on RDNA (WMMA, mTid typically 2)
    this covers cluster_size=2; on CDNA (MFMA, mTid typically 2 or 4) this covers
    cluster_size=2 and cluster_size=4

Path 3 — LDS Tree (fallback):

  • When: Neither shuffle path applies
  • Subpath DPP: If cluster_size is power-of-2 and <= waveSize
    store to LDS → threadwise pre-reduce → gpu.SubgroupReduceOp → leader write → broadcast
  • Subpath Tree: Otherwise → store to LDS → log2(N) LDS tree reduction loop → broadcast
  • No cluster_size limit — handles any configuration including non-power-of-2

DPP vs Tree thread layout

DPP path Tree path
Reduction gpu.SubgroupReduceOp log2(N) LDS loop
LDS writes Leader only (rtid == 0) All active threads per iteration

Helper functions extracted

  • shuffleRearrangeForDPP: gpu.shuffle IDX transposition logic
    (sourceLane = (lane % cluster) * stride + lane / cluster)
  • readReducedResultsFromLDS: barrier + ThreadwiseReadInto from LDS → output registers
    (eliminates ~20 lines of duplication per call site)
  • getPerWaveThreadCounts: promoted to static member; extracts m_tid/n_tid from
    the tid slice Merge transform

Test Plan

  • All existing integration tests pass (lit test suite for reduce/blockwise_reduce/)
  • New blockwise_reduce_dpp_cluster_sizes.mlir test covers cluster sizes 2, 4, 8, 16, 32, 64
    with both sum and max reductions

Test Result

Submission Checklist

…per extraction

Restructure the blockwise reduce rewrite pattern in BlockwiseGemmToThreadwise.cpp
to improve clarity, maintainability, and enable DPP-based reductions via
gpu.SubgroupReduceOp.

Shuffle decision logic:
- Introduce has2DThreadLayout guard (mTidPerWave > 0 && nTidPerWave > 0) to
  clearly separate GEMM-style 2D thread layouts from general cases
- Path 1 (Shuffle+DPP): activates when blockSize > nrDimProduct and the
  per-thread subtile is [1,1] with rDim == 1, using gpu.shuffle to transpose
  data from WMMA/MFMA strided layout into contiguous DPP-compatible layout
- Path 2 (Serial XOR): activates when blockSize <= nrDimProduct, performing
  log2(rDim) XOR butterfly reduction steps within a wave at stride nTidPerWave
- Initial LDS store is deferred: only performed when neither shuffle path applies,
  avoiding unnecessary LDS traffic for shuffle-eligible configurations

Parallel reduction with DPP:
- Use gpu.SubgroupReduceOp with cluster_size for DPP-eligible reductions
  (power-of-2 active threads, cluster_size <= waveSize)
- Only the reduction group leader (rtid == 0) writes the result back to LDS,
  followed by a barrier and broadcast read
- Use bitwise AND/SHRU for thread ID decomposition (rtid, nrtid) on the DPP
  path and for power-of-2 non-reduction dimensions; fall back to DIV/REM
  for non-power-of-2 cases
- Force scalar accumulation (vectorLen = 1) during threadwise pre-reduction
  on the DPP path to ensure correct element-wise reduction before SubgroupReduceOp

Helper extraction:
- getPerWaveThreadCounts: promote to static member function; extracts m_tid and
  n_tid counts from the tid slice view Merge transform
- shuffleRearrangeForDPP: encapsulates the gpu.shuffle-based transposition from
  strided WMMA/MFMA layout to contiguous DPP layout
  (sourceLane = (lane % clusterSize) * stride + lane / clusterSize)
- readReducedResultsFromLDS: consolidates the repeated pattern of barrier +
  ThreadwiseReadInto from LDS into output registers (and optional extra output)

Tree reduction path:
- Retained as fallback for non-DPP-eligible configurations
  (non-power-of-2 thread counts or cluster_size > waveSize)
- Scope ceilPowerOf2 computation and treeMaxActiveThreads naming to this path

New test: blockwise_reduce_dpp_cluster_sizes.mlir
- Integration test covering DPP reduction with cluster sizes 2, 4, 8, 16, 32, 64
- Validates both sum (rand=none, all ones) and max (rand=fixed) reductions
- All test configurations use blockSize <= waveSize to ensure single-wave
  execution on both RDNA (waveSize=32) and CDNA (waveSize=64)
- cluster_size=64 falls back to tree reduction on RDNA since 64 > waveSize=32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Rock BlockwiseBroadcastReduceOp lowering to use gpu.subgroup_reduce (with clustered reductions where applicable) and adds a Rock backend pass to lower gpu.subgroup_reduce to AMD DPP instructions, improving inter-thread reduction performance on supported architectures.

Changes:

  • Update blockwise broadcast-reduce lowering to select between shuffle+DPP, serial XOR shuffle, and LDS tree fallback paths, with shared helper functions.
  • Introduce rock-subgroup-reduce-to-dpp pass and wire it into the backend pipeline before convert-gpu-to-rocdl.
  • Extend/adjust tests and pipelines to cover the new DPP clustered reduction behavior and pass ordering.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp Refactors reduction lowering, adds shuffle/DPP paths, helpers, and emits gpu.shuffle + gpu.subgroup_reduce.
mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp New Rock pass to lower gpu.subgroup_reduce into AMD DPP sequences via GPU transform patterns.
mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt Adds the new pass source and links GPU transforms library.
mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp Inserts rock-subgroup-reduce-to-dpp into the backend pipeline after lowering affine.
mlir/include/mlir/Dialect/Rock/Passes.td Declares the new pass and its chip option.
mlir/include/mlir/Dialect/Rock/Passes.h Adds the generated pass decl macro for the new pass.
mlir/test/rocmlir-driver/pipelines.mlir Updates expected printed pipelines to include rock-subgroup-reduce-to-dpp{chip=...}.
mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir Updates lowering checks and parameterizes arch via token substitution.
mlir/test/Dialect/Rock/integration/reduce/blockwise_reduce/blockwise_reduce_dpp_cluster_sizes.mlir New integration test covering multiple cluster_size cases and both sum/max reductions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp Outdated
Comment thread mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp
@stefankoncarevic stefankoncarevic force-pushed the dpp-refactor-blockwise-reduce branch from 8636b38 to 918f35c Compare April 3, 2026 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants