Skip to content

feat: Add bias forRightSemi, RightAnti, RightMark join orientation#22957

Open
neilconway wants to merge 3 commits into
apache:mainfrom
neilconway:neilc/semi-join-swap-bias
Open

feat: Add bias forRightSemi, RightAnti, RightMark join orientation#22957
neilconway wants to merge 3 commits into
apache:mainfrom
neilconway:neilc/semi-join-swap-bias

Conversation

@neilconway

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

To evaluate a semi join, we support two orientations: LeftSemi or RightSemi (analogously for anti and mark joins; I'll just refer to semijoins here to simplify the discussion). Under RightSemi, we build the non-preserved ("filter") input and stream the preserved input; we do the inverse for LeftSemi. There are significant differences in evaluation behavior between these two orientations:

  • The build-side hash table has to be resident in memory; all else being equal, building the smaller join input is a good general rule, and that's the main rule we follow today.
  • RightSemi only needs to store the join keys for the build side; LeftSemi needs to store wider rows. By definition, the consumer of a semijoin can't be interested in any values from the filter side of the join. So even if the filter side has more rows than the preserved side, building the hash table on the filter side might still require less memory.
  • RightSemi preserves the partitioning of the preserved input, whereas LeftSemi + CollectLeft emits with UnknownPartitioning.
  • RightSemi works better with dynamic filter pushdown: I don't know the dynamic filter code super well, but I'd imagine that since RightSemi builds the filter side before streaming the preserved side, that gives us more information we can use to push down filters into the preserved-side scan.

The current optimizer rules don't reflect this:

  • LeftSemi and RightSemi are considered symmetrically; whichever semijoin input is predicted to be smaller is placed on the build side
  • If there are absent stats, LeftSemi is the default orientation

This PR revises these rules as follows:

  • Prefer RightSemi over LeftSemi, unless the filter side is twice as large as the preserved side (configurable via semi_join_swap_bias configuration variable)
  • If there are absent stats, prefer RightSemi

What changes are included in this PR?

  • Add semi_join_swap_bias configuration variable
  • Refactor code to use a single routine when considering when to swap hash join inputs
  • Apply configured bias to HJ swap input decision
  • Add unit tests
  • Update expected plans in SLT

Are these changes tested?

Yes.

Are there any user-facing changes?

Changes in plans for user queries.

@github-actions github-actions Bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Jun 15, 2026
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v54.0.0 (current)
       Built [  96.144s] (current)
     Parsing datafusion v54.0.0 (current)
      Parsed [   0.033s] (current)
    Building datafusion v54.0.0 (baseline)
       Built [  96.794s] (baseline)
     Parsing datafusion v54.0.0 (baseline)
      Parsed [   0.036s] (baseline)
    Checking datafusion v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.581s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 194.980s] datafusion
    Building datafusion-common v54.0.0 (current)
       Built [  32.257s] (current)
     Parsing datafusion-common v54.0.0 (current)
      Parsed [   0.057s] (current)
    Building datafusion-common v54.0.0 (baseline)
       Built [  32.806s] (baseline)
     Parsing datafusion-common v54.0.0 (baseline)
      Parsed [   0.059s] (baseline)
    Checking datafusion-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.644s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field OptimizerOptions.semi_join_swap_bias in /home/runner/work/datafusion/datafusion/datafusion/common/src/config.rs:1267

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  66.830s] datafusion-common
    Building datafusion-physical-optimizer v54.0.0 (current)
       Built [  36.510s] (current)
     Parsing datafusion-physical-optimizer v54.0.0 (current)
      Parsed [   0.022s] (current)
    Building datafusion-physical-optimizer v54.0.0 (baseline)
       Built [  36.480s] (baseline)
     Parsing datafusion-physical-optimizer v54.0.0 (baseline)
      Parsed [   0.021s] (baseline)
    Checking datafusion-physical-optimizer v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.112s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  74.143s] datafusion-physical-optimizer
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  34.920s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.131s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  34.906s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.129s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.577s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  72.105s] datafusion-physical-plan
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 164.888s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 165.547s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.023s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.082s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 333.072s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 15, 2026
@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation optimizer Optimizer rules physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimizer should bias to prefer RightSemi over LeftSemi

1 participant