Skip to content

Limit pushdown with filters + sort-aware reverse output#7

Open
ch-sc wants to merge 4 commits into
0.65.0-atlasfrom
limit-pushdown
Open

Limit pushdown with filters + sort-aware reverse output#7
ch-sc wants to merge 4 commits into
0.65.0-atlasfrom
limit-pushdown

Conversation

@ch-sc
Copy link
Copy Markdown
Collaborator

@ch-sc ch-sc commented May 19, 2026

Summary

Two related perf gaps surfaced while investigating reverse order scans:

  1. LIMIT is silently dropped when a filter is set. ScanBuilder bailed outright on filter + limit, and VortexOpener worked around it by gating with_limit on filter.is_none(). The result was that SELECT … WHERE x > k LIMIT n always scanned every file end-to-end.
  2. ORDER BY ts DESC against a table declared WITH ORDER (ts ASC) doesn't exploit reverse scanning. VortexSource::with_reversed already existed and ScanBuilder already supported reverse intra-file scans, but no FileSource::try_reverse_output hook was wired up. So DataFusion always inserted a full SortExec instead of letting the source absorb the reversal.

Together they meant that the canonical "latest N matching rows" query (e.g. SELECT * FROM t WHERE x > k ORDER BY ts DESC LIMIT 100 over time-partitioned files) read every file and every row. After the changes in this PR it typically reads only the last file, short-circuiting once the limit is satisfied.

SELECT * FROM t WHERE x > k ORDER BY ts DESC LIMIT 100 over time-partitioned
files now executes as:

  1. Planner sees ORDER BY ts DESC, table's output ordering is ts ASC. Calls VortexSource::try_pushdown_sort.
  2. We return Exact with reversed=true. DataFusion reverses file_groups (so the latest file comes first) and removes the SortExec.
  3. Planner pushes filter x > k via try_pushdown_filters (already in place).
  4. Planner pushes LIMIT 100 via with_fetch on DataSourceExec (already in place; passes through to base_config.limit).
  5. Opener pushes the limit unconditionally now (remove the filter.is_none() guard).
  6. Scan starts processing files in reverse order. First split's filter+projection produces, say, 60 rows. Atomic reserves 60, remaining = 40.
  7. Second split sees remaining = 40, reserves 40, exits early after enough rows. Done.
  8. Third+ splits / later files never start.

reversed is now surfaced in fmt_extra so plan rendering shows
reversed: true, which makes the pushdown observable in tests.

Edge cases.

  • Empty requested ordering or no declared orderings → Unsupported.
  • Multiple declared orderings → accept if any one matches in reverse.
  • Partial prefix matches → is_reverse is strict; we return Unsupported, the planner falls back to a full SortExec. Correct, just not optimal.
  • Currently-reversed source asked to reverse again → flip the flag back (reversed = !reversed).
  • Filter pushdown interactions — none. Sort pushdown is a separate optimizer stage; we just clone with a different flag.

Risks & limitations

Risk Mitigation
CAS livelock under heavy contention compare_exchange_weak allows spurious failures; bounded by physical concurrency. 32-thread stress test (reserve_up_to_concurrent_never_oversubscribes) exercises this.
Soft early-stop over-fetches on a slow split Outer LimitExec truncates. We pay extra IO, never wrong rows.
ORDER BY ts DESC, id ASC (multi-key with mixed direction) is_reverse is strict; we return Unsupported, planner falls back to SortExec. Correct, just not optimal.
DataFusion 52 leaves a SortExec even after reverse pushdown We return Inexact (Parquet does the same on 52). The TopK still benefits from fetch pushdown and from the source emitting locally-sorted streams.

API Changes

  • VortexSource::with_reversed(bool) -> Self
  • VortexSource::try_reverse_output(&self, &[PhysicalSortExpr], &EquivalenceProperties) -> Result<SortOrderPushdownResult<Arc<dyn FileSource>>>
  • ScanBuilder::with_reversed(bool) -> Self and ScanBuilder::reversed(&self) -> bool (already existed in the type, now stamped into the public-api lock).

public-api.lock files regenerated via cargo xtask public-api.

Testing

vortex-scan/src/tasks.rs (CAS unit tests):

  • reserve_up_to_drains_counter — sequential reservations drain to zero.
  • reserve_up_to_concurrent_never_oversubscribes — 32-thread stress proves the
    CAS loop never lets the counter underflow or hand out more than the initial
    budget.

vortex-file/src/tests.rs (filter + limit at the file level):

  • limit_with_filter_truncates_globally — multi-chunk scan with a filter and
    limit=7 returns exactly 7 rows. (Previously this combination would have
    bailed in ScanBuilder::prepare.)
  • limit_zero_with_filter_yields_no_rowslimit=0 short-circuits even when
    a filter is present.

vortex-datafusion/src/persistent/mod.rs (sort pushdown):

  • reverse_pushdown_swaps_file_orderWITH ORDER (ts ASC) + ORDER BY ts DESC LIMIT k produces the largest k values, the plan shows
    SortExec(TopK) and the source is marked reversed: true.
  • reverse_pushdown_rejected_for_unrelated_orderingWITH ORDER (a ASC),
    query asks ORDER BY b DESC. Plan keeps the Sort and source is not
    reversed.
  • reverse_pushdown_no_ordering_declared — no WITH ORDER → no reverse
    pushdown.
  • reverse_pushdown_then_filter — filter pushdown still works alongside
    reverse sort pushdown.
  • order_by_desc_limit_returns_largest_rows — end-to-end synergy: 5 files of
    100 rows each, filter + ORDER BY ts DESC LIMIT 10 returns ts [490..500]
    and the source is reversed in the plan.

ch-sc added 2 commits May 19, 2026 17:37
Signed-off-by: Christoph Schulze <christoph.schulze@polygon.io>
Signed-off-by: Christoph Schulze <christoph.schulze@polygon.io>
@ch-sc ch-sc requested a review from zhuqi-lucas May 20, 2026 08:01
@ch-sc ch-sc marked this pull request as ready for review May 20, 2026 09:16
Copy link
Copy Markdown

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, two things worth addressing before merge:

  1. Test gap — tasks.rs:182 (reserved < true_count branch where the LIMIT
    straddles a split mid-way and mask.limit(reserved) trims the mask) is not
    covered. order_by_desc_limit_returns_largest_rows only exercises the
    "first split fills the budget" fast path; later splits short-circuit at the
    top-of-function check (line 87-91) without entering the filter+reserve
    block. Worth a test that tunes filter selectivity so LIMIT crosses a split
    boundary — it's the most subtle correctness path in this PR.

  2. Doc nit on try_reverse_output — please pin the mental model: this is
    "scan emits rows already in target order, atomic counter caps emission,"
    not the dynamic-filter-pushdown shape used by ParquetSource for unsorted
    ORDER BY x LIMIT k. The SortExec(TopK) above the rebuilt source is a
    passthrough cap + cross-split merge, not the correctness guarantor. One
    sentence prevents a future reader coming from the Parquet path from
    misreading the design.

/// checker. The planner keeps the upstream `SortPreservingMergeExec` but the reversed
/// scan still produces locally-sorted streams in the requested order, which is what
/// enables filter + LIMIT pushdown to short-circuit on the first file.
fn try_reverse_output(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually vortex is using exact reverse, it supports reverse reading, we can add some comments about this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the doc comment and added a todo for when we bump to DF 53 to switch to the exact hint.

ch-sc added 2 commits May 21, 2026 13:54
Signed-off-by: Christoph Schulze <christoph.schulze@polygon.io>
Signed-off-by: Christoph Schulze <christoph.schulze@polygon.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants