[SDTEST-3824] Split planner from runner and make fast discovery fallback-only by anmarchenko · Pull Request #73 · DataDog/ddtest

anmarchenko · 2026-06-04T12:16:13Z

What

Split planning state and behavior into internal/planner, leaving internal/runner focused on executing plan artifacts.
Make fast filesystem discovery a fallback-only source: when full test discovery succeeds, its discovered suites/files are authoritative.
Keep backend duration enrichment only for fast-discovery fallback files, so stale backend rows or filesystem-only files do not reintroduce suites that full discovery omitted.
Preserve CI-node worker subsplitting while keeping planner-owned weights behind planner APIs.

Why

A suite that is completely skipped by the test framework can still be found by filesystem glob discovery. Before this change, planning was too eager to merge fast-discovered files back into the runnable set after full discovery succeeded. That could cause files omitted by full discovery, for example an RSpec file containing only skipped examples, to appear in plan artifacts and receive default or backend durations.

Full discovery is the authoritative signal for runnable local tests whenever it succeeds. Fast discovery remains useful only as a safety fallback when full discovery fails or is cancelled.

E2E testing

Bug reproduction and verification:

In an RSpec repository, create or identify a spec file whose suite is completely skipped by the framework, for example a file containing only xit examples.
Ensure the file matches the fast discovery glob, such as spec/**/*_spec.rb. To reproduce the old failure more clearly, use a file that already has historical backend duration data, for example by first running it normally, then changing every example in the file to skipped.
Run ddtest plan with test optimization enabled and successful full discovery.
Before the fix, the skipped-only file could be re-added from fast filesystem discovery and show up in .testoptimization/runner/test-files or runner split artifacts, with either default or backend duration data.
After this fix, the skipped-only file is absent from runnable plan artifacts when full discovery succeeds. Fast-discovered-only files are ignored in that path, and backend duration rows do not reintroduce them.
Separately verify fallback behavior by forcing full discovery to fail while fast discovery succeeds. In that case, fast-discovered files should still be planned, and matching backend durations should still be used for scheduling.

Full regression testing for both planning and running test execution must be run because this PR heavily refactors planner/runner ownership and artifact loading.

Check that there are no regressions when the durations backend API returns 404.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f18e23896

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-04T12:20:00Z

+	planInfo, err := tr.planner.LoadPlan()
 	if err != nil {
-		return fmt.Errorf("failed to read parallel runners count from %s: %w", constants.ParallelRunnersOutputPath, err)
+		slog.Error("Test optimization plan is not available", "error", err)
+		return fmt.Errorf("test optimization plan is not available: %w", err)


Don't abort runs when only the plan cache is absent

In the existing-artifact path (parallel-runners.txt already present), a missing or corrupt .testoptimization/runner/cache/test_suite_durations.json now causes Run to return before reading test-files.txt or tests-split. That makes CI jobs fail to execute tests when they restore the runner file artifacts without the cache, even though the file lists are sufficient and DistributeTestFiles already has a default-weight fallback for an unavailable cache.

Useful? React with 👍 / 👎.

this is actually what we want to do now because running tests without a correct plan is not wise - it is better to abort to signal that the setup is broken

anmarchenko · 2026-06-04T12:42:22Z

E2E Test Report: SUCCESS

Tested by: Shepherd Agent (autonomous QA for Datadog Test Optimization)

Pull Request

PR: [SDTEST-3824] Split planner from runner and make fast discovery fallback-only #73
Title: SDTEST-3824 Split planner from runner and make fast discovery fallback-only
Branch tested: anmarchenko/fast-discovery-fallback
Revision tested: 5f18e2389667032e1d3e525cc645b285a02657c2
Environment: Local Shepherd E2E runs against mockdog on macOS/Darwin, using real Ruby playgrounds and the PR branch built/injected by crook
Datadog UI verification: Not applicable for these runs. The runs targeted local mockdog; backend/API behavior and emitted telemetry were verified from mockdog reports and ddtest debug logs.

Summary

This PR passes the requested E2E coverage. I verified the main behavioral fix and the related regression surfaces:

Full framework discovery is authoritative when it succeeds.
Fast filesystem discovery remains a fallback when full discovery fails.
Stale backend duration rows do not reintroduce files omitted by successful full discovery.
Backend duration data is still used for scheduling in the fallback path.
Plan + real worker execution still works after the planner/runner split.
CI-node mode with local worker subsplitting still works.
A 404 from the test-suite durations endpoint is non-fatal and falls back to default duration weighting.

Results Matrix

Area	Playground	Command shape	Result	Evidence
Successful full discovery excludes fast-only skipped file	`spree` core / RSpec	`crook run spree -c ddtest-plan-core --dep ddtest=anmarchenko/fast-discovery-fallback --scenario <temp skipped-only scenario> --debug` with `DD_TEST_OPTIMIZATION_RUNNER_TESTS_LOCATION=spec/helpers/*_spec.rb`	PASS	Fast discovery saw 7 files and the mock backend returned a stale duration row for `shepherd_skipped_only_spec.rb`; full discovery succeeded and final plan had `testFilesCount=6`, excluding the skipped-only file.
Full-discovery failure uses fast fallback and backend duration	`spree` core / RSpec	Same PR branch and scenario, narrowed to the intentionally broken temp file	PASS	Log showed `Full test discovery failed or was cancelled`, then `using fast test file discovery fallback`; final plan had `testFilesCount=1`, `expectedWallTime=12s`, and `Duration source: 1 known`.
Real plan + worker execution	`sidekiq` / Minitest	`crook run sidekiq -c ddtest --dep ddtest=anmarchenko/fast-discovery-fallback --scenario itr-discovery-only --debug` scoped to `test/logger_test.rb`	PASS	`9 runs, 48 assertions, 0 failures, 0 errors, 0 skips`; ddtest run report `Result: passed`; mockdog received 12 CI test-cycle events and valid telemetry.
CI-node worker subsplitting	`sidekiq` / Minitest	Same run, with `DD_TEST_OPTIMIZATION_RUNNER_CI_NODE=0` and `DD_TEST_OPTIMIZATION_RUNNER_CI_NODE_WORKERS=2`	PASS	Log showed `Running tests for CI node in parallel mode ciNode=0 ciNodeWorkers=2`; run report showed `Mode: CI node`, `Local workers: 2`, `Result: passed`; mockdog received 12 CI test-cycle events and valid telemetry.
Durations endpoint 404 is non-fatal	`sidekiq` / Minitest	`crook run sidekiq -c ddtest-plan --dep ddtest=anmarchenko/fast-discovery-fallback --scenario <temp 404 scenario> --debug`	PASS	mockdog returned 404 three times for `/api/v2/ci/ddtest/test_suite_durations`; ddtest logged `Test durations API errored` and still completed planning with `Duration source: 0 known, 1 default`.

Detailed Evidence

1. Successful full discovery stays authoritative

I created a temporary Spree spec file containing only skipped examples:

core/spec/helpers/shepherd_skipped_only_spec.rb

The temporary mockdog scenario returned a stale backend duration row for that file alongside the normal helper specs:

ShepherdSkippedOnlyHelper
source_file: core/spec/helpers/shepherd_skipped_only_spec.rb
p50: 12000000000

Important log evidence from debug_ddtest73_spree_success.log:

Discovered test files (fast) ... count=7
test_suite_durations responseBody=... "ShepherdSkippedOnlyHelper":{"source_file":"core/spec/helpers/shepherd_skipped_only_spec.rb","duration":{"p50":"12000000000","p90":"15000000000"}}
Discovered local tests ... count=97
Full test discovery succeeded; using full discovery results and ignoring fast-discovered-only files fastDiscoveredTestFilesCount=7
Test files prepared testFilesCount=6
Test execution planning completed ... testFilesCount=6
Duration source: 0 known, 6 default

Artifact evidence:

crook-data/runs/20260604-142857/spree/core/.testoptimization/runner/test-files.txt

spec/helpers/base_helper_spec.rb
spec/helpers/currency_helper_spec.rb
spec/helpers/images_helper_spec.rb
spec/helpers/locale_helper_spec.rb
spec/helpers/products_helper_spec.rb
spec/helpers/shipment_helper_spec.rb

The skipped-only file was absent from runner/test-files.txt and absent from every runner split, even though fast discovery found it and the backend returned duration data for it. This is the exact bug class described in the PR.

2. Fast discovery fallback still works when full discovery fails

To force a real full-discovery failure, I temporarily made the skipped-only Spree spec syntactically invalid and narrowed DD_TEST_OPTIMIZATION_RUNNER_TESTS_LOCATION to that file.

Important log evidence from debug_ddtest73_spree_fallback_single.log:

Discovered test files (fast) ... count=1
WARN Full test discovery failed or was cancelled error="open .testoptimization/tests-discovery/tests.json: no such file or directory"
Full test discovery did not run or failed; using fast test file discovery fallback fastDiscoveredTestFilesCount=1
Test files prepared testFilesCount=1
Test execution planning completed parallelRunners=1 expectedWallTime=12s ... testFilesCount=1
Duration source: 1 known, 0 default

Artifact evidence:

crook-data/runs/20260604-143354/spree/core/.testoptimization/runner/test-files.txt

spec/helpers/shepherd_skipped_only_spec.rb

This verifies the fallback side of the behavior: when full discovery cannot produce authoritative results, ddtest still plans from fast-discovered files and can attach matching backend durations for scheduling.

3. Real worker execution still works

I ran a full ddtest execution against Sidekiq's focused ddtest-logger path using the PR branch.

Important log evidence from debug_ddtest73_sidekiq_run.log:

Full test discovery succeeded; using full discovery results and ignoring fast-discovered-only files fastDiscoveredTestFilesCount=1
Test execution planning completed parallelRunners=1 expectedWallTime=9s ... testFilesCount=1
9 runs, 48 assertions, 0 failures, 0 errors, 0 skips
Result: passed
CI Test Cycle: 12 events (9 tests, 1 suites, 1 modules, 1 sessions) gzipped
Telemetry:     valid, 1 sessions, 9 tests, 2 payloads

Plan artifact:

crook-data/runs/20260604-143549/sidekiq/.testoptimization/runner/test-files.txt

test/logger_test.rb

This covers the runner side of the planner/runner refactor, not just planning.

4. CI-node local worker subsplitting still works

I then reran Sidekiq in CI-node mode with DD_TEST_OPTIMIZATION_RUNNER_CI_NODE=0 and DD_TEST_OPTIMIZATION_RUNNER_CI_NODE_WORKERS=2.

Important log evidence from debug_ddtest73_sidekiq_ci_node_run.log:

Running tests for CI node in parallel mode ciNode=0 ciNodeWorkers=2 testFilesCount=1
Assigned test files to CI node worker ciNode=0 workerIndex=0 testFiles=[test/logger_test.rb]
Running tests in worker nodeIndex=0 workerIndex=0 testFilesCount=1
9 runs, 48 assertions, 0 failures, 0 errors, 0 skips
Mode: CI node
CI node: 0
Local workers: 2
Result: passed
CI Test Cycle: 12 events (9 tests, 1 suites, 1 modules, 1 sessions) gzipped
Telemetry:     valid, 1 sessions, 9 tests, 2 payloads

This specifically exercises the CI-node worker path mentioned in the PR summary.

5. Durations endpoint 404 is non-fatal

I created a temporary mockdog scenario that returned 404 for test_suite_durations and ran a focused Sidekiq plan.

Important log evidence from debug_ddtest73_sidekiq_durations_404.log:

POST /api/v2/ci/ddtest/test_suite_durations 404
POST /api/v2/ci/ddtest/test_suite_durations 404
POST /api/v2/ci/ddtest/test_suite_durations 404
ERROR Test durations API errored ... error="fetching test suite durations: max retries exceeded: unexpected status code 404: "
Full test discovery succeeded; using full discovery results and ignoring fast-discovered-only files fastDiscoveredTestFilesCount=1
Test files prepared testFilesCount=1
Test execution planning completed parallelRunners=1 expectedWallTime=9s ... testFilesCount=1
Duration source: 0 known, 1 default

This confirms the requested 404 regression check: the endpoint error is logged, but planning still succeeds with default duration weighting.

Issues Found

No ddtest PR bugs were found.

One environment issue was encountered and handled:

A RuboCop full RSpec run failed before test execution during Bundler dependency injection because the local Bundler git cache for dd-trace-rb had conflicting refs (refs/heads/claude vs nested refs/heads/claude/...). This was not caused by the ddtest PR. I restored the failed Gemfile injection artifacts and used Sidekiq for real runner execution coverage.

Temporary test fixtures/scenarios used for the Spree and 404 checks were removed after the runs.

Conclusion

The PR behavior is verified end-to-end:

Successful full discovery is authoritative.
Fast discovery no longer re-adds full-discovery-omitted files.
Backend duration rows no longer revive stale/skipped-only suites in the successful-full-discovery path.
Fast discovery remains functional as a fallback and still uses backend duration data.
Normal worker execution and CI-node worker execution still pass.
Durations API 404 is non-fatal.

This E2E test was performed by Shepherd - autonomous QA agent for Datadog Test Optimization.

anmarchenko added 3 commits June 4, 2026 11:46

Use fast discovery only as fallback

fc9dc5b

split planning logic from the runner

e5ea018

planner refactoring

5f18e23

anmarchenko marked this pull request as ready for review June 4, 2026 12:16

anmarchenko requested a review from a team as a code owner June 4, 2026 12:16

anmarchenko changed the title ~~Split planner from runner and make fast discovery fallback-only~~ [SDTEST-3824] Split planner from runner and make fast discovery fallback-only Jun 4, 2026

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

juan-fernandez approved these changes Jun 4, 2026

View reviewed changes

anmarchenko merged commit dde79e8 into main Jun 4, 2026
3 checks passed

anmarchenko deleted the anmarchenko/fast-discovery-fallback branch June 4, 2026 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDTEST-3824] Split planner from runner and make fast discovery fallback-only#73

[SDTEST-3824] Split planner from runner and make fast discovery fallback-only#73
anmarchenko merged 3 commits into
mainfrom
anmarchenko/fast-discovery-fallback

anmarchenko commented Jun 4, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

anmarchenko Jun 4, 2026

Uh oh!

anmarchenko commented Jun 4, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anmarchenko commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

E2E testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

anmarchenko Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

anmarchenko commented Jun 4, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Report: SUCCESS

Pull Request

Summary

Results Matrix

Detailed Evidence

1. Successful full discovery stays authoritative

2. Fast discovery fallback still works when full discovery fails

3. Real worker execution still works

4. CI-node local worker subsplitting still works

5. Durations endpoint 404 is non-fatal

Issues Found

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anmarchenko commented Jun 4, 2026 •

edited

Loading

anmarchenko commented Jun 4, 2026 •

edited by atlassian Bot

Loading