Skip to content

feat(compile): IR-validated step authoring & runtime self-optimization#1061

Open
jamesadevine wants to merge 13 commits into
mainfrom
jamesadevine/self-optimization-and-step-validation
Open

feat(compile): IR-validated step authoring & runtime self-optimization#1061
jamesadevine wants to merge 13 commits into
mainfrom
jamesadevine/self-optimization-and-step-validation

Conversation

@jamesadevine

@jamesadevine jamesadevine commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

feat(compile): IR-validated step authoring & runtime self-optimization

Summary

Two new agent-driven capabilities that don't exist in ado-aw (or upstream gh-aw) today:

Flow A — Authoring-time step hoisting: When an authoring agent creates or updates a pipeline, it proactively scaffolds concrete steps: / post-steps: entries for deterministic, non-LLM work (clone, cache restore, CLI install). A new validate_steps MCP tool on the author-facing server gives the agent IR-level feedback before committing.

Flow B — Runtime self-optimization: Opt-in via self-optimization: enabled: true. The Stage-1 runtime agent recognises bash it ran successfully and proposes lifting it into front-matter steps via a new structured safe-output. Stage 2 cross-checks proposals against command history (anti-injection). Stage 3 IR-validates (Curated allow-list: bash + typed-factory tasks only) and either renders a 🎭 build-log preview (staged mode, the default) or opens a PR against the source .md (live mode, with idempotency dedup).

Self-audit prompt: New prompts/audit-ado-agentic-workflow.md teaches a Copilot CLI agent to systematically audit pipeline runs across 5 dimensions (cost, hoist candidates, reliability, safe-output quality, security) and produce actionable reports with concrete front-matter patches.

What shipped (13 commits)

Layer What
Shared core compile::ir::validate_step_block structural validator + CURATED_TASK_IDS allow-list
Flow A validate_steps MCP tool + 3 authoring prompt updates (hoist heuristic)
Front matter self-optimization: section (enabled, staged, max-proposals-per-run, allowed-sections)
Flow B Stage 1 propose-step-optimization safe-output + OPT_IN_GATED_TOOLS gating
Flow B Stage 2 Detection prompt extension (command-history cross-check)
Flow B Stage 3 Staged preview renderer + live-PR path (with idempotency dedup)
Telemetry AgentStats.propose_step_optimization_calls OTel counter
Self-audit prompts/audit-ado-agentic-workflow.md + dispatcher routing
Docs 8 doc files updated/created (front-matter, mcp-author, extending, safe-outputs, self-optimization, prompts×4, AGENTS.md)
Tests 2046 total tests pass; ~50 new tests across unit, integration, and fixture coverage

Configuration

self-optimization:
  enabled: true              # opt-in (default: false)
  staged: true               # preview-only (default: true)
  max-proposals-per-run: 3   # cap (default: 3, max: 50)
  allowed-sections:          # which sections proposals may target
    - steps                  # (default: [steps, post-steps])
    - post-steps

Security model

  • Curated allow-list: Only Bash steps and tasks with a typed factory in tasks.rs are accepted (today: ArchiveFiles@2, CopyFiles@2, DockerInstaller@0, DotNetCoreCLI@2, PublishTestResults@2).
  • Stage 2 cross-check: Detection agent verifies proposed bash against source_command_evidence — ungrounded proposals are flagged as prompt injection.
  • Staged-by-default: First runs only preview; authors flip staged: false once they trust the agent's judgement.
  • Section scoping: setup/teardown require explicit opt-in (different job identities).
  • OPT_IN_GATED_TOOLS: MCP layer strips the tool unless the compiler explicitly enables it.
  • Validator rejects safe-outputs.propose-step-optimization: — the tool is ONLY activated via the self-optimization: section.

Testing

cargo build        # 1 warning (allows_section unused — trivial)
cargo test         # 2046 passed, 0 failed, 1 ignored
cargo clippy       # clean

jamesadevine and others added 12 commits June 15, 2026 23:16
…t matter

* Add CURATED_TASK_IDS + is_curated_task() in src/compile/ir/tasks.rs.
  These will gate which TaskStep variants agent-proposed step blocks
  may reference (used by the IR fragment validator landing next).
  A new test asserts the const list stays in lock-step with every
  factory's emitted task identifier.

* Add SelfOptimizationConfig + StepSection in src/compile/types.rs
  and wire as the new self-optimization: front-matter section
  (opt-in; default off). Defaults: staged=true, max-proposals-per-run=3,
  allowed-sections=[steps, post-steps]. The sanitize impl clamps
  max-proposals-per-run at 50. Seven parsing tests lock the schema.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ocks

Add src/compile/ir/step_validation.rs exposing validate_step_block, the
shared core for two upcoming layers:

* Flow A — a validate_steps MCP tool on the author-facing MCP server
  so the authoring agent can get IR-aligned feedback on proposed
  steps: blocks without round-tripping through ado-aw compile.
* Flow B — the propose-step-optimization safe-output Stage 3 executor
  must IR-validate any agent-proposed block before applying it to the
  source .md.

The validator operates on serde_yaml::Value (mirroring how the
front-matter steps: field is treated today — opaque YAML passed to
ADO) rather than lowering to a typed Vec<Step> first; the IR has no
public Value -> Step parser. It enforces:

* Top-level must be a sequence; each entry a mapping with exactly
  one of bash/task/checkout/download/publish.
* Unknown step-level keys are rejected (the most common shape-
  injection vector in untrusted YAML).
* env: values must be string scalars (nested maps/sequences could
  smuggle ADO macros or template expressions).
* bash bodies are capped at 10 KB.
* task identifiers must match Name@Version. Curated mode also
  restricts tasks to CURATED_TASK_IDS (the tasks.rs typed factories).
* All errors collected, not short-circuited.

17 unit tests cover happy paths, every failure class, and the task
identifier parser.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The new tool exposes compile::ir::validate_step_block over MCP so an
authoring agent (Copilot Chat / Claude / Codex) can get IR-aligned
feedback on a proposed front-matter steps: block without
round-tripping through the full ado-aw compile flow.

Input: a JSON array of step entries (same shape ADO accepts in
YAML) plus an optional allow_list mode ("full" default,
"curated" to additionally restrict tasks to the tasks.rs typed-
factory set). The MCP transport speaks JSON; the tool round-trips
through a YAML text representation before handing to the validator
(serde_yaml::Value is a strict superset of serde_json::Value, so
the conversion is lossless for JSON-shaped inputs).

Output: structured response { ok, kinds } or { ok, errors }.
Errors are collected, not short-circuited, so the agent gets the
full picture in one round.

Adds Serialize derives on StepKind + StepValidationError to allow
the structured MCP response and the future propose-step-optimization
Stage 3 audit emission.

Three new tests cover the registration, the curated-mode rejection
of AzureCLI@2, and the invalid_params error on a bad allow_list.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…teps

* create-ado-agentic-workflow.md: Step 13 (Inline Steps) now leads
  with a 'hoist candidates' heuristic the authoring agent uses to
  decide what work belongs in steps:/post-steps: vs the prompt body,
  with concrete examples and a 'Validate before committing'
  subsection pointing at the new validate_steps MCP tool.
* update-ado-agentic-workflow.md: mirrors the heuristic for edit
  flows so the update agent re-examines hoist candidates before
  modifying prompt bodies.
* debug-ado-agentic-workflow.md: adds a slow-build diagnostic that
  surfaces hoist candidates from the audit command history.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* docs/front-matter.md: new 'Self-optimization (opt-in)' section
  documenting the self-optimization: block (enabled, staged,
  max-proposals-per-run, allowed-sections), with the canonical
  YAML example slotted in alongside execution-context.
* docs/mcp-author.md: new validate_steps entry covering input
  schema, full vs curated allow-list modes, and the structured
  success/error response shapes.
* docs/extending.md: pointer to compile::ir::validate_step_block as
  the shared structural validator for any component accepting an
  untrusted step block (always use StepKindAllow::Curated for
  agent-proposed input).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Stage-1 surface of Flow B for runtime self-optimization. When
the agent's front matter sets self-optimization.enabled: true, the
Stage-1 agent gets access to a structured safe-output that lets it
propose lifting deterministic bash work (clone, install, cache
restore, artifact download) out of its prompt body and into the
front-matter steps:/post-steps: section.

Params (validated at Stage 1):
  - section: which front-matter section to propose into (steps,
    post-steps, setup, teardown — kebab-case wire format mirroring
    StepSection in compile::types).
  - rationale: <= 2 KB, non-empty.
  - estimated_token_savings: optional hint.
  - steps: JSON array of ADO step entries (deep structural validation
    runs at Stage 3 via compile::ir::validate_step_block in Curated
    mode — bash + tasks.rs typed factories only).
  - source_command_evidence: bash commands the agent actually ran,
    capped at 64 entries / 10 KB each. Stage 2 cross-checks this
    against the audit command history; any bash in steps without a
    matching evidence entry is a prompt-injection signal.

Stage 3 placeholder: this commit records the proposal in
safe_outputs.ndjson for audit visibility but emits no preview or PR.
The staged-preview renderer and live-mode PR opener land in
subsequent commits.

Gating (parallel mechanism to ado-aw-debug's DEBUG_ONLY_TOOLS):
  - New OPT_IN_GATED_TOOLS const in src/safeoutputs/mod.rs listing
    propose-step-optimization.
  - MCP layer in src/mcp.rs strips the route unless explicitly
    listed in --enabled-tools (the same path used for debug-only
    tools, generalised from is_debug_only -> is_gated).
  - New self_optimization_enabled(fm) predicate in compile/common.rs
    drives whether generate_enabled_tools_args adds the tool to the
    --enabled-tools list.
  - New validate_self_optimization_config validator (wired in
    agentic_pipeline.rs) rejects safe-outputs.propose-step-optimization
    with a clear "use self-optimization: instead" message, and
    rejects an empty allowed-sections list when the feature is
    enabled.

Sanitisation: rationale runs through sanitize_config; steps and
source_command_evidence pass through unchanged because the Stage 3
IR validator enforces structure and mangling them would break the
Stage 2 command-history cross-check.

Tests: 10 unit tests in propose_step_optimization (round-trip,
section enum, validation rejection paths, dry-run summary,
sanitization invariants); 12 new compile-side tests
(self_optimization_enabled, generate_enabled_tools_args inclusion/
exclusion, validate_self_optimization_config rejection paths). All
2040 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion test

* src/data/threat-analysis.md: extend the threat-detection prompt
  with a fourth analysis item (Unsupported Step-Optimization
  Proposal). The detection agent now cross-checks every
  propose-step-optimization entry's params.steps against
  params.source_command_evidence and the agent's actual command
  history (from the audit MCP-call trail). Any bash in the proposal
  that does not appear in the agent's observed behaviour is flagged
  under prompt_injection: true. Setup/teardown proposals get extra
  scrutiny because they run as separate jobs with potentially
  different identities.

* src/agent_stats.rs: add propose_step_optimization_calls counter
  to AgentStats so authors can see, at a glance, how often the
  agent is finding hoist candidates. Defensive substring matching
  on the OTel span name (the Copilot CLI uses several shapes for
  MCP-tool spans across versions). All existing AgentStats
  constructor call sites updated; new test exercises the counter.

* tests/fixtures/self-optimization-agent.md + tests/compiler_tests.rs:
  end-to-end integration test confirming that a self-optimization-
  enabled fixture compiles cleanly AND emits the
  --enabled-tools propose-step-optimization flag. Counterpart to the
  unit tests in compile::common::tests::test_generate_enabled_tools_args_self_optimization_*.

All 2042 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The first end-to-end milestone for Flow B: when an agent calls
propose-step-optimization and self-optimization.staged is true
(the default), Stage 3 now:

1. Reads the self-optimization config from tool_configs (injected
   by main.rs from the front-matter self-optimization: block).
2. Validates the proposed section against allowed-sections; rejects
   proposals targeting sections the author hasn't opted in to.
3. IR-validates the proposed step block via validate_step_block
   with StepKindAllow::Curated (bash + typed-factory tasks only).
   IR-invalid proposals are rejected with structured error output.
4. Renders a formatted staged preview to the Stage 3 step log,
   showing section, rationale, token-savings estimate, and the
   proposed YAML in a copy-paste-ready format.

Authors can now opt in to self-optimization, watch the agent's
proposals accumulate in their build logs over a few runs, then
flip staged: false when they trust the proposals. The live-PR path
(stage3-live-pr-path todo) remains unimplemented — calling with
staged: false returns a clear failure message pointing at the
upcoming release.

Wiring changes:
- SelfOptimizationConfig gains Serialize (needed for serde_json
  round-trip into tool_configs).
- main.rs::build_execution_context injects the config into
  tool_configs["propose-step-optimization"] when self-optimization
  is enabled — parallel mechanism to the ado-aw-debug.create-issue
  config injection.
- execute.rs: registers ProposeStepOptimizationResult budget and
  dispatch route (new dispatch_opt_in_tools alongside
  dispatch_debug_tools).

All 2042 tests pass (unit + integration).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ication

* tests/bash_lint_tests.rs: add self-optimization-agent.md to the
  shellcheck fixture list.
* docs/safe-outputs.md: new Self-modification subsection documenting
  propose-step-optimization (opt-in activation, Stage 2 cross-check,
  Stage 3 staged/live behaviour).
* src/audit/analyzers/safe_outputs.rs: verified that existing
  proposed_count logic generically counts propose-step-optimization
  entries (no exclusion or special-casing needed — all NDJSON
  entries contribute to proposed_count regardless of tool type).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When self-optimization.staged is false, the executor now:

1. Reads the source .md from ctx.source_directory + the newly-added
   source_file_relative_path field on ExecutionContext (set in
   main.rs::run_execute from the --source CLI arg).
2. Patches the front matter via patch_front_matter(): finds the
   target section (steps/post-steps/setup/teardown), parses the
   YAML, appends the proposed entries, re-serializes.
3. Pushes a single-file edit commit to a new branch via ADO REST
   Pushes API (same endpoint and auth pattern as create-pull-request).
4. Opens a PR against the default branch via ADO REST Pull Requests
   API, with a structured body explaining the rationale, section,
   token savings, and that the proposal passed IR validation + Stage 2.

Branch naming: ado-aw/self-opt-<section>-<random-hex>.
Commit message: chore(ado-aw): self-optimize `<section>` steps.

Also adds:
- ExecutionContext.source_file_relative_path field (Option<String>)
  populated in main.rs from strip_prefix(source_directory) on the
  --source path.
- patch_front_matter() helper with 3 unit tests (insert into
  existing section, create absent section, reject missing fences).
- Fixes to two test files that needed the new field initialised.

All 2044 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e variants

* Idempotency: before opening a new PR in live mode, the executor
  now queries ADO for existing open PRs whose sourceRefName starts
  with `refs/heads/ado-aw/self-opt-<section>-`. If one exists, it
  returns success with a message pointing the author at the existing
  PR instead of opening a duplicate.

* New integration test (test_self_optimization_staged_false_compiles_identically)
  confirming that both staged: true and staged: false front-matter
  variants compile to the same pipeline YAML (staging is a Stage 3
  runtime decision, not a compile-time fork). Uses a new fixture
  (self-optimization-live-agent.md) with staged: false and extended
  allowed-sections: [steps, post-steps, setup].

All 2046 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
docs/self-optimization.md covers the opt-in self-optimization feature
end-to-end: configuration, three-stage flow, staged/live mode, allowed
step kinds, security model, hoist heuristic, troubleshooting.

AGENTS.md: add the new page to the Documentation Index.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Rust PR Review

Summary: Has a few real bugs and one security gap worth fixing before merge.


Findings

🐛 Bugs / Logic Issues

1. checkout, download, publish accepted in Curated mode — contradicts stated security model

src/compile/ir/step_validation.rsclassify_and_validate_step

The PR description and docs/self-optimization.md explicitly state the curated allow-list is "bash + typed-factory tasks only", but checkout, download, and publish steps pass StepKindAllow::Curated validation without restriction. The test accepts_multi_entry_mixed_step_block confirms this by calling validate_step_block(..., StepKindAllow::Curated) with a publish: entry and asserting success.

checkout: self with persistCredentials: true is particularly concerning in this mode — an agent could propose it to leave credentials available to subsequent steps. publish: could exfiltrate data as a build artifact.

Fix: add a StepKindAllow::Curated arm to classify_and_validate_step that rejects checkout, download, and publish with a clear error message pointing authors to StepKindAllow::Full (the authoring-time tool).


2. max_proposals_per_run is configured but never enforced at Stage 3

src/safeoutputs/propose_step_optimization.rsexecute_impl

SelfOptimizationConfig::max_proposals_per_run is validated, clamped at 50 in sanitize_config_fields, and documented in the front-matter grammar. But execute_impl never reads this field. Stage 3 processes every NDJSON entry independently, so a misbehaving agent could emit 100+ proposals regardless of the configured cap.

Stage 3 needs to count how many times this tool has already executed in the current run (e.g., by recording processed proposals in a sidecar file or checking the NDJSON archive) and return ExecutionResult::failure when the cap is exceeded.


3. patch_front_matter can split incorrectly on multiline YAML strings

src/safeoutputs/propose_step_optimization.rs:644

let Some(fence_end) = after_first_fence.find("\n---") else { ... };

find("\n---") returns the first occurrence of \n--- in the post-opening-fence content. If the front matter contains a multiline string value that includes --- on its own line — e.g.:

description: |
  ---
  some details here

...the split will land inside the YAML block, producing a truncated yaml_str and a corrupted body_with_fence. The resulting re-serialized file will be silently malformed. Change the search to "\n---\n" (or "\n---\r\n" for CRLF), which matches only a bare fence separator and won't fire on inline --- content.


🔒 Security Concerns

4. env: values in proposed steps not checked for ADO macro expansion syntax

src/compile/ir/step_validation.rscheck_env_string_values

The validator correctly rejects non-string env values, but it does not inspect whether string values contain ADO macro syntax ($(...)) or template expressions ($[...]). Stage 2 cross-checks bash bodies against source_command_evidence, but env values are outside that check. A proposal like:

- bash: ls -la   # agent did run this → evidence matches
  env:
    ADO_TOKEN: $(System.AccessToken)

would pass the IR validator, pass Stage 2 (bash body is in evidence), and — if merged — would expand $(System.AccessToken) in a future build and make the token available inside $ADO_TOKEN during the step. Consider rejecting values that match \$\( or \$\[ at the validator level as a defence-in-depth measure, consistent with the stated security model.


⚠️ Suggestions

5. Section comparison uses serde_yaml::to_string(s).trim() — fragile round-trip

src/safeoutputs/propose_step_optimization.rs:243–248

let section_allowed = config.allowed_sections.iter().any(|s| {
    serde_yaml::to_string(s)
        .unwrap_or_default()
        .trim()
        == section_wire
});

Comparing via a YAML serialization round-trip when a direct string conversion is already available elsewhere in this file (ProposalSection::as_wire_str) is unnecessarily fragile. Add as_wire_str() -> &'static str to StepSection (same approach as ProposalSection) and use that for the comparison. This eliminates the allocation and the implicit dependency on serde_yaml's enum formatting.


6. patch_front_matter silently strips YAML comments on re-serialisation

src/safeoutputs/propose_step_optimization.rs:651–679

Parsing the front matter into a serde_yaml::Value and re-serializing with serde_yaml::to_string will silently discard any # comments in the YAML block and may reorder keys. The live-mode PR will therefore contain spurious formatting changes beyond the intended step additions, making review noisier and potentially surprising authors who maintain documentation comments in their front matter. Worth calling out in the live-mode PR body ("Note: YAML comments in the front-matter block are not preserved across this rewrite") or in the docs.


✅ What Looks Good

  • The OPT_IN_GATED_TOOLS gating mechanism + validate_self_optimization_config rejecting safe-outputs.propose-step-optimization is a clean defence-in-depth layer.
  • Staged-by-default is the right choice — authors opt into real mutations explicitly.
  • validate_step_block collects all errors rather than short-circuiting; the comprehensive error message UX is good.
  • is_valid_task_identifier is simple, regex-free, and tests the edge cases well.
  • The source_command_evidence cross-check design for Stage 2 is sound.
  • SelfOptimizationConfig::sanitize_config_fields clamping max_proposals_per_run at 50 is the right call.

Generated by Rust PR Reviewer for issue #1061 · 367.8 AIC · ⌖ 12.7 AIC · ⊞ 33.7K ·

New prompt file teaching a Copilot CLI agent how to systematically
audit pipeline runs using the audit_build, logs, status, and
inspect_workflow MCP tools. Covers five analysis dimensions:

1. Cost & token efficiency (token trends, model choice, turns/output)
2. Hoist candidates (self-optimization opportunities from tool_usage)
3. Reliability & failure patterns (errors, timeouts, network blocks)
4. Safe-output quality (acceptance rate, noops, detection rejections)
5. Security posture (detection flags, firewall anomalies, MCP health)

Produces a structured markdown report with prioritized findings,
concrete front-matter change recommendations (including enabling
self-optimization), and optionally applies fixes via validate_steps
+ the update workflow.

Also updates:
- AGENTS.md: architecture tree + documentation index entry
- src/data/init-agent.md: dispatcher routing for "audit my pipeline"
  requests + use-case examples

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Rust PR Review

Summary: Two findings need attention before merge — one security bug and one enforcement gap. The overall architecture and test coverage are strong.

Findings

🔒 Security Concerns

  • propose_step_optimization.rs:990-1011##vso[ pipeline-command injection via staged preview stdout

    In staged mode, proposed_yaml is printed directly to stdout without calling neutralize_pipeline_commands:

    // propose_step_optimization.rs ~line 990
    let proposed_yaml = serde_yaml::to_string(&yaml_value)
        .unwrap_or_else(|_| "<failed to render YAML>".to_string());
    // ...
    println!("{preview}");   // ← preview embeds proposed_yaml verbatim

    The SanitizeContent impl explicitly skips steps (justified for Stage 2 cross-check integrity), and validate_step_block checks structural shape and byte limits but does not strip ##vso[task.setvariable variable=X]Y sequences. Since Stage 3 runs as a regular ADO task, the ADO build agent processes ##vso[...] written to stdout as pipeline commands. A compromised agent could embed ##vso[task.setvariable variable=EXPLOIT]$(System.AccessToken) in a bash body (valid string ≤ 10KB), pass IR validation, and use Stage 3's staged preview to inject pipeline commands.

    The other println!/stdout paths in execute.rs correctly route through log_and_print_entry_result which calls neutralize_pipeline_commands(&result.message), but this println!("{preview}") call inside execute_impl bypasses that wrapper entirely.

    Fix: apply neutralize_pipeline_commands to proposed_yaml before embedding it in the preview string:

    use crate::sanitize::neutralize_pipeline_commands;
    let proposed_yaml_safe = neutralize_pipeline_commands(&proposed_yaml);
    // use proposed_yaml_safe in preview format string

🐛 Bugs / Logic Issues

  • propose_step_optimization.rs:920-1027max_proposals_per_run declared but never enforced in Stage 3

    execute_impl reads config.max_proposals_per_run from the tool config but never uses it:

    let config: crate::compile::types::SelfOptimizationConfig =
        ctx.get_tool_config("propose-step-optimization");
    
    // config.staged is used; config.allowed_sections is used;
    // config.max_proposals_per_run is ... never checked.

    execute_safe_outputs in execute.rs processes every NDJSON entry sequentially, dispatching each propose-step-optimization entry into a fresh execute_impl call with no shared counter. In staged mode this means unlimited previews. In live mode, the per-section idempotency check caps to one PR per section (max 4), but that bound is an incidental side-effect of idempotency, not enforcement of the documented max_proposals_per_run cap.

    The config promise in the docs and front-matter type should be enforced. One pattern used elsewhere: track counts in ExecutionContext via an Arc<Mutex<u32>> counter keyed on the tool name (similar to uploaded_pipeline_artifact_keys).

⚠️ Suggestions

  • propose_step_optimization.rs:925-930allows_section comparison via serde_yaml::to_string roundtrip

    let section_allowed = config.allowed_sections.iter().any(|s| {
        serde_yaml::to_string(s)
            .unwrap_or_default()
            .trim()
            == section_wire
    });

    This serializes each StepSection enum variant to YAML and string-compares. It works correctly today, but it's fragile — it relies on serde_yaml's enum-unit-variant format staying stable and matching the kebab-case wire strings that ProposalSection::as_wire_str() returns. StepSection already has #[serde(rename_all = "kebab-case")] so the values align, but the indirection is unnecessary. Adding fn as_wire_str(self) -> &'static str to StepSection (mirroring the identical method on ProposalSection) would let this be a direct == comparison and remove any serialization dependency from a security-critical allow-list check.

  • ir/mod.rs — stale #[allow(unused_imports)] comment

    The #[allow(unused_imports)] block says the downstream consumers "land in subsequent commits," but both consumers (validate_steps in mcp_author/mod.rs and propose-step-optimization Stage 3 in safeoutputs/propose_step_optimization.rs) are in this PR. The attribute is harmless since the items are used, but the comment is misleading. The #[allow] can be removed.

✅ What Looks Good

  • The curated task allow-list design (CURATED_TASK_IDS + is_curated_task + the lockstep test curated_task_ids_match_factories) is exactly the right pattern — the allow-list is the source of truth and the test enforces it stays in sync with the factories.
  • validate_step_block collecting all errors rather than short-circuiting is the right UX call for a validator that surfaces feedback to agents.
  • The OPT_IN_GATED_TOOLS / DEBUG_ONLY_TOOLS distinction in the MCP filter is a clean security boundary that makes the gating explicit and extensible.
  • Per-section idempotency check in live mode (prefix branch search before creating a new branch) is a thoughtful operator-friendliness touch.
  • Test coverage is comprehensive — the structural validator tests, section roundtrip tests, and patch_front_matter tests all cover the key failure modes.

Generated by Rust PR Reviewer for issue #1061 · 447.2 AIC · ⌖ 12.6 AIC · ⊞ 33.8K ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant