feat(compile): IR-validated step authoring & runtime self-optimization#1061
feat(compile): IR-validated step authoring & runtime self-optimization#1061jamesadevine wants to merge 13 commits into
Conversation
…t matter * Add CURATED_TASK_IDS + is_curated_task() in src/compile/ir/tasks.rs. These will gate which TaskStep variants agent-proposed step blocks may reference (used by the IR fragment validator landing next). A new test asserts the const list stays in lock-step with every factory's emitted task identifier. * Add SelfOptimizationConfig + StepSection in src/compile/types.rs and wire as the new self-optimization: front-matter section (opt-in; default off). Defaults: staged=true, max-proposals-per-run=3, allowed-sections=[steps, post-steps]. The sanitize impl clamps max-proposals-per-run at 50. Seven parsing tests lock the schema. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ocks Add src/compile/ir/step_validation.rs exposing validate_step_block, the shared core for two upcoming layers: * Flow A — a validate_steps MCP tool on the author-facing MCP server so the authoring agent can get IR-aligned feedback on proposed steps: blocks without round-tripping through ado-aw compile. * Flow B — the propose-step-optimization safe-output Stage 3 executor must IR-validate any agent-proposed block before applying it to the source .md. The validator operates on serde_yaml::Value (mirroring how the front-matter steps: field is treated today — opaque YAML passed to ADO) rather than lowering to a typed Vec<Step> first; the IR has no public Value -> Step parser. It enforces: * Top-level must be a sequence; each entry a mapping with exactly one of bash/task/checkout/download/publish. * Unknown step-level keys are rejected (the most common shape- injection vector in untrusted YAML). * env: values must be string scalars (nested maps/sequences could smuggle ADO macros or template expressions). * bash bodies are capped at 10 KB. * task identifiers must match Name@Version. Curated mode also restricts tasks to CURATED_TASK_IDS (the tasks.rs typed factories). * All errors collected, not short-circuited. 17 unit tests cover happy paths, every failure class, and the task identifier parser. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The new tool exposes compile::ir::validate_step_block over MCP so an
authoring agent (Copilot Chat / Claude / Codex) can get IR-aligned
feedback on a proposed front-matter steps: block without
round-tripping through the full ado-aw compile flow.
Input: a JSON array of step entries (same shape ADO accepts in
YAML) plus an optional allow_list mode ("full" default,
"curated" to additionally restrict tasks to the tasks.rs typed-
factory set). The MCP transport speaks JSON; the tool round-trips
through a YAML text representation before handing to the validator
(serde_yaml::Value is a strict superset of serde_json::Value, so
the conversion is lossless for JSON-shaped inputs).
Output: structured response { ok, kinds } or { ok, errors }.
Errors are collected, not short-circuited, so the agent gets the
full picture in one round.
Adds Serialize derives on StepKind + StepValidationError to allow
the structured MCP response and the future propose-step-optimization
Stage 3 audit emission.
Three new tests cover the registration, the curated-mode rejection
of AzureCLI@2, and the invalid_params error on a bad allow_list.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…teps * create-ado-agentic-workflow.md: Step 13 (Inline Steps) now leads with a 'hoist candidates' heuristic the authoring agent uses to decide what work belongs in steps:/post-steps: vs the prompt body, with concrete examples and a 'Validate before committing' subsection pointing at the new validate_steps MCP tool. * update-ado-agentic-workflow.md: mirrors the heuristic for edit flows so the update agent re-examines hoist candidates before modifying prompt bodies. * debug-ado-agentic-workflow.md: adds a slow-build diagnostic that surfaces hoist candidates from the audit command history. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* docs/front-matter.md: new 'Self-optimization (opt-in)' section documenting the self-optimization: block (enabled, staged, max-proposals-per-run, allowed-sections), with the canonical YAML example slotted in alongside execution-context. * docs/mcp-author.md: new validate_steps entry covering input schema, full vs curated allow-list modes, and the structured success/error response shapes. * docs/extending.md: pointer to compile::ir::validate_step_block as the shared structural validator for any component accepting an untrusted step block (always use StepKindAllow::Curated for agent-proposed input). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Stage-1 surface of Flow B for runtime self-optimization. When
the agent's front matter sets self-optimization.enabled: true, the
Stage-1 agent gets access to a structured safe-output that lets it
propose lifting deterministic bash work (clone, install, cache
restore, artifact download) out of its prompt body and into the
front-matter steps:/post-steps: section.
Params (validated at Stage 1):
- section: which front-matter section to propose into (steps,
post-steps, setup, teardown — kebab-case wire format mirroring
StepSection in compile::types).
- rationale: <= 2 KB, non-empty.
- estimated_token_savings: optional hint.
- steps: JSON array of ADO step entries (deep structural validation
runs at Stage 3 via compile::ir::validate_step_block in Curated
mode — bash + tasks.rs typed factories only).
- source_command_evidence: bash commands the agent actually ran,
capped at 64 entries / 10 KB each. Stage 2 cross-checks this
against the audit command history; any bash in steps without a
matching evidence entry is a prompt-injection signal.
Stage 3 placeholder: this commit records the proposal in
safe_outputs.ndjson for audit visibility but emits no preview or PR.
The staged-preview renderer and live-mode PR opener land in
subsequent commits.
Gating (parallel mechanism to ado-aw-debug's DEBUG_ONLY_TOOLS):
- New OPT_IN_GATED_TOOLS const in src/safeoutputs/mod.rs listing
propose-step-optimization.
- MCP layer in src/mcp.rs strips the route unless explicitly
listed in --enabled-tools (the same path used for debug-only
tools, generalised from is_debug_only -> is_gated).
- New self_optimization_enabled(fm) predicate in compile/common.rs
drives whether generate_enabled_tools_args adds the tool to the
--enabled-tools list.
- New validate_self_optimization_config validator (wired in
agentic_pipeline.rs) rejects safe-outputs.propose-step-optimization
with a clear "use self-optimization: instead" message, and
rejects an empty allowed-sections list when the feature is
enabled.
Sanitisation: rationale runs through sanitize_config; steps and
source_command_evidence pass through unchanged because the Stage 3
IR validator enforces structure and mangling them would break the
Stage 2 command-history cross-check.
Tests: 10 unit tests in propose_step_optimization (round-trip,
section enum, validation rejection paths, dry-run summary,
sanitization invariants); 12 new compile-side tests
(self_optimization_enabled, generate_enabled_tools_args inclusion/
exclusion, validate_self_optimization_config rejection paths). All
2040 tests pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion test * src/data/threat-analysis.md: extend the threat-detection prompt with a fourth analysis item (Unsupported Step-Optimization Proposal). The detection agent now cross-checks every propose-step-optimization entry's params.steps against params.source_command_evidence and the agent's actual command history (from the audit MCP-call trail). Any bash in the proposal that does not appear in the agent's observed behaviour is flagged under prompt_injection: true. Setup/teardown proposals get extra scrutiny because they run as separate jobs with potentially different identities. * src/agent_stats.rs: add propose_step_optimization_calls counter to AgentStats so authors can see, at a glance, how often the agent is finding hoist candidates. Defensive substring matching on the OTel span name (the Copilot CLI uses several shapes for MCP-tool spans across versions). All existing AgentStats constructor call sites updated; new test exercises the counter. * tests/fixtures/self-optimization-agent.md + tests/compiler_tests.rs: end-to-end integration test confirming that a self-optimization- enabled fixture compiles cleanly AND emits the --enabled-tools propose-step-optimization flag. Counterpart to the unit tests in compile::common::tests::test_generate_enabled_tools_args_self_optimization_*. All 2042 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The first end-to-end milestone for Flow B: when an agent calls propose-step-optimization and self-optimization.staged is true (the default), Stage 3 now: 1. Reads the self-optimization config from tool_configs (injected by main.rs from the front-matter self-optimization: block). 2. Validates the proposed section against allowed-sections; rejects proposals targeting sections the author hasn't opted in to. 3. IR-validates the proposed step block via validate_step_block with StepKindAllow::Curated (bash + typed-factory tasks only). IR-invalid proposals are rejected with structured error output. 4. Renders a formatted staged preview to the Stage 3 step log, showing section, rationale, token-savings estimate, and the proposed YAML in a copy-paste-ready format. Authors can now opt in to self-optimization, watch the agent's proposals accumulate in their build logs over a few runs, then flip staged: false when they trust the proposals. The live-PR path (stage3-live-pr-path todo) remains unimplemented — calling with staged: false returns a clear failure message pointing at the upcoming release. Wiring changes: - SelfOptimizationConfig gains Serialize (needed for serde_json round-trip into tool_configs). - main.rs::build_execution_context injects the config into tool_configs["propose-step-optimization"] when self-optimization is enabled — parallel mechanism to the ado-aw-debug.create-issue config injection. - execute.rs: registers ProposeStepOptimizationResult budget and dispatch route (new dispatch_opt_in_tools alongside dispatch_debug_tools). All 2042 tests pass (unit + integration). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ication * tests/bash_lint_tests.rs: add self-optimization-agent.md to the shellcheck fixture list. * docs/safe-outputs.md: new Self-modification subsection documenting propose-step-optimization (opt-in activation, Stage 2 cross-check, Stage 3 staged/live behaviour). * src/audit/analyzers/safe_outputs.rs: verified that existing proposed_count logic generically counts propose-step-optimization entries (no exclusion or special-casing needed — all NDJSON entries contribute to proposed_count regardless of tool type). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When self-optimization.staged is false, the executor now: 1. Reads the source .md from ctx.source_directory + the newly-added source_file_relative_path field on ExecutionContext (set in main.rs::run_execute from the --source CLI arg). 2. Patches the front matter via patch_front_matter(): finds the target section (steps/post-steps/setup/teardown), parses the YAML, appends the proposed entries, re-serializes. 3. Pushes a single-file edit commit to a new branch via ADO REST Pushes API (same endpoint and auth pattern as create-pull-request). 4. Opens a PR against the default branch via ADO REST Pull Requests API, with a structured body explaining the rationale, section, token savings, and that the proposal passed IR validation + Stage 2. Branch naming: ado-aw/self-opt-<section>-<random-hex>. Commit message: chore(ado-aw): self-optimize `<section>` steps. Also adds: - ExecutionContext.source_file_relative_path field (Option<String>) populated in main.rs from strip_prefix(source_directory) on the --source path. - patch_front_matter() helper with 3 unit tests (insert into existing section, create absent section, reject missing fences). - Fixes to two test files that needed the new field initialised. All 2044 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e variants * Idempotency: before opening a new PR in live mode, the executor now queries ADO for existing open PRs whose sourceRefName starts with `refs/heads/ado-aw/self-opt-<section>-`. If one exists, it returns success with a message pointing the author at the existing PR instead of opening a duplicate. * New integration test (test_self_optimization_staged_false_compiles_identically) confirming that both staged: true and staged: false front-matter variants compile to the same pipeline YAML (staging is a Stage 3 runtime decision, not a compile-time fork). Uses a new fixture (self-optimization-live-agent.md) with staged: false and extended allowed-sections: [steps, post-steps, setup]. All 2046 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
docs/self-optimization.md covers the opt-in self-optimization feature end-to-end: configuration, three-stage flow, staged/live mode, allowed step kinds, security model, hoist heuristic, troubleshooting. AGENTS.md: add the new page to the Documentation Index. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 Rust PR ReviewSummary: Has a few real bugs and one security gap worth fixing before merge. Findings🐛 Bugs / Logic Issues1.
The PR description and
Fix: add a 2.
Stage 3 needs to count how many times this tool has already executed in the current run (e.g., by recording processed proposals in a sidecar file or checking the NDJSON archive) and return 3.
let Some(fence_end) = after_first_fence.find("\n---") else { ... };
description: |
---
some details here...the split will land inside the YAML block, producing a truncated 🔒 Security Concerns4.
The validator correctly rejects non-string env values, but it does not inspect whether string values contain ADO macro syntax ( - bash: ls -la # agent did run this → evidence matches
env:
ADO_TOKEN: $(System.AccessToken)would pass the IR validator, pass Stage 2 (bash body is in evidence), and — if merged — would expand
|
New prompt file teaching a Copilot CLI agent how to systematically audit pipeline runs using the audit_build, logs, status, and inspect_workflow MCP tools. Covers five analysis dimensions: 1. Cost & token efficiency (token trends, model choice, turns/output) 2. Hoist candidates (self-optimization opportunities from tool_usage) 3. Reliability & failure patterns (errors, timeouts, network blocks) 4. Safe-output quality (acceptance rate, noops, detection rejections) 5. Security posture (detection flags, firewall anomalies, MCP health) Produces a structured markdown report with prioritized findings, concrete front-matter change recommendations (including enabling self-optimization), and optionally applies fixes via validate_steps + the update workflow. Also updates: - AGENTS.md: architecture tree + documentation index entry - src/data/init-agent.md: dispatcher routing for "audit my pipeline" requests + use-case examples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 Rust PR ReviewSummary: Two findings need attention before merge — one security bug and one enforcement gap. The overall architecture and test coverage are strong. Findings🔒 Security Concerns
🐛 Bugs / Logic Issues
|
feat(compile): IR-validated step authoring & runtime self-optimization
Summary
Two new agent-driven capabilities that don't exist in ado-aw (or upstream gh-aw) today:
Flow A — Authoring-time step hoisting: When an authoring agent creates or updates a pipeline, it proactively scaffolds concrete
steps:/post-steps:entries for deterministic, non-LLM work (clone, cache restore, CLI install). A newvalidate_stepsMCP tool on the author-facing server gives the agent IR-level feedback before committing.Flow B — Runtime self-optimization: Opt-in via
self-optimization: enabled: true. The Stage-1 runtime agent recognises bash it ran successfully and proposes lifting it into front-matter steps via a new structured safe-output. Stage 2 cross-checks proposals against command history (anti-injection). Stage 3 IR-validates (Curated allow-list: bash + typed-factory tasks only) and either renders a🎭build-log preview (staged mode, the default) or opens a PR against the source.md(live mode, with idempotency dedup).Self-audit prompt: New
prompts/audit-ado-agentic-workflow.mdteaches a Copilot CLI agent to systematically audit pipeline runs across 5 dimensions (cost, hoist candidates, reliability, safe-output quality, security) and produce actionable reports with concrete front-matter patches.What shipped (13 commits)
compile::ir::validate_step_blockstructural validator +CURATED_TASK_IDSallow-listvalidate_stepsMCP tool + 3 authoring prompt updates (hoist heuristic)self-optimization:section (enabled, staged, max-proposals-per-run, allowed-sections)propose-step-optimizationsafe-output +OPT_IN_GATED_TOOLSgatingAgentStats.propose_step_optimization_callsOTel counterprompts/audit-ado-agentic-workflow.md+ dispatcher routingConfiguration
Security model
tasks.rsare accepted (today: ArchiveFiles@2, CopyFiles@2, DockerInstaller@0, DotNetCoreCLI@2, PublishTestResults@2).source_command_evidence— ungrounded proposals are flagged as prompt injection.staged: falseonce they trust the agent's judgement.setup/teardownrequire explicit opt-in (different job identities).safe-outputs.propose-step-optimization:— the tool is ONLY activated via theself-optimization:section.Testing