Add live golden-path (Tier 2) pipeline for azd ai agent extension#8758
Add live golden-path (Tier 2) pipeline for azd ai agent extension#8758v1212 wants to merge 24 commits into
Conversation
Adds eng/pipelines/ext-azure-ai-agents-live.yml, an on-demand/weekly Azure DevOps pipeline that drives the real 'azd ai agent' CLI through tmux against live Azure (TME), exercising init -> provision -> deploy -> invoke -> down for both code and container deploy modes. This is the live counterpart to the PR-gate checks (Tier 0 offline + Tier 1 recording/playback in #8754). Per Azure SDK EngSys / SFI guidance, live access stays out of the automatic PR pipeline (trigger: none) and runs only via '/azp run ext-azure-ai-agents-live' or the weekly schedule. The Tier 2 tmux driver (test_full_e2e.py, test_tier2.py) is migrated from the #8692 prototype; CI auth detection is extended to recognize Azure DevOps (TF_BUILD) and an explicit E2E_USE_AZ_CLI_AUTH override.
…zSessionActive The azure-sdk-tests service connection uses Workload Identity Federation, whose az session is isolated to the task's private AZURE_CONFIG_DIR and expires after ~10 min. Running the ~50 min golden-path test (and the cleanup) as plain bash steps after a separate login step would fail auth on both counts. Run them inside AzureCLI@2 with keepAzSessionActive:true (matching build-cli.yml) so the session stays refreshed and reaches azd (auth.useAzCliAuth) through tmux, which inherits AZURE_CONFIG_DIR. Subscription/tenant are now read in-script via az account show instead of cross-step pipeline variables.
test_tier2.py always ran sequentially, but kept a tautological if-condition (len==1 or len>1), an unused concurrent.futures import, a no-op --serial flag, and a docstring/print claiming parallel execution. Simplify to an explicit sequential loop and update the docstring to match. Also fix test_full_e2e.py's module docstring to point at README.md (LOCAL-TEST-GUIDE.md does not exist).
There was a problem hiding this comment.
Pull request overview
This PR adds a dedicated Azure DevOps pipeline and supporting Python drivers/docs to run the Tier 2 live golden-path E2E for the azure.ai.agents azd extension (init → provision → deploy → invoke → down) against real Azure resources, outside of the GitHub PR gate.
Changes:
- Added an on-demand + weekly ADO pipeline (
ext-azure-ai-agents-live) to run Tier 2 live E2E for bothcodeandcontainerdeploy modes. - Added a tmux-driven Python Tier 2 runner (
test_tier2.py) and a full golden-path driver (test_full_e2e.py) adapted for ADO CI auth detection. - Added documentation for running the live Tier 2 tests locally and in CI.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| eng/pipelines/ext-azure-ai-agents-live.yml | New ADO pipeline wiring to build azd + extension, run Tier 2 live E2E under AzureCLI@2, and publish logs/cleanup. |
| cli/azd/extensions/azure.ai.agents/tests/e2e-live/test_tier2.py | Tier 2 orchestrator: runs code + container golden paths sequentially with isolation and timeout handling. |
| cli/azd/extensions/azure.ai.agents/tests/e2e-live/test_full_e2e.py | tmux-driven end-to-end golden path driver (init/provision/deploy/invoke/down) with CI/local auth switching. |
| cli/azd/extensions/azure.ai.agents/tests/e2e-live/README.md | Docs for CI setup (ADO registration/service connection/secrets) and local WSL execution. |
- Use the ambient azure-sdk org secret `azuresdk-github-pat` for GH_TOKEN instead of an empty `GitHubPat` placeholder variable (mirrors eval-waza.yml); removes a misleading masked variable and the need for admin PAT setup. - Harden the AzureCLI@2 inline script: `set -euo pipefail` and assign-then-verify subscription/tenant so an `az account show` failure fails fast (a plain `export X=$(...)` would have masked the error from set -e). - Reword the extension-install comment to be self-contained (it no longer inaccurately claims to mirror lint-ext-azure-ai-agents.yml). - Clarify the test_full_e2e.py auth prerequisite: only local WSL runs leave auth.useAzCliAuth unset; CI auto-enables az CLI auth. - Clear tmux scrollback after env setup so the exported GH token cannot leak into capture() output on failures/timeouts. - _cleanup_leaked_resources now checks azd down's return code and reports failures instead of always printing "Cleanup complete".
- Stream child E2E output live with a watchdog-enforced hard timeout instead of buffering everything via capture_output - Shell-escape the GitHub token (shlex.quote) before exporting in tmux - Clean up the per-mode AZD_CONFIG_DIR temp copy unless E2E_KEEP_ARTIFACTS - Use sha256 instead of md5 for the agent-name uniqueness suffix - Derive the agent binary arch from uname -m instead of hard-coding amd64
- Shell-escape HOME/PATH/TENANT, the cd target, and the agent name with shlex.quote() (consistent with the earlier token fix) - On Tier 2 timeout, kill the child's detached tmux server so reused CI agents do not accumulate orphaned tmux sockets
pyflakes flagged two f-strings with no interpolation; they are static text, so the f-prefix is dead. Plain string literals keep the linter (and Copilot) clean. No behavior change.
The docstring and README claimed tmux >=3.4, but the live pipeline installs tmux via apt-get on ubuntu-22.04 (LINUXVMIMAGE), which ships 3.2a. The driver only uses basic verbs (send-keys -l, capture-pane -p, new-session -x/-y, clear-history, kill-session/server) available well before 3.2, so >=3.4 was inaccurate and misleading. Align both docs to >=3.2.
jongio
left a comment
There was a problem hiding this comment.
Solid Tier 2 pipeline addition. The timeout hierarchy (phase budgets < child watchdog 60m < step 130m < job 150m) with explicit margin documentation is particularly well-thought-out, and the three-layer resource cleanup (in-test teardown, parent watchdog cleanup, pipeline condition: always() safety net) gives good confidence that leaked resources won't accumulate.
A few minor observations (none blocking):
File size (test_full_e2e.py, 925 lines): If more live test scenarios are planned (e.g., Tier 2 for other extensions), the tmux interaction helpers (lines ~79-170: tmux(), send(), capture(), wait_for(), run_cmd(), etc.) could be extracted into a shared utility module. For a single test driver this is fine.
Signal handling: The __main__ block doesn't wrap the phase chain in try/finally for tmux cleanup. If an unexpected exception occurs after setup() but before the tmux("kill-session") at the bottom, the tmux session/socket persists on the agent. The parent's watchdog and pipeline cleanup step cover this in practice, so it's not urgent.
Invoke regex: Nice work on (?<![\w.])4(?!\.\d)(?!\w) for the "2+2" assertion. Correctly rejects "gpt-4o-mini", "4.1", "404" while matching standalone "4". Clean.
|
Request: switch the Tier 2 interactive driver from tmux/Python to Go ( Thanks for this — the live golden-path coverage and the safety nets (watchdog, Why
Parity — yes, it can do everything the tmux/Python driver doesI went through the current driver capability-by-capability:
The one thing to validate is that Note: today only PHASE 1 ( The bigger plan (future iterations) — a shared driver for all extensions and azdThis isn't a one-off ask. We're planning a shared, public test-support module (tracked in #8795) that exposes this interactive driver so every Go extension — and azd core — defines interactive tests without hand-rolling a per-extension harness. Sketch of the extension-facing API: sess := extframework.StartInteractive(t, "ai", "agent", "init") // PTY + vt10x + go-expect, auto-cleanup via t.Cleanup
sess.ExpectString("Select a subscription")
sess.Send(extframework.KeyDown, extframework.KeyEnter)
sess.ExpectString("Enter a name")
sess.SendLine("my-agent")
sess.ExpectString("added to your azd project successfully")
require.Contains(t, sess.Screen(), "Done")
out, code := extframework.Run(t, "provision", "--no-prompt") // non-interactive phases reuse the plain runnerThe same driver is intended to serve Tier 1 (recorded/playback) and Tier 2 (live), sharing Suggested path forward
If the timing of #8795 is a concern, we can land the Go driver locally in the extension first and migrate it into the shared module afterward — but I'd prefer to avoid merging a tmux/Python driver that we know we'll replace. Let me know if you'd like me to prototype the |
Per review feedback, replace the tmux-driven Python golden-path driver with a Go test so the live suite uses the same toolchain as the rest of the extension and needs no Python/tmux on the agent. - TestTier2Live drives the interactive `azd ai agent init` through a pseudo-terminal (Netflix/go-expect sends keys, hinshun/vt10x renders the screen, creack/pty provides the PTY); provision/deploy/invoke/down shell out to azd with --no-prompt. Per-mode 60m context drives the graceful azd-down teardown via t.Cleanup. - The PTY driver is tagged //go:build linux; the answer matcher and its table unit tests build and run on any platform. - Pipeline builds and runs `go test -run TestTier2Live` and drops the Python/tmux setup steps; promote go-expect/vt10x/creack-pty to direct requires; refresh README and cspell. - Remove test_tier2.py and test_full_e2e.py.
…d source Replace the init driver's fixed 3s polling with go-expect event-driven reads that block until the survey UI settles, then dispatch on the verbatim prompt strings the extension prints (each case annotated with its source file:line). Select deploy mode via the --deploy-mode flag, which init auto-resolves rather than prompts when a manifest is supplied. Document the dispatch-loop design and the live-only verification path in the README. Addresses review feedback on #8758.
jongio
left a comment
There was a problem hiding this comment.
The Go rewrite (commits c62a696 and 4d27988) directly addresses @vhvb1989's direction request. The interactive driver now uses go-expect + vt10x + creack/pty with event-driven synchronization (waitForQuiet) instead of fixed-interval polling. This is solid work.
What looks good:
- Event-driven sync:
promptQuiet/listSettlewaits for the UI to stop emitting before dispatching, rather than sleeping a fixed interval. Correct approach for flake-free interactive tests. - Timeout hierarchy: per-mode
runTimeout(60m) <go test -timeout(125m) < ADO step (130m) < job (150m). Each layer has room for graceful teardown before the next one kills it. - Cleanup safety:
assertSafeTestDirblocks dangerousos.RemoveAllon protected paths;t.Cleanup(r.teardown)guaranteesazd down --force --purgeruns even on failure; thecondition: always()pipeline step is a crash-path backstop. - Platform-portable assert logic:
assert.gobuilds on all platforms (no build tag), with the regex-free standalone-"4" detection explained clearly. - Pipeline arch detection:
uname -mmapping to GOARCH means the step won't break if the pool moves off amd64.
One observation for the conversation with @vhvb1989:
The latest commits already deliver what was requested. The old Python/tmux files are gone, the Go driver is in place, and it uses exactly the libraries and approach described in the direction-change comment. The driver is local to the extension's tests/e2e-live/ for now (as suggested), ready to lift into the shared module when #8795 lands.
Minor notes (non-blocking):
tier2_live_test.goat 859 lines is large for a single file, but splitting a sequential test flow (init/provision/deploy/invoke/down + all helpers) across multiple files would hurt readability. Acceptable.- The
dispatchPromptsource-line annotations (e.g.,init.go:722) are useful for maintenance but will become stale as the extension evolves. The README already documents this trade-off ("changes there can require updatingdispatchPrompt"). Consider a lint or generator if drift becomes a maintenance burden. findServiceNameuses line-by-line string matching to parseazure.yaml. This is pragmatic for the known scaffolded output format and avoids pulling in a YAML library for one test helper. Fine as-is.
|
Thanks @vhvb1989 — done. Ported the interactive Per "one thing to validate" note: I did a source-grounded spike and calibrated the prompt dispatch to the real |
|
Thanks @jongio. The driver changed — the Go rewrite removed the original Python/tmux code, so those first-review points are gone. The three notes on the new code are acknowledged trade-offs (the |
jongio
left a comment
There was a problem hiding this comment.
My prior analysis (two rounds against this same HEAD) still holds. The Go rewrite using go-expect + vt10x + creack/pty is solid: event-driven synchronization via waitForQuiet, clean timeout hierarchy, three-layer resource cleanup, and well-tested assert logic.
One code modernization note below. Also: the PR currently has merge conflicts that need resolving before merge.
No blocking issues from my side.
There was a problem hiding this comment.
The Go rewrite from tmux+Python to a PTY-based Go driver is a significant improvement: same toolchain as the rest of the extension, no Python/tmux dependency on the agent, and the event-driven dispatch loop (wait for survey UI to settle, then match the rendered prompt) is more robust than fixed polling.
A few observations on the final state:
-
findServiceNamenaive YAML walk (tier2_live_test.go): The line-by-line parser assumes a specific indentation structure. Sinceazd ai agent initalways scaffolds the same format this is fine today, but if the template output changes it will silently break. Consider a targetedgrep -m1approach or importing a minimal YAML decoder if the function gets reused. -
Timeout cascading is well-designed: per-mode 60m > sum-of-phases 50m (10m teardown margin), step 130m > go-test 125m, job 150m > step. This ensures graceful teardown always runs before hard kills, preventing resource leaks. The commit that adjusted these (db53b06) explains the invariant clearly.
-
responseHasExpectedAnswerrejects "4.0" by design (the4followed by.then digit rule). Worth a brief code comment noting this is intentional, since a future maintainer might wonder why a valid math answer is rejected. The existing unit tests cover this implicitly via the "version" case ("4.1") but an explicit "4.0" case in the table would make the design intent obvious. -
Dependency versions (
go-expect,creack/pty,vt10x) are the same 2022 pseudo-versions that AlecAivazis/survey uses in its own test suite. They are proven to work well together for this exact use case. No action needed, but worth noting in case a future dep audit flags them as stale.
Per AGENTS.md Go 1.26 guidance, switch exitCode() from errors.As to errors.AsType[*exec.ExitError]. Add an explicit "4.0" case to the assert table and document why a decimal answer is deliberately rejected.
# Conflicts: # cli/azd/extensions/azure.ai.agents/cspell.yaml
jongio
left a comment
There was a problem hiding this comment.
One minor nit on the safety-net cleanup step: the 2>/dev/null on line 213 swallows all diagnostic output from azd down. If a crash-path cleanup fails, the pipeline log won't show why. Since the task already sets continueOnError: true (so a non-zero exit can't fail the build), consider 2>&1 || true instead to keep diagnostic output visible for post-mortem investigation while still preventing the set -e from aborting the loop.
Not blocking, the current pattern is defensible for a quiet safety net.
Use 2>&1 instead of 2>/dev/null in the always() cleanup loop so a failed force-purge is visible for debugging a resource leak; keep || true so set -e doesn't abort the loop.
jongio
left a comment
There was a problem hiding this comment.
The Go rewrite of the Tier 2 driver is well-structured: event-driven prompt dispatch, proper timeout hierarchy (per-mode 60m < go test 125m < step 130m < job 150m), and LIFO cleanup ordering so teardown always runs before the config dir it depends on is removed.
One minor observation below on findServiceName; everything else looks solid after 12+ rounds of iteration.
jongio
left a comment
There was a problem hiding this comment.
The crash-path cleanup now surfaces stderr via 2>&1 instead of swallowing it, making failed force-purge attempts visible in the pipeline log. The || true guard still prevents the loop from stopping on individual azd down failures.
findServiceName (tier2_live_test.go:649) parses azure.yaml by string matching rather than a YAML library. This works for the predictable output of azd ai agent init, but could silently break on edge-case YAML (a comment containing services:, or a multi-line string value). Low risk since the generated files are predictable, but worth noting if the test starts failing on valid projects.
Replace the indentation/colon line scan with a minimal struct unmarshal (gopkg.in/yaml.v3, already a direct dep) so the service-name lookup tolerates comments and indentation changes in scaffolded azure.yaml.
jongio
left a comment
There was a problem hiding this comment.
One suggestion on context propagation in the test runner.
jongio
left a comment
There was a problem hiding this comment.
One suggestion on context propagation in the test runner.
jongio
left a comment
There was a problem hiding this comment.
Reviewed the full diff (8 files, +1550/-1). My prior feedback from yesterday was addressed in the latest commits (errors.AsType, stderr visibility in cleanup, struct-based YAML parsing, t.Context() propagation).
Fresh pass on the current state: the test architecture is well-designed. The event-driven prompt loop (go-expect waitForQuiet instead of fixed-interval polling), the bounded ring buffer for diagnostics, the LIFO cleanup ordering, and the tiered timeout hierarchy (per-phase < per-mode < go-test < ADO step < ADO job) are all solid engineering.
One non-blocking note for future maintenance: the dispatchPrompt cases are tightly coupled to the extension's prompt strings (each case annotated with source file:line). The README documents this, but consider adding a CI step or a grep-based check that flags when prompt-emitting lines in the extension source change without a corresponding update here. That would catch desync earlier than a full Tier 2 run.
jongio
left a comment
There was a problem hiding this comment.
Checked the Go PTY test driver, answer matcher logic, console helpers, pipeline YAML, and README.
Prior feedback (errors.AsType, t.Context(), yaml struct parsing, stderr redirect) is all addressed. CI is green across all checks.
The timeout hierarchy, LIFO cleanup ordering, assertSafeTestDir guard, and event-driven prompt loop with loop detection are correctly implemented. The standalone-4 checker handles edge cases well (model names, versions, status codes). Dependencies were already in go.sum as indirect deps; the go.mod change just promotes them to direct.
One non-blocking suggestion on the dispatchPrompt default case.
| r.enter() | ||
|
|
||
| default: | ||
| r.enter() |
There was a problem hiding this comment.
Non-blocking: when the default case fires, it sends Enter with no indication in the test log that the dispatch didn't match a specific case. The prompt text IS logged before dispatch (line 301), but adding something like r.t.Logf("unhandled prompt (default Enter): %s", truncate(prompt, 100)) here would make CI investigations easier when a new or changed prompt hits the catch-all, since it would distinguish "prompt matched and answered correctly" from "prompt fell through to default."
Fixes #8759
Summary
eng/pipelines/ext-azure-ai-agents-live.yml: an on-demand / weekly Azure DevOps pipeline that runs the Tier 2 live golden path (init → provision → deploy → invoke → down) for theazure.ai.agentsextension against a real Azure (TME) subscription, for bothcodeandcontainerdeploy modes.tier2_live_test.go,console_test.go,assert.go) undercli/azd/extensions/azure.ai.agents/tests/e2e-live/, drivingazd ai agentthrough a pseudo-terminal withgo-expect+vt10x+creack/pty(replacing the original tmux/Python prototype from test: add static E2E tests for azure.ai.agents extension #8692), with CI auth detection for Azure DevOps (TF_BUILD/E2E_USE_AZ_CLI_AUTH).README.mddocumenting how to run (CI + local) and the one-time ADO setup.Together with the PR-gate tests in #8754 (Tier 0 offline + Tier 1 recording/playback), this covers Tier 0/1/2 from the original prototype (#8692).
At a glance
What it tests — the full live golden path
init → provision → deploy → invoke → downofazd ai agent, driving a real CLI over a pseudo-terminal (Gogo-expect+vt10x+creack/pty), for bothcode(zip) andcontainer(ACR build) deploy modes, run sequentially. The interactiveinitphase is an event-driven prompt loop (wait for the survey UI to settle, read the rendered screen, dispatch on the verbatim prompt strings) rather than a fixed script, so it tolerates the live model/region/project branches. Theinvokestep asserts a live model answers "what is 2+2?" with a standalone4/four.How to trigger —
trigger: none/pr: none, so it never runs on PR push. Run it via: a PR comment/azp run ext-azure-ai-agents-live(requires repo write), the weekly Mon 07:00 UTC schedule, or a manual ADO queue (pickdeployModes=both/code/container).Identity / subscription / what it does — authenticates as the
azure-sdk-testsADO service connection (Workload Identity Federation / OIDC) into the shared TME test subscription, regioneastus2. It creates a fresh Foundry project +gpt-4o-minideployment (+ an ACR in container mode) and the agent service, exercises them, then tears them down.azlogs in via OIDC and azd reuses that session (auth.useAzCliAuth+keepAzSessionActive: true, so the WIF token survives the multi-minute run).Leftover resources? — No, by design. Each mode registers a
t.Cleanupthat runsazd down --force --purge, and a teardown failure fails the run (never green while leaking). Acondition: always()AzureCLI@2step is a second safety net: it force-purges any project dirs left under/tmp/e2e-tests/tier2-*/after a crash or timeout. (Crash-path cleanup is best-effort, so the TME subscription is worth an occasional spot-check.)Runs on Azure DevOps, not GitHub Actions — the YAML lives in this GitHub repo (
eng/pipelines/…) but is an ADO pipeline (extends: 1es-redirect,task: AzureCLI@2,$(Build.*),##vso[...]), executed by theazure-sdkADO project. GitHub CI for this PR stays green/inert; the live run only happens in ADO.Post-merge admin action — needs an Azure DevOps admin on the
azure-sdkproject (Azure SDK EngSys) — not a GitHub admin — to: (a) register the pipeline asext-azure-ai-agents-live(the exact name/azp runuses) against this repo + YAML path; (b) confirm theazure-sdk-testsservice connection maps to the TME subscription with RBAC to create Foundry projects and deploy models (Contributor + Azure AI Developer + Cognitive Services Contributor, or equivalent); (c) kick the first/azp runto validate thekeepAzSessionActiveauth path. Merging the PR itself only needs a GitHub maintainer approval. TheinitGitHub token is already wired via the ambientazuresdk-github-patorg secret, so no extra secret setup is needed.Testing
/azp run(the test is gated behindAZURE_AI_AGENTS_E2E_LIVE=1and never runs in PR CI).gofmt,go vet,golangci-lint(ubuntu + windows), and ago test -ccompile all clean on thelinuxbuild tag; the no-tagassert.goanswer-matching logic is covered byassert_test.go(runs on the host);README.mdpassescspell.t.Context()as the run-context parent sogo test -timeoutcancellation propagates; LIFOt.Cleanupordering so teardown runs before the copiedAZD_CONFIG_DIRis removed; a destructive-RemoveAllguardrail on the test dir; and a robust standalone-token check for the invoke result (accepts4orfour, ignores incidental4s such asgpt-4o-mini/4.1/404).runTimeout(60 min, acontextdeadline) > the sum of the per-phase budgets (init/provision/deploy/invoke ≈ 32 min) so a slow-but-healthy run is never preempted mid-phase; both modes fit undergo test -timeout 125m, and the ADOAzureCLI@2step (130 min) and job (150 min) caps sit strictly above that, so a run can never be hard-killed mid-azd down(which would leak live resources).