Skip to content

feat: add e2e triage CI workflow with Slack integration#741

Draft
alishakawaguchi wants to merge 21 commits intomainfrom
alisha/e2e-triage-ci-job
Draft

feat: add e2e triage CI workflow with Slack integration#741
alishakawaguchi wants to merge 21 commits intomainfrom
alisha/e2e-triage-ci-job

Conversation

@alishakawaguchi
Copy link
Contributor

@alishakawaguchi alishakawaguchi commented Mar 20, 2026

Context

When E2E tests fail on main, a Slack alert is posted but there's no easy way to kick off triage. This adds a one-click "Run Triage" link to the Slack alert that triggers the triage workflow via a Cloudflare Worker. Triage results post back to the same Slack thread.

E2E fails → bot posts alert to Slack (with "Run Triage" link)
  → user clicks link → Cloudflare Worker → GitHub API (workflow_dispatch)
    → e2e-triage.yml runs → posts results back to Slack thread

Summary

  • Switch Slack failure alert from incoming webhook to bot token (chat.postMessage) so we can capture channel and ts from the response and build the triage link with Slack thread context
  • Add clickable "Run Triage" link as a threaded follow-up to the failure alert
  • Remove repository_dispatch trigger from e2e-triage.yml — only workflow_dispatch remains
  • Remove cmd/e2e-triage-dispatch/ and internal/slacktriage/ (dispatch service code that was never deployed)
  • Update docs and README

Secret/config changes needed

  • Add repo variable E2E_SLACK_CHANNEL — channel ID for failure alerts
  • Add repo secret SLACK_BOT_TOKEN — bot token with chat:write scope
  • Check E2E_SLACK_WEBHOOK_URL — still used by release.yml, don't remove yet

Slack app changes

  • Keep chat:write scope
  • Remove incoming-webhook, channels:history scopes
  • Disable Event Subscriptions
  • Reinstall to workspace

Test plan

  • mise run fmt && mise run lint && mise run test:ci — all pass
  • Deploy Cloudflare Worker first
  • Set SLACK_BOT_TOKEN secret and E2E_SLACK_CHANNEL variable
  • Merge, trigger E2E failure on main → confirm Slack alert has "Run Triage" link → click → confirm triage runs and posts results to thread

🤖 Generated with Claude Code


Note

Medium Risk
Introduces new GitHub Actions workflows plus an external Slack-to-GitHub dispatch service that handles signatures, tokens, and event validation; misconfiguration could trigger unexpected runs or leak metadata to Slack threads.

Overview
Enables Slack-triggered E2E triage: E2E failure alerts now include a machine-readable meta: line, and a new E2E Triage workflow can be started via repository_dispatch or manually via workflow_dispatch.

The new workflow builds an agent matrix (auto-detecting sha/failed agents from the run when needed), runs scripts/run-e2e-triage.sh to download CI artifacts and invoke the Claude /e2e:triage-ci command, uploads per-agent triage artifacts, and optionally posts start/completion summaries back to the Slack thread.

Adds a new cmd/e2e-triage-dispatch HTTP service plus internal/slacktriage helpers to validate Slack signatures, detect triage e2e thread replies, parse the parent alert’s meta: fields, and dispatch the GitHub repository_dispatch event; includes unit tests and new docs/README guidance.

Written by Cursor Bugbot for commit ba6611a. Configure here.

alishakawaguchi and others added 15 commits March 17, 2026 15:15
Make sha and failed_agents optional for workflow_dispatch triggers.
When omitted, these values are derived from the run URL via the
GitHub API, reducing friction when triggering triage from the UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 4a44db7b807d
- Consolidate two gh API calls into one (headSha + jobs in single request)
- Extract duplicated CSV-to-JSON jq pattern into csv_to_json function
- Add "null" guard to agents_json validation
- Use shallow clone (fetch-depth: 1) for triage jobs
- Add server-side error logging in HTTP handler
- Fix gosec nolint placement and noctx lint errors in tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 0f803598ba36
Copilot AI review requested due to automatic review settings March 20, 2026 17:08
@alishakawaguchi alishakawaguchi self-assigned this Mar 20, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Slack-triggered E2E triage path that bridges Slack thread replies (triage e2e) to a new GitHub Actions triage workflow, so failing CI runs can be triaged and reported back to Slack with minimal manual steps.

Changes:

  • Introduce .github/workflows/e2e-triage.yml (workflow_dispatch + repository_dispatch) to run /e2e:triage-ci per failed agent and post Slack thread updates.
  • Add cmd/e2e-triage-dispatch/ HTTP service plus internal/slacktriage/ helpers to validate Slack events, parse parent alert metadata, and dispatch GitHub events.
  • Add machine-readable meta: data to E2E Slack alerts, plus docs and a runner script for the triage workflow.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/run-e2e-triage.sh Runner script invoked by the triage workflow to execute the Claude E2E triage command and tee logs to artifacts.
internal/slacktriage/parent_message.go Parses the meta: line from Slack alerts into structured metadata for dispatch.
internal/slacktriage/normalize.go Normalizes Slack reply text and checks for the exact triage trigger phrase.
internal/slacktriage/dispatch.go Builds the repository_dispatch payload from parsed metadata + Slack thread info.
internal/slacktriage/*_test.go Unit tests for trigger normalization, parent metadata parsing, and dispatch payload creation.
cmd/e2e-triage-dispatch/main.go Slack event receiver: verifies signatures, fetches parent message, parses metadata, dispatches to GitHub.
cmd/e2e-triage-dispatch/main_test.go Handler + dispatcher unit tests (signature verification, ignore cases, dispatch path).
.github/workflows/e2e.yml Adds machine-readable meta: metadata to the Slack failure alert.
.github/workflows/e2e-triage.yml New triage workflow that validates payload, derives sha/agents when needed, runs triage, posts Slack updates, uploads artifacts.
docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md Design doc describing the Slack→GitHub triage system and contract.
docs/architecture/slack-e2e-triage.md Architecture/runbook-style overview for operating the Slack-triggered triage.
README.md Documents Slack-triggered E2E triage and points to the architecture doc.

alishakawaguchi and others added 6 commits March 20, 2026 10:25
Adds push-triggered test mode that runs with the vogon canary agent
(no API costs) when workflow-related files change on this branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: eec73e0fab92
This reverts commit f8a82d6.

Entire-Checkpoint: 363c74b4a8c5
The triage workflow was checking out the failed run's SHA, which
doesn't contain the triage script. Now checks out the workflow's
own branch and passes the target SHA as an env var instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: d4ec0a1e350d
Use --allowedTools with explicit per-command scoping instead of
--dangerously-skip-permissions. Each gh command is locked to the
specific repo, workflow, and agent. No generic shell access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 4004148a4e05
Instead of giving Claude shell access to gh/scripts, download
artifacts in the script before invoking Claude. Claude only gets
Read, Grep, and Glob — pure analysis, no shell execution.

Also improve job summary to show helpful message when log is empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: ca4f43d851a5
…aries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 4fc3a7119ed8
@alishakawaguchi
Copy link
Contributor Author

bugbot run

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants