Skip to content

feat(eval): add plan-mode and HITL contract evals (D2 follow-up)#236

Open
zjshen14 wants to merge 1 commit into
mainfrom
feat/issue-234-plan-mode-hitl-contract-evals
Open

feat(eval): add plan-mode and HITL contract evals (D2 follow-up)#236
zjshen14 wants to merge 1 commit into
mainfrom
feat/issue-234-plan-mode-hitl-contract-evals

Conversation

@zjshen14

@zjshen14 zjshen14 commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Summary

  • Extends RunTapeOptions in the replay harness with confirmFn? and forcesConfirmation? so contract evals can drive HITL decisions without an interactive terminal.
  • Adds a synthesised JSONL tape (plan-mode-write-block) whose single /plan-prefixed turn contains a write call, proving the executor's readOnly guard fires before the call reaches the registry.
  • Adds src/eval/replay/contract.test.ts with 9 tests covering two D2 contract surfaces: plan-mode write-blocking and the three HITL paths (user_denied, non_interactive, allow).

Related issue

Closes #234

Test plan

  • npm run typecheck && npm run lint && npm run format:check && npm test — all pass (713 tests, 9 new)
  • Plan-mode: tool_denied(plan_mode) fires, tool_exec_start count is 0, tape exhausted cleanly
  • HITL deny: tool_denied(user_denied) fires when confirmFn returns "deny"
  • HITL non-interactive: tool_denied(non_interactive) fires when no confirmFn is provided
  • HITL allow: no denial, tool_exec_start fires once when confirmFn returns "allow"

https://claude.ai/code/session_01HRHNetsbTUajSxK33fby1u


Generated by Claude Code

Extends the replay harness with two surfaces from the D2 contract-eval
backlog (#234):

**Plan-mode contract eval** — a synthesised JSONL tape whose single turn
is prefixed with `/plan`, causing agent.run() to run with readOnly=true.
The tape's write tool call is blocked by the executor's readOnly guard
before reaching TapeRegistry, emitting tool_denied(plan_mode). Asserts:
- tool_denied fires exactly once with reason="plan_mode"
- tool_exec_start / tool_exec_end counts are both 0
- tape exhausted cleanly, 0 unconsumed results

**HITL contract eval** — three programmatic tapes exercising the three
confirmation paths:
- confirmFn present + returns "deny" → tool_denied(user_denied)
- confirmFn absent (non-interactive context) → tool_denied(non_interactive)
- confirmFn present + returns "allow" → tool executes normally, no denial

RunTapeOptions gains confirmFn? and forcesConfirmation? to wire these
paths; runner.ts calls agent.setConfirmFn() and
agent.setForcesConfirmationFn() after construction.

Closes #234
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zjshen14

zjshen14 commented Jun 8, 2026

Copy link
Copy Markdown
Owner Author

Review: feat(eval): add plan-mode and HITL contract evals (D2 follow-up)

What's good: Solid approach — the synthesized JSONL tape for plan-mode, the dedicated contract.test.ts, and the three HITL paths (deny / non-interactive / allow) all read cleanly. The forcesConfirmation escape hatch for TapeRegistry tools is a pragmatic design. CI is green.


Overlap with #237

This PR and #237 were opened one day apart and both address the same #234 sub-items. Critically, both PRs make identical additions to runner.ts: adding confirmFn and forcesConfirmation to RunTapeOptions and wiring them into runTape. The difference is where the test code lands:

PR Test location Approach
#236 (this PR) New src/eval/replay/contract.test.ts + JSONL tape + expected JSON Dedicated contract eval module with file-based tapes
#237 (newer) Extends src/eval/replay/synthesized.test.ts Inline programmatic tapes in existing file

Merging both would produce a runner.ts conflict and duplicate coverage. Please close one before merging the other.

Design trade-offs:

Recommendation: Pick one and close the other — no code-quality blocker in either.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

D2 follow-up: sandbox / plan-mode / HITL contract evals

2 participants