Skip to content

hooks: ship each tool_use once per Stop turn (fix inflated policy-match counts)#157

Open
vigneshsubbiah16 wants to merge 1 commit into
mainfrom
fix/web-stop-event-duplicate-tool-use
Open

hooks: ship each tool_use once per Stop turn (fix inflated policy-match counts)#157
vigneshsubbiah16 wants to merge 1 commit into
mainfrom
fix/web-stop-event-duplicate-tool-use

Conversation

@vigneshsubbiah16

@vigneshsubbiah16 vigneshsubbiah16 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

The Unbound hook sender re-ships the same tool_uses to the proxy on every Stop event within a single user turn. Because Claude Code fires multiple Stop events per UserPromptSubmit (one per agent turn), a single file edit gets counted 1 + 2 + … + N times across N Stops. This is the root cause of inflated policy-match counts (one edit observed counted 14–35×).

This PR makes each tool_use ship exactly once per turn across all three hook flavors (claude-code, codex, cursor), while keeping the audit log intact so the user-prompt lookback keeps working.

Root cause

process_stop_event rebuilds the exchange window from the persisted audit log (~/.claude/hooks/agent-audit.log): it resets the window on each UserPromptSubmit and appends every event after. build_llm_exchange then emits one tool_use per PostToolUse entry in that window.

Commit 16d1ffe ("hooks: stop wiping the audit log on Stop; rely on rotation", 2026-05-13) removed the post-send prune of already-shipped events. It did so deliberately — the prune had been deleting every entry for the current session after a send, which broke the then-new get_recent_user_prompts_for_session lookback (no prior prompts left to find). But with the prune gone, nothing stops each subsequent Stop from re-shipping the whole accumulated window:

→ cumulative re-sends of the same tool_use entries.

Fix — restore the prune safely (mark, don't delete)

Instead of re-deleting shipped events (which would re-break the lookback 16d1ffe protected), entries are annotated with an unbound_shipped flag and left in the log:

  • claude-code — skip PostToolUse entries already flagged SHIPPED_FLAG when building the window; after a successful send, stamp the shipped tool entries and the turn's UserPromptSubmit anchor, then persist. A tool-free conversation turn still ships exactly once (via the prompt anchor), preserving prior behavior.
  • codex — sources tool_uses from the transcript per turn; the turn's UserPromptSubmit anchor is marked shipped after a successful send and a later Stop in the same turn skips the resend.
  • cursor — groups by generation_id; the generation's stop entry is marked shipped so a stop that fires twice for one generation doesn't double-ship.

_persist_*_flags re-read the log immediately before stamping, so a concurrent hook that appended new lines after we loaded the log isn't clobbered.

Before / after

For a turn with 2 edits and 3 Stop events:

tool_uses shipped
Before [/a.py, /a.py, /a.py, /b.py, /b.py, /b.py] (6 — triangular re-send)
After [/a.py, /b.py] (2 — once each)

The user-prompt lookback (get_recent_user_prompts_for_session) returns the same prior prompts in both cases — the entries are retained, only annotated.

Test plan

New test_stop_dedup.py in each hook dir (function-level, matching the repo's existing unittest + patch.object convention):

  • multiple Stops in one turn → each PostToolUse tool_use ships exactly once (idempotent)
  • a new tool arriving after an earlier Stop ships once, without re-shipping prior tools
  • a tool-free conversation turn ships exactly once across multiple Stops
  • get_recent_user_prompts_for_session still returns prior user prompts after Stop processing (the 16d1ffe protection)

Verified red without the fix (claude-code dedup tests fail, e.g. ['/a.py','/a.py','/a.py','/b.py','/b.py','/b.py'] != ['/a.py','/b.py']) and green with it. The full existing hook suites pass; the one pre-existing test_identity.test_keys_limited_to_identity_fields failure exists unchanged on main and is unrelated to this change. CI gate (py_compile + pyflakes undefined-name) is green.

Run:

cd claude-code/hooks && python3 -m unittest test_stop_dedup
cd codex/hooks     && python3 -m unittest test_stop_dedup
cd cursor          && python3 -m unittest test_stop_dedup

Follow-up (not in this PR)

The unbound-hook frozen binary (binary/unbound-hook.spec) vendors these hook sources (it bundles ../claude-code/hooks/unbound.py, ../cursor/unbound.py, and ../codex/hooks/unbound.py under vendored/), so devices running the binary won't pick up this fix until the binary is rebuilt and redeployed via MDM. That rebuild/redeploy is intentionally out of scope here and should be tracked/executed separately.

🤖 Generated with Claude Code

Greptile Summary

This PR fixes inflated policy-match counts caused by process_stop_event re-shipping the same tool_use entries on every Stop event within a single user turn. The fix introduces a SHIPPED_FLAG annotation (unbound_shipped) that is written to the audit log after a successful send, letting subsequent Stops in the same turn skip already-shipped entries while leaving them in the log so the user-prompt lookback continues to work.

  • claude-code: per-PostToolUse entry flag — each entry is individually marked, so tool_uses arriving between Stops are still captured on the next Stop without re-shipping earlier ones.
  • codex: turn-level flag on the UserPromptSubmit anchor — simpler but means any tool_use that reaches the transcript after Stop [WEB-3217] Adds --clear and --debug #1 is silently skipped by Stop [WEB-3237] Captures model from cursor #2+.
  • cursor: generation-level flag on the stop entry — clean fit since cursor already groups by generation_id.
  • All three flavors ship new regression tests (test_stop_dedup.py) covering idempotency, incremental arrivals, and lookback preservation.

Confidence Score: 4/5

Safe to merge for sequential Stop flows, which is the common case; the codex hook silently drops tool_uses that arrive between Stops.

The claude-code and cursor hooks are solid — entry-level flags with a re-read-before-write persist, backed by comprehensive regression tests. The codex hook marks the whole turn shipped on the first Stop, so any tool_use that reaches the transcript after Stop #1 will never be sent. Whether this happens in practice depends on codex's sub-turn ordering, but no test covers that scenario. The unused session_id parameter in _persist_shipped_flags is a cosmetic issue. The TOCTOU window for concurrent Stops is inherent to the file-based log approach and dramatically narrower than before the fix.

codex/hooks/unbound.py — the turn-level shipped flag could silently miss tool_uses added to the transcript between consecutive Stops in the same user turn

Important Files Changed

Filename Overview
claude-code/hooks/unbound.py Core dedup fix: adds SHIPPED_FLAG, _persist_shipped_flags (re-read-before-write), and a per-PostToolUse shipped check; unused session_id parameter in _persist_shipped_flags is dead code
codex/hooks/unbound.py Turn-level dedup: marks the UserPromptSubmit anchor shipped after the first successful send; any tool_uses added between Stop #1 and a later Stop for the same turn are silently skipped
cursor/unbound.py Generation-level dedup: stamps SHIPPED_FLAG on the generation's stop entry after a successful send; clean and straightforward
claude-code/hooks/test_stop_dedup.py New regression suite covering idempotent sends, incremental tool arrivals, tool-free turns, and lookback preservation; well structured with addCleanup teardown
codex/hooks/test_stop_dedup.py Covers turn-level idempotency and lookback preservation for the codex hook; stubs parse_codex_transcript_for_tools and parse_codex_transcript_for_usage correctly
cursor/test_stop_dedup.py Covers generation-level idempotency and lookback for the cursor hook; correct use of api_key keyword-only sentinel matching the cursor send_to_api signature

Sequence Diagram

sequenceDiagram
    participant CC as Claude Code
    participant Hook as process_stop_event
    participant Log as audit-log
    participant API as Proxy API

    CC->>Hook: "Stop #1 (turn T)"
    Hook->>Log: load_existing_logs()
    Log-->>Hook: [UserPromptSubmit, PostToolUse(/a.py), PostToolUse(/b.py)]
    Note over Hook: pending_tool_logs = [/a.py, /b.py], turn_already_shipped = False
    Hook->>Hook: build_llm_exchange(session_events)
    Hook->>API: send_to_api(exchange) → True
    Hook->>Log: _persist_shipped_flags() re-read → stamp SHIPPED_FLAG → write

    CC->>Hook: "Stop #2 (turn T, same session)"
    Hook->>Log: load_existing_logs()
    Log-->>Hook: [UserPromptSubmit✓, PostToolUse(/a.py)✓, PostToolUse(/b.py)✓]
    Note over Hook: pending_tool_logs = [], turn_already_shipped = True → early return
    Hook-->>CC: (no send)

    CC->>Hook: "Stop #1 (turn T+1, new UserPromptSubmit)"
    Hook->>Log: load_existing_logs()
    Log-->>Hook: [...✓ entries, UserPromptSubmit (new), PostToolUse(/c.py)]
    Note over Hook: Window resets, pending_tool_logs = [/c.py], turn_already_shipped = False
    Hook->>API: send_to_api(exchange) → True
    Hook->>Log: _persist_shipped_flags()
Loading

Reviews (1): Last reviewed commit: "hooks: ship each tool_use once per turn ..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

…nts)

Claude Code (and Codex) fire multiple Stop events per user turn — one per
agent turn. The Stop handler rebuilds the exchange window from the persisted
audit log, resetting on each UserPromptSubmit and appending every event after.

Commit 16d1ffe removed the post-send prune of already-shipped events (it had
broken the get_recent_user_prompts_for_session lookback by wiping prior
prompts). With the prune gone, every Stop re-shipped ALL PostToolUse events
accumulated since the last UserPromptSubmit, so a single file edit was counted
1 + 2 + ... + N times across N Stops — the root cause of inflated
policy-match counts (one edit seen 14-35x).

Fix — restore the prune safely (mark, don't delete):

- claude-code: skip PostToolUse audit entries already flagged SHIPPED_FLAG when
  building the window; after a successful send, stamp the shipped tool entries
  (and the turn's UserPromptSubmit anchor) and persist. Entries stay in the log
  so the user-prompt lookback keeps working. A tool-free conversation turn
  still ships exactly once via the prompt anchor.
- codex: ships tool_uses from the transcript per turn; mark the turn's
  UserPromptSubmit anchor shipped after a successful send and skip resend.
- cursor: groups by generation_id; mark the generation's stop entry shipped and
  skip if a stop fires twice for the same generation.

_persist_*_flags re-read the log before stamping to avoid clobbering entries a
concurrent hook appended.

Tests: new test_stop_dedup.py in each hook dir reproduces the bug (multiple
Stops in one turn) and asserts each tool_use ships exactly once, plus that the
get_recent_user_prompts_for_session lookback still returns prior prompts (what
16d1ffe protected). Verified red without the fix, green with it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vigneshsubbiah16 vigneshsubbiah16 requested a review from a team June 15, 2026 20:52
Comment on lines +1155 to +1162
def _persist_shipped_flags(session_id: Optional[str], shipped_logs: List[Dict]) -> None:
"""Re-read the audit log and stamp SHIPPED_FLAG on the entries that were
just shipped, then rewrite it. Re-reading (instead of saving our stale
in-memory copy) avoids clobbering entries a concurrent hook appended after
we loaded the log."""
if not shipped_logs:
return
shipped_ids = {_log_identity(log) for log in shipped_logs}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The session_id parameter is accepted but never referenced inside the function body. _log_identity already embeds the session ID in the identity tuple, so the match is implicitly session-scoped. The dead parameter adds confusion about whether session-scoping is intended as an extra guard and makes callers think they need to supply it. Consider dropping it from the signature (and the call-site).

Suggested change
def _persist_shipped_flags(session_id: Optional[str], shipped_logs: List[Dict]) -> None:
"""Re-read the audit log and stamp SHIPPED_FLAG on the entries that were
just shipped, then rewrite it. Re-reading (instead of saving our stale
in-memory copy) avoids clobbering entries a concurrent hook appended after
we loaded the log."""
if not shipped_logs:
return
shipped_ids = {_log_identity(log) for log in shipped_logs}
def _persist_shipped_flags(shipped_logs: List[Dict]) -> None:
"""Re-read the audit log and stamp SHIPPED_FLAG on the entries that were
just shipped, then rewrite it. Re-reading (instead of saving our stale
in-memory copy) avoids clobbering entries a concurrent hook appended after
we loaded the log."""
if not shipped_logs:
return
shipped_ids = {_log_identity(log) for log in shipped_logs}

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines 1246 to 1258
usage=transcript_usage,
)

if exchange:
send_to_api(exchange, api_key)
if exchange and send_to_api(exchange, api_key):
# Persist the shipped markers so later Stops in this turn skip what was
# already sent. The entries stay in the log for the user-prompt lookback.
for log in pending_tool_logs:
log[SHIPPED_FLAG] = True
if prompt_log is not None:
prompt_log[SHIPPED_FLAG] = True
_persist_shipped_flags(session_id, pending_tool_logs + ([prompt_log] if prompt_log else []))


Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 TOCTOU race between concurrent Stops

The guard read (prompt_log.get(SHIPPED_FLAG)) and the flag write (_persist_shipped_flags) are not atomic. If two Stop events fire within the same small window — both load the log before either has finished writing — both will see an unshipped state, both pass the guard, and both call send_to_api, shipping the same exchange twice. The re-read inside _persist_shipped_flags only protects against a third process appending new lines, not against two racing Stops both clearing the guard simultaneously. For the common sequential-Stop case this is fully correct; the risk is limited to genuinely concurrent invocations, but it's worth documenting (or protecting with a lock/tmp-rename swap) so the behavior is explicit.

Comment thread codex/hooks/unbound.py
Comment on lines +1135 to +1140
# Idempotency guard: Codex fires a Stop per agent turn, so a single user
# turn produces several Stop events. Once this turn's exchange has shipped,
# re-sending would re-count the same tool_uses (root cause of inflated
# policy-match counts), so skip it.
if prompt_log is not None and prompt_log.get(SHIPPED_FLAG):
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Turn-level flag silently drops tool_uses added after the first Stop

Codex marks the UserPromptSubmit anchor shipped as soon as Stop #1 succeeds. If Stop #2 fires and the transcript now contains tool_uses that weren't present when Stop #1 ran (e.g., the agent performed a file edit between the two sub-turns), those new tool_uses are never shipped — the guard returns early without inspecting the transcript again. The claude-code hook avoids this by checking per-PostToolUse entries. If multi-Stop, multi-tool-use codex turns are possible in practice, consider tracking shipped transcript offsets or tool indices rather than a single anchor flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant