Skip to content

[otel-advisor] OTel improvement: surface token usage from agent-stdio.log when firewall-proxy logs are absent #33976

@github-actions

Description

@github-actions

📡 OTel Instrumentation Improvement: surface token usage from agent-stdio.log when firewall-proxy logs are absent

Analysis Date: 2026-05-22
Priority: High
Effort: Small (< 2h)

Problem

gen_ai.usage.total_tokens (and the sibling gen_ai.usage.input_tokens / output_tokens / cache attributes) are missing from the majority of gh-aw.agent.conclusion spans in production telemetry. Token data is currently sourced only from /tmp/gh-aw/agent_usage.json, which is written by parse_token_usage.cjs from the firewall proxy log at /tmp/gh-aw/sandbox/firewall*/logs/api-proxy-logs/token-usage.jsonl. When the firewall proxy isn't active for a given engine — or when its log path differs from the two hard-coded paths — agent_usage.json is never written, and no token data ever reaches OTel. The Claude / Copilot stream-json result event already carries the same usage data on disk in agent-stdio.log, but readAgentRuntimeMetrics() in actions/setup/js/send_otlp_span.cjs reads only num_turns, total_cost_usd, stop_reason, and model from that event — it ignores the usage block entirely.

Why This Matters (DevOps Perspective)

Without gen_ai.usage.* on at least 2 of every 3 agent.conclusion spans, the following operational questions cannot be answered from telemetry alone:

  • "Which workflow consumed the most input/output tokens this week?" — sum(gen_ai.usage.total_tokens) by gh-aw.workflow.name undercounts by ~69% across all engines.
  • "Are we approaching token quotas for a model?" — model-level token aggregations are unreliable.
  • "What's the cost-per-trigger for our scheduled workflows?" — cost dashboards built on gen_ai.usage.* show only a sliver of real usage.
  • "Is engine X significantly more expensive than engine Y?" — engine comparisons are skewed (pi/gemini show 0 tokens, biasing them as "free").

For an oncall engineer triaging a cost spike or quota incident, the missing data forces them to fall back to GitHub job logs and per-workflow firewall artifacts, dramatically increasing MTTR.

Current Behavior

In actions/setup/js/send_otlp_span.cjs, the agent stdio parser reads only num_turns, total_cost_usd, stop_reason, and model from {"type": "result", ...} events:

// actions/setup/js/send_otlp_span.cjs lines 1555–1567
if (parsed.type !== "result") {
  return;
}

if (typeof parsed.num_turns === "number" && parsed.num_turns >= 0) {
  metrics.turns = parsed.num_turns;
}
if (typeof parsed.total_cost_usd === "number" && Number.isFinite(parsed.total_cost_usd) && parsed.total_cost_usd >= 0) {
  metrics.estimatedCostUsd = parsed.total_cost_usd;
}
if (typeof parsed.stop_reason === "string" && parsed.stop_reason) {
  metrics.stopReason = parsed.stop_reason;
}

The Claude / Copilot result event additionally carries a usage object that the parser ignores:

{
  "type": "result",
  "subtype": "success",
  "num_turns": 12,
  "total_cost_usd": 0.42,
  "usage": {
    "input_tokens": 4120,
    "output_tokens": 870,
    "cache_creation_input_tokens": 1500,
    "cache_read_input_tokens": 2200
  }
}

Downstream, sendJobConclusionSpan only reads token data from /tmp/gh-aw/agent_usage.json:

// actions/setup/js/send_otlp_span.cjs lines 2019–2040
const agentUsage = readJSONIfExists("/tmp/gh-aw/agent_usage.json") || {};
const usageAttrs = [];
if (typeof agentUsage.input_tokens === "number" && agentUsage.input_tokens > 0) {
  usageAttrs.push(buildAttr("gen_ai.usage.input_tokens", agentUsage.input_tokens));
}
// ...same for output_tokens, cache_read_tokens, cache_write_tokens, total_tokens

When agent_usage.json is absent (no firewall proxy log), usageAttrs stays empty and no gen_ai.usage.* attribute is emitted.

Proposed Change

Extend readAgentRuntimeMetrics() to also extract the usage block, and use it as a fallback in sendJobConclusionSpan when agent_usage.json is missing or has zero counts.

// 1) In readAgentRuntimeMetrics (actions/setup/js/send_otlp_span.cjs):
// extend AgentRuntimeMetrics with optional usage fields
//   `@property` {number | undefined} inputTokens
//   `@property` {number | undefined} outputTokens
//   `@property` {number | undefined} cacheReadTokens
//   `@property` {number | undefined} cacheWriteTokens

// inside the `if (parsed.type !== "result") return;` block, add:
if (parsed.usage && typeof parsed.usage === "object") {
  const u = parsed.usage;
  if (typeof u.input_tokens === "number" && u.input_tokens >= 0) {
    metrics.inputTokens = u.input_tokens;
  }
  if (typeof u.output_tokens === "number" && u.output_tokens >= 0) {
    metrics.outputTokens = u.output_tokens;
  }
  if (typeof u.cache_read_input_tokens === "number" && u.cache_read_input_tokens >= 0) {
    metrics.cacheReadTokens = u.cache_read_input_tokens;
  }
  if (typeof u.cache_creation_input_tokens === "number" && u.cache_creation_input_tokens >= 0) {
    metrics.cacheWriteTokens = u.cache_creation_input_tokens;
  }
}

// 2) In sendJobConclusionSpan, after `const agentUsage = readJSONIfExists(...) || {};`:
// fall back to runtimeMetrics fields when agent_usage.json lacks the value.
const usage = {
  input_tokens: agentUsage.input_tokens || runtimeMetrics.inputTokens,
  output_tokens: agentUsage.output_tokens || runtimeMetrics.outputTokens,
  cache_read_tokens: agentUsage.cache_read_tokens || runtimeMetrics.cacheReadTokens,
  cache_write_tokens: agentUsage.cache_write_tokens || runtimeMetrics.cacheWriteTokens,
};

// then use `usage.*` in place of `agentUsage.*` when building usageAttrs.

The fallback path is non-destructive: when the firewall log is present, agent_usage.json wins (preserving today's behavior); when it's absent, the stream-json result event fills the gap.

Expected Outcome

After this change:

  • In Grafana / Tempo / Sentry: sum(gen_ai.usage.total_tokens) by gh-aw.workflow.name becomes meaningful — coverage for engines that emit a stream-json result event (Claude, Copilot, Codex) should rise from 28–34% toward 95%+ on successful runs.
  • In the local /tmp/gh-aw/otel.jsonl mirror: agent-job spans on machines without the firewall proxy will carry usage attributes for the first time, enabling offline cost analysis from artifact downloads alone.
  • For on-call engineers: a single Sentry / Grafana query (sum(gen_ai.usage.total_tokens)) answers "which workflow burned tokens?" without cross-referencing per-job firewall artifacts.
Implementation Steps
  • Extend the AgentRuntimeMetrics typedef and readAgentRuntimeMetrics() parser in actions/setup/js/send_otlp_span.cjs to capture usage.input_tokens, usage.output_tokens, usage.cache_read_input_tokens, usage.cache_creation_input_tokens from {"type": "result", ...} events.
  • In sendJobConclusionSpan, prefer agent_usage.json values when present (truthy) but fall back to runtimeMetrics.*Tokens when they are missing or zero. Recompute totalTokens from the resolved values.
  • Update actions/setup/js/send_otlp_span.test.cjs with two new cases:
    1. agent_usage.json absent + agent-stdio.log contains a result event with a usage block → conclusion span carries gen_ai.usage.input_tokens / output_tokens / total_tokens.
    2. Both sources present → agent_usage.json wins (regression guard).
  • Run cd actions/setup/js && npx vitest run send_otlp_span to confirm tests pass.
  • Run make fmt and make test-unit from the repo root.
  • Open a PR referencing this issue.
Evidence from Live OTel Data (Sentry/Grafana)

Sentry — github / gh-aw, dataset spans, last 7 days, grouped by gh-aw.engine.id:

engine spans spans with gen_ai.usage.total_tokens > 0 missing %
copilot 1,073 297 72%
claude 324 111 66%
codex 112 26 77%
pi 22 0 100%
gemini 16 0 100%
total 1,547 434 ~72%

Query:

span.name:gh-aw.agent.conclusion
fields: gh-aw.engine.id, count(), count_if(gen_ai.usage.total_tokens, greater, 0)
statsPeriod: 7d

Grafana / Tempo (grafanacloud-traces) — confirms attribute keys: the span-scope tag list includes gh-aw.engine.id, gh-aw.workflow.name, and gh-aw.action_minutes, but does not include gh-aw.turns, gh-aw.estimated_cost_usd, or gen_ai.response.model. This shows the existing result-event derived attributes are also missing — but token data extracted from the same event would still flow through the independent usageAttrs path proposed above, even if result-event parsing later needs follow-up debugging.

Representative trace: 1e395bf7dd92c4e6eee4162ff0b78906 (gh-aw.activation.setupgh-aw.agent.setupgh-aw.agent.conclusion). Engine copilot, workflow Daily MCP Tool Concurrency Analysis. The gh-aw.agent.conclusion span carries gh-aw.run.status=success, gh-aw.engine.id=copilot, gen_ai.system=github_models — but no gen_ai.usage.* attributes.

Related Files
  • actions/setup/js/send_otlp_span.cjsreadAgentRuntimeMetrics() (lines 1533–1612), sendJobConclusionSpan() token-attribute block (lines 2019–2040)
  • actions/setup/js/send_otlp_span.test.cjs — add new vitest cases for the fallback path
  • actions/setup/js/parse_token_usage.cjs — unchanged; remains the preferred source when firewall logs exist
  • actions/setup/js/action_conclusion_otlp.cjs — unchanged; sends the enriched span

Generated by the Daily OTel Instrumentation Advisor workflow

Generated by 📊 Daily OTel Instrumentation Advisor · ● 50.7M ·

  • expires on May 29, 2026, 10:32 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions