📡 OTel Instrumentation Improvement: surface token usage from agent-stdio.log when firewall-proxy logs are absent
Analysis Date: 2026-05-22
Priority: High
Effort: Small (< 2h)
Problem
gen_ai.usage.total_tokens (and the sibling gen_ai.usage.input_tokens / output_tokens / cache attributes) are missing from the majority of gh-aw.agent.conclusion spans in production telemetry. Token data is currently sourced only from /tmp/gh-aw/agent_usage.json, which is written by parse_token_usage.cjs from the firewall proxy log at /tmp/gh-aw/sandbox/firewall*/logs/api-proxy-logs/token-usage.jsonl. When the firewall proxy isn't active for a given engine — or when its log path differs from the two hard-coded paths — agent_usage.json is never written, and no token data ever reaches OTel. The Claude / Copilot stream-json result event already carries the same usage data on disk in agent-stdio.log, but readAgentRuntimeMetrics() in actions/setup/js/send_otlp_span.cjs reads only num_turns, total_cost_usd, stop_reason, and model from that event — it ignores the usage block entirely.
Why This Matters (DevOps Perspective)
Without gen_ai.usage.* on at least 2 of every 3 agent.conclusion spans, the following operational questions cannot be answered from telemetry alone:
- "Which workflow consumed the most input/output tokens this week?" —
sum(gen_ai.usage.total_tokens) by gh-aw.workflow.name undercounts by ~69% across all engines.
- "Are we approaching token quotas for a model?" — model-level token aggregations are unreliable.
- "What's the cost-per-trigger for our scheduled workflows?" — cost dashboards built on
gen_ai.usage.* show only a sliver of real usage.
- "Is engine X significantly more expensive than engine Y?" — engine comparisons are skewed (pi/gemini show 0 tokens, biasing them as "free").
For an oncall engineer triaging a cost spike or quota incident, the missing data forces them to fall back to GitHub job logs and per-workflow firewall artifacts, dramatically increasing MTTR.
Current Behavior
In actions/setup/js/send_otlp_span.cjs, the agent stdio parser reads only num_turns, total_cost_usd, stop_reason, and model from {"type": "result", ...} events:
// actions/setup/js/send_otlp_span.cjs lines 1555–1567
if (parsed.type !== "result") {
return;
}
if (typeof parsed.num_turns === "number" && parsed.num_turns >= 0) {
metrics.turns = parsed.num_turns;
}
if (typeof parsed.total_cost_usd === "number" && Number.isFinite(parsed.total_cost_usd) && parsed.total_cost_usd >= 0) {
metrics.estimatedCostUsd = parsed.total_cost_usd;
}
if (typeof parsed.stop_reason === "string" && parsed.stop_reason) {
metrics.stopReason = parsed.stop_reason;
}
The Claude / Copilot result event additionally carries a usage object that the parser ignores:
Downstream, sendJobConclusionSpan only reads token data from /tmp/gh-aw/agent_usage.json:
// actions/setup/js/send_otlp_span.cjs lines 2019–2040
const agentUsage = readJSONIfExists("/tmp/gh-aw/agent_usage.json") || {};
const usageAttrs = [];
if (typeof agentUsage.input_tokens === "number" && agentUsage.input_tokens > 0) {
usageAttrs.push(buildAttr("gen_ai.usage.input_tokens", agentUsage.input_tokens));
}
// ...same for output_tokens, cache_read_tokens, cache_write_tokens, total_tokens
When agent_usage.json is absent (no firewall proxy log), usageAttrs stays empty and no gen_ai.usage.* attribute is emitted.
Proposed Change
Extend readAgentRuntimeMetrics() to also extract the usage block, and use it as a fallback in sendJobConclusionSpan when agent_usage.json is missing or has zero counts.
// 1) In readAgentRuntimeMetrics (actions/setup/js/send_otlp_span.cjs):
// extend AgentRuntimeMetrics with optional usage fields
// `@property` {number | undefined} inputTokens
// `@property` {number | undefined} outputTokens
// `@property` {number | undefined} cacheReadTokens
// `@property` {number | undefined} cacheWriteTokens
// inside the `if (parsed.type !== "result") return;` block, add:
if (parsed.usage && typeof parsed.usage === "object") {
const u = parsed.usage;
if (typeof u.input_tokens === "number" && u.input_tokens >= 0) {
metrics.inputTokens = u.input_tokens;
}
if (typeof u.output_tokens === "number" && u.output_tokens >= 0) {
metrics.outputTokens = u.output_tokens;
}
if (typeof u.cache_read_input_tokens === "number" && u.cache_read_input_tokens >= 0) {
metrics.cacheReadTokens = u.cache_read_input_tokens;
}
if (typeof u.cache_creation_input_tokens === "number" && u.cache_creation_input_tokens >= 0) {
metrics.cacheWriteTokens = u.cache_creation_input_tokens;
}
}
// 2) In sendJobConclusionSpan, after `const agentUsage = readJSONIfExists(...) || {};`:
// fall back to runtimeMetrics fields when agent_usage.json lacks the value.
const usage = {
input_tokens: agentUsage.input_tokens || runtimeMetrics.inputTokens,
output_tokens: agentUsage.output_tokens || runtimeMetrics.outputTokens,
cache_read_tokens: agentUsage.cache_read_tokens || runtimeMetrics.cacheReadTokens,
cache_write_tokens: agentUsage.cache_write_tokens || runtimeMetrics.cacheWriteTokens,
};
// then use `usage.*` in place of `agentUsage.*` when building usageAttrs.
The fallback path is non-destructive: when the firewall log is present, agent_usage.json wins (preserving today's behavior); when it's absent, the stream-json result event fills the gap.
Expected Outcome
After this change:
- In Grafana / Tempo / Sentry:
sum(gen_ai.usage.total_tokens) by gh-aw.workflow.name becomes meaningful — coverage for engines that emit a stream-json result event (Claude, Copilot, Codex) should rise from 28–34% toward 95%+ on successful runs.
- In the local
/tmp/gh-aw/otel.jsonl mirror: agent-job spans on machines without the firewall proxy will carry usage attributes for the first time, enabling offline cost analysis from artifact downloads alone.
- For on-call engineers: a single Sentry / Grafana query (
sum(gen_ai.usage.total_tokens)) answers "which workflow burned tokens?" without cross-referencing per-job firewall artifacts.
Implementation Steps
Evidence from Live OTel Data (Sentry/Grafana)
Sentry — github / gh-aw, dataset spans, last 7 days, grouped by gh-aw.engine.id:
| engine |
spans |
spans with gen_ai.usage.total_tokens > 0 |
missing % |
| copilot |
1,073 |
297 |
72% |
| claude |
324 |
111 |
66% |
| codex |
112 |
26 |
77% |
| pi |
22 |
0 |
100% |
| gemini |
16 |
0 |
100% |
| total |
1,547 |
434 |
~72% |
Query:
span.name:gh-aw.agent.conclusion
fields: gh-aw.engine.id, count(), count_if(gen_ai.usage.total_tokens, greater, 0)
statsPeriod: 7d
Grafana / Tempo (grafanacloud-traces) — confirms attribute keys: the span-scope tag list includes gh-aw.engine.id, gh-aw.workflow.name, and gh-aw.action_minutes, but does not include gh-aw.turns, gh-aw.estimated_cost_usd, or gen_ai.response.model. This shows the existing result-event derived attributes are also missing — but token data extracted from the same event would still flow through the independent usageAttrs path proposed above, even if result-event parsing later needs follow-up debugging.
Representative trace: 1e395bf7dd92c4e6eee4162ff0b78906 (gh-aw.activation.setup → gh-aw.agent.setup → gh-aw.agent.conclusion). Engine copilot, workflow Daily MCP Tool Concurrency Analysis. The gh-aw.agent.conclusion span carries gh-aw.run.status=success, gh-aw.engine.id=copilot, gen_ai.system=github_models — but no gen_ai.usage.* attributes.
Related Files
actions/setup/js/send_otlp_span.cjs — readAgentRuntimeMetrics() (lines 1533–1612), sendJobConclusionSpan() token-attribute block (lines 2019–2040)
actions/setup/js/send_otlp_span.test.cjs — add new vitest cases for the fallback path
actions/setup/js/parse_token_usage.cjs — unchanged; remains the preferred source when firewall logs exist
actions/setup/js/action_conclusion_otlp.cjs — unchanged; sends the enriched span
Generated by the Daily OTel Instrumentation Advisor workflow
Generated by 📊 Daily OTel Instrumentation Advisor · ● 50.7M · ◷
📡 OTel Instrumentation Improvement: surface token usage from
agent-stdio.logwhen firewall-proxy logs are absentAnalysis Date: 2026-05-22
Priority: High
Effort: Small (< 2h)
Problem
gen_ai.usage.total_tokens(and the siblinggen_ai.usage.input_tokens/output_tokens/ cache attributes) are missing from the majority ofgh-aw.agent.conclusionspans in production telemetry. Token data is currently sourced only from/tmp/gh-aw/agent_usage.json, which is written byparse_token_usage.cjsfrom the firewall proxy log at/tmp/gh-aw/sandbox/firewall*/logs/api-proxy-logs/token-usage.jsonl. When the firewall proxy isn't active for a given engine — or when its log path differs from the two hard-coded paths —agent_usage.jsonis never written, and no token data ever reaches OTel. The Claude / Copilot stream-jsonresultevent already carries the same usage data on disk inagent-stdio.log, butreadAgentRuntimeMetrics()inactions/setup/js/send_otlp_span.cjsreads onlynum_turns,total_cost_usd,stop_reason, andmodelfrom that event — it ignores theusageblock entirely.Why This Matters (DevOps Perspective)
Without
gen_ai.usage.*on at least 2 of every 3 agent.conclusion spans, the following operational questions cannot be answered from telemetry alone:sum(gen_ai.usage.total_tokens) by gh-aw.workflow.nameundercounts by ~69% across all engines.gen_ai.usage.*show only a sliver of real usage.For an oncall engineer triaging a cost spike or quota incident, the missing data forces them to fall back to GitHub job logs and per-workflow firewall artifacts, dramatically increasing MTTR.
Current Behavior
In
actions/setup/js/send_otlp_span.cjs, the agent stdio parser reads onlynum_turns,total_cost_usd,stop_reason, andmodelfrom{"type": "result", ...}events:The Claude / Copilot
resultevent additionally carries ausageobject that the parser ignores:{ "type": "result", "subtype": "success", "num_turns": 12, "total_cost_usd": 0.42, "usage": { "input_tokens": 4120, "output_tokens": 870, "cache_creation_input_tokens": 1500, "cache_read_input_tokens": 2200 } }Downstream,
sendJobConclusionSpanonly reads token data from/tmp/gh-aw/agent_usage.json:When
agent_usage.jsonis absent (no firewall proxy log),usageAttrsstays empty and nogen_ai.usage.*attribute is emitted.Proposed Change
Extend
readAgentRuntimeMetrics()to also extract theusageblock, and use it as a fallback insendJobConclusionSpanwhenagent_usage.jsonis missing or has zero counts.The fallback path is non-destructive: when the firewall log is present,
agent_usage.jsonwins (preserving today's behavior); when it's absent, the stream-jsonresultevent fills the gap.Expected Outcome
After this change:
sum(gen_ai.usage.total_tokens) by gh-aw.workflow.namebecomes meaningful — coverage for engines that emit a stream-jsonresultevent (Claude, Copilot, Codex) should rise from 28–34% toward 95%+ on successful runs./tmp/gh-aw/otel.jsonlmirror: agent-job spans on machines without the firewall proxy will carry usage attributes for the first time, enabling offline cost analysis from artifact downloads alone.sum(gen_ai.usage.total_tokens)) answers "which workflow burned tokens?" without cross-referencing per-job firewall artifacts.Implementation Steps
AgentRuntimeMetricstypedef andreadAgentRuntimeMetrics()parser inactions/setup/js/send_otlp_span.cjsto captureusage.input_tokens,usage.output_tokens,usage.cache_read_input_tokens,usage.cache_creation_input_tokensfrom{"type": "result", ...}events.sendJobConclusionSpan, preferagent_usage.jsonvalues when present (truthy) but fall back toruntimeMetrics.*Tokenswhen they are missing or zero. RecomputetotalTokensfrom the resolved values.actions/setup/js/send_otlp_span.test.cjswith two new cases:agent_usage.jsonabsent +agent-stdio.logcontains aresultevent with ausageblock → conclusion span carriesgen_ai.usage.input_tokens/output_tokens/total_tokens.agent_usage.jsonwins (regression guard).cd actions/setup/js && npx vitest run send_otlp_spanto confirm tests pass.make fmtandmake test-unitfrom the repo root.Evidence from Live OTel Data (Sentry/Grafana)
Sentry —
github / gh-aw, datasetspans, last 7 days, grouped bygh-aw.engine.id:gen_ai.usage.total_tokens > 0Query:
Grafana / Tempo (
grafanacloud-traces) — confirms attribute keys: thespan-scope tag list includesgh-aw.engine.id,gh-aw.workflow.name, andgh-aw.action_minutes, but does not includegh-aw.turns,gh-aw.estimated_cost_usd, orgen_ai.response.model. This shows the existingresult-event derived attributes are also missing — but token data extracted from the same event would still flow through the independentusageAttrspath proposed above, even ifresult-event parsing later needs follow-up debugging.Representative trace:
1e395bf7dd92c4e6eee4162ff0b78906(gh-aw.activation.setup→gh-aw.agent.setup→gh-aw.agent.conclusion). Enginecopilot, workflowDaily MCP Tool Concurrency Analysis. Thegh-aw.agent.conclusionspan carriesgh-aw.run.status=success,gh-aw.engine.id=copilot,gen_ai.system=github_models— but nogen_ai.usage.*attributes.Related Files
actions/setup/js/send_otlp_span.cjs—readAgentRuntimeMetrics()(lines 1533–1612),sendJobConclusionSpan()token-attribute block (lines 2019–2040)actions/setup/js/send_otlp_span.test.cjs— add new vitest cases for the fallback pathactions/setup/js/parse_token_usage.cjs— unchanged; remains the preferred source when firewall logs existactions/setup/js/action_conclusion_otlp.cjs— unchanged; sends the enriched spanGenerated by the Daily OTel Instrumentation Advisor workflow