Skip to content

vercelAIIntegration double-counts cached input tokens with AI SDK 6 (gen_ai.usage.input_tokens inflated by cache reads) #21484

@raymondhechen

Description

@raymondhechen

Is there an existing issue for this?

How do you use Sentry?

Sentry Saas (sentry.io)

Which SDK are you using?

@sentry/node

SDK Version

10.49.0 (reproduced through 10.57.0 and current develop)

Framework Version

ai@6.0.184 (Vercel AI SDK 6)

Link to Sentry event

(org data; can share privately if needed)

Reproduction Example/SDK Setup

With AI SDK 6, every provider normalizes usage.inputTokens to be cache-inclusivetotal = noCache + cacheRead + cacheWrite:

  • @ai-sdk/anthropic: total: inputTokens + cacheCreationTokens + cacheReadTokens
  • @ai-sdk/amazon-bedrock: total: inputTokens + cacheReadTokens + cacheWriteTokens
  • @ai-sdk/google: total: promptTokenCount (Google's promptTokenCount already includes cachedContentTokenCount)

The SDK's telemetry then emits both ai.usage.inputTokens (= cache-inclusive total; it even sets gen_ai.usage.input_tokens itself per OTel GenAI semconv) and ai.usage.cachedInputTokens (= cacheRead).

processVercelAiSpanAttributes (packages/core/src/tracing/vercel-ai/index.ts, the block commented // Input tokens is the sum of prompt tokens and cached input tokens) renames both and then does:

attributes[GEN_AI_USAGE_INPUT_TOKENS_ATTRIBUTE] =
  attributes[GEN_AI_USAGE_INPUT_TOKENS_ATTRIBUTE] + attributes[GEN_AI_USAGE_INPUT_TOKENS_CACHED_ATTRIBUTE];

That heuristic matches AI SDK 5 Anthropic semantics (where promptTokens excluded cache), but with SDK 6 it double-counts cache reads on every generation span.

Minimal repro against the real processor:

import { addVercelAiProcessors } from '@sentry/core';

let processor;
addVercelAiProcessors({ on: () => () => {}, addEventProcessor: p => (processor = p) });

const event = processor({
  type: 'transaction',
  contexts: { trace: {} },
  spans: [{
    span_id: 'aaaaaaaaaaaaaaaa',
    origin: 'auto.vercelai.otel',
    op: 'gen_ai.generate_content',
    data: {
      'ai.operationId': 'ai.streamText.doStream',
      'operation.name': 'ai.streamText.doStream',
      'ai.usage.inputTokens': 9500,            // = 1000 noCache + 8000 cacheRead + 500 cacheWrite (SDK 6 total)
      'ai.usage.outputTokens': 300,
      'ai.usage.cachedInputTokens': 8000,
      'ai.usage.inputTokenDetails.noCacheTokens': 1000,
      'ai.usage.inputTokenDetails.cacheReadTokens': 8000,
      'ai.usage.inputTokenDetails.cacheWriteTokens': 500,
    },
  }],
});

console.log(event.spans[0].data['gen_ai.usage.input_tokens']);
// actual: 17500  — expected: 9500

Steps to Reproduce

  1. Use ai@6.x with any provider that reports prompt-cache usage (Anthropic / Bedrock / Google) and experimental_telemetry.isEnabled: true
  2. Enable vercelAIIntegration in @sentry/node
  3. Make a call that gets cache hits
  4. Inspect gen_ai.usage.input_tokens on the gen_ai.generate_content span

Expected Result

gen_ai.usage.input_tokens equals the SDK-reported inputTokens total (9500 above). For AI SDK 6 spans the summation should be skipped — ai.usage.inputTokenDetails.noCacheTokens being present is a reliable v6 marker, or gate on ai.operationId which is v6+.

Actual Result

gen_ai.usage.input_tokens = total + cacheRead = noCache + cacheWrite + 2×cacheRead (Anthropic/Bedrock) or promptTokenCount + cachedContent (Google). gen_ai.usage.total_tokens and the accumulated invoke_agent totals inherit the inflation, as do AI-cost views derived from these attributes. On agent traffic with high cache-hit rates (~85% of input cached) input tokens are overstated ~1.85×, which is how we noticed — Sentry token dashboards diverged ~2× from AWS Bedrock / Google Vertex billing consoles.

Workaround we ship: a beforeSendTransaction that recomputes input_tokens/total_tokens from ai.usage.inputTokenDetails.*.

Metadata

Metadata

Assignees

Labels

No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions