whisper : map token timestamps through VAD offset table (#3754) by achyutbenz19 · Pull Request #3764 · ggml-org/whisper.cpp

achyutbenz19 · 2026-04-19T00:38:23Z

Summary

When VAD is enabled, whisper_full_get_token_data returns t0/t1 in the VAD-processed (speech-filtered) timeline, while whisper_full_get_segment_t0/t1 already map those values back to the original audio timeline via map_processed_to_original_time. The two timelines diverge whenever VAD strips a non-speech prefix: the segment says "speech starts at 7.04 s" while the first token inside that segment says "starts at 0.01 s".

This shows up most visibly in --output-json-full where token timestamps.from/timestamps.to are printed alongside segment timestamps. Callers using the C API directly see it as tokens whose t0/t1 are systematically earlier than their containing segment.

Scope of the change

src/whisper.cpp, 1 file, +16/-2.

In whisper_full_get_token_data_from_state, apply map_processed_to_original_time to token.t0 and token.t1 when state->has_vad_segments is true and the mapping table is populated. Non-VAD paths short-circuit (the mapping table is empty, the guard returns the raw timestamps unchanged). The whisper_full_get_token_data ctx variant now forwards to the _from_state version so the fix applies to both public entry points without duplication.

Reproduction

On current master (166c20b), with ggml-base.bin, a fixture of 7 s pink noise then 3 s English speech, and ggml-silero-v5.1.2.bin for VAD:

whisper-cli -m ggml-base.bin --vad -vm ggml-silero-v5.1.2.bin -ojf music-prefix-speech.wav

Segment (correct):

[00:00:07.040 --> 00:00:10.370]   Hello World, this is a short test of the transcription system.

Token t0/t1 from the JSON (wrong):

[_BEG_]   00:00:00,000 -> 00:00:00,000
 Hello    00:00:00,010 -> 00:00:00,410
 World    00:00:00,410 -> 00:00:00,580
 ,        00:00:00,960 -> 00:00:01,000
 this     00:00:01,000 -> 00:00:01,240

With this patch, the same command produces:

[_BEG_]   00:00:07,040 -> 00:00:07,040
 Hello    00:00:07,050 -> 00:00:07,450
 World    00:00:07,450 -> 00:00:07,620
 ,        00:00:08,030 -> 00:00:08,070
 this     00:00:08,070 -> 00:00:08,310

Token timeline now aligns with the segment timeline.

Differential matrix

model=base, fixture ∈ {music-prefix-speech, speech-en, long-en-70s, en-speech+10s-silence}, vadmode ∈ {novad, vad}. 8 cells per build.

cells	target cells	target improved	target regressed	non-target unchanged	non-target changed
8	4	4	0	4	0

Target cells (all vadmode=vad) improve: token timestamps now lie within their containing segment's time range. Non-target cells (all vadmode=novad) are byte-identical in the full -ojf JSON, confirming the guard does not touch paths that do not use VAD.

The "target improved" verdict for the speech-en + vad cell might look surprising because that fixture has no music prefix. VAD still trims small boundaries around speech onset and offset, and the fix correctly propagates that trim into token timestamps. Before the patch, [_BEG_] for speech-en + VAD was emitted at 00:00:00,000 while the segment starts at 00:00:00,030. After the patch, [_BEG_] is at 00:00:00,030, matching the segment.

What this does not do

Does not change non-VAD behavior. Confirmed by byte-for-byte JSON match on all 4 non-VAD cells.
Does not change segment-level timestamps. Those already went through map_processed_to_original_time; only token-level timestamps were missing the mapping.
Does not change whisper_full_get_token_p or any non-timestamp getter. The only fields affected are t0 and t1 on returned whisper_token_data.

Tools used

git, cmake, whisper-cli, and audiokit for the differential matrix.

Disclosure

I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. Numbers above come from actual runs on an Apple Silicon Mac against commit 166c20b of this repo and a patched build. The reproducer fixture and regress config are available; happy to share.

whisper_full_get_token_data returned raw t0/t1 from the decoder, which are in the VAD-processed (speech-only) timeline. Segment timestamps already go through map_processed_to_original_time via whisper_full_get_segment_t0/t1, so the two diverge whenever VAD strips a non-speech prefix: segments report correct times in the original audio while tokens restart at 0. Apply the same mapping to token t0/t1 when state->has_vad_segments is set. Non-VAD paths and non-token-timestamp runs are unaffected since the mapping table is empty and the guard short-circuits. Fixes ggml-org#3754

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : map token timestamps through VAD offset table (#3754)#3764

whisper : map token timestamps through VAD offset table (#3754)#3764
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19:fix/3754-vad-token-timestamps

achyutbenz19 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

achyutbenz19 commented Apr 19, 2026

Summary

Scope of the change

Reproduction

Differential matrix

What this does not do

Tools used

Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant