Skip to content

whisper : map token timestamps through VAD offset table (#3754)#3764

Open
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19:fix/3754-vad-token-timestamps
Open

whisper : map token timestamps through VAD offset table (#3754)#3764
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19:fix/3754-vad-token-timestamps

Conversation

@achyutbenz19
Copy link
Copy Markdown

Summary

Fixes #3754.

When VAD is enabled, whisper_full_get_token_data returns t0/t1 in the VAD-processed (speech-filtered) timeline, while whisper_full_get_segment_t0/t1 already map those values back to the original audio timeline via map_processed_to_original_time. The two timelines diverge whenever VAD strips a non-speech prefix: the segment says "speech starts at 7.04 s" while the first token inside that segment says "starts at 0.01 s".

This shows up most visibly in --output-json-full where token timestamps.from/timestamps.to are printed alongside segment timestamps. Callers using the C API directly see it as tokens whose t0/t1 are systematically earlier than their containing segment.

Scope of the change

src/whisper.cpp, 1 file, +16/-2.

In whisper_full_get_token_data_from_state, apply map_processed_to_original_time to token.t0 and token.t1 when state->has_vad_segments is true and the mapping table is populated. Non-VAD paths short-circuit (the mapping table is empty, the guard returns the raw timestamps unchanged). The whisper_full_get_token_data ctx variant now forwards to the _from_state version so the fix applies to both public entry points without duplication.

Reproduction

On current master (166c20b), with ggml-base.bin, a fixture of 7 s pink noise then 3 s English speech, and ggml-silero-v5.1.2.bin for VAD:

whisper-cli -m ggml-base.bin --vad -vm ggml-silero-v5.1.2.bin -ojf music-prefix-speech.wav

Segment (correct):

[00:00:07.040 --> 00:00:10.370]   Hello World, this is a short test of the transcription system.

Token t0/t1 from the JSON (wrong):

[_BEG_]   00:00:00,000 -> 00:00:00,000
 Hello    00:00:00,010 -> 00:00:00,410
 World    00:00:00,410 -> 00:00:00,580
 ,        00:00:00,960 -> 00:00:01,000
 this     00:00:01,000 -> 00:00:01,240

With this patch, the same command produces:

[_BEG_]   00:00:07,040 -> 00:00:07,040
 Hello    00:00:07,050 -> 00:00:07,450
 World    00:00:07,450 -> 00:00:07,620
 ,        00:00:08,030 -> 00:00:08,070
 this     00:00:08,070 -> 00:00:08,310

Token timeline now aligns with the segment timeline.

Differential matrix

model=base, fixture ∈ {music-prefix-speech, speech-en, long-en-70s, en-speech+10s-silence}, vadmode ∈ {novad, vad}. 8 cells per build.

cells target cells target improved target regressed non-target unchanged non-target changed
8 4 4 0 4 0

Target cells (all vadmode=vad) improve: token timestamps now lie within their containing segment's time range. Non-target cells (all vadmode=novad) are byte-identical in the full -ojf JSON, confirming the guard does not touch paths that do not use VAD.

The "target improved" verdict for the speech-en + vad cell might look surprising because that fixture has no music prefix. VAD still trims small boundaries around speech onset and offset, and the fix correctly propagates that trim into token timestamps. Before the patch, [_BEG_] for speech-en + VAD was emitted at 00:00:00,000 while the segment starts at 00:00:00,030. After the patch, [_BEG_] is at 00:00:00,030, matching the segment.

What this does not do

  • Does not change non-VAD behavior. Confirmed by byte-for-byte JSON match on all 4 non-VAD cells.
  • Does not change segment-level timestamps. Those already went through map_processed_to_original_time; only token-level timestamps were missing the mapping.
  • Does not change whisper_full_get_token_p or any non-timestamp getter. The only fields affected are t0 and t1 on returned whisper_token_data.

Tools used

git, cmake, whisper-cli, and audiokit for the differential matrix.

Disclosure

I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. Numbers above come from actual runs on an Apple Silicon Mac against commit 166c20b of this repo and a patched build. The reproducer fixture and regress config are available; happy to share.

whisper_full_get_token_data returned raw t0/t1 from the decoder, which
are in the VAD-processed (speech-only) timeline. Segment timestamps
already go through map_processed_to_original_time via
whisper_full_get_segment_t0/t1, so the two diverge whenever VAD strips
a non-speech prefix: segments report correct times in the original
audio while tokens restart at 0.

Apply the same mapping to token t0/t1 when state->has_vad_segments is
set. Non-VAD paths and non-token-timestamp runs are unaffected since
the mapping table is empty and the guard short-circuits.

Fixes ggml-org#3754
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VAD causes incorrect token timestamps when audio starts with music

1 participant