whisper : map token timestamps through VAD offset table (#3754)#3764
Open
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
Open
whisper : map token timestamps through VAD offset table (#3754)#3764achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
Conversation
whisper_full_get_token_data returned raw t0/t1 from the decoder, which are in the VAD-processed (speech-only) timeline. Segment timestamps already go through map_processed_to_original_time via whisper_full_get_segment_t0/t1, so the two diverge whenever VAD strips a non-speech prefix: segments report correct times in the original audio while tokens restart at 0. Apply the same mapping to token t0/t1 when state->has_vad_segments is set. Non-VAD paths and non-token-timestamp runs are unaffected since the mapping table is empty and the guard short-circuits. Fixes ggml-org#3754
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #3754.
When VAD is enabled,
whisper_full_get_token_datareturnst0/t1in the VAD-processed (speech-filtered) timeline, whilewhisper_full_get_segment_t0/t1already map those values back to the original audio timeline viamap_processed_to_original_time. The two timelines diverge whenever VAD strips a non-speech prefix: the segment says "speech starts at 7.04 s" while the first token inside that segment says "starts at 0.01 s".This shows up most visibly in
--output-json-fullwhere tokentimestamps.from/timestamps.toare printed alongside segment timestamps. Callers using the C API directly see it as tokens whoset0/t1are systematically earlier than their containing segment.Scope of the change
src/whisper.cpp, 1 file, +16/-2.In
whisper_full_get_token_data_from_state, applymap_processed_to_original_timetotoken.t0andtoken.t1whenstate->has_vad_segmentsis true and the mapping table is populated. Non-VAD paths short-circuit (the mapping table is empty, the guard returns the raw timestamps unchanged). Thewhisper_full_get_token_datactx variant now forwards to the_from_stateversion so the fix applies to both public entry points without duplication.Reproduction
On current master (
166c20b), withggml-base.bin, a fixture of 7 s pink noise then 3 s English speech, andggml-silero-v5.1.2.binfor VAD:Segment (correct):
Token
t0/t1from the JSON (wrong):With this patch, the same command produces:
Token timeline now aligns with the segment timeline.
Differential matrix
model=base,fixture ∈ {music-prefix-speech, speech-en, long-en-70s, en-speech+10s-silence},vadmode ∈ {novad, vad}. 8 cells per build.Target cells (all
vadmode=vad) improve: token timestamps now lie within their containing segment's time range. Non-target cells (allvadmode=novad) are byte-identical in the full-ojfJSON, confirming the guard does not touch paths that do not use VAD.The "target improved" verdict for the
speech-en+vadcell might look surprising because that fixture has no music prefix. VAD still trims small boundaries around speech onset and offset, and the fix correctly propagates that trim into token timestamps. Before the patch,[_BEG_]forspeech-en+ VAD was emitted at00:00:00,000while the segment starts at00:00:00,030. After the patch,[_BEG_]is at00:00:00,030, matching the segment.What this does not do
map_processed_to_original_time; only token-level timestamps were missing the mapping.whisper_full_get_token_por any non-timestamp getter. The only fields affected aret0andt1on returnedwhisper_token_data.Tools used
git,cmake,whisper-cli, andaudiokitfor the differential matrix.Disclosure
I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. Numbers above come from actual runs on an Apple Silicon Mac against commit
166c20bof this repo and a patched build. The reproducer fixture and regress config are available; happy to share.