whisper : correct per-token timestamps in parallel processing merge (#3726)#3766
Open
achyutbenz19 wants to merge 2 commits intoggml-org:masterfrom
Open
whisper : correct per-token timestamps in parallel processing merge (#3726)#3766achyutbenz19 wants to merge 2 commits intoggml-org:masterfrom
achyutbenz19 wants to merge 2 commits intoggml-org:masterfrom
Conversation
The auto-detect call in whisper_full_with_state passed a hard-coded offset of 0 to whisper_lang_auto_detect_with_state, so language detection always analyzed the first window of audio regardless of the caller's offset_ms. On audio like "1 minute of French then 30 minutes of German" with offset_ms=60000, transcription correctly started at the 1-minute mark but language detection still returned French from the prefix. Pass params.offset_ms through. Auto-detect now reads the same window that decoding will start from. Fixes ggml-org#1831
whisper_full_parallel applies a chunk offset to segment t0/t1 when merging worker results into the main state, but the token.t0/t1 inside each segment were left in chunk-relative time. Segments reported correct absolute times while token timestamps reset to zero at every split boundary. Extract the offset into a local, apply it to token t0/t1 as well. Fixes ggml-org#3726
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #3726.
whisper_full_parallelsplits the input audio into N chunks and spawns one worker per chunk to callwhisper_full_with_state. When merging worker results back into the main state, it offset-corrects each segment'st0andt1by the chunk's starting sample, but leaves the per-token timestamps (result.tokens[j].t0/.t1) in chunk-relative time. Token timestamps therefore reset to00:00:00at every chunk boundary while segment timestamps show correct absolute times, so the two disagree in every worker beyond the first.Scope of the change
src/whisper.cpp, +11/-7 on the merge loop ofwhisper_full_parallel.100 * ((i + 1) * n_samples_per_processor) / WHISPER_SAMPLE_RATE + offset_tinto a localchunk_offset.token.t0/token.t1insideresult.tokensbefore the segment is pushed intoctx->state->result_all.Non-parallel (
n_processors == 1) is unchanged: that branch returns early viawhisper_fulland never enters this merge loop.Reproduction
On current master (
166c20b), withggml-base.binandlong-en-70s.wav:Sample from
out.json(before this patch):All three tokens show chunk-relative times (relative to the start of worker 1's slice) while the segments are correctly absolute.
After this patch:
Token and segment timelines agree.
Differential matrix
model=base,fixture ∈ {long-en-70s, long-en-55s, speech-en},procs ∈ {1, 2, 3}. 9 cells per build.Target cells (
procs ∈ {2, 3}) improve: every-ojfoutput now has token timestamps agreeing with segment timestamps. Non-target cells (procs=1) are byte-identical across master and this patch, confirming single-worker transcriptions are untouched.What this does not do
whisper_full, notwhisper_full_parallel; VAD causes incorrect token timestamps when audio starts with music #3754 covered that path).Tools used
git,cmake,whisper-cli, andaudiokitfor the differential matrix.Disclosure
I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. Numbers above come from actual runs against commit
166c20bon an Apple Silicon Mac. The regress config and raw per-cell outputs are available.