Skip to content

whisper : correct per-token timestamps in parallel processing merge (#3726)#3766

Open
achyutbenz19 wants to merge 2 commits intoggml-org:masterfrom
achyutbenz19:fix/3726-multiproc-token-ts
Open

whisper : correct per-token timestamps in parallel processing merge (#3726)#3766
achyutbenz19 wants to merge 2 commits intoggml-org:masterfrom
achyutbenz19:fix/3726-multiproc-token-ts

Conversation

@achyutbenz19
Copy link
Copy Markdown

Summary

Fixes #3726.

whisper_full_parallel splits the input audio into N chunks and spawns one worker per chunk to call whisper_full_with_state. When merging worker results back into the main state, it offset-corrects each segment's t0 and t1 by the chunk's starting sample, but leaves the per-token timestamps (result.tokens[j].t0 / .t1) in chunk-relative time. Token timestamps therefore reset to 00:00:00 at every chunk boundary while segment timestamps show correct absolute times, so the two disagree in every worker beyond the first.

Scope of the change

src/whisper.cpp, +11/-7 on the merge loop of whisper_full_parallel.

  1. Hoist the repeated offset expression 100 * ((i + 1) * n_samples_per_processor) / WHISPER_SAMPLE_RATE + offset_t into a local chunk_offset.
  2. Apply the same offset to every token.t0 / token.t1 inside result.tokens before the segment is pushed into ctx->state->result_all.

Non-parallel (n_processors == 1) is unchanged: that branch returns early via whisper_full and never enters this merge loop.

Reproduction

On current master (166c20b), with ggml-base.bin and long-en-70s.wav:

whisper-cli -m ggml-base.bin --processors 2 -ojf -of out long-en-70s.wav

Sample from out.json (before this patch):

segment 00:00:38.600 -> 00:00:41.400 | token[0].t0 = 00:00:03,750
segment 00:00:41.400 -> 00:00:43.960 | token[0].t0 = 00:00:06,550
segment 00:01:08.440 -> 00:01:10.440 | token[0].t0 = 00:00:33,440

All three tokens show chunk-relative times (relative to the start of worker 1's slice) while the segments are correctly absolute.

After this patch:

segment 00:00:38.600 -> 00:00:41.400 | token[0].t0 = 00:00:38,750
segment 00:00:41.400 -> 00:00:43.960 | token[0].t0 = 00:00:41,550
segment 00:01:08.440 -> 00:01:10.440 | token[0].t0 = 00:01:08,440

Token and segment timelines agree.

Differential matrix

model=base, fixture ∈ {long-en-70s, long-en-55s, speech-en}, procs ∈ {1, 2, 3}. 9 cells per build.

cells target cells target improved target regressed non-target unchanged non-target changed
9 6 6 0 3 0

Target cells (procs ∈ {2, 3}) improve: every -ojf output now has token timestamps agreeing with segment timestamps. Non-target cells (procs=1) are byte-identical across master and this patch, confirming single-worker transcriptions are untouched.

What this does not do

  • Does not change segment-level timestamps, which were already correct.
  • Does not change decoder behavior, beam-search, or any audio processing. The only change is a post-decode offset applied to token metadata in one merge loop.
  • Does not interact with the VAD path (VAD runs inside whisper_full, not whisper_full_parallel; VAD causes incorrect token timestamps when audio starts with music #3754 covered that path).

Tools used

git, cmake, whisper-cli, and audiokit for the differential matrix.

Disclosure

I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. Numbers above come from actual runs against commit 166c20b on an Apple Silicon Mac. The regress config and raw per-cell outputs are available.

The auto-detect call in whisper_full_with_state passed a hard-coded
offset of 0 to whisper_lang_auto_detect_with_state, so language
detection always analyzed the first window of audio regardless of
the caller's offset_ms. On audio like "1 minute of French then 30
minutes of German" with offset_ms=60000, transcription correctly
started at the 1-minute mark but language detection still returned
French from the prefix.

Pass params.offset_ms through. Auto-detect now reads the same window
that decoding will start from.

Fixes ggml-org#1831
whisper_full_parallel applies a chunk offset to segment t0/t1 when
merging worker results into the main state, but the token.t0/t1
inside each segment were left in chunk-relative time. Segments
reported correct absolute times while token timestamps reset to
zero at every split boundary.

Extract the offset into a local, apply it to token t0/t1 as well.

Fixes ggml-org#3726
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--processors / n_processors greater than 1 _still_ produces incorrect token timestamps

1 participant