whisper : respect offset_ms in language auto-detection (#1831)#3765
Open
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
Open
whisper : respect offset_ms in language auto-detection (#1831)#3765achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
Conversation
The auto-detect call in whisper_full_with_state passed a hard-coded offset of 0 to whisper_lang_auto_detect_with_state, so language detection always analyzed the first window of audio regardless of the caller's offset_ms. On audio like "1 minute of French then 30 minutes of German" with offset_ms=60000, transcription correctly started at the 1-minute mark but language detection still returned French from the prefix. Pass params.offset_ms through. Auto-detect now reads the same window that decoding will start from. Fixes ggml-org#1831
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1831.
whisper_full_with_statecallswhisper_lang_auto_detect_with_statewith a hard-coded offset of0instead ofparams.offset_ms. This means language detection always runs on the first window of audio regardless of the caller's offset. On audio like "1 minute of French intro, then 30 minutes of German" withoffset_ms=60000, transcription correctly starts at the 1-minute mark butwhisper_full_lang_idstill returns French.The fix passes
params.offset_msthrough to the auto-detector. If the user-supplied offset would fall past the end of the mel spectrogram (e.g.offset_ms=4000on a 3-second file), the guard falls back to0so language detection still returns a valid language and the existing too-short-audio guard downstream can handle the empty-decode case cleanly.Scope of the change
src/whisper.cpp, 1 file, +14/-1.Inside the
auto_detectbranch ofwhisper_full_with_state, computedetect_offset_msasparams.offset_msif it is positive andparams.offset_ms / 10 < state->mel.n_len_org, otherwise0. Pass that intowhisper_lang_auto_detect_with_stateinstead of the hardcoded0.Reproduction
Fixture built with audiokit:
Before the patch, auto-detect always runs at sample 0, so both
-ot 0and-ot 4000report Russian:With the patch:
Differential matrix
Ran with
audiokit regress check. Axes:model=small,fixture ∈ {ru-then-en, speech-en (3.36 s), speech-ru, long-en-70s},offsetms ∈ {0, 4000}. 8 cells per build. Measure:grep -oE 'auto-detected language: [a-z]+'(language code only, isolates semantic signal from noise like probability drift).Non-target cells (7 of them) are byte-identical in language detection output between
masterand this patch. This includesspeech-enandspeech-ruwithoffset=4000(both files shorter than 4 s, so the offset falls past the audio and the guard correctly falls back to 0) andlong-en-70swithoffset=4000(valid offset, still detects the same language).The one cell flagged as "equal-length change" is the target case itself:
auto-detected language: ruis the same character length asauto-detected language: en, and the matrix's length-only heuristic cannot distinguish a correct replacement from a regression. Verified by hand: this is the intended before-and-after.How audiokit was used
audiokit repro fixtures --set allproducedspeech-ru.wav,speech-en.wav,long-en-70s.wavwhich I used directly.ru-then-en.wavon top of those withsox(audiokit'saudiosubcommand also exposesconcat).audiokit regress initscaffolded the matrix config,audiokit regress check --report-mdran 16 whisper-cli invocations (8 cells x 2 builds), hashed the measured output of each, and classified cells as target or non-target to surface any adjacent regressions.audiokit inject device,audiokit inject web) were not used for this PR because this is a file-based bug in a CLI path, not a desktop-app or browser bug. For my other audiokit-driven fixes (Zero-filled WAV give hallucination and wrong duration #1881, Hallucination on silence #1724, VAD causes incorrect token timestamps when audio starts with music #3754) I used the same fixture-plus-matrix slice.What this does not do
offset_msis 0 (the default). Auto-detect still runs at sample 0, matching the pre-patch behavior.params.languagebeing ignored). On currentmasterthat gate atwhisper_full_with_stateshort-circuits the auto-detect block cleanly when a non-auto, non-empty language is set anddetect_languageis false. If a specific binding is still triggering auto-detect despite those params, that is a separate defect and should be a separate patch.Tools used
git,cmake,whisper-cli,sox, andaudiokitfor fixture generation plus the differential matrix.Disclosure
I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. Matrix numbers come from actual runs on an Apple Silicon Mac against commit
166c20bof this repo and a patched build. Reproducer fixtures and regress config are available; happy to share.