Skip to content

whisper : respect offset_ms in language auto-detection (#1831)#3765

Open
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19:fix/1831-autodetect-offset
Open

whisper : respect offset_ms in language auto-detection (#1831)#3765
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19:fix/1831-autodetect-offset

Conversation

@achyutbenz19
Copy link
Copy Markdown

Summary

Fixes #1831.

whisper_full_with_state calls whisper_lang_auto_detect_with_state with a hard-coded offset of 0 instead of params.offset_ms. This means language detection always runs on the first window of audio regardless of the caller's offset. On audio like "1 minute of French intro, then 30 minutes of German" with offset_ms=60000, transcription correctly starts at the 1-minute mark but whisper_full_lang_id still returns French.

The fix passes params.offset_ms through to the auto-detector. If the user-supplied offset would fall past the end of the mel spectrogram (e.g. offset_ms=4000 on a 3-second file), the guard falls back to 0 so language detection still returns a valid language and the existing too-short-audio guard downstream can handle the empty-decode case cleanly.

Scope of the change

src/whisper.cpp, 1 file, +14/-1.

Inside the auto_detect branch of whisper_full_with_state, compute detect_offset_ms as params.offset_ms if it is positive and params.offset_ms / 10 < state->mel.n_len_org, otherwise 0. Pass that into whisper_lang_auto_detect_with_state instead of the hardcoded 0.

Reproduction

Fixture built with audiokit:

# generate language-labeled fixtures
audiokit repro fixtures --set all

# concatenate ru speech then en speech into a single WAV
sox ~/.audiokit/repro/fixtures/speech-ru.wav \
    ~/.audiokit/repro/fixtures/speech-en.wav \
    ru-then-en.wav
# total duration ~6.2 s: ~3.3 s ru, then ~2.9 s en

Before the patch, auto-detect always runs at sample 0, so both -ot 0 and -ot 4000 report Russian:

$ whisper-cli -m ggml-small.bin -l auto -ot 0    ru-then-en.wav 2>&1 | grep auto-detected
whisper_full_with_state: auto-detected language: ru (p = 0.812757)

$ whisper-cli -m ggml-small.bin -l auto -ot 4000 ru-then-en.wav 2>&1 | grep auto-detected
whisper_full_with_state: auto-detected language: ru (p = 0.812757)   # wrong, we skipped past the ru portion

With the patch:

$ whisper-cli -m ggml-small.bin -l auto -ot 0    ru-then-en.wav 2>&1 | grep auto-detected
whisper_full_with_state: auto-detected language: ru (p = 0.812757)

$ whisper-cli -m ggml-small.bin -l auto -ot 4000 ru-then-en.wav 2>&1 | grep auto-detected
whisper_full_with_state: auto-detected language: en (p = 0.997886)   # correct, now detecting en at the post-offset window

Differential matrix

Ran with audiokit regress check. Axes: model=small, fixture ∈ {ru-then-en, speech-en (3.36 s), speech-ru, long-en-70s}, offsetms ∈ {0, 4000}. 8 cells per build. Measure: grep -oE 'auto-detected language: [a-z]+' (language code only, isolates semantic signal from noise like probability drift).

cells target cells target improved target equal-length change non-target unchanged non-target changed
8 1 0 1 7 0

Non-target cells (7 of them) are byte-identical in language detection output between master and this patch. This includes speech-en and speech-ru with offset=4000 (both files shorter than 4 s, so the offset falls past the audio and the guard correctly falls back to 0) and long-en-70s with offset=4000 (valid offset, still detects the same language).

The one cell flagged as "equal-length change" is the target case itself: auto-detected language: ru is the same character length as auto-detected language: en, and the matrix's length-only heuristic cannot distinguish a correct replacement from a regression. Verified by hand: this is the intended before-and-after.

How audiokit was used

  • audiokit repro fixtures --set all produced speech-ru.wav, speech-en.wav, long-en-70s.wav which I used directly.
  • Composed ru-then-en.wav on top of those with sox (audiokit's audio subcommand also exposes concat).
  • audiokit regress init scaffolded the matrix config, audiokit regress check --report-md ran 16 whisper-cli invocations (8 cells x 2 builds), hashed the measured output of each, and classified cells as target or non-target to surface any adjacent regressions.
  • The injection primitives (audiokit inject device, audiokit inject web) were not used for this PR because this is a file-based bug in a CLI path, not a desktop-app or browser bug. For my other audiokit-driven fixes (Zero-filled WAV give hallucination and wrong duration #1881, Hallucination on silence #1724, VAD causes incorrect token timestamps when audio starts with music #3754) I used the same fixture-plus-matrix slice.

What this does not do

  • Does not change behavior when offset_ms is 0 (the default). Auto-detect still runs at sample 0, matching the pre-patch behavior.
  • Does not address the other aspect of the original issue report (forced params.language being ignored). On current master that gate at whisper_full_with_state short-circuits the auto-detect block cleanly when a non-auto, non-empty language is set and detect_language is false. If a specific binding is still triggering auto-detect despite those params, that is a separate defect and should be a separate patch.

Tools used

git, cmake, whisper-cli, sox, and audiokit for fixture generation plus the differential matrix.

Disclosure

I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. Matrix numbers come from actual runs on an Apple Silicon Mac against commit 166c20b of this repo and a patched build. Reproducer fixtures and regress config are available; happy to share.

The auto-detect call in whisper_full_with_state passed a hard-coded
offset of 0 to whisper_lang_auto_detect_with_state, so language
detection always analyzed the first window of audio regardless of
the caller's offset_ms. On audio like "1 minute of French then 30
minutes of German" with offset_ms=60000, transcription correctly
started at the 1-minute mark but language detection still returned
French from the prefix.

Pass params.offset_ms through. Auto-detect now reads the same window
that decoding will start from.

Fixes ggml-org#1831
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trouble with language/detect_language and offset_ms params that are totally or partially ignored

1 participant