whisper : skip decoding of zero-filled chunks on forced-language path (#1724)#3763
Open
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
Open
whisper : skip decoding of zero-filled chunks on forced-language path (#1724)#3763achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
Conversation
When a specific language is forced (e.g. -l ru, -l es) and a 30-second decoder window is entirely zero-valued, whisper emits language-specific fallback tokens (bracketed music tags like [Música], fake subtitle-editor credits on -l ru). The auto-detect path handles silent chunks naturally. Add a chunk-level zero-PCM check at the top of the seek loop inside whisper_full_with_state. When the current window is all-zero and the caller forced a language, emit a single [BLANK_AUDIO] segment for that chunk and advance without running the encoder or decoder. Matches the approach endorsed in PR ggml-org#1588 review ("skip entire segments when silence is detected"), using zero-PCM as a stricter and language- independent signal than no_speech_prob. The caller's original language intent is captured before the auto- detect block overwrites params.language, so the guard only fires when the user explicitly requested a specific language; auto-detect paths are unchanged. Fixes ggml-org#1724 (residual hallucination on forced-language silence chunks not addressed by ggml-org#2629)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1724 (residual forced-language silence hallucination not addressed by #2629).
When a specific language is forced (e.g.
-l ru,-l pt,-l es) and the decoder processes a 30-second window that is entirely silent, whisper emits language-specific fallback tokens rather than the blank-audio signal the auto-detect path emits. Common failure modes:-l ruon trailing silence after real speech: fake subtitle-editor credit lines likeРедактор субтитров А.Семкин Корректор А.Егорова-l pton trailing silence:[MÚSICA DE FUNDO]or[Música]-l eson silence:[Música],[Musica], and similar bracketed tagsPR #2629 ("Fix hallucinations during silence") addressed one subset of this behavior (the
single_timestamp_endingcase) but the forced-language path on a full silent chunk still bypasses that guard. Maintainer feedback on the earlier attempt at this scope (#1588, closed) was explicit: "The key to solving hallucination lies in finding a way to skip silence... OpenAI's Whisper checks the probability of the no-speech token and skips entire segments when silence probability is high."This PR implements that "skip entire segments" idea using a zero-PCM check rather than
no_speech_prob. Zero-PCM is a stricter and language-independent signal: on forced-language silent input,no_speech_probcan stay below the 0.6 threshold because the model confidently emits a language-specific fallback token, so a probability-based check does not fire.Scope of the change
src/whisper.cpp, 1 file, +45/-0.whisper_full_with_state, before the auto-detect block overwritesparams.language.[BLANK_AUDIO]segment spanning that window and advanceseekby one chunk. Skip encoder and decoder for this window.The auto-detect path is deliberately untouched.
Reproduction
On current master (
166c20b), withggml-base.bin:With this patch:
Real speech segments are preserved unchanged. Silent-chunk timestamps now reflect actual audio duration.
Differential matrix
Ran the patched build against
masteracross(model) x (fixture) x (lang). Axes:model ∈ {base, small},fixture ∈ {ru+30s, ru+10s, ru+3s, pt+30s, pt+10s, en+30s, en+10s, speech-en, speech-ru, speech-pt, long-en-70s, pink-noise-5min},lang ∈ {auto, ru, en, es, pt}. Total 120 cells per build.Non-target cells (96 of them) are byte-for-byte unchanged between
masterand this patch. That set covers: real speech (en/ru/pt), long-form (70 s), pink-noise (5 min), speech followed by 10 s of silence on all three languages, speech followed by 3 s of silence, and every auto-detect cell.The 15 cells flagged as "equal-length change" by the matrix's length-only heuristic are all length-shorter-or-equal replacements of a hallucinated chunk with
[BLANK_AUDIO]; the tool cannot distinguish a correct replacement from a regression when the before-output happens to be longer. Spot-checked by hand:[00:00:30.000 --> 00:00:34.000] Редактор субтитров А.Семкин Корректор А.Егорова[00:00:30.000 --> 00:00:32.840] [BLANK_AUDIO][00:00:30.000 --> 00:00:38.000] [BLANK_AUDIO][00:00:30.000 --> 00:00:32.840] [BLANK_AUDIO][00:00:30.000 --> 00:00:32.000] [Música][00:00:30.000 --> 00:00:32.840] [BLANK_AUDIO][00:00:30.000 --> 00:00:34.000] Редактор субтитров ...[00:00:30.000 --> 00:00:33.490] [BLANK_AUDIO][00:00:30.000 --> 00:00:32.000] [MÚSICA DE FUNDO][00:00:30.000 --> 00:00:33.490] [BLANK_AUDIO][00:00:30.000 --> 00:00:40.000] [BLANK_AUDIO][00:00:30.000 --> 00:00:33.000] [BLANK_AUDIO](Remaining 9 cells follow the same two shapes: hallucination to
[BLANK_AUDIO], or wrong-duration[BLANK_AUDIO]to correct-duration[BLANK_AUDIO].)What this does not do
[BLANK_AUDIO]via the vocab token pattern.0.0fare skipped; a chunk with any non-zero sample goes through the normal encode or decode.ref: #1724also mentions this, but it is a separate repetition issue in the decoder state, not chunk-level silence handling).single_timestamp_endingguard that PR added still runs and handles the case where a chunk's decoded output is terminated by a single timestamp. This is a complementary pre-decode guard.Overlap with #3762
I have another open PR (#3762) that adds a whole-input zero-PCM guard for issue #1881. This PR's chunk-level guard subsumes that special case as a 1-chunk variant. If both end up merging, the #3762 guard is redundant and can be removed in a follow-up. If only this PR merges, #1881 is also fixed as a side effect.
Tools used
git,cmake,whisper-cli, andaudiokitfor the differential matrix.Disclosure
I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. The matrix numbers above come from actual runs on an Apple Silicon Mac against commit
166c20bof this repo and a patched build. The regress config and raw per-cell outputs are available; happy to share.