whisper : skip decoding of zero-filled chunks on forced-language path (#1724) by achyutbenz19 · Pull Request #3763 · ggml-org/whisper.cpp

achyutbenz19 · 2026-04-19T00:24:41Z

Summary

Fixes #1724 (residual forced-language silence hallucination not addressed by #2629).

When a specific language is forced (e.g. -l ru, -l pt, -l es) and the decoder processes a 30-second window that is entirely silent, whisper emits language-specific fallback tokens rather than the blank-audio signal the auto-detect path emits. Common failure modes:

-l ru on trailing silence after real speech: fake subtitle-editor credit lines like Редактор субтитров А.Семкин Корректор А.Егорова
-l pt on trailing silence: [MÚSICA DE FUNDO] or [Música]
-l es on silence: [Música], [Musica], and similar bracketed tags

PR #2629 ("Fix hallucinations during silence") addressed one subset of this behavior (the single_timestamp_ending case) but the forced-language path on a full silent chunk still bypasses that guard. Maintainer feedback on the earlier attempt at this scope (#1588, closed) was explicit: "The key to solving hallucination lies in finding a way to skip silence... OpenAI's Whisper checks the probability of the no-speech token and skips entire segments when silence probability is high."

This PR implements that "skip entire segments" idea using a zero-PCM check rather than no_speech_prob. Zero-PCM is a stricter and language-independent signal: on forced-language silent input, no_speech_prob can stay below the 0.6 threshold because the model confidently emits a language-specific fallback token, so a probability-based check does not fire.

Scope of the change

src/whisper.cpp, 1 file, +45/-0.

Capture the caller's original language intent at the top of whisper_full_with_state, before the auto-detect block overwrites params.language.
At the top of the chunk seek loop, if the caller forced a language and the current 30-second window is entirely zero-valued, emit a single [BLANK_AUDIO] segment spanning that window and advance seek by one chunk. Skip encoder and decoder for this window.

The auto-detect path is deliberately untouched.

Reproduction

On current master (166c20b), with ggml-base.bin:

whisper-cli -m ggml-base.bin -l ru -mc 0 ru-speech-plus-30s-silence.wav
[00:00:00.000 --> 00:00:02.800]   Привет, мир! Это простой тест транскрипции.
[00:00:30.000 --> 00:00:34.000]   Редактор субтитров А.Семкин Корректор А.Егорова

whisper-cli -m ggml-base.bin -l pt -mc 0 pt-speech-plus-30s-silence.wav
[00:00:00.000 --> 00:00:03.480]   Olá mundo, este é um teste simples de transcrição.
[00:00:30.000 --> 00:00:32.000]   [MÚSICA DE FUNDO]

With this patch:

whisper-cli -m ggml-base.bin -l ru -mc 0 ru-speech-plus-30s-silence.wav
[00:00:00.000 --> 00:00:02.800]   Привет, мир! Это простой тест транскрипции.
[00:00:30.000 --> 00:00:32.840]   [BLANK_AUDIO]

whisper-cli -m ggml-base.bin -l pt -mc 0 pt-speech-plus-30s-silence.wav
[00:00:00.000 --> 00:00:03.480]   Olá mundo, este é um teste simples de transcrição.
[00:00:30.000 --> 00:00:33.490]   [BLANK_AUDIO]

Real speech segments are preserved unchanged. Silent-chunk timestamps now reflect actual audio duration.

Differential matrix

Ran the patched build against master across (model) x (fixture) x (lang). Axes: model ∈ {base, small}, fixture ∈ {ru+30s, ru+10s, ru+3s, pt+30s, pt+10s, en+30s, en+10s, speech-en, speech-ru, speech-pt, long-en-70s, pink-noise-5min}, lang ∈ {auto, ru, en, es, pt}. Total 120 cells per build.

cells	target cells	target improved	target equal-length change	non-target unchanged	non-target changed
120	24	9	15	96	0

Non-target cells (96 of them) are byte-for-byte unchanged between master and this patch. That set covers: real speech (en/ru/pt), long-form (70 s), pink-noise (5 min), speech followed by 10 s of silence on all three languages, speech followed by 3 s of silence, and every auto-detect cell.

The 15 cells flagged as "equal-length change" by the matrix's length-only heuristic are all length-shorter-or-equal replacements of a hallucinated chunk with [BLANK_AUDIO]; the tool cannot distinguish a correct replacement from a regression when the before-output happens to be longer. Spot-checked by hand:

model	fixture	lang	before	after
base	ru+30s	ru	`[00:00:30.000 --> 00:00:34.000] Редактор субтитров А.Семкин Корректор А.Егорова`	`[00:00:30.000 --> 00:00:32.840] [BLANK_AUDIO]`
base	ru+30s	en	`[00:00:30.000 --> 00:00:38.000] [BLANK_AUDIO]`	`[00:00:30.000 --> 00:00:32.840] [BLANK_AUDIO]`
base	ru+30s	pt	`[00:00:30.000 --> 00:00:32.000] [Música]`	`[00:00:30.000 --> 00:00:32.840] [BLANK_AUDIO]`
base	pt+30s	ru	`[00:00:30.000 --> 00:00:34.000] Редактор субтитров ...`	`[00:00:30.000 --> 00:00:33.490] [BLANK_AUDIO]`
base	pt+30s	pt	`[00:00:30.000 --> 00:00:32.000] [MÚSICA DE FUNDO]`	`[00:00:30.000 --> 00:00:33.490] [BLANK_AUDIO]`
base	en+30s	en	`[00:00:30.000 --> 00:00:40.000] [BLANK_AUDIO]`	`[00:00:30.000 --> 00:00:33.000] [BLANK_AUDIO]`

(Remaining 9 cells follow the same two shapes: hallucination to [BLANK_AUDIO], or wrong-duration [BLANK_AUDIO] to correct-duration [BLANK_AUDIO].)

What this does not do

Does not touch the auto-detect path. A silent chunk under auto-detect still goes through normal decoding and emits [BLANK_AUDIO] via the vocab token pattern.
Does not change behavior on partially silent chunks. Only chunks where every PCM sample is exactly 0.0f are skipped; a chunk with any non-zero sample goes through the normal encode or decode.
Does not address the sentence-duplication failure mode across a mid-file silence gap (ref: #1724 also mentions this, but it is a separate repetition issue in the decoder state, not chunk-level silence handling).
Does not replace PR Fix hallucinations during silence #2629; the single_timestamp_ending guard that PR added still runs and handles the case where a chunk's decoded output is terminated by a single timestamp. This is a complementary pre-decode guard.

Overlap with #3762

I have another open PR (#3762) that adds a whole-input zero-PCM guard for issue #1881. This PR's chunk-level guard subsumes that special case as a 1-chunk variant. If both end up merging, the #3762 guard is redundant and can be removed in a follow-up. If only this PR merges, #1881 is also fixed as a side effect.

Tools used

git, cmake, whisper-cli, and audiokit for the differential matrix.

Disclosure

I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. The matrix numbers above come from actual runs on an Apple Silicon Mac against commit 166c20b of this repo and a patched build. The regress config and raw per-cell outputs are available; happy to share.

When a specific language is forced (e.g. -l ru, -l es) and a 30-second decoder window is entirely zero-valued, whisper emits language-specific fallback tokens (bracketed music tags like [Música], fake subtitle-editor credits on -l ru). The auto-detect path handles silent chunks naturally. Add a chunk-level zero-PCM check at the top of the seek loop inside whisper_full_with_state. When the current window is all-zero and the caller forced a language, emit a single [BLANK_AUDIO] segment for that chunk and advance without running the encoder or decoder. Matches the approach endorsed in PR ggml-org#1588 review ("skip entire segments when silence is detected"), using zero-PCM as a stricter and language- independent signal than no_speech_prob. The caller's original language intent is captured before the auto- detect block overwrites params.language, so the guard only fires when the user explicitly requested a specific language; auto-detect paths are unchanged. Fixes ggml-org#1724 (residual hallucination on forced-language silence chunks not addressed by ggml-org#2629)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : skip decoding of zero-filled chunks on forced-language path (#1724)#3763

whisper : skip decoding of zero-filled chunks on forced-language path (#1724)#3763
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19:fix/1724-chunk-silence-guard

achyutbenz19 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

achyutbenz19 commented Apr 19, 2026

Summary

Scope of the change

Reproduction

Differential matrix

What this does not do

Overlap with #3762

Tools used

Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant