Skip to content

whisper : skip decoding of zero-filled chunks on forced-language path (#1724)#3763

Open
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19:fix/1724-chunk-silence-guard
Open

whisper : skip decoding of zero-filled chunks on forced-language path (#1724)#3763
achyutbenz19 wants to merge 1 commit intoggml-org:masterfrom
achyutbenz19:fix/1724-chunk-silence-guard

Conversation

@achyutbenz19
Copy link
Copy Markdown

Summary

Fixes #1724 (residual forced-language silence hallucination not addressed by #2629).

When a specific language is forced (e.g. -l ru, -l pt, -l es) and the decoder processes a 30-second window that is entirely silent, whisper emits language-specific fallback tokens rather than the blank-audio signal the auto-detect path emits. Common failure modes:

  • -l ru on trailing silence after real speech: fake subtitle-editor credit lines like Редактор субтитров А.Семкин Корректор А.Егорова
  • -l pt on trailing silence: [MÚSICA DE FUNDO] or [Música]
  • -l es on silence: [Música], [Musica], and similar bracketed tags

PR #2629 ("Fix hallucinations during silence") addressed one subset of this behavior (the single_timestamp_ending case) but the forced-language path on a full silent chunk still bypasses that guard. Maintainer feedback on the earlier attempt at this scope (#1588, closed) was explicit: "The key to solving hallucination lies in finding a way to skip silence... OpenAI's Whisper checks the probability of the no-speech token and skips entire segments when silence probability is high."

This PR implements that "skip entire segments" idea using a zero-PCM check rather than no_speech_prob. Zero-PCM is a stricter and language-independent signal: on forced-language silent input, no_speech_prob can stay below the 0.6 threshold because the model confidently emits a language-specific fallback token, so a probability-based check does not fire.

Scope of the change

src/whisper.cpp, 1 file, +45/-0.

  1. Capture the caller's original language intent at the top of whisper_full_with_state, before the auto-detect block overwrites params.language.
  2. At the top of the chunk seek loop, if the caller forced a language and the current 30-second window is entirely zero-valued, emit a single [BLANK_AUDIO] segment spanning that window and advance seek by one chunk. Skip encoder and decoder for this window.

The auto-detect path is deliberately untouched.

Reproduction

On current master (166c20b), with ggml-base.bin:

whisper-cli -m ggml-base.bin -l ru -mc 0 ru-speech-plus-30s-silence.wav
[00:00:00.000 --> 00:00:02.800]   Привет, мир! Это простой тест транскрипции.
[00:00:30.000 --> 00:00:34.000]   Редактор субтитров А.Семкин Корректор А.Егорова

whisper-cli -m ggml-base.bin -l pt -mc 0 pt-speech-plus-30s-silence.wav
[00:00:00.000 --> 00:00:03.480]   Olá mundo, este é um teste simples de transcrição.
[00:00:30.000 --> 00:00:32.000]   [MÚSICA DE FUNDO]

With this patch:

whisper-cli -m ggml-base.bin -l ru -mc 0 ru-speech-plus-30s-silence.wav
[00:00:00.000 --> 00:00:02.800]   Привет, мир! Это простой тест транскрипции.
[00:00:30.000 --> 00:00:32.840]   [BLANK_AUDIO]

whisper-cli -m ggml-base.bin -l pt -mc 0 pt-speech-plus-30s-silence.wav
[00:00:00.000 --> 00:00:03.480]   Olá mundo, este é um teste simples de transcrição.
[00:00:30.000 --> 00:00:33.490]   [BLANK_AUDIO]

Real speech segments are preserved unchanged. Silent-chunk timestamps now reflect actual audio duration.

Differential matrix

Ran the patched build against master across (model) x (fixture) x (lang). Axes: model ∈ {base, small}, fixture ∈ {ru+30s, ru+10s, ru+3s, pt+30s, pt+10s, en+30s, en+10s, speech-en, speech-ru, speech-pt, long-en-70s, pink-noise-5min}, lang ∈ {auto, ru, en, es, pt}. Total 120 cells per build.

cells target cells target improved target equal-length change non-target unchanged non-target changed
120 24 9 15 96 0

Non-target cells (96 of them) are byte-for-byte unchanged between master and this patch. That set covers: real speech (en/ru/pt), long-form (70 s), pink-noise (5 min), speech followed by 10 s of silence on all three languages, speech followed by 3 s of silence, and every auto-detect cell.

The 15 cells flagged as "equal-length change" by the matrix's length-only heuristic are all length-shorter-or-equal replacements of a hallucinated chunk with [BLANK_AUDIO]; the tool cannot distinguish a correct replacement from a regression when the before-output happens to be longer. Spot-checked by hand:

model fixture lang before after
base ru+30s ru [00:00:30.000 --> 00:00:34.000] Редактор субтитров А.Семкин Корректор А.Егорова [00:00:30.000 --> 00:00:32.840] [BLANK_AUDIO]
base ru+30s en [00:00:30.000 --> 00:00:38.000] [BLANK_AUDIO] [00:00:30.000 --> 00:00:32.840] [BLANK_AUDIO]
base ru+30s pt [00:00:30.000 --> 00:00:32.000] [Música] [00:00:30.000 --> 00:00:32.840] [BLANK_AUDIO]
base pt+30s ru [00:00:30.000 --> 00:00:34.000] Редактор субтитров ... [00:00:30.000 --> 00:00:33.490] [BLANK_AUDIO]
base pt+30s pt [00:00:30.000 --> 00:00:32.000] [MÚSICA DE FUNDO] [00:00:30.000 --> 00:00:33.490] [BLANK_AUDIO]
base en+30s en [00:00:30.000 --> 00:00:40.000] [BLANK_AUDIO] [00:00:30.000 --> 00:00:33.000] [BLANK_AUDIO]

(Remaining 9 cells follow the same two shapes: hallucination to [BLANK_AUDIO], or wrong-duration [BLANK_AUDIO] to correct-duration [BLANK_AUDIO].)

What this does not do

  • Does not touch the auto-detect path. A silent chunk under auto-detect still goes through normal decoding and emits [BLANK_AUDIO] via the vocab token pattern.
  • Does not change behavior on partially silent chunks. Only chunks where every PCM sample is exactly 0.0f are skipped; a chunk with any non-zero sample goes through the normal encode or decode.
  • Does not address the sentence-duplication failure mode across a mid-file silence gap (ref: #1724 also mentions this, but it is a separate repetition issue in the decoder state, not chunk-level silence handling).
  • Does not replace PR Fix hallucinations during silence #2629; the single_timestamp_ending guard that PR added still runs and handles the case where a chunk's decoded output is terminated by a single timestamp. This is a complementary pre-decode guard.

Overlap with #3762

I have another open PR (#3762) that adds a whole-input zero-PCM guard for issue #1881. This PR's chunk-level guard subsumes that special case as a 1-chunk variant. If both end up merging, the #3762 guard is redundant and can be removed in a follow-up. If only this PR merges, #1881 is also fixed as a side effect.

Tools used

git, cmake, whisper-cli, and audiokit for the differential matrix.

Disclosure

I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. The matrix numbers above come from actual runs on an Apple Silicon Mac against commit 166c20b of this repo and a patched build. The regress config and raw per-cell outputs are available; happy to share.

When a specific language is forced (e.g. -l ru, -l es) and a 30-second
decoder window is entirely zero-valued, whisper emits language-specific
fallback tokens (bracketed music tags like [Música], fake subtitle-editor
credits on -l ru). The auto-detect path handles silent chunks naturally.

Add a chunk-level zero-PCM check at the top of the seek loop inside
whisper_full_with_state. When the current window is all-zero and the
caller forced a language, emit a single [BLANK_AUDIO] segment for that
chunk and advance without running the encoder or decoder. Matches the
approach endorsed in PR ggml-org#1588 review ("skip entire segments when
silence is detected"), using zero-PCM as a stricter and language-
independent signal than no_speech_prob.

The caller's original language intent is captured before the auto-
detect block overwrites params.language, so the guard only fires when
the user explicitly requested a specific language; auto-detect paths
are unchanged.

Fixes ggml-org#1724 (residual hallucination on forced-language silence chunks
not addressed by ggml-org#2629)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hallucination on silence

1 participant