You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VAD- Suspected silence processing, but even with continuous speech, performance is still ~5× slower than real-time.
Batch vs Live Toggle
Batch:
Accumulate ~10s → call whisper_full once → fast
Live:
Call whisper_full repeatedly on a growing buffer → extremely slow
Hardware - Device is clearly capable, Batch mode proves this.
My Hypothesis / Questions
If whisper_full is fast enough for batch processing,
why does calling it repeatedly in a streaming loop destroy performance?
Is there a large overhead in repeatedly initializing or resetting whisper_full?
Am I misusing prompt / context handling?
In faster-whisper, previously committed text is passed as a prompt.
I’m doing the same in Kotlin, but whisper.cpp seems to struggle with repeated re-evaluation.
Is whisper.cpp simply not designed for overlapping-buffer streaming
on mobile CPUs?
Code Snippet (C++ JNI)
// Called repeatedly in Live Mode (for example, every 1–2 seconds)extern"C" JNIEXPORT jstring JNICALL
Java_com_wikey_feature_voice_engines_whisper_WhisperContextImpl_transcribeNative(
JNIEnv *env,
jobject,
jlong contextPtr,
jfloatArray audioData,
jstring prompt) {
// ... setup context and audio buffer ...
whisper_full_params params =
whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
params.print_progress = false;
params.no_context = true; // Is this correct for streaming?
params.single_segment = false;
params.n_threads = 4;
// Passing the previously confirmed text as promptconstchar *promptStr = env->GetStringUTFChars(prompt, nullptr);
if (promptStr) {
params.initial_prompt = promptStr;
}
// This call takes ~5–7 seconds for ~1.5s of audio in Live Modeif (whisper_full(ctx, params, pcmf32.data(), pcmf32.size()) != 0) {
return env->NewStringUTF("[]");
}
// ... parse and return JSON ...
}
Logs (Live Mode)
D/OnlineASRProcessor: ASR Logic: Words from JNI (count: 5): [is, it, really, translated, ?]
V/WhisperVoiceEngine: Whisper Partial: 'is it really translated?'
D/OnlineASRProcessor: ASR Process: Buffer=1.088s Offset=0.0s
D/OnlineASRProcessor: ASR Inference took: 6772ms
(~6.7s to process ~1s of audio)
Logs (Batch Mode – Fast)
D/WhisperVoiceEngine$stopListening: Processing Batch Audio: 71680 samples (~4.5s)
D/WhisperVoiceEngine$stopListening: Batch Result: '...'
(Inference time isn’t explicitly logged, but is perceptibly under 2s.)
Any insights into why whisper.cpp performs so poorly in this streaming loop, compared to batch processing or the Python faster-whisper implementation?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I’m building an Android app with voice typing powered by
whisper.cpp, running locally on the device (CPU only).I’m porting the logic from:
(which uses
faster-whisperin Python)to Kotlin + C++ (JNI) for Android.
Batch Mode (
Record → Stop → Transcribe)Works perfectly.
~5 seconds of audio transcribed in ~1–2 seconds.
Fast and accurate.
Live Streaming Mode (
Record → Stream chunks → Transcribe)Extremely slow.
~5–7 seconds to process ~1 second of new audio.
Latency keeps increasing (3s → 10s → 30s),
eventually causing ANRs or process kills.
Engine:
whisper.cpp(native C++ via JNI)Model: Quantized
tiny(q8_0), CPU onlyDevice: Android smartphone (ARM64)
VAD: Disabled (to isolate variables; inference continues even during silence)
Kotlin Layer
Captures audio in
1024-sample chunks (16 kHzPCM)Accumulates chunks into a buffer
Implements a sliding window / buffer
(ported from
OnlineASRProcessorinwhisper_streaming)Calls
transcribeNative()via JNI when a chunk threshold is reachedC++ JNI Layer (
whisper_jni.cpp)Receives
float[]audio dataCalls
whisper_fullusingWHISPER_SAMPLING_GREEDYParameters:
print_progress = falseno_context = truen_threads = 4Returns JSON segments
What I’ve Tried and Verified
Quantization - Using quantized models (
q8_0).VAD- Suspected silence processing, but even with continuous speech, performance is still ~5× slower than real-time.
Batch vs Live Toggle
Batch:
Accumulate ~10s → call
whisper_fullonce → fastLive:
Call
whisper_fullrepeatedly on a growing buffer → extremely slowHardware - Device is clearly capable, Batch mode proves this.
My Hypothesis / Questions
If
whisper_fullis fast enough for batch processing,why does calling it repeatedly in a streaming loop destroy performance?
Is there a large overhead in repeatedly initializing or resetting
whisper_full?Am I misusing prompt / context handling?
In
faster-whisper, previously committed text is passed as a prompt.I’m doing the same in Kotlin, but
whisper.cppseems to struggle with repeated re-evaluation.Is
whisper.cppsimply not designed for overlapping-buffer streamingon mobile CPUs?
Any insights into why whisper.cpp performs so poorly in this streaming loop, compared to batch processing or the Python faster-whisper implementation?
Beta Was this translation helpful? Give feedback.
All reactions