Whisper.cpp on Android: Streaming / Live Transcription is ~5× Slower Than Real-Time, but Batch Is Fast , Why? #3567

rkecodist · 2025-12-16T16:29:38Z

rkecodist
Dec 16, 2025

I’m building an Android app with voice typing powered by whisper.cpp, running locally on the device (CPU only).

I’m porting the logic from:

https://github.com/ufal/whisper_streaming

(which uses faster-whisper in Python)
to Kotlin + C++ (JNI) for Android.

The Problem

Batch Mode (Record → Stop → Transcribe)

Works perfectly.
~5 seconds of audio transcribed in ~1–2 seconds.
Fast and accurate.

Live Streaming Mode (Record → Stream chunks → Transcribe)

Extremely slow.
~5–7 seconds to process ~1 second of new audio.
Latency keeps increasing (3s → 10s → 30s),
eventually causing ANRs or process kills.

The Setup

Engine: whisper.cpp (native C++ via JNI)

Model: Quantized tiny (q8_0), CPU only

Device: Android smartphone (ARM64)

VAD: Disabled (to isolate variables; inference continues even during silence)

Architecture

Kotlin Layer

Captures audio in 1024-sample chunks (16 kHz PCM)

Accumulates chunks into a buffer

Implements a sliding window / buffer
(ported from OnlineASRProcessor in whisper_streaming)

Calls transcribeNative() via JNI when a chunk threshold is reached

C++ JNI Layer (whisper_jni.cpp)

Receives float[] audio data

Calls whisper_full using WHISPER_SAMPLING_GREEDY

Parameters:
print_progress = false
no_context = true
n_threads = 4

Returns JSON segments

What I’ve Tried and Verified
Quantization - Using quantized models (q8_0).
VAD- Suspected silence processing, but even with continuous speech, performance is still ~5× slower than real-time.
Batch vs Live Toggle

Batch:
Accumulate ~10s → call whisper_full once → fast

Live:
Call whisper_full repeatedly on a growing buffer → extremely slow

Hardware - Device is clearly capable, Batch mode proves this.
My Hypothesis / Questions

If whisper_full is fast enough for batch processing,
why does calling it repeatedly in a streaming loop destroy performance?

Is there a large overhead in repeatedly initializing or resetting whisper_full?

Am I misusing prompt / context handling?
In faster-whisper, previously committed text is passed as a prompt.
I’m doing the same in Kotlin, but whisper.cpp seems to struggle with repeated re-evaluation.

Is whisper.cpp simply not designed for overlapping-buffer streaming
on mobile CPUs?

Code Snippet (C++ JNI)

// Called repeatedly in Live Mode (for example, every 1–2 seconds)
extern "C" JNIEXPORT jstring JNICALL
Java_com_wikey_feature_voice_engines_whisper_WhisperContextImpl_transcribeNative(
        JNIEnv *env,
        jobject,
        jlong contextPtr,
        jfloatArray audioData,
        jstring prompt) {

    // ... setup context and audio buffer ...

    whisper_full_params params =
        whisper_full_default_params(WHISPER_SAMPLING_GREEDY);

    params.print_progress = false;
    params.no_context = true;   // Is this correct for streaming?
    params.single_segment = false;
    params.n_threads = 4;

    // Passing the previously confirmed text as prompt
    const char *promptStr = env->GetStringUTFChars(prompt, nullptr);
    if (promptStr) {
        params.initial_prompt = promptStr;
    }

    // This call takes ~5–7 seconds for ~1.5s of audio in Live Mode
    if (whisper_full(ctx, params, pcmf32.data(), pcmf32.size()) != 0) {
        return env->NewStringUTF("[]");
    }

    // ... parse and return JSON ...
}

Logs (Live Mode)

D/OnlineASRProcessor: ASR Logic: Words from JNI (count: 5): [is, it, really, translated, ?]
V/WhisperVoiceEngine: Whisper Partial: 'is it really translated?'
D/OnlineASRProcessor: ASR Process: Buffer=1.088s Offset=0.0s
D/OnlineASRProcessor: ASR Inference took: 6772ms
(~6.7s to process ~1s of audio)

Logs (Batch Mode – Fast)

D/WhisperVoiceEngine$stopListening: Processing Batch Audio: 71680 samples (~4.5s)
D/WhisperVoiceEngine$stopListening: Batch Result: '...'

(Inference time isn’t explicitly logged, but is perceptibly under 2s.)

Any insights into why whisper.cpp performs so poorly in this streaming loop, compared to batch processing or the Python faster-whisper implementation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper.cpp on Android: Streaming / Live Transcription is ~5× Slower Than Real-Time, but Batch Is Fast , Why? #3567

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Whisper.cpp on Android: Streaming / Live Transcription is ~5× Slower Than Real-Time, but Batch Is Fast , Why? #3567

Uh oh!

rkecodist Dec 16, 2025

Replies: 0 comments

rkecodist
Dec 16, 2025