⚠️ EXPERIMENTAL WIP – Made with care but with AI. Expect bugs, breaking changes, and rough edges.
Try it now: parakeetweb.olicorne.org — no installation required.
- Features
- Quick Start
- Dictation Mode
- Live Transcription
- Remote Microphone (Phone as Mic)
- Local Model Fallback
- Mobile debugging
- License
- Acknowledgments
- Credits
Browser-based speech-to-text running entirely client-side using NVIDIA's Parakeet TDT 0.6B v3 model (converted to ONNX format by istupakov) via WebGPU/WASM.
| Feature | Details |
|---|---|
| 🔒 100% Private | Runs entirely in your browser — no audio ever leaves your device |
| ⚡ WebGPU Accelerated | Fast GPU inference with automatic WASM fallback for compatibility |
| 🎙️ Phone as Mic | Use your phone as a wireless microphone via end-to-end encrypted WebRTC |
| ⏱️ Live Transcription | Optional streaming mode: text appears as you speak, dictation regex applied in real time |
| 📝 Dictation Mode | Post-processes transcriptions with regex rules (medical French vocabulary, punctuation, units) |
| 🕐 Word Timestamps | Per-word timestamps and confidence score heatmap |
| 📁 File or Mic | Transcribe uploaded audio files or record directly from your microphone |
| 📦 Quantization | Choose between fp32 (accurate) and int8 (faster/smaller) model variants |
| 🐳 Docker Ready | One-command self-hosted deployment |
# 1. Copy the example env file and edit it with your own values
cp docker/env.example docker/.env
# 2. Run the demo locally with Docker
sudo docker compose -f docker/docker-compose.yml up- Then visit
http://localhost:5173
Parakeet Web includes an experimental dictation mode that post-processes transcriptions using regex rules to clean up spoken punctuation, medical vocabulary, and unit abbreviations. This is especially useful for French medical dictation.
The regex rules are sourced from the murmure-regex repository by the non-profit interhop.org, originally created for the Murmure software. A single combined CSV file is automatically downloaded on container startup.
The rules are in French and cover categories like punctuation, unit abbreviations, clinical exam templates, medication name corrections, and medical vocabulary corrections.
This feature is very early and will improve rapidly.
- Docker: The entrypoint script downloads the single combined
regex.csvfile from the murmure-regex repository on every container start. - Frontend: The app loads the CSV rules at startup via a manifest file and applies them as JavaScript
RegExpreplacements. After regex processing, each line is stripped of leading/trailing whitespace and its first letter is capitalized. Three display modes are available per transcription: Raw, Confidence (heatmap), and Dictation (regex-cleaned). - Custom regex source: Set the
DICTATION_REGEX_SOURCEenvironment variable to override the default Murmure URL. This can be a GitLab-compatible repo URL (e.g.https://framagit.org/interhop/murmure-regex) or a local folder path containing CSV regex files (e.g./path/to/my/regex-csvs). This allows you to iterate on regex rules locally without waiting for upstream changes.
By default, transcription runs once when you stop recording. If you'd rather see the text appear as you speak, enable Live transcription in the settings panel. The model is then re-run every few seconds on a sliding window of recent audio, and the transcript updates incrementally during the recording. The dictation regex (if loaded) is applied to the entire visible text on every update, so corrections like "point virgule" → ";" happen live too.
This works for both the local microphone and the phone-as-mic path — the live transcriber consumes the same audio buffer either way.
Parakeet's encoder is non-streaming (it sees the whole window at once with self-attention), so accuracy depends heavily on having enough acoustic context. The live transcriber maintains a sliding context window of the last N seconds of audio and re-runs the model on it every few seconds. Words near the trailing edge of the window are "pending" (may be revised by the next, larger-context window) and words past a 3-second commit boundary are frozen for good. The result: every word is eventually transcribed with at least 3 seconds of right-context, while you still see updates as you speak.
When you hit stop, the canonical full-audio transcription pass runs as it always has, and its result replaces the live one — so the live mode never affects the final accuracy.
- Live transcription (off by default): toggle the streaming mode on or off.
- Context window: how many seconds of recent audio the encoder sees on each update.
- Auto (recommended): starts at 15 s and adapts itself between 10 s and 60 s based on how fast your machine actually transcribes. Faster machines get a larger window (more context, better accuracy); slower machines get a smaller one (so updates can keep up).
- Or pick a fixed value (10/15/20/30/45/60 s) if you want to override the auto-adapter — for example, choose 60 s on a fast desktop to maximize accuracy, or 10 s on a phone to keep latency low.
The cadence (how often the live transcript updates) is always auto-adapted: if a transcription pass takes longer than expected, updates back off so the queue never grows. Enable Display more details in settings to see the current window size, step interval, and per-tick processing time below the live transcript.
This feature was implemented with Claude Code.
No local microphone? Use your phone as a wireless mic via WebRTC. Audio is end-to-end encrypted (ECDH P-256 + AES-GCM-256) — the server only relays encrypted data and never sees the plaintext audio.
- Click the Phone Mic button in the app
- A QR code appears — scan it with your phone
- Grant microphone permission on the phone
- Verify the short code that appears on both screens matches — read it aloud or compare visually. If the codes differ, click Codes differ – abort on either device. This step defends against a malicious signaling server that could otherwise swap encryption keys to MITM the supposedly end-to-end channel. The Confirm button is disabled for 3 seconds (with a visible countdown) so a reflexive Enter/Space press cannot auto-accept a tampered code without you actually reading it.
- Speak — encrypted audio streams to the computer in real time
- Click Stop on either device — the audio is transcribed normally
- Local network only: works out of the box with no extra config (STUN-only / direct P2P).
- Over the internet: requires a coturn TURN relay. A commented-out coturn service is included in
docker/docker-compose.yml— uncomment it and setTURN_SERVER,TURN_SECRET, andTURN_EXTERNAL_IPindocker/.env. If you already run coturn (e.g. for WebSend or Nextcloud Talk), point to it and reuse the sameTURN_SECRET.
See docker/env.example for all configuration options.
If HuggingFace is blocked or unreachable in your environment, you can serve
model weights directly from the container. Pick any host folder, populate
it with the ONNX files, bind-mount it into the container, and set
LOCAL_MODEL_PATH to the matching in-container path:
# 1. Populate any host folder with the ONNX files (flat layout):
hf download istupakov/parakeet-tdt-0.6b-v3-onnx \
--local-dir /host/path/to/onnx-files# 2. In docker/docker-compose.yml, add a volume:
volumes:
- /host/path/to/onnx-files:/models:ro# 3. In docker/.env, set:
LOCAL_MODEL_PATH=/modelsCaddy serves whatever is at LOCAL_MODEL_PATH under /models/. Setting
LOCAL_MODEL_PATH automatically enables VITE_LOCAL_MODEL_FALLBACK. The
container crashes at startup if vocab.txt is missing, so
misconfigurations are caught early.
To troubleshoot the local-fallback path itself, set
VITE_FORCE_LOCAL_MODEL_FALLBACK=true — the UI will skip HuggingFace
entirely and load weights from /models/ on first try. Implies
VITE_LOCAL_MODEL_FALLBACK=true.
The container runs as UID 1000. If your files end up unreadable to UID
1000, run chmod -R a+rX /host/path/to/onnx-files (or
chown -R 1000:1000 /host/path/to/onnx-files).
Built with Claude Code.
Append ?debug=1 to any URL to load the in-page eruda
devtools — useful for inspecting console logs and network requests on a phone
where you cannot open desktop devtools. Eruda is vendored locally (served
from same-origin with SRI), so nothing is fetched from a CDN at runtime.
Examples:
- Main app:
https://your-host/?debug=1 - Remote-mic page:
https://your-host/remote-mic.html?debug=1#ROOMID:SECRET(the room info is in the hash fragment, so?debug=1goes before the#)
Without ?debug=1, no devtools surface is shipped to the user.
AGPLv3 – See LICENSE file
- ysdede/parakeet.js – Original project this is forked from
- nvidia/parakeet-tdt-0.6b-v3 – The underlying ASR model by NVIDIA
- istupakov/parakeet-tdt-0.6b-v3-onnx – ONNX conversion of the model
- istupakov/onnx-asr – Python reference implementation
- ONNX Runtime Web – Makes browser inference possible
This fork is based on ysdede/parakeet.js – all the heavy lifting and original implementation credit goes there. This would not exist without their excellent work.
