Docs

Voice Chat

Agenties supports voice input and spoken responses so you can talk to your orchestrator hands-free. The system is built in four phases — browser speech recognition, Edge TTS neural voices, on-device Whisper transcription, and VAD conversational mode. Each phase has different prerequisites and privacy trade-offs.

Note:STT accuracy depends heavily on your microphone quality and ambient noise. Any speech recognition engine — including faster-whisper — can misinterpret words, especially short phrases, proper nouns, or commands spoken quietly. Always review transcripts before sending critical messages.

Voice phases overview

Phase 1
Available
STT via Web Speech API
Uses the browser's built-in Speech Recognition API (available in the Chromium-based Tauri webview). No installation needed. Transcription happens in Google's cloud, so audio leaves your machine. Good enough for most casual use.
Phase 2
Available
TTS via Edge TTS neural voices
Text-to-speech using high-quality Edge TTS neural voices (Spanish, English, French, German). Requires the edge-tts Python package installed in ~/.notebooklm-venv/. Falls back to the browser's built-in speechSynthesis API when edge-tts is not detected.
Phase 3
Available
On-device Whisper (faster-whisper)
Local OpenAI Whisper transcription via faster-whisper running in ~/.notebooklm-venv/. Push-to-talk mode: hold the mic button, speak, release to transcribe. Fully private — no audio leaves your machine. Requires Python 3.8+ and ~140 MB disk space for the model.
Phase 4
Available
VAD conversational mode
Voice Activity Detection for hands-free conversation. The app uses energy-level analysis (RMS) to detect when you start and stop speaking, then transcribes and submits turns automatically. Available as an MVP — microphone quality and ambient noise significantly affect reliability.

Prerequisites

Phases 2 and 3 both require Python and a shared virtual environment at ~/.notebooklm-venv/. Phase 1 (Web Speech API) has no prerequisites.

Python version

Python 3.8 or later is required. Python 3.10–3.12 is recommended for best compatibility with faster-whisper. Check your version with:

bash
python --version
# or
python3 --version

Virtual environment setup

Agenties expects both faster-whisper and edge-tts to be installed in the same virtual environment at ~/.notebooklm-venv/. If the venv already exists (e.g. from NotebookLM integration), add the packages to it. If not, create it first.

bash
# Step 1 — Create the virtual environment (skip if it already exists)
python -m venv ~/.notebooklm-venv

# Step 2 — Install both voice packages
# Linux / macOS:
~/.notebooklm-venv/bin/pip install faster-whisper edge-tts

# Windows (PowerShell):
~/.notebooklm-venv/Scripts/pip install faster-whisper edge-tts
Note:faster-whisper handles speech-to-text (Phases 3 and 4). edge-tts handles text-to-speech (Phase 2). You can install only what you need, but installing both at once is simpler.

Whisper model download

On first use, Agenties automatically downloads the base Whisper model (~140 MB) from Hugging Face and writes a server script to ~/.agenties/whisper_server.py. The app waits up to 30 seconds for the model to load — the first run may feel slow; subsequent starts reuse the cached model from disk.

Warning:The current implementation loads the base model on CPU with int8 quantization. There is no GPU acceleration and no option to change the model size from the UI. On low-spec machines, the first load can take up to 30 seconds; transcription of short phrases typically takes 1–3 seconds after the model is ready.

Language support — current limitation

The Whisper transcription path is currently configured for Spanish (language="es"). If your primary language is Spanish, this works as expected. For other languages, the Web Speech API (Phase 1) is the reliable fallback — it automatically uses your browser's configured locale.


Setup by phase

Phase 1 — Web Speech API (no setup)

No Python or installation required. Click the microphone icon in the chat input bar to start recording. The transcription appears in the input field. Release or click again to stop. This path sends audio to Google's speech API — see the Privacy section.

Phase 2 — Edge TTS neural voices

Once edge-tts is installed, the volume icon in the chat bar becomes active. Click it to enable spoken responses. Select your preferred voice from the dropdown that appears next to the volume icon. If edge-ttsis not detected, Agenties falls back to the browser's built-in speechSynthesis voices automatically.

Available neural voices:

Voice IDLanguageLabel
es-ES-AlvaroNeuralSpanish (Spain)Álvaro
es-ES-ElviraNeuralSpanish (Spain)Elvira
es-MX-DaliaNeuralSpanish (Mexico)Dalia
es-MX-JorgeNeuralSpanish (Mexico)Jorge
en-US-AriaNeuralEnglish (US)Aria
en-US-JennyNeuralEnglish (US)Jenny
en-US-GuyNeuralEnglish (US)Guy
en-GB-SoniaNeuralEnglish (UK)Sonia
fr-FR-DeniseNeuralFrenchDenise
de-DE-KatjaNeuralGermanKatja

Responses are read aloud as they stream in. You can stop playback at any time by clicking the volume icon again.

Phase 3 — On-device Whisper

After installing faster-whisper (see Prerequisites above), Agenties will automatically detect the installation when you click the microphone button. If Whisper is available, the mic button switches to push-to-talk mode. If not, it falls back to the Web Speech API path.

Phase 4 — VAD conversational mode

Click the circular mic icon (second mic button in the chat bar) to toggle conversational mode. Agenties will continuously monitor audio for speech activity and submit turns automatically when silence is detected. This mode requires Phase 3 (Whisper) to be available for the best experience; it can also work with Phase 1 as the STT backend.


Push-to-talk (Phase 3)

When Whisper is installed, the microphone button uses push-to-talk:

1.Press and hold the microphone button in the chat input bar
2.Speak your message clearly, close to the microphone
3.Release the button — Agenties stops recording and sends audio to faster-whisper
4.Transcription takes 1–3 seconds depending on phrase length (longer on first run)
5.Transcription appears in the input field — review it before sending
Tip:Always review the transcription before sending. faster-whisper can misinterpret short phrases, proper nouns, punctuation, and commands spoken softly or with background noise.

VAD conversational mode (Phase 4)

VAD mode removes the need to hold the mic button. Agenties continuously analyzes audio energy levels (RMS — Root Mean Square) to detect speech start and silence end, then automatically transcribes and submits each turn.

CapabilityCurrent behavior
Speech detectionEnergy-level (RMS) analysis — not ML-based. Triggers when audio exceeds a fixed amplitude threshold.
Silence detectionAfter ~900 ms of silence following at least 650 ms of speech, the turn is submitted.
Transcription pathUses faster-whisper (Phase 3) when installed; falls back to Web Speech API otherwise.
Minimum phrase lengthPhrases shorter than ~650 ms are discarded to filter background noise.
Filler filteringCommon fillers ("eh", "um", "ah", "mmm", "ok") are filtered out to prevent accidental sends.
InterruptionsSpeaking while a response is playing can interrupt and stop TTS playback.
Warning:RMS-based VAD is sensitive to ambient noise. Loud environments, music playing nearby, or HVAC noise may trigger false detections. If VAD is unreliable in your environment, use push-to-talk (Phase 3) instead — it remains the most predictable input method.

Disabling voice / returning to normal mode

Voice features are controlled entirely from the chat input bar — there is no global voice toggle in Settings. To return to text-only mode:

1.Click the circular mic icon to turn off conversational / VAD mode (stops continuous listening)
2.Click the volume icon to disable TTS (stops spoken responses)
3.Avoid clicking the main mic button to prevent accidental push-to-talk triggers

Troubleshooting

SymptomLikely causeFix
Mic button does not switch to push-to-talkfaster-whisper not found in ~/.notebooklm-venv/Run ~/.notebooklm-venv/bin/pip install faster-whisper (Linux/macOS) or ~/.notebooklm-venv/Scripts/pip install faster-whisper (Windows) and restart Agenties.
First transcription hangs for 20–30 secondsWhisper model is downloading (~140 MB) or loading for the first timeWait for the initial load. Subsequent uses will be faster.
Transcription comes out wrong or in Spanish unexpectedlyWhisper language is fixed to Spanish (es) in this buildUse the Web Speech API path (Phase 1) for non-Spanish input.
Edge TTS voice dropdown is missingedge-tts not installed or not found in ~/.notebooklm-venv/Install it: pip install edge-tts inside the venv. The fallback is browser speechSynthesis.
VAD constantly triggers false turnsBackground noise exceeds the RMS thresholdMove to a quieter environment, use a directional microphone, or switch to push-to-talk.
VAD never triggers (no turns submitted)Microphone volume too low or wrong device selectedCheck OS microphone settings and ensure the correct input device is selected. Speak louder or closer to the mic.
"python not found" or venv creation failsPython is not installed or not on PATHInstall Python 3.10+ from python.org. On Windows, check "Add to PATH" during installation.

Privacy

Voice privacy depends on which phase is active:

PhaseAudio leaves device?Notes
Phase 1 — Web Speech APIYes — sent to GoogleSame as Chrome's built-in dictation feature
Phase 2 — Edge TTS (output)Text is sent to Microsoft TTS APIAudio playback is local; only the text of responses is transmitted
Phase 3 — WhisperNo — fully on-deviceAudio processed locally by faster-whisper; nothing leaves your machine
Phase 4 — VADDepends on STT pathOn-device when Whisper is configured; otherwise follows Phase 1 (Google cloud)
Tip:For full on-device privacy, use Phase 3 (Whisper) for input and choose a browser speechSynthesis voice (not Edge TTS) for output. Both are available without sending any audio to external services.