Voice Chat
Agenties supports voice input and spoken responses so you can talk to your orchestrator hands-free. The system is built in four phases — browser speech recognition, Edge TTS neural voices, on-device Whisper transcription, and VAD conversational mode. Each phase has different prerequisites and privacy trade-offs.
Voice phases overview
Prerequisites
Phases 2 and 3 both require Python and a shared virtual environment at ~/.notebooklm-venv/. Phase 1 (Web Speech API) has no prerequisites.
Python version
Python 3.8 or later is required. Python 3.10–3.12 is recommended for best compatibility with faster-whisper. Check your version with:
Virtual environment setup
Agenties expects both faster-whisper and edge-tts to be installed in the same virtual environment at ~/.notebooklm-venv/. If the venv already exists (e.g. from NotebookLM integration), add the packages to it. If not, create it first.
faster-whisper handles speech-to-text (Phases 3 and 4). edge-tts handles text-to-speech (Phase 2). You can install only what you need, but installing both at once is simpler.Whisper model download
On first use, Agenties automatically downloads the base Whisper model (~140 MB) from Hugging Face and writes a server script to ~/.agenties/whisper_server.py. The app waits up to 30 seconds for the model to load — the first run may feel slow; subsequent starts reuse the cached model from disk.
base model on CPU with int8 quantization. There is no GPU acceleration and no option to change the model size from the UI. On low-spec machines, the first load can take up to 30 seconds; transcription of short phrases typically takes 1–3 seconds after the model is ready.Language support — current limitation
The Whisper transcription path is currently configured for Spanish (language="es"). If your primary language is Spanish, this works as expected. For other languages, the Web Speech API (Phase 1) is the reliable fallback — it automatically uses your browser's configured locale.
Setup by phase
Phase 1 — Web Speech API (no setup)
No Python or installation required. Click the microphone icon in the chat input bar to start recording. The transcription appears in the input field. Release or click again to stop. This path sends audio to Google's speech API — see the Privacy section.
Phase 2 — Edge TTS neural voices
Once edge-tts is installed, the volume icon in the chat bar becomes active. Click it to enable spoken responses. Select your preferred voice from the dropdown that appears next to the volume icon. If edge-ttsis not detected, Agenties falls back to the browser's built-in speechSynthesis voices automatically.
Available neural voices:
| Voice ID | Language | Label |
|---|---|---|
| es-ES-AlvaroNeural | Spanish (Spain) | Álvaro |
| es-ES-ElviraNeural | Spanish (Spain) | Elvira |
| es-MX-DaliaNeural | Spanish (Mexico) | Dalia |
| es-MX-JorgeNeural | Spanish (Mexico) | Jorge |
| en-US-AriaNeural | English (US) | Aria |
| en-US-JennyNeural | English (US) | Jenny |
| en-US-GuyNeural | English (US) | Guy |
| en-GB-SoniaNeural | English (UK) | Sonia |
| fr-FR-DeniseNeural | French | Denise |
| de-DE-KatjaNeural | German | Katja |
Responses are read aloud as they stream in. You can stop playback at any time by clicking the volume icon again.
Phase 3 — On-device Whisper
After installing faster-whisper (see Prerequisites above), Agenties will automatically detect the installation when you click the microphone button. If Whisper is available, the mic button switches to push-to-talk mode. If not, it falls back to the Web Speech API path.
Phase 4 — VAD conversational mode
Click the circular mic icon (second mic button in the chat bar) to toggle conversational mode. Agenties will continuously monitor audio for speech activity and submit turns automatically when silence is detected. This mode requires Phase 3 (Whisper) to be available for the best experience; it can also work with Phase 1 as the STT backend.
Push-to-talk (Phase 3)
When Whisper is installed, the microphone button uses push-to-talk:
VAD conversational mode (Phase 4)
VAD mode removes the need to hold the mic button. Agenties continuously analyzes audio energy levels (RMS — Root Mean Square) to detect speech start and silence end, then automatically transcribes and submits each turn.
| Capability | Current behavior |
|---|---|
| Speech detection | Energy-level (RMS) analysis — not ML-based. Triggers when audio exceeds a fixed amplitude threshold. |
| Silence detection | After ~900 ms of silence following at least 650 ms of speech, the turn is submitted. |
| Transcription path | Uses faster-whisper (Phase 3) when installed; falls back to Web Speech API otherwise. |
| Minimum phrase length | Phrases shorter than ~650 ms are discarded to filter background noise. |
| Filler filtering | Common fillers ("eh", "um", "ah", "mmm", "ok") are filtered out to prevent accidental sends. |
| Interruptions | Speaking while a response is playing can interrupt and stop TTS playback. |
Disabling voice / returning to normal mode
Voice features are controlled entirely from the chat input bar — there is no global voice toggle in Settings. To return to text-only mode:
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Mic button does not switch to push-to-talk | faster-whisper not found in ~/.notebooklm-venv/ | Run ~/.notebooklm-venv/bin/pip install faster-whisper (Linux/macOS) or ~/.notebooklm-venv/Scripts/pip install faster-whisper (Windows) and restart Agenties. |
| First transcription hangs for 20–30 seconds | Whisper model is downloading (~140 MB) or loading for the first time | Wait for the initial load. Subsequent uses will be faster. |
| Transcription comes out wrong or in Spanish unexpectedly | Whisper language is fixed to Spanish (es) in this build | Use the Web Speech API path (Phase 1) for non-Spanish input. |
| Edge TTS voice dropdown is missing | edge-tts not installed or not found in ~/.notebooklm-venv/ | Install it: pip install edge-tts inside the venv. The fallback is browser speechSynthesis. |
| VAD constantly triggers false turns | Background noise exceeds the RMS threshold | Move to a quieter environment, use a directional microphone, or switch to push-to-talk. |
| VAD never triggers (no turns submitted) | Microphone volume too low or wrong device selected | Check OS microphone settings and ensure the correct input device is selected. Speak louder or closer to the mic. |
| "python not found" or venv creation fails | Python is not installed or not on PATH | Install Python 3.10+ from python.org. On Windows, check "Add to PATH" during installation. |
Privacy
Voice privacy depends on which phase is active:
| Phase | Audio leaves device? | Notes |
|---|---|---|
| Phase 1 — Web Speech API | Yes — sent to Google | Same as Chrome's built-in dictation feature |
| Phase 2 — Edge TTS (output) | Text is sent to Microsoft TTS API | Audio playback is local; only the text of responses is transmitted |
| Phase 3 — Whisper | No — fully on-device | Audio processed locally by faster-whisper; nothing leaves your machine |
| Phase 4 — VAD | Depends on STT path | On-device when Whisper is configured; otherwise follows Phase 1 (Google cloud) |
speechSynthesis voice (not Edge TTS) for output. Both are available without sending any audio to external services.