Voice Chat

Agenties supports voice input and spoken responses so you can talk to your orchestrator hands-free. The system is built in four phases — browser speech recognition, Edge TTS neural voices, on-device Whisper transcription, and VAD conversational mode. Each phase has different prerequisites and privacy trade-offs.

Note:STT accuracy depends heavily on your microphone quality and ambient noise. Any speech recognition engine — including faster-whisper — can misinterpret words, especially short phrases, proper nouns, or commands spoken quietly. Always review transcripts before sending critical messages.

Voice phases overview

Phase 1

Available

STT via Web Speech API

Uses the browser's built-in Speech Recognition API (available in the Chromium-based Tauri webview). No installation needed. Transcription happens in Google's cloud, so audio leaves your machine. Good enough for most casual use.

Phase 2

Available

TTS via Edge TTS neural voices

Text-to-speech using high-quality Edge TTS neural voices (Spanish, English, French, German). Requires the edge-tts Python package installed in ~/.notebooklm-venv/. Falls back to the browser's built-in speechSynthesis API when edge-tts is not detected.

Phase 3

Available

On-device Whisper (faster-whisper)

Local OpenAI Whisper transcription via faster-whisper running in ~/.notebooklm-venv/. Push-to-talk mode: hold the mic button, speak, release to transcribe. Fully private — no audio leaves your machine. Requires Python 3.8+ and ~140 MB disk space for the model.

Phase 4

Available

VAD conversational mode

Voice Activity Detection for hands-free conversation. The app uses energy-level analysis (RMS) to detect when you start and stop speaking, then transcribes and submits turns automatically. Available as an MVP — microphone quality and ambient noise significantly affect reliability.

Prerequisites

Phases 2 and 3 both require Python and a shared virtual environment at ~/.notebooklm-venv/. Phase 1 (Web Speech API) has no prerequisites.

Python version

Python 3.8 or later is required. Python 3.10–3.12 is recommended for best compatibility with faster-whisper. Check your version with:

bash

python --version
# or
python3 --version

Virtual environment setup

Agenties expects both faster-whisper and edge-tts to be installed in the same virtual environment at ~/.notebooklm-venv/. If the venv already exists (e.g. from NotebookLM integration), add the packages to it. If not, create it first.

bash

# Step 1 — Create the virtual environment (skip if it already exists)
python -m venv ~/.notebooklm-venv

# Step 2 — Install both voice packages
# Linux / macOS:
~/.notebooklm-venv/bin/pip install faster-whisper edge-tts

# Windows (PowerShell):
~/.notebooklm-venv/Scripts/pip install faster-whisper edge-tts

Note:faster-whisper handles speech-to-text (Phases 3 and 4). edge-tts handles text-to-speech (Phase 2). You can install only what you need, but installing both at once is simpler.

Whisper model download

On first use, Agenties automatically downloads the base Whisper model (~140 MB) from Hugging Face and writes a server script to ~/.agenties/whisper_server.py. The app waits up to 30 seconds for the model to load — the first run may feel slow; subsequent starts reuse the cached model from disk.

Warning:The current implementation loads the base model on CPU with int8 quantization. There is no GPU acceleration and no option to change the model size from the UI. On low-spec machines, the first load can take up to 30 seconds; transcription of short phrases typically takes 1–3 seconds after the model is ready.

Language support — current limitation

The Whisper transcription path is currently configured for Spanish (language="es"). If your primary language is Spanish, this works as expected. For other languages, the Web Speech API (Phase 1) is the reliable fallback — it automatically uses your browser's configured locale.

Setup by phase

Phase 1 — Web Speech API (no setup)

No Python or installation required. Click the microphone icon in the chat input bar to start recording. The transcription appears in the input field. Release or click again to stop. This path sends audio to Google's speech API — see the Privacy section.

Phase 2 — Edge TTS neural voices

Once edge-tts is installed, the volume icon in the chat bar becomes active. Click it to enable spoken responses. Select your preferred voice from the dropdown that appears next to the volume icon. If edge-ttsis not detected, Agenties falls back to the browser's built-in speechSynthesis voices automatically.

Available neural voices:

Voice ID	Language	Label
es-ES-AlvaroNeural	Spanish (Spain)	Álvaro
es-ES-ElviraNeural	Spanish (Spain)	Elvira
es-MX-DaliaNeural	Spanish (Mexico)	Dalia
es-MX-JorgeNeural	Spanish (Mexico)	Jorge
en-US-AriaNeural	English (US)	Aria
en-US-JennyNeural	English (US)	Jenny
en-US-GuyNeural	English (US)	Guy
en-GB-SoniaNeural	English (UK)	Sonia
fr-FR-DeniseNeural	French	Denise
de-DE-KatjaNeural	German	Katja

Responses are read aloud as they stream in. You can stop playback at any time by clicking the volume icon again.

Phase 3 — On-device Whisper

After installing faster-whisper (see Prerequisites above), Agenties will automatically detect the installation when you click the microphone button. If Whisper is available, the mic button switches to push-to-talk mode. If not, it falls back to the Web Speech API path.

Phase 4 — VAD conversational mode

Click the circular mic icon (second mic button in the chat bar) to toggle conversational mode. Agenties will continuously monitor audio for speech activity and submit turns automatically when silence is detected. This mode requires Phase 3 (Whisper) to be available for the best experience; it can also work with Phase 1 as the STT backend.

Push-to-talk (Phase 3)

When Whisper is installed, the microphone button uses push-to-talk:

1.Press and hold the microphone button in the chat input bar

2.Speak your message clearly, close to the microphone

3.Release the button — Agenties stops recording and sends audio to faster-whisper

4.Transcription takes 1–3 seconds depending on phrase length (longer on first run)

5.Transcription appears in the input field — review it before sending

Tip:Always review the transcription before sending. faster-whisper can misinterpret short phrases, proper nouns, punctuation, and commands spoken softly or with background noise.

VAD conversational mode (Phase 4)

VAD mode removes the need to hold the mic button. Agenties continuously analyzes audio energy levels (RMS — Root Mean Square) to detect speech start and silence end, then automatically transcribes and submits each turn.

Capability	Current behavior
Speech detection	Energy-level (RMS) analysis — not ML-based. Triggers when audio exceeds a fixed amplitude threshold.
Silence detection	After ~900 ms of silence following at least 650 ms of speech, the turn is submitted.
Transcription path	Uses faster-whisper (Phase 3) when installed; falls back to Web Speech API otherwise.
Minimum phrase length	Phrases shorter than ~650 ms are discarded to filter background noise.
Filler filtering	Common fillers ("eh", "um", "ah", "mmm", "ok") are filtered out to prevent accidental sends.
Interruptions	Speaking while a response is playing can interrupt and stop TTS playback.

Warning:RMS-based VAD is sensitive to ambient noise. Loud environments, music playing nearby, or HVAC noise may trigger false detections. If VAD is unreliable in your environment, use push-to-talk (Phase 3) instead — it remains the most predictable input method.

Disabling voice / returning to normal mode

Voice features are controlled entirely from the chat input bar — there is no global voice toggle in Settings. To return to text-only mode:

1.Click the circular mic icon to turn off conversational / VAD mode (stops continuous listening)

2.Click the volume icon to disable TTS (stops spoken responses)

3.Avoid clicking the main mic button to prevent accidental push-to-talk triggers

Troubleshooting

Symptom	Likely cause	Fix
Mic button does not switch to push-to-talk	faster-whisper not found in ~/.notebooklm-venv/	Run `~/.notebooklm-venv/bin/pip install faster-whisper` (Linux/macOS) or `~/.notebooklm-venv/Scripts/pip install faster-whisper` (Windows) and restart Agenties.
First transcription hangs for 20–30 seconds	Whisper model is downloading (~140 MB) or loading for the first time	Wait for the initial load. Subsequent uses will be faster.
Transcription comes out wrong or in Spanish unexpectedly	Whisper language is fixed to Spanish (es) in this build	Use the Web Speech API path (Phase 1) for non-Spanish input.
Edge TTS voice dropdown is missing	edge-tts not installed or not found in ~/.notebooklm-venv/	Install it: `pip install edge-tts` inside the venv. The fallback is browser speechSynthesis.
VAD constantly triggers false turns	Background noise exceeds the RMS threshold	Move to a quieter environment, use a directional microphone, or switch to push-to-talk.
VAD never triggers (no turns submitted)	Microphone volume too low or wrong device selected	Check OS microphone settings and ensure the correct input device is selected. Speak louder or closer to the mic.
"python not found" or venv creation fails	Python is not installed or not on PATH	Install Python 3.10+ from `python.org`. On Windows, check "Add to PATH" during installation.

Privacy

Voice privacy depends on which phase is active:

Phase	Audio leaves device?	Notes
Phase 1 — Web Speech API	Yes — sent to Google	Same as Chrome's built-in dictation feature
Phase 2 — Edge TTS (output)	Text is sent to Microsoft TTS API	Audio playback is local; only the text of responses is transmitted
Phase 3 — Whisper	No — fully on-device	Audio processed locally by faster-whisper; nothing leaves your machine
Phase 4 — VAD	Depends on STT path	On-device when Whisper is configured; otherwise follows Phase 1 (Google cloud)

Tip:For full on-device privacy, use Phase 3 (Whisper) for input and choose a browser speechSynthesis voice (not Edge TTS) for output. Both are available without sending any audio to external services.

← PreviousContinuity System Next →Multi-PC Sync