Hermes Agent Voice Mode: Talk to Your AI Instead of Typing
A complete guide to Hermes Agent voice features, microphone input in the CLI, spoken replies on Telegram and Discord, and live voice conversations in Discord voice channels.

Typing Is Not the Only Way to Use AI
The dominant interface for AI tools in 2026 is still text. You type a message, you read a response. This works well for many tasks. But there are situations where voice is simply better:
- You are walking and want to think through a problem out loud
- You are cooking and want to ask about a recipe substitution
- You are driving and want your morning briefing read aloud
- You are in a Discord voice channel and want the agent to participate in the conversation
Hermes Agent has voice mode built in across three surfaces: the CLI, Telegram, and Discord. This is not a bolted-on text-to-speech wrapper. It is full voice interaction, you speak, the agent listens, transcribes, processes, and responds with spoken audio.
Here is how each voice feature works, what it takes to set up, and what it is actually useful for.
Voice Mode Overview
Hermes supports three distinct voice interaction patterns:
| Feature | Where It Works | What It Does | |---|---|---| | Interactive Voice | CLI | Press Ctrl+B to record. Agent transcribes, processes, and displays the response. | | Auto Voice Reply | Telegram, Discord | Agent sends spoken audio alongside text responses. Send a voice memo, get a voice reply. | | Voice Channel | Discord | Bot joins a voice channel, listens to users speaking, and speaks replies back in real time. |
Each mode serves a different use case. Let's break them down.
CLI Voice Mode: Talk in the Terminal
The simplest voice feature. Inside a Hermes CLI session, press Ctrl+B to start recording. Speak your message. Press Ctrl+B again (or wait for silence detection) to stop. Hermes transcribes your speech, processes it as a normal message, and responds.
What you need
pip install "hermes-agent[voice]"
This installs sounddevice and numpy for microphone capture and audio processing. You also need a working microphone connected to your machine.
When CLI voice is useful
- Hands-free brainstorming: Talk through a problem while pacing around your office. Hermes keeps up.
- Accessibility: If typing is difficult or slow, voice input removes the barrier.
- Long-form dictation: Describe a complex task verbally instead of typing a paragraph of instructions.
The CLI voice mode is the most "developer-oriented" voice feature. It is useful, but the real magic happens on messaging platforms.
Telegram Voice: Send a Voice Memo, Get a Voice Reply
This is where voice mode becomes genuinely useful for non-technical users. On Telegram:
- You send a voice memo (hold the microphone button, speak, release)
- Hermes transcribes your message
- Hermes processes it normally
- Hermes sends back a spoken audio message alongside the text response
You can have an entirely voice-based conversation with your agent on Telegram. No typing required.
What you need
pip install "hermes-agent[messaging]"
Plus the standard Telegram bot setup (bot token from BotFather, configured in config.yaml).
For higher quality voice output, you can configure premium TTS providers like ElevenLabs:
pip install "hermes-agent[tts-premium]"
The Telegram voice experience in practice
Imagine this workflow:
- You are walking to work. You hold the mic button in Telegram and say: "What's on my schedule today? And remind me to call the dentist at 3pm."
- Hermes checks your context, sets the reminder, and sends back a voice message: "You have two meetings this morning, a standup at 10 and a product review at 11:30. I've set a reminder for the dentist call at 3pm."
The entire interaction is voice-based. You never open a keyboard.
Auto Voice Reply configuration
By default, Hermes sends both text and audio replies on Telegram when voice mode is enabled. You can configure this behavior:
- Always voice: Every response includes spoken audio
- Reply in kind: Voice messages get voice replies, text messages get text replies
- Text only: Disable voice output while keeping voice input
The "reply in kind" mode is the most natural. It matches the user's communication style automatically.
Discord Voice Channel: Live Conversation
The most advanced voice feature. Hermes can join a Discord voice channel, listen to everyone speaking, and respond with spoken audio in real time.
This turns the agent into a voice participant in group conversations. Multiple users can ask questions, and the agent responds to each one.
What you need
pip install "hermes-agent[messaging]"
Discord voice requires discord.py[voice], which is included in the messaging extra. You also need the Discord bot configured with voice permissions in your server.
When Discord voice is useful
- Team brainstorming: The agent participates in a voice discussion, offering suggestions and answering questions in real time
- Study groups: Ask the agent to explain concepts during a live discussion
- Gaming and social servers: The agent can be a voice-enabled helper in community channels
- Accessibility: Users who cannot type can interact with the agent via voice
TTS Voice Options
Hermes supports multiple text-to-speech backends:
| Provider | Quality | Cost | Notes | |---|---|---|---| | System TTS | Basic | Free | Default, works everywhere | | NeuTTS (local) | Good | Free | Runs locally, requires setup | | ElevenLabs | Excellent | Paid | Premium quality, most natural sounding |
For personal use, the system TTS or NeuTTS is sufficient. If you want the agent to sound genuinely human, especially for customer-facing or content creation use cases, ElevenLabs is worth the cost.
To configure ElevenLabs, add your API key to ~/.hermes/.env:
ELEVENLABS_API_KEY=your_key_here
And install the premium TTS package:
pip install "hermes-agent[tts-premium]"
Voice Input Languages
Hermes uses Whisper for speech recognition, which supports 99 languages. You can speak in Spanish, French, German, Mandarin, or most other languages, and the agent will transcribe and respond appropriately.
The transcription quality depends on the Whisper model configuration. For best results with non-English languages, ensure you are using a sufficiently capable Whisper model.
Privacy Considerations
Voice data introduces privacy considerations that text does not:
- Audio recordings: Check whether your TTS/STT provider retains audio. Hermes itself processes audio locally when using local models.
- Voice messages on Telegram: Telegram stores voice messages on their servers. The bot downloads them for transcription, but the originals remain in the Telegram cloud.
- Discord voice: Discord voice data passes through Discord's infrastructure before reaching the bot.
If privacy is a primary concern, local Whisper transcription and local TTS (NeuTTS) keep all audio processing on your infrastructure.
The Non-Technical Appeal
Voice mode is the feature that makes Hermes accessible to people who would never use a terminal. If you set up a Hermes agent for a family member, friend, or small business owner, voice on Telegram is the interface they will actually use.
Think about it from their perspective: they don't need to learn a CLI, they don't need to understand model configuration, and they don't need to type. They press and hold a button in an app they already use (Telegram), speak naturally, and get a spoken response. That is the experience that bridges the gap between "powerful AI agent" and "tool my parents would use."
Setting Up Voice Mode
If you are running Hermes yourself:
- Install voice support:
pip install "hermes-agent[voice,messaging]" - Configure TTS in
config.yaml(or use defaults) - Start the gateway:
hermes gateway start --detach - Send a voice memo to your Telegram bot
If you are using Hermify, voice mode works out of the box once your Telegram bot is connected. No additional installation or configuration needed.
Sources
Run Your Own Hermes Agent
Bring your API key, connect Telegram, and get a self-improving AI agent live in 60 seconds.
Get Started