Hermes Agent Voice Mode: Talk to Your AI Instead of Typing

Typing Is Not the Only Way to Use AI

The dominant interface for AI tools in 2026 is still text. You type a message, you read a response. This works well for many tasks. But there are situations where voice is simply better:

You are walking and want to think through a problem out loud
You are cooking and want to ask about a recipe substitution
You are driving and want your morning briefing read aloud
You are in a Discord voice channel and want the agent to participate in the conversation

Hermes Agent has voice mode built in across three surfaces: the CLI, Telegram, and Discord. This is not a bolted-on text-to-speech wrapper. It is full voice interaction, you speak, the agent listens, transcribes, processes, and responds with spoken audio.

Here is how each voice feature works, what it takes to set up, and what it is actually useful for.

Voice Mode Overview

Hermes supports three distinct voice interaction patterns:

Feature	Where It Works	What It Does
Interactive Voice	CLI	Press Ctrl+B to record. Agent transcribes, processes, and displays the response.
Auto Voice Reply	Telegram, Discord	Agent sends spoken audio alongside text responses. Send a voice memo, get a voice reply.
Voice Channel	Discord	Bot joins a voice channel, listens to users speaking, and speaks replies back in real time.

Each mode serves a different use case. Let's break them down.

CLI Voice Mode: Talk in the Terminal

The simplest voice feature. Inside a Hermes CLI session, press Ctrl+B to start recording. Speak your message. Press Ctrl+B again (or wait for silence detection) to stop. Hermes transcribes your speech, processes it as a normal message, and responds.

What you need

pip install "hermes-agent[voice]"

This installs sounddevice and numpy for microphone capture and audio processing. You also need a working microphone connected to your machine.

When CLI voice is useful

Hands-free brainstorming: Talk through a problem while pacing around your office. Hermes keeps up.
Accessibility: If typing is difficult or slow, voice input removes the barrier.
Long-form dictation: Describe a complex task verbally instead of typing a paragraph of instructions.

The CLI voice mode is the most "developer-oriented" voice feature. It is useful, but the real magic happens on messaging platforms.

Telegram Voice: Send a Voice Memo, Get a Voice Reply

This is where voice mode becomes genuinely useful for non-technical users. On Telegram:

You send a voice memo (hold the microphone button, speak, release)
Hermes transcribes your message
Hermes processes it normally
Hermes sends back a spoken audio message alongside the text response

You can have an entirely voice-based conversation with your agent on Telegram. No typing required.

What you need

pip install "hermes-agent[messaging]"

Plus the standard Telegram bot setup (bot token from BotFather, configured in config.yaml).

For higher quality voice output, you can configure premium TTS providers like ElevenLabs:

pip install "hermes-agent[tts-premium]"

The Telegram voice experience in practice

Imagine this workflow:

You are walking to work. You hold the mic button in Telegram and say: "What's on my schedule today? And remind me to call the dentist at 3pm."
Hermes checks your context, sets the reminder, and sends back a voice message: "You have two meetings this morning, a standup at 10 and a product review at 11:30. I've set a reminder for the dentist call at 3pm."

The entire interaction is voice-based. You never open a keyboard.

Auto Voice Reply configuration

By default, Hermes sends both text and audio replies on Telegram when voice mode is enabled. You can configure this behavior:

Always voice: Every response includes spoken audio
Reply in kind: Voice messages get voice replies, text messages get text replies
Text only: Disable voice output while keeping voice input

The "reply in kind" mode is the most natural. It matches the user's communication style automatically.

Discord Voice Channel: Live Conversation

The most advanced voice feature. Hermes can join a Discord voice channel, listen to everyone speaking, and respond with spoken audio in real time.

This turns the agent into a voice participant in group conversations. Multiple users can ask questions, and the agent responds to each one.

What you need

pip install "hermes-agent[messaging]"

Discord voice requires discord.py[voice], which is included in the messaging extra. You also need the Discord bot configured with voice permissions in your server.

When Discord voice is useful

Team brainstorming: The agent participates in a voice discussion, offering suggestions and answering questions in real time
Study groups: Ask the agent to explain concepts during a live discussion
Gaming and social servers: The agent can be a voice-enabled helper in community channels
Accessibility: Users who cannot type can interact with the agent via voice

TTS Voice Options

Hermes supports multiple text-to-speech backends:

Provider	Quality	Cost	Notes
System TTS	Basic	Free	Default, works everywhere
NeuTTS (local)	Good	Free	Runs locally, requires setup
ElevenLabs	Excellent	Paid	Premium quality, most natural sounding

For personal use, the system TTS or NeuTTS is sufficient. If you want the agent to sound genuinely human, especially for customer-facing or content creation use cases, ElevenLabs is worth the cost.

To configure ElevenLabs, add your API key to ~/.hermes/.env:

ELEVENLABS_API_KEY=your_key_here

And install the premium TTS package:

pip install "hermes-agent[tts-premium]"

Voice Input Languages

Hermes uses Whisper for speech recognition, which supports 99 languages. You can speak in Spanish, French, German, Mandarin, or most other languages, and the agent will transcribe and respond appropriately.

The transcription quality depends on the Whisper model configuration. For best results with non-English languages, ensure you are using a sufficiently capable Whisper model.

Privacy Considerations

Voice data introduces privacy considerations that text does not:

Audio recordings: Check whether your TTS/STT provider retains audio. Hermes itself processes audio locally when using local models.
Voice messages on Telegram: Telegram stores voice messages on their servers. The bot downloads them for transcription, but the originals remain in the Telegram cloud.
Discord voice: Discord voice data passes through Discord's infrastructure before reaching the bot.

If privacy is a primary concern, local Whisper transcription and local TTS (NeuTTS) keep all audio processing on your infrastructure.

The Non-Technical Appeal

Voice mode is the feature that makes Hermes accessible to people who would never use a terminal. If you set up a Hermes agent for a family member, friend, or small business owner, voice on Telegram is the interface they will actually use.

Think about it from their perspective: they don't need to learn a CLI, they don't need to understand model configuration, and they don't need to type. They press and hold a button in an app they already use (Telegram), speak naturally, and get a spoken response. That is the experience that bridges the gap between "powerful AI agent" and "tool my parents would use."

Setting Up Voice Mode

If you are running Hermes yourself:

Install voice support: pip install "hermes-agent[voice,messaging]"
Configure TTS in config.yaml (or use defaults)
Start the gateway: hermes gateway start --detach
Send a voice memo to your Telegram bot

If you are using Hermify, voice mode works out of the box once your Telegram bot is connected. No additional installation or configuration needed.

Hermes Agent Voice Mode: Talk to Your AI Instead of Typing

Typing Is Not the Only Way to Use AI

Voice Mode Overview

CLI Voice Mode: Talk in the Terminal

What you need

When CLI voice is useful

Telegram Voice: Send a Voice Memo, Get a Voice Reply

What you need

The Telegram voice experience in practice

Auto Voice Reply configuration

Discord Voice Channel: Live Conversation

What you need

When Discord voice is useful

TTS Voice Options

Voice Input Languages

Privacy Considerations

The Non-Technical Appeal

Setting Up Voice Mode

Sources

Run Your Own Hermes Agent