Skip to content

Add OpenVoiceUI Voice Pipeline — full STT → LLM → TTS tutorial#606

Closed
MCERQUA wants to merge 2 commits intoShubhamsaboo:mainfrom
MCERQUA:feat/openvoiceui-voice-pipeline
Closed

Add OpenVoiceUI Voice Pipeline — full STT → LLM → TTS tutorial#606
MCERQUA wants to merge 2 commits intoShubhamsaboo:mainfrom
MCERQUA:feat/openvoiceui-voice-pipeline

Conversation

@MCERQUA
Copy link

@MCERQUA MCERQUA commented Mar 18, 2026

What this adds

A self-contained voice AI agent tutorial in voice_ai_agents/openvoiceui_voice_pipeline/ that demonstrates the complete voice conversation loop:

Speech-to-Text → Language Model → Text-to-Speech

Inspired by the architecture behind OpenVoiceUI, an open-source voice AI platform.

Pipeline

Step What Happens API
🎤 STT Browser mic recording transcribed OpenAI Whisper
🧠 LLM Transcript sent for AI response GPT-4o
🔊 TTS Response synthesized and played back OpenAI TTS

Files

  • voice_pipeline.py — Streamlit app (~180 lines), fully self-contained
  • requirements.txt — 3 dependencies: openai, streamlit, python-dotenv
  • README.md — setup instructions + what you'll learn

What learners take away

  • How to capture audio from a browser with st.audio_input()
  • How to call Whisper for real-time transcription
  • How to maintain multi-turn conversation state in Streamlit
  • How to synthesize TTS and autoplay audio responses

All credentials entered interactively in the Streamlit sidebar. No .env file required.

Mike added 2 commits March 18, 2026 15:35
Self-contained Streamlit app demonstrating the complete voice AI loop:
- STT: browser mic recording transcribed via OpenAI Whisper
- LLM: multi-turn conversation with GPT-4o
- TTS: response synthesized and played back via OpenAI TTS

Includes configurable voice, model, and system prompt.
…xpressive TTS

- Add pipeline_agents.py: VoiceAssistant (GPT-4o + WebSearchTool, Pydantic output)
  and TTSDirector (GPT-4o-mini, writes delivery instructions for TTS)
- Refactor voice_pipeline.py: two-agent async pipeline via Runner.run(),
  multi-turn context window (last 6 messages), gpt-4o-mini-tts with instructions
- Update requirements.txt to include openai-agents and pydantic
- Update README with agent architecture diagram and expanded learning outcomes
@awesomekoder
Copy link
Contributor

Thanks for the clean submission! The code quality is solid and the two-agent pattern is well documented.

However, the STT > text LLM > TTS pipeline is now outdated. OpenAI's Realtime API and Gemini 3.1 Live API both support native voice-to-voice with lower latency, no transcription step, and the model can actually hear tone and emotion.

For a stronger submission, consider building a tutorial using one of these native audio approaches:

Would love to see a resubmission using the modern voice architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants