AI 脳 ai-know.
JA · EN
CONCEPT · STUB

Voice AI

Voice AI refers to AI systems that conduct real-time conversation through audio input and audio output. The architecture typically comprises three layers: STT (Speech-to-Text), LLM-based reasoning, and TTS (Text-to-Speech). End-to-end latency — the delay from audio input to audio response — is the primary quality metric for user experience.

Between 2024 and 2025, major deployments including OpenAI’s Realtime API, xAI Grok Voice, and Google Gemini Live transitioned voice AI from an LLM add-on feature to an independent competitive product category. Infrastructure depth — specifically Real-Time LLM pipeline efficiency and WebRTC Stack design — now sets the quality ceiling.

Natural turn-taking (detecting when a speaker has finished versus paused) remains one of the core unsolved challenges.

Mentioned in