Voice AI Agents — How They Work, What They Cost, and How to Build a Phone Bot
A voice AI agent is a system that holds a natural phone or voice conversation by combining three components in a loop: STT (speech-to-text), LLM (response generation) and TTS (text-to-speech). The key is latency: the whole loop has to fit within ~600 ms, because that is the threshold below which the caller stops noticing they are talking to a machine. In 2026 you have two build paths: managed platforms (Vapi, Retell) for a fast start at a higher per-minute cost, or your own pipeline on open-source (LiveKit, Pipecat) for full control and lower cost but more work. The real all-in cost is $0.11–0.25 per minute, and the typical use cases are customer support, bookings, lead qualification and outbound campaigns.
The complete guide to voice AI agents: how the STT→LLM→TTS pipeline works, why latency under 600 ms decides whether it sounds human, how interruptions (barge-in) and end-of-turn detection work, which stack to choose in 2026, a comparison of the Vapi, Retell, LiveKit and Pipecat platforms, how to connect SIP/PSTN telephony, what it really costs per minute, and the business use cases with their ROI.
You call a business at 11 p.m. to reschedule an appointment. Instead of "our office is open 8 to 4" you hear a calm voice that understands your request, checks the calendar, offers three slots and confirms the change — in 40 seconds, no queue, no waiting for an agent. You do not realize you just spoke with AI. This is not the future — these are deployments that work today.
Voice AI agents are one of the fastest-growing automation categories, because the phone is still the main contact channel in many industries — and at the same time the most expensive and hardest to scale. This article shows how these systems work under the hood, why latency is paramount, which stack and platform to choose, how to connect telephony and what it really costs.
How a voice AI agent works — the STT→LLM→TTS pipeline
/// VOICE AI AGENT PIPELINE
STT → LLM → TTS in one conversation loop
A voice agent is not one model but a pipeline of three specialized components that process every conversation turn:
- STT (Speech-to-Text) — turns the caller's speech into text in real time; it is the foundation, because if transcription is wrong every downstream step fails. STT does two things in parallel: it transcribes and it detects the end of the turn (end-of-turn)
- LLM (language model) — receives the transcription along with conversation context and generates a response; voice applications use fast, lightweight models (e.g. Claude Haiku) because speed matters more than maximum intelligence
- TTS (Text-to-Speech) — turns the model's response into a natural voice; what matters is time to first audio, not generating the whole utterance
The trick that makes it sound natural is streaming at every stage. You do not wait for the caller to finish to start transcribing; you do not wait for the whole LLM response to start TTS. The components work in parallel and stream — when the LLM produces the first sentence, TTS is already speaking it while the model keeps generating. That is the difference between a robot reading from a script and a fluid conversation.
Latency — the single most important parameter
In a text chatbot a 2-second delay is tolerable. In a voice conversation it is a chasm — in natural conversation people exchange turns with gaps of around 200 ms. That is why latency is the number-one parameter in voice AI, more important than the model's "intelligence".
| Latency metric | Target | What it means |
|---|---|---|
| Full loop (STT→LLM→TTS) | ~600 ms | The threshold where the caller stops noticing it is AI |
| Barge-in (interruption) | < 150 ms | From the end of the caller's speech to the agent's voice stopping |
| Gap between turns | 200–450 ms | From the agent finishing to the first audio of the next turn |
| End-of-turn detection | The long pole | Detecting the caller has finished — tuned to avoid false cuts |
Real production deployments achieve 580–620 ms across the full loop — and that is precisely the threshold at which tested callers stop noticing they are talking to AI. Each component has its budget: STT ~100–200 ms, LLM ~200–300 ms, TTS ~150–250 ms. The sum has to fit, so choosing fast providers at each stage is not an optimization — it is a precondition for working.
The end-of-turn paradox: it is usually the hardest part of the whole system. If the agent reacts too fast, it will interrupt the caller mid-sentence (when they pause for breath). Too slow, and the conversation drags with awkward silences. That is why modern systems use semantic VAD (voice activity detection) that understands whether a sentence is complete, not just detects silence.
Interruptions and turn-taking — the secret to natural conversation
A voice agent that talks over the caller is a voice agent that loses the call. In real conversation we interrupt each other — "yes, exactly", "no, I meant..." — and the agent has to handle it. This is called barge-in: the ability to fall silent instantly when the caller starts speaking.
Barge-in mechanics have two sides:
- Detecting the interruption — semantic VAD on the STT side recognizes that the caller has started speaking while the agent is still talking
- Instant TTS stop — the agent's voice playback must cut off in < 150 ms from the start of the caller's speech; any delay makes the agent "talk over" and sound unnatural
Turn-taking, in turn, is the conversational policy that decides who "holds the floor" at any moment. A good agent not only reacts to interruptions but also knows when to pause, when to acknowledge ("mhm", "I see"), and when to wait because the caller has not finished their thought. It is these details — not voice quality itself — that separate an agent that is pleasant to talk to from one that frustrates after 15 seconds.
The 2026 technology stack
The choice of providers at each pipeline stage decides latency and quality. The proven "sweet spot" for 2026:
| Component | 2026 recommendation | Alternatives | Why |
|---|---|---|---|
| STT | Deepgram Nova-3 | AssemblyAI | Best streaming latency and accuracy |
| LLM | Claude Haiku 4.5 | GPT-4o-mini, Gemini Flash | Fast, cheap, intelligent enough for conversation |
| TTS | Cartesia Sonic-3 | ElevenLabs, Deepgram Aura-2 | Lowest time to first audio, natural voice |
This stack delivers a total latency of 550–700 ms. The key selection rule: in voice AI you do not pick the most intelligent LLM, you pick the fastest one that is good enough. A phone conversation rarely needs GPT-4o-level reasoning — it needs instant reaction. Claude Haiku or GPT-4o-mini respond in a fraction of the time of large models, and for most scenarios (bookings, FAQ, qualification) their capabilities are more than enough.
For languages other than English, pay special attention to STT and TTS — not all models handle them as well as English. Test transcription on real recordings from your industry (with slang, proper names, numbers) before choosing, because it is the foundation — an STT error ruins the whole conversation.
Platforms: Vapi, Retell, LiveKit, Pipecat
/// VAPI vs RETELL vs LIVEKIT vs PIPECAT — VOICE PLATFORMS
You do not have to assemble the pipeline from scratch — orchestration platforms do it for you. They fall into two camps:
- Vapi — a managed platform with a visual builder and API; a good balance of ease and control; watch the default endpointing of ~1450 ms, which needs tuning
- Retell — managed, valued for natural conversation; endpointing ~700 ms; good for customer support
- LiveKit — open-source with native SIP/WebRTC support; full control, ideal for telephony and custom deployments
- Pipecat — open-source in Python; the lowest latency (~300 ms endpointing); the choice for teams that value performance and developer control
The build vs buy decision:
- Choose a managed platform (Vapi, Retell) when you want to launch fast, have no team to maintain real-time infrastructure, and accept a higher per-minute cost in exchange for convenience
- Choose open-source (LiveKit, Pipecat) when you have an engineering team, care about the lowest latency and cost at scale, or need full control over data (e.g. self-hosting, compliance)
The rule: start with a managed platform to validate the business case in weeks, not months. Move to your own pipeline when scale makes per-minute cost and control matter more than time to deployment.
Telephony — SIP, PSTN and WebRTC
The AI pipeline alone is not everything — the agent has to connect to something. That is where the telephony layer comes in:
- PSTN (the public telephone network) — so the agent can call and answer on regular phone numbers
- SIP (Session Initiation Protocol) — the protocol you use to connect the agent to phone exchanges and carriers
- WebRTC — voice through a browser or app, without a phone number (e.g. a "call us" widget on a website)
For production telephony deployments, the SIP layer is provided by LiveKit, Vapi or carriers such as Twilio or Telnyx. A well-designed agent works across all three channels (PSTN, SIP, WebRTC), so you can connect it to both a hotline and a website widget. Integrating a phone number is usually a few configuration steps with a SIP provider — you do not build it from scratch.
What it really costs
A voice agent's cost is measured per minute of conversation and consists of several layers. Beware the marketing: platforms advertise the platform fee alone, not the all-in cost.
| Model | Advertised fee | Real all-in cost | Notes |
|---|---|---|---|
| Own pipeline (DIY) | — | $0.05–0.15/min | Full control, sum of STT+LLM+TTS+telephony |
| Vapi | $0.05/min (platform) | $0.11–0.25/min | Plus STT, LLM, TTS, telephony |
| Retell | $0.07/min (platform) | $0.11–0.25/min | As above |
| Bland | $0.09/min (platform) | $0.11–0.25/min | As above |
The real all-in cost for managed platforms lands between $0.11 and $0.25 per minute once you add STT, LLM, TTS and telephony. Your own pipeline gives $0.05–0.15 per minute with full control — which is why at large scale (thousands of minutes a day) a self-built system repays the engineering team's cost. Compare that to an agent's cost: even $0.25 per minute is a fraction of a call-center employee billed by the hour — and the agent works 24/7, without breaks, in parallel across hundreds of calls.
Business use cases and ROI
Voice agents excel where conversations are repetitive and volume is high:
- Customer service and support — answering frequent questions, order status, basic troubleshooting; the agent takes the routine, humans handle the hard cases
- Bookings and scheduling — checking the calendar, offering slots, confirmations and reminders; ideal for clinics, salons, workshops
- Lead qualification — the agent calls new contacts, asks qualifying questions and hands hot leads to a salesperson
- Outbound campaigns — payment reminders, satisfaction surveys, delivery confirmations — at a scale unreachable for a human team
- 24/7 hotline — answering calls after hours, routing urgent matters, collecting information before a human contact
ROI comes from three sources: the agent handles hundreds of calls in parallel (scale), works around the clock without overtime (availability) and costs a fraction of an agent's hourly rate (cost). It pays back fastest where a company loses calls after hours or where agents spend time on repetitive, simple conversations. A full rollout is worth preceding with analysis: which conversations are repetitive enough for the agent to take over, and which require a human.
Common mistakes and a deployment checklist
- 1.Measure full-loop latency — target ~600 ms; above it the conversation sounds artificial and callers hang up
- 2.Pick fast providers at each stage (STT, LLM, TTS) — it is a latency precondition, not an optimization
- 3.Choose the fastest good-enough LLM, not the most intelligent one — Haiku/mini, not large models
- 4.Test STT on real recordings in your language — with proper names, numbers, industry slang
- 5.Implement barge-in with TTS stop in < 150 ms — the agent must fall silent when the caller starts speaking
- 6.Tune end-of-turn detection (semantic VAD) — balance between interrupting and awkward silence
- 7.Start with a managed platform (Vapi/Retell) to validate the case in weeks
- 8.Move to open-source (LiveKit/Pipecat) at scale — lower per-minute cost and full control
- 9.Calculate the all-in cost, not just the platform fee — realistically $0.11–0.25/min on a managed platform
- 10.Plan human escalation — the agent must be able to hand off a hard case, not get stuck in a loop
- 11.Add guardrails and handling for unexpected questions — the agent must not hallucinate to a customer
- 12.Pick a repetitive, high-volume case to start — bookings or FAQ, not the whole support operation at once
Key takeaways
A voice AI agent is an STT→LLM→TTS pipeline in a conversation loop, where latency matters most — the whole loop must fit within ~600 ms, the naturalness threshold. You choose the fastest good-enough LLM, not the most intelligent one, and conversation quality is decided by the details: barge-in (< 150 ms), end-of-turn detection and turn-taking. The 2026 stack: Deepgram Nova-3 (STT), Claude Haiku 4.5 (LLM), Cartesia Sonic-3 (TTS). Build on a managed platform (Vapi, Retell) for a fast start, or on open-source (LiveKit, Pipecat) for control and lower cost at scale. The real cost is $0.11–0.25/min all-in, and the best use cases are repetitive, high-volume conversations: customer support, bookings, lead qualification and outbound campaigns — with human escalation where empathy is needed.
---
I help companies design and deploy voice AI agents — from stack and platform selection, through latency optimization and language handling, to telephony integration, human escalation and ROI analysis. Get in touch — I start with a free 30-minute analysis of your use case.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
