RETURN_TO_BLOG
AI & Automation 15 min

Voice AI Agents — How They Work, What They Cost, and How to Build a Phone Bot

A voice AI agent is a system that holds a natural phone or voice conversation by combining three components in a loop: STT (speech-to-text), LLM (response generation) and TTS (text-to-speech). The key is latency: the whole loop has to fit within ~600 ms, because that is the threshold below which the caller stops noticing they are talking to a machine. In 2026 you have two build paths: managed platforms (Vapi, Retell) for a fast start at a higher per-minute cost, or your own pipeline on open-source (LiveKit, Pipecat) for full control and lower cost but more work. The real all-in cost is $0.11–0.25 per minute, and the typical use cases are customer support, bookings, lead qualification and outbound campaigns.

The complete guide to voice AI agents: how the STT→LLM→TTS pipeline works, why latency under 600 ms decides whether it sounds human, how interruptions (barge-in) and end-of-turn detection work, which stack to choose in 2026, a comparison of the Vapi, Retell, LiveKit and Pipecat platforms, how to connect SIP/PSTN telephony, what it really costs per minute, and the business use cases with their ROI.

You call a business at 11 p.m. to reschedule an appointment. Instead of "our office is open 8 to 4" you hear a calm voice that understands your request, checks the calendar, offers three slots and confirms the change — in 40 seconds, no queue, no waiting for an agent. You do not realize you just spoke with AI. This is not the future — these are deployments that work today.

Voice AI agents are one of the fastest-growing automation categories, because the phone is still the main contact channel in many industries — and at the same time the most expensive and hardest to scale. This article shows how these systems work under the hood, why latency is paramount, which stack and platform to choose, how to connect telephony and what it really costs.

How a voice AI agent works — the STT→LLM→TTS pipeline

/// VOICE AI AGENT PIPELINE

STT → LLM → TTS in one conversation loop

Speech
🎙
STT~100–200 ms
Speech-to-Text
Deepgram Nova-3
Turns speech into text + detects end of turn; the foundation — an error here breaks everything downstream
LLM~200–300 ms
Language model
Claude Haiku 4.5
Generates the reply; token streaming lets TTS start before the whole sentence is done
TTS~150–250 ms
Text-to-Speech
Cartesia Sonic-3
Turns text into a natural voice; what matters is time to first audio, not the whole clip
Voice
🔊
~600 ms
THRESHOLD WHERE CALLERS STOP NOTICING AI
< 150 ms
FOR INTERRUPTION (BARGE-IN)
200–450
MS GAP BETWEEN TURNS

A voice agent is not one model but a pipeline of three specialized components that process every conversation turn:

  • STT (Speech-to-Text) — turns the caller's speech into text in real time; it is the foundation, because if transcription is wrong every downstream step fails. STT does two things in parallel: it transcribes and it detects the end of the turn (end-of-turn)
  • LLM (language model) — receives the transcription along with conversation context and generates a response; voice applications use fast, lightweight models (e.g. Claude Haiku) because speed matters more than maximum intelligence
  • TTS (Text-to-Speech) — turns the model's response into a natural voice; what matters is time to first audio, not generating the whole utterance

The trick that makes it sound natural is streaming at every stage. You do not wait for the caller to finish to start transcribing; you do not wait for the whole LLM response to start TTS. The components work in parallel and stream — when the LLM produces the first sentence, TTS is already speaking it while the model keeps generating. That is the difference between a robot reading from a script and a fluid conversation.

Latency — the single most important parameter

In a text chatbot a 2-second delay is tolerable. In a voice conversation it is a chasm — in natural conversation people exchange turns with gaps of around 200 ms. That is why latency is the number-one parameter in voice AI, more important than the model's "intelligence".

Latency metricTargetWhat it means
Full loop (STT→LLM→TTS)~600 msThe threshold where the caller stops noticing it is AI
Barge-in (interruption)< 150 msFrom the end of the caller's speech to the agent's voice stopping
Gap between turns200–450 msFrom the agent finishing to the first audio of the next turn
End-of-turn detectionThe long poleDetecting the caller has finished — tuned to avoid false cuts

Real production deployments achieve 580–620 ms across the full loop — and that is precisely the threshold at which tested callers stop noticing they are talking to AI. Each component has its budget: STT ~100–200 ms, LLM ~200–300 ms, TTS ~150–250 ms. The sum has to fit, so choosing fast providers at each stage is not an optimization — it is a precondition for working.

The end-of-turn paradox: it is usually the hardest part of the whole system. If the agent reacts too fast, it will interrupt the caller mid-sentence (when they pause for breath). Too slow, and the conversation drags with awkward silences. That is why modern systems use semantic VAD (voice activity detection) that understands whether a sentence is complete, not just detects silence.

Interruptions and turn-taking — the secret to natural conversation

A voice agent that talks over the caller is a voice agent that loses the call. In real conversation we interrupt each other — "yes, exactly", "no, I meant..." — and the agent has to handle it. This is called barge-in: the ability to fall silent instantly when the caller starts speaking.

Barge-in mechanics have two sides:

  • Detecting the interruption — semantic VAD on the STT side recognizes that the caller has started speaking while the agent is still talking
  • Instant TTS stop — the agent's voice playback must cut off in < 150 ms from the start of the caller's speech; any delay makes the agent "talk over" and sound unnatural

Turn-taking, in turn, is the conversational policy that decides who "holds the floor" at any moment. A good agent not only reacts to interruptions but also knows when to pause, when to acknowledge ("mhm", "I see"), and when to wait because the caller has not finished their thought. It is these details — not voice quality itself — that separate an agent that is pleasant to talk to from one that frustrates after 15 seconds.

The 2026 technology stack

The choice of providers at each pipeline stage decides latency and quality. The proven "sweet spot" for 2026:

Component2026 recommendationAlternativesWhy
STTDeepgram Nova-3AssemblyAIBest streaming latency and accuracy
LLMClaude Haiku 4.5GPT-4o-mini, Gemini FlashFast, cheap, intelligent enough for conversation
TTSCartesia Sonic-3ElevenLabs, Deepgram Aura-2Lowest time to first audio, natural voice

This stack delivers a total latency of 550–700 ms. The key selection rule: in voice AI you do not pick the most intelligent LLM, you pick the fastest one that is good enough. A phone conversation rarely needs GPT-4o-level reasoning — it needs instant reaction. Claude Haiku or GPT-4o-mini respond in a fraction of the time of large models, and for most scenarios (bookings, FAQ, qualification) their capabilities are more than enough.

For languages other than English, pay special attention to STT and TTS — not all models handle them as well as English. Test transcription on real recordings from your industry (with slang, proper names, numbers) before choosing, because it is the foundation — an STT error ruins the whole conversation.

Platforms: Vapi, Retell, LiveKit, Pipecat

/// VAPI vs RETELL vs LIVEKIT vs PIPECAT — VOICE PLATFORMS

Vapi
MANAGED
TypeManaged
Endpointing~1450 ms (default)
StrengthVisual build + API
Best forFast start, balance
Retell
NATURAL
TypeManaged
Endpointing~700 ms
StrengthNatural conversation
Best forCustomer support
LiveKit
OPEN SOURCE
TypeOpen-source + SIP
EndpointingConfigurable
StrengthFull control, WebRTC
Best forCustom, telephony
Pipecat
OPEN SOURCE
TypeOpen-source (Python)
Endpointing~300 ms
StrengthLowest latency
Best forPerformance, dev control
2
OPEN-SOURCE LIVEKIT · PIPECAT
2
MANAGED VAPI · RETELL
SIP
PSTN TELEPHONY VIA LIVEKIT / VAPI

You do not have to assemble the pipeline from scratch — orchestration platforms do it for you. They fall into two camps:

  • Vapi — a managed platform with a visual builder and API; a good balance of ease and control; watch the default endpointing of ~1450 ms, which needs tuning
  • Retell — managed, valued for natural conversation; endpointing ~700 ms; good for customer support
  • LiveKit — open-source with native SIP/WebRTC support; full control, ideal for telephony and custom deployments
  • Pipecat — open-source in Python; the lowest latency (~300 ms endpointing); the choice for teams that value performance and developer control

The build vs buy decision:

  • Choose a managed platform (Vapi, Retell) when you want to launch fast, have no team to maintain real-time infrastructure, and accept a higher per-minute cost in exchange for convenience
  • Choose open-source (LiveKit, Pipecat) when you have an engineering team, care about the lowest latency and cost at scale, or need full control over data (e.g. self-hosting, compliance)

The rule: start with a managed platform to validate the business case in weeks, not months. Move to your own pipeline when scale makes per-minute cost and control matter more than time to deployment.

Telephony — SIP, PSTN and WebRTC

The AI pipeline alone is not everything — the agent has to connect to something. That is where the telephony layer comes in:

  • PSTN (the public telephone network) — so the agent can call and answer on regular phone numbers
  • SIP (Session Initiation Protocol) — the protocol you use to connect the agent to phone exchanges and carriers
  • WebRTC — voice through a browser or app, without a phone number (e.g. a "call us" widget on a website)

For production telephony deployments, the SIP layer is provided by LiveKit, Vapi or carriers such as Twilio or Telnyx. A well-designed agent works across all three channels (PSTN, SIP, WebRTC), so you can connect it to both a hotline and a website widget. Integrating a phone number is usually a few configuration steps with a SIP provider — you do not build it from scratch.

What it really costs

A voice agent's cost is measured per minute of conversation and consists of several layers. Beware the marketing: platforms advertise the platform fee alone, not the all-in cost.

ModelAdvertised feeReal all-in costNotes
Own pipeline (DIY)$0.05–0.15/minFull control, sum of STT+LLM+TTS+telephony
Vapi$0.05/min (platform)$0.11–0.25/minPlus STT, LLM, TTS, telephony
Retell$0.07/min (platform)$0.11–0.25/minAs above
Bland$0.09/min (platform)$0.11–0.25/minAs above

The real all-in cost for managed platforms lands between $0.11 and $0.25 per minute once you add STT, LLM, TTS and telephony. Your own pipeline gives $0.05–0.15 per minute with full control — which is why at large scale (thousands of minutes a day) a self-built system repays the engineering team's cost. Compare that to an agent's cost: even $0.25 per minute is a fraction of a call-center employee billed by the hour — and the agent works 24/7, without breaks, in parallel across hundreds of calls.

Business use cases and ROI

Voice agents excel where conversations are repetitive and volume is high:

  • Customer service and support — answering frequent questions, order status, basic troubleshooting; the agent takes the routine, humans handle the hard cases
  • Bookings and scheduling — checking the calendar, offering slots, confirmations and reminders; ideal for clinics, salons, workshops
  • Lead qualification — the agent calls new contacts, asks qualifying questions and hands hot leads to a salesperson
  • Outbound campaigns — payment reminders, satisfaction surveys, delivery confirmations — at a scale unreachable for a human team
  • 24/7 hotline — answering calls after hours, routing urgent matters, collecting information before a human contact

ROI comes from three sources: the agent handles hundreds of calls in parallel (scale), works around the clock without overtime (availability) and costs a fraction of an agent's hourly rate (cost). It pays back fastest where a company loses calls after hours or where agents spend time on repetitive, simple conversations. A full rollout is worth preceding with analysis: which conversations are repetitive enough for the agent to take over, and which require a human.

Common mistakes and a deployment checklist

  1. 1.Measure full-loop latency — target ~600 ms; above it the conversation sounds artificial and callers hang up
  2. 2.Pick fast providers at each stage (STT, LLM, TTS) — it is a latency precondition, not an optimization
  3. 3.Choose the fastest good-enough LLM, not the most intelligent one — Haiku/mini, not large models
  4. 4.Test STT on real recordings in your language — with proper names, numbers, industry slang
  5. 5.Implement barge-in with TTS stop in < 150 ms — the agent must fall silent when the caller starts speaking
  6. 6.Tune end-of-turn detection (semantic VAD) — balance between interrupting and awkward silence
  7. 7.Start with a managed platform (Vapi/Retell) to validate the case in weeks
  8. 8.Move to open-source (LiveKit/Pipecat) at scale — lower per-minute cost and full control
  9. 9.Calculate the all-in cost, not just the platform fee — realistically $0.11–0.25/min on a managed platform
  10. 10.Plan human escalation — the agent must be able to hand off a hard case, not get stuck in a loop
  11. 11.Add guardrails and handling for unexpected questions — the agent must not hallucinate to a customer
  12. 12.Pick a repetitive, high-volume case to start — bookings or FAQ, not the whole support operation at once

Key takeaways

A voice AI agent is an STT→LLM→TTS pipeline in a conversation loop, where latency matters most — the whole loop must fit within ~600 ms, the naturalness threshold. You choose the fastest good-enough LLM, not the most intelligent one, and conversation quality is decided by the details: barge-in (< 150 ms), end-of-turn detection and turn-taking. The 2026 stack: Deepgram Nova-3 (STT), Claude Haiku 4.5 (LLM), Cartesia Sonic-3 (TTS). Build on a managed platform (Vapi, Retell) for a fast start, or on open-source (LiveKit, Pipecat) for control and lower cost at scale. The real cost is $0.11–0.25/min all-in, and the best use cases are repetitive, high-volume conversations: customer support, bookings, lead qualification and outbound campaigns — with human escalation where empathy is needed.

---

I help companies design and deploy voice AI agents — from stack and platform selection, through latency optimization and language handling, to telephony integration, human escalation and ROI analysis. Get in touch — I start with a free 30-minute analysis of your use case.

/// AUTHOR
Paweł Wiszniewski – AI & Web Engineer

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...