How does a voice AI agent work?

A voice agent combines three components in a conversation loop: STT (speech-to-text) turns the caller's speech into text and detects the end of the turn; the LLM (language model) generates a response from the transcription and conversation context; TTS (text-to-speech) turns that response into a natural voice. The secret to sounding natural is streaming — the components work in parallel and stream, so TTS starts speaking before the LLM finishes the whole sentence, and STT transcribes while the caller is still talking. The whole loop must fit within ~600 ms, because that is the threshold below which the caller stops noticing they are talking to a machine.

Why is latency so important in voice AI?

Because in natural conversation people exchange turns with gaps of around 200 ms — a second of silence that is unnoticeable in a text chat sounds like a freeze in a voice conversation. Real production deployments achieve 580–620 ms across the full STT→LLM→TTS loop, and that is exactly the threshold at which tested callers stop noticing AI. Each component has a budget: STT ~100–200 ms, LLM ~200–300 ms, TTS ~150–250 ms. That is why latency is the number-one parameter in voice AI — more important than model intelligence. You choose the fastest good-enough model, not the smartest one.

What is barge-in and why is it crucial?

Barge-in is the agent's ability to fall silent instantly when the caller starts speaking — just as we interrupt each other in real conversation. An agent that talks over the caller is an agent that loses the call. The mechanics have two sides: semantic VAD on the STT side detects that the caller has started speaking, and TTS playback must cut off in under 150 ms. Without working barge-in the agent "talks over" the caller and sounds unnatural, frustrating them within seconds. This, together with good end-of-turn detection, separates an agent that is fluid to talk to from one that irritates.

Should I build my own pipeline or use a ready-made platform?

It depends on the team and scale. Managed platforms (Vapi, Retell) give a fast start without maintaining real-time infrastructure — choose them when you want to launch in weeks and accept a higher per-minute cost ($0.11–0.25 all-in) in exchange for convenience. Open-source (LiveKit, Pipecat) gives full control, the lowest latency and lower cost at scale ($0.05–0.15/min), but requires an engineering team. The practical path: start with a managed platform to validate the business case quickly, and move to your own pipeline when scale makes per-minute cost and control matter more than time to deployment. Pipecat has the lowest default endpointing (~300 ms), Vapi the highest (~1450 ms, to be tuned).

How much does a voice AI agent cost per minute?

Realistically $0.11–0.25 per minute all-in on a managed platform, and $0.05–0.15 with your own pipeline. Beware the marketing: platforms advertise the platform fee alone (Vapi $0.05, Retell $0.07, Bland $0.09), but that excludes STT, LLM, TTS and telephony — once you add those layers the real cost rises to $0.11–0.25. Your own pipeline is cheaper per minute but requires a team, so it pays off only at large scale. For context: even $0.25 per minute is a fraction of a call-center employee billed by the hour, and the agent works 24/7, in parallel across hundreds of calls, without overtime — and that is where the ROI comes from.

Will a voice AI agent handle my language well?

Yes, but it requires careful selection and testing. The most important components here are STT and TTS — not all models handle every language as well as English. STT is the foundation: if transcription confuses words, the whole conversation collapses because the LLM gets the wrong text. So before choosing, test transcription on real recordings from your industry — with proper names, numbers, terminology, possibly slang. On the TTS side, check whether the voice sounds natural, with correct intonation and accent. LLMs handle most languages well, so the bottleneck is usually STT and TTS — focus your testing there.

Which tasks are best suited to a voice agent?

The repetitive, high-volume ones where the conversation structure is predictable. The best cases: answering frequent questions and order status, bookings and scheduling (checking the calendar, confirmations, reminders), lead qualification (the agent calls, asks questions, hands over hot leads), outbound campaigns (payment reminders, surveys) and a 24/7 hotline after hours. ROI is highest where a company loses calls after hours or where agents spend time on simple, repetitive conversations. What not to fully delegate: matters requiring empathy, negotiation or unusual decisions — plan human escalation there. Start with one narrow case, not the whole support operation at once.

RETURN_TO_BLOG

2026-06-17AI & Automation 15 min

Voice AI Agents — How They Work, What They Cost, and How to Build a Phone Bot

A voice AI agent is a system that holds a natural phone or voice conversation by combining three components in a loop: STT (speech-to-text), LLM (response generation) and TTS (text-to-speech). The key is latency: the whole loop has to fit within ~600 ms, because that is the threshold below which the caller stops noticing they are talking to a machine. In 2026 you have two build paths: managed platforms (Vapi, Retell) for a fast start at a higher per-minute cost, or your own pipeline on open-source (LiveKit, Pipecat) for full control and lower cost but more work. The real all-in cost is $0.11–0.25 per minute, and the typical use cases are customer support, bookings, lead qualification and outbound campaigns.

The complete guide to voice AI agents: how the STT→LLM→TTS pipeline works, why latency under 600 ms decides whether it sounds human, how interruptions (barge-in) and end-of-turn detection work, which stack to choose in 2026, a comparison of the Vapi, Retell, LiveKit and Pipecat platforms, how to connect SIP/PSTN telephony, what it really costs per minute, and the business use cases with their ROI.

You call a business at 11 p.m. to reschedule an appointment. Instead of "our office is open 8 to 4" you hear a calm voice that understands your request, checks the calendar, offers three slots and confirms the change — in 40 seconds, no queue, no waiting for an agent. You do not realize you just spoke with AI. This is not the future — these are deployments that work today.

Voice AI agents are one of the fastest-growing automation categories, because the phone is still the main contact channel in many industries — and at the same time the most expensive and hardest to scale. This article shows how these systems work under the hood, why latency is paramount, which stack and platform to choose, how to connect telephony and what it really costs.

How a voice AI agent works — the STT→LLM→TTS pipeline

/// VOICE AI AGENT PIPELINE

STT → LLM → TTS in one conversation loop

Speech

🎙

→

STT~100–200 ms

Speech-to-Text

Deepgram Nova-3

Turns speech into text + detects end of turn; the foundation — an error here breaks everything downstream

LLM~200–300 ms

Language model

Claude Haiku 4.5

Generates the reply; token streaming lets TTS start before the whole sentence is done

TTS~150–250 ms

Text-to-Speech

Cartesia Sonic-3

Turns text into a natural voice; what matters is time to first audio, not the whole clip

→

Voice

🔊

~600 ms

THRESHOLD WHERE CALLERS STOP NOTICING AI

< 150 ms

FOR INTERRUPTION (BARGE-IN)

200–450

MS GAP BETWEEN TURNS

A voice agent is not one model but a pipeline of three specialized components that process every conversation turn:

STT (Speech-to-Text) — turns the caller's speech into text in real time; it is the foundation, because if transcription is wrong every downstream step fails. STT does two things in parallel: it transcribes and it detects the end of the turn (end-of-turn)
LLM (language model) — receives the transcription along with conversation context and generates a response; voice applications use fast, lightweight models (e.g. Claude Haiku) because speed matters more than maximum intelligence
TTS (Text-to-Speech) — turns the model's response into a natural voice; what matters is time to first audio, not generating the whole utterance

The trick that makes it sound natural is streaming at every stage. You do not wait for the caller to finish to start transcribing; you do not wait for the whole LLM response to start TTS. The components work in parallel and stream — when the LLM produces the first sentence, TTS is already speaking it while the model keeps generating. That is the difference between a robot reading from a script and a fluid conversation.

Latency — the single most important parameter

In a text chatbot a 2-second delay is tolerable. In a voice conversation it is a chasm — in natural conversation people exchange turns with gaps of around 200 ms. That is why latency is the number-one parameter in voice AI, more important than the model's "intelligence".

Latency metric	Target	What it means
Full loop (STT→LLM→TTS)	~600 ms	The threshold where the caller stops noticing it is AI
Barge-in (interruption)	< 150 ms	From the end of the caller's speech to the agent's voice stopping
Gap between turns	200–450 ms	From the agent finishing to the first audio of the next turn
End-of-turn detection	The long pole	Detecting the caller has finished — tuned to avoid false cuts

Real production deployments achieve 580–620 ms across the full loop — and that is precisely the threshold at which tested callers stop noticing they are talking to AI. Each component has its budget: STT ~100–200 ms, LLM ~200–300 ms, TTS ~150–250 ms. The sum has to fit, so choosing fast providers at each stage is not an optimization — it is a precondition for working.

The end-of-turn paradox: it is usually the hardest part of the whole system. If the agent reacts too fast, it will interrupt the caller mid-sentence (when they pause for breath). Too slow, and the conversation drags with awkward silences. That is why modern systems use semantic VAD (voice activity detection) that understands whether a sentence is complete, not just detects silence.

Interruptions and turn-taking — the secret to natural conversation

A voice agent that talks over the caller is a voice agent that loses the call. In real conversation we interrupt each other — "yes, exactly", "no, I meant..." — and the agent has to handle it. This is called barge-in: the ability to fall silent instantly when the caller starts speaking.

Barge-in mechanics have two sides:

Detecting the interruption — semantic VAD on the STT side recognizes that the caller has started speaking while the agent is still talking
Instant TTS stop — the agent's voice playback must cut off in < 150 ms from the start of the caller's speech; any delay makes the agent "talk over" and sound unnatural

Turn-taking, in turn, is the conversational policy that decides who "holds the floor" at any moment. A good agent not only reacts to interruptions but also knows when to pause, when to acknowledge ("mhm", "I see"), and when to wait because the caller has not finished their thought. It is these details — not voice quality itself — that separate an agent that is pleasant to talk to from one that frustrates after 15 seconds.

The 2026 technology stack

The choice of providers at each pipeline stage decides latency and quality. The proven "sweet spot" for 2026:

Component	2026 recommendation	Alternatives	Why
STT	Deepgram Nova-3	AssemblyAI	Best streaming latency and accuracy
LLM	Claude Haiku 4.5	GPT-4o-mini, Gemini Flash	Fast, cheap, intelligent enough for conversation
TTS	Cartesia Sonic-3	ElevenLabs, Deepgram Aura-2	Lowest time to first audio, natural voice

This stack delivers a total latency of 550–700 ms. The key selection rule: in voice AI you do not pick the most intelligent LLM, you pick the fastest one that is good enough. A phone conversation rarely needs GPT-4o-level reasoning — it needs instant reaction. Claude Haiku or GPT-4o-mini respond in a fraction of the time of large models, and for most scenarios (bookings, FAQ, qualification) their capabilities are more than enough.

For languages other than English, pay special attention to STT and TTS — not all models handle them as well as English. Test transcription on real recordings from your industry (with slang, proper names, numbers) before choosing, because it is the foundation — an STT error ruins the whole conversation.

Platforms: Vapi, Retell, LiveKit, Pipecat

/// VAPI vs RETELL vs LIVEKIT vs PIPECAT — VOICE PLATFORMS

Vapi

MANAGED

TypeManaged

Endpointing~1450 ms (default)

StrengthVisual build + API

Best forFast start, balance

Retell

NATURAL

TypeManaged

Endpointing~700 ms

StrengthNatural conversation

Best forCustomer support

LiveKit

OPEN SOURCE

TypeOpen-source + SIP

EndpointingConfigurable

StrengthFull control, WebRTC

Best forCustom, telephony

Pipecat

OPEN SOURCE

TypeOpen-source (Python)

Endpointing~300 ms

StrengthLowest latency

Best forPerformance, dev control

OPEN-SOURCE LIVEKIT · PIPECAT

MANAGED VAPI · RETELL

SIP

PSTN TELEPHONY VIA LIVEKIT / VAPI

You do not have to assemble the pipeline from scratch — orchestration platforms do it for you. They fall into two camps:

Vapi — a managed platform with a visual builder and API; a good balance of ease and control; watch the default endpointing of ~1450 ms, which needs tuning
Retell — managed, valued for natural conversation; endpointing ~700 ms; good for customer support
LiveKit — open-source with native SIP/WebRTC support; full control, ideal for telephony and custom deployments
Pipecat — open-source in Python; the lowest latency (~300 ms endpointing); the choice for teams that value performance and developer control

The build vs buy decision:

Choose a managed platform (Vapi, Retell) when you want to launch fast, have no team to maintain real-time infrastructure, and accept a higher per-minute cost in exchange for convenience
Choose open-source (LiveKit, Pipecat) when you have an engineering team, care about the lowest latency and cost at scale, or need full control over data (e.g. self-hosting, compliance)

The rule: start with a managed platform to validate the business case in weeks, not months. Move to your own pipeline when scale makes per-minute cost and control matter more than time to deployment.

Telephony — SIP, PSTN and WebRTC

The AI pipeline alone is not everything — the agent has to connect to something. That is where the telephony layer comes in:

PSTN (the public telephone network) — so the agent can call and answer on regular phone numbers
SIP (Session Initiation Protocol) — the protocol you use to connect the agent to phone exchanges and carriers
WebRTC — voice through a browser or app, without a phone number (e.g. a "call us" widget on a website)

For production telephony deployments, the SIP layer is provided by LiveKit, Vapi or carriers such as Twilio or Telnyx. A well-designed agent works across all three channels (PSTN, SIP, WebRTC), so you can connect it to both a hotline and a website widget. Integrating a phone number is usually a few configuration steps with a SIP provider — you do not build it from scratch.

What it really costs

A voice agent's cost is measured per minute of conversation and consists of several layers. Beware the marketing: platforms advertise the platform fee alone, not the all-in cost.

Model	Advertised fee	Real all-in cost	Notes
Own pipeline (DIY)	—	$0.05–0.15/min	Full control, sum of STT+LLM+TTS+telephony
Vapi	$0.05/min (platform)	$0.11–0.25/min	Plus STT, LLM, TTS, telephony
Retell	$0.07/min (platform)	$0.11–0.25/min	As above
Bland	$0.09/min (platform)	$0.11–0.25/min	As above

The real all-in cost for managed platforms lands between $0.11 and $0.25 per minute once you add STT, LLM, TTS and telephony. Your own pipeline gives $0.05–0.15 per minute with full control — which is why at large scale (thousands of minutes a day) a self-built system repays the engineering team's cost. Compare that to an agent's cost: even $0.25 per minute is a fraction of a call-center employee billed by the hour — and the agent works 24/7, without breaks, in parallel across hundreds of calls.

Business use cases and ROI

Voice agents excel where conversations are repetitive and volume is high:

Customer service and support — answering frequent questions, order status, basic troubleshooting; the agent takes the routine, humans handle the hard cases
Bookings and scheduling — checking the calendar, offering slots, confirmations and reminders; ideal for clinics, salons, workshops
Lead qualification — the agent calls new contacts, asks qualifying questions and hands hot leads to a salesperson
Outbound campaigns — payment reminders, satisfaction surveys, delivery confirmations — at a scale unreachable for a human team
24/7 hotline — answering calls after hours, routing urgent matters, collecting information before a human contact

ROI comes from three sources: the agent handles hundreds of calls in parallel (scale), works around the clock without overtime (availability) and costs a fraction of an agent's hourly rate (cost). It pays back fastest where a company loses calls after hours or where agents spend time on repetitive, simple conversations. A full rollout is worth preceding with analysis: which conversations are repetitive enough for the agent to take over, and which require a human.

Common mistakes and a deployment checklist

1.Measure full-loop latency — target ~600 ms; above it the conversation sounds artificial and callers hang up
2.Pick fast providers at each stage (STT, LLM, TTS) — it is a latency precondition, not an optimization
3.Choose the fastest good-enough LLM, not the most intelligent one — Haiku/mini, not large models
4.Test STT on real recordings in your language — with proper names, numbers, industry slang
5.Implement barge-in with TTS stop in < 150 ms — the agent must fall silent when the caller starts speaking
6.Tune end-of-turn detection (semantic VAD) — balance between interrupting and awkward silence
7.Start with a managed platform (Vapi/Retell) to validate the case in weeks
8.Move to open-source (LiveKit/Pipecat) at scale — lower per-minute cost and full control
9.Calculate the all-in cost, not just the platform fee — realistically $0.11–0.25/min on a managed platform
10.Plan human escalation — the agent must be able to hand off a hard case, not get stuck in a loop
11.Add guardrails and handling for unexpected questions — the agent must not hallucinate to a customer
12.Pick a repetitive, high-volume case to start — bookings or FAQ, not the whole support operation at once

Key takeaways

A voice AI agent is an STT→LLM→TTS pipeline in a conversation loop, where latency matters most — the whole loop must fit within ~600 ms, the naturalness threshold. You choose the fastest good-enough LLM, not the most intelligent one, and conversation quality is decided by the details: barge-in (< 150 ms), end-of-turn detection and turn-taking. The 2026 stack: Deepgram Nova-3 (STT), Claude Haiku 4.5 (LLM), Cartesia Sonic-3 (TTS). Build on a managed platform (Vapi, Retell) for a fast start, or on open-source (LiveKit, Pipecat) for control and lower cost at scale. The real cost is $0.11–0.25/min all-in, and the best use cases are repetitive, high-volume conversations: customer support, bookings, lead qualification and outbound campaigns — with human escalation where empathy is needed.

---

I help companies design and deploy voice AI agents — from stack and platform selection, through latency optimization and language handling, to telephony integration, human escalation and ROI analysis. Get in touch — I start with a free 30-minute analysis of your use case.

/// RELATED_RECORDS

AI & Automation

How AI Reads Invoices from Email and Enters Them into ERP

AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.

10 min

AI & Automation

Where to Start with AI Implementation in Your Company

AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.

8 min

AI & Automation

How to Build a Company Internal Knowledge Base with AI (RAG in Practice)

An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.

11 min

/// AUTHOR

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

LinkedIn Facebook

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...

BIAŁYSTOK, PL

+48 732 022 086 pawel.wiszniewski95@gmail.com

How a voice AI agent works — the STT→LLM→TTS pipeline

STT → LLM → TTS in one conversation loop

Latency — the single most important parameter

Interruptions and turn-taking — the secret to natural conversation

The 2026 technology stack

Platforms: Vapi, Retell, LiveKit, Pipecat

Telephony — SIP, PSTN and WebRTC

What it really costs

Business use cases and ROI

Common mistakes and a deployment checklist

Key takeaways

/// RELATED_RECORDS

How AI Reads Invoices from Email and Enters Them into ERP

Where to Start with AI Implementation in Your Company

How to Build a Company Internal Knowledge Base with AI (RAG in Practice)

Signal received?

TerminateSilence

Terminate
Silence