AI Voice Basics

How Do AI Voice Agents Work? The 9-Step Journey of a Real Phone Call

The 9-step anatomy of an AI-handled phone call — SIP to speech-to-text to LLM to function calls to TTS — with latency budgets, honest costs, and failure modes.

Jordan ReyesAnalytics Lead, MapleVoiceJun 12, 2026 · 28 min read

AI voice agents work by streaming a caller's audio off the phone network into a chain of three AI systems: speech-to-text transcribes words as they're spoken, a large language model decides what to say and what actions to take (like booking an appointment through a function call), and text-to-speech converts the reply into a natural voice — fast enough, ideally around one second per turn, that the caller experiences an ordinary conversation.

That one-sentence version is where most explanations stop — leaving out everything that decides whether a voice agent actually works in production: the telephony plumbing that connects it to a real phone call, the millisecond budgets that separate natural from robotic, the function-calling mechanism that lets it do things instead of just talk, and the failure modes nobody puts in the demo. This article walks all nine steps, from carrier handoff to the summary landing in your CRM, with honest cost math along the way.

One scoping note: this is the mechanics deep-dive in our series. If you're still mapping the category — what these systems are, who they're for, how they differ from IVRs, chatbots, and answering services — start with the companion pillar at /blog/what-is-an-ai-voice-agent and come back.

First, the one-minute refresher — then the 9-step journey of a real phone call

An AI voice agent is software that answers (or places) real phone calls and holds an open-ended conversation to get something done — booking an appointment, qualifying a lead, taking an order, answering questions. It is not an IVR phone tree ("press 1 for sales"). The defining trait is agency: it takes actions in your business systems mid-call, not just talks about them.

The businesses that benefit most share a profile: revenue arrives by phone, calls go unanswered (after hours, lunch rushes, every-tech-on-a-jobsite afternoons), and most calls are routine. Dental offices, home-services companies, restaurants, law firms, and real-estate teams are the classic cases. The problem solved is rarely "replace the front desk" — it's "stop sending revenue to voicemail."

The category is growing fast: according to AssemblyAI, citing Straits Research, the voice and speech recognition market was $14.8 billion in 2024, projected to exceed $61 billion by 2033. For the phone-tree comparison, see /blog/ai-voice-vs-ivr.

With the table set, here is the framework the rest of this article hangs on: every call passes through nine steps. Vendors usually show you steps 4 through 7 — the AI sandwich in the middle — and skip the telephony up front and the bookkeeping at the back, which is where implementations succeed or fail. Timings are typical engineering targets, not guarantees. Steps 1–2 and 8–9 are the ones almost every explainer skips.

  • Step 1 — The call hits your number (0–2 seconds). A customer dials your business number; the carrier routes it across the public phone network to a SIP endpoint at the voice platform. Artifact: a call record with caller ID, timestamp, and number dialed.
  • Step 2 — A media stream opens (~100–300 ms). The platform answers and opens a live two-way audio stream into the AI pipeline; the greeting fires immediately — a good agent answers before the second ring. Artifact: a live audio session, usually recorded from this moment.
  • Step 3 — Voice activity detection and endpointing (continuous). Software detects when the caller starts speaking and — much harder — when they've finished. Most systems treat roughly 500–800 ms of trailing silence as end-of-turn. Artifact: speech segments with timestamps.
  • Step 4 — Streaming speech-to-text (partials every ~100–300 ms). The caller's words become text while they're still talking, with a corrected final transcript at end-of-turn. Artifact: a turn transcript with confidence scores.
  • Step 5 — The language model thinks (~200–500 ms to first output). The transcript, conversation so far, and your business knowledge feed an LLM, which decides what to say — and whether an action is needed instead. Artifact: a reply, and sometimes a function call.
  • Step 6 — Function calling (200 ms to several seconds). If the caller wants something done, the model emits a structured request — check the calendar, look up the order, create the booking — and your real systems execute it. Artifact: an API call and its result, such as a confirmed 2:30 pm slot.
  • Step 7 — Streaming text-to-speech (~100–300 ms to first audio). The reply is spoken in a natural voice, streamed so playback starts before the sentence is fully generated — while the system keeps listening for interruptions. Artifact: the agent's audio turn.
  • Step 8 — Exits and edge routes (as needed). Calls that shouldn't end with the AI get routed out: a warm transfer with a whispered context summary, a text with a booking link, or a message for callback. Artifact: a transfer log or follow-up task.
  • Step 9 — Post-call processing (seconds after hangup). The recording is re-transcribed and summarized; the call gets a reason, outcome, and next step; everything is written to the CRM and analytics. Artifact: the permanent record your team actually reads.

Steps 1 and 2: the telephony layer nobody explains

A voice agent lives in a cloud platform, and the public switched telephone network (PSTN) has to deliver calls to it. There are two ways to connect them: forwarding and porting. Forwarding leaves your number with your current carrier and forwards calls — unconditionally, or only when busy or unanswered — to a number the platform controls. Porting moves the number itself to a voice-API carrier such as Twilio or Telnyx. Forwarding is reversible in minutes, so most businesses start there; porting takes days to weeks but removes one hop and one vendor from the chain.

Once the call arrives, it travels over SIP — the internet protocol behind modern phone calls — and the platform opens a media stream: a continuous two-way audio feed into the AI pipeline. Telephony audio is narrow (traditionally 8 kHz, far below podcast quality), one reason speech recognition on phone calls is harder than on a laptop microphone.

Two details worth knowing. First, the greeting is usually a pre-rendered audio clip, not live AI generation — that's how a well-configured agent answers in under two seconds every time, including 2 am on a Sunday. Second, on outbound calls, US carriers use the STIR/SHAKEN framework to attest that caller ID isn't spoofed; a legitimate platform handles this for you, which helps keep calls from being flagged as spam.

Steps 3 and 4: how the agent hears — VAD, endpointing, and streaming transcription

Before any AI "understands" anything, two humble components decide what counts as speech. Voice activity detection (VAD) separates talking from silence, breathing, hold music, and the TV in the background. Endpointing decides the caller has finished their turn — the hardest small problem in voice AI. Wait too little and the agent interrupts people mid-sentence; wait too long and every exchange carries a dead-air gap. Most production systems treat roughly half a second to 800 ms of trailing silence as end-of-turn, often adjusted dynamically — a longer pause is tolerated mid-phone-number than after a "yes."

Speech-to-text (also called ASR) then converts audio into words, and the operative word is streaming: the transcriber emits partial results every few hundred milliseconds while the caller is still talking, then a corrected final transcript at end-of-turn. That overlap is the first big latency saving — the system isn't waiting for you to finish before it starts reading.

Accuracy is genuinely good now — AssemblyAI cites a NIST report finding top systems reach word error rates as low as 4.9 percent — but errors cluster exactly where business calls are most sensitive: names, street addresses, phone numbers, emails. That's why well-designed agents confirm by reading back: "That's K-A-T-E at 416-555-0192 — did I get that right?" If a vendor demo never reads anything back, the agent wasn't built for real calls.

Step 5: how the agent thinks — the language model's turn

The transcript lands in a large language model — the same technology behind ChatGPT and Claude — alongside three other inputs: a system prompt (the agent's standing instructions: who it is, what it can do, what it must never do), the conversation so far, and your business knowledge — hours, services, pricing, policies — retrieved on demand from a knowledge base, a technique called retrieval-augmented generation (RAG).

Grounding matters more than model brilliance. An ungrounded model asked "do you do same-day crowns?" will produce a fluent, confident answer that may be wrong. A grounded agent answers only from your documents and defers — "let me have someone confirm that" — when the answer isn't there. "Training" a voice agent (covered at /blog/how-to-train-an-ai-voice-agent) is mostly this: not retraining a neural network, but writing instructions and curating the knowledge the model may use.

The model's output isn't only words. On every turn it makes a quiet decision: is this a turn where I reply, or a turn where I act? That decision is function calling — the most important concept in this pipeline, and the one almost no popular explainer covers.

Step 6: function calling, explained for non-engineers

Function calling is the mechanism that turns a talking FAQ into an agent. The whole idea: at setup, the agent gets a menu of actions it's allowed to take, each with a name and required details. A booking action might be described as: book_appointment, requiring patient_name, phone, service, date, and time. The language model never touches your calendar directly — it just knows this action exists and what it needs.

Mid-conversation, when a caller says "can I get a cleaning Thursday afternoon?", the model recognizes that words alone can't answer — it needs facts. So instead of replying, it emits a structured request, essentially a small filled-in form: action: check_availability, service: cleaning, date: Thursday, time_window: afternoon. The platform calls your booking system's API and hands back the result: 2:30 and 4:00 are open. Only then does the model speak: "I have 2:30 or 4 o'clock on Thursday — which works better?"

Three practical consequences. First, the agent can only do what's on the menu — it physically cannot wipe your calendar, because no such action exists for it to call. Second, integrations are where real setup effort goes; the conversation is the easy part, the calendar, CRM, and POS wiring is the work (see /integrations). Third, tools take time — a calendar API might respond in 300 ms, a creaky CRM in three seconds — so good agents mask the wait the way humans do: "Sure, let me pull up Thursday for you." That sentence isn't personality. It's latency engineering.

Step 7: how the agent speaks — streaming TTS and barge-in

Text-to-speech converts the model's words into a voice, and again the operative word is streaming: TTS starts speaking the first words while the rest of the reply is still being generated, with first audio typically landing 100–300 ms later. Without streaming, the caller would wait for the entire reply to be composed and rendered — easily two to four extra seconds of silence per turn.

The subtler skill is barge-in: what happens when the caller interrupts. Humans interrupt constantly — "actually, make it Friday" — and a system that bulldozes through its sentence feels like an IVR no matter how lifelike its voice. Barge-in works because VAD never stops running, even while the agent talks: caller speech during playback halts the audio within a few hundred milliseconds, discards the unspoken remainder, and starts a new turn — with the agent's memory updated to what it actually said, not what it planned to say.

Barge-in has a failure mode of its own: line echo and background noise can trigger false interruptions, making the agent stop mid-sentence for no reason. Production systems use echo cancellation and require sustained speech — not a cough — before yielding the floor. This tuning is invisible when done well and excruciating when not; listen for it in any demo call.

Step 8: the exits — transfers, voicemail, and fallbacks

Not every call should end with the AI; the difference between a good deployment and a bad one is mostly what happens at the edges. Routing policy belongs to you, not the vendor: many businesses start after-hours only, expand to overflow, and move to first-line answering once trust is earned (see /use-cases/after-hours-answering for that progression). The four standard exits:

  • Warm transfer to a human. The agent dials your staff line, bridges the calls, and passes context — on a screen or as a whispered summary: "I have Maria on the line; she's asking about a billing dispute from March." The caller never repeats themselves; a transfer without context is just a slower way to annoy someone.
  • Message plus text fallback. If no human is available, the agent takes a structured message and texts the caller a booking link or confirmation — turning a dead end into a task with a timestamp.
  • Scheduled callback. The agent books the callback into a calendar like any other appointment, so the promise has an owner.
  • Outbound-side detection. On outbound calls (reminders, missed-call recovery), the platform detects answering machines and leaves a compliant voicemail or schedules a retry — instead of letting the AI converse with a beep.

Step 9: after the hangup — the record, the metrics, and the QA loop

The call ends; the work product begins. Within seconds, the platform re-transcribes the recording — more accurately than real time allowed — and produces the structured record: a summary, the call reason, the outcome (booked, message taken, transferred, abandoned), and the next step with an owner.

Then it writes everything where work happens: the appointment is already on the calendar (that happened mid-call, in step 6), the contact and summary land in the CRM, follow-ups hit the task list, and analytics roll up volume, after-hours share, booking rate, and transfer rate. At MapleVoice, every call produces a recording, transcript, summary, call reason, outcome, and next step — that artifact set is the deliverable, not an add-on.

Step 9 is also where you find out whether the agent is actually working. AssemblyAI's implementation guide recommends tracking completion rate (conversations that achieve their goal), escalation rate (transfers to humans), average handling time, and user satisfaction — a sound starting set. Add the numbers an owner actually feels: booking rate on bookable calls, after-hours capture, and mid-call hang-ups. One caution: containment rate — the share of calls resolved with no human — is the industry's favorite headline number, but it varies so much by industry and call mix that any benchmark quoted without context is marketing. Measure against your own baseline instead: what share of these calls went to voicemail last month?

The other half of quality control is testing — and treating every instruction change like a code change. A tweak that fixes one behavior can quietly break another, so disciplined operators (and good vendors) re-run a scripted test suite after every edit, not just at launch, and read five to ten transcripts a week, tagging the misses and turning them into instruction updates — a 15-minute ritual covered at /blog/call-summary-analytics-guide. In a done-for-you service this loop is the vendor's job; the per-call recording, transcript, and outcome exist precisely so someone can run it. The minimum suite worth scripting:

  • The digits test: a phone number spoken quickly, a name that needs spelling, an email address. Pass: a correct read-back on every field.
  • The correction test: change the day or the service halfway through a booking. Pass: the agent updates the pending booking instead of creating two.
  • The interruption test: talk over the agent twice mid-sentence. Pass: it stops, listens, and resumes coherently.
  • The off-script test: ask something that isn't in the knowledge base. Pass: an honest deferral — never an invented answer.
  • The escalation test: demand a human immediately, with irritation. Pass: a fast, graceful transfer with context attached.
  • The noise test: call from a car, a kitchen, or a speakerphone. Pass: confirmation loops kick in instead of silent guessing.

The latency budget: milliseconds decide whether it feels human

In ordinary human conversation, the gap between turns is a few hundred milliseconds — people begin formulating replies before the other person finishes. A voice agent has to approximate that rhythm with five systems chained together, which is why latency is the central engineering discipline of voice AI. The targets below are commonly used industry budgets, not the product of a single study; the point is the shape of the math.

Most of the budget is consumed before the AI says a single word — which is why every component streams and overlaps instead of running in sequence. MapleVoice's answers-in-under-2-seconds standard refers to picking up the call (step 2); per-turn latency is what the streaming architecture protects. When you evaluate any vendor, call the demo line and count the beats after you stop talking — your callers will.

Pipeline stageTypical targetWhat the caller feels when it slips
Endpointing (detecting you finished)500–800 ms of trailing silenceToo short: the agent talks over you. Too long: dead air after everything you say
Speech-to-text final transcript100–300 ms after end-of-turnExtra silence before the "thinking" even begins
Language model first output200–500 msThe classic "Hello? Are you still there?" pause
Function call (calendar, CRM lookup)200 ms–3 s, masked with speechUnmasked: silences long enough that callers hang up
Text-to-speech first audio100–300 msThe agent sounds hesitant starting every sentence
Total voice-to-voice turn~1 second is excellent; under 2 is acceptablePast ~2 seconds, callers repeat themselves or talk over the agent

Cascading vs speech-to-speech: two ways to build the same pipeline

Everything described so far is the cascading architecture: separate systems for hearing, thinking, and speaking, chained together. It dominates production deployments because every link is observable and controllable — you can read the exact transcript the model saw, log every function call, and constrain answers to grounded knowledge. The cost is the latency tax of every handoff, plus lost nonverbal signal: a transcript flattens hesitation, sarcasm, and rising frustration into plain words.

The alternative is speech-to-speech, also called realtime or end-to-end: a single model — the GPT-4o Realtime and Gemini Live class of systems — that listens to audio and generates audio directly, with no intermediate text. These conversations can feel strikingly natural: the model hears tone, handles overlap gracefully, and responds with humanlike timing. The trade-offs are control and audit: there may be no canonical transcript of what the model "heard," constraining its answers is harder, and as of 2026 per-minute costs have generally run higher.

In practice the choice is pragmatic. Business agents that must take correct actions, follow policy, and produce auditable records — bookings, intake, payments, anything compliance-adjacent — are overwhelmingly cascading or hybrid (cascading control with realtime-grade voice and interruption handling). Speech-to-speech wins where naturalness itself is the product. Expect hybrids to become the default.

One call, annotated: watching the pipeline work

Here is a compressed after-hours dental call with the machinery exposed at each turn. This transcript is illustrative — written for this article, not taken from a customer call; hear real ones at /call-recordings.

📞 Call transcript
Agent"Thanks for calling Maple Dental — this is the after-hours assistant. How can I help?" [Pre-rendered greeting; media stream opened about a second after the first ring.]
Caller"Hey, yeah — I chipped a tooth at dinner and I'm hoping to get in tomorrow." [Streaming STT produced partials as they spoke; endpointing fired ~700 ms after "tomorrow."]
Agent"Sorry to hear that — a chipped tooth, and you'd like to come in tomorrow. Let me check the schedule." [The model emitted check_availability; the last sentence is latency masking while the calendar API runs.]
Agent"I have 9:40 am or 1:15 pm with Dr. Osei — do either of those work?" [The function call returned two open slots.]
Caller"Uh, actually, can we do the aftern—" and the agent stops talking. [Barge-in: caller speech during playback halted the audio within a few hundred milliseconds.]
Caller"The afternoon one. And it's for my son, Theo." [New turn; the model updates the pending booking to a different patient.]
Agent"Got it — 1:15 pm tomorrow for Theo. Can I get a phone number in case the doctor needs to reach you?" [The model spotted a missing required field on book_appointment.]
Caller"Six-four-seven, five-five-five, oh-one-nine-eight."
Agent"That's 647-555-0198 — correct?" [Read-back confirmation; digit strings are where transcription errors hide.]
Caller"Yep."
Agent"You're booked — 1:15 pm tomorrow for Theo, for a chipped tooth. You'll get a text confirmation in a minute. Anything else?" [book_appointment executed; SMS confirmation queued.]

After the hangup: recording and transcript stored; summary written ("Parent booked urgent 1:15 pm visit for child; chipped tooth"); reason: urgent booking; outcome: booked; next step: none. The front desk sees it all before opening. [Step 9 in action.]

Total elapsed: about 90 seconds, at 11 pm, with no human awake. That — not the novelty of a talking computer — is the business case.

Same pipeline, different businesses: dental, HVAC, law firm

The nine steps never change. What changes by industry is the menu of functions (step 6), the compliance envelope, and the exit policy (step 8) — which is what "tuned for your industry" actually means, and why a generic demo agent and a production agent for your business behave so differently. Three sketches make it concrete.

The dental office you just heard. The function menu is check_availability and book_appointment against the practice-management calendar, plus an SMS confirmation. The compliance envelope is HIPAA: bookings reveal health information, so the vendor must sign a Business Associate Agreement. The exit policy: severe pain, swelling, or post-extraction bleeding never gets parked at tomorrow's 1:15 — it routes to the on-call line. More at /industries/dental.

An HVAC company runs the identical pipeline with a different menu: create_job in the field-service software, capturing address, equipment type, symptom, and urgency, with the agent quoting arrival windows instead of fixed times. Compliance is lighter — recording disclosure, plus TCPA consent if reminder calls go out later. The exit policy is seasonal triage: no heat in February pages the on-call tech now; a thermostat question books Thursday's window. See /industries/home-services.

A law firm flips the priorities. The functions are intake-shaped — matter type, jurisdiction, urgency, basic conflict screening, consultation booking — and the hard constraint is what the agent must never do: give legal advice. Its instructions confine it to logistics, explicitly. The exit policy: a caller who was just arrested or has a court date tomorrow transfers to an attorney immediately, at any hour. See /industries/legal.

This is why industry tuning matters more than model choice for most buyers: the pipeline is generic, the judgment isn't. MapleVoice ships tunings for 20 industries — the same nine steps with different menus, disclosures, and escalation rules. Browse them at /industries.

What it actually costs: the honest math

Cost is where this topic gets least honest, so let's reconcile the two numbers you'll see. Component pricing is real but misleading: AssemblyAI prices streaming transcription from $0.15 per hour and describes full stacks at $0.01–$0.05 per minute. But add the language model, voice synthesis, telephony, and platform margin, and pricing analyses published by Retell AI and independent reviewers put real-world all-in costs at roughly $0.15 to $0.31 per minute on DIY platforms, with the broader provider range running $0.05 to $1.00. Both numbers are "true." Only one resembles your invoice.

The worked example almost nobody publishes — 300 calls a month averaging 4 minutes, or 1,200 minutes — is in the table below. Two patterns to notice. Per-minute pricing makes busy months expensive and forecasting annoying — exactly why flat-rate pricing exists. And the dominant unpriced line item in DIY is time: testing, edge cases, and weekly transcript reviews never appear on an invoice. Fuller breakdown at /blog/how-much-does-an-ai-receptionist-cost; our flat pricing is at /pricing. One number matters more than any of these: what a missed call is worth to you. Run yours at /blog/missed-call-roi-calculator — for many service businesses, one recovered job a month pays for the system.

ApproachUsage math at 1,200 min/monthWhat's not in that number
Raw APIs (build it yourself)$12–$60 at advertised component rates; more once telephony and retries are addedWeeks of engineering plus maintenance — the real cost is salaried time
DIY voice-agent platformRoughly $180–$370 at the $0.15–$0.31/min real-world figures, plus any subscription feeYour hours for prompts, integrations, testing, and weekly tuning
Done-for-you managed serviceFlat monthly rate, no per-minute meter (MapleVoice's model)Less knob-level control; a vendor relationship instead of a toolkit
Human answering serviceBilled per minute or per call; 24/7 human coverage is the most expensive optionQuality varies by operator; live booking often costs extra

Failure modes: what goes wrong, and what good systems do about it

AssemblyAI's own research found that nearly 95 percent of users have been frustrated by a voice agent at some point. That number is believable, and a vendor who won't discuss why doesn't deserve your phone number. The honest list follows.

And the honest boundary: some calls shouldn't be automated at all. A distraught family calling a funeral home, a legal client mid-crisis, a furious customer demanding a manager — the right design routes these to humans fast. If most of your calls are long, emotional, and high-stakes, a human answering service may genuinely serve you better than any AI. If you get six calls a week, do nothing — voicemail is free.

  • Misheard names, numbers, and addresses. Transcription errors concentrate in exactly the fields bookings depend on. Mitigation: mandatory read-back on every critical field, plus a spelling fallback ("could you spell that for me?").
  • Hallucinated answers. An ungrounded model will confidently invent your prices and policies. Mitigation: ground answers in a curated knowledge base, and instruct the agent to defer — "I'll have the office confirm that" — when the answer isn't there.
  • Dead air on slow tools. A three-second CRM lookup with no speech reads as a dropped call. Mitigation: spoken latency masking, plus timeouts that degrade gracefully to "I'll text you the confirmation."
  • Interruption failures. The agent either bulldozes through interruptions or false-triggers on background noise and stops mid-sentence. Mitigation: tuned barge-in with echo cancellation — judge by calling the demo line, not reading the spec sheet.
  • Accent and noise errors. Modern models are dramatically better than the IVRs people remember, but kitchens, jobsites, and speakerphones still degrade accuracy. Mitigation: confirmation loops, and handoff to a human when confidence drops.
  • Dead-end loops. The worst experience in the category: an agent that can't help and won't get out of the way. Mitigation: a hard rule that two failures to understand triggers an offer to transfer or take a message — never a third attempt.

Build, rent, or done-for-you — and where MapleVoice honestly fits

The last question is who should operate the machine. Three viable answers, compared in the table below.

Our honest position: if you have engineers and an appetite, the DIY platforms are genuinely good and worth considering — /blog/how-to-build-an-ai-voice-agent walks that path step by step. If your calls are rare, or overwhelmingly complex and emotional, humans or plain voicemail may beat any AI, and we'll tell you so.

MapleVoice is built for the case in the middle, which is most local businesses: the phone is how revenue arrives, nobody wants to own prompt engineering, and what you need is the outcome. We run the entire pipeline in this article as a done-for-you service — live in about 48 hours, answering 24/7 in under two seconds, booking appointments, qualifying leads, and taking orders against your real calendar, CRM, or POS, tuned for 20 industries, at a flat monthly price with no per-minute meter. Every call comes back as a recording, transcript, summary, call reason, outcome, and next step.

Next step, whichever path you choose: listen to real calls at /call-recordings, see the setup process at /how-it-works, and check flat pricing at /pricing. If you're still mapping the category, the companion pillar at /blog/what-is-an-ai-voice-agent completes the picture.

PathBest forTime to liveYou own
Build on raw APIsEngineering teams with unusual requirements or very large volumeWeeks to monthsEverything: telephony, prompts, integrations, monitoring, on-call
DIY platform (Vapi, Retell, Botpress class)Technical operators who want knob-level controlDays to a working demo; longer to reliablePrompts, integrations, testing, weekly tuning
Done-for-you managed serviceBusinesses where nobody's job is "maintain the AI"About 48 hours, in MapleVoice's caseReviewing summaries and requesting changes; the vendor owns the pipeline

Frequently asked questions

How much does an AI voice agent cost?

Expect roughly $0.15 to $0.31 per minute all-in on DIY platforms, per pricing analyses published by Retell AI — not the $0.01–$0.05 component figures vendors advertise. A 300-call month at four minutes per call runs about $180–$370 in usage; done-for-you services like MapleVoice charge a flat monthly rate instead.

Are AI voice calls legal?

Yes, with conditions. Inbound calls answered by AI are broadly legal in the US. Outbound is regulated: the FCC ruled in February 2024 that AI-generated voices count as artificial voices under the TCPA, so marketing calls require prior express written consent. Recording-consent rules also vary by state as of 2026.

How are AI voice agents different from chatbots?

A chatbot exchanges text; a voice agent holds a live phone conversation, which adds speech-to-text, text-to-speech, telephony plumbing, and a hard real-time constraint. A chatbot can take ten seconds to respond and nobody notices; a voice agent that takes ten seconds gets hung up on. The decision-making core is similar.

Do I need a team of AI experts to build a voice agent?

No. Modern platforms handle the models, so a technically comfortable operator can assemble a basic agent in days. Making it reliable — edge cases, integrations, interruption tuning — takes sustained effort. If nobody on your team wants that ongoing job, a done-for-you service is usually cheaper than learning it.

What infrastructure do I need to run one?

Almost none. The pipeline runs in your vendor's cloud, and your existing phone number simply forwards to it — no hardware to install, no phone system to replace. You keep your number, and you can route calls to the agent always, after hours only, or on overflow.

Can AI voice agents integrate with existing IVR or CRM systems?

Yes. Voice agents typically sit in front of or replace an IVR, and they write to CRMs, calendars, and POS systems through APIs — the function-calling mechanism explained in this guide. Most platforms ship prebuilt connectors for common tools; custom systems usually connect through webhooks or middleware without replacing anything.

How well do they handle background noise and accents?

Much better than the IVRs people remember. Modern speech models are trained on noisy, accented, real-world audio; AssemblyAI cites NIST testing showing top systems reach word error rates near 4.9 percent. Heavy noise and unusual accents still cause errors, which is why good agents read back names and numbers.

What about deepfakes and voice spoofing?

The risk runs both directions: fraudsters cloning voices to impersonate callers, and scammers spoofing your business's caller ID. Mitigations include never using voice alone to authenticate identity, verifying sensitive requests through a second channel, and STIR/SHAKEN attestation on outbound calls. A receptionist-style booking agent carries low exposure.

Will callers know they're talking to an AI?

Often yes — and good agents say so up front. A brief disclosure like "I'm the AI assistant for Maple Dental" builds trust, pairs with recording disclosure, and is emerging best practice ahead of likely disclosure regulation. Most callers care less about who answers than whether the problem gets solved.

What happens when the AI can't answer?

A well-built agent fails gracefully: it admits it doesn't know, offers a human transfer with full context, or takes a message and triggers a callback. The failure you should never accept is improvising an answer. Ask any vendor to demonstrate exactly what their agent does when it is stumped.

The “How to…” series

Ten hands-on playbooks — real steps, real numbers, honest about the work involved.

Keep reading

Hear it answer a real call

MapleVoice builds and runs a fully-managed AI voice agent for your business — live in about 48 hours, flat monthly price.