Voice calls are real-time conversations through your agent’s phone numbers. Calls can be inbound (received) or outbound (initiated via API). Each call includes metadata like duration, status, and transcript. You can also stream transcripts in real time via Server-Sent Events.
How calls are handled depends on your agent’s voice mode.
voiceMode: "webhook" (default) — Caller speech is transcribed and sent to your webhook as agent.message events. Your server controls every response using any LLM, RAG, or custom logic.
voiceMode: "hosted" — Calls are handled end-to-end by a built-in LLM using your systemPrompt. No webhook or server needed.
Switch modes at any time via PATCH /v1/agents/:id. The backend automatically re-provisions voice infrastructure and rebinds phone numbers with no downtime.
SMS is always webhook-based regardless of voice mode.
When voiceMode is "webhook":
We POST the transcript to your webhook with event: "agent.message" and channel: "voice", including recentHistory for context.
You process the transcript (e.g., send to your LLM) and return a response. We strongly recommend streaming NDJSON — TTS starts speaking on the first chunk.
When voiceMode is "hosted":
Both modes share the same low-latency engine:
For voice webhooks, your server must return a JSON object ({...}) telling the agent what to say. Non-object responses (numbers, strings, arrays) are ignored and the caller hears silence.
Return Content-Type: application/x-ndjson with newline-delimited JSON chunks. TTS starts speaking on the very first chunk while your server continues processing.
Mark interim chunks with "interim": true — the final chunk (without interim) closes the turn. Use this for tool calls, LLM token forwarding, or any time your response takes more than ~1 second.
Return a single JSON object for instant replies where no processing delay is expected.
Voice webhook requests have a 30-second default timeout (configurable from 5–120 seconds per webhook via the timeout field). If your server doesn’t start responding in time, the request is cancelled and the caller hears silence for that turn. This is especially important when your webhook calls external APIs or runs LLM tool calls — always stream an interim chunk immediately so the caller hears something while you process.
When your agent needs to call external APIs (databases, calendars, CRM, etc.) during a voice call, always stream an interim filler response first. This prevents the caller from hearing silence while your tools run.
The pattern is: stream an interim acknowledgement immediately → run your tools → stream the final answer.
Without the interim chunk, the caller hears dead silence while your LLM decides which tool to call, the external API responds, and the LLM summarises the result. With streaming, they hear “Let me check on that” within milliseconds — just like a human assistant would.
Common issues and how to fix them.
Your webhook is too slow or not responding. Voice webhooks have a 30-second default timeout (configurable per webhook from 5–120 seconds). If your server doesn’t respond in time, the turn is dropped and the caller hears nothing.
Fix: Always stream an interim NDJSON chunk immediately (e.g. {"text": "One moment.", "interim": true}) before doing any slow work. This buys you time while keeping the caller engaged.
Common causes:
Your webhook isn’t configured or isn’t returning a valid JSON object. Voice responses must be a JSON object ({...}). Non-object responses (strings, arrays, numbers) are ignored.
Fix: Verify your webhook is returning {"text": "..."}. Use POST /v1/webhooks/test to confirm your endpoint is reachable and responding correctly.
You’re sending the entire response as a single large chunk. Long responses in a single chunk can cause TTS delays.
Fix: Use NDJSON streaming and break responses into natural sentences. Send each sentence as an interim chunk so TTS can start speaking immediately.
Your LLM is including tool-call markup in its response. Some LLMs emit <function_call> or similar tags.
Fix: Strip non-speech content from your LLM output before returning it. AgentPhone removes common patterns automatically, but your webhook should clean responses to be safe.
You’re returning a 200 OK with no body, or a non-JSON response for voice. SMS webhooks only need a 200 status — voice webhooks must return a JSON object with a text field.
Fix: Check the channel field in the webhook payload. For "voice", always return {"text": "..."}. For "sms", a 200 OK is sufficient.
Call recording is an optional add-on that saves audio recordings of your voice calls. When enabled, completed calls include a recordingUrl field with a link to the audio file.
Enable recording from the Billing page in the dashboard. See Usage & Billing for pricing.
Recordings are captured automatically for all calls while the add-on is active. If you disable the add-on, existing recordings are preserved but recordingUrl will be null until you re-enable it.
List all calls for this project.
Get details of a specific call, including its full transcript.
Stream a call’s transcript in real time via Server-Sent Events. On connect the server replays all existing transcript turns from the database, then streams new turns as they arrive from the voice engine. Works for both live and completed calls — same URL either way.
The response is an SSE stream (Content-Type: text/event-stream) with the following event types:
A : heartbeat comment is sent every 15 seconds to keep proxies and load balancers from closing the connection.
connected
turn
ended
For a live call, existing turns are replayed first, then the stream stays open and delivers new turns in real time until the call ends. For a completed call, all turns are replayed immediately followed by an ended event and the stream closes. Your client code doesn’t need to differentiate — just handle the events.
Initiate an outbound voice call from one of your agent’s phone numbers. The agent’s first assigned phone number is used as the caller ID.
Web calls let users talk to your agent directly from a browser, no phone number needed. Use the agentphone-web-sdk npm package on the frontend and mint access tokens from your backend.
How it works:
POST /v1/calls/web with agentId to get an access tokenagentphone-web-sdk to start the call with the tokenThe call direction will be web (in addition to inbound and outbound).
List all calls associated with a specific phone number.