Skip to Content

Vapi

Vapi is a hosted voice runtime: the orchestration loop (endpointing, turn-taking), the STT, the LLM call, and the TTS all run on Vapi’s cloud. You don’t embed it as a library — your process creates a call on Vapi’s API, opens a WebSocket to it, and bridges audio. The distinctive piece is tools: Vapi invokes them by calling back into your process over HTTP, so your agent has to expose a public webhook.

This is a different integration shape from the trace-native frameworks on the overview page. There’s no Python entry point for Veris to call into; instead the Veris actor reaches your agent over the voice_ws channel, your process bridges audio out to Vapi’s cloud, and tool visibility comes from events you report (see Tool calls and grading), not native trace ingestion.

Architecture

Your agent runs in the sandbox as one FastAPI app (app.main). The Veris actor connects to it over voice_ws; for each connection the app creates a Vapi call (transport.provider = vapi.websocket), opens an outbound WSS to Vapi’s cloud, and bridges the two audio streams. Tool calls do not come back on that audio socket — Vapi POSTs them as HTTP tool-calls webhooks to your tool URL. Because Vapi’s cloud must reach your pod, the app exposes that URL through an ngrok tunnel (see Concurrency and the ngrok tunnel).

What runs at Vapi: the agent loop, STT, the LLM call, TTS, and turn-taking. What runs in your pod: the audio bridge, your tool implementations (against postgres, mocked services, etc.), and an ngrok tunnel that gives Vapi’s cloud a public address to POST tool calls to. The network requirements are outbound WSS to api.vapi.ai and a public inbound URL for the tool webhook.

Channel: voice_ws

Declare the actor channel as voice_ws pointing at your agent’s WebSocket. The actor speaks raw PCM16 bytes at 24 kHz mono — the default binary framing (voice_ws also offers a json envelope); your Vapi call’s audioFormat must match exactly.

.veris/veris.yaml
version: "1.0" mini-bcs-voice-vapi-env: services: - name: postgres config: SCHEMA_PATH: /agent/db/schema.sql actor: channels: - type: voice_ws url: ws://localhost:8008/voice agent: name: Mini BCS Voice (Vapi) code_path: /agent entry_point: uv run --no-sync uvicorn app.main:app --host 0.0.0.0 --port 8008 environment: DATABASE_URL: postgresql://postgres:postgres@localhost:5432/veris # The agent POSTs to Vapi to create a call per /voice connection. VAPI_API_KEY: ${VAPI_API_KEY} # ngrok needs an authtoken to spawn the tunnel for Vapi -> /tool webhooks. NGROK_AUTHTOKEN: ${NGROK_AUTHTOKEN}

The agent speaks first: the call’s firstMessage is voiced as soon as the WebSocket opens. See the voice_ws reference for the full field list, protocol options (binary vs json), and audio contract.

The call

Unlike ElevenLabs’s persistent agent record, Vapi configuration is sent inline per call. For each /voice connection the agent POSTs to https://api.vapi.ai/call with the system prompt, model, voice, transcriber, tools, and a WebSocket transport, then dials the websocketCallUrl Vapi returns.

app/main.py
async def _create_vapi_call() -> dict: payload = { "transport": { "provider": "vapi.websocket", "audioFormat": {"format": "pcm_s16le", "container": "raw", "sampleRate": 24000}, }, "assistant": { "firstMessage": "Thanks for calling Acme Bank, this is Riley — how can I help?", "firstMessageMode": "assistant-speaks-first", "model": { "provider": "openai", "model": "gpt-4o-mini", "messages": [{"role": "system", "content": AGENT_PROMPT}], "tools": build_tools(TOOL_WEBHOOK_URL), # each tool's server.url → /tool }, "voice": {"provider": "openai", "voiceId": "alloy"}, "transcriber": {"provider": "deepgram", "model": "nova-2", "language": "en"}, }, } async with httpx.AsyncClient(timeout=30.0) as client: r = await client.post( "https://api.vapi.ai/call", json=payload, headers={"Authorization": f"Bearer {os.environ['VAPI_API_KEY']}"}, ) return r.json() # transport.websocketCallUrl → open a WS and bridge audio

The audio format on both sides must agree: the actor speaks PCM16 / 24 kHz mono, so the call’s audioFormat is pcm_s16le / 24000. The LLM, STT, and TTS all run on Vapi’s managed inference, billed to your Vapi account — no separate provider key is required for the model itself.

Server tools

Tools are the architecturally distinctive piece. Each tool is a Vapi function tool carrying a server.url; when the LLM picks a tool, Vapi POSTs a tool-calls event to that URL and waits for the result in the HTTP response — there is no client-side execution.

app/tools.py
def build_tools(server_url: str) -> list[dict]: """Wrap each function as a Vapi tool that POSTs to server_url.""" return [ {"type": "function", "function": fn, "server": {"url": server_url}} for fn in TOOL_FUNCTIONS ]

Your webhook unpacks message.toolCallList, dispatches each call, and returns a results array keyed by toolCallId:

app/main.py
@app.post("/tool") async def tool_webhook(request: Request) -> JSONResponse: body = await request.json() results = [] for call in body["message"]["toolCallList"]: call_id, name, args = _extract_tool_call(call) output = dispatch(_TOOL_API, name, args) # your impl: postgres, mocks, etc. # The result field is a STRING. Returning a dict is rejected by Vapi. results.append({"toolCallId": call_id, "result": json.dumps(output, default=str)}) report_tool_call(name, args, output) # make it visible to the grader return JSONResponse({"results": results})

The tool result must be a JSON string — json.dumps(...), never the raw dict. A non-string result is silently dropped: Vapi logs “No result returned” and the model continues with no observation at all, so the failure is invisible rather than loud. Keep the result a single line, and always return HTTP 200 (a non-200 response is ignored too). default=str covers enums and datetimes that aren’t natively JSON-serializable.

Tool calls and grading

Server tools execute in your process and never appear in the actor’s audio transcript, so the grader can’t see them — real, completed actions get flagged as fabricated. Report each call to the engine, at the single /tool dispatch point, so it lands in the graded trace:

app/main.py
# SIMULATION_ID and ENGINE_URL are injected by the sandbox; outside a sim this # is a no-op. Keep it fire-and-forget — never reshape data the model observes. _ENGINE_URL = os.environ.get("ENGINE_URL", "http://localhost:6100") _SIMULATION_ID = os.environ.get("SIMULATION_ID") _report_tasks: set[asyncio.Task] = set() def report_tool_call(name: str, args: dict, result: object) -> None: if not _SIMULATION_ID: return # asyncio.to_thread so the blocking POST never delays the synchronous /tool # response Vapi is waiting on. Keep a strong ref so the task isn't GC'd. task = asyncio.create_task(asyncio.to_thread(_emit_tool_event, name, args, result)) _report_tasks.add(task) task.add_done_callback(_report_tasks.discard) def _emit_tool_event(name: str, args: dict, result: object) -> None: body = json.dumps( { "service": "agent", "event_type": "agent_tool_call", "data": {"name": name, "arguments": args, "result": result}, }, default=str, # enums/datetimes — same handling as the tool result ) try: httpx.post( f"{_ENGINE_URL}/simulations/{_SIMULATION_ID}/events", content=body, headers={"Content-Type": "application/json"}, timeout=2.0, ) except Exception as exc: logger.warning("[tool] could not report %s to engine: %s", name, exc)

The event_type must be exactly agent_tool_call (the renderer keys on it). Report on both the success and error paths so the grader can tell a real failure from a fabricated success. This is the canonical voice-agent tool-reporting pattern — see Tool call reporting  in the voice_ws reference.

Keep the actor’s VAD happy

The Veris actor commits a turn with server-side VAD (~1500 ms silence window). Vapi streams audio only while the agent is speaking, so after each turn you must pump trailing silence or the conversation deadlocks waiting for the actor to detect end-of-turn.

app/main.py
# ~1700 ms of PCM16 silence, padded past the actor's ~1500 ms VAD threshold. END_OF_TURN_SILENCE = b"\x00\x00" * (24000 * 1700 // 1000)

Flush it when Vapi sends speech-update with status=stopped and role=assistant, so the actor’s VAD reliably fires end-of-turn.

Concurrency and the ngrok tunnel

Vapi’s cloud must reach your pod to deliver tool calls, so the agent spawns ngrok http $PORT on the first /voice call and points the tools’ server.url at the tunnel. This works in isolation but has a hard concurrency limit.

Free-tier ngrok allows one agent session per authtoken. When several sims run at once, every pod tries to open a tunnel on the same account and all but one are refused. The agent retries the spawn a handful of times with backoff, so brief contention self-heals once a sibling tunnel tears down — but under sustained concurrency the losing pods exhaust their retries, can’t set up the call, and the actor sees callee_no_answer. A few parallel sims pass while a larger batch shows most calls failing to connect — looking like a flaky agent when it’s the tunnel.

Three ways to run a batch of Vapi calls:

  • Serialize — run one call at a time, so each gets the single free-tier tunnel in turn. Reliable, slow; fine for small batches.
  • A multi-tunnel tunnel — a paid ngrok plan or a (free) Cloudflare Tunnel removes the one-tunnel limit.
  • One shared webhook for the fleet (the production pattern) — point every call at a single stable public server.url set at the assistant level, and route inside it by call.id. Vapi correlates tool results purely by toolCallId, not by connection, so one stateless endpoint serves arbitrarily many concurrent calls. See the Vapi server-URL docs .

Note there is no way to answer tool calls over the audio WebSocket — Vapi’s client→server messages on that socket carry only audio and call control, never a tool result. The webhook is required; the only question is how you expose it.

Runtime env vars

veris env vars set VAPI_API_KEY=... --secret veris env vars set NGROK_AUTHTOKEN=... --secret

VAPI_API_KEY authenticates the POST /call that creates each call. NGROK_AUTHTOKEN lets the in-pod ngrok agent spawn its tunnel. ENGINE_URL and SIMULATION_ID are set by the sandbox in-sim; PUBLIC_BASE_URL, if set, is used as the tool webhook base instead of spawning ngrok (the escape hatch for the shared-endpoint pattern above).

Dockerfile

Standard thin layer on the Veris base image, plus the ngrok binary so the agent can spawn its tunnel. See the Dockerfile.sandbox reference.

.veris/Dockerfile.sandbox
ARG VERIS_BASE FROM ${VERIS_BASE} COPY pyproject.toml README.md /agent/ COPY app /agent/app COPY db/schema.sql /agent/db/schema.sql COPY agent_desc.txt /agent/agent_desc.txt # ngrok — the agent spawns it on startup to expose the tool webhook. USER root RUN apt-get update \ && apt-get install -y --no-install-recommends curl ca-certificates gnupg \ && curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc \ | gpg --dearmor -o /etc/apt/keyrings/ngrok.gpg \ && echo "deb [signed-by=/etc/apt/keyrings/ngrok.gpg] https://ngrok-agent.s3.amazonaws.com buster main" \ > /etc/apt/sources.list.d/ngrok.list \ && apt-get update \ && apt-get install -y --no-install-recommends ngrok \ && rm -rf /var/lib/apt/lists/* WORKDIR /agent RUN uv sync --no-dev

Sharp edges

GotchaWhy it bites
Tool result must be a JSON stringA non-string result is silently dropped — Vapi logs “No result returned” and the model continues with no observation, so the failure is invisible. Wrap with json.dumps(result, default=str); keep it single-line and return HTTP 200.
Tools need a public inbound URLVapi POSTs tool-calls from its cloud to each tool’s server.url, so the pod must be publicly reachable. The agent spawns ngrok to expose /tool.
Free-tier ngrok = one agent session per authtokenConcurrent sims contend for the single allowed session; losing pods retry with backoff and, if contention persists, give up → callee_no_answer. Serialize, use a paid/Cloudflare tunnel, or one shared server.url routed by call.id.
Trailing silence is mandatoryVapi streams audio only while speaking; without ~1700 ms of trailing PCM silence after each turn the actor’s VAD never commits → the sim deadlocks.
Config is per-call, not a hosted recordPrompt, tools, voice, and model are sent in the POST /call body each time — there’s no agent_id to pin or reuse.
Audio format must match on both sidesThe actor is PCM16 / 24 kHz mono, so the call’s audioFormat must be pcm_s16le / 24000; a wrong rate causes silent STT failure, not an error.

What’s next