Skip to Content

ElevenLabs Conversational AI

ElevenLabs Conversational AI is a hosted voice runtime: the orchestration loop (endpointing, turn-taking, interruption detection), the STT, the LLM call, and their proprietary TTS all run on ElevenLabs’s servers. You don’t embed it as a library — you create an agent record on their platform and open a WebSocket to it. Your process is a thin bridge that forwards audio and hosts your tool implementations.

This is a different integration shape from the trace-native frameworks on the overview page. There’s no Python entry point for Veris to call into; instead the Veris actor reaches your agent over the voice_ws channel, and tool visibility comes from events you report (see Tool calls and grading), not native trace ingestion.

Architecture

Your agent runs in the sandbox as a small WebSocket server. The Veris actor connects to it over voice_ws; your server opens an outbound WSS to api.elevenlabs.io and bridges the two audio streams. Tool calls round-trip back over that same already-open WS — so there’s no inbound traffic to the pod and no tunnel.

What runs at ElevenLabs: the agent loop, STT, the LLM call, TTS, VAD, and turn-taking — keyed to a persistent agent record. What runs in your pod: the audio bridge and your tool implementations (against postgres, mocked services, etc.). The only network requirement is outbound HTTPS to api.elevenlabs.io, which the sandbox already allows.

Channel: voice_ws

Declare the actor channel as voice_ws pointing at your agent’s WebSocket. The actor speaks PCM16 at 24 kHz mono; your ElevenLabs audio formats must match exactly (see below).

.veris/veris.yaml
version: "1.0" my-voice-agent: services: - name: postgres config: SCHEMA_PATH: /agent/db/schema.sql actor: channels: - type: voice_ws url: ws://localhost:8008/voice agent: name: My Voice Agent code_path: /agent entry_point: uv run --no-sync uvicorn app.main:app --host 0.0.0.0 --port 8008 environment: DATABASE_URL: postgresql://postgres:postgres@localhost:5432/veris

The agent speaks first: the channel defaults to wait_for_callee_first: true, and the agent record’s first_message is voiced as soon as the connection opens. See the voice_ws reference for the full field list, protocol options (binary vs json), and audio contract.

The agent record

ElevenLabs configuration lives on a persistent agent record, not in the per-call payload. Create it once with the system prompt, tools, voice, LLM, and audio formats; every call then just opens a WS referencing its agent_id.

app/agent_setup.py
client.conversational_ai.agents.create( name="My Voice Agent", conversation_config={ "agent": { "first_message": "Thanks for calling Acme Bank, this is Riley — how can I help?", "language": "en", "prompt": { "prompt": system_prompt, "llm": "gpt-4o-mini", "tools": TOOLS, # each entry type: "client" "ignore_default_personality": True, # drop ElevenLabs's boilerplate persona }, }, "tts": { "model_id": "eleven_flash_v2", # English ConvAI rejects v2_5 / v3 "voice_id": "EXAVITQu4vr4xnSDxMaL", # Sarah, from the voice library "agent_output_audio_format": "pcm_24000", }, "asr": { "quality": "high", "user_input_audio_format": "pcm_24000", # must match the voice_ws 24 kHz contract }, }, )

Provision the record on first boot and pin its id so you don’t create a fresh one every container start:

agent_id = os.environ.get("AGENT_ID") or client.conversational_ai.agents.create(...).agent_id

The record is hosted state — prompt, tool schemas, voice, and LLM choice all live at ElevenLabs keyed by agent_id until you delete it. Recreate it whenever the prompt or tools change.

Client tools

Client tools are the architecturally distinctive piece. The tool schema lives on the agent record; the tool implementation runs in your process. When the LLM picks a tool, ElevenLabs sends a client_tool_call event down the open WS, the SDK routes it to your registered handler, and your handler’s result goes back as a client_tool_result on the same connection — no inbound HTTP, no public URL.

app/main.py
from elevenlabs.conversational_ai.conversation import ClientTools tools = ClientTools(loop=loop) def make_handler(name): def handler(parameters: dict): args = {k: v for k, v in parameters.items() if k != "tool_call_id"} result = dispatch(api, name, args) # your impl: postgres, mocks, etc. # The result field is a STRING. Returning a dict trips a # `1008 policy violation` and drops the call. return json.dumps(result, default=str) # default=str handles enums/datetimes return handler for name in MY_TOOL_NAMES: tools.register(name, make_handler(name))

Each tool schema is a client-type entry on the agent record’s prompt.tools array:

app/tools.py
{ "type": "client", "name": "display_card_info_by_last4", "description": "Find a card by the last 4 digits and return its details.", "parameters": { "type": "object", "required": ["last4"], "properties": { "last4": {"type": "string", "description": "Last 4 digits of the card."}, }, }, "expects_response": True, }

The SDK runs sync handlers in a thread pool, so blocking I/O (psycopg2, etc.) is fine. But the result must be a JSON string — return json.dumps(...), never the raw dict.

Tool calls and grading

Client tools execute in your process and never appear in the actor’s audio transcript, so the grader can’t see them — real actions get flagged as fabricated. Report each call to the engine so it lands in the graded trace:

app/main.py
# SIMULATION_ID and ENGINE_URL are injected by the sandbox; outside a sim this # is a no-op. Keep it fire-and-forget — never reshape data the model observes. if SIMULATION_ID: httpx.post( f"{ENGINE_URL}/simulations/{SIMULATION_ID}/events", json={ "service": "agent", "event_type": "agent_tool_call", "data": {"name": name, "arguments": args, "result": result}, }, timeout=2.0, )

This is the canonical voice-agent tool-reporting pattern — see Tool call reporting  in the voice_ws reference.

Keep the actor’s VAD happy

The Veris actor commits a turn with server-side VAD (~1500 ms silence window). ElevenLabs streams audio only while the agent is speaking, so after each turn you must pump trailing silence or the conversation deadlocks waiting for the actor to detect end-of-turn.

app/main.py
# ~1700 ms of PCM16 silence, padded past the actor's ~1500 ms VAD threshold. END_OF_TURN_SILENCE = b"\x00\x00" * (24000 * 1700 // 1000)

Flush it on a short debounce (~400 ms) after the last audio chunk of a turn, and cancel the pending trailer when ElevenLabs fires interrupt() on barge-in so you don’t tail silence into the caller’s next turn.

Runtime env vars

veris env vars set ELEVENLABS_API_KEY=sk_... --secret veris env vars set AGENT_ID=agent_... # pin after the first boot logs it

The LLM runs through ElevenLabs’s managed inference (billed to them), so no separate provider key is required for the model itself.

Dockerfile

Standard thin layer on the Veris base image — add the elevenlabs SDK to your dependencies and copy your code. See the Dockerfile.sandbox reference.

.veris/Dockerfile.sandbox
ARG VERIS_BASE FROM ${VERIS_BASE} COPY pyproject.toml README.md /agent/ COPY app /agent/app COPY db/schema.sql /agent/db/schema.sql COPY agent_desc.txt /agent/agent_desc.txt WORKDIR /agent RUN uv sync --no-dev WORKDIR /agent

Sharp edges

GotchaWhy it bites
TTS model must be eleven_flash_v2 or eleven_turbo_v2English ConvAI agents reject eleven_v2_5 / v3 at agent-create time.
Tool result must be a JSON stringReturning a dict trips 1008 policy violation and drops the call.
Audio = pcm_24000, mono, raw PCM16, both sidesMust match the voice_ws 24 kHz mono contract on asr and tts; a wrong rate (16 kHz / 48 kHz) causes silent STT failure, not an error.
Trailing silence is mandatoryWithout ~1700 ms of trailing PCM silence the actor’s VAD never commits end-of-turn → deadlock.
The agent record is hosted statePrompt, tools, voice, and LLM live at ElevenLabs keyed by agent_id. Pin AGENT_ID to reuse; recreate the record when prompt or tools change.

What’s next