ElevenLabs Conversational AI
ElevenLabs Conversational AI is a hosted voice runtime: the orchestration loop (endpointing, turn-taking, interruption detection), the STT, the LLM call, and their proprietary TTS all run on ElevenLabs’s servers. You don’t embed it as a library — you create an agent record on their platform and open a WebSocket to it. Your process is a thin bridge that forwards audio and hosts your tool implementations.
This is a different integration shape from the trace-native frameworks on the overview page. There’s no Python entry point for Veris to call into; instead the Veris actor reaches your agent over the voice_ws channel, and tool visibility comes from events you report (see Tool calls and grading), not native trace ingestion.
Architecture
Your agent runs in the sandbox as a small WebSocket server. The Veris actor connects to it over voice_ws; your server opens an outbound WSS to api.elevenlabs.io and bridges the two audio streams. Tool calls round-trip back over that same already-open WS — so there’s no inbound traffic to the pod and no tunnel.
What runs at ElevenLabs: the agent loop, STT, the LLM call, TTS, VAD, and turn-taking — keyed to a persistent agent record. What runs in your pod: the audio bridge and your tool implementations (against postgres, mocked services, etc.). The only network requirement is outbound HTTPS to api.elevenlabs.io, which the sandbox already allows.
Channel: voice_ws
Declare the actor channel as voice_ws pointing at your agent’s WebSocket. The actor speaks PCM16 at 24 kHz mono; your ElevenLabs audio formats must match exactly (see below).
version: "1.0"
my-voice-agent:
services:
- name: postgres
config:
SCHEMA_PATH: /agent/db/schema.sql
actor:
channels:
- type: voice_ws
url: ws://localhost:8008/voice
agent:
name: My Voice Agent
code_path: /agent
entry_point: uv run --no-sync uvicorn app.main:app --host 0.0.0.0 --port 8008
environment:
DATABASE_URL: postgresql://postgres:postgres@localhost:5432/verisThe agent speaks first: the channel defaults to wait_for_callee_first: true, and the agent record’s first_message is voiced as soon as the connection opens. See the voice_ws reference for the full field list, protocol options (binary vs json), and audio contract.
The agent record
ElevenLabs configuration lives on a persistent agent record, not in the per-call payload. Create it once with the system prompt, tools, voice, LLM, and audio formats; every call then just opens a WS referencing its agent_id.
client.conversational_ai.agents.create(
name="My Voice Agent",
conversation_config={
"agent": {
"first_message": "Thanks for calling Acme Bank, this is Riley — how can I help?",
"language": "en",
"prompt": {
"prompt": system_prompt,
"llm": "gpt-4o-mini",
"tools": TOOLS, # each entry type: "client"
"ignore_default_personality": True, # drop ElevenLabs's boilerplate persona
},
},
"tts": {
"model_id": "eleven_flash_v2", # English ConvAI rejects v2_5 / v3
"voice_id": "EXAVITQu4vr4xnSDxMaL", # Sarah, from the voice library
"agent_output_audio_format": "pcm_24000",
},
"asr": {
"quality": "high",
"user_input_audio_format": "pcm_24000", # must match the voice_ws 24 kHz contract
},
},
)Provision the record on first boot and pin its id so you don’t create a fresh one every container start:
agent_id = os.environ.get("AGENT_ID") or client.conversational_ai.agents.create(...).agent_idThe record is hosted state — prompt, tool schemas, voice, and LLM choice all live at ElevenLabs keyed by agent_id until you delete it. Recreate it whenever the prompt or tools change.
Client tools
Client tools are the architecturally distinctive piece. The tool schema lives on the agent record; the tool implementation runs in your process. When the LLM picks a tool, ElevenLabs sends a client_tool_call event down the open WS, the SDK routes it to your registered handler, and your handler’s result goes back as a client_tool_result on the same connection — no inbound HTTP, no public URL.
from elevenlabs.conversational_ai.conversation import ClientTools
tools = ClientTools(loop=loop)
def make_handler(name):
def handler(parameters: dict):
args = {k: v for k, v in parameters.items() if k != "tool_call_id"}
result = dispatch(api, name, args) # your impl: postgres, mocks, etc.
# The result field is a STRING. Returning a dict trips a
# `1008 policy violation` and drops the call.
return json.dumps(result, default=str) # default=str handles enums/datetimes
return handler
for name in MY_TOOL_NAMES:
tools.register(name, make_handler(name))Each tool schema is a client-type entry on the agent record’s prompt.tools array:
{
"type": "client",
"name": "display_card_info_by_last4",
"description": "Find a card by the last 4 digits and return its details.",
"parameters": {
"type": "object",
"required": ["last4"],
"properties": {
"last4": {"type": "string", "description": "Last 4 digits of the card."},
},
},
"expects_response": True,
}The SDK runs sync handlers in a thread pool, so blocking I/O (psycopg2, etc.) is fine. But the result must be a JSON string — return json.dumps(...), never the raw dict.
Tool calls and grading
Client tools execute in your process and never appear in the actor’s audio transcript, so the grader can’t see them — real actions get flagged as fabricated. Report each call to the engine so it lands in the graded trace:
# SIMULATION_ID and ENGINE_URL are injected by the sandbox; outside a sim this
# is a no-op. Keep it fire-and-forget — never reshape data the model observes.
if SIMULATION_ID:
httpx.post(
f"{ENGINE_URL}/simulations/{SIMULATION_ID}/events",
json={
"service": "agent",
"event_type": "agent_tool_call",
"data": {"name": name, "arguments": args, "result": result},
},
timeout=2.0,
)This is the canonical voice-agent tool-reporting pattern — see Tool call reporting in the voice_ws reference.
Keep the actor’s VAD happy
The Veris actor commits a turn with server-side VAD (~1500 ms silence window). ElevenLabs streams audio only while the agent is speaking, so after each turn you must pump trailing silence or the conversation deadlocks waiting for the actor to detect end-of-turn.
# ~1700 ms of PCM16 silence, padded past the actor's ~1500 ms VAD threshold.
END_OF_TURN_SILENCE = b"\x00\x00" * (24000 * 1700 // 1000)Flush it on a short debounce (~400 ms) after the last audio chunk of a turn, and cancel the pending trailer when ElevenLabs fires interrupt() on barge-in so you don’t tail silence into the caller’s next turn.
Runtime env vars
veris env vars set ELEVENLABS_API_KEY=sk_... --secret
veris env vars set AGENT_ID=agent_... # pin after the first boot logs itThe LLM runs through ElevenLabs’s managed inference (billed to them), so no separate provider key is required for the model itself.
Dockerfile
Standard thin layer on the Veris base image — add the elevenlabs SDK to your dependencies and copy your code. See the Dockerfile.sandbox reference.
ARG VERIS_BASE
FROM ${VERIS_BASE}
COPY pyproject.toml README.md /agent/
COPY app /agent/app
COPY db/schema.sql /agent/db/schema.sql
COPY agent_desc.txt /agent/agent_desc.txt
WORKDIR /agent
RUN uv sync --no-dev
WORKDIR /agentSharp edges
| Gotcha | Why it bites |
|---|---|
TTS model must be eleven_flash_v2 or eleven_turbo_v2 | English ConvAI agents reject eleven_v2_5 / v3 at agent-create time. |
| Tool result must be a JSON string | Returning a dict trips 1008 policy violation and drops the call. |
Audio = pcm_24000, mono, raw PCM16, both sides | Must match the voice_ws 24 kHz mono contract on asr and tts; a wrong rate (16 kHz / 48 kHz) causes silent STT failure, not an error. |
| Trailing silence is mandatory | Without ~1700 ms of trailing PCM silence the actor’s VAD never commits end-of-turn → deadlock. |
| The agent record is hosted state | Prompt, tools, voice, and LLM live at ElevenLabs keyed by agent_id. Pin AGENT_ID to reuse; recreate the record when prompt or tools change. |
What’s next
voice_wschannel reference → voice-channels.md (audio contract, protocols, turn detection, tool reporting).- Config reference → veris.yaml schema, Dockerfile.sandbox.
- ElevenLabs docs → Conversational AI , voice library .
- Full agent repos → the cookbook .