ElevenLabs Conversational AI

ElevenLabs Conversational AI is a hosted voice runtime: the orchestration loop (endpointing, turn-taking, interruption detection), the STT, the LLM call, and their proprietary TTS all run on ElevenLabs’s servers. You don’t embed it as a library — you create an agent record on their platform and open a WebSocket to it. Your process is a thin bridge that forwards audio and hosts your tool implementations.

This is a different integration shape from the trace-native frameworks on the overview page. There’s no Python entry point for Veris to call into; instead the Veris actor reaches your agent over the voice_ws channel, and tool visibility comes from events you report (see Tool calls and grading), not native trace ingestion.

Architecture

Your agent runs in the sandbox as a small WebSocket server. The Veris actor connects to it over voice_ws; your server opens an outbound WSS to api.elevenlabs.io and bridges the two audio streams. Tool calls round-trip back over that same already-open WS — so there’s no inbound traffic to the pod and no tunnel.

What runs at ElevenLabs: the agent loop, STT, the LLM call, TTS, VAD, and turn-taking — keyed to a persistent agent record. What runs in your pod: the audio bridge and your tool implementations (against postgres, mocked services, etc.). The only network requirement is outbound HTTPS to api.elevenlabs.io, which the sandbox already allows.

Channel: `voice_ws`

Declare the actor channel as voice_ws pointing at your agent’s WebSocket. The actor speaks PCM16 at 24 kHz mono; your ElevenLabs audio formats must match exactly (see below).

.veris/veris.yaml


version: "1.0"
 
my-voice-agent:
  services:
    - name: postgres
      config:
        SCHEMA_PATH: /agent/db/schema.sql
 
  actor:
    channels:
      - type: voice_ws
        url: ws://localhost:8008/voice
 
  agent:
    name: My Voice Agent
    code_path: /agent
    entry_point: uv run --no-sync uvicorn app.main:app --host 0.0.0.0 --port 8008
    environment:
      DATABASE_URL: postgresql://postgres:postgres@localhost:5432/veris

The agent speaks first: the channel defaults to wait_for_callee_first: true, and the agent record’s first_message is voiced as soon as the connection opens. See the voice_ws reference for the full field list, protocol options (binary vs json), and audio contract.

The agent record

ElevenLabs configuration lives on a persistent agent record, not in the per-call payload. Create it once with the system prompt, tools, voice, LLM, and audio formats; every call then just opens a WS referencing its agent_id.

app/agent_setup.py


client.conversational_ai.agents.create(
    name="My Voice Agent",
    conversation_config={
        "agent": {
            "first_message": "Thanks for calling Acme Bank, this is Riley — how can I help?",
            "language": "en",
            "prompt": {
                "prompt": system_prompt,
                "llm": "gpt-4o-mini",
                "tools": TOOLS,                       # each entry type: "client"
                "ignore_default_personality": True,   # drop ElevenLabs's boilerplate persona
            },
        },
        "tts": {
            "model_id": "eleven_flash_v2",            # English ConvAI rejects v2_5 / v3
            "voice_id": "EXAVITQu4vr4xnSDxMaL",       # Sarah, from the voice library
            "agent_output_audio_format": "pcm_24000",
        },
        "asr": {
            "quality": "high",
            "user_input_audio_format": "pcm_24000",   # must match the voice_ws 24 kHz contract
        },
    },
)

Provision the record on first boot and pin its id so you don’t create a fresh one every container start:


agent_id = os.environ.get("AGENT_ID") or client.conversational_ai.agents.create(...).agent_id

The record is hosted state — prompt, tool schemas, voice, and LLM choice all live at ElevenLabs keyed by agent_id until you delete it. Recreate it whenever the prompt or tools change.

Client tools

Client tools are the architecturally distinctive piece. The tool schema lives on the agent record; the tool implementation runs in your process. When the LLM picks a tool, ElevenLabs sends a client_tool_call event down the open WS, the SDK routes it to your registered handler, and your handler’s result goes back as a client_tool_result on the same connection — no inbound HTTP, no public URL.

app/main.py


from elevenlabs.conversational_ai.conversation import ClientTools
 
tools = ClientTools(loop=loop)
 
def make_handler(name):
    def handler(parameters: dict):
        args = {k: v for k, v in parameters.items() if k != "tool_call_id"}
        result = dispatch(api, name, args)        # your impl: postgres, mocks, etc.
        # The result field is a STRING. Returning a dict trips a
        # `1008 policy violation` and drops the call.
        return json.dumps(result, default=str)    # default=str handles enums/datetimes
    return handler
 
for name in MY_TOOL_NAMES:
    tools.register(name, make_handler(name))

Each tool schema is a client-type entry on the agent record’s prompt.tools array:

app/tools.py


{
    "type": "client",
    "name": "display_card_info_by_last4",
    "description": "Find a card by the last 4 digits and return its details.",
    "parameters": {
        "type": "object",
        "required": ["last4"],
        "properties": {
            "last4": {"type": "string", "description": "Last 4 digits of the card."},
        },
    },
    "expects_response": True,
}

The SDK runs sync handlers in a thread pool, so blocking I/O (psycopg2, etc.) is fine. But the result must be a JSON string — return json.dumps(...), never the raw dict.

Tool calls and grading

Client tools execute in your process and never appear in the actor’s audio transcript, so the grader can’t see them — real actions get flagged as fabricated. Report each call to the engine so it lands in the graded trace:

app/main.py


# SIMULATION_ID and ENGINE_URL are injected by the sandbox; outside a sim this
# is a no-op. Keep it fire-and-forget — never reshape data the model observes.
if SIMULATION_ID:
    httpx.post(
        f"{ENGINE_URL}/simulations/{SIMULATION_ID}/events",
        json={
            "service": "agent",
            "event_type": "agent_tool_call",
            "data": {"name": name, "arguments": args, "result": result},
        },
        timeout=2.0,
    )

This is the canonical voice-agent tool-reporting pattern — see Tool call reporting in the voice_ws reference.

Keep the actor’s VAD happy

The Veris actor commits a turn with server-side VAD (~1500 ms silence window). ElevenLabs streams audio only while the agent is speaking, so after each turn you must pump trailing silence or the conversation deadlocks waiting for the actor to detect end-of-turn.

app/main.py


# ~1700 ms of PCM16 silence, padded past the actor's ~1500 ms VAD threshold.
END_OF_TURN_SILENCE = b"\x00\x00" * (24000 * 1700 // 1000)

Flush it on a short debounce (~400 ms) after the last audio chunk of a turn, and cancel the pending trailer when ElevenLabs fires interrupt() on barge-in so you don’t tail silence into the caller’s next turn.

Runtime env vars


veris env vars set ELEVENLABS_API_KEY=sk_... --secret
veris env vars set AGENT_ID=agent_...          # pin after the first boot logs it

The LLM runs through ElevenLabs’s managed inference (billed to them), so no separate provider key is required for the model itself.

Dockerfile

Standard thin layer on the Veris base image — add the elevenlabs SDK to your dependencies and copy your code. See the Dockerfile.sandbox reference.

.veris/Dockerfile.sandbox


ARG VERIS_BASE
FROM ${VERIS_BASE}
 
COPY pyproject.toml README.md /agent/
COPY app /agent/app
COPY db/schema.sql /agent/db/schema.sql
COPY agent_desc.txt /agent/agent_desc.txt
 
WORKDIR /agent
RUN uv sync --no-dev
WORKDIR /agent

Sharp edges

Gotcha	Why it bites
TTS model must be `eleven_flash_v2` or `eleven_turbo_v2`	English ConvAI agents reject `eleven_v2_5` / `v3` at agent-create time.
Tool result must be a JSON string	Returning a dict trips `1008 policy violation` and drops the call.
Audio = `pcm_24000`, mono, raw PCM16, both sides	Must match the `voice_ws` 24 kHz mono contract on `asr` and `tts`; a wrong rate (16 kHz / 48 kHz) causes silent STT failure, not an error.
Trailing silence is mandatory	Without ~1700 ms of trailing PCM silence the actor’s VAD never commits end-of-turn → deadlock.
The agent record is hosted state	Prompt, tools, voice, and LLM live at ElevenLabs keyed by `agent_id`. Pin `AGENT_ID` to reuse; recreate the record when prompt or tools change.

What’s next

voice_ws channel reference → voice-channels.md (audio contract, protocols, turn detection, tool reporting).
Config reference → veris.yaml schema, Dockerfile.sandbox.
ElevenLabs docs → Conversational AI , voice library .
Full agent repos → the cookbook .

ElevenLabs Conversational AI

Architecture

Channel: voice_ws

The agent record

Client tools

Tool calls and grading

Keep the actor’s VAD happy

Runtime env vars

Dockerfile

Sharp edges

What’s next

Channel: `voice_ws`