Reinforcement Learning (GRPO)

GRPO (Group Relative Policy Optimization) uses the Veris simulation environment as a live training ground. The model generates completions, executes them against mock services and simulated users, and learns from reward signals.

RL training currently supports single-agent setups only.

Set up your environment and agent

Configure your environment with services and scenarios. Set up your agent for RL.

Write a reward script

Define a Python function that scores simulation traces. This drives the training signal.

Start training

Select a base model, scenario sets, and upload your reward script from the Console.

Monitor and iterate

Track metrics like mean reward, KL divergence, and group signal quality.

Deploy

Get an inference endpoint for your trained model and use it in your agent.

Agent Setup

For RL training, your agent’s LLM calls are routed through the Veris model router. The training orchestrator swaps in the model being trained and captures the full interaction.

Your agent must:

Use the OpenAI Chat Completions API format (no Responses API, no streaming)
Point to the Veris LLM proxy via the VERIS_LLM_BASE_URL environment variable
Train one agent at a time


import os
from openai import AsyncOpenAI
from agents import Agent, OpenAIChatCompletionsModel, set_default_openai_api
 
set_default_openai_api("chat_completions")
 
veris_url = os.getenv("VERIS_LLM_BASE_URL")
model_name = os.getenv("MODEL_NAME", "qwen3-8b")
 
if veris_url:
    # During RL training — route through Veris
    client = AsyncOpenAI(api_key="not-needed", base_url=veris_url)
else:
    # Local dev / evaluation — use provider directly
    client = AsyncOpenAI()
 
model = OpenAIChatCompletionsModel(model=model_name, openai_client=client)
 
agent = Agent(
    name="my-agent",
    instructions="...",
    model=model,
    tools=[...],
)

VERIS_LLM_BASE_URL routes your agent’s LLM calls to the training model during RL. When not set, your agent should fall back to its default provider so the same code works for both training and evaluation.

How It Works

For each training step:

The model generates multiple completions for a scenario (controlled by group_size)
Each completion runs as a full simulation in the sandbox environment
Your reward script scores each simulation
The model is updated to favor higher-scoring behaviors

This is the same loop as evaluation, but instead of just measuring performance, the results actively improve the model.

Reward Scripts

A reward script is a Python function that takes a simulation trace and returns a score. This is the signal that drives training.


from collections import Counter
 
def reward(trace):
    """Score a simulation trace. Return a float between 0 and 1."""
    turns = trace["turns"]
 
    # Extract tool call names
    tool_calls = []
    for t in turns:
        if t["role"] == "assistant" and t.get("tool_calls"):
            for tc in t["tool_calls"]:
                tool_calls.append(tc["name"])
 
    score = 0.0
 
    # Did the agent respond?
    agent_responses = [t for t in turns if t["role"] == "assistant" and t.get("content")]
    if agent_responses:
        score += 0.4
 
    # Did the agent use tools (but not excessively)?
    if 1 <= len(tool_calls) <= 6:
        score += 0.3
 
    # Was it efficient?
    if len(turns) <= 10:
        score += 0.3
 
    # Penalize repeated tool calls
    consecutive_repeats = sum(1 for i in range(1, len(tool_calls)) if tool_calls[i] == tool_calls[i-1])
    score -= consecutive_repeats * 0.1
 
    return max(0.0, min(score, 1.0))

What you get in `trace`

The trace is a dict with the normalized conversation:

Field	Type	Description
`turns`	list	Conversation turns with `role`, `content`, `tool_calls`
`tools`	list	Tool/function definitions available to the agent
`model`	string	Model name used for generation
`system_prompt`	string	The agent’s system prompt
`agent_id`	string	Agent identifier

Example trace:


{
  "agent_id": "credit-card-agent",
  "model": "qwen3-8b",
  "system_prompt": "You are a support agent for credit cards...",
  "tools": [
    {"name": "lookup_card", "description": "Find card by last 4 digits", "parameters": {...}}
  ],
  "turns": [
    {"role": "user", "content": "I need to cancel my card ending in 4532"},
    {"role": "assistant", "tool_calls": [{"name": "lookup_card", "arguments": {"last4": "4532"}}]},
    {"role": "tool", "name": "lookup_card", "content": "{\"id\": \"CRD-123\", \"status\": \"active\"}"},
    {"role": "assistant", "content": "I found your card ending in 4532. I'll cancel it now."},
    {"role": "assistant", "tool_calls": [{"name": "cancel_card", "arguments": {"card_id": "CRD-123"}}]},
    {"role": "tool", "name": "cancel_card", "content": "{\"status\": \"cancelled\"}"},
    {"role": "assistant", "content": "Your card has been cancelled. Is there anything else?"}
  ]
}

Tips for writing reward scripts

Start simple. A reward based on assertion pass/fail is a strong baseline.
Combine signals. Weight tool usage, conversation length, and assertion results together.
Avoid sparse rewards. If most simulations score 0, the model has no gradient to learn from. Add partial credit.
Test first. Run your reward script against existing evaluation runs before starting a training run.

Supported Models

Model	Parameters
Qwen 3 8B	8B
Qwen 3 32B	32B

Parameters

Parameter	Default	Description
Epochs	3	Passes through all training scenarios
Group size	4	Completions per scenario per step
Batch size	1	Scenarios per training step
Learning rate	1e-6	Step size for optimization
KL penalty	0.01	Penalizes divergence from the base model
LoRA rank	32	Rank of LoRA adapters
Temperature	1.0	Sampling temperature during generation
Max tokens	2048	Maximum generation tokens per turn
Max steps	30	Maximum LLM calls per simulation
Eval every	20	Run evaluation every N training steps
Save every	20	Save checkpoint every N steps

Simulation parameters

Parameter	Default	Description
Max turns	10	Maximum conversation turns per simulation
Simulation timeout	600	Timeout per simulation in seconds

Using the Console

Navigate to Training and click New Training Run
Select a base model
Choose scenario sets to train on
Upload a reward script (.py file)
Configure parameters or use defaults
Click Start Training

Monitoring

Active training runs show:

Mean reward per step (train and eval)
KL divergence from the base model
Entropy of the policy
Group signal quality (variance within groups)
Training groups with per-scenario breakdowns and token masks

Logs and WandB integration are available for detailed debugging.