Reinforcement Learning (GRPO)
GRPO (Group Relative Policy Optimization) uses the Veris simulation environment as a live training ground. The model generates completions, executes them against mock services and simulated users, and learns from reward signals.
RL training currently supports single-agent setups only.
Set up your environment and agent
Configure your environment with services and scenarios. Set up your agent for RL.
Write a reward script
Define a Python function that scores simulation traces. This drives the training signal.
Start training
Select a base model, scenario sets, and upload your reward script from the Console.
Monitor and iterate
Track metrics like mean reward, KL divergence, and group signal quality.
Deploy
Get an inference endpoint for your trained model and use it in your agent.
Agent Setup
For RL training, your agent’s LLM calls are routed through the Veris model router. The training orchestrator swaps in the model being trained and captures the full interaction.
Your agent must:
- Use the OpenAI Chat Completions API format (no Responses API, no streaming)
- Point to the Veris LLM proxy via the
VERIS_LLM_BASE_URLenvironment variable - Train one agent at a time
import os
from openai import AsyncOpenAI
from agents import Agent, OpenAIChatCompletionsModel, set_default_openai_api
set_default_openai_api("chat_completions")
veris_url = os.getenv("VERIS_LLM_BASE_URL")
model_name = os.getenv("MODEL_NAME", "qwen3-8b")
if veris_url:
# During RL training — route through Veris
client = AsyncOpenAI(api_key="not-needed", base_url=veris_url)
else:
# Local dev / evaluation — use provider directly
client = AsyncOpenAI()
model = OpenAIChatCompletionsModel(model=model_name, openai_client=client)
agent = Agent(
name="my-agent",
instructions="...",
model=model,
tools=[...],
)VERIS_LLM_BASE_URL routes your agent’s LLM calls to the training model during RL. When not set, your agent should fall back to its default provider so the same code works for both training and evaluation.
How It Works
For each training step:
- The model generates multiple completions for a scenario (controlled by
group_size) - Each completion runs as a full simulation in the sandbox environment
- Your reward script scores each simulation
- The model is updated to favor higher-scoring behaviors
This is the same loop as evaluation, but instead of just measuring performance, the results actively improve the model.
Reward Scripts
A reward script is a Python function that takes a simulation trace and returns a score. This is the signal that drives training.
from collections import Counter
def reward(trace):
"""Score a simulation trace. Return a float between 0 and 1."""
turns = trace["turns"]
# Extract tool call names
tool_calls = []
for t in turns:
if t["role"] == "assistant" and t.get("tool_calls"):
for tc in t["tool_calls"]:
tool_calls.append(tc["name"])
score = 0.0
# Did the agent respond?
agent_responses = [t for t in turns if t["role"] == "assistant" and t.get("content")]
if agent_responses:
score += 0.4
# Did the agent use tools (but not excessively)?
if 1 <= len(tool_calls) <= 6:
score += 0.3
# Was it efficient?
if len(turns) <= 10:
score += 0.3
# Penalize repeated tool calls
consecutive_repeats = sum(1 for i in range(1, len(tool_calls)) if tool_calls[i] == tool_calls[i-1])
score -= consecutive_repeats * 0.1
return max(0.0, min(score, 1.0))What you get in trace
The trace is a dict with the normalized conversation:
| Field | Type | Description |
|---|---|---|
turns | list | Conversation turns with role, content, tool_calls |
tools | list | Tool/function definitions available to the agent |
model | string | Model name used for generation |
system_prompt | string | The agent’s system prompt |
agent_id | string | Agent identifier |
Example trace:
{
"agent_id": "credit-card-agent",
"model": "qwen3-8b",
"system_prompt": "You are a support agent for credit cards...",
"tools": [
{"name": "lookup_card", "description": "Find card by last 4 digits", "parameters": {...}}
],
"turns": [
{"role": "user", "content": "I need to cancel my card ending in 4532"},
{"role": "assistant", "tool_calls": [{"name": "lookup_card", "arguments": {"last4": "4532"}}]},
{"role": "tool", "name": "lookup_card", "content": "{\"id\": \"CRD-123\", \"status\": \"active\"}"},
{"role": "assistant", "content": "I found your card ending in 4532. I'll cancel it now."},
{"role": "assistant", "tool_calls": [{"name": "cancel_card", "arguments": {"card_id": "CRD-123"}}]},
{"role": "tool", "name": "cancel_card", "content": "{\"status\": \"cancelled\"}"},
{"role": "assistant", "content": "Your card has been cancelled. Is there anything else?"}
]
}Tips for writing reward scripts
- Start simple. A reward based on assertion pass/fail is a strong baseline.
- Combine signals. Weight tool usage, conversation length, and assertion results together.
- Avoid sparse rewards. If most simulations score 0, the model has no gradient to learn from. Add partial credit.
- Test first. Run your reward script against existing evaluation runs before starting a training run.
Supported Models
| Model | Parameters |
|---|---|
| Qwen 3 8B | 8B |
| Qwen 3 32B | 32B |
Parameters
| Parameter | Default | Description |
|---|---|---|
| Epochs | 3 | Passes through all training scenarios |
| Group size | 4 | Completions per scenario per step |
| Batch size | 1 | Scenarios per training step |
| Learning rate | 1e-6 | Step size for optimization |
| KL penalty | 0.01 | Penalizes divergence from the base model |
| LoRA rank | 32 | Rank of LoRA adapters |
| Temperature | 1.0 | Sampling temperature during generation |
| Max tokens | 2048 | Maximum generation tokens per turn |
| Max steps | 30 | Maximum LLM calls per simulation |
| Eval every | 20 | Run evaluation every N training steps |
| Save every | 20 | Save checkpoint every N steps |
Simulation parameters
| Parameter | Default | Description |
|---|---|---|
| Max turns | 10 | Maximum conversation turns per simulation |
| Simulation timeout | 600 | Timeout per simulation in seconds |
Using the Console
- Navigate to Training and click New Training Run
- Select a base model
- Choose scenario sets to train on
- Upload a reward script (.py file)
- Configure parameters or use defaults
- Click Start Training
Monitoring
Active training runs show:
- Mean reward per step (train and eval)
- KL divergence from the base model
- Entropy of the policy
- Group signal quality (variance within groups)
- Training groups with per-scenario breakdowns and token masks
Logs and WandB integration are available for detailed debugging.