Skip to Content

Reinforcement Learning (GRPO)

GRPO (Group Relative Policy Optimization) uses the Veris simulation environment as a live training ground. The model generates completions, executes them against mock services and simulated users, and learns from reward signals.

RL training currently supports single-agent setups only.

Set up your environment and agent

Configure your environment with services and scenarios. Set up your agent for RL.

Write a reward script

Define a Python function that scores simulation traces. This drives the training signal.

Start training

Select a base model, scenario sets, and upload your reward script from the Console.

Monitor and iterate

Track metrics like mean reward, KL divergence, and group signal quality.

Deploy

Get an inference endpoint for your trained model and use it in your agent.

Agent Setup

For RL training, your agent’s LLM calls are routed through the Veris model router. The training orchestrator swaps in the model being trained and captures the full interaction.

Your agent must:

  • Use the OpenAI Chat Completions API format (no Responses API, no streaming)
  • Point to the Veris LLM proxy via the VERIS_LLM_BASE_URL environment variable
  • Train one agent at a time
import os from openai import AsyncOpenAI from agents import Agent, OpenAIChatCompletionsModel, set_default_openai_api set_default_openai_api("chat_completions") veris_url = os.getenv("VERIS_LLM_BASE_URL") model_name = os.getenv("MODEL_NAME", "qwen3-8b") if veris_url: # During RL training — route through Veris client = AsyncOpenAI(api_key="not-needed", base_url=veris_url) else: # Local dev / evaluation — use provider directly client = AsyncOpenAI() model = OpenAIChatCompletionsModel(model=model_name, openai_client=client) agent = Agent( name="my-agent", instructions="...", model=model, tools=[...], )

VERIS_LLM_BASE_URL routes your agent’s LLM calls to the training model during RL. When not set, your agent should fall back to its default provider so the same code works for both training and evaluation.

How It Works

For each training step:

  1. The model generates multiple completions for a scenario (controlled by group_size)
  2. Each completion runs as a full simulation in the sandbox environment
  3. Your reward script scores each simulation
  4. The model is updated to favor higher-scoring behaviors

This is the same loop as evaluation, but instead of just measuring performance, the results actively improve the model.

Reward Scripts

A reward script is a Python function that takes a simulation trace and returns a score. This is the signal that drives training.

from collections import Counter def reward(trace): """Score a simulation trace. Return a float between 0 and 1.""" turns = trace["turns"] # Extract tool call names tool_calls = [] for t in turns: if t["role"] == "assistant" and t.get("tool_calls"): for tc in t["tool_calls"]: tool_calls.append(tc["name"]) score = 0.0 # Did the agent respond? agent_responses = [t for t in turns if t["role"] == "assistant" and t.get("content")] if agent_responses: score += 0.4 # Did the agent use tools (but not excessively)? if 1 <= len(tool_calls) <= 6: score += 0.3 # Was it efficient? if len(turns) <= 10: score += 0.3 # Penalize repeated tool calls consecutive_repeats = sum(1 for i in range(1, len(tool_calls)) if tool_calls[i] == tool_calls[i-1]) score -= consecutive_repeats * 0.1 return max(0.0, min(score, 1.0))

What you get in trace

The trace is a dict with the normalized conversation:

FieldTypeDescription
turnslistConversation turns with role, content, tool_calls
toolslistTool/function definitions available to the agent
modelstringModel name used for generation
system_promptstringThe agent’s system prompt
agent_idstringAgent identifier

Example trace:

{ "agent_id": "credit-card-agent", "model": "qwen3-8b", "system_prompt": "You are a support agent for credit cards...", "tools": [ {"name": "lookup_card", "description": "Find card by last 4 digits", "parameters": {...}} ], "turns": [ {"role": "user", "content": "I need to cancel my card ending in 4532"}, {"role": "assistant", "tool_calls": [{"name": "lookup_card", "arguments": {"last4": "4532"}}]}, {"role": "tool", "name": "lookup_card", "content": "{\"id\": \"CRD-123\", \"status\": \"active\"}"}, {"role": "assistant", "content": "I found your card ending in 4532. I'll cancel it now."}, {"role": "assistant", "tool_calls": [{"name": "cancel_card", "arguments": {"card_id": "CRD-123"}}]}, {"role": "tool", "name": "cancel_card", "content": "{\"status\": \"cancelled\"}"}, {"role": "assistant", "content": "Your card has been cancelled. Is there anything else?"} ] }

Tips for writing reward scripts

  • Start simple. A reward based on assertion pass/fail is a strong baseline.
  • Combine signals. Weight tool usage, conversation length, and assertion results together.
  • Avoid sparse rewards. If most simulations score 0, the model has no gradient to learn from. Add partial credit.
  • Test first. Run your reward script against existing evaluation runs before starting a training run.

Supported Models

ModelParameters
Qwen 3 8B8B
Qwen 3 32B32B

Parameters

ParameterDefaultDescription
Epochs3Passes through all training scenarios
Group size4Completions per scenario per step
Batch size1Scenarios per training step
Learning rate1e-6Step size for optimization
KL penalty0.01Penalizes divergence from the base model
LoRA rank32Rank of LoRA adapters
Temperature1.0Sampling temperature during generation
Max tokens2048Maximum generation tokens per turn
Max steps30Maximum LLM calls per simulation
Eval every20Run evaluation every N training steps
Save every20Save checkpoint every N steps

Simulation parameters

ParameterDefaultDescription
Max turns10Maximum conversation turns per simulation
Simulation timeout600Timeout per simulation in seconds

Using the Console

  1. Navigate to Training and click New Training Run
  2. Select a base model
  3. Choose scenario sets to train on
  4. Upload a reward script (.py file)
  5. Configure parameters or use defaults
  6. Click Start Training

Monitoring

Active training runs show:

  • Mean reward per step (train and eval)
  • KL divergence from the base model
  • Entropy of the policy
  • Group signal quality (variance within groups)
  • Training groups with per-scenario breakdowns and token masks

Logs and WandB integration are available for detailed debugging.