Training: Reinforcement Learning

Use this page when you want to use Veris simulations as a training environment for reinforcement learning. This is a secondary workflow — most users should start with the development loop and only reach for RL once evals are stable and they want to push model behavior further.

See the existing RL training reference for the full setup: launching GRPO training jobs against a scenario set, wiring graders as reward signals, and monitoring training runs.

If your team wants to use RL but doesn’t have in-house expertise, reach out about enterprise support — we can partner on training setup, reward shaping, and interpreting runs.

When RL makes sense

Evaluations are stable and your graders reliably distinguish good from bad behavior.
The agent’s failures aren’t obvious prompt or tooling fixes — they’re judgement calls you can’t patch directly.
You have budget for a training run (simulations × gradient steps gets expensive fast).

When RL doesn’t make sense

Your graders are noisy or too easy.
Your scenario set is small (< 50 scenarios).
Your agent’s failures are classic bugs — missing a step, calling the wrong tool — that a prompt fix would catch.

Training: Reinforcement Learning

When RL makes sense

When RL doesn’t make sense

See also