Skip to Content
Training

Training

Veris simulation data and environments can be used directly for model training. The same scenarios, transcripts, and grading results that you use for evaluation become training data — and the simulation environment itself becomes the RL training ground.

Training is currently in beta. Available in the Console under Training.

Why Train with Veris Data?

Agent behavior is shaped by the model behind it. While prompt engineering and tool configuration go a long way, fine-tuning the model on your specific task domain produces agents that are more reliable, more efficient, and better at following your workflows.

Veris provides both ingredients for training:

  • Data — Simulation transcripts with grading labels for supervised fine-tuning
  • Environment — The simulation sandbox as a live reward environment for reinforcement learning

Supervised Fine-Tuning (SFT)

SFT trains the model on high-quality examples of correct agent behavior.

How It Works

  1. Run simulations and evaluations across your scenario sets
  2. Filter for high-scoring transcripts (the agent behaved correctly)
  3. Convert transcripts into training examples (input/output pairs)
  4. Fine-tune a base model on these examples

Supported Base Models

ProviderModels
DeepSeekDeepSeek-V3, DeepSeek-R1
QwenQwen 2.5 (7B, 14B, 32B, 72B)
LlamaLlama 3.1 (8B, 70B), Llama 3.3 70B
MistralMistral Large, Mistral Small

SFT Parameters

ParameterDefaultDescription
Epochs3Number of training passes
Learning rate2e-5Step size for optimization
Batch size4Samples per training step
Max sequence length4096Maximum token length
Warmup ratio0.1Fraction of steps for LR warmup
LoRA rank16Rank of LoRA adapters
LoRA alpha32Scaling factor for LoRA

Reinforcement Learning (GRPO)

GRPO (Group Relative Policy Optimization) uses the Veris simulation environment as a live training ground. The agent interacts with mock services and simulated users, and graders provide reward signals.

How It Works

  1. The model generates multiple completions for each scenario
  2. Each completion is executed in the simulation environment
  3. Graders and assertions score the outcomes
  4. The model is updated to favor higher-scoring behaviors

This is the same loop as evaluation, but instead of just measuring performance, the results actively improve the model.

Reward Models

Reward signals come from:

  • Grader scores — hallucination, tool execution, communication, procedural correctness
  • Assertion pass rates — did the agent achieve the defined success criteria?
  • Custom reward models — optional LLM-based reward models for domain-specific scoring

GRPO Parameters

ParameterDefaultDescription
Epochs1Number of training passes
Learning rate5e-7Step size (lower than SFT)
Batch size4Samples per training step
Num generations4Completions per prompt
Max prompt length2048Maximum prompt tokens
Max completion length2048Maximum completion tokens
Temperature0.7Sampling temperature
Beta0.04KL penalty coefficient
LoRA rank16Rank of LoRA adapters

Using the Console

Navigate to Training in the sidebar.

Creating a Training Run

  1. Click New Training Run
  2. Select a base model from the supported list
  3. Choose the training method — SFT or GRPO
  4. For GRPO, optionally select a reward model
  5. Configure parameters (or use defaults with Advanced toggle)
  6. Select the simulation data to train on
  7. Click Start Training

Monitoring Progress

Active training runs show a progress bar, current epoch, loss metrics, and estimated time remaining. Completed runs show final metrics and a download link for the trained model weights.

From Evaluation to Training

RL path

RL Training

The simulation environment is the live training ground. Graders and assertions provide reward signals after each completion, and the model is updated via GRPO to favor higher-scoring behaviors. Repeat until convergence.

SFT path

SFT Training

  1. Run simulations and evaluations across your scenario sets
  2. Filter for high-scoring transcripts where the agent behaved correctly
  3. Fine-tune a base model on these examples
  4. Deploy the improved model and re-evaluate to measure progress