Evaluations

What is an evaluation?

An evaluation runs after simulations complete. For each simulation it produces:

Grader output — scores and findings from the grader attached to the scenario set (hallucination, tool-use correctness, communication quality, etc.).
Assertion results — PASS / FAIL for each of the scenario’s assertions, with reasoning per statement.

The transcript is the input; these are the outputs.

When evaluations run

Evaluations start automatically when a simulation run completes — --auto-evaluate on veris simulations create is on by default. Run veris evaluations create when you want to evaluate a run manually, or re-evaluate it with a different grader.

Running an evaluation manually

Interactive:


veris evaluations create

Prompts for a completed run; the grader is auto-selected based on the scenario set.

With flags:


veris evaluations create \
  --sim-run-id run_abc123 \
  --grader-id grader_xyz789

You can run multiple graders against the same simulation run — useful when you want to score different aspects without re-running simulations.

What graders look for

A generated grader groups checks into 5–10 categories tailored to your agent’s responsibilities. Common ones:

Category	What it checks
Information gathering	Did the agent collect required info before acting?
Tool execution	Did the agent use the right tools with correct parameters and sequence?
Data accuracy	Did the agent avoid hallucinating tool responses or data?
Error handling	Did the agent handle tool errors gracefully?
Scope management	Did the agent handle out-of-scope requests properly?
Consent & confirmation	Did the agent get user agreement before taking actions?
Communication	Did the agent provide the required information to the user?

Viewing results


# List evaluation runs for a simulation run
veris evaluations list <RUN_ID>
 
# Check progress (polls every 5 seconds)
veris evaluations status <RUN_ID> <EVAL_RUN_ID> --watch
 
# Open in the console
veris evaluations get <RUN_ID> <EVAL_RUN_ID>

In the console, the Evaluations page shows evaluation runs per simulation run. Click into one to see per-simulation grader scores, assertion results, and findings.

From evaluation to report

Evaluation output is per-simulation. To identify patterns across the whole run and get fix suggestions, generate a report:


veris reports create <RUN_ID>

CLI Commands


# Trigger evaluation
veris evaluations create [--sim-run-id ID] [--grader-id ID]
 
# List evaluation runs
veris evaluations list [RUN_ID]
 
# Check evaluation status
veris evaluations status <RUN_ID> <EVAL_RUN_ID> [--watch]
 
# Open in console
veris evaluations get <RUN_ID> <EVAL_RUN_ID>