Skip to Content
Core ConceptsEvaluations

Evaluations

What is an evaluation?

An evaluation runs after simulations complete. For each simulation it produces:

  • Grader output — scores and findings from the grader attached to the scenario set (hallucination, tool-use correctness, communication quality, etc.).
  • Assertion results — PASS / FAIL for each of the scenario’s assertions, with reasoning per statement.

The transcript is the input; these are the outputs.

When evaluations run

Evaluations start automatically when a simulation run completes — --auto-evaluate on veris simulations create is on by default. Run veris evaluations create when you want to evaluate a run manually, or re-evaluate it with a different grader.

Running an evaluation manually

Interactive:

veris evaluations create

Prompts for a completed run; the grader is auto-selected based on the scenario set.

With flags:

veris evaluations create \ --sim-run-id run_abc123 \ --grader-id grader_xyz789

You can run multiple graders against the same simulation run — useful when you want to score different aspects without re-running simulations.

What graders look for

A generated grader groups checks into 5–10 categories tailored to your agent’s responsibilities. Common ones:

CategoryWhat it checks
Information gatheringDid the agent collect required info before acting?
Tool executionDid the agent use the right tools with correct parameters and sequence?
Data accuracyDid the agent avoid hallucinating tool responses or data?
Error handlingDid the agent handle tool errors gracefully?
Scope managementDid the agent handle out-of-scope requests properly?
Consent & confirmationDid the agent get user agreement before taking actions?
CommunicationDid the agent provide the required information to the user?

Viewing results

# List evaluation runs for a simulation run veris evaluations list <RUN_ID> # Check progress (polls every 5 seconds) veris evaluations status <RUN_ID> <EVAL_RUN_ID> --watch # Open in the console veris evaluations get <RUN_ID> <EVAL_RUN_ID>

In the console, the Evaluations page shows evaluation runs per simulation run. Click into one to see per-simulation grader scores, assertion results, and findings.

From evaluation to report

Evaluation output is per-simulation. To identify patterns across the whole run and get fix suggestions, generate a report:

veris reports create <RUN_ID>

CLI Commands

# Trigger evaluation veris evaluations create [--sim-run-id ID] [--grader-id ID] # List evaluation runs veris evaluations list [RUN_ID] # Check evaluation status veris evaluations status <RUN_ID> <EVAL_RUN_ID> [--watch] # Open in console veris evaluations get <RUN_ID> <EVAL_RUN_ID>