Evaluations
What is an evaluation?
An evaluation runs after simulations complete. For each simulation it produces:
- Grader output — scores and findings from the grader attached to the scenario set (hallucination, tool-use correctness, communication quality, etc.).
- Assertion results — PASS / FAIL for each of the scenario’s assertions, with reasoning per statement.
The transcript is the input; these are the outputs.
When evaluations run
Evaluations start automatically when a simulation run completes — --auto-evaluate on veris simulations create is on by default. Run veris evaluations create when you want to evaluate a run manually, or re-evaluate it with a different grader.
Running an evaluation manually
Interactive:
veris evaluations createPrompts for a completed run; the grader is auto-selected based on the scenario set.
With flags:
veris evaluations create \
--sim-run-id run_abc123 \
--grader-id grader_xyz789You can run multiple graders against the same simulation run — useful when you want to score different aspects without re-running simulations.
What graders look for
A generated grader groups checks into 5–10 categories tailored to your agent’s responsibilities. Common ones:
| Category | What it checks |
|---|---|
| Information gathering | Did the agent collect required info before acting? |
| Tool execution | Did the agent use the right tools with correct parameters and sequence? |
| Data accuracy | Did the agent avoid hallucinating tool responses or data? |
| Error handling | Did the agent handle tool errors gracefully? |
| Scope management | Did the agent handle out-of-scope requests properly? |
| Consent & confirmation | Did the agent get user agreement before taking actions? |
| Communication | Did the agent provide the required information to the user? |
Viewing results
# List evaluation runs for a simulation run
veris evaluations list <RUN_ID>
# Check progress (polls every 5 seconds)
veris evaluations status <RUN_ID> <EVAL_RUN_ID> --watch
# Open in the console
veris evaluations get <RUN_ID> <EVAL_RUN_ID>In the console, the Evaluations page shows evaluation runs per simulation run. Click into one to see per-simulation grader scores, assertion results, and findings.
From evaluation to report
Evaluation output is per-simulation. To identify patterns across the whole run and get fix suggestions, generate a report:
veris reports create <RUN_ID>CLI Commands
# Trigger evaluation
veris evaluations create [--sim-run-id ID] [--grader-id ID]
# List evaluation runs
veris evaluations list [RUN_ID]
# Check evaluation status
veris evaluations status <RUN_ID> <EVAL_RUN_ID> [--watch]
# Open in console
veris evaluations get <RUN_ID> <EVAL_RUN_ID>