Development loop

Use this page when you’ve pushed your agent to Veris and run veris run at least once, and now you want to iterate. If you haven’t gotten that far, see the Quickstart.

This is the default path for most Veris users. The loop is:

Run simulations against the current agent.
Read the report.
Fix the highest-leverage issue in your agent’s code.
Push a new image tag.
Re-run. Compare.

Repeat until the numbers stop moving.

Running the loop

The one-command form:


veris run --report

This chains simulations → evaluations → report and returns a markdown summary plus a link to the full report. For iteration work, always include --report — the report is where the value is.

The exploded form (useful when debugging a specific step):


veris simulations create --scenario-set-id <SET_ID>
veris evaluations create --sim-run-id <RUN_ID>
veris reports create --eval-run-id <EVAL_RUN_ID>

Reading the report

A report has four sections. For iteration, focus on:

Scenario pass rate — the headline number. Track it run-over-run.
Top issues — ranked failure patterns with how many simulations each affected. Your attention should go to the top of this list.
Suggested fixes — concrete diffs the report recommends. Each is tagged with which simulations it would help. Work the highest-impact one first.

Start with Top issues to orient, then move into Suggested fixes to act. Only dig into individual simulation transcripts (in the console under the relevant run) when something in the report doesn’t add up.

Treat the report like a code review comment, not a CI log.

Fixing the agent

Suggested fixes come in three types. Each applies to a specific layer of your agent:

System prompt fix — edit the system prompt to change how the agent is instructed.
Tool docstring fix — tighten a tool’s description so the model uses it correctly.
Tool code fix — change the tool’s implementation.

Whatever you change, keep it in the agent’s source — not in the Veris config. The goal is to test the same code that runs in production.

Push a new tag

Tag every iteration so you can compare runs later:


veris env push --tag v1.1

Tags are cheap. Over-tag. Naming suggestion: use semantic tags for milestones (v1.0, v1.1) and short descriptive tags for experiments (no-confirmation-step, longer-system-prompt).

Compare runs

Run against the same scenario set with the new tag:


veris run --image-tag v1.1

Open the Benchmarks page in the console to compare the runs side by side. It shows which failures went away, which are new, and which didn’t move.

Don’t regenerate scenarios between iterations. Same scenario set, different agent versions — otherwise you can’t compare fairly. Regenerate scenarios only when you’ve changed your agent’s services or public interface.

When to stop

Manual iteration hits a ceiling. Stop when:

Marginal improvements have diminishing returns against your eval bar.
The remaining failures are in scenarios the current set wasn’t the right test for — regenerate a tighter set and decide whether to keep iterating.

Past that ceiling, gains come from training rather than prompting: RL uses your graders as a reward signal, and SFT distills your highest-scoring transcripts into a smaller, cheaper model.