Use the CLI#
You’ve written tasks and jobs by hand. The devals CLI can generate most of
that configuration for you — this page shows how, and what you’ll want to
customize afterward.
Scaffolding commands#
devals init#
Initializes a fresh project for evals:
cd ~/my-project
devals init
What it creates:
my-project/
├── devals.yaml # marker file
└── evals/
├── tasks/
│ └── get_started/
│ └── task.yaml # starter task
└── jobs/
└── local_dev.yaml # ready-to-run job
What to customize:
The starter task uses
func: analyze_codebase— fine for a smoke test, but you’ll want to changefuncto match your eval type (question_answer,bug_fix,code_gen, etc.)The job defaults to
google/gemini-2.0-flash. Updatemodels:to the provider(s) you want to test.filespoints at../../(your project root). Update if your workspace lives elsewhere.
devals create pipeline#
An interactive walkthrough that creates a sample, task, and job in one go. Great for first-timers:
devals create pipeline
It prompts you for:
A sample ID and prompt
Which task function to use
A job name and model selection
The result is a fully wired-up set of YAML files ready to devals run.
devals create task#
Creates a new task directory with a starter task.yaml:
devals create task
Prompts for:
Task ID (becomes the directory name under
tasks/)Task function (selected from the Python registry)
Optional system message
What to customize after:
Add your
samples— the generated file is a skeletonAdd
filesandsetupif your task needs a workspaceAdd
metadatawith tags for filtering
devals create sample#
Adds a new sample interactively:
devals create sample
Prompts for:
Sample ID (snake_case)
Difficulty level
Whether a workspace is needed
What to customize after:
Write a specific
inputprompt — the generated placeholder is genericWrite grading criteria in
targetAdd
metadata.tagsfor filtering
devals create job#
Creates a new job YAML file:
devals create job
Prompts for:
Job name
Which models, variants, and tasks to include
What to customize after:
Add or refine
variants— the generated file may only includebaseline: {}Add
task_filtersorsample_filtersif you want to target a subsetConfigure
inspect_eval_argumentsfor retry, timeout, and limit settings
Running evals#
Basic run#
devals run <job_name>
The CLI:
Reads
devals.yamlto find theevals/directoryResolves your YAML config into a JSON manifest
Passes the manifest to
run-evals(the Pythondash_evalsrunner)dash_evalscalls Inspect AI’seval_set()Logs are written to
logs/
Dry run#
Preview the resolved configuration without making API calls:
devals run <job_name> --dry-run
This prints every task × model × variant combination that would execute. Use it to verify your setup before spending API credits.
[!TIP] Always dry-run after editing YAML config. It catches typos, missing files, and bad task references before they cost you money.
Viewing results#
devals view
Launches the Inspect AI log viewer
— a local web UI. devals automatically finds your logs/ directory from
devals.yaml.
To view logs from a specific location:
devals view /path/to/logs
What to look for in the viewer:
Section |
What it shows |
|---|---|
Runs |
Each task × model × variant combination |
Transcript |
The full conversation, including every tool call |
Score |
Pass/fail, model-graded scores, test results |
Metadata |
Timing, token usage, cost |
Troubleshooting#
devals doctor#
Checks all prerequisites:
devals doctor
It verifies:
Dart SDK — required for the CLI itself
Python 3.13+ — required for
dash_evalsdash_evals— the Python evaluation packagePodman/Docker — container runtime for sandboxed tasks
Flutter SDK — needed for Flutter-based eval tasks
API Keys — checks for configured provider keys
Fix any errors before running evals. Warnings (like a missing Flutter SDK) are safe to ignore if your evals don’t need that tool.
Quick reference#
Command |
What it does |
|---|---|
|
Initialize a new dataset in the current directory |
|
Check prerequisites |
|
Interactive walkthrough: sample → task → job |
|
Create a new task directory |
|
Create a new sample |
|
Create a new job file |
|
Run an evaluation |
|
Preview without executing |
|
Launch the Inspect AI log viewer |
Next steps#
You now know the full CLI workflow. Part 5 looks
under the hood at the dash_evals Python package — useful if you ever want
to write custom task logic.