Install and run evals#
By the end of this page you’ll have installed everything,
run an evaluation with Python and Inspect AI, and (optionally) seen how the
devals CLI wraps all of that into a single workflow.
Prerequisites#
Tool |
Details |
Notes |
|---|---|---|
Ver. 3.10+ |
Runs the |
|
Ver. 3.13+ |
Runs the |
|
API keys |
|
Requires at least one model provider key. |
*Dart isn’t required. It powers the CLI, which assists in authoring various YAML files and hides some complexity of the framework. However, the framework is entirely useable without the CLI.
1. Set up your project#
Create a directory for your evals work and set up a Python virtual environment:
mkdir my-evals && cd my-evals
python3 -m venv .venv
source .venv/bin/activate
Install dash_evals (Python — required)#
Install the evaluation runner and its config library from git:
pip install "dash-evals @ git+https://github.com/flutter/evals.git#subdirectory=packages/dash_evals"
pip install "dataset-config-python @ git+https://github.com/flutter/evals.git#subdirectory=packages/dataset_config_python"
This gives you dash_evals — the runtime that drives
Inspect AI to run evaluations. Its CLI entry
point is run-evals.
Install devals CLI (Dart — optional)#
If you have the Dart SDK installed, you can also install the CLI, which automates some of the configuration and eval authoring.
dart pub global activate devals --source git https://github.com/flutter/evals.git --git-path packages/devals_cli
devals resolves YAML configuration, scaffolds new tasks and jobs, and
wraps run-evals and inspect view into a single workflow. It reduces the
learning curve but is entirely optional — everything it does can be done with
vanilla Python and Inspect AI commands.
2. Configure an API key#
export GEMINI_API_KEY=your_key_here
Set at least one provider key. You can also add it to a .env file in your
project directory — dash_evals loads it automatically.
3. Create a minimal dataset#
The basic unit of evals is the sample. A sample is a single ‘test case’, which includes, at a minimum, an prompt (called ‘input’) and a target, which is used to grade the evaluation.
Create a file called my_first_sample.json in your project directory:
[
{
"input": "Explain the difference between `Future.then()` and `async/await` in Dart. When should you prefer one over the other?",
"target": "The answer should explain that both are mechanisms for handling asynchronous code; async/await is syntactic sugar over Futures. It should note that async/await is generally preferred for readability, while .then() can be useful for simple one-off transformations."
}
]
3.2 Run it#
run-evals \
--task question_answer \
--model google/gemini-2.5-flash \
--dataset ./my_first_sample.json
This runs that one sample. Here’s what just happened:
run-evalsloaded thequestion_answertask function fromdash_evals. A task is a Python function that desribes the logic required to run a sample. Tasks are the generic, reusable logic that know how to run your bespoke samples. Some other task examples are [generate_code][] and [bug_fix][].Your dataset, or collection of samples, was passed to the task, which executes its solver chain, the instructions given to the agent being evaluated.
Inspect AI drives the agent, collects the response, and scores it with a scorer. Scorers vary by task. In this case, the scorer is [
model_graded_fact][], a scorer provided by Inspect that asks a second agent to compare the generated response to our target response.Finally, A log file was written to
./logs/
3.3 View the results#
Inspect AI ships with a robust log viewer. Launch it:
inspect view
This opens a local web UI where you can browse the run, see the full conversation transcript, and check how the response was scored.
[!TIP]
inspect viewfinds logs in the current directory by default. Pass a path to point it elsewhere:inspect view ./path/to/logs.
4. That’s what devals wraps#
The commands above — run-evals, inspect view — are the raw building blocks.
The devals CLI wraps all of them, helps manage your runtime environment,
and manages the YAML configuration layer we’ve put on top of Inspect AI,
which replaces the Samples JSON and many configuration
options that are otherwise be passed in as CLI flags. All of the quality-of-life
improvements provided by the CLI are described in the [Using the CLI guide][].
Importantly, you can still use the Yaml configuration layer without Dart and the CLI, it’s just less automated and requires you writing a bit of python glue code.
Let’s try the devals workflow now.
As a reminder, the install script is:
dart pub global activate devals --source git https://github.com/flutter/evals.git --git-path packages/devals_cli
4.1 Check your environment#
devals doctor
This verifies Dart, Python, dash_evals, API keys, and optional tools like
Podman and Flutter. Fix any errors it reports; warnings are safe to ignore for now.
4.2 Initialize a dataset#
Run devals init from your project directory (the my-evals directory you
created in step 1):
devals init
This creates:
my-evals/
├── devals.yaml # marker file
└── evals/
├── tasks/
│ └── get_started/
│ └── task.yaml # starter task + sample
└── jobs/
└── local_dev.yaml # ready-to-run job
The starter task uses the analyze_codebase task function — it asks the model
to explore your project and suggest an improvement. It’s a good smoke test that
doesn’t require a sandbox.
4.3 Run the eval#
devals run local_dev
Behind the scenes, this:
Resolves your YAML config (job + tasks + samples) into a JSON manifest
Passes the manifest to
run-evals(the Pythondash_evalsrunner)dash_evalscalls Inspect AI’seval_set(), which sends prompts, scores results, and writes logs
To preview the resolved config without making API calls:
devals run local_dev --dry-run
4.4 View results#
devals view
This is the same Inspect AI log viewer from before, but devals automatically
finds your logs/ directory based on devals.yaml.
Recap#
You’ve now seen the two layers of the system:
Layer |
What it does |
|---|---|
|
The engine. Runs tasks, sends prompts, scores responses. |
|
The convenience layer. YAML config, scaffolding, log discovery. |
Next steps#
You’re set up and you’ve seen the framework in action. In Part 2, you’ll author a more complex, agentic evaluation from scratch.