Configure jobs#
In Part 1 and Part 2 you wrote tasks and jobs by following a recipe. Now let’s understand the full configuration model so you can build your own from scratch.
This page walks through every piece of the YAML configuration — building each file up incrementally.
The three config files#
Everything lives under your evals/ directory:
File |
Purpose |
|---|---|
|
Defines what to evaluate — the task function, samples, workspace, prompt |
|
Defines how to run it — models, variants, filters, sandbox, limits |
Context files (optional) |
Markdown files injected into the model’s prompt via variants |
The devals CLI resolves these into a single JSON manifest and hands it to the
Python runner. Most of the time, you’re just editing YAML.
Building a task.yaml#
Let’s build a task file from scratch, adding one concept at a time.
Start minimal#
The only required field is the task function:
func: question_answer
This is enough to define a task, but it has no samples — nothing to evaluate yet.
Add a sample#
Samples go under dataset.samples.inline. Each sample needs at minimum an id,
input (the prompt), and target (grading criteria):
func: question_answer
dataset:
samples:
inline:
- id: explain_null_safety
input: |
Explain Dart's sound null safety. How does it prevent
null reference errors at compile time?
target: |
Should explain nullable vs non-nullable types, the `?`
suffix, null-aware operators, and how the analyzer enforces
null checks at compile time.
Add a system message#
A system_message customizes the prompt sent to the model before your sample input:
func: question_answer
system_message: |
You are an expert Dart developer. Answer questions with code
examples where appropriate. Be concise.
dataset:
samples:
inline:
- id: explain_null_safety
# ...
Add files and setup#
For agentic tasks that run code in a sandbox, use files and setup:
func: bug_fix
# Copy a project into the sandbox — key = destination, value = source
files:
/workspace: ../../workspaces/my_dart_package
setup: "cd /workspace && dart pub get"
dataset:
samples:
inline:
- id: fix_the_bug
input: |
The tests are failing. Find and fix the bug.
target: |
All tests should pass after the fix.
files and setup at the task level are inherited by all samples. A sample
can override them:
dataset:
samples:
inline:
- id: fix_the_bug
files:
/workspace: ./custom_project # overrides task-level files
setup: "cd /workspace && pub get" # overrides task-level setup
input: ...
[!NOTE] File paths in
filesvalues are resolved relative to the task directory. Task-levelfilesstack with sample-levelfiles— on a key conflict, the sample wins.
Add metadata for filtering#
Samples can carry metadata with tags and difficulty. Jobs use these for filtering:
dataset:
samples:
inline:
- id: fix_the_bug
metadata:
difficulty: medium
tags: [dart, bug-fix, async]
input: ...
target: ...
Use external sample files#
For large datasets, you can keep samples in separate files and reference them with glob patterns:
func: question_answer
dataset:
samples:
paths:
- samples/*.yaml # loads every .yaml in the samples/ subdirectory
Each external file contains a list of sample objects in the same format as
dataset.samples.inline.
Building a job.yaml#
Jobs control how tasks run. Let’s build one up.
Start with models and tasks#
The bare minimum — which models and which tasks:
models:
- google/gemini-2.5-flash
tasks:
inline:
explain_null_safety: {} # run all samples with default settings
Add variants#
Variants let you test the same task under different conditions. Each variant is a named
map — an empty map {} means “no extras” (the baseline):
models:
- google/gemini-2.5-flash
variants:
baseline: {}
context_only:
files: [./context_files/dart_docs.md]
mcp_only:
mcp_servers:
- name: dart
command: dart
args: [mcp-server]
full:
files: [./context_files/dart_docs.md]
mcp_servers:
- name: dart
command: dart
args: [mcp-server]
tasks:
inline:
explain_null_safety: {}
This produces 4 runs per sample (one per variant) × however many models you list.
Variant sub-fields:
Field |
What it does |
|---|---|
|
Context files injected into the prompt |
|
MCP tool servers the model can call (stdio, HTTP, or Python ref) |
|
Skill directories copied into the sandbox |
|
Extra parameters merged into the task config at runtime |
Filter tasks and samples#
Use task_filters and sample_filters to select subsets by tag:
task_filters:
include_tags: [dart] # only tasks tagged "dart"
exclude_tags: [deprecated] # skip deprecated tasks
sample_filters:
include_tags: [bug-fix] # only samples tagged "bug-fix"
include_tags— an item must have all listed tags to be includedexclude_tags— an item is excluded if it has any listed tag
You can also filter per-task using include-samples and exclude-samples:
tasks:
inline:
fix_math_utils:
include-samples: [fix_factorial] # only run this sample
include-variants: [baseline] # only run this variant
Add sandbox configuration#
For tasks that need container execution:
sandbox: podman # or "docker"
You can also pass additional sandbox parameters:
sandbox:
environment: podman
image_prefix: us-central1-docker.pkg.dev/my-project/repo/
Add Inspect AI parameters#
The inspect_eval_arguments section passes settings through to Inspect AI’s
eval_set():
inspect_eval_arguments:
retry_attempts: 20
fail_on_error: 0.05
log_level: info
# Defaults applied to every task in this job
task_defaults:
time_limit: 600
message_limit: 50
Putting it all together#
Here’s a complete job file using everything above:
models:
- google/gemini-2.5-flash
- anthropic/claude-sonnet-4-20250514
sandbox: podman
max_connections: 15
variants:
baseline: {}
context_only:
files: [./context_files/dart_docs.md]
with_mcp:
mcp_servers:
- name: dart
command: dart
args: [mcp-server]
task_filters:
include_tags: [dart]
tasks:
inline:
fix_math_utils:
exclude-variants: [with_mcp] # MCP not relevant for this task
dart_question_answer: {}
inspect_eval_arguments:
retry_attempts: 10
task_defaults:
time_limit: 300
message_limit: 30
This will run:
2 models × 2 applicable variants × all matching samples in
fix_math_utils2 models × 3 variants × all matching samples in
dart_question_answer
Summary#
Concept |
Where it lives |
What it controls |
|---|---|---|
Task |
|
What to evaluate: function, prompt, workspace, samples |
Job |
|
How to run: models, variants, filters, sandbox, limits |
Variant |
Inside job YAML |
Different configurations for the agent being evaluated |
Sample |
Inside task YAML (or external files) |
Individual test cases with input/target pairs |
Context file |
Referenced by variants |
Extra information injected into the model’s prompt |
For the complete field-by-field reference, see YAML Configuration Fields.
Next steps#
Now that you understand the configuration model, Part 4
shows how the devals CLI can generate most of this config for you — and
what you need to customize in the output.