YAML Configuration Fields#

This page provides a complete field-by-field reference for each YAML configuration file type, cross-referenced with the corresponding Dart and Python object field names.

Job#

Job files define runtime settings for an evaluation run, including sandbox configuration, rate limits, model selection, variant definitions, tag-based filtering, and pass-through parameters for Inspect AI’s eval_set() and Task constructors. Located in eval/jobs/.

Field name

YAML type

Optional

Dart field

Python field

Description

description

string

Y

description

description

Human-readable description of the job

log_dir

string

N

logDir

log_dir

Directory to write evaluation logs to

sandbox

string/object

Y

sandbox

sandbox

Sandbox configuration. String shorthand (e.g. podman) is equivalent to {environment: podman}

sandbox
  .environment

string

Y

Sandbox type: local, docker, or podman (default: local)

sandbox
  .parameters

object

Y

Pass-through parameters for sandbox plugin configuration

sandbox
  .image_prefix

string

Y

Registry prefix prepended to image names during sandbox resolution (e.g. us-central1-docker.pkg.dev/project/repo/)

max_connections

int

Y

maxConnections

max_connections

Maximum concurrent API connections (default: 10)

models

list

N

models

models

List of model identifiers to evaluate (required — at least one model must be specified)

variants

map

Y

variants

variants

Named variant definitions (keys are names, values are config maps). Can also be a list of paths to external variant files.

variants
  .<name>
  .files

list

Y

Paths or glob patterns to context files

variants
  .<name>
  .mcp_servers

list

Y

MCP server configurations. Each entry is one of: (1) an object with command/args for stdio/sandbox, (2) an object with url for HTTP, or (3) a ref: string pointing to a Python MCPServer object. Common sub-fields: name, transport. Stdio sub-fields: command, args, env, cwd. HTTP sub-fields: url, authorization, headers.

variants
  .<name>
  .skills

list

Y

Paths or glob patterns to skill directories

variants
  .<name>
  .task_parameters

object

Y

Optional parameters merged into the task config dict at runtime

task_filters

object

Y

taskFilters

task_filters

Tag-based task selection filter

task_filters
  .include_tags

list

Y

TagFilter.includeTags

TagFilter.include_tags

Only run tasks whose metadata tags include all of these

task_filters
  .exclude_tags

list

Y

TagFilter.excludeTags

TagFilter.exclude_tags

Exclude tasks whose metadata tags include any of these

sample_filters

object

Y

sampleFilters

sample_filters

Tag-based sample selection filter (same schema as task_filters)

task_paths

list

Y

taskPaths

task_paths

Glob patterns for discovering task directories (relative to dataset root)

tasks

object

Y

tasks

tasks

Per-task configurations with inline overrides

tasks
  .<task_id>
  .include-samples

list

Y

JobTask.includeSamples

JobTask.include_samples

Only run these sample IDs

tasks
  .<task_id>
  .exclude-samples

list

Y

JobTask.excludeSamples

JobTask.exclude_samples

Exclude these sample IDs

tasks
  .<task_id>
  .args

object

Y

JobTask.args

JobTask.args

Per-task argument overrides passed to the task function

tasks
  .<task_id>
  .include-variants

list

Y

JobTask.includeVariants

JobTask.include_variants

Only run these variant names for this task

tasks
  .<task_id>
  .exclude-variants

list

Y

JobTask.excludeVariants

JobTask.exclude_variants

Exclude these variant names for this task

save_examples

bool

Y

saveExamples

save_examples

Copy final workspace to <logDir>/examples/ after each sample (default: false)

inspect_eval_arguments

object

Y

inspectEvalArguments

inspect_eval_arguments

Pass-through dict of any valid Inspect AI eval_set() kwargs (e.g. retry_attempts, log_level, max_tasks, tags, task_defaults, eval_set_overrides, etc.). See Inspect AI docs for the full list of supported parameters.

Task#

Task files define a single evaluation task with its samples, prompt configuration, and optional Inspect AI Task parameter overrides. Located in eval/tasks/<task_id>/task.yaml.

Task-level Inspect AI Task parameters (model, limits, sandbox, etc.) are nested under inspect_task_args.

Field name

YAML type

Optional

Dart field

Python field

Description

func

string

Y

func

func

Name of the @task function or module:function reference (defaults to directory name)

id

string

Y

Task identifier (defaults to directory name)

description

string

Y

description

description

Human-readable description

dataset

object

Y

Dataset configuration. Must contain exactly one of samples, json, or csv.

dataset
  .samples

object

Y

Inline/file-based sample definitions (see samples.inline and samples.paths below)

dataset
  .samples
  .inline

list

Y

Inline sample definitions (list of sample objects)

dataset
  .samples
  .paths

list

Y

Glob patterns for external sample YAML files (relative to task dir)

dataset
  .json

string

Y

Path or URL to a JSON/JSONL dataset file (maps to Inspect’s json_dataset())

dataset
  .csv

string

Y

Path to a CSV dataset file (maps to Inspect’s csv_dataset())

dataset
  .args

object

Y

Dataset.args

Dataset.args

Additional arguments passed through to the dataset constructor (e.g. auto_id, shuffle, delimiter)

system_message

string

Y

systemMessage

system_message

Custom system prompt for this task

files

object

Y

files

files

Files to copy into sandbox for all samples ({destination: source}). Task-level files stack with sample-level files (sample wins on key conflict).

setup

string

Y

setup

setup

Setup script to run in sandbox before evaluation (overridden by sample-level setup)

display_name

string

Y

displayName

display_name

Task display name (e.g. for plotting)

version

int

Y

version

version

Version of task spec

metadata

object

Y

metadata

metadata

Additional metadata to associate with the task

inspect_task_args

object

Y

Pass-through dict of any valid Inspect AI Task() kwargs (e.g. model, time_limit, message_limit, epochs, sandbox, etc.). See Inspect AI docs for the full list.

Sample#

Samples are individual test cases defined either inline in task.yaml under dataset.samples.inline, or in external YAML files referenced via dataset.samples.paths. Fields like difficulty and tags should be nested inside the sample’s metadata dict.

Field name

YAML type

Optional

Dart field

Python field

Description

id

string

N

id

id

Unique sample identifier

input

string

N

input

input

The prompt given to the model

target

string

N

target

target

Expected output or grading criteria

metadata
  .difficulty

string

Y

easy, medium, or hard

metadata
  .tags

list

Y

Categories for filtering

metadata
  .system_message

string

Y

Override system prompt for this sample

choices

list

Y

choices

choices

Answer choices for multiple-choice evaluations

metadata

object

Y

metadata

metadata

Arbitrary metadata

sandbox

string/object

Y

sandbox

sandbox

Override sandbox environment for this sample

files

object

Y

files

files

Files to copy into sandbox ({destination: source})

setup

string

Y

setup

setup

Setup script to run in sandbox before evaluation