Configuration Reference#

This document describes the standard eval/ directory structure and YAML configuration files used by the evaluation framework.

Overview#

The evaluation framework uses the eval/ directory as its entry point. It contains:

  • Task definitions autodiscovered from tasks/*/task.yaml

  • Job files in jobs/ that control what to run

  • Shared resources (context files, sandboxes)

Configuration is parsed and resolved by the Dart dataset_config_dart package, which produces an EvalSet JSON manifest consumed by the Python dash_evals.

See also: YAML Configuration Fields for a complete field-by-field reference with Dart and Python cross-references.

Directory Structure#

eval/
├── jobs/                    # Job files for different run configurations
│   ├── local_dev.yaml
│   └── ci.yaml
├── tasks/                   # Task definitions (autodiscovered)
│   ├── flutter_bug_fix/
│   │   ├── task.yaml        # Task config with inline samples
│   │   └── project/         # Project files (if applicable)
│   ├── dart_question_answer/
│   │   └── task.yaml
│   └── generate_flutter_app/
│       ├── task.yaml
│       └── todo_tests/      # Test files for a sample
├── context_files/           # Context files injected into prompts
│   └── flutter.md
└── sandboxes/               # Container configurations
    └── podman/
        ├── Containerfile
        └── compose.yaml

Task files#

Each subdirectory in tasks/ that contains a task.yaml is automatically discovered as a task. The directory name is the task ID.

# tasks/flutter_bug_fix/task.yaml
func: flutter_bug_fix
system_message: |
  You are an expert Flutter developer. Fix the bug and explain your changes.

# Task-level files copied into sandbox (inherited by all samples)
files:
  /workspace: ./project
setup: "cd /workspace && flutter pub get"

dataset:
  samples:
    inline:
      - id: flutter_bloc_cart_mutation_001
        input: |
          Fix the bug where adding items to cart doesn't update the total.
        target: |
          The fix should modify the BLoC to emit a new state instead of mutating.
        metadata:
          difficulty: medium
          tags: [bloc, state]

      - id: navigation_crash
        files:
          /workspace: ./nav_project    # Override task-level files
        input: |
          Fix the crash when navigating back from the detail screen.
        target: |
          The fix should handle the disposed controller properly.
        metadata:
          difficulty: hard
          tags: [navigation]

For the complete list of task fields (including Inspect AI Task parameters), see the Task fields table.

Files and Setup#

# Copy a local directory into the sandbox
files:
  /workspace: ./project
setup: "cd /workspace && flutter pub get"

# Copy individual files
files:
  /workspace/lib/main.dart: ./fixtures/main.dart
  /workspace/test/widget_test.dart: ./fixtures/test.dart

[!NOTE] Paths in files values are resolved relative to the task directory (e.g., tasks/flutter_bug_fix/). Task-level files and setup are inherited by all samples. Sample-level files stack on top (sample wins on key conflict). Sample-level setup overrides task-level setup.


Sample files#

A sample is a single test case containing an input prompt, expected output (grading target), and optional configuration. Samples are defined inline in task.yaml or in external YAML files referenced via paths.

# Inline in task.yaml
dataset:
  samples:
    inline:
      - id: dart_async_await_001
        input: |
          Explain the difference between Future.then() and async/await in Dart.
        target: |
          The answer should cover both approaches, explain that they are
          functionally equivalent, and note when each is preferred.
        metadata:
          difficulty: medium
          tags: [async, dart]
          added: 2025-02-04
          category: language_fundamentals

For the complete list of sample fields, see the Sample fields table.

Multiple Choice Example#

- id: dart_null_safety_quiz
  input: "Which of the following is NOT a valid way to handle null in Dart 3?"
  target: C
  choices:
    - "Use the null-aware operator ?."
    - "Use a null check with if (x != null)"
    - "Use the ! operator on every nullable variable"
    - "Use late initialization"

Sandbox Files Example#

- id: flutter_fix_counter
  input: "Fix the bug in the counter app."
  target: "The fix should update the state correctly."
  sandbox: docker
  files:
    /workspace/lib/main.dart: ./fixtures/broken_counter.dart
    /workspace/test/widget_test.dart: ./fixtures/counter_test.dart
  setup: "cd /workspace && flutter pub get"

Job files#

Job files define what to run and can override built-in runtime defaults. They’re selected via devals run <job_name>. Multiple jobs can be run sequentially.

# jobs/local_dev.yaml
name: local_dev

# Sandbox configuration (string shorthand or object)
sandbox:
  environment: podman

# Override runtime defaults
max_connections: 15

# Save the agent's final workspace output to logs/<run>/examples/
# save_examples: true

# Filter what to run (required)
models:
  - google/gemini-2.5-flash

# Variants are defined as a named map.
# Each key is a variant name; the value is the variant configuration.
variants:
  baseline: {}
  context_only: { files: [./context_files/flutter.md] }
  mcp_only: { mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }
  full: { files: [./context_files/flutter.md], mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }

# Inspect AI eval_set() parameters (all optional, nested under inspect_eval_arguments)
inspect_eval_arguments:
  retry_attempts: 20
  fail_on_error: 0.05
  log_level: info
  tags: [nightly]

  # Default Task-level overrides applied to every task
  task_defaults:
    time_limit: 600
    message_limit: 50

  # Additional eval_set() parameters not covered above
  # eval_set_overrides:
  #   bundle_dir: ./bundle
  #   log_images: true

For the complete list of job fields (including all Inspect AI eval_set() parameters), see the Job fields table.

Pass-Through Sections#

task_defaults#

Default Task parameters applied to every task in this job. Per-task overrides from task.yaml take precedence. Nested under inspect_eval_arguments:

inspect_eval_arguments:
  task_defaults:
    time_limit: 600
    message_limit: 50
    cost_limit: 2.0
    epochs: 3

eval_set_overrides#

Arbitrary eval_set() kwargs for parameters not covered by the named fields above. Top-level inspect_eval_arguments fields take precedence over overrides. Nested under inspect_eval_arguments:

inspect_eval_arguments:
  eval_set_overrides:
    bundle_dir: ./bundle
    log_images: true

Tasks Object#

tasks:
  # Discover tasks via glob patterns (relative to dataset root)
  paths: [tasks/*]
  # Per-task overrides (keys must match directory names in tasks/)
  inline:
    flutter_bug_fix:
      include-variants: [baseline]     # Only run these variants for this task
      include-samples: [sample_001]    # Only run these samples
      exclude-samples: [slow_test]     # Exclude these samples

Variants#

Variants modify how tasks execute, controlling context injection, tool availability, and skill access. Variants are defined as named maps in job files.

variants:
  baseline: {}
  context_only: { files: [./context_files/flutter.md] }
  mcp_only: { mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }
  full: { files: [./context_files/flutter.md], mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }

Variant sub-fields (files, mcp_servers, skills, task_parameters) are documented in the Job fields table.

Jobs can restrict which variants apply to specific tasks via include-variants and exclude-variants on the tasks.<task_id> object:

# job.yaml — only run baseline and mcp_only variants for flutter_bug_fix
tasks:
  inline:
    flutter_bug_fix:
      include-variants: [baseline, mcp_only]

Glob patterns (containing *, ?, or [) are expanded automatically. For skills, only directories containing SKILL.md are included.

MCP Server Modes#

MCP servers in variants support three modes:

variants:
  # 1. Declarative stdio/sandbox — command-based
  with_dart_mcp:
    mcp_servers:
      - name: dart
        command: dart
        args: [mcp-server]

  # 2. Declarative HTTP — url-based
  with_http_mcp:
    mcp_servers:
      - name: my-api
        url: https://mcp.example.com/api
        authorization: "bearer-token-here"    # optional OAuth Bearer token
        headers:                               # optional extra headers
          X-Custom-Header: value

  # 3. Python ref — import a pre-built MCPServer
  with_custom_mcp:
    mcp_servers:
      - ref: "my_package.mcp:staging_server"

[!IMPORTANT] The skills feature requires a sandbox (docker/podman). Skill directories are copied into the sandbox filesystem by Inspect AI’s built-in skill() tool. Each skill directory must contain a SKILL.md file.


Context Files#

Markdown files with YAML frontmatter providing additional context to the model.

---
title: "AI Rules for Flutter"
version: "1.0.0"
description: "Recommended patterns and best practices"
dart_version: "3.10.0"
flutter_version: "3.24.0"
updated: "2025-12-24"
---

## Flutter Best Practices

Content here is injected into the model's context when the variant
has files pointing to this file.

Field

Type

Required

Description

title

string

Yes

Context file title

version

string

Yes

Version identifier

description

string

Yes

Brief description

dart_version

string

No

Target Dart version

flutter_version

string

No

Target Flutter version

updated

string

No

Last update date


CLI Usage#

# Run a specific job
devals run local_dev
devals run ci

# Dry run — validate config without executing
devals run local_dev --dry-run

# Create a new task
devals create task

# Add a sample to an existing task
devals create sample

# Initialize a new dataset
devals init