About the framework#

You’ve been using built-in task functions like bug_fix and question_answer. This page explains how they work — useful if you want to write custom eval logic or just understand what happens when you run devals run.

Architecture overview#

When you run an eval, data flows through three layers:

YAML config → Dart resolver → JSON manifest → Python runner → Inspect AI

Layer	Package	What it does
YAML config	—	Your `task.yaml` and `job.yaml` files
Dart resolver	`dataset_config_dart`	Parses YAML, resolves globs and references, produces a JSON manifest
Hydration	`dataset_config_python`	Converts config dicts into Inspect AI objects (datasets, MCP servers, skills)
Python runner	`dash_evals`	Reads the manifest, builds Inspect AI `Task` objects, calls `eval_set()`
Inspect AI	`inspect_ai`	Runs solver chains, sends prompts, collects responses, runs scorers

The devals CLI (Dart) orchestrates steps 1–2, then hands off to run-evals (Python) for steps 3–4.

The `dash_evals` package#

Entry point#

The Python CLI entry point is run-evals, defined in dash_evals/main.py. It supports two modes:

# Mode 1: From a JSON manifest (what devals uses)
run-evals --json ./eval_set.json

# Mode 2: Direct CLI arguments (what you used in Part 1)
run-evals --task question_answer --model google/gemini-2.5-flash --dataset samples.json

JSON runner#

When using --json mode, json_runner.py does the heavy lifting:

Reads the manifest file
For each task definition, resolves the task function by name
Builds an Inspect AI MemoryDataset from the inline samples
Calls the task function with the dataset and config
Collects all Task objects and calls inspect_ai.eval_set()

Task resolution#

The func field in your task.yaml is resolved to a Python function. Three formats are supported:

Format	Example	How it resolves
Short name	`question_answer`	Looks up `dash_evals.runner.tasks.question_answer`
Colon syntax	`my_package.tasks:my_task`	Imports `my_package.tasks`, gets `my_task`
Dotted path	`my_package.tasks.my_task.my_task`	Last segment is the function name

Short names work for all built-in tasks. Use colon syntax or dotted paths for custom tasks in external packages.

Anatomy of a task function#

Every task function follows the same pattern. Here’s question_answer — the simplest built-in task:

from inspect_ai import Task, task
from inspect_ai.dataset import Dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import chain_of_thought

@task
def question_answer(dataset: Dataset, config: dict) -> Task:
    system_msg = config.get("system_message") or DEFAULT_QA_SYSTEM_MESSAGE

    solver_chain = [
        add_system_message(system_msg),      # 1. Set the system prompt
        # context_injector(...)              # 2. Inject context files (if variant has them)
        chain_of_thought(),                   # 3. Ask for step-by-step reasoning
        # generate() or react(tools=...)     # 4. Get the model's response
    ]

    return Task(
        name=config["task_name"],
        dataset=dataset,
        solver=solver_chain,
        scorer=model_graded_fact(),
        time_limit=300,
    )

Key ingredients:

Part	Purpose
`@task`	Decorator that registers this function with Inspect AI
`dataset`	An Inspect `Dataset` built from your samples
`config`	A dict with everything from the JSON manifest — variant, system_message, sandbox_type, etc.
Solver chain	A list of steps that process the prompt and generate a response
Scorer	Evaluates the model’s output against the `target`

Solver chain patterns#

Most tasks build their solver chain from shared helpers in task_helpers.py:

def _build_solver(config, system_msg):
    chain = [add_system_message(system_msg)]

    # Inject context files from the variant
    append_context_injection(chain, config)

    # Add chain-of-thought reasoning
    chain.append(chain_of_thought())

    # If the variant has MCP servers → use react() agent
    # Otherwise → use plain generate()
    append_model_interaction(chain, config)

    return chain

This means that variants automatically affect the solver chain — if a variant defines mcp_servers, the task switches from a simple generate call to a full ReAct agent loop with tool access.

Agentic vs. non-agentic tasks#

Pattern	Tasks that use it	What happens
Non-agentic	`question_answer`, `code_gen`	System message → chain of thought → single generate
Agentic	`bug_fix`, `analyze_codebase`, `mcp_tool`	System message → ReAct loop with tools (bash, text editor, MCP)

Agentic tasks give the model tools (bash_session(), text_editor(), MCP servers) and run in a react() loop where the model can take multiple actions before calling submit().

Shared helpers#

The task_helpers.py module contains functions used across all tasks. Some of these are re-exported from dataset_config_python.hydrate — the shared config-interpretation layer that both dash_evals and external consumers (like yardstick) use to ensure consistent hydration of config into Inspect AI objects.

Helper	Source	What it does
`create_mcp_servers(configs, sandbox_type)`	`dataset_config_python`	Creates MCP server objects from variant config
`get_skill_tool(config)`	`dataset_config_python`	Creates a skill tool if the variant has `skills` configured
`build_task_metadata(config)`	`dataset_config_python`	Builds the metadata dict for the `Task` object
`append_context_injection(chain, config)`	`dash_evals`	Adds a `context_injector` solver if the variant has `files`
`append_model_interaction(chain, config)`	`dash_evals`	Adds `react()` (if tools exist) or `generate()` (if not)
`validate_sandbox_tools(config, tool_names)`	`dash_evals`	Checks that sandbox-requiring tools aren’t used on local

These helpers mean that most of the variant logic (context injection, MCP tools, skills) is handled automatically. You just need to define the core solver pattern for your task.

Writing your own task#

Create a file at packages/dash_evals/src/dash_evals/runner/tasks/your_task.py

Write the task function:

from inspect_ai import Task, task
from inspect_ai.dataset import Dataset
from inspect_ai.scorer import model_graded_fact

from .task_helpers import (
    append_context_injection,
    append_model_interaction,
    build_task_metadata,
)
from ..solvers import add_system_message

@task
def your_task(dataset: Dataset, config: dict) -> Task:
    chain = [add_system_message("You are a helpful assistant.")]
    append_context_injection(chain, config)
    append_model_interaction(chain, config)

    return Task(
        name=config["task_name"],
        dataset=dataset,
        solver=chain,
        scorer=model_graded_fact(),
        metadata=build_task_metadata(config),
    )

Export it from runner/tasks/__init__.py:
```
from .your_task import your_task
```
Reference it in task.yaml:
```
func: your_task
```
That’s it — the JSON runner resolves the short name automatically.

Built-in tasks#

Task function	Type	What it evaluates
`question_answer`	Non-agentic	Q&A knowledge and reasoning
`code_gen`	Non-agentic	Code generation with structured output
`flutter_code_gen`	Non-agentic	Flutter-specific code gen (wraps `code_gen`)
`bug_fix`	Agentic	Diagnosing and fixing bugs with bash + editor
`flutter_bug_fix`	Agentic	Flutter-specific bug fix (wraps `bug_fix`)
`analyze_codebase`	Agentic	Exploring and answering questions about code
`mcp_tool`	Agentic	Testing MCP tool usage
`skill_test`	Agentic	Testing skill file usage in sandboxes