Tasks#

Task implementations for different evaluation types.

Code Generation#

Code Generation Task (Generic)

Evaluates LLM ability to write working code from scratch. Language-specific behavior (response schema, system message, target path) is controlled via the config dict. Thin wrappers (e.g., flutter_code_gen) supply language defaults.

Evaluates by: 1. Generating code based on prompts 2. Executing code in a sandbox 3. Running tests 4. Analyzing code quality 5. Scoring based on test results and code metrics

Uses structured output to get clean code without regex extraction.

class dash_evals.runner.tasks.code_gen.FlutterCodeResponse[source]#

Bases: BaseModel

Structured response for Flutter code generation.

main_dart: str#

reasoning: str#

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class dash_evals.runner.tasks.code_gen.GenericCodeResponse[source]#

Bases: BaseModel

Generic structured response for code generation.

code: str#

reasoning: str#

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

dash_evals.runner.tasks.code_gen.parse_structured_code(code_field='main_dart', response_model=None)[source]#

Parse structured JSON output and store extracted code.

Reads the model’s structured JSON response, extracts the code field, and stores it in state.store[“extracted_code”] for write_to_sandbox.

Parameters:

code_field (str) – The field name containing the code in the response model.
response_model (type[BaseModel] | None) – Optional Pydantic model to validate against. If None, attempts to use the raw output.

Return type:

Solver

dash_evals.runner.tasks.code_gen.code_gen(dataset, config)[source]#

Generic task for evaluating LLM code generation.

The config dict controls language-specific behavior: - system_message: Custom system prompt (optional) - target_path: Where to write generated code (default: “lib/main.dart”) - response_schema_name: Key into RESPONSE_SCHEMAS (default: “generic”) - response_schema_description: Description for the schema (optional) - code_field: Field name for code in structured output (default: “code”)

Parameters:

dataset (Dataset) – Inspect dataset loaded from JSONL.
config (dict) – Task manifest entry with variant, system_message, etc.

Return type:

Task

dash_evals.runner.tasks.code_gen.flutter_code_gen(dataset, config)[source]#

Flutter-specific code generation task.

Thin wrapper around code_gen() with Flutter defaults: - Flutter system message - FlutterCodeResponse schema - target_path = “lib/main.dart” - flutter_code_scorer

Parameters:

dataset (Dataset) – Inspect dataset loaded from JSONL.
config (dict) – Task manifest entry with variant, system_message, etc.

Return type:

Task

Bug Fix#

Bug Fix Task (Generic, Agentic)

Evaluates LLM ability to diagnose and fix bugs in existing code using an agentic approach where the model can explore files and make edits.

This task behaves similarly to real-world AI coding assistants like Claude Code and Cursor. Language-specific behavior (system message, scorers) is controlled via the config dict or thin wrappers (e.g., flutter_bug_fix).

dash_evals.runner.tasks.bug_fix.bug_fix(dataset, config)[source]#

Generic task for evaluating LLM ability to diagnose and fix bugs.

The config dict controls language-specific behavior: - system_message: Custom system prompt (optional) - agent_name: Name for the react agent (default: “debugger”) - agent_description: Description for the react agent (optional) - scorers: List of scorer instances (optional, defaults to dart analyzers)

Parameters:

dataset (Dataset) – Inspect dataset loaded from JSONL.
config (dict) – Task manifest entry with variant, system_message, etc.

Return type:

Task

dash_evals.runner.tasks.bug_fix.flutter_bug_fix(dataset, config)[source]#

Flutter-specific bug fix task.

Thin wrapper around bug_fix() with Flutter defaults: - Flutter system message - Flutter-specific scorers (dart_analyze, flutter_test, code_quality)

Parameters:

dataset (Dataset) – Inspect dataset loaded from JSONL.
config (dict) – Task manifest entry with variant, system_message, etc.

Return type:

Task

Question Answer#

QA tasks for evaluating model Q&A capabilities.

dash_evals.runner.tasks.question_answer.question_answer(dataset, config)[source]#

Generic QA task for evaluating model Q&A capabilities.

Parameters:

dataset (Dataset) – Inspect dataset loaded from JSONL.
config (dict) – Task manifest entry with variant, system_message, etc.

Return type:

Task

MCP Tool#

MCP Tool Usage Task (Unified)

Tests an agent’s ability to use a specific MCP server tool. Consolidates the former mcp_create_project and mcp_pub_dev_search tasks into a single configurable task.

Config keys:: required_tools: list[str] — MCP tool names the agent should use (for scoring) inject_temp_dir: bool — if True, replace {root_path} in sample inputs with a

temp directory (needed for create_project-style tasks)

dash_evals.runner.tasks.mcp_tool.mcp_tool(dataset, config)[source]#

Unified task for evaluating MCP tool usage.

Parameters:

dataset (Dataset) – Inspect dataset loaded from JSONL.
config (dict) – Task manifest entry with: - required_tools: list of MCP tool names the agent should call - inject_temp_dir: if True, replaces {root_path} in inputs - system_message: custom system prompt (optional)

Return type:

Task

Analyze Codebase#

Analyze Codebase Task

Evaluates LLM ability to explore and answer questions about an existing codebase. The model gets read-only access to workspace files via bash commands, but is instructed not to modify any files.

dash_evals.runner.tasks.analyze_codebase.analyze_codebase(dataset, config)[source]#

Task for evaluating LLM ability to explore and answer questions about a codebase.

Parameters:

dataset (Dataset) – Inspect dataset loaded from JSONL.
config (dict) – Task manifest entry with variant, system_message, etc.

Return type:

Task

Skill Test#

Skill Test Task

Evaluates whether the agent discovers and applies specific skills provided to it. This task is designed to allow the agent to use the skill tool if provided, and the scorer checks if the tool was utilized appropriately when skills are available.

Requires a sandbox since Inspect AI’s skill() tool copies skill directories into the sandbox filesystem.

dash_evals.runner.tasks.skill_test.skill_test(dataset, config)[source]#

Task for evaluating whether an agent correctly uses provided skills.

This task is specifically designed to test skill discovery and application. If skills are provided, the agent may:

Use the skill tool to discover available skills
Read and understand the skill instructions
Apply the skill guidance to answer the user’s question
Submit an answer that reflects skill-based knowledge

The scorer checks both answer quality (model_graded_fact) and, if skills are present, whether the skill tool was actually used (skill_usage_scorer).

Parameters:

dataset (Dataset) – Inspect dataset loaded from JSONL.
config (dict) – Task manifest entry with variant, system_message, etc.

Return type:

Task

Task Helpers#

Shared utilities used across task implementations.

Shared helper functions for building task components.

These helpers encapsulate common patterns used across tasks: - Creating MCP servers from variant config - Building task metadata - Appending variant-driven solvers (context injection, MCP tools, skills)

All helpers accept a config dict (from the run manifest) instead of TaskConfig, enabling the JSONL + manifest-based execution flow.

dash_evals.runner.tasks.task_helpers.validate_sandbox_tools(config, tool_names)[source]#

Validate that the requested tools are compatible with the sandbox type.

Parameters:

config (dict) – Task manifest entry with ‘sandbox_type’ and ‘task_name’ keys.
tool_names (list[str]) – Names of tools this task will use.

Raises:

ValueError – If local sandbox is used with injection-requiring tools.

Return type:

None

dash_evals.runner.tasks.task_helpers.create_mcp_server(config=None)[source]#

Create the default Dart MCP server (backwards-compatible alias).

Parameters:: config (dict | None)

dash_evals.runner.tasks.task_helpers.create_dart_mcp_server()[source]#: Create the standard Dart MCP server tool (backwards-compatible alias).

dash_evals.runner.tasks.task_helpers.append_context_injection(solver_chain, config)[source]#

Append context injection solver if the variant has context files.

Parameters:

solver_chain (list) – The solver chain list to append to.
config (dict) – Task manifest entry with ‘variant’ key or ‘metadata’ key.

Return type:

None

dash_evals.runner.tasks.task_helpers.append_model_interaction(solver_chain, config, *, extra_tools=None)[source]#

Append either a react agent (with MCP tools) or plain generate.

Parameters:

solver_chain (list) – The solver chain list to append to.
config (dict) – Task manifest entry with ‘variant’ key or ‘metadata’ key.
extra_tools (list | None) – Additional tools to include alongside MCP (optional).

Return type:

None