Tasks#
Task implementations for different evaluation types.
Code Generation#
Code Generation Task (Generic)
Evaluates LLM ability to write working code from scratch. Language-specific behavior (response schema, system message, target path) is controlled via the config dict. Thin wrappers (e.g., flutter_code_gen) supply language defaults.
Evaluates by: 1. Generating code based on prompts 2. Executing code in a sandbox 3. Running tests 4. Analyzing code quality 5. Scoring based on test results and code metrics
Uses structured output to get clean code without regex extraction.
- class dash_evals.runner.tasks.code_gen.FlutterCodeResponse[source]#
Bases:
BaseModelStructured response for Flutter code generation.
- class dash_evals.runner.tasks.code_gen.GenericCodeResponse[source]#
Bases:
BaseModelGeneric structured response for code generation.
- dash_evals.runner.tasks.code_gen.parse_structured_code(code_field='main_dart', response_model=None)[source]#
Parse structured JSON output and store extracted code.
Reads the model’s structured JSON response, extracts the code field, and stores it in state.store[“extracted_code”] for write_to_sandbox.
- dash_evals.runner.tasks.code_gen.code_gen(dataset, config)[source]#
Generic task for evaluating LLM code generation.
The config dict controls language-specific behavior: - system_message: Custom system prompt (optional) - target_path: Where to write generated code (default: “lib/main.dart”) - response_schema_name: Key into RESPONSE_SCHEMAS (default: “generic”) - response_schema_description: Description for the schema (optional) - code_field: Field name for code in structured output (default: “code”)
- Parameters:
dataset (
Dataset) – Inspect dataset loaded from JSONL.config (
dict) – Task manifest entry with variant, system_message, etc.
- Return type:
Task
- dash_evals.runner.tasks.code_gen.flutter_code_gen(dataset, config)[source]#
Flutter-specific code generation task.
Thin wrapper around code_gen() with Flutter defaults: - Flutter system message - FlutterCodeResponse schema - target_path = “lib/main.dart” - flutter_code_scorer
- Parameters:
dataset (
Dataset) – Inspect dataset loaded from JSONL.config (
dict) – Task manifest entry with variant, system_message, etc.
- Return type:
Task
Bug Fix#
Bug Fix Task (Generic, Agentic)
Evaluates LLM ability to diagnose and fix bugs in existing code using an agentic approach where the model can explore files and make edits.
This task behaves similarly to real-world AI coding assistants like Claude Code and Cursor. Language-specific behavior (system message, scorers) is controlled via the config dict or thin wrappers (e.g., flutter_bug_fix).
- dash_evals.runner.tasks.bug_fix.bug_fix(dataset, config)[source]#
Generic task for evaluating LLM ability to diagnose and fix bugs.
The config dict controls language-specific behavior: - system_message: Custom system prompt (optional) - agent_name: Name for the react agent (default: “debugger”) - agent_description: Description for the react agent (optional) - scorers: List of scorer instances (optional, defaults to dart analyzers)
- Parameters:
dataset (
Dataset) – Inspect dataset loaded from JSONL.config (
dict) – Task manifest entry with variant, system_message, etc.
- Return type:
Task
- dash_evals.runner.tasks.bug_fix.flutter_bug_fix(dataset, config)[source]#
Flutter-specific bug fix task.
Thin wrapper around bug_fix() with Flutter defaults: - Flutter system message - Flutter-specific scorers (dart_analyze, flutter_test, code_quality)
- Parameters:
dataset (
Dataset) – Inspect dataset loaded from JSONL.config (
dict) – Task manifest entry with variant, system_message, etc.
- Return type:
Task
Question Answer#
QA tasks for evaluating model Q&A capabilities.
MCP Tool#
MCP Tool Usage Task (Unified)
Tests an agent’s ability to use a specific MCP server tool. Consolidates the former mcp_create_project and mcp_pub_dev_search tasks into a single configurable task.
- Config keys:
required_tools: list[str] — MCP tool names the agent should use (for scoring) inject_temp_dir: bool — if True, replace {root_path} in sample inputs with a
temp directory (needed for create_project-style tasks)
- dash_evals.runner.tasks.mcp_tool.mcp_tool(dataset, config)[source]#
Unified task for evaluating MCP tool usage.
- Parameters:
dataset (
Dataset) – Inspect dataset loaded from JSONL.config (
dict) – Task manifest entry with: - required_tools: list of MCP tool names the agent should call - inject_temp_dir: if True, replaces {root_path} in inputs - system_message: custom system prompt (optional)
- Return type:
Task
Analyze Codebase#
Analyze Codebase Task
Evaluates LLM ability to explore and answer questions about an existing codebase. The model gets read-only access to workspace files via bash commands, but is instructed not to modify any files.
- dash_evals.runner.tasks.analyze_codebase.analyze_codebase(dataset, config)[source]#
Task for evaluating LLM ability to explore and answer questions about a codebase.
- Parameters:
dataset (
Dataset) – Inspect dataset loaded from JSONL.config (
dict) – Task manifest entry with variant, system_message, etc.
- Return type:
Task
Skill Test#
Skill Test Task
Evaluates whether the agent discovers and applies specific skills provided to it. This task is designed to allow the agent to use the skill tool if provided, and the scorer checks if the tool was utilized appropriately when skills are available.
Requires a sandbox since Inspect AI’s skill() tool copies skill directories into the sandbox filesystem.
- dash_evals.runner.tasks.skill_test.skill_test(dataset, config)[source]#
Task for evaluating whether an agent correctly uses provided skills.
This task is specifically designed to test skill discovery and application. If skills are provided, the agent may:
Use the skill tool to discover available skills
Read and understand the skill instructions
Apply the skill guidance to answer the user’s question
Submit an answer that reflects skill-based knowledge
The scorer checks both answer quality (model_graded_fact) and, if skills are present, whether the skill tool was actually used (skill_usage_scorer).
- Parameters:
dataset (
Dataset) – Inspect dataset loaded from JSONL.config (
dict) – Task manifest entry with variant, system_message, etc.
- Return type:
Task
Task Helpers#
Shared utilities used across task implementations.
Shared helper functions for building task components.
These helpers encapsulate common patterns used across tasks: - Creating MCP servers from variant config - Building task metadata - Appending variant-driven solvers (context injection, MCP tools, skills)
All helpers accept a config dict (from the run manifest) instead of TaskConfig, enabling the JSONL + manifest-based execution flow.
- dash_evals.runner.tasks.task_helpers.validate_sandbox_tools(config, tool_names)[source]#
Validate that the requested tools are compatible with the sandbox type.
- Parameters:
- Raises:
ValueError – If local sandbox is used with injection-requiring tools.
- Return type:
- dash_evals.runner.tasks.task_helpers.create_mcp_servers(mcp_configs, sandbox_type='local')[source]#
Create MCP server objects from variant config.
Supports three modes per entry: - Declarative stdio/sandbox: dict with
command,args, etc. - Declarative HTTP: dict withurl, and optionallyauthorization/headers. - Python ref: dict withrefkey pointing to a pre-built MCPServer.Transport is auto-selected when not explicit: - If
urlis present →mcp_server_http- If sandbox is non-local →mcp_server_sandbox- Otherwise →mcp_server_stdio
- dash_evals.runner.tasks.task_helpers.create_mcp_server(config=None)[source]#
Create the default Dart MCP server (backwards-compatible alias).
- Parameters:
config (dict | None)
- dash_evals.runner.tasks.task_helpers.create_dart_mcp_server()[source]#
Create the standard Dart MCP server tool (backwards-compatible alias).
- dash_evals.runner.tasks.task_helpers.build_task_metadata(config)[source]#
Build task metadata dictionary from manifest config.
- dash_evals.runner.tasks.task_helpers.append_context_injection(solver_chain, config)[source]#
Append context injection solver if the variant has context files.
- dash_evals.runner.tasks.task_helpers.get_skill_tool(config)[source]#
Create the skill tool if the variant has skills configured.