Scorers#

Scorer implementations for evaluating task outputs.

Code Quality Scorer#

LLM-graded code quality scorer.

Reusable scorer that uses an LLM to evaluate subjective code quality aspects.

dash_evals.runner.scorers.code_quality.code_quality_scorer(rubric=None, model=None)[source]#

Score code quality using LLM judgment.

Uses a rubric to evaluate subjective aspects of code quality that static analysis can’t capture: minimality, elegance, robustness.

Parameters:

rubric (str | None) – Custom rubric prompt. If None, uses default Dart/Flutter rubric.
model (str | None) – Model to use for grading. If None, uses the task’s model.

Return type:

Scorer

Returns:

A Scorer that evaluates code quality on a 0-1 scale.

Dart Analyze Scorer#

Dart static analysis scorer.

Reusable scorer that runs dart analyze on auto-discovered project roots and scores based on output.

dash_evals.runner.scorers.dart_analyze.dart_analyze_scorer(strict=False, project_dir=None)[source]#

Score based on dart static analysis results.

Scoping behavior (in priority order):

If project_dir argument is set, analyze only that subdirectory.
If state.metadata["project_dir"] exists, use that.
Fall back to auto-discovering all pubspec.yaml files.

Scores: - CORRECT if no errors in any project (and no warnings if strict=True) - INCORRECT if any project has errors

Parameters:

strict (bool) – If True, also fail on warnings. Default False (only errors fail).
project_dir (str | None) – Optional subdirectory to scope analysis to. Relative to the workspace root.

Return type:

Scorer

Returns:

A Scorer that evaluates Dart code quality via static analysis.

Flutter Code Scorer#

Scorer for Flutter code quality evaluation.

dash_evals.runner.scorers.flutter_code.flutter_code_scorer()[source]#

Custom scorer that evaluates Flutter code based on:

Code analysis (flutter analyze)
Test results (flutter test)
Code structure validation

The final score is a weighted combination of these factors:

Analyzer: 30%
Tests: 50%
Structure: 20%

A score >= 0.7 is considered passing for the accuracy metric.

Return type:: Scorer
Returns:: A Scorer that evaluates Flutter code quality.

Flutter Test Scorer#

Flutter test runner scorer.

Reusable scorer that runs flutter test and scores based on pass/fail.

dash_evals.runner.scorers.flutter_test.flutter_test_scorer(test_path='test/')[source]#

Score based on Flutter test results.

Runs flutter test on the specified path and scores: - CORRECT if all tests pass - INCORRECT if any tests fail

Parameters:: test_path (str) – Path to test directory or file. Default “test/”.
Return type:: Scorer
Returns:: A Scorer that evaluates code by running Flutter tests.

Flutter Output Parser#

Parsers for Flutter command outputs.

class dash_evals.runner.scorers.flutter_output_parser.AnalyzerResult[source]#

Bases: object

Parsed flutter analyze output.

error_count: int#

warning_count: int#

info_count: int#

raw_output: str#

__init__(error_count, warning_count, info_count, raw_output)#

Parameters:

error_count (int)
warning_count (int)
info_count (int)
raw_output (str)

Return type:

None

class dash_evals.runner.scorers.flutter_output_parser.TestResult[source]#

Bases: object

Parsed flutter test output.

passed: bool#

raw_output: str#

passed_count: int = 0#

failed_count: int = 0#

__init__(passed, raw_output, passed_count=0, failed_count=0)#

Parameters:

passed (bool)
raw_output (str)
passed_count (int)
failed_count (int)

Return type:

None

dash_evals.runner.scorers.flutter_output_parser.parse_analyzer_output(output)[source]#

Parse flutter analyze output to count errors, warnings, and info messages.

Parameters:: output (str) – Combined stdout and stderr from flutter analyze
Return type:: AnalyzerResult
Returns:: AnalyzerResult with counts and raw output

Examples

>>> result = parse_analyzer_output("error • Something wrong\nwarning • Be careful")
>>> result.error_count
1
>>> result.warning_count
1

dash_evals.runner.scorers.flutter_output_parser.parse_test_output(output, success)[source]#

Parse flutter test output to determine test results.

Parameters:

output (str) – Combined stdout and stderr from flutter test
success (bool) – Whether the test command succeeded

Return type:

TestResult

Returns:

TestResult with pass/fail status and raw output

Examples

>>> result = parse_test_output("All tests passed!", success=True)
>>> result.passed
True

Flutter Scoring Utilities#

Scoring utilities for Flutter code evaluation.

dash_evals.runner.scorers.flutter_scoring.calculate_analyzer_score(result)[source]#

Calculate score from analyzer results.

Parameters:: result (AnalyzerResult) – Parsed analyzer output
Return type:: Tuple[float, str]
Returns:: Tuple of (score, explanation)

Examples

>>> from dash_evals.utils.flutter_output_parser import AnalyzerResult, TestResult
>>> result = AnalyzerResult(0, 0, 0, "")
>>> score, explanation = calculate_analyzer_score(result)
>>> score
1.0

dash_evals.runner.scorers.flutter_scoring.calculate_test_score(result)[source]#

Calculate score from test results as a percentage of tests passed.

Parameters:: result (TestResult) – Parsed test output
Return type:: Tuple[float, str]
Returns:: Tuple of (score, explanation) where score is the percentage of tests passed (0.0-1.0)

Examples

>>> from dash_evals.parsers.flutter_output_parser import TestResult
>>> result = TestResult(passed=True, raw_output="")
>>> score, explanation = calculate_test_score(result)
>>> score
1.0

dash_evals.runner.scorers.flutter_scoring.validate_code_structure(code, required_widgets)[source]#

Validate that code contains required structural elements.

Parameters:

code (str) – The generated Dart code
required_widgets (list) – List of required widget names from metadata

Return type:

Tuple[float, str]

Returns:

Tuple of (score, explanation)

Examples

>>> code = "class MyApp extends StatelessWidget { MaterialApp() }"
>>> score, explanation = validate_code_structure(code, ["TextField"])
>>> score >= 0.7
True

dash_evals.runner.scorers.flutter_scoring.calculate_final_score(analyzer_score, test_score, structure_score, weights=None)[source]#

Calculate weighted final score.

Parameters:

analyzer_score (float) – Code quality score (0.0-1.0)
test_score (float) – Test pass rate (0.0-1.0)
structure_score (float) – Code structure score (0.0-1.0)
weights (dict | None) – Optional custom weights (default: {“analyzer”: 0.3, “test”: 0.5, “structure”: 0.2})

Return type:

float

Returns:

Weighted final score (0.0-1.0)

Examples

>>> calculate_final_score(1.0, 1.0, 1.0)
1.0
>>> calculate_final_score(0.0, 1.0, 0.0)  # Test score is 50% of total
0.5

MCP Tool Usage Scorer#

Scorer for verifying MCP tool usage during evaluations.

dash_evals.runner.scorers.mcp_tool_usage.mcp_tool_usage(mcp_server_name='Dart', mcp_tool_names=None, required_tools=None)[source]#

Scorer that checks if an MCP tool from the specified server was called.

This scorer examines the message history to determine whether the model actually used an MCP tool (vs. answering from its training data).

Parameters:

mcp_server_name (str) – The name prefix of the MCP server tools. Tools matching “{mcp_server_name}_*” pattern will be identified as MCP tools.
mcp_tool_names (list[str] | None) – Optional list of specific tool names to identify as MCP tools. If not provided and mcp_server_name is “Dart”, defaults to the full DART_MCP_TOOLS list.
required_tools (list[str] | None) – Optional list of specific MCP tool names that MUST have been called for a passing score. If provided, the scorer checks that every tool in this list was used. If not provided, any MCP tool usage counts as a pass.

Return type:

Scorer

Returns:

A Scorer that returns “C” if MCP tool(s) were used as required, “I” otherwise.

Example:

from dash_evals.scorers import mcp_tool_usage

Task(
    dataset=my_dataset,
    solver=react(),
    tools=[dart_mcp_server],
    scorer=[
        includes(ignore_case=True),  # Check answer correctness
        mcp_tool_usage(),             # Uses DART_MCP_TOOLS by default
        # Or check specific tools:
        # mcp_tool_usage(required_tools=["create_project"]),
    ],
)

Export Workspace#

Scorer that exports the final workspace to the examples directory.

This scorer is a side-effect scorer: it copies the agent’s finished workspace to <log_dir>/examples/<task_id>:<variant>/<sample_id>/ so eval authors can inspect the exact code produced during a run.

It is only added to a Task’s scorer list when task_config.save_examples=True, so it can assume it should always run — no runtime guard needed.

== How It Works ==

Scorers run while the sandbox container is still alive. We use the sandbox API to create a tar archive of /workspace inside the container, read it out via sandbox().read_file(), then extract it on the host into examples_dir.

This scorer must run LAST in the scorer list so it captures the final state of the workspace after all code edits are complete.

dash_evals.runner.scorers.export_workspace.export_workspace()[source]#

Copy the final workspace to the examples directory alongside logs.

Reads examples_dir and save_examples from state.metadata. Uses the sandbox API to tar the workspace inside the container and extract it on the host — works with any sandbox type (podman/docker).

The destination path is:

<examples_dir>/<task_id>:<variant>/<sample_id>/

This scorer is only added to the Task scorer list when task_config.save_examples=True, so it always runs unconditionally.

Return type:: Scorer

Skill Usage Scorer#

Scorer for verifying skill usage during evaluations.

dash_evals.runner.scorers.skill_usage.skill_usage_scorer()[source]#

Scorer that checks if the agent used the skill tool.

Examines the message history to determine whether the model actually called the skill tool to read/discover available skills, rather than answering from its training data alone.

Return type:: Scorer
Returns:: A Scorer that returns “C” if the skill tool was used, “I” otherwise.

Example:

from dash_evals.runner.scorers import skill_usage_scorer

Task(
    dataset=my_dataset,
    solver=react(tools=[skill_tool, bash(timeout=120)]),
    scorer=[
        model_graded_fact(),    # Check answer correctness
        skill_usage_scorer(),   # Check skill tool was used
    ],
)