Scorers#
Scorer implementations for evaluating task outputs.
Code Quality Scorer#
LLM-graded code quality scorer.
Reusable scorer that uses an LLM to evaluate subjective code quality aspects.
Dart Analyze Scorer#
Dart static analysis scorer.
Reusable scorer that runs dart analyze on auto-discovered project roots
and scores based on output.
- dash_evals.runner.scorers.dart_analyze.dart_analyze_scorer(strict=False, project_dir=None)[source]#
Score based on dart static analysis results.
Scoping behavior (in priority order):
If
project_dirargument is set, analyze only that subdirectory.If
state.metadata["project_dir"]exists, use that.Fall back to auto-discovering all
pubspec.yamlfiles.
Scores: - CORRECT if no errors in any project (and no warnings if strict=True) - INCORRECT if any project has errors
Flutter Code Scorer#
Scorer for Flutter code quality evaluation.
- dash_evals.runner.scorers.flutter_code.flutter_code_scorer()[source]#
Custom scorer that evaluates Flutter code based on:
Code analysis (flutter analyze)
Test results (flutter test)
Code structure validation
The final score is a weighted combination of these factors:
Analyzer: 30%
Tests: 50%
Structure: 20%
A score >= 0.7 is considered passing for the accuracy metric.
- Return type:
Scorer- Returns:
A Scorer that evaluates Flutter code quality.
Flutter Test Scorer#
Flutter test runner scorer.
Reusable scorer that runs flutter test and scores based on pass/fail.
- dash_evals.runner.scorers.flutter_test.flutter_test_scorer(test_path='test/')[source]#
Score based on Flutter test results.
Runs
flutter teston the specified path and scores: - CORRECT if all tests pass - INCORRECT if any tests fail- Parameters:
test_path (
str) – Path to test directory or file. Default “test/”.- Return type:
Scorer- Returns:
A Scorer that evaluates code by running Flutter tests.
Flutter Output Parser#
Parsers for Flutter command outputs.
- class dash_evals.runner.scorers.flutter_output_parser.AnalyzerResult[source]#
Bases:
objectParsed flutter analyze output.
- class dash_evals.runner.scorers.flutter_output_parser.TestResult[source]#
Bases:
objectParsed flutter test output.
- dash_evals.runner.scorers.flutter_output_parser.parse_analyzer_output(output)[source]#
Parse flutter analyze output to count errors, warnings, and info messages.
- Parameters:
output (
str) – Combined stdout and stderr from flutter analyze- Return type:
- Returns:
AnalyzerResult with counts and raw output
Examples
>>> result = parse_analyzer_output("error • Something wrong\nwarning • Be careful") >>> result.error_count 1 >>> result.warning_count 1
- dash_evals.runner.scorers.flutter_output_parser.parse_test_output(output, success)[source]#
Parse flutter test output to determine test results.
- Parameters:
- Return type:
- Returns:
TestResult with pass/fail status and raw output
Examples
>>> result = parse_test_output("All tests passed!", success=True) >>> result.passed True
Flutter Scoring Utilities#
Scoring utilities for Flutter code evaluation.
- dash_evals.runner.scorers.flutter_scoring.calculate_analyzer_score(result)[source]#
Calculate score from analyzer results.
- Parameters:
result (
AnalyzerResult) – Parsed analyzer output- Return type:
- Returns:
Tuple of (score, explanation)
Examples
>>> from dash_evals.utils.flutter_output_parser import AnalyzerResult, TestResult >>> result = AnalyzerResult(0, 0, 0, "") >>> score, explanation = calculate_analyzer_score(result) >>> score 1.0
- dash_evals.runner.scorers.flutter_scoring.calculate_test_score(result)[source]#
Calculate score from test results as a percentage of tests passed.
- Parameters:
result (
TestResult) – Parsed test output- Return type:
- Returns:
Tuple of (score, explanation) where score is the percentage of tests passed (0.0-1.0)
Examples
>>> from dash_evals.parsers.flutter_output_parser import TestResult >>> result = TestResult(passed=True, raw_output="") >>> score, explanation = calculate_test_score(result) >>> score 1.0
- dash_evals.runner.scorers.flutter_scoring.validate_code_structure(code, required_widgets)[source]#
Validate that code contains required structural elements.
- Parameters:
- Return type:
- Returns:
Tuple of (score, explanation)
Examples
>>> code = "class MyApp extends StatelessWidget { MaterialApp() }" >>> score, explanation = validate_code_structure(code, ["TextField"]) >>> score >= 0.7 True
- dash_evals.runner.scorers.flutter_scoring.calculate_final_score(analyzer_score, test_score, structure_score, weights=None)[source]#
Calculate weighted final score.
- Parameters:
- Return type:
- Returns:
Weighted final score (0.0-1.0)
Examples
>>> calculate_final_score(1.0, 1.0, 1.0) 1.0 >>> calculate_final_score(0.0, 1.0, 0.0) # Test score is 50% of total 0.5
MCP Tool Usage Scorer#
Scorer for verifying MCP tool usage during evaluations.
- dash_evals.runner.scorers.mcp_tool_usage.mcp_tool_usage(mcp_server_name='Dart', mcp_tool_names=None, required_tools=None)[source]#
Scorer that checks if an MCP tool from the specified server was called.
This scorer examines the message history to determine whether the model actually used an MCP tool (vs. answering from its training data).
- Parameters:
mcp_server_name (
str) – The name prefix of the MCP server tools. Tools matching “{mcp_server_name}_*” pattern will be identified as MCP tools.mcp_tool_names (
list[str] |None) – Optional list of specific tool names to identify as MCP tools. If not provided and mcp_server_name is “Dart”, defaults to the full DART_MCP_TOOLS list.required_tools (
list[str] |None) – Optional list of specific MCP tool names that MUST have been called for a passing score. If provided, the scorer checks that every tool in this list was used. If not provided, any MCP tool usage counts as a pass.
- Return type:
Scorer- Returns:
A Scorer that returns “C” if MCP tool(s) were used as required, “I” otherwise.
Example:
from dash_evals.scorers import mcp_tool_usage Task( dataset=my_dataset, solver=react(), tools=[dart_mcp_server], scorer=[ includes(ignore_case=True), # Check answer correctness mcp_tool_usage(), # Uses DART_MCP_TOOLS by default # Or check specific tools: # mcp_tool_usage(required_tools=["create_project"]), ], )
Export Workspace#
Scorer that exports the final workspace to the examples directory.
This scorer is a side-effect scorer: it copies the agent’s finished workspace to <log_dir>/examples/<task_id>:<variant>/<sample_id>/ so eval authors can inspect the exact code produced during a run.
It is only added to a Task’s scorer list when task_config.save_examples=True, so it can assume it should always run — no runtime guard needed.
== How It Works ==
Scorers run while the sandbox container is still alive. We use the sandbox API to create a tar archive of /workspace inside the container, read it out via sandbox().read_file(), then extract it on the host into examples_dir.
This scorer must run LAST in the scorer list so it captures the final state of the workspace after all code edits are complete.
- dash_evals.runner.scorers.export_workspace.export_workspace()[source]#
Copy the final workspace to the examples directory alongside logs.
Reads
examples_dirandsave_examplesfromstate.metadata. Uses the sandbox API to tar the workspace inside the container and extract it on the host — works with any sandbox type (podman/docker).The destination path is:
<examples_dir>/<task_id>:<variant>/<sample_id>/
This scorer is only added to the Task scorer list when
task_config.save_examples=True, so it always runs unconditionally.- Return type:
Scorer
Skill Usage Scorer#
Scorer for verifying skill usage during evaluations.
- dash_evals.runner.scorers.skill_usage.skill_usage_scorer()[source]#
Scorer that checks if the agent used the skill tool.
Examines the message history to determine whether the model actually called the skill tool to read/discover available skills, rather than answering from its training data alone.
- Return type:
Scorer- Returns:
A Scorer that returns “C” if the skill tool was used, “I” otherwise.
Example:
from dash_evals.runner.scorers import skill_usage_scorer Task( dataset=my_dataset, solver=react(tools=[skill_tool, bash(timeout=120)]), scorer=[ model_graded_fact(), # Check answer correctness skill_usage_scorer(), # Check skill tool was used ], )