Contributing#
Welcome to the Dart/Flutter LLM evaluation project! This repository contains tools for running and analyzing AI model evaluations on Dart and Flutter tasks.
Table of Contents#
Setup#
Prerequisites
Python 3.13+
Podman or Docker (for sandbox execution)
API keys for the models you want to test
Create and activate a virtual environment
cd packages/dash_evals python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
Install dependencies
pip install -e . # Core dependencies pip install -e ".[dev]" # Development dependencies (pytest, ruff, etc.)
Configure API keys
You only need to configure the keys you plan on testing.
export GEMINI_API_KEY=your_key_here export ANTHROPIC_API_KEY=your_key_here export OPENAI_API_KEY=your_key_here
Verify installation
run-evals --help
Write a New Eval#
The most common contribution is adding new evaluation samples. Each sample tests a specific capability or scenario.
Add Your Sample to the Dataset#
Decide which task your sample belongs to
Review the available tasks in
dataset/tasks/or rundevals create taskto see available task functions:Task
Purpose
question_answerQ&A evaluation for Dart/Flutter knowledge
bug_fixAgentic debugging of code in a sandbox
flutter_bug_fixFlutter-specific bug fix (wraps
bug_fix)code_genGenerate code from specifications
flutter_code_genFlutter-specific code gen (wraps
code_gen)mcp_toolTest MCP tool usage
analyze_codebaseEvaluate codebase analysis
skill_testTest skill file usage in sandboxes
Create your sample file
Use
devals create samplefor interactive sample creation, or add a sample inline in the task’stask.yamlfile underdataset/tasks/<task_name>/task.yaml:id: dart_your_sample_id input: | Your prompt to the model goes here. target: | Criteria for grading the response. This is used by the scorer to determine if the model's output is acceptable. metadata: added: 2025-02-04 tags: [dart, async] # Optional categorization
For agentic tasks (bug fix, code gen), you’ll also need a workspace:
id: flutter_fix_some_bug input: | The app crashes when the user taps the submit button. Debug and fix the issue. target: | The fix should handle the null check in the onPressed callback. workspace: template: flutter_app # Use a reusable template # OR path: ./project # Custom project relative to sample directory
Add your sample to the task’s
task.yamlAdd your sample inline in the appropriate task’s
sampleslist:# dataset/tasks/dart_question_answer/task.yaml func: question_answer samples: - id: dart_your_sample_id input: | Your prompt to the model goes here. target: | Criteria for grading the response.
Edit the Config to Run Only Your New Sample#
Before committing, test your sample by creating a job file. Use devals create job interactively, or manually create one in dataset/jobs/:
# jobs/test_my_sample.yaml
name: test_my_sample
# Run only the task containing your sample
tasks:
dart_question_answer:
allowed_variants: [baseline] # Start with baseline variant
include-samples:
- dart_your_sample_id # Only run your specific sample
# Use a fast model for testing
models: [google/gemini-2.5-flash]
Then run with your job:
devals run test_my_sample
Verify the Sample Works#
Dry run first — validates configuration without making API calls:
devals run test_my_sample --dry-run
Run the evaluation:
devals run test_my_sample
Check the output in the
logs/directory. Verify:The model received your prompt correctly
The scorer evaluated the response appropriately
No errors occurred during execution
What to Commit (and Not Commit!)#
Do commit:
Your updated task file(s) in
dataset/tasks/Any new workspace templates or context files
Do NOT commit:
Test job files in
dataset/jobs/(if they were only for local testing)Log files in
logs/API keys or
.envfiles
Before submitting your PR, clean up any test job files you created:
git status # Check for untracked/modified job files
Add Functionality to the Runner#
If you’re adding new task types, scorers, or solvers, this section is for you.
Understand Tasks, Solvers, and Scorers#
The dash_evals runner uses Inspect AI concepts:
Component |
Purpose |
Location |
|---|---|---|
Task |
Defines what to evaluate — combines dataset, solver chain, and scorers |
|
Solver |
Processes inputs (e.g., injects context, runs agent loops) |
|
Scorer |
Evaluates outputs (e.g., model grading, dart analyze, flutter test) |
|
A typical task structure:
from inspect_ai import Task, task
from inspect_ai.dataset import MemoryDataset
@task
def your_new_task(dataset: MemoryDataset, task_def: dict) -> Task:
return Task(
name=task_def.get("name", "your_new_task"),
dataset=dataset,
solver=[
add_system_message("Your system prompt"),
context_injector(task_def),
# ... more solvers
],
scorer=[
model_graded_scorer(),
dart_analyze_scorer(),
],
)
Add a New Task#
Create your task file at
src/dash_evals/runner/tasks/your_task.pyExport it from
src/dash_evals/runner/tasks/__init__.py:from .your_task import your_new_task __all__ = [ # ... existing tasks ... "your_new_task", ]
Task functions are discovered dynamically via
importlib. If the function name matches a module inrunner/tasks/, it will be found automatically when referenced from atask.yamlfile. No registry is needed.Create a task directory in
dataset/tasks/:dataset/tasks/your_task_id/ └── task.yaml
# dataset/tasks/your_task_id/task.yaml func: your_new_task # Must match the function name samples: - id: sample_001 input: | Your prompt here. target: | Expected output or grading criteria.
Test and Verify#
Run the test suite:
cd packages/dash_evals pytest
Run linting:
ruff check src/dash_evals ruff format src/dash_evals
Test your task end-to-end:
devals run test_my_sample --dry-run # Validate config devals run test_my_sample # Run actual evaluation
eval_explorer#
A Dart/Flutter application for exploring evaluation results, built with Serverpod.
[!NOTE] The eval_explorer is under active development. Contribution guidelines coming soon!
The package is located in packages/eval_explorer/ and consists of:
Package |
Description |
|---|---|
|
Dart client package |
|
Flutter web/mobile app |
|
Serverpod backend |
|
Shared models |