Glossary#

Key terminology for understanding the evals framework.

Core Concepts#

Term	Definition
Model	The LLM being evaluated (e.g., `google/gemini-2.5-pro`, `anthropic/claude-3-5-haiku`)
Task	An Inspect AI evaluation function that processes samples (e.g., `question_answer`, `bug_fix`, `code_gen`)
Sample	A single test case containing an input prompt and expected output (grading criteria)
Variant	Named configuration that modifies how a task runs — controls context injection, MCP tools, and skill availability
Eval Run	A complete execution of a task against one or more models, producing results for all samples

Term	Definition
task.yaml	Task definition file specifying the task function, samples, and optional variant restrictions
job.yaml	Runtime configuration defining what to run — filters tasks, variants, and models for a specific run
EvalSet JSON	Resolved configuration produced by the Dart `dataset_config_dart` package and consumed by the Python runner

Term	Definition
Context File	Markdown file with YAML frontmatter providing additional context injected into prompts
Workspace Template	Reusable project scaffolds (Flutter app, Dart package) mounted in the sandbox
Sandbox	Containerized environment (Podman/Docker) for safe code execution during evaluations
MCP Server	Model Context Protocol server providing tools/context to the model during evaluation

Term	Definition
Scorer	Logic that determines if a model’s output is correct (e.g., model-graded semantic match)
Accuracy	Percentage of samples scored as correct in an eval run

Package	Definition
dataset_config_dart	Dart package that parses dataset YAML and resolves it into EvalSet JSON via a layered Parser → Resolver → Writer pipeline
dash_evals	Python package that consumes EvalSet JSON (or direct CLI arguments) and executes evaluations using Inspect AI
devals_cli	Dart CLI (`devals`) for creating and managing tasks, samples, and jobs

Class	Definition
EvalSet	Top-level container representing a fully resolved evaluation configuration, serialized to JSON for the runner
Task	Inspect domain task definition with a name, task function reference, dataset, and configuration
Sample	An input/target test case with optional metadata, workspace, and sandbox configuration
Variant	Named variant configuration with context files, MCP servers, and skills
TaskInfo	Lightweight task metadata (name and function reference)
ParsedTask	Intermediate representation produced by parsers, consumed by the resolver
Job	Parsed job file with runtime overrides and task/variant/model filters
ConfigResolver	Facade providing single-call convenience API for the full parse → resolve → write pipeline

Class	Definition
json_runner	Module that reads EvalSet JSON, resolves task functions via `importlib`, builds `inspect_ai.Task` objects, and calls `eval_set()`
args_runner	Module that builds a single task from direct CLI arguments (`--task`, `--model`, `--dataset`)

See the Configuration Reference for detailed configuration file documentation.