Introduction

AgentV is a CLI-first AI agent evaluation framework. It evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code graders + customizable LLM graders, all version-controlled in Git.

Why AgentV?

Best for: Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.

No cloud dependency — everything runs locally
No server — just install and run
Version-controlled — YAML evaluation files live in Git alongside your code
CI/CD ready — run evaluations in your pipeline without external API calls
Multiple grader types — code validators, LLM graders, custom Python/TypeScript

How AgentV Compares

Feature	AgentV	LangWatch	LangSmith	LangFuse
Setup	`npx allagents plugin install`	Cloud account + API key	Cloud account + API key	Cloud account + API key
Server	None (local)	Managed cloud	Managed cloud	Managed cloud
Privacy	All local	Cloud-hosted	Cloud-hosted	Cloud-hosted
CLI-first	Yes	No	Limited	Limited
CI/CD ready	Yes	Requires API calls	Requires API calls	Requires API calls
Version control	Yes (YAML in Git)	No	No	No
Graders	Code + LLM + Custom	LLM only	LLM + Code	LLM only

Core Concepts

Evaluation files (.yaml or .jsonl) define test cases with expected outcomes. Targets specify which agent or provider to evaluate. Graders (code or LLM) score results. Results are written as JSONL/YAML for analysis and comparison.

Key Components

Eval files — YAML or JSONL definitions of test cases
Tests — Individual test entries with input messages and expected outcomes
Targets — The agent or LLM provider being evaluated
Graders — Code graders (Python/TypeScript) or LLM graders that score responses
Rubrics — Structured criteria with weights for grading
Results — JSONL output with scores, reasoning, and execution traces

Use this topic map when you are an AI agent trying to decide which primitive or workflow to compose next:

Goal	Start here	Why
Create a first eval	Quickstart → Eval files	Defines the smallest runnable YAML shape before adding advanced fields.
Run or resume evals	Running evals → WIP checkpoints	Covers `agentv eval`, concurrency, `--resume`, `--rerun-failed`, and remote partial-run recovery.
Choose graders	Rubrics → Code graders → LLM graders	Keeps deterministic checks, rubric scoring, and LLM judgment separate.
Evaluate tool use or agents	Tool trajectory → Coding agents → CLI provider	Shows how targets, transcripts, and tool-call assertions compose.
Share and inspect results	Results → Dashboard	Explains local artifacts, reports, remote result repositories, and Dashboard review flows.
Compare runs	Compare → Dashboard Analytics	Use CLI metrics for automation and Dashboard analytics for interactive inspection.
Govern or improve an agent workflow	Agent eval layers → Skill improvement workflow → Enterprise governance	Moves from primitive eval design to iterative agent improvement and governance checks.

Keep the public Astro/Starlight docs as AgentV’s canonical navigation layer, and add lightweight topic-map sections like the one above when agents need a faster path through related pages. This borrows the useful LLM Wiki convention of one-line index entries with dense cross-links, without introducing a separate wiki, custom schema, or runtime navigation code.

That is the smallest fit for the current docs: Starlight already provides the sidebar, URLs, search, and link validation, while the source MDX files remain reviewable in ordinary PRs. A full LLM Wiki-style knowledge graph would add duplicate source-of-truth and maintenance overhead before AgentV has enough public docs or contradictory source material to justify provenance tracking. Revisit a richer topic-map or wiki only if a docs section grows beyond a scannable page index, or if multiple sources need explicit confidence/contradiction metadata.

Features

Multi-objective scoring: Correctness, latency, cost, safety in one run
Multiple grader types: Code validators, LLM graders, custom Python/TypeScript
Built-in targets: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
Structured evaluation: Rubric-based grading with weights and requirements
Batch evaluation: Run hundreds of test cases in parallel
Export: JSON, JSONL, YAML formats
Compare results: Compute deltas between evaluation runs for A/B testing