Skip to content

Introduction

AgentV is a CLI-first AI agent evaluation framework. It evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code graders + customizable LLM graders, all version-controlled in Git.

Best for: Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.

  • No cloud dependency — everything runs locally
  • No server — just install and run
  • Version-controlled — YAML evaluation files live in Git alongside your code
  • CI/CD ready — run evaluations in your pipeline without external API calls
  • Multiple grader types — code validators, LLM graders, custom Python/TypeScript
FeatureAgentVLangWatchLangSmithLangFuse
Setupnpx allagents plugin installCloud account + API keyCloud account + API keyCloud account + API key
ServerNone (local)Managed cloudManaged cloudManaged cloud
PrivacyAll localCloud-hostedCloud-hostedCloud-hosted
CLI-firstYesNoLimitedLimited
CI/CD readyYesRequires API callsRequires API callsRequires API calls
Version controlYes (YAML in Git)NoNoNo
GradersCode + LLM + CustomLLM onlyLLM + CodeLLM only

Evaluation files (.yaml or .jsonl) define test cases with expected outcomes. Targets specify which agent or provider to evaluate. Graders (code or LLM) score results. Results are written as JSONL/YAML for analysis and comparison.

  • Eval files — YAML or JSONL definitions of test cases
  • Tests — Individual test entries with input messages and expected outcomes
  • Targets — The agent or LLM provider being evaluated
  • Graders — Code graders (Python/TypeScript) or LLM graders that score responses
  • Rubrics — Structured criteria with weights for grading
  • Results — JSONL output with scores, reasoning, and execution traces

Use this topic map when you are an AI agent trying to decide which primitive or workflow to compose next:

GoalStart hereWhy
Create a first evalQuickstartEval filesDefines the smallest runnable YAML shape before adding advanced fields.
Run or resume evalsRunning evalsWIP checkpointsCovers agentv eval, concurrency, --resume, --rerun-failed, and remote partial-run recovery.
Choose gradersRubricsCode gradersLLM gradersKeeps deterministic checks, rubric scoring, and LLM judgment separate.
Evaluate tool use or agentsTool trajectoryCoding agentsCLI providerShows how targets, transcripts, and tool-call assertions compose.
Share and inspect resultsResultsDashboardExplains local artifacts, reports, remote result repositories, and Dashboard review flows.
Compare runsCompareDashboard AnalyticsUse CLI metrics for automation and Dashboard analytics for interactive inspection.
Govern or improve an agent workflowAgent eval layersSkill improvement workflowEnterprise governanceMoves from primitive eval design to iterative agent improvement and governance checks.

Keep the public Astro/Starlight docs as AgentV’s canonical navigation layer, and add lightweight topic-map sections like the one above when agents need a faster path through related pages. This borrows the useful LLM Wiki convention of one-line index entries with dense cross-links, without introducing a separate wiki, custom schema, or runtime navigation code.

That is the smallest fit for the current docs: Starlight already provides the sidebar, URLs, search, and link validation, while the source MDX files remain reviewable in ordinary PRs. A full LLM Wiki-style knowledge graph would add duplicate source-of-truth and maintenance overhead before AgentV has enough public docs or contradictory source material to justify provenance tracking. Revisit a richer topic-map or wiki only if a docs section grows beyond a scannable page index, or if multiple sources need explicit confidence/contradiction metadata.

  • Multi-objective scoring: Correctness, latency, cost, safety in one run
  • Multiple grader types: Code validators, LLM graders, custom Python/TypeScript
  • Built-in targets: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
  • Structured evaluation: Rubric-based grading with weights and requirements
  • Batch evaluation: Run hundreds of test cases in parallel
  • Export: JSON, JSONL, YAML formats
  • Compare results: Compute deltas between evaluation runs for A/B testing