CONCEPT · STUB

Agent Evaluation(Agent Evaluation)

Agent Evaluation encompasses the benchmarks, harnesses, and infrastructure used to assess AI agents that perform multi-step tasks, use tools, and make autonomous decisions. The Holistic Agent Leaderboard (HAL), accepted at ICLR 2026, established cost-aware evaluation as the emerging standard for agentic benchmarking.

※ Auto-generated stub — requires completion

Mentioned in

COMPARE · 2026-05-05

Agent Evaluation(Agent Evaluation)

Mentioned in

AI Evaluation in 2026: Beyond MMLU — A Practical Guide from SWE-bench Pro to HLE