Home » TAU-bench

TAU-bench

TAU-bench (Tool-Agent-User benchmark) is an evaluation framework designed to test how effectively AI agents can interact with users and external tools in realistic, multi-turn scenarios. Developed by researchers from Anthropic, it measures an agent’s ability to complete structured tasks—like booking flights or managing online orders—while following rules, handling APIs, and reasoning across multiple steps. A key innovation in TAU-bench is its pass@k metric, which evaluates not just whether an agent succeeds once, but how consistently it can repeat correct behavior across several trials. The benchmark highlights that even advanced AI models often struggle with reliability, consistency, and proper tool use over extended interactions. TAU-bench has quickly become a standard for assessing the real-world robustness of autonomous and conversational AI systems.