How do you evaluate an agent when the "correct path" might involve 5 different tool calls in any order?

Question

Accepted Answer

You cannot evaluate the path (it's non-deterministic). You evaluate the Final Outcome. You create 100 test cases with a defined "Success" state (e.g., "The user's password was reset"). If the agent reaches that state, it passes, regardless of which tools it used to get there.

How do you evaluate an agent when the "correct path" might involve 5 different tool calls in any order?

Practice Your Response

Similar Questions in AI System Design