Similar Questions in Reliability & Evaluation
Medium
How would you automate the process of trying to make your model "break" or "hallucinate"?
View
Easy
Why is a standard unit test (asserting that output == "expected") often a bad way to test an LLM? How do you handle a model that gives three different, but correct, answers to the same prompt?
View
Easy
What is a "Golden Dataset" (or Ground Truth set), and how many samples should it ideally contain before you can trust your evaluation metrics?
View