Similar Questions in Reliability & Evaluation
Easy
Why is a standard unit test (asserting that output == "expected") often a bad way to test an LLM? How do you handle a model that gives three different, but correct, answers to the same prompt?
View
Medium
You’ve updated your system prompt to fix a specific bug. How do you ensure this "fix" didn't break 10 other things the model was previously doing correctly?
View
Medium
Why is it harder to A/B test an LLM prompt than a UI button color? How do you account for the "non-deterministic" nature during the test?
View