Why is a standard unit test (asserting that output == "expected") often a bad way to test an LLM? How do you handle a model that gives three different, but correct, answers to the same prompt?

Question

Accepted Answer

Standard unit tests fail because LLMs are non-deterministic (they can say the same thing using different words). Instead of assert output == "Paris", you use semantic verification (checking if the embedding of the output is close to the expected answer) or contains-logic to see if the key information exists within the varied phrasing.

Why is a standard unit test (asserting that output == "expected") often a bad way to test an LLM? How do you handle a model that gives three different, but correct, answers to the same prompt?

Practice Your Response

Similar Questions in Reliability & Evaluation