Similar Questions in Reliability & Evaluation
Medium
How do you measure "Time to First Token" (TTFT) vs. "Total Runtime"? Which one matters more for user experience in a chatbot?
View
Easy
Why is a standard unit test (asserting that output == "expected") often a bad way to test an LLM? How do you handle a model that gives three different, but correct, answers to the same prompt?
View
Medium
If your retriever returns 5 documents but only 1 was actually related to answering the question, how do you penalize the retriever for the "noise"?
View