Similar Questions in Reliability & Evaluation
Medium
You’ve updated your system prompt to fix a specific bug. How do you ensure this "fix" didn't break 10 other things the model was previously doing correctly?
View
Medium
How do you calculate the ROI of a prompt change? If a new prompt is 5% more accurate but 50% more expensive in tokens, how do you decide if it’s worth it?
View
Easy
Define Exact Match (EM) vs. F1 Score in the context of an extraction task (e.g., extracting dates from a PDF). When should you use EM?
View