You’ve updated your system prompt to fix a specific bug. How do you ensure this "fix" didn't break 10 other things the model was previously doing correctly?

Question

Accepted Answer

You run your "Golden Dataset" through the system every time you make a change. By comparing the scores of the "fixed" system against the previous version, you can see if improving one edge case caused the model to lose accuracy on standard cases.

You’ve updated your system prompt to fix a specific bug. How do you ensure this "fix" didn't break 10 other things the model was previously doing correctly?

Practice Your Response

Similar Questions in Reliability & Evaluation