How would you set up an experiment to prove that a new prompt version is actually 10% better than the old one? What metrics would you track?

Question

Accepted Answer

Track Task Success Rate. If the goal is "Extract Date," you run 1,000 test cases through Prompt A and Prompt B and measure which one has a higher match rate against your "Golden Dataset."

How would you set up an experiment to prove that a new prompt version is actually 10% better than the old one? What metrics would you track?

Practice Your Response

Similar Questions in Generative AI & LLMs