How would you automate the process of trying to make your model "break" or "hallucinate"?

Question

Accepted Answer

This is "adversarial testing" or "red teaming". You use a separate model to generate thousands of "attack" prompts designed to make the system break its rules, leak data, or provide dangerous advice.

How would you automate the process of trying to make your model "break" or "hallucinate"?

Practice Your Response

Similar Questions in Reliability & Evaluation