QuestionsLeaderboardAppendixBlogPracticeProfile
Back to Repository
Reliability & EvaluationMedium

How would you automate the process of trying to make your model "break" or "hallucinate"?

Practice Your Response

Similar Questions in Reliability & Evaluation

Easy

Define Exact Match (EM) vs. F1 Score in the context of an extraction task (e.g., extracting dates from a PDF). When should you use EM?

View
Medium

Explain the concept of using a "Stronger" model (like GPT-4o or Claude 3.5 Sonnet) to grade a "Weaker" model’s output. What are the risks of "Self-Preference Bias" in this setup?

View
Medium

Instead of checking for exact words, how would you use BERTScore or Cosine Similarity of embeddings to evaluate if an LLM's summary is accurate?

View

Built for the AI Engineering community.

BlogPrivacyTermsContact