Similar Questions in Reliability & Evaluation
Easy
Define Exact Match (EM) vs. F1 Score in the context of an extraction task (e.g., extracting dates from a PDF). When should you use EM?
View
Medium
Explain the concept of using a "Stronger" model (like GPT-4o or Claude 3.5 Sonnet) to grade a "Weaker" model’s output. What are the risks of "Self-Preference Bias" in this setup?
View
Medium
Instead of checking for exact words, how would you use BERTScore or Cosine Similarity of embeddings to evaluate if an LLM's summary is accurate?
View