Ground Truth Adherence measures how closely a model response aligns with a known, authoritative reference answer (or “gold standard”). This metric is only appropriate for prompts that include one or more examples or reference answers for each important part of the model response.
01
Low Adherence
Response deviates from the reference answerHigh Adherence
Response matches the reference answer in content and meaningUnderstanding Ground Truth Adherence
Ground Truth vs. Other Metrics
Ground Truth Adherence: Measures alignment with a known ideal answer (gold standard).
Correctness: Measures factual accuracy regardless of whether it matches a specific reference.
Context Adherence: Measures whether the model stayed within the information provided in a context window.
Addressing Low Ground Truth Adherence Scores
Improving Ground Truth Adherence
Analyze mismatch patterns: Identify which types of claims tend to diverge and refine prompts or model settings accordingly.
Use few-shot prompting: Embed examples that reflect the reference style and structure to increase alignment.
Evaluate your references: Ensure gold standards are internally consistent, accurate, and domain-relevant.
Reinforce structural cues: If specific phrasing, ordering, or format matters, enforce it through prompt scaffolding.
Best Practices
Define Gold Standards Clearly
Ensure reference answers are unambiguous, domain-specific, and appropriately scoped for your evaluation goals.
Test Against Diverse Inputs
Evaluate across a wide range of inputs to confirm that high adherence scores generalize beyond templated examples.
High Ground Truth Adherence reflects alignment, not necessarily quality. Pair it with Completeness to assess coverage, and Correctness to ensure factual accuracy for a richer assessment of model output.