Skip to main content

Ground Truth Adherence measures how closely a model response aligns with a known, authoritative reference answer (or “gold standard”). This metric is only appropriate for prompts that include one or more examples or reference answers for each important part of the model response.
01
Low Adherence
Response deviates from the reference answer
High Adherence
Response matches the reference answer in content and meaning
This score captures how well a model output mirrors the expected, ideal response. A high score indicates a strong match with the reference; a low score suggests factual or semantic divergence. Results are delivered as a continuous score from 0 to 1 and can be viewed as a float or boolean depending on your workflow.

Understanding Ground Truth Adherence

Ground Truth vs. Other Metrics

It’s important to distinguish Ground Truth Adherence from related evaluation metrics:
Ground Truth Adherence: Measures alignment with a known ideal answer (gold standard).
Correctness: Measures factual accuracy regardless of whether it matches a specific reference.
Context Adherence: Measures whether the model stayed within the information provided in a context window.

Addressing Low Ground Truth Adherence Scores

Improving Ground Truth Adherence

To reduce discrepancies between generated responses and reference answers:
Analyze mismatch patterns: Identify which types of claims tend to diverge and refine prompts or model settings accordingly.
Use few-shot prompting: Embed examples that reflect the reference style and structure to increase alignment.
Evaluate your references: Ensure gold standards are internally consistent, accurate, and domain-relevant.
Reinforce structural cues: If specific phrasing, ordering, or format matters, enforce it through prompt scaffolding.

Best Practices

Define Gold Standards Clearly

Ensure reference answers are unambiguous, domain-specific, and appropriately scoped for your evaluation goals.

Test Against Diverse Inputs

Evaluate across a wide range of inputs to confirm that high adherence scores generalize beyond templated examples.
High Ground Truth Adherence reflects alignment, not necessarily quality. Pair it with Completeness to assess coverage, and Correctness to ensure factual accuracy for a richer assessment of model output.
I