Ground Truth Adherence
Measure how closely AI responses match reference answers using DeepRails Guardrail Metrics to evaluate semantic and factual alignment.
Ground Truth Adherence measures how closely a model response aligns with a known, authoritative reference answer (or “gold standard”).
0
1
Low Adherence
Response deviates from the reference answer
High Adherence
Response matches the reference answer in content and meaning
This score captures how well a model output mirrors the expected, ideal response. A high score indicates a strong match with the reference; a low score suggests factual or semantic divergence. Results are delivered as a continuous score from 0 to 1 and can be viewed as a float or boolean depending on your workflow.
Evaluation Method
DeepRails uses a structured, multi-step evaluation to compare a model response against a defined ground truth. The comparison is done at the claim level, enabling detailed audits.
Reference Identification
The provided reference answer (also known as the ground truth) is treated as the authoritative benchmark for comparison. It may be user-defined or system-generated.
Claim Extraction from AI Response
The model’s output is segmented into discrete factual or conceptual claims. Compound statements are broken down to enable accurate, fine-grained comparison.
Claim-Level Comparison
Each extracted claim is matched against the reference answer to determine whether it reflects the same meaning or fact. A binary verdict is assigned:
- Y if the claim matches or is semantically equivalent to the reference
- N if the claim deviates or introduces conflicting information
Step-by-step justification is provided for each verdict.
Confidence Scoring
Each verdict is paired with a confidence level—Low, Medium, High, or Certain—reflecting how clearly the comparison aligns or diverges from the reference.
Score Consolidation
All individual verdicts and confidence scores are aggregated into a single Ground Truth Adherence score between 0 and 1, reflecting overall fidelity to the reference.
The result is a reliable, interpretable score that reflects how well the model’s response adheres to a predetermined gold standard—useful for benchmarking, regression testing, and QA validation.
Understanding Ground Truth Adherence
Ground Truth vs. Other Metrics
It’s important to distinguish Ground Truth Adherence from related evaluation metrics:
Ground Truth Adherence: Measures alignment with a known correct answer (gold standard).
Correctness: Measures factual accuracy regardless of whether it matches a specific reference.
Context Adherence: Measures whether the model stayed within the information provided in a context window.
Addressing Low Ground Truth Adherence Scores
Improving Ground Truth Adherence
To reduce discrepancies between generated responses and reference answers:
Analyze mismatch patterns: Identify which types of claims tend to diverge and refine prompts or model settings accordingly.
Use few-shot prompting: Embed examples that reflect the reference style and structure to increase alignment.
Evaluate your references: Ensure gold standards are internally consistent, accurate, and domain-relevant.
Reinforce structural cues: If specific phrasing, ordering, or format matters, enforce it through prompt scaffolding.
Best Practices
Define Gold Standards Clearly
Ensure reference answers are unambiguous, domain-specific, and appropriately scoped for your evaluation goals.
Adapt Equivalence Thresholds
Decide whether you’re measuring exact reproduction or semantic equivalence, and tailor your evaluation rules accordingly.
Test Against Diverse Inputs
Evaluate across a wide range of inputs to confirm that high adherence scores generalize beyond templated examples.
Prevent Reference Drift
Use DeepRails’ Ground Truth Adherence guardrail to catch when model outputs drift away from approved reference responses.
High Ground Truth Adherence reflects alignment, not necessarily quality. Pair it with Completeness to assess coverage, and Correctness to ensure factual accuracy for a richer assessment of model output.