Ground Truth Adherence measures how closely a model response aligns with a known, authoritative reference answer (or “gold standard”).

0

1

Low Adherence

Response deviates from the reference answer

High Adherence

Response matches the reference answer in content and meaning

This score captures how well a model output mirrors the expected, ideal response. A high score indicates a strong match with the reference; a low score suggests factual or semantic divergence. Results are delivered as a continuous score from 0 to 1 and can be viewed as a float or boolean depending on your workflow.


Evaluation Method

DeepRails uses a structured, multi-step evaluation to compare a model response against a defined ground truth. The comparison is done at the claim level, enabling detailed audits.

1

Reference Identification

The provided reference answer (also known as the ground truth) is treated as the authoritative benchmark for comparison. It may be user-defined or system-generated.

2

Claim Extraction from AI Response

The model’s output is segmented into discrete factual or conceptual claims. Compound statements are broken down to enable accurate, fine-grained comparison.

3

Claim-Level Comparison

Each extracted claim is matched against the reference answer to determine whether it reflects the same meaning or fact. A binary verdict is assigned:

  • Y if the claim matches or is semantically equivalent to the reference
  • N if the claim deviates or introduces conflicting information
    Step-by-step justification is provided for each verdict.
4

Confidence Scoring

Each verdict is paired with a confidence level—Low, Medium, High, or Certain—reflecting how clearly the comparison aligns or diverges from the reference.

5

Score Consolidation

All individual verdicts and confidence scores are aggregated into a single Ground Truth Adherence score between 0 and 1, reflecting overall fidelity to the reference.

The result is a reliable, interpretable score that reflects how well the model’s response adheres to a predetermined gold standard—useful for benchmarking, regression testing, and QA validation.


Understanding Ground Truth Adherence

Ground Truth vs. Other Metrics

It’s important to distinguish Ground Truth Adherence from related evaluation metrics:

Ground Truth Adherence: Measures alignment with a known correct answer (gold standard).

Correctness: Measures factual accuracy regardless of whether it matches a specific reference.

Context Adherence: Measures whether the model stayed within the information provided in a context window.

Addressing Low Ground Truth Adherence Scores

Improving Ground Truth Adherence

To reduce discrepancies between generated responses and reference answers:

Analyze mismatch patterns: Identify which types of claims tend to diverge and refine prompts or model settings accordingly.

Use few-shot prompting: Embed examples that reflect the reference style and structure to increase alignment.

Evaluate your references: Ensure gold standards are internally consistent, accurate, and domain-relevant.

Reinforce structural cues: If specific phrasing, ordering, or format matters, enforce it through prompt scaffolding.

Best Practices

Define Gold Standards Clearly

Ensure reference answers are unambiguous, domain-specific, and appropriately scoped for your evaluation goals.

Adapt Equivalence Thresholds

Decide whether you’re measuring exact reproduction or semantic equivalence, and tailor your evaluation rules accordingly.

Test Against Diverse Inputs

Evaluate across a wide range of inputs to confirm that high adherence scores generalize beyond templated examples.

Prevent Reference Drift

Use DeepRails’ Ground Truth Adherence guardrail to catch when model outputs drift away from approved reference responses.

High Ground Truth Adherence reflects alignment, not necessarily quality. Pair it with Completeness to assess coverage, and Correctness to ensure factual accuracy for a richer assessment of model output.