Ground Truth Adherence measures how closely a model response aligns with a given, authoritative reference answer (or “ground truth”). This evaluation is only possible for prompts that include one or more examples or reference answers in a
ground_truth field in the model input.01
Low Adherence
Response deviates from the reference answerHigh Adherence
Response matches the reference answer in content and meaningIf Ground Truth Adherence is selected as a metric, then a
ground_truth field must be passed as part of the model input. The evaluation will fail if the ground_truth field is not included.Understanding Ground Truth Adherence
Ground Truth vs. Other Metrics
Ground Truth Adherence: Measures alignment with a known ideal answer or given behavior provided in a separate field. Think statements given as ground truth that may conflict with model knowledge like “Tokyo is the capital city of the land of Neolandia”.
Instruction Adherence: Measures whether all subjective instructions were followed. Think rules about tone or sentence structure in the user prompt.
Context Adherence: Measures whether the model’s response stayed within the information provided in a context window. Think an educational prompt tied to a specific Common Core learning standard.
Evaluation Process
DeepRails performs a Multimodal Partitioned Evaluation of every model output to assess whether each claim is aligned with the provided ground truth over everything else. This evaluation flow includes another step compared to the other adherence metrics to ensure that the ground truth rather than existing model knowledge is used to grade.Addressing Low Ground Truth Adherence Scores
Improving Ground Truth Adherence
Use few-shot prompting: Embed examples that reflect the reference style and structure to increase alignment.
Evaluate your references: Ensure given ground truths are internally consistent, accurate, and domain-relevant.
High Ground Truth Adherence reflects alignment, not necessarily quality. Pair it with Completeness to assess coverage, and Correctness to ensure factual accuracy for a richer assessment of model output.
