Skip to main content

Correctness evaluates the factual accuracy of an AI model’s response. It measures whether the information provided is accurate, truthful, and free from factual errors or hallucinations that are not related to any specific documents or context.
Correctness is returned as a continuous metric ranging from 0 to 1:
01
Low Correctness
The response contains factual inaccuracies or hallucinations
High Correctness
The response is factually accurate and free of errors
This score reflects the probability that a response is free from factual inaccuracies. A higher score indicates greater factual accuracy and trustworthiness.

Understanding Correctness

It’s important to distinguish Correctness from other related metrics:

Correctness vs. Context Adherence

Correctness: Measures whether a model response contains factual information, regardless of whether that information is included in the provided context.
Context Adherence: Measures whether the model’s output is derived solely from the user-supplied context or documents.
Example: A model may state that “The Eiffel Tower is in Paris,” which is factually correct (high Correctness) even if that fact is not found in the provided context. However, if the model generates this information from its own prior knowledge rather than the given context, it may be rated low on Context Adherence because it did not rely on the source material it was instructed to use.

Correctness vs. Completeness

Correctness: Measures whether a model response’s factual information is accurate, regardless of whether that information fully answers the user’s prompt.
Completeness: Measures whether the model’s output includes all necessary statements to entirely fulfill the prompt.
Example: A model may state that “The Eiffel Tower is in Paris,” which is factually correct (high Correctness) regardless if it’s related to the user’s prompt. However, if the user prompt requested more information about the Eiffel Tower, or especially if the Eiffel Tower is off topic, just this response would be incomplete (low Completeness).

Addressing Low Correctness Scores

Improving Correctness

When responses score low on correctness, consider the following steps:
Pattern Analysis: Identify patterns in hallucinated or factually incorrect outputs using DeepRails’ claim-level audit logs.
Prompt Tuning: Modify prompts to encourage grounded responses and discourage speculation.
Model Selection: Experiment with different foundation models, as factuality varies widely across architectures.
Verification Workflow: Use automated or human-in-the-loop workflows to review and override responses with low correctness confidence.
While Correctness provides a powerful factuality lens, it’s important to combine it with other guardrails like Context Adherence and Completeness for robust production safety.
I