Correctness evaluates the factual accuracy of an AI model’s response. It measures the approximate percentage of the response that is verifiably true, with gross errors being weighted more than slight missteps.
01
Low Correctness
The response has very few verifiably true claimsHigh Correctness
The response is completely accurateUnderstanding Correctness
It’s important to distinguish Correctness from other related metrics:Correctness vs. Context Adherence
Correctness: Measures whether a model response contains factual information, regardless of whether that information is included in the provided context.
Context Adherence: Measures whether the model’s output is derived solely from the user-supplied context or documents.
Example: A model may state that “The Eiffel Tower is in Paris,” which is factually correct (high Correctness) even if that fact is not found in the provided context. However, if the model generates this information from its own prior knowledge rather than the given context, it may be rated low on Context Adherence because it did not rely on the source material it was instructed to use.
Correctness vs. Completeness
Correctness: Measures whether a model response’s factual information is accurate, regardless of whether that information fully answers the user’s prompt.
Completeness: Measures whether the model’s output includes all necessary statements to entirely fulfill the prompt.
Example: A model may state that “The Eiffel Tower is in Paris,” which is factually correct (high Correctness) regardless if it’s related to the user’s prompt. However, if the user prompt requested more information about the Eiffel Tower, or if the Eiffel Tower is off topic for the user request, the response would be incomplete (low Completeness).
Why We Evaluate Correctness
The most concerning and imminently dangerous errors made by AI are caused when outputs give verifiably false information that could inform important decisions or opinions. A dedicated evaluation for the factual accuracy for all claims made in AI outputs is critical for essentially every production LLM use case. As such, we made sure our Correctness evaluations efficiently and accurately identify all false or misleading claims.Evaluation Process
DeepRails performs a Multimodal Partitioned Evaluation of every model output to assess the truth in each of its claims. A few core pieces of logic ensure that the evaluation is as thorough and accurate as possible.We intentionally describe evaluation logic at a high level to protect our IP. Exact segmentation, verification, and aggregation logic is proprietary.
Segmentation and Claim Extraction
Segmentation and Claim Extraction
The model output is decomposed into segments each containing a granular factual claims. The most important claims are intelligently selected for evaluation to ensure that minor details don’t adversely impact evaluation of complex outputs.
Fact Verification and Confidence Assessment
Fact Verification and Confidence Assessment
Each claim is reviewed for factual accuracy using model knowledge and outside references if necessary. A binary correctness judgment is assigned to avoid overassignment of partial credit, along with a confidence rating.
Aggregate Scoring
Aggregate Scoring
All claim judgments are weighted by their confidence rating and consolidated into a final correctness score between 0 and 1.
Addressing Low Correctness Scores
Improving Correctness
Pattern Analysis and Prompt Tuning: Identify patterns in incorrect outputs using DeepRails’ Monitor function or evaluation logs, then update prompts to guard against these patterns of unreliability.
Add Context and Structure: Modify model inputs to include more information on topics models may get wrong or even include a separate context field.
Model Selection: Experiment with different input models, as factuality varies widely across architectures and providers.
Verification Workflow: Use workflows with additional verification steps, either automated or human-in-the-loop, to review and override responses with consistently low correctness.
While Correctness provides a powerful factuality lens, it’s important to combine it with other guardrails like Context Adherence and Completeness for robust production safety.
