Correctness evaluates the factual accuracy of an AI model’s response. It measures whether the information provided is accurate, truthful, and free from factual errors or hallucinations that are not related to any specific documents or context.

Correctness is returned as a continuous metric ranging from 0 to 1:

0

1

Low Correctness

The response contains factual inaccuracies or hallucinations

High Correctness

The response is factually accurate and free of errors

This score reflects the probability that a response is free from factual inaccuracies. A higher score indicates greater factual accuracy and trustworthiness.


Evaluation Method

DeepRails performs a structured, multi-step factual audit on every model output. Each response is broken down into the smallest evaluable units—typically atomic factual claims—and each unit is scored independently.

1

Claim Extraction and Segmentation

The model output is decomposed into granular factual claims. Multi-part or compound statements are split into individual assertions to allow precise verification.

2

Fact Verification and Confidence Assessment

Each claim is reviewed for factual accuracy. A binary correctness judgment (Y/N) is assigned, along with a confidence rating: Low, Medium, High, or Certain.

3

Aggregate Scoring

All judgments are consolidated into a final continuous correctness score between 0 and 1, reflecting the factual reliability of the entire response.

The result is a single, user-facing score that provides a clear signal of how trustworthy a model response is. Users can choose to view this score as a float (e.g. 0.83) or a boolean (Pass/Fail), depending on their workflow needs.


Understanding Correctness

Correctness vs. Context Adherence

It’s important to distinguish Correctness from other related metrics:

Correctness: Measures whether a model response contains factual information, regardless of whether that information is included in the provided context.

Context Adherence: Measures whether the model’s output is derived solely from the user-supplied context or documents.

Example: A model may state that “The Eiffel Tower is in Paris,” which is factually correct (high Correctness) even if not present in the context. However, if it pulls this from outside the given context, it may have low Context Adherence.

Addressing Low Correctness Scores

Improving Correctness

When responses score low on correctness, consider the following steps:

Pattern Analysis: Identify patterns in hallucinated or factually incorrect outputs using DeepRails’ claim-level audit logs.

Prompt Tuning: Modify prompts to encourage grounded responses and discourage speculation.

Model Selection: Experiment with different foundation models, as factuality varies widely across architectures.

Verification Workflow: Use automated or human-in-the-loop workflows to review and override responses with low correctness confidence.

Best Practices

Enforce Fact Checks

Establish automated checks for factual accuracy against trusted sources before surfacing responses to end users.

Use Confidence Thresholds

Incorporate confidence ratings per claim to drive override rules or fallback flows.

Benchmark by Domain

Measure correctness across different knowledge domains to uncover systemic gaps.

Halt on Hallucination

Use DeepRails correctness guardrail to prevent clearly false information from ever reaching the user.

While Correctness provides a powerful factuality lens, it’s important to combine it with other guardrails like Context Adherence and Completeness for robust production safety.