Skip to main content

Completeness measures the degree to which an AI-generated response fully addresses every necessary aspect of the user’s query, with sufficient detail. Off topic deviations and incoherence result in deductions.
Completeness is returned as a continuous score ranging from 0 to 1:
01
Low Completeness
The response misses key parts of the query or goes off topic
High Completeness
The response addresses all aspects of the query thoroughly

Understanding Completeness

The Completeness metric evaluates each model response along five key dimensions:
  • Coverage: Does the response address all parts of the user’s query?
  • Detail and Depth: Does it consistently provide sufficient elaboration?
  • Factual Correctness: Is the information given sufficiently accurate?
  • Relevance: Is the content strictly pertinent to the query?
  • Logical Coherence: Is the response clearly organized and well structured?
A separate pass/fail assessment for each dimension is conducted before the evaluation model decides a final grade for Completeness.

Example Scenario

Completeness detects when a response may be technically correct but fails to satisfy the full intent of the user’s query.
User query: “Describe how a bill becomes a law in the U.S., and give an example of a recent bill.”
Model response: “A bill becomes a law after being approved by Congress and signed by the President.”
Analysis: The content given in the response is factually correct and coherent, but it fails to mention an example and lacks procedural detail. It would score poorly on the Coverage and Detail dimensions, leading to a low overall Completeness score.

Details of the Five Dimensions

This aspect is the most important, since missing a sub-request is the most common error models make in terms of completeness. The evaluation model enumerates each request in the user prompt and verifies if each one was met by the model response.
LLMs sometimes use “fluffy” sentences with little substance, and those responses cannot be considered completely satisfactory. Each sentence in the response is analyzed and the evaluation model fails the response if any of them could’ve included more detail.
This dimension seems like it should be unnecessary at first glance since DeepRails provides a Correctness metric separate from Completeness. However, during development of the prompt, we found that evaluation models struggled to accurately assess completeness without a component confirming that the response’s claims were true.
Sometimes an AI response can be too complete. This dimension of the Completeness evaluation fails if any sentences in the response are not directly related to the user’s query.
A response’s completeness is not all about its content. A finished response has a good logical flow and provides indentation, proper spacing, and bullet points as needed, and our evaluation model checks the response for all of this structure.

Evaluation Process

DeepRails performs a Multimodal Partitioned Evaluation of every model output to assess its completeness in each of the five dimensions as visualized below. This flow is completed on two separate models and their evaluations are averaged to arrive at a final score. See the MPE page for more information.

Improving Low Completeness Scores

To improve low completeness scores:
Use prompt structure effectively: You should tailor your prompts to emphasize structure and completeness in response; add specific validation rules for each important part of your request to the end of your prompts.
Guide for elaboration: You should instruct the model to use examples and detail specifically in each of your prompts to reinforce its adherence to all aspects of your request.

Completeness evaluates the robustness of your AI outputs, but remember to combine it with Correctness, Instruction Adherence, or other metrics for a more holistic review of the response contents.