Completeness measures the degree to which an AI-generated response fully addresses every necessary aspect of the user’s query, with sufficient detail. Off topic deviations and incoherence result in deductions.
01
Low Completeness
The response misses key parts of the query or goes off topicHigh Completeness
The response addresses all aspects of the query thoroughlyUnderstanding Completeness
The Completeness metric evaluates each model response along five key dimensions:- Coverage: Does the response address all parts of the user’s query?
- Detail and Depth: Does it consistently provide sufficient elaboration?
- Factual Correctness: Is the information given sufficiently accurate?
- Relevance: Is the content strictly pertinent to the query?
- Logical Coherence: Is the response clearly organized and well structured?
Example Scenario
User query: “Describe how a bill becomes a law in the U.S., and give an example of a recent bill.”
Model response: “A bill becomes a law after being approved by Congress and signed by the President.”
Analysis: The content given in the response is factually correct and coherent, but it fails to mention an example and lacks procedural detail. It would score poorly on the Coverage and Detail dimensions, leading to a low overall Completeness score.
Details of the Five Dimensions
Coverage
Coverage
This aspect is the most important, since missing a sub-request is the most common error models make in terms of completeness. The evaluation model enumerates each request in the user prompt and verifies if each one was met by the model response.
Detail and Depth
Detail and Depth
LLMs sometimes use “fluffy” sentences with little substance, and those responses cannot be considered completely satisfactory. Each sentence in the response is analyzed and the evaluation model fails the response if any of them could’ve included more detail.
Factual Correctness
Factual Correctness
This dimension seems like it should be unnecessary at first glance since DeepRails provides a Correctness metric separate from Completeness. However, during development of the prompt, we found that evaluation models struggled to accurately assess completeness without a component confirming that the response’s claims were true.
Relevance
Relevance
Sometimes an AI response can be too complete. This dimension of the Completeness evaluation fails if any sentences in the response are not directly related to the user’s query.
Logical Coherence
Logical Coherence
A response’s completeness is not all about its content. A finished response has a good logical flow and provides indentation, proper spacing, and bullet points as needed, and our evaluation model checks the response for all of this structure.
Evaluation Process
DeepRails performs a Multimodal Partitioned Evaluation of every model output to assess its completeness in each of the five dimensions as visualized below. This flow is completed on two separate models and their evaluations are averaged to arrive at a final score. See the MPE page for more information.Improving Low Completeness Scores
Recommended Fixes
Use prompt structure effectively: You should tailor your prompts to emphasize structure and completeness in response; add specific validation rules for each important part of your request to the end of your prompts.
Guide for elaboration: You should instruct the model to use examples and detail specifically in each of your prompts to reinforce its adherence to all aspects of your request.
Completeness evaluates the robustness of your AI outputs, but remember to combine it with Correctness, Instruction Adherence, or other metrics for a more holistic review of the response contents.
