DeepRails Metric Comparison
| Name | Description | When to Use | Example Use Case |
|---|---|---|---|
| Correctness | Measures factual accuracy by evaluating whether each claim in the output is true and verifiable. | When factual integrity is critical, especially in domains like healthcare, finance, or legal. | Verifying whether a model-generated drug interaction list contains any false or fabricated claims. |
| Completeness | Assesses whether the response addresses all necessary parts of the prompt with sufficient detail and relevance. | When ensuring that all user instructions or question components are covered in the answer. | Evaluating a customer support response to check if it fully answers a multi-part troubleshooting query. |
| Instruction Adherence | Checks whether the AI followed the explicit instructions in the prompt and system directives. | When prompt compliance is important—such as tone, structure, or style guidance. | Validating that a model-generated blog post adheres to formatting rules and brand tone instructions. |
| Context Adherence | Determines whether each factual claim is directly supported by the provided context. | When grounding responses in user-provided input or retrieved documents. | Ensuring that a RAG-based assistant only uses company documentation to answer internal HR questions. |
| Ground Truth Adherence | Measures how closely the output matches a known correct answer (gold standard). | When evaluating model outputs against a trusted reference, such as in benchmarking or grading tasks. | Comparing QA outputs against annotated gold answers during LLM fine-tuning experiments. |
| Comprehensive Safety | Detects and categorizes safety violations across areas like PII, CBRN, hate speech, self-harm, and more. | When filtering or flagging unsafe, harmful, or policy-violating content in LLM outputs. | Auditing completions for PII leakage and violent content before releasing model-generated transcripts. |
The Benefits of Continuous Scoring
One of the things that sets DeepRails apart from other LLM evaluation services is the granularity of our evaluations. In any given DeepRails evaluation, each selected metric is assigned a final score that can be a decimal anywhere from 0.0 to 1.0. The one exception to this rule is Comprehensive Safety, which is scored all or nothing, since all safety violations cause a failure, no matter the severity. This fine-grained analysis requires more complexity in our evaluation prompts, but results in much greater accuracy. DeepRails excels at identifying middling outputs, especially when compared to competitors. This accuracy allows users to confidently set the specific hallucination thresholds used in our Defend service. In order to confirm the efficacy of our evaluations, DeepRails Correctness and Completeness evaluations were compared against Amazon Bedrock evaluations in the same categories for sixty input/output pairs, repeated three times for each service. The results showed that DeepRails was much more aligned with human assigned Completeness and Correctness grades for scores ranging between 40% and 80%.
DeepRails vs. Bedrock Comparison Study: 40-60% Scores

DeepRails vs. Bedrock Comparison Study: 60-80% Scores
