Skip to main content
DeepRails offers a unified suite of Guardrail metrics built to diagnose, debug, and improve the behavior of large language models. Each guardrail covers its own critical aspect of LLM output quality. Since some guardrails like Ground Truth Adherence won’t be applicable to every LLM use case, the guardrail metrics applied in each evaluation are customizable. Each metric has refined, individual evaluation logic and delivers a continuous score with clear diagnostic feedback. The table below summarizes each DeepRails Guardrail metric, how it works, and where it’s most useful in your AI workflow.

DeepRails Metric Comparison

NameDescriptionWhen to UseExample Use Case
CorrectnessMeasures factual accuracy by evaluating whether each claim in the output is true and verifiable.When factual integrity is critical, especially in domains like healthcare, finance, or legal.Verifying whether a model-generated drug interaction list contains any false or fabricated claims.
CompletenessAssesses whether the response addresses all necessary parts of the prompt with sufficient detail and relevance.When ensuring that all user instructions or question components are covered in the answer.Evaluating a customer support response to check if it fully answers a multi-part troubleshooting query.
Instruction AdherenceChecks whether the AI followed the explicit instructions in the prompt and system directives.When prompt compliance is important—such as tone, structure, or style guidance.Validating that a model-generated blog post adheres to formatting rules and brand tone instructions.
Context AdherenceDetermines whether each factual claim is directly supported by the provided context.When grounding responses in user-provided input or retrieved documents.Ensuring that a RAG-based assistant only uses company documentation to answer internal HR questions.
Ground Truth AdherenceMeasures how closely the output matches a known correct answer (gold standard).When evaluating model outputs against a trusted reference, such as in benchmarking or grading tasks.Comparing QA outputs against annotated gold answers during LLM fine-tuning experiments.
Comprehensive SafetyDetects and categorizes safety violations across areas like PII, CBRN, hate speech, self-harm, and more.When filtering or flagging unsafe, harmful, or policy-violating content in LLM outputs.Auditing completions for PII leakage and violent content before releasing model-generated transcripts.

The Benefits of Continuous Scoring

One of the things that sets DeepRails apart from other LLM evaluation services is the granularity of our evaluations. In any given DeepRails evaluation, each selected metric is assigned a final score that can be a decimal anywhere from 0.0 to 1.0. The one exception to this rule is Comprehensive Safety, which is scored all or nothing, since all safety violations cause a failure, no matter the severity. This fine-grained analysis requires more complexity in our evaluation prompts, but results in much greater accuracy. DeepRails excels at identifying middling outputs, especially when compared to competitors. This accuracy allows users to confidently set the specific hallucination thresholds used in our Defend service. In order to confirm the efficacy of our evaluations, DeepRails Correctness and Completeness evaluations were compared against Amazon Bedrock evaluations in the same categories for sixty input/output pairs, repeated three times for each service. The results showed that DeepRails was much more aligned with human assigned Completeness and Correctness grades for scores ranging between 40% and 80%.
DeepRails v. Bedrock 40-60%

DeepRails vs. Bedrock Comparison Study: 40-60% Scores

DeepRails v. Bedrock 60-80%

DeepRails vs. Bedrock Comparison Study: 60-80% Scores