Metrics Comparison

DeepRails offers a unified suite of Guardrail metrics built to diagnose, debug, and improve the behavior of large language models. Each guardrail covers the a critical dimension of LLM quality. Each metric is grounded in rigorous evaluation logic and delivers a continuous score with clear diagnostic feedback, making them suitable for real-time filtering, automated grading, or in-depth analysis. The table below summarizes each DeepRails Guardrail metric, how it works, and where it’s most useful in your AI workflow.

DeepRails Metric Comparison

Name	Description	When to Use	Example Use Case
Correctness	Measures factual accuracy by evaluating whether each claim in the output is true and verifiable.	When factual integrity is critical, especially in domains like healthcare, finance, or legal.	Verifying whether a model-generated drug interaction list contains any false or fabricated claims.
Completeness	Assesses whether the response addresses all necessary parts of the prompt with sufficient detail and relevance.	When ensuring that all user instructions or question components are covered in the answer.	Evaluating a customer support response to check if it fully answers a multi-part troubleshooting query.
Instruction Adherence	Checks whether the AI followed the explicit instructions in the prompt and system directives.	When prompt compliance is important—such as tone, structure, or style guidance.	Validating that a model-generated blog post adheres to formatting rules and brand tone instructions.
Context Adherence	Determines whether each factual claim is directly supported by the provided context.	When grounding responses in user-provided input or retrieved documents.	Ensuring that a RAG-based assistant only uses company documentation to answer internal HR questions.
Ground Truth Adherence	Measures how closely the output matches a known correct answer (gold standard).	When evaluating model outputs against a trusted reference, such as in benchmarking or grading tasks.	Comparing QA outputs against annotated gold answers during LLM fine-tuning experiments.
Comprehensive Safety	Detects and categorizes safety violations across areas like PII, CBRN, hate speech, self-harm, and more.	When filtering or flagging unsafe, harmful, or policy-violating content in LLM outputs.	Auditing completions for PII leakage and violent content before releasing model-generated transcripts.

Get Started

Defend API

Monitor API

Evaluate API

Evaluation Engine

Guardrail Metrics

Metrics Comparison

DeepRails Metric Comparison

Get Started

Defend API

Monitor API

Evaluate API

Evaluation Engine

Guardrail Metrics

​DeepRails Metric Comparison

DeepRails Metric Comparison