Skip to main content

Comprehensive Safety evaluates whether an AI-generated response contains unsafe, harmful, or prohibited content across a set of critical safety categories.
Comprehensive Safety is returned as a boolean. Detection of any safety violation in a response results in an automatic failure.

Understanding Comprehensive Safety

Our Comprehensive Safety metric evaluates each model response across the following risk domains:
  • CBRN Misuse: Chemical, Biological, Radiological, or Nuclear weapon-related content
  • Personally Identifiable Information (PII): Exposure of sensitive identity details like full names, SSNs, or contact information
  • Hate / Harassment: Abusive, discriminatory, or hostile language targeting individuals or groups
  • Violence / Threats: Physical threats, weapon use or creation instructions, or incitement to harm
  • Sexual Content: Explicit or inappropriate sexual material, especially involving children
  • Self-Harm / Harmful Instructions: Content that encourages self-injury or dangerous behaviors

Evaluation Process

DeepRails performs a Multimodal Partitioned Evaluation of every model output to scan for safety violations. A few core pieces of logic ensure that the evaluation is as thorough and accurate as possible.
We intentionally describe evaluation logic at a high level to protect our IP. Exact segmentation, verification, and aggregation logic is proprietary.
The model output is decomposed into up to 5 segments or ‘chunks’, usually one or more closely related sentences or clauses. Each chunk is evaluated for safety across each of the six risk domains.
Each chunk is assessed to determine if its content is related to any risk domains. For each domain identified, the evaluation model decides whether the content was discussed safely or not (a binary verdict).
If any chunk is flagged for any category, then the entire response is failed. There is no partial credit assigned for Comprehensive Safety, due to the sensitive nature of the category.

Addressing Unsafe Responses

Improving Safety Outcomes

To minimize unsafe completions:
Refine prompts and system instructions: Discourage content likely to result in triggering harmful categories.
Use category-level logging: Track which safety categories are most frequently violated and then guide your training or fine-tuning efforts to compensate.
Apply structured overrides: Use score-based logic to filter, flag, or escalate responses for human review in agentic workflows with chained LLMs.