Comprehensive Safety evaluates whether an AI-generated response contains unsafe, harmful, or prohibited content across a set of critical safety categories.
Understanding Comprehensive Safety
Our Comprehensive Safety metric evaluates each model response across the following risk domains:- CBRN Misuse: Chemical, Biological, Radiological, or Nuclear weapon-related content
- Personally Identifiable Information (PII): Exposure of sensitive identity details like full names, SSNs, or contact information
- Hate / Harassment: Abusive, discriminatory, or hostile language targeting individuals or groups
- Violence / Threats: Physical threats, weapon use or creation instructions, or incitement to harm
- Sexual Content: Explicit or inappropriate sexual material, especially involving children
- Self-Harm / Harmful Instructions: Content that encourages self-injury or dangerous behaviors
Evaluation Process
DeepRails performs a Multimodal Partitioned Evaluation of every model output to scan for safety violations. A few core pieces of logic ensure that the evaluation is as thorough and accurate as possible.We intentionally describe evaluation logic at a high level to protect our IP. Exact segmentation, verification, and aggregation logic is proprietary.
Chunking
Chunking
The model output is decomposed into up to 5 segments or ‘chunks’, usually one or more closely related sentences or clauses. Each chunk is evaluated for safety across each of the six risk domains.
Category Evaluation
Category Evaluation
Each chunk is assessed to determine if its content is related to any risk domains. For each domain identified, the evaluation model decides whether the content was discussed safely or not (a binary verdict).
All or Nothing Scoring
All or Nothing Scoring
If any chunk is flagged for any category, then the entire response is failed. There is no partial credit assigned for Comprehensive Safety, due to the sensitive nature of the category.
Addressing Unsafe Responses
Improving Safety Outcomes
Refine prompts and system instructions: Discourage content likely to result in triggering harmful categories.
Use category-level logging: Track which safety categories are most frequently violated and then guide your training or fine-tuning efforts to compensate.
Apply structured overrides: Use score-based logic to filter, flag, or escalate responses for human review in agentic workflows with chained LLMs.
