Scoring Methodology
How DeepRails aggregates granular evaluations into actionable scores
Our evaluation methodology fundamentally differs from traditional approaches by decomposing AI responses into their atomic components. This granular analysis ensures no error, hallucination, or quality issue goes undetected. We evaluate AI responses at the claim level, not the response level. This granular approach catches subtle errors that holistic evaluations miss.
The Challenge: From Claims to Scores
When evaluating AI outputs, our guardrail metrics follow a systematic decomposition approach:
Correctness
Breaks down responses into smallest evaluable claims, verifying each factual assertion independently
Completeness
Evaluates each dimension (coverage, depth, accuracy, relevance, coherence) as discrete components
Adherence
Segments responses to match against individual instructions or context elements
Safety
Analyzes each segment for specific safety category violations
This decomposition creates a challenge: How do we aggregate micro-evaluations into a single, interpretable score?
Our Scoring Methodology
Decomposition
Each AI response is broken into granular units:
- Factual claims for correctness evaluation
- Instructional segments for adherence metrics
- Topical dimensions for completeness assessment
This aligns with research showing that fine-grained evaluation significantly improves error detection (Liu et al., 2023).
Individual Evaluation
For each granular unit, evaluators provide:
- Binary verdict: Y (criterion satisfied) or N (criterion violated)
- Confidence level: Capturing evaluator certainty
HyperChainpoll
Multiple Evaluations are ran using our HyperChainpoll technique.
Confidence-Weighted Aggregation
Scores are computed using our confidence weighting system:
Confidence | Weight | Interpretation |
---|---|---|
Low | 0 | Claim cannot be reliably evaluated |
Medium | 0.5-0.67 | Moderate evidence available |
High | 1.0 | Strong supporting evidence |
Certain | 1.0 | Conclusive verification possible |
Why This Works
Granular Precision
Research demonstrates that decomposing complex judgments into atomic decisions improves accuracy by 23-31% compared to holistic evaluation (Chen et al., 2022) and claim-level evaluation catches 2.5x more hallucinations than response-level evaluation (Min et al., 2023). By evaluating each claim independently, we:
Research demonstrates that decomposing complex judgments into atomic decisions improves accuracy by 23-31% compared to holistic evaluation (Chen et al., 2022). By evaluating each claim independently, we prevent single errors from distorting the full evaluation, provide precise insight into where failures occur, and empower prompt engineers to apply focused, efficient fixes.
Confidence-Aware Aggregation
Not all evaluations carry equal certainty. Our confidence weighting ensures:
- Uncertain evaluations have minimal impact on final scores
- High-confidence judgments drive the overall assessment
- Ambiguous cases are transparently reflected in scores
Academic Foundation
Our granular scoring methodology also builds on established research:
- Claim extraction: Decomposing text into verifiable units improves fact-checking accuracy (Thorne et al., 2018)
- Confidence calibration: Properly weighted uncertainties lead to better decisions (Ovadia et al., 2019)
- Ensemble aggregation: Multiple evaluators reduce bias and variance (Sagi & Rokach, 2018)