Our evaluation methodology fundamentally differs from traditional approaches by decomposing AI responses into their atomic components. This granular analysis ensures no error, hallucination, or quality issue goes undetected. We evaluate AI responses at the claim level, not the response level. This granular approach catches subtle errors that holistic evaluations miss.

The Challenge: From Claims to Scores

When evaluating AI outputs, our guardrail metrics follow a systematic decomposition approach:

Correctness

Breaks down responses into smallest evaluable claims, verifying each factual assertion independently

Completeness

Evaluates each dimension (coverage, depth, accuracy, relevance, coherence) as discrete components

Adherence

Segments responses to match against individual instructions or context elements

Safety

Analyzes each segment for specific safety category violations

This decomposition creates a challenge: How do we aggregate micro-evaluations into a single, interpretable score?

Our Scoring Methodology

1

Decomposition

Each AI response is broken into granular units:

  • Factual claims for correctness evaluation
  • Instructional segments for adherence metrics
  • Topical dimensions for completeness assessment

This aligns with research showing that fine-grained evaluation significantly improves error detection (Liu et al., 2023).

2

Individual Evaluation

For each granular unit, evaluators provide:

  • Binary verdict: Y (criterion satisfied) or N (criterion violated)
  • Confidence level: Capturing evaluator certainty
Claim: "The Earth orbits the Sun in 365 days"
Verdict: Y
Confidence: High
3

HyperChainpoll

Multiple Evaluations are ran using our HyperChainpoll technique.

4

Confidence-Weighted Aggregation

Scores are computed using our confidence weighting system:

ConfidenceWeightInterpretation
Low0Claim cannot be reliably evaluated
Medium0.5-0.67Moderate evidence available
High1.0Strong supporting evidence
Certain1.0Conclusive verification possible
Final Score = Σ(Y * weight) / Σ(all weights)

Why This Works

Granular Precision

Research demonstrates that decomposing complex judgments into atomic decisions improves accuracy by 23-31% compared to holistic evaluation (Chen et al., 2022) and claim-level evaluation catches 2.5x more hallucinations than response-level evaluation (Min et al., 2023). By evaluating each claim independently, we:

Research demonstrates that decomposing complex judgments into atomic decisions improves accuracy by 23-31% compared to holistic evaluation (Chen et al., 2022). By evaluating each claim independently, we prevent single errors from distorting the full evaluation, provide precise insight into where failures occur, and empower prompt engineers to apply focused, efficient fixes.

Confidence-Aware Aggregation

Not all evaluations carry equal certainty. Our confidence weighting ensures:

  • Uncertain evaluations have minimal impact on final scores
  • High-confidence judgments drive the overall assessment
  • Ambiguous cases are transparently reflected in scores

Academic Foundation

Our granular scoring methodology also builds on established research:

  • Claim extraction: Decomposing text into verifiable units improves fact-checking accuracy (Thorne et al., 2018)
  • Confidence calibration: Properly weighted uncertainties lead to better decisions (Ovadia et al., 2019)
  • Ensemble aggregation: Multiple evaluators reduce bias and variance (Sagi & Rokach, 2018)