Scoring Methodology

Our evaluation methodology fundamentally differs from traditional approaches by decomposing AI responses into their atomic components. This granular analysis ensures no error, hallucination, or quality issue goes undetected. We evaluate AI responses at the claim level, not the response level. This granular approach catches subtle errors that holistic evaluations miss.

The Challenge: From Claims to Scores

When evaluating AI outputs, our guardrail metrics follow a systematic decomposition approach:

Correctness

Breaks down responses into smallest evaluable claims, verifying each factual assertion independently

Completeness

Evaluates each dimension (coverage, depth, accuracy, relevance, coherence) as discrete components

Adherence

Segments responses to match against individual instructions or context elements

Safety

Analyzes each segment for specific safety category violations

This decomposition creates a challenge: How do we aggregate micro-evaluations into a single, interpretable score?

Our Scoring Methodology

Decomposition

Each AI response is broken into granular units:

Factual claims for correctness evaluation
Instructional segments for adherence metrics
Topical dimensions for completeness assessment

This aligns with research showing that fine-grained evaluation significantly improves error detection (Liu et al., 2023).

Individual Evaluation

For each granular unit, evaluators provide:

Binary verdict: Y (criterion satisfied) or N (criterion violated)
Confidence level: Capturing evaluator certainty

Claim: "The Earth orbits the Sun in 365 days"
Verdict: Y
Confidence: High

HyperChainpoll

Multiple Evaluations are ran using our HyperChainpoll technique.

Confidence-Weighted Aggregation

Scores are computed using our confidence weighting system:

Confidence	Weight	Interpretation
Low	0	Claim cannot be reliably evaluated
Medium	0.5-0.67	Moderate evidence available
High	1.0	Strong supporting evidence
Certain	1.0	Conclusive verification possible

Final Score = Σ(Y * weight) / Σ(all weights)

Why This Works

Granular Precision

Research demonstrates that decomposing complex judgments into atomic decisions improves accuracy by 23-31% compared to holistic evaluation (Chen et al., 2022) and claim-level evaluation catches 2.5x more hallucinations than response-level evaluation (Min et al., 2023). By evaluating each claim independently, we:

Research demonstrates that decomposing complex judgments into atomic decisions improves accuracy by 23-31% compared to holistic evaluation (Chen et al., 2022). By evaluating each claim independently, we prevent single errors from distorting the full evaluation, provide precise insight into where failures occur, and empower prompt engineers to apply focused, efficient fixes.

Confidence-Aware Aggregation

Not all evaluations carry equal certainty. Our confidence weighting ensures:

Uncertain evaluations have minimal impact on final scores
High-confidence judgments drive the overall assessment
Ambiguous cases are transparently reflected in scores

Academic Foundation

Our granular scoring methodology also builds on established research:

Claim extraction: Decomposing text into verifiable units improves fact-checking accuracy (Thorne et al., 2018)
Confidence calibration: Properly weighted uncertainties lead to better decisions (Ovadia et al., 2019)
Ensemble aggregation: Multiple evaluators reduce bias and variance (Sagi & Rokach, 2018)

References

Get Started

Evaluate

Monitor

Defend

Evaluation Engine

Guardrail Metrics

Scoring Methodology

The Challenge: From Claims to Scores

Correctness

Completeness

Adherence

Safety

Our Scoring Methodology

Why This Works

Granular Precision

Confidence-Aware Aggregation

Academic Foundation

Get Started

Evaluate

Monitor

Defend

Evaluation Engine

Guardrail Metrics

​The Challenge: From Claims to Scores

Correctness

Completeness

Adherence

Safety

​Our Scoring Methodology

​Why This Works

​Granular Precision

​Confidence-Aware Aggregation

​Academic Foundation

The Challenge: From Claims to Scores

Our Scoring Methodology

Why This Works

Granular Precision

Confidence-Aware Aggregation

Academic Foundation