We intentionally describe MPE at a high level to protect our IP. Exact routing, partitioning, and aggregation heuristics are proprietary.
What is MPE?
Multimodel Partitioned Evaluation is an advanced member of a larger class of AI evaluations, LLM-as-a-Judge (LLMJ). This means that MPE relies on using several other models to evaluate the output from an LLM. MPE stands out from other forms of LLMJ by using partitioned judging, dual-model consensus, and confidence calibration. LLMs work better with more granular tasks, so MPE breaks down evaluations in several ways: partitioning the input into smaller parts, evaluating in parallel with two or more models, and assigning confidence values rather than having models internally aggregate scores.Why LLM-as-a-Judge Is Effective
The LLM-as-a-Judge evaluation approach combines the nuanced understanding of human evaluators with the scalability and consistency of automated systems. By leveraging the reasoning capabilities of LLMs to evaluate outputs across a range of tasks and quality dimensions, this approach allows for more holistic assessments that align closely with human judgment. Moreover, LLMJ can process vast amounts of data quickly, meaning they can provide timely feedback crucial for iterative development.“[Human graded evals are] often expensive or not always practical”The adoption of the LLMJ approach by leading AI labs underscores its effectiveness. OpenAI employs its most advanced models to evaluate outputs of new models, guiding release decisions and performance benchmarks. Similarly, Anthropic integrates judge-style evaluations as a “pillar of safe scaling”, actively supporting an external ecosystem to develop LLMJ tools and protocols.
— Shyamal Hitesh Anadkat, Applied AI Engineer, OpenAI
Why We Built MPE
Single-judge evaluators often miss subtle errors, inherit model-specific biases, and struggle with complex prompts. MPE solves this by breaking evaluations into smaller, checkable units, judging each unit with two different LLMs in parallel, and calibrating by confidence before producing a final, interpretable score.- Always two judges in parallel: Every evaluation uses two distinct LLMs to reduce bias and increase reliability.
- Partitioned judging: Big problems are split into focused checks aligned to each guardrail metric.
- Confidence-aware aggregation: Judges estimate their own confidence and total scores are weighted accordingly.
- Reasoned evaluation: Judging prompts promote structured, stepwise reasoning for better fidelity.
- Repetition: Evaluations are run several times per model and averaged to minimize the impact of evaluation hallucinations.
- Plan-agnostic: Available on all plans and across all APIs. Model selection follows your chosen Run Mode.
The four pillars of MPE
Partitioned Reasoning
Large input/output pairs are segmented into smaller, verifiable units per guardrail metric (e.g., claims for correctness, context checks for adherence). This reduces distraction and makes scoring traceable.
Dual-Model Consensus
Two different LLMs (often cross-provider) judge each unit in parallel. The output model never judges itself. This mitigates single-model bias and improves stability.
Confidence Calibration
Each judge self-reports confidence for every sub-check. MPE aggregates with confidence-aware weighting to ensure the final score reflects the exactness of the evaluations.
Reasoned Judging & Parallel Runs
Prompts that encourage structured, chain-of-thought style reasoning are run multiple times. This improves faithfulness on complex, multi-step tasks and minimizes noise from hallucinations.
How MPE runs (conceptual)
1
Partition
The request is decomposed into focused checks per selected guardrail metrics (e.g., correctness claims, instruction adherence, safety).
2
Parallel Judging
Two different LLMs evaluate each partition independently, producing a score, rationale, and self-reported confidence.
3
Confidence-Aware Consensus
MPE aggregates the per-partition results with confidence-calibrated weighting to form metric-level scores.
4
Final Scores & Rationale
Scores, rationales, and metadata are returned and visualized in the Console—ready for monitoring, auditing, and optimization.
