Skip to main content

Multimodal Partitioned Evaluation (MPE) is the engine that powers all DeepRails evaluations across Evaluate, Monitor, and Defend. MPE is not a separate product—it is the way our evals run under the hood for every guardrail metric.
We intentionally describe MPE at a high level to protect our IP. Exact routing, partitioning, and aggregation heuristics are proprietary.

Why we built MPE

Single-judge evaluators often miss subtle errors, inherit model-specific biases, and struggle with complex prompts. MPE solves this by breaking evaluations into smaller, checkable units, judging each unit with two different LLMs in parallel, and calibrating by confidence before producing a final, interpretable score.
  • Always two judges in parallel: Every evaluation uses two distinct LLMs to reduce bias and increase reliability.
  • Partitioned judging: Big problems are split into focused checks aligned to each guardrail metric.
  • Confidence-aware aggregation: Judges estimate their own confidence; consensus weights reflect it.
  • Reasoned evaluation: Judging prompts promote structured, stepwise reasoning for better fidelity.
  • Plan-agnostic: Available on all plans and across all APIs. Model selection follows your chosen Run Mode.

The four pillars of MPE

Partitioned Reasoning

Large input/output pairs are segmented into smaller, verifiable units per guardrail metric (e.g., claims for correctness, context checks for adherence). This reduces distraction and makes scoring traceable.

Dual-Model Consensus

Two different LLMs (often cross-provider) judge each unit in parallel. The output model never judges itself. This mitigates single-model bias and improves stability.

Confidence Calibration

Each judge self-reports confidence for every sub-check. MPE aggregates with confidence-aware weighting to dampen spurious votes and surface true agreement.

Reasoned Judging & Prompt Scaffolds

Carefully engineered, extensively tested prompts encourage structured, chain-of-thought style reasoning. This improves faithfulness on complex, multi-step tasks.

How MPE runs (conceptual)

1

Partition

The request is decomposed into focused checks per selected guardrail metrics (e.g., correctness claims, instruction adherence, safety).
2

Parallel Judging

Two different LLMs evaluate each partition independently, producing a score, rationale, and self-reported confidence.
3

Confidence-Aware Consensus

MPE aggregates the per-partition results with confidence-calibrated weighting to form metric-level scores.
4

Final Scores & Rationale

Scores, rationales, and metadata are returned and visualized in the Console—ready for monitoring, auditing, and optimization.
I