Evaluate is your Prompt Engineer’s favorite new tool! It enables quick, ad-hoc assessments of your prompts and model outputs so you can iterate faster, test smarter, and build safer AI.

Designed for pre-production, Evaluate helps you measure your prompt performance during development and iteration using DeepRails’ research-backed Guardrail metrics. Gain confidence in your model’s outputs before they ever reach your users.

Core Features

  • Prompt-Centric Evaluation
    Designed specifically for prompt engineers to evaluate prompt quality, response behavior, and edge cases.

  • High-Fidelity Guardrail Metrics
    Access research-backed Guardrails metrics for correctness, completeness, safety, and adherence. Continuously updated to match emerging best practices.

  • Model-Aware Evaluation
    DeepRails Guardrail Metrics are informed by performance patterns of 100+ LLMs, enabling context-sensitive evaluations. Compare prompt and model performance side-by-side to uncover your most effective configurations.

  • Fast Feedback Loop
    Get detailed feedback on your completions instantly, without leaving your workflow.

  • API-First with Deep Console Visibility
    Trigger evaluations programmatically via API, then leverage the Krino Console to explore eval history, compare runs, and debug granular performance issues—ensuring full observability across your development workflow.

An Evaluation Run in Krino Console

The Workflow

1

Generate Completions

Use your preferred LLM and prompt combination to generate model outputs (completions).

2

Submit Your Run to DeepRails

Log your completions and metadata (prompt, output, model) with DeepRails using our Evaluate API or Console. Specify the Guardrail metrics you want to run evaluations against, such as correctness, completeness, comprehensive safety, and groundedness.

3

Analyze Results

View instant evaluations in the Krino Console. Drill into underperforming prompts, compare runs, and form hypotheses.

4

Debug, Fix, & Run Again

Iterate on your prompt or model choice, rerun your eval, and track improvements until quality targets are met.

Getting Started

Quickstart