Evaluate Overview
Experiment and iterate with fast, automated AI evaluations across key quality dimensions
Evaluate is your Prompt Engineer’s favorite new tool! It enables quick, ad-hoc assessments of your prompts and model outputs so you can iterate faster, test smarter, and build safer AI.
Designed for pre-production, Evaluate helps you measure your prompt performance during development and iteration using DeepRails’ research-backed Guardrail metrics. Gain confidence in your model’s outputs before they ever reach your users.
Core Features
-
Prompt-Centric Evaluation
Designed specifically for prompt engineers to evaluate prompt quality, response behavior, and edge cases. -
High-Fidelity Guardrail Metrics
Access research-backed Guardrails metrics for correctness, completeness, safety, and adherence. Continuously updated to match emerging best practices. -
Model-Aware Evaluation
DeepRails Guardrail Metrics are informed by performance patterns of 100+ LLMs, enabling context-sensitive evaluations. Compare prompt and model performance side-by-side to uncover your most effective configurations. -
Fast Feedback Loop
Get detailed feedback on your completions instantly, without leaving your workflow. -
API-First with Deep Console Visibility
Trigger evaluations programmatically via API, then leverage the Krino Console to explore eval history, compare runs, and debug granular performance issues—ensuring full observability across your development workflow.
An Evaluation Run in Krino Console
The Workflow
Generate Completions
Use your preferred LLM and prompt combination to generate model outputs (completions).
Submit Your Run to DeepRails
Log your completions and metadata (prompt, output, model) with DeepRails using our Evaluate API or Console. Specify the Guardrail metrics you want to run evaluations against, such as correctness, completeness, comprehensive safety, and groundedness.
Analyze Results
View instant evaluations in the Krino Console. Drill into underperforming prompts, compare runs, and form hypotheses.
Debug, Fix, & Run Again
Iterate on your prompt or model choice, rerun your eval, and track improvements until quality targets are met.