Skip to main content
POST
/
evaluate
Create an evaluation
from deeprails import Deeprails

DEEPRAILS_API_KEY = "YOUR_API_KEY"

client = Deeprails(
    api_key=DEEPRAILS_API_KEY,
)

evaluation_response = client.evaluate.create(
    model_input={
        "user_prompt": "What color is the sky?",
    },
    model_output="The sky is dark blue.",
    guardrail_metrics=["correctness", "completeness", "instruction_adherence"],
    run_mode="precision"
)
print(evaluation_response.eval_id)
print(evaluation_response.evaluation_result)
{
  "eval_id": "<string>",
  "evaluation_status": "in_progress",
  "guardrail_metrics": [
    "correctness"
  ],
  "model_used": "<string>",
  "run_mode": "precision_plus",
  "model_input": {
    "system_prompt": "<string>",
    "user_prompt": "<string>",
    "ground_truth": "<string>"
  },
  "model_output": "<string>",
  "nametag": "<string>",
  "progress": 50,
  "evaluation_result": {},
  "evaluation_total_cost": 123,
  "created_at": "2023-11-07T05:31:56Z",
  "error_message": "<string>",
  "error_timestamp": "2023-11-07T05:31:56Z",
  "start_timestamp": "2023-11-07T05:31:56Z",
  "end_timestamp": "2023-11-07T05:31:56Z",
  "modified_at": "2023-11-07T05:31:56Z"
}
The request body must include the model_input dictionary (containing at least a system_prompt or user_prompt field), the model_output string to be evaluated, the guardrail_metrics to evaluate against, and the run_mode to optimize for speed, accuracy, and cost of the models used for evaluations. Optionally, specify the model_used to generate the output (e.g. gpt-5-mini), and a human-readable nametag to organize and filter your evaluations.

The run mode determines which models power the evaluation:
- precision_plus - Maximum accuracy using the most advanced models
- precision - High accuracy with optimized performance
- smart - Balanced speed and accuracy (default)
- economy - Fastest evaluation at lowest cost

Available guardrail metrics include correctness, completeness, instruction_adherence, context_adherence, ground_truth_adherence, and `comprehensive_safety

When you create an evaluation, you’ll receive an evaluation ID. Use this ID to track the evaluation’s progress and retrieve the results.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
model_input
object
required

A dictionary of inputs sent to the LLM to generate output. The dictionary must contain at least user_prompt or system_prompt field. For ground_truth_aherence guadrail metric, ground_truth should be provided.

  • Option 1
  • Option 2
model_output
string
required

Output generated by the LLM to be evaluated.

run_mode
enum<string>
required

Run mode for the evaluation. The run mode allows the user to optimize for speed, accuracy, and cost by determining which models are used to evaluate the event. Available run modes include precision_plus, precision, smart, and economy. Defaults to smart.

Available options:
precision_plus,
precision,
smart,
economy
model_used
string

Model ID used to generate the output, like gpt-4o or o3.

guardrail_metrics
enum<string>[]

An array of guardrail metrics that the model input and output pair will be evaluated on. For non-enterprise users, these will be limited to the allowed guardrail metrics.

nametag
string

An optional, user-defined tag for the evaluation.

Response

Evaluation created successfully

eval_id
string
required

A unique evaluation ID.

evaluation_status
enum<string>
required

Status of the evaluation.

Available options:
in_progress,
completed,
canceled,
queued,
failed
run_mode
enum<string>
required

Run mode for the evaluation. The run mode allows the user to optimize for speed, accuracy, and cost by determining which models are used to evaluate the event.

Available options:
precision_plus,
precision,
smart,
economy
model_input
object
required

A dictionary of inputs sent to the LLM to generate output. The dictionary must contain at least user_prompt or system_prompt field. For ground_truth_aherence guadrail metric, ground_truth should be provided.

  • Option 1
  • Option 2
model_output
string
required

Output generated by the LLM to be evaluated.

guardrail_metrics
enum<string>[]

An array of guardrail metrics that the model input and output pair will be evaluated on.

model_used
string

Model ID used to generate the output, like gpt-4o or o3.

nametag
string

An optional, user-defined tag for the evaluation.

progress
integer

Evaluation progress. Values range between 0 and 100; 100 corresponds to a completed evaluation_status.

Required range: 0 <= x <= 100
evaluation_result
object

Evaluation result consisting of average scores and rationales for each of the evaluated guardrail metrics.

evaluation_total_cost
number

Total cost of the evaluation.

created_at
string<date-time>

The time the evaluation was created in UTC.

error_message
string

Description of the error causing the evaluation to fail, if any.

error_timestamp
string<date-time>

The time the error causing the evaluation to fail was recorded.

start_timestamp
string<date-time>

The time the evaluation started in UTC.

end_timestamp
string<date-time>

The time the evaluation completed in UTC.

modified_at
string<date-time>

The most recent time the evaluation was modified in UTC.

I