Create an evaluation

from deeprails import Deeprails

DEEPRAILS_API_KEY = "YOUR_API_KEY"

client = Deeprails(
    api_key=DEEPRAILS_API_KEY,
)

evaluation_response = client.evaluate.create(
    model_input={
        "user_prompt": "What color is the sky?",
    },
    model_output="The sky is dark blue.",
    guardrail_metrics=["correctness", "completeness", "instruction_adherence"],
    run_mode="precision"
)
print(evaluation_response.eval_id)
print(evaluation_response.evaluation_result)

{
  "eval_id": "<string>",
  "evaluation_status": "in_progress",
  "guardrail_metrics": [
    "correctness"
  ],
  "model_used": "<string>",
  "run_mode": "precision_plus",
  "model_input": {
    "system_prompt": "<string>",
    "user_prompt": "<string>",
    "ground_truth": "<string>"
  },
  "model_output": "<string>",
  "nametag": "<string>",
  "progress": 50,
  "evaluation_result": {},
  "evaluation_total_cost": 123,
  "created_at": "2023-11-07T05:31:56Z",
  "error_message": "<string>",
  "error_timestamp": "2023-11-07T05:31:56Z",
  "start_timestamp": "2023-11-07T05:31:56Z",
  "end_timestamp": "2023-11-07T05:31:56Z",
  "modified_at": "2023-11-07T05:31:56Z"
}

POST

evaluate

from deeprails import Deeprails

DEEPRAILS_API_KEY = "YOUR_API_KEY"

client = Deeprails(
    api_key=DEEPRAILS_API_KEY,
)

evaluation_response = client.evaluate.create(
    model_input={
        "user_prompt": "What color is the sky?",
    },
    model_output="The sky is dark blue.",
    guardrail_metrics=["correctness", "completeness", "instruction_adherence"],
    run_mode="precision"
)
print(evaluation_response.eval_id)
print(evaluation_response.evaluation_result)

{
  "eval_id": "<string>",
  "evaluation_status": "in_progress",
  "guardrail_metrics": [
    "correctness"
  ],
  "model_used": "<string>",
  "run_mode": "precision_plus",
  "model_input": {
    "system_prompt": "<string>",
    "user_prompt": "<string>",
    "ground_truth": "<string>"
  },
  "model_output": "<string>",
  "nametag": "<string>",
  "progress": 50,
  "evaluation_result": {},
  "evaluation_total_cost": 123,
  "created_at": "2023-11-07T05:31:56Z",
  "error_message": "<string>",
  "error_timestamp": "2023-11-07T05:31:56Z",
  "start_timestamp": "2023-11-07T05:31:56Z",
  "end_timestamp": "2023-11-07T05:31:56Z",
  "modified_at": "2023-11-07T05:31:56Z"
}

The request body must include the model_input dictionary (containing at least a system_prompt or user_prompt field), the model_output string to be evaluated, the guardrail_metrics to evaluate against, and the run_mode to optimize for speed, accuracy, and cost of the models used for evaluations. Optionally, specify the model_used to generate the output (e.g. gpt-5-mini), and a human-readable nametag to organize and filter your evaluations.

The run mode determines which models power the evaluation:
- precision_plus - Maximum accuracy using the most advanced models
- precision - High accuracy with optimized performance
- smart - Balanced speed and accuracy (default)
- economy - Fastest evaluation at lowest cost

Available guardrail metrics include correctness, completeness, instruction_adherence, context_adherence, ground_truth_adherence, and `comprehensive_safety

When you create an evaluation, you’ll receive an evaluation ID. Use this ID to track the evaluation’s progress and retrieve the results.

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

model_input

object

required

A dictionary of inputs sent to the LLM to generate output. The dictionary must contain at least user_prompt or system_prompt field. For ground_truth_aherence guadrail metric, ground_truth should be provided.

Option 1
Option 2

Show child attributes

model_output

string

required

Output generated by the LLM to be evaluated.

run_mode

enum<string>

required

Run mode for the evaluation. The run mode allows the user to optimize for speed, accuracy, and cost by determining which models are used to evaluate the event. Available run modes include precision_plus, precision, smart, and economy. Defaults to smart.

Available options:

precision_plus,

precision,

smart,

economy

model_used

string

Model ID used to generate the output, like gpt-4o or o3.

guardrail_metrics

enum<string>[]

An array of guardrail metrics that the model input and output pair will be evaluated on. For non-enterprise users, these will be limited to the allowed guardrail metrics.

Show child attributes

nametag

string

An optional, user-defined tag for the evaluation.

Response

Evaluation created successfully

eval_id

string

required

A unique evaluation ID.

evaluation_status

enum<string>

required

Status of the evaluation.

Available options:

in_progress,

completed,

canceled,

queued,

failed

run_mode

enum<string>

required

Run mode for the evaluation. The run mode allows the user to optimize for speed, accuracy, and cost by determining which models are used to evaluate the event.

Available options:

precision_plus,

precision,

smart,

economy

model_input

object

required

Option 1
Option 2

Show child attributes

model_output

string

required

Output generated by the LLM to be evaluated.

guardrail_metrics

enum<string>[]

An array of guardrail metrics that the model input and output pair will be evaluated on.

Show child attributes

model_used

string

Model ID used to generate the output, like gpt-4o or o3.

nametag

string

An optional, user-defined tag for the evaluation.

progress

integer

Evaluation progress. Values range between 0 and 100; 100 corresponds to a completed evaluation_status.

Required range: 0 <= x <= 100

evaluation_result

object

Evaluation result consisting of average scores and rationales for each of the evaluated guardrail metrics.

evaluation_total_cost

number

Total cost of the evaluation.

created_at

string<date-time>

The time the evaluation was created in UTC.

error_message

string

Description of the error causing the evaluation to fail, if any.

error_timestamp

string<date-time>

The time the error causing the evaluation to fail was recorded.

start_timestamp

string<date-time>

The time the evaluation started in UTC.

end_timestamp

string<date-time>

The time the evaluation completed in UTC.

modified_at

string<date-time>

The most recent time the evaluation was modified in UTC.

Update a monitor Retrieve evaluation by ID

⌘I

API Documentation

Authorizations

Body

Response