model_input
dictionary (containing at least a system_prompt
or user_prompt
field), the model_output
string to be evaluated, the guardrail_metrics
to evaluate against, and the run_mode
to optimize for speed, accuracy, and cost of the models used for evaluations. Optionally, specify the model_used
to generate the output (e.g. gpt-5-mini), and a human-readable nametag
to organize and filter your evaluations.The run mode determines which models power the evaluation:
-
precision_plus
- Maximum accuracy using the most advanced models-
precision
- High accuracy with optimized performance-
smart
- Balanced speed and accuracy (default)-
economy
- Fastest evaluation at lowest costAvailable guardrail metrics include
correctness
, completeness
, instruction_adherence
, context_adherence
, ground_truth_adherence
, and `comprehensive_safetyWhen you create an evaluation, you’ll receive an evaluation ID. Use this ID to track the evaluation’s progress and retrieve the results.
Authorizations
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
Body
A dictionary of inputs sent to the LLM to generate output. The dictionary must contain at least user_prompt
or system_prompt
field. For ground_truth_aherence guadrail metric, ground_truth
should be provided.
- Option 1
- Option 2
Output generated by the LLM to be evaluated.
Run mode for the evaluation. The run mode allows the user to optimize for speed, accuracy, and cost by determining which models are used to evaluate the event. Available run modes include precision_plus
, precision
, smart
, and economy
. Defaults to smart
.
precision_plus
, precision
, smart
, economy
Model ID used to generate the output, like gpt-4o
or o3
.
An array of guardrail metrics that the model input and output pair will be evaluated on. For non-enterprise users, these will be limited to the allowed guardrail metrics.
An optional, user-defined tag for the evaluation.
Response
Evaluation created successfully
A unique evaluation ID.
Status of the evaluation.
in_progress
, completed
, canceled
, queued
, failed
Run mode for the evaluation. The run mode allows the user to optimize for speed, accuracy, and cost by determining which models are used to evaluate the event.
precision_plus
, precision
, smart
, economy
A dictionary of inputs sent to the LLM to generate output. The dictionary must contain at least user_prompt
or system_prompt
field. For ground_truth_aherence guadrail metric, ground_truth
should be provided.
- Option 1
- Option 2
Output generated by the LLM to be evaluated.
An array of guardrail metrics that the model input and output pair will be evaluated on.
Model ID used to generate the output, like gpt-4o
or o3
.
An optional, user-defined tag for the evaluation.
Evaluation progress. Values range between 0 and 100; 100 corresponds to a completed evaluation_status
.
0 <= x <= 100
Evaluation result consisting of average scores and rationales for each of the evaluated guardrail metrics.
Total cost of the evaluation.
The time the evaluation was created in UTC.
Description of the error causing the evaluation to fail, if any.
The time the error causing the evaluation to fail was recorded.
The time the evaluation started in UTC.
The time the evaluation completed in UTC.
The most recent time the evaluation was modified in UTC.