Rubrics
Rubrics let you define structured evaluation criteria for your agents — turning your domain knowledge into consistent, explainable scoring or validation systems. Instead of relying on a single metric, rubrics help you quantify or verify agent quality across dimensions you care about.Why Create Custom Rubrics?
Agents don’t just succeed or fail — they perform along a spectrum, or meet/don’t meet specific expectations. Rubrics help you:- Capture nuanced behaviors
- Enforce domain-specific quality standards
- Compare model changes or prompt variations
- Identify regressions or improvement trends
Use Cases
- Score Rubric: Rate task completion, efficiency, safety, etc.
- Pass/Fail Rubric: Validate key requirements, such as compliance, user safety, or response correctness
Rubric Modes
1. Score Rubrics
Score rubrics provide weighted numeric evaluations across several criteria. Each criterion has a defined scoring range (e.g., 1–10), with descriptions for what qualifies as high, medium, or low performance.- Each criterion has optional weights, which are used to compute a weighted average.
- The final rubric score is a single float value.
- Ideal for comparative analysis, model iteration, or benchmarking performance.
Example
2. Pass/Fail Rubrics
Pass/Fail rubrics are designed for binary validation of agent behavior.- Each criterion has a Pass Definition and Fail Definition
- If all criteria pass, the rubric evaluates to True
- If any criterion fails, the rubric evaluates to False
- The system automatically identifies which step(s) caused each failure
Example
How to Use Rubrics
Using rubrics consists of two parts: 1) creating one in the UI and 2) using it in your code.Step 1: In the UI
- Navigate to the Rubrics tab on any Agent located directly under the agent name.
- Click the Create New Evaluation Rubric button
- Give your Rubric a name, description, and icon
-
Choose a Rubric Mode:
- Score Rubric: Numerical scoring system (e.g., 1–10 scale)
- Pass/Fail Rubric: Binary outcomes (Pass/Fail) per criterion
For Score Rubrics:
-
Add multiple Criteria, each with:
- A name
- Optional weight (for weighted average)
- Score Definitions (what a score of 1, 5, or 10 means)
For Pass/Fail Rubrics:
-
Add multiple Criteria, each with:
- A name
- A Pass Definition (what success looks like)
- A Fail Definition (what failure looks like)
- Click Save
Pro Tip: Click on the Evaluation Rubric Summary to see a quick summary of all criteria.
Step 2: In Python
Attach rubrics to sessions at runtime:Using Rubrics with Datasets
Rubrics work seamlessly with datasets to provide automated evaluation of test cases:- Consistent Evaluation: Same rubrics applied to all test cases
- Automated Scoring: No manual evaluation needed
- Trend Analysis: Track rubric scores across dataset runs
- Quality Gates: Fail tests that don’t meet rubric criteria
Integration with Other Features
Production Monitoring
Apply rubrics to production sessions for real-time quality monitoring:Final Notes
- Rubric results are immutable once ran on a session; any changes made afterward won’t retroactively affect past evaluations.
- You can use multiple rubrics per session to get both high-level and detailed insights.
- Whether you need nuanced score-based comparisons or simple go/no-go quality checks, rubrics give you a structured, explainable framework to evaluate AI agents.
- Rubrics integrate seamlessly with datasets, experiments, and production monitoring for comprehensive evaluation coverage.