Rubric Evaluation
Rubrics allow you to define custom, structured evaluations for your agents — enabling deeper analysis beyond just “did the task succeed?” Each rubric turns your domain expertise into a repeatable scoring system by breaking down session quality into criteria, each with clear definitions and optional weighting.Why Rubrics?
We believe agent builders know more than we do about what a good agent run looks like. That’s why we built a system that’s:- Standardized in structure
- Flexible in content
- Designed to encode your intuition
Rubric Structure
A Rubric contains:- One or more Criteria
- Optional weighting per criterion
- One or more Score Definitions per criterion (e.g. what a 1 vs 10 means)
Example
How Scores Are Computed
Each rubric criterion is evaluated and assigned a score from your defined set (e.g. 1–10). The final rubric score is a weighted average across all criteria.- If weights are not defined, all criteria are weighted equally
- If weights are provided, they are used relatively
- (e.g. Weight 2 and 5 means 28.5% vs 71.5%)
How to Use Rubrics
In the UI
- Define rubrics under the Rubrics tab for any Agent
- Apply rubrics manually to Sessions
- View breakdowns of scores per criterion
In the SDK
Run a rubric at the end of a session:Reusability
- Rubrics are attached to Agents
- Any Session from that Agent can be scored using the attached rubrics
- You can define multiple rubrics per Agent
- All rubrics applied to a session will appear in the results panel
This makes it easy to apply consistent evaluation logic across your runs — and evolve it over time.
Display and Visualization
- Rubric scores are visible on each Session page
- Each criterion’s score is broken down individually
- Future updates will include:
- Visual bar charts
- Score overlays on graphs
- Rubric score heatmaps
Design Philosophy
We’re opinionated about structure, but flexible about content.- You define what matters
- We provide a way to measure it consistently
- This balance allows for scalable agent evaluation across teams and use cases
What’s Next
- Better rubric analytics across Mass Sims
- Graph overlays based on rubric scores
- Export/import rubric templates
- Maybe: rubric suggestion starter packs (if we see common patterns)