Rubric Evaluation
Evaluate your agent using structured, customizable rubrics that turn qualitative judgment into actionable scoring.
📏 Rubric Evaluation
Rubrics allow you to define custom, structured evaluations for your agents — enabling deeper analysis beyond just “did the task succeed?”
Each rubric turns your domain expertise into a repeatable scoring system by breaking down session quality into criteria, each with clear definitions and optional weighting.
🧠 Why Rubrics?
We believe agent builders know more than we do about what a good agent run looks like. That’s why we built a system that’s:
- Standardized in structure
- Flexible in content
- Designed to encode your intuition
Whether you care about navigation efficiency, repeated actions, time per step, or something unique to your agent — rubrics let you define and quantify that.
🧬 Rubric Structure
A Rubric contains:
- One or more Criteria
- Optional weighting per criterion
- One or more Score Definitions per criterion (e.g. what a 1 vs 10 means)
Example
🧪 How Scores Are Computed
Each rubric criterion is evaluated and assigned a score from your defined set (e.g. 1–10). The final rubric score is a weighted average across all criteria.
- If weights are not defined, all criteria are weighted equally
- If weights are provided, they are used relatively
- (e.g. Weight 2 and 5 means 28.5% vs 71.5%)
⚙️ How to Use Rubrics
In the UI
- Define rubrics under the Rubrics tab for any Agent
- Apply rubrics manually to Sessions
- View breakdowns of scores per criterion
In the SDK
Run a rubric at the end of a session:
Or run multiple rubrics after the fact via the UI.
♻️ Reusability
- Rubrics are attached to Agents
- Any Session from that Agent can be scored using the attached rubrics
- You can define multiple rubrics per Agent
- All rubrics applied to a session will appear in the results panel
📊 Display and Visualization
- Rubric scores are visible on each Session page
- Each criterion’s score is broken down individually
- Future updates will include:
- Visual bar charts
- Score overlays on graphs
- Rubric score heatmaps
🧠 Design Philosophy
We’re opinionated about structure, but flexible about content.
- You define what matters
- We provide a way to measure it consistently
- This balance allows for scalable agent evaluation across teams and use cases
🔮 What’s Next
- Better rubric analytics across Mass Sims
- Graph overlays based on rubric scores
- Export/import rubric templates
- Maybe: rubric suggestion starter packs (if we see common patterns)