Rubric Evaluation

Rubrics allow you to define custom, structured evaluations for your agents — enabling deeper analysis beyond just “did the task succeed?” Each rubric turns your domain expertise into a repeatable scoring system by breaking down session quality into criteria, each with clear definitions and optional weighting.

Why Rubrics?

We believe agent builders know more than we do about what a good agent run looks like. That’s why we built a system that’s:

Standardized in structure
Flexible in content
Designed to encode your intuition

Whether you care about navigation efficiency, repeated actions, time per step, or something unique to your agent — rubrics let you define and quantify that.

Rubric Structure

A Rubric contains:

One or more Criteria
Optional weighting per criterion
One or more Score Definitions per criterion (e.g. what a 1 vs 10 means)

Example

Rubric: "Browser Navigation Quality"

Criteria:
  - name: "Visiting Unique Sites"
    weight: 2
    scores:
      10: "Never visited a site twice"
      5: "Occasionally revisited sites"
      1: "Constantly visited the same site"

  - name: "Average Step Time"
    weight: 1
    scores:
      10: "< 1s per step"
      5: "~2s per step"
      1: "> 5s per step"

How Scores Are Computed

Each rubric criterion is evaluated and assigned a score from your defined set (e.g. 1–10). The final rubric score is a weighted average across all criteria.

If weights are not defined, all criteria are weighted equally
If weights are provided, they are used relatively
- (e.g. Weight 2 and 5 means 28.5% vs 71.5%)

How to Use Rubrics

In the UI

Define rubrics under the Rubrics tab for any Agent
Apply rubrics manually to Sessions
View breakdowns of scores per criterion

In the SDK

Run a rubric at the end of a session:

session.log_eval(rubric="browser_navigation_quality")

Or run multiple rubrics after the fact via the UI.

Reusability

Rubrics are attached to Agents
Any Session from that Agent can be scored using the attached rubrics
You can define multiple rubrics per Agent
All rubrics applied to a session will appear in the results panel

This makes it easy to apply consistent evaluation logic across your runs — and evolve it over time.

Display and Visualization

Rubric scores are visible on each Session page
Each criterion’s score is broken down individually
Future updates will include:
- Visual bar charts
- Score overlays on graphs
- Rubric score heatmaps

Design Philosophy

We’re opinionated about structure, but flexible about content.

You define what matters
We provide a way to measure it consistently
This balance allows for scalable agent evaluation across teams and use cases

What’s Next

Better rubric analytics across Mass Sims
Graph overlays based on rubric scores
Export/import rubric templates
Maybe: rubric suggestion starter packs (if we see common patterns)

Get Started

Features

Integrations

Core Concepts

Python SDK Functions

Rubric Evaluation

Rubric Evaluation

Why Rubrics?

Rubric Structure

Example

How Scores Are Computed

How to Use Rubrics

In the UI

In the SDK

Reusability

Display and Visualization

Design Philosophy

What’s Next

Get Started

Features

Integrations

Core Concepts

Python SDK Functions

​Rubric Evaluation

​Why Rubrics?

​Rubric Structure

​Example

​How Scores Are Computed

​How to Use Rubrics

​In the UI

​In the SDK

​Reusability

​Display and Visualization

​Design Philosophy

​What’s Next

​Related Docs

Rubric Evaluation

Why Rubrics?

Rubric Structure

Example

How Scores Are Computed

How to Use Rubrics

In the UI

In the SDK

Reusability

Display and Visualization

Design Philosophy

What’s Next

Related Docs