📏 Rubric Evaluation

Rubrics allow you to define custom, structured evaluations for your agents — enabling deeper analysis beyond just “did the task succeed?”

Each rubric turns your domain expertise into a repeatable scoring system by breaking down session quality into criteria, each with clear definitions and optional weighting.


🧠 Why Rubrics?

We believe agent builders know more than we do about what a good agent run looks like. That’s why we built a system that’s:

  • Standardized in structure
  • Flexible in content
  • Designed to encode your intuition

Whether you care about navigation efficiency, repeated actions, time per step, or something unique to your agent — rubrics let you define and quantify that.


🧬 Rubric Structure

A Rubric contains:

  • One or more Criteria
  • Optional weighting per criterion
  • One or more Score Definitions per criterion (e.g. what a 1 vs 10 means)

Example

Rubric: "Browser Navigation Quality"

Criteria:
  - name: "Visiting Unique Sites"
    weight: 2
    scores:
      10: "Never visited a site twice"
      5: "Occasionally revisited sites"
      1: "Constantly visited the same site"

  - name: "Average Step Time"
    weight: 1
    scores:
      10: "< 1s per step"
      5: "~2s per step"
      1: "> 5s per step"

🧪 How Scores Are Computed

Each rubric criterion is evaluated and assigned a score from your defined set (e.g. 1–10). The final rubric score is a weighted average across all criteria.

  • If weights are not defined, all criteria are weighted equally
  • If weights are provided, they are used relatively
    • (e.g. Weight 2 and 5 means 28.5% vs 71.5%)

⚙️ How to Use Rubrics

In the UI

  • Define rubrics under the Rubrics tab for any Agent
  • Apply rubrics manually to Sessions
  • View breakdowns of scores per criterion

In the SDK

Run a rubric at the end of a session:

session.log_eval(rubric="browser_navigation_quality")

Or run multiple rubrics after the fact via the UI.


♻️ Reusability

  • Rubrics are attached to Agents
  • Any Session from that Agent can be scored using the attached rubrics
  • You can define multiple rubrics per Agent
  • All rubrics applied to a session will appear in the results panel

📊 Display and Visualization

  • Rubric scores are visible on each Session page
  • Each criterion’s score is broken down individually
  • Future updates will include:
    • Visual bar charts
    • Score overlays on graphs
    • Rubric score heatmaps

🧠 Design Philosophy

We’re opinionated about structure, but flexible about content.

  • You define what matters
  • We provide a way to measure it consistently
  • This balance allows for scalable agent evaluation across teams and use cases

🔮 What’s Next

  • Better rubric analytics across Mass Sims
  • Graph overlays based on rubric scores
  • Export/import rubric templates
  • Maybe: rubric suggestion starter packs (if we see common patterns)