Rubrics

Rubrics let you define structured evaluation criteria for your agents — turning your domain knowledge into consistent, explainable scoring or validation systems. Instead of relying on a single metric, rubrics help you quantify or verify agent quality across dimensions you care about.

Why Create Custom Rubrics?

Agents don’t just succeed or fail — they perform along a spectrum, or meet/don’t meet specific expectations. Rubrics help you:

Capture nuanced behaviors
Enforce domain-specific quality standards
Compare model changes or prompt variations
Identify regressions or improvement trends

Use Cases

Score Rubric: Rate task completion, efficiency, safety, etc.
Pass/Fail Rubric: Validate key requirements, such as compliance, user safety, or response correctness

Rubric Modes

1. Score Rubrics

Score rubrics provide weighted numeric evaluations across several criteria. Each criterion has a defined scoring range (e.g., 1–10), with descriptions for what qualifies as high, medium, or low performance.

Each criterion has optional weights, which are used to compute a weighted average.
The final rubric score is a single float value.
Ideal for comparative analysis, model iteration, or benchmarking performance.

Example

Rubric: "Browser Navigation Quality"

Criteria:
  - name: "Step Efficiency"
    weight: 2
    scores:
      10: "Task completed in <10 steps"
      5: "Task completed in 11–15 steps"
      1: ">15 steps"

  - name: "Revisiting Sites"
    weight: 1
    scores:
      10: "Never visited a site more than once"
      5: "Occasionally revisited"
      1: "Frequently revisited the same sites"

2. Pass/Fail Rubrics

Pass/Fail rubrics are designed for binary validation of agent behavior.

Each criterion has a Pass Definition and Fail Definition
If all criteria pass, the rubric evaluates to True
If any criterion fails, the rubric evaluates to False
The system automatically identifies which step(s) caused each failure

Example

Rubric: "Customer Service QA Check"

Criteria:
  - name: "Polite Communication"
    pass_definition: "Agent maintains professional tone throughout"
    fail_definition: "Agent uses inappropriate or dismissive language"

  - name: "Issue Resolution"
    pass_definition: "Customer issue is fully resolved"
    fail_definition: "Customer problem remains unresolved or unclear"

How to Use Rubrics

Using rubrics consists of two parts: 1) creating one in the UI and 2) using it in your code.

Step 1: In the UI

Navigate to the Rubrics tab on any Agent located directly under the agent name.
Click the Create New Evaluation Rubric button
Give your Rubric a name, description, and icon
Choose a Rubric Mode:
- Score Rubric: Numerical scoring system (e.g., 1–10 scale)
- Pass/Fail Rubric: Binary outcomes (Pass/Fail) per criterion

For Score Rubrics:

Add multiple Criteria, each with:
- A name
- Optional weight (for weighted average)
- Score Definitions (what a score of 1, 5, or 10 means)

For Pass/Fail Rubrics:

Add multiple Criteria, each with:
- A name
- A Pass Definition (what success looks like)
- A Fail Definition (what failure looks like)
Click Save

Pro Tip: Click on the Evaluation Rubric Summary to see a quick summary of all criteria.

Step 2: In Python

Attach rubrics to sessions at runtime:

lai.init(
    ...
    rubrics=["Default", "My Custom Rubric"]
)

Using Rubrics with Datasets

Rubrics work seamlessly with datasets to provide automated evaluation of test cases:

import lucidicai as lai

# Run dataset tests with automatic rubric evaluation
dataset_items = lai.get_dataset_items(dataset_id="test_suite")

for item in dataset_items:
    lai.init(
        session_name=f"Test: {item['name']}",
        dataset_item_id=item['id'],
        rubrics=["Quality Check", "Safety Validation"]  # Applied automatically
    )

    # Run your agent
    result = process_agent(item['input'])

    lai.end_session()
    # Rubrics will automatically evaluate the session

This combination enables:

Consistent Evaluation: Same rubrics applied to all test cases
Automated Scoring: No manual evaluation needed
Trend Analysis: Track rubric scores across dataset runs
Quality Gates: Fail tests that don’t meet rubric criteria

See Datasets for more information on creating test datasets.

Integration with Other Features

Production Monitoring

Apply rubrics to production sessions for real-time quality monitoring:

lai.init(
    production_monitoring=True,
    rubrics=["Customer Satisfaction", "Response Accuracy"]
)

Track rubric scores in the Production Monitoring dashboard to ensure quality in production.

Final Notes

Rubric results are immutable once ran on a session; any changes made afterward won’t retroactively affect past evaluations.
You can use multiple rubrics per session to get both high-level and detailed insights.
Whether you need nuanced score-based comparisons or simple go/no-go quality checks, rubrics give you a structured, explainable framework to evaluate AI agents.
Rubrics integrate seamlessly with datasets, experiments, and production monitoring for comprehensive evaluation coverage.

Get Started

Features

Integrations

Core Concepts

Python SDK Functions

Evaluation Rubrics

Rubrics

Why Create Custom Rubrics?

Use Cases

Rubric Modes

1. Score Rubrics

Example

2. Pass/Fail Rubrics

Example

How to Use Rubrics

Step 1: In the UI

For Score Rubrics:

For Pass/Fail Rubrics:

Step 2: In Python

Using Rubrics with Datasets

Integration with Other Features

Production Monitoring

Final Notes

Get Started

Features

Integrations

Core Concepts

Python SDK Functions

​Rubrics

​Why Create Custom Rubrics?

​Use Cases

​Rubric Modes

​1. Score Rubrics

​Example

​2. Pass/Fail Rubrics

​Example

​How to Use Rubrics

​Step 1: In the UI

​For Score Rubrics:

​For Pass/Fail Rubrics:

​Step 2: In Python

​Using Rubrics with Datasets

​Integration with Other Features

​Production Monitoring

​Final Notes

Rubrics

Why Create Custom Rubrics?

Use Cases

Rubric Modes

1. Score Rubrics

Example

2. Pass/Fail Rubrics

Example

How to Use Rubrics

Step 1: In the UI

For Score Rubrics:

For Pass/Fail Rubrics:

Step 2: In Python

Using Rubrics with Datasets

Integration with Other Features

Production Monitoring

Final Notes