📦 Sessions

A Session represents a single run of an AI agent as it attempts to complete a task. It captures the full trace of the agent’s behavior — from initial state to final outcome — and provides deep visibility into each decision, action, and outcome along the way.


🧠 What is a Session?

A Session is a top-level unit of observability. It begins when an agent starts a task and ends when the task is either completed, failed, or interrupted. Learn to create a session here

Within each Session UI in Lucidic or what we call workflow sandbox, you’ll find:

  • A timeline of Steps (agent states and transitions)
  • Fine-grained Events (e.g. LLM calls, API responses) inside each Step
  • A workflow graph visualizing the session’s trajectory
  • A GIF/video replay of the agent’s interaction (if available)
  • Evaluation scores, cost metrics, metadata, and more

🧬 Session Structure

Step and Event Hierarchy

A Session is composed of ordered Steps, each representing a unique state and action taken by the agent. Each Step contains multiple Events, representing granular atomic operations like LLM calls or API hits.

Session
└── Step
    ├── Event
    ├── Event
    └── ...

📺 Session Dashboard or Workflow Sandbox

Each Session in the workflow sandbox includes:

  • Overview metrics: duration, step count, cost, completion status
  • Visual replay: full session video or GIF (if captured)
  • Interactive workflow graph: clickable nodes for each step
  • Step-by-step explorer: deep dive into state/action/event logs
  • Evaluation scores: human or LLM-generated

📊 Evaluation and Scoring

Sessions can be evaluated via multiple systems:

1. User-Provided Eval Score

Passed in at runtime using the Python SDK. Useful for customized or domain-specific scoring.

lai.end_session(
    is_successful=True,
    session_eval="4.5",  # Your custom score which is not a rubric
    session_eval_reason="Agent completed primary objectives but took longer than expected"
)

2. Rubric-Based Evaluation

Define structured, weighted rubrics with multiple criteria and score definitions.

criteria:
  - name: "Repeated Site Visits"
    weights: 0.4
    scores:
      10: "Never visited a site twice"
      1: "Frequently revisited pages"

  - name: "Average Step Time"
    weight: 0.6
    ...

These rubrics produce both per-criterion and composite scores.

3. Default Evaluation

Our built-in system outputs:

  • A score (0 or 1) based on task completion
  • A list of “wrong” or suboptimal steps (automatically identified)
  • Graph highlights of failure points

🕵️‍♂️ Debugging with Sessions

Sessions give you rich insight into how your agent operates and where it goes wrong:

  • Replay stuck behavior via video
  • Identify high-cost operations
  • Trace decision-making back to specific prompts or events
  • Group failing steps into unified nodes in the graph
  • Track revisited states and looping patterns

📁 Metadata and Tagging

Each Session contains rich metadata including:

  • Task description
  • Agent identity and version
  • Cost
  • Timestamps
  • Link to parent Mass Simulation (if applicable)
  • Replay data and graph reference

Planned: semantic tagging, search, and session filtering based on LLM-labeled themes.