Agent Lightning — RL Frameworks Interactive Explorer

Foundations

Why Does RL Exist?

Before diving into Agent Lightning, let's build intuition for why reinforcement learning exists — and why it's the only tool that works when you can't write the rules in advance.

Act 1 — The Problem

An LLM with tools... but no sense of order

Imagine someone messages the AI: "My flight just got cancelled. How do I get to Paris by tomorrow morning?"

The agent has three tools available. Stop and ask yourself — which one would you reach for first?

🔍 Search — look up alternative flights right now?
🧠 Memory — check the user's itinerary, budget, passport details?
🧮 Calculator — compare cost vs. travel time tradeoffs?

Search sounds right — but search for what? Without knowing where the user is flying from, their budget, or their flexibility, any search result is useless. The correct first step is Memory. An untrained agent doesn't know this.

The core problem: The right tool depends on context only the agent can discover step by step. There's no rule you can write in advance — and even a human expert pauses before answering.

Untrained LLM Agent

Enter Reinforcement Learning

Act 2 — The Solution

Learn from feedback, not from rules

Instead of programming the right answer, we give the agent a score after each attempt. Right answer = reward. Wrong answer = nothing.

Run training rounds and watch the agent figure out — through trial and error — that Memory must be called first, before searching for anything.

This is reinforcement learning in a nutshell: try → score → adjust → repeat.

Key insight: Nobody told the agent "check Memory first." It discovered the right order on its own, purely from the reward signal — "did the user actually get to Paris?"

RL Training in Progress

The Next Level

Act 3 — The Scale-Up

From one tool call to a full reasoning chain

Single Tool Call — One Decision

The agent makes one decision: which tool to call. RL handles this well — one action, one reward.

Simple. Standard RL handles this fine. One action → one reward signal.

Multi-Step Chain — Complexity Explodes

A real agent task needs multiple tool calls in sequence. Each step is a decision. The final reward only comes at the end — so which step deserves credit?

The reward arrives at the end. But was it the Search step that was useful? The Calculator? Both? Standard RL can't tell — this is the credit assignment problem.

Agent Lightning — Credit Flows Back Through Every Step

Agent Lightning formalizes the entire chain as a structured MDP. The reward signal is decomposed and assigned back to each step that contributed — so every decision can be improved.

The purple reward arrows flow backwards through every step simultaneously. Agent Lightning is the first framework built to handle this at scale — for any tool, any agent architecture.

Agent Lightning is built specifically for the multi-step agent world — it knows which step in a long tool-calling chain deserves credit when the final answer is right.

See the Overview →

Now that we've seen why RL exists, let's formalise what the agent is actually learning. The next section introduces the Markov Decision Process — the mathematical structure that turns a flight conversation into a trainable sequence of decisions.

Paper Overview

⏱ ~15 min interactive read 12 interactive sections ← → or J/K to navigate

Agent Lightning

A unified, scalable framework for training any AI agent with reinforcement learning — from single-tool LLMs to multi-agent systems. (arXiv:2508.03680)

3

Core Algorithms (REINFORCE++, PPO, GRPO)

Any

Agent Type

2×

Training Speedup

1

Unified Framework

Paper Map — Click any contribution to explore

Core Insight: Remember the cancelled-flight scenario? Agent Lightning is built specifically for that world — multi-step tool-using agents where the right action ordering matters. It formalizes the entire chain as a structured MDP and decomposes reward back to each step that deserved credit.

Core Concept

What is a Structured MDP?

Agent Lightning says it "formalizes agent trajectories as a structured MDP." But what does that mean? Let's build this from the ground up — State, Action, Reward — then see what makes Agent Lightning's version "structured."

State (s) — What the agent knows

At every moment, the agent sees a state — its current snapshot of the world. It's everything the agent can observe right now.

For our cancelled-flight agent, the state at step 1 is just the user's message. At step 2, the state includes the user's message plus the Memory tool result. Each step the state grows richer.

Key point: The agent cannot go back in time. It only acts on the current state. So the state must contain everything relevant — which is why it accumulates tool results as the conversation progresses.

Action (a) — What the agent decides

Given the state, the agent picks an action — which tool to call, or what response to give. This is the decision the policy π learns to make.

An untrained agent might call 🔍 Search immediately. A well-trained agent learns to call 🧠 Memory first — because without knowing the user's origin city and budget, any search result is meaningless.

The policy π is just a function: given state s, output a probability distribution over actions. Training changes these probabilities — good actions get higher probability, bad actions lower.

Reward (r) — Feedback that shapes the policy

After each action, the agent gets a reward — a number that says "was that useful?" High reward → do that more. Low reward → do that less.

Over thousands of training iterations, the rewards shape the policy until the agent reliably picks the right action order — without anyone ever writing "check Memory first" as an explicit rule.

The credit problem: In a multi-step task, the final reward (+1 for correct answer) arrives at step 5 — but step 1 (calling Memory) might have been the most important decision. Standard RL doesn't know this. Agent Lightning's structured MDP fixes it.

The "Structured" Part

Now that you know S, A, R — here's what makes Agent Lightning's MDP structured. Toggle between the two approaches to see the difference:

reward only at end

Standard MDP: One reward arrives at the very end. All 3 intermediate steps share the same terminal signal — the agent can't tell which step actually mattered.

This is exactly what Agent Lightning's Unified Data Interface does: it takes any multi-turn agent trajectory and structures it into a list of (input, output, reward) transitions. Once structured, any RL algorithm — PPO, GRPO, REINFORCE++ — can train on it without needing to understand the agent internals at all.

Which algorithm does Agent Lightning use?

REINFORCE++

Simplest. Good for quick experiments and smaller models. High variance but easy to implement.

Explore →

PPO

Stable. Clips updates so the policy can't swing too far in one step. Great for continuous improvement.

Explore →

GRPO

Default for LLM agents. No critic network — compares trajectories against each other as the baseline.

Explore →

With the MDP structure clear, the next three sections show how REINFORCE++, PPO, and GRPO each use it differently to update the agent's policy — same flight scenario, three different approaches to the same problem.

RL Algorithms

REINFORCE++

The simplest policy gradient algorithm. After each trajectory attempt, every action's probability is nudged up or down proportionally to the discounted return it contributed to.

∇J(θ) = E_τ [ Σ_t ∇_θ log π_θ(a_t|s_t) · G_t ] G_t = Σ_{k=0}^{T-t} γ^k · r_{t+k} (discounted return from step t)

Flight agent scenario: The agent tries a 3-step trajectory — 🧠 Memory → 🔍 Search → ✍️ Answer. Each step gets an immediate reward. REINFORCE++ uses these rewards to compute a discounted return G_t for each step, then adjusts the policy so good sequences become more likely.

Step Rewards — drag to adjust

These are the rewards the agent received at each step of the flight trajectory. Drag sliders to see how changing one step's reward ripples through the discounted returns G_t.

Discount γ 0.90

Reward r_t vs Discounted Return G_t

■ r_t = immediate reward at step t | ■ G_t = discounted sum of future rewards | ↑ gradient arrows = policy update direction

How Credit Flows Backwards

Discounted Return — Animated Credit Flow

When the final answer scores +1.0, that credit flows backwards through the trajectory — discounted by γ at each step. Watch how the γ value controls how much earlier steps "share" in the terminal reward.

γ = 0.90

Pros

Simple to implement
Unbiased gradient estimate
Works with any differentiable policy

Cons

High variance — can swing wildly
Sample inefficient
No trust region → unstable updates

The "++" improvements

Subtract baseline b(s_t) to reduce variance
Advantage: A_t = G_t − b(s_t)
Gradient clipping for stability

In Agent Lightning

REINFORCE++ is the simplest option in Agent Lightning — useful for quick experiments and smaller models. After each flight-agent trajectory, it assigns each step a return G_t and nudges the policy: if Memory→Search→Answer led to a good outcome, all three actions become slightly more likely. Simple but effective for a first pass.

RL Algorithms

Proximal Policy Optimization (PPO)

PPO fixes REINFORCE++'s instability problem by . The agent can only change its behavior by a bounded amount per training step — preventing catastrophic forgetting.

L_CLIP(θ) = E_t [ min( r_t(θ)·A_t, clip(r_t(θ), 1-ε, 1+ε)·A_t ) ] r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) (how much the policy changed for this action)

Flight agent scenario: REINFORCE++ saw that Memory-first worked well and wants to make it far more likely. But if it swings the policy too hard, the agent might never try Search again — even when Memory returns empty. PPO clips the update: it can improve Memory-first, but only up to the ε boundary per step.

Flight Agent — Policy Before & After One PPO Step

Adjust ε to see how tightly PPO constrains the update. The agent wants to shift more probability toward 🧠 Memory — but the clip prevents an extreme swing.

■ Old policy π_old ■ New policy π_θ (clipped by PPO) ■ Unconstrained update (what REINFORCE++ would do)

Clip Epsilon (ε) — how much the policy can change

ε = 0.20

Small ε = cautious updates. Large ε = aggressive updates. Agent Lightning default: ε = 0.2

Advantage Sign (for chart below)

PPO Objective — why the clip works

When A>0 (good action like Memory-first): PPO won't let the ratio r_t exceed 1+ε, preventing over-reinforcement. When A<0 (bad action like Search-first): won't let ratio go below 1-ε, preventing over-penalisation.

Why clipping matters for the flight agent

Without clipping, one lucky Memory-first trajectory could make the agent always call Memory regardless of context — even when the user hasn't provided their details yet. PPO ensures improvements are gradual and reversible.

In Agent Lightning

PPO is supported in Agent Lightning for stable training. Per-token advantage estimation extends naturally to LightningRL's transition-based format. Each flight-agent step is its own transition — PPO updates each one independently with its own clip.

RL Algorithms

GRPO — Group Relative Policy Optimization

GRPO's key insight: instead of a separate to estimate "how good is this state?", just sample multiple trajectories from the same prompt and compare them against each other.

A_i = (r_i − mean(r)) / std(r) (normalized relative to the group) L_GRPO = E_i [ min( r_i(θ)·A_i, clip(r_i(θ),1−ε,1+ε)·A_i ) ] − β·KL(π_θ ‖ π_ref)

Flight agent scenario: Given the same prompt — "My flight got cancelled, how do I reach Paris?" — the agent generates 4 different trajectories in parallel. GRPO compares them against each other: trajectories above average get positive advantage (do more of this), below average get negative (do less). No critic needed — the group is the baseline.

4 Trajectory Attempts — same prompt, different strategies

Click any trajectory card to resample its reward and watch the normalized advantages update in real time.

Normalized Advantages A_i — who should be reinforced?

Positive advantageHow much better (or worse) this trajectory was compared to the group average — the signal that drives the policy update = above group average → increase probability. Negative = below average → decrease probability. The baseline is the group mean, not a separate neural network.

What happens after this GRPO step?

The policy shifts: strategies with positive advantage become more probable, strategies with negative advantage become less probable. After enough training rounds, Memory-first dominates — the agent has learned the right tool order without anyone writing a rule.

GRPO vs PPO

No value/critic network needed
Lower GPU memory footprint
Advantage estimated from group peers
Naturally handles open-ended generation
Ideal for LLM agents: each prompt gets its own group baseline

In Agent Lightning

Default algorithm for math, coding, and tool-use agent tasks in Agent Lightning experiments. The group size G is configurable — larger G gives a better baseline estimate but costs more compute. Agent Lightning uses G=4 to G=8 in its paper experiments.

The KL Penalty — Keeping the Policy Grounded

β — The Reference Policy Anchor

The GRPO objective includes −β·KL(π_θ ‖ π_ref) — a penalty that prevents the new policy from drifting too far from the original pretrained model. Without it, the policy could collapse: always calling Memory regardless of context. β controls the trade-off between learning speed and stability.

β (KL weight) 0.10

Agent Lightning uses β = 0.04 in most experiments — small enough to let the policy improve, large enough to prevent collapse. The paper notes β interacts with learning rate: larger models typically need smaller β to avoid destabilising the pretrained capabilities.

You've now seen all three algorithms in action on the same flight scenario. The next section helps you decide which one to use for your own agent — and what trade-offs to expect in production.

RL Algorithms

Which Algorithm Should You Use?

REINFORCE++, PPO, and GRPO have different trade-offs. Agent Lightning supports all three through the same Unified Data Interface. Here's how to choose.

Property	REINFORCE++	PPO	GRPO
Needs critic / value network?	✗ No	~ Yes	✗ No
Training stability	Moderate	High	High
Compute overhead	Low	Medium	Medium
Implementation complexity	Simple	Moderate	Moderate
Requires multiple samples/prompt?	✗ No	✗ No	~ Recommended
KL regularization?	✗	~ Clip only	✓ β·KL term

Pick Your Scenario — Get a Recommendation

🚀

Fast prototype

Small model, limited GPU, want quick iteration and minimal setup overhead

🏭

Production training

Large model, need stability, reproducibility, no catastrophic forgetting

🎲

Group comparison

Can sample 4–8 responses per prompt, want self-normalizing baseline, no critic

🔗

Long multi-step agent

5–15 tool calls per trajectory, want per-step credit with LightningRL transitions

With the algorithm chosen, the next section shows how Agent Lightning's Unified Data Interface converts any framework's raw trajectories into the structured MDP transitions these algorithms need — without touching your agent code.

Agent Lightning

Unified Data Interface

The bridge between any agent framework and any RL algorithm. It intercepts your agent's existing API call logs — no code modifications — and structures them into clean MDP transitions that any RL algorithm can train on.

Raw Agent Trajectory LangChain

→

Structured MDP Transitions

Turn 0 / 3 converted

How AIR Computes the Reward Values

Tool success (non-null return)

r_t = +w_tool

AIR reads the tool's return status. Non-null / status 200 → positive reward. The weight w_tool is configurable per tool type (Memory: 0.30, Search: 0.20, etc.).

Tool failure (exception / null)

r_t = −w_tool

Exception, timeout, or null return → negative reward of the same magnitude. The agent learns to avoid tool calls that fail, without any manual reward shaping.

Terminal step (final answer)

r_T = evaluator(answer)

The last step uses a task-specific evaluator (exact match, F1, LLM-as-judge). This is the only reward that exists in the sparse baseline — r₁ and r₂ are purely AIR's contribution.

Zero agent code changes. The Data Interface works by intercepting the LLM API calls externally — it reads the request/response logs the agent already generates. Switch from LangChain to AutoGen? The Data Interface adapts; your RL training code stays identical.

System Design

Agent Lightning Architecture

An asynchronous server-client design that decouples environment rollout from policy optimization, enabling high-throughput distributed RL training.

⚡ Lightning Server (Policy)

🧠Policy Model (LLM backbone)

📊Experience Buffer & Replay

⚙️RL Optimizer (PPO/GRPO/etc.)

💾Checkpoint Manager

📡gRPC / REST API endpoint

🤖 Lightning Client (Agent)

🎯Task Sampler & Env Runner

🔧Tool Executor (code/search/etc.)

📝Trajectory Logger

🏆Reward Computer

🔄Async rollout & push to server

Live Data Flow — Animated

Three clients run rollouts in parallel, each streaming trajectories to the server as they complete. The server updates the policy continuously — no client ever waits for another.

This is the disaggregation: the policy server and the agent clients are fully decoupled processes. They communicate only via the gRPC API.

Data Flow — Stages

Task / Prompt

→

Client Rollout

→

Trajectory

→

Server Update

→

Updated Policy

Async Design

Multiple clients run rollouts concurrently. The server updates the policy whenever a batch is ready, without blocking clients.

Any Agent Type

Works with LangChain, OpenAI Agents SDK, AutoGen, and any framework that makes LLM API calls — single-LLM, tool-augmented, ReAct, multi-agent, or hierarchical agents.

Scalability

Scale to hundreds of parallel clients. The server handles batching, prioritized replay, and gradient accumulation automatically.

Sync vs Async — Visual Comparison

Why Asynchronous Training Matters

In synchronous mode the server waits for all workers to finish before updating the policy — workers sit idle while others catch up. In async mode (Agent Lightning) every worker runs continuously and the server updates the moment a batch arrives.

Synchronous: workers must wait for the slowest peer before the server can update. GPU utilization ~40-60%.

The Async Trade-off

Policy Staleness — The Honest Trade-off of Async Training

More workers = more throughput. But in async mode, by the time a worker's trajectory reaches the server, the policy may have updated several times. That data is now stale — it was collected under an older policy. Drag the slider to see the trade-off.

Parallel workers 3

Mitigation: Agent Lightning bounds staleness by discarding trajectories collected more than K policy updates ago. K is configurable (typically K = 4). This keeps training data fresh without blocking workers or reducing throughput.

Agent Lightning

Credit Assignment

Assigning credit to individual actions in long agent trajectories is a core challenge. Agent Lightning introduces a structured 4-step credit assignment pipeline.

Click any node above to jump to that step

Credit assignment tells us which steps to reinforce. AIR automates the reward signal that makes this possible — the next section shows exactly how tool return statuses become training signal, automatically.

Agent Lightning

AIR — Automatic Intermediate Rewarding

The credit assignment problem: a 5-step agent trajectory gets one reward at the end. Which step deserved it? AIR solves this by automatically converting tool call return statuses into intermediate reward signals — no manual reward engineering needed.

only terminal reward

Flight Agent Trajectory — toggle each tool call result

Click the status badge on each step to flip it between ✅ success and ❌ failure. Watch how the reward signal changes — with AIR, failures get immediate penalties instead of waiting for the terminal reward.

Reward Signal Density — sparse vs dense

Without AIR: only step 5 carries a reward signal. Steps 1–4 get zero gradient. The policy has no way to know that step 1 (Memory) was the key decision.

Without AIR — the problem

The agent makes 5 decisions. Only the last one gets a reward. Steps 1–4 receive zero gradient — the optimizer literally has nothing to learn from them.

This is why multi-step agent training is slow: most of the work the agent does is invisible to the learning signal.

With AIR — the fix

Tool call return status is automatically converted to reward: successful call → small positive reward, failed call → small penalty. No manual reward shaping needed.

Every step now teaches the policy something. Training converges faster and intermediate behaviors (like always checking Memory first) emerge naturally.

AIR formula: r_t = α · status(tool_call_t) + (1−α) · r_terminal where status is +1 for success, −1 for failure, and α controls how much weight to give intermediate vs terminal rewards. Agent Lightning uses α ≈ 0.1–0.3 in its experiments.

Agent Lightning

LightningRL vs. Masking Approaches

Most RL frameworks concatenate all turns and use token masking to ignore system prompts. Agent Lightning uses a fundamentally different approach that avoids the pitfalls of masking entirely.

Single-Turn RL — The Baseline

The simplest approach: one LLM call produces one training sample. Clean and straightforward, but ignores multi-step structure entirely.

Input prompt → LLM call → Output → Reward One training sample per task.

Works well for basic QA and single-step tasks. But for tool-using agents that make 3–10 LLM calls per task, this throws away most of the signal — only the final call gets trained.

Masking — The Old Approach (Criticized by the Paper)

Concatenate all turns into one long sequence. Mask out system prompts, tool results, and context — only train on agent output tokens.

[SYS]Youareahelpfulassistant Think:checkMemoryfirst [TOOLRESULT]itinerary:CDG,budget:€400 Think:nowsearchflights [TOOLRESULT]Eurostar7am,€89 Answer:TakeEurostar.

A Agent tokens (trained) A Masked tokens (ignored)

Problems (from the paper):
• Disrupts RoPE positional encodings — token positions become meaningless across masked gaps
• Tight coupling with agent code — must modify agent internals to emit masks
• Complex masks slow GPU kernels — irregular patterns prevent efficient batching
• Hard to maintain across agent frameworks

LightningRL — The Agent Lightning Approach

Decompose the full trajectory into clean individual transitions. Each transition is a fresh (input, output, reward) tuple. No masking needed at all.

Trajectory → [(input₁, output₁, r₁), (input₂, output₂, r₂), (input₃, output₃, r₃)] Each transition = one self-contained training example.

Benefits

No masking — token positions always valid
Zero agent code modification needed
Works with any framework (LangChain, AutoGen, etc.)
Standard RL algorithms work unchanged
AIR provides intermediate rewards automatically

Flight Agent Example

Turn 1: (context, "call Memory", r=+0.3)

Turn 2: (context+memory, "call Search", r=+0.2)

Turn 3: (context+results, "Answer: Eurostar", r=+1.0)

Data Efficiency

N Training Samples Per Task — The Efficiency Multiplier

Standard RL produces 1 training sample per completed task. LightningRL produces N — one per LLM call in the trajectory. Adjust the slider to see the cumulative difference over 100 tasks.

Steps per trajectory 3

In the paper's Text-to-SQL experiment, the 3-agent workflow (SQL writer + checker + rewriter) produces ~3–5 LLM calls per task. LightningRL extracts a transition from each — 3–5× more gradient updates from the same compute budget, with no extra inference cost.

The algorithm is defined, the architecture is built. The next section shows the three real-world experiments from the paper — Text-to-SQL, multi-hop RAG, and Math QA — where all of this comes together on a live Llama-3.2-3B model.

Agent Lightning

Real-World Experiments

Three experiments from the paper — each using a different agent framework and task type, all with Llama-3.2-3B-Instruct. Results show training reward curves, not absolute benchmark scores.

Experiment 1 · LangChain

Text-to-SQL

Spider dataset (10K questions, 200 databases). 3-agent workflow: SQL writer + checker + rewriter. Llama-3.2-3B-Instruct.

Stable reward improvement on both training and test sets. Writer and rewriter trained simultaneously; checker left frozen.

Experiment 2 · OpenAI Agents SDK

RAG / Multi-hop QA

MuSiQue dataset (multi-hop QA over Wikipedia 21M docs). Single LLM generates search queries and decides when to answer. R = 0.9×correctness + 0.1×format.

Continuous performance improvement on compositional reasoning requiring multiple Wikipedia searches.

Experiment 3 · AutoGen

Math QA with Tools

Calc-X dataset (arithmetic + symbolic problems). Single LLM generating calculator tool calls. AIR mechanism provides dense intermediate rewards.

Fastest improvement, steepest curve. Shows AIR working — tool call success signals provide dense intermediate rewards.

Reward Function Designer — What Does the Agent Actually Optimise?

Reward design determines what behaviour the agent learns. Adjust the weights below and watch which flight-agent strategy becomes the most rewarded — you may be surprised what a poorly tuned reward incentivises.

Task completion reward

1.00

Reward for a correct final answer

Tool success (per call)

0.20

Reward per successful tool return

Format bonus

0.10

Bonus for well-structured output

All experiments use Llama-3.2-3B-Instruct. The paper reports relative reward improvement curves — no absolute benchmark numbers are claimed. Training consistently improves across all tasks and agent frameworks tested.

What Breaks Without Each Feature?

Ablation Simulator — Disable a Feature, See the Effect

Toggle each of Agent Lightning's three contributions off and watch what happens to the training curve. This shows why each feature is necessary — not just nice-to-have.

Results

Framework Comparison

Agent Lightning vs. existing RL training frameworks across key capabilities and benchmark performance.

Feature	Agent Lightning	TRL	OpenRLHF	veRL	RLlib

The paper reports relative reward improvement curves (not absolute metrics) across 3 experiments. Training consistently improves across all tasks and agent frameworks tested — Text-to-SQL (LangChain/Spider), RAG (OpenAI Agents SDK/MuSiQue), and Math QA (AutoGen/Calc-X).

Summary

Key Takeaways

Three innovations that make Agent Lightning different from every existing RL training framework — and why each one matters in practice.

🗂️

1 — Unified Data Interface

Takes any multi-turn agent trajectory — from any framework — and structures it into a list of (input_t, output_t, r_t) MDP transitions automatically.

Why it matters

Works with LangChain, AutoGen, OpenAI Agents SDK — zero agent code changes. Every other framework requires you to rewrite the agent from scratch.

⚡

2 — LightningRL Algorithm

Trains on individual MDP transitions instead of full sequences. No loss masking, no padding — each transition is self-contained and carries its own reward.

Why it matters

Cleaner gradient signal, naturally supports AIR (each step has its own reward), and works on trajectories of any length without waste.

🔀

3 — Training-Agent Disaggregation

Separates the policy server (training) from the agent clients (rollout). Clients stream data asynchronously; the server updates the policy continuously.

Why it matters

Scale to hundreds of parallel workers without GPU idle time. Synchronous frameworks block training while waiting for every rollout to finish.

Before Training vs After Training

Same prompt, same tools, same model weights — before RL training and after. Watch what changes.

❌ Before RL Training

"My flight got cancelled. How do I reach Paris?"

🔍 Search("flights Paris") wrong first step

🧮 Calculator(cost comparison) no data to compare

✍️ Answer: "Try Eurostar or Air France"

Generic answer — no budget, no itinerary, no personalisation. User can't book anything.

✅ After RL Training

"My flight got cancelled. How do I reach Paris?"

🧠 Memory("user_itinerary") correct first step

🔍 Search("CDG→Paris trains, budget €400")

✍️ Answer: "Take Eurostar 7am (£89)"

Personalised answer — uses itinerary, respects budget. User can book in one click.

The model weights are identical — only the policy distribution changed. After RL training, Memory-first has ~85% probability; before, all three tools were ~33%. This is what Agent Lightning trains.

The Full Training Loop — Animated

Watch how a single flight question flows through the entire Agent Lightning pipeline — from raw prompt to an improved policy.

Press Play to start

📄 Read the Paper

Agent Lightning: Scalable RL for Agentic AI via Training-Agent Disaggregation — arXiv 2025

arXiv:2508.03680 →

📬 Get Next Week's Visual

Every Monday, a new interactive visualization of an AI/ML paper. Become a paid subscriber to unlock all tools.

Subscribe to Visual Summary →