Imagine someone messages the AI: "My flight just got cancelled. How do I get to Paris by tomorrow morning?"
The agent has three tools available. Stop and ask yourself โ which one would you reach for first?
- ๐ Search โ look up alternative flights right now?
- ๐ง Memory โ check the user's itinerary, budget, passport details?
- ๐งฎ Calculator โ compare cost vs. travel time tradeoffs?
Search sounds right โ but search for what? Without knowing where the user is flying from, their budget, or their flexibility, any search result is useless. The correct first step is Memory. An untrained agent doesn't know this.
The core problem: The right tool depends on context only the agent can discover step by step. There's no rule you can write in advance โ and even a human expert pauses before answering.
Enter Reinforcement Learning
Instead of programming the right answer, we give the agent a score after each attempt. Right answer = reward. Wrong answer = nothing.
Run training rounds and watch the agent figure out โ through trial and error โ that Memory must be called first, before searching for anything.
This is reinforcement learning in a nutshell: try โ score โ adjust โ repeat.
Key insight: Nobody told the agent "check Memory first." It discovered the right order on its own, purely from the reward signal โ "did the user actually get to Paris?"
The Next Level
The agent makes one decision: which tool to call. RL handles this well โ one action, one reward.
Simple. Standard RL handles this fine. One action โ one reward signal.
Agent Lightning is built specifically for the multi-step agent world โ it knows which step in a long tool-calling chain deserves credit when the final answer is right.
See the Overview โNow that we've seen why RL exists, let's formalise what the agent is actually learning. The next section introduces the Markov Decision Process โ the mathematical structure that turns a flight conversation into a trainable sequence of decisions.
Core Insight: Remember the cancelled-flight scenario? Agent Lightning is built specifically for that world โ multi-step tool-using agents where the right action ordering matters. It formalizes the entire chain as a structured MDP and decomposes reward back to each step that deserved credit.
At every moment, the agent sees a state โ its current snapshot of the world. It's everything the agent can observe right now.
For our cancelled-flight agent, the state at step 1 is just the user's message. At step 2, the state includes the user's message plus the Memory tool result. Each step the state grows richer.
Key point: The agent cannot go back in time. It only acts on the current state. So the state must contain everything relevant โ which is why it accumulates tool results as the conversation progresses.
Given the state, the agent picks an action โ which tool to call, or what response to give. This is the decision the policy ฯ learns to make.
An untrained agent might call ๐ Search immediately. A well-trained agent learns to call ๐ง Memory first โ because without knowing the user's origin city and budget, any search result is meaningless.
The policy ฯ is just a function: given state s, output a probability distribution over actions. Training changes these probabilities โ good actions get higher probability, bad actions lower.
After each action, the agent gets a reward โ a number that says "was that useful?" High reward โ do that more. Low reward โ do that less.
Over thousands of training iterations, the rewards shape the policy until the agent reliably picks the right action order โ without anyone ever writing "check Memory first" as an explicit rule.
The credit problem: In a multi-step task, the final reward (+1 for correct answer) arrives at step 5 โ but step 1 (calling Memory) might have been the most important decision. Standard RL doesn't know this. Agent Lightning's structured MDP fixes it.
The "Structured" Part
Now that you know S, A, R โ here's what makes Agent Lightning's MDP structured. Toggle between the two approaches to see the difference:
This is exactly what Agent Lightning's Unified Data Interface does: it takes any multi-turn agent trajectory and structures it into a list of (input, output, reward) transitions. Once structured, any RL algorithm โ PPO, GRPO, REINFORCE++ โ can train on it without needing to understand the agent internals at all.
Which algorithm does Agent Lightning use?
Simplest. Good for quick experiments and smaller models. High variance but easy to implement.
Explore โStable. Clips updates so the policy can't swing too far in one step. Great for continuous improvement.
Explore โDefault for LLM agents. No critic network โ compares trajectories against each other as the baseline.
Explore โWith the MDP structure clear, the next three sections show how REINFORCE++, PPO, and GRPO each use it differently to update the agent's policy โ same flight scenario, three different approaches to the same problem.
Flight agent scenario: The agent tries a 3-step trajectory โ ๐ง Memory โ ๐ Search โ โ๏ธ Answer. Each step gets an immediate reward. REINFORCE++ uses these rewards to compute a discounted return G_t for each step, then adjusts the policy so good sequences become more likely.
These are the rewards the agent received at each step of the flight trajectory. Drag sliders to see how changing one step's reward ripples through the discounted returns G_t.
โ r_t = immediate reward at step t | โ G_t = discounted sum of future rewards | โ gradient arrows = policy update direction
How Credit Flows Backwards
When the final answer scores +1.0, that credit flows backwards through the trajectory โ discounted by ฮณ at each step. Watch how the ฮณ value controls how much earlier steps "share" in the terminal reward.
- Simple to implement
- Unbiased gradient estimate
- Works with any differentiable policy
- High variance โ can swing wildly
- Sample inefficient
- No trust region โ unstable updates
- Subtract baseline b(s_t) to reduce variance
- Advantage: A_t = G_t โ b(s_t)
- Gradient clipping for stability
REINFORCE++ is the simplest option in Agent Lightning โ useful for quick experiments and smaller models. After each flight-agent trajectory, it assigns each step a return G_t and nudges the policy: if MemoryโSearchโAnswer led to a good outcome, all three actions become slightly more likely. Simple but effective for a first pass.
Flight agent scenario: REINFORCE++ saw that Memory-first worked well and wants to make it far more likely. But if it swings the policy too hard, the agent might never try Search again โ even when Memory returns empty. PPO clips the update: it can improve Memory-first, but only up to the ฮต boundary per step.
Adjust ฮต to see how tightly PPO constrains the update. The agent wants to shift more probability toward ๐ง Memory โ but the clip prevents an extreme swing.
Small ฮต = cautious updates. Large ฮต = aggressive updates. Agent Lightning default: ฮต = 0.2
When A>0 (good action like Memory-first): PPO won't let the ratio r_t exceed 1+ฮต, preventing over-reinforcement. When A<0 (bad action like Search-first): won't let ratio go below 1-ฮต, preventing over-penalisation.
Without clipping, one lucky Memory-first trajectory could make the agent always call Memory regardless of context โ even when the user hasn't provided their details yet. PPO ensures improvements are gradual and reversible.
PPO is supported in Agent Lightning for stable training. Per-token advantage estimation extends naturally to LightningRL's transition-based format. Each flight-agent step is its own transition โ PPO updates each one independently with its own clip.
Flight agent scenario: Given the same prompt โ "My flight got cancelled, how do I reach Paris?" โ the agent generates 4 different trajectories in parallel. GRPO compares them against each other: trajectories above average get positive advantage (do more of this), below average get negative (do less). No critic needed โ the group is the baseline.
Click any trajectory card to resample its reward and watch the normalized advantages update in real time.
Positive advantageHow much better (or worse) this trajectory was compared to the group average โ the signal that drives the policy update = above group average โ increase probability. Negative = below average โ decrease probability. The baseline is the group mean, not a separate neural network.
The policy shifts: strategies with positive advantage become more probable, strategies with negative advantage become less probable. After enough training rounds, Memory-first dominates โ the agent has learned the right tool order without anyone writing a rule.
- No value/critic network needed
- Lower GPU memory footprint
- Advantage estimated from group peers
- Naturally handles open-ended generation
- Ideal for LLM agents: each prompt gets its own group baseline
Default algorithm for math, coding, and tool-use agent tasks in Agent Lightning experiments. The group size G is configurable โ larger G gives a better baseline estimate but costs more compute. Agent Lightning uses G=4 to G=8 in its paper experiments.
The KL Penalty โ Keeping the Policy Grounded
The GRPO objective includes โฮฒยทKL(ฯ_ฮธ โ ฯ_ref) โ a penalty that prevents the new policy from drifting too far from the original pretrained model. Without it, the policy could collapse: always calling Memory regardless of context. ฮฒ controls the trade-off between learning speed and stability.
Agent Lightning uses ฮฒ = 0.04 in most experiments โ small enough to let the policy improve, large enough to prevent collapse. The paper notes ฮฒ interacts with learning rate: larger models typically need smaller ฮฒ to avoid destabilising the pretrained capabilities.
You've now seen all three algorithms in action on the same flight scenario. The next section helps you decide which one to use for your own agent โ and what trade-offs to expect in production.
| Property | REINFORCE++ | PPO | GRPO |
|---|---|---|---|
| Needs critic / value network? | โ No | ~ Yes | โ No |
| Training stability | Moderate | High | High |
| Compute overhead | Low | Medium | Medium |
| Implementation complexity | Simple | Moderate | Moderate |
| Requires multiple samples/prompt? | โ No | โ No | ~ Recommended |
| KL regularization? | โ | ~ Clip only | โ ฮฒยทKL term |
With the algorithm chosen, the next section shows how Agent Lightning's Unified Data Interface converts any framework's raw trajectories into the structured MDP transitions these algorithms need โ without touching your agent code.
AIR reads the tool's return status. Non-null / status 200 โ positive reward. The weight w_tool is configurable per tool type (Memory: 0.30, Search: 0.20, etc.).
Exception, timeout, or null return โ negative reward of the same magnitude. The agent learns to avoid tool calls that fail, without any manual reward shaping.
The last step uses a task-specific evaluator (exact match, F1, LLM-as-judge). This is the only reward that exists in the sparse baseline โ rโ and rโ are purely AIR's contribution.
Zero agent code changes. The Data Interface works by intercepting the LLM API calls externally โ it reads the request/response logs the agent already generates. Switch from LangChain to AutoGen? The Data Interface adapts; your RL training code stays identical.
⚡ Lightning Server (Policy)
🤖 Lightning Client (Agent)
Three clients run rollouts in parallel, each streaming trajectories to the server as they complete. The server updates the policy continuously โ no client ever waits for another.
This is the disaggregation: the policy server and the agent clients are fully decoupled processes. They communicate only via the gRPC API.
Multiple clients run rollouts concurrently. The server updates the policy whenever a batch is ready, without blocking clients.
Works with LangChain, OpenAI Agents SDK, AutoGen, and any framework that makes LLM API calls โ single-LLM, tool-augmented, ReAct, multi-agent, or hierarchical agents.
Scale to hundreds of parallel clients. The server handles batching, prioritized replay, and gradient accumulation automatically.
Sync vs Async โ Visual Comparison
In synchronous mode the server waits for all workers to finish before updating the policy โ workers sit idle while others catch up. In async mode (Agent Lightning) every worker runs continuously and the server updates the moment a batch arrives.
Synchronous: workers must wait for the slowest peer before the server can update. GPU utilization ~40-60%.
The Async Trade-off
More workers = more throughput. But in async mode, by the time a worker's trajectory reaches the server, the policy may have updated several times. That data is now stale โ it was collected under an older policy. Drag the slider to see the trade-off.
Mitigation: Agent Lightning bounds staleness by discarding trajectories collected more than K policy updates ago. K is configurable (typically K = 4). This keeps training data fresh without blocking workers or reducing throughput.
Credit assignment tells us which steps to reinforce. AIR automates the reward signal that makes this possible โ the next section shows exactly how tool return statuses become training signal, automatically.
Click the status badge on each step to flip it between โ success and โ failure. Watch how the reward signal changes โ with AIR, failures get immediate penalties instead of waiting for the terminal reward.
Without AIR: only step 5 carries a reward signal. Steps 1โ4 get zero gradient. The policy has no way to know that step 1 (Memory) was the key decision.
The agent makes 5 decisions. Only the last one gets a reward. Steps 1โ4 receive zero gradient โ the optimizer literally has nothing to learn from them.
This is why multi-step agent training is slow: most of the work the agent does is invisible to the learning signal.
Tool call return status is automatically converted to reward: successful call โ small positive reward, failed call โ small penalty. No manual reward shaping needed.
Every step now teaches the policy something. Training converges faster and intermediate behaviors (like always checking Memory first) emerge naturally.
AIR formula: r_t = ฮฑ ยท status(tool_call_t) + (1โฮฑ) ยท r_terminal where status is +1 for success, โ1 for failure, and ฮฑ controls how much weight to give intermediate vs terminal rewards. Agent Lightning uses ฮฑ โ 0.1โ0.3 in its experiments.
The simplest approach: one LLM call produces one training sample. Clean and straightforward, but ignores multi-step structure entirely.
Works well for basic QA and single-step tasks. But for tool-using agents that make 3โ10 LLM calls per task, this throws away most of the signal โ only the final call gets trained.
Concatenate all turns into one long sequence. Mask out system prompts, tool results, and context โ only train on agent output tokens.
Problems (from the paper):
โข Disrupts RoPE positional encodings โ token positions become meaningless across masked gaps
โข Tight coupling with agent code โ must modify agent internals to emit masks
โข Complex masks slow GPU kernels โ irregular patterns prevent efficient batching
โข Hard to maintain across agent frameworks
Decompose the full trajectory into clean individual transitions. Each transition is a fresh (input, output, reward) tuple. No masking needed at all.
- No masking โ token positions always valid
- Zero agent code modification needed
- Works with any framework (LangChain, AutoGen, etc.)
- Standard RL algorithms work unchanged
- AIR provides intermediate rewards automatically
Data Efficiency
Standard RL produces 1 training sample per completed task. LightningRL produces N โ one per LLM call in the trajectory. Adjust the slider to see the cumulative difference over 100 tasks.
In the paper's Text-to-SQL experiment, the 3-agent workflow (SQL writer + checker + rewriter) produces ~3โ5 LLM calls per task. LightningRL extracts a transition from each โ 3โ5ร more gradient updates from the same compute budget, with no extra inference cost.
The algorithm is defined, the architecture is built. The next section shows the three real-world experiments from the paper โ Text-to-SQL, multi-hop RAG, and Math QA โ where all of this comes together on a live Llama-3.2-3B model.
Reward design determines what behaviour the agent learns. Adjust the weights below and watch which flight-agent strategy becomes the most rewarded โ you may be surprised what a poorly tuned reward incentivises.
Reward for a correct final answer
Reward per successful tool return
Bonus for well-structured output
All experiments use Llama-3.2-3B-Instruct. The paper reports relative reward improvement curves โ no absolute benchmark numbers are claimed. Training consistently improves across all tasks and agent frameworks tested.
What Breaks Without Each Feature?
Toggle each of Agent Lightning's three contributions off and watch what happens to the training curve. This shows why each feature is necessary โ not just nice-to-have.
| Feature | Agent Lightning | TRL | OpenRLHF | veRL | RLlib |
|---|
The paper reports relative reward improvement curves (not absolute metrics) across 3 experiments. Training consistently improves across all tasks and agent frameworks tested โ Text-to-SQL (LangChain/Spider), RAG (OpenAI Agents SDK/MuSiQue), and Math QA (AutoGen/Calc-X).
Takes any multi-turn agent trajectory โ from any framework โ and structures it into a list of (input_t, output_t, r_t) MDP transitions automatically.
Works with LangChain, AutoGen, OpenAI Agents SDK โ zero agent code changes. Every other framework requires you to rewrite the agent from scratch.
Trains on individual MDP transitions instead of full sequences. No loss masking, no padding โ each transition is self-contained and carries its own reward.
Cleaner gradient signal, naturally supports AIR (each step has its own reward), and works on trajectories of any length without waste.
Separates the policy server (training) from the agent clients (rollout). Clients stream data asynchronously; the server updates the policy continuously.
Scale to hundreds of parallel workers without GPU idle time. Synchronous frameworks block training while waiting for every rollout to finish.
Same prompt, same tools, same model weights โ before RL training and after. Watch what changes.
The model weights are identical โ only the policy distribution changed. After RL training, Memory-first has ~85% probability; before, all three tools were ~33%. This is what Agent Lightning trains.
Watch how a single flight question flows through the entire Agent Lightning pipeline โ from raw prompt to an improved policy.
Agent Lightning: Scalable RL for Agentic AI via Training-Agent Disaggregation โ arXiv 2025
arXiv:2508.03680 โEvery Monday, a new interactive visualization of an AI/ML paper. Become a paid subscriber to unlock all tools.
Subscribe to Visual Summary โ