Bandits, Exploration & the Agent-Environment Loop · Visual Summary
Incorrect password. Try again.
RL Properties
⟶
Agent Loop
⟶
Bandits
⟶
Strategies
⟶
Simulator
Introduction
Why Reinforcement Learning, Why Now?
On 5 March 2025, Sutton & Barto won the 2024 ACM Turing Award for "the conceptual and algorithmic foundations of reinforcement learning." Their 1998 textbook has been cited 75,000+ times. Then every major LLM shifted to RL post-training.
75K+
Citations of Sutton & Barto
1990s
TD-Gammon, first RL milestone
2016
AlphaGo beats Lee Sedol
2025
Turing Award + agentic RL
RL Milestones
1994 — TD-Gammon
Gerald Tesauro's TD-Gammon learns backgammon at near-expert level using temporal difference learning — RL's first major proof of concept.
2013 — DQN
DeepMind's DQN combines Q-learning with deep neural networks, achieving superhuman Atari performance from raw pixels.
2016 — AlphaGo
AlphaGo defeats Lee Sedol 4-1. Monte Carlo Tree Search + deep RL. Go's 10170 states had been considered unsolvable.
2022–2024 — RLHF era
ChatGPT, GPT-4, Claude, Gemini — all use RLHF or DPO for alignment. RL becomes the defining post-training step for LLMs.
2025 — Agentic RL
DeepSeek-R1 trains reasoning chains via GRPO. RL is now used to teach thinking — not just helpfulness. Turing Award for Sutton & Barto.
Overview
What Is Reinforcement Learning?
RL is the third paradigm of ML. An agent interacts with an environment, takes actions, and learns from scalar reward signals — with no teacher providing correct answers.
Supervised Learning
Input Feature vector
Label Correct answer given
Signal Error vs ground truth
Learns mapping X → Y
Unsupervised Learning
Input Feature vector
Label No labels
Signal Reconstruction / density
Learns structure in X
Reinforcement Learning
Input State observation
Label Reward (evaluative)
Signal Cumulative future reward
Learns policy π: S → A
The core RL loop: The agent observes state St, chooses action At, the environment transitions to St+1 and emits reward Rt+1. The agent's goal is to maximize the expected sum of future rewards.
What Makes RL Different
4 Properties That Define RL
These four characteristics distinguish RL from supervised and unsupervised learning — they also explain why RL is fundamentally harder.
1
Evaluative, Not Instructive
In SL, the teacher tells you the correct answer: "the label is cat." In RL, the environment only says "that was worth +1" or "-5" — it evaluates what you did, without telling you what you should have done. The agent must infer which action was actually best.
SL: "correct answer is A"
RL: "reward = +3"
2
Non-i.i.d. Data Distribution
In SL, training samples are independent and identically distributed (i.i.d.) — the distribution is fixed. In RL, the agent's own actions determine which states it visits next. Policy changes the data distribution. A policy that explores widely collects different data than a greedy one.
The training data is a function of the agent's current behavior — this creates a feedback loop absent in SL.
3
Delayed Consequences
Actions have consequences that only become clear later. A chess move might look neutral but set up a winning position 10 moves ahead. This is the credit assignment problem — which of the past actions actually caused the reward?
Example — Chess
Move 1 → Move 2 → ... → Move 40 → Win (+1) Which moves contributed to the win? The credit assignment problem.
4
Exploration vs. Exploitation
Should the agent use what it already knows (exploit) or try new things to discover better options (explore)? Pure exploitation locks in suboptimal actions. Pure exploration wastes time on arms you've already evaluated.
Exploit only → Miss better options
Balance → Long-run optimum
Property
Supervised Learning
Reinforcement Learning
Feedback type
Instructive (correct label given)
Evaluative (reward only)
Data distribution
i.i.d., fixed
Non-i.i.d., policy-dependent
Temporal structure
No (each sample independent)
Yes (actions affect future states)
Exploration needed?
No
Yes — fundamental challenge
Formalism
The Agent-Environment Loop
The formal skeleton of every RL problem. At each timestep t, the agent observes a state, takes an action, and receives a reward and next state.
Agent
Has policy π
Picks action At
At
St, Rt
Environment
Emits Rt+1
Transitions to St+1
Key Definitions
State St
The agent's observation of the world at time t. May be partial (partial observability = POMDP).
Action At
The choice the agent makes. Discrete (chess moves) or continuous (robot joint torques).
Reward Rt+1
Scalar signal from the environment after action At. The only feedback the agent gets.
Policy π
Mapping from states to actions (or action distributions): At ~ π(· | St). The thing being learned.
Trajectory & Return
A trajectory (or episode) is the sequence generated by one run:
τ = S₀, A₀, R₁, S₁, A₁, R₂, S₂, ...
The agent's objective is to maximise the return — discounted cumulative reward:
Gt = Rt+1 + γ·Rt+2 + γ²·Rt+3 + ...
γ ∈ [0,1] is the discount factor. γ=0 = myopic (only next reward). γ=1 = infinite horizon (all future rewards equally).
The Agent-Environment Boundary
The boundary is drawn at the limits of the agent's direct control, not physical limits. A robot arm's controller is the agent; the arm mechanics and physics are the environment — even if both run in the same process.
Interactive
Agent-Environment Loop: Step Through It
Watch a simulated agent navigate a grid. Each step shows St → At → Rt+1 → St+1 explicitly.
Trajectory
—
State St
—
Action At
—
Reward Rt+1
0
Return Gt
Multi-Armed Bandits
The Multi-Armed Bandit Problem
The simplest RL setting that isolates exploration vs. exploitation. No states, no dynamics — just k arms with unknown reward distributions. The purest version of the exploration problem.
The Casino Analogy
You walk into a casino with k slot machines ("one-armed bandits"). Each machine has an unknown payout distribution. You want to maximise total reward over n pulls. You don't know which machine is best — you have to learn by pulling.
Formal Setup
k arms, each with unknown true value q*(a)
Agent picks action at ∈ {1,...,k} each step
Receives reward Rt ~ N(q*(at), 1)
Goal: maximise cumulative reward Σ Rt
# Bandit setup — true values unknown to agentq_star = np.random.randn(k) # drawn once from N(0,1), hiddenreward = np.random.normal(q_star[a], 1) # observed per pull
The challenge: You can't compute q*(a) — only estimate it from noisy samples. Pull arm 3 five times and get rewards [0.2, 1.4, -0.3, 0.8, 1.1]. The true q*(3) could be anywhere. How confident are you?
Why Bandits Matter for LLMs
RLHF Reward Model
Each response is an "arm". The reward model evaluates quality — like a bandit signal for alignment training.
Beam Search
Choosing which partial sequences to expand is a bandit-style tradeoff: exploit known good prefixes vs explore new paths.
A/B Testing
Multi-armed bandit algorithms power adaptive A/B tests — allocating traffic to versions while learning which is better.
Value Estimation
Estimating Action Values
Since q*(a) is unknown, we estimate it. The sample average is the natural estimator — and its incremental form is efficient and online-updatable.
Sample Average Estimate
Q_t(a) = (sum of rewards when a was chosen)
─────────────────────────────────────
(number of times a was chosen)
By the Law of Large Numbers: Qt(a) → q*(a) as Nt(a) → ∞. More pulls = better estimate.
Incremental Update Rule
Recomputing the full average each time is wasteful. Derive the online update:
Q_{n+1} = Q_n + (1/n) * (R_n - Q_n)
This is a stochastic gradient descent step toward the new observation Rn, with step size 1/n. No need to store all past rewards.
Why the Incremental Form Matters
1
Memory O(1)
Store only Qn and N. No list of past rewards needed.
2
Constant time per step
O(1) update regardless of how many times the arm has been pulled.
3
Generalizes to non-stationary
Replace 1/n with constant α for non-stationary bandits (recent rewards count more).
4
Same pattern everywhere in RL
TD-learning, Q-learning, and policy gradients all use this "nudge toward target" structure.
The term (Target − OldEstimate) is the error. The step size controls how fast we move toward the new target. This pattern appears throughout RL.
Exploration Strategies
4 Action-Selection Strategies
Each strategy answers "given my current estimates Qt(a), which arm should I pull next?" differently — trading off exploration and exploitation in different ways.
GreedyPure exploitation
Greedy Selection
A_t = argmax_a Q_t(a)
Always pick the arm with the highest current estimate. Zero exploration after initialization.
+ Fastest short-term gains
− Locks in on first good arm found. Misses globally optimal arms. Gets stuck in local optima.
On the 10-armed testbed: only finds the best arm ~33% of runs — just by chance on first pull.
ε-GreedySimplest exploration
Epsilon-Greedy
With prob ε: random arm
With prob 1-ε: argmax Q_t(a)
With probability ε explore uniformly at random; otherwise exploit. ε=0.1 means 10% random exploration.
+ Simple, robust, guaranteed exploration
− Explores uniformly — wastes pulls on clearly bad arms. Fixed ε wastes even late in learning.
ε=0.1 reaches ~80% optimal action rate at step 1000 on the testbed.
Optimistic InitExploration via optimism
Optimistic Initial Values
Q_0(a) = +5 (all arms)
then: Greedy selection
Initialize all estimates high. The agent is "disappointed" by every arm it pulls (actual rewards are lower), forcing it to try all arms early. The optimism fades as more samples arrive.
+ Automatic early exploration without ε parameter
− Only works for stationary bandits. Q₀ must be tuned. No adaptation when environment changes.
Exploration is front-loaded — strong early, then purely greedy.
UCBPrincipled exploration
Upper Confidence Bound
A_t = argmax_a [ Q_t(a) + c√(ln(t)/N_t(a)) ]
Select the arm that could plausibly be best. The bonus c√(ln(t)/N_t(a)) is large for rarely-pulled arms (high uncertainty) and shrinks as we gather more data.
− Harder to tune (c parameter). Slower start than ε-greedy on early steps.
Best long-run performance on the testbed. Focuses exploration on uncertain arms, not random ones.
Strategy
Exploration
Convergence
Non-stationary?
Parameters
Greedy
None (after init)
Suboptimal
No
None
ε-Greedy
Uniform random
ε=0.1 → ~80% opt
Yes (explore adapts)
ε
Optimistic Init
Forced early
Fast start, then greedy
No
Q₀
UCB
Uncertainty-weighted
Near-optimal
Partial
c
UCB Formula Explorer
Upper Confidence Bound — Deep Dive
UCB is built on a simple insight: when choosing, optimistically assume each arm is as good as it could plausibly be. The "upper confidence bound" is the optimistic ceiling.
At = argmaxa [ Qt(a) + c · √( ln(t) / Nt(a) ) ]
Qt(a) = current estimate+ exploration bonusc = confidence levelln(t) = time pressure growsNt(a) = pulls so far
Why ln(t) in the Numerator?
ln(t) grows slowly — exploration pressure increases over time but at a sub-linear rate. This means every arm gets pulled infinitely often, but the most uncertain arms get pulled much more. It's the "optimism in the face of uncertainty" principle.
Why Nt(a) in the Denominator?
As you pull arm a more, the bonus shrinks. Well-sampled arms need less exploration because you're confident in Qt(a). Under-sampled arms get a larger bonus — they're prioritized for exploration automatically.
UCB Score Explorer
Adjust parameters to see how the UCB score changes.
Current estimate Qt(a)1.0
Confidence c2.0
Total steps t100
Times pulled Nt(a)5
UCB Score
—
Key insight: UCB selects the arm at the top of a credible interval, not the arm with the best point estimate. This is why it handles uncertainty correctly — an arm pulled once with reward 0.5 gets a bigger bonus than an arm pulled 200 times with estimate 0.3.
Interactive Simulator
Live Multi-Armed Bandit Simulator
Pull arms manually or let a strategy play automatically. The true values q*(a) are hidden — can you find the best arm?
Strategy:
True best: Arm ?
0.0
Total Reward
0
Steps
0.0
Avg Reward
0%
Optimal %
Classic Benchmark
The 10-Armed Testbed
The canonical benchmark from Sutton & Barto Ch. 2. 2,000 runs × 1,000 steps each. Four strategies compared head-to-head.
Setup
k=10
Arms
N(0,1)
q*(a) distribution
2000
Independent runs
1000
Steps per run
Average Reward Over Time
Greedy
ε=0.01
ε=0.1
UCB c=2
Results at Step 1000
Strategy
Avg Reward
% Optimal Action
Exploration
Verdict
Greedy
~1.00
~33%
None
Fails — locks in early
ε=0.01
~1.35
~60%
Too little
Slow — misses optimal
ε=0.1
~1.45
~80%
Good balance
Strong — simple & robust
UCB c=2
~1.55
~92%
Smart
Best — principled exploration
The signature of exploration-exploitation: Greedy rises fastest early (pure exploitation) then stalls. ε-greedy rises slower but keeps improving. UCB starts slower still (tries all arms) but eventually dominates. The strategy best right now is rarely best in 1,000 steps.
Summary
Full Strategy Comparison
When to use each strategy — the practical guide.
Decision Guide
?
Do you have infinite time?
If not, greedy is too risky. Any exploration is better than none.
?
Is the environment stationary?
Stationary → optimistic init works well. Non-stationary → ε-greedy or UCB with constant α.
?
Do you need simplicity?
ε-greedy: one parameter, easy to tune, surprisingly robust in practice.
?
Want provable guarantees?
UCB: achieves Lai-Robbins lower bound on regret. Recommended when asymptotic optimality matters.
The Big Picture
Bandits are the simplest possible RL problem — no state transitions. But they capture the core tension of RL. Every algorithm in full RL generalizes from bandit intuitions.
Concept
In Bandits
In Full RL
Value estimation
Qt(a)
Q(s,a) or V(s)
Exploration
Arm selection
Policy entropy, UCB
Update rule
Q ← Q + α(R-Q)
TD error δ = R + γV(s') - V(s)
Credit assignment
Immediate
Temporal difference
Optimal policy
argmax Q*(a)
π*(s) = argmax Q*(s,a)
What's next: Part 2 of this RL series will introduce the full Markov Decision Process (MDP), value functions, Bellman equations, and dynamic programming — the foundations for Q-learning, policy gradients, and modern deep RL.