🎰
Foundations of RL
Bandits, Exploration & the Agent-Environment Loop · Visual Summary
Incorrect password. Try again.
RL Properties
Agent Loop
Bandits
Strategies
Simulator

Why Reinforcement Learning, Why Now?

On 5 March 2025, Sutton & Barto won the 2024 ACM Turing Award for "the conceptual and algorithmic foundations of reinforcement learning." Their 1998 textbook has been cited 75,000+ times. Then every major LLM shifted to RL post-training.

75K+
Citations of
Sutton & Barto
1990s
TD-Gammon,
first RL milestone
2016
AlphaGo beats
Lee Sedol
2025
Turing Award
+ agentic RL
RL Milestones
1994 — TD-Gammon
Gerald Tesauro's TD-Gammon learns backgammon at near-expert level using temporal difference learning — RL's first major proof of concept.
2013 — DQN
DeepMind's DQN combines Q-learning with deep neural networks, achieving superhuman Atari performance from raw pixels.
2016 — AlphaGo
AlphaGo defeats Lee Sedol 4-1. Monte Carlo Tree Search + deep RL. Go's 10170 states had been considered unsolvable.
2022–2024 — RLHF era
ChatGPT, GPT-4, Claude, Gemini — all use RLHF or DPO for alignment. RL becomes the defining post-training step for LLMs.
2025 — Agentic RL
DeepSeek-R1 trains reasoning chains via GRPO. RL is now used to teach thinking — not just helpfulness. Turing Award for Sutton & Barto.

What Is Reinforcement Learning?

RL is the third paradigm of ML. An agent interacts with an environment, takes actions, and learns from scalar reward signals — with no teacher providing correct answers.

Supervised Learning
Input Feature vector
Label Correct answer given
Signal Error vs ground truth
Learns mapping X → Y
Unsupervised Learning
Input Feature vector
Label No labels
Signal Reconstruction / density
Learns structure in X
Reinforcement Learning
Input State observation
Label Reward (evaluative)
Signal Cumulative future reward
Learns policy π: S → A
The core RL loop: The agent observes state St, chooses action At, the environment transitions to St+1 and emits reward Rt+1. The agent's goal is to maximize the expected sum of future rewards.

4 Properties That Define RL

These four characteristics distinguish RL from supervised and unsupervised learning — they also explain why RL is fundamentally harder.

1
Evaluative, Not Instructive
In SL, the teacher tells you the correct answer: "the label is cat." In RL, the environment only says "that was worth +1" or "-5" — it evaluates what you did, without telling you what you should have done. The agent must infer which action was actually best.
SL: "correct answer is A"
RL: "reward = +3"
2
Non-i.i.d. Data Distribution
In SL, training samples are independent and identically distributed (i.i.d.) — the distribution is fixed. In RL, the agent's own actions determine which states it visits next. Policy changes the data distribution. A policy that explores widely collects different data than a greedy one.
The training data is a function of the agent's current behavior — this creates a feedback loop absent in SL.
3
Delayed Consequences
Actions have consequences that only become clear later. A chess move might look neutral but set up a winning position 10 moves ahead. This is the credit assignment problem — which of the past actions actually caused the reward?
Example — Chess
Move 1 → Move 2 → ... → Move 40 → Win (+1)
Which moves contributed to the win? The credit assignment problem.
4
Exploration vs. Exploitation
Should the agent use what it already knows (exploit) or try new things to discover better options (explore)? Pure exploitation locks in suboptimal actions. Pure exploration wastes time on arms you've already evaluated.
Exploit only → Miss better options
Balance → Long-run optimum
PropertySupervised LearningReinforcement Learning
Feedback typeInstructive (correct label given)Evaluative (reward only)
Data distributioni.i.d., fixedNon-i.i.d., policy-dependent
Temporal structureNo (each sample independent)Yes (actions affect future states)
Exploration needed?NoYes — fundamental challenge

The Agent-Environment Loop

The formal skeleton of every RL problem. At each timestep t, the agent observes a state, takes an action, and receives a reward and next state.

Agent
Has policy π
Picks action At
At
St, Rt
Environment
Emits Rt+1
Transitions to St+1
Key Definitions
State St
The agent's observation of the world at time t. May be partial (partial observability = POMDP).
Action At
The choice the agent makes. Discrete (chess moves) or continuous (robot joint torques).
Reward Rt+1
Scalar signal from the environment after action At. The only feedback the agent gets.
Policy π
Mapping from states to actions (or action distributions): At ~ π(· | St). The thing being learned.
Trajectory & Return
A trajectory (or episode) is the sequence generated by one run:
τ = S₀, A₀, R₁, S₁, A₁, R₂, S₂, ...
The agent's objective is to maximise the return — discounted cumulative reward:
Gt = Rt+1 + γ·Rt+2 + γ²·Rt+3 + ...
γ ∈ [0,1] is the discount factor. γ=0 = myopic (only next reward). γ=1 = infinite horizon (all future rewards equally).
The Agent-Environment Boundary
The boundary is drawn at the limits of the agent's direct control, not physical limits. A robot arm's controller is the agent; the arm mechanics and physics are the environment — even if both run in the same process.

Agent-Environment Loop: Step Through It

Watch a simulated agent navigate a grid. Each step shows St → At → Rt+1 → St+1 explicitly.

Trajectory
State St
Action At
Reward Rt+1
0
Return Gt

The Multi-Armed Bandit Problem

The simplest RL setting that isolates exploration vs. exploitation. No states, no dynamics — just k arms with unknown reward distributions. The purest version of the exploration problem.

The Casino Analogy
You walk into a casino with k slot machines ("one-armed bandits"). Each machine has an unknown payout distribution. You want to maximise total reward over n pulls. You don't know which machine is best — you have to learn by pulling.
Formal Setup
k arms, each with unknown true value q*(a)
Agent picks action at ∈ {1,...,k} each step
Receives reward Rt ~ N(q*(at), 1)
Goal: maximise cumulative reward Σ Rt
# Bandit setup — true values unknown to agent q_star = np.random.randn(k) # drawn once from N(0,1), hidden reward = np.random.normal(q_star[a], 1) # observed per pull
The challenge: You can't compute q*(a) — only estimate it from noisy samples. Pull arm 3 five times and get rewards [0.2, 1.4, -0.3, 0.8, 1.1]. The true q*(3) could be anywhere. How confident are you?
Why Bandits Matter for LLMs
RLHF Reward Model
Each response is an "arm". The reward model evaluates quality — like a bandit signal for alignment training.
Beam Search
Choosing which partial sequences to expand is a bandit-style tradeoff: exploit known good prefixes vs explore new paths.
A/B Testing
Multi-armed bandit algorithms power adaptive A/B tests — allocating traffic to versions while learning which is better.

Estimating Action Values

Since q*(a) is unknown, we estimate it. The sample average is the natural estimator — and its incremental form is efficient and online-updatable.

Sample Average Estimate
Q_t(a) = (sum of rewards when a was chosen) ───────────────────────────────────── (number of times a was chosen)
By the Law of Large Numbers: Qt(a) → q*(a) as Nt(a) → ∞. More pulls = better estimate.
Incremental Update Rule
Recomputing the full average each time is wasteful. Derive the online update:
Q_{n+1} = Q_n + (1/n) * (R_n - Q_n)
This is a stochastic gradient descent step toward the new observation Rn, with step size 1/n. No need to store all past rewards.
Why the Incremental Form Matters
1
Memory O(1)
Store only Qn and N. No list of past rewards needed.
2
Constant time per step
O(1) update regardless of how many times the arm has been pulled.
3
Generalizes to non-stationary
Replace 1/n with constant α for non-stationary bandits (recent rewards count more).
4
Same pattern everywhere in RL
TD-learning, Q-learning, and policy gradients all use this "nudge toward target" structure.
General Update Form
NewEstimate ← OldEstimate + StepSize × (Target − OldEstimate)
The term (Target − OldEstimate) is the error. The step size controls how fast we move toward the new target. This pattern appears throughout RL.

4 Action-Selection Strategies

Each strategy answers "given my current estimates Qt(a), which arm should I pull next?" differently — trading off exploration and exploitation in different ways.

Greedy Pure exploitation
Greedy Selection
A_t = argmax_a Q_t(a)
Always pick the arm with the highest current estimate. Zero exploration after initialization.
+ Fastest short-term gains
− Locks in on first good arm found. Misses globally optimal arms. Gets stuck in local optima.
On the 10-armed testbed: only finds the best arm ~33% of runs — just by chance on first pull.
ε-Greedy Simplest exploration
Epsilon-Greedy
With prob ε: random arm With prob 1-ε: argmax Q_t(a)
With probability ε explore uniformly at random; otherwise exploit. ε=0.1 means 10% random exploration.
+ Simple, robust, guaranteed exploration
− Explores uniformly — wastes pulls on clearly bad arms. Fixed ε wastes even late in learning.
ε=0.1 reaches ~80% optimal action rate at step 1000 on the testbed.
Optimistic Init Exploration via optimism
Optimistic Initial Values
Q_0(a) = +5 (all arms) then: Greedy selection
Initialize all estimates high. The agent is "disappointed" by every arm it pulls (actual rewards are lower), forcing it to try all arms early. The optimism fades as more samples arrive.
+ Automatic early exploration without ε parameter
− Only works for stationary bandits. Q₀ must be tuned. No adaptation when environment changes.
Exploration is front-loaded — strong early, then purely greedy.
UCB Principled exploration
Upper Confidence Bound
A_t = argmax_a [ Q_t(a) + c√(ln(t)/N_t(a)) ]
Select the arm that could plausibly be best. The bonus c√(ln(t)/N_t(a)) is large for rarely-pulled arms (high uncertainty) and shrinks as we gather more data.
+ Smart, targeted exploration. Provably near-optimal (Lai-Robbins lower bound)
− Harder to tune (c parameter). Slower start than ε-greedy on early steps.
Best long-run performance on the testbed. Focuses exploration on uncertain arms, not random ones.
StrategyExplorationConvergenceNon-stationary?Parameters
GreedyNone (after init)SuboptimalNoNone
ε-GreedyUniform randomε=0.1 → ~80% optYes (explore adapts)ε
Optimistic InitForced earlyFast start, then greedyNoQ₀
UCBUncertainty-weightedNear-optimalPartialc

Upper Confidence Bound — Deep Dive

UCB is built on a simple insight: when choosing, optimistically assume each arm is as good as it could plausibly be. The "upper confidence bound" is the optimistic ceiling.

At = argmaxa [ Qt(a) + c · √( ln(t) / Nt(a) ) ]
Qt(a) = current estimate + exploration bonus c = confidence level ln(t) = time pressure grows Nt(a) = pulls so far
Why ln(t) in the Numerator?
ln(t) grows slowly — exploration pressure increases over time but at a sub-linear rate. This means every arm gets pulled infinitely often, but the most uncertain arms get pulled much more. It's the "optimism in the face of uncertainty" principle.
Why Nt(a) in the Denominator?
As you pull arm a more, the bonus shrinks. Well-sampled arms need less exploration because you're confident in Qt(a). Under-sampled arms get a larger bonus — they're prioritized for exploration automatically.
UCB Score Explorer
Adjust parameters to see how the UCB score changes.
Current estimate Qt(a) 1.0
Confidence c 2.0
Total steps t 100
Times pulled Nt(a) 5
UCB Score
Key insight: UCB selects the arm at the top of a credible interval, not the arm with the best point estimate. This is why it handles uncertainty correctly — an arm pulled once with reward 0.5 gets a bigger bonus than an arm pulled 200 times with estimate 0.3.

Live Multi-Armed Bandit Simulator

Pull arms manually or let a strategy play automatically. The true values q*(a) are hidden — can you find the best arm?

Strategy:
True best: Arm ?
0.0
Total Reward
0
Steps
0.0
Avg Reward
0%
Optimal %

The 10-Armed Testbed

The canonical benchmark from Sutton & Barto Ch. 2. 2,000 runs × 1,000 steps each. Four strategies compared head-to-head.

Setup
k=10
Arms
N(0,1)
q*(a) distribution
2000
Independent runs
1000
Steps per run
Average Reward Over Time
Greedy
ε=0.01
ε=0.1
UCB c=2
Results at Step 1000
StrategyAvg Reward% Optimal ActionExplorationVerdict
Greedy ~1.00 ~33% None Fails — locks in early
ε=0.01 ~1.35 ~60% Too little Slow — misses optimal
ε=0.1 ~1.45 ~80% Good balance Strong — simple & robust
UCB c=2 ~1.55 ~92% Smart Best — principled exploration
The signature of exploration-exploitation: Greedy rises fastest early (pure exploitation) then stalls. ε-greedy rises slower but keeps improving. UCB starts slower still (tries all arms) but eventually dominates. The strategy best right now is rarely best in 1,000 steps.

Full Strategy Comparison

When to use each strategy — the practical guide.

Decision Guide
?
Do you have infinite time?
If not, greedy is too risky. Any exploration is better than none.
?
Is the environment stationary?
Stationary → optimistic init works well. Non-stationary → ε-greedy or UCB with constant α.
?
Do you need simplicity?
ε-greedy: one parameter, easy to tune, surprisingly robust in practice.
?
Want provable guarantees?
UCB: achieves Lai-Robbins lower bound on regret. Recommended when asymptotic optimality matters.
The Big Picture
Bandits are the simplest possible RL problem — no state transitions. But they capture the core tension of RL. Every algorithm in full RL generalizes from bandit intuitions.
ConceptIn BanditsIn Full RL
Value estimationQt(a)Q(s,a) or V(s)
ExplorationArm selectionPolicy entropy, UCB
Update ruleQ ← Q + α(R-Q)TD error δ = R + γV(s') - V(s)
Credit assignmentImmediateTemporal difference
Optimal policyargmax Q*(a)π*(s) = argmax Q*(s,a)
What's next: Part 2 of this RL series will introduce the full Markov Decision Process (MDP), value functions, Bellman equations, and dynamic programming — the foundations for Q-learning, policy gradients, and modern deep RL.
← Previous Post
Post 44 — LLM Training Pipeline
Next Post →
Post 46 — Hermes Agent