Introduction

Why Reinforcement Learning, Why Now?

On 5 March 2025, Sutton & Barto won the 2024 ACM Turing Award for "the conceptual and algorithmic foundations of reinforcement learning." Their 1998 textbook has been cited 75,000+ times. Then every major LLM shifted to RL post-training.

75K+

Citations of
Sutton & Barto

1990s

TD-Gammon,
first RL milestone

2016

AlphaGo beats
Lee Sedol

2025

Turing Award
+ agentic RL

RL Milestones

1994 — TD-Gammon

Gerald Tesauro's TD-Gammon learns backgammon at near-expert level using temporal difference learning — RL's first major proof of concept.

2013 — DQN

DeepMind's DQN combines Q-learning with deep neural networks, achieving superhuman Atari performance from raw pixels.

2016 — AlphaGo

AlphaGo defeats Lee Sedol 4-1. Monte Carlo Tree Search + deep RL. Go's 10¹⁷⁰ states had been considered unsolvable.

2022–2024 — RLHF era

ChatGPT, GPT-4, Claude, Gemini — all use RLHF or DPO for alignment. RL becomes the defining post-training step for LLMs.

2025 — Agentic RL

DeepSeek-R1 trains reasoning chains via GRPO. RL is now used to teach thinking — not just helpfulness. Turing Award for Sutton & Barto.

Overview

What Is Reinforcement Learning?

RL is the third paradigm of ML. An agent interacts with an environment, takes actions, and learns from scalar reward signals — with no teacher providing correct answers.

Supervised Learning

Input Feature vector

Label Correct answer given

Signal Error vs ground truth

Learns mapping X → Y

Unsupervised Learning

Input Feature vector

Label No labels

Signal Reconstruction / density

Learns structure in X

Reinforcement Learning

Input State observation

Label Reward (evaluative)

Signal Cumulative future reward

Learns policy π: S → A

The core RL loop: The agent observes state S_t, chooses action A_t, the environment transitions to S_t+1 and emits reward R_t+1. The agent's goal is to maximize the expected sum of future rewards.

What Makes RL Different

4 Properties That Define RL

These four characteristics distinguish RL from supervised and unsupervised learning — they also explain why RL is fundamentally harder.

1

Evaluative, Not Instructive

In SL, the teacher tells you the correct answer: "the label is cat." In RL, the environment only says "that was worth +1" or "-5" — it evaluates what you did, without telling you what you should have done. The agent must infer which action was actually best.

SL: "correct answer is A"

RL: "reward = +3"

2

Non-i.i.d. Data Distribution

In SL, training samples are independent and identically distributed (i.i.d.) — the distribution is fixed. In RL, the agent's own actions determine which states it visits next. Policy changes the data distribution. A policy that explores widely collects different data than a greedy one.

The training data is a function of the agent's current behavior — this creates a feedback loop absent in SL.

3

Delayed Consequences

Actions have consequences that only become clear later. A chess move might look neutral but set up a winning position 10 moves ahead. This is the credit assignment problem — which of the past actions actually caused the reward?

Example — Chess

Move 1 → Move 2 → ... → Move 40 → Win (+1)
Which moves contributed to the win? The credit assignment problem.

4

Exploration vs. Exploitation

Should the agent use what it already knows (exploit) or try new things to discover better options (explore)? Pure exploitation locks in suboptimal actions. Pure exploration wastes time on arms you've already evaluated.

Exploit only → Miss better options

Balance → Long-run optimum

Property	Supervised Learning	Reinforcement Learning
Feedback type	Instructive (correct label given)	Evaluative (reward only)
Data distribution	i.i.d., fixed	Non-i.i.d., policy-dependent
Temporal structure	No (each sample independent)	Yes (actions affect future states)
Exploration needed?	No	Yes — fundamental challenge

Formalism

The Agent-Environment Loop

The formal skeleton of every RL problem. At each timestep t, the agent observes a state, takes an action, and receives a reward and next state.

Agent

Has policy π

Picks action A_t

A_t

S_t, R_t

Environment

Emits R_t+1

Transitions to S_t+1

Key Definitions

State S_t

The agent's observation of the world at time t. May be partial (partial observability = POMDP).

Action A_t

The choice the agent makes. Discrete (chess moves) or continuous (robot joint torques).

Reward R_t+1

Scalar signal from the environment after action A_t. The only feedback the agent gets.

Policy π

Mapping from states to actions (or action distributions): A_t ~ π(· | S_t). The thing being learned.

Trajectory & Return

A trajectory (or episode) is the sequence generated by one run:

τ = S₀, A₀, R₁, S₁, A₁, R₂, S₂, ...

The agent's objective is to maximise the return — discounted cumulative reward:

G_t = R_t+1 + γ·R_t+2 + γ²·R_t+3 + ...

γ ∈ [0,1] is the discount factor. γ=0 = myopic (only next reward). γ=1 = infinite horizon (all future rewards equally).

The Agent-Environment Boundary

The boundary is drawn at the limits of the agent's direct control, not physical limits. A robot arm's controller is the agent; the arm mechanics and physics are the environment — even if both run in the same process.

Interactive

Agent-Environment Loop: Step Through It

Watch a simulated agent navigate a grid. Each step shows S_t → A_t → R_t+1 → S_t+1 explicitly.

Trajectory

—

State S_t

—

Action A_t

—

Reward R_t+1

0

Return G_t

Multi-Armed Bandits

The Multi-Armed Bandit Problem

The simplest RL setting that isolates exploration vs. exploitation. No states, no dynamics — just k arms with unknown reward distributions. The purest version of the exploration problem.

The Casino Analogy

You walk into a casino with k slot machines ("one-armed bandits"). Each machine has an unknown payout distribution. You want to maximise total reward over n pulls. You don't know which machine is best — you have to learn by pulling.

Formal Setup

k arms, each with unknown true value q*(a)

Agent picks action a_t ∈ {1,...,k} each step

Receives reward R_t ~ N(q*(a_t), 1)

Goal: maximise cumulative reward Σ R_t

# Bandit setup — true values unknown to agent
q_star = np.random.randn(k)          # drawn once from N(0,1), hidden
reward = np.random.normal(q_star[a], 1)  # observed per pull

The challenge: You can't compute q*(a) — only estimate it from noisy samples. Pull arm 3 five times and get rewards [0.2, 1.4, -0.3, 0.8, 1.1]. The true q*(3) could be anywhere. How confident are you?

Why Bandits Matter for LLMs

RLHF Reward Model

Each response is an "arm". The reward model evaluates quality — like a bandit signal for alignment training.

Beam Search

Choosing which partial sequences to expand is a bandit-style tradeoff: exploit known good prefixes vs explore new paths.

A/B Testing

Multi-armed bandit algorithms power adaptive A/B tests — allocating traffic to versions while learning which is better.

Value Estimation

Estimating Action Values

Since q*(a) is unknown, we estimate it. The sample average is the natural estimator — and its incremental form is efficient and online-updatable.

Sample Average Estimate

Q_t(a) = (sum of rewards when a was chosen)
       ─────────────────────────────────────
       (number of times a was chosen)

By the Law of Large Numbers: Q_t(a) → q*(a) as N_t(a) → ∞. More pulls = better estimate.

Incremental Update Rule

Recomputing the full average each time is wasteful. Derive the online update:

Q_{n+1} = Q_n + (1/n) * (R_n - Q_n)

This is a stochastic gradient descent step toward the new observation R_n, with step size 1/n. No need to store all past rewards.

Why the Incremental Form Matters

1

Memory O(1)

Store only Q_n and N. No list of past rewards needed.

2

Constant time per step

O(1) update regardless of how many times the arm has been pulled.

3

Generalizes to non-stationary

Replace 1/n with constant α for non-stationary bandits (recent rewards count more).

4

Same pattern everywhere in RL

TD-learning, Q-learning, and policy gradients all use this "nudge toward target" structure.

General Update Form

NewEstimate ← OldEstimate
         + StepSize × (Target − OldEstimate)

The term (Target − OldEstimate) is the error. The step size controls how fast we move toward the new target. This pattern appears throughout RL.

Exploration Strategies

4 Action-Selection Strategies

Each strategy answers "given my current estimates Q_t(a), which arm should I pull next?" differently — trading off exploration and exploitation in different ways.

Greedy Pure exploitation

Greedy Selection

A_t = argmax_a Q_t(a)

Always pick the arm with the highest current estimate. Zero exploration after initialization.

+ Fastest short-term gains

− Locks in on first good arm found. Misses globally optimal arms. Gets stuck in local optima.

On the 10-armed testbed: only finds the best arm ~33% of runs — just by chance on first pull.

ε-Greedy Simplest exploration

Epsilon-Greedy

With prob ε: random arm With prob 1-ε: argmax Q_t(a)

With probability ε explore uniformly at random; otherwise exploit. ε=0.1 means 10% random exploration.

+ Simple, robust, guaranteed exploration

− Explores uniformly — wastes pulls on clearly bad arms. Fixed ε wastes even late in learning.

ε=0.1 reaches ~80% optimal action rate at step 1000 on the testbed.

Optimistic Init Exploration via optimism

Optimistic Initial Values

Q_0(a) = +5 (all arms) then: Greedy selection

Initialize all estimates high. The agent is "disappointed" by every arm it pulls (actual rewards are lower), forcing it to try all arms early. The optimism fades as more samples arrive.

+ Automatic early exploration without ε parameter

− Only works for stationary bandits. Q₀ must be tuned. No adaptation when environment changes.

Exploration is front-loaded — strong early, then purely greedy.

UCB Principled exploration

Upper Confidence Bound

A_t = argmax_a [ Q_t(a) + c√(ln(t)/N_t(a)) ]

Select the arm that could plausibly be best. The bonus c√(ln(t)/N_t(a)) is large for rarely-pulled arms (high uncertainty) and shrinks as we gather more data.

+ Smart, targeted exploration. Provably near-optimal (Lai-Robbins lower bound)

− Harder to tune (c parameter). Slower start than ε-greedy on early steps.

Best long-run performance on the testbed. Focuses exploration on uncertain arms, not random ones.

Strategy	Exploration	Convergence	Non-stationary?	Parameters
Greedy	None (after init)	Suboptimal	No	None
ε-Greedy	Uniform random	ε=0.1 → ~80% opt	Yes (explore adapts)	ε
Optimistic Init	Forced early	Fast start, then greedy	No	Q₀
UCB	Uncertainty-weighted	Near-optimal	Partial	c

UCB Formula Explorer

Upper Confidence Bound — Deep Dive

UCB is built on a simple insight: when choosing, optimistically assume each arm is as good as it could plausibly be. The "upper confidence bound" is the optimistic ceiling.

A_t = argmax_a [ Q_t(a) + c · √( ln(t) / N_t(a) ) ]

Q_t(a) = current estimate + exploration bonus c = confidence level ln(t) = time pressure grows N_t(a) = pulls so far

Why ln(t) in the Numerator?

ln(t) grows slowly — exploration pressure increases over time but at a sub-linear rate. This means every arm gets pulled infinitely often, but the most uncertain arms get pulled much more. It's the "optimism in the face of uncertainty" principle.

Why N_t(a) in the Denominator?

As you pull arm a more, the bonus shrinks. Well-sampled arms need less exploration because you're confident in Q_t(a). Under-sampled arms get a larger bonus — they're prioritized for exploration automatically.

UCB Score Explorer

Adjust parameters to see how the UCB score changes.

Current estimate Q_t(a) 1.0

Confidence c 2.0

Total steps t 100

Times pulled N_t(a) 5

UCB Score

—

Key insight: UCB selects the arm at the top of a credible interval, not the arm with the best point estimate. This is why it handles uncertainty correctly — an arm pulled once with reward 0.5 gets a bigger bonus than an arm pulled 200 times with estimate 0.3.

Interactive Simulator

Live Multi-Armed Bandit Simulator

Pull arms manually or let a strategy play automatically. The true values q*(a) are hidden — can you find the best arm?

Strategy:

True best: Arm ?

0.0

Total Reward

0

Steps

0.0

Avg Reward

0%

Optimal %

Classic Benchmark

The 10-Armed Testbed

The canonical benchmark from Sutton & Barto Ch. 2. 2,000 runs × 1,000 steps each. Four strategies compared head-to-head.

Setup

k=10

Arms

N(0,1)

q*(a) distribution

2000

Independent runs

1000

Steps per run

Average Reward Over Time

Greedy

ε=0.01

ε=0.1

UCB c=2

Results at Step 1000

Strategy	Avg Reward	% Optimal Action	Exploration	Verdict
Greedy	~1.00	~33%	None	Fails — locks in early
ε=0.01	~1.35	~60%	Too little	Slow — misses optimal
ε=0.1	~1.45	~80%	Good balance	Strong — simple & robust
UCB c=2	~1.55	~92%	Smart	Best — principled exploration

The signature of exploration-exploitation: Greedy rises fastest early (pure exploitation) then stalls. ε-greedy rises slower but keeps improving. UCB starts slower still (tries all arms) but eventually dominates. The strategy best right now is rarely best in 1,000 steps.

Summary

Full Strategy Comparison

When to use each strategy — the practical guide.

Decision Guide

?

Do you have infinite time?

If not, greedy is too risky. Any exploration is better than none.

?

Is the environment stationary?

Stationary → optimistic init works well. Non-stationary → ε-greedy or UCB with constant α.

?

Do you need simplicity?

ε-greedy: one parameter, easy to tune, surprisingly robust in practice.

?

Want provable guarantees?

UCB: achieves Lai-Robbins lower bound on regret. Recommended when asymptotic optimality matters.

The Big Picture

Bandits are the simplest possible RL problem — no state transitions. But they capture the core tension of RL. Every algorithm in full RL generalizes from bandit intuitions.

Concept	In Bandits	In Full RL
Value estimation	Q_t(a)	Q(s,a) or V(s)
Exploration	Arm selection	Policy entropy, UCB
Update rule	Q ← Q + α(R-Q)	TD error δ = R + γV(s') - V(s)
Credit assignment	Immediate	Temporal difference
Optimal policy	argmax Q*(a)	π(s) = argmax Q(s,a)

What's next: Part 2 of this RL series will introduce the full Markov Decision Process (MDP), value functions, Bellman equations, and dynamic programming — the foundations for Q-learning, policy gradients, and modern deep RL.