World Models — Learning to Dream

Section 01 — Overview

The Predictive Mind

Before acting, intelligent agents form an internal model of how the world works — predicting consequences before they happen. World Models (Ha & Schmidhuber, 2018) gives this intuition a concrete neural architecture.

3

Components (V·M·C)

32

Latent Dimensions (z)

256

Hidden Units (h)

906

Car Racing Score

The Core Idea

An agent builds a compressed model of its environment. It learns what happens next given current state and action — enabling planning and imagination without interacting with the real world.

Why Compress?

Raw pixels (64×64×3 = 12,288 dims) are too high-dimensional for efficient learning. A VAE distills each frame into 32 meaningful numbers — keeping what matters, discarding noise.

Why Dream?

Real environment rollouts are slow and expensive. Once you have a world model, you can run thousands of imagined rollouts per second — training policies entirely inside the agent's head.

How is this related to human cognition? ▶

Neuroscientists have long argued that the brain is a predictive machine. The visual cortex doesn't passively receive signals — it actively predicts incoming sensory data, updating only on prediction errors. The World Models framework mirrors this: the V model encodes raw observations, the M model predicts what comes next (analogous to cortical prediction), and the C controller acts on compressed, predicted representations rather than raw input. Karl Friston's Free Energy Principle and the Predictive Coding hypothesis both align with this view of intelligence as minimising surprise through world modelling.

What environments were tested? ▶

Two main environments: Car Racing v0 (OpenAI Gym) — a top-down racing game where the agent must navigate randomly generated tracks, receiving pixel observations at 64×64. The agent controls steering, gas, and brakes. VizDoom (take_cover scenario) — a first-person shooter where the agent must dodge fireballs. VizDoom is used to demonstrate "dream training" — training the controller entirely inside the world model without touching the real environment.

Explore the Architecture →

Section 02 — Architecture

V-M-C: Three Components, One Agent

The agent is deliberately split into a large, expressive world model (V+M) and a tiny, fast controller (C). Most complexity lives in the world model — the controller is intentionally minimal.

Data Flow: At each timestep, the raw observation o_t (64×64 pixels) enters the Vision model (V), which encodes it to a compact latent vector z_t. The Memory model (M) receives z_t and outputs a hidden state h_t. The tiny Controller (C) takes [z_t, h_t] and outputs action a_t. The environment responds with o_t+1, completing the loop.

-- At each step t: z_t = V.encode(o_t) # 64×64 pixels → 32-dim latent h_t = M.step(z_t, a_{t-1}) # LSTM hidden state (256 units) a_t = C([z_t, h_t]) # Linear controller: (32+256) → 3 actions o_t+1 = environment.step(a_t) # Real or dreamed environment

V — Vision

Variational Autoencoder. Encodes 64×64×3 observations to a 32-dim Gaussian posterior (μ, σ). Trained with reconstruction + KL loss. Runs in ≈1 ms per frame.

M — Memory

MDN-RNN: LSTM with 256 hidden units + Mixture Density Network output. Predicts the distribution of z_{t+1} given z_t, a_t, h_t. Captures temporal dynamics and uncertainty.

C — Controller

Single linear layer: a_t = W[z_t; h_t] + b. Only 867 parameters (Car Racing). Optimised by CMA-ES (Evolution Strategies) — not gradient descent. Deliberately kept tiny.

      The design principle: Keep the controller simple so the world model does the heavy lifting. A powerful world model + simple controller outperforms a complex controller with weak world model.
    

Explore the VAE →

Section 03 — Vision Model

Vision: Variational Autoencoder

The V model learns to compress high-dimensional observations into a dense latent space, and reconstruct them back. Each dimension of z encodes a distinct visual concept.

Explore Latent Dimensions

Each slider perturbs one latent dimension. In practice, different z dims encode track curvature, distance to edge, car orientation, road texture, and horizon tilt.

z[0] — Track curve 0.00

z[1] — Road width 0.00

z[2] — Car position 0.00

z[3] — Background 0.00

z[4] — Lighting 0.00

-- VAE objective (ELBO): L = E[log p(o|z)] - β · KL[q(z|o) || p(z)] ↑ reconstruction ↑ regularisation -- Encoding: μ, log σ² = Encoder(o) # Conv layers → 32 × 2 outputs z ~ N(μ, σ²) # Reparametrisation trick = μ + σ · ε, ε ~ N(0, I) -- Decoding: ô = Decoder(z) # Deconv layers → 64×64×3

Why variational, not just a standard autoencoder? ▶

A standard autoencoder maps each input to a single point in latent space. This creates holes — regions with no training data — which produce garbage when sampled. A VAE maps each input to a distribution (Gaussian). The KL term forces distributions to overlap and cover latent space uniformly. This makes z smooth and interpolable — essential for dream generation, where M samples new z values that must decode sensibly.

VAE architecture details ▶

Encoder: 4 conv layers (32, 64, 128, 256 channels), stride 2, ReLU activation → flatten → two linear heads for μ and log σ². Decoder: Linear → reshape → 4 transposed conv layers, stride 2 → 64×64×3 output. Input images are collected from 10,000 random rollouts in the environment. Training uses β=1 (standard β-VAE).

Explore the MDN-RNN →

Section 04 — Memory Model

Memory: MDN-RNN

The M model is an LSTM augmented with a Mixture Density Network head — predicting not just the next latent state, but a full probability distribution over possible futures.

Temperature Control (τ)

Temperature τ scales the mixture component variances. Higher τ = more stochastic predictions. Used during dream training to prevent adversarial exploitation of model imperfections.

Temperature τ 0.50

Mixture components K 5

τ = 0.50: Low uncertainty. Predictions are tight. Risk: agent can exploit deterministic hallucinations in dreams that don't transfer to real environments.

-- MDN-RNN: h_t = LSTM(z_t, a_t, h_{t-1}) # 256 hidden units -- Mixture Density output (K=5 components): {π_k, μ_k, σ_k}_{k=1..K} = Linear(h_t) -- Probability of next latent: p(z_{t+1} | z_t, a_t, h_t) = Σ_k π_k · N(z_{t+1}; μ_k, σ_k²·τ²) -- Sampling for dream rollout: k* ~ Categorical(π_1, ..., π_K) z_{t+1} ~ N(μ_{k*}, σ_{k*}²·τ²)

Why MDN?

Environments are stochastic. A single Gaussian prediction says "something around here." An MDN says "probably here, but sometimes here or here." Multimodal — essential for branching futures.

Why LSTM?

The agent's z_t snapshot doesn't capture history. The LSTM's hidden state h_t accumulates temporal context — "I've been turning right for 5 steps" — enabling prediction of dynamics invisible in a single frame.

Role of τ

At τ=1.0, predictions match training distribution. At τ<1.0, the model is overconfident. At τ>1.0, it is more uncertain — forcing the controller to learn robust policies that work despite noisy predictions.

Explore Dream Training →

Section 05 — Dream Training

Training in Dreams

The key insight of the paper: once V and M are trained, the controller C can be evolved entirely inside the world model — never touching the real environment. The world model becomes a trainable simulator.

Real vs Dream: On the left, the agent receives raw pixel observations from the real environment (slow, one frame at a time). On the right, the world model generates hallucinated observations — running thousands of imagined rollouts per second. Both use the same policy.

          VizDoom Dream Training: For the VizDoom take_cover task, Ha & Schmidhuber trained the controller entirely in dreams — zero real environment interaction during controller training. The resulting policy achieved >1000 timesteps when deployed in the real game.
        

How does dream training work step-by-step? ▶

Step 1: Collect random rollouts in real environment (no policy needed).
Step 2: Train VAE (V) on collected frames — pure unsupervised learning.
Step 3: Train MDN-RNN (M) on sequences of encoded z values + actions.
Step 4: Freeze V and M. Run CMA-ES inside the dream: sample controller weights → imagine rollout using M → evaluate reward → update CMA-ES.
Step 5: Deploy trained controller in real environment.

          The Exploit Problem: Agents trained in dreams discover adversarial policies — actions that maximise reward in the imperfect world model but fail catastrophically in reality. The model has blind spots the controller learns to abuse.
        

How is the exploit prevented? ▶

The temperature parameter τ is increased during dream training. Higher τ widens the MDN's mixture components, injecting controlled uncertainty into the hallucinated rollouts. The controller can no longer rely on exploiting specific predicted states because those predictions are now noisier. This forces it to learn policies robust to model uncertainty — policies that generalise to the real environment.

See the Results →

Section 06 — Results

Benchmark Results

World Models set a new state-of-the-art on Car Racing v0 and demonstrated successful dream-trained transfer on VizDoom — using far fewer real environment interactions than competing methods.

Car Racing v0 — Score Comparison

        Key result: World Models (dream training) scored 906 ± 21 — solving the environment (threshold: 900) and exceeding all prior methods. The world model trained C needed only a fraction of the real environment steps used by model-free baselines.
      

Sample Efficiency

Model-free RL (A3C, PPO) typically requires 100M+ environment steps to converge on Car Racing. World Models needs only ~10,000 random rollouts to train V and M, then evolves C entirely in dreams — orders of magnitude more sample efficient.

Controller Simplicity

The Car Racing controller has only 867 parameters. The entire learning capacity is in the world model — not the policy. CMA-ES can search this tiny space efficiently without gradients.

See the Dreamer Evolution →

Section 07 — Evolution

From World Models to DreamerV3

Ha & Schmidhuber's framework sparked a research lineage. The Dreamer series (Hafner et al.) extended world model RL to achieve human-level Atari and conquer Minecraft.

2018

World Models — Ha & Schmidhuber

V-M-C architecture, MDN-RNN, dream training, Car Racing 906±21. Introduced the concept of training policies entirely in learned world models. CMA-ES controller, temperature trick for robustness.

VAE (V) MDN-RNN (M) CMA-ES (C) Dream Training

2019–2020

DreamerV1 — Hafner et al.

Introduced RSSM (Recurrent State Space Model): separates deterministic path (GRU) from stochastic path (Gaussian latent). Replaced CMA-ES with backpropagation through imagination (BPTT). Achieved state-of-the-art on 20 DeepMind Control Suite tasks.

RSSM BPTT DMControl ×20

2021

DreamerV2 — Hafner et al.

Switched to discrete latent representations (straight-through gradients). First model-based RL agent to achieve human-level performance on Atari 55-game benchmark. Matched Rainbow/IQN using 200M frames vs their 1.8B frames. Also solved humanoid stand-up and walking from pixels.

Discrete Latents Atari ×55 Human-Level

2023

DreamerV3 — Hafner et al.

Single fixed configuration across 150+ diverse tasks. Introduced symlog transformations, per-return normalisation, and free bits to stabilise training across varying reward scales. First algorithm to collect diamonds in Minecraft from scratch — without human demonstrations, reward shaping, or curriculum. Outperforms domain-specific methods on Atari, Crafter, DMLab, Robosuite, BSuite, and more.

150+ Tasks Minecraft Diamonds One Config

The RSSM Insight

DreamerV1 split the world model into two paths: a deterministic GRU (carries the past faithfully) and a stochastic Gaussian (models uncertainty). This lets the model handle both predictable dynamics and surprising events without conflating them.

Discrete ≥ Continuous

DreamerV2's discovery: discrete latents (categorical via Gumbel-softmax / straight-through) outperform Gaussian latents on Atari. Hypothesis: visual frames contain discrete structure (objects, edges) that discrete codes capture more efficiently.

Scale Without Tuning

DreamerV3's symlog transformation normalises rewards and values into a consistent range regardless of their true scale. This single change lets one set of hyperparameters work across tasks where rewards differ by 10,000×.

      From 2 environments to 150+: In five years, world model RL went from two toy benchmarks to the most general RL algorithm ever demonstrated — with a single configuration. The core intuition from 2018 (compress → predict → act in imagination) remains unchanged.