Defeating Nondeterminism in LLM Inference

Motivation · ~12 min read

Same Prompt. Same Settings. Different Answer.

When you set an AI tool to its most “precise” mode — the mode that’s supposed to always pick the single most likely answer — you’d expect the same response every time. Same prompt, same result. That’s not what happens in practice.

TL;DR — The Short Version (No Jargon)

❓

The problem: AI services handle many users at once and group requests together to save computation. The group size varies depending on how busy the server is at that moment.

🔌

Why it causes variation: Different group sizes change the order in which the AI does its internal arithmetic. Computers can’t represent most decimals exactly, so rounding happens at every step — and different addition orders accumulate different rounding errors, enough to change which word the AI picks next.

✅

The fix: Rewrite the three most sensitive calculation routines to always use the same arithmetic order, no matter how many requests are running. Cost: ~1.6× slower. Result: perfectly consistent answers every time.

Same Prompt. Temp=0. Different Outputs.

Prompt (sent 5 times, temperature=0)

Summarize the key contributions of the transformer architecture in one sentence.

highlighted tokens = words that changed vs. Run 1 — same prompt, temperature=0, but different output

80

Unique completions from 1,000 runs

8%

Variance rate (Qwen3-235B, temp=0)

0

Unique completions after the fix

This isn't a temperature problem. temp=0 means greedy sampling — the model always picks the top token. No randomness is involved. Something at the hardware level is causing the variation. And most explanations you've heard about why this happens are wrong.

Next: what’s actually causing this The Root Cause →

Root Cause

Why Your Requests Get Different Answers

Here’s the plain-English version: an AI service handles many users at once. It groups requests together to save computation. At 9am yours might travel alone; at 9pm yours shares a group with 24 others. That grouping — invisible to you — changes which numbers get added together inside the model, and in computer arithmetic the order you add numbers in can produce slightly different results.

👁 Everyday Analogy

Imagine 10 people each adding the same long list of numbers, but each person splits the list into their own groups to divide the work. Because computers round at every step, different groupings accumulate different rounding errors. The totals don’t exactly match — even though everyone started with identical numbers.

Inside an AI model, this happens across millions of calculations in every response. Your request gets “grouped” with different co-requests depending on server load, and those different groupings compound into a measurably different final answer.

The Full Causal Chain — Click any step to jump to its section

Step-by-Step: How It Happens

📥 Identical Request

You send the exact same prompt at two different times of day. The tokens, the model, the temperature (0) — everything is identical. From your perspective, the requests are bitwise equal.

request = { model: "qwen3-235b", prompt: "Explain...", temperature: 0 }
// sent at 9am and 9pm

"The primary reason nearly all LLM inference endpoints are nondeterministic is that the load — and thus batch size — nondeterministically varies."

— Thinking Machines AI, "Defeating Nondeterminism in LLM Inference"

Next: the common wrong explanation, and why it’s wrong The Atomic Add Myth →

Common Misconception

The Myth: Atomic Adds Are the Culprit

Ask most ML engineers why LLM inference is nondeterministic at temp=0 and they'll say: "atomic add race conditions." This explanation is pervasive, plausible-sounding, and largely incorrect.

❌ The Myth

⚡

Multiple GPU threads simultaneously write partial sums back to the same memory location using atomic add instructions. Because these concurrent writes happen in unpredictable order, the accumulated total varies each run — creating race-condition nondeterminism.

This sounds compelling. It's also the explanation you'll find in many blog posts, stack overflow answers, and framework docs.

✓ The Reality

🔬

LLM forward passes don't require atomic add operations for most kernels. The batch dimension provides enough independent parallelism that work can be tiled across cores without any thread needing to write to a shared memory address. Atomic adds are largely unnecessary and not the source of variance.

The nondeterminism has a different cause entirely — one that requires a different fix.

Why the Myth Persists

🧩

It's partially true in other domains

Atomic add race conditions do cause nondeterminism in some GPU workloads — particularly sparse operations and custom reductions. Generalizing from there to LLM inference is an understandable but incorrect leap.

🔧

The fix sounds the same

Both the myth and the reality lead to "control the order of operations." The atomic add story and the batch-variance story both suggest fixing kernel execution order — but they point at different operations to fix.

📊

The real cause is less obvious

The actual culprit — batch size variation — requires understanding how inference servers work under load. It's a systems-level cause, not a kernel-level bug.

Animation: Two Mechanisms Side-by-Side

Left: atomic add race conditions. Right: batch variance causing different tiling.

The real culprit isn't concurrency within a single operation. It's that the batch size itself changes — and with it, which operations run together and in what floating-point order.

Next: the math that explains why batch size matters The Math Deep Dive →

Deep Dive — The Mechanism

Floating-Point Math Isn't What You Think

This section explains the exact mathematical mechanism behind batch variance. You don’t need it to understand the fix — but if you want to know precisely why different addition orders produce different results, this is the deep dive.

🎯 The Core Insight

In everyday math, 1 + 2 + 3 = 6 no matter which two numbers you add first. Computers use a compressed number format (called floating-point) where this stops being true — rounding happens at each intermediate step, so different addition orders accumulate different rounding errors and produce slightly different totals.

The interactive demo below makes this concrete. Try the numbers yourself, then explore how GPUs run thousands of these additions in parallel — and why that parallelism is exactly where the variance creeps in.

Interactive: See Non-Associativity Yourself

a =

b =

c =

Order 1: (a + b) + c

Order 2: a + (b + c)

How This Plays Out on a GPU

Why GPU Parallelism Makes This Worse

A GPU executes thousands of floating-point additions in parallel across CUDA cores. These additions are divided into work groups — and which values end up in the same work group depends on how the scheduler assigns work at runtime. When two runs of the same kernel assign work to cores in slightly different orders, the partial sums are grouped differently.

Both trees add the same four values. The tree structure (which pairs get added first) differs between left and right. Due to floating-point non-associativity, the final sums can differ.

GPU Architecture: The Full Picture — for the technically curious

Optional — skip to the fix if you prefer

Every AI response is the product of millions of floating-point additions. Different batch sizes change which additions happen together — and those tiny ordering differences compound through hundreds of layers until the “most likely next word” ranking flips. That’s the complete mechanism.

Next: how to actually fix this The Fix →

The Solution

The Fix: Lock the Arithmetic Order

The fix is conceptually simple: always do the arithmetic in the same order, no matter how many requests are running at the same time. In practice, this requires rewriting three specific calculation routines inside the model. The explorer below shows what changes in each one — and why those specific three matter most.

Kernel Fix Explorer — Select an Operation

These changes guarantee bitwise-identical outputs for the same input regardless of batch size — or whether the batch exists at all. The same floating-point operations happen in the same order every single time.

Next: does it actually work? Results →

Results

From 8% Variance to Zero

The team at Thinking Machines tested batch-invariant kernels on Qwen3-235B with identical prompts at temperature=0, running 1,000 inference calls and counting unique completions.

100%

Determinism achieved

0

Unique completions (was 80)

1.6×

Latency overhead (optimized)

Unique Completions From 1,000 Runs

Performance Cost

Latency Comparison (Qwen3-235B, same prompt)

Batch-invariant kernels impose a latency penalty — forcing fixed tile sizes means GPU utilization isn't always optimal. But engineering effort can claw back most of the loss.

Initial implementation: ~2.1× slowdown (26s → 55s). After optimization: ~1.6× slowdown (~42s). Further optimization is possible.

What Did Those 80 Variants Look Like?

Completion Diversity — Hover over each chip to read it

From 1,000 runs of the same prompt at temp=0, 80 distinct responses emerged — 8% of runs gave a different answer. Each chip represents one unique completion. Green = nearly identical wording to the most common answer, yellow = noticeably different phrasing, red = substantially different response. Hover to read each.

Hover over a chip to read its completion

A 1.6× slowdown is real. But for use cases that require reproducibility — RL training, compliance, debugging — it's not a trade-off. It's the cost of correctness. And unlike nondeterminism, latency can be optimized further.

Next: why this matters beyond reproducibility Implications →

So What?

Why Determinism Unlocks Real Progress

Deterministic inference isn't just a convenience for debugging. It unblocks several important research and production use cases that have been silently broken by nondeterminism.

🔬

Reproducibility

Same prompt → same output, every time, on any machine. Essential for scientific reproducibility, compliance audits, and comparing model versions without confounding variance.

🤖

True On-Policy RL

RL training samples rollouts from a policy. If inference is nondeterministic, training samples differ from deployment samples — implicitly introducing off-policy corrections. Deterministic inference means rollouts at training time are bitwise identical to rollouts at deploy time.

🐛

Debuggability

Reproduce a failure exactly. No more "it only happens sometimes under load." When a model gives a wrong answer, you can replay the exact computation that produced it and trace the error to its source.

Is Batch-Invariant Inference Right for Your Use Case?

Decision Guide — Click any node to explore

Nondeterminism in LLM inference isn't an inherent property of the technology — it's an engineering choice about whether to control floating-point reduction order. This article shows it can be fixed. The question is only whether the 1.6× latency cost is worth it for your application.

Want more interactive AI/ML deep-dives like this?

Subscribe to Visual Summary →

← → or J/K to navigate sections