The problem: AI services handle many users at once and group requests together to save computation. The group size varies depending on how busy the server is at that moment.
Why it causes variation: Different group sizes change the order in which the AI does its internal arithmetic. Computers canโt represent most decimals exactly, so rounding happens at every step โ and different addition orders accumulate different rounding errors, enough to change which word the AI picks next.
The fix: Rewrite the three most sensitive calculation routines to always use the same arithmetic order, no matter how many requests are running. Cost: ~1.6ร slower. Result: perfectly consistent answers every time.
This isn't a temperature problem. temp=0 means greedy sampling โ the model always picks the top token. No randomness is involved. Something at the hardware level is causing the variation. And most explanations you've heard about why this happens are wrong.
Imagine 10 people each adding the same long list of numbers, but each person splits the list into their own groups to divide the work. Because computers round at every step, different groupings accumulate different rounding errors. The totals donโt exactly match โ even though everyone started with identical numbers.
Inside an AI model, this happens across millions of calculations in every response. Your request gets โgroupedโ with different co-requests depending on server load, and those different groupings compound into a measurably different final answer.
You send the exact same prompt at two different times of day. The tokens, the model, the temperature (0) โ everything is identical. From your perspective, the requests are bitwise equal.
// sent at 9am and 9pm
"The primary reason nearly all LLM inference endpoints are nondeterministic is that the load โ and thus batch size โ nondeterministically varies."
โ Thinking Machines AI, "Defeating Nondeterminism in LLM Inference"
Multiple GPU threads simultaneously write partial sums back to the same memory location using atomic add instructions. Because these concurrent writes happen in unpredictable order, the accumulated total varies each run โ creating race-condition nondeterminism.
LLM forward passes don't require atomic add operations for most kernels. The batch dimension provides enough independent parallelism that work can be tiled across cores without any thread needing to write to a shared memory address. Atomic adds are largely unnecessary and not the source of variance.
Atomic add race conditions do cause nondeterminism in some GPU workloads โ particularly sparse operations and custom reductions. Generalizing from there to LLM inference is an understandable but incorrect leap.
Both the myth and the reality lead to "control the order of operations." The atomic add story and the batch-variance story both suggest fixing kernel execution order โ but they point at different operations to fix.
The actual culprit โ batch size variation โ requires understanding how inference servers work under load. It's a systems-level cause, not a kernel-level bug.
The real culprit isn't concurrency within a single operation. It's that the batch size itself changes โ and with it, which operations run together and in what floating-point order.
In everyday math, 1 + 2 + 3 = 6 no matter which two numbers you add first. Computers use a compressed number format (called floating-point) where this stops being true โ rounding happens at each intermediate step, so different addition orders accumulate different rounding errors and produce slightly different totals.
The interactive demo below makes this concrete. Try the numbers yourself, then explore how GPUs run thousands of these additions in parallel โ and why that parallelism is exactly where the variance creeps in.
How This Plays Out on a GPU
A GPU executes thousands of floating-point additions in parallel across CUDA cores. These additions are divided into work groups โ and which values end up in the same work group depends on how the scheduler assigns work at runtime. When two runs of the same kernel assign work to cores in slightly different orders, the partial sums are grouped differently.
GPU Architecture: The Full Picture โ for the technically curious
Every AI response is the product of millions of floating-point additions. Different batch sizes change which additions happen together โ and those tiny ordering differences compound through hundreds of layers until the โmost likely next wordโ ranking flips. Thatโs the complete mechanism.
These changes guarantee bitwise-identical outputs for the same input regardless of batch size โ or whether the batch exists at all. The same floating-point operations happen in the same order every single time.
Performance Cost
Batch-invariant kernels impose a latency penalty โ forcing fixed tile sizes means GPU utilization isn't always optimal. But engineering effort can claw back most of the loss.
What Did Those 80 Variants Look Like?
From 1,000 runs of the same prompt at temp=0, 80 distinct responses emerged โ 8% of runs gave a different answer. Each chip represents one unique completion. Green = nearly identical wording to the most common answer, yellow = noticeably different phrasing, red = substantially different response. Hover to read each.
A 1.6× slowdown is real. But for use cases that require reproducibility — RL training, compliance, debugging — it's not a trade-off. It's the cost of correctness. And unlike nondeterminism, latency can be optimized further.
Same prompt โ same output, every time, on any machine. Essential for scientific reproducibility, compliance audits, and comparing model versions without confounding variance.
RL training samples rollouts from a policy. If inference is nondeterministic, training samples differ from deployment samples โ implicitly introducing off-policy corrections. Deterministic inference means rollouts at training time are bitwise identical to rollouts at deploy time.
Reproduce a failure exactly. No more "it only happens sometimes under load." When a model gives a wrong answer, you can replay the exact computation that produced it and trace the error to its source.
Is Batch-Invariant Inference Right for Your Use Case?
Nondeterminism in LLM inference isn't an inherent property of the technology โ it's an engineering choice about whether to control floating-point reduction order. This article shows it can be fixed. The question is only whether the 1.6ร latency cost is worth it for your application.