“Non-Determinism in GPT-4 Is Caused by Sparse MoE”, ​152334H2023-08-05 (, , ; backlinks)⁠:

It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0. This is an odd behavior if you’re used to dense decoder-only models, where temp=0 should imply greedy sampling which should imply full determinism, because the logits for the next token should be a pure function of the input sequence & the model weights…When asked about this behavior at the developer roundtables during OpenAI’s World Tour, the responses of the members of technical staff were something along the lines of:

Honestly, we’re confused as well. We think there might be some bug in our systems, or some non-determinism in optimized floating point calculations…

…The number of unique completions from GPT-4 is ridiculously high—practically always non-deterministic with longer outputs. This almost certainly confirms that something is up with GPT-4…3 years, and this couldn’t be fixed?…In the recent Soft MoE paper, there was an interesting blurb in §2.2 that sparked a connection:

Per-sequence determinism: Under capacity constraints, all Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens often compete against each other for available spots in expert buffers. As a consequence, the model is no longer deterministic at the sequence-level, but only at the batch-level, as some input sequences may affect the final prediction for other inputs.

Models using larger groups tend to provide more freedom to the routing algorithm and usually perform better, while their computational cost is also higher. On the other hand, when groups contain tokens from a single sequence, the model is forced to use every expert on every input sequence. This may lead to more generalist experts. Moreover, changing the group size between training and inference can be problematic due to the potential distributional shift in token-to-expert assignments. We explore these aspects in §3.5.

[Our new approach] Soft MoE gracefully sidesteps all these [Sparse MoE] challenges. Since it combines all tokens in each input sequence, we just set the group size to be a single sequence. Every expert does handle tokens from every input, maybe somewhat limiting the amount of high-level specialization. Yet, this also implies that it is per-example deterministic and fast, while typical instances of Sparse MoEs are not.

It is currently public knowledge that GPT-4 is a Mixture of Experts model. Given that GPT-4 was trained before Q2 2022, and that Sparse Mixture-of-Experts have existed long before that, I think the following hypothesis is justified:

The GPT-4 AI is hosted with a backend that does batched inference. Although some of the randomness may be explained by other factors, the vast majority of non-determinism in the AI is explainable by its Sparse MoE architecture failing to enforce per-sequence determinism.

GPT-3.5-Turbo may be MoE too: I heard a rumour, once, about 3.5-turbo sharing the same architecture as GPT-4; just with much much less parameters than it, or even GPT-3. And, when I heard it, I was thinking: Nah, that sounds too complicated for a small public model. Why wouldn’t they just use a dense one? Fits on one GPU, no complexity overhead, really simple to optimize…

Fast forward to now, and we’re still suffering a regime where it takes 70b parameters to meet Turbo’s performance—a number which just doesn’t make sense for how much traffic OpenAI’s handling, and how much speed they get.

It’s also easy to notice that Turbo is the only other model in the AI that has its logprobs restricted from public view.