ā€œSuperposed Decoding: Multiple Generations from a Single Autoregressive Inference Passā€, Ethan Shen, Alan Fan, Sarah M. Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati2024-05-28 (, )⁠:

Many applications today provide users with multiple auto-complete drafts as they type, including GitHub’s code completion, Gmail’s smart compose, and Apple’s messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing k drafts to the user requires running an expensive language model k times.

To alleviate the computation cost of running k inference passes, we propose Superposed Decoding, a new decoding algorithm that generates k drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the k most recent token embeddings from the drafts as input to the next decoding step of the language model. At every inference step, we combine the k drafts with the top-k tokens to get k2 new drafts and cache the k most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations.

Our experiments show that k drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least 2.44Ɨ faster for k ≄ 3. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling.

Code and more examples open-sourced at GitHub.