“MUX-PLMs: Pre-Training Language Models With Data Multiplexing”, Vishvak Murahari, Ameet Deshpande, Carlos E. Jimenez, Izhak Shafran, Mingqiu Wang, Yuan Cao, Karthik Narasimhan2023-02-24 (, , ; backlinks)⁠:

Data multiplexing is a recently proposed method for improving a model’s inference efficiency by processing multiple instances simultaneously using an ordered representation mixture. Prior work on data multiplexing only used task-specific Transformers without any pre-training, which limited their accuracy and generality.

In this paper, we develop pre-trained multiplexed language models (MUX-PLMs) that can be widely finetuned on any downstream task. Our approach includes a three-stage training procedure and novel multiplexing and demultiplexing modules for improving throughput and downstream task accuracy.

We demonstrate our method on BERT and ELECTRA pre-training objectives, with our MUX-BERT and MUX-ELECTRA models achieving 2×/5× inference speedup with a 2–4% drop in absolute performance on GLUE and 1–2% drop on token-level tasks.