“Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira2021-07-30 (, , , ; similar)⁠:

[code; Hugging Face] The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original’s appealing properties by learning to flexibly query the model’s latent space to produce outputs of arbitrary size and semantics.

Perceiver IO still decouples model depth from data size and still scales linearly with data size, but now with respect to both input and output sizes.

The full Perceiver IO model achieves strong results on tasks with highly structured output spaces, such as natural language and visual understanding, StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization, and achieves state-of-the-art performance on Sintel optical flow estimation.

Figure 2: The Perceiver IO architecture. Perceiver IO maps arbitrary input arrays to arbitrary output arrays in a domain agnostic process. The bulk of the computation happens in a latent space whose size is typically smaller than the inputs and outputs, which makes the process computationally tractable even for very large inputs & outputs.

…The Perceiver IO architecture relies on the same primitives as Transformers: so why aren’t Transformers all you need? The answer is that Transformers scale very poorly in both compute and memory.82 A Transformer deploys attention modules homogeneously throughout its architecture, using its full input to generate queries and keys at every layer. As discussed in,35 this means each layer scales quadratically in compute and memory, which currently makes it impossible to apply Transformers on high-dimensional data like images without some form of preprocessing. Even on domains like language where Transformers shine, preprocessing (eg. tokenization) is often needed to scale beyond short input sequences. On the other hand, Perceiver IO uses attention non-homogeneously, first using it to map inputs to a latent space, then using it to process in that latent space, and finally using it to map to an output space. The resulting architecture has no quadratic dependence on the input or output size: encoder and decoder attention modules depend linearly on the input and output size (respectively), while latent attention is independent of both input and output sizes. Because of this structure, and the corresponding reduction in compute and memory requirements, Perceivers scale to much larger inputs and outputs. While Transformers are typically used in settings with inputs and outputs of at most a few thousand dimensions[9, 63], we show good results on domains with hundreds of thousands of input and output dimensions.

…Because of this structure, this architecture can be applied to inputs of any shape or spatial layout and even to inputs or outputs which don’t share the same spatial structure (eg. sound and video). However, in contrast to the latent spaces used elsewhere in vision (eg.)67 the latent does not explicitly share the structure (spatial or otherwise) of the inputs. To decode this information, we query for it using cross-attention.

4.4 StarCraft II: To further demonstrate Perceiver IO’s capabilities on discrete modalities and to serve as a drop-in replacement for Transformers, we use Perceiver IO to replace the Transformer in AlphaStar, the state-of-the-art system for the complex game of StarCraft II. At its core, AlphaStar[89] represents the units in the game as a discrete, unordered set of symbols (the “units”). These units are represented by a vector of properties such as unit type, position, health, etc. At each timestep, the architecture encodes up to 512 units “tokens” with a vanilla Transformer. This representation is used both as a summary of the state (after pooling) and as a rich representation of the 512 units. This representation is used by a pointer network[90], to assign a probability to each possible unit selection, effectively parameterizing the agent’s unit selection policy (see89 and Appendix §G for more details). We replaced the Transformer that inputs and outputs 512 units with Perceiver IO with a latent size of 32. Without tuning any additional parameters, we observed that the resulting agent reached the same level of performance as the original AlphaStar agent, reaching an 87% win-rate versus the Elite bot after behavioral cloning[61] on human data.