“Fast Feedforward Networks”, Peter Belcak, Roger Wattenhofer2023-08-28 (, ; backlinks)⁠:

[Github; Warning: apparently irreproducible & may be fake] We break the linear link between the layer size and its inference cost by introducing the fast feedforward (FFF) architecture, a log-time alternative to feedforward networks.

We demonstrate that FFFs are up to 220× faster than feedforward networks, up to 6× faster than mixture-of-experts networks, and exhibit better training properties than mixtures of experts thanks to noiseless conditional execution.

Pushing FFFs to the limit, we show that they can use as little as 1% of layer neurons for inference in vision transformers while preserving 94.2% of predictive performance.