“Towards Smaller, Faster Decoder-Only Transformers: Architectural Variants and Their Implications”, Sathya Krishnan Suresh, Shunmugapriya P2024-04-22 (, ; similar)⁠:

In recent times, the research on Large Language Models (LLMs) has grown exponentially, predominantly focusing on models underpinned by the transformer architecture, as established by Vaswani et al 2017, and further developed through the decoder-only variations by Radford et al 2020. Contemporary efforts in this field primarily aim to enhance model capabilities by scaling up both the architecture and data volumes used during training.

However, the exploration into reducing these model sizes while preserving their efficacy remains scant. In this study, we introduce 3 modifications to the decoder-only transformer architecture, namely ParallelGPT (pgpt), LinearGPT (lgpt), and ConvGPT (cgpt).

These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes.

We open-source the model weights and the complete codebase for these implementations for further research.