“Smaller, Faster, Cheaper, Lighter: Introducing DistilGPT, a Distilled Version of GPT”, Victor Sanh2019-08-28 (, ; similar)⁠:

[HuggingFace has released a GPT transformer model “DistilGPT”, which is similar to the GPT architecture: only 66 million parameters (instead of 110 million) while keeping 95% of the performance on GPT. It is available on their repository ‘pytorch-transformers’ alongside 7 other transformer models. It uses knowledge distillation with a cross-entropy loss to train a much smaller faster version of GPT with similar performance. Previously: Sanh et al 2019]

“We train DistilGPT on eight 16GB V100 GPTs for ~3.5 days using the concatenation of Toronto Book Corpus and English Wikipedia (same data as original GPT)…As shown in the following table, DistilGPT’s performances compare favorably with the baselines while having respectively about half and one third the number of parameters (more on this below). Among the 9 tasks, DistilGPT is always on par or improving over the GPTo baseline (up to 14 points of accuracy on GPT). DistilGPT also compares surprisingly well to GPT: we are able to retain more than 95% of the performance while having 40% fewer parameters. In terms of inference time, DistilGPT is more than 60% faster and smaller than GPT and 120% faster and smaller than GPTo+BiGPT.”