“Transformers Are a Very Exciting Family of Machine Learning Architectures”, Peter Bloem (; backlinks; similar)⁠:

Many good tutorials exist but in the last few years, transformers have mostly become simpler, so that it is now much more straightforward to explain how modern architectures work.

This post is an attempt to explain directly [in PyTorch] how modern transformers work, and why, without some of the historical baggage.