“GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications”, 2018-06-08 (; similar):
[paper]
We used a byte-pair encoding (BPE) vocabulary with 40,000 merges53 and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of 𝓁2 regularization proposed in37, with w = 0.01 on all non-bias or gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU)18. We used learned position embeddings instead of the sinusoidal version proposed in the original work.
We use the ftfy library to clean the raw text in BookCorpus, standardize some punctuation and whitespace, and use the spaCy tokenizer.