Part 1: Best Practices for Finetuning Large Transformer Language models

Part 2: How I (almost) replicated OpenAI’s GPT-2 (124M version)


small update (01/25/21): I posted this to HN and it ended up at the top of the front page for a couple of hours :O

TL;DR


A few months ago I started working on a research project trying to pretrain my own, more efficient language model from scratch. I got access to a 128-core TPUv3 pod from the Tensorflow Reseach Cloud and used it to pretrain a 124124M parameter GPT-2 model to a perplexity pretty close to OpenAI’s results (my pretrained model was trained for about 1/81/8th of the number of iterations that OpenAI trained their model for and got 2121 ppl on OpenWebText compared to 1717 ppl for OpenAI’s model), and then pretrained an ALBERT-style GPT-2 (that I’m calling ALGPT2) language model with a factorized input embedding and layer-wise parameter sharing that would reduce the number of paramters in the model from 124124M to around 1212M.

Unfortunately, ALGPT-2 doesn’t perform as well as GPT-2 (ALGPT-2 gets 3131 ppl on OpenWebText compared to 2121 ppl for my pretrained GPT-2 model), but I’m writing this series of blog posts to go through everything I’ve learned over the last few months.


The Idea


The main thing that I wanted to do from this sort-of “research project” that I was working on by myself this spring was to develop and train a more efficient version of the 124124M parameter version of GPT-2. I wanted to pretrain the 1.51.5B parameter version of GPT-2 but since I only got access to the TPU pod for a week, I had to choose a model that would train in time. A 100100k iteration training run takes about 2020 hours to run which gave me plenty of time to run multiple experiments. In contrast, following OpenAI’s training procedure exactly and training for the full 800800k iterations would take up almost an entire week and use up most of my quota.

I was able to almost replicate the 124124M parameter version of GPT-2 by pretraining it to a perplexity pretty close to OpenAI’s results (my pretrained model used was trained for about 1/81/8th of the number of iterations that OpenAI trained their model for and got 2121 perplexity (ppl) on the standard OpenWebText dataset compared to 1717 ppl for OpenAI’s model),

My idea of making a more efficient transformer didn’t really work out since my pretrained transformer ended up being about 2020ppl worse than an equivalent GPT-2 model, but I wanted to writeup what I learned over the two or three months that I was working on this anyway.


Background


A little bit about myself: I’m an incoming software engineering student at the University of Waterloo and this post is supposed to be a writeup of a NLP research project that I was working on from around March to May of 2020 (right in the middle of the first Covid-19 lockdown of 2020, I’m currently writing this on July 15, 2020 while waiting for my Pix2PixHD model to train for a few hundred epochs on colab for a new project that I’m working on).

I was pretty lucky that I started learning NLP right before transformers exploded in popularity, I remember when word2vec and LSTMs were SOTA on a lot of NLP tasks, and it has been really interesting to see how much the field of NLP has changed in just a few years, going from when LSTMs with only a a handful of layers and somewhere on the order of 512512 units were considered to be large networks and computationally expensive to train, to training LSTMs with attention layers on top, to the original transformer encoder/decoder networks, to ULMFIT and ELMO, then BERT, RoBERTa, GPT-2, and T5, to just a few months ago with the explosion of new, more efficient replacements for self-attention like the Sparse Transformer, the Reformer, and Synthesizers, and now GPT-3, which IMO has the potential to really change the whole field of NLP.

Just a few years ago we trained shallow recurrent networks on datasets, then pretrained large transformer language models on large datasets and finetuned on task-specific datasets. Now the whole idea of just training a gigantic language model on a huge dataset, then conditioning the model in a form of few-shot learning by prepending a few examples of a certain task to an input feels like it can really make NLP models a lot more accessible and easier to productionize as well as making human-chatbot interactions a lot more realistic than they are today.

I’ve rambled on for long enough, lets get to the main topic of this post.


GPT-2 and ALBERT


GPT-2 is a transformer decoder.

The embedding layer at the root of the model maps a one-hot vector of a given token’s index (all the GPT-2 models use a vocabulary size of 5025750257) to a 768768 dimensional vector (all GPT-2 numbers in this blog post will be for the 124124m parameter version of GPT-2).

The embedding matrix is followed by a stack of self-attention and feed-forward layers that each output a 768768 dimensional vector (keeping the number of outputs for each layer constant), which makes up the main part of the transformer.

The stack of self-attention layers is then followed by an output embedding (the weights of the input and output embeddings are tied to make training easier) that maps the 768768 dimensional vector that is the output of the last layer of the transformer to the same 5025750257 dimensional vector that represents the probability of each token in the vocabulary being the next token in the sequence.

Take a look at The Illustrated GPT-2 for a more in-depth look into GPT-2.

ALBERT (A Lite BERT) is a paper that takes a look at BERT and identifies some ways in which to make it more efficient and reduce the number of parameters in the model in four major ways: a factorized embedding, layer-wise parameter sharing, a sentence-order-prediction auxillary loss, and removing dropout.


Factorized embedding


GPT-2’s embedding has a lot of parameters. It’s really just a matrix of dimensions 50257×76850257 \times 768. That means that the input embedding alone uses up almost 50257×768= 38,000,00050257 \times 768 = \space \sim 38,000,000 parameters which is a pretty big chunk of the 128128M total parameters in the model.

The ALBERT authors propose a factorized embedding with an intermediate embedding size of 128128: one embedding of size 50257×12850257 \times 128 and another embedding of size 128×768128 \times 768. By breaking up the large embedding matrix into two smaller matrices, the total number of parameters used in the embedding goes from about 3838M to about 66M.

50257×128=6,000,00050257 \times 128 = \sim 6,000,000

128×768=100,000128 \times 768 = \sim 100,000

The authors try different intermediate embedding sizes and settle on 128128 as a good tradeoff betweeen parameters and performance.


Layer-wise parameter sharing


In a normal transformer model, the transformer layers are created something like this:

class BERT(nn.Module):
    def __init__(self, n_layers):
        super().__init__()
        // ...
        self.blocks = nn.ModuleList([Block() for _ in range(n_layers)])
        // ...
    def forward(self, x):
        // ...
        for block in self.blocks:
            x = block(x)
        // ...

ALBERT shares all parameters across the transformer layers something like this:

class ALBERT(nn.Module):
    def __init__(self, n_layers):
        super().__init__()
        // ...

        self.n_layers = n_layers
        self.block = Block()
        // ...
    def forward(self, x):
        // ...
        for _ in self.n_layers:
            x = block(x)
        // ...

By only defining one transformer block and looping around it n_layers times, ALBERT saves the GPU memory that would be used to store the parameters for all the layers.

Since we normally use 3232 bit floats to store parameters on the GPU, storing the 1.51.5B parameter GPT-2 on the GPU will use up about 66GB of the GPU’s memory — that’s a pretty big chunk of the 1616GB of memory that’s on a normal V100 GPU already being used up before taking into account the memory needed to store the model’s activations as well as any momentum parameters needed by the optimizer. In contrast, if you share parameters across all transformer layers in the 1.51.5B parameter GPT-2, the resulting model will only have about 3737M parameters, the parameter-sharing version would only use up around 148148MB of GPU memory.

The authors try applying parameter sharing to BERT and see that it reduces performance but makes it easier to train larger and larger models.

In a machine learning framework like JAX, which by default unrolls and inlines loops when it’s compiling your code with XLA, the size of the unrolled and inlined loop would make the computation graph really large and take a long time to compile. This is why you’re recommended to use somehting like lax.scan() in these situations.


Sentence-order-prediction auxillary loss

The ALBERT authors add an auxillary loss to help training. Since language modelling is usually done autoregressively, I didn’t use this for my custom model.


Removing dropout

The ALBERT authors remove all dropout from BERT and see that it significantly improves performance.


That’s pretty much what my idea was: Take GPT-2, add a factorized embedding, share parameters across all transformer layers, remove dropout (I actually missed the part about ALBERT removing dropout until I was pretty far into my work, but I did run one or two runs without dropout to see how that works), and pretrain on a large dataset for a few hundred thousand iterations.

There’s no way that I could pretrain something like GPT-2 by myself, so I applied to the Tensorflow Research Cloud (TFRC).

The TFRC puts an emphasis on wanting to help researchers from non-traditional backgrounds which makes it an amazing resource for anyone who isn’t a “traditional” machine learning researcher. They were willing to give me, a 17 year old with no formal education or credentials (not even a high school diploma :/), access to an extremely powerful cluster of TPUs at no cost. Being able to be a part of this program was really helpful to me, especially since I don’t have access to a dedicated GPU and usually rely on Colab’s free GPU to train my models.

I emailed the TFRC team to ask if I could get upgraded from 55 separate individual TPUv3’s (with 8 cores each) to a TPU pod to pretrain a large language model. The very next day (!) I got an email back saying that I could get access to a preemptible 128-core TPUv3 Pod for 7 days which unfortunately wasn’t long enough for me to pretrain the 1.51.5B parameter model but was enough to train a few runs on the 124124M model.


Setup


So for setup I’ll be going through all the steps that I took to setup my VM and TPU Pod and preprocess the dataset as well.

When I was working on this project, I set up two VMs; One with a lot of RAM and CPU cores to process the data quickly and another small instance to run the TPU training script. One of the nice things about training on TPUs and TPU pods is that as long as your data has been preprocessed as a set of TFRecord files, you don’t need a really powerful VM instance which saves you a lot of money/compute credits.

You can look at this for a full list of every command that I used to setup the VM and preprocess the dataset.


OpenWebText


I used a n-1-standard-16 instance with TF2.1 to process the OpenWebText dataset. Make sure that you use an instance with a SSD instead of the default HDD because processing the dataset involves processing a lot of very small text files and is mostly limited by your drive’s io speed. I made the mistake of using a HDD and just extracting the dataset’s TAR archives took about 7 hours. I put all the data in a folder at ~/data/openwebtext/ so modify it if you want to download the data elsewhere.

TIL: most common linux utilities (like ls, mv, and cat) aren’t really that optimized for working with almost 10 million files like in OpenWebText. Just counting the number of text files in the dataset could take several minutes._

Download the OpenWebText dataset (which is really just a tar archive of a bunch of tar archives that contain a lot of text files) and extract it:

gdown https://drive.google.com/uc?id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx
tar -xf openwebtext.tar.xz
cat *.xz | tar -J -xf - -i

The dataset is about 12GB compressed and 53GB uncompressed and has just about 8 million text files.

I moved the first 100,000100,000 files in the dataset to a separate directory to create a validation set:

ls -f | head -100000 | xargs -i mv {} ../openwebtext-valid/

Tokenization


I trained a Byte-level BPE tokenizer with a vocabulary size of 50,25750,257 (The same as GPT-2) on a 11M file subset of the training set (I’m not sure if GPT-2 trains the tokenizer on the entire dataset or on just a subset, but I know that the CTRL paper trains their tokenizer on a 5% split of their training set.). I used Hugginface’s fast Rust-based Tokenizers library and their ByteLevelBPETokenizer tokenizer.

You can use my script here and run

python3 train_tokenizer.py --train_path ./data/openwebtext/ --save_path ./tokenizer/ \
    --vocab_size 50257 --n_files 1000000

to train the tokenizer, or just take a look at this for the main details (It just trains a tokenizer and saves it as well as a configuration file to disk):

import os
import glob
import json

from tokenizers import ByteLevelBPETokenizer

paths = glob.glob(os.path.join('./data/openwebtext', '*'))[:1000000]

tok = ByteLevelBPETokenizer()
tok.train(files=paths, vocab_size=args.vocab_size, special_tokens=args.control_codes)
tok.save('./tokenizer/')

tokenizer_config = {
    "max_len": 1024
}

with open(os.path.join('./tokenizer/', "tokenizer_config.json"), 'w') as fp:
    json.dump(tokenizer_config, fp)

TFRecords


TPU Pods expect your data to be available as a set of TFRecord files in a GCP cloud bucket that get downloaded to each of your TPU board’s built in powerful VM that will take care of de-serializing the files and feeding it to the TPU chips. Make sure that your GCP bucket and your TPU pod are in the same compute zone, otherwise you’ll quickly rack up a lot of charges by transferring hundreds of GBs of data across compute zones.

Here’s a thing that’s not very well documented when working with TPU Pods (this doesn’t really apply to TPUs as much): TPU Pods create a lot (100s of GBs) of logs that get sent to Stackdriver, where you get charged about 50 cents for each GiB of logs ingested beyond a certain limit (I think it’s around 50GiB/month). In just a few days of training, I ended up being charged about a 100$ IIRC. Luckily, I still had most of the free GCP credits so this didn’t end up being a major problem for me, but make sure to turn off ingesting logs for TPUs.

I ran into a problem early on when I got access to the TPU pod where my code would work perfectly on a single TPU, but would throw an Out of range: End of sequence error when running it on a TPU pod. I struggled with this for a pretty long time until I took a look at this Kaggle discussion post that says that TPUs expect each TPU board (8 cores) to get its own TFrecord file (until that point, I was splitting the train set into 8 TFRecord files where I should’ve been splitting it into 16 (128 cores / 8 cores per board) TFRecord files.

TPUs are awesome for scaling to huge models and huge datasets, but there are a lot of TPU-specific information (especially for TPU Pods) that you need to know that’s not covered in the documentation and isn’t easy to find._**

You can use my script here and run

python3 make_tfrecords.py --path ./data/openwebtext/ --save_path ./train/ --files_per_tfrecord 500000 \
    --use_control_codes --seq_len 1024 --min_seq_len --tokenizer ./tokenizer/
python3 make_tfrecords.py --path ./data/openwebtext-valid/ --save_path ./val/ --files_per_tfrecord 50000 \
    --use_control_codes --seq_len 1024 --min_seq_len --tokenizer ./tokenizer/

to convert the raw text files from the train and validation splits into two sets of 1616 TFRecord files.

I ran a quick analysis on the average lengths of text fields in the dataset, 6767% of files have less than 10241024 tokens, 3535% of files have less than 512512 tokens, and only 1010% of files have less than 256256 tokens. This means that if I wanted to make the dataset as clean as possible and have each input sequence to the model be of a single contigous stream of 10241024 tokens, the dataset’s size would be a lot smaller. For this reason, everyone prepends a token like <|endoftext|> to the beginning of each sequence and concatenates together sequences with lengths smaller than 10241024. The specifics of how exactly you do that (e.g. do you treat the dataset as single stream of tokens and just break it up into sequences of length 10241024, or do you keep track of sequences smaller that 10241024 and just concatenate them together into a single sequence) really shouldn’t make too big of a difference in your model’s performance, but you can take a look at my implementation here.

My version doesn’t take full advantage of the fast, multithreaded batch_encode_plus() way to tokenize large datasets in parallel since it only keeps the first context_len tokens in each line of the files which makes dealing with files with more or less than 10241024 tokens harder. Because of this, tokenizing the dataset takes about 88 hours, which is something I want to improve.

The train set comes out to about 2626GB and consists of about 88M text files that have been transformed into just under 77M tfrecord examples, each with 10241024 tokens (same as GPT-2). The validation set comes out to about 300300MB and consists of about 100100K text files that have been transformed into just about 9090K tfrecord examples, each with 10241024 tokens (also the same as GPT-2).


Coding it up


Since I’m using TPUs, the only real library that you can practically use right now would be Tensorflow. I didn’t want to have to go through the learning curve of learning how to make custom training loops and stuff in TF2 so I just stuck to using Keras. You can take a look at my training script (It’s pretty short) here. It’s pretty simple so I’m not going to copy over the entire training script, but I will talk about a few small code snippets.

I usually like to add a ptvsd breakpoint to my script so I can debug my training script locally with vscode before pushing it up to my VM

if args.debug:
    import ptvsd
    ptvsd.enable_attach(address=('localhost', 5678),
                        redirect_output=True)
    ptvsd.wait_for_attach()
    breakpoint()

I’m using Weights&Biases to keep track of my experiments and save checkpoints.

    wandb.login()
    wandb.init(project='lm-finetuning', config=args, tags=args.tags)

    ...

    wandb_callback = WandbCallback(save_model=False)

Usually when you’re using a TPU with Keras, you pass in the IP address and port of the TPU to TPUClusterResolver, but you pass the name of the TPU itself to the resolver when using a TPU Pod.

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=args.tpu)
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)

Replicating GPT-2


I tried to use as many of the original hyperparameters that OpenAI used when I was replicating their 124124M parameter version of GPT-2, but I had to modify a few things so I could train everything in time.

Note: For some reason, the authors of the GPT-2 paper don’t state exactly what learning rates they used for training their models and instead just state “The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText”.

OpenAI trains their models for a total of 800800K iterations at a batch size of 512512 (Which comes out to around a total of 6060 epochs through the training set).

I trained my GPT-2 model for 1/8th1/8th the number of iterations that OpenAI trained theirs for (a total of around 100100K iterations) since each 100100K iteration training run took about 2020 hours to run on my 128-core TPU Pod. If I wanted to train GPT-2 for the same number of iterations as OpenAI, a single training run would have used up most of my one week of access to the pod.

Since my TPU pod was preemptible and resets every 2424 hours I usually had to resume my training run at least once and is the reason why all of these graphs usually have two or more training runs on them.


Replicating GPT-2


So here’s my model that came really close to replicating GPT-2, the training perplexity is about 21.521.5 at the end of the almost 9090K training iterations. For comparison, GPT-2 gets a training perplexity about 17.517.5 ppl after about 800800K training iterations, so a difference of only about 44 ppl.

I made a colab notebook showing how to use my pretrained GPT-2 model to generate text


AdamW vs Adafactor


I wanted to use the memory-saving Adafactor optimizer to make it easier to train larger language models but all of my Adafactor training runs were a lot (~5ppl IIRC) worse than using AdamW (This may be due to not using Adafactor’s momentum parameter or relative update scale, so this is something I want to look into more soon).


Learning rates


I started out with using Adam’s default learning rate of 1e41e-4 but I quickly figured out that I could train my models a lot faster by using a higher learning rate like 1e31e-3.

Section 2 of the GPT-3 paper lists the learning rates the OpenAI team used for different sized models when training GPT-3. They use a learning rate of 6e46e-4 for the 124124M version of their model and decrease the learning rate with model size.

You can take a look at this partial training run to see the difference between training at different learning rates.


Pretraining ALGPT-2


Since I was using the Huggingface Transformers repository’s implementations of GPT-2 and ALBERT, I just forked the repository and modified a few files to implement my ALGPT-2 model. You can take a look at all the changes that I had to make here, most of the changes are only to make ALGPT-2 compatible with the /Transformers library and to be able to use the useful abstractions that it gives you, but most of the important code is in the modelling_algpt2.py file in which I just copied over the contents of modelling_gpt2.py and changed a few parts of the code. I’m only showing the changes that I made to the Pytorch version of ALGPT-2 here, the changes in the TF version are pretty similar to the Pytorch version and can be seen here.


Implementing parameter sharing


Implementing parameter sharing only involves changing a few lines of code:

class ALGPT2Model(ALGPT2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        ...

        # self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True)
        #     for _ in range(config.n_layer)])
        self.h = Block(config.n_ctx, config, scale=True)

        ...

    def forward(self, ...):

        ...

        if past is None:
            past_length = 0

            # past = [None] * len(self.h)
            past = [None] * self.config.n_layer

        ...

        # for i, (block, layer_past) in enumerate(zip(self.h, past)):
        for i in range(self.config.n_layer):

            if self.output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)

            # outputs = block(
            outputs = self.h(
                hidden_states,
                layer_past=layer_past,
                attention_mask=attention_mask,
                head_mask=head_mask[i],
                use_cache=use_cache,
            )
        ...


Implementing a factorized embedding


Adding a factorized embedding is a little more work:

In the config.json that you use for your ALGPT-2 model, you need to specify that you want to use the ALGPT-2 and you need to specify the dimension of the factorized embedding that you want to use:

{
+	"architectures": ["ALGPT2LMHeadModel"],
	"attn_pdrop": 0.1,
	"bos_token_id": 50256,
	"embd_pdrop": 0.1,
	"eos_token_id": 50256,
	"initializer_range": 0.02,
	"layer_norm_epsilon": 1e-5,
+	"model_type": "algpt2",
	"n_ctx": 1024,
	"n_embd": 768,
	"n_head": 12,
	"n_layer": 12,
	"n_positions": 1024,
	"resid_pdrop": 0.1,
	"summary_activation": null,
	"summary_first_dropout": 0.1,
	"summary_proj_to_labels": true,
	"summary_type": "cls_index",
	"summary_use_proj": true,
	"vocab_size": 50257,
+	"embedding_size": 128
}

Back in modelling_algpt2.py, define the two factorized embedding matrices (the first second matrix that is really just a simple linear layer)

class ALGPT2Model(ALGPT2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        ...

        # self.wte = nn.Embedding(config.vocab_size, config.n_embd)
        # self.wpe = nn.Embedding(config.n_positions, config.n_embd)
        self.wte = nn.Embedding(config.vocab_size, config.embedding_size)
        self.wpe = nn.Embedding(config.n_positions, config.embedding_size)

+       self.projection_layer = nn.Linear(config.embedding_size, config.n_embd)


        ...

    def forward(self, ...):

        ...

        hidden_states = inputs_embeds + position_embeds + token_type_embeds

+       hidden_states = self.projection_layer(hidden_states)

        ...


class ALGPT2LMHeadModel(ALGPT2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        ...

        # self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.dense = nn.Linear(config.n_embd, config.embedding_size)
        self.lm_head = nn.Linear(config.embedding_size, config.vocab_size, bias=False)

    def forward(self, ...):

        ...

        # lm_logits = self.lm_head(hidden_states)
        dense = self.dense(hidden_states)
        lm_logits = self.lm_head(dense)
        ...

Effect of layer-wise parameter sharing


This version of ALGPT-2 has about 4747M parameters while GPT-2 has 124124M. This ALGPT-2 model with parameter sharing trains a lot faster than GPT-2 (99 hours vs 2020 hours for a 9090K iteration training run), but is consistently about 1010 ppl worse than GPT-2 (3131 vs 2121 ppl).

This difference is quite a bit larger than the difference between ALBERT and BERT, but might be explained by masked language modelling being an easier task than autoregressive language modelling. Increasing the size of the ALGPT-2 model might make it more competitive with GPT-2.


Effect of removing dropout


I ran a partial training run on removing dropout from ALGPT-2. I didn’t run it for very long, but it looks like removing dropout gives you a slight improvement (~3ppl).


Effect of factorized embeddings


I ran three experiments for 9090K iterations with three different values for the factorized embedding (128128, 256256, and 512512) as well as the baseline version without a factorized embedding.

Model ALGPT-2 ALGPT-2 512 ALGPT-2 256 ALGPT-2 128
Parameters 47M 34M 20M 13M
Time ~9H ~9H ~9H ~9H
Perplexity 31 31 34 38

There was practically no difference in the loss curves between the baseline and the 512512 run since the change in the number of parameters wasn’t that great. However, the training runs with factorized embeddings of sizes 256256 and 128128 were significantly worse than the baseline: 3434 and 3838 ppl respectively, a pretty big difference from the baseline of 3131 ppl.


Effect of model size


I only had the time to run one more full training run with ALGPT-2-medium (this one is comparable to the 345345M version of GPT-2). ALGPT-2-medium has about 6666M parameters and took twice as long as ALGPT-2 to train (a little more than 2020 hours). The larger model size made quite a big difference in performance, the training perplexity decreased 55ppl from 3131 to 2626 ppl.


Conclusion and next steps


Well that’s pretty much everything that I did. After my TPU pod’s quota was used up, I started working on a few other things over the summer and just kept delaying writing up what I did for a couple of months until now.

There are a lot of things that I still want to work on or look into:

  • Training larger versions of ALGPT-2
  • Removing or replacing the normalization layers in transformers
  • Working on distilling/shrinking language models with billions of parameters to make them more accessible
  • Apply something like PPLM to condition language models for few-shot inference (kinda like what GPT-3 does).

Thanks for reading through all this. If you think there’s any mistakes or inaccuracies in this post, please let me know.