How To Fine-Tune GPT-NeoX

The Forefront Team
February 21, 2022
How To Fine-Tune GPT-NeoX

Less than two weeks ago, EleutherAI announced their latest open source language model, GPT-NeoX-20B. Today, we’re excited to announce that Forefront is the first platform where you can fine-tune GPT-NeoX, enabling our customers to train the largest open source language model on any natural language processing or understanding task. Start fine-tuning GPT-NeoX for free

The same fine-tuning experience our customers have come to know with GPT-J will be offered for GPT-NeoX including free fine-tuning, JSON Lines and text file support, test prompts, Weights & Biases integration, and control over hyperparameters like epochs and checkpoints. We look forward to seeing all the ways our customers will fine-tune GPT-NeoX models to solve complex NLP problems at scale. Let’s take a closer look at fine-tuning.

What is fine-tuning?

Recent research in Natural Language Processing (NLP) has led to the release of multiple large transformer-based language models (LLMs) like OpenAI’s GPT-[2,3], EleutherAI’s GPT-[Neo, J], and most recently, GPT-NeoX-20B, a 20 billion parameter language model, by EleutherAI. One of the most impactful outcomes of this research has been the finding that the performance of LLMs scales predictably as a power-law with the number of parameters; the downside of scaling parameters being the increased cost to fine-tune and inference. For those not impressed by the leap of tunable parameters now in the tens of billions, the performance that these models can achieve on a variety of tasks after fine-tuning just a few epochs on as little as 100 training examples is where you start to see the value.

Fine-tuning refers to the practice of further training language models on a dataset to achieve better performance on a specific task. This practice can enable a model to outperform one 10x its size on virtually any task. As such, fine-tuned models are the majority of models deployed in production on the Forefront platform and where businesses get the most value.

Until now, one had to choose between GPT-J’s 6 billion parameters and GPT-3 Davinci’s 175 billion parameters. The former model small enough to fine-tune and inference cost efficiently, but not big enough to perform well on complex tasks. The latter model big enough to perform well on complex tasks, but incredibly expensive to fine-tune and inference. Enter GPT-NeoX-20B, and solving many more complex NLP tasks at scale starts to look doable. Let’s look at how GPT-NeoX fine-tuned on various tasks compares to vanilla GPT-NeoX and GPT-3 Davinci.

Text summarization

Summarize text into a few sentences.

GPT-NeoX text summarization comparisons

Emotion classification

Classify text as an emotion.

GPT-NeoX emotion classification

Question answering

Answer natural language questions about provided text.

GPT-NeoX question answering comparison

GPT-NeoX another question answering comparison

Chat summarization

Summarize dialogue and transcripts.

GPT-NeoX chat summarization example

Content generation

Write a paragraph based on a topic and bullet point.

GPT-NeoX content generation comparisons

Question answering with context

Answer natural language questions based on the provided information and scenario.

GPT-NeoX question answering with context example

Chatbot with personality

Imitate Elon Musk in a conversation.

GPT-NeoX chatbot with personality comparisons

Blog idea generation

Generate blog ideas based on a company name and product description.

GPT-NeoX blog idea generation comparisons

Blog Outline

Provide a blog outline based on a topic

GPT-NeoX blog outline comparisons

How to fine-tune GPT-NeoX on Forefront

The first (and most important) step to fine-tuning a model is to prepare a dataset. A fine-tuning dataset can be in one of two formats on Forefront: JSON Lines or plain text file (UTF-8 encoding). For the purpose of this example, we’ll format our dataset as JSON Lines where each example is a prompt-completion pair. Here are some example dataset formats for the emotion classification, text summarization, question answering, and chat summarization use cases above.

After uploading your dataset, you can set the number of epochs your model will train for. Epochs refer to the number of complete passes through a training dataset, or put another way, how many times a model will “see” each training example in your dataset. A range of 2-4 epochs is typically recommended depending on the size of your dataset.

How to fine-tune on Forefront

Next, you’ll set a number of checkpoints. Checkpoints refer to how many model versions will be saved throughout training. Training a model for the optimal amount of time is incredibly important, and checkpoints lets you easily find the optimal time by comparing performance between models at different points during training. Performance is compared by setting test prompts.

Test prompts are a simple method to validate the performance of your model checkpoints. They work by adding prompts and parameters for each model checkpoint to provide completions. After training, you can review the completions from each checkpoint to find the best performing model.

Alternative ways to fine-tune GPT-NeoX

Alternatively, you could fine-tune GPT-NeoX on your own infrastructure. To do this, you'll need at least 8 NVIDIA A100s, A40s, or A6000s and use the NeoX Github repo to preprocess your dataset and run the training script. The script will need to be run with the many degrees of parallelism that EleutherAI's repo supports.

Helpful Tips

  1. Prioritize quality training examples of the chosen task over a large dataset.
  2. Use Weights & Biases. Enter your W&B API key in Settings → General and monitor training progress for your models.
  3. Train for 3-4 epochs on smaller datasets (<1MB). Train for 2-3 epochs on larger datasets (>1MB). Recommendations are provided in-app.
  4. Save 5-10 checkpoints. It’s important to find the optimal training time, and the more checkpoints you have with test prompts to compare performance, the more likely a checkpoint was trained for the optimal amount of time.
  5. Set test prompts that are not included in your dataset and will likely be seen in production.
  6. You can deploy multiple checkpoints and conduct further testing in your Playground.
  7. For more detailed information on fine-tuning and preparing your dataset, refer to our docs.

These tips are meant as loose guidelines and experimentation is encouraged.

At Forefront, we believe building a simple, free experience for fine-tuning will lower the cost of experimentation with large language models enabling businesses to solve a variety of complex NLP problems. If you have any ideas on how we can further improve the fine-tuning experience, please get in touch with our team. Don't have access to the Forefront platform? Get access