“Building a Large Japanese Web Corpus for Large Language Models”, 2024-04-27 ():
Open Japanese large language models (LLMs) have been trained on the Japanese portions of corpora such as CC-100, mC4, and OSCAR. However, these corpora were not created for the quality of Japanese texts. This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive (21 snapshots of ~63.4 billion pages crawled 2020–32023).
This corpus consists of ~312.1 billion characters (~173 million pages), which is the largest of all available training corpora for Japanese LLMs, surpassing CC-100 (~25.8 billion characters), mC4 (~239.7 billion characters) and OSCAR 23.10 (~74 billion characters).
To confirm the quality of the corpus, we performed continual pre-training on Llama 2 7B, 13B, 70B, Mistral-7B v0.1, and Mixtral 8×7B Instruct as base LLMs and gained consistent (6.6-8.1 points) improvements on Japanese benchmark datasets. We also demonstrate that the improvement on Llama 2 13B brought from the presented corpus was the largest among those from other existing corpora.