Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages.
We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation.
Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision 5.5% → 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.
…3.2 General Internet Noise and Creativity: …Our efforts in scaling the LangID models in our web crawl to hundreds of languages uncovered greater depths to internet noise, alongside even more creative ways of using text. As a result of the sheer size of the web, any small pathologies of a LangID model are hugely magnified: we observed that our models tend to pick up on particular genres of internet noise for each separate language, resulting in corpora for some languages that mostly showcase a rich array of particular types of oddities.
For example, in our initial crawls, what purported to be the corpus for Varhadi picked up large amounts of badly-encoded PDFs; Aymara and Turkmen were made up mostly of misrendered non-Unicode text; Dimli had mostly invalid HTML; Dogri offered a rich array of Zalgo-like ornamentation; Fula was awash in URLs; Ilocano caught vast amounts of garbled Javascript; and Zhuang captured German sentences involving the Unicode SOFT HYPHEN character. In each of these cases, sadly the majority of the crawled corpus actually consisted of the class of noise that the LangID classifier decided to assign to these languages—unfortunately drowning out any in-language sentences in the corpora.
In another interesting twist, one might expect that languages which are written in scripts that are not used for any other language would have clean corpora, as the unique connection between the script and the language means that any LangID model gets 100% F1 on development sets. However, this underestimates the creativity of the internet: the Cherokee syllabary, for example, contains characters that look similar to Latin characters, which are consequently repurposed to give words in other languages an esthetic effect (see example in Table 2), while other scripts, such as Balinese, are used commonly for purely decorative purposes alongside content in entirely unrelated languages. Some script-unique languages like Divehi do yield high-precision corpora right from the get-go, but they are the lucky few.
Table 2: Examples of several representative classes of noise in our initial web-crawl corpora.
3.3 Artifacts from Character n-gram Modeling: Many error modes seem to be direct consequences of n-gram count based models, and are also common in public corpora crawled using n-gram models like FastText (Grave2017)—Appendix E explores these phenomena in the OSCAR (Ortiz Súarez et al 2019) corpus. Here are a few important classes of pathologies we discovered; see Table 2 for examples of each, and Appendix C for frequency statistics:
Unlucky overlap of frequent n-grams with high-prevalence languages: Token frequencies in natural text follow a power law distribution (Zipf1935), so that the most common n-grams in a language will be present in a majority of all of its sentences. If one of these common n-grams happens to occur in a sentence in a different language, LangID models can over-trigger. We observed this with Oromo, where 50% of the crawled dataset was actually English sentences containing the word “essay” at least 3×, misleading the model due to high counts for the n-grams “essa”, “ess”, “sa”, “a”, “e”, “s”, and “y”, all of which are top Oromo n-grams (see Appendix Table 12).
Repeated n-graaaaaaaaams: By repeating an n-gram sequence an arbitrary amount, which is rare in clean training text but common on the internet, the class probability of a language may be ramped up, even if the language is clearly wrong—cf. adversarial examples (Goodfellowet al2015).
A N T S P E A K: A surprisingly common internet phenomenon is to find ‘ant speak’ text with space-separated characters, l i k e t h i s (Channing2020). Standard n-gram models—or even SentencePiece models (Kudo & Richardson2018)—can’t handle this without special-casing. This affects about one to two languages per major script: we found that most of our “Chechen” data was actually R u s s i a n, most of our “Lambadi” T e l u g u, our “Santali” B e n g a l i, and some of our “Sepedi” E n g l i s h.
3.4 Languages with High-Prevalence Cousins: Languages with High-Prevalence Cousins is a specific, quite common case of the Class Imbalance problem, which requires somewhat different techniques to mitigate (see §4). Crawling the web for a low-resource language (“target language”) that is closely related to a language that is highly prevalent on the internet (“distractor language”) can yield a dataset consisting mostly of the distractor language. A particularly salient example is Nigerian Pidgin (ie. Naija, pcm) and English (en), which are similar enough (see Appendix Table 11 for examples) that typical LangID models will have high false positive rates between the two. Because of the prevalence of English on the internet, along with this high degree of confusability, building a high-precision web-crawled text corpus for languages like Nigerian Pidgin is exceedingly difficult.
3.5 Languages with Out-of-Model Cousins: A variant on the above are languages that are not supported by the LangID model, which interfere with related languages that are supported. For example, a majority of our Uyghur crawl was actually Kazakh and Kyrgyz in the Arabic script; our model had been trained to recognize Kazakh and Kyrgyz, but only in the Cyrillic alphabet. Table 2 gives an example Kazakh sentence that was labeled as Uighur.
3.6 Unrepresentative Training Data: Sometimes training data may be too clean to be accurate on out-of-domain, noisy web data; yet other times it may be too noisy, too homogeneous, or contain systematic biases. For example, for some languages, training data (especially data sourced from Wikipedia) had high quantities of special characters and templated data (esp. from censuses). Templated data may be harmful for n-gram models, by skewing the token distributions away from that of normal text, though there is some evidence that neural models may be less affected by token distributions than by latent structure (Papadimitriou & Jurafsky2020). Other training data may also have issues; for instance, in our elicited Chechen data, the CYRILLIC LETTER PALOCHKA (not found on many keyboards) was represented with the ASCII digit “1”. Our model therefore may not handle Chechen text containing the correct code point, or other substitutes, very well.