“The First AI Model That Translates 100 Languages without Relying on English Data”, Angela Fan2020-10-19 (; similar)⁠:

…In a culmination of many years of MT research at Facebook, we’re excited to announce a major milestone: the first single massive MMT model that can directly translate 100×100 languages in any direction without relying on only English-centric data. Our single multilingual model performs as well as traditional bilingual models and achieved a 10 BLEU point improvement over English-centric multilingual models. Using novel mining strategies to create translation data, we built the first truly “many-to-many” data set with 7.5 billion sentences for 100 languages. We used several scaling techniques to build an universal model with 15 billion parameters, which captures information from related languages and reflects a more diverse script of languages and morphology.

…It’s a lot easier to find translations for Chinese to English and English to French, than, say, French to Chinese. What’s more, the volume of data required for training grows quadratically with the number of languages that we support. For instance, if we need 10M sentence pairs for each direction, then we need to mine 1B sentence pairs for 10 languages and 100B sentence pairs for 100 languages.

We took on this ambitious challenge of building the most diverse many-to-many MMT data set to date: 7.5 billion sentence pairs across 100 languages. This was possible by combining complementary data mining resources that have been years in the making, including ccAligned, ccMatrix, and LASER. As part of this effort, we created a new LASER 2.0 and improved fastText language identification, which improves the quality of mining and includes open sourced training and evaluation scripts. All of our data mining resources leverage publicly available data and are open sourced.