“The First AI Model That Translates 100 Languages without Relying on English Data”, 2020-10-19 (; similar):
Facebook AI is introducing, M2M-100 the first multilingual machine translation (MMT) model that translates between any pair of 100 languages without relying on English data. It’s open sourced here.
When translating, say, Chinese to French, previous best multilingual models train on Chinese to English and English to French, because English training data is the most widely available. Our model directly trains on Chinese to French data to better preserve meaning. It outperforms English-centric systems by 10 points on the widely used BLEU metric for evaluating machine translations.
M2M-100 is trained on a total of 2,200 language directions—or 10× more than previous best, English-centric multilingual models. Deploying M2M-100 will improve the quality of translations for billions of people, especially those who speak low-resource languages.
This milestone is a culmination of years of Facebook AI’s foundational work in machine translation. Today, we’re sharing details on how we built a more diverse MMT training data set and model for 100 languages. We’re also releasing the model, training, and evaluation setup to help other researchers reproduce and further advance multilingual models.
…In a culmination of many years of MT research at Facebook, we’re excited to announce a major milestone: the first single massive MMT model that can directly translate 100×100 languages in any direction without relying on only English-centric data. Our single multilingual model performs as well as traditional bilingual models and achieved a 10 BLEU point improvement over English-centric multilingual models. Using novel mining strategies to create translation data, we built the first truly “many-to-many” data set with 7.5 billion sentences for 100 languages. We used several scaling techniques to build an universal model with 15 billion parameters, which captures information from related languages and reflects a more diverse script of languages and morphology.
…It’s a lot easier to find translations for Chinese to English and English to French, than, say, French to Chinese. What’s more, the volume of data required for training grows quadratically with the number of languages that we support. For instance, if we need 10M sentence pairs for each direction, then we need to mine 1B sentence pairs for 10 languages and 100B sentence pairs for 100 languages.
We took on this ambitious challenge of building the most diverse many-to-many MMT data set to date: 7.5 billion sentence pairs across 100 languages. This was possible by combining complementary data mining resources that have been years in the making, including ccAligned, ccMatrix, and LASER. As part of this effort, we created a new LASER 2.0 and improved fastText language identification, which improves the quality of mining and includes open sourced training and evaluation scripts. All of our data mining resources leverage publicly available data and are open sourced.