“XLM-R: State-Of-The-Art Cross-Lingual Understanding through Self-Supervision”, 2019-11-07 (; similar):
A new model, called XLM-R, that uses self-supervised training techniques to achieve state-of-the-art performance in cross-lingual understanding, a task in which a model is trained in one language and then used with other languages without additional training data. Our model improves upon previous multilingual approaches by incorporating more training data and languages—including so-called low-resource languages, which lack extensive labeled and unlabeled data sets.
XLM-R has achieved the best results to date on four cross-lingual understanding benchmarks, with increases of 4.7 percent average accuracy on the XNLI cross-lingual natural language inference data set, 8.4% average F1 score on the recently introduced MLQA question answering data set, and 2.1% F1 score on NER. After extensive experiments and ablation studies, we’ve shown that XLM-R is the first multilingual model to outperform traditional monolingual baselines that rely on pretrained models.
In addition to sharing our results, we’re releasing the code and models that we used for this research. Those resources can be found on our fairseq, Pytext and XLM repositories on GitHub.
…With people on Facebook posting content in more than 160 languages, XLM-R represents an important step toward our vision of providing the best possible experience on our platforms for everyone, regardless of what language they speak. Potential applications include serving highly accurate models for identifying hate speech and other policy-violating content across a wide range of languages. As this work helps us transition toward a one-model-for-many-languages approach—as opposed to one model per language—it will also make it easier to continue launching high-performing products in multiple languages at once.