Romain Beaumont · Jun 2, 2022 · 11:43 PM UTC

Romain Beaumont

2 Jun 2022

@AccountForAI and I trained a better multilingual encoder aligned with openai clip vit-l/14 image encoder. github.com/FreddeFrallan/Mul… 1/6

Romain Beaumont · Jun 2, 2022 · 11:43 PM UTC

Romain Beaumont @rom1504

2 Jun 2022

The retrieval metrics on coco github.com/FreddeFrallan/Mul… are at similar or better level than the english text encoder: around 90% recall at 10 2/6

Romain Beaumont · Jun 2, 2022 · 11:43 PM UTC

Romain Beaumont @rom1504

2 Jun 2022

Thanks to stability.ai/ and laion.ai/ for providing the compute! 3/6

Romain Beaumont · Jun 2, 2022 · 11:43 PM UTC

Romain Beaumont @rom1504

2 Jun 2022

It’s a huggingface.co/xlm-roberta-l… that is trained to predict the clip english text embeddings. The loss is MSE. See the full report at github.com/FreddeFrallan/Mul… 4/6

xlm-roberta-large · Hugging Face

huggingface.co

Romain Beaumont · Jun 2, 2022 · 11:43 PM UTC

Romain Beaumont · Jun 2, 2022 · 11:43 PM UTC

Romain Beaumont @rom1504

2 Jun 2022

It makes it possible to do multilingual text to image retrieval using an existing knn image index that was build using clip vit-l/14 embeddings . Test it yourself there rom1504.github.io/clip-retri… 5/6

Jun 2, 2022 · 11:43 PM UTC

Romain Beaumont · Jun 2, 2022 · 11:43 PM UTC

Romain Beaumont @rom1504

2 Jun 2022

It can also make it possible to do clip-guiding (AI Art!) using other languages. We also trained and released openai B/32 and openclip B/16+ mclip models. The openclip b/16+ version gets the best retrieval metrics. 6/6