@AccountForAI and I trained a better multilingual encoder aligned with openai clip vit-l/14 image encoder. github.com/FreddeFrallan/Mul… 1/6
The retrieval metrics on coco github.com/FreddeFrallan/Mul… are at similar or better level than the english text encoder: around 90% recall at 10 2/6
Thanks to stability.ai/ and laion.ai/ for providing the compute! 3/6
It’s a huggingface.co/xlm-roberta-l… that is trained to predict the clip english text embeddings. The loss is MSE. See the full report at github.com/FreddeFrallan/Mul… 4/6
It makes it possible to do multilingual text to image retrieval using an existing knn image index that was build using clip vit-l/14 embeddings . Test it yourself there rom1504.github.io/clip-retri… 5/6

Jun 2, 2022 · 11:43 PM UTC

It can also make it possible to do clip-guiding (AI Art!) using other languages. We also trained and released openai B/32 and openclip B/16+ mclip models. The openclip b/16+ version gets the best retrieval metrics. 6/6