@AccountForAI and I trained a better multilingual encoder aligned with openai clip vit-l/14 image encoder. github.com/FreddeFrallan/Mul… 1/6
The retrieval metrics on coco github.com/FreddeFrallan/Mul… are at similar or better level than the english text encoder: around 90% recall at 10 2/6
Thanks to stability.ai/ and laion.ai/ for providing the compute! 3/6
It’s a huggingface.co/xlm-roberta-l… that is trained to predict the clip english text embeddings. The loss is MSE. See the full report at github.com/FreddeFrallan/Mul… 4/6
It makes it possible to do multilingual text to image retrieval using an existing knn image index that was build using clip vit-l/14 embeddings . Test it yourself there rom1504.github.io/clip-retri… 5/6
Jun 2, 2022 · 11:43 PM UTC