“Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers”, 2021-01-31 (; similar):
Recently, multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study 3 important factors which can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions.
By pretraining models on 6 datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality specific attention mechanisms.
Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers