https://dl.fbaipublicfiles.com/imagebind/imagebind_video.mp4
https://ai.meta.com/blog/imagebind-six-modalities-binding-ai/
Hierarchical Text-Conditional Image Generation with CLIP Latents
https://www.karolpiczak.com/papers/Piczak2015-ESC-Dataset.pdf
https://mtg.upf.edu/system/files/publications/Font-Roma-Serra-ACMM-2013.pdf
InfoNCE: Representation Learning with Contrastive Predictive Coding (CPC)
Detecting Twenty-thousand Classes using Image-level Supervision
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
Reproducible scaling laws for contrastive language-image learning
Wikipedia Bibliography: