- See Also
-
Links
- “HiCLIP: Contrastive Language-Image Pretraining With Hierarchy-aware Attention”, Et Al 2023
- “BLIP-2: Bootstrapping Language-Image Pre-training With Frozen Image Encoders and Large Language Models”, Et Al 2023
- “MUG: Vision Learners Meet Web Image-Text Pairs”, Et Al 2023
- “Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B”, 2023
- “Reproducible Scaling Laws for Contrastive Language-image Learning”, Et Al 2022
- “CLIP Itself Is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy With ViT-B and ViT-L on ImageNet”, Et Al 2022
- “A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others”, Et Al 2022
- “Scaling Language-Image Pre-training via Masking”, Et Al 2022
- “Retrieval-Augmented Multimodal Language Modeling”, Et Al 2022
- “Videogenic: Video Highlights via Photogenic Moments”, Et Al 2022
- “ClipCrop: Conditioned Cropping Driven by Vision-Language Model”, Et Al 2022
- “MaskDistill: A Unified View of Masked Image Modeling”, 2022
- “I Can’t Believe There’s No Images! Learning Visual Tasks Using Only Language Data”, Et Al 2022
- “Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces”, Et Al 2022
- “AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities”, Et Al 2022
- “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Et Al 2022
- “3DALL·E: Integrating Text-to-Image AI in 3D Design Workflows”, Et Al 2022
- “Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, Et Al 2022
- “ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training”, Et Al 2022
- “Incorporating Natural Language into Vision Models Improves Prediction and Understanding of Higher Visual Cortex”, Et Al 2022
- “Do Androids Laugh at Electric Sheep? Humor”Understanding” Benchmarks from The New Yorker Caption Contest”, Et Al 2022
- “Fast Text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators”, Et Al 2022
- “What Does a Platypus Look Like? Generating Customized Prompts for Zero-shot Image Classification (CuPL)”, Et Al 2022
- “Decoding Speech from Non-invasive Brain Recordings”, Et Al 2022
- “Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP”, Et Al 2022
- “CLIP-based Neural Neighbor Style Transfer for 3D Assets”, 2022
- “EVL: Frozen CLIP Models Are Efficient Video Learners”, Et Al 2022
- “Adversarial Attacks on Image Generation With Made-Up Words”, 2022
- “LaTTe: Language Trajectory TransformEr”, Et Al 2022
- “X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, Et Al 2022
- “TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment”, Et Al 2022
- “NewsStories: Illustrating Articles With Visual Summaries”, Et Al 2022
- “Text-Guided Synthesis of Artistic Images With Retrieval-Augmented Diffusion Models”, Et Al 2022
- “MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training”, Et Al 2022
- “Semantic Abstraction (SemAbs): Open-World 3D Scene Understanding from 2D Vision-Language Models”, 2022
- “Don’t Stop Learning: Towards Continual Learning for the CLIP Model”, Et Al 2022
- “Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning”, Et Al 2022
- “X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Et Al 2022
- “LM-Nav: Robotic Navigation With Large Pre-Trained Models of Language, Vision, and Action”, Et Al 2022
- “CLAP: Learning Audio Concepts From Natural Language Supervision”, Et Al 2022
- “Improved Vector Quantized Diffusion Models”, Et Al 2022
- “ADAPT: Vision-Language Navigation With Modality-Aligned Action Prompts”, Et Al 2022
- “CyCLIP: Cyclic Contrastive Language-Image Pretraining”, Et Al 2022
- “Fine-grained Image Captioning With CLIP Reward”, Et Al 2022
- “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Et Al 2022
- “AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars”, Et Al 2022
- “CoCa: Contrastive Captioners Are Image-Text Foundation Models”, Et Al 2022
- “Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)”, Et Al 2022
- “Semi-Parametric Neural Image Synthesis”, Et Al 2022
- “Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?”, Et Al 2022
- “Opal: Multimodal Image Generation for News Illustration”, Et Al 2022
- “VQGAN-CLIP: Open Domain Image Generation and Editing With Natural Language Guidance”, Et Al 2022
- “No Token Left Behind: Explainability-Aided Image Classification and Generation”, Et Al 2022
- “Semantic Exploration from Language Abstractions and Pretrained Representations”, Et Al 2022
- “Unified Contrastive Learning in Image-Text-Label Space”, Et Al 2022
- “Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality”, Et Al 2022
- “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Et Al 2022
- “Learning to Generate Line Drawings That Convey Geometry and Semantics”, Et Al 2022
- “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Et Al 2022
- “CLIP on Wheels (CoW): Zero-Shot Object Navigation As Object Localization and Exploration”, Et Al 2022
- “Bamboo: Building Mega-Scale Vision Dataset Continually With Human-Machine Synergy”, Et Al 2022
- “CLIP Models Are Few-shot Learners: Empirical Studies on VQA and Visual Entailment”, Et Al 2022
- “The Unsurprising Effectiveness of Pre-Trained Vision Models for Control”, Et Al 2022
- “Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment”, Et Al 2022
- “RuCLIP—new Models and Experiments: a Technical Report”, Et Al 2022
- “Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework”, Et Al 2022
- “CLIPasso: Semantically-Aware Object Sketching”, Et Al 2022
- “Can Wikipedia Help Offline Reinforcement Learning?”, Et Al 2022
- “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Et Al 2022
- “SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models”, Et Al 2022
- “CM3: A Causal Masked Multimodal Model of the Internet”, Et Al 2022
- “LSeg: Language-driven Semantic Segmentation”, Et Al 2022
- “Detecting Twenty-thousand Classes Using Image-level Supervision”, Et Al 2022
- “Design Guidelines for Prompt Engineering Text-to-Image Generative Models”, 2022
- “A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision”, Et Al 2021
- “High-Resolution Image Synthesis With Latent Diffusion Models”, Et Al 2021
- “RegionCLIP: Region-based Language-Image Pretraining”, Et Al 2021
- “More Control for Free! Image Synthesis With Semantic Diffusion Guidance”, Et Al 2021
- “MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning”, Et Al 2021
- “CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions”, Et Al 2021
- “FuseDream: Training-Free Text-to-Image Generation With Improved CLIP+GAN Space Optimization”, Et Al 2021
- “Zero-Shot Text-Guided Object Generation With Dream Fields”, Et Al 2021
- “DenseCLIP: Extract Free Dense Labels from CLIP”, Et Al 2021
- “MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, Et Al 2021
- “CRIS: CLIP-Driven Referring Image Segmentation”, Et Al 2021
- “Blended Diffusion for Text-driven Editing of Natural Images”, Et Al 2021
- “Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic”, Et Al 2021
- “LAFITE: Towards Language-Free Training for Text-to-Image Generation”, Et Al 2021
- “Florence: A New Foundation Model for Computer Vision”, Et Al 2021
- “BASIC: Combined Scaling for Open-Vocabulary Image Classification”, Et Al 2021
- “Simple but Effective: CLIP Embeddings for Embodied AI”, Et Al 2021
- “ClipCap: CLIP Prefix for Image Captioning”, Et Al 2021
- “INTERN: A New Learning Paradigm Towards General Vision”, Et Al 2021
- “LiT: Zero-Shot Transfer With Locked-image Text Tuning”, Et Al 2021
- “Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling”, Et Al 2021
- “StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis”, Et Al 2021
- “LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs”, Et Al 2021
- “Projected GANs Converge Faster”, Et Al 2021
- “Telling Creative Stories Using Generative Visual Aids”, 2021
- “Image-Based CLIP-Guided Essence Transfer”, Et Al 2021
- “Wav2CLIP: Learning Robust Audio Representations From CLIP”, Et Al 2021
- “Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)”, Et Al 2021
- “MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training”, Et Al 2021
- “CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation”, Et Al 2021
- “CLOOB: Modern Hopfield Networks With InfoLOOB Outperform CLIP”, Et Al 2021
- “DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models”, 2021
- “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Et Al 2021
- “VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, Et Al 2021
- “CLIPort: What and Where Pathways for Robotic Manipulation”, Et Al 2021
- “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, 2021
-
“
THINGSvision
: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks”, 2021 - “Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts”, 2021
- “What Vision-Language Models ‘See’ When They See Scenes”, Et Al 2021
- “EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling”, Et Al 2021
- “Zero-Shot Open Set Detection by Extending CLIP”, Et Al 2021
- “Robust Fine-tuning of Zero-shot Models”, Et Al 2021
- “What Users Want? WARHOL: A Generative Model for Recommendation”, Et Al 2021
- “LAION-400-Million Open Dataset”, 2021
- “Contrastive Language-Image Pre-training for the Italian Language”, Et Al 2021
- “Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications”, Et Al 2021
- “StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators”, Et Al 2021
- “Segmentation in Style: Unsupervised Semantic Image Segmentation With StyleGAN and CLIP”, Et Al 2021
- “Language Grounding With 3D Objects”, Et Al 2021
- “FairyTailor: A Multimodal Generative Framework for Storytelling”, Et Al 2021
- “How Much Can CLIP Benefit Vision-and-Language Tasks?”, Et Al 2021
- “CLIP-It! Language-Guided Video Summarization”, Et Al 2021
- “Small In-distribution Changes in 3D Perspective and Lighting Fool Both CNNs and Transformers”, Et Al 2021
- “CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”, Et Al 2021
- “AudioCLIP: Extending CLIP to Image, Text and Audio”, Et Al 2021
- “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Et Al 2021
- “A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods”, Et Al 2021
- “Partial Success in Closing the Gap between Human and Machine Vision”, Et Al 2021
- “ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation”, Et Al 2021
- “Exploring the Limits of Out-of-Distribution Detection”, Et Al 2021
- “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, 2021
- “Generative Art Using Neural Visual Grammars and Dual Encoders”, Et Al 2021
- “Zero-Shot Detection via Vision and Language Knowledge Distillation”, Et Al 2021
- “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Et Al 2021
- “CLIPScore: A Reference-free Evaluation Metric for Image Captioning”, Et Al 2021
- “Paint by Word”, Et Al 2021
- “WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training”, Et Al 2021
- “Multimodal Neurons in Artificial Neural Networks [CLIP]”, Et Al 2021
- “Zero-Shot Text-to-Image Generation”, Et Al 2021
- “ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”, Et Al 2021
- “Generating Images from Caption and vice Versa via CLIP-Guided Generative Latent Space Search”, Et Al 2021
- “Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers”, Et Al 2021
- “Scoring Images from TADNE With CLIP”, Nagolinc 2021
- “DALL·E 1: Creating Images from Text: We’ve Trained a Neural Network Called DALL·E That Creates Images from Text Captions for a Wide Range of Concepts Expressible in Natural Language”, Et Al 2021
- “CLIP: Connecting Text and Images: We’re Introducing a Neural Network Called CLIP Which Efficiently Learns Visual Concepts from Natural Language Supervision. CLIP Can Be Applied to Any Visual Classification Benchmark by Simply Providing the Names of the Visual Categories to Be Recognized, Similar to The”Zero-shot” Capabilities of GPT-2 and GPT-3”, Et Al 2021
- “Learning Transferable Visual Models From Natural Language Supervision”, Et Al 2021
- “Transformers in Vision: A Survey”, Et Al 2021
- “Vision Transformer: An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale”, Et Al 2020
- “M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training”, Et Al 2020
- “Learning to Scale Multilingual Representations for Vision-Language Tasks”, Et Al 2020
- “The Messy, Secretive Reality behind OpenAI’s Bid to save the World: The AI Moonshot Was Founded in the Spirit of Transparency. This Is the inside Story of How Competitive Pressure Eroded That Idealism”, 2020
- “MULE: Multimodal Universal Language Embedding”, Et Al 2019
- “This Anime Does Not Exist, Search: This Notebook Uses the Precomputed CLIP Feature Vectors for 100k Images from TADNE”
- “CLIPIT PixelDraw Demo”
- “Clustering-laion400m: Script and Models for Clustering LAION-400m CLIP Embeddings. Models Were Fit on the First Million or so Image Embeddings.”
- “The Bouba/Kiki Effect And Sound Symbolism In CLIP”
- “[P] List of Sites/programs/projects That Use OpenAI’s CLIP Neural Network for Steering Image/video Creation to Match a Text Description”
- “Pixels Still Beat Text: Attacking the OpenAI CLIP Model With Text Patches and Adversarial Pixel Perturbations”
- “New AI Tools CLIP+VQ-GAN Can Create Impressive Works of Art Based on Just a Few Words of Input”
- “Apple or IPod? Easy Fix for Adversarial Textual Attacks on OpenAI’s CLIP Model!”
- Miscellaneous
- Link Bibliography
See Also
Links
“HiCLIP: Contrastive Language-Image Pretraining With Hierarchy-aware Attention”, Et Al 2023
“HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention”, 2023-03-06 (similar)
“BLIP-2: Bootstrapping Language-Image Pre-training With Frozen Image Encoders and Large Language Models”, Et Al 2023
“BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models”, 2023-01-30 ( ; similar; bibliography)
“MUG: Vision Learners Meet Web Image-Text Pairs”, Et Al 2023
“MUG: Vision Learners Meet Web Image-Text Pairs”, 2023-01-17 ( ; similar; bibliography)
“Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B”, 2023
“Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B”, 2023 (similar; bibliography)
“Reproducible Scaling Laws for Contrastive Language-image Learning”, Et Al 2022
“Reproducible scaling laws for contrastive language-image learning”, 2022-12-14 ( ; backlinks; similar; bibliography)
“CLIP Itself Is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy With ViT-B and ViT-L on ImageNet”, Et Al 2022
“CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet”, 2022-12-12 (similar; bibliography)
“A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others”, Et Al 2022
“A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others”, 2022-12-09 ( ; similar)
“Scaling Language-Image Pre-training via Masking”, Et Al 2022
“Scaling Language-Image Pre-training via Masking”, 2022-12-01 ( ; similar)
“Retrieval-Augmented Multimodal Language Modeling”, Et Al 2022
“Retrieval-Augmented Multimodal Language Modeling”, 2022-11-22 ( ; similar; bibliography)
“Videogenic: Video Highlights via Photogenic Moments”, Et Al 2022
“Videogenic: Video Highlights via Photogenic Moments”, 2022-11-22 ( ; similar)
“ClipCrop: Conditioned Cropping Driven by Vision-Language Model”, Et Al 2022
“ClipCrop: Conditioned Cropping Driven by Vision-Language Model”, 2022-11-21 (similar)
“MaskDistill: A Unified View of Masked Image Modeling”, 2022
“MaskDistill: A Unified View of Masked Image Modeling”, 2022-11-17 ( ; similar; bibliography)
“I Can’t Believe There’s No Images! Learning Visual Tasks Using Only Language Data”, Et Al 2022
“I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data”, 2022-11-17 ( ; similar)
“Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces”, Et Al 2022
“Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces”, 2022-11-14 ( ; backlinks; similar; bibliography)
“AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities”, Et Al 2022
“AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities”, 2022-11-12 ( ; similar; bibliography)
“EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Et Al 2022
“eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers”, 2022-11-02 ( ; similar; bibliography)
“3DALL·E: Integrating Text-to-Image AI in 3D Design Workflows”, Et Al 2022
“3DALL·E: Integrating Text-to-Image AI in 3D Design Workflows”, 2022-10-20 ( ; similar)
“Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, Et Al 2022
“Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, 2022-10-17 ( ; similar)
“ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training”, Et Al 2022
“ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training”, 2022-10-04 (similar)
“Incorporating Natural Language into Vision Models Improves Prediction and Understanding of Higher Visual Cortex”, Et Al 2022
“Incorporating natural language into vision models improves prediction and understanding of higher visual cortex”, 2022-09-29 ( ; similar; bibliography)
“Do Androids Laugh at Electric Sheep? Humor”Understanding” Benchmarks from The New Yorker Caption Contest”, Et Al 2022
“Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest”, 2022-09-13 ( ; similar)
“Fast Text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators”, Et Al 2022
“Fast text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators”, 2022-09-08 ( ; similar; bibliography)
“What Does a Platypus Look Like? Generating Customized Prompts for Zero-shot Image Classification (CuPL)”, Et Al 2022
“What does a platypus look like? Generating customized prompts for zero-shot image classification (CuPL)”, 2022-09-07 ( ; similar; bibliography)
“Decoding Speech from Non-invasive Brain Recordings”, Et Al 2022
“Decoding speech from non-invasive brain recordings”, 2022-08-25 ( ; similar; bibliography)
“Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP”, Et Al 2022
“Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP”, 2022-08-10 ( ; similar; bibliography)
“CLIP-based Neural Neighbor Style Transfer for 3D Assets”, 2022
“CLIP-based Neural Neighbor Style Transfer for 3D Assets”, 2022-08-08 (similar)
“EVL: Frozen CLIP Models Are Efficient Video Learners”, Et Al 2022
“EVL: Frozen CLIP Models are Efficient Video Learners”, 2022-08-06 ( ; similar; bibliography)
“Adversarial Attacks on Image Generation With Made-Up Words”, 2022
“Adversarial Attacks on Image Generation With Made-Up Words”, 2022-08-04 ( ; similar)
“LaTTe: Language Trajectory TransformEr”, Et Al 2022
“LaTTe: Language Trajectory TransformEr”, 2022-08-04 ( ; similar)
“X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, Et Al 2022
“X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, 2022-08-04 ( ; similar)
“TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment”, Et Al 2022
“TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment”, 2022-07-29 (similar; bibliography)
“NewsStories: Illustrating Articles With Visual Summaries”, Et Al 2022
“NewsStories: Illustrating articles with visual summaries”, 2022-07-26 ( ; similar; bibliography)
“Text-Guided Synthesis of Artistic Images With Retrieval-Augmented Diffusion Models”, Et Al 2022
“Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models”, 2022-07-26 ( ; backlinks; similar)
“MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training”, Et Al 2022
“MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training”, 2022-07-26 (similar; bibliography)
“Semantic Abstraction (SemAbs): Open-World 3D Scene Understanding from 2D Vision-Language Models”, 2022
“Semantic Abstraction (SemAbs): Open-World 3D Scene Understanding from 2D Vision-Language Models”, 2022-07-23 ( ; similar)
“Don’t Stop Learning: Towards Continual Learning for the CLIP Model”, Et Al 2022
“Don’t Stop Learning: Towards Continual Learning for the CLIP Model”, 2022-07-19 (backlinks; similar)
“Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning”, Et Al 2022
“Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning”, 2022-07-15 (similar; bibliography)
“X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Et Al 2022
“X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, 2022-07-15 ( ; similar; bibliography)
“LM-Nav: Robotic Navigation With Large Pre-Trained Models of Language, Vision, and Action”, Et Al 2022
“LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action”, 2022-07-10 ( ; backlinks; similar; bibliography)
“CLAP: Learning Audio Concepts From Natural Language Supervision”, Et Al 2022
“CLAP: Learning Audio Concepts From Natural Language Supervision”, 2022-06-09 ( ; similar)
“Improved Vector Quantized Diffusion Models”, Et Al 2022
“Improved Vector Quantized Diffusion Models”, 2022-05-31 ( ; similar; bibliography)
“ADAPT: Vision-Language Navigation With Modality-Aligned Action Prompts”, Et Al 2022
“ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts”, 2022-05-31 ( ; similar)
“CyCLIP: Cyclic Contrastive Language-Image Pretraining”, Et Al 2022
“CyCLIP: Cyclic Contrastive Language-Image Pretraining”, 2022-05-28 (similar; bibliography)
“Fine-grained Image Captioning With CLIP Reward”, Et Al 2022
“Fine-grained Image Captioning with CLIP Reward”, 2022-05-26 ( ; similar)
“VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Et Al 2022
“VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners”, 2022-05-22 ( ; similar; bibliography)
“AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars”, Et Al 2022
“AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars”, 2022-05-17 ( ; similar; bibliography)
“CoCa: Contrastive Captioners Are Image-Text Foundation Models”, Et Al 2022
“CoCa: Contrastive Captioners are Image-Text Foundation Models”, 2022-05-04 ( ; similar; bibliography)
“Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)”, Et Al 2022
“Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)”, 2022-05-03 ( ; similar; bibliography)
“Semi-Parametric Neural Image Synthesis”, Et Al 2022
“Semi-Parametric Neural Image Synthesis”, 2022-04-25 ( ; similar)
“Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?”, Et Al 2022
“Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?”, 2022-04-23 ( ; similar)
“Opal: Multimodal Image Generation for News Illustration”, Et Al 2022
“Opal: Multimodal Image Generation for News Illustration”, 2022-04-19 ( ; similar)
“VQGAN-CLIP: Open Domain Image Generation and Editing With Natural Language Guidance”, Et Al 2022
“VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance”, 2022-04-18 ( ; similar)
“No Token Left Behind: Explainability-Aided Image Classification and Generation”, Et Al 2022
“No Token Left Behind: Explainability-Aided Image Classification and Generation”, 2022-04-11 (similar)
“Semantic Exploration from Language Abstractions and Pretrained Representations”, Et Al 2022
“Semantic Exploration from Language Abstractions and Pretrained Representations”, 2022-04-08 ( ; similar; bibliography)
“Unified Contrastive Learning in Image-Text-Label Space”, Et Al 2022
“Unified Contrastive Learning in Image-Text-Label Space”, 2022-04-07 (similar; bibliography)
“Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality”, Et Al 2022
“Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality”, 2022-04-07 ( ; similar)
“Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Et Al 2022
“Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language”, 2022-04-01 ( ; similar; bibliography)
“Learning to Generate Line Drawings That Convey Geometry and Semantics”, Et Al 2022
“Learning to generate line drawings that convey geometry and semantics”, 2022-03-23 ( ; similar)
“CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Et Al 2022
“CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning”, 2022-03-21 ( ; similar; bibliography)
“CLIP on Wheels (CoW): Zero-Shot Object Navigation As Object Localization and Exploration”, Et Al 2022
“CLIP on Wheels (CoW): Zero-Shot Object Navigation as Object Localization and Exploration”, 2022-03-20 ( ; similar)
“Bamboo: Building Mega-Scale Vision Dataset Continually With Human-Machine Synergy”, Et Al 2022
“Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy”, 2022-03-15 ( ; similar)
“CLIP Models Are Few-shot Learners: Empirical Studies on VQA and Visual Entailment”, Et Al 2022
“CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment”, 2022-03-14 (similar)
“The Unsurprising Effectiveness of Pre-Trained Vision Models for Control”, Et Al 2022
“The Unsurprising Effectiveness of Pre-Trained Vision Models for Control”, 2022-03-07 ( ; similar)
“Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment”, Et Al 2022
“Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment”, 2022-03-01 ( ; similar)
“RuCLIP—new Models and Experiments: a Technical Report”, Et Al 2022
“RuCLIP—new models and experiments: a technical report”, 2022-02-22 ( ; similar)
“Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework”, Et Al 2022
“Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework”, 2022-02-14 ( ; similar; bibliography)
“CLIPasso: Semantically-Aware Object Sketching”, Et Al 2022
“CLIPasso: Semantically-Aware Object Sketching”, 2022-02-11 (similar)
“Can Wikipedia Help Offline Reinforcement Learning?”, Et Al 2022
“Can Wikipedia Help Offline Reinforcement Learning?”, 2022-01-28 ( ; similar)
“BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Et Al 2022
“BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, 2022-01-28 ( ; similar; bibliography)
“SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models”, Et Al 2022
“SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models”, 2022-01-20 ( ; similar; bibliography)
“CM3: A Causal Masked Multimodal Model of the Internet”, Et Al 2022
“CM3: A Causal Masked Multimodal Model of the Internet”, 2022-01-19 ( ; similar)
“LSeg: Language-driven Semantic Segmentation”, Et Al 2022
“LSeg: Language-driven Semantic Segmentation”, 2022-01-10 (similar)
“Detecting Twenty-thousand Classes Using Image-level Supervision”, Et Al 2022
“Detecting Twenty-thousand Classes using Image-level Supervision”, 2022-01-07 (similar; bibliography)
“Design Guidelines for Prompt Engineering Text-to-Image Generative Models”, 2022
“Design Guidelines for Prompt Engineering Text-to-Image Generative Models”, 2022-01-07 ( ; similar; bibliography)
“A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision”, Et Al 2021
“A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision”, 2021-12-27 ( ; similar)
“High-Resolution Image Synthesis With Latent Diffusion Models”, Et Al 2021
“High-Resolution Image Synthesis with Latent Diffusion Models”, 2021-12-20 ( ; backlinks; similar; bibliography)
“RegionCLIP: Region-based Language-Image Pretraining”, Et Al 2021
“RegionCLIP: Region-based Language-Image Pretraining”, 2021-12-16 (similar; bibliography)
“More Control for Free! Image Synthesis With Semantic Diffusion Guidance”, Et Al 2021
“More Control for Free! Image Synthesis with Semantic Diffusion Guidance”, 2021-12-10 ( ; similar; bibliography)
“MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning”, Et Al 2021
“MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning”, 2021-12-09 ( ; backlinks; similar)
“CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions”, Et Al 2021
“CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions”, 2021-12-09 ( ; similar)
“FuseDream: Training-Free Text-to-Image Generation With Improved CLIP+GAN Space Optimization”, Et Al 2021
“FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization”, 2021-12-02 ( ; similar; bibliography)
“Zero-Shot Text-Guided Object Generation With Dream Fields”, Et Al 2021
“Zero-Shot Text-Guided Object Generation with Dream Fields”, 2021-12-02 ( ; similar)
“DenseCLIP: Extract Free Dense Labels from CLIP”, Et Al 2021
“DenseCLIP: Extract Free Dense Labels from CLIP”, 2021-12-02 (similar; bibliography)
“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, Et Al 2021
“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, 2021-12-01 ( ; similar)
“CRIS: CLIP-Driven Referring Image Segmentation”, Et Al 2021
“CRIS: CLIP-Driven Referring Image Segmentation”, 2021-11-30 (similar)
“Blended Diffusion for Text-driven Editing of Natural Images”, Et Al 2021
“Blended Diffusion for Text-driven Editing of Natural Images”, 2021-11-29 ( ; similar)
“Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic”, Et Al 2021
“Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic”, 2021-11-29 ( ; similar)
“LAFITE: Towards Language-Free Training for Text-to-Image Generation”, Et Al 2021
“LAFITE: Towards Language-Free Training for Text-to-Image Generation”, 2021-11-27 ( ; similar)
“Florence: A New Foundation Model for Computer Vision”, Et Al 2021
“Florence: A New Foundation Model for Computer Vision”, 2021-11-22 ( ; similar; bibliography)
“BASIC: Combined Scaling for Open-Vocabulary Image Classification”, Et Al 2021
“BASIC: Combined Scaling for Open-Vocabulary Image Classification”, 2021-11-19 ( ; similar; bibliography)
“Simple but Effective: CLIP Embeddings for Embodied AI”, Et Al 2021
“Simple but Effective: CLIP Embeddings for Embodied AI”, 2021-11-18 ( ; similar)
“ClipCap: CLIP Prefix for Image Captioning”, Et Al 2021
“ClipCap: CLIP Prefix for Image Captioning”, 2021-11-18 ( ; similar; bibliography)
“INTERN: A New Learning Paradigm Towards General Vision”, Et Al 2021
“INTERN: A New Learning Paradigm Towards General Vision”, 2021-11-16 ( ; similar)
“LiT: Zero-Shot Transfer With Locked-image Text Tuning”, Et Al 2021
“LiT: Zero-Shot Transfer with Locked-image Text Tuning”, 2021-11-15 ( ; similar; bibliography)
“Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling”, Et Al 2021
“Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling”, 2021-11-06 (similar; bibliography)
“StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis”, Et Al 2021
“StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis”, 2021-11-04 ( ; similar; bibliography)
“LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs”, Et Al 2021
“LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs”, 2021-11-03 ( ; similar; bibliography)
“Projected GANs Converge Faster”, Et Al 2021
“Projected GANs Converge Faster”, 2021-11-01 ( ; backlinks; similar; bibliography)
“Telling Creative Stories Using Generative Visual Aids”, 2021
“Telling Creative Stories Using Generative Visual Aids”, 2021-10-27 ( ; similar)
“Image-Based CLIP-Guided Essence Transfer”, Et Al 2021
“Image-Based CLIP-Guided Essence Transfer”, 2021-10-24 (similar)
“Wav2CLIP: Learning Robust Audio Representations From CLIP”, Et Al 2021
“Wav2CLIP: Learning Robust Audio Representations From CLIP”, 2021-10-21 ( ; similar)
“Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)”, Et Al 2021
“Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)”, 2021-10-11 ( ; similar; bibliography)
“MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training”, Et Al 2021
“MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training”, 2021-10-06 (similar; bibliography)
“CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation”, Et Al 2021
“CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation”, 2021-10-06 (similar)
“CLOOB: Modern Hopfield Networks With InfoLOOB Outperform CLIP”, Et Al 2021
“CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP”, 2021-10-05 ( ; similar; bibliography)
“DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models”, 2021
“DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models”, 2021-10-05 ( ; similar)
“OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Et Al 2021
“OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation”, 2021-10-05 ( ; similar; bibliography)
“VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, Et Al 2021
“VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, 2021-09-28 ( ; similar)
“CLIPort: What and Where Pathways for Robotic Manipulation”, Et Al 2021
“CLIPort: What and Where Pathways for Robotic Manipulation”, 2021-09-24 ( ; similar)
“ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, 2021
“ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation”, 2021-09-24 ( ; similar; bibliography)
“THINGSvision
: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks”, 2021
“THINGSvision
: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks”, 2021-09-22 ( ; similar; bibliography)
“Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts”, 2021
“Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts”, 2021-09-18 (similar; bibliography)
“What Vision-Language Models ‘See’ When They See Scenes”, Et Al 2021
“What Vision-Language Models ‘See’ when they See Scenes”, 2021-09-15 (similar)
“EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling”, Et Al 2021
“EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling”, 2021-09-10 ( ; similar)
“Zero-Shot Open Set Detection by Extending CLIP”, Et Al 2021
“Zero-Shot Open Set Detection by Extending CLIP”, 2021-09-06 (similar)
“Robust Fine-tuning of Zero-shot Models”, Et Al 2021
“Robust fine-tuning of zero-shot models”, 2021-09-04 (similar)
“What Users Want? WARHOL: A Generative Model for Recommendation”, Et Al 2021
“What Users Want? WARHOL: A Generative Model for Recommendation”, 2021-09-02 ( ; similar)
“LAION-400-Million Open Dataset”, 2021
“LAION-400-Million Open Dataset”, 2021-08-20 ( ; similar; bibliography)
“Contrastive Language-Image Pre-training for the Italian Language”, Et Al 2021
“Contrastive Language-Image Pre-training for the Italian Language”, 2021-08-19 ( ; similar)
“Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications”, Et Al 2021
“Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications”, 2021-08-05 (similar)
“StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators”, Et Al 2021
“StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators”, 2021-08-02 (similar)
“Segmentation in Style: Unsupervised Semantic Image Segmentation With StyleGAN and CLIP”, Et Al 2021
“Segmentation in Style: Unsupervised Semantic Image Segmentation with StyleGAN and CLIP”, 2021-07-26 (similar; bibliography)
“Language Grounding With 3D Objects”, Et Al 2021
“Language Grounding with 3D Objects”, 2021-07-26 ( ; similar)
“FairyTailor: A Multimodal Generative Framework for Storytelling”, Et Al 2021
“FairyTailor: A Multimodal Generative Framework for Storytelling”, 2021-07-13 (similar)
“How Much Can CLIP Benefit Vision-and-Language Tasks?”, Et Al 2021
“How Much Can CLIP Benefit Vision-and-Language Tasks?”, 2021-07-13 (similar; bibliography)
“CLIP-It! Language-Guided Video Summarization”, Et Al 2021
“CLIP-It! Language-Guided Video Summarization”, 2021-07-01 ( ; similar)
“Small In-distribution Changes in 3D Perspective and Lighting Fool Both CNNs and Transformers”, Et Al 2021
“Small in-distribution changes in 3D perspective and lighting fool both CNNs and Transformers”, 2021-06-30 (similar; bibliography)
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”, Et Al 2021
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”, 2021-06-28 (similar)
“AudioCLIP: Extending CLIP to Image, Text and Audio”, Et Al 2021
“AudioCLIP: Extending CLIP to Image, Text and Audio”, 2021-06-24 ( ; backlinks; similar; bibliography)
“CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Et Al 2021
“CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, 2021-06-21 ( ; similar; bibliography)
“A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods”, Et Al 2021
“A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods”, 2021-06-16 (similar)
“Partial Success in Closing the Gap between Human and Machine Vision”, Et Al 2021
“Partial success in closing the gap between human and machine vision”, 2021-06-14 ( ; backlinks; similar; bibliography)
“ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation”, Et Al 2021
“ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation”, 2021-06-10 ( ; similar)
“Exploring the Limits of Out-of-Distribution Detection”, Et Al 2021
“Exploring the Limits of Out-of-Distribution Detection”, 2021-06-06 ( ; similar; bibliography)
“Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, 2021
“Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters”, 2021-06-01 ( ; similar; bibliography)
“Generative Art Using Neural Visual Grammars and Dual Encoders”, Et Al 2021
“Generative Art Using Neural Visual Grammars and Dual Encoders”, 2021-05-01 (similar)
“Zero-Shot Detection via Vision and Language Knowledge Distillation”, Et Al 2021
“Zero-Shot Detection via Vision and Language Knowledge Distillation”, 2021-04-28 ( ; similar; bibliography)
“Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Et Al 2021
“Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation”, 2021-04-18 ( ; similar; bibliography)
“CLIPScore: A Reference-free Evaluation Metric for Image Captioning”, Et Al 2021
“CLIPScore: A Reference-free Evaluation Metric for Image Captioning”, 2021-04-18 (similar)
“Paint by Word”, Et Al 2021
“Paint by Word”, 2021-03-19 ( ; similar)
“WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training”, Et Al 2021
“WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training”, 2021-03-11 ( ; backlinks; similar)
“Multimodal Neurons in Artificial Neural Networks [CLIP]”, Et Al 2021
“Multimodal Neurons in Artificial Neural Networks [CLIP]”, 2021-03-04 ( ; similar; bibliography)
“Zero-Shot Text-to-Image Generation”, Et Al 2021
“Zero-Shot Text-to-Image Generation”, 2021-02-24 ( ; similar)
“ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”, Et Al 2021
“ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”, 2021-02-11 ( ; similar; bibliography)
“Generating Images from Caption and vice Versa via CLIP-Guided Generative Latent Space Search”, Et Al 2021
“Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search”, 2021-02-02 ( ; similar; bibliography)
“Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers”, Et Al 2021
“Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers”, 2021-01-31 ( ; similar)
“Scoring Images from TADNE With CLIP”, Nagolinc 2021
“Scoring images from TADNE with CLIP”, 2021-01-20 ( ; backlinks; similar; bibliography)
“DALL·E 1: Creating Images from Text: We’ve Trained a Neural Network Called DALL·E That Creates Images from Text Captions for a Wide Range of Concepts Expressible in Natural Language”, Et Al 2021
“DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language”, 2021-01-05 ( ; backlinks; similar; bibliography)
“CLIP: Connecting Text and Images: We’re Introducing a Neural Network Called CLIP Which Efficiently Learns Visual Concepts from Natural Language Supervision. CLIP Can Be Applied to Any Visual Classification Benchmark by Simply Providing the Names of the Visual Categories to Be Recognized, Similar to The”Zero-shot” Capabilities of GPT-2 and GPT-3”, Et Al 2021
“CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3”, 2021-01-05 ( ; backlinks; similar; bibliography)
“Learning Transferable Visual Models From Natural Language Supervision”, Et Al 2021
“Learning Transferable Visual Models From Natural Language Supervision”, 2021-01-05 ( ; backlinks; similar; bibliography)
“Transformers in Vision: A Survey”, Et Al 2021
“Transformers in Vision: A Survey”, 2021-01-04 ( ; similar)
“Vision Transformer: An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale”, Et Al 2020
“Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”, 2020-09-28 ( ; similar; bibliography)
“M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training”, Et Al 2020
“M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training”, 2020-06-04 ( ; similar)
“Learning to Scale Multilingual Representations for Vision-Language Tasks”, Et Al 2020
“Learning to Scale Multilingual Representations for Vision-Language Tasks”, 2020-04-09 ( ; similar)
“The Messy, Secretive Reality behind OpenAI’s Bid to save the World: The AI Moonshot Was Founded in the Spirit of Transparency. This Is the inside Story of How Competitive Pressure Eroded That Idealism”, 2020
“The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism”, 2020-02-17 ( ; backlinks; similar; bibliography)
“MULE: Multimodal Universal Language Embedding”, Et Al 2019
“MULE: Multimodal Universal Language Embedding”, 2019-09-08 ( ; similar)
“This Anime Does Not Exist, Search: This Notebook Uses the Precomputed CLIP Feature Vectors for 100k Images from TADNE”
“CLIPIT PixelDraw Demo”
“Clustering-laion400m: Script and Models for Clustering LAION-400m CLIP Embeddings. Models Were Fit on the First Million or so Image Embeddings.”
“The Bouba/Kiki Effect And Sound Symbolism In CLIP”
“[P] List of Sites/programs/projects That Use OpenAI’s CLIP Neural Network for Steering Image/video Creation to Match a Text Description”
“Pixels Still Beat Text: Attacking the OpenAI CLIP Model With Text Patches and Adversarial Pixel Perturbations”
“New AI Tools CLIP+VQ-GAN Can Create Impressive Works of Art Based on Just a Few Words of Input”
“Apple or IPod? Easy Fix for Adversarial Textual Attacks on OpenAI’s CLIP Model!”
Miscellaneous
-
https://colab.research.google.com/drive/1N8Cc9yYzNR4M9J2NtE3n3jL2Jy25KAl_
-
https://creator.nightcafe.studio/vqgan-clip-keyword-modifier-comparison
-
https://github.com/EleutherAI/vqgan-clip/tree/main/notebooks
-
https://github.com/LAION-AI/laion-datasets/blob/main/laion-aesthetic.md
-
https://github.com/christophschuhmann/4MC-4M-Image-Text-Pairs-with-CLIP-embeddings
-
https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup
-
https://nitter.moomoo.me/NicholasBardy/status/1530461357048418304
-
https://old.reddit.com/r/MachineLearning/comments/nq4es7/d_unreal_engine_trick_with_vqgan_clip/
-
https://stanislavfort.github.io/blog/OpenAI_CLIP_adversarial_examples/
-
https://stanislavfort.github.io/blog/OpenAI_CLIP_stickers_and_adversarial_examples/
-
https://web.media.mit.edu/~echu/assets/projects/evolving-views/paper.pdf
-
https://www.unlimiteddreamco.xyz/writing-good-prompts-part-1/
-
https://www.unlimiteddreamco.xyz/writing-good-prompts-part-2/
-
https://www.unlimiteddreamco.xyz/writing-good-prompts-part-3/
Link Bibliography
-
https://arxiv.org/abs/2301.12597#salesforce
: “BLIP-2: Bootstrapping Language-Image Pre-training With Frozen Image Encoders and Large Language Models”, Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi: -
https://arxiv.org/abs/2301.07088#bytedance
: “MUG: Vision Learners Meet Web Image-Text Pairs”, Bingchen Zhao, Quan Cui, Hao Wu, Osamu Yoshie, Cheng Yang: -
https://laion.ai/blog/giant-openclip/
: “Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G / 14 Trained On LAION-2B”, Mitchell Wortsman: -
https://arxiv.org/abs/2212.07143
: “Reproducible Scaling Laws for Contrastive Language-image Learning”, : -
https://arxiv.org/abs/2212.06138#microsoft
: “CLIP Itself Is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy With ViT-B and ViT-L on ImageNet”, : -
https://arxiv.org/abs/2211.12561#facebook
: “Retrieval-Augmented Multimodal Language Modeling”, : -
https://openreview.net/forum?id=wmGlMhaBe0
: “MaskDistill: A Unified View of Masked Image Modeling”, Anonymous: -
https://arxiv.org/abs/2211.07292
: “Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces”, Dominic Rampas, Pablo Pernias, Elea Zhong, Marc Aubreville: -
https://arxiv.org/abs/2211.06679#baai
: “AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities”, Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu: -
https://arxiv.org/abs/2211.01324#nvidia
: “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, : -
https://www.biorxiv.org/content/10.1101/2022.09.27.508760.full
: “Incorporating Natural Language into Vision Models Improves Prediction and Understanding of Higher Visual Cortex”, Aria Y. Wang, Kendrick Kay, Thomas Naselaris, Michael J. Tarr, Leila Wehbe: -
https://arxiv.org/abs/2209.03953
: “Fast Text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators”, Xiaodan Du, Raymond A. Yeh, Nicholas Kolkin, Eli Shechtman, Greg Shakhnarovich: -
https://arxiv.org/abs/2209.03320
: “What Does a Platypus Look Like? Generating Customized Prompts for Zero-shot Image Classification (CuPL)”, Sarah Pratt, Rosanne Liu, Ali Farhadi: -
https://arxiv.org/abs/2208.12266#facebook
: “Decoding Speech from Non-invasive Brain Recordings”, Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, Jean-Rémi King: -
https://arxiv.org/abs/2208.05516
: “Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP”, Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, Ludwig Schmidt: -
https://arxiv.org/abs/2208.03550
: “EVL: Frozen CLIP Models Are Efficient Video Learners”, Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li: -
https://arxiv.org/abs/2207.14525
: “TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment”, Tejas Srinivasan, Xiang Ren, Jesse Thomason: -
https://arxiv.org/abs/2207.13061
: “NewsStories: Illustrating Articles With Visual Summaries”, Reuben Tan, Bryan A. Plummer, Kate Saenko, J. P. Lewis, Avneesh Sud, Thomas Leung: -
https://arxiv.org/abs/2207.12661#microsoft
: “MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training”, Haoxuan You, Luowei Zhou, Bin Xiao, Noel Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan: -
https://arxiv.org/abs/2207.07635
: “Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning”, Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang, Tatsunori Hashimoto: -
https://arxiv.org/abs/2207.07285#alibaba
: “X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, Rongrong Ji: -
https://arxiv.org/abs/2207.04429
: “LM-Nav: Robotic Navigation With Large Pre-Trained Models of Language, Vision, and Action”, Dhruv Shah, Blazej Osinski, Brian Ichter, Sergey Levine: -
https://arxiv.org/abs/2205.16007#microsoft
: “Improved Vector Quantized Diffusion Models”, Zhicong Tang, Shuyang Gu, Jianmin Bao, Dong Chen, Fang Wen: -
https://arxiv.org/abs/2205.14459
: “CyCLIP: Cyclic Contrastive Language-Image Pretraining”, Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan A. Rossi, Vishwa Vinay, Aditya Grover: -
https://arxiv.org/abs/2205.10747
: “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, : -
https://arxiv.org/abs/2205.08535
: “AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars”, Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, Ziwei Liu: -
https://arxiv.org/abs/2205.01917#google
: “CoCa: Contrastive Captioners Are Image-Text Foundation Models”, Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu: -
https://arxiv.org/abs/2205.01397
: “Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)”, Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, Ludwig Schmidt: -
https://arxiv.org/abs/2204.05080#deepmind
: “Semantic Exploration from Language Abstractions and Pretrained Representations”, : -
https://arxiv.org/abs/2204.03610#microsoft
: “Unified Contrastive Learning in Image-Text-Label Space”, Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, Jianfeng Gao: -
https://arxiv.org/abs/2204.00598#google
: “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, : -
https://arxiv.org/abs/2203.11096
: “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer: -
https://arxiv.org/abs/2202.06767#huawei
: “Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework”, : -
https://arxiv.org/abs/2201.12086#salesforce
: “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi: -
https://arxiv.org/abs/2201.08371#facebook
: “SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models”, : -
https://arxiv.org/abs/2201.02605#facebook
: “Detecting Twenty-thousand Classes Using Image-level Supervision”, Xingyi Zhou, Rohit Girdhar, Arm, Joulin, Phillip Krähenbühl, Ishan Misra: -
2022-liu.pdf
: “Design Guidelines for Prompt Engineering Text-to-Image Generative Models”, Vivian Liu, Lydia B. Chilton: -
https://arxiv.org/abs/2112.10752
: “High-Resolution Image Synthesis With Latent Diffusion Models”, Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer: -
https://arxiv.org/abs/2112.09106#microsoft
: “RegionCLIP: Region-based Language-Image Pretraining”, : -
https://arxiv.org/abs/2112.05744
: “More Control for Free! Image Synthesis With Semantic Diffusion Guidance”, : -
https://arxiv.org/abs/2112.01573
: “FuseDream: Training-Free Text-to-Image Generation With Improved CLIP+GAN Space Optimization”, Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, Qiang Liu: -
https://arxiv.org/abs/2112.01071
: “DenseCLIP: Extract Free Dense Labels from CLIP”, Chong Zhou, Chen Change Loy, Bo Dai: -
https://arxiv.org/abs/2111.11432#microsoft
: “Florence: A New Foundation Model for Computer Vision”, : -
https://arxiv.org/abs/2111.10050#google
: “BASIC: Combined Scaling for Open-Vocabulary Image Classification”, : -
https://arxiv.org/abs/2111.09734
: “ClipCap: CLIP Prefix for Image Captioning”, Ron Mokady, Amir Hertz, Amit H. Bermano: -
https://arxiv.org/abs/2111.07991#google
: “LiT: Zero-Shot Transfer With Locked-image Text Tuning”, Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer: -
https://arxiv.org/abs/2111.03930
: “Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling”, Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li: -
https://arxiv.org/abs/2111.03133
: “StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis”, Peter Schaldenbrand, Zhixuan Liu, Jean Oh: -
https://arxiv.org/abs/2111.02114#laion
: “LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs”, : -
https://arxiv.org/abs/2111.01007
: “Projected GANs Converge Faster”, Axel Sauer, Kashyap Chitta, Jens Müller, Andreas Geiger: -
https://arxiv.org/abs/2110.05208
: “Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)”, Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, Junjie Yan: -
https://openreview.net/forum?id=ROteIE-4A6W
: “MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training”, Haoxuan You, Luowei Zhou, Bin Xiao, Noel C Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan: -
https://openreview.net/forum?id=qw674L9PfQE
: “CLOOB: Modern Hopfield Networks With InfoLOOB Outperform CLIP”, : -
https://openreview.net/forum?id=G89-1yZLFHk
: “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Bichen Wu, Ruizhe Cheng, Peizhao Zhang, Peter Vajda, Joseph E. Gonzalez: -
https://arxiv.org/abs/2109.12066
: “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Johnathan Xie, Shuai Zheng: -
https://www.frontiersin.org/articles/10.3389/fninf.2021.679838/full
: “<code>THINGSvision< / code>: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks”, Lukas Muttenthaler, Martin N. Hebart: -
https://arxiv.org/abs/2109.08857#google
: “Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts”, Yingtao Tian, David Ha: -
https://laion.ai/blog/laion-400-open-dataset/
: “LAION-400-Million Open Dataset”, Christoph Schuhmann: -
https://arxiv.org/abs/2107.12518
: “Segmentation in Style: Unsupervised Semantic Image Segmentation With StyleGAN and CLIP”, Daniil Pakhomov, Sanchit Hira, Narayani Wagle, Kemar E. Green, Nassir Navab: -
https://arxiv.org/abs/2107.06383
: “How Much Can CLIP Benefit Vision-and-Language Tasks?”, Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer: -
https://arxiv.org/abs/2106.16198
: “Small In-distribution Changes in 3D Perspective and Lighting Fool Both CNNs and Transformers”, Spandan Madan, Tomotake Sasaki, Tzu-Mao Li, Xavier Boix, Hanspeter Pfister: -
https://arxiv.org/abs/2106.13043
: “AudioCLIP: Extending CLIP to Image, Text and Audio”, Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel: -
https://arxiv.org/abs/2106.11097
: “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen: -
https://arxiv.org/abs/2106.07411
: “Partial Success in Closing the Gap between Human and Machine Vision”, : -
https://arxiv.org/abs/2106.03004#google
: “Exploring the Limits of Out-of-Distribution Detection”, Stanislav Fort, Jie Ren, Balaji Lakshminarayanan: -
https://en.pingwest.com/a/8693#baai
: “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, Chen Du: -
https://arxiv.org/abs/2104.13921#google
: “Zero-Shot Detection via Vision and Language Knowledge Distillation”, Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui: -
https://arxiv.org/abs/2104.08945#facebook
: “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Ruizhe Cheng, Bichen Wu, Peizhao Zhang, Peter Vajda, Joseph E. Gonzalez: -
https://distill.pub/2021/multimodal-neurons/#openai
: “Multimodal Neurons in Artificial Neural Networks [CLIP]”, : -
https://arxiv.org/abs/2102.05918#google
: “ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”, : -
https://arxiv.org/abs/2102.01645
: “Generating Images from Caption and vice Versa via CLIP-Guided Generative Latent Space Search”, Federico A. Galatolo, Mario G. C. A. Cimino, Gigliola Vaglini: -
https://github.com/nagolinc/notebooks/blob/main/TADNE_and_CLIP.ipynb
: “Scoring Images from TADNE With CLIP”, nagolinc: -
https://openai.com/blog/dall-e/
: “DALL·E 1: Creating Images from Text: We’ve Trained a Neural Network Called DALL·E That Creates Images from Text Captions for a Wide Range of Concepts Expressible in Natural Language”, : -
https://openai.com/blog/clip/
: “CLIP: Connecting Text and Images: We’re Introducing a Neural Network Called CLIP Which Efficiently Learns Visual Concepts from Natural Language Supervision. CLIP Can Be Applied to Any Visual Classification Benchmark by Simply Providing the Names of the Visual Categories to Be Recognized, Similar to the”zero-shot” Capabilities of GPT-2 and GPT-3”, Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal: -
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf
: “Learning Transferable Visual Models From Natural Language Supervision”, : -
https://openreview.net/forum?id=YicbFdNTTy#google
: “Vision Transformer: An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale”, : -
https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/
: “The Messy, Secretive Reality behind OpenAI’s Bid to save the World: The AI Moonshot Was Founded in the Spirit of Transparency. This Is the inside Story of How Competitive Pressure Eroded That Idealism”, Karen Hao: