Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7× Faster Pre-training on Web-scale Image-Text Data
Towards Generated Image Provenance Analysis Via Conceptual-Similar-Guided-SLIP Retrieval
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
TextCraftor: Your Text Encoder Can be Image Quality Controller
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Discovering Universal Semantic Triggers for Text-to-Image Synthesis
Grounded language acquisition through the eyes and ears of a single child
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
StarVector: Generating Scalable Vector Graphics Code from Images
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
Are Vision Transformers More Data Hungry Than Newborn Visual Systems?
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery
Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback
One-for-All: Towards Universal Domain Translation with a Single StyleGAN
Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity?
From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions
LLaVA-1.5: Improved Baselines with Visual Instruction Tuning
Investigating the Existence of ‘Secret Language’ in Language Models
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model
Improving neural network representations using human similarity judgments
Artificial intelligence and art: Identifying the esthetic judgment factors that distinguish human & machine-generated artwork
On Evaluating Adversarial Robustness of Large Vision-Language Models
Generalizable Synthetic Image Detection via Language-guided Contrastive Learning
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
DINOv2: Learning Robust Visual Features without Supervision
ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification
KD-DLGAN: Data Limited Image Generation via Knowledge Distillation
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention
When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?
Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B
Reproducible scaling laws for contrastive language-image learning
CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet
A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data
Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Text-Only Training for Image Captioning using Noise-Injected CLIP
3DALL·E: Integrating Text-to-Image AI in 3D Design Workflows
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training
Incorporating natural language into vision models improves prediction and understanding of higher visual cortex
Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest
Fast text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators
What does a platypus look like? Generating customized prompts for zero-shot image classification (CuPL)
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition
Adversarial Attacks on Image Generation With Made-Up Words
TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment
MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models
Semantic Abstraction (SemAbs): Open-World 3D Scene Understanding from 2D Vision-Language Models
Don’t Stop Learning: Towards Continual Learning for the CLIP Model
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
CLAP: Learning Audio Concepts From Natural Language Supervision
ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
CoCa: Contrastive Captioners are Image-Text Foundation Models
Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)
Retrieval-Augmented Diffusion Models: Semi-Parametric Neural Image Synthesis
Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?
VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
DALL·E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents § 7. Limitations and Risks
No Token Left Behind: Explainability-Aided Image Classification and Generation
Semantic Exploration from Language Abstractions and Pretrained Representations
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Learning to generate line drawings that convey geometry and semantics
CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
CLIP on Wheels (CoW): Zero-Shot Object Navigation as Object Localization and Exploration
Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment
Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
The Unsurprising Effectiveness of Pre-Trained Vision Models for Control
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models
Design Guidelines for Prompt Engineering Text-to-Image Generative Models
Detecting Twenty-thousand Classes using Image-level Supervision
A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision
High-Resolution Image Synthesis with Latent Diffusion Models
More Control for Free! Image Synthesis with Semantic Diffusion Guidance
CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions
MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning
FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
Blended Diffusion for Text-driven Editing of Natural Images
LAFITE: Towards Language-Free Training for Text-to-Image Generation
BASIC: Combined Scaling for Open-Vocabulary Image Classification
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)
MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation
DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models
CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation
THINGSvision: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks
Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts
EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling
What Users Want? WARHOL: A Generative Model for Recommendation
Contrastive Language-Image Pre-training for the Italian Language
Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
Segmentation in Style: Unsupervised Semantic Image Segmentation with StyleGAN and CLIP
FairyTailor: A Multimodal Generative Framework for Storytelling
Small in-distribution changes in 3D perspective and lighting fool both CNNs and Transformers
CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders
A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods
Partial success in closing the gap between human and machine vision
ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation
Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters
Generative Art Using Neural Visual Grammars and Dual Encoders
Zero-Shot Detection via Vision and Language Knowledge Distillation
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
CLIP: Learning Transferable Visual Models From Natural Language Supervision
CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3
DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training
Learning to Scale Multilingual Representations for Vision-Language Tasks
The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism
What A Long, Strange Trip It's Been: EleutherAI One Year Retrospective
This Anime Does Not Exist, Search: This Notebook Uses the Precomputed CLIP Feature Vectors for 100k Images from TADNE
Christophschuhmann/4MC-4M-Image-Text-Pairs-With-CLIP-Embeddings: I Have Created a Dataset of Image-Text-Pairs by Using the Cosine Similarity of the CLIP Embeddings of the Image & Its Caption Derrived from YFCC100M. I Have Also Added Probabilities from a NSFW Detector & More
CLIP (Contrastive Language–Image Pre-Training) for Italian
Clustering-Laion400m: Script and Models for Clustering LAION-400m CLIP Embeddings. Models Were Fit on the First Million or so Image Embeddings.
Robgon-Art/CLIPandPASTE: CLIP and PASTE: Using AI to Create Photo Collages from Text Prompts
sam2_hierarch: Unsupervised Human-Friendly Online Object Categorization
Pixels Still Beat Text: Attacking the OpenAI CLIP Model With Text Patches and Adversarial Pixel Perturbations
Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders
[P] List of Sites/programs/projects That Use OpenAI’s CLIP Neural Network for Steering Image/video Creation to Match a Text Description
Writing Good VQGAN+CLIP Prompts Part One – Basic Prompts and Style Modifiers
Writing Good VQGAN+CLIP Prompts Part Two – Artist and Genre Modifiers
Writing Good VQGAN+CLIP Prompts Part Three – Environmental Modifiers
New AI Tools CLIP+VQ-GAN Can Create Impressive Works of Art Based on Just a Few Words of Input
Apple or IPod? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model!
2023-01-01-ross-clipvitbigg14laion2b39bb160kbenchmarkperformancecomparedtopreviousopensourcesotamodel.jpg
2023-01-01-ross-openclipscalingforclipvitbigg14laion2b39bb160k.png
2023-girdhar-figure6-imagebindscalingofperformancewithincreasingclipimageencodersize.png
2022-11-22-armstrong-screenshotofsootimageorganizer-personswimminginwaterqueryexample.jpg
2022-11-22-armstrong-screenshotofsootimageorganizer-personswimminginwaterresults.jpg
2022-cherti-figure1a-openclipcomputezeroshotclassificationscalingcurve.jpg
2022-cherti-figure1b-openclipcomputezeroshotretrievalscalingcurve.jpg
2022-dong-figure1-ablatingimprovementstoclipfinetuningtricksforimagenettransfer.png
2021-04-22-rivershavewings-clipvqgan-theshadowyhackergroupeleuther.png
2021-01-20-nagolinc-tadne-clipbasedgeneration-agirlwithapinkhat.png
2021-muttenthaler-figure2-correlationoffmribrainactivationswithvariousneuralnetworks.jpg
https://colab.research.google.com/drive/189LHTpYaefMhKNIGOzTLHHavlgmoIWg9
https://colab.research.google.com/drive/1N8Cc9yYzNR4M9J2NtE3n3jL2Jy25KAl_
https://colab.research.google.com/drive/1c6VccMPsOMAUQCKU4BVDRd5Y32qkozmK
https://colab.research.google.com/github/kvfrans/clipdraw/blob/main/clipdraw.ipynb
https://creator.nightcafe.studio/vqgan-clip-keyword-modifier-comparison
https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup
https://jxmo.notion.site/The-Weird-and-Wonderful-World-of-AI-Art-b9615a2e7278435b98380ff81ae1cf09
https://stanislavfort.com/2021/01/12/OpenAI_CLIP_adversarial_examples.html
https://stanislavfort.github.io/blog/OpenAI_CLIP_adversarial_examples/
https://stanislavfort.github.io/blog/OpenAI_CLIP_stickers_and_adversarial_examples/
https://tech.pic-collage.com/distillation-of-clip-model-and-other-experiments-f8394b7321ce
https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA
https://web.media.mit.edu/~echu/assets/projects/evolving-views/paper.pdf
https://www.lesswrong.com/posts/cqGEQeLNbcptYsifz/this-week-in-fashion
https://www.lesswrong.com/posts/kobJymvvcvhbjWFKe/laying-the-foundations-for-vision-and-multimodal-mechanistic
https://www.reddit.com/r/MachineLearning/comments/nq4es7/d_unreal_engine_trick_with_vqgan_clip/
https://www.reddit.com/r/MediaSynthesis/comments/p5nw28/clip_vqgan_keyword_comparison_by_kingdomakrillic/
https://www.unum.cloud/blog/2023-02-20-efficient-multimodality
https%253A%252F%252Fresearch.google%252Fblog%252Ftaking-medical-imaging-embeddings-3d%252F.html
Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness
https%253A%252F%252Farxiv.org%252Fabs%252F2405.02793%2523google.html
Grounded language acquisition through the eyes and ears of a single child
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
StarVector: Generating Scalable Vector Graphics Code from Images
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
https%253A%252F%252Farxiv.org%252Fabs%252F2312.05328%2523deepmind.html
https%253A%252F%252Farxiv.org%252Fabs%252F2309.17425%2523apple.html
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
https%253A%252F%252Farxiv.org%252Fabs%252F2307.01952%2523stability.html
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy
Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model
%252Fdoc%252Fai%252Fanime%252Fdanbooru%252F2023-yi.pdf.html
Artificial intelligence and art: Identifying the esthetic judgment factors that distinguish human & machine-generated artwork
%252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252Fdall-e%252F1%252F2023-samo.pdf.html
On Evaluating Adversarial Robustness of Large Vision-Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2305.05665%2523facebook.html
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
DINOv2: Learning Robust Visual Features without Supervision
https%253A%252F%252Farxiv.org%252Fabs%252F2304.07193%2523facebook.html
ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification
https%253A%252F%252Farxiv.org%252Fabs%252F2303.15343%2523google.html
When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?
https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DKRLUvxh8uaX.html
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2301.12597%2523salesforce.html
https%253A%252F%252Farxiv.org%252Fabs%252F2301.07088%2523bytedance.html
Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B
https%253A%252F%252Flaion.ai%252Fblog%252Fgiant-openclip%252F.html
Reproducible scaling laws for contrastive language-image learning
CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet
https%253A%252F%252Farxiv.org%252Fabs%252F2212.06138%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2211.12561%2523facebook.html
https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DwmGlMhaBe0.html
Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
https%253A%252F%252Farxiv.org%252Fabs%252F2211.06679%2523baai.html
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
https%253A%252F%252Farxiv.org%252Fabs%252F2211.01324%2523nvidia.html
Incorporating natural language into vision models improves prediction and understanding of higher visual cortex
https%253A%252F%252Fwww.biorxiv.org%252Fcontent%252F10.1101%252F2022.09.27.508760.full.html
Fast text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators
What does a platypus look like? Generating customized prompts for zero-shot image classification (CuPL)
https%253A%252F%252Farxiv.org%252Fabs%252F2208.12266%2523facebook.html
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment
MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
https%253A%252F%252Farxiv.org%252Fabs%252F2207.12661%2523microsoft.html
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
https%253A%252F%252Farxiv.org%252Fabs%252F2207.07285%2523alibaba.html
Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
https%253A%252F%252Farxiv.org%252Fabs%252F2205.16007%2523microsoft.html
VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
CoCa: Contrastive Captioners are Image-Text Foundation Models
https%253A%252F%252Farxiv.org%252Fabs%252F2205.01917%2523google.html
Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)
DALL·E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents § 7. Limitations and Risks
https%253A%252F%252Farxiv.org%252Fpdf%252F2204.06125%2523page%253D16%2526org%253Dopenai.html
Semantic Exploration from Language Abstractions and Pretrained Representations
https%253A%252F%252Farxiv.org%252Fabs%252F2204.05080%2523deepmind.html
https%253A%252F%252Farxiv.org%252Fabs%252F2204.03610%2523microsoft.html
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https%253A%252F%252Farxiv.org%252Fabs%252F2204.00598%2523google.html
CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework
https%253A%252F%252Farxiv.org%252Fabs%252F2202.06767%2523huawei.html
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
https%253A%252F%252Farxiv.org%252Fabs%252F2201.12086%2523salesforce.html
SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models
https%253A%252F%252Farxiv.org%252Fabs%252F2201.08371%2523facebook.html
https%253A%252F%252Farxiv.org%252Fabs%252F2201.07520%2523facebook.html
Design Guidelines for Prompt Engineering Text-to-Image Generative Models
%252Fdoc%252Fai%252Fnn%252Ftransformer%252Fclip%252F2022-liu-2.pdf.html
Detecting Twenty-thousand Classes using Image-level Supervision
https%253A%252F%252Farxiv.org%252Fabs%252F2201.02605%2523facebook.html
High-Resolution Image Synthesis with Latent Diffusion Models
https%253A%252F%252Farxiv.org%252Fabs%252F2112.09106%2523microsoft.html
More Control for Free! Image Synthesis with Semantic Diffusion Guidance
MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning
FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization
https%253A%252F%252Farxiv.org%252Fabs%252F2111.11432%2523microsoft.html
BASIC: Combined Scaling for Open-Vocabulary Image Classification
https%253A%252F%252Farxiv.org%252Fabs%252F2111.10050%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2111.07991%2523google.html
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
https%253A%252F%252Farxiv.org%252Fabs%252F2111.02114%2523laion.html
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)
MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training
https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DROteIE-4A6W.html
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation
https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DG89-1yZLFHk.html
CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP
https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253Dqw674L9PfQE.html
ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation
THINGSvision: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks
https%253A%252F%252Fwww.frontiersin.org%252Farticles%252F10.3389%252Ffninf.2021.679838%252Ffull.html
Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts
https%253A%252F%252Farxiv.org%252Fabs%252F2109.08857%2523google.html
https%253A%252F%252Flaion.ai%252Fblog%252Flaion-400-open-dataset%252F.html
Segmentation in Style: Unsupervised Semantic Image Segmentation with StyleGAN and CLIP
Small in-distribution changes in 3D perspective and lighting fool both CNNs and Transformers
Partial success in closing the gap between human and machine vision
https%253A%252F%252Farxiv.org%252Fabs%252F2106.03004%2523google.html
Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters
https%253A%252F%252Fen.pingwest.com%252Fa%252F8693%2523baai.html
Zero-Shot Detection via Vision and Language Knowledge Distillation
https%253A%252F%252Farxiv.org%252Fabs%252F2104.13921%2523google.html
Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation
https%253A%252F%252Farxiv.org%252Fabs%252F2104.08945%2523facebook.html
https%253A%252F%252Fdistill.pub%252F2021%252Fmultimodal-neurons%252F%2523openai.html
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
https%253A%252F%252Farxiv.org%252Fabs%252F2102.05918%2523google.html
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search
https%253A%252F%252Fgithub.com%252Fnagolinc%252Fnotebooks%252Fblob%252Fmain%252FTADNE_and_CLIP.ipynb.html
CLIP: Learning Transferable Visual Models From Natural Language Supervision
https%253A%252F%252Fcdn.openai.com%252Fpapers%252FLearning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf.html
CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3
https%253A%252F%252Fopenai.com%252Findex%252Fclip%252F.html
DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language
https%253A%252F%252Fopenai.com%252Fresearch%252Fdall-e.html
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
https%253A%252F%252Farxiv.org%252Fabs%252F2010.11929%2523google.html
The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism
https%253A%252F%252Fwww.technologyreview.com%252F2020%252F02%252F17%252F844721%252Fai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality%252F.html
sam2_hierarch: Unsupervised Human-Friendly Online Object Categorization
https%253A%252F%252Fgithub.com%252Futilityhotbar%252Fsam2_hierarch.html
Wikipedia Bibliography: