“‘CLIP’ Tag”,2020-01-13 (; backlinks):
![]()
Bibliography for tag
ai/nn/transformer/clip, most recent first: 3 related tags, 253 annotations, & 47 links (parent).
- See Also
- Gwern
- Links
- “CT Foundation: Taking Medical Imaging Embeddings 3D”, 2024
- “Explore the Limits of Omni-Modal Pretraining at Scale”, et al 2024
- “Sakuga-42M Dataset: Scaling Up Cartoon Research”, et al 2024
- “ImageInWords: Unlocking Hyper-Detailed Image Descriptions”, et al 2024
- “CatLIP: CLIP-Level Visual Recognition Accuracy With 2.7× Faster Pre-Training on Web-Scale Image-Text Data”, et al 2024
- “Towards Generated Image Provenance Analysis Via Conceptual-Similar-Guided-SLIP Retrieval”, et al 2024
- “Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies”, et al 2024
- “TextCraftor: Your Text Encoder Can Be Image Quality Controller”, et al 2024
- “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-Training”, et al 2024
- “Discovering Universal Semantic Triggers for Text-To-Image Synthesis”, et al 2024
- “Grounded Language Acquisition through the Eyes and Ears of a Single Child”, et al 2024
- “TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones”, et al 2023
- “Parrot Captions Teach CLIP to Spot Text”, et al 2023
- “StarVector: Generating Scalable Vector Graphics Code from Images”, et al 2023
- “Vision-Language Models As a Source of Rewards”, et al 2023
- “Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding”, et al 2023
- “ECLIPSE: A Resource-Efficient Text-To-Image Prior for Image Generations”, et al 2023
- “Alpha-CLIP: A CLIP Model Focusing on Wherever You Want”, et al 2023
- “Are Vision Transformers More Data Hungry Than Newborn Visual Systems?”, et al 2023
- “BioCLIP: A Vision Foundation Model for the Tree of Life”, et al 2023
- “Rethinking FID: Towards a Better Evaluation Metric for Image Generation”, et al 2023
- “SatCLIP: Global, General-Purpose Location Embeddings With Satellite Imagery”, et al 2023
- “Test-Time Adaptation of Discriminative Models via Diffusion Generative Feedback”, et al 2023
- “One-For-All: Towards Universal Domain Translation With a Single StyleGAN”, et al 2023
- “Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity?”, et al 2023
- “From Scarcity to Efficiency: Improving CLIP Training via Visual-Enriched Captions”, et al 2023
- “LLaVA-1.5: Improved Baselines With Visual Instruction Tuning”, et al 2023
- “Data Filtering Networks”, et al 2023
- “Vision Transformers Need Registers”, et al 2023
- “Demystifying CLIP Data”, et al 2023
- “Multimodal Neurons in Pretrained Text-Only Transformers”, et al 2023
- “Investigating the Existence of ‘Secret Language’ in Language Models”, et al 2023
- “InternVid: A Large-Scale Video-Text Dataset for Multimodal Understanding and Generation”, et al 2023
- “PIGEON: Predicting Image Geolocations”, et al 2023
- “CLIPMasterPrints: Fooling Contrastive Language-Image Pre-Training Using Latent Variable Evolution”, et al 2023
- “SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis”, et al 2023
- “CLIPA-V2: Scaling CLIP Training With 81.1% Zero-Shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy”, et al 2023
- “SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality”, et al 2023
- “Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model”, et al 2023
- “ChessGPT: Bridging Policy Learning and Language Modeling”, et al 2023
- “Rosetta Neurons: Mining the Common Units in a Model Zoo”, et al 2023
- “Image Captioners Are Scalable Vision Learners Too”, et al 2023
- “Improving Neural Network Representations Using Human Similarity Judgments”, et al 2023
- “Artificial Intelligence and Art: Identifying the Esthetic Judgment Factors That Distinguish Human & Machine-Generated Artwork”, 2023
- “On Evaluating Adversarial Robustness of Large Vision-Language Models”, et al 2023
- “Generalizable Synthetic Image Detection via Language-Guided Contrastive Learning”, et al 2023
- “TorToise: Better Speech Synthesis through Scaling”, 2023
- “An Inverse Scaling Law for CLIP Training”, et al 2023
- “ImageBind: One Embedding Space To Bind Them All”, et al 2023
- “Pick-A-Pic: An Open Dataset of User Preferences for Text-To-Image Generation”, et al 2023
- “A Cookbook of Self-Supervised Learning”, et al 2023
- “DINOv2: Learning Robust Visual Features without Supervision”, et al 2023
- “ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification”, et al 2023
- “KD-DLGAN: Data Limited Image Generation via Knowledge Distillation”, et al 2023
- “MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks”, et al 2023
- “Sigmoid Loss for Language Image Pre-Training”, et al 2023
- “HiCLIP: Contrastive Language-Image Pretraining With Hierarchy-Aware Attention”, et al 2023
- “When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?”, et al 2023
- “Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery”, et al 2023
- “BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models”, et al 2023
- “MUG: Vision Learners Meet Web Image-Text Pairs”, et al 2023
- “Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B”, 2023
- “Reproducible Scaling Laws for Contrastive Language-Image Learning”, et al 2022
- “CLIP Itself Is a Strong Fine-Tuner: Achieving 85.7% and 88.0% Top-1 Accuracy With ViT-B and ViT-L on ImageNet”, et al 2022
- “A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others”, et al 2022
- “Scaling Language-Image Pre-Training via Masking”, et al 2022
- “Videogenic: Video Highlights via Photogenic Moments”, et al 2022
- “Retrieval-Augmented Multimodal Language Modeling”, et al 2022
- “ClipCrop: Conditioned Cropping Driven by Vision-Language Model”, et al 2022
- “I Can’t Believe There’s No Images! Learning Visual Tasks Using Only Language Data”, et al 2022
- “MaskDistill: A Unified View of Masked Image Modeling”, 2022
- “Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces”, et al 2022
- “AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities”, et al 2022
- “EDiff-I: Text-To-Image Diffusion Models With an Ensemble of Expert Denoisers”, et al 2022
- “Text-Only Training for Image Captioning Using Noise-Injected CLIP”, et al 2022
- “3DALL·E: Integrating Text-To-Image AI in 3D Design Workflows”, et al 2022
- “Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends”, et al 2022
- “ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training”, et al 2022
- “Incorporating Natural Language into Vision Models Improves Prediction and Understanding of Higher Visual Cortex”, et al 2022
- “Do Androids Laugh at Electric Sheep? Humor “Understanding” Benchmarks from The New Yorker Caption Contest”, et al 2022
- “Fast Text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators”, et al 2022
- “What Does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification (CuPL)”, et al 2022
- “Efficient Vision-Language Pretraining With Visual Concepts and Hierarchical Alignment”, et al 2022
- “Decoding Speech from Non-Invasive Brain Recordings”, et al 2022
- “Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP”, et al 2022
- “CLIP-Based Neural Neighbor Style Transfer for 3D Assets”, 2022
- “EVL: Frozen CLIP Models Are Efficient Video Learners”, et al 2022
- “X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, et al 2022
- “LaTTe: Language Trajectory TransformEr”, et al 2022
- “Adversarial Attacks on Image Generation With Made-Up Words”, 2022
- “TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment”, et al 2022
- “MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-Training”, et al 2022
- “Text-Guided Synthesis of Artistic Images With Retrieval-Augmented Diffusion Models”, et al 2022
- “NewsStories: Illustrating Articles With Visual Summaries”, et al 2022
- “Semantic Abstraction (SemAbs): Open-World 3D Scene Understanding from 2D Vision-Language Models”, 2022
- “Don’t Stop Learning: Towards Continual Learning for the CLIP Model”, et al 2022
- “X-CLIP: End-To-End Multi-Grained Contrastive Learning for Video-Text Retrieval”, et al 2022
- “Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning”, et al 2022
- “LM-Nav: Robotic Navigation With Large Pre-Trained Models of Language, Vision, and Action”, et al 2022
- “CLAP: Learning Audio Concepts From Natural Language Supervision”, et al 2022
- “ADAPT: Vision-Language Navigation With Modality-Aligned Action Prompts”, et al 2022
- “Improved Vector Quantized Diffusion Models”, et al 2022
- “CyCLIP: Cyclic Contrastive Language-Image Pretraining”, et al 2022
- “Fine-Grained Image Captioning With CLIP Reward”, et al 2022
- “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, et al 2022
- “AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars”, et al 2022
- “CoCa: Contrastive Captioners Are Image-Text Foundation Models”, et al 2022
- “Data Determines Distributional Robustness in Contrastive Language Image Pre-Training (CLIP)”, et al 2022
- “Retrieval-Augmented Diffusion Models: Semi-Parametric Neural Image Synthesis”, et al 2022
- “Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?”, et al 2022
- “Opal: Multimodal Image Generation for News Illustration”, et al 2022
- “VQGAN-CLIP: Open Domain Image Generation and Editing With Natural Language Guidance”, et al 2022
- “DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, et al 2022 (page 16 org openai)
- “No Token Left Behind: Explainability-Aided Image Classification and Generation”, et al 2022
- “Semantic Exploration from Language Abstractions and Pretrained Representations”, et al 2022
- “Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality”, et al 2022
- “Unified Contrastive Learning in Image-Text-Label Space”, et al 2022
- “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, et al 2022
- “Learning to Generate Line Drawings That Convey Geometry and Semantics”, et al 2022
- “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-Shot Transfer Learning”, et al 2022
- “CLIP on Wheels (CoW): Zero-Shot Object Navigation As Object Localization and Exploration”, et al 2022
- “Bamboo: Building Mega-Scale Vision Dataset Continually With Human-Machine Synergy”, et al 2022
- “CLIP Models Are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment”, et al 2022
- “Democratizing Contrastive Language-Image Pre-Training: A CLIP Benchmark of Data, Model, and Supervision”, et al 2022
- “Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy without Increasing Inference Time”, et al 2022
- “The Unsurprising Effectiveness of Pre-Trained Vision Models for Control”, et al 2022
- “Unsupervised Vision-And-Language Pre-Training via Retrieval-Based Multi-Granular Alignment”, et al 2022
- “RuCLIP—New Models and Experiments: a Technical Report”, et al 2022
- “Wukong: 100 Million Large-Scale Chinese Cross-Modal Pre-Training Dataset and A Foundation Framework”, et al 2022
- “CLIPasso: Semantically-Aware Object Sketching”, et al 2022
- “BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation”, et al 2022
- “Can Wikipedia Help Offline Reinforcement Learning?”, et al 2022
- “SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models”, et al 2022
- “CM3: A Causal Masked Multimodal Model of the Internet”, et al 2022
- “LSeg: Language-Driven Semantic Segmentation”, et al 2022
- “Design Guidelines for Prompt Engineering Text-To-Image Generative Models”, 2022b
- “Detecting Twenty-Thousand Classes Using Image-Level Supervision”, et al 2022
- “A Fistful of Words: Learning Transferable Visual Models from Bag-Of-Words Supervision”, et al 2021
- “High-Resolution Image Synthesis With Latent Diffusion Models”, et al 2021
- “RegionCLIP: Region-Based Language-Image Pretraining”, et al 2021
- “More Control for Free! Image Synthesis With Semantic Diffusion Guidance”, et al 2021
- “CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions”, et al 2021
- “MAGMA—Multimodal Augmentation of Generative Models through Adapter-Based Finetuning”, et al 2021
- “DenseCLIP: Extract Free Dense Labels from CLIP”, et al 2021
- “Zero-Shot Text-Guided Object Generation With Dream Fields”, et al 2021
- “FuseDream: Training-Free Text-To-Image Generation With Improved CLIP+GAN Space Optimization”, et al 2021
- “MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, et al 2021
- “CRIS: CLIP-Driven Referring Image Segmentation”, et al 2021
- “Zero-Shot Image-To-Text Generation for Visual-Semantic Arithmetic”, et al 2021
- “Blended Diffusion for Text-Driven Editing of Natural Images”, et al 2021
- “LAFITE: Towards Language-Free Training for Text-To-Image Generation”, et al 2021
- “Florence: A New Foundation Model for Computer Vision”, et al 2021
- “BASIC: Combined Scaling for Open-Vocabulary Image Classification”, et al 2021
- “ClipCap: CLIP Prefix for Image Captioning”, et al 2021
- “Simple but Effective: CLIP Embeddings for Embodied AI”, et al 2021
- “INTERN: A New Learning Paradigm Towards General Vision”, et al 2021
- “LiT: Zero-Shot Transfer With Locked-Image Text Tuning”, et al 2021
- “Tip-Adapter: Training-Free CLIP-Adapter for Better Vision-Language Modeling”, et al 2021
- “StyleCLIPDraw: Coupling Content and Style in Text-To-Drawing Synthesis”, et al 2021
- “LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs”, et al 2021
- “Projected GANs Converge Faster”, et al 2021
- “Telling Creative Stories Using Generative Visual Aids”, 2021
- “Image-Based CLIP-Guided Essence Transfer”, et al 2021
- “Wav2CLIP: Learning Robust Audio Representations From CLIP”, et al 2021
- “Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-Training Paradigm (DeCLIP)”, et al 2021
- “CLIP-Forge: Towards Zero-Shot Text-To-Shape Generation”, et al 2021
- “MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-Training”, et al 2021
- “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, et al 2021
- “DiffusionCLIP: Text-Guided Image Manipulation Using Diffusion Models”, 2021
- “CLOOB: Modern Hopfield Networks With InfoLOOB Outperform CLIP”, et al 2021
- “VideoCLIP: Contrastive Pre-Training for Zero-Shot Video-Text Understanding”, et al 2021
- “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, 2021
- “CLIPort: What and Where Pathways for Robotic Manipulation”, et al 2021
- “
THINGSvision: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks”, 2021- “Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts”, 2021
- “What Vision-Language Models ‘See’ When They See Scenes”, et al 2021
- “EfficientCLIP: Efficient Cross-Modal Pre-Training by Ensemble Confident Learning and Language Modeling”, et al 2021
- “Zero-Shot Open Set Detection by Extending CLIP”, et al 2021
- “Robust Fine-Tuning of Zero-Shot Models”, et al 2021
- “What Users Want? WARHOL: A Generative Model for Recommendation”, et al 2021
- “LAION-400-Million Open Dataset”, 2021
- “Contrastive Language-Image Pre-Training for the Italian Language”, et al 2021
- “Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications”, et al 2021
- “StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators”, et al 2021
- “Language Grounding With 3D Objects”, et al 2021
- “Segmentation in Style: Unsupervised Semantic Image Segmentation With StyleGAN and CLIP”, et al 2021
- “How Much Can CLIP Benefit Vision-And-Language Tasks?”, et al 2021
- “FairyTailor: A Multimodal Generative Framework for Storytelling”, et al 2021
- “CLIP-It! Language-Guided Video Summarization”, et al 2021
- “Small In-Distribution Changes in 3D Perspective and Lighting Fool Both CNNs and Transformers”, et al 2021
- “CLIPDraw: Exploring Text-To-Drawing Synthesis through Language-Image Encoders”, et al 2021
- “AudioCLIP: Extending CLIP to Image, Text and Audio”, et al 2021
- “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, et al 2021
- “A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods”, et al 2021
- “Partial Success in Closing the Gap between Human and Machine Vision”, et al 2021
- “ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation”, et al 2021
- “Exploring the Limits of Out-Of-Distribution Detection”, et al 2021
- “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, 2021
- “Generative Art Using Neural Visual Grammars and Dual Encoders”, et al 2021
- “Zero-Shot Detection via Vision and Language Knowledge Distillation”, et al 2021
- “CLIPScore: A Reference-Free Evaluation Metric for Image Captioning”, et al 2021
- “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, et al 2021
- “Paint by Word”, et al 2021
- “WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training”, et al 2021
- “Multimodal Neurons in Artificial Neural Networks [CLIP]”, et al 2021
- “Zero-Shot Text-To-Image Generation”, et al 2021
- “ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”, et al 2021
- “Generating Images from Caption and vice Versa via CLIP-Guided Generative Latent Space Search”, et al 2021
- “Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers”, et al 2021
- “Scoring Images from TADNE With CLIP”, nagolinc 2021
- “CLIP: Learning Transferable Visual Models From Natural Language Supervision”, et al 2021
- “CLIP: Connecting Text and Images: We’re Introducing a Neural Network Called CLIP Which Efficiently Learns Visual Concepts from Natural Language Supervision. CLIP Can Be Applied to Any Visual Classification Benchmark by Simply Providing the Names of the Visual Categories to Be Recognized, Similar to the ‘Zero-Shot’ Capabilities of GPT-2 and GPT-3”, et al 2021
- “DALL·E 1: Creating Images from Text: We’ve Trained a Neural Network Called DALL·E That Creates Images from Text Captions for a Wide Range of Concepts Expressible in Natural Language”, et al 2021
- “Transformers in Vision: A Survey”, et al 2021
- “Vision Transformer: An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale”, et al 2020
- “M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-Training”, et al 2020
- “Learning to Scale Multilingual Representations for Vision-Language Tasks”, et al 2020
- “The Messy, Secretive Reality behind OpenAI’s Bid to save the World: The AI Moonshot Was Founded in the Spirit of Transparency. This Is the inside Story of How Competitive Pressure Eroded That Idealism”, 2020
- “MULE: Multimodal Universal Language Embedding”, et al 2019
- “What A Long, Strange Trip It’s Been: EleutherAI One Year Retrospective”
- “CLIP: Zero-Shot Jack of All Trades”
- “This Anime Does Not Exist, Search: This Notebook Uses the Precomputed CLIP Feature Vectors for 100k Images from TADNE”
- “CLIPIT PixelDraw Demo”
- “Vqgan-Clip/notebooks”
- “Combination of OpenAI GLIDE and Latent Diffusion”
- “LAION-AI/laion-Datasets”
- “CLIP Implementation for Russian Language”
- “Christophschuhmann/4MC-4M-Image-Text-Pairs-With-CLIP-Embeddings: I Have Created a Dataset of Image-Text-Pairs by Using the Cosine Similarity of the CLIP Embeddings of the Image & Its Caption Derrived from YFCC100M. I Have Also Added Probabilities from a NSFW Detector & More”
- “CLIP (Contrastive Language–Image Pre-Training) for Italian”
- “Crowsonkb/simulacra-Aesthetic-Models”
- “Neural Image Generation”
- “An Open Source Implementation of CLIP”
- “CLIP/data/yfcc100m.md”
- “StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery”
- “Clustering-Laion400m: Script and Models for Clustering LAION-400m CLIP Embeddings. Models Were Fit on the First Million or so Image Embeddings.”
- “Rinongal/StyleGAN-Nada”
- “Simple Image Captioning Model”
- “Robgon-Art/CLIPandPASTE: CLIP and PASTE: Using AI to Create Photo Collages from Text Prompts”
- “
sam2_hierarch: Unsupervised Human-Friendly Online Object Categorization”, Utility2024- “AI-Powered Command-Line Photo Search Tool”
- “Alien Dreams: An Emerging Art Scene”
- “The Bouba/Kiki Effect And Sound Symbolism In CLIP”
- “Image Captioning”
- “Same Energy”
- “Guidance: a Cheat Code for Diffusion Models”
- “Pixels Still Beat Text: Attacking the OpenAI CLIP Model With Text Patches and Adversarial Pixel Perturbations”
- “Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders”
- “[P] List of Sites/programs/projects That Use OpenAI’s CLIP Neural Network for Steering Image/video Creation to Match a Text Description”
- “Writing Good VQGAN+CLIP Prompts Part One – Basic Prompts and Style Modifiers”
- “Writing Good VQGAN+CLIP Prompts Part Two – Artist and Genre Modifiers”
- “Writing Good VQGAN+CLIP Prompts Part Three – Environmental Modifiers”
- “New AI Tools CLIP+VQ-GAN Can Create Impressive Works of Art Based on Just a Few Words of Input”
- “Apple or IPod? Easy Fix for Adversarial Textual Attacks on OpenAI’s CLIP Model!”
- Sort By Magic
- Miscellaneous
- Bibliography