Bibliography:

  1. ‘Transformer’ tag

  2. ‘CLIP samples’ tag

  3. ‘video analysis’ tag

  4. ‘neuroscience’ tag

  5. Utext: Rich Unicode Documents

  6. PaliGemma 2: A Family of Versatile VLMs for Transfer

  7. CT Foundation: Taking medical imaging embeddings 3D

  8. Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

  9. Explore the Limits of Omni-modal Pretraining at Scale

  10. Sakuga-42M Dataset: Scaling Up Cartoon Research

  11. ImageInWords: Unlocking Hyper-Detailed Image Descriptions

  12. CatLIP: CLIP-level Visual Recognition Accuracy with 2.7× Faster Pre-training on Web-scale Image-Text Data

  13. Towards Generated Image Provenance Analysis Via Conceptual-Similar-Guided-SLIP Retrieval

  14. Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

  15. TextCraftor: Your Text Encoder Can be Image Quality Controller

  16. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

  17. Discovering Universal Semantic Triggers for Text-to-Image Synthesis

  18. Grounded language acquisition through the eyes and ears of a single child

  19. TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

  20. Parrot Captions Teach CLIP to Spot Text

  21. StarVector: Generating Scalable Vector Graphics Code from Images

  22. Vision-Language Models as a Source of Rewards

  23. Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

  24. ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

  25. Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

  26. Are Vision Transformers More Data Hungry Than Newborn Visual Systems?

  27. BioCLIP: A Vision Foundation Model for the Tree of Life

  28. Rethinking FID: Towards a Better Evaluation Metric for Image Generation

  29. SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

  30. Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback

  31. One-for-All: Towards Universal Domain Translation with a Single StyleGAN

  32. Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity?

  33. From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions

  34. LLaVA-1.5: Improved Baselines with Visual Instruction Tuning

  35. Data Filtering Networks

  36. Vision Transformers Need Registers

  37. Demystifying CLIP Data

  38. Multimodal Neurons in Pretrained Text-Only Transformers

  39. Investigating the Existence of ‘Secret Language’ in Language Models

  40. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

  41. PIGEON: Predicting Image Geolocations

  42. CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution

  43. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

  44. CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy

  45. SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

  46. Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model

  47. ChessGPT: Bridging Policy Learning and Language Modeling

  48. Rosetta Neurons: Mining the Common Units in a Model Zoo

  49. Image Captioners Are Scalable Vision Learners Too

  50. Improving neural network representations using human similarity judgments

  51. Artificial intelligence and art: Identifying the esthetic judgment factors that distinguish human & machine-generated artwork

  52. On Evaluating Adversarial Robustness of Large Vision-Language Models

  53. Generalizable Synthetic Image Detection via Language-guided Contrastive Learning

  54. TorToise: Better speech synthesis through scaling

  55. An Inverse Scaling Law for CLIP Training

  56. ImageBind: One Embedding Space To Bind Them All

  57. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

  58. A Cookbook of Self-Supervised Learning

  59. DINOv2: Learning Robust Visual Features without Supervision

  60. ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification

  61. KD-DLGAN: Data Limited Image Generation via Knowledge Distillation

  62. MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

  63. Sigmoid Loss for Language Image Pre-Training

  64. HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention

  65. When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?

  66. Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery

  67. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

  68. MUG: Vision Learners Meet Web Image-Text Pairs

  69. Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B

  70. Reproducible scaling laws for contrastive language-image learning

  71. CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet

  72. A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others

  73. Scaling Language-Image Pre-training via Masking

  74. Videogenic: Video Highlights via Photogenic Moments

  75. Retrieval-Augmented Multimodal Language Modeling

  76. ClipCrop: Conditioned Cropping Driven by Vision-Language Model

  77. I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data

  78. MaskDistill: A Unified View of Masked Image Modeling

  79. Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

  80. AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

  81. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

  82. Text-Only Training for Image Captioning using Noise-Injected CLIP

  83. 3DALL·E: Integrating Text-to-Image AI in 3D Design Workflows

  84. Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

  85. ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

  86. Incorporating natural language into vision models improves prediction and understanding of higher visual cortex

  87. Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest

  88. Fast text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators

  89. What does a platypus look like? Generating customized prompts for zero-shot image classification (CuPL)

  90. Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

  91. Decoding speech from non-invasive brain recordings

  92. Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

  93. CLIP-based Neural Neighbor Style Transfer for 3D Assets

  94. EVL: Frozen CLIP Models are Efficient Video Learners

  95. X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition

  96. LaTTe: Language Trajectory TransformEr

  97. Adversarial Attacks on Image Generation With Made-Up Words

  98. TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment

  99. MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

  100. Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models

  101. NewsStories: Illustrating articles with visual summaries

  102. Semantic Abstraction (SemAbs): Open-World 3D Scene Understanding from 2D Vision-Language Models

  103. Don’t Stop Learning: Towards Continual Learning for the CLIP Model

  104. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

  105. Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning

  106. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

  107. CLAP: Learning Audio Concepts From Natural Language Supervision

  108. ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts

  109. Improved Vector Quantized Diffusion Models

  110. CyCLIP: Cyclic Contrastive Language-Image Pretraining

  111. Fine-grained Image Captioning with CLIP Reward

  112. VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

  113. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars

  114. CoCa: Contrastive Captioners are Image-Text Foundation Models

  115. Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

  116. Retrieval-Augmented Diffusion Models: Semi-Parametric Neural Image Synthesis

  117. Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?

  118. Opal: Multimodal Image Generation for News Illustration

  119. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

  120. DALL·E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents § 7. Limitations and Risks

  121. No Token Left Behind: Explainability-Aided Image Classification and Generation

  122. Semantic Exploration from Language Abstractions and Pretrained Representations

  123. Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

  124. Unified Contrastive Learning in Image-Text-Label Space

  125. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

  126. Learning to generate line drawings that convey geometry and semantics

  127. CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

  128. CLIP on Wheels (CoW): Zero-Shot Object Navigation as Object Localization and Exploration

  129. Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy

  130. CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

  131. Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

  132. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

  133. The Unsurprising Effectiveness of Pre-Trained Vision Models for Control

  134. Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

  135. RuCLIP—new models and experiments: a technical report

  136. Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework

  137. CLIPasso: Semantically-Aware Object Sketching

  138. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

  139. Can Wikipedia Help Offline Reinforcement Learning?

  140. SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models

  141. CM3: A Causal Masked Multimodal Model of the Internet

  142. LSeg: Language-driven Semantic Segmentation

  143. Design Guidelines for Prompt Engineering Text-to-Image Generative Models

  144. Detecting Twenty-thousand Classes using Image-level Supervision

  145. A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

  146. High-Resolution Image Synthesis with Latent Diffusion Models

  147. RegionCLIP: Region-based Language-Image Pretraining

  148. More Control for Free! Image Synthesis with Semantic Diffusion Guidance

  149. CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

  150. MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning

  151. DenseCLIP: Extract Free Dense Labels from CLIP

  152. Zero-Shot Text-Guided Object Generation with Dream Fields

  153. FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization

  154. MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

  155. CRIS: CLIP-Driven Referring Image Segmentation

  156. Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

  157. Blended Diffusion for Text-driven Editing of Natural Images

  158. LAFITE: Towards Language-Free Training for Text-to-Image Generation

  159. Florence: A New Foundation Model for Computer Vision

  160. BASIC: Combined Scaling for Open-Vocabulary Image Classification

  161. ClipCap: CLIP Prefix for Image Captioning

  162. Simple but Effective: CLIP Embeddings for Embodied AI

  163. INTERN: A New Learning Paradigm Towards General Vision

  164. LiT: Zero-Shot Transfer with Locked-image Text Tuning

  165. Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

  166. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis

  167. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

  168. Projected GANs Converge Faster

  169. Telling Creative Stories Using Generative Visual Aids

  170. Image-Based CLIP-Guided Essence Transfer

  171. Wav2CLIP: Learning Robust Audio Representations From CLIP

  172. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)

  173. CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation

  174. MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training

  175. OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation

  176. DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models

  177. CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

  178. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

  179. ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation

  180. CLIPort: What and Where Pathways for Robotic Manipulation

  181. THINGSvision: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks

  182. Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts

  183. What Vision-Language Models ‘See’ when they See Scenes

  184. EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

  185. Zero-Shot Open Set Detection by Extending CLIP

  186. Robust fine-tuning of zero-shot models

  187. What Users Want? WARHOL: A Generative Model for Recommendation

  188. LAION-400-Million Open Dataset

  189. Contrastive Language-Image Pre-training for the Italian Language

  190. Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications

  191. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

  192. Language Grounding with 3D Objects

  193. Segmentation in Style: Unsupervised Semantic Image Segmentation with StyleGAN and CLIP

  194. How Much Can CLIP Benefit Vision-and-Language Tasks?

  195. FairyTailor: A Multimodal Generative Framework for Storytelling

  196. CLIP-It! Language-Guided Video Summarization

  197. Small in-distribution changes in 3D perspective and lighting fool both CNNs and Transformers

  198. CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders

  199. AudioCLIP: Extending CLIP to Image, Text and Audio

  200. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

  201. A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods

  202. Partial success in closing the gap between human and machine vision

  203. ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation

  204. Exploring the Limits of Out-of-Distribution Detection

  205. Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters

  206. Generative Art Using Neural Visual Grammars and Dual Encoders

  207. Zero-Shot Detection via Vision and Language Knowledge Distillation

  208. CLIPScore: A Reference-free Evaluation Metric for Image Captioning

  209. Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

  210. Paint by Word

  211. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

  212. Multimodal Neurons in Artificial Neural Networks [CLIP]

  213. Zero-Shot Text-to-Image Generation

  214. ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

  215. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

  216. Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

  217. Scoring images from TADNE with CLIP

  218. CLIP: Learning Transferable Visual Models From Natural Language Supervision

  219. CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3

  220. DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language

  221. Transformers in Vision: A Survey

  222. Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

  223. M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

  224. Learning to Scale Multilingual Representations for Vision-Language Tasks

  225. The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism

  226. MULE: Multimodal Universal Language Embedding

  227. What A Long, Strange Trip It's Been: EleutherAI One Year Retrospective

  228. CLIP: Zero-Shot Jack of All Trades

  229. 17a1ef196fab4082eb1ab6f204c49b8ccf5b60e6.html

  230. This Anime Does Not Exist, Search: This Notebook Uses the Precomputed CLIP Feature Vectors for 100k Images from TADNE

  231. CLIPIT PixelDraw Demo

  232. Vqgan-Clip/notebooks

  233. Combination of OpenAI GLIDE and Latent Diffusion

  234. LAION-AI/laion-Datasets

  235. CLIP Implementation for Russian Language

  236. Christophschuhmann/4MC-4M-Image-Text-Pairs-With-CLIP-Embeddings: I Have Created a Dataset of Image-Text-Pairs by Using the Cosine Similarity of the CLIP Embeddings of the Image & Its Caption Derrived from YFCC100M. I Have Also Added Probabilities from a NSFW Detector & More

  237. CLIP (Contrastive Language–Image Pre-Training) for Italian

  238. Crowsonkb/simulacra-Aesthetic-Models

  239. Neural Image Generation

  240. An Open Source Implementation of CLIP

  241. CLIP/data/yfcc100m.md

  242. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

  243. Clustering-Laion400m: Script and Models for Clustering LAION-400m CLIP Embeddings. Models Were Fit on the First Million or so Image Embeddings.

  244. Rinongal/StyleGAN-Nada

  245. Simple Image Captioning Model

  246. Robgon-Art/CLIPandPASTE: CLIP and PASTE: Using AI to Create Photo Collages from Text Prompts

  247. sam2_hierarch: Unsupervised Human-Friendly Online Object Categorization

  248. AI-Powered Command-Line Photo Search Tool

  249. Alien Dreams: An Emerging Art Scene

  250. The Bouba/Kiki Effect And Sound Symbolism In CLIP

  251. 7b754d1adedff79bde90b78a60b89f20f46cc3fd.html

  252. Image Captioning

  253. Same Energy

  254. Guidance: a Cheat Code for Diffusion Models

  255. e206a82a3a46f2351209c00f2212172f4bcd84f2.html

  256. Pixels Still Beat Text: Attacking the OpenAI CLIP Model With Text Patches and Adversarial Pixel Perturbations

  257. Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders

  258. [P] List of Sites/programs/projects That Use OpenAI’s CLIP Neural Network for Steering Image/video Creation to Match a Text Description

  259. ac50f6d0260c0ea73815d2769178feff54123ab9.html

  260. Writing Good VQGAN+CLIP Prompts Part One – Basic Prompts and Style Modifiers

  261. 7674a6fcf40c2b1980c96aa54966ffc68f691d01.html

  262. Writing Good VQGAN+CLIP Prompts Part Two – Artist and Genre Modifiers

  263. 4158b5cd7fc1bedf87dd9f7b1eb0a0fb08c650c2.html

  264. Writing Good VQGAN+CLIP Prompts Part Three – Environmental Modifiers

  265. f048a74dbbaf5c5ffedd2ec618e1fc94ba8a7b2c.html

  266. New AI Tools CLIP+VQ-GAN Can Create Impressive Works of Art Based on Just a Few Words of Input

  267. Apple or IPod? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model!

  268. design#future-tag-features

    [Transclude the forward-link's context]

  269. 2023-01-01-ross-clipvitbigg14laion2b39bb160kbenchmarkperformancecomparedtopreviousopensourcesotamodel.jpg

  270. 2023-01-01-ross-openclipscalingforclipvitbigg14laion2b39bb160k.png

  271. 2023-girdhar-figure6-imagebindscalingofperformancewithincreasingclipimageencodersize.png

  272. 2022-11-22-armstrong-screenshotofsootimageorganizer-personswimminginwaterqueryexample.jpg

  273. 2022-11-22-armstrong-screenshotofsootimageorganizer-personswimminginwaterresults.jpg

  274. 2022-04-04-rombach-compvistxt2imgpreview.png

  275. 2022-cherti-figure1a-openclipcomputezeroshotclassificationscalingcurve.jpg

  276. 2022-cherti-figure1b-openclipcomputezeroshotretrievalscalingcurve.jpg

  277. 2022-dong-figure1-ablatingimprovementstoclipfinetuningtricksforimagenettransfer.png

  278. 2022-singh-figure3-scalingmodelanddatasetsizes.jpg

  279. 2021-04-22-rivershavewings-clipvqgan-theshadowyhackergroupeleuther.png

  280. 2021-01-20-nagolinc-tadne-clipbasedgeneration-agirlwithapinkhat.png

  281. 2021-muttenthaler-figure2-correlationoffmribrainactivationswithvariousneuralnetworks.jpg

  282. 2021-radford-clip-figure13-cliprobustness.png

  283. 2021-radford-clip-figure21-zeroshot36differenttasks.png

  284. 2021-radford-clip-figure4-promptengineering.png

  285. 2021-radford-clip-figure5-clipzeroshotvsfullresnet.png

  286. 2021-radford-clip-figure9-clipcomputescaling.jpg

  287. https://colab.research.google.com/drive/189LHTpYaefMhKNIGOzTLHHavlgmoIWg9

  288. https://colab.research.google.com/drive/1N8Cc9yYzNR4M9J2NtE3n3jL2Jy25KAl_

  289. https://colab.research.google.com/drive/1c6VccMPsOMAUQCKU4BVDRd5Y32qkozmK

  290. https://colab.research.google.com/github/kvfrans/clipdraw/blob/main/clipdraw.ipynb

  291. https://creator.nightcafe.studio/vqgan-clip-keyword-modifier-comparison

  292. c2b3f7f8f7d2e84dba6f058ccb3d4813d6d3a061.html

  293. https://github.com/MaartenGr/Concept

  294. https://github.com/lucidrains/big-sleep

  295. https://github.com/nostalgebraist/improved-diffusion

  296. https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup

  297. https://joelsimon.net/new-words.html

  298. https://jxmo.notion.site/The-Weird-and-Wonderful-World-of-AI-Art-b9615a2e7278435b98380ff81ae1cf09

  299. e8e1cababfe9d293e80842e0166a8b871d941d87.html

  300. https://laion.ai/blog/coca/

  301. 9338c68ccb3fbc1b8e2d338b100d2528a47eb2ff.html

  302. https://laion.ai/blog/large-openclip/

  303. 9955fdf640aa5485e1dfadc20f7568b0ba3a5ef9.html

  304. https://replicate.com/methexis-inc/img2prompt

  305. https://rom1504.github.io/clip-retrieval/

  306. https://stanislavfort.com/2021/01/12/OpenAI_CLIP_adversarial_examples.html

  307. https://stanislavfort.github.io/blog/OpenAI_CLIP_adversarial_examples/

  308. 0132c6b8fcb240f99477e01ca3cd5f4fbc118543.html

  309. https://stanislavfort.github.io/blog/OpenAI_CLIP_stickers_and_adversarial_examples/

  310. 57b1be183f675f1bc1cf53fd933d30a26ea56104.html

  311. https://tech.pic-collage.com/distillation-of-clip-model-and-other-experiments-f8394b7321ce

  312. feb67cfe48e6266bfd331586a8e75b88972f5b9c.html

  313. https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA

  314. 9b851d946d9ca37cf712409432960a1945beb0ef.html

  315. https://web.media.mit.edu/~echu/assets/projects/evolving-views/paper.pdf

  316. c1550951b92dcb21fd8604871b4b28bfff99228c.pdf

  317. https://www.lesswrong.com/posts/cqGEQeLNbcptYsifz/this-week-in-fashion

  318. https://www.lesswrong.com/posts/kobJymvvcvhbjWFKe/laying-the-foundations-for-vision-and-multimodal-mechanistic

  319. https://www.reddit.com/r/MachineLearning/comments/nq4es7/d_unreal_engine_trick_with_vqgan_clip/

  320. c39027815f2c3fa973cefd6cb422cc98fb12f7ce.html

  321. https://www.reddit.com/r/MediaSynthesis/comments/p5nw28/clip_vqgan_keyword_comparison_by_kingdomakrillic/

  322. https://www.unum.cloud/blog/2023-02-20-efficient-multimodality

  323. 5ad337768b678595d6ea93ed364f3e38fb69b14f.html

  324. https://x.com/NicholasBardy/status/1530461357048418304

  325. https://x.com/davisblalock/status/1559802005928808448

  326. https://x.com/rom1504/status/1532508153513971712

  327. CT Foundation: Taking medical imaging embeddings 3D

  328. https%253A%252F%252Fresearch.google%252Fblog%252Ftaking-medical-imaging-embeddings-3d%252F.html

  329. Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

  330. https%253A%252F%252Farxiv.org%252Fabs%252F2408.05446.html

  331. ImageInWords: Unlocking Hyper-Detailed Image Descriptions

  332. https%253A%252F%252Farxiv.org%252Fabs%252F2405.02793%2523google.html

  333. Grounded language acquisition through the eyes and ears of a single child

  334. %252Fdoc%252Fai%252Fnn%252Fcnn%252F2024-vong.pdf.html

  335. TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

  336. https%253A%252F%252Farxiv.org%252Fabs%252F2312.16862.html

  337. StarVector: Generating Scalable Vector Graphics Code from Images

  338. https%253A%252F%252Farxiv.org%252Fabs%252F2312.11556.html

  339. Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

  340. https%253A%252F%252Farxiv.org%252Fabs%252F2312.05328%2523deepmind.html

  341. Data Filtering Networks

  342. https%253A%252F%252Farxiv.org%252Fabs%252F2309.17425%2523apple.html

  343. Demystifying CLIP Data

  344. Luke Zettlemoyer

  345. https%253A%252F%252Farxiv.org%252Fabs%252F2309.16671.html

  346. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

  347. Robin Rombach

  348. https%253A%252F%252Farxiv.org%252Fabs%252F2307.01952%2523stability.html

  349. CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy

  350. https%253A%252F%252Farxiv.org%252Fabs%252F2306.15658.html

  351. Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model

  352. %252Fdoc%252Fai%252Fanime%252Fdanbooru%252F2023-yi.pdf.html

  353. Rosetta Neurons: Mining the Common Units in a Model Zoo

  354. https%253A%252F%252Farxiv.org%252Fabs%252F2306.09346.html

  355. Artificial intelligence and art: Identifying the esthetic judgment factors that distinguish human & machine-generated artwork

  356. %252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252Fdall-e%252F1%252F2023-samo.pdf.html

  357. On Evaluating Adversarial Robustness of Large Vision-Language Models

  358. https%253A%252F%252Farxiv.org%252Fabs%252F2305.16934.html

  359. An Inverse Scaling Law for CLIP Training

  360. https%253A%252F%252Farxiv.org%252Fabs%252F2305.07017.html

  361. ImageBind: One Embedding Space To Bind Them All

  362. Zhuang Liu’s Homepage

  363. https%253A%252F%252Farxiv.org%252Fabs%252F2305.05665%2523facebook.html

  364. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

  365. Omer Levy

  366. https%253A%252F%252Farxiv.org%252Fabs%252F2305.01569.html

  367. DINOv2: Learning Robust Visual Features without Supervision

  368. https%253A%252F%252Farxiv.org%252Fabs%252F2304.07193%2523facebook.html

  369. ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification

  370. https%253A%252F%252Farxiv.org%252Fabs%252F2304.05538.html

  371. Sigmoid Loss for Language Image Pre-Training

  372. Lucas Beyer

  373. https%253A%252F%252Farxiv.org%252Fabs%252F2303.15343%2523google.html

  374. When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?

  375. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DKRLUvxh8uaX.html

  376. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

  377. https%253A%252F%252Farxiv.org%252Fabs%252F2301.12597%2523salesforce.html

  378. MUG: Vision Learners Meet Web Image-Text Pairs

  379. https%253A%252F%252Farxiv.org%252Fabs%252F2301.07088%2523bytedance.html

  380. Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B

  381. https%253A%252F%252Flaion.ai%252Fblog%252Fgiant-openclip%252F.html

  382. Reproducible scaling laws for contrastive language-image learning

  383. https%253A%252F%252Farxiv.org%252Fabs%252F2212.07143.html

  384. CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet

  385. https%253A%252F%252Farxiv.org%252Fabs%252F2212.06138%2523microsoft.html

  386. Retrieval-Augmented Multimodal Language Modeling

  387. Percy Liang

  388. Mike Lewis

  389. Luke Zettlemoyer

  390. https%253A%252F%252Farxiv.org%252Fabs%252F2211.12561%2523facebook.html

  391. MaskDistill: A Unified View of Masked Image Modeling

  392. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DwmGlMhaBe0.html

  393. Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

  394. https%253A%252F%252Farxiv.org%252Fabs%252F2211.07292.html

  395. AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

  396. https%253A%252F%252Farxiv.org%252Fabs%252F2211.06679%2523baai.html

  397. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

  398. https%253A%252F%252Farxiv.org%252Fabs%252F2211.01324%2523nvidia.html

  399. Incorporating natural language into vision models improves prediction and understanding of higher visual cortex

  400. https%253A%252F%252Fwww.biorxiv.org%252Fcontent%252F10.1101%252F2022.09.27.508760.full.html

  401. Fast text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators

  402. https%253A%252F%252Farxiv.org%252Fabs%252F2209.03953.html

  403. What does a platypus look like? Generating customized prompts for zero-shot image classification (CuPL)

  404. https%253A%252F%252Farxiv.org%252Fabs%252F2209.03320.html

  405. Decoding speech from non-invasive brain recordings

  406. https%253A%252F%252Farxiv.org%252Fabs%252F2208.12266%2523facebook.html

  407. Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

  408. https%253A%252F%252Farxiv.org%252Fabs%252F2208.05516.html

  409. EVL: Frozen CLIP Models are Efficient Video Learners

  410. https%253A%252F%252Farxiv.org%252Fabs%252F2208.03550.html

  411. TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment

  412. https%253A%252F%252Farxiv.org%252Fabs%252F2207.14525.html

  413. MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

  414. https%253A%252F%252Farxiv.org%252Fabs%252F2207.12661%2523microsoft.html

  415. NewsStories: Illustrating articles with visual summaries

  416. https%253A%252F%252Farxiv.org%252Fabs%252F2207.13061.html

  417. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

  418. https%253A%252F%252Farxiv.org%252Fabs%252F2207.07285%2523alibaba.html

  419. Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning

  420. Percy Liang

  421. https%253A%252F%252Farxiv.org%252Fabs%252F2207.07635.html

  422. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

  423. Sergey Levine

  424. https%253A%252F%252Farxiv.org%252Fabs%252F2207.04429.html

  425. Improved Vector Quantized Diffusion Models

  426. https%253A%252F%252Farxiv.org%252Fabs%252F2205.16007%2523microsoft.html

  427. CyCLIP: Cyclic Contrastive Language-Image Pretraining

  428. Aditya Grover

  429. https%253A%252F%252Farxiv.org%252Fabs%252F2205.14459.html

  430. VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

  431. Mohit Bansal

  432. https%253A%252F%252Farxiv.org%252Fabs%252F2205.10747.html

  433. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars

  434. https%253A%252F%252Farxiv.org%252Fabs%252F2205.08535.html

  435. CoCa: Contrastive Captioners are Image-Text Foundation Models

  436. https%253A%252F%252Farxiv.org%252Fabs%252F2205.01917%2523google.html

  437. Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

  438. https%253A%252F%252Farxiv.org%252Fabs%252F2205.01397.html

  439. DALL·E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents § 7. Limitations and Risks

  440. Aditya A. Ramesh

  441. Speaker Details: EmTech MIT 2023

  442. https%253A%252F%252Farxiv.org%252Fpdf%252F2204.06125%2523page%253D16%2526org%253Dopenai.html

  443. Semantic Exploration from Language Abstractions and Pretrained Representations

  444. Language Understanding Grounded in Perception and Action

  445. https%253A%252F%252Farxiv.org%252Fabs%252F2204.05080%2523deepmind.html

  446. Unified Contrastive Learning in Image-Text-Label Space

  447. Jianfeng Gao at Microsoft Research

  448. https%253A%252F%252Farxiv.org%252Fabs%252F2204.03610%2523microsoft.html

  449. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

  450. https%253A%252F%252Farxiv.org%252Fabs%252F2204.00598%2523google.html

  451. CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

  452. https%253A%252F%252Farxiv.org%252Fabs%252F2203.11096.html

  453. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

  454. https%253A%252F%252Farxiv.org%252Fabs%252F2203.05482.html

  455. Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework

  456. https%253A%252F%252Farxiv.org%252Fabs%252F2202.06767%2523huawei.html

  457. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

  458. Caiming Xiong—Home Page

  459. https%253A%252F%252Farxiv.org%252Fabs%252F2201.12086%2523salesforce.html

  460. SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models

  461. Ross Girshick

  462. Laurens Van Der Maaten

  463. https%253A%252F%252Farxiv.org%252Fabs%252F2201.08371%2523facebook.html

  464. CM3: A Causal Masked Multimodal Model of the Internet

  465. Mike Lewis

  466. Luke Zettlemoyer

  467. https%253A%252F%252Farxiv.org%252Fabs%252F2201.07520%2523facebook.html

  468. Design Guidelines for Prompt Engineering Text-to-Image Generative Models

  469. %252Fdoc%252Fai%252Fnn%252Ftransformer%252Fclip%252F2022-liu-2.pdf.html

  470. Detecting Twenty-thousand Classes using Image-level Supervision

  471. https%253A%252F%252Farxiv.org%252Fabs%252F2201.02605%2523facebook.html

  472. High-Resolution Image Synthesis with Latent Diffusion Models

  473. Robin Rombach

  474. Prof. Dr. Björn Ommer—Computer Vision & Learning Group

  475. https%253A%252F%252Farxiv.org%252Fabs%252F2112.10752.html

  476. RegionCLIP: Region-based Language-Image Pretraining

  477. Jianfeng Gao at Microsoft Research

  478. https%253A%252F%252Farxiv.org%252Fabs%252F2112.09106%2523microsoft.html

  479. More Control for Free! Image Synthesis with Semantic Diffusion Guidance

  480. https%253A%252F%252Farxiv.org%252Fabs%252F2112.05744.html

  481. MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning

  482. https%253A%252F%252Farxiv.org%252Fabs%252F2112.05253.html

  483. DenseCLIP: Extract Free Dense Labels from CLIP

  484. https%253A%252F%252Farxiv.org%252Fabs%252F2112.01071.html

  485. FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization

  486. Welcome to Hao Su's Homepage

  487. https%253A%252F%252Farxiv.org%252Fabs%252F2112.01573.html

  488. Florence: A New Foundation Model for Computer Vision

  489. Jianfeng Gao at Microsoft Research

  490. https%253A%252F%252Farxiv.org%252Fabs%252F2111.11432%2523microsoft.html

  491. BASIC: Combined Scaling for Open-Vocabulary Image Classification

  492. Zihang Dai

  493. https%253A%252F%252Farxiv.org%252Fabs%252F2111.10050%2523google.html

  494. ClipCap: CLIP Prefix for Image Captioning

  495. https%253A%252F%252Farxiv.org%252Fabs%252F2111.09734.html

  496. LiT: Zero-Shot Transfer with Locked-image Text Tuning

  497. Lucas Beyer

  498. https%253A%252F%252Farxiv.org%252Fabs%252F2111.07991%2523google.html

  499. Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

  500. https%253A%252F%252Farxiv.org%252Fabs%252F2111.03930.html

  501. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis

  502. https%253A%252F%252Farxiv.org%252Fabs%252F2111.03133.html

  503. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

  504. https%253A%252F%252Farxiv.org%252Fabs%252F2111.02114%2523laion.html

  505. Projected GANs Converge Faster

  506. https%253A%252F%252Farxiv.org%252Fabs%252F2111.01007.html

  507. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)

  508. https%253A%252F%252Farxiv.org%252Fabs%252F2110.05208.html

  509. MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training

  510. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DROteIE-4A6W.html

  511. OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation

  512. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DG89-1yZLFHk.html

  513. CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

  514. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253Dqw674L9PfQE.html

  515. ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation

  516. https%253A%252F%252Farxiv.org%252Fabs%252F2109.12066.html

  517. THINGSvision: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks

  518. https%253A%252F%252Fwww.frontiersin.org%252Farticles%252F10.3389%252Ffninf.2021.679838%252Ffull.html

  519. Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts

  520. https%253A%252F%252Farxiv.org%252Fabs%252F2109.08857%2523google.html

  521. What Vision-Language Models ‘See’ when they See Scenes

  522. https%253A%252F%252Farxiv.org%252Fabs%252F2109.07301.html

  523. LAION-400-Million Open Dataset

  524. https%253A%252F%252Flaion.ai%252Fblog%252Flaion-400-open-dataset%252F.html

  525. Segmentation in Style: Unsupervised Semantic Image Segmentation with StyleGAN and CLIP

  526. https%253A%252F%252Farxiv.org%252Fabs%252F2107.12518.html

  527. How Much Can CLIP Benefit Vision-and-Language Tasks?

  528. Sheng Shen’s Homepage

  529. Mohit Bansal

  530. https%253A%252F%252Farxiv.org%252Fabs%252F2107.06383.html

  531. CLIP-It! Language-Guided Video Summarization

  532. https%253A%252F%252Farxiv.org%252Fabs%252F2107.00650.html

  533. Small in-distribution changes in 3D perspective and lighting fool both CNNs and Transformers

  534. https%253A%252F%252Farxiv.org%252Fabs%252F2106.16198.html

  535. AudioCLIP: Extending CLIP to Image, Text and Audio

  536. https%253A%252F%252Farxiv.org%252Fabs%252F2106.13043.html

  537. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

  538. https%253A%252F%252Farxiv.org%252Fabs%252F2106.11097.html

  539. Partial success in closing the gap between human and machine vision

  540. Robert Geirhos

  541. Matthias Bethge

  542. https%253A%252F%252Farxiv.org%252Fabs%252F2106.07411.html

  543. Exploring the Limits of Out-of-Distribution Detection

  544. https%253A%252F%252Farxiv.org%252Fabs%252F2106.03004%2523google.html

  545. Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters

  546. https%253A%252F%252Fen.pingwest.com%252Fa%252F8693%2523baai.html

  547. Zero-Shot Detection via Vision and Language Knowledge Distillation

  548. https%253A%252F%252Farxiv.org%252Fabs%252F2104.13921%2523google.html

  549. Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

  550. https%253A%252F%252Farxiv.org%252Fabs%252F2104.08945%2523facebook.html

  551. Multimodal Neurons in Artificial Neural Networks [CLIP]

  552. Alec Radford

  553. https%253A%252F%252Fdistill.pub%252F2021%252Fmultimodal-neurons%252F%2523openai.html

  554. ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

  555. https%253A%252F%252Farxiv.org%252Fabs%252F2102.05918%2523google.html

  556. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

  557. https%253A%252F%252Farxiv.org%252Fabs%252F2102.01645.html

  558. Scoring images from TADNE with CLIP

  559. https%253A%252F%252Fgithub.com%252Fnagolinc%252Fnotebooks%252Fblob%252Fmain%252FTADNE_and_CLIP.ipynb.html

  560. CLIP: Learning Transferable Visual Models From Natural Language Supervision

  561. Alec Radford

  562. Jong Wook Kim

  563. Aditya A. Ramesh

  564. Sandhini Agarwal

  565. About Me

  566. https://jack-clark.net/about/

  567. Gretchen Krueger

  568. https%253A%252F%252Fcdn.openai.com%252Fpapers%252FLearning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf.html

  569. CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3

  570. Alec Radford

  571. Jong Wook Kim

  572. Gretchen Krueger

  573. Sandhini Agarwal

  574. https%253A%252F%252Fopenai.com%252Findex%252Fclip%252F.html

  575. DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language

  576. Aditya A. Ramesh

  577. Speaker Details: EmTech MIT 2023

  578. Vedant Misra

  579. Gretchen Krueger

  580. Sandhini Agarwal

  581. https%253A%252F%252Fopenai.com%252Fresearch%252Fdall-e.html

  582. Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

  583. Lucas Beyer

  584. Jakob Uszkoreit

  585. Neil Houlsby

  586. https%253A%252F%252Farxiv.org%252Fabs%252F2010.11929%2523google.html

  587. The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism

  588. https%253A%252F%252Fwww.technologyreview.com%252F2020%252F02%252F17%252F844721%252Fai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality%252F.html

  589. sam2_hierarch: Unsupervised Human-Friendly Online Object Categorization

  590. https%253A%252F%252Fgithub.com%252Futilityhotbar%252Fsam2_hierarch.html