Bibliography:

  1. ‘AI video’ tag

  2. ‘CLIP’ tag

  3. ‘masked autoencoder’ tag

  4. CT Foundation: Taking medical imaging embeddings 3D

  5. Long-Term Tracking of Social Structure in Groups of Rats

  6. Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

  7. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

  8. Test-Time Training on Video Streams

  9. Magenta Green Screen: Spectrally Multiplexed Alpha Matting with Deep Colorization

  10. PaLI-X: On Scaling up a Multilingual Vision and Language Model

  11. ImageBind: One Embedding Space To Bind Them All

  12. Scaling Vision Transformers to 22 Billion Parameters

  13. VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

  14. VindLU: A Recipe for Effective Video-and-Language Pretraining

  15. Videogenic: Video Highlights via Photogenic Moments

  16. AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies

  17. Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

  18. TVLT: Textless Vision-Language Transformer

  19. EVL: Frozen CLIP Models are Efficient Video Learners

  20. X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition

  21. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

  22. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

  23. OmniMAE: Single Model Masked Pretraining on Images and Videos

  24. LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

  25. MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

  26. Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

  27. Revisiting the "Video" in Video-Language Understanding

  28. VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

  29. Masked Autoencoders As Spatiotemporal Learners

  30. Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning

  31. ViS4mer: Long Movie Clip Classification with State-Space Video Models

  32. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

  33. Reinforcement Learning with Action-Free Pre-Training from Videos

  34. CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

  35. Robot peels banana with goal-conditioned dual-action deep imitation learning

  36. Hierarchical Perceiver

  37. MuZero with Self-competition for Rate Control in VP9 Video Compression

  38. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

  39. MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

  40. CAST: Character labeling in Animation using Self-supervision by Tracking

  41. AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

  42. Noether Networks: Meta-Learning Useful Conserved Quantities

  43. MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

  44. MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

  45. Florence: A New Foundation Model for Computer Vision

  46. Scaling ASR Improves Zero and Few Shot Learning

  47. ADOP: Approximate Differentiable One-Pixel Point Rendering

  48. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

  49. Perceiver IO: A General Architecture for Structured Inputs & Outputs

  50. CLIP-It! Language-Guided Video Summarization

  51. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

  52. Revisiting ResNets: Improved Training and Scaling Strategies

  53. Learning from videos to understand the world

  54. Perceiver: General Perception with Iterative Attention

  55. Video Transformer Network

  56. Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

  57. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

  58. CLIP: Learning Transferable Visual Models From Natural Language Supervision

  59. Transformers in Vision: A Survey

  60. Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures

  61. Accuracy and Performance Comparison of Video Action Recognition Approaches

  62. Self-supervised learning through the eyes of a child

  63. Gesticulator: A framework for semantically-aware speech-driven gesture generation

  64. SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective

  65. Axial Attention in Multidimensional Transformers

  66. CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

  67. CLEVRER: CoLlision Events for Video REpresentation and Reasoning

  68. Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos

  69. A Short Note on the Kinetics-700 Human Action Dataset

  70. Billion-scale semi-supervised learning for image classification

  71. VideoBERT: A Joint Model for Video and Language Representation Learning

  72. Real-time Continuous Transcription with Live Transcribe

  73. CCNet: Criss-Cross Attention for Semantic Segmentation

  74. Evolving Space-Time Neural Architectures for Videos

  75. Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

  76. A Short Note about Kinetics-600

  77. Large-Scale Visual Speech Recognition

  78. Playing hard exploration games by watching YouTube

  79. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

  80. The Sound of Pixels

  81. One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning

  82. Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition

  83. Reinforced Video Captioning with Entailment Rewards

  84. Tracking as Online Decision-Making: Learning a Policy from Streaming Videos with Reinforcement Learning

  85. Learning to Learn from Noisy Web Videos

  86. Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset

  87. The Kinetics Human Action Video Dataset

  88. Dense-Captioning Events in Videos

  89. Time-Contrastive Networks: Self-Supervised Learning from Video

  90. LipNet: End-to-End Sentence-level Lipreading

  91. Deep Visual Foresight for Planning Robot Motion

  92. Temporal Convolutional Networks: A Unified Approach to Action Segmentation

  93. Clockwork Convnets for Video Semantic Segmentation

  94. Artistic style transfer for videos

  95. YFCC100M: The New Data in Multimedia Research

  96. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

  97. design#future-tag-features

    [Transclude the forward-link's context]

  98. 2022-baker-figure8-vptsuccessratescalingofmakingitemsbydatasetsizescaling.png

  99. 2021-lee-figure4-datascaling.png

  100. https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/

  101. 759e50847f76f686f333cf2d92d46f60f3539f48.html

  102. https://github.com/facebookresearch/jepa

  103. ad8c5fa97c6ea2c2011978d7fd9a6e215d7aef57.html

  104. https://openai.com/research/vpt

  105. https://rmozone.com/snapshots/2021/11/gentle-history/

  106. https://simonwillison.net/2024/Feb/21/gemini-pro-video

  107. 1cee0912bf1331b2bc9b97f4bebb4933d571ddc0.html

  108. https://www.csm.ai/commonsim-1-generating-3d-worlds

  109. https://www.nature.com/articles/s42003-023-05098-1

  110. CT Foundation: Taking medical imaging embeddings 3D

  111. https%253A%252F%252Fresearch.google%252Fblog%252Ftaking-medical-imaging-embeddings-3d%252F.html

  112. Test-Time Training on Video Streams

  113. Yu Sun

  114. https%253A%252F%252Farxiv.org%252Fabs%252F2307.05014.html

  115. ImageBind: One Embedding Space To Bind Them All

  116. Zhuang Liu’s Homepage

  117. https%253A%252F%252Farxiv.org%252Fabs%252F2305.05665%2523facebook.html

  118. Scaling Vision Transformers to 22 Billion Parameters

  119. Robert Geirhos

  120. Lucas Beyer

  121. Yi Tay

  122. Neil Houlsby

  123. https%253A%252F%252Farxiv.org%252Fabs%252F2302.05442%2523google.html

  124. VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

  125. https%253A%252F%252Farxiv.org%252Fabs%252F2212.04979%2523google.html

  126. VindLU: A Recipe for Effective Video-and-Language Pretraining

  127. Mohit Bansal

  128. https%253A%252F%252Farxiv.org%252Fabs%252F2212.05051.html

  129. TVLT: Textless Vision-Language Transformer

  130. Mohit Bansal

  131. https%253A%252F%252Farxiv.org%252Fabs%252F2209.14156.html

  132. EVL: Frozen CLIP Models are Efficient Video Learners

  133. https%253A%252F%252Farxiv.org%252Fabs%252F2208.03550.html

  134. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

  135. https%253A%252F%252Farxiv.org%252Fabs%252F2207.07285%2523alibaba.html

  136. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

  137. Jeff Clune—Professor—Computer Science—University of British Columbia

  138. https%253A%252F%252Farxiv.org%252Fabs%252F2206.11795%2523openai.html

  139. OmniMAE: Single Model Masked Pretraining on Images and Videos

  140. https%253A%252F%252Farxiv.org%252Fabs%252F2206.08356%2523facebook.html

  141. LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

  142. https%253A%252F%252Farxiv.org%252Fabs%252F2206.07160%2523microsoft.html

  143. VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

  144. Mohit Bansal

  145. https%253A%252F%252Farxiv.org%252Fabs%252F2205.10747.html

  146. Masked Autoencoders As Spatiotemporal Learners

  147. https%253A%252F%252Farxiv.org%252Fabs%252F2205.09113%2523facebook.html

  148. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

  149. https%253A%252F%252Farxiv.org%252Fabs%252F2204.00598%2523google.html

  150. CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

  151. https%253A%252F%252Farxiv.org%252Fabs%252F2203.11096.html

  152. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

  153. Caiming Xiong—Home Page

  154. https%253A%252F%252Farxiv.org%252Fabs%252F2201.12086%2523salesforce.html

  155. Florence: A New Foundation Model for Computer Vision

  156. Jianfeng Gao at Microsoft Research

  157. https%253A%252F%252Farxiv.org%252Fabs%252F2111.11432%2523microsoft.html

  158. Perceiver IO: A General Architecture for Structured Inputs & Outputs

  159. https%253A%252F%252Farxiv.org%252Fabs%252F2107.14795%2523deepmind.html

  160. CLIP-It! Language-Guided Video Summarization

  161. https%253A%252F%252Farxiv.org%252Fabs%252F2107.00650.html

  162. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

  163. https%253A%252F%252Farxiv.org%252Fabs%252F2106.11097.html

  164. Revisiting ResNets: Improved Training and Scaling Strategies

  165. Aravind Srinivas

  166. Barret Zoph

  167. https%253A%252F%252Farxiv.org%252Fabs%252F2103.07579%2523google.html

  168. Learning from videos to understand the world

  169. Polina Kuznetsova

  170. https%253A%252F%252Fai.facebook.com%252Fblog%252Flearning-from-videos-to-understand-the-world%252F.html

  171. Perceiver: General Perception with Iterative Attention

  172. https%253A%252F%252Farxiv.org%252Fabs%252F2103.03206%2523deepmind.html

  173. CLIP: Learning Transferable Visual Models From Natural Language Supervision

  174. Alec Radford

  175. Jong Wook Kim

  176. Aditya A. Ramesh

  177. Sandhini Agarwal

  178. About Me

  179. https://jack-clark.net/about/

  180. Gretchen Krueger

  181. https%253A%252F%252Fcdn.openai.com%252Fpapers%252FLearning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf.html

  182. Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures

  183. Language Understanding Grounded in Perception and Action

  184. https%253A%252F%252Farxiv.org%252Fabs%252F2012.08508%2523deepmind.html

  185. Accuracy and Performance Comparison of Video Action Recognition Approaches

  186. https%253A%252F%252Farxiv.org%252Fabs%252F2008.09037.html

  187. Billion-scale semi-supervised learning for image classification

  188. https%253A%252F%252Farxiv.org%252Fabs%252F1905.00546%2523facebook.html

  189. CCNet: Criss-Cross Attention for Semantic Segmentation

  190. https%253A%252F%252Farxiv.org%252Fabs%252F1811.11721.html

  191. A Short Note about Kinetics-600

  192. https%253A%252F%252Farxiv.org%252Fabs%252F1808.01340%2523deepmind.html

  193. Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset

  194. https%253A%252F%252Farxiv.org%252Fabs%252F1705.07750%2523deepmind.html

  195. Clockwork Convnets for Video Semantic Segmentation

  196. https%253A%252F%252Farxiv.org%252Fabs%252F1608.03609.html