Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Magenta Green Screen: Spectrally Multiplexed Alpha Matting with Deep Colorization
PaLI-X: On Scaling up a Multilingual Vision and Language Model
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
VindLU: A Recipe for Effective Video-and-Language Pretraining
AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
OmniMAE: Single Model Masked Pretraining on Images and Videos
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning
ViS4mer: Long Movie Clip Classification with State-Space Video Models
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Reinforcement Learning with Action-Free Pre-Training from Videos
CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
Robot peels banana with goal-conditioned dual-action deep imitation learning
MuZero with Self-competition for Rate Control in VP9 Video Compression
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
CAST: Character labeling in Animation using Self-supervision by Tracking
AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Noether Networks: Meta-Learning Useful Conserved Quantities
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video
ADOP: Approximate Differentiable One-Pixel Point Rendering
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Revisiting ResNets: Improved Training and Scaling Strategies
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
CLIP: Learning Transferable Visual Models From Natural Language Supervision
Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures
Accuracy and Performance Comparison of Video Action Recognition Approaches
Gesticulator: A framework for semantically-aware speech-driven gesture generation
SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective
CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos
Billion-scale semi-supervised learning for image classification
VideoBERT: A Joint Model for Video and Language Representation Learning
Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow
BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning
One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning
Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition
Tracking as Online Decision-Making: Learning a Policy from Streaming Videos with Reinforcement Learning
Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset
Time-Contrastive Networks: Self-Supervised Learning from Video
Temporal Convolutional Networks: A Unified Approach to Action Segmentation
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
2022-baker-figure8-vptsuccessratescalingofmakingitemsbydatasetsizescaling.png
https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
https%253A%252F%252Fresearch.google%252Fblog%252Ftaking-medical-imaging-embeddings-3d%252F.html
https%253A%252F%252Farxiv.org%252Fabs%252F2305.05665%2523facebook.html
https%253A%252F%252Farxiv.org%252Fabs%252F2302.05442%2523google.html
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
https%253A%252F%252Farxiv.org%252Fabs%252F2212.04979%2523google.html
VindLU: A Recipe for Effective Video-and-Language Pretraining
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
https%253A%252F%252Farxiv.org%252Fabs%252F2207.07285%2523alibaba.html
Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
Jeff Clune—Professor—Computer Science—University of British Columbia
https%253A%252F%252Farxiv.org%252Fabs%252F2206.11795%2523openai.html
OmniMAE: Single Model Masked Pretraining on Images and Videos
https%253A%252F%252Farxiv.org%252Fabs%252F2206.08356%2523facebook.html
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
https%253A%252F%252Farxiv.org%252Fabs%252F2206.07160%2523microsoft.html
VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
https%253A%252F%252Farxiv.org%252Fabs%252F2205.09113%2523facebook.html
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https%253A%252F%252Farxiv.org%252Fabs%252F2204.00598%2523google.html
CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
https%253A%252F%252Farxiv.org%252Fabs%252F2201.12086%2523salesforce.html
https%253A%252F%252Farxiv.org%252Fabs%252F2111.11432%2523microsoft.html
Perceiver IO: A General Architecture for Structured Inputs & Outputs
https%253A%252F%252Farxiv.org%252Fabs%252F2107.14795%2523deepmind.html
Revisiting ResNets: Improved Training and Scaling Strategies
https%253A%252F%252Farxiv.org%252Fabs%252F2103.07579%2523google.html
https%253A%252F%252Fai.facebook.com%252Fblog%252Flearning-from-videos-to-understand-the-world%252F.html
https%253A%252F%252Farxiv.org%252Fabs%252F2103.03206%2523deepmind.html
CLIP: Learning Transferable Visual Models From Natural Language Supervision
https%253A%252F%252Fcdn.openai.com%252Fpapers%252FLearning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf.html
Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures
https%253A%252F%252Farxiv.org%252Fabs%252F2012.08508%2523deepmind.html
Accuracy and Performance Comparison of Video Action Recognition Approaches
Billion-scale semi-supervised learning for image classification
https%253A%252F%252Farxiv.org%252Fabs%252F1905.00546%2523facebook.html
https%253A%252F%252Farxiv.org%252Fabs%252F1808.01340%2523deepmind.html
Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset
https%253A%252F%252Farxiv.org%252Fabs%252F1705.07750%2523deepmind.html
Wikipedia Bibliography: