- See Also
-
Links
- “Scaling Vision Transformers to 22 Billion Parameters”, Et Al 2023
- “VindLU: A Recipe for Effective Video-and-Language Pretraining”, Et Al 2022
- “Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners”, Et Al 2022
- “Videogenic: Video Highlights via Photogenic Moments”, Et Al 2022
- “AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies”, Et Al 2022
- “Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, Et Al 2022
- “TVLT: Textless Vision-Language Transformer”, Et Al 2022
- “EVL: Frozen CLIP Models Are Efficient Video Learners”, Et Al 2022
- “X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, Et Al 2022
- “X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Et Al 2022
- “Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, Et Al 2022
- “OmniMAE: Single Model Masked Pretraining on Images and Videos”, Et Al 2022
- “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Et Al 2022
- “MLP-3D: A MLP-like 3D Architecture With Grouped Time Mixing”, Et Al 2022
- “Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Et Al 2022
- “Revisiting The”Video” in Video-Language Understanding”, Et Al 2022
- “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Et Al 2022
- “Masked Autoencoders As Spatiotemporal Learners”, Et Al 2022
- “Imitating, Fast and Slow: Robust Learning from Demonstrations via Decision-time Planning”, Et Al 2022
- “ViS4mer: Long Movie Clip Classification With State-Space Video Models”, 2022
- “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Et Al 2022
- “Reinforcement Learning With Action-Free Pre-Training from Videos”, Et Al 2022
- “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Et Al 2022
- “Robot Peels Banana With Goal-conditioned Dual-action Deep Imitation Learning”, Et Al 2022
- “Hierarchical Perceiver”, Et Al 2022
- “MuZero With Self-competition for Rate Control in VP9 Video Compression”, Et Al 2022
- “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Et Al 2022
- “MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition”, Et Al 2022
- “CAST: Character Labeling in Animation Using Self-supervision by Tracking”, Et Al 2022
- “AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction”, Et Al 2022
- “Noether Networks: Meta-Learning Useful Conserved Quantities”, Et Al 2021
- “MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, Et Al 2021
- “MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”, Et Al 2021
- “Florence: A New Foundation Model for Computer Vision”, Et Al 2021
- “Scaling ASR Improves Zero and Few Shot Learning”, Et Al 2021
- “ADOP: Approximate Differentiable One-Pixel Point Rendering”, Et Al 2021
- “VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, Et Al 2021
- “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Et Al 2021
- “CLIP-It! Language-Guided Video Summarization”, Et Al 2021
- “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Et Al 2021
- “Revisiting ResNets: Improved Training and Scaling Strategies”, Et Al 2021
- “Learning from Videos to Understand the World”, Et Al 2021
- “Perceiver: General Perception With Iterative Attention”, Et Al 2021
- “Video Transformer Network”, Et Al 2021
- “Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning”, Et Al 2021
- “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, Et Al 2021
- “Learning Transferable Visual Models From Natural Language Supervision”, Et Al 2021
- “Transformers in Vision: A Survey”, Et Al 2021
- “Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, Et Al 2020
- “Accuracy and Performance Comparison of Video Action Recognition Approaches”, Et Al 2020
- “Self-supervised Learning through the Eyes of a Child”, Et Al 2020
- “Gesticulator: A Framework for Semantically-aware Speech-driven Gesture Generation”, Et Al 2020
- “SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded from the Infant’s Perspective”, Et Al 2020
- “Axial Attention in Multidimensional Transformers”, Et Al 2019
- “CATER: A Diagnostic Dataset for Compositional Actions and TEmporal Reasoning”, 2019
- “CLEVRER: CoLlision Events for Video REpresentation and Reasoning”, Et Al 2019
- “A Short Note on the Kinetics-700 Human Action Dataset”, Et Al 2019
- “Billion-scale Semi-supervised Learning for Image Classification”, Et Al 2019
- “VideoBERT: A Joint Model for Video and Language Representation Learning”, Et Al 2019
- “Real-time Continuous Transcription With Live Transcribe”, 2019
- “CCNet: Criss-Cross Attention for Semantic Segmentation”, Et Al 2018
- “Evolving Space-Time Neural Architectures for Videos”, Et Al 2018
- “Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow”, Et Al 2018
- “A Short Note about Kinetics-600”, Et Al 2018
- “Large-Scale Visual Speech Recognition”, Et Al 2018
- “Playing Hard Exploration Games by Watching YouTube”, Et Al 2018
- “BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning”, Et Al 2018
- “The Sound of Pixels”, Et Al 2018
- “One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”, Et Al 2018
- “Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition”, Et Al 2017
- “Reinforced Video Captioning With Entailment Rewards”, 2017
- “Tracking As Online Decision-Making: Learning a Policy from Streaming Videos With Reinforcement Learning”, III & 2017
- “Learning to Learn from Noisy Web Videos”, Et Al 2017
- “Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, 2017
- “The Kinetics Human Action Video Dataset”, Et Al 2017
- “Dense-Captioning Events in Videos”, Et Al 2017
- “Time-Contrastive Networks: Self-Supervised Learning from Video”, Et Al 2017
- “LipNet: End-to-End Sentence-level Lipreading”, Et Al 2016
- “Deep Visual Foresight for Planning Robot Motion”, 2016
- “Artistic Style Transfer for Videos”, Et Al 2016
- “YFCC100M: The New Data in Multimedia Research”, Et Al 2015
- “UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild”, Et Al 2012
- Miscellaneous
- Link Bibliography
See Also
Links
“Scaling Vision Transformers to 22 Billion Parameters”, Et Al 2023
“Scaling Vision Transformers to 22 Billion Parameters”, 2023-02-10 ( ; similar; bibliography)
“VindLU: A Recipe for Effective Video-and-Language Pretraining”, Et Al 2022
“VindLU: A Recipe for Effective Video-and-Language Pretraining”, 2022-12-09 ( ; similar; bibliography)
“Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners”, Et Al 2022
“Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners”, 2022-12-09 ( ; similar; bibliography)
“Videogenic: Video Highlights via Photogenic Moments”, Et Al 2022
“Videogenic: Video Highlights via Photogenic Moments”, 2022-11-22 ( ; similar)
“AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies”, Et Al 2022
“AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies”, 2022-11-10 ( ; similar)
“Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, Et Al 2022
“Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, 2022-10-17 ( ; similar)
“TVLT: Textless Vision-Language Transformer”, Et Al 2022
“TVLT: Textless Vision-Language Transformer”, 2022-09-28 ( ; similar; bibliography)
“EVL: Frozen CLIP Models Are Efficient Video Learners”, Et Al 2022
“EVL: Frozen CLIP Models are Efficient Video Learners”, 2022-08-06 ( ; similar; bibliography)
“X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, Et Al 2022
“X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, 2022-08-04 ( ; similar)
“X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Et Al 2022
“X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, 2022-07-15 ( ; similar; bibliography)
“Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, Et Al 2022
“Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, 2022-06-23 ( ; similar; bibliography)
“OmniMAE: Single Model Masked Pretraining on Images and Videos”, Et Al 2022
“OmniMAE: Single Model Masked Pretraining on Images and Videos”, 2022-06-16 ( ; similar; bibliography)
“LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Et Al 2022
“LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling”, 2022-06-14 ( ; similar; bibliography)
“MLP-3D: A MLP-like 3D Architecture With Grouped Time Mixing”, Et Al 2022
“MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing”, 2022-06-13 ( ; similar)
“Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Et Al 2022
“Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs”, 2022-06-09 ( ; backlinks; similar)
“Revisiting The”Video” in Video-Language Understanding”, Et Al 2022
“Revisiting the "Video" in Video-Language Understanding”, 2022-06-03 (similar)
“VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Et Al 2022
“VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners”, 2022-05-22 ( ; similar; bibliography)
“Masked Autoencoders As Spatiotemporal Learners”, Et Al 2022
“Masked Autoencoders As Spatiotemporal Learners”, 2022-05-18 ( ; similar; bibliography)
“Imitating, Fast and Slow: Robust Learning from Demonstrations via Decision-time Planning”, Et Al 2022
“Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning”, 2022-04-07 ( ; similar)
“ViS4mer: Long Movie Clip Classification With State-Space Video Models”, 2022
“ViS4mer: Long Movie Clip Classification with State-Space Video Models”, 2022-04-04 ( ; similar)
“Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Et Al 2022
“Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language”, 2022-04-01 ( ; similar; bibliography)
“Reinforcement Learning With Action-Free Pre-Training from Videos”, Et Al 2022
“Reinforcement Learning with Action-Free Pre-Training from Videos”, 2022-03-25 ( ; similar)
“CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Et Al 2022
“CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning”, 2022-03-21 ( ; similar; bibliography)
“Robot Peels Banana With Goal-conditioned Dual-action Deep Imitation Learning”, Et Al 2022
“Robot peels banana with goal-conditioned dual-action deep imitation learning”, 2022-03-18 ( ; similar)
“Hierarchical Perceiver”, Et Al 2022
“Hierarchical Perceiver”, 2022-02-22 ( ; similar)
“MuZero With Self-competition for Rate Control in VP9 Video Compression”, Et Al 2022
“MuZero with Self-competition for Rate Control in VP9 Video Compression”, 2022-02-14 ( ; similar)
“BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Et Al 2022
“BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, 2022-01-28 ( ; similar; bibliography)
“MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition”, Et Al 2022
“MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition”, 2022-01-20 ( ; similar)
“CAST: Character Labeling in Animation Using Self-supervision by Tracking”, Et Al 2022
“CAST: Character labeling in Animation using Self-supervision by Tracking”, 2022-01-19 ( ; similar)
“AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction”, Et Al 2022
“AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction”, 2022-01-05 ( ; similar)
“Noether Networks: Meta-Learning Useful Conserved Quantities”, Et Al 2021
“Noether Networks: Meta-Learning Useful Conserved Quantities”, 2021-12-06 ( ; similar)
“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, Et Al 2021
“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, 2021-12-01 ( ; similar)
“MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”, Et Al 2021
“MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”, 2021-11-24 ( ; backlinks; similar)
“Florence: A New Foundation Model for Computer Vision”, Et Al 2021
“Florence: A New Foundation Model for Computer Vision”, 2021-11-22 ( ; similar; bibliography)
“Scaling ASR Improves Zero and Few Shot Learning”, Et Al 2021
“Scaling ASR Improves Zero and Few Shot Learning”, 2021-11-10 ( ; similar)
“ADOP: Approximate Differentiable One-Pixel Point Rendering”, Et Al 2021
“ADOP: Approximate Differentiable One-Pixel Point Rendering”, 2021-10-13 ( ; similar)
“VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, Et Al 2021
“VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, 2021-09-28 ( ; similar)
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Et Al 2021
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”, 2021-07-30 ( ; similar; bibliography)
“CLIP-It! Language-Guided Video Summarization”, Et Al 2021
“CLIP-It! Language-Guided Video Summarization”, 2021-07-01 ( ; similar)
“CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Et Al 2021
“CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, 2021-06-21 ( ; similar; bibliography)
“Revisiting ResNets: Improved Training and Scaling Strategies”, Et Al 2021
“Revisiting ResNets: Improved Training and Scaling Strategies”, 2021-03-13 ( ; similar; bibliography)
“Learning from Videos to Understand the World”, Et Al 2021
“Learning from videos to understand the world”, 2021-03-12 ( ; similar; bibliography)
“Perceiver: General Perception With Iterative Attention”, Et Al 2021
“Perceiver: General Perception with Iterative Attention”, 2021-03-04 ( ; similar; bibliography)
“Video Transformer Network”, Et Al 2021
“Video Transformer Network”, 2021-02-01 ( ; backlinks; similar; bibliography)
“Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning”, Et Al 2021
“Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning”, 2021-01-26 ( ; backlinks; similar)
“MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, Et Al 2021
“MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, 2021-01-17 ( ; similar)
“Learning Transferable Visual Models From Natural Language Supervision”, Et Al 2021
“Learning Transferable Visual Models From Natural Language Supervision”, 2021-01-05 ( ; backlinks; similar; bibliography)
“Transformers in Vision: A Survey”, Et Al 2021
“Transformers in Vision: A Survey”, 2021-01-04 ( ; similar)
“Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, Et Al 2020
“Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures”, 2020-12-15 ( ; similar; bibliography)
“Accuracy and Performance Comparison of Video Action Recognition Approaches”, Et Al 2020
“Accuracy and Performance Comparison of Video Action Recognition Approaches”, 2020-08-20 ( ; similar; bibliography)
“Self-supervised Learning through the Eyes of a Child”, Et Al 2020
“Self-supervised learning through the eyes of a child”, 2020-07-31 ( ; similar)
“Gesticulator: A Framework for Semantically-aware Speech-driven Gesture Generation”, Et Al 2020
“Gesticulator: A framework for semantically-aware speech-driven gesture generation”, 2020-01-25 ( ; backlinks; similar)
“SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded from the Infant’s Perspective”, Et Al 2020
“SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective”, 2020-01-14 ( ; similar)
“Axial Attention in Multidimensional Transformers”, Et Al 2019
“Axial Attention in Multidimensional Transformers”, 2019-12-20 ( ; similar)
“CATER: A Diagnostic Dataset for Compositional Actions and TEmporal Reasoning”, 2019
“CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning”, 2019-10-10 ( ; backlinks; similar)
“CLEVRER: CoLlision Events for Video REpresentation and Reasoning”, Et Al 2019
“CLEVRER: CoLlision Events for Video REpresentation and Reasoning”, 2019-10-03 (backlinks; similar)
“A Short Note on the Kinetics-700 Human Action Dataset”, Et Al 2019
“A Short Note on the Kinetics-700 Human Action Dataset”, 2019-07-15 (similar)
“Billion-scale Semi-supervised Learning for Image Classification”, Et Al 2019
“Billion-scale semi-supervised learning for image classification”, 2019-05-02 ( ; similar; bibliography)
“VideoBERT: A Joint Model for Video and Language Representation Learning”, Et Al 2019
“VideoBERT: A Joint Model for Video and Language Representation Learning”, 2019-04-03 ( ; similar)
“Real-time Continuous Transcription With Live Transcribe”, 2019
“Real-time Continuous Transcription with Live Transcribe”, 2019-02-04 ( ; similar)
“CCNet: Criss-Cross Attention for Semantic Segmentation”, Et Al 2018
“CCNet: Criss-Cross Attention for Semantic Segmentation”, 2018-11-28 ( ; backlinks; similar; bibliography)
“Evolving Space-Time Neural Architectures for Videos”, Et Al 2018
“Evolving Space-Time Neural Architectures for Videos”, 2018-11-26 ( ; backlinks; similar)
“Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow”, Et Al 2018
“Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow”, 2018-10-01 ( ; backlinks; similar)
“A Short Note about Kinetics-600”, Et Al 2018
“A Short Note about Kinetics-600”, 2018-08-03 ( ; similar; bibliography)
“Large-Scale Visual Speech Recognition”, Et Al 2018
“Large-Scale Visual Speech Recognition”, 2018-07-13 ( ; similar)
“Playing Hard Exploration Games by Watching YouTube”, Et Al 2018
“Playing hard exploration games by watching YouTube”, 2018-05-29 ( ; similar)
“BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning”, Et Al 2018
“BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning”, 2018-05-12 ( ; backlinks; similar)
“The Sound of Pixels”, Et Al 2018
“The Sound of Pixels”, 2018-04-09 ( ; similar)
“One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”, Et Al 2018
“One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”, 2018-02-05 ( ; similar)
“Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition”, Et Al 2017
“Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition”, 2017-12-14 ( ; similar)
“Reinforced Video Captioning With Entailment Rewards”, 2017
“Reinforced Video Captioning with Entailment Rewards”, 2017-08-07 ( ; backlinks; similar)
“Tracking As Online Decision-Making: Learning a Policy from Streaming Videos With Reinforcement Learning”, III & 2017
“Tracking as Online Decision-Making: Learning a Policy from Streaming Videos with Reinforcement Learning”, 2017-07-17 ( ; backlinks; similar)
“Learning to Learn from Noisy Web Videos”, Et Al 2017
“Learning to Learn from Noisy Web Videos”, 2017-06-09 ( ; backlinks; similar)
“Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, 2017
“Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, 2017-05-22 ( ; similar; bibliography)
“The Kinetics Human Action Video Dataset”, Et Al 2017
“The Kinetics Human Action Video Dataset”, 2017-05-19 ( ; similar)
“Dense-Captioning Events in Videos”, Et Al 2017
“Dense-Captioning Events in Videos”, 2017-05-02 ( ; backlinks; similar)
“Time-Contrastive Networks: Self-Supervised Learning from Video”, Et Al 2017
“Time-Contrastive Networks: Self-Supervised Learning from Video”, 2017-04-23 ( ; backlinks; similar)
“LipNet: End-to-End Sentence-level Lipreading”, Et Al 2016
“LipNet: End-to-End Sentence-level Lipreading”, 2016-12-16 ( ; backlinks; similar)
“Deep Visual Foresight for Planning Robot Motion”, 2016
“Deep Visual Foresight for Planning Robot Motion”, 2016-10-03 ( ; similar)
“Artistic Style Transfer for Videos”, Et Al 2016
“Artistic style transfer for videos”, 2016-04-28 (similar)
“YFCC100M: The New Data in Multimedia Research”, Et Al 2015
“YFCC100M: The New Data in Multimedia Research”, 2015-03-05 ( ; similar)
“UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild”, Et Al 2012
“UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild”, 2012-12-03 ( ; backlinks; similar)
Miscellaneous
Link Bibliography
-
https://arxiv.org/abs/2302.05442#google
: “Scaling Vision Transformers to 22 Billion Parameters”, : -
https://arxiv.org/abs/2212.05051
: “VindLU: A Recipe for Effective Video-and-Language Pretraining”, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius: -
https://arxiv.org/abs/2212.04979#google
: “Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners”, Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu: -
https://arxiv.org/abs/2209.14156
: “TVLT: Textless Vision-Language Transformer”, Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal: -
https://arxiv.org/abs/2208.03550
: “EVL: Frozen CLIP Models Are Efficient Video Learners”, Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li: -
https://arxiv.org/abs/2207.07285#alibaba
: “X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, Rongrong Ji: -
https://arxiv.org/abs/2206.11795#openai
: “Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, : -
https://arxiv.org/abs/2206.08356#facebook
: “OmniMAE: Single Model Masked Pretraining on Images and Videos”, Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Arm, Joulin, Ishan Misra: -
https://arxiv.org/abs/2206.07160#microsoft
: “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang: -
https://arxiv.org/abs/2205.10747
: “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, : -
https://arxiv.org/abs/2205.09113#facebook
: “Masked Autoencoders As Spatiotemporal Learners”, Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He: -
https://arxiv.org/abs/2204.00598#google
: “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, : -
https://arxiv.org/abs/2203.11096
: “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer: -
https://arxiv.org/abs/2201.12086#salesforce
: “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi: -
https://arxiv.org/abs/2111.11432#microsoft
: “Florence: A New Foundation Model for Computer Vision”, : -
https://arxiv.org/abs/2107.14795#deepmind
: “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, : -
https://arxiv.org/abs/2106.11097
: “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen: -
https://arxiv.org/abs/2103.07579#google
: “Revisiting ResNets: Improved Training and Scaling Strategies”, : -
https://ai.facebook.com/blog/learning-from-videos-to-understand-the-world/
: “Learning from Videos to Understand the World”, Geoffrey Zweig, Polina Kuznetsova, Michael Auli, Francois Fagan: -
https://arxiv.org/abs/2103.03206#deepmind
: “Perceiver: General Perception With Iterative Attention”, Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira: -
https://arxiv.org/abs/2102.00719
: “Video Transformer Network”, Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann: -
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf
: “Learning Transferable Visual Models From Natural Language Supervision”, : -
https://arxiv.org/abs/2012.08508#deepmind
: “Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, David Ding, Felix Hill, Adam Santoro, Matt Botvinick: -
https://arxiv.org/abs/2008.09037
: “Accuracy and Performance Comparison of Video Action Recognition Approaches”, : -
https://arxiv.org/abs/1905.00546#facebook
: “Billion-scale Semi-supervised Learning for Image Classification”, I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, Dhruv Mahajan: -
https://arxiv.org/abs/1811.11721
: “CCNet: Criss-Cross Attention for Semantic Segmentation”, Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang: -
https://arxiv.org/abs/1808.01340#deepmind
: “A Short Note about Kinetics-600”, Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, Andrew Zisserman: -
https://arxiv.org/abs/1705.07750#deepmind
: “Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, Joao Carreira, Andrew Zisserman: