- See Also
-
Links
- “InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation”, Wang et al 2023
- “Test-Time Training on Video Streams”, Wang et al 2023
- “PaLI-X: On Scaling up a Multilingual Vision and Language Model”, Chen et al 2023
- “ImageBind: One Embedding Space To Bind Them All”, Girdhar et al 2023
- “Scaling Vision Transformers to 22 Billion Parameters”, Dehghani et al 2023
- “VindLU: A Recipe for Effective Video-and-Language Pretraining”, Cheng et al 2022
- “Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners”, Yan et al 2022
- “Videogenic: Video Highlights via Photogenic Moments”, Lin et al 2022
- “AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies”, Siyao et al 2022
- “Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, Gan et al 2022
- “TVLT: Textless Vision-Language Transformer”, Tang et al 2022
- “EVL: Frozen CLIP Models Are Efficient Video Learners”, Lin et al 2022
- “X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, Ni et al 2022
- “X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Ma et al 2022
- “Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, Baker et al 2022
- “OmniMAE: Single Model Masked Pretraining on Images and Videos”, Girdhar et al 2022
- “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Li et al 2022
- “MLP-3D: A MLP-like 3D Architecture With Grouped Time Mixing”, Qiu et al 2022
- “Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Zhu et al 2022
- “Revisiting the "Video" in Video-Language Understanding”, Buch et al 2022
- “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Wang et al 2022
- “Masked Autoencoders As Spatiotemporal Learners”, Feichtenhofer et al 2022
- “Imitating, Fast and Slow: Robust Learning from Demonstrations via Decision-time Planning”, Qi et al 2022
- “ViS4mer: Long Movie Clip Classification With State-Space Video Models”, Islam & Bertasius 2022
- “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Zeng et al 2022
- “Reinforcement Learning With Action-Free Pre-Training from Videos”, Seo et al 2022
- “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Taesiri et al 2022
- “Robot Peels Banana With Goal-conditioned Dual-action Deep Imitation Learning”, Kim et al 2022
- “Hierarchical Perceiver”, Carreira et al 2022
- “MuZero With Self-competition for Rate Control in VP9 Video Compression”, Mandhane et al 2022
- “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Li et al 2022
- “MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition”, Wu et al 2022
- “CAST: Character Labeling in Animation Using Self-supervision by Tracking”, Nir et al 2022
- “AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction”, Shi et al 2022
- “Noether Networks: Meta-Learning Useful Conserved Quantities”, Alet et al 2021
- “MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, Soldan et al 2021
- “MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”, Zhang et al 2021
- “Florence: A New Foundation Model for Computer Vision”, Yuan et al 2021
- “Scaling ASR Improves Zero and Few Shot Learning”, Xiao et al 2021
- “ADOP: Approximate Differentiable One-Pixel Point Rendering”, Rückert et al 2021
- “VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, Xu et al 2021
- “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Jaegle et al 2021
- “CLIP-It! Language-Guided Video Summarization”, Narasimhan et al 2021
- “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Fang et al 2021
- “Revisiting ResNets: Improved Training and Scaling Strategies”, Bello et al 2021
- “Learning from Videos to Understand the World”, Zweig et al 2021
- “Perceiver: General Perception With Iterative Attention”, Jaegle et al 2021
- “Video Transformer Network”, Neimark et al 2021
- “Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning”, Lee et al 2021
- “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, Xu et al 2021
- “CLIP: Learning Transferable Visual Models From Natural Language Supervision”, Radford et al 2021
- “Transformers in Vision: A Survey”, Khan et al 2021
- “Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, Ding et al 2020
- “Accuracy and Performance Comparison of Video Action Recognition Approaches”, Hutchinson et al 2020
- “Self-supervised Learning through the Eyes of a Child”, Orhan et al 2020
- “Gesticulator: A Framework for Semantically-aware Speech-driven Gesture Generation”, Kucherenko et al 2020
- “SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded from the Infant’s Perspective”, Sullivan et al 2020
- “Axial Attention in Multidimensional Transformers”, Ho et al 2019
- “CATER: A Diagnostic Dataset for Compositional Actions and TEmporal Reasoning”, Girdhar & Ramanan 2019
- “CLEVRER: CoLlision Events for Video REpresentation and Reasoning”, Yi et al 2019
- “Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos”, Lin et al 2019
- “A Short Note on the Kinetics-700 Human Action Dataset”, Carreira et al 2019
- “Billion-scale Semi-supervised Learning for Image Classification”, Yalniz et al 2019
- “VideoBERT: A Joint Model for Video and Language Representation Learning”, Sun et al 2019
- “Real-time Continuous Transcription With Live Transcribe”, Savla 2019
- “CCNet: Criss-Cross Attention for Semantic Segmentation”, Huang et al 2018
- “Evolving Space-Time Neural Architectures for Videos”, Piergiovanni et al 2018
- “Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow”, Peng et al 2018
- “A Short Note about Kinetics-600”, Carreira et al 2018
- “Large-Scale Visual Speech Recognition”, Shillingford et al 2018
- “Playing Hard Exploration Games by Watching YouTube”, Aytar et al 2018
- “BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning”, Yu et al 2018
- “The Sound of Pixels”, Zhao et al 2018
- “One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”, Yu et al 2018
- “Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition”, Ye et al 2017
- “Reinforced Video Captioning With Entailment Rewards”, Pasunuru & Bansal 2017
- “Tracking As Online Decision-Making: Learning a Policy from Streaming Videos With Reinforcement Learning”, III & Ramanan 2017
- “Learning to Learn from Noisy Web Videos”, Yeung et al 2017
- “Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, Carreira & Zisserman 2017
- “The Kinetics Human Action Video Dataset”, Kay et al 2017
- “Dense-Captioning Events in Videos”, Krishna et al 2017
- “Time-Contrastive Networks: Self-Supervised Learning from Video”, Sermanet et al 2017
- “LipNet: End-to-End Sentence-level Lipreading”, Assael et al 2016
- “Deep Visual Foresight for Planning Robot Motion”, Finn & Levine 2016
- “Temporal Convolutional Networks: A Unified Approach to Action Segmentation”, Lea et al 2016
- “Clockwork Convnets for Video Semantic Segmentation”, Shelhamer et al 2016
- “Artistic Style Transfer for Videos”, Ruder et al 2016
- “YFCC100M: The New Data in Multimedia Research”, Thomee et al 2015
- “UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild”, Soomro et al 2012
- Sort By Magic
- Miscellaneous
- Link Bibliography
See Also
Links
“InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation”, Wang et al 2023
“InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation”
“Test-Time Training on Video Streams”, Wang et al 2023
“PaLI-X: On Scaling up a Multilingual Vision and Language Model”, Chen et al 2023
“PaLI-X: On Scaling up a Multilingual Vision and Language Model”
“ImageBind: One Embedding Space To Bind Them All”, Girdhar et al 2023
“Scaling Vision Transformers to 22 Billion Parameters”, Dehghani et al 2023
“VindLU: A Recipe for Effective Video-and-Language Pretraining”, Cheng et al 2022
“VindLU: A Recipe for Effective Video-and-Language Pretraining”
“Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners”, Yan et al 2022
“Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners”
“Videogenic: Video Highlights via Photogenic Moments”, Lin et al 2022
“AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies”, Siyao et al 2022
“AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies”
“Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, Gan et al 2022
“Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”
“TVLT: Textless Vision-Language Transformer”, Tang et al 2022
“EVL: Frozen CLIP Models Are Efficient Video Learners”, Lin et al 2022
“X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, Ni et al 2022
“X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”
“X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Ma et al 2022
“X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”
“Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, Baker et al 2022
“Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”
“OmniMAE: Single Model Masked Pretraining on Images and Videos”, Girdhar et al 2022
“OmniMAE: Single Model Masked Pretraining on Images and Videos”
“LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Li et al 2022
“LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling”
“MLP-3D: A MLP-like 3D Architecture With Grouped Time Mixing”, Qiu et al 2022
“MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing”
“Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Zhu et al 2022
“Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs”
“Revisiting the "Video" in Video-Language Understanding”, Buch et al 2022
“VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Wang et al 2022
“VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners”
“Masked Autoencoders As Spatiotemporal Learners”, Feichtenhofer et al 2022
“Imitating, Fast and Slow: Robust Learning from Demonstrations via Decision-time Planning”, Qi et al 2022
“Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning”
“ViS4mer: Long Movie Clip Classification With State-Space Video Models”, Islam & Bertasius 2022
“ViS4mer: Long Movie Clip Classification with State-Space Video Models”
“Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Zeng et al 2022
“Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language”
“Reinforcement Learning With Action-Free Pre-Training from Videos”, Seo et al 2022
“Reinforcement Learning with Action-Free Pre-Training from Videos”
“CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Taesiri et al 2022
“Robot Peels Banana With Goal-conditioned Dual-action Deep Imitation Learning”, Kim et al 2022
“Robot peels banana with goal-conditioned dual-action deep imitation learning”
“Hierarchical Perceiver”, Carreira et al 2022
“MuZero With Self-competition for Rate Control in VP9 Video Compression”, Mandhane et al 2022
“MuZero with Self-competition for Rate Control in VP9 Video Compression”
“BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Li et al 2022
“MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition”, Wu et al 2022
“MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition”
“CAST: Character Labeling in Animation Using Self-supervision by Tracking”, Nir et al 2022
“CAST: Character labeling in Animation using Self-supervision by Tracking”
“AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction”, Shi et al 2022
“AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction”
“Noether Networks: Meta-Learning Useful Conserved Quantities”, Alet et al 2021
“Noether Networks: Meta-Learning Useful Conserved Quantities”
“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, Soldan et al 2021
“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”
“MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”, Zhang et al 2021
“MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”
“Florence: A New Foundation Model for Computer Vision”, Yuan et al 2021
“Scaling ASR Improves Zero and Few Shot Learning”, Xiao et al 2021
“ADOP: Approximate Differentiable One-Pixel Point Rendering”, Rückert et al 2021
“ADOP: Approximate Differentiable One-Pixel Point Rendering”
“VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, Xu et al 2021
“VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Jaegle et al 2021
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”
“CLIP-It! Language-Guided Video Summarization”, Narasimhan et al 2021
“CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Fang et al 2021
“Revisiting ResNets: Improved Training and Scaling Strategies”, Bello et al 2021
“Revisiting ResNets: Improved Training and Scaling Strategies”
“Learning from Videos to Understand the World”, Zweig et al 2021
“Perceiver: General Perception With Iterative Attention”, Jaegle et al 2021
“Video Transformer Network”, Neimark et al 2021
“Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning”, Lee et al 2021
“Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning”
“MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, Xu et al 2021
“MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”
“CLIP: Learning Transferable Visual Models From Natural Language Supervision”, Radford et al 2021
“CLIP: Learning Transferable Visual Models From Natural Language Supervision”
“Transformers in Vision: A Survey”, Khan et al 2021
“Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, Ding et al 2020
“Accuracy and Performance Comparison of Video Action Recognition Approaches”, Hutchinson et al 2020
“Accuracy and Performance Comparison of Video Action Recognition Approaches”
“Self-supervised Learning through the Eyes of a Child”, Orhan et al 2020
“Gesticulator: A Framework for Semantically-aware Speech-driven Gesture Generation”, Kucherenko et al 2020
“Gesticulator: A framework for semantically-aware speech-driven gesture generation”
“SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded from the Infant’s Perspective”, Sullivan et al 2020
“SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective”
“Axial Attention in Multidimensional Transformers”, Ho et al 2019
“CATER: A Diagnostic Dataset for Compositional Actions and TEmporal Reasoning”, Girdhar & Ramanan 2019
“CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning”
“CLEVRER: CoLlision Events for Video REpresentation and Reasoning”, Yi et al 2019
“CLEVRER: CoLlision Events for Video REpresentation and Reasoning”
“Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos”, Lin et al 2019
“Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos”
“A Short Note on the Kinetics-700 Human Action Dataset”, Carreira et al 2019
“Billion-scale Semi-supervised Learning for Image Classification”, Yalniz et al 2019
“Billion-scale semi-supervised learning for image classification”
“VideoBERT: A Joint Model for Video and Language Representation Learning”, Sun et al 2019
“VideoBERT: A Joint Model for Video and Language Representation Learning”
“Real-time Continuous Transcription With Live Transcribe”, Savla 2019
“CCNet: Criss-Cross Attention for Semantic Segmentation”, Huang et al 2018
“Evolving Space-Time Neural Architectures for Videos”, Piergiovanni et al 2018
“Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow”, Peng et al 2018
“A Short Note about Kinetics-600”, Carreira et al 2018
“Large-Scale Visual Speech Recognition”, Shillingford et al 2018
“Playing Hard Exploration Games by Watching YouTube”, Aytar et al 2018
“BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning”, Yu et al 2018
“BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning”
“The Sound of Pixels”, Zhao et al 2018
“One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”, Yu et al 2018
“One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”
“Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition”, Ye et al 2017
“Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition”
“Reinforced Video Captioning With Entailment Rewards”, Pasunuru & Bansal 2017
“Tracking As Online Decision-Making: Learning a Policy from Streaming Videos With Reinforcement Learning”, III & Ramanan 2017
“Learning to Learn from Noisy Web Videos”, Yeung et al 2017
“Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, Carreira & Zisserman 2017
“Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”
“The Kinetics Human Action Video Dataset”, Kay et al 2017
“Dense-Captioning Events in Videos”, Krishna et al 2017
“Time-Contrastive Networks: Self-Supervised Learning from Video”, Sermanet et al 2017
“Time-Contrastive Networks: Self-Supervised Learning from Video”
“LipNet: End-to-End Sentence-level Lipreading”, Assael et al 2016
“Deep Visual Foresight for Planning Robot Motion”, Finn & Levine 2016
“Temporal Convolutional Networks: A Unified Approach to Action Segmentation”, Lea et al 2016
“Temporal Convolutional Networks: A Unified Approach to Action Segmentation”
“Clockwork Convnets for Video Semantic Segmentation”, Shelhamer et al 2016
“Artistic Style Transfer for Videos”, Ruder et al 2016
“YFCC100M: The New Data in Multimedia Research”, Thomee et al 2015
“UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild”, Soomro et al 2012
“UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild”
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
imitationlearning
multimodal
perceiverio
actiondataset
videolearning
Miscellaneous
Link Bibliography
-
https://arxiv.org/abs/2307.05014
: “Test-Time Training on Video Streams”, Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang -
https://arxiv.org/abs/2305.05665#facebook
: “ImageBind: One Embedding Space To Bind Them All”, Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Arm, Joulin, Ishan Misra -
https://arxiv.org/abs/2302.05442#google
: “Scaling Vision Transformers to 22 Billion Parameters”, -
https://arxiv.org/abs/2212.05051
: “VindLU: A Recipe for Effective Video-and-Language Pretraining”, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius -
https://arxiv.org/abs/2212.04979#google
: “Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners”, Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu -
https://arxiv.org/abs/2209.14156
: “TVLT: Textless Vision-Language Transformer”, Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal -
https://arxiv.org/abs/2208.03550
: “EVL: Frozen CLIP Models Are Efficient Video Learners”, Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li -
https://arxiv.org/abs/2207.07285#alibaba
: “X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, Rongrong Ji -
https://arxiv.org/abs/2206.11795#openai
: “Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, -
https://arxiv.org/abs/2206.08356#facebook
: “OmniMAE: Single Model Masked Pretraining on Images and Videos”, Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Arm, Joulin, Ishan Misra -
https://arxiv.org/abs/2206.07160#microsoft
: “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang -
https://arxiv.org/abs/2205.10747
: “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, -
https://arxiv.org/abs/2205.09113#facebook
: “Masked Autoencoders As Spatiotemporal Learners”, Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He -
https://arxiv.org/abs/2204.00598#google
: “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, -
https://arxiv.org/abs/2203.11096
: “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer -
https://arxiv.org/abs/2201.12086#salesforce
: “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi -
https://arxiv.org/abs/2111.11432#microsoft
: “Florence: A New Foundation Model for Computer Vision”, -
https://arxiv.org/abs/2107.14795#deepmind
: “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, -
https://arxiv.org/abs/2106.11097
: “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen -
https://arxiv.org/abs/2103.07579#google
: “Revisiting ResNets: Improved Training and Scaling Strategies”, -
https://ai.facebook.com/blog/learning-from-videos-to-understand-the-world/
: “Learning from Videos to Understand the World”, Geoffrey Zweig, Polina Kuznetsova, Michael Auli, Francois Fagan -
https://arxiv.org/abs/2103.03206#deepmind
: “Perceiver: General Perception With Iterative Attention”, Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira -
https://arxiv.org/abs/2102.00719
: “Video Transformer Network”, Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann -
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf
: “CLIP: Learning Transferable Visual Models From Natural Language Supervision”, -
https://arxiv.org/abs/2012.08508#deepmind
: “Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, David Ding, Felix Hill, Adam Santoro, Matt Botvinick -
https://arxiv.org/abs/2008.09037
: “Accuracy and Performance Comparison of Video Action Recognition Approaches”, -
https://arxiv.org/abs/1905.00546#facebook
: “Billion-scale Semi-supervised Learning for Image Classification”, I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, Dhruv Mahajan -
https://arxiv.org/abs/1811.11721
: “CCNet: Criss-Cross Attention for Semantic Segmentation”, Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang -
https://arxiv.org/abs/1808.01340#deepmind
: “A Short Note about Kinetics-600”, Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, Andrew Zisserman -
https://arxiv.org/abs/1705.07750#deepmind
: “Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, Joao Carreira, Andrew Zisserman -
https://arxiv.org/abs/1608.03609
: “Clockwork Convnets for Video Semantic Segmentation”, Evan Shelhamer, Kate Rakelly, Judy Hoffman, Trevor Darrell