‘video analysis’ directory

Annotations sorted by machine learning into ⁠inferred 'tags'⁠. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Miscellaneous

Bibliography

https://arxiv.org/abs/2501.08332#bytedance: “MangaNinja: Line Art Colorization With Precise Reference Following ”⁠, Zhiheng Liu, Ka Leong Cheng, Xi Chen⁠ …, Jie Xiao, Hao Ouyang, Kai Zhu, Yu Liu, Yujun Shen, Qifeng Chen, Ping Luo
link-bibliography⁠
https://research.google/blog/taking-medical-imaging-embeddings-3d/: “CT Foundation: Taking Medical Imaging Embeddings 3D ”⁠, Atilla Kiraly, Madeleine Traverse
link-bibliography⁠
https://arxiv.org/abs/2307.05014: “Test-Time Training on Video Streams ”⁠, Renhao Wang, ⁠Yu Sun, Yossi Gandelsman …, Xinlei Chen, Alexei A. Efros⁠, Xiaolong Wang
link-bibliography⁠
https://arxiv.org/abs/2305.05665#facebook: “ImageBind: One Embedding Space To Bind Them All ”⁠, Rohit Girdhar, Alaaeldin El-Nouby, ⁠Zhuang Liu …, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin⁠, Ishan Misra
link-bibliography⁠
https://arxiv.org/abs/2302.05442#google: “Scaling Vision Transformers to 22 Billion Parameters ”⁠, Mostafa Dehghani, Josip Djolonga, Basil Mustafa …, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos⁠, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer⁠, ⁠Michael Tschannen, Anurag Arnab, Xiao Wang⁠, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, ⁠Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai⁠, Daniel Keysers, Jeremiah Harmsen, ⁠Neil Houlsby
link-bibliography⁠
https://arxiv.org/abs/2212.04979#google: “VideoCoCa: Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners ”⁠, Shen Yan, Tao Zhu⁠, Zirui Wang …, Yuan Cao⁠, Mi Zhang⁠, Soham Ghosh⁠, Yonghui Wu⁠, Jiahui Yu
link-bibliography⁠
https://arxiv.org/abs/2212.05051: “VindLU: A Recipe for Effective Video-And-Language Pretraining ”⁠, Feng Cheng, Xizi Wang, Jie Lei …, David Crandall, ⁠Mohit Bansal, Gedas Bertasius
link-bibliography⁠
https://arxiv.org/abs/2209.14156: “TVLT: Textless Vision-Language Transformer ”⁠, Zineng Tang, Jaemin Cho, Yixin Nie, ⁠Mohit Bansal
link-bibliography⁠
https://arxiv.org/abs/2208.03550: “EVL: Frozen CLIP Models Are Efficient Video Learners ”⁠, Ziyi Lin, Shijie Geng, Renrui Zhang …, Peng Gao, Gerard de Melo, Xiaogang Wang⁠, Jifeng Dai, Yu Qiao, Hongsheng Li
link-bibliography⁠
https://arxiv.org/abs/2207.07285#alibaba: “X-CLIP: End-To-End Multi-Grained Contrastive Learning for Video-Text Retrieval ”⁠, Yiwei Ma, Guohai Xu, Xiaoshuai Sun …, Ming Yan⁠, Ji Zhang, Rongrong Ji
link-bibliography⁠
https://arxiv.org/abs/2206.11795#openai: “Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos ”⁠, Bowen Baker, Ilge Akkaya, Peter Zhokhov …, Joost Huizinga, Jie Tang⁠, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, ⁠Jeff Clune
link-bibliography⁠
https://arxiv.org/abs/2206.08356#facebook: “OmniMAE: Single Model Masked Pretraining on Images and Videos ”⁠, Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh …, Kalyan Vasudev Alwala, Armand Joulin⁠, Ishan Misra
link-bibliography⁠
https://arxiv.org/abs/2206.07160#microsoft: “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling ”⁠, Linjie Li, Zhe Gan, Kevin Lin⁠ …, Chung-Ching Lin, Zicheng Liu⁠, Ce Liu, Lijuan Wang
link-bibliography⁠
https://arxiv.org/abs/2205.10747: “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners ”⁠, Zhenhailong Wang, Manling Li, Ruochen Xu …, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang⁠, ⁠Mohit Bansal, Heng Ji⁠
link-bibliography⁠
https://arxiv.org/abs/2205.09113#facebook: “Masked Autoencoders As Spatiotemporal Learners ”⁠, Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He⁠
link-bibliography⁠
https://arxiv.org/abs/2204.01692: “ViS4mer: Long Movie Clip Classification With State-Space Video Models ”⁠, Md Mohaiminul Islam, Gedas Bertasius
link-bibliography⁠
https://arxiv.org/abs/2204.00598#google: “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language ”⁠, Andy Zeng, Adrian Wong, Stefan Welker …, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence
link-bibliography⁠
https://arxiv.org/abs/2203.11096: “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-Shot Transfer Learning ”⁠, Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer
link-bibliography⁠
https://arxiv.org/abs/2201.12086#salesforce: “BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation ”⁠, Junnan Li, Dongxu Li, ⁠Caiming Xiong, Steven Hoi
link-bibliography⁠
https://arxiv.org/abs/2111.11432#microsoft: “Florence: A New Foundation Model for Computer Vision ”⁠, Lu Yuan, Dongdong Chen, Yi-Ling Chen …, Noel Codella, Xiyang Dai, ⁠Jianfeng Gao⁠, Houdong Hu, Xuedong Huang⁠, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu⁠, Yumao Lu, Yu Shi⁠, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang
link-bibliography⁠
https://arxiv.org/abs/2107.14795#deepmind: “Perceiver IO: A General Architecture for Structured Inputs & Outputs ”⁠, Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac …, Carl Doersch, Catalin Ionescu, David Ding⁠, Skanda Koppula, Daniel Zoran, Andrew Brock⁠, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman⁠, Oriol Vinyals⁠, João Carreira
link-bibliography⁠
https://arxiv.org/abs/2107.00650: “CLIP-It! Language-Guided Video Summarization ”⁠, Medhini Narasimhan, Anna Rohrbach, Trevor Darrell⁠
link-bibliography⁠
https://arxiv.org/abs/2106.11097: “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP ”⁠, Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen
link-bibliography⁠
https://arxiv.org/abs/2103.07579#google: “Revisiting ResNets: Improved Training and Scaling Strategies ”⁠, Irwan Bello, William Fedus⁠, Xianzhi Du …, Ekin D. Cubuk, Aravind Srinivas⁠, Tsung-Yi Lin, Jonathon Shlens, ⁠Barret Zoph
link-bibliography⁠
https://ai.meta.com/blog/learning-from-videos-to-understand-the-world/: “Learning from Videos to Understand the World ”⁠, Geoffrey Zweig, Polina Kuznetsova⁠, Michael Auli, Francois Fagan
link-bibliography⁠
https://arxiv.org/abs/2103.03206#deepmind: “Perceiver: General Perception With Iterative Attention ”⁠, Andrew Jaegle, Felix Gimeno, Andrew Brock⁠ …, Andrew Zisserman⁠, Oriol Vinyals⁠, Joao Carreira
link-bibliography⁠
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf: “CLIP: Learning Transferable Visual Models From Natural Language Supervision ”⁠, Alec Radford⁠, ⁠Jong Wook Kim, Chris Hallacy …, Aditya A. Ramesh⁠, Gabriel Goh⁠, Sandhini Agarwal⁠, Girish Sastry, ⁠Amanda Askell, Pamela Mishkin⁠, ⁠Jack Clark⁠, Gretchen Krueger⁠, Ilya Sutskever⁠
link-bibliography⁠
https://arxiv.org/abs/2012.08508#deepmind: “Object-Based Attention for Spatio-Temporal Reasoning: Outperforming Neuro-Symbolic Models With Flexible Distributed Architectures ”⁠, David Ding⁠, ⁠Felix Hill, Adam Santoro⁠, Matt Botvinick
link-bibliography⁠
https://arxiv.org/abs/2008.09037: “Accuracy and Performance Comparison of Video Action Recognition Approaches ”⁠, Matthew Hutchinson⁠, Siddharth Samsi, William Arcand …, David Bestor, Bill Bergeron, Chansup Byun, Michael Houle, Matthew Hubbell, Michael Jones, Jeremy Kepner, Andrew Kirby, Peter Michaleas, Lauren Milechin, Julie Mullen, Andrew Prout⁠, Antonio Rosa, Albert Reuther, Charles Yee, Vijay Gadepally
link-bibliography⁠
https://arxiv.org/abs/1905.00546#facebook: “Billion-Scale Semi-Supervised Learning for Image Classification ”⁠, I. Zeki Yalniz, Hervé Jégou, Kan Chen …, Manohar Paluri, Dhruv Mahajan
link-bibliography⁠
https://arxiv.org/abs/1811.11721: “CCNet: Criss-Cross Attention for Semantic Segmentation ”⁠, Zilong Huang, Xinggang Wang, Yunchao Wei …, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang⁠
link-bibliography⁠
https://arxiv.org/abs/1808.01340#deepmind: “A Short Note about Kinetics-600 ”⁠, Joao Carreira, Eric Noland, Andras Banki-Horvath …, Chloe Hillier, Andrew Zisserman⁠
link-bibliography⁠
https://arxiv.org/abs/1705.07750#deepmind: “Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset ”⁠, Joao Carreira, Andrew Zisserman⁠
link-bibliography⁠
https://arxiv.org/abs/1608.03609: “Clockwork Convnets for Video Semantic Segmentation ”⁠, Evan Shelhamer, Kate Rakelly, Judy Hoffman⁠, Trevor Darrell⁠
link-bibliography⁠