“Accuracy and Performance Comparison of Video Action Recognition Approaches”, Matthew Hutchinson, Siddharth Samsi, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Michael Houle, Matthew Hubbell, Michael Jones, Jeremy Kepner, Andrew Kirby, Peter Michaleas, Lauren Milechin, Julie Mullen, Andrew Prout, Antonio Rosa, Albert Reuther, Charles Yee, Vijay Gadepally2020-08-20 (, , ; similar)⁠:

Over the past few years, there has been substantial interest in video action recognition systems and models. However, direct comparison of accuracy and computational performance results remain clouded by differing training environments, hardware specifications, hyperparameters, pipelines, and inference methods. This article provides a direct comparison between fourteen off-the-shelf and state-of-the-art models by ensuring consistency in these training characteristics in order to provide readers with a meaningful comparison across different types of video action recognition algorithms. Accuracy of the models is evaluated using standard Top-1 and Top-5 accuracy metrics in addition to a proposed new accuracy metric. Additionally, we compare computational performance of distributed training from two to sixty-four GPUs on a state-of-the-art HPC system.

[Keywords: action recognition, neural network, deep learning, accuracy metrics, computational performance]

[Jack Clark’s summary:

Which is the best system for video action recognition? Simple 2D convnets, says survey:

…Richard Sutton’s ‘bitter lesson’ strikes again…

Researchers with MIT have analyzed the performance of fourteen different models used for video action recognition—correctly labeling something in a video, a generically useful AI capability. The results show that simple techniques tend to beat complex ones. Specifically, the researchers benchmark a range of 2D convolutional networks (C2Ds) against temporal segment networks (TSNs), Long-Term Recurrent Convolutional Neural Nets (LCRNs) and Temporal Shift Modules (TSMs). They find the simple stuff—2D convnets—perform best.

The bitter lesson results: Convolutional net models “significantly outperform” the other models they test. Specifically, the Inception-ResNet-v2, ResNet50, DenseNet201, and MobileNetv2 are all top performers. These results also highlight some of the ideas in Sutton’s ’bitter lesson‘ essay—namely that simpler things that scale better tend to beat the smart stuff. “2D approaches can yield results comparable to their more complex 3D counterparts, and model depth, rather than input feature scale, is the critical component to an architecture’s ability to extract a video’s semantic action information”, they write.]