âAccuracy and Performance Comparison of Video Action Recognition Approachesâ, 2020-08-20 (; similar)â :
Over the past few years, there has been substantial interest in video action recognition systems and models. However, direct comparison of accuracy and computational performance results remain clouded by differing training environments, hardware specifications, hyperparameters, pipelines, and inference methods. This article provides a direct comparison between fourteen off-the-shelf and state-of-the-art models by ensuring consistency in these training characteristics in order to provide readers with a meaningful comparison across different types of video action recognition algorithms. Accuracy of the models is evaluated using standard Top-1 and Top-5 accuracy metrics in addition to a proposed new accuracy metric. Additionally, we compare computational performance of distributed training from two to sixty-four GPUs on a state-of-the-art HPC system.
[Keywords: action recognition, neural network, deep learning, accuracy metrics, computational performance]
Which is the best system for video action recognition? Simple 2D convnets, says survey:
âŚRichard Suttonâs âbitter lessonâ strikes againâŚ
Researchers with MIT have analyzed the performance of fourteen different models used for video action recognitionâcorrectly labeling something in a video, a generically useful AI capability. The results show that simple techniques tend to beat complex ones. Specifically, the researchers benchmark a range of 2D convolutional networks (C2Ds) against temporal segment networks (TSNs), Long-Term Recurrent Convolutional Neural Nets (LCRNs) and Temporal Shift Modules (TSMs). They find the simple stuffâ2D convnetsâperform best.
The bitter lesson results: Convolutional net models âsignificantly outperformâ the other models they test. Specifically, the Inception-ResNet-v2, ResNet50, DenseNet201, and MobileNetv2 are all top performers. These results also highlight some of the ideas in Suttonâs âbitter lessonâ essayânamely that simpler things that scale better tend to beat the smart stuff. â2D approaches can yield results comparable to their more complex 3D counterparts, and model depth, rather than input feature scale, is the critical component to an architectureâs ability to extract a videoâs semantic action informationâ, they write.]