Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?

04/10/2020
by   Hirokatsu Kataoka, et al.
2

How can we collect and use a video dataset to further improve spatiotemporal 3D Convolutional Neural Networks (3D CNNs)? In order to positively answer this open question in video recognition, we have conducted an exploration study using a couple of large-scale video datasets and 3D CNNs. In the early era of deep neural networks, 2D CNNs have been better than 3D CNNs in the context of video recognition. Recent studies revealed that 3D CNNs can outperform 2D CNNs trained on a large-scale video dataset. However, we heavily rely on architecture exploration instead of dataset consideration. Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy. We reveal that a carefully annotated dataset (e.g., Kinetics-700) effectively pre-trains a video representation for a video classification task. (ii) We confirm the relationships between #category/#instance and video classification accuracy. The results show that #category should initially be fixed, and then #instance is increased on a video dataset in case of dataset construction. (iii) In order to practically extend a video dataset, we simply concatenate publicly available datasets, such as Kinetics-700 and Moments in Time (MiT) datasets. Compared with Kinetics-700 pre-training, we further enhance spatiotemporal 3D CNNs with the merged dataset, e.g., +0.9, +3.4, and +1.1 on UCF-101, HMDB-51, and ActivityNet datasets, respectively, in terms of fine-tuning. (iv) In terms of recognition architecture, the Kinetics-700 and merged dataset pre-trained models increase the recognition performance to 200 layers with the Residual Network (ResNet), while the Kinetics-400 pre-trained model cannot successfully optimize the 200-layer architecture.

READ FULL TEXT

page 5

page 7

research
04/02/2022

CNN Filter DB: An Empirical Investigation of Trained Convolutional Filters

Currently, many theoretical as well as practically relevant questions to...
research
05/03/2022

In Defense of Image Pre-Training for Spatiotemporal Recognition

Image pre-training, the current de-facto paradigm for a wide range of vi...
research
11/27/2017

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

The purpose of this study is to determine whether current video datasets...
research
02/23/2016

The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection

This paper strives for video event detection using a representation lear...
research
11/17/2018

Recurrence to the Rescue: Towards Causal Spatiotemporal Representations

Recently, three dimensional (3D) convolutional neural networks (CNNs) ha...
research
04/22/2016

Refining Architectures of Deep Convolutional Neural Networks

Deep Convolutional Neural Networks (CNNs) have recently evinced immense ...
research
10/15/2018

Vehicle classification using ResNets, localisation and spatially-weighted pooling

We investigate whether ResNet architectures can outperform more traditio...

Please sign up or login with your details

Forgot password? Click here to reset