The success of many deep learning systems relies on supervised learning with a very large amount of labeled data. However, labeling data is an expensive process both in terms of time and money. Even with the most advanced crowd sourcing techniques, it requires a significant amount of effort and yet the results are not guaranteed to be accurate.
In contrast, the mere collection of large amounts of data is fairly trivial as it is easily available online, e.g., several hundred hours of videos are uploaded to YouTube every minute (youtubestats)
. For this reason, developing improved unsupervised learning methods are of particular interest, as they can leverage large amounts of unlabeled data and extract meaningful information without supervision. However, developing effective methods to do this is not at all trivial.
After the introductions of greedy layer-wise training (hinton2006fast), there have been numerous previous attempts which use simple (Jain2010) and advanced (dundar2015convolutional) clustering techniques, introduce surrogate classes (Dosovitskiy2014), use Generative Adversarial Networks (GAN) (Radford2015), or use Auto-Encoders (Masci2011; Baldi2012; Bengio2013; Yang2015; Zhao2015).
In particular, using unsupervised techniques as pre-training for a later classification task is a long known approach (Bengio2007; Erhan2009; Glorot2010). However, despite the evident advantages of using unsupervised pre-training (erhan2010does), common machine learning experience and recent work suggests that training for reconstruction first, and for classification later might not be the best idea in all cases (alberti2017). This is due to the inherently different nature of the two tasks (reconstruction and classification) which leads deep neural networks to learn different features for solving them.
In this paper we choose to approach unsupervised learning from a different direction: instead of pre-training for reconstruction, we pre-train the network to memorize randomly-assigned labels for all the samples in a dataset. This allows us to train a classification task on datasets that do not have any labels.
This work is inspired by the recent intriguing findings about the capacity of deep neural networks to memorize the training set (Zhang2016) and the possibility to measure the intrinsic dimension of the objective landscape (Li2018). The former rigorously show that deep neural networks are capable of overfitting to a training set even when there is no correlation between the labels and data, i.e., the training labels are shuffled. This suggests that some type of features have to be learned by the network to succeed in this task, although they could be arbitrarily specialized to identify some sort of noise or bias in the input images (alberti2018tampering). In the latter, however, the authors observe that there is a generalization from one part of the training set to another, even though the label are just random. Furthermore, they make the hypothesis that training on some random labels forces the network to setup kind of a base infrastructure for memorization, and that this infrastructure can then be used to make further memorization more efficient111This has been extracted from https://www.youtube.com/watch?v=uSZWeRADTFI. For these reasons we believe that pre-training with random labels might lead the network to learn useful features that can be used as a starting point (we are only pre-training after all) for further supervised training.
In this work we introduce a novel approach for performing unsupervised pre-training in the form of training for classification with random labels. Our preliminary experiments show that is it possible to learn useful representations by leveraging large amount of unlabeled data which calls for further research in this direction.
). The blue line shows the baseline performance of C3D, the green line shows the upper bound of transfer learning from the complete Sports-1M dataset and the orange line shows the improved performance from our unsupervised pre-training method. It’s important to note that only a subset () of the Sports-1M dataset was used for the random-label pre-training due to computational constraints.
2 Experimental Setting
In this section we explain the experimental setup, i.e., the task performed, the dataset and the model used as well as the training procedure such that our experiments can be reproduced.
We consider the task of identifying actions performed in videos for two reasons. First, it has gained much interest in the computer vision community for its many applications in a variety of domains such as intelligent video surveillance, shopping behavioural analysis. Second, due to the abundance of smart-phones and social media the amount of videos recorded and uploaded have been increasing and most likely will continue to increase and thus expand the already large video datasets. As the growth rate of videos exceeds the growth rate for other types of data (such as images), tasks on videos are inherently more interesting to tackle.
In this work we used the following datasets:
Kinetics: approximately 300,000 video clips, covering 400 human action classes (kay2017kinetics).
Sports-1M: 1 million YouTube videos belonging to 487 classes (Karpathy2014). We used roughly 40% of this dataset as the full size was prohibitive for us to handle.
UCF101: 13320 videos from 101 action categories (soomro2012ucf101).
HMDB51: 6,766 video clips extracted from a wide range of sources with 51 distinct action categories (kuehne2011hmdb).
2.3 Model Architecture: C3D
We used the PyTorch implementation222https://github.com/DavideA/c3d-pytorch of C3D (Tran2015)
, which was originally implemented in a modified version of BVLC caffe(Jia2014) that supports 3-Dimensional Convolutional Networks (Ji2013). This architecture looks similar to popular CNN architectures except that 3D convolutions now replace the 2D convolutions.
2.4 Training Procedure
We test the hypothesis that pre-training on a randomly-labeled dataset will improve learning performance on the HMDB51 and UCF-101 datasets. In order to test the hypothesis, we initially pre-train the network on a randomly-labeled version of one of the datasets, and then use this pre-trained model to train on the correctly labeled version of the other dataset. The experimental procedure is as follows:
Pick a dataset from Sports-1M, UCF-101 or HMDB51.
Relabel all training instances of the selected dataset with randomly chosen labels.
Train a 3D-CNN on dataset for epochs with a inital learning rate of , momentum and learning rate decay with a patience of epochs.
Fine-tune this pre-trained model on the other dataset with the correct labels.
This procedure is repeated for all the datasets in turn. However, HMDB51 and UCF-101 are cross pre-trained on each other, while Kinetics is pre-trained on the much larger Sports-1M dataset.
To evaluate the effectiveness of the proposed pre-training method, we initally compare the performance of the C3D network – with and without pre-training – on the UCF-101 and HMDB51 datasets. However, as both these datasets are fairly small333In relation to state-of-the-art datasets which are several orders of magnitude bigger, we also consider pre-training the network on the Sports1M and evaluating it on the Kinetics dataset. This allows us to measure the impact of scaling up the quantity of the data used for pre-training. As all the datasets are used for video action recognition, the task is formulated as a classification problem, and as such evaluated using the accuracy metric.
3.1 HMDB51 and UCF101
Following the experimental procedure from Section 2.4, we evaluate the effectiveness on cross pre-training on the HMDB51 and UCF-101 datasets. In Table 1 and Figure 6, we can see that the cross pre-training results in a performance gain of and for HMDB51 and UCF-101 respectively.
As seen in Table 2 and Figure 9, the C3D network without pre-training scores and the network pre-trained on the correctly-labeled version of Sports-1M scores . These allow us to establish a baseline on the Kinetics dataset and an upper bound for what is possible when using transfer learning from the Sports-1M dataset. The network that has been pre-trained on the randomly-labeled subset (only of the data is used due to computational constraints) of the Sports-1M dataset scores , a relative reduction of the error rate by .
3.3 Memorization Infrastructure
As previously mentioned (see Section 1) previous work suggests that training on random labels forces the network to set up infrastructure for memorization, and that this infrastructure can then be used to make further memorization more efficient (Li2018). We verified this hypothesis by training a network for memorization and after a fixed amount of epochs re-shuffled the labels hence forcing the network to start over with the process. As shown in Figure 1 subsequent memorization processes are much faster than the initial one, thus indicating that the network is in fact building up a memorization infrastructure of some sort.
In this paper we introduced a novel approach for performing unsupervised pre-training in the form of training for classification with random labels. Our preliminary experiments suggests that is it possible to learn useful representations by leveraging a large amount of unlabeled data in this way, although the improvement in performances are limited to % of accuracy. We believe that further research in this direction could provide larger margin of improvement and allow us to gain a better understanding of deep neural networks.
4.1 Future Work
We plan to further investigate the dynamic of memorization by determining at which stage of the network (early layers close to input or later ones close to final features) are most responsible for it. We speculate that, even though regularization hinders the memorization capability of the network, it might be beneficial to learn more useful feature for a later classification task. Finally, we want to inspect what the network is looking at in the input to succeed in memorizing every training sample (for example with global average pooling layers (zhou2016learning)).
The work presented in this paper has been partially supported by the HisDoc III project funded by the Swiss National Science Foundation with the grant number _.