Leveraging Random Label Memorization for Unsupervised Pre-Training

11/05/2018 ∙ by Vinaychandran Pondenkandath, et al. ∙ 0

We present a novel approach to leverage large unlabeled datasets by pre-training state-of-the-art deep neural networks on randomly-labeled datasets. Specifically, we train the neural networks to memorize arbitrary labels for all the samples in a dataset and use these pre-trained networks as a starting point for regular supervised learning. Our assumption is that the "memorization infrastructure" learned by the network during the random-label training proves to be beneficial for the conventional supervised learning as well. We test the effectiveness of our pre-training on several video action recognition datasets (HMDB51, UCF101, Kinetics) by comparing the results of the same network with and without the random label pre-training. Our approach yields an improvement - ranging from 1.5 classification accuracy, which calls for further research in this direction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Training accuracy of the same network while sequentially re-shuffling the set of labels, hence forcing the network to start over with the memorization process.The first time, from random weights initialization, it learns very slowly. Later on, subsequent memorization processes are much faster than the initial one, thus indicating that the network is in fact building up a memorization infrastructure of some sort.

The success of many deep learning systems relies on supervised learning with a very large amount of labeled data. However, labeling data is an expensive process both in terms of time and money. Even with the most advanced crowd sourcing techniques, it requires a significant amount of effort and yet the results are not guaranteed to be accurate.

In contrast, the mere collection of large amounts of data is fairly trivial as it is easily available online, e.g., several hundred hours of videos are uploaded to YouTube every minute (youtubestats)

. For this reason, developing improved unsupervised learning methods are of particular interest, as they can leverage large amounts of unlabeled data and extract meaningful information without supervision. However, developing effective methods to do this is not at all trivial.

After the introductions of greedy layer-wise training (hinton2006fast), there have been numerous previous attempts which use simple (Jain2010) and advanced (dundar2015convolutional) clustering techniques, introduce surrogate classes (Dosovitskiy2014), use Generative Adversarial Networks (GAN) (Radford2015), or use Auto-Encoders (Masci2011; Baldi2012; Bengio2013; Yang2015; Zhao2015).

In particular, using unsupervised techniques as pre-training for a later classification task is a long known approach (Bengio2007; Erhan2009; Glorot2010). However, despite the evident advantages of using unsupervised pre-training (erhan2010does), common machine learning experience and recent work suggests that training for reconstruction first, and for classification later might not be the best idea in all cases (alberti2017). This is due to the inherently different nature of the two tasks (reconstruction and classification) which leads deep neural networks to learn different features for solving them.

In this paper we choose to approach unsupervised learning from a different direction: instead of pre-training for reconstruction, we pre-train the network to memorize randomly-assigned labels for all the samples in a dataset. This allows us to train a classification task on datasets that do not have any labels.

This work is inspired by the recent intriguing findings about the capacity of deep neural networks to memorize the training set (Zhang2016) and the possibility to measure the intrinsic dimension of the objective landscape (Li2018). The former rigorously show that deep neural networks are capable of overfitting to a training set even when there is no correlation between the labels and data, i.e., the training labels are shuffled. This suggests that some type of features have to be learned by the network to succeed in this task, although they could be arbitrarily specialized to identify some sort of noise or bias in the input images (alberti2018tampering). In the latter, however, the authors observe that there is a generalization from one part of the training set to another, even though the label are just random. Furthermore, they make the hypothesis that training on some random labels forces the network to setup kind of a base infrastructure for memorization, and that this infrastructure can then be used to make further memorization more efficient111This has been extracted from https://www.youtube.com/watch?v=uSZWeRADTFI. For these reasons we believe that pre-training with random labels might lead the network to learn useful features that can be used as a starting point (we are only pre-training after all) for further supervised training.


In this work we introduce a novel approach for performing unsupervised pre-training in the form of training for classification with random labels. Our preliminary experiments show that is it possible to learn useful representations by leveraging large amount of unlabeled data which calls for further research in this direction.

Figure 6: Accuracy curves on the HMDB51 (a. and b.) and UCF-101 (c. and d.) datasets, where C3D networks have been either randomly initalized (blue line) or cross pre-trained on a randomly-labeled variant of the complementary dataset (orange line). For example, in a./b. the orange line refers to a C3D network that was pre-trained on a randomly-labeled variant of the UCF-101 dataset. In these plots, we can see that the pre-training on random labels proves to be beneficial with an improvement in the accuracy of for HMDB51 and for the UCF-101 dataset.
Figure 9: Accuracy curves on the Kinetics dataset comparing the performance of C3D networks that have been: randomly initialized (blue line), pre-trained on a randomly-labeled subset of the Sports-1M dataset (orange line) or pre-trained on the Sports-1M dataset with the correct labels (green line

). The blue line shows the baseline performance of C3D, the green line shows the upper bound of transfer learning from the complete Sports-1M dataset and the orange line shows the improved performance from our unsupervised pre-training method. It’s important to note that only a subset (

) of the Sports-1M dataset was used for the random-label pre-training due to computational constraints.

2 Experimental Setting

In this section we explain the experimental setup, i.e., the task performed, the dataset and the model used as well as the training procedure such that our experiments can be reproduced.

2.1 Task

We consider the task of identifying actions performed in videos for two reasons. First, it has gained much interest in the computer vision community for its many applications in a variety of domains such as intelligent video surveillance, shopping behavioural analysis. Second, due to the abundance of smart-phones and social media the amount of videos recorded and uploaded have been increasing and most likely will continue to increase and thus expand the already large video datasets. As the growth rate of videos exceeds the growth rate for other types of data (such as images), tasks on videos are inherently more interesting to tackle.

2.2 Datasets

In this work we used the following datasets:

Kinetics: approximately 300,000 video clips, covering 400 human action classes (kay2017kinetics).
Sports-1M: 1 million YouTube videos belonging to 487 classes (Karpathy2014). We used roughly 40% of this dataset as the full size was prohibitive for us to handle.
UCF101: 13320 videos from 101 action categories (soomro2012ucf101).
HMDB51: 6,766 video clips extracted from a wide range of sources with 51 distinct action categories (kuehne2011hmdb).

2.3 Model Architecture: C3D

We used the PyTorch implementation

222https://github.com/DavideA/c3d-pytorch of C3D (Tran2015)

, which was originally implemented in a modified version of BVLC caffe

(Jia2014) that supports 3-Dimensional Convolutional Networks (Ji2013). This architecture looks similar to popular CNN architectures except that 3D convolutions now replace the 2D convolutions.

2.4 Training Procedure

We test the hypothesis that pre-training on a randomly-labeled dataset will improve learning performance on the HMDB51 and UCF-101 datasets. In order to test the hypothesis, we initially pre-train the network on a randomly-labeled version of one of the datasets, and then use this pre-trained model to train on the correctly labeled version of the other dataset. The experimental procedure is as follows:

  1. Pick a dataset from Sports-1M, UCF-101 or HMDB51.

  2. Relabel all training instances of the selected dataset with randomly chosen labels.

  3. Train a 3D-CNN on dataset for epochs with a inital learning rate of , momentum and learning rate decay with a patience of epochs.

  4. Fine-tune this pre-trained model on the other dataset with the correct labels.

This procedure is repeated for all the datasets in turn. However, HMDB51 and UCF-101 are cross pre-trained on each other, while Kinetics is pre-trained on the much larger Sports-1M dataset.

3 Results

To evaluate the effectiveness of the proposed pre-training method, we initally compare the performance of the C3D network – with and without pre-training – on the UCF-101 and HMDB51 datasets. However, as both these datasets are fairly small333In relation to state-of-the-art datasets which are several orders of magnitude bigger, we also consider pre-training the network on the Sports1M and evaluating it on the Kinetics dataset. This allows us to measure the impact of scaling up the quantity of the data used for pre-training. As all the datasets are used for video action recognition, the task is formulated as a classification problem, and as such evaluated using the accuracy metric.

3.1 HMDB51 and UCF101

Following the experimental procedure from Section 2.4, we evaluate the effectiveness on cross pre-training on the HMDB51 and UCF-101 datasets. In Table 1 and Figure 6, we can see that the cross pre-training results in a performance gain of and for HMDB51 and UCF-101 respectively.

Method UCF-101 HMDB51
No pre-training
With pre-training 34.6% 18.2%
Table 1: Comparison of validation accuracy of C3D –with and without pre-training– on UCF-101 and HMDB51. In this scenario pre-training has been performed on a randomly-labeled version of UCF-101 for HMDB51 and vice-versa for UCF-101.

3.2 Kinetics

As seen in Table 2 and Figure 9, the C3D network without pre-training scores and the network pre-trained on the correctly-labeled version of Sports-1M scores . These allow us to establish a baseline on the Kinetics dataset and an upper bound for what is possible when using transfer learning from the Sports-1M dataset. The network that has been pre-trained on the randomly-labeled subset (only of the data is used due to computational constraints) of the Sports-1M dataset scores , a relative reduction of the error rate by .

Pre-training Accuracy
None 35.0%
Randomly-labeled Sports-1M 40.2%
Correctly-labeled Sports-1M 69.8%
Table 2: Comparison of validation accuracies of the C3D network with: no pre-training, pre-training on the randomly-labeled variant of Sports-1M and pre-training on the correctly-labeled Sports-1M.

3.3 Memorization Infrastructure

As previously mentioned (see Section 1) previous work suggests that training on random labels forces the network to set up infrastructure for memorization, and that this infrastructure can then be used to make further memorization more efficient (Li2018). We verified this hypothesis by training a network for memorization and after a fixed amount of epochs re-shuffled the labels hence forcing the network to start over with the process. As shown in Figure 1 subsequent memorization processes are much faster than the initial one, thus indicating that the network is in fact building up a memorization infrastructure of some sort.

4 Conclusion

In this paper we introduced a novel approach for performing unsupervised pre-training in the form of training for classification with random labels. Our preliminary experiments suggests that is it possible to learn useful representations by leveraging a large amount of unlabeled data in this way, although the improvement in performances are limited to % of accuracy. We believe that further research in this direction could provide larger margin of improvement and allow us to gain a better understanding of deep neural networks.

4.1 Future Work

We plan to further investigate the dynamic of memorization by determining at which stage of the network (early layers close to input or later ones close to final features) are most responsible for it. We speculate that, even though regularization hinders the memorization capability of the network, it might be beneficial to learn more useful feature for a later classification task. Finally, we want to inspect what the network is looking at in the input to succeed in memorizing every training sample (for example with global average pooling layers (zhou2016learning)).


The work presented in this paper has been partially supported by the HisDoc III project funded by the Swiss National Science Foundation with the grant number _.