Noise-as-targets representation learning for cifar10. Implementation based on the paper "Unsupervised Learning by Predicting Noise" by Bojanowski and Joulin.
Convolutional neural networks provide visual features that perform remarkably well in many computer vision applications. However, training these networks requires significant amounts of supervision. This paper introduces a generic framework to train deep networks, end-to-end, with no supervision. We propose to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them. This domain agnostic approach avoids the standard unsupervised learning issues of trivial solutions and collapsing of features. Thanks to a stochastic batch reassignment strategy and a separable square loss function, it scales to millions of images. The proposed approach produces representations that perform on par with state-of-the-art unsupervised methods on ImageNet and Pascal VOC.READ FULL TEXT VIEW PDF
Clustering is a class of unsupervised learning methods that has been
Is strong supervision necessary for learning a good visual representatio...
Pre-training general-purpose visual features with convolutional neural
In this paper, we present a technique for unsupervised learning of visua...
Recent data-driven approaches to scene interpretation predominantly pose...
Latest deep learning methods for object detection provided remarkable
We present Momentum Contrast (MoCo) for unsupervised visual representati...
Noise-as-targets representation learning for cifar10. Implementation based on the paper "Unsupervised Learning by Predicting Noise" by Bojanowski and Joulin.
In recent years, convolutional neural networks, or convnets (Fukushima, 1980; LeCun et al., 1989) have pushed the limits of computer vision (Krizhevsky et al., 2012; He et al., 2016), leading to important progress in a variety of tasks, like object detection (Girshick, 2015) or image segmentation (Pinheiro et al., 2015). Key to this success is their ability to produce features that easily transfer to new domains when trained on massive databases of labeled images (Razavian et al., 2014; Oquab et al., 2014) or weakly-supervised data (Joulin et al., 2016). However, human annotations may introduce unforeseen bias that could limit the potential of learned features to capture subtle information hidden in a vast collection of images.
Several strategies exist to learn deep convolutional features with no annotation (Donahue et al., 2016). They either try to capture a signal from the source as a form of self-supervision (Doersch et al., 2015; Wang & Gupta, 2015) or learn the underlying distribution of images (Vincent et al., 2010; Goodfellow et al., 2014)
. While some of these approaches obtain promising performance in transfer learning(Donahue et al., 2016; Wang & Gupta, 2015), they do not explicitly aim to learn discriminative features. Some attempts were made with retrieval based approaches (Dosovitskiy et al., 2014) and clustering (Yang et al., 2016; Liao et al., 2016), but they are hard to scale and have only been tested on small datasets. Unfortunately, as in the supervised case, a lot of data is required to learn good representations.
In this work, we propose a novel discriminative framework designed to learn deep architectures on massive amounts of data. Our approach is general, but we focus on convnets since they require millions of images to produce good features. Similar to self-organizing maps(Kohonen, 1982; Martinetz & Schulten, 1991), we map deep features to a set of predefined representations in a low dimensional space. As opposed to these approaches, we aim to learn the features in a end-to-end fashion, which traditionally suffers from a feature collapsing problem. Our approach deals with this issue by fixing the target representations and aligning them to our features. These representations are sampled from a uninformative distribution and we use this Noise As Targets (NAT). Our approach also shares some similarities with standard clustering approches like -means (Lloyd, 1982) or discriminative clustering (Bach & Harchaoui, 2007).
In addition, we propose an online algorithm able to scale to massive image databases like ImageNet (Deng et al., 2009). Importantly, our approach is barely less efficient to train than standard supervised approaches and can re-use any optimization procedure designed for them. This is achieved by using a quadratic loss as in (Tygert et al., 2017) and a fast approximation of the Hungarian algorithm. We show the potential of our approach by training end-to-end on ImageNet a standard architecture, namely AlexNet (Krizhevsky et al., 2012) with no supervision.
We test the quality of our features on several image classification problems, following the setting of Donahue et al. (2016)
. We are on par with state-of-the-art unsupervised and self-supervised learning approaches while being much simpler to train and to scale.
Several approaches have been recently proposed to tackle the problem of deep unsupervised learning (Coates & Ng, 2012; Mairal et al., 2014; Dosovitskiy et al., 2014). Some of them are based on a clustering loss (Xie et al., 2016; Yang et al., 2016; Liao et al., 2016), but they are not tested at a scale comparable to that of supervised convnet training. Coates & Ng (2012) uses -means to pre-train convnets, by learning each layer sequentially in a bottom-up fashion. In our work, we train the convnet end-to-end with a loss that shares similarities with -means. Closer to our work, Dosovitskiy et al. (2014) proposes to train convnets by solving a retrieval problem. They assign a class per image and its transformation. In contrast to our work, this approach can hardly scale to more than a few hundred of thousands of images, and requires a custom-tailored architecture while we use a standard AlexNet.
Another traditional approach for learning visual representations in an unsupervised manner is to define a parametrized mapping between a predefined random variable and a set of images. Traditional examples of this approach are variational autoencoders(Kingma & Welling, 2013), generative adversarial networks (Goodfellow et al., 2014), and to a lesser extent, noisy autoencoders (Vincent et al., 2010). In our work, we are doing the opposite; that is, we map images to a predefined random variable. This allows us to re-use standard convolutional networks and greatly simplifies the training.
Among those approaches, generative adversarial networks (GANs) (Goodfellow et al., 2014; Denton et al., 2015; Donahue et al., 2016) share another similarity with our approach, namely they are explicitly minimizing a discriminative loss to learn their features. While these models cannot learn an inverse mapping, Donahue et al. (2016) recently proposed to add an encoder to extract visual features from GANs. Like ours, their encoder can be any standard convolutional network. However, their loss aims at differentiating real and generated images, while we are aiming directly at differentiating between images. This makes our approach much simpler and faster to train, since we do not need to learn the generator nor the discriminator.
Recently, a lot of work has explored self-supervison: leveraging supervision contained in the input signal (Doersch et al., 2015; Noroozi & Favaro, 2016; Pathak et al., 2016). In the same vein as word2vec (Mikolov et al., 2013), Doersch et al. (2015) show that spatial context is a strong signal to learn visual features. Noroozi & Favaro (2016) have further extended this work. Others have shown that temporal coherence in videos also provides a signal that can be used to learn powerful visual features (Agrawal et al., 2015; Jayaraman & Grauman, 2015; Wang & Gupta, 2015). In particular, Wang & Gupta (2015) show that such features provide promising performance on ImageNet. In contrast to our work, these approaches are domain dependent since they require explicit derivation of weak supervision directly from the input.
Many have also used autoencoders with a reconstruction loss (Bengio et al., 2007; Ranzato et al., 2007; Masci et al., 2011). The idea is to encode and decode an image, while minimizing the loss between the decoded and original images. Once trained, the encoder produces image features and the decoder can be used to generate images from codes. The decoder is often a fully connected network (Ranzato et al., 2007) or a deconvolutional network (Masci et al., 2011; Zhao et al., 2016) but can be more sophisticated, like a PixelCNN network (van den Oord et al., 2016).
This family of unsupervised methods aims at learning a low dimensional representation of the data that preserves certain topological properties (Kohonen, 1982; Vesanto & Alhoniemi, 2000). In particular, Neural Gas (Martinetz & Schulten, 1991)
aligns feature vectors to the input data. Each input datum is then assigned to one of these vectors in a winner-takes-all manner. These feature vectors are in spirit similar to our target representations and we use a similar assignment strategy. In contrast to our work, the target vectors are not fixed and aligned to the input vectors. Since we primarly aim at learning the input features, we do the opposite.
shows that the ridge regression loss could be use to learn discriminative clusters. It has been successfully applied to several computer vision applications, like object discovery(Joulin et al., 2010; Tang et al., 2014) or video/text alignment (Bojanowski et al., 2013, 2014; Ramanathan et al., 2014). In this work, we show that a similar framework can be designed for neural networks. As opposed to Xu et al. (2004), we address the empty assignment problems by restricting the set of possible reassignments to permutations rather than using global linear constrains the assignments. Our assignments can be updated online, allowing our approach to scale to very large datasets.
In this section, we present our model and discuss its relations with several clustering approaches including -means. Figure 1 shows an overview of our approach. We also show that it can be trained on massive datasets using an online procedure. Finally, we provide all the implementation details.
We are interested in learning visual features with no supervision. These features are produced by applying a parametrized mapping to the images. In the presence of supervision, the parameters are learned by minimizing a loss function between the features produced by this mapping and some given targets, e.g., labels. In absence of supervision, there is no clear target representations and we thus need to learn them as well. More precisely, given a set of images , we jointly learn the parameters of the mapping , and some target vectors :
where is the dimension of target vectors. In the rest of the paper, we use matrix notations, i.e., we denote by the matrix whose rows are the target representations , and by the matrix whose rows are the images . With a slight abuse of notation, we denote by the matrix of features whose rows are obtained by applying the function to each image independently.
In the supervised setting, a popular choice for the loss is the softmax function. However, computing this loss is linear in the number of targets, making it impractical for large output spaces (Goodman, 2001). While there are workarounds to scale these losses to large output spaces, Tygert et al. (2017) has recently shown that using a squared distance works well in many supervised settings, as long as the final activations are unit normalized. This loss only requires access to a single target per sample, making its computation independent of the number of targets. This leads to the following problem:
where we still denote by the unit normalized features.
Directly solving the problem defined in Eq. (2) would lead to a representation collapsing problem: all the images would be assigned to the same representation (Xu et al., 2004). We avoid this issue by fixing a set of predefined target representations and matching them to the visual features. More precisely, the matrix is defined as the product of a matrix containing these representations and an assignment matrix in , i.e.,
Note that we can assume that is greater than with no loss of generality (by duplicating representations otherwise). Each image is assigned to a different target and each target can only be assigned once. This leads to a set of constraints for the assignment matrices:
This formulation forces the visual features to be diversified, avoiding the collapsing issue at the cost of fixing the target representations. Predefining these targets is an issue if their number is small, which is why we are interested in the case where is at least as large as the number of images.
Until now, we have not discussed the set of target representations stored in . A simple choice for the targets would be to take elements of the canonical basis of . If is larger than , this formulation would be similar to the framework of Dosovitskiy et al. (2014), and is impractical for large . On the other hand, if is smaller than , this formulation is equivalent to the discriminative clustering approach of Bach & Harchaoui (2007). Choosing such targets makes very strong assumptions on the nature of the underlying problem. Indeed, it assumes that each image belongs to a unique class and that all classes are orthogonal. While this assumption might be true for some classification datasets, it does not generalize to huge image collections nor capture subtle similarities between images belonging to different classes.
Since our features are unit normalized, another natural choice is to uniformly sample target vectors on the unit sphere. Note that the dimension will then directly influence the level of correlation between representations, i.e., the correlation is inversely proportional to the square root of . Using this Noise As Targets (NAT), Eq. (2) is now equivalent to:
This problem can be interpreted as mapping deep features to a uniform distribution over a manifold, namely the-dimension sphere. Using predefined representations is a discrete approximation of this manifold that justifies the restriction of the mapping matrices to the set of -to- assignment matrices. In some sense, we are optimizing a crude approximation of the earth mover’s distance between the distribution of deep features and a given target distribution (Rubner et al., 1998).
Using the same notations as in Eq. (5), several clustering approaches share similarities with our method. In the linear case, spherical -means minimizes the same loss function w.r.t. and , i.e.,
A key difference is the set of assignment matrices:
This set only guarantees that each data point is assigned to a single target representation. Once we jointly learn the features and the assignment, this set does not prevent the collapsing of the data points to a single target representation.
Another similar clustering approach is Diffrac (Bach & Harchaoui, 2007). Their loss is equivalent to ours in the case of unit normalized features. Their set of assignment matrices, however, is different:
where is some fixed parameter. While restricting the assignment matrices to this set prevents the collapsing issue, it introduces global constraints that are not suited for online optimization. This makes their approach hard to scale to large datasets.
In this section, we describe how to efficiently optimize the cost function described in Eq. (5
). In particular, we explore approximated updates of the assignment matrix that are compatible with online optimization schemes, like stochastic gradient descent (SGD).
Directly solving for the optimal assignment requires to evaluate the distances between all the features and the representations. In order to efficiently solve this problem, we first reduce the number of representations to . This limits the set to the set of permutation matrices, i.e.,
Instead, we perform stochastic updates of the matrix. Given a batch of samples, we optimize the assignment matrix on its restriction to this batch. Given a subset of distinct images, we only update the square sub matrix obtained by restricting to these images and their corresponding targets. In other words, each image can only be re-assigned to a target that was previously assigned to another image in the batch. This procedure has a complexity of per batch, leading to an overall complexity of , which is linear in the number of data points. We perform this update before updating the parameters of our features, in an on-line manner. Note that this simple procedure would not have been possible if ; we would have had to also consider the unassigned representations.
Apart from the update of the assignment matrix , we use the same optimization scheme as standard supervised approaches, i.e.
, SGD with batch normalization(Ioffe & Szegedy, 2015). As noted by Tygert et al. (2017), batch normalization plays a crucial role when optimizing the loss, as it avoids exploding gradients. For each batch of images, we first perform a forward pass to compute the distance between the images and the corresponding subset of target representations
. The Hungarian algorithm is then used on these distances to obtain the optimal reassignments within the batch. Once the assignments are updated, we use the chain rule in order to compute the gradients of all our parameters. Our optimization algorithm is summarized in Algorithm1.
Our experiments solely focus on learning visual features with convnets. All the details required to train these architectures with our approach are described below. Most of them are standard tricks used in the usual supervised setting.
To ensure a fair empirical comparison with previous work, we follow Wang & Gupta (2015)
and use an AlexNet architecture. We train it end to end using our unsupervised loss function. We subsequently test the quality of the learned visual feature by re-training a classifier on top. During transfer learning, we consider the output of the last convolutional layer as our features as inRazavian et al. (2014)
. We use the same multi-layer perceptron (MLP) as inKrizhevsky et al. (2012) for the classifier.
We observe in practice that pre-processing the images greatly helps the quality of our learned features. As in Ranzato et al. (2007), we use image gradients instead of the images to avoid trivial solutions like clustering according to colors. Using this preprocessing is not surprising since most hand-made features like SIFT or HoG are based on image gradients (Lowe, 1999; Dalal & Triggs, 2005). In addition to this pre-processing, we also perform all the standard image transformations that are commonly applied in the supervised setting (Krizhevsky et al., 2012), such as random cropping and flipping of images.
We project the output of the network on the sphere as in Tygert et al. (2017). The network is trained with SGD with a batch size of . During the first batches, we use a constant step size. After batches, we use a linear decay of the step size, i.e., . Unless mentioned otherwise, we permute the assignments within batches every epochs. For the transfer learning experiments, we follow the guideline described in Donahue et al. (2016).
We perform several experiments to validate different design choices in NAT. We then evaluate the quality of our features by comparing them to state-of-the-art unsupervised approaches on several auxiliary supervised tasks, namely object classification on ImageNet and object classification and detection of Pascal VOC 2007 (Everingham et al., 2010).
In order to measure the quality of our features, we measure their performance on transfer learning. We freeze the parameters of all the convolutional layers and overwrite the parameters of the MLP classifier with random Gaussian weights. We precisely follow the training and testing procedure that is specific to each of the datasets following Donahue et al. (2016).
We use the training set of ImageNet to learn our convolutional network (Deng et al., 2009). This dataset is composed of images that belong to object categories. For the transfer learning experiments, we also consider Pascal VOC 2007. In addition to fully supervised approaches (Krizhevsky et al., 2012), we compare our method to several unsupervised approaches, i.e., autoencoder, GAN and BiGAN as reported in Donahue et al. (2016). We also compare to self-supervised approaches, i.e., Agrawal et al. (2015); Doersch et al. (2015); Pathak et al. (2016); Wang & Gupta (2015) and Zhang et al. (2016). Finally we compare to state-of-the-art hand-made features, i.e., SIFT with Fisher Vectors (SIFT+FV) (Sánchez et al., 2013). They reduce the Fisher Vectors to a dimensional vector with PCA, and apply an unit -layer MLP on top.
In this section, we validate some of our design choices, like the loss function, representations and the influences of some parameters on the quality of our features. All the experiments are run on ImageNet.
Table 1 compares the performance of an AlexNet trained with a softmax and a square loss. We report the accuracy on the validation set. The square loss requires the features to be unit normalized to avoid exploding gradients. As previously observed by Tygert et al. (2017), the performances are similar, hence validating our choice of loss function.
In supervised classification, image pre-processing is not frequently used, and transformations that remove information are usually avoided. In the unsupervised case, however, we observe that is it is preferable to work with simpler inputs as it avoids learning trivial features. In particular, we observe that using grayscale image gradients greatly helps our method, as mentioned in Sec. 3. In order to verify that this preprocessing does not destroy crucial information, we propose to evaluate its effect on supervised classification. We also compare with high-pass filtering. Table 2 shows the impact of this preprocessing methods on the accuracy of an AlexNet on the validation set of ImageNet. None of these pre-processings degrade the perform significantly, meaning that the information related to gradients are sufficient for object classification. This experiment confirms that such pre-processing does not lead to a significant drop in the upper bound performance for our model.
We compare our choice for the target vectors to those commonly used for clustering, i.e., elements of the canonical basis of a dimensional space. Such discrete representation make a strong assumption on the underlying structure of the problem, that it can be linearly separated in different classes. This assumption holds for ImageNet giving a fair advantage to this discrete representation. We test this representation with k in , which is a range well-suited for the classes of ImageNet. The matrix contains replications of elements of the canonical basis. This assumes that the clusters are balanced, which is verified on ImageNet.
We compare these cluster-like representations to our continuous target vectors on the transfer task on ImageNet. Using discrete targets achieves an accuracy of , which is significantly worse that our best performance, i.e., . A possible explanation is that binary vectors induce sharp discontinuous distances between representations. Such distances are hard to optimize over and may result in early convergence to poorer local minima.
In this experiment, we are interested in understanding how the quality of our features evolves with the optimization of our cost function. During the unsupervised training, we freeze the network every 20 epochs and learn a MLP classifier on top. We report the accuracy on the validation set of ImageNet. Figure 2 shows the evolution of the performance on this transfer task as we optimize for our unsupervised approach. The training performance improves monotonically with the epochs of the unsupervised training. This suggests that optimizing our objective function correlates with learning transferable features, i.e., our features do not destroy useful class-level information. On the other hand, the test accuracy seems to saturate after a hundred epochs. This suggests that the MLP is overfitting rapidly on pre-trained features.
Assigning images to their target representations is a crucial feature of our approach. In this experiment, we are interested in understanding how frequently we should update this assignment. Indeed, updating the assignment, even partially, is relatively costly and may not be required to achieve good performance. Figure 2 shows the transfer accuracies on ImageNet as a function of the frequency of these updates. The model is quite robust to choice of frequency, with a test accuracy always above . Interestingly, the accuracy actually degrades slightly with high frequency. A possible explanation is that the network overfits rapidly to its own output, leading to relatively worse features. In practice, we observe that updating the assignment matrix every epochs offers a good trade-off between performance and accuracy.
Figure 4 shows a comparison between the first convolutional layer of an AlexNet trained with and without supervision. Both take grayscale gradient images as input. The visualization are obtained by composing the Sobel filtering with the filters of the first layer of the AlexNet. Unsupervised filters are slightly less sharp than their supervised counterpart, but still maintain edge and orientation information.
Our loss optimizes a distance between features and fixed vectors. This means that looking at the distance between features should provide some information about the type of structure that our model captures. Given a query image , we compute its feature and search for its nearest neighbors according to the distance. Figure 3 shows images and their nearest neighbors.
The features capture relatively complex structures in images. Objects with distinctive structures, like trunks or fruits, are well captured by our approach. However, this information is not always related to true labels. For example, the image of bird over the sea is matched to images capturing information about the sea or the sky rather than the bird.
We report results on the transfer task both on ImageNet and Pascal VOC 2007. In both cases, the model is trained on ImageNet.
In this experiment, we evaluate the quality of our features for the object classification task of ImageNet. Note that in this setup, we build the unsupervised features on images that correspond to predefined image categories. Even though we do not have access to category labels, the data itself is biased towards these classes. In order to evaluate the features, we freeze the layers up to the last convolutional layer and train the classifier with supervision. This experimental setting follows Noroozi & Favaro (2016).
We compare our model with several self-supervised approaches (Wang & Gupta, 2015; Doersch et al., 2015; Zhang et al., 2016) and an unsupervised approach, i.e., Donahue et al. (2016). Note that self-supervised approaches use losses specifically designed for visual features. Like BiGANs (Donahue et al., 2016), NAT does not make any assumption about the domain but of the structure of its features. Table 3 compares NAT with these approaches.
|Random (Noroozi & Favaro, 2016)||12.0|
|SIFT+FV (Sánchez et al., 2013)||55.6|
|Wang & Gupta (2015)||29.8|
|Doersch et al. (2015)||30.4|
|Zhang et al. (2016)||35.2|
|Noroozi & Favaro (2016)||38.1|
|BiGAN (Donahue et al., 2016)||32.2|
Among unsupervised approaches, NAT compares favorably to BiGAN (Donahue et al., 2016). Interestingly, the performance of NAT are slightly better than self-supervised methods, even though we do not explicitly use domain-specific clues in images or videos to guide the learning. While all the models provide performance in the range, it is not clear if they all learn the same features. Finally, all the unsupervised deep features are outperformed by hand-made features, in particular Fisher Vectors with SIFT descriptors. This baseline uses a slightly bigger MLP for the classifier and its performance can be improved by by bagging of these models. This difference of in accuracy shows that unsupervised deep features are still quite far from the state-of-the-arts among all unsupervised features.
We carry out a second transfer experiment on the Pascal VOC dataset, on the classification and detection tasks. The model is trained on ImageNet. Depending on the task, we finetune all layers in the network, or solely the classifier, following Donahue et al. (2016). In all experiments, the parameters of the convolutional layers are initialized with the ones obtained with our unsupervised approach. The parameters of the classification layers are initialized with gaussian weights. We get rid of batch normalization layers and use a data-dependent rescaling of the parameters (Krähenbühl et al., 2015). Table 4 shows the comparison between our model and other unsupervised approaches. The results for other methods are taken from Donahue et al. (2016) except for Zhang et al. (2016).
|Agrawal et al. (2015)||31.0||54.2||43.9|
|Pathak et al. (2016)||34.6||56.5||44.5|
|Wang & Gupta (2015)||55.6||63.1||47.4|
|Doersch et al. (2015)||55.1||65.3||51.1|
|Zhang et al. (2016)||61.5||65.6||46.9|
|BiGAN (Donahue et al., 2016)||52.3||60.1||46.9|
As with the ImageNet classification task, our performance is on par with self-supervised approaches, for both detection and classification. Among purely unsupervised approaches, we outperform standard approaches like autoencoders or GANs by a large margin. Our model also performs slightly better than the best performing BiGAN model (Donahue et al., 2016). These experiments confirm our findings from the ImageNet experiments. Despite its simplicity, NAT learns feature that are as good as those obtained with more sophisticated and data-specific models.
This paper presents a simple unsupervised framework to learn discriminative features. By aligning the output of a neural network to low-dimensional noise, we obtain features on par with state-of-the-art unsupervised learning approaches. Our approach explicitly aims at learning discriminative features, while most unsupervised approaches target surrogate problems, like image denoising or image generation. As opposed to self-supervised approaches, we make very few assumptions about the input space. This makes our appproach very simple and fast to train. Interestingly, it also shares some similarities with traditional clustering approaches as well as retrieval methods. While we show the potential of our approach on visual data, it will be interesting to try other domains. Finally, this work only considers simple noise distributions and alignment methods. A possible direction of research is to explore target distributions and alignments that are more informative. This also would strengthen the relation between NAT and methods based on distribution matching like the earth mover distance.
We greatly thank Hervé Jégou for his help throughout the development of this project. We also thank Allan Jabri, Edouard Grave, Iasonas Kokkinos, Léon Bottou, Matthijs Douze and the rest of FAIR for their support and helpful discussion. Finally, we thank Richard Zhang, Jeff Donahue and Florent Perronnin for their help.
Learning feature representations with k-means.In Neural Networks: Tricks of the Trade. Springer, 2012.
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological Cybernetics, 36:193–202, 1980.
Stacked convolutional auto-encoders for hierarchical feature extraction.In ICANN, 2011.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.JMLR, 11(Dec):3371–3408, 2010.
Unsupervised deep embedding for clustering analysis.In ICML, 2016.
Colorful image colorization.In ECCV, 2016.