Log In Sign Up

Whitening for Self-Supervised Representation Learning

Recent literature on self-supervised learning is based on the contrastive loss, where image instances which share the same semantic content ("positives") are contrasted with instances extracted from other images ("negatives"). However, in order for the learning to be effective, a lot of negatives should be compared with a positive pair. This is not only computationally demanding, but it also requires that the positive and the negative representations are kept consistent with each other over a long training period. In this paper we propose a different direction and a new loss function for self-supervised learning which is based on the whitening of the latent-space features. The whitening operation has a "scattering" effect on the batch samples, which compensates the lack of a large number of negatives, avoiding degenerate solutions where all the sample representations collapse to a single point. We empirically show that our loss accelerates self-supervised training and the learned representations are much more effective for downstream tasks than previously published work.


Self-supervised learning of audio representations using angular contrastive loss

In Self-Supervised Learning (SSL), various pretext tasks are designed fo...

Extending Momentum Contrast with Cross Similarity Consistency Regularization

Contrastive self-supervised representation learning methods maximize the...

Self-Distilled Self-Supervised Representation Learning

State-of-the-art frameworks in self-supervised learning have recently sh...

Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework

Self-supervised learning has shown its great potential to extract powerf...

QK Iteration: A Self-Supervised Representation Learning Algorithm for Image Similarity

Self-supervised representation learning is a fundamental problem in comp...

G-SimCLR : Self-Supervised Contrastive Learning with Guided Projection via Pseudo Labelling

In the realms of computer vision, it is evident that deep neural network...

MelHuBERT: A simplified HuBERT on Mel spectrogram

Self-supervised models have had great success in learning speech represe...

1 Introduction

One of the current main bottlenecks in Deep Network training is the dependence on large annotated training datasets, and this motivates the recent surge of interest in unsupervised methods. Specifically, in self-supervised representation learning, a network is (pre-)trained without any form of manual annotation, thus providing a means to extract information from unlabeled-data sources (e.g., text corpora, videos, images from the Internet, etc.). In self-supervision, label information is replaced by asking the network to make predictions using some form of context or using a pretext

task. Pioneering work in this direction was done in Natural Language Processing (NLP), in which the co-occurrence of words in a sentence is used to learn a language model

[28, 29, 8]

. In Computer Vision, typical contexts or pretext tasks are based on: (1) the temporal consistency in videos

[42, 31, 14], (2) the spatial order of patches in still images [32, 30, 21] or (3) simple image transformation techniques [26, 19, 43]. The intuitive idea behind most of these methods is to collect pairs of positive and negative samples: Two positive samples should share the same semantics, while negatives should be perceptually different. A triplet loss [38, 36, 22, 42, 31] can then be used to learn a metric space which should represent the human perceptual similarity. However, most of the recent studies use a contrastive loss [17] or one of its variants [16, 40, 23], while in [39] the authors show the relation between the triplet loss and the contrastive loss.

It is worth noticing that the success of both kinds of losses is strongly affected by the number and the quality of the negative samples. For instance, in the case of the triplet loss, a common practice is to select hard/semi-hard negatives [36, 22]. On the other hand, recent works have shown that the contrastive loss needs a large number of negatives to be competitive [23]. However, this implies using batches with a large size, which is computationally demanding, especially with high-resolution images. In order to alleviate this problem, Wu et al. [43] use a memory bank

of negatives, which is composed of feature-vector representations of all the training samples. He et al.

[19] conjecture that the use of large and fixed-representation vocabularies is one of the key of the success of self-supervision in NLP [8, 29]. The solution proposed in [19] extends [43] using a memory-efficient queue of the last visited negatives, together with a momentum encoder which preserves the intra-queue representation consistency. However, the raw images used in Vision are much more variable than the fixed symbols (e.g, the vocabulary words) used in NLP. As a consequence, when an image is represented by the encoder part of the trained network, this representation drastically changes during training, thus making hard the comparison with image samples observed far in time.

In this paper we propose a different direction and a new self-supervised loss function which first scatters all the sample representations in a spherical distribution111

Here and in the following, with “spherical distribution” we mean a distribution with a zero-mean and an identity-matrix covariance.

and then penalizes the positive pairs which are far from each other. In more detail, given a set of samples , corresponding to the current mini-batch of images , we first project the elements of onto a spherical distribution using a whitening transform [37]. The whitened representations , corresponding to , are used to compute a standard Mean Squared Error (MSE) loss which accumulates the error taking into account only positive pairs . We do not need to contrast positives against negatives as in the contrastive loss or in the triplet loss because the optimization process leads to shrink the distance between positive pairs and, indirectly, scatters the other samples in order to satisfy the overall spherical-distribution constraint.

We empirically show that our Whitening MSE (W-MSE) loss outperforms the commonly adopted contrastive loss when measured using different standard classification protocols. Additionally, we show that W-MSE brings complementary information with respect to a standard contrastive loss and can be combined with the latter using multiple projection heads. Specifically, given an encoder , which extracts a representation vector from an image

, we use two simple and separated MultiLayer Perceptrons (MLPs),

and , which project the common representation onto two different latent-space representations. The projection heads and work collaboratively and they are trained with two different losses, using our W-MSE for the first and a (suitably normalized) contrastive loss for the second. We test the full method, which we call Collaborative Projections With Whitening (), on common self-supervised benchmarks, significantly outperforming other unsupervised methods which use similar-capacity backbone networks. , tested with a finetuning-based protocol on STL-10 [4], establishes a new state-of-the-art result on this dataset.

In summary, our contributions are the following:

  • We propose a new loss function (W-MSE) for self-supervised training. W-MSE constrains the batch samples to lie in a spherical distribution and it is an alternative to positive-negative instance contrasting methods.

  • Differently from most of previous work in which only one loss function is used, we can combine our W-MSE with other common losses. Specifically, we propose two different nonliner projection heads and the use of W-MSE and the contrastive loss in the two corresponding latent spaces.

  • We show that W-MSE outperforms other loss functions and that is competitive with respect to state-of-the-art self-supervised methods.

2 Background and Related Work

A typical self-supervised method is composed of two main components: a pretext task, which exploits some a-priori knowledge about the domain in order to automatically extract supervision from data, and a loss function. In this section we briefly review both aspects, and we additionally analyse the recent literature concerning feature whitening.

Pretext Tasks. The temporal consistency in a video provides an intuitive form of self-supervision: temporally-close frames usually contain a similar semantic content [42, 40]. In [31] this idea is extended using the relative temporal order of 3 frames, while in [14] self-supervision is given by a temporal cycle consistency, which is based on comparing two videos sharing the same semantics and computing inter-video frame-to-frame nearest neighbour assignments.

When dealing with still images, the most common pretext task is instance discrimination [43]: an input image is transformed into using a (composition of) data-augmentation technique(s), such as image cropping, rotation, color jittering, Sobel filtering, etc., and then the learner is required to discriminate from other samples [26, 19, 43].

Denoising auto-encoders [41] add random noise to the input image and try to recover the original image. More sophisticated pretext tasks consist in predicting the spatial order of image patches [32, 30] or in reconstructing large masked regions of the image [34]. In [23, 1] the holistic representation of an input image is compared with a patch of the same image. A similar idea is used in [21], where the comparison depends on the patch order: the appearance of a given patch should be predicted given the appearance of the patches which lie above it in the image.

In this paper we use standard data augmentation techniques on still images to obtain positive pairs, which is a simple method to get self-supervision [26, 19, 43] and does not require a pretext-task specific network architecture [23, 1, 21].

Loss functions. Denoising auto-encoders [41] use a reconstruction loss which compares the generated image with the input image before adding noise. Other generative methods use an adversarial loss in which a discriminator provides supervisory information to the generator [10, 11].

Early self-supervised (deep) discriminative methods used a triplet loss [42, 31]: Given two positive images and a negative (Sec. 1), together with their corresponding latent-space representations , this loss penalizes those cases in which and are closer to each other than and plus a margin :


Most of the recent self-supervised discriminative methods are based on some contrastive loss [17] variant, in which and are contrasted against a set of negative pairs. Following the common formulation proposed in [40]:


where is a temperature hyparameter which should be manually set and the sum in the denominator is over a set of negative samples. Usually is the size of the current batch, i.e., , being the number of the collected positive pairs. However, as shown in [23], the InfoNCE loss requires a large number of negative samples to be competitive. In [43, 19] a set of negatives much larger than the current batch is used, by using pre-computed latent-space representations of old samples.

In this paper we propose a different loss which is highly competitive with respect to other alternatives and does not require a large number of samples. Moreover, our loss formulation is also simpler since it does not require a proper setting of the hyperparameter in Eq. 2 or in Eq. 1. Finally, while many recent works [40, 21, 23, 1, 35]

draw a relation between the contrastive loss and an estimate of the mutual information between latent-space image representations, Tschannen et al.

[39] argue that the success of this loss is likely related to learning a metric space, similarly to what happens with a triplet loss.

Feature Whitening. We adopt the efficient and stable Cholesky decomposition [7] based whitening transform proposed in [37] to project our latent-space vectors into a spherical distribution (more details in Sec. 3). Note that in [24, 37]

whitening transforms are used in the intermediate layers of the network for a completely different task: extending Batch Normalization

[25] to a multivariate batch normalization.

Figure 1: A schematic representation of the optimization process driven by our W-MSE loss. Positive pairs are indicated with similar shapes and colors. (a) A representation of the feature batch when training starts. (b) The distribution of the elements in after whitening. (c) The MSE computed over encourages the network to move the positive pair representations closer to each other. (d)-(f) The subsequent iterations move closer and closer the positive pairs, while the relative layout of the other samples is forced to lie in a spherical distribution.

3 The Whitening MSE Loss

Given an image , we extract an embedding using a network parametrized with (more details below). We require that: (1) the image embeddings are drawn from a non-degenerate distribution (the latter being a distribution where, e.g., all the representations collapse to a single point), and (2) positive image pairs , which share a similar semantics, should be clustered close to each other. We formulate this problem as follows:


where is the identity matrix and correspond to a positive pair of images . With Eq. (4), we constrain the distribution of the

values to be non-degenerate, hence avoiding that all the probability mass is concentrated in a single point. Moreover, Eq. (

4) makes all the components of to be linearly independent from each other, which encourages the different dimensions of to represent different semantic content. We provide below the details on how positive image samples are collected, how they are encoded and how Eq. (3)-(4) are implemented.

First, similarly to [26, 19, 43], we obtain a pair of positive images sharing the same semantics starting from a single image and using standard image transformation techniques. Specifically, we use a composition of image cropping and color jittering transformations , whose parameters () are selected uniformly at random and independently of each other in order to obtain a pair of positive samples from the same image: and (see Fig 3 for some examples). We concisely indicate with the fact that and (, the current batch) are matched to each other because they share the same semantics.

For representation learning, we use a backbone encoder network . , trained without human supervision, will be used in Sec. 5 for evaluation using standard protocols. Similarly to [19], we use a standard ResNet [20] as the encoder, and is the output of the average-pooling layer. This choice has the advantage to be simple and easy to be reproduced, in contrast to other methods who use encoder architectures specific for a given pretext task [23, 1, 21]. Since is a high-dimensional vector, following [23, 1] we use a nonlinear projection head to project in a lower dimensional space: , where is implemented with a simple MLP with one hidden layer. The whole network is given by the composition of with (see Fig. 2 (a)).

Given positive pairs and a batch of images , where , let , be the corresponding batch of features obtained as described above. The proposed W-MSE loss is obtained using the Mean Squared Error computed over the pairs, where constraint (4) is satisfied using the reparameterization of the variables with whitened variables :


where , and:


In Eq. (6), is the mean of the elements in :


while the matrix is such that: , being the covariance matrix of :


For more details on the whitening transform, we refer to [37]. This transformation performs the full whitening of each [37] and the resulting set of vectors lies in a zero-centered distribution with a covariance matrix equal to the identity matrix (Fig. 1).

The intuition behind the proposed loss is that Eq. (5) penalizes positives which are far apart from each other, thus leading to shrink the inter-positive distances. On the other hand, since must lie in a spherical distribution, the other samples should be “moved” and rearranged in order to satisfy constraint (4) (see Fig. 1).

Batch Slicing. The estimation of the Mean Square Error in Eq. (5) depends on the whitening matrix

, which may have a high variance over consecutive iteration batches

. For this reason, inspired by the resampling methods [15], at each iteration, given a batch , we randomly slice in several non-overlapping sub-batches and we compute the whitening matrix independently for each sub-batch. We repeat this random slicing four times in order to get a more robust estimate of Eq. (5).

3.1 Discussion

In a common instance-discrimination task (Sec. 2), e.g., solved using Eq. (2), the similarity of a positive pair () is contrasted with the similarity computed with respect to all the other samples () in the batch (, ). However, and , extracted from different image instances, can occasionally share the same semantics (e.g., and are two different image instances of the “cat” class). Conversely, the proposed W-MSE loss does not force all the instance samples to lie far from each other, but it only imposes a soft constraint (Eq. (4)), which avoids degenerate distributions.

Note that previous work [19, 21] highlighted that Batch Normalization (BN) [25] may be harmful for learning semantically meaningful representations because the network can “cheat” and exploit the batch statistics in order to find a trivial solution to Eq. (2). However, our whitening transform (Eq. (6)) is applied only to the very last layer of the network (see Fig. 2) and it is not used in the intermediate layers, which is instead the case of BN. Hence, our cannot learn to exploit subtle inter-sample dependencies introduced by batch-statistics because of the lack of other learnable layers on top of the features.

4 Multiple-Head Projections

Our W-MSE loss can be used in conjunction with different losses in order to increase the self-supervision signal provided to the encoder . In our experiments we used the InfoNCE loss (Eq. 2) but other losses (e.g., the triplet loss, Eq. 1) may be used as well. Specifically, the output of the encoder, , is fed to a second projection head

, an MLP with the same number of layers and neurons of

. Using , we obtain: . Note that the encoder is shared over the two heads (see Fig. 2 (b)). Before applying the InfoNCE loss, we -normalize all the elements in : for each , . This normalization is used also in [19]

, and it is important to make the average magnitude of the gradients backpropagated from

and to roughly similar to each other. In Sec. 5.1.1 we show that the normalized InfoNCE loss, when used in isolation, is much better than its unnormilzed version. It is worth noticing that the -normalization is per-element: differently from whitening (Eq. 6), depends only on .

Finally, the elements in are used to compute Eq. 2, and the gradients of the two losses are merged in with equal relative weight. The dimensions of the and the embeddings are the same and the intermediate layer in both the MLP heads has the same dimension of the input .

Figure 2: A schematic representation of our architecture with one head (a) and two heads (b).

5 Experiments

We test our method and its variants on the following datasets.

  • CIFAR10 and CIFAR100 [27], two small-scale datasets composed of images with 10 and 100 classes, respectively.

  • ImageNet [6] is a large-scale dataset with 1.3M training images and 50K test images, spanning over 1000 classes.

  • Tiny ImageNet [27], a reduced version of ImageNet, composed of 200 classes with images scaled down to . The total number of images is: 100K (training) and 10K (testing).

  • STL-10 [4], also derived from ImageNet, with resolution images. While CIFAR10, CIFAR100, Tiny ImageNet and ImageNet are fully-labeled, STL-10 is composed of 5K labeled training samples (500 per class) and 100K unlabeled training examples of irrelevant or distractor classes. There are additional 8K labeled testing images.

Figure 3: Image transformation examples. The first row shows the original image taken from the STL-10 dataset. The second and the third row show the corresponding randomly augmented pairs used for self-supervised training.

Setting. For a fair comparison, we split all the experiments according to the capacity of the encoder networks we compare with. Specifically, for our encoder , we use ResNet-18, ResNet-34, ResNet-50 or AlexNet with about 11M, 21M, 24M, 58M parameters each. Unless otherwise specified, the results of the other state-of-the-art methods we report are based on the same backbone networks or on networks with roughly the same capacity. InfoNCE and Normalized InfoNCE refer to our reproduction of contrastive loss variants which are based on the encoder , followed by a single projection head (), with or without feature normalization (more details below).

Encoder Epochs Learning rate LR drop
ResNet-18 200 -
ResNet-34 500
ResNet-50 1000
AlexNet 500
Table 1: Training details of experiments according to backbone networks for CIFAR10, CIFAR100, Tiny ImageNet and STL10. Last column denotes the learning rate drop in the last 50 epochs.

We use the Adam optimizer for CIFAR10, CIFAR100, Tiny ImageNet and STL10. The number of epochs, the learning rate and the drop are presented in Tab. 1. We use a mini-batch size of pairs ( samples). Finally, we use an embedding size of 32 for CIFAR10 and CIFAR100, and an embedding of size of 64 for STL-10, Tiny ImageNet and ImageNet (recall that the dimension of the embedding is the same of the embedding).

As a common practice when using ResNet-like architectures for small-size image resolutions, in all the experiments we have a first convolutional layer with kernel size 3, stride 1 and padding 1. Additionally, in case of CIFAR10 and CIFAR100, we remove the first max pooling layer.

For ImageNet experiments we use ResNet-50 with SGD optimizer and momentum equal to 0.9. We use a learning rate and cosine learning rate decay. We train with mini-batch of size pairs, embedding size of and for 200 epochs. Additionally we apply Gaussian blur as a data augmentation.

Image Transformation Details. In Fig 3 we show some examples of positive pairs extracted from the same image instance (Sec. 3). We extract crops with random size from to of the original area and a random aspect ratio from to of the original aspect ratio, which is a commonly used data-augmentation technique. We also apply horizontal mirroring with probability . Finally, we apply color jitterering with probability and grayscaling with probability .

5.1 Linear Classification Protocol

The most common evaluation protocol for unsupervised feature learning is based on freezing the network encoder (, in our case) after unsupervised pre-training, and then train a supervised linear classifier

on top of it. Specifically, the linear classifier is a fully-connected layer followed by softmax, which is placed on top of

after removing both the projection heads and .

In all the experiments we train the linear classifier for 500 epochs using the Adam optimizer and the labeled training set of each specific dataset, without data augmentation. The learning rate is exponentially decayed from to . The weight decay is . In ImageNet experiments we use the evaluation protocol of [19].

5.1.1 Ablation Study.

In Tab. 2 we compare with each other different loss functions on CIFAR10 and CIFAR100. In all the experiments we use as the encoder , a ResNet-18, which is always trained with 200 unsupervised epochs. The batch size is (where positive pairs, see Sec. 3). InfoNCE refers to the contrastive loss version shown in Eq. (2), largely used in many recent self-supervised works. Normalized InfoNCE is the -normalized version of InfoNCE used in [19] and in as well (Sec. 4). W-MSE is our whitening-based loss, introduced in Sec. 3. Finally, is our full method, which combines Normalized InfoNCE with W-MSE using two dedicated projection heads ( and , see Sec. 4).

In case of InfoNCE and Normalized InfoNCE, a temperature parameter () must be set (Sec. 2). For these two losses, we performed a separated grid search on , using common value ranges reported in the literature, and separately choosing, independently for each loss, the value which achieves the best result across all the datasets (i.e., is kept fixed for both datasets). In Tab. 2 we report the results corresponding to the best temperature value we found, which is for InfoNCE and for Normalized InfoNCE. For the second head () in , we use the same setting () as in Normalized InfoNCE. These temperature values are then used in all the other experiments (Tab. 3-7). Note that W-MSE does not need to tune loss-specific hyperparameters.

Figure 4: Comparison of different self-supervised losses on CIFAR10 and CIFAR100. The encoder is a ResNet-18, which is trained with a different number of epochs (shown in the x-axis).
Method CIFAR10 CIFAR100
InfoNCE 79.98 54.27
Normalized InfoNCE 85.53 56.96
W-MSE 86.08 57.47
86.91 57.79
Table 2: Classification accuracy (top 1) results of different loss functions on CIFAR10 and CIFAR100. The encoder is a ResNet-18.

Tab. 2 shows that the proposed loss, W-MSE, is significantly better that InfoNCE. The results we obtained with other datasets, encoders and evaluation protocols confirm this finding (see Tab. 3-7). With a lower margin, W-MSE outperforms also Normalized InfoNCE. The combination of W-MSE and Normalized InfoNCE, obtained using our two-head solution (), further improves the classification accuracy on both datasets, showing that these two losses bring a partially complementary supervision signal to the encoder.

In Fig. 4 we plot the linear classification accuracy on CIFAR10 and CIFAR100, as a function of the number of epochs used to pre-train the corresponding encoder. These plots show that both W-MSE and can accellerate the unsupervised training, being their accuracy curves higher than Normalized InfoNCE and InfoNCE almost consistently over all the epochs, and especially in the initial epochs.

5.1.2 Comparison with the State of the art.

We use CIFAR10, CIFAR100, STL-10, Tiny ImageNet and ImageNet to compare our methods, W-MSE and , against various unsupervised approaches. The results are reported in Tab. 3, 4, 5 and 6, split according to the datasets and the capacity of the backbone networks. In all the experiments, W-MSE and significantly outperform all the other methods. Specifically, in Tab. 3 we also outperform the very recent results reported in [1] (AMDIM), where a specific network with a higher capacity (about 32M parameters) is used. For reference, in Tab. 3 and 5 we also report fully-supervised results. It is worth mentioning that for a large scale experiments on ImageNet (Tab. 6), also shows significant improvements over the recent methods [19, 21].

Method Accuracy
Fully-supervised 93.62
DIM [23] 80.95
CPC [40] 77.45
AMDIM [1] 89.5
InfoNCE 86.27
W-MSE 90.11
Table 3: Classification accuracy (top 1) results on CIFAR10. All the methods are based on a ResNet-50 encoder, except AMDIM, which is based on a customized, bigger architecture. CPC result is reported in [23]. Fully-supervised result is published at
Method Accuracy
K-means Network [5] a 60.1
HMP [2] a 64.5
Stacked AE [45] a 74.33
Exemplar [12] a 75.4
Ye et al. [44] 77.9
InfoNCE 78.18
W-MSE 83.05
Table 4: Classification accuracy (top 1) results on STL-10. Ye et al. [44], InfoNCE and our methods are based on a ResNet-18 encoder. a indicates results reported in [44].
Method STL10 Tiny ImageNet
Fully-supervised 68.7 36.60
DIM [23] 70.00 38.09
InfoNCE 83.14 35.34
W-MSE 85.04 40.76
85.69 42.00
Table 5: Classification accuracy (top 1) results on STL-10 and Tiny ImageNet with AlexNet-based encoder. Fully-supervised results are reported in [23].
Method Top 1 Top 5
MoCo [19] 60.6 -
CPC [21] 63.8 85.3
W-MSE 64.48 85.97
66.29 86.94
Table 6: Classification accuracy results on ImageNet. All listed methods are based on a ResNet-50 architecture.
Method Accuracy
Dundar et al. [13] a 74.1
Cutout [9] a 87.3
Oyallon et al. [33] a 87.6
DeepCluster [3] a 73.4
ADC [18] a 56.7
DIM [23] a 77.0
IIC [26] a 88.8
InfoNCE 87.09
W-MSE 88.28
Table 7: Fully and semi-supervised classification on STL10. a indicates results reported in [26]. indicates fully supervised method.

5.2 Semi-Supervised Finetuning

In this section we use the finetuning protocol presented in IIC [26] in order to show the potentialities of our learned representations when trained in a semi-supervised fashion. Specifically, (ResNet-34) is pre-trained as usual, using all the 105K training samples of STL-10 (see Sec. 5). Note that in this phase we also use the 5K labeled training images but we do not use their corresponding labels, pretending these images are unlabeled. In the second, supervised stage, following the protocol described in [26], we train an MLP on top of , simultaneously finetuning . For this stage, only the 5K labeled training images are used, together with the specific data-augmentation procedure described in [26]. The evaluation is reported in Tab. 7. establishes a new state-of-the-art semi-supervised result on this dataset, significantly outperforming IIC [26] (which is also based on a ResNet-34 encoder) and previous work as well. In this case, InfoNCE achieves inferior results with respect to the previous state of the art [26].

6 Conclusions

In this paper we proposed a new self-supervised loss, W-MSE, which is alternative to common loss functions used in the field. Differently from the triplet loss and the contrastive loss, both of which are based on comparing an instance-level similarity against other samples, W-MSE computes only the intra-positive distances, while using a whitening transform to avoid degenerate solutions. We empirically show that W-MSE achieves results constantly better than InfoNCE, the most common version of the contrastive loss, and can be jointly used with the latter using dedicated latent-spaces. Our full method, outperforms other state-of-the-art unsupervised approaches in different datasets and using different evaluation protocols.


  • [1] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In NeurIPS, Cited by: §2, §2, §2, §3, §5.1.2, Table 3.
  • [2] L. Bo, X. Ren, and D. Fox (2013) Unsupervised feature learning for rgb-d based object recognition. In Experimental robotics, pp. 387–402. Cited by: Table 4.
  • [3] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. Lecture Notes in Computer Science, pp. 139–156. Cited by: Table 7.
  • [4] A. Coates, A. Y. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In AISTATS, Cited by: §1, 4th item.
  • [5] A. Coates and A. Y. Ng (2011) Selecting receptive fields in deep networks. In Advances in Neural Information Processing Systems 24, pp. 2528–2536. Cited by: Table 4.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: 2nd item.
  • [7] D. Dereniowski and K. Marek (2004) Cholesky factorization of matrices in parallel and ranking of graphs. In 5th Int. Conference on Parallel Processing and Applied Mathematics, Cited by: §2.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL,, Cited by: §1, §1.
  • [9] T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    External Links: 1708.04552 Cited by: Table 7.
  • [10] J. Donahue, P. Krähenbühl, and T. Darrell (2017) Adversarial feature learning. In ICLR, Cited by: §2.
  • [11] J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. In NeurIPS, Cited by: §2.
  • [12] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014) Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems 27, pp. 766–774. Cited by: Table 4.
  • [13] A. Dundar, J. Jin, and E. Culurciello (2015) Convolutional clustering for unsupervised learning. External Links: 1511.06241 Cited by: Table 7.
  • [14] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman (2019) Temporal cycle-consistency learning. In CVPR, Cited by: §1, §2.
  • [15] B. Efron (1982) The jackknife, the bootstrap, and other resampling plans. Vol. 38, Siam. Cited by: §3.
  • [16] M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    Cited by: §1.
  • [17] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In CVPR, Cited by: §1, §2.
  • [18] P. Häusser, J. Plapp, V. Golkov, E. Aljalbout, and D. Cremers (2018) Associative deep clustering: training a classification network with no labels. In Pattern Recognition - 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings, Lecture Notes in Computer Science, Vol. 11269, pp. 18–32. Cited by: Table 7.
  • [19] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv:1911.05722. Cited by: §1, §1, §2, §2, §2, §3.1, §3, §3, §4, §5.1.1, §5.1.2, §5.1, Table 6.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.
  • [21] O. J. Hénaff, A. Razavi, C. Doersch, S. M. A. Eslami, and A. van den Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv:1905.09272. Cited by: §1, §2, §2, §2, §3.1, §3, §5.1.2, Table 6.
  • [22] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv:1703.07737. Cited by: §1, §1.
  • [23] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In ICLR, Cited by: §1, §1, §2, §2, §2, §2, §3, Table 3, Table 5, Table 7.
  • [24] L. Huang, D. Yang, B. Lang, and J. Deng (2018) Decorrelated batch normalization. In CVPR, Cited by: §2.
  • [25] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §2, §3.1.
  • [26] X. Ji, J. F. Henriques, and A. Vedaldi (2019) Invariant information clustering for unsupervised image classification and segmentation. In ICCV, Cited by: §1, §2, §2, §3, §5.2, Table 7.
  • [27] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical Report. Cited by: 1st item, 3rd item.
  • [28] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781. Cited by: §1.
  • [29] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §1, §1.
  • [30] I. Misra and L. van der Maaten (2019) Self-supervised learning of pretext-invariant representations. arXiv:1912.01991. Cited by: §1, §2.
  • [31] I. Misra, C. L. Zitnick, and M. Hebert (2016) Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, Cited by: §1, §2, §2.
  • [32] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §1, §2.
  • [33] E. Oyallon, E. Belilovsky, and S. Zagoruyko (2017) Scaling the scattering transform: deep hybrid networks. ICCV. Cited by: Table 7.
  • [34] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. CVPR. Cited by: §2.
  • [35] M. Ravanelli and Y. Bengio (2018) Learning speaker representations with mutual information. arXiv:1812.00271. Cited by: §2.
  • [36] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    FaceNet: A unified embedding for face recognition and clustering

    In CVPR, Cited by: §1, §1.
  • [37] A. Siarohin, E. Sangineto, and N. Sebe (2019) Whitening and Coloring Batch Transform for GANs. In ICLR, Cited by: §1, §2, §3.
  • [38] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In NIPS, Cited by: §1.
  • [39] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2019) On mutual information maximization for representation learning. arXiv:1907.13625. Cited by: §1, §2.
  • [40] A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748. Cited by: §1, §2, §2, §2, Table 3.
  • [41] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    In ICML, Cited by: §2, §2.
  • [42] X. Wang and A. Gupta (2015) Unsupervised learning of visual representations using videos. In ICCV, Cited by: §1, §2, §2.
  • [43] Z. Wu, Y. Xiong, S. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance-level discrimination. arXiv:1805.01978. Cited by: §1, §1, §2, §2, §2, §3.
  • [44] M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. CVPR. Cited by: Table 4.
  • [45] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun (2015) Stacked what-where auto-encoders. External Links: 1506.02351 Cited by: Table 4.