Log In Sign Up

Unsupervised Representation Learning by Discovering Reliable Image Relations

by   Timo Milbich, et al.

Learning robust representations that allow to reliably establish relations between images is of paramount importance for virtually all of computer vision. Annotating the quadratic number of pairwise relations between training images is simply not feasible, while unsupervised inference is prone to noise, thus leaving the vast majority of these relations to be unreliable. To nevertheless find those relations which can be reliably utilized for learning, we follow a divide-and-conquer strategy: We find reliable similarities by extracting compact groups of images and reliable dissimilarities by partitioning these groups into subsets, converting the complicated overall problem into few reliable local subproblems. For each of the subsets we obtain a representation by learning a mapping to a target feature space so that their reliable relations are kept. Transitivity relations between the subsets are then exploited to consolidate the local solutions into a concerted global representation. While iterating between grouping, partitioning, and learning, we can successively use more and more reliable relations which, in turn, improves our image representation. In experiments, our approach shows state-of-the-art performance on unsupervised classification on ImageNet with 46.0 VOC.


page 2

page 11


Deep Unsupervised Learning of Visual Similarities

Exemplar learning of visual similarities in an unsupervised manner is a ...

CliqueCNN: Deep Unsupervised Exemplar Learning

Exemplar learning is a powerful paradigm for discovering visual similari...

Deep Unsupervised Similarity Learning using Partially Ordered Sets

Unsupervised learning of visual similarities is of paramount importance ...

Relation-Guided Representation Learning

Deep auto-encoders (DAEs) have achieved great success in learning data r...

Motif Mining and Unsupervised Representation Learning for BirdCLEF 2022

We build a classification model for the BirdCLEF 2022 challenge using un...

Learning Image Relations with Contrast Association Networks

Inferring the relations between two images is an important class of task...

Differentiable Mathematical Programming for Object-Centric Representation Learning

We propose topology-aware feature partitioning into k disjoint partition...

1 Introduction

The driving force of deep learning has been supervised training using vast amounts of tediously labeled training samples, such as object bounding boxes for visual recognition. Since easily accessible visual data is growing exponentially, manual labeling of training samples constitutes a bottleneck to utilizing all this valuable data. Consequently, there has recently been great interest in weakly supervised

Zhang et al. (2018), self-supervised Noroozi and Favaro (2016), and unsupervised Xiong et al. (2017); Milbich et al. (2017) approaches to representation learning. Fundamental computer vision problems like classification Zhang et al. (2019a); Rubio et al. (2015), object detection Zhang et al. (2019b) and image segmentation Randrianasoa et al. (2018) all directly depend on such learned representations to find similar objects or group related image areas.
To learn a characteristic representation of images and the distances between them, different degrees of supervision can be considered: (i)supervised learning using samples with class labels Ma et al. (2019), (ii) user feedback providing weakly supervised side information in terms of pairwise constraints Sohn (2016), (iii)

problem specific surrogate tasks such as colorization

Larsson et al. (2017), permutations Noroozi and Favaro (2016), or transitivity Wang et al. (2017), and (iv) unsupervised feature learning Bojanowski and Joulin (2017). Regardless of the training signal, be it unaries such as class labels Sanakoyeu et al. (2018), binary similarity constraints between samples Sohn (2016) or sample ordering constraints Ren et al. (2019), a dataset of training samples gives rise to pairwise relations, exploitable for learning our representation. In the absence of supervisory information, these relations need to be automatically inferred during training. However, the vast majority of these inferred pairwise relations turn out to be unreliable as discussed in Sect. 3, Fig. 2, and Fig. 3. Despite the danger of diminished performance due to learning from spurious relations, recent approaches on unsupervised representation learning Caron et al. (2018); Bojanowski and Joulin (2017), nevertheless, do not question the reliability of these relations. Now, assuming that only a small fraction of correct relations per sample can be identified reliably (i.e. we are left with at most class labels or pairwise link constraints), how can we discover those few reliable relations, when no label or guiding side information is available?

Figure 1: Overview of our iterative learning procedure. We first find reliable similarity constraints by forming compact groups. To avoid unreliable dissimilarities, we partition the data into sets of mutually dissimilar groups, . Based on these (dis-)similarity constraints between/within groups, we learn a local representation for subset . Finally, we exploit sparse couplings between the local representations to arrive at a consolidated global representation. This iterative procedure improves the overall representation by successively adding reliable constraints into the learning process.

In this work we propose a novel approach to visual representation learning that explicitly identifies and leverages reliable image relations without the need for annotations, supervision, problem-specific surrogate tasks for self-supervision, or pre-training. By extracting compact groups of images we are able to harness reliable similarities. Subsequently, we divide these compact groups constituting the overall learning problem into smaller, (potentially overlapping) subproblems, such that each contains only reliable dissimilarities between their groups. Thus, whereas the complicated global problem suffers from many of the

relations not being reliable, we ensure that the samples in each subproblem are either reliably similar or dissimilar. Optimization is then performed by learning a mapping from the images into a dedicated target space, built to reflect the structure and distribution estimated from the reliable relations for each subset. Next, coupling the local subproblems by utilizing transitivity between their samples allows us to consolidate the learned individual representations into a concerted global representation. Finally, by alternating between extracting reliable relations and learning, we successively incorporate more reliable relations and in turn more data which ultimately improves our image representation (cf. Fig.

We evaluate our model on challenging benchmarks and achiev state-of-the-art performance on the ImageNet dataset, thus proving the scalability of our approach. Further, our approach performs comparably to the state-of-the-art in transfer learning on PASCAL VOC indicating its general applicability. By performing ablation and analysis studies, we finally provide insights into our learning procedure.

2 Related Work

Leveraging explicit relationships between training images for representation learning is widely studied using different orders of constraints (e.g. binary similarities Sohn (2016), ranking constraints Ren et al. (2019)). All these methods use label information or strong pretraining as the performance of such task heavily depends on the quality of the pairwise constraints. However, densely labeling all pairwise constraints is infeasible.
Due to the difficulty of extracting reliable relations from data, many recent label-free approaches resort to generic prior assumptions on the data distribution or single image- and class-based tasks. (Deep) clustering methods Caron et al. (2018); Sanakoyeu et al. (2018)

rely on a predefined number of pseudo-classes which typically is estimated by heuristics and further focus on similarity constraints only. Our model explicitly models both similarity and dissimilarity constraints estimated from data itself. Bojanowski et al.

Bojanowski and Joulin (2017) find a mapping between images and a uniformly discretized target space, thus enforcing their representation to resemble a distribution of pairwise relationships independent of the actual data structure. Sanakoyeu et al. Sanakoyeu et al. (2018) cluster data into small surrogate classes to perform a global classification task, however, not considering that similar images may end up in competing, different classes. Thus, inferred relationships during training suffer from contradicting training signals. Also DeepCluster Caron et al. (2018)

follows this strategy based on disjoint k-means clustering, thus enforcing clear distinct boundaries which potentially disagree with the real data distribution. In contrast, our grouping process does not enforce hard class boundaries and is able to adapt to the data structure. Moreover by splitting groups into reliable subproblems and constructing a learning problem following their distance distribution, our groups corroborate their training signal. Dosovitski et al.

Dosovitskiy et al. (2016) cast distance learning as an exemplar classification task utilizing heavy data augmentations at the cost of poor scalability to large data collections.
Self–supervised learning approaches aim to leverage data itself by typically solving surrogate tasks based on temporal Wang and Gupta (2015) and spatial Noroozi and Favaro (2016) coherence. These approaches are either domain specific or operate on images independently thus missing out on their relationships. Our work, in contrast, explicitly models relationships between images. Gidaris et al. Gidaris et al. (2018)

exploit image geometry and classify rotations applied to input images. Even though they report good results on image classification tasks on large datasets, this task is conceptually dependent on large variations in the underlying data distribution to avoid trivial image representation, potentially missing out fine-grained relationships between images.

Generative models based on GANGoodfellow et al. (2014)- and VAEKingma and Welling (2013)

-like architectures recently became a popular choice for unsupervised learning. These approaches learn mappings between images and a latent space driven by a generative task and thus implicitly learn an image representation. Unfortunately such approaches typically suffer from limited applicability for large datasets with high variance due to the difficult optimization problem. Further their training is regularized using data-independent priors, e.g. by enforcing the learned feature space to follow a gaussian distribution. Our approach on the other side explicitly learns an image representation using data dependent constraints and scales to large datasets.

3 Approach

Figure 2: Nearest neighbour distance ratios. For most of the pairwise relations, the ratio (for sorted neighbors ) is close to 1. Only have a robust ordering. This analysis is based on samples from the STL-10Coates et al. (2011) dataset using euclidean distances based on an unsupervised representation.

Let us now learn a representation that allows to relate image samples to another. This is equivalent to learning a distance , i.e. learning a representation such that given image relations are reflected and preserved in the embedding space . Thus, learning is propelled by pairwise relationships between images indicated by : In supervised training is typically defined on the basis of manually provided class labels , weak user feedback, problem specific surrogate tasks, dense triplet ranking constraints such as , or other sparse partial ordering constraints.

3.1 Reliable Relations for Learning a Representation:

Regardless of the origin of , only a small number of all possible pairwise relations (for training samples) may be feasibly provided by manual annotation. For the particular case of unsupervised learning only a small number of the pairwise relations can be inferred correctly for training with high confidence. Let’s consider a triplet of images with the ground-truth distance between and being small and between and being large. A learned distance is correct, if it obeys these ground-truth constraints. Furthermore to ensure robustness to noise these constraints should be obeyed reliably by a clear margin . In Fig. 2 we plot this ratio for consecutive nearest neighbors. This is the same ratio used in Lowe (2004) to measure the reliability of a matching similar image. We observe that for only pairwise distances the ratio is significantly smaller than . All other relations would not even have an opportunity to exhibit above correctness constraints reliably and, thus, would be falsely identified to be correct rather than spurious. Further, in Fig. 3 we plot the sorted similarities for a query image. Observe from Fig. 2 that above triplet constraints can only be fulfilled where there is significant slope in Fig. 3. Thus, only relations from both ends, where we have strong (dis-)similarities, can be considered to be reliable. These relations are significantly less susceptible to change under noise than the vast majority, as analyzed in Fig. 4. However, recent work on unsupervised learning has nevertheless simply relied on all pairwise relations Bojanowski and Joulin (2017) inferred during training at the cost of incorporating corrupted relations. In contrast, we now present an approach for unsupervised representation learning which explicitly aims at extracting and leveraging these reliable relations.

Figure 3: Sorted pairwise similarities for a query image based on different representations and distance metrics resulting from supervised and unsupervised training on STL-10Coates et al. (2011). Only the strong (dis-)similarities at both ends are reliable to provide a robust ordering.

3.2 Outline of our Iterative Representation Learning:

We first decompose the training set into subsets of images by extracting (Sec. 3.3) and dividing (Sec. 3.4) potentially overlapping compact groups of images, exhibiting reliable mutual similarity, based on the representation from the previous training iteration. Within each all mutual image relations are reliable. In each iteration, learning a representation (Sec. 3.5) then proceeds as follows: For each subset we seek an embedding . To learn , we randomly sample target points such that the distribution of their pairwise distances matches those of the . Learning the mapping from images to targets then yields the local representations (Sec. 3.5.2). In a final step, all these local representations are merged into a single overall representation by exploiting transitivity relations between those samples which are shared among subsets (Sec. 3.5.3). Based on representation , reliable relations are then again extracted to serve as input for the next iteration. Since no annotations are provided, the first iteration of our training starts from scratch with a randomly initialized CNN and random assignments.

3.3 Compact Groups for Finding Reliable Similarities

Figure 4: Sorted pairwise image similarities for an STL-10 query image. Color indicates amount of noise (gaussian, , ) needed for each image to change its rank w.r.t. query.

Every training iteration builds upon the distances learned in the previous round. Since the majority of inferred relations between our training samples is not reliable, how can we find the ones we can rely upon without annotations? In Fig. 2 to Fig. 4

we empirically demonstrate that only a few nearest neighbours (NN) can robustly be identified. Unfortunately, even these do not always give rise to correct relations. For instance, two samples may spuriously have a small distance or a sample may be an outlier, thus corrupting the nearest neighbors for a given sample

. This issue can however be alleviated by forming compact groups of images. Fig. 5 (a) shows that considering dense groups of samples increases the chance of inferring correct similarity relations, following the intuition that in dense areas of our feature space pairwise relations do not arise accidentally but due to actual commonalities. For different group sizes on ImageNet, the plot illustrates the average percentage of correct image relations in a group based on their ground-truth labels (blue). As we observe, the chance that samples appear erroneously close to another (forming a compact group) becomes increasingly unlikely as increases. On the other hand, large compact groups are scarce and therefore only a small number of samples can be covered for large (red).
Let the largest pairwise distance, , between members of a group represent its compactness. Further, when building groups of random samples, it is highly unlikely to form a group with low compactness, i.e. correctly mutually close samples. Thus, we demand for each to be smaller than the -th percentile of the compactness of a set of randomly built groups of equal size. Consequently, we extract reliable groups by starting with a seed and add its nearest neighbors as long as the resulting compactness is below the -th percentile compactness of the randomly sampled groups of equal size. Thus, we grow to be as large as possible. We denote as the resulting set of all possible (potentially overlapping) groups with reliable pairwise similarity relationships.

(a) (b)
Figure 5: a) Group correctness vs. data coverage w.r.t group size. blue: average correctness of relations in a group. red: data coverage for groups of different size extracted on ImageNet using our representation. We call a relation correct if it links images with the same label. Note that we only use labels for evaluation in this plot, but not for extracting our groups. Coverage: fraction of all samples that can be covered by compact groups of a given size relative to the overall number of extracted groups . b) Data- vs. Target-Distribution. Distribution of all pairwise intra-group and inter-group distances (blue) based on a fully supervised trained representation, of points uniformly sampled from an unit sphere (orange), and of a Gaussian distribution (yellow). Evidently, most data points are far apart, approximated by the orange mode. But there are also characteristic compact hubs approximated by the yellow mode (which has been magnified for the purpose of this illustration). Note that the sampled distributions can approximate the data distribution.

3.4 Reliable Dissimilarity by Partitioning Groups

Finding Reliable Dissimilarities:

The compact groups in provide small distances that we can reliably use for learning a representation. However, the relationships between the groups can be arbitrary and many are unlikely to be reliable as discussed above and in Fig. 2. Therefore, to further increase reliability, instead of using all relations in for learning our representation, we partition them into subsets of groups . By distributing overlapping groups across different while maximizing the distance between groups within a subset, we gathers the reliable dissimilarity relationships from the tail of the distance distributions shown in Fig. 3. Thus, all relations in each subset are as reliable as possible.

Optimization Problem:

Formally, we partition using the following criteria: (i) all groups should be mutually distant, (ii) partially overlapping groups should be distributed across different to establish couplings between the subproblems exploitable for transitivity relations, and (iii) the union of all subsets, , should cover as much as possible to maximize the usage of training samples. Using these constraints, we formulate the partitioning process as the following optimization problem.
Let be the assignment matrix of samples to groups

. Furthermore, the column vector

indicates groups assigned to subset . Then are the assignments of groups to all . Moreover, contains the mean pairwise distances between any two groups, .

The objective then becomes


where is regularized for the diagonal elements of and enforces constraint (i), i.e. maximal distance between groups in a , maximizes (ii), i.e. the distribution of groups across the subsets , and enforces (iii), i.e. maximal coverage of . are weighting terms for adjusting relative impact of individual constraints. This optimization problem can be efficiently solved following CliqueCNN Sanakoyeu et al. (2018). This work has addressed a similar problem to find a discrete partitioning of images into surrogate classes for optimizing a standard classification task equal to DeepCluster Caron et al. (2018). However, both Sanakoyeu et al. (2018) and Caron et al. (2018) are using all resulting surrogate classes without considering their reliability. Thus, they are inevitably prone to introducing noise into the learning process. In contrast, we seek a partitioning of the set of already extracted groups into subsets to further increase the reliability of our relations which we then use to construct a dedicated learning problem.

3.5 Unsupervised Representation Learning

Figure 6: Learning feature vectors by sampling a target space. Large distances between groups are captured by sampling their centroids uniformly from the surface of a hypersphere. Then target feature points for the are sampled from a Gaussian around to represent the compact group.

Now we learn a representation that preserves all the reliable relations in . To this end, we build a target space by randomly sampling target points , such that the distribution of their pairwise distances matches the distances in . Finding a representation for the data points then corresponds to solving an assignment problem to find their targets while simultaneously learning the corresponding mappings.

3.5.1 Constructing the target spaces

We sample the targets from the surface of a high-dimensional sphere with local Gaussian hubs accounting for the compact groups . More precisely, we first sample centroids uniformly from the surface of the -dimensional unit hypersphere (where D is the dimensionality of ). Then we sample from the Gaussians with fixed covariance matrix as illustrated in Fig. 6. Note that this sampling process is derived from and approximates the actual data distribution as illustrated in Fig. 5 (b). Using the representation of a fully supervised model as a proxy for ground-truth image relations (for this experiment only), we observe that the distribution of pairwise intra-group and inter-group distances (blue distribution) exhibits two distinctive modes as intuitively expected: A large mode representing mediocre to large inter-group distances and a minor second mode of small intra-group distances reflecting dense neighbourhoods of mutually similar samples. Sampling based on Gaussians uniformly distributed on a hypersphere explicitly approximates this distribution (orange and yellow distributions). In contrast, other approaches such as Bojanowski and Joulin (2017) rely on a data independent prior which does not sufficiently account for dense neighbourhoods of highly similar datapoints and consequently diminishes the expressiveness of the representation to be learned.

3.5.2 Learning the Representations

Learning is now formulated as establishing correspondences between the from a group and the targets . This requires to minimize the distances between and . We obtain and by jointly minimizing the optimization problem


We solve this problem by alternating between two steps: (i) Find optimal assignments based on the current representation . Since global solvers for such an assignment problem typically exhibit the prohibitive cost of in the number of input samples, we adopt the efficient algorithm of Bojanowski and Joulin (2017) which uses stochastic local updates to approximate the Hungarian method. (ii) Given an assignment, we optimize the representation by learning a CNN with weights to minimize the distances . By alternating between these two steps (we re-assign between and

every 3 epochs), the model needs to reason about which targets apply for which images and which images should be assigned next to each other. Thus it needs to find an optimal arrangement of groups

on the target space preserving their relations and further needs to infer meaningful relations between groups . At initialization, we start with a randomly initialized CNN and, thus, random initial weights and random assignments .
Using this learning process, we obtain a representation for each of our subsets , each having its own distances = , with being the metric in our implementation.

3.5.3 Coupling Subproblems to Consolidate their Representations

Figure 7: Triplet constraints for consolidating subproblems: A reliable relations between two subsets and couples them locally: solid black arrow between and , which appear in the same group (green group). Further, we find which are reliably dissimilar to (dashed black lines). We now have a triplet relating the , which was previously unknown to subset .

We now have an ensemble of representations , each representing a different subset of the data. The goal is now to consolidate their underlying learned relationships into a global representation reflecting all of the data. Therefore, we look for reliable relations between different subsets to establish local links that allows to locally transfer relations from one subset into the other. Groups which are (partially) shared among subsets due to their overlap and distribution to different subsets by the partitioning act as anchors for such links as illustrated in Fig. 7. Using transitivity we thus find triplets across subsets which allows to transfer information about previously unknown relations from one local representation to another.
Let be a member of a subset , but not of , i.e. . Similarly let and . Assume , thus providing a reliable similarity between and consequently between both subsets. Similarly, we assume to have a reliable dissimilarity. Using transitivity the triplet thus implies an ordering constraint under the representation , i.e., we want

. Note that these are additional relations imputed from

, which were previously not present in . Let be the set of all such triplets deduced by transitivity between and . We now incorporate this additional information by refining solving the triplet ranking problem Wu et al. (2017)


The parameter controls the margin between and with respect to . Note that the relations between and the other subsets are only sparse. Thus the potentially large computational complexity of the triplet ranking loss does not dominate the complexity of the overall approach.
However, optimizing this objective alone would ignore and potentially forget about the dense relationships in exploited by . Thus we combine both objectives,


Optimizing (4) retains the constraints from while incorporating couplings to other subsets to improve . Due to the additional inter-set relations, effectively now covers more of than before the refinement.
In the last training iteration, one representation becomes the final global representation. We conducted experiments using different aggregation strategies (averaging, random selection, etc.) and observed that after the final iteration all capture the dataset nearly equally well, allowing to randomly select one.

Input : : Set of training images
Parameter :  : Number of training iterations; : Number of subsets
Output : : global representation
Train initial representation : uniform distribution on unit hypersphere;
// Eq.2
// initialize
Iterative learning: for  to  do
        Extract reliable image relations
        SampleRandomGroups() // different sizes
        for  to  do
        PartitionGroups() // Sec.3.4
        Train local representations // Sec. 3.5.2
        for  to  do
               uniform distribution on unit hypersphere;
               ; // gaussian around
               while not converged do
       Refine local representations // Sec. 3.5.3
        for  to  do
               uniform distribution on unit hypersphere;
               ; // gaussian around
               while not converged do
       Update global by randomly choosing
Algorithm 1 Unsupervised Representation Learning by Extracting Reliable Image Relations.

As our overall iterative approach starts from random network initialization, initially we have no representation provided to extract reliable (dis)similarities for the grouping process in Sect. 3.3. Therefore, we train the first iteration by optimizing only the problem based on the whole training set and randomly sample individual target points for each image, yielding our first representation . The iterative approach then gradually learns stronger representations from iteration to iteration by capturing more and more reliable relationships in our dataset.

3.5.4 Pseudo-Code for summary

For additional clarity we now present a pseudo-code overview for our iterative approach (cf. Algorithm 1).

4 Experiments

Following we evaluate the performance of our model on large scale datasets and the usability of our learned representation for the tasks of classification, detection and segmentation. Further, we present ablation and analysis experiments providing insights into our iterative learning process.

Method Acc@1

rgb input

Supervised  Bojanowski and Joulin (2017) 59.7
Random Noroozi and Favaro (2016) 12.0
Colorization Zhang et al. (2016) 35.2
Jigsaw Puzzles Noroozi and Favaro (2016) 38.1
BiGAN Donahue et al. (2017) 32.2
RotNet Gidaris et al. (2018) 43.8

gradient (sobel) input

Supervised Bojanowski and Joulin (2017) 57.4
NAT Bojanowski and Joulin (2017) 36.0
Deep Cluster Caron et al. (2018) 44.0
Ours (Initialization) 30.0
Ours (Round 1) 39.1
Ours (Round 2) 44.6
Ours (Round 3) 45.8
Ours (Round 4) 46.0
Ours (meanstd) 45.80.3
Table 1: Comparison of our method to other state of the art unsupervised learning approaches on the ImageNet dataset. We report classification accuracy ().
Method Class Det Seg
(%mAP) (%mAP) (%mIoU)
Trained layers all all all
Supervised 79.9 56.8 48.0
Random 57.0 44.5 30.1
Colorization Zhang et al. (2016) 65.6 46.9 35.6
BIGAN Donahue et al. (2017) 60.1 46.9 34.9
Jigsaw Puzzle Noroozi and Favaro (2016) 67.6 53.2 37.6
NAT Bojanowski and Joulin (2017) 65.3 49.4 -
Split-Brain Zhang et al. (2017) 67.1 46.7 36.0
Counting Noroozi et al. (2017) 67.7 51.4 36.6
RotationNet Gidaris et al. (2018) 73.0 54.4 39.1
DeepCluster Caron et al. (2018) 73.7 55.4 45.1
Ours 74.2 55.6 44.6
Ours (meanstd) 74.10.3 55.50.2 44.50.3
Table 2: Comparing our model to state-of-the-art unsupervised approaches on PASCAL VOC 2007 classification, detection, and segmentation (measured in mean average precision and mean intersection over union.

4.1 Implementation Details and Benchmarking

We now evaluate our learned image representation on the ImageNet and PASCAL VOC dataset. We test on different vision tasks, such as image classification, objection detection, and semantic segmentation and compare our approach against state-of-the-art unsupervised methods. As preprocessing we convert our images to gradient images (obtained using a sobel filter) to avoid trivial solutions based on color, thus following the protocol of recent approaches Bojanowski and Joulin (2017); Caron et al. (2018). These also conducted experiments on supervised ImageNet classification and concluded that gradient images yield similar performance in comparison to RGB inputs, thus indicating a fair comparison. If not stated otherwise, for all experiments the number of subsets used for training is fixed to . In the grouping stage we set to be the percentile. We set the dimensionality of to and choose the parameters using cross-validation on the training set. The margin parameter is set to as suggested in various works Wu et al. (2017)

. We train our model using stochastic gradient descent using an initial learning rate of

and momentum of . Building groups can be efficiently performed using the FAISS Johnson et al. (2017) library for fast nearest-neighbor retrieval with GPU support, thus leading to no significant computational overhead.


We evaluate the ability of our model to capture differences between objects and its ability to scale to large image collections on the ImageNet dataset Deng et al. (2009). This dataset is composed of 1.2M images distributed over 1,000 categories including subtle category boundaries (such as dog breeds). For fair comparison with other methods we use the AlexNet Krizhevsky et al. (2012)

architecture with batch normalization. Our evaluation follows the standard protocol for unsupervised ImageNet pretraining: First, performing our unsupervised training on a randomly initialized network without using any labels. Afterwards, the convolutional layers are fixed while the last layers are randomly reinitialized and trained using ImageNet labels. Table

1 (left) compares our results with other state-of-the-art unsupervised approaches. Our method converges to its final performance of after training iterations (excluding initialization). Hence, we are significantly improving upon all other unsupervised approaches including DeepCluster Caron et al. (2018) by

, which is trained on all training data at the cost of also incorporating unreliable noisy information. In contrast, our model successfully leverages more and more reliable image relations over the iterations, thus alleviating the issue of noise. Additionally we report mean and standard deviation of our approach (Ours (mean

std)) over 5 runs. Note that all of the other methods are only reporting their best run.

Pascal Voc:

We now illustrate the generalization capability of our learned representation on different transfer learning tasks. We utilize our representation trained on ImageNet without labels (using the same architecture as above) and fine tune it on the PASCAL VOC 2007 Everingham et al. (2007) classification, detection, and segmentation tasks (VOC 2012). For transfer learning we use the framework of Krähenbühl et al. Krähenbühl et al. (2016) for classification experiments, the Fast R–CNN Girshick (2015) framework for object detection and the method of Long et al. Shelhamer et al. (2017) for semantic segmentation. Our results are summarized in Table 1 (right). On all transfer learning tasks, i.e. classification, detection and segmentation, our approach achieves comparable results to the unsupervised state-of-the-art which further demonstrates the expressive power of our representation. Additionally we report mean and standard deviation of our approach (Ours (meanstd)) over 5 runs. Note that all of the other methods are only reporting their best run.

Method Acc (%)
Target coding Yang and Tang (2015) (supervised)
Wang et al. Wang and Tan (2016)
CliqueCNN Sanakoyeu et al. (2018)
Exemplar CNN Dosovitskiy et al. (2016)
Discr. Attr. Huang et al. (2016)
Chang et al. Chang et al. (2018)
Ours 75.3
Table 3: Comparison to other approaches based on average classification accuracy on STL-10, using the unlabeled split for training.

4.2 Analysis and Ablation

We now show ablation experiments and analyses to evaluate the individual parts of our approach and to provide insights into the iterative learning procedure.

4.2.1 Ablation studies

Ablation studies are conducted on the STL-10 dataset and summarized in Tab. 3 (Left) and (Right).

STL-10 Performance:

We contrast our approach to state-of-the-art methods on STL. For a fair comparison, we use the same network architecture as Dosovitskiy et al. (2016) and train our model on the unlabeled split of the dataset. For this experiment we set . Tab. 3 (Left) shows that our proposed approach achieves competitive performance to unsupervised state-of-the-art approaches, ExemplarCNN Dosovitskiy et al. (2016), Discriminative Attributes Huang et al. (2016) and Chang et al. Chang et al. (2018). However, in contrast to ExemplarCNN Dosovitskiy et al. (2016) and Discriminative Attributes Huang et al. (2016) whose learning procedures rely on dense, costly (instance-level-)exemplar classification tasks, our approach leverages only a sparse set of reliable constraints and thus scales to large datasets as shown on ImageNet. Chang et al. Chang et al. (2018) leverages individual images in combination with strong augmentations, thus neglecting valuable information from direct relations between training samples. Further, they are operating on a more powerful network architecture leading to unfair comparison in their favor. Note that our approach outperforms CliqueCNN Sanakoyeu et al. (2018) by which also learns from relations inferred by a grouping of images, however, without considering their reliability. This strongly indicates that the concept of reliability actually helps to reduce the amount of noise introduced into the learning process.

Method Acc. (%)
Ours (Initialization)
Ours (Round 1)
Ours (Round 2)
Ours (Round 3)
Ours (Round 4)
No decomposition into (Round 1)
Triplets Only (Round 1)
Table 4: Average classification accuracy on STL-10 for ablations of our approach.

The performance of our representation increases with each iteration and improves over our initial representation by , cf. Tab. 3 (Right). Now, to evaluate the effect of incorporating reliable dissimilarities, we train our model using only a single target space (No decomposition) for all groups for one iteration after initialization. As a result there are only reliable relations within the , but lots of unreliable relationships between them. Due to the former, performance slightly increases over the initialization by . The latter explains the lower performance compared to our divide-and-conquer strategy. This highlights the importance of modelling reliable dissimilarities when learning visual representations.

Full model vs. triplet learning:

In this ablation experiment (Tab.3 (Right) Triplets Only) we use our extracted reliable (dis-)similarity relationships to mine triplet-constraints as input into a standard triplet-loss framework Wu et al. (2017). For a sample that serves as triplet anchor, reliable similarity relations act as positive constraints and reliable dissimilarity relations act as negative constraints. The aim of this experiment is to contrast our learning objective, Eq. (4), against popular ranking loss approaches. The massive drop of in performance can be explained by the dependence of such frameworks on hard-negative mining strategies. Since reliable constraints are only based on high similarities and dissimilarities, the ranking framework obviously has no access to hard constraints, which are very difficult to find reliably without supervision Wu et al. (2017) or strong pretraining Ren et al. (2019). Note that we are using triplet constraints only for transferring already learned information to refine our representations (Sec. 3.5.3) rather than using them as the driving overall learning signal.

Number of subsets:

To evaluate the sensitivity of our approach with respect to the number of subsets used for training, we train multiple models using different values of . Tab. 5 illustrates, that training performance saturates for subsets. This indicates that further partitioning of the data and thus further maximizing the dissimilarity between groups in has no effect and has been achieved to a sufficient degree.

4.2.2 Analysis of the model:

We now further analyze the effect of iterative training based on the model used for the ImageNet benchmark. The following experiments highlight the ability of our model to turn reliable (dis-)similarities into increasingly correct sample relations while simultaneously harnessing more and more data.

Correctness of image relations in groups :

In Sec. 3.3 and Sec. 3.4 we gather reliable relations driving the learning of our representation. Fig. 5 (a) illustrates that our procedure of extracting groups

for reliable relations increases the probability of finding correct relationships. Moreover, successive iterations of training improve the learned image relations. Fig.

8 confirms this by measuring how many relations within a group disagree. One can see that in consecutive iterations, our groups exhibit more and more correct relationships. This proves that our model is able to extend the reliable relations learned from a group to relations which so far could not be identified as reliable. Consequently the performance of our model increases.

Number of Subsets Acc. (%)
no partitioning
2 subsets
5 subsets
10 subsets
20 subsets
Table 5: Average classification accuracy on STL-10 dataset as number of subsets is varied. Results are for one iteration of training after initialization.
Data coverage:

In our proposed method we deliberately explore an orthogonal approach to simply using all training samples by reducing the noise introduced into the training process using only reliable relations for learning. Consequently there is a trade-off between exploring more data and introducing more noise due to unreliable relations. Fig. 9 shows the amount of training data covered in each training iteration. Overall and per subset, our grouping process covers more and more data samples, since representations improve and more relations become reliable. Between iteration and iteration the amount of data available for training is almost doubled to . Moreover, Fig. 8 proves that the additional data is meaningful and not just noise, since overall data quality is improving simultaneously. This demonstrates that our model not only reinforces already available reliable relations but is able to generalize its representation to previously unused data, i.e., discovering new reliable relationships. Further, the result that we achieve state-of-the-art performance using 60 of the available data proves there exists an alternative way to using all but therefore noisy relations between training samples, we are able to successfully find and exploit reliable samples and (iii) the idea of considering data reliability for unsupervised representation learning is not fully solved, thus opening new promising directions for future research.

Fixing the seed of groups

In Fig. 10 and Fig. 11 we present examples for groups of size and while training on the ImageNet dataset. To allow for a detailed comparison, we fix the seed elements and show how the constituents of a group change over the iterations. We observe that over the iterations the constituents of each group share more and more meaningful visual features. At the beginning the representation focuses on coarse visual commonalities like rough shape and scene. At the end relationships between constituents are dominated by true (intra-)class-specific features and similar pose.

Figure 8: Average correctness for groups of different size and in different training iterations. Average correctness is the fraction of members with the same ground-truth ImageNet class label (not used for training).
Typical sources of incorrectness

To explain common sources of incorrect relationships within groups, i.e., group constituents having different labels, we show typical examples in Fig. 12. As one can see, disagreements often arise due to subtle differences between the constituents’ classes (such as different dog breeds, hedgehogs vs. sea urchins, etc.) and misleading scene settings (such as a buildup of flags imitating a ship’s shape, a dog’s head above the surface while swimming looking like a duck, etc.).

Analysis of computational cost

As deriving exact computational complexities is often difficult, it is common practice for deep learning based methods to compare their computational complexities based on their training times. For our model learning one representation (including the refinement step) for a subset takes approx. 16h on a Titan X (Pascal), resulting in 14 days of training in total for the Imagenet dataset. Thus, our overall training time is comparable to the current state-of-the-art approach DeepCluster Caron et al. (2018) (12 days). Further, Tab. 6 compares the computational cost of our method with recent unsupervised representation learning approaches in relation to their performance. As we observe, peak performances on Imagenet are computationally costly. Further, we observe that RotNet Gidaris et al. (2018) offers the best trade-off between training efficiency and performance.

Figure 9: Fraction of training data covered by subsets in successive training iterations. Solid: overall data coverage. Dashed: mean data coverage per subset.

5 Conclusion and Future Work:

In our manuscript we have presented a novel iterative approach for unsupervised learning of visual feature representations. Experimental evaluation shows that our method yields a representation of competitive or even superior performance on the tasks of unsupervised classification and transfer learning compared to the state-of-the-art. We propose a novel technique for finding reliable relations (i.e. dis-/similarities) between training images which are likely to agree with the ground-truth. This reduces the amount of noise due to erroneous relations introduced into training, which is typically an issue when all possible training data relations are directly used. As using all relations is typically common practice in the vast majority of unsupervised learning literature, our work offers a new direction of future research directions: Instead of solely looking for more powerful surrogate tasks to compensate for missing supervision, we also address the question which training samples and relations can be trusted and which are likely to obstruct learning. This question naturally leads to a trade-off between covering all available training data and exploiting only reliable relations for learning. Hence, future work should focus on further improving the estimation of reliability to increase data coverage while maintaining a high reliability for efficient learning. Further, as the concept of reliable relations is of general value and can be decoupled from the learning itself, we investigate its applicability in unsupervised learning in general.
We presented a technique for identifying reliable relations resulting in a set of local sub-problems. For each subproblem the data samples are either reliably similar or reliably dissimilar. Further, the learning process for each subproblem is formulated to preserve their reliable relations and approximate the actual distribution of distances of the training data. This stands in contrast to previous approaches whose feature space relies on data-independent prior assumptions which potentially disagree with the training data at hand. We then optimize each problem individually before using transitivity relations between them to efficiently merge their learned local representations into a single concerted representation. To increase the efficiency of our learning procedure, another line of future work addresses the trade-off between peak performance and computational cost. To this end, we investigate possible approximations of the introduced optimization and learning problem and analyze their implications on the learned representation.

Method Running time Acc@1
Supervised days (Titan X)
NAT Bojanowski and Joulin (2017) days (Titan X)
RotNet Gidaris et al. (2018) days (Titan X)
Jigsaw Puzzle Noroozi and Favaro (2016) days (Titan X)
BiGAN Donahue et al. (2017) days (Titan X)
DeepCluster Caron et al. (2018) days (Titan X)
Ours days (Titan X)
Table 6: Training time vs. performance on the ImageNet dataset. Comparison of training times is based on the reported timings in the manuscripts of each method using a single NVIDIA Titan X (Pascal). Additionally, their performance for Acc@1 on ImageNet is reported.
Figure 10: Groups of size for fixed seed image (pink) over the iterations while training the ImageNet model. Each column represents a group .
Figure 11: Groups of size for fixed seed images (pink) over the iterations while training the ImageNet model. Each column represents a group .
Figure 12: Examples of groups with incorrect constituent relationships (i.e. different labels) taken from the last iteration while training on ImageNet. Orange highlights the source of disagreement. Each column represents a group .


This work has been supported in part by DFG grant OM81/1-1 and a hardware donation from NVIDIA Corporation.

Author Biographies

Timo Milbich received his masters degree in Scientific Computing from Ruprecht-Karls-University, Heidelberg, in 2014. He is currently a Ph.D. Candidate in Heidelberg Collaboratory for Image Processing at Heidelberg University. His current research interests include computer vision focusing on deep representation and metric learning.
Omair Ghori received his masters in Information and Communication Engineering from TU Darmstadt. He is currently a senior data scientist at AGT International and is also pursuing his PhD at the Heidelberg Collaboratory for Image Processing at Heidelberg University. His research interests include Action Recognition and deep representation learning.
Ferran Diego

received his Masters Degree and PhD in Computer Science from Autonomous University of Barcelona. He was a postdoctoral researcher in the IAL group at Heidelberg University, Germany. He is currently a Research Scientist at Telefonica R&D. His research interests include computer vision, machine learning, deep learning and neuroscience.

Björn Ommer received a diploma in computer science from the University of Bonn, Germany and a Ph.D. from ETH Zurich. After holding a postdoctoral position at the University of California at Berkeley he has joint the Department of Mathematics and Computer Science at Heidelberg University as a professor where he is heading the computer vision group. His research interests include computer vision, machine learning, and cognitive science.


  • P. Bojanowski and A. Joulin (2017) Unsupervised learning by predicting noise. In Proceedings of the 34th International Conference on Machine Learning, Cited by: §1, §2, §3.1, §3.5.1, §3.5.2, §4.1, Table 1, Table 2, Table 6.
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In European Conference on Computer Vision, Cited by: §1, §2, §3.4, §4.1, §4.1, §4.2.2, Table 1, Table 2, Table 6.
  • J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan (2018) Deep unsupervised learning with consistent inference of latent representations. Pattern Recognition (PR) 77, pp. 438 – 453. Cited by: §4.2.1, Table 3.
  • A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    Cited by: Figure 2, Figure 3.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  • J. Donahue, P. Krähenbühl, and T. Darrell (2017) Adversarial feature learning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: Table 1, Table 2, Table 6.
  • A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox (2016)

    Discriminative unsupervised feature learning with exemplar convolutional neural networks

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38 (9), pp. 1734–1747. Cited by: §2, §4.2.1, Table 3.
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2007) The PASCAL Visual Object Classes Challenge 2007 (VOC2007). Cited by: §4.1.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, Cited by: §2, §4.2.2, Table 1, Table 2, Table 6.
  • R. Girshick (2015) Fast r-cnn. In International Conference on Computer Vision (ICCV), Cited by: §4.1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2.
  • C. Huang, C. C. Loy, and X. Tang (2016) Unsupervised learning of discriminative attributes and visual representations. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §4.2.1, Table 3.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. Cited by: §4.1.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell (2016) Data-dependent initializations of convolutional neural networks. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §4.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, Cited by: §4.1.
  • G. Larsson, M. Maire, and G. Shakhnarovich (2017) Colorization as a proxy task for visual understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60. Cited by: §3.1.
  • Q. Ma, C. Bai, J. Zhang, Z. Liu, and S. Chen (2019)

    Supervised learning based discrete hashing for image retrieval

    Pattern Recognition (PR) 92, pp. 156 – 164. Cited by: §1.
  • T. Milbich, M. Bautista, E. Sutter, and B. Ommer (2017) Unsupervised video understanding by reconciliation of posture similarities. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representions by solving jigsaw puzzles. In The IEEE European Conference on Computer Vision (ECCV), Cited by: §1, §2, Table 1, Table 2, Table 6.
  • M. Noroozi, H. Pirsiavash, and P. Favaro (2017) Representation learning by learning to count. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 2.
  • J. F. Randrianasoa, C. Kurtz, É. Desjardin, and N. Passat (2018) Binary partition tree construction from multiple features for image segmentation. Pattern Recognition (PR) 84, pp. 237 – 250. Cited by: §1.
  • C. Ren, J. Li, P. Ge, and X. Xu (2019) Deep metric learning via subtype fuzzy clustering. Pattern Recognition (PR) 90, pp. 210 – 219. Cited by: §1, §2, §4.2.1.
  • J. C. Rubio, A. Eigenstetter, and B. Ommer (2015) Generative regularization with latent topics for discriminative object recognition. Pattern Recognition 48 (12). Cited by: §1.
  • A. Sanakoyeu, M. A. Bautista, and B. Ommer (2018) Deep unsupervised learning of visual similarities. Pattern Recognition (PR) 78, pp. 331 – 343. Cited by: §1, §2, §3.4, §4.2.1, Table 3.
  • E. Shelhamer, J. Long, and T. Darrell (2017) Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §4.1.
  • K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems 29, Cited by: §1, §2.
  • D. Wang and X. Tan (2016) Unsupervised feature learning with c-svddnet. Pattern Recognition (PR) 60, pp. 473 – 485. Cited by: Table 3.
  • X. Wang and A. Gupta (2015) Unsupervised learning of visual representations using videos. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • X. Wang, K. He, and A. Gupta (2017) Transitive invariance for self-supervised visual representation learning. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • C. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl (2017) Sampling matters in deep embedding learning. In ICCV, Cited by: §3.5.3, §4.1, §4.2.1.
  • W. Xiong, L. Zhang, B. Du, and D. Tao (2017) Combining local and global: rich and robust feature pooling for visual recognition. Pattern Recognition (PR) 62, pp. 225 – 235. Cited by: §1.
  • S. Yang and X. Tang (2015) Deep representation learning with target coding. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), Cited by: Table 3.
  • J. Zhang, K. Mei, Y. Zheng, and J. Fan (2019a) Learning multi-layer coarse-to-fine representations for large-scale image classification. Pattern Recognition (PR) 91, pp. 175 – 189. Cited by: §1.
  • P. Zhang, W. Liu, Y. Lei, and H. Lua (2019b) Hyperfusion-net: hyper-densely reflective feature fusion for salient object detection. Pattern Recognition (PR) 93, pp. 521 – 533. Cited by: §1.
  • R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In The IEEE European Conference on Computer Vision (ECCV), Cited by: Table 1, Table 2.
  • R. Zhang, P. Isola, and A. A. Efros (2017)

    Split-brain autoencoders: unsupervised learning by cross-channel prediction

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 2.
  • Y. Zhang, Y. Bai, M. Ding, Y. Li, and B. Ghanem (2018) Weakly-supervised object detection via mining pseudo ground truth bounding-boxes. Pattern Recognition (PR) 84, pp. 68 – 81. Cited by: §1.