Unsupervised Discovery of Object Landmarks via Contrastive Learning

06/26/2020 ∙ by Zezhou Cheng, et al. ∙ University of Massachusetts Amherst 5

Given a collection of images, humans are able to discover landmarks of the depicted objects by modeling the shared geometric structure across instances. This idea of geometric equivariance has been widely used for unsupervised discovery of object landmark representations. In this paper, we develop a simple and effective approach based on contrastive learning of invariant representations. We show that when a deep network is trained to be invariant to geometric and photometric transformations, representations from its intermediate layers are highly predictive of object landmarks. Furthermore, by stacking representations across layers in a hypercolumn their effectiveness can be improved. Our approach is motivated by the phenomenon of the gradual emergence of invariance in the representation hierarchy of a deep network. We also present a unified view of existing equivariant and invariant representation learning approaches through the lens of contrastive learning, shedding light on the nature of invariances learned. Experiments on standard benchmarks for landmark discovery, as well as a challenging one we propose, show that the proposed approach surpasses prior state-of-the-art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 14

page 15

Code Repositories

ContrastLandmark

Unsupervised object landmark discovery


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning in the absence of labels is a challenge for existing machine learning and computer vision systems. Despite recent advances, the performance of unsupervised learning remains far below supervised learning, especially for few-shot image understanding tasks 

Misra (2019). This paper considers the task of unsupervised learning of object landmarks from a collection of images. The goal is to learn representations from a large number of unlabeled images such that they allow accurate prediction of landmarks such as eyes and noses when provided with a few labeled examples.

One way of inferring structure is to reason about the global appearance in terms of disentangled factors such as geometry and texture. This is the basis of alignment based Miller et al. (2000); Huang et al. (2012) and generative modeling based approaches for landmark discovery Zhang et al. (2018); Wiles et al. (2018); Shu et al. (2018); Jakab et al. (2018, 2020). An alternate is to learn a representation that geometrically deforms in the same way as the object, a property called geometric equivariance (Fig. 1a). However, a drawback of these approaches is that useful invariances may not be learned (e.g., the raw pixel representation itself is equivariant). As a result, these approaches are not robust to nuisance factors such as background clutter, occlusion, and other inter-image variations, limiting their applicability.

A different line of work has proposed contrastive learning as an unsupervised objective Hadsell et al. (2006); Dosovitskiy et al. (2014); Hjelm et al. (2019); Bachman et al. (2019); Lin (2018); Isola (2019); He et al. (2019); Chen et al. (2020a); Oord et al. (2018); Zhuang et al. (2019); Hénaff et al. (2019). This is commonly formulated as an objective defined over pairs of data. The goal is to learn a representation that has higher similarity between an image and its transformation than with a different one , i.e., , as illustrated in Fig. 1b. Transformations are obtained using a combination of geometric (e.g., cropping and scaling) and photometric (e.g., color jittering and blurring) transformations, encouraging the representation to be invariant to these factors while being distinctive across images. Recent work He et al. (2019); Chen et al. (2020a)

has shown that contrastive learning is effective, even outperforming ImageNet 

Deng et al. (2009) pretraining for a range of tasks and domains. However, in order to predict landmarks, a representation cannot be invariant to geometric transformations of an image. Moreover, their effectiveness in the few-shot setting has not been sufficiently studied. This paper asks the question: are equivariant losses necessary for unsupervised landmark discovery? In particular, do representations predictive of object landmarks automatically emerge in intermediate layers of a deep network trained to be invariant to image transformations, including geometric ones? While empirical evidence Zhou et al. (2016); Gonzalez-Garcia et al. (2018) suggests the emergence of semantic parts in deep networks trained on supervised tasks, is it also the case for unsupervised learning?

Figure 1: Equivariant and invariant learning. (a) Equivariant learning requires representations across locations to be invariant to a geometric transformation while being distinctive across locations. (b) Invariant learning encourages the representations to be invariant to transformations while being distinctive across images. Thus both can be seen as instances of contrastive learning. (c) Invariance increases while geometric equivariance decreases with depth when a deep network is trained to be invariant to both geometric and photometric transformations. (d) A hypercolumn representation allows a linear model to select the right trade-off between desired invariances and equivariance.

This work aims to address these by presenting a unified view of the two approaches. We show that when a deep network is trained to be invariant to geometric and photometric transformations, its intermediate layer representations are highly predictive of landmarks (Fig. 1b). The emergence of invariance and the loss of geometric equivariance is gradual in the representation hierarchy, a phenomenon that has been studied empirically Zeiler and Fergus (2014); Lenc and Vedaldi (2015) and theoretically Tishby et al. (2000); Tishby and Zaslavsky (2015); Achille and Soatto (2018) (Fig. 1c). As a result, its intermediate layer offers a better trade-off between desired invariances and equivariances for the landmark prediction task. This observation also motivates a hypercolumn representation Hariharan et al. (2015), resulting in more accurate landmark predictions (Fig. 1d). We also observe that objectives used in equivariant learning can be seen as a contrastive loss between pixel representations at different locations within the same image, unlike invariant learning where the contrastive loss is applied across images (Fig. 1a vs 1b). This sheds light on the nature of invariances learned by these techniques.

To validate these claims, we present experiments using off-the-shelf Residual Networks He et al. (2016) and the Momentum Contrast (MoCo) He et al. (2019) learning on several benchmarks for landmark detection. We present a quantitative evaluation in a linear evaluation setting by varying the number of labeled examples. We also present a comparison by learning on a challenging dataset of birds from the iNaturalist dataset Van Horn et al. (2018) and evaluating on the CUB dataset Wah et al. (2011). Our approach is simple, yet it offers consistent improvements over prior approaches Thewlis et al. (2017b, a); Zhang et al. (2018); Jakab et al. (2018); Thewlis et al. (2019). While the hypercolumn representation leads to a larger embedding dimension, it comes at a modest cost as our approach outperforms the prior state-of-the-art Thewlis et al. (2019), with as few as 50 annotated training examples on the AFLW benchmark Koestinger et al. (2011) (Fig. 3). In fact, we observe that the hypercolumns of randomly initialized networks also offer a good generalization, suggesting that both the architecture and learning objective play an important role.

In summary, our contributions are as follows. We present a unified view of existing equivariant and the proposed invariant landmark representation learning approaches through the lens of contrastive learning — the former encourages intra-image invariance while the latter encourages inter-image invariance. We show that invariant representation learning using contrastive losses lead to more effective landmark representations than prior state-of-the-art on several benchmarks. The improvements are pronounced on a more challenging dataset we propose. We analyze and motivate the effectiveness of our approach by the phenomenon of the gradual emergence of invariance in the representation hierarchy and propose a hypercolumn representation which provides further benefits.

2 Related Work

Background. A representation is said to be equivariant (or covariant) with a transformation for input if there exists a map such that: . In other words, the representation transforms in a predictable manner given the input transformation. For natural images, the transformations can be geometric (e.g., translation, scaling, and rotation), photometric (e.g., color changes), or more complex (e.g., occlusion, viewpoint or instance variations). Note that a sufficient condition for equivariance is when is invertible since . Invariance is a special case of equivariance when is the identity function, i.e., . There is a rich history in computer vision on the design of covariant (e.g., SIFT Lowe (2004)), and invariant representations (e.g., HOG Dalal and Triggs (2005) and Bag-of-Visual-Words Sivic and Zisserman (2003)).

Deep representations. Invariance and equivariance in deep network representations result from both the architecture (e.g., convolutions lead to translational equivariance, while pooling leads to translational invariance), and learning (e.g., invariance to categorical variations). Lenc and Vedaldi (2015) showed that early-layer representations of deep networks trained on ImageNet are nearly equivariant as they can be “inverted” to recover the input, while later layers are more invariant. Similar observations have been made by visualizing these representations Mahendran and Vedaldi (2016); Zeiler and Fergus (2014). The gradual emergence of invariance can also be theoretically understood in terms of a “information bottleneck” in the feed-forward hierarchy Achille and Soatto (2018); Tishby et al. (2000); Tishby and Zaslavsky (2015). While equivariance to geometric transformations is relevant for landmark representations, the notion can be generalized to other transformation groups Gens and Domingos (2014); Cohen and Welling (2016).

Landmark discovery. Empirical evidence Zhou et al. (2016); Oquab et al. (2015) suggests that semantic parts emerge when deep networks are trained on supervised tasks. This has inspired architectures for image classification that encourage part-based reasoning, such as those based on texture representations Lin et al. (2015); Cimpoi et al. (2015); Arandjelovic et al. (2016) or spatial attention Sermanet et al. (2014); Xiao et al. (2015); Fu et al. (2017). In contrast, our work shows that parts also emerge when models are trained in an unsupervised manner. When no labels are available, equivariance to geometric transformations provides a natural self-supervisory signal. The equivariance constraint requires , the representation of at location , to be invariant to the geometric transformation of the image, i.e., (Fig. 1a). This alone is not sufficient since both and , satisfy this property. Constraints based on locality Thewlis et al. (2017a, 2019) and diversity Thewlis et al. (2017b) have been proposed to avoid this pathology. Yet, inter-image invariance is not directly enforced. Another line of work is based on a generative modeling approach Zhang et al. (2018); Wiles et al. (2018); Shu et al. (2018); Jakab et al. (2018); Lorenz et al. (2019); Jakab et al. (2020). These methods implicitly incorporate equivariant constraints by modeling objects as deformation (or flow) of a shape template together with appearance variation in a disentangled manner. In contrast, our work shows that learning representations invariant to both geometric and photometric transformations is an effective strategy. These invariances emerge at different rates in the representation hierarchy, and one can select the appropriate ones with a small amount of supervision for the downstream task.

Unsupervised learning. Recent work has shown that unsupervised objectives based on density modeling Dosovitskiy et al. (2014); Hjelm et al. (2019); Bachman et al. (2019); Lin (2018); Isola (2019); He et al. (2019); Chen et al. (2020a); Oord et al. (2018)

, outperform unsupervised (or self-supervised) learning based on pretext tasks such as colorization 

Efros (2016a), rotation prediction Komodakis (2018), jigsaw puzzle Noroozi and Favaro (2016), and inpainting Efros (2016b). These contrastive learning objectives Hadsell et al. (2006)

are often expressed in terms of noise-contrastive estimation (NCE) 

Gutmann and Hyvärinen (2010) (or maximizing mutual information Oord et al. (2018); Hjelm et al. (2019)) between different views obtained by geometrically and photometrically transforming an image. The learned representations thus encode invariances to these transformations while preserving information relevant for downstream tasks. However, the effectiveness of unsupervised learning depends on how well these invariances relate to those desired for end tasks. Despite recent advances, existing methods for unsupervised learning significantly lack in comparison to their supervised counterparts in the few-shot setting Misra (2019). Moreover, their effectiveness for landmark discovery has not been sufficiently studied in the literature.222Note that MoCo He et al. (2019)

was evaluated on pose estimation, however their method was trained with 150K labeled examples and the entire network was fine-tuned.

In part, it is not clear why invariance to geometric transformations might be useful for landmark prediction since we require the representation to carry some spatial information about the image. Understanding these trade-offs and improving the effectiveness of contrastive learning for landmark prediction is one of the goals of the paper.

3 Method

Denote as an image of an object, and as pixel coordinates. The target of unsupervised landmark discovery is to learn an encoder to generate representation at spatial location of input which is predictive for object landmarks. We assume aiming to learn a high-dimensional representation of landmarks. This is similar to Thewlis et al. (2019) which learns a local descriptor for each landmark, and unlike those that represent them as a discrete set Zhang et al. (2015), or on a planar (Thewlis et al. (2017b); Zhang et al. (2018) or spherical (Thewlis et al. (2017a) coordinate system. In other words, the representation should be predictive of landmarks (measured using a linear regressor), without requiring compactness or topology in the embedding space.

We describe commonly used equivariance constraints for unsupervised landmark discovery Thewlis et al. (2017b, a, 2019), followed by models based on invariant learning Oord et al. (2018); He et al. (2019). We then present a unified view of the two approaches motivating our approach, which is to simply train a deep network to be invariant to all transformations and use features from intermediate layers as the landmark representation.

3.1 Equivariant and invariant representation learning

Equivariant learning. The equivariance constraint requires , the representation of at location , to be invariant to the geometric deformation of the image (Fig. 1a). Given a geometric warping function , the representation of at should be same as the representation of the transformed image at , that is, . This constraint can be captured by the loss:

(1)

A diversity (or locality) constraint is necessary to encourage the representation to be distinctive across locations. For example, Thewlis et al. (2017a) proposed the following:

(2)

which they replace by a probabilistic version that combines both the losses as:

(3)

Here

is the probability of pixel

in image matching in image with as the encoder shared by and computed as below, and is a scale parameter:

(4)

Invariant learning. Contrastive representation learning is based on similarity over pairs of inputs (Fig. 1b). Given an image and its transformation as well as other images , the InfoNCE Oord et al. (2018) loss minimizes:

(5)

The objective encourages representations to be invariant to transformations while being distinctive across images. To address the computational bottleneck in evaluating the denominator, Momentum Contrast (MoCo) He et al. (2019) replaces the loss over negative examples using a dictionary queue and updates the parameters based on momentum.

Transformations. The space of transformations used to generate image pairs plays an important role in learning. A common approach is to apply a combination of geometric transformations, such as cropping, resizing, and thin-plate spline warps, as well as photometric transformations, such as color jittering, JPEG noise, and PCA color augmentation. However, transformations can also denote different color channels of an image or different modalities such as depth and color Isola (2019).

Hypercolumns. A deep network of layers (or blocks333Due to skip-connections, we cannot decompose the encoding over layers, but can across blocks.) can be written as . A representation of size

can be spatially interpolated to the input size

to produce a pixel representation . The hypercolumn representation of layers is obtained by concatenating the interpolated features from the corresponding layers, that is, .

Our approach. Given a large unlabeled dataset, we first train invariant representations using contrastive learning He et al. (2019) with a combination of geometric and photometric transformations applied to images to generate pairs. We then extract single or hypercolumn representation from intermediate layers to represent landmarks (Fig. 1b and Fig. 1d).

3.2 A unified view of the two approaches

Commonalities and differences. Equivariance is necessary but not sufficient for an effective landmark representation. It also needs to be distinctive or invariant to nuisance factors. This is enforced in the equivariance objective (Eq. 3) as a contrastive term over locations within the same image, as the loss is minimized when is maximized at . This encourages intra-image invariance, unlike the objective of contrastive learning (Eq. 5), which encourages inter-image invariance. However, a single image may contain enough variety to guarantee some invariance. This is supported by its empirical performance and recent work showing that representation learning is possible even from a single image YM. et al. (2020). However, our experiments suggest that on challenging datasets, with objects in clutter, occlusion, and wider pose variation, inter-image invariance can be more effective.

Why do landmarks emerge during invariant learning? For this, we point to the phenomenon of the gradual emergence of invariance in the representation hierarchy Lenc and Vedaldi (2015); Achille and Soatto (2018); Tishby et al. (2000); Tishby and Zaslavsky (2015). In particular, equivariance to geometric transformations reduces with depth due to pooling operations, while invariance to nuisance factors increases with depth (Fig. 1c). Thus, picking an intermediate layer in the representation hierarchy may offer the desired trade-off between the geometric equivariance and invariances for predicting landmarks. Our experiments support this, and we find that it is possible to select the right layer with just a few labeled examples from the downstream task. The scheme can be further improved by stacking representations across layers in a hypercolumn (Sec. 4).

Is there any advantage of one approach over the other? Our experiments show that for a deep network of the same size, invariant representation learning can be just as effective (Tab. 1). However, invariant learning is conceptually simpler and scales better than equivariance approaches, as the latter maintains high-resolution feature maps across the hierarchy. Using a deeper network (e.g., ResNet50 vs. ResNet18) results in consistent improvements, outperforming DVE Thewlis et al. (2019) on four out of five datasets, as shown in Tab. 1. A drawback of our approach is that the representation is not directly interpretable or compact, which results in lower performance in the extreme few-shot case. However, as seen in Fig. 3a, the advantage disappears with as few as 50 training examples on the AFLW benchmark Koestinger et al. (2011). Moreover, invariant learning makes better use of data achieving the same performance with half the unlabeled examples, as seen in Fig. 3c. We describe these experiments in detail in the next section.

4 Experiments

We first outline the datasets and implementation details of the proposed method (Sec. 4.1). We then evaluate our model and provide comparisons to the existing methods qualitatively and quantitatively on the landmark detection benchmarks, followed by a thorough ablation study (Sec. 4.2).

4.1 Benchmarks and implementation details

Human faces. We first compare the proposed model with prior art on the existing human face landmark detection benchmarks. Following DVE Thewlis et al. (2019), we train our model on CelebA dataset Liu et al. (2015) and evaluate on MAFL Zhang et al. (2015), AFLW Koestinger et al. (2011), and 300W Sagonas et al. (2013). The overlapping images with MAFL are excluded from CelebA. MAFL comprises 19,000 training images and 1000 testing images with annotations on 5 face landmarks. Two versions of AFLW are used: AFLW that contains 10,122 training images and 2995 testing images, which are crops from MTFL Zhang et al. (2014); AFLW which contains tighter crops of face images with 10,122 for training and 2991 for testing. 300W provides 68 annotated face landmarks with 3148 training images and 689 testing images. We apply the same image pre-processing procedures as in DVE, the current state-of-the-art, for a direct comparison.

Birds. We collect a challenging dataset of birds where objects appear in clutter, occlusion, and exhibit wider pose variation. We randomly select 100K images of birds from the iNaturalist 2017 dataset Van Horn et al. (2018) under the “Aves” class to train unsupervised representations. For the performance on the few-shot setting, we collect a subset of CUB dataset Wah et al. (2011) containing 35 species of Passeroidea superfamily, each annotated with 15 landmarks444This is the biggest Aves taxa in iNaturalist.. We sample at most 60 images per class which results in 1241 images as our training set, 382 as validation set, and 383 as testing set (see Appendix E for details).

Evaluation. We use landmark regression as the end task for evaluation. Following Thewlis et al. (2017a, 2019), we train a linear regressor to map the representations to landmark annotations, keeping the representation frozen. The landmark regressor is a linear regressor per target landmark. Each regressor consists of filters of size 11 on top of a -dimensional representation to generate 50 intermediate heatmaps, which are then converted to spatial coordinates by soft-argmax operation. These 50 coordinates are finally converted to the target landmark by a linear layer (see Appendix A for details). We report errors in the percentage of inter-ocular distance on face benchmarks and the percentage of correct keypoints (PCK) on CUB. A prediction is considered correct according to the PCK metric if its distance to the ground-truth is within 5% of the longer side of the image. The occluded landmarks are ignored during evaluation. We did not find fine-tuning to be uniformly beneficial but include a comparison in Appendix A.

Implementation details. We use MoCo He et al. (2019)

to train our models on CelebA or iNat Aves for 800 epochs with a batch size of 256 and a dictionary size of 4096. ResNet18 or ResNet50 

He et al. (2016) are used as our backbone. We extract hypercolumns Hariharan et al. (2015) per pixel by stacking activations from the second (conv2_x) to the last convolutional block (conv5_x). We resize the feature maps from the selected convolutional blocks to the same spatial size as DVE Thewlis et al. (2019) (i.e. 4848). We also follow DVE (with Hourglass network) to resize the input image to then center-crop the image to for face datasets. Images are resized to without any cropping on bird dataset. For a comparison with DVE on CUB dataset we used their publicly available implementation.

4.2 Results on landmark detection and ablation studies

Quantitative results. Tab. 1 presents a quantitative evaluation of multiple benchmarks. On faces, the proposed model with ResNet18 significantly outperforms several prior works Thewlis et al. (2017b, a) and DVE with SmallNet. Our model with a ResNet50 achieves state-of-the-art results on all benchmarks except for 300W. On iNat Aves CUB, the approach outperforms prior state-of-the-art Thewlis et al. (2019) by a large margin, suggesting improved invariance to nuisance factors.

Method # Params. Unsuper. MAFL AFLW AFLW 300W CUB
Millions Inter-ocular Distance (%) PCK
TCDCN Zhang et al. (2015) 7.95 7.65 5.54
RAR Xiao et al. (2016) 7.23 4.94
MTCNN Zhang et al. (2014) 5.39 6.90
Wing Loss Feng et al. (2018) 4.04
Generative modeling based
Structural Repr. Zhang et al. (2018) 3.15 6.58
FAb-Net Wiles et al. (2018) 3.44 5.71
Deforming AE Shu et al. (2018) 5.45
ImGen. Jakab et al. (2018) 2.54 6.31
ImGen.++ Jakab et al. (2020) 5.12
Equivariance based
Sparse Thewlis et al. (2017b) 6.67 10.53 7.97
Dense 3D Thewlis et al. (2017a) 4.02 10.99 10.14 8.23
DVE SmallNet Thewlis et al. (2019) 0.35 3.42 8.60 7.79 5.75
DVE Hourglass Thewlis et al. (2019) 12.61 2.86 7.53 6.54 4.65 61.91
Invariance based
Ours (ResNet18) 11.24 2.57 8.59 7.38 5.78 62.24
Ours (ResNet50) 23.77 2.44 14.7 6.99 7.2 6.27 4.1 5.22 -12.3 68.63 17.6
Table 1: Results on landmark detection. Comparison on face benchmarks, including MAFL, AFLW, AFLW, and 300W, and CUB dataset. We report the error in the percentage of inter-ocular distance on human face dataset (lower is better), and the percentage of correct keypoints (PCK) on CUB dataset (higher is better). Our approach outperforms prior work on four out of five benchmarks. Superscripts in the last row denote the (%) error reduction over DVE Hourglass Thewlis et al. (2019).

Qualitative results. Fig. 2 shows the qualitative results of landmark regression on human faces and birds. We notice that both DVE and our model are able to localize the foreground object accurately. However, our model localizes many keypoints better (e.g., on the tails of the birds) and is more robust to the background clutter (e.g., the last column of Fig. 2b).

MAFL

Ground Truth

AFLW

DVE

300W

Ours

(a) Human face (b) Birds
Figure 2: Detected landmarks. on (a) faces (blue: predictions, green: ground truth) and (b) CUB. Notice that our method localizes tails of birds (circled) much better. Zoom in for details.

Limited annotations. Fig. 3a and 3b compare our model with DVE Thewlis et al. (2019) using a limited number of annotations on AFLW and CUB dataset respectively. Our performance is better as soon as a few training examples are available (e.g., 50 on AFLW and 250 on CUB). This can be attributed to the higher dimensional embedding of the hypercolumn representation. The scheme can be improved by using a single-layer representation as shown in the yellow line. Note that all unsupervised learning models (including DVE and our model) outperform the randomly initialized baseline on both human face and bird datasets. We provide the numbers corresponding to Fig. 3 in Appendix F.

(a) Limited anno. on AFLW (b) Limited anno. on CUB (c) Unlabeled CelebA images
Figure 3: The effect of dataset size. (a) A comparison of our model with DVE Thewlis et al. (2019) by varying the number of annotations for landmark regression on AFLW dataset. Random-SmallNet: is a randomly initialized “small network” taken from Thewlis et al. (2019). Ours-ResNet50: are based on hypercolumn or forth-layer representations trained using contrastive learning. (b) Similar results on CUB dataset. Random-ResNet18: is trained from scratch on the CUB dataset. (c) Results of landmark regression on AFLW using different numbers of unlabeled images from CelebA for training.

Limited unlabeled data. Fig. 3c shows that our model matches the performance of DVE on AFLW using only 40% of the images on the CelebA dataset.

Hypercolumns. Tab. 2 compares the performance of using individual layer and hypercolumn representations. The activation from the fourth convolutional block consistently outperforms those from the other layers. For an input of size 9696, the spatial dimension of the representation is 48 48 at Layer #1 and 33 at Layer #5, reducing by a factor of two at each successive layer. Thus, while the representation loses geometric equivariance with depth, contrastive learning encourages invariance, resulting in Layer #4 with the optimal trade-off for this task. While the best layer can be selected with some labeled validation data, the hypercolumn representation provides further benefits everywhere except the very small data regime (Tab. 2 and Fig. 3a).

Dataset Single layer Hypercolumn
#1 #2 #3 #4 #5 #4 - #5 #3 - #5 #2 - #5 #1 - #5
(64) (256) (512) (1024) (2048) (3072) (3584) (3840) (3904)
MAFL 5.77 4.58 3.03 2.73 3.66 2.73 2.65 2.44 2.51
AFLW 24.20 21.34 11.95 8.83 11.55 8.14 8.31 6.99 7.40
AFLW 16.27 14.15 9.66 7.37 8.83 6.95 6.24 6.27 6.34
300W 16.45 13.08 7.66 6.01 7.70 5.68 5.28 5.22 5.21
Table 2: Landmark detection using single layer and hypercolumn representations.

The error is reported in the percentage of inter-ocular distance using linear regression over individual layers (left) and combinations (right), with a ResNet50. The embedding dimension for each is shown in parenthesis. Layer #4 performs the best across datasets, while hypercolumns offer an improvement.

Effectiveness of unsupervised learning. Tab. 3 compares representations using the linear evaluation setting for randomly initialized, ImageNet pretrained, and contrastively learned networks using a hypercolumn representation. Contrastive learning provides significant improvements over ImageNet pretrained models, which is less surprising since the domain of ImageNet images is quite different from faces. Interestingly, random networks have competitive performances with respect to some prior work in Tab. 1. For example, Thewlis et al. (2017a) achieve 4.02% on MAFL, while a randomly initialized ResNet18 with hypercolumns achieves 4.00%.

ResNet18 Supervision MAFL AFLW AFLW 300W Random 4.00 14.20 10.11 9.88 ImageNet 2.85 8.76 7.03 6.66 Contrastive 2.57 8.59 7.38 5.78 ResNet50 Supervision MAFL AFLW AFLW 300W Random 4.72 16.74 11.23 11.70 ImageNet 2.98 8.88 7.34 6.88 Contrastive 2.44 6.99 6.27 5.22
Table 3: Effectiveness of unsupervised learning. Error using randomly initialized, ImageNet pretrained, and contrastively trained ResNet18 (left) and ResNet50 (right) for landmark detection. Frozen hypercolumn representations with a linear regression were used for all methods.

5 Conclusion

We show that landmark representations extracted from intermediate layers of a deep network trained to be invariant outperform existing unsupervised landmark representation learning approaches. We also show that several equivariant learning approaches can be viewed through the lens of contrastive learning — they learn intra-image invariances which result in weaker generalization than inter-image invariances on a more challenging benchmark we propose. However, the two approaches are complementary and may be combined for further benefits.

Broader Impact

This work falls in the general category of unsupervised learning which aims to discover structure from unlabeled data. In particular this work reduces annotation costs associated with training object landmark detectors, potentially allowing novel applications in domains where there are no existing large-scale labeled datasets. On the other hand, without expert supervision, the methods may inherit and amplify the biases from data causing poor peformance on downstream applications.

The project is supported in part by Grants #1661259 and #1749833 from the National Science Foundation of United States. Our experiments were performed on the University of Massachusetts, Amherst GPU cluster obtained under the Collaborative Fund managed by the Massachusetts Technology Collaborative. We would also like to thank Erik Learned-Miller, Dan Sheldon and Rui Wang for discussion and feedback on the draft.

References

  • A. Achille and S. Soatto (2018) Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research 19 (1), pp. 1947–1980. Cited by: §1, §2, §3.2.
  • R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic (2016) NetVLAD: cnn architecture for weakly supervised place recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 5297–5307. Cited by: §2.
  • P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15509–15519. Cited by: §1, §2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020a) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1, §2.
  • X. Chen, H. Fan, R. Girshick, and K. He (2020b) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §D.
  • M. Cimpoi, S. Maji, and A. Vedaldi (2015) Deep filter banks for texture recognition and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3828–3836. Cited by: §2.
  • T. Cohen and M. Welling (2016) Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §2.
  • N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, Vol. 1, pp. 886–893. Cited by: §2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
  • A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014)

    Discriminative unsupervised feature learning with convolutional neural networks

    .
    In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.
  • A. A. Efros (2016a) Colorful image colorization. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • A. A. Efros (2016b) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • Z. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu (2018) Wing loss for robust facial landmark localisation with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2235–2245. Cited by: Table 1.
  • J. Fu, H. Zheng, and T. Mei (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4438–4446. Cited by: §2.
  • R. Gens and P. M. Domingos (2014) Deep symmetry networks. In Advances in neural information processing systems, pp. 2537–2545. Cited by: §2.
  • A. Gonzalez-Garcia, D. Modolo, and V. Ferrari (2018) Do semantic parts emerge in convolutional neural networks?. International Journal of Computer Vision 126 (5), pp. 476–494. Cited by: §1.
  • M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    ,
    pp. 297–304. Cited by: §2.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE conference on computer vision and pattern recognition, Vol. 2, pp. 1735–1742. Cited by: §1, §2.
  • B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik (2015) Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 447–456. Cited by: §1, §4.1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §D, §1, §1, §2, §3.1, §3.1, §3, §4.1, footnote 2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1.
  • O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §1.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • G. Huang, M. Mattar, H. Lee, and E. G. Learned-Miller (2012) Learning to align from scratch. In Advances in neural information processing systems, pp. 764–772. Cited by: §1.
  • P. Isola (2019) Contrastive multiview coding. arXiv:1906.05849. Cited by: §D, §1, §2, §3.1.
  • T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi (2018) Unsupervised learning of object landmarks through conditional image generation. In Advances in Neural Information Processing Systems, pp. 4016–4027. Cited by: §1, §1, §2, Table 1.
  • T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi (2020) Self-supervised learning of interpretable keypoints from unlabelled videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, Table 1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §D.
  • M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof (2011) Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In 2011 IEEE international conference on computer vision workshops (ICCV workshops), pp. 2144–2151. Cited by: §1, §3.2, §4.1.
  • N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • K. Lenc and A. Vedaldi (2015) Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 991–999. Cited by: §1, §2, §3.2.
  • D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • T. Lin, A. RoyChowdhury, and S. Maji (2015) Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision, pp. 1449–1457. Cited by: §2.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: §4.1.
  • D. Lorenz, L. Bereska, T. Milbich, and B. Ommer (2019) Unsupervised part-based disentangling of object shape and appearance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10955–10964. Cited by: §2.
  • D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §2.
  • A. Mahendran and A. Vedaldi (2016) Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision 120 (3), pp. 233–255. Cited by: §2.
  • E. G. Miller, N. E. Matsakis, and P. A. Viola (2000) Learning from one example through shared densities on transforms. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition., Vol. 1, pp. 464–471. Cited by: §1.
  • I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §2, §3.1, §3.
  • M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2015) Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 685–694. Cited by: §2.
  • C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic (2013) 300 faces in-the-wild challenge: the first facial landmark localization challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 397–403. Cited by: §4.1.
  • P. Sermanet, A. Frome, and E. Real (2014) Attention for fine-grained categorization. arXiv preprint arXiv:1412.7054. Cited by: §2.
  • Z. Shu, M. Sahasrabudhe, R. Alp Guler, D. Samaras, N. Paragios, and I. Kokkinos (2018)

    Deforming autoencoders: unsupervised disentangling of shape and appearance

    .
    In Proceedings of the European Conference on Computer Vision, pp. 650–665. Cited by: §1, §2, Table 1.
  • J. Sivic and A. Zisserman (2003) Video google: a text retrieval approach to object matching in videos. In International Conference on Computer Vision (ICCV), pp. 1470. Cited by: §2.
  • J. Thewlis, S. Albanie, H. Bilen, and A. Vedaldi (2019)

    Unsupervised learning of landmarks by descriptor vector exchange

    .
    In Proceedings of the IEEE International Conference on Computer Vision, pp. 6361–6371. Cited by: §A, §C, §D, §D, Table 5, Table 6, Table 7, Table 8, §1, §2, §3.2, §3, §3, Figure 3, §4.1, §4.1, §4.1, §4.2, §4.2, Table 1.
  • J. Thewlis, H. Bilen, and A. Vedaldi (2017a) Unsupervised learning of object frames by dense equivariant image labelling. In Advances in neural information processing systems, pp. 844–855. Cited by: §1, §2, §3.1, §3, §3, §4.1, §4.2, §4.2, Table 1.
  • J. Thewlis, H. Bilen, and A. Vedaldi (2017b) Unsupervised learning of object landmarks by factorized spatial embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5916–5925. Cited by: §1, §2, §3, §3, §4.2, Table 1.
  • N. Tishby, F. C. Pereira, and W. Bialek (2000) The information bottleneck method. arXiv preprint physics/0004057. Cited by: §1, §2, §3.2.
  • N. Tishby and N. Zaslavsky (2015) Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §1, §2, §3.2.
  • G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie (2018) The iNaturalist species classification and detection dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §E, §1, §4.1.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §1, §4.1.
  • O. Wiles, A. Koepke, and A. Zisserman (2018) Self-supervised learning of a facial attribute embedding from video. In British Machine Vision Conference, Cited by: §1, §2, Table 1.
  • S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, and A. Kassim (2016) Robust facial landmark detection via recurrent attentive-refinement networks. In European conference on computer vision, pp. 57–72. Cited by: Table 1.
  • T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang (2015)

    The application of two-level attention models in deep convolutional neural network for fine-grained image classification

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 842–850. Cited by: §2.
  • A. YM., R. C., and V. A. (2020) A critical analysis of self-supervision, or what we can learn from a single image. In International Conference on Learning Representations (ICLR), Cited by: §3.2.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1, §2.
  • Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee (2018) Unsupervised discovery of object landmarks as structural representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2694–2703. Cited by: §1, §1, §2, §3, Table 1.
  • Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2014) Facial landmark detection by deep multi-task learning. In European conference on computer vision, pp. 94–108. Cited by: §4.1, Table 1.
  • Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2015) Learning deep representation for face alignment with auxiliary attributes. IEEE transactions on pattern analysis and machine intelligence 38 (5), pp. 918–930. Cited by: §3, §4.1, Table 1.
  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §1, §2.
  • C. Zhuang, A. L. Zhai, and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6002–6012. Cited by: §1.

Appendix

A Fine-tuning the network 

Tab. 4 presents the effect of fine-tuning all layers of our model end-to-end on multiple face landmark regression benchmarks. In this experiment, we use the output of the forth convolutional block of ResNet50 as the representation (which is the optimal single layer representation). We did not find fine-tuning to be uniformly beneficial — fine-tuning is worse than training the linear regressor only on MAFL dataset, while it is better on AFLW dataset. We speculate that this is because of the domain gap between the two training sets of unsupervised learning (contrastive learning) and supervised learning (linear regressor) and small training set sizes. CelebA and AFLW has larger domain gap than CelebA and MAFL, which is also noticed in DVE Thewlis et al. [2019]. DVE approach fine-tuned the feature extractor on AFLW in the stage of unsupervised representation learning, however, we did not observe improvement from such unsupervised fine-tuning method in our experiments.

MAFL AFLW AFLW 300W
w/o fine-tuning 2.73 8.83 7.37 6.01
w/ fine-tuning 2.81 7.80 6.99 5.94
Table 4: The effect of fine-tuning for landmark regression. We used a ResNet50 as the backbone and the output of the forth convolutional block as the representation.

B Visualization of feature embeddings

To better understand the information encoded in the learned representations, we visualize the first few PCA (uncentered) components in Fig. 4. Specifically, we sample hypercolumns from 32 MAFL images using our contrastively pretrained ResNet50, treat each spatial location separately and compute the PCA basis vectors. We then project the hypercolumns to each basis and visualize the projection as a spatial map. We observe that the first few PCA bases encode the background, foreground, and landmark regions (e.g. eyes, nose, mouth, etc.) of human faces. As a reference, we provide the PCA visualization of a random initialized ResNet50 with hypercolumn as representation.

Contrastively trained model
Randomly initialized model
Contrastively trained model
Randomly initialized model
Figure 4: PCA visualization of the hypercolumn representation. From left to right: input image, and the projection of hypercolumns on the first five PCA bases. For each example we visualize the PCA components of a contrastively trained (top) and a randomly initialized ResNet50 (bottom).

C Memory efficiency

Tab. 5 compares the memory efficiency between DVE Thewlis et al. [2019] and our model. To compute the equivariance loss, DVE needs to maintain high-resolution feature maps across the network hierarchy. By comparison, our contrastive learning loss is computed on a global image representation which requires less memory per-image during training.

Method Network # Params (M) Network Size (MB) Memory (MB)
DVE Thewlis et al. [2019] Hourglass 12.61 48.10 491.85
Ours ResNet18 11.24 42.89 11.54
ResNet50 23.77 90.68 52.65
Table 5: Memory efficiency. We compare the networks used by DVE Thewlis et al. [2019] and our method in terms of number of network parameters (# Params), memory required for storing the network (Network Size), and the memory usage of a forward and backward pass on a single image (Memory).

D Other implementation details

Training details of unsupervised learning models.

We use MoCo He et al. [2019] as our contrastive learning model. We train MoCo for 800 epochs with a batch size of 256 and the cosine learning rate schedule as proposed in Chen et al. [2020b]. However, we did not observe improvements in our task when using other tricks in Chen et al. [2020b], such as adding Gaussian blur for the data augmentation and using a MLP as the projection network. We use the public implementation555https://github.com/HobbitLong/CMC of MoCo from Isola [2019]. For a comparison with DVE model Thewlis et al. [2019] on the proposed bird dataset, we use their publicly available implementation666https://github.com/jamt9000/DVE.

Training details of linear regression.

For the limited annotation experiments on human face benchmarks, we apply thin-plate spline as the data augmentation method with the same deformation hyperparameters as DVE 

Thewlis et al. [2019]. We do not use any data augmentation for the experiments on the CUB dataset. To avoid excessive hyper-parameter tuning we train the linear regressor for 120, 45, and 80 epochs on MAFL, AFLW, and 300W dataset respectively, and keep them fixed for all ablation studies. On CUB, the results are reported from the checkpoint selected on the validation set. In the ablation study of the effectiveness of unsupervised learning (Tab. 3), for the ImageNet pretrained or randomly initialized networks, we report the best performance on the test set within 2000 training epochs. We use an initial learning rate of 0.01 and a weight decay of 0.05 when only limited annotations are available (Fig. 3a,b). The two hyperparameters are 0.001 and 0.0005 respectively when the entire annotations are given (Tab. 1). We use Adam Kingma and Ba [2014] as the optimizer, and apply the cosine learning rate schedule.

E Birds benchmark 

To test the performance of models on a challenging setting, we train representations in an unsupervised manner on iNaturalist Aves and evaluate on the CUB dataset. We randomly sample 100K images of birds from the iNaturalist 2017 dataset Van Horn et al. [2018] under “Aves” class. Figure 5 top shows images from iNaturalist Aves dataset which contains birds in background clutter, occlusion, and with a wide range of pose variations and sizes than frontal faces in the human face benchmarks (e.g. MAFL). Some images even contain multiple birds. To test the performance in the few-shot setting, we sample a subset of CUB dataset which contains similar species to iNaturalist. Specifically we sample 35 species of Passeroidea superfamily, each annotated with 15 landmarks. We sample at most 60 images per class and conduct the splitting of training, validation and test set on the samples of each species in a ratio of 3:1:1. These splits are then combined, which results in 1241 training images, 382 validation images, and 383 testing images. Figure 5 bottom shows images from the CUB dataset, which are usually more object-centric and only have one bird per image.

iNat Aves
CUB
Figure 5: Images from bird datasets. Images in CUB dataset (bottom) are iconic with birds more frequently in canonical poses and contain a single instance. On the other hand, iNaturalist images (top) are comunity driven and less curated. Often multiple birds are in a single image and are far away. This makes learning and transfer more challenging.

F Tables for Figure 3

Tab. 67, and 8 present the numbers corresponding to Fig. 3a, b, and c respectively, which describe the effect of dataset size for landmark regression and unsupervised learning.

Self-supervision # of annotations
1 5 10 20 50 100 10122
None (SmallNet) Thewlis et al. [2019] 28.87 32.85 22.31 21.13 14.25
DVE (Hourglass) Thewlis et al. [2019] 14.23 12.04 12.25 11.46 12.76 11.88 7.53
1.54 2.03 2.42 0.83 0.53 0.16
Ours (ResNet50 + hypercol.) 42.69 25.74 17.61 13.35 10.67 9.24 6.99
5.10 2.33 0.75 0.33 0.35 0.35
Ours (ResNet50 + conv4) 43.74 21.25 16.51 12.45 10.03 9.95 8.05
2.78 1.14 1.43 0.66 0.21 0.17
Ours (ResNet18 + hypercol.) 47.15 24.99 17.40 13.87 11.04 9.93 8.59
6.88 3.21 0.37 0.66 0.92 0.39
Ours (ResNet18 + conv4) 38.05 21.71 16.60 14.48 12.20 11.02 10.61
5.25 1.57 0.61 0.69 0.36 0.06
Table 6: Landmark regression with limited annotations on AFLW benchmark. The results are reported as percentage of the inter-ocular distance (lower is better).
Self-supervision # of annotation
10 50 100 250 500 1241
None (ResNet18) 2.97 10.07 11.31 24.82 38.86 52.64
DVE (Hourglass) Thewlis et al. [2019] 37.82 51.64 54.58 56.78 58.64 61.91
Ours (ResNet18) 13.41 25.91 34.02 51.70 56.77 62.24
Ours (ResNet50) 13.87 29.28 40.86 57.96 64.55 68.63
Table 7: Landmark regression on bird dataset. The results are reported as the percentage of correct keypoints (PCK) metric (higher is better).
Dataset Training set size DVE Thewlis et al. [2019]
5% 10% 25% 50% 100%
MAFL 3.87 3.29 3.09 2.43 2.44 2.86
AFLW 13.26 9.12 7.82 7.22 6.99 7.53
AFLW 11.47 8.69 7.71 6.61 6.27 6.54
300W 9.23 6.92 6.38 5.13 5.22 4.65
Table 8: The effect of training set size on unsupervised learning models. The results are reported as percentage of the inter-ocular distance. The performance on MAFL and 300W are saturated given more than 50% unlabeled data. One possible reason is that MAFL and 300W provide many more annotations than AFLW. Notice that although 300W contains only 3148 training images, there are 68 annotations per image.