Unsupervised object landmark discovery
Given a collection of images, humans are able to discover landmarks of the depicted objects by modeling the shared geometric structure across instances. This idea of geometric equivariance has been widely used for unsupervised discovery of object landmark representations. In this paper, we develop a simple and effective approach based on contrastive learning of invariant representations. We show that when a deep network is trained to be invariant to geometric and photometric transformations, representations from its intermediate layers are highly predictive of object landmarks. Furthermore, by stacking representations across layers in a hypercolumn their effectiveness can be improved. Our approach is motivated by the phenomenon of the gradual emergence of invariance in the representation hierarchy of a deep network. We also present a unified view of existing equivariant and invariant representation learning approaches through the lens of contrastive learning, shedding light on the nature of invariances learned. Experiments on standard benchmarks for landmark discovery, as well as a challenging one we propose, show that the proposed approach surpasses prior state-of-the-art.READ FULL TEXT VIEW PDF
Equivariance to random image transformations is an effective method to l...
Deep neural networks can model images with rich latent representations, ...
Learning automatically the structure of object categories remains an
This paper proposes a method to ease the unsupervised learning of object...
Convolutional Neural Network (CNN) features have been successfully emplo...
We present a novel unsupervised learning approach to image landmark disc...
Detecting dense landmarks for diverse clothes, as a fundamental techniqu...
Unsupervised object landmark discovery
Learning in the absence of labels is a challenge for existing machine learning and computer vision systems. Despite recent advances, the performance of unsupervised learning remains far below supervised learning, especially for few-shot image understanding tasksMisra (2019). This paper considers the task of unsupervised learning of object landmarks from a collection of images. The goal is to learn representations from a large number of unlabeled images such that they allow accurate prediction of landmarks such as eyes and noses when provided with a few labeled examples.
One way of inferring structure is to reason about the global appearance in terms of disentangled factors such as geometry and texture. This is the basis of alignment based Miller et al. (2000); Huang et al. (2012) and generative modeling based approaches for landmark discovery Zhang et al. (2018); Wiles et al. (2018); Shu et al. (2018); Jakab et al. (2018, 2020). An alternate is to learn a representation that geometrically deforms in the same way as the object, a property called geometric equivariance (Fig. 1a). However, a drawback of these approaches is that useful invariances may not be learned (e.g., the raw pixel representation itself is equivariant). As a result, these approaches are not robust to nuisance factors such as background clutter, occlusion, and other inter-image variations, limiting their applicability.
A different line of work has proposed contrastive learning as an unsupervised objective Hadsell et al. (2006); Dosovitskiy et al. (2014); Hjelm et al. (2019); Bachman et al. (2019); Lin (2018); Isola (2019); He et al. (2019); Chen et al. (2020a); Oord et al. (2018); Zhuang et al. (2019); Hénaff et al. (2019). This is commonly formulated as an objective defined over pairs of data. The goal is to learn a representation that has higher similarity between an image and its transformation than with a different one , i.e., , as illustrated in Fig. 1b. Transformations are obtained using a combination of geometric (e.g., cropping and scaling) and photometric (e.g., color jittering and blurring) transformations, encouraging the representation to be invariant to these factors while being distinctive across images. Recent work He et al. (2019); Chen et al. (2020a)
has shown that contrastive learning is effective, even outperforming ImageNetDeng et al. (2009) pretraining for a range of tasks and domains. However, in order to predict landmarks, a representation cannot be invariant to geometric transformations of an image. Moreover, their effectiveness in the few-shot setting has not been sufficiently studied. This paper asks the question: are equivariant losses necessary for unsupervised landmark discovery? In particular, do representations predictive of object landmarks automatically emerge in intermediate layers of a deep network trained to be invariant to image transformations, including geometric ones? While empirical evidence Zhou et al. (2016); Gonzalez-Garcia et al. (2018) suggests the emergence of semantic parts in deep networks trained on supervised tasks, is it also the case for unsupervised learning?
This work aims to address these by presenting a unified view of the two approaches. We show that when a deep network is trained to be invariant to geometric and photometric transformations, its intermediate layer representations are highly predictive of landmarks (Fig. 1b). The emergence of invariance and the loss of geometric equivariance is gradual in the representation hierarchy, a phenomenon that has been studied empirically Zeiler and Fergus (2014); Lenc and Vedaldi (2015) and theoretically Tishby et al. (2000); Tishby and Zaslavsky (2015); Achille and Soatto (2018) (Fig. 1c). As a result, its intermediate layer offers a better trade-off between desired invariances and equivariances for the landmark prediction task. This observation also motivates a hypercolumn representation Hariharan et al. (2015), resulting in more accurate landmark predictions (Fig. 1d). We also observe that objectives used in equivariant learning can be seen as a contrastive loss between pixel representations at different locations within the same image, unlike invariant learning where the contrastive loss is applied across images (Fig. 1a vs 1b). This sheds light on the nature of invariances learned by these techniques.
To validate these claims, we present experiments using off-the-shelf Residual Networks He et al. (2016) and the Momentum Contrast (MoCo) He et al. (2019) learning on several benchmarks for landmark detection. We present a quantitative evaluation in a linear evaluation setting by varying the number of labeled examples. We also present a comparison by learning on a challenging dataset of birds from the iNaturalist dataset Van Horn et al. (2018) and evaluating on the CUB dataset Wah et al. (2011). Our approach is simple, yet it offers consistent improvements over prior approaches Thewlis et al. (2017b, a); Zhang et al. (2018); Jakab et al. (2018); Thewlis et al. (2019). While the hypercolumn representation leads to a larger embedding dimension, it comes at a modest cost as our approach outperforms the prior state-of-the-art Thewlis et al. (2019), with as few as 50 annotated training examples on the AFLW benchmark Koestinger et al. (2011) (Fig. 3). In fact, we observe that the hypercolumns of randomly initialized networks also offer a good generalization, suggesting that both the architecture and learning objective play an important role.
In summary, our contributions are as follows. We present a unified view of existing equivariant and the proposed invariant landmark representation learning approaches through the lens of contrastive learning — the former encourages intra-image invariance while the latter encourages inter-image invariance. We show that invariant representation learning using contrastive losses lead to more effective landmark representations than prior state-of-the-art on several benchmarks. The improvements are pronounced on a more challenging dataset we propose. We analyze and motivate the effectiveness of our approach by the phenomenon of the gradual emergence of invariance in the representation hierarchy and propose a hypercolumn representation which provides further benefits.
Background. A representation is said to be equivariant (or covariant) with a transformation for input if there exists a map such that: . In other words, the representation transforms in a predictable manner given the input transformation. For natural images, the transformations can be geometric (e.g., translation, scaling, and rotation), photometric (e.g., color changes), or more complex (e.g., occlusion, viewpoint or instance variations). Note that a sufficient condition for equivariance is when is invertible since . Invariance is a special case of equivariance when is the identity function, i.e., . There is a rich history in computer vision on the design of covariant (e.g., SIFT Lowe (2004)), and invariant representations (e.g., HOG Dalal and Triggs (2005) and Bag-of-Visual-Words Sivic and Zisserman (2003)).
Deep representations. Invariance and equivariance in deep network representations result from both the architecture (e.g., convolutions lead to translational equivariance, while pooling leads to translational invariance), and learning (e.g., invariance to categorical variations). Lenc and Vedaldi (2015) showed that early-layer representations of deep networks trained on ImageNet are nearly equivariant as they can be “inverted” to recover the input, while later layers are more invariant. Similar observations have been made by visualizing these representations Mahendran and Vedaldi (2016); Zeiler and Fergus (2014). The gradual emergence of invariance can also be theoretically understood in terms of a “information bottleneck” in the feed-forward hierarchy Achille and Soatto (2018); Tishby et al. (2000); Tishby and Zaslavsky (2015). While equivariance to geometric transformations is relevant for landmark representations, the notion can be generalized to other transformation groups Gens and Domingos (2014); Cohen and Welling (2016).
Landmark discovery. Empirical evidence Zhou et al. (2016); Oquab et al. (2015) suggests that semantic parts emerge when deep networks are trained on supervised tasks. This has inspired architectures for image classification that encourage part-based reasoning, such as those based on texture representations Lin et al. (2015); Cimpoi et al. (2015); Arandjelovic et al. (2016) or spatial attention Sermanet et al. (2014); Xiao et al. (2015); Fu et al. (2017). In contrast, our work shows that parts also emerge when models are trained in an unsupervised manner. When no labels are available, equivariance to geometric transformations provides a natural self-supervisory signal. The equivariance constraint requires , the representation of at location , to be invariant to the geometric transformation of the image, i.e., (Fig. 1a). This alone is not sufficient since both and , satisfy this property. Constraints based on locality Thewlis et al. (2017a, 2019) and diversity Thewlis et al. (2017b) have been proposed to avoid this pathology. Yet, inter-image invariance is not directly enforced. Another line of work is based on a generative modeling approach Zhang et al. (2018); Wiles et al. (2018); Shu et al. (2018); Jakab et al. (2018); Lorenz et al. (2019); Jakab et al. (2020). These methods implicitly incorporate equivariant constraints by modeling objects as deformation (or flow) of a shape template together with appearance variation in a disentangled manner. In contrast, our work shows that learning representations invariant to both geometric and photometric transformations is an effective strategy. These invariances emerge at different rates in the representation hierarchy, and one can select the appropriate ones with a small amount of supervision for the downstream task.
Unsupervised learning. Recent work has shown that unsupervised objectives based on density modeling Dosovitskiy et al. (2014); Hjelm et al. (2019); Bachman et al. (2019); Lin (2018); Isola (2019); He et al. (2019); Chen et al. (2020a); Oord et al. (2018)
, outperform unsupervised (or self-supervised) learning based on pretext tasks such as colorizationEfros (2016a), rotation prediction Komodakis (2018), jigsaw puzzle Noroozi and Favaro (2016), and inpainting Efros (2016b). These contrastive learning objectives Hadsell et al. (2006)
are often expressed in terms of noise-contrastive estimation (NCE)Gutmann and Hyvärinen (2010) (or maximizing mutual information Oord et al. (2018); Hjelm et al. (2019)) between different views obtained by geometrically and photometrically transforming an image. The learned representations thus encode invariances to these transformations while preserving information relevant for downstream tasks. However, the effectiveness of unsupervised learning depends on how well these invariances relate to those desired for end tasks. Despite recent advances, existing methods for unsupervised learning significantly lack in comparison to their supervised counterparts in the few-shot setting Misra (2019). Moreover, their effectiveness for landmark discovery has not been sufficiently studied in the literature.222Note that MoCo He et al. (2019)
was evaluated on pose estimation, however their method was trained with 150K labeled examples and the entire network was fine-tuned.In part, it is not clear why invariance to geometric transformations might be useful for landmark prediction since we require the representation to carry some spatial information about the image. Understanding these trade-offs and improving the effectiveness of contrastive learning for landmark prediction is one of the goals of the paper.
Denote as an image of an object, and as pixel coordinates. The target of unsupervised landmark discovery is to learn an encoder to generate representation at spatial location of input which is predictive for object landmarks. We assume aiming to learn a high-dimensional representation of landmarks. This is similar to Thewlis et al. (2019) which learns a local descriptor for each landmark, and unlike those that represent them as a discrete set Zhang et al. (2015), or on a planar () Thewlis et al. (2017b); Zhang et al. (2018) or spherical () Thewlis et al. (2017a) coordinate system. In other words, the representation should be predictive of landmarks (measured using a linear regressor), without requiring compactness or topology in the embedding space.
We describe commonly used equivariance constraints for unsupervised landmark discovery Thewlis et al. (2017b, a, 2019), followed by models based on invariant learning Oord et al. (2018); He et al. (2019). We then present a unified view of the two approaches motivating our approach, which is to simply train a deep network to be invariant to all transformations and use features from intermediate layers as the landmark representation.
Equivariant learning. The equivariance constraint requires , the representation of at location , to be invariant to the geometric deformation of the image (Fig. 1a). Given a geometric warping function , the representation of at should be same as the representation of the transformed image at , that is, . This constraint can be captured by the loss:
A diversity (or locality) constraint is necessary to encourage the representation to be distinctive across locations. For example, Thewlis et al. (2017a) proposed the following:
which they replace by a probabilistic version that combines both the losses as:
is the probability of pixelin image matching in image with as the encoder shared by and computed as below, and is a scale parameter:
Invariant learning. Contrastive representation learning is based on similarity over pairs of inputs (Fig. 1b). Given an image and its transformation as well as other images , the InfoNCE Oord et al. (2018) loss minimizes:
The objective encourages representations to be invariant to transformations while being distinctive across images. To address the computational bottleneck in evaluating the denominator, Momentum Contrast (MoCo) He et al. (2019) replaces the loss over negative examples using a dictionary queue and updates the parameters based on momentum.
Transformations. The space of transformations used to generate image pairs plays an important role in learning. A common approach is to apply a combination of geometric transformations, such as cropping, resizing, and thin-plate spline warps, as well as photometric transformations, such as color jittering, JPEG noise, and PCA color augmentation. However, transformations can also denote different color channels of an image or different modalities such as depth and color Isola (2019).
Hypercolumns. A deep network of layers (or blocks333Due to skip-connections, we cannot decompose the encoding over layers, but can across blocks.) can be written as . A representation of size
can be spatially interpolated to the input sizeto produce a pixel representation . The hypercolumn representation of layers is obtained by concatenating the interpolated features from the corresponding layers, that is, .
Our approach. Given a large unlabeled dataset, we first train invariant representations using contrastive learning He et al. (2019) with a combination of geometric and photometric transformations applied to images to generate pairs. We then extract single or hypercolumn representation from intermediate layers to represent landmarks (Fig. 1b and Fig. 1d).
Commonalities and differences. Equivariance is necessary but not sufficient for an effective landmark representation. It also needs to be distinctive or invariant to nuisance factors. This is enforced in the equivariance objective (Eq. 3) as a contrastive term over locations within the same image, as the loss is minimized when is maximized at . This encourages intra-image invariance, unlike the objective of contrastive learning (Eq. 5), which encourages inter-image invariance. However, a single image may contain enough variety to guarantee some invariance. This is supported by its empirical performance and recent work showing that representation learning is possible even from a single image YM. et al. (2020). However, our experiments suggest that on challenging datasets, with objects in clutter, occlusion, and wider pose variation, inter-image invariance can be more effective.
Why do landmarks emerge during invariant learning? For this, we point to the phenomenon of the gradual emergence of invariance in the representation hierarchy Lenc and Vedaldi (2015); Achille and Soatto (2018); Tishby et al. (2000); Tishby and Zaslavsky (2015). In particular, equivariance to geometric transformations reduces with depth due to pooling operations, while invariance to nuisance factors increases with depth (Fig. 1c). Thus, picking an intermediate layer in the representation hierarchy may offer the desired trade-off between the geometric equivariance and invariances for predicting landmarks. Our experiments support this, and we find that it is possible to select the right layer with just a few labeled examples from the downstream task. The scheme can be further improved by stacking representations across layers in a hypercolumn (Sec. 4).
Is there any advantage of one approach over the other? Our experiments show that for a deep network of the same size, invariant representation learning can be just as effective (Tab. 1). However, invariant learning is conceptually simpler and scales better than equivariance approaches, as the latter maintains high-resolution feature maps across the hierarchy. Using a deeper network (e.g., ResNet50 vs. ResNet18) results in consistent improvements, outperforming DVE Thewlis et al. (2019) on four out of five datasets, as shown in Tab. 1. A drawback of our approach is that the representation is not directly interpretable or compact, which results in lower performance in the extreme few-shot case. However, as seen in Fig. 3a, the advantage disappears with as few as 50 training examples on the AFLW benchmark Koestinger et al. (2011). Moreover, invariant learning makes better use of data achieving the same performance with half the unlabeled examples, as seen in Fig. 3c. We describe these experiments in detail in the next section.
We first outline the datasets and implementation details of the proposed method (Sec. 4.1). We then evaluate our model and provide comparisons to the existing methods qualitatively and quantitatively on the landmark detection benchmarks, followed by a thorough ablation study (Sec. 4.2).
Human faces. We first compare the proposed model with prior art on the existing human face landmark detection benchmarks. Following DVE Thewlis et al. (2019), we train our model on CelebA dataset Liu et al. (2015) and evaluate on MAFL Zhang et al. (2015), AFLW Koestinger et al. (2011), and 300W Sagonas et al. (2013). The overlapping images with MAFL are excluded from CelebA. MAFL comprises 19,000 training images and 1000 testing images with annotations on 5 face landmarks. Two versions of AFLW are used: AFLW that contains 10,122 training images and 2995 testing images, which are crops from MTFL Zhang et al. (2014); AFLW which contains tighter crops of face images with 10,122 for training and 2991 for testing. 300W provides 68 annotated face landmarks with 3148 training images and 689 testing images. We apply the same image pre-processing procedures as in DVE, the current state-of-the-art, for a direct comparison.
Birds. We collect a challenging dataset of birds where objects appear in clutter, occlusion, and exhibit wider pose variation. We randomly select 100K images of birds from the iNaturalist 2017 dataset Van Horn et al. (2018) under the “Aves” class to train unsupervised representations. For the performance on the few-shot setting, we collect a subset of CUB dataset Wah et al. (2011) containing 35 species of Passeroidea superfamily, each annotated with 15 landmarks444This is the biggest Aves taxa in iNaturalist.. We sample at most 60 images per class which results in 1241 images as our training set, 382 as validation set, and 383 as testing set (see Appendix E for details).
Evaluation. We use landmark regression as the end task for evaluation. Following Thewlis et al. (2017a, 2019), we train a linear regressor to map the representations to landmark annotations, keeping the representation frozen. The landmark regressor is a linear regressor per target landmark. Each regressor consists of filters of size 11 on top of a -dimensional representation to generate 50 intermediate heatmaps, which are then converted to spatial coordinates by soft-argmax operation. These 50 coordinates are finally converted to the target landmark by a linear layer (see Appendix A for details). We report errors in the percentage of inter-ocular distance on face benchmarks and the percentage of correct keypoints (PCK) on CUB. A prediction is considered correct according to the PCK metric if its distance to the ground-truth is within 5% of the longer side of the image. The occluded landmarks are ignored during evaluation. We did not find fine-tuning to be uniformly beneficial but include a comparison in Appendix A.
Implementation details. We use MoCo He et al. (2019)
to train our models on CelebA or iNat Aves for 800 epochs with a batch size of 256 and a dictionary size of 4096. ResNet18 or ResNet50He et al. (2016) are used as our backbone. We extract hypercolumns Hariharan et al. (2015) per pixel by stacking activations from the second (conv2_x) to the last convolutional block (conv5_x). We resize the feature maps from the selected convolutional blocks to the same spatial size as DVE Thewlis et al. (2019) (i.e. 4848). We also follow DVE (with Hourglass network) to resize the input image to then center-crop the image to for face datasets. Images are resized to without any cropping on bird dataset. For a comparison with DVE on CUB dataset we used their publicly available implementation.
Quantitative results. Tab. 1 presents a quantitative evaluation of multiple benchmarks. On faces, the proposed model with ResNet18 significantly outperforms several prior works Thewlis et al. (2017b, a) and DVE with SmallNet. Our model with a ResNet50 achieves state-of-the-art results on all benchmarks except for 300W. On iNat Aves CUB, the approach outperforms prior state-of-the-art Thewlis et al. (2019) by a large margin, suggesting improved invariance to nuisance factors.
|Millions||Inter-ocular Distance (%)||PCK|
|TCDCN Zhang et al. (2015)||–||7.95||7.65||–||5.54||–|
|RAR Xiao et al. (2016)||–||–||7.23||–||4.94||–|
|MTCNN Zhang et al. (2014)||–||5.39||6.90||–||–||–|
|Wing Loss Feng et al. (2018)||–||–||–||–||4.04||–|
|Generative modeling based|
|Structural Repr. Zhang et al. (2018)||–||✓||3.15||–||6.58||–||–|
|FAb-Net Wiles et al. (2018)||–||✓||3.44||–||–||5.71||–|
|Deforming AE Shu et al. (2018)||–||✓||5.45||–||–||–||–|
|ImGen. Jakab et al. (2018)||–||✓||2.54||–||6.31||–||–|
|ImGen.++ Jakab et al. (2020)||–||✓||–||–||–||5.12||–|
|Sparse Thewlis et al. (2017b)||–||✓||6.67||10.53||–||7.97||–|
|Dense 3D Thewlis et al. (2017a)||–||✓||4.02||10.99||10.14||8.23||–|
|DVE SmallNet Thewlis et al. (2019)||0.35||✓||3.42||8.60||7.79||5.75||–|
|DVE Hourglass Thewlis et al. (2019)||12.61||✓||2.86||7.53||6.54||4.65||61.91|
|Ours (ResNet50)||23.77||✓||2.44 14.7||6.99 7.2||6.27 4.1||5.22 -12.3||68.63 17.6|
Qualitative results. Fig. 2 shows the qualitative results of landmark regression on human faces and birds. We notice that both DVE and our model are able to localize the foreground object accurately. However, our model localizes many keypoints better (e.g., on the tails of the birds) and is more robust to the background clutter (e.g., the last column of Fig. 2b).
|(a) Human face||(b) Birds|
Limited annotations. Fig. 3a and 3b compare our model with DVE Thewlis et al. (2019) using a limited number of annotations on AFLW and CUB dataset respectively. Our performance is better as soon as a few training examples are available (e.g., 50 on AFLW and 250 on CUB). This can be attributed to the higher dimensional embedding of the hypercolumn representation. The scheme can be improved by using a single-layer representation as shown in the yellow line. Note that all unsupervised learning models (including DVE and our model) outperform the randomly initialized baseline on both human face and bird datasets. We provide the numbers corresponding to Fig. 3 in Appendix F.
|(a) Limited anno. on AFLW||(b) Limited anno. on CUB||(c) Unlabeled CelebA images|
Limited unlabeled data. Fig. 3c shows that our model matches the performance of DVE on AFLW using only 40% of the images on the CelebA dataset.
Hypercolumns. Tab. 2 compares the performance of using individual layer and hypercolumn representations. The activation from the fourth convolutional block consistently outperforms those from the other layers. For an input of size 9696, the spatial dimension of the representation is 48 48 at Layer #1 and 33 at Layer #5, reducing by a factor of two at each successive layer. Thus, while the representation loses geometric equivariance with depth, contrastive learning encourages invariance, resulting in Layer #4 with the optimal trade-off for this task. While the best layer can be selected with some labeled validation data, the hypercolumn representation provides further benefits everywhere except the very small data regime (Tab. 2 and Fig. 3a).
|#1||#2||#3||#4||#5||#4 - #5||#3 - #5||#2 - #5||#1 - #5|
The error is reported in the percentage of inter-ocular distance using linear regression over individual layers (left) and combinations (right), with a ResNet50. The embedding dimension for each is shown in parenthesis. Layer #4 performs the best across datasets, while hypercolumns offer an improvement.
Effectiveness of unsupervised learning. Tab. 3 compares representations using the linear evaluation setting for randomly initialized, ImageNet pretrained, and contrastively learned networks using a hypercolumn representation. Contrastive learning provides significant improvements over ImageNet pretrained models, which is less surprising since the domain of ImageNet images is quite different from faces. Interestingly, random networks have competitive performances with respect to some prior work in Tab. 1. For example, Thewlis et al. (2017a) achieve 4.02% on MAFL, while a randomly initialized ResNet18 with hypercolumns achieves 4.00%.
We show that landmark representations extracted from intermediate layers of a deep network trained to be invariant outperform existing unsupervised landmark representation learning approaches. We also show that several equivariant learning approaches can be viewed through the lens of contrastive learning — they learn intra-image invariances which result in weaker generalization than inter-image invariances on a more challenging benchmark we propose. However, the two approaches are complementary and may be combined for further benefits.
This work falls in the general category of unsupervised learning which aims to discover structure from unlabeled data. In particular this work reduces annotation costs associated with training object landmark detectors, potentially allowing novel applications in domains where there are no existing large-scale labeled datasets. On the other hand, without expert supervision, the methods may inherit and amplify the biases from data causing poor peformance on downstream applications.
The project is supported in part by Grants #1661259 and #1749833 from the National Science Foundation of United States. Our experiments were performed on the University of Massachusetts, Amherst GPU cluster obtained under the Collaborative Fund managed by the Massachusetts Technology Collaborative. We would also like to thank Erik Learned-Miller, Dan Sheldon and Rui Wang for discussion and feedback on the draft.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5297–5307. Cited by: §2.
Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §2.
Deforming autoencoders: unsupervised disentangling of shape and appearance. In Proceedings of the European Conference on Computer Vision, pp. 650–665. Cited by: §1, §2, Table 1.
Unsupervised learning of landmarks by descriptor vector exchange. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6361–6371. Cited by: §A, §C, §D, §D, Table 5, Table 6, Table 7, Table 8, §1, §2, §3.2, §3, §3, Figure 3, §4.1, §4.1, §4.1, §4.2, §4.2, Table 1.
The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 842–850. Cited by: §2.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §1, §2.
Tab. 4 presents the effect of fine-tuning all layers of our model end-to-end on multiple face landmark regression benchmarks. In this experiment, we use the output of the forth convolutional block of ResNet50 as the representation (which is the optimal single layer representation). We did not find fine-tuning to be uniformly beneficial — fine-tuning is worse than training the linear regressor only on MAFL dataset, while it is better on AFLW dataset. We speculate that this is because of the domain gap between the two training sets of unsupervised learning (contrastive learning) and supervised learning (linear regressor) and small training set sizes. CelebA and AFLW has larger domain gap than CelebA and MAFL, which is also noticed in DVE Thewlis et al. . DVE approach fine-tuned the feature extractor on AFLW in the stage of unsupervised representation learning, however, we did not observe improvement from such unsupervised fine-tuning method in our experiments.
To better understand the information encoded in the learned representations, we visualize the first few PCA (uncentered) components in Fig. 4. Specifically, we sample hypercolumns from 32 MAFL images using our contrastively pretrained ResNet50, treat each spatial location separately and compute the PCA basis vectors. We then project the hypercolumns to each basis and visualize the projection as a spatial map. We observe that the first few PCA bases encode the background, foreground, and landmark regions (e.g. eyes, nose, mouth, etc.) of human faces. As a reference, we provide the PCA visualization of a random initialized ResNet50 with hypercolumn as representation.
|Contrastively trained model|
|Randomly initialized model|
|Contrastively trained model|
|Randomly initialized model|
Tab. 5 compares the memory efficiency between DVE Thewlis et al.  and our model. To compute the equivariance loss, DVE needs to maintain high-resolution feature maps across the network hierarchy. By comparison, our contrastive learning loss is computed on a global image representation which requires less memory per-image during training.
|Method||Network||# Params (M)||Network Size (MB)||Memory (MB)|
|DVE Thewlis et al. ||Hourglass||12.61||48.10||491.85|
We use MoCo He et al.  as our contrastive learning model. We train MoCo for 800 epochs with a batch size of 256 and the cosine learning rate schedule as proposed in Chen et al. [2020b]. However, we did not observe improvements in our task when using other tricks in Chen et al. [2020b], such as adding Gaussian blur for the data augmentation and using a MLP as the projection network. We use the public implementation555https://github.com/HobbitLong/CMC of MoCo from Isola . For a comparison with DVE model Thewlis et al.  on the proposed bird dataset, we use their publicly available implementation666https://github.com/jamt9000/DVE.
For the limited annotation experiments on human face benchmarks, we apply thin-plate spline as the data augmentation method with the same deformation hyperparameters as DVEThewlis et al. . We do not use any data augmentation for the experiments on the CUB dataset. To avoid excessive hyper-parameter tuning we train the linear regressor for 120, 45, and 80 epochs on MAFL, AFLW, and 300W dataset respectively, and keep them fixed for all ablation studies. On CUB, the results are reported from the checkpoint selected on the validation set. In the ablation study of the effectiveness of unsupervised learning (Tab. 3), for the ImageNet pretrained or randomly initialized networks, we report the best performance on the test set within 2000 training epochs. We use an initial learning rate of 0.01 and a weight decay of 0.05 when only limited annotations are available (Fig. 3a,b). The two hyperparameters are 0.001 and 0.0005 respectively when the entire annotations are given (Tab. 1). We use Adam Kingma and Ba  as the optimizer, and apply the cosine learning rate schedule.
To test the performance of models on a challenging setting, we train representations in an unsupervised manner on iNaturalist Aves and evaluate on the CUB dataset. We randomly sample 100K images of birds from the iNaturalist 2017 dataset Van Horn et al.  under “Aves” class. Figure 5 top shows images from iNaturalist Aves dataset which contains birds in background clutter, occlusion, and with a wide range of pose variations and sizes than frontal faces in the human face benchmarks (e.g. MAFL). Some images even contain multiple birds. To test the performance in the few-shot setting, we sample a subset of CUB dataset which contains similar species to iNaturalist. Specifically we sample 35 species of Passeroidea superfamily, each annotated with 15 landmarks. We sample at most 60 images per class and conduct the splitting of training, validation and test set on the samples of each species in a ratio of 3:1:1. These splits are then combined, which results in 1241 training images, 382 validation images, and 383 testing images. Figure 5 bottom shows images from the CUB dataset, which are usually more object-centric and only have one bird per image.
|Self-supervision||# of annotations|
|None (SmallNet) Thewlis et al. ||28.87||32.85||22.31||21.13||–||–||14.25|
|DVE (Hourglass) Thewlis et al. ||14.23||12.04||12.25||11.46||12.76||11.88||7.53|
|Ours (ResNet50 + hypercol.)||42.69||25.74||17.61||13.35||10.67||9.24||6.99|
|Ours (ResNet50 + conv4)||43.74||21.25||16.51||12.45||10.03||9.95||8.05|
|Ours (ResNet18 + hypercol.)||47.15||24.99||17.40||13.87||11.04||9.93||8.59|
|Ours (ResNet18 + conv4)||38.05||21.71||16.60||14.48||12.20||11.02||10.61|
|Self-supervision||# of annotation|
|DVE (Hourglass) Thewlis et al. ||37.82||51.64||54.58||56.78||58.64||61.91|
|Dataset||Training set size||DVE Thewlis et al. |