Our study of the unsupervised landmark detection began with the question of whether it is possible to store the image landmarks within the bottleneck of an auto-encoder. How can we influence the auto-encoder to contain only the information about the image landmarks in its bottleneck and what kind of regularization would be required for that? These questions, along with the fact that the landmarks of the same class typically look similar (e.g., key-points extracted from different faces resemble each other), have led us to the idea of comparing the post-encoder features with some ’average’ landmarks.
Using auto-encoders to extract landmarks is similar to the principle of unsupervised segmentation, because it is in the bottleneck where the features of the segmentation contours could be distilled. Image segmentation has been a desirable task in the field of deep learning over the last five years1; 2; 3; chowdhury2017blood. Today, the state-of-the-art algorithms show impressive results but, oftentimes, require massive amounts of annotated data, which is not always available 2; 3; chowdhury2017blood. Reducing the required amount of labelled data for the segmentation algorithms is a task of pressing demand, which has been the other motivation for our study. Instead of complex segmentation patterns, we begin with the unsupervised detection of key-points in the human faces, which will be the main focus of this paper.
If the ’average’ landmarks pattern needs to be computed, it makes sense to calculate it via the optimal transport distance bc_coords_proj, also known as the barycenter. Wasserstein barycenters rose in popularity in recent years as they preserve common topological properties of geometrical figures set in the wide range of tasks bc_coords_proj; bc_gar; shvetsov2020unsupervised.
In this work, we extend their use for the regularization in the landmark detection problem. Once computed, the barycenter of landmarks could be used for calculating distances to the rest of the predicted (encoded) landmarks.
The principal architecture is illustrated in Figure 1. Besides barycenter regularization we involve regularization by geometric transforms DVE
, that allows to control deviations from the barycenter and synchronize coordinates of image and landmarks. For decoding we use generative neural network. More details of the method are described in Section2 below. Overall, the contribution of this paper is in the following:
The first method that predicts comprehensible landmarks in unsupervised way with interpretability. Works in semi-supervised scenario too, outperforming state-of-the-art methods;
Regularization by the Wasserstein distance to the barycenter, factorized by three types of transformations;
New type of cyclic GAN architecture cyclegan that performs training with only one domain data and decomposes images into landmarks and style;
Extended architecture of stylegan2 Stylegan2 to conditional image generation.
Traditionally, the algorithms of the unsupervised segmentation extract latent representations via deep autoencoders6; 8; 7. These methods attempt to form clusters of the latent pixels which correspond to correlated parts of the initial image. To guarantee direct correspondence between image and segmentation coordinates, the authors in 8 suggested the idea of regularization using geometry transformations, expressed as a condition , where is the image, is some deformation, and is a segmentation mapping. Generative adversarial networks, such as SEIGAN 9, were also proposed for unsupervised segmentation, relying on the latent space representation, segment painting, and object embedding into another background (with the constraint that the image must remain realistic).
There are many landmark detection approaches, especially for faces. Initially, they were based on various statistical approaches, preprocessing, and enhancements AAM; SSDMD; AFLCLM. Rapid progress of deep learning then instigated a series of supervised methods: cascade of CNNs DCNCFPD, multi-task learning (pose, smile, the presence of glasses, sex of person) TCDCN
, and recurrent attentive-refinements via Long Short-Term Memory networks (LSTMs)RAR
. Special loss functions (e.g., wing lossWingLoss) were shown to further improve the accuracy of CNN-based facial landmark localisation.
Unsupervised pre-training has seen major interest in the community with the advent of data-hungry deep networks. A classic approach for such a task is the use of autoencoders, comprising different variations of embeddings kp_next_im_gen; SPARSE; DVE. Papers kp_next_im_gen and FabNet use the condition and the additional condition of sparsity on the heatmap corresponding to . If images and have different landmarks but have the same style (e.g., sequential frames from the video), the network can generate from and the landmarks . Method UDIT is similar, but it has an additional discriminator network to compare predicted landmarks to the landmarks from the dataset by distribution. Big drawback of these methods is the necessity to have paired images or video datasets. Another class of methods, such as SPARSE, Dense3D, and DVE, does employ the geometric transformations for regularization. However, these works lose the landmarks interpretability in the process. Besides, their unsupervised nature was only implemented as as a pertaining step for landmark detection.
In our work, we have chosen Wasserstein distance because it can establish pairwise correspondence between predicted landmarks and the key-points of the barycenter. It makes the regularizer more flexible, enabling the capability of comparing the landmark sets of different size and order. We refer to work bc_coords_proj that describes projection of 3D figures into the barycenter coordinates, and the theoretical study of the barycenters in bc_gar; shvetsov2020unsupervised.
The training process of the unsupervised landmark detector, illustrated in Figure 1, consists of two main steps: conditional GAN training cyclegan; prokopenko2019unpaired and the actual landmark detection optimisation. These two steps are repeated every training iteration. When the encoder network predicts the heatmap of landmarks of the batch of images , one applies the optimisation routine to the generator () and discriminator () networks. Let
denote the network that maps the Gaussian noise vectorto a random style, and – the network that maps images (either fake or real) to their style.
The training paradigm of Algorithm 1 starts similarly to MUNIT MUNIT, where one encodes the style and the landmarks from an input image. Then, we separately restore the image and generate the fake. It makes GAN training more stable and allows to decompose the image into content and style, with the role of content being played by the landmarks. The loss function for discriminator and generator enhances that from stylegan2 Stylegan2, where the penalty of discriminator enforces smoother separation of classes (fake and real images). In generator optimization we combine the original generator loss with the loss between restored and initial images. We also add a term that computes cross-entropy between the landmarks from the fake and from the initial image, which introduces important constraint for conditional generation only (in addition to the conditional discriminator). Random geometric transforms () improves coordinates synchronisation of , and . We train the GAN together with the style encoder. The last loss element is necessary for the style adjustments, assuring that the style of the fake looks similar to both the generated style and to that of the transformed fake .
Such encoder training is enabled every third iteration. The resulting landmarks, produced by the encoder, are a heatmap, the entropy of which we want to reduce to concentrate around their mean values (minimization of cross-entropy yields a mixture of Gaussian measures , with their centers being located in the coordinates of the landmarks ). Also, the encoder is optimised by the landmark’s participation in the generation of the fake and the restored images. Regularizers (distance to the barycenter), (-transform synchronisation), and (influence of the GAN) are described in Section 4.
. The coefficients (hyperparameters) of the loss of the landmarks encoder are tuned on a validation dataset. The tuning procedure uses the gradient-free optimization. We choose a random uniform direction in the hyperparameter space and minimize the validation score (defined in Section5) along this direction using the golden-section search polyak, repeating it multiple times. If a given direction does not improve the score in the initial two points of the golden-section search, a new one is sampled several times, with the best one being selected. The gradient-free optimization was chosen thanks to its relatively low computational cost in this problem, and thanks to its ability to aggregate updates of network weights during more than one iterations.
3 Architecture description
To build the conditional GAN for our purpose, we enhance the stylegan2 architecture Stylegan2 by introducing the following modifications to the original generator and discriminator (see Fig. 2).
Generator. We consecutively upsample and downsample the heatmap of the landmarks from the size to the sizes , and then, we concatenate them with the outputs from the progression of the modulated convolution blocks. At the same time, the noise passes through a series of linear layers to place the style on the manifold, which helps them to acquire the same topological properties as the images in the dataset. These styles are further used in ModulatedConvBlock to obtain the corresponding weights.
Discriminator. In the discriminator, we connect by channels the landmarks with the intermediate layer, obtained from the input images.
Landmark Encoder. Our Landmarks encoder consists of two principal parts. Due to recent success of the stacked hourglass model Hourglass
, we have integrated it in our landmarks encoder, which produces a heatmap with separate channels corresponding to key-point probabilities. Applyingsoftmax to each channel, makes the total probability of each point equal to one.
Style Encoder. Style encoder is just a regular CNN.
4 Loss function
The loss function for our architecture contains 6 components, with each component having its own physical meaning. We will now discuss each of them in detail.
Barycenter regularizer. If the landmark encoder predicts the landmarks at coordinates , the transport path from to the barycenter entails two principal transformations: the linear (affine) and the nonlinear (warping). Hence, the transport mapping is expressed as sequential translation, two other affine transformations (rotation and scaling), and the nonlinear elastic deformation111We find expressing the translation term separately from the other affine transformations to be useful due to its bigger impact on the regularization.:
where is the Wasserstein-2 distance bc_coords_proj, is the translation operator, is the affine operator (rotation and scaling). Each of these three terms is included with its own coefficient to vary the strength of the regularization accordingly. Naturally, simpler deformations are preferred, yielding lower coefficients by the corresponding terms.
The translation is determined by the center of mass difference between and
. Conventionally, it is tempting to express the affine matrixvia the covariance matrix
; however, it would give a solution with an accuracy up to any orthogonal matrix, becauseand one might inject the orthogonal transformation between and . It is possible to resolve this irregularity by establishing a pairwise correspondence between the source and the destination of the linear transport mapping. Namely, if is a probability matrix of the complete transport plan, such that is the probability that -th point from moves to -th point of , one can find the matrix of affine operator by solving the following optimization problem:
Geometric regularizer. To guarantee correspondence of coordinates of the landmarks to those in the image, we add a proper geometric regularization. The geometric regularization assures that the same affine and elastic transformations are applied both to the original image and to the predicted landmarks . After applying the deformations , we use the encoder to predict a new set of landmarks in the transformed image . This loss component minimizes cross-entropy between and , along with the distance between their coordinates:
Landmarks and style decorrelation. Obviously, the generator , fed with the sets of absolutely the same landmarks but with different styles, should create semantically similar images. Hence, we generate two fakes for the same set of landmarks with two different styles and , and then, minimize norm between the landmarks of the produced fakes:
Reconstruction. This term compares the original image to the reconstructed one, for a given set of landmarks and the encoded style produced by the generator:
Discriminator loss. We further increase correlation between the image and the landmarks by chaining with the GAN generator losses:
Style consistency. We make style consistent with the style generated from noise and make it invariant to the geometric transformations:
We consider two popular face datasets: CelebA celebadataset and 300-W w300dataset. The former lacks ground-truth landmarks and the latter contains 68 annotated key-points per face, with 3k training pairs in total. These datasets are quite similar, but the scale of images is somewhat different, providing a good setting for testing the robustness and universality of our method. As described above, we aspire to accomplish a completely unsupervised extraction of landmarks using the BRULÉ framework; but we also demonstrate efficacy in the semi-supervised training scenario so that we can compare to the state-of-the-art methods.
Unsupervised experiments. In the unsupervised scenario, we first compute the barycenter using the given landmarks from the 300-W dataset. We stress that this initiation does not contradict the criteria for being unsupervised, because by doing so we essentially only ’show’ the object of interest to the model. Once computed, the barycenter is kept fixed for all experiments. The training follows the recipe from Section 2, with the Algorithm 1 hyperparameters being set to , , , and . The automatically tuned hyperparameters in Algorithm 2 are initiated as: , , , , and . Parameters in are: , , and . The error values of the landmark prediction were evaluated on the standard test set of 300-W, using a conventional metric – the inter ocular distance (IOD) interocular_distance. However, during the training, instead of the Euclidean norm like in the other works, we used the Wasserstein distance bc_coords_proj between the predicted and the ground truth landmarks (divided by the distance between the key-points corresponding to the outer eye corners), because the key-points lack the pairwise correspondence. The chosen metric for the test set (IOD) was identical for all methods in our comparison.
|Sparse SPARSE||un. pretr.||7.97|
|Fab-Net FabNet||un. pretr.||5.71|
|UDIT UDIT||un. pretr.||5.37|
|Dense 3D Dense3D||un. pretr.||8.23|
|DVE HG DVE||un. pretr.||4.65|
Performance of our method is demonstrated in Figure 5 and the comparison against the state-of-the-art methods is summarized in Table 4. Training of the model takes three days on three V100 Tesla GPUs, which is of the same order of magnitude as the other models in Table 4, as well as the rest of architectures that utilize stylegan-like frameworks. Additional details about training and evaluation are given in the Supplementary materials.
Semi-supervised experiments. In the ‘semi-supervised’ case, we have trained the GAN part (i.e., Algorithm 1) on CelebA and the landmarks encoder on 300-W. The barycenter regularisation has been excluded from the loss function because it is meaningless to complement the true known landmarks with the ’average’ ones that come from the barycenter222Yet, the negligible effect of the barycenter on the supervised training was confirmed experimentally.. Intuitively, the barycenter regularizer is efficient only on limited training sets with missing annotations. If the true labels are available, it has a limited effect, and the BRULÉ pipeline acts as generic regularizer, which increases training stability and reduces overfitting, as shown in Figures 4 and 6.
Discussion. Face landmarks predicted by our unsupervised method, as illustrated in Figure 5, are very good outcomes, especially in the examples with the faces photographed from the front. Because this is the predominant orientation of the face in both datasets, the barycenter really ’looks’ like the frontal photograph. Some problems emerge when the testing images are acquired from the side-view, probably requiring larger warp deformations or more complex affine transformations to compensate for the ’ruined’ orientation. In the first version of BRULÉ, precise detection of the landmarks seems to be possible in the local vicinity of the barycenter because of the weights assigned to the different terms in the loss function. We present a comprehensive study of these weight coefficients in the Supplementary material333Also: the extra meaning behind the terms in the loss function, e.g., how to avoid collapse of landmarks.. However, the encoder captures all the landmark transformations only in the 2D image plane, because the geometric transforms used in and in the discriminator are both 2D transforms. Hence, the future work will entail extension of these transforms into 3D domain, which is expected to significantly boost performance on those faces that look up/down/sideways or have a weird viewangle.
Moreover, Figure 6 shows first adaptation of stylegan2
to conditional generation. In the semi-supervised case, it splits the landmarks and the style data exceptionally well, so that we can generate fake photos of a person (the fixed style) from different sets of landmarks. We find BRULÉ to be well positioned for regularization in active learning (AL) frameworks to gain efficient annotation strategiesshelmanov2019active. AL and multi-class barycenters are both obvious extensions for the future work. This effort paves the way towards the vision of complete image understanding with no supervision.
Complete understanding of a scene in an image is the ultimate goal of the ‘smart’ computer vision algorithms. To move in the direction of that vision, it is natural to begin the comprehension from extracting some smaller scene components which constitute the image. Carrying out such decomposition into individual parts (a.k.a., segmentation) in a completely unsupervised manner is what motivated our work. Herein, we have solved this task for somewhat of a simpler problem: unsupervised landmark detection, which could be considered as the first step towards the bigger goal.
We demonstrate the efficiency of the method on the datasets that contain images of faces, with the segmentation implying extraction of the facial landmarks (key-points). One of our most striking ‘firsts’ is that this key-point extraction is 100% interpretable. An immediate practical impact is expected in the areas of video and image editing, where detection/selection of the contours of an object could be done automatically. Painting over an object, changing the style of an object in a scene, combining various objects from different sources in one picture – all of these areas have acquired a powerful tool into their arsenal.
As another important impact of our work, the BRULÉ approach will become extremely useful in the biomedical field. In the clinical setting, where one requires high accuracy of the per-pixel image predictions to support the diagnostic decisions, only specialists from the field can perform the essential annotations, significantly slowing down the time for obtaining a marked dataset. However, our method relies on the use of barycenters (the ‘average image’ values), effectively allowing us to address this problem: because the same organ is quite similar among different patients, computing the average barycenter is a very sensible endeavor to generalize among large patient cohorts. Our method will change the way the anatomic segmentation is approached when there is a limited amount of annotated data.
We thank Ivan Oseledets and Victor Lempitsky for helpful discussions.