DenseReg
Code repository for DenseReg.
view repo
In this paper we propose to learn a mapping from image pixels into a dense template grid through a fully convolutional network. We formulate this task as a regression problem and train our network by leveraging upon manually annotated facial landmarks "in-the-wild". We use such landmarks to establish a dense correspondence field between a three-dimensional object template and the input image, which then serves as the ground-truth for training our regression system. We show that we can combine ideas from semantic segmentation with regression networks, yielding a highly-accurate "quantized regression" architecture. Our system, called DenseReg, allows us to estimate dense image-to-template correspondences in a fully convolutional manner. As such our network can provide useful correspondence information as a stand-alone system, while when used as an initialization for Statistical Deformable Models we obtain landmark localization results that largely outperform the current state-of-the-art on the challenging 300W benchmark. We thoroughly evaluate our method on a host of facial analysis tasks and also provide qualitative results for dense human body correspondence. We make our code available at http://alpguler.com/DenseReg.html along with supplementary materials.
READ FULL TEXT VIEW PDF
Dense surface registration of three-dimensional (3D) human facial images...
read it
In this work, we establish dense correspondences between RGB image and a...
read it
We present a descriptor, called fully convolutional self-similarity (FCS...
read it
We consider the problem of computing accurate point-to-point corresponde...
read it
This paper investigates how to rapidly and accurately localize facial
la...
read it
Supervised training of a convolutional network for object classification...
read it
We present a method for recovering the dense 3D surface of the hand by
r...
read it
Code repository for DenseReg.
Code repository for DenseReg.
Non-planar object deformations, e.g. due to facial pose or expression, result in challenging but also informative signal variations. Our objective in this paper is to recover this information in a feedforward manner by employing a discriminatively trained convolutional network. Motivated by the gap between discriminatively trained systems for detection and category-level deformable models, we propose a system that combines the merits of both.
In particular, discriminative learning-based approaches typically pursue invariance to shape deformations, for instance by employing local ‘max-pooling’ operations to ellicit responses that are invariant to
localtranslations. As such, these models can reliably detect patterns irrespective of their deformations through efficient, feedforward algorithms. At the same time, however, this discards useful shape-related information and only delivers a single categorical decision per position. Several recent works in deep learning have aimed at enriching deep networks with information about shape by explicitly modelling
the effect of similarity transformations [33] or non-rigid deformations [20, 18, 9]; several of these have found success in classification [33], fine-grained recognition [20], and also face detection
[9]. There are works [24, 36] that model the deformation via optimization procedures, whereas we obtain it in a feedforward manner and in a single shot. In these works, shape is treated as a nuisance, while we treat it as the goal in itself. Recent works on 3D surface correspondence [31, 4] have shown the merit of CNN-based unary terms for correspondence. There are works that address the problem of establishing dense correspondence for the human body from static RGBD images[43, 37, 49]. In our case we tackle the much more challenging task of establishing a 2D to 3D correspondence in the wild by leveraging upon recent advances in semantic segmentation [10]. To the best of our knowledge, the task of explicitly recovering dense correspondence in the wild has not been addressed yet in the context of deep learning.By contrast, approaches that rely on Statistical Deformabe Models (SDMs), such as Active Appearance Models (AAMs) or 3D Morphable Models (3DMMs) aim at explicitly recovering dense correspondences between a deformation-free template and the observed image, rather than trying to discard them. This allows to both represent shape-related information (e.g. for facial expression analysis) and also to obtain invariant decisions after registration (e.g. for identification). Explicitly representing shape can have substantial performance benefits, as is witnessed in the majority of facial analysis tasks requiring detailed face information e.g. landmark localisation [40], 3D pose estimation, as well as 3D face reconstruction “in-the-wild” [22], where SDMs consitute the current state of the art.
However SDM-based methods are limited in two respects. Firstly they require an initialization from external systems, which can become increasingly challenging for elaborate SDMs: both AAMs and 3DMMs require at least a bounding box as initialization and 3DMMs may further require position of specific facial landmarks. Furthermore, SDM fitting requires iterative, time-demanding optimization algorithms, especially when the initialisation is far from the solution. The advent of Deep Learning has made it possible to replace the iterative optimization task with iterative regression problems [44], but this does not alleviate the need for initialization and multiple iterations.
In this work we aim at bridging these two approaches, and introduce a discriminatively trained network to obtain, in a fully-convolutional manner, dense correspondences between an input image and a deformation-free template coordinate system.
In particular, we exploit the availability of manual facial landmark annotations “in-the-wild” in order to fit a 3D template; this provides us with a dense correspondence field, from the image domain to the 2-dimensional, parameterization of the face surface. We then train a fully convolutional network that densely regresses from the image pixels to this coordinate space.
This provides us with dense and fine-grained correspondence information, as in the case of SDMs, while at the same time being independent of any initialization procedure, as in the case of discriminatively trained ‘fully-convolutional’ networks. We demonstrate that the performance of certain tasks, such as facial landmark localisation or segmantic part segmentation, is largely improved by using the proposed network.
Even though the methodology is general, this paper is mainly concerned with human faces. The architecture for the case of human face is described in Fig. 1.
Our approach can be seen in two complementary manners: first, it provides a stand-alone, feedforward alternative to the combination of initialization with iterative fitting typically used in SDMs. This allows us to have a feedforward system that solves both the detection and correspondence problems at approximate frames per second for a input image. Secondly, our approach can also be understood as an initialization procedure for SDMs which gets them started from a much more accurate position than the bounding box, or landmark-based initializations currently employed in the face analysis literature. When taking this approach we observe substantial gains over the current state-of-the-art systems.
We can summarize our contributions as follows:
We introduce the task of dense shape regression in the setting of CNNs, and exploit the SDM-based notion of a deformation-free UV-space to construct target ground-truth signals (Sec.2).
We propose a carefully-designed fully-convolutional shape regression system that exploits ideas from semantic segmentation and dense regression networks. Our quantized regression architecture (Sec.3) is shown to substantially outperform simpler baselines that consider the task as a plain regression problem.
We use dense shape regression to jointly tackle a multitude of problems, such as landmark localization or semantic segmentation.
In particular, the template coordinates allow us to ‘copy’ multiple annotations constructed on a single template system, and thereby tackle multiple problems in a single go.
We use the regressed shape coordinates for the initialization of SDMs; systematic evaluations on facial analysis benchmarks show that this yields substantial performance improvements on tasks ranging from landmark localization to semantic segmentation.
We demonstrate the generic nature of the method by applying it to the task of estimating dense correspondences for other deformable surfaces, such as the human body and the human ear.
Following the deformable template paradigm [53, 17], we consider that object instances are obtained by deforming a prototypical object, or ‘template’, through dense deformation fields. This makes it possible to factor object variability within a category into variations that are associated to deformations, generally linked to the object’s 2D/3D shape, and variations that are associated to appearance (or, ‘texture’ in graphics), e.g. due to facial hair, skin color, or illumination.
This factorization largely simplifies the modelling task. SDMs use it as a stepping stone for the construction of parametric models of deformation and appearance. For instance, in AAMs a combination of Procrustes Analysis, Thin-Plate Spline warping and PCA is the standard pipeline for learning a low-dimensional linear subspace that captures category-specific shape variability
[13]. Even though we have a common starting point, rather than trying to construct a linear generative model of deformations, we treat the image-to-template correspondence as a vector field that our network tries to regress.
In particular, we start from a template , where each is a vertex location of the mesh in 3D space.
This template could be any 3D facial mesh, but in practice it is most useful to use a topology that is in correspondence with a 3D statistical shape model such as [2] or [35]. We compute a bijective mapping , from template mesh to the 2D canonical space , such that
(1) |
The mapping is obtained via the cylindrical unwrapping described in [3]. Thanks to the cylindrical unwrapping, we can interpret these coordinates as being the horizontal and vertical coordinates while moving on the face surface: and . Note that this semantically meaningful parameterization has no effect on the operation of our method.
We exploit the availability of landmark annotations “in the wild”, to fit the template face to the image by obtaining a coordinate transformation for each vertex . We use the fittings provided by [55] which were fit using a modified 3DMM implementation [39]. However, for the purpose of this paper, we require a per-pixel estimate of the location in UV space on our template mesh and thus do not require an estimate of the projection or model parameters as required by other 3D landmark recovery methods [22, 55]. The per-pixel UV coordinates are obtained through rasterization of the fitted mesh and non-visible vertices are culled via z-buffering.
As illustrated in Fig. 2, once the transformation from the template face vertices to the morphed vertices is established, the coordinates of each visible vertex on the canonical face can be transferred to the image space. This establishes the ground truth signal for our subsequent regression task.
Having described how we establish our supervision signal, we now turn to the task of estimating it through a convolutional neural network (CNN). Our aim is to estimate at any image pixel that belongs to a face region the values of . We need to also identify non-face pixels, e.g. by predicting a ‘dummy’ output.
One can phrase this problem as a generic regression task and attack it with the powerful machinery of CNNs. Unfortunately, the best performance that we could obtain this way was quite underwhelming, apparently due to the task’s complexity. Our approach is to quantize and estimate the quantization error separately for each quantized value. Instead of directly regressing , the quantized regression approach lets us solve a set of easier sub-problems, yielding improved regression results.
In particular, instead of using a CNN as a ‘black box’ regressor, we draw inspiration from the success of recent works on semantic part segmentation [45, 11], and landmark classification [5, 6]. These works have shown that CNNs can deliver remarkably accurate predictions when trained to predict categorical variables, indicating for instance the facial part or landmark corresponding to each pixel.
Building on these successes, we propose a hybrid method that combines a classification with a regression problem. Intuitively, we first identify a coarser face region that can contain each pixel, and then obtain a refined, region-specific prediction of the pixel’s field. As we will describe below, this yields substantial gains in performance when compared to the baseline of a generic regression system.
We identify facial regions by using a simple geometric approach. We tesselate the template’s surface with a cartesian grid, by uniformly and separately quantizing the and coordinates into bins, where is a design parameter. For any image that is brought into correspondence with the template domain, this induces a discrete labelling, which can be recovered by training a CNN for classification.
On Fig. 4, the tesselations of different granularities are visualized. For a sufficiently large value of even a plain classification result could provide a reasonable estimate of the pixel’s correspondence field, albeit with some staircasing effects. The challenge here is that as the granularity of these discrete labels becomes increasingly large, the amount of available training data decreases and label complexity increases. A more detailed analysis on the effect of label-space granularity to segmentation performance is provided in supplementary materials.
We propose to combine powerful classification results with a regression problem that will yield a refined correspondence estimate. For this, we compute the residual between the desired and quantized coordinates and add a separate module that tries to regress it. We train a separate regressor per facial region, and at any pixel only penalize the regressor loss for the responsible face region. We can interpret this form as a ‘hard’ version of a mixture of regression experts [21]. This interpretation is further elaborated upon in the supplementary material.
The horizontal and vertical components of the correspondence field are predicted separately. This results in a substantial reduction in computational and sample complexity - For distinct U and V bins we have regions; the classification is obtained by combining 2
-way classifiers. Similarily, the regression mapping involves
regions, but only uses one-dimensional regression units. The pipeline for quantized face shape regression is provided in Fig. 3.We now detail the training and testing of this network; for simplicity we only describe the horizontal component of the mapping. From the ground truth construction, every position is associated with a scalar ground-truth value . Rather than trying to predict as is, we transform it into a pair of discrete and continuous values, encoding the quantization and residual respectively:
(2) |
where is the quantization step size (we consider coordinates to lie in ]).
Given a common CNN trunk, we use two classification branches to predict and two regression branches to predict as convolution layers with kernel size . As mentioned earlier, we employ separate regression functions per region, which means that at any position we have estimates of the horizontal residual vector, .
At test time, we let the network predict the discrete bin associated with every input position, and then use the respective regressor output to obtain an estimate of :
(3) |
For the and , which are modeled as categorical distributions, we use softmax followed by the cross entropy loss. For estimating and , we use a normalized version of the smooth loss [16]. The normalization is obtained by dividing the loss by the number of pixels that contribute to the loss.
Compared to plain regression of the coordinates, the proposed method achieves much better results. In Fig.5 we report results of an experiment that evaluates the contribution of the q-r branches separately for different granularities. The results for the quantized branch are evaluated by transforming the discrete horzintal/vertical label into the center of the region corresponding to the quantized horizontal/vertical value respectively. The results show the merit of adopting the classification branch, as the finely quantized results(K=40,60) yield better coordinate estimates with respect to the non-quantized alternative (K=1). After K=40, we observe an increase in the failure rate for the quantized branch. The experiment reveals that the proposed quantized regression outperforms both non-quantized and the best of only-quantized alternatives.
Herein, we evaluate the performance of the proposed method (referred to as DenseReg) on various tasks. In the following sections, we first describe the training setup (Sec. 4.1) and then present extensive quantitative and qualitative results on (i) semantic segmentation (Sec. 4.2), (ii) landmark localization on static images (Sec. 4.3), (iii) deformable tracking (Sec. 4.4), (iv) monocular depth estimation (Sec. 4.5), (v) dense correspondence on human bodies (Sec. 4.6), and (vi) human ear landmark localization (Sec. 4.7).
Training Databases. We train our system using the 3DDFA data of [55]. The 3DDFA data provides projection and 3DMM model parameters for the Basel [35] + FaceWarehouse [7] model for each image of the 300W database. We use the topology defined by this model to define our UV space and rasterize the images to obtain per-pixel ground truth UV coordinates. Our training set consists of the LFPW trainset, Helen trainset and AFW, thus 3148 images that are captured under completely unconstrained conditions and exhibit large variations in pose, expression, illumination, age, etc. Many of these images contain multiple faces, some of which are not annotated. We deal with this issue by employing the out-of-the-box DPM face detector of Mathias et al. [32] to obtain the regions that contain a face for all of the images. The detected regions that do not overlap with the ground truth landmarks do not contribute to the loss. For training and testing, we have rescaled the images such that their largest side is 800 pixels.
CNN Training. For the dense regression network, we adopt a ResNet101 [19] architecture with dilated convolutions (atrous) [10, 29]
, such that the stride of the CNN is
. We use bilinear interpolation to upscale both the
and branches before the losses. The losses are applied at the input image scale and back-propagated through interpolation. We apply a weight to the smooth loss layers to balance their contribution. In our experiments, we have used a weight of for quantized () and a weight of for non-quantized regression, which are determined by a coarse cross validation. We initialize the training with a network pre-trained for the MS COCO segmentation task [27]. The new layers are initialized with random weights drawn from Gaussian distributions. Large weights of the regression losses can be problematic at initialization even with moderate learning rates. To cope with this, we use initial training with a lower learning rate for a
warm start for a few iterations. We then use a base learning rate of with a polynomial decay policy for iterations with a batch size of images. During training, each sample is randomly scaled with one of the ratios and cropped to form a fixed input image.As discussed in Sec. 2, any labelling function defined on the template shape can be transferred to the image domain using the regressed coordinates. One application that can be naturally represented on the template shape is semantic segmentation of facial parts. To this end, we manually defined a segmentation mask of classes (right/left eye, right/left eyebrow, upper/lower lip, nose, other) on the template shape, as shown in Fig. 1.
We compare against a state-of-the-art semantic part segmentation system (DeepLab-v2) [11] which is based on the same ResNet-101 architecture as our proposed DenseReg. We train DeepLab-v2 on the same training images (i.e. LFPW trainset, Helen trainset and AFW). We generate the ground-truth segmentation labels for both training and testing images by transferring the segmentation mask using the ground-truth deformation-free coordinates explained in Sec. 2. We employ the Helen testset [26] for the evaluation.
Table 1 reports evaluation results using the intersection-over-union (IoU) ratio. Additionally, Fig. 6 shows some qualitative results for both methods, along with the ground-truth segmentation labels. The results indicate that the DenseReg outperforms DeepLab-v2. The reported improvement is substantial for several parts, such as eyebrows and lips. We believe that this result is significant given that DenseReg is not optimized for the specific task-at-hand, as opposed to DeepLab-v2 which was trained for semantic segmentation. This performance difference can be justified by the fact that DenseReg was exposed to a richer label structure during training, which reflects the underlying variability and structure of the problem.
Class | Methods | |
DenseReg | Deeplab-v2 | |
Left Eyebrow | 48.35 | 40.57 |
Right Eyebrow | 46.89 | 41.85 |
Left Eye | 75.06 | 73.65 |
Right Eye | 73.53 | 73.67 |
Upper Lip | 69.52 | 62.04 |
Lower Lip | 75.18 | 70.71 |
Nose | 87.71 | 86.76 |
Other | 99.44 | 99.37 |
Average | 71.96 | 68.58 |
DenseReg can be readily used for the task of facial landmark localization on static images. Given the landmarks’ locations on the template shape, it is straightforward to estimate the closest points in the deformation-free coordinates on the images. The local minima of the Euclidean distance between the estimated coordinates and the landmark coordinates are considered as detected landmarks. In order to find the local minima, we simply analyze the connected components separately. Even though more sophisticated methods for covering “touching shapes” can be used, we found that this simplistic approach is sufficient for the task.
Note that the closest deformation-free coordinates among all visible pixels to a landmark point is not necessarily the correct corresponding landmark. This phenomenon is called “landmark marching” [56] and mostly affects the jaw landmarks which are dependent on changes in head pose. It should be noted that we do not use any explicit supervision for landmark detection nor focus on ad-hoc methods to cope with this issue. Errors on jaw landmarks due to invisible coordinates and improvements thanks to deformable models can be observed in Fig. 8.
Herein, we evaluate the landmark localization performance of DenseReg as well as the performance obtained by employing DenseReg as an initialization for deformable models [34, 46, 1, 44] trained for the specific task. In the second scenario, we provide a slightly improved initialization with very small computational cost by reconstructing the detected landmarks with a PCA shape model that is constructed from ground-truth annotations.
We present experimental results using the very challenging 300W benchmark. This is the testing database that was used in the 300W competition [41, 40] - the most important facial landmark localization challenge. The error is measured using the point-to-point RMS error normalized with the interocular distance and reported in the form of Cumulative Error Distribution (CED). Figure 7 (bottom) presents some self-evaluations in which we compare the quality of initialization for deformable modelling between DenseReg and two other standard face detection techniques (HOG-SVM [23], DPM [32]). The employed deformable models are the popular generative approach of patch-based Active Appearance Models (AAM) [34, 46, 1], as well as the current state-of-the-art approach of Mnemonic Descent Method (MDM) [44]. It is interesting to notice that the performance of DenseReg without any additional deformable model on top, already outperforms even HOG-SVM detection combined with MDM. Especially when DenseReg is combined with MDM, it greatly outperforms all other combinations.
Method | AUC | Failure Rate (%) |
DenseReg + MDM | 0.5219 | 3.67 |
DenseReg | 0.3605 | 10.83 |
Fan et al. [15] | 0.4802 | 14.83 |
Deng et al. [14] | 0.4752 | 5.5 |
Martinez et al. [30] | 0.3779 | 16.0 |
Cech et al. [8] | 0.2218 | 33.83 |
Uricar et al. [48] | 0.2109 | 32.17 |
We greatly outperform all competitors by a large margin. It should be noted that the participants of the competition did not have any restrictions on the amount of training data employed and some of them are industrial companies (e.g. Fan et al. [15]), which further illustrates the effectiveness of our approach. Finally, Table 2 reports the area under the curve (AUC) of the CED curves, as well as the failure rate for a maximum RMS error of . Apart from the accuracy improvement shown by the AUC, we believe that the reported failure rate of is remarkable and highlights the robustness of DenseReg.
Method | AUC | Failure Rate (%) |
DenseReg + MDM | 0.5937 | 4.57 |
DenseReg | 0.4320 | 8.1 |
Yang et al. [52] | 0.5832 | 4.66 |
Xiao et al. [51] | 0.5800 | 9.1 |
Rajamanoharan et al. [38] | 0.5154 | 9.68 |
Wu et al. [50] | 0.4887 | 15.39 |
Unicar et al. [47] | 0.4059 | 16.7 |
For the challenging task of deformable face tracking on lengthy videos, we employ the testing database of the 300VW challenge [42, 12] - the only existing benchmark for deformable tracking “in-the-wild”. The benchmark consists of videos ( frames in total) and includes videos captured in totally arbitrary conditions (severe occlusions and extreme illuminations).
The tracking is performed based on sparse landmark points, thus we follow the same strategy as in the case of landmark localization in Sec. 4.3.
We compare the output of DenseReg, as well as DenseReg+MDM which was the best performing combination for landmark localization in static images (Sec. 4.3), against the participants of the 300VW challenge.
Table 3 reports the AUC and Failure Rate measures. DenseReg combined with MDM demonstrates better performance than the winner of the 300VW competition. It should be highlighted that our approach is not fine-tuned for the task-at-hand as opposed to the rest of the methods that were trained on video sequences and most of them make some kind of temporal modelling. Finally, similar to the 300W case, the participants were allowed to use unlimited training data (apart from the provided training seuqences), as opposed to DenseReg (and MDM) that were trained only on the images mentioned in Sec. 4.1. Please refer to the supplementary material for a more detailed presentation of the tracking results.
The fitted template shapes also provide the depth from the image plane. We transfer this information to the visible pixels on the image using the same z-buffering operation used for the deformation-free coordinates (detailed in Sec. 2 of the paper). We adopt this as an additional supervision signal: and add another branch to our network to estimate the depth along with the deformation-free coordinates. To our knowledge, there is no existing results in literature that would allow a quantitative comparison. We are providing example reconstructions using estimated monocular depth fields at Fig.9. We observe that this additional branch does not affect the performance of other branches and adds little to the complexity, since it is just a 1x1 convolution layer after the final shared convolutional layer.
To portray that the DenseReg system can be used for articulated shapes of complex topology, we present results on the human shape. We use the recently proposed ”Unite the People” (UP) dataset [25], which provides a 3D deformable human shape model [28] in correspondence with images from several publicly available datasets. We handle the complex geometry of the human shape by manually partitioning the surface into patches. We unwrap each patch using multidimensional scaling. The partitioning replaces the quantization and the rest of the system remains the same. Since there are no dense correspondence results between a 3D human model and image pixels in literature, we demonstrate the performance of our system through visual results from our test-set partition of the UP dataset in Fig.10.
We have also performed experiments on the human ear. We employ the images and sparse landmark annotations that were generated in a semi-supervised manner by Zhou et al. [54]. Due to the lack of a 3D model of the human ear, we apply Thin Plate Splines to bring the images into dense correspondence and obtain the deformation-free space. We perform landmark localization following the same procedure as in Sec. 4.3. We split the images in for training and for testing.
Given the lack of state-of-the-art deformable models on human ear, we compare DenseReg with DenseReg+AAM and DenseReg+MDM. We also trained a DPM detector in order to compare the initialization quality with DenseReg. Figure 12 reports the CED curves based on the 55 landmark points using the RMS point-to-point error normalized by the bounding box average edge length. On Table.4, we provide failure rate and the Area Under Curve(AUC) measures. Once again, the results are highly accurate even without improving DenseReg with a deformable model. We observe that DenseReg results are highly accurate and clearly outperforms the DPM based alternative even without a deformable model. Examples for dense human ear correspondence estimated by our system are presented in Fig. 11.
Method | AUC | Failure Rate (%) |
DenseReg + MDM | 0.4842 | 0.98 |
DenseReg | 0.4150 | 1.96 |
DenseReg + AAM | 0.4263 | 0.98 |
DPM + MDM | 0.4160 | 15.69 |
DPM + AAM | 0.3283 | 22.55 |
We propose a fully-convolutional regression approach for establishing dense correspondence fields between objects in natural images and three-dimensional object templates. We demonstrate that the correspondence information can successfully be utilised on problems that can be geometrically represented on the template shape. Throughout the paper, we focus on face shapes, where applications are abundant and benchmarks allow a fair comparison. We show that using our dense regression method out-of-the-box outperforms a state-of-the-art semantic segmentation approach for the task of face-part segmentation, while when used as an initialisation for SDMs, we obtain the state-of-the-art results on the challenging 300W landmark localization challenge. We demonstrate the generality of our method by performing experiments on the human body and human ear shapes. We believe that our method will find ubiquitous use, since it can be readily used for face and human-body related tasks and can be easily integrated into many other correspondence problems.
Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition
. IEEE.Supervised transformer network for efficient face detection.
In European Conference on Computer Vision, 2016.
Comments
There are no comments yet.