Tensorflow implementation of recombinator networks
Deep neural networks with alternating convolutional, max-pooling and decimation layers are widely used in state of the art architectures for computer vision. Max-pooling purposefully discards precise spatial information in order to create features that are more robust, and typically organized as lower resolution spatial feature maps. On some tasks, such as whole-image classification, max-pooling derived features are well suited; however, for tasks requiring precise localization, such as pixel level prediction and segmentation, max-pooling destroys exactly the information required to perform well. Precise localization may be preserved by shallow convnets without pooling but at the expense of robustness. Can we have our max-pooled multi-layered cake and eat it too? Several papers have proposed summation and concatenation based methods for combining upsampled coarse, abstract features with finer features to produce robust pixel level predictions. Here we introduce another model --- dubbed Recombinator Networks --- where coarse features inform finer features early in their formation such that finer features can make use of several layers of computation in deciding how to use coarse features. The model is trained once, end-to-end and performs better than summation-based architectures, reducing the error from the previous state of the art on two facial keypoint datasets, AFW and AFLW, by 30% and beating the current state-of-the-art on 300W without using extra data. We improve performance even further by adding a denoising prediction model based on a novel convnet formulation.READ FULL TEXT VIEW PDF
To better retain the deep features of an image and solve the sparsity pr...
Recognition algorithms based on convolutional networks (CNNs) typically ...
Convolutional networks almost always incorporate some form of spatial
Spatial downsampling layers are favored in convolutional neural networks...
Learning invariant representations from images is one of the hardest
We solve the problem of salient object detection by investigating how to...
PCANet, as one noticeable shallow network, employs the histogram
Tensorflow implementation of recombinator networks
Recent progress in computer vision has been driven by the use of large convolutional neural networks. Such networks benefit from alternating convolution and pooling layers[16, 23, 22, 29, 24, 27, 42] where the pooling layers serve to summarize small regions of the layer below. The operations of convolution, followed by max-pooling, then decimation cause features in subsequent layers of the network to be increasingly translation invariant, more robust, and to more coarsely summarize progressively larger regions of the input image. As a result, features in the fourth or fifth convolutional layer serve as more robust detectors of more global, but spatially imprecise high level patterns like text or human faces . In practice these properties are critical for many visual tasks, and they have been particularly successful at enabling whole image classification [16, 29, 24]. However, for other types of vision tasks these architectural elements are not as well suited. For example on tasks requiring pixel-precise localization or labeling, features arising from max-pooling and decimation operations can only provide approximate localization, as in the process of creating them, the network has already thrown out precise spatial information by design. If we wish to generate features that preserve accurate localization, we may do so using shallow networks without max-pooling, but shallow networks without pooling cannot learn robust, invariant features. What we would like is to have our cake and eat it too: to combine the best of both worlds, merging finely-localized information from shallow, non-pooled networks with robust, coarsely-localized features computed by deep, pooled networks.
Several recently proposed approaches [17, 13, 31] address this by adding or concatenating the features obtained across multiple levels. We use this approach in our baseline model termed SumNet for our task of interest: facial keypoint localization. To the best of our knowledge this is the first time this general approach has been applied to the problem of facial keypoint localization and even our baseline is capable of yielding state of the art results. A possible weakness of these approaches however is that all detection paths, from coarsely to finely localized features, only become aggregated at the very end of the feature processing pipeline. As a thought experiment to illustrate this approach’s weakness, imagine that we have a photo of a boat floating in the ocean and would like to train a convnet to predict with single pixel accuracy a keypoint corresponding to the tip of the boat’s bow. Coarsely localized features111From now on we use the shorthand fine/coarse features to mean finely/coarsely localized features. could highlight the rough region of the bow of the boat, and finely localized features could be tuned to find generic boat edges, but the fine features must remain generic, being forced to learn boat edge detectors for all possible ocean and boat color combinations. This would be difficult, because boat and ocean pixels could take similar colors and textures. Instead, we would like a way for the coarse features which contain information about the global scene structure (perhaps that the water is dark blue and the boat is bright blue) to provide information to the fine feature detectors earlier in their processing pipeline. Without such information, the fine feature detectors would be unable to tell which half of a light blue/dark blue edge was ocean and which was boat. In the Recombinator Networks proposed in this paper, the finely localized features are conditioned on higher level more coarsely localized information. It results in a model which is deeper but – interestingly – trains faster than the summation baseline and yields more precise localization predictions. In summary, this work makes the following contributions:
We propose a novel architecture — the Recombinator Networks — for combining information over different spatial localization resolutions (Section 3).
We show how a simple denoising model may be used to enhance model predictions (Section 4).
We provide an in-depth empirical evaluation of a wide variety of relevant architectural variants (Section 5.1).
We show state of the art performance on two widely used and competitive evaluations for facial keypoint localization (Section 5.2).
. Precise facial keypoint localization is often an essential preprocessing step for face recognition and detection . Recent face verification models like DeepFace  and DeepID2  also include keypoint localization as the first step. There have been many other approaches to general keypoint localization, including active appearance models [8, 43], constrained local models [10, 21, 11, 2], active shape models , point distribution models , structured model prediction [3, 31], tree structured face models , group sparse learning based methods , shape regularization models that combines multiple datasets , feature voting based landmark localization [26, 36] and convolutional neural networks based models [41, 28, 42]. Two other related models are , where a multi-resolution model is proposed with dual coarse/fine paths and tied filters, and , which uses a cascaded architecture to refine predictions over several stages. Both of these latter models make hard decisions using coarse information halfway through the model.
Approaches that combine features across multiple levels: Several recent models — including the fully convolutional networks (FCNs) in , the Hypercolumn model , and the localization model of Tompson et al. 
— generate features or predictions at multiple resolutions, upsample the coarse features to the fine resolution, and then add or concatenate the features or predictions together. This approach has generally worked well, improving on previous state of the art results in detection, segmentation, and human-body pose estimation[13, 17, 31]. In this paper we create a baseline model similar to these approaches that we refer to as SumNet in which we use a network that aggregates information from features across different levels in the hierarchy of a conv-pool-decimate network using concatenation followed by a weighted sum over feature maps prior to final layer softmax predictions. Our goal in this paper is to improve upon this architecture. Differences between the Recombinator Networks and related architectures are summarized in Table 5. U-Net  is another model that merges features across multiple levels and has a very similar architecture to Recombinator Networks. The two models have been developed independently and were designed for different problems222For keypoint localization, we apply the softmax spatially i.e. across possible spatial locations, whereas for segmentation [13, 17, 19] it is applied across all possible classes for each pixel.. Note that none of these models use a learned denoising post-processing as we do (see section 4).
In this section we describe our baseline SumNet model based on a common architectural design where information from different levels of granularity are merged just prior to predictions being made. We contrast this with the Recombinator Networks architecture.
The SumNet architecture, shown in Figure 1(left), adds to the usual bottom to top convolution and spatial pooling, or “trunk”,
a horizontal left-to-right “branch” at each resolution level. While spatial pooling progressively
reduces the resolution as we move “up” the network along the trunk, the horizontal branches only contains full
convolutions and element-wise non-linearities, with no spatial pooling, so that they can preserve
the spatial resolution at that level while doing further processing.
The output of the finest resolution branch only goes through convolutional layers. The finest resolution layers keep positional information and use it
to guide the coarser layers within the patch that they cannot have any preference, while the coarser resolution layers help finer layers to
get rid of false positives.
The architecture then combines the rightmost low resolution output of all horizontal branches, into a single high resolution prediction, by first up-sampling333 Upsampling can be performed either by tiling values or by using bilinear interpolation.
We found bilinear interpolation degraded performance in some cases, so we instead used the simpler tiling approach.
Upsampling can be performed either by tiling values or by using bilinear interpolation. We found bilinear interpolation degraded performance in some cases, so we instead used the simpler tiling approach.them all to the model’s input image resolution (
for our experiments) and then taking a weighted sum to yield the pre-softmax values. Finally, a softmax function is applied to yield the final location probability map for each keypoint. Formally, given an input image, define the trunk of the network as a sequence of blocks of traditional groups of convolution, pooling and decimation operations. Starting from the layer yielding the coarsest scale feature maps we call the outputs of such blocks . At each level of the trunk we have a horizontal branch that takes as its input and consists of a sequence of convolutional layers with no subsampling. The output of such a branch is a stack of feature maps, one for each of the target keypoints, at the same resolution as its input , and we denote this output as . It is then upsampled by some factor which returns the feature map to the original resolution of the input image. Let these upsampled maps be where is the score map given by the branch to the keypoint (left eye, right eye, ). Each such map is a matrix of the same resolution as the image fed as input (i.e. ). The score ascribed by branch for keypoint being at coordinate is given by . The final probability map for the location of keypoint is given by a softmax over all possible locations. We can therefore write the model as
where is a 2D matrix that gives a weight to every pixel location of keypoint in branch
. The weighted sum of features over all branches taken here is equivalent to concatenating the features of all branches and multiplying them in a set of weights, which results in one feature map per keypoint. This architecture is trained globally using gradient backpropagation to minimize the sum of negated conditional log probabilities of alltraining (input-image, keypoint-locations) pairs, for all keypoints , with an additional regularization term for the weights ; i.e. we search for network parameters that minimize 444 We also tried L2 distance cost between true and estimated keypoints (as a regression problem) and got worse results. This may be due to the fact that a softmax probability map can be multimodal , while L2 distance implicitly corresponds to likelihood of a unimodal isotropic Gaussian.
In the SumNet model, different branches can only communicate through the updates received from the output layer and the features are merged linearly through summation. In the Recombinator Networks (RCN) architecture, as shown in Figure 1(right), instead of taking a weighted sum of the upsampled feature maps in each branch and then passing them to a softmax, the output of each branch is upsampled, then concatenated with the next level branch with one degree of finer resolution. In contrast to the SumNet model, each branch does not end in feature maps. The information stays in the form of a keypoint independent feature map. It is only at the end of the branch that feature maps are converted into a per-keypoint scoring representation that has the same resolution as the input image, on which a softmax is then applied. As a result of RCN’s different architecture, branches pass more information to each other during training, such that convolutional layers in the finer branches get inputs from both coarse and fine layers, letting the network learn how to combine them non-linearly to maximize the log likelihood of the keypoints given the input images. The whole network is trained end-to-end by backprop. Following the previous conventions and by defining the concatenation operator on feature maps , as , we can write the model as
We also explore RCN with skip connections, where the features of each branch are concatenated with upsampled features of not only one-level coarser branch, but all previous coarser branches and, therefore, the last branch computes . In practice, the information flow between different branches makes RCN converge faster and also perform better compared to the SumNet model.
Convolutional networks are excellent edge detectors. If there are few samples with occlusion in the training sets, convnets have problem detecting occluded keypoints and instead select nearby edges (see some samples in Figures 3, 5). Moreover, the convnet predictions, especially on datasets with many keypoints, do not always correspond to a plausible keypoint distribution and some keypoints jump off the curve (e.g. on the face contour or eye-brows) irrespective of other keypoints’ position (see some samples in Figure 7). This type of error can be addressed by using a structured output predictor on top of the convnet, that takes into account how likely the location of a keypoint is relative to other keypoints. Our approach is to train another convolutional network that captures useful aspects of the prior keypoint distribution (not conditioned on the image). We train it to predict the position of a random subsets of keypoints, given the position of the other keypoints. More specifically, we train the convolutional network as a denoising model, similar to the denoising auto-encoder  by completely corrupting the location of a randomly chosen subset of the keypoints and learning to accurately predict their correct location given that of the other keypoints. This network receives as input, not the image, but only keypoint locations represented as one-hot 2D maps (one 2D map per keypoint, with a 1 at the position of the keypoint and zeros elsewhere). It is composed of convolutional layers with large receptive fields (to get to see nearby keypoints), ReLU nonlinearities and no subsampling (see Figure 2). The network outputs probability maps for the location of all keypoints, however, its training criterion uses only prediction errors of the corrupted ones. The cost being optimized similar to Eq.(2) but includes only the corrupted keypoints.
Once, this denoising model is trained, the output of RCN (the predicted most likely location in one-hot binary location 2D map format) is fed to the denoising model. We then simply sum the pre-softmax values of both RCN and denoising models and pass them through a softmax to generate the final output probability maps. The joint model is depicted in Figure 2. The joint model combines the RCN’s predicted conditional distribution for keypoint given the image with the denoising model’s distribution of the location of that keypoint given other keypoints , to yield an estimation of keypoint ’s location given both image and other keypoint locations . The choice of convolutional networks for the denoising model allows it to be easily combined with RCN in a unified deep convolutional architecture.
We evaluate our model555Our models and code are publicly available at https://github.com/SinaHonari/RCN on the following datasets with evaluation protocols defined by previous literature:
AFLW and AFW datasets: Similar to TCDCN , we trained our models on the MTFL dataset,666MTFL consists of 10,000 training images: 4151 images from LFW  and 5849 images from the web. which we split into 9,000 images for training and 1,000 for validation. We evaluate our models on the same subsets of AFLW  and AFW  used by , consisting of 2995 and 377 images, respectively, each labeled with 5 facial keypoints.
300W dataset: 300W  standardizes multiple datasets into one common dataset with 68 keypoints. The training set is composed of 3148 images (337 AFW, 2000 Helen, and 811 LFPW). The test set is composed of 689 images (135 IBUG, 224 LFPW, and 330 Helen). The IBUG is referred to as the challenging subset, and the union of LFPW and Helen test sets is referred to as the common subset. We shuffle the training set and split it into 90% train-set (2834 images) and 10% valid-set (314 images).
One challenging issue in these datasets is that the test set examples are significantly different and more difficult compared to the training sets. In other words the train and test set images are not from the same distribution. In particular, the AFLW and AFW test sets contain many samples with occlusion and more extreme rotation and expression cases than the training set. The IBUG subset of 300W contains more extreme pose and expressions than other subsets.
Error Metric: The euclidean distance between the true and estimated landmark positions normalized by the distance between the eyes (interocular distance) is used:
where is the number of keypoints, is the total number of images, is the interocular distance in image . and represent the true and estimated coordinates for keypoint in image , respectively.
We evaluate RCN on the 5-keypoint test sets.
To avoid overfitting and improve performance,
we applied online data augmentation to the 9,000 MTFL train set using random scale, rotation, and translation jittering777 We jittered data separately in each epoch, whose parameters were uniformly sampled in the following ranges (selected based on the validation set performance):
Translation and Scaling: [-10%, +10%] of face bounding box size;
Rotation: [-40, +40] degrees.
We jittered data separately in each epoch, whose parameters were uniformly sampled in the following ranges (selected based on the validation set performance): Translation and Scaling: [-10%, +10%] of face bounding box size; Rotation: [-40, +40] degrees.. We preprocessed images by making them gray-scale and applying local contrast normalization 888RGB images performed worse in our experiments.. In Figure S1, we show a visualization of the contribution of each branch of the SumNet to the final predictions: the coarsest layer provides robust but blurry keypoint locations, while the finest layer gives detailed face information but suffers from many false positives. However, the sum of branches in SumNet and the finest branch in RCN make precise predictions.
Since the test sets contain more extreme occlusion and lighting conditions compared to the train set, we applied a preprocessing to the train set to bring it closer to the test set distribution. In addition to the jittering, we found it helpful to occlude images in the training set with randomly placed black rectangles999Each image was occluded with one black (zeros) rectangle, whose size was drawn uniformly in the range [20, 50] pixels. It’s location was drawn uniformly over the entire image. at each training iteration. This trick forced the convnet models to use more global facial components to localize the keypoints and not rely as much on the features around the keypoints, which in turn, made it more robust against occlusion and lighting contrast in the test set. Figure 3 shows the effects of this occlusion when used to train the SumNet and RCN models on randomly drawn samples. The samples show for most of the test set examples the models do a good prediction. Figure 4 shows some hand-picked examples from the test sets, to show extreme expression, occlusion and contrast that are not captured in the random samples of Figure 3. Figure 5 similarly uses some manually selected examples to show the benefits of using occlusion.
To evaluate how much each branch contributes to the overall performance of the model, we trained models excluding some branches and report the results in Table 1. The finest layer on its own does a poor job due to many false positives, while the coarsest layer on its own does a reasonable job, but still lacks high accuracy. One notable result is that using only the coarsest and finest branches together produces reasonable performance. However, the best performance is achieved by using all branches, merging four resolutions of coarse, medium, and fine information.
1, 0, 0, 0
0, 1, 0, 0
1, 1, 0, 0
0, 0, 1, 0
0, 0, 0, 1
0, 0, 1, 1
0, 1, 1, 1
1, 0, 0, 1
1, 1, 1, 1
We also experimented with adding extra branches, getting to a coarser resolution of 5 5 in the 5 branch model, 2 2 in the 6 branch model
and 1 1 in the 7 branch model. In each branch, the same number of convolutional layers with the same kernel size is applied,101010A single exception is that when the 5 5 resolution map is reduced to 2 2,
we apply 3 3 pooling with stride 2 instead of the usual
3 pooling with stride 2 instead of the usual 22 pooling, to keep the resulting map left-right symmetric. and all new layers have 48 channels. The best performing model, as shown in Table 2
, is RCN with 6 branches. Comparing RCN and SumNet training, RCN converges faster. Using early stopping and without occlusion pre-processing, RCN requires on average 200 epochs to converge (about 4 hours on a NVidia Tesla K20 GPU), while SumNet needs on average more than 800 epochs (almost 14 hours). RCN’s error on both test sets drops below 7% on average after only 15 epochs (about 20 minutes), while SumNet needs on average 110 epochs (almost 2 hours) to get to this error. Using occlusion preprocessing increases these times slightly but results in lower test error. At test time, a feedforward pass on a K20 GPU takes 2.2ms for SumNet and 2.5ms for RCN per image in Theano. Table 2 shows occlusion pre-processing significantly helps boost the accuracy of RCN, while slightly helping SumNet. We believe this is due to global information flow from coarser to finer branches in RCN.
|SumNet (4 branch)||6.44||6.78|
|SumNet (5 branch)||6.42||6.53|
|SumNet (6 branch)||6.34||6.48|
|SumNet (5 branch - occlusion)||6.29||6.34|
|SumNet (6 branch - occlusion)||6.27||6.33|
|RCN (4 branch)||6.37||6.43|
|RCN (5 branch)||6.11||6.05|
|RCN (6 branch)||6.00||5.98|
|RCN (7 branch)||6.17||6.12|
|RCN (5 branch - occlusion)||5.65||5.44|
|RCN (6 branch - occlusion)||5.60||5.36|
|RCN (7 branch - occlusion)||5.76||5.55|
|RCN (6 branch - occlusion - skip)||5.63||5.56|
|TCDCN baseline (our implementation)||7.60||7.87|
|SumNet (FCN/HC) baseline (this)||6.27||6.33|
AFLW and AFW datasets: We first re-implemented the TCDCN model , which is the current state of the art model on 5 keypoint AFLW  and AFW  sets, and applied the same pre-processing as our other experiments. Through hyper-parameter search, we even improved upon the AFLW and AFW results reported in . Table 3 compares RCN with other models. Especially, it improves the SumNet baseline, which is equivalent to FCN and Hypercolumn models, and it also converges faster. The SumNet baseline is also provided by this paper and to the best of our knowledge this is the first application of any such coarse-to-fine convolutional architecture to the facial keypoint problem. Figure 6 compares TCDCN with SumNet and RCN models, on some difficult samples reported in .
300W dataset :
The RCN model that achieved the best result on the validation set,
contains 5 branches with 64 channels for all layers (higher capacity is needed to extract features for more keypoints) and 2 extra
convolutional layers with kernel size in the finest branch right before applying the softmax.
compares different models on all keypoints (68) and a subset of keypoints (49) reported in .
The denoising model is trained by
randomly choosing 35 keypoints
in each image and jittering them
(changing their location uniformly to any place in the 2D map).
It improves the RCN’s prediction by considering
how locations of different keypoints are inter-dependent. Figure 7
compares the output of RCN, the denoising model and the joint model, showing how the keypoint distribution
modeling can reduce the error.
We only trained RCN on the
2834 images in the train-set. No extra data is taken to pre-train or
fine-tune the model 121212We only jittered the train-set images by random scaling, translation and rotation similar to the 5 keypoint dataset.
TCDCN  uses 20,000 extra dataset for pre-training.. The current state-of-the-art model without any extra data is CFSS. We reduce the error by 15% on the IBUG subset compared to CFSS.
|RCN + denoising keypoint model (this)||2.59||4.81||3.76|
|RCN + denoising keypoint model (this)||4.67||8.44||5.41|
In this paper we have introduced the Recombinator Networks architecture for combining coarse maps of pooled features with fine non-pooled features in convolutional neural networks. The model improves upon previous summation-based approaches by feeding coarser branches into finer branches, allowing the finer resolutions to learn upon the features extracted by coarser branches. We find that this new architecture leads to both reduced training time and increased facial keypoint prediction accuracy. We have also proposed a denoising model for keypoints which involves explicit modeling of valid spatial configurations of keypoints. This allows our complete approach to deal with more complex cases such as those with occlusions.
We would like to thank the Theano developers, particularly F. Bastien and P. Lamblin, for their help throughout this project. We appreciate Y. Bengio and H. Larochelle feedbacks and also L. Yao, F. Ahmed and M. Pezeshki’s helps in this project. We also thank Compute Canada, and Calcul Quebec for providing computational resources. Finally, we would like to thank Fonds de Recherche du Québec – Nature et Technologies (FRQNT) for a doctoral research scholarship (B2) grant during 2014 and 2015 (SH) and the NASA Space Technology Research Fellowship (JY).
|FeaturesModels||Efficient Localization ||Deep Cascade ||Hyper- columns ||FCN ||RCN (this)|
|Coarse features: hard crop or soft combination?||Hard||Hard||Soft||Soft||Soft|
|Learned coarse features fed into finer branches?||No||No||No||No||Yes|
NIPS Workshop on Deep Learning, 2012.
Extracting and composing robust features with denoising autoencoders.In ICML, pages 1096–1103, 2008.