1 Introduction
Locating visual landmarks, such as human body joints [37] and facial key points [41], is an important yet challenging problem. The stacked UNets, e.g. hourglasses (HGs) [23], are widely used in landmark localization. Generally speaking, their success can be attributed to design patterns: 1) within each UNet, connect the topdown and bottomup feature blocks to encourage gradient flow; and 2) stack multiple UNets in a cascade to refine prediction stage by stage.
However, the shortcut connection exists only “locally” inside each UNet [32]. There is no “global” connection across UNets except the cascade. Blocks in different UNets cannot share features, which may impede the information flow and lead to redundant parameters.
We propose densely connected UNets (DUNet) to address this issue. The key idea is to directly connect blocks of the same semantic meanings, i.e. having the same resolution in either topdown or bottomup context, from any UNet to all subsequent UNets. Please refer to Fig. 1 for an illustration. The dense connectivity is similar to DenseNet [14] but generalizing the design philosophy from feature to semantic level. It encourages information flow as well as feature reuse “globally” across the stacked UNets, yielding improved localization accuracy.
Yet there are critical issues in designing DUNet: 1) The number of parameters would have a quadratic growth since stacked UNets could generate connections. 2) A naive implementation may allocate new memory for every connection, making the training highly expensive and limiting the maximum depth of DUNets.
Our solution to those efficiency issues is threefold. First, instead of connecting all stacked UNets, we only connect a UNet to its successors. We name it as the  connectivity, which aims to balance the fitting accuracy and parameter efficiency by cutting off longdistance connections. Second, we employ a memoryefficient implementation in training. The key idea is to reuse a preallocated memory so all connected blocks could share the same memory. Compared with the naive implementation, this strategy makes it possible to train a very deep DUNet (actually, deeper). Third, to further improve the efficiency, we investigate an iterative design that may reduce the model size to one half. More specifically, the output of the first pass of the DUNet is used as the input of the second pass, where detection or regression loss is applied as supervision.
Besides shrinking the number of network parameters, we also study to further quantize each parameter. This motivates from the ubiquitous mobile applications. Although current mobile devices could carry models of dozens of MBs, deploying such networks requires highend GPUs. However, quantized models could be accelerated by some specifically designed lowcost hardwares. Beyond only deploying models on mobile devices [18]
, training deep neural networks on distributed mobile devices emerges recently
[22]. To this end, we also try to quantize not only the model parameters but also its inputs (intermediate features) and gradients in training. This is the first attempt to investigate training landmark localizers using quantized inputs and gradients.In summary, our key contributions are:

To the best of our knowledge, we are the first to propose quantized densely connected UNets for visual landmark localization, which largely improves the information flow and feature reuse at the semantic level.

We propose the  connectivity to balance accuracy and efficiency. It decreases the growth of model size from quadratic to linear by removing trivial connections. Experiments show it could reduce 70% parameters of stateoftheart landmark localizers.

Very deep UNets can be trained using a memoryefficient implementation, where preallocated memory is reused by all connected blocks.

We further investigate an iterative refinement that may cut down half of the model size, by forwarding DUNet twice using either detection or regression supervision.

Different from previous efforts of quantizing only the model parameters, we are the first to quantize their inputs and gradients for better training efficiency on landmark localization tasks. By choosing appropriate quantization bitwidths for weights, inputs and gradients, quantized DUNet achieves 75% training memory saving with comparable performance.

Exhaustive experiments are performed to validate DUNet in different aspects. In both human pose estimation and face alignment, DUNet demonstrates comparable localization accuracy and use 2% model size compared with stateoftheart methods.
2 Related Work
In this section, we review the recent developments on designing convolutional network architectures, quantizing the neural networks, human pose estimation and facial landmark localization.
Network Architecture. The identity mappings make it possible to train very deep ResNet [12]. The popular stacked UNets [23] are designed based on the residual modules. More recently, the DenseNet [14] outperforms the ResNet in the image classification task, benefitting from its dense connections. We would like to use the dense connectivity into multiple UNets.
Network Quantization. Training deep neural networks usually consumes a large amount of computational resources, which makes it hard to deploy on mobile devices. Recently, network quantization approaches [9, 19, 47, 40, 31]
offer an efficient solution to reduce the size of network through cutting down high precision operations and operands. In the recent binarized convolutional landmark localizer (BCLL)
[5]architecture, XNORNet
[31] was utilized for network binarization. However, BCLL only quantizes weights for inference and bring in realvalue scaling factors. Due to its high precision demand in training, it cannot save training memory and improve training efficiency. To this end, we explore to quantize our DUNet in training and inference simultaneously.Human Pose Estimation. Starting from the DeepPose [37], CNNs based approaches [39, 6, 4, 28, 15, 20, 3, 46] become the mainstream in human pose estimation and prediction. Recently, the architecture of stacked hourglasses [23] has obviously beaten all the previous ones in terms of usability and accuracy. Therefore, all recent stateoftheart methods [8, 42, 7, 26] build on its architecture. They replace the residual modules with more sophisticated ones, add graphical models to get better inference, or use an additional network to provide adversarial supervisions or do adversarial data augmentation [26]. In contrast, we design a simple yet very effective connectivity pattern for stacked UNets.
Facial Landmark Localization. Similarly, CNNs have largely reshaped the field of facial landmark localization. Traditional methods could be easily outperformed by the CNNs based [44, 45, 21, 24, 25]. In the recent Menpo Facial Landmark Localization Challenge [43], stacked hourglasses [23] achieves stateoftheart performance. The proposed  connected UNets could produce even better results but with much fewer parameters.
3 Our Method
In this section, we first introduce the DUNet after recapping the stacked UNets [23]. Then we present the  connectivity to improve its parameter efficiency, an efficient implementation to reduce its training memory, and an iterative refinement to make it more parameter efficient. Finally, network quantization is utilized to further reduce training memory and model size.
3.1 DUNet
A UNet contains topdown, bottomup blocks and skip connections between them. Suppose multiple UNets are stacked together, for the topdown and bottomup blocks in the UNet, we use and
to denote their nonlinear transformations. Their outputs are represented by
and . andcomprise operations of Convolution (Conv), Batch Normalization (BN)
[16], rectified linear units (ReLU)
[11], and pooling.Stacked UNets. The feature transitions at the topdown and bottomup blocks of the UNet are:
(1) 
The skip connections only exist locally within each UNet, which may restrict that information flows across UNets.
DUNet. To make information flow efficiently across stacked UNets, we propose a global connectivity pattern. Blocks at the same locations of different UNets have direct connections. Hence, we refer to this densely connected UNets architecture as DUNet. Figure 1 gives an illustration. Mathematically, the feature transitions at the topdown and bottomup blocks of the UNet can be formulated as:
(2) 
where are the outputs of the topdown blocks in all preceding UNets. Similarly, represent the outputs from the bottomup blocks. denotes the feature concatenation, which could make information flow more efficiently than the summation operation in Equation 1.
According to Equation 2, a block receives features not only from connected blocks in the current UNet but also the output features of the same semantic blocks from all its preceding UNets. Note that this semantic level dense connectivity is a generalization of the dense connectivity in DenseNet [14] that connects layers only within each block.
3.2  Connectivity
In the above formulation of DUNet, we connect blocks with the same semantic meanings across all UNets. The connections would have quadratic growth depthwise. To make DUNet parameter efficient, we propose to cut off some trivial connections. For compensation, we add an intermediate supervision at the end of each UNet. The intermediate supervisions, as the skip connections, could also alleviate the gradient vanish problem. Mathematically, the features and in Equation 2 turns into
(3)  
(4) 
where represents how many preceding nearby UNets connect with the current one. or would result in the stacked UNets or fully densely connected UNets. A medium order could reduce the growth of DUNet parameters from quadratic to linear. Therefore, it largely improves the parameter efficiency of DUNet and could make DUNet grow several times deeper.
The proposed  connection has similar philosophy as the Variable Order Markov (VOM) models [2]
. Each UNet can be viewed as a state in the Markov model. The current UNet depends on a fixed number of preceding nearby UNets, instead of preceding either only one or all UNets. In this way, the longrange connections are cut off. Figure
3 illustrates connections of three different orders. In Figure 3, the connections above the central axes follow VOM patterns of ,  and  whereas the central axes together with connections below them follow VOM patterns of ,  and .Dense connectivity is a special case of  connectivity on the limit of . For small ,  connectivity is much more parameter efficient. But fewer connections may affect the prediction accuracy of very deep DUNet. To make DUNet have both high parameter efficiency and prediction accuracy, we propose to use  connectivity in conjunction with intermediate supervisions. In contrast, DenseNet [14] has only one supervision at the end. Thus, it cannot effectively take advantage of  connectivity.
3.3 Memory Efficient Implementation
Benefitting from the  connectivity, our DUNet is quite parameter efficient. However, a naive implementation would prevent from training very deep DUNet, since every connection would make a copy of input features. To reduce the training memory, we follow the efficient implementation [29]. More specifically, concatenation operations of the same semantic blocks in all UNets share a memory allocation and their subsequent batch norm operations share another memory allocation. Suppose a DUNet includes UNets each of which has topdown blocks and bottomup blocks. We need to preallocate two memory space for each of semantic blocks. For the topdown blocks, the concatenated features share the same memory space. Similarly, the concatenated features in the bottomup blocks share the same memory space.
In one shared memory allocation, later produced features would overlay the former features. Thus, the concatenations and their subsequent batch norm operations require to be recomputed in backward phase. Figure 3 illustrates naive and efficient implementations.
3.4 Iterative Refinement
In order to further improve the parameter efficiency of DUNet, we consider an iterative refinement. It uses only half of a DUNet but may achieve comparable performance. In the iterative refinement, a DUNet has two forward passes. In the first pass, we concatenate the inputs of the first and last UNets and merge them in a small dense block. Then the refined input is fed forward in the DUNet again. Better output is expected because of the refined input.
In this iterative pipeline, the DUNet has two groups of supervisions in the first and second iterations. Both the detection and regression supervisions [4] are already used in the landmark detection tasks. However, there is no investigation how they compare with each other. To this end, we could try different combinations of detection and regression supervisions for two iterations. Our comparison could give some guidance for future research.
3.5 Network Quantization
We aim at cutting down high precision operations and parameters both in training and inference stages of DUNet. The bitwidth of weights can be reduced to one or two bits through sign function or symmetrical threshold, whereas the layerwise gradients and inputs are quantized with linear mapping. In previous XNORNet [31], a scaling factor was introduced to approximate the realvalue weight. However, calculating these float factor costs additional computational resources. To further decrease memory usage and model size, we try to remove the scaling factor and follow WAGE [40] to quantize dataflow during training. More specifically, weights are binarized to 1 and 1 by the following equation:
(5) 
or ternarized to 1, 0 and 1 by the a positive threshold as [19] presented, where provided that
is initialized by Gaussian distributions. The dataflows, i.e. gradients and inputs, are quantized to
bit values by the following linear mapping function:(6) 
Here, the unit distance is calculated by . In the following experiments, we explore different combinations of bitwidths to balance performance and memory consumption.
4 Experiments
In this section, we first demonstrate the effectiveness of DUNet through its comparison with the stacked UNets. Then we explore the relation between the prediction accuracy and  connectivity. After that, we evaluate the iterative refinement to halve DUNet parameters. Finally, we test the network quantization. Different combinations of bitwidths to find appropriate ones which balance accuracy, model size and memory consumption. The general comparisons are given at last. Some qualitative results are shown in Figure 6.
Network. The input resolution is normalized to 256256. Before the DUNet, a Conv(
) filter with stride 2 and a max pooling would produce 128 features with resolution 64
64. Hence, the maximum resolution of DUNet is 6464. Each block in DUNet has a bottleneck structure as shown on the right side of Figure 1. At the beginning of each bottleneck, features from different connections are concatenated and stored in a shared memory. Then the concatenated features are compressed by the Conv() to 128 features. At last, the Conv() further produces 32 new features. The batch norm and ReLU are used before the convolutions.Training.
We implement the DUNet using the PyTorch. The DUNet is trained by the optimizer RMSprop. When training human pose estimators, the initial learning rate is
which is decayed toafter 100 epochs. The whole training takes 200 epochs. The facial landmark localizers are easier to train. Also starting from
, its learning rate is divided by 5, 2 and 2 at epoch 30, 60 and 90 respectively. The above settings remain the same for quantized DUNet. In order to match the pace of dataflow, we set the same bitwidth for gradients and inputs. We quantize dataflows and parameters all over the DUNet except the first and last convolutional layers, since localization is a finegrained task requires high precision of heatmaps.Human Pose Datasets. We use two benchmark human pose estimation datasets: MPII Human Pose [1] and Leeds Sports Pose (LSP) [17]. The MPII is collected from YouTube videos with a broad range of human activities. It has 25K images and 40K annotated persons, which are split into a training set of 29K and a test set of 11K. Following [35], 3K samples are chosen from the training set as validation set. Each person has 16 labeled joints. The LSP dataset contains images from many sport scenes. Its extended version has 11K training samples and 1K testing samples. Each person in LSP has 14 labeled joints. Since there are usually multiple people in one image, we crop around each person and resize it to 256x256. We also use scaling (0.751.25), rotation (/+30) and random flip to augment the data.
Facial Landmark Datasets. The experiments of the facial lanmark localization are conducted on the composite of HELEN, AFW, LFPW and IBUG which are reannotated in the 300W challenge [33]. Each face has 68 landmarks. Following [48] and [21], we use the training images of HELEN, LFPW and all images of AFW, totally 3148 images, as the training set. The testing is done on the common subset (testing images of HELEN and LFPW), challenge subset (all images from IBUG) and their union. We use the provided bounding boxes from the 300W challenge to crop faces. The same augmentations of scaling and rotation as in human pose estimation are applied.
Metric. We use the standard metrics in both human pose estimation and face alignment. Specifically, Percentage of Correct Keypoints (PCK) is used to evaluate approaches for human pose estimation. And the normalized mean error (NME) is employed to measure the performance of localizing facial landmarks. Following the convention of 300W challenge, we use the interocular distance to normalize mean error. For network quantization, we propose the balance index (BI) to examine the tradeoff between performance and efficiency.
4.1 DUNet vs. Stacked UNets
To demonstrate the advantages of DUNet, we first compare it with traditional stacked UNets. This experiment is done on the MPII validation set. All DUNets use the  connectivity and intermediate supervisions. Table 2 shows three pairs of comparisons with 4, 8 and 16 UNets. Both their PCKh and number of convolution parameters are reported. We could observe that, with the same number of UNets, DUNet could obtain comparable or even better accuracy. More importantly, the number of parameters in DUNet is decreased by about 70% of that in stacked UNets. The feature reuse across UNets make each UNet in DUNet become lightweighted. Besides, the high parameter efficiency makes it possible to train 16  connected UNets in a 12G GPU with batch size 16. In contrast, training 16 stacked UNets is infeasible. Thus,  together with intermediate supervisions could make DUNet obtain accurate prediction as well as high parameter efficiency, compared with stacked UNets.
4.2 Evaluation of  connectivity
The proposed  connectivity is key to improve the parameter efficiency of DUNet. In this experiment, we investigate how the PCKh and convolution parameter number change along with the order value. Figure 5 gives the results from MPII validation set. The left and right figures show results of DUNet with 8 and 16 UNets. It is clear that the convolution parameter number increases as the order becomes larger. However, the left and right PCKh curves have a similar shape of first increasing and then decreasing.  connectivity is always better than .
However, very dense connections may not be a good choice, which is kind of counterintuitive. This is because the intermediate supervisions already provide additional gradients. Too dense connections make gradients accumulate too much, causing the overfitting of training set. Further evidence of overfitting is shown in Table 4. The 7 connectivity has the higher training PCKh the 1 in all training epochs. But its validation PCKh is a little lower in the last training epochs. Thus, small orders are recommended in DUNet.
4.3 Evaluation of Efficient Implementation
The memoryefficient implementation makes it possible to train very deep DUNet. Figure 5 shows the training memory consumption of both naive and memoryefficient implementations of DUNet with order1 connectivity. The linear growths of training memory along with number of UNets is because of the fixed order connectivity. But the memory growth of efficient implementation is much slower than that of the naive one. With batch size 16, we could train a DUNet with 16 UNets in 12GB GPU. Under the same setting, the naive implementation could accept only 9 UNets.
4.4 Evaluation of Iterative Refinement
The iterative refinement is designed to make DUNet more parameter efficient. First, experiments are done on the 300W dataset using DUNet(4). Results are shown in Table 2. For both detection and regression supervisions, adding an iteration could lower the localization errors, demonstrating effectiveness of the iterative refinement. Meanwhile, the model parameters only increase 0.2M, making DUNet even more parameter efficient. Besides, the regression supervision outperforms the detection one no matter in the iterative or noniterative setting, making it a better choice for landmark localization.
Further, we compare iterative DUNet(4) with noniterative DUNet(8). Table 4 gives the comparison. We could find that, the iterative DUNet(4) could obtain comparable NME as DUNet(8). However, DUNet(8) has double parameters of DUNet(4) whereas iterative DUNet(4) increases only 0.2M additional parameters on DUNet(4).
4.5 Evaluation of Network Quantization
Through network quantization, high precision operations and parameters can be efficiently represented by a few discrete values. In order to find appropriate choices of bitwidths, we try a series of bitwidth combinations on the 300W dataset based on  DUNet(4). The performance and balance ability of these combinations on several methods are shown in Table 5, where DUNet(4) is DUNet with 4 blocks, BW and TW respectively represents binarized weight and ternarized weight without , BW is binarized weight with float scaling factor , the suffix QIG means quantized inputs and gradients.
For mobile devices with limited computational resources, slightly performance drop is tolerable provided that corresponding large efficiency enhancement. For the evaluation purpose, we propose a balance index (BI) to better examine the tradeoff between performance and efficiency:
(7) 
where and is respectively short for training memory and model size compression ratios to the original network without quantization. The square of is calculated in the above formula to emphasize the prior importance of performance. For BI, the smaller the value, the better the ability of balance.
According to Table 5, BWQIG(818) could achieve the best balance between performance and model efficiency among all the combinations. BWQIG(818) could reduce more than 4 training memory and 32 model size while reach a better performance than TSR [21]. Besides, BWQIG(818), BWQIG(616) and TWQIG(626) also have small balance index. Among all the combinations, the binarized network with scaling factor , i.e. BW gets the closest error to the original network DUNet(4).
For BWQIG(818), the performance is not better than BWQIG(818). This is mainly because that BW is heavily rely on the parameter . However, the quantization of dataflow could reduce the approximation ability of . TW and TWQIG usually gets better results than BW and BWQIG, since they have more choices in terms of weight value. The above results proves the effectiveness of network quantization, yet a correct combination of bitwidths is a crucial factor.
Method  NME(%)  NME(%)  NME(%)  Training  Model  Balance  

Full set  Easy set  Hard set  Memory  Size  Index  
DUNet(4) 
32  32  32  3.38  2.95  5.13  1.00  1.00  11.4 
BWQIG  6  1  6  5.93  5.10  9.34  0.17  0.03  0.18 
BWQIG  8  1  8  4.30  3.67  6.86  0.25  0.03  0.14 
BWQIG  8  1  8  4.47  3.75  7.40  0.25  0.03  0.15 
BW  32  1  32  3.75  3.20  5.99  1.00  0.03  0.42 
BW  32  1  32  3.58  3.12  5.45  1.00  0.03  0.38 
TW  32  2  32  3.73  3.21  5.85  1.00  0.06  0.83 
TWQIG  6  2  6  4.27  3.70  6.59  0.17  0.06  0.19 
TWQIG  8  2  8  4.13  3.55  6.50  0.25  0.06  0.26 
Method  Yang  Wei  Bulat  Chu  Newell  1  1 DU 
et al.[42]  et al.[39]  et al.[4]  et al.[8]  et al.[23]  DUNet(16)  NetBW(16)  
# Parameters  28.0M  29.7M  58.1M  58.1M  25.5M  15.9M  15.9M 
Model Size  110.2MB  116.9MB  228.7MB  228.7MB  100.5MB  62.6MB  2.0MB 

4.6 Comparison with Stateoftheart Methods
Human Pose Estimation. Tables 7 and 9 show comparisons of human pose estimation on MPII and LSP test sets. The  DUNetBW(16) achieves comparable stateoftheart performances. In contrast, as shown in Table 6, it has only 27%62% parameters and less than 2% model size of other recent stateoftheart methods. The DUNet is concise and simple. Other stateoftheart methods use stacked UNets with either sophisticated modules [42], graphical models [8] or adversarial networks [7].
Facial Landmark Localization. The DUNet is also compared with other stateoftheart facial landmark localization methods on 300W. Please refer to Table 8. We uses a smaller network 1 DUNet(8) than that in human pose estimation, since localizing the facial landmarks is easier. The 1 DUNetBW(8) gets comparable errors stateoftheart method [23]. However, 1 DUNetBW(8) has only 2% model size.
Method  Head  Sho.  Elb.  Wri.  Hip  Knee  Ank.  Mean 
Pishchulin et al. ICCV’13 [27]  74.3  49.0  40.8  34.1  36.5  34.4  35.2  44.1 
Tompson et al. NIPS’14 [36]  95.8  90.3  80.5  74.3  77.6  69.7  62.8  79.6 
Carreira et al. CVPR’16 [6]  95.7  91.7  81.7  72.4  82.8  73.2  66.4  81.3 
Tompson et al. CVPR’15 [35]  96.1  91.9  83.9  77.8  80.9  72.3  64.8  82.0 
Hu et al. CVPR’16 [13]  95.0  91.6  83.0  76.6  81.9  74.5  69.5  82.4 
Pishchulin et al. CVPR’16 [28]  94.1  90.2  83.4  77.3  82.6  75.7  68.6  82.4 
Lifshitz et al. ECCV’16 [20]  97.8  93.3  85.7  80.4  85.3  76.6  70.2  85.0 
Gkioxary et al. ECCV’16 [10]  96.2  93.1  86.7  82.1  85.2  81.4  74.1  86.1 
Rafi et al. BMVC’16 [30]  97.2  93.9  86.4  81.3  86.8  80.6  73.4  86.3 
Belagiannis et al. FG’17 [3]  97.7  95.0  88.2  83.0  87.9  82.6  78.4  88.1 
Insafutdinov et al. ECCV’16 [15]  96.8  95.2  89.3  84.4  88.4  83.4  78.0  88.5 
Wei et al. CVPR’16 [39]  97.8  95.0  88.7  84.0  88.4  82.8  79.4  88.5 
Bulat et al. ECCV’16 [4]  97.9  95.1  89.9  85.3  89.4  85.7  81.7  89.7 
Newell et al. ECCV’16 [23]  98.2  96.3  91.2  87.1  90.1  87.4  83.6  90.9 
Chu et al. CVPR’17 [8]  98.5  96.3  91.9  88.1  90.6  88.0  85.0  91.5 
 DUNet(16)  97.4  96.4  92.1  87.7  90.2  87.7  84.3  91.2 
 DUNetBW(16)  97.6  96.4  91.7  87.3  90.4  87.3  83.8  91.0 
Method  CFAN  Deep  CFSS  TCDCN  MDM  TSR  HGs(4)  1  1 DU 

[44]  Reg [34]  [48]  [45]  [38]  [21]  [23]  DUNet(8)  Net(8)BW  
Easy subset  5.50  4.51  4.73  4.80  4.83  4.36  2.90  2.82  3.00 
Hard subset  16.78  13.80  9.98  8.60  10.14  7.56  5.15  5.07  5.36 
Full set  7.69  6.31  5.76  5.54  5.88  4.99  3.35  3.26  3.46 
Method  Head  Sho.  Elb.  Wri.  Hip  Knee  Ank.  Mean 
Belagiannis et al. FG’17 [3]  95.2  89.0  81.5  77.0  83.7  87.0  82.8  85.2 
Lifshitz et al. ECCV’16 [20]  96.8  89.0  82.7  79.1  90.9  86.0  82.5  86.7 
Pishchulin et al. CVPR’16 [28]  97.0  91.0  83.8  78.1  91.0  86.7  82.0  87.1 
Insafutdinov et al. ECCV’16 [15]  97.4  92.7  87.5  84.4  91.5  89.9  87.2  90.1 
Wei et al. CVPR’16 [39]  97.8  92.5  87.0  83.9  91.5  90.8  89.9  90.5 
Bulat et al. ECCV’16 [4]  97.2  92.1  88.1  85.2  92.2  91.4  88.7  90.7 
Chu et al. CVPR’17 [8]  98.1  93.7  89.3  86.9  93.4  94.0  92.5  92.6 
Newell et al. ECCV’16 [23]  98.2  94.0  91.2  87.2  93.5  94.5  92.6  93.0 
Yang et al. ICCV’17 [42]  98.3  94.5  92.2  88.9  94.4  95.0  93.7  93.9 
 DUNet(16)  97.5  95.0  92.5  90.1  93.7  95.2  94.2  94.0 
 DUNetBW(16)  97.8  94.3  91.8  89.3  93.1  94.9  94.4  93.6 

5 Conclusion
We have generalized the dense connectivity into the stacked UNets, resulting in a novel, simple and effective DUNet. It connects blocks with the same semantic meanings in different UNets.  connectivity is proposed to improve its parameter efficiency. An iterative refinement is also introduced make it more parameter efficient. It could halve a DUNet but achieves comparable accuracy. Through network quantization, the training memory consumption and model size can further be reduced simultaneously. Experiments show the DUNet could achieve stateoftheart performances as other landmark localizers but with only 30% parameters, 2% model size and 25% training memory.
6 Acknowledgment
This work is partly supported by the Air Force Office of Scientific Research (AFOSR) under the Dynamic DataDriven Application Systems Program, NSF 1763523, 1747778, 1733843 and 1703883 Awards.
References
 [1] Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: CVPR (2014)

[2]
Begleiter, R., ElYaniv, R., Yona, G.: On prediction using variable order markov models. Journal of Artificial Intelligence Research (2004)
 [3] Belagiann., V., Zisserman, A.: Recurrent human pose estimation. In: FG (2017)
 [4] Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: ECCV (2016)
 [5] Bulat, A., Tzimiropoulos, G.: Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In: ICCV (2017)
 [6] Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: CVPR (2016)
 [7] Chen, Y., Shen, C., Wei, X.S., Liu, L., Yang, J.: Adversarial posenet: A structureaware convolutional network for human pose estimation. In: ICCV (2017)
 [8] Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A., Wang, X.: Multicontext attention for human pose estimation. In: CVPR (2016)
 [9] Courbariaux, M., Hubara, I., Soudry, D., ElYaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or 1. arXiv (2016)

[10]
Gkioxari, G., Toshev, A., Jaitly, N.: Chained predictions using convolutional neural networks. In: ECCV (2016)
 [11] Glorot, X., B., A., B., Y.: Deep sparse rectifier neural networks. In: AISTAT (2011)
 [12] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
 [13] Hu, P., Ramanan, D.: Bottomup and topdown reasoning with hierarchical rectified gaussians. In: CVPR (2016)
 [14] Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: CVPR (2017)
 [15] Insafutdinov, E., Pishchulin, L., Andres, B., And., M., Schiele, B.: Deepercut: A deeper, stronger, and faster multiperson pose estimation model. In: ECCV (2016)
 [16] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
 [17] Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)
 [18] Li, D., Wang, X., Kong, D.: Deeprebirth: Accelerating deep neural network execution on mobile devices. AAAI (2018)
 [19] Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv (2016)
 [20] Lifshitz, I., Fetaya, E., Ullman, S.: Human pose estimation using deep consensus voting. In: ECCV (2016)
 [21] Lv, J., Shao, X., Xing, J., Cheng, C., Zhou, X.: A deep regression architecture with twostage reinitialization for high performance facial landmark detection. In: CVPR (2017)
 [22] McMahan, H.B., Moore, E., Ramage, D., Hampson, S., et al.: Communicationefficient learning of deep networks from decentralized data. arXiv. (2016)
 [23] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV (2016)
 [24] Peng, X., Feris, R.S., Wang, X., Metaxas, D.N.: A recurrent encoderdecoder network for sequential face alignment. In: ECCV (2016)
 [25] Peng, X., Feris, R.S., Wang, X., Metaxas, D.N.: Rednet: A recurrent encoder–decoder network for videobased face alignment. IJCV (2018)
 [26] Peng, X., Tang, Z., Yang, F., Feris, R.S., Metaxas, D.: Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In: CVPR (2018)
 [27] Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Strong appearance and expressive spatial models for human pose estimation. In: ICCV (2013)
 [28] Pishchulin, L., Insafutdinov, E., Tang, S., A., B., A., M., Gehler, P.V., Schiele, B.: Deepcut: Joint subset partition and labeling for multi person pose estimation. In: CVPR (2016)
 [29] Pleiss, G., Chen, D., Huang, G., Li, T., M., L., Weinberger, K.Q.: Memoryefficient implementation of densenets. arXiv (2017)
 [30] Rafi, U., Leibe, B., Gall, J., Kostrikov, I.: An efficient convolutional network for human pose estimation. In: BMVC (2016)

[31]
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnornet: Imagenet classification using binary convolutional neural networks. In: ECCV (2016)
 [32] Ronneberger, O., Fischer, P., Brox, T.: Unet: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
 [33] Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces inthewild challenge: The first facial landmark localization challenge. In: ICCVW (2013)
 [34] Shi, B., Bai, X., Liu, W., Wang, J.: Deep regression for face alignment. arXiv (2014)
 [35] Tompson, J., Goroshin, R., Jain, A., LeCun, Y., B., C.: Efficient object localization using convolutional networks. In: CVPR (2015)
 [36] Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)
 [37] Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: CVPR (2014)
 [38] Trigeorgis, G., Snape, P., Nicolaou, M.A., A., E., Zafeiriou, S.: Mnemonic descent method: A recurrent process applied for endtoend face alignment. In: CVPR (2016)
 [39] Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)
 [40] Wu, S., Li, G., Chen, F., Shi, L.: Training and inference with integers in deep neural networks. In: ICLR (2018)
 [41] Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: CVPR (2013)
 [42] Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: ICCV (2017)
 [43] Zafeiriou, S., Trigeorgis, G., Chrysos, G., Deng, J., Shen, J.: The menpo facial landmark localisation challenge: A step towards the solution. In: CVPRW (2017)
 [44] Zhang, J., Shan, S., Kan, M., Chen, X.: Coarsetofine autoencoder networks (cfan) for realtime face alignment. In: ECCV (2014)
 [45] Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multitask learning. In: ECCV (2014)
 [46] Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.: Learning to forecast and refine residual motion for imagetovideo generation. In: ECCV (2018)
 [47] Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv (2016)
 [48] Zhu, S., Li, C., Change Loy, C., Tang, X.: Face alignment by coarsetofine shape searching. In: CVPR (2015)
Comments
There are no comments yet.