Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation

The target of human pose estimation is to determine body part or joint locations of each person from an image. This is a challenging problems with wide applications. To address this issue, this paper proposes an augmented parallel-pyramid net with attention partial module and differentiable auto-data augmentation. Technically, a parallel pyramid structure is proposed to compensate the loss of information. We take the design of parallel structure for reverse compensation. Meanwhile, the overall computational complexity does not increase. We further define an Attention Partial Module (APM) operator to extract weighted features from different scale feature maps generated by the parallel pyramid structure. Compared with refining through upsampling operator, APM can better capture the relationship between channels. At last, we proposed a differentiable auto data augmentation method to further improve estimation accuracy. We define a new pose search space where the sequences of data augmentations are formulated as a trainable and operational CNN component. Experiments corroborate the effectiveness of our proposed method. Notably, our method achieves the top-1 accuracy on the challenging COCO keypoint benchmark and the state-of-the-art results on the MPII datasets.



page 2

page 4


P^2 Net: Augmented Parallel-Pyramid Net for Attention Guided Pose Estimation

We propose an augmented Parallel-Pyramid Net (P^2 Net) with feature refi...

Multi-Person Pose Estimation with Enhanced Channel-wise and Spatial Information

Multi-person pose estimation is an important but challenging problem in ...

Anti-Confusing: Region-Aware Network for Human Pose Estimation

In this work, we propose a novel framework named Region-Aware Network (R...

Adversarial Semantic Data Augmentation for Human Pose Estimation

Human pose estimation is the task of localizing body keypoints from stil...

Swin-Pose: Swin Transformer Based Human Pose Estimation

Convolutional neural networks (CNNs) have been widely utilized in many c...

PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation

Existing 3D human pose estimators suffer poor generalization performance...

SSP-Net: Scalable Sequential Pyramid Networks for Real-Time 3D Human Pose Regression

In this paper we propose a highly scalable convolutional neural network,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-person pose-estimation has been intensely investigated in computer vision due to its wide applications. There are many challenges in multi-person pose-estimation, such as occluded, self-occluded and invisible keypoints. For example, the visibility of keypoints is greatly affected by wearing, posture, viewing angle and backgrounds. With the advances of deep learning, there are growing interests in developing deep neural networks for multi-person pose-estimation

[9, 3, 17, 33, 26].

Figure 1: The image is fed into a detector to generate the bounding boxes and then the detected person is fed into the Net to generate the heatmap of the channel. The maximum value of each heatmap indicates the position of the detection point.

Multi-person pose-estimation models can be categorized into top-down and bottom-up methods. As shown in Figure 1, the top-down methods firstly detect each person by some algorithms (e.g., FPN [34], Mask RCNN [10], TridentNet [18]), and then generate keypoints in these bounding-boxes. The representative methods include Simple-Baseline [37], RMPE [9], CPN [4] and HRNet [33]. Whereas, the bottom-up methods also consist of keypoints detection and clustering. They firstly detect all keypoints in a partial image, and then cluster all keypoints into different individuals. PAF [3], Associative Embedding [24], Part Segmentation [36], Mid-Range offsets [26] are representative methods.

Although great progress has been made, there are still many challenges for accurate pose estimation in the wild. In the unconstrained conditions, the visibility of keypoints is greatly affected by wearing, posture, viewing angle, and backgrounds. Besides, the large pose variations further increase the difficulty in detection. Particularly, when there are overlapping parts of human body, the occluded keypoints are extremely difficult to detect. To increase the diversity of training data, we usually use data augmentation strategies. Reasonable data enhancement improves the robustness of the model. On the contrary, irrational setting of data enhancement parameters will also bring noise to the network. Moreover, data augmentation plays an important role in pose estimation. Existing augmentation strategies are usually hand-crafted or based on tuning techniques [11, 13]. Due to the complexity of multi-person pose estimation, manual designing the data augmentation sequences is a non-trivial task. Hence, finding an efficient strategy becomes urgent.

FPN [21]

exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. Due to the down-sampling of the top-down structure, we get more semantic information at different scales in the deep layer. But the spatial resolution of the deep feature map becomes lower, resulting in loss of spatial information.

To address these problems, we propose a novel two-stage network structure which consists of ParallelNet and RefineNet. ParallelNet adopts parallel structures to learn high-resolution representations to compensate for the information loss. On the one hand, our ParallelNet extracts sufficient context information necessary for the inference of the occluded and invisible keipoints. On the other hand, it effectively preserves the spatial information and the semantic information of the pyramid feature network. Based on the pyramid features, our RefineNet explicitly addresses the hard keypoint detection problem by optimizing an online hard keypoints mining loss [31]. The general refining operation is to perform upsampling and then concatenation, which ignores the relationships between feature maps with different scales. Alternatively, we maintain the spatial resolution of the features by employing a dilated bottleneck structure. In order to make RefineNet focus on more informative regions, we add an Attention Partial Module (APM) to the output of ParallelNet. Our work not only maintains high-resolution feature maps but also keeps large receptive field, both of which are important for pose detection and estimation.

We address the multi-person pose-estimation problems based on a top-down pipeline. Human detector is firstly adopted to generate a set of human bounding-boxes, followed by our network for keypoints localization in each human bounding-box. In addition, we also explore the effects of various factors that might affect the performance of multi-person pose-estimation, including person detector.

We propose an automated approach (named Auto-Pose) to search for data augmentation strategies instead of hand-crafted data augmentation strategies. Our approach is explicitly suited to the pose-estimation task. Auto-Pose designs a new pose search space to encode the sequences of data augmentation that is commonly used in pose-estimation. Moreover, it equips with the differentiable method instead of reinforcement learning, making the searching results particularly suitable for pose-estimation. We summarize our contributions as follows:

  1. First, we improve the upper part of the pyramid structure, retaining both the global and the local information effectively. We propose a repeated multi-scale fusions operation and acquire high-resolution representations to compensate for the information losses incurred by the single pyramid structure. Furthermore, we implement Attention Partial Module (APM) to probe the relationships between feature maps with different scales. Combining these techniques mentioned above, the predicted keypoints heatmap and spatial position can be more accurate.

  2. Second, to the best of our knowledge, we are the very first to search the sequence of data augmentations for human pose-estimation, replacing the laborious manual design of data augmentation. We propose a novel pose search space where the sequences are formulated as a trainable and operational CNN component.

Figure 2: The architecture of the proposed pose-estimation network. The network adopts the idea of cascading pyramid network, which is divided into three sub-networks, PyramidNet, ParallelNet and RefineNet. The PyramidNet part(purple area) is a pyramid structure, and the ParallelNet is a parallel network structure(red area). In ParallelNet, we perform repeated multi-scale fusions, using parallel structures, resulting in high resolution representations to maintain the information in the pyramid network completely. Therefore, the predicted keypoints heatmap may be more accurate and the space is more accurate. In RefineNet(yellow area), correcting keypoints that are difficult to locate in ParallelNet. Before element-wise sum process, we adopt 11 conv. In order to ensure the same number of channels.

2 Related Works

Human pose-estimation has been widely used in action recognition, human body structure generation, and etc. With the development of Convolutional Neural Networks (CNN), the mainstream methods have evolved from the HOG

[8] and deformable parts model rely on handcraft features and graphical models to CNN that significantly improves the performance of pose-estimation. In general, there are two mainstream approaches: Top-Down and Bottom-Up.

Top-Down Methods Top-Down Methods are more like the human perception of generating keypoints after a person sees a human body. These methods can be categorized into two processes. Firstly, some detection networks such as FPN [21], YOLO [29], SSD [23] are used to detect the person from the image and predict human bounding-boxes. Then in the human bounding-box, keypoints are predicted. Recent pose-estimation methods apply object detection for making predictions. The U-shape stacked hourglass network [25] is often used to stack up several hourglass modules to generate prediction. Zisserman applies some RNN like architectures to sequentially refine the results [2] . These methods depend on the reliability of the person detector so that they often face difficulty in recovering the poses that are obscured or difficult to locate. Thus, the output predictions of top-down methods might potentially benefit from an additional refinement step.

Bottom-Up Methods Bottom-up methods firstly predict all body joints and then group them into full poses. Instead of applying person detection, these methods rely on context information and inter body joint relationships, therefore they are usually used in real-time scenarios. Pishchulin[15] proposed a bottom-up approach that jointly labeled part detection candidates and associated them to individual people. Zhe[3] maped the relationships between keypoints and assembled detected keypoints into different poses of people. Newell[24] simultaneously produced score maps and pixel-wise embedding to group the candidate keypoints to different people to get final multi-person pose-estimation. However, it is often difficult to model the joint relationships, resulting in some failure cases to disambiguate pose of different people or group body parts of the same person into different clusters.

Auto Data Augmentations Recently, some researchers adopted AutoAugment search space with improved optimization algorithms to seek more efficient policies [12, 20]

. Most of AutoML approaches search CNN on a small proxy task and automatically tune hyperparameters. Our work is motivated by recent researches on AutoML

[7] and Neural Architecture Searching (NAS) [1, 28, 40], while we focus on searching for a pose estimation model with high performance. Inspired by the efficiency of NAS and AutoML algorithms, we develop new auto methods for the pose-estimation problem. Particularly, we focus on the searching differentiable sequences of data augmentations in network. Experimental results demonstrate that sequence information affects estimation results and validate that auto strategy significantly outperforms simply adopting the hand-crafted sequences of data augmentations based on experience of engineering.

3 Proposed Method

We introduce the proposed top-down pose estimation method. As shown in Figure 2, our Net involves three sub-networks: ParallelNet, PyramidNet and RefineNet. A human detector is firstly applied on an image to generate a set of human bounding-boxes, and then the keypoints for each person are located by our proposed pose-estimation network, Net.

3.1 Human Detector

FPN [21] uses a pyramid structure to maintain the balance of spatial resolution and semantic information. In FPN, a large object is generated and predicted within deeper layers. Since FPN adopts the top-down method like the inverted pyramid structure, the boundary of these objects may be too blurry to get an accurate regression. FPN predicts small object in shallower layers. However, shallow layers only have low semantic information so that there may be not sufficient to recognize the category of the object instances. Therefore, detectors must enhance their classification capability by involving context cues of high-level representations from the deeper layers. Hence, FPN is often enhanced by adopting a bottom-up pathway. However, the missing of small objects in deeper layers still leads to losing context cues.

To address these problems, we use the feature activation output by the last residual block of each stage in ResNet. We denote the output of conv i as (i=2,3,4,5). The detector has exactly the same number of stages as the used detector such as FPN where we keep the scale of ,, and the same. We keep the spatial resolution 16downsample after . Then, we apply bottleneck with dilation [39] as a basic network block to maintain the receptive filed efficiently. Since dilated convolution is still time consuming, and keep the same channels as setting C=256.In Section 4 we also discuss the impact of different detection networks on pose estimation.

3.2 Pyramid Network With Information Compensation

Figure 3: (a) represents the overall architecture of the proposed parallel structures. Before element-wise sum process, we adopt 11 conv. In order to ensure the same number of channels. (b)(c)(d) represent how the exchange unit aggregates the information for high, medium and low resolutions from the left to the right, respectively.

ParallelNet - Making up for information loss. We design our Net based on the backbone of ResNet. We denote the output of these last residual blocks as, (r=2,3,4,5) for conv2, conv3, conv4, and conv5 feature map outputs respectively. Convolution filters (kernel size33) are applied on (r=2,3,4,5) to generate the heatmaps of keypoints. As we know, the shallow features have the high spatial resolution for localization but low semantic information for recognition. Deep feature layers have more semantic information but spatial resolution is not high enough. Therefore, we apply the pyramid structure in PyramidNet. The advantage of using the pyramid structure is that pyramid structure can effectively preserve the semantic information and spatial resolution of the feature maps. The scale of the upper layer is twice that of the next layer . We use the connection method depicted in Figure 3. Then we use to represent the spatial resolution of .The resolution has the following calculation:


In parallel structure, as shown in Figure 3, where represents the feature map of each stage Sth and r is the index of resolution.

Existing networks for pose estimation are built by connecting high representation to low representation subnetworks in series, where each subnetwork, forming a stage, is composed of a sequence of convolutions and there is a down-sample layer across adjacent subnetworks to halve the resolution. But we keep the same spatial resolution of each subnetworks to ensure sufficient feature expression:


represents the result of fusion of various scales through element-sum. As shown in Figure 3 (b-d), the operation in (c) can be interpreted as:


represents one strided 3

3 convolution and represents the operation of upsampling.

One strided 33 convolution with the stride=2 for 2downsampling, and two consecutive strided 33 convolutions with the stride=2 for 4 downsampling. For upsampling, we adopt the simple nearest neighbor sampling following a 11 convolution for aligning the number of channels. And we use one strided 33 convolution with the stride = 2 for 2downsampling to complete the transformation.

Figure 4: (a) Attention Partial Module. (b) Dilated Bottleneck with DilatedConv.

3.3 Dilated Attention Partial Module

RefineNet - Better capture the re-lationship between channels. Based on the feature pyramid representations generated by ParallelNet, we attach a RefineNet to refine the keypoints. In order to improve efficiency, our RefineNet transmits the information across different levels and finally integrates the information of different levels via upsampling and concatenating. Our RefineNet concatenates all the pyramid features rather than simply using the upsampled features at the end of hourglass module.

In order to gather features selectivity from the shallow layer and the deep layer, we design the attention module, Attention Particial Module (APM). As shown in Figure 

4(a), we denote the input feature maps as , then applying global average pooling to have output , where denotes the number of channels. The -th channel can be expressed as:


is reorganized by a 11 convolution layer with the same number of channels as

. The Sigmoid function

is applied to activate the convolution result, constraining the value of weight vector

. Then performing an outer product for and , and the final output can be expressed as :


In addition, we stack more Dilated bottleneck blocks as shown in Figure 4(b) into deeper layers, which can use smaller spatial-size conv to achieve a good trade-off between receptive filed and efficiency.

3.4 Differentiable Sequences for Pose Data Augmentations

Figure 5:

Four examples of learned sub-policies applied to one example image batch. (a) Each different color image batch corresponds to a different random sample of the corresponding sub-policy. Each step of an augmentation sub-policy consists of a triplet corresponding to the operation, the probability of application and a magnitude measure. (b) Operations on the edges are initially unknown. We take a Continuous relaxation of the search space by placing a mixture of candidate operations on each edge. We jointly optimize the mixed probability and network weights by solving the bi-level optimization problem. (c) We induce the example architecture from the learned mixing probabilities.

More efficient. Relying on engineering experience to manually set data augmentations is not innovative enough. Since our Net input is a cropped bounding-box image, we use these methods such as geometric transformation, color transformation, scale transformation, etc. to increase data diversity. We treat data augmentation search as a discrete optimization problem. Based on previous work, in our search space, we define an augmentation policy as a unordered set of K sub-policies. During training one of the K sub-policies will be selected at random and then applied to the current image-batch. Each sub-policy has N image transformations which we define 9 operations, as shown in Table 1.

Operation Name Description
TranslateX(Y) Translate the image and the bounding boxes in the horizontal (vertical) direction by number of pixels.
Rotate Rotate the image and the bounding boxes degrees.
Equalize Equalize the image histogram.
Solarize Invert all pixels above a threshold value of .
SolarizeAdd For each pixel in the image that is less than 128, add an additional amount to it decided by the .
Brightness Adjust the brightness of the image. A =0 gives a black image, whereas =1 gives the original image.
Sharpness Adjust the sharpness of the image. A =0 gives a blurred image, whereas =1 gives the original image.
Cutout Set random square patches of pixels with side-length to gray.
Scale Scale with this .
Table 1: Table of the possible transformations that can be applied to an image. These are the transformations that are available to the controller during the search process.

As shown in Figure 5, we turn this problem of searching for a learned augmentation policy into a discrete optimization problem by creating a search space. The search space consists K = 3 sub-policies with each sub policy consisting of N = 2 operations applied in sequence to a single image. At the same time, we set two hyperparameters, probability P and magnitudes M of operations. The probability parameter introduces a notion of stochasticity into the augmentation policy whereby the selected augmentation operation will be applied to the image with the specified probability. Because the range of M for each data augmentations is different such as the range of random-scale (0.71.35) and the range of rotation (), we normalize M to 010. Many existing methods for addressing the discrete optimization problem include reinforcement learning , evolutionary methods and sequential model-based optimization. In this paper, we apply a differentiable search algorithm. Let be a set of candidate data augmentations where each operation represents function to be applied , we relax the categorical choice of a particular operation to a softmax over all possible operations:


where the operation mixing weights for a pair of image batches are parameterized by a vector of dimension . The task of sequences search then reduces to learning a set of continuous variables . At the end of search, a discrete sequence of data augmentations can be obtained by replacing each mixed operation with the most likely operation. Let and denote the training loss and the validation loss, respectively. Both losses are determined by the data augmentation sequences architecture and the weights in the network. The goal becomes the two formulas of alternating optimization:


We adopt an approximation scheme as follows:


where denotes the current weights maintained by the algorithm, and is the learning rate for a step of inner optimization.

4 Experiments and Analysis

We evaluate our proposed Net on two pose-estimation datasets. The quantitative verification performance is presented in this section. The overall results demonstrate that our framework achieves state-of-the-art verification accuracy across multi-person status.

4.1 COCO keypoint Detection


Our models are trained on the MSCOCO [22] train dataset (includes 57K images and 150K person instances with 17 keypoints) and validated on MSCOCO val dataset (includes 5k images). The testing sets include test-dev sets (20K images)

Evaluation Method

We report standard average precision(AP) and recall scores(AR): AP (the mean of AP scores at 10 positions, OKS = 0.50; 0.55;…; 0.90; 095 (OKS = 0.50) (OKS = 0.75), ; for medium objects, for large objects. Object Keypoint Similarity (OKS):


is the Euclidean distance between the predicted keypoints and ground-truth, is the scale factor, and the square root of the human body area. is included in the evaluation for the keypoints marked as visible (V=1).

Training Detials

For each bounding-box, we crop the box from the image, which is resized to a fixed size, 384288. The we adopt data augmentation sequences previously searched by us. In order to get the sequence, we train from scratch with a global batch size 32, 640640 of images size, learning rate of 0.08, weight decay of 1e - 4, = 0.25 and = 1.5 for the focal loss parameters.

We train for 100 epochs, using stepwise decay where the learning rate is reduced by a factor of 10 at epochs 70 and 90. All models are trained on 4 GPUs over 5 days. Meanwhile, in order to save computing resources and speed up, we only randomly use 5k COCO train images when searching for data augmentation sequences. The reward signal for the controller is the mAP on MS COCO val sets of 5,000 images.

The models of pose estimation are trained using Adam algorithm with an initial learning rate of 5. Note that we also decrease the learning rate by a factor of 2 every iterations. We use a weight decay of 1

and set the training batch size 32. Batch normalization is used in our network. Generally, the training of ResNet-101 models takes about 1.5 days on 4 GPUs. Our models are initialized with weights of the ImageNet-pretrained model.

During the process on training, ParallelNet adopt L2 loss of all keypoints, at the same time RefineNet adopt loss, which we select hard keypoints online based on traing loss(pixel-level heatmap L2 loss). We only keep the top (=10) keypoints out of all N keypoints.

Testing Detials

We apply a gaussian filter on the predicted heatmaps, computing the heatmap by averaging the heatmaps of the original and flipped images. Each keypoint position is predicted by adjusting the highest calorific value position, which is shifted by a quarter in the direction from the highest response to the second highest response. We consider the product of boxes’ score and the average score of all keypoints as the final pose score of a person instance.

Method Backbone Input size #Params GFLOPs AP AR
Mask-RCNN [10] ResNet-50-FPN - - - 63.1 87.3 68.7 57.8 71.4 -
G-RMI [27] ResNet-101 353257 42.6M 57.0 64.9 85.5 71.3 62.3 70.0 69.7
CPN [4] ResNet-Inception 384288 - - 72.1 91.4 80.0 68.7 77.2 78.5
RMPE [9] PyraNet 320256 28.1M 26.7 72.3 89.2 79.1 68.0 78.6 -
CFN [14] - - - - 72.6 86.1 69.7 78.3 64.1 -
CPN [4] (ensemble) ResNet-Inception 384288 - - 73.0 91.7 80.9 69.5 78.1 79.0
SimpleBaseline [37] ResNet-152 384288 68.6M 35.6 73.7 91.9 81.1 70.3 80.0 79.0
HRNet-W32 [33] HRNet-W32 384288 28.5M 16.0 74.9 92.5 82.8 71.3 80.9 80.1
Ours ResNet101 384288 47.5M - 76.5 92.1 83.7 73.2 82.2 82.5
Table 2: Comparisons on COCO test-dev.#Params and FLOPs are calculated for the pose estimation network.
Method Hea Sho Elb Wri Hip Kne Ank Total
Stack Hourglass [25]. 98.2 96.3 91.2 87.1 90.1 87.4 83.6 90.9
Sun et al [32]. 98.1 96.2 91.2 87.2 89.8 87.4 84.1 91.0
Chu et al [6]. 98.5 96.3 91.9 88.1 90.6 88.0 85.0 91.5
Chou et al [5]. 98.2 96.8 92.2 88.0 91.3 89.1 84.9 91.8
Yang et al [38]. 98.5 96.7 92.5 88.7 91.1 88.6 86.0 92.0
Ke et al [16]. 98.5 96.8 92.7 88.4 90.6 89.3 86.3 92.1
Tang et al [35]. 98.4 96.9 92.6 88.7 91.8 89.4 86.2 92.3
SimpleBaseline [37] 98.8 96.6 91.9 87.6 91.1 88.1 84.1 91.5
HRNet-W32 [33] 98.6 96.9 92.8 89.0 91.5 89.0 85.7 92.3
Ours 98.7 97.1 92.9 89.2 90.1 90.5 85.8 92.4
Table 3: Performance comparisons on the MPII test set (PCKh@0.5).

4.2 MPII keypoint Detection


There are 25K images with 40K instances, where there are 12K subjects for testing and the remaining subjects for the training set. The data augmentation and the training strategy are the same to MS COCO. The standard metric, the PCKh (head-normalized probability of correct keypoint) score, is used. The PCKh@0:5 ( = 0.5) score is reported.

4.3 Experiment Results

Tables 2,3 show the pose-estimation performance of our method. We get state-of-the-art results on both two datasets, MS COCO test-dev and MPII. Our performance is better than SimpleBaseline, but the amount of parameters is much less than SimpleBaseline.

4.4 Ablation Study

Components of Net. We analyze the important of each component in Net. Our net is evaluated on ResNet101 backbone.

  • Detector Network. Table 4 shows the relationship between detection AP and the corresponding keypoints AP, we have chosen the three detector networks, Faster R-CNN [30], DetNet [19] and TridentNet [18]. From the table, when the detection AP increases and the human detection AP does not increases. Therefore, the more important task for pose-estimation is to enhance the accuracy of the keypoints rather than involve more boxes.

    Detector Backbone mAP
    Faster R-CNN ResNet50 37.4 59.0 18.3 41.7 52.9 76.2
    DetNet59 ResNet50 40.2 61.7 23.9 43.2 52.0 76.3
    TridentNet ResNet101 40.6 61.8 23.0 45.5 55.9 76.3
    Table 4: Comparing with different detector networks. means the performance of Net based on the specific detector network.
  • Attention Partial Module. As shown in Table 5,comparing our approach to the method based on our network without APM, we can see that performance has improved a lot. This indicates that APM can gather features more selectivity and efficiently from shallow layers and deep layers.

    Method Backbone AP AR
    Ours ResNet101 76.5 92.1 83.7 73.2 82.2 82.5
    ResNet101 76.2 92.5 83.1 72.2 82.2 81.4
    Table 5: Compared with the result without Attention Module. means the method without Attention Module, adopting the auto data augmentations.

5 Conclusion

This paper has proposed a top-down pose-estimation framework with auto data augmentations. In human detection stage of our framework, Det-Net has been employed to enhance the coherence between the ImageNet pretrained model and detection model, resulting in better representation ability of feature maps. The detected human images are then augmented by searched policies. A differentiable method is to search for different augmentations, policies, which are regarded as different routines to be selected. Finally, the ParallelNet is designed to fuse feature maps of different levels, such that both high- and low-level information can be utilized. Our proposed framework pays attention to all stages in pose estimation, therefore can significantly reduce estimation errors. The experimental results demonstrate that our proposed framework outperforms state-of-the-art human pose-estimation methods.


  • [1] B. Baker, O. Gupta, N. Naik, and R. Raskar (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §2.
  • [2] V. Belagiannis and A. Zisserman (2017) Recurrent human pose estimation. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 468–475. Cited by: §2.
  • [3] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7291–7299. Cited by: §1, §1, §2.
  • [4] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun (2018) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112. Cited by: §1, Table 2.
  • [5] C. Chou, J. Chien, and H. Chen (2018) Self adversarial training for human pose estimation. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 17–30. Cited by: Table 3.
  • [6] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang (2017) Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1831–1840. Cited by: Table 3.
  • [7] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §2.
  • [8] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. Cited by: §2.
  • [9] H. Fang, S. Xie, Y. Tai, and C. Lu (2017) Rmpe: regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343. Cited by: §1, §1, Table 2.
  • [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, Table 2.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • [12] D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen (2019) Population based augmentation: efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393. Cited by: §2.
  • [13] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.
  • [14] S. Huang, M. Gong, and D. Tao (2017) A coarse-fine network for keypoint localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3028–3037. Cited by: Table 2.
  • [15] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele (2016) Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision, pp. 34–50. Cited by: §2.
  • [16] L. Ke, M. Chang, H. Qi, and S. Lyu (2018) Multi-scale structure-aware network for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 713–728. Cited by: Table 3.
  • [17] M. Kocabas, S. Karagoz, and E. Akbas (2018) Multiposenet: fast multi-person pose estimation using pose residual network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 417–433. Cited by: §1.
  • [18] Y. Li, Y. Chen, N. Wang, and Z. Zhang (2019) Scale-aware trident networks for object detection. arXiv preprint arXiv:1901.01892. Cited by: §1, 1st item.
  • [19] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2018) Detnet: a backbone network for object detection. arXiv preprint arXiv:1804.06215. Cited by: 1st item.
  • [20] S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim (2019) Fast autoaugment. arXiv preprint arXiv:1905.00397. Cited by: §2.
  • [21] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1, §2, §3.1.
  • [22] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.
  • [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.
  • [24] A. Newell, Z. Huang, and J. Deng (2017) Associative embedding: end-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems, pp. 2277–2287. Cited by: §1, §2.
  • [25] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision, pp. 483–499. Cited by: §2, Table 3.
  • [26] G. Papandreou, T. Zhu, L. Chen, S. Gidaris, J. Tompson, and K. Murphy (2018) Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–286. Cited by: §1, §1.
  • [27] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy (2017) Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4903–4911. Cited by: Table 2.
  • [28] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §2.
  • [29] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.
  • [30] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: 1st item.
  • [31] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 761–769. Cited by: §1.
  • [32] K. Sun, C. Lan, J. Xing, W. Zeng, D. Liu, and J. Wang (2017) Human pose estimation using global and local normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5599–5607. Cited by: Table 3.
  • [33] K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212. Cited by: §1, §1, Table 2, Table 3.
  • [34] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
  • [35] W. Tang, P. Yu, and Y. Wu (2018) Deeply learned compositional models for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 190–206. Cited by: Table 3.
  • [36] F. Xia, P. Wang, X. Chen, and A. L. Yuille (2017) Joint multi-person pose estimation and semantic part segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6769–6778. Cited by: §1.
  • [37] B. Xiao, H. Wu, and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–481. Cited by: §1, Table 2, Table 3.
  • [38] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang (2017) Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1281–1290. Cited by: Table 3.
  • [39] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §3.1.
  • [40] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §2.