To tackle the challenging 3D pose estimation problem, considerable efforts have been devoted, and these can be divided into two categories. One-stage approaches directly learn the 3D poses from monocular RGB images. Early investigations based on convolutional neural networks (CNN) involve a multi-task framework that jointly trains pose regression and body part detectors. Several subsequent approaches consider volumetric prediction and monocular model based on semantic representations On the other hand, two-stage approaches first estimate 2D poses and then lift 2D poses to 3D poses. These approaches are motivated by the results that the influence of 2D pose information is significant in 3D pose estimation . Among them, simple yet effective residual networks  that directly estimate 3D poses from estimated 2D pose results show state-of-the-art performance despite its simple architecture.
In this paper, we propose a top-bottom based two-stage 3D estimation framework for ‘3D Human Pose Estimation within the ECCV 2018 PoseTrack Challenge.’ Fig. 1 shows the overall flow of the proposed framework. Our two-stage method achieves outstanding results with mean per joint position error (MPJPE) at 42.39 on the validation dataset on 3D human pose estimation challenge.
2 Proposed method
2.1 2D pose estimation
We estimate 2D poses with a top-bottom pipeline. Since the challenge dataset does not include labels for the 2D pose, a subset of the Human3.6M dataset [4, 5] is used to train the 2D pose estimator. Given monocular images, we first perform human detection using Single Shot MultiBox Detector (SSD) . We then estimate 2D poses by using cascaded pyramid networks (CPN)  which consist of GlobalNet and RefineNet.
GlobalNet based on feature pyramid networks first localizes the keypoints in the detected bounding box. U-shape structure with intermediate supervision in GlobalNet helps to maintain both the spatial resolution and semantic information. In order to precisely estimate occluded or invisible keypoints, we apply RefineNet trained on an online hard keypoints mining loss. RefineNet transmits the information across different levels and then integrates the information of different levels. Both GlobalNet and RefineNet generate probability heatmaps equal to the number of joints, i.e., 17 in Human3.6M. Finally, we pick the output corresponding to the maximum probability value for each joint to estimate their positions.
Once the 2D detector is trained, we can get 2D keypoints of the challenge dataset. As shown in Fig. 1, the subject is tightly cropped in the Human3.6M image while images in the challenge dataset contain a significant amount of background. Considering the relative size differences of the subjects from the two databases, we add crop and resize (CR) module before and after the CPN in the inference process. Concretely, the CR module generates a square based on the length of the longest side of the width or height based on the detected bounding box. At this time, we add a little margin to the longest side to prevent the subject from being cropped too tightly. This cropped area is then resized to 224x224 and fed into the CPN. The process of adjusting the output of the CPN to the original scale can be processed in the reverse order of the CR. We define this as inverse crop and resize (ICR).
2.2 2D to 3D pose estimation
Given a 2D pose from the input image, we aim to learn a mapping function as:
where , , and is the number of the sample batch. Following 
, we focus on deep neural networks based on residual blocks with batch normalization. As a preprocessing step, we apply a standard normalization to the 2D inputs and 3D outputs by subtracting the mean and dividing by the standard deviation. We also zero-center both 2D and 3D poses around the hip joint. To stabilize training, we also apply a max-norm constraint on the weights of each layer, which is efficient when coupled with batch normalization.
3.1 Experimental setup
We evaluate the proposed method on the ‘3D Human Pose Estimation within the ECCV 2018 PoseTrack Challenge’. This challenge dataset consists of a training set (35,832), a validation set (19,312) and a test set (24,416).
3.2 Quantitative results
We performed an ablative study to better understand the impact of each module in the proposed framework. Table 1 shows mean per joint position error (MPJPE) on the validation set. The performance change with regard to the capacity of the network can be seen in Fig. 2. From the experimental results, we can confirm that the performance is sufficient with one residual block with 1024 dimensions.
3.3 Qualitative results
We show some qualitative results on the challenge set in Fig. 3. As can be seen in the figure, in most case, the proposed method shows promising results in both 2D and 3D pose estimation. Our top-bottom approach is detector-dependent, which affects the performance of subsequent processes, e.g., 2D pose estimation and 3D pose estimation, as can be seen in the figure. We plan to change the proposed method to a bottom-up approach for future works.
We propose a 3D pose estimator based on two-stage strategy composed of cascaded pyramid networks for 2D pose estimation, and residual blocks for 2D-to-3D estimation. GloabalNet and RefinNet in cascaded pyramid networks were used to find occluded or invisible 2D joints while residual block based estimator was used to lift 2D joints to 3D joints effectively. Our method achieves promising results with MPJPE at 42.39 on the validation dataset on 3D human pose estimation challenge.
-  Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3d pose and shape estimation of multiple people in natural scenes–the importance of multiple scene constraints. CVPR (2018)
-  Park, S., Kwak, N.: 3d human pose estimation with relational networks. BMVC (2018)
-  Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. ICCV (2017)
-  Ionescu, C., Li, F., Sminchisescu, C.: Latent structured models for human pose estimation. ICCV (2011)
-  Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI (2014)
-  Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. ECCV (2016)
-  Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. CVPR (2018)