Multi-person pose estimation aims at detecting and localizing keypoints for all persons in an image, which is useful for many applications like human action recognition and in-vehicle video recognition. It is a challenging task since it requires accurate localization of the keypoints of an unknown number of persons in situations where there may be a variety of lighting, clothing, human poses and occlusion due to crowds. In addition, algorithms are required to be both accurate and computationally efficient for practical applications.
Recently, along with the significant progress of convolutional neural networks (CNNs) [13, 24, 26], the performance of pose estimation algorithms has also greatly improved. Cao et al. proposed a pose estimation algorithm using a CNN to estimate confidence maps of human keypoints and Part Affinity Fields (PAFs) representing the existence of ’limbs’ (in this paper, meaning keypoint pairs) . Final pose estimation results can be obtained by parsing process that picks keypoint candidates from confidence maps and associate them using PAFs. The process requires low computational cost even if the number of people in an image increases. While conventional approaches associate keypoints based mainly on their geometric relationships such as distance and angle [19, 9], Cao’s approach defines PAFs and propose a way to learn them directly with a CNN, which enables the computation of the confidence for keypoint association, considering the context in the image.
While many algorithms for accurate multi-person pose estimation have been proposed, there is few researches to explore training datasets for this task. Human annotated labels are sometimes inappropriate for training models, which is a fundamental problem in many vision tasks. Although some works on learning with noisy labels exist [21, 10, 27], they mainly focus on classification. To the best of our knowledge, learning with noisy labels for the pose estimation task has not been discussed. For instance, Cao’s approach have issues regarding PAFs. Since the PAF is defined as connections of keypoint pairs, if one side of the keypoint pair exists outside an image like Figure 1a, the PAF for the limb can not be generated (Figure 1b). Predicted results of limbs without labels are penalized as false positives, which becomes a factor that decreases the performance of the model.
In this paper, we point out the existence of inappropriate labels generated from human annotations mainly targeting Cao’s pose estimation algorithm and propose a novel method to correct such incomplete labels. A trained model can generate complemental labels which does not exist in human annotations like Figure 1c, because the model is generalized to the features of keypoints and limbs. Thus, we propose to correct incomplete labels with the teacher model which can predict complemental labels. Specifically, we correct such labels by performing a element-wise max operation on the label and the output of the teacher model to supplement the missing part in the label. Since the proposed method uses the soft output of the teacher model to correct labels, it also speeds up training, like what is observed in training with knowledge distillation . In our experiment using the COCO dataset , we show that the performance of the model trained with the corrected labels improves by our method.
Our contribution can be summarized as follows:
We point out multiple types of inappropriate labels generated from human annotations using Cao’s pose estimation algorithm  as an example.
We propose a novel method for correcting such incomplete labels using the output of the teacher model. To the best of our knowledge, this is the first attempt to use a trained model to correct noisy labels in a pose estimation task.
In experiments, we compare the performance of the models trained with knowledge distillation  and proposed label correction to show the superiority of our approach.
|(a) Keypoint protrusions||(b) Keypoint occlusion||(c) Missed keypoints||(d) Missed masks|
2 Related Work
2.1 Multi-Person Pose Estimation
Human pose estimation is the task of estimating the keypoint coordinates of persons in an image, which is very challenging due to the wide variation in human appearance present in images [25, 11]. Due to its high applicability, the research topic has attracted significant attention in recent years. Pose estimation is categorized into single person pose estimation and multi-person pose estimation. While single person pose estimation makes the assumption that there is a single person in an image and locates his/her keypoints, multi-person pose estimation requires the detection and localization of all the keypoints of an unspecified number of people individually. The latter becomes highly challenging especially in crowded scenes. Both single-person and multi-person pose estimation algorithms depend on CNNs for their performance. In this paper, we focus on multi-person pose estimation, which is closer to real world situations. Multi-person pose estimation methods are roughly divided into top-down approaches and bottom-up approaches, and the difference between them will be described below.
Top-down approaches. Top-down approaches firstly detect bounding boxes of people in an image and locate all the keypoints of the persons afterwards. These approaches need to estimate each person’s keypoints separately, so the computational cost increases in proportion to the number of people in the image. Many algorithms of this type first predict bounding boxes with a human detector and then apply a single person pose estimator for each person, so the approaches rely heavily on human detector for their performances . Many single person pose estimation algorithms based on CNNs predict heatmaps representing the existence of persons’ keypoints. The Convolutional Pose Machine  enforces intermediate supervision of the outputs of each stage of the network to address the problem of vanishing gradients. The stacked hourglass network  is based on the successive steps of pooling and upsampling that enlarge receptive field. The Cascaded Pyramid Network  adopts an architecture like the Feature Pyramid Network  to produce feature maps of different resolutions which are integrated together to achieve multi-scale detection. Papandreou et al. predicts both heatmaps and offsets for the ground-truth keypiont location, and combine them to obtain the final keypoint predictions . Mask R-CNN  extends Faster R-CNN  to multi-task detection - bounding box regression and human keypoint estimation - in an end-to-end fashion.
Bottom-up approaches. Bottom-up approaches firstly detect all the keypoints of multiple persons in an image without distinguishing the individuals, and assemble the keypoints of each person to get final pose estimation results. DeepCut 
interprets the problem of distinguishing different persons as an Integer Linear Program (ILP) and partitions part detection candidates into person clusters. Then the final pose estimation results are obtained when person clusters are combined with labeled body parts. DeeperCut improves DeepCut using ResNet  and employs image-conditioned pairwise terms. The above two methods need to solve ILP for partitioning part candidates, but since it has a high computational cost, these methods are not suitable for realtime applications. Newell et al. simultaneously produces score maps and pixel-wise embedding to group the candidate keypoints into different people to get final pose estimation results . The method proposed by Cao et al. has a better trade-off between performance and speed , since the increase in computational cost is small even if the number of people in an image increases. Due to its high applicability, we use this algorithm as our basic pose estimation approach. We will briefly describe this algorithm in the next section.
2.2 Pose Estimation with Part Affinity Fields
This pose estimation method takes an image of size as an input into a CNN to predict part confidence maps of body part locations and Part Affinity Fields (PAFs) which represent the existence of limbs. Confidence maps have confidence maps, one per part, where . The ground-truth label for each part
is generated by the gaussian distributions whose mean positions are annotated keypoint locations for all the persons in an image regardless of human identity. PAFshave vector fields, one per limb, where . The ground-truth label for each limb is generated by a unit vector that points from to in a rectangular area defined between the corresponding keypoint pair and the zero vector for all other points. Given model predictions and ground-truth labels , the model is trained using mean squared error defined as follows:
where is 2D coordinates in an image, is a binary mask with for regions that should be ignored, such as crowds.
At the parsing step, a set of bipartite matchings to associate body part candidates is performed. First, we detect keypoint candidates from maximum position of the confidence maps. Then, we get limb confidence scores by computing line integral between all the corresponding keypoint pairs. Final pose estimation results are obtained by associating keypoint pairs greedily using these scores.
2.3 Knowledge Distillation
Hinton et al. proposed a model compression method called knowledge distillation . They use the final output of a trained teacher network as a soft target for training a small student network to transfer the information acquired by the teacher. Since soft targets contain information obtained by the teacher network that is not involved in hard targets generated from human annotations, they function as strong regularizers for training. Romero et al. and Yim et al. extend knowledge distillation by using intermediate outputs of a teacher network for training signals to improve the performance of the student network [23, 29]
. Gao et al. distilled knowledge obtained by multiple teacher networks into a single student network to enable it to classify 100K object categories. Although knowledge distillation is first proposed as a model compression method that aims at improving the performance of small student network, Furlanello et al. showed that student networks that are parameterized identically to the teacher networks outperform them . For the tasks of object detection and human pose estimation, Radosavovic et al. proposed data distillation that ensembles predictions of a single model from multiple transformations of unlabeled data to increase training data . Data distillation differs from other distillation methods in terms of aiming at generating annotations for unlabeled data, which can be used for training as a hard target.
In this paper, as opposed to data distillation , we aim to improve existing labels in training data for a pose estimation task, and propose an approach to distill knowledge obtained by a teacher network to correct labels. We train a student network that has identical architecture to the teacher network like Furlanello’s approach  and use corrected labels to train it to improve the performance of the model.
3 Proposed Method
In this section, we first indicate the existence of inappropriate labels in the pose estimation method proposed by Cao et al. . Then, we describe the detailed method of our approach to improve those labels.
3.1 Problems with Labels
Datasets used for pose estimation such as the COCO dataset  and the MPII dataset  have annotations of keypoint locations for each person in the images. The multi-person pose estimation method proposed by Cao et al. mentioned in the previous section generates labels based on these annotations. In this section, we indicate that there are inappropriate labels for training in their approach.
Figure 2 shows some examples of inappropriate labels generated using the COCO dataset. PAFs are generated from the pairs of keypoint annotations. The keypoints existing outside of images (Figure 2a) and having severe occlusion (Figure 2b) don’t have annotations. In that case, labels for corresponding limbs cannot be generated, even though they are inside of the images. The datasets have images with missing annotations like Figure 2c as well. Furthermore, the datasets have masks for ignoring regions where it is hard to annotate keypoints such as crowded areas. We train a model ignoring the masked regions (Equation 1), but these masks are sometimes missing. The prediction of a model for regions that do not have appropriate labels are penalized as false positives, even if the keypoints and limbs are correctly predicted by the model. This lowers the performance of the trained model.
To summarize, the failure patterns of the labels are summarized as follows:
There are no annotations for keypoints outside images.
There are no annotations for keypoints with severe occlusion.
Annotations for visible keypoints are missing.
Mask annotations for ignoring region are missing.
3.2 Label Correction
As shown in Figure 1c, there are cases where the output of a learned model is more appropriate than labels generated from annotations even though the model is trained with data including inappropriate labels. In this example, the model estimates PAFs that cannot be generated from the annotations. In addition, the PAFs predicted by the model are located along the limbs smoothly although each of the ground-truth PAF is generated in the rectangular area defined between the pair of corresponding keypoints. Like this, the learned model is generalized to the features of keypoints and limbs to adequately estimate the existence of keypoints and limbs. Based on such observations, we propose to improve such incomplete labels by correcting them using the output of the learned teacher model.
Let the predictions of a teacher model trained using Equation 1 and ground-truth labels which include incomplete training data. We correct the labels by applying the following functions for each pixel:
Figure 3 shows some examples of corrected PAFs. The labels for missing limbs are supplemented and have become better than the original. Like this, By adopting max operation for correcting labels, existing labels are not changed and missing labels are supplemented by the output of the teacher. In addition, even if the teacher can not predict the existence of keypoints and limbs, this does not cause negative effect to the final labels.
When we train a new model using corrected labels , the mean squared error
is used as a loss function,
where are the outputs of a student model in training. We use a student model that has identical architecture as the teacher model, different from many existing works on knowledge distillation.
Label correction with knowledge distillation. Proposed label correction can be used together with knowledge distillation . Knowledge distillation is first proposed for classification tasks, but it is also applicable to the pose estimation task in the same way by using the weighted average of losses whose targets are the ground-truth label and the predictions of the teacher model,
where is a parameter representing mixing ratio of the loss functions. In the case of using label correction and knowledge distillation at the same time, we use the loss function which replaces the former term in Equation 5 with corrected labels,
In our experiments, we evaluate the effectiveness of knowledge distillation, and label correction with knowledge distillation as comparison methods for label correction.
Iterative learning of label correction. Since we can get labels appropriate for training a model via proposed label correction, it might be possible to improve the performance of a model by making the student model the next teacher model and repeating learning. We evaluate the effectiveness of this iterative learning in our experiment.
|(a) Normal training||(b) Knowledge Distillation|
|(c) Label Correction||(d) Data Distillation|
Relationship to other methods. We now explain the relationship between our proposed label correction and related methods. Figures 4a
4d show the training schema of normal supervised learning with unchanged labels (Equation1), knowledge distillation (Equation 5), label correction (Equation 4) and data distillation, respectively. Whereas knowledge distillation learns a model with the weighted average of the losses calculated from labels generated from annotations and the outputs of the teacher model, label correction revises the labels directly with the teacher’s outputs. Although data distillation seems equivalent to knowledge distillation with , it differs from knowledge distillation in terms of aiming to generate labels for unlabeled data to enlarge the number of training data and generate keypoint annotations other than making use of the soft outputs of a teacher model to directly generate labels.
This section describes experimental results that verify the effectiveness of our approach. After explaining the dataset used in the experiment and implementation details, we report quantitative results and qualitative results.
4.1 Dataset and Evaluation Metric
We use the COCO 2017 dataset 
for evaluation. This dataset consists of real world images including the variety of lighting, clothing, and occlusion, making it highly challenging. We train models with the training set and report the evaluation results with the validation set. As evaluation metrics, we use Average Precision (AP),, , , used in the COCO keypoint challenge. The primary challenge metric AP is calculated from the mean AP over 10 Object Keypoint Similarity (OKS) thresholds, where OKS is calculated from scale of the person and the distance between predicted keypoints and the ground-truth keypoints. and are AP at OKS 0.50, 0.75, respectively. and represent AP for the medium and large size of persons, respectively.
4.2 Data Augmentation
Data augmentation is critical for the learning of scale invariance and rotation invariance. After applying random flip, random rotation () and random scale (), we crop the image to the size of at a random position which is the model input size.
4.3 Implementation Details
We report the evaluation results for two models, CMU-Pose  (52.3 M parameters) proposed by Cao et al. and small NN (5.2 M parameters) which is a relatively small model consisting of a stack of 19 convolution layers. For label correction and knowledge distillation, CMU-Pose was used as the teacher network in all the experiments. Regardless of the teacher model and the student model, all the trained models have only the difference between the labels used for training and the loss functions accompanying them, and all the other training settings were set to be the same.
For optimization we use Adam  with . All the models were trained with update iterations at and another iterations at
. In all the experiments, the first 10 layers of VGG-19 in CMU-Pose were initialized with the weights of the ImageNet pretrained model.
|KD () + LC||56.8||81.9||59.8||53.7||63.0|
|KD () + LC 2||56.3||82.0||57.7||52.8||63.2|
(b) small NN
|KD () + LC||45.8||77.4||45.2||43.8||51.2|
4.4 Main Results
In this section, we report evaluation results measured with CMU-Pose and small NN learned changing labels and loss functions. We compare the performance of a model trained with unchanged labels with Equation 1 (Baseline), knowledge distillation with Equation 5 (KD), label correction with Equation 4 (LC), trained simultaneously with label correction and knowledge distillation with Equation 6 (KD + LC), and iterative learning that makes the learned student model the next teacher model. Note that in the cases of training with label correction or knowledge distillation, The baseline model is used as their teacher model.
Results. Table 1a. shows results with CMU-Pose.
When models are trained with knowledge distillation, the AP increases 1.0 point from the baseline model when , but the performance drops in the case of . The effectiveness of using both of the loss term from unchanged labels and the output of the teacher model is observed. This result is similar to the effectiveness of using both hard targets and soft targets in knowledge distillation for classification . In addition, this result is similar to the fact that the student model which has identical architecture to the teacher model exceeds the performance of the teacher model by knowledge distillation in classification .
The model trained with label correction exceeds 2.3 points of AP from the baseline model, which is the largest performance gain obtained in our experiments. It seems that proposed label correction adequately improves the quality of training data as we thought would be the case. In training with knowledge distillation, the loss of the former term of Equation 5 calculated using data which includes inappropriate labels still seems to have negative effect on training. In contrast, in training with corrected labels, such labels are directly modified, which mitigates the negative effect for the training, as is thought to be the reason the model trained with label correction exceeds the performance of the model trained with knowledge distillation.
|(a) Loss curve||(b) AP curve|
Simultaneous learning of label correction and knowledge distillation results in almost the same performance as label correction. This result seems to be caused by the corrected labels partly including soft output of the teacher model, which makes label correction behave similarly to knowledge distillation. The performance gain was not observed by iterative learning of models, which imply that it is sufficient to use corrected labels once to get proper labels.
Table 1b shows results evaluated with small NN. Although it is similar to CMU-Pose in that the performance gain obtained by label correction is larger than that of knowledge distillation, performance is best when we use label correction and knowledge distillation simultaneously in this case. Similar to the fact that knowledge distillation is effective for model compression , it might be effective to directly use the soft output of a teacher model as a training signal for learning a small model with limited expression ability.
Figure 5 shows learning curve of each model. In the training using the corrected labels, the loss changes between baseline and knowledge distillation. Additionally, label correction not only improves the performance of the trained model, but also speeds up convergence of a model. These facts also imply that label correction has the effect of improving the performance of the model by correcting inappropriate labels while including the properties observed in knowledge distillation.
|LC (Confidence maps)||55.5||80.6||56.9||52.4||62.4|
4.5 Ablation Study
In order to confirm the performance gain was obtained by correcting both of confidence maps and PAFs, we train models correcting one of either of the labels. Table 2 shows the evaluation results of the models trained using unchanged labels, labels including only corrected confidence maps or PAFs, and labels that have both been corrected. The table shows, when we correct only confidence map or PAFs, the performance of the model is not good as the case where both labels are corrected. It is observed that both of corrected labels do contribute to improving the performance of the model.
4.6 Qualitative Results
Figure 6 shows some qualitative results of our proposed label correction. It can be seen that labels of data with keypoint protrusions, occlusion, and missed annotations are appropriately corrected. Common failure cases include false estimations of the teacher model to objects similar to people and misalignment of the labels and the teacher output.
In this paper, we first point out the existence of some patterns of inappropriate labels in a pose estimation method, especially the ones used for part confidence maps and part affinity fields. We indicate that such labels result from the label map generation method based on keypoint annotations and missed annotations in datasets. Then, in order to improve such labels, we propose a novel method using a teacher model trained with data including such inappropriate labels for correcting such labels. In the experiments on the COCO dataset, we show the training using corrected labels by our proposed method improves the performance of the trained model. The model also converges faster as can be seen in training with knowledge distillation. Label correction using a teacher model could also be applied to other pose estimation algorithms and other vision tasks such as object detection and semantic segmentation.
-  M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
-  Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multi-person pose estimation. arXiv preprint arXiv:1711.07319, 2017.
-  T. Furlanello, Z. C. Lipton, A. Amazon, L. Itti, and A. Anandkumar. Born again neural networks. In NIPS Workshop on Meta Learning, 2017.
-  J. Gao, Z. Li, R. Nevatia, et al. Knowledge concentration: Learning 100k object classifiers in a single cnn. arXiv preprint arXiv:1711.07607, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.
-  I. Jindal, M. Nokleby, and X. Chen. Learning deep networks from noisy labels with dropout regularization. In Data Mining (ICDM), 2016.
-  S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In BMVC, 2010.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
-  T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS, 2017.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
-  G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.
-  L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016.
-  I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He. Data distillation: Towards omni-supervised learning. arXiv preprint arXiv:1712.04440, 2017.
-  S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015.
-  B. Sapp and B. Taskar. Modec: Multimodal decomposable models for human pose estimation. In CVPR, 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa. Joint optimization framework for learning with noisy labels. In CVPR, 2018.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
J. Yim, D. Joo, J. Bae, and J. Kim.
A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.In CVPR, 2017.