Human Pose Estimation (HPE) is the task of localizing human body keypoints (also referred to as joints) from an image. It serves as a fundamental technique for numerous applications, including action recognition, pedestrian tracking, and virtual/augmented reality. Recently, deep convolutional neural networks (DCNN)[toshev2014deeppose, newell2016stacked, newell2017associative] have achieved drastic improvements on standard benchmark datasets. To fully exploit the power of DCNN, a large number of training data is indispensable for obtaining satisfactory performance in human pose estimation.
However, existing human pose estimation datasets do not uniformly represent all possible human poses in real life. We take MS-COCO dataset[lin2014microsoft] as an example to analyze the distribution of the human poses, as shown in Fig. 1. We normalize the poses and cluster them into 20 categories. We observe that it follows a long-tailed distribution, with a few common pose categories (e.g standing and walking) occupying a large portion of the dataset and unusual posture types (e.g squatting and jumping) possessing a smaller portion. We also find that although current state-of-the-art data-driven methods achieve good performance on common poses, however, they still suffer performance degradation on some unusual poses, since the long-tailed categories have neither enough training samples nor enough diversity.
Due to the high cost of collecting and annotating examples with rare poses, a feasible way to tackle this problem is data augmentation. Previous methods augment the human pose mainly by global image-level transformations [peng2018jointly, chen2017adversarial, newell2016stacked, xiao2018simple, wang2021human] (e.g scaling and rotating) or local object-level transformations [bin2020adversarial, peng2018jointly, fang2019instaboost] (e.g copy-paste and occluding). Since these methods fail to increase the diversity of poses and alleviate the long-tailed distribution, they contribute little to recognizing diverse rare poses.
In this paper, we propose a simple yet effective data augmentation approach, termed Pose Transformation (PoseTrans), to tackle the aforementioned challenges. PoseTrans consists of a Pose Transformation Module (PTM) with a pose discriminator, and a Pose Clustering Module (PCM). During training, PTM applies affine transformations to the original pose of the training sample and generates a pool of diverse new poses. The pre-trained pose discriminator is adopted to evaluate the plausibility of generated samples and then filter out unnatural samples. PCM is based on the Gaussian Mixture Model (GMM), which normalizes and clusters the human poses in the dataset. The rare types of poses are represented by the Gaussian components that have small weights. PCM evaluates the components’ density for each candidate pose and selects the “rarest” one (i.e which has the minimal weighted sum of components’ density) as the final augmented training sample. By transforming the existing poses, PoseTrans helps generate diverse, plausible poses by PTM and alleviate the long-tail distribution problem by PCM. We also design a metric that focuses on rare poses called balanced AP/AR and observe more performance gain on this metric. Our method is simple to implement and can be easily integrated into the training pipeline of existing pose estimation models.
We summarize our contributions as follows:
We present a simple yet effective data augmentation method, termed PoseTrans. To tackle the problem of limited diversity of unusual human poses, we propose a novel Pose Transformation Module (PTM) with a pose discriminator to generate new training samples with diverse and plausible poses.
We propose Pose Clustering Module (PCM) to measure the pose rarity and select rare poses for data augmentation, which helps to balance the long-tailed distribution of the training set.
Extensive experiments on various pose estimation datasets show that PoseTrans consistently improves the performance of various state-of-the-art pose estimators, especially on rare poses.
2 Related Works
2.1 2D Human Pose Estimation
In recent years, 2D human pose estimation has shown remarkable performance advancement. DeepPose [toshev2014deeppose]
first applied deep neural networks to human pose estimation by directly regressing the 2D coordinates of key points from the input image. Since then, deep learning-based methods started to dominate this area. Recent multi-person human pose estimation approaches can be divided into bottom-up and top-down approaches. Bottom-up approaches[Insafutdinov2016DeeperCut, cao2017realtime, papandreou2018personlab, newell2017associative, jin2020differentiable, cheng2020higherhrnet, kreiss2019pifpaf, jin2019multi] first detect all the key points of every person in images and then group them into individuals. Top-down methods [he2017mask, chen2018cascaded, xiao2018simple, sun2019deep] first detect the bounding boxes and then predict the human body key points in each box.
Recent works mainly focus on designing powerful network architectures to improve the performance of pose estimation [newell2016stacked, xiao2018simple, sun2019deep, chen2018cascaded, jin2020whole, xu2021vipnas, zeng2022not]. However, current state-of-the-art models often suffer performance drops on rare poses due to the long-tailed distribution problem in human pose data. In this work, we focus on tackling this important but ignored problem. Standing on the shoulder of the well-designed network structure, we propose a novel data augmentation method to generate diverse rare poses.
2.2 Data Augmentation
Data augmentation has been widely utilized to improve the model generalization ability. For image classification, popular augmentation methods include information dropping [zhong2020random, chen2020gridmask, devries2017improved], multi-image information mixing [zhang2017mixup, yun2019cutmix] and automatic augmentation [cubuk2018autoaugment]. For human pose estimation, data augmentation mainly focus on global image-level transformations [peng2018jointly, chen2017adversarial, newell2016stacked, xiao2018simple, wang2021human] (e.g scaling, rotating, and flipping) and local object-level transformations [bin2020adversarial, peng2018jointly] (e.g copy-paste, occluding). These common data augmentation schemes enhance the global translational invariance and robustness in occlusion cases but struggle to improve the immunity to rare poses. Recently, some augmentation methods [fang2019instaboost, fang2021decaug] propose to perform jitting on instances to increase the generalization of the model, but they do not change either the instance itself or the distribution of instances. Different from the existing data augmentation strategies, we propose a novel, simple and effective PoseTrans augmentation scheme that directly generates diverse rare poses.
2.3 Long-tailed Distribution
In visual recognition, there exists a challenging problem of long-tailed training set distributions, where a small portion of classes have massive training samples while classes in the distribution tail have few samples [zhang2021deep]. Over-sampling [chawla2002smote] and re-weighting [elkan2001foundations] are two popular methods to tackle the problem. The over-sampling method raises the frequency level of the minor classes by repeating the data samples during training. The re-weighting method assigns higher loss weights to these minor classes and thus increases their importance. However, such approaches do not increase the diversity of the data and tend to suffer from over-fitting which leads to a performance drop. Other approaches also include metric learning that enforces inter-class margins [huang2016learning] and meta-learning that learns to regress many-shot model parameters from few-shot model parameters [wang2017learning], but they are only designed for visual recognition. In human pose estimation, we encounter a similar problem. For many human pose estimation datasets [lin2014microsoft, andriluka14cvpr, li2019crowdpose], e.g the MS-COCO dataset [lin2014microsoft], the distribution of human poses is highly biased, which does not uniformly represent human poses in real life. These dataset biases lead to poor generalization and degraded detection accuracy of these “long-tailed” poses. To address the aforementioned issue, we propose a simple yet effective PoseTrans approach to create the needed diverse poses.
To increase the diversity of poses and alleviate the long-tailed distribution problem, we propose the Pose Transformation (PoseTrans) to generate new training samples with diverse poses, as shown in Fig. 2. PoseTrans consists of a Pose Transformation Module (PTM) with a pose discriminator and a Pose Clustering Module (PCM). Given a training sample consisting of a single human image and its keypoint annotation , PTM aims to create a new training sample by applying affine transformations on the limbs of the human, where , . , and indicate the height, width and the number of keypoints respectively. To ensure plausibility, we leverage the pose discriminator to filter out implausible samples. PoseTrans applies PTM repeatedly until a candidate pose pool with plausible generated poses is formed. PCM clusters human poses into
categories and evaluates the probability of belonging to each cluster for generated poses to select the rarest one among the pool as a new training sample. After each training epoch, we re-fit the PCM using the original training set and all the selected augmented samples.
3.2 Pose Transformation Module (PTM) and Pose Discriminator
By clustering the human poses in the existing dataset, it can be observed that many clusters only have a few examples. The lack of training examples of rare poses further leads to the lack of diversity of rare poses, which results in the inferior performance of current data-driven methods on these types of poses. To tackle this issue, we devise the Pose Transformation Module (PTM) and a pose discriminator to create plausible new poses based on the existing training samples. The detail of PTM is shown in Fig. 3.
Modeling the body part movement. The body kinematic skeleton is constructed by a pose graph, where the human body is partitioned into several parts, i.e the head, the torso, the left/right arm, and the left/right leg. In this work, we mainly focus on the angular movement of the arms and legs. Angular movements (flexion and extension) take place at the shoulder, hip, elbow, knee, and wrist. Flexion decreases the angle between the bones (bending of the joint), while extension increases the angle and straightens the joint. These body part movements in the image plane can be modeled by applying the affine transformation to a rigid body part segment. In our implementation, the affine transformation is composed of rotation and scaling.
We define the limb as a single rigid body part connecting natural adjacent joints and , where are the coordinates of the source and destination joint respectively. We define limbs for each instance, including the lower arm, the upper arm, the lower leg, and the upper leg of both sides.
Pose transformation. With human parsing results obtained through DensePose [alp2018densepose] model, PTM first erases the original limbs in by an efficient inpainting method [bertalmio2001navier]. After that, each limb is transformed by its affine transformation matrix separately. To increase the diversity, each limb has a probability of to decide whether to transform or not. The transformed limbs and the inpainted image are composed to form the new augmented image . And the pose annotations are also transformed accordingly to get .
Specifically, the angular movement of the -th limb can be modeled by the following affine transformation matrix
where and denote the scale and rotation of the -th limb, is the coordinates of the rotation center of the -th limb. For the lower arm, the upper arm, the lower leg, and the upper leg, the rotation centers are the elbow, the shoulder, the knee, and the hip respectively. To ensure the diversity of augmented poses, the scale and rotation parameters in
are randomly sampled from a normal distribution in the neighboring space of identity transformation. The scale and rotation parameters are also restricted to a certain range in our implementation to ensure that the majority of the randomly generated poses are plausible. Note that, limbs that do not appear in the image or are obscured will not be transformed.
According to the kinematic skeleton hierarchy, the movement of the upper arm/leg will affect that of its lower part. Suppose the -th limb is the lower arm/leg and the -th limb is its corresponding upper part. Considering the combined effect, the total movement of the -th limb can be modeled by matrix multiplication, i.e .
Pose discriminator for the plausibility check. Purely generating poses randomly may result in implausible poses that violate the biomechanical structure of the human body. Some other augmentation methods [li2020cascaded, chen2016synthesizing] rely on pre-defined rules for ensuring plausibility, which however limits the diversity of generated poses. Inspired by [gong2021poseaug], we design a pose discriminator that suits our task to avoid implausible poses that have unnatural joint angles or unreasonable positions in the scene. For the augmented sample , the discriminator is trained to predict the plausibility . We adopt the LS-GAN loss [mao2017least] to train the discriminator before training the pose estimatior:
With the pre-trained discriminator , PoseTrans efficiently filter out the augmented sample whose plausibility is less than a pre-defined threshold , and fill the candidate pose pool with samples that are plausible and diverse.
3.3 Pose Clustering Module (PCM)
After gaining the ability to create new human poses by PTM, we propose the Pose Clustering Module (PCM) to measure the pose rarity and select the needed poses for data augmentation.
Fitting the PCM. Our PCM is built upon the Gaussian Mixture Model (GMM) with Gaussian components. As a soft clustering method, it predicts the probability of belonging to a certain category. Before pose clustering, human poses in the training set are first normalized. We crop every human instance on the image and re-scale the cropped image into the same height and width (). The corresponding keypoint coordinates are also normalized at the same time. We fit the PCM using the normalized human poses in the training set. After fitting, given the pose , we model as:
where is the weight of the -th Gaussian component, denotes the
-th Gaussian distribution with meanand covariance .
By predicting the probability of belonging to each Gaussian component, the human pose is classified as the component with the maximum probability. We visualize the probability vectors of every example using t-SNE[van2008visualizing], as shown in Fig. 4. With PCM, we cluster the human poses into categories, where Gaussian components that have small weights (i.e few examples,) indicate the categories of rare poses. We observe the long-tailed problem that frontal standing accounts for a significant portion while squatting and lateral postures account for small percentages.
Pose selection from the candidate pose pool. PoseTrans repeats PTM to build a candidate pose pool with samples for the training sample , where . PoseTrans select the rarest one among the candidate pose pool by:
where is the predicted probability of belonging to Gaussian components by the fitted PCM. We consider the transformed sample with the minimal weighted sum of components’ density as the rarest and select it as a new training sample.
4.1 Datasets and Evaluation
Datasets. To verify the effectiveness of our proposed data augmentation approach, we conduct extensive experiments on popular datasets. (1) MS-COCO [lin2014microsoft] pose estimation dataset. Our models are trained on the train set only and evaluated on the val set and the test-dev set. DensePose [alp2018densepose]
provides a small portion of human parsing annotations for the MS-COCO dataset. To verify the performance on rare poses, both the traditional evaluation metrics (i.e AP/AR) and newly designed metrics (balanced AP/AR) are used for evaluation. The base learning rate of 1e-3, and decay the learning rate to 1e-4 and 1e-5 at the 170-th and 200-th epochs respectively. (2) PoseTrack’18 [andriluka2018posetrack] dataset. Following common settings [mmpose2020], we pre-train the model on the MS-COCO dataset and fine-tune it on the PoseTrack’18 dataset for 20 epochs. The basic learning rate is 1e-4 and drops to 1e-5 at 10 epochs then 1e-6 at 15 epochs. We test the model on the PoseTrack’18 validation set using the ground truth bounding boxes, and evaluate the AP on the whole body and also on different parts of the human. Due to the limited space, the results of some experiments are placed in the supplementary material.
Evaluation metrics. We follow [lin2014microsoft] to use Average Precision (AP) and Average Recall (AR) for evaluation on MS-COCO [lin2014microsoft]. They are based on object keypoint similarity (OKS), which measures the distance between predicted keypoints and ground-truth keypoints normalized by the scale of the object. AP (AP at OKS = 0.5), AP (AP at OKS = 0.75), AP for medium objects, and AP for large objects are reported.
Balanced AP/AR. Since existing datasets mostly suffer the long-tailed distribution problem, simply calculating the AP/AR tends to ignore the minor pose categories. To solve this problem, we design the balanced AP/AR, which we term , . We first classify the ground-truth poses into categories based on the fitted PCM. Then we calculate the standard AP/AR separately for each category and calculate the average precision/recall among categories instead of samples. Therefore, and assign the same weights to all pose categories, which is helpful to analyze the “unbiased” performance.
4.2 Implementation Details
PoseTrans can be integrated into the training pipeline of any existing pose estimators together with other common data augmentation strategies. Except for the small portion of images that have human parsing annotations, we leverage DensePose [alp2018densepose] model for human parsing which segments humans into 14 semantic parts. In PCM, we have and cluster the poses into 20 categories. We implement PoseTrans with scaling (), rotating (]), and apply it with the probability for every limb in the training examples. We filter out the implausible samples whose plausibility is less than and repeat PTM until the candidate pose pool has augmented samples.
|Method||Input size||MS-COCO val||MS-COCO test-dev|
|Bottom-up methods w/o multi-scale test|
|AE[newell2017associative] + HRNet-W32[sun2019deep]||64.4||86.3||72.0||57.1||75.6||71.0||64.1||86.3||70.4||57.4||73.9||70.4|
|+ PoseTrans (Ours)||66.2||86.4||72.1||59.3||76.5||71.6||65.4||87.6||72.1||58.8||74.7||71.0|
+ PoseTrans (Ours)
Bottom-up methods with multi-scale test [, , ]
AE[newell2017associative] + HRNet-W32[sun2019deep]
+ PoseTrans (Ours)
+ PoseTrans (Ours)
|+ PoseTrans (Ours)||72.3||89.9||80.0||68.3||79.2||77.8||71.5||91.8||80.0||68.1||77.3||77.0|
|+ PoseTrans (Ours)||72.7||90.0||80.7||69.5||78.8||78.3||71.8||91.6||80.3||68.3||77.5||77.3|
|+ PoseTrans (Ours)||75.5||91.0||82.9||71.8||82.2||80.7||74.2||92.4||82.5||70.8||79.6||79.4|
HRNet-W32[sun2019deep] + Dark[zhang2020distribution]
|+ PoseTrans (Ours)||76.0||90.8||83.0||72.1||83.2||81.1||75.0||92.5||82.9||71.5||80.6||80.1|
|+ PoseTrans (Ours)||76.5||90.9||83.3||72.5||83.3||81.5||75.4||92.5||83.0||71.6||81.1||80.4|
|+ PoseTrans (Ours)||76.8||91.0||83.1||72.7||83.7||81.6||75.7||92.6||83.4||72.0||81.7||80.6|
|Method||Input size||Head||Sho.||Elb.||Wri.||Hip||Knee||Ank.||Total AP|
|+ PoseTrans (Ours)||87.8||89.3||84.7||77.7||82.3||81.6||75.4||83.0|
|+ PoseTrans (Ours)||88.6||90.0||86.2||80.3||83.1||84.9||79.8||84.9|
|+ PoseTrans (Ours)||88.9||90.3||87.4||81.8||83.5||85.5||80.6||85.7|
For bottom-up methods, PoseTrans is applied on every instance in the image separately. The experimental settings are the same as [cheng2020higherhrnet]. We apply image-level random scaling (), random rotation ([, ]), random translation ([px, px]) and random flipping. The models are trained for epochs using the Adam optimizer [kingma2014adam]. The base learning rate is 1e-3, and it decreases to 1e-4 and 1e-5 at the -th and -th epochs respectively. For top-down approaches, the experimental settings are the same as [sun2019deep]. We use the detected bounding boxes provided by Xiao et al [xiao2018simple]. The detection boxes are first extended to a fixed aspect ratio (i.e height:width = 4:3) and then enlarged by a factor of to include some context. We apply random scaling (), random rotation ([, ]), random flipping, and half-body crops. The models are trained on GPUs for epochs. We use Adam optimizer [kingma2014adam]
for training. All networks are pre-trained on the ImageNet dataset[russakovsky2015imagenet].
4.3 Improvement of state-of-the-art methods by PoseTrans
Improvement of AP/AR. Table 1 reports the performance improvement of AP and AR on the MS-COCO val and MS-COCO test-dev set, where PoseTrans is applied to recent state-of-the-art pose estimators, i.e SBL [xiao2018simple], HRNet [sun2019deep], and HigherHRNet [cheng2020higherhrnet]. Table 2 show the performance improvement on the PoseTrack dataset. PoseTrans consistently boosts the performance of both top-down and bottom-up approaches in various datasets.
Improvement of and . The results of and are reported in Table 3(a). To calculate the new metrics, we use the bounding boxes and keypoint annotations to determine the category of predicted poses. Thanks to the design of PCM and PTM, PoseTrans increases the diversity of rare poses and balances the distribution, which enables PoseTrans to bring more improvements on the newly proposed / than traditional AP/AR.
4.4 Comparisons with other data augmentation techniques
In Table 3(b), we compare PoseTrans with other data augmentation techniques, including non-learning [devries2017improved, chen2020gridmask, bochkovskiy2020yolov4] and learning/strategy-based methods [wang2021human, fang2019instaboost].
For non-learning methods, Cutout [devries2017improved] randomly selects a rectangle region around the keypoint and fills in random values. GridMask [chen2020gridmask] evenly replaces multiple rectangle regions in an image with all zeros. For Photometric Distortion, we follow [bochkovskiy2020yolov4] to adjust the brightness, contrast, hue, saturation, and noise of an image. These general data techniques are proven to be effective for image classification. However, they do not bring significant improvements for human pose estimation. Similar conclusions have also been reached by previous works [pytel2021tilting]. This is probably because such techniques introduce undesirable artifacts and do not increase the diversity of human poses.
For learning/strategy-based methods, AdvMix [wang2021human] applies adversarial training to learn to mix up augmented samples generated by GridMask [chen2020gridmask] and AutoAugment [cubuk2018autoaugment]. InstaBoost [fang2019instaboost] is a recently proposed data augmentation technique which is originally designed for instance segmentation. It conducts crop-paste augmentation guided by the appearance consistency heatmaps. However, the improvements of AdvMix and InstaBoost are only marginal. ASDA [bin2020adversarial] also employs human parsing and augments images by pasting the segmented body parts. PoseTrans outperforms all these approaches, which validates the importance of increasing the diversity of the human body poses.
Kindly note that PoseTrans is also complementary to other techniques. Effectively combining these techniques may further improve the final performance. As shown in the third row from the bottom in Table 1, combining PoseTrans with DarkPose [zhang2020distribution] can further gain improvements.
4.5 Ablation Studies
Effect of PTM. Without using the PTM, we perform the over-sampling [chawla2002smote] and re-weighting [elkan2001foundations] strategies, which are two popular methods to tackle the long-tailed problem. The over-sampling method raises the frequency level of the minor categories by duplicating the long-tailed data samples during model training. The re-weighting method assigns higher loss weights to rare samples and thus increases their importance. Based on the clustering results of PCM, we implement these methods as baselines, as shown in Table 4(a). By increasing the importance of long-tailed training samples, both the over-sampling marginally improve the . However, such approaches do not increase the diversity of the data, which leads to slight performance drops on AP and AR. With the design of PTM, our proposed PoseTrans creates diverse long-tailed samples, which significantly outperforms the baseline methods.
Effect of PCM. Without PCM, PoseTrans randomly samples a transformed pose obtained from PTM as the training sample, instead of picking the “rarest” pose in the candidate pose pool. Note that, “w/o PCM” is equivalent to the case of in PoseTrans. The studies of w/o PCM and the number of in PCM are shown in Table 4(b). By providing simple disturbance to training data, w/o PCM increases the generalization of the model, which leads to some performance improvements. While with the aid of PCM, our full model learns to alleviate the long-tailed distribution problem of the training set by selecting transformed poses, which brings greater performance gains, especially on /. Also, a larger candidate pose pool (i.e greater ) leads to better performance. However, greater than will not bring more performance boost.
Effect of pose discriminator. Without the pose discriminator (), some implausible poses will lead to performance degradation as shown in Table 5(a). Since the scale and rotation parameters are sampled from a normal distribution and are restricted to and in the implementation, a majority of the randomly generated poses are plausible. In this situation, the PTM without the pose discriminator can still benefit the model.
Comparison with the adversarial learning variant. Inspired by recent works [wang2021human, peng2018jointly, bin2020adversarial] on adversarial data augmentation, we also build an adversarial training variant of PoseTrans, which we refer to as PoseTrans-Adv. PoseTrans-Adv has an additional generator that predicts the rotation and scale for a given single human image . During training, the generator is asked to confuse the pose estimation model by maximizing the loss of the pose estimator. However, we observe that the generator will soon learn to choose the maximum rotation and scale for every training sample, which actually decreases the diversity of the training set. This leads to performance degradation in all the evaluation metrics as shown in the first row of Table 5(b).
Comparison with PoseTrans-Par on the MS-COCO dataset. As mentioned above, DensePose [alp2018densepose] provides a small portion of human parsing annotations for the MS-COCO dataset. Here, we compare with the PoseTrans-Par variant that replaces the human annotations with the pseudo-labels obtained from the parsing model. As shown in the second row of Table 5(b), without human annotations, the performance of PoseTrans-Par is comparable with PoseTrans.
Visualizations of the augmented samples. In Fig. 5, we visualize the original image and the augmented sample by PoseTrans. It can be observed that our proposed method generates diverse and plausible body postures that facilitate the model training and improve its generalization ability.
Visualizations of pose estimation results. In Fig. 6, we visualize pose estimation results obtained by HRNet [sun2019deep]. We observe that vanilla HRNet is easily confused by infrequent and difficult poses, e.g upside-down postures and serious occlusions. By generating training samples with diverse rare poses, our PoseTrans improves the performance in these challenging cases.
Limitations. Our limitations mainly lie in the artifacts produced by the inpainting method and the accuracy of the human parsing model. We choose a simple non-data-driven inpainting method in pose transformation for efficiency. An advanced inpainting and parsing model with higher resolution inputs may bring more improvements in pose estimation.
In this paper, we study the performance degradation caused by unbalanced data distribution on human pose estimation. To tackle this issue, we propose PoseTrans with PTM, PCM, and a pose discriminator to create diverse and plausible training samples that have infrequent poses. Comprehensive experiments on public benchmarks demonstrate the effectiveness of our method, especially on rare poses. Our implementation of PoseTrans is simple and efficient, which can be easily integrated into the training pipeline of existing pose estimators. We hope our work will draw the community’s attention to the long-tail problem in human pose estimation and provide inspiration on how to tackle it for other tasks.
Acknowledgement. This work is supported in part by the National Natural Science Foundation of China under Grant 62122010 and Grant 61876177, in part by the Fundamental Research Funds for the Central Universities, and in part by the Key Research and Development Program of Zhejiang Province under Grant 2022C01082. Ping Luo is supported by the General Research Fund of HK No.27208720, No.17212120, and No.17200622.