PRTR: Pose Recognition with Cascade Transformers
In this paper, we present a regression-based pose recognition method using cascade Transformers. One way to categorize the existing approaches in this domain is to separate them into 1). heatmap-based and 2). regression-based. In general, heatmap-based methods achieve higher accuracy but are subject to various heuristic designs (not end-to-end mostly), whereas regression-based approaches attain relatively lower accuracy but they have less intermediate non-differentiable steps. Here we utilize the encoder-decoder structure in Transformers to perform regression-based person and keypoint detection that is general-purpose and requires less heuristic design compared with the existing approaches. We demonstrate the keypoint hypothesis (query) refinement process across different self-attention layers to reveal the recursive self-attention mechanism in Transformers. In the experiments, we report competitive results for pose recognition when compared with the competing regression-based methods.READ FULL TEXT VIEW PDF
In this paper, we present a holistically end-to-end algorithm for line
Self-attention models such as Transformers, which can capture temporal
Transformers are increasingly dominating multi-modal reasoning tasks, su...
Recently, self-attention models such as Transformers have given competit...
Recently, Deep Convolution Networks (DCNNs) have been applied to the tas...
Self-supervised Audio Transformers (SAT) enable great success in many
In this paper we present BPnP, a novel method to do back-propagation thr...
PRTR: Pose Recognition with Cascade Transformers
We tackle the 2D human pose recognition problem [lin2014microsoft, andriluka20142d, toshev2014deeppose, newell2016stacked]
where keypoints (head, shoulders, knees, ) for multiple people in an RGB image are to be detected and localized. This is an important problem in computer vision that can be adopted in a variety of downstream tasks including tracking, security, animation, human-computer interaction, computer games, and robotics.
There has been a steady progress in 2D human pose recognition [andriluka20142d, toshev2014deeppose, wei2016convolutional, newell2016stacked, kreiss2019pifpaf, cao2017realtime, papandreou2017towards, sun2018integral, papandreou2018personlab, cheng2020higherhrnet, chen2018cascaded, sun2019deep, zhou2019objects, nie2019single] with systems becoming increasingly practical without a strong constraint (present multiple people of varying size). However, pose recognition is a challenging problem that remains unsolved. The difficulty lies in various aspects such as large pose/shape variation, inter-person and self occlusion, large appearance variation, and background clutter.
For multiple people in an input image [lin2014microsoft], the task of pose recognition is to localize the human keypoints (17 in the experiments) for the individual persons. This can be achieved by a two-stage process in which individual persons are detected first, followed by keypoint detection from the detected image region/patch; this is called a top-down process [sun2019deep]. An alternative strategy is called a bottom-up process where human keypoints are detected directly from the image without an explicit object detection stage [cheng2020higherhrnet]. A discussion about the top-down and bottom-up approaches can be found in [cheng2020higherhrnet].
Another way to divide the existing literature in pose recognition is based on the choice of using heatmap or regression. Heatmap-based approaches [xiao2018simple, sun2019deep] perform dense keypoint detection followed by subsequent processes for clustering and grouping; they deliver strong performance but are also subject to many heuristic designs that are mostly not end-to-end learnable. Regression based methods [sun2018integral, zhou2019objects, wei2020point] perform regression for the keypoints directly which have less intermediate stages and specifications. Regression-based methods typically perform worse than heatmap-based ones, but can be made end-to-end and readily integrated with the other downstream tasks. Reasons for the existence of both heatmap-based and regression-based methods are present. Heatmap-based methods are adopted when the accuracy is the priority whereas regression-based approaches can be considered as a convenient plug-and-play module.
Generally, heatmap-based methods adopt handcrafted or heuristic pre/post-processing to encode ground truth to heatmaps and decode heatmaps to predict keypoints. These methods introduce design challenges and biases, making them sub-optimal. They are hard to update and adapt as well. In detail, SimpleBaseline [xiao2018simple] and HRNet [sun2019deep] adopt the standard coordinate decoding method designed empirically according to model performance in [newell2016stacked], refining the coordinates 0.25 time from the maximum activation to the second maximum empirically in the heatmap. DARK [zhang2020distribution] presents Taylor-expansion based coordinate decoding and unbiased sub-pixel centered coordinate encoding. UDP [Huang_2020_CVPR]
even discovered a considerable accuracy decrease when using one-pixel flip shift in heatmap-based paradigms. For general-purpose regression methods, we aim at removing unnecessary designs by making the training objective and target output direct and transparent. Coordinates should be output directly and the loss be calculated with predictions and ground truth coordinates straightforward.
Bearing this in mind, we present a top-down regression-based 2D human pose recognition method using cascade Transformers consisting of a person detection Transformer and a keypoint detection Transformer. Two alternatives have been developed, one being a two-stage process (shown in Figure 2) with the two Transformers learned sequentially and the other being a sequential process (shown in Figure 3) with the two transfomers learned jointly in an end-to-end fashion. We name our method Pose Regression TRansformers (PRTR). We apply multi-scale features in the keypoint detection Transformer. Visualization for the keypoint queries across different attention layers in the decoder is given to illustrate the internal detection process. PRTR is a general-purpose approach for keypoint regression and we show competitive results in pose recognition when compared with the existing regression-based methods in the literature. The contributions of our work include:
We propose a regression-based human pose recognition method by building cascade Transformers, based on a general-purpose object detector, end-to-end object detection Transformer (DETR) [carion2020end]. Our method, named pose recognition Transformer (PRTR), enjoys the tokenized representation in Transformers with layers of self-attention to capture the joint spatial and appearance modeling for the keypoints.
Two types of cascade Transformers have been developed: 1). a two-stage one with the second Transformer taking image patches detected from the first Transformer, as shown in Figure 2
; and 2). a sequential one using spatial Transformer network (STN)[jaderberg2015spatial] to create an end-to-end framework, shown in Figure 3.
We visualize the distribution of keypoint queries in various aspects to unfold the internal process of the Transformer for the gradual refinement of the detection.
On the COCO 2D human pose recognition dataset[lin2014microsoft], competitive results have been observed when compared with the regression-based methods.
Given an image , the goal of pose recognition is to predict a possibly empty set of persons, , where is the number of persons in the image. For each person, we need to predict its bounding box position, , as well as its skeleton coordinates, , where is the number of joints pre-defined in each dataset.
We discuss related work from several aspects. The field of human pose regression has witnessed a continuing progress [andriluka20142d, toshev2014deeppose, wei2016convolutional, newell2016stacked, kreiss2019pifpaf, cao2017realtime, papandreou2017towards, sun2018integral, papandreou2018personlab, cheng2020higherhrnet, chen2018cascaded, sun2019deep, zhou2019objects, nie2019single]
, in particular with the advancing of the deep learning technologies[krizhevsky2012imagenet, goodfellow2016deep, he2016deep]. One notable development in pose recognition is the creation of the HRNet family model [sun2019deep, cheng2020higherhrnet]
which is itself about a new convolutional neural network (CNN) architecture targeting the modeling of high-resolution feature responses. HRNet[sun2019deep]
has shown its particular advantage in advancing the state-of-the-art for 2D human pose recognition/estimation.
Heatmap-based approaches include [cao2017realtime, he2017mask, papandreou2017towards, newell2017associative, kreiss2019pifpaf, papandreou2018personlab, cheng2020higherhrnet, chen2018cascaded, xiao2018simple, sun2019deep, zhang2020distribution, Yang_2017_ICCV, Tang_2018_ECCV]
where various techniques have been developed to perform multi-class keypoint classification. The classifiers produce dense heatmaps (classification map), followed by clustering and grouping processes. On one hand, heatmap-based methods leverage fine-grained detection for the keypoints by densely scanning all the pixels; on the other hand, heatmaps create a disconnection from the overall estimation of the keypoints, making the intermediate clustering and grouping process not directly integrable to be end-to-end learning frameworks.
Regression-based methods [Carreira2016HumanPE, zhou2019objects, nie2019single, wei2020point, sun2018integral] aim to directly approach keypoint detection with a direct loss minimization between predicted and ground truth coordinates, hence, they can be more easily integrated into an end-to-end learning framework. However, holistic regression can be intrinsically more difficult to optimize due to the high-precision needed by pose recognition. Furthermore, regression-based approaches typically have a recursive procedure [dollar2010cascaded]
that skips a large number of candidate locations, creating a performance gap with the heatmap-based methods. Our work follows the line of regressive pose estimation, and formulates the process of step-by-step regression[dollar2010cascaded, Carreira2016HumanPE] implicitly in a layered Transformer way.
Transformers and self-attention The attention mechanism [xu2015show, vaswani2017attention, devlin2018bert]
has greatly advanced the field of representation learning in machine learning. The introduction of Transformers[vaswani2017attention] to object detection gives another leap-forward in building end-to-end object detection framework that is free of proposal, anchor, and post processing (non-maximum suppression). Here, we build cascade Transformers based on the DETR [carion2020end] framework to perform regression-based pose recognition. Our system, named PRTR, aims towards a general-purpose keypoint regression solution without specific heuristic-driven designs.
Recently, Transformer architecture and self-attention have seen increasing application in computer vision tasks [Parmar2018ImageT, carion2020end, Dosovitskiy2020AnII], yet there are limited visualization works compared with those done on language application [Coenen2019VisualizingAM, Vig2019AnalyzingTS]. As far as we know, we are the first to visualize the dynamic decoding process in Transformer decoder, which brings significant insights to future Transformer designs.
We argue that the attention mechanism in Transformer can act as a general-purpose inference engine for regression in vision tasks by writing visual perception as a Bayesian inferencewith . Here, Transformer for regression performs direct learning and inference by capturing complex joint relations between input and prediction hypotheses (queries), , through cross-attention, and modeling the prior on configuration of , , via hypothesis (query) self-attention. See Figure 1.
In this section, we instantiate this idea as Pose Recognition with TRansformer (PRTR) for multi-person pose recognition. The overall architecture is shown in Figure 2. We first introduce a cascaded double Transformer architecture for person and keypoint detection, then an end-to-end variant to streamline the entire model.
We tackle multi-person pose recognition problem in a top-down manner, and adopt a Transformer architecture [vaswani2017attention] following DEtection TRansformer (DETR) [carion2020end] as the backbone for the first-stage person detection. In the encoder stage, image features generated by a CNN are flattened and fed into a Transformer encoder to produce contextualized image features; in the decoder stage, given a fixed set of learned query embedding as input, Transformer decoder reasons about the relations between objects under the context of image features, and output all the object queries in a parallel way. At last, a classification head is used to classify the object as person or background (), and a 4-channel regression head is used to predict the bounding boxes.
After getting the bounding boxes, we crop the RGB image and use another CNN backbone to get feature maps per person. Because only matched queries are involved in calculating the loss for keypoint-detection Transformer, we filtered out unmatched ones. Like the process of person detection, we use the encoder-decoder architecture of the Transformer to predict in a parallel fashion, but we use another set of queries (quantity denoted ). Finally, a classification head predicts among types of joints and background () and a 2-channel regression head outputs the coordinate of each keypoint.
Since PRTR infers a fixed larger number of predictions than ground truth (quantity denoted ), we need to find a matching between them to calculate the loss. We formulate this matching problem as an optimal bipartite matching problem, which can be solved efficiently by Hungarian algorithm [Stewart2016EndtoEndPD]. In specific, we try to find an injective function that firstly minimizes the matching cost in a discrete way:
, where means the prediction to be matched with the -th keypoint.
At training stage, we match our queries using a mixture of classification probabilities and coordinate deviation. For instance, the cost function for the-th keypoint and its matched query is:
, where is the class probabilities of the query and is the class label for -th keypoint. However, at inference stage, we do not have access to the ground-truth keypoint coordinates, thus we match prototype keypoints to queries using only the classification probabilities. Therefore the matching cost for -th keypoint is simply:
After running the bipartite matching algorithm, we return the matched keypoints as our prediction.
The loss function of the model is obtained by replacing negative probabilities in Equation2 with negative log-likelihood
for matched queries. For unmatched queries we only backpropagate the classification loss. To address the class imbalance caused byclass, as in [carion2020end], we set the weight of its log-probability term to 0.1.
In the previous section, we introduce a two-stage pipeline. However, under an end-to-end philosophy, it is desired that the model is end-to-end tunable to exploit the synergy between person detection and keypoint recognition task. To this end, we incorporate the Spatial Transformer Network (STN)[Fang2019LocalityConstrainedST] to crop out image features needed by the keypoint-detection Transformer directly from the feature map generated by the first CNN backbone. This cropping operation is differentiable not only to the feature maps, but also to the bounding box coordinates.
For instance, an grid generated by can be formulated by:
, where is relative to the original image, and is the desired feature map size for the keypoint-detection Transformer.
To mitigate the resolution challenge commonly seen in keypoint recognition, we apply the grid to feature maps of different scales generated at different intermediate layers of the CNN backbone using a bilinear kernel. Denoting the the original feature map by , the differentiable sampling process can be formulated as:
After getting a series of image features of the same spatial size, we concatenate them into a single feature map for the keypoint-detection Transformer. This multi-layer cropping variant is illustrated in Figure 3.
We validate our proposed method on the COCO Keypoint Detection task and MPII Human Pose Dataset.
Datasets. We used two human pose estimation datasets, COCO and MPII. The COCO dataset [lin2014microsoft] contains over 200,000 images and 250,000 person instances. Each person instance is labelled with 17 joints. We train our model on COCO train2017 dataset with 57K images, and evaluate our approach on the standard val2017 and test-dev2017 split, containing 5K and 20K images respectively. The MPII single person dataset [andriluka20142d] consists of around 25K images and 40K well-separated person instances. We follow the standard train/val split.
Evaluation metrics. We follow the common practice in [sun2019deep] and use Object Keypoint Similarity (OKS) for COCO and Percentage of Correct Keypoints (PCK) for MPII to evaluate the performance.
Person-detection Transformer finetuning. We first tune a person detector by initializing from weights provided by DETR [carion2020end]
. We keep all weights except prototype vectors for non-person class in the classifier. The tuning lasts for 10 epochs with a leaning rate of 1e7 for ResNet-50 backbone and 5e6 for the rest. For pose recognition task, people without any visible keypoints are not desired to be detected; these people have a common characteristic of being small in area. In fact, all people with a segmentation area less than do not contain keypoints. Given this, we skipped person annotations without visible keypoints at this stage for both training and evaluation. After tuning, the person detector scores an mAP of on the pruned val2017 set, and an mAP of on the standard val2017 set.
Two-stage variant. For the two-stage version of our model, we extend the human detection bounding box in height or width to a fixed aspect ratio ( for COCO). A patch is cropped using the box and then resized to a fixed size, or for COCO. The data augmentation follows [xiao2018simple], including random rotation (), random scale (), and flipping. The data pre-processing remains the same for MPII, except for aspect ratio set to and input size available in or . For the Transformer part, number of encoder layers, decoder layers and keypoint queries are set to 6, 6, 100 respectively.
We use the AdamW optimizer [Loshchilov2019DecoupledWD]. The base learning rate is 1e5 for ResNet backbone and 1e4 for the rest, with weight decay 1e4. Multi-step learning rate schedule is used, which halves the learning rate at the 120th and 140th epoch respectively. The training process terminates within 200 epochs for both datasets.
Testing. At test time, We use the person detection results from the tuned person detector (with AP on COCO val2017 set) for both COCO val and test-dev set. Inspired by the common practice of flip-test [chen2018cascaded, newell2016stacked, xiao2018simple] used in heatmap paradigms, we compute the keypoint coordinates by averaging the outputs of original and flipped images.
End-to-end variant. For the end-to-end variant, we use ground truth to match predicted people after person-detection Transformer, and discard unmatched queries because they will not be contributing to training keypoint-detection Transformer. For images with more than 5 people, we randomly sample 5 matched queries to reduce computational cost. Bounding boxes predicted by person-detection Transformer are enlarged by at both the height and width dimension before sampling image features from backbone features, which helps predicting keypoints at the margin by taking in more contextual information.
We used the same data augmentation as DETR [carion2020end] except randomly resizing the image to having its shortest side being to while not exceeding . Optimizer settings follow the two-stage variant, except for halving the learning rate at the 25th and 60th epoch instead.
Results on the COCO dataset. Table 1 and Table 2 compare pose estimation results on COCO val and test-dev set respectively. Qualitative results are given in Figure 6. For the end-to-end variant, it surpasses competing fully end-to-end components like CenterNet [zhou2019objects] and DirectPose [Tian2019DirectPoseDE]. The two-stage variant of our approach outperforms the competing baselines in the regression based category. Our model with ResNet-101 backbone is comparable to PointSetNet [wei2020point] which leverages a more complex backbone (HRNet-W48). Our model benefits from larger input size and stronger feature backbones. By enlarging input size from to , PRTR with ResNet-50 and ResNet-101 receives 2.2, 1.9 improvement respectively. Our best model, achieving 72.1 AP, is able to emulate the heatmap-based HigherHRNet [cheng2020higherhrnet].
Results on the MPII val dataset. Since only MPII val is publicly available, we report the performance of our model trained on the entire MPII train set, as shown in Table 3. Our best model achieves a 89.5 PCKh@0.5 score, comparable to that of SimpleBaseline [xiao2018simple]. Not needing a person detection stage, MPII is not tried with the end-to-end variant.
Non class-specific queries. We make the queries of Transformer decoder to predict both keypoint coordinates and classes, and then select the required points from all the queries via class probabilities. This way, we do not enforce a fixed correspondence between keypoint types and queries. Therefore, the queries are not class-specific and can be used to predict different types of keypoints each time. Here, we focus on two alternative designs: a) different number of queries used; b) when number of queries equals the number of required points, the necessity for queries to be non class-specific. From Table 4, it is clear that 100-query version only has a small advantage over 50- and 17-query counterparts. However, using class-specific queries will greatly hamper the performance of the model, resulting in a large drop in AP (11.4). This illustrates the necessity that each query dynamically predicts its preferred keypoint type, and reads out the best estimation through Hungarian matching during inference.
Exclusion of background prediction during inference. During inference, we exclude the logits of the background class () before normalizing class probabilities to provide more keypoint candidates for the Hungarian matcher. From Table 5, we observe that including the logits of background class will result in a 0.91.5 drop in AP.
Flip test. Flipping is a common test augmentation used in heatmap paradigms, where input image is horizontally flipped and fed to the model, and then flip back, align and average the predicted heatmaps to increase accuracy. The same technique applies to regression models as well, with results obtained by directly averaging the predicted keypoint coordinates. Since regression operates on continuous coordinate space, one advantage is that it does not suffer from the inaccuracy caused by alignment errors in heatmap paradigms, as described in [Huang_2020_CVPR]. From Table 5, flip test offers a consistent performance boost for our model.
Oracle results. We also explore the room for improvement by replacing the bounding boxes predicted by person-detector with ground truth (GT) ones, as in Table 5. It is evident that GT boxes improves AP by 22.5, indicating the potential benefit of a stronger person-detector.
we visualize the position and class distribution for keypoint predictions by the queries. Different queries are observed to bias towards different keypoints (in our model 92.3% of the predictions by the 89th query are nose keypoints). We also observe that queries dedicated to certain keypoints are biased to specific locations (the query focusing on the nose tends to predict positions in the upper part of the images) while the points predicted by queries focusing on background are uniformly distributed.
In Figure 4, we explore and visualize query output results in different decoder layers during inference. The first row shows the queries selected by the Hungarian algorithm and demonstrate how their predictions move and refine through lower-to-higher decoder layers. Initially, the predictions are randomly located in the image. After passing some decoder layers, queries predictions gradually approach the proper locations. It is noteworthy that if a query’s prediction is close to the ground truth in lower layers, its prediction barely changes in higher layers.
The second row shows the spatial probabilities of a certain type of keypoint. For visualization, Gaussian heatmaps are first generated around the predicted keypoint locations, with their peak values proportional to class probabilities; then the heatmaps of all queries are stacked to form a single probability map. Note that the initial query embedding (the first column) produces an equivocal keypoint distribution. There exists confusion of keypoint locations in the first several layers of decoder, yet as the decoder layer goes deeper, the refinement proceeds and eventually yields a salient keypoint probability map (the last column).
In this paper, we have presented Pose Regression TRansformer (PRTR), a new design for regression-based multi-person pose recognition method based on the Transformer structure [vaswani2017attention, carion2020end]. It treats the pose recognition task as a regression task, removes complex pre/post-processing procedures and requires fewer heuristic designs compared with existing heatmap-based approaches. Our method includes two alternatives, one as a two-stage and the other an end-to-end one. PRTR achieves state-of-the-art performance compared with other existing regression-based methods on the challenging COCO dataset. Distribution and refinement visualization of keypoint queries blazes the trail of revealing Transformer decoder inner mechanisms. In the future, we would like to investigate more powerful backbone networks and combine regression-based human detection and pose recognition in a more flexible manner.