Swin-Pose: Swin Transformer Based Human Pose Estimation

01/19/2022
by   Zinan Xiong, et al.
0

Convolutional neural networks (CNNs) have been widely utilized in many computer vision tasks. However, CNNs have a fixed reception field and lack the ability of long-range perception, which is crucial to human pose estimation. Due to its capability to capture long-range dependencies between pixels, transformer architecture has been adopted to computer vision applications recently and is proven to be a highly effective architecture. We are interested in exploring its capability in human pose estimation, and thus propose a novel model based on transformer architecture, enhanced with a feature pyramid fusion structure. More specifically, we use pre-trained Swin Transformer as our backbone and extract features from input images, we leverage a feature pyramid structure to extract feature maps from different stages. By fusing the features together, our model predicts the keypoint heatmap. The experiment results of our study have demonstrated that the proposed transformer-based model can achieve better performance compared to the state-of-the-art CNN-based models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/24/2022

CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation

3D human pose estimation can be handled by encoding the geometric depend...
04/13/2022

Recognition of Freely Selected Keypoints on Human Limbs

Nearly all Human Pose Estimation (HPE) datasets consist of a fixed set o...
05/11/2022

AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation

Movement and pose assessment of newborns lets experienced pediatricians ...
03/26/2021

Lifting Transformer for 3D Human Pose Estimation in Video

Despite great progress in video-based 3D human pose estimation, it is st...
05/08/2016

Chained Predictions Using Convolutional Neural Networks

In this paper, we present an adaptation of the sequence-to-sequence mode...
07/29/2021

Efficient Human Pose Estimation by Maximizing Fusion and High-Level Spatial Attention

In this paper, we propose an efficient human pose estimation network – S...
03/17/2020

Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation

The target of human pose estimation is to determine body part or joint l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Swin-Pose Architecture Overview. The proposed Swin-Pose utilises a multi-method approach combining the Swin Transformer Block from the Swin-L[11] and the feature pyramid fusing. A feature pyramid fusion is adopted to produce the output for keypoint heatmap regression.

Human pose estimation is one of the key tasks in the field of computer vision, the present research aims to detect and locate persons with their body joints such as neck, shoulder, and knee from the input images or videos. It has several important and promising applications, including action recognition, human-computer interaction, and gaming. There are still a lot of challenges brought by partial or full joint occlusions, differences in body shapes, and clothing. Moreover, it shares most of the difficulties with object detection, thus remains a difficult problem to be solved. A traditional method for estimating an articulated pose relies on pictorial structures, where body parts are arranged in a deformable configuration that is attached to the main object, then a part detector is used to learn the variant poses and appearances[1]. A large amount of work is devoted to obtain a better feature representation and distinguish the correct pose. However, the computation cost is high and its ability to adapt to all poses is still limited.

Recently, deep convolutional neural networks (DCNNs) has proved its ability in computer vision tasks. Among all these approaches, two of them are mainstream methods: keypoint position regression[2, 3], and keypoint heatmap estimation followed by choosing the location with the highest score[4, 5, 6]. The former one treats pose estimation as a joint position regression problem and regress the locations of each joint keypoint directly. The latter one puts a 2D Gaussian kernel on all keypoints, constructs ground truth heatmaps, and utilizes these heatmaps to supervise the prediction with L2 loss. The heatmap regression is relatively simpler to implement than the approach of keypoint regression, and it achieves both high accuracy and efficiency, thus most of the state-of-the-art methods are based on this approach.

The attention mechanism is first proposed to solve the problem that recurrent neural networks (RNNs) cannot remember long sentences

[7]. It calculates the similarity and weights between the current input and all the other inputs, then the model will put more effort into the context which has a higher weight. Transformer adopts the structure of encoder-decoder, but it has demonstrated that the replacement of the original design with multi-head self-attention leads to a better performance[9]. However, traditional transformer structure is unable to be used on the human pose estimation task directly. First of all, its computation cost is tremendously high, it will be increased sharply with the size of feature map and consuming excessive memory because of self-attention. In addition, traditional transformer only generates the output feature map within one single scale, thus it has a limit performance on images with multi-scale objects.

To leverage the long-range dependencies capturing ability of transformer, and avoid the excessive memory consumption encountered by traditional transformer, we choose Swin Transformer [11] as our backbone, reducing the computation cost and making the model implementable. We also integrate it with a top-down feature pyramid structure to add the ability of scale-invariant. With this Swin Transformer structure and two different fusion methods, we propose a novel human pose estimation model Swin-Pose, it achieves outstanding results compared with HRNet-W48. Additionally, we perform a number of experiments and investigate the effects of input resolution, model size, and fusion method, and present the results in Section 5.

2 Related Work

Human Pose Estimation: CNN has demonstrated its powerful ability in the field of computer vision, owing to its characteristics of local connectivity, and weight sharing, thus also being widely applied to human pose estimation tasks. Convolutional Pose Machines generate 2D belief maps for the location of each part by several convolutional networks at each stage[8], which encodes the spatial uncertainty of the locations, and captures the spatial relationships between the parts. Simple Baseline[21] uses ResNet[23]

as the backbone to extract features, three deconvolutional layers with batch normalization and ReLU are added to the last convolution stage, and generate the feature maps and heatmaps with high resolution.

Instead of following one route, HRNet[22] recovers the high-resolution from low-resolution at sub-network and maintains high-resolution through the whole process. It gradually adds branches with lower resolution in parallel, but fuses the output from all branches and exchanges the information, then passes the fused information through all branches, and finally generates the keypoint estimation. It has been established that HRNet can predict keypoints more accurately and deliver an outstanding result.

Figure 2: Structure of Transformer Block

Attention Mechanism and Transformer:

Originally, transformer models were applied to natural language processing (NLP). It stacks encoder and decoder together, both of which are consist of self-attention and feed-forward layers. The self-attention layer is the core of the transformer, it maps query, key and value to the output then computes the output as a weighted sum of values. A multi-head mechanism is also implemented to obtain the attention between different positions jointly and achieves a remarkable result

[9].

Vision Transformer (ViT) is the first model to get rid of all standard convolutions. It splits the input images into patches of fixed size, then applies linear embeddings to each patch and send them through the transformer as tokens[10]. A learnable classification embedding is prepended to the patch, and a position embedding is also added to keep the positional information. TransPose is proposed to combine the translation equivalence of CNN and the long-range dependencies capturing ability of the transformer[25]. It uses convolutional neural network as its backbone, then connect to a transformer encode layer and captures the global dependencies at high-level feature maps.

The structure of Swin Transformer is similar to the ResNet and consists of four stages. By limiting the self-attention within non-overlapping local windows, rather than conducting global self-attention between all tokens, Swin Transformer reduces the computation cost by a large margin[11], making it applicable in downstream tasks. A shifted window partitioning approach is also applied to enable information communication between those non-overlapping windows.

Image Feature Fusion: Up to now, previous studies of feature fusion methods have established that multi-level features can be extracted and fused [14]. HyperNet [15], and YOLO9000 [16] concatenate the representatives before prediction, Spatial Pyramid Pooling [17] and CFENet [18]

investigated the multi-scale filters. Furthermore, the image pyramid is famous for its ability to combine and utilize the features extracted from different scales and its characteristics of scale-invariant. Felzenszwalb

et al. [13] proposed an efficient and accurate system based on pictorial structures framework, they obtained the feature pyramids by repeated smoothing and sub-sampling from different levels, then a score at different positions and scales was identified.

In 2017, Lin et al. [19] published a paper in which they described Feature Pyramid Networks for object detection. Lin’s innovative and seminal work pioneered a new approach to examining the inherent multi-scale features by adopting a pyramidal hierarchy of convolutional networks. They identified a feature pyramid that has strong semantics from low to high levels to leverage the pyramidal shape of ConvNet’s [20] feature hierarchy. The pyramid includes bottom-up and top-down pathways and lateral connections to merge feature maps of the same spatial size.

3 Method and Approach

We propose Swin-Pose model that utilises a multi-method approach combining the Swin Transformer Block and feature pyramid fusing. The overview of our proposed model architecture is shown in Fig. 1, which follows the logic of the large version of Swin Transformer (Swin-L). The advantage of the swin transformer is that it allows us to extract important information and long-range dependencies between joints from the images. Before obtaining the embedding with the linear layer, the input images are split into four non-overlapping patches. After these patches are treated as “tokens”, the raw pixel RGB values are concatenated to produce the features. Once the embeddings are passed through four stages, feature maps are generated with different scales.

Furthermore, the feature pyramid is one of the most practical ways of examining the inherent multi-scale features. It is applied to fuse the feature maps and produce the output for keypoint heatmap regression.

Figure 3: Architecture of Feature Fusion Module. Left: four feature maps with different resolutions are generated with the down sampling operations. Right: in order to fuse feature maps from different layers, the bi-linear up sampling and the convolution are adopted to keep the channels and resolutions consistent at all times.

3.1 Swin-Pose Transformer Module

In our study, the vanilla Swin-Pose follows the settings in Swin Transformer, setting the patch size to . The tokens are sent to a linear embedding layer with a classification token prepended, and position information encoded, and its dimension is projected from to an arbitrary value . A number of Swin Transformer blocks are applied to the embedded tokens and make up “Stage 1” together with the aformentioned linear embedding layer. In order to get a hierarchical representation, the output of “Stage 1” is then sent through a patch merging layer, where the neighbouring patches are merged together. Also, the resolution is down-sampled by a factor of 2, and the number of tokens is reduced by a factor of 4. A linear layer is also applied to change the dimension from to . As shown in Fig. 1, with the patch merging layer, multiple swin transformer blocks are connected, constituting three identical stages. Moreover, the resolutions of output are and , respectively.

The proposed transformer block is derived from the original swin transformer block. It consists of LayerNorm (LN) layer, multi-head self-attention layer, 2-layer multilayer perception with Gaussian Error Linear Unit (GELU), and residual connection structure. More importantly, the standard multi-head self-attention layer is replaced by window-based multi-head self-attention (W-MSA) and shifted window-based multi-head self-attention (SW-MSA), respectively, as shown in Fig.

2.

After extracting features from the vanilla Swin-Pose transformer backbone, a heatmap regression layer is applied to estimate the locations of all keypoints. This vanilla Swin-Pose structure was trained initially without the pre-trained model of ImageNet. Surprisingly, our experiments obtained weak

about 18% lower than previously reported models. This result may be explained by the fact that transformer requires a large scale dataset to achieve competitive performance. Although swin transformer has powerful performance in extracting features from the connections of the patches from different locations, the human pose estimation task heavily depends on the local features near each keypoint. To improve the local feature extraction performance of the Swin-Pose Transformer Module, we propose a feature fusion module before the heatmap regression layer, whose details are explained in the following subsection.

Figure 4: Top: Element-wise Sum. Bottom: Concatenation

3.2 Feature Fusion Module

There are 4 stages in the structure of Swin Transformer, and the size of the output are , respectively, where and are the height and width of the input image. In order to avoid situations where the model overfits to a specific scale and loses its generalization, we use top-down pathway and lateral connections to fuse the feature maps from different resolution levels and combine outputs of different resolutions and semantic information, as shown in Fig. 3. During the fusion procedure, a convolution is first used to change the channels of the output from the upper stage. Then a bi-linear up sampling is used to change the resolution while maintaining the number of channels, so both channels and resolution will match the output from the lower stage extracted through the lateral connection. Then we put them together, repeat the procedure of 1 and up sampling twice, get the final output with the highest resolution, and use it for the next step of heatmap regression.

As shown in Fig. 4, we adopt two different methods to examine the effects of multi-scale feature fusion. Method A is the element-wise sum. It generates the new features from the summation operation of the original features, but a little information of the original features will be lost during the process. Method B concatenates those feature maps by splicing each feature map directly along the direction of the channel, thus all information will be kept.

3.3 Analysis

We think transformer structure is more suitable for human pose estimation tasks than convolutional neural networks, as CNNs use fixed-size filters to extract features from input images. Thus, they usually only have a fixed-sized reception field and lack the ability to gain relationships between far apart pixels. However, getting this long-range dependence information is essential to human pose estimation tasks. It can help locate the position of body joints that are not close to each other or even occluded. With the help of the attention mechanism, transformer calculates the similarity and relation between different pixels within the range and assign weights accordingly, thus gaining the ability to extract long-range dependence information.

But transformer models are generally larger than CNN models, and they require more training data to achieve a comparable result. Besides the model we trained with pre-trained weights, we also tried to train this transformer-based model from scratch to determine how many differences there would be. From the experiment results, the AP of the model trained from scratch can be as much as 20% worse than that of the pretrained one and proves the importance of pretraining.

4 Experiments

Dataset:

The COCO person keypoints detection dataset

[12] contains over 200,000 images and 250,000 person instances. In this dataset, the person instances are all labelled with 17 keypoints, consisting of 5 face landmarks and 12 body joints. All the keypoints are annotated with three values: the x coordinate, the y coordinate, and a flag indicating whether the keypoint is visible and labelled. Most of the persons are at medium or large scale, and a large number of the persons are only partially visible, thus bringing challenges to our task. The training set train2017 contains more than 150,000 people, and the evaluation set val2017 contains 5,000 images.

Evaluation Metric: The main objective of human pose estimation is predicting the keypoints as close as possible to the ground truth. Thus a precise method to evaluate the prediction is needed. We adopt Object Keypoint Similarity (OKS)[12] to evaluate our model, which is inspired by object detection metrics, and commonly used to evaluate models in human pose estimation. It calculates the degree of match between predicted and ground truth, and normalized by the person scale. The result is within the range of 0 to 1, a higher value means the prediction is closer to the ground truth.

We report the average precision (AP) and average recall (AR) scores following the standard COCO format, where AP is calculated as the mean average precision at 10 positions when the value of OKS are 0.50, 0.55, 0.60, …, 0.90, 0.95, and , , , are also recorded.

Method Backbone Pretrain Input size #Params GFLOPs
8-stage Hourglass
CPN
CPN + OHKM
SimpleBaseline
SimpleBaseline
SimpleBaseline
8-stage Hourglass
ResNet-50
ResNet-50
ResNet-50
ResNet-101
ResNet-152
N
Y
Y
Y
Y
Y
256 × 192
256 × 192
256 × 192
256 × 192
256 × 192
256 × 192
25.1M
27.0M
27.0M
34.0M
53.0M
68.6M
14.3
6.20
6.20
8.90
12.4
15.7
66.9
68.6
69.4
70.4
71.4
72.0
88.6
89.3
89.3
78.3
79.3
79.8
67.1
68.1
68.7
77.2
78.1
78.9
76.3
77.1
77.8
HRNet-W32
HRNet-W32
HRNet-W48
HRNet-W32
HRNet-W32
HRNet-W48
N
Y
Y
256 × 192
256 × 192
256 × 192
28.5M
28.5M
63.6M
7.10
7.10
14.6
73.4
74.4
75.1
89.5
90.5
90.6
80.7
81.9
82.2
70.2
70.8
71.5
80.1
81.0
81.8
78.9
79.8
80.4
SimpleBaseline
HRNet-W32
HRNet-W48
ResNet-152
HRNet-W32
HRNet-W48
Y
Y
Y
384 × 288
384 × 288
384 × 288
68.6M
28.5M
63.6M
35.6
16.0
32.9
74.3
75.8
76.3
89.6
90.6
90.8
81.1
82.7
82.9
70.5
71.9
72.3
79.7
82.8
83.4
79.7
81.0
81.2
Ours
Swin-L with concatenation
Y
384 × 384
197.9M
204.5
76.2
93.4
83.3
72.6
81.5
84.7
Swin-L with element-wise sum
Y
384 × 384
196.4M
202.6
76.3
93.5
83.4
72.5
81.7
84.7
Table 1: Comparison on the COCO validation set.

Settings:

There are four Swin Transformers with different size, namely Swin Tiny (Swin-T), Swin Small (Swin-S), Swin Base (Swin-B), and Swin Large (Swin-L). In order to achieve the best result, we choose Swin-L here, with weights pretrained on ImageNet-22K dataset which contains 14.2 million images and 22K classes. The input image size is set to

, the embedding dimension for the first stage is set to , the window size is set to 12, the number of blocks and number of heads for each stage are set to and , respectively. The total model size and computation complexity is about

the Swin-B model. We set the batch size to 10, which is limited by the memory, and set the number of epoches to 240. Adam optimizer is used, with initial learning rate experimentally set to 5e-5, and learning rate is reduced by a factor of 10 when the epoch reaches 60, 120 and 160.

Results: We compare our results with previous state-of-the-art human pose estimation models, including SimpleBaseline, Cascaded Pyramid Network (CPN), Hourglass, and HRNet, and show the results in Table. 1. As it shows, our two models with concatenation and element-wise sum, respectively, both achieve a better result on , and , and have 3.5% improvement on . Besides that, Swin-L with element-wise sum achieves the same as HRNet-W48.

5 Ablation Study

Input Resolution: An ablation study is chosen to determine the factor of input resolution that affects the performance. The results of our model based on Swin-L are presented in Table. 2, with two fusion methods, and input resolution set to and , respectively. As shown in the table, comparing experiments 5, 6, 7 and 8, with the input resolution increased, the average precision has increased by 3.4 points for both fusion methods, with the price of a significant larger GFLOPs.

Model Size: The experiments were conducted over the different sizes of Swin-B and Swin-L. As shown in the Table. 2, comparing the two results between experiments 3 and 5, it can be seen that the larger size of backbone only contributes limited improvements to its results. Similarly, no significant increase was found in experiment 4 compared with experiment 7. In addition, the computation cost and model size increase significantly, as the result of the growing FLOPs and number of parameters. Overall, these results indicate the trade-offs between precision and computation cost. A smaller backbone should be selected if the computation cost is of higher priority.

ID Backbone Pretrain1 Fusion Method2 Input size #Params GFLOPs AP AR
1 Swin-S 1K Concat 224 × 224 49.5M 11.9 71.6 75.2
2 Swin-S 1K Sum 224 × 224 49.1M 11.7 71.7 75.2
3 Swin-B 22K Concat 224 × 224 88.0M 15.9 72.6 75.9
4 Swin-B 22K Sum 224 × 224 87.3M 15.7 72.2 75.7
5 Swin-L 22K Concat 224 × 224 197.9M 24.2 72.8 76.1
6 Swin-L 22K Concat 384 × 384 197.9M 204.5 76.2 84.7
7 Swin-L 22K Sum 224 × 224 196.4M 23.6 72.9 76.3
8 Swin-L 22K Sum 384 × 384 196.4M 202.6 76.3 84.7
  • Pretrain: models pretrained on ImageNet-1K or ImageNet-22K.

  • Fusion Method: Concat (Concatenation), Sum (Element-wise Sum)

Table 2: Results of different fusion methods, input resolution, and model size

Fusion Approach: We study how two different fusion methods, namely element-wise sum and concatenation along channels, affect the performance of our models. It has been demonstrated that concatenation is able to retain as much information as possible during the fusion process. However, the number of parameters for concatenation will be greater than element-wise sum, which results in higher cost from the calculation. In particular, an independent ablation experiment was performed to determine whether there was a difference between the two fusion methods.

In summary, it can be seen from the data in Table. 2, these results show that the computation cost of concatenation is slightly higher than the element-wise sum.

6 Conclusion

In this paper, we have explored the application of transformer to human pose estimation and presented two slightly different versions of models, both based on the Swin-L Transformer structure. The hierarchical structure of the Swin Transformer inspires us to adopt feature pyramid, fuse the feature maps output by different stages together, and build a more substantial model which is suitable for objects of different scales. The two proposed models achieve competitive results compared to the famous HRNet-W48 and other CNN based models.

Although the results demonstrate that transformer models can be successfully implemented to human pose estimation tasks with competitive performance, the main challenge is that transformer models are huge with almost two times more parameters and six times more FLOPs. Further studies need to be carried out in order to reduce the computation cost and obtain an outstanding performance compared with CNN models.

References