A Context-and-Spatial Aware Network for Multi-Person Pose Estimation

05/14/2019 ∙ by Dongdong Yu, et al. ∙ ByteDance Inc. 0

Multi-person pose estimation is a fundamental yet challenging task in computer vision. Both rich context information and spatial information are required to precisely locate the keypoints for all persons in an image. In this paper, a novel Context-and-Spatial Aware Network (CSANet), which integrates both a Context Aware Path and Spatial Aware Path, is proposed to obtain effective features involving both context information and spatial information. Specifically, we design a Context Aware Path with structure supervision strategy and spatial pyramid pooling strategy to enhance the context information. Meanwhile, a Spatial Aware Path is proposed to preserve the spatial information, which also shortens the information propagation path from low-level features to high-level features. On top of these two paths, we employ a Heavy Head Path to further combine and enhance the features effectively. Experimentally, our proposed network outperforms state-of-the-art methods on the COCO keypoint benchmark, which verifies the effectiveness of our method and further corroborates the above proposition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-person pose estimation aims at locating body keypoints (eyes, ears, nose, shoulders, elbows, wrists, hips, knees, ankles, etc.) for all persons from an image. It is fundamental and important to a variety of computer vision applications, such as human action recognition [25] and human re-identification [13].

Figure 1: An occluded example from MS-COCO test-dev2017 dataset. (a) is the original input image. (b) is the prediction result of our CSANet.

Due to the help of deep convolution neural networks, remarkable progress has been made in multi-person pose estimation 

[8, 24, 4, 28, 16, 27, 29, 18, 17, 9, 1, 5]. Although great progress has been made, there still exist a lot of challenging cases, such as occluded keypoints, invisible keypoints, change of view point, and crowed background. Both affluent context information and spatial information are essential to locate keypoints accurately. For example, we can capture the global context of the image by enlarging the receptive field and fusing different context information. The context information represents the global position of human and indicates the contextual relationship between keypoints, thus holds potential to accurately estimate the occluded and invisible keypoints, e.g. the left knee of the man in Figure 1. Adding the spatial information can provide detail information which is useful for refining the positions of keypoints.

Figure 2: Illustration of our framework. (a) ResNet backbone. (b) Components of the Context Aware Path (CAP). (c) Components of the Spatial Aware Path (SAP). (d) Components of the Heavy Head Path (HHP). Note that the reduction operation is implemented by a convolution layer of kernel size.

With these observations, we aim at leveraging both the context information and spatial information to improve multi-person pose estimation. Toward this end, we present a novel Context-and-Spatial Aware Network (CSANet) to extract effective context information and spatial information, as shown in Figure 2. Based on a backbone network, there are three parts in our network architecture: Context Aware Path (CAP), Spatial Aware Path (SAP), and Heavy Head Path (HHP). For Context Aware Path, we design a structure supervision module to learn the part-aware context information and adopt the Atrous Spatial Pyramid Pooling (ASPP) module [2] to capture context information of different receptive fields. Thus, sufficient context information are extracted to infer the occluded and invisible keypoints. In respect of Spatial Aware Path, we preserve the spatial size and encode affluent spatial information for accurate localization, which also shorten the information path from low-level features to the high-level features. On top of these two paths, the Heavy Head Path (HHP) is proposed to adopt adaptivity fusion learning to combine the context information with spatial information and employ a small fully convolution network to recalibrate the fusion features.

Based on our Context-and-Spatial Aware Network (CSANet), we address the multi-person pose estimation problem in a top-down pipeline. First, we apply the human detect network to obtain human detection bounding boxes. Then, the CSANet is adopted to locate body keypoints for each human bounding box. Next, ablation studies are conducted to demonstrate the effectiveness of the CAP path, SAP path and HHP path. Finally, we evaluate our proposed network on the COCO keypoint benchmark [15], and the experimental results show that our proposed CSANet outperforms existing state-of-the-art methods.

In summary, there are five contributions in our paper:

  • We design a Context Aware Path to learn the part-aware context information and context information of different receptive fields for inferencing the challenge keypoints.

  • We design a Spatial Aware Path to preserve spatial detail information for refining the position of keypoints.

  • We design a Heavy Head Path to adaptively fuse the context information and the spatial information.

  • Based on the Context Aware Path, Spatial Aware Path and Heavy Head Path, we propose a Context-and-Spatial Aware Network which can make full use of the context information and spatial information.

  • We evaluate our method on the COCO keypoint benchmark, and achieve state-of-the-art performance in multi-person pose estimation.

2 Related Work

Recently, lots of approaches based on Convolution Neural Network (CNN) have achieved high performance on different benchmarks of multi-person pose estimation [8, 24, 4, 28, 16, 27, 29, 18, 17, 9, 1]. Several principles proposed for designing networks in scene parsing are also effective for our work, in which we pay more attention to the issue of context information extraction and spatial information preservation [3, 31, 2, 26, 30].

Multi-person Pose Estimation  Recently, significant progress has been made in multi-person pose estimation with the development of CNN. In [1], a real-time Convolution Pose Machine (CPM) is proposed to locate the body keypoints, and assemble the keypoints to individuals in the image with the learning part affinity fields (PAFs). Based on the ResNet backbone, the Simple Baseline Network (SBN) [28] employs a deconvolution head network to predict human keypoints. The spatial detail information is inevitably lost along the information propogation in CPM and SBN, which is useful for refining keypoints’ localization. To avoid this problem, Newell et al. [16] integrate associate embedding with a stack-hourglass network to produce joint score heatmaps and embedded tags for grouping joints into individual people. The Cascaded Pyramid Network (CPN) [4] adopts the GlobalNet to learn a good feature representation and the RefineNet to further recalibrate the feature representation for accurate keypoint localization. The Hourglass network and Cascade Pyramid Network preserve spatial features at each resolution by adding skip layers and capture sufficient context information for accurately inferencing both simple keypoints and challenge keypoints.

As mentioned in [4, 27], context information represents the global position of human and indicates the contextual relationship between keypoints. Spatial information can provide detail information which is useful for refining the positions of keypoints. Thus, the well-designed network should take both context information and spatial information into account. Several principles (e.g. preserving spatial information, and capturing diverse context information) proposed for designing networks in scene parsing can be also effective for the multi-person pose estimation task.

Context Information  Generally, as the network goes deep, the high-level feature holds potential to capture the context information with a large receptive filed. In another way, Atrous Spatial Pyramid Pooling (ASPP) [2] and Pyramid Pooling Module (PPM) [31] are widely used to extract abundant context information in scene parsing. ASPP module employs atrous convolution with different dilation rates and global pooling module to capture diverse context information. PPM module fuses features under different pyramid pooling scales to obtain global contextual prior information.

Spatial Information  Consecutive down-sampling or pooling operations in the convolution neural network may lose the spatial information which is crucial to predicting the detailed output in scene parsing and pose estimation tasks. Some existing methods [3, 31, 2, 26] use the dilated convolution to preserve spatial size of the feature map. Other methods employ the feature pyramid network [11], U-shape method [21], Hourglass network [17] to shorten the information path between low-level features and high-level features. By using such skip-connected network structure, we can recover a certain extent of spatial information.

In our paper, we aim at leveraging both the context information and spatial information to improve multi-person pose estimation. Compared with existing methods, we design a Structure Supervision module to capture part-aware context information and adopt ASPP module to capture context information of different receptive fields in Context Aware Path. We preserve abundant spatial information by adding skip layers from the low-level features to high-level features with Spatial Aware Path. Experimentally, we find that adding the global pooling features of low-level features can further help accurately locate the keypoints. Moreover, we proposed a simple yet effective Heavy Head Path to fuse the context aware features and spatial aware features.

3 Method

In this section, we propose a novel Context-and-Spatial Aware Network (CSANet) to make full use of the context information and spatial information. An overview of the proposed CSANet is illustrated in Figure 2. We first briefly review the structure of Simple Baseline Network. Then, we introduce Context-Aware-Path, Spatial-Aware-Path, and Heavy Head Path in detail. Finally, we describe the complete network architecture of Context-and-Spatial Aware Network, as well as training and inference details.

3.1 Revisiting Simple Baseline Network

ResNet is the most commonly used backbone network for image classification, scene parsing, and human pose estimation. Simple Baseline Network (SBN) uses a DeconvHead (consists of three deconvolution layers) after the last convolution stage of the ResNet, in which each deconvolution layer has 256 filters with

kernel size and stride parameter is 2. After the DeconvHead, a

convolution layer is added to predict the heatmaps for all keypoints.

Figure 3: Illustration of the Context Aware Path. (a) is the Structure Supervision (SS) module. (b) is the Atrous Spatial Pyramid Pooling (ASPP) module. Face GT denotes the ground-truth score maps of the face related keypoints. Upper GT represents the ground-truth score maps of the upper limb related keypoints. Lower GT indicates the ground-truth score maps of the lower limb related keypoints. , , and are the loss of the face, upper limb, and lower limb related keypoints, respectively.

3.2 Context Aware Path

Motivation  In the task of multi-person pose estimation, most of the modern methods tackle it as a dense regression issue. Due to the lack of abundant context information, the regression could not handle the prediction of invisible keypoints, occluded keypoints, and other complex situations. To this end, we design a Context Aware Path to extract abundant context information which represents the global position of human and the contextual relationship of body keypoints.

The Context Aware Path contains two modules: Structure Supervision module and Atrous Spatial Pyramid Pooling module, as shown in Figure 3. The Structure Supervision (SS) module is to encode part-aware context information, and the Atrous Spatial Pyramid Pooling (ASPP) module is to capture diverse context information of different receptive fields.

Structure Supervision  Body structural priors can provide valuable cues to infer the locations of the hidden body parts from the visible ones. Motivated by this, we perform multi-part supervision at each part prediction branch to obtain part-aware features. Compared with the Simple Baseline Network, we replace the DeconvHead module with the Structure Supervision module. In this paper, we divide human body into three parts for COCO keypoint dataset: face part (ears, eyes, and nose), upper limb part (shoulders, elbows, and wrists) and lower-limb part (hips, knees, and ankles). Then, we combine the face aware features, upper limb aware features, and lower limb aware features with the hybrid context features for capturing diverse part-aware context information.

Atrous Spatial Pyramid Pooling   Atrous convolution is a powerful operation to adjust the filed-of-view in order to capture multi-scale information. The Atrous Spatial Pyramid Pooling module has been widely used in scene parsing task, which adopts atrous convolution with different dilation rates for diverse context information extraction. In the CAP path, we simply add this module after the structure supervision module to capture context information of different receptive fields. In our experiment, we set the dilation rates as 1, 6, 12, and 18.

As shown in Figure 3, three branches are employed to extract part-aware context features, which respectively supervised by face ground-truth score maps, upper limb ground-truth score maps, and lower limb ground-truth score maps. Then ASPP module is adopted to further recalibrate the fusing information of the part-aware features and hybrid context features.

3.3 Spatial Aware Path

Motivation  In the task of multi-person pose estimation, spatial information can provide detailed information which is useful for refining the positions of keypoints. Some existing methods [1, 8, 18] attempt to estimate keypoints from the heatmaps of which the resolution is 1/8 of the input image. Yet, higher resolution information should be added to provide more spatial details. To this end, we extract the spatial aware features from the lower stages of the backbone network to preserve abundant spatial information of which the resolution is 1/4 of the input resolution.

In our proposed network, we use ResNet [7] as a backbone model. According to the feature maps’ size, the ResNet can be divided into five stages, denoted as , , , , and stages. The ResNet encodes more detailed spatial information in the lower stages, however, extracts stronger context information in the higher stages. Based on this observation, we design our Spatial Aware Path to capture the finer spatial information, as shown in Figure 4. First, we use a convolution path (contains a convolution layer with filters and a convolution layer with filters) to recalibrate the last feature maps of stage to obtain the spatial feature, denoted as Conv2 Features. Conv3 Features are captured by another convolution path (same as Conv2 Features) on the last feature maps of stage and resized to the resolution of Conv2 Features. Next, we use a series of operations (consists of global pooling, two convolution layers with filters, and resize to the resolution of Conv2 Features) to generate Conv2GP Features. Finally, we concatenate the Conv2 Features, Conv3 Features and Conv2GP Features, and reduce the concatenated features to 256 dimension feature maps by a convolution layer with filters. Experimentally, we find that adding the Conv2GP Features can further help accurately locate the keypoints.

Figure 4: Illustration of the Spatial Aware Path. (a) is the branch to obtain Conv2 Features. (b) is the branch to obtain Conv2GP Features. (c) is the branch to obtain Conv3 Features.

3.4 Heavy Head Path

Heavy head, namely stack of convolution layers, is quite effective for bounding box prediction [14]

. In our paper, we find it is also useful in the dense regression of keypoints’ score maps. This path first concatenates the context aware features extracted by Context Aware Path and spatial aware features captured by Spatial Aware Path, followed by using a small fully convolution network (FCN) to regress the body keypoints’ ground-truth score maps. After concatenating the context aware features and spatial aware features, the fusion parameters of these features can be adaptively learned by the network. The FCN consists of

convolution layers. The first layers consists of filters and the last convolution layer is filters. According to the number of keypoints in the COCO keypoint benchmark [15], is set to 17 in our paper.

3.5 Network Architecture, Training and Inference

With the Context Aware Path, Spatial Aware Path, and Heavy Head Path, we propose a novel Context-and-Spatial Aware Network (CSANet) for multi-person pose estimation as illustrated in Figure 2.

Network Architecture  We use the pre-trained ResNet as our backbone network. First, we operate the Context Aware Path (CAP) on the last feature maps of stage to capture the part-aware context information and diverse context information of different receptive filed. Then, we employ the Spatial Aware Path (SAP) on the last feature maps of , stage to encode spatial information feature. In our paper, the CAP path encodes abundant context information, while the SAP path provides rich spatial information. They are complementary to each other for higher performance on keypoint localization. Given the different feature representation of the CAP path and SAP path, we concatenate these features instead of simply summing operation to fuse the context information features and spatial information features. Finally, the Heavy Head Path (HHP) is operated on the concatenated features, which encodes both affluent context information and spatial information to accurately predict the keypoints’ heatmaps.

Network Training  In our paper, we use the score maps to represent the location of body keypoints. For each person, the ground-truth locations are labeled as , where denotes the coordinate of the th keypoint ( denote the keypoints of the face, denote the keypoints of the upper limb, and denote the keypoints of the lower limb) of the person. The ground-truth score map is defined as,

(1)

in which, denotes the location, and is set to 2 for input, and is set to 3 for

input. In our Context Aware Path, we use three auxiliary loss functions to supervise it to learn part-aware context information. The face aware branch, upper limb aware branch and lower limb branch predict the face related keypoints’ heatmaps (

i.e., ), upper limb related keypoints’ heatmaps ( i.e., ), and lower limb related keypoints’ heatmaps ( i.e., ) respectively. For the Heavy Head Path, it predicts the holistic body keypoints’ heatmaps ( i.e., ). Then, the loss of our CSANet is,

(2)
(3)
(4)
(5)
(6)

where is the number of samples, , , and are the loss weight parameters.

Network Inference  During the inference, we obtain the predicted body keypoints localizations from the predicted score maps generated from the Heavy Head Path by taking the locations with the maximum score as follows:

(7)

4 Experiment

This section is organized in accordance with the progress of our experiments. Firstly, we describe the experimental setup. Then, we decompose our proposed network to reveal the effect of each component on MS-COCO val2017 dataset. Last but not least, we compare our network with previous state-of-the-art methods on MS-COCO val2017 dataset and MS-COCO test-dev2017 dataset.

4.1 Experimental Setup

Dataset and Evaluation Metric

  We train and evaluate our Context-and-Spatial Aware Network (CSANet) on MS-COCO 2017 dataset [12]. Our models are only trained on the MS-COCO train2017 dataset including 57K images and 150K person instances, no extra data involved. There are 5000 images (MS-COCO val2017 dataset) for validation and 20K images (MS-COCO test-dev2017 dataset) for testing. Following previous work [1, 4, 28], evaluation is conducted using the Object Keypoints Similarity (OKS) based mAP, where OKS defines the difference between predicted person keypoints and ground-truth person keypoints.

Cropping Strategy  The person ground-truth box (or detection box) is changed to a fixed aspect ratio, e.g. height : weight = 4 : 3. Then, we crop the image and resize it to a fixed resolution. In our paper, the default resolution of the network input image is .

Data Augmentation Strategy  We use random flip, random scale, and random rotation in training. The possibility of flip or not is 0.5. The random rotation range is (), and the random scale is ().

Person Detector  For MS-COCO val2017 dataset, we use the human detection boxes provided by [28] to make a fair comparison, the detection boxes are generated by a Faster-RCNN detector [20] with human detection AP 56.4 on MS-COCO val2017. For MS-COCO test-dev2017 dataset, we adopt the SNIPER detector [23] with human detection AP 58.1 on MS-COCO test-dev2017.

Method AP AP.5 AP.75 AP(M) AP(L)
ResNet-50+DeconvHead (SBN) 70.6 - - - -
ResNet-50+CAP 71.1 88.8 78.6 67.7 77.6
ResNet-50+CAP+SAP 71.7 88.8 78.8 68.2 78.4
ResNet-50+CAP+SAP+HHP (Our CSANet) 72.5 89.4 79.4 69.1 79.4
Table 1: Results on the MS-COCO val2017 dataset. Based on the ResNet-50, we gradually add Context Aware Path (CAP), Spatial Aware Path (SAP), and Heavy Head Path (HHP) for ablation study. The first row is performance of Simple Baseline Network (SBN) which is the state-of-the-art performance network on COCO keypoint benchmark.
Method AP
ResNet-50+DeconvHead (SBN) 70.6
ResNet-50+SS 71.0
ResNet-50+SS+ASPP 71.1
Table 2: Ablation study on our proposed Context Aware Path. SBN: Simple Baseline Network. SS: Structure Supervision module. ASPP: Atrous Spatial Pyramid Pooling.

Training Details  We train our proposed CSANet using Adam [10]

algorithm with a mini-batch of 128 (32 per GPU) for 140 epochs. The initial learning rate is 1e-3 and is dropped by 10 at the 90th epoch and the 120th epoch. Generally, the training of ResNet-50 based models takes about 52 hours on four NVIDIA Titan V100 GPUs. All codes are implemented with PyTorch 

[19]

. In this paper, our network is trained with ResNet-50, ResNet-101, and ResNet-152. The ResNet backbones are initialized with the public-released pre-trained model on the ImageNet 

[22]. We also conduct experiments with different resolutions of the input image ( and ).

Testing Details  A top-down pipeline is adopted for estimating the multi-person pose. First, we use a person detector to generate the human bounding boxes. Then, we apply our CSANet to generate the pose prediction heatmaps for each bounding box. Following previous work [28, 4], we average the heatmaps of origin image and the heatmaps of the flipped image to get the final prediction. A quarter offset in the direction from the highest response to the second highest response is used to obtain the final location.

4.2 Ablation Study

In this subsection, we will step-wise decompose our proposed CSANet to reveal the effect of each component. In the following experiments, we evaluate all comparisons on MS-COCO val2017 dataset. Unless otherwise specified, the default backbone is ResNet-50, and the input size of all models is .

4.2.1 Component Analysis

In Table 1, we show our ablation study from the Simple Baseline Network [28] (SBN, which achieves the state-of-art) gradually to all components incorporated. Based on the SBN, we replace the DeconvHead with our Context Aware Path (CAP), the AP performance is improved from 70.6 to 71.1. Furthermore, when adding the Spatial Aware Path, we can achieve 71.7 AP. Finally, we adopt the Heavy Head Path (HHP) to fuse the context aware information and spatial aware information to predict the pose heatmaps. After adding the HHP module, the AP performance can be further improved from 71.7 to 72.5.

4.2.2 Ablation Study on Context Aware Path

Different with the SBN, we replace the DeconvHead (consists of three deconvolution layers) with our Context Aware Path. The CAP path consists of two modules: Structure Supervision module and ASPP module.

Ablation for Structure Supervision  We use the Structure Supervision module which performs multi-part supervision operation to extract the part-aware context information. As shown in Table 2, this module improves AP performance from 70.6 to 71.0, which is an obvious improvement. In our paper, the loss weight parameters , , and are set to 1. We also conduct the experiment which sets all the loss weight parameters to 0, the AP performance is 70.8.

Ablation for Atrous Spatial Pyramid Pooling  To capture diverse context information of different receptive fields, we apply the ASPP module on the features extracted by Structure Supervision module. As shown in Table 2, this further improves the performance by 0.1.

Method AP
ResNet-50+CAP 71.1
ResNet-50+CAP+Conv2 71.4
ResNet-50+CAP+Conv2+Conv3 71.5
ResNet-50+CAP+Conv2+Conv2GP 71.4
ResNet-50+CAP+Conv2+Conv3+Conv2GP 71.7
Table 3: Ablation study on our proposed Spatial Aware Path. Conv2: Features captured from stage. Conv3: Features captured from stage. Conv2GP: Features captured from global pooling of stage.

4.2.3 Ablation Study on Spatial Aware Path

While the Context Aware Path pays attention to the context information, the Spatial Aware Path focus on the spatial information which can provide detail information for refining the positions of keypoints. By integrating the CAP path and SAP path, the AP performance is improved from 71.1 to 71.7, as shown in Table 3.

Design Choices of Spatial Aware Path  Here, we compare different design strategies of the SAP path as shown in Table 3. We compare the following implementations: 1) Conv2 Features. 2) Conv2 Features + Conv3 Features. 3) Conv2 Features + Conv2GP Features. 4) Conv2 Features + Conv3 Features + Conv2GP Features. The Conv2 Features, Conv2GP Features, and Conv3 Features are detailedly described in Section 3.3.

Then, we reduce the spatial aware features to 256 dimension, and integrating it with the context aware features to predict the person pose. As shown in Table 3, we find that adding spatial detail information with Conv2 Features, Conv3 Features and Conv2GP Features can effectively achieve 0.6 AP gains.

Figure 5: Some results from MS-COCO test-dev 2017 dataset of our method.

4.2.4 Ablation Study on Heavy Head Path

This path first concatenates the context aware features extracted by Context Aware Path and spatial aware features captured by Spatial Aware Path, then uses a small fully convolution network (FCN) to regress the keypoints’ ground-truth score maps. The FCN consists of N convolution layers with filters and one convolution layer with filters. As shown in Table 4, this improves the AP performance from 71.7 to 72.5 when N is chosen as 3, 5, or 6. In our CSANet, the N is set to 3 for less computation.

N 0 1 2 3 4 5 6
AP 71.7 71.9 72.0 72.5 72.4 72.5 72.5
Table 4: Ablation study on our proposed Heavy Head Path. N denotes the number of convolution layers used in this module. The baseline is ResNet-50+CAP+SAP model.
Method Input Size Backbone AP
Our CSANet ResNet-50 72.5
Our CSANet ResNet-50 74.1
Table 5: Ablation study of different input sizes.
Method Backbone Input Size AP
Our CSANet ResNet-50 74.1
Our CSANet ResNet-101 74.4
Our CSANet ResNet-152 75.1
Table 6: Ablation study of different Backbone networks.
Method Backbone Input Size AP
8-stage Hourglass ResNet-50 66.9
8-stage Hourglass ResNet-50 67.1
CPN ResNet-50 69.4
CPN ResNet-50 71.6
SBN ResNet-50 70.6
SBN ResNet-50 72.2
Our CSANet ResNet-50 72.5
Our CSANet ResNet-50 74.1
Table 7: Comparison with Hourglass [17], CPN [4], SBN [28] on MS-COCO val2017 dataset. Hourglass: a classical model. CPN: Cascade Pyramid Network, COCO2017 keypoint winner. SBN: Simple Baseline Network, the state-of-the-art network.

4.2.5 Ablation Study on Data Pre-processing

Here, we investigate the performance of our CSANet with different input sizes. Due to the increase of input image size, more spatial information are fed into our network. Therefore, this improves the AP performance from 72.5 ( input size) to 74.1 ( input size), which is an obvious large improvement, as shown in Table 5.

4.2.6 Ablation Study on Backbone Network

As in most computer vision tasks, a deeper backbone model has better performance. We conduct experiments with ResNet-50, ResNet-101, and ResNet-152 backbones with the input size of . Table 6 shows that AP increase is 0.3 from ResNet-50 to ResNet-101 and 1.0 from ResNet-50 to ResNet-152.

Method Backbone Input Size AP AP.5 AP.75 AP(M) AP(L) AR
CMU-Pose - - 61.8 84.9 67.5 57.1 68.2 66.5
Mask-RCNN ResNet-50-FPN - 63.1 87.3 68.7 57.8 71.4 -
G-RMI ResNet-101 64.9 85.5 71.3 62.3 70.0 69.7
CPN ResNet-Inception 72.1 91.4 80.0 68.7 77.2 78.5
CPN+ ResNet-Inception 73.0 91.7 80.9 69.5 78.1 79.0
SBN ResNet-50 70.2 90.9 78.3 67.1 75.9 75.8
SBN ResNet-50 71.3 91.0 78.5 67.3 77.9 76.6
SBN ResNet-101 71.1 91.1 79.3 68.3 76.7 76.8
SBN ResNet-101 73.2 91.4 80.9 69.7 79.5 78.6
SBN ResNet-152 71.9 91.4 80.1 68.9 77.4 77.5
SBN ResNet-152 73.8 91.7 81.2 70.3 80.0 79.1
Our CSANet ResNet-50 71.9 91.0 79.9 68.7 77.5 78.7
Our CSANet ResNet-50 73.5 91.4 80.8 69.9 79.4 79.7
Our CSANet ResNet-101 72.3 91.2 80.2 69.3 77.6 79.1
Our CSANet ResNet-101 74.1 91.6 81.6 70.7 79.8 80.4
Our CSANet ResNet-152 72.8 91.4 80.9 69.8 78.3 79.6
Our CSANet ResNet-152 74.5 91.7 82.1 71.2 80.2 80.7
Table 8: Comparisons on the MS-COCO test-dev2017 dataset. Top: methods in the literature, trained only on COCO training dataset. CMU-Pose: COCO2016 keypoint winner [1]. Mask-RCNN: a classical model [6]. G-RMI: a classical model [18]. CPN: Cascaded Pyramid Network, COCO2017 keypoint winner [4]. SBN: Simple Baseline Network, the state-of-the-art network [28]. ”+” means the method using ensemble models. Bottom: our single model results, trained only on COCO training dataset.

4.3 Comparison with State-of-the-art Methods

In this subsection, we compare our proposed CSANet with state-of-the-art methods on MS-COCO val2017 dataset and MS-COCO test-dev2017 dataset.

Results on MS-COCO val2017 As shown in Table 7, we compare our network with a 8-stage Hourglass (a classical model), CPN (Cascaded Pyramid Network, COCO2017 winner), and SBN (Simple Baseline Network, the state-of-the-art network) . All these methods use top-down pipeline. For human bounding boxes generating, the person detection AP of Hourglass and CPN is 55.3. The person detection AP of SBN is 56.4, we use the human bounding boxes provided by SBN to make a fair comparison.

Compared with Hourglass [17], our CSANet has an improvement of 5.6 points in AP for input size of . Our network outperforms CPN [4] by 3.1 AP for input size of , and 2.5 AP for input size of . By contrasting SBN [28] with our CSANet, the AP performance is improved from 70.6 to 72.5 for input size of , and from 72.2 to 74.1 for input size of . Our method improves the previous best results with a large margin by 1.9 AP for both and input size.

Results on MS-COCO test-dev 2017 Table 8 illustrates the results of modern state-of-the-art methods in the literature on MS COCO test-dev2017 dataset. For the human bounding boxes generating, CPN uses a human detector with person detection AP 62.9 on COCO minival split dataset. SBN adopts a human detector with person detection AP 60.9 on COCO test-dev dataset. We use the SNIPER detector with person detection AP 58.1 on COCO test-dev dataset.

Compared with CMU-Pose [1], G-RMI [18], and Mask-RCNN [6], our method achieves significant improvement. Even though CPN [4] use a stronger backbone of ResNet-Inception, our CSANet’s single model (ResNet-152) achieves 74.5 AP and outperforms CPN’s single model by 2.4 AP for the input size of . As mentioned before, SBN [28] use a more powerful human detector with person detection AP 60.9 on COCO test-dev dataset, which is higher than our human detector by 2.8. Yet, our model has an improvement of 0.7 AP in multi-person pose estimation for the input size of . Figure 5 illustrates some results generated using our method.

5 Conclusion

Aiming at fully leveraging both context information and spatial information to improve multi-person pose estimation, we propose a novel Context-and-Spatial Aware Network (CSANet) in this paper. From the architecture perspective, we design a Context Aware Path to capture part-aware information and diverse context information of different receptive filed which indicates the contextual relationship between keypoints. Then, we propose a Spatial Aware Path to preserve detail information for refining the position of keypoints. Next, a Heavy Head Path is proposed to further combine and recalibrate the context aware features and spatial aware features. These modules are trained as a whole to maximally complement each other. We also conduct a series of ablation studies to validate the effectiveness of each module. Finally, our experimental results show that our proposed CSANet can significant improve the performance on COCO keypoint benchmark.

References