View Invariant Human Body Detection and Pose Estimation from Multiple Depth Sensors

by   Walid Bekhtaoui, et al.

Point cloud based methods have produced promising results in areas such as 3D object detection in autonomous driving. However, most of the recent point cloud work focuses on single depth sensor data, whereas less work has been done on indoor monitoring applications, such as operation room monitoring in hospitals or indoor surveillance. In these scenarios multiple cameras are often used to tackle occlusion problems. We propose an end-to-end multi-person 3D pose estimation network, Point R-CNN, using multiple point cloud sources. We conduct extensive experiments to simulate challenging real world cases, such as individual camera failures, various target appearances, and complex cluttered scenes with the CMU panoptic dataset and the MVOR operation room dataset. Unlike most of the previous methods that attempt to use multiple sensor information by building complex fusion models, which often lead to poor generalization, we take advantage of the efficiency of concatenating point clouds to fuse the information at the input level. In the meantime, we show our end-to-end network greatly outperforms cascaded state-of-the-art models.



There are no comments yet.


page 1

page 8


Stein ICP for Uncertainty Estimation in Point Cloud Matching

Quantification of uncertainty in point cloud matching is critical in man...

P^2GNet: Pose-Guided Point Cloud Generating Networks for 6-DoF Object Pose Estimation

Humans are able to perform fast and accurate object pose estimation even...

Local and Global Point Cloud Reconstruction for 3D Hand Pose Estimation

This paper addresses the 3D point cloud reconstruction and 3D pose estim...

APS: A Large-Scale Multi-Modal Indoor Camera Positioning System

Navigation inside a closed area with no GPS-signal accessibility is a hi...

Person-in-WiFi: Fine-grained Person Perception using WiFi

Fine-grained person perception such as body segmentation and pose estima...

PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation

We present PointFusion, a generic 3D object detection method that levera...

DVI: Depth Guided Video Inpainting for Autonomous Driving

To get clear street-view and photo-realistic simulation in autonomous dr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point cloud analysis has been studied widely due to its important applications in autonomous driving[1, 2, 3], augmented reality[4], and medical applications[5]. However, most of the point cloud based work has been focusing on applications with only one depth sensor, or multiple sensors facing outwards, covering non-overlapping regions. Less work has been reported that tackles the 3D object detection problem in in-door multi-camera settings, which is a very common and important scenario in applications such as operating room monitoring[6], indoor social interaction studies[7], indoor surveillance etc.

Using multiple cameras reduces the amount of occlusions, but fusing information from multiple sensors is very challenging and still an open problem in the community. Traditionally fusing 2D and or 3D images coming from multiple sensors is handled by complex models that either process each sensor separately and fuse the decision at a later stage [8, 9, 10], or fuse the information earlier on feature level[11]. Such algorithms tend to suffer from poor generalization due to the complexity of the model and heavy assumptions. In comparison, we argue that using multiple sourced point clouds is a more straightforward and natural alternative, which provides better generalization under various challenging real world scenarios.

In this work we study the multi-person 3D pose estimation problem for indoor multi-camera settings using point cloud and propose an end-to-end multi-person 3D pose estimation network, “Point R-CNN”. We tested our method by simulating scenarios such as camera failures and camera view changes on the CMU panoptic dataset[7], which was collected in the CMU Panoptic studio to capture various social interaction events with 10 depth cameras. We further test our method on the challenging real world dataset MVOR[12], which was captured in hospital operation rooms with 3 depth cameras. The experiment demonstrates the robustness of the algorithm for challenging scenes and shows good generalization ability on multi-sensor point clouds. Furthermore, we show that the proposed end-to-end network outperforms the baseline cascaded model by a large margin.

The contributions of our work are as follows:

  1. We propose to use point cloud as the only source for fusing data from multiple cameras and show that our proposed method is efficient and generalizes well to various challenging scenarios.

  2. We propose an end-to-end multi-person 3D pose estimation network, Point R-CNN, based solely on point clouds. Through extended experiments, we show the proposed network outperforms the cascaded state-of-the-art models.

  3. We present extensive experimental results simulating challenging in-door multi-camera application problems, such as repeated camera failures and view changes.

2 Related work

2.1 Point cloud based approaches

Processing point cloud data is challenging due to its unstructured nature. Typically, in order to use Convolutional Neural Network (CNN) like methods, point clouds are usually pre-processed and projected to some ordered space, such as in

[13, 14, 15]. Projecting point clouds is also helpful for more complex localization tasks such as 3D object detection. For example, Zhou et al. [1] proposed to use voxelized point clouds to detect 3D objects using a Region Proposal Network, which simultaneously produces voxel class labels and regression results for bounding box “anchors”.

Alternatively, more efforts have been put into directly using point clouds. For example, Qi et al. proposed PointNet[16] and PointNet++[17]

to classify and segment point clouds in their native space. These networks are the building blocks for many later point cloud based algorithms. Recently, Ge

et al. [18] proposed to directly use point clouds for hand pose estimation. We further discuss this approach in section 2.3.

Besides the above mentioned trends, on-the-fly point cloud transformations have also been explored. Su et al. [19] proposed using Bilateral Convolutional Layers to filter the point cloud and further process the data without explicitly pre-processing the input point cloud. Li et al. [20] proposed to use -transform to gradually transform the point clouds into a higher order representation without pre-processing steps.

Our method is inspired from the first two approaches to detect people in ordered 3D space using voxelized input, as well as detecting human body joints on the segmented point cloud.

2.2 Multi-sensor applications

As discussed earlier, combining the information of multiple sensors is challenging. Conventionally the information is fused at a later stage, either by fusing the decision or fusing the feature space. Hedge et al. [21] proposed to use one network per sensor and fuse the result at a later stage to perform object recognition. Xu et al. proposed PointFusion[11] to fuse point cloud features and 2D images to detect 3D bounding boxes.

However, complex fusion models present weaknesses in terms of generalization. For example, in indoor surveillance systems, the number of cameras and their position vary widely making it challenging to generalize. Hence we argue that combining multiple point cloud sources is a more natural and effective alternative where the effort of fusing information is very low in comparison. Furthermore, in case of camera failures, the structure of the input does not change, only the density, i.e. number of points does. This can easily be accounted for by training the point cloud network on input cloud with variable density.

2.3 3D human pose estimation

3D human pose estimation is a challenging problem. Most recent works focused on using depth images or combining 2D and 3D information to detect landmarks for a single or for multiple persons[22, 23, 24, 25].

Rhodin et al. [24] proposed to use weakly-supervised training methods to circumvent the annotation problem. This is achieved by assuming consistency in the pose across different views. During testing the pose is estimated based on a single camera input. More recently, Moon et al. [26] proposed to use voxelized depth images to detect 3D hands and estimate single human pose from depth images. Haque et al. [27]

describe a view point invariant 3D human pose estimation network, where the input depth image is either a top view or a front view. They refine the landmark position using a Recurrent Neural Network and tested view transfer by training on the front view and testing in on the side view. While this relates to our problem and presents encouraging results, the assumption of having fixed viewpoints does not apply for our application.

3 Approach

The work we are presenting is built upon the VoxelNet paradigm[1]. Our end to end framework can be split into two parts, (1) instance detection and (2) instance processing.

Figure 2: Overview of the Point R-CNN framework.

3.1 Architecture

Our framework is outlined in Figure 2

. The architecture can be split into several modules which are (1) per-voxel feature extraction, (2) voxel features aggregation, (3) instance detection, and finally (4) instance-wise processing,

i.e. in our case point to point regression. In the following sections we describe these modules in detail.

3.2 Input preprocessing

Figure 3: The input of our architecture is the concatenation of the point clouds from several sensors. During this step we associate each point of our scene to a voxel which allows us to easily access a point and his neighborhood

The input to our algorithm is the unstructured point cloud , where denotes the point cloud acquired from sensor . The point clouds acquired from all the sensors are assumed to be time-synchronized and registered within the same world coordinate system. We further assume our world coordinate system to be axis aligned with the ground plane with the -axis being the ground normal.

As a first step we define an axis-aligned cuboid working space resting on top of the ground plane. All points outside this volume are discarded at this time.

In order to reduce the impact of variable point cloud densities across the scene and to speed up processing we down-sample using a voxel grid filter to our working space. This filter merges all points that fall within one voxel into their centroid such that after filtering each voxel contains at most one point.

We then subdivide the working space into a different regular axis-aligned grid of (larger) voxels as follows. The origin of our working space is denoted by , the dimensions of each voxel , and , and the number of voxels along each axis , and . We also choose the number of points to be considered per voxel, in our case we chose .

Each point of the point cloud is now assigned to the voxel it falls in, denoted by the directional voxel indices , and , where etc.

The voxel grid is then flattened by assigning a linear index to each voxel via

After this grouping, each point is assigned the of the corresponding voxel.

Using the we previously computed for each point, we can find the list of unique s of every voxel containing at least one point. Since we already sampled and shuffled the whole point cloud, we just have to take for each voxel the first points with this and put those

points in a tensor of size

, 3 being our input dimension and being the total number of voxel in the scene .

Instead of using the world coordinate of each points we use the relative position within its corresponding voxel, i.e.

This prevents the network from learning global aspects of the scene as opposed to the desired local structures,

We pad with zeros the voxel which do not have enough points.

3.3 Instance detection

Now that our scene is defined, we can regress the bounding cylinder for each instance (i.e. person) in the scene. Our approach to do this is inspired by previous work on 2D instance detection and segmentation [28, 29].

Per-voxel feature extraction.

Figure 4: Per-voxel feature extraction pipeline. Each point is processed using VFE[1]

and get some information from the point in the same voxel. Finally, element-wise max pooling let us retrieve a feature vector per voxel.

The first part of our architecture is using 2 stacked voxel feature encoding (VFE) layers [1] in order to learn a voxel-wise set of features which is invariant to point permutation. Those VFE layers are efficiently implemented as 2D convolutions with kernels of size 1. We use a similar notation as in [1] for the VFE layers. To represent the i-th VFE layer, we use VFE-i with and being respectively the dimension of the input features and the dimension of the output features of the layer. As in [1], the VFE layer will transform its input into a feature vector of dimension before doing the point-wise concatenation, which then yields the output of dimension .

The VFE are VFE-1(3, 32) and VFE-2(32, 64) followed by a fully connected layer of input and output 64 right before the element-wise max pooling. Having the VFE instead of solely using PointNet [16] helps add neighborhood information to the points (defined by being every point in this voxel). After the max pooling of the last layer, the output size is . We can thus reshape our output in order to retrieve a 3D image of size .

In this work we process every voxel. However, this could be sped up as shown in [1] by only processing non-empty voxels.

In order to aggregate information from the scene and learn multi-scale features from our scene, we use a DenseUNet [30]

. The first step makes our network invariant to point permutation. Then, when working on voxels instead of point clouds, we go from an unstructured representation to a structured one, which is easier to apply classic deep learning methods on. By doing those two steps we provide additional neighborhood information to each point at voxel and scene level. This is more efficient than processing the whole point cloud and looking for k-closest neighbor, and it gives better results than just processing the whole point cloud as a set of patches.

Instance detection.

After the feature aggregation part we take the output and feed it to two parallel heads: one for the per-voxel classification (doing the detection) and one for the per-voxel bounding cylinder regression.

With the classification branch we want to find the voxel which should be used for the bounding cylinder regression. This is done by classifying each voxel into one of two classes: containing the top of a cylinder or not. At the moment we assume that each voxel can only have one bounding cylinder. This could be extended by having several bounding cylinder per voxel, or a second network for refinement.

Since there are not many point cloud datasets where the instance segmentation and/or detection are provided together with the 3D joints of each instance we decided to work on datasets that provide at least the point cloud and the joint positions. The cylinders are then defined based on the joint positions for each person.

We define each cylinder axis as being aligned with the person’s neck. The top of the cylinder is at the same height as the joint which is furthest from the ground (for this person). The radius of the cylinder is determined by the distance of the joint furthest from the neck axis of this person.

During training we use the ground truth of the classification to mask the voxel that should be used for back propagation while during testing and inference time we use the output of the classification branch to retrieve the voxel which contains the desired bounding cylinders, as in the RCNN family of frameworks.

The loss for this part is defined as:


Here represents the output of the head of our network doing the classification and represents the output of the head doing the bounding cylinder regression. Their respective ground truth is represented by and . is a cross-entropy loss doing the classification of voxels. For the regression loss, we use where

is the robust loss function (smooth L1) defined in


In the case of human detection only a relatively small number of voxels contain a cylinder to regress, so we need to re-sample our data. Instead of working with and which is the entire output of our networks, we use every voxel containing an object across the batch, as many non empty voxels randomly chosen across the batch and the same amount of voxel across every voxel in the batch. In the case where there is no human in our whole batch, we just pick 32 random voxels. That way, we maintain a bias towards "empty" voxels, we are sure to use the voxels with data in them and we do not have to compute the whole loss on every empty voxel in our batch.

The two terms of the loss are normalized by and and weighted by a balancing parameter . In our current implementation, and both set to the number of voxels used after the sampling while is set to 1.

We use a normalization analog to the one presented in [1, Equation 1], i.e. the cylinder coordinates are normalized relative to the voxel size. By transforming the cylinders back into our world coordinate system we can extract a point cloud consisting of all points within this cylinder.

3.4 Instance wise processing

Figure 5: In this last part, we use the predictions from our two branches to retrieve the bounding cylinder and extract the point cloud of each instance. These points clouds are then fed into a PointNet [16] in order to regress the joints.

Using the extracted point cloud for each cylinder, we can set up a batch of instances to be processed by the last stage of our framework. During training we take at most 32 persons across the frames in the minibatch and for each person we sample at most 1024 points. If a person does not have enough points (32 in our case), we discard this instance.

If there are no extracted point clouds at all, i.e. there was no detection at all, we skip this whole part and just compute the loss previously mentioned. In our use case, we are doing joint regression from the point cloud using PointNet [16].This framework could of course be modified to replace what should be regressed and how it is regressed. The only normalization we do on those point cloud is to put them in a sphere located at the middle of cylinder with a radius of half the height of the cylinder. The final loss of our whole framework is:


Here, we take the loss of the first part and add the loss of the regression part. The loss for the joint regression we use is Mean Square Error Loss calculated between the joints predicted and the ground truth joints . As stated before, we sample through our batch for the regression part, this being 32 during training.

If there are too many persons we pick the person to regress randomly across the batch. If there are not enough persons, we pad this batch with zeros. During testing, we use every detected bounding box instead. We add another balancing term that we set to 1, being set to 1 too.

The is the total number of joints to regress across our instances. We show that using an end to end training on this whole framework improves the result compared to a two-stage solution trained separately.

4 Experiments

To simulate real world scenarios we designed four challenging experiments and compared our results with the baseline method. In particular, we train stage-wise state-of-the-art 3D point cloud based object detection via VoxelNet[1] and a state-of-the-art point cloud based regression network, namely PointNet[16].

To the best of our knowledge no end-to-end solution is available for pure point cloud based 3D multi-person pose estimation. Thus for comparison we cascade the above mentioned state-of-the-art algorithms to perform each part, i.e. 3D person detection and per-person joint detection. Each part is trained separately and the best model is chosen for final evaluation.

In order to simulate camera failures in real life scenarios all datasets used in first three experiments are conducted with a random number of camera inputs, i.e. we randomly drop one or more of the camera inputs. Details are discussed in the respective experiment sections.

The first three experiments are conducted on the CMU Panoptic dataset[31], which was created to capture social interaction using several modalities. Here several actors are interacting in a closed environment and captured by a multitude of RGB and RGBD cameras. For our experiments we use only the depth images from the ten Kinect 2 RGBD sensors and show that point clouds alone are sufficient to accurately detect people and identify their pose.

The cameras in the CMU Panoptic dataset are placed in various positions on a dome over the room, all pointing towards the center. The dataset contains several different social scenes with 3D joint annotations of the actors. For our experiments we randomly choose four scenes to conduct our evaluation. Namely, “160224_haggling1”, “160226_haggling1”, “160422_haggling1”, “160906_pizza1”. For more information about the dataset, readers are referred to [31].

The last experiment is conducted on the MVOR dataset [32], which was acquired over 4 days at a hospital operation room with a varying number of medical personnel visible in the scene and recorded by three RGBD cameras. The dataset captures a real world operation room scenarios, including cluttered background, various medical devices, random number of people at any given time, and ubiquitous occlusion caused by medical devices. The point clouds were cropped to a 4x2x3 m size cube and sampled using a voxel grid filter with a 2.5x2.5x2.5 cm grid size.

4.1 Metrics

To evaluate the overall performance of the algorithm we use both 3D object detection metrics, Average Precision (AP) on Intersection of Union (IoU) > 0.5, used in KITTI Vision Benchmark Suite [33] to evaluate the instance detection part of the algorithm, and per joint mean distance metrics, denoted as DIST for per joint distance and ACC for accuracy under 10cm threshold in the following sections. When calculating the distance, we only count the joints in true positive detections. And if there is duplicated detections, we only calculate the joints difference of the highest scored detection.

4.2 View Generalization

In this experiment, we simulate cameras being placed in different locations in the room to demonstrate the view generalization of the algorithm. We achieve this by using different cameras at training and test time. We show that our algorithm is robust when changing the view or location of the camera.

For our training dataset we randomly choose 8 camera inputs as our training dataset, namely camera 0, 1, 2, 3, 4, 6, 7, 9. The remaining cameras 5 and 8 are used for testing. Note that in both training and testing, we choose a random number of cameras at any given time to simulate camera failures.

The first experiment is conducted on scene “160224_haggling1”, “160226_haggling1” and “160906_pizza1”. The training dataset has 6858 frames, uniformly down-sampled by a factor of 3. The testing dataset has 897 frames, and is down-sampled by a factor of 30.

As we can see from Table 1, even with severe camera view changes and camera failures, the proposed method performs well. Meanwhile, our end-to-end solution outperforms the baseline by a large margin, both from detection perspective and joint regression perspective.

[1] + [16] Point R-CNN
[cm] % [cm] %
Neck 5.47 94.8 4.67 96.7
Headtop 10.55 52.1 9.28 63.5
BodyCenter 7.14 85.5 6.89 85.6
Lshoulder 19.00 5.2 17.00 22.9
Lhip 11.55 40.4 10.61 52.8
Lknee 13.41 33.8 12.80 40.8
Lankle 16.36 28.3 15.75 30.5
Rshoulder 18.16 3.4 16.28 25.7
Rhip 12.29 32.1 11.42 45.6
Rknee 14.41 24.7 13.52 34.8
Rankle 17.05 24.4 15.95 28.2
Mean 13.20 38.6 12.20 47.9
AP 81.24 81.37
Table 1: View generalization: per joint distance in cm and accuracy < 10 cm.

4.3 Actor generalization

In this experiment, we explore the generalization problem in terms of the objects and scenes. In particular, we train the network with scene “160224_haggling1”, “160226_haggling1” and “160906_pizza1” and tested with scene “160422_haggling1” with same camera view, i.e. both training and testing data uses camera 0, 1, 2, 3, 4, 6, 7, 9 with random camera failure. The training dataset has 6858 frames, uniformly downsampled by factor 3. The testing dataset has 430 frames, down-sampled with factor 30. Table 2 shows that the proposed method outperforms the baseline method by a large margin, especially for shoulders, hips and Knees.

[1] + [16] Point R-CNN
[cm] % [cm] %
Neck 7.28 81.0 5.65 89.3
Headtop 12.10 38.2 8.40 75.1
BodyCenter 7.24 81.7 6.50 86.5
Lshoulder 19.83 0.4 12.96 42.7
Lhip 12.26 28.8 9.24 63.6
Lknee 13.12 30.3 10.23 56.6
Lankle 16.49 25.8 13.67 37.4
Rshoulder 19.00 5.4 12.49 44.8
Rhip 12.85 27.8 9.42 64.4
Rknee 13.95 27.2 10.22 56.6
Rankle 17.71 17.6 13.82 36.3
Mean 13.80 33.1 10.24 59.4
AP 80.65 90.54
Table 2: Actor generalization: per joint distance in cm and accuracy < 10cm.

4.4 View and Actor generalization

In this study, we demonstrate the evaluation under both view change and actor changes. We trained the network with scene “160224_haggling1“, “160226_haggling1” and “160906_pizza1” using camera 0, 1, 2, 3, 4, 6, 7, 9 and tested on scene “160422_haggling1” on camera 5, 8, both with random camera failures. The training dataset has 6858 frames, uniformly downsampled with factor 3. Testing dataset has 432 frames, downsampled with factor 30. From Table 3 we can see that the proposed method outperforms in all the joints, some with large margins, for example the accuracy of headtop, left and right shoulder joints.

[1] + [16] Point R-CNN
[cm] % [cm] %
Neck 7.88 71.1 6.32 88.4
Headtop 9.99 54.9 8.64 71.5
BodyCenter 7.20 80.2 6.29 87.0
Lshoulder 13.57 37.2 11.50 48.7
Lhip 9.18 64.8 8.22 73.6
Lknee 9.28 63.2 8.12 74.1
Lankle 12.15 45.9 10.99 52.7
Rshoulder 14.37 31.5 11.71 48.1
Rhip 10.51 53.3 8.77 69.5
Rknee 11.29 51.0 9.49 64.1
Rankle 14.43 35.9 12.62 44.2
Mean 10.90 53.6 9.33 65.64
AP 71.25 89.94
Table 3: View and actor generalization: per joint distance in cm and accuracy < 10cm.

4.5 Handling background and cluttered scenes

In this experiment, we use the MVOR dataset which shows a very cluttered room and walls in the background. Unlike previous datasets, this is a real world operation room scenario, which is naturally more complex. Furthermore, due to the size of the dataset, it is not feasible to train the network from scratch, as it is mentioned in the original paper [12], so in this experiment, we fine-tuned our previous model on the first 3 days of data and tested on the 4th day data with all 3 cameras used. Since the number of joints is different from the previous dataset, we only fine-tuned the joints common to both dataset’s annotations. The training dataset has 513 frames and 113 testing frames. Table 4 shows that our algorithm outperforms the baseline.

[1] + [16] Point R-CNN
[cm] % [cm] %
Head 13.06 37.5 9.74 61.9
Neck 12.05 41.3 9.24 65.0
Lshoulder 19.02 15.3 15.52 28.9
Rshoulder 16.90 21.0 15.12 27.4
Lhip 27.93 7.5 20.27 14.2
Rhip 32.34 8.2 18.50 19.3
Lelb 27.60 6.2 22.03 12.7
Relb 25.87 6.1 20.41 14.2
Mean 24.96 15.4 17.70 26.1
AP 57.84 64.73
Table 4: MVOR dataset: per joint distance in cm and accuracy < 10cm.
Figure 6: The mean per joint distance accuracy using different thresholds for the above four experiments. Each dotted line represents the baseline method and the solid line represents the proposed method. Experiments are color coded.
Figure 7: Person detection and pose estimation results on CMU Panoptic (rows 1, 2) and MVOR (row 3). The red cylinder and skeleton are the prediction result and the green cylinder and skeleton indicate the ground truth.

4.6 Qualitative Evaluation and Discussion

Apart from the experiments we mentioned above, we evaluated the mean of per joint distance accuracy under different thresholds. As shown in Figure 6, the proposed method consistently outperforms the baseline method under various thresholds in all four experiments. Sample testing results from all experiments are shown in Figure 7, where the ground truth is color coded green and the prediction red.

5 Conclusion

In this work we have demonstrated through extended experiments that point cloud is the natural and straightforward alternative for a multi-sensor indoor system, where the fusion of multi-sensor information is efficient. Unlike conventional methods that use complex fusion models to combine information, which tend to generalize poorly, we show through various challenging real world scenarios that the proposed algorithm can generalize well. Furthermore, we propose an end-to-end multi-person 3D pose estimation network, “Point R-CNN”, and show that the proposed network outperforms the simply cascaded model by large margins in various experiments. The study shows that using an end-to-end network greatly improves both object detection and joint regression performance.