Single-view human detection and body pose estimation have enjoyed a great deal of attention over the last decades in the field of computer vision because of their importance for various applications, ranging from activity recognition to human computer interaction. More recently, the emergence of deep learning has pushed the boundaries in many fields, including computer vision. The combination of deep learning with the availability of large datasets, such asMPII Pose (Andriluka et al., 2014) and MS COCO (Lin et al., 2014), has spawned many promising approaches for single-view human detection and pose estimation (Wei et al., 2016; Newell et al., 2017; Cao et al., 2017). But the presence of clutter and occlusions degrades their performance. Capturing an environment from complementary views permits to reduce the risk of occlusions, especially in busy environments, as shown in Figure 1. In addition, the availability of calibrated multi-view data greatly facilitates the process of lifting 2D scenes into 3D, which is important for many applications such as augmented reality.
Despite the inherent benefits of capturing an environment from multiple views, multi-view approaches have not achieved the same level of maturity as compared to single-view approaches, mostly due to two reasons: firstly, multi-view datasets are generally recorded in controlled environments in order to use motion capture systems to acquire precise 3D location data. This removes the need for the tedious and error-prone manual annotation of the abundant number of frames coming from all views for generating ground truth 3d poses. Even though there are large multi-view datasets such as Human3.6.M (Ionescu et al., 2014) and HumanEva (Sigal et al., 2009), the simple backgrounds and tight clothes required by motion capture systems make these datasets trivial for 2D pose estimation methods. Monocular pose estimation approaches report low 2D body part localization errors even without finetuning (Chen and Ramanan, 2017; Martinez et al., 2017). For these reasons, single- and multi-view pose estimation models trained on datasets captured in such controlled laboratory environments do not generalize well to real world data, which is often visually much more complex due to occlusions, clutter and the presence of multiple persons in the scene. Secondly, current multi-view approaches (Sigal et al., 2012; Dogan et al., 2017) learn model parameters that are specific to each multi-view camera setup. In other words, to apply these approaches on a new multi-view scenario, it is required to collect new annotated data that includes both multi-view images and their corresponding 3D ground truth poses for the same camera setup. On the one hand, generating synthetic datasets for these approaches would require not only the generation of 3D body poses, but also of photo-realistic rendering of humans with different shapes, textures and backgrounds to allow generalization to the real world, which is not a trivial task. On the other hand, generating such training data using either motion capture systems or manual annotations, especially in the case of data-hungry deep learning methods, is not always feasible in uncontrolled environments and very tedious. We therefore propose an approach that benefits from existing multi-view datasets to perform multi-view 3D pose estimation in new multi-view setups.
Our approach formulates the problem of multi-view 3D pose estimation in a two-step framework: (1) single-view pose detections and (2) multi-view 3D pose regression. We separate these two steps for two reasons. First, we can better exploit available single-view and multi-view datasets for the right task. Single-view datasets, such as MPII Pose (Andriluka et al., 2014) and MS COCO (Lin et al., 2014), include diverse and challenging frames from everyday activities or movies originating from amateur to professional recordings. Therefore, models trained on these datasets can better cope with real world challenges and generalize to new environments. But, these single-view datasets are lacking 3D annotations, contrary to multi-view datasets, which often come with accurate 3D body poses. As these are however much simpler for the task of 2D pose estimation (Chen and Ramanan, 2017; Martinez et al., 2017), researchers have proposed methods to jointly use both single- and multi-view datasets in order to construct more robust 3D pose estimation models from multiple views (Amin et al., 2014; Belagiannis et al., 2016). Changes in camera setups however require the retraining of the model on training data from the same camera setup. This strictly limits the deployment of the models to environments where such training data exists. The second reason for our two steps approach is that we can better generalize to new multi-view environments by assuming that lifting 2D body poses into 3D is independent of the images given the 2D pose detections. This assumption implies that we do not need to collect 2D image data for training the 3D regression function and that any set of plausible 3D body poses can be used instead by computing body pose projections into 2D.
To learn a multi-view 3D regression function, we propose a method that relies on a multi-stage neural network. The input of this network is a set of corresponding multi-view 2D detections for each individual person. At test time, they are collected using a state-of-the-art single-view detector. We assume that the camera system is fully calibrated and can therefore use epipolar geometry to establish the multi-view correspondences per person. This process also allows us to detect the number of persons per multi-view frame111We define a multi-view frame as the set of all images captured from all views at the same time step.. This is in contrast to current multi-view RGB approaches, which tackle either single-person scenarios (Gall et al., 2010; Hofmann and Gavrila, 2011) or multi-person scenarios where the number of persons is known a priori (Luo et al., 2010; Belagiannis et al., 2014b).
The proposed network consists of a series of blocks of fully-connected layers with intermediate supervision at each block. The input to each block is the raw network input, i.e. the concatenated 2D poses, and the output from previous block if it exists. The network can therefore build a high dimensional function and refine the output of the previous block to achieve a more reliable regression function. In order to generalize to new multi-view setups, we do not use images during training but construct training data solely by projecting Human3.6M’s 3D poses. We use Human3.6M because it is the largest publicly available multi-view dataset and it includes men and women of different sizes. The projected 2D poses are generated according to the camera parameters used at test time. In practice, 2D poses are detected at test time using a 2D pose detector that may be noisy and inaccurate. In order to cope with these inaccuracies, we propose to perturb during training the 2D locations of the body joints by random noise that is generated based on the characteristics of the 2D detector. We also propose to incorporate a detection confidence for each body joint, computed based on the amount of noise added during training. This provides a representation for the detection confidence generated by the detector at test time. Therefore, the approach can take into account not only joint locations, but also detection precision to build a robust regression function.
We use two datasets to perform quantitative and qualitative evaluations and compare with state-of-the-art results on these datasets. We first report results on the Human3.6M dataset (Ionescu et al., 2014) to characterize the properties and the performance of our approach. This dataset includes recordings of several actions performed by professional actors of different genders. This dataset has been recorded by a fully calibrated four-view camera system and a motion capture system to collect ground truth 3D positions of the body joints. We also evaluate our approach on a challenging multi-view dataset (Kadkhodamohammadi et al., 2017) to show the generalization ability of our approach. This dataset is generated from real surgery recordings obtained in an operating room (OR) using a three-view camera system and hence is called Multi-view OR (MVOR) in the following. Our approach achieves comparative results on Human3.6M and significantly improves the localization error on the multi-view OR dataset without using any training data from this dataset.
The main contributions of the paper are twofold. First, we present a simple and yet accurate multi-view 3D pose estimation approach that can generalize well to new multi-view environments. In contrast to current state-of-the-art methods, the approach exploits an existing multi-view dataset to build models for new multi-view environments without any need for new annotation. Second, this is the first multi-view RGB approach that has been quantitatively evaluated on data captured in an unconstrained environment.
2 Related Work
Multi-view segmentation-based 3D pose estimation. Hofmann and Gavrila (2011) use foreground segmentation to estimate body silhouettes per view. Then, 3D pose candidates are obtained by matching a library of exemplars. Texture information and shape similarity across all views combined with temporal information are used to compute the final 3D poses. Similarly, Gall et al. (2010) propose a two-layer framework that iteratively improves foreground segmentation and retrieved body poses by incorporating both multi-view and temporal information. Other approaches have deployed optical flow estimation (Chen et al., 2008), 2D as well as 3D motion cues (Sundaresan and Chellappa, 2009) and low-rank multi-view feature fusion combined with sparse spectral embedding (Yu and Hong, 2017) to estimate 3D poses. In contrast to our work, these approaches are only evaluated on single-person datasets. More importantly, it is not always possible to compute foreground in cluttered environments, such as in operating rooms. Therefore, these approaches can only be evaluated on data recorded in environments with simple backgrounds.
Multi-view part-based 3D pose estimation. Several multi-view 3D pose estimation approaches (Burenius et al., 2013; Amin et al., 2013, 2014; Belagiannis et al., 2014a, 2016; Kadkhodamohammadi et al., 2017) have been proposed that rely on a part-based framework (Felzenszwalb et al., 2010). This part-based framework provides an elegant formalism to optimize over different potential functions for incorporating image features, multi-view cues, temporal information and body physical constraints. Burenius et al. (2013) propose an approach that extends pictorial structures (Fischler and Elschlager, 1973; Felzenszwalb and Huttenlocher, 2005) to multi-view and to perform exact 3D inference by using simple binary pairwise potential functions. Instead, Amin et al. (2013, 2014) use 2D inference with more complex pairwise potentials, multi-view cues and triangulation to estimate 3D poses. Belagiannis et al. (2014a) have also deployed different pairwise potentials for incorporating both body physical constraints and multi-view features. This approach allows to perform approximate 3D inference by selecting a limited number of hypotheses per individual. This approach has been extended to incorporate temporal information (Belagiannis et al., 2014b) and to use a deep neural network based body part detector (Belagiannis et al., 2016).
In contrast to our work, all these approaches have only been evaluated on datasets recorded in constrained laboratory environments and also require the number of person to be known a priori. MVDeep3DPS presented in (Kadkhodamohammadi et al., 2017) is an exception, but this approach relies on multi-view RGB-D input to estimate 3D body poses. Additionally, all these approaches need in general to learn model parameters on data from the same camera setup. Moreover, optimizing these energy functions is demanding, especially in 3D, which makes these approaches not suitable for real-time applications. In our work, we do not require image with pose annotation from the camera setup used at test time and learn model parameters by using existing datasets. Furthermore, our approach performs both human detection and pose estimation. As our regression function uses a multi-layer neural network, it runs in super real-time on a single consumer GPU card.
Single-view 3D pose estimation. Recently, many deep learning based approaches have been proposed to directly regress body poses in 3D from a monocular image or an image sequence. Pavlakos et al. (2017) use a stack of a fully convolutional network (Newell et al., 2016) to iteratively compute 3D heatmaps per body parts. Tekin et al. (2016a) propose to learn an auto-encoder that maps 3D body joints into a high-dimension latent space for discovering joint dependencies and then to learn a convolutinal network that maps an image into this high-dimensional pose space. In (Tekin et al., 2016b), motion compensation is used to align several consecutive frames and construct a rectified spatiotemporal volume that is then fed into a 3D regression function. Other approaches have built deep pose grammar representations (Fang et al., 2017), skeleton map (Wan et al., 2017) and multitask objectives (Rogez et al., 2017; Luvizon et al., 2018) to enforce more constraints and obtain a more accurate 3D regression function. These approaches are trained on images with accurate 3D ground truth poses. The main issue is that to generate such accurate 3D annotations, motion capture systems are used in controlled laboratory environments with simple backgrounds. Models trained on such image data do not generalize well to real world scenes.
Another line of work relies on two-stage methods, where 2D body parts are first predicted using 2D pose detectors (Wei et al., 2016; Newell et al., 2016; Cao et al., 2017) and then 3D body part locations are computed by relying on these predictions (Moreno-Noguer, 2017; Chen and Ramanan, 2017; Martinez et al., 2017). In comparison with direct 3D regression approaches, these approaches benefit from the diverse, challenging and real world datasets, e.g. MS COCO and MPII Pose, to train reliable 2D pose detector models that generalize well. To compute 3D body locations, exemplar-based approaches are used by matching lower and upper body parts separately (Jiang, 2010) and by matching the whole skeleton (Chen and Ramanan, 2017). More recently, Moreno-Noguer (2017) proposed to regress from 2D Euclidean distance matrices (EDM) to 3D EDM instead of using traditional 2D-to-3D regression in the Cartesian coordinate system (Radwan et al., 2013; Ionescu et al., 2014). The regression is performed using a fully convolutional network and 3D poses are recovered via a multidimensional scaling algorithm (Biswas et al., 2006). Martinez et al. (2017) showed that a simple fully connected network to regress from 2D to 3D outperforms (Moreno-Noguer, 2017) and achieves state-of-the-art results on Human3.6M. We also adopt a two-stage framework in our multi-view approach and use a fully connected network as a 2D-to-3D regression function. The single-view model in (Martinez et al., 2017) was however trained on the output of the 2D detector used during test time. In contrast, our approach relies solely on ground truth during training and instead generates training samples that comply with the behavior of the 2D detector used at test time. This is an interesting property of our approach, which enables us to train our network on Human3.6M and test on a completely different multi-view dataset.
In this section, we present our proposed approach for multi-view 3D pose estimation. We assume that we have a calibrated multi-view system recording an environment from a set of complementary views. Our objective is to detect and predict human body poses in 3D given images captured from all views. In a probabilistic formulation, we want to compute, where is the number of body joints and is a body joint location in 3D; (2) the 2D body poses , where is the number of viewpoints and is the tuple of pixel coordinates indicating the body joints of a 2D pose in view ; and (3) all 2D images , where is the image taken from the viewpoint. Such a formulation makes no limiting assumption and indicates that a 3D body pose is jointly dependent on its appearance in all individual views. However, learning such a model requires collecting training data from the same multi-view setup that we want to apply the model to.
Without loss of generality, we can rewrite the joint probability distribution as:
To build a multi-view pose estimation approach that can generalize to new environments, we make two conditionally independence assumptions. Firstly, the 3D pose is assumed conditionally independent of images given 2D poses . Obviously, this is not always correct, as one can find different 3D skeletons that have similar 2D projections due to the 3D-2D perspective effect. The likelihood of such cases however degrades dramatically in a multi-view setup, where a working volume has been captured from complementary views.
Secondly, we assume that given an image observation for a view , 2D poses in this view are conditionally independent of detections in the other views and other image observations. One can see that this assumption does not hold in case of occlusions. But, we believe this assumption is reasonable for these three reasons: (1) there exist challenging single-view datasets, e.g. MS COCO and MPII Pose, which can be used to train robust single-view pose detection models; (2) recent deep neural network based approaches have achieved very promising results on unseen data and reliably discriminate occluded joints from visible ones (Cao et al., 2017; Newell et al., 2016, 2017); and (3) it yields an interesting modeling that allows us to train a 2D pose detector independently. Considering these two assumptions, we can rewrite the joint probability as:
This equation indicates that a 2D pose detector is applied in each view independently and that the 3D pose regression function is solely dependent upon 2D pose detections. We model the first term using a multi-view 3D regression function, described in Section 3.4. The input for this function is provided by concatenating 2D detections for each individual person across all views, which is presented in Section 3.2. The second term is the single-view pose detector explained next.
3.1 Single-view 2D Pose Detector
The relaxation assumption mentioned above allows us to use arbitrary complex models to detect and localize 2D body poses given single-view images. We therefore use the deep convolutional network of Cao et al. (2017) as single-view pose detector. This approach is currently the state-of-the-art approach for multi-person 2D pose estimation. In addition to its reliable multi-person pose estimation performance, the approach runs in nearly real-time. Given an image, the model generates a set of 2D poses, where each body pose is specified by a collection of 18 body parts. For each body part, the model provides its pixel coordinate and a detection confidence. The confidence values are in range , where zero indicates undetected body parts.
3.2 Concatenating Detection Across all Views
Given the detected poses per views, we need to find correspondences across the views. As we assume that the camera system is fully calibrated (i.e. both camera intrinsic and extrinsic parameters are available), we use epipolar geometry to find correspondences (Hartley and Zisserman, 2000). Let us assume that for each pair of cameras the camera parameters are given with respect to the first one:
where and are camera intrinsic parameters and indicates extrinsic parameters. We can compute the fundamental matrix by:
is the skew matrix operator. The fundamental matrix encapsulates all cameras parameters and allows us to compute the corresponding epipolar line for a point in the other view, as illustrated in Figure2.
Here, we use the fundamental matrix to compute average distances between detected skeletons for all pairs of views. This distance is computed for each possible pair of detections from two distinct views as the average distance between a subset of body joints detected in both skeletons. We collect 2D skeletons for each person across two views by computing the average distances between detected skeletons in one view and the corresponding epipolar lines of skeletons from the other view and by then finding disjoint pairs of skeletons with the lowest average distance. We exclude pairs for which the average distance is bigger than 20 pixels. We then use the matched skeletons to establish multi-view correspondences per individual person. One should note that despite the availability of the correspondences, we cannot use triangulation because inaccurate detections lead to high error in 3D and, more importantly, joints might be detected in less than two views, especially in cluttered environments. We therefore use a regression function to compute the 3D positions of the body joints.
To prepare the input for the regression function, we concatenate skeletons across all views. If a person is not detected in a view, we fill the corresponding entry with zeros. Each body part is represented by three channels: two channels indicating pixel location and the third channel indicating the detection confidence.
3.3 Training Data Generation
As mentioned in the introduction, we generate training samples by projecting 3D skeletons into 2D. The model can therefore be trained on data generated from existing datasets or any set of valid 3D poses. The projected 2D skeletons are computed based on the camera setup used at test time. Since the single-view 2D pose detector used at test time can provide noisy detections, the model needs to be trained on similar data to be able to generalize. We therefore evaluate our 2D pose detector on an existing dataset, which contains both images and ground truth 2D poses, to characterize its performance. We use these evaluation results to design a normally distributed noise model for each body joint. This noise is used to perturb training data. We then compute the joint confidence as:
is the amount of additive noise, which is sampled from a normal distribution with zero mean and standard deviation, and is a coefficient. We use this coefficient to set the confidence of a joint to zero, i.e. undetected, based on the relative amount of added noise with respect to the standard deviation. As shown by the experiments, perturbing trained data and incorporating the confidence value are important for the method to generalize well to unseen data.
3.4 Multi-view 3D Regression Function
As mentioned earlier, the regression function relies solely on the detections provided by the single-view 2D pose detector. In contrast to (Tekin et al., 2016a; Fang et al., 2017; Luvizon et al., 2018), we do not need to model a complex function to directly map image pixel intensities into body part locations in 3D. Similar to Martinez et al. (2017), we model the 3D regression function using a simple multi-stage multilayer neural network.
The illustration of the network architecture is shown in Figure 3. The network consists of several stages, where each stage is made of four fully connected (FC) layers. The first stage takes the multi-view 2D detections as input, described in Section 3.2. Every stage in this network is trained to regress for the desired output. This provides intermediate supervision at each stage and automatically alleviates the problem of vanishing gradient that happens when there are many intermediate layers between the network input and output layers (Cao et al., 2017). We can therefore build deep neural networks by stacking several stages. The stage-wise supervision is provided by computing the loss between the output of the last layer in each stage and the desired output ():
where is the average loss computed over all training samples used in this iteration and is the output of the last layer at stage for sample . The network is optimized by computing the overall network loss as a sum of the losses from all stages that is defined as:
Since we need to retrain the model for new multi-view setups, we use batch normalization in order to reduce sensitivity to network initialization and learning rate(Ioffe and Szegedy, 2015). We have also used dropout to avoid overfitting (Srivastava et al., 2014)
and rectified linear units to achieve non-linearity(Nair and Hinton, 2010).
In this section, we present the evaluation on two multi-view datasets and compare with state-of-the-art results.
4.1 Implementation Details
We implement our approach using TensorFlow(Abadi et al., 2015)
. In each stage of the network, the size of the first and last layers are set based on the input and output dimensions and the size of the intermediate layers are set to 1024. Our network is trained using the Adam optimizer. We set the starting learning rate to 0.001 and use exponential decay. The batch size is set to 512 and we train our network for 200 epochs. We observe that the performance of the network reaches a plateau when more than three stages are used. We therefore use three-stage networks throughout our experiments. A forward pass takes less than 1ms on a 1080Ti GPU. We can therefore say that the computation time of our multi-view regression model is almost negligible compared to the use of the 2D detector.
Human3.6M. Human3.6M is currently the largest multi-view human pose estimation dataset. The dataset includes around 3.6 million images collected from 15 actions performed by seven professional actors in a laboratory environment (Ionescu et al., 2014). The actions have been recorded by a four-view RGB camera system and camera parameters, including both intrinsic and extrinsic parameters, are available. Full-body 3D ground truth annotations are generated using a motion capture system. Following the standard evaluation protocol used in the literature, five subjects (S1, S5, S6, S7, S8) are used for training and two subjects (S9, S11) for testing (Chen and Ramanan, 2017; Pavlakos et al., 2017; Martinez et al., 2017)
. Mean per joint position error (MPJPE) in millimeter is used as evaluation metric and test results are collected per action.
Multi-view OR. The multi-view OR (MVOR) dataset is, to the best of our knowledge, the first multi-view pose estimation dataset that is generated from recordings in an uncontrolled environment. All activities in an operating room have been recorded for four days using a three-view camera system (Kadkhodamohammadi et al., 2017). The dataset has been manually annotated to provide both 2D and 3D upper-body poses. The dataset includes around 700 multi-view frames and 1100 persons. The presence of multiple persons and clutter make this dataset much more challenging than Human3.6M as can be seen in Figure 1. To report 2D body part localization on this dataset, we use the probability of correct keypoints (PCK) metric that is commonly used for evaluating multi-person pose estimation (Kadkhodamohammadi et al., 2017; Cao et al., 2017). MPJPE is used to report 3D body part localization.
4.3 2D Detection Results
|(Cao et al., 2017)||92.8||90.1||75.6||75.9||58.9||78.6|
In this section, we evaluate the 2D detection model of (Cao et al., 2017) on both datasets to assess its performance on such unseen data. In addition, we use the results on Human3.6M to model the characteristics of the 2D detector, which are required by our data generation model presented in Section 3.3.
In Table 1, we present the results of the single-view 2D pose detector (Cao et al., 2017) on the Human3.6M train set. We should note that the detector has not seen any data from this dataset during training. We use MPJPE in pixel to compute body part localization errors. The results for each body parts are reported per camera. The results for head and neck localizations are not presented as the annotation for these body parts are different between Human3.6M and MS COCO that is used to train the detector. The detector is applied on the whole image, i.e. no bounding box is provided, in contrast to previous work that relies either on ground truth (Ionescu et al., 2014; Moreno-Noguer, 2017; Martinez et al., 2017) or on person detectors (Tekin et al., 2016b) to obtain bounding boxes. In total, of the joints are not detected and the detector achieves the average MPJPE of 11 pixels. It is worth mentioning that the detector performs similarly on the test set. Table 2 presents the results of the 2D detector on the MVOR dataset. The model attains an average PCK of 78.9% on this dataset. We have also reported the performance of Deep3DPS (Kadkhodamohammadi et al., 2017), which is the state-of-the-art model on this dataset. In contrast to (Cao et al., 2017), which is trained on the RGB images of MS COCO, the Deep3DPS model uses both color and depth images and has been trained on MPI Pose and then finetuned on a single-view OR dataset. The 2D pose detector of (Cao et al., 2017) outperforms Deep3DPS. These results show that the detector achieves fairly promising results on both datasets even without finetuning. Comparing the performance of the 2D detector on these two datasets also indicates that the MVOR dataset is much more complex, as the number of undetected joints is much higher ( vs. ).
For generating the training data, the evaluation results on the train set of Human3.6M, which are reported in Table 1, are used to set the parameters of the noise model. The train set from Human3.6M is chosen to avoid any overlap between train and test sets. The coefficient in (5) is set to two. As a result, of the joints will be labeled as undetected, which is on par with the percentage of undetected joints in Human3.6M.
4.4 3D Localization Results
|(Tekin et al., 2016b)||102.4||147.2||88.8||125.3||118.0||182.7||112.4||129.2||138.9||224.9||118.4||138.8||126.3||55.1||65.8||125.0|
|(Chen and Ramanan, 2017)||89.9||97.6||89.9||107.9||107.3||139.2||93.6||136.0||133.1||240.1||106.6||106.2||87.0||114.0||90.5||114.1|
|(Pavlakos et al., 2017)||67.4||71.9||66.7||69.1||72.0||77.0||65.0||68.3||83.7||96.5||71.7||65.8||74.9||59.1||63.2||71.9|
|(Martinez et al., 2017)||51.8||56.2||58.1||59.0||69.5||78.4||55.2||58.1||74.0||94.6||62.3||59.1||65.1||49.5||52.4||62.9|
|SV, (Newell et al., 2016)||53.4||58.6||62.1||63.2||86.2||83.3||56||58.1||81.2||101.2||68.4||64.1||67.4||51||54.2||67.2|
|SV, (Cao et al., 2017)||69.5||75.5||67.6||76.8||84.6||94.9||69.8||68.4||92.2||113.7||77.1||75.1||77.2||59.0||64.2||77.7|
|SV, Noisy GT||69.7||78.8||69.8||77.5||84.4||97.6||64.9||86.5||103.3||125.8||81.8||80.4||83.3||59.9||62.6||81.8|
|MV, (Cao et al., 2017)||39.4||46.9||41.0||42.7||53.6||54.8||41.4||50.0||59.9||78.8||49.8||46.2||51.1||40.5||41.0||49.1|
|MV, Noisy GT||47.1||60.5||48.7||53.5||63.5||71.1||48.7||57.8||72.2||81.7||59.0||55.9||60.6||43.4||44.3||57.9|
Human3.6M. As Human3.6M is a fairly new dataset and no multi-view pose estimation result is reported on this dataset, we compare our approach with recent state-of-the-art models for single-view 3D pose estimation on Human3.6M. For the sake of comparison, we have therefore trained a variant of our proposed regression function that relies solely on single-view input. Table 3 reports evaluation results of our approach with different configurations. Models that are relying on single-view input are denoted by SV and multi-view ones by MV. These models are trained either on ground truth (GT) 2D poses, Noisy GT 2D poses as described in Section 3.2 or on 2D detections provided by either (Newell et al., 2016) or (Cao et al., 2017) for comparison. Even though Human3.6M is a single-person dataset, note that in (Tekin et al., 2016b; Pavlakos et al., 2017) the input images are cropped using bounding boxes around the persons and that the 2D pose detector models of (Newell et al., 2016) and (Wei et al., 2016) used in (Chen and Ramanan, 2017) and (Martinez et al., 2017) are applied on bounding boxes around the persons obtained from ground truth.
Our single-view 3D pose regression model trained on 2D detection provided by (Newell et al., 2016) achieves the average localization error of mm. We should note that our results for this model improves slightly over the results reported by (Martinez et al., 2017) on the same experimental setup (), where the same 2D pose detector trained on MPII Pose is used without any finetuning on Human3.6M. Martinez et al. (2017) showed that the results can be improved by finetuning the model on Human3.6M (62.9 vs. 67.5), which is in line with the results reported in Chen and Ramanan (2017). However, in order to easily generalize to new environments, we do not finetune 2D pose detectors as this would require annotated data. Except the model SV, (Newell et al., 2016), which uses the same 2D pose detector during both training and testing for the sake of fair comparison with (Martinez et al., 2017), all our models have used 2D detections provided by (Cao et al., 2017) during testing. We should note that even though our single-view 3D regression model trained on the 2D detections provided by Newell et al. (2016) performs better than other variants of our single-view model, we decide to use the model of Cao et al. (2017) instead, as it allows us to detect and estimate 2D body poses in multi-person scenarios, e.g. the MVOR dataset.
The evaluation results show that our single-view model trained on ground truth 2D poses and the model of Chen and Ramanan (2017) perform similarly. This indicates that our regression function that is trained on perfect GT data will eventually work similarly to the lookup table used in (Chen and Ramanan, 2017). One can therefore conclude that if perfect 2D detections are obtained, a 2D-to-3D regression function or a lookup table would work similarly. But, the 2D detections are not perfect in practice. Therefore, by incorporating detection noise during training as described in Section 3.3, we have constructed a model SV, Noisy GT that could cope better with noisy detection (81.8 vs. 119.6). We observe that if we train the model on 2D detections from the same 2D detector used during testing, i.e. (Cao et al., 2017), average MPJPE is improved by only four millimeters. These results indicate that our data generation model presented in Section 3.3 has properly incorporated the detector’s characteristics and generalize well to test data.
We have also presented the evaluation results of our multi-view regression function in Table 3. Training the model MV, (Cao et al., 2017) on 2D pose detections by the same detector model as the one used at test time achieves the average MPJPE of 49 millimeters. This is the lower limit for MPJPE on Human3.6M, which can be obtained by our MV regression model using this single-view pose detector. During our experiments, we observe that even though our multi-view regression models have generally converged to lower training losses compared to single-view ones, both single-view and multi-view models trained on ground truth poses achieve similar performance (119.6 vs. 118.5). We believe that as the multi-view model is only trained on perfect ground truth 2D poses, it always expects the exact projections of a 3D pose in all views. But, since the 2D pose detector provides noisy detections, this is not always possible at test time. The last row shows the results of our multi-view regression model trained using 2D poses generated from 3D ground truth by incorporating the 2D detector’s characteristics. This model reduces the error by more than compared to the same model trained on ground truth data only. The model has also improved the localization results by compared to the single-view model SV, Noisy GT indicating that this model has properly incorporated 2D body part locations across all views to regress for their 3D positions. These results also confirm our hypothesis that incorporating the characteristics of the detector during training enables developing models that are robust to the inaccuracy and failure of the detector at test time.
|One view||Two views||Three views|
Multi-view OR. In order to assess the ability of our approach to generalize to new multi-view environments, we evaluate the performance of our approach on the multi-view OR dataset. We use the 3D poses from Human3.6M, the camera calibration parameters of MVOR and the data generation model describe in Section 3.2 to train a multi-view 3D regression model. The evaluation results of this model on MVOR are presented in Table 4. We use 3D MPJPE in centimeter as evaluation metric. Following the convention in MVDeep3DPS (Kadkhodamohammadi et al., 2017), MPJPE is computed for the same set of body parts and is reported per number of supporting views. Our model has achieved the average MJPJPE of 17 cm on this dataset. The results show a significant improvement in the localization of the body parts as the number of supporting view increases. The average MPJPE is improved by 12 cm for persons who are detected in three views compared to those who are only detected in one view. This clearly indicates the benefit of observing an environment from multiple complementary views and the ability of our regression model to leverage such data for predicting 3D body poses even when some body parts are invisible.
Table 4 also compares the performance of our model with the MVDeep3DPS model (Kadkhodamohammadi et al., 2017). We should note that MVDeep3DPS requires both color and depth images in contrast to our approach that relies solely on color images. Our approach, which only uses Human3.6M data, improves the results over MVDeep3DPS that is trained on a dataset recorded in the same OR as the one used to capture MVOR. This evaluation results demonstrate that our approach can exploit existing datasets to easily generalize to new multi-view setups without any need for new annotation.
4.5 Qualitative Results
In Figures 4 and 5, we show qualitative results on both Human3.6M and MVOR222Please note that for generating the qualitative images, the predicted 3D poses are transferred to the room reference frame using an offset computed as the relative difference between the neck location in the ground truth and the neck location in the predicted skeleton.. Each row shows a multi-view frame. The predicted 3D poses are shown in the last column and the overlaid 2D poses are obtained by projecting the 3D poses into the views. Figure 4 demonstrates the high-quality of the predicted 3D body poses. For example, the frame presented in the last row shows that our approach can successfully incorporate evidence across all views to localize the occluded body parts.
We also show some frames from the multi-view OR dataset in Figure 5. As can be seen in this figure, this dataset is much more complex due to the similar appearance of the objects as well as the people and the presence of many objects and multiple persons in the scene. Our approach predicts fairly accurate 3D body poses and always correctly detects the left and right side labels even though it has not seen any data from this dataset or any other data collected in such an OR environment at the training stage333More qualitative results generated by our model on both datasets are available at https://youtu.be/Cx_kTRzqqzA.
The complexity of this dataset also allows us to identify some of the limitations of the proposed approach. For example, we observe that the elbow and the wrist localization are less accurate compared to other body parts, which is in line with results presented in Tables 1 and 2. We envision that enforcing appearance consistencies among the projections of a body part across all views can be used to update and improve the 2D body joint detections. The improved 2D detections could then be fed into our multi-view regression model to obtain a more accurate localizations of the body parts in 3D. In the last row of Figure 5, we have highlighted a 3D body pose, where the right arm configuration is infeasible because of body physical constraints. We believe that since our training data generation model described in Section 3.2 perturbs 3D poses randomly and does not take the body constraints into account, it may generate such a training sample. Therefore, it would be interesting to combine our data generation model with a model like the one used in (Vondrak et al., 2008) to enforce and verify the physical plausibility of the generated 3D poses.
4.6 Ablation study
We performed several experiments on Human3.6M to study the impact of each of the components of our approach. We first observe that by removing the stage-wise supervision, the performance always drops, for example, average MPJPE changes from 57.9 to 77.2 for our MV, Noisy GT model. Removing batch normalization leads to a substantial increase in the error (from 57.9 to 175). We also observe that the use of dropout during the training of single-view models and multi-view models on perfect ground truth data is important to obtain more robust models, as it reduces the errors by mm. However, deactivating dropout for our multi-view models trained on (Cao et al., 2017)’s detections or Noisy GT decreases localization errors by and
mm, respectively. We believe that this is due to the fact that 2D detection inputs are constructed from single-view poses that have been independently affected by noise in each view by either the detector inaccuracy or by our data generation model. This independent noise can therefore work as a regularizer to enforce neurons to detect the most relevant information across all views, thereby removing the need for dropout.
Following (Moreno-Noguer, 2017) and (Martinez et al., 2017), we perform a series of experiments to evaluate the performance our approach under different levels of noise at test time. For a fair comparison, we evaluate our single-view model trained on Noisy GT and add different levels of Gaussian noise to ground truth 2D poses at test time. The evaluation results are presented in Table 5 and are compared with EDM (Moreno-Noguer, 2017) and SimpBase (Martinez et al., 2017). Even though the average localization error of SimpBase is lower than our model’s error by one centimeter when tested on perfect ground truth 2D poses, our model achieves lower localization errors as the noise increases. This indicates that incorporating detector’s characteristic during training allows our model to better cope with the noise at test time.
In a multi-view setup, a 3D body pose can have completely different projections to the views depending on the orientation of the person with respect to the reference coordinate system. We therefore need to construct our multi-view regression model in a way that is robust to these changes in the orientation of the person, as our model only relies on these 2D projections to compute 3D body poses. For this reason, we propose to augment the training data by rotating each 3D pose in human3.6M w.r.t. the reference frame. Figure 6 shows the effect of this data augmentation. We report the results of our multi-view model MV, Noisy GT on the MVOR dataset as a function of the number of rotations applied to each 3D poses in Human3.6M. The results show that applying up to three random rotations decrease the error but applying more random rotation does not lead to any improvement. Apart from the evaluation results reported in Figure 6, for all the other evaluation on MVOR we always use our multi-view model trained on the train set of Human3.6M, which is augmented by applying three random rotations to each 3D pose.
We present an easily generalizable approach for estimating 3D body poses using multi-view data. We propose a two-step framework to tackle this problem, which separates single-view pose detection from multi-view 3D pose regression. The proposed approach permits to effectively exploit existing datasets to generalize to new multi-view environments. We use a multi-stage neural network as regression function to estimate 3D poses. Our model is trained on data generated from a set of valid 3D poses by projecting the 3D poses using the camera parameters used at the test time and by incorporating the characteristics of the single-view pose detector. Our evaluation results indicate the effectiveness and importance of incorporating the detector’s characteristics during training, as it significantly reduces the localization error and achieves results on par with models trained on the output of the detector. We have also evaluated the generalization of our approach on the multi-person MVOR dataset by using only the camera configuration parameters from this dataset during training, but no image data. Our approach yields fairly accurate results and outperforms the state-of-the-art model on this dataset. The results also show that the localization error dramatically decreases as the number of supporting views increases. This highlights the benefit of our approach in leveraging multi-view data to obtain a reliable model for crowded and cluttered environments. To the best of our knowledge, this is also the first multi-view RGB approach that has been quantitatively evaluated on a real world dataset for the task of 3D body part localization.
This work was supported by French state funds managed within the Investissements d’Avenir program by BPI France (project CONDOR) and by the ANR (references ANR-11-LABX- 0004 and ANR-10-IAHU-02). The authors would also like to acknowledge the support of NVIDIA with the donation of a GPU used in this research.
Abadi et al. (2015)
Abadi, M., Agarwal, A.,
Barham, P., Brevdo, E.,
Chen, Z., Citro, C.,
Corrado, G.S., Davis, A.,
Dean, J., Devin, M.,
Ghemawat, S., Goodfellow, I.,
Harp, A., Irving, G.,
Isard, M., Jia, Y.,
Jozefowicz, R., Kaiser, L.,
Kudlur, M., Levenberg, J.,
Mané, D., Monga, R.,
Moore, S., Murray, D.,
Olah, C., Schuster, M.,
Shlens, J., Steiner, B.,
Sutskever, I., Talwar, K.,
Tucker, P., Vanhoucke, V.,
Vasudevan, V., Viégas, F.,
Vinyals, O., Warden, P.,
Wattenberg, M., Wicke, M.,
Yu, Y., Zheng, X., 2015.
TensorFlow: Large-scale machine learning on heterogeneous systems.URL: https://www.tensorflow.org/. software available from tensorflow.org.
- Amin et al. (2013) Amin, S., Andriluka, M., Rohrbach, M., Schiele, B., 2013. Multi-view pictorial structures for 3d human pose estimation, in: British Machine Vision Conference (BMVC).
Amin et al. (2014)
Amin, S., Müller, P.,
Bulling, A., Andriluka, M.,
Test-time adaptation for 3d human pose estimation, in: Pattern Recognition. Springer. volume 8753 ofLecture Notes in Computer Science, pp. 253–264.
- Andriluka et al. (2014) Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B., 2014. 2D human pose estimation: New benchmark and state of the art analysis, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 3686–3693.
- Belagiannis et al. (2014a) Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S., 2014a. 3d pictorial structures for multiple human pose estimation, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 1669–1676.
- Belagiannis et al. (2014b) Belagiannis, V., Wang, X., Schiele, B., Fua, P., Ilic, S., Navab, N., 2014b. Multiple human pose estimation with temporally consistent 3D pictorial structures, in: ChaLearn Looking at People Workshop, European Conference on Computer Vision (ECCV2014), IEEE. pp. 742–754.
- Belagiannis et al. (2016) Belagiannis, V., Wang, X., Shitrit, H.B.B., Hashimoto, K., Stauder, R., Aoki, Y., Kranzfelder, M., Schneider, A., Fua, P., Ilic, S., Feussner, H., Navab, N., 2016. Parsing human skeletons in an operating room. Machine Vision and Applications , 1–12.
- Biswas et al. (2006) Biswas, P., Liang, T.C., Toh, K.C., Ye, Y., Wang, T.C., 2006. Semidefinite programming approaches for sensor network localization with noisy distance measurements. IEEE Transactions on Automation Science and Engineering 3, 360–371.
- Burenius et al. (2013) Burenius, M., Sullivan, J., Carlsson, S., 2013. 3d pictorial structures for multiple view articulated pose estimation, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 3618–3625.
- Cao et al. (2017) Cao, Z., Simon, T., Wei, S.E., Sheikh, Y., 2017. Realtime multi-person 2D pose estimation using part affinity fields, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1302–1310.
- Chen and Ramanan (2017) Chen, C.H., Ramanan, D., 2017. 3D human pose estimation = 2D pose estimation + matching, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5759–5767.
- Chen et al. (2008) Chen, D., Chou, P.C., Fookes, C.B., Sridharan, S., 2008. Multi-view human pose estimation using modified five-point skeleton model, in: International Conference on Signal Processing and Communication Systems, pp. 17–19.
- Dogan et al. (2017) Dogan, E., Eren, G., Wolf, C., Lombardi, E., Baskurt, A., 2017. Multi-view pose estimation with mixtures-of-parts and adaptive viewpoint selection. IET Computer Vision .
- Fang et al. (2017) Fang, H., Xu, Y., Wang, W., Liu, X., Zhu, S., 2017. Learning knowledge-guided pose grammar machine for 3d human pose estimation. CoRR abs/1710.06513. URL: http://arxiv.org/abs/1710.06513.
- Felzenszwalb et al. (2010) Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D., 2010. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1627–1645.
- Felzenszwalb and Huttenlocher (2005) Felzenszwalb, P.F., Huttenlocher, D.P., 2005. Pictorial structures for object recognition. International Journal of Computer Vision 61, 55–79.
- Fischler and Elschlager (1973) Fischler, M.A., Elschlager, R.A., 1973. The representation and matching of pictorial structures. IEEE Transactions on Computers 22, 67–92.
- Gall et al. (2010) Gall, J., Rosenhahn, B., Brox, T., Seidel, H.P., 2010. Optimization and filtering for human motion capture. International Journal of Computer Vision 87, 75–92.
- Hartley and Zisserman (2000) Hartley, R.I., Zisserman, A., 2000. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521623049.
- Hofmann and Gavrila (2011) Hofmann, M., Gavrila, D.M., 2011. Multi-view 3D human pose estimation in complex environment. International Journal of Computer Vision 96, 103–124.
- Ioffe and Szegedy (2015) Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, pp. 448–456.
- Ionescu et al. (2014) Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C., 2014. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 1325–1339.
- Jiang (2010) Jiang, H., 2010. 3d human pose reconstruction using millions of exemplars, in: 2010 20th International Conference on Pattern Recognition, pp. 1674–1677.
- Kadkhodamohammadi et al. (2017) Kadkhodamohammadi, A., Gangi, A., de Mathelin, M., Padoy, N., 2017. A multi-view RGB-D approach for human pose estimation in operating rooms, in: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 363–372.
- Lin et al. (2014) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft COCO: Common Objects in Context. Springer. pp. 740–755.
- Luo et al. (2010) Luo, X., Berendsen, B., Tan, R.T., Veltkamp, R.C., 2010. Human pose estimation for multiple persons based on volume reconstruction, in: International Conference on Pattern Recognition, pp. 3591–3594.
- Luvizon et al. (2018) Luvizon, D.C., Picard, D., Tabia, H., 2018. 2D/3D pose estimation and action recognition using multitask deep learning, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE.
- Martinez et al. (2017) Martinez, J., Hossain, R., Romero, J., Little, J.J., 2017. A simple yet effective baseline for 3d human pose estimation, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2659–2668.
- Moreno-Noguer (2017) Moreno-Noguer, F., 2017. 3d human pose estimation from a single image via distance matrix regression, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1561–1570.
Nair and Hinton (2010)
Nair, V., Hinton, G.E.,
Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 807–814.
- Newell et al. (2017) Newell, A., Huang, Z., Deng, J., 2017. Associative embedding: End-to-end learning for joint detection and grouping, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 2277–2287.
- Newell et al. (2016) Newell, A., Yang, K., Deng, J., 2016. Stacked Hourglass Networks for Human Pose Estimation. Springer. pp. 483–499.
- Pavlakos et al. (2017) Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K., 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1263–1272.
- Radwan et al. (2013) Radwan, I., Dhall, A., Goecke, R., 2013. Monocular image 3d human pose estimation under self-occlusion, in: 2013 IEEE International Conference on Computer Vision, pp. 1888–1895.
- Rogez et al. (2017) Rogez, G., Weinzaepfel, P., Schmid, C., 2017. LCR-Net: Localization-classification-regression for human pose, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1216–1224.
- Sigal et al. (2009) Sigal, L., Balan, A.O., Black, M.J., 2009. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision 87, 4–27.
- Sigal et al. (2012) Sigal, L., Isard, M., Haussecker, H., Black, M.J., 2012. Loose-limbed people: Estimating 3D human pose and motion using non-parametric belief propagation. International Journal of Computer Vision 98, 15–48.
- Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958.
- Sundaresan and Chellappa (2009) Sundaresan, A., Chellappa, R., 2009. Multicamera tracking of articulated human motion using shape and motion cues. IEEE Transactions on Image Processing 18, 2114–2126.
- Tekin et al. (2016a) Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., Fua, P., 2016a. Structured prediction of 3d human pose with deep neural networks, in: Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016.
- Tekin et al. (2016b) Tekin, B., Rozantsev, A., Lepetit, V., Fua, P., 2016b. Direct prediction of 3d body poses from motion compensated sequences, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 991–1000.
- Vondrak et al. (2008) Vondrak, M., Sigal, L., Jenkins, O.C., 2008. Physical simulation for probabilistic motion tracking, in: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8.
- Wan et al. (2017) Wan, Q., Zhang, W., Xue, X., 2017. Deepskeleton: Skeleton map for 3D human pose regression. CoRR abs/1711.10796. URL: http://arxiv.org/abs/1711.10796.
- Wei et al. (2016) Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y., 2016. Convolutional pose machines, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 4724–4732.
- Yu and Hong (2017) Yu, J., Hong, C., 2017. Exemplar-based 3d human pose estimation with sparse spectral embedding. Neurocomputing 269, 82 – 89.