Real-time marker-less multi-person 3D pose estimation in RGB-Depth camera networks

by   Marco Carraro, et al.
Università di Padova

This paper proposes a novel system to estimate and track the 3D poses of multiple persons in calibrated RGB-Depth camera networks. The multi-view 3D pose of each person is computed by a central node which receives the single-view outcomes from each camera of the network. Each single-view outcome is computed by using a CNN for 2D pose estimation and extending the resulting skeletons to 3D by means of the sensor depth. The proposed system is marker-less, multi-person, independent of background and does not make any assumption on people appearance and initial pose. The system provides real-time outcomes, thus being perfectly suited for applications requiring user interaction. Experimental results show the effectiveness of this work with respect to a baseline multi-view approach in different scenarios. To foster research and applications based on this work, we released the source code in OpenPTrack, an open source project for RGB-D people tracking.


page 2

page 3


Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo

Existing approaches for multi-view multi-person 3D pose estimation expli...

MVOR: A Multi-view RGB-D Operating Room Dataset for 2D and 3D Human Pose Estimation

Person detection and pose estimation is a key requirement to develop int...

MV6D: Multi-View 6D Pose Estimation on RGB-D Frames Using a Deep Point-wise Voting Network

Estimating 6D poses of objects is an essential computer vision task. How...

MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks

In this paper, we present a novel dataset named MVB (Multi View Baggage)...

Multi-View Matching (MVM): Facilitating Multi-Person 3D Pose Estimation Learning with Action-Frozen People Video

To tackle the challeging problem of multi-person 3D pose estimation from...

Newton-PnP: Real-time Visual Navigation for Autonomous Toy-Drones

The Perspective-n-Point problem aims to estimate the relative pose betwe...

Fast and Robust Multi-Person 3D Pose Estimation from Multiple Views

This paper addresses the problem of 3D pose estimation for multiple peop...

I Introduction

The human body pose is rich of information. Many algorithms and applications, such as Action Recognition [1, 2, 3], People Re-identification [4], Human-Computer-Interaction (HCI) [5] and Industrial Robotics [6, 7, 8] rely on this type of data. The recent availability of smart cameras [9, 10, 11] and affordable RGB-Depth sensors as the first and second generation Microsoft Kinect, allow to estimate and track body poses in a cost-efficient way. However, using a single sensor is often not reliable enough because of occlusions and Field-of-View (FOV) limitations. For this reason, a common solution is to take advantage of camera networks. Nowadays, the most reliable way to perform human Body Pose Estimation (BPE) is to use marker-based motion capture systems. These systems show great results in terms of accuracy (less than 1mm), but they are very expensive and require the users to wear many markers, thus imposing heavy limitations to their diffusion. Moreover, these systems usually require offline computations in complicated scenarios with many markers and people, while the system we propose provides immediate results. A real-time response is usually needed in security applications, where person actions should be detected in time, or in industrial applications, where human motion is predicted to prevent collisions with robots in shared workspaces. Aimed by those reasons, the research on marker-less motion capture systems has been particularly active in recent years.

Fig. 1: The output provided by the system we are proposing. In this example, five persons are seen from a network composed of four Microsoft Kinect v2.
Fig. 2: The system overview. The camera network is composed of several RGB-D sensors (from 1 to N). Each single-view detector takes the RGB and Depth images as input and computes the 3D skeletons of the people in the scene as the output using the calibration parameters K. The information is then sent to the multi-view central node which is in charge of computing the final pose estimation for each person in the scene. First, a data association is performed to determine which pose detection is belonging to which pose track, then a filtering step is performed to update the pose track given the detection.

In this work, we propose a novel system to estimate the 3D human body pose in real-time. To the best of our knowledge, this is the first open-source and real-time solution to the multi-view, multi-person 3D body pose estimation problem. Figure 1 depicts our system output. The system relies on the feed of multiple RGB-D sensors (from 1 to N) placed in the scene and on an extrinsic calibration of the network. in this work, this calibration is performed with the calibration_toolkit [12]111 The multi-view poses are obtained by fusing the single view outcomes of each detector, that runs a state-of-the-art 2D body pose estimator [13, 14] and extend it to 3D by means of the sensor depth. The contribution of the paper is two-fold: i) we propose a novel system to fuse and update 3D body poses of multiple persons in the scene and ii) we enriched a state-of-the-art single-view 2D pose estimation algorithm to provide 3D poses. As a further contribution, the code of the project has been released as open-source as part of the OpenPTrack [15, 16] repository. The proposed system is:

  • multi-view: The fused poses are computed taking into account the different poses of the single-view detectors;

  • asynchronous: The fusion algorithm does not require the different sensors to be synchronous or have the same frame rate. This allows the user to choose the detector computing node accordingly to his needs and possibilities;

  • multi-person: The system does not make any assumption on the number of persons in the scene. The overhead due to the different number of persons is negligible;

  • scalable: No assumptions are made on the number or positions of the cameras. The only request is an offline one-time extrinsic calibration of the network;

  • real-time: The final pose framerate is linear to the number of cameras in the network. In our experiments, a single-camera network can provide from 5 fps to 15 fps depending on the Graphical Processing Unit (GPU) exploited by the detector. The final framerate of a camera network composed of nodes is the sum of their single-view framerate;

  • low-cost: The system relies on affordable low-cost RGB-D sensors controlled by consumer GPU-enabled computers. No specific hardware is required.

The remainder of the paper is organized as follows: in Section II we review the literature regarding human BPE from single and multiple views, while Section III describes our system and the approach used to solve the problem. In Section IV experimental results are presented, and, finally in Section V we present our final conclusions.

Ii Related Work

Ii-a Single-view body pose estimation

Since a long time, there have been a great interest about single-view human BPE, in particular for gaming purposes or avatar animation. Recently, the advent of affordable RGB-D sensors boosted the research in this and other Computer Vision fields. Shotton et al. 


proposed the skeletal tracking system licensed by Microsoft used by the XBOX console with the first-generation Kinect. This approach used a random forest classifier to classify the different pixels as belonging to the different body parts. This work inspired an open-source approach that was released by Buys et al. 

[18]. This same work was then improved by adding the OpenPTrack people detector module as a preprocessing step [19]. Still, the performance of the detector remained very poor for non frontal persons. In these last years, many challenging Computer Vision problems have been finally resolved by using Convolutional Neural Networks (CNNs) solutions. Also single-view BPE has seen a great benefit from these techniques [20, 21, 22, 14]. The impressive pose estimation quality provided by those solution is usually paid in terms of computational time. Nevertheless, this limitation is going to be leveraged with newer network layouts and Graphical Processing Units (GPU) architectures, as proved by some recent works  [22, 14]. In particular, the work of Cao et. al [14] was one of the first to implement a CNN solution to solve people BPE in real-time using a bottom-up approach. The authors were able to compute 2D poses for all the people in the scene with a single forward pass of their CNN. This work has been adopted here as part of our single-view detectors.

Fig. 3: The single-view pipeline followed for each sensor. At each new frame composed of a color image (RGB), a depth image and the calibration parameters, the 3D pose of each person in the scene is computed from the 2D one. Then, the results are sent to the central computer which will compute the multi-view result.

Ii-B Multi-view body pose estimation

Multiple views can be exploited to be more robust against occlusions, self-occlusions and FOV limitations. In [23] a Convolutional Neural Network (CNN) approach is proposed to estimate the body poses of people by using a low number of cameras also in outdoor scenarios. The solution combines a generative and discriminative approach, since they use a CNN to compute the poses which are driven by an underlying model. For this reason, the collaboration of the users is required for the initialization phase. In our previous work [19], we solved the single-person human BPE by fusing the data of the different sensors and by applying an improved version of [18] to a virtual depth image of the frontalized person. In this way, the skeletonization is only performed once, on the virtual depth map of the person in frontal pose. In [24], a 3D model is registered to the point clouds of two Kinects. The work provides very accurate results, but it is computationally expensive and not scalable to multiple persons. The authors of [25] proposed a pure geometric approach to infer the multi-view pose from a synchronous set of 2D single-view skeletons obtained using [26]. The third dimension is computed by imposing a set of algebraic constraints from the triangulation of the multiple views. The final skeleton is then computed by solving a least square error method. While the method is computationally promising (skeleton computed in 1s per set of synchronized images with an unoptimized version of the code), it does not scale with the number of persons in the scene. In [27] a system composed of common RGB cameras and RGB-D sensors are used together to record a dance motion performed by a user. The fusion method is obtained by selecting the best skeleton match between the different ones obtained by using a probabilistic approach with a particle filter. The system performs well enough for its goal, but it does not scale to multiple people and requires an expensive setup. In [28] the skeletons obtained from the single images are enriched with a 3D model computed with the visual hull technique. In [29] two orthogonal Kinects are used to improve the single-view outcome of both sensors. They used a constrained optimization framework with the bone lengths as hard constraints. While the work provides a real-time solution and there are no hard assumption on the Kinect positions, it was tested just with one person and two orthogonal Kinect sensors. Similarly to many recent works [25, 28, 27], we use a single-view state-of-the-art body pose estimator, but we augment this result with 3D data and we then combine the multiple views to improve the overall quality.

Iii System Design

Figure 2 shows an overview of the proposed system. It can be split into two parts: i) the single view, which is the same for each sensor and it is executed locally and ii) the multi-view part which is executed just by the master computer. In the single-view part (see Figure 3), each detector estimates the 2D body pose of each person in the scene using an open-source state-of-the-art single-view body pose estimator. In this work, we use the OpenPose222[13, 14]

library, but the overall system is totally independent of the single-view algorithm used. The last operation made by the detector is to compute the 3D positions of each joint returned by OpenPose. This fusion is done by exploiting the depth information coming from the RGB-D sensor used. The 3D skeleton is then sent to the master computer for the fusion phase. This is done by means of multiple Unscented Kalman Filters used on the detection feeds, as explained in Section 


Iii-a Camera Network setup

The camera network can be composed of several RGB-D sensors. In order to know the relative position of each camera, we calibrate the system using a solution similar to our previous works [16, 15]. From this passage we fix a common world reference frame and we obtain a transformation , for each camera in the network, which transforms points in the camera coordinate system to the world reference system.

Iii-B Single-view Estimation of 3D Poses

Fig. 4: The human model used in this work.


  • - a new detection set from sensor in the world reference frame

  • - the current set of tracked persons pose.

  • - maximum distance for a detection to be considered for the association


  • - the association between the pose tracked and the new observations

  • - the detections without an association. They will initialize a new track.

  • - the tracks without an associated observations. They will be considered for removal

1:procedure DATA_ASSOCIATION(, )
4:     for each  do
5:         for each  do
7:               *v that would have if were associated to it*
8:               *prediction step of *
13:     for  do
14:         for  do
15:              if  and  then
17:                  * update with *                             
20:     return , ,
Algorithm 1 The algorithm performed by the master computer to decide the association between the different skeletons in a detection and the current tracks.

Each node in the network is composed of an RGB-D sensor and a computer to elaborate the images. Let be a frame captured by the detector and composed of the color image C and the depth image D all in the reference frame. The color and depth images in are considered as synchronized. We then apply OpenPose to obtaining the raw two dimensional skeletons . Each is a set of 2D joints which follows the human model depicted in Figure 4. The goal of the single-view detector is to transform in the set of skeletons where each is a three dimensional skeleton. Given the RGB image , let’s consider a point and its corresponding depth . Considering and respectively the focal point and the optical center of the sensor, the relationship to compute the 3D point in the camera reference system R is explained in Equation 1.


Since the depth data is potentially noisy or missing, we compute the depth associated to the point by applying a median to the set , as shown in Equations 23.


Given , we then proceed to the calculation of as shown in Equation 4.

r-shoulder r-elbow r-wrist l-shoulder l-elbow l-wrist r-hip r-knee r-ankle l-hip l-knee l-ankle
single-camera network >100 >100 >100 >100 >100 >100 >100 >100 >100 >100 >100 >100
>100 >100 >100 >100 >100 >100 >100 >100 >100 >100 >100 >100
Ours 54.9 58.6 42.4 47.4 42.4 40.0 77.7 74.4 79.1 82.7 70.0 61.8 51.7 43.7 54.5 31.0 63.3 34.2 97.8 30.3 57.5 38.9 69.2 37.6
2-camera network 62.0 33.0 62.9 32.0 63.1 34.5 83.3 33.4 85.8 37.8 94.8 45.4 76.4 30.6 75.9 27.4 88.3 35.6 >100 85.4 35.5 93.3 37.0
83.7 41.8 84.0 40.9 83.1 43.7 >100 >100 >100 99.2 40.4 96.3 38.0 >100 >100 >100 >100
Ours 20.7 17.2 21.0 17.5 24.3 17.5 32.1 23.0 33.4 26.3 39.8 35.1 22.4 16.7 42.8 17.2 59.7 28.6 98.3 21.2 39.9 18.3 58.6 27.1
4-camera network 28.7 16.4 31.0 16.9 32.2 22.5 41.5 17.9 39.9 19.6 44.7 29.5 40.2 15.0 48.7 12.8 58.6 21.2 94.1 26.1 52.1 17.8 57.8 27.9
38.4 21.2 40.8 21.7 41.6 26.3 53.0 23.2 52.7 24.6 57.6 33.1 50.7 19.4 56.2 16.7 66.0 24.5 96.6 30.8 61.2 23.1 67.5 31.6
Ours 22.7 18.9 21.3 18.5 26.3 19.9 22.5 22.1 26.7 25.9 31.8 29.7 23.9 18.0 46.5 19.7 55.9 25.1 95.4 22.0 45.1 20.5 49.1 25.2

The results of the experiments. Each number represents the mean and the standard deviation of the reprojection error on a reference camera (see Equation 



Iii-C Multi-view fusion of 3D poses

The master computer is in charge of fusing the different information it is receiving from the single-view detectors in the network. One of the common limitations in motion capture systems is the necessity to have synchronized cameras. Moreover, off-the-shelves RGB-D sensors, such as the Microsoft Kinect v2, do not have the possibility to trigger the image acquisition. In order to overcome this limitation, our solution merges the different data streams asynchronously. This allows the system to work also with other RGB-D sensors or other low-cost embedded machine. At time , the master computer maintains a set of tracks where each pose tracked is composed of the set of states of different Kalman Filters, one per each joint, i.e: . The additional Kalman Filter is mantained for the data association algorithm. At time , it may arrive a detection from the sensor of the network. The master computer first refers the detection to the common world coordinate system (see Section III-A):

Then, it associates the different skeletons in as new observations for the different tracks in if they belong to them or initializes new tracks if some of the skeletons do not belong to any . At this stage, the system also decides if a track is old and has to be removed from . This step is important to prevent to grow big causing time computing problems with systems which are running for hours. We refer to this phase as data association. Algorithm 1 shows how it is performed. The data association is done by considering the centroid of each skeleton contained in the detection . The centroid is calculated as the chest joint , if this is valid, otherwise it is replaced with a weighted mean of the neighbor joints. Lines [6-9] of Algorithm 1 refers to the calculation of a cost associated to the case if the detection pose would be associated to the track

. To calculate this, we consider the Mahalanobis distance between the likelihood vector at time

and : the covariance matrix of the Kalman filter associated to the centroid of . At this point, computing the optimal association between tracks and detections is the same as solving the Hungarian algorithm associated to the cost matrix ; Line 11 refers to the use of the Munkres algorithm which efficiently computes the optimal matrix with a on the associated couples. Nevertheless, this algorithm does not consider a maximum distance between tracks and detections. Thus, it may happen that a couple is wrongly associated in the optimal assignment. For this reason, when inserting the couples in , we check also if the cost of the couple in the initial cost matrix is below a threshold.

Once solved the data association problem, we can assign the tracks ID to the different skeletons. Indeed, we know which are the detection at the current time belonging to the tracks in the system and, additionally, we know also which tracks need to be created (i.e. new detections with no associated track) and the tracks to consider for the removal. Let be the number of people in the scene, we used a set of Unscented Kalman Filters where the generic is in charge of computing the new position of the joint of the person at time , given the new detection received from one of the detectors at time and the prediction of the filter computed from the previous position at time of the same joint .

The state of each Kalman Filter is dimensioned with the three dimensional position of the joint . We used as motion model a constant velocity model, since it is good to predict joint movements in the small temporal space between two good detections of that joint.

Iv Experiments

The algorithm described in this paper does not require any synchronization between the cameras in the networks. This fact makes particularly difficult to find a fair comparison between our proposed system and other state-of-the-art works. Thus, in order to provide useful indication on how our system performs, we recorded and manually annotated a set of RGB-D frames while a person was freely moving in the field-of-view of a 4-sensors camera network. We compare our algorithm with a baseline method called MAF (Moving Average Filter), in which the outcome of the generic joint at time is computed as an average of the last frames. In order to be as fair as possible, we fixed to provide comparable results in terms of smoothness. We also demonstrated the effectiveness of the multi-view fusion by comparing our results with the poses obtained by considering just one and two cameras of the same network. In this comparison, we report the average reprojection error with respect to one of the cameras, . Equation 5 shows how this error is calculated with as the generic joint expressed in the world reference system and as the corresponding ground truth :


Table I shows the results we achieved. As depicted, the proposed method outperforms the baseline in all the cases: single-view, 2-camera network and 4-camera network. In the first two cases (single and 2-camera network) the improvement is from 50% to 60%, while, when multiple views are available, it is from 18% to 32%. It is also interesting to note that the most noisy joints are the ones relative to the legs as confirmed by other state-of-the-art works [14, 20, 21].

Iv-a Implementation Details

The system has been implemented and tested with Ubuntu 14.04 and Ubuntu 16.04 operating system using the Robot Operating System (ROS) [30] middleware. The code is entirely written in C++ using the Eigen, OpenCV and PCL libraries.

V Conclusions and Future Works

In this paper we presented a framework to compute the 3D body pose of each person in a RGB-D camera network using only its extrinsic calibration as a prior. The system does not make any assumption on the number of cameras, on the number of persons in the scene, on their initial poses or clothes and does not require the cameras to be synchronous. In our experimental setup we demonstrated the validity of our system over both single-view and multi-view approaches. In order to provide the best service to the Computer Vision community and to provide also a future baseline method to other researchers, we released the source code under the BSD license as part of the OpenPTrack library333 As future works, we plan to add a human dynamic model to guide the prediction of the Kalman Filters to further improve the performance achievable by our system (in particular for the lower joints) and to further validate the proposed system on a new RGB-Depth dataset annotated with the ground truth of the single links of the persons’ body pose. The ground truth will be provided by a marker based commercial motion capture system.


This work was partially supported by U.S. National Science Foundation award IIS-1629302


  • [1] F. Han, X. Yang, C. Reardon, Y. Zhang, and H. Zhang, “Simultaneous feature and body-part learning for real-time robot awareness of human behaviors,” 2017.
  • [2] M. Zanfir, M. Leordeanu, and C. Sminchisescu, “The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2752–2759, 2013.
  • [3] C. Wang, Y. Wang, and A. L. Yuille, “An approach to pose-based action recognition,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 915–922, 2013.
  • [4] S. Ghidoni and M. Munaro, “A multi-viewpoint feature-based re-identification system driven by skeleton keypoints,” Robotics and Autonomous Systems, vol. 90, pp. 45–54, 2017.
  • [5] A. Jaimes and N. Sebe, “Multimodal human–computer interaction: A survey,” Computer vision and image understanding, vol. 108, no. 1, pp. 116–134, 2007.
  • [6] C. Morato, K. N. Kaipa, B. Zhao, and S. K. Gupta, “Toward safe human robot collaboration by using multiple kinects based real-time human tracking,” Journal of Computing and Information Science in Engineering, vol. 14, no. 1, p. 011006, 2014.
  • [7] S. Michieletto, F. Stival, F. Castelli, M. Khosravi, A. Landini, S. Ellero, R. Landò, N. Boscolo, S. Tonello, B. Varaticeanu, C. Nicolescu, and E. Pagello, “Flexicoil: Flexible robotized coils winding for electric machines manufacturing industry,” in ICRA workshop on Industry of the future: Collaborative, Connected, Cognitive, 2017.
  • [8] F. Stival, S. Michieletto, and E. Pagello, “How to deploy a wire with a robotic platform: Learning from human visual demonstrations,” in FAIM 2017, 2017.
  • [9] Z. Zivkovic, “Wireless smart camera network for real-time human 3d pose reconstruction,” Computer Vision and Image Understanding, vol. 114, no. 11, pp. 1215–1222, 2010.
  • [10] M. Carraro, M. Munaro, and E. Menegatti, “A powerful and cost-efficient human perception system for camera networks and mobile robotics,” in International Conference on Intelligent Autonomous Systems, pp. 485–497, Springer, Cham, 2016.
  • [11] M. Carraro, M. Munaro, and E. Menegatti, “Cost-efficient rgb-d smart camera for people detection and tracking,” Journal of Electronic Imaging, vol. 25, no. 4, pp. 041007–041007, 2016.
  • [12] F. Basso, R. Levorato, and E. Menegatti, “Online calibration for networks of cameras and depth sensors,” in OMNIVIS: The 12th Workshop on Non-classical Cameras, Camera Networks and Omnidirectional Vision-2014 IEEE International Conference on Robotics and Automation (ICRA 2014), 2014.
  • [13] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in CVPR, 2016.
  • [14] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.
  • [15] M. Munaro, A. Horn, R. Illum, J. Burke, and R. B. Rusu, “Openptrack: People tracking for heterogeneous networks of color-depth cameras,” in IAS-13 Workshop Proceedings: 1st Intl. Workshop on 3D Robot Perception with Point Cloud Library, pp. 235–247, 2014.
  • [16] M. Munaro, F. Basso, and E. Menegatti, “Openptrack: Open source multi-camera calibration and people tracking for rgb-d camera networks,” Robotics and Autonomous Systems, vol. 75, pp. 525–538, 2016.
  • [17] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, “Real-time human pose recognition in parts from single depth images,” Communications of the ACM, vol. 56, no. 1, pp. 116–124, 2013.
  • [18] K. Buys, C. Cagniart, A. Baksheev, T. De Laet, J. De Schutter, and C. Pantofaru, “An adaptable system for rgb-d based human body detection and pose estimation,” Journal of visual communication and image representation, vol. 25, no. 1, pp. 39–52, 2014.
  • [19] M. Carraro, M. Munaro, A. Roitberg, and E. Menegatti, “Improved skeleton estimation by means of depth data fusion from multiple depth cameras,” in International Conference on Intelligent Autonomous Systems, pp. 1155–1167, Springer, Cham, 2016.
  • [20] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, “Deepercut: A deeper, stronger, and faster multi-person pose estimation model,” in European Conference on Computer Vision, pp. 34–50, Springer, 2016.
  • [21] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele, “Deepcut: Joint subset partition and labeling for multi person pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4929–4937, 2016.
  • [22] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose estimation with iterative error feedback,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [23] A. Elhayek, E. de Aguiar, A. Jain, J. Thompson, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele, and C. Theobalt, “Marconi—convnet-based marker-less motion capture in outdoor and indoor scenes,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 3, pp. 501–514, 2017.
  • [24] Z. Gao, Y. Yu, Y. Zhou, and S. Du, “Leveraging two kinect sensors for accurate full-body motion capture,” Sensors, vol. 15, no. 9, pp. 24297–24317, 2015.
  • [25] M. Lora, S. Ghidoni, M. Munaro, and E. Menegatti, “A geometric approach to multiple viewpoint human body pose estimation,” in Mobile Robots (ECMR), 2015 European Conference on, pp. 1–6, IEEE, 2015.
  • [26] Y. Yang and D. Ramanan, “Articulated human detection with flexible mixtures of parts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2878–2890, 2013.
  • [27] Y. Kim, “Dance motion capture and composition using multiple rgb and depth sensors,” International Journal of Distributed Sensor Networks, vol. 13, no. 2, p. 1550147717696083, 2017.
  • [28] A. Kanaujia, N. Haering, G. Taylor, and C. Bregler, “3d human pose and shape estimation from multi-view imagery,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on, pp. 49–56, IEEE, 2011.
  • [29] K.-Y. Yeung, T.-H. Kwok, and C. C. Wang, “Improved skeleton tracking by duplex kinects: a practical approach for real-time applications,” Journal of Computing and Information Science in Engineering, vol. 13, no. 4, p. 041007, 2013.
  • [30] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating system,” in ICRA workshop on open source software, vol. 3, p. 5, Kobe, 2009.