1 Introduction
Multihuman 3D pose estimation from videos has a wide range of applications, including action recognition, sports analysis, and humancomputer interaction. With the rapid development of deep neural network, most of the recent efforts in this area have been devoted to monocular 3D pose estimation
[25, 26] . However, despite much progress, the singlecamera setting is still far from being resolved due to the large variations of human poses and partial occlusion in the monocular views. A natural solution for these problems is to recover the 3D poses from multiple camera views.Recent multiview approaches generally employ the detected 2D body joints from multiple views as inputs with the advance of 2D human pose estimation [9, 11, 35], and address the 3D pose estimation in a twostep formulation [2, 13]. Specifically, the 2D joints of the same person are first matched and associated across views, the 3D location of each joint is subsequently determined by a multiview reconstruction method. In this formulation, the challenge comes from three parts: 1) the detected 2D joints are noisy and inaccurate since the pose estimation is imperfect; 2) the crossview association is ambiguous when multiple people interacting with each other in crowded scenes; 3) the computational complexity explodes as the number of people and number of cameras increase.
To tackle the problem of crossview association, 3D pictorial structure model (3DPS) is widely used in some previous methods [2, 8], where the 3D poses are recovered from 2D joints in a discretized 3space. In this formulation, the likelihood of a joint belonging to a spatial bin is given by the geometric consistency [16], along with a predefined body structure model. A severe problem of 3DPS is the expensive computational cost due to the huge state space with multiple people in multiple views. As an improvement, Dong et al. [13] propose solving the crossview association problem at the body level in advance before applying 3DPS. They associate 2D poses of the same person from different views as clusters and estimate 3D poses from the clusters via 3DPS. Nevertheless, matching 2D poses between all pairs of views still makes the computational complexity explode as the number of cameras increases.
In contrast to previous methods that process inputs from multiple cameras simultaneously, we propose a new solution with an iterative processing strategy. Specifically, we propose exploiting the temporal consistency in videos to match 2D poses of each view with 3D poses directly in 3space, where the 3D poses are retained and updated iteratively by the crossview multihuman tracking. There are two advantages in our formulation. Firstly, for the accuracy, matching in 3space is expected to be robust to partial occlusion and inaccurate 2D localization, as the 3D poses consist of multiview information. Secondly, for the efficiency, processing camera views iteratively makes the computational complexity varies only linearly as the number of cameras changes, enabling the applications on largescale camera systems. To verify the effectiveness, we compare our method with stateoftheart approaches on several widelyused public datasets, and moreover, we test it on a selfcollected dataset with more than 12 cameras, as shown in Figure 1. With the proposed solution, we are able to estimate 3D poses accurately in 12 cameras at over 100 FPS.
Below, we review related work in multihuman 3D pose estimation and multiview tracking, and then we present the details of our new approach, which contains an efficient geometric affinity measurement for tracking in 3space, along with a novel 3D reconstruction algorithm that designed for iterative processing in videos. In the experimental section, we perform the evaluation on three public datasets: Campus [2], Shelf [2], and CMU Panoptic [18], demonstrating both stateoftheart accuracy and efficiency of our method. We also propose a new dataset that collected from largescale camera systems, to verify the scalability of our method for realworld applications as the number of cameras increases.
2 Related work
Multihuman 3D pose estimation.
The problem of 3D human pose estimation has been studied from monocular [26, 1, 21, 25, 12] and multiview perspectives [8, 4, 13, 32].
Most of the existing monocular solutions are designed for the singleperson cases [28, 21, 12], where the estimated poses are relatively centered around the pelvis joint, and the absolute locations in the environment are unknown. Such a relative coordinate setting limits the application of these methods in surveillance scenarios.
To estimate multiple 3D poses from a monocular view, Mehta et al. [22] use the locationmaps [23] to infer 3D joint positions at the respective 2D joint pixel locations. Moon et al. [25] propose a root localization network to estimate the cameracentered coordinates of the human roots. Despite lots of recent progress in this area, the task of monocular 3D pose estimation is inherently ambiguous as multiple 3D poses can map to the same 2D joints. The mapping result, unfortunately, often has a large deviation in practice, especially when occlusion or motion blur occurs in images.
On the other hand, multicamera systems are becoming progressively available in the context of various applications such as sport analysis and video surveillance. Given images from multiple camera views, most previous methods [27, 29, 8, 2] are generally based on the 3D Pictorial Structure model (3DPS) [8], which discretizes the 3space by an grid and assigns each joint to one of the bins (hypothesis). The crossview association and reconstruction are solved by minimizing the geometric error [16] between the estimated 3D poses and 2D inputs among all the hypotheses. Considering all joints of multiple people in all cameras simultaneously, these methods are generally computational expensive due to the huge state space. Recent work from Dong et al. [13] propose to solve the crossview association problem at the body level first. 3DPS is subsequently applied to each cluster of the 2D poses of the same person from different views. The state space is therefore reduced as each person is processed individually. Nevertheless, the computational cost of crossview association of this method is still too high to achieve the realtime speed.
Multiview tracking for 3D pose estimation.
Multiview tracking for 3D pose estimation is not a new topic in computer vision. However, it is still nontrivial to combine these two tasks for fast and robust multihuman 3D pose estimation, as facing the challenges mentioned above.
Markerless motion capture, aiming at 3D motion capturing for a single person, has been studied for a decade [33, 14, 34]
. Tracking in these early works is developed for joint localization and motion estimation. As the recent progress in deep neural network, temporal information is also investigated with the recurrent neural network
[30, 20]or convolutional neural network
[28] for singleview 3D pose estimation. However, these approaches are generally designed for wellaligned single person cases, where the critical crossview association problem is neglected.As for the multihuman case, Belagiannis et al. [4] propose employing crossview tracking results to assist 3D pose estimation under the framework of 3DPS. It introduces the temporal consistency from an offtheshelf crossview tracker [5] to reduce the state space of 3DPS. This approach separates tracking and pose estimation into two tasks and runs at 1 fps, which is far from being applied to the timecritical applications. There is also a very recent tracking approach [7] that uses the estimated 3D poses as inputs of the tracker to improve the tracking quality, while the pose estimation is rarely benefited from the tracking results. Tang et al. [32] propose to jointly perform multiview 2D tracking and pose estimation for 3D scene reconstruction. The 2D detections are associated using a ground plane assumption, which is efficient but limits the accuracy. In contrast, we couple crossview tracking and multihuman 3D pose estimation in a unified framework, making these two tasks benefit from each other for both accuracy and efficiency.
3 Method
In this section, we first give an overview of our framework with iterative processing, then we detail the two components of our framework, that is, crossview tracking in 3space with geometric affinity measurement and incremental 3D pose reconstruction in videos.
3.1 Iterative processing for 3D pose estimation
Given an unknown number of people interacting with each other in the scene covered by multiple calibrated cameras, our approach takes the detected 2D body joints as inputs. We aim at estimating the 3D locations of a fixed set of body joints for each person in the scene. Particularly, our approach differs from previous methods in the way they process frames from different cameras. In contrast to taking all camera views at a time in a batch mode, here we assume each camera streams frames independently, where the frames are collected in chronological order and fed into the framework onebyone iteratively.
With iterative processing, the overall computational cost increases only linearly as the number of cameras increases, and the strict synchronization between cameras is no longer required, making the solution have the potential to be applied to largescale camera systems. Such a modification is straightforward, but not that easy to achieve, as the crossview association is generally ambiguous, especially when only one view is observed at one time. Another challenge, in this case, is to reconstruct 3D poses from different cameras when these cameras are not strictly synchronized.
To solve the problems, we construct our framework from two components: 1) crossview tracking for body joint association, and 2) incremental 3D pose reconstruction for unsynchronized frames. Given a frame from a particular camera, the task of tracking is to associate the detected 2D human bodies with tracked targets. Here, we represent the targets in 3space using historically estimated 3D poses. The crossview association is therefore performed between 2D joints and 3D poses in 3space, as detailed in Section 3.2. Subsequently, based on the association results, each 2D human body is assigned to a target or labeled as unmatched. The 3D pose of each target is incrementally updated when combining the newly observed and previously retained 2D joints. Since these joints are from different times, conventional reconstruction method such as triangulation [16] is prone to inaccurate 3D locations. To deal with the unsynchronized frames, we present our incremental triangulation algorithm in Section 3.3.
3.2 Crossview tracking with geometric affinity
In multiview geometry, reconstructing the location of a point in 3space requires knowing the 2D locations of the point in at least two views. Thus in our case, in order to estimate the 3D poses, we have to associate the detected 2D joints across views first. Similar to [13], we associate the joints at the body level, but not just across views, also across times. This forms the crossview tracking problem, as discussed in this section.
Problem statement.
We retain historical states of persons in the scene as tracked targets, the problem becomes associating these targets with the newly detected human bodies, while the detections come from a different camera in every iteration. Here, we begin with some notations and definitions. We use to represent 2D point in camera coordinate, and for 3D point in global coordinate. For a frame from camera at time , a detected human body is denoted as 2D points of a fixed set of human joints with indices . Meanwhile, a target is represented in 3space using points of the same set of human joints, where stands for the last updated time of the joint. The historical 2D joints are also retained in the corresponding targets.
Then, supposing there are detections in the new frame, we need to associate these detections to the last tracked targets
, and afterwards update the 3D locations of targets based on the matching results. Technically, this is a weighted bipartite graph matching problem, where the graph is determined by the affinity matrix
between targets and detections. Once the graph is determined, the problem can be solved efficiently with the Hungarian algorithm [19]. Therefore, our major challenge is to measure the affinity of each pair of targets and detections accurately and efficiently.Affinity measurement.
Given a pair of target and detection , the affinity is measured from both 2D and 3D geometric correspondences:
(1) 
where is the last matched joint of the target from camera . For each type of human joints the correspondence is computed independently, thus we omit the index in the following discussion for notation simplicity.
As shown in Figure 1(a), the 2D correspondence is computed based on the distance of detected joint and previously retained joint in the camera coordinate:
(2) 
There are three hyperparameters , , and , standing for the weight of 2D correspondence, threshold of 2D velocity, and the penalty rate of time interval, respectively. Note that since frames are processed in chronological order. indicates these two joints may come from the same person, and vice versa. The magnitude represents the confidence of the indication, which decreases exponentially as the time interval increases.
2D correspondence is the most basic affinity measurement that exploited by singleview tracking methods. In order to track people across views, a 3D correspondence is introduced, as illustrated in Figure 1(b). We suppose that cameras are well calibrated and the projection matrix of camera is provided as . We first backproject the detected 2D point into 3space as a ray:
(3) 
where is the pseudoinverse of and is the 3D location of the camera center. The symbol with superscript tilde denotes the corresponding homogeneous coordinate. The 3D correspondence is then defined as:
(4) 
where denotes the pointtoline distance in 3space and is the threshold of distance. Note that in this formulation, the detected point is compared with a predicted point at the same time . A linear motion model is introduce to predict the 3D location at time :
(5) 
where and is 3D velocity estimated via a linear leastsquare method.
Here, for the purpose of verifying the iterative processing strategy, we only employ the geometric consistency in the affinity measurement for simplicity. This baseline formulation already achieves stateoftheart performance for both human body association and 3D pose estimation, as we demonstrated in experiments. The key contribution comes from Equation 4, where we match the detected 2D joints with targets directly in 3space.
Compared with matching in pairs of views in the camera coordinates [13], our formulation has three advantages: 1) matching in 3space is robust to partial occlusion and inaccurate 2D localization, as the 3D pose actually combines the information from multiple views; 2) motion estimation in 3space is more feasible and reliable than that in 2D camera coordinates; 3) the computational cost is significantly reduced since only one comparison is required in 3space for each pair of target and detection. To verify this, a quantitative comparison is further conducted in ablation study.
Target update and initialization.
With previous affinity measurement, this section describes how we update and initialize targets in a particular iteration. Firstly, we compute the affinity matrix between targets and detections using Equation 1 and solve the association problem in bipartite graph matching. Each detection is either assigned to a target or labeled as unmatched based on the association results. In the former case, if a detection is assigned to a target, the 3D pose of the target will be updated gradually with the new detection, as the 2D information is observed over time. Thus, 3D pose reconstruction in our framework is an incremental process, as detailed in Section 3.3.
As for the target initialization, we collect unmatched detections from different cameras and associate them across views using epipolar constraint [16]. Here for each camera, only the most recent frame is retained, thus we assume all detections are from very similar times and can be matched directly. Particularly, we solve the association problem in weighted graph partitioning [31, 10], to comply the cycleconsistency constraint as there are multiple cameras [13]. Body pose of a new target is initialized in 3space from the detections when at least two views are matched. The overall procedure of crossview tracking is shown in Algorithm 1.
3.3 Incremental 3D pose reconstruction
Generally, given 2D poses of the same person at a time in different views, the 3D pose can be reconstructed using triangulation. However, with the iterative processing, 2D poses in our framework may come from different times, raising the incremental triangulation problem.
Supposing the new frame is from camera at time , for a target with the matched detection we collect 2D points from different cameras for each type of human joints:
(6) 
where is the new point in camera , and denotes the last observed point in camera . For each joint, the 3D location is estimated independently, thus we omit the index in the following discussion for clarity. Here we aim at estimating the 3D location from the point collection , where the points are from different times.
We first briefly introduce the linear algebraic triangulation algorithm and then explain our improvement that designed for this problem. For each camera, the relationship between 2D point and 3D point can be written as:
(7) 
where is the cross product, and are the homogeneous coordinates, and denotes the projection matrix. Writing Equation 7 out on multiple cameras gives the equation of the form:
(8) 
with
(9) 
where denotes the 2D point , and is the th row of . If there are at least two views, Equation 8
is overdetermined and can be solved via singular value decomposition (SVD). The final nonhomogeneous coordinate
can be obtained by dividing the homogeneous coordinate by its fourth value: .The conventional triangulation algorithm assumes that 2D points of different views are from the same time and independently of each other. However, in our case the points are collected from different times (Equation 6). The time difference between points varies from 0 to 300 ms in practice, according to the frame rate and temporary occlusion.
Aiming at estimating the 3D point for the newest time , we argue that points from different times should have different importance when solving Equation 8. To this end, we add weights to the coefficients of corresponding to different cameras:
(10) 
where and denotes Hadamard product. This is a similar formulation to that in [17], where is estimated by a convolution neural network for the confidences of 2D points. Differently, our method is designed for incremental processing on time series:
(11) 
where is the penalty rate, is the timestamp of the point, and denotes the th row of . In this case, the importance of the point increases as its timestamp closes to the last time, making the estimated 3D point closer to the actual joint location at time . The second term of norm is written to eliminate the bias from different 2D locations in different views, as introduced in Equation 9.
4 Experiments
We perform the evaluation on three widelyused public datasets: Campus [2], Shelf [2], and CMU Panoptic [18], and compare our method with previous works in terms of both accuracy and efficiency. We also propose a new dataset with 12 to 28 camera views, to verify the scalability of our method as the numbers of cameras and people increase.
4.1 Datasets
We first briefly introduce the public datasets and evaluation metric for multihuman 3D pose estimation. Then we present the detail of our proposed dataset and compare it with existing public datasets.
Campus and Shelf. The Campus is a smallscale dataset that captured by three calibrated cameras. It consists of three people interacting with each other on an open outdoor square. The Shelf dataset is captured by five cameras with a more complex setting, where four people are interacting and disassembling a shelf in a small indoor area. The joint annotations of these two datasets are provided by Belagiannis et al. [2] for evaluation. We follow the same evaluation protocol as in previous works [2, 3, 15, 13] and compute the PCP (percentage of correctly estimated parts) scores to measure the accuracy of 3D pose estimation.
CMU Panoptic. The CMU Panoptic dataset [18] is captured in a closed studio with 480 VGA cameras and 31 HD cameras. The hundreds of cameras are distributed over the surface of a geodesic sphere with about 5 meters of width and 4 meters of height. The studio is designed to simulate and capture social activities of multiple people and therefore the space inside the sphere is built without obstacle. For the lack of the ground truth of 3D poses of multiple people, only qualitative results are presented on this dataset. In contrast to previous works [13, 17] that only exploit a few camera views (about two to five views) for 3D pose estimation, we analyze our approach with different numbers of cameras in the ablation study.
Our dataset. Our dataset, namely Store dataset, is captured inside two kinds of simulated stores with 12 and 28 cameras, respectively. Different from CMU Panoptic that uses hundreds of cameras for a small closed area, we evenly arrange the cameras on the ceiling of the store to simulate the realworld environment. Each camera works independently without strict synchronization, as we discussed in Section 3.1. Moreover, there are lots of shelves inside the second store, serving as obstacles, making the scene more complex than previous datasets. A detailed comparison is presented in Table 1. We use the Store dataset along with the CMU Panoptic dataset to verify the scalability of our method on the largescale camera systems.
4.2 Comparison with stateoftheart
We first present the quantitative comparison with other stateoftheart methods in Table 2. Belagiannis et al. introduced 3DPS for multiview multihuman 3D pose estimation in [2]. Afterwards, they extended 3DPS for videos by exploiting the temporal consistency in [4]. These early works have a huge state space with a very expensive computational cost. Dong et al. [13] propose to cluster joints at the body level to reduce the state space. An appearance model [36] is also investigated in their work to mitigate the ambiguity of the bodylevel association. Their approach takes about 25 ms on a dedicated GPU to extract appearance features and 20 ms for the body association, and 60 ms for the 3D reconstruction in 3DPS. Without bells and whistles, our geometriconly method outperforms pervious 3DPSbased models and achieves competitive accuracy with the very recent work [13], while our method is much faster with only a single laptop CPU. Note that, for the fair comparison, we use the same 2D pose detections for the experiments as that in [13], which are provided by an offtheshelf 2D pose estimation method [11].
Dataset  Cameras  People  Area  Obstacle 

Campus  3  3  43  None 
Shelf  5  4  19  Shelf 
CMU Panoptic  480+31  7  17  None 
Store layout1 (ours)  12  4  12  None 
Store layout2 (ours)  28  16  23  Shelves 
PCP(%)  
Campus  Actor1  Actor2  Actor3  Average  FPS 
CVPR14 [2]  82.0  72.4  73.7  75.8   
ECCVW14 [4]  83.0  73.0  78.0  78.0  1 
TPAMI16 [3]  93.5  75.7  85.4  84.5   
MTA18 [15]  94.2  92.9  84.6  90.6   
CVPR19 [13]  97.6  93.3  98.0  96.3  9.5 
Ours  97.1  94.1  98.6  96.6  617 
Shelf  Actor1  Actor2  Actor3  Average  FPS 
CVPR14 [2]  66.1  65.0  83.2  71.4   
ECCVW14 [4]  75.0  67.0  86.0  76.0  1 
TPAMI16 [3]  75.3  69.7  87.6  77.5   
MTA18 [15]  93.3  75.9  94.8  88.0   
CVPR19 [13]  98.8  94.1  97.8  96.9  9.5 
Ours  99.6  93.2  97.5  96.8  325 
4.3 Ablation study
To further verify the effectiveness of our solution, ablation study is conducted to answer the following questions: 1) Whether matching in 3space has achieved better results comparing to its 2D counterparts? 2) How much is the contribution of the incremental triangulation, is it really necessary? 3) What is the speed of our method on largescale camera systems and how much is the contribution of the iterative processing? 4) How is the quality of the tracking?
Matching in 2D or 3D?
As described in Section 3.2, we argue that matching in 3space leads to more accurate association results, since it robust to partial occlusion and inaccurate 2D localization. To verify that, instead of comparing the final PCP score, we measure the association accuracy directly and compare our method with four baselines, as shown in Figure 3. The association accuracy is computed for each camera based on the degree of agreement between clustered 2D poses and annotations. This formulation removes the impact of different reconstruction algorithms. The first baseline is matching joints in pairs of views in the 2D camera coordinates via epipolar constraint. The following three baselines are taken from the official implementation of [13], which employs geometric information and human appearance features for matching 2D poses between camera views. As seen in the figure, all these approaches achieve good performance in Camera1 and Camera2 of the Campus dataset, while the gap is revealed in the more difficult Camera3, which is placed closer to the people and suffers more from occlusion. Our method that matching in 3space outperforms the baselines with 32%, 5.2%, 9.2%, 4.6% association accuracy in Camera3, respectively.
Different 3D reconstruction methods.
Crossview association is the first step of 3D pose estimation, while 3D reconstruction is also critical. Here, we retain the association results of our method and estimate the 3D poses using different reconstruction algorithms. As presented in Table 3 four algorithms are considered: 3DPS, conventional triangulation, incremental triangulation without normalization, and our proposed. We select torso, upper arm, lower arm for comparison because these body parts have different motion amplitudes that can evaluate for different cases. All the four reconstruction algorithms achieve good performance on the torso as it has a small range of motion and is easy to detect. As for the lower arm, which can generally move quickly, our incremental triangulation improves about 3% to 5% PCP score compared with conventional triangulation.
To further verify if the incremental triangulation has the ability to handle unsynchronized frames, we analyze the performance drop when the input frame rate decreases. The original Shelf dataset was captured with 25 FPS. We construct datasets with different frame rates by sampling one frame from every frames in each camera. The comparison between incremental and conventional triangulation is shown in Figure 4. Average time differences within every 2D joint collection are also recorded in the figure. As the input frame rate decreases and the time differences increase, the performance of conventional triangulation drops significantly, while that of ours keeps stable, indicating the effectiveness of our method in handling the unsynchronized frames. Therefore, we confirm that incremental triangulation is essential for the iterative processing.
Campus  Torso  Upper arm  Lower arm  Whole 

3DPS  100.0  99.1  82.5  96.0 
Triangulation  100.0  95.4  79.1  94.4 
Ours, w/o norm  100.0  95.6  81.7  95.4 
Ours, proposed  100.0  98.6  84.6  96.6 
Shelf  Torso  Upper arm  Lower arm  Whole 
3DPS  100.0  98.1  88.4  96.6 
Triangulation  100.0  97.0  84.5  94.8 
Ours, w/o norm  100.0  98.7  87.7  96.9 
Ours, proposed  100.0  98.7  87.7  96.8 
Speed on largescale camera system.
As already seen in Table 2, our method is about 50 times faster than others on the smallscale datasets Campus and Shelf. We further test the proposed method on the largescale Store dataset as demonstrated in Figure 5. It finally achieves 154 FPS for 12 cameras with 4 people and 34 FPS for 28 cameras with 16 people. Note that when counting the running speed, we follow the common practice that one frame represents that all cameras are updated once.
Indeed, different implementation and hardware environment affect the running speed a lot. Our algorithm is implemented in C++ without multiprocessing and evaluated on the laptop with an Intel i7 2.20 GHz CPU. In order to verify the efficiency more fairly and understand the contribution of iterative processing, we construct a baseline method that matches joints in pairs of views in the camera coordinates with the same testing environment. The comparison is conducted on the CMU Panoptic dataset with its 31 HD cameras, as the cameras are all placed in a closed small area that changing the number of cameras does not affect the number of people observed. As shown in Figure 6, the running time of the baseline method explodes as the number of cameras increases, while that of ours varies almost linearly. The result verifies the effectiveness of the iterative processing strategy and demonstrates the ability of our method to work with largescale camera systems in realworld applications.
Tracking quality.
The quality of tracking is measured in each camera view using the Shelf dataset. Particularly, we project the estimated 3D poses onto each camera and follow the same evaluation protocol as MOTChallenge [24]. We compare our result with a simple singleview tracking baseline [6] as shown in Table 4. In some easy cases, e.g. Camera2 and Camera3, the baseline singleview tracker achieves similar performance as crossview tracking. But for the difficult cases such as Camera4 and Camera5, which contain severe occlusion, our crossview tracking outperforms its singleview counterpart significantly. The experimental result verifies that, in our framework, multihuman tracking can also be boosted by multiview 3D pose estimation.
Method  Camera  MOTA  IDF1  FP  FN  IDS 

Singleview  Camera1  86.7  81.7  32  34  2 
Camera2  97.6  63.9  4  4  4  
Camera3  97.3  98.6  7  7  0  
Camera4  68.8  41.8  77  79  3  
Camera5  79.0  69.0  51  51  5  
Crossview  Camera1  98.8  99.4  3  3  0 
Camera2  99.2  99.6  1  1  2  
Camera3  98.4  99.2  4  4  0  
Camera4  97.6  98.8  6  6  0  
Camera5  97.6  98.8  6  6  0 
5 Conclusion
We have presented a novel solution for multihuman 3D pose estimation from multiple camera views. By exploiting the temporal consistency in videos, we propose to match the 2D inputs with 3D poses in 3space directly, where the 3D poses are retained and iteratively updated by a crossview tracking. In experiments, we have achieved stateoftheart accuracy and efficiency on three public datasets. The comprehensive ablation study demonstrates the effectiveness of each component in our framework. Given its simple formulation and efficiency, our solution can be extended easily by other techniques such as appearance features, and applied directly to other highlevel tasks. In addition, we propose a new largescale Store dataset to simulate the realworld scenarios, which verifies the scalability of our solution and may also benefit future researches in this area.
6 Supplementary Material
6.1 Detail of Target Initialization
Here, we present details of our target initialization algorithm, including the epipolar constraint, cycleconsistency, and the formulation we utilized for graph partitioning.
When two cameras observing a 3D point from two distinct views, the epipolar constraint [16] provides relations between the two projected 2D points in camera coordinates, as illustrated in Figure S1. Supposing is the projected 2D point in the left view, the another projected point of the right view should be contained in the epipolar line:
(12) 
where is the fundamental matrix that determined by the internal parameters and relative poses of the two cameras. Therefore, given two points from two views, we can measure the correspondence between them based on the pointtoline distance in the camera coordinates:
(13) 
Given a set of unmatched detections from different cameras, we compute the affinity matrix using Equation 13. Then the problem is turned to associate these detections across camera views. Note that there are multiple cameras, the association problem can not be formulated as simple bipartite graph partitioning. And the matching result should satisfy the cycleconsistent constraint, i.e. must be matched if and are matched. To this end, we formulate the problem as general graph partitioning and solve it via binary integer programming [10, 31]:
(14) 
subject to
(15) 
(16) 
where is the affinity between and
is the set of all possible assignments to the binary variables
. The cycleconsistency constraint is ensured by Equation 16.6.2 Baseline Method in the Ablation Study
To verify the effectiveness of our solution, we construct a method that matches joints in pairs of views using epipolar constraint as the baseline in ablation study. The procedure of the baseline method is detailed in Algorithm 2. Basically, for each frame, it takes 2D poses from all cameras as inputs, and associate them across views using epipolar constraint and graph partitioning. Afterwards, 3D poses are estimated from the matching results via triangulation.
6.3 Parameter Selection
In this work, we have six parameters: , are the weights of the affinity measurements, and are the corresponding thresholds, and , are the time penalty rates for the affinity calculation and incremental triangulation, respectively. Here in Table S1, we first show the experimental results with different affinity weights on the Campus dataset. As seen in the table, 3D correspondence is critical in our framework but the performance is robust to the combination of weights. Therefore, we fix , for all datasets, and select other parameters for each dataset empirically, as shown in Table S2. The basic intuition behind it is to adjust according to the image resolution and change , based on the input frame rate. Since different datasets are captured at different frame rates, e.g. the first three public datasets are captured at 25 FPS while the Store dataset is captured at 10 FPS.
Association Accuracy (%)  PCP (%)  

1.0  0.0  45.69  62.29 
0.8  0.2  96.22  96.58 
0.6  0.4  96.30  96.61 
0.4  0.6  96.38  96.63 
0.2  0.8  96.38  96.63 
0.0  1.0  96.38  96.49 
Dataset  (pixel / second)  (m)  

Campus  25  0.10  5  10 
Shelf  60  0.15  5  10 
CMU Panoptic  60  0.15  5  10 
Store (layout 1)  70  0.25  3  5 
Store (layout 2)  70  0.25  3  5 
6.4 Qualitative Results
Here, we present more qualitative results of our solution on public datasets in Figure S2, Figure S3, and Figure S4. A recorded video is also provided at https://youtu.be/4wTcGjHZq8.
References

[1]
(2010)
Monocular 3d pose estimation and tracking by detection.
In
2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, pp. 623–630. Cited by: §2.  [2] (2014) 3D pictorial structures for multiple human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1669–1676. Cited by: §1, §1, §1, §2, §4.1, §4.2, Table 2, §4.
 [3] (2016) 3d pictorial structures revisited: multiple human pose estimation. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1929–1942. Cited by: §4.1, Table 2.
 [4] (2014) Multiple human pose estimation with temporally consistent 3d pictorial structures. In European Conference on Computer Vision Workshop, pp. 742–754. Cited by: §2, §2, §4.2, Table 2.
 [5] (2011) Multiple object tracking using kshortest paths optimization. IEEE transactions on pattern analysis and machine intelligence 33 (9), pp. 1806–1819. Cited by: §2.
 [6] (2016) Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468. External Links: Document Cited by: §4.3.
 [7] (2019) Multiperson 3d pose estimation and tracking in sports. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §2.
 [8] (2013) 3D pictorial structures for multiple view articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3618–3625. Cited by: §1, §2, §2.
 [9] (2017) Realtime multiperson 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299. Cited by: §1.
 [10] (2019) Aggregate tracklet appearance features for multiobject tracking. IEEE Signal Processing Letters 26 (11), pp. 1613–1617. Cited by: §3.2, §6.1.
 [11] (2018) Cascaded pyramid network for multiperson pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112. Cited by: §1, §4.2.
 [12] (2019) Occlusionaware networks for 3d human pose estimation in video. In ICCV, Cited by: §2, §2.
 [13] (2019) Fast and robust multiperson 3d pose estimation from multiple views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7792–7801. Cited by: §1, §1, §2, §2, §3.2, §3.2, §3.2, §4.1, §4.1, §4.2, §4.3, Table 2.
 [14] (2015) Efficient convnetbased markerless motion capture in general scenes with a low number of cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3810–3818. Cited by: §2.
 [15] (2018) Multiple human 3d pose estimation from multiview images. Multimedia Tools and Applications 77 (12), pp. 15573–15601. Cited by: §4.1, Table 2.
 [16] (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §1, §2, §3.1, §3.2, §6.1.
 [17] (2019) Learnable triangulation of human pose. In ICCV, Cited by: §3.3, §4.1.
 [18] (2015) Panoptic studio: a massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3334–3342. Cited by: §1, §4.1, §4.
 [19] (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2 (12), pp. 83–97. Cited by: §3.2.
 [20] (2018) Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 119–135. Cited by: §2.
 [21] (2017) A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649. Cited by: §2, §2.
 [22] (2018) Singleshot multiperson 3d pose estimation from monocular rgb. In 2018 International Conference on 3D Vision (3DV), pp. 120–130. Cited by: §2.
 [23] (2017) Vnect: realtime 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–14. Cited by: §2.
 [24] (2016) MOT16: a benchmark for multiobject tracking. arXiv preprint arXiv:1603.00831. Cited by: §4.3.
 [25] (2019) Camera distanceaware topdown approach for 3d multiperson pose estimation from a single rgb image. In ICCV, Cited by: §1, §2, §2.
 [26] (2017) 3d human pose estimation from a single image via distance matrix regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2823–2832. Cited by: §1, §2.
 [27] (2017) Harvesting multiple views for markerless 3d human pose annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6988–6997. Cited by: §2.
 [28] (2019) 3D human pose estimation in video with temporal convolutions and semisupervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762. Cited by: §2, §2.
 [29] (2019) Cross view fusion for 3d human pose estimation. In ICCV, Cited by: §2.
 [30] (2018) Exploiting temporal information for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84. Cited by: §2.
 [31] (2014) Tracking multiple people online and in real time. In Asian conference on computer vision, pp. 444–459. Cited by: §3.2, §6.1.
 [32] (2018) Joint multiview people tracking and pose estimation for 3d scene reconstruction. In 2018 IEEE International Conference on Multimedia and Expo (ICME), Cited by: §2, §2.
 [33] (2010) Dynamical binary latent variable models for 3d human pose tracking. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 631–638. Cited by: §2.
 [34] (2018) Rethinking pose in 3d: multistage refinement and recovery for markerless motion capture. In 2018 International Conference on 3D Vision (3DV), pp. 474–483. Cited by: §2.
 [35] (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–481. Cited by: §1.
 [36] (2018) Camera style adaptation for person reidentification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5157–5166. Cited by: §4.2.