Affordable markerless motion capture technology is becoming increasingly pervasive in applications of human-computer and human-machine interaction, entertainment, healthcare, communication, surveillance and others. Although the methods for capturing and extracting human pose from image data have been around for several years, the advances in sensor technologies (infrared sensors) and computing power (e.g., GPUs) have facilitated new systems that provide robust and relatively accurate markerless acquisition of human movement. An important milestone for wide adoption of these technologies was the release of Microsoft Kinect camera [Zhang_2012] for the gaming console Xbox 360 in 2010, followed by the release of Kinect for Windows with the accompanying Software Development Kit (SDK) in 2011. The Kinect SDK for Windows featured real-time full-body tracking of human limbs based on the algorithm by Shotton et al. [Shotton_2011]. Several other technology makers followed the suit by releasing their own 3D cameras that focused on capture of human motion for interactive applications (Xtion by Asus, RealSense by Intel). Many researchers and commercial developers embraced the Kinect in wide range of applications that took advantage of its real-time 3D acquisition capabilities and provided skeletal tracking, such as in physical therapy and rehabilitation [Hondori2014], fall detection [Stone2011] and exercise in elderly [Webster2014, Ofli2015], ergonomics [Diego-Mas2013, Plantard2015] and anthropometry [Robinson2013]Han2013], and many others. In 2013 the second generation of the Kinect camera was released as part of the Xbox One gaming console. In 2014 a standalone version of Kinect for Windows (k4w) was officially released featuring wider camera angle, higher resolution of depth and color images, improved skeletal tracking, and detection of facial expressions.
In this paper we focus on the evaluation of accuracy and performance of skeletal tracking in the two Kinect systems (referred to as Kinect 1 and Kinect 2 in the remainder of this paper) compared to a marker-based motion capture system. Several publications have previously addressed the accuracy of the skeletal tracking of Kinect 1 for various applications; however, the accuracy of Kinect 2 has been reported only to a limited extent in the research literature. Furthermore, concurrent comparison of the two systems has not yet been published to the best of our knowledge. Although both Kinect systems employ similar methodology for human body segmentation and tracking based on the depth data, the underlying technology for acquisition of the depth differs between the two. We report the accuracy rates of the skeletal tracking and the corresponding error distributions in a set of exercise motions that include standing and sitting body configurations. Such an extensive performance assessment of the technology is intended to assist researchers who rely on Kinect as a measurement device in the studies of human motion.
Ii Related Work
In this section we review several publications related to evaluation of the Kinect systems. Kinect 1 has been extensively investigated in terms of 3D depth map acquisition as well as body tracking accuracy for various applications. Khoshelman and Elbernik [Khoshelham_2013] examined the accuracy of depth acquisition in Kinect 1, and found that the depth error ranges from a few millimeters up to about 4 cm at the maximum range. They recommended that the data for mapping applications should be acquired within 1-3 m distance. Smisek et al. [Smisek_2011] proposed a geometrical model and calibration method to improve the accuracy of Kinect 1 for 3D measurements. Kinect 1 and Kinect 2 were jointly evaluated by Gonzalez-Jorge et al. [Gonzalez_2015] who reported that the precision of both systems is similar (about 8 mm) in the range of under 1 m, while Kinect 2 outperforms Kinect 1 at the range of 2 m with the error values of up to 25 mm. They also reported that precision of Kinect 1 decreases rapidly following a second order polynomial, while Kinect 2 exhibits a more stable behavior inside its work range. 3D accuracy of Kinect 2 was recently evaluated by Yang et al. [Yang_2015] who reported on the spatial distribution of the depth accuracy in regard to the vertical and horizontal displacement.
Skeletal tracking of Kinect was examined primarily in the context of biomechanical and exercise performance analyses. In this review, we limit ourselves only to the evaluations of skeletal tracking based on the official Microsoft Kinect SDK. Obdržálek et al. [Obdrzalek_2012] performed accuracy and robustness analysis of the Kinect skeletal tracking in six exercises for elderly population. Their paper reports on the error bounds for particular joints obtained from the comparison with an optical motion capture system. The authors conclude that employing a more anthropometric kinematic model with fixed limb lengths could improve the performance. Clark et al. [Clark_2012] examined the clinical feasibility of using Kinect for postural control assessment. The evaluation with Kinect and motion capture performed in 20 healthy subjects included three postural tests: forward reach, lateral reach, and single-leg eyes-closed standing balance assessment. The authors found high inter-trail reliability and excellent concurrent validity for majority of the measurements. The study, however, revealed presence of proportional biases for some of the outcome measures, in particular for sternum and pelvis evaluations. The authors proposed the use of calibration equations that could potentially correct for such biases. Several other works have examined the body tracking accuracy for specific applications in physical therapy, such as for example upper extremity function evaluation [Kurillo_2013], assessment of balance disorders [Funaya_2013], full-body functional assessment [Bonnechere_2013], and movement analysis in Parkinson’s disease [Galna_2014]. Plantard et al. [Plantard_2015] performed an extensive evaluation of Kinect 1 skeletal tracking accuracy for ergonomic assessment. By using a virtual mannequin, they generated a synthetic depth map that was input into the Kinect SDK algorithm to predict potential accuracy of joint locations in a large number of skeletal configurations and camera positions. The simulation results were validated by a small number of real experiments. The authors concluded that the kinematic information obtained by the Kinect is generally accurate enough for ergonomic assessment.
To the best of our knowledge, publication by Xu and McGorry [Xu_2015] is to date the only work that reported on the evaluation of Kinect 2 skeletal tracking alongside an optical motion capture system. In their study the authors examined 8 standing and 8 sitting static poses of daily activities. Similar poses were also captured with Kinect 1, however the two data sets were not obtained concurrently. The authors reported that the average static error across all the participants and all Kinect-identified joint centers was 76 mm for Kinect 1 and 87 mm for Kinect 2. They further concluded that there was no significant difference between the two Kinects. This conclusion, however, is of limited validity as the comparison was done indirectly with two different sets of subjects.
Since the Kinect 1 system is being replaced by the Kinect 2 system in many applications, it is important to evaluate the performance of the new camera and software for tracking of dynamic human activities. This is especially relevant since the depth estimation in the two systems is based on two different physical principles. Side-by-side comparison can thus provide a better understanding of the performance improvements as well as potential drawbacks. In this paper, we report on the experimental evaluation of the joint tracking accuracy of the two Kinects in comparison to an optical motion capture system. We analyze the results for 12 different activities that include standing and sitting poses as well as slow and fast movements. Furthermore we examine the performance of pose estimation with respect to three different horizontal orientation angles. We provide the error bounds for joint positions and extracted limb lengths for both systems. We also analyze the distribution of joint localization errors by fitting a mixture of Gaussian and uniform distribution models to determine the outliers in the motion data.
Iii Acquisition Systems
In this section we provide more details on the experimental setup and a brief overview of the technology behind each Kinect system. For the experimental evaluation, the movements were simultaneously captured by Kinect 1, Kinect 2, and a marker-based optical motion capture system which served as a baseline. The two Kinects were secured together and mounted on a tripod at the height of about 1.5 m. All three systems were geometrically calibrated and synchronized prior to the data collection using the procedure described below.
Iii-a Motion Capture System (MoCap)
The motion capture data were acquired using PhaseSpace (San Leandro, CA, USA) system Impulse X2 with 8 infrared stereo cameras. The cameras were positioned around the capture space of about 4 m by 4 m. The system provides 3D position of LED markers with sub-millimeter accuracy and frequency of up to 960 Hz. Capture rate of 480 Hz was selected for this study. For each subject 43 markers were attached at standard body landmarks to a motion capture suit using velcro. To obtain the skeleton from the marker data, a rigid kinematic structure was dynamically fitted into the 3D point cloud. We used PhaseSpace Recap2 software to obtain the skeleton for each subject based on collected calibration data which consisted of a sequence of individual joint rotations. The built-in algorithm determines the length of the body segments based on the set of markers associated with different parts of the body and generates a skeleton with 29 joint positions. Once individual’s kinematic model is calibrated, the skeletal sequence can be extracted for any motion of that person.
Iii-B Kinect 1
Kinect 1 sensor features acquisition rates of up to 30 Hz for the color and depth data with the resolution of 640 480 pixels and 320 240 pixels, respectively. The depth data are obtained using structured light approach, where a pseudo-random infrared dot pattern is projected onto the scene while being captured by an infrared camera. Stereo triangulation is used to obtain 3D position of the points from their projections. This approach provides a robust 3D reconstruction even in low-light conditions. The accuracy of the depth decreases with the square of the distance with typical accuracy ranging from about 1-4 cm in the range of 1-4 m [Smisek_2011]
. To obtain a dense depth map, surface interpolation is applied based on the acquired depth values at the data points. Fixed density of the points limits the accuracy when moving away from the camera as the points become sparser. The boundaries of surfaces in the distance are thus often jagged.
Real-time skeletal tracking provided by the Microsoft Kinect SDK is based on the depth data using body part estimation algorithm based on random decision forest proposed by Shotton et al. [Shotton_2011]. The algorithm estimates candidate body parts based on a large training set of synthetically-generated depth images of humans of many different poses and shapes in various poses from a motion capture database [Zhang_2012]. The Kinect 1 SDK can track up to two users, providing the 3D location of 20 joints for each tracked skeleton.
Iii-C Kinect 2
Kinect 2 sensor features high definition color (1920 1080 pixels) and higher resolution depth data (512 424 pixels) as compared to Kinect 1. The depth acquisition is based on the time-of-flight (ToF) principle where the distance to points on the surface is measured by computing the phase-shift distance of modulated infrared light. The intensity of the captured image is thus proportional to the distance of the points in 3D space. The ToF technology as opposed to the structured light inherently provides a dense depth map, however the results can suffer from various artifacts caused by the reflections of light signal from the scene geometry and the reflectance properties of observed materials. The depth accuracy of Kinect 2 is relatively constant within a specific capture volume, however it depends on the vertical and horizontal displacement as the light pulses are scattered away from the center of the camera [Yang_2015]. The reported average depth accuracy is under 2 mm in the central viewing cone and increases to 2-4 mm in the range of up to 3.5 m. The maximal range captured by Kinect 2 is 4.0 m where the average error typically increases beyond 4 mm.
The skeletal tracking method implemented in Kinect SDK v2.0 has not been fully disclosed by Microsoft; however, it appears to follow similar methodology as for Kinect 1 while taking advantage of GPU computation to reduce the latency and to improve the performance. The Kinect SDK v2.0 features skeletal tracking of up to 6 users with 3D locations of 25 joints for each skeleton. In comparison to Kinect 1, the skeleton includes additional joints at the hand tip, thumb tip and neck. The arrangement of the joints, i.e. the kinematic structure of the model, is comparable to standard motion capture skeleton. Kinect 2 includes some additional features, such as detection of hand opening/closing and tracking of facial features.
Iii-D Calibration and Data Acquisition
For the capture of the database, we connected the two Kinect cameras to a single PC running Windows 8.1, with Kinect 1 connected via USB 2.0 and Kinect 2 connected via USB 3.0 on a separate PCI bus. Such arrangement allowed for both sensors to capture at the full frame rate of 30 Hz. The skeletal data for both cameras were extracted in real time via Microsoft Kinect SDK v1.8 and Kinect for Windows SDK v2.0 for Kinect 1 and Kinect 2, respectively.
The temporal synchronization of the captured data was performed using Network Time Protocol (NTP). The motion capture server provided the time stamps for the Kinect PC over the local area network. Meinberg NTP Client Software (Meinberg Radio Clocks GmbH, Bad Pyrmont, Germany) was installed on the Windows computer to obtain more precise clock synchronization.
Prior to the data acquisition, we first calibrated the motion capture system using provided calibration software. The coordinate frames of the Kinect cameras were then aligned with the motion capture coordinates using the following procedure. A planar checkerboard with three motion capture markers attached to corners of the board was placed in three different positions in front of the Kinects. In each configuration, marker position, color and depth data were recorded. Next, the 3D positions of the corners were extracted from the depth data using the intrinsic parameters of the Kinect and corresponding depth pixel values. Finally, a rigid transformation matrix that maps 3D data captured by each Kinect into the motion capture coordinate system was determined by minimizing the squared distance between the Kinect acquired points and the corresponding marker locations.
Iii-E Data Processing
Collected marker data were first processed in Recap2 to obtain the skeletal sequence for each subject, and then exported to BVH file format. The rest of the analysis was performed in MATLAB (MathWorks, Natick, MA). First, the skeletal sequences from the Kinect cameras were mapped to the motion capture coordinate space using the rigid transformation obtained from the calibration. Next, we aligned the sequences using the time stamps, and re-sampled all the data points to the time stamps of Kinect 2 in order to compare the joint localization at the same time instances. Fig. 1 demonstrates the three skeletal configurations projected into the motion capture coordinate space after the calibration.
After the spatial transformation and temporal alignment, we obtained three sequences of 3D joint positions for Kinect 1, Kinect 2, and motion capture. Since the three skeletal configurations have different arrangements and number of joints, we selected 20 joints that are common to all the three systems. Other remaining joints were ignored in this analysis. Next, we evaluated the position accuracy by calculating the distance between the corresponding joints in each time frame. When the Kinect skeletal tracking loses track of the body parts for certain joints (e.g. due to occlusions), such frames can be flagged as outliers. Since the data of the outliers can be assigned arbitrary values, we use a uniform distribution to model the distribution of the outliers. The distribution of the valid (on-track) data samples is on the other hand modeled by a Gaussian distribution with the mean representing the average offset of that joint. The overall distribution of the joint offset data,, can thus be modeled by a mixture model of Gaussian and uniform distributions as follows:
In equation (1), and denote the parameters of the Gaussian distribution, , respectively. and denote the parameters of the uniform distribution, , respectively.
denotes the weight of the Gaussian distribution. In this paper, we use the maximum-likelihood method to estimate these parameters with the input data samples. After estimating the mixture model, the data are classified into eitheron-track or off-track state. The off-track data is then excluded from the accuracy evaluation.
Another important parameter for the accuracy assessment of human pose tracking is the variability of the limb lengths. The human skeleton is typically modeled as a kinematic chain with rigid body segments. The Kinect skeletal tracking, however, does not explicitly constrain the length of body segments. In the paper, we thus report on the variability of the bone lengths by calculating the distance between two end-joints of each bone for the Kinect systems. For motion capture, the bone length is extracted from the segment length parameters in the BVH file.
In this section we describe the experimental protocol for the data accuracy evaluation. As described in Section III, the motion data were captured by the setup consisting of Kinect 1, Kinect 2, and the motion capture system. We selected 12 different exercises (Table I, Fig. 2), consisting of six sitting (and sit-to-stand) exercises and six standing exercises. In the first set of exercises, subjects were interacting with the chair, while no props were used in the second set. We analyze the two sets of exercises separately.
|1. Shallow Squats||Sitting||Stand-to-sit movements without sitting.|
|2. Chair Stands||Sitting||Sit-to-stand movements.|
|3. Buddha’s Prayer||Sitting||Vertical hand lifts with palms together.|
|4. Cops & Robbers||Sitting||Shoulder rotation and forward arm extension.|
|5. Abs in, Knee Lifts||Sitting||Alternating knee lifts.|
|6. Lateral Stepping||Sitting||Alternating front and side stepping.|
|7. Pickup & Throw||Standing||Step forward, pick up off the floor and throw.|
|8. Jogging||Standing||Running in place.|
|9. Clapping||Standing||Wide hand clapping while standing still.|
|10. Punching||Standing||Alternating forward punching.|
|11. Line Stepping||Standing||Alternating forward foot tapping.|
|12. Pendulum||Standing||Alternating leg adduction.|
We captured the motion data in 10 subjects (mean age: 27). Before starting the recording, each subject was instructed on how to perform the exercise via a video. We first recorded the motion capture calibration sequence for the subsequent skeleton fitting. Each exercise recording consisted of five repetitions, except for the Jogging which required subjects to perform ten jogging steps. The recording of the 12 exercises was repeated for three different orientation angles of the subjects with respect to the Kinect cameras, i.e. at with subject facing the cameras and at and with subject rotated to the left of the cameras. Figs. 2 and 3 show the video snapshots and the corresponding motion capture skeletons of the key poses for the 12 exercises in one of the subjects.
After the data acquisition, the joint coordinates of Kinect 1 and Kinect 2 were transformed into the global coordinate system of the motion capture. Additionally, the temporal data were synchronized according to the time stamp of the sequence captured by Kinect 2, as described in Section III-D.
For the analysis of joint position accuracy, we selected 20 joints that are common between the three systems. These joints and their abbreviated names are shown in Fig. 4. In addition to the joint position accuracy, we also evaluated the accuracy of the bone lengths for the upper and lower extremities. Those bones and their abbreviated names are also shown in Fig. 4.
V Results and Discussion
In this section, we present detailed analysis of the pose tracking accuracy in Kinect 1 and Kinect 2 in comparison to the motion capture system which we use as a baseline. All the reported results are the average values across all the subjects. The values in the sitting or standing pose represents the mean values of all the exercises in the sitting or standing set.
V-a Joint Position Accuracy
Tables III and III summarize the mean offsets for all the joints in the sitting and standing sets of exercises in three different viewpoint directions. The mean offset represents the average distance between the corresponding joint position of Kinect 1 or Kinect 2 as compared to the location identified from the motion capture data.
In the sitting set of exercises (Table III), the majority of the mean joint offsets range between 50 mm and 100 mm for both Kinect systems. The largest offset in Kinect 1 is consistently observed in the pelvic area which includes the following three joints: ROOT, HIP_L, and HIP_R. Kinect 2 on the other hand has smaller offsets for these particular joints. In Kinect 2, the largest offsets are observed in the following four joints of the lower extremities: ANK_L, ANK_R, FOO_L, and FOO_R. These joints typically have a large vertical offset from the ground plane, while the same is not observed in Kinect 1. Similar observations can be made in the standing set of exercises (Table III) where the largest offsets in Kinect 1 are again in the pelvic area and the largest offsets in Kinect 2 are found in the lower extremities. These observations are also clearly visible in Fig. 1.
also summarize the standard deviation (SD) of the joint position offsets which reflects the variability of a particular joint tracking. For most of the joints, the SD ranges between 10 mm and 50 mm. The joints that exhibit considerable motion during an exercise have much higher variability, and thus SD, typically greater than 50 mm. In most cases, the SDs of the joint positions in Kinect 2 are considerably smaller than those in Kinect 1. This is most likely due to an increased resolution and reduced noise level of Kinect 2 depth maps.
Furthermore, we can observe that the mean offset and SD of the joints that are most active in a particular exercise are both increasing with the viewpoint angle. This is especially noticeable on the side of the skeleton that is turned further away from the camera as the occlusion of joints increases the uncertainty of the pose detection. In our experiments, the left side of the skeleton was turning away from the camera with the increasing viewpoint angle.
In order to examine whether Kinect 2 is performing better than Kinect 1, we performed statistical analysis of the accuracy for each of the 20 joints in comparison to the motion capture. We used pair-wise t-test for the joint position analysis. Our hypothesis was that the position of a particular joint is not significantly different between a Kinect and the motion capture. The results of the analysis are shown in Tables III and III where the joints with significant difference are denoted with * symbol when p 0.05 and ** symbol when p 0.01, respectively.
The results of the t-test analyses show that the joint position offsets of the joints ROOT, SPINE, HIP_L, HIP_R, ANK_L, ANK_R, FOO_L, are FOO_R for Kinect 2 have significantly different mean offsets as compared to Kinect 1. The mean joint position offsets of other joints share the same distribution.
Similar conclusions can be drawn for the standing set of exercises (Table III
). For example, the SDs of the joint position offset in Kinect 2 are usually smaller than those of Kinect 1. The variances in the more active joints are typically increasing with the viewpoint angle. Statistically significant differences in the accuracy of Kinect 1 vs. Kinect 2 can be found in the following joints: ROOT, SPINE, HIP_L, HIP_R, ANK_L, ANK_R, FOO_L, and FOO_R. Overall, the means and SDs of the joint position offsets in the standing poses are usually larger than those in the sitting poses. In the sitting poses there are higher number of static joints, which in general have smaller variability.
|Kinect 1||Kinect 2|
|Mean (mm)||SD (mm)||Mean (mm)||SD (mm)|
|ROOT00footnotetext: * t-test, p 0.05, ** t-test, p 0.01||256||262||263||25||20||25||100||102||106||17||18||16|
|Kinect 1||Kinect 2|
|Mean (mm)||SD (mm)||Mean (mm)||SD (mm)|
|ROOT00footnotetext: * t-test, p 0.05, ** t-test, p 0.01||245||256||267||23||25||25||76||81||93||24||19||18|
Figs. 6, 6, 8, and 8 demonstrate the means and SDs of the joint position offsets for the exercises Cops & Robbers, Lateral Stepping, Jogging, and Punching, respectively. In the figures, the skeleton in magenta represents one of the key poses in the exercise sequence as captured by the motion capture system. The blue or black lines on the other hand represent the corresponding skeletons generated from the mean joint position offsets that were observed in either Kinect 1 or Kinect 2, respectively. The ellipsoids at each joint denote the SDs in the 3D space, analyzed for each axis of the local coordinate system attached to the corresponding segment. The larger size of the ellipsoid indicates larger SD of the joint position in Kinect compared with the joint position captured by the motion capture.
Such visualization of results provides a quick and intuitive way for comparison of accuracy of different joints. The results show that the overall SDs are larger in Kinect 1 as compared to Kinect 2. The variability of offsets is also increasing with the viewpoint angle. In more dynamic movements, such as Jogging (Fig. 8) and Punching (Fig. 8), the end-points, such as feet and hands, have considerably larger SD with increasing viewpoint angle. Finally, we can observe that certain joints have consistently large offsets from the baseline skeleton, such as ROOT, HIP_L, and HIP_R in Kinect 1 and ANK_L, ANK_R, FOO_L, and FOO_R, in Kinect 2.
The joint position offsets in general depend on various sources of error, such as systematic errors (e.g. offset of hips in Kinect 1), noise from depth computation, occlusions, loss of tracking, etc. In our further analysis, we analyze the error distribution to discriminate between the random errors and the errors due to tracking loss. We expect that the random errors follow Gaussian distribution while the errors due to tracking loss can be treated as outliers belonging to a uniform distribution. As an example, we show the histogram of the joint position offsets for the right elbow and right knee as captured in the exercises Cops & Robbers and Jogging from different viewpoint angles (Figs. 10 and 10, respectively). These two joints are among the more active joints in these two exercises. The histograms demonstrate our assumption about the error distribution where the joint position offsets are mainly concentrated close to zero with a long tail to the right side. In order to determine the outliers in the tracking data, we use a mixture model of a Gaussian distribution and a uniform distribution to approximate the distribution of the joint position offsets, as defined in equation (1). Fig. 11 demonstrates the distribution fitting results for the right elbow in the exercise Cops & Robbers. The results show the mixture model of the Gaussian and uniform distributions overlaid on the data histograms.
|Kinect 1||Kinect 2|
After applying the mixture model fitting for each joint independently, we can classify the data into either on-track state or off-track state. Table IV shows the percentage of the average on-track ratio for each joint defined as the ratio between the number of on-track samples and the total number of frames. The results show that in most joints the on-track ratio is above 90%. For the frontal view, the on-track ratio of all the joint is relatively similar. In the viewpoints of 30 and 60, the active joints which are further away from the camera typically have lower ratios than the joints closer to the camera. If all the joints in a frame are on-track, that frame is marked as a valid frame for the data accuracy evaluation. The last row in Table IV summarized the percentage of valid frames. The percentage of valid frames is typically higher for Kinect 2 than Kinect 1. Furthermore, the percentages of valid frames in the viewpoints of 30 and 60 drop by 10% and 15% compared to those in the frontal view, respectively. Finally, in Tables VI and VI we show the mean and SD of the joint position offsets after the removal of outliers.
|Sitting (%)||Standing (%)|
|Kinect 1||Kinect 2||Kinect 1||Kinect 2|
Compared with the results in Tables III and III, both the mean and SD of most joints in Tables VI and VI are reduced since the outliers are excluded from the analysis. Table VII summarized the average reduction of the mean and SD of the joint position offsets after excluding the outliers. The results demonstrate that the data accuracy can be significantly improved by fitting the data into the mixture model.
|Kinect 1||Kinect 2|
|Mean (mm)||SD (mm)||Mean (mm)||SD (mm)|
|ROOT 00footnotetext: * t-test, p 0.05, ** t-test, p 0.01||251||259||261||15||13||17||100||100||107||14||12||14|
|Kinect 1||Kinect 2|
|Mean (mm)||SD (mm)||Mean (mm)||SD (mm)|
|ROOT 00footnotetext: * t-test, p 0.05, ** t-test, p 0.01||244||257||267||16||17||20||67||80||91||10||11||12|
|Kinect 1||Kinect 2|
|Mean (%)||SD (%)||Mean (%)||SD (%)|
V-B Bone Length Accuracy
Another important parameter for evaluation of Kinect pose tracking performance is the bone length. As mentioned previously, the Kinect tracking algorithm does not specifically pre-define or calibrate for the anthropometric values of the body/bone segments. On the other hand, the human skeleton can be approximated as a kinematic structure with rigid segments, so we expect that the bone lengths should stay relatively constant. The size of the variance (and SD) of the bone length over time can thus be interpreted as a measure of robustness of the extracted kinematic model.
For the Kinect skeleton, we define the bone length as the distance between the positions of two subsequent joints. The bone length for the motion capture is on the other hand determined during the calibration phase and remains constant during the motion sequence. Figs. 12 and 13 show the means and SDs of the bone length difference of Kinect 1 and Kinect 2, respectively, as compared to the bone length calibrated from the motion capture data across all the subjects. The mean bone length difference does not change too much between different exercises. The SDs are typically increasing with larger viewpoint angle. We can observe that the bone lengths in Kinect 1 usually have larger offsets and SDs as compared to Kinect 2, especially for the upper legs due to the large vertical offset of the hip joints.
Tables VIII and IX summarize the mean and SD of the bone length differences in Kinect 1 and Kinect 2 in the three viewpoints for sitting and standing exercises, receptively. We can observe that the mean differences in the bone lengths and SDs are smaller in Kinect 2, suggesting that the kinematic structure of its skeleton is more robust.
|Kinect 1||Kinect 2|
|Mean (mm)||SD (mm)||Mean (mm)||SD (mm)|
|ARM_UP_L 00footnotetext: * t-test, p 0.05, ** t-test, p 0.01||-76||-74||-81||17||18||21||-67||-62||-68||14||17||20|
|Kinect 1||Kinect 2|
|Mean (mm)||SD (mm)||Mean (mm)||SD (mm)|
|ARM_UP_L 00footnotetext: * t-test, p 0.05, ** t-test, p 0.01||-60||-62||-65||19||22||23||-54||-54||-51||15||18||20|
V-C Summary of Findings
Based on the experimental results reported in this paper, we can make the following observations:
As reported by other researchers, the hip joints in Kinect 1 are located much higher than normal with the offset of about 200 mm. The offsets should be considered when calculating knee and hip angles, in particular in sitting position. On the other hand, the skeleton in Kinect 2 is in general more anthropometric with smaller offsets.
The foot and ankle joints of Kinect 2 are offset from the ground plane for about 100 mm or more. The orientation of the feet is thus unreliable. Once the foot is lifted off the ground, the tracking of the joints is more accurate. The unreliable foot position may be originating from ToF artifacts that generate large amounts of noise close to large planar surfaces.
Overall accuracy of joint positions in Kinect 2 is better than in Kinect 1, except the location of feet. The average offsets are typically between 50 mm and 100 mm.
The analysis of the distribution mixture shows that Kinect 2 has smaller uniform distribution component (i.e. less outliers) suggesting that the tracking of joints is more robust. Kinect 2 also tracks human movement more reliably even with partial body occlusions.
The difference and variance of the actual limb lengths are smaller in Kinect 2 than in Kinect 1.
The skeleton tracking in Kinect 2 has much smaller latency as compared to Kinect 1 which is noticeable especially during fast actions (e.g. exercises Jogging and Punching).
In this paper, we compared the human pose estimation for the first and second generations of Microsoft Kinect with standard motion capture technology. The results of our analysis showed that overall Kinect 2 has better accuracy in joint estimation while providing skeletal tracking that is more robust to occlusions and body rotation. Only the lower legs were tracked with large offsets, possibly due to ToF artifacts. This phenomena was not observed in Kinect 1 which employs structured light for depth acquisition. For Kinect 1, the largest offsets were observed in the pelvic area as also noted by others. Our analyses show that Kinect 1 can be exchanged with Kinect 2 for majority of motions. Furthermore, by applying a mixture of Gaussian and uniform distribution models we were able to evaluate the robustness of the pose tracking. We showed that the SDs of the joint positions can be reduced by 30% to 40% by employing the classification with a mixture distribution model. This finding suggests that by excluding the outliers from the data and compensating for the offsets, more accurate human motion analysis can be achieved.