3D Human Pose Estimation using Deep Learning
Marker-based and marker-less optical skeletal motion-capture methods use an outside-in arrangement of cameras placed around a scene, with viewpoints converging on the center. They often create discomfort by possibly needed marker suits, and their recording volume is severely restricted and often constrained to indoor scenes with controlled backgrounds. Alternative suit-based systems use several inertial measurement units or an exoskeleton to capture motion. This makes capturing independent of a confined volume, but requires substantial, often constraining, and hard to set up body instrumentation. We therefore propose a new method for real-time, marker-less and egocentric motion capture which estimates the full-body skeleton pose from a lightweight stereo pair of fisheye cameras that are attached to a helmet or virtual reality headset. It combines the strength of a new generative pose estimation framework for fisheye views with a ConvNet-based body-part detector trained on a large new dataset. Our inside-in method captures full-body motion in general indoor and outdoor scenes, and also crowded scenes with many people in close vicinity. The captured user can freely move around, which enables reconstruction of larger-scale activities and is particularly useful in virtual reality to freely roam and interact, while seeing the fully motion-captured virtual body.READ FULL TEXT VIEW PDF
3D Human Pose Estimation using Deep Learning
Traditional optical skeletal motion-capture methods – both marker-based and marker-less – use several cameras typically placed around a scene in an outside-in arrangement, with camera views approximately converging in the center of a confined recording volume. This greatly constrains the spatial extent of motions that can be recorded; simply enlarging the recording volume by using more cameras, for instance to capture an athlete, is not scalable. Outside-in arrangements also constrain the type of scene that can be recorded, even if it fits into a confined space. If a recording location is too small, cameras can often not be placed sufficiently far away. In other cases, a scene may be cluttered with objects or furniture, or other dynamic scene elements, such as people in close interaction, may obstruct a motion-captured person in the scene or create unwanted dynamics in the background. In such cases, even state-of-the-art outside-in marker-less optical methods that succeed with just a few cameras, and are designed for less controlled and outdoor scenes [Elhayek et al., 2015], quickly fail. Scenes with dense social interaction were previously captured with outside-in camera arrays of a few hundred sensors [Joo et al., 2015], a very complex and difficult to scale setup.
These strong constraints on recording volume and scene density prevent the use of optical motion capture in the majority of real-world scenes. This problem can partly be bypassed with inside-in motion-capture methods that use body-worn sensors exclusively Menache , such as the Xsens MVN inertial measurement unit suit. However, the special suit and cabling are obstructive and require tedious calibration. Shiratori et al.  propose to wear 16 cameras placed on body parts facing inside-out, and capture the skeletal motion through structure-from-motion relative to the environment. This clever solution requires instrumentation, calibration and a static background, but allows free roaming. This design was inspirational for our egocentric approach.
We propose EgoCap: an egocentric motion-capture approach that estimates full-body pose from a pair of optical cameras carried by lightweight headgear (see EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras). The body-worn cameras are oriented such that their field of view covers the user’s body entirely, forming an arrangement that is independent of external sensors – an optical inside-in method, if you will. We show that our optical full-body approach overcomes many limitations of existing outside-in, inside-out and IMU-based inside-in methods. It reduces the setup effort, enables free roaming, and minimizes body instrumentation. EgoCap decouples the estimation of local body pose with respect to the headgear cameras and global headgear position, which we infer by inside-out structure-from-motion on the scene background.
Our first contribution is a new egocentric inside-in sensor rig with only two head-mounted, downward-facing commodity video cameras with fisheye lenses (see EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras). While head-mounted cameras might pose a problem with respect to social acceptance and ergonomics in some scenarios, performances have not been hindered during our recordings and VR tests. The rig can be attached to a helmet or a head-mounted VR display, and, hence, requires less instrumentation and calibration than other body-worn systems. The stereo fisheye optics keep the whole body in view in all poses, despite the cameras’ proximity to the body. We prefer conventional video cameras over IR-based RGB-D cameras, which were for example used for egocentric hand tracking [Sridhar et al., 2015], as video cameras work indoors and outdoors, have lower energy consumption and are easily fitted with the required fisheye optics.
Our second contribution is a new marker-less motion capture algorithm tailored to the strongly distorted egocentric fisheye views. It combines a generative model-based skeletal pose estimation approach (Section 4) with evidence from a trained ConvNet-based body part detector (Section 4.3). The approach features an analytically differentiable objective energy that can be minimized efficiently, is designed to work with unsegmented frames and general backgrounds, succeeds even on poses exhibiting notable self-occlusions (e.g. when walking), as the part detector predicts occluded parts, and enables recovery from tracking errors after severe occlusions.
Our third contribution is a new approach for automatically creating body part detection training datasets. We record test subjects in front of green screen with an existing outside-in marker-less motion capture system to get ground-truth skeletal poses, which are reprojected into the simultaneously recorded head-mounted fisheye views to get 2D body part annotations. We augment the training images by replacing the green screen with random background images, and vary the appearance in terms of color and shading by intrinsic recoloring Meka et al. . With this technique, we annotate a total of 100,000 egocentric images of eight people in different clothing (Section 4.3.1), with 75,000 images from six people used for training. We publish the dataset for research purposes EgoCap .
We designed and extensively tested two system prototypes featuring (1) cameras fitted to a bike helmet, and (2) small cameras attached to an Oculus Rift headset. We show reliable egocentric motion capture, both off-line and in real time. The egocentric tracking meets the accuracy of outside-in approaches using 2–3 cameras; additional advances are necessary to match the accuracy of many-camera systems. In our egocentric setup, reconstructing the lower body is more challenging due to its larger distance and frequent occlusions, and is less accurate compared to the upper body in our experiments. Nevertheless, we succeed in scenes that are challenging for outside-in approaches, such as close interaction with many people, as well outdoor and indoor scenes in cluttered environments with frequent occlusions, for example when working in a kitchen or at a desk. We also show successful capturing in large volumes, for example of the skeletal motion of a cyclist. The lightweight Oculus Rift gear is designed for egocentric motion capture for virtual reality, where the user can move in the real world to roam and interact in a virtual environment seen through a head-mounted display, while perceiving increased immersion thanks to the rendering of the motion-captured body, which is not obtained with current HMD head pose tracking.
Marker-based optical systems use a suit with passive retro-reflective spheres (e.g. Vicon) or active LEDs (e.g. PhaseSpace). Skeleton motion is reconstructed from observed marker positions in multiple cameras (usually 10 or more) in an outside-in arrangement, producing highly accurate sparse motion data, even of soft tissue [Park and Hodgins, 2008, Loper et al., 2014], but the external cameras severely restrict the recording volume. For character animation purposes, where motions are restricted, use of motion sub-spaces can reduce requirements to six markers and two cameras [Chai and Hodgins, 2005]
, or a single foot pressure-sensor padYin and Pai , which greatly improves usability. For hand tracking, a color glove and one camera [Wang and Popović, 2009] is highly practical. Inertial measurement units (IMUs) fitted to a suit (e.g. Xsens MVN) allow free roaming and high reliability in cluttered scenes by inside-in motion capture, i.e. without requiring external sensors [Tautges et al., 2011]. Combinations with ultrasonic distance sensors Vlasic et al. , video input [Pons-Moll et al., 2010, 2011], and pressure plates Ha et al.  suppress the drift inherent to IMU measurements and reduce the number of required IMUs. Besides drift, the instrumentation with IMU sensors is the largest drawback, causing long setup times and intrusion. Exoskeleton suits (e.g. METAmotion Gypsy) avoid drift, but require more cumbersome instrumentation. Turning the standard outside-in capturing approach on its head, Shiratori et al.  attach 16 cameras to body segments in an inside-out configuration, and estimate skeletal motion from the position and orientation of each camera as computed with structure-from-motion. This clever solution – which was inspirational for our egocentric approach – allows free roaming although it requires instrumentation and a static background.
Recent years have seen great advances in marker-less optical motion-capture algorithms that track full-body skeletal motions, reaching and outperforming the reconstruction quality of suit- and marker-based approaches [Bregler and Malik, 1998, Theobalt et al., 2010, Moeslund et al., 2011, Holte et al., 2012]. Marker-less approaches also typically use an outside-in camera setup, and were traditionally limited to controlled studio environments, or scenes with static, easy-to-segment background, using 8 or more cameras [e.g. Urtasun et al., 2006, Gall et al., 2010, Sigal et al., 2010, 2012, Stoll et al., 2011]. Recent work is moving towards less controlled environments and outdoor scenes, also using fewer cameras [Amin et al., 2009, Burenius et al., 2013, Elhayek et al., 2015, Rhodin et al., 2015], but still in an outside-in configuration. These approaches are well-suited for static studio setups, but share the limitation of constrained recording volumes, and reach their limits in dense, crowded scenes. Joo et al.  use a camera dome with 480 outside-in cameras for motion capture of closely interacting people, but domes do not scale to larger natural scenes.
3D pose estimation is highly accurate and reliable when using multiple RGB-D cameras Zhang et al. , and even feasible from a single RGB-D camera in real time [e.g. Shotton et al., 2011, Baak et al., 2011, Wei et al., 2012]. However, many active IR-based depth cameras are unsuitable for outdoor capture, have high energy consumption, and equipping them with fisheye optics needed for our camera placement is hard.
In the past, egocentric inside-in camera placements were used for tracking or model learning of certain parts of the body, for example of the face with a helmet-mounted camera or rig [Jones et al., 2011, Wang et al., 2016], of fingers from a wrist-worn camera [Kim et al., 2012], or of eyes and eye gaze from cameras in a head-mounted rig [Sugano and Bulling, 2015]. Rogez et al.  and Sridhar et al.  track articulated hand motion from body- or chest-worn RGB-D cameras. Using a body-worn depth camera, Yonemoto et al.  extrapolate arm and torso poses from arm-only RGB-D footage. Jiang and Grauman  attempted full-body pose estimation from a chest-worn camera view by analyzing the scene, but without observing the user directly and at very restricted accuracy. Articulated full-body motion capture with a lightweight head-mounted camera pair was not yet attempted.
A complementary research branch analyses the environment from first-person, i.e. body-worn outward-facing cameras, for activity recognition [e.g. Fathi et al., 2011, Kitani et al., 2011, Ohnishi et al., 2016, Ma et al., 2016], for learning engagement and saliency patterns of users when interacting with the real world [e.g. Park et al., 2012, Su and Grauman, 2016], and for understanding the utility of surrounding objects Rhinehart and Kitani . Articulated full-body tracking, or even only arm tracking, is not their goal, but synergies of both fields appear promising.
Traditionally, 2D human pose estimation from monocular images is a two-stage process where coherent body pose is inferred from local image evidence Yang and Ramanan , Johnson and Everingham . Convolutional networks (ConvNets) brought a major leap in performance Chen and Yuille , Jain et al. [2014, 2015], Tompson et al. , Toshev and Szegedy  and recent models demonstrated that end-to-end prediction is possible due to the large receptive fields capturing the complete pose context [Pishchulin et al., 2016]. Pfister et al.  and Wei et al.  allow for increased depth and learning of spatial dependencies between body parts by layering multiple ConvNets. We adopt the network architecture of Insafutdinov et al. , which builds on the recent success of residual networks [He et al., 2016, Newell et al., 2016], which further facilitate an increase in network depth. Recently, direct 3D pose estimation has emerged by lifting 2D poses to 3D Yasin et al. , using mid-level posebit descriptors Pons-Moll et al. , and motion compensation in videos [Tekin et al., 2016], but estimates are still coarse. Existing detection methods use simplified body models with few body parts to reduce the enormous cost of creating sufficiently large, annotated training datasets, do not generalize to new camera geometry and viewpoints, such as egocentric views, and results usually exhibit jitter over time.
We designed a mobile egocentric camera setup to enable human motion capture within a virtually unlimited recording volume. We attach two fisheye cameras rigidly to a helmet or VR headset, such that their field of view captures the user’s full body, see Figure 1. The wide field of view allows to observe interactions in front and beside the user, irrespective of their global motion and head orientation, and without requiring additional sensors or suits. The stereo setup ensures that most actions are observed by at least one camera, despite substantial self-occlusions of arms, torso and legs in such an egocentric setup. A baseline of 30–40 cm proved to be best in our experiments. The impact of the headgear on the user’s motion is limited as it is lightweight: our prototype camera rig for VR headsets (see EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras, bottom left) only adds about 65 grams of weight.
Our egocentric setup separates human motion capture into two subproblems: (1) local skeleton pose estimation with respect to the camera rig, and (2) global rig pose estimation relative to the environment. Global pose is estimated with existing structure-from-motion techniques, see Section 6.3. We formulate skeletal pose estimation as an analysis-by-synthesis-style optimization problem in the pose parameters , that maximizes the alignment of a projected 3D human body model (Section 4.1) with the human in the left and the right stereo fisheye views, at each video time step . We use a hybrid alignment energy combining evidence from a generative image-formation model, as well as from a discriminative detection approach. Our generative ray-casting-based image formation model is inspired by light transport in volumetric translucent media, and enables us to formulate a color-based alignment term in that is analytically differentiable and features an analytically differentiable formulation of 3D visibility (Section 4.2). This model facilitates generative pose estimation with only two cameras, and we adapt it to the strongly distorted fisheye views. Our energy also employs constraints from one-shot joint-location predictions in the form of . These predictions are found with a new ConvNet-based 2D joint detector for head-mounted fisheye views, which is learned from a large corpus of annotated training data, and which generalizes to different users and cluttered scenes (Section 4.3). The combined energy that we optimize takes the following form:
Here, is a regularizer that penalizes violations of anatomical joint-angle limits as well as poses deviating strongly from the rest pose ():
where and are lower and upper joint-angle limits, and
is the Pseudo-Huber loss function.is a temporal smoothness term:
where 0.25 is a damping factor. The total energy in Equation 1 is optimized for every frame, as described in Section 4.4. In the following, we describe the generative and discriminative terms in more detail, while omitting the temporal dependency in the notation for better readability.
We use weights , and .
We model the 3D body shape and pose of humans in 3D using the approach proposed by Rhodin et al. , which represents the body volumetrically as a set of isotropic Gaussian density functions distributed in 3D space. Each Gaussian
is parametrized by its standard deviation, location in 3D space, density and color , which define the Gaussian shape parameters. The combined density field of the Gaussians, , smoothly describes the volumetric occupancy of the human in 3D space, see Figure 1. Each Gaussian is rigidly attached to one of the bones of an articulated skeleton with 17 joints, whose pose is parameterized by 37 twist pose parameters [Murray et al., 1994].
Shape and skeleton bone lengths need to be personalized to the tracked user prior to capturing. Commercial systems often use a dedicated initialization sequence at the start. Research papers on marker-less motion capture often treat initialization as a separate problem, and initialize models manually, which we could also do. However, we propose a much more automated initialization procedure to reduce setup time and effort. To this end, we adapt the approach of Rhodin et al. , who personalize a 3D parametric human shape model of Gaussian density and skeleton dimensions by fitting it to multi-view images using a volumetric contour alignment energy. We adapt this to our stereo fisheye setting. In our egocentric setup 3–4 different user poses, showing the bending of knees, elbows and wrists without any occlusion, were sufficient for automatic shape and skeleton personalization, and only the automatically inferred Gaussian colors are manually corrected on body parts viewed at acute angles.
For color-based model-to-image similarity, we use the ray-casting image formation model of the previously described volumetric body model[Rhodin et al., 2015]. We first describe image formation assuming a standard pinhole model, as in Rhodin et al., and then describe how we modify it for fisheye views. A ray is cast from the camera center in direction of an image pixel. The visibility of a particular 3D Gaussian along the ray () is computed via
This formulation of visibility and color of a 3D Gaussian from the camera view is based on a model of light transport in heterogeneous translucent media Cerezo et al. . is the fraction of light along the ray that is absorbed by Gaussian . We use this image-formation model in an energy term that computes the agreement of model and observation by summing the visibility-weighted color dissimilarity , which we describe in Appendix A, between image pixel color and the Gaussian’s color :
Note that this formulation has several key advantages over previous generative models for image-based pose estimation. It enables analytic derivatives of the pose energy, including a smooth analytically differentiable visibility model everywhere in pose space. This makes it perform well with only a few camera views. Previous methods often used fitting energies that are non-smooth or even lacking a closed-form formulation, requiring approximate recomputation of visibility (e.g. depth testing) inside an iterative optimization loop. Rhodin et al.’s formulation forms a good starting point for our egocentric tracking setting, as non-stationary backgrounds and occlusions are handled well. However, it applies only to static cameras, does not support the distortion of fisheye lenses, and it does not run in real time.
In our egocentric camera rig, the cameras move rigidly with the user’s head. In contrast to commonly used skeleton configurations, where the hip is taken as the root joint, our skeleton hierarchy is rooted at the head. Like a puppet, the lower body parts are then relative to the head motion, see Figure 1. This formulation factors out the user’s global motion, which can be estimated independently, see Section 6.3
, and reduces the dimensionality of the pose estimation by 6 degrees of freedom. By attaching the cameras to the skeleton root, the movable cameras are reduced to a static camera formulation such thatEquation 4 applies without modification.
Simply undistorting the fisheye images before optimization is impractical as resolution at the image center reduces and pinhole cameras cannot capture fields of view approaching 180 degrees – their image planes would need to be infinitely large. To apply the ray-casting formulation described in the previous section to our egocentric motion-capture rig, with its 180° field of view, we replace the original pinhole camera model with the omnidirectional camera model of Scaramuzza et al. . The ray direction of a pixel is then given by , where is a polynomial of the distance of to the estimated image center. We combine the energy terms for the two cameras (Equation 5) in our egocentric camera rig using
These extensions also generalize the contour model of Rhodin et al.  to enable egocentric body model initialization.
We combine the generative model-based alignment from the previous section with evidence from the discriminative joint-location detector of Insafutdinov et al. , trained on annotated egocentric fisheye images. The discriminative component dramatically improves the quality and stability of reconstructed poses, provides efficient recovery from tracking failures, and enables plausible tracking even under notable self-occlusions. To apply Insafutdinov et al.’s body-part detector, which has shown state-of-the-art results on human pose estimation from outside-in RGB images, to the top-down perspective and fisheye distortion of our novel egocentric camera setup, the largest burden is to gather and annotate a training dataset that is sufficiently large and varied, containing tens of thousands of images. As our camera rig is novel, there are no existing public datasets, and we therefore designed a method to automatically annotate real fisheye images by outside-in motion capture and to augment appearance with the help of intrinsic image decomposition.
We propose a novel approach for semi-automatically creating large, realistic training datasets for body-part detection that comprise tens of thousands of camera images annotated with the joint locations of a kinematic skeleton and other body parts such as the hands and feet. To avoid the tedious and error-prone manual annotation of locations in thousands of images, as in previous work, we use a state-of-the-art marker-less motion capture system (Captury Studio of The Captury) to estimate the skeleton motion in 3D from eight stationary cameras placed around the scene. We then project the skeleton joints into the fisheye images of our head-mounted camera rig. The projection requires tracking the rigid motion of our head-mounted camera rig relative to the stationary cameras of the motion-capture system, for which we use a large checkerboard rigidly attached to our camera rig (Figure 2). We detect the checkerboard in all stationary cameras in which it is visible, and triangulate the 3D positions of its corners to estimate the pose and orientation of the camera rig. Using Scaramuzza et al.’s camera distortion model, we then project the 3D joint locations into the fisheye images recorded by our camera rig.
We record video sequences of eight subjects performing various motions in a green-screen studio. For the training set, we replace the background of each video frame, using chroma keying, with a random, floor-related image from Flickr, as our fisheye cameras mostly see the ground below the tracked subject. Please note that training with real backgrounds could give the CNN additional context, but is prone to overfitting to a (necessarily) small set of recorded real backgrounds. In addition, we augment the appearance of subjects by varying the colors of clothing, while preserving shading effects, using intrinsic recoloring Meka et al. . This is, to our knowledge, the first application of intrinsic recoloring for augmenting datasets. We also apply a random gamma curve () to simulate changing lighting conditions. We furthermore exploit the shared plane of symmetry of our camera rig and the human body to train a single detector on a dataset twice the size by mirroring the images and joint-location annotations of the right-hand camera to match those of the left-hand camera during training, and vice versa during motion capture. Thanks to the augmentation, both background and clothing colors are different for every frame (see Figure 3), which prevents overfitting to the limited variety of the captured appearances. This results in a training set of six subjects and ~75,000 annotated fisheye images. Two additional subjects are captured and prepared for validation purposes.
Our starting point for learning an egocentric body-part detector for fisheye images is the 101-layer residual network [He et al., 2016] trained by Insafutdinov et al.  on the MPII Human Pose dataset [Andriluka et al., 2014], which contains ~19,000 internet images that were manually annotated in a crowd-sourced effort, and the Leeds Sports Extended dataset [Johnson and Everingham, 2011] of 10,000 images. We remove the original prediction layers and replace them with ones that output 18 body-part heat maps111We jointly learn heat maps for the head and neck, plus the left and right shoulders, elbows, wrists, hands, hips, knees, ankles and feet.. The input video frames are scaled to a resolution of 640512 pixels, the predicted heat maps are of 8 coarser resolution. We then fine-tune the ConvNet on our fisheye dataset for 220,000 iterations with a learning rate of 0.002, and drop it to 0.0002 for 20,000 additional iterations. The number of training iterations is chosen based on performance on the validation set. We randomly scale images during training by up to 15 to be more robust to variations in user size. Figure 4 (center) visualizes the computed heat maps for selected body parts. We demonstrate generalization capability to a large variety of backgrounds, changing illumination and clothing colors in Section 5.3.
Inspired by Elhayek et al. , who exploit detections in outside-in motion capture, we integrate the learned detections, in the form of heat maps as shown in Figure 4, into the objective energy (Equation 1) as a soft constraint. For each detection label, the location with maximum confidence, , is selected and an associated 3D Gaussian is attached to the corresponding skeleton body part. This association can be thought of as giving a distinct color to each body-part label. The Gaussian is used to compute the spatial agreement of the detection and body-part location in the same way as in the color similarity , only the color distance in Equation 5 is replaced with the predicted detection confidence at . For instance, a light green Gaussian is placed at the right knee and is associated with the light green knee detection heat map at , then their agreement is maximal when the Gaussian’s center projects on . By this definition, forms the sum over the detection agreements of all body parts and in both cameras. We weight its influence by .
Rhodin et al.’s volumetric ray-casting method  models occlusion as a smooth phenomenon by integrating the visibility computations within the objective function instead of applying a depth test once before optimization. While this is beneficial for optimizing disocclusions, it introduces dense pairwise dependencies between all Gaussians: the visibility (Equation 4) of a single Gaussian can be evaluated in linear time in terms of the number of Gaussians, , but – and its gradient with respect to all Gaussians – has quadratic complexity in .
To nevertheless reach real-time performance, we introduce a new parallel stochastic optimization approach. The ray-casting formulation allows a natural parallelization of and terms and their gradient computation across pixels and Gaussians . We also introduce a traversal step, which determines the Gaussians that are close to each ray, and excludes distant Gaussians with negligible contribution to the energy. These optimizations lead to significant run-time improvements, particularly when executed on a GPU, but only enable interactive frame rates.
We achieve further reductions in run times by introducing a statistical optimization approach that is tailored to the ray-casting framework. The input image pixels are statistically sampled for each gradient iteration step, as proposed by Blanz and Vetter . In addition, we sample the volumetric body model by excluding Gaussians from the gradient computation at random, individually for each pixel, which improves the optimization time to 10 fps and more.
We show the two EgoCap prototypes used in this work in EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras (left). EgoRig1 consists of two fisheye cameras attached to a standard bike helmet. It is robust and well-suited for capturing outdoor activities and sports. EgoRig2 builds on a lightweight wooden rig that holds two consumer cameras and is glued to an Oculus VR headset. It weighs only 65 grams and adds minimal discomfort on the user. Both prototypes are equipped with 180° fisheye lenses and record with a resolution of 12801024 pixels at 30 Hz. Note that the checkerboard attached to EgoRig1 in several images is not used for tracking (only used in training and validation dataset recordings).
For egocentric tracking of unconstrained motions, the full 180° field of view is essential for egocentric tracking. We evaluate the visibility of selected body parts from our egocentric rig with different (virtual) field-of-view angles in Figure 5. Only at 180 degrees are almost all body parts captured, otherwise even small motions of the head can cause the hand to leave the recording volume. The limited field of view of existing active depth sensors of 60–80 degrees restricts their applicability to egocentric motion capture in addition to their higher energy consumption and interference with other light sources.
For most tracking results, we use a resolution of 128128 pixels and 200 gradient-descent iterations. Our CPU implementation runs at ten seconds per frame on a Xeon E5-1620 3.6 GHz, which is similar to run times reported by Rhodin et al. . Straightforward parallelization on the GPU reduces run times to two seconds per frame. The body-part detector runs on a separate machine, and processes 6 images per second on an Nvidia Titan GPU and a Xeon E5-2643 3.30 GHz.
For some experiments (see Section 6.3), we use a resolution of 120100 pixels and enable stochastic optimization. Then, purely color-based optimization reaches 10 to 15 fps for 50 gradient iterations (2–3 ms per iteration), i.e. close to real-time performance. Our body-part detector is not optimized for speed and cannot yet run at this frame rate, but its implementation could be optimized for real-time processing, so a real-time end-to-end approach would be feasible without algorithmic changes.
We first evaluate the learned body-part detectors, irrespective of generative components, using the percentage of correct keypoints (PCK) metric [Sapp and Taskar, 2013, Tompson et al., 2014]. We evaluate on a validation set, Validation2D, of 1000 images from a 30,000-frame sequence of two subjects that are not part of the training set and wear dissimilar clothing. Validation2D is augmented with random backgrounds using the same procedure as for the training set, such that the difficulty of the detection task matches the real-world sequences. We further validated that overfitting to augmentation is minimal, by testing on green-screen background, with equivalent results.
|Training dataset setting||Head||Sho.||Elb.||Wri.||Hip||Knee||Ank.||PCK||AUC|
|+ background augmentation||84.7||87.5||90.9||89.1||97.7||94.2||86.4||89.5||56.9|
|+ intrinsic recoloring||86.2||96.1||93.6||90.1||99.1||95.8||90.9||92.5||59.4|
|(a) Arm joints||(b) Leg joints|
Table 1 presents the evaluation of proposed data augmentation strategies. Background augmentation during training brings a clear improvement. It provides a variety of challenging negative samples for the training of the detector, which is of high importance. Secondly, the performance is further boosted by employing intrinsic video for cloth recoloring, which additionally increases the diversity of training samples. The improvement of about two percent is consistent across all body parts.
Figure 6 contains the plots of PCK at different distance thresholds for arms and legs evaluated on sequence Validation2D. We achieve high accuracy, with slightly lower detection reliability of terminal limbs (wrists, feet). This can either be due to more articulation or, in case of the feet, due to higher occlusion by knees and their small appearance due to the strong fisheye distortion. The 2D detection accuracy of feet and wrists is comparable, even though feet are further away, and similar pixel error hence translates to larger 3D errors, as evaluated in the next section. We additionally evaluated the training set size. We found that subject variation is important: using only three out of six subjects, the PCK performance dropped by 2.5 percent points. Moreover, using a random subset of 10 of the original database size reduces the PCK by 2 points, i.e. using more than three frames per second is beneficial. Using a 50 subset did not degrade performance, showing that consecutive frames are not crucial for our per-frame model, but could be beneficial for future research, such as for temporal models.
Our main objective is to infer 3D human pose from the egocentric views, despite occlusions and strong fisheye image distortions. We quantitatively evaluate the 3D body pose accuracy of our approach on two sequences, ValidationWalk and ValidationGest. Ground-truth data is obtained with the Captury Studio, a state-of-the-art marker-less commercial multi-view solution with eight video cameras and 1–2 cm accuracy. The two systems are used simultaneously and their relative transformation is estimated with a reference checkerboard, see Figure 2. We experimented with raw green-screen and with randomly replaced background. Error values are estimated as the average Euclidean 3D distance over 17 joints, including all joints with detection labels, except the head. Reconstructions on green and replaced backgrounds are both 71 cm for a challenging 250-frame walking sequence with occlusions, and 71 cm on a long sequence of 750 frames of gesturing and interaction. During gesturing, where arms are close to the camera, upper body (shoulder, elbow, wrist, finger) joint accuracy is higher than for the lower body (hip, knee, ankle, and toe) with 6 cm and 8 cm average error, respectively. During walking, upper and lower body error is similar with 7 cm. Please note that slight differences in skeleton topology between ground truth and EgoCap exist, which might bias the errors.
Despite the difficult viewing angle and image distortion of our egocentric setup, the overall 3D reconstruction error is comparable to state-of-the-art results of outside-in approaches Rhodin et al. , Elhayek et al. , Amin et al. , Sigal et al. , Belagiannis et al. , which reach 5–7 cm accuracy from two or more cameras, but only in small and open recording volumes, and for static cameras. In contrast, our algorithm scales to very narrow and cluttered scenes (see Figure 8) as well as to wide unconstrained performances (see Figure 7). No existing algorithm is directly applicable to these conditions and the strong distortions of the fisheye cameras, precluding a direct comparison. Closest to our approach is the fundamentally off-line inside-out method of Shiratori et al. , who use 16 body-worn cameras facing outwards, reporting a mean joint position error of 2 cm on a slowly performed indoor walking sequence. Visually, their outdoor results show similar quality to our reconstructions, although we require fewer cameras, and can handle crowded scenes. It depends on the application whether head gear or body-worn cameras less impair the user’s performance.
Our objective energy consists of detection, color, smoothness, and pose prior terms. Disabling the smoothness term increases the reconstruction error on the validation sequences by 3 cm. Without the color term, accuracy is reduced by 0.5 cm. We demonstrate in the supplemental video that the influence of the color term is more significant in the outdoor sequences for motions that are very dissimilar to the training set. Disabling the detection term removes the ability to recover from tracking failures, which are usually unavoidable for fully automatic motion capture of long sequences with challenging motions. High-frequency noise is filtered with a Gaussian low-pass filter of window size 5.
We further evaluate our approach in three application scenarios with seven sequences of lengths of up to 1500 frames using EgoRig1, in addition to the three quantitative evaluation sequences. The captured users wear clothes not present in the training set. The qualitative results are best observed in the supplemental video.
We captured a Basketball sequence outdoors, which shows quick motions, large steps on a steep staircase, and close interaction of arms, legs and the basketball (supplemental video). We also recorded an outdoor Walk sequence with frequent arm-leg self-occlusions (Figure 7, right). With EgoCap, a user can even motion capture themselves while riding a bike in a larger volume of space (Bike sequence, Figure 7, left and center). The pedaling motion of the legs is nicely captured, despite frequent self-occlusions; the steering motion of the arms and the torso is also reconstructed. Even for very fast absolute motions, like this one on a bike, our egocentric rig with cameras attached to the body leads to little motion blur, which challenges outside-in optical systems. All this would have been difficult with alternative motion-capture approaches.
Note that our outdoor sequences also show the resilience of our method to different appearance and lighting conditions, as well as the generalization of our detector to a large range of scenes.
We also tested EgoCap with EgoRig1 for motion capture on the Crowded sequence, where many spectators are interacting and occluding the tracked user from the outside (Figure 8). In such a setting, as well as in settings with many obstacles and narrow sections, outside-in motion capture, even with a dense camera system, would be difficult. In contrast, EgoCap captures the skeletal motion of the user in the center with only two head-mounted cameras.
The egocentric camera placement is well-suited for capturing human-object interactions too, such as the juggling performance Juggler (Figure 9). Fast throwing motions as well as occlusions are handled well. The central camera placement ensures that objects that are manipulated by the user are always in view.
We also performed an experiment to show how EgoCap could be used in immersive virtual reality (VR) applications. To this end, we use EgoRig2 attached to an Oculus VR headset and track the motion of a user wearing it. We build a real-time demo application running at up to 15 fps, showing that real-time performance is feasible with additional improvements on currently unoptimized code. In this Live test, we only use color-based tracking of the upper body, without detections, as the detector code is not yet optimized for speed. The Live sequence shows that body motions are tracked well, and that with such an even more lightweight capture rig, geared for HMD-based VR, egocentric motion capture is feasible. In the supplemental video, we show an additional application sequence ‘VR’, in which the the user can look down at their virtual self while sitting down on a virtual sofa. Current HMD-based systems only track the pose of the display; our approach adds motion capture of the wearer’s full body, which enables a much higher level of immersion.
For free roaming, the global rig pose can be tracked independently of external devices using structure-from-motion in the fisheye views. We demonstrate combined local and global pose estimation on the Biking, Walk, and VR sequence, using the structure-from-motion implementation of Moulon et al.  provided in the OpenMVG library, see Figure 10 and the accompanying video. Such complete motion capture paves the way for immersive roaming in a fully virtual 3D environment.
We developed the first stereo egocentric motion-capture approach for indoor and outdoor scenes, that also works well for very crowded scenes. The combination of generative and detection-based pose estimation make it fare well even under poses with notable self-occlusions. Similar to other outside-in optical methods, tracking under occlusions by objects in the environment, such as a table, may lead to tracking failures. However, the detections enable our tracker to quickly recover from such occlusion failures. Interestingly, the egocentric fisheye camera setup provides stronger perspective cues for motion towards and away from the camera than with normal optics. The perspective effect of the same motion increases with proximity to the camera. For instance, bending an arm is a subtle motion when observed from an external camera, but when observed in proximity, the same absolute motion causes large relative motion, manifesting in large displacements and scaling of the object in motion.
The algorithm in this paper focuses on an entirely new way of capturing the full egocentric skeletal body pose, that is decoupled from global pose and rotation relative to the environment. Global pose can be inferred separately by structure-from-motion from the fisheye cameras or is provided by HMD tracking in VR applications. Fisheye cameras keep the whole body in view, but cause distortions reducing the image resolution of distant body parts such as the legs. Therefore, tracking accuracy of the upper body is slightly higher than that of the lower body. Also, while overall tracking accuracy of our research prototype is still lower than with commercial outside-in methods, it shows a new path towards more unconstrained capture in the future. Currently, we have no real-time end-to-end prototype. We are confident that this would be feasible without algorithm redesign, yet felt that real-time performance is not essential to demonstrate the algorithm and its general feasibility.
Our current prototype systems may still be a bit bulky, but much stronger miniaturization becomes feasible in mass production; the design of EgoRig2 shows this possibility. Some camera extension is required for lower-body tracking and might pose a problem with respect to social acceptance and ergonomics for some applications; However, we did not encounter practical issues during our recordings and VR tests, as users naturally keep the area in front of their head clear to not impair their vision. Moreover, handling changing illumination is still an open problem for motion capture in general and is not the focus of our work. For dynamic illumination, the color model would need to be extended. However, the CNN performs one-shot estimation and does not suffer from illumination changes. The training data also contains shadowing from the studio illumination, although extreme directional light might still cause inaccuracies. Additionally, loose clothing, such as a skirt, is not part of the training dataset and hence likely to reduce pose accuracy.
We presented EgoCap, the first approach for marker-less egocentric full-body motion capture with a head-mounted fisheye stereo rig. It is based on a pose optimization approach that jointly employs two components. The first is a new generative pose estimation approach based on a ray-casting image formation model enabling an analytically differentiable alignment energy and visibility model. The second component is a new ConvNet-based body-part detector for fisheye cameras that was trained on the first automatically annotated real-image training dataset of egocentric fisheye body poses. EgoCap’s lightweight on-body capture strategy bears many advantages over other motion-capture methods. It enables motion capture of dense and crowded scenes, and reconstruction of large-scale activities that would not fit into the constrained recording volumes of outside-in motion-capture methods. It requires far less instrumentation than suit-based or exoskeleton-based approaches. EgoCap is particularly suited for HMD-based VR applications; two cameras attached to an HMD enable full-body pose reconstruction of your own virtual body to pave the way for immersive VR experiences and interactions.
We thank all reviewers for their valuable feedback, Dushyant Mehta, James Tompkin, and The Foundry for license support. This research was funded by the ERC Starting Grant project CapReal (335545).
International Journal of Computer Vision 87, 1–2, 75–92.
MoDeep: A deep learning framework using motion features for human pose estimation.In ACCV.
DeepPose: Human pose estimation via deep neural networks.In CVPR.
For measuring the dissimilarity of model color and image pixel color in Equation 5, we use the HSV color space (with all dimensions normalized to unit range) and combine three dissimilarity components:
For saturated colors, the color dissimilarity is computed using the squared (minimum angular) hue distance. Using the hue channel alone gains invariance to illumination changes.
For dark colors, the color dissimilarity is computed as twice the squared value difference, i.e. . Hue and saturation are ignored as they are unreliable for dark colors.
For gray colors, the distance is computed as the sum of absolute value and saturation difference, i.e. . Hue is unreliable and thus ignored.
We weight these three dissimilarity components by , and respectively, where normalizes the sum of these weights to unity. The total dissimilarity is computed by where is a smooth step function. We employ a two-sided energy, i.e. can be negative: For dissimilar colors, and approaches for similar colors.