Continuous advancements in the capabilities of Augmented Reality (AR) headsets promise new trends of entertainment, communication, healthcare, and productivity, and point towards a revolution in how we interact with the world and communicate with each other. Egocentric vision is a key building block for these emerging capabilities, as AR experiences can benefit from an accurate understanding of the user’s perception, attention, and actions. Substantial progress has been made in understanding human-object interaction [poleg2016compact, fathi2011understanding, Damen2018EPICKITCHENS, li2015delving, Li_2018_ECCV, furnari2019rulstm, li2020eye, liu2019forecasting, ego-topo] from egocentric videos. Additional works investigated social interactions by leveraging egocentric videos to reason about social signals of the second-person [chong2018connecting, ye2015detecting, fathi2012social, soo2015social, yonetani2016recognizing, yagi2018future, Chong_2020_CVPR]
. However, these works are largely limited to the analysis of head pose, gaze behavior, and simple gestures. Future intelligent AR headsets should also have the capacity of capturing the subtle nuances of second-person body pose or even generating plausible interactive 3D avatar that grounded on the 3D scene captured from egocentric point of view. To this end, we introduce a novel task of 4D second-person full body capture from egocentric videos. As shown in Fig.4D Human Body Capture from Egocentric Video via 3D Scene Grounding, we seek to reconstruct time series of motion plausible 3D second-person body meshes that are grounded on 3D scene captured from egocentric perspective.
3D human body capture from videos is a key challenge in computer vision, which has received substantial attention over the years[Kanazawa_2019_CVPR, VIBE:CVPR:2020, MuVS:3DV:2017, von2018recovering]. However, none of previous works considered the challenging setting of reconstructing 3D second-person human body from egocentric perspective222We note that another branch of prior work addresses the related but quite different task of predicting the 3D body pose of the camera-wearer from egocentric video [ng2020you2me, jiang2017seeing, yuan20183d, tome2019xr].. The unique viewpoints and embodied camera motions that arise in egocentric video create formidable technical obstacles to 3D body estimation, causing previous SOTA methods for video-based motion capture to fail. For example, the close interpersonal distances that characterize social interactions result in partial observation of the second-person as body parts move in and out of frame. The drastic camera motion also leads to additional barrier of human kinematic estimation, as the second-person motion is entangled with the embodied movement of the camera wearer.
To address the challenging artifacts of egocentric videos, we propose to a novel optimization-based method that jointly considers time series of 2D observations and 3D scene information. Our key insight is that combining the 2D observations from the entire video sequence provides additional evidence for estimating human body models from frames with only partial observation, and 3D scene also constrains the human body pose and motion. Our approach begins with the use of Structure-from-Motion (SfM) to estimate the camera trajectory and to reconstruct the 3D environment. Note that the 3D scene and body reconstruction from monocular videos is up to a scale. Therefore, directly projecting the 3D body meshes into the reconstructed 3D scene and enforcing human-scene contact will result in unrealistic human-scene interaction. To overcome this challenge, we carefully design the optimization method so that it can not only encourage human-scene contact, but also estimate scale difference between 3D human body and scene reconstruction. We further enforce temporal coherency by uniting time series of body model with temporal prior to recover more plausible global human motion even when the second-person body captured by the egocentric view is only partially observable.
To study this challenging problem of reconstructing 4D second-person body pose and shape from egocentric videos and to validate our proposed approach, we introduce a new egocentric video dataset – EgoMoCap. This dataset captures various human social behaviors in outdoor environment, which serves as an ideal vehicle to study the problem of second-person human body reconstruction from egocentric perspective. We conduct detailed ablation studies on this dataset to show the benefits of our method. We further compare our approach with previous state-of-the-art method on human motion capture from monocular videos, and show our method can address the challenging cases where second-person human body is partially observable. Besides improving the body reconstruction accuracy, we also demonstrate that our method solves the relative scale difference between 3D scene reconstruction and 3D human body reconstruction from monocular videos, and thereby produces more realistic human-scene interaction.
In summary, our work has the following contributions:
We introduce a new problem of reconstructing time series of second-person poses and shapes from egocentric videos. To the best of our knowledge, we are also the first to address capturing global human motion grounded on the 3D environment.
We propose a novel optimization-based approach that jointly considers time series of 2D observation and 3D scene context for accurate 4D human body capture. In addition, our approach seeks to address the scale ambiguity of 3D reconstruction from monocular videos.
We present a new egocentric dataset – EgoMoCap that captures human social interactions in outdoor environment. And we conduct detailed experiments on EgoMoCap dataset and show that our approach can reconstruct more accurate 4D second-person human body, and encourage more realistic human-scene interaction.
2 Related Work
The most relevant works to ours are those investigations on 4D human body reconstruction and human-scene interaction. Our work is also related to recent efforts on reasoning about social interaction from egocentric perspective. Furthermore, we compare our EgoMoCap dataset with other egocentric human interaction datasets.
4D Human Body Reconstruction. A rich set of literature has covered the topic of human body reconstruction. Previous approaches [Bogo:ECCV:2016, SMPL-X:2019, kolotouros2019learning, kanazawa2018end, anguelov2005scape, martinez2017simple, sun2017compositional, romero2017embodied] have demonstrated great success on inferring 3D human pose and shape from a single image. Here, we focus on discussing those works on inferring time series of 3D human body poses and shapes from videos. Alldieck et al. [alldieck2017optical] proposed to use optical flow to estimate temporal coherent human bodies from monocular videos. Tung et al. [tung2017self]
introduced a self-supervised learning method that uses optical flow, silhouettes, and keypoints to estimate SMPL human body parameters from two consecutive video frames.[kanazawa2019learning, pavllo20193d] used fully convolutional network to predict 3D human pose from 2D images sequences. Kocabas et al. [VIBE:CVPR:2020] proposed an adversarial learning framework to produce realistic and accurate human pose and motion from video sequences. Shimada et al. [PhysCapTOG2020]
used physical engine to capture physically plausible and temporally stale global 3D human motion. All those deep learning based methodsassumed a fixed camera view and fully observable human body. Those assumptions do not hold under egocentric setting. Several optimization-based methods [von2018recovering, MuVS:3DV:2017, wang2017outdoor] considered the moving camera scenarios. [von2018recovering] proposed to jointly optimize the camera pose and human body model, yet their method requires additional IMU sensor data. [MuVS:3DV:2017] enforced temporal coherence to reconstruct reasonable body pose from monocular videos with moving camera. Wang et al.[wang2017outdoor] proposed to utilize multiple cameras for outdoor human motion capture. Those methods only targeted at local human kinematic motion without reasoning the 3D scene context. In contrast, we seek to estimate the global human motion grounded on 3D scene from only monocular egocentric videos.
Human-Scene Interaction. Several investigations on human-scene interaction seek to reason about environment affordance [ego-topo, gupta20113d, grabner2011makes, delaitre2012scene, wang2017binge, koppula2015anticipating, chen2018subjects, nagarajan2018grounded]. Our work is more relevant to those efforts on using the environment cues to better capture 3D human body. Savva et al. [savva2016pigraphs] proposed to learn a probabilistic model that captures how human interact with the indoor scene from RGB-D sensors. Li et al. [li2019estimating] factorized estimating 3D person-object interactions into an optimal control problem, and used contact constraints to recover human motion and contact forces from monocular videos. Zhang et al. [zhang2020phosa] proposed an optimization-based framework that incorporates the scale loss to jointly reconstruct the 3D spatial arrangement and shape of humans and objects in the scene from a single image. Hassan et al. [PROX:2019] made use of the 3D scene context – obtained from 3D scan, to estimate more accurate human pose and shape from single image. Zhang et al. [zhang2020generatingnew, zhang2020generating] further studied the problem of generating plausible human body grounded on 3D scene prior. Despite those progress on using scene information to estimate 3D human body model parameters, none of them considered the egocentric camera motion, 3D scene context from monocular videos, and global human motion grounded on 3D scene in one-shot as in our proposed approach.
Egocentric Social Interaction. Understanding human social interaction has been the subject of many recent efforts in egocentric vision. Several previous works studied human attention during social interaction. Ye et al. [ye2015detecting] proposed to use pose-dependent appearance model to estimate the eye contact of children. Chong et al. [chong2018connecting] introduced a novel multi-task learning method to predict gaze directions from various kinds of datasets. Park et al. [soo2015social] considered the challenging problem of social saliency prediction. Fathi et al. [fathi2012social] utilized face, attention, and head motion to recognize social interactions. More recently, a few works considered novel vision problems in egocentric social interaction. Yagi [yagi2018future] addressed the task of localizing future position of target person from egocentric videos. Yonetani et al. [yonetani2016recognizing] proposed to use features from both the first-person and second-person points-of-view for recognizing micro-actions and reactions during social interaction. Ng et al. [ng2020you2me] proposed to use the second-person body pose as additional cues for predicting the egocentric body pose during human interaction. Those previous works studied various signals during human social interaction, however none of them targeted at second-person full body capture. Our work seeks to bridge this gap and points to new research directions in egocentric social interaction.
Egocentric Human Interaction Datasets. Several egocentric datasets target the analysis of human social behavior during naturalistic interactions. Fathi et al. [fathi2012social] presented an egocentric dataset for the detection and recognition of fixed categories of conversational interactions within a social group. The NUS Dataset [narayan2014action] and JPL Dataset [ryoo2013first] support more general human interaction classification tasks. Yonetani et al. [yonetani2016recognizing] collected a paired egocentric human interaction dataset to study human action and reaction. While prior datasets focused on social interaction recognition, Park et al. introduced an RGB-D egocentric dataset – EgoMotion [soo2016egocentric], for forecasting a walking trajectory based on interaction with the environment. More recently, the You2Me dataset [ng2020you2me] was proposed to study the problem of egocentric body pose prediction. However, none of those datasets were designed to study the second-person body pose, which is the focus and contribution of our work. In prior datasets, the majority of second-person body captures are either largely occluded by objects or frequently truncated by the frustum, which makes their utilization for full body capture infeasible. In contrast, our EgoMoCap dataset focuses on outdoor social interaction scenarios that have less foreground occlusion on second-person body.
We denote an input monocular egocentric video as with its frame indexed by time . We estimate the human body pose and shape at each time step from input . Due to the unique viewpoint of egocentric video, the captured second-person body is partially observable within a time window. In addition, the second-person body motion is entangled with the camera motion, and therefore incurs additional barrier to enforce temporal coherency. To address those challenges, we propose a novel optimization method that jointly considers the 2D observation of the entire video sequence and 3D scene for more accurate 4D human body reconstruction. We illustrate our method in Fig. 1. Specifically, we first recover the 3D human body at each time instant from the 2D observation of . We then use Structure from Motion (SfM) to project a sequence of 3D body meshes into the 3D world coordinate, and further adopt a contact term to encourage human-scene interaction. In addition, we combine the 2D cues from entire video sequences for reconstructing temporal coherent time series of body poses using human dynamic prior. In following sections, we introduce each component of our method.
3.1 Human Body Model
To better understand various signals during social interaction, we use the differentiable body model SMPL-X [SMPL-X:2019] to jointly capture human body, hands, and facial expression. SMPL-X produces a body mesh of a fixed topology with 10,475 vertices, using a compact set of body configuration parameters. Specifically, the shape parameter represents how individuals vary in height, weight, and body proportions, encodes the 3D body pose, hand pose and facial expression information, and denotes the body translation. Formally, the SMPL-X function is defined as . It outputs a 3D body mesh as , where and denote the body vertices and triangular faces, respectively.
Similar to [SMPL-X:2019, Bogo:ECCV:2016], we factorize fitting the SMPL-X model to each video frame as an optimization problem. Formally, we optimize by minimizing:
where K is the intrinsic camera parameters; the shape prior term is learned from SMPL-X model body shape training data and the pose prior term is learned from CMU MoCap dataset [cmumocap]; and denote the weights of and ; refers to the energy function that minimizes the weighted robust distance between the 2D projection of the body joints, hand joints and face landmarks, and the corresponding 2D joints estimation from OpenPose [cao2017realtime, wei2016cpm]. is given by:
where returns 3D joints location based on embedded shape parameters , and transforms the joints along the kinematic tree according to the pose and body translation ; is the 3D to 2D projection function based on intrinsic parameters ; refers to the 2D joints estimation from OpenPose; is the 2D joints detection confident score which accounts for the noises of 2D joints estimation; is the per-joint weights for annealed optimization as in [SMPL-X:2019]; denotes a robust Geman-McClure error function [geman1987statistical]
that downweights outliers, which is given by:
where is the residual error, and is the robustness constant chosen empirically.
3.2 Egocentric Camera Representation
To capture 4D second-person bodies that are grounded on the 3D scene from egocentric videos, we need to take the embodied camera motion into consideration. Here we elaborate the egocentric camera representation adopted in our method. Formally, we denote as the transformation from the human body coordinate to the egocentric camera coordinate, and as the transformation from the egocentric camera coordinate to the world coordinate. Note that is derived from the translation parameter of SMPL-X model fitting introduced aforementioned section, while is returned from COLMAP Structure from Motion (SfM) [schoenberger2016sfm]. In order to utilize the 3D scene context and enforce the temporal coherency on reconstructed human body meshes, we project the 3D second-person body vertices into world coordinate using human body to world transformation , which is given by:
where refers to the body vertices at time step , represented in homogeneous coordinate.
3.3 Optimization with 3D Scene
3D Scene Representation. The 3D scene conveys useful information of human behavior, and therefore plays an important role in 3D human body recovery. As human-scene interaction is often grounded on the surfaces, we adopt a mesh representation for the 3D scene. Formally, we denote the 3D scene mesh as , where
denotes the vertices of the scene representation, anddenotes the corresponding triangular faces. We use the dense environment reconstruction from COLMAP to represent .
Human-Scene Contact. Note that the reconstructed 3D scene from the monocular video is up to a scale. To address such scale ambiguity, we design a novel energy function that not only encourages contact between human body and 3D scene, but also estimates the scale difference between 3D scene mesh and 3D body mesh . Specifically, we make use of the annotation from [hassan2019resolving], where a candidate set of SMPL-X mesh vertices to contact with the world were provided. We then multiply an optimizable scale parameter to human body vertices during optimization. Therefore, the energy function for enforcing human-scene contact is given by:
where is the robust Geman-McClure error function introduced in Eq. 3, and is human body to world transformation introduced in Eq. 4. Note that the scale factor is shared across the video sequence. This is because we estimate a consistent 3D shape parameter from the entire sequence by taking the median of all the shape parameters obtained from the per-frame SMPL-X model fitting.
3.3.1 Human Dynamics Prior
Fitting SMPL-X human body model to each video frame will incur notable temporal inconsistency. Due to drastic camera motion, this problem is further amplified under egocentric scenarios. Here, we propose to use the empirical human dynamics priors to enforce temporal coherency on human body models in the world coordinate. Formally, we have the following energy function:
where is the 3D human body joints position at time step , transformed in world coordinate as in Eq. 4; is another robust Geman-McClure error function that accounts for possible outliers; and is confident score of 2D human keypoints estimation. As shown in Eq. 6, we design this energy function to focus on body parts that do not have reliable 2D observation, due to the unique egocentric viewpoint. Notably, we assume a zero acceleration motion prior. We show that this naive prior can effectively capture human motion in the outdoor environment.
Putting everything together, we have the following energy function for our optimization method:
where denotes the SMPL-X model fitting energy function for video frame ; and represent the weights for human-scene contact term and human dynamic prior term, respectively. We optimize Eq. 7 using a gradient-based optimizer Adam [kingma2014adam] w.r.t. SMPL-X body parameters , scale parameter , and camera to world transformation . Note that the SfM already provides a initialization of , making optimizable can further smooth the global second-person human motion.
Note that performs model fitting at each time step, while and optimize time series of body models. In addition, both and seek to optimize human body parameters in world coordinate, the scale ambiguity will cause the gradients of the contact term shift the body global position in wrong direction. Therefore, we carefully design a multi-stage optimization strategy. Specifically, we set and to be zero, so that the optimizer will only look at the 2D observation at stage one. We then set to be 0.1, keep as zero, and freeze the , so that the optimizer will focus on recovering the scale parameter . At the final stage, we set to 0.1 and enable the gradients of
to enforce temporal coherency. Our method is implemented in PyTorch and will be made publicly available.
In this section, we discuss our experiments and results. To begin with, we introduce our dataset and evaluation metrics. We then present detailed ablation studies to validate our model design, and compare our approach with state-of-the-art on 3D body recovery from monocular videos. Finally, we provide a discussion of our method.
4.1 Dataset and Metrics
Datasets. To study the problem of second-person human body reconstruction, we present a new egocentric social interaction dataset – EgoMoCap. This dataset consists of 36 video sequences from 4 participants. Each recording scenario incorporates two participants interacting in the wild. The camera wearer is equipped with head-mounted GoPro camera, and the other participant is asked to interacts with the camera wearer in a natural manner. This dataset captures 4 types of outdoor human social interactions: Greeting, Touring, Jogging Together, and Throw and Catch.
Evaluation Metrics. For our experiments, we evaluate the human body reconstruction accuracy, motion smoothness, and the plausibility of human-scene interaction.
• Human Body Reconstruction Accuracy: We acknowledge that the 3D ground truth of human bodies can be obtained from RGB-D data [PROX:2019], or Motion Capture Systems [SIP, mahmood2019amass]. However, all those systems adopt constrained capture environments and may result in unnatural social interactions. Our work focuses on outdoor social interaction, where the 3D human body ground truth is extremely difficult to capture. To evaluate the accuracy of human body reconstruction, we annotate our datasets with 2D human keypoints and evaluate the reconstruction quality using per-joint 2D projection error (PJE) on the image plane as in [yuan20183d]. We report the PJE on both uniformly sampled frames (PJE-U), and frames where second-person body is partially observable (PJE-P). Note that we focus on evaluating human body poses, even though our method has the capability of reconstructing 3D hands and faces. This is because the primary goal of this work is to explore how environment factor affects 4D human body capture, while 3D scene context has minor influence on facial expression and hand pose for outdoor social interaction.
• Motion Smoothness: We adopt a physics-based metric [yuan20183d] that uses average magnitude of joint accelerations to measure the smoothness of the estimated pose sequence. Thus, a lower value indicates that the times series of body meshes have more consistent human motion. Note that the motion smoothness is evaluated on 3D human joints projected in world coordinate. For fair comparison, we normalize the scale factor when reporting the results.
• Plausibility of Human-Scene Interaction: To evaluate whether our method leads to more realistic human-scene interaction, we transform the human body meshes into 3D world coordinate, render the results as video sequences, and further upload them to Amazon Mechanical Turk (AMT) for a user study. Specifically, we put the rendered results of all compared methods and our method side-by-side, and ask the AMT worker to choose the instance has the most realistic human-scene interaction.
4.2 Quantitative Results
We now introduce our quantitative experiment results. We first present detailed ablation studies, and then compare our method with state-of-the-art for 3D human body reconstruction from monocular videos.
|Method||PJE-U / PJE-P||Smoothness||User Study|
|22.19 / 73.14||5.33||7.4|
|+||30.09 / 87.74||5.72||23.2|
|+||23.93 / 75.14||2.23||13.7|
|++ (Ours)||24.03 / 66.03||1.82||55.7|
Ablation Study. Here we analyze the functionality of the terms in Eq. 7. The results are summarized in Table 1. refers to the baseline method that performs per-frame fitting with 2D observation as in SMPLify-X [SMPL-X:2019]. achieves 22.19 in PJE, yet has undesirable performance on motion smoothness and human-scene interaction user study. In the second row , we report the method that makes use of both human scene contact term and 2D observations. Though adding the contact term alone leads to more realistic human-scene interaction, it compromises the performance on 2D projection error and motion smoothness by a notable margin. in the third row refers to the method that optimizes the 2D observations together with the human dynamic prior term . Not surprisingly, can significantly improve the motion smoothness. In the last row, we present the results of our full optimization approach. Our method achieves the best performance on motion smoothness and plausibility of human-scene interaction. An interesting observation is that ours outperforms by a notable margin on motion smoothness. We speculate that this is because the physical human scene constraints narrows do the solution space of model fitting, and thereby leads to more optimal performance on temporal coherency. We note that our model performs slightly more worse on PJE-U. This is because PJE is a 2D metric, and therefore favors the the method that adopts only 2D projection error as objective function during optimization. However, when the 2D observation can not be robustly estimated due to partial observation, our method outperforms other baselines by a significant margin (66.03 vs. 73.14 in PJE-P). Those results support our claim that our method can address the challenge of partially observable human body, and estimate plausible global human motion grounded on the 3D scene.
|Method||PJE-U / PJE-P||Smoothness||User Study|
|VIBE [VIBE:CVPR:2020]||22.45 / 75.91||4.79||17.2|
|Ours||24.03 / 66.03||1.85||82.8|
Comparison to SOTA Method. In Table 2, we compare our approach with SOTA method of 3D body recovery from monocular videos –VIBE [VIBE:CVPR:2020]. Since VIBE does not model the human-scene constraints, simply projecting human body meshes into 3D scene results in unrealistic human-scene interaction. Moreover, the egocentric camera motion causes VIBE failing to capture temporal coherent human bodies. In contrast, our method outperforms VIBE on motion smoothness and human-scene interaction plausibility by a large margin. Though VIBE performs slightly better on PJE-U (22.45 vs. 24.03), it lags far behind of our method on PJE-P (75.91 vs. 66.03). We have to re-emphasize that the 2D projection error can not reflect the true performance improvement of our method. This is because the 2D keypoints annotation is only available for visible human body parts, and therefore 2D per-joint projection error does not penalize the method that fits wrong 3D body model to partially 2D observation. Take the VIBE result shown in the third row of Fig. 3 for an instance, the 2D projection error may have decent performance, even though the reconstructed 3D human body is completely wrong.
4.3 Qualitative Results
We now present the qualitative results of our method. As shown in Fig. 2, we visualized the results of both baseline and our method in the world coordinate. By examining the SMPLify-X baseline results, we can observe an obvious mismatched scale between the 3D reconstruction of human body and environment, which results in unrealistic human-scene interaction. In contrast, our method produces more plausible human body motion grounded on 3D scene by resolving the scale ambiguity of 3D reconstruction from monocular videos. In Fig. 3, we visualize our results on 2D image plane. Specifically, we choose instances where the second-person human body is partially observable. Notably, both SMPLify-X and VIBE fail substantially for those challenging cases. Our method, on the other hand, makes use of the 2D cues from entire video sequences and 3D scene for reconstructing temporal coherent time series of body poses, and therefore can successfully reconstruct the human body even when it is partially observable. In the supplementary materials, we provide additional video demos to demonstrate the benefits of our approach.
4.4 Remarks and Discussion
The previous sections have demonstrated, via detailed experimental evaluation and comparisons, that our method can capture more accurate second-person human bodies, and produce more realistic human-scene interaction, compared to prior works. However, our method also has certain limitations. A key issue is the need to retrieve the camera trajectory and 3D scene only from monocular RGB videos via Structure from Motion (SfM). Therefore, our method has the same bottleneck as SfM: Challenging factors such as dynamic scenes, featureless surfaces, changing illumination, etc., may cause visual feature matching to fail. We note that the camera and environment information can be more robustly estimated using additional sensors (Lidar, Depth Camera, Matterport etc.). Incorporating those sensors into the egocentric capture setting is a very interesting and promising future direction. In addition, our naive human motion prior (zero acceleration), may result in unrealistic motions in some cases. More effort in learning motion priors could potentially address this issue. We believe our efforts constitute an important step forward for a largely unexplored egocentric vision task, and we hope our work can inspire the community to make further investments.
In this work, we introduce a novel task of reconstructing a time series of second-person 3D human body meshes that are grounded on the 3D scene information from monocular egocentric videos. We propose a novel optimization-based method to address the challenges of egocentric capture, that exploits the 2D observation of entire video sequence and 3D scene information for second-person human body capture. In addition, we introduce a new egocentric video dataset – EgoMocap, and provide extensive quantitative and qualitative analysis to demonstrate that our method can effectively reconstruct partially-observable second-person human bodies and produce more realistic human-scene interaction.