Daily life activities take place in a 3D space, and humans effortlessly leverage spatial and semantic cues, such as the 3D surface shape and semantics of familiar objects and environments, in performing daily tasks. For example, suppose the 3D environment rendered in a top-down view in Fig. 1 (a) was from your apartment, where you have a rough idea about the 3D layout of the furniture and appliances. Based on such knowledge and even without a single image, you can easily imagine where your routine actions might happen in the 3D space, drawing a picture while sitting on the sofa. Can we design vision models to exploit the prior knowledge of a known 3D environment for recognizing and localizing our actions and activities?
A significant amount of prior work in computer vision has addressed the interaction between human activities and 3D environment using fixed cameras[10, 27, 34, 52, 17]. Recently, first person vision has emerged to offer a new perspective for activity understanding. Egocentric video captures a mobile user’s actions in the context of the environment, as the camera wearer explores the space. Yet, most existing works on egocentric activity recognition did not address the scene context [25, 69, 37, 33, 44, 39, 46, 31]. More recently, contextual cues such as a 2D ground plane  or a topological map  have been considered for understanding environmental functionalities, such as the common locations at which activities occur, from egocentric videos.
In contrast to these prior efforts, this paper introduces a new research topic on the joint recognition and 3D localization of egocentric activities using the prior context provided by an annotated 3D environment map (see Fig. 1 (b)). We use a Hierarchical Volumetric Representation (HVR) to describe the semantic and geometric information of the 3D environment map (see Fig. 2 (a) and Sec 3.1 for explanation). To illustrate our approach, consider the task of determining, based on an egocentric video, whether a person is placing a book on a shelf or picking up a book from the floor. In both cases, the subject is holding the book and moving their body, and differentiating the activities based on video alone could be challenging. However, the ability to localize the head-worn camera in the 3D scene relative to the 3D voxels for the bookshelf and floor is a potentially powerful cue for disambiguation. Picking up the book requires the subject’s head to approach the horizontal surface of the floor, while placing the book requires the upright head to approach the vertical bookcase. Methods that use ground plane representations of the environment  or merely detect the presence of objects as context  may lack the specificity provided by 3D proximity. The ability to localize the camera wearer in the 3D scene and leverage spatial and semantic context provides a means to infer cues which otherwise would not be visible within the egocentric video.
Three major challenges exist for recognizing and localizing actions in 3D using egocentric videos. First, standard architectures for activity recognition from egocentric video are not designed to incorporate a 3D volumetric environment representation, requiring the design of a novel recognition architecture. Second, accurate 3D locations of activities
sometimes provide erroneous estimates of camera pose and thus action locations, due to rapid camera motion and varying lighting conditions. Thus, our solution must address potential errors in action locations during training and testing.
Third, no existing egocentric activity datasets contain both 3D environment models and corresponding videos of egocentric activities.
To address the first challenge, we present a novel deep model that takes the inputs of egocentric videos and our Hierarchical Volumetric Representation (HVR) of the 3D environment, and outputs the 3D action location, as well as the action categories. Our model consists of two branches. The environment branch makes use of a 3D convolutional network to extract environmental features from HVR. Similarly, the video branch uses a 3D convolutional network to extract visual features from the input egocentric video. The environmental and visual features are further combined to estimate the 3D action location, supervised by the results of camera registration. The second challenge of noisy localizations is addressed by using stochastic units to account for potential error. The predicted 3D action location, in the form of a probabilistic distribution, is then interpreted as a 3D attention map to select local environmental features relevant to the current video. Finally, these features are further fused with video features to recognize the actions.
In the experiment section, we show that our model can not only recognize and localize actions in a known 3D environment, but can also generalize to unseen environments not present in the training set yet with known 3D maps and object labels. These results are evaluated on a newly-collected egocentric video dataset with photo-realistic 3D environment reconstructions and 3D object annotations, recorded at multiple naturalistic household environments. We demonstrate strong results on action recognition and 3D action localization for both seen and unseen environments. When evaluated on seen environments, our model outperforms a strong baseline of 2D video-based action recognition methods by 4.2% in mean class accuracy, and beats baselines on 3D action localization by 9.3% in F1 scores. We believe our method provides a solid step forward to understand actions and activities in the context of their 3D environments.
2 Related Work
Our work is related to several ongoing topics in computer vision. We first discuss the most relevant works on egocentric vision. We then review several previous efforts on human-scene interaction and 3D scene representation.
Egocentric Vision. There is a rich set of literature aiming at understanding human activity from an egocentric perspective. Prior works have made great progress in recognizing and anticipating egocentric actions based on 2D videos [13, 55, 26, 69, 37, 33, 44, 39, 46], and predicting gaze and locomotion [31, 30, 20, 58, 42, 66, 43, 45, 54]. Far fewer works have considered environmental factors and spatial grounding of egocentric activity. Guan et al.  and Rhinehart et al. 
jointly considered trajectory forecasting and egocentric activity anticipation with online inverse reinforcement learning. The most relevant works to ours are recent efforts on learning affordances for egocentric action understanding[41, 50]. Nagarajan et al.  introduced a topological map environment representation for long-term activity forecasting and affordance prediction. Rhinehart et al.  considered a novel problem of learning “Action Maps” from egocentric videos. In contrast to the prior use of topological maps  and 2D ground planes , our focus is on exploiting the geometric and semantic information in the HVR map to address our novel task of joint egocentric action recognition and 3D localization.
. Human-scene constraints have been proven to be effective in reasoning about human body pose[68, 18, 67]. The most relevant prior works focus on understanding environment affordance. Grabner et al.  predict object functionality by hallucinating an actor interacting with the scene. A similar idea was also explored in [23, 24]. Savva et al.  predicted action heat maps that highlight the likelihood of an action in the scene by partitioning scanned 3D scenes into disjoint sets of segments and learning a segment dictionary. Gupta et al.  presented a human-centric scene representation for predicting the afforded human body poses. Delaitre et al. [7, 11] introduced a statistical descriptor of person-object interactions for object recognition and human body pose prediction. Fang et al.  proposed to learn object affordances from demonstrative videos. Nagarajan et al.  proposed to use backward attention to approximate the interaction hotspots of future action. Those previous efforts were limited to the analysis of environment functionality [23, 24, 7, 11], constrained human action and body pose , or hand-object interaction on 2D image plane [9, 40]. In contrast, we are the first to utilize the rich geometric and semantic information of the 3D environment for naturalistic human activity recognition and 3D localization.
3D Scene Representation. Many recent works explored various forms of 3D representations for 3D vision tasks, including 3D object detection [65, 70, 4] and embodied visual navigation [14, 19, 29]. Deep models have been developed for point clouds [47, 63, 62, 48] with great success in object recognition, semantic segmentation, and sceneflow estimation. However, using point clouds to describe a large-scale 3D scene will result in high computational and memory cost . To address this challenge, many approaches used rasterized point clouds in a 3D voxel grid, with each voxel represented by either handcrafted features [60, 8, 56, 57] or learning-based features . Signed-Distanced value and Chamfer Distance between 3D scene and 3D human body have also been used to enforce more plausible human-scene contact [68, 18, 67, 36]. Building on this prior work, we utilize a 3D Hierarchical Volumetric Representation (HVR) that encodes the local geometric and semantic context of a 3D environment for egocentric action understanding.
We denote a trimmed input egocentric video as with frames indexed by time . In addition, we assume a global 3D environment prior , associated each with input video, is available at both training and inference time. is environment specific, the 3D map of an apartment. Our goal is to jointly predict the action category of and the action location on the 3D map. is parameterized as a 3D saliency map, where the value of represents the likelihood of action clip happening in spatial location .111For tractability, we associate the entire activity with a specific 3D location and do not model the change in location over the course of an activity. This is a valid assumption for the activities we address, such as sitting down, playing keyboards, etc. The action location thereby defines a proper probabilistic distribution in 3D space.
In this section, we first introduce our proposed joint model of action recognition and 3D localization, leveraging a 3D representation of the environment. We then describe key components of our model, training and inference schema, as well as our network architecture.
3.1 Joint Modeling with a 3D Environment Prior
3D Environment Representation. We seek to design a representation that not only encodes the 3D geometric and semantic information of the 3D environment, but is also effective for 3D action localization and recognition.
To this end, we introduce a Hierarchical Volumetric Representation (HVR) of the 3D environment. We provide an illustration of our method in Fig. 2(a). We assume the 3D environment reconstruction with object labels is given in advance as a 3D mesh (see Sec.4 for details). We first divide the 3D mesh into parent voxels, that define all possible action locations. We then divide each parent voxel into multiple voxels at a fixed resolution and further assign an object label to each child voxel based on the object annotation. Specifically, the object label of each child voxel is determined by the majority vote of the vertices that lie inside that child voxel. Note that we only consider static objects and treat empty space as a specially-designated “object” category. Therefore, the child voxels compose a semantic occupancy map that encodes both the 3D geometry and semantic meaning of the parent voxel.
We further vectorize the semantic occupancy map and use the resulting vector as a feature descriptor of the parent voxel. The 3D environment representation
can then be represented as a 4D tensor, with dimension. Note that higher resolution can better approximate the 3D shape of the environment. Our proposed representation is thus a flexible environment representation that jointly considers the 3D action location candidates, geometric and semantic information of the 3D environment.
Joint Learning of Action Category and Action Location. We present an overview of our model in Fig. 2(b). Specifically, we adopt a two-pathway network architecture. The video pathway extracts video features with an I3D backbone network , while the environment pathway extracts the global 3D environment features with a 3D convolutional network . Visual and environmental features are jointly considered for predicting the 3D action location . We then adopt stochastic units to generate sampled action location for selecting the local environment features relevant to the actions. Local environment features and video features are further fused with the video features for action recognition.
Our key idea is thus to make use of the 3D environment representation for jointly learning the action label and 3D action location of video clip . We consider the action location as a probabilistic variable, and model the action label given input video and environment representation
using a latent variable model. Therefore, the conditional probabilityis given by:
Notably, our proposed joint model has two key components. First, models the 3D action location from video input and the 3D environment representation . Second, utilizes to select a region of interest (ROI) from the environment representation , and combines selected environment features with the video features from for action classification. During training, our model receives the ground truth 3D action location and action label as supervisory signals. At inference time, our model jointly predicts both the 3D action location and action label . We now provide additional technical details in modeling and .
3.2 3D Action Localization
We first introduce our 3D action localization module, defined by the conditional probability . Given the video pathway features and the environment pathway features , we learn a mapping function to predict location , which is defined on a 3D grid of candidate action locations.222The 3D grid is defined globally over the 3D environment scan. The mapping function is composed of 3D convolution operations with parameters and a softmax function. Thus, is given by:
where denotes concatenation along the channel dimension. Therefore, the resulting action location is a proper probabilistic distribution normalized in 3D space, with reflecting the possibility of video clip happening in the spatial location of the 3D environment.
In practice, we don’t have access to the precise ground truth 3D action location and must rely on camera registration results as a proxy. Using a categorical distribution for thus models the ambiguity of 2D to 3D registration. We follow [32, 35] to adopt stochastic units in our model. Specifically, we use the Gumbel-Softmax and reparameterization trick [22, 38] for differentiable sampling:
where is a Gumbel Distribution for sampling from a discrete distribution. This Gumbel-Softmax trick produces a “soft” sample that allows the gradients propagation to video pathway network and environment pathway network . is the temperature parameter that controls the shape of the soft sample distribution. We set for our model.
3.3 Action Recognition with Environment Prior
Our model further models with a mapping function that jointly considers action location , video input and 3D environment representation for action recognition. Formally, the conditional probability can be modeled as:
where denotes concatenation along channel dimension, and denotes the element-wise multiplication. Specifically, our method uses the sampled action location for selectively aggregating environment features and combines the aggregated environment features with video features for action recognition. denotes the average pooling operation that maps 3D feature to 2D feature, and
3.4 Training and Inference
We now present our training and inference schema. At training time, we assume a prior distribution of action location is given as a supervisory signal. is obtained by registering the egocentric camera into the 3D environment (see more details in Sec.4). Note that we factorize
as latent variables, and based on the Evidence Lower Bound (ELBO), the resulting deep latent variable model has the following loss function:
where the first term is the cross entropy loss for action classification and the second term is the KL-Divergence that matches the predicted 3D action location distribution to the prior distribution . During training, a single 3D action location sample for each input within the mini-batch will be drawn.
Theoretically, our model should sample from the same input multiple times and take average of the prediction at inference time. To avoid such dense sampling, we choose to directly plug in the deterministic action location in Eq. 4. Note that the recognition function is composed of a linear mapping function and a softmax function, and therefore is convex. By Jensen’s Inequality, we have
That being said, provides an empirical lower bound of , and therefore provides a valid approximation of dense sampling.
3.5 Network Architecture
pre-trained on Kinetics as the backbone. For the environment pathway, we make use of a lightweight network (denoted as EnvNet), which has four 3D convolutional operations. The video features from the 3rd convolutional block of I3D-Res50 and the environment features after the 2nd 3D convolutional operation in EnvNet are concatenated for 3D action location prediction. We then use 3D max pooling operations to match the size of action location map to the size of the feature map of the 4th convolution of EnvNet for the weighted pooling in Eq.2.
4 Dataset and Annotation
Dataset. We utilize a newly-developed egocentric dataset, where the camera wearers are asked to conduct daily activities in living rooms that have photo-realistic 3D reconstruction. Note that existing egocentric video datasets (EGTEA , and EPIC-Kitchens  etc.) did not explicitly capture the 3D environment, and the Structure-from-Motion  fails substantially on those datasets due to the drastic head motion. This is the first activity recognition dataset to include both egocentric videos and high-quality 3D environment reconstructions.
The dataset contains 60 hours of video from 105 different video sequences captured by Vuzix Blade Smart Glasses with a resolution of 1920×1080 at 24Hz. It captures 34 different indoor activities from 3 real-world living rooms, resulting in action clips. Similar to , we consider both seen and unseen environment splits. In the seen environment split, each environment is seen in both training and testing sets ( instances for training, and instances for testing). In the unseen split, all sequences from the same environment are either in training or testing ( instances for training, and instances for testing).
3D Environment Reconstruction & Object Annotations. We use the state-of-the-art dense reconstruction system  to obtain the photo-realistic 3D reconstruction of the environment. Specifically, the environment is scanned with a customized capture rig, which contains a high-resolution RGB camera, an IR projector and an IR sensor, two fisheye monochrome cameras, IMU, and AprilTag. To reconstruct the 3D model, the camera trajectory is first tracked using the monochrome images and IMU signals. Then the dense depth measurements are fused into a volumetric Truncated Signed Distance Fields (TSDF) representation, which is extracted into meshes by the Marching Cubes algorithm, followed by texturing with the HDR RGB images. Once the 3D model is reconstructed, we further annotate the mesh by painting an object instance label over the mesh polygons. See  for a more detailed description. In this work, 35 object categories plus a background class label are used in annotation. Note that the object annotations can be automated with SOTA 3D object detection algorithms.
Prior Distribution of 3D Action Location. To obtain the ground truth of the activity location for each trimmed activity video clip, we first register the egocentric camera in the 3D environment using a RANSAC based feature matching method. Specifically, we first build a base map from the monochrome camera streams for 3D environment reconstruction using Structure from Motion [12, 53]. The pre-built base map is a dense point cloud associated with 3D feature points. We then estimate the camera pose of the video frame using active search . Note that registering the 2D egocentric video frames in a 3D environment is fundamentally challenging, due to the drastic head rotation, featureless surfaces, and changing illumination. Therefore, we only consider the key frame camera registration, where enough inliers were matched with RANSAC. As introduced in Sec.3, the action location is defined as a probabilistic distribution in 3D space. Thus, we map the key frame camera location into the index of the 3D action location tensor, with its value representing the likelihood of the given action happening in the corresponding parent voxel. To account for the uncertainty of 2D to 3D camera registration, we further enforce a Guassian distribution to generate the final 3D action location ground truth.
5 Experiments and Results
In this section, we first describe the experiment setting, and present our main results on action recognition and 3D action localization, followed by detailed ablation studies to verify our model design. We further show how our model generalizes to a novel environment, and discuss our results.
Evaluation Protocol: For all experiments, we evaluate the performance of both action recognition and 3D action localization, following the protocols.
• 3D Action Localization.We consider 3D action localization as binary classification over the 3D grids. Therefore, we report the Precision, Recall, and F1 score on a downsampled 3D heatmap (4 in X, Y direction, and 2 in Z direction) as in .
|Method||Action Recognition||3D Action Localization|
|Mean Cls Acc||Top-1 Acc||Prec||Recall||F1|
5.1 Action Understanding in Seen Environments
Our method is the first to utilize the 3D environment information for egocentric action recognition and 3D localization. Previous works have considered various environment contexts for other tasks, including 3D object detection, affordance prediction etc. Therefore, we adapt previous proposed contextual cues into our proposed joint model and design the following strong baselines:
• I3D-Res50 refers to the backbone network from . We also use the network feature from I3D-Res50 for 3D action localization by adopting the KL loss.
• I3D+Obj uses object detection results from a pre-trained object detector  as contextual cues as in . This representation is essentially an object-centric feature that describes the attended environment (where the camera wearer is facing towards), therefore 3D action location can not used for selecting surrounding environment features.
• I3D+2DGround projects the object information from the 3D environment to 2D ground plane. A similar representation is also considered in . Note that the predicted 3D action location will also be projected to 2D ground plane to select local environment features.
• I3D+SemVoxel is inspired by , where we use the semantic probabilistic distribution of all the vertices within each voxel as a feature descriptor. Therefore, the resulting environment representation is a 4D tensor with dimension , where , , represent the spatial dimension, and denotes the number of object labels from the 3D environment mesh annotation introduced in Sec.4.
• I3D+Affordance follows  to use the afforded action distribution as feature descriptor for each voxel. The resulting representation is a 4D tensor with dimension , where denotes the number of action classes.The afforded action distribution is derived from the training set.
Results. Our results on the seen environment split is listed in Table 1. Our method outperforms I3D-Res50 baseline by a large margin ( on Mean Cls Acc/Top1 Acc) on action recognition. We attribute this significant performance gain to explicitly modeling the 3D environment context. As for 3D action localization, our method outperforms I3D-50 by – a relative improvement of 69%. Notably, predicting the 3D action location based on video sequence alone is erroneous. Our method, on the other hand, explicitly models the 3D environment factor and thus improves the performance of 3D action localization. In subsequent sections, we will show that the performance improvement does not simply come from additional input modalities of 3D environment, but attributes to a careful design of 3D representation and probabilistic joint modeling.
Comparison on environment representation. We now compare HVR with other forms of environment representation. As shown in Table 1, I3D+Obj has minor improvement on the over all performance, while I3D+2DGround, I3D+SemVoxel and I3D+Affordance can improve the performance of action recognition and 3D localization by a notable margin. Those results suggest that the environment context (even in 2D space) plays an important role in egocentric action understanding. More importantly, our method outperforms all previous methods by at least for action recognition and for 3D action localization. These results suggest that our proposed HVR is superior to a 2D ground plane representation, and demonstrates that using the semantic occupancy map as the environment descriptor can better facilitate egocentric understanding.
|Method||Action Recognition||3D Action Localization|
|Mean Cls Acc||Top-1 Acc||Prec||Recall||F1|
5.2 Ablation Studies
We now present detailed ablation studies of our method on seen split. To begin with, we analyze the role of semantic and geometric information in our hierarchical volumetric representation. We then present an experiment to verify whether fine-grained environment context is necessary for egocentric action understanding. Furthermore, we show the benefits of joint modeling of action and 3D action location.
Semantic Meaning and 3D Geometry. The semantic occupancy map carries both geometric and semantic information of the local environment. To show how each component contributes to the performance boost, we compare Ours with I3D+SemVoxel, where only semantic meaning is considered, in Table 2. Ours outperforms I3D+SemVoxel by a notable margin for action recognition and a large margin for 3D localization. These results suggest that semantic occupancy map is more expressive than only semantic information for action understanding, yet it has smaller impact on action recognition than 3D action localization.
Granularity of 3D Information. We further show what level of 3D environment granularity is needed for egocentric action understanding. By the definition of occupancy map, increasing the resolution of children voxels will approximate the actual 3D shape of the environment. Therefore, we report results of our method with different occupancy map resolution in Table 2. Not surprisingly, low occupancy map resolution lags behind Ours for action recognition by and 3D action localization by , which again shows the necessity of incorporating the 3D geometric cues. Another interesting observation is that higher resolution can slightly increase the 3D action localization accuracy by , yet decreases the performance on action recognition by . These results suggest that fine-grained 3D shape of the environment is not necessary for action recognition. In fact, higher resolution will dramatically increase the feature dimension of the environment representation, and thereby incurs more barriers to the network.
|Method||Action Recognition||3D Action Localization|
|Mean Cls Acc||Top-1 Acc||Prec||Recall||F1|
Joint Learning of Action Label and 3D Location
. We denote a baseline model that directly fuses global environment features, extracted by the same 3D convolutional network adopted in our method, with video features for action grounding as I3D+GlobalEnv. The results are presented in Table3. Interestingly, I3D+GlobalEnv decreases the performance of I3D-Res50 backbone network by for action recognition and has marginal improvement for 3D action localization (). We speculate that this is because only 3 types of environment representation available for training may lead to overfitting. In contrast, our method makes use of the learned 3D action location to select interesting environment features associated with the action. As the action location varies among different input videos, our method can utilize the 3D environment context without running into the pitfall of overfitting, and therefore outperforms I3D+GlobalEnv by for action recognition and for 3D action localization.
Probabilistic Modeling of 3D Action Location. As introduced in Sec.4, considerable uncertainty lies in the prior distribution of 3D action location, due to the challenging artifact of 2D to 3D camera registration. To verify that the probabilistic modeling can account for the uncertainty of 3D action location ground truth, we compare our method with a deterministic version of our model, denoted as DetEnv. DetEnv adopts the same inputs and network architecture as our method, except for the differentiable sampling with Gumbel-Softmax Trick. As shown in Table 3, Ours outperforms DetEnv by for action recognition and for 3D action localization. These results demonstrate the benefits of the stochastic units adopted in our method.
Remarks. To summarize, our key finding is that both 3D geometric and semantic contexts convey important information for action recognition and 3D localization. Another important take home is that egocentric understanding only requires a sparse encoding of geometric information. Moreover, without a careful model design, the 3D environment representation has minor improvement on (or even decreases) the overall performance as reported in Table 3.
5.3 Generalization to Novel Environment
|Method||Action Recognition||3D Action Localization|
|Mean Cls Acc||Top-1 Acc||Prec||Recall||F1|
We further present experiment results on the unseen environment split in Table 4. Ours outperforms I3D-Res50 and I3D+2DGround by a notable margin on both action recognition and 3D action localization. These results suggest that explicitly modeling the 3D environment context can improve the generalization ability to unseen environments with known 3D maps. However, the performance gap is smaller in comparison to the performance boost on seen environment split. We speculate that this is because we only have two different types of environments for training and therefore the risk of overfitting on unseen environment split is further exemplified.
We further visualize our results, and discuss the limitation of our method and potential future work.
Visualization of Action Location. We visualize our results on seen environment split. Specifically, we project the 3D saliency map of action location on the top-down view of the 3D environments. As shown in Fig. 3, our model can effectively localize the coarse action location and thereby select the region of interest from the global environment features for action recognition. By examining the failure cases, we found that the model may run into the failure modes when the video features are not discriminative enough (when the camera wearer is standing close to a white wall.)
Limitation and Future Work. One limitation of our method is the requirement of high-quality 3D environment reconstruction with object annotations. However, we conjecture that 3D object detection algorithms , semantic structure from motion  and 3D scene graphs  can be used to replace the human annotation, since our current volumetric representation only adopts a low resolution semantic occupancy map as environment descriptor. We plan to explore this direction as our future work.
Another limitation is the potential error in 2D to 3D camera registration, as discussed in Sec.4. Currently, only camera poses from key video frames can be robustly estimated. Our method thus does not model the location shift within the same action. We argue that camera registration can be drastically improved with the help of additional sensors (IMU or depth camera). Incorporating those sensors into egocentric capturing setting is an exciting future direction. In addition, our method did not consider the camera orientation. We will leave this for future efforts.
We introduced a novel deep model that makes use of egocentric videos and a 3D map to address the task of joint action recognition and 3D localization. The key innovation of our model is to characterize the 3D action location as a latent variable, which is used to select the surrounding local environment features for action recognition. Our key insight is that the 3D geometric and semantic context of the surrounding environment provides critical information that complements video features for action understanding. We believe our work provides a critical first step towards understanding actions in the context of a 3D environment, and points to exciting future directions connecting egocentric vision, action recognition, and 3D scene understanding.
-  (2012) Freak: fast retina keypoint. In CVPR, Cited by: §H.
-  (2019) 3D scene graph: a structure for unified semantics, 3d space, and camera. In ICCV, Cited by: §5.4.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §3.5.
-  (2017) Multi-view 3d object detection network for autonomous driving. In CVPR, Cited by: §2.
-  (2020) The epic-kitchens dataset: collection, challenges and baselines. IEEE Computer Architecture Letters (01), pp. 1–1. Cited by: §4, §4, §5.
-  (2018) Scaling egocentric vision: the epic-kitchens dataset. In ECCV, Cited by: §H.
-  (2012) Scene semantics from long-term observation of people. In ECCV, Cited by: §2.
Vote3deep: fast object detection in 3d point clouds using efficient convolutional neural networks. In ICRA, Cited by: §2.
-  (2018) Demo2vec: reasoning object affordances from online videos. In CVPR, Cited by: §2.
-  (2008) Multi-camera human activity monitoring. Journal of Intelligent and Robotic Systems 52 (1), pp. 5–43. Cited by: §1.
-  (2014) People watching: human actions as a cue for single view geometry. IJCV. Cited by: §2.
-  (2010) Building rome on a cloudless day. In ECCV, Cited by: §4.
-  (2019) What would you expect? anticipating egocentric actions with rolling-unrolling LSTMs and modality attention.. In ICCV, Cited by: §2, §5.1.
-  (2019) Splitnet: sim2sim and task2task transfer for embodied visual navigation. In ICCV, Cited by: §2.
-  (2011) What makes a chair a chair?. In CVPR, Cited by: §2.
-  (2020) Generative hybrid representations for activity forecasting with no-regret learning. In CVPR, Cited by: §2.
-  (2011) From 3d scene geometry to human workspace. In CVPR, Cited by: §1, §2.
-  (2019) Resolving 3D human pose ambiguities with 3D scene constraints. In ICCV, Cited by: §2, §2.
-  (2018) MapNet: an allocentric spatial memory for mapping environments. In CVPR, Cited by: §2.
-  (2018) Predicting gaze in egocentric video by learning task-dependent attention transition. In ECCV, Cited by: §2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §G.
-  (2017) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: §3.2.
-  (2013) Hallucinated humans as the hidden context for labeling 3d scenes. In CVPR, Cited by: §2.
-  (2012) Learning object arrangements in 3d scenes using human context. In ICML, Cited by: §2.
-  (2019) Epic-fusion: audio-visual temporal binding for egocentric action recognition. In ICCV, Cited by: §1.
-  (2019) Time-conditioned action anticipation in one shot. In CVPR, Cited by: §2.
-  (2013) Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32 (8), pp. 951–970. Cited by: §1.
-  (2014) Joint semantic segmentation and 3d reconstruction from monocular video. In ECCV, Cited by: §5.4.
-  (2020) Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In CVPR, Cited by: §2.
-  (2013) Learning to predict gaze in egocentric video. In ICCV, Cited by: §2.
-  (2018) In the eye of beholder: joint learning of gaze and actions in first person video. In ECCV, Cited by: §1, §2, §5.
-  (2021) In the eye of the beholder: gaze and actions in first person video. TPAMI. Cited by: §3.2, §4, §5, §H.
-  (2015) Delving into egocentric actions. In CVPR, Cited by: §1, §2.
-  (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. TPAMI. Cited by: §1.
-  (2020) Forecasting human object interaction: joint prediction of motor attention and actions in first person video. In ECCV, Cited by: §3.2.
-  (2020) 4D human body capture from egocentric video via 3d scene grounding. arXiv preprint arXiv:2011.13341. Cited by: §2.
-  (2016) Going deeper into first-person activity recognition. In CVPR, Cited by: §1, §2.
The concrete distribution: a continuous relaxation of discrete random variables. In ICLR, Cited by: §3.2.
-  (2017) Trespassing the boundaries: labeling temporal bounds for object interactions in egocentric video. In ICCV, Cited by: §1, §2.
-  (2019) Grounded human-object interaction hotspots from video. In ICCV, Cited by: §2.
-  (2020) EGO-topo: environment affordances from egocentric video. In CVPR, Cited by: §1, §1, §2, §5.1.
-  (2020) You2me: inferring body pose in egocentric video via first and second person interactions. In CVPR, Cited by: §2.
-  (2012) 3d social saliency from head-mounted cameras. NeurIPS. Cited by: §2.
-  (2012) Detecting activities of daily living in first-person camera views. In CVPR, Cited by: §1, §2.
-  (2014) Head motion signatures from egocentric videos. In ACCV, Cited by: §2.
-  (2014) Temporal segmentation of egocentric videos. In CVPR, Cited by: §1, §2.
Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §2.
-  (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §2.
-  (2017) First-person activity forecasting with online inverse reinforcement learning. In ICCV, Cited by: §2.
-  (2016) Learning action maps of large environments via first-person vision. In CVPR, Cited by: §1, §1, §2, §5.1.
-  (2012) Improving image-based localization by active correspondence search. In ECCV, Cited by: §1, §4.
-  (2014) SceneGrok: inferring action maps in 3d environments. TOG. Cited by: §1, §2.
-  (2016) Structure-from-motion revisited. In CVPR, Cited by: §4, §4.
-  (2013) Hand segmentation for gesture recognition in ego-vision. In Proceedings of the 3rd ACM international workshop on Interactive multimedia on mobile & portable devices, pp. 31–36. Cited by: §2.
-  (2018) Egocentric activity prediction via event modulated attention. In ECCV, Cited by: §2.
-  (2014) Sliding shapes for 3d object detection in depth images. In ECCV, Cited by: §2.
-  (2016) Deep sliding shapes for amodal 3d object detection in rgb-d images. In CVPR, Cited by: §2.
-  (2016) Egocentric future localization. In CVPR, Cited by: §2.
-  (2019) The Replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797. Cited by: §4, §H.
-  (2015) Voting for voting in online point cloud object detection.. In TSS, Cited by: §2.
Non-local neural networks. In CVPR, Cited by: §3.5, §5.1, §G.
-  (2019) Pointconv: deep convolutional networks on 3d point clouds. In CVPR, Cited by: §2.
-  (2020) PointPWC-net: cost volume on point clouds for (self-) supervised scene flow estimation. In ECCV, Cited by: §2.
-  (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §5.1.
-  (2018) Pixor: real-time 3d object detection from point clouds. In CVPR, Cited by: §2.
-  (2017) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In CVPR, Cited by: §2.
-  (2020) PLACE: proximity learning of articulation and contact in 3D environments. In 3DV, Cited by: §2, §2.
-  (2020) Generating 3d people in scenes without people. In CVPR, Cited by: §2, §2.
-  (2016) Cascaded interactional targeting network for egocentric video analysis. In CVPR, Cited by: §1, §2.
-  (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In CVPR, Cited by: §2, §5.1, §5.4.
G Implementation Details
Data Processing. We resize all video frames to the short edge size of 256. For the coarse 3D map, we adopt a resolution of for parent voxel, and for children voxels. For training, our model takes an input of 8 frames (temporal sampling rate of 8) with a resolution of . For inference, our model samples 30 clips from a video (3 along spatial dimension and 10 in time). Each clip has 8 frames with a resolution of . We average the scores of all sampled clips for video level prediction.
Training Details. Our model is trained using SGD with momentum 0.9 and batch size 64 on 4 GPUs. The initial learning rate is 0.0375 with cosine decay. We set weight decay to 1e-4 and enable batch norm . To avoid overfitting, we adopt several data augmentation techniques, including random flipping, rotation, cropping and color jittering.
H Dataset Details
This section introduces details of our new egocentric video dataset. We show additional rendered images of the 3D reconstructions, and present additional qualitative results of the key frame camera registration. Finally, we plot the distribution of activity categories in our dataset.
Camera Registration. We provide qualitative results of key frame egocentric camera registration in Fig. 4. Specifically, we use the estimated camera pose to render the egocentric view of the 3D environment. When the camera registration fails (insufficient inliers were matched by RANSAC), we assign a dummy result at the world origin, which will result in a completely wrong rendered view of the 3D environment as in the last row of Fig. 4. We also visualize the matched feature points 
to help readers better interpret the RANSAC based matching method for camera registration introduced in Sec. 4 of the main paper. Notably, registering the egocentric camera into a 3D environment based on only images remains a major challenge. The motion blur, foreground occlusion, change of illumination, and featureless surfaces in indoor capture are the main failure causing factors of the 2D-to-3D registration. To mitigate these failure cases, we only consider camera relocalization using key frames to approximate the ground truth of 3D action locations, and do not model how location might shift over time within the same action. Note that, if the entire action clips does not have one robust key frame registration result, we adopt uniform distribution for the 3D action location priorduring training.
3D Environment Reconstruction. Our dataset captures the 3D reconstructions of three different living rooms using SOTA dense Reconstruction system . We provide more rendered images of the 3D reconstructions in Fig. 5.
Activity Distribution. We present the distribution of egocentric activities in our dataset in Fig. 6. Similar to [6, 32], our dataset has a “long tailed” distribution that characterizes naturalistic human behavior. For example, the action activity category of “Pick up Book from Floor” happens 1400 times, while the action of “Put Painting” on the tail occurs only 9 times. Mean Class Accuracy provides a metric that is not biased towards frequently occurred categories, and thus is better suited than Top-1 accuracy for activity recognition on our dataset. We highlight that our model outperforms I3D baseline by on Mean Class Accuracy.
I Analysis of Table 1 in Main Paper
We provide additional analysis of our method. We specifically compare the activity classes, where our model significantly outperforms I3D and I3D+2DGround in Fig. 7. Interestingly, our model is better at classes where the video features might not be discriminative enough for recognition, “Pick up Book from Floor” vs. “Put Book on Shelf”, or “Pick up Poster” vs. “Stamp Poster”. Moreover, the contextual features from 2D ground plane provide limited information for understanding those egocentric activities. We conjecture that our method makes use of environment features surrounding the predicted 3D action location to complement video features for activity recognition.
J Additional Qualitative Results
Finally, we provide additional qualitative results. As shown in Fig. 8, we present predicted 3D action location and action labels. The figure follows the same format as Fig. 3 in the main paper. Those results suggest that our model can effectively localize the action location and thereby more accurately predict the action labels.
Another interesting observation is that the model may output a “diffused” heatmap, when the foreground active objects take up the majority of the video frames (right column of Fig. 8). This is because the model receives uniform prior as supervisory signals when the camera registration fails for an action clip. In these cases, our model opts for predicting a diffused heat map of action location to prevent itself from missing important environment features. In doing so, our model might still be able to successfully predict the action labels, despite the failure of camera registration.
|ID||Branch||Type||Kernel Size THW,(C)||Stride THW||Output Size THWC||Comments (Loss)|
|1||Backbone Input Size: 8x224x224x3||Conv3D||5x7x7,64||1x2x2||8x112x112x64|
|11||EnvNet Input Size 8x28x28x64||
|4x14x14x1||Sampling 3D Action Location|
|17||Weighted Avg Pooling||4x14x14x1024||
|22||Recognition Network||Weighted Avg Pooling||4x7x7||4x7x7||1x1x1x1024||
|24||Softmax||1x1x1xN||Cross Entropy Loss|
Network architecture of our two-pathway network. We omit the residual connection in backbone ResNet-50 for simplification.