Egocentric Activity Recognition and Localization on a 3D Map

by   Miao Liu, et al.

Given a video captured from a first person perspective and recorded in a familiar environment, can we recognize what the person is doing and identify where the action occurs in the 3D space? We address this challenging problem of jointly recognizing and localizing actions of a mobile user on a known 3D map from egocentric videos. To this end, we propose a novel deep probabilistic model. Our model takes the inputs of a Hierarchical Volumetric Representation (HVR) of the environment and an egocentric video, infers the 3D action location as a latent variable, and recognizes the action based on the video and contextual cues surrounding its potential locations. To evaluate our model, we conduct extensive experiments on a newly collected egocentric video dataset, in which both human naturalistic actions and photo-realistic 3D environment reconstructions are captured. Our method demonstrates strong results on both action recognition and 3D action localization across seen and unseen environments. We believe our work points to an exciting research direction in the intersection of egocentric vision, and 3D scene understanding.


page 1

page 4

page 8

page 13

page 14

page 15


In the Eye of the Beholder: Gaze and Actions in First Person Video

We address the task of jointly determining what a person is doing and wh...

Weakly-Supervised Multi-Person Action Recognition in 360^∘ Videos

The recent development of commodity 360^∘ cameras have enabled a single ...

EGO-TOPO: Environment Affordances from Egocentric Video

First-person video naturally brings the use of a physical environment to...

Trajectory Aligned Features For First Person Action Recognition

Egocentric videos are characterised by their ability to have the first p...

DAP3D-Net: Where, What and How Actions Occur in Videos?

Action parsing in videos with complex scenes is an interesting but chall...

Sequential Person Recognition in Photo Albums with a Recurrent Network

Recognizing the identities of people in everyday photos is still a very ...

Dynamic Probabilistic Network Based Human Action Recognition

This paper examines use of dynamic probabilistic networks (DPN) for huma...

1 Introduction

Daily life activities take place in a 3D space, and humans effortlessly leverage spatial and semantic cues, such as the 3D surface shape and semantics of familiar objects and environments, in performing daily tasks. For example, suppose the 3D environment rendered in a top-down view in Fig. 1 (a) was from your apartment, where you have a rough idea about the 3D layout of the furniture and appliances. Based on such knowledge and even without a single image, you can easily imagine where your routine actions might happen in the 3D space, drawing a picture while sitting on the sofa. Can we design vision models to exploit the prior knowledge of a known 3D environment for recognizing and localizing our actions and activities?

A significant amount of prior work in computer vision has addressed the interaction between human activities and 3D environment using fixed cameras 

[10, 27, 34, 52, 17]. Recently, first person vision has emerged to offer a new perspective for activity understanding. Egocentric video captures a mobile user’s actions in the context of the environment, as the camera wearer explores the space. Yet, most existing works on egocentric activity recognition did not address the scene context [25, 69, 37, 33, 44, 39, 46, 31]. More recently, contextual cues such as a 2D ground plane [50] or a topological map [41] have been considered for understanding environmental functionalities, such as the common locations at which activities occur, from egocentric videos.

In contrast to these prior efforts, this paper introduces a new research topic on the joint recognition and 3D localization of egocentric activities using the prior context provided by an annotated 3D environment map (see Fig. 1 (b)). We use a Hierarchical Volumetric Representation (HVR) to describe the semantic and geometric information of the 3D environment map (see Fig. 2 (a) and Sec 3.1 for explanation). To illustrate our approach, consider the task of determining, based on an egocentric video, whether a person is placing a book on a shelf or picking up a book from the floor. In both cases, the subject is holding the book and moving their body, and differentiating the activities based on video alone could be challenging. However, the ability to localize the head-worn camera in the 3D scene relative to the 3D voxels for the bookshelf and floor is a potentially powerful cue for disambiguation. Picking up the book requires the subject’s head to approach the horizontal surface of the floor, while placing the book requires the upright head to approach the vertical bookcase. Methods that use ground plane representations of the environment [50] or merely detect the presence of objects as context [41] may lack the specificity provided by 3D proximity. The ability to localize the camera wearer in the 3D scene and leverage spatial and semantic context provides a means to infer cues which otherwise would not be visible within the egocentric video.

Three major challenges exist for recognizing and localizing actions in 3D using egocentric videos. First, standard architectures for activity recognition from egocentric video are not designed to incorporate a 3D volumetric environment representation, requiring the design of a novel recognition architecture. Second, accurate 3D locations of activities

are difficult to obtain. Even with a photo-realistic 3D environment reconstruction (as in Fig. 1(a)) and high resolution videos, we observed that state-of-the-art camera registration methods [51]

sometimes provide erroneous estimates of camera pose and thus action locations, due to rapid camera motion and varying lighting conditions. Thus, our solution must address potential errors in action locations during training and testing.

Third, no existing egocentric activity datasets contain both 3D environment models and corresponding videos of egocentric activities.

To address the first challenge, we present a novel deep model that takes the inputs of egocentric videos and our Hierarchical Volumetric Representation (HVR) of the 3D environment, and outputs the 3D action location, as well as the action categories. Our model consists of two branches. The environment branch makes use of a 3D convolutional network to extract environmental features from HVR. Similarly, the video branch uses a 3D convolutional network to extract visual features from the input egocentric video. The environmental and visual features are further combined to estimate the 3D action location, supervised by the results of camera registration. The second challenge of noisy localizations is addressed by using stochastic units to account for potential error. The predicted 3D action location, in the form of a probabilistic distribution, is then interpreted as a 3D attention map to select local environmental features relevant to the current video. Finally, these features are further fused with video features to recognize the actions.

In the experiment section, we show that our model can not only recognize and localize actions in a known 3D environment, but can also generalize to unseen environments not present in the training set yet with known 3D maps and object labels. These results are evaluated on a newly-collected egocentric video dataset with photo-realistic 3D environment reconstructions and 3D object annotations, recorded at multiple naturalistic household environments. We demonstrate strong results on action recognition and 3D action localization for both seen and unseen environments. When evaluated on seen environments, our model outperforms a strong baseline of 2D video-based action recognition methods by 4.2% in mean class accuracy, and beats baselines on 3D action localization by 9.3% in F1 scores. We believe our method provides a solid step forward to understand actions and activities in the context of their 3D environments.

2 Related Work

Our work is related to several ongoing topics in computer vision. We first discuss the most relevant works on egocentric vision. We then review several previous efforts on human-scene interaction and 3D scene representation.

Egocentric Vision. There is a rich set of literature aiming at understanding human activity from an egocentric perspective. Prior works have made great progress in recognizing and anticipating egocentric actions based on 2D videos [13, 55, 26, 69, 37, 33, 44, 39, 46], and predicting gaze and locomotion [31, 30, 20, 58, 42, 66, 43, 45, 54]. Far fewer works have considered environmental factors and spatial grounding of egocentric activity. Guan et al. [16] and Rhinehart et al. [49]

jointly considered trajectory forecasting and egocentric activity anticipation with online inverse reinforcement learning. The most relevant works to ours are recent efforts on learning affordances for egocentric action understanding 

[41, 50]. Nagarajan et al. [41] introduced a topological map environment representation for long-term activity forecasting and affordance prediction. Rhinehart et al. [50] considered a novel problem of learning “Action Maps” from egocentric videos. In contrast to the prior use of topological maps [41] and 2D ground planes [50], our focus is on exploiting the geometric and semantic information in the HVR map to address our novel task of joint egocentric action recognition and 3D localization.

Human-Scene Interaction

. Human-scene constraints have been proven to be effective in reasoning about human body pose 

[68, 18, 67]. The most relevant prior works focus on understanding environment affordance. Grabner et al. [15] predict object functionality by hallucinating an actor interacting with the scene. A similar idea was also explored in [23, 24]. Savva et al. [52] predicted action heat maps that highlight the likelihood of an action in the scene by partitioning scanned 3D scenes into disjoint sets of segments and learning a segment dictionary. Gupta et al. [17] presented a human-centric scene representation for predicting the afforded human body poses. Delaitre et al. [7, 11] introduced a statistical descriptor of person-object interactions for object recognition and human body pose prediction. Fang et al. [9] proposed to learn object affordances from demonstrative videos. Nagarajan et al. [40] proposed to use backward attention to approximate the interaction hotspots of future action. Those previous efforts were limited to the analysis of environment functionality [23, 24, 7, 11], constrained human action and body pose [52], or hand-object interaction on 2D image plane [9, 40]. In contrast, we are the first to utilize the rich geometric and semantic information of the 3D environment for naturalistic human activity recognition and 3D localization.

3D Scene Representation. Many recent works explored various forms of 3D representations for 3D vision tasks, including 3D object detection [65, 70, 4] and embodied visual navigation [14, 19, 29]. Deep models have been developed for point clouds [47, 63, 62, 48] with great success in object recognition, semantic segmentation, and sceneflow estimation. However, using point clouds to describe a large-scale 3D scene will result in high computational and memory cost [70]. To address this challenge, many approaches used rasterized point clouds in a 3D voxel grid, with each voxel represented by either handcrafted features [60, 8, 56, 57] or learning-based features [70]. Signed-Distanced value and Chamfer Distance between 3D scene and 3D human body have also been used to enforce more plausible human-scene contact [68, 18, 67, 36]. Building on this prior work, we utilize a 3D Hierarchical Volumetric Representation (HVR) that encodes the local geometric and semantic context of a 3D environment for egocentric action understanding.

3 Method

We denote a trimmed input egocentric video as with frames indexed by time . In addition, we assume a global 3D environment prior , associated each with input video, is available at both training and inference time. is environment specific, the 3D map of an apartment. Our goal is to jointly predict the action category of and the action location on the 3D map. is parameterized as a 3D saliency map, where the value of represents the likelihood of action clip happening in spatial location .111For tractability, we associate the entire activity with a specific 3D location and do not model the change in location over the course of an activity. This is a valid assumption for the activities we address, such as sitting down, playing keyboards, etc. The action location thereby defines a proper probabilistic distribution in 3D space.

In this section, we first introduce our proposed joint model of action recognition and 3D localization, leveraging a 3D representation of the environment. We then describe key components of our model, training and inference schema, as well as our network architecture.

Figure 2: (a) Hierarchical Volumetric Representation (HVR). We rasterize the semantic 3D environment mesh into two levels of 3D voxels. Each parent voxel corresponds to a possible action location, while the children voxels compose a semantic occupancy map that describes their parent voxel. (b) Overview of our model. Our model takes video clips and the associated 3D environment representation as inputs. We adopt an I3D backbone network to extract video features and a 3D convolutional network to extract environment features. We then make use of stochastic units to generate sampled action location for selecting local 3D environment features for action recognition. Note that represents weighted average pooling, while denotes concatenation along channel dimension.

3.1 Joint Modeling with a 3D Environment Prior

3D Environment Representation. We seek to design a representation that not only encodes the 3D geometric and semantic information of the 3D environment, but is also effective for 3D action localization and recognition.

To this end, we introduce a Hierarchical Volumetric Representation (HVR) of the 3D environment. We provide an illustration of our method in Fig. 2(a). We assume the 3D environment reconstruction with object labels is given in advance as a 3D mesh (see Sec.4 for details). We first divide the 3D mesh into parent voxels, that define all possible action locations. We then divide each parent voxel into multiple voxels at a fixed resolution and further assign an object label to each child voxel based on the object annotation. Specifically, the object label of each child voxel is determined by the majority vote of the vertices that lie inside that child voxel. Note that we only consider static objects and treat empty space as a specially-designated “object” category. Therefore, the child voxels compose a semantic occupancy map that encodes both the 3D geometry and semantic meaning of the parent voxel.

We further vectorize the semantic occupancy map and use the resulting vector as a feature descriptor of the parent voxel. The 3D environment representation

can then be represented as a 4D tensor, with dimension

. Note that higher resolution can better approximate the 3D shape of the environment. Our proposed representation is thus a flexible environment representation that jointly considers the 3D action location candidates, geometric and semantic information of the 3D environment.

Joint Learning of Action Category and Action Location. We present an overview of our model in Fig. 2(b). Specifically, we adopt a two-pathway network architecture. The video pathway extracts video features with an I3D backbone network , while the environment pathway extracts the global 3D environment features with a 3D convolutional network . Visual and environmental features are jointly considered for predicting the 3D action location . We then adopt stochastic units to generate sampled action location for selecting the local environment features relevant to the actions. Local environment features and video features are further fused with the video features for action recognition.

Our key idea is thus to make use of the 3D environment representation for jointly learning the action label and 3D action location of video clip . We consider the action location as a probabilistic variable, and model the action label given input video and environment representation

using a latent variable model. Therefore, the conditional probability

is given by:


Notably, our proposed joint model has two key components. First, models the 3D action location from video input and the 3D environment representation . Second, utilizes to select a region of interest (ROI) from the environment representation , and combines selected environment features with the video features from for action classification. During training, our model receives the ground truth 3D action location and action label as supervisory signals. At inference time, our model jointly predicts both the 3D action location and action label . We now provide additional technical details in modeling and .

3.2 3D Action Localization

We first introduce our 3D action localization module, defined by the conditional probability . Given the video pathway features and the environment pathway features , we learn a mapping function to predict location , which is defined on a 3D grid of candidate action locations.222The 3D grid is defined globally over the 3D environment scan. The mapping function is composed of 3D convolution operations with parameters and a softmax function. Thus, is given by:


where denotes concatenation along the channel dimension. Therefore, the resulting action location is a proper probabilistic distribution normalized in 3D space, with reflecting the possibility of video clip happening in the spatial location of the 3D environment.

In practice, we don’t have access to the precise ground truth 3D action location and must rely on camera registration results as a proxy. Using a categorical distribution for thus models the ambiguity of 2D to 3D registration. We follow [32, 35] to adopt stochastic units in our model. Specifically, we use the Gumbel-Softmax and reparameterization trick [22, 38] for differentiable sampling:


where is a Gumbel Distribution for sampling from a discrete distribution. This Gumbel-Softmax trick produces a “soft” sample that allows the gradients propagation to video pathway network and environment pathway network . is the temperature parameter that controls the shape of the soft sample distribution. We set for our model.

3.3 Action Recognition with Environment Prior

Our model further models with a mapping function that jointly considers action location , video input and 3D environment representation for action recognition. Formally, the conditional probability can be modeled as:


where denotes concatenation along channel dimension, and denotes the element-wise multiplication. Specifically, our method uses the sampled action location for selectively aggregating environment features and combines the aggregated environment features with video features for action recognition. denotes the average pooling operation that maps 3D feature to 2D feature, and

denotes the parameters of the linear classifier that maps feature vector to prediction logits.

3.4 Training and Inference

We now present our training and inference schema. At training time, we assume a prior distribution of action location is given as a supervisory signal. is obtained by registering the egocentric camera into the 3D environment (see more details in Sec.4). Note that we factorize

as latent variables, and based on the Evidence Lower Bound (ELBO), the resulting deep latent variable model has the following loss function:


where the first term is the cross entropy loss for action classification and the second term is the KL-Divergence that matches the predicted 3D action location distribution to the prior distribution . During training, a single 3D action location sample for each input within the mini-batch will be drawn.

Theoretically, our model should sample from the same input multiple times and take average of the prediction at inference time. To avoid such dense sampling, we choose to directly plug in the deterministic action location in Eq. 4. Note that the recognition function is composed of a linear mapping function and a softmax function, and therefore is convex. By Jensen’s Inequality, we have


That being said, provides an empirical lower bound of , and therefore provides a valid approximation of dense sampling.

3.5 Network Architecture

For the video pathway, we adopt the I3D-Res50 network [3, 61]

pre-trained on Kinetics as the backbone. For the environment pathway, we make use of a lightweight network (denoted as EnvNet), which has four 3D convolutional operations. The video features from the 3rd convolutional block of I3D-Res50 and the environment features after the 2nd 3D convolutional operation in EnvNet are concatenated for 3D action location prediction. We then use 3D max pooling operations to match the size of action location map to the size of the feature map of the 4th convolution of EnvNet for the weighted pooling in Eq.


4 Dataset and Annotation

Dataset. We utilize a newly-developed egocentric dataset, where the camera wearers are asked to conduct daily activities in living rooms that have photo-realistic 3D reconstruction. Note that existing egocentric video datasets (EGTEA [32], and EPIC-Kitchens [5] etc.) did not explicitly capture the 3D environment, and the Structure-from-Motion [53] fails substantially on those datasets due to the drastic head motion. This is the first activity recognition dataset to include both egocentric videos and high-quality 3D environment reconstructions.

The dataset contains 60 hours of video from 105 different video sequences captured by Vuzix Blade Smart Glasses with a resolution of 1920×1080 at 24Hz. It captures 34 different indoor activities from 3 real-world living rooms, resulting in action clips. Similar to [5], we consider both seen and unseen environment splits. In the seen environment split, each environment is seen in both training and testing sets ( instances for training, and instances for testing). In the unseen split, all sequences from the same environment are either in training or testing ( instances for training, and instances for testing).

3D Environment Reconstruction & Object Annotations. We use the state-of-the-art dense reconstruction system [59] to obtain the photo-realistic 3D reconstruction of the environment. Specifically, the environment is scanned with a customized capture rig, which contains a high-resolution RGB camera, an IR projector and an IR sensor, two fisheye monochrome cameras, IMU, and AprilTag. To reconstruct the 3D model, the camera trajectory is first tracked using the monochrome images and IMU signals. Then the dense depth measurements are fused into a volumetric Truncated Signed Distance Fields (TSDF) representation, which is extracted into meshes by the Marching Cubes algorithm, followed by texturing with the HDR RGB images. Once the 3D model is reconstructed, we further annotate the mesh by painting an object instance label over the mesh polygons. See [59] for a more detailed description. In this work, 35 object categories plus a background class label are used in annotation. Note that the object annotations can be automated with SOTA 3D object detection algorithms.

Prior Distribution of 3D Action Location. To obtain the ground truth of the activity location for each trimmed activity video clip, we first register the egocentric camera in the 3D environment using a RANSAC based feature matching method. Specifically, we first build a base map from the monochrome camera streams for 3D environment reconstruction using Structure from Motion [12, 53]. The pre-built base map is a dense point cloud associated with 3D feature points. We then estimate the camera pose of the video frame using active search [51]. Note that registering the 2D egocentric video frames in a 3D environment is fundamentally challenging, due to the drastic head rotation, featureless surfaces, and changing illumination. Therefore, we only consider the key frame camera registration, where enough inliers were matched with RANSAC. As introduced in Sec.3, the action location is defined as a probabilistic distribution in 3D space. Thus, we map the key frame camera location into the index of the 3D action location tensor, with its value representing the likelihood of the given action happening in the corresponding parent voxel. To account for the uncertainty of 2D to 3D camera registration, we further enforce a Guassian distribution to generate the final 3D action location ground truth.

5 Experiments and Results

In this section, we first describe the experiment setting, and present our main results on action recognition and 3D action localization, followed by detailed ablation studies to verify our model design. We further show how our model generalizes to a novel environment, and discuss our results.

Evaluation Protocol: For all experiments, we evaluate the performance of both action recognition and 3D action localization, following the protocols.

• Action Recognition. We follow [5, 32] to report both Mean Class Accuracy and Top-1 Accuracy.

• 3D Action Localization.We consider 3D action localization as binary classification over the 3D grids. Therefore, we report the Precision, Recall, and F1 score on a downsampled 3D heatmap (4 in X, Y direction, and 2 in Z direction) as in [31].

Method Action Recognition 3D Action Localization
Mean Cls Acc Top-1 Acc Prec Recall F1
I3D-Res50 37.48 55.15 8.14 38.73 13.45
I3D+Obj 37.66 55.11 10.04 35.08 15.61
I3D+2DGround 38.69 55.37 10.88 36.19 16.73
I3D+SemVoxel 39.23 56.07 11.26 38.77 17.45
I3D+Affordance 39.95 55.82 11.55 35.35 17.41
Ours(HVR) 41.64 56.94 16.71 35.55 22.73
Table 1: Comparison with other forms of environment context. Our Hierarchical Volumetric Representation (HVR) outperforms other methods by a significant margin on both action recognition and 3D action localization.

5.1 Action Understanding in Seen Environments

Our method is the first to utilize the 3D environment information for egocentric action recognition and 3D localization. Previous works have considered various environment contexts for other tasks, including 3D object detection, affordance prediction etc. Therefore, we adapt previous proposed contextual cues into our proposed joint model and design the following strong baselines:

• I3D-Res50 refers to the backbone network from [61]. We also use the network feature from I3D-Res50 for 3D action localization by adopting the KL loss.

• I3D+Obj uses object detection results from a pre-trained object detector [64] as contextual cues as in [13]. This representation is essentially an object-centric feature that describes the attended environment (where the camera wearer is facing towards), therefore 3D action location can not used for selecting surrounding environment features.

• I3D+2DGround projects the object information from the 3D environment to 2D ground plane. A similar representation is also considered in [50]. Note that the predicted 3D action location will also be projected to 2D ground plane to select local environment features.

• I3D+SemVoxel is inspired by [70], where we use the semantic probabilistic distribution of all the vertices within each voxel as a feature descriptor. Therefore, the resulting environment representation is a 4D tensor with dimension , where , , represent the spatial dimension, and denotes the number of object labels from the 3D environment mesh annotation introduced in Sec.4.

• I3D+Affordance follows [41] to use the afforded action distribution as feature descriptor for each voxel. The resulting representation is a 4D tensor with dimension , where denotes the number of action classes.The afforded action distribution is derived from the training set.

Results. Our results on the seen environment split is listed in Table 1. Our method outperforms I3D-Res50 baseline by a large margin ( on Mean Cls Acc/Top1 Acc) on action recognition. We attribute this significant performance gain to explicitly modeling the 3D environment context. As for 3D action localization, our method outperforms I3D-50 by – a relative improvement of 69%. Notably, predicting the 3D action location based on video sequence alone is erroneous. Our method, on the other hand, explicitly models the 3D environment factor and thus improves the performance of 3D action localization. In subsequent sections, we will show that the performance improvement does not simply come from additional input modalities of 3D environment, but attributes to a careful design of 3D representation and probabilistic joint modeling.

Comparison on environment representation. We now compare HVR with other forms of environment representation. As shown in Table 1, I3D+Obj has minor improvement on the over all performance, while I3D+2DGround, I3D+SemVoxel and I3D+Affordance can improve the performance of action recognition and 3D localization by a notable margin. Those results suggest that the environment context (even in 2D space) plays an important role in egocentric action understanding. More importantly, our method outperforms all previous methods by at least for action recognition and for 3D action localization. These results suggest that our proposed HVR is superior to a 2D ground plane representation, and demonstrates that using the semantic occupancy map as the environment descriptor can better facilitate egocentric understanding.

Method Action Recognition 3D Action Localization
Mean Cls Acc Top-1 Acc Prec Recall F1
I3D-Res50 37.48 55.15 8.14 38.73 13.45
I3D+SemVoxel 39.23 56.07 11.26 38.77 17.45
Ours () 39.04 56.26 12.19 36.82 18.32
Ours () 41.64 56.94 16.71 35.55 22.73
Ours () 40.06 56.04 16.13 39.84 22.96
Table 2: Ablation study for the 3D representation. We present the results of our method that adopts different semantic occupancy map resolution .

5.2 Ablation Studies

We now present detailed ablation studies of our method on seen split. To begin with, we analyze the role of semantic and geometric information in our hierarchical volumetric representation. We then present an experiment to verify whether fine-grained environment context is necessary for egocentric action understanding. Furthermore, we show the benefits of joint modeling of action and 3D action location.

Semantic Meaning and 3D Geometry. The semantic occupancy map carries both geometric and semantic information of the local environment. To show how each component contributes to the performance boost, we compare Ours with I3D+SemVoxel, where only semantic meaning is considered, in Table 2. Ours outperforms I3D+SemVoxel by a notable margin for action recognition and a large margin for 3D localization. These results suggest that semantic occupancy map is more expressive than only semantic information for action understanding, yet it has smaller impact on action recognition than 3D action localization.

Granularity of 3D Information. We further show what level of 3D environment granularity is needed for egocentric action understanding. By the definition of occupancy map, increasing the resolution of children voxels will approximate the actual 3D shape of the environment. Therefore, we report results of our method with different occupancy map resolution in Table 2. Not surprisingly, low occupancy map resolution lags behind Ours for action recognition by and 3D action localization by , which again shows the necessity of incorporating the 3D geometric cues. Another interesting observation is that higher resolution can slightly increase the 3D action localization accuracy by , yet decreases the performance on action recognition by . These results suggest that fine-grained 3D shape of the environment is not necessary for action recognition. In fact, higher resolution will dramatically increase the feature dimension of the environment representation, and thereby incurs more barriers to the network.

Method Action Recognition 3D Action Localization
Mean Cls Acc Top-1 Acc Prec Recall F1
I3D-Res50 37.48 55.15 8.14 38.73 13.45
I3D+GlobalEnv 35.99 54.93 8.82 36.40 14.20
I3D+DetEnv 39.37 55.88 14.11 32.66 19.71
Ours 41.64 56.94 16.71 35.55 22.73
Table 3: Ablation study for joint modeling of action category and 3D action location. Our proposed probabilistic joint modeling can consistently benefit the performance on action recognition and 3D action localization
Figure 3: Visualization of predicted 3D action location (projected on top-down view of the reconstructed 3D environment) and action labels (captions above the video frames). We present both green successful and red failure examples. We also show the “zoom-in” spatial region of the action location to help readers to better interpret our action localization results.

Joint Learning of Action Label and 3D Location

. We denote a baseline model that directly fuses global environment features, extracted by the same 3D convolutional network adopted in our method, with video features for action grounding as I3D+GlobalEnv. The results are presented in Table 

3. Interestingly, I3D+GlobalEnv decreases the performance of I3D-Res50 backbone network by for action recognition and has marginal improvement for 3D action localization (). We speculate that this is because only 3 types of environment representation available for training may lead to overfitting. In contrast, our method makes use of the learned 3D action location to select interesting environment features associated with the action. As the action location varies among different input videos, our method can utilize the 3D environment context without running into the pitfall of overfitting, and therefore outperforms I3D+GlobalEnv by for action recognition and for 3D action localization.

Probabilistic Modeling of 3D Action Location. As introduced in Sec.4, considerable uncertainty lies in the prior distribution of 3D action location, due to the challenging artifact of 2D to 3D camera registration. To verify that the probabilistic modeling can account for the uncertainty of 3D action location ground truth, we compare our method with a deterministic version of our model, denoted as DetEnv. DetEnv adopts the same inputs and network architecture as our method, except for the differentiable sampling with Gumbel-Softmax Trick. As shown in Table 3, Ours outperforms DetEnv by for action recognition and for 3D action localization. These results demonstrate the benefits of the stochastic units adopted in our method.

Remarks. To summarize, our key finding is that both 3D geometric and semantic contexts convey important information for action recognition and 3D localization. Another important take home is that egocentric understanding only requires a sparse encoding of geometric information. Moreover, without a careful model design, the 3D environment representation has minor improvement on (or even decreases) the overall performance as reported in Table 3.

5.3 Generalization to Novel Environment

Method Action Recognition 3D Action Localization
Mean Cls Acc Top-1 Acc Prec Recall F1
I3D-Res50 29.24 52.22 6.20 45.14 10.90
I3D+2DGround 30.06 53.87 6.95 41.27 11.90
Ours 30.89 54.93 7.26 45.83 12.54
Table 4: Experimental results on unseen environment split. Our model show the capacity of better generalizing to an unseen environment with known 3D map.

We further present experiment results on the unseen environment split in Table 4. Ours outperforms I3D-Res50 and I3D+2DGround by a notable margin on both action recognition and 3D action localization. These results suggest that explicitly modeling the 3D environment context can improve the generalization ability to unseen environments with known 3D maps. However, the performance gap is smaller in comparison to the performance boost on seen environment split. We speculate that this is because we only have two different types of environments for training and therefore the risk of overfitting on unseen environment split is further exemplified.

5.4 Discussion

We further visualize our results, and discuss the limitation of our method and potential future work.

Visualization of Action Location. We visualize our results on seen environment split. Specifically, we project the 3D saliency map of action location on the top-down view of the 3D environments. As shown in Fig. 3, our model can effectively localize the coarse action location and thereby select the region of interest from the global environment features for action recognition. By examining the failure cases, we found that the model may run into the failure modes when the video features are not discriminative enough (when the camera wearer is standing close to a white wall.)

Limitation and Future Work. One limitation of our method is the requirement of high-quality 3D environment reconstruction with object annotations. However, we conjecture that 3D object detection algorithms [70], semantic structure from motion [28] and 3D scene graphs [2] can be used to replace the human annotation, since our current volumetric representation only adopts a low resolution semantic occupancy map as environment descriptor. We plan to explore this direction as our future work.

Another limitation is the potential error in 2D to 3D camera registration, as discussed in Sec.4. Currently, only camera poses from key video frames can be robustly estimated. Our method thus does not model the location shift within the same action. We argue that camera registration can be drastically improved with the help of additional sensors (IMU or depth camera). Incorporating those sensors into egocentric capturing setting is an exciting future direction. In addition, our method did not consider the camera orientation. We will leave this for future efforts.

6 Conclusion

We introduced a novel deep model that makes use of egocentric videos and a 3D map to address the task of joint action recognition and 3D localization. The key innovation of our model is to characterize the 3D action location as a latent variable, which is used to select the surrounding local environment features for action recognition. Our key insight is that the 3D geometric and semantic context of the surrounding environment provides critical information that complements video features for action understanding. We believe our work provides a critical first step towards understanding actions in the context of a 3D environment, and points to exciting future directions connecting egocentric vision, action recognition, and 3D scene understanding.


  • [1] A. Alahi, R. Ortiz, and P. Vandergheynst (2012) Freak: fast retina keypoint. In CVPR, Cited by: §H.
  • [2] I. Armeni, Z. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese (2019) 3D scene graph: a structure for unified semantics, 3d space, and camera. In ICCV, Cited by: §5.4.
  • [3] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §3.5.
  • [4] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In CVPR, Cited by: §2.
  • [5] D. Damen, H. Doughty, G. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. (2020) The epic-kitchens dataset: collection, challenges and baselines. IEEE Computer Architecture Letters (01), pp. 1–1. Cited by: §4, §4, §5.
  • [6] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray (2018) Scaling egocentric vision: the epic-kitchens dataset. In ECCV, Cited by: §H.
  • [7] V. Delaitre, D. F. Fouhey, I. Laptev, J. Sivic, A. Gupta, and A. A. Efros (2012) Scene semantics from long-term observation of people. In ECCV, Cited by: §2.
  • [8] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner (2017)

    Vote3deep: fast object detection in 3d point clouds using efficient convolutional neural networks

    In ICRA, Cited by: §2.
  • [9] K. Fang, T. Wu, D. Yang, S. Savarese, and J. J. Lim (2018) Demo2vec: reasoning object affordances from online videos. In CVPR, Cited by: §2.
  • [10] L. Fiore, D. Fehr, R. Bodor, A. Drenner, G. Somasundaram, and N. Papanikolopoulos (2008) Multi-camera human activity monitoring. Journal of Intelligent and Robotic Systems 52 (1), pp. 5–43. Cited by: §1.
  • [11] D. F. Fouhey, V. Delaitre, A. Gupta, A. A. Efros, I. Laptev, and J. Sivic (2014) People watching: human actions as a cue for single view geometry. IJCV. Cited by: §2.
  • [12] J. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, R. Raguram, C. Wu, Y. Jen, E. Dunn, B. Clipp, S. Lazebnik, et al. (2010) Building rome on a cloudless day. In ECCV, Cited by: §4.
  • [13] A. Furnari and G. M. Farinella (2019) What would you expect? anticipating egocentric actions with rolling-unrolling LSTMs and modality attention.. In ICCV, Cited by: §2, §5.1.
  • [14] D. Gordon, A. Kadian, D. Parikh, J. Hoffman, and D. Batra (2019) Splitnet: sim2sim and task2task transfer for embodied visual navigation. In ICCV, Cited by: §2.
  • [15] H. Grabner, J. Gall, and L. Van Gool (2011) What makes a chair a chair?. In CVPR, Cited by: §2.
  • [16] J. Guan, Y. Yuan, K. M. Kitani, and N. Rhinehart (2020) Generative hybrid representations for activity forecasting with no-regret learning. In CVPR, Cited by: §2.
  • [17] A. Gupta, S. Satkin, A. A. Efros, and M. Hebert (2011) From 3d scene geometry to human workspace. In CVPR, Cited by: §1, §2.
  • [18] M. Hassan, V. Choutas, D. Tzionas, and M. J. Black (2019) Resolving 3D human pose ambiguities with 3D scene constraints. In ICCV, Cited by: §2, §2.
  • [19] J. F. Henriques and A. Vedaldi (2018) MapNet: an allocentric spatial memory for mapping environments. In CVPR, Cited by: §2.
  • [20] Y. Huang, M. Cai, Z. Li, and Y. Sato (2018) Predicting gaze in egocentric video by learning task-dependent attention transition. In ECCV, Cited by: §2.
  • [21] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §G.
  • [22] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: §3.2.
  • [23] Y. Jiang, H. Koppula, and A. Saxena (2013) Hallucinated humans as the hidden context for labeling 3d scenes. In CVPR, Cited by: §2.
  • [24] Y. Jiang, M. Lim, and A. Saxena (2012) Learning object arrangements in 3d scenes using human context. In ICML, Cited by: §2.
  • [25] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen (2019) Epic-fusion: audio-visual temporal binding for egocentric action recognition. In ICCV, Cited by: §1.
  • [26] Q. Ke, M. Fritz, and B. Schiele (2019) Time-conditioned action anticipation in one shot. In CVPR, Cited by: §2.
  • [27] H. S. Koppula, R. Gupta, and A. Saxena (2013) Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32 (8), pp. 951–970. Cited by: §1.
  • [28] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. M. Rehg (2014) Joint semantic segmentation and 3d reconstruction from monocular video. In ECCV, Cited by: §5.4.
  • [29] J. Li, X. Wang, S. Tang, H. Shi, F. Wu, Y. Zhuang, and W. Y. Wang (2020) Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In CVPR, Cited by: §2.
  • [30] Y. Li, A. Fathi, and J. M. Rehg (2013) Learning to predict gaze in egocentric video. In ICCV, Cited by: §2.
  • [31] Y. Li, M. Liu, and J. M. Rehg (2018) In the eye of beholder: joint learning of gaze and actions in first person video. In ECCV, Cited by: §1, §2, §5.
  • [32] Y. Li, M. Liu, and J. M. Rehg (2021) In the eye of the beholder: gaze and actions in first person video. TPAMI. Cited by: §3.2, §4, §5, §H.
  • [33] Y. Li, Z. Ye, and J. M. Rehg (2015) Delving into egocentric actions. In CVPR, Cited by: §1, §2.
  • [34] J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Duan, and A. C. Kot (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. TPAMI. Cited by: §1.
  • [35] M. Liu, S. Tang, Y. Li, and J. Rehg (2020) Forecasting human object interaction: joint prediction of motor attention and actions in first person video. In ECCV, Cited by: §3.2.
  • [36] M. Liu, D. Yang, Y. Zhang, Z. Cui, J. M. Rehg, and S. Tang (2020) 4D human body capture from egocentric video via 3d scene grounding. arXiv preprint arXiv:2011.13341. Cited by: §2.
  • [37] M. Ma, H. Fan, and K. M. Kitani (2016) Going deeper into first-person activity recognition. In CVPR, Cited by: §1, §2.
  • [38] C. J. Maddison, A. Mnih, and Y. W. Teh (2017)

    The concrete distribution: a continuous relaxation of discrete random variables

    In ICLR, Cited by: §3.2.
  • [39] D. Moltisanti, M. Wray, W. Mayol-Cuevas, and D. Damen (2017) Trespassing the boundaries: labeling temporal bounds for object interactions in egocentric video. In ICCV, Cited by: §1, §2.
  • [40] T. Nagarajan, C. Feichtenhofer, and K. Grauman (2019) Grounded human-object interaction hotspots from video. In ICCV, Cited by: §2.
  • [41] T. Nagarajan, Y. Li, C. Feichtenhofer, and K. Grauman (2020) EGO-topo: environment affordances from egocentric video. In CVPR, Cited by: §1, §1, §2, §5.1.
  • [42] E. Ng, D. Xiang, H. Joo, and K. Grauman (2020) You2me: inferring body pose in egocentric video via first and second person interactions. In CVPR, Cited by: §2.
  • [43] H. Park, E. Jain, and Y. Sheikh (2012) 3d social saliency from head-mounted cameras. NeurIPS. Cited by: §2.
  • [44] H. Pirsiavash and D. Ramanan (2012) Detecting activities of daily living in first-person camera views. In CVPR, Cited by: §1, §2.
  • [45] Y. Poleg, C. Arora, and S. Peleg (2014) Head motion signatures from egocentric videos. In ACCV, Cited by: §2.
  • [46] Y. Poleg, C. Arora, and S. Peleg (2014) Temporal segmentation of egocentric videos. In CVPR, Cited by: §1, §2.
  • [47] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017)

    Pointnet: deep learning on point sets for 3d classification and segmentation

    In CVPR, Cited by: §2.
  • [48] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §2.
  • [49] N. Rhinehart and K. M. Kitani (2017) First-person activity forecasting with online inverse reinforcement learning. In ICCV, Cited by: §2.
  • [50] N. Rhinehart and K. M. Kitani (2016) Learning action maps of large environments via first-person vision. In CVPR, Cited by: §1, §1, §2, §5.1.
  • [51] T. Sattler, B. Leibe, and L. Kobbelt (2012) Improving image-based localization by active correspondence search. In ECCV, Cited by: §1, §4.
  • [52] M. Savva, A. X. Chang, P. Hanrahan, M. Fisher, and M. Nießner (2014) SceneGrok: inferring action maps in 3d environments. TOG. Cited by: §1, §2.
  • [53] J. L. Schönberger and J. Frahm (2016) Structure-from-motion revisited. In CVPR, Cited by: §4, §4.
  • [54] G. Serra, M. Camurri, L. Baraldi, M. Benedetti, and R. Cucchiara (2013) Hand segmentation for gesture recognition in ego-vision. In Proceedings of the 3rd ACM international workshop on Interactive multimedia on mobile & portable devices, pp. 31–36. Cited by: §2.
  • [55] Y. Shen, B. Ni, Z. Li, and N. Zhuang (2018) Egocentric activity prediction via event modulated attention. In ECCV, Cited by: §2.
  • [56] S. Song and J. Xiao (2014) Sliding shapes for 3d object detection in depth images. In ECCV, Cited by: §2.
  • [57] S. Song and J. Xiao (2016) Deep sliding shapes for amodal 3d object detection in rgb-d images. In CVPR, Cited by: §2.
  • [58] H. Soo Park, J. Hwang, Y. Niu, and J. Shi (2016) Egocentric future localization. In CVPR, Cited by: §2.
  • [59] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan, X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe (2019) The Replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797. Cited by: §4, §H.
  • [60] D. Z. Wang and I. Posner (2015) Voting for voting in online point cloud object detection.. In TSS, Cited by: §2.
  • [61] X. Wang, R. Girshick, A. Gupta, and K. He (2018)

    Non-local neural networks

    In CVPR, Cited by: §3.5, §5.1, §G.
  • [62] W. Wu, Z. Qi, and L. Fuxin (2019) Pointconv: deep convolutional networks on 3d point clouds. In CVPR, Cited by: §2.
  • [63] W. Wu, Z. Y. Wang, Z. Li, W. Liu, and L. Fuxin (2020) PointPWC-net: cost volume on point clouds for (self-) supervised scene flow estimation. In ECCV, Cited by: §2.
  • [64] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: Cited by: §5.1.
  • [65] B. Yang, W. Luo, and R. Urtasun (2018) Pixor: real-time 3d object detection from point clouds. In CVPR, Cited by: §2.
  • [66] M. Zhang, K. Teck Ma, J. Hwee Lim, Q. Zhao, and J. Feng (2017) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In CVPR, Cited by: §2.
  • [67] S. Zhang, Y. Zhang, Q. Ma, M. J. Black, and S. Tang (2020) PLACE: proximity learning of articulation and contact in 3D environments. In 3DV, Cited by: §2, §2.
  • [68] Y. Zhang, M. Hassan, H. Neumann, M. J. Black, and S. Tang (2020) Generating 3d people in scenes without people. In CVPR, Cited by: §2, §2.
  • [69] Y. Zhou, B. Ni, R. Hong, X. Yang, and Q. Tian (2016) Cascaded interactional targeting network for egocentric video analysis. In CVPR, Cited by: §1, §2.
  • [70] Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In CVPR, Cited by: §2, §5.1, §5.4.

G Implementation Details

Data Processing. We resize all video frames to the short edge size of 256. For the coarse 3D map, we adopt a resolution of for parent voxel, and for children voxels. For training, our model takes an input of 8 frames (temporal sampling rate of 8) with a resolution of . For inference, our model samples 30 clips from a video (3 along spatial dimension and 10 in time). Each clip has 8 frames with a resolution of . We average the scores of all sampled clips for video level prediction.

Training Details. Our model is trained using SGD with momentum 0.9 and batch size 64 on 4 GPUs. The initial learning rate is 0.0375 with cosine decay. We set weight decay to 1e-4 and enable batch norm [21]. To avoid overfitting, we adopt several data augmentation techniques, including random flipping, rotation, cropping and color jittering.

Network Architecture Details. We present the architectures of our video pathway using I3D Res50 [61] backbone and our environment pathway network (EnvNet) in Table 5.

H Dataset Details

This section introduces details of our new egocentric video dataset. We show additional rendered images of the 3D reconstructions, and present additional qualitative results of the key frame camera registration. Finally, we plot the distribution of activity categories in our dataset.

Camera Registration. We provide qualitative results of key frame egocentric camera registration in Fig. 4. Specifically, we use the estimated camera pose to render the egocentric view of the 3D environment. When the camera registration fails (insufficient inliers were matched by RANSAC), we assign a dummy result at the world origin, which will result in a completely wrong rendered view of the 3D environment as in the last row of Fig. 4. We also visualize the matched feature points [1]

to help readers better interpret the RANSAC based matching method for camera registration introduced in Sec. 4 of the main paper. Notably, registering the egocentric camera into a 3D environment based on only images remains a major challenge. The motion blur, foreground occlusion, change of illumination, and featureless surfaces in indoor capture are the main failure causing factors of the 2D-to-3D registration. To mitigate these failure cases, we only consider camera relocalization using key frames to approximate the ground truth of 3D action locations, and do not model how location might shift over time within the same action. Note that, if the entire action clips does not have one robust key frame registration result, we adopt uniform distribution for the 3D action location prior

during training.

Figure 4: Qualitative Results for Camera Registration. We present both green successful cases and red failure cases of the camera registration results by using the estimated camera pose to render the egocentric view of the 3D environment reconstructions.

3D Environment Reconstruction. Our dataset captures the 3D reconstructions of three different living rooms using SOTA dense Reconstruction system [59]. We provide more rendered images of the 3D reconstructions in Fig. 5.

Figure 5: Rendered views of the 3D environment reconstructions captured in our dataset.

Activity Distribution. We present the distribution of egocentric activities in our dataset in Fig. 6. Similar to [6, 32], our dataset has a “long tailed” distribution that characterizes naturalistic human behavior. For example, the action activity category of “Pick up Book from Floor” happens 1400 times, while the action of “Put Painting” on the tail occurs only 9 times. Mean Class Accuracy provides a metric that is not biased towards frequently occurred categories, and thus is better suited than Top-1 accuracy for activity recognition on our dataset. We highlight that our model outperforms I3D baseline by on Mean Class Accuracy.

Figure 6: Long tailed activity distribution in our dataset.

I Analysis of Table 1 in Main Paper

We provide additional analysis of our method. We specifically compare the activity classes, where our model significantly outperforms I3D and I3D+2DGround in Fig. 7. Interestingly, our model is better at classes where the video features might not be discriminative enough for recognition, “Pick up Book from Floor” vs. “Put Book on Shelf”, or “Pick up Poster” vs. “Stamp Poster”. Moreover, the contextual features from 2D ground plane provide limited information for understanding those egocentric activities. We conjecture that our method makes use of environment features surrounding the predicted 3D action location to complement video features for activity recognition.

Figure 7: A closer look at the experiment results comparison in Table 1 of main paper.
Figure 8: Additional visualization of predicted 3D action location and action labels.

J Additional Qualitative Results

Finally, we provide additional qualitative results. As shown in Fig. 8, we present predicted 3D action location and action labels. The figure follows the same format as Fig. 3 in the main paper. Those results suggest that our model can effectively localize the action location and thereby more accurately predict the action labels.

Another interesting observation is that the model may output a “diffused” heatmap, when the foreground active objects take up the majority of the video frames (right column of Fig. 8). This is because the model receives uniform prior as supervisory signals when the camera registration fails for an action clip. In these cases, our model opts for predicting a diffused heat map of action location to prevent itself from missing important environment features. In doing so, our model might still be able to successfully predict the action labels, despite the failure of camera registration.

ID Branch Type Kernel Size THW,(C) Stride THW Output Size THWC Comments (Loss)
1 Backbone Input Size: 8x224x224x3 Conv3D 5x7x7,64 1x2x2 8x112x112x64
2 MaxPool1 1x3x3 1x2x2 8x56x56x64
Bottleneck 0-2
(3 times)
(3 times)
4 MaxPool2 2x1x1 2x1x1 4x56x56x256
Bottleneck 0
Bottleneck 1-3
(3 times)
(3 times)
Concatenation with EnvNet Features
for 3D Action Location Prediction
Bottleneck 0
Bottleneck 1-5
(5 times)
(5 times)
Bottleneck 0
Bottleneck 1-2
(2 times)
(2 times)
Concatenation with EnvNet Features
for Activity Recognition
11 EnvNet Input Size 8x28x28x64
Conv3d 1
3x3x3,356 2x1x1 4x28x28x256
12 Conv3d 2 1x3x3,512 1x1x1 4x28x28x512
Concatenation with Video Features
for 3D Action Location Prediction
Action Location Branch
Conv3d 1
1x3x3,512 1x2x2 4x14x14x512
Action Location Branch
Conv3d 2
1x3x3,1 1x1x1 4x14x14x1 KLD Loss
Gumbel Softmax 1
4x14x14x1 Sampling 3D Action Location
15 Maxpool 2x1x1 2x1x1 4x14x14x512
16 Conv3d 3 1x3x3,1024 1x1x1 4x14x14x1024
17 Weighted Avg Pooling 4x14x14x1024
Guided by
Sampled 3D Action Location
18 Conv3d 4 1x3x3,1024 1x3x3 4x7x7x1024
Concatenation with Video Features
for Activity Recognition
22 Recognition Network Weighted Avg Pooling 4x7x7 4x7x7 1x1x1x1024
Fused Environmental
and Video features
23 Fully Connected 1x1x1xN
24 Softmax 1x1x1xN Cross Entropy Loss
Table 5:

Network architecture of our two-pathway network. We omit the residual connection in backbone ResNet-50 for simplification.