In the Eye of the Beholder: Gaze and Actions in First Person Video

05/31/2020 ∙ by Yin Li, et al. ∙ University of Wisconsin-Madison Georgia Institute of Technology 0

We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with videos, gaze tracking data, hand masks and action annotations, thereby providing the most comprehensive benchmark for First Person Vision (FPV). Moving beyond the dataset, we propose a novel deep model for joint gaze estimation and action recognition in FPV. Our method describes the participant's gaze as a probabilistic variable and models its distribution using stochastic units in a deep network. We further sample from these stochastic units, generating an attention map to guide the aggregation of visual features for action recognition. Our method is evaluated on our EGTEA Gaze+ dataset and achieves a performance level that exceeds the state-of-the-art by a significant margin. More importantly, we demonstrate that our model can be applied to larger scale FPV dataset—EPIC-Kitchens even without using gaze, offering new state-of-the-art results on FPV action recognition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

page 6

page 12

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in sensor miniaturization, low-power computing, and battery life have enabled the first generation of mainstream wearable cameras. Millions of hours of videos are captured by these devices every year, creating a record of our daily visual experience at an unprecedented scale. This has created a major opportunity to develop new capabilities and applications for computer vision. We have witnessed a surge of interest in the automatic analysis of visual data captured from wearable cameras, also known as First Person Vision (FPV) 

[32] or Egocentric Vision. Examples include first person action and activity recognition [43, 48, 56, 71, 35, 64, 95, 76], first person gaze estimation and prediction [41, 98, 53, 52]

, first person pose estimation 

[61, 31] and first person video summarization [94, 38, 59].

We address the problem of joint gaze estimation and action recognition in FPV. Our daily interaction with objects is guided by a sequence of carefully orchestrated fixations. It is thus critical to study the links between gaze and actions. We argue that “where we look” reveals important information about “what we do.” Consider the examples in Figure 1, where only small regions around the first person’s point of gaze are shown. What is this person doing? We can easily identify the actions as “squeeze liquid soap into hand” and “cut tomato,” in spite of the fact that more than of the pixels are missing. This is possible because egocentric gaze serves as an index into the critical regions of the video that define the action. Focusing on these regions eliminates the potential distraction of irrelevant background pixels, and allows us to focus on the key elements of the action. In this case, attention is naturally embodied in the camera wearer’s actions. Thus, FPV provides the ideal vehicle for studying the joint modeling of attention and action.

Fig. 1: Can you tell what the person is doing? (examples taken from our EGTEA Gaze+ dataset) With only of the pixels visible, centered around the point of gaze, we can easily recognize the camera wearer’s actions. The gaze indexes key regions containing interactions with objects. We leverage this intuition and develop a model to jointly infer gaze and actions in First Person Vision.

To study attention and action in FPV, we extend our previous work [14, 43] and introduce the Extended GTEA Gaze+ (EGTEA Gaze+) dataset—a new FPV action dataset of meal preparation tasks captured in a naturalistic kitchen environment. Our work is related to recent efforts [10] to create large scale benchmarks for FPV action recognition. However, in contrast to concurrent work, such as EPIC-Kitchens [10] and Charades-Ego [68], our dataset not only includes FPV videos with fine-grained action annotations, but also provides first person gaze tracking data and egocentric hand masks. We believe our EGTEA Gaze+ dataset offers the most comprehensive benchmark for FPV to date. Our dataset is publicly available111Our dataset is available at http://cbi.gatech.edu/fpv.

Moving beyond the dataset, a major challenge for the joint modeling of egocentric gaze and action is the uncertainty in gaze measurements. A significant portion of the egocentric gaze events are irrelevant to the actions. For instance, around  [22] of our gaze within daily actions are saccades—rapid gaze jumps during which our vision system receives no inputs [6]. Within the gaze events that remain, it is not clear what portion of the fixations correspond to overt attention and are therefore meaningfully-connected to actions [29]. In addition, there are small but non-negligible measurement errors in the eye-tracker itself [21]. It follows that a joint model of attention and actions must account for the uncertainty of gaze. What model should we use to represent this uncertainty?

Our inspiration comes from the observation that gaze can be characterized by a latent

distribution of attention in the context of an action, represented as an attention map in egocentric coordinates. This map identifies image regions that are salient to the current action, such as hands, objects, and surfaces. We model gaze measurements as samples from the attention map distribution. Given gaze measurements obtained during the production of actions, we can directly learn a model for the attention map, which can in turn guide action recognition. Our action recognition model can then focus on action-relevant regions to determine what the person is doing. The attention model is tightly coupled with the recognition of actions. Building on this intuition, we develop a deep network with a latent variable attention model and an attention mechanism for recognition.

To this end, we propose a novel deep model for joint gaze estimation and action recognition in FPV. Specifically, we model the latent distribution of gaze as stochastic units in a deep network. This representation allows us to sample attention maps from the gaze distribution. These maps are further used to selectively aggregate relevant visual features in space and time for action recognition. Our model thus encodes the uncertainty in gaze measurement, and models visual attention in the context of actions. When gaze measurement is available during training, we train the model in an end-to-end fashion using action labels and noisy gaze measurements as supervision. When gaze measurement is not available, we explore training our model using a simple prior distribution. At test time, our model receives only an input video and is able to infer both gaze and action.

We first evaluate our model on the new EGTEA Gaze+ dataset. As a consequence of jointly modeling gaze and actions, we obtain results for action recognition that outperform state-of-the-art deep models by a significant margin on EGTEA Gaze+. Our gaze estimation accuracy is also comparable with several strong baseline methods on EGTEA Gaze+. More importantly, we demonstrate that the model architecture developed on EGTEA Gaze+ can be effectively transferred to a larger scale FPV action dataset—EPIC-Kitchens. We develop a version of our model that does not require direct gaze supervision, yet uses a uniform prior of attention for training. Our joint model achieves state-of-the-art results on the challenging EPIC-Kitchens dataset.

We summarize our key contributions as follows:

  • We present the EGTEA Gaze+ dataset, the most comprehensive FPV dataset with gaze tracking data, annotated fine-grained actions and hand masks. We establish solid benchmark using the proposed dataset for FPV action recognition. We believe our dataset and benchmark will provide a major resource for the community.

  • We propose a novel deep model for joint gaze estimation and action recognition in FPV. At the core of our model lies in the probabilistic modeling of visual attention using stochastic units in a deep network. To the best of our knowledge, this is the first work to model uncertainty in gaze measurements for action recognition, and the first deep model for joint gaze estimation and action recognition in FPV.

  • Our method achieves state-of-the-art results on the new EGTEA Gaze+ dataset. More importantly, we demonstrate that our model can be applied to EPIC-Kitchens—the largest FPV dataset even without using gaze, leading to new state-of-the-art results.

A preliminary version of this paper was discussed in the first author’s thesis [44] and also appeared in ECCV’18 [42]. In this journal version, we added a detailed description of the dataset effort and substantially improved our results on EGTEA Gaze+. More importantly, we demonstrated that our attention model is very general and can achieve competitive results on a larger scale FPV dataset (EPIC-Kitchens), even without using human gaze for training.

Our paper is organized as follows. Section 2 covers related work on first person vision. Section 3 presents our dataset. Section 4 describes our model for joint gaze estimation and action recognition. Sections 5 details our experiments. Finally, Section 6 summarizes our findings and discusses future directions for FPV.

2 Related Work

2.1 Action Recognition

We briefly review the literature on action recognition in computer vision. The main property of this work is that it assumes a third person view of one or more individuals, e.g., as would be captured by a surveillance camera, and asks what the individuals are doing. A thorough survey of this previous work is beyond our scope, and we refer the readers to recent survey papers [2, 84] for a comprehensive description. We discuss relevant work on the development of deep models and the use of attentional cues for recognizing actions.

Deep Models for Action Recognition. Deep models have shown recent success in action recognition. For example, Simonyan and Zisserman [70] proposed the two-stream network that learns to recognize an action from both optical flow fields and video frames. Wang et al. [87] further extended two-stream network to model multiple temporal segments within the video. Du et al. [81]

replaced 2D convolution with spatial temporal convolutions and learned a 3D convolutional neural network for action recognition. Carreira and Zisserman further proposed a two-stream 3D convolutional network for action recognition 

[7]. Du et al. [83] and Xie et al. [93] explored different architectures of 3D convolutional networks. CNN features can be also combined with either tracked local descriptors [86] or human poses estimation [9]. Our model builds on the latest development of two-stream 3D convolutional networks [8] to recognize actions in FPV. Our technical novelty is to incorporate stochastic units to model egocentric gaze.

Attention for Actions. Human gaze provides useful signals for the location of actions. This intuition has been explored for action recognition in domains outside of FPV. For example, Mathe and Sminchesescu [50] proposed to recognize actions by sampling local descriptors from a predicted saliency map. Liu et al. [46] considered motor attention as a first-class player for egocentric action anticipation. Shapovalova et al. [66] presented a method that uses human gaze for learning to localize actions. However, these methods did not use deep models. Recently, Shikhar et al. [67] incorporated soft attention into a deep recurrent network for recognizing actions. However, their notion of attention is defined by discriminative image regions that are not derived from gaze measurements, and therefore they can’t support the joint inference of egocentric gaze and actions.

Our method shares a key intuition with [50, 66] by using predicted gaze to select visual features. However, unlike [50, 66], our attention model is integrated within a deep network and trained from end-to-end. Our model is also similar to [67] as we also design a attention mechanism that facilitates end-to-end training. However, unlike [67], attention is modeled as stochastic units in our model and receives supervision from either noisy human gaze measurements or a prior distribution.

Datasets for Action Recognition. A major driving force behind the recent advance in action recognition is the development of large-scale video datasets and benchmarks. Examples include UCF101 [73], HMDB [36] and more recent 20BN-Something-Something [20], Charades [69] and Kinetics [7], where tens of thousands of video clips were collected from Internet and annotated manually. In the meanwhile, previous FPV action datasets, including our own work [14], lags behind in terms of number of samples. Similar to the concurrent work of EPIC-Kitchens [10] and Charades-Ego [68], our work seeks to bridge this gap. Different from existing efforts, our key focus is to provide a more comprehensive set of signals for attention and action in FPV. Specifically, EGTEA Gaze+ dataset includes FPV videos, gaze tracking, action annotations and hand masks. We will describe our effort on EGTEA Gaze+ in Sec. 3.

Another highly relevant dataset is the MPII-Cooking dataset [63]. Both datasets focus on cooking activities, with MPII following a conventional 3rd person paradigm. Our dataset, in contrast, was captured from the first person perspective, and it offers the largest benchmark for FPV action recognition, gaze estimation and hand segmentation. Our dataset is also related to the ADL dataset from Pirsiavash and Ramanan [56], where they collected and annotated 10 hours of FPV videos. However, ADL is targeted for complex activities (Activities of Daily Living) and is substantially smaller than our dataset in terms of number of instances. We believe that our EGTEA Gaze+ dataset can serve as a major resource for the community to further advance the understanding of attention and actions in FPV.

2.2 First Person Vision

We now describe the emerging field of first person vision and its related work. In this section, we focus on action and activity recognition in FPV. Other efforts include egocentric gaze estimation [41, 52, 53], hand analysis [39, 25, 62, 16, 80], pose estimation [61, 31], physiological parameter estimation [23, 51], user identification [24, 57, 96] and video summarization [94, 38, 59]. A recent survey of this literature can be found in [4].

FPV Gaze. Gaze estimation is well studied in computer vision [5]. Several recent work have addressed the problem of egocentric gaze estimation. Our previous work [41] estimated egocentric gaze using hand and head cues. Zhang et al. [97] predicted future gaze by estimating gaze from predicted future frames. Huang et al. [26] modeled the transition of attention for FPV gaze estimation. Park et al. [52] considered D social gaze from multiple camera wearers. However, these work did not model egocentric gaze in the context of actions.

FPV Actions. FPV action has been the subject of many recent efforts. Spriggs et al. [74] proposed to segment and recognize daily activities using a combination of video and wearable sensor data. Kitani et al. [35] used a global motion descriptor to discover egocentric actions. Fathi et al. [12] presented a joint model of objects, actions and activities. Pirsiavash and Ramanan [55] further advocated for an object-centric representation of FPV activities. Other efforts included the modeling of conversations [11] and reactions [95, 40] in social interactions.

Several recent work have developed deep models for FPV action recognition. Ryoo et al. [65] developed a novel pooling method for deep models. Poleg et al. [58] used temporal convolutions on motion fields for long-term activity recognition. Kazakos et al. [33] proposed to fuse video and audio signals for FPV action recognition. Wray et al. [91] made use of text descriptions of FPV actions for zero-short learning. In contrast to our work, these prior work did not consider using egocentric gaze for action recognition.

FPV Gaze and Actions. There have been a few work that incorporated egocentric gaze for FPV action recognition. For example, our previous work [43] showed the benefits of gaze-indexed visual features in a comprehensive benchmark. Both Singh et al. [71] and Ma et al. [48] explored the use of multi-stream networks to capture egocentric attention. These works have clearly demonstrated the advantage of using egocentric gaze for FPV actions. However, they all model FPV gaze and actions separately rather than jointly, and they do not address the uncertainty in gaze. Moreover, these methods require side information in addition to the input image at testing time, e.g., hand masks [71, 43] or object information [48]. More recently, Sudhakaran et al. [78, 77] presented LSTM models with soft attention for FPV action recognition, yet their methods did not consider human gaze. In contrast, our method jointly models gaze and action, captures the uncertainty of gaze, and requires only video inputs during testing.

Our previous work [13] presented a joint model for FPV gaze and actions. This work extends [13] in multiple aspects: (1) we propose an end-to-end deep model rather than using hand crafted features; (2) we explicitly model “noise” in gaze measurements while [13] did not; (3) we infer gaze and action jointly through a single pass during testing while [13] used iterative inference. In a nutshell, we model gaze as a stochastic variable via a novel deep architecture for joint gaze estimation and action recognition. Our model thus combines the benefits of latent variable modeling with the expressive power of a learned feature representation. Consequently, we show that our method can outperform latest deep models [8] for FPV action recognition.

3 The Extended GTEA Gaze+ Dataset

We start by presenting our work on creating EGTEA Gaze+ dataset—a major expansion of our previous GTEA Gaze+ dataset [14, 43]. Specially, our new dataset contains more than hours of videos, times larger than GTEA Gaze+. Theses videos are from unique sessions of subjects performing different meal preparation tasks. Our new dataset subsumes GTEA Gaze+ as a subset, yet with revised annotations via our new annotation pipeline. The final dataset comes with videos, gaze tracking, action annotations and hand masks. We hope our dataset will serve as a major vehicle for understanding first person gaze and actions. We now describe our data collection and annotation process.

3.1 Data Collection

The dataset was collected at the kitchen area of Aware Home222http://www.awarehome.gatech.edu/ on Georgia Tech campus. This kitchen area provides a naturalistic house-holding environment that contains the standard appliances, furnishings and food. All participants were recruited from Georgia Tech. A written consent was obtained for each participant, such that their FPV recording can be used and shared for research purpose. The study was approved by Institutional Review Board (IRB). We recorded videos, audios and gaze tracking data using the SMI eye-tracking glasses333 https://imotions.com/hardware/smi-eye-tracking-glasses/ (see Figure 2 (left)). We now describe our study protocol and how we de-identify the videos.

Protocol. At the beginning of the session, a researcher introduced the protocol, present a target recipe4447 recipes were considered, including breakfast (scrambled eggs), snack (peanut-butter sandwich), turkey sandwich, pizza, Greek salad, pasta salad and cheese burger. to the participants and answered any questions they may have. Each recipe included detailed key steps of the dishes, such as “fill a small pot with water”. The participants were given a few minutes to explore the kitchen and to go through the recipe. The same researcher then helped the participant to wear the eye-tracking glasses that is connected to a host laptop. A calibration of the eye tracker was first performed by asking the participant to stand still and look at a few landmarks in the kitchen. After the calibration, the researcher helped the participant to wear a backpack with the host laptop inside. The participant was then able to move freely within the kitchen, with FPV video and gaze captured by the glasses and stored on the laptop in the backpack.

During the session, the participant was asked to prepare the dish by following the recipe. Paper copies of the recipe were available on site and the participant was free to check the recipe during a session. No other instruction were given. The participant can choose to end anytime during the session, yet the majority of them ended the session after they finished the dish. At the end of a session, another calibration process for the eye tracker was performed similar to the one at the beginning of the session. And the eye tracking quality was manually verified by the researcher. Sessions with low tracking quality were discarded from our dataset. Finally, the researcher helped the participant to remove the glasses.

Each session usually lasts between 10 minutes and half an hour. Across different sessions, we altered the lighting condition by turning on different light blobs or putting down the blinds of the window. The object instances also varied as we re-supplied the utensils and food. Our goal is to have videos that exhibit a high variety of lighting conditions, objects and actions.

Video De-identification. We manually screened all recorded videos to remove any frames that might reveal the identity of the participant. Most of the removed frames are around the start and end of a session, when the participant was mounting or removing the glasses. In rare case, we removed frames that have reflections of the participant’s face in the environment, e.g., reflections on a mirror-finished stainless kettle. The screened video was further verified by another researcher. After the screening, more than of all frames were kept. All removed frames were replaced with an empty black image in the videos.

Video Data Statistics. We recruited 32 participants across 7 recipes. At the end, we obtained 86 videos (sessions) with high quality eye tracking and a total of 28 hours. Each of the video has a resolution of captured at Hz with a field of view (FoV) of (horizontal vertical) . All videos comes with audio and gaze tracking data. Sample frames of the videos are shown in Figure 2 (right).

Gaze Data Statistics. Binocular gaze data was tracked at 30Hz and synchronized with the recorded videos by SMI mobile eye tracker. Each gaze point is time-stamped and defined as a 2D point in the image plane. Furthermore, a proprietary software from SMI was used to identify the tracked gaze into (1) a saccade (rapid eye movement), (2) a fixation or (3) an unknown gaze type. In very rare cases (3.3%), the gaze tracker might fail to track the egocentric gaze (untracked). Over a total of 3-million gaze points, the ratio of fixation, saccade, unknown gaze type and untracked gaze are 53.6%, 26.1%, 17.0% and 3.3%, respectively. Sample gaze points are shown in Figure 3.

Fig. 2: Left: SMI eye tracking glasses used for video recording. Right: sample frames from the videos. Our dataset contains videos with different lighting condition, object instances and actions.

3.2 Data Annotation

We further annotated the dataset by pixel-level annotations of the first person’s hands for sparsely sampled frames, as well as frame-level annotations of actions for all videos. We present our annotation details.

3.2.1 Hand Mask Annotation

In addition to action actions, we provide egocentric hand mask annotations on sparsely sampled video frames. Egocentric hands are important cues for FPV actions, as we use our hands to interact with the objects and the physical environment. Specifically, we sampled one frame from every 5 seconds within the video. Empty (due to de-identification) or blurry frames were removed. And the rest of the frames were out-sourced to a third-party company, where a modified interface from [3] was used to contour the egocentric hands. Each hand was annotated using one or more polygons (if it is been occluded) following its contour, which was further converted into a hand mask. After the annotation, we went through the hand masks and further removed poorly annotated one.

Fig. 3: Gaze tracking data from our dataset. The tracked gaze points are shown as green dots on the video frames.

Hand Mask Statistics. We obtained a total of number of hand masks from frames. Both frames and the hand masks are in the resolution of . On average, there are masks per image, and each image has of hand pixels. We released this dataset as our Hand14K dataset, an important part of the EGTEA Gaze+ dataset. Figure 4 shows sample annotations of hand masks from Hand14k. We hope this dataset will provide a major resource for analyzing hands in FPV.

3.2.2 Action Annotation

To capture FPV actions, we use the same taxonomy from [1] to define action labels. Moreover, we develop a multi-stage pipeline to enable accurate and efficient video annotation. We discuss the action categories on our dataset, present our pipeline and analyze the result action labels.

Action Categories. Our first step is to identify the action categories. In this work, we focus on fine-grained actions that can be described by a combination of a single verb and a set nouns, such as “take tomato” or “turn on oven”. The verb describes the motion, e.g., “take”, “turn on”, and the nouns specify the objects involved in the action, e.g., “tomato” or “peanut butter container”. We did not distinguish between the plural or singular form of the nouns, and thus instead focus on the recognizing the presence of the objects. This combination of verb and nouns can describe complex actions, such as “pour condiment (from) container (to) salad”. A similar naming taxonomy is also used in [69, 10] and discussed in [1].

Fig. 4: Ground truth hand masks from our Hand14K dataset. The annotated masks are shown as green regions.
Fig. 5: Action annotation pipeline. We follow a three stage pipeline for annotation, with each stage focuses on a single task. From left to right: interfaces for action candidate labeling, action naming and action trimming. We use ELAN [90] for generating action candidates—clips that contain full extent of an action. Moreover, we developed an interactive web User Interface (UI) for further label the clips (action naming) and refine the temporal boundary of the action (action trimming).
Fig. 6: Long tailed distribution of verbs (left), nouns (middle) and actions (right) in our dataset. We consider verbs, nouns and fine-grained actions. Top-10 objects and top-20 actions are further displayed. The distribution poses additional challenge of learning from imbalanced data.

Annotation Pipeline. Annotating an action in a continuous video requires to identify its onset and offset, and to generate its label according to our taxonomy. This process can be very time consuming for even an expert human annotator. We propose to streamline the annotation by dividing the process into multiple stages, with each stage focusing on a single sub-task. More precisely, our pipeline consists of three stages: action candidate labeling, action naming and action trimming. Figure 5 presents an illustration of the stages. Our pipeline creates a more reliable work flow for action annotation. For example, it allows us to identify and correct errors from previous stages. We annotated our dataset in-house to ensure the best quality, although the pipeline and tools can be easily scaled to crowd-source settings. We now describe our annotation pipeline in details.

Action Candidate Labeling. This stage aims to identify all potential actions and their rough temporal extents. We use ELAN [90]555ELAN is a multi-modal annotation tool developed at Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands. http://tla.mpi.nl/tools/tla-tools/elan/ for this step. Specifically, annotators were ask to mark rough onset and offset of all possible actions in an video. Two action candidates are allowed to overlap in time. A small to moderate amount of errors are expected in this stage in trade of efficiency, as later stages will be able to filter out bad candidates.

Action Naming

. This stage seeks to label action candidates from previous stage under our taxonomy. We cropped all action candidates into video clips with temporal padding on both the start and the end. These clips are most likely to include the full extent of a single action. Cropping the videos not only reduces the visual content that an user has to examine, but also helps to ensure that the name of the action can be inferred using an isolated clip. We then developed a web interface that presents the clips to the annotators and allows the annotators to input the action names.

Specifically, the web interface presents one clip at a time and asks the users to annotate a verb and a set of nouns. For the verb, a user can choose from a list 19 pre-defined words. We empirically verified that our list provides a good coverage of actions in our videos. For nouns, a user must enter them in a list. Moreover, we allowed users to red flag a clip if (1) no actions are presented; or (2) there are multiple major actions; or (3) the action is not complete in the given clip. Flagged clips were manually examined and removed if they do not contain a proper action instance.

Action Trimming. This final stage further refines the temporal boundaries of labeled action clips from the previous stage. Similar to action naming, we developed a web interface where both a video clip (temporally padded) and its action label are presented to annotators. The users were instructed to identify the exact temporal extent of the labeled action. This is done by marking the onset and offset on a slider bar, as shown in Figure 5 (right). Similarly, we allowed the users to flag action clips with incorrect labels. Flagged clips were checked manually.

Post-Processing. As the last step, we post-processed the labels to finalize the annotations. First, we removed action clips that are less than 0.5 seconds, as (1) the frames were typically blurred due to rapid motion; (2) we found it hard to accurately identify their temporal boundaries. Moreover, we ran a spell checker to nouns and merged all the synonyms. In addition, we combined some sub-categories of objects into its super categories due to naming inconsistency among our annotators. Specifically, fork, knife and spoon were merged into “eating utensil”. Spatula, skimmer and ladle were renamed as “cooking utensil” and jar, bottle and box were renamed “container”. Finally, we sorted all action categories by their frequency. Only the top 106 categories are considered for our final dataset.

Action Label Statistics. After post-processing, our action annotation includes verbs, nouns and unique action labels (a combination of verb and nouns), leading to a total of 10,325 action instances within all 86 videos. Figure 6 shows all our verbs, as well as top objects and top action labels. While the combination of verb and nouns can be lead to complex actions, the most frequent actions tend to be simple, such as “read recipe” or “open fridge”.

Moreover, we present the distribution of verbs, nouns and all action labels in Figure 6. Figure 6 demonstrates the a “long tailed” distribution of our action categories. While the most common action “read recipe” happens 752 times, the least common action “put (down) oil container” only occurs 32 times. This distribution, which we believe have characterized our visual experience, is very different from all previous action recognition datasets, such as UCF101 or HMDB. This poses a significant challenge of learning from unbalanced samples.

3.3 Our EGTEA Gaze+ Dataset

Our final dataset includes hours of FPV videos with a resolution of at Hz. The dataset has unique sessions from subjects across 7 recipes. Each session consists of a HD video, an audio sampled at KHz, binocular gaze tracking data (Hz), frame-level action annotations and hand masks at sparsely sampled frames. Our annotations include K hand masks and action instances from the most frequent action categories. Our action instances have an average duration of seconds with an average of events per minutes. That is over million action frames (from over millions of frames) together with their tracked gaze points.

Train/Test Splits. To facilitate fair benchmark using our dataset, we further created three different splits of non-overlapping train and test sets, with 8299/2022, 8299/2022, 8230/2021 samples (train/test). These splits were created by random sampling such that roughly 80% of the samples per category is used for training and rest for testing. We encourage reporting results on all three splits.

FPV Dataset Mounting Res (FPS) Hours Sessions Subjects
Action
Instance
Action
Classes
Hand Gaze
Object
Boxes
EGTEA Gaze+ Head 1280*960 (24) 28 86 32 10,321 106
EPIC-Kitchens [10] Head 1920*1080 (60) 55 432 32 39,596 149
Charades-Ego [68] Vary Vary 34.4(+34.4) 7860 112 68,536 157
GTEA Gaze+ [43] Head 1280*960 (24) 9 37 6 3,371 44
UCI ADL [56] Chest 1280*960 (30) 10 20 20 436 32
JPL Interaction [64] Chest 320*240 (30) 1 57 8 94 7
CMU MMAC [37] Head 640*480 (12) 6 25 5 516 31
TABLE I: Comparison to FPV datasets. Our dataset provides the most comprehensive signals of egocentric hand, gaze and actions. : resolution may vary.

Comparison to FPV Datasets. Table I compares our EGTEA Gaze+ dataset with existing FPV action datasets. In comparison to previous FPV datasets [37, 64, 56, 43], our dataset excels at scale. Similar to the concurrent work, such as EPIC-Kitchens [10] and Charades-Ego [68], our dataset features HD videos and considers similar number of action categories. While EGTEA Gaze+ does have much smaller number of action instances, our dataset stands out in terms of egocentric gaze and hand. Notably, our EGTEA Gaze+ is the only dataset that offers gaze tracking, hand masks and action annotations at the same time, thereby offering the most comprehensive benchmark for FPV gaze and actions. We anticipate that our dataset will be used for evaluating FPV gaze estimation, hand segmentation, action recognition and action detection.

4 Modeling Gaze and Actions in FPV

We now present our joint model of egocentric gaze and actions. We denote an input first person video as with its frames indexed by time . Our goal is to predict the action category for . We assume egocentric gaze measurements are available during training yet need to be inferred during testing. are measured as a single 2D gaze point at time defined on the image plane of . For our model, it is helpful to reparameterize as a 2D saliency map , where the value of the gaze position are set to one and all others are zero. And thus . In this case, defines a proper probabilistic distribution of 2D gaze.

Fig. 7: Overview of our joint model of FPV gaze and actions. Our model takes multiple RGB and flow frames as inputs, and outputs a set of parameters defining a distribution of gaze in the middle layers. We then sample a gaze map from this distribution. This map is used to selectively pool visual features at higher layers of the network for action recognition. During training, our model receives action labels and noisy gaze measurement. Once trained, the model is able to infer gaze and recognize actions in FPV. We show that this network builds a probabilistic model that naturally accounts for the uncertainty of gaze and captures the relationship between gaze and actions in FPV.

Figure 7 presents an overview of our model. Consider an analogy between our model and the well-known R-CNN framework for object detection [19, 60]. Our model takes a video as input and outputs the distribution of gaze as an intermediate result. We then sample the gaze map from this predicted distribution. encodes location information for actions and thus can be viewed as a source of action proposals—similar to the object proposals in R-CNN. Finally, we use the attention map to select features from the network hierarchy for recognition. This can be viewed as Region of Interest (ROI) pooling in R-CNN, where visual features in relevant regions are selected for recognition.

4.1 Modeling Gaze with Stochastic Units

Our key idea is to model

as a probabilistic variable to account for its uncertainty. More precisely, we model the conditional probability of

by

(1)

Intuitively, estimates gaze given the input video . further uses the predicted gaze to select visual features from input video to predict the action . Moreover, we want to use high capacity models, such as deep networks, for both and . While this model is appealing, the learning and inference tasks are intractable for high dimensional video inputs .

Our solution, inspired by [34, 72], is to approximate the intractable posterior with a carefully designed . Specifically, we define defined discrete D grids, e.g., the image plane. has the same size of as . is parameterized by , where

(2)

is the output from a deep neural network . thus models the probabilistic distribution of egocentric gaze. Thus, our deep network creates a 2D map of . defines an approximation to the distribution of the latent attention map. Specifically, can be viewed as the expectation of the gaze at position . We can then sample the gaze map from for recognition.

Given a sampled gaze map , our attention mechanism will selectively aggregate visual features defined by network . In our model, this is simply a weighted average pooling, where the weights are defined by the gaze map . We then send pooled features to the recognition network . We further constrain

to have the form of a linear classifier, followed by a softmax function. This design is important for approximate inference. Now we have

(3)

The sum operation is equivalent to spatially re-weighting individual feature channels. We expect that the network will learn to attend to discriminative regions for action recognition. Note that this is a soft attention mechanism that allows back-propagation. Thus, top-down modulation of gaze can be achieved through gradients from action labels.

Our model thus includes three sub-networks: that outputs parameters for the attention map, that extracts visual representations for , and that pools features and recognizes actions. All three sub-networks share the same backbone network with their separate heads, and thus our model is realized as a single feed forward deep network. Due to the sampling process introduced in modeling, learning the parameters of the network is challenging. We overcome this challenge by using variational learning and optimizing a lower bound. We now present our training objective and inference method.

4.2 Variational Learning

During training, we make use of the input video , its action label and human gaze measurements sampled from a distribution . Intuitively, our learning process has two major goals. First, our predicted gaze distribution parameterized by should match the noisy observations of gaze. Second, the action recognition error should be minimized. We achieve these goals by maximizing the lower bound of , given by

(4)

where is the Kullback–Leibler (KL) divergence between and , and denotes the expectation.

Learning with Egocentric Gaze. Computing requires the prior knowledge of . In our case, given , we observe gaze drawn from . Thus, is the noise pattern of the gaze measurement . We adapt a simple noise model of gaze. For all tracked fixation points, we assume a

D isotropic Gaussian noise, where the standard deviation of the Gaussian is selected based on the average tracking error of modern eye trackers. When the gaze point is a saccade (or is missing), we set

to a

D uniform distribution, allowing attention to be uniformly distributed.

Learning without Human Gaze. If no gaze data is available, e.g., on EPIC-Kitchens dataset [10], we assume a D uniform distribution for similar to the missing gaze case. This prior assumes that the attention can be allocated with equal chance to every possible location on the image plane. Even with such a weak prior, we found it helpful to regularize the learning on EPIC-Kitchens. Other distributions can be further explored, such as the center prior.

Loss Function. Given our prior of gaze

, we now minimize our loss function as the negative of the empirical lower bound, given by

(5)

During training, we sample the gaze map from the predicted distribution , apply the map for recognition (

) and compute its negative log likelihood—the same as the cross entropy loss for a categorical variable

. Our objective function thus has two terms: (a) the negative log likelihood term as the cross entropy loss between the predicated and the ground-truth action labels using the sampled gaze maps; and (b) the KL divergence between the predicted distribution and the gaze distribution .

Reparameterization. Our model is fully differentiable except for the sampling of . To allow end-to-end back propagation, we re-parameterize the discrete distribution using the Gumbel-Softmax approach as in [30, 49]. Specifically, instead of sampling from directly, we sample the gaze map via

(6)

where is the temperature that controls the “sharpness” of the distribution. We set for all of our experiments. The softmax normalization ensures that , such that it is a proper gaze map. follows the Gumbel distribution , where is the uniform distribution on

. This Concrete distribution separates out the sampling into a random variable from a uniform distribution and a set of parameters

, and thus allows the direct back-propagation of gradients to .

4.3 Approximate Inference

During testing, we feed an input video forward through the network to estimate the gaze distribution . Ideally, we should sample multiple gaze maps from , pass them into our recognition network , and average all predictions. This is, however, prohibitively expensive. Since is nonlinear and has hundreds of dimensions, we will need many samples to approximate the expectation , where each sample requires us to recompute . We take a shortcut by feeding into to avoid the sampling. We note that is the expectation of , and thus our approximation is .

This shortcut does provide a good approximation. Recall that our recognition network is a softmax linear classifier. Thus, is convex (even with the weight decay on ). By Jensen’s Inequality, we have . Thus, our approximation is indeed a lower bound for the sample averaged estimate of . Using this deterministic approximation during testing also eliminates the randomness in the results due to sampling. We have empirically verified the effectiveness of our approximation.

4.4 Discussions

We connect our model to the techniques of Dropout and DropBlock, as well as the model of Conditional Variational AutoEncoder (CVAE). We hope these connections help to draw better insights about our model.

Connection to Dropout and DropBlock. Our sampling procedure during learning can viewed as an alternative to Dropout and the more recent DropBlock [75, 18], and thus helps to regularize the learning. In particular, we sample the gaze map to re-weight features. This map will have a single peak and many close-to-zero values because of the softmax function. If a position has a very small weight, the features at that position are effectively “dropped out”. The key difference is that our sampling is guided by the predicted gaze distribution of instead of random masking used by Dropout or DropBlock.

Connection to Conditional Variational Autoencoder. Our model is highly relevant to CVAE [72]. Both models use stochastic variables for discriminative tasks. Yet they are different in three aspects: (1) our stochastic unit—the 2D gaze distribution, is discrete. In contrast, CVAE employs a continuous Gaussian variable, leading to a different reparameterization technique. (2) our stochastic unit—the gaze map is physically meaningful and receives supervision during training, while CVAE’s is latent. (3) our model approximates the posterior with and uses one forward pass for approximated inference, while CVAE models the posterior as a function of both and and requires recurrent updates.

4.5 Network Architecture

Our model builds on two-stream I3D networks [8]. Similar to its base Inception network [79], I3D has 5 convolutional blocks and the network uses 3D convolutions to capture the temporal dynamics of videos. Specifically, our model takes both RGB frames and optical flow as inputs, and feeds them into an RGB or a flow stream, respectively. We fuse the two streams at the end of the 4th convolutional block for gaze estimation, and at the end of the 5th convolutional block for action recognition. The fusion is done using element-wise summation as suggested by [15]

. We used 3D max pooling to match the predicted gaze map to the size of the feature map at the 5th convolutional block for weighted pooling.

Our model takes the inputs of 24 frames, outputs action scores and a gaze map at a temporal stride of 8. Our output gaze map has a spatial resolution of

(downsampled by 32x). During testing, we average the clip-level actions scores to recognizing actions in a video. Note that our gaze output is down-sampled both spatially (x32) and temporally (x8). When evaluating gaze, we aggregate fixation points within 8 frames and project them into a downsampled 2D map. This time interval (300ms) is equal to the duration of a fixation (around 250ms) and thus this temporal aggregation should preserve the location of gaze.

Implementation Details. We downsampled all video frames to and computed optical flow using FlowNet V2 [27]. We empirically verified that FlowNet V2 gives satisfactory motion estimation in egocentric videos. The flow map was truncated in the range of and rescaled to as [87, 70]. During training, we randomly cropped regions from frames. We then fed the RGB frames and flow maps into our networks. We also performed random horizontal flip and color jittering for data augmentation. For testing, we send the frames with a resolution of and their flipped version. For action recognition, we averaged pool scores of all clips within a video. For gaze estimation, we flipped back the gaze map and take the average.

Training Details. All our models are trained using SGD with momentum of 0.9 and weight decay of . When using SGD for variational learning, we draw a single sample for each input within a mini-batch, and multiple samples of the same input will be drawn at different iterations. The initial weights for 3D convolutional networks are restored from Kinectcs pre-trained models [8]. For training two stream-networks, we used a batch size of 40, paralleled over 4 GPUs. We used a initial learning rate of , which matches the same learning rate from [8]. We decayed the learning rate by a factor of at

th epoch and end the training at

epochs. We enabled batch normalization 

[28] during training and set the decay rate for its parameters to , allowing faster aggregation of dataset statistics. By default, dropout with rate of was attached for fully connected layer during training, as suggested in [87]. In comparison to our previous work [42], we found that adding dropout and training longer leads to better results.

5 Experiments and Results

This section presents our experiments and results. We first introduce the datasets and the evaluation metrics. We then present our results on gaze and actions using EGTEA Gaze+ and EPIC Kitchens datasets. Specifically, our main results include three parts. First, we present an ablation study of our model. Second, we present result on gaze estimation and compare to a set of strong baselines on EGTEA Gaze+ dataset, where the ground truth gaze points are available. Finally, we discuss our main results on FPV action recognition. Our results are compared to several state-of-the-art methods on both EGTEA Gaze+ and EPIC-Kitchens datasets. Overall, our model achieves strong results for both gaze estimation and action recognition.

5.1 Dataset and Evaluation Metric

We start by introducing the datasets used in our experiment, and present the evaluation metrics used on these datasets.

Dataset. We use EGTEA Gaze+ dataset as the main vehicle for our benchmark. The dataset includes action instances from categories. These instances are divided into three train and test splits. We use the first split for our ablation study, and report gaze estimation and action recognition results on all splits. EGTEA Gaze+ manifests two key challenges of FPV action recognition. First, the definition of FPV actions leads to a fine-grained recognition problem. The task is to recognize action categories like “cut onion” or “spread condiment (on) bread (using) eating utensil”. Thus, the inference of these categories involves non-trivial understanding of body motion and object information. Moreover, the action labels are imbalanced–the distributions of action instances in both datasets follow a long-tailed distribution. The frequent classes have hundreds of samples and the classes on the tail have only dozens of samples. This long-tailed distribution poses additional challenge of learning from imbalanced data.

Evaluation Metric. We now present our evaluation metrics for FPV gaze estimation and action recognition.

Gaze Estimation

: We consider gaze estimation as a binary classification problem, evaluate their Precision and Recall curves, and report the best F1 scores together with their corresponding precision and recall. Untracked gaze points and saccades are excluded from the evaluation. Note that we did not use the average angular error common in gaze tracking. This is because that our model produce a low resolution gaze map. The gaze estimation results are only reported on EGTEA Gaze+ where the ground truth gaze is given by a mobile eye tracker.


Action Recognition: We treat action recognition as a multi-class classification problem. For EGTEA Gaze+, we report mean class accuracy on all three splits, i.e., the average per-class accuracy at the clip and video level. Mean class accuracy provides a less biased metric in comparison to instance level accuracy for class-imbalanced problems.

5.2 Ablation Study

We begin with a comprehensive study of our model on the first split of the EGTEA Gaze+ dataset. The goal of this ablation study is to delineate different components of our model. Specifically, we vary each of the following components: (1) the backbone network for feature presentation; (2) probabilistic modeling; and (3) the attention guided action recognition, and evaluate their impact to the performance.

Networks
Clip Acc
Video Acc
I3D RGB 46.35 50.08
I3D Flow 37.74 45.78
I3D Fusion N/A 54.19
I3D Joint 50.30 55.76
TABLE III: Ablation study of probabilistic modeling on EGTEA Gaze+ dataset. We compare our model to its deterministic version (Gaze MLE). We report F1 scores for gaze and mean class accuracy for action.
Methods
Gaze
F1
Action
Acc
I3D Joint N/A 55.76
Prob-Atten N/A 56.50
Gaze MLE 26.63 55.88
Ours (Prob.) 34.01 57.20
TABLE II: Ablation study of backbone networks on EGTEA Gaze+ dataset. We compare RGB, Flow, late fusion and joint training of I3D for action recognition, and report mean class accuracy.
Method Split1 Split2 Split3
Prec Recall F1 Prec Recall F1 Prec Recall F1
EgoGaze* [41] 16.63 16.63 16.63 12.85 12.85 12.85 18.30 18.30 18.30
Simple Gaze 16.11 41.82 31.33 24.66 37.16 29.65 29.52 38.85 33.55
Deep Gaze [97] 28.71 43.08 34.46 26.30 40.28 31.82 30.96 44.48 36.51
Gaze MLE 21.25 35.65 26.63 18.15 38.40 24.65 24.11 38.23 29.57
Our Model 28.29 42.65 34.01 26.05 40.76 31.79 30.89 42.37 35.73
TABLE IV: Gaze estimation results on EGTEA. We report best F1 scores and their corresponding precision and recall. We compare our model to several baselines. Our results are comparable to the latest methods. : methods jointly model gaze and actions; *: see Section 5.3 for discussions.

Backbone Network: RGB vs. Flow. We benchmark different backbone networks for FPV action recognition. Our goal is to understand which network performs the best in the egocentric setting. Concretely, we tested RGB and flow streams of I3D [8], the late fusion of two streams, and the joint training of two streams [15]. The results are summarized in Table III. Overall, EGTEA dataset is very challenging, even the strongest model has an accuracy around . To help calibrate the performance, we note that the same I3D model achieved on Charades [69, 89], on Kinetics and on UCF [8].

Unlike Kinetics or UCF, where flow stream performs comparably to RGB stream, the performance of I3D flow stream on EGTEA is significantly lower than its RGB counterpart. The results suggest that it is more difficult to capture motion cues in FPV. This is probably due to the frequent motion of the camera in FPV. Moreover, the joint training of RGB and flow streams performs the best. Thus, we choose this network as our backbone for the rest of our experiments.

Modeling: Probabilistic vs. Deterministic. Going forward, we test the probabilistic modeling part of our method. We focus on the key question: “Does probabilistic modeling of gaze helps?” To this end, we compare to a deterministic version of our model that uses maximum likelihood estimation for gaze. We denote this model as Gaze MLE. Instead of sampling, this model learns to directly output a gaze map, and applies the map for recognition. During training, the gaze map is supervised by human gaze using a pixel-wise sigmoid cross entropy loss. Both model architecture and the training procedure are the same as our model. We disable the loss for gaze when fixation is not available.

We compare our model with Gaze MLE for gaze and actions, and present the results in Table III. Our probabilistic model outperforms its deterministic version by for action recognition and for gaze estimation. We attribute this significant gain to the modeling. Our probabilistic model helps to facilitate the learning even with a noisy supervisory signal such as human gaze.

Attention for Action Recognition. Finally, we compare our method to a probabilistic attention model (Prob-Atten in Table III) using the same backbone networks. Prob-Atten follows the same network architecture yet receives uniform distribution as prior for variational learning. The benefit of this probabilistic attention modeling was studied in our recent work [45]. For action recognition, Prob-Atten is worse than gaze supervised models by 0.7, yet outperforms the base I3D model by . These results suggest that (1) probabilistic attention helps to improve action recognition even without explicit supervision of gaze; and (2) adding human gaze as supervision provides a significant performance gain.

5.3 FPV Gaze Estimation

We now present our results for FPV gaze estimation.

Baselines. We consider the following baselines.

  • EgoGaze [41] makes use of hand crafted egocentric features, such as head motion and hand position, to regress gaze points. For a fair comparison, we use the FlowNet V2 for motion estimation and hand masks from FCN for hand positions (same as our method). EgoGaze outputs a single gaze point per frame. With a single ground-truth gaze, EgoGaze will have equal numbers of false positives and false negative. Thus, its precision, recall and F1 scores are the same.

  • Simple Gaze is a straightforward deep model for gaze estimation. Specifically, we directly estimate the gaze map using per-pixel sigmoid cross entropy loss. We use the same backbone network (I3D Joint) as our model and keep the output resolution the same.

  • Deep Gaze [97] is the FPV gaze prediction module from [97], where a 3D convolutional network is combined with a KL loss. Again, we use I3D Joint as the backbone network and keep the output resolution. Note that this model can be considered as a special case of our model by removing the sampling, the attention mechanism and the recognition network.

  • Gaze MLE is the same model in our ablation study, where gaze is estimated using maximum likelihood.

Results. Our gaze estimation results are shown in Table IV. We report best F1 scores and their corresponding precision and recall. Not surprisingly, deep models outperform hand crafted features by a large margin. We also observe that models with KL loss (e.g., Deep Gaze and our model) are consistently better than those use cross entropy loss (e.g., Simple Gaze and Gaze MLE). We conjecture that this is due to the difficulty in balancing between the losses for gaze and action. Finally, our method that jointly model gaze and action achieves comparable results to gaze-only models.

Discussion. Our results suggest that the top-down, task-relevant attention is not fully captured in our joint model, even though the top-down modulation can be achieved via back-propagation. This is thus an interesting future direction for the community to explore. Finally, we note that the benchmark of gaze estimation uses noisy human gaze as ground truth. We argue that even though these gaze measurements are noisy, they largely correlate with the underlie signal of attention. And thus the results of the benchmark are meaningful.

5.4 FPV Action Recognition

We now describe our results on FPV action recognition. We introduce our baselines, compare our model to the baselines, and discuss the results.

Method Split1 Split2 Split3
Clip Acc Video Acc Clip Acc Video Acc Clip Acc Video Acc
EgoIDT+Gaze [43] N/A 42.55 N/A 37.30 N/A 37.60
2SCNN [70] N/A 43.78 N/A 41.47 N/A 40.28
I3D (joint) [8] 50.30 55.76 44.92 53.14 48.38 53.55
I3D+Gaze 47.16 53.74 43.49 50.30 45.35 49.63
EgoConv+I3D [71] N/A 54.19 N/A 51.45 N/A 49.41
Ego-RNN-2S [78] N/A 52.40 N/A N/A N/A N/A
LSTA-2S [77] N/A 53.00 N/A N/A N/A N/A
Gaze MLE 50.79 55.88 45.87 52.77 48.96 53.49
Our Model 50.80 57.20 45.45 53.75 48.31 54.13
TABLE V: Action recognition results on EGTEA. Results are reported using mean class accuracy. We compare our model to latest methods. Our method outperforms previous methods, including those use gaze during testing, by a significant margin. : methods use human gaze during testing.
Fig. 8: Visualization of gaze estimation and action recognition results. For each 24-frame video snippet, we plot the output gaze heat map (higher values in red) with a temporal stride of 8 frames. We also display the ground-truth gaze points as green dots. Thus, the result for each snippet is shown as three key frames with their gaze maps. We print the predicted action labels and ground-truth labels above the images. Both successful (first and second rows) and failure cases (third row) are presented.

Baselines. We consider the following set of baselines for FPV action recognition.

  • EgoIDT+Gaze [43] combines egocentric features with dense trajectory descriptors [85]

    . These features are further selected by gaze points, and encoded using Fisher vectors 

    [54] for action recognition.

  • 2SCNN [70] is the two stream networks with VGG16 backbone developed for generic action recognition.

  • I3D (joint) [8] is the two stream I3D with joint training. It is a strong baseline for many datasets.

  • I3D+Gaze is inspired by [43, 13], where the ground truth human gaze is used to pool features from the last convolutional outputs of the network. For this method, we use the same I3D joint backbone and the same attention mechanism as our model, yet use human gaze for pooling features. When human gaze is not available, we fall back to average pooling.

  • EgoConv+I3D [71] adds a stream of egocentric cues for FPV action recognition. This egocentric stream encodes head motion and hand masks, and its outputs are fused with RGB and flow streams. We use Fully Convolutional Network (FCN) [47] for hand segmentation, and late fuse the score of egocentric stream with I3D for a fair comparison. This model is trained from scratch.

  • Ego-RNN-2S [78] and LSTA-2S [77] are recent LSTM based models using soft-attention for FPV action recognition. Both models use two-stream networks. We report their updated results on mean class accuracy at ICCV’19 EPIC workshop.666Available at https://eyewear-computing.org/EPIC_ICCV19/program/ICCV-EPIC-LSTA.pdf

  • Gaze MLE is the deterministic version of our model that has been used in our ablation and gaze results. It provides a simple baseline of multi-task learning of gaze and actions within a single deep network.

We were unable to compare against relevant methods from [48, 13]. These methods require additional object annotations for training, which is not presented in EGTEA dataset. And we must emphasis that our method does not need object or hand information for training or testing.

Results and Discussion. We report results on all three splits on EGTEA Gaze+ in Table V. As expected, all deep models outperform hand crafted features (EgoIDT+Gaze [43]) by a large margin. Among the deep models, 3D networks (I3D) outperforms the 2D version. Surprisingly, I3D [8], original designed for generic action recognition, provides a strong baseline that outperforms all previous deep models, including the recent methods [78, 77]. However, further combining egocentric features (EgoConv+I3D) or using egocentric gaze directly (I3D+Gaze) does not improve the results, and indeed decreases the accuracy by -.

There might be several reasons for this performance drop. First, EgoConv [71] was designed to capture actions defined by “gross body motion”, such as “take” vs. “put”, while our setting requires fine grained recognition of actions, e.g., “take cup” vs. “take plate”. Thus, EgoConv features is less useful. Second, as we argued in the introduction, human gaze can be quite noisy, with over of gaze points irrelevant to actions. Thus, using gaze data directly might hurt the performance.

Finally, we note that a simple joint model (Gaze MLE) has comparable performance as I3D. And our full model further improves the strong baseline of I3D by an average of +0.9% over three splits. Our results also consistently outperforms all baseline methods, including those use human gaze during testing, across all splits. Our model reaches the accuracy of 57.20% / 53.75% / 54.13% on the three splits, respectively. We argue that these results provide a strong evidence to our modeling of uncertainty in gaze measurements. A model must learn to account for this uncertainty to avoid misleading gaze points, which will distract the model to action irrelevant regions.

Visualizing Egocentric Gaze for Action Recognition. Moving beyond the accuracy, we provide more results to help understand our model. Specifically, we visualize the outputs of gaze estimation and action labels from our model, as well as the ground-truth gaze in Figure 8. Our gaze outputs often attend to foreground objects that the person is interacting with. We believe this is why the model is able to better recognize egocentric actions. Moreover, we found these visualizations helpful for diagnosing the error in action recognition. A good example is the second failure case in the third row (middle) of Figure 8, where our model successfully recognize the object as “condiment container” yet fail to distinguish the verb (“take” vs. “open”). Another example is the first failure case in Figure 8, where the recognition model is confused due to the appearance similarity between “onion” and “cucumber”.

5.5 Extension to EPIC-Kitchens Dataset

Finally, we extend our model to EPIC-Kitchens dataset–the largest benchmark for FPV action recognition. By using a single network on RGB frames, our model achieves state-of-the-art performance on EPIC-Kitchens.

Dataset and Metric. EPIC-Kitchens dataset has action instances. Similar to our dataset, each instance is described by the combination of a single verb and a single noun. There are in total 125 verb classes and 331 noun classes. We report results on both seen and unseen test sets provided by [10]. our results are evaluated by the server provided by [10]. And the main metric is the top-k accuracy for verb, noun and action labels.

Training Details. We followed a similar training protocol as EGTEA Gaze+ with some modifications. First, gaze tracking data is not available on EPIC-Kitchens dataset, hence we replaced the gaze prior with uniform distribution. Second, to further improve the performance, we replaced the I3D backbone with a more recent CSN152 network pre-trained by [82]. Third, by using a heavy backbone, we only considered a single RGB stream and reduced the batch size to 16. Finally, we found it helpful to prevent over-fitting by early stopping the training. Specifically, we only trained the network for 18 epochs where the learning rate was decayed by 1/10 at epoch 15. This strategy helps to boost the performance on both seen and unseen test set, and has a larger performance impact on the unseen set.

Results and Discussion. Our results are presented in Table VI and compared to state-of-the-art methods [77, 92, 33, 17, 88]. Our final model, using only RGB frames, achieves state-of-the-art results in comparison to all prior work, including those use optical flow [77], object detector [88] or audio data [33]. Our single model outperforms the best published single-model entries [33, 17] by a significant margin of 4.0%, 1.8% on the seen and unseen set, respectively. Moreover, our single model even beats large ensembles [33]. At the time of submission, our results are ranked the 1st on the unseen set and 3rd on the seen set on EPIC-Kitchen leaderboard, where the top ranked entry on the seen set used object detector and model ensembles [88]. We must point out that the uniform gaze prior used in our model is over-simplified, yet our model still achieves very competitive results on this larger benchmark. We speculate that better modeling of the gaze prior can further improve the performance.

Method Top1/Top5 Accuracy
Verb Noun Action
s1 2SCNN [10] 40.44 / 83.04 30.46 / 57.05 13.67 / 33.25
TSN (fusion) [10] 48.23 / 84.09 36.71 / 62.32 20.54 / 39.79
LSTA-2S [77] 62.12 / 87.95 40.41 / 64.47 32.60 / 52.85
LFB Max [92] 60.00 / 88.40 45.00 / 71.80 32.70 / 55.30
EPIC-Fusion [33] 64.75 / 90.70 46.03 / 71.34 34.80 / 56.65
R(2+1)D [17] 65.20 / 87.40 45.10 / 67.80 34.50 / 53.80
2SI3D+Obj [88] 69.80 / 90.95 52.27 / 76.71 41.37 / 63.59
Ours (CSN152) 68.51 / 89.32 49.96 / 72.30 38.75 / 59.00
s2 2SCNN [10] 36.16 / 71.97 18.03 / 38.41 7.31 / 19.49
TSN (fusion) [10] 39.40 / 74.29 22.70 / 45.72 10.89 / 25.26
LSTA-2S [77] 48.89 / 77.88 24.27 / 46.06 18.71 / 33.77
LFB Max [92] 50.90 / 77.60 31.50 / 57.80 21.20 / 39.40
EPIC-Fusion [33] 52.69 / 79.93 27.86 / 53.78 19.06 / 39.40
R(2+1)D [17] 57.30 / 81.10 35.70 / 58.70 25.60 / 42.70
2SI3D+Obj [88] 58.96 / 82.69 33.90 / 62.27 25.20 / 45.48
Ours (CSN152) 60.05 / 81.97 38.14 / 63.81 27.35 / 45.24
TABLE VI: Action recognition results on Epic-Kitchens test sets. We follow [10] to report top1/top5 accuracy for verb / noun / action on seen (s1) and unseen sets (s2). Our results are further compared against previous methods that use a single model. At the time of this submission, our model (Ours+CSN152) ranks 1st for the unseen set and 3rd for the seen set on the EPIC-Kitchens Action Recognition Challenge Leaderboard.

6 Conclusion and Future Work

In this paper, we considered the task of joint gaze estimation and action recognition in first person video. To facilitate our research, we introduced a new dataset—EGTEA Gaze+. Our dataset comes with video recording, gaze tracking and hand masks, thereby providing the most comprehensive benchmark to understand egocentric gaze and actions. Moving beyond the dataset, we presented a novel deep model that jointly estimate FPV gaze and recognition FPV actions. At the core of our model lies in the innovation of probabilistic modeling of human gaze using a deep model. We evaluated our model on EGTEA Gaze+ and demonstrated its superior performance. More importantly, the same model was applied to the largest FPV action recognition benchmark (EPIC-Kitchens) and achieved state-of-the-art results. We believe our dataset and model offer new insights for connecting egocentric gaze and actions, and thus providing a solid step towards advancing First Person Vision.

Acknowledgments

This research was supported by grant U54EB020404 awarded by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative www.bd2k.nih.gov. This work was also partially supported by Intel Science of Technology Center for Pervasive Computing (ISTC-PC). The work was developed during the first author’s Ph.D. thesis at Georgia Tech.

References

  • [1] G. A. Sigurdsson, O. Russakovsky, and A. Gupta (2017) What actions are needed for understanding human actions in videos?. In ICCV, Cited by: §3.2.2, §3.2.2.
  • [2] J.K. Aggarwal and M.S. Ryoo (2011-04) Human activity analysis: a review. ACM Comput. Surv. 43 (3), pp. 16:1–16:43. Cited by: §2.1.
  • [3] S. Bell, P. Upchurch, N. Snavely, and K. Bala (2013) OpenSurfaces: a richly annotated catalog of surface appearance. ACM Transactions on Graphics (TOG) 32 (4), pp. 111. Cited by: §3.2.1.
  • [4] A. Betancourt, P. Morerio, C. S. Regazzoni, and M. Rauterberg (2015) The evolution of first person vision methods: a survey. Circuits and Systems for Video Technology, IEEE Transactions on 25 (5), pp. 744–760. Cited by: §2.2.
  • [5] A. Borji and L. Itti (2013) State-of-the-art in visual attention modeling. TPAMI 35 (1), pp. 185–207. Cited by: §2.2.
  • [6] B. Bridgeman, D. Hendry, and L. Stark (1975) Failure to detect displacement of the visual world during saccadic eye movements. Vision research 15 (6), pp. 719–722. Cited by: §1.
  • [7] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §2.1, §2.1.
  • [8] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §2.1, §2.2, §4.5, §4.5, 3rd item, §5.2, §5.4, TABLE V.
  • [9] G. Cheron, I. Laptev, and C. Schmid (2015-12) P-cnn: pose-based cnn features for action recognition. In ICCV, Cited by: §2.1.
  • [10] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray (2018) Scaling egocentric vision: the EPIC-KITCHENS dataset. In ECCV, Cited by: §1, §2.1, §3.2.2, §3.3, TABLE I, §4.2, §5.5, TABLE VI.
  • [11] A. Fathi, J. K. Hodgins, and J. M. Rehg (2012) Social interactions: a first-person perspective. In CVPR, Cited by: §2.2.
  • [12] A. Fathi, A. Farhadi, and J. M. Rehg (2011) Understanding egocentric activities. In ICCV, Cited by: §2.2.
  • [13] A. Fathi, Y. Li, and J. M. Rehg (2012) Learning to recognize daily actions using gaze. In ECCV, pp. 314–327. Cited by: §2.2, 4th item, §5.4.
  • [14] A. Fathi, Y. Li, and J. M. Rehg (2012) Learning to recognize daily actions using gaze. In ECCV, Cited by: §1, §2.1, §3.
  • [15] C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In CVPR, Cited by: §4.5, §5.2.
  • [16] T. Feix, J. Romero, H. Schmiedmayer, A. M. Dollar, and D. Kragic (2016) The grasp taxonomy of human grasp types. IEEE Transactions on Human-Machine Systems 46 (1), pp. 66–77. Cited by: §2.2.
  • [17] D. Ghadiyaram, D. Tran, and D. Mahajan (2019) Large-scale weakly-supervised pre-training for video action recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 12046–12055. Cited by: §5.5, TABLE VI.
  • [18] G. Ghiasi, T. Lin, and Q. V. Le (2018) Dropblock: a regularization method for convolutional networks. In NeurIPS, pp. 10727–10737. Cited by: §4.4.
  • [19] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587. Cited by: §4.
  • [20] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017) The ”something something” video database for learning and evaluating visual common sense.. In ICCV, Cited by: §2.1.
  • [21] D. W. Hansen and Q. Ji (2010) In the eye of the beholder: a survey of models for eyes and gaze. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (3), pp. 478–500. Cited by: §1.
  • [22] J. M. Henderson (2003) Human gaze control during real-world scene perception. Trends in cognitive sciences 7 (11), pp. 498–504. Cited by: §1.
  • [23] J. Hernandez, Y. Li, J. M. Rehg, and R. W. Picard (2014) BioGlass: physiological parameter estimation using a head-mounted wearable device. In Wireless Mobile Communication and Healthcare (Mobihealth), 2014 EAI 4th International Conference on, Cited by: §2.2.
  • [24] Y. Hoshen and S. Peleg (2016) An egocentric look at video photographer identity. In CVPR, Cited by: §2.2.
  • [25] D. Huang, M. Ma, W. Ma, and K. M. Kitani (2015) How do we use our hands? discovering a diverse set of common grasps. In CVPR, Cited by: §2.2.
  • [26] Y. Huang, M. Cai, Z. Li, and Y. Sato (2018) Predicting gaze in egocentric video by learning task-dependent attention transition. In ECCV, Cited by: §2.2.
  • [27] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In ICCV, Cited by: §4.5.
  • [28] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §4.5.
  • [29] L. Itti and C. Koch (2001) Computational modelling of visual attention. Nature reviews neuroscience 2 (3), pp. 194. Cited by: §1.
  • [30] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: §4.2.
  • [31] H. Jiang and K. Grauman (2017) Seeing invisible poses: estimating 3d body pose from egocentric video. In CVPR, Cited by: §1, §2.2.
  • [32] T. Kanade and M. Hebert (2012) First-person vision. Proceedings of the IEEE 100 (8), pp. 2442 –2453. External Links: Document, ISSN 0018-9219 Cited by: §1.
  • [33] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen (2019) EPIC-Fusion: audio-visual temporal binding for egocentric action recognition. In ICCV, Cited by: §2.2, §5.5, TABLE VI.
  • [34] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §4.1.
  • [35] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto (2011) Fast unsupervised ego-action learning for first-person sports videos. In CVPR, Cited by: §1, §2.2.
  • [36] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In ICCV, pp. 2556–2563. Cited by: §2.1.
  • [37] F. D. la Torre Frade, J. K. Hodgins, A. W. Bargteil, X. M. Artal, J. C. Macey, A. C. I. Castells, and J. Beltran (2008-04) Guide to the carnegie mellon university multimodal activity (CMU-MMAC) database. Technical report Technical Report CMU-RI-TR-08-22, Carnegie Mellon University, Pittsburgh, PA. Cited by: §3.3, TABLE I.
  • [38] Y. J. Lee and K. Grauman (2015) Predicting important objects for egocentric video summarization. IJCV 114 (1), pp. 38–55. Cited by: §1, §2.2.
  • [39] C. Li and K. M. Kitani (2013) Model recommendation with virtual probes for egocentric hand detection. In ICCV, Cited by: §2.2.
  • [40] H. Li, Y. Cai, and W. Zheng (2019-06) Deep dual relation modeling for egocentric interaction recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [41] Y. Li, A. Fathi, and J. M. Rehg (2013) Learning to predict gaze in egocentric video. In ICCV, Cited by: §1, §2.2, §2.2, 1st item, TABLE IV.
  • [42] Y. Li, M. Liu, and J. M. Rehg (2018) In the eye of beholder: joint learning of gaze and actions in first person video. In ECCV, Cited by: §1, §4.5.
  • [43] Y. Li, Z. Ye, and J. M. Rehg (2015) Delving into egocentric actions. In CVPR, Cited by: §1, §1, §2.2, §3.3, TABLE I, §3, 1st item, 4th item, §5.4, TABLE V.
  • [44] Y. Li (2017) Learning embodied models of actions from first person video. Ph.D. Thesis, Georgia Institute of Technology. Cited by: §1.
  • [45] M. Liu, X. Chen, Y. Zhang, Y. Li, and J. M. Rehg (2019) Paying more attention to motion: attention distillation for learning video representations. arXiv preprint arXiv:1904.03249. Cited by: §5.2.
  • [46] M. Liu, S. Tang, Y. Li, and J. Rehg (2019) Forecasting human object interaction: joint prediction of motor attention and egocentric activity. arXiv preprint arXiv:1911.10967. Cited by: §2.1.
  • [47] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: 5th item.
  • [48] M. Ma, H. Fan, and K. M. Kitani (2016) Going deeper into first-person activity recognition. In CVPR, Cited by: §1, §2.2, §5.4.
  • [49] C. J. Maddison, A. Mnih, and Y. W. Teh (2017)

    The concrete distribution: a continuous relaxation of discrete random variables

    .
    In ICLR, Cited by: §4.2.
  • [50] S. Mathe and C. Sminchisescu (2012) Dynamic eye movement datasets and learnt saliency models for visual action recognition. In ECCV, Cited by: §2.1, §2.1.
  • [51] K. Nakamura, S. Yeung, A. Alahi, and L. Fei-Fei (2017) Jointly learning energy expenditures and activities using egocentric multimodal signals. In CVPR, Cited by: §2.2.
  • [52] H. S. Park, E. Jain, and Y. Sheikh (2012) 3D social saliency from head-mounted cameras.. In NeurIPS, Cited by: §1, §2.2, §2.2.
  • [53] H. S. Park and J. Shi (2015) Social saliency prediction. In CVPR, Cited by: §1, §2.2.
  • [54] F. Perronnin, J. Sánchez, and T. Mensink (2010) Improving the fisher kernel for large-scale image classification. In ECCV, pp. 143–156. Cited by: 1st item.
  • [55] H. Pirsiavash and D. Ramanan (2012) Detecting activities of daily living in first-person camera views. In CVPR, Cited by: §2.2.
  • [56] H. Pirsiavash and D. Ramanan (2012) Detecting activities of daily living in first-person camera views. In CVPR, Cited by: §1, §2.1, §3.3, TABLE I.
  • [57] Y. Poleg, C. Arora, and S. Peleg (2014) Head motion signatures from egocentric videos. In ACCV, Cited by: §2.2.
  • [58] Y. Poleg, A. Ephrat, S. Peleg, and C. Arora (2016) Compact CNN for indexing egocentric videos. In WACV, Cited by: §2.2.
  • [59] Y. Poleg, T. Halperin, C. Arora, and S. Peleg (2015) Egosampling: fast-forward and stereo for egocentric videos. In CVPR, pp. 4768–4776. Cited by: §1, §2.2.
  • [60] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §4.
  • [61] G. Rogez, J. S. Supancic, and D. Ramanan (2015) First-person pose recognition using egocentric workspaces. In CVPR, Cited by: §1, §2.2.
  • [62] G. Rogez, J. S. Supancic, and D. Ramanan (2015) Understanding everyday hands in action from rgb-d images. In ICCV, Cited by: §2.2.
  • [63] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele (2012) A database for fine grained activity detection of cooking activities. In CVPR, Cited by: §2.1.
  • [64] M. S. Ryoo and L. Matthies (2013) First-person activity recognition: what are they doing to me?. In CVPR, Cited by: §1, §3.3, TABLE I.
  • [65] M. S. Ryoo, B. Rothrock, and L. Matthies (2015) Pooled motion features for first-person videos. In CVPR, pp. 896–904. Cited by: §2.2.
  • [66] N. Shapovalova, M. Raptis, L. Sigal, and G. Mori (2013) Action is in the eye of the beholder: eye-gaze driven model for spatio-temporal action localization. In NIPS, pp. 2409–2417. Cited by: §2.1, §2.1.
  • [67] S. Sharma, R. Kiros, and R. Salakhutdinov (2016) Action recognition using visual attention. In ICLR Workshop, Cited by: §2.1, §2.1.
  • [68] G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari (2018) Charades-Ego: A large-scale dataset of paired third and first person videos. CoRR abs/1804.09626. Cited by: §1, §2.1, §3.3, TABLE I.
  • [69] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In ECCV, Cited by: §2.1, §3.2.2, §5.2.
  • [70] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, Cited by: §2.1, §4.5, 2nd item, TABLE V.
  • [71] S. Singh, C. Arora, and C. Jawahar (2016)

    First person action recognition using deep learned descriptors

    .
    In CVPR, Cited by: §1, §2.2, 5th item, §5.4, TABLE V.
  • [72] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In NeurIPS, pp. 3483–3491. Cited by: §4.1, §4.4.
  • [73] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §2.1.
  • [74] E. H. Spriggs, F. De La Torre, and M. Hebert (2009) Temporal segmentation and activity classification from first-person sensing. In CVPR Workshops, Cited by: §2.2.
  • [75] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    15, pp. 1929–1958.
    Cited by: §4.4.
  • [76] Y. Su and K. Grauman (2016) Detecting engagement in egocentric video. In ECCV, Cited by: §1.
  • [77] S. Sudhakaran, S. Escalera, and O. Lanz (2019) LSTA: long short-term attention for egocentric action recognition. In CVPR, Cited by: §2.2, 6th item, §5.4, §5.5, TABLE V, TABLE VI.
  • [78] S. Sudhakaran and O. Lanz (2018) Attention is all we need: nailing down object-centric attention for egocentric activity recognition. In BMVC, Cited by: §2.2, 6th item, §5.4, TABLE V.
  • [79] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §4.5.
  • [80] B. Tekin, F. Bogo, and M. Pollefeys (2019-06) H+O: unified egocentric recognition of 3d hand-object poses and interactions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [81] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, Cited by: §2.1.
  • [82] D. Tran, H. Wang, L. Torresani, and M. Feiszli (2019) Video classification with channel-separated convolutional networks. In ICCV, pp. 5552–5561. Cited by: §5.5.
  • [83] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In CVPR, Cited by: §2.1.
  • [84] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea (2008-11) Machine recognition of human activities: a survey. Circuits and Systems for Video Technology, IEEE Transactions on 18 (11), pp. 1473–1488. Cited by: §2.1.
  • [85] H. Wang, A. Kläser, C. Schmid, and C. Liu (2011) Action recognition by dense trajectories. In CVPR, Cited by: 1st item.
  • [86] L. Wang, Y. Qiao, and X. Tang (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, Cited by: §2.1.
  • [87] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, Cited by: §2.1, §4.5, §4.5.
  • [88] X. Wang, Y. Wu, L. Zhu, and Y. Yang (2019) Baidu-UTS submission to the EPIC-kitchens action recognition challenge 2019. arXiv preprint arXiv:1906.09383. Cited by: §5.5, TABLE VI.
  • [89] X. Wang, R. B. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Cited by: §5.2.
  • [90] P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, and H. Sloetjes (2006) ELAN: a professional framework for multimodality research. In Proceedings of LREC, Vol. 2006, pp. 5th. Cited by: Fig. 5, §3.2.2.
  • [91] M. Wray, D. Larlus, G. Csurka, and D. Damen (2019) Fine-grained action retrieval through multiple parts-of-speech embeddings. In ICCV, Cited by: §2.2.
  • [92] C. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick (2019) Long-term feature banks for detailed video understanding. In CVPR, pp. 284–293. Cited by: §5.5, TABLE VI.
  • [93] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In ECCV, Cited by: §2.1.
  • [94] J. Xu, L. Mukherjee, Y. Li, J. Warner, J. M. Rehg, and V. Singh (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In CVPR, Cited by: §1, §2.2.
  • [95] R. Yonetani, K. M. Kitani, and Y. Sato (2016) Recognizing micro-actions and reactions from paired egocentric videos. In CVPR, Cited by: §1, §2.2.
  • [96] R. Yonetani, K. M. Kitani, and Y. Sato (2015) Ego-surfing first-person videos. In CVPR, Cited by: §2.2.
  • [97] M. Zhang, K. Teck Ma, J. Hwee Lim, Q. Zhao, and J. Feng (2017) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In CVPR, Cited by: §2.2, 3rd item, TABLE IV.
  • [98] M. Zhang, K. Teck Ma, J. Hwee Lim, Q. Zhao, and J. Feng (2017) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In CVPR, Cited by: §1.