In the Eye of the Beholder: Gaze and Actions in First Person Video

05/31/2020
by   Yin Li, et al.
0

We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with videos, gaze tracking data, hand masks and action annotations, thereby providing the most comprehensive benchmark for First Person Vision (FPV). Moving beyond the dataset, we propose a novel deep model for joint gaze estimation and action recognition in FPV. Our method describes the participant's gaze as a probabilistic variable and models its distribution using stochastic units in a deep network. We further sample from these stochastic units, generating an attention map to guide the aggregation of visual features for action recognition. Our method is evaluated on our EGTEA Gaze+ dataset and achieves a performance level that exceeds the state-of-the-art by a significant margin. More importantly, we demonstrate that our model can be applied to larger scale FPV dataset—EPIC-Kitchens even without using gaze, offering new state-of-the-art results on FPV action recognition.

READ FULL TEXT

page 1

page 4

page 5

page 6

page 12

page 15

research
01/07/2019

Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions

In this work, we address two coupled tasks of gaze prediction and action...
research
05/20/2021

Egocentric Activity Recognition and Localization on a 3D Map

Given a video captured from a first person perspective and recorded in a...
research
11/25/2019

Forecasting Human Object Interaction: Joint Prediction of Motor Attention and Egocentric Activity

We address the challenging task of anticipating human-object interaction...
research
09/15/2019

Multitask Learning to Improve Egocentric Action Recognition

In this work we employ multitask learning to capitalize on the structure...
research
09/16/2011

Learning where to Attend with Deep Architectures for Image Tracking

We discuss an attentional model for simultaneous object tracking and rec...
research
09/12/2022

Predicting the Next Action by Modeling the Abstract Goal

The problem of anticipating human actions is an inherently uncertain one...
research
12/29/2013

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

Systems based on bag-of-words models from image features collected at ma...

Please sign up or login with your details

Forgot password? Click here to reset