Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions

01/07/2019 ∙ by Yifei Huang, et al. ∙ The University of Tokyo 8

In this work, we address two coupled tasks of gaze prediction and action recognition in egocentric videos by exploring their mutual context. Our assumption is that in the procedure of performing a manipulation task, what a person is doing determines where the person is looking at, and the gaze point reveals gaze and non-gaze regions which contain important and complementary information about the undergoing action. We propose a novel mutual context network (MCN) that jointly learns action-dependent gaze prediction and gaze-guided action recognition in an end-to-end manner. Experiments on public egocentric video datasets demonstrate that our MCN achieves state-of-the-art performance of both gaze prediction and action recognition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The popularity of wearable cameras in recent years is accompanied by a large number of first-person view videos, or often called egocentric videos, that record persons’ daily interactions with their surrounding environments. The demand for automatic analysis of egocentric videos has promoted various egocentric vision techniques [1] such as egocentric video hyper-lapse [20, 35] and video summarization [26, 48]. In particular, the task of understanding what a person is doing and where a person is looking at have attracted great interests from researchers. The former task is often called egocentric action recognition [24, 27, 52] and the latter is called egocentric gaze prediction [22, 55, 13]. Although the two tasks have been studied extensively in the past years, few works have focused on the relationships between the two tasks which are in fact deeply related.

This work aims to jointly model the two coupled tasks of gaze prediction and action recognition in egocentric videos. Previous works have studied how gaze could be used for action recognition [7, 23]. They tried to model human gaze in egocentric videos and use estimated gaze points for removing unrelated background information, in order to improve action recognition. While the guidance of gaze for action recognition has been studied, gaze itself was simply modeled as a saliency prediction problem, and no effort has been seen to explicitly explore the influence of actions for gaze prediction.

In an egocentric video, background regions are often cluttered and may contain multiple salient regions. Thus it would be difficult for a saliency-based model to predict gaze reliably without additional information about the location of attention. It has been studied by psychologists that an action performed by a person implicitly affects where the person is looking [45, 46, 38]. For example, to take a knife from a table, a person always moves his/her focus onto the knife and then keeps fixating on the knife before grasping it. Also, when people performing same daily actions like “put cup” or “take plate” they often have similar gaze movements. Therefore, we argue that for better modeling of gaze and actions in egocentric videos, not only the gaze-guided action recognition (gaze context for actions) but also the action-dependent gaze prediction (action context for gaze) should be jointly considered.

Figure 1: Different actions in egocentric videos will result in different gaze patterns. Our method leverages this observation and explicitly uses action likelihood as a prior for egocentric gaze prediction.

In this paper, we propose a mutual context network (MCN) that jointly predicts human gaze and recognizes actions in egocentric videos with consideration of mutual context between the two coupled tasks. The proposed MCN takes a video sequence as input and outputs action likelihood as well as a gaze probability map for each frame. Two novel modules are developed within the model to leverage the context from the predicted actions and gaze probability maps respectively. The first module called the action-based gaze prediction module takes the predicted action likelihood as input and produces a set of convolutional kernels that are relevant to the action being performed. The generated action kernels are then used to convolve input feature maps for locating action-related regions. The second module called the gaze-guided action recognition module uses the estimated gaze point as a guideline to spatially aggregate the input features for action recognition. Rather than only using the region around the gaze point as in previous work, the features are aggregated both in the gaze region and the non-gaze region separately and then used as input to the gaze-guided action recognition module, while the relative importance of the two regions is learned automatically during training.

Our main contributions are summarized as follows:

  • We propose a novel MCN for both egocentric gaze prediction and action recognition that leverages the mutual context between the two tasks.

  • We propose a novel action-based gaze prediction module that explicitly utilizes information from the estimated action for gaze prediction. This is done by generating convolution kernels for gaze prediction adaptively with the estimated action.

  • Our proposed MCN achieves state-of-the-art performance in both gaze prediction and action recognition and is able to learn action-dependent gaze patterns.

2 Related works

2.1 Egocentric gaze prediction

Predicting gaze in an egocentric video can benefit a diverse range of applications such as joint attention discovery [12, 16, 30], action recognition [7], human computer interaction [8, 17, 21], and video summarization [48]. Despite the correlation between gaze and saliency [31], previous works have revealed the need for additional cues for predicting gaze in egocentric videos [22, 44, 50, 51, 55, 56]. Li [22] used head motion and hand cues in a graphical model for gaze prediction. However, the pre-defined egocentric cues may limit the generalization ability of their model. Huang [13] proposed a hybrid deep model which incorporates task-dependent gaze shift patterns in addition to a bottom-up saliency-based model. However, they did not consider the differences in gaze patterns with respect to different actions.

In this work, we explicitly leverage the contextual information from the performed actions for gaze prediction by using the predicted action likelihood. To the best of our knowledge, this is the first work to explore the influence of actions for egocentric gaze prediction.

2.2 Egocentric action recognition

Egocentric action recognition is one of the focused fields in egocentric vision and has been studied extensively in recent years [9, 24, 25, 28, 29, 34, 33, 41, 43, 53, 54, 4]. Kitani [19] used global motion to discover different egocentric actions in an unsupervised manner. Fathi [6] adopted a graphical model to recognize actions in relation to objects and head/hand motion. Ryoo [37] proposed a novel pooling method for action recognition. Ma [27] proposed a comprehensive deep model for recognizing objects and actions jointly. Singh [40] used additional inputs like hand masks to improve action recognition performance. Sudhakaran [42]

used object-centric attention in a recurrent neural network to get better performance in action recognition. Different from previous works, our method recognizes actions with the contextual information from gaze by modeling actions and gaze in a unified framework.

2.3 Gaze and actions

Human gaze and actions are deeply correlated in egocentric videos, and the use of gaze has been proved to be beneficial for action recognition [7, 39, 57]. However, little work has been done on the joint modeling of egocentric gaze prediction and action recognition. Extended from [7], Li [23] proposed a deep model for jointly modeling gaze and actions. They modeled the probabilistic nature of gaze and used the estimated gaze for better action recognition. However, their work did not explicitly consider the contextual information from actions for gaze prediction. Gaze prediction will be less reliable without the contextual information of actions.

In this work, we leverage the mutual context of gaze and actions in our proposed model, in the form of using action likelihood as a conditional input to predict gaze and simultaneously, using gaze as a guidance for action recognition. By explicitly exploring such mutual context, our model achieves state of the art performance in both gaze prediction and action recognition.

3 Our proposed MCN

Figure 2: Our proposed mutual context network (MCN) contains 5 sub-modules: the feature encoding module which encodes input video frames into feature maps , the gaze-guided action recognition module which uses gaze as a guideline to recognize actions, the action-based gaze prediction module which takes predicted action likelihood as input and outputs an action-dependent gaze probability map , the saliency-based gaze prediction module which outputs a saliency map , and finally the late fusion module to get the final gaze probability map .

3.1 Overview

When performing a task, especially a hand manipulation task, human gaze and actions of hand-object interaction are tightly related. While image region around a person’s gaze point explicitly reveals important information about the undergoing action, the action performed by the person implicitly affects where the person is looking. In this work, we propose a mutual context network (MCN) that uses the estimated action to predict gaze point and uses gaze as a guidance for action recognition.

Figure 2 depicts the architecture of our MCN. The input video RGB frames and optical flow images are first encoded as feature maps by the feature encoding module, which are then used as input to the following modules. One of the key components in our model is the action-based gaze prediction module that learns to predict gaze using the predicted action likelihood as a conditional input. As complementary information for gaze prediction, we also obtain a saliency map with the saliency-based gaze prediction module. The outputs from the two modules are then fused by the late fusion module to get the final gaze probability map . Another component in our MCN is the gaze-guided action recognition module which takes the predicted gaze as guidance to selectively filter the input features for action recognition. The output of action likelihood is then used as conditional input to the action-based gaze prediction module, thus a loop of mutual context is constructed.

3.2 Feature encoding module

We adopt the first four convolutional blocks of the 3D convolution network I3D [3] for feature encoding. Following [23], we fuse the RGB stream and optical flow stream at the end of the 4th convolutional block by element-wise summation. With this 3D encoder, the output feature map is of size , where is the number of channels, is the temporal dimension, and are the spatial height and width.

3.3 Saliency-based gaze prediction module

Image regions with high saliency tend to attract human attention. For instance, regions with unique and distinguishing features such as a moving object or high contrast of brightness are more likely to be looked at than other regions. Therefore, we use a saliency-based gaze prediction module to learn the image regions that are more likely to draw human attention. For this, we use a 3D decoder that takes the encoded feature map as input and outputs a series of gaze probability maps with each pixel value within the range of [0, 1]. While this bottom-up approach provides information about salient regions in the image, it is not sufficient to reliably identify the attended region when multiple salient regions exist, which is common in egocentric video.

3.4 Action-based gaze prediction module

As different actions are associated with different objects and motion, the gaze patterns when performing different actions are different. It is necessary for the gaze prediction module to leverage action information for more reliable gaze prediction. To this end, inspired by [49, 5], we use the output of the gaze guided action recognition module to generate a group of convolutional kernels that are used to identify the regions relevant to the performed action. The generated action kernels are then used to convolve with the input features in order to locate the action-related regions. Finally, gaze probability maps that have the same size with input frames are generated by a decoder consisting of deconvolutional layers.

More formally, given action likelihood estimated by the action recognition module and the input feature maps with channels ( and are temporal and spatial dimension), the gaze probability map is generated through the following procedure:

(1)
(2)
(3)

where is the kernel generator, is a group of kernels, and is the filtered feature maps. denotes the operator of convolution. The kernel generator contains one fully connected layer and two convolutional layers. The output of the first fully connected layer is first reshaped into size and then forwarded to the following convolution layers.

We also adopt the saliency-based gaze prediction module which can be seen as a complementary to the action-based gaze prediction module. Finally, we use a late fusion module to combine the outputs and from the previous modules:

(4)

Late fusion technique has been proved to be effective in previous work of gaze prediction [13]. Following previous works [55, 22], we take the spatial location with maximum likelihood on as the predicted gaze point.

3.5 Gaze-guided action recognition module

Here we describe the gaze-guided action recognition module in our MCN that uses the predicted gaze point as a guide to exploit discriminative features for action recognition. Previous works [7, 23] mostly used gaze as a filter to remove features of image regions far from the gaze point. However, focusing only on the region around the gaze point might lose important information about the action. We observed that when performing certain actions such as “put an object”, the person may fixate on the table on which to place the object instead of looking at the object in hand which contains critical information about the action. Therefore, we think that while the gaze region is important, the region outside the gaze (non-gaze region) might also contain complementary information about the action. In this work, we develop a two-way pooling structure to aggregate features in the gaze and non-gaze regions separately and use both as input for action recognition.

As shown in Figure 2, we first forward to the fifth convolutional block of I3D to encode more compact features . On each temporal dimension of , we locate the corresponding spatial gaze point

on the feature map by selecting the maximum spatial location of the 3d max-pooled gaze map

. Then we split spatial dimensions of the feature map into two parts: gaze region and non-gaze region. Gaze region on a feature map (dark green region of in the figure) is the locations whose spatial positions are within range , and non-gaze region is the left-out region (light green region of

in the figure). We pool the two regions separately on the spatial dimensions, generating two feature tensors

and :

(5)
(6)

where denotes the -th channel and position of the feature map , similarly for .

The pooled feature tensors and are fed into two 1x1x1 convolution layers (denoted as ), and the outputs are channel-wise concatenated and forwarded into the final 1x1x1 convolution layer (denoted as ) for predictions. We average the predictions on temporal dimension to get the action likelihood :

(7)
(8)
(9)

Here denotes channel-wise concatenation. We set the output channel of to be and to be since the modeling of non-gaze region is empirically simpler than that of the gaze region, so we limit its channel size to prevent over-fitting.

3.6 Implementation and training details

The whole framework is implemented using Pytorch framework

[32]. The feature encoding module is identical to the first 4 convolutional blocks of the I3D [3] network without the last pooling layer. With our input of 24 stacked images of size , the output of feature encoding module is of size . The decoder contains a set of 4 transposed convolution layers, with kernel sizes

, and stride

respectively. Padding 1 is added on all layers. Each layer is followed by batch normalization and ReLU activation. We add another convolution layer with kernel size 1 and a sigmoid layer on top of the decoder for outputting values within

. The kernel decoder takes the input vector

where is the number of action categories, and firstly encoded to a latent size of and reshaped into . The two convolutional layers output channels and , with kernel size 3, stride 1 and padding 1. The output size of the action kernel generator is . For the gaze guided action recognition module, the convolution block is identical to the -th convolution block of the I3D network. Thus the output size of is . The 3d max-pooling layer therefore has kernel size (8,32,32). We set and . The late fusion module is composed of 4 convolutional layers with output channels 32,32,8,1, in which the first 3 layers have a kernel size of 3 with 1 zero padding and the last layer has a kernel size of 1 with no padding.

For training the whole network, we first train the gaze-guided action recognition module and the saliency-based gaze prediction module using ground truth action labels and gaze positions. We use Adam optimizer [18] in all experiments. The base I3D weights are initialized from weights pretrained on kinetics dataset [15]. We then use the result of action recognition to train the action-based gaze prediction module and then the late fusion module. We use cross entropy loss for action recognition and binary cross entropy loss for gaze prediction. We apply a Gaussian with on the gaze point for generating ground truth images for gaze prediction. The learning rates for action recognition module and all gaze prediction modules are fixed as and respectively. We first resize the images to and then random crop images into , random flip with probability 0.5 for data augmentation during training. Ground truth gaze images perform the same data augmentation. When testing, we resize the image and send both the images and their flipped version and report the averaged performance.

1:Using the saliency-based gaze prediction module to initialize gaze prediction .
2:Initialize action likelihood vectors using .
3:while  and  do
4:     Get from action-based gaze prediction module using ;
5:     Update using the previous and :
6: ;
7:     Update from gaze-guided action recognition module based on .;
8:     Compute the AAE of and :
9: ;
10:     
11:end while
Algorithm 1 Alternative inference procedure

We iteratively infer gaze positions and action likelihood vectors in an alternative fashion as described in Algorithm 1. The iteration terminates when the variation (measured by average angular error AAE) of current gaze prediction from the previous prediction is below a threshold or the number of iteration surpasses an upper bound.

4 Experiments

4.1 Dataset and evaluation metric

Our experiments are conducted on two public datasets: EGTEA [23] and GTEA Gaze+ [22]. The GTEA Gaze+ dataset consists of 7 activities performed by 5 subjects. Each video clip is 10 to 15 minutes with resolution . We do 5-fold cross validation across all 5 subjects and take their average for evaluation as [22]. The EGTEA dataset is an extension of GTEA Gaze+ which contains 29 hours of egocentric videos with the resolution of and 24 fps, taken from 86 unique sessions with 32 subjects performing meal preparation tasks in a kitchen environment. Fine-grained annotations of 106 action classes are provided together with measured ground truth gaze points on all frames. Following [23], we use the first split (8299 training and 2022 testing instances) of the dataset to evaluate the performance of gaze prediction and action recognition. We use the trimmed action clips of both datasets for training and testing unless otherwise noted.

We compare different methods on both tasks of gaze prediction and action recognition. For gaze prediction, we adopt two commonly used evaluation metrics: AAE (Average Angular Error in degrees)

[36] and AUC (Area Under Curve) [2]. For action recognition, we use classification accuracy as the evaluation metric.

4.2 Gaze prediction results

We compare our method with the following baselines:

  • Saliency prediction methods: we use two representative traditional methods GBVS [10], Itti’s model [14] as our baseline. We also re-implement the deep FCN based model SALICON [11] and train on the same dataset with gaze as ground truth saliency map as another baseline.

  • Egocentric gaze prediction methods: We also compare with three egocentric gaze prediction methods most related to our work: coarse gaze prediction method (Li [23]), the GAN-based method (DFG [55]), and the attention transition-based method (Huang [13]). Since [23] only outputs a coarse gaze prediction map (of resolution

    ), we resize their output using bilinear interpolation. For

    Li and DFG we report the results based on our implementation as no code is publicly available. For Huang we use the author’s original implementation. However, Huang is designed for untrimmed video since it needs knowledge about continuous attention transition. Therefore, we also report the performance of Huang †[13] trained using the full untrimmed dataset.

  • Subsets of our full MCN: We also conduct ablation study using subsets of our full model. These include the saliency-based gaze prediction module (Saliency-based), the action-based gaze prediction module (Action-based). In addition, we also test the action-based gaze prediction module with ground truth action labels (Action-based).

Method EGTEA GTEA Gaze+
AAE AUC AAE AUC
GBVS [10] 12.81 0.707 12.68 0.829
Itti [14] 12.50 0.717 12.73 0.801
SALICON [11] 11.17 0.881 12.34 0.867
Li [23] 8.58 0.870 8.97 0.889
DFG [55] 6.30 0.923 6.39 0.910
Huang [13] 6.25 0.925 6.23 0.924
Huang †[13] N/A N/A (4.83) (0.939)
Saliency-based 6.36 0.922 6.57 0.929
Action-based 6.20 0.928 6.35 0.923
Action-based 6.04 0.927 6.20 0.933
Our full MCN 5.79 0.932 5.74 0.945
Table 1: Comparison of gaze prediction performance on two datasets. Results of previous methods are placed on top. Results of our full MCN and the subsets of MCN are placed on the bottom. Lower AAE and higher AUC indicate better performance. denotes using ground truth action label as input. Values in brackets indicate method is trained using full untrimmed dataset.

Table 1 shows the quantitative comparison of different methods on gaze prediction performance. Although our saliency-based gaze prediction module alone cannot get preferable performances against state-of-the-art gaze prediction methods of [55] and [13], our action-based gaze prediction module clearly outperforms all previous methods trained on the same dataset. This demonstrates the usefulness of actions in gaze prediction. Our full MCN further outperforms the action-based saliency prediction module, indicating that an ideal gaze prediction method should consider information from both bottom-up visual saliency and top-down influence of actions. The superiority of encoder-decoder based SALICON [11] over [23] reveals the importance of using decoder-based structure for fine-grained gaze prediction.

Importantly to be noted that Huang †  trained with untrimmed videos outperforms our method by the metric of AAE on GTEA Gaze+ dataset. The comparison between our method and the two variants of Huang and Huang †  shows that while our method can benefit from action-based gaze prediction and achieves state-of-the-art performance on the trimmed dataset, its current version could not fully explore the useful information from additional data in the untrimmed dataset. This indicates a potential research direction and would be discussed as our future work in Section 5.

Comparing subsets of our MCN, the action-based module performs better than the saliency-based module and even better than the state-of-the-art methods, indicating the effectiveness of action information in gaze prediction. When feeding the action-based module with ground-truth action labels, the performance is further improved. The performance of our full MCN significantly improves by integrating the two sub-modules. This strongly indicates that the action-based gaze prediction module and the saliency-based gaze prediction module represent complementary information and should be jointly considered.

Figure 3: Qualitative visualizations of gaze prediction results on EGTEA dataset. We show the output heatmap from our full MCN and several baselines. Ground truth action labels and gaze points (GT) are placed on the leftmost columns.

Qualitative results are shown in Figure 3. It can be seen that with the help of the action-based gaze prediction module, our full MCN can better locate the action, thus giving better gaze prediction results. For example, in the first row, our MCN successfully recognizes the action as “take paper towel”, thus finds the paper towel in the hand. Other baseline methods mostly focus on the stove or other salient regions. In the second row, while other methods are distracted by the plates and food on the counter, our MCN successfully locates the hand with dishrag on the bottom right corner and a part of the counter which will be cleaned in the next few frames. More interestingly as shown in the fourth row, the lettuce of ground-truth gaze fixation is placed on a cluttered kitchen table, which is challenging for other methods to locate. Still, our full MCN correctly predicts gaze to be on the lettuce with the help of context from the action “take lettuce”. Similar situations can be found in other rows of the figure.

4.3 Examination of action-based gaze prediction module

We conduct a new experiment on the top 20 frequent actions in the test set of EGTEA dataset to examine our action-based gaze prediction module. We feed the module with action label representing each of the 20 action classes and examine how gaze prediction performance (AAE score) varies when the module is tested on each of these actions. For example, we feed the action-based gaze prediction module with the action label of “take plate” and test the AAE scores on the videos of all 20 actions. As a result, we obtain a matrix of AAE scores with the size of , denoted by , in which is the AAE score of the action-based gaze module fed with the action label of the -th action and applied to the videos of the -th action.

We found that the average AAE score on the diagonal of is 6.21, while the average AAE score of on non-diagonal locations is 6.87. This indicates that the action-based gaze prediction module benefits more from correct action information. We also observe that there exists several action groups (e.g. “cut something”) that feeding the module with incorrect action labels within the same group does not affect gaze prediction performance too much. Please see the supplementary material for more detailed explanation of experimental setting and result analysis.

4.4 Action recognition results

As for the task of action recognition, we compare our method with the following baselines:

  • I3D [3] is one of the state of the art models for action recognition. We refer to [23] for the accuracy of this baseline method.

  • Methods using measured gaze: I3D+Gaze is to use a ground truth gaze point as a guideline to pool feature maps from the last convolution layer of the fifth convolutional block. EgoIDT+Gaze [24] is a traditional method which uses dense trajectories [47] selected by a ground truth gaze point for action recognition.

  • State-of-the-art egocentric action recognition methods: Li [23] uses a estimated gaze probability map as soft attention to perform a weighted average on top I3D features. Sudhakaran [42] adopts attention mechanism in a recurrent neural network to recognize actions. We also compare our method with Ma [27] and Shen [39] that use additional annotations of object locations and hand masks during training. [39] even uses ground-truth gaze positions as input during testing. We compare the performance as reported in their original papers.

  • Baseline of our model: MCN (gaze region) is a baseline of our MCN that uses only the gaze region for pooling without using non-gaze regions. We use this baseline to validate the usefulness of information from the non-gaze regions.

Method EGTEA GTEA Gaze+
EgoIDT + Gaze [24] 46.50 60.50
I3D [3] 49.79 57.64
I3D [3] + Gaze 51.21 59.72
Li [23] 53.30 N/A
Sudhakaran [42] N/A 60.13
Ma [27] N/A (66.40)
Shen [39] N/A (67.10)
MCN (gaze region) 52.35 59.21
Our full MCN 55.63 61.14
Table 2: Quantitative comparison of action recognition. We report recognition accuracy in %. Values in brackets indicate the methods that rely on additional labeling.

Table 2

lists the action recognition accuracy of our model and baseline methods. The deep learning method I3D

[3] outperforms EgoIDT+Gaze [24] that uses handcrafted features on EGTEA dataset but not on GTEA Gaze+ dataset. This is possibly due to the smaller number of training samples in GTEA Gaze+ dataset. With the use of measured gaze, the performance of I3D+Gaze is slightly improved compared with I3D. On both datasets, our MCN performs the best among all the methods except for [27] and [39] that rely on additional labeling. We also conduct ablation study and compare with our baseline that only uses gaze region for action recognition. The superiority of our MCN over the baseline validates our thought that the non-gaze regions contain supplementary information and should be jointly considered in action recognition.

Figure 4: Gaze prediction AUC and action recognition accuracy with respect to inference iteration on the EGTEA dataset. Blue line correspond to action recognition accuracy on the left axis, and orange line correspond to gaze prediction AUC on the right axis. We show the strongest baseline of action recognition [23] and gaze prediction [13] in cyan and red dashed lines respectively.

Also, we show the performance of our method during the alternative inference procedure in Figure 4. It can be seen that the performance of both gaze prediction and action recognition increases in the first two iterations and saturates from then. Our method outperforms the strongest baselines on both tasks after the first iteration. This strongly supports our hypothesis that the mutual context of gaze and action can be beneficial for both tasks.

4.5 Failure cases and discussion

Here we discuss several failure cases we faced. One failure case comes from the inaccuracy of action recognition. As shown in the first row of Figure 5, although we use a late fusion module to fuse the outputs of the sub-modules, our model may fail when action recognition gives a wrong result. Still, the impact of failed action recognition is limited in our model. Our MCN can still outperform other methods: among all the testing data of the EGTEA dataset, our model achieves an AAE score of when action recognition fails, and an AAE score of when action recognition is correct.

Another failure case comes from the circumstances where a person begins to shift the gaze fixation between consecutive actions. An example is shown in the second row of Figure 5. After grabbing the bread, instead of keeping fixation on the bread, the person’s attention goes to the plate on which he’s planning to put the bread. Other than increasing the action recognition accuracy, this reveals the necessity of taking attention transition [13] into consideration for our current gaze prediction model.

Figure 5: Failure cases of our MCN on gaze prediction. In the first row, failed action recognition misleads gaze prediction. In the second row, although the action recognition is correct, the camera wearer shifts the gaze fixation onto the region of future destination when he/she has already finished the action of grabbing the bread.

5 Conclusion and future work

In this work, we proposed a novel deep model for both egocentric gaze prediction and action recognition. Our model explicitly leverages the mutual context between the two tasks and achieves state of the art performance in both tasks on a public egocentric video dataset. Although our model can reliably predict gaze within an action period compared with previous methods, gaze prediction performance still needs further improvement, especially for the transition moment between consecutive actions. We think it would be an interesting future work to explore the gaze transition patterns relevant to the performed actions.

References

  • [1] A. Betancourt, P. Morerio, C. S. Regazzoni, and M. Rauterberg. The evolution of first person vision methods: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 25(5):744–760, 2015.
  • [2] A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti. Analysis of scores, datasets, and models in visual saliency prediction. In

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    , pages 921–928, 2013.
  • [3] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 4724–4733, 2017.
  • [4] A. Cartas, E. Talavera, P. Radeva, and M. Dimiccoli. On the role of event boundaries in egocentric activity recognition from photostreams. arXiv preprint arXiv:1809.00402, 2018.
  • [5] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2770–2779, 2017.
  • [6] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding egocentric activities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 407–414, 2011.
  • [7] A. Fathi, Y. Li, and J. M. Rehg. Learning to recognize daily actions using gaze. In Proceedings of the European Conference on Computer Vision (ECCV), pages 314–327, 2012.
  • [8] M. Fujisaki, H. Takenouchi, and M. Tokumaru.

    Interactive evolutionary computation using multiple users’ gaze information.

    In International Conference on Human-Computer Interaction, pages 109–116, 2017.
  • [9] A. Furnari, S. Battiato, and G. M. Farinella.

    Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation.

    In Proceedings of the European Conference on Computer Vision Workshops (ECCVW), pages 389–405, 2018.
  • [10] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In Advances in neural information processing systems, pages 545–552, 2007.
  • [11] X. Huang, C. Shen, X. Boix, and Q. Zhao. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 262–270, 2015.
  • [12]

    Y. Huang, M. Cai, H. Kera, R. Yonetani, K. Higuchi, and Y. Sato.

    Temporal localization and spatial segmentation of joint attention in multiple first-person videos. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), pages 2313–2321. IEEE, 2017.
  • [13] Y. Huang, M. Cai, Z. Li, and Y. Sato. Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 754–769, 2018.
  • [14] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision research, 40(10-12):1489–1506, 2000.
  • [15] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [16] H. Kera, R. Yonetani, K. Higuchi, and Y. Sato. Discovering objects of joint attention via first-person sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 7–15, 2016.
  • [17] M. Khamis, F. Alt, M. Hassib, E. von Zezschwitz, R. Hasholzner, and A. Bulling. Gazetouchpass: Multimodal authentication using gaze and touch on mobile devices. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pages 2156–2164, 2016.
  • [18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [19] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto. Fast unsupervised ego-action learning for first-person sports videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3241–3248, 2011.
  • [20] J. Kopf, M. F. Cohen, and R. Szeliski. First-person hyper-lapse videos. ACM Transactions on Graphics (TOG), 33(4):78, 2014.
  • [21] A. Kurauchi, W. Feng, A. Joshi, C. Morimoto, and M. Betke. Eyeswipe: Dwell-free text entry using gaze paths. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 1952–1956, 2016.
  • [22] Y. Li, A. Fathi, and J. M. Rehg. Learning to predict gaze in egocentric video. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3216–3223, 2013.
  • [23] Y. Li, M. Liu, and J. M. Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision (ECCV), pages 619–635, 2018.
  • [24] Y. Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 287–295, 2015.
  • [25] M. Lu, Z.-N. Li, Y. Wang, and G. Pan. Deep attention network for egocentric action recognition. IEEE Transactions on Image Processing, 2019.
  • [26] Z. Lu and K. Grauman. Story-driven summarization for egocentric video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2714–2721, 2013.
  • [27] M. Ma, H. Fan, and K. M. Kitani. Going deeper into first-person activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1894–1903, 2016.
  • [28] T. McCandless and K. Grauman. Object-centric spatio-temporal pyramids for egocentric activity recognition. In Proceedings of the British Machine Vision Conference (BMVC), volume 2, page 3, 2013.
  • [29] K. Ogaki, K. M. Kitani, Y. Sugano, and Y. Sato. Coupling eye-motion and ego-motion features for first-person activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2012.
  • [30] H. S. Park, E. Jain, and Y. Sheikh. 3d social saliency from head-mounted cameras. In Advances in Neural Information Processing Systems, pages 422–430, 2012.
  • [31] D. Parkhurst, K. Law, and E. Niebur. Modeling the role of salience in the allocation of overt visual attention. Vision research, 42(1):107–123, 2002.
  • [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [33] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-person camera views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2847–2854, 2012.
  • [34] Y. Poleg, A. Ephrat, S. Peleg, and C. Arora. Compact cnn for indexing egocentric videos. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9, 2016.
  • [35] Y. Poleg, T. Halperin, C. Arora, and S. Peleg. Egosampling: Fast-forward and stereo for egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4768–4776, 2015.
  • [36] N. Riche, M. Duvinage, M. Mancas, B. Gosselin, and T. Dutoit. Saliency and human fixations: state-of-the-art and study of comparison metrics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1153–1160, 2013.
  • [37] M. S. Ryoo, B. Rothrock, and L. Matthies. Pooled motion features for first-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 896–904, 2015.
  • [38] S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. arXiv preprint arXiv:1511.04119, 2015.
  • [39] Y. Shen, B. Ni, Z. Li, and N. Zhuang. Egocentric activity prediction via event modulated attention. In Proceedings of the European Conference on Computer Vision (ECCV), pages 197–212, 2018.
  • [40] S. Singh, C. Arora, and C. Jawahar. First person action recognition using deep learned descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2620–2628, 2016.
  • [41] E. H. Spriggs, F. De La Torre, and M. Hebert. Temporal segmentation and activity classification from first-person sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 17–24, 2009.
  • [42] S. Sudhakaran and O. Lanz. Attention is all we need: Nailing down object-centric attention for egocentric activity recognition. Proceedings of the British Machine Vision Conference (BMVC), 2018.
  • [43] D. Surie, T. Pederson, F. Lagriffoul, L.-E. Janlert, and D. Sjölie. Activity recognition using an egocentric perspective of everyday objects. In International Conference on Ubiquitous Intelligence and Computing, pages 246–257, 2007.
  • [44] H. R. Tavakoli, E. Rahtu, J. Kannala, and A. Borji. Digging deeper into egocentric gaze prediction. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pages 273–282, 2019.
  • [45] S. P. Tipper, C. Lortie, and G. C. Baylis. Selective reaching: evidence for action-centered attention. Journal of Experimental Psychology: Human Perception and Performance, 18(4):891, 1992.
  • [46] J. N. Vickers. Advances in coupling perception and action: the quiet eye as a bidirectional link between gaze, attention, and action. Progress in brain research, 174:279–288, 2009.
  • [47] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3169–3176, 2011.
  • [48] J. Xu, L. Mukherjee, Y. Li, J. Warner, J. M. Rehg, and V. Singh. Gaze-enabled egocentric video summarization via constrained submodular maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2235–2244, 2015.
  • [49] T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman. Visual dynamics: Stochastic future generation via layered cross convolutional networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
  • [50] K. Yamada, Y. Sugano, T. Okabe, Y. Sato, A. Sugimoto, and K. Hiraki. Can saliency map models predict human egocentric visual attention? In Proceedings of the Asian Conference on Computer Vision (ACCV), pages 420–429, 2010.
  • [51] K. Yamada, Y. Sugano, T. Okabe, Y. Sato, A. Sugimoto, and K. Hiraki. Attention prediction in egocentric video using motion and visual saliency. In Pacific-Rim Symposium on Image and Video Technology, pages 277–288, 2011.
  • [52] Y. Yan, E. Ricci, G. Liu, and N. Sebe. Egocentric daily activity recognition via multitask clustering. IEEE Transactions on Image Processing, 24(10):2984–2995, 2015.
  • [53] J. L. J. F. Yansong Tang, Yi Tian and J. Zhou. Action recognition in rgb-d egocentric videos. In Proceedings of the IEEE International Conference on Image Processing (ICIP), 2017.
  • [54] R. Yonetani, K. M. Kitani, and Y. Sato. Recognizing micro-actions and reactions from paired egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2629–2638, 2016.
  • [55] M. Zhang, K. Teck Ma, J. Hwee Lim, Q. Zhao, and J. Feng. Anticipating where people will look using adversarial networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
  • [56] Z. Zhang, S. Bambach, C. Yu, and D. J. Crandall. From coarse attention to fine-grained gaze: A two-stage 3d fully convolutional network for predicting eye gaze in first person video. 2018.
  • [57] Z. Zuo, L. Yang, Y. Peng, F. Chao, and Y. Qu. Gaze-informed egocentric action recognition for memory aid systems. IEEE Access, 6:12894–12904, 2018.