The role of ego vision in view-invariant action recognition

06/10/2019 ∙ by Gaurvi Goyal, et al. ∙ Istituto Italiano di Tecnologia 0

Analysis and interpretation of egocentric video data is becoming more and more important with the increasing availability and use of wearable cameras. Exploring and fully understanding affinities and differences between ego and allo (or third-person) vision is paramount for the design of effective methods to process, analyse and interpret egocentric data. In addition, a deeper understanding of ego-vision and its peculiarities may enable new research perspectives in which first person viewpoints can act either as a mean for easily acquiring large amounts of data to be employed in general-purpose recognition systems, and as a challenging test-bed to assess the usability of techniques specifically tailored to deal with allocentric vision on more challenging settings. Our work, with an eye to cognitive science findings, leverages transfer learning in Convolutional Neural Networks to demonstrate capabilities and limitations of an implicitly learnt view-invariant representation in the specific case of action recognition.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Action recognition is a core topic in computer vision with applications in a variety of artificial intelligence systems — there included, human-computer interaction, robotics, video-surveillance, just to name a few. We are experiencing today considerable leaps forward in the action recognition research, with better algorithms and models being proposed more and more frequently. Among the open problems the research community is dealing with, we focus on the tolerance to view-point changes. This property is not easily obtained in recognition tasks, and requires special care. In the domain of ego-vision, view invariant action recognition is an important element with two different implications: first, ego-vision systems may provide us with large amount of data streams, which could be fuelling general purpose recognition systems; conversely, in the design of the algorithms for an ego-vision system, one may want to incorporate information learnt from allocentric vision data. While action recognition from egocentric view has been explored to some degree

[16, 18], within view-invariance the subject remains largely untouched. Over the years, the problem of view-invariant motion recognition has been addressed considering two different settings, i.e. observing the same dynamic event simultaneously from multiple cameras [27, 13] or considering independent instances of a same dynamic concept [10, 9].

Methodologically, early works approached it as an epipolar geometry problem [19, 24], while later works can be categorized into methods acting at a descriptor level, to design representations explicitly embedding view-invariant information [10, 12, 9, 16], or at a similarity level [23, 8, 26]. In this category are also methods addressing the problem with a transfer learning formulation, from one view to another [27], or to a common virtual view, sometimes in a 3D reference frame [13]. In the last decade, the availability of affordable 3D acquisitions systems has facilitated approaches combining multiple types of information, e.g. videos and skeletal data (for more details see [17]).

(a) View0: lateral
(b) View1: ego
(c) View2: frontal
Figure 1: Synchronized frames from the MoCA dataset (action pouring).

View-invariant action recognition plays a crucial role in humans, supporting the capability to solve the correspondence problem, i.e., identifying a mapping between the others’ actions and their own, which is necessary for crucial activities like social learning, imitation or mimicry [15]. From results in neuroscience, it emerges that such view invariance is a property of higher order visual areas, such as the Superior Temporal Sulcus [7]

, which could however also be supported by pooling together the responses of other view-dependent areas. Indeed, from studies in the macaque brain it is suggested that view-dependent mirror neurons in the premotor cortex (area F5) play an essential role in the formation of view-invariant representations. Alternatively, it has been speculated that a top-down stream of information from view-dependent mirror neurons might modulate the activity of visual representations in the STS, reinforcing the processing of visual patterns that are associated with different views of the same action


. Although some human perceptual abilities are immediately available, being part of an innate background of skills, a large portion of them are acquired over time leveraging what we can generally call the “human experience”. In modern artificial intelligence, its role is often played by a large amount of data. With the success of data driven methods, especially deep learning, the variety in the available data has a corresponding effect on the variety and the effectiveness of applications. Very complex architectures leverage on the availability of large datasets, which allow us to learn not only input-output relationships with good generalization properties, but also multiple intermediate representations, that could be exploited to address other tasks, through transfer learning. The ability of pre-trained deep neural networks to extract relevant information from new data is documented

[22] and applied often, but only recently in action recognition [4, 21].

Considering this context, in our work we are assessing the potential of pre-trained features in mimicking the role of view-dependent neurons and view invariant higher level descriptions. We consider the MoCA (Multimodal Cooking Actions) [14] dataset specifically acquired to study view invariance, both in artificial and biological systems111The dataset will be soon made available to the research community.. It includes three different views (an egocentric and two allocentric) of a set of 20 different upper body activities — see Fig. 1. We discuss the effectiveness of intermediate pre-trained features, in dealing with different degrees of view invariance, with specific reference to situations in which the egocentric one is involved.

2 Methodology

Our approach is based on learning intermediate level features with the help of a pre-trained architecture, and applying this representation as an input to a multi-class classification architecture which depends on the specific task of interest. To learn the representation we consider a variant of the Inception 3D model [4]

taking optical flow estimates as inputs and 3D convolutional filters to incorporate and compress both spatial and temporal information. The model is pre-trained on ImageNet dataset

[5] and on Kinetics-400 [11]. Once trained, the network may be seen as a multi-resolution representation of image sequences.

In order to pin-point an appropriate point of extraction of intermediate features, we identify intermediate layers, which should be producing representations tolerant to view point changes, without being too connected to a specific classification task, hence a point 2 layers before the end was selected. (see [4] for details). Thereafter, for a given multi-class classification task, segmented video clips of the actions are used as inputs to the action recognition pipeline. From them, the optical flow is extracted, using the TV-L1 algorithm [25]

. The optical flow is input into the trained Inception 3D model and the activations or learnt intermediate spatio-temporal features are then fed to a multi-class classifier. In Section 3, we will compare results obtained with two different classifiers with different degrees of complexity: Single Layered Perceptron (SLP) and a convolutional neural network (3DConv). The simple SLP allows us to comment on the intrinsic ability of the learnt features to deal with view invariance and with the complexity of ego-vision. The more complex 3DConv, shows the potential of the approach under different challenging classification tasks.

(a) Cam0
(b) Cam1
(c) Cam2
(d) Cam3
(e) Cam4
Figure 2: Sample frames from the IXMAS dataset (actor alba and action scratchhead).
SourceTarget 00 11 22 0,1,20,1,2 0,12 0,21 1,20 01 02 10 12 20 21
SLP 93.25 91.11 92.70 87.37 68.33 46.03 68.10 47.38 68.33 47.38 32.86 66.27 34.84
3DConv 96.25 96.35 96.43 94.81 62.30 61.67 62.70 50.63 64.84 33.10 36.35 61.67 54.92
Table 1: Performance evaluation (in %) on the MoCA dataset considering various training and test subsets. Views - 0: Lateral, 1: Egocentric, 2: Frontal
SourceTarget Mean 01 04 14 24 34 40 41 42 43 0,34 0,1,2,34
DT [20] 61.7 93.9 27.6 22.4 53.3 34.8 42.1 25.8 63.3 48.8
Hankelets [12] 56.4 83.7 33.6 26.9 60.1 31.2 39.6 32.8 68.1 37.4
SLP 69.4 84.4 48.8 47.0 66.0 45.0 53.3 56.6 69.3 53.5 57.3 62.8
3DConv 68.5 89.0 44.4 42.5 61.3 45.4 48.6 49.1 57.9 46.7 49.2 57.9
Table 2: Performance evaluation on the IXMAS dataset considering a subset of training and test splits of the dataset, based on viewpoints. Mean refers to the average of accuracies for all combinations of viewpoints with one-one protocol.
Figure 3: Confusion matrix for 3DConv classifier trained on V1 and tested on V2 of the MoCA dataset.

3 Results

The core of our experimental analysis focuses on the MoCA dataset [14], consisting of 20 cooking action primitives, involving one or two arms of a volunteer, with subtle differences between different actions. The dataset comprises synchronized videos of actions from 3 different viewpoints see Fig 1

: Lateral (V0), Egocentric(V1), and Frontal (V2). Training (TR) and Test (TE) sequences are available for each action and viewpoint. In different iterations of the experiment, we trained the classifiers with a variety of subsets of the TR split and tested on subsections of the TE split. Validation splits were processed using a batch-wise protocol with batch normalization parameters calculated per batch.

The resulting validation accuracies are shown in Table 1. We carried out a set of baseline experiments, where TR and TE are uniform: ({ii}, with ). We also include another baseline, ({0,1,20,1,2}), where a view-invariant model is obtained simply by training the classifier on multiple views. We can see that in all these cases, the classification performances are high. Next, we consider a one-view out protocol, when the classifiers are trained with 2 viewpoints and tested on the third; in this case, there is a notable and expected drop is the capability of the classifiers to correctly classify the actions, but considering they are not explicitly trained to identify actions view-invariantly, this drop is modest and thus not remarkable. Notice in particular how the egocentric view is the hardest to classify if it does not participate in the training phase.

Finally, we adopt a one-one protocol training classifiers on a single viewpoint and evaluating on another viewpoint, to analyse view-view relationship. When both views are allocentric ({02},{20}), the resulting values are almost as high as in one-view out experiments. But in all cases where V1 is involved in the one-one protocol ({01},{10},{12},{21}), there is a noticeable drop in the performance. The results highlight the specific challenge in dealing with view invariance, when ego-vision is one of the views considered. This appears to be understandable, considering the smaller amount of dynamic information included in the ego view, but it is also in contrast with findings in cognitive science. Indeed, from recent neuro-scientific literature it can be derived that not all views are equally important. First-person view seems to have a prominent role with respect to other perspectives in terms of responsiveness in the sensorimotor areas of the brain during action observation [1] and has been shown to facilitate certain forms of action understanding (e.g., estimating the size of an object to be grasped) [3]. Beyond egocentric perspective, also the frontal view seems to have a peculiar role, eliciting a stronger activity in the ventral premotor cortex if compared with lateral view, suggesting a preference for “face-to-face interactions” [6].

Figure 3

shows the confusion matrix for the above experiment, in case the Conv3D classifier is trained on the egocentric view V1 and tested on V2. Notice that carrot (grating carrots) is almost always classified as cut (cutting a bread). The motion of the two actions is very similar from these two perspectives. Also note, that many actions are often confused with the eating action. This is probably because since the face is not visible in either view, the amount of information available makes it very easy to confuse with actions like transporting (moving an object across the table).

We conclude by reporting a further set of experiments, carried out with the same protocol on the IXMAS benchmark (see Fig. 2). This dataset does not include an ego-vision, but incorporates instead a top view which is very different from the others. The results reported in Table 2 confirm the observation that the architecture exhibits a good amount of view-invariance, in particular for views that are more likely to be observed. It is instead less robust on view , the top view, which is less common. Similarly to biological systems, our architecture appears to be better tuned for a set of more likely view points.

4 Discussion

Our analysis suggests the relationship between egocentric and individual allocentric viewpoints is significantly less strong than the relationship among allocentric viewpoints (even if they have been acquired by widely different perspectives). This could be explained by the reduced amount of information conveyed by egocentric data, which is compensated in case of biological vision by proprioception or the awareness of the position and movements of one’s own body.

However, the relationship still exists, as is demonstrated by the ability of the classifiers to recognise actions to some extent despite not having any significant information about the egocentric view and how different the actions look from this viewpoint. It is a very interesting observation that the combination of two allocentric viewpoints together were able to train the 3DConv classifier well enough to identify actions from the egocentric viewpoint, with almost the same accuracy as for other scenarios with unseen viewpoints. This apparent ability deserves further investigation to be carried out on wider multi-view datasets, to assess the generality of our observations.


Some results incorporated in this publication have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, G.A. No 804388.


  • [1] M. Angelini, M. Fabbri-Destro, N. F. Lopomo, M. Gobbo, G. Rizzolatti, and P. Avanzini. Perspective-dependent reactivity of sensorimotor mu rhythm in alpha and beta ranges during action observation: An eeg study. Scientific reports 2018.
  • [2] V. Caggiano, L. Fogassi, G. Rizzolatti, J. K. Pomper, P. Thier, M. A. Giese, and A. Casile. View-based encoding of actions in mirror neurons of area f5 in macaque premotor cortex. Current Biology, 21(2):144–148, 2011.
  • [3] F. Campanella, G. Sandini, and M. C. Morrone. Visual information gleaned by observing grasping movement in allocentric and egocentric perspectives. Proceedings of the Royal Society B: Biological Sciences, 278(1715):2142–2149, 2010.
  • [4] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR 2009.
  • [6] S. Ferri, K. Pauwels, G. Rizzolatti, and G. Orban. Stereoscopically observing manipulative actions. Cerebral cortex, 26(8):3591–3610, 2016.
  • [7] E. D. Grossman, N. L. Jardine, and J. A. Pyles. fmr-adaptation reveals invariant coding of biological motion on human sts. Frontiers in human neuroscience 2010.
  • [8] C.-H. Huang, Y.-R. Yeh, and Y.-C. F. Wang. Recognizing actions across cameras by exploring the correlated subspace. In ECCV 2012.
  • [9] K. Huang, Y. Zhang, and T. Tan. A discriminative model of motion and cross ratio for view-invariant action recognition. IEEE Trans. Image Processing 2012.
  • [10] I. Junejo, E. Dexter, I. Laptev, and P. Perez. View-independent action recognition from temporal self-similarities. IEEE PAMI 2011, 2011.
  • [11] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, et al. The kinetics human action video dataset. arXiv preprint, 2017.
  • [12] B. Li, O. I. Camps, and M. Sznaier. Cross-view activity recognition using hankelets. In IEEE CVPR 2012.
  • [13] R. Li and T. Zickler. Discriminative virtual views for cross-view action recognition. In IEEE CVPR 2012.
  • [14] D. Malafronte, G. Goyal, A. Vignolo, F. Odone, and N. Noceti. Investigating the use of space-time primitives to understand human movements. In ICIAP 2017.
  • [15] C. L. Nehaniv and K. Dautenhahn. The correspondence problem, imitation in animals and artifacts, 2002.
  • [16] T.-H.-C. Nguyen, J.-C. Nebel, F. Florez-Revuelta, et al. Recognition of activities of daily living with egocentric vision: A review. Sensors, 2016.
  • [17] T. Singh and D. K. Vishwakarma. Human activity recognition in video benchmarks: A survey. In Advances in Signal Processing and Communication 2019.
  • [18] S. Song, V. Chandrasekhar, N.-M. Cheung, S. Narayan, L. Li, and J.-H. Lim. Activity recognition in egocentric life-logging videos. In ACCV 2014 Workshops, 2015.
  • [19] T. Syeda-Mahmood, A. Vasilescu, and S. Sethi. Recognizing action events from multiple viewpoints. In Detection and Recognition of Events in Video, 2001.
  • [20] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR. IEEE, 2011.
  • [21] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV 2016.
  • [22] K. R. Weiss, T. M. Khoshgoftaar, and D. Wang. A survey of transfer learning, volume 3. 2016.
  • [23] X. Wu and Y. Jia. View-invariant action recognition using latent kernelized structural svm. In ECCV 2012.
  • [24] A. Yilmaz and M. Shah. Recognizing human actions in videos acquired by uncalibrated moving cameras. In ICCV 2005.
  • [25] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l1 optical flow. In

    Joint Pattern Recognition Symposium 2017

  • [26] J. Zheng and Z. Jiang. Learning view-invariant sparse representations for cross-view action recognition. In IEEE ICCV 2013.
  • [27] J. Zheng, Z. Jiang, P. J. Phillips, and R. Chellappa. Cross-view action recognition via a transferable dictionary pair. In BMVC 2012.