Egocentric Action Recognition by Video Attention and Temporal Context

07/03/2020
by   Juan-Manuel Pérez-Rúa, et al.
2

We present the submission of Samsung AI Centre Cambridge to the CVPR2020 EPIC-Kitchens Action Recognition Challenge. In this challenge, action recognition is posed as the problem of simultaneously predicting a single `verb' and `noun' class label given an input trimmed video clip. That is, a `verb' and a `noun' together define a compositional `action' class. The challenging aspects of this real-life action recognition task include small fast moving objects, complex hand-object interactions, and occlusions. At the core of our submission is a recently-proposed spatial-temporal video attention model, called `W3' (`What-Where-When') attention <cit.>. We further introduce a simple yet effective contextual learning mechanism to model `action' class scores directly from long-term temporal behaviour based on the `verb' and `noun' prediction scores. Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data. In particular, our best solution with multimodal ensemble achieves the 2^nd best position for `verb', and 3^rd best for `noun' and `action' on the Seen Kitchens test set.

READ FULL TEXT
research
10/06/2021

SAIC_Cambridge-HuPBA-FBK Submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021

This report presents the technical details of our submission to the EPIC...
research
06/22/2019

Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019

In this report, we present the Baidu-UTS submission to the EPIC-Kitchens...
research
11/01/2016

Sliding Dictionary Based Sparse Representation For Action Recognition

The task of action recognition has been in the forefront of research, gi...
research
04/21/2021

Skimming and Scanning for Untrimmed Video Action Recognition

Video action recognition (VAR) is a primary task of video understanding,...
research
12/02/2021

Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips

First-person action recognition is a challenging task in video understan...
research
12/03/2020

SAFCAR: Structured Attention Fusion for Compositional Action Recognition

We present a general framework for compositional action recognition – i....
research
08/15/2022

Action Recognition based on Cross-Situational Action-object Statistics

Machine learning models of visual action recognition are typically train...

Please sign up or login with your details

Forgot password? Click here to reset