Trear: Transformer-based RGB-D Egocentric Action Recognition

01/05/2021
by   Xiangyu Li, et al.
0

In this paper, we propose a Transformer-based RGB-D egocentric action recognition framework, called Trear. It consists of two modules, inter-frame attention encoder and mutual-attentional fusion block. Instead of using optical flow or recurrent units, we adopt self-attention mechanism to model the temporal structure of the data from different modalities. Input frames are cropped randomly to mitigate the effect of the data redundancy. Features from each modality are interacted through the proposed fusion block and combined through a simple yet effective fusion operation to produce a joint RGB-D representation. Empirical experiments on two large egocentric RGB-D datasets, THU-READ and FPHA, and one small dataset, WCVS, have shown that the proposed method outperforms the state-of-the-art results by a large margin.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset