AssembleNet++: Assembling Modality Representations via Attention Connections

08/18/2020
by   Michael S. Ryoo, et al.
0

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network. A new network component named peer-attention is introduced, which dynamically learns the attention weights using another block or input modality. Even without pre-training, our models outperform the previous work on standard public activity recognition datasets with continuous videos, establishing new state-of-the-art. We also confirm that our findings of having neural connections from the object modality and the use of peer-attention is generally applicable for different existing architectures, improving their performances. We name our model explicitly as AssembleNet++. The code will be available at: https://sites.google.com/corp/view/assemblenet/

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/29/2022

MVPTR: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment

In this paper, we propose a Multi-stage Vision-language Pre-TRaining (MV...
research
05/20/2022

Structured Attention Composition for Temporal Action Localization

Temporal action localization aims at localizing action instances from un...
research
11/01/2021

Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive per...
research
06/01/2022

Unifying Voxel-based Representation with Transformer for 3D Object Detection

In this work, we present a unified framework for multi-modality 3D objec...
research
08/01/2019

Two-Stream Video Classification with Cross-Modality Attention

Fusing multi-modality information is known to be able to effectively bri...
research
12/11/2021

COMPOSER: Compositional Learning of Group Activity in Videos

Group Activity Recognition (GAR) detects the activity performed by a gro...
research
11/27/2019

Social Attention for Autonomous Decision-Making in Dense Traffic

We study the design of learning architectures for behavioural planning i...

Please sign up or login with your details

Forgot password? Click here to reset