Multimodal Vision Transformers with Forced Attention for Behavior Analysis

12/07/2022
by   Tanay Agrawal, et al.
0

Human behavior understanding requires looking at minute details in the large context of a scene containing multiple input modalities. It is necessary as it allows the design of more human-like machines. While transformer approaches have shown great improvements, they face multiple challenges such as lack of data or background noise. To tackle these, we introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. In addition to improving the performance on different tasks and inputs, the modification requires less time and memory resources. We provide a model for a generalised feature extraction for tasks concerning social signals and behavior analysis. Our focus is on understanding behavior in videos where people are interacting with each other or talking into the camera which simulates the first person point of view in social interaction. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets. We further provide an extensive ablation study of the proposed architecture.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/18/2020

Transformer Networks for Trajectory Forecasting

Most recent successes on forecasting the people motion are based on LSTM...
research
12/09/2021

PE-former: Pose Estimation Transformer

Vision transformer architectures have been demonstrated to work very eff...
research
12/08/2022

PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

Action recognition models have achieved impressive results by incorporat...
research
11/07/2022

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

The attention mechanism is considered the backbone of the widely-used Tr...
research
03/15/2023

EgoViT: Pyramid Video Transformer for Egocentric Action Recognition

Capturing interaction of hands with objects is important to autonomously...
research
12/22/2021

Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding

Personality computing and affective computing have gained recent interes...
research
02/05/2021

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Vision-and-Language Pretraining (VLP) has improved performance on variou...

Please sign up or login with your details

Forgot password? Click here to reset