Knowledge Integration Networks for Action Recognition

02/18/2020
by   Shiwen Zhang, et al.
0

In this work, we propose Knowledge Integration Networks (referred as KINet) for video action recognition. KINet is capable of aggregating meaningful context features which are of great importance to identifying an action, such as human information and scene context. We design a three-branch architecture consisting of a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition which allow the model to encode the knowledge of human and scene for action recognition. We explore two pre-trained models as teacher networks to distill the knowledge of human and scene for training the auxiliary tasks of KINet. Furthermore, we propose a two-level knowledge encoding mechanism which contains a Cross Branch Integration (CBI) module for encoding the auxiliary knowledge into medium-level convolutional features, and an Action Knowledge Graph (AKG) for effectively fusing high-level context information. This results in an end-to-end trainable framework where the three tasks can be trained collaboratively, allowing the model to compute strong context knowledge efficiently. The proposed KINet achieves the state-of-the-art performance on a large-scale action recognition benchmark Kinetics-400, with a top-1 accuracy of 77.8 KINet has strong capability by transferring the Kinetics-trained model to UCF-101, where it obtains 97.8

READ FULL TEXT

page 1

page 3

page 7

research
08/29/2016

Human Action Recognition without Human

The objective of this paper is to evaluate "human action recognition wit...
research
12/19/2021

Precondition and Effect Reasoning for Action Recognition

Human action recognition has drawn a lot of attention in the recent year...
research
02/08/2020

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Egocentric video recognition is a natural testbed for diverse interactio...
research
03/22/2019

On the Importance of Video Action Recognition for Visual Lipreading

We focus on the word-level visual lipreading, which requires to decode t...
research
04/12/2021

Event-based Timestamp Image Encoding Network for Human Action Recognition and Anticipation

Event camera is an asynchronous, high frequency vision sensor with low p...
research
12/11/2019

Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition

Human activities often occur in specific scene contexts, e.g., playing b...
research
10/11/2019

Interaction Relational Network for Mutual Action Recognition

Person-person mutual action recognition (also referred to as interaction...

Please sign up or login with your details

Forgot password? Click here to reset