Students taught by multimodal teachers are superior action recognizers

10/09/2022
by   Gorjan Radevski, et al.
0

The focal point of egocentric video understanding is modelling hand-object interactions. Standard models – CNNs, Vision Transformers, etc. – which receive RGB frames as input perform well, however, their performance improves further by employing additional modalities such as object detections, optical flow, audio, etc. as input. The added complexity of the required modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such multimodal approaches, while using only the RGB images as input at inference time. Our approach is based on multimodal knowledge distillation, featuring a multimodal teacher (in the current experiments trained only using object detections, optical flow and RGB frames) and a unimodal student (using only RGB frames as input). We present preliminary results which demonstrate that the resulting model – distilled from a multimodal teacher – significantly outperforms the baseline RGB model (trained without knowledge distillation), as well as an omnivorous version of itself (trained on all modalities jointly), in both standard and compositional action recognition.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2023

Multimodal Distillation for Egocentric Action Recognition

The focal point of egocentric video understanding is modelling hand-obje...
research
10/15/2021

From Multimodal to Unimodal Attention in Transformers using Knowledge Distillation

Multimodal Deep Learning has garnered much interest, and transformers ha...
research
08/06/2021

Feature-Supervised Action Modality Transfer

This paper strives for action recognition and detection in video modalit...
research
07/10/2020

Optical Flow Distillation: Towards Efficient and Stable Video Style Transfer

Video style transfer techniques inspire many exciting applications on mo...
research
03/26/2021

Multimodal Knowledge Expansion

The popularity of multimodal sensors and the accessibility of the Intern...
research
08/17/2022

Leukocyte Classification using Multimodal Architecture Enhanced by Knowledge Distillation

Recently, a lot of automated white blood cells (WBC) or leukocyte classi...
research
08/17/2022

Progressive Cross-modal Knowledge Distillation for Human Action Recognition

Wearable sensor-based Human Action Recognition (HAR) has achieved remark...

Please sign up or login with your details

Forgot password? Click here to reset