UniDual: A Unified Model for Image and Video Understanding

06/10/2019
by   Yufei Wang, et al.
4

Although a video is effectively a sequence of images, visual perception systems typically model images and videos separately, thus failing to exploit the correlation and the synergy provided by these two media. While a few prior research efforts have explored the benefits of leveraging still-image datasets for video analysis, or vice-versa, most of these attempts have been limited to pretraining a model on one type of visual modality and then adapting it via finetuning on the other modality. In contrast, in this paper we introduce a framework that enables joint training of a unified model on mixed collections of image and video examples spanning different tasks. The key ingredient in our architecture design is a new network block, which we name UniDual. It consists of a shared 2D spatial convolution followed by two parallel point-wise convolutional layers, one devoted to images and the other one used for videos. For video input, the point-wise filtering implements a temporal convolution. For image input, it performs a pixel-wise nonlinear transformation. Repeated stacking of such blocks gives rise to a network where images and videos undergo partially distinct execution pathways, unified by spatial convolutions (capturing commonalities in visual appearance) but separated by point-wise operations (modeling patterns specific to each modality). Extensive experiments on Kinetics and ImageNet demonstrate that our UniDual model jointly trained on these datasets yields substantial accuracy gains for both tasks, compared to 1) training separate models, 2) traditional multi-task learning and 3) the conventional framework of pretraining-followed-by-finetuning. On Kinetics, the UniDual architecture applied to a state-of-the-art video backbone model (R(2+1)D-152) yields an additional video@1 accuracy gain of 1.5

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

research
06/16/2022

OmniMAE: Single Model Masked Pretraining on Images and Videos

Transformer-based architectures have become competitive across a variety...
research
11/05/2018

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Despite the success of deep learning for static image understanding, it ...
research
02/12/2022

Audio-Visual Fusion Layers for Event Type Aware Video Recognition

Human brain is continuously inundated with the multisensory information ...
research
07/17/2019

OmniNet: A unified architecture for multi-modal multi-task learning

Transformer is a popularly used neural network architecture, especially ...
research
01/06/2023

TarViS: A Unified Approach for Target-based Video Segmentation

The general domain of video segmentation is currently fragmented into di...
research
08/12/2021

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

Large-scale pretraining of visual representations has led to state-of-th...
research
03/21/2023

BigSmall: Efficient Multi-Task Learning for Disparate Spatial and Temporal Physiological Measurements

Understanding of human visual perception has historically inspired the d...

Please sign up or login with your details

Forgot password? Click here to reset