Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

12/06/2022
by   AJ Piergiovanni, et al.
0

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and the code will be open-sourced.

READ FULL TEXT

page 2

page 8

research
09/14/2022

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

The pre-trained image-text models, like CLIP, have demonstrated the stro...
research
05/04/2023

VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

We propose a new two-stage pre-training framework for video-to-text gene...
research
11/14/2022

Grafting Pre-trained Models for Multimodal Headline Generation

Multimodal headline utilizes both video frames and transcripts to genera...
research
04/01/2021

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Our objective in this work is video-text retrieval - in particular a joi...
research
09/14/2023

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

Recently, large-scale pre-trained language-image models like CLIP have s...
research
05/16/2023

Is a Video worth n× n Images? A Highly Efficient Approach to Transformer-based Video Question Answering

Conventional Transformer-based Video Question Answering (VideoQA) approa...
research
07/12/2019

AVD: Adversarial Video Distillation

In this paper, we present a simple yet efficient approach for video repr...

Please sign up or login with your details

Forgot password? Click here to reset