PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

12/08/2022
by   Roei Herzig, et al.
7

Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of “task prompts”, each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as “Promptonomy”, since the prompts model a task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the “Promptonomy” approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets.

READ FULL TEXT

page 1

page 4

page 5

page 18

page 19

research
06/13/2022

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Recent action recognition models have achieved impressive results by int...
research
10/12/2019

Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models

Deep video action recognition models have been highly successful in rece...
research
09/12/2020

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

One significant factor we expect the video representation learning to ca...
research
06/08/2023

Efficient Multi-Task Scene Analysis with RGB-D Transformers

Scene analysis is essential for enabling autonomous systems, such as mob...
research
12/07/2022

Multimodal Vision Transformers with Forced Attention for Behavior Analysis

Human behavior understanding requires looking at minute details in the l...
research
07/27/2015

Discovery of Shared Semantic Spaces for Multi-Scene Video Query and Summarization

The growing rate of public space CCTV installations has generated a need...
research
04/25/2019

Meta-Sim: Learning to Generate Synthetic Datasets

Training models to high-end performance requires availability of large l...

Please sign up or login with your details

Forgot password? Click here to reset