Streaming Video Model

03/30/2023
by   Yucheng Zhao, et al.
0

Video understanding tasks have traditionally been modeled by two separate architectures, specially tailored for two distinct tasks. Sequence-based video tasks, such as action recognition, use a video backbone to directly extract spatiotemporal features, while frame-based video tasks, such as multiple object tracking (MOT), rely on single fixed-image backbone to extract spatial features. In contrast, we propose to unify video understanding tasks into one novel streaming video architecture, referred to as Streaming Vision Transformer (S-ViT). S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve the frame-based video tasks. Then the frame features are input into a task-related temporal decoder to obtain spatiotemporal features for sequence-based tasks. The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition task and the competitive advantage over conventional architecture in the frame-based MOT task. We believe that the concept of streaming video model and the implementation of S-ViT are solid steps towards a unified deep learning architecture for video understanding. Code will be available at https://github.com/yuzhms/Streaming-Video-Model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2022

Towards Grand Unification of Object Tracking

We present a unified method, termed Unicorn, that can simultaneously sol...
research
08/06/2022

Frozen CLIP Models are Efficient Video Learners

Video recognition has been dominated by the end-to-end learning paradigm...
research
01/31/2016

Order-aware Convolutional Pooling for Video Based Action Recognition

Most video based action recognition approaches create the video-level re...
research
10/13/2020

Video Action Understanding: A Tutorial

Many believe that the successes of deep learning on image understanding ...
research
06/09/2021

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

In video transformers, the time dimension is often treated in the same w...
research
05/22/2023

Spatiotemporal Attention-based Semantic Compression for Real-time Video Recognition

This paper studies the computational offloading of video action recognit...
research
10/10/2019

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

Computer vision has undergone a dramatic revolution in performance, driv...

Please sign up or login with your details

Forgot password? Click here to reset