Towards Long-Form Video Understanding

by   Chao-Yuan Wu, et al.

Our world offers a never-ending stream of visual stimuli, yet today's vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. We show that existing state-of-the-art short-term models are limited for long-form tasks. A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks. It also outperforms comparable state-of-the-art on the AVA dataset.



page 1

page 5

page 8

page 13

page 14


Long-Term Feature Banks for Detailed Video Understanding

To understand the world, we humans constantly need to relate the present...

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

While today's video recognition systems parse snapshots or short clips a...

Long Short-Term Transformer for Online Action Detection

In this paper, we present Long Short-term TRansformer (LSTR), a new temp...

An Annotated Video Dataset for Computing Video Memorability

Using a collection of publicly available links to short form video clips...

How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos

Understanding web instructional videos is an essential branch of video u...

Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in Complex Events

Along with the development of the modern smart city, human-centric video...

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

Computer vision has undergone a dramatic revolution in performance, driv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.