Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

09/20/2023
by   Mohamed Afham, et al.
0

While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.

READ FULL TEXT
research
04/08/2019

SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition

While many action recognition datasets consist of collections of brief, ...
research
06/21/2021

Towards Long-Form Video Understanding

Our world offers a never-ending stream of visual stimuli, yet today's vi...
research
10/19/2022

Temporal Action Segmentation: An Analysis of Modern Technique

Temporal action segmentation from videos aims at the dense labeling of v...
research
09/22/2014

Temporally Coherent Bayesian Models for Entity Discovery in Videos by Tracklet Clustering

A video can be represented as a sequence of tracklets, each spanning 10-...
research
10/12/2022

LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos

Livestream videos have become a significant part of online learning, whe...
research
05/11/2016

Unsupervised Semantic Action Discovery from Video Collections

Human communication takes many forms, including speech, text and instruc...

Please sign up or login with your details

Forgot password? Click here to reset