Temporal Alignment Networks for Long-term Video

04/06/2022
by   Tengda Han, et al.
0

The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment. The challenge is to train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise, and are only weakly aligned when relevant. Apart from proposing the alignment network, we also make four contributions: (i) we describe a novel co-training method that enables to denoise and train on raw instructional videos without using manual annotation, despite the considerable noise; (ii) to benchmark the alignment performance, we manually curate a 10-hour subset of HowTo100M, totalling 80 videos, with sparse temporal descriptions. Our proposed model, trained on HowTo100M, outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset by a significant margin; (iii) we apply the trained model in the zero-shot settings to multiple downstream video understanding tasks and achieve state-of-the-art results, including text-video retrieval on YouCook2, and weakly supervised video action segmentation on Breakfast-Action; (iv) we use the automatically aligned HowTo100M annotations for end-to-end finetuning of the backbone model, and obtain improved performance on downstream action recognition tasks.

READ FULL TEXT

page 2

page 4

page 5

page 13

page 14

page 16

research
07/21/2019

TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

In this paper we propose a novel Temporal Attentive Relation Network (TA...
research
12/13/2019

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Annotating videos is cumbersome, expensive and not scalable. Yet, many s...
research
06/06/2023

Learning to Ground Instructional Articles in Videos through Narrations

In this paper we present an approach for localizing steps of procedural ...
research
03/23/2017

Weakly Supervised Action Learning with RNN based Fine-to-coarse Modeling

We present an approach for weakly supervised learning of human actions. ...
research
01/05/2023

HierVL: Learning Hierarchical Video-Language Embeddings

Video-language embeddings are a promising avenue for injecting semantics...
research
09/30/2021

Unsupervised Few-Shot Action Recognition via Action-Appearance Aligned Meta-Adaptation

We present MetaUVFS as the first Unsupervised Meta-learning algorithm fo...
research
02/08/2023

Weakly-supervised Representation Learning for Video Alignment and Analysis

Many tasks in video analysis and understanding boil down to the need for...

Please sign up or login with your details

Forgot password? Click here to reset