HierVL: Learning Hierarchical Video-Language Embeddings

01/05/2023
by   Kumar Ashutosh, et al.
0

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

READ FULL TEXT

page 1

page 4

page 7

research
12/04/2021

An Annotated Video Dataset for Computing Video Memorability

Using a collection of publicly available links to short form video clips...
research
02/13/2022

Learning long-term music representations via hierarchical contextual constraints

Learning symbolic music representations, especially disentangled represe...
research
12/31/2022

Translating Text Synopses to Video Storyboards

A storyboard is a roadmap for video creation which consists of shot-by-s...
research
04/06/2022

Temporal Alignment Networks for Long-term Video

The objective of this paper is a temporal alignment network that ingests...
research
05/15/2019

Automatic Long-Term Deception Detection in Group Interaction Videos

Most work on automated deception detection (ADD) in video has two restri...
research
06/12/2018

Hierarchical Long-term Video Prediction without Supervision

Much of recent research has been devoted to video prediction and generat...
research
08/05/2020

Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Learning long-term dynamics models is the key to understanding physical ...

Please sign up or login with your details

Forgot password? Click here to reset