Verbs in Action: Improving verb understanding in video-language models

04/13/2023
by   Liliane Momeni, et al.
4

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding: video-text matching, video question-answering and video classification. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it.

READ FULL TEXT

page 4

page 16

page 18

page 19

page 20

page 21

research
09/28/2021

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

We present VideoCLIP, a contrastive approach to pre-train a unified mode...
research
01/05/2023

Test of Time: Instilling Video-Language Models with a Sense of Time

Modeling and understanding time remains a challenge in contemporary vide...
research
05/18/2023

Paxion: Patching Action Knowledge in Video-Language Foundation Models

Action knowledge involves the understanding of textual, visual, and temp...
research
08/23/2021

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

Contrastive learning has been widely used to train transformer-based vis...
research
07/16/2022

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Building a universal video-language model for solving various video unde...
research
06/12/2020

Video Understanding as Machine Translation

With the advent of large-scale multimodal video datasets, especially seq...
research
05/17/2023

Probing the Role of Positional Information in Vision-Language Models

In most Vision-Language models (VL), the understanding of the image stru...

Please sign up or login with your details

Forgot password? Click here to reset