ActBERT: Learning Global-Local Video-Text Representations

11/14/2020
by   Linchao Zhu, et al.
0

In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperforms the state-of-the-arts, demonstrating its superiority in video-text representation learning.

READ FULL TEXT
research
07/29/2020

Learning Video Representations from Textual Web Supervision

Videos found on the Internet are paired with pieces of text, such as tit...
research
04/03/2019

VideoBERT: A Joint Model for Video and Language Representation Learning

Self-supervised learning has become increasingly important to leverage t...
research
01/18/2023

Temporal Perceiving Video-Language Pre-training

Video-Language Pre-training models have recently significantly improved ...
research
11/11/2019

Interactive Attention for Semantic Text Matching

Semantic text matching, which matches a target text to a source text, is...
research
06/24/2022

Text-Driven Stylization of Video Objects

We tackle the task of stylizing video objects in an intuitive and semant...
research
04/26/2022

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Dominant pre-training work for video-text retrieval mainly adopt the "du...
research
08/29/2013

Joint Video and Text Parsing for Understanding Events and Answering Queries

We propose a framework for parsing video and text jointly for understand...

Please sign up or login with your details

Forgot password? Click here to reset