COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

11/01/2020
by   Simon Ging, et al.
0

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

READ FULL TEXT

page 9

page 20

page 21

page 22

page 24

page 25

page 26

page 27

research
08/31/2022

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

This paper studies the multimedia problem of temporal sentence grounding...
research
09/18/2023

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

The canonical approach to video-text retrieval leverages a coarse-graine...
research
03/25/2023

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Contrastive learning-based video-language representation learning approa...
research
03/28/2021

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Video-Text Retrieval has been a hot research topic with the explosion of...
research
06/26/2022

Semantic Role Aware Correlation Transformer for Text to Video Retrieval

With the emergence of social media, voluminous video clips are uploaded ...
research
06/24/2019

Decomposable Neural Paraphrase Generation

Paraphrasing exists at different granularity levels, such as lexical lev...
research
07/26/2022

Learning Protein Representations via Complete 3D Graph Networks

We consider representation learning for proteins with 3D structures. We ...

Please sign up or login with your details

Forgot password? Click here to reset