Collaborative Three-Stream Transformers for Video Captioning

09/18/2023
by   Hao Wang, et al.
0

As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.

READ FULL TEXT

page 10

page 11

page 12

page 13

research
03/22/2023

Text with Knowledge Graph Augmented Transformer for Video Captioning

Video captioning aims to describe the content of videos using natural la...
research
03/26/2023

SEM-POS: Grammatically and Semantically Correct Video Captioning

Generating grammatically and semantically correct captions in video capt...
research
08/19/2022

Diverse Video Captioning by Adaptive Spatio-temporal Attention

To generate proper captions for videos, the inference needs to identify ...
research
12/03/2016

Areas of Attention for Image Captioning

We propose "Areas of Attention", a novel attention-based model for autom...
research
05/22/2022

GL-RG: Global-Local Representation Granularity for Video Captioning

Video captioning is a challenging task as it needs to accurately transfo...
research
09/09/2023

Deep Video Restoration for Under-Display Camera

Images or videos captured by the Under-Display Camera (UDC) suffer from ...
research
09/15/2021

MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake Detection

The rapid progress in the ease of creating and spreading ultra-realistic...

Please sign up or login with your details

Forgot password? Click here to reset