MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

05/11/2020
by   Jie Lei, et al.
1

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events. All code is available open-source at: https://github.com/jayleicn/recurrent-transformer

READ FULL TEXT

page 9

page 12

research
07/26/2018

Move Forward and Tell: A Progressive Generator of Video Descriptions

We present an efficient framework that can generate a coherent paragraph...
research
12/13/2018

Adversarial Inference for Multi-Sentence Video Description

While significant progress has been made in the image captioning task, v...
research
01/14/2021

Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Video captioning is a popular task that challenges models to describe ev...
research
11/28/2022

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Video paragraph captioning aims to generate a multi-sentence description...
research
03/22/2023

Text with Knowledge Graph Augmented Transformer for Video Captioning

Video captioning aims to describe the content of videos using natural la...
research
01/17/2022

Discourse Analysis for Evaluating Coherence in Video Paragraph Captions

Video paragraph captioning is the task of automatically generating a coh...
research
07/06/2020

Relevance Transformer: Generating Concise Code Snippets with Relevance Feedback

Tools capable of automatic code generation have the potential to augment...

Please sign up or login with your details

Forgot password? Click here to reset