Adversarial Inference for Multi-Sentence Video Description

12/13/2018
by   Jae Sung Park, et al.
14

While significant progress has been made in the image captioning task, video description is still comparatively in its infancy, due to the complex nature of video data. Generating multi-sentence descriptions for long videos is even more challenging. Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video. Recently, reinforcement and adversarial learning based methods have been explored to improve the image captioning models; however, both types of methods suffer from a number of issues, e.g. poor readability and high redundancy for RL and stability issues for GANs. In this work, we instead propose to apply adversarial techniques during inference, designing a discriminator which encourages better multi-sentence video description. In addition, we find that a multi-discriminator "hybrid" design, where each discriminator targets one aspect of a description, leads to the best results. Specifically, we decouple the discriminator to evaluate on three criteria: 1) visual relevance to the video, 2) language diversity and fluency, and 3) coherence across sentences. Our approach results in more accurate, diverse and coherent multi-sentence video descriptions, as shown by automatic as well as human evaluation on the popular ActivityNet Captions dataset.

READ FULL TEXT

page 1

page 5

page 9

page 10

research
05/11/2020

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Generating multi-sentence descriptions for videos is one of the most cha...
research
02/04/2023

Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image Captioning

Coherent entity-aware multi-image captioning aims to generate coherent c...
research
03/24/2020

Learning Compact Reward for Image Captioning

Adversarial learning has shown its advances in generating natural and di...
research
10/26/2019

Diverse Video Captioning Through Latent Variable Expansion with Conditional GAN

Automatically describing video content with text description is challeng...
research
08/22/2020

Identity-Aware Multi-Sentence Video Description

Standard video and movie description tasks abstract away from person ide...
research
08/06/2020

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Generating accurate descriptions for online fashion items is important n...
research
01/17/2022

Discourse Analysis for Evaluating Coherence in Video Paragraph Captions

Video paragraph captioning is the task of automatically generating a coh...

Please sign up or login with your details

Forgot password? Click here to reset