Boosting Video Captioning with Dynamic Loss Network

07/25/2021
by   Nasibullah, et al.
0

Video captioning is one of the challenging problems at the intersection of vision and language, having many real-life applications in video retrieval, video surveillance, assisting visually challenged people, Human-machine interface, and many more. Recent deep learning-based methods have shown promising results but are still on the lower side than other vision tasks (such as image classification, object detection). A significant drawback with existing video captioning methods is that they are optimized over cross-entropy loss function, which is uncorrelated to the de facto evaluation metrics (BLEU, METEOR, CIDER, ROUGE).In other words, cross-entropy is not a proper surrogate of the true loss function for video captioning. This paper addresses the drawback by introducing a dynamic loss network (DLN), which provides an additional feedback signal that directly reflects the evaluation metrics. Our results on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSRVTT) datasets outperform previous methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2023

A Review of Deep Learning for Video Captioning

Video captioning (VC) is a fast-moving, cross-disciplinary area of resea...
research
03/06/2023

Models See Hallucinations: Evaluating the Factuality in Video Captioning

Video captioning aims to describe events in a video with natural languag...
research
12/27/2017

Consensus-based Sequence Training for Video Captioning

Captioning models are typically trained using the cross-entropy loss. Ho...
research
08/07/2017

Reinforced Video Captioning with Entailment Rewards

Sequence-to-sequence models have shown promising improvements on the tem...
research
04/12/2022

Video Captioning: a comparative review of where we are and which could be the route

Video captioning is the process of describing the content of a sequence ...
research
05/17/2023

FACE: Evaluating Natural Language Generation with Fourier Analysis of Cross-Entropy

Measuring the distance between machine-produced and human language is a ...
research
09/15/2020

Semantically Sensible Video Captioning (SSVC)

Video captioning, i.e. the task of generating captions from video sequen...

Please sign up or login with your details

Forgot password? Click here to reset