Visual-aware Attention Dual-stream Decoder for Video Captioning

10/16/2021
by   Zhixin Sun, et al.
0

Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence. The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically. This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.To generate semantically coherent sentences, we propose a new Visual-aware Attention (VA) model, which concatenates dynamic changes of temporal sequence frames with the words at the previous moment, as the input of attention mechanism to extract sequence features.In addition, the prevalent approaches widely use the teacher-forcing (TF) learning during training, where the next token is generated conditioned on the previous ground-truth tokens. The semantic information in the previously generated tokens is lost. Therefore, we design a self-forcing (SF) stream that takes the semantic information in the probability distribution of the previous token as input to enhance the current token.The Dual-stream Decoder (DD) architecture unifies the TF and SF streams, generating sentences to promote the annotated captioning for both streams.Meanwhile, with the Dual-stream Decoder utilized, the exposure bias problem is alleviated, caused by the discrepancy between the training and testing in the TF learning.The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated through the result of experimental studies on Microsoft video description (MSVD) corpus and MSR-Video to text (MSR-VTT) datasets.

READ FULL TEXT

page 1

page 3

page 7

research
06/05/2017

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Recent progress has been made in using attention based encoder-decoder f...
research
12/01/2016

Video Captioning with Multi-Faceted Attention

Recently, video captioning has been attracting an increasing amount of i...
research
12/12/2019

Meaning guided video captioning

Current video captioning approaches often suffer from problems of missin...
research
11/05/2019

Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning

Automatically describing video content with natural language has been at...
research
06/16/2020

Focus of Attention Improves Information Transfer in Visual Features

Unsupervised learning from continuous visual streams is a challenging pr...
research
06/21/2022

Bypass Network for Semantics Driven Image Paragraph Captioning

Image paragraph captioning aims to describe a given image with a sequenc...
research
08/31/2021

SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory

Lip reading, aiming to recognize spoken sentences according to the given...

Please sign up or login with your details

Forgot password? Click here to reset