Variational Stacked Local Attention Networks for Diverse Video Captioning

01/04/2022
by   Tonmoay Deb, et al.
3

While describing Spatio-temporal events in natural language, video captioning models mostly rely on the encoder's latent visual representation. Recent progress on the encoder-decoder model attends encoder features mainly in linear interaction with the decoder. However, growing model complexity for visual data encourages more explicit feature interaction for fine-grained information, which is currently absent in the video captioning domain. Moreover, feature aggregations methods have been used to unveil richer visual representation, either by the concatenation or using a linear layer. Though feature sets for a video semantically overlap to some extent, these approaches result in objective mismatch and feature redundancy. In addition, diversity in captions is a fundamental component of expressing one event from several meaningful perspectives, currently missing in the temporal, i.e., video captioning domain. To this end, we propose Variational Stacked Local Attention Network (VSLAN), which exploits low-rank bilinear pooling for self-attentive feature interaction and stacking multiple video feature streams in a discount fashion. Each feature stack's learned attributes contribute to our proposed diversity encoding module, followed by the decoding query stage to facilitate end-to-end diverse and natural captions without any explicit supervision on attributes. We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity. The CIDEr score of VSLAN outperforms current off-the-shelf methods by 7.8% on MSVD and 4.5% on MSR-VTT, respectively. On the same datasets, VSLAN achieves competitive results in caption diversity metrics.

READ FULL TEXT
research
08/19/2022

Diverse Video Captioning by Adaptive Spatio-temporal Attention

To generate proper captions for videos, the inference needs to identify ...
research
04/03/2018

End-to-End Dense Video Captioning with Masked Transformer

Dense video captioning aims to generate text descriptions for all events...
research
11/22/2022

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

Training supervised video captioning model requires coupled video-captio...
research
06/03/2019

Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning

In this paper, the problem of describing visual contents of a video sequ...
research
08/17/2016

Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation

We present our submission to the Microsoft Video to Language Challenge o...
research
05/28/2022

Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning

Accuracy and Diversity are two essential metrizable manifestations in ge...
research
05/08/2019

Multimodal Semantic Attention Network for Video Captioning

Inspired by the fact that different modalities in videos carry complemen...

Please sign up or login with your details

Forgot password? Click here to reset