DeepAI AI Chat
Log In Sign Up

Exploration of Visual Features and their weighted-additive fusion for Video Captioning

by   Praveen S V, et al.

Video captioning is a popular task that challenges models to describe events in videos using natural language. In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context. We introduce the Weighted Additive Fusion Transformer with Memory Augmented Encoders (WAFTM), a captioning model that incorporates memory in a transformer encoder and uses a novel method, to fuse features, that ensures due importance is given to more significant representations. We illustrate a gain in performance realized by applying Word-Piece Tokenization and a popular REINFORCE algorithm. Finally, we benchmark our model on two datasets and obtain a CIDEr of 92.4 on MSVD and a METEOR of 0.091 on the ActivityNet Captions Dataset.


MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Generating multi-sentence descriptions for videos is one of the most cha...

Dense Video Captioning Using Unsupervised Semantic Information

We introduce a method to learn unsupervised semantic visual information ...

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Dense video captioning aims to localize and describe important events in...

Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

Automatic generation of video captions is a fundamental challenge in com...

Memory-Attended Recurrent Network for Video Captioning

Typical techniques for video captioning follow the encoder-decoder frame...

Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation

This work aims at generating captions for soccer videos using deep learn...

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

The ability to generate natural language explanations conditioned on the...