Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

12/28/2021
by   Philipp Harzig, et al.
0

Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips, which can support visually impaired people to understand scenes of a YouTube video for instance. Transformer architectures have shown great performance in both machine translation and image captioning, lacking a straightforward and reproducible application for VTT. However, there is no comprehensive study on different strategies and advice for video description generation including exploiting the accompanying audio with fully self-attentive networks. Thus, we explore promising approaches from image captioning and video processing and apply them to VTT by developing a straightforward Transformer architecture. Additionally, we present a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX dataset to determine a configuration applicable to unseen datasets that helps describe short video clips in natural language and improved the CIDEr and BLEU-4 scores by 37.13 and 12.83 points compared to a vanilla Transformer network and achieve state-of-the-art results on the MSR-VTT and MSVD datasets. Also, FPE helps increase the CIDEr score by a relative factor of 8.6

READ FULL TEXT

page 3

page 7

research
07/30/2018

Doubly Attentive Transformer Machine Translation

In this paper a doubly attentive transformer machine translation model (...
research
04/01/2022

Learning Audio-Video Modalities from Image Captions

A major challenge in text-video and text-audio retrieval is the lack of ...
research
10/07/2019

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

The ability to generate natural language explanations conditioned on the...
research
12/13/2020

MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish

Automatic generation of video descriptions in natural language, also cal...
research
10/01/2021

Geometry Attention Transformer with Position-aware LSTMs for Image Captioning

In recent years, transformer structures have been widely applied in imag...
research
10/21/2020

WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

Automated audio captioning (AAC) is a novel task, where a method takes a...
research
01/09/2018

DeepSeek: Content Based Image Search & Retrieval

Most of the internet today is composed of digital media that includes vi...

Please sign up or login with your details

Forgot password? Click here to reset