VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

04/06/2019
by   Xin Wang, et al.
0

We present a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions, there are over 206,000 English-Chinese parallel translation pairs. Compared to the widely-used MSR-VTT dataset, VATEX is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. We also introduce two tasks for video-and-language research based on VATEX: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context. Extensive experiments on the VATEX dataset show that, first, the unified multilingual model can not only produce both English and Chinese descriptions for a video more efficiently, but also offer improved performance over the monolingual models. Furthermore, we demonstrate that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation. In the end, we discuss the potentials of using VATEX for other video-and-language research.

READ FULL TEXT

page 2

page 13

page 14

page 15

page 16

research
12/13/2020

MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish

Automatic generation of video descriptions in natural language, also cal...
research
03/12/2022

Taking an Emotional Look at Video Paragraph Captioning

Translating visual data into natural language is essential for machines ...
research
07/26/2017

Video Highlight Prediction Using Audience Chat Reactions

Sports channel video portals offer an exciting domain for research on mu...
research
06/20/2017

Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation

Recent work in computer vision has yielded impressive results in automat...
research
09/14/2023

Multilingual Audio Captioning using machine translated data

Automated Audio Captioning (AAC) systems attempt to generate a natural l...
research
11/22/2022

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

Training supervised video captioning model requires coupled video-captio...
research
10/17/2019

Multi-View Features and Hybrid Reward Strategies for Vatex Video Captioning Challenge 2019

This document describes our solution for the VATEX Captioning Challenge ...

Please sign up or login with your details

Forgot password? Click here to reset