The Role of the Input in Natural Language Video Description

02/09/2021
by   Silvia Cascianelli, et al.
7

Natural Language Video Description (NLVD) has recently received strong interest in the Computer Vision, Natural Language Processing (NLP), Multimedia, and Autonomous Robotics communities. The State-of-the-Art (SotA) approaches obtained remarkable results when tested on the benchmark datasets. However, those approaches poorly generalize to new datasets. In addition, none of the existing works focus on the processing of the input to the NLVD systems, which is both visual and textual. In this work, it is presented an extensive study dealing with the role of the visual input, evaluated with respect to the overall NLP performance. This is achieved performing data augmentation of the visual component, applying common transformations to model camera distortions, noise, lighting, and camera positioning, that are typical in real-world operative scenarios. A t-SNE based analysis is proposed to evaluate the effects of the considered transformations on the overall visual data distribution. For this study, it is considered the English subset of Microsoft Research Video Description (MSVD) dataset, which is used commonly for NLVD. It was observed that this dataset contains a relevant amount of syntactic and semantic errors. These errors have been amended manually, and the new version of the dataset (called MSVD-v2) is used in the experimentation. The MSVD-v2 dataset is released to help to gain insight into the NLVD problem.

READ FULL TEXT

page 1

page 5

page 7

page 11

page 12

page 13

research
03/27/2021

Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review

Research in the area of Vision and Language encompasses challenging topi...
research
06/05/2018

Mining for meaning: from vision to language through multiple networks consensus

Describing visual data into natural language is a very challenging task,...
research
11/30/2020

A Comprehensive Review on Recent Methods and Challenges of Video Description

Video description involves the generation of the natural language descri...
research
01/15/2016

Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures

Automatic description generation from natural images is a challenging pr...
research
04/12/2016

Attributes as Semantic Units between Natural Language and Visual Recognition

Impressive progress has been made in the fields of computer vision and n...
research
10/15/2020

TextMage: The Automated Bangla Caption Generator Based On Deep Learning

Neural Networks and Deep Learning have seen an upsurge of research in th...
research
09/20/2023

AttentionMix: Data augmentation method that relies on BERT attention mechanism

The Mixup method has proven to be a powerful data augmentation technique...

Please sign up or login with your details

Forgot password? Click here to reset