Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review

03/27/2021
by   Jesus Perez-Martin, et al.
0

Research in the area of Vision and Language encompasses challenging topics that seek to connect visual and textual information. The video-to-text problem is one of these topics, in which the goal is to connect an input video with its textual description. This connection can be mainly made by retrieving the most significant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze how the most reported benchmark datasets have been created, showing their drawbacks and strengths for the problem requirements. We also show the impressive progress that researchers have made on each dataset, and we analyze why, despite this progress, the video-to-text conversion is still unsolved. State-of-the-art techniques are still a long way from achieving human-like performance in generating or retrieving video descriptions. We cover several significant challenges in the field and discuss future research directions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/30/2020

A Comprehensive Review on Recent Methods and Challenges of Video Description

Video description involves the generation of the natural language descri...
research
02/09/2021

The Role of the Input in Natural Language Video Description

Natural Language Video Description (NLVD) has recently received strong i...
research
07/25/2018

Video Storytelling

Bridging vision and natural language is a longstanding goal in computer ...
research
10/17/2022

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

This paper surveys vision-language pre-training (VLP) methods for multim...
research
04/10/2016

TGIF: A New Dataset and Benchmark on Animated GIF Description

With the recent popularity of animated GIFs on social media, there is ne...
research
11/21/2020

Deep learning for video game genre classification

Video game genre classification based on its cover and textual descripti...
research
05/26/2021

Computer Vision and Conflicting Values: Describing People with Automated Alt Text

Scholars have recently drawn attention to a range of controversial issue...

Please sign up or login with your details

Forgot password? Click here to reset