From Show to Tell: A Survey on Image Captioning

07/14/2021
by   Matteo Stefanini, et al.
18

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

READ FULL TEXT

page 4

page 11

page 17

page 22

research
10/06/2018

A Comprehensive Survey of Deep Learning for Image Captioning

Generating a description of an image is called image captioning. Image c...
research
07/26/2022

Retrieval-Augmented Transformer for Image Captioning

Image captioning models aim at connecting Vision and Language by providi...
research
02/21/2020

Image to Language Understanding: Captioning approach

Extracting context from visual representations is of utmost importance i...
research
07/28/2021

A Thorough Review on Recent Deep Learning Methodologies for Image Captioning

Image Captioning is a task that combines computer vision and natural lan...
research
03/12/2016

Image Captioning with Semantic Attention

Automatically generating a natural language description of an image has ...
research
01/14/2019

Image Based Review Text Generation with Emotional Guidance

In the current field of computer vision, automatically generating texts ...
research
05/24/2023

Exploring Diverse In-Context Configurations for Image Captioning

After discovering that Language Models (LMs) can be good in-context few-...

Please sign up or login with your details

Forgot password? Click here to reset