Image captioning is the task of describing the visual content of an image in natural language, employing a visual understanding system and a language model capable of generating meaningful and syntactically correct sentences. Neuroscience research has clarified the link between human vision and language generation only in the last few years [ardila2015language]
. Similarly, in Artificial Intelligence, the design of architectures capable of processing images and generating language is a very recent matter. The goal of these research efforts is to find the most effective pipeline to process an input image, represent its content, and transform that into a sequence of words by generating connections between visual and textual elements while maintaining the fluency of language. In its standard configuration, image captioning is an image-to-sequence problem whose inputs are pixels. These are encoded as one or multiple feature vectors in the visual encoding step, which prepares the input for a second generative step, called the language model. This produces a sequence of words or sub-words decoded according to a given vocabulary.
In these few years, the research community improved the models considerably: from the first deep learning-based proposals adopting Recurrent Neural Networks (RNNs) fed with global image descriptors, methods have been enriched with attentive approaches and reinforcement learning, up to the breakthroughs of Transformers and self-attention to single-stream BERT-like approaches. At the same time, the Computer Vision and Natural Language Processing (NLP) communities have addressed the challenge of building proper evaluation protocols and evaluation metrics to compare results with human-generated ground-truths. Moreover, several domain-specific scenarios and variants of the task have been investigated. However, the achieved results are still far from setting an ultimate solution. With the aim of providing a testament to the journey that captioning has taken so far, and with that of encouraging novel ideas, in this paper, we trace a holistic overview of the models developed in the last years.
Following the inherent dual nature of captioning models, we develop a taxonomy of both the visual encoding and the language modeling approaches, focusing on their key aspects and limitations. We focus on the training strategies adopted in the literature over the past years, from cross-entropy loss to reinforcement learning and the recent advancement obtained by the pre-training paradigm and masked language model losses. Furthermore, we review the main datasets used to explore image captioning, from domain-generic benchmarks to domain-specific datasets collected to investigate specific aspects of the problem. Also, we analyze standard and non-standard metrics adopted for performance evaluation, and the different characteristics of the caption they analyze.
An additional contribution of this work is a quantitative comparison of the main image captioning methods which considers both standard and non-standard metrics, and a discussion on their relationships which sheds light on performance, differences, and characteristics of the most important models. Finally, we give an overview of many variants of the problem and discuss some open challenges and future directions.
2 Visual Encoding
Providing an effective representation of the visual content is the first challenge of an image captioning pipeline. Excluding the earliest image captioning works [pan2004automatic, farhadi2010every, pan2004gcap, yao2010i2t, gupta2012choosing, li2011composing, kulkarni2013babytalk, aker2010generating, mitchell2012midge], we focus on deep learning-based solutions.
The current approaches of visual encoding can be classified as belonging to four main categories: 1.non-attentive methods based on global CNN features; 2. additive attentive methods that embed the visual content using either grids or regions; 3. graph-based methods adding visual relationships between visual regions; and 4. self-attentive methods that employ Transformer-based paradigms, either by using region-based, patch-based, or image-text early fusion solutions. This taxonomy is visually summarized in Fig. 1.
2.1 Global CNN Features
With the advent of CNNs, all models consuming visual inputs have been improved in terms of performance. The visual encoding step of image captioning is no exception. In the most simple recipe, the activation of one of the last layers of a CNN is employed to extract high-level and fixed-sized representations, which are then used as a conditioning element for the language model (Fig. 2). This is the approach employed in the seminal paper “Show and Tell” [vinyals2015show]111Actually, the title of this survey is a tribute of this pioneering work., where the output of a GoogleNet [szegedy2015going]
pre-trained on ImageNet[russakovsky2015imagenet] is fed to the initial hidden state of the language model. In the same year, Karpathy et al. [karpathy2015deep]
used global features extracted from AlexNet[krizhevsky2012imagenet] as the input for a language model. Further, Mao et al. [mao2015deep] and Donahue et al. [donahue2015long] injected global features extracted from the VGG network [simonyan2014very] at each time-step of the language model.
Global CNN features were then employed in a large variety of image captioning models [chen2015mind, fang2015captions, jia2015guiding, you2016image, wu2016value, gu2017empirical, chen2017structcap, chen2018groupcap]. Notably, Rennie et al. [rennie2017self] introduced the FC model, in which images are encoded using a ResNet-101 [he2016deep], preserving their original dimensions. Other approaches [yao2017boosting, gan2017semantic]
integrated high-level attributes or tags, represented as a probability distribution over the most common words of the training captions.
The main advantage of employing global CNN features resides in their simplicity and compactness of representation, which embraces the capacity to extract and condense information from the whole input and considering the overall context of an image. However, this paradigm also leads to excessive compression of information and lacks granularity: all salient objects and regions are fused in a single vector, making it hard for a captioning model to produce specific and fine-grained descriptions.
2.2 Attention Over Grid of CNN Features
Motivated by the drawbacks of global representations, most of the following approaches have increased the granularity level of visual encoding [xu2015show, rennie2017self, lu2017knowing] (Fig. 2). Drawing from machine translation, the additive attention mechanism has demonstrated remarkable performance in a wide range of tasks and has endowed image captioning architectures with time-varying visual features encoding, enabling greater flexibility and finer granularity.
Definition of additive attention. The intuition behind attention boils down to weighted averaging. In the first formulation proposed for sequence alignment by Bahdanau et al. [bahdanau2014neural] (also known as additive attention
), a single-layer feed-forward neural network with a hyperbolic tangent non-linearity is used to compute attention weights. Formally, given two generic sets of vectorsand , the additive attention score between the ’s and the ’s is computed as follows:
where and are weight matrices, and
is a weight vector that performs a linear combination. A softmax function is then applied to obtain a probability distribution, representing how much the element encoded by is relevant for .
Although the attention mechanism was initially devised for modeling the relationships between two sequences of elements (i.e. hidden states from a recurrent encoder and a decoder), it can be adapted to connect a set of fine-grained visual representations with the hidden states of a language model. From the point of view of visual features extraction, a global representation may lead to sub-optimal results due to noisy contexts or irrelevant regions while generating a specific word. Employing an attention mechanism over a set of features, instead, allows retaining richer information useful for a more comprehensive sentence generation.
Attending convolutional activations. Xu et al. [xu2015show] introduced the first method leveraging the additive attention over the spatial output grid of a convolutional layer. This allows the model to selectively focus on certain elements of the grid by selecting a subset of features for each generated word. Specifically, the model first extracts the activation of the last convolutional layer of a VGG network [simonyan2014very], then uses additive attention to compute a weight for each grid element, interpreted as the relative importance of that element for generating the next word.
Other approaches. The solution based on additive attention over a grid of features has been widely adopted by several following works with minor improvements in terms of visual encoding [yao2017boosting, chen2018regularizing, lu2017knowing, wang2017skeleton, ge2019exploring, gu2018stack].
Review networks – For instance, Yang et al. [yang2016review] supplemented the encoder-decoder framework with a recurrent review network. This performs a given number of review steps with attention on the encoder hidden states and outputs a “thought vector” after each step, which is then used by the attention mechanism in the decoder.
Multi-level features – Chen et al. [chen2017sca] proposed to employ channel-wise attention over convolutional activations, followed by a more classical spatial attention. They also experimented with using more than one convolutional layer to exploit multi-level features. On the same line, Jiang et al. [jiang2018recurrent] proposed to use multiple CNNs in order to exploit their complementary information, then fused their representations with a recurrent procedure.
Exploiting human attention – Some works also integrated saliency information (i.e. what do humans pay more attention to in a scene) to guide caption generation. This idea was first proposed by Sugano and Bulling [sugano2016seeing] who exploited human eye-fixation information for image captioning by including normalized fixation histograms over the image as an input to the soft-attention module of [xu2015show] and weighing the attended image regions based on whether these are fixated or not. Subsequent works on this line [tavakoli2017paying, ramanishka2017top, cornia2018paying] used predicted saliency information in place of eye-fixation information.
2.3 Attention Over Visual Regions
Although the intuition of using saliency boils down to neuroscience, the same discipline suggests that our brain constantly integrates a top-down reasoning process with a bottom-up flow of visual input signals. The top-down path consists of predicting the upcoming sensory input by leveraging our knowledge and inductive bias. On the other side, the bottom-up flow constantly provides visual stimuli adjusting the previous predictions, passing from input signals to their interpretation. The captioning models mentioned so far exploit an attention mechanism that can be thought of as a top-down system. In this mechanism, the language model predicts the next word based on its learned assumptions while attending a feature grid, whose geometry is irrespective of the image content.
Bottom-up and top-down attention. The solution proposed by Anderson et al. [anderson2018bottom] entails integrating an additional bottom-up path, defined by an object detector in charge of proposing image regions, coupled with the top-down mechanism that learns to weigh each region for each word prediction (see Fig. 2). In this approach, Faster R-CNN [ren2015faster, ren2017faster] is adopted to detect objects in two stages: the first, called Region Proposal Network, produces object proposals rolling over intermediate features of a CNN; the second operates a pooling of the region of interest to extract a feature vector for each proposal. One of the key elements of this approach resides in its pre-training strategy, where an auxiliary training loss is added for learning to predict attribute classes alongside object classes on the Visual Genome [krishnavisualgenome] dataset. This allows the model to predict a dense and rich set of detections, including both salient object and contextual regions and favors the learning of better feature representations.
Other approaches. Employing pooled vectors from image regions has demonstrated its advantages when dealing with the raw visual input and has been the standard de-facto in image captioning for years. As a result, many of the following works have based the visual encoding phase on this strategy [ke2019reflective, qin2019look, huang2019adaptively, wang2020show]. Among them, we point out two remarkable variants.
Visual Policy – While typical visual attention points to a single image region at every step, the approach proposed by Zha et al. [zha2019context] introduces a sub-policy network that interprets also the visual part sequentially by encoding historical visual actions (e.g. previously attended regions) via an LSTM to serve as context for the next visual action.
Geometric Transforms – Pedersoli et al. [pedersoli2017areas]
proposed to use spatial transformers for generating image-specific attention areas by regressing region proposals in a weakly-supervised fashion (relying on the captioning training loss only). Specifically, a localization network learns an affine transformation or each location of the feature map, and then a bilinear interpolation is used to regress a feature vector for each region with respect to anchor boxes.
2.4 Graph-based Encoding
To further improve the encoding of image regions and their relationships, some studies consider using graphs built over image regions (see Fig. 3) to enrich the representation by including semantic and spatial connections.
Spatial and semantic graphs. The first attempt in this sense is due to Yao et al. [yao2018exploring], followed by Guo et al. [guo2019aligning], who proposed the use of a graph convolutional network (GCN) [kipf2016semi] to integrate both semantic and spatial relationships between objects. The semantic relationships graph is obtained by applying a classifier pre-trained on Visual Genome [krishnavisualgenome] that predicts an action or an interaction between object pairs. The spatial relationships graph is instead inferred through geometry measures (i.e. intersection over union, relative distance, and angle) between bounding boxes of object pairs.
Scene graphs. With a focus on modeling semantic relations, Yang et al. [yang2019auto] proposed to integrate semantic priors learned from text in the image encoding by exploiting a graph-based representation of both images and sentences. The representation used is the scene graph, i.e. a directed graph connecting the objects, their attributes, and their mutual relations. Along the same line, Shi et al. [shi2020improving] represented the image as a semantic relationship graph but proposed to train the module in charge of predicting the predicate nodes directly on the ground-truth captions rather than on external datasets [krishnavisualgenome]. The obtained graph is then fed to a GCN for encoding and is also exploited at the decoding stage.
Hierarchical trees. As a special case of a graph-based encoding, Yao et al. [yao2019hierarchy] employed a tree to represent the image as a hierarchical structure. The root represents the image as a whole, intermediate nodes represent image regions and their contained sub-regions, and the leaves represent segmented objects in the regions. The image encoding is then obtained by feeding the image tree to a TreeLSTM [tai2015improved].
Graph encodings brought a mechanism to leverage relationships between detected objects, which allows the exchange of information in adjacent nodes and thus in a local manner. Further, it seamlessly allows the integration of external semantic information. On the other hand, manually building the graph structure can limit the interactions between visual features. This is where self-attention proved to be more successful by connecting all the elements with each other in a complete graph representation.
2.5 Self-Attention Encoding
Self-attention is an attentive mechanism where each element of a set is connected with all the others, and that can be adopted to compute a refined representation of the same set of elements through residual connections (Fig.3). It was first introduced in 2017 by Vaswani et al. [vaswani2017attention] for machine translation and language understanding tasks, giving birth to the Transformer architecture and its subsequent variants, which have dominated the NLP field and later also Computer Vision.
Definition of self-attention. Formally, self-attention makes use of the scaled dot-product mechanism, i.e. a multiplicative attention operator that handles three sets of vectors: a set of query vectors , a set of key vectors , and a set of value vectors , both containing elements. The operator takes a weighted sum of value vectors according to a similarity distribution between query and key vectors, i.e.
where is a scaling factor. In the case of self-attention, the three sets of vectors are obtained as linear projections of the same input set of elements. The success of the Transformer demonstrates that leveraging self-attention allows achieving superior performances compared to attentive RNNs.
Early self-attention approaches. Among the first image captioning models leveraging this approach, Yang et al. [yang2019learning] employed a self-attentive module to encode relationships between features resulting from an object detector. Later, Li et al. [li2019entangled] proposed a Transformer model with a visual encoder for the region features coupled with a semantic encoder that exploits knowledge from an external tagger. Both encoders are based on self-attention and feed-forward layers. Their output is then fused in the decoder through a gating mechanism governing the propagation of visual and semantic information.
Variants of the self-attention operator. Other works proposed variants or modifications of the self-attention operator tailored for image captioning [herdade2019image, guo2020normalized, huang2019attention, pan2020x, cornia2020meshed].
Geometry-aware encoding – Herdade et al. [herdade2019image] introduced a modified version of self-attention that takes into account the spatial relationships between regions. In particular, an additional geometric weight is computed between object pairs and is used to scale the attention weights. On a similar line, Guo et al. [guo2020normalized] proposed a normalized and geometry-aware version of self-attention that makes use of the relative geometry relationships between input objects.
Attention on Attention – Huang et al. [huang2019attention] proposed an extension of the attention operator, named “Attention on Attention”, in which the final attended information is weighted by a gate guided by the context. Specifically, they concatenate the output of the self-attention with the queries, then compute an information vector and a gate vector that are finally multiplied together. In their visual encoder, they employ this mechanism in order to refine the visual features. This method is then adopted by later models such as [liu2020prophet].
X-Linear Attention – Pan et al. [pan2020x] proposed to use bilinear pooling techniques to strengthen the representative capacity of the output attended feature. Notably, this mechanism encodes the region-level features with higher-order interaction, leading to a set of enhanced region-level and image-level features.
Memory-augmented Attention – Cornia et al. [cornia2020meshed, cornia2020smart] proposed a Transformer-based architecture where the self-attention operator of each encoder layer is augmented with a set of memory vectors. Specifically, the set of keys and values is extended with additional “slots” learned during training, which can encode multi-level visual relationships with a priori knowledge.
Other self-attention-based approaches. Ji et al. [Ji2020ImprovingIC] proposed to improve self-attention by adding to the sequence of feature vectors a global vector computed as their average. A global vector is computed for each layer, and the resulting global vectors are combined via an LSTM, thus obtaining an inter-layer representation. Luo et al. [luo2021dual] proposed a hybrid approach that combines region and grid features to exploit their complementary advantages. Two self-attention modules are applied independently to each kind of features, and a cross-attention module locally fuses their interactions. Finally, the approach proposed by Zhang et al. [zhang2021rstnet] completely disregards region features and applies self-attention directly to grid features, incorporating their relative geometry relationships into self-attention computation.
Vision Transformer. Transformer-like architectures can also be applied directly on image patches, thus excluding or limiting the usage of the convolutional operator [dosovitskiy2020image, touvron2020training] (Fig. 4). On this line, Liu et al. [liu2021cptr]
devised the first convolution-free architecture for image captioning. Specifically, a pre-trained Vision Transformer network (i.e. ViT [dosovitskiy2020image]) is adopted as encoder, and a standard Transformer decoder is employed to generate captions.
Early fusion and vision-and-language pre-training. Other works using self-attention to encode visual features achieved remarkable performance also thanks to vision-and-language pre-training [tan2019lxmert, lu2019vilbert] and early-fusion strategies [li2020oscar, zhou2020unified]. For example, following the BERT architecture [devlin2018bert], Zhou et al. [zhou2020unified] combined encoder and decoder into a single stream of Transformer layers, where region and word tokens are early fused together into a unique flow. This unified model is first pre-trained on large amounts of image-caption pairs to perform both bidirectional and sequence-to-sequence prediction tasks and then fine-tuned for image captioning. On the same line, Li et al. [li2020oscar] proposed OSCAR, a BERT-like architecture that also includes objects tags as anchor points in order to ease the semantic alignment between images and text. These tags are extracted from an object detector and concatenated with the image regions and word embeddings fed to the model. They also performed a large-scale pre-train with million image-text pairs, with a masked token loss similar to the BERT mask language loss and a contrastive loss for distinguishing aligned words-tags-regions triples from polluted ones. Moreover, Zhang et al. [zhang2021vinvl] proposed VinVL, built on top of OSCAR, introducing a new object detector capable of extracting better visual features and a modified version of the vision-and-language pre-training objectives. Specifically, the object detector presents minor changes with respect to Faster R-CNN and is pre-trained on a large corpus consisting of four public datasets. The vision-and-language pre-training objectives are the same masked token loss and a 3-way contrastive loss that takes into account two types of triples: words-tags-regions from captioning datasets and question-answer-regions from visual question answering datasets.
Global CNN features are a simple and compact way to encode the visual information but have been proven to be insufficient. Indeed, for image captioning, the information about the visual entities in the scene, detected in image regions, is essential. Almost all the surveyed approaches adopt the same model (i.e. Faster R-CNN trained on Visual Genome) as the object detector backbone for its remarkable performance. Nonetheless, since the set of objects the backbone can distinguish defines what can be described in an image, applying more and more general detectors will extend the domain application of image captioning approaches. Not only distinguishing each visual entity is important, but also encoding their spatial and contextual relation has been proven to boost the captioning performance. In this sense, explicit graph-based representations, in conjunction with attentive mechanisms and, later, implicit self-attentive solutions, had more success than global representation. This fact clearly suggests that a viable direction is designing visual encoders that model mutual relations between objects. Moreover, the success of BERT-like solutions performing image and text early-fusion indicates the suitability of visual representations that also integrate textual information.
3 Language Models
The primary goal of a language model is to predict the probability of a given sequence of words to occur in a sentence. As such, it represents a crucial component of many NLP tasks, as it gives a machine the ability to understand and deal with natural language as a stochastic process.
Formally, given a sequence of words, the language model component of an image captioning algorithm assigns a probability to the sequence as:
where represents the visual encoding on which the language model is specifically conditioned. Notably, when predicting the next word given the previous ones, the language model is auto-regressive, which means that each predicted word is conditioned on the previous ones. The language model also decides when to stop generating caption words by outputting a special end-of-sequence token.
The main language modeling strategies applied to image captioning can be categorized as: 1. LSTM-based approaches, which can be either single-layer or two-layer; 2. CNN-based methods that constitute a first attempt in surpassing the fully recurrent paradigm; 3. Transformer-based fully-attentive approaches; 4. image-text early-fusion (BERT-like) strategies that directly connect the visual and textual inputs. This taxonomy is visually summarized in Fig. 1.
3.1 LSTM-based Models
As language has a sequential structure, RNNs are naturally suited to deal with the generation of sentences. Among RNN variants, LSTM [hochreiter1997long] has been the predominant option for language modeling.
3.1.1 Single-layer LSTM
The most simple LSTM-based captioning architecture is based on a single-layer LSTM and was proposed by Vinyals et al. [vinyals2015show]. As shown in Fig. 5
, the visual encoding is used as the initial hidden state of the LSTM, which then generates the output caption. At each time step, a word is predicted by applying a softmax activation function over the projection of the hidden state into a vector of the same size as the vocabulary. During training, input words are taken from the ground-truth sentence, while during inference, input words are those generated at the previous step.
Shortly after, with the seminal work “Show, Attend and Tell”, Xu et al. [xu2015show] introduced the additive attention mechanism, a dynamic and time-varying representation of the image that replaced the static global vector and improved the alignment between words and visual content. As depicted in Fig. 5, in this case, the previous hidden state guides the attention mechanism over the visual features , computing a context vector which is then fed to the MLP in charge of predicting the output word.
Other approaches. Many subsequent works have adopted a decoder based on a single-layer LSTM, mostly without any architectural changes [yang2016review, chen2017sca, pedersoli2017areas], while others have proposed significant modifications, summarized below.
Visual sentinel – Lu et al. [lu2017knowing] augmented the spatial image features with an additional learnable vector, called visual sentinel, which can be attended by the decoder in place of visual features while generating “non-visual” words (e.g. “the”, “of”, and “on”), for which visual features are not needed (Fig. 5). At each time step, the visual sentinel is computed from the previous hidden state and generated word. Then, the model generates a context vector as a combination of attended image features and visual sentinel, whose importance is weighted by a learnable gate. Many subsequent works [lu2018neural, cornia2019show] confirmed the utility of this additional vector.
Hidden state reconstruction – Chen et al. [chen2018regularizing] proposed to regularize the transition dynamics of the language model by using a second LSTM for reconstructing the previous hidden state based on the current one. Ge et al. [ge2019exploring] proposed to better capture context information by using a bidirectional LSTM with an auxiliary module. The auxiliary module in a direction approximates the hidden state of the LSTM in the other direction. Finally, a cross-modal attention mechanism combines grid visual features with the two sentences from the bidirectional LSTM to obtain the final caption.
Multi-stage generation – Wang et al. [wang2017skeleton] proposed to generate a caption from coarse central aspects to finer attributes by decomposing the caption generation process into two phases: skeleton sentence generation and attributes enriching, both implemented with single-layer LSTMs. On the same line, Gu et al. [gu2018stack] devised a coarse-to-fine multi-stage framework using a sequence of LSTM decoders, each operating on the output of the previous one to produce increasingly refined captions.
3.1.2 Two-layer LSTM
LSTMs can be expanded to multi-layer structures to augment their capability of capturing higher-order relations. Donahue et al. [donahue2015long] firstly proposed a two-layer LSTM as a language model for captioning, stacking two layers, where the hidden states of the first are the input to the second.
Two-layers and additive attention. Anderson et al. [anderson2018bottom] went further and proposed to specialize the two layers to perform visual attention and the actual language modeling. As shown in Fig. 5
, the first LSTM layer acts as a top-down visual attention model which takes the previously generated word, the previous hidden state, and the mean-pooled image features. Then, the current hidden state is used to compute a probability distribution over image regions with an additive attention mechanism. The so-obtained attended image feature vector is fed to the second LSTM layer, which combines it with the hidden state of the first layer to generate a probability distribution over the vocabulary.
Variants of two-layers LSTM. Because of their representation power, LSTMs with two-layers and internal attention mechanisms represent the most employed language model approach before the advent of Transformer-based architectures [yao2018exploring, yang2019auto, yao2019hierarchy, shi2020improving]. As such, many other variants have been proposed to improve the performance of this approach.
Neural Baby Talk – To ground words into image regions, Lu et al. [lu2018neural] incorporated a pointing network that modulates the content-based attention mechanism. In particular, during the generation process, the network predicts slots in the caption, which are then filled with the image region classes. For non-visual words, a visual sentinel is used as dummy grounding. This approach leverages the object detector both as a feature region extractor and as a visual word prompter for the language model.
Reflective attention – Ke et al. [ke2019reflective] introduced two reflective modules: the first computes the relevance between hidden states from all the past predicted words and the current one, thus modeling longer dependencies and fostering historical coherence. The second improves the syntactic structure of the sentence by guiding the generation process with words common position information (e.g. subjects are usually at the beginning, while predicates in the middle).
Look back and predict forward – On a similar line, Qin et al. [qin2019look] used two modules: the look back module that takes into account the previous attended vector to compute the next one, and the predict forward module that predicts the new two words at once, thus alleviating the accumulated errors problem that may occur at inference time.
Adaptive attention time – Huang et al. [huang2019adaptively] proposed an adaptive attention time mechanism, in which the decoder can take an arbitrary number of attention steps for each generated word, determined by a confidence network on top of the second-layer LSTM.
Recall mechanisms – Wang et al. [wang2020show] introduced a recall mechanism modeled with a text retrieval system, which provides the model with useful words for each image. An auxiliary word distribution is obtained from the recalled words and used as a semantic guide.
3.1.3 Boosting LSTM with Self-Attention
Some works adopted the self-attention operator in place of the additive attention one in LSTM-based language models [huang2019attention, pan2020x, liu2020prophet, zhu2020autocaption]. In particular, Huang et al. [huang2019attention] augmented the LSTM with the Attention on Attention operator, which computes another step of attention on top of visual self-attention. Pan et al. [pan2020x] introduced the X-Linear attention block, which enhances self-attention with second-order interactions and improves both the visual encoding and the language model.
3.1.4 Neural Architecture Search for RNN
Zhu et al. [zhu2020autocaption] applied the neural architecture search paradigm to select the connections between layers and the operations within gates of RNN-based image captioning language models. To evaluate their method, they considered the decoder of the X-LAN architecture [pan2020x], which includes a variant of the self-attention operator.
3.2 Convolutional Language Models
A worth-to-mention approach is that proposed by Aneya et al. [aneja2018convolutional], which uses convolutions as a language model. In particular, a global image feature vector is combined with word embeddings and fed to a CNN, operating on all words in parallel during training and sequentially in inference. Convolutions are right-masked to prevent the model from using the information of future word tokens. Despite the clear advantage of parallel training, the usage of the convolutional operator in language models has not gained popularity due to the poor performance and the advent of Transformer architectures.
3.3 Transformer-based Architectures
The fully-attentive paradigm proposed by Vaswani et al. [vaswani2017attention] in the seminal paper “Attention is all you need” has completely changed the perspective of language generation. Shortly after, the Transformer model became the building block of other breakthroughs in NLP, such as BERT [devlin2018bert] and GPT [radford2018improving], and the standard de-facto architecture for many language understanding tasks. As image captioning can be cast as a set-to-sequence problem when using image regions, the Transformer architecture has been employed also for this task. The standard Transformer decoder performs a masked self-attention operation, which is applied to words, followed by a cross-attention operation, where words act as queries and the outputs of the last encoder layer act as keys and values, plus a final feed-forward network (Fig. 6). During training, a masking mechanism is applied to the previous words to constrain a unidirectional generation process. The original Transformer architecture has been employed in some image captioning models without significant architectural modifications [herdade2019image, guo2020normalized, luo2021dual]. Besides, some variants have been proposed to improve language generation and visual feature encoding.
Gating mechanisms. Li et al. [li2019entangled] proposed a gating mechanism for the cross-attention operator, which controls the flow of visual and semantic information by combining and modulating image regions representations with semantic attributes coming from an external tagger. On the same line, Ji et al. [Ji2020ImprovingIC] integrated a context gating mechanism to modulate the influence of the global image representation on each generated word, modeled via multi-head attention. Cornia et al. [cornia2020meshed] proposed to take into account all encoding layers in place of performing cross-attention only on the last one. To this end, they devised the meshed decoder, which contains a mesh operator that modulates the contribution of all the encoding layers independently and a gate that weights these contributions guided by the text query.
3.4 BERT-like Architectures
Despite the encoder-decoder paradigm is a common approach to image captioning, some works have revisited captioning architectures to exploit a BERT-like [devlin2018bert] structure in which the visual and textual modalities are fused together in the early stages (Fig. 7). When employed as a language model for captioning, the main advantage of this architecture is that layers dealing with text can be initialized with pre-trained parameters learned from massive textual corpora. Therefore, the BERT paradigm has been adopted mainly in works that exploit pre-training [li2020oscar, zhou2020unified, zhang2021vinvl].
In this section, we refer to a BERT-like approach when the overall architecture does not have a clear distinction between the encoding and decoding phases and when inputs coming from both modalities are processed together in a single stream made of Transformer layers.
The first example is due to Zhou et al. [zhou2020unified], who developed a unified model that fuses visual and textual modalities into a BERT-like architecture for image captioning. The model consists of a shared multi-layer Transformer encoder network for both encoding and decoding, pre-trained on a large corpus of image-caption pairs and then fine-tuned for image captioning by right-masking the tokens sequence to simulate the unidirectional generation process. Further, Li et al. [li2020oscar] introduced the usage of object tags detected in the image as anchors points for learning a better alignment in vision-and-language joint representations. To this end, their model represents an input image-text pair as a word tokens-object tags-region features triple, where the object tags are the textual classes proposed by the object detector.
3.5 Non-autoregressive Language Models
Thanks to the parallelism offered by Transformers, non-autoregressive language models have been proposed in machine translation to reduce the inference time by generating all words in parallel. Some efforts have been made to apply this paradigm to image captioning [gao2019masked, fei2019fast, guo2020non, fei2020iterative, guo2021fast]. The first approaches towards a non-autoregressive generation were composed of a number of different generation stages, where all words were predicted in parallel and refined at each stage. Gao et al. [gao2019masked] adopted a multi-stage masking procedure where mask tokens are given as inputs to the decoder, with different masking ratios for each stage. Similarly, Fei et al. [fei2020iterative] proposed to iteratively refine the captions with an additional length predictor module which predicts the total number of words and adjusts the final length of the generated sequence.
Another line of work involves using reinforcement learning, with significant performance improvements. These approaches treat the generation process as a cooperative multi-agent reinforcement system, where the positions in of the words in the target sequence are viewed as agents that learn to cooperatively maximize a sentence-level reward [guo2020non, guo2021fast]. These works also leverage knowledge distillation on unlabeled data and a post-processing step to remove identical consecutive tokens.
Recurrent models based on LSTM have been the standard for many years. Their application for language modeling brought to the development of clever and successful ideas that can be integrated also into non-recurrent solutions. For example, the generation to increasingly detailed captions by successive refinements, the grounding of generated words, the reconstruction of the internal state as a regularization strategy, the neural architecture search strategy applied to language models. The main disadvantage of recurrent models is that they are slow to train and struggle to maintain long-term dependencies between the generated words. These drawbacks are alleviated by autoregressive Transformer-based solutions that gained popularity on many natural language generation tasks, including image captioning. Inspired by the success of pre-training on large, unsupervised corpora for NLP tasks, massive pre-training has been applied also for image captioning by employing BERT-like architectures. This strategy led to impressive performance, suggesting that visual and textual semantic relations can be inferred and learned also from not well-curated data. BERT-like architectures are suitable for such a massive pre-training but are not generative architectures by design. This fact makes their success in image captioning more imputable to the pre-training than to the architecture design and suggests that massive pre-training on generative-oriented architectures different from BERT-like ones would be a worth-exploring direction.
4 Training Strategies
An image captioning model is commonly expected to generate a caption word by word by taking into account the previous words and the image. At each step, the output word is sampled from a learned distribution over the vocabulary words. In the most simple scenario, i.e. the greedy decoding mechanism, the word with the highest probability is output. The main drawback of this setting is that possible prediction errors quickly accumulate along the way. To alleviate this drawback, one effective strategy is to use the beam search algorithm [koehn2009statistical] that, instead of outputting the word with maximum probability at each time step, maintains sequence candidates (those with the highest probability at each step) and finally outputs the most probable one.
During training, the captioning model must learn to properly predict the probabilities of the words to appear in the caption. To this end, the most common training strategies are based on 1. cross-entropy loss; 2. masked language model strategy; 3. reinforcement learning that allows directly optimizing for captioning-specific non-differentiable metrics; 4. vision-and-language pre-training objectives (see Fig. 1).
4.1 Cross-Entropy Loss
The cross-entropy loss is the first proposed and most used objective for image captioning models. With this loss, the goal of the training, at each timestep, is to minimize the negative log-likelihood of the current word given the previous ground-truth words. Given a sequence of target words , the loss is formally defined as:
where is the probability distribution induced by the language model, the ground-truth word at time , indicate the previous ground-truth words, and the visual encoding. The cross-entropy loss is designed to operate at word level and optimize the probability of each word in the ground-truth sequence without considering longer range dependencies between generated words. The traditional training setting with cross-entropy also suffers from the exposure bias problem [ranzato2015sequence] caused by the discrepancy between the training data distribution as opposed to the distribution of its own predicted words.
4.2 Masked Language Model (MLM)
The first masked language model has been proposed for training the BERT [devlin2018bert] architecture, with the aim of learning a bidirectional representation for language. The main idea behind this optimization function consists in randomly masking out a small subset of the input tokens sequence and training the model to predict masked tokens while relying on the rest of the sequence, i.e. both previous and subsequent tokens. As a consequence, the model learns to employ contextual information to infer missing tokens, which allows building a robust sentence representation where the context plays an essential role. Since this strategy considers only the prediction of the masked tokens and ignores the prediction of the non-masked ones, training with it is much slower than training for complete left-to-right or right-to-left generation. Notably, some works have employed this strategy as a pre-training objective, sometimes completely avoiding the combination with the cross-entropy [li2020oscar, zhang2021vinvl].
4.3 Reinforcement Learning
Given the limitations of word-level training strategies, a significant improvement was achieved by applying the reinforcement learning paradigm for training image captioning models. Within this framework, the image captioning model is considered as an agent whose parameters determine a policy. At each time step, the agent executes the policy to choose an action, i.e. the prediction of the next word in the generated sentence. Once the end-of-sequence is reached, the agent receives a reward depending on the generated sentence. The aim of the training is to optimize the agent parameters to maximize the expected reward. Many works harnessed this paradigm and explored different sequence-level metrics as rewards. The first proposal is due to Ranzato et al. [ranzato2015sequence], which introduced the usage of the REINFORCE algorithm [williams1992simple, zaremba2015reinforcement] adopting BLEU [papineni2002bleu] and ROUGE [lin2004rouge] as reward signals. Ren et al. [ren2017deep] experimented using visual-semantic embeddings obtained from a network that encodes the image and the so far generated caption in order to compute a similarity score to be used as reward. Liu et al. [liu2017improved] proposed to use as reward a linear combination of the SPICE [spice2016] and CIDEr [vedantam2015cider] metrics, called SPIDEr. Finally, the most widely adopted reinforcement learning-based strategy [zhang2017actor, gao2019self, cornia2020meshed], introduced by Rennie et al. [rennie2017self], entails using the CIDEr score as reward, as it correlates well with human judgment [vedantam2015cider]
. The reward is normalized with respect to a baseline value to reduce the reward variance. Formally, to compute the loss gradient, beam search and greedy decoding are leveraged as follows:
where is the -th sentence in the beam or a sampled collection, is the reward function, i.e. the CIDEr computation, and is the baseline, computed as the reward of the sentence obtained via greedy decoding [rennie2017self], or as the average reward of the beam candidates [cornia2020meshed].
Note that, since it would be difficult for a random policy to improve in an acceptable amount of time, the usual procedure entails pre-training with cross-entropy or masked language model first, and then fine-tuning stage with reinforcement learning by employing a sequence level metric as reward. This ensures the initial reinforcement learning policy to be more suitable than the random one.
4.4 Vision-and-Language Pre-Training
In the context of vision-and-language pre-training, one of the most common pre-training objectives is the masked contextual token loss, where tokens of each modality (visual and textual) are randomly masked following the BERT strategy [devlin2018bert], and the model has to predict the masked input based on the context of both modalities, thus connecting their joint representation. Another largely adopted strategy entails using a contrastive loss, where the inputs are organized as image regions-captions words-object tags triples, and the model is asked to discriminate correct triples from polluted ones, in which tags are randomly replaced [li2020oscar, zhang2021vinvl]. Other objectives take into account the text-image alignment at a word-region level and entail predicting the original word sequence given a corrupted one [xia2020xgpt].
5 Evaluation Protocol
As for any data-driven task, the development of image captioning has been enabled by the collection of large datasets and the definition of quantitative scores to evaluate the performance and monitor the advancement of the field.
Image captioning datasets contain images and one or multiple captions associated with them. Having multiple ground-truth captions for each image helps to capture the variability of human descriptions. Other than the number of available captions, also their characteristics (e.g. average caption length and vocabulary size) highly influence the design and the performance of image captioning algorithms. Note that the distribution of the terms in the datasets captions is usually long-tailed, thus, the common practice is to include in the vocabulary only those terms whose frequency is above a pre-defined threshold. The threshold must be chosen as a trade-off between numerical tractability and capability to mimic the lexical richness and diversity of human descriptions. The available datasets differ both on the images contained (for their domain and visual quality) and on the captions associated with the images (for their length, number, relevance, and style). A summary of the most used public datasets is reported in Table I, and some sample image-caption pairs are reported in Fig. 8, along with some word clouds obtained from the 50 most used visual words in the captions.
|Domain||Nb. Images||Nb. Caps||Vocab Size||Nb. Words|
|(per Image)||(per Cap.)|
|Flickr30K [young2014image]||Generic||K||K (K)|
|Flickr8K [hodosh2013framing]||Generic||K||K (K)|
|CC3M [sharma2018conceptual]||Generic||M||K (K)|
|CC12M [changpinyo2021conceptual]||Generic||M||K (K)|
|SBU Captions [ordonez2011im2text]||Generic||M||K (K)|
|VizWiz [gurari2020captioning]||Assistive||K||K (K)|
|CUB-200 [reed2016learning]||Birds||K||K (K)|
|Oxford-102 [reed2016learning]||Flowers||K||K (K)|
|Fashion Cap. [yang2020fashion]||Fashion||K||K (K)|
|BreakingNews [ramisa2017breakingnews]||News||K||K (K)|
|GoodNews [biten2019good]||News||K||K (K)|
|TextCaps [sidorov2020textcaps]||OCR||K||K (K)|
|Loc. Narratives [pont2020connecting]||Generic||K||K (K)|
5.1.1 Standard captioning datasets
Standard benchmark datasets are used by the community to compare their approaches on a common test-bed. As a matter of fact, these comparisons guide the development of image captioning strategies by allowing to identify suitable directions. Therefore, datasets used as benchmarks should be representative of the task at hand, both in terms of the challenges it poses and of the ideal expected results (i.e. achievable human performance). In this sense, benchmark datasets should contain a large number of generic-domain images, each associated with multiple captions.
Early image captioning architectures [mao2015deep, donahue2015long, karpathy2015deep] were commonly trained and tested on the Flickr30K [young2014image] and Flickr8K [hodosh2013framing] datasets, consisting of pictures collected from the Flickr website, containing everyday activities, events, and scenes, paired with five captions each. Currently, the most commonly used dataset for image captioning is Microsoft COCO [lin2014microsoft], which consists of images of complex scenes with people, animals, and common everyday objects in their natural context. It contains more than 120,000 images, each of them annotated with five different captions, divided into 82,783 images for training and 40,504 for validation. For ease of evaluation, most of the literature follows the splits defined by Karpathy et al. [karpathy2015deep], where 5,000 images of the original validation set are used for validation, 5,000 for test, and the rest for training. The dataset has also an official test set, composed of 40,775 images paired with 40 private captions each, and a public evaluation server222https://competitions.codalab.org/competitions/3221 to measure the performance.
5.1.2 Pre-training datasets
Although training on large well-curated datasets is a sound approach, some works [lu2019vilbert, li2020oscar] have demonstrated the benefits of pre-training on even bigger vision-and-language datasets, which can be either image captioning datasets of less diverse and lower-quality captions or datasets collected for other tasks (e.g. visual question answering [li2020oscar, zhou2020unified], text-to-image generation [ramesh2021zero], image-caption association [radford2021learning]). Among the datasets used for pre-training, that have been specifically collected for image captioning, it is worth mentioning SBU Captions [ordonez2011im2text], originally used for tackling image captioning as a retrieval task [hodosh2013framing], which contains around 1 million image-text pairs, collected from the Flickr website. Later, the Conceptual Captions [sharma2018conceptual, changpinyo2021conceptual] datasets have been proposed, which are collections of around 3.3 million (CC3M) and 12 million (CC12M) images paired with one weakly-associated description automatically collected from the web with a relaxed filtering procedure. Although the large scale and variety in caption style make Conceptual Captions particularly interesting for pre-training, the contained captions are simple and availability of images is not always guaranteed since they are provided as URLs.
Pre-training on such datasets requires significant computational resources and effort to collect the data needed. Nevertheless, this strategy represents an asset to obtain state-of-the-art performances. For this reason, some vision-and-language pre-training datasets are not always publicly available [ramesh2021zero, radford2021learning].
5.1.3 Domain-specific datasets
While domain-generic benchmark datasets are important to capture the main aspects of the image captioning task, domain-specific datasets are also important to highlight and target specific challenges. These may relate to the visual domain (e.g. type and style of the images) and the semantic domain. In particular, the distribution of the terms used to describe domain-specific images can be significantly different from that of the terms used for domain-generic images.
An example of dataset specific in terms of visual domain is the VizWiz Captions [gurari2020captioning] dataset, collected to favor the image captioning research towards assistive technologies. The images in this dataset have been taken by visually-impaired people with their phones, thus, they can be of low quality and concern a wide variety of everyday activities, most of which entail reading some text.
Some examples of specific semantic domain are the CUB-200 [welinder2010caltech] and the Oxford-102 [nilsback2008automated] datasets, which contain images of birds and flowers, respectively, that have been paired with ten captions each by Reed et al. [reed2016learning]. Given the specificity of these datasets, rather than for standard image captioning, they are usually adopted for different related tasks such as cross-domain captioning [chen2017show], visual explanation generation [hendricks2016generating, hendricks2018grounding], and text-to-image synthesis [reed2016generative]. Another domain-specific dataset is Fashion Captioning [yang2020fashion] that contains images of clothing items in different poses and colors that may share the same caption. The vocabulary for describing these images is somewhat smaller and more specific than for generic datasets. Differently, datasets as BreakingNews [ramisa2017breakingnews] and GoodNews [biten2019good] enforce using a richer vocabulary since their images, taken from news articles, have long associated captions written by expert journalists. The same applies to the TextCaps [sidorov2020textcaps] dataset, which contains images with text, that must be “read” and included in the caption, and to Localized Narratives [pont2020connecting], whose captions have been collected by recording people freely narrating what they see in the images.
Collecting domain-specific datasets and developing solutions to tackle the challenges they pose is crucial to extend the applicability of image captioning algorithms.
5.2 Evaluation Metrics
Evaluating the quality of a generated caption is a tricky and subjective task [vedantam2015cider, spice2016], complicated by the fact that captions cannot only be grammatical and fluent but need to properly refer to the input image. Arguably, the best way to measure the quality of the caption for an image is still carefully designing a human evaluation campaign in which multiple users score the produced sentences. However, human evaluation is lengthy and costly and, most importantly, is not reproducible – which prevents a fair comparison between different approaches. Automatic scoring methods exist that are used to assess the quality of system-produced captions, usually by comparing them with human-produced reference sentences, although some metrics can also be applied without relying on reference captions. A taxonomy and main characteristics of the most commonly used metrics are summarized in Table II.
5.2.1 Standard evaluation metrics
The first strategy adopted to evaluate image captioning performance consists of exploiting metrics designed for NLP tasks such as machine translation and summarization [papineni2002bleu, lin2004rouge, banerjee2005meteor]. Later, specific image captioning metrics have been proposed [vedantam2015cider, spice2016]. The most commonly used ones capture different aspects of the caption quality based on n
-gram precision and recall. For fair comparison among different approaches, the common practice is to use the implementation provided in the Microsoft COCO caption evaluation repository333https://github.com/tylin/coco-caption.
As expected, metrics designed for image captioning usually correlate better with human judgment than those borrowed from other NLP tasks (with the exception of METEOR [banerjee2005meteor]), both at corpus-level and caption-level [spice2016, sharif2018nneval, cui2018learning]. Correlation with human judgment is measured via statistical correlation coefficients (such as Pearson’s, Kendall’s, and Spearman’s correlation coefficients) and via the agreement with humans’ preferred caption in a pair of candidates, all evaluated on sample captioned images.
BLEU [papineni2002bleu]. BLEU is a precision-oriented metric designed for machine translation. To obtain the score, n-gram precision is calculated for each n-gram up to length four, with a minor modification to prevent an n-gram from appearing in the candidate more often than in the reference. Finally, the n-gram precision values are combined via a weighted sum.
METEOR [banerjee2005meteor]. METEOR is a precision and recall-based machine translation evaluation metric. Unigram precision and unigram recall are calculated by matching unigrams in the candidate and reference sentences based on their exact form, stemmed form, and meaning. Then, an F-mean is obtained, weighing the recall more than the precision. In addition, a multiplicative factor is used to reward identically ordered contiguous matched unigrams.
ROUGE [lin2004rouge]. ROUGE is a recall-oriented metric designed for summarization. It is based on the idea that ideal candidate summaries should overlap the reference summary. For image captioning evaluation, precision and recall are obtained based on the longest subsequence of tokens in the same relative order, possibly with other tokens in-between, that appears in both candidate and reference caption, and an F-mean is computed favoring the recall.
|WMD [kusner2015word]||Doc. Dissimilarity||✓||✓|
|Coverage [cornia2019show, bigazzi2020explore]||Captioning||✓||(✓)||(✓)|
|Learning-based||BERT-S [zhang2020bertscore]||Text Similarity||✓||✓|
CIDEr was designed to correlate well with human judgment on image captions quality. It is based on the cosine similarity between the Term Frequency-Inverse Document Frequency weightedn-grams in the candidate caption and in the set of reference captions associated with the image, thus taking into account both precision and recall. In addition, a Gaussian penalty factor rewards length similarity between candidate and reference sentences.
SPICE [spice2016]. SPICE is specifically designed for captioning evaluation and considers the candidate caption semantic content rather than its grammaticality and fluency. The captions are represented as sets of tuples extracted from their scene graphs, and precision and recall are calculated over matching tuples in these sets. Note that two tuples match if their elements match or are synonyms. SPICE is obtained as the F1-mean. This score can be quantified for certain objects, attributes, and relations separately, and, by definition, it could also work by directly comparing the scene graphs of the image and the candidate caption.
5.2.2 Diversity metrics
To better assess the performance of a captioning system, it is common practice to consider a set of the above-mentioned standard metrics. Nevertheless, these are somehow gameable because they favor word similarity rather than meaning correctness [caglayan2020curious]. Another drawback of the standard metrics is that they do not capture (but rather disfavor) the desirable capability of the system to produce novel and diverse captions, which is more in line with the variability with which humans describe complex images. This consideration brought to the development of diversity metrics [shetty2017speaking, van2018measuring, wang2019describing, wang2020diversity]. Most of these metrics can potentially be calculated even when no ground-truth captions are available at test time. However, since they overlook the syntactic correctness of the captions and their relatedness with the image, it is advisable to combine them with other metrics.
The overall performance of a captioning system can be evaluated in terms of corpus-level diversity or, when the system can output multiple captions for the same image, single image-level diversity (termed as global diversity and local diversity, respectively, in [van2018measuring]). To quantify the former, it can be considered the number of unique words used in all the generated captions (Vocab) and the percentage of generated captions that were not present in the training set (%Novel). For the latter, it can be used the ratio of unique captions unigrams or bigrams to the total number of captions unigrams (Div-1 and Div-2).
5.2.3 Embedding-based metrics
An alternative approach to captioning evaluation consists in relying on captions semantic similarity or other specific aspects of caption quality, which are estimated via embedding-based metrics[kusner2015word, cornia2019show, cornia2020smart, bigazzi2020explore].
|Visual Encoding||Language Model||Training Strategies||Main Results|
|Unified VLP [zhou2020unified]||✓||✓||✓||✓||✓||✓||39.5||29.3||129.3|
|Neural Baby Talk [lu2018neural]||✓||✓||✓||34.7||27.1||107.2|
|SCST (Att2in) [rennie2017self]||✓||✓||✓||✓||33.3||26.3||111.4|
|Adaptive Attention [lu2017knowing]||✓||✓||✓||33.2||26.6||108.5|
|Areas of Attention [pedersoli2017areas]||✓||✓||✓||30.7||24.5||93.8|
|Review Net [yang2016review]||✓||✓||✓||29.0||23.7||88.6|
|Show, Attend and Tell [xu2015show]||✓||✓||✓||24.3||23.9||-|
|SCST (FC) [rennie2017self]||✓||✓||✓||✓||31.9||25.5||106.3|
|Embedding Reward [ren2017deep]||✓||✓||✓||✓||30.4||25.1||93.7|
|Show and Tell [vinyals2015show]||✓||✓||✓||24.6||-||-|
|Mind’s Eye [chen2015mind]||✓||✓||✓||19.0||20.4||-|
WMD [kusner2015word]. WMD was introduced to evaluate document semantic dissimilarity but can also be applied to captioning evaluation (converted into a similarity score via a negative exponential) by considering generated captions and ground-truth captions as the compared documents [kilickaya2017re]. The captions are represented by normalized bag-of-words, not including stopwords. Their WDM is defined as the minimum cumulative sum of the pairwise euclidean distance of their word embeddings [mikolov2013distributed], weighted by a term representing the contribution of the generated caption word to the probability mass of the ground-truth caption word.
Alignment [cornia2019show]. Alignment was introduced to evaluate controllable captioning but can also be applied to standard captioning evaluation. The produced caption and the ground-truth caption are represented as the sequence of their contained nouns, and an alignment score is calculated via the Needleman-Wunsch algorithm, where the noun matching score is given by the cosine similarity of the nouns word vectors represented with GloVe embeddings [pennington2014glove]. The noun alignment score is given by the alignment score normalized by the maximum length of the compared sequences.
Coverage [cornia2020smart, bigazzi2020explore]. Coverage expresses the completeness of a caption, which is evaluated by considering the mentioned visual entities. To this end, scene object categories and caption nouns are represented as word vectors [pennington2014glove] and matched via the Hungarian algorithm. The noun coverage score is defined as the sum of the assignment scores normalized by the number of object categories in the scene. Since this score considers visual objects directly, it can be applied even when no ground-truth caption is available.
|Standard Metrics||Diversity Metrics||Embedding-based Metrics||Learning-based Metrics|
|Show and Tell [vinyals2015show]||13.6||72.4||31.4||25.0||53.1||97.2||18.1||0.014||0.045||635||36.1||16.5||0.199||71.7||71.8||93.4||0.697||0.762|
|SCST (FC) [rennie2017self]||13.4||74.7||31.7||25.2||54.0||104.5||18.4||0.008||0.023||376||60.7||16.8||0.218||74.7||71.9||89.0||0.691||0.758|
|Show, Attend and Tell [xu2015show]||18.1||74.1||33.4||26.2||54.6||104.6||19.3||0.017||0.060||771||47.0||17.6||0.209||72.1||73.2||93.6||0.710||0.773|
|SCST (Att2in) [rennie2017self]||14.5||78.0||35.3||27.1||56.7||117.4||20.5||0.010||0.031||445||64.9||18.5||0.238||76.0||73.9||88.9||0.712||0.779|
|Unified VLP [zhou2020unified]||138.2||80.9||39.5||29.3||59.6||129.3||23.2||0.019||0.081||898||74.1||26.6||0.258||77.1||75.1||94.4||0.750||0.807|
5.2.4 Learning-based evaluation
As a further development towards captions quality assessment, learning-based evaluation strategies [sharif2018nneval, cui2018learning, jiang2019tiger, zhang2020bertscore, lee2020vilbertscore, hessel2021clipscore, wang2021faier, lee2021umic] are being investigated that evaluate how human-like a caption is [dai2017towards].
TIGEr [jiang2019tiger]. TIGEr represents the reference and candidate captions as grounding score vectors obtained from a pre-trained model [lee2018stacked] that grounds their words on the image regions and scores the candidate caption based on the similarity of the grounding vectors. This is computed by taking into account region rank similarity (i.e. how similarly the image regions are ranked based on the grounding scores in the vectors) and weight distribution similarity (i.e. how similar the grounding score distributions in the vectors are).
BERT-S [zhang2020bertscore]. BERT Score is a metric to evaluate various language generation tasks [unanue2021berttune], including image captioning. It exploits pre-trained BERT embeddings [devlin2018bert] to represent and match the tokens in the reference and candidate sentences via cosine similarity. The best matching token pairs are used for computing precision, recall, and F1-score.
CLIP-S [hessel2021clipscore]. CLIP Score is a direct application of the CLIP [radford2021learning] cross-modal retrieval model, which leverages large-scale vision-and-language pre-training, to image captioning evaluation. The score consists of an adjusted cosine similarity of image and candidate caption representation. Thus, CLIP-S is designed to work without reference captions. Nonetheless, the CLIP-S
variant can exploit also the reference captions by considering the harmonic mean between the image-candidate CLIP-S score and the maximum cosine similarity between the candidate caption and the reference ones.
6 Experimental Evaluation
According to the taxonomies proposed in Sections 2, 3, and 4, in Table III, we overview the most relevant surveyed methods. We report their performance in terms of BLEU-4, METEOR, and CIDEr on the MS COCO Karpathy split test set and their main features in terms of visual encoding, language modeling, and training strategies. In the table, methods are clustered based primarily on their visual encoding strategy and ordered based on the obtained scores. Methods exploiting vision-and-language pre-training are further separated from the others. Image captioning models have reached impressive performance in just a few years: from an average BLEU-4 of 25.1 for the methods using global CNN features to an average BLEU-4 of 35.3 and 39.8 for those exploiting the attention and self-attention mechanisms, peaking at 41.7 in case of vision-and-language pre-training. By looking at the performance in terms of the more representative CIDEr score, we can notice the same positive trend and make the following considerations on the design choices adopted in the surveyed works. As for the visual encoding, the more complete and structured information about semantic visual concepts and their mutual relation is included, the better is the performance (consider that methods applying attention over a grid of features reach an average CIDEr score of 105.8, while those performing attention over visual regions 121.8, further increased for graph-based approaches and methods using self-attention, which reach 130.4 on average). As for the language model, LSTM-based approaches combined with strong visual encoders are still competitive with subsequent fully-attentive methods in terms of performance. These methods are slower to train but are generally smaller than Transformer-based ones (apart from the optimized Transformer model [cornia2020meshed]). As for the training strategy, sentence-level fine-tuning with reinforcement learning leads to significant performance improvement (consider that methods relying only on the cross-entropy loss obtain an average CIDEr score of 92.3, while those combining it with reinforcement learning fine-tuning reach 125.1 on average). Moreover, the collected results show that the Masked Language loss can be a valid alternative to the cross-entropy loss. Finally, it emerges that vision-and-language pre-training on large datasets allows boosting the performance and deserves further investigation.
Furthermore, in Table IV, we analyze the performance of some of the main approaches in terms of all the evaluation scores presented in Section 5.2 to take into account the different aspects of caption quality these express and report their number of parameters to give an idea of the computational complexity and memory occupancy of the models. The data in the table have been obtained either from the model weights and captions files provided by the original authors or from our best implementation. Given its large use as a benchmark in the field, we consider the domain-generic MS COCO dataset also for this analysis. In the table, methods are clustered based on the information included in the visual encoding and ordered by CIDEr score. It can be observed that standard and embedding-based metrics all had a substantial improvement with the introduction of region-based visual encodings. Further improvement was due to the integration of information on inter-objects relations, either expressed via graphs or self-attention. Notably, CIDEr, SPICE, and Coverage most reflect the benefit of vision-and-language pre-training. Moreover, as expected, it emerges that the diversity-based scores are correlated, especially Div-1 and Div-2 and the Vocab Size. The correlation of this family of scores and the others is almost linear, except for early approaches, which perform averagely well in terms of Diversity despite lower values for standard metrics. From the trend of learning-based scores, it emerges that exploiting models trained on textual data only (BERT-S, reported in the table as its F1-score variant) does not help discriminating among image captioning approaches. On the other hand, considering as reference only the visual information and disregarding the ground-truth captions is possible with the appropriate vision-and-language pre-trained model (consider that CLIP-S and CLIP-S are linearly correlated). This is a desirable property for an image captioning evaluation score since it allows estimating the performance of a model without relying on reference captions that can be limited in number and somehow subjective.
For readability, in Fig. 9 we highlight the relation between the CIDEr score and other characteristics from Table IV (i.e. number of parameters and representative non-standard metrics). We chose CIDEr as this score is commonly regarded as one of the most relevant indicators of image captioning systems performance. The first plot, depicting the relation between model complexity and performance, shows that more complex models do not necessarily bring to better performance. Consider for example Transformer [cornia2020meshed] and X-LAN [pan2020x] in comparison with X-Transformer [pan2020x] and Unified VLP [zhou2020unified]. All methods achieve CIDEr higher than 130, but the first two are much more compact architectures (38.8M and 75.2M parameters compared to over 137M). The other plots describe an almost-linear relation between CIDEr and the other scores, with some flattening for high CIDEr values. These trends confirm the suitability of the CIDEr score as an indicator of the overall performance of an image captioning algorithm, whose specific characteristics in terms of the produced captions would still be expressed more precisely in terms of non-standard metrics.
7 Image Captioning Variants
Beyond general-purpose image captioning, several specific sub-tasks have been explored in the literature. These can be classified into four categories according to their scope: 1. dealing with lacking training data; 2. focusing on the visual input; 3. focusing on the textual output; 4. addressing user requirements.
7.1 Dealing with lacking training data
Paired image-caption datasets are very expensive to obtain. Thus, some image captioning variants are being explored that limit the need for full supervision information.
Novel Object Captioning. Novel object captioning focuses on describing objects not appearing in the training set, thus enabling a zero-shot learning setting that can increase the applicability of the models in the real world. Early approaches to this task [hendricks16cvpr, venugopalan17cvpr] tried to transfer knowledge from out-domain images by conditioning the model on external unpaired visual and textual data at training time. To explore this strategy, Hendricks et al. [hendricks16cvpr] introduced a variant of the MS COCO dataset [lin2014microsoft], called held-out COCO, in which image-caption pairs containing one of eight pre-selected object classes were removed from the training set but not from the test set. To further encourage research on this task, the more challenging nocaps dataset, with nearly 400 novel objects, has been introduced [agrawal2019nocaps], where images are grouped into three subsets depending on their semantic distance to MS COCO (i.e. in-domain, near-domain, and out-of-domain images). Some approaches to this variant [yao2017incorporating, li2019lstmp] integrate copying mechanisms in the language model to select novel objects predicted from a tagger. Other methods generate a caption template with placeholders to be filled with novel objects [wu2018dnoc, lu2018neural] or replace ambiguous words with novel objects in a second stage [feng2020cascaded]. On a different line, Anderson et al. [cbs2017emnlp] devised the Constrained Beam Search algorithm to force the inclusion of selected tag words in the output caption, following the predictions of a tagger. Moreover, following the pre-training trend with BERT-like architectures, Hu et al. [hu2020vivo] proposed a multi-layer Transformer model pre-trained by randomly masking one or more tags from image-tag pairs.
Unpaired Captioning. Unpaired captioning aims at understanding and describing images without paired image-text training data. Following unpaired machine translation approaches, the early work [gu2018unpaired] proposes to generate captions in a pivot language and then translate predicted captions to the target language. After this work, the most common approach focuses on adversarial learning by training an LSTM-based discriminator to distinguish whether a caption is real or generated [feng2019unsupervised, laina2019towards]. Some alternatives exist, such as [gu2019unpaired] that generates a caption from the image scene-graph, [guo2020recurrent] that leverages a memory-based network, or [ben2021unpaired, kim2019image] that propose semi-supervised strategies.
Continual Captioning. Continual captioning aims to deal with partially unavailable data by following the continual learning paradigm to incrementally learn new tasks without forgetting what has been learned before. In this respect, new tasks can be represented as sequences of captioning tasks with different vocabularies, as proposed in [del2020ratt], and the model should be able to transfer visual concepts from one to the other while enlarging its vocabulary. To this end, continual captioning approaches focus on techniques to address the catastrophic forgetting, such as freezing part of the model during training, employing pseudo-labels, or knowledge distillation [nguyen2019contcap].
7.2 Focusing on the visual input
Some sub-tasks focus on making the textual description more correlated with visual data.
Dense Captioning.Dense captioning was proposed by Johnson et al. [johnson2016densecap] and consists of concurrently localizing and describing salient image regions with short natural language sentences. In this respect, the task can be conceived as a generalization of object detection, where caption replaces object tags, or image captioning, where single regions replace the full image. The main challenge of this task is the aperture problem, i.e. the lack of contextual information for each region with the surrounding ones. To face this issue, contextual and global features [yang2017dense, li2019learning] and attribute generators [yin2019context, kim2019dense] can be exploited. The performance on this task is commonly evaluated on the Visual Genome [krishnavisualgenome] dataset since it contains a large number of images with region-level annotated captions. Related to this variant, an important line of works [krause2017hierarchical, liang2017recurrent, mao2018show, melas2018training, chatterjee2018diverse, zha2019context, luo2019curiosity, che2019visual] focuses on the generation of textual paragraphs that densely describe the visual content as a coherent story.
Text-based Image Captioning. Text-based image captioning, also known as OCR-based image captioning or image captioning with reading comprehension, aims at reading and including the text appearing in images in the generated descriptions. The task was introduced by Sidorov et al. [sidorov2020textcaps] with the TextCaps dataset. Another dataset designed for pre-training for this variant is OCR-CC [yang2021tap], which is a subset of images containing meaningful text taken from the CC3M dataset [sharma2018conceptual] and automatically annotated through a commercial OCR system. The common approach to this variant entails combining image regions and text tokens, i.e. groups of characters from an OCR, possibly enriched with mutual spatial information [wang2020multimodal, wang2021improving], in the visual encoding [sidorov2020textcaps, zhu2021simple]. Another direction entails generating multiple captions describing different parts of the image, including the contained text [xu2021towards].
Change Captioning. Change captioning targets changes that occurred in a scene, thus requiring both accurate change detection and effective natural language description. The task was first presented in [jhamtani2018learning] with the Spot-the-Diff dataset, composed of pairs of frames extracted from video surveillance footages and the corresponding textual descriptions of visual changes. To further explore this variant, the CLEVR-Change dataset [park2019robust] has been introduced, which contains five scene change types on almost 80K image pairs. The proposed approaches for this variant apply attention mechanisms to focus on semantically relevant aspects without being deceived by distractors such as viewpoint changes [shi2020finding, huang2021image]
or perform multi-task learning with image retrieval as an auxiliary task[hosseinzadeh2021image], where an image must be retrieved from its paired image and the description of the occurred changes.
7.3 Focusing on the textual output
Since every image captures a wide variety of entities with complex interactions, human descriptions tend to be diverse and grounded to different objects and details. Some image captioning variants explicitly focus on these aspects.
Diverse Captioning. Diverse image captioning tries to replicate the quality and variability of the sentences produced by humans. The most common technique to achieve diversity is based on variants of the beam search algorithm [vijayakumar2018diverse] that entail dividing the beams into similar groups and encouraging diversity between groups. Other solutions have been investigated, such as contrastive learning [dai2017contrastive], conditional GANs [dai2017towards, shetty2017speaking], and paraphrasing [liu2019generating]. However, these solutions tend to underperform in terms of caption quality, which is partially recovered by using variational auto-encoders [wang2017diverse, aneja2019sequential, chen2019variational, mahajan2020diverse]. Another approach is exploiting multiple part-of-speech tags sequences predicted from image region classes [deshpande2019fast] and forcing the model to produce different captions based on these sequences.
Multilingual Captioning. Since image captioning is commonly performed in English, multilingual captioning [elliott2015multilingual] aims to extend the applicability of captioning systems to other languages. The two main strategies entail collecting captions in different languages for commonly used datasets (e.g. Chinese and Japanese captions for MS COCO images [li2019coco, miyazaki2016cross], German captions for Flick30K [elliott2016multi30k]), or directly training multilingual captioning systems with unpaired captions [elliott2015multilingual, lan2017fluency, gu2018unpaired, song2019unpaired, wu2019improving], which requires specific evaluation protocols [chen2021towards].
7.4 Addressing user requirements
Regular image captioning models generate factual captions with a neutral tone and no interaction with end-users. Instead, some image captioning sub-tasks are devoted to coping with user requests.
Personalized Captioning. Humans consider more effective the captions that avoid stating the obvious and that are written in a style that catches their interest. Personalized image captioning aims at fulfilling this requirement by generating descriptions that take into account the user’s prior knowledge, active vocabulary, and writing style. To this end, early approaches exploit a memory block as a repository for this contextual information [chunseong2017attend, park2018towards]. On another line, Zhang et al. [zhang2020learning] proposed a multi-modal Transformer network that personalizes captions conditioned on the user’s recent captions and a learned user representation. Other works have instead focused on the style of captions as an additional controllable input and proposed to solve this task by exploiting unpaired stylized textual corpus [gan2017stylenet, mathews2018semstyle, guo2019mscap, zhao2020memcap] and adversarial learning [guo2019mscap]. Some datasets have been collected to explore this variant, such as InstaPIC [chunseong2017attend], which is composed of multiple Instagram posts from the same users, FlickrStyle10K [gan2017stylenet], which contains images and textual sentences with two different styles, and Personality-Captions [shuster2019engaging], which contains triples of images, captions, and one among 215 personality traits to be used to condition the caption generation.
Controllable Captioning. Controllable captioning puts the users in the loop by asking them to select and give priorities to what should be described in an image. This information is exploited as a guiding signal for the generation process. The signal can be in the form of part-of-speech tag sequences, as in [deshpande2019fast], of sets or sequences of image regions corresponding to a sentence noun chunk (i.e. a noun with its modifiers) as in [cornia2019show], of mouse traces, producing a dense visual grounding between words and visual elements, as in [pont2020connecting], or of verbs and semantic roles, where verbs represent activities in the image and semantic roles determine how objects engage in these activities, as in [chen2021human]. A step further is proposed by Meng et al. [meng2021connecting], which also incorporates controlled trace generation and joint caption-trace generation tasks.
8 Conclusions and Future Directions
Image captioning is an intrinsically complex challenge for machine intelligence as it integrates difficulties from both Computer Vision and Natural Language Generation. While most approaches keep the visual encoding and language modeling steps distinguished, the single-stream trend of BERT-like architectures entails performing early-fusion of visual and textual data. This strategy allows achieving remarkable performance but is usually combined with massive pre-training. Thus, it is worth investigating whether standard encoder-decoder methods enriched with pre-training could achieve similar results. Nonetheless, methods based on the classical two-stream paradigm are more explainable, both for model designers and end-users. The presented literature review and experimental comparison show the performance improvement over the last few years. However, many open challenges remain since accuracy, robustness, and generalization results are far from satisfactory. Similarly, requirements of fidelity, naturalness, and diversity are not yet met. In this respect, since image captioning has been conceived for improving human-machine interaction, the possibility to include the user in the loop is promising. Based on the analysis presented, we can trace three main developmental directions for the image captioning field, which are discussed in the following.
8.1 Procedural and architectural changes
As emerged from the analysis in Section 6, a paradigm shift is needed in order to boost the achievable performance.
Large-scale vision-and-language pre-training. Since image captioning models are data greedy, training on standard datasets can be limiting. Thus, pre-training on large-scale vision-and-language datasets, even if not well-curated, is a solid strategy for improving the captioning capabilities, as demonstrated in [li2020oscar, zhou2020unified, zhang2021vinvl]. Moreover, new pre-training strategies could be devised to leverage the data available in a self-supervised fashion, e.g. by reconstructing the inputs or predicting correlations, finally boosting the performance on downstream tasks such as image captioning.
Novel architectures and training strategies. The best performing paradigm for image captioning is currently the bottom-up one, which leverages object detectors for image regions encoding. Nonetheless, the surveyed work [liu2021cptr] explores a fully-Transformer paradigm, where image patches are directly applied to Transformer encoders, as in the Vision Transformer proposed in [dosovitskiy2020image]. Although this first attempt underperforms the majority of previous works, it suggests that hybrid solutions, possibly integrating a Transformer-based object detector, might be a worthwhile future direction. Other promising directions entail exploring Neural Architecture Search [zhu2020autocaption]
, and applying the distillation mechanism to autoregressive models, as suggested by the results achieved for non-autoregressive models in[guo2020non, guo2021fast]. Finally, a promising exploration line is the design of new objectives functions for training. In particular, when a reinforcement learning phase is performed, rewards based on human feedback or interaction can be considered [shen2019learning].
8.2 Focus on the open challenges
Generalization to different domains and increased diversity and naturalness of the generated captions are among the main open challenges for image captioning.
Generalizing to different domains. Image captioning models are usually trained on datasets that do not cover all possible real-life scenarios and, therefore, cannot generalize well to different contexts. As an example, in Fig. 10, we report some qualitative results with clear errors, indicating the difficulties in dealing with rare visual concepts. In this sense, further research efforts are needed towards a robust representation of visual concepts. Moreover, developments in image captioning variants such as novel objects captioning or controllable captioning could help to tackle this open issue. This would be strategic for adopting image captioning in specific applications, such as medicine, industrial products description, or cultural heritage.
Diversity and natural generation. As argued in [dai2017towards], image captioning models should produce descriptions with three properties: semantic fidelity, i.e. reflecting the actual visual content, naturalness, i.e. reading as if they were written by a human, and diversity, i.e. expressing notably different concepts as different humans would describe. However, most of the existing approaches emphasize only semantic fidelity. Although we discussed some attempts to encourage naturalness and diversity with conditional GANs [dai2017towards], contrastive learning [dai2017contrastive], variational auto-encoders [wang2017diverse], part-of-speech tagging [deshpande2019fast], or word latent spaces [aneja2019sequential], further research is needed to design models that are suitable for real-world applications.
8.3 Design of trustworthy AI solutions
Due to its potential in human-machine interactions, image captioning needs solutions that are transparent and acceptable for end-users, framed as interpretable results, overcome bias, and adequate evaluation.
The need for interpretability. People can naturally give explanations, highlight proofs, and express confidence in what they predict, also recognizing the need for more information before reaching a conclusion. Conversely, existing image captioning algorithms lack reliable and interpretable means for determining the cause of a particular output. In this respect, a possible strategy can be based on attention visualization, which loosely couples word predictions and image regions, indicating correlations and grounding [cornia2020meshed]. However, further research is needed to shed more light on models explainability, focusing on how these deal with data from different modalities or novel concepts.
Tackling datasets bias. Since most vision-and-language datasets share common patterns and regularities, memorizing those patterns gives algorithms a shortcut to exploit unwanted correspondences. Therefore, datasets bias in human textual annotations or overrepresented visual concepts are major issues for any vision-and-language task. This topic has been investigated in the context of language generation [florez2019unintended] but is even more challenging in image captioning [hendricks2018women], where the joined ambiguity of visual and textual data must be taken into account. In this sense, some effort should be devoted to the study of fairness and bias in the image-description pairs. In this regard, two possible directions entail designing specific evaluation metrics and focusing on the robustness to unwanted correlations.
The role of evaluation. Despite the promising performance on the benchmark datasets, state-of-the-art approaches are not yet satisfactory when applied in the wild. A possible reason for this is the evaluation procedures used and their impact on the training approaches currently adopted. Captioning algorithms are trained to mimic ground truth sentences, which is somewhat a different task from understanding the visual content and expressing it in text. For this reason, the design of appropriate and reproducible evaluation protocols [hodosh2016focused, xie2019going, alikhani2020cross] and insightful metrics remains an open challenge in image captioning. Moreover, since the task is currently defined as a supervised one and thus is strongly influenced by the training data, the development of scores that do not need reference captions for assessing the performance would be key for a shift towards unsupervised image captioning.
We thank CINECA, the Italian Supercomputing Center, for providing computational resources. This work has been supported by “Fondazione di Modena”, by the “Artificial Intelligence for Cultural Heritage (AI4CH)” project, cofunded by the Italian Ministry of Foreign Affairs and International Cooperation, and by the H2020 ICT-48-2020 HumanE-AI-NET and ELISE projects. We also want to thank the authors who provided us with the captions and model weights for some of the surveyed approaches.