Oracle performance for visual captioning

by   Li Yao, et al.

The task of associating images and videos with a natural language description has attracted a great amount of attention recently. Rapid progress has been made in terms of both developing novel algorithms and releasing new datasets. Indeed, the state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates the possibility of empirically establishing performance upper bounds on various visual captioning datasets without extra data labelling effort or human evaluation. In particular, it is assumed that visual captioning is decomposed into two steps: from visual inputs to visual concepts, and from visual concepts to natural language descriptions. One would be able to obtain an upper bound when assuming the first step is perfect and only requiring training a conditional language model for the second step. We demonstrate the construction of such bounds on MS-COCO, YouTube2Text and LSMDC (a combination of M-VAD and MPII-MD). Surprisingly, despite of the imperfect process we used for visual concept extraction in the first step and the simplicity of the language model for the second step, we show that current state-of-the-art models fall short when being compared with the learned upper bounds. Furthermore, with such a bound, we quantify several important factors concerning image and video captioning: the number of visual concepts captured by different models, the trade-off between the amount of visual elements captured and their accuracy, and the intrinsic difficulty and blessing of different datasets.


page 1

page 2

page 3

page 4


Fusion Models for Improved Visual Captioning

Visual captioning aims to generate textual descriptions given images. Tr...

Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

In this paper, we address the task of learning novel visual concepts, an...

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

With the maturity of visual detection techniques, we are more ambitious ...

Learning to Select: A Fully Attentive Approach for Novel Object Captioning

Image captioning models have lately shown impressive results when applie...

Dense Captioning with Joint Inference and Visual Context

Dense captioning is a newly emerging computer vision topic for understan...

An Integrated Approach for Video Captioning and Applications

Physical computing infrastructure, data gathering, and algorithms have r...

Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training

While captioning models have obtained compelling results in describing n...

1 Introduction

With standard datasets publicly available, such as COCO and Flickr (Lin et al., 2014; Hodosh et al., 2013; Young et al., ) in image captioning, and YouTube2Text, MVAD and MPI-MD (Guadarrama et al., 2013; Torabi et al., 2015; Rohrbach et al., 2015b) in video captioning, the field has been progressing in an astonishing speed. For instance, the state-of-the-art results on COCO image captioning has been improved rapidly from 0.17 to 0.31 in BLEU Kiros et al. (2014); Devlin et al. (2015b); Donahue et al. (2015); Vinyals et al. (2014); Xu et al. (2015b); Mao et al. (2015); Karpathy and Fei-Fei (2014); Bengio et al. (2015); Qi Wu et al. (2015). Similarly, the benchmark on YouTube2Text has been repeatedly pushed from 0.31 to 0.50 in BLEU score Rohrbach et al. (2013); Venugopalan et al. (2015b); Yao et al. (2015); Venugopalan et al. (2015a); Xu et al. (2015a); Rohrbach et al. (2015a); Yu et al. (2015); Ballas et al. (2016). While obtaining encouraging results, captioning approaches involve large networks, usually leveraging convolution network for the visual part and recurrent network for the language side. It therefore results model with a certain complexity where the contribution of the different component is not clear.

Instead of proposing better models, the main objective of this work is to develop a method that offers a deeper insight of the strength and the weakness of popular visual captioning models. In particular, we propose a trainable oracle that disentangles the contribution of the visual model from the language model. To obtain such oracle, we follow the assumption that the image and video captioning task may be solved with two steps Rohrbach et al. (2013); Fang et al. (2015) . Consider the model where refers to usually high dimensional visual inputs, such as representations of an image or a video, and refers to a caption, usually a sentence of natural language description. In order to work well, needs to form higher level visual concept, either explicitly or implicitly, based on in the first step, denoted as , followed by a language model that transforms visual concept into a legitimate sentence, denoted by . referes to atoms that are visually perceivable from .

The above assumption suggests an alternative way to build an oracle. In particular, we assume the first step is close to perfect in the sense that visual concept (or hints) is observed with almost 100% accuracy. And then we train the best language model conditioned on hints to produce captions.

Using the proposed oracle, we compare the current state-of-the-art models against it, which helps to quantify their capacity of visual modeling, a major weakness, apart from the strong language modeling. In addition, when being applied on different datasets, the oracle offers insight on the intrinsic difficulty and blessing of them, a general guideline when designing new algorithms and developing new models. Finally, we also relax the assumption to investigate the case where visual concept may not be realistically predicted with 100% accuracy and demonstrate a quantity-accuracy trade-off in solving visual captioning tasks.

2 Related work

Visual captioning

The problem of image captioning has attracted a great amount of attention lately. Early work focused on constructing linguistic templates or syntactic trees based on a set of concept from visual inputs Kuznetsova et al. (2012); Mitchell et al. (2012); Kulkarni et al. (2013). Another popular approach is based on caption retrieval in the embedding space such as Kiros et al. (2014); Devlin et al. (2015b). Most recently, the use of language models conditioned on visual inputs have been widely studied in the work of Fang et al. (2015) where a maximum entropy language model is used and in Donahue et al. (2015); Vinyals et al. (2014); Xu et al. (2015b); Mao et al. (2015); Karpathy and Fei-Fei (2014)

where recurrent neural network based models are built to generate natural language descriptions. The work of

Devlin et al. (2015a) advocates to combine both types of language models. Furthermore, CIDEr (Vedantam et al., 2015)

was proposed as an alternative evaluation metric for image captioning and is shown to be more advantageous compared with BLEU and METEOR. To further improve the performance,

Bengio et al. (2015) suggests a simple sampling algorithm during training, which was one of the winning recipes for MSR-COCO Captioning challenge 111, and Jia et al. (2015) suggests the use of extra semantic information to guide the language generation process.

Similarly, video captioning has made substantial progress recently. Early models such as Kojima et al. (2002); Barbu et al. (2012); Rohrbach et al. (2013) tend to focus on constrained domains with limited appearance of activities and objects in videos. They also rely heavily on hand-crafted video features, followed by a template-based or shallow statistical machine translation approaches to produce captions. Borrowing success from image captioning, recent models such as Venugopalan et al. (2015b); Donahue et al. (2015); Yao et al. (2015); Venugopalan et al. (2015a); Xu et al. (2015a); Rohrbach et al. (2015a); Yu et al. (2015) and most recently Ballas et al. (2016) have adopted a more general encoder-decoder approach with end-to-end parameter tuning. Videos are input into a specific variant of encoding neural networks to form a higher level visual summary, followed by a caption decoder by recurrent neural networks. Training such type of models are possible with the availability of three relatively large scale datasets, one collected from YouTube by Guadarrama et al. (2013), the other two constructed based on Descriptive Video Service (DVS) on movies by Torabi et al. (2015) and Rohrbach et al. (2015b). The latter two have recently been combined as the official dataset for Large Scale Movie Description Challenge (LSMDC) 222

Capturing higher-level visual concept

The idea of using intermediate visual concept to guide the caption generation has been discussed in Qi Wu et al. (2015) in the context of image captioning and in Rohrbach et al. (2015a)

for video captioning. Both work trained classifiers on a predefined set of visual concepts, extracted from captions using heuristics from linguistics and natural language processing. Our work resembles both of them in the sense that we also extract similar constituents from captions. The purpose of this study, however, is different. By assuming perfect classifiers on those visual atoms, we are able to establish the performance upper bounds for a particular dataset. Note that a simple bound is suggested by

Rohrbach et al. (2015a) where METEOR is measured on all the training captions against a particular test caption. The largest score is picked as the upper bound. As a comparison, our approach constructs a series of oracles that are trained to generate captions given different number of visual hints. Therefore, such bounds are clear indication of models’ ability of capturing concept within images and videos when performing caption generation, instead of the one suggested by Rohrbach et al. (2015a) that performs caption retrieval.

3 Oracle Model

The construction of the oracle is inspired by the observation that where denotes a caption containing a sequence of words having a length . denotes the visual inputs such as an image or a video. denotes visual concepts which we call “atoms”. We have explicitly factorized the captioning model into two parts, , which we call the conditional language model given atoms, and , which we call conditional atom model given visual inputs. To establish the oracle, we assume that the atom model is given, which amounts to treat

as a Dirac delta function that assigns all the probability mass to the observed atom

. In other words, .

Therefore, with the fully observed , the task of image and video captioning reduces to the task of language modeling conditioned on atoms. This is arguably a much easier task compared with the direct modeling of , therefore a well-trained model could be treated as a performance oracle of it. Information contained in directly influences the difficulty of modeling . For instance, if no atoms are available, reduces to unconditional language modeling, which could be considered as a lower bound of . By increasing the amount of information carries, the modeling of becomes more and more straightforward.

3.1 Oracle Parameterization

Given a set of atoms that summarize the visual concept appearing in the visual inputs , this section describes the detailed parameterization of the model with denoting the overall parameters. In particular, we adopt the commonly used encoder-decoder framework (Cho et al., 2014) to model this conditional based on the following simple factorization .

Recurrent neural networks (RNNs) are natural choices when outputs are identified as sequences. We borrow the recent success from a variant of RNNs called Long-short term memory networks (LSTMs) first introduced in

Hochreiter and Schmidhuber (1997), formulated as the following


where and represent the RNN state and memory of LSTMs at timestep t respectively. Combined with the atom representation, Equ. (1) is implemented as following

where denotes the word embedding matrix, as apposed to the atom embedding matrix , , , and are parameters of the LSTM. With the LSTM’s state , the probability of the next word in the sequence is with parameters , , and . The overall training criterion of the oracle is


given training pairs . represents parameters in the LSTM.

3.2 Atoms Construction

Each configuration of may be associated with a different distribution , therefore a different oracle model. We define configuration as an orderless collection of unique atoms. That is, where is the size of the bag and all items in the bag are different from each other. Considering the particular problem of image and video captioning, atoms are defined as words in captions that are most related to actions, entities, and attributes of entities (in Figure 1). The reason of using these three particular choices of language components as atoms is not an arbitrary decision. It is reasonable to consider these three types among the most visually perceivable ones when human describes visual content in natural language. We further verify this by conducting a human evaluation procedure to identify “visual” atoms from this set and show that a dominant majority of them indeed match human visual perception, detailed in Section 5.1. Being able to capture these important concepts is considered as crucial in getting superior performance. Therefore, comparing the performance of existing models against this oracle reveals their ability of capturing atoms from visual inputs when is unknown.

A set of atoms is treated as “a bag of words”. As with the use of word embedding matrix in neural language modeling (Bengio et al., 2003), the atom is used to index the atom embedding matrix

to obtain a vector representation of it. Then the representation of the entire set of atoms is


4 Contributing factors of the oracle

The formulation of Section 3 is generic, only relying on the assumption the two-step visual captioning process, independent of the parameterization in Section 3.1. In practice, however, one needs to take into account several contributing factors to the oracle.

Firstly, atoms, or visual concepts, may be defined as 1-gram words, 2-gram phrases and so on. Arguably a mixture of N-gram representations has the potential to capture more complicated correlations among visual concepts. For simplicity, this work uses only 1-gram representations, detailed in Section

5.1. Secondly, the procedure used to extract atoms needs to be reliable, extracting mainly visual concepts, leaving out non-visual concepts. To ensure this, the procedure used in this work is verified with human evaluation, detailed in 5.1. Thirdly, the modeling capacity of the conditional language has a direct influence on the obtained oracle. Section 3.1 has shown one example of many possible parameterizations. Lastly, the oracle may be sensitive to the training procedure and its hyper-parameters (see Section 5.2).

While it is therefore important to keep in mind that proposed oracle conditions on the above factors, quite surprisingly, however, with the simplest procedure and parameterization we show in the experimental section that oracle serves their purpose reasonably well.

5 Experiments

We demonstrate the procedure of learning the oracle on three standard visual captioning datasets. MS COCO (Lin et al., 2014) is the most commonly used benchmark dataset in image captioning. It consists of 82,783 training and 40,504 validation images. each image accompanied by 5 captions, all in one sentence. We follow the split used in Xu et al. (2015b) where a subset of 5,000 images are used as validation, and another subset of 5,000 images are used for testing. YouTube2Text is the most commonly used benchmark dataset in video captioning. It consists of 1,970 video clips, each accompanied with multiple captions. Overall, there are 80,000 video and caption pairs. Following Yao et al. (2015), it is split into 1,200 clips for training, 100 for validation and 670 for testing. Another two video captioning datasets have been recently introduced in Torabi et al. (2015) and Rohrbach et al. (2015b). Compared with YouTube2Text, they are both much larger in the number of video clips, most of which are associated with one or two captions. Recently they are merge together for Large Scale Movie Description Challenge (LSMDC). 333 We therefore name this particular dataset LSMDC. The official splits contain 91,908 clips for training, 6,542 for validation and 10,053 for testing.

5.1 Atom extraction

Figure 1: Given ground truth captions, three categories of visual atoms (entity, action and attribute) are automatically extracted using NLP Parser. “NA” denotes the empty atom set.

Visual concepts in images or videos are summarized as atoms that are provided to the caption language model. They are split into three categories: actions, entities, and attributes. To identify these three classes, we utilize Stanford natural language parser 444 to automatically extract them. After a caption is parsed, we apply simple heuristics based on the tags produced by the parser, ignoring the phrase and sentence level tags 555complete list of tags: Use words tagged with {“NN”, “NNP”, “NNPS” ,“NNS”, “PRP”} as entity atoms. Use words tagged with {“VB”, “VBD”, “VBG”, “VBN”, “VBP”, “VBZ”} as action atoms. Use words tagged with {“JJ”, “JJR”, “JJS”} as attribute atoms. After atoms are identified, they are lemmatized with NLTK lemmatizer 666 to unify them to their original dictionary format 777available at Figure 1 illustrates some results. We extracted atoms for COCO, YouTube2Text and LSMDC. This gives 14,207 entities, 4,736 actions and 8,671 attributes for COCO, 6,922 entities, 2,561 actions, 2,637 attributes for YouTube2Text, and 12,895 entities, 4,258 actions, 8550 attributes for LSMDC. Note that although the total number of atoms of each categories may be large, atom frequency varies. In addition, the language parser does not guarantee the perfect tags. Therefore, when atoms are being used in training the oracle, we sort them according to their frequency and make sure to use more frequent ones first to also give priority to atoms with larger coverage, detailed in Section 5.2 below.

We conducted a simple human evaluation 888details available at to confirm that extracted atoms are indeed predominantly visual. As it might be impractical to evaluate all the extracted atoms for all three datasets, we focus on top 150 frequent atoms. This evaluation intends to match the last column of Table 2 where current state-of-the-art models have the equivalent capacity of capturing perfectly less than 100 atoms from each of three categories. Subjects are asked to cast their vote independently. The final decision of an atom being visual or not is made by majority vote. Table 1 shows the ratio of atoms flagged as visual by such procedure.

entities actions attributes
COCO 92% 85% 81%
YouTube2Text 95% 91% 72%
LSMDC 83% 87% 78%
Table 1: Human evaluation of proportion of atoms that are voted as visual. It is clear that extracted atoms from three categories contain dominant amount of visual elements, hence verifying the procedure described in Section 3.2. Another observation is that entities and actions tend to be more visual than attributes according to human perception.

5.2 Training

After the atoms are extracted, they are sorted according to the frequency they appear in the dataset, with the most frequent one leading the sorted list. Taking first items from this list gives the top most frequent ones, forming a bag of atoms denoted by where is the size of the bag. Conditioned on the atom bag, the oracle is maximized as in Equ (2).

To form captions, we used a vocabulary of size 20k, 13k and 25k for COCO, YouTube2Text and LSMDC respectively. For all three datasets, models were trained on training set with different configuration of (1) atom embedding size, (2) word embedding size and (3) LSTM state and memory size. To avoid overfitting we also experimented weight decay and Dropout (Hinton et al., 2012) to regularize the models with different size. In particular, we experimented with random hyper-parameter search by Bergstra and Bengio (2012) with range on (1), (2) and (3). Similarly we performed random search on the weight decay coefficient with a range of , and whether or not to use dropout. Optimization was performed by SGD, minibatch size 128, and with Adadelta (Zeiler, 2012) to automatically adjust the per-parameter learning rate. Model selection was done on the standard validation set, with an early stopping patience of 2,000 (early stop if no improvement made after 2,000 minibatch updates). We report the results on the test splits.

5.3 Interpretation

Figure 2: Learned oracle on COCO (left), YouTube2Text (middle) and LSMDC (right). The number of atoms is varied on x-axis and oracles are computed on y-axis on testsets. The first row shows the oracles on BLEU and METEOR with atoms, from each of the three categories. The second row shows the oracles when atoms are selected individually for each category. CIDEr is used for COCO and YouTube2Text as each test example is associated with multiple ground truth captions, argued in (Vedantam et al., 2015). For LSMDC, METEOR is used, as argued by Rohrbach et al. (2015a).

All three metrics – BLEU, METEOR and CIDER are computed with Microsoft COCO Evaluation Server (Chen et al., 2015). Figure 2 summarizes the learned oracle with an increasing number of .

comparing oracle performance with existing models

We compare the current state-of-the-art models’ performance against the established oracles in Figure 2. Table 2 shows the comparison on three different datasets. With Figure 2, one could easily associate a particular performance with the equivalent number of atoms perfectly captured across all 3 atom categories, as illustrated in Table 2, the oracle included in bold. It is somehow surprising that state-of-the-art models have performance that is equivalent to capturing only a small amount of “ENT” and “ALL”. This experiment highlights the shortcoming of the state-of-art visual models. By improving them, we could close the performance gap that we currently have with the oracles.

(Qi Wu et al., 2015)
200 2100 4000 50
(Yu et al., 2015)
60 500 1900 20
(Venugopalan et al., 2015a)
40 50 4000 10
Table 2: Measure semantic capacity of current state-of-the-art models. Using Figure 2, one could easily map the reported metric to the number of visual atoms captured. This establishes an equivalence between a model, the proposed oracle and a model’s semantic capacity. (“ENT” for entities. “ACT” for actions. “ATT” for attributes. “ALL” for all three categories combined. “B1” for BLEU-1, “B4” for BLEU-4. “M” for METEOR. “C” for CIDEr. Note that the CIDEr is between 0 and 10 according to Vedantam et al. (2015). The learned oracle is denoted in bold.

quantify the diminishing return

As the number of atoms in increases, one would expect the oracle to be improved accordingly. It is however not yet clear the speed of such improvement. In other words, the gain in performance may not be proportional to the number of atoms given when generating captions, due to atom frequencies and language modeling. Figure 2 quantifies this effect. The oracle on all three datasets shows a significant gain at the beginning and diminishes quickly as more and more atoms are used.

Row 2 of Figure 2 also highlights the difference among actions, entities and attributes in generating captions. For all three datasets tested, entities played much more important roles, even more so than action atoms in general. This is particularly true on LSMDC where the gain of modeling attributes is much less than the other two categories.

Although visual atoms dominant the three atom categories shown in Section 5.1, as they increase in number, more and more non-visual atoms may be included, such as “living”, “try”, “find” and “free” which are relatively difficult to be associated with a particular part of visual inputs. Excluding non-visual atoms in the conditional language model can further tighten the oracle bound as less hints are provided to it. The major difficulty lies in the labor of hand-separating visual atoms from non-visual ones as to the our best knowledge this is difficult to automate with heuristics.

atom accuracy versus atom quantity

We have assumed that the atoms are given, or in other words, the prediction accuracy of atoms is 100%. In reality, one would hardly expect to have a perfect atom classifier. There is naturally a trade-off between number of atoms one would like to capture and the prediction accuracy of it. Figure 3 quantifies this trade-off on COCO and LSMDC. It also indicates the upper limit of performance given different level of atom prediction accuracy. In particular, we have replaced by where portion of are randomly selected and replaced by other randomly picked atoms not appearing in . The case of corresponds to those shown in Figure 2. And the larger the ratio , the worse the assumed atom prediction is. The value of is shown in the legend of Figure 3. According to the figure, in order to improve the caption generation score, one would have two options, either by keeping the number of atoms fixed while improving the atom prediction accuracy or by keeping the accuracy while increasing the number of included atoms. As state-of-art visual model already model around 1000 atoms, we hyphotesize that we could gain more in improving the atoms accuracy rather than increase the number of atom detected by those models.

Figure 3: Learned oracles with different atom precision ( in red) and atom quantity (x-axis) on COCO (left) and LSMDC (right). The number of atoms is varied on x-axis and oracles are computed on y-axis on testsets. CIDEr is used for COCO and METEOR for LSMDC. It shows one could increase the score by either improving with a fixed or increase . It also shows the maximal error bearable for different score.

intrinsic difficulties of particular datasets

Figure 2 also reveals the intrinsic properties of each dataset. In general, the bounds on YouTube2Text are much higher than COCO, with LSMDC the lowest. For instance, from the first column of the figure, taking 10 atoms respectively, BLUE-4 is around 0.15 for COCO, 0.30 for YouTube2Text and less than 0.05 for LSMDC. With little visual information to condition upon, a strong language model is required, which makes a dramatic difference across three datasets. Therefore the oracle, when compared across different datasets, offer an objective measure of difficulties of using them in the captioning task.

6 Discussion

This work formulates oracle performance for visual captioning. The oracle is constructed with the assumption of decomposing visual captioning into two consecutive steps. We have assumed the perfection of the first step where visual atoms are recognized, followed by the second step where language models conditioned on visual atoms are trained to maximize the probability of given captions. Such an empirical construction requires only automatic atom parsing and the training of conditional language models, without extra labeling or costly human evaluation.

Such an oracle enables us to gain insight on several important factors accounting for both success and failure of the current state-of-the-art models. It further reveals model independent properties on different datasets. We furthermore relax the assumption of prefect atom prediction. This sheds light on a trade-off between atom accuracy and atom coverage, providing guidance to future research in this direction. Importantly, our experimental results suggest that more efforts are required in step one where visual inputs are converted into visual concepts (atoms).

Despite its effectiveness shown in the experiments, the empirical oracle is constructed with the simplest atom extraction procedure and model parameterization in mind, which makes such a construction in a sense a “conservative” oracle.


The authors would like to acknowledge the support of the following agencies for research funding and computing support: IBM T.J. Watson Research, NSERC, Calcul Québec, Compute Canada, the Canada Research Chairs and CIFAR. We would also like to thank the developers of Theano

(Theano Development Team, 2016) , for developing such a powerful tool for scientific computing.