Reinforcing an Image Caption Generator Using Off-Line Human Feedback

11/21/2019 ∙ by Paul Hongsuck Seo, et al. ∙ Google POSTECH Seoul National University 0

Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to improve captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maximize the human ratings as rewards in an off-policy reinforcement learning setting, where policy gradients are estimated by samples from a distribution that focuses on the captions in a caption ratings dataset. Our empirical evidence indicates that the proposed method learns to generalize the human raters' judgments to a previously unseen set of images, as judged by a different set of human judges, and additionally on a different, multi-dimensional side-by-side human evaluation procedure.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image captioning is the task of automatically generating fluent natural language descriptions for an input image. However, measuring the quality of generated captions in an automatic manner is a challenging and yet-unsolved task; therefore, human evaluations are often required to assess the complex semantic relationships between a visual scene and a generated caption [30, 8, 36]. As a result, there is a mismatch between the training objective of the captioning models and their final evaluation criteria. The most simple and frequently-used training objective is maximum likelihood estimation (MLE) [33, 25, 21, 22, 6]

, while other approaches make use of handcrafted evaluation metrics, such as CIDEr 

[32], to optimize model parameters using reinforcement learning (RL) [29, 10, 4, 27]. However, these surrogate objectives capture only limited aspects of caption quality, and often fail to guide the training procedure towards models capable of producing outputs that are highly-rated by human evaluators.

(a) Image caption
(b) Captions ratings
Figure 1: Examples from an image caption dataset [30] and a caption-ratings dataset [19]. (a) Images in the caption dataset are annotated with ground-truth captions written by humans. (b) Captions in the caption ratings dataset are generated by trained models and scored in [0, 1] (0: worst, 1: best) by human raters.

As a result of the need to understand the performance of the current models, human evaluation studies for measuring caption quality are frequently reported in the literature  [30, 12, 11, 36]. In addition to an aggregate model performance, such human evaluation studies also produce a valuable by-product: a dataset of model-generated image captions with human annotated quality labels, as shown in Figure 0(b). We argue that such a by-product, henceforth called a caption ratings dataset, can be successfully used to improve the quality of image captioning models, for several reasons. First, optimizing based on instance-level human judgments of caption quality represent a closer-to-truth objective for image captioning: generating more captions judged as good but fewer ones rated as poor by human raters. Second, while having highly-rated captions as positive examples (i.e., how good captions may look like), a caption ratings dataset also contains captions that are highly-scored by a model but annotated as negative examples (i.e., how model-favored yet bad captions look like), which intuitively should be a useful signal for correcting common model biases. To the best of our knowledge, our work is the first to propose using human caption ratings directly for training captioning models.

Our goal is to leverage the signals from a pre-collected caption ratings dataset [19] for training an image captioning model. We propose a method based on policy gradient, where the human ratings are considered as rewards for generating captions (seen as taking actions) in an RL framework. Since the dataset provides ratings only for a small set of images and captions, we do not have a generic reward function for random image-caption pairs. Therefore, it is not straightforward to apply policy gradient method that requires a reward for randomly sampled captions. To address this challenge, we use an off-policy technique and force the network to sample captions for which ratings are available in the dataset. We evaluate the effectiveness of our method using human evaluation studies on the T2 test set used for the Conceptual Captions Challenge111, using both a similar human evaluation methodology and an additional, multi-dimensional side-by-side human evaluation strategy. Additionally, the human raters in our evaluation study are different from the ones that provided the caption ratings in [19], thereby ensuring that the results are independent of using a specific human-evaluator pool. The results of our human evaluations indicate that the proposed method improves the image captioning quality, by effectively leveraging both the positive and negative signals from the captions ratings dataset.

The main contributions of this paper are the following:

  • We propose to train captioning models using human ratings produced during evaluations of previous models.

  • We propose an off-policy policy gradient method to cope with the sparsity in available caption ratings.

  • We present a set of experiments using human evaluations that demonstrates the effectiveness of our approach.

2 Related Work

There have been multiple attempts to define metrics that evaluate the quality of generated captions. Several studies proposed automatic metrics using ground-truth captions. A few of them are adopted from machine translation community and are based on -gram matches between ground-truth and generated captions; BLEU [26] and ROUGE [20]

measures precision and recall based on

-gram matches, respectively, while METEOR [5] incorporates alignments between -gram matches. In the context of evaluating image caption quality specifically, CIDEr [32] and SPICE [3] utilize more corpus-level and semantic signals to measure matches between generated and ground-truth captions. Aside from these handcrafted metrics, a recent study proposes to learn an automatic metric from a captioning dataset [8], while another uses semantic similarity between object labels identified in the image and the words in the caption [24].

To overcome the limitations imposed by the automatic metrics, several studies evaluate their models using human judgments [30, 36, 11, 12]. However, none of them utilizes the human-rated captions in the model evaluations. In this work, we show how one can utilize such human-rated captions for training better captioning models.

MLE with ground-truth captions has been widely adopted as the standard surrogate objective for training [33, 25, 21, 22, 6]. Aside from this main thrust, an additional line of research is concerned with optimizing models that maximize some automatic evaluation metric(s) using RL, in an attempt to bridge the mismatch between the training objective and the evaluation criteria [29, 10, 4, 27]. To our knowledge, this is the first study that proposes to optimize test-time scores of human judgment using a dataset generated by a human evaluation process.

Another line of related research is focused on learning from human feedback, which has been actively explored in the field of RL. Some approaches use binary human feedback to update an agent [16, 2, 23] whereas approaches with preference-based RL take human feedback as preferences between two action/state trajectories [1, 35, 34]. A common technique adopted in these methods is to learn an estimation model from human feedback to approximate the absent reward function [16, 7, 13]. However, these approaches assume that the models receive human feedback iteratively in a training loop; in contrast, our approach uses the caption ratings in an off-line manner, simply as a pre-existing annotation dataset. As a result, our method focuses on existing examples within the dataset, using an off-policy technique.

(a) Maximum likelihood estimation (b) On-policy policy gradient (c) Off-policy policy gradient
with rating estimator with true human ratings
Figure 2: Illustration of training settings for three different methods. The 2D boxes represent the space of possible captions, and each dot is a caption with its color corresponding to human ratings. (Blue is high and red is low.) The solid line in each plot indicates the virtual boundary separating low-quality and high-quality captions in terms of human ratings. The color gradation in (b) represents learned rating estimates for captions, while the dashed line is the model’s approximated virtual boundary between low- and high-quality estimates. In (c), denotes the importance weight for a sample .

3 Methods

3.1 Caption Ratings Dataset

A sample in a caption ratings dataset is comprised of an image , a machine-generated caption , and a human judgment for the caption quality . For each image, multiple captions from several candidate models are available, some of which might be rated higher than others. In the setup used in this paper, the low-rated captions serve as negative examples, because human annotators judged them as bad captions (see examples in Figure 0(b)). is possibly an aggregate of multiple ratings from different raters. Section 4.1 provides more details of the caption ratings dataset that we employ.

We make a few observations that apply not only to image captioning, but more generally to the principle of generating annotations. Although a human-ratings dataset is usually just a by-product of human evaluations for past models, such a dataset can be valuable for improving models (as we show in this paper). There are several advantageous properties of a ratings dataset over traditional supervised-learning datasets. First, obtaining ratings for automatically generated outputs is significantly cheaper than collecting ground-truth labels, because it requires less rater training and less time spent annotating. Moreover, if human evaluation is performed anyway during a model’s development cycle, there is no additional cost associated to using these annotations for further improving the model. In addition to that, it is easy to capture consensus between multiple raters to reduce noise,


., by averaging their scores; it is completely non-trivial to achieve a similar effect from multiple ground-truth labels. Last but not least, the examples with a negative rating score provide valuable training signals, as they explicitly penalize the mistakes that appear in model outputs with high-probability; this type of signal is completely lacking in traditional supervised-learning datasets.

3.2 Reinforcing Caption Generator using Ratings

Given a caption ratings dataset with triplets , our objective is to maximize the expected ratings of the output captions , which is given by


where is the dataset distribution for and is the conditional caption distribution estimated by a model parameterized by .

Our objective in Eq. (3.2) exactly aligns with the reward maximization of RL, and therefore we apply the techniques of RL by configuring the captioning model as the agent, the rating scores as the reward, the input images as the states, and the captions as the actions. Specifically, we use a policy gradient method where an approximated policy gradient is computed using Monte-Carlo sampling,


where represents , and are image and caption sampled from and , respectively, and is the number of samples. In the above equations, we subtract a baseline from the rating score

to reduce the variance of the estimator while keeping its original bias.

Although this formulation is straightforward, there remains a critical challenge to apply this technique to our task, since the dataset contains only sparse information about and true ratings for most captions are unknown. Eq. (2) requires the rating for a randomly sampled caption which may not be present in the dataset . In the rest of this section, we present two alternative techniques for this challenge, and discuss the advantages of one alternative versus the other.

On-policy policy gradient with rating estimates

One approach to address the sparsity of the rating function is to construct a caption quality estimator, while keeping the sampling process on-policy; this is the method adopted in, e.g., [16, 7, 13]. Incidentally, it is also the expressed goal for the effort behind the caption ratings dataset in [19] that we use in this work.

For this purpose, we train a rating estimator parameterized by , by minimizing mean squared error of the true rating scores for the image-caption pairs on the caption ratings dataset. The trained estimator then replaces the true rating function in Eq. (2) and the estimated policy gradient is now:


This technique allows to obtain rating estimates for any image-caption pairs, including ones that are not present in the dataset . The training objective with Eq. (3) is now maximizing the expected rating estimate of captions. This approach is effective only if the trained rating estimator generalizes well to unseen images and captions, and it is expected to be effective only to the extent to which the rating estimator performs well over the sampled search space. In our work, we have observed artifacts of the ratings estimator that negatively impact the performance of this method, e.g., severely ill-formed captions for which the caption estimator had no training signal but assigned high ratings. We report results for this method in Section 4.

Off-policy policy gradient with true ratings

This second method takes an orthogonal approach to address the sparsity of the rating function. We modify the sampling process in such a manner that it allows us to directly utilize the true ratings of the dataset (no estimation involved), while ensuring that the training procedure is not influenced by the captions whose true ratings are not available. More precisely, we adopt an off-policy policy gradient technique that uses an alternative distribution , instead of the true policy distribution for sampling. The policy gradient in Eq. (2) is approximated as follows:


where represents with an alternative caption distribution , and represents the importance weight for sample caption and image . The alternative caption sampling distribution is defined as:


where is the conditional caption distribution in the dataset ,

is the uniform distribution, and

is a small positive weight assigned to the uniform distribution. In all experiments, we sample a single caption per image in the batch. While captions that are not present in the dataset may still be sampled from , we assign a reward to these captions, in order to prevent incorrect contributions to the gradient computation. In the policy gradient formulation, examples with reward value are considered to have no information, and their weight cancels out the entire term corresponding to these examples. Note that the off-policy methods enable experience replay, which is repeating previous experiences with known rewards. In this view, this method is viewed as training a captioning model by replaying the experiences in the ratings dataset.

Curriculum learning

As our training conditions, we assume the access to both a captioning dataset and a caption ratings dataset. Under a curriculum learning procedure, we first train a model by MLE on the captioning dataset, and then fine-tune the model with the above methods using the caption ratings dataset. To avoid overfitting during fine-tuning, we add the MLE loss on the captioning dataset as a regularization term. Given the caption labeled dataset and the caption ratings dataset , the final gradients w.r.t. the parameters are therefore computed as follows:


where is the average log-likelihood of ground-truth captions in , and is a hyper-parameter that balances the regularization effect.

3.3 Comparing two policy gradient methods

Intuitively, the two policy gradient methods described in this section have strong relationships to MLE, since training signals are based on the gradients of caption log-likelihoods. We illustrate the training settings of MLE and the two proposed methods in Figure 2. In MLE, we train the model using positive captions only and treat all positive captions equally, as illustrated in Figure 2a: the parameters are updated by the gradients of log-likelihoods of ground-truth captions . The on-policy policy gradient method (Eq. (3)) instead computes the gradients of reward-weighted log-likelihoods of sample captions over all possible captions. By sampling from the policy distribution (on-policy), we may sample captions whose true rating scores are not known (not in the dataset). The on-policy method thus approximates the rating function by a rating estimator , depicted by the background gradient in Figure 2b. However, the mismatch between the true rating function and the estimator (depicted by the gap between solid and dashed lines) can degenerate the quality of the resulting captioning model. On the other hand, the off-policy method focuses on the captions with true rating scores in the dataset, by changing the sampling distribution. In contrast to MLE, where each sample is viewed as equally correct and important, the off-policy method weights each caption by its rating, and therefore includes captions with negative feedback, as illustrated in Figure 2c. Note that, in the off-policy method, the baseline determines the threshold for positive/negative feedback; captions with ratings below the baseline are explicitly penalized, while the others are positively rewarded.

4 Experiments

4.1 Datasets

Image captioning dataset

In the experiments, we use Conceptual Captions [30], a large-scale captioning dataset that consists of images crawled from the Internet, with captions derived from corresponding Alt-text labels on the webpages. The training and validation splits have approximately 3.3M and 16K samples, respectively.

Caption ratings dataset

In our experiments, we use the Caption-Quality dataset [19], recently introduced for the purpose of training quality-estimation models for image captions. We re-purpose this data as our caption ratings dataset . The dataset is divided into training, validation and test splits containing approximately 130K, 7K and 7K rated captions, respectively. Each image has an average of 4.5 captions (generated by different models that underwent evaluation evaluation). The captions are individually rated by asking raters the question “Is this a good caption for the image?”, with the answers “NO” or “YES” mapped to a 0 or 1 score, respectively. Each image/caption pair is evaluated by 10 different human raters, and an average rating score per-caption is obtained by quantizing the resulting averages into a total of nine bins .

Conceptual Captions Challenge T2 dataset

To evaluate our models, we run human evaluation studies on the T2 test dataset used in the CVPR 2019 Conceptual Captions Challenge222 The dataset contains 1K images sampled from the Open Images Dataset [18]. Note that the images in the Caption-Quality dataset are also sampled from the Open Images Dataset, but using a disjoint split. So there is no overlap between the caption ratings dataset we use for training, and the T2 test set we use for evaluations.

4.2 Experimental Settings

(a) Baseline
(b) Baseline+
(c) On-policy policy gradient with rating estimator (OnPG)
(d) Off-policy policy gradient with true ratings (OffPG)
Figure 3: The training procedures for the different methods described. Blue and red boxes represent datasets and models, respectively. MLE is maximum likelihood estimation and MSE means mean-squared error minimization.
Metric Type Question
Goodness S Is this a good caption for the image?
Informativeness SxS Which caption provides more useful info
for a person who cannot see this image?
Correctness SxS Which caption has fewer mistakes?
Fluency SxS Which caption has better language quality?
Table 1: Questions asked to raters in the two human evaluations. Type ’S’ evaluation means single-caption rating. Type ‘SxS’ evaluation means side-by-side caption rating.

Model architecture

As the backbone model for image captioning we adopt the architecture described in [6], since it provides the highest single-model score in the Conceptual Captions Challenge333As of Sept. 5, 2019.. Given an image, we extract two types of visual features: 1) ultra fine-grained semantic features using pretrained network [14] from the entire image and 16 bounding boxes proposed by faster-RCNN [28], and 2) label embeddings of objects predicted by Google Cloud Vision API444

. We use these features with an encoder-decoder Transformer Network 

[31] to generate the captions.

In addition, we train a caption rating estimator for the OnPG method using the Caption-Quality dataset. The rating estimator extracts the same types of visual features as the captioning model above, and embeds the input caption with a pretrained BERT encoder [9]

. We concatenate all these features after projecting into a common embedding space and predict the human ratings of the input image/caption pair. To feed the generated captions from the captioning model directly into the rating estimator, we share the vocabulary (but not the token embeddings) between the two models. We fix the pretrained image feature extraction modules in both models during training, as well as the BERT encoder of the rating estimator. The rating estimator achieves a test performance that is close to the one reported (0.519 Spearman correlation) in

[19]; however, as we will discuss further, its performance on the Caption-Quality test set does not transfer well to the needs of the OnPG method, which needs correct rating estimates for ill-formed captions as well.

Baselines and proposed models

We first train an MLE model as our baseline, trained on the Conceptual Captions training split alone. We referred to this model as Baseline. For a baseline approach that utilizes (some of) the Caption-Quality data, we merge positively-rated captions from the Caption-Quality training split with the Conceptual Captions examples and finetune the baseline model. We call this model Baseline, where is the rating threshold for the included positive captions. We train models for two variants, , which results in 72K and 51K additional (pseudo-)ground-truth captions, respectively. Note that the Baseline approaches attempt to make use of the same additional dataset as our two reinforced models, OnPG and OffPG, but they need to exclude below-threshold captions due to the constraints in MLE.

In addition to the baselines, we train two reinforced models: one based on the on-policy policy gradient method with a rating estimator (OnPG), and the other based on the off-policy policy gradient method with the true ratings (OffPG). The differences between the methods are shown in Figure 3.

Training details

We train Baseline using the Adam optimizer [15] on the training split of the Conceptual dataset for 3M iterations with the batch size of 4,096 and the learning rate of

. The learning rate is warmed up for 20 epochs and exponentially decayed by a factor of 0.95 every 25 epochs.

Baseline are obtained by fine-tuning Baseline on the merged dataset for 1M iterations, with the learning rate of and the same decaying factor. For OnPG, because its memory footprint is increased significantly due to the additional parameters for the rating estimator, we reduce the batch size for training this model by a 0.25 factor; the value of in Eq. (2) is set to the moving average of the rating estimates. During OffPG training, for each batch, we sample half of the examples from the Conceptual dataset and the other half from Caption-Quality dataset; is set to the average of the ratings in the dataset.

Average Voting
Goodness Goodness
Baseline 66.230.60% 66.301.49%
Baseline+ (0.5) 66.680.61% 0.45% 66.501.48% 0.20%
Baseline+ (0.7) 65.830.62% -0.40% 65.201.51% -1.10%
OnPG 65.970.61% -0.26% 66.401.48% 0.10%
OffPG 68.420.61% 3.19% 69.701.46% 3.40%
Table 2: Human evaluation single-caption results: Goodness scores for models (higher is better). Column shows relative improvements over Baseline. Note that all score increases of Baseline+(t) and OnPG are within the error range.
Informativeness Correctness Fluency
Baseline+ (0.5) 1.780.85% 0.180.49% 0.100.28%
Baseline+ (0.7) 0.700.58% 0.680.33% 0.230.15%
OnPG -0.330.90% -0.350.62% 0.080.20%
OffPG 7.451.06% 5.900.80% 1.690.31%
Table 3: Human evaluation side-by-side comparisons against the baseline. Positive values denote superior performance compared to the baseline. Note that some score increases for Baseline+(t) and OnPG are within error range.

4.3 Evaluations

We run two sets of human evaluation studies to evaluate the performance of our models and baselines, using the T2 dataset (1K images). For every evaluation, we generate captions using beam search (beam size of 5).

Single-caption evaluation

In the first type of evaluation, 6 distinct raters are asked to judge each image caption as good or bad. They are shown the image and caption with the “Goodness” question prompt shown in Table 1. The bad or good rating is translated to 0 or 1, respectively. We measure “average” goodness score as the average of all the ratings over the test set. We also report a “voting”555The “voting” score is the metric reported on the Conceptual Captions Challenge leaderboard.

score which is the average of the binarized score for each caption based on majority voting. Note that both the “average” and “voting” scores are in the range

, where higher values denote better model performance.

Side-by-side caption evaluation

In the other type of evaluation, we measure the relative improvement of a model against the Baseline model; Three professional raters are shown the input image and two captions (anonymized and randomly shuffled with respect to their left/right position) side-by-side. One of the captions is from a candidate model and the other always from Baseline. We ask for relative judgments on three dimensions – Informativeness, Correctness and Fluency, using their corresponding questions shown in Table 1. Each of these dimensions allows a 5-way choice, shown below together with their corresponding scores:

The left caption is much better
The left caption is slightly better
The two captions seem equal    
The right caption is slightly better
The right caption is much better

Each model is evaluated by the average rating scores from 3 distinct raters. As a result, we obtain 3 values for each model in the range , where a negative score means a performance degradation in the given dimension with respect to Baseline

. For every human evaluation, we report confidence intervals based on bootstrap resampling 


Figure 4: Rating distribution for the correctness question. Tendencies are similar for the other side-by-side questions.
Figure 5: Results of OnPG and OffPG in side-by-side human comparisons while varying weight of policy gradient . Models are tested on 200 samples from T2 dataset.
Figure 6: Qualitative examples of generated captions. Numbers represent informativeness, correctness and fluency scores (rated by comparing against those generated by Baseline).

4.4 Results

Single-caption evaluation

Table 2 shows the goodness scores from the single-caption evaluation. Both “average” and “voting” metrics clearly indicate that OffPG significantly improves over Baseline, while the other methods achieve only marginal gains, all of which are within the error range. Baseline models use only 1.5% and 2.2% additional data, at and , respectively, with insignificant impact. Moreover, these methods only maximize the likelihood of the additional captions, which are already generated with high likelihood by previous models trained on the same dataset, which results in self-reinforcement. In contrast, the policy gradient methods are allowed to utilize the negative feedback to directly penalize incorrect captions. However, OnPG fails to improve the quality, most likely because it relies on a noisy caption ratings estimator that fails to generalize well over the large space of possible captions.

Side-by-side evaluations

The results from the side-by-side evaluations are are shown in Table 3. The OffPG method achieves significant improvements on all three different dimensions. This is an important result, considering that we trained the model using a caption ratings dataset that contains single-scalar scores for generic ’goodness’ (as opposed to the well-defined dimensions along which the OffPG

 method scores have improved). These results demonstrate that the single-caption ’goodness’ ratings encapsulate a signal for all these dimensions into its scalar value. Note that we observe the same tendency consistently under a variety of hyperparameter settings in our internal experiments.

Figure 4 highlights the way in which the OffPG method achieves its superiority over the Baseline model, compared to the other alternative models (using the ’Corectness’ scores). For instance, over 75% of the captions for both Baseline models receive a 0.0 score (equal quality), and more than half of them are exactly identical to their corresponding Baseline captions. In contrast, OffPG makes a strong impact by explicitly penalizing the captions with negative feedback: less than 16% captions are identical to the corresponding Baseline captions. Moreover, we observe a large portion of captions with scores of 1.0 in favor of OffPG, indicating that many captions are significantly enhanced. We observe similar trends in all the three metrics.

On-policy vs. off-policy performance

We compare the OnPG and OffPG methods in more depth, by performing ablation experiments for the hyper-parameter (the weight for the policy gradient). Figure 5 shows the results of these ablation experiments, for which we performed side-by-side comparisons over a 200-image subset from the T2 dataset. The results indicate that a very small limits the impact of the additional signal for both models, since the regularization effect from the original loss term becomes too strong. By allowing updates using policy gradient with a larger value, OffPG improves the performances along all three dimensions, whereas the performance of OnPG starts degrading at higher values. At , OnPG drastically suffers from mode collapse and ends up generating a single caption for every image. This mode collapse is a result of poor generalization of the rating estimator: the collapsed captions are structurally ill-formed (e.g., an empty string, or a string with simply a period ‘.’), but they receive high rating estimates (

) from the estimator. Although we can (and did) introduce some heuristics to avoid some of these failure cases in the estimator, we observe that

OnPG training would continue to suffer from the estimator failing to generalize well over the vast space of possible captions. This observation is similar to the mode collapsing phenomenon seen when training generative adversarial networks (GANs), but even more severe as the estimator in OnPG is fixed (unlike the discriminators in GANs which are trained simultaneously).

Another drawback of OnPG is that it increases the computational complexity significantly during training. In terms of the memory usage, the rating estimator introduces 65% additional parameters, and uses more than double the memory for gradient computation compared to the other models. Also, the sequential caption sampling in OnPG slows down the training procedure, by breaking the parallelism in the Transformer computations, in addition to the time complexity incurred by the rating estimator. Empirically, OnPG is over 10 times slower than the others in processing the same number of examples in training. In contrast, the time and space complexities of OffPG remain the same as Baseline and Baseline, since the only difference is the use of scalar weights ( and ) to gradients of each caption likelihood (), as shown in Figure 2.

Qualitative results

Figure 6 presents some qualitative example outputs for our models, showcasing the effectiveness of the OffPG method. We observe that the OffPG model is often successful at correcting arbitrary qualifiers present in the baseline outputs (e.g., ‘half marathon’ and ‘most beautiful’ in the second and third examples, respectively).

5 Conclusion

In this paper, we describe how to train an improved captioning model by using a caption ratings dataset, which is often a natural by-product in the development process of image captioning models. We show that an off-policy RL technique with an alternative sampling distribution successfully deals with the sparsity of information about the rating function, while an on-policy method has difficulties in obtaining an improved model, due to generalization issues of the ratings estimator. While this conclusion may not be definitive, it is definitely an important result, and it also opens up additional lines of inquiry along the relative merits of these RL techniques.


  • [1] R. Akrour, M. Schoenauer, and M. Sebag (2012) April: active preference learning-based reinforcement learning. In ECML-KDD, pp. 116–131. Cited by: §2.
  • [2] S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza (2014)

    Power to the people: the role of humans in interactive machine learning

    AI Magazine 35 (4), pp. 105–120. Cited by: §2.
  • [3] P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) SPICE: semantic propositional image caption evaluation. In ECCV, pp. 382–398. Cited by: §2.
  • [4] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pp. 6077–6086. Cited by: §1, §2.
  • [5] S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §2.
  • [6] S. Changpinyo, B. Pang, P. Sharma, and R. Soricut (2019) Decoupled box proposal and featurization with ultrafine-grained semantic labels improve image captioning and visual question answering. In EMNLP-IJCNLP, Cited by: §1, §2, §4.2.
  • [7] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei (2017) Deep reinforcement learning from human preferences. In NIPS, pp. 4299–4307. Cited by: §2, §3.2.
  • [8] Y. Cui, G. Yang, A. Veit, X. Huang, and S. Belongie (2018) Learning to evaluate image captioning. In CVPR, Cited by: §1, §2.
  • [9] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.2.
  • [10] N. Ding and R. Soricut (2017) Cold-start reinforcement learning with softmax policy gradients. In NIPS, Cited by: §1, §2.
  • [11] P. Dognin, I. Melnyk, Y. Mroueh, J. Ross, and T. Sercu (2019) Adversarial semantic alignment for improved image captions. In CVPR, pp. 10463–10471. Cited by: §1, §2.
  • [12] M. Forbes, C. Käser-Chen, P. Sharma, and S. Belongie (2019) Neural naturalist: generating fine-grained image comparisons. In EMNLP-IJCNLP, Cited by: §1, §2.
  • [13] B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei (2018) Reward learning from human preferences and demonstrations in atari. In NeurIPS, pp. 8011–8023. Cited by: §2, §3.2.
  • [14] D. Juan, C. Lu, Z. Li, F. Peng, A. Timofeev, Y. Chen, Y. Gao, T. Duerig, A. Tomkins, and S. Ravi (2019) Graph-rise: graph-regularized image semantic embedding. arXiv preprint arXiv:1902.10814. Cited by: §4.2.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [16] W. B. Knox, B. D. Glass, B. C. Love, W. T. Maddox, and P. Stone (2012) How humans teach agents. International Journal of Social Robotics 4 (4), pp. 409–421. Cited by: §2, §3.2.
  • [17] P. Koehn (2004) Statistical significance tests for machine translation evaluation. In EMNLP, pp. 388–395. Cited by: §4.3.
  • [18] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §4.1.
  • [19] T. Levinboim, A. Thapliyal, P. Sharma, and R. Soricut (2019) Quality estimation for image captions based on large-scale human evaluations. arXiv preprint arXiv. Cited by: Figure 1, §1, §3.2, §4.1, §4.2.
  • [20] C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text summarization branches out, Cited by: §2.
  • [21] J. Lu, C. Xiong, D. Parikh, and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In CVPR, pp. 375–383. Cited by: §1, §2.
  • [22] J. Lu, J. Yang, D. Batra, and D. Parikh (2018) Neural baby talk. In CVPR, pp. 7219–7228. Cited by: §1, §2.
  • [23] J. MacGlashan, M. K. Ho, R. Loftin, B. Peng, G. Wang, D. L. Roberts, M. E. Taylor, and M. L. Littman (2017) Interactive learning from policy-dependent human feedback. In ICML, pp. 2285–2294. Cited by: §2.
  • [24] P. Madhyastha, J. Wang, and L. Specia (2019) VIFIDEL: evaluating the visual fidelity of image descriptions. In ACL, Cited by: §2.
  • [25] J. Mun, M. Cho, and B. Han (2017)

    Text-guided attention model for image captioning

    In AAAI, Cited by: §1, §2.
  • [26] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §2.
  • [27] Y. Qin, J. Du, Y. Zhang, and H. Lu (2019) Look back and predict forward in image captioning. In CVPR, pp. 8367–8375. Cited by: §1, §2.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §4.2.
  • [29] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In CVPR, pp. 7008–7024. Cited by: §1, §2.
  • [30] P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pp. 2556–2565. Cited by: Figure 1, §1, §1, §2, §4.1.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §4.2.
  • [32] R. Vedantam, C. L. Zitnick, and D. Parikh (2015) CIDEr: consensus-based image description evaluation. In CVPR, Cited by: §1, §2.
  • [33] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In CVPR, Cited by: §1, §2.
  • [34] C. Wirth, R. Akrour, G. Neumann, and J. Fürnkranz (2017) A survey of preference-based reinforcement learning methods. The Journal of Machine Learning Research 18 (1), pp. 4945–4990. Cited by: §2.
  • [35] C. Wirth, J. Fürnkranz, and G. Neumann (2016) Model-free preference-based reinforcement learning. In AAAI, Cited by: §2.
  • [36] S. Zhao, P. Sharma, T. Levinboim, and R. Soricut (2019) Informative image captioning with external sources of information. In ACL, Cited by: §1, §1, §2.