Log In Sign Up

A Better Variant of Self-Critical Sequence Training

In this work, we present a simple yet better variant of Self-Critical Sequence Training. We make a simple change in the choice of baseline function in REINFORCE algorithm. The new baseline can bring better performance with no extra cost, compared to the greedy decoding baseline.


page 1

page 2

page 3

page 4


Blockwise Parallel Decoding for Deep Autoregressive Models

Deep autoregressive sequence-to-sequence models have demonstrated impres...

Latent Sequence Decompositions

We present the Latent Sequence Decompositions (LSD) framework. LSD decom...

A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models

Beam search is a desirable choice of test-time decoding algorithm for ne...

Self-critical Sequence Training for Image Captioning

Recently it has been shown that policy-gradient methods for reinforcemen...

Winner-Relaxing Self-Organizing Maps

A new family of self-organizing maps, the Winner-Relaxing Kohonen Algori...

CASICT Tibetan Word Segmentation System for MLWS2017

We participated in the MLWS 2017 on Tibetan word segmentation task, our ...

Consistent Multiple Sequence Decoding

Sequence decoding is one of the core components of most visual-lingual m...

1 Introduction

Self-Critical Sequence Training(SCST), upon its release, has been a popular way to train sequence generation models. While originally proposed for image captioning task, SCST not only has become the new standard for training captioning models 

Yao et al. (2017); Dognin et al. (2019); Luo et al. (2018); Jiang et al. (2018); Liu et al. (2018b); Ling and Fidler (2017); Liu et al. (2018a); Chen et al. (2019); Yang et al. (2019); Zhao et al. (2018), but also has been applied to many other tasks, like video captioningLi et al. (2018); Chen et al. (2018); Li and Gong (2019), reading comprehensionHu et al. (2017), summarizationCelikyilmaz et al. (2018); Paulus et al. (2017); Wang et al. (2018); Pasunuru and Bansal (2018), image paragraph generationMelas-Kyriazi et al. (2018), speech recognitionZhou et al. (2018).

SCST is used to optimize generated sequences over a non-differentiable objective, usually the evaluation metrics, for example, CIDEr for captioning, ROUGE for summarization. To optimize such objective, SCST adopts REINFORCE with baseline 

Williams (1992)

, where a “Self-Critical” baseline is used; specifically, the score of the greedy decoding output is used as the baseline. This is proved to be better than learned baseline function which is more commonly used in Reinforcement Learning literature.

In this work, we present a different baseline choice which was first proposed in Mnih and Rezende (2016), to the best of our knowledge. With more elaboration in Sec. 3, this baseline can be described as a variant of “Self-Critical”. This method is simple, but also faster and more effective compared to the greedy decoding baseline used in SCST.

2 Recap for SCST

MIXER Ranzato et al. (2015) is the first to use REINFORCE algorithm for sequence generation training. They use a learned function approximator to get the baseline.

SCST inherits the REINFORCE algorithm from MIXER, but discards the learned baseline function. Instead, SCST uses the reward of the greedy decoding result as the baseline, achieving better captioning performance and lower gradient variance.

2.1 Math formulation

The goal of SCST, for example in captioning, is to maximize the expected CIDEr score of generated captions.

where is a sampled caption; is the image; is the captioning model parameterized by , and is the CIDEr score.

Since this objective is not non-differentiable with respect to , back propagation is not feasible. To optimize it, a policy gradient method, specifically REINFORCE with baseline Williams (1992) is used.

The policy gradient method allows estimating the gradient from individual samples (the right-hand side) and applying gradient ascent. To reduce the variance of the estimation, a baseline

is needed, and has to be independent of .

In SCST, the baseline is set to be the CIDEr score of the greedy decoding caption, denoted as . Thus, we have

3 The Better SCST

The success of SCST comes from better gradient variance reduction introduced by the greedy decoding baseline. In our variant, we use the baseline proposed in Mnih and Rezende (2016) to achieve even better variance reduction.

Following Mnih and Rezende (2016), we sample captions for each image when applying REINFORCE: , ,

The baseline for each sampled caption is defined as the average reward of the rest samples. That is, for caption , its baseline is


Since each sample is independently drawn, is a valid baseline. The final gradient estimation is


Note that, is an estimation of expected reward, which is similar to the learning objective of value functions in other Reinforcement Learning algorithms. The expected reward is usually a good baseline choice in that it can effectively reduce gradient variance. In Sec. 4, we show that our gradient variance is lower than SCST empirically.

It is still a “Self-Critical” baseline because the critic is still from itself: its other sampling results, instead of the greedy decoding result.

4 Experiments

For all models, we first pretrain them using standard cross-entropy loss and then switch to Self-Critical training.

For a fair comparison, during Self-Critical stage, we always sample 5 captions for each image, same for both SCST and our variant.

All the experiments are done on COCO captioning dataset Lin et al. (2014). The scores are obtained on Karparthy test split Karpathy and Fei-Fei (2015) with beam search of beam size 5 if not explicitly noted.


Since no extra greedy decoding is needed, our method is slightly faster than SCST.

Performance on different model architectures

We experiment with four different architectures. FC and Att2in are from SCSTRennie et al. (2017). UpDown is from Anderson et al. (2018). Transformer is adapted from Vaswani et al. (2017) for captioning task.

Table 1 shows that our variant is better than SCST on all architectures, especially on Transformer.

Bleu1 Bleu2 Bleu3 Bleu4 ROUGE_L METEOR CIDEr SPICE
FC(SCST) 74.7 57.8 43.0 31.7 54.0 25.2 104.5 18.4
FC(Ours) 74.9 57.9 43.0 31.6 54.1 25.4 105.3 18.6
Att2in(SCST) 78.0 61.8 47.0 35.3 56.7 27.1 117.4 20.5
Att2in(Ours) 78.4 62.1 47.3 35.6 56.9 27.3 119.5 20.7
UpDown(SCST) 79.4 63.3 48.6 36.7 57.6 27.9 122.7 21.5
UpDown(Ours) 80.0 63.9 49.1 37.2 57.8 28.0 123.9 21.5
Transformer(SCST) 80.0 64.6 50.3 38.6 58.4 28.6 126.6 22.2
Transformer(Ours) 80.7 65.6 51.3 39.4 58.7 28.9 129.6 22.8
Table 1: The performance of our method on different model architectures. The numbers are from authors’ own implementation.

Different training hyperparameters

Here we adopt a different training setting (‘Long’) for UpDown model. The ‘Long’ setting (from uses a larger batch size and a longer training time. Table 2 shows that there is always a gap between our method and SCST which cannot be closed by longer training or a larger batch size.

Bleu1 Bleu2 Bleu3 Bleu4 ROUGE_L METEOR CIDEr SPICE
UpDown+SCST 79.4 63.3 48.6 36.7 57.6 27.9 122.7 21.5
UpDown+Ours 80.0 63.9 49.1 37.2 57.8 28.0 123.9 21.5
UpDown(Long)+SCST 80.3 64.5 49.9 38.0 58.3 28.4 127.2 21.9
UpDown(Long)+Ours 80.4 64.7 50.1 38.1 58.4 28.5 127.9 22.0
Table 2:

The performance of UpDown model with SCST/Ours under two different hyperparameter settings.

Multiple runs

Table 3 shows that our variant is consistently better than SCST with different random seeds. All the models use ‘Long’ setting with UpDown model.

Specifically, we pretrain 5 models using cross-entropy loss, and then apply SCST and our method respectively. The same means they share the same pretrained model.

Bleu1 Bleu2 Bleu3 Bleu4 ROUGE_L METEOR CIDEr SPICE
RS1+SCST 80.3 64.5 49.9 38.0 58.3 28.4 127.2 21.9
RS1+Ours 80.4 64.7 50.1 38.1 58.4 28.5 127.9 22.0
RS2+SCST 80.2 64.5 49.9 37.9 58.3 28.3 127.2 21.9
RS2+Ours 80.2 64.5 50.0 38.1 58.2 28.4 128.0 22.0
RS3+SCST 80.2 64.5 50.0 38.1 58.3 28.3 127.3 21.8
RS3+Ours 80.2 64.7 50.2 38.3 58.3 28.4 127.9 22.0
RS4+SCST 80.2 64.5 49.9 37.9 58.2 28.3 127.0 21.8
RS4+Ours 80.2 64.5 50.0 38.0 58.3 28.5 127.7 22.0
RS5+SCST 80.2 64.6 50.2 38.4 58.4 28.5 127.6 21.9
RS5+Ours 80.2 64.5 49.8 37.9 58.3 28.4 127.8 22.0
Mean(SCST) 80.2 64.5 50.0 38.0 58.3 28.4 127.3 21.8
Mean(Ours) 80.2 64.6 50.0 38.1 58.3 28.4 127.9 22.0
Table 3: Within the first 5 block, the models share the same cross-entropy pretrained model (RS stands for random seed). The last block shows the average score of 5 models.

Training curves

Figure 1 shows the model performance on the validation set during training, after entering Self-Critical stage. The scores are averaged over the 5 UpDown(Long) models above.

Figure 1: Performance on validation set during training. (With UpDown(Long) + greedy decoding)

Is greedy decoding necessary for SCST

We also experiment with a variant of SCST, by replacing the greedy decoding output with a sampled output. (This is similar to our method with .)

Table 4 shows that one sample baseline is worse than greedy decoding. This is as expected, because using one sample to estimate the expected reward is too noisy, resulting in larger gradient variance, while the reward of greedy decoding output may be biased but more stable. It also shows that it is important to use sufficiently large to have a better estimation of expected reward.

Bleu1 Bleu2 Bleu3 Bleu4 ROUGE_L METEOR CIDEr SPICE
UpDown(SCST) 79.4 63.3 48.6 36.7 57.6 27.9 122.7 21.5
UpDown(Sample) 79.6 63.5 48.7 36.7 57.7 27.8 122.1 21.3
Table 4: Replacing the greedy decoding output in SCST with a separately drawn sample .

Variance reduction

As stated in Sec. 3, the motivation of using the average reward baseline is for better variance reduction. Here we show it indeed is better in practice.

The gradient variance is calculated as follows. At the end of each epoch, we take the saved model and run through the training set. We get the gradients from each training batch and calculate the variance for each parameter gradient across batches. To get a single value, we take the average of all the parameters. A mathematic expression of this process is:

where is the index of each parameter; is the index of each batch; is the network parameters; is the gradient of at batch .

As shown in Fig. 2, our method is always getting lower variance than SCST.

Figure 2: The gradient variance on training set.(Model: UpDown)

5 Code release

6 Conclusion

We propose a variant of popular SCST, which can work as a drop-in replacement for SCST. This variant reduces the gradient variance when applying REINFORCE by modifying the baseline function. We show that this method is effective on Image Captioning task, and we believe it should benefit other tasks as well.


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6077–6086. Cited by: §4.
  • A. Celikyilmaz, A. Bosselut, X. He, and Y. Choi (2018) Deep communicating agents for abstractive summarization. arXiv preprint arXiv:1803.10357. Cited by: §1.
  • C. Chen, S. Mu, W. Xiao, Z. Ye, L. Wu, and Q. Ju (2019) Improving image captioning with conditional generative adversarial nets. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 8142–8150. Cited by: §1.
  • Y. Chen, S. Wang, W. Zhang, and Q. Huang (2018) Less is more: picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 358–373. Cited by: §1.
  • P. Dognin, I. Melnyk, Y. Mroueh, J. Ross, and T. Sercu (2019) Adversarial semantic alignment for improved image captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10463–10471. Cited by: §1.
  • M. Hu, Y. Peng, Z. Huang, X. Qiu, F. Wei, and M. Zhou (2017) Reinforced mnemonic reader for machine reading comprehension. arXiv preprint arXiv:1705.02798. Cited by: §1.
  • W. Jiang, L. Ma, Y. Jiang, W. Liu, and T. Zhang (2018) Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 499–515. Cited by: §1.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §4.
  • L. Li and B. Gong (2019) End-to-end video captioning with multitask reinforcement learning. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 339–348. Cited by: §1.
  • Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei (2018)

    Jointly localizing and describing events for dense video captioning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7492–7500. Cited by: §1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.
  • H. Ling and S. Fidler (2017) Teaching machines to describe images via natural language feedback. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5075–5085. Cited by: §1.
  • D. Liu, Z. Zha, H. Zhang, Y. Zhang, and F. Wu (2018a) Context-aware visual policy network for sequence-level image captioning. arXiv preprint arXiv:1808.05864. Cited by: §1.
  • X. Liu, H. Li, J. Shao, D. Chen, and X. Wang (2018b) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 338–354. Cited by: §1.
  • R. Luo, B. Price, S. Cohen, and G. Shakhnarovich (2018) Discriminability objective for training descriptive captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6964–6974. Cited by: §1.
  • L. Melas-Kyriazi, A. M. Rush, and G. Han (2018) Training for diversity in image paragraph captioning. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 757–761. Cited by: §1.
  • A. Mnih and D. J. Rezende (2016) Variational inference for monte carlo objectives. arXiv preprint arXiv:1602.06725. Cited by: §1, §3, §3.
  • R. Pasunuru and M. Bansal (2018) Multi-reward reinforced summarization with saliency and entailment. arXiv preprint arXiv:1804.06451. Cited by: §1.
  • R. Paulus, C. Xiong, and R. Socher (2017) A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304. Cited by: §1.
  • M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2015)

    Sequence level training with recurrent neural networks

    arXiv preprint arXiv:1511.06732. Cited by: §2.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.
  • L. Wang, J. Yao, Y. Tao, L. Zhong, W. Liu, and Q. Du (2018)

    A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization

    arXiv preprint arXiv:1805.03616. Cited by: §1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §1, §2.1.
  • X. Yang, K. Tang, H. Zhang, and J. Cai (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: §1.
  • T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei (2017) Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902. Cited by: §1.
  • W. Zhao, B. Wang, J. Ye, M. Yang, Z. Zhao, R. Luo, and Y. Qiao (2018) A multi-task learning approach for image captioning.. In IJCAI, pp. 1205–1211. Cited by: §1.
  • Y. Zhou, C. Xiong, and R. Socher (2018) Improving end-to-end speech recognition with policy learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5819–5823. Cited by: §1.