Can adversarial training learn image captioning ?

by   Jean-Benoit Delbrouck, et al.

Recently, generative adversarial networks (GAN) have gathered a lot of interest. Their efficiency in generating unseen samples of high quality, especially images, has improved over the years. In the field of Natural Language Generation (NLG), the use of the adversarial setting to generate meaningful sentences has shown to be difficult for two reasons: the lack of existing architectures to produce realistic sentences and the lack of evaluation tools. In this paper, we propose an adversarial architecture related to the conditional GAN (cGAN) that generates sentences according to a given image (also called image captioning). This attempt is the first that uses no pre-training or reinforcement methods. We also explain why our experiment settings can be safely evaluated and interpreted for further works.



There are no comments yet.


page 3


A Thorough Review on Recent Deep Learning Methodologies for Image Captioning

Image Captioning is a task that combines computer vision and natural lan...

Improving Image Captioning with Conditional Generative Adversarial Nets

In this paper, we propose a novel conditional generative adversarial net...

Re-evaluating Automatic Metrics for Image Captioning

The task of generating natural language descriptions from images has rec...

Towards Diverse and Natural Image Descriptions via a Conditional GAN

Despite the substantial progress in recent years, the image captioning t...

Recurrent Relational Memory Network for Unsupervised Image Captioning

Unsupervised image captioning with no annotations is an emerging challen...

Learning Compact Reward for Image Captioning

Adversarial learning has shown its advances in generating natural and di...

Recurrent Topic-Transition GAN for Visual Paragraph Generation

A natural image usually conveys rich semantic content and can be viewed ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial networks (GAN, Goodfellow et al. (2014)) have attracted a lot of attention over the last years especially in the field of image generation. GAN have shown great success to generate high fidelity, diverse images with models learned directly from data. Recently, new architectures have been investigated to create class-conditioned GAN Brock et al. (2018)

so that the model is able to generate a new image sample from a given ImageNet category. These networks are more broadly know as conditional-GAN or cGAN

Mirza and Osindero (2014) where the generation is conditioned by a label.

In the field of Natural Language Generation (NLG), on the other hand, a lot of efforts have been made to generate structured sequences. In the current state-of-the-art, Recurrent neural networks (RNN;

Graves (2013)) are trained to produce a sequence of words by maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. Scheduled sampling Bengio et al. (2015)

and reinforcement learning

Zoph and Le (2017) have also been investigated to train such networks. Unfortunately, training discrete probabilistic models with GAN has shown to be a very difficult task. Previous investigations require complicated training techniques such as gradient policy methods and pre-training and often struggles to generate realistic sentences. Moreover, it is not always clear how NLG should be evaluated in an adversarial settings Semeniuta et al. (2018).

In this paper, we propose a cGAN-like architecture that generates a sentence according to a label, the label being an image to describe. This work is related to image captioning task that proposes strict evaluation methods for any given captioning data-set. We also investigate if GAN can learn image captioning in a straightforward manner, this includes a fully differentiable end-to-end architecture and no pre-training. The generated sentences are then evaluated against to the ground truth captioning given by the task. The widely-used COCO caption data-set Lin et al. (2014) contains 5 human-annotated ground-truth descriptions per image, this justifies our will to use a generative adversarial setting whose goal is to generate realistic and diverse samples.

2 Related work

A few works can be related to ours. First, Yu et al. (2017) proposed a Sequence Generative Adversarial Nets trained with policy gradient methods Sutton et al. (2000) and used synthetic data experiments to evaluate the training. Other works also investigated adversarial text generation with reinforcement learning and pretraining Guo et al. (2018); Dai et al. (2017). Finally, the closest work related to ours is the one of Press et al. (2017) who proposes an adversarial setting pre-training and without reinforcement. Our model differs in the way that we use a conditional label as image to generate a sentence or image caption.

3 Adversarial image captioning

In this section, we briefly describe the model architecture used in our experiments.

As any adversarial generative setting, our model is composed of a generator and a discriminator . The generator is an RNN that uses a visual attention mechanism Xu et al. (2015) over an image

to generate a distribution of probabilities

over the vocabulary at each time-step . During training, is fed a caption as the embedded ground-truth words and

is fed with either the set of probability distributions from

or the embedded ground truth words of a real caption. has to say if the input received is either real or fake according to the image. is also a the same RNN as but with different training weights. The RNN can be expressed as follows:


where is the embedded ground-truth symbols of word and

the attention model over image

. and are both trained simultaneously with the following min-max objective:


where is an example from the true data and a sample from the Generator. Variable is supposed to be Gaussian noise.

4 Tips and tricks

It is important to mind two tricks to make adversarial captioning work:

Gradient penality for embeddings  As show in equation 3, the discriminator receives half of the time a probability distribution over the vocabulary from G. This is fully differentiable compared to

. A potential concern regarding our strategy to train our discriminator to distinguish between sequence of 1-hot vectors from the true data distribution and a sequence of probabilities from the generator is that the discriminator can easily exploit the sparsity in the 1-hot vectors. However, a gradient penalty can be added to the discriminator loss to provides good gradients even under an optimal discriminator. The gradient penalty

Gulrajani et al. (2017) is defined as with and where

is a random number sampled from the uniform distribution

Dropout as noise  For the evaluation of a model to be consistent, we can’t introduce noise as input of our Generator. To palliate this constraint, we provide noise only in the form of dropout to make our Generator less deterministic. Because we don’t want to sample from a latent space (our model don’t fall into the category of generative model), using only dropout is a good work-around in our case. Moreover, dropout has already shown success in previous generative adversarial work Isola et al. (2017).

5 Experimentation

We use the MS-COCO data-set Lin et al. (2014)consisting of 414.113 image-description pairs. For our experiments, we only pick a subset of 50.000 training images, 1000 images are use for validation.

Each ground-truth symbol is a word-embedding from Glove Pennington et al. (2014). All GRU used are of size 256, so is . Image is extracted at the output of the pool-5 layer from ResNet-50 He et al. (2015). The attention mechanism consists of a simple element-wise product between and :

where and . Finally, the size of the following matrices are: where is the vocabulary size and .

As hyper-parameters, we set the batch size to 512, the gradient penalty and a dropout of p=0.5 is applied at the output of

in the Generator. We stop training of the BLEU score on the validation set doesn’t improve for 5 epochs.

6 Results

(a) Ground truth : a group of people who are sitting on bikes
Generated caption : a group of people riding on the side of a car
(b) Ground truth : a kitchen with a stove a sink and a counter
Generated caption : a kitchen with a sink stove a sink and other
(c) Ground truth : a group of people standing around a kitchen
Generated caption : a group of people standing in a kitchen
Figure 1: Success case of our adversarial captioning model. The model is able to recognize groups of people, some locations and objects. We also notice the correct use of verbs.

The best configuration as described in section 5 gives a BLEU score Papineni et al. (2002) of 7.30. Figure 1 shows some of the best generated captions given images. We observed that the model is able to recognize groups of people as well as some locations (such as a kitchen) and objects (such as a sink). The model also learned to use the correct verb for a given caption. For example, in Figure 1 the model is capable of making differentiate riding with standing.

(a) Ground truth : a nude man sitting on his bed while using his phone
Generated caption : a <unk> <unk> <unk> next to an table
(b) Ground truth : two people stand using laptops in a dark room with big stars on the wall
Generated caption : a kitchen with a sink and tiled sink
(c) Ground truth : an elephant using its trunk to blow the dirt off its face
Generated caption : a man of people sits in a kitchen
Figure 2: Worst generated captions (BLEU )

Nevertheless, we can identify two failure cases. First, the model often output sentences filled with the <unk> token. It is possible that the model hasn’t been trained for long enough and on too few data. The Generator receives only a single adversarial feed back for all the words generated. It is possible some words may not have received enough gradient in order to be successfully used. In general, the pool of words used is not very large: the words used in Figure 1 are related to the ones used in Figure 2. Secondly, the model sometimes outputs well formed sentences (Figure 2 b) and c)) but unrelated to the image. Here, it is possible that the conditional information has not been taken into account.

7 Conclusion

Figure 3: Result of the dropout on embedding and hidden state

In this paper, we made a first attempt on adversarial captioning without pre-training and reinforcement techniques. The task is challenging, especially since the generator G and discriminator D work with different sparsity. Nevertheless, only the WGAN with gradient penalty was able to give acceptable results. Other techniques such as the relativistic GAN Jolicoeur-Martineau (2018) or WGAN-divergence Wu et al. (2018) didn’t work in our case. We also notice that the model was very sensitive to dropout. However, Figure 3 confirms our intuition that no dropout is not benefical for the generator (the bottom-left of the heat-map resulted in a BLEU score of 0).

There are a few improvements that can be made for future research. First, the attention model could be more sophisticated so that the visual signal is stronger. The size of the overall model could also be increased. Finally, the model should be trained on the full COCO training set. It is possible that enforcing an early-stop of 5 epochs for training could be an issue since the model could take time to converge.


  • [1] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §1.
  • [2] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. Cited by: §1.
  • [3] B. Dai, S. Fidler, R. Urtasun, and D. Lin (2017) Towards diverse and natural image descriptions via a conditional gan. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 2970–2979. Cited by: §2.
  • [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [5] A. Graves (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §1.
  • [6] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. CoRR abs/1704.00028. External Links: Link, 1704.00028 Cited by: §4.
  • [7] J. Guo, S. Lu, H. Cai, W. Zhang, Y. Yu, and J. Wang (2018) Long text generation via adversarial training with leaked information. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §5.
  • [9] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1125–1134. Cited by: §4.
  • [10] A. Jolicoeur-Martineau (2018) The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734. Cited by: §7.
  • [11] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. Cited by: §1, §5.
  • [12] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
  • [13] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, pp. 311–318. External Links: Link, Document Cited by: §6.
  • [14] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    pp. 1532–1543. Cited by: §5.
  • [15] O. Press, A. Bar, B. Bogin, J. Berant, and L. Wolf (2017) Language generation with recurrent generative adversarial networks without pre-training. arXiv preprint arXiv:1706.01399. Cited by: §2.
  • [16] S. Semeniuta, A. Severyn, and S. Gelly (2018) On accurate evaluation of gans for language generation. arXiv preprint arXiv:1806.04936. Cited by: §1.
  • [17] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.
  • [18] J. Wu, Z. Huang, J. Thoma, D. Acharya, and L. Van Gool (2018) Wasserstein divergence for gans. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 653–668. Cited by: §7.
  • [19] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015-07–09 Jul) Show, attend and tell: neural image caption generation with visual attention. In

    Proceedings of the 32nd International Conference on Machine Learning

    , F. Bach and D. Blei (Eds.),
    Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2048–2057. Cited by: §3.
  • [20] L. Yu, W. Zhang, J. Wang, and Y. Yu (2017) Seqgan: sequence generative adversarial nets with policy gradient. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.
  • [21] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §1.