1 Introduction
Generative adversarial networks (GAN, Goodfellow et al. (2014)) have attracted a lot of attention over the last years especially in the field of image generation. GAN have shown great success to generate high fidelity, diverse images with models learned directly from data. Recently, new architectures have been investigated to create class-conditioned GAN Brock et al. (2018)
so that the model is able to generate a new image sample from a given ImageNet category. These networks are more broadly know as conditional-GAN or cGAN
Mirza and Osindero (2014) where the generation is conditioned by a label.In the field of Natural Language Generation (NLG), on the other hand, a lot of efforts have been made to generate structured sequences. In the current state-of-the-art, Recurrent neural networks (RNN;
Graves (2013)) are trained to produce a sequence of words by maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. Scheduled sampling Bengio et al. (2015)Zoph and Le (2017) have also been investigated to train such networks. Unfortunately, training discrete probabilistic models with GAN has shown to be a very difficult task. Previous investigations require complicated training techniques such as gradient policy methods and pre-training and often struggles to generate realistic sentences. Moreover, it is not always clear how NLG should be evaluated in an adversarial settings Semeniuta et al. (2018).In this paper, we propose a cGAN-like architecture that generates a sentence according to a label, the label being an image to describe. This work is related to image captioning task that proposes strict evaluation methods for any given captioning data-set. We also investigate if GAN can learn image captioning in a straightforward manner, this includes a fully differentiable end-to-end architecture and no pre-training. The generated sentences are then evaluated against to the ground truth captioning given by the task. The widely-used COCO caption data-set Lin et al. (2014) contains 5 human-annotated ground-truth descriptions per image, this justifies our will to use a generative adversarial setting whose goal is to generate realistic and diverse samples.
2 Related work
A few works can be related to ours. First, Yu et al. (2017) proposed a Sequence Generative Adversarial Nets trained with policy gradient methods Sutton et al. (2000) and used synthetic data experiments to evaluate the training. Other works also investigated adversarial text generation with reinforcement learning and pretraining Guo et al. (2018); Dai et al. (2017). Finally, the closest work related to ours is the one of Press et al. (2017) who proposes an adversarial setting pre-training and without reinforcement. Our model differs in the way that we use a conditional label as image to generate a sentence or image caption.
3 Adversarial image captioning
In this section, we briefly describe the model architecture used in our experiments.
As any adversarial generative setting, our model is composed of a generator and a discriminator . The generator is an RNN that uses a visual attention mechanism Xu et al. (2015) over an image
to generate a distribution of probabilities
over the vocabulary at each time-step . During training, is fed a caption as the embedded ground-truth words andis fed with either the set of probability distributions from
or the embedded ground truth words of a real caption. has to say if the input received is either real or fake according to the image. is also a the same RNN as but with different training weights. The RNN can be expressed as follows:(1) |
(2) | ||||
(3) |
(4) |
where is the embedded ground-truth symbols of word and
the attention model over image
. and are both trained simultaneously with the following min-max objective:(5) |
where is an example from the true data and a sample from the Generator. Variable is supposed to be Gaussian noise.
4 Tips and tricks
It is important to mind two tricks to make adversarial captioning work:
Gradient penality for embeddings As show in equation 3, the discriminator receives half of the time a probability distribution over the vocabulary from G. This is fully differentiable compared to
. A potential concern regarding our strategy to train our discriminator to distinguish between sequence of 1-hot vectors from the true data distribution and a sequence of probabilities from the generator is that the discriminator can easily exploit the sparsity in the 1-hot vectors. However, a gradient penalty can be added to the discriminator loss to provides good gradients even under an optimal discriminator. The gradient penalty
Gulrajani et al. (2017) is defined as with and whereis a random number sampled from the uniform distribution
Dropout as noise For the evaluation of a model to be consistent, we can’t introduce noise as input of our Generator. To palliate this constraint, we provide noise only in the form of dropout to make our Generator less deterministic. Because we don’t want to sample from a latent space (our model don’t fall into the category of generative model), using only dropout is a good work-around in our case. Moreover, dropout has already shown success in previous generative adversarial work Isola et al. (2017).
5 Experimentation
We use the MS-COCO data-set Lin et al. (2014)consisting of 414.113 image-description pairs. For our experiments, we only pick a subset of 50.000 training images, 1000 images are use for validation.
Each ground-truth symbol is a word-embedding from Glove Pennington et al. (2014). All GRU used are of size 256, so is . Image is extracted at the output of the pool-5 layer from ResNet-50 He et al. (2015). The attention mechanism consists of a simple element-wise product between and :
where and . Finally, the size of the following matrices are: where is the vocabulary size and .
As hyper-parameters, we set the batch size to 512, the gradient penalty and a dropout of p=0.5 is applied at the output of
in the Generator. We stop training of the BLEU score on the validation set doesn’t improve for 5 epochs.
6 Results
![]() Generated caption : a group of people riding on the side of a car BLEU |
![]() Generated caption : a kitchen with a sink stove a sink and other BLEU |
![]() Generated caption : a group of people standing in a kitchen BLEU |
The best configuration as described in section 5 gives a BLEU score Papineni et al. (2002) of 7.30. Figure 1 shows some of the best generated captions given images. We observed that the model is able to recognize groups of people as well as some locations (such as a kitchen) and objects (such as a sink). The model also learned to use the correct verb for a given caption. For example, in Figure 1 the model is capable of making differentiate riding with standing.
![]() Generated caption : a <unk> <unk> <unk> next to an table |
![]() Generated caption : a kitchen with a sink and tiled sink |
![]() Generated caption : a man of people sits in a kitchen |
Nevertheless, we can identify two failure cases. First, the model often output sentences filled with the <unk> token. It is possible that the model hasn’t been trained for long enough and on too few data. The Generator receives only a single adversarial feed back for all the words generated. It is possible some words may not have received enough gradient in order to be successfully used. In general, the pool of words used is not very large: the words used in Figure 1 are related to the ones used in Figure 2. Secondly, the model sometimes outputs well formed sentences (Figure 2 b) and c)) but unrelated to the image. Here, it is possible that the conditional information has not been taken into account.
7 Conclusion

In this paper, we made a first attempt on adversarial captioning without pre-training and reinforcement techniques. The task is challenging, especially since the generator G and discriminator D work with different sparsity. Nevertheless, only the WGAN with gradient penalty was able to give acceptable results. Other techniques such as the relativistic GAN Jolicoeur-Martineau (2018) or WGAN-divergence Wu et al. (2018) didn’t work in our case. We also notice that the model was very sensitive to dropout. However, Figure 3 confirms our intuition that no dropout is not benefical for the generator (the bottom-left of the heat-map resulted in a BLEU score of 0).
There are a few improvements that can be made for future research. First, the attention model could be more sophisticated so that the visual signal is stronger. The size of the overall model could also be increased. Finally, the model should be trained on the full COCO training set. It is possible that enforcing an early-stop of 5 epochs for training could be an issue since the model could take time to converge.
References
- [1] (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §1.
- [2] (2018) Large scale gan training for high fidelity natural image synthesis. Cited by: §1.
-
[3]
(2017)
Towards diverse and natural image descriptions via a conditional gan.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 2970–2979. Cited by: §2. - [4] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
- [5] (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §1.
- [6] (2017) Improved training of wasserstein gans. CoRR abs/1704.00028. External Links: Link, 1704.00028 Cited by: §4.
-
[7]
(2018)
Long text generation via adversarial training with leaked information.
In
Thirty-Second AAAI Conference on Artificial Intelligence
, Cited by: §2. - [8] (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §5.
-
[9]
(2017)
Image-to-image translation with conditional adversarial networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1125–1134. Cited by: §4. - [10] (2018) The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734. Cited by: §7.
- [11] (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. Cited by: §1, §5.
- [12] (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
- [13] (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, pp. 311–318. External Links: Link, Document Cited by: §6.
-
[14]
(2014)
Glove: global vectors for word representation.
In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
, pp. 1532–1543. Cited by: §5. - [15] (2017) Language generation with recurrent generative adversarial networks without pre-training. arXiv preprint arXiv:1706.01399. Cited by: §2.
- [16] (2018) On accurate evaluation of gans for language generation. arXiv preprint arXiv:1806.04936. Cited by: §1.
- [17] (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.
- [18] (2018) Wasserstein divergence for gans. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 653–668. Cited by: §7.
-
[19]
(2015-07–09 Jul)
Show, attend and tell: neural image caption generation with visual attention.
In
Proceedings of the 32nd International Conference on Machine Learning
, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2048–2057. Cited by: §3. - [20] (2017) Seqgan: sequence generative adversarial nets with policy gradient. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.
- [21] (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §1.
Comments
There are no comments yet.