The scarcity in the availability of image captioned datasets raises the question of whether it’d be feasible to synthesize artificial samples that contain high quality pairs of image and text. Such an application could help improve the performance of neural networks on tasks such as image captioning and image classification. Commercial applications also include content creation, drug discovery and news entertainment.
Recent advances in generative adversarial networks (GAN) and sequence to sequence modeling have made image to text and text to image feasible. Our work combines these two paradigms and explore the implications of combining the two models.
Our main accomplishments include: 1) To the best of our knowledge, the first study on synthesizing novel pairs of image and text 2) An analysis of cycles, from image to text back to image, and a comparison with autoencoders.
2.1 Generative Adversarial Networks
Generative image modeling has seen tremendous improvements in the past few years with the emergence of deep learning techniques. Generative Adversarial Networks pioneered by Goodfellowet al. has since opened doors to image generation with deep-convolution generative models (DCGAN) [Goodfellow2014]. In the paper ”Generative Adversarial Text to Image Synthesis”, Reed et al. provided a multi-stage approach for generating image from text [Reed2016]. The authors first learned a sequence encoder that learns the discriminative text feature representations. Then the authors trained a deep-convolution generative model (DCGAN) conditioned on the text embeddings. This end to end approach provided a performant way to generate images based on text.
2.2 Image Captioning
On the flip side of text to image, the problem of generating natural language descriptions from images has seen tremendous improvements due to the clever amalgamation of deep convolution networks with recurrent nerual networks. Techniques such as seq2seq by Sutskever et al. has shown great potential in natural language processing tasks such as machine translation [Sutskever2014]. In ”Show and Tell: A Neural Image Caption Generator”, Vinyals et al.
demonstrated that descriptive and coherent captions can be generated by feeding the last convolutional activations from an image classification network through a long short term memory (LSTM) network[Vinyals2015].
2.3 Dataset Augmentation
Inspired by the above advances, we try to tackle the problem of dataset augmentation through synthesizing novel image and text pairs. There has been a plethora of work detailing ways of creating additional samples in the imbalanced learning literature, where new examples are created for the minority class in order to prevent overfitting. In Synthetic Minority Over-sampling Technique (SMOTE), examples are created in feature-space by randomly selecting pairs from the original dataset [Chawla2002]. In Adasyn, a new sample is generated by combining an existing sample with weighted differences of the sample with its neighbors. In Parsimonious Mixture of Gaussian Trees, new samples are created by first fitting the existing minority class with mixture of gaussians, and then sampling from them [Cao2014]. We take inspiration from much of the literature in imbalanced learning, where novel samples are generated for the minority class in order to improve learning.
In section 3 of the paper, we cover how we are re-purposing Reed et al.’s GAN-CLS to generate new image and text pairs, as well as its connection with autoencoders. In section 4 of the paper, we show qualitative results and an analysis of our approach.
In this paper, we utilize a neural framework to generate images and text. Recent advances in generative adversarial network has shown the feasibility of translating image to text, , as well as translating text to image, . Combining the two, we gain the ability to generate novel image text pairs.
Specifically in terms of image and text, our problem is formulated as generating novel images and based on existing pairs of images and texts. We formulate our challenge as ~, where , the existing dataset.
We approach the paired generation problem by using a 2-stepped approach:
Source domain generation: sample from a source domain: ~
Target domain generation: generate the target domain conditioned on the source domain, , ~
In the case where the source domain is text, we first sample a text, then generate an image based on the text sample; and vise versa, we first sample an image, then generate a text caption based on the image sample.
3.1 Source domain generation
The source domain generation problem is a problem of given examples , generate novel samples . This problem requires us to construct novel examples, instead of sampling the existing dataset. Because of the high dimensionality of the image and text domain, we instead generate novel embeddings instead. The encoding of text and image are learned during the training phase of GAN-GLS. is the last convolutional activation in the discriminator, and is the encoder that the generator is conditioned on. In order to construct and , We propose 2 approaches inspired by the existing imbalance class learning literature, prototype-based and density-based.
3.1.1 Prototype based
In the first method, we make modifications to the existing samples, prototypes. Prototypes ~ are examples from the dataset. Our contribution draws heavily from techniques such as SMOTE that synthesize new examples based on pairs of existing examples.
A new image embedding can be produced via
and vise versa for text
. A major advantage of using existing prototypes is that the prototypes can be easily identified. The prototype method can be further extended by sampling and conditioned on additional information, such as the source class.
3.1.2 Density based
In density-based source domain generation, we learn a generative model that sample from the distribution of the source domain. Then we sample from the learned distribution to obtain additional examples in the source domain. ~
In our application, we fit mixture models for both text and image embeddings respectively. where there are K mixtures, each parameterized by a Gaussian.
3.2 Target domain generation
After generating the source domain, we want to generate the target domain conditioned on the source domain. This leads us to our formulation to go from text to image, , and image to text, .
3.2.1 Text to Image
Like Reed et al. [Reed2016], we model text to image synthesis as a conditional generative adversarial network (cGAN). The generator network is denoted
where , and . On the other hand, the discriminator network is denoted
The loss function is adopted from GAN-CLS,
The intuition behind the loss is that it forces the discriminator to learn to discern: generated images with right captions , real images with wrong captions , and wrong images with the right captions .
3.2.2 Image to Text
Going from image to text, we train our model to maximize the likelihood of the correct caption.
We estimateusing a LSTM. Unlike Vinyals et al. [Vinyals2015]
, instead of training a new convolutional neural network, we take advantage of the discriminator trained in our conditional GAN from section 3.2.1 by utilizing the last convolutional layer of the CNN,.
Note that from the GAN-CLS architecture, has not been concatenated with textual information . This means that contains the information needed to distinguish real images with wrong captions and wrong images with the right captions. This leads us to conclude that has the capability of generating descriptive captions.
3.3 Formulating a cycle: A to B to A
Combining , generating an image based on text, and , generating a text based on image, we study the implications of going from image to text then back to image, or vise versa, which we term cycles:
3.3.1 Relation to Autoencoders
The traditional autoencoder tries to minimize the reconstruction loss between its input and the reconstructed input. Although not directly minimizing the loss, our formulation of cycles do have close ties with autoencoders.
As noted by Goodfellow et al., GANs, in particular, G, is learning the true distribution of the input data. . Conditional GAN, by extension, is also learning the true conditional distribution. This means that when going from text to image, the conditional distribution of the data is preserved.
On the other hand, when going from image to text, we maximize , which also means that the conditional distribution of the data is preserved.
However, it’s important to note that and only preserve information that is shared between the image and text. Information not shared between the image and text are lost in the generation process.
Therefore, we should expect the information loss within the embedding space, information shared between the two domains, to be minimal. This means that should be close to , where and are the original and reconstructed images. However, we should not expect to also be similar, since and
maximize the probability of the data conditioned on the shared information between image and text.
We experimented with the Oxford 102 flowers dataset, which consists of 102 classes of flowers, 8000 images along with 10 captions per each image. Using our model, we are able to successfully generate novel pairs of image and text.
4.1 Source domain generation
In Figure 4, 5 and 6, the source domain is text and the target domain is images. In Figure 4
, we interpolate two text embeddingsand , with linearly interpolating between and . This demonstrates our ability to construct novel images based on text.
In Figure 5, we demonstrate the reverse result of Figure 4 and combine two prototype images to generate novel text.
In Figure 6
, we explore density based source domain generation. Each image is generated by sampling from the same cluster in the Gaussian mixture model.
4.2 Cycle generation: A to B to A
Figure 1. in section 1 shows an example of the cycle of of image to text to image. Note that the caption and image preserve the semantic attributes of the flowers instead of the raw pixel values. On the other hand, for text to image to text, Figure 7. below shows similar behavior, the meaning of the text stays the same, but not in exact words.
We propose a novel way of generating pairs of image and text by repurposing the existing GAN-CLS architecture. This method can be extended for generating pairs of novel samples in multiple domains. Additionally, we study the implication of cycles, going from image to text to image and vise versa, as well as compare our model with autoencoders.
- [Goodfellow2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio 2014. Generative Adversarial Nets NIPS
- [Reed2016] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee 2016. Generative Adversarial Text to Image Synthesis arXiv preprint arXiv:605.05396
- [Reed2016] Scott Reed, Zeynep Akata, Honglak Lee, Bernt Schiele 2016. Learning Deep Representations of Fine-Grained Visual Descriptions arXiv preprint arXiv:1605.05395
- [Sutskever2014] Hya Sutskever, Oriol Vinyals, Quoc V. Le 2014. Sequence to Sequence Learning with Neural Networks NIPS
- [Vinyals2015] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan 2015. Show and Tell: A Neural Image Caption Generator arXiv preprint arXiv:1411.4555
- [Cao2014] Hong-Cao, Vicent Tan, John Pang 2015. A Parsimonious Mixture of Gaussian Trees Model for Oversampling in Imbalanced and Multimodal Time-Series Classification IEEE Volume: 25, Issue: 12, Dec. 2014
- [Chawla2002] Nitesh-Chawla, Kevin-Bowyer, Lawrence-Hall, Philip-Kegelmeyer 2015. SMOTE: Synthetic Minority Over-sampling Technique arXiv preprint arXiv:1106.1813
- [He2008] Haibo-He, Yang-Bai, Edwardo-Garcia 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning Neural Networks, 2008. IJCNN 2008.