Generating high-quality images with arbitrary content based on user input, whether through natural language or a discrete user interface, has been a long-term goal and focal point of the Graphics and Vision community. In Natural Language Processing, a comparatively important goal has been that of building natural language interfaces for complex programs that interact with users to complete tasks collaboratively. For visual data, prior work has generated images from sentences [1, 2, 3, 4] and attributes [5, 6, 7]. More complex areas such as video or unsupervised generation also exist [8, 9, 10]. For language, examples include [11, 12, 13, 14, 15]; we point the reader to  for a more in-depth overview and motivation. We define a task at the intersection of these two areas: build a system that a user can interact with iteratively to generate images in a dynamic fashion, i.e., updating after every round of interaction. Possible use cases for such a system range from a skilled designer rapidly prototyping and refining designs to an art hobbyist creating high quality illustrations using speech commands. At every round, the user provides input, and the system returns an image conditioned on the history of inputs so far.
Our proposed system also serves as a helpful testbed for debugging and improving conditional Generative Adversarial Networks (GANs) and benchmarking methods for interpretability in deep models. Our multi-turn setting disentangles the effects of different conditioning information on the rendered images, making statements like “shape information is more difficult to integrate than color” quantifiable and explicit. We believe that such a setting can drive the community to find model improvements, modules, and training strategies to address specific problems in conditional GANs. We additionally justify this task as a setting for exploring neural interpretability by introducing the concept of visual justifications and arguing that they may be superior to current approaches to interpretability.
This framework presents new challenges that current methods cannot address. Because of the lack of supervision for the intermediate stages of image generation, this task is similar to episodic reinforcement learning, where the agent doesn’t receive the reward until the end of the episode. In our case, we need to learn to produce reasonable intermediate images without direct supervision. This makes the current GAN-based approaches ill-suited since they implicitly assume a supervised case. There does exist a class of latent variable models like that at first glance may seem to be sufficient since they update a ‘canvas’ over the course of generation, but we note that these are latent variables and not human-comprehensible intermediate results, nor are they trained to be. Beyond the problem of intermediate supervision, applying current state of the art models in a non-recurrent approach has problems in not keeping images consistent through time. This applies both to attributes that are unrelated to the conditioning information (illumination, background, etc.) and for attributes that are related (pose, color).
In order to solve these challenges, we propose a novel model framework with a new training algorithm that allows us to hallucinate intermediate supervision without making strong distributional assumptions about conditioning or datasets used, and without requiring the existence of a sampler. This allows us to train our model to generate intermediate results without ever observing them. We display qualitative results showing that our model does learn to generate these intermediate results responsively to conditioning information and coherently with the full sequence of generation, even when large changes are required. In doing so, we lay out a class of deep generative models that have recurrence through time, using primitive elements – convolutional Gated Recurrent Units (GRUs) – originally designed for discriminative video based tasks. Additionally, we demonstrate concretely that this setting allows us to understand better where and why the current state of the art in GANs fail at generating images, pointing to specific problems to tackle with the goal of improving image generation in general.
Our contributions in this work are as follows. We combine two research threads from two different areas – vision/graphics and NLP – in order to introduce a task that is practically useful and that offers insights into promising research directions in these areas and in conditional image generation and neural interpretability. We introduce a framework that includes a novel training algorithm as well as model improvements in order to handle the challenges introduced by his task. We demonstrate that this framework works well both in terms of the quality of images generated and in terms of performance on the task by presenting images as well as examples of specific changes occurring (shape, color, etc.).
2 Related Work
Beyond the brief overview of concrete problems and long-term goals in NLP that help to define our task, we would like to draw the reader’s attention to the significant body of work in the Human-Computer Interaction community over the past several decades on the space of control, feedback, and multi-step interaction with computers and technology. Much of this work phrases these interactions as a form of collaborative conversation , while also building upon empirical and theoretical results showing that people interact with computers in such a conversational fashion  . The emphasis of  on incremental and rapid feedback to changes taken by the user suggest that our task has implications for building better experiences using vision and NLP, beyond the usefulness of phrasing problems as multi-turn for performance.
Since the introduction of GANs , there has been a surge of interest in image generation, from both the unconditional  and the conditional perspectives. Work has explored generating images from captions [1, 2, 3], attributes [5, 6], as well as how to parametrize the models and training framework , beyond the original .
With regard to generation, perhaps the works most closely related to ours are  and . Both of these works are important to our work – the former for the introduction of a mental ‘canvas’ that a generator iteratively updates in order to produce a final image, and the latter for the use of such a model on a difficult language to image task. Our work is also closely related to 
, although they do not update the image with conditioning information at each step, and do not enforce that the image should be complete at every step. We distinguish our framework by the requirement that intermediate generation be human-comprehensible, via an actual image and not a latent vector.
2.1 Neural Interpretability
To date, the majority of the work on interpretability in neural networks can be broadly categorized either as post-training methods, which find neurons selective for specific concepts or objects or as attention-based methods, which provide internal attention maps to illustrate where the network is focusing as it makes predictions.
Although these methods do provide a more intuitive view into understanding the operation of these networks on single examples and on datasets as a whole, they are not applicable on all tasks, particularly in complex settings like VQA  where multiple modalities are involved, and fail to meaningfully communicate what models are ‘thinking’ at a high level. In contrast, we propose to solve the neural interpretability problem as a problem of visual justification – where the model must explain its actions at every step by providing visual output, and where people can, at a glance, understand what the model is ‘thinking’.
This is in the same spirit as newer work which provides natural language justifications for model predictions in sentiment analysis, and image classification 
. We go a step further, creating visual justifications that illustrate the model’s internal representation, not just its final actions, and our visual justifications are amenable to cases where a model must justify its ‘mental process’ during computation as opposed to waiting till the very end. Accordingly, we expect to generate images at every turn of the conversation demonstrating the bot’s current estimate of the image from the conditioning information so far and that these images should meaningfully change through the course of the conversation as new information is obtained (e.g. ‘primary color: red’)
3.1 Problem Definition
Our task, simply posed, is as follows: the image generation model participates in a “conversation” with an external actor who provides conditioning information at every turn that the generator must use to produce images. For example, this information takes the form of attribute-value pairs of a bird in question. Note that our framework applies to more complex cases: e.g. where the conditioning information at each turn may be a full sentence.
In contrast to [1, 2, 3], and many of the GANs operating on the CUB dataset, which use a class-disjoint train/test split of the data, we conduct a stratified sample of the CUB dataset by class. In order to clearly illustrate our setting and to avoid the problem of dealing with very fine-detailed attribute classes, we restrict the attributes used to the top 4 observed in the dataset by frequency.
The attributes selected are the 4 most common in the dataset: “Primary Color,” “Shape,” “Size,” and “Bill Color.” These were chosen to provide good coverage over the types of attributes we could use, while containing both very disjoint (e.g., “Shape” and “Bill Color”), as well as less disjoint (e.g., “Primary Color”, “Bill Color”) attribute class pairs allowing us to observe how these affect training and generation. In addition, this choice helps us to disentangle the problem of vocabulary sparsity from our setting, in a first step toward what we want in our models: dealing with highly diverse and sparse vocabularies that characterize fine-grained changes across long timescales in a highly interacting fashion.
Our model consists of two components: a reader submodule, to properly integrate conditioning information through time, and a recurrent generator, to transform this integrated information into pictures at each time step. Because of the importance of pretraining embeddings for GANs in terms of image quality and convergence, we pretrain these components separately, and then tune them together.
We first embed attributes and values, encode them together with a linear layer, and utilize a GRU to manage the relevant state updates through time as we receive new inputs. A nonlinear stack of transformations follows, as specified in Fig. 2. Intuitively, we want our reader to observe conditioning information at each step and return embeddings that are useful for generating images that match the given input.
4.3 StackGAN++ and modifications
4.3.1 Conv-GRU and recurrence in generation
We use a convolutional GRU as introduced in . With the convolutional modifications, the Conv-GRU equations are as follows:
where and denote convolution and the Hadamard product, respectively.
Intuitively, adding recurrence in the generator can help with training, by removing some of the burden of keeping state through time and integrating new information that falls on the reader module, as well as improving performance by giving the network the ability to condition on past generation - in the same way that    and  demonstrated that conditioning on lower-resolution generation can improve higher-resolution image generation. Finally, the use of recurrence increases the ability of the network to keep elements not related or not closely related to the conditioning information like pose, illumination, background, etc. fixed through the sequence of images.
We use a GRU recurrence instead of an LSTM recurrence in order to reduce training times, as well as parameter sizes and memory/computational requirements. The system was not sensitive to this difference. We demonstrate the first use of this class of modules in a non-discriminative context. Beyond this brief overview, we point the reader to  and  to learn more about these convolutionally-recurrent models and how they can improve performance.
4.3.2 Conditional Augmentation
to improve training and diversity, where the output of the conditional augmentation layer is drawn from an independent Gaussian Distributionwhere the mean and diagonal covariance matrix are functions of the reader’s output . We additionally concatenate noise which is fixed through time.
For the GAN component, we use the same hyperparameters as Stackgan++. All convolutional GRUs have kernel size 1 and have the same number of channels as the layer they run on top of. For the reader module, the GRU has hidden size 1024 and all other hyperparameters are described in the model diagram.
In order to modify the joint conditional-unconditional losses for the Discriminators and Generator, from  for the multi-turn setting, we need to find supervision for an image at turn in the conversation.
where is the image generated by the generator, is the length of conversations, is the history of the conversation from time to , and is the conditional distribution of images that match . Unfortunately, naively setting for all , and training with this loss for all rounds in a batch tends to produce unchanging image sequences of lower quality. See Fig. 3 for an illustration of our approach in contrast to the naive approach.
defines a distribution over images sampled from by picking resulting in . For ease of notation, we define the following:
We would like to approximate the following loss:
where is the distribution over conversations (i.e. the distribution over conditioning strings of length ). For problems where the conditioning information is discrete and has a small vocabulary, it is possible to get a good approximation to the conditional distribution by sampling from the dataset, but this method will not work well for datasets that are sparse in this conditional distribution. To illustrate: the likelihood of the first two sentences describing two different images being exactly the same is vanishingly small for any reasonably-sized vocabulary, even when these images are semantically similar. To work on more complex problems where this data sparsity exists, we need a method that doesn’t require sampling from this conditional distribution.
By Fubini-Tonelli’s theorem, we can swap the first two expectations,
we then collapse the last two expectations into:
is the joint distribution over images and conditioning information. A similar argument follows for
. This loss function admits a simple learning algorithm; see Alg.1.
We use the associated image embeddings for StackGAN++ , and train our reader module to take in conditioning information and predict the embedding at every step under mean squared error for 15 epochs using Adam with default parameters and a batch size of 32. Reader training was not sensitive to these hyperparameters.
We also attempted to use embeddings from a variety of networks pretrained on imagenet as well as networks pretrained for CUB classification, discovering that image quality is highly sensitive to the embeddings used for pretraining, and the class-conditional loss introduced by  used to train was important to learning high quality image embeddings that translate well to training a GAN.
We pretrain the stackgan component (generator and discriminator) without the conv-gru layers to generate images using for 150 epochs, using all the hyperparameters from  without the KL-Divergence regularization they introduce, and without dropping the learning rate as they do.
We then “put the models together” by initializing the full model with the parameters from the pretrained reader, generator, and discriminator, and train for epochs, keeping the optimizer hyperparameters the same, except we drop the learning rate by a factor of every
epochs. We train with a batch size of 64 and perform simple dataset augmentation via random horizontal flips. Note that the Conv-GRU layers in the model are initialized from scratch using the standard pytorch initialization.
6 Results and Analysis
6.1 Common Changes
We illustrate 4 common changes in generated images we see as a result of updating from given conditioning information: Color, Shape, Size, and Part. Part refers to changes localized to a specific part of the bird (most typically the bill), while Shape and Size changes require more intelligent adjustment of the generated image to remain coherent with the history so far.
In this paper we introduce a novel multi-turn image generation task where the model is asked to generate images at every step conditioned on a sequence of input information. A key challenge here is the lack of intermediate supervision during training, which we overcome with a novel learning algorithm and model improvements. Our learning algorithm hallucinates intermediate supervision in a provably equivalent way to training on the intermediate supervision if it existed. More broadly, this task is a helpful testbed for debugging and improving conditional GANs and benchmarking methods for interpretability in deep models.
This work was initiated during an internship at Microsoft Research. We thank Andrew Bennett, Valts Blukis, Daniel Jiwoong Im, Dipendra Misra, and Ayush Sekhari for helpful discussions.
-  Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML. (2016)
Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., Metaxas, D.N.:
Stackgan: Text to photo-realistic image synthesis with stacked
generative adversarial networks.
2017 IEEE International Conference on Computer Vision (ICCV) (2017) 5908–5916
-  Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan++: Realistic image synthesis with stacked generative adversarial networks. CoRR abs/1710.10916 (2017)
-  Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. CoRR abs/1711.10485 (2017)
Dixit, M., Kwitt, R., Niethammer, M., Vasconcelos, N.:
Aga: Attribute-guided augmentation.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 3328–3336
-  Kaneko, T., Hiramatsu, K., Kashino, K.: Generative attribute controller with conditional filtered generative adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 7006–7015
-  Mirza, M., Osindero, S.: Conditional generative adversarial nets. CoRR abs/1411.1784 (2014)
-  Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NIPS. (2016)
-  Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434 (2015)
-  Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial networks. CoRR abs/1406.2661 (2014)
Zelle, J.M., Mooney, R.J.:
Learning to parse database queries using inductive logic programming.In: AAAI/IAAI, Vol. 2. (1996)
-  Manshadi, M., Gildea, D., Allen, J.F.: Integrating programming by example and natural language programming. In: AAAI. (2013)
-  She, L., Yang, S., Cheng, Y., Jia, Y., Chai, J.Y., Xi, N.: Back to the blocks world: Learning new actions through situated human-robot dialogue. In: SIGDIAL Conference. (2014)
-  Ling, W., Blunsom, P., Grefenstette, E., Hermann, K.M., Kociský, T., Wang, F., Senior, A.: Latent predictor networks for code generation. CoRR abs/1603.06744 (2016)
-  Yin, P., Neubig, G.: A syntactic neural model for general-purpose code generation. In: ACL. (2017)
-  Chaurasia, S., Mooney, R.J., Gligoric, M., Chaurasia, A., Mooney, R.: Dialog for natural language to code. (2017)
-  Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: Draw: A recurrent neural network for image generation. In: ICML. (2015)
-  Ballas, N., Yao, L., Pal, C.J., Courville, A.C.: Delving deeper into convolutional networks for learning video representations. CoRR abs/1511.06432 (2015)
-  Pérez-Quiñones, M.A., Sibert, J.L.: A collaborative model of feedback in human-computer interaction. In: CHI. (1996)
-  Payne, S.J.: Looking hci in the i. In: INTERACT. (1990)
-  Payne, S.J.: Display-based action at the user interface. Int. J. Man-Mach. Stud. 35(3) (September 1991) 275–289
-  Hutchins, E., Hollan, J.D., Norman, D.A.: Direct manipulation interfaces. Human-Computer Interaction 1 (1985) 311–338
Odena, A., Olah, C., Shlens, J.:
Conditional image synthesis with auxiliary classifier gans.In: ICML. (2017)
-  Mansimov, E., Parisotto, E., Ba, J., Salakhutdinov, R.: Generating images from captions with attention. CoRR abs/1511.02793 (2015)
-  Im, D.J., Kim, C.D., Jiang, H., Memisevic, R.: Generating images with recurrent adversarial networks. CoRR abs/1602.05110 (2016)
-  Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Object detectors emerge in deep scene cnns. CoRR abs/1412.6856 (2014)
-  Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantifying interpretability of deep visual representations. CoRR abs/1704.05796 (2017)
-  Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C.L., Parikh, D., Batra, D.: Vqa: Visual question answering. 2015 IEEE International Conference on Computer Vision (ICCV) (2015) 2425–2433
-  Lei, T., Barzilay, R., Jaakkola, T.S.: Rationalizing neural predictions. In: EMNLP. (2016)
-  Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., Darrell, T.: Generating visual explanations. In: ECCV. (2016)
-  Denton, E.L., Chintala, S., Szlam, A., Fergus, R.: Deep generative image models using a laplacian pyramid of adversarial networks. In: NIPS. (2015)
-  Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. CoRR abs/1710.10196 (2017)
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., chun Woo, W.:
Convolutional lstm network: A machine learning approach for precipitation nowcasting.In: NIPS. (2015)
-  Deng, J., Dong, W., Socher, R., Li, C., fei Li, F., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009) 248–255
-  Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: NIPS-W. (2017)