The volume of visual media has grown tremendously in recent years, which has intensified the need for professional image editing tools (e.g., Adobe Photoshop, Microsoft Photos). However, visual editing still remains a challenging task relying heavily on manual efforts , which is time-consuming and requires artistic creativity as well as iterative experimentation. A natural approach to automating the process is to provide an interactive environment, where a system can generate images automatically following users’ verbal command; and where users can provide feedback on intermediate results, which in turn allows the system to further refine the generated images.
In this work, we propose a new task - interactive image editing via conversational language, where a system can generate new images by interacting with users in multi-turn dialogue. Figure 1 shows an example system, which supports natural language communication between the agent and an user for image editing (e.g., customizing shape, color, size, texture, etc.). Potential applications of such a system range from dialogue-based visual design to language-guided visual assistance.
Current approaches useful for this task are limited to either keyword-based (e.g., object attributes) [3, 11] or single-turn setting [6, 33, 25]. While these paradigms are effective to some degree, the restriction on the format of user feedback inevitably constrains the information that a user can convey to the agent to influence the image generation process. To solve these challenges, we propose a new Conditional Generative Adversarial Network (GAN) framework, which includes an image generator for generating intermediate results, and a neural state tracker for encoding dialogue context. In each dialogue turn, the generator generates a new image by taking into input the dialogue history and previously generated images. To fully utilize the sequential information, the proposed model performs end-to-end training on the full dialogue sequence. Moreover, we adopt a regularizer based on an image-text matching loss as well as an attention mechanism, for better fine-grained text-to-image generation and multi-turn refinement.
As this is a newly proposed task, we also introduce two new datasets, namely Zap-Seq and DeepFashion-Seq, which were collected via crowdsourcing in a real-world application scenario. In total, there are 8,734 dialogues in Zap-Seq and 4,820 in DeepFashion-Seq. Each dialogue consists of a sequences of images, with slight variation in design, accompanied by a sequence of textual descriptions on the differences between each pair of images, similar to the examples in Figure 1.
Experiments on these two datasets show that the proposed SeqAttnGAN framework achieves better performance than state-of-the-art techniques. In particular, by incorporating dialogue history information, SeqAttnGAN is able to generate high quality images, beating all baseline models in terms of contextual relevance and consistency. Results also show that allowing natural language feedback is more effective than only taking keywords or visual attributes as input, as used in previous approaches.
Our contributions are summarized as follows:
We propose a new task for visual editing - interactive image editing via dialogue, which allows users to provide natural language command and feedback for image editing, via multi-turn dialogue.
We introduce two new datasets for this task, Zap-Seq and DeepFashion-Seq, collected through crowdsourcing. With free-formed descriptions and diverse vocabularies, the two datasets provide reliable benchmarks for the interactive image editing task.
We propose a new conditional GAN framework - SeqAttnGAN, which can fully utilize dialogue history to synthesize images that conform to user’s iterative feedback.
2 Related Work
Image Generation and Editing
Language-based image editing [6, 21] is a task designed for minimizing labor work while helping users create visual data. Specifically, systems that can perform automatic image editing should be able to understand which part of the image that the user is referring to. This is a very challenging task, which requires comprehensive understanding of both natural language and visual information. Following this thread, several studies have explored the task. Hu et al.  tackled the language-based image segmentation task, taking phrase as the input. Ramesh et al.  developed a system using simple language to modify the image, where a classification model is utilized to understand user intent. Wang et al.  proposed a neural model for global image editing.
Since the introduction of GANs , there has been a surge of interest in image generation tasks. In the conditional GAN space, there have been some studies on generating images from images , captions / attributes , and object-patch . There were also studies on how to parameterize the models and training framework  beyond the vanilla GAN . Zhang et al.  stacked several GANs for text-to-image synthesis, with different GANs to generate images of different sizes. In these studies, the image is synthesized on the context level but is not region-specific.
Dialogue-based Vision Tasks
AI tasks that lie in the intersection between computer vision and natural language processing have drawn much attention in the research community, benefiting from the latest deep learning techniques and GANs. Such tasks include visual question-answering, visual-semantic embeddings , grounding phrases in image regions , and image-grounded conversation .
Most approaches have focused on end-to-end neural models based on the encoder-decoder architectures and sequence-to-sequence learning [13, 32, 4, 8]. Das et al.  proposed the task of visual dialogue, where the agent can answer questions about images in an interactive dialogue. De Vries et al.  introduced the GuessWhat?! game, where a series of questions is asked to pinpoint a speciﬁc object in an image, with yes/no/NA answers. However, these dialogue settings are purely text-based, where visual feature only plays a supportive role. DeVault et al.  investigated building dialogue systems that can help users efficiently explore data through visualizations. Guo et al.  introduced an agent presenting candidate images to the user and retrieving new images based on user’s feedback. Another piece of related work is  for interactive image generation by encoding history information. Different from them, text information is used to guide the image generation/editing in our work.
3 Datasets: Zap-Seq and DeepFashion-Seq
The interactive image editing task is defined as follows: a user can interact with the system via iterative dialogue turns to edit an image. In the -th dialogue turn, the system presents a reference image generated by the system to the user. The user then gives a natural language feedback , to describe the difference between the reference image and the desired image he/she wants. This process continues iteratively until the user is satisfied with the result rendered by the system, or the maximum number of dialogue turns has reached.
Existing image generation datasets are mostly single-turned, thus not suitable for this new multi-turn task. Therefore, we developed two new datasets for the proposed task - Zap-Seq and DeepFashion-Seq, which were derived from two existing datasets (UT-Zap50K  and DeepFashion ).
First, we retrieve sequences of images from the two datasets, with each sequence containing 3 to 5 images. Every pair of consecutive images are slightly different in one or two attributes . As a result, a total of 8,734 image sequences were extracted from UT-Zap50K and 4,820 sequences from DeepFashion. After collecting these image sequences, the second step is to collect natural language descriptions that can capture the difference between each image pair. We resorted to crowdsourcing via Amazon Mechanical Turk  for this data collection task. Specifically, each human annotator was asked to provide a free-formed sentence to describe the difference between any two given images. Figure 2 provides some examples. The interface of the data collection task is provided in Appendix. To provide a robust dataset, we also randomly select images from the two original datasets to form additional sequences, which makes up of the total datasets.
After manually removing wrong and duplicate annotations, we collected a total of 18,497 descriptions collected for the Zap-Seq dataset and 12,765 for DeepFashion-Seq. The statistics on the two datasets are shown in Table 1.
|Dataset||#dialogues||#turns per dialogue||#descriptions||#words per description||#unique words|
Most descriptions are very concise (between 4 to 8 words), yet the vocabulary is highly diverse (943 unique words in the Zap-Seq dataset and 687 in DeepFashion-Seq). Compared with pre-defined attributes, most descriptions often include fine-grained spatial or structural details. More information about the data collection procedure and the datasets can be found in Appendix.
4 Sequential Attention GAN (SeqAttnGAN)
For the new task, we develop a model to generate new images given current dialogue description, while preserving coherency to the natural language description, visual quality, and naturalness. Our framework is inspired by the Generative Adversarial Networks (GANs) [14, 23] and consists of three components: an image generator, a neural state tracker and a context encoder. As shown in Figure 3, in the -th dialogue turn, the context encoder encodes the user response and passes it to the state tracker. The state tracker then aggregates this representation with the dialogue history from previous turns. Base on the joint representation of user response and previous intermediate images , the image generator produces a new image .
Specifically, the generator takes the hidden state of -th step as input and generates an image , defined as,
where is produced from the state tracker based on a GRU unit that fuses the representation
with the dialogue history representation from previous dialogue turns, and outputs the aggregated feature vector:
is a noise vector sampled in each step from a standard normal distribution.is the “initial image”. The neural state tracker
is based on a gated recurrent unit (GRU).
is concatenation and embedding through a linear transformation, to obtain the final response representation of, , and . represents the proposed attention model, which will be discussed in the following subsection.
To perform compositional mapping [37, 33, 38], i.e., enforcing the model to produce regions and associated features that conform to the textual description, we introduce an attention module into the framework. takes user feedback and image features from the previous hidden layer as inputs. First, the user feedback is converted to the common semantic space via a transform layer. Then, a word-context vector is computed for each sub-region of the image based on its hidden features . For the -th sub-region of the image (the column of ), a word-context vector can be obtained by learning the attention weights of every word in given the -th sub-region of the image. Finally, produces a word-context matrix , which is passed to the neural tracker to generate an image in the -th step.
Compared with AttnGAN , our framework deploys the attention component in a dialogue sequence. All the dialogue turn share the same generator, while AttnGAN has disjoint generators for different scales. Hence we name our model as Sequential Attention GAN (SeqAttnGAN). Following , the objective of SeqAttnGAN is the joint conditional-unconditional losses over the discriminator and generator. With the supervision of in the -th turn, the loss of the generator is defined as:
where the loss of the discriminator is calculated by:
where is from the true data distribution and is from the model distribution .
Deep Multimodal Similarity Regularizer
In addition to the GAN loss, we bring in another term to SeqAttnGAN. We adopt the deep attentional multimodal similarity model (DAMSM) used in [12, 38] to: 1) maximize the utility of all the input information (such as attributes) to boost the model performance; 2) regularize the model in order to stabilize the image generator. DAMSM is to match the similarity between the synthesized images and user input sentences, acting as an effective regularizer. For simplicity, we call it DMS regularizer in our paper.
The DMS function is pre-trained using dialogue data. Specifically, for any dialogue sample , we first retrieve an initial image as , then concatenate the attribute value of (denoted as ) and the annotated description to have a new text . Note that here we combine attributes and reference feedback as the text, which is different from [38, 40]. After selecting image-description pairs, we have the image-description corpus as . Following [17, 38]
, the posterior probability of descriptionmatching the image is defined as:
where is a smoothing factor. is the word-level attention-driven image-text matching score  (i.e., the attention weights are calculated between the sub-region of an image and each word of its corresponding text) in word level. Given a batch of
pairs, the loss function for matching the images with their corresponding descriptions is:
Symmetrically, we can also define the loss function for matching textual descriptions with their corresponding images (by switching and ). Combining these two, the pre-trained DMS function is computed as:
In summary, we use a similar idea to DAMSM in  to form the DMS regularizer. The training pairs are created by concatenating attributes and user description. The image-description matching score is calculated in each step. By bringing in the discriminator power of DMS, the model can generate region-specific features that align well with the descriptions as well as improving the visual diversity.
Adding all terms together, given a sequence of full supervisions , the overall objective of SeqAttnGAN is defined as:
is the hyperparameter to balance the two loss functions., and are computed in each step and aggregated back to update the gradients, similar to the training of regular dialogue systems.
The text encoder used in the model follows 
, which can be jointly tuned in each module in our framework. For the image encoder, we use a Convolutional Neural Network (CNN) with ResNet-101
pre-trained on ImageNet
with fixed parameters. For the DMS function, we use the Bi-directional Long Short-Term Memory (BiLSTM) to encode the text. The batch size is set to 50. The hyper-parameters and are set via the experiments.
We validate the effectiveness of our model through both quantitative and qualitative evaluations. Given the subjective nature of dialogue with visual synthesis, we also conduct a user study via crowdsourcing to compare our approach with state-of-the-art methods.
5.1 Datasets and Baselines
All experiments are performed on the Zap-Seq and DeepFashion-Seq datasets with the same splits: 90% images were used for training, and the model was evaluated on a held-out test set from the rest 10%. The training process used image pairs sampled from the training set that has no overlap with the test set.
We compare our approach with several baselines:
(1) StackGAN. The first baseline is StackGAN v1  (the resolution of images on Zap-Seq and DeepFashion-Seq is low). A generator is trained to generate target images with a resolution of
pixels, based on the reference attributes and the descriptions. In other components of the StackGAN architecture, all hyper-parameters and training epochs remain the same as the original.
(2) AttnGAN. AttnGAN  currently achieves the state-of-the-art Inception Score on MS-COCO. Similar to StackGAN, we utilize AttnGAN1, generating images at a resolution of pixels. The discriminator and all the hyper-parameters stay unchanged.
(3) LIBE. We also used the recurrent attentive model employed in  for image coloring and segmentation task as another baseline. Like AttnGAN and StackGAN, LIBE utilizes image-caption pairs to train the model. The hyper-parameter setting and training details remain the same as in the original paper.
For training, we use bounding box information for images. Data augmentation is also performed in both datasets. Specifically, images are cropped to and augmented with horizontal flips. To perform a fair comparison, all models share the same structure of the generator and discriminator. The text encoder is also shared. The baseline model training follows standard conditional GAN training procedure, using Adam with the default batch size of 32.
5.2 Quantitative Evaluation
In this section, we provide the quantitative evaluation and analysis on the two datasets. In each step of a dialogue from the test set, we randomly sampled one image from each model, then calculated the IS and FID scores comparing each of the selected sample with the ground-truth image. The averaged numbers are presented in Table 2. Our SeqAttnGAN model outperforms StackGAN and LIBE on the Zap-Seq dataset, with slightly worse performance than AttnGAN. On the DeepFashion-Seq dataset, our model achieves the best results among all the models.
Next, to evaluate whether the generated images are coherent with the input text, we also measure the Structural Similarity Index (SSIM) score between generated images and the ground-truth images. Table 3 summarizes the results, which shows that the generated images yielded by our model are more consistent with the ground-truth than all the baselines. This indicates that the proposed model can generate images with better contextual coherency.
Figure 4 and Figure 5 present a few examples comparing all the approaches with the ground-truth. For each image, we can observe that our model generates images consistent with the ground-truth images and the reference descriptions. Specifically, SeqAttnGAN can generate images with good visual quality, while incorporating changes described in the text. Even for some fine-grained features (”kitten heel”, ”leather”, ”button”), the generated images can well satisfy the requirement. AttnGAN is able to synthesize visually sharp/diverse images, but not as good as our method in terms of context relevance. StackGAN does not perform as well as our model and AttnGAN, in terms of both visual quality and content consistency. This observation is consistent with the quantitative study.
Some examples from LIBE are shown in Figure 6. Visually, LIBE cannot generate good quality samples and can only capture the color information to some degree.
5.3 Human Evaluation
We perform a human evaluation on Amazon Mechanical Turk. For each dataset, we randomly sampled 100 image sequences generated by all the models, each assigned to 3 workers to label. The source model of each image is hidden from the annotators for fair comparison. The participants were asked to rank the quality of the generated image sequences with respect to: 1) consistence to the description and the source image, 2) visual quality and naturalness.
provides the results from this evaluation. For each approach, we computed the average ranking (where 1 is the best and 3 is the worst) and standard deviation. Results show that our approach achieves the best rank. This human study indicates that our solution achieves the best visual quality and image-text consistency among all the models.
|StackGAN||2.68 0.24||2.53 0.27|
|AttnGAN||2.37 0.27||2.46 0.29|
|SeqAttnGAN||1.84 0.23||1.88 0.25|
Besides the crowdsouring human evaluation, we also recruited real users to interact with the proposed system. Figure 7 shows examples of several dialogue interactions with real users. We can observe that users often start the dialogue with a high-level description of main attributes (e,g., color, category). As the dialogue progresses, users give more specific feedback on fine-grained descriptions. The benefit of using free-formed dialogue can be demonstrated by the flexible usage of fine-grained attribute words (“white shoelace”), as well as comparative descriptions (“thinner”, “more open”). Our model is able to capture both coarse and fine-grained differences between images through multi-turn refinement. Overall, these results show that the proposed SeqAttnGAN model exhibits promising potential on generalizing to real-world applications.
5.4 Ablation Study
We conducted an ablation study to verify the effectiveness of two main components in the proposed SeqAttnGAN model: attention module and DMS regularizer. We first compare the IS, FID and SSIM scores of SeqAttnGAN with/without attention and DMS. Table 5 shows the ablation results on Zap-Seq and DeepFashion-Seq.
We observe that both attention and DMS can improve the model with a large margin. We also show some examples generated with different variations in Figure 8. Results show that the original model outperforms the variation settings, as DMS helps stabilizing the training while the attention module improves image-and-text consistency.. Similar observation can been found in the DeepFashion-Seq dataset.
In this paper, we present interactive image editing via dialogue, a novel task that resides in the intersection of computer vision and language. We demonstrate the value of this task as well as its many challenges. To provide benchmarks for this new task, we release two datasets, with image sequences accompanied by textual descriptions. To solve this task, we propose the SeqAttnGAN model, which can jointly model user’s description and dialogue history to iteratively generate images.
Experimental results demonstrate the effectiveness of SeqAttnGAN. In both quantitative and human evolution, our approach with sequential training outperforms baseline methods that rely on pre-deﬁned attributes or trained in a single-turn paradigm, while offering a more expressive and natural human-machine communication. Particularly, the proposed attention technique can enforce the networks to focus on specific areas of the image, and the DMS function can regularize the model to boost the rendering power.
The results are limited by the current fashion data we adopted. In future work, we plan to build a generic system for other types of images (e.g., face ). Currently the framework still needs associated attributes to train the regularizer, which is not easy to be generalized. We would also like to investigate other ways to avoid using attribute data. Finally, we plan to investigate models to support more robust natural language interactions, which requires techniques such as user intent understanding, co-reference resolution, etc.
2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
-  R. Y. Benmalek, C. Cardie, S. J. Belongie, X. He, and J. Gao. The neural painter: Multi-turn image generation. CoRR, abs/1806.06183, 2018.
-  A. Bordes and J. Weston. Learning end-to-end goal-oriented dialog. CoRR, abs/1605.07683, 2016.
-  M. Buhrmester, T. Kwang, and S. Gosling. Amazon’s mechanical turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6:3–5, 8 2011.
-  J. Chen, Y. Shen, J. Gao, J. Liu, and X. Liu. Language-based image editing with recurrent attentive models. arXiv preprint arXiv:1711.06288, 2017.
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio.
Empirical evaluation of gated recurrent neural networks on sequence modeling.In NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
A. Das, S. Kottur, J. M. F. Moura, S. Lee, and D. Batra.
Learning cooperative visual dialog agents with deep reinforcement learning.In ICCV, pages 2970–2979. IEEE Computer Society, 2017.
-  H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. C. Courville. Guesswhat?! visual object discovery through multi-modal dialogue. In CVPR, pages 4466–4475. IEEE Computer Society, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
-  M. Dixit, R. Kwitt, M. Niethammer, and N. Vasconcelos. Aga: Attribute guided augmentation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Jan 2017.
-  H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1473–1482, 2015.
-  J. Gao, M. Galley, and L. Li. Neural approaches to conversational ai. arXiv preprint arXiv:1809.08267, 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
-  X. Guo, H. Wu, Y. Cheng, S. Rennie, and R. S. Feris. Dialog-based interactive image retrieval. CoRR, abs/1805.00145, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  X. He, L. Deng, and W. Chou. Discriminative learning in sequential pattern recognition. volume 25, pages 14–36. Institute of Electrical and Electronics Engineers, Inc., September 2008.
-  R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. Proceedings of the European Conference on Computer Vision (ECCV), 2016.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
-  Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  R. Manuvinakurike, T. Bui, W. Chang, and K. Georgila. Conversational Image Editing: Incremental Intent Identiﬁcation in a New Dialogue Task. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 284–295, Melbourne, Australia, July 2018. Association for Computational Linguistics.
-  R. Manuvinakurike, D. DeVault, and K. Georgila. Using Reinforcement Learning to Model Incrementality in a Fast-Paced Dialogue Game. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbruecken Germany, Aug. 2017. SIGDIAL.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
-  N. Mostafazadeh, C. Brockett, B. Dolan, M. Galley, J. Gao, G. Spithourakis, and L. Vanderwende. Image-grounded conversations: Multimodal context for natural question and response generation. In IJCNLP, 2017.
-  S. Nam, Y. Kim, and S. J. Kim. Text-adaptive generative adversarial networks: Manipulating images with natural language. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 42–51. Curran Associates, Inc., 2018.
A. Odena, C. Olah, and J. Shlens.
Conditional image synthesis with auxiliary classifier GANs.In D. Precup and Y. W. Teh, editors,
Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2642–2651, 2017.
-  X. Ouyang, Y. Cheng, Y. Jiang, C.-L. Li, and P. Zhou. Pedestrian-synthesis-gan: Generating pedestrian data in real scene and beyond. arXiv preprint arXiv:1804.02047, 2018.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1060–1069, 2016.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 1060–1069. JMLR.org, 2016.
-  A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
-  M. Schuster and K. Paliwal. Bidirectional recurrent neural networks. volume 45, Nov. 1997.
I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau.
Building end-to-end dialogue systems using generative hierarchical
neural network models.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 3776–3783, 2016.
-  R. U. D. L. C. C. L. Shizhan Zhu, Sanja Fidler. Be your own prada: Fashion synthesis with structural coherence. In International Conference on Computer Vision (ICCV), 2017.
-  H. Wang, J. D. Williams, and S. Kang. Learning to globally edit images with textual description. CoRR, abs/1810.05786, 2018.
-  J. Wang, Y. Cheng, and R. S. Feris. Walk and learn: Facial attribute representation learning from egocentric video and contextual data. In CVPR, pages 2295–2304. IEEE Computer Society, 2016.
-  L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5005–5013, 2016.
-  X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016.
-  T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. CoRR, abs/1711.10485, 2017.
-  A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’14, pages 192–199, 2014.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
-  B. Zhao, J. Feng, X. Wu, and S. Yan. Memory-augmented attribute manipulation networks for interactive fashion search. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6156–6164, 2017.