Recent developments in deep learning and generative adversarial networks(GAN) have made it possible to generate realistic looking images of high resolution. The image generation techniques generally come in two forms : i) Unconditional image generation – starting from a noise vector, the generator generates an image[Goodfellow et al., 2014]. ii) Conditional image generation – given a condition, the aim is to generate an image adhering to some condition [Mirza and Osindero, 2014; Isola et al., 2017; Zhu et al., 2017].
While a lot of work has been done in the domain of single-conditional image synthesis, the domain of unsupervised multi-conditional image synthesis is relatively new. We aim to generate an image conditioned on two inputs such that the generated image contains texture of the first and shape of the second conditioned image.
Our work is based upon the recently released work FineGAN. The authors propose a GAN based framework that learns to disentangle the background, shape and texture of an image in an unsupervised manner. The network generates an image conditioned on input background, shape and texture codes.
Our pipeline takes in two images and as input and generates an output image(O). The pipeline consists of three steps - i) Compute the texture code(T) that describes the first input image(), ii) Compute the shape code(S) that describes the second input image(), and iii) Feed the computed codes(T and S) as input to the pre-trained FineGAN network to get the desired output O.
To compute the codes(T and S), we take a trained FineGAN and iterate over all the possible combinations of shape and texture codes for 10 different noise vectors(different noise vector lead to different orientations of the generated image). We denote as G the set of all such generated images. To compute the texture code(T), we compute the nearest neighbour of amongst G in the embedding space of ImageNet [Russakovsky et al., 2015] pre-trained ResNet50 model [He et al., 2016]. The embedding space is defined by the Global Average Pooling layer output of the ResNet50 model.
We repeat the same process for image to compute the shape code(S) with the exception of using a shape biased pre-trained ResNet50 network. The motivation for using a shape biased network stems from [Geirhos et al., 2018], where the authors show that the ImageNet trained models are biased towards texture details of image. The authors use stylized variants of the ImageNet dataset to train the network resulting in a shape-biased network. We hypothesize that the shape biasness of the network would allow it to better capture the shape details of the image, leading to correct identification of the shape code of an input image. We verify both quantitatively and qualitatively this design choice in the following section.
3 Results and Discussions
Our cFineGAN results over the three datasets - CUB-200-2011 [Welinder et al., 2010], UT Zappos50k 111We trained our own FineGAN model over this dataset [Yu and Grauman, 2014] and Stanford Dogs [Khosla et al., 2015] are shown in figure 1. Additional results can be found in the Appendix section.
To quantitatively evaluate the benefit of using a shape-biased pre-trained model for extracting the shape code, we compute the nearest neighbour in the embedding space for each generated image in G. We define accuracy as the fraction of times the query image and its nearest neighbour have the same shape code. As the shape code of all the generated images is known, we can compute this metric. Table 1 shows that the accuracy achieved by the shape-biased model is much better than that of a standard model. Some qualitative results have been shown in figure 2.
We baseline our method against the approach mentioned in [Singh et al., 2018]
where the authors train classifiers over the domain of generated images to predict the shape and texture codes given image as an input. Since the classifier is trained over the domain of generated images but is expected to predict the codes of natural images during evaluation time, the huge domain shift encountered between train and test settings lead to incorrect outputs. We show some qualitative comparisons against this baseline in figure3. As can be seen cFineGAN better captures the shape and texture details of input images.
- ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231. Cited by: §2.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
- Deep residual learning for image recognition. In , Cited by: §2.
- Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1.
- Novel dataset for fine-grained image categorization: stanford dogs. Cited by: cFineGAN: Unsupervised multi-conditional fine-grained image generation, §3.
- Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
- Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §2.
- FineGAN: unsupervised hierarchical disentanglement for fine-grained object generation and discovery. arXiv preprint arXiv:1811.11155. Cited by: cFineGAN: Unsupervised multi-conditional fine-grained image generation, §3.
- Caltech-ucsd birds 200. Cited by: cFineGAN: Unsupervised multi-conditional fine-grained image generation, §3.
- Fine-grained visual comparisons with local learning. In Computer Vision and Pattern Recognition (CVPR), Cited by: cFineGAN: Unsupervised multi-conditional fine-grained image generation, §3.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1.