The motivation for this work was to find a way of transforming a generative model that had been trained on one distribution, to output a completely new distribution of images that did not model an existing dataset. We approached this by taking the generator from a pre-trained generative adversarial network (GAN) Goodfellow et al. (2014)
trained on one dataset (in this case ImageNetDeng et al. (2009)) and then fine-tuned it with features from another dataset using a classifier trained on data from both datasets.
With this approach we were hoping not to simply model the distribution of images in the new dataset, but transform the generator so it outputs a new distribution of images that fuses visual features from both datasets, resulting in a distribution with novel characteristics. By starting from a pre-trained model with good initial weights, we hoped that this would preserve some aspects of the original distribution, such as the spatial structure of the images, but instilling it with some new characteristics from the other dataset.
We created a dataset of approximately 14k images from Pinterest boards with the title a e s t h e t i c.111See Figure 2 in the Appendix for samples. Images from these boards can usually be characterised by having distinct, washed-our colour palettes (often with only one dominant colour in the image) and often the photographs are framed with no particular subject in focus.
We trained a binary classifier to classify between the a e s t h e t i c images222We also trained classifiers for other datasets with prominent aesthetic characteristics, but for posterity, we will only be discussing results from fine-tuning with the classifiers trained on the a e s t h e t i c dataset. and images from the ImageNet dataset Deng et al. (2009). To train the classifier we fine-tuned a pre-trained ResNet He et al. (2016) model that had been trained to weakly classify Instagram hastags and then ImageNet Mahajan et al. (2018). In addition to training the classifier to classify a e s t h e t i c images and ImageNet images as separate classes (contrastive features), we also—initially by accident—trained a classifier that classifies them as being in the same class (joint features), which led to significantly better results when used for fine-tuning the generator (see Section 3 for further discussion).
After training the cross-dataset classifier, we used this model to fine-tune the weights of a pre-trained BigGAN Brock et al. (2019) generator trained on the ImageNet dataset at a resolution of 128x128 pixels.333 For this we used ‘The author’s officially unofficial PyTorch BigGAN implementation’
For this we used ‘The author’s officially unofficial PyTorch BigGAN implementation’https://github.com/ajbrock/BigGAN-PyTorch and would like to thank the authors of the repository, Andrew Brock and Alex Andonian, for releasing the model weights for the discriminator as well as the generator, without which this work would not have been possible. We also used the frozen weights of the discriminator in the fine-tuning training procedure, updating the weights of the generator based on a weighted sum of the loss from the discriminator and the cross-dataset classifier (see Figure 1 for details). During this fine-tuning process, the networks are not exposed to any new training data, all the samples and losses are produced only using the pre-trained networks.
The process of training and convergence is very rapid. Usually within 1000 iterations (using a batch size of 9) the generator has converged onto a configuration of the weights that satisfies both the cross-dataset classifier and the discriminator. However we find that the best results were achieved using early stopping, often the most interesting visual results occurred when training was stopped after 300-600 iterations. Because training time is so quick, it is trivial to try multiple configurations of the parameter weighting and manually compare the visual results.
3 Discussion and Conclusion
In the process of this work we have happened upon a number of surprising results. The manner in which features get combined from the different datasets was highly unexpected. Neither the results of fine-tuning using the contrastive features or the joint features classifier have resulted in producing images that resemble the images in either the ImageNet or a e s t h e t i c datasets.
The second surprising result is that when fine-tuning with the joint features classifier the visual results were much richer and varied (almost dreamlike in nature) than the results from fine-tuning with the contrastive features classifier (see Figures 4 and 5 in the Appendix for a detailed comparison). We speculate that the contrastive features classifier discards a lot of important features from the ImageNet distribution, so when the generator is fine-tuned, there are less combinations of features that can be used and the resulting distribution has a lot less variety.
In future research, we hope to find ways of having more control over what kind of characteristics from the different datasets get combined in the fine-tuning process, be that characteristics relating to aesthetic qualities, the structure and form in the images, or the stylistic qualities of a given dataset. We also hope to apply these techniques to higher resolution GAN models, but without having access to pre-trained discriminators, it is currently not possible to apply these techniques to the higher resolution generative models that have been made publicly available without retraining the models from scratch.
This work has been supported by UK’s EPSRC Centre for Doctoral Training in Intelligent Games and Game Intelligence (IGGI; grant EP/L015846/1).
-  (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §1, §2.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
-  (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §2.