Recent years have witnessed the increasing power of GANs to generate high quality samples that are indistinguishable from real data (Karras et al., 2017; Lucic et al., 2018; Miyato et al., 2018; Brock et al., 2019; Karras et al., 2019a), demonstrating the capability of GANs to exploit the valuable information within the underlying data distribution. Although many powerful GAN models pretrained on large-scale datasets have been released, few efforts have been made (Giacomello et al., 2019) to take advantage of the valuable information within those models to aid downstream tasks; this shows a clear contrast with the popularity of transfer learning for recognition tasks (e.g.,
to reuse the feature extractor of a pretrained classifier)(Bengio, 2012; Donahue et al., 2014; Luo et al., 2017; Zamir et al., 2018)
, and transfer learning in natural language processing (e.g., to reuse the expensively-pretrained BERT (Devlin et al., 2018)) (Bao & Qiao, 2019; Peng et al., 2019; Mozafari et al., 2019).
Motivated by the significant value of released pretrained GAN models, we propose to leverage the valuable information therein to facilitate downstream tasks with limited target data, which arise frequently due to expensive data collection in target domain or privacy regulations like in medical or biological fields (Yi et al., 2019). Specifically, we concentrate on the challenging problem of limited-data generation, i.e., using limited available data in target domain to train a GAN for realistic generation, where the newly trained GAN is expected to captures the underlying data manifold with the help of the valuable information from released pretrained ones. One key observation motivating that is a well-trained GAN can generate realistic images not observed in the training dataset (Brock et al., 2019; Karras et al., 2019a; Han et al., 2019)
, demonstrating the generalization ability of GANs in capturing the underlying data manifold. Probably originating from novel combinations of information/attributes/styles (see stunning illustrations in StyleGAN(Karras et al., 2019a)), that generalization of GANs is extremely appealing for limited data applications (Yi et al., 2019; Han et al., 2019), where one would expect GANs to generate realistic samples to alleviate overfitting or provide regularizations for classification (Wang & Perez, 2017; Frid-Adar et al., 2018), segmentation (Bowles et al., 2018), or detection (Han et al., 2019, 2020).
For the problem of limited-data generation, to naively train a GAN on the limited data is prone to overfitting for powerful GAN models with a lot of parameters, which are essential for realistic generation (Bermudez et al., 2018; Bowles et al., 2018; Frid-Adar et al., 2018; Finlayson et al., 2018). To alleviate over-fitting, one would often consider transferring more information from other domains via transfer learning, which may also delivers better training efficiency and final performance simultaneously (Caruana, 1995; Bengio, 2012; Sermanet et al., 2013; Donahue et al., 2014; Zeiler & Fergus, 2014; Girshick et al., 2014). However, most transfer learning literatures focus on recognition tasks, based on the foundation that low-level filters (those close to observation) of a classifier pretrained on a large-scale source dataset are fairly general (like Gabor filters) and thus transferable to different target domains (Yosinski et al., 2014); as the well-trained low-level filters, often data-demanding (Frégier & Gouray, 2019; Noguchi & Harada, 2019), provide additional information, transfer learning often leads to better performance (Yosinski et al., 2014; Long et al., 2015; Noguchi & Harada, 2019). Compared to transfer learning on recognition tasks, fewer efforts have been made for generation tasks (Shin et al., 2016b; Wang et al., 2018b; Noguchi & Harada, 2019), which are summarized in detail in Related Work. We work in this direction and specially consider transfer learning for limited-data generation.
Based on the finding that GAN generators pretrained on large-scale datasets show a similar pattern to that of recognition models (i.e., lower-level layers (those close to observation) portray generally-applicable local patterns like materials, edges, and colors, while higher-level layers represent more specific semantic objects or object parts (Bau et al., 2017, 2018)), we consider transferring the low-level filters/patterns from a pretrained GAN model to facilitate limited-data generation in a target domain. From Bayesian perspective, the transferred filters serve as strong prior knowledge based on large-scale datasets, providing hard regularizations to the likelihood information from the limited target data. We take the popular natural image generation as an illustrative example; however our findings and the presented techniques are believed applicable to other domains (like medical or biological). Our main contributions include
We empirically reveal that the low-level filters (within both the generator and the discriminator) of a GAN pretrained on large-scale datasets can be transferred to different domains for better generation therein.
We design a tailored specific network to harmoniously cooperate with the transferred low-level filters, which enables style mixing for limited-data generation.
For better adapting the transferred filters to target domain, we introduce the adaptive filter modulation (AdaFM), which leads to boosted performance.
Extensive experiments are conducted to verify the effectiveness of the proposed techniques.
2 Background and Related Work
2.1 Generative Adversarial Networks
Attracting vast attentions, generative adversarial networks (GANs) (Goodfellow et al., 2014) show increasing power in synthesizing highly realistic observations (Brock et al., 2019; Karras et al., 2019a); accordingly, GANs are widely applied to various research fields, such as image (Hoffman et al., 2017; Ledig et al., 2017; Ak et al., 2019), text (Lin et al., 2017; Fedus et al., 2018; Wang & Wan, 2018; Wang et al., 2019), video (Mathieu et al., 2015; Wang et al., 2017, 2018a; Chan et al., 2019), and audio (Engel et al., 2019; Yamamoto et al., 2019; Kumar et al., 2019).
Often a GAN consists of two adversarial components, i.e., a generator and a discriminator . As the adversarial game proceeds, the generator learns to generate increasingly realistic fake data to confuse the discriminator; while the discriminator tries to discriminate the real data from the fake ones generated by the generator. The standard GAN objective (Goodfellow et al., 2014) is
where is an easy-to-sample distribution like Normal and is the underlying data distribution.
2.2 GANs on Limited-Data
Existing work related to GANs on limited-data can be roughly summarized into two groups.
Exploit GANs for better usage of the information within limited-data. In addition to the traditional data augmentation like shift, zooming, rotation, or flipping, GANs trained on limited data can be used to do synthetic augmentation like transformed-style or fake observation-label/segmentation pair (Wang & Perez, 2017; Bowles et al., 2018; Frid-Adar et al., 2018; Han et al., 2019, 2020). However, because of the limited available data, a relatively small GAN model is often used to alleviate overfitting, leading to reduced generative power. Only the information within the limited-data is used.
Use GANs to transfer additional information to help limited-data generation. As the information is limited within the available data, it’s often preferred to transfer additional information from other domains via transfer learning (Yosinski et al., 2014; Long et al., 2015; Noguchi & Harada, 2019). TransferGAN (Wang et al., 2018b) proposes to initialize the GAN model with parameters pretrained on source large-scale datasets, followed by fine-tuning those parameters with the limited target data. As the source model architecture (often large) are directly transferred to the target domain, fine-tuning the whole model with too limited target data would suffer from overfitting, as empirically verified in our experiments; since the high-level specific filters are also transferred, the similarity between source and target is critical for a beneficial transferring (Wang et al., 2018b). Different from TransferGAN fine-tuning the whole model, Noguchi & Harada (2019) propose to frozen all parameters but introduce new trainable ones to adapt the hidden batch statistics of the generator for extremely-limited target generation. However, the generator is not adversarially trained (L1/Perceptual loss is used instead), leading to a blurry generation in target domain (Noguchi & Harada, 2019). By comparison, our method transfers-then-freezes the generally-applicable low-level filters from source to target, followed by employing a trainable tailored high-level network (on top of low-level layers) to better adapt to target limited data. Accordingly our method, when compared to TransferGAN, is expected to suffer less from overfitting and behave more robustly towards the differences between source and target domains; when compared to (Noguchi & Harada, 2019), our method is more flexible and thus provides clearly better generation, because of its adversarial training on relatively limited-data.
3 Our Method
For limited-data generation, we propose to introduce additional information by transferring the valuable low-level filters (those close to observation) from released GANs pretrained on large-scale datasets. Combining that prior (i.e., the transferred low-level filters, often generally-applicable but data-demanding to train (Yosinski et al., 2014; Frégier & Gouray, 2019)) with the likelihood from the limited-data, one could expect less overfitting and/or better performance (thanks to the transferred information). Specifically, given a source pretrained GAN model, we reuse its low-level filters (termed general-part) in target domain, replace the high-level layers (termed specific-part) with another smaller network, and then train that specific-part using the limited target data while keeping the transferred general-part frozen (see Figures 1(f) and 1(g)).
How to specify the general-part for transferring?
How to tailor the specific-part?
How to better adapt the transferred general-part?
Before introducing the proposed techniques in detail, we first discuss source datasets, available pretrained GAN models, and the employed evaluation metrics. Intuitively, to get generally-applicable low-level filters, one would expect a large-scale source dataset with rich diversity. A common choice for that is the ImageNet dataset(Krizhevsky et al., 2012; Donahue et al., 2014; Oquab et al., 2014; Shin et al., 2016a), which contains million high-resolution images from classes; we also consider it as the source dataset. Concerning the released GAN models pretrained on ImageNet, available choices include SNGAN (Miyato et al., 2018), GP-GAN (Mescheder et al., 2018), and BigGAN (Brock et al., 2019); we employ the pretrained GP-GAN model (with resolution ) because of its well-written codebase and the available computational resource. To evaluate the generative performance, we adopt the widely used Fréchet inception distance (FID, lower is better) (Heusel et al., 2017), a metric assessing the realism and variation of generated samples (Zhang et al., 2018).
3.1 On Specifying the General-Part for Transferring
As mentioned in the Introduction, both generative and recognitive models share a similar pattern, that is, higher-level filters portray more task-specific/data-specific information, while lower-level ones (those close to observation) portray more generally applicable information (Yosinski et al., 2014; Zeiler & Fergus, 2014; Bau et al., 2017, 2018). Given the GP-GAN model pretrained on ImageNet, it’s natural to ask how many low-level filters/general-part can be transferred to target domain. Generally speaking, the optimal solution is a compromise dependent on the available target data; if given plenty of data (more likelihood information), less low-level filters should be transferred (less prior is needed); but when target data are limited (limited likelihood information), it’s better to transfer more filters (more prior is preferred). We empirically address that question by transferring the pretrained GP-GAN model to the CelebA dataset (Liu et al., 2015), which is fairly different from the source ImageNet (see Figure 2). It’s worth emphasizing that the general-part discovered here also delivers excellent results on three other datasets in the experiments, likely because the newly introduced AdaFM technique (see Section 3.3) has strong modulation power to adapt the transferred low-level filters to target domain.
3.1.1 General-Part in Generator
To figure out the suitable general-part in GP-GAN generator to be transferred to the target CelebA dataset111To verify the generalization of pretrained filters, we bypass the limited-data assumption in this section and use the whole CelebA data for training., we employ the GP-GAN architectures and design experiments with increasing number of lower layers of the generator included in the transferred/frozen general-part; the left specific-part of generator (see Figure 1(h)) and the discriminator are reinitialized and trained with CelebA.
Four settings for the generator general-part are tested, i.e., , , , and lower groups to be transferred (termed G2, G4, G5, and G6, respectively; G4 is illustrated in Figure 1(f)). After training iterations (generative quality stabilizes by then), we show in Figure 3 the generated samples and FIDs of the four settings. It’s clear that the G2/G4 general-part delivers decent generative quality (see eye details, hair texture, and cheek smoothness), despite the source ImageNet is quite different from the target CelebA, confirming the generalization of the low-level filters from up to lower groups of the pretrained GP-GAN generator (also verified on three other datasets in the experiments). The lower FID of G4 than that of G2 indicates that transferring more low-level filters pretrained on large-scale source datasets potentially benefits a better performance in target domain222Another reason might be that to train well-behaved low-level filters is quite time-consuming and data-demanding (Frégier & Gouray, 2019; Noguchi & Harada, 2019). The worse FID of G2 is believed caused by the insufficiently trained low-level filters, as we find the images from G2 show a relatively lower diversity and contain strange textures in the details (see Figure 13 in Appendix). FID is biased towards texture than shape (Karras et al., 2019b). . But when we freeze more groups as the generator general-part (i.e., G5 and G6), the generative quality drops quickly; this is expected as higher-level filters are more specific to the source ImageNet and may not fit the target CelebA. By reviewing Figure 3, we choose G4 as the setting for the generator general-part.
3.1.2 General-Part in Discriminator
Based on the G4 generator general-part, we next conduct experiments to specify the discriminator general-part. We consider transferring/freezing , , , and lower groups of the pretrained GP-GAN discriminator (termed D0, D2, D3, and D4, respectively; D2 is illustrated in Figure 14 of the Appendix). Figure 4 shows the generated samples and FIDs for each setting. Similar to what’s found for generator, transferring more low-level filters from the pretrained GP-GAN discriminator also leads to a better generative performance (compare the FID of D0 with that of D2), thanks to the additional information therein; however, as the higher-level filters are more specific to the source ImageNet, they lead to a decreased generative quality (see the results from D3 and D4). Therefore considering both generator and discriminator, we transfer/frozen the G4D2 general-part from the pretrained GP-GAN model, which will be shown to work quite well on three other target datasets in the experiments.
3.2 On Tailoring the High-Level Specific-Part
Even with the transferred G4D2 general-part, the left specific-part may contain too many trainable parameters considering the limited available data in target domain (e.g., the GPHead model in Figure 1(f) shows mode collapse (see Figure 5) when trained on the small Flower dataset (Nilsback & Zisserman, 2008)); another concern is that when using GAN for synthetic augmentation for limited-data applications, style mixing is highly appealing capability (Wang & Perez, 2017). Motivated by that, we propose to replace the high-level specific-part with a tailored smaller network (compare Figure 1(f) with 1(g)), to alleviate overfitting, to enable style mixing, and also to lower computational/memory cost.
Specifically, that tailored specific-part is constructed as a fully connected (FC) layer followed by two successive style blocks (borrowed from StyleGAN (Karras et al., 2019a) with an additional short cut, see Figure 1(c)). Similar to StyleGAN, the style blocks enable unsupervised disentanglement of high-level attributes, which may benefit an efficient exploration of underlying data manifold and thus better generation; they also enable generating samples with new attribute combinations (style mixing), which dramatically enlarges the generation diversity (see Figure 12 and Figure 16 in Appendix). Note the tailored specific-part is also used in our method (see Figure 1(h)). We term the model consisting of that specific-part and the G4D2 general-part as SmallHead. Different from the GPHead, the SmallHead trains stable without mode collapse on Flower (see Figure 5). In the experiments, the SmallHead is found working well on other small datasets.
3.3 Better Adaption of the Transferred General-Part
Based on the above transferred general-part and tailored specific-part, we next present a new technique, termed adaptive filter modulation (AdaFM), to better adapt the transferred low-level filters to the target domain for boosted performance, as shown in the experiments.
, where one manipulates the style of an image by modifying the statistics (like mean or variance) of its latent feature maps, we alternatively consider the variant of manipulating the style of a function (represented by the transferred general-part) by modifying the statistics of its convolutional filters via AdaFM.
Specifically, a tiny amount of learnable parameters, i.e., scale and shift where denotes the number of input/output channels, are introduced in AdaFM to modulate the transferred/frozen convolutional filters with kernel size in the general-part. Namely,
for and . is then used to convolve with input feature maps for output ones. Applying AdaFM to convolutional kernels of a residual block (see Figure 1(a) (He et al., 2016)) gives the AdaFM block (see Figure 1(b)). With the residual blocks of the SmallHead replaced with AdaFM blocks, we yield our generator, as shown in Figure 1(h), which delivers boosted performance than the SmallHead in the experiments.
For better understanding the power of our AdaFM, below we draw parallel connections to its two special cases (more or less). The first one is named weight demodulation, which is revealed in the recently proposed StyleGAN-II (Karras et al., 2019b), a model with the state-of-the-art generative quality. Compared with our AdaFM, weight demodulation employs zero shift and a one-rank scale , where style is produced from a trainable mapping network (often a MLP) and is calculated as
where is a small constant to avoid numerical issues. Despite closely related, AdaFM and weight demodulation are motivated differently. We propose AdaFM to better adapt the transferred/frozen filters to target domain, while weight demodulation is used to relax instance normalization while keep controllable style mixing ability (Karras et al., 2019b).
The other special case of AdaFM is the Filter Selection (FS) presented in (Noguchi & Harada, 2019), which employs one-rank simplification to both scale and shift . Specifically, and shift with and (see Figures 1(d) and 1(e)). The motivation of FS is to “select” the transferred filters , for example if
is a binary vector, it literally selects among. As FS doesn’t modulate among input channels, its basic assumption is that the source and target share the same correlation among input feature maps, which might not be true. See the demonstrative example in Figure 6, where the source domain has the basic pattern of a almost-red square whereas the target has a basic pattern of almost-green square (with the same shape); it’s clear that simply selecting (FS) a filter/pattern won’t work, appealing for a modulation (AdaFM) among filter channels. Figure 7 shows the results from models with FS and AdaFM (using the same architecture in Figure 1(h)); it’s obvious that AdaFM brings boosted performance, empirically supporting the above instinct that the basic shape within each are generally applicable while the correlation among -,-channels may be data-specific (this is further supported by AdaFM delivering boosted performance on other datasets in the experiments).
4 Experimental Results
Taking natural image generation as an illustrative example, we demonstrate the effectiveness of the proposed techniques by transferring the GP-GAN model pretrained on the large-scale source ImageNet dataset (containing million images from classes) to facilitate the limited-data generation on four smaller target datasets and their variants containing only images, that is, CelebA (Liu et al., 2015) (), Flowers (Nilsback & Zisserman, 2008) (), Cars (Krause et al., 2013) (), and Cathedral (Zhou et al., 2014) ().
The experiments proceed by () demonstrating the advantage of our method over existing/naive ones; () conducting ablation studies to analyze the contribution of each component of our method; () verifying the proposed techniques in more challenging settings with only target images; () analyzing why/how AdaFM leads to boosted performance; and () illustrating the potential in exploiting the tailored specific-part for data augmentation for limited data applications. Generated images and FID (Heusel et al., 2017) are used to evaluate generative performance. Experimental settings and more results are in Appendix. Code is available in the supplementary material.
4.1 Comparisons with Existing/Naive Methods
To demonstration our contributions over existing/naive methods, we compare our method with () TransferGAN (Wang et al., 2018b), which initializes with the pretrained GP-GAN model (accordingly the same network architectures are adopted; see Figure 1(f)), followed by fine-tuning all parameters on the target data, and () Scratch, which trains a model with the same architecture as ours (see Figure 1(h)) from scratch with the target data.
The experimental results are shown in Figure 8, with the final FID scores summarized in Table 2. Since TransferGAN employs the source (large) GP-GAN architecture, it may suffer from overfitting if the target data are too limited, which manifest as training/mode collapse; accordingly, TransferGAN fails on the small data: Flowers, Cars, and Cathedral. By comparison, thanks to the tailored specific-part, both Scratch and our methods train stably on all target datasets, as shown in Figures 8. Compared to Scratch, our method shows a dramatically increased training efficiency, thanks to the transferred low-level filters, and a significantly improved generative quality (a much better FID in Table 2), which are attributed to both the transferred general-part and a better adaption to target domain via AdaFM.
4.2 Ablation Study of Our Method
To reveal how each component contributes to the excellent performance of our method, we consider experimental settings in a sequential manner. (a) GP-GAN. Adopt GP-GAN architectures (Figure 1(f) but all parameters are trainable and randomly initialized), used as a baseline where no low-level filters are transferred. (b) GPHead. Use the model in Figure 1(f), used to demonstrate the contribution of the transferred general-part. (c) SmallHead. Employ the model in Figure 1(g), used to reveal the contribution of the tailored specific-part. (d) Our. Leverage the model in Figure 1(h), for showing the contribution of the presented AdaFM.
The FID curves during training and the final FID scores of the compared methods are shown in Figure 9 and Table 1, respectively. By comparing GP-GAN with GPHead in Figure 9(left) and Table 1 (on CelebA), it’s clear that the transferred general-part contributes by dramatically increasing the training efficiency (also refer to Figure 8) and a better performance. Comparing SmallHead to both GPHead and GP-GAN in Table 1 indicates that the tailored specific-part helps alleviate overfitting (stable training). By better adapting the transferred general-part to target domains, the proposed AdaFM contributes most to the boosted performance (see Figure 9(right) and Table 1), empirically confirming our instinct in Section 3.3.
4.3 More Challenging Limited-Data Generation
To verify the effectiveness of the proposed techniques in more challenging settings, we consider limited-data generation with only data samples. Specifically, we randomly select images from CelebA, Flowers, and Cathedral to form their limited-1K variants, termed CelebA-1K, Flowers-1K, and Cathedral-1K, respectively. Since TransferGAN fails when given about target images (see Section 4.1), we omit it and only compare our method with Scratch. The FID curves versus training iterations are shown in Figure 10, with the lowest FIDs summarized in Table 3. Under the challenging settings with only training data, both Scratch and our method with the G4D2 general-part (labeled Our-G4D2) suffer from overfitting, as shown in Figure 10; Scratch actually suffers more due to more trainable parameters. Compared with Scratch, our method trains more efficiently (see Figure 10) and delivers significantly improved best performance (see Table 3). To alleviate overfitting, we transfer more discriminator filters from the pretrained GP-GAN model, with the results given in Figure 10 and Table 3. It’s clear that patterns emerge similar to what’s shown in Section 3.1, i.e., less data (likelihood) appeal for more transferred information (prior); also the discovered G4D2 general-part works fairly well (see the comparable FID scores in Table 3).
4.4 Analysis of AdaFM and Style Augmentation with the Tailored Specific-Part
To better understand why adopting AdaFM in the transferred general-part leads to boosted performance, we summarize in Figure 11(a)(b) the learned scale and shift from different groups of the generator general-part. It’s clear that all transferred filters are used in target domains (no zero-valued ) but with modulations ( has values around ). As AdaFM delivers boosted performance, it’s clear that such modulation is crucial to a successful transfer from source to target, confirming our instinct discussed in Section 3.3. Next to illustrate how behaves on different target datasets, we show in Figure 11(c) the sorted comparisons of the learned in Group ; apparently, different datasets prefer different modulations as expected, justifying the necessity of AdaFM and its performance gain. Concerning explicit demonstration of AdaFM and medical/biological applications with gray-scale images, we conduct another experiment on a gray-scale variant of Cathedral (results are given in Appendix I due to space constraints), where we find that without AdaFM, worse (blurry and messy) details are observed in the generated images (refer also to Figure 7), likely because of the mismatched correlation among channels between source and target domains.
To reveal the potential in exploiting the tailored specific-part for data augmentation for limited data applications, we conduct style mixing with the specific-part following (Karras et al., 2019a). Figure 12 shows the results on Flowers (see Appendix F for details and more results). Style mixing enables synthesizing vast new images via style/attribute combination. Therefore, the tailored specific-part can be used for a diverse data augmentation, which is believed extremely appealing for downstream limited-data applications.
We reveal that the valuable information (specifically low-level filters) within GAN models pretrained on large-scale source datasets (ImageNet) can be transferred to limited-data generation in target domain. A small specific network is developed to alleviate overfitting on target limited-data and to enable style mixing for diverse data augmentation. We presented the adaptive filter modulation (AdaFM) to better adapt the transferred filters to target domain, which delivers boosted performance on limited-data generation.
- Ak et al. (2019) Ak, K., Lim, J., Tham, J., and Kassim, A. Attribute manipulation generative adversarial networks for fashion images. In ICCV, pp. 10541–10550, 2019.
- Bao & Qiao (2019) Bao, X. and Qiao, Q. Transfer learning from pre-trained bert for pronoun resolution. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp. 82–88, 2019.
- Bau et al. (2017) Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, pp. 6541–6549, 2017.
- Bau et al. (2018) Bau, D., Zhu, J., Strobelt, H., Zhou, B., Tenenbaum, J., Freeman, W., and Torralba, A. GAN dissection: Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597, 2018.
- Bengio (2012) Bengio, Y. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML workshop on unsupervised and transfer learning, pp. 17–36, 2012.
- Bermudez et al. (2018) Bermudez, C., Plassard, A., Davis, L., Newton, A., Resnick, S., and Landman, B. Learning implicit brain MRI manifolds with deep learning. In Medical Imaging 2018: Image Processing, volume 10574, pp. 105741L. International Society for Optics and Photonics, 2018.
- Bowles et al. (2018) Bowles, C., Chen, L., Guerrero, R., Bentley, P., Gunn, R., Hammers, A., Dickie, D., Hernández, M., Wardlaw, J., and Rueckert, D. GAN augmentation: augmenting training data using generative adversarial networks. arXiv preprint arXiv:1810.10863, 2018.
- Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019. URL https://openreview.net/forum?id=B1xsqj09Fm.
Learning many related tasks at the same time with backpropagation.In NIPS, pp. 657–664, 1995.
- Chan et al. (2019) Chan, C., Ginosar, S., Zhou, T., and Efros, A. Everybody dance now. In CVPR, pp. 5933–5942, 2019.
- Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Donahue et al. (2014) Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, pp. 647–655, 2014.
- Engel et al. (2019) Engel, J., Agrawal, K., Chen, S., Gulrajani, I., Donahue, C., and Roberts, A. Gansynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710, 2019.
- Fedus et al. (2018) Fedus, W., Goodfellow, I., and Dai, A. MaskGAN: better text generation via filling in the _. arXiv preprint arXiv:1801.07736, 2018.
- Finlayson et al. (2018) Finlayson, S., Lee, H., Kohane, I., and Oakden-Rayner, L. Towards generative adversarial networks as a new paradigm for radiology education. arXiv preprint arXiv:1812.01547, 2018.
- Frégier & Gouray (2019) Frégier, Y. and Gouray, J. Mind2mind: transfer learning for GANs. arXiv preprint arXiv:1906.11613, 2019.
- Frid-Adar et al. (2018) Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., and Greenspan, H. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing, 321:321–331, 2018.
- Giacomello et al. (2019) Giacomello, E., Loiacono, D., and Mainardi, L. Transfer brain mri tumor segmentation models across modalities with adversarial networks. arXiv preprint arXiv:1910.02717, 2019.
- Girshick et al. (2014) Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587, 2014.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS, pp. 2672–2680, 2014.
- Han et al. (2019) Han, C., Murao, K., Noguchi, T., Kawata, Y., Uchiyama, F., Rundo, L., Nakayama, H., and Satoh, S. Learning more with less: conditional PGGAN-based data augmentation for brain metastases detection using highly-rough annotation on MR images. arXiv preprint arXiv:1902.09856, 2019.
- Han et al. (2020) Han, C., Rundo, L., Araki, R., Furukawa, Y., Mauri, G., Nakayama, H., and Hayashi, H. Infinite brain MR images: PGGAN-based data augmentation for tumor detection. In Neural Approaches to Dynamics of Signal Exchanges, pp. 291–303. Springer, 2020.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, pp. 770–778, 2016.
- Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS, pp. 6626–6637, 2017.
- Hoffman et al. (2017) Hoffman, J., Tzeng, E., Park, T., Zhu, J., Isola, P., Saenko, K., Efros, A., and Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
- Huang & Belongie (2017) Huang, X. and Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, pp. 1501–1510, 2017.
- Karras et al. (2017) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- Karras et al. (2019a) Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In CVPR, June 2019a.
- Karras et al. (2019b) Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958, 2019b.
- Krause et al. (2013) Krause, J., Stark, M., J., D., and Fei-Fei, L. 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
Krizhevsky et al. (2012)
Krizhevsky, A., Sutskever, I., and Hinton, G. E.
Imagenet classification with deep convolutional neural networks.In NIPS, pp. 1097–1105, 2012.
- Kumar et al. (2019) Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W., Sotelo, J., de Brebisson, A., Bengio, Y., and Courville, A. MelGAN: Generative adversarial networks for conditional waveform synthesis. arXiv preprint arXiv:1910.06711, 2019.
Ledig et al. (2017)
Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta,
A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al.
Photo-realistic single image super-resolution using a generative adversarial network.In CVPR, pp. 4681–4690, 2017.
- Lin et al. (2017) Lin, K., Li, D., He, X., Zhang, Z., and Sun, M. Adversarial ranking for language generation. In NIPS, pp. 3155–3165, 2017.
- Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In ICCV, pp. 3730–3738, 2015.
- Long et al. (2015) Long, M., Cao, Y., Wang, J., and Jordan, M. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.
- Lucic et al. (2018) Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bousquet, O. Are GANs created equal? a large-scale study. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), NeurIPS, pp. 700–709. Curran Associates, Inc., 2018.
- Luo et al. (2017) Luo, Z., Zou, Y., Hoffman, J., and Fei-Fei, L. Label efficient learning of transferable representations acrosss domains and tasks. In NIPS, pp. 165–177, 2017.
- Mathieu et al. (2015) Mathieu, M., Couprie, C., and LeCun, Y. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
- Mescheder et al. (2018) Mescheder, L., Geiger, A., and Nowozin, S. Which training methods for GANs do actually converge? In ICML, pp. 3478–3487, 2018.
- Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
- Mozafari et al. (2019) Mozafari, M., Farahbakhsh, R., and Crespi, N. A BERT-based transfer learning approach for hate speech detection in online social media. arXiv preprint arXiv:1910.12574, 2019.
- Nilsback & Zisserman (2008) Nilsback, M. and Zisserman, A. Automated flower classification over a large number of classes. In ICCV, Graphics & Image Processing, pp. 722–729. IEEE, 2008.
- Noguchi & Harada (2019) Noguchi, A. and Harada, T. Image generation from small datasets via batch statistics adaptation. arXiv preprint arXiv:1904.01774, 2019.
- Oquab et al. (2014) Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, pp. 1717–1724, 2014.
- Peng et al. (2019) Peng, Y., Yan, S., and Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474, 2019.
- Sermanet et al. (2013) Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
- Shin et al. (2016a) Shin, H., Roth, H., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., and Summers, R. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging, 35(5):1285–1298, 2016a.
- Shin et al. (2016b) Shin, S., Hwang, K., and Sung, W. Generative knowledge transfer for neural language models. arXiv preprint arXiv:1608.04077, 2016b.
- Wang & Perez (2017) Wang, J. and Perez, L. The effectiveness of data augmentation in image classification using deep learning. Convolutional Neural Networks Vis. Recognit, 2017.
- Wang & Wan (2018) Wang, K. and Wan, X. SentiGAN: Generating sentimental texts via mixture adversarial networks. In IJCAI, pp. 4446–4452, 2018.
- Wang et al. (2019) Wang, R., Zhou, D., and He, Y. Open event extraction from online text using a generative adversarial network. arXiv preprint arXiv:1908.09246, 2019.
- Wang et al. (2018a) Wang, T., Liu, M., Zhu, J., Liu, G., Tao, A., Kautz, J., and Catanzaro, B. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018a.
- Wang et al. (2017) Wang, X., Shrivastava, A., and Gupta, A. A-fast-RCNN: Hard positive generation via adversary for object detection. In CVPR, pp. 2606–2615, 2017.
- Wang et al. (2018b) Wang, Y., Wu, C., Herranz, L., van de Weijer, J., Gonzalez-Garcia, A., and Raducanu, B. Transferring GANs: generating images from limited data. In ECCV, pp. 218–234, 2018b.
- Yamamoto et al. (2019) Yamamoto, R., Song, E., and Kim, J. Probability density distillation with generative adversarial networks for high-quality parallel waveform generation. arXiv preprint arXiv:1904.04472, 2019.
- Yi et al. (2019) Yi, X., Walia, E., and Babyn, P. Generative adversarial network in medical imaging: A review. Medical image analysis, pp. 101552, 2019.
- Yosinski et al. (2014) Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In NIPS, pp. 3320–3328, 2014.
- Zamir et al. (2018) Zamir, A., Sax, A., Shen, W., Guibas, L., Malik, J., and Savarese, S. Taskonomy: Disentangling task transfer learning. In CVPR, pp. 3712–3722, 2018.
- Zeiler & Fergus (2014) Zeiler, M. and Fergus, R. Visualizing and understanding convolutional networks. In ECCV, pp. 818–833. Springer, 2014.
- Zhang et al. (2018) Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
- Zhou et al. (2014) Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. In NIPS, pp. 487–495, 2014.
Appendix A Experimental settings
All training images are resized to for consistency. For large dataset, i.e., CelebA, the dimension of the latent vector is set to , for small dataset, i.e., Flowers, Cars, and Cathedral, the dimension is set to .
The provided FID scores are calculated based on real and generated images on CelebA, Flowers, Cars, and Cathedral, respectively. Adam with learning rate and coefficients is used as the optimizer. iterations are used for training. Because of limited computation power, we use a batch size of .
Appendix B On explaining the worse FID of GD
The worse FID of GD is believed caused by the insufficiently trained low-level filters. By observing the generated samples, we find that,
some samples looks similar to each other, indicating that the generator has a relatively low diversity;
most of the samples contain strange texture looks like water sports (see Figure 13).
Both of these two phenomenons have negative affect to the final FID.
Appendix C On transferring discriminator and reinitializing the specific-part
To figure out the suitable general-part in discriminator to be transferred from the pretrained GP-GAN models to the target CelebA dataset, a series experiments is designed by freezing some low-level groups as general-part and training the rest high-level specific-part. The architecture for transfer discriminator is shown in Figure 14 with D as an example.
With the general-parts for generator and discriminator frozen as the value pretrained on ImageNet, we only reinitialize and train the rest specific-part. The reinitialization are as follows,
For layers except FC in generator/discriminator. We use the value pretrained on ImageNet as initialization;
For FC layer in generator/discriminator. Since the model pretrained on ImageNet used a conditional train configuration, the input for the generator is a concatenation of the latent vector and the embedded label vector, which is not applicable for CelebA (the same situation is also found in the output of the discriminator), we initialize both FC layers randomly.
Appendix D TransferGAN failed the training on three small datasets
Since the FC layer from the pretrained model is not applicable to the target data (as discussed in section C), we implement the TransferGAN method by initializing the layers except FC layer with the value pretrained on ImageNet, and leave the FC layer randomly initialized.
When applied on small dataset, TransferGAN method suffers from model collapse (as shown in Figure 15) which is considered to be caused by overfitting.
Appendix E How to obtain the matrix shown in Figure 11(c)?
The sorted demonstration of the learned is obtained from the last convolution layer in Group2. We calculate the sorted matrix shown in Figure 11(c) as follows,
reshape the matrix into a vector for each dataset;
stack these vectors into a matrix , each row is obtained from a specific dataset;
clip all the values of to , then re-scale to ;
for dataset , find the set of columns and sort these columns within according to the values in row .
Appendix F Style Mixing on Flowers and CelebA
The style mixing results shown in Figure 12 of the main manuscript is obtained as follows. Following (Karras et al., 2019a), given the generative process of a “source” image, we replace its style input of Group 333We choose Group for example demonstration; one can of course control the input to other Groups, or even hierarchically control all of them. (the arrow at left, see Figure 1(h)) with that from a “Destination” image, followed by propagating through the rest of the generator to generate a new image with mixed style.
A similar style mixing is conducted on CelebA and the results are shown in Figure 16. We can observe that, the styles produced for source images control the identity, posture, and hair type, while the styles produced for destination images control the sex, color, and expression.
Appendix G More generated samples for comparing our method with Scratch
Figure 17, 18 and 19 shows more randomly generated samples from our method and Scratch as a supplementary for Figure 8 in Section 4.1 of the main manuscript. Thanks to the transferred low-level filters and the better adaption to target domain via AdaFM, our method shows a much higher generation quality than Scratch.
Appendix H The contribution of the proposed AdaFM
To demonstrate the contribution of the proposed AdaFM, the randomly generated samples from our method (with AdaFM) and SmallHead (without AdaFM) are shown in Figure 20, 21, and 22. Note that the only difference between our method and SmallHead is the use of AdaFM. Obviously, with AdaFM, the generation quality is greatly improved.
Appendix I Medical/biological applications with gray-scale images
Concerning explicit demonstration of AdaFM and medical/biological applications with gray-scale images, we conduct experiment on a gray-scale variant of Cathedral data, termed gray-Cathedral. The randomly generated samples are shown in Figure 23. Obviously, without AdaFM, worse (blurry and messy) details are observed in the generated images, likely because of the mismatched correlation among channels between source and target domains. with AdaFM, the generated samples have a relatively much clear details, because AdaFM make it easy to modulate the filters to meet the demanded correlation among input feature maps in target data.