On Leveraging Pretrained GANs for Limited-Data Generation

02/26/2020 ∙ by Miaoyun Zhao, et al. ∙ 41

Recent work has shown GANs can generate highly realistic images that are indistinguishable by human. Of particular interest here is the empirical observation that most generated images are not contained in training datasets, indicating potential generalization with GANs. That generalizability makes it appealing to exploit GANs to help applications with limited available data, e.g., augment training data to alleviate overfitting. To better facilitate training a GAN on limited data, we propose to leverage already-available GAN models pretrained on large-scale datasets (like ImageNet) to introduce additional common knowledge (which may not exist within the limited data) following the transfer learning idea. Specifically, exampled by natural image generation tasks, we reveal the fact that low-level filters (those close to observations) of both the generator and discriminator of pretrained GANs can be transferred to help the target limited-data generation. For better adaption of the transferred filters to the target domain, we introduce a new technique named adaptive filter modulation (AdaFM), which provides boosted performance over baseline methods. Unifying the transferred filters and the introduced techniques, we present our method and conduct extensive experiments to demonstrate its training efficiency and better performance on limited-data generation.



There are no comments yet.


page 1

page 2

page 3

page 4

page 6

page 9

page 10

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years have witnessed the increasing power of GANs to generate high quality samples that are indistinguishable from real data (Karras et al., 2017; Lucic et al., 2018; Miyato et al., 2018; Brock et al., 2019; Karras et al., 2019a), demonstrating the capability of GANs to exploit the valuable information within the underlying data distribution. Although many powerful GAN models pretrained on large-scale datasets have been released, few efforts have been made (Giacomello et al., 2019) to take advantage of the valuable information within those models to aid downstream tasks; this shows a clear contrast with the popularity of transfer learning for recognition tasks (e.g.,

to reuse the feature extractor of a pretrained classifier)

(Bengio, 2012; Donahue et al., 2014; Luo et al., 2017; Zamir et al., 2018)

, and transfer learning in natural language processing (

e.g., to reuse the expensively-pretrained BERT (Devlin et al., 2018)) (Bao & Qiao, 2019; Peng et al., 2019; Mozafari et al., 2019).

Motivated by the significant value of released pretrained GAN models, we propose to leverage the valuable information therein to facilitate downstream tasks with limited target data, which arise frequently due to expensive data collection in target domain or privacy regulations like in medical or biological fields (Yi et al., 2019). Specifically, we concentrate on the challenging problem of limited-data generation, i.e., using limited available data in target domain to train a GAN for realistic generation, where the newly trained GAN is expected to captures the underlying data manifold with the help of the valuable information from released pretrained ones. One key observation motivating that is a well-trained GAN can generate realistic images not observed in the training dataset (Brock et al., 2019; Karras et al., 2019a; Han et al., 2019)

, demonstrating the generalization ability of GANs in capturing the underlying data manifold. Probably originating from novel combinations of information/attributes/styles (see stunning illustrations in StyleGAN

(Karras et al., 2019a)), that generalization of GANs is extremely appealing for limited data applications (Yi et al., 2019; Han et al., 2019), where one would expect GANs to generate realistic samples to alleviate overfitting or provide regularizations for classification (Wang & Perez, 2017; Frid-Adar et al., 2018), segmentation (Bowles et al., 2018), or detection (Han et al., 2019, 2020).

For the problem of limited-data generation, to naively train a GAN on the limited data is prone to overfitting for powerful GAN models with a lot of parameters, which are essential for realistic generation (Bermudez et al., 2018; Bowles et al., 2018; Frid-Adar et al., 2018; Finlayson et al., 2018). To alleviate over-fitting, one would often consider transferring more information from other domains via transfer learning, which may also delivers better training efficiency and final performance simultaneously (Caruana, 1995; Bengio, 2012; Sermanet et al., 2013; Donahue et al., 2014; Zeiler & Fergus, 2014; Girshick et al., 2014). However, most transfer learning literatures focus on recognition tasks, based on the foundation that low-level filters (those close to observation) of a classifier pretrained on a large-scale source dataset are fairly general (like Gabor filters) and thus transferable to different target domains (Yosinski et al., 2014); as the well-trained low-level filters, often data-demanding (Frégier & Gouray, 2019; Noguchi & Harada, 2019), provide additional information, transfer learning often leads to better performance (Yosinski et al., 2014; Long et al., 2015; Noguchi & Harada, 2019). Compared to transfer learning on recognition tasks, fewer efforts have been made for generation tasks (Shin et al., 2016b; Wang et al., 2018b; Noguchi & Harada, 2019), which are summarized in detail in Related Work. We work in this direction and specially consider transfer learning for limited-data generation.

Based on the finding that GAN generators pretrained on large-scale datasets show a similar pattern to that of recognition models (i.e., lower-level layers (those close to observation) portray generally-applicable local patterns like materials, edges, and colors, while higher-level layers represent more specific semantic objects or object parts (Bau et al., 2017, 2018)), we consider transferring the low-level filters/patterns from a pretrained GAN model to facilitate limited-data generation in a target domain. From Bayesian perspective, the transferred filters serve as strong prior knowledge based on large-scale datasets, providing hard regularizations to the likelihood information from the limited target data. We take the popular natural image generation as an illustrative example; however our findings and the presented techniques are believed applicable to other domains (like medical or biological). Our main contributions include

  • We empirically reveal that the low-level filters (within both the generator and the discriminator) of a GAN pretrained on large-scale datasets can be transferred to different domains for better generation therein.

  • We design a tailored specific network to harmoniously cooperate with the transferred low-level filters, which enables style mixing for limited-data generation.

  • For better adapting the transferred filters to target domain, we introduce the adaptive filter modulation (AdaFM), which leads to boosted performance.

  • Extensive experiments are conducted to verify the effectiveness of the proposed techniques.

2 Background and Related Work

2.1 Generative Adversarial Networks

Attracting vast attentions, generative adversarial networks (GANs) (Goodfellow et al., 2014) show increasing power in synthesizing highly realistic observations (Brock et al., 2019; Karras et al., 2019a); accordingly, GANs are widely applied to various research fields, such as image (Hoffman et al., 2017; Ledig et al., 2017; Ak et al., 2019), text (Lin et al., 2017; Fedus et al., 2018; Wang & Wan, 2018; Wang et al., 2019), video (Mathieu et al., 2015; Wang et al., 2017, 2018a; Chan et al., 2019), and audio (Engel et al., 2019; Yamamoto et al., 2019; Kumar et al., 2019).

Often a GAN consists of two adversarial components, i.e., a generator and a discriminator . As the adversarial game proceeds, the generator learns to generate increasingly realistic fake data to confuse the discriminator; while the discriminator tries to discriminate the real data from the fake ones generated by the generator. The standard GAN objective (Goodfellow et al., 2014) is


where is an easy-to-sample distribution like Normal and is the underlying data distribution.

2.2 GANs on Limited-Data

Existing work related to GANs on limited-data can be roughly summarized into two groups.

Exploit GANs for better usage of the information within limited-data. In addition to the traditional data augmentation like shift, zooming, rotation, or flipping, GANs trained on limited data can be used to do synthetic augmentation like transformed-style or fake observation-label/segmentation pair (Wang & Perez, 2017; Bowles et al., 2018; Frid-Adar et al., 2018; Han et al., 2019, 2020). However, because of the limited available data, a relatively small GAN model is often used to alleviate overfitting, leading to reduced generative power. Only the information within the limited-data is used.

Figure 1: Network architectures. A “group” in general contains blocks with the same feature-map size. MLP consists of FC layers.

Use GANs to transfer additional information to help limited-data generation. As the information is limited within the available data, it’s often preferred to transfer additional information from other domains via transfer learning (Yosinski et al., 2014; Long et al., 2015; Noguchi & Harada, 2019). TransferGAN (Wang et al., 2018b) proposes to initialize the GAN model with parameters pretrained on source large-scale datasets, followed by fine-tuning those parameters with the limited target data. As the source model architecture (often large) are directly transferred to the target domain, fine-tuning the whole model with too limited target data would suffer from overfitting, as empirically verified in our experiments; since the high-level specific filters are also transferred, the similarity between source and target is critical for a beneficial transferring (Wang et al., 2018b). Different from TransferGAN fine-tuning the whole model, Noguchi & Harada (2019) propose to frozen all parameters but introduce new trainable ones to adapt the hidden batch statistics of the generator for extremely-limited target generation. However, the generator is not adversarially trained (L1/Perceptual loss is used instead), leading to a blurry generation in target domain (Noguchi & Harada, 2019). By comparison, our method transfers-then-freezes the generally-applicable low-level filters from source to target, followed by employing a trainable tailored high-level network (on top of low-level layers) to better adapt to target limited data. Accordingly our method, when compared to TransferGAN, is expected to suffer less from overfitting and behave more robustly towards the differences between source and target domains; when compared to (Noguchi & Harada, 2019), our method is more flexible and thus provides clearly better generation, because of its adversarial training on relatively limited-data.

3 Our Method

For limited-data generation, we propose to introduce additional information by transferring the valuable low-level filters (those close to observation) from released GANs pretrained on large-scale datasets. Combining that prior (i.e., the transferred low-level filters, often generally-applicable but data-demanding to train (Yosinski et al., 2014; Frégier & Gouray, 2019)) with the likelihood from the limited-data, one could expect less overfitting and/or better performance (thanks to the transferred information). Specifically, given a source pretrained GAN model, we reuse its low-level filters (termed general-part) in target domain, replace the high-level layers (termed specific-part) with another smaller network, and then train that specific-part using the limited target data while keeping the transferred general-part frozen (see Figures 1(f) and 1(g)).

In what follows, we take natural image generation as an example and present our method by answering three questions in Sections 3.1, 3.2, and 3.3, respectively, i.e.,

  • [leftmargin=*,topsep=0.cm,itemsep=0.1cm,partopsep=0cm,parsep=0cm]

  • How to specify the general-part for transferring?

  • How to tailor the specific-part?

  • How to better adapt the transferred general-part?

Before introducing the proposed techniques in detail, we first discuss source datasets, available pretrained GAN models, and the employed evaluation metrics. Intuitively, to get generally-applicable low-level filters, one would expect a large-scale source dataset with rich diversity. A common choice for that is the ImageNet dataset

(Krizhevsky et al., 2012; Donahue et al., 2014; Oquab et al., 2014; Shin et al., 2016a), which contains million high-resolution images from classes; we also consider it as the source dataset. Concerning the released GAN models pretrained on ImageNet, available choices include SNGAN (Miyato et al., 2018), GP-GAN (Mescheder et al., 2018), and BigGAN (Brock et al., 2019); we employ the pretrained GP-GAN model (with resolution ) because of its well-written codebase and the available computational resource. To evaluate the generative performance, we adopt the widely used Fréchet inception distance (FID, lower is better) (Heusel et al., 2017), a metric assessing the realism and variation of generated samples (Zhang et al., 2018).

3.1 On Specifying the General-Part for Transferring

As mentioned in the Introduction, both generative and recognitive models share a similar pattern, that is, higher-level filters portray more task-specific/data-specific information, while lower-level ones (those close to observation) portray more generally applicable information (Yosinski et al., 2014; Zeiler & Fergus, 2014; Bau et al., 2017, 2018). Given the GP-GAN model pretrained on ImageNet, it’s natural to ask how many low-level filters/general-part can be transferred to target domain. Generally speaking, the optimal solution is a compromise dependent on the available target data; if given plenty of data (more likelihood information), less low-level filters should be transferred (less prior is needed); but when target data are limited (limited likelihood information), it’s better to transfer more filters (more prior is preferred). We empirically address that question by transferring the pretrained GP-GAN model to the CelebA dataset (Liu et al., 2015), which is fairly different from the source ImageNet (see Figure 2). It’s worth emphasizing that the general-part discovered here also delivers excellent results on three other datasets in the experiments, likely because the newly introduced AdaFM technique (see Section 3.3) has strong modulation power to adapt the transferred low-level filters to target domain.

Figure 2: Sample images from the ImageNet and CelebA datasets. Although quite different, they are likely to share the same set of low-level filters describing basic shapes like lines and curves.

3.1.1 General-Part in Generator

To figure out the suitable general-part in GP-GAN generator to be transferred to the target CelebA dataset111To verify the generalization of pretrained filters, we bypass the limited-data assumption in this section and use the whole CelebA data for training., we employ the GP-GAN architectures and design experiments with increasing number of lower layers of the generator included in the transferred/frozen general-part; the left specific-part of generator (see Figure 1(h)) and the discriminator are reinitialized and trained with CelebA.

Four settings for the generator general-part are tested, i.e., , , , and lower groups to be transferred (termed G2, G4, G5, and G6, respectively; G4 is illustrated in Figure 1(f)). After training iterations (generative quality stabilizes by then), we show in Figure 3 the generated samples and FIDs of the four settings. It’s clear that the G2/G4 general-part delivers decent generative quality (see eye details, hair texture, and cheek smoothness), despite the source ImageNet is quite different from the target CelebA, confirming the generalization of the low-level filters from up to lower groups of the pretrained GP-GAN generator (also verified on three other datasets in the experiments). The lower FID of G4 than that of G2 indicates that transferring more low-level filters pretrained on large-scale source datasets potentially benefits a better performance in target domain222Another reason might be that to train well-behaved low-level filters is quite time-consuming and data-demanding (Frégier & Gouray, 2019; Noguchi & Harada, 2019). The worse FID of G2 is believed caused by the insufficiently trained low-level filters, as we find the images from G2 show a relatively lower diversity and contain strange textures in the details (see Figure 13 in Appendix). FID is biased towards texture than shape (Karras et al., 2019b). . But when we freeze more groups as the generator general-part (i.e., G5 and G6), the generative quality drops quickly; this is expected as higher-level filters are more specific to the source ImageNet and may not fit the target CelebA. By reviewing Figure 3, we choose G4 as the setting for the generator general-part.

Figure 3: Generated samples and FIDs from different settings for the generator general-part. GmDn indicates freezing the lower groups as the general-part of generator/discriminator.
Figure 4: Generated samples and FIDs from different settings for the discriminator general-part.

3.1.2 General-Part in Discriminator

Based on the G4 generator general-part, we next conduct experiments to specify the discriminator general-part. We consider transferring/freezing , , , and lower groups of the pretrained GP-GAN discriminator (termed D0, D2, D3, and D4, respectively; D2 is illustrated in Figure 14 of the Appendix). Figure 4 shows the generated samples and FIDs for each setting. Similar to what’s found for generator, transferring more low-level filters from the pretrained GP-GAN discriminator also leads to a better generative performance (compare the FID of D0 with that of D2), thanks to the additional information therein; however, as the higher-level filters are more specific to the source ImageNet, they lead to a decreased generative quality (see the results from D3 and D4). Therefore considering both generator and discriminator, we transfer/frozen the G4D2 general-part from the pretrained GP-GAN model, which will be shown to work quite well on three other target datasets in the experiments.

3.2 On Tailoring the High-Level Specific-Part

Figure 5: Generated images from the GPHead and SmallHead trained on the Flowers dataset (Nilsback & Zisserman, 2008).

Even with the transferred G4D2 general-part, the left specific-part may contain too many trainable parameters considering the limited available data in target domain (e.g., the GPHead model in Figure 1(f) shows mode collapse (see Figure 5) when trained on the small Flower dataset (Nilsback & Zisserman, 2008)); another concern is that when using GAN for synthetic augmentation for limited-data applications, style mixing is highly appealing capability (Wang & Perez, 2017). Motivated by that, we propose to replace the high-level specific-part with a tailored smaller network (compare Figure 1(f) with 1(g)), to alleviate overfitting, to enable style mixing, and also to lower computational/memory cost.

Specifically, that tailored specific-part is constructed as a fully connected (FC) layer followed by two successive style blocks (borrowed from StyleGAN (Karras et al., 2019a) with an additional short cut, see Figure 1(c)). Similar to StyleGAN, the style blocks enable unsupervised disentanglement of high-level attributes, which may benefit an efficient exploration of underlying data manifold and thus better generation; they also enable generating samples with new attribute combinations (style mixing), which dramatically enlarges the generation diversity (see Figure 12 and Figure 16 in Appendix). Note the tailored specific-part is also used in our method (see Figure 1(h)). We term the model consisting of that specific-part and the G4D2 general-part as SmallHead. Different from the GPHead, the SmallHead trains stable without mode collapse on Flower (see Figure 5). In the experiments, the SmallHead is found working well on other small datasets.

3.3 Better Adaption of the Transferred General-Part

Figure 6: A demonstrative example motivating AdaFM. Both source and target share the same basic “shape” within each channel but use a different among-channel correlation. AdaFM learns to adapt source to target .

Based on the above transferred general-part and tailored specific-part, we next present a new technique, termed adaptive filter modulation (AdaFM), to better adapt the transferred low-level filters to the target domain for boosted performance, as shown in the experiments.

Motivated by style transfer literatures (Huang & Belongie, 2017; Noguchi & Harada, 2019)

, where one manipulates the style of an image by modifying the statistics (like mean or variance) of its latent feature maps, we alternatively consider the variant of manipulating the style of a function (represented by the transferred general-part) by modifying the statistics of its convolutional filters via AdaFM.

Specifically, a tiny amount of learnable parameters, i.e., scale and shift where denotes the number of input/output channels, are introduced in AdaFM to modulate the transferred/frozen convolutional filters with kernel size in the general-part. Namely,


for and . is then used to convolve with input feature maps for output ones. Applying AdaFM to convolutional kernels of a residual block (see Figure 1(a) (He et al., 2016)) gives the AdaFM block (see Figure 1(b)). With the residual blocks of the SmallHead replaced with AdaFM blocks, we yield our generator, as shown in Figure 1(h), which delivers boosted performance than the SmallHead in the experiments.

For better understanding the power of our AdaFM, below we draw parallel connections to its two special cases (more or less). The first one is named weight demodulation, which is revealed in the recently proposed StyleGAN-II (Karras et al., 2019b), a model with the state-of-the-art generative quality. Compared with our AdaFM, weight demodulation employs zero shift and a one-rank scale , where style is produced from a trainable mapping network (often a MLP) and is calculated as


where is a small constant to avoid numerical issues. Despite closely related, AdaFM and weight demodulation are motivated differently. We propose AdaFM to better adapt the transferred/frozen filters to target domain, while weight demodulation is used to relax instance normalization while keep controllable style mixing ability (Karras et al., 2019b).

The other special case of AdaFM is the Filter Selection (FS) presented in (Noguchi & Harada, 2019), which employs one-rank simplification to both scale and shift . Specifically, and shift with and (see Figures 1(d) and 1(e)). The motivation of FS is to “select” the transferred filters , for example if

is a binary vector, it literally selects among

. As FS doesn’t modulate among input channels, its basic assumption is that the source and target share the same correlation among input feature maps, which might not be true. See the demonstrative example in Figure 6, where the source domain has the basic pattern of a almost-red square whereas the target has a basic pattern of almost-green square (with the same shape); it’s clear that simply selecting (FS) a filter/pattern won’t work, appealing for a modulation (AdaFM) among filter channels. Figure 7 shows the results from models with FS and AdaFM (using the same architecture in Figure 1(h)); it’s obvious that AdaFM brings boosted performance, empirically supporting the above instinct that the basic shape within each are generally applicable while the correlation among -,-channels may be data-specific (this is further supported by AdaFM delivering boosted performance on other datasets in the experiments).

Figure 7: Generated samples from our model (a) with AdaFM, and (b) with AdaFM replaced by FS. (c) FID scores along training.

4 Experimental Results

Figure 8: FID scores (left) and generated images (right) of Scratch and Our methods on target datasets. The transferred general-part dramatically accelerates the training, leading to better performance.

Taking natural image generation as an illustrative example, we demonstrate the effectiveness of the proposed techniques by transferring the GP-GAN model pretrained on the large-scale source ImageNet dataset (containing million images from classes) to facilitate the limited-data generation on four smaller target datasets and their variants containing only images, that is, CelebA (Liu et al., 2015) (), Flowers (Nilsback & Zisserman, 2008) (), Cars (Krause et al., 2013) (), and Cathedral (Zhou et al., 2014) ().

The experiments proceed by () demonstrating the advantage of our method over existing/naive ones; () conducting ablation studies to analyze the contribution of each component of our method; () verifying the proposed techniques in more challenging settings with only target images; () analyzing why/how AdaFM leads to boosted performance; and () illustrating the potential in exploiting the tailored specific-part for data augmentation for limited data applications. Generated images and FID (Heusel et al., 2017) are used to evaluate generative performance. Experimental settings and more results are in Appendix. Code is available in the supplementary material.

4.1 Comparisons with Existing/Naive Methods

To demonstration our contributions over existing/naive methods, we compare our method with () TransferGAN (Wang et al., 2018b), which initializes with the pretrained GP-GAN model (accordingly the same network architectures are adopted; see Figure 1(f)), followed by fine-tuning all parameters on the target data, and () Scratch, which trains a model with the same architecture as ours (see Figure 1(h)) from scratch with the target data.

Figure 9: FID scores from the ablation studies of our method on CelebA (left) and the smaller datasets of Flower, Cars, and Cathedral (right).
Table 1: FID scores from ablation studies on our method after training iterations. Lower is better.
Method CelebA Flowers Cars Cathedral
(a): GP-GAN 19.48 failed failed failed
(b): GPHead 11.15 failed failed failed
(c): SmallHead 12.42 29.94 20.64 34.83
(d): Our 9.90 16.76 10.10 15.78
Method CelebA Flowers Cars Cathedral
TransferGAN 18.69 failed failed failed
Scratch 16.51 29.65 11.77 30.59
Our 9.90 16.76 10.10 15.78
Table 2: FID scores of the compared methods after training iterations. Lower is better. “Failed” means training/mode collapse.

The experimental results are shown in Figure 8, with the final FID scores summarized in Table 2. Since TransferGAN employs the source (large) GP-GAN architecture, it may suffer from overfitting if the target data are too limited, which manifest as training/mode collapse; accordingly, TransferGAN fails on the small data: Flowers, Cars, and Cathedral. By comparison, thanks to the tailored specific-part, both Scratch and our methods train stably on all target datasets, as shown in Figures 8. Compared to Scratch, our method shows a dramatically increased training efficiency, thanks to the transferred low-level filters, and a significantly improved generative quality (a much better FID in Table 2), which are attributed to both the transferred general-part and a better adaption to target domain via AdaFM.

4.2 Ablation Study of Our Method

Figure 10: FID scores on CelebA-1K (left), Flower-1K (center), and Cathedral-1K (right). The best FID achieved is marked with a star.
Table 3: The best FID achieved within training iterations on the limited-1K datasets. Lower is better.
Method CelebA-1K Flowers-1K Cathedral-1K
Scratch 20.75 58.18 39.97
Our-G4D2 14.19 46.68 38.17
Our-G4D3 13.99 - -
Our-G4D5 19.77 43.05 35.88

To reveal how each component contributes to the excellent performance of our method, we consider experimental settings in a sequential manner. (a) GP-GAN. Adopt GP-GAN architectures (Figure 1(f) but all parameters are trainable and randomly initialized), used as a baseline where no low-level filters are transferred. (b) GPHead. Use the model in Figure 1(f), used to demonstrate the contribution of the transferred general-part. (c) SmallHead. Employ the model in Figure 1(g), used to reveal the contribution of the tailored specific-part. (d) Our. Leverage the model in Figure 1(h), for showing the contribution of the presented AdaFM.

The FID curves during training and the final FID scores of the compared methods are shown in Figure 9 and Table 1, respectively. By comparing GP-GAN with GPHead in Figure 9(left) and Table 1 (on CelebA), it’s clear that the transferred general-part contributes by dramatically increasing the training efficiency (also refer to Figure 8) and a better performance. Comparing SmallHead to both GPHead and GP-GAN in Table 1 indicates that the tailored specific-part helps alleviate overfitting (stable training). By better adapting the transferred general-part to target domains, the proposed AdaFM contributes most to the boosted performance (see Figure 9(right) and Table 1), empirically confirming our instinct in Section 3.3.

4.3 More Challenging Limited-Data Generation

To verify the effectiveness of the proposed techniques in more challenging settings, we consider limited-data generation with only data samples. Specifically, we randomly select images from CelebA, Flowers, and Cathedral to form their limited-1K variants, termed CelebA-1K, Flowers-1K, and Cathedral-1K, respectively. Since TransferGAN fails when given about target images (see Section 4.1), we omit it and only compare our method with Scratch. The FID curves versus training iterations are shown in Figure 10, with the lowest FIDs summarized in Table 3. Under the challenging settings with only training data, both Scratch and our method with the G4D2 general-part (labeled Our-G4D2) suffer from overfitting, as shown in Figure 10; Scratch actually suffers more due to more trainable parameters. Compared with Scratch, our method trains more efficiently (see Figure 10) and delivers significantly improved best performance (see Table 3). To alleviate overfitting, we transfer more discriminator filters from the pretrained GP-GAN model, with the results given in Figure 10 and Table 3. It’s clear that patterns emerge similar to what’s shown in Section 3.1, i.e., less data (likelihood) appeal for more transferred information (prior); also the discovered G4D2 general-part works fairly well (see the comparable FID scores in Table 3).

4.4 Analysis of AdaFM and Style Augmentation with the Tailored Specific-Part

Figure 11: Boxplots of the learned scale (a) and shift (b) within each group of the generator general-part. (c) Sorted comparison of the learned on different datasets (see Appendix E for details).

To better understand why adopting AdaFM in the transferred general-part leads to boosted performance, we summarize in Figure 11(a)(b) the learned scale and shift from different groups of the generator general-part. It’s clear that all transferred filters are used in target domains (no zero-valued ) but with modulations ( has values around ). As AdaFM delivers boosted performance, it’s clear that such modulation is crucial to a successful transfer from source to target, confirming our instinct discussed in Section 3.3. Next to illustrate how behaves on different target datasets, we show in Figure 11(c) the sorted comparisons of the learned in Group ; apparently, different datasets prefer different modulations as expected, justifying the necessity of AdaFM and its performance gain. Concerning explicit demonstration of AdaFM and medical/biological applications with gray-scale images, we conduct another experiment on a gray-scale variant of Cathedral (results are given in Appendix I due to space constraints), where we find that without AdaFM, worse (blurry and messy) details are observed in the generated images (refer also to Figure 7), likely because of the mismatched correlation among channels between source and target domains.

Figure 12: Style mixing on Flowers via the tailored specific-part. The source controls the general flower shape, location, and background, while the destination controls color and petal details.

To reveal the potential in exploiting the tailored specific-part for data augmentation for limited data applications, we conduct style mixing with the specific-part following (Karras et al., 2019a). Figure 12 shows the results on Flowers (see Appendix F for details and more results). Style mixing enables synthesizing vast new images via style/attribute combination. Therefore, the tailored specific-part can be used for a diverse data augmentation, which is believed extremely appealing for downstream limited-data applications.

5 Conclusions

We reveal that the valuable information (specifically low-level filters) within GAN models pretrained on large-scale source datasets (ImageNet) can be transferred to limited-data generation in target domain. A small specific network is developed to alleviate overfitting on target limited-data and to enable style mixing for diverse data augmentation. We presented the adaptive filter modulation (AdaFM) to better adapt the transferred filters to target domain, which delivers boosted performance on limited-data generation.


Appendix A Experimental settings

All training images are resized to for consistency. For large dataset, i.e., CelebA, the dimension of the latent vector is set to , for small dataset, i.e., Flowers, Cars, and Cathedral, the dimension is set to .

The provided FID scores are calculated based on real and generated images on CelebA, Flowers, Cars, and Cathedral, respectively. Adam with learning rate and coefficients is used as the optimizer. iterations are used for training. Because of limited computation power, we use a batch size of .

Appendix B On explaining the worse FID of GD

The worse FID of GD is believed caused by the insufficiently trained low-level filters. By observing the generated samples, we find that,

  • some samples looks similar to each other, indicating that the generator has a relatively low diversity;

  • most of the samples contain strange texture looks like water sports (see Figure 13).

Both of these two phenomenons have negative affect to the final FID.

Figure 13: Samples generated from GD. Water spot shaped texture appears in the hair area (indicated by yellow boxes).

Appendix C On transferring discriminator and reinitializing the specific-part

To figure out the suitable general-part in discriminator to be transferred from the pretrained GP-GAN models to the target CelebA dataset, a series experiments is designed by freezing some low-level groups as general-part and training the rest high-level specific-part. The architecture for transfer discriminator is shown in Figure 14 with D as an example.

Figure 14: The transfer architecture for discriminator with the general-part consisting of lower groups. The architecture is inherited from GP-GAN.

With the general-parts for generator and discriminator frozen as the value pretrained on ImageNet, we only reinitialize and train the rest specific-part. The reinitialization are as follows,

  • For layers except FC in generator/discriminator. We use the value pretrained on ImageNet as initialization;

  • For FC layer in generator/discriminator. Since the model pretrained on ImageNet used a conditional train configuration, the input for the generator is a concatenation of the latent vector and the embedded label vector, which is not applicable for CelebA (the same situation is also found in the output of the discriminator), we initialize both FC layers randomly.

Appendix D TransferGAN failed the training on three small datasets

Since the FC layer from the pretrained model is not applicable to the target data (as discussed in section C), we implement the TransferGAN method by initializing the layers except FC layer with the value pretrained on ImageNet, and leave the FC layer randomly initialized.

When applied on small dataset, TransferGAN method suffers from model collapse (as shown in Figure 15) which is considered to be caused by overfitting.

Figure 15: TransferGAN method comes to model collapse on small dataset. This is a supplementary results for section 4.1 of the main manuscript.

Appendix E How to obtain the matrix shown in Figure 11(c)?

The sorted demonstration of the learned is obtained from the last convolution layer in Group2. We calculate the sorted matrix shown in Figure 11(c) as follows,

  • reshape the matrix into a vector for each dataset;

  • stack these vectors into a matrix , each row is obtained from a specific dataset;

  • clip all the values of to , then re-scale to ;

  • for dataset , find the set of columns and sort these columns within according to the values in row .

Appendix F Style Mixing on Flowers and CelebA

The style mixing results shown in Figure 12 of the main manuscript is obtained as follows. Following (Karras et al., 2019a), given the generative process of a “source” image, we replace its style input of Group 333We choose Group for example demonstration; one can of course control the input to other Groups, or even hierarchically control all of them. (the arrow at left, see Figure 1(h)) with that from a “Destination” image, followed by propagating through the rest of the generator to generate a new image with mixed style.

A similar style mixing is conducted on CelebA and the results are shown in Figure 16. We can observe that, the styles produced for source images control the identity, posture, and hair type, while the styles produced for destination images control the sex, color, and expression.

Figure 16: Style mixing on CelebA via the tailored specific-part. The source controls the identity, posture, and hair type, while the destination controls the sex, color, and expression.

Appendix G More generated samples for comparing our method with Scratch

Figure 17, 18 and 19 shows more randomly generated samples from our method and Scratch as a supplementary for Figure 8 in Section 4.1 of the main manuscript. Thanks to the transferred low-level filters and the better adaption to target domain via AdaFM, our method shows a much higher generation quality than Scratch.

Figure 17: More generated samples on CelebA for Figure 8 in Section 4.1 of the main manuscript. (a) Our; (b) Scratch.
Figure 18: More generated samples on Flowers for Figure 8 in Section 4.1 of the main manuscript. (a) Our; (b) Scratch.
Figure 19: More generated samples on Cathedral for Figure 8 in Section 4.1 of the main manuscript. (a) Our; (b) Scratch.

Appendix H The contribution of the proposed AdaFM

To demonstrate the contribution of the proposed AdaFM, the randomly generated samples from our method (with AdaFM) and SmallHead (without AdaFM) are shown in Figure 20, 21, and 22. Note that the only difference between our method and SmallHead is the use of AdaFM. Obviously, with AdaFM, the generation quality is greatly improved.

Figure 20: Randomly generated samples from our method and SmallHead on CelebA. (a) Our (with AdaFM); (b) SmallHead (without AdaFM).
Figure 21: Randomly generated samples from our method and SmallHead on Flowers. (a) Our (with AdaFM); (b) SmallHead (without AdaFM).
Figure 22: Randomly generated samples from our method and SmallHead on Cathedral. (a) Our (with AdaFM); (b) SmallHead (without AdaFM).

Appendix I Medical/biological applications with gray-scale images

Concerning explicit demonstration of AdaFM and medical/biological applications with gray-scale images, we conduct experiment on a gray-scale variant of Cathedral data, termed gray-Cathedral. The randomly generated samples are shown in Figure 23. Obviously, without AdaFM, worse (blurry and messy) details are observed in the generated images, likely because of the mismatched correlation among channels between source and target domains. with AdaFM, the generated samples have a relatively much clear details, because AdaFM make it easy to modulate the filters to meet the demanded correlation among input feature maps in target data.

(a) Our method with AdaFM
(b) SmallHead without AdaFM
Figure 23: Randomly generated samples on gray-Cathedral for Section 4.4 in the main manuscript (better viewed with zoom in).