F2GAN: Fusing-and-Filling GAN for Few-shot Image Generation

08/05/2020 ∙ by Yan Hong, et al. ∙ Shanghai Jiao Tong University 5

In order to generate images for a given category, existing deep generative models generally rely on abundant training images. However, extensive data acquisition is expensive and fast learning ability from limited data is necessarily required in real-world applications. Also, these existing methods are not well-suited for fast adaptation to a new category. Few-shot image generation, aiming to generate images from only a few images for a new category, has attracted some research interest. In this paper, we propose a Fusing-and-Filling Generative Adversarial Network (F2GAN) to generate realistic and diverse images for a new category with only a few images. In our F2GAN, a fusion generator is designed to fuse the high-level features of conditional images with random interpolation coefficients, and then fills in attended low-level details with non-local attention module to produce a new image. Moreover, our discriminator can ensure the diversity of generated images by a mode seeking loss and an interpolation regression loss. Extensive experiments on five datasets demonstrate the effectiveness of our proposed method for few-shot image generation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep generative models , mainly including Variational Auto-Encoder (VAE) based methods (vae) and Generative Adversarial Network (GAN) based methods (goodfellow2014generative)

, draw extensive attention from the artificial intelligence community. Despite the advances achieved in current GAN-based methods 

(cyclegan; stargan1; stargan2; stylegan1; stylegan2; mm2; DoveNet2020; GAIN2019), the remaining bottlenecks in deep generative models are the necessity of amounts of training data and the difficulties with fast adaptation to a new category (clouatre2019figr; bartunov2018few; liang2020dawson), especially for those newly emerging categories or long-tail categories. Therefore, it is necessary to consider how to generate images for a new category with only a few images. This task is referred to as few-shot image generation (clouatre2019figr; hong2020matchinggan), which can benefit a wide range of downstream category-aware tasks like few-shot classification (vinyals2016matching; sung2018learning).

Figure 1. Illustration of fusing three conditional images with interpolation coefficients in our proposed F2GAN. The high-level features of conditional images are fused with interpolation coefficients and the details (e.g., color dots representing query locations) of the generated image are filled by using relevant low-level features (e.g., color boxes corresponding to query locations) from conditional images. Best viewed in color.

In the few-shot image generation task, the model is trained on seen categories with sufficient labeled training images. Then, given only a few training images from a new unseen category, the learnt model is expected to produce more diverse and realistic images for this unseen category. In some previous few-shot image generation methods (vinyals2016matching; hong2020matchinggan), the model is trained on seen categories in an episode-based manner (vinyals2016matching), in which a small number (e.g., 1, 3, 5) of images from one seen category are provided in each training episode (vinyals2016matching) to generate new images. The input images used in each training episode are called conditional images. After training, the learnt model can generate new images by using a few conditional images from each unseen category.

To the best of our knowledge, there are quite few works on few-shot image generation. Among them, DAGAN (antoniou2017data) is a special case, i.e., one-shot image generation, which injects random noise into the generator to produce a slightly different image from the same category. However, this method is conditioned on only one image and fails to fuse the information of multiple images from the same category. More recent few-shot image generation methods can be divided into optimization-based methods and metric-based methods. Particularly, optimization-based FIGR (clouatre2019figr) (resp., DAWSON (liang2020dawson)) adopted a similar idea to Reptile (nichol2018first) (resp., MAML (finn2017model)), by initializing a generator with images from seen categories and fine-tuning the trained model with images from each unseen category. Metric-based method GMN (bartunov2018few) (resp., MatchingGAN (hong2020matchinggan)) is inspired by matching network (vinyals2016matching) and combines matching procedure with VAE (resp., GAN). However, FIGR, DAWSON, and GMN can hardly produce sharp and realistic images. MatchingGAN performs better, but has difficulty in fusing complex natural images.

In this paper, we follow the idea in (hong2020matchinggan) by fusing conditional images, and propose a novel fusing-and-filling GAN (F2GAN) to enhance the fusion ability. The high-level idea is fusing the high-level features of conditional images and filling in the details of generated image with relevant low-level features of conditional images, which is depicted in Figure 1. In detail, our method contains a fusion generator and a fusion discriminator as shown in Figure 2. Our generator is built upon U-Net structure with skip connections (ronneberger2015u)

between the encoder and the decoder. A well-known fact is that in a CNN encoder, shallow blocks encode low-level information at high spatial resolution while deep blocks encode high-level information at low spatial resolution. We interpolate the high-level bottleneck features (the feature vector between encoder and decoder) of multiple conditional images with random interpolation coefficients. Then, the fused high-level feature is upsampled through the decoder to produce a new image. In each upsampling stage, we borrow missing details from the skip-connected shallow encoder block by using our Non-local Attentional Fusion (NAF) module. Precisely, NAF module searches the outputs from shallow encoder blocks of conditional images in a global range, to attend the information of interest for each location in the generated image.

In the fusion discriminator, we employ typical adversarial loss and classification loss to enforce the generated images to be close to real images and from the same category of conditional images. To ensure the diversity of generated images, we additionally employ a mode seeking loss and an interpolation regression loss, both of which are related to interpolation coefficients. Specifically, we use a variant of mode seeking loss (mao2019mode) to prevent the images generated based on different interpolation coefficients from collapsing to a few modes. Moreover, we propose a novel interpolation regression loss by regressing the interpolation coefficients based on the features of conditional images and generated image, which means that each generated image can recognize its corresponding interpolation coefficients. In the training phase, we train our F2GAN based on the images from seen categories. In the testing phase, conditioned on a few images from each unseen category, we can randomly sample interpolation coefficients to generate diverse images for this unseen category.

Our contributions can be summarized as follows: 1) we design a new few-shot image generation method F2GAN, by fusing high-level features and filling in low-level details; 2) Technically, we propose a novel non-local attentional fusion module in the generator and a novel interpolation regression loss in the discriminator; 3) Comprehensive experiments on five real datasets demonstrate the effectiveness of our proposed method.

Figure 2. The framework of our method which consists of a fusion generator and a fusion discriminator. is generated based on the random interpolation coefficients a and K conditional images . Due to space limitation, we only draw three encoder blocks and two decoder blocks. Best viewed in color.

2. Related Work

Data Augmentation: Data augmentation (Krizhevsky2012ImageNet) targets at augmenting the training set with new samples. Traditional data augmentation techniques (e.g., crop, rotation, color jittering) can only produce limited diversity. Some more advanced augmentation techniques (zhang2017mixup; yun2019cutmix) are proposed, but they fail to produce realistic images. In contrast, deep generative models can exploit the distribution of training data to generate more diverse and realistic samples for feature augmentation (schwartz2018delta; mm1) and image augmentation (antoniou2017data). Our method belongs to image augmentation and can produce more images to augment the training set.

Generative Adversarial Network: Generative Adversarial Network (GAN) (goodfellow2014generative; xu2019learning) is a powerful generative model based on adversarial learning. In the early stage, unconditional GANs (miyato2018spectral) generated images with random vectors by learning the distribution of training images. Then, GANs conditioned on a single image (miyato2018cgans; antoniou2017data) were proposed to transform the conditional image to a target image. Recently, a few conditional GANs attempted to accomplish more challenging tasks conditioned on more than one image, such as few-shot image translation (liu2019few) and few-shot image generation (clouatre2019figr; bartunov2018few). In this paper, we focus on few-shot image generation, which will be detailed next.

Few-shot Image Generation Few-shot generation is a challenging problem which can generate new images with a few conditional images. Early few-shot image generation works are limited to certain application scenario. For example, Bayesian learning and reasoning were applied in  (lake2011one; rezende2016one) to learn simple concepts like pen stroke and combine the concepts hierarchically to generate new images. More recently, FIGR (clouatre2019figr) was proposed to combine adversarial learning with optimization-based few-shot learning method Reptile (nichol2018first) to generate new images. Similar to FIGR (clouatre2019figr), DAWSON (liang2020dawson) applied meta-learning MAML algorithms (finn2017model) to GAN-based generative models to achieve domain adaptation between seen categories and unseen categories. Metric-based few-shot learning method Matching Network (vinyals2016matching) was combined with Variational Auto-Encoder (Pu2016Variational) in GMN (bartunov2018few) to generate new images without finetuning in the test phase. MatchingGAN (hong2020matchinggan) attempted to use learned metric to generate images based on a single or a few conditional images. In this work, we propose a new solution for few-shot image generation, which can generate more diverse and realistic images.

Attention Mechanism: Attention module aims to localize the regions of interest. Abundant attention mechanisms like spatial attention (xu2016ask), channel attention (chen2017sca), and full attention (wang2018mancs) have been developed. Here, we discuss two works most related to our method. The method in (lathuiliere2019attention) employs local attention mechanism to select relevant information from multi-source human images for human image generation, but it fails to capture long-range relevance. Inspired by non-local attention (zhang2019self; wang2018non), we develop a novel non-local attentional fusion (NAF) module for few-shot image generation.

3. Our Method

Given a few conditional images from the same category ( is the number of conditional images) and random interpolation coefficients , our model targets at generating a new image from the same category. We fuse the high-level bottleneck features of conditional images with interpolation coefficients , and fill in the low-level details specified by Non-local Attentional Fusion (NAF) module during upsampling to generate a new image .

We split all categories into seen categories and unseen categories , where . In the training phase, our model is trained with images from seen categories to learn a mapping, which translates a few conditional images of a seen category to a new image belonging to the same category. In the testing phase, a few conditional images from an unseen category in together with random interpolation coefficients are fed into the trained model to generate new diverse images for this unseen category. As illustrated in Figure 2, our model consists of a fusion generator and a fusion discriminator, which will be detailed next.

Figure 3. The architecture of our Non-local Attentional Fusion (NAF) module. is the feature of from the -th encoder block, is the output from the -th decoder block, and is the output of NAF.

3.1. Fusion Generator

Our fusion generator adopts an encoder-decoder structure (antoniou2017data) which is a combination of U-Net (ronneberger2015u) and ResNet (he2016deep). Specifically, there are in total residual blocks ( encoder blocks, decoder blocks, and intermediate block), in which each encoder (resp., decoder) block contains

convolutional layers with leaky ReLU and batch normalization followed by one downsampling (

resp., upsampling) layer, while the intermediate block contains convolutional layers with leaky ReLU and batch normalization. The detailed architecture can be found in Supplementary. The encoder (resp., decoder) blocks progressively decrease (resp., increase) the spatial resolution. For ease of description, the encoder (resp., decoder) blocks from shallow to deep are indexed from (resp., ) to (resp., ). We use to denote the bottleneck feature of from the intermediate block. Besides, we add skip connections between the encoder and the decoder. For , the -th skip connection directs the output from the -th encoder block to the output from the -th decoder block. Then, we use to denote the output feature of conditional image from the -th encoder block, and to denote the output feature from the -th decoder block, where and are the number of channels in the -th encoder and decoder respectively.

To fuse the bottleneck features of conditional images , we randomly sample interpolation coefficients , which satisfy and , leading to the fused bottleneck feature . Since the spatial size of bottleneck feature is very small (e.g., ), the spatial misalignment issue can be ignored and high-level semantic information of conditional images is fused. Then, the fused bottleneck feature is upsampled through decoder blocks. During upsampling in each decoder block, lots of details are missing and need to be filled in. We borrow the low-level detailed information from the output features of its skip-connected encoder block. Furthermore, we insert a Non-local Attentional Fusion (NAF) module into the skip connection to attend relevant detailed information, as shown in Figure 2. For the -th skip connection, NAF module takes and as input and outputs . Then, concatenated with , that is, , is taken as the input to the -th decoder block.

Our attention-enhanced fusion strategy is a little similar to (lathuiliere2019attention). However, for each spatial location on , the attention module in (lathuiliere2019attention) only attends exactly the same location on , which will hinder attending relevant information if the conditional images are not strictly aligned. For example, for category “dog face”, the dog eyes may appear at different locations in different conditional images which have different face poses. Inspired by non-local attention (zhang2019self; wang2018non), for each spatial location on , we search relevant information in a global range on . Specifically, our proposed Non-local Attentional Fusion (NAF) module calculates an attention map based on and , in which each entry represents the attention score between the -th location on and the -th location on . Therefore, the design philosophy and technical details of our NAF module are considerably different from those in (lathuiliere2019attention).

The architecture of NAF module is shown in Figure 3. First, and are projected to a common space by and respectively, where and are convolutional layer with spectral normalization (miyato2018spectral). For ease of calculation, we reshape (resp., ) into (resp., ), in which . Then, we can calculate the attention map between and :

(1)

With obtained attention map , we attend information from and achieve the attended feature map :

(2)

where means convolutional layer followed by reshaping to , similar to and in (1). reshapes the feature map back to and then performs convolution.

As the shallow (resp., deep) encoder block contains the low-level (resp., high-level) information, our generated images can fuse multi-level information of conditional images coherently. Finally, the generated image can be represented by .

Following (hong2020matchinggan), we adopt a weighted reconstruction loss to constrain the generated image:

(3)

Intuitively, the generated image should bear more resemblance to the conditional image with larger interpolation coefficient.

Figure 4. Images generated by DAGAN, MatchingGAN, and our F2GAN (K=3) on five datasets (from top to bottom: Omniglot, EMNIST, VGGFace, Flowers, and Animals Faces). The conditional images are in the left three columns.

3.2. Fusion Discriminator

The network structure of our discriminator is analogous to that in (liu2019few), which consists of one convolutional layer followed by five residual blocks (mescheder2018training). The detailed architecture can be found in Supplementary. Differently, we use one fully connected (fc) layer with output following average pooling layer to obtain the discriminator score. We treat conditional images as real images and the generated image as fake image. In detail, the average score for conditional images and the score for generated image are calculated for adversarial learning. To stabilize training process, we use hinge adversarial loss in (miyato2018cgans). To be exact, the goal of discriminator is minimizing while the goal of generator is minimizing :

(4)

Analogous to ACGAN (odena2017conditional)

, we apply a classifier with cross-entropy loss to classify the real images and the generated images into the corresponding seen categories. Specifically, the last fc layer of the discriminator

is replaced by another fc layer with the output dimension being the number of seen categories:

(5)

where is the ground-truth category of . We minimize for conditional images when training the discriminator. We update the generator by minimizing , since we expect the generated image to be classified as the same category of conditional images.

By varying interpolation coefficients , we expect to generate diverse images, but one common problem for GAN is mode collapse (mao2019mode), which means that the generated images may collapse into a few modes. In our fusion generator, when sampling two different interpolation coefficients and , the generated images and are likely to collapse into the same mode. To guarantee the diversity of generated images, we use two strategies to mitigate mode collapse, one is a variant of mode seeking loss (mao2019mode) to seek for more modes, the other is establishing bijection between the generated image and its corresponding interpolation coefficient . The mode seeking loss in (mao2019mode) was originally used to produce diverse images when using different latent codes. Here, we slightly twist the mode seeking loss to produce diverse images when using different interpolation coefficients. Specifically, we remove the last fc layer of and use the remaining feature extractor to extract the features of generated images with different interpolation coefficients. Then, we maximize the ratio of the distance between and over the distance between and , yielding the following mode seeking loss:

(6)

To further ensure the diversity of generated images, the bijection between the generated image and its corresponding interpolation coefficient is established by a novel interpolation regression loss, which regresses the interpolation coefficient based on the features of conditional images and generated image . Note that the feature extractor is the same as in (6). Specifically, we apply a fully-connected (fc) layer to the concatenated feature , and obtain the similarity score between and :

. Then, we apply softmax layer to

to obtain the predicted interpolation coefficients , which are enforced to match the ground-truth :

(7)

By recognizing the interpolation coefficient based on the generated image and conditional images, we actually establish a bijection between the generated image and interpolation coefficient, which discourages two different interpolation coefficients from generating the same image.

3.3. Optimization

The overall loss function to be minimized is as follows,

(8)

in which , , and are trade-off parameters. In the framework of adversarial learning, fusion generator and fusion discriminator are optimized by related loss terms in an alternating manner. In particular, the fusion discriminator is optimized by minimizing and , while the fusion generator is optimized by minimizing , , , , and , in which and are defined below (5).

Method VGGFace Flowers Animals Faces
FID () IS () LPIPS() FID () IS () LPIPS () FID () IS () LPIPS ()
FIGR (clouatre2019figr) 139.83 2.98 0.0834 190.12 1.38 0.0634 211.54 1.55 0.0756
GMN (bartunov2018few) 136.21 2.14 0.0902 200.11 1.42 0.0743 220.45 1.71 0.0868
DAWSON (liang2020dawson) 137.82 2.56 0.0769 188.96 1.25 0.0583 208.68 1.51 0.0642
DAGAN (antoniou2017data) 128.34 4.12 0.0913 151.21 2.18 0.0812 155.29 3.32 0.0892
MatchingGAN (hong2020matchinggan) 118.62 6.16 0.1695 143.35 4.36 0.1627 148.52 5.08 0.1514
F2GAN 109.16 8.85 0.2125 120.48 6.58 0.2172 117.74 7.66 0.1831
Table 1. FID (), IS () and LPIPS () of images generated by different methods for unseen categories on three datasets.

4. Experiments

Method Omniglot EMNIST VGGFace
5-sample 10-sample 15-sample 5-sample 10-sample 15-sample 5-sample 10-sample 15-sample
Standard 66.22 81.87 83.31 83.64 88.64 91.14 8.82 20.29 39.12
Traditional 67.32 82.28 83.95 84.62 89.63 92.07 9.12 22.83 41.63
FIGR (clouatre2019figr) 69.23 83.12 84.89 85.91 90.08 92.18 6.12 18.84 32.13
GMN (bartunov2018few) 67.74 84.19 85.12 84.12 91.21 92.09 5.23 15.61 35.48
DAWSON (liang2020dawson) 68.56 82.02 84.01 83.63 90.72 91.83 5.27 16.92 30.61
DAGAN (antoniou2017data) 88.81 89.32 95.38 87.45 94.18 95.58 19.23 35.12 44.36
MatchingGAN (hong2020matchinggan) 89.03 90.92 96.29 91.75 95.91 96.29 21.12 40.95 50.12
F2GAN 91.93 92.48 97.12 93.18 97.01 97.82 24.76 43.21 53.42
Table 2. Accuracy(%) of different methods on three datasets in low-data setting.
Method VGGFace Flowers Animals Faces
5-way 5-shot 10-way 5-shot 5-way 5-shot 10-way 5-shot 5-way 5-shot 10-way 5-shot
MatchingNets (vinyals2016matching) 60.01 48.67 67.98 56.12 59.12 50.12
MAML (finn2017model) 61.09 47.89 68.12 58.01 60.03 49.89
RelationNets (sung2018learning) 62.89 54.12 69.83 61.03 67.51 58.12
MTL (sun2019meta) 77.82 68.95 82.35 74.24 79.85 70.91
DN4 (li2019revisiting) 78.13 70.02 83.62 73.96 81.13 71.34
MatchingNet-LFT (Hungfewshot) 77.64 69.92 83.19 74.32 80.95 71.62
MatchingGAN (hong2020matchinggan) 78.72 70.94 82.76 74.09 80.36 70.89
F2GAN 79.85 72.31 84.92 75.02 82.69 73.19
Table 3. Accuracy(%) of different methods on three datasets in few-shot classification setting.

4.1. Datasets and Implementation Details

We conduct experiments on five real datasets including Omniglot (Brenden2015One), EMNIST (cohen2017emnist), VGGFace (cao2018vggface2), Flowers (nilsback2008automated), and Animal Faces (deng2009imagenet). For VGGFace (resp., Omniglot, EMNIST), following MatchingGAN (hong2020matchinggan), we randomly select (resp., , ) categories from total (resp., , ) categories as training seen categories and select (resp., , ) categories from remaining categories as unseen testing categories. For Animal face and flower datasets, we use the seen/unseen split provided in (liu2019few). In Animal Faces, animal faces from

carnivorous animal categories are selected from ImageNet 

(deng2009imagenet). All animal categories are split into seen categories for training and unseen categories for testing. For Flowers dataset with images distributed in categories, there are training seen categories and testing unseen categories.

We set , , and in (8). We set the number of conditional images by balancing the benefit against the cost, because larger only brings slight improvement (see Supplementary). We use Adam optimizer with learning rate 0.0001 and train our model for epochs.

4.2. Quantitative Evaluation of Generated Images

We evaluate the quality of images generated by different methods on three datasets based on commonly used Inception Scores (IS) (xu2018empirical), Fréchet Inception Distance (FID) (heusel2017gans), and Learned Perceptual Image Patch Similarity (LPIPS) (zhang2018unreasonable). The IS is positively correlated with visual quality of generated images. We fine-tune the ImageNet-pretrained Inception-V3 model (szegedy2016rethinking) with unseen categories to calculate the IS for generated images. The FID is designed for measuring similarities between two sets of images. We remove the last average pooling layer of the ImageNet-pretrained Inception-V3 model as the feature extractor. Based on the extracted features, we compute Fréchet Inception Distance between the generated images and the real images from the unseen categories. The LPIPS can be used to measure the average feature distance among the generated images. We compute the average of pairwise distances among generated images for each category, and then compute the average over all unseen categories as the final LPIPS score. The details of distance calculation can be found in (zhang2018unreasonable).

For our method, we train our model based on seen categories. Then, we use a random interpolation coefficient and conditional images from each unseen category to generate a new image for this unseen category. We can generate adequate images for each unseen category by repeating the above procedure. Similarly, GMN (bartunov2018few), FIGR (clouatre2019figr) and MatchingGAN (hong2020matchinggan) are trained in -way -shot setting based on seen categories, and the trained models are used to generate images for unseen categories. Different from the above methods, DAGAN (antoniou2017data) is conditioned on a single image, but we can use one conditional image each time to generate adequate images for unseen categories.

For each unseen category, we use each method to generate images based on sampled real images, and calculate FID, IS and LPIPS based on the generated images. The results of different methods are reported in Table 1, from which we observe that our method achieves the highest IS, lowest FID, and highest LPIPS, demonstrating that our model could generate more diverse and realistic images compared with baseline methods.

We show some example images generated by our method on five datasets including simple concept datasets and relatively complex natural datasets in Figure 4. For comparison, we also show the images generated by DAGAN and MatchingGAN, which are competitive baselines as demonstrated in Table 1. On concept datasets Omniglot and EMNIST, we can see that the images generated by DAGAN are closer to inputs with limited diversity, while MatchingGAN and F2GAN can both fuse features from conditional images to generate diverse images for simple concepts. On natural datasets VGGFace, Flowers, and Animals Faces, we observe that MatchingGAN can generate plausible images on VGGFace dataset because face images are well-aligned. However, the images generated by MatchingGAN are of low quality on Flowers and Animals Faces datasets. In contrast, the images generated by our method are more diverse and realistic than DAGAN and MatchingGAN, because the information of more than one conditional image are fused more coherently in our method. In Supplementary, we also visualize our generated results on FIGR-8 dataset, which is released and used in FIGR (clouatre2019figr), as well as more visualization results on Flowers and Animals datasets.

4.3. Visualization of Linear Interpolation

To evaluate whether the space of generated images is densely populated, we perform linear interpolation based on two conditional images and for ease of visualization. In detail, for interpolation coefficients , we start from , and then gradually decrease (resp., increase) (resp., ) to (resp., ) with step size . Because MatchingGAN also fuses conditional images with interpolation coefficients, we report the results of both MatchingGAN and our F2GAN in Figure 10. Compared with MatchingGAN, our F2GAN can produce more diverse images with smoother transition between two conditional images. More results can be found in Supplementary.

4.4. Low-data Classification

To further evaluate the quality of generated images, we use generated images to help downstream classification tasks in low-data setting in this section and few-shot setting in Section 4.5. For low-data classification on unseen categories, following MatchingGAN (hong2020matchinggan), we randomly select a few (e.g., , , ) training images per unseen category while the remaining images in each unseen category are test images. Note that we have training and testing phases for the classification task, which are different from the training and testing phases of our F2GAN. We initialize the ResNet (he2016deep) backbone using the images of seen categories, and then train the classifier using the training images of unseen categories. Finally, the trained classifier is used to predict the test images of unseen categories. This setting is referred to as “Standard” in Table 2.

Then, we use the generated images to augment the training set of unseen categories. For each few-shot generation method, we generate images for each unseen category based on the training set of unseen categories. Then, we train the ResNet classifier on the augmented training set (including both original training set and generated images) and apply the trained classifier to the test set of unseen categories. We also use traditional augmentation techniques (e.g., crop, rotation, color jittering) to augment the training set and report the results as “Traditional” in Table 2.

The results of different methods are listed in Table 2. On Omniglot and EMNIST datasets, all methods outperform “Standard” and “Traditional”, which demonstrates the benefit of deep augmentation methods. On VGGFace dataset, our F2GAN, MatchingGAN (hong2020matchinggan), and DAGAN (antoniou2017data) outperform “Standard”, while the other methods underperform “Standard”. One possible explanation is that the images generated by GMN and FIGR on VGGFace are of low quality, which harms the classifier. It can also be seen that our proposed F2GAN achieves significant improvement over baseline methods, which corroborates the high quality of our generated images.

Figure 5. Linear interpolation results of MatchingGAN (top row) and our F2GAN (bottom row) based on two conditional images and on Flowers dataset.

4.5. Few-shot Classification

We follow the -way -shot setting in few-shot classification (vinyals2016matching; sung2018learning) by creating evaluation episodes and calculating the averaged accuracy over multiple evaluation episodes. In each evaluation episode, categories are randomly selected from unseen categories. Then, images from each of categories are randomly selected as training set while the remaining images are used as test set. We use pretrained ResNet (he2016deep) (pretrained on the seen categories) as the feature extractor and train a linear classifier for the selected unseen categories. Besides training images, our fusion generator produces additional images for each of categories to augment the training set.

We compare our method with existing few-shot classification methods, including representative methods MatchingNets (vinyals2016matching), RelationNets (sung2018learning), MAML (finn2017model) as well as state-of-the-art methods MTL (sun2019meta), DN4 (li2019revisiting), MatchingNet-LFT (Hungfewshot). Note that no augmented images are added to the training set of unseen categories for these baseline methods. Instead, we strictly follow their original training procedure, in which the images from seen categories are used to train those few-shot classifiers. Among the baselines, MAML (finn2017model) and MTL (sun2019meta) need to further fine-tune the trained classifier based on the training set of unseen categories in each evaluation episode.

We also compare our method with competitive few-shot generation baseline MatchingGAN (hong2020matchinggan). For MatchingGAN, We use the same setting as our F2GAN and generate augmented images for unseen categories. Besides, we compare our F2GAN with FUNIT (liu2019few) in Supplementary.

By taking -way/-way -shot as examples, we report the averaged accuracy over episodes on three datasets in Table 6. Our method achieves the best results in both settings on all datasets, which shows the benefit of using augmented images produced by our fusion generator.

Figure 6. The visualization of learned attention maps from NAF module. In each row, is generated based on three conditional images , , and . For each color dot (query location) in , we draw a color arrow with the same color to summarize the bright region (the most-attended region corresponding to the query location) in . Best viewed in color.

4.6. Ablation Studies

The number of conditional images To analyze the impact of the number of conditional images, we train F2GAN with conditional images based on seen categories, and generate new images for unseen categories with conditional images. Due to space limitation, we leave the details to Supplementary.

Loss terms: In our method, we employ weighted reconstruction loss (3), mode seeking loss (6), and interpolation regression loss (7). To investigate the impact of , , and , we conduct experiment on Flowers dataset by removing each loss term from the final objective (8) separately. The quality of generated images is evaluated from two perspectives. On one hand, IS, FID, and LPIPS of generated images are computed as in Section 4.2. On the other hand, we report the accuracy of few-shot (-way -shot) classification augmented with generated images as in Section 4.5. The results are summarized in Table 4, which shows that ablating leads to slight degradation of generated images. Another observation is that without , the results w.r.t. all metrics become much worse, which indicates that the mode seeking loss can enhance the diversity of generated images. Besides, when removing , it can be seen that the diversity and realism of generated images are compromised, resulting in lower classification accuracy.

Attention module: In our fusion generator, a Non-local Attentional Fusion (NAF) module is designed to borrow low-level information from the encoder. To corroborate the effectiveness of our design, we remove the NAF module, and directly connect the fused encoder features with the output of corresponding decoder blocks via skip connection, which is referred to as “w/o NAF” in Table  4. Besides, we replace our NAF module with local attention used in (lathuiliere2019attention) to compare two different attention mechanisms, which is referred to as “local NAF” in Table 4. The results show that both “local NAF” or our NAF achieve better results than “w/o NAF”, which proves the necessity of attention enhanced fusion strategy. We also observe that our NAF module can improve the realism and diversity of generated images, which is justified by lower FID, higher IS, and higher LPIPS.

Moreover, we visualize the attention maps in Figure 6. The first column exhibits the generated images based on three conditional images. For each generated image, we choose three representative query locations, which borrow low-level details from three conditional images respectively. For the conditional image , we obtain the attention map from the corresponding row in in (1). For each color query point, we draw a color arrow with the same color to summarize the most-attended regions (bright regions) in the corresponding conditional image. In the first row, we can see that the red (resp., green, blue) query location in the generated flower borrows some color and shape details from (resp., , ). Similarly, in the second row, the red (resp., green, blue) query location in the generated dog face borrows some visual details of forehead (resp., tongue, cheek) from (resp., , ).

setting accuracy (%) FID () IS () LPIPS ()
w/o 74.89 122.68 6.39 0.2114
w/o 73.92 125.26 4.92 0.1691
w/o 72.42 122.12 4.18 0.1463
w/o NAF 72.62 137.81 5.11 0.1825
local NAF 73.98 134.45 5.92 0.2052
F2GAN 75.02 120.48 6.58 0.2172
Table 4. Ablation studies of our loss terms and attention module on Flowers dataset.

5. Conclusion

In this paper, we have proposed a novel few-shot generation method F2GAN to fuse high-level features of conditional images and fill in the detailed information borrowed from conditional images. Technically, we have developed a non-local attentional fusion module and an interpolation regression loss. We have conducted extensive generation and classification experiments on five datasets to demonstrated the effectiveness of our method.

Acknowledgements.
The work is supported by the National Key RD Program of China (2018AAA0100704) and is partially sponsored by National Natural Science Foundation of China (Grant No.61902247) and Shanghai Sailing Program (19YF1424400).

References

6. The number of conditional images

To analyze the impact of the number of conditional images, we train our F2GAN with conditional images based on seen categories, and generate new images for unseen categories with conditional images. By default, we set in our experiments. We evaluate the quality of generated images using different and in low-data (i.e., -sample) classification (see Section in the main paper). By taking EMNIST dataset as an example, we report the results in Table 5 by varying and in the range of . From Table 5, we can observe that our F2GAN can achieve satisfactory performance when . The performance generally increases as

increases (except from 3 to 5), but the performance gain is not very significant. Then, we observe the performance variance with fixed

. Given a fixed , when , the performance drops a little compared with . Instead, when , the performance drops sharply, especially when is much larger than (e.g., and ). One possible explanation is that when we train our F2GAN with conditional images, it is not adept at fusing the information of more conditional images () in the testing phase.

97.01 96.86 95.82 94.56
95.24 96.98 96.08 95.52
93.76 95.13 97.23 96.86
90.17 92.74 94.38 97.86
Table 5. Accuracy(%) of low-data (10-sample) classification augmented by our F2GAN with different and on EMNIST dataset.

7. More Generation Results

We show more example images generated by our F2GAN () on Flowers and Animals datasets in Figure 7 and Figure 8 respectively. Besides, we additionally conduct experiments on FIGR-8 (clouatre2019figr) dataset, which is not used in our main paper. The generated images on FIGR-8 dataset are shown in Figure 9. On all three datasets, our F2GAN can generally generate diverse and plausible images based on a few conditional images. However, for some complex categories with very large intra-class variance, the generated images are not very satisfactory. For example, in the -th row in Figure 8, the mouths of some dog faces look a little unnatural. We conjecture that in these hard cases, our fusion generator may have difficulty in fusing the high-level features of conditional images or seeking for relevant details from conditional images.

Figure 7. Images generated by our F2GAN(K=3) on Flowers dataset. The conditional images are in the left three columns.
Figure 8. Images generated by our F2GAN(K=3) on Animals Faces dataset. The conditional images are in the left three columns.
Figure 9. Images generated by our F2GAN(K=3) on FIGR-8 dataset  (clouatre2019figr). The conditional images are in the left three columns.

8. More Interpolation results

As in Section 4.3 in the main paper, We show more interpolation results of our F2GAN in Figure 10. Given two images from the same unseen category, we perform linear interpolation based on these two conditional images. In detail, for interpolation coefficients , we start from , and then gradually decrease (resp., increase) (resp., ) to (resp., ) with step size . It can be seen that our F2GAN is able to produce diverse and realistic images with rich details between two conditional images, even when two conditional images are quite different.

Figure 10. Linear interpolation results of our F2GAN on Flowers dataset.

9. Comparison with Few-shot Image Translation

Few-shot image translation methods like FUNIT (liu2019few) mainly borrow category-invariant content information from seen categories to generate new images for unseen categories in the testing phase. Technically, FUNIT disentangles the category-relevant factors (i.e., class code) and category-irrelevant factors (i.e., content code) of images. Next, we refer to the images from seen (resp., unseen) categories as seen (resp., unseen) images. By replacing the content code of an unseen image with those of seen images, FUNIT can generate more images for this unseen category. However, in this way, few-shot image translation can only introduce category-irrelevant diversity, but fails to introduce enough category-relevant diversity for category-specific properties.

To confirm this point, we conduct few-shot classification experiments (see Section in the main paper) to evaluate the quality of generated images. Based on the released model of FUNIT (liu2019few) trained on Animal Faces (deng2009imagenet), we use class codes of unseen images and content codes of seen images to generate new images for each unseen category. Then, we use the generated images to help few-shot classification (see Section in the main paper), which is referred to as “FUNIT-1” in Table 6. Besides, we also exchange content codes within the images from the same unseen category to generate new images for each unseen category, but the number of new images generated in this way is quite limited. Specifically, in -way -shot setting, we can only generate new images for each unseen category. We refer to this setting as “FUNIT-2” in Table 6.

From Table 6, it can be seen that “FUNIT-1” is better than “FUNIT-2”, because “FUNIT-1” leverages a large amount of extra seen images when generating new unseen images. However, “FUNIT-1” is inferior to some state-of-the-art few-shot classification methods as well as our F2GAN, because FUNIT cannot introduce adequate category-relevant diversity as analyzed above.

Method 5-way 5-shot 10-way 5-shot
MatchingNets (vinyals2016matching) 59.12 50.12
MAML (finn2017model) 60.03 49.89
RelationNets (sung2018learning) 67.51 58.12
MTL (sun2019meta) 79.85 70.91
DN4 (li2019revisiting) 81.13 71.34
MatchingNet-LFT (Hungfewshot) 80.95 71.62
MatchingGAN (hong2020matchinggan) 80.36 70.89
FUNIT-1 78.02 69.12
FUNIT-2 75.29 67.87
F2GAN 82.69 73.19
Table 6. Accuracy(%) of different methods on Animals Faces in few-shot classification setting.

10. Details of Network Architecture

Generator In our fusion generator, there are in total residual blocks ( encoder blocks, decoder blocks, and intermediate block), in which each encoder (resp.,decoder) block contains convolutional layers with leaky ReLU and batch normalization followed by one downsampling (resp., upsampling) layer, while intermediate block contains convolutional layers with leaky ReLU and batch normalization. The architecture of our generator is summarized in Table 7.

Layer Resample Norm Output Shape
Image - - 128*128*3
Conv - - 128*128*32
Residual Block AvgPool BN 64*64*64
Residual Block AvgPool BN 32*32*64
Residual Block AvgPool BN 16*16*96
Residual Block AvgPool BN 8*8*96
Residual Block AvgPool BN 4*4*128
Residual Block - BN 4*4*128
Residual Block Upsample BN 8*8*96
Residual Block Upsample BN 16*16*96
Residual Block Upsample BN 32*32*64
Residual Block Upsample BN 64*64*64
Residual Block Upsample BN 128*128*64
Conv - - 128*128*3
Table 7. The network architecture of our fusion generator. BN denotes batch normalization.

Discriminator Our discriminator is analogous to that in (liu2019few), which consists of one convolutional layer followed by five groups of residual blocks. Each group of residual blocks is as follows: ResBlk- ResBlk- AvePoolx, where ResBlk- is a ReLU first residual block (mescheder2018training) with the number of channels set as , , , , in five residual blocks. We use one fully connected (fc) layer with output following global average pooling layer to obtain the discriminator score. The architecture of our discriminator is summarized in Table  8.

The classifier shares the feature extractor with the discriminator and only replaces the last fc layer with another fc layer with outputs with being the number of seen categories. The mode seeking loss and the interpolation regression loss also use the feature extractor from the discriminator. Specifically, we remove the last fc layer from discriminator to extract the features of generated images, based on which the mode seeking loss and the interpolation regression loss are calculated.

Layer Resample Norm Output Shape
Image - - 128*128*3
Conv - - 128*128*32
Residual Blocks AvgPool - 64*64*64
Residual Blocks AvgPool - 32*32*128
Residual Blocks AvgPool - 16*16*256
Residual Blocks AvgPool - 8*8*512
Residual Blocks AvgPool - 4*4*1024
Global GlobalAvgPool - 1*1*1024
FC - - 1*1*1
Table 8. The network architecture of our fusion discriminator.