Learning Inception Attention for Image Synthesis and Image Recognition

by   Jianghao Shen, et al.
NC State University

Image synthesis and image recognition have witnessed remarkable progress, but often at the expense of computationally expensive training and inference. Learning lightweight yet expressive deep model has emerged as an important and interesting direction. Inspired by the well-known split-transform-aggregate design heuristic in the Inception building block, this paper proposes a Skip-Layer Inception Module (SLIM) that facilitates efficient learning of image synthesis models, and a same-layer variant (dubbed as SLIM too) as a stronger alternative to the well-known ResNeXts for image recognition. In SLIM, the input feature map is first split into a number of groups (e.g., 4).Each group is then transformed to a latent style vector(via channel-wise attention) and a latent spatial mask (via spatial attention). The learned latent masks and latent style vectors are aggregated to modulate the target feature map. For generative learning, SLIM is built on a recently proposed lightweight Generative Adversarial Networks (i.e., FastGANs) which present a skip-layer excitation(SLE) module. For few-shot image synthesis tasks, the proposed SLIM achieves better performance than the SLE work and other related methods. For one-shot image synthesis tasks, it shows stronger capability of preserving images structures than prior arts such as the SinGANs. For image classification tasks, the proposed SLIM is used as a drop-in replacement for convolution layers in ResNets (resulting in ResNeXt-like models) and achieves better accuracy in theImageNet-1000 dataset, with significantly smaller model complexity


page 14

page 15

page 16

page 18

page 19

page 20

page 21

page 22


Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis

With the remarkable recent progress on learning deep generative models, ...

S^2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

Recently, MLP-based vision backbones emerge. MLP-based vision architectu...

Few-Shot Adaptation of Generative Adversarial Networks

Generative Adversarial Networks (GANs) have shown remarkable performance...

ImageSig: A signature transform for ultra-lightweight image recognition

This paper introduces a new lightweight method for image recognition. Im...

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

We propose RepMLP, a multi-layer-perceptron-style neural network buildin...

Single Image Texture Translation for Data Augmentation

Recent advances in image synthesis enables one to translate images by le...

An Inception Inspired Deep Network to Analyse Fundus Images

A fundus image usually contains the optic disc, pathologies and other st...

1 Introduction

Figure 1: Illustration of the squeeze-excitation (SE) module [17] (Left), the skip-layer excitation (SLE) module [35] (Middle), and the proposed Skip-Layer Inception Modulation module for image synthesise (Right), dubbed as SLIM (albeit it is visually “fat-looking”) . See text for detail.

1.1 Motivation and Objective

Image synthesis and classification are classic problems in computer vision and machine learning, and they remain challenging problems. Remarkable progress has been made since the recent resurgence of deep neural networks (DNNs). Super-human performance have been reported for image classification in ImageNet-1000 

[9], and the high quality synthesized images even triggered serious concerns of the so-called deep fake [12]

. However, synthesizing and classifying images typically entails computationally expensive training and inference, which may even lead to some environmental issues due to the carbon footprint 

[47]. Thus, learning light-weight yet highly-expressive deep models has emerged as an important and interesting research direction, especially with less data. This paper focuses on learning few-shot and one-shot image synthesis models and on designing and learning smaller yet expressive models from scratch for image recognition.

Designing and/or searching better neural architectures has been always one main fundamental problem in deep learning. Among many remarkable work, the ResNets 

[14] and the ResNeXts [56] are representative architectures which have been deployed in numerous applications and used as strong baselines in state-of-the-art DNN development.

Consider generative adversarial networks (GANs), state-of-the-art methods such as BigGANs [4] and StyleGANs [25, 26] utilize ResNets as their backbones. Although powerful, as the resolution of synthesized images goes higher, the width and the depth of a generator network goes wider and deeper accordingly, leading to much increased memory footprint and longer training time. Recently, by exploiting the differentiable data augmentation methods [63], Liu et al [35] present a FastGAN approach which further introduces a skip-layer channel-wise excitation (SLE) module as a drop-in component for both generator and discriminator networks, as well as a decoder enhanced design for self-supervised discriminator networks. FastGANs have shown exciting results which outperform the well-known and powerful StyleGANv2 [26] under the low-shot training settings. In this paper, we propose to extend the SLE module in FastGANs by leveraging both channel-wise and spatial attention mechanisms to improve the quality of synthesized images, especially at the resolution of 10241024, while retaining the efficiency and stability of training.

In the meanwhile, learning generative models with less data has been pushed to the extreme using a single image. SinGANs have shown very promising and exciting results [46]. However, state-of-the-art methods lack the capability of sufficiently preserving dominant image structures. To address the structure preservation challenge in SinGANs, in this paper, we adapt the same module that we propose to replace the SLE in FastGANs [35] to learn structure-aware SinGANs, while retaining the efficiency.

In addition to developing the skip-layer feature modulation/attention in image synthesis tasks, this paper also investigates the similar idea under the same-layer settings for DNNs in image classification tasks.

1.2 Method Overview

SLIM for Image Synthesis. Fig. 1 (Right) illustrates the proposed Skip-Layer Inception modulation (SLIM) module. Unlike the channel-wise 1-D re-calibration weights for a target feature map in both the SE and SLE modules (i.e., ), the proposed SLIM aims to learn spatial-block-wise 3-D re-calibration weights (i.e., ), which are spatial-block-wise due to the nearest upsampling operation from an map to an (where and ). The learning of 3-D re-calibration weights enables richer information flow from a source feature map to a target one, which leads to better synthesizing quality.

More specifically, the proposed SLIM is motivated by the split-transform-aggregate design heuristic popularized by the well-known Google’s Inception networks [51]. In the SLIM module, an input source feature map is first split into a number of groups (e.g., 4). Each group is then transformed to a latent style vector (via channel-wise attention, the right-top of Fig. 1) and a latent spatial mask (via spatial attention, the right-bottom of Fig. 1). An optional noise injection layer is introduced to enhance the style. The learned latent masks and latent style vectors are aggregated to form the block-wise 3-D re-calibration weights modulating the target feature map. The Inception design also induces fine-grained style learning and mixing between different groups split from a source feature map.

The proposed SLIM is deployed in FastGANs [35] and SinGANs [46]. For FastGANs, the proposed SLIM module is a drop-in replacement for the original SLE module, and it obtains better performance, especially for high-resolution image synthesis at the resolution of 10241024, while retaining comparable efficiency. For SinGANs, it shows a stronger capability of preserving image structures than the Concurrent SinGANs (ConSinGANs) [16].

Figure 2: Illustration of the Same-Layer Inception Model (dubbed as SLIM too) for image recognition. It can be treated as an strong alternative for the widely used SE module and as a new realization of the ResNeXts [56]. See text for detail.

SLIM for Image Recognition. Fig. 2 illustrates the Same-Layer Inception Module (dubbed as SLIM too). Similarly, we first split the input feature map into a number of groups (e.g., 4). For input group, the transformation consists of a standard convolution (with a kernel), the channel attention (based on SE), and the spatial attention (which can be treated as a spatially-adaptvie variant of SE). The spatial-channel attention is the product between the channel attention and the spatial attention, resulting in a 3D attention matrix. Softmax is applied along the channel dimension of the 3D attention matrix. It is then used to re-calibrate the output of the convolution branch in an element-wise / spatially-adaptive way. It is used to replace all the 3x3 convolution in a feature backbone (e.g., ResNet-50). It achieves better accuracy with significantly smaller model complexity in ImageNet-1000 [9].

2 Related Work and Our Contributions

Unconditional image synthesis and GANs.

Unconditional image synthesis is a generative learning task which seeks probabilistic models that are capable of explaining away data under the maximum likelihood estimation (MLE) framework. To this end, introducing multiple layers of hidden variables is often entailed, which in turn popularizes DNNs. There are two types of realizations in the literature: top-down generator models 

[11, 13, 28, 44, 37, 61, 56, 57] that are non-linear generalization of factor analysis [45]

, and bottom-up energy-based models 

[33, 40, 10, 27, 62, 58, 41, 29, 32] that are in the form of a Gibbs distribution or Boltzmann distribution. It is well-known that training a DNN-based generator model individually is an extremely difficult task, mainly due to the difficulty of sampling the posterior for real images. Adversarial training is a popular workaround and GANs [11] are widely used in practice, which are formulated under a two-player minmax game setting. The proposed SLIM module can be used as a drop-in component in the generator network of GANs.

Light-weight GANs with Few-Shot Learning. Compared to the extensive research on light-weight neural architectures in discriminative learning for mobile platforms, much less work has been done in generative learning [30]. The residual network [14] is the most popular choice, on top of which powerful generative models such as BigGANs [4] and StyleGANs [24, 26] have been built with remarkable progress achieved. For high-resolution image synthesis, these models will be very computational expensive in training and inference. In the meanwhile, training these models typically require a big dataset, which further increase the training time. Few-shot learning is appealing, but very challenging in training GANs, since data augmentation methods that are developed for discriminative learning tasks are not directly applicable. To address this challenge, differentiable data augmentation methods and variants [63, 23, 52, 64] have been proposed in training GANs with very exciting results obtained. Very recently, equiped with the differentiable data augmentation method [63], a FastGAN approach [35] is proposed to realize light-weight yet sufficiently powerful GANs with several novel designs including the SLE module. The proposed SLIM is built on the SLE in FastGANs by exploiting the well-known Google’s Inception [50, 51] building block design and the Átrous convolution [6].

Learning Unconditional GANs from a Single Image. There are several work on learning GANs from a single texture image [2, 22, 34]. Recently, a SinGAN approach [46] has shown surprisingly good results on learning unconditional GANs from a single non-texture image. It is further improved in ConSinGANs [16] which jointly train several stages in progressively growing the generator network. However, it remains a challenging problem of preserving image structure in synthesis. The proposed SLIM is applied to the vanilla SinGANs [46], leading to a simplified workflow that can be trained in a stage-wise manner and thus more efficient than ConSinGANs, and facilitating a stronger capability of preserving image structures.

Attention Mechanism in Deep Networks. Attention reallocates the available computational resources to the most relevant components of a signal to the task  [42, 19, 20, 31, 38, 53]. Attention mechanisms have been widely used in computer vision tasks  [21, 5, 60, 7]. The SE module  [17] applies a lightweight self gating module to facilitate channel-wise feature re-calibration. Our proposed SLIM module incorporates spatially-adaptive feature modulation, while maintaining the light weight design, improving the representation power for efficient discriminative learning in addition to generative learning.

Our Contributions. In summary, this paper makes the follow four main contributions in unconditional image synthesis using GANs and image analysis: (i) It proposes a Skip-Layer Inception Modulation (SLIM) module to facilitate better learning of generative models from few-shot / one-shot images, and to provide visually-interpretable visualization of the skip-layer information flow. (ii) It shows significantly better performance for high-resolution image synthesis at the resolution of 10241024 when deployed in the FastGANs [35], while retaining the efficiency. (iii) It enables a simplified workflow for SinGANs [46], and shows a stronger capability of preserving image structures than prior arts. (iv) It further presents a Same-Layer Inception Module for image classification, which serves as a stronger new realization of the ResNeXt-like models when deployed in the ResNets.

Figure 3: The generator network of FastGANs [35] and the drop-in replacement of the SLE module by the proposed SLIM module. Note that the network specification is reproduced based on the officially released code of FastGANs (link). See text for detail.

3 Approach

In this section, we present details of the proposed SLIM modules for image synthesis and image recognition.

3.1 The SLIM for Image Synthesis

Uncondiational Image Synthesis. The goal is to learn a generator network which maps a latent code to an image,


where represents a 1-D latent code (e.g., in FastGANs [35] or a 2-D latent code (e.g., in SinGANs [46]

), which is typically drawn from standard Normal distribution (i.e., white noise).

collects all the parameters of the generator network . Given a latent code, it is straightforward to generate an image.

FastGANs. As shown in Fig. 3, the generator network used in FastGANs [35] adopts a minimally-simple yet elegantly chosen design methodology. Given an input latent code, the initial block applies the transpose convolution to map the latent code to a feature map. Then, a composite block (UpCompBlock) and a plain block (UpBlock) are interleaved to map the feature map to the one at a given target resolution (e.g., 1024

1024). Batch normalization 

[18] and gated linear unit (GLU) [8] are used in the building blocks. Spectral normalization [36] also is used for all the convolution and linear layers. In a composite upsample block, noise injection is used right after the convolution operation. As aforementioned, the plain feedforward architecture will suffer from possible gradient vanishing and style effect vanishing issues. The SLE module (the middle of Fig. 1) is proposed to mitigate these issues. In training, an auxiliary image is also synthesized at the resolution of , which further helps alleviate potential gradient vanishing issues and collects multi-scale feedback from the discriminator. The discriminator in FastGANs also includes some novel designs with the SLE module applied. Since our focus is to re-design the SLE module in the generator and we keep the discriminator unchanged in the experiments, we omit the discussion of the discrminator due to space limit. Please refer the original paper [35] for details.

SinGANs. Fig. 4 (top) shows a stage in the generator of SinGANs [46]. A SinGAN is progressively trained from a chosen low resolution (based on the predefined image pyramid settings) to the resolution of the single input image. Following the notation usage in SinGANs [46], the start resolution is indexed by and the end resolution by . At the very beginning, a 2-D latent code, is sampled and the initial generator is trained under the vanilla GAN settings. Then, at the stage , the generator has been progressively grown from , and we have , where represents the up-sampled synthesized image from the previous stage with respect to the predefined ratio used in the construction of the image pyramid. More specifically, . Batch normalization [18]

and leaky ReLU 

[59] are used in a convolution block, while spectral normalization is not used. With the proposed SLIM module, we substantially change the workflow of the generator as shown in the bottom of Fig. 4: Each stage of the vanilla SinGAN is to learn the residual image on top of the output from the previous stage. As we shall elaborate, the proposed SLIM module is spatially-adaptive in modulating a target feature using a source feature map, so we remove the residual learning setting. Similarly, we keep the discriminator of SinGANs unchanged in our experiments. Due to space limit, we omit the discussion of it and more details are referred to the original paper [46].

Figure 4: Illustration of deploying the proposed SLIM module in SinGANs [46]. See text for detail.

The SLIM Module. Fig. 1 (Right) illustrates the proposed SLIM module. We first compare the formulation between the SE module [17], the SLE module [35] and the proposed SLIM module. Focusing on how an input target feature map is transformed to the output feature map , denote by where

represents a single channel slice of the tensor

for , we have,

SE: (2)
SLE: (3)

where the channel importance coefficient in the SE module, and

in the SLE module. So, the SE module realizes self-attention between channels (a.k.a. “neurons”), while the SLE module realizes cross-attention. And, the former is a special case of the latter when

. Both and are scalar and shared by all spatial locations in the same channel slice. For discriminative learning tasks such as image classification, this channel-wise feature re-calibration works very well since spatial locations will be discarded by the classification head sub-network (typically via a global average pooling followed by a fully-connected layer). For image synthesis tasks whose outputs are location-sensitive, it may not be sufficient to deliver the entailed modulation effects.

The proposed SLIM module aims to facilitate spatially-aware re-calibration by extending the SLE module. It learns a 3-D weight matrix from the source feature map in modulating the target feature map (that is to “pay full attention”), and we have,

SLIM: (4)

where represent the Hadamard product.

Learning the spatially-adaptive weight matrix from . We want to distill two types of information: One represents 1-D latent style codes that are informed by the source feature map , and then used to induce the modulated source feature map to focus on; The other reflects 2-D latent spatial masks that are used to distribute the latent style codes. Decoupling these two is beneficial to enable them learning faster and more accurate. In the vanilla SLE module (the middle of Fig. 1), a coarse spatial attention map also is computed, followed by a channel-wise operation. This sequential transformation may entangle them in a less effective way, as well as the spatial pooling may discard some useful spatial information.

Decoupling the Style and Layout. We decouple the channels (“neurons”) in an input source feature map by splitting them into a number of groups (e.g., 4), that is to exploit mixture modeling or clustering of the “neurons” in a building block, as suggested by the Inception module [51] which in turn is inspired by the theoretical study of how to construct an optimal neural architecture in a layer-wise manner with a set of constraints satisfied [1]. For each group, we apply the decoupled channel-wise and spatial transformation for learning the latent style codes and the latent spatial masks concurrently.

In learning the latent style codes, we use the transformation (the right-top of Fig. 1) similar in spirit to the vanilla SE except for adding BatchNorm [18] after the first fully-connected layer and removing the last sigmoid operation. We remove the sigmoid operation since the outputs will be interpreted as latent style codes, rather than the channel-wise importance weights.

In learning the latent spatial layout/masks, we exploit the Átrous convolution [6] to capture long-range contextual information. The dilation rate is chosen based on the size of an input source feature map such that the receptive field of the Átrous convolution is sufficiently large. To overcome the potential collapse issue of the learned latent spatial masks, especially in synthesizing high-resolution images, we apply a noise injection operation before the sigmoid transformation (the bottom-right of Fig. 1).

To sum up, from the channel-wise attention branches, we compute a group of latent style codes and concatenate them to form the latent style code tensor, . From the spatial attention branches, we compute a group of latent spatial masks and concatenate them to form the latent spatial mask tensor, . Then, the 3-D weight matrix in Eqn. 4 is computed by,


where is a tensor of the size by repeating along the last two dimensions, and is a tensor of the size by repeating along the second dimension. is used to normalize the spatial mask between the latent objects.

Note that the SE module, the SLE module and the proposed SLIM module all are instance specific for different inputs in a batch.

3.2 Comparing the SLIM with Alternative Designs in Image Synthesis

Comparing with the weight modulation in StyleGANv2 [26]. The weight modulation method in StyleGANv2 is an elegantly designed operation to achieve detailed style tuning effects. The style code is used to directly modulate the filter kernels (as model parameters) in an instance specific, and then modulated filter kernels are used in computing the convolution. Although being highly expressive, this weight modulate is not spatially-adaptive. And, it increases the computational burden and the memory footprint in execution. The proposed SLIM directly modulates the feature map in a light-weight manner.

Comparing with the SPADE in GauGANs [43] and the ISLA-Norm in LostGANs [48]. Both the SPADE and the ISLA-Norm exploit spatially-adaptive modulation, but apply it inside the BatchNorm. They replace the vanilla channel-wise affine transformation in the BatchNorm with spatially-adaptive affine transformation. The spatially-adaptive affine transformation coefficients are learned either from the input semantic masks in GauGANs or the generated latent masks from the input layouts in LostGANs. The proposed SLIM is similar in spirit to the ISLA-Norm, but is formulated under the Inception architecture together with the skip-layer idea proposed in FastGANs [35].

Figure 5: Illustration of deploying the SLIM in ResNets (right), compared with the SE (left) and the vanilla ResNeXt (middle).

3.3 The SLIM for Image Recognition

Fig. 2 illustrates the SLIM for substituting a vanilla 3x3 Convolution in a network such as the ResNet [14]. Let be an input feature map. Fig. 5 shows the deployment of the SLIM in the residual building block. For the SLIM, at a very high-level, the outputs of the channel-attention and the spatial-attention, and the convolution in a group play the roles of query, key and value in the Transformer model [53]. The multi-group represents the multi-head design.

4 Experiments

In this section, we test the proposed SLIM in few-shot image synthesis using FastGANs [35], one-shot image synthesis using SinGANs [46], and in ImageNet-1000 classification using ResNets [14]. Our PyTorch source code will be released soon

, which is built on officially released PyTorch codes of FastGANs 

111https://github.com/odegeasslbc/FastGAN-pytorch, SinGANs 222https://github.com/tamarott/SinGAN, and the MMClassification Github Repo 333https://github.com/open-mmlab/mmclassification respectively.

Figure 6: Top: Synthesized face images at the resolution of in the FFHQ dataset [25]. The model is trained using 2k training FFHQ images for around 15 hours on a single GPU. Bottom: Visualization of the learned spatial masks from the stage to the stage (Fig. 3).

4.1 Results of SLIM-FastGANs

Data. We adopt the datasets used in the vanilla FastGANs [35]. There are 5 categories tested at the resolution of each of which uses around 100 training images (see part one of Table 4). There are 7 categories tested at the resolution of each of which uses around 1000 training images (see part two of Table 4).

Model / Training Time (hours) Art FFHQ Flower Pokemon AnimeFace Skull Shell
StyleGANv2 / 24 74.56 - - 190.23 152.73 127.98 241.37
StyleGANv2 finetune / 24 - - - 60.12 61.23 107.68 220.45
SLE-FastGAN / 8 45.08 44.3 31.7 57.19 59.38 130.05 155.47
SLIM-FastGAN (ours) / 9 43.46 39.59 29.90 51.2 54.22 101.16 140.45
Table 1: FID comparisons (smaller is better) on the 7 categories with models trained with 1000 images and tested at the resolution of . means that results are obtained by retraining the vanilla FastGANs. The 1000 images are randomly sampled from the original much larger datasets: FFHQ and Oxford-Flower.
Resolution SLIM-FastGAN SLE-FastGAN
28.5M 29.1M
28.5M 29.2M

Table 2: #Parameters of generators of SLIM-FastGAN and SLE-FastGAN, in the models for and resolutions, respectively.
Model Nature-2k Nature-5k Nature-10k FFHQ-2k FFHQ-5k FFHQ-10k
SLE [35] 103.71 104.73 99.64 27.68 20.6 19.21
SLIM (ours) 101.53 96.91 93.94 24.59 19.45 18.84
Table 3: FID comparisons (smaller is better) on the 2 categories with models trained with more data (2k, 5k and 10k) at the resolution of . Both the vanilla SLE-FastGAN and the proposed SLIM-FastGAN are trained from scratch with the same data sampled from the original FFHQ and Nature Photograph datasets. Training time are budgeted with around 20 hours for all the experiments.

Settings: We follow settings in the official code of FastGANs. We keep the first SLE module from the initial bock considering the small spatial size that may not be helpful for learning spatial masks and replace the remaining SLE by the proposed SLIM. And, the dimension of the input latent code is set to (Fig. 3). One thing to highlight is the output size of the discriminator. There are two different settings used in report: or , which show different performance on different categories. We use as the output size (the default in the official code), so some of the results of the proposed SLIM module could be further improved. At the resolution of , we apply the noise injection layer in the SLIM (the right-bottom of Fig. 1). At the resolution of , no noise injection is applied in the SLIM. The dilation rates for the Átrous convolution in the spatial branches are set to 2, 2 and 4 for the SLIM modules from the stage , and respectively. The learning rates follow the default value (0.0002) except for the FFHQ which uses 0.0001. The Adam optimizer is used with and . A single GPU is used in training.

We compare the proposed SLIM with several baselines: 1)FastGAN [35]. 2)The adaptive data augmentation method [23]

, in which we follow the original paper to set the target value to 0.6, and set the increasing rate of augmentation probability such that it can increase from 0 to 1 within 10k iterations (1/5 of the total training time). 3)The CBAM module 

[55] which also leverages spatial and channel attention for better representation learning 444https://github.com/Jongchan/attention-module . 4)SPAP module [49] which leverages multi-scale spatial attention for GAN learning 555https://github.com/WillSuen/DilatesGAN. For 3) and 4), we replace the original SLE module in FastGAN with the corresponding CBAM or SPAP modules.

Metrics. We evaluate our methods on FID [15], KID [3] for image quality. To assess the overfitting of the methods, we further evaluate them on latent recovery [54], density and coverage [39] metrics.

Efficiency. The proposed SLIM module increases the training time in training models for high-resolution image synthesis (e.g., relative increase in time in Table 1). The training time increase is negligible for training image synthesis models at lower resolutions such as . We further compare the number of parameters of our proposed SLIM method with original FastGAN at Table 7, we can see that the proposed SLIM has less parameter than the SLE-FastGAN, This is because in SLIM, we down scale the channel number by a factor of 16 through a linear layer before extracting the style and spatial information from the input, resulting in less parameters than SLE; This shows the effectiveness of our design.

Metric Method 256256, 100 images per category
Obama Dog Cat Grumpy Cat Panda
FID SLE-Diffaug 41.05 50.66 35.11 26.65 10.03
SLIM-Diffaug 36.4 49.99 33.55 26.01 9.48
SPAP-Diffaug 51.98 58.46 54.31 30.15 14.41
CBAM-Diffaug 40.05 52.35 36.14 26.89 10.14
SLE-ADA 38.9 52.04 34.5 26.83 9.87
SLIM-ADA 34.5 49.83 31.2 26.03 9.50
KID SLE-diffaug 0.012 0.014 0.006 0.007 0.004
SLIM-Diffaug 0.005 0.012 0.004 0.004 0.002
SPAP-Diffaug 0.045 0.026 0.014 0.013 0.009
CBAM-Diffaug 0.012 0.016 0.007 0.007 0.004
SLE-ADA 0.01 0.015 0.004 0.007 0.003
SLIM-ADA 0.004 0.012 0.002 0.004 0.002
MRE/value SLE-Diffaug 0.11/0.87 0.16/0.77 0.06/0.32 0.24/0.19 0.29/0.81
SLIM-Diffaug 0.15/0.65 0.22/0.45 0.16/0.32 0.11/0.94 0.17/0.76
SPAP-Diffaug 0.28/0.45 0.32/0.39 0.44/0.13 0.37/0.11 0.41/0.55
CBAM-Diffaug 0.12/0.86 0.23/0.65 0.11/0.35 0.18/0.53 0.24/0.79
SLE-ADA 0.12/0.83 0.17/0.74 0.09/0.35 0.27/0.19 0.28/0.84
SLIM-ADA 0.15/0.71 0.21/0.59 0.18/0.42 0.13/0.95 0.15/0.81
Density/Coverage SLE-Diffaug 1.31/1.0 0.79/0.96 0.95/1.0 1.25/1.0 1.78/1.0
SLIM-Diffaug 1.38/1.0 0.84/0.98 1.07/1.0 1.38/1.0 1.89/1.0
SPAP-Diffaug 0.91/0.86 0.53/0.90 0.89/0.92 1.01/0.94 1.32/0.95
CBAM-Diffaug 1.35/1.0 0.79/0.95 0.94/1.0 1.25/1.0 1.75/1.0
SLE-ADA 1.35/1.0 0.78/0.94 0.96/1.0 1.28/1.0 1.82/1.0
SLIM-ADA 1.39/1.0 0.89/1.0 1.12/1.0 1.41/1.0 1.87/1.0

Metric Method 1024 1024, 1000 images 1024 1024, 100 images
FFHQ Art Flower Pokemon AnimeFace Skulls Shells
FID SLE-Diffaug 44.3 45.08 31.7 57.19 59.38 130.05 155.47
SLIM-Diffaug 39.59 43.46 29.90 51.2 54.22 101.16 140.45
SPAP-Diffaug 78.37 61.89 60.15 114.98 93.53 118.09 160.12
CBAM-Diffaug 58.23 58.12 44.13 76.76 84.45 125.61 156.76
SLE-ADA 44.43 45.1 31.89 55.67 59.11 120.62 153.47
SLIM-ADA 39.12 43.53 29.63 48.56 53.31 96.56 140.75
KID SLE-diffaug 0.012 0.011 0.006 0.014 0.018 0.054 0.068
SLIM-Diffaug 0.011 0.009 0.006 0.011 0.014 0.030 0.044
SPAP-Diffaug 0.21 0.54 0.019 0.11 0.15 0.045 0.11
CBAM-Diffaug 0.15 0.45 0.012 0.058 0.13 0.051 0.071
SLE-ADA 0.012 0.011 0.006 0.013 0.018 0.049 0.071
SLIM-ADA 0.011 0.009 0.006 0.009 0.013 0.025 0.044
MRE SLE-diffaug 0.0097 0.0301 0.31 0.0013 2.13 0.32 1.83
SLIM-diffaug 0.0082 0.0027 0.03 0.0012 0.88 0.28 0.22
SPAP-Diffaug 0.0121 0.0821 0.65 0.12 3.01 0.29 1.91
CBAM-Diffaug 0.0083 0.0592 0.45 0.53 2.07 0.31 0.84
SLE-ADA 0.0095 0.02903 0.28 0.0014 2.03 0.35 1.75
SLIM-ADA 0.0081 0.0028 0.03 0.0013 0.87 0.27 0.24
value SLE-diffaug 0.39 0.025 0.06 0.78 0.75 0.17 0.89
SLIM-Diffaug 0.73 0.53 0.43 0.81 0.86 0.39 0.98
SPAP-Diffaug 0.12 0.11 0.03 0.42 0.31 0.21 0.49
CBAM-Diffaug 0.24 0.23 0.02 0.37 0.42 0.58 0.46
SLE-ADA 0.37 0.027 0.08 0.76 0.73 0.18 0.90
SLIM-ADA 0.74 0.55 0.42 0.83 0.87 0.38 0.98
Density SLE-diffaug 1.18 1.38 0.85 1.14 1.17 0.90 0.29
SLIM-Diffaug 1.20 1.41 0.92 1.21 1.21 1.18 0.52
SPAP-Diffaug 0.81 0.74 0.66 0.54 0.61 0.92 0.27
CBAM-Diffaug 0.88 0.81 0.73 0.67 0.79 0.90 0.28
SLE-ADA 1.17 1.39 0.85 1.13 1.18 0.91 0.28
SLIM-ADA 1.21 1.40 0.92 1.25 1.24 1.21 0.53
Coverage SLE-Diffaug 0.95 0.95 0.93 0.95 0.98 0.89 0.85
SLIM-Diffaug 0.96 0.96 0.95 0.96 1.0 1.0 0.91
SPAP-Diffaug 0.88 0.84 0.79 0.71 0.68 0.92 0.81
CBAM-Diffaug 0.91 0.88 0.83 0.78 0.73 0.90 0.83
SLE-ADA 0.96 0.94 0.95 0.94 0.98 0.92 0.86
SLIM-ADA 0.96 0.97 0.96 0.95 1.0 1.0 0.92

Table 4: Comparison between SLIM and SLE, SPAP, CBAM, and the adaptive data augmentation technique (ADA) using the related metrics. In evaluating the MRE, we use a train/val split with ratio 9:1. For the Density and Coverage, we use the default -nearest neighbours with

. For KID, we omit the variance for the clarity of the table. Most of them are negligible and a few are


Results. Table 4 shows the comparison of our proposed SLIM module with related methods. The method with ’-Diffaug’ postfix means it adopts the vanilla differentiable data augmentation technique, the method with ’-ADA’ postfix means it adopts the adaptive data augmentation technique. We can see that under the differentiable data augmentation scheme, the proposed SLIM consistently outperform SLE, CBAM and SPAP modules in terms of FID and KID, especially for images of size. Also, we can see that when the ADA technique is used, SLIM still consistently better than the SLE module. This show that our SLIM is complementary to ADA in improving performance in few-show image synthesis. For overfitting evaluation, in terms of MRE/p-value, our SLIM is worse thant SLE in low-res image dataset. Note that a -value greater than means no major concerns of memorization. Both SLE and SLIM seem work reasonably well. it shows our SLIM is more conservative towards the training distribution when less data (e.g., 100) and less information (low-resolution) are used in training and higher fidelity of generated images are expected. This conservative synthesis may be potentially less biased in generative learning. Table 3 shows the FID comparison between SLIM and SLE when trained with more data.

Figure 7: Synthesized images by the vanilla SinGAN, ConSinGAN, SLE-SinGAN and our SLIM-SinGAN. SLIM-SinGAN is better in terms of preserving structure, while producing meaningful semantic variations (note the change of number of Sphinx statues of SLIM-SinGAN results)
Metric SinGAN ConSinGAN SLE-SinGAN SLIM-SinGAN (ours)
SIFID 0.518 1.02 0.652 0.467
Diversity Score 0.52 0.45 0.61 0.32
Table 5: Single Image FID (SIFID) (smaller is better) and diversity score (larger is better) comparisons using the 14 selected images. Note that SIFID may not reflect the actual quality of synthesized images, as pointed out in ConSinGANs [16]. We observe that ConSinGANs may fail to learn some images e.g., the volcano, which causes the high SIFID.

4.2 Results of SLIM-SinGANs

Data. We tested the SLIM-SinGAN on 14 images used by the vanilla SinGANs: Brandenberg, bridge, Golden gate, tower, angkorwat, balloons, birds, colusseum, mountains, starry night, tree, cows, volcano, zebra. The images are selected to cover different image structures which are difficult for one-shot GANs.

Settings. We follow the settings provided by the vanilla SinGANs [46]. Two baselines are used: (i) ConSinGANs [16] for which we follow the suggestions in the paper to try different combinations between the learning rate and the number of stages jointly trained and select the best results. (ii) SLE-SinGANs in which the SLE module is used, instead of SLIM, in the bottom of Fig. 4.

Metrics We evaluate our methods with single image FID (SIFID) and Diversity Score proposed in the vanilla SinGANs [46].

Efficiency. The proposed SLIM-SinGANs are as efficient as the vanilla SinGANs in terms of training speed.

Results. Table 5 shows the comparison results. Fig. 7 shows synthesized images for the Egyptian pyramid image. In terms of diversity score, Our SLIM obtains lower diversity in the trend similar to ConSinGAN. The testing images are structure-rich images for which our goal is to study how to preserve the structure. The diversity score should be interpreted jointly with the SIFID. We also observe that SLIM-SinGANs may suffer from the training image memorization problem while trying to preserve the structures in images, especially for the images with dominant foreground structural objects, which we leave to investigate in future work. More qualitative results will be provided in the supplementary material.

Method #Params FLOPS top-1 top-5
ResNet50-SE 28.09M 4.13G 77.74 93.84
ResNeXt-32x4d-50 25.03M 4.27G 77.90 93.66
ResNet50-SLIM (ours) 18.66M 3.36G 78.06 94.14
VGG-11-BN 132.87M 7.64G 70.75 90.12
VGG-11-BN-SLE 132.87M 7.65G 71.95 90.64
VGG-11-BN-SLIM (ours) 132.96M 7.70G 72.49 90.91

Table 6: Comparisons between SLE and SLIM in VGG-11, and between SE and the modified SLIM in ResNet-50 in ImageNet-1000 classification. Top-1 and Top-5 accuracy (%) are used. Results are from the MMClassification model zoo.

4.3 Results of the SLIM in ImageNet-1000

Table 6 (top) shows the results of applying the SLIM in ResNet-50. Compared with the SE, our SLIM obtains 0.32% top-1 accuracy increase, while significantly reducing the model parameters (by 9M) and FLOPs. Compared with ResNeXt-32x4d-50, our SLIM obtains 0.16% top-1 accuracy increase with much less parameters too. These results clearly show the effectiveness of the proposed SLIM.

Ablation study. To test if the skip-layer connections help classification tasks too and if the SLIM module helps more than the SLE module, we apply the vanilla SLE and our SLIM in the VGG-11+BatchNorm network which does not have skip connections as in ResNets. The skip-layer connections are introduced between the 5 stages in VGG-11 with stage 1/2/3 connected to stage 3/4/5 respectively. Our objective is. Table 6 (bottom) shows the results, which shows that skip-layer connections can also help in classification, and our SLIM works better than the vanilla SLE.

5 Conclusion

This paper proposes a method of learning Inception Attention in both few-shot/one-shot image synthesis and large-scale image recognition. A Skip-Layer Inception Module (SLIM) is presented for the former, and a Same-Layer Inception Module (dubbed as SLIM too) for the latter. Both share the same idea, that is to explore and exploit spatially-adaptive feature modulation/recalibration by jointly learning both channel-wise attention (based on the popular Squeeze-Excitation module) as latent style representation and spatial attention as latent layout representation. The resulting Inception Attention is a 3D attention matrix in modulating / recalibrating the input feature map. In experiments, the proposed SLIM module is tested in few-shot image synthesis using FastGANs [35], one-shot image synthesis using SinGANs [46], and ImageNet-1000 classification using ResNets. The SLIM-FastGANs are consistently better than the vanilla FastGANs, SPAP, CBAM, and obtain significantly better performance at high-resolution image synthesis. The SLIM-SinGANs show stronger capabilities in preserving image structures than prior arts such as the ConSinGANs [16]. The SLIM-ResNets show better performance than the SE varaint and the vanilla ResNeXt with model complexity significantly reduced.


This work were supported in part by NSF IIS-1909644, ARO Grant W911NF1810295, NSF IIS-1822477, NSF CMMI-2024688, NSF IUSE-2013451 and DHHS-ACL Grant 90IFDV0017-01-00. The views presented in this paper are those of the authors and should not be interpreted as representing any funding agencies.


  • [1] S. Arora, A. Bhaskara, R. Ge, and T. Ma (2014) Provable bounds for learning some deep representations. In International conference on machine learning, pp. 584–592. Cited by: §3.1.
  • [2] U. Bergmann, N. Jetchev, and R. Vollgraf (2017) Learning texture manifolds with the periodic spatial gan. arXiv preprint arXiv:1705.06566. Cited by: §2.
  • [3] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying mmd gans. arXiv preprint arXiv:1801.01401. Cited by: §4.1.
  • [4] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §1.1, §2.
  • [5] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, et al. (2015)

    Look and think twice: capturing top-down visual attention with feedback convolutional neural networks

    In Proceedings of the IEEE international conference on computer vision, pp. 2956–2964. Cited by: §2.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2, §3.1.
  • [7] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua (2017)

    Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 5659–5667. Cited by: §2.
  • [8] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In International conference on machine learning, pp. 933–941. Cited by: §3.1.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.1, §1.2.
  • [10] C. Finn, P. Christiano, P. Abbeel, and S. Levine (2016)

    A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models

    arXiv preprint arXiv:1611.03852. Cited by: §2.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • [12] D. Güera and E. J. Delp (2018)

    Deepfake video detection using recurrent neural networks

    In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. Cited by: §1.1.
  • [13] T. Han, Y. Lu, S. Zhu, and Y. N. Wu (2017) Alternating back-propagation for generator network. In

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

    pp. 1976–1984. Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.1, §2, §3.3, §4.
  • [15] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30. Cited by: §4.1.
  • [16] T. Hinz, M. Fisher, O. Wang, and S. Wermter (2021) Improved techniques for training single-image gans. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1300–1309. Cited by: §1.2, §2, §4.2, Table 5, §5.
  • [17] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: Figure 1, §2, §3.1.
  • [18] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456. Cited by: §3.1, §3.1, §3.1.
  • [19] L. Itti, C. Koch, and E. Niebur (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence 20 (11), pp. 1254–1259. Cited by: §2.
  • [20] L. Itti and C. Koch (2001) Computational modelling of visual attention. Nature reviews neuroscience 2 (3), pp. 194–203. Cited by: §2.
  • [21] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. Advances in neural information processing systems 28, pp. 2017–2025. Cited by: §2.
  • [22] N. Jetchev, U. Bergmann, and R. Vollgraf (2016) Texture synthesis with spatial generative adversarial networks. arXiv preprint arXiv:1611.08207. Cited by: §2.
  • [23] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020) Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676. Cited by: §2, §4.1.
  • [24] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §2.
  • [25] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1.1, Figure 6.
  • [26] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119. Cited by: §1.1, §2, §3.2.
  • [27] T. Kim and Y. Bengio (2016) Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439. Cited by: §2.
  • [28] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Cited by: §2.
  • [29] R. Kumar, S. Ozair, A. Goyal, A. Courville, and Y. Bengio (2019) Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508. Cited by: §2.
  • [30] K. Kurach, M. Lucic, and X. Z. M. M. S. Gelly (2018) THE gan landscape: losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720. Cited by: §2.
  • [31] H. Larochelle and G. E. Hinton (2010)

    Learning to combine foveal glimpses with a third-order boltzmann machine

    Advances in neural information processing systems 23, pp. 1243–1251. Cited by: §2.
  • [32] J. Lazarow, L. Jin, and Z. Tu (2017) Introspective neural networks for generative modeling. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2793–2802. External Links: Link, Document Cited by: §2.
  • [33] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) A tutorial on energy-based learning. Predicting structured data 1 (0). Cited by: §2.
  • [34] C. Li and M. Wand (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In European conference on computer vision, pp. 702–716. Cited by: §2.
  • [35] B. Liu, Y. Zhu, K. Song, and A. Elgammal (2021) Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. arXiv e-prints, pp. arXiv–2101. Cited by: Figure 1, §1.1, §1.1, §1.2, Figure 3, §2, §2, §3.1, §3.1, §3.1, §3.2, §4.1, §4.1, Table 3, §4, §5, §6.2.4, Table 7.
  • [36] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §3.1.
  • [37] A. Mnih and K. Gregor (2014) Neural variational inference and learning in belief networks. In International Conference on Machine Learning, pp. 1791–1799. Cited by: §2.
  • [38] V. Mnih, N. Heess, A. Graves, et al. (2014) Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212. Cited by: §2.
  • [39] M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo (2020) Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning, pp. 7176–7185. Cited by: §4.1.
  • [40] J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng (2011) Learning deep energy models. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 1105–1112. Cited by: §2.
  • [41] E. Nijkamp, M. Hill, S. Zhu, and Y. N. Wu (2019) Learning non-convergent non-persistent short-run mcmc toward energy-based model. In Advances in Neural Information Processing Systems, pp. 5232–5242. Cited by: §2.
  • [42] B. A. Olshausen, C. H. Anderson, and D. C. Van Essen (1993) A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience 13 (11), pp. 4700–4719. Cited by: §2.
  • [43] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §3.2.
  • [44] D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In Proceedings of the 31st International Conference on International Conference on Machine Learning-Volume 32, pp. II–1278. Cited by: §2.
  • [45] D. B. Rubin and D. T. Thayer (1982) EM algorithms for ml factor analysis. Psychometrika 47 (1), pp. 69–76. Cited by: §2.
  • [46] T. R. Shaham, T. Dekel, and T. Michaeli (2019) Singan: learning a generative model from a single natural image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4570–4580. Cited by: §1.1, §1.2, §2, §2, Figure 4, §3.1, §3.1, §4.2, §4.2, §4, §5, §6.3.2, §6.3.3.
  • [47] E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243. Cited by: §1.1.
  • [48] W. Sun and T. Wu (2019) Image synthesis from reconfigurable layout and style. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10531–10540. Cited by: §3.2.
  • [49] W. Sun and T. Wu (2019)

    Learning spatial pyramid attentive pooling in image synthesis and image-to-image translation

    arXiv preprint arXiv:1901.06322. Cited by: §4.1.
  • [50] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
  • [51] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1.2, §2, §3.1.
  • [52] N. Tran, V. Tran, N. Nguyen, T. Nguyen, and N. Cheung (2020) Towards good practices for data augmentation in gan training. arXiv preprint arXiv:2006.05338. Cited by: §2.
  • [53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §3.3.
  • [54] R. Webster, J. Rabin, L. Simon, and F. Jurie (2019) Detecting overfitting of deep generative networks via latent recovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11273–11282. Cited by: §4.1.
  • [55] S. Woo, J. Park, J. Lee, and I. S. Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19. Cited by: §4.1.
  • [56] J. Xie, R. Gao, Z. Zheng, S. Zhu, and Y. N. Wu (2019) Learning dynamic generator model by alternating back-propagation through time. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5498–5507. Cited by: Figure 2, §1.1, §2.
  • [57] J. Xie, Y. Lu, R. Gao, S. Zhu, and Y. N. Wu (2018) Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence 42 (1), pp. 27–45. Cited by: §2.
  • [58] J. Xie, Y. Lu, S. Zhu, and Y. Wu (2016) A theory of generative convnet. In International Conference on Machine Learning, pp. 2635–2644. Cited by: §2.
  • [59] B. Xu, N. Wang, T. Chen, and M. Li (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §3.1.
  • [60] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.
  • [61] Y. Yu, Z. Gong, P. Zhong, and J. Shan (2017) Unsupervised representation learning with deep convolutional neural network for remote sensing images. In International Conference on Image and Graphics, pp. 97–108. Cited by: §2.
  • [62] J. Zhao, M. Mathieu, and Y. LeCun (2016) Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126. Cited by: §2.
  • [63] S. Zhao, Z. Liu, J. Lin, J. Zhu, and S. Han (2020) Differentiable augmentation for data-efficient gan training. arXiv preprint arXiv:2006.10738. Cited by: §1.1, §2.
  • [64] Z. Zhao, Z. Zhang, T. Chen, S. Singh, and H. Zhang (2020) Image augmentations for gan training. arXiv preprint arXiv:2006.02595. Cited by: §2.

6 Supplementary Materials

6.1 Training details of SLIM-FastGANs

We use the Adam optimizer for training, with =0.5, =0.999. We use learning rate of 0.0002 for all datasets except for FFHQ and the panda datasets, in which we use 0.0001. For the architecture of Discriminator, we adopts the output size of ; and for SLIM-FastGAN model on the datasets, we apply Gaussian noise injection to the spatial masks of SLIM, with 0 mean and unit variance; For the convolution of spatial branch of SLIM, we set the dilation rates as 2, 2, 4 at stage , , , respectively.

6.2 More results of SLIM-FastGANs

6.2.1 Clarification on results on FFHQ-1k

We notice there is a gap of the performance on FFHQ-1K between our retrained version based on the official FastGAN code and the reported one in the paper. Thanks to the author’s feedback via emails, the best configuration for reproducing the FFHQ-1k result of FID=24.45 will NOT be released since it is deployed on a commercial platform. So the FID of the SLE-FastGAN we trained based on the FastGAN code is worse than the one reported in the paper.

SLE-FastGAN [35] (reported in the paper) 24.45
Retrained from the official FastGAN code 44.31
Retrained from the official FastGAN code (dataset-specific version) 42.82
Our SLIM-FastGAN built on the official FastGAN code 39.59

Table 7: FFHQ-1k performance comparison between the reported result in FastGAN paper, our retrained version based on their code, and the proposed SLIM-FastGAN

6.2.2 Synthesis results of SLIM-FastGAN on datasets

Fig. 16 shows the example synthesized images of our proposed SLIM-FastGAN.

6.2.3 Backtracking results

Settings (Thanks to the clarification by the authors via emails) 1) We first split the dataset into train/test ratio of 9:1. 2) Train the model on the splitted training set. 3) Pick the trained generator checkpoint at iteration (20k, 40k, 80k) respectively, and do latent backtracking for 1k iterations on test set. 4) Compute the mean LPIPS between the test images and the reconstructed images from backtracking of the corresponding checkpoints. Where LPIPS is the average perceptual distance between two set of images; in this test, a lower LPIPS is better because i means that the model that is trained on the training set can backtrack an unseen test set with small reconstruction error, indicating less overfitting.

Results Fig. 17

shows the example backtracking results on several of the few-shot datasets. The smooth transition of the interpolated images between the backtracked test images show that our model hasn’t overfit to the training set.

6.2.4 Style mixing results

Settings To demonstrate that the proposed SLIM is able to disentangle the high level semantic attributes of featres at different scales, we conduct the style mixing experiment as done in the FastGAN paper [35], in which for a pair of style and content images, we extract channel weights from style images, and use them to modulate the features of content images, while retaining the spatial masks of the content images. The resulting effects as shown in Fig. 18 is that the appearance and color scheme of the style image is propagated to the content image, and the spatial structure of the content image is unchanged.

6.3 More results of SLIM-SinGANs

In order to demonstrate the strength of the proposed SLIM module in one-shot generative learning. We present the comparison of synthesis results of SLIM-SinGAN with other related methods on 9 images. We also show the results of image harmonization and editing under one-shot setting.

6.3.1 Synthesis with self-collected images

Fig. 8 to Fig. 15 are the synthesis results comparison on 9 images. We can see that compare to ConSinGAN, SinGAN and SLE-SinGAN, SLIM-SinGAN captures the global layout of the image better, while producing meaningful local semantic variations.

6.3.2 Harmonization

Fig. 19 shows the comparison on one-shot image harmonization task as done in  [46]. It shows that our proposed SLIM-SinGAN can realistically blend an object into the background image.

6.3.3 Editing

Fig. 20 shows the comparison on one-shot image editing task as done in  [46]. It shows that our proposed SLIM-SinGAN is able to produce a seamless composite in which image regions have been copied and pasted in other locations. Note that SLIM-SinGAN shows more realistic composite within the edited regions.

Figure 8: One-Shot synthesis comparison on the bridge image. Note how the synthesized images of SLIM-SinGAN capture the global layout of the real image, and at the same time produces semantically meaningful variations (size, number of towers at top.
Figure 9: One-Shot synthesis comparison on the Great Wall image.
Figure 10: One-Shot synthesis comparison on the capitol hill image.
Figure 11: One-Shot synthesis comparison on the ancient Chinese tower image.
Figure 12: One-Shot synthesis comparison on the Lincoln memorial image.
Figure 13: One-Shot synthesis comparison on the Temple of Heaven image.
Figure 14: One-Shot synthesis comparison on the Temple of ancient Chinese tower image.
Figure 15: One-Shot synthesis comparison on the Temple of Heaven image.
Figure 16: Examples of synthesized images at the resolution of . Best viewed in magnification.
Figure 17: Example backtracking results,
Figure 18: Examples style mixing results. Best viewed in magnification.
Figure 19: One-Shot harmonization comparison on example images.
Figure 20: One-Shot Editing comparison on example images.