PolyGAN: High-Order Polynomial Generators

08/19/2019 ∙ by Grigorios Chrysos, et al. ∙ Imperial College London 9

Generative Adversarial Networks (GANs) have become the gold standard when it comes to learning generative models that can describe intricate, high-dimensional distributions. Since their advent, numerous variations of GANs have been introduced in the literature, primarily focusing on utilization of novel loss functions, optimization/regularization strategies and architectures. In this work, we take an orthogonal approach to the above and turn our attention to the generator. We propose to model the data generator by means of a high-order polynomial using tensorial factors. We design a hierarchical decomposition of the polynomial and demonstrate how it can be efficiently implemented by a neural network. We show, for the first time, that by using our decomposition a GAN generator can approximate the data distribution by only using linear/convolution blocks without using any activation functions. Finally, we highlight that PolyGAN can be easily adapted and used along-side all major GAN architectures. In an extensive series of quantitative and qualitative experiments, PolyGAN improves upon the state-of-the-art by a significant margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) are currently one of the most popular lines of research in machine learning. Research on GANs mainly revolves around (a) how to achieve faster and/or better convergence (e.g., by studying different loss functions 

(Nowozin et al., 2016; Arjovsky and Bottou, 2017; Mao et al., 2017) or regularization schemes (Odena et al., 2018; Miyato et al., 2018; Gulrajani et al., 2017)) and (b) how to design generators that can effectively model complicated high-dimensional distributions (e.g., by progressively training large networks (Karras et al., 2018) or by utilizing deep ResNet type of networks as generators (Brock et al., 2019)). Nevertheless, as stated in the recent in-depth comparison of many different GAN training schemes (Lucic et al., 2018), the improvements may arise from a higher computational budget and tuning more than fundamental algorithmic changes.

Motivated by the aforementioned empirical finding we take an orthogonal approach and investigate a direction that has not been explored in the literature before. We propose to model a vector-valued generator function

by a high-order multivariate polynomial of to the input

and efficiently learn its parameters tensors by means of tensor decomposition. Concretely, we apply a hierarchical shared tensor decomposition to the parameters tensors, which is specifically tailored to capture interactions of latent variables across different levels of approximation. Moreover this specific decomposition allows us to implement the polynomial approximation function as a hierarchical structure (e.g., a neural network decoder) which can be used as a generator in a GAN setting. The proposed polynomial-based generator GAN (PolyGAN) provides an intuitive way of generating samples with an increasing level of detail. This is pictorially shown in Fig.

1. The result of the proposed GAN using a fourth-order polynomial approximator is shown in Fig. 1 (a), while Fig. 1 (b) shows the corresponding generation when removing the fourth-order power from the generator.

The multivariate approximation is preferable to a classic compositional neural network with non-linear activations mainly due to the following two reasons:

  • The non-linear activation functions pose a difficulty in theoretical analysis, e.g. convergence. Several methods, e.g. (Saxe et al., 2014; Hardt and Ma, 2017; Laurent and Brecht, 2018; Lampinen and Ganguli, 2019), focus only on linear models (with respect to the weights) in order to be able to rigorously analyze the neural network dynamics, the residual design principle, local extrema and generalization error, respectively. As illustrated in (Arora et al., 2019), element-wise non-linearities pose a challenge on proving convergence and pose a “major difficulty” in an adversarial learning setting (Ji and Liang, 2018).

  • On the other hand, our polynomial-based analysis relies on strong evidence. The Stone–Weierstrass theorem (Stone, 1948) proves that every continuous function defined on a closed interval can be uniformly approximated as closely as desired by a polynomial function

    . The current practice in deep learning is that we are approximating smooth, continuous functions (required by back-propagation).

We demonstrate that the proposed generator is agnostic to the GAN training scheme through extensive experimentation with three different widely used GAN architectures, i.e., DCGAN (Radford et al., 2015), SNGAN (Miyato et al., 2018), and SAGAN (Zhang et al., 2019). Furthermore, we introduce an experimental result that has not emerged before: we remove the activation functions of the generator (keeping only the typical in the last layer) and show that we can generate compelling images. We will release the code in http://anonymous.

(a)
(b)
Figure 1: Generated samples by an instance of the proposed GAN. (a) Generated samples using a fourth-order polynomial and (b) the corresponding generated samples when removing the terms that correspond to the fourth-order. As evidenced, the proposed GAN generates samples with an increasing level of detail.

2 Method

In this section, a novel approach for approximating generators in GANs is introduced. The notation is summarized in Section 2.1, with the derivation of the polynomial approximation and its factorization following in Section 2.2.

2.1 Preliminaries and notation

Matrices (vectors) are denoted by uppercase (lowercase) boldface letters e.g., , (). Tensors are denoted by calligraphic letters, e.g., . The order of a tensor is the number of indices needed to address its elements. Consequently, each element of an th-order tensor is addressed by indices, i.e., .

The mode- unfolding of a tensor maps to a matrix with such that the tensor element is mapped to the matrix element where with .

The mode- vector product of a tensor with a vector , denoted by . The result is of order and is defined element-wise as

(1)

Furthermore, we denote .

The Khatri-Rao product (i.e., column-wise Kronecker product) of matrices and is denoted by and yields a matrix of dimensions . The Hadamard product of and is defined as and is equal to for the element.

The CP decomposition (Kolda and Bader, 2009; Sidiropoulos et al., 2017) factorizes a tensor into a sum of component rank-one tensors. An th-order tensor has rank-, when it is decomposed as the outer product of vectors . That is, , where denotes for the vector outer product. Consequently, the rank- CP decomposition of an th-order tensor is written as:

(2)

where the factor matrices collect the vectors from the rank-one components. By considering the mode- unfolding of , the CP decomposition can be written in matrix form as (Kolda and Bader, 2009):

(3)

More details on tensors and multilinear operators can be found in Kolda and Bader (2009); Sidiropoulos et al. (2017).

2.2 High-order polynomial generators

GANs typically consist of two deep networks, namely a generator and a discriminator . is a decoder (i.e., acting as a function approximator of the sampler of the target distribution) which receives as input a random noise vector and outputs a sample . receives as input both and real samples and tries to differentiate the fake and the real samples. During training, both and compete against each other till they reach an “equilibrium” (Goodfellow et al., 2014). In practice, both the generator and discriminator are modeled as deep neural networks, involving composition of linear and non-linear operators (Radford et al., 2015).

In this paper, we focus on the generator. Instead of modeling the generator as a composition of linear and non-linear functions, we assume that each generated pixel may be expanded as a -order polynomial111With an -order polynomial we can approximate any smooth function (Stone, 1948). in . That is,

(4)

where scalars , and the set of tensors are the parameters of the polynomial expansion associated to each output of the generator (i.e., pixel). Clearly, when the weights are -dimensional vectors, when , the weights form a matrix and for higher orders of approximation, i.e., when , the weights are -th order tensors.

By stacking the parameters for all pixels and all parameters , we define the parameters , and . Consequently, the vector-valued generator function is expressed as:

(5)

Intuitively, the aforementioned functional form is an expansion which allows the order interactions between the elements of the noise vector . Furthermore, it is worth noting that (5) resembles the functional form of a truncated Maclaurin expansion of vector-valued functions. In the case of a Maclaurin expansion, represent the

order partial derivatives of a known function. However, in our case the generator function is unknown and hence all the parameters need to be estimated from training samples.

The number of the unknown parameters in (5) is , which grows exponentially with the order of the approximation. Consequently, model equation 5 is prone to overfitting and its training is computationally demanding.

A natural approach to reduce the number of parameters is to assume that the weights have redundancy and hence the weight tensors are of low-rank. To this end, several low-rank tensor decomposition can be employed (Kolda and Bader, 2009; Sidiropoulos et al., 2017). Let the parameter tensors admit a CP decompostion (Kolda and Bader, 2009) of mutilinear rank-, namely, , with , and , for . Then, (5) is expressed as

(6)

which has significantly less parameters than (5), especially when . However, a set of different factor matrices for each level of approximation are required in equation 6, and hence the correlation of pixels at different levels of approximations is not taken into account.

To alleviate this, and further reduce the number of parameters we propose the following factorization:

(7)

with for . This hierarchical decomposition, which admits parameters, can be implemented with a neural network structure as a GAN decoder.

The main building injection block: Third order approximation

To illustrate the proposed approach we consider a third order function approximation ():

(8)

By expressing the decomposition of (7), the (8) is written as:

(9)

The following lemmas are used to transform (9) into a network structure; their proofs are deferred to the appendix.

Lemma 1.

For the sets of matrices and , it holds that

(10)
Lemma 2.

Let

(11)

It holds that

(12)
Lemma 3.

Let

(13)

with as in Lemma 6. Then, for of (9) it is true that .

Combining Lemma 6 and Lemma 7, we obtain:

(14)

The last equation can be implemented in a hierarchical manner with a three-layer neural network as shown in Fig. 2.

Figure 2: Schematic illustration of the polynomial generator (for third order approximation). Symbol refers to the Hadamard product.

3 Related work

Since the literature on GANs is vast, we refer the interested reader to a recent survey on the topic (Creswell et al., 2018). In what follows, we focus only on the most closely related works to ours.

The authors of (Berthelot et al., 2017) use skip connections to concatenate the noise in deeper layers in the generator. The recent BigGAN (Brock et al., 2019) performs a hierarchical composition through skip connections from the noise to multiple resolutions of the generator. In their implementation, they split into one chunk per resolution and concatenate each chunk (of ) to the respective resolution.

Despite the propagation of the noise to successive layers, the aforementioned works have substantial differences from ours. We introduce a well-motivated and mathematically elaborate method to achieve a more precise approximation in a power series-like fashion. In contrast to the previously mentioned works, we also do not concatenate the noise with the feature representations, but rather perform multiplication of the noise with the feature representations, which we mathematically justify.

The work that is most closely related to ours is the recently proposed StyleGAN (Karras et al., 2019), which is an improvement over the Progressive Growing of GANs (ProGAN) (Karras et al., 2018). As ProGAN, StyleGAN is a highly-engineered network that achieves compelling results on synthesized 2D images. In order to provide an explanation on the improvements of StyleGAN over ProGAN the authors adopt arguments from the style transfer literature (Huang and Belongie, 2017). Nevertheless, the idea of style transfer proposes to use features from images for conditional image translation, which is very different to unsupervised samples (image) generation. We believe that these improvements can be better explained under the light of our proposed polynomial function approximation. That is, as we show in Fig. 1, the injection layers allows to build a hierachical decomposition with increasing level of details (rather than different styles).

In addition, the improvements in StyleGAN (Karras et al., 2019) are demonstrated by using a well-tuned model, while in this paper we showcase that without any complicated engineering process the injection can be applied into several generators (or any other type of decoders) and consistently improve the performance.

4 Experiments

In this section, we describe the experimental setup as well as the obtained quantitative and qualitative results. In Section 4.1, we implement exactly the theoretical analysis of Section 2 and thus experimentally solidify our methodology. In Section 4.2, we extend the framework of the previous section in real-world images. Finally, in Section 4.3, we utilize more challenging datasets and state-of-the-art network structures and establish that our framework is architecture-agnostic and consistently outperforms the baselines. An ablation study as well as further quantitative and qualitative results are deferred to the appendix.

We note that throughout our experiments we minimally modify the generators of the implemented architectures and derive our approach as visualized in Fig. 2. Additionally, we implement the most closely related alternative to our framework, namely instead of using the Hadamard operator as in Fig. 2, we concatenate the noise with the feature representations at that layer/block. The latter approach is frequently used in the literature (Berthelot et al., 2017; Brock et al., 2019) (referred as “Concat” in the paper). The number of the trainable parameters of the generators of the compared methods are reported in Table 3. Our method has only a minimal increase of the parameters, while the concatenation increases the number of parameters substantially.

4.1 Synthetic

In this synthetic experiment we assess the polynomial-based generator on a sinusoidal function in the bounded domain [

]. Only linear blocks, i.e., no activation functions, are used in the generator. That is, all the element-wise non-linearities (such as ReLU’s,

) are ditched.

The distribution we want to match is a signal, i.e., the input to the generator is and the output is with . The generator architecture consists of fully connected (FC) layers with hidden units each. In Fig. 3, for each compared method, samples are visualized.

(a) GT
(b) Orig
(c) Concat
(d) PolyGAN
Figure 3: Synthesized data for learning the signal. No activations are used in the generators. From left to right: (a) the ground-truth signals, (b) the original GAN, (c) by concatenating the noise throughout the layers, (d) PolyGAN. Notably, neither Orig nor Concat can learn to approximate different Taylor terms.
(a) GT
(b) Orig
(c) Concat
(d) PolyGAN
Figure 4: Synthesized data for MNIST with a single activation in the generator. From left to right: (a) The ground-truth signals, (b) the original GAN, (c) by concatenating the noise throughout the layers, (d) PolyGAN.

4.2 Generators with Linear Blocks

We extend the linear generation as described in Sec. 4.1 to real-world images. Since real-world data are more intricate, we capitalize on the expressivity of the recent resnet-based SNGAN (Miyato et al., 2018). In our case, we remove all intermediate activations and retain only in the last layer (for normalization purposes). For every resnet-based generator, we consider each resnet block as one term of the expansion equation 5.

In Fig. 5 we illustrate conditional generation in MNIST (LeCun et al., 1998); in Fig. 4 we do unsupervised generation in MNIST. As can be seen, there is a significant visual difference. Specifically, the “Orig” and “Concat” methods result in severe mode collapse while they fail to retain any fine details. The corresponding experiment for conditional CIFAR10 (Krizhevsky et al., 2014) is deferred to the supplementary material.

(a) GT
(b) Orig
(c) Concat
(d) PolyGAN
Figure 5: Conditional generation of MNIST digits with a single activation function. Note that both “Orig” and “Concat” suffer from severe mode collapse.

4.3 Extensive experiments on image generation

We use three different popular GAN architectures. In particular, we utilize DCGAN (Radford et al., 2015), SNGAN (Miyato et al., 2018), and SAGAN (Zhang et al., 2019). The original implementations of the aforementioned GANs correspond to our baselines; we then minimally modify the baseline (without adapting any hyper-parameters for fair comparison) to obtain our proposed approach. Algorithms 1 and 2 succinctly present the key differences of our approach compared to the traditional one (in the case of SNGAN, similarly for other architectures).

To reduce the variance often observed during GAN training 

(Lucic et al., 2018; Odena et al., 2018), each reported score is averaged over runs utilizing different seeds. The metrics we utilized are Inception Score (IS) (Salimans et al., 2016) and Frechet Inception Distance (FID) (Heusel et al., 2017).

4.3.1 Unsupervised image generation

In this experiment, we study the image generation problem without any labels or class information for the images. The architectures of DCGAN and resnet-based SNGAN are used for image generation in CIFAR10 (Krizhevsky et al., 2014), a widely used dataset to benchmark generative methods. CIFAR10 includes images of resolution. We use images for training and the rest for testing. Table 1 summarizes the results of the IS/FID scores of the compared methods. In all of the experiments, PolyGAN outperforms the compared methods.

DCGAN Model IS () FID () Orig Concat PolyGAN SNGAN Model IS () FID () Orig Concat PolyGAN
Table 1: IS/FID scores on CIFAR10 (Krizhevsky et al., 2014) utilizing DCGAN (Radford et al., 2015) and SNGAN (Miyato et al., 2018) architectures for unsupervised image generation. Each network is run for

times and the mean and standard deviation are reported. In both cases, inserting block-wise noise injections to the generator (i.e., converting to our proposed PolyGAN) results in an improved score. Higher IS / lower FID score indicate better performance.


4.3.2 Conditional image generation

Frequently class information is available in the databases, which we can utilize, e.g. in the formation of conditional batch normalization or class embeddings, to synthesize images conditioned on a class. We train two networks, i.e., SNGAN 

(Miyato et al., 2018) in CIFAR10 (Krizhevsky et al., 2014) and SAGAN (Zhang et al., 2019)

in Imagenet 

(Russakovsky et al., 2015). SAGAN uses self-attention blocks (Wang et al., 2018) to improve the resnet-based generator. Imagenet is a large scale dataset that includes over one million training samples and validation images. We reshape the images to resolution.

Despite our best efforts to show that our method is both architecture and database agnostic, the recent methods are run for hundreds of thousands or even million iterations till “convergence”. In SAGAN the authors report that for each training multiple GPUs need to be utilized for weeks to reach the final reported Inception Score. We report the metrics for networks that are run with batch size (i.e., four times less than the original ) to fit in a single 16GB NVIDIA V100 GPU. Following the current practice in ML, due to the lack of computational budget (Hoogeboom et al., 2019), we run SAGAN for iterations (see Fig.3 of the original paper for the IS during training)222Given the batch size difference, our training corresponds to roughly the steps of the authors’ reported results.. Each such experiment takes roughly days to train. The FID/IS scores of our approach compared against the baseline method can be found in Table 2. In both cases, our proposed method yields a higher Inception Score and a lower FID.

SNGAN (CIFAR10) Model IS () FID () Orig PolyGAN SAGAN (Imagenet) Model IS () FID () Orig PolyGAN
Table 2: Quantitative results on conditional image generation. We implement both SNGAN trained on CIFAR10 and SAGAN trained on Imagenet (for iterations). Each network is run for 10 times and the mean and variance are reported.
DCGAN (CIFAR10) Model Params Orig Concat PolyGAN SNGAN (CIFAR10) Model Params Orig Concat PolyGAN SAGAN (Imagenet) Model Params Orig PolyGAN
Table 3: Number of parameters for the generators of each approach and on various databases. As can be seen, our method only marginally increases the parameters while substantially improving the performance. On the other hand, “Concat” significantly increases the parameters without analogous increase in the performance.
Input : Noise , RELU
Output : 
1 ;
2 ;
3 % fully-connected layer for reshaping.
4 = (Linear()) % dims out: .;
5 ;
6;
7;
8
9for i=1:3 do
10       
11       % resnet blocks.
12        = resblock()
13       % dims out: .;
14       ;
15       ;
16        ;
17       
18 end for
19
= tanh(Conv()) % dims out: .
Algorithm 1 Original SNGAN generator.
Input : Noise , RELU
Output : 
1
2% global transformation of .
3 = (Linear())
4% fully-connected layer for reshaping.
5 = (Linear()) % dims out: .
6% perform an injection here.
7
8 = % dims out: .
9for i=1:3 do
10       
11       % resnet blocks.
12        = resblock()
13       % dims out: .
14       % reshape for injection.
15       
16        =
17 end for
= tanh(Conv()) % dims out: .
Algorithm 2 Ours (PolyGAN).

5 Conclusion

In this work, we study data generation as a hierarchical regression task and introduce a decomposition that yields a function approximation as a high-order polynomial. We implement the decomposition on three different popular GAN architectures and demonstrate that with minimal increase in the network parameters the new decomposition outperforms the original or recently proposed variants of them by a significant margin. We additionally showcase that our decomposition can be used to synthesize images without any activation functions in the generator, i.e., by utilizing only linear blocks.

6 Acknowledgements

We would like to thank Takeru Miyato for the advice on implementing the baselines of our experiments. We are thankful to Nvidia for the hardware donation and Amazon web services for the cloud credits. The work of GC and SM was partially funded by an Imperial College DTA. The work of Stefanos Zafeiriou was partially funded by the EPSRC Fellowship DEFORM: Large Scale Shape Analysis of Deformable Models of Humans (EP/S010203/1) and a Google Faculty Award.

References

  • M. Arjovsky and L. Bottou (2017) Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • S. Arora, N. Cohen, N. Golowich, and W. Hu (2019) A convergence analysis of gradient descent for deep linear neural networks. In International Conference on Learning Representations (ICLR), Cited by: 1st item.
  • D. Berthelot, T. Schumm, and L. Metz (2017) Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §3, §4.
  • A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), Cited by: §C.3, §C.3, §C.3, §1, §3, §4.
  • A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath (2018) Generative adversarial networks: an overview. IEEE Signal Processing Magazine 35 (1), pp. 53–65. Cited by: §C.2, §3.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems (NIPS), Cited by: §2.2.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems (NIPS), pp. 5767–5777. Cited by: §1.
  • M. Hardt and T. Ma (2017) Identity matters in deep learning. In International Conference on Learning Representations (ICLR), Cited by: 1st item.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems (NIPS), pp. 6626–6637. Cited by: §C.2, §4.3.
  • E. Hoogeboom, R. v. d. Berg, and M. Welling (2019) Emerging convolutions for generative normalizing flows. In International Conference on Machine Learning (ICML), Cited by: §4.3.2.
  • X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In

    IEEE Proceedings of International Conference on Computer Vision (ICCV)

    ,
    pp. 1501–1510. Cited by: §3.
  • K. Ji and Y. Liang (2018) Minimax estimation of neural net distance. In Advances in neural information processing systems (NIPS), pp. 3845–3854. Cited by: 1st item.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations (ICLR), Cited by: §1, §3.
  • T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In

    IEEE Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §3, §3.
  • T. G. Kolda and B. W. Bader (2009) Tensor decompositions and applications. SIAM review 51 (3), pp. 455–500. Cited by: §2.1, §2.1, §2.2.
  • A. Krizhevsky, V. Nair, and G. Hinton (2014) The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html 55. Cited by: §4.2, §4.3.1, §4.3.2, Table 1.
  • A. K. Lampinen and S. Ganguli (2019)

    An analytic theory of generalization dynamics and transfer learning in deep linear networks

    .
    In International Conference on Learning Representations (ICLR), Cited by: 1st item.
  • T. Laurent and J. Brecht (2018) Deep linear networks with arbitrary loss: all local minima are global. In International Conference on Machine Learning (ICML), Cited by: 1st item.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.
  • M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2018) Are gans created equal? a large-scale study. In Advances in neural information processing systems (NIPS), pp. 700–709. Cited by: §C.2, §1, §4.3.
  • X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley (2017) Least squares generative adversarial networks. In IEEE Proceedings of International Conference on Computer Vision (ICCV), pp. 2813–2821. Cited by: §1.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), Cited by: 2nd item, §C.2, §1, §1, §4.2, §4.3.2, §4.3, Table 1.
  • S. Nowozin, B. Cseke, and R. Tomioka (2016) F-gan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems (NIPS), pp. 271–279. Cited by: §1.
  • A. Odena, J. Buckman, C. Olsson, T. B. Brown, C. Olah, C. Raffel, and I. Goodfellow (2018) Is generator conditioning causally related to gan performance?. In International Conference on Machine Learning (ICML), Cited by: §1, §4.3.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: 1st item, §1, §2.2, §4.3, Table 1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: 3rd item, §4.3.2.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems (NIPS), pp. 2234–2242. Cited by: §C.2, §4.3.
  • A. M. Saxe, J. L. McClelland, and S. Ganguli (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations (ICLR), Cited by: 1st item.
  • N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos (2017) Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing 65 (13), pp. 3551–3582. Cited by: §2.1, §2.1, §2.2.
  • M. H. Stone (1948) The generalized weierstrass approximation theorem. Mathematics Magazine 21 (5), pp. 237–254. Cited by: 2nd item, footnote 1.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In IEEE Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. Cited by: §C.2, §C.2.
  • L. Theis, A. v. d. Oord, and M. Bethge (2016) A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR), Cited by: §C.2.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In IEEE Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: 3rd item, §4.3.2.
  • H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In International Conference on Machine Learning (ICML), Cited by: 3rd item, §C.3, §1, §4.3.2, §4.3.

Appendix A Introduction

This is the supplementary material accompanying the submission “PolyGAN: High-Order Polynomial Generators”. The following sections are organized as:

  • Section B provides the Lemmas and their proofs required for our derivations.

  • Section C includes further information on the experimental setup along with ablation studies on different training options.

Appendix B Theoretical study

Additional notation: The Khatri-Rao of a set of matrices is denoted by .

In this section, we will prove the following identity connecting the sets of matrices and :

(15)

To demonstrate the simple case with two matrices, we prove first the special case with .

Lemma 4.

It holds that

(16)
Proof.

Initially, both sides of the equation have dimensions of , i.e. they match. The element of the matrix product of is

(17)

Then the element of the right hand side (rhs) of equation 16 is:

(18)

From the definition of Khatri-Rao, it is straightforward to obtain the element with of as . Similarly, the element of is .

The respective element of the left hand side (lhs) of the equation is:

(19)

In the last equation, we replace the sum in () with the equivalent sums in . ∎

In a similar manner, we generalize the identity to the case of terms below.

Lemma 5.

It holds that

(20)
Proof.

In the right hand side (rhs), we have Hadamard products of the matrix multiplications . Each multiplication results in a matrix of dimensions. Thus, the rhs is a matrix of dimensions.

The lhs is a matrix multiplication of two Khatri-Rao products. The first Khatri-Rao product has dimensions , while the second . Altogether, the lhs has dimensions.

Similarly to the previous Lemma, the element of the rhs is:

(21)

To proceed with the lhs, it is straightforward to derive that

(22)

where and is a recursive function of the .

However, the recursive definition of is summed in the multiplication and we obtain:

(23)

Lemma 6.

Let

(24)

It holds that

(25)
Proof.

We will prove the equivalence starting from (24) and transform it into (25). From (24):

(26)

where in the last equation, we have applied the Lemma 1 from the supplementary. Applying the Lemma once more in the last term of (26), we obtain (25). ∎

Lemma 7.

Let

(27)

with as in Lemma 6. Then, it holds for of (9) that .

Proof.

Transforming (27) into (9):

(28)

To simplify the notation, we define and . Then, and from (28):

(29)

Replacing (29) into (28), we obtain (9). ∎

Appendix C Experiments

In section C.1 further information on the baseline architectures are mentioned, in section C.2 the implementation details are developed, and in section C.3 an ablation study is performed.

c.1 Baseline details

Three recent methods are used for the experimental validation of our proposed generators. The architectures employed as baselines are:

For all the aforementioned architectures, the default hyper-parameters are left unchanged. The aforementioned codes are used for reporting the results of both the baseline and our method to avoid any discrepancies, e.g. different frameworks resulting in unfair comparisons. The source code will be released to enable the reproduction of our results.

c.2 Implementation details

We experimentally define that in deeper networks instead of injecting directly the input noise , we might need fully connected layer(s) to transform it into before injecting it. The aforementioned fully connected layers are henceforth mentioned as global transformations on . Similarly the fully connected layers before the injection are mentioned as local transformations on .

The implementation details for each network are the following:

  • DCGAN: We use a global transformation followed by a RELU non-linearity. We also use local transformations followed by RELU non-linearities. The rest details remain the same as the baseline model.

  • SNGAN: Similarly to DCGAN, we use a global transformation with a RELU non-linearity. Each local transformation is only composed of a fully-connected layer.

We have not tried optimizing further the non-linearities or the layers in the local transformations. It is possible that with additional engineering, it might yield superior results depending on the task. However, our goal is to assess the performance without additional overhead or engineering.

In the resnet-based generators, e.g., SNGAN or SAGAN, we perform injection after each block, i.e. see algorithms in the main paper.

In the conditional SNGAN/SAGAN, the injection is only performed on the noise, i.e. no class information is injected.

Evaluation metrics: The popular Inception Score (IS) (Salimans et al., 2016) and Frechet Inception Distance (FID) (Heusel et al., 2017)

are used for the quantitative evaluation. Both scores use feature representations from a pretrained classifier. Despite their shortcomings, IS and FID are widely used 

(Lucic et al., 2018; Creswell et al., 2018), since alternative metrics fail for generative models (Theis et al., 2016).

The Inception Score is defined as

(30)

where is a generated sample and is the conditional distribution for labels (in practice the Inception network (Szegedy et al., 2015)). The distribution over the labels is approximated by for generated samples. Following the methods in the literature (Miyato et al., 2018), we compute the inception score for generated samples per run ( splits for each run).

The Frechet Inception Distance (FID) utilizes feature representations from a pretrained network (Szegedy et al., 2015) and assumes that the distributions of these representations are Gaussian. Denoting the representations of real images as and the generated (fake) as , FID is:

(31)

In the experiments, we use to compute the mean and covariance of the real images and synthesized samples for .

For both scores the original tensorflow inception network weights are used; the routines of tensorflow.contrib.gan.eval are called for the metric evaluation.

c.3 Ablation study

In this section we conduct an ablation study to further assess our method. The experiments are based on the SNGAN since most recent methods are based on the similar generator (Zhang et al., 2019; Brock et al., 2019). Unless, explicitly mentioned otherwise, the experiments are on SNGAN trained CIFAR10 for unsupervised image generation in this section.

In the first experiment we evaluate the performance of the additional fully-connected (FC) layer on z. We namely report two alternatives: i) with linear global transformation (‘Ours-linear-global’) and ii) with global transformation with RELU non-linearity (‘Ours-RELU-global’).

SNGAN on CIFAR10
Model IS FID
Ours-linear-global
Ours-RELU-global
Table 4: IS/FID score on the global transformation on SNGAN. We consider the two alternatives of adding an activation on the global transformation or not. The results verify that both cases improve the baseline, but the addition of the RELU does indeed boost the performance.

Based on the previous experiment, we insert the same global transformation in the original SNGAN, i.e. we insert a fully connected layer with RELU activation in the input of the generator. The original is mentioned as ‘Orig’, while the alternative of adding a global transformation as ‘Original-RELU-global’.

SNGAN on CIFAR10
Model IS FID
Original
Original-RELU-global
Ours-RELU-global
Table 5: Global transformation evaluation on the original SNGAN. The addition of the global transformation in the original SNGAN does not improve the results. On the contrary, the FID mean and variance deteriorate from the baseline.

The recent BigGAN of (Brock et al., 2019) uses a block-wise injection. However, in contrast to our method, in BigGAN the original is split into different non-overlapping parts that are then injected. That is, if the noise is injected into two blocks, they split the noise into . The part is used in the input, in the first injection and in the second.

We scrutinize this splitting against our method; we split the noise into vectors of equal size for performing injections. The injection with splitting is mentioned as ‘Inject-split’ below. As the experimental results demonstrate, the naive splitting deteriorates the scores on the task. However, we have not tried optimizing the dimensionality of each or the conditional setting (used in (Brock et al., 2019)).

SNGAN on CIFAR10
Model IS FID
Original
Inject-split
PolyGAN
Table 6: Ablation experiment on splitting the noise into non-overlapping parts for the injection.
(a) GT
(b) Orig
(c) Concat
(d) PolyGAN
Figure 6: Conditional image generation on CIFAR10 for a generator with a single activation function. Our approach generates more realistic samples in comparison to the compared methods, where severe mode collapse also takes place.