Stochastic Conditional Generative Networks with Basis Decomposition

09/25/2019 ∙ by Ze Wang, et al. ∙ 4

While generative adversarial networks (GANs) have revolutionized machine learning, a number of open questions remain to fully understand them and exploit their power. One of these questions is how to efficiently achieve proper diversity and sampling of the multi-mode data space. To address this, we introduce BasisGAN, a stochastic conditional multi-mode image generator. By exploiting the observation that a convolutional filter can be well approximated as a linear combination of a small set of basis elements, we learn a plug-and-played basis generator to stochastically generate basis elements, with just a few hundred of parameters, to fully embed stochasticity into convolutional filters. By sampling basis elements instead of filters, we dramatically reduce the cost of modeling the parameter space with no sacrifice on either image diversity or fidelity. To illustrate this proposed plug-and-play framework, we construct variants of BasisGAN based on state-of-the-art conditional image generation networks, and train the networks by simply plugging in a basis generator, without additional auxiliary components, hyperparameters, or training objectives. The experimental success is complemented with theoretical results indicating how the perturbations introduced by the proposed sampling of basis elements can propagate to the appearance of generated images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 9

page 10

page 14

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conditional image generation networks learn mappings from the condition domain to the image domain by training on massive samples from both domains. The mapping from a condition, e.g., a map, to an image, e.g., a satellite image, is essentially one-to-many as illustrated in Figure 1. In other words, there exists many plausible output images that satisfy a given input condition, which motivates us to explore multi-mode conditional image generation that produces diverse images conditioned on one single input condition.

One technique to improve image generation diversity is to feed the image generator with an additional latent code in the hope that such code can carry information that is not covered by the input condition, so that diverse output images are achieved by decoding the missing information conveyed through different latent codes. However, as illustrated in the seminal work [12], encoding the diversity with an input latent code can lead to unsatisfactory performance for the following reasons. While training using objectives like GAN loss [7], regularizations like L1 loss [12] and perceptual loss [27] are imposed to improve both visual fidelity and correspondence to the input condition. However, no similar regularization is imposed to enforce the correspondence between outputs and latent codes, so that the network is prone to ignore input latent codes in training, and produce identical images from an input condition even with different latent codes. Several methods are proposed to explicitly encourage the network to take into account input latent codes to encode diversity. For example, [15] explicitly maximizes the ratio of the distance between generated images with respect to the corresponding latent codes; while [32] applies an auxiliary network for decoding the latent codes from the generative images. Although the diversity of the generative images is significantly improved, these methods experience drawbacks. In [15], at least two samples generated from the same condition are needed for calculating the regularization term, which multiplies the memory footprint while training each mini-batch. Auxiliary network structures and training objectives in [32] unavoidably increase training difficulty and memory footprint. These previously proposed methods usually require considerable modifications to the underlying framework.


Figure 1: Illustration of the proposed BasisGAN. The diversity generated images are achieved by the parameter generation in the stochastic sub-model, where basis generators take samples from a prior distribution and generate low dimensional basis elements from the learned spaces. The sampled basis elements are linearly combined using the deterministic bases coefficients and used to reconstruct the convolutional filters. Filters in each stochastic layer are modeled with a separate basis generator. By convolving the same feature from the deterministic sub-model using different convolutional filters, images with diverse appearances are generated.

In this paper, we propose a stochastic model, BasisGAN, that directly maps an input condition to diverse output images, aiming at building networks that model the multi-mode intrinsically. The proposed method exploits a known observation that a well-trained deep network can converge to significantly different sets of parameters across multiple trainings, due to factors such as different parameter initializations and different choices of mini-batches. Therefore, instead of treating a conditional image generation network as a deterministic function with fixed parameters, we propose modeling the filter in each convolutional layer as a sample from filter space, and learning the corresponding filter space using a tiny network for efficient and diverse filter sampling. In [6], parameter non-uniqueness is used for multi-mode image generation by training several generators with different parameters simultaneously as a multi-agent solution. However, the maximum modes of [6]

are restricted by the number of agents, and the replication increases memory as well as computational cost. Based on the above parameters non-uniqueness property, we introduce into a deep network stochastic convolutional layers, where filters are sampled from learned filter spaces. Specifically, we learn the mapping from a simple prior to the filter space using neural networks, here referred to as

filter generators. To empower a deterministic network with multi-mode image generation, we divide the network into a deterministic sub-model and a stochastic sub-model as shown in Figure 1, where standard convolutional layers and stochastic convolutional layers with filter generators are deployed, respectively. By optimizing an adversarial loss, filter generators can be jointly trained with a conditional image generation network. In each forward pass, filters at stochastic layers are sampled by filter generators. Highly diverse images conditioned on the same input are achieved by jointly sampling of filters in multiple stochastic convolutional layers.

However, filters of a convolutional layer are usually high-dimensional while being together written as one vector, which makes the modeling and sampling of a filter space highly costly in practice in terms of training time, sampling time, and filter generator memory footprint. Based on the low-rank property observed from sampled filters, we decompose each filter as a linear combination of a small set of basis elements

[20], and propose to only sample low-dimensional spatial basis elements instead of filters. By replacing filter generators with basis generators, the proposed method becomes highly efficient and practical. Theoretical arguments are provided on how perturbations introduced by sampling basis elements can propagate to the appearance of generated images.

The proposed BasisGAN introduces a generalizable concept to promote diverse modes in the conditional image generation. As basis generators act as plug-and-play modules, variants of BasisGAN can be easily constructed by replacing in various state-of-the-art conditional image generation networks the standard convolutional layers by stochastic layers with basis generators. Then, we directly train them without additional auxiliary components, hyperparameters, or training objectives on top of the underlying models. Experimental results consistently show that the proposed BasisGAN is a simple yet effective solution to multi-mode conditional image generation. We further empirically show that the inherent stochasticity introduced by our method allows training without paired samples, and the one-to-many image-to-image translation is achieved using a stochastic auto-encoder where stochasticity prevents the network from learning a trivial identity mapping.

Our contributions are summarized as follows:

We propose a plug-and-played basis generator to stochastically generate basis elements, with just a few hundred of parameters, to fully embed stochasticity into network filters.

Theoretic arguments are provided to support the simplification of replacing stochastic filter generation with basis generation.

Both the generation fidelity and diversity of the proposed BasisGAN with basis generators are validated extensively, and state-of-the-art performances are consistently observed.

2 Related Work

Conditional image generation.

Parametric modeling of the natural image distribution has been studied for years, from restricted Boltzmann machines [23]

to variational autoencoders

[13]; in particular variants with conditions [17, 24, 25] show promising results. With the great power of GANs [7], conditional generative adversarial networks (cGANs) [12, 18, 21, 27, 29, 31] achieve great progress on visually appealing images given conditions. However, the quality of images and the loyalty to input conditions come with sacrifice on image diversity as discussed in [32], which is addressed by the proposed BasisGAN.

Multi-mode conditional image generation.

To enable the cGANs with multi-mode image generation, pioneer works like infoGAN [3] and pix2pix [12] propose to encode the diversity in an input latent code. To enforce the networks to take into account input latent codes, [32] deploys auxiliary networks and training objectives to impose the recovery of the input latent code from the generated images. MSGAN [15] and DSGAN [30] propose regularization terms for diversity that enforces a larger distance between generated images with respect to different input latent codes given one input condition. These methods require considerable modifications to the underlying original framework.

Neural network parameters generating and uncertainty.

Extensive studies have been conducted for generating network parameters using another network since Hypernetworks [8]. As a seminal work on network parameter modeling, Hypernetworks successfully reduce learnable parameters by relaxing weight-sharing across layers. Followup works like Bayesian Hypernetworks [14] further introduce uncertainty to the generated parameters. Variational inference based methods like Bayes by Backprop [2] solve the intractable posterior distribution of parameters by assuming a prior (usually Gaussian). However, the assumed prior unavoidably degrades the expressiveness of the learned distribution. The parameter prediction of neural network is intensively studied under the context of few shot learning [1, 19, 28], which aims to customize a network to a new task adaptively and efficiently in a data-driven way. Apart from few shot learning, [5] suggests parameter prediction as a way to study the redundancy in neural networks. While studying the representation power of random weights, [22] also suggests the uncertainty and non-uniqueness of network parameters. Another family of network with uncertainty is based on variational inference [2], where an assumption of the distribution on network weights is imposed for a tractable learning on the distribution of weights. Works on studying the relationship between local and global minima of deep networks [9, 26] also suggest the non-uniqueness of optimal parameters of a deep network.

3 Stochastic Filter Generation

A conditional generative network (cGAN) [16] learns the mapping from input condition domain to output image domain using a deep neural network. The conditional image generation is essentially a one-to-many mapping as there could be multiple plausible instances that map to a condition [32], corresponding to a distribution . However, the naive mapping of the generator formulated by a neural network is deterministic, and is incapable of covering the distribution . We exploit the non-uniqueness of network parameters as discussed above, and introduce stochasticity into convolutional filters through plug-and-play filter generators. To achieve this, we divide a network into two sub-models:

  • A deterministic sub-model with convolutional filters that remain fixed after training;

  • A stochastic sub-model whose convolutional filters are sampled from parameter spaces modeled by neural networks , referred to as filter generators, parametrized by with inputs from a prior distribution, e.g., for all experiments in this paper.

Note that filters in each stochastic layer are modeled with a separate neural network, which is not explicitly shown in the formulation for notation brevity. With this formulation, the conditional image generation becomes , with stochasticity achieved by sampling filters for the stochastic sub-model in each forward pass. The conditional GAN loss [7, 16] then becomes

(1)

where denotes a standard discriminator. Note that we represent the generator here as to emphasize that the generator uses stochastic filters .

Given a stochastic generative network parametrized by and , and input condition

, the generated images form a conditional probability

, so that (1) can be simplified as

(2)

When the optimal discriminator is achieved, (2) can be reformulated as

(3)

where is the Jensen-Shannon divergence (the proof is provided in the supplementary material). The global minimum of (3) is achieved when given every sampled condition , the generator perfectly replicates the true distribution , which indicates that by directly optimizing the loss in (1), conditional image generation with diversity is achieved with the proposed stochasticity in the convolutional filters.

To optimize (1), we train as in [7] to maximize the probability of assigning the correct label to both training examples and samples from . Simutanously, we train to miminize the following loss, where filter generators are jointly optimized to bring stochasticity:

(4)

We describe in detail the optimization of the generator parameters in supplementary material Algorithm 1.

Discussions on diversity modeling in cGANs.

The goal of cGAN is to model the conditional probability . Previous cGAN models [15, 16, 32] typically incorporate randomness in the generator by setting , where is a deep network with deterministic parametrization and the randomness is introduced via , e.g., a latent code, as an extra input. This formulation implicitly makes the following two assumptions: (A1) The randomness of the generator is independent from that of ; (A2) Each realization conditional on can be modeled by a CNN, i.e., , where is a draw from an ensemble of CNNs, being the random event. (A1) is reasonable as long as the source of variation to be modeled by cGAN is independent from that contained in , and the rational of (A2) lies in the expressive power of CNNs for image to image translation. The previous model adopts a specific form of via feeding random input to , yet one may observe that the most general formulation under (A1), (A2) would be to sample the generator itself from certain distribution , which is independent from . Since generative CNNs are parametrized by convolutional filters, this would be equivalent to set , where we use  ‘‘;’’ in the parentheses to emphasize that what after is parametrization of the generator. The proposed cGAN model in the current paper indeed takes such an approach, where we model by a separate filter generator network.

4 Stochastic Basis Generation

Using the method above, filters of each stochastic layer are generated in the form of a high-dimensional vector of size , where , , and correspond to the kernel size, numbers of input and output channels, respectively. Although directly generating such high-dimensional vectors is feasible, it can be highly costly in terms of training time, sampling time, and memory footprint when the network scale grows. We present a throughout comparison in terms of generated quality and sample filter size in supplementary material Figure A.1, where it is clearly shown that filter generation is too costly to afford. In this section, we propose to replace filter generation with basis generation to achieve a quality/cost effect shown by the red dot in supplementary material Figure A.1. Details on the memory and computational cost are also provided at the end of the supplementary material.

For convolutional filters, the weights

is a 3-way tensor involving a spatial index and two channel indices for input and output channel respectively. Tensor low-rank decomposition cannot be defined in a unique way. For convolutional filters, a natural solution then is to separate out the spatial index, which leads to depth-separable network architectures

[4]. Among other studies of low-rank factorization of convolutional layers, [20] proposes to approximate a convolutional filter using a set of prefixed basis element linearly combined by learned reconstruction coefficients.

Given that the weights in convolutional layers may have a low-rank structure, we collect a large amount of generated filters and reshape the stack of sampled filters to a 2-dimensional matrix with size of , where and . We consistently observe that

is always of low effective rank, regardless the network scales we use to estimate the filter distribution. If we assume that a collection of generated filters observe such a low-rank structure, the following theorem proves that it suffices to generate bases in order to generate the desired distribution of filters.

Theorem 1.

Let be probability space and a 3-way random tensor, where maps each event to For each fixed and ,

. If there exists a set of deterministic linear transforms

, in s.t. for any and , then there exists random vectors , , s.t. in distribution. If has a probability density, then so do . (The proof of the theorem is provided in the supplementary material.)

We simplify the expensive filter generation problem by decomposing each filter as a linear combination of a small set of basis elements, and then sampling basis elements instead of filters directly. In our method, we assume that the diverse modes of conditional image generations are essentially caused by the spatial perturbations, thus we propose to introduce stochasticity to the spatial basis elements. Specifically, we apply convolutional filer decomposition as in [20] to write , , where are basis elements, are decomposition coefficients, and is a pre-defined small value, e.g., . We keep the decomposition coefficients deterministic and learned directly from training samples. And instead of using predefined basis elements as in [20], we adopt a basis generator to sample the basis elements

, which dramatically reduces the difficulty on modeling the corresponding probability distribution. The costly filter generators in Section 

3 is now replaced by much more efficient basis generators, and stochastic filters are then constructed by linearly combining sampled basis elements with the deterministic coefficients, The illustration on the convolution filter reconstruction is shown as a part of Figure 1. As illustrated in this figure, BasisGAN is constructed by replacing the standard convolutional layers with the proposed stochastic convolutional layers with basis generators, and the network parameters can be learned without additional auxiliary training objective or regularization.

5 Experiments

In this section, we conduct experiments on multiple conditional generation task. Our preliminary objective is to show that thanks to the inherent stochasticity of the proposed BasisGAN, multi-mode conditional image generation can be learned without any additional regularizations that explicitly promote diversity. The effectiveness of the proposed BasisGAN is demonstrated by quantitative and qualitative results on multiple tasks and underlying models. We start with a stochastic auto-encoder example to demonstrate the inherent stochasticity brought by basis generator. Then we proceed to image to image translation tasks, and compare the proposed method with: regularization based methods DSGAN [30] and MSGAN [15] that adopt explicit regularization terms that encourages higher distance between output images with different latent code; the model based method MUNIT [11] that explicitly decouples appearance with content and achieves diverse image generation by manipulating appearance code; and BicycleGAN [32] that uses auxiliary networks to encourage the diversity of the generated images with respect to the input latent code. We further demonstrate that as an essential way to inject randomness to conditional image generation, our method is compatible with existing regularization based methods, which can be adopted together with our proposed method for further performance improvements. Finally, extensive ablation studies are provided in the supplementary material.

5.1 Stochastic Auto-encoder

The inherent stochasticity of the proposed BasisGAN allows learning conditional one-to-many mapping even without paired samples for training. We validate this by a variant of BasisGAN referred as stochastic auto-encoder, which is trained to do simple self-reconstructions with real-world images as inputs. Only L1 loss and GAN loss are imposed to promote fidelity and correspondence. However, thanks to the inherent stochasticity of BasisGAN, we observe that the network does not collapse to a trivial identity mapping, and diverse outputs with strong correspondence to the input images are generated with appealing fidelity. Some illustrative results are shown in Figure 2.

Input Generated diverse samples Input Generated diverse samples
Figure 2: Stochastic auto-encoder: one-to-many conditional image generation without paired sample. The network is trained directly to reconstruct the input real-world images, and the inherent stochasticity of the proposed method successfully promotes diverse output appearances with strong fidelity and correspondence to the inputs.

5.2 Image to Image Translation

To faithfully validate the fidelity and diversity of generated images, we follow [15] to evaluate the performance quantitatively using the following metrics:
LPIPS. The diversity of generated images are measured using LPIPS [15]. LPIPS computes the distance of images in the feature space. Generated images with higher diversity give higher LPIPS scores, which are more favourable in conditional image generation.
FID. FID [10] is used to measure the fidelity of the generated images. It computes the distance between the distribution of the generated images and the true images. Since the entire GAN family is to faithfully model true data distribution parametrically, lower FID is favourable in our case since it reflects a closer fit to the desired distribution.

Pix2Pix BasisGAN.

As one of the most prevalent conditional image generation network, Pix2Pix [12] serves as a solid baseline for many multi-mode conditional image generation methods. It achieves conditional image generation by feeding the generator a conditional image, and training the generator to synthesize image with both GAN loss and L1 loss to the ground truth image. Typical applications for Pix2Pix include edge mapsshoes or handbags, maps

satellites, and so on. We adopt the ResNet based Pix2Pix model, and impose the proposed stochasticity in the successive residual blocks, where regular convolutional layers and convolutional layers with basis generators convolve alternatively with the feature maps. The network is re-trained from scratch directly without any extra loss functions or regularizations. Some samples are visualized in Figure 

3. For a fair comparison with previous works [11, 12, 15, 32, 30], we perform the quantitative evaluations on image to image translation tasks and the results are presented in Table 1. As discussed, all the state-of-the-art methods require considerable modifications to the underlying framework. By simply using the proposed stochastic basis generators as plug-and-play modules to the Pix2Pix model, the BasisGAN generates significantly more diverse images but still at comparable quality with other state-of-the-art methods.

Input Ground truth Generated diverse samples
Figure 3: BasisGAN adapted from Pix2Pix. The network is trained without any auxiliary loss functions or regularizations. From top to bottom, the image to image translation tasks are: edges handbags, edges shoes, maps satellite, nights days, facades buildings. Additional examples are provided in the supplementary material, Figure A.2.
Dataset Labels Facade
Methods Pix2Pix BicycleGAN MSGAN BasisGAN DSGAN (20s) BasisGAN (20s)
Diversity 0.0003 0.0000 0.1413 0.0005 0.1894 0.0011 0.2648 0.004 0.18 0.2594 0.004
Fidelity 139.19 2 .94 98.85 1.21 92.84 1.00 88.7 1.28 57.20 24.14 0.76

Datasets
Map Satellite
Methods Pix2Pix BicycleGAN MSGAN BasisGAN DSGAN (20s) BasisGAN (20s)
Diversity 0.0016 0.0003 0.1150 0.0007 0.2189 0.0004 0.2417 0.005 0.13 0.2398 0.005
Fidelity 168.99 2.58 145.78 3.90 152.43 2.52 35.54 2.19 49.92 28.92 1.88
Dataset Edge Handbag Edge Shoe
Methods MUNIT BasisGAN MUNIT BasisGAN
Diversity 0.32 0.624 0.35 0.810 0.217 0.512 0.242 0.743
Fidelity 92.84 0.121 88.76 0.513 62.57 0.917 64.17 1.14

Table 1: Quantitative results on image to image translation. Diversity and fidelity are measured using LPIPS and FID, respectively. Pix2Pix [12], BicycleGAN [32], MSGAN [15], and DSGAN [30] are included in the comparisons. DSGAN adopts a different setting (denoted as 20s in the table) by generating 20 samples per input for computing the scores. We report results under both settings.

Pix2PixHD BasisGAN.

In this experiment, we report results on high-resolution scenarios, which particularly demand efficiency and have not been previously studied by other conditional image generation methods.

Input condition Generated diverse samples
Figure 4: High resolution conditional image generation. Additional examples are provided in the supplementary material, Figure A.3.

We conduct high resolution image synthesis on Pix2PixHD [27], which is proposed to conditionally generate images with resolution up to . The importance of this experiment arises from the fact that existing methods [15, 32] require considerable modifications to the underlying networks, which in this case, are difficult to be scaled to very high resolution image synthesis due to the memory limit of modern hardware. Our method requires no auxiliary networks structures or special batch formulation, thus is easy to be scaled to large scale scenarios. Some generated samples are visualized in Figure 4. Quantitative results and comparisons against DSGAN [30] are reported in Table 2. BasisGAN significantly improves both diversity and fidelity with little overheads in terms of training time, testing time, and memory.

Table 2: Quantitative results on high resolution image to image translation. Diversity and fidelity are measured using LPIPS and FID, respectively. Methods Pix2PixHD DSGAN BasisGAN Diversity 0.0 0.12 0.168 Fidelity 48.85 28.8 25.12
Table 3: Quantitative results on face inpainting. Diversity and fidelity are measured using LPIPS and FID, respectively. Methods DSGAN BasisGAN BasisGAN + DSGAN Diversity 0.05 0.062 0.073 Fidelity 13.94 12.88 12.82

Image inpainting.

Following DSGAN [30]

, we conduct one-to-many image inpainting experiments on face images. Following

[30], centered face images in the celebA dataset are adopted and parts of the faces are discarded by removing the center pixels. We adopt the exact same network used in [30] and replace the convolutional layers by layers with basis generators. To show the plug-and-play compatibility of the proposed BasisGAN, we conduct experiments by both training BasisGAN alone and combining BasisGAN with regularization based methods DSGAN (BasisGAN + DSGAN). When combining BasisGAN with DSGAN, we feed all the basis generator in BasisGAN with the same latent code and use the distance between the latent codes and the distance between generated samples to compute the regularization term proposed in [30]. Quantitative results and qualitative results are in Table 3 and Figure 5, respectively. BasisGAN delivers good balance between diversity and fidelity, while combining BasisGAN with regularization based DSGAN further improves the performance.

Input condition BasisGAN BasisGAN + DSGAN
Figure 5: Face inpainting examples.

6 Conclusion

In this paper, we proposed BasisGAN to model the multi-mode for conditional image generation in an intrinsic way. We formulated BasisGAN as a stochastic model to allow convolutional filters to be sampled from a filter space learned by a neural network instead of being deterministic. To significantly reduce the cost of sampling high-dimensional filters, we adopt parameter reduction using filter decomposition, and sample low-dimensional basis elements, as supported by the theoretical results here presented. Stochasticity is introduced by replacing deterministic convolution layers with stochastic layers with basis generators. BasisGAN with basis generators achieves high-fidelity and high-diversity, state-of-the-art conditional image generation, without any auxiliary training objectives or regularizations. Extensive experiments with multiple underlying models demonstrate the effectiveness and extensibility of the proposed method.

References

  • [1] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi (2016) Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, pp. 523–531. Cited by: §2.
  • [2] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §2.
  • [3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180. Cited by: §2.
  • [4] F. Chollet (2017)

    Xception: deep learning with depthwise separable convolutions

    .
    In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1251–1258. Cited by: §4.
  • [5] M. Denil, B. Shakibi, L. Dinh, N. De Freitas, et al. (2013) Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pp. 2148–2156. Cited by: §2.
  • [6] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, and P. K. Dokania (2018) Multi-agent diverse generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8513–8521. Cited by: §1.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1, §2, §3, §3.
  • [8] D. Ha, A. Dai, and Q. V. Le (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §2.
  • [9] B. D. Haeffele and R. Vidal (2015) Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540. Cited by: §2.
  • [10] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §5.2.
  • [11] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §5.2, §5.
  • [12] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. Cited by: §1, §2, §2, §5.2, Table 1.
  • [13] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • [14] D. Krueger, C. Huang, R. Islam, R. Turner, A. Lacoste, and A. Courville (2017) Bayesian hypernetworks. arXiv preprint arXiv:1710.04759. Cited by: §2.
  • [15] Q. Mao, H. Lee, H. Tseng, S. Ma, and M. Yang (2019) Mode seeking generative adversarial networks for diverse image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §3, §5.2, §5.2, §5.2, Table 1, §5.
  • [16] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §3, §3, §3.
  • [17] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016)

    Pixel recurrent neural networks

    .
    arXiv preprint arXiv:1601.06759. Cited by: §2.
  • [18] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §2.
  • [19] S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018) Few-shot image recognition by predicting parameters from activations. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238. Cited by: §2.
  • [20] Q. Qiu, X. Cheng, R. Calderbank, and G. Sapiro (2018) DCFNet: deep neural network with decomposed convolutional filters. International Conference on Machine Learning. Cited by: §1, §4, §4.
  • [21] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays (2017) Scribbler: controlling deep image synthesis with sketch and color. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5400–5409. Cited by: §2.
  • [22] A. M. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng (2011) On random weights and unsupervised feature learning.. In International Conference on Machine Learning, Vol. 2, pp. 6. Cited by: §2.
  • [23] P. Smolensky (1986) Information processing in dynamical systems: foundations of harmony theory. Technical report Colorado Univ at Boulder Dept of Computer Science. Cited by: §2.
  • [24] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491. Cited by: §2.
  • [25] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798. Cited by: §2.
  • [26] R. Vidal, J. Bruna, R. Giryes, and S. Soatto (2017) Mathematics of deep learning. arXiv preprint arXiv:1712.04741. Cited by: §2.
  • [27] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional GANs. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §5.2.
  • [28] X. Wang, F. Yu, R. Wang, T. Darrell, and J. E. Gonzalez (2019) TAFE-net: task-aware feature embeddings for low shot learning. arXiv preprint arXiv:1904.05967. Cited by: §2.
  • [29] W. Xian, P. Sangkloy, V. Agrawal, A. Raj, J. Lu, C. Fang, F. Yu, and J. Hays (2018) Texturegan: controlling deep image synthesis with texture patches. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8456–8465. Cited by: §2.
  • [30] D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee (2019) Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024. Cited by: §2, §5.2, §5.2, §5.2, Table 1, §5.
  • [31] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2223–2232. Cited by: §2.
  • [32] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pp. 465–476. Cited by: §1, §2, §2, §3, §3, §5.2, §5.2, Table 1, §5.

Appendix A Proof of Equation (3)

Proof.

Given (2) in Section 3, the minimax game of adversarial training is expressed as:

(A.1)

By fixing and only consider:

(A.2)

The optimal discriminator in (A.2) is achieved when

(A.3)

Given the optimal discriminator , (A.2) is expressed as:

(A.4)

where

is the Kullback-Leibler divergence. The minimum of

is achieved iff the Jensen-Shannon divergence is and . And the global minimum of (A.1) is achieved when given every sampled , the generator perfectly replicate the conditional distribution . ∎

Appendix B Proof of Theorem 4.1

Proof.

Without loss of generality, suppose that is a linearly independent set in the space of , which is finite dimensional (the space of -by- matrices). Then is in the span of for any means that there are unique coefficients s.t.

and the vector can be determined from by a (deterministic) linear transform. Since each entry

is a random variable, i.e. measurable function on

, then so is viewed as a mapping from to , for each and , due to that invertible linear transform between finite dimensional spaces preserves measurability. For same reason, if has probability density, then so does each . Letting be the random vectors proves the statement. ∎

Appendix C Parameter Optimization in Filter Generation

The optimization of the parameters in filter generation is presented in Algorithm 1.

for number of iterations do
  • Sample a minibatch of pairs of samples .

  • Sample .

  • Calculate the gradient w.r.t. the convolutional filters and as in the standard setting

    where .

  • Calculate the gradient w.r.t. in the filter generator .

  • Update the parameters : ; : , where is the learning rate.

end for
Algorithm 1 Optimization of the generator parameters

Appendix D Computation Comparison

We present a throughout comparison in terms of generated quality and sample filter size in Figure A.1, where it is clearly shown that filter generation is too costly to afford, and basis generation achieves a significantly better quality/cost effect shown by the red dot in Figure A.1.

(a) Quality/cost comparison.
(b) Generated images.
Figure A.1: (a) shows the comparison between basis generation and filter generation in terms of quality and cost. In (b), top row shows images generated with basis generators (the red dot in (a)), bottom row shows images generated with filter generators at the highest cost (highest in (a)). Basis generation achieves better performance with significantly less cost comparing to filter generation. The quality metrics are introduced in Section 5.

Appendix E Ablation Studies

In this section, we perform ablation studies on the proposed BasisGAN, and evaluate multiple factors that can affect generation results. We perform ablation studies on BasisGAN adapted from the Pix2Pix model with the maps satellite dataset.

Size of basis generators. We model a basis generator using a small neural network, which consists of several hidden layers and inputs a latent code sampled from a prior distribution. We consistently observe that a basis generator with a single hidden layer achieves the best performance while maintains fast basis generation speed. Here we perform further experiments on the size of intermediate layers and input latent code size, and the results are presented in Table A.1. It is observed that the size of a basis generator does not significantly effect the final performance, and we use the setting in all the experiments for a good balance between performances and costs.

Dimensions 16+16 32 + 32 64 + 64 128 + 128 256 + 256 512 + 512
Diversity 0.2242 0.2388 0.2417 0.2448 0.2452 0.2433
Fidelity 40.16 37.41 35.54 34.36 33.70 32.31
Table A.1: Quantitative results with different sizes of input latent code and intermediate layer. denotes the size of latent code and intermediate layer.

Appendix F Qualitative Results

f.1 Pix2Pix BasisGAN

Additional qualitative results for Pix2Pix BasisGAN are presented in Figure A.2.

Input Ground truth Generated diverse samples
Figure A.2: Pix2Pix BasisGAN.

f.2 Pix2PixHD BasisGAN

Additional qualitative results for Pix2PixHD BasisGAN are presented in Figure A.3.

Input condition Generated diverse samples
Figure A.3: Pix2PixHD BasisGAN.

Appendix G Speed and Memory

We use PyTorch for the implementation of all the experiments. The training and testing are performed on a single NVIDIA 1080Ti graphic card with 11GB memory. The comparisons on testing speed and training memory are presented in Table 

A.2. The training memory is measured under standard setting with resolution of for Pix2Pix, and for Pix2PixHD.

Methods Testing speed (s) Training memory (MB)
Pix2Pix 0.01017 1465
Pix2Pix BasisGAN 0.01025 1439
Pix2PixHD 0.0299 8145
Pix2PixHD BasisGAN 0.0324 8137

Table A.2: Speed in testing and memory usage in training.