Conditional image generation networks learn mappings from the condition domain to the image domain by training on massive samples from both domains. The mapping from a condition, e.g., a map, to an image, e.g., a satellite image, is essentially one-to-many as illustrated in Figure 1. In other words, there exists many plausible output images that satisfy a given input condition, which motivates us to explore multi-mode conditional image generation that produces diverse images conditioned on one single input condition.
One technique to improve image generation diversity is to feed the image generator with an additional latent code in the hope that such code can carry information that is not covered by the input condition, so that diverse output images are achieved by decoding the missing information conveyed through different latent codes. However, as illustrated in the seminal work , encoding the diversity with an input latent code can lead to unsatisfactory performance for the following reasons. While training using objectives like GAN loss , regularizations like L1 loss  and perceptual loss  are imposed to improve both visual fidelity and correspondence to the input condition. However, no similar regularization is imposed to enforce the correspondence between outputs and latent codes, so that the network is prone to ignore input latent codes in training, and produce identical images from an input condition even with different latent codes. Several methods are proposed to explicitly encourage the network to take into account input latent codes to encode diversity. For example,  explicitly maximizes the ratio of the distance between generated images with respect to the corresponding latent codes; while  applies an auxiliary network for decoding the latent codes from the generative images. Although the diversity of the generative images is significantly improved, these methods experience drawbacks. In , at least two samples generated from the same condition are needed for calculating the regularization term, which multiplies the memory footprint while training each mini-batch. Auxiliary network structures and training objectives in  unavoidably increase training difficulty and memory footprint. These previously proposed methods usually require considerable modifications to the underlying framework.
In this paper, we propose a stochastic model, BasisGAN, that directly maps an input condition to diverse output images, aiming at building networks that model the multi-mode intrinsically.
The proposed method exploits a known observation that a well-trained deep network can converge to significantly different sets of parameters across multiple trainings, due to factors such as different parameter initializations and different choices of mini-batches.
Therefore, instead of treating a conditional image generation network as a deterministic function with fixed parameters, we propose modeling the filter in each convolutional layer as a sample from filter space, and learning the corresponding filter space using a tiny network for efficient and diverse filter sampling.
In , parameter non-uniqueness is used for multi-mode image generation by training several generators with different parameters simultaneously as a multi-agent solution. However, the maximum modes of  are restricted by the number of agents, and the replication increases memory as well as computational cost.
Based on the above parameters non-uniqueness property,
we introduce into a deep network stochastic convolutional layers, where filters are sampled from learned filter spaces.
Specifically, we learn the mapping from a simple prior to the filter space using neural networks, here referred to as
are restricted by the number of agents, and the replication increases memory as well as computational cost. Based on the above parameters non-uniqueness property, we introduce into a deep network stochastic convolutional layers, where filters are sampled from learned filter spaces. Specifically, we learn the mapping from a simple prior to the filter space using neural networks, here referred to asfilter generators. To empower a deterministic network with multi-mode image generation, we divide the network into a deterministic sub-model and a stochastic sub-model as shown in Figure 1, where standard convolutional layers and stochastic convolutional layers with filter generators are deployed, respectively. By optimizing an adversarial loss, filter generators can be jointly trained with a conditional image generation network. In each forward pass, filters at stochastic layers are sampled by filter generators. Highly diverse images conditioned on the same input are achieved by jointly sampling of filters in multiple stochastic convolutional layers.
However, filters of a convolutional layer are usually high-dimensional while being together written as one vector, which makes the modeling and sampling of a filter space highly costly in practice in terms of training time, sampling time, and filter generator memory footprint.
Based on the low-rank property observed from sampled filters, we decompose each filter as a linear combination of a small set of basis elements
However, filters of a convolutional layer are usually high-dimensional while being together written as one vector, which makes the modeling and sampling of a filter space highly costly in practice in terms of training time, sampling time, and filter generator memory footprint. Based on the low-rank property observed from sampled filters, we decompose each filter as a linear combination of a small set of basis elements, and propose to only sample low-dimensional spatial basis elements instead of filters. By replacing filter generators with basis generators, the proposed method becomes highly efficient and practical. Theoretical arguments are provided on how perturbations introduced by sampling basis elements can propagate to the appearance of generated images.
The proposed BasisGAN introduces a generalizable concept to promote diverse modes in the conditional image generation.
As basis generators act as plug-and-play modules, variants of BasisGAN can be easily constructed by replacing in various state-of-the-art conditional image generation networks the standard convolutional layers by stochastic layers with basis generators.
Then, we directly train them without additional auxiliary components, hyperparameters, or training objectives on top of the underlying models.
Experimental results consistently show that the proposed BasisGAN is a simple yet effective solution to multi-mode conditional image generation.
We further empirically show that the inherent stochasticity introduced by our method allows training without paired samples, and the one-to-many image-to-image translation is achieved using a stochastic auto-encoder where stochasticity prevents the network from learning a trivial identity mapping.
The proposed BasisGAN introduces a generalizable concept to promote diverse modes in the conditional image generation. As basis generators act as plug-and-play modules, variants of BasisGAN can be easily constructed by replacing in various state-of-the-art conditional image generation networks the standard convolutional layers by stochastic layers with basis generators. Then, we directly train them without additional auxiliary components, hyperparameters, or training objectives on top of the underlying models. Experimental results consistently show that the proposed BasisGAN is a simple yet effective solution to multi-mode conditional image generation. We further empirically show that the inherent stochasticity introduced by our method allows training without paired samples, and the one-to-many image-to-image translation is achieved using a stochastic auto-encoder where stochasticity prevents the network from learning a trivial identity mapping.
Our contributions are summarized as follows:
We propose a plug-and-played basis generator to stochastically generate basis elements, with just a few hundred of parameters, to fully embed stochasticity into network filters.
Theoretic arguments are provided to support the simplification of replacing stochastic filter generation with basis generation.
Both the generation fidelity and diversity of the proposed BasisGAN with basis generators are validated extensively, and state-of-the-art performances are consistently observed.
2 Related Work
Conditional image generation.
Parametric modeling of the natural image distribution has been studied for years, from restricted Boltzmann machines  to variational autoencoders
to variational autoencoders; in particular variants with conditions [17, 24, 25] show promising results. With the great power of GANs , conditional generative adversarial networks (cGANs) [12, 18, 21, 27, 29, 31] achieve great progress on visually appealing images given conditions. However, the quality of images and the loyalty to input conditions come with sacrifice on image diversity as discussed in , which is addressed by the proposed BasisGAN.
Multi-mode conditional image generation.
To enable the cGANs with multi-mode image generation, pioneer works like infoGAN  and pix2pix  propose to encode the diversity in an input latent code. To enforce the networks to take into account input latent codes,  deploys auxiliary networks and training objectives to impose the recovery of the input latent code from the generated images. MSGAN  and DSGAN  propose regularization terms for diversity that enforces a larger distance between generated images with respect to different input latent codes given one input condition. These methods require considerable modifications to the underlying original framework.
Neural network parameters generating and uncertainty.
Extensive studies have been conducted for generating network parameters using another network since Hypernetworks . As a seminal work on network parameter modeling, Hypernetworks successfully reduce learnable parameters by relaxing weight-sharing across layers. Followup works like Bayesian Hypernetworks  further introduce uncertainty to the generated parameters. Variational inference based methods like Bayes by Backprop  solve the intractable posterior distribution of parameters by assuming a prior (usually Gaussian). However, the assumed prior unavoidably degrades the expressiveness of the learned distribution. The parameter prediction of neural network is intensively studied under the context of few shot learning [1, 19, 28], which aims to customize a network to a new task adaptively and efficiently in a data-driven way. Apart from few shot learning,  suggests parameter prediction as a way to study the redundancy in neural networks. While studying the representation power of random weights,  also suggests the uncertainty and non-uniqueness of network parameters. Another family of network with uncertainty is based on variational inference , where an assumption of the distribution on network weights is imposed for a tractable learning on the distribution of weights. Works on studying the relationship between local and global minima of deep networks [9, 26] also suggest the non-uniqueness of optimal parameters of a deep network.
3 Stochastic Filter Generation
A conditional generative network (cGAN)  learns the mapping from input condition domain to output image domain using a deep neural network. The conditional image generation is essentially a one-to-many mapping as there could be multiple plausible instances that map to a condition , corresponding to a distribution . However, the naive mapping of the generator formulated by a neural network is deterministic, and is incapable of covering the distribution . We exploit the non-uniqueness of network parameters as discussed above, and introduce stochasticity into convolutional filters through plug-and-play filter generators. To achieve this, we divide a network into two sub-models:
A deterministic sub-model with convolutional filters that remain fixed after training;
A stochastic sub-model whose convolutional filters are sampled from parameter spaces modeled by neural networks , referred to as filter generators, parametrized by with inputs from a prior distribution, e.g., for all experiments in this paper.
Note that filters in each stochastic layer are modeled with a separate neural network, which is not explicitly shown in the formulation for notation brevity. With this formulation, the conditional image generation becomes , with stochasticity achieved by sampling filters for the stochastic sub-model in each forward pass. The conditional GAN loss [7, 16] then becomes
where denotes a standard discriminator. Note that we represent the generator here as to emphasize that the generator uses stochastic filters .
When the optimal discriminator is achieved, (2) can be reformulated as
where is the Jensen-Shannon divergence (the proof is provided in the supplementary material). The global minimum of (3) is achieved when given every sampled condition , the generator perfectly replicates the true distribution , which indicates that by directly optimizing the loss in (1), conditional image generation with diversity is achieved with the proposed stochasticity in the convolutional filters.
To optimize (1), we train as in  to maximize the probability of assigning the correct label to both training examples and samples from . Simutanously, we train to miminize the following loss, where filter generators are jointly optimized to bring stochasticity:
We describe in detail the optimization of the generator parameters in supplementary material Algorithm 1.
Discussions on diversity modeling in cGANs.
The goal of cGAN is to model the conditional probability . Previous cGAN models [15, 16, 32] typically incorporate randomness in the generator by setting , where is a deep network with deterministic parametrization and the randomness is introduced via , e.g., a latent code, as an extra input. This formulation implicitly makes the following two assumptions: (A1) The randomness of the generator is independent from that of ; (A2) Each realization conditional on can be modeled by a CNN, i.e., , where is a draw from an ensemble of CNNs, being the random event. (A1) is reasonable as long as the source of variation to be modeled by cGAN is independent from that contained in , and the rational of (A2) lies in the expressive power of CNNs for image to image translation. The previous model adopts a specific form of via feeding random input to , yet one may observe that the most general formulation under (A1), (A2) would be to sample the generator itself from certain distribution , which is independent from . Since generative CNNs are parametrized by convolutional filters, this would be equivalent to set , where we use ‘‘;’’ in the parentheses to emphasize that what after is parametrization of the generator. The proposed cGAN model in the current paper indeed takes such an approach, where we model by a separate filter generator network.
4 Stochastic Basis Generation
Using the method above, filters of each stochastic layer are generated in the form of a high-dimensional vector of size , where , , and correspond to the kernel size, numbers of input and output channels, respectively. Although directly generating such high-dimensional vectors is feasible, it can be highly costly in terms of training time, sampling time, and memory footprint when the network scale grows. We present a throughout comparison in terms of generated quality and sample filter size in supplementary material Figure A.1, where it is clearly shown that filter generation is too costly to afford. In this section, we propose to replace filter generation with basis generation to achieve a quality/cost effect shown by the red dot in supplementary material Figure A.1. Details on the memory and computational cost are also provided at the end of the supplementary material.
For convolutional filters, the weights is a 3-way tensor involving a spatial index
and two channel indices for input and output channel respectively.
Tensor low-rank decomposition cannot be defined in a unique way.
For convolutional filters, a natural solution then is to separate out the spatial index,
which leads to depth-separable network architectures
is a 3-way tensor involving a spatial index and two channel indices for input and output channel respectively. Tensor low-rank decomposition cannot be defined in a unique way. For convolutional filters, a natural solution then is to separate out the spatial index, which leads to depth-separable network architectures. Among other studies of low-rank factorization of convolutional layers,  proposes to approximate a convolutional filter using a set of prefixed basis element linearly combined by learned reconstruction coefficients.
Given that the weights in convolutional layers may have a low-rank structure,
we collect a large amount of generated filters and reshape the stack of sampled filters to a 2-dimensional matrix with size of , where and . We consistently observe that is always of low effective rank, regardless the network scales we use to estimate the filter distribution.
If we assume that a collection of generated filters observe such a low-rank structure,
the following theorem proves that it suffices to generate bases in order to generate the desired distribution of filters.
is always of low effective rank, regardless the network scales we use to estimate the filter distribution. If we assume that a collection of generated filters observe such a low-rank structure, the following theorem proves that it suffices to generate bases in order to generate the desired distribution of filters.
Let be probability space
and a 3-way random tensor,
where maps each event to
For each fixed and ,
If there exists a set of deterministic linear transforms
. If there exists a set of deterministic linear transforms, in s.t. for any and , then there exists random vectors , , s.t. in distribution. If has a probability density, then so do . (The proof of the theorem is provided in the supplementary material.)
We simplify the expensive filter generation problem by decomposing each filter
as a linear combination of a small set of basis elements, and then sampling basis elements instead of filters directly.
In our method, we assume that the diverse modes of conditional image generations are essentially caused by the spatial perturbations, thus we propose to introduce stochasticity to the spatial basis elements.
we apply convolutional filer decomposition as in  to write , , where are basis elements, are decomposition coefficients, and is a pre-defined small value, e.g., .
We keep the decomposition coefficients deterministic and learned directly from training samples.
And instead of using predefined basis elements as in ,
we adopt a basis generator to sample the basis elements , which dramatically reduces the difficulty on modeling the corresponding probability distribution. The costly filter generators in Section
, which dramatically reduces the difficulty on modeling the corresponding probability distribution. The costly filter generators in Section3 is now replaced by much more efficient basis generators, and stochastic filters are then constructed by linearly combining sampled basis elements with the deterministic coefficients, The illustration on the convolution filter reconstruction is shown as a part of Figure 1. As illustrated in this figure, BasisGAN is constructed by replacing the standard convolutional layers with the proposed stochastic convolutional layers with basis generators, and the network parameters can be learned without additional auxiliary training objective or regularization.
In this section, we conduct experiments on multiple conditional generation task. Our preliminary objective is to show that thanks to the inherent stochasticity of the proposed BasisGAN, multi-mode conditional image generation can be learned without any additional regularizations that explicitly promote diversity. The effectiveness of the proposed BasisGAN is demonstrated by quantitative and qualitative results on multiple tasks and underlying models. We start with a stochastic auto-encoder example to demonstrate the inherent stochasticity brought by basis generator. Then we proceed to image to image translation tasks, and compare the proposed method with: regularization based methods DSGAN  and MSGAN  that adopt explicit regularization terms that encourages higher distance between output images with different latent code; the model based method MUNIT  that explicitly decouples appearance with content and achieves diverse image generation by manipulating appearance code; and BicycleGAN  that uses auxiliary networks to encourage the diversity of the generated images with respect to the input latent code. We further demonstrate that as an essential way to inject randomness to conditional image generation, our method is compatible with existing regularization based methods, which can be adopted together with our proposed method for further performance improvements. Finally, extensive ablation studies are provided in the supplementary material.
5.1 Stochastic Auto-encoder
The inherent stochasticity of the proposed BasisGAN allows learning conditional one-to-many mapping even without paired samples for training. We validate this by a variant of BasisGAN referred as stochastic auto-encoder, which is trained to do simple self-reconstructions with real-world images as inputs. Only L1 loss and GAN loss are imposed to promote fidelity and correspondence. However, thanks to the inherent stochasticity of BasisGAN, we observe that the network does not collapse to a trivial identity mapping, and diverse outputs with strong correspondence to the input images are generated with appealing fidelity. Some illustrative results are shown in Figure 2.
|Input||Generated diverse samples||Input||Generated diverse samples|
5.2 Image to Image Translation
To faithfully validate the fidelity and diversity of generated images, we follow  to evaluate the performance quantitatively using the following metrics:
LPIPS. The diversity of generated images are measured using LPIPS . LPIPS computes the distance of images in the feature space. Generated images with higher diversity give higher LPIPS scores, which are more favourable in conditional image generation.
FID. FID  is used to measure the fidelity of the generated images. It computes the distance between the distribution of the generated images and the true images. Since the entire GAN family is to faithfully model true data distribution parametrically, lower FID is favourable in our case since it reflects a closer fit to the desired distribution.
As one of the most prevalent conditional image generation network, Pix2Pix  serves as a solid baseline for many multi-mode conditional image generation methods.
It achieves conditional image generation by feeding the generator a conditional image, and training the generator to synthesize image with both GAN loss and L1 loss to the ground truth image. Typical applications for Pix2Pix include edge mapsshoes or handbags, maps satellites, and so on.
We adopt the ResNet based Pix2Pix model, and impose the proposed stochasticity in the successive residual blocks, where regular convolutional layers and convolutional layers with basis generators convolve alternatively with the feature maps.
The network is re-trained from scratch directly without any extra loss functions or regularizations.
Some samples are visualized in Figure
satellites, and so on. We adopt the ResNet based Pix2Pix model, and impose the proposed stochasticity in the successive residual blocks, where regular convolutional layers and convolutional layers with basis generators convolve alternatively with the feature maps. The network is re-trained from scratch directly without any extra loss functions or regularizations. Some samples are visualized in Figure3. For a fair comparison with previous works [11, 12, 15, 32, 30], we perform the quantitative evaluations on image to image translation tasks and the results are presented in Table 1. As discussed, all the state-of-the-art methods require considerable modifications to the underlying framework. By simply using the proposed stochastic basis generators as plug-and-play modules to the Pix2Pix model, the BasisGAN generates significantly more diverse images but still at comparable quality with other state-of-the-art methods.
|Input||Ground truth||Generated diverse samples|
|Methods||Pix2Pix||BicycleGAN||MSGAN||BasisGAN||DSGAN (20s)||BasisGAN (20s)|
|Diversity||0.0003 0.0000||0.1413 0.0005||0.1894 0.0011||0.2648 0.004||0.18||0.2594 0.004|
|Fidelity||139.19 2 .94||98.85 1.21||92.84 1.00||88.7 1.28||57.20||24.14 0.76|
|Methods||Pix2Pix||BicycleGAN||MSGAN||BasisGAN||DSGAN (20s)||BasisGAN (20s)|
|Diversity||0.0016 0.0003||0.1150 0.0007||0.2189 0.0004||0.2417 0.005||0.13||0.2398 0.005|
|Fidelity||168.99 2.58||145.78 3.90||152.43 2.52||35.54 2.19||49.92||28.92 1.88|
|Dataset||Edge Handbag||Edge Shoe|
|Diversity||0.32 0.624||0.35 0.810||0.217 0.512||0.242 0.743|
|Fidelity||92.84 0.121||88.76 0.513||62.57 0.917||64.17 1.14|
In this experiment, we report results on high-resolution scenarios, which particularly demand efficiency and have not been previously studied by other conditional image generation methods.
|Input condition||Generated diverse samples|
We conduct high resolution image synthesis on Pix2PixHD , which is proposed to conditionally generate images with resolution up to . The importance of this experiment arises from the fact that existing methods [15, 32] require considerable modifications to the underlying networks, which in this case, are difficult to be scaled to very high resolution image synthesis due to the memory limit of modern hardware. Our method requires no auxiliary networks structures or special batch formulation, thus is easy to be scaled to large scale scenarios. Some generated samples are visualized in Figure 4. Quantitative results and comparisons against DSGAN  are reported in Table 2. BasisGAN significantly improves both diversity and fidelity with little overheads in terms of training time, testing time, and memory.
Following DSGAN  , we conduct one-to-many image inpainting experiments on face images.
, we conduct one-to-many image inpainting experiments on face images. Following, centered face images in the celebA dataset are adopted and parts of the faces are discarded by removing the center pixels. We adopt the exact same network used in  and replace the convolutional layers by layers with basis generators. To show the plug-and-play compatibility of the proposed BasisGAN, we conduct experiments by both training BasisGAN alone and combining BasisGAN with regularization based methods DSGAN (BasisGAN + DSGAN). When combining BasisGAN with DSGAN, we feed all the basis generator in BasisGAN with the same latent code and use the distance between the latent codes and the distance between generated samples to compute the regularization term proposed in . Quantitative results and qualitative results are in Table 3 and Figure 5, respectively. BasisGAN delivers good balance between diversity and fidelity, while combining BasisGAN with regularization based DSGAN further improves the performance.
|Input condition||BasisGAN||BasisGAN + DSGAN|
In this paper, we proposed BasisGAN to model the multi-mode for conditional image generation in an intrinsic way. We formulated BasisGAN as a stochastic model to allow convolutional filters to be sampled from a filter space learned by a neural network instead of being deterministic. To significantly reduce the cost of sampling high-dimensional filters, we adopt parameter reduction using filter decomposition, and sample low-dimensional basis elements, as supported by the theoretical results here presented. Stochasticity is introduced by replacing deterministic convolution layers with stochastic layers with basis generators. BasisGAN with basis generators achieves high-fidelity and high-diversity, state-of-the-art conditional image generation, without any auxiliary training objectives or regularizations. Extensive experiments with multiple underlying models demonstrate the effectiveness and extensibility of the proposed method.
-  (2016) Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, pp. 523–531. Cited by: §2.
-  (2015) Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §2.
-  (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180. Cited by: §2.
Xception: deep learning with depthwise separable convolutions. In , pp. 1251–1258. Cited by: §4.
-  (2013) Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pp. 2148–2156. Cited by: §2.
-  (2018) Multi-agent diverse generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8513–8521. Cited by: §1.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1, §2, §3, §3.
-  (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §2.
-  (2015) Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540. Cited by: §2.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §5.2.
-  (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §5.2, §5.
Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. Cited by: §1, §2, §2, §5.2, Table 1.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
-  (2017) Bayesian hypernetworks. arXiv preprint arXiv:1710.04759. Cited by: §2.
-  (2019) Mode seeking generative adversarial networks for diverse image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §3, §5.2, §5.2, §5.2, Table 1, §5.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §3, §3, §3.
-  (2016) . arXiv preprint arXiv:1601.06759. Cited by: §2.
-  (2016) Context encoders: feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §2.
-  (2018) Few-shot image recognition by predicting parameters from activations. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238. Cited by: §2.
-  (2018) DCFNet: deep neural network with decomposed convolutional filters. International Conference on Machine Learning. Cited by: §1, §4, §4.
-  (2017) Scribbler: controlling deep image synthesis with sketch and color. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5400–5409. Cited by: §2.
-  (2011) On random weights and unsupervised feature learning.. In International Conference on Machine Learning, Vol. 2, pp. 6. Cited by: §2.
-  (1986) Information processing in dynamical systems: foundations of harmony theory. Technical report Colorado Univ at Boulder Dept of Computer Science. Cited by: §2.
-  (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491. Cited by: §2.
-  (2016) Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798. Cited by: §2.
-  (2017) Mathematics of deep learning. arXiv preprint arXiv:1712.04741. Cited by: §2.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional GANs. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §5.2.
-  (2019) TAFE-net: task-aware feature embeddings for low shot learning. arXiv preprint arXiv:1904.05967. Cited by: §2.
-  (2018) Texturegan: controlling deep image synthesis with texture patches. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8456–8465. Cited by: §2.
-  (2019) Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024. Cited by: §2, §5.2, §5.2, §5.2, Table 1, §5.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2223–2232. Cited by: §2.
-  (2017) Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pp. 465–476. Cited by: §1, §2, §2, §3, §3, §5.2, §5.2, Table 1, §5.
Appendix A Proof of Equation (3)
Given (2) in Section 3, the minimax game of adversarial training is expressed as:
By fixing and only consider:
The optimal discriminator in (A.2) is achieved when
Given the optimal discriminator , (A.2) is expressed as:
where is the Kullback-Leibler divergence. The minimum of
is the Kullback-Leibler divergence. The minimum ofis achieved iff the Jensen-Shannon divergence is and . And the global minimum of (A.1) is achieved when given every sampled , the generator perfectly replicate the conditional distribution . ∎
Appendix B Proof of Theorem 4.1
Without loss of generality, suppose that is a linearly independent set in the space of , which is finite dimensional (the space of -by- matrices). Then is in the span of for any means that there are unique coefficients s.t.
and the vector
can be determined from by a (deterministic) linear transform.
Since each entry is a random variable,
i.e. measurable function on
is a random variable, i.e. measurable function on, then so is viewed as a mapping from to , for each and , due to that invertible linear transform between finite dimensional spaces preserves measurability. For same reason, if has probability density, then so does each . Letting be the random vectors proves the statement. ∎
Appendix C Parameter Optimization in Filter Generation
The optimization of the parameters in filter generation is presented in Algorithm 1.
Sample a minibatch of pairs of samples .
Calculate the gradient w.r.t. the convolutional filters and as in the standard setting
Calculate the gradient w.r.t. in the filter generator .
Update the parameters : ; : , where is the learning rate.
Appendix D Computation Comparison
We present a throughout comparison in terms of generated quality and sample filter size in Figure A.1, where it is clearly shown that filter generation is too costly to afford, and basis generation achieves a significantly better quality/cost effect shown by the red dot in Figure A.1.
Appendix E Ablation Studies
In this section, we perform ablation studies on the proposed BasisGAN, and evaluate multiple factors that can affect generation results. We perform ablation studies on BasisGAN adapted from the Pix2Pix model with the maps satellite dataset.
Size of basis generators. We model a basis generator using a small neural network, which consists of several hidden layers and inputs a latent code sampled from a prior distribution. We consistently observe that a basis generator with a single hidden layer achieves the best performance while maintains fast basis generation speed. Here we perform further experiments on the size of intermediate layers and input latent code size, and the results are presented in Table A.1. It is observed that the size of a basis generator does not significantly effect the final performance, and we use the setting in all the experiments for a good balance between performances and costs.
|Dimensions||16+16||32 + 32||64 + 64||128 + 128||256 + 256||512 + 512|
Appendix F Qualitative Results
f.1 Pix2Pix BasisGAN
Additional qualitative results for Pix2Pix BasisGAN are presented in Figure A.2.
|Input||Ground truth||Generated diverse samples|
f.2 Pix2PixHD BasisGAN
Additional qualitative results for Pix2PixHD BasisGAN are presented in Figure A.3.
|Input condition||Generated diverse samples|
Appendix G Speed and Memory
We use PyTorch for the implementation of all the experiments. The training and testing are performed on a single NVIDIA 1080Ti graphic card with 11GB memory.
The comparisons on testing speed and training memory are presented in Table
We use PyTorch for the implementation of all the experiments. The training and testing are performed on a single NVIDIA 1080Ti graphic card with 11GB memory. The comparisons on testing speed and training memory are presented in TableA.2. The training memory is measured under standard setting with resolution of for Pix2Pix, and for Pix2PixHD.
|Methods||Testing speed (s)||Training memory (MB)|