Training Generative Adversarial Networks from Incomplete Observations using Factorised Discriminators

05/29/2019 ∙ by Daniel Stoller, et al. ∙ Spotify Queen Mary University of London 0

Generative adversarial networks (GANs) have shown great success in applications such as image generation and inpainting. To stabilize the challenging training process, one typically requires large datasets - which are not available for many tasks. Large amounts of additionally available incomplete observations could be exploited in many cases, but it remains unclear how to train a GAN in such a setting. To address this shortcoming, we factorise the high-dimensional joint distribution of the complete data into a set of lower-dimensional distributions along with their dependencies. As a consequence, we can split the discriminator in a GAN into multiple "sub-discriminators" that can be independently trained from incomplete observations. Their outputs can be combined to obtain an estimate of the density ratio between the joint real and the generator distribution, which enables training the generator as in the original GAN framework. As an additional benefit, our modularisation facilitates incorporating prior knowledge into the discriminator architecture. We apply our method to image generation, image segmentation and audio source separation, and show an improved performance compared to a standard GAN when additional incomplete training examples are available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

page 17

page 18

page 19

page 20

page 21

page 23

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In generative adversarial networks (GANs) (Goodfellow et al., 2014) a generator network is trained to produce samples from a given target distribution. To achieve this, a discriminator network is employed to distinguish between “real” samples from the dataset and “fake” samples from the generator network. The discriminator’s feedback is used by the generator to improve its output. While GANs have become highly effective at synthesising realistic examples even for complex data such as natural images (Radford et al., 2015; Karras et al., 2018), they typically rely on large training datasets, which are not available for many tasks. So far, it remains unclear how incomplete observations could be used for training, which further limits the amount of available data especially in applications such as image segmentation (Cordts et al., 2016) or audio source separation (Stoller et al., 2018), where annotated examples are rare compared to individual input or output examples. Furthermore, large discriminator networks operating on the joint distribution are more difficult to train, as they have to consider dependencies between all input dimensions (Karaletsos, 2016).

In this paper, we adapt the standard GAN framework to enable training from incomplete observations. To achieve this, we split the discriminator into multiple “marginal” discriminators, each modelling a separate set of dimensions of the input. As this modification on its own would ignore any dependencies between these parts, we incorporate two additional “dependency discriminators”, each focusing only on inter-part relationships. We show how the outputs from these marginal and dependency discriminators can be recombined and used to estimate the same density ratios as in the original GAN framework – which enables training any generator network in an unmodified form. In contrast to previous GANs, however, our approach only requires full observations to train the smaller dependency model and can leverage much bigger, simpler datasets to train the marginal discriminators, which enables the generator to model the marginal distributions more accurately. Additionally, prior knowledge about the marginals and dependencies can be incorporated into the architecture of each discriminator. Compared to other approaches that rely on imputation models to handle incomplete observations 

(Pu et al., 2018; Yoon et al., 2018), our approach is designed for cases where the pattern of missing data is known, which enables us to construct a factorization scheme that completely eliminates any overlap between sub-discriminators. This way, sub-components in our approach require considerably less capacity, which further limits the need for large datasets. Finally, our approach can be extended to the conditional generation setting in a straightforward way.

In our experiments, we apply our approach (“FactorGAN”)111Implementation available at https://github.com/f90/FactorGAN to two image generation tasks (Sections 4.1 and 4.2), image segmentation (Section 4.3) and audio source separation (Section 4.4), and observe improved performance in missing data scenarios compared to a GAN. For image segmentation, we also compare to the CycleGAN (Zhu et al., 2017), which does not require images to be paired with their segmentation maps. However, it relies on an additional loss, which assumes a one-to-one mapping between inputs and outputs and needs to be balanced with the GAN loss with a hyper-parameter. FactorGAN instead learns a probabilistic mapping from inputs to outputs from a mixture of paired and unpaired examples using a single adversarial objective with a known optimal solution for the generator, and reaches a much higher segmentation accuracy even with only paired samples.

2 Method

After a brief summary of GANs in Section 2.1, we introduce our method to learn from missing data in Section 2.2, and present variants for conditional generation (2.3) and independent outputs (2.4).

2.1 Generative adversarial networks

To model a probability distribution

over , we follow the standard GAN framework and introduce a generator model that maps an -dimensional input to a -dimensional sample , resulting in the generator distribution . To train such that approximates the real data density , a discriminator is trained to estimate whether a given sample is real or generated:

(1)

In the non-parametric limit (Goodfellow et al., 2014), approaches at every point . The generator is updated based on the discriminator’s estimate of

. In this paper, we use the alternative loss function for

as proposed by Goodfellow et al. (2014):

(2)

2.2 Adaptation to missing data

In the following we consider the case that incomplete observations are available in addition to our regular dataset (i.e. simpler yet larger datasets). In particular, we partition the set of input dimensions of into () non-overlapping subsets . For each , an incomplete (“marginal”) observation can be drawn from , which is obtained from after marginalising out all dimensions not in . Analogously, denotes the -th marginal distribution of the generator . Next, we extend the existing GAN framework such we can employ the additional incomplete observations. In this context, a main hurdle is that a standard GAN discriminator is trained with samples from the full joint . To eliminate this restriction, we note that can be mapped to a “joint density ratio” by applying the bijective function . For our approach, we exploit that this joint density ratio can be factorised into a product of density ratios:

(3)

Each “marginal density ratio” captures the generator’s output quality for one marginal variable , while the and terms describe the dependency structure between marginal variables in the real and generated distribution, respectively. We can estimate each density ratio independently by training a “sub-discriminator” network, and combine their outputs for an estimate of , as we will show in the following.

Estimating the marginal density ratios:

To estimate for each , we train a “marginal discriminator network” with parameters to determine whether a marginal sample is real or generated following the GAN discriminator loss in Equation (1222Samples are drawn from and instead of and , respectively.. This allows making use of the additional incomplete observations. In the non-parametric limit, will approach , so that we can use as an estimate of .

Estimation of and :

Note that and are also density ratios, this time containing a distribution over in both the numerator and denominator – the main difference being that in the latter the individual parts are independent from each other. To approximate the ratio , we can apply the same principles as above and train a “p-dependency discriminator”  to distinguish samples from the two distributions, i.e.  to discriminate real joint samples from samples where the individual parts are real but were drawn independently of each other (i.e. the individual parts might not originate from the same real joint sample). Again, in the non-parametric limit, its response approaches and thus can be approximated via . Analogously, the term is estimated with a “q-dependency discriminator” – here, we compare joint generator samples with samples where the individual parts were shuffled across several generated samples (to implement the independence assumption).

Joint discriminator sample complexity:

In contrast to , where the generator provides an infinite number of samples, estimating without overfitting to the limited number of joint training samples can be challenging. While standard GANs suffer from the same difficulty, our factorisation into specialised sub-units allows for additional opportunities to improve the sample complexity. In particular, we can design the architecture of the p-dependency discriminator to incorporate prior knowledge about the dependency structure333If only certain features of a marginal variable influence the dependencies, we can limit the input to the p-dependency discriminator to these features instead of the full marginal sample to prevent overfitting..

Combining the discriminators:

As the marginal and the p- and q-dependency sub-discriminators provide estimates of their respective density ratios, we can multiply them and apply to obtain the desired ratio , following Equation (3). We describe a numerically stable and simple implementation in the supplementary material, involving only a linear combination of pre-activation sub-discriminator outputs followed by a sigmoid (see Section 6.4 for details and proof). The time for a generator update step grows linearly with the number of marginals , assuming the time to update each of the marginal discriminators remains constant.

2.3 Adaptation to conditional generation

Conditional generation, such as image segmentation or inpainting, can be performed with GANs by using a generator that maps a conditional input and noise to an output , resulting in an output density . We can view and as parts of a joint variable with distribution , which leads to the equivalent task of matching to the joint generator distribution . In a conditional GAN, the discriminator needs to distinguish between joint samples from and , which requires “paired” samples from and is inefficient as the inputs are the same in both and . In contrast, applying our factorisation principle from Equation (3) to and yields

(4)

suggesting the use of a p- and a q-dependency discriminator to model the input-output relationship, and a marginal discriminator over that matches aggregate generator predictions from to real output examples from . Note that we do not need a marginal discriminator for , which increases computational efficiency. This adaptation can also involve additionally partitioning into multiple partial observations as shown in Equation 3.

2.4 Adaption to independent marginals

In case the marginals can be assumed to be completely independent, one can remove the p-dependency discriminator from our framework, since for all inputs . This approach can be useful in the conditional setting, when each output is related to the input but their marginals are independent from each other. In this context, our approach is related to adversarial ICA (Brakel and Bengio, 2017). Note that the q-dependency discriminator still needs to be trained on the full generator outputs if the generator should not introduce unwanted dependencies between the marginals.

2.5 Further extensions

There are many more ways of partitioning the joint distribution into marginals. We discuss two additional variants (Hierarchical and auto-regressive FactorGANs) of our approach in Section 6.3 of the supplementary material.

3 Related work

Yoon et al. (2018) randomly mask the inputs to a GAN generator so it learns to impute missing values, but not to generate joint observations from scratch like the FactorGAN. Pu et al. (2018) use GANs for joint distribution modelling by training a generator for each possible factorisation of the joint distribution, which enables flexible missing data imputation. However, generators are required when partitioning the joint into marginals, so the approach is prohibitively slow for large . In contrast, we assume either all parts or exactly one part of the variable of interest is observed, allowing the discriminators to be factorised without introducing functional redundancies between individual parts that create computational overhead. Karaletsos (2016)

propose adversarial inference on local factors of a high-dimensional joint distribution and factorise both generator and discriminator based on independence assumptions given by a Bayesian network, whereas we keep a joint sample generator and model all dependencies.

While our approach is not limited to conditional generation, we will briefly review related approaches in the following. The “CycleGAN” (Zhu et al., 2017) exploits unpaired samples by assuming a one-to-one mapping between the domains and using bidirectional generators (along with Gan et al. (2017)), while FactorGAN makes no such assumptions and instead uses paired examples to learn the dependency structure. Brakel and Bengio (2017)

perform independent component analysis in an adversarial fashion using a discriminator to identify correlations similarly to a q-dependency discriminator to enforce the separator outputs to be independent. While similar, our method is fully adversarial and extends this framework with a p-dependency discriminator to enable modelling of dependencies. For audio source separation, GANs have been used to match the outputs of a source separation model to real source signals but source dependencies were either ignored 

(Zhang et al., 2017) or modelled with an additional supervised mean squared error loss (Stoller et al., 2018), which lacks a unified objective with known optimal solution as provided by the FactorGAN framework.

4 Experiments

To validate our method, we compare our FactorGAN with the regular GAN approach, both for unsupervised generation as well as supervised prediction tasks. To investigate whether FactorGAN can make use of additional partial observations, we vary the proportion of the training samples available for joint sampling (paired), while using the rest to sample from the marginals (unpaired). We train all models using a single NVIDIA GTX 1080 GPU. The code to reproduce all experiments can be found in the supplementary material.

Training procedure

For stable training, we employ spectral normalisation (Miyato et al., 2018) on each discriminator network to ensure they satisfy a Lipschitz condition. Since the overall output used for training the generator is simply a linear combination of the individual discriminators (see Section 6.4 in the supplementary material), the generator gradients are also constrained in magnitude accordingly. Unless otherwise noted, we use an Adam optimiser with learning rate and a batch size of for training all models. We perform two discriminator updates after each generator update.

4.1 Paired MNIST

Our first experiment will involve “Paired MNIST”, a synthetic dataset of low complexity whose dependencies between marginals can be easily controlled. More precisely, we generate a paired version of the original MNIST dataset444http://yann.lecun.com/exdb/mnist/ by creating samples that contain a pair of vertically stacked digit images. With a probability of , the lower digit chosen during random generation is the same as the upper one, and different otherwise. For FactorGAN, we model the distributions of upper and lower digits as individual marginal distributions ().

Experimental setup

We compare the normal GAN with our FactorGAN, also including a variant without p-dependency discriminator that assumes marginals to be independent (“FactorGAN-no-cp”). We conduct the experiment with and

and also vary the amount of training samples available in paired form, while keeping the others as marginal samples only usable by FactorGAN. For both generators and discriminators, we used simple multi-layer perceptrons (MLPs) (Tables 

1 and 2, see supplementary material).

To evaluate the quality of generated digits, we adopt the “Frechét Inception Distance” (FID) as metric (Heusel et al., 2017)

. It is based on estimating the distance between the distributions of hidden layer activations of a pre-trained Imagenet object detection model for real and fake examples. To adapt the metric to MNIST data, we pre-train a classifier to predict MNIST digits (see Table 

3 in supplementary material) on the training set for epochs, obtaining a test accuracy of . We input the top and bottom digits in each sample separately to the classifier and collect the activations from the last hidden layer (FC1) to compute FIDs for the top and bottom digits, respectively. We use the average of both FIDs to measure the overall output quality of the marginals (lower value is better).

Since the only dependencies in the data are digit correlations controlled by , we can evaluate how well FactorGAN models these dependencies. We compute as the probability for a real sample to have digit at the top and digit at the bottom, along with marginal probabilities and (and analogously

for generated data). Since we do not have ground truth digit labels for the generated samples, we instead use the class with highest probability according to the pre-trained classifier. We encode the dependency as a ratio between a joint and the product of its marginals, where the ratios for real and generated data are ideally the same. Therefore, we take their absolute difference for all digit combinations as evaluation metric (lower is better):

(5)

Note that the metric computes how well dependencies in the real data are modelled by a generator, but not whether it introduces any additional unwanted dependencies such as top and bottom digits sharing stroke thickness, and thus presents only a necessary condition for a good generator.

Results
(a) FID value, averaged over both digits
(b) Dependency metric
Figure 1: Performance with different numbers of paired training samples and settings for compared between GAN and FactorGAN with and without dependency modelling.

The results of our experiment are shown in Figure 1. Since FactorGAN-no-cp trains on all samples independently of the number of paired observations, both FID and are constant. As expected, FactorGAN-no-cp delivers good digit quality, and performs well for (as it assumes independence) and badly for with regards to dependency modelling.

FactorGAN outperforms GAN with small numbers of paired samples in terms of FID by exploiting the additional unpaired samples, although this gap closes as both models eventually have access to the same amount of data. FactorGAN also consistently improves in modelling the digit dependencies with an increasing number of paired observations. For , this also applies to the normal GAN, although its performance is much worse for small sample sizes as it introduces unwanted digit dependencies. Additionally, its performance appears unstable for , where it achieves the best results for a small number of paired examples. Further improvements in this setting could be gained by incorporating prior knowledge about the nature of these dependencies into the p-dependency discriminator to increase its sample efficiency, but this is left for future work.

4.2 Image pair generation

In this section, we use GAN and FactorGAN for generating pairs of images in an unsupervised way to evaluate how well FactorGAN models more complex data distributions.

Datasets

We use the “Cityscapes” dataset (Cordts et al., 2016) and the “Edges2Shoes” dataset (Isola et al., 2017). To keep the outputs in a continuous domain, we treat the segmentation maps in the Cityscapes dataset as RGB images, instead of a set of discrete categorical labels. Each input and output image is downsampled to pixels as a preprocessing step to reduce computational complexity and to ensure stable GAN training.

Experimental setup

We define the distributions of input as well as output images as marginal distributions. Therefore, FactorGAN uses two marginal discriminators and a p- and q-dependency discriminator. All discriminators employ a convolutional architecture shown in Table 5 with and (see supplementary material). To control for the impact of discriminator size, we also train a GAN with twice the number of filters in each discriminator layer to match its size with the combined size of the FactorGAN discriminators. The same convolutional generator shown in Table 4 in the supplementary material is used for GAN and FactorGAN. Each image pair is concatenated along the channel dimension to form one sample, so that for the Cityscapes and for the Edges2Shoes dataset (since edge maps are greyscale). We make either , , or all training samples available in paired form, to investigate whether FactorGAN can improve upon GAN by exploiting the remaining unpaired samples or match its quality if there are none.

For evaluation, we randomly assign of validation data to a “test-train” and the rest to a “test-test” partition. We train an LSGAN discriminator (Mao et al., 2017) with the architecture shown in Table 5 in the supplementary material (but half the filters in each layer) on the test-train partition for epochs to distinguish real from generated samples, before measuring its loss on the test set. We continuously sample from the generator during training and testing instead of using a fixed set of samples to better approximate the true generator distribution. As evaluation metric, we use the average test loss over training runs, which was shown to correlate with subjective ratings of visual quality (Im et al., 2018) and also with our own quality judgements throughout this study. A larger value indicates better performance, as we use a flipped sign compared to Im et al. (2018). While the quantitative results appear indicative of output quality, accurate GAN evaluation is still an open problem and so we encourage the reader to judge generated examples in the supplementary material.

Results
Figure 2:

GAN and FactorGAN output quality estimated by the LS metric for different datasets and numbers of paired samples. Error bars show 95% confidence intervals.

(a) GAN
(b) FactorGAN
Figure 3: Examples generated for the Edges2Shoes dataset using paired samples

Our FactorGAN achieves better or similar output quality compared to the GAN baseline in all cases, as seen in Figure 2. For the Edges2Shoes dataset, the performance gains are most pronounced for small numbers of paired samples. On the more complex Cityscapes dataset, FactorGAN outperforms GAN by a large margin independent of training set size, even when the discriminators are closely matched in size. This suggests that FactorGAN converges with fewer training iterations for , although the exact cause is unclear and should be investigated in future work.

We show some generated examples in Figure 3. Due to the small number of available paired samples, we observe a strong mode collapse of the GAN in Figure 2(a), while FactorGAN provides high-fidelity, diverse outputs, as shown in Figure 2(b). Similar observations can be made for the Cityscapes dataset when using 100 paired samples (see Section 6.5.2 in supplementary material).

4.3 Image segmentation

Our approach extends to the case of conditional generation (see Section 2.3), so we tackle a complex and important image segmentation task on the Cityscapes dataset, where we ask the generator to predict a segmentation map for a city scene (instead of generating both from scratch as in Section 4.2).

Experimental setup

We downsample the scenes and segmentation maps to pixels and use a U-Net architecture (Ronneberger et al., 2015) shown in Table 6 in supplementary material with and as segmentation model. For FactorGAN, we use one marginal discriminator to match the distribution of real and fake segmentation maps to ensure realistic predictions, which enables training with isolated city scenes and segmentation maps. To ensure the correct predictions for each city scene, a p- and a q-dependency discriminator learns the input-output relationship using joint samples, both employing a convolutional architecture shown in Table 5 (see supplementary material). Note that as in Section 4.2, we output segmentation maps in the continuous RGB space instead of performing classification. In addition to the MSE in the RGB space, we compute the widely used pixel-wise classification accuracy (Cordts et al., 2016) by assigning each output pixel to the class whose colour has the lowest Euclidean distance in RGB space.

Results
Figure 4: MSE (left) and accuracy (right) obtained on the Cityscapes dataset with different numbers of paired training samples for the GAN and FactorGAN
(a) GAN
(b) FactorGAN
Figure 5: Segmentation predictions made on the Cityscapes dataset for the same set of test inputs, compared between models, using paired samples for training

The results in Figure 4 demonstrate that our approach can exploit additional unpaired samples to deliver better MSE and accuracy than a GAN and less noisy outputs as seen in Figure 5. While the CycleGAN reaches accuracy (Zhu et al., 2017) treating all samples as unpaired, FactorGAN offers an increase to accuracy when only 25 samples are paired, although other factors such as the choice of discriminator architecture or GAN loss might also affect this difference.

4.4 Audio source separation

We apply our method to audio source separation as another conditional generation task to investigate whether it transfers across application domains. Specifically, we conduct experiments on separating music signals into singing voice and accompaniment, which are detailed in the supplementary material in Section 6.2. Similarly to image segmentation in Section 4.3, we find that FactorGAN outperforms the normal GAN regarding separation quality, suggesting that our factorisation is indeed useful across problem domains.

5 Discussion

We find that FactorGAN outperforms GAN across all experiments when additional incomplete samples are available, especially when they are abundant in comparison to the number of joint samples. When using only joint observations, FactorGAN should be expected to match the GAN in quality, and it does so quite closely in most of our experiments. Surprisingly, it outperforms GAN in some scenarios such as image segmentation, even when discriminator sizes are matched – a phenomenon we do not fully understand yet and should be investigated in the future.

Since the p-dependency discriminator does not rely on generator samples that change during training, it could be pre-trained to reduce computation time, but this led to sudden training instabilities in our experiments. We suspect that this is due to a mismatch between training and testing conditions for the p-dependency discriminator since it is trained on real but evaluated on fake data, and neural networks can yield overly confident predictions outside the support of the training set 

(Gal and Ghahramani, 2016). Therefore, we expect classifiers with better uncertainty calibration to alleviate this issue.

6 Conclusion

In this paper, we demonstrated how a joint distribution can be factorised into a set of marginals and dependencies, giving rise to the FactorGAN – a GAN in which the discriminator is split into parts that can be independently trained with incomplete observations. For both generation and conditional prediction tasks in multiple domains, we find that FactorGAN outperforms the standard GAN when additional incomplete observations are available. For Cityscapes scene segmentation in particular, FactorGAN achieves a much higher accuracy than the fully unsupervised CycleGAN, while requiring only of all examples to be annotated.

Factorising discriminators enables incorporating more prior knowledge into the design of neural architectures in GANs, which could improve empirical results in applied domains. The presented factorisation is generally applicable independent of model choice, so it can be readily integrated into many existing GAN-based approaches. Since the joint density can be factorised in different ways, multiple extensions are conceivable depending on the particular application (as shown in Section 6.3 in the supplementary material). This paper derives FactorGAN from the original GAN proposed by Goodfellow et al. (2014) by exploiting the probabilistic view of the optimal discriminator. Adapting the FactorGAN to alternative GAN objectives (such as the Wasserstein GAN (Arjovsky et al., 2017)) might be possible as well.

References

Appendix

6.1 Tables

Layer Input shape Outputs Output shape Activation
FC ReLU
FC ReLU
FC Sigmoid
Table 1: The architecture of our generator on the MNIST dataset. All layers have biases.
Layer Input shape Outputs Output shape Activation
FC LeakyReLU
FC LeakyReLU
FC -
Table 2: The architecture of our discriminators on the paired MNIST dataset. for marginal, for dependency discriminators.
Layer Input shape Filter size Stride Outputs Output shape Activation
Conv -
AvgPool LeakyReLU
Conv -
AvgPool LeakyReLU
FC1 - - LeakyReLU
FC2 - - -
Table 3: The architecture of our MNIST classifier. Dropout with probability is applied to FC1 outputs.
Layer Input shape Filter size Stride Outputs Output shape Activation
ConvT ReLU
ConvT ReLU
ConvT ReLU
ConvT ReLU
ConvT ReLU
Conv Sigmoid
Table 4: The architecture of our convolutional generator. “ConvT” represent transposed convolutions. All layers have biases. The number of output channels depends on the task.
Layer Input shape Filter size Stride Outputs Output shape Activation
Conv LeakyReLU
Conv LeakyReLU
Conv LeakyReLU
Conv LeakyReLU
Conv LeakyReLU
FC - - 1 LeakyReLU
Table 5: The architecture of our convolutional discriminator. All layers except FC have biases. , and are set for each task so that the dimensions of the input data are matched.
Layer Input (shape) Outputs Output shape
DoubleConv1
MP1
DoubleConv2
MP2
DoubleConv3
MP3
DoubleConv4
MP4
DoubleConv5
FC
Concat DoubleConv5 -
UpConv
Concat DoubleConv4
Conv
UpConv
Concat DoubleConv3
Conv
UpConv
Concat DoubleConv2
Conv
UpConv
Concat DoubleConv1
Conv
Conv
Table 6: The architecture of our U-Net. The height and number of input channels depends on the experiment. MP is maxpooling with stride 2. FC has noise as input. UpConv performs transposed convolution with stride . Concat concatenates the current feature map with one from the downstream path. The final output is computed depending on the task (see text for more details)
Layer Input shape Outputs Output shape
Conv
BatchNorm & ReLU -
Conv
BatchNorm & ReLU -
Table 7: The DoubleConv neural network block used in the U-Net. Conv uses a filter size.

6.2 Audio source separation experiment

For our audio source separation experiment, our generator takes a music spectrogram along with noise and maps it to an estimate of the accompaniment and vocal spectra and , implicitly defining an output probability . We define the joint real and generated distributions that should be matched as and . Since the source signals in our dataset are simply added in the time-domain to produce the mixture, this approximately applies to the spectrogram as well, so we assume that . We can constrain our generator to make predictions that always satisfy this condition, thereby taking care of the input-output relationship manually, similarly to Sønderby et al. [2017]. Instead of predicting the sources directly, a mask with values in the range is computed, and the accompaniment and vocals are estimated as and , respectively. As a result, , so we can simplify the joint density ratio to

(6)

meaning that the discriminator(s) in the GAN and the FactorGAN only require pairs, but not the mixture as additional input, as the correct input-output relationship is already incorporated into the generator. Furthermore, the last equality suggests a FactorGAN application with one marginal discriminator for each source along with dependency discriminators to model source dependencies.

Dataset

We use MUSDB [Rafii et al., 2017] as multi-track dataset for our experiment, featuring 100 songs for training and 50 songs for testing. Each song is downsampled to KHz before spectrogram magnitudes are computed, using an STFT with a -sample window and a -sample hop555This results in

frequency bins but we discard the bin with the highest frequency to obtain a power of 2 and thus avoid padding issues in our network architectures.

. Snippets with timeframes each are created by cropping each song’s full spectrogram at regular intervals of timeframes. Thus, the generator only separates snippets and outputs predictions of the same shape, however this does not change the derivation presented in Equation (6), and longer inputs at test time can be processed by partitioning them into snippets and concatenating the model predictions.

Experimental setup

For our generator, we use the U-Net architecture detailed in Table 6 with and . We use the convolutional discriminator described in Table 5 (see supplementary material) with , and . The source dependency discriminators take two sources as input via concatenation along the channel dimension, so they use .

In each experiment, we vary the number of training songs whose snippets are available for paired training between , and and compare between GAN and FactorGAN. The spectrograms predicted on the test set are converted to audio with the inverse STFT by reusing the phase from the mixture, and then evaluated using the signal-to-distortion ratio (SDR), a well-established evaluation metric for source separation [Vincent et al., 2006].

Results
Figure 6: GAN and FactorGAN separation performance for different numbers of paired samples

Figure 6 shows our separation results. Compared to a GAN, the separation performance is significantly higher using FactorGAN. As expected, FactorGAN improves slightly with more paired examples, which is not the case for the GAN – here we find that the vocal output becomes too quiet when increasing the number of songs for training, possibly a sign of mode collapse. Similarly to the results seen in the image pair generation experiments, we suspect that the FactorGAN discriminator might approximate the joint density more closely than the GAN discriminator due to its use of multiple discriminators, although the reasons for this are not yet understood.

6.3 Possible extensions

We can decompose the joint density ratio in other ways than shown in Equation 3 in the paper. In the following, we discuss two additional possibilities.

6.3.1 Hierarchical FactorGAN

The decomposition of the joint density ratio could be applied recursively, splitting the obtained marginals further into “sub-marginals” and their dependencies, which could be repeated multiple times. In addition to training with incomplete observations where only a single part is given, this also allows making use of samples where only sub-parts of these parts are given and is thus more flexible than a single factorisation as used in the standard FactorGAN.

As a demonstration, we split each marginal further into a group of marginals, , and their dependencies, without further recursion for simplicity:

(7)

and are dependency terms analogously to and , but only defined on marginal variable , whose “sub-marginals” are denoted by .

Such a hierarchical decomposition might also be beneficial if the data is known to be generated from a hierarchical process. We leave the empirical exploration of this concept to future work.

6.3.2 Autoregressive FactorGAN

For a multi-dimensional variable composed of elements arranged in a sequence, such as time series data, the joint density ratio can also be decomposed in a causal, auto-regressive fashion:

(8)
(9)

Note that is defined here as ( analogously using ). Equation (8) suggests an auto-regressive version of FactorGAN in which the generator output quality at each time-step is evaluated using a marginal discriminator that estimates combined with dependency discriminators that model the dependency between the current and all past time-steps.

The final product formulation in Equation (9) reveals a close similarity to auto-regressive models and suggests a modification of the normal GAN with an auto-regressive discriminator that rates an input at each time-step given the previous ones. Using a derivation analogous to the one shown in Section 6.4, this implies taking the unnormalised discriminator outputs at each time-step, summing them, and applying a sigmoid non-linearity to obtain the overall estimate of the probability . A similar implementation was used before in Mogren [2016], attempting to stabilise GAN training with recurrent neural networks as discriminators, but for the first time, we provide a rigorous theoretical justification for this practice here.

6.4 Discriminator combination

Definition 6.1.

Sigmoid discriminator output. Let for all , analogously define and .

Definition 6.2.

Combined discriminator. Let be the output of the combined discriminator that is used for training using Equation 2.

Theorem 1.

Combined discriminator approximates . Under definitions 6.1 and 6.2 and assuming optimally trained sub-discriminators, .

Proof.

Proof of Theorem 1 using Definitions 6.1 and 6.2:

(10)

6.5 Generated examples

6.5.1 Paired MNIST

Figure 7: Paired MNIST examples generated by GAN and FactorGAN for different number of paired training samples, using .

6.5.2 Image Pairs

Figure 8: GAN generating image pairs for the Cityscapes dataset using 100 paired samples.
Figure 9: GAN (big) generating image pairs for the Cityscapes dataset using 100 paired samples.
Figure 10: FactorGAN generating image pairs for the Cityscapes dataset using 100 paired samples.
Figure 11: GAN generating image pairs for the Cityscapes dataset using 1000 paired samples.
Figure 12: GAN (big) generating image pairs for the Cityscapes dataset using 1000 paired samples.
Figure 13: FactorGAN generating image pairs for the Cityscapes dataset using 1000 paired samples.
Figure 14: GAN generating image pairs using the full Cityscapes dataset.
Figure 15: GAN (big) generating image pairs using the full Cityscapes dataset.
Figure 16: FactorGAN generating image pairs using the full Cityscapes dataset.
(a) GAN
(b) FactorGAN
Figure 17: Image pairs generated for the Edges2Shoes dataset using 100 paired samples.
(a) GAN
(b) FactorGAN
Figure 18: Image pairs generated for the Edges2Shoes dataset using 1000 paired samples.
(a) GAN
(b) FactorGAN
Figure 19: Image pairs generated for the Edges2Shoes dataset using all samples as paired.

6.6 Image segmentation

(a) GAN
(b) FactorGAN
Figure 20: Segmentation predictions made on the Cityscapes dataset for the same set of test inputs, compared between models, using paired samples for training
(a) GAN
(b) FactorGAN
Figure 21: Segmentation predictions made on the Cityscapes dataset for the same set of test inputs, compared between models, using paired samples for training
(a) GAN
(b) FactorGAN
Figure 22: Segmentation predictions made on the Cityscapes dataset for the same set of test inputs, compared between models, using all paired samples for training