HyperGAN: A Generative Model for Diverse, Performant Neural Networks

01/30/2019 ∙ by Neale Ratzlaff, et al. ∙ 0

We introduce HyperGAN, a generative network that learns to generate all the weights within a deep neural network. HyperGAN employs a novel mixer to transform independent Gaussian noise into a latent space where dimensions are correlated, which is then transformed to generate weights in each layer of a deep neural network. We utilize an architecture that bears resemblance to generative adversarial networks, but we evaluate the likelihood of samples with a classification loss. This is equivalent to minimizing the KL-divergence between the generated network parameter distribution and an unknown true parameter distribution. We apply HyperGAN to classification, showing that HyperGAN can learn to generate parameters which solve the MNIST and CIFAR-10 datasets with competitive performance to fully supervised learning, while learning a rich distribution of effective parameters. We also show that HyperGAN can also provide better uncertainty than standard ensembles. This is evaluated by the ability of HyperGAN generated ensembles to detect out of distribution data as well as adversarial examples. We see that in addition to being highly accurate on inlier data, HyperGAN can provide reasonable uncertainty estimates.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since the inception of deep neural networks, it has been found that it is possible to train from different random initializations and obtain networks that, albeit having quite different parameters, achieve similar accuracy (Freeman & Bruna, 2016). It has further been found that ensembles of deep networks that are trained in such a way have significant performance advantages over single models (Maclin & Opitz, 2011)

, similar to the classical bagging approach in statistics. Ensemble models also have other benefits, such as being robust to outliers and providing uncertainty estimates over their inputs

(Lakshminarayanan et al., 2017).

In Bayesian deep learning, there is a significant interest in having a probabilistic interpretation of network parameters and modeling a distribution over them. Earlier approaches have utilized dropout as a Bayesian approximation, by randomly setting different parameters to zero and thus integrating over many possible networks.

(Gal & Ghahramani, 2016) showed that networks with dropout following each layer are equivalent to a deep Gaussian process (Damianou & Lawrence, 2013) marginalized over its covariance functions. They proposed MC dropout as a simple way to estimate model uncertainty. Applying dropout to every layer however, may result in over-regularization and underfitting of the target function. Moreover, dropout does not integrate over the full space of possible models, only those which may be reached from one (random) initialization.

As another interesting direction: hypernetworks (Ha et al., 2016) are neural networks which output parameters for a target neural network. The hypernetwork and the target network together form a single model which is trained jointly. The original hypernetwork produced the target weights as a deterministic function of its own weights, but Bayesian Hypernetworks (BHNs) (Krueger et al., 2017), and Multiplicative Normalizing Flows (MNF) (Louizos & Welling, 2016) scale and shift model parameters by transforming samples from a Gaussian prior through a normalizing flow. Normalizing flows can model complicated posteriors, but are composed of only bijective, invertible functions. This allows for computation of the exact likelihood of a posterior sample, but also limits their scalability and the variety of learnable functions.

In this paper we explore an approach which generates all the parameters of a neural network in a single pass, without assuming any fixed noise models on the weights, or functional form of the generating function. To keep our method scalable, we do not restrict ourselves to invertible functions as in flow-based approaches. We instead utilize ideas from generative adversarial networks (GANs). We are especially motivated by recent Wasserstein Auto-encoder (Tolstikhin et al., 2017) approaches. These approaches have demonstrated an impressive capability to model complicated, multimodal distributions.

One of the issues in generating weights for every layer is the connectivity of the network. Namely, the output of the previous layer becomes the input of the next layer, hence the network weights must be correspondent in order to generate valid results. In our approach, we sample from a simple multi-dimensional Gaussian distribution, and propose to transform this sample into multiple different vectors. We call this procedure a

mixer since it introduces correlation to the otherwise independent random noise. Then each random vector is used to generate all the weights within one layer of a deep network. The generator is then trained with conventional maximum likelihood (classification/regression) on the weights that it generates, and an adversarial regularization keeps it from collapsing onto only one mode. In this way, it is possible to generate much larger networks than the dimensionality of the latent code, making our approach capable of generating all the weights of a deep network with a single GPU. As an example, in our experiments on CIFAR-10 we start from a -dimensional latent vector and generate all weights in one pass, consuming only GB GPU memory.

Somewhat surprisingly, with just this approach we can already generate complete, multi-layer convolutional networks which do not require additional fine-tuning. We are able to easily sample many well-trained networks from the generator which each achieve low loss on the dataset the generative model is trained on. Moreover, our diversity constraints result in models significantly more diverse than traditional training with multiple random starts, dropout or adding scaling factors to the weights.

We believe our approach is widely applicable to a variety of tasks. One area where populations of diverse networks show promise is in uncertainty estimation and anomaly detection. We show through a variety of experiments that populations of diverse networks sampled from our model are able to generate reasonable uncertainty estimates by calculating the entropy of the predictive distribution of sampled networks. Such uncertainty estimates allow us to detect out of distribition samples as well as adversarial examples. Our method is straightforward, as well as easy to train and sample from. We hope that we can inspire future work in estimation of the manifold of neural networks.

We summarize our contribution as follows:

  • We propose HyperGAN, a novel approach for generating all the weights for a target network architecture. HyperGAN contains a novel mixer that mixes the noise vector into multiple separate vectors that generate each layer of the network respectively.

  • Different from prior GANs, HyperGAN does not require repeated samples to start with (e.g. no need to train

    networks to create a training set for the GAN) but trains directly on the maximum-likelihood loss function. This significantly reduces the effort needed in training. The generated networks perform well without any need for further fine-tuning.

  • We have validated our performance on classification, where we use HyperGAN to easily generate an ensemble of 100 networks, and achieve significantly better accuracy. To validate the uncertainty estimates provided by multiple networks from the HyperGAN, we performed experiments on a synthetic regression dataset, an open-category classification task and an adversarial detection task.

2 Related Work

Generating parameters for neural networks has been framed in contexts other than the Bayesian and flow-based approaches described above. The hypernetwork framework (Ha et al., 2016) described models where one network directly supervises the weight updates of another network. (Pawlowski et al., 2017) used hypernetworks to generate layer-wise weights for a target architecture given some auxiliary noise as input. The auxiliary noise is noted to be independent between layers, while we impose structure in our inputs to create diversity in our generated samples.

Computer vision methods often use data driven approaches as seen in methods such as Spatial Transformer networks (Jaderberg et al., 2015), or Dynamic Filter networks (Jia et al., 2016). In these methods the filter parameters of the main network are conditioned on the input data, receiving contextual scale and shift updates from an auxiliary network. Our approach, however, generates weights for the whole network. Furthermore our predicted weights are highly nonlinear functions of the input, instead of simple affine transformations based on the input examples.

GANs have also been used as a method for sampling the distribution captured by a Bayesian neural network (BNN) trained with Stochastic Gradient Langevin Dynamics (SGLD). (Wang et al., 2018) propose Adversarial Posterior Distillation (APD): instead of generating parameters from the BNN posterior using MCMC, a GAN trained on intermediate models during the SGLD process is used to sample from a BNN. However, the training samples from the SGLD training process are inevitably correlated, potentially reducing the diversity of generated networks. Our approach does not use correlated examples in training and hence it can generate more diverse networks.

Recently, (Lakshminarayanan et al., 2017)

proposed Deep Ensembles, where adversarial training was applied to standard ensembles to smooth the predictive variance. However, adversarial training is an expensive training process. Adversarial examples are generated for each batch of data seen. We seek a method to learn a distribution over parameters which does not require adversarial training.

Meta learning approaches use different kinds of weights in order to increase the generalization ability of neural networks. The first proposed method is fast weights (Hinton & Plaut, 1987) which uses an auxiliary (slow) network to produce weight changes in the target (fast) network, acting as a short term memory store. Meta Networks (Munkhdalai & Yu, 2017) build on this approach by using an external neural memory store in addition to multiple sets of fast and slow weights. (Ba et al., 2016)

augment recurrent networks with fast weights as a more biologically plausible memory system. Unfortunately, the generation of each predicting network requires querying the base (slow) learner many times. Many of the methods presented there, along with hyperparameter learning

(Lorraine & Duvenaud, 2018), and the original hypernetwork, propose learning target weights which are deterministic functions of the training data. Our method instead captures a distribution over parameters, and provides a cheap way to directly sample full networks from a learned distribution.

Figure 1: HyperGAN architecture. The mixer transforms into latent codes . The generators each transform each latent subvector into the parameters of the corresponding layer in the target network. The discriminator forces to be well-distributed and close to

3 HyperGAN

Taking note from the original hypernetwork framework for generating neural networks (Ha et al., 2016), we coin our approach HyperGAN. The idea of HyperGAN is to utilize a GAN-type approach to directly generate discriminative networks. To do this, a standard approach would be to acquire a large set of trained neural networks and use those as training data to the GAN (Wang et al., 2018). However, a large collection of neural networks would be extremely costly to build. The other approach, proposed in  (Wang et al., 2018) to utilize many intermediate models during an SGLD training process as examples to train the GAN, would have the training samples be highly correlated and not diverse enough.

Instead, we propose to directly optimize the supervised learning objective instead of focusing on reconstruction error on the training for the generator, as in normal GANs. Similar to a GAN, we start by drawing a random sample , where is an all-zero vector and is a identity matrix. The idea is that this random sample would provide enough diversity and if we can maintain such diversity while optimizing on the supervised learning objective, we can generate networks that all optimize the loss function well, but are sufficiently diverse because they are generated from different Gaussian random vectors.

Figure 1 shows the HyperGAN architecture. We begin by defining a neural network as a function with input and parameters , consisting of a given architecture with layers, and a training set with inputs and targets . Distinct from the standard GAN, we propose a Mixer which is a fully-connected network that maps to a mixed latent space . The mixer is motivated by the observation that weight parameters between network layers must be strongly correlated as the output of one layer needs to be the input to the next one. Hence, it is likely that some correlations are also needed in the latent vectors that generate those weight parameters. Our -dimensional mixed latent space contains vectors that are all correlated, which we then partition into layer embeddings , each being a -dimensional vector. Finally, we use parallel generators to generate the parameters for all layers in . This approach is also memory efficient since the extremely high dimensional space of the weight parameters are now separately connected to multiple latent vectors, instead of fully-connected to the latent space.

After generation, we can evaluate the new model on the training set. We define an objective which minimizes the error of generated parameters with respect to a task loss :

(1)

At each training step we generate a different network from a random

, and then evaluate the loss function on a mini-batch from the training set. The resulting loss is backpropagated through the generators until

minimizes the target loss .

The main concern about directly optimizing the formulation in (1) would be that the codes sampled from may collapse to the maximum likelihood estimate (MLE) (when is a log-likelihood). This means that the generators may learn a very narrow approximation of . On the other hand, one can think of the training process as simultaneously starting from many starting points (since we sample different for each mini-batch) and attempting to obtain a good optimum on all of them. Because deep networks are extremely overparameterized and many global optima exist (Choromanska et al., 2015), the optimization may indeed converge to different optima from different random .

The mixer may make the training process easier by building in the required correlations into the latent code, hence improve the chance the optimization converges to different optima from different random . To further ensure that the parameters are well distributed, we add an adversarial constraint on the mixed latent space and dictate it to not deviate too much from a Gaussian prior . This constraint is closer to the generated parameters and ensures that itself does not collapse to always outputting the same latent code. With this we arrive at the HyperGAN objective:

(2)

Where is a hyperparameter, and is the regularization term which penalizes the distance between the prior and the distribution of latent codes. In practice could be any distance function between two distributions. We choose to parameterize as a discriminator network

that outputs probabilities, and use the adversarial loss

(Goodfellow et al., 2014) to approximate . Note that while and are both multivariate Gaussians, they are of different dimensionality and covariance.

(3)

Note that we find it difficult to learn a discriminator in the output (parameter) space because the dimensionality is high and there is no structure in those parameters to be utilized as in images (where CNNs can be trained). Our experiments show that regularizing in the latent space works well, which matches results from recent work in implicit generative models (Tolstikhin et al., 2017).

This framework is general and can be adapted to a variety of tasks and losses. In this work, we show that HyperGAN can operate in both classification and regression settings. For multi-class classification, the generators and mixer are trained with the cross entropy loss function:

(4)

For regression tasks we replace the cross entropy term with the mean squared error (mse):

(5)

3.1 Discussion: Learning to Generate without Explicit Samples

In generative models such as GAN or WAE (Tolstikhin et al., 2017), it was always necessary to have a collection of data points to train with, which come from the distribution that is being estimated, so that such training examples can be generated from the generator. HyperGAN does not have a given set of samples to train with. Instead, it optimizes a supervised learning objective such as maximum likelihood. To draw a connection between this objective and the traditional reconstruction objective in GAN, we note that after training, represents the maximum likelihood estimate of , which has a well-known link to KL-divergence:

(3.1) shows that by minimizing the error of the MLE on the log-likelihood, we are indeed minimizing the KL divergence between the unknown true parameter distribution and the generated samples .

Hence, we can view HyperGAN also as approximating a target distribution of the neural network parameters. However, HyperGAN only assumes the target distribution exists, and update our approximation via maximum likelihood to better match the unknown . The success of this approach in generating parameter distributions could lead to some more insights about generative models.

4 Experiments

4.1 Experiment Setup

We conduct a variety of experiments to test HyperGAN’s ability to achieve both high accuracy and obtain accurate uncertainty estimates. First we show classification performance on both MNIST and CIFAR-10 datasets. Next we examine HyperGAN’s capability to learn the variance of a simple 1D dataset. Afterwards, we perform experiments on anomaly detection by testing HyperGAN on out-of-distribution examples. For models trained on MNIST we test on notMNIST. For CIFAR experiments we train on the first 5 classes of CIFAR-10 (airplane, automobile, bird, cat, deer), then test whether the classifier can detect out-of-distribution images from the 5 remaining classes not shown during training. Finally, we test our robustness to adversarial examples as extreme cases of out-of-distribution data.

In the following experiments we compare against APD (Wang et al., 2018), MNF (Louizos & Welling, 2016), and MC Dropout (Gal & Ghahramani, 2016)

. We also evaluate standard ensembles as a baseline. The target architecture used is the same across all approaches. More details about the target architecture can be found in the supplementary material. For APD we train both MNIST and CIFAR networks with SGLD for 100 epochs. We then train a GAN given the architectures and hyperparameters specified in

(Wang et al., 2018) until convergence. For MNF we use the code provided in (Louizos & Welling, 2016), and we train the model for 100 epochs. MC dropout is trained and sampled from as described in (Gal & Ghahramani, 2016), with a dropout rate of . In all experiments, unless otherwise stated, we draw 100 networks from the posterior to form the predictive distribution for each approaches.

HyperGAN Details

Both of our models take a 256 dimensional sample of

as input, but have different sized mixed latent spaces. The HyperGAN for the MNIST experiments consists of three weight generators, each using a 128 dimensional latent vector as input. The target network for the MNIST experiments is a small two layer convolutional network followed by 1 fully-connected layer, using leaky ReLU activations and 2x2 max pooling after each convolutional layer. Our HyperGAN trained on CIFAR-10 use 5 weight generators and latent dimensonality of 256. The target architecture for CIFAR-10 consists of three convolutional layers, each followed by leaky ReLU and 2x2 max pooling, with 2 fully connected layer after the convolutional layers.

The mixer, generators, and discriminator are each a 2 layer MLP with 512 units in each layer and ReLU nonlinearity. We found that larger networks offered little performance benefit, and ultimately hurt scalability. It should be noted that larger networks would not harm the capability of HyperGANs to model the target distribution. We trained our HyperGAN on MNIST using less than 1.5GB of memory on a single GPU, while CIFAR-10 used just 4GB, making HyperGAN surprisingly scalable.

HyperGAN Standard Training APD
Conv1 Conv2 Linear Conv1 Conv2 Linear Conv1 Conv2 Linear
Mean 7.49 51.10 22.01 27.05 160.51 5.97 2.63 5.01 17.4
1.59 10.62 6.01 0.31 0.51 0.06 0.22 0.41 1.43
Table 1: 2-norm statistics on the layers of a population of networks sampled from HyperGAN, compared to 10 standard networks trained from different random initializations as well as 10 samples from the posterior learned by APD. Both HyperGAN and the standard models were trained on MNIST to 98% accuracy. Its easy to see that HyperGAN generates far more diverse networks

4.2 Classification Accuracy and Diversity

First we evaluate the classification accuracy of HyperGAN on MNIST and CIFAR-10. Classification serves as an entrance exam into our other experiments, as the distribution we want to learn is over parameters which can effectively solve the classification task. We test with both single network samples, and ensembles. For our ensembles we average predictions from sampled models with the scoring rule . It should be noted that we did not perform fine tuning, or any additional training on the sampled networks. The results are shown in Table 2. We generate ensembles of different sizes and compare against APD (Wang et al., 2018), MNF (Louizos & Welling, 2016), and MC dropout (Gal & Ghahramani, 2016). For each method we draw 100 samples from the learned posterior to generate the predictive distribution and use the above averaging to compute the classification score.

Method MNIST MNIST 5000 CIFAR-5 CIFAR-10 CIFAR-10 5000
1 network 98.64 96.69 84.50 76.32 76.31
5 networks 98.75 97.24 85.51 76.84 76.41
10 networks 99.22 97.33 85.54 77.52 77.12
100 networks 99.31 97.71 85.81 77.71 77.38
APD 98.61 96.35 83.21 75.62 75.13
MNF 99.30 97.52 84.00 76.71 76.88
MC Dropout 98.73 95.58 84.00 72.75 70.10
Table 2: Classification performance of HyperGAN on MNIST and CIFAR-10. CIFAR-5 refers to a dataset with only the first 5 classes of CIFAR-10. MNIST 5000 and CIFAR-10 5000 refers to training on only 5000 examples of MNIST and CIFAR-10, respectively, which has been used in prior work (e.g.  (Louizos & Welling, 2016)). When held to the same architecture, the ensemble from HyperGAN performs better than ensembles using other approaches (100 network ensembles are used for APD, MNF and MC Dropout

In Table 1 we show some statistics of the networks generated by HyperGAN on MNIST. We note that HyperGAN can generate very diverse networks, as the variance of network weights generated by the HyperGAN is significantly higher than standard training from different random initializations, as well as APD. More insights on the diversity of HyperGAN samples can be found in Section 5.

4.3 1-D Toy Regression Task

We next evaluate the capability of HyperGAN to fit a simple 1D function from noisy samples and generate reasonable uncertainty estimates on regions with few training samples. This dataset was first proposed by (Hernández-Lobato & Adams, 2015), and consists of a training set of 20 points drawn uniformly from the interval . The targets are given by where . We used the same target architecture as in (Hernández-Lobato & Adams, 2015) and (Louizos & Welling, 2016): a one layer neural network with 100 hidden units and ReLU nonlinearity trained with MSE. For HyperGAN we use two layer generators, and 128 hidden units across all networks. Because this is a small task, we use only a 64 dimensional latent space.

Figure 2 shows that HyperGAN clearly learns the target function and captures the variation in the data. Furthermore, sampling more (100) networks to compose a larger ensemble improves the predicted uncertainty in regions with few training examples.

Figure 2: Results of HyperGAN on the 1D regression task. From left to right, we plot the predictive distribution of 10 and 100 sampled models from a trained HyperGAN. Within each image, the blue line is the target function , the red circles show the noisy observations, the grey line is the learned mean function, and the light blue shaded region denotes standard deviations

4.4 Anomaly Detection

To measure the uncertainty given on out of distribution data, we measure the total predictive entropy given by HyperGAN-generated ensembles. For MNIST experiments we train a HyperGAN on the MNIST dataset, and test on the notMNIST dataset: a 10-class set of 28x28 grayscale images depicting the letters A - J. In this setting, we want the softmax probabilities on inlier MNIST examples to have minimum entropy - a single large activation close to 1. On out-of-distribution data we want to have equal probability across predictions. Similarly, we test our CIFAR-10 model by training on the first 5 classes, and using the latter 5 classes as out of distribution examples. To build an estimate of the predictive entropy we sample multiple networks from HyperGAN, evaluate them on each example, and measure their predictive entropy. We compare our uncertainty measurements with those of APD, MNF, MC dropout, and standard ensembles. Unless otherwise noted, we compute the entropy based on 100 networks.

Figure 3: Empirical CDF of the predictive entropy of all approaches on notMNIST. L2 refers to conventional ensembles trained from different random starts. One an see the entropy of HyperGAN models are significantly higher than baselines

Fig. 3 shows that HyperGAN is overall less confident on outlier samples than other approaches on the notMNIST dataset. Standard ensembles overfit considerably, as expected. Further, table 1 shows that the diversity of standard ensembles is quite low. Fig. 4 shows similar behavior on CIFAR. Hence HyperGAN can better separate inliers from outliers when out-of-distribution examples are present.

Figure 4: Empirical CDF of the predictive entropy on out of distribution data: the 5 classes of CIFAR-10 unseen during training. L2 refers to conventional ensembles trained from different random starts

4.5 Adversarial Detection

We employ the same experimental setup to the detection of adversarial examples, an extreme type of out-of-distribution data. Adversarial examples are often optimized to lie within a small neighborhood of a real data point, so that it is hard for human to detect them. They are created by adding perturbations in the direction of the greatest loss with respect to the parameters of the model. Because HyperGAN learns a distribution over parameters, it should be more robust to adversarial attacks. We generate adversarial examples using the Fast Gradient Sign method (FGSM) (Goodfellow et al., 2015) and Projected Gradient Descent (PGD) (Madry et al., 2017). FGSM adds a small perturbation to the target image in the direction of greatest loss. FGSM is known to underfit to the target model, hence it may transfer better across many similar models. In contrast, PGD takes many steps in the direction of greatest loss, producing a stronger adversarial example, at the risk of overfitting to a single set of parameters. This poses the following challenge: to detect attacks by FGSM and PGD, HyperGAN will need to generate diverse parameters to avoid both attacks.

To detect adversarial examples, we first hypothesize that a single adversarial example will not fool the entire space of parameters learned by HyperGAN. If we then evaluate adversarial examples against many newly generated networks, then we should see a high entropy among predictions (softmax probabilities) for any individual class.

Adversarial examples have been shown to successfully fool ensembles (Dong et al., 2017), but with HyperGAN one can always generate significantly more models that can be added to the ensemble for the cost of one forward pass, making it hard to attack against. We compare the performance of HyperGAN with ensembles of

models trained on MNIST with normal supervised training. We fuse their logits (unnormalized log probabilities) together as

where is the th model weighting, and is the logits of the th model. In all experiments we consider uniformly weighted ensembles. For HyperGAN we simply sample the generators to create as many models as we need, then we fuse their logits together. Specifically we test HyperGAN ensembles with members each. Adversarial examples are generated by attacking the ensemble directly until the generated image completely fools the whole ensemble. For HyperGAN, we attack the full ensemble, but test with a new ensemble of equal size. For other methods we first attack a single model, then test with 100 samples unless otherwise specified since that is what they reported in their papers. Note this puts HyperGAN in a disadvantageous position as it has to overcome a much stronger attack (having knowledge of more models).

For the purposes of adversarial detection, we compute the entropy within the predictive distribution of the ensemble to score the example on the likelihood that it was drawn from the training distribution. Figure 5 shows that HyperGAN predictions on adversarial examples approaches maximum entropy, performing better than other methods as well as standard ensembles, even when our experiment conditions are significantly disadvantageous for HyperGAN. HyperGAN is especially suited to this task as adversarial examples are optimized against a set of parameters - parameters which HyperGAN can change. Because HyperGAN can generate very diverse models, it is difficult for an adversarial example to fool the ever-changing ensemble generated by HyperGAN. In the supplementary material we show that adversarial examples which fool a single HyperGAN sample fool only 50% to 70% of a larger ensemble.

Figure 5: Entropy of predictions on FGSM and PGD adversarial examples. HyperGAN generates ensembles that are far more effective than standard ensembles even with equal population size. Note that for large ensembles, it is hard to find adversarial examples with small norms e.g.

5 Ablation Study

The proposed architecture for HyperGAN is motivated by two requirements. First, we want to generate parameters for a target architecture which can solve a task specified by the data and the loss function. Second, we want the generated networks to be diverse. The two specific constructs in the paper are the mixer, which introduces the necessary correlations between generated layers, and an adversarially trained discriminator to enforce that samples from the mixed latent space are well-distributed according to the prior. In this section we test and discuss the validity and effect of these two components by removing each one respectively and check the classification accuracy and the diversity of HyperGAN after the removal.

Training without a discriminator is the simpler of the two experiments. The only modification made to the training procedure is that we remove the distributional constraint on . We can see in figure 6 that the classification accuracy of the generated networks is unaffected, while the diversity shown in figure 7 decreases, showing that by making the mixed latent space well-distributed, the discriminator is having a positive effect in improving the diversity of the generated models. This is similar to the prevention of mode collapse in adversarial  (Makhzani et al., 2015)

and Wasserstein autoencoders 

(Tolstikhin et al., 2017). We can also see that through the training, diversity does indeed decrease, which is also common in GAN training. Hence, early stopping the training when accuracy has just converged may be key to maintaining diversity as well. We would like to study the effect of early stopping more from the theoretical side in future work.

Figure 6: Ablation of HyperGAN accuracy on CIFAR-10, with normal HyperGAN, without the mixer and without the discriminator, respectively. All of them converge to very similar accuracy but the version without the mixer stumbles significantly in the beginning

Next we train HyperGAN without the Mixer as well as the mixed latent space . In this case, the generator for each layer takes as input an independent -dimensional sample from a Gaussian distribution. From Fig. 6, we see that even in this case we can obtain similar classification accuracy. However, from Fig. 7 we see that without the mixer, diversity suffers significantly. We hypothesize that without the mixer, a valid optimization trajectory is difficult to find (Fig. 6 shows that the HyperGAN with no mixer starts with low accuracy for a longer period); when one trajectory is finally found, the optimizer will prioritize classification loss over diversity. When the mixer is included, the built-in correlation between the parameters of different layers may have made optimization easier, hence diverse good optima are found even from different random starts. We also performed experiments with removing both the mixer and the discriminator in the supplementary material.

Figure 7: HyperGAN diversity on CIFAR-10 given a normal training run, with the mixer removed, and with the discriminator removed. Diversity is shown as the standard deviation divided by the norm of the weights, within a population of generated networks.

6 Conclusion and Future Work

We have proposed HyperGAN, a generative model for neural network weights Training a GAN to learn a probability distribution over neural networks allows us to non-deterministically sample diverse, performant networks which we can use to form ensembles that obtain better classification accuracy and uncertainty estimates. Our method is ultimately scalable in terms of the number of networks in the predicting ensemble, requiring just one forward pass to generate a new performant network and a low GPU memory footprint. We have also shown the uncertainty estimates from the generated ensembles are capable of detecting out-of-distribution data and adversarial examples. In the future, we believe HyperGAN would impact many domains including meta learning and reinforcement learning.

References

Appendix A Appendix

a.1 Generated Filter Examples

We show the first filter in 25 different networks generated by the HyperGAN to illustrate their difference in Fig. 8. It can be seen that qualitatively HyperGAN learns to generate classifiers with a variety of filters.

Figure 8: Convolutional filters from MNIST classifiers sampled from HyperGAN. For each image we sample the same 5x5 filter from 25 separate generated networks. From left to right: figures a and b show the first samples of the first two generated filters for layer 1 respectively. Figures c and d show samples of filters 1 and 2 for layer 2. We can see that qualitatively, HyperGAN learns to generate classifiers with a variety of filters.

a.2 Outlier Examples

In Figure 9 we show images of examples which do not behave like most of their respective distribution. On top are MNIST images which HyperGAN networks predict to have high entropy. We can see that they are generally ambiguous and do not fit with the rest of the training data. The bottom row shows notMNIST examples which score with low entropy according to HyperGAN. It can be seen that these examples look like they could come from the MNIST training distribution, making HyperGAN’s predictions reasonable

Figure 9: Top: MNIST examples to which HyperGAN assigns high entropy (outlier). Bottom: Not-MNIST examples which are predicted with low entropy (inlier)

a.3 HyperGAN Network Details

In tables 4 and 4 we show how the latent points are transformed through the generators to become a full layer of parameters. For a MNIST based HyperGAN we generate layers from small latent points of dimensionality . For CIFAR-10 based HyperGANs we use a larger dimensionality of for the latent points.

Layer Latent size Output Layer Size
Conv 1 128 x 1 32 x 1 x 5 x 5
Conv 2 128 x 1 32 x 32 x 5 x 5
Linear 128 x 1 512 x 10
Table 4: CIFAR-10 HyperGAN Target Size
Layer Latent Size Output Layer Size
Conv 1 256 x 1 16 x 3 x 3 x 3
Conv 2 256 x 1 32 x 16 x 3 x 3
Conv 3 256 x 1 32 x 64 x 3 x 3
Linear 1 256 x 1 256 x 128
Linear 2 256 x 1 128 x 10
Table 3: MNIST HyperGAN Target Size

a.4 Diversity with Neither Mixer nor Discriminator

We run experiments on both MNIST and CIFAR-10 where we remove both the mixer and the discriminator. Tables 6 and 5 show statistics of the networks generated by HyperGAN using only independent Gaussian samples to the generators. In this setting, HyperGAN learns to generate only a very small distribution of parameters.

HyperGAN w/o (Q, D) - CIFAR-10
Conv1 Conv2 Conv3 Linear1 Linear2
Mean 1.87 16.83 9.35 10.66 20.35
0.11 2.44 1.02 0.16 0.76
Standard Training - CIFAR-10
Conv1 Conv2 Conv3 Linear1 Linear2
Mean 5.13 15.19 16.15 11.79 2.45
1.19 4.40 4.28 2.80 0.13
Table 5: Statistics on the layers of networks sampled from HyperGAN without the mixing network or discriminator, compared to 10 standard networks trained from different random initializations

HyperGAN w/o (Q, D) - MNIST Standard Training - MNIST
Conv1 Conv2 Linear Conv1 Conv2 Linear
Mean 10.79 106.39 14.81 27.05 160.51 5.97
0.58 0.90 0.79 0.31 0.51 0.06
Table 6: Statistics on the layers of a population of networks sampled from HyperGAN, compared to 10 standard networks trained from different random initializations. Without the mixing network or the discriminator, HyperGAN suffers from a lack of diversity

a.5 HyperGAN Diversity on Adversarial Examples

As an ablation study, in Fig. 10 we show the diversity of the HyperGAN predictions against adversarial examples generated to fool one network. It is shown that while those examples can fool of the networks generated by HyperGAN, they usually do not fool all of them.











Figure 10: Diversity of predictions on adversarial examples. FGSM and PGD examples are created against a network generated by HyperGAN, and tested on 500 more generated networks. FGSM transfers better than PGD, though both attacks fail to cover the distribution learned by HyperGAN