hidden-networks
None
view repo
Training a neural network is synonymous with learning the values of the weights. In contrast, we demonstrate that randomly weighted neural networks contain subnetworks which achieve impressive performance without ever training the weight values. Hidden in a randomly weighted Wide ResNet-50 we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet. Not only do these "untrained subnetworks" exist, but we provide an algorithm to effectively find them. We empirically show that as randomly weighted neural networks with fixed weights grow wider and deeper, an "untrained subnetwork" approaches a network with learned weights in accuracy.
READ FULL TEXT VIEW PDFWhat lies hidden in an overparameterized neural network with random weights? If the distribution is properly scaled, then it contains a subnetwork which performs well without ever modifying the values of the weights (as illustrated by Figure 1).
The number of subnetworks is combinatorial in the size of the network, and modern neural networks contain millions or even billions of parameters [21]. We should expect that even a randomly weighted neural network contains a subnetwork that performs well on a given task. In this work, we provide an algorithm to find these subnetworks.
Finding subnetworks contrasts with the prevailing paradigm for neural network training – learning the values of the weights by stochastic gradient descent. Traditionally, the network structure is either fixed during training (
e.g. ResNet [8] or MobileNet [9]), or optimized in conjunction with the weight values (e.g. Neural Architecture Search (NAS)). We instead optimize to find a good subnetwork within a fixed, randomly weighted network. We do not ever tune the value of any weights in the network, not even the batch norm [10] parameters or first or last layer.In [4], Frankle and Carbin articulate The Lottery Ticket Hypothesis: neural networks contain sparse subnetworks that can be effectively trained from scratch when reset to their initialization. We offer a complimentary conjecture: within a sufficiently overparameterized neural network with random weights (e.g. at initialization), there exists a subnetwork that achieves competitive accuracy. Specifically, the test accuracy of the subnetwork is able to match the accuracy of a trained network with the same number of parameters.
This work is catalyzed by the recent advances of Zhou et al. [29]. By sampling subnetworks in the forward pass, they first demonstrate that subnetworks of randomly weighted neural networks can achieve impressive accuracy. However, we hypothesize that stochasticity may limit their performance. As the number of parameters in the network grows, they are likely to have a high variability in their sampled networks.
To this end we propose the edge-popup algorithm for finding effective subnetworks within randomly weighted neural networks. We show a signifigant boost in performance and scale to ImageNet. For each fixed random weight in the network, we consider a positive real-valued score. To choose a subnetwork we take the weights with the top-
highest scores. With a gradient estimator we optimize the scores via SGD. We are therefore able to find a good neural network without ever changing the values of the weights. We empirically demonstrate the efficacy of our algorithm and formally show that under certain technical assumptions the loss decreases on the mini-batch with each modification of the subnetwork.
We experiment on small and large scale datasets for image recognition, namely CIFAR-10 [12] and Imagenet [3]. On CIFAR-10 we empirically demonstrate that as networks grow wider and deeper, untrained subnetworks perform just as well as the dense network with learned weights. On ImageNet, we find a subnetwork of a randomly weighted Wide ResNet50 which is smaller than, but matches the performance of a trained ResNet-34. Moreover, a randomly weighted ResNet-101 [8] with fixed weights contains a subnetwork that is much smaller, but surpasses the performance of VGG-16 [23]. In short, we validate the unreasonable effectiveness of randomly weighted neural networks for image recognition.
Lottery Tickets and Supermasks
In [4], Frankle and Carbin offer an intriguing hypothesis: neural networks contain sparse subnetworks that can be effectively trained from scratch when reset to their initialization. These so-called winning tickets have won the “initialization lottery”. Frankle and Carbin find winning tickets by iteratively shrinking the size of the network, masking out weights which have the lowest magnitude at the end of each training run.
Follow up work by Zhou et al. [29] demonstrates that winning tickets achieve better than random performance without training. Motivated by this result they propose an algorithm to identify a “supermask” – a subnetwork of a randomly initialized neural network that achieves high accuracy without training. On CIFAR-10, they are able to find subnetworks of randomly initialized neural networks that achieve 65.4% accuracy.
The algorithm presented by Zhou et al. is as follows: for each weight
in the network they learn an associated probability
. On the forward pass they include weight with probability and otherwise zero it out. Equivalently, they use weight where is a Bernoullirandom variable ( is with probability and otherwise). The probabilities are the output of a sigmoid, and are learned using stochastic gradient descent. The terminology supermask” arises as finding a subnetwork is equivalent to learning a binary mask for the weights.Our work builds upon Zhou et al., though we recognize that the stochasticity of their algorithm may limit performance. In section 3.1 we provide more intuition for this claim. We show a significant boost in performance with an algorithm that does not sample supermasks on the forward pass. For the first time we are able to match the performance of a dense network with a supermask.
Neural Architecture Search (NAS)
The advent of modern neural networks has shifted the focus from feature engineering to feature learning. However, researchers may now find themselves manually engineering the architecture of the network. Methods of Neural Architecture Search (NAS) [30, 2, 16, 24] instead provide a mechanism for learning the architecture of neural network jointly with the weights. Models powered by NAS have recently obtained state of the art classification performance on ImageNet [25].
As highlighted by Xie et al. [27], the connectivity patterns in methods of NAS remain largely constrained. Surprisingly, Xie et al. establish that randomly wired neural networks can achieve competitive performance. Accordingly, Wortsman et al. [26] propose a method of Discovering Neural Wirings (DNW) – where the weights and structure are jointly optimized free from the typical constraints of NAS. We highlight DNW as we use a similar method of analysis and gradient estimator to optimize our supermasks. In DNW, however, the subnetwork is chosen by taking the weights with the highest magnitude. There is therefore no way to learn supermasks with DNW as the weights and connectivity are inextricably linked – there is no way to separate the weights and the structure.
Weight Agnostic Neural Networks
In Weight Agnostic Neural Networks (WANNs) [5], Gaier and Ha question if an architecture alone may encode the solution to a problem. They present a mechanism for building neural networks that achieve high performance when each weight in the network has the same shared value. Importantly, the performance of the network is agnostic to the value itself. They are able to obtain accuracy on MNIST [15].
We are quite inspired by WANNs, though we would like to highlight some important distinctions. Instead of each weight having the same value, we explore the setting where each weight has a random value. In Section A.2.2 of their appendix, Gaier and Ha mention that they were not successful in this setting. However, we find a good subnetwork for a given random initialization – the supermasks we find are not agnostic to the weights. Finally, Gaier and Ha construct their network architectures, while we look for supermasks within standard architectures.
Linear Classifiers
Linear classifiers on top of randomly weighted neural networks are often used as baselines in unsupervised learning
[18]. Our work is different in motivation, we explicitly find untrained subnetworks which achieve high performance without changing any weight values, including the final layer.In this section we present our optimization method for finding effective subnetworks within randomly weighted neural networks. We begin by building intuition in an unusual setting – the infinite width limit. Next we motivate and present our algorithm for finding effective subnetworks.
The Existence of Good Subnetworks
Modern neural networks have a staggering number of possible subnetworks. Consequently, even at initialization, a neural network should contain a subnetwork which performs well.
To build intuition we will consider an extreme case – a neural network
in the infinite width limit (for a convolutional neural networks, the width of the network is the number of channels). As in Figure
1, let be a network with the same structure of that achieves achieves good accuracy. If the weights ofare initialized using any standard scaling of a normal distribution,
e.g. xavier [6] or kaiming [7], then we may show there exists a subnetwork of that achieves the same performance as without training. Let be the probability that a given subnetwork of has weights that are close enough to to obtain the same accuracy. This probability is extremely small, but it is still nonzero. Therefore, the probability that no subnetwork of is close enough to is effectively where is the number of subnetworks. grows very quickly with the width of the network, and this probability becomes arbitrarily small.How Should We Find A Good Subnetwork
Even if there are good subnetworks in randomly weighted neural networks, how should we find them?
Zhou et al. learn an associated probability with each weight in the network. On the forward pass they include weight with probability (where is the output of a sigmoid) and otherwise zero it out. The infinite width limit provides intuition for a possible shortcoming of the algorithm presented by Zhou et al. [29]. Even if the parameters are fixed, the algorithm will likely never observe the same subnetwork twice. As such, the gradient estimate becomes more unstable, and this in turn may make training difficult.
Our algorithm for finding a good subnetwork is illustrated by Figure 2. With each weight in the neural network we learn a positive, real valued popup score . The subnetwork is then chosen by selecting the weights in each layer corresponding to the top- highest scores. For simplicity we use the same value of for all layers.
How should we update the score
? Consider a single edge in a fully connected layer which connects neuron
to neuron . Let be the weight of this edge, and the associated score. If this score is initially low then is not selected in the forward pass. But we would still like a way to update its score to allow it to pop back up. Informally, with backprop [22] we compute how the loss “wants” node ’s input to change (i.e. the negative gradient). We then examine the weighted output of node . If this weighted output is aligned with the negative gradient, then node can take node ’s output where the loss “wants” it to go. Accordingly, we should increase the score. If this alignment happens consistently, then the score will continue to increase and the edge will re-enter the chosen subnetwork (i.e. popup).More formally, if denotes the weighted output of neuron , and denotes the input of neuron , then we update as
(1) |
This argument and the analysis that follows is motivated and guided by the work of [26]. In their work, however, they do not consider a score and are instead directly updating the weights. In the forward pass they use the top of edges by magnitude, and therefore there is no way of learning a subnetwork without learning the weights. Their goal is to train sparse neural networks, while we aim to showcase the efficacy of randomly weighted neural networks.
We now formally detail the edge-popup algorithm.
For clarity, we first describe our algorithm for a fully connected neural network. In Section B.2
we provide the straightforward extension to convolutions along with code in PyTorch
[20].A fully connected neural network consists of layers where layer has nodes . We let denote the input to node and let denote the output, where
for some non-linear activation function
(e.g. ReLU
[13]). The input to neuron in layer is a weighted sum of all neurons in the preceding layer. Accordingly, we write as(2) |
where are the network parameters for layer . The output of the network is taken from the final layer while the input data is given to the very first layer. Before training, the weights for layer are initialized by independently sampling from distribution . For example, if we are using kaiming normal initialization [7] with ReLU activations, then where denotes the normal distribution.
Normally, the weights are optimized via stochastic gradient descent. In our edge-popup algorithm, we instead keep the weights at their random initialization, and optimize to find a subnetwork . We then compute the input of node in layer as
(3) |
where is a sub-graph of the original fully connected network^{1}^{1}1The original network has edges where denotes the cross-product..
As mentioned above, for each weight in the original network we learn a popup score . We choose the subnetwork by selecting the weights in each layer which have the top- highest scores. Equation 3 may therefore be written equivalently as
(4) |
where if is among the top highest scores in layer and otherwise. Since the gradient of is 0 everywhere it is not possible to directly compute the gradient of the loss with respect to . We instead use the straight-through gradient estimator [1], in which is treated as the identity in the backwards pass – the gradient goes “straight-through” .
Consequently, we approximate the gradient to as
(5) |
where is the loss we are trying to minimize. The scores are then updated via stochastic gradient descent with learning rate . If we ignore momentum and weight decay [14] then we update as
(6) |
where denotes the score after the gradient step^{2}^{2}2To ensure that the scores are positive we take the absolute value..
As the scores change certain edges in the subnetwork will be replaced with others. Motivated by the analysis of [26] we show that when swapping does occur, the loss decreases for the mini-batch.
Theorem 1: When edge replaces and the rest of the subnetwork remains fixed then the loss decreases for the mini-batch (provided the learning rate is sufficiently small, and the loss is smooth).
Proof. Let denote the score of weight after the gradient update. If edge replaces then our algorithm dictates that but . Accordingly,
(7) |
which implies that
(8) |
by the update rule given in Equation 6. Let denote the input to node after the swap is made and denote the original input. Note that by Equation 3. We now wish to show that .
When the learning rate is sufficiently small (and the loss is smooth) we may assume that is close to and ignore second-order terms in a Taylor expansion:
(9) | ||||
(10) | ||||
(11) |
and from equation 8 we have that and so as needed.
We examine a more general case of Theorem 1 in Section B.1 of the supplementary material.
We demonstrate the unreasonable effectiveness of randomly weighted neural networks image recognition on standard benchmark datasets CIFAR-10 [12] and ImageNet [3]. This section is organized as follows: in Section 4.1
we discuss the experimental setup and hyperparameters. We perform a series of ablations at small scale: we examine the effect of
, the of Weights which remain in the subnetwork, and the effect of width. In Section 4.4 we compare against the algorithm of Zhou et al., followed by Section 4.5 in which we study the effect of the distribution used to sample the weights. We conclude with Section 4.6, where we otpimize to find subnetworks of randomly weighted neural networks which achieve good performance on ImageNet [3].We use two different distributions for the weights in our network:
Signed Kaiming Constant which we denote . Here we set each weight to be a constant and randomly choose its sign to be or
. The constant we choose is the standard deviation of Kaiming Normal, and as a result the variance is the same. We use the notation
as we are sampling uniformly from the set where is the standard deviation for Kaiming Normal (i.e. ).In Section 4.5 we reflect on the importance of the random distribution and experiment with alternatives.
On CIFAR-10 [12] we experiment with simple VGG-like architectures of varying depth. These architectures are also used by Frankle and Carbin [4] and Zhou et al. [29] and are provided in Table 1. On ImageNet we experiment with ResNet-50 and ResNet-101 [8], as well as their wide variants [28].
Model | Conv2 | Conv4 | Conv6 | Conv8 |
---|---|---|---|---|
Conv Layers | 64, 64, pool | 64, 64, pool 128, 128, pool | 64, 64, pool 128, 128, pool 256, 256, pool | 64, 64, pool 128, 128, pool 256, 256, pool 512, 512, pool |
FC | 256, 256, 10 | 256, 256, 10 | 256, 256, 10 | 256, 256, 10 |
. However, the slightly deeper Conv8 does not appear in the previous work. Each model first performs convolutions followed by the fully connected (FC) layers, and pool denotes max-pooling.
In every experiment we train for 100 epochs and report the last epoch accuracy on the validation set. When we optimize with Adam
[11] we do not decay the learning rate. When we optimize with SGD we use cosine learning rate decay [17]. On CIFAR-10 [12] we train our models with weight decay 1e-4, momentum 0.9, batch size 128, and learning rate 0.1. We also often run both an Adam and SGD baseline where the weights are learned. The Adam baseline uses the same learning rate and batch size as in [4, 29]^{3}^{3}3Batch size 60, learning rate 2e-4, 3e-4 and 3e-4 for Conv2, Conv4, and Conv6 respectively Conv8 is not tested in [4], though we use find that learning rate 3e-4 still performs well.. For the SGD baseline we find that training does not converge with learning rate 0.1, and so we use 0.01. As standard we also use weight decay 1e-4, momentum 0.9, and batch size 128.For the ImageNet experiments we use the hyperparameters found on NVIDIA’s public github example repository for training ResNet [19]. For simplicity, our edge-popup algorithm does not modify batch norm parameters, they are frozen at their default initialization in PyTorch (i.e. bias 0, scale 1).
This discussion has encompassed the extent of the hyperparameter tuning for our models. We do, however, perform hyperparameter tuning for the Zhou et al. [29] baseline and improve accuracy significantly. We include further discussion of this in Section 4.4.
In all experiments on CIFAR-10 [12] we use 5 different random seeds and plot the mean accuracy one standard deviation. Moreover, on all figures, Learned Dense Weights denotes the standard training the full model (all weights remaining).
Our algorithm has one associated parameter: the % of weights which remain in the subnetwork, which we refer to as . Figure 3 illustrates how the accuracy of the subnetwork we find varies with , a trend which we will now dissect. We consider and plot the dense model when it is trained as a horizontal line (as it has 100% of the weights).
We recieve the worst accuracy when approaches or . When approaches 0, we are not able to perform well as our subnetwork has very few weights. On the other hand, when approaches 100, our network outputs are random.
The best accuracy occurs when , and we make a combinatorial argument for this trend. We are choosing weights out of , and there are ways of doing so. The number of possible subnetworks is therefore maximized when , and at this value our search space is at its largest.
Our intuition from Section 3.1 suggests that as the network gets wider, a subnetwork of a randomly weighted model should approach the trained model in accuracy. How wide is wide enough?
In Figure 4 we vary the width of Conv4 and Conv6. The width of a linear layer is the number of “neurons”, and the width of a convolution layer is the number of channels. The width multiplier is the factor by which the width of all layers is scaled. A width multiplier of 1 corresponds to the models tested in Figure 3.
As the width multiplier increases, the gap shrinks between the accuracy a subnetwork found with edge-popup and the dense model when it is trained. Notably, when Conv6 is wide enough, a subnetwork of the randomly weighted model (with ) performs just as well as the dense model when it is trained.
Moreover, this boost in performance is not solely from the subnetwork having more parameters. Even when the # of parameters is fixed, increasing the width and therefore the search space leads to better performance. In Figure 5 we fix the number of parameters and while modifying and the width multiplier. Specifically, we test for subnetworks of constant size and . On Figure 5 we use denote the size of the subnetwork.
In Figure 6 we compare the performance of edge-popup with Zhou et al. Their work considers distributions and , which are identical to those presented in Section 4.1 but with xavier normal [6] instead of kaiming normal [7] – the factor of is omitted from the standard deviation. By running their algorithm with and we witness a significant improvement. However, even the and results exceed those in the paper as we perform some hyperparameter tuning. As in our experiments on CIFAR-10, we use SGD with weight decay 1e-4, momentum 0.9, batch size 128, and a cosine scheduler [17]. We double the learning rate until we see their performance become worse, and settle on 200^{4}^{4}4An absurdly high learning rate is required as mentioned in their work..
The distribution that the random weights are sampled from is very important. As illustrated by Figure 7, the performance of our algorithm vastly decreases when we switch to using xavier normal [6] or kaiming uniform [7].
Following the derivation in [7], the variance of the forward pass is not exactly 1 when we consider a subnetwork with only of the weights. To reconcile for this we could scale standard deviation by . This distribution is referred to as “Scaled Kaiming Normal” on Figure 7. We may also consider this scaling for the Signed Kaiming Constant distribution which is described in Section 4.1.
On ImageNet we observe similar trends to CIFAR-10. As ImageNet is much harder, computationally feasible models are not overparameterized to the same degree. As a consequence, the performance a randomly weighted subnetwork does not match the full model with learned weights. However, we still witness a very encouraging trend – the performance increases with the width and depth of the network.
As illustrated by Figure 8, a randomly weighted Wide ResNet-50 contains a subnetwork that is smaller than, but matches the accuracy of ResNet-34 when trained on ImageNet [3]. As strongly suggested by our trends, better and larger “parent” networks would result in even stronger performance on ImageNet [3]. A table which reports the numbers in Figure 8 may be found in Section A of the supplementary material.
Figure 9 illustrates the effect of , which follows an almost identical trend: performs best though 30 now provides the best performance. Figure 9 also demonstrates that we significantly outperform Zhou et al. at scale (they do not test on ImageNet in their paper). Their algorithm does not allow an explicit choice of the % of weights remaining in the subnetwork, and we found the algorithm unstable outside of the range reported.
The choice of the random distribution matters more for ImageNet. The “Scaled” distribution we discuss in Section 4.5 did not show any discernable difference on CIFAR-10. However, Figure 10 illustrates that on ImageNet it is much better. Recall that the “Scaled” distribution adds a factor of , which has less of an effect when approaches . This result highlights the possibility of finding better distributions which works best for this algorithm.
Hidden within randomly weighted neural networks we find subnetworks with compelling accuracy. This work provides an avenue for many areas of exploration. For example, we anticipate the development of faster algorithms, or the alternating optimization of the structure and the weights.
Finally, we hope that our findings serve as a useful step in the pursuit the understanding of the optimization of neural networks.
We thank Jesse Dodge and Nicholas Lourie for many helpful discussions, Gabriel Ilharco for valuable feedback, and Sarah Pratt for .
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, volume 9 ofProceedings of Machine Learning Research
, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)
, ICCV ’15, pages 1026–1034, Washington, DC, USA, 2015. IEEE Computer Society.Exploring the limits of transfer learning with a unified text-to-text transformer.
arXiv e-prints, 2019.In this section we first prove a more general case of Theorem 1 then provide an extension of edge-popup for convolutions along with code in PyTorch [20], found in Algorithm 1.
We now examine a more general case of Theorem 1, where the two swapped edges are not connected to the same node. Again we are motivated by the analysis of [26], though we tackle a more general case.
Theorem 1 (more general): When a nonzero number of edges are swapped in one layer and the rest of the network remains fixed then the loss decreases for the mini-batch (provided the learning rate is sufficiently small, and the loss is smooth).
Proof. As before, we let denote the score of weight after the gradient update. Additionally, let denote the input to node after the gradient update whereas is the input to node before the update. Finally, let denote the nodes in layer and denote the notes in layer . Our goal is to show that
(12) |
where the loss is written as a function of layer ’s input for brevity. Since is small and the loss is smooth we may assume that each is close to and ignore second-order terms in a Taylor expansion:
(13) | |||
(14) | |||
(15) |
And so, in order to show Equation 12 it suffices to show that
(16) |
It is helpful to rewrite the sum to be over edges. Specifically, we will consider the sets and where contains all edges that entered the network after the gradient update and consists of edges which were previously in the subnetwork, but have now exited. As the total number of edges is conserved we know that , and by assumption .
Using the definition of and from Equation 3 we may rewrite Equation 16 as
(17) |
which, by Equation 6 and factoring out becomes
(18) |
We now show that
(19) |
for any pair of edges and . Since we are then able to conclude that Equation 18 holds.
As was not in the edge set before the gradient update, but was, we can conclude
(20) |
Likewise, since is in the edge set after the gradient update, but isn’t, we can conclude
(21) |
In order to show that our method extends to convolutional layers we recall that convolutions may be written in a form that resembles Equation 2. Let
be the kernel size which we assume is odd for simplicity, then for
and we have(22) |
where instead of “neurons”, we now have “channels”. The input and output are now two dimensional and so is a scalar. As before, where is a nonlinear function. However, in the convolutional case is often batch norm [10]
followed by ReLU (and then implicitly followed by zero padding).
Instead of simply having weights we now have weights for , . Likewise, in our edge-popup Algorithm we now consider scores and again use the top % in the forwards pass. As before, let if is among the top highest scores in the layer and otherwise. Then in edge-popup we are performing a convolution as
(23) |
which mirrors the formulation of edge-popup in Equation 4. In fact, when (i.e. a 1x1 convolution on a 1x1 feature map) then Equation 23 and Equation 4 are equivalent.
The update for the scores is quite similar, though we must now sum over all spatial (i.e. and ) locations as given below:
(24) |