Innovations in how deep networks are trained have played an important role in the remarkable success deep learning has produced in a variety of application areas, including image recognition(He et al., 2016), object detection (Ren et al., 2015; He et al., 2017), machine translation (Vaswani et al., 2017) and language modeling (Brown et al., 2020). Learning typically involves either optimizing a network from scratch (Krizhevsky et al., 2012), finetuning a pre-trained model (Yosinski et al., 2014) or jointly optimizing the architecture and weights (Zoph and Le, 2017). Against this predominant background, we pose the following question: can a network instantiated with only random weights achieve competitive results compared to the same model using optimized weights?
For a given task, an untrained, randomly initialized network is unlikely to produce good performance. However, we demonstrate that given sufficient random weight options for each connection, there exist configurations of these random weight values that have generalization performance comparable to that of a trained network with the same architecture. More importantly, we introduce a method that can find these high-performing randomly weighted configurations consistently and efficiently. Furthermore, we show empirically that a small number of random weight options (e.g., values per connection) are sufficient to obtain accuracy comparable to that of the trained network. The weights are never updated. Instead, the algorithm simply selects for each connection a weight value from a fixed set of random weights.
We use the analogy of “slot machines” to describe how our method operates. Each reel in a Slot Machine has a fixed set of symbols. The reels are jointly spinned in an attempt to find winning combinations. In our context, each connection has a fixed set of random weight values. Our algorithm “spins the reels” in order to find a winning combination of symbols, i.e., selects a weight value for each connection so as to produce an instantiation of the network that yields strong performance. While in physical Slot Machines the spinning of the reels is governed by a fully random process, in our Slot Machines the selection of the weights is guided by a method that optimizes the given loss at each spinning iteration.
More formally, we allocate fixed random weight values to each connection. Our algorithm assigns a quality score to each of these
possible values. In the forward pass a weight value is selected for each connection based on the scores. The scores are then updated in the backward pass via stochastic gradient descent. However, the weights are never changed. By evaluating different combinations of fixed randomly generated values, this extremely simple procedure finds weight configurations that yield high accuracy.
We demonstrate the efficacy of our algorithm through experiments on MNIST and CIFAR-10. On MNIST, our randomly weighted Lenet-300-100 (Lecun et al., 1998) obtains a test set accuracy when using options per connection and with . On CIFAR-10 (Krizhevsky, 2009), our six-layer convolutional network matches the test set performance of the trained network when selecting from fixed random values at each connection.
Finetuning the models obtained by our procedure generally boosts performance over trained networks albeit at an additional compute cost (see Figure 4). Also, compared to traditional networks, our networks are less memory efficient due to the inclusion of scores. That said, our work casts light on some intriguing phenomena about neural networks:
First, our results suggest a performance equivalence between random initialization and training showing that a proper initialization is crucially important.
Second, this paper further highlights the enormous expressive capacity of neural networks. Maennel et al. (2020) show that contemporary neural networks are so powerful that they can memorize randomly generated labels. This work builds on that revelation and demonstrates that current networks can model challenging non-linear mappings extremely well even with random weights.
Finally, we are hopeful that our novel model —consisting in the the introduction of multiple weight options for each edge— will inspire other initialization and optimization strategies.
2 Related Work
Supermasks and the Strong Lottery Ticket Conjecture. The lottery ticket hypothesis was articulated in (Frankle and Carbin, 2018) and states that a randomly initialized neural network contains sparse subnetworks which when trained in isolation from scratch can achieve accuracy similar to that of the trained dense network. Inspired by this result, Zhou et al. (2019)
present a method for identifying subnetworks of randomly initialized neural networks that achieve better than chance performance without training. These subnetworks (named “supermasks”) are found by assigning a probability value to each connection. These probabilities are used to sample the connections to use and are updated via stochastic gradient descent. Without ever modifying the weights,Zhou et al. (2019) find subnetworks that perform impressively across multiple datasets.
Follow up work by Ramanujan et al. (2019)
finds supermasks that match the performance of a dense network. On ImageNet(Russakovsky et al., 2009), they find subnetworks within a randomly weighted ResNet-50 (Zagoruyko and Komodakis, 2016) that match the performance of a smaller, trained ResNet-34 (He et al., 2016). Accordingly, they propose the strong lottery ticket conjecture: a sufficiently overparameterized, randomly weighted neural network contains a subnetwork that performs as well as a trained network with the same number of parameters. Ramanujan et al. (2019) adopts a deterministic protocol in their so-called “edge-popup” algorithm for finding supermasks instead of the stochastic algorithm of Zhou et al. (2019).
These empirical results as well recent theoretical ones (Malach et al., 2020; Pensia et al., 2020) suggest that pruning a randomly initialized network is just as good as optimizing the weights, provided a good pruning mechanism is used. Our work corroborates this intriguing phenomenon but differs from these prior methods in a significant aspect. We eliminate pruning completely and instead introduce multiple weight values per connection. Thus, rather than selecting connections to define a subnetwork, our method selects weights for all connections in a network of fixed structure. Although our work has interesting parallels with pruning (as illustrated in Figure 9), it is different from pruning as all connections remain active in every forward pass.
Pruning at Initialization. The lottery ticket hypothesis also inspired several recent work aimed towards pruning (i.e., predicting “winning” tickets) at initialization (Lee et al., 2020, 2019; Tanaka et al., 2020; Wang et al., 2020). Our work is different in motivation from these methods and those that train only a subset of the weights (Hoffer et al., 2018; Rosenfeld and Tsotsos, 2019). Our aim is to find neural networks with random weights that match the performance of trained networks with the same number of parameters.
Weight Agnostic Neural Networks. Gaier and Ha (2019) build neural network architectures with high performance in a setting where all the weights have the same shared random value. The optimization is instead performed over the architecture (Stanley and Miikkulainen, 2002). They show empirically that the network performance is indifferent to the shared value but defaults to random chance when all the weights assume different random values. Although we do not perform weight training, the weights in this work have different random values. Further, we build our models using fixed architectures.
Low-bit Networks and Quantization Methods. Similar to binary networks (Courbariaux and Bengio, 2016; Rastegari et al., 2016) and network quantization (Hubara et al., 2017; Wang et al., 2018), the parameters in slot machines are drawn from a finite set. However, whereas the primary objectives in the prior networks are mostly compression and computational speedup, the motivation behind slot machines is recovering good performance from randomly initialized networks. Accordingly, slot machines use real-valued weights as opposed to the binary (or small integer) weights used by low-bit networks. Further, the weights in low-bit networks are usually optimized directly whereas only associated scores are optimized in slot machines.
Random Decision Trees.
Our approach is inspired by the popular use of random subsets of features in the construction of decision trees(Breiman et al., 1984). Instead of considering all possible choices of features and all possible splitting tests at each node, random decision trees are built by restricting the selection to small random subsets of feature values and splitting hypotheses. We adapt this strategy to the training of neural network by restricting the optimization of each connection over a random subset of weight values.
3 Slot Machines: Networks with Fixed Random Weight Options
values is selected for each connection, based on a quality score computed for each weight value. On the backward pass, the quality scores of all weights are updated using a straight-through gradient estimator(Bengio et al., 2013), enabling the network to sample better weights in future passes. Unlike the scores, the weights are never changed.
Our goal is to construct non-sparse neural networks with completely random weights that achieve high accuracy. We start by providing an intuition for our method in Section 3.1, before formally defining our algorithm in Section 3.2 .
An untrained, randomly initialized network is unlikely to perform better than random chance. Interestingly, the impressive advances of Ramanujan et al. (2019) and Zhou et al. (2019) demonstrate that untrained networks can in fact do well if pruned properly. We show in this work that pruning is not required to obtain good performance with a randomly weighted network. To provide an intuition for our method, consider an untrained network with one weight value for each connection, as typically done. If the weights of are drawn randomly from an appropriate distribution (e.g., (Glorot and Bengio, 2010) or (He et al., 2015)), there is an extremely small but nonzero probability that obtains good accuracy (say, greater than a threshold ) on the given task. Let denote this probability. Also consider another untrained network that has the same architectural configuration as but with weight choices per connection. If is the number of connections in , then contains within it different network instantiations that are architecturally identical to but that differ in weight configuration. If the weights of are sampled from , then the probability that none of the networks obtains good accuracy is essentially . This probability decays quickly as either or increases. We find randomly weighted networks that achieve very high accuracy even with small values of . For instance, a six layer convolutional network with random values per connection obtains test accuracy on CIFAR-10.
But how do we select a good network from these different networks? Brute-force evaluation of all possible configurations is clearly not feasible due to the massive number of different hypotheses. Instead, we present an algorithm, shown in Figure 1, that iteratively searches the best combination of connection values for the entire network by optimizing the given loss. To do this, the method learns a real-valued quality score for each weight option. These scores are used to select the weight value of each connection during the forward pass. The scores are then updated in the backward pass based on the loss value in order to improve training performance over iterations.
3.2 Learning in Slot Machines
Here we introduce our algorithm for the case of fully-connected networks but the description extends seamlessly to convolutional networks. A fully-connected neural network is an acyclic graph consisting of a stack of layers where the th layer has neurons. The activation of neuron in layer is given by
where is the weight of the connection between neuron in layer and neuron in layer , represents the input to the network, and is a non-linear function. Traditionally,
starts off as a random value drawn from an appropriate distribution before being optimized with respect to a dataset and a loss function using gradient descent. In contrast, we do not ever update the weights. Instead, we associate a set ofpossible weight options for each connection111For simplicity, we use the same number of weight options for all connections in a network., and then optimize the selection of weights to use from these predefined sets for all connections.
Forward Pass. Let 222For brevity, from now on we omit the superscript denoting the layer. be the set of the possible weight values for connection () and let be the “quality score” of value , denoting the preference for this value over the other possible values. We define a selection function which takes as input the scores and returns an index between and . In the forward pass, we set the weight of () to where .
In our work we set to be either the function (returning the index corresponding to the largest score) or the sampling from a Multinomial distribution defined by . We refer to the former as Greedy Selection (GS). We name the latter Probabilistic Sampling (PS) and implement it as
The empirical comparison between these two selection strategies is given in Section 4.5.
We note that, although values per connection are considered during training (as opposed to the infinite number of possible values of traditional training), only one value per connection is used at test time. The final network is obtained by selecting for each connection the value corresponding to the highest score (for both GS and PS) upon completion of training. Thus, the effective capacity of the network at inference time is the same as that of a traditionally trained network.
Backward Pass. In the backward pass, all the scores are updated with straight-through gradient estimation since has a zero gradient almost everywhere. The straight-through gradient estimator (Bengio et al., 2013) treats essentially as the identity function in the backward pass by setting the gradient of the loss with respect to as
for where is the objective function. is the pre-activation of neuron in layer .
Given as the learning rate, and ignoring momentum, we update the scores via stochastic gradient descent as
where is the score after the update. Our experiments demonstrate that this simple algorithm learns to select effective configurations of random weights resulting in impressive results across different datasets and models.
4.1 Experimental Setup
The weights of all our networks are sampled uniformly at random from a Glorot Uniform distribution(Glorot and Bengio, 2010), where , the number of options per connection, when computing the standard deviation since it does not affect the network capacity. Like for the weights, we initialize the scores independently from a uniform distribution where and are small constants. We use for all fully-connected layers and set to when initializing convolutional layers. We always set to . We use and of the training sets of MNIST and CIFAR-10, respectively, for validation. We report performance on the separate test set. On MNIST, we experiment with the Lenet-300-100 (Lecun et al., 1998) architecture following the protocol in Frankle and Carbin (2018). We also use the VGG-like architectures used thereof and in Zhou et al. (2019) and Ramanujan et al. (2019). We denote these networks as CONV-2, CONV-4, and CONV-6. For completeness, these architectures and their training schedules are provided in Appendix C
. All our plots show the averages of five different independent trials. Error bars whenever shown are the minimum and maximum over the trials. A core component of our algorithm is the hyperparameterwhich represents the number of options per connection. As such, we conduct experiments with and analyze the performance as varies. When not otherwise noted, we used the Greedy Selection (GS) method. We empirically compare GS to Probabilistic Selection (PS) in 4.5.
4.2 Slot Machines versus Trained Networks
We compare the networks using random weights selected from our approach with two different baselines: (1) randomly initialized networks with one weight option per connection, and (2) traditional trained networks whose weights are optimized directly. These baselines are off-the-shelf modules from PyTorch(Paszke et al., 2019) which we train in the standard way according to the schedules presented in Appendix C.
As shown in Figure 2, untrained dense networks perform at chance, if they have only one weight per edge. However, methodologically selecting the parameters from just two random values for each connection greatly enhances performance across different datasets and networks. Even better, as shown in Figure 3, as the number of random weight options per connection increases, the performance of these networks approaches that of trained networks with the same number of parameters, despite containing only random values. Malach et al. (2020)
proved that any “ReLU network of depthcan be approximated by finding a weighted-subnetwork of a random network of depth and sufficient width.” We find a random layer network that performs as well as a layer trained network without any form of pruning.
4.3 Finetuning Slot Machines
Our approach can also be viewed as a strategy to provide a better initialization for traditional training. To assess the value of such a scheme, we finetune the networks obtained after training Slot Machines for 100 epochs (see Appendix C for implementation details). Figure 4 summarizes the results in terms of training time (including both selection and finetuning) vs test accuracy. It can be noted that for the CONV-4 and CONV-6 architectures, finetuned Slot Machines achieve higher accuracy compared to the same models learned from scratch, effectively at no additional training cost. For VGG-19, finetuning improves accuracy but the resulting model still does not match the performance of the model trained from scratch.
To show that the weight selection in Slot Machines does in fact impact performance of finetuned models, we finetune from different Slot Machine checkpoints. If the selection is beneficial, then finetuning from later checkpoints will show improved performance. As shown in Figure 7, this is indeed the case as finetuning from later Slot Machine checkpoints results in higher performance on the test set.
4.4 Common Sets of Random Weights
We tested slot machines in setting where (1) the connections in a layer share the same set of randomly initialized weights, and (2) where all connections in the network share the same set of randomly initialized weights. At the layer level, the weights are drawn from the uniform distribution where is the standard deviation of the Glorot Normal distribution for layer . When using the same set of weights globally, we sample the weights independently from . is the mean of the standard deviations of the per layer Glorot Normal distributions.
Each of the weights is still associated with a score. The slot machine with shared weights is then trained as before. This approach has the potential of compression the model although the full set of of scores is still needed.
As shown in Figure 4, these models continue to do well when is large enough. However, unlike conventional slot machines, these models do not do well when . Further, when , it sometimes takes more than one random initialization to get a set that is effective. This phenomenon is directly related to the large performance variance the models when is small.
4.5 Greedy Selection Versus Probabilistic Sampling
As detailed in Section 3.2, we consider two different methods for sampling our networks in the forward pass: a greedy selection where the weight corresponding to the highest score is used and a stochastic selection which draws from a proper distribution over the weights. We compare the behavior of our networks under these two different protocols.
As seen in Figure 3, GS performs better. To fully comprehend the performance differences between these two strategies, it is instructive to look at Figure 7, which reports the percentage of weights changed every 5 epochs by the two strategies. PS keeps changing a large percentage of weights even in late stages of the optimization, due to its probabilistic sampling. Despite the network changing considerably, PS still manages to obtain good accuracy (see Figure 3) indicating that there are potentially many good random networks within a Slot Machine. However, as hypothesized in Ramanujan et al. (2019), the high variability due to stochastic sampling means that the same network is likely never or rarely observed more than once in any training run. This makes learning extremely challenging and consequently adversely impacts performance. Conversely, GS is less exploratory and converges fairly quickly to a stable set of weights.
From Figure 3 we can also notice that the accuracy of GS improves or remains stable as the value of is increased. This is not always the case for PS when . We claim this behavior is expected since GS is more restricted in terms of the choices it can take. Thus, GS benefits more from large values of compared to PS.
4.6 Distribution of Selected Weights
Here we look at the distribution of selected weights at different training points in order to understand why certain weights are chosen and others are not. As shown in Figure 8, both GS and PS tend to prefer weights having large magnitudes as learning progresses. This propensity for large weights might help explain why methods such as magnitude-based pruning of trained networks work as well as they do. Unlike the weights, we find that the scores associated to the selected weights form a normal distribution overtime as shown in Figure 12 in the Appendix.
5 Conclusion and Future Work
This work shows neural networks with random fixed weights perform competitively, given a good selection criterion and provided each connection is given more than one weight options. We introduce a simple selection procedure that is remarkably effective and consistent in producing strong weight configurations from these few options per connection. We also demonstrate that these selected configurations can be used as starting initializations for finetuning, which often produces accuracy gains over training the network from scratch, at comparable computational cost. Our study suggests that our method tends to naturally select large magnitude weights as training proceeds. Future work will be devoted to further analyze what other properties differentiate selected weights from those that are not selected, as knowing such properties may pave the way for more effective initializations for neural networks.
- Estimating or propagating gradients through stochastic neurons for conditional computation. External Links: Cited by: Figure 1, §3.2.
- Classification and regression trees.. Wadsworth & Brooks/Cole Advanced Books & Software., Monterey, CA. External Links: Cited by: §2.
- Language models are few-shot learners. External Links: Cited by: §1.
- BinaryNet: training deep neural networks with weights and activations constrained to +1 or -1. External Links: Cited by: §2.
- The lottery ticket hypothesis: finding sparse, trainable neural networks. External Links: Cited by: Table 2, 3rd item, §2, §4.1.
- Weight agnostic neural networks. External Links: Cited by: §2.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Y. W. Teh and M. Titterington (Eds.),
Proceedings of Machine Learning Research, Vol. 9, Chia Laguna Resort, Sardinia, Italy, pp. 249–256. Cited by: §3.1, §4.1.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In
2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1026–1034. Cited by: §3.1.
- Mask r-cnn. In Proceedings of the International Conference on Computer Vision (ICCV), Cited by: §1.
- Deep residual learning for image recognition. In CVPR, Vol. , pp. 770–778. Cited by: §1, §2.
Fix your classifier: the marginal value of training the last weight layer. In ICLR, Cited by: §2.
- Quantized neural networks: training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: §2.
Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Vol. 25, pp. 1097–105. Cited by: §1.
- Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: Appendix C, §1.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §4.1.
- A signal propagation perspective for pruning neural networks at initialization. In ICLR, Cited by: §2.
- SNIP: single-shot network pruning based on connection sensitivity. In ICLR, Cited by: §2.
- What do neural networks learn when trained with random labels?. External Links: Cited by: 2nd item.
- Proving the lottery ticket hypothesis: pruning is all you need. External Links: Cited by: 3rd item, §2, §4.2.
- PyTorch: an imperative style, high-performance deep learning library. In Neural Information Processing Systems (NeurIPS), Cited by: §4.2.
- Optimal lottery tickets via subsetsum: logarithmic over-parameterization is sufficient. External Links: Cited by: §2.
- What’s hidden in a randomly weighted neural network?. External Links: Cited by: Table 1, Appendix B, Table 2, §2, §3.1, §4.1, §4.5.
- XNOR-net: imagenet classification using binary convolutional neural networks. In Computer Vision – ECCV 2016, pp. 525–542. External Links: Cited by: §2.
- Faster r-cnn: towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS), Cited by: §1.
- Intriguing properties of randomly weighted networks: generalizing while learning next to nothing. In 2019 16th Conference on Computer and Robot Vision (CRV), Vol. , pp. 9–16. Cited by: §2.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: §2.
- Evolving neural networks through augmenting topologies. Evolutionary Computation 10 (2), pp. 99–127. Cited by: §2.
- Pruning neural networks without any data by iteratively conserving synaptic flow. External Links: Cited by: §2.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010. Cited by: §1.
- Picking winning tickets before training by preserving gradient flow. In ICLR, Cited by: §2.
Two-step quantization for low-bit neural networks.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 4376–4384. External Links: Cited by: §2.
- How transferable are features in deep neural networks?. In Neural Information Processing Systems (NIPS), Cited by: §1.
- Wide residual networks. External Links: Cited by: §2.
- Deconstructing lottery tickets: zeros, signs, and the supermask. External Links: Cited by: Table 1, Appendix B, Table 2, Appendix C, §2, §2, §3.1, §4.1.
Neural architecture search with reinforcement learning. In ICLR, External Links: Cited by: §1.
Appendix A Relation to Pruning
The models learned by our algorithm could in principle be found by applying pruning to a bigger network representing the multiple weight options in the form of additional connections. One way to achieve this is by introducing additional “dummy” layers after every layer except the output layer. Each “dummy” layer will have identity units where and is the number of neurons in layer . The addition of these layers has the effect of separating out the random values for each connection in our network into distinct connection weights. It is important that the neurons of the “dummy” layer encode the identity function to ensure that the random values can pass through it unmodified. Finally, in order to obtain the model learned by our system, all connections between a layer and its associated “dummy” layer must be pruned except for the weights which would have been selected by our algorithm as shown in Figure 9. This procedure requires allocating a bigger network and is clearly more costly compared to our algorithm.
). The green-colored connections represent the selected weights. The square boxes in the bigger network implement identity functions. The circles designate vanilla neurons with non-linear activation functions. Red dash lines in our network represent unchosen weights. These lines in the bigger network would correspond to pruned weights.
Appendix B Experimental Comparison with Prior Pruning Approaches
Our approach is similar to the pruning technique of Ramanujan et al. (2019) as their method too does not update the weights of the networks after initialization. Furthermore, their strategy selects the weights greedily, as in our GS. However, they use one weight per connection and employ pruning to uncover good subnetworks within the random network whereas we use multiple random values per edge. Additionally, we do not ever prune any of the connections. We compare the results of our networks to this prior work in Table 1. We also compare with supermasks (Zhou et al., 2019)
. Supermasks employ a probability distribution during the selection which makes them reminiscent of ourPS
models. However, they use a Bernoulli distribution at each weight whilePS uses a Multinomial distribution at each connection. Also, like Ramanujan et al. (2019), supermasks have one weight per connection and perform pruning rather than weight selection. Table 1 shows that GS achieves accuracy comparable to that of Ramanujan et al. (2019) while PS matches the performance of supermasks. These results suggest an interesting empirical performance equivalency among these related but distinct approaches.
|Ramanujan et al. (2019)||-||77.7||85.8||88.1|
|SNIP (Zhou et al., 2019)||98.0||66.0||72.5||76.5|
|Slot Machines (GS)||98.1||77.2||84.6||86.3|
|Slot Machines (PS)||98.0||67.8||76.7||78.3|
Appendix C Further Implementation Details
We find that a high learning rate is required when sampling the network probabilistically—a behaviour which was also observed in Zhou et al. (2019). Accordingly, we use a learning rate of for all PS models except the six layer convolutional network where we use a learning rate of . We did not train VGG-19 using a probabilistic selection. Also, a slightly lower learning rate helps GS networks when . Accordingly, we set a constant learning rate of for GS models for and for . When optimizing the weights directly from scratch or fine-tuning, we decay the learning rate by a factor of at epoch for all models except when training CONV-2 and CONV-4 from scratch where we decay the rate at epoch . Learned weights and fine-tuned models all use the same initial learning rate of . All the VGG-19 models use an initial learning rate of which is decayed by a factor of at epoch and additionally at epoch when directly optimizing the weights from scratch. Finetuning in VGG-19 also starts with an initial learning rate of which is reduced by a factor of at epoch 20.
Whenever we sample using PS, we train for additional epochs relative to the number of epochs used by the corresponding GS model as shown in Table 2. We do this to compensate for the slow convergence of PS models as shown in Figure 7. All finetuned models are trained for epochs. When directly optimizing the weights from scratch, we set the number of epochs to be the sum of the number of epochs used for the corresponding selection checkpoint and that of the finetuned model; we then report the test accuracy based on early stopping with a horizon of with respect to the validation accuracy.
We use data augmentation and dropout (with a rate of ) when experimenting on CIFAR-10 (Krizhevsky, 2009). All models use a batch size of and stochastic gradient descent optimizer with a momentum of 0.9. We do not use weight decay when optimizing slot machines (for both GS and PS ). But we do use a weigh decay of
for all directly optimized models (training from scratch and finetuning). We use batch normalization in VGG-19 but the affine parameters are never updated throughout training.
|, pool||2x256, pool|
|, pool||, pool||4x512, pool|
|Convolutional Layers||, pool||, pool||, pool||4x512, avg-pool|
filters and pool denotes max pooling.
Appendix D Distribution of Selected Weights and Scores
As discussed in Section 4.6 and shown in Figure 8, we observe that slot machines tend to choose increasingly large magnitude weights as learning proceeds. In Figures 10, 11, 12 of this appendix, we show similar plots for other models. We reasoned that the observed behavior might be due to the Glorot Uniform distribution from which the weights are sampled. Accordingly, we performed ablations for this where we used a Glorot Normal distribution for the weights as opposed to the Glorot Uniform distribution used throughout the paper. As shown in Figure 9(a), the initialization distribution do indeed contribute to observed pattern of preference for large magnitude weights. However, initialization may not be the only reason as the models continue to choose large magnitude weights even when the weights are sampled from a Glorot Normal distribution. This is shown more clearly in the third layer of Lenet which has relatively fewer weights compared to the first two layers. We also observed a similar behavior in normally distributed convolutional layers.
Different from the weights, notice that the selected scores are distributed normally as shown in Figure 12. The scores in PS move much further away from the initial values compared to those in GS. This is largely due to the large learning rates used in PS models.
Appendix E Scores Initialization
We initialize the quality scores by sampling from a uniform distribution . As shown in Figure 13, we observe that our networks are sensitive to the range of the uniform distribution the scores are drawn from when trained using GS. However, as expected we found them to be insensitive to the position of the distribution . Generally, narrow uniform distributions, e.g., , lead to higher test set accuracy compared to wide distributions e.g., . This matches intuition since the network requires relatively little effort to drive a very small score across a small range compared to a large range. To concretize this intuition, take for example a weight that gives the minimum loss for connection (). If its associated score is initialized poorly to a small value, and the range is small, the network will need little effort to push it to the top to be selected. However, if the range is large, the network will need much more effort to drive to the top for . We believe that this sensitivity to the distribution range could be compensated by using higher learning rates for wider distributions of scores and vice-versa.