Deep neural networks solve a variety of problems using multiple layers to progressively extract higher level features from the raw input. The commonly adopted method to train deep neural networks is backpropagation rumelhart1985learning and it has been around for the past 35 years. Backpropagation assumes that the function is differentiable and leverages the partial derivative w.r.t the weight for minimizing the function as follows,
is the learning rate. Also, the method is efficient as it makes a single functional estimate to update all the weights of the network. As in, the partial derivative for some weight, where would change once is updated, still this change is not factored into the weight update rule for . Although deep neural networks are non-convex (and the weight update rule measures approximate gradients), this update rule works surprisingly well in practice.
To explain the above observation, recent literature du2018gradient; li2018learning argues that because the network is over-parametrized, the initial set of weights are very close to the final solution and even a little bit of nudging using gradient descent around the initialization point leads to a very good solution. We take this argument to another extreme - instead of using gradient based optimizers - which provide strong direction and magnitude signals for updating the weights; we explore the region around the initialization point by sampling weight changes to minimize the objective function. Formally, our weight update rule is
, where is the weight change hypothesis. Here, we explicitly test the region around the initial set of weights by computing the function and update a weight if it minimizes the loss, see Fig. 1
. Surprisingly, our experiments demonstrate that the above update rule requires fewer weight updates compared to backpropagation to find good minimizers for deep neural networks, strongly suggesting that just exploring regions around randomly initialized networks is sufficient, even without explicit gradient computation. We evaluate this weight update scheme (called RSO; random search optimization) on classification datasets like MNIST and CIFAR-10 with deep convolutional neural networks (6-10 layers) and obtain competitive accuracy numbers. For example, RSO obtains 99.1% accuracy on MNIST and 81.8% accuracy on CIFAR-10 using just the random search optimization algorithm. We do not use any other optimizers for optimizing the final classification layer.
Although RSO is computationally expensive (because it requires updates which are linear in the number of network parameters), our hope is that as we develop better intuition about structural properties of deep neural networks which will help us to find better minimizers quickly (like Hebbian principles, Gabor filters, depth-wise convolutions). If the number of trainable parameters are reduced drastically (like frankle2020training), search based methods could be a viable alternative to back-propagation. Furthermore, since architectural innovations which have happened over the past decade use back-propagation by default, a different optimization algorithm could potentially lead to a different class of architectures, because minimizers of an objective function via different greedy optimizers could potentially be different.
2 Related Work
Multiple optimization techniques have been proposed for training deep neural networks. When gradient based methods were believed to get stuck in local minima with random initialization, layer wise training was popular for optimizing deep neural networks hinton2006fast; bengio2007greedy using contrastive methods hinton2002training. In similar spirit, recently, Greedy InfoMax lowe2019putting maximizes mutual information between adjacent layers instead of training a network end to end. taylor2016training finds the weights of each layer independently by solving a sequence of optimization problems which can be solved globally in closed form. However, these training methods do not generalize to deep neural networks which have more than 2-3 layers and its not shown that the performance increases as we make the network deeper. Hence, back-propagation with SGD or other gradient based optimizers duchi2011adaptive; sutskever2013importance; kingma2014adam are commonly used for optimizing deep neural networks.
Once surprisingly good results were obtained with gradient based methods with back-propagation, the research community has started questioning the commonly assumed hypothesis if gradient based optimizers get stuck in local minima. Recently, multiple works have proposed that because these networks are heavily over-parametrized, the initial set of random filters is already close to the final solution and gradient based optimizers only nudge the parameters to obtain the final solution du2018gradient; li2018learning. For example, only training batch-norm parameters (and keeping the random filters fixed) can obtain very good results with heavily parametrized very deep neural networks ( 800 layers) frankle2020training. It was also shown that networks can be trained by just masking out some weights without modifying the original set of weights ramanujan2019s - although one can argue that masking is a very powerful operator and can represent an exponential number of output spaces. The network pruning literature covers more on optimizing subsets of an over-parametrized randomly initialized neural network frankle2018lottery; li2016pruning. Our method, RSO, is also based on the hypothesis that the initial set of weights is close to the final solution. Here we show that gradient based optimizers may not even be necessary for training deep networks and when starting from randomly initialized weights, even search based algorithms can be a feasible option.
Recently, search based algorithms have gained traction in the deep learning community. Since the design space of network architectures is huge, search based techniques are used to explore the placement of different neural modules to find better design spaces which lead to better accuracyzoph2016neural; liu2018darts. This is done at a block level and each network is still trained with gradient descent. Similar to NAS based methods, weight agnostic neural networks wanngaier2019weight
(WANN) also searches for architectures, but keeps the set of weights fixed. WANN operates at a much finer granularity as compared to NAS based methods while searching for connections between neurons and does not use gradient descent for optimization. Genetic algorithms (GA), a class of search based optimization algorithms, have also been used for training reinforcement learning algorithmssuch2017deep; salimans2017evolution which use neural networks. Deep Neuroevolution such2017deep which is based on genetic algorithms, creates multiple replicas (children) of an initial neural network by adding minor random permutations to all the weight parameters and then selects the best child. The problem with such an approach is that updating all the parameters of the network at once leads to random directions which are unlikely to contain a direction which will minimize the objective function. Also, GA based methods were only trained on networks with 2-3 hidden layers, which is fairly shallow when compared to modern deep architectures.
Consider a deep neural network with layers, where the weights of a layer with neurons is represented by . For an input activation , generates an activation , see Fig 2
. Each weight tensorgenerates an activation , where can be a scalar or a tensor depending on whether the layer is fully connected, convolutional, recurrent, batch-norm etc. The objective of the training process is to find the best set of weights
, which minimize a loss functiongiven some input data and labels .
To this end, we initialize the weights of the network with a Gaussian distribution, like he2015delving
. The input data is also normalized to have zero mean and unit standard deviation. Once the weights of all layers are initialized, we compute the standard deviationof all elements in the weight tensor . In the weight update step for a weight , is sampled from . We call this which is zero for all weights of the network but for , where . For a randomly sampled mini-batch , we compute the loss for , and . If adding or subtracting reduces , is updated, otherwise the original weight is retained. This process is repeated for all the weights in the network, i.e., to update all the weights of the network once, needs to be computed three times the number of weight parameters in the network, . We first update the weights of the layer closest to the labels and then sequentially move closer to the input. This is typically faster than optimizing the other way, but both methods lead to good results. This algorithm is described in Algorithm 1.
In Algorithm 1, in line 12, we sample change in weights from a Gaussian distribution whose standard deviation is the same as the standard deviation of the layer. This is to ensure that the change in weights is within a small range. The Gaussian sampling can also be replaced with other distributions like uniform sampling from or just sampling values from a template like and these would also be effective in practice. The opposite direction of a randomly sampled weight is also tested because often it leads to a better hypothesis when one direction does not decrease the loss. However, in quite a few cases (close to 10% as per our experiments), not changing the weight at all is better. Note that there is no concept of learning rate in this algorithm. We also do not normalize the loss if the batch size increases or decreases as the weight update step is independent of the magnitude of the loss.
There is widespread belief in the literature that randomly initialized deep neural networks are already close to the final solution du2018gradient; li2018learning. Hence, we use this prior and explore regions using bounded step sizes (
) in a single dimension. We chose to update one weight at a time instead of sampling all the weights of the network as this would require sampling an exponential number of samples to estimate their joint distribution. RSO will be significantly faster even if prior knowledge about the distribution of the weights of individual neurons is used.
We demonstrate the effectiveness of RSO on image classification tasks on the MNIST mnist1998 and the CIFAR-10 Krizhevsky2009CIFAR data sets. MNIST consists of 60k training images and 10k testing images of handwritten single digits and CIFAR-10 consists of 50k training images and 10k testing images for 10 different classes of images.
We use a standard convolution neural network (CNN) with 6 convolution layers followed by one fully connected layer as the baseline network for MNIST. All convolution layers use a filter and generate an output with
filters channels. Each convolution layer is followed by a Batch Norm layer and then a ReLU operator. Every second convolution layer is followed by aaverage pool operator. The final convolution layer’s output is pooled globally and the pooled features are input to a fully connected layer that maps it to the target output classes. The feature output is mapped to probability values using a softmax layer and cross entropy loss is used as the objective function when choosing between two network states. For RSO, we train the networks for cycles as described in Section 3 and in each cycle the weights in each layer are optimized once. We sample random samples in a batch per weight for computing the loss during training. The order of updates within each layer is discussed in Section 4.3. After optimizing a convolution layer , we update the parameters of it’s batch norm layer using search based optimization as well. We linearly anneal the standard deviation at a cycle of the sampling distribution for layer , such that the standard deviation at the final cycle is . We report performance of this network using RSO and compare it with backpropagation (with SGD) and the approach described in wanngaier2019weight in Table 1. The network is able to achieve a near state-of-the-art accuracy of 99.12% using random search alone to optimize all the weights in the network.
4.2 Perturbing multiple weights versus perturbing a single weight
We compare different strategies each with different set of weights that are perturbed at each update step to empirically demonstrate the effectiveness of updating the weights one at a time (Algorithm 1). The default strategy is sample the perturbation for a single weight per update step and cycle through the layers and through each weight within each layer. A second possible strategy is to sample perturbations for all the weights in a layer jointly at each update step and then cycle through the layers. The third strategy is to sample perturbations for all the weights in whole network at every update step. For layer-level and network-level sampling, we obtain optimal results when the perturbations for a layer with weight tensor are sampled from , where is the standard deviation of all elements in after is initialized for the first time. We optimize each of the networks for K steps for each of the three strategies and report performance on MNIST in Figure 2. For the baseline network described in Section 4.1, K steps translate to about cycles when using the single weight update strategy. Updating a single weight obtains much better and faster results compared to layer level random sampling which, in turn, is faster compared to a network-level random sampling. When we test on harder data sets like CIFAR-10, the network-level and layer-level strategies do not even show promising initial accuracies when compared to the single weight sampling strategy. The network level sampling strategy is close to how genetic algorithms function, however, changing the entire network is significantly slower and less accurate that RSO. Note that our experiments are done with neural networks with 6-10 layers.
|Random Search (ours)||Backpropagation (SGD)||WANN wanngaier2019weight|
4.3 Order of optimization within a layer
When individually optimizing all the weights in a layer , RSO needs an order for optimizing each of the weights, , where is the number of neurons and . is the number of output channels in a convolution layer and each neuron has weights, where the number of input channels and is the filter size. By default, we first optimize the set of weights that affect each output neuron and optimize the next set and so on till the last neuron. Similarly, in a fully connected layer we first optimize weights for one output neuron and then move to the next in a fixed manner. These orders do not change across optimization cycles. This ordering strategy obtains 99.06% accuracy on MNIST. To verify the robustness to the optimization order, we inverted the optimization order for both convolution and fully connected layers. In the inverted order, we first optimize the set of weights that interact with one input channel and then move to the next set of weights and so on. Inverting the order of optimization leads to a performance of 99.12% on the MNIST data set. The final performance for the two runs is almost identical and demonstrates the robustness of the optimization algorithm to a given optimization order in a layer.
|100K Steps||200K Steps||300K Steps||400K Steps||500K Steps||600K Steps|
On CIFAR-10, we show the ability of RSO to leverage the capacity of reasonably deep convolution networks and show that performance improves with an increase in network depth. We present results on architectures with , and convolution layers followed by one fully connected layer. The CNN architectures are denoted by Depth-, where is the number of convolution layers plus and the details of the architectures are reported in Table 3. Each convolution layer is followed by a Batch Norm layer and a ReLU activation. The final convolution layer output is pooled globally and the pooled features are input to a fully connected layer that maps it to the target output classes. For RSO, we train the networks for cycles and in each cycle the weights in each layer are optimized once as described in Section3. To optimize each weight we use a random batch of samples. The performance of RSO improves significantly with increase in depth. This clearly demonstrates that RSO is able to leverage the improvements which come by increasing depth and is not restricted to working with only shallow networks. A comparison for different depth counts is shown in Table 3.
To compare with SGD, we find the hyper-parameters for the best performance of the Depth- architecture by running grid search over batch size from to K and learning rate from to . In RSO, we anneal the standard deviation of the sampling distribution, use weight decay and do not use any data augmentation. For SGD we use a weight decay of , momentum at , no data augmentation and step down the learning rate by a factor of twice. The top performance on CIFAR-10 using the Depth- network was 82.94%.
4.5 Comparison of total weight updates
RSO perturbs each weight once per cycle and a weight may or may not be updated. For cycles, the maximum number of times all weights are updated is . Back-propagation based SGD updates each weight in the network at each iteration. For a learning schedule with epochs and batches per epoch, the total number of iterations is . On the left in Figure 3 we report the accuracy versus the number of times all the weights are updated for RSO and SGD on the MNIST data set. For SGD, we ran grid search as described in Section 4.4 to find hyper-parameters that require the minimum steps to reach accuracy because that is the accuracy of RSO in cycles. We found that using a batch size of 5000, a learning rate of 0.5 and a linear warm-up for the first 5 epochs achieves in less than steps.
On the right in Figure 3 we report the accuracy on the Depth- network (Table 3) at different stages of learning on CIFAR-10 for RSO and SGD. We apply grid search to find optimal hyper-parameters for SGD such that the result is , which is the accuracy of RSO on the Depth- network after cycles. The optimal settings use a batch size of 3000, a learning rate of 4.0, momentum at and a linear warm-up for the first 5 epochs to achieve an accuracy of in a total of iterations. The results on number of iterations demonstrate that the number of times all the weights are updated in RSO, , is typically much smaller than the corresponding number in SGD, for both MNIST and CIFAR-10. Further, RSO reaches on MNIST after updating all the weights just times, which indicates that the initial set of random weights is already close to the final solution and needs small changes to reach a high accuracy.
4.6 Updating weights in parallel
RSO is computationally expensive and sequential in its default update strategy, section 4.3. The sequential strategy ensures that the up to date state of the rest of the network is used when perturbing a single weight in each update step. In contrast, the update for each weight during an iteration in the commonly used backpropagation algorithm is calculated using gradients that are based on a single state of the network. In backpropagation, the use of an older state of the network empirically proves to be a good approximation to find the weight update for multiple weights. As shown in section 4.2, updating a single weight at each step leads to better convergence with RSO as compared to updating weights jointly at the layer and network level. If we can search for updates by optimizing one weight at a time and use an older state of the network, we would have an embarrassingly parallel optimization strategy that can then leverage distributed computing to significantly reduce the run time of the algorithm.
We found that the biggest computational bottleneck was presented by convolution layers and accordingly experimented with two levels of parallel search specifically for convolution layers. The first approach was to search each of the weights in a layer in parallel. The search for the new estimate for each weight uses the state of the network before the optimization started for layer . The second approach we tried was to search for weights of each output neuron in parallel. We spawn different threads per neuron and within each thread we update the weights for the assigned neuron sequentially. Finally we merge the weights for all neurons in the layer. Results on the MNIST data set are shown on the left and results on the CIFAR-10 data set are shown on the right in Figure 4. Regarding the rate of convergence, sequential updates outperform both layer-level parallel and updating the neurons in parallel. However, both parallel approaches converge almost as well as sequential search at the end of cycles on MNIST. For the CIFAR-10 data set, the sequential update strategy seems to out perform the parallel search strategies by a significant margin over a learning schedule of cycles. However, given the embarrassingly parallel nature of both parallel search strategies, this limitation may be overcome by running the experiment on a longer learning schedule using distributed learning hardware to possibly close this gap. This is a hypothesis which will have to be tested experimentally in a distributed setting.
4.7 Caching network activation
Throughout this paper, we have described neural networks with convolution layers and fully connected layer. The number of weight parameters , where and are the number of output filters, number of input filters and filter size respectively, for convolution layer . The FLOPs of a forward pass for a batch size of is , where is the input activation size at layer . Updating all the parameters in the network once using a batch size requires FLOPs, which leads to a computationally demanding algorithm.
To reduce the computational demands for optimizing a layer , we cache the network activations from the layer before optimizing layer . This enables us to start the forward pass from the layer by sampling a random batch from these activations. The FLOPs for a forward pass starting at a convolution layer reduce to . The cost of caching the activations of layer for the complete training set is negligible if computed at a batch size large enough that the number of forward iterations . The FLOPs for training the parameters once using this caching techniques is . This caching strategy leads to an amortized reduction in FLOPs by for the MNIST network described in 4.1 and by for the Depth- network in Table 3.
5 Conclusion and Future Work
RSO is our first effective attempt to train reasonably deep neural networks ( 10 layers) with search based techniques. A fairly naive sampling technique was proposed in this paper for training the neural network. Yet, as can be seen from the experiments, RSO converges an order of magnitude faster compared to back-propagation when we compare the number of weight updates. However, RSO computes the function for each weight update, so its training time scales linearly with the number of parameters in the neural network. If we can reduce the number of parameters, RSO will train faster.
Another direction for speeding up would be to use effective priors on the joint distribution of weights. This could be done using techniques like Blocked Gibbs sampling where some variables are sampled jointly, conditioned on all other variables. While sampling at the layer and network level lead to a drop in performance for RSO, a future direction is to identify highly-correlated and coupled blocks of weights in the network (like Hebbian priors) that can be sampled jointly to reduce computational costs similar to blocked Gibbs. The north star would be a neural network which consists of a fixed set of basis functions (like DCT or wavelet basis functions) and we only have to learn a small set of hyper-parameters which modulate the responses of these basis functions using a training data set. If we can construct such networks and train them with sampling based techniques, it would significantly improve our understanding of deep networks.
There are other improvements which can be made for gradient based optimizers based on the observations in this work. We showed that the weight update step during training need not be proportional to the magnitude of the loss and weight updates can be sampled from predefined templates. So, instead of relying on the learning rate to decide the step size, one could use two fixed step sizes per weight (computed using the standard deviation of the layer weights) and pick the step size which is aligned with the direction of the gradient. This could potentially lead to faster convergence and simplify the training process as it would not require tuning the learning rate parameter.