We present an algorithm to find initial weights for networks that in a range of examples trains more effectively than randomly-initialized networks with the same architecture. Our results illustrate how geometry of a data set can inform the development of a network to be trained on that data. We also expect that further development will prove useful to those working at the state of the art.
Effective methods for initializing the weights of deep networks He et al. (2015); Saxe et al. (2014) allow for faster and more accurate training. Geometric and topological analyses of neural networks during training find that the first layer of a network eventually learns weights which match “features” in the input space Carlsson and Gabrielsson (2020), and that extracting those features explicitly can be useful.
Here, we approximate these features of the data distribution via a process we call Linear Discriminant Sorting, or the “Sorting Game,” a deterministic method to initialize weights of a feedforward neural network. The weights which are found via the Sorting Game are then permitted to evolve during training, leading to greater flexibility.
That initial, nonrandom weights affect a network’s training has previous been shown in work on the lottery ticket hypothesis Frankle and Carbin (2019), that large networks contain smaller subnetworks which train nearly as well as the original large network. When one reinitializes these subnetworks with random weights, they no longer perform well. Locating these smaller subnetworks requires the computationally-expensive process of weight pruning Frankle and Carbin (2019). Here we find initial network weights which lead to a small neural network close to a near-optimal loss basin, a process which is less computationally intensive.
Through improvements in performance, we provide evidence for a model for what some neural networks do, namely find discriminating features in early layers, and then use further layers to perform logic on those features. The current algorithm and its implementation for relatively small networks, along with data sets which are generally modest – though we do report in Section 3.4 on the CIFAR-10 data set – is also meant as a promising invitation to both scale up to larger networks and to implement for feedforward subnetworks of architectures such as transformers Vaswani et al. (2017). More broadly, we see our main results as providing evidence for the fruitfulness of ideas from geometry and topology to better understand and develop machine learning.
2.1 Motivation and Background
The Linear Discriminant Sorting algorithm (informally, the “sorting game”) builds on the mathematical description of neurons as hyperplanes partitioning the input space. In the sigmoid setting, a neuron is effectively determined by a “strip” with a hyperplane at its center, on which the activation function changes values (typically fromto
). In the ReLU setting, the activation function is constant on one side of the hyperplane and linear on the other.
In some studies, the distribution of weights of the neurons in the first layer of trained network reflect the geometry of the dataset on which the network has been trained Carlsson and Gabrielsson (2020). In particular, we observe that in sigmoid networks trained on classification tasks, the hyperplanes representing the first layer neurons often appear to lie between the point clouds representing each class, as illustrated in Figure 1.
to compute hyperplanes best separating two classes of data. The unit vectors corresponding to those hyperplanes are then used to define first-layer neurons in a neural network. In our applications we primarily use this to initialize the first layer of weights, but we also initialize deeper fully-connected layers in a network following a fixed architecture in Section3.4.
2.2 Informal Description
We describe the sorting game applied to two classes of data, as additional classes are addressed by taking each class label and performing the sorting game on “ versus ”.
First, we find a hyperplane which separates the two classes by computing the linear discriminant between the data points in the input space. Then, we set the resulting components of the linear discriminant as a hyperplane for a neuron in the first layer of the network, with unit magnitude.
We then discard the data points which have been sorted. To choose which points to discard, we first project the data onto the orthogonal complement of the hyperplane. We select a bias that maximizes the total number of data points which belong to opposing classes on opposite sides of the hyperplane, which we then consider to be “sorted.” We remove the points which we consider sorted. We then repeat the process of finding a linear discriminant, sorting and removing well-sorted points, until a unique linear discriminant cannot be computed. See Figure 2
. If there are multiple classes, we perform this procedure for the characteristic function of each class.
We use these hyperplanes to initialize the first layer of a neural network, with at least as many neurons as hyperplanes found. We then initialize any remaining layers of the network according to standard initialization schemes before training the network. We permit the discovered initial weights to evolve normally as part of the network.
2.3 Formalized Algorithm
2.4 Sampling and Dimensional Reduction
Performing the linear discriminant analysis on samples from the data reduces computational expense. We use such a strategy in Section 3.4. Less obvious but also quite helpful is decreasing dimensionality. If there are data point in -dimensional space with , it is to compute all linear discriminants Cai et al. (2008). Instead, we may perform the linear discriminant analysis on a subset of input variables at a time. Doing so on features of the input data set at a time (for times), leads to complexity. This leads to a large practical speed up, for example, when fine-tuning the feedforward subnetwork of AlexNet on CIFAR-10 ub Section 3.4.
We compare networks initialized with the LDA sorting game to those initialized randomly. In most experiments, we use the LDA Sorting algorithm to determine a number of neurons to initalize deterministically, and then create both LDA-initialized and entirely randomly-initialized networks with the same architecture. Results with a priori fixed architecture are in Section 3.4.
We compare the training performance between the two initialization schemes both visually, comparing epoch vs. accuracy graphs over many trials, and using two metrics. The first metric is the difference betweenand , the average number of training steps needed to for training error to reach threshold accuracy. Here, the threshold accuracy
is defined as the maximum observed training accuracy of the least-accurate network trained with the same hyperparameters. We define the threshold accuracy in this way to allow for consistent meaning across hyperparameters. The second metric is the difference in minimum validation error between the LDA-initialized networksand the randomly-initialized networks .
These two measurements capture the improvements in performance of the sorting game algorithm. When we see below that is significantly less than , then LDA-initialized networks reach a given training accuracy sooner than those initialized randomly. When we see that the minimum validation error is less than that of this indicates that LDA sorting leads to better generalization by the trained network.
3.1 Sigmoid Activation
Our first case is that of sigmoid-activated networks. We compare networks with LDA-sorted first layers against networks with the same architecture and orthogonal weight initialization Saxe et al. (2014).
For the MNIST dataset, the LDA initialization algorithm finds 21 weights, which we use to initialize 21 hidden units. Comparing the training trajectory of networks (784 input neurons, 21 hidden neurons, 10 output neurons, softmax and categorical crossentropy loss) initialized with these 21 components against randomly-initialized networks of the same architecture, the LDA-initialized networks reach higher training accuracy significantly sooner than those initialized entirely randomly in all but the networks trained with a very low batch size and high learning rate. Visually, in Figure 3 we see that the accuracy of the LDA-initialized networks (in red, in all figures) are consistently higher than the randomly initialized networks (in blue, in all figures).
We observe similar results for the Fashion MNIST dataset, where the LDA initialization algorithm finds 28 components. We initialized a network with 28 hidden units, 10 output units, and a softmax output layer, and trained it using stochastic gradient descent and categorical crossentropy loss.
|Batch Size 25||Batch Size 100||Batch Size 500|
|Batch Size 25||Batch Size 100||Batch Size 500|
3.2 Comparison Across Batch Size and Learning Rate
To ensure that the improved training we see from LDA Initialization is robust, we performed the same experiment across batch sizes and learning rates for the MNIST and FashionMNIST dataset initializations. We keep the initialized (sorted) weights the same but independently randomize the remaining weights. The table in Figure 4
demonstrates the comparison between the behavior of LDA-sorted networks and those with random initialization. We consistently see substantial differences in the number of epochs required to reach threshold accuracy, namely about ten epochs out of ninety, with standard deviations across trials which are approximately one.
3.3 Initializing a Subset of Neurons
In the case where architecture is pre-selected, the sorting game still gives a benefit to training behavior. Using LDA sorting to initialize only a subset of the first layer’s weights, and then randomly initializing the remaining weights, continues to demonstrate improved training performance over orthogonal initialization, though the improvement diminishes as additional neurons are added, as in Figure 8.
3.4 AlexNet Fine Tune
We use the sampling modification described in Sec. 2.3 to initialize 1048 neurons using the output of AlexNet convolutional layers Krizhevsky et al. (2012). We use the CIFAR-10 dataset Krizhevsky and Hinton (2009), and resize images to the appropriate size for input into the AlexNet convolutional layers. We train a feedforward network with 4092 input neurons, 1024 hidden neurons in the first layer, and 10 output neurons with softmax. We then followed a learning rate schedule with initial learning rate of .01 and a learning rate decay factor of 0.7 every 10 epochs, with a dropout factor of 0.4. Compared to a Gaussian-initialized network, the linear discriminant initialization leads to significant improvement in initial training, as seen in Figure 9. Since training was performed on data augmented by random affine transformations, training accuracy was inconsistent. Instead, we compute threshold accuracy and training time on validation data. During a 50-epoch training run, Sorting Game initialized networks reached threshold accuracy on average epochs sooner than Xavier Normal-initialized networks Glorot and Bengio (2010). Additionally, Sorting Game-initialized networks reached an average of percentage points lower minimum validation error ( CI to percentage points).
While the sorting game was designed to handle sigmoid activation functions, an identical experiment with the same weight initialization was also performed with the remaining feedforward layers of AlexNet with ReLU activation. Compared to He initialization He et al. (2015), the training still appeared improved, but the difference was less pronounced.
3.5 Global performance, and deeper layers
In practice, the amount of computational time it takes to run this algorithm is lower than that of pruning a large network, but higher than that of running a randomly-initialized network a bit longer. On one machine, applying the (non-optimized) Sorting Game algorithm on the MNIST dataset takes approximately 170 seconds, but a single epoch of training what is now considered a fairly small network with 800 hidden units such as those used in Lucas et al. (2003) on the same device takes about 17 seconds. Training a full sized network to completion, roughly one hundred epochs, in order to prune its weights would thus take approximately ten times as long as Sorting Game initialization.
Finally, we report that naively applying linear discriminant analysis to the image of data under the first layer, in order to initialize a second layer, did not yield positive results. At the moment, the Sorting Game only has strong supporting evidence as a way to initialize the first layer.
Our experiments demonstate improvement in training performance when using LDA initialization compared to standardly utilized randomized initializations. This improvement is robust across hyperparameters and also occurs in larger architectures.
We are optimistic that, with optimization, this algorithm could be of value to machine learning practitioners. But this work may be of greater theoretical significance in that it sheds light on the geometry of the loss landscapes of neural network training. Because low stochasticity (large batch size and lower learning rate) leads to a greater separation between Sorting Game-initialized networks and those networks which are randomly initialized, a reasonable interpretation of these results is that LDA Sorting finds a loss basin which has an optimum closer to a global optimum than a randomly-initialized network. We thus have a deterministic algorithmic step which could be incorporated in a number of ways to achieve a combination of higher accuracy and less computational expense.
In some applications larger batch sizes are desirable for more efficient parallel computation when training Smith et al. (2018). However, large batch training has pitfalls such as, potentially, decreased generalization Keskar et al. (2019). Since the improvements the Sorting Game appear more pronounced when training with larger batch sizes, we hope that using this initialization scheme could lead to large batch sizes being more feasible in practice, should they be desired.
Though our results were more strongly supported for sigmoid activation functions than ReLu, we believe that the general principles of its initialization scheme are applicable in a broader scope. Some modification will be needed to be applicable for varied classes of activation functions, which opens up an avenue for inquiry, namely the interplay between activation functions, geometry of data, and geometry of trained networks.
We also expect that other network architectures can be initialized in this way, and in particular expect the Sorting Game to be applicable to fully-connected feedforward portions of recurrent architectures. Finally, modifying the Sorting Game so that it can be fruitfully applied to multiple layers in a network would not only be of greater practical values, especially for deep networks, but is likely to require deeper insight into the geometry of data, networks and loss landscapes.
- Training linear discriminant analysis in linear time. In Proceedings - International Conference on Data Engineering, pp. 209–217. External Links: Cited by: §2.4.
Topological Approaches to Deep Learning. In Topological Data Analysis, pp. 119–146. External Links: Cited by: §1, §2.1.
- The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 (7), pp. 179–188. Cited by: §2.1.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, pp. 1–42. External Links: Cited by: §1.
- Understanding the difficulty of training deep feedforward neural networks. 9, pp. 249–256. External Links: Cited by: §3.4.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE International Conference on Computer Vision, Vol. 2015 Inter, pp. 1026–1034. External Links: Cited by: §1, §3.4.
- On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, pp. 1–16. External Links: Cited by: §4.
- Learning multiple layers of features from tiny images. Cited by: §3.4.
ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: §3.4.
- ICDAR 2003 robust reading competitions. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2, ICDAR ’03, USA, pp. 682. External Links: Cited by: §3.5.
- Scikit-learn: machine learning in python. Journal of Machine Learning Research 12 (85), pp. 2825–2830. External Links: Cited by: §2.1.
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, pp. 1–22. External Links: Cited by: §1, §3.1.
- Don’t decay the learning rate, increase the batch size. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings (2017), pp. 1–11. External Links: Cited by: §4.
- Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 2017-Decem, pp. 5999–6009. External Links: Cited by: §1.