Linear discriminant initialization for feed-forward neural networks

07/24/2020 ∙ by Marissa Masden, et al. ∙ University of Oregon 0

Informed by the basic geometry underlying feed forward neural networks, we initialize the weights of the first layer of a neural network using the linear discriminants which best distinguish individual classes. Networks initialized in this way take fewer training steps to reach the same level of training, and asymptotically have higher accuracy on training data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We present an algorithm to find initial weights for networks that in a range of examples trains more effectively than randomly-initialized networks with the same architecture. Our results illustrate how geometry of a data set can inform the development of a network to be trained on that data. We also expect that further development will prove useful to those working at the state of the art.

Effective methods for initializing the weights of deep networks He et al. (2015); Saxe et al. (2014) allow for faster and more accurate training. Geometric and topological analyses of neural networks during training find that the first layer of a network eventually learns weights which match “features” in the input space Carlsson and Gabrielsson (2020), and that extracting those features explicitly can be useful.

Here, we approximate these features of the data distribution via a process we call Linear Discriminant Sorting, or the “Sorting Game,” a deterministic method to initialize weights of a feedforward neural network. The weights which are found via the Sorting Game are then permitted to evolve during training, leading to greater flexibility.

That initial, nonrandom weights affect a network’s training has previous been shown in work on the lottery ticket hypothesis Frankle and Carbin (2019), that large networks contain smaller subnetworks which train nearly as well as the original large network. When one reinitializes these subnetworks with random weights, they no longer perform well. Locating these smaller subnetworks requires the computationally-expensive process of weight pruning Frankle and Carbin (2019). Here we find initial network weights which lead to a small neural network close to a near-optimal loss basin, a process which is less computationally intensive.

Through improvements in performance, we provide evidence for a model for what some neural networks do, namely find discriminating features in early layers, and then use further layers to perform logic on those features. The current algorithm and its implementation for relatively small networks, along with data sets which are generally modest – though we do report in Section 3.4 on the CIFAR-10 data set – is also meant as a promising invitation to both scale up to larger networks and to implement for feedforward subnetworks of architectures such as transformers Vaswani et al. (2017). More broadly, we see our main results as providing evidence for the fruitfulness of ideas from geometry and topology to better understand and develop machine learning.

2 Algorithm

2.1 Motivation and Background

Figure 1: Hyperplanes representing the three neurons in the first layer of a small neural network trained on the annulus dataset, illustrating the relationship between the geometry of data and the first layer weights.

Figure 2: The sorting game, in pictures. In (a), we compute the linear discriminant between the yellow and purple classes. In (b), we determine which points are unsorted by this discriminant, and in (c) we compute the linear discriminant between the remaining points. We recursively apply this process until all points are sorted.

The Linear Discriminant Sorting algorithm (informally, the “sorting game”) builds on the mathematical description of neurons as hyperplanes partitioning the input space. In the sigmoid setting, a neuron is effectively determined by a “strip” with a hyperplane at its center, on which the activation function changes values (typically from

to

). In the ReLU setting, the activation function is constant on one side of the hyperplane and linear on the other.

In some studies, the distribution of weights of the neurons in the first layer of trained network reflect the geometry of the dataset on which the network has been trained Carlsson and Gabrielsson (2020). In particular, we observe that in sigmoid networks trained on classification tasks, the hyperplanes representing the first layer neurons often appear to lie between the point clouds representing each class, as illustrated in Figure 1.

Our algorithm applies linear discriminant analysis Fisher (1936); Pedregosa et al. (2011)

to compute hyperplanes best separating two classes of data. The unit vectors corresponding to those hyperplanes are then used to define first-layer neurons in a neural network. In our applications we primarily use this to initialize the first layer of weights, but we also initialize deeper fully-connected layers in a network following a fixed architecture in Section

3.4.

2.2 Informal Description

We describe the sorting game applied to two classes of data, as additional classes are addressed by taking each class label and performing the sorting game on “ versus ”.

First, we find a hyperplane which separates the two classes by computing the linear discriminant between the data points in the input space. Then, we set the resulting components of the linear discriminant as a hyperplane for a neuron in the first layer of the network, with unit magnitude.

We then discard the data points which have been sorted. To choose which points to discard, we first project the data onto the orthogonal complement of the hyperplane. We select a bias that maximizes the total number of data points which belong to opposing classes on opposite sides of the hyperplane, which we then consider to be “sorted.” We remove the points which we consider sorted. We then repeat the process of finding a linear discriminant, sorting and removing well-sorted points, until a unique linear discriminant cannot be computed. See Figure 2

. If there are multiple classes, we perform this procedure for the characteristic function of each class.

We use these hyperplanes to initialize the first layer of a neural network, with at least as many neurons as hyperplanes found. We then initialize any remaining layers of the network according to standard initialization schemes before training the network. We permit the discovered initial weights to evolve normally as part of the network.

2.3 Formalized Algorithm

  Input: Data points where ;Labels ;Number of classes .
  Initialize .
  for  to  do
     repeat
        Compute unit component vector for the top linear discriminant on for the binary class .
        Store as .
        Set
        Find bias maximizing the sum:
        Store as .
        Increment .
        Remove points for all satisfying:
     until 
  end for
  Set network weights
Algorithm 1 Sorting Game Algorithm

2.4 Sampling and Dimensional Reduction

Performing the linear discriminant analysis on samples from the data reduces computational expense. We use such a strategy in Section 3.4. Less obvious but also quite helpful is decreasing dimensionality. If there are data point in -dimensional space with , it is to compute all linear discriminants Cai et al. (2008). Instead, we may perform the linear discriminant analysis on a subset of input variables at a time. Doing so on features of the input data set at a time (for times), leads to complexity. This leads to a large practical speed up, for example, when fine-tuning the feedforward subnetwork of AlexNet on CIFAR-10 ub Section 3.4.

3 Results

We compare networks initialized with the LDA sorting game to those initialized randomly. In most experiments, we use the LDA Sorting algorithm to determine a number of neurons to initalize deterministically, and then create both LDA-initialized and entirely randomly-initialized networks with the same architecture. Results with a priori fixed architecture are in Section 3.4.

We compare the training performance between the two initialization schemes both visually, comparing epoch vs. accuracy graphs over many trials, and using two metrics. The first metric is the difference between

and , the average number of training steps needed to for training error to reach threshold accuracy. Here, the threshold accuracy

is defined as the maximum observed training accuracy of the least-accurate network trained with the same hyperparameters. We define the threshold accuracy in this way to allow for consistent meaning across hyperparameters. The second metric is the difference in minimum validation error between the LDA-initialized networks

and the randomly-initialized networks .

These two measurements capture the improvements in performance of the sorting game algorithm. When we see below that is significantly less than , then LDA-initialized networks reach a given training accuracy sooner than those initialized randomly. When we see that the minimum validation error is less than that of this indicates that LDA sorting leads to better generalization by the trained network.

3.1 Sigmoid Activation

Our first case is that of sigmoid-activated networks. We compare networks with LDA-sorted first layers against networks with the same architecture and orthogonal weight initialization Saxe et al. (2014).

For the MNIST dataset, the LDA initialization algorithm finds 21 weights, which we use to initialize 21 hidden units. Comparing the training trajectory of networks (784 input neurons, 21 hidden neurons, 10 output neurons, softmax and categorical crossentropy loss) initialized with these 21 components against randomly-initialized networks of the same architecture, the LDA-initialized networks reach higher training accuracy significantly sooner than those initialized entirely randomly in all but the networks trained with a very low batch size and high learning rate. Visually, in Figure 3 we see that the accuracy of the LDA-initialized networks (in red, in all figures) are consistently higher than the randomly initialized networks (in blue, in all figures).

Figure 3: Training accuracy plotted through the training of 20 different fully-connected feedforward neural networks on the MNIST dataset (top) and the Fashion MNIST dataset (bottom).

We observe similar results for the Fashion MNIST dataset, where the LDA initialization algorithm finds 28 components. We initialized a network with 28 hidden units, 10 output units, and a softmax output layer, and trained it using stochastic gradient descent and categorical crossentropy loss.

Batch Size 25 Batch Size 100 Batch Size 500
Figure 4: Comparison of number of training epochs to threshold accuracy across batch size and learning rate for FashionMNIST data set. Displayed as . Statistically significant difference .
Batch Size 25 Batch Size 100 Batch Size 500
Figure 5: Comparison of minimum validation error (in percent) across batch size and learning rate, following 100 training epochs. Displayed as . Statistically significant difference .
Figure 6: Distribution of the number of epochs until threshold accuracy is reached, over 20 training sequences of 100 epochs each, on MNIST Dataset.
Figure 7: Example of the distribution of minimum validation error after 100 epochs of training. (Training on MNIST dataset)

3.2 Comparison Across Batch Size and Learning Rate

To ensure that the improved training we see from LDA Initialization is robust, we performed the same experiment across batch sizes and learning rates for the MNIST and FashionMNIST dataset initializations. We keep the initialized (sorted) weights the same but independently randomize the remaining weights. The table in Figure 4

demonstrates the comparison between the behavior of LDA-sorted networks and those with random initialization. We consistently see substantial differences in the number of epochs required to reach threshold accuracy, namely about ten epochs out of ninety, with standard deviations across trials which are approximately one.

3.3 Initializing a Subset of Neurons

In the case where architecture is pre-selected, the sorting game still gives a benefit to training behavior. Using LDA sorting to initialize only a subset of the first layer’s weights, and then randomly initializing the remaining weights, continues to demonstrate improved training performance over orthogonal initialization, though the improvement diminishes as additional neurons are added, as in Figure 8.

Extra Neurons
Extra Neurons
Extra Neurons
Extra Neurons
Extra Neurons
Figure 8: Performance of progressively larger networks trained on Fashion MNIST for 100 epochs. The Sorting Game initialization finds 28 neurons, and training networks with exactly 28 hidden neurons is represented in the first row. Each subsequent row represents a larger network with 28 “extra neurons” which are randomly initialized, in addition to the Sorting Game initialized subnetwork.

3.4 AlexNet Fine Tune

We use the sampling modification described in Sec. 2.3 to initialize 1048 neurons using the output of AlexNet convolutional layers Krizhevsky et al. (2012). We use the CIFAR-10 dataset Krizhevsky and Hinton (2009), and resize images to the appropriate size for input into the AlexNet convolutional layers. We train a feedforward network with 4092 input neurons, 1024 hidden neurons in the first layer, and 10 output neurons with softmax. We then followed a learning rate schedule with initial learning rate of .01 and a learning rate decay factor of 0.7 every 10 epochs, with a dropout factor of 0.4. Compared to a Gaussian-initialized network, the linear discriminant initialization leads to significant improvement in initial training, as seen in Figure 9. Since training was performed on data augmented by random affine transformations, training accuracy was inconsistent. Instead, we compute threshold accuracy and training time on validation data. During a 50-epoch training run, Sorting Game initialized networks reached threshold accuracy on average epochs sooner than Xavier Normal-initialized networks Glorot and Bengio (2010). Additionally, Sorting Game-initialized networks reached an average of percentage points lower minimum validation error ( CI to percentage points).

While the sorting game was designed to handle sigmoid activation functions, an identical experiment with the same weight initialization was also performed with the remaining feedforward layers of AlexNet with ReLU activation. Compared to He initialization He et al. (2015), the training still appeared improved, but the difference was less pronounced.

Figure 9: Validation accuracy throughout training when fine tuning an AlexNet implementation with fixed convolutional layers and mutable feedforward layers with sigmoid activation. Comparison of training between LDA-initialized first feedforward layer, and Xavier Gaussian-initialized.

3.5 Global performance, and deeper layers

In practice, the amount of computational time it takes to run this algorithm is lower than that of pruning a large network, but higher than that of running a randomly-initialized network a bit longer. On one machine, applying the (non-optimized) Sorting Game algorithm on the MNIST dataset takes approximately 170 seconds, but a single epoch of training what is now considered a fairly small network with 800 hidden units such as those used in Lucas et al. (2003) on the same device takes about 17 seconds. Training a full sized network to completion, roughly one hundred epochs, in order to prune its weights would thus take approximately ten times as long as Sorting Game initialization.

Finally, we report that naively applying linear discriminant analysis to the image of data under the first layer, in order to initialize a second layer, did not yield positive results. At the moment, the Sorting Game only has strong supporting evidence as a way to initialize the first layer.

4 Discussion

Our experiments demonstate improvement in training performance when using LDA initialization compared to standardly utilized randomized initializations. This improvement is robust across hyperparameters and also occurs in larger architectures.

We are optimistic that, with optimization, this algorithm could be of value to machine learning practitioners. But this work may be of greater theoretical significance in that it sheds light on the geometry of the loss landscapes of neural network training. Because low stochasticity (large batch size and lower learning rate) leads to a greater separation between Sorting Game-initialized networks and those networks which are randomly initialized, a reasonable interpretation of these results is that LDA Sorting finds a loss basin which has an optimum closer to a global optimum than a randomly-initialized network. We thus have a deterministic algorithmic step which could be incorporated in a number of ways to achieve a combination of higher accuracy and less computational expense.

In some applications larger batch sizes are desirable for more efficient parallel computation when training Smith et al. (2018). However, large batch training has pitfalls such as, potentially, decreased generalization Keskar et al. (2019). Since the improvements the Sorting Game appear more pronounced when training with larger batch sizes, we hope that using this initialization scheme could lead to large batch sizes being more feasible in practice, should they be desired.

Though our results were more strongly supported for sigmoid activation functions than ReLu, we believe that the general principles of its initialization scheme are applicable in a broader scope. Some modification will be needed to be applicable for varied classes of activation functions, which opens up an avenue for inquiry, namely the interplay between activation functions, geometry of data, and geometry of trained networks.

We also expect that other network architectures can be initialized in this way, and in particular expect the Sorting Game to be applicable to fully-connected feedforward portions of recurrent architectures. Finally, modifying the Sorting Game so that it can be fruitfully applied to multiple layers in a network would not only be of greater practical values, especially for deep networks, but is likely to require deeper insight into the geometry of data, networks and loss landscapes.


References

  • D. Cai, X. He, and J. Han (2008) Training linear discriminant analysis in linear time. In Proceedings - International Conference on Data Engineering, pp. 209–217. External Links: Document, ISBN 9781424418374, ISSN 10844627 Cited by: §2.4.
  • G. Carlsson and R. B. Gabrielsson (2020)

    Topological Approaches to Deep Learning

    .
    In Topological Data Analysis, pp. 119–146. External Links: 1811.01122, Link Cited by: §1, §2.1.
  • R. A. Fisher (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 (7), pp. 179–188. Cited by: §2.1.
  • J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, pp. 1–42. External Links: 1803.03635 Cited by: §1.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. 9, pp. 249–256. External Links: ISSN 15324435 Cited by: §3.4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

    .
    In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    Vol. 2015 Inter, pp. 1026–1034. External Links: Document, 1502.01852, ISBN 9781467383912, ISSN 15505499 Cited by: §1, §3.4.
  • N. S. Keskar, J. Nocedal, P. T. P. Tang, D. Mudigere, and M. Smelyanskiy (2019) On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, pp. 1–16. External Links: arXiv:1609.04836v2 Cited by: §4.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Cited by: §3.4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    .
    In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: §3.4.
  • S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young (2003) ICDAR 2003 robust reading competitions. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2, ICDAR ’03, USA, pp. 682. External Links: ISBN 0769519601 Cited by: §3.5.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay (2011) Scikit-learn: machine learning in python. Journal of Machine Learning Research 12 (85), pp. 2825–2830. External Links: Link Cited by: §2.1.
  • A. M. Saxe, J. L. McClelland, and S. Ganguli (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, pp. 1–22. External Links: 1312.6120 Cited by: §1, §3.1.
  • S. L. Smith, P. J. Kindermans, C. Ying, and Q. V. Le (2018) Don’t decay the learning rate, increase the batch size. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings (2017), pp. 1–11. External Links: arXiv:1711.00489v2 Cited by: §4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 2017-Decem, pp. 5999–6009. External Links: arXiv:1706.03762v5, ISSN 10495258 Cited by: §1.