. Moreover, linear models are suitable for approximate higher-level representation of the dataset distribution using AutoEncoders[3, 27, 13, 16]. They could serve as classic denoising solvers 
or even datasets’ manifold estimators. However, linear models perform poorly on complex problems expressed with nonlinear properties [17, 15, 19].
On the other hand, deep neural networks with nonlinear abilities are an extensively researched field. These nonlinear models reached state-of-the-art performance on computer vision problems[26, 20, 10, 21, 7]
and natural language processing tasks[25, 5, 2, 4]
. Yet, the clarity of the learning process and the direct influence of the samples on the network remained quite vague. Therefore, deep linear networks provide a convenient framework to understand part of the bigger picture of deep learning.
Recent findings in theoretical deep-learning show that despite the linear mapping between the input and the output of deep linear networks, they still have optimization advantages over a single linear layer network [1, 9, 12, 22]. These findings contain an elaborated explanation for this claim using theoretical analysis and experimental demonstrations.
According to the theoretical analysis, deep linear networks have a non-convex optimization process. This paper demonstrates how this perception might be different for fully-connected linear networks. Deep fully-connected linear networks could perform a non-convex (and non-concave) optimization process. However, in practice, these networks are, indeed, going through a convex optimization process that is experimentally equivalent to a single fully-connected linear layer network.
We begin in section 2
by exposing some properties of fully connected linear networks trained with stochastic gradient descent (SGD) and experimentally support our claims in section 3. Then, in section 4, we show how these properties lead to an equivalent optimization process of a deep linear network and a single linear layer (with fully-connected architectures). Finally, in section 5, we explain why SGD with momentum  also performs an equivalent optimization process.
2 Proportions in fully-connected linear networks
This section will expose some properties regarding the weights of fully-connected linear networks (experimentally supported by section 3). We will use the following notations:
of size - The weights of layer in the network.
- The th row of .
- The changes of the weights in the first step with respect to the gradient of the loss function.
For a network with a single linear layer, we will define (for two classes). Without loss of generality, we will focus on the randomly initialized weights (and not, the size of vector is .
Let be a single training sample used for training a network with a single linear layer for a single step. Then there is a scalar such that:
Given a network with a single linear layer with randomly initialized weights and a set such that each pair corresponds to the proportional property described in Claim 1 with respect to . Training the network with the entire batch for a single step (with the same initial weights ) will result in the following equality:
For a deep linear network, the following statement is applied (using the previous notations):
For a deep linear network:
For a deep linear network, we get the following for any layer in the network and any step in the training process:
3 Experimental support
To measure how close the vectors are, in terms of proportionally, for each angle in the experiments, we use instead. In addition, we are using the negative log likelihood loss function (a non-linear function). The experiments were conducted with GPU K80.
Claim 1 (support)
For a single linear layer, we will use the classes Cat and Dog of the dataset CIFAR10 . For ten different initializations of a single linear layer, we randomly pick different samples. We compute the angle between and the chosen sample. The mean value of the calculated angles is
, and the standard deviation is.
As expected, each angle is very close to or to (up to numerical errors), which indicates that , for some sample and scalar .
Claim 2 (support)
For a batch size of and a single linear network with a single layer, we get that the expression over ten initializations produces an average value of . It implies that the equation is indeed true up to numerical errors.
Claim 3 (support)
In the same manner, for multiple layers in the network (using “Architecture B” appears in Appendix B) and a batch size of , consider all possible combinations of and for . We get that for the angles above , we have an average angle of , and for the angles below , the average angle is which supports the fact that:
Claim 4 (support)
For long-term training of linear network with multiple layers, the following experiments will include the average angle between the vectors for each layer as a function of step number . Each analysis will also include a graph of the model’s accuracy over the training steps.
We will use several linear architectures for the experiments. The architectures appear in Appendix B.
In Figure 1, we can see the analysis for Cat versus Dog trained with a batch size of , a learning rate of , and “Architecture B” over steps. The first image shows the average angle between each pair of vectors ( for ) as a function of the iteration (the step number). There are four plots in the first image, each plot for each one of the four linear layers in the network. The presented angles are in degrees, and it is easy to spot that (up to a complement of ) the angle is below a single degree (which is extremely small). The second graph (below the first one) shows the accuracy of the same network. It implies that the angles are independent of the accuracy, and up to minor errors, they have very small values in each phase of the training. Overall it supports the claim that for any given step and layer , the following expression is true:
An additional set of experiments with multiple different settings appears in Appendix A.
Note: When calculating the angles, we normalize the vectors. For vectors with small norms in the first place, a numerical error might occur during the normalization. Therefore, wider layers might have more significant errors. Overall the error is tiny (the vertical scale of the first plot is in degrees) and usually below .
4 The optimization process
Following the above claims (section 2), for each layer , the update of the weights could be observed as a collection of the vectors (for some set of scalars ) rather than a collection of the vectors . In this case, the entire matrix depends on a single vector up to scalar multiplication. Additionally, in the classic training process, where we use stochastic gradient steps (SGD), this is true for any given step in the training process (and not only for the first step). In other words, we get that each layer is updated with a weights matrix of rank . Their multiplication does not reduce the expressiveness of the network.
In general, we can apply a simple reduction from the optimization process of fully-connected deep linear networks to the optimization process of a single fully-connected layer, as illustrated in Figure 2. Assuming that:
we can represent as the collection of the vectors . Since depends on a single vector, up to scalar multiplication, the same update in the weights could have been done with a single neuron (which also depends on a single vector of weights of the same size). Extending this conclusion for each step and layer in the original network (excluding the output layer), we get a similar optimization process of a fully-connected linear network with only a single neuron for each hidden layer. In the case of a single neuron in each hidden layer, we get a network that is not using its depth since the weights of each hidden layer is a scalar. Therefore, the network could be represented as a single linear layer in the training process as an equivalent learner to any deep fully-connected linear architecture.
This section concludes that the training process of a randomly initialized fully-connected deep linear network is experimentally equivalent to a randomly initialized linear network with a single layer.
5 SGD with Momentum
In many cases, the SGD optimizer is used with momentum  to acquire convergence advantages. For such an optimizer, the optimization process enjoys memorization abilities based on the weights update of previous steps. The momentum algorithm is conducted as follows:
Where is the iteration index, is the learning rate, and is the momentum factor. For an iterative representation of , we get:
Assume a new optimization method with a predefined number of steps (in our case, steps), using the traditional SGD with a gentle twist:
In the new variant, is an iteration-dependent scalar. Proportional vectors would be proportional even after scalar multiplications. Therefore, the properties mentioned in section 2 and supported by section 3 are applied to the new variant of SGD defined above. Moreover, we get the following equality:
The above equality shows that taking steps using SGD with momentum produces the same state as the proposed optimization method. Additionally, both methods (SGD with momentum and the new method defined above) use the same matrices to reach that state (up to scalar multiplications). It implies similar expressive abilities, and therefore both processes are equivalent in that term.
Overall, SGD with momentum could be represented as the new variant of SGD we proposed. As explained earlier in this section, the new variant of SGD has an equivalent optimization process to a single layer. Therefore, the optimization process of SGD with momentum has an equivalent optimization process to a single fully-connected layer.
6 Summary and open problems
We experimentally demonstrated how the derivatives of the weights (represented as vectors) are proportional for the same iteration and the same layer when training a deep fully-connected network with conventional optimizers. Provided with this experimental outcome, we concluded that a deep fully-connected network trained with conventional optimizers has an equivalent optimization process to a single fully-connected layer.
Our experiments are valid solely for fully-connected networks. Intuitively, the proportion between two vectors of weights is due to the mutual input layer observations during the training. Even though convolutional layers observe local regions individually (rather than the entire input simultaneously), there are still dependencies between close areas in the image and different filters concerning the same region. Therefore, deep linear convolutional neural networks probably do not have a convex optimization process; however, these models still might demonstrate interesting optimization constraints. Testing this hypothesis in future work might equip us with more knowledge regarding linear networks in particular and convolutional behavior in general.
-  (2018) On the optimization of deep networks: implicit acceleration by overparameterization. External Links: Cited by: §1.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
-  (1988) . Biological cybernetics 59 (4), pp. 291–294. Cited by: §1.
-  (2020) Language models are few-shot learners. Advances in neural information processing systems 33, pp. 1877–1901. Cited by: §1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
LIBLINEAR: a library for large linear classification.
the Journal of machine Learning research9, pp. 1871–1874. Cited by: §1.
-  (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §1.
How data augmentation affects optimization for linear regression. Advances in Neural Information Processing Systems 34. Cited by: §1.
-  (2018) Identity matters in deep learning. External Links: Cited by: §1.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
-  (1997) Modeling the manifolds of images of handwritten digits. IEEE transactions on Neural Networks 8 (1), pp. 65–74. Cited by: §1.
-  (2016) Deep learning without poor local minima. External Links: Cited by: §1.
Semantic autoencoder for zero-shot learning.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3174–3183. Cited by: §1.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §3.
-  (2017) The expressive power of neural networks: a view from the width. Advances in neural information processing systems 30. Cited by: §1.
Relational autoencoder for feature extraction. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 364–371. Cited by: §1.
-  (2006) Comparative performance of linear and nonlinear neural networks to predict irregular breathing. Physics in Medicine & Biology 51 (22), pp. 5903. Cited by: §1.
Learning dynamics of linear denoising autoencoders. In International Conference on Machine Learning, pp. 4141–4150. Cited by: §1.
-  (2017) On the expressive power of deep neural networks. In international conference on machine learning, pp. 2847–2854. Cited by: §1.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28. Cited by: §1.
-  (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. External Links: Cited by: §1.
-  (2005) Introduction to stochastic search and optimization: estimation, simulation, and control. Vol. 65, John Wiley & Sons. Cited by: §1.
-  (2013) On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 1139–1147. External Links: Cited by: §1, §5.
-  (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1.
-  (2018) Deep learning for computer vision: a brief review. Computational intelligence and neuroscience 2018. Cited by: §1.
-  (2014) Generalized autoencoder: a neural network framework for dimensionality reduction. In 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Vol. , pp. 496–503. External Links: Cited by: §1.
-  (2005) Applied linear regression. Vol. 528, John Wiley & Sons. Cited by: §1.
-  (2001) Text categorization based on regularized linear classification methods. Information retrieval 4 (1), pp. 5–31. Cited by: §1.
Appendix A Additional experiments
Appendix B Architectures
- Architecture A:
Two linear layers.
Fully-connected layer [input: 3072, output: 128]
Fully-connected layer [input: 128, output: 2]
- Architecture B:
Four linear layers.
Fully-connected layer [input: 3072, output: 2048]
Fully-connected layer [input: 2048, output: 1024]
Fully-connected layer [input: 1024, output: 128]
Fully-connected layer [input: 128, output: 2]
- Architecture C:
Eight linear layers.
Fully-connected layer [input: 3072, output: 2500]
Fully-connected layer [input: 2500, output: 1500]
Fully-connected layer [input: 1500, output: 1024]
Fully-connected layer [input: 1024, output: 512]
Fully-connected layer [input: 512, output: 256]
Fully-connected layer [input: 256, output: 64]
Fully-connected layer [input: 64, output: 16]
Fully-connected layer [input: 16, output: 2]