Backpropagation is considered to be biologically implausible. Amongsts others, Stefan Grossberg, stressed this fundamental limitation by discussing the transport of the weights that is assumed in the algorithm. He claimed that “Such a physical transport of weights has no plausible physical interpretation.” . Since then, a number of papers have contributed with many insights on the overcoming of this limitation, including recent contributions ,.
The ideas behind this paper nicely intercept Yan Le Cun’s paper on a theoretical framework for Backpropagation that is based on a Lagrangian formulation of learning with the neural equations imposed as constraints . It is worth mentioning that he established the connection by imposing the stationary condition of the Lagrangian, which is in fact the only case in which one can restore the classic factorization of Backpropagation with the forward and backward terms. 111This connection was brought to our attention during a discussion with Yoshua Bengio on biological plausibility of Backpropation and related algorithms. We find that, because of the close connections with the ideas proposed in this paper, this is a nice instance of “Chi cerca trova, chi ricerca ritrova.” (Ennio De Giorgi), that is quite popular amongst mathematicians.
Instead of focussing on the derivation of Backpropation in the Lagrangian framework, in this paper we introduce a novel approach to learning that explores the search in the learning adjoint space that is characterized by the triple (weights, neuron outputs, and Lagrangian multipliers.) According to the prescriptions on the discovery of the minimum, we search for saddle points in this space. It turns out that the gradient descent in the variables and the gradient ascent in the multipliers give rise to algorithms that are fully local. This confers them a biological plausibility that is not enjoyed in Backpropation.
An important consequence of the proposed scheme is that it is very well-suited for a network gradual building. Algorithms can be constructed to build neurons that gradually satisfy the constraints imposed by the training set while respecting their own underlying model. In a sense, at any stage of learning, the algorithm is characterized by the property of finding a perfect building (PB) of the neurons.
2 Lagrangian formulation of supervised learning
Our model is based on any directed graph , where is the set of vertices of and is the multiset of arcs of the graph. We denote with the set of input neurons, with the set of the output neurons and with the set of hidden neurons. Suppose furthermore 222 We are using here the Iverson’s notation: Given a statement , we set to if is true and to if is false.. Given a training set , learning is formulated as the problem of determining as a solution of the problem
with . Clearly, once we determine , the corresponding is determinded from the neural constraint. The error function , which is accumulated all over the patters, is minimized under the neural architectural constraints. Here,
is a loss function, while the weightedregularization term favors sparse solutions that correspond with neural pruning. The corresponding Lagrangian is with
Now, let be. In order to analyze the consequence of the saddle point condition of the Lagrangian, we calculate
Once we set , if we impose the stationarity conditions on the Lagrangian and then we recognize in Eq. (3) and Eq. (4) the classic structure of the Backpropagation algorithm. In particular, Eq. (3) expresses the gradient with respect to the weights by the factorization of the forward and backward terms, while in the Eq. (4), when imposing , yields the backward update of the delta term, which is interpreted (with negated sign) by the corresponding Lagrangian multiplier. In particular, we have
Finally, clearly reproduces the neural constraints. This theoretical framework of Backpropagation was early proposed in a seminal paper in , at the down of the connectionist wave.
3 Plausibility in the learning adjoint space
The idea behind the proposed approach is inspired by the basic differential multiplier method (BDMM) . Instead of using the Lagrangian approach as a framework for Backpropagation, one can think of searching for saddles points by the following learning algorithm, which updates the parameters according to
This is a batch-model learning algorithm which operates in the learning adjoint space defined by the triple . The parameters can be randomly initialized, though in the following we will discuss the initialization . Notice that we need to minimize with respect to the variables , so as we perform gradient descent according to lines (1) and (1). On the opposite, the Lagrangian multipliers are updated by gradient ascent as stated in line (1). It is worth mentioning that the parameter updating of and exhibits the same batch-mode structure adopted for , but they are kept separate to emphasize the structure of the variables in the adjoint learning space.
Now, we give some more details on the algorithm with the purpose of proving that, unlike Backpropation, it is fully local.
We begin noticing that the weights are updated in line (2), according to Eq. (3). As already noticed, this has the classic factorization structure of Backpropagation. Interestingly, the output is not determined by the forward computation as in Backpropagation algorithm, since evolves according to the updating line (2). As we can see, the updating ends as , which corresponds with the satisfaction of Backpropagation backward step (see Eq. (4)). As already noticed, in this case, the Lagrangian multipliers can be interpreted as the Backpropagation delta error. The computation of the Lagrangian multiplier , that is used in the factorization Eq. (3), follows line (2) and exhibits a dual structure with respect to . It requires to accumulate that is determined in line (2). The multipliers ends up its updating as , which corresponds with satisfaction of the neural constraints defined by Eq. (1). When looking at the overall structure of the algorithm, one promptly realizes that the outer loops on and on drive the computation over all examples and neurons.
Interestingly, for any pair the factors and are computed by involving and , respectively (see side figure). This corresponds exactly with the operations needed in Backpropation for the backward and forward step, but the remarkable difference is that these computations take place at the same time in a local way on each neuron without needing the propagation to outputs. The algorithm is , where is the set of arcs of the graph, that is it exhibits the same optimal asymptotical property of Backpropagation.
This analysis reveals the full local structure of the algorithm for the discovery of saddle points, a properly that, unlike Backpropagation, makes it biologically plausible.
Notice that the extension to on-line and mini-batch modes requires a sort of prediction of and , since when we move from one iteration to the other, unlike for batch mode, no variable is allocated to keep the corresponding value. This is an interesting issue, which seems to indicate that the biological plausibility that is gained by the proposed scheme is either restricted to the possibility of storing all the variables according to the batch scheme or to the presence of an inherent temporal structure that is found in most perceptual tasks.
4 Support neurons and support examples
Let us begin with the assumption that there is no regularization () and that the learning process has ended up into the condition . From Eq. (4) we can promptly see that , . In other words, the neural constraints do not provide any reaction. Whenever this condition holds true, we say that we are in front of straw neurons. In case any of the (hidden) neurons is removed in such a way to keep the end learning condition true, we fully appreciate the property that we are in front of a needless neuron. Of course, if we continue pruning the network, after awhile the end learning condition will be violated. Interestingly, as this happens, all neurons likely turn into support neurons, that are characterized by . This trick transition is somehow indicating the ill-position of learning when formulated as in Eq (1
). On the other hand, also in SVM, the support vectors emerge because of the presence of regularization.
If then we suddenly see the emergence of a new mathematical structure that reveals the essence of a network building plan that is driven by the parsimony principle. As we can see from Eq. (4), the end of learning condition does not correspond with the nullification of the Lagrangian multipliers anymore. While for we have , , for we have
The partition based on the nominal satisfaction of the condition at the end of learning allows us to distinguish between straw from support neurons, since whenever we have . Hence, a learning process that drives to with some straw neurons suggests that we can remove them and rely on support neurons only.
This paper gives an insight on the longstanding debate on the unlikely biological plausibility of Backpropation. It has been shown that the formulation in the learning adjoint space under the Lagrangian formalism leads to fully local algorithms that naturally emerge when searching for saddle points. The approach can be extended to on-line and mini-batch mode learning by keeping the weights as the state of learning process. The adoption of regularization makes also possible to appreciate the role of support and straw neurons, that can be pruned so as the learning process also contributes to the construction of the architecture. Finally, the approach can also naturally extended to any recurrent network for sequences and graphs by expressing the corresponding constraints induced by the given data structure.
We thank Yoshua Bengio for discussions on biological plausibility, which also bring me back to closely related studies on a theoretical framework for Backpropagation by Yan Le Cun .
-  Yoshua Bengio, Dong-Hyun Lee, Jörg Bornschein, and Zhouhan Lin. Towards biologically plausible deep learning. CoRR, abs/1502.04156, 2015.
-  S. Grossberg. Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, 11:23–63, 1987.
-  Y. le Cun. A theoretical framework for backpropagation. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, The 1988 Connectionist Models Summer School, pages 21–28, San Mateo, (CA), 1988. Morgan Kauffman.
-  John C. Platt and Alan H. Barr. Constrained differential optimization. In Neural Information Processing Systems, Denver, Colorado, USA, 1987, pages 612–621, 1987.
-  Benjamin Scellier and Yoshua Bengio. Towards a biologically plausible backprop. CoRR, abs/1602.05179, 2016.