Deep learning (e.g., with many layers of neural networks) works very well in areas from speech recognition, image classification, to drug discovery, medical image analysis, particle discovery, automatic game playing, and many othersGoodfellow et al. (2016). This is due to the available large dataset for training and efficient hardware design such as GPU to accelerate training, rather than breakthrough in theoretical foundations. Theoretical efforts are recently devoted to address partially the origin of these amazing performances of deep networks, relying on model assumptions Choromanska et al. (2014); Chaudhari and Soatto (2015); Kawaguchi (2016); Soudry and Carmon (2016); Vardan et al. (2016); Patel et al. (2016).
To understand deep learning as a whole is extremely difficult and highly challenging. However, based on simple hypotheses, we are able to provide theoretical insights towards collective properties of deep trained networks, which some regularization techniques used to achieve superior performance. One of them is dropconnect proposed to provide a sampling of model ensemble during training Wan et al. (2013), with the purpose to reduce the over-fitting effects, which is common in a fully-connected neural network training, because the number of weight parameters is much larger than the amount of labeled data provided to the deep network. Although dropconnect is quite effective in practice, theoretical justifications or underlying mechanisms are still lacking so far. On the other hand, redundancy in population codes was observed from retina to motor cortex Puchalla et al. (2005); Narayanan et al. (2005); Schneidman et al. (2011)
. This redundancy emerges from the collective neural code and provides neural mechanism against any possible noise in the neural circuit to maintain its neurophysiological function. In terms of synaptic activities, this kind of redundancy may also appear from interactions between synapses, even during a supervised training of a deep fully-connected feedforward network. The deep architecture should be robust in the absence of a fraction of connections between layers to some extent. In other words, the generalization ability measured by the output error on test data set does not significantly change until a certain amount of connections are removed. Indeed, this is observed in our numerical experiments in this work.
Surprisingly, this typical phenomenon can be qualitatively captured by a simple random active path model, in which we construct randomly and independently each active path from the input at the bottom layer to the output at the top layer, and each path serves as a constraint in a -weight interaction model (
refers to the depth of the deep network). Increasing the number of path is equivalent to increasing the weight’s mean degree in a graphical model representation. The ground state energy of the model decreases with the mean degree until the mean degree grows up to some critical value. This simple theoretical model qualitatively explains why a small dropconnect probability should be avoided for having good generalization ability. In practice, we use the dropconnect probability selected from the redundancy regime, and in a backward pass, a random feedback weight matrix is used to backpropagate the error for saving computation cost. Elements of the random feedback matrix are independently sampled from a uniform distribution with relatively large bound for the supportLillicrap Timothy P. et al. (2016). In particular, the random feedback breaks up the symmetry of connections between layers in a deep network, which is more biologically plausible, since top-down connections may weigh differently the teaching signals generated from high-order brain areas Harris Kenneth D. and Mrsic-Flogel Thomas D. (2013); Harris Kenneth D and Shepherd Gordon M G (2015).
In particular, the random feedback asymmetry reduces the computer time compared to the standard dropconnect implementation, but still behaves similarly to dropconnect. The current model study adds some qualitative understandings to this scalable algorithm for modern deep learning, and moreover builds connections to physics of -spin interaction model well studied in physics community Gardner (1985); Kirkpatrick and Thirumalai (1987); Franz et al. (2001); Montanari, A. and Ricci-Tersenghi, F. (2003).
ii.1 Redundancy in active paths of deep networks
We consider a deep network model with layers of fully-connected feedforward architecture. Each layer has neurons (so-called width of a layer). We define the input as
-dimensional vector, and the weight matrix specifies connections between layer and layer . A bias can also be incorporated into the weight matrix by assuming an additional constant input. The output at the final layer is expressed as:
is an element-wise sigmoid function for neurons at layer, defined as . Note that for the top layer, we use softmax transfer function as , where is the weighted sum of inputs into neuron at the top layer.
The network is trained to learn the labeled handwritten digits on MNIST Lecun et al. (1998) with training handwritten digits (from to ), and after learning, the network is tested on the other different set of
examples. For this supervised learning, we use the cross-entropy defined byas our cost function to minimize, where is the total number of classes, is the actual softmax output at the top layer, and is the target label (one-hot representation). The test error characterizes the generalization ability of the learned model. We measure the test error as the classification error defined by , where defines an indicator function of an event, and takes value one if the event is true, but zero otherwise. In simulations, we use , thus the deep network architecture is specified by ---. The cross-entropy is minimized by backpropagation of the error from the top layer to the bottom layer.
To visualize the redundancy behavior of active paths, we first introduce a dilution probability , where specifies the weight population between two consecutive layers ( and -layer). In testing simulations, each weight in a weight population was removed independently with . By varying with zero , one can see how the test error changes, as shown in Fig. 1 (a). Interestingly, the test performance is quite robust against increasing up to some value (e.g., ). Above this critical value, the test performance deteriorates rapidly with the dilution probability. This property also holds for diluting other weight populations by varying or . By varying with zero , apart from the same qualitative behavior, we also observe large fluctuations in test error, which deteriorates at a smaller compared to the critical value of or . This indicates that the final layer is more sensitive to network structure perturbation, likely because of its reduced dimensionality in weights compared to that of early layers. Furthermore, we change the width of the second hidden layer, and observe that the typical redundancy behavior does not change significantly in the redundant regime, while the non-redundant regime shows a strong finite-width effect (Fig. 2).
ii.2 A random active path model
In this section, we propose a random active path (RAP) model to understand the redundancy of active paths. An active path refers to the path from one input unit to one output unit with the property that each connection on the path is present. We observe that the distribution of products of real-valued weights between consecutive layers (on an arbitrary path) is concentrated around zero, with the property that the product of weights on a chosen path takes negative value or positive value with nearly equal probability (see Fig. 3). Therefore, for layers, we construct weight populations whose elements take with equal probability. The weight population size is fixed to be , which tends to be infinite in the thermodynamic limit. For simplicity, we assume that the sign of weights is important to constrain the contribution of each path to the global loss of the network Courbariaux et al. (2015). From each population, we randomly pick one weight (say ) to form an interaction, being . Each interaction represents a random active path to constrain the weight configuration in a deep network. The above process is repeated for times, and thus we have the following Hamiltonian for all possible weight configurations in a deep network setting:
where means all the weights involved in the active path . In fact, Eq. (2) is a disordered -weight interaction model Gardner (1985); Franz et al. (2001); Montanari, A. and Ricci-Tersenghi, F. (2003), where here the quenched disorder stems from the random construction process. Clearly, if , the ground state energy is obtained. However, the weight configuration to have a low energy state is not unique in general.
It has been shown that the zero-one loss for a deep non-linear network can be transformed into a -weight interaction spherical model under a few unrealistic approximations Choromanska et al. (2014), such as activation of any paths is independent of input, and all paths have independent inputs. Therefore the energy landscape of -weight model can be equivalent to the loss surface structure under these coarse-grained settings. Although it is challenging to prove these assumptions reasonable in real deep networks, the -weight interaction model still provides us a nice starting point to qualitatively understand complicated properties of deep neural networks.
In the current setting, increasing leads to a high degree of each weight, i.e., each weight is involved in a large number of active paths with high probability. In the thermodynamic limit, one can compute analytically the degree distribution. Here, we consider (i.e., ) for simplicity. It is straightforward to generalize the following formulas to larger (deeper network). We also assume each layer has equal width . First, the probability for an active path is given by . In addition, we have the identity , where is the total number of weights in Eq. (2), and it is proportional to with the proportion coefficient ( as ), and , where indicates the degree of weight . Thus , where . Now the probability of degree for an arbitrary weight in Eq. (2) can be written explicitly as follows,
where we have used in the large
limit. The degree profile is exactly a Poisson distribution with mean degree, as also verified by a comparison between the theory and one instance of the model (Fig. 4 (a)).
Similarly, one can prove the relationship between the dilution probability (e.g., , , or ) and the mean degree , from the fact that . It then follows that the dropconnect probability . Thus for a fixed width, the is proportional to . Tuning amounts to varying in the model. Note that for simplicity, we assume here a homogeneous dilution probability for all layers.
The mean-field model defined in Eq. (2) can be solved by the cavity method Mézard and Parisi (2001); Huang (2017). The technical details are summarized in methods (Appendix A). As observed in Fig. 4 (b), given the inverse temperature , the RAP model has a paramagnetic phase if is small, and the energy is decreased rapidly as
increases, i.e., when the network becomes denser with decreasing dilution probability. At a critical degree, the entropy starts to become negative (so-called entropy crisis), implying that the paramagnetic phase is not thermodynamically dominant any more, and a discontinuous spin phase transition emergesGardner (1985); Franz et al. (2001); Montanari, A. and Ricci-Tersenghi, F. (2003). However, we would not perform a one-step replica symmetry-breaking (RSB) Gardner (1985); Montanari, A. and Ricci-Tersenghi, F. (2003) studies here, and alternatively we adopt a frozen ansatz, i.e., after the critical , the energy is independent of , and fixed according to the zero-entropy condition Nakajima and Hukushima (2009). In the current model setting, at . The result is shown in Fig. 4 (b). Even though an advanced approximation (e.g., RSB) is considered, the energy level is expected to decrease much more slowly in the high degree regime Nakajima and Hukushima (2009). This behavior mimics the redundancy observed in practical deep network simulations (Fig. 1). Note that this qualitative property does not change when different values of are considered, with the only difference that energy levels are different at different inverse temperatures.
ii.3 Dropconnect combined with random synaptic feedback improves the test performance
The RAP model provides us qualitative understandings about dropconnect probability and its relationship with redundancy phenomenon. In modern deep network training, one popular regularization technique is called dropconnect, in which during forward pass of the standard backpropagation Rumelhart David E. et al. (1986), a binary mask is applied to the feedforward connections, implementing some sort of Bayesian model selection Huang (2017). Entry of the binary mask is set to one with dropconnect probability , and zero otherwise. The redundancy behavior implies that . Therefore, we can study effects of different values of as the dropconnect probability on the regularization ability of deep networks. Note that small values of maintain the plateau of test error in Fig. 1.
To avoid the computation cost to memorize the dropconnect mask applied in the feedforward pass Wan et al. (2013), we use a random feedback weight matrix to propagate the error Lillicrap Timothy P. et al. (2016). Elements of the random feedback weight matrix are independently sampled from a uniform distribution with supports falling within , where denotes the bound of the weight strength. We first test effects of different bounds on the learning performance by using pure feedback alignment (without dropconnect regularization). The result is shown in Fig. 5, indicating achieves the best performance among all values tested.
As shown in Fig. 6 (a), small values of result in high test error, consistent with the non-redundant regime where the network function is very sensitive to architecture perturbation (i.e., deletion of some connections between layers). In contrast, high values of maintain a good test performance. This is because, as dropconnect is carried out during training, a good model ensemble (robust in performance) is sampled randomly, and the network learning is still able to use the information transmitted from input layer to output layer. As decreases down to some small value, the performance gets worse immediately (Fig. 6 (b)), which is expected from redundancy behavior of active paths in a deep network. Remarkably, this hybrid regularization algorithm improves substantially over the standard backpropagation in the deep network’s generalization performance (Fig. 6 (b)). Note that, to make this hybrid regularization algorithm more effective, a small learning rate at the later stage of the training should be used.
In this work, we study the redundancy behavior of network connections on test performance of deep networks. In modern deep learning application, a fully-connected multi-layered neural network is commonly used (or as a part of the network). When a connection between layers is deleted with certain probability, the performance of a trained network is not affected unless this deletion probability increases up to some value, from which the performance starts to get immediately worse. This motivates us to propose a phenomenological model to understand this empirical behavior. We assume the weight products of neighboring layers give constraints on the learning task the network tries to implement. As observed in typical deep network training, these weight products have equal probability to be positive or negative, thus we propose a random active path model, where each path from input to output is randomly constructed from pools of weights between consecutive layers. Interestingly, when the number of selected paths increases, the mean degree of each involved weight on a graphical model representation grows as well, which finally leads to disappearance of a paramagnetic phase in the model. The energy in a paramagnetic phase grows as the the graphical model becomes diluted. This thermodynamic behavior mimics the redundancy behavior observed in real deep network training. Given the redundancy property, the deep network has the chance of repairing and modification of subnetworks, without affecting the overall function. It is worth noting that the redundancy property also holds in a deep network where connections are binary (2015).
In addition, by turning on a certain fraction of connections during forward pass of a supervised learning, one can reduce the over-fitting effects of training. This amounts to implementing a model selection, and our phenomenological analysis of a random active path model and redundancy phenomenon offers theoretical insights towards this kind of model selection. When the dropconnect probability is very small, the performance gets worse rapidly as the probability decreases, which is consistent with a rapid increase of the model energy when the mean degree of each weight decreases down to some small value. This suggests that one should choose a relatively large dropconnect probability to implement a sampling of good model ensemble in terms of its robustness in learning performance. The RAP model explains qualitatively the redundancy behavior, nevertheless, it still needs to be improved by introducing correlations between paths sharing the same input, and thus will have quantitative predictions for deep network training, which deserves further systematic studies.
We also propose a combination of dropconnect with random feedback alignment, to avoid the expensive computation cost of memorizing the dropconnect mask during backpropagating the error. The random feedback matrix breaks the symmetry of connections, with a mechanism closer to biological observations, in the sense that in cortical computation, different weight matrices are used respectively for bottom-up and top-down information processing Harris Kenneth D. and Mrsic-Flogel Thomas D. (2013); Harris Kenneth D and Shepherd Gordon M G (2015). We verify that this combination takes effect in classifying handwritten digits on a benchmark dataset MNIST. This hybrid algorithm indeed reduces the over-fitting effects at some optimal values of dropconnect probability, compared to standard backpropagation (Fig. 6 (b)). This efficient combination thus offers the possibility to test its learning capability in much more complicated network architectures and complex dataset.
Appendix A Mean field method to solve the RAP model
We present self-consistent mean field equations in this appendix to analyze the statistical properties of the RAP model (Eq. (2)). These equations can be derived using the standard cavity method Mézard and Montanari (2009), and have been similarly studied in a recent work Huang (2017) in a different context. These equations are given by (assuming ):
where denotes the member of interaction except , and denotes the interaction set is involved in with removed. is interpreted as the message passing from the weight to the interaction it participates in, while is interpreted as the message passing from the interaction to its member . We call this iteration equation the belief propagation, which serves as the message passing algorithm whose fixed point corresponds to the stationary point of the following Bethe free energy Mézard and Parisi (2001)
where is the normalization constant of the model probability . The free energy contribution of one weight is given by and the free energy contribution of one interaction is given by . We define the function
. With the free energy, one can estimate both the entropy and the energy of the model which is given by:
The entropy can be obtained by using the standard thermodynamic formula as when .
- Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, Cambridge, MA, 2016).
- Choromanska et al. (2014) A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun, ArXiv e-prints 1412.0233 (2014).
- Chaudhari and Soatto (2015) P. Chaudhari and S. Soatto, ArXiv e-prints 1511.06485 (2015).
- Kawaguchi (2016) K. Kawaguchi, ArXiv e-prints 1605.07110 (2016).
- Soudry and Carmon (2016) D. Soudry and Y. Carmon, ArXiv e-prints 1605.08361 (2016).
- Vardan et al. (2016) P. Vardan, Y. Romano, and M. Elad, ArXiv e-prints 1607.08194 (2016).
- Patel et al. (2016) A. B. Patel, T. Nguyen, and R. G. Baraniuk, ArXiv e-prints 1612.01936 (2016).
- Wan et al. (2013) L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, in Proceedings of the 30th International Conference on Machine Learning (ICML-13), edited by S. Dasgupta and D. Mcallester (JMLR Workshop and Conference Proceedings, 2013), vol. 28, pp. 1058–1066.
- Puchalla et al. (2005) J. L. Puchalla, E. Schneidman, R. A. Harris, and M. J. Berry, Neuron 46, 493 (2005).
- Narayanan et al. (2005) N. S. Narayanan, E. Y. Kimchi, and M. Laubach, Journal of Neuroscience 25, 4207 (2005).
- Schneidman et al. (2011) E. Schneidman, J. L. Puchalla, R. Segev, R. A. Harris, W. Bialek, and M. J. Berry, Journal of Neuroscience 31, 15732 (2011).
- Lillicrap Timothy P. et al. (2016) Lillicrap Timothy P., Cownden Daniel, Tweed Douglas B., and Akerman Colin J., Nature Communications 7, 13276 (2016).
- Harris Kenneth D. and Mrsic-Flogel Thomas D. (2013) Harris Kenneth D. and Mrsic-Flogel Thomas D., Nature 503, 51 (2013).
- Harris Kenneth D and Shepherd Gordon M G (2015) Harris Kenneth D and Shepherd Gordon M G, Nat Neurosci 18, 170 (2015).
- Gardner (1985) E. Gardner, Nuclear Physics B 257, 747 (1985).
- Kirkpatrick and Thirumalai (1987) T. R. Kirkpatrick and D. Thirumalai, Phys. Rev. B 36, 5388 (1987).
- Franz et al. (2001) S. Franz, M. Mézard, F. Ricci-Tersenghi, M. Weigt, and R. Zecchina, EPL (Europhysics Letters) 55, 465 (2001).
- Montanari, A. and Ricci-Tersenghi, F. (2003) Montanari, A. and Ricci-Tersenghi, F., Eur. Phys. J. B 33, 339–346 (2003).
- Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Proceedings of the IEEE 86, 2278 (1998).
- Courbariaux et al. (2015) M. Courbariaux, Y. Bengio, and J.-P. David, in Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Curran Associates, Inc., 2015), pp. 3105–3113.
- Mézard and Parisi (2001) M. Mézard and G. Parisi, Eur. Phys. J. B 20, 217 (2001).
- Huang (2017) H. Huang, Journal of Statistical Mechanics: Theory and Experiment 2017, 033501 (2017).
- Nakajima and Hukushima (2009) T. Nakajima and K. Hukushima, Phys. Rev. E 80, 011103 (2009).
- Rumelhart David E. et al. (1986) Rumelhart David E., Hinton Geoffrey E., and Williams Ronald J., Nature 323, 533–536 (1986).
- Huang (2017) H. Huang, ArXiv e-prints 1703.07943 (2017).
- Mézard and Montanari (2009) M. Mézard and A. Montanari, Information, Physics, and Computation (Oxford University Press, Oxford, 2009).