The paradigm of deep learning has revolutionized our ability to perform challenging classification tasks in a variety of domains such as computer vision and speech. However, so far, a complete theoretical understanding of deep learning is lacking. Training deep-nets is a highly non-convex problem involving millions of variables, and an exponential number of fixed points. Viewed naively, proving any guarantees appears to be intractable. In this paper, on the contrary, we show that guaranteed learning of a subset of parameters is possible under mild conditions.
We propose a novel learning algorithm based on the method-of-moments. The notion of using moments for learning distributions dates back to Pearson (Pearson, 1894)2014)
for a survey). The basic idea is to develop efficient algorithms for factorizing moment matrices and tensors. When the underlying factors are sparse,-based convex optimization techniques have been proposed before, and been employed for learning dictionaries (Spielman et al., 2012)
, topic models, and linear latent Bayesian networks(Anandkumar et al., 2012).
In this paper, we employ the -based optimization method to learn deep-nets with sparse connectivity. However, so far, this method has theoretical guarantees only for linear models. We develop novel techniques to prove the correctness even for non-linear models. A key technique we use is the Stein’s lemma from statistics (Stein, 1986). Taken together, we show how to effectively leverage algorithms based on method-of-moments to train deep non-linear networks.
1.1 Summary of Results
We present a theoretical framework for analyzing when neural networks can be learnt efficiently. We demonstrate how the method-of-moments can yield useful information about the weights in a neural network, and also in some cases, even recover them exactly. In practice, the output of our method can be used for dimensionality reduction for back propagation, resulting in reduced computation.
We show that in a feedforward neural network, the relevant moment matrix to consider is the cross-moment matrix between the label and the score function of the input data (i.e. the derivative of the log of the density function). The classical Stein’s result (Stein, 1986) states that this matrix yields the expected derivative of the label (as a function of the input). The Stein’s result is essentially obtained through integration by parts (Nourdin et al., 2013).
By employing the Stein’s lemma, we show that the row span of the moment matrix between the label and the input score function corresponds to the span of the weight vectors in the first layer, under natural non-degeneracy conditions. Thus, the singular value decomposition of this moment e matrix can be used as low rank approximation of the first layer weight matrix during back propagation, when the number of neurons is less than the input dimensionality. Note that since the first layer typically has the most number of parameters (if a convolutional structure is not assumed), having a low rank approximation results in significant improvement in performance and computational requirements.
We then show that we can exactly recover the weight matrix of the first layer from the moment matrix, when the weights are sparse. It has been argued that sparse connectivity is a natural constraint which can lead to improved performance in practice (Thom and Palm, 2013). We show that the weights can be correctly recovered using an efficient optimization approach. Such approaches have been earlier employed for linear models such as dictionary learning (Spielman et al., 2012) and topic modeling (Anandkumar et al., 2012). Here, we establish that the method is also successful in learning non-linear networks, by alluding to Stein’s lemma.
Thus, we show that the cross-moment matrix between the label and the score function of the input contains useful information for training neural networks. This result has an intriguing connection with (Alain and Bengio, 2012), where it is shown a denoising auto-encoder approximately learns the score function of the input. Our analysis here provides a theoretical explanation of why pre-training can lead to improved performance during back propagation: the interaction between the score function (learnt during pre-training) and the label during back propagation results in correctly identifying the span of the weight vectors, and thus, it leads to improved performance.
The use of score functions for improved classification performance is popular under the framework of Fisher kernels (Jaakkola et al., 1999). However, in (Jaakkola et al., 1999), Fisher kernel is defined as the derivative with respect to some model parameter, while here we consider the derivation with respect to the input and refer to it as score function. Note that if the Fisher kernel is with respect to a location parameter, these two notions are equivalent. Here, we show that considering the moment between the label and the score function of the input can lead to guaranteed learning and improved classification.
Note that there are various efficient methods for computing the score function (in addition to the auto-encoder). For instance, Sasaki et al. (2014)
point out that the score function can be estimated efficiently through non-parametric methods without the need to estimate the density function. In fact, the solution is closed form, and the hyper-parameters (such as the kernel bandwidth and the regularization parameter) can be tuned easily through cross validation. There are a number of score matching algorithms, where the goal is to find a good fit in terms of the score function, e.g(Hyvärinen, 2005; Swersky et al., 2011). We can employ them to obtain accurate estimations of the score functions.
Since we employ a method-of-moments approach, we assume that the label is generated by a feedforward neural network, to which the input data is fed. In addition, we make mild non-degeneracy assumptions on the weights and the derivatives of the activation functions. Such assumptions make the learning problem tractable, whereas the general learning problem is NP-hard. We expect that the output of our moment-based approach can provide effective initializers for the back propagation procedure.
1.2 Related Work
In this paper, we show that the method-of-moments can yield low rank approximations for weights in the first layer. Empirically, low rank approximations of the weight matrices have been employed successfully to improve the performance and for reducing computations (Davis and Arel, 2013). Moreover, the notion of using moment matrices for dimension reduction is popular in statistics, and the dimension reducing subspace is termed as a central subspace (Cook, 1998).
We present a based convex optimization technique to learn the weights in the first layer, assuming they are sparse. Note that this is different from other convex approaches for learning feedforward neural network. For instance, Bengio et al. (2005) show via a boosting approach that learning neural networks is a convex optimization problem as long as the number of hidden units can be selected by the algorithm. However, typically, the neural network architecture is fixed, and in that case, the optimization is non-convex.
Our work is the first to show guaranteed learning of a feedforward neural network incorporating both the label and the input. Arora et al. (2013) considered the auto-encoder setting, where learning is unsupervised, and showed how the weights can be learnt correctly under a set of conditions. They assume that the hidden layer can be decoded correctly using a “Hebbian” style rule, and they all have only binary states. We present a different approach for learning by using the moments between the label and the score function of the input.
2 Moments of a Neural Network
2.1 Feedforward network with one hidden layer
We first consider a feedforward network with one hidden layer. Subsequently, we discuss how much this can be extended. Let be the label vector generated from the neural network and be the feature vector. We assume
has a well-behaved continuous probability distributionsuch that the score function exists. The network is depicted in Figure 1. Let
This setup is applicable to both multiclass and multilabel settings. For multiclass classification is the softmax function and for multilabel classification
is a elementwise sigmoid function. Recall that multilabel classification refers to the case where each instance can have more than one (binary) label(Bishop et al., 2006; Tsoumakas and Katakis, 2007).
2.2 Method-of-moments: label-score function correlation matrix
We hope to get information about the weight matrix using moments of the label and the input. The question is when this is possible and with what guarantees. To study the moments let us start from a simple problem. For a linear network and whitened Gaussian input , we have . In order to learn , we can form the label-score function correlation matrix as
Therefore, if is low dimensional, we can project into that span and perform classification in this lower dimension.
Stein’s lemma for a Gaussian random vector (Stein, 1986) states that for a function satisfying some mild regularity conditions we have
A more difficult problem is generalized linear model (GLM) of a (whitened) Gaussian . In this case, for any nonlinear activation function that satisfies some mild regularity conditions. Using Stein’s lemma we have
where . Therefore, assuming has full column rank, we obtain the row span of . For Gaussian (and elliptical) random vector , provides the sufficient statistic with no information loss. Thus, we can project the input into this span and obtain dimensionality reduction.
The Gaussian distribution assumption is a restrictive assumption. The more challenging problem is when random vectorhas a general probability distribution and the network has hidden layers. How can we deal with such an instance? Below we provide the method to learn such problems.
be a random vector with probability density functionand let be the output label corresponding to the network described in Equation (1). For a general probability distribution, we use score function of the random vector which provides us with sufficient statistics for .
Definition: Score function
The score of with probability density function is the random vector .
which can be calculated in a supervised setting. Note that represents the score function for random vector .
In a nonlinear neural network with feature vector and output label , we have
where and .
The second equality is a result of law of total expectation. The third equality follows from Stein’s lemma as in Proposition 1
below. The last equality results from Chain rule.
Proposition 1 (Stein’s lemma (Stein et al., 2004)).
Let be a random vector with joint density function . Suppose the score function exists. Consider any continuously differentiable function such that all the entries of go to zero on the boundaries of support of . Then, we have
Note that it is also assumed that the above expectations exist (in the sense that the corresponding integrals exist).
The proof follows integration by parts; the result for the scalar and scalar-output functions is provided in (Stein et al., 2004).
Remark 1 (Connection with pre-training).
The above theorem provides us with a nice closed-form. If has full column rank, we obtain the row space of . In deep networks auto-encoder is shown to approximately learn the score function of the input (Alain and Bengio, 2012). It has been shown that pre-training results in better performance. Here, we are using the correlation matrix between labels and score function to obtain the span of weights. Auto-encoder appears to be doing the same by estimating the score function. Therefore, our method provides a theoretical explanation of why pre-training is helpful.
For whitened Gaussian (and elliptical) random vector, projecting the input onto rowspace of is a sufficient statistic. Empirically, even for non-Gaussian distribution, this has lead to improvements (Sun et al., 2013; Li, 1992). The moment method presented in this paper presents a low-rank approximation to train the neural networks.
So far, we showed that we can recover the span of . How can we retrieve the matrix ? Without further assumptions this problem is not identifiable. A reasonable assumption is that is sparse. In this case, we can pose this problem as learning given its row span. This problem arises in a number of settings such as learning a sparse dictionary or topic modeling. Next, using the idea presented in (Spielman et al., 2012), we discuss how this can be done.
3 Learning the Weight Matrix
The first natural identifiability requirement on is that it has full row rank. Spielman et al. (2012) show that for Bernoulli-Gaussian entries under relative scaling of parameters, we can impose that the sparsest vectors in the row-span of are the rows of . Any vector in this space is generated by a linear combination of rows of . The intuition is random sparsity, where a combination of different sparse rows cannot make a sparse row. Under this identifiability condition, we need to solve the optimization problem
In order to come up with a tractable update, Spielman et al. (2012) use the convex relaxation of norm and relax the nonzero constraint on
by constraining it to lie in an affine hyperplane
. Therefore, the algorithm includes solving the following linear programming problem
It is proved that under some additional conditions, when is chosen as a column or sum of two columns of , the linear program is likely to produce rows of with high probability (Spielman et al., 2012). We explain these conditions in our context in Section 3.1.
By normalizing the rows of the output, we obtain a row-normalized version of . The algorithm is shown in Algorithm 2. Note that refers to the -th basis vector.
We finally note that there exist more sophisticated analysis and algorithms for the problem of finding the sparsest vectors in a subspace. Anandkumar et al. (2012) provide the deterministic sparsity version of the result. Barak et al. (2012) require more computation and even quasi-polynomial time but they can solve the problem in denser settings.
3.1 Guarantees for learning first layer weights
We have the following assumptions to ensure that the weight matrix is learnt correctly.
Elementwise first layer: is a elementwise function.
Nondegeneracy: has full column rank111Throughout this Section, we use the notation to denote ..
Score function: The score function exists.
Sufficient input dimension: We have for some positive constant .
Sparse connectivity: The weight matrix is Bernoulli-Gaussian. For some positive constant , we have
Normalized weight matrix: The weight matrix is row-normalized.
Assumption A.1 is common in deep network literature since there are only elementwise activation in the intermediate layers.
Assumption A.2 is satisfied where is full-rank and are non-degenerate. This is the case when the number of classes is large, i.e.
as in imagenets. In future, we plan to consider the setting with a small number of classes using other methods like tensor methods. For non-degeneracy assumption of, the reason is that we assume the functions are at least linear, i.e. their first order derivatives are nonzero. This is true for the activation function models in deep networks such as sigmoid function, piecewise linear rectifier and softmax function at the last layer.
Note that Assumption A.4 uses an improvement over Spielman’s initial result (Luh and Vu, 2015). In a deep network is usually a few thousand while is in the millions. Hence, Assumption A.4 is satisfied. Note that Luh and Vu (2015) have provided an algorithm for very sparse weight matrices, which only needs .
Assumption A.5 requires the weight matrix to be sparse and the expected number of nonzero elements in each column of be at most (Luh and Vu, 2015). In other words, each input is connected to at most neurons. This is a meaningful assumption in the deep-nets literature as it has been argued that sparse connectivity is a natural constraint which can lead to improved performance in practice (Thom and Palm, 2013).
If Assumption A.6 does not hold, we will have to learn the scaling and the bias through back propagation. Nevertheless, since the row-normalized provides the directions, the number of parameters in back propagation is reduced significantly. Therefore, instead of learning a dense matrix we will only need to find the scaling in a sparse matrix. This results in significant shrinkage in the number of parameters the back propagation needs to learn.
Finally we provide the results on learning the first layer weight matrix in a feedforward network with one hidden layer.
For proof, see (Spielman et al., 2012).
Remark 3 (Efficient implementation).
The optimization is an efficient algorithm to implement. The algorithm involves solving optimization problems. Traditionally, the minimization can be formulated as a linear programming problem. In particular, each of these minimization problems can be written as a LP with inequality constraints and one equality constraint. Since the computational complexity of such a method is often too high for large scale problems, one can use approximate methods such as gradient projection (Figueiredo et al., 2007; Kim et al., 2007), iterative-shrinkage thresholding (Daubechies et al., 2004) and proximal gradient (Nesterov, 1983; Nesterov et al., 2007) that are noticeably faster (Anandkumar et al., 2012).
Remark 4 (Learning ).
After learning , we can encode the first layer as and perform softmax regression to learn .
Remark 5 (Extension to deterministic sparsity).
The results in this work are proposed in the random setting where the i.i.d. Bernoulli-Gaussian entries for matrix are assumed. In general, the results can be presented in terms of deterministic conditions as in (Anandkumar et al., 2012). Anandkumar et al. (2012) show that the model is identifiable when has full column rank and the following expansion condition holds (Anandkumar et al., 2012).
Here, denotes the set of neighbors of columns of in set . They also show that under additional conditions, the relaxation can recover the model parameters. See (Anandkumar et al., 2012) for the details.
3.2 Extension to deep networks
So far, we have considered a network with one hidden layer. Now, consider a deep -node neural network with depth . Let be the label vector and be the feature vector. We have
where is elementwise function (linear or nonlinear). This set up is applicable to both multiclass and mutlilabel settings. For multiclass classification, is the softmax function and for multilabel classification is a elementwise sigmoid function. In this network, we can learn the first layer using the idea presented earlier in this Section to learn the first layer. From Stein’s lemma, we have
Assumption B.2 Nondegeneracy:
The matrix has full column rank.
In Assumption B.2, where denotes the input and the -th layer.
The proof follows Stein’s lemma, use of Chain rule and (Spielman et al., 2012).
In a deep network, the first layer includes most of the parameters (if a structure such as convolutional networks is not assumed) and other layers consist of a small number of parameters since there are small number of neurons. Therefore, the above result is a prominent progress in learning deep neural networks.
This is the first result to learn a subset of deep networks for general nonlinear case in supervised manner. The idea presented in (Arora et al., 2013) is for the auto-encoder setting, whereas we consider supervised setting. Also, Arora et al. (2013) assume that the hidden layer can be decoded correctly using a “Hebbian” style rule, and they all have only binary states. In addition, they can handle sparsity level up to while we can go up to , i.e. .
Remark 7 (Challenges in learning the higher layers).
In order for to have full column rank, intermediate layers should have square weight matrices. However, if we want to learn the middle layers, requires that the number of rows of the weight matrices be smaller than the number of columns in a specific manner and therefore cannot have full column rank. In future, we hope to investigate new methods to help in overcoming this challenge.
We introduced a new paradigm for learning neural networks using method-of-moments. In the literature, this method has been restricted to unsupervised setting. Here, we bridged the gap and employed it for discriminative learning. This opens up a lot of interesting research directions for future investigation. First, note that we only considered the input to have continuous distribution for which the score function exists. The question is whether learning the parameters in a neural network is possible for the discrete data. Although Stein’s lemma has a form for discrete variables (in terms of finite differences) (Wei et al., 2010), it is not clear how that can be leveraged to learn the network parameters. Next, it is worth analyzing how we can go beyond relaxation and provide guarantees in such cases. Another interesting problem arises in case of small number of classes. Note that for non-degeneracy condition, we require the number of classes to be bigger than the number of neurons in the hidden layers. Therefore, our method does not work for the cases where . In addition, in order to learn the weight matrices for intermediate layers, we need the number of rows to be smaller than the number of columns to have sufficient input dimension. On the other hand, non-degeneracy assumption requires these weight matrices to be square matrices. Hence, learning the weights in the intermediate layers of deep networks is a challenging problem. It seems tensor methods, which have been highly successful in learning a wide range of hidden models such as topic modeling, mixture of Gaussian and community detection problem (Anandkumar et al., 2014), may provide a way to overcome the last two challenges.
A. Anandkumar is supported in part by Microsoft Faculty Fellowship, NSF Career award CCF-, NSF Award CCF-, ARO YIP Award WNF--- and ONR Award N. H. Sedghi is supported by ONR Award N.
- Alain and Bengio (2012) Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data generating distribution. arXiv preprint arXiv:1211.4246, 2012.
- Anandkumar et al. (2012) A. Anandkumar, D. Hsu, and A. Javanmard S. M. Kakade. Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints. Preprint. ArXiv:1209.5350, Sept. 2012.
- Anandkumar et al. (2014) A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. J. of Machine Learning Research, 15:2773–2832, 2014.
- Arora et al. (2013) Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. arXiv preprint arXiv:1310.6343, 2013.
Barak et al. (2012)
Boaz Barak, Fernando GSL Brandao, Aram W Harrow, Jonathan Kelner, David
Steurer, and Yuan Zhou.
Hypercontractivity, sum-of-squares proofs, and their applications.
Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 307–326. ACM, 2012.
- Bengio et al. (2005) Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. In Advances in neural information processing systems, pages 123–130, 2005.
- Bishop et al. (2006) Christopher M Bishop et al. Pattern recognition and machine learning, volume 1. springer New York, 2006.
- Cook (1998) R Dennis Cook. Principal hessian directions revisited. Journal of the American Statistical Association, 93(441):84–94, 1998.
- Daubechies et al. (2004) Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on pure and applied mathematics, 57(11):1413–1457, 2004.
- Davis and Arel (2013) Andrew Davis and Itamar Arel. Low-rank approximations for conditional feedforward computation. arXiv preprint arXiv:1312.4461, 2013.
- Figueiredo et al. (2007) Mário AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. Selected Topics in Signal Processing, IEEE Journal of, 1(4):586–597, 2007.
- Hyvärinen (2005) Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. In Journal of Machine Learning Research, pages 695–709, 2005.
Jaakkola et al. (1999)
Tommi Jaakkola, David Haussler, et al.
Exploiting generative models in discriminative classifiers.In Advances in neural information processing systems, pages 487–493, 1999.
- Kim et al. (2007) Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry Gorinevsky. An interior-point method for large-scale l 1-regularized least squares. Selected Topics in Signal Processing, IEEE Journal of, 1(4):606–617, 2007.
On principal hessian directions for data visualization and dimension reduction: another application of stein’s lemma.Journal of the American Statistical Association, 87(420):1025–1039, 1992.
- Luh and Vu (2015) Kyle Luh and Van Vu. Dictionary learning with few samples and matrix concentration. arXiv preprint arXiv:1503.08854, 2015.
- Nesterov (1983) Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
- Nesterov et al. (2007) Yurii Nesterov et al. Gradient methods for minimizing composite objective function, 2007.
- Nourdin et al. (2013) Ivan Nourdin, Giovanni Peccati, and Yvik Swan. Integration by parts and representation of information functionals. arXiv preprint arXiv:1312.5276, 2013.
- Pearson (1894) K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society, London, A., page 71, 1894.
- Sasaki et al. (2014) Hiroaki Sasaki, Aapo Hyvärinen, and Masashi Sugiyama. Clustering via mode seeking by direct estimation of the gradient of a log-density. arXiv preprint arXiv:1404.5028, 2014.
- Spielman et al. (2012) Daniel A Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In Conference on Learning Theory, 2012.
- Stein (1986) Charles Stein. Approximate computation of expectations. Lecture Notes-Monograph Series, 7:i–164, 1986.
- Stein et al. (2004) Charles Stein, Persi Diaconis, Susan Holmes, Gesine Reinert, et al. Use of exchangeable pairs in the analysis of simulations. In Stein’s Method, pages 1–25. Institute of Mathematical Statistics, 2004.
- Sun et al. (2013) Yuekai Sun, Stratis Ioannidis, and Andrea Montanari. Learning mixtures of linear classifiers. arXiv preprint arXiv:1311.2547, 2013.
- Swersky et al. (2011) Kevin Swersky, David Buchman, Nando D Freitas, Benjamin M Marlin, et al. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1201–1208, 2011.
Thom and Palm (2013)
Markus Thom and Günther Palm.
Sparse activity and sparse connectivity in supervised learning.The Journal of Machine Learning Research, 14(1):1091–1143, 2013.
- Tsoumakas and Katakis (2007) Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3):1–13, 2007.
- Wei et al. (2010) Zhengyuan Wei, Xinsheng Zhang, and Taifu Li. On stein identity, chernoff inequality, and orthogonal polynomials. Communications in Statistics—Theory and Methods, 39(14):2573–2593, 2010.