# Provable Methods for Training Neural Networks with Sparse Connectivity

We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.

## Authors

• 14 publications
• 83 publications
• ### Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

Training neural networks is a challenging non-convex optimization proble...
06/28/2015 ∙ by Majid Janzamin, et al. ∙ 0

• ### A Vest of the Pseudoinverse Learning Algorithm

In this letter, we briefly review the basic scheme of the pseudoinverse ...
05/20/2018 ∙ by Ping Guo, et al. ∙ 0

• ### Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

In this note, we study the dynamics of gradient descent on objective fun...
09/23/2018 ∙ by Ohad Shamir, et al. ∙ 0

• ### Learning Parities with Neural Networks

In recent years we see a rapidly growing line of research which shows le...
02/18/2020 ∙ by Amit Daniely, et al. ∙ 2

• ### Learning Boolean Circuits with Neural Networks

Training neural-networks is computationally hard. However, in practice t...
10/25/2019 ∙ by Eran Malach, et al. ∙ 0

• ### Sparse Activity and Sparse Connectivity in Supervised Learning

Sparseness is a useful regularizer for learning in a wide range of appli...
03/28/2016 ∙ by Markus Thom, et al. ∙ 0

• ### Deep Rewiring: Training very sparse deep networks

Neuromorphic hardware tends to pose limits on the connectivity of deep n...
11/14/2017 ∙ by Guillaume Bellec, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The paradigm of deep learning has revolutionized our ability to perform challenging classification tasks in a variety of domains such as computer vision and speech. However, so far, a complete theoretical understanding of deep learning is lacking. Training deep-nets is a highly non-convex problem involving millions of variables, and an exponential number of fixed points. Viewed naively, proving any guarantees appears to be intractable. In this paper, on the contrary, we show that guaranteed learning of a subset of parameters is possible under mild conditions.

We propose a novel learning algorithm based on the method-of-moments. The notion of using moments for learning distributions dates back to Pearson (Pearson, 1894)

. This paradigm has seen a recent revival in machine learning and has been applied for unsupervised learning of a variety of latent variable models (see

(Anandkumar et al., 2014)

for a survey). The basic idea is to develop efficient algorithms for factorizing moment matrices and tensors. When the underlying factors are sparse,

-based convex optimization techniques have been proposed before, and been employed for learning dictionaries (Spielman et al., 2012)

, topic models, and linear latent Bayesian networks

(Anandkumar et al., 2012).

In this paper, we employ the -based optimization method to learn deep-nets with sparse connectivity. However, so far, this method has theoretical guarantees only for linear models. We develop novel techniques to prove the correctness even for non-linear models. A key technique we use is the Stein’s lemma from statistics (Stein, 1986). Taken together, we show how to effectively leverage algorithms based on method-of-moments to train deep non-linear networks.

### 1.1 Summary of Results

We present a theoretical framework for analyzing when neural networks can be learnt efficiently. We demonstrate how the method-of-moments can yield useful information about the weights in a neural network, and also in some cases, even recover them exactly. In practice, the output of our method can be used for dimensionality reduction for back propagation, resulting in reduced computation.

We show that in a feedforward neural network, the relevant moment matrix to consider is the cross-moment matrix between the label and the score function of the input data (i.e. the derivative of the log of the density function). The classical Stein’s result (Stein, 1986) states that this matrix yields the expected derivative of the label (as a function of the input). The Stein’s result is essentially obtained through integration by parts (Nourdin et al., 2013).

By employing the Stein’s lemma, we show that the row span of the moment matrix between the label and the input score function corresponds to the span of the weight vectors in the first layer, under natural non-degeneracy conditions. Thus, the singular value decomposition of this moment e matrix can be used as low rank approximation of the first layer weight matrix during back propagation, when the number of neurons is less than the input dimensionality. Note that since the first layer typically has the most number of parameters (if a convolutional structure is not assumed), having a low rank approximation results in significant improvement in performance and computational requirements.

We then show that we can exactly recover the weight matrix of the first layer from the moment matrix, when the weights are sparse. It has been argued that sparse connectivity is a natural constraint which can lead to improved performance in practice (Thom and Palm, 2013). We show that the weights can be correctly recovered using an efficient optimization approach. Such approaches have been earlier employed for linear models such as dictionary learning (Spielman et al., 2012) and topic modeling (Anandkumar et al., 2012). Here, we establish that the method is also successful in learning non-linear networks, by alluding to Stein’s lemma.

Thus, we show that the cross-moment matrix between the label and the score function of the input contains useful information for training neural networks. This result has an intriguing connection with (Alain and Bengio, 2012), where it is shown a denoising auto-encoder approximately learns the score function of the input. Our analysis here provides a theoretical explanation of why pre-training can lead to improved performance during back propagation: the interaction between the score function (learnt during pre-training) and the label during back propagation results in correctly identifying the span of the weight vectors, and thus, it leads to improved performance.

The use of score functions for improved classification performance is popular under the framework of Fisher kernels (Jaakkola et al., 1999). However, in (Jaakkola et al., 1999), Fisher kernel is defined as the derivative with respect to some model parameter, while here we consider the derivation with respect to the input and refer to it as score function. Note that if the Fisher kernel is with respect to a location parameter, these two notions are equivalent. Here, we show that considering the moment between the label and the score function of the input can lead to guaranteed learning and improved classification.

Note that there are various efficient methods for computing the score function (in addition to the auto-encoder). For instance, Sasaki et al. (2014)

point out that the score function can be estimated efficiently through non-parametric methods without the need to estimate the density function. In fact, the solution is closed form, and the hyper-parameters (such as the kernel bandwidth and the regularization parameter) can be tuned easily through cross validation. There are a number of score matching algorithms, where the goal is to find a good fit in terms of the score function, e.g

(Hyvärinen, 2005; Swersky et al., 2011). We can employ them to obtain accurate estimations of the score functions.

Since we employ a method-of-moments approach, we assume that the label is generated by a feedforward neural network, to which the input data is fed. In addition, we make mild non-degeneracy assumptions on the weights and the derivatives of the activation functions. Such assumptions make the learning problem tractable, whereas the general learning problem is NP-hard. We expect that the output of our moment-based approach can provide effective initializers for the back propagation procedure.

### 1.2 Related Work

In this paper, we show that the method-of-moments can yield low rank approximations for weights in the first layer. Empirically, low rank approximations of the weight matrices have been employed successfully to improve the performance and for reducing computations (Davis and Arel, 2013). Moreover, the notion of using moment matrices for dimension reduction is popular in statistics, and the dimension reducing subspace is termed as a central subspace (Cook, 1998).

We present a based convex optimization technique to learn the weights in the first layer, assuming they are sparse. Note that this is different from other convex approaches for learning feedforward neural network. For instance, Bengio et al. (2005) show via a boosting approach that learning neural networks is a convex optimization problem as long as the number of hidden units can be selected by the algorithm. However, typically, the neural network architecture is fixed, and in that case, the optimization is non-convex.

Our work is the first to show guaranteed learning of a feedforward neural network incorporating both the label and the input. Arora et al. (2013) considered the auto-encoder setting, where learning is unsupervised, and showed how the weights can be learnt correctly under a set of conditions. They assume that the hidden layer can be decoded correctly using a “Hebbian” style rule, and they all have only binary states. We present a different approach for learning by using the moments between the label and the score function of the input.

## 2 Moments of a Neural Network

### 2.1 Feedforward network with one hidden layer

We first consider a feedforward network with one hidden layer. Subsequently, we discuss how much this can be extended. Let be the label vector generated from the neural network and be the feature vector. We assume

has a well-behaved continuous probability distribution

such that the score function exists. The network is depicted in Figure 1. Let

 E[y|h]=σ2(A2h),   E[h|x]=σ1(A1x). (1)

This setup is applicable to both multiclass and multilabel settings. For multiclass classification is the softmax function and for multilabel classification

is a elementwise sigmoid function. Recall that multilabel classification refers to the case where each instance can have more than one (binary) label

(Bishop et al., 2006; Tsoumakas and Katakis, 2007).

### 2.2 Method-of-moments: label-score function correlation matrix

We hope to get information about the weight matrix using moments of the label and the input. The question is when this is possible and with what guarantees. To study the moments let us start from a simple problem. For a linear network and whitened Gaussian input , we have . In order to learn , we can form the label-score function correlation matrix as

 E[ylinear x⊤]=AE[xx⊤]=A.

Therefore, if is low dimensional, we can project into that span and perform classification in this lower dimension.

Stein’s lemma for a Gaussian random vector  (Stein, 1986) states that for a function satisfying some mild regularity conditions we have

 E[g(x)x⊤]=Ex[∇xg(x)].

A more difficult problem is generalized linear model (GLM) of a (whitened) Gaussian . In this case, for any nonlinear activation function that satisfies some mild regularity conditions. Using Stein’s lemma we have

 E[σ(Ax)x⊤]=Ex′[∇x′σ(x′))]A,

where . Therefore, assuming has full column rank, we obtain the row span of . For Gaussian (and elliptical) random vector , provides the sufficient statistic with no information loss. Thus, we can project the input into this span and obtain dimensionality reduction.

The Gaussian distribution assumption is a restrictive assumption. The more challenging problem is when random vector

has a general probability distribution and the network has hidden layers. How can we deal with such an instance? Below we provide the method to learn such problems.

#### 2.2.1 Results

Let

be a random vector with probability density function

and let be the output label corresponding to the network described in Equation (1). For a general probability distribution, we use score function of the random vector which provides us with sufficient statistics for .

##### Definition: Score function

The score of with probability density function is the random vector .

Let

 M:=E[y(∇xlogp(x))⊤],

which can be calculated in a supervised setting. Note that represents the score function for random vector .

###### Theorem 1.

In a nonlinear neural network with feature vector and output label , we have

 M=−Ex[σ′2(~x2)A2Diag(σ′1(~x1))]A1,

where and .

###### Proof.

Our method builds upon Stein’s lemma (Stein, 1986). We use Proposition 1.

 M =Ex,y[y(∇xlogp(x))⊤]=Ex[Ey[y(∇xlogp(x))⊤|x]] =Ex[σ2(A2(σ1(A1x)(∇xlogp(x))⊤] =−Ex[σ′2(~x2)A2Diag(σ′1(~x1))A1]

The second equality is a result of law of total expectation. The third equality follows from Stein’s lemma as in Proposition 1

below. The last equality results from Chain rule.

###### Proposition 1 (Stein’s lemma (Stein et al., 2004)).

Let be a random vector with joint density function . Suppose the score function exists. Consider any continuously differentiable function such that all the entries of go to zero on the boundaries of support of . Then, we have

 E[g(x)(∇xlogp(x))⊤]=−E[∇xg(x)],

Note that it is also assumed that the above expectations exist (in the sense that the corresponding integrals exist).

The proof follows integration by parts; the result for the scalar and scalar-output functions is provided in (Stein et al., 2004).

###### Remark 1 (Connection with pre-training).

The above theorem provides us with a nice closed-form. If has full column rank, we obtain the row space of . In deep networks auto-encoder is shown to approximately learn the score function of the input (Alain and Bengio, 2012). It has been shown that pre-training results in better performance. Here, we are using the correlation matrix between labels and score function to obtain the span of weights. Auto-encoder appears to be doing the same by estimating the score function. Therefore, our method provides a theoretical explanation of why pre-training is helpful.

###### Remark 2.

For whitened Gaussian (and elliptical) random vector, projecting the input onto rowspace of is a sufficient statistic. Empirically, even for non-Gaussian distribution, this has lead to improvements (Sun et al., 2013; Li, 1992). The moment method presented in this paper presents a low-rank approximation to train the neural networks.

So far, we showed that we can recover the span of . How can we retrieve the matrix ? Without further assumptions this problem is not identifiable. A reasonable assumption is that is sparse. In this case, we can pose this problem as learning given its row span. This problem arises in a number of settings such as learning a sparse dictionary or topic modeling. Next, using the idea presented in (Spielman et al., 2012), we discuss how this can be done.

## 3 Learning the Weight Matrix

In this Section, we explain how we learn the weight matrix given the moment . The complete framework is shown in Algorithm 1. Assuming sparsity we use Spielman et al. (2012) method.

##### Identifiablity

The first natural identifiability requirement on is that it has full row rank. Spielman et al. (2012) show that for Bernoulli-Gaussian entries under relative scaling of parameters, we can impose that the sparsest vectors in the row-span of are the rows of . Any vector in this space is generated by a linear combination of rows of . The intuition is random sparsity, where a combination of different sparse rows cannot make a sparse row. Under this identifiability condition, we need to solve the optimization problem

 minimize  ∥w⊤M∥0  subject to  w≠0.
##### ℓ1 optimization

In order to come up with a tractable update, Spielman et al. (2012) use the convex relaxation of norm and relax the nonzero constraint on

by constraining it to lie in an affine hyperplane

. Therefore, the algorithm includes solving the following linear programming problem

 minimize  ∥w⊤M∥1  subject to  r⊤w=1.

It is proved that under some additional conditions, when is chosen as a column or sum of two columns of , the linear program is likely to produce rows of with high probability (Spielman et al., 2012). We explain these conditions in our context in Section 3.1.

By normalizing the rows of the output, we obtain a row-normalized version of . The algorithm is shown in Algorithm 2. Note that refers to the -th basis vector.

We finally note that there exist more sophisticated analysis and algorithms for the problem of finding the sparsest vectors in a subspace. Anandkumar et al. (2012) provide the deterministic sparsity version of the result. Barak et al. (2012) require more computation and even quasi-polynomial time but they can solve the problem in denser settings.

### 3.1 Guarantees for learning first layer weights

We have the following assumptions to ensure that the weight matrix is learnt correctly.

##### Assumptions
1. Elementwise first layer: is a elementwise function.

2. Nondegeneracy: has full column rank111Throughout this Section, we use the notation to denote ..

3. Score function: The score function exists.

4. Sufficient input dimension: We have for some positive constant .

5. Sparse connectivity: The weight matrix is Bernoulli-Gaussian. For some positive constant , we have

6. Normalized weight matrix: The weight matrix is row-normalized.

Assumption A.1 is common in deep network literature since there are only elementwise activation in the intermediate layers.

Assumption A.2 is satisfied where is full-rank and are non-degenerate. This is the case when the number of classes is large, i.e.

as in imagenets. In future, we plan to consider the setting with a small number of classes using other methods like tensor methods. For non-degeneracy assumption of

, the reason is that we assume the functions are at least linear, i.e. their first order derivatives are nonzero. This is true for the activation function models in deep networks such as sigmoid function, piecewise linear rectifier and softmax function at the last layer.

Note that Assumption A.4 uses an improvement over Spielman’s initial result (Luh and Vu, 2015). In a deep network is usually a few thousand while is in the millions. Hence, Assumption A.4 is satisfied. Note that Luh and Vu (2015) have provided an algorithm for very sparse weight matrices, which only needs .

Assumption A.5 requires the weight matrix to be sparse and the expected number of nonzero elements in each column of be at most  (Luh and Vu, 2015). In other words, each input is connected to at most neurons. This is a meaningful assumption in the deep-nets literature as it has been argued that sparse connectivity is a natural constraint which can lead to improved performance in practice (Thom and Palm, 2013).

If Assumption A.6 does not hold, we will have to learn the scaling and the bias through back propagation. Nevertheless, since the row-normalized provides the directions, the number of parameters in back propagation is reduced significantly. Therefore, instead of learning a dense matrix we will only need to find the scaling in a sparse matrix. This results in significant shrinkage in the number of parameters the back propagation needs to learn.

Finally we provide the results on learning the first layer weight matrix in a feedforward network with one hidden layer.

###### Theorem 2.

Let Assumptions hold for the nonlinear neural network (1), then Algorithm 2 uniquely recovers a row-normalized version of with exponentially small probability of failure.

For proof, see (Spielman et al., 2012).

###### Remark 3 (Efficient implementation).

The optimization is an efficient algorithm to implement. The algorithm involves solving optimization problems. Traditionally, the minimization can be formulated as a linear programming problem. In particular, each of these minimization problems can be written as a LP with inequality constraints and one equality constraint. Since the computational complexity of such a method is often too high for large scale problems, one can use approximate methods such as gradient projection (Figueiredo et al., 2007; Kim et al., 2007), iterative-shrinkage thresholding (Daubechies et al., 2004) and proximal gradient (Nesterov, 1983; Nesterov et al., 2007) that are noticeably faster (Anandkumar et al., 2012).

###### Remark 4 (Learning ^A2).

After learning , we can encode the first layer as and perform softmax regression to learn .

###### Remark 5 (Extension to deterministic sparsity).

The results in this work are proposed in the random setting where the i.i.d. Bernoulli-Gaussian entries for matrix are assumed. In general, the results can be presented in terms of deterministic conditions as in (Anandkumar et al., 2012). Anandkumar et al. (2012) show that the model is identifiable when has full column rank and the following expansion condition holds (Anandkumar et al., 2012).

 |NB(S)|≥|S|+dmax(B),∀S⊆% Columns of B, |S|≥2.

Here, denotes the set of neighbors of columns of in set . They also show that under additional conditions, the relaxation can recover the model parameters. See (Anandkumar et al., 2012) for the details.

### 3.2 Extension to deep networks

So far, we have considered a network with one hidden layer. Now, consider a deep -node neural network with depth . Let be the label vector and be the feature vector. We have

where is elementwise function (linear or nonlinear). This set up is applicable to both multiclass and mutlilabel settings. For multiclass classification, is the softmax function and for multilabel classification is a elementwise sigmoid function. In this network, we can learn the first layer using the idea presented earlier in this Section to learn the first layer. From Stein’s lemma, we have

Assumption B.2 Nondegeneracy:

The matrix has full column rank.

In Assumption B.2, where denotes the input and the -th layer.

###### Theorem 3.

Let Assumptions hold for the nonlinear deep neural network (2). Then, Algorithm 2 uniquely recovers a row-normalized version of with exponentially small probability of failure.

The proof follows Stein’s lemma, use of Chain rule and (Spielman et al., 2012).

In a deep network, the first layer includes most of the parameters (if a structure such as convolutional networks is not assumed) and other layers consist of a small number of parameters since there are small number of neurons. Therefore, the above result is a prominent progress in learning deep neural networks.

###### Remark 6.

This is the first result to learn a subset of deep networks for general nonlinear case in supervised manner. The idea presented in (Arora et al., 2013) is for the auto-encoder setting, whereas we consider supervised setting. Also, Arora et al. (2013) assume that the hidden layer can be decoded correctly using a “Hebbian” style rule, and they all have only binary states. In addition, they can handle sparsity level up to while we can go up to , i.e. .

###### Remark 7 (Challenges in learning the higher layers).

In order for to have full column rank, intermediate layers should have square weight matrices. However, if we want to learn the middle layers, requires that the number of rows of the weight matrices be smaller than the number of columns in a specific manner and therefore cannot have full column rank. In future, we hope to investigate new methods to help in overcoming this challenge.

## 4 Conclusion

We introduced a new paradigm for learning neural networks using method-of-moments. In the literature, this method has been restricted to unsupervised setting. Here, we bridged the gap and employed it for discriminative learning. This opens up a lot of interesting research directions for future investigation. First, note that we only considered the input to have continuous distribution for which the score function exists. The question is whether learning the parameters in a neural network is possible for the discrete data. Although Stein’s lemma has a form for discrete variables (in terms of finite differences) (Wei et al., 2010), it is not clear how that can be leveraged to learn the network parameters. Next, it is worth analyzing how we can go beyond relaxation and provide guarantees in such cases. Another interesting problem arises in case of small number of classes. Note that for non-degeneracy condition, we require the number of classes to be bigger than the number of neurons in the hidden layers. Therefore, our method does not work for the cases where . In addition, in order to learn the weight matrices for intermediate layers, we need the number of rows to be smaller than the number of columns to have sufficient input dimension. On the other hand, non-degeneracy assumption requires these weight matrices to be square matrices. Hence, learning the weights in the intermediate layers of deep networks is a challenging problem. It seems tensor methods, which have been highly successful in learning a wide range of hidden models such as topic modeling, mixture of Gaussian and community detection problem (Anandkumar et al., 2014), may provide a way to overcome the last two challenges.

### Acknowledgment

A. Anandkumar is supported in part by Microsoft Faculty Fellowship, NSF Career award CCF-, NSF Award CCF-, ARO YIP Award WNF--- and ONR Award N. H. Sedghi is supported by ONR Award N.