Deep neural networks have been extremely successful in many tasks related to images, videos and reinforcement learning. However, the success of deep learning is still far from being understood in theory. In particular, learning a neural network is a complicated non-convex optimization problem, which is hard in the worst-case. Why can we efficiently learn a neural network? Despite a lot of recent effort, the class of neural networks that we know how to provably learn in polynomial time is still very limited, and many results require strong assumptions on the input distribution.
In this paper we design a new algorithm that is capable of learning a two-layer111There are different ways to count the number of layers. Here by two-layer network we refer to a fully-connected network with two layers of edges (two weight matrices). This is considered to be a three-layer network if one counts the number of layers for nodes (e.g. in Goel and Klivans (2017)) or a one-hidden layer network if one just counts the number of hidden layers. neural network for a general class of input distributions. Following standard models for learning neural networks, we assume there is a ground truth neural network. The input data is generated by first sampling the input from an input distribution , then computing according to the ground truth network that is unknown to the learner. The learning algorithm will try to find a neural network such that is as close to as possible over the input distribution . Learning a neural network is known to be a hard problem even in some simple settings (Goel et al., 2016; Brutzkus and Globerson, 2017), so we need to make assumptions on the network structure or the input distribution , or both. Many works have worked with simple input distribution (such as Gaussians) and try to learn more and more complex networks (Tian, 2017; Brutzkus and Globerson, 2017; Li and Yuan, 2017; Soltanolkotabi, 2017; Zhong et al., 2017). However, the input distributions in real life are distributions of very complicated objects such as texts, images or videos. These inputs are highly structured, clearly not Gaussian and do not even have a simple generative model.
We consider a type of two-layer neural network, where the output is generated as
Here is the input, and are two weight matrices222Here we assume for simplicity, our results can easily be generalized as long as the dimension of output is no smaller than the number of hidden units.. The function
applied entry-wise to the vector, and is a noise vector that has and is independent of . Although the network only has two layers, learning similar networks is far from trivial: even when the input distribution is Gaussian, Ge et al. (2017b) and Safran and Shamir (2018) showed that standard optimization objective can have bad local optimal solutions. Ge et al. (2017b) gave a new and more complicated objective function that does not have bad local minima.
For the input distribution , our only requirement is that is symmetric. That is, for any
, the probability of observingis the same as the probability of observing . A symmetric distribution can still be very complicated and cannot be represented by a finite number of parameters. In practice, one can often think of the symmetry requirement as a “factor-2” approximation to an arbitrary input distribution: if we have arbitrary training samples, it is possible to augment the input data with their negations to make the input distribution symmetric, and it should take at most twice the effort in labeling both the original and augmented data. In many cases (such as images) the augmented data can be interpreted (for images it will just be negated colors) so reasonable labels can be obtained.
1.1 Our Results
When the input distribution is symmetric, we give the first algorithm that can learn a two-layer neural network. Our algorithm is based on the method-of-moments approach: first estimate some correlations betweenand , then use these information to recover the model parameters. More precisely we have
Theorem 1 (informal).
If the data is generated according to Equation (1), and the input distribution is symmetric. Given exact correlations between of order at most 4, as long as and input distribution are not degenerate, there is an algorithm that runs in time and outputs a network of the same size that is effectively the same as the ground-truth network: for any input , .
Of course, in practice we only have samples of and cannot get the exact correlations. However, our algorithm is robust to perturbations, and in particular can work with polynomially many samples.
Theorem 2 (informal).
If the data is generated according to Equation (1), and the input distribution is symmetric. As long as the weight matrices and input distributions are not degenerate, there is an algorithm that uses time and number of samples and outputs a network of the same size that computes an -approximation function to the ground-truth network: for any input , .
In fact, the algorithm recovers the original parameters up to scaling and permutations. Here when we say weight matrices are not degenerate, we mean that the matrices should be full rank, and in addition a certain distinguishing matrix that we define later in Section 2 is also full rank. We justify these assumptions using the smoothed analysis framework (Spielman and Teng, 2004).
In smoothed analysis, the input is not purely controlled by an adversary. Instead, the adversary can first generate an arbitrary instance (in our case, arbitrary weight matrices and symmetric input distribution ), and the parameters for this instance will be randomly perturbed to yield a perturbed instance. The algorithm only needs to work with high probability on the perturbed instance. This limits the power of the adversary and prevents it from creating highly degenerate cases (e.g. choosing the weight matrices to be much lower rank than ). Roughly speaking, we show
Theorem 3 (informal).
There is a simple way to perturb the input distribution, and such that with high probability, the distance between the perturbed instance and original instance is at most , and our algorithm outputs an -approximation to the perturbed network with time and number of samples.
In the rest of the paper, we will first review related works. Then in Section 2 we formally define the network and introduce some notations. Our algorithm is given in Section 3. Finally in Section 4 we run experiments to show that the algorithm can indeed learn the two-layer network efficiently and robustly. The experiments show that our algorithm works robustly with reasonable number of samples for different (symmetric) input distributions and weight matrices. Due to space constraints, the proof for polynomial number of samples (Theorem 2) and smoothed analysis (Theorem 3) are deferred to the appendix.
1.2 Related Work
There are many works in learning neural networks, and they come in many different styles.
ReLU network, Gaussian input
When the input is Gaussian, Ge et al. (2017b) showed that for a two-layer neural network, although the standard objective does have bad local optimal solutions, one can construct a new objective whose local optima are all globally optimal. Several other works (Tian, 2017; Brutzkus and Globerson, 2017; Li and Yuan, 2017; Soltanolkotabi, 2017; Zhong et al., 2017) extend this to different settings. A closely related work (Janzamin et al., 2015) does not require the input distribution to be Gaussian, but still relies on knowing the score function of the input distribution (which in general cannot be estimated efficiently from samples).
General input distributions
There are several lines of work that try to extend the learning results to more general distributions. Du et al. (2017) showed how to learn a single neuron or a single convolutional filter under some conditions for the input distribution. Daniely et al. (2016); Zhang et al. (2016, 2017); Goel and Klivans (2017); Du and Goel (2018) used kernel methods to learn neural networks when the norm of the weights and input distributions are both bounded (and in general the running time and sample complexity in this line of work depend exponentially on the norms of weights/input). The work that is most similar to our setting is Goel et al. (2018), where they showed how to learn a single neuron (or a single convolutional filter) for any symmetric input distribution. Our two-layer neural network model is much more complicated.
Method-of-Moments and Tensor Decomposition
Our work uses method-of-moments, which has already been applied to learn many latent variable models (see Anandkumar et al. (2014) and references there). The particular algorithm that we use is inspired by an over-complete tensor decomposition algorithm FOOBI (De Lathauwer et al., 2007). Our smoothed analysis results are inspired by Bhaskara et al. (2014) and Ma et al. (2016), although our setting is more complicated and we need several new ideas.
In this section, we first describe the neural network model that we learn, and then introduce notations related to matrices and tensors. Finally we will define distinguishing matrix, which is a central object in our analysis.
2.1 Network Model
We consider two-layer neural networks with -dimensional input, hidden units and -dimensional output, as shown in Figure 1. We assume that . The input of the neural network is denoted by . Assume that the input is i.i.d. drawn from a symmetric distribution 333Suppose the density function of distribution is , we assume for any . Let the two weight matrices in the neural network be and . The output is generated as follows:
where is the element-wise ReLu function and is zero-mean random noise, which is independent with input . Let the value of hidden units be , which is equal to . Denote -th row of matrix as . Also, let -th column of matrix be (). By property of ReLU activations, for any constant , scaling the -th row of by while scaling the -th column of by does not change the function computed by the network. Therefore without loss of generality, we assume every row vector of has unit norm.
We use to denote the set
. For two random variablesand , we say if they come from the same distribution.
In the vector space , we use to denote the inner product of two vectors, and use to denote the Euclidean norm. We use to denote the -th standard basis vector. For a matrix , let denote its -th row vector, and let denote its -th column vector. Let
’s singular values be, and denote the smallest singular value be . The condition number of matrix is defined as . We use
to denote the identity matrix with dimension. The spectral norm of a matrix is denoted as , and the Frobenius norm as
We represent a -dimensional linear subspace by a matrix , whose columns form an orthonormal basis for subspace . The projection matrix onto the subspace is denoted by and the projection matrix onto the orthogonal subspace of is denoted by
For matrix , let the Kronecker product of and be , which is defined as . For a vector , the Kronecker product has dimension . We denote the -fold Kronecker product of as , which has dimension .
We often need to convert between vectors and matrices. For a matrix , let be the vector obtained by stacking all the columns of . For a vector , let denote the inverse mapping such that . Let be the space of all symmetric matrices, which has dimension . For convenience, we denote For a symmetric matrix , we denote as the vector obtained by stacking all the upper triangular entries (including diagonal entries) of . Note that still has dimension . For a vector , let denote the inverse mapping of such that .
2.3 Distinguishing Matrix
A central object in our analysis is a large matrix whose columns are closely related to pairs of hidden variables. We call this the distinguishing matrix and define it below:
Given a weight matrix of the first layer, and the input distribution , the distinguishing matrix is a matrix whose columns are indexed by where , and
Another related concept is the augmented distinguishing matrix , which is a matrix whose first columns are exactly the same as distinguishing matrix , and the last column (indexed by ) is defined as
For both matrices, when the input distribution is clear from context we use or and omit the superscript.
The exact reason for these definitions will only be clear after we explain the algorithm in Section 3. Our algorithm will require that these matrices are robustly full rank, in the sense that is lowerbounded. Intuitively, every column looks at the expectation over samples that have opposite signs for weights (, hence the name distinguishing matrix).
Requiring and to be full rank prevents several degenerate cases. For example, if two hidden units are perfectly correlated and always share the same sign for every input, this is very unnatural and requiring the distinguishing matrix to be full rank prevents such cases. Later in Section C we will also show that requiring a lowerbound on is not unreasonable: in the smoothed analysis setting where the nature can make a small perturbation on the input distribution , we show that for any input distribution , there exists simple perturbations that are arbitrarily close to such that is lowerbounded.
3 Our Algorithm
In this section, we describe our algorithm for learning the two-layer networks defined in Section 2.1. As a warm-up, we will first consider a single-layer neural network and recover the results in Goel et al. (2018) using method-of-moments. This will also be used as a crucial step in our algorithm. Due to space constraints we will only introduce algorithm and proof ideas, the detailed proof is deferred to Section A in appendix. Throughout this section, when we use without further specification the expectation is over the randomness and the noise .
3.1 Warm-up: Learning Single-layer Networks
We will first give a simple algorithm for learning a single-layer neural network. More precisely, suppose we are given samples where comes from a symmetric distribution, and the output is computed by
Here ’s are i.i.d. noises that satisfy . Noise is also assumed to be independent with input . The goal is to learn the weight vector .
The idea of the algorithm is simple: we will estimate the correlations between and and the covariance of , and then recover the hidden vector using these two estimates. The main challenge here is that is not a linear function on . Goel et al. (2018) gave a crucial observation that allows us to deal with the non-linearity:
Suppose comes from a symmetric distribution and is computed as in (3), then
Importantly, the right hand side of Lemma 1 does not contain the ReLU function . This is true because if comes from a symmetric distribution, averaging between and can get rid of non-linearities like ReLU or leaky-ReLU. Later we will prove a more general version of this lemma (Lemma 6).
Using this lemma, it is immediate to get a method-of-moments algorithm for learning : we just need to estimate and , then we know . This is summarized in Algorithm 1.
3.2 Learning Two-layer Networks
In order to learn the weights of the network defined in Section 2.1, a crucial observation is that we have outputs as well as hidden-units. This gives a possible way to reduce the two-layer problem to the single-layer problem. For simplicity, we will consider the noiseless case in this section, where
Let be a vector and consider , it is clear that . Let be the normalized version -th row of , then we know has the property that where is a constant and is a basis vector.
The key observation here is that if , then . As a result, is the output of a single-layer neural network with weight equal to . If we know all the vectors , the input/output pairs correspond to single-layer networks with weight vectors . We can then apply the algorithm in Section 3.1 (or the algorithm in Goel et al. (2018)) to learn the weight vectors.
When , we say that is a pure neuron. Next we will design an algorithm that can find all vectors ’s that generate pure neurons, and therefore reduce the problem of learning a two-layer network to learning a single-layer network.
Pure Neuron Detector
In order to find the vector that generates a pure neuron, we will try to find some property that is true if and only if the output can be represented by a single neuron.
Intuitively, using ideas similar to Lemma 1 we can get a property that holds for all pure neurons:
Suppose , then , and . As a result we have
As before, the ReLU activation does not appear because of the symmetric input distribution. For , we can estimate all of these moments () using samples and check whether this condition is satisfied. However, the problem with this property is that even if is not pure, it may still satisfy the property. More precisely, if , then we have
The additional terms may accidentally cancel each other which leads to a false positive. To address this problem, we consider a higher order moment:
Suppose , then
Moreover, if where is a -dimensional vector, we have
Here ’s are columns of the distinguishing matrix defined in Definition 1.
The important observation here is that there are extra terms in that are multiples of , which are (or considering their symmetry) dimensional objects. When the distinguishing matrix is full rank, we know its columns are linearly independent. In that case, if the sum of the extra terms is 0, then the coefficient in front of each must also be 0. The coefficients are which will be non-zero if and only if both are non-zero, therefore to make all the coefficients 0 at most one of can be non-zero. This is summarized in the following Corollary:
Corollary 1 (Pure Neuron Detector).
Define . Suppose the distinguishing matrix is full rank, if for unit vector , then must be equal to one of .
We will call the function a pure neuron detector, as is a pure neuron if and only if . Therefore, to finish the algorithm we just need to find all solutions for .
The main obstacle in solving the system of equations is that every entry of is a quadratic function in . The system of equations is therefore a system of quadratic equations. Solving a generic system of quadratic equations is NP-hard. However, in our case this can be solved by a technique that is very similar to the FOOBI algorithm for tensor decomposition (De Lathauwer et al., 2007). The key idea is to linearize the function by thinking of each degree 2 monomial as a separate variable. Now the number of variables is and is linear in this space. In other words, there exists a matrix such that . Clearly, if is a pure neuron, then . That is, are all in the nullspace of . Later in Section A we will prove that the nullspace of consists of exactly these vectors (and their combinations):
Let be the unique matrix that satisfies (where is defined as in Corollary 1), suppose the distinguishing matrix is full rank, then the nullspace of is exactly the span of .
Based on Lemma 4, we can just estimate the tensor from the samples we are given, and its smallest singular directions would give us the span of .
Finding ’s from span of ’s
In order to reduce the problem to a single-layer problem, the final step is to find ’s from span of ’s. This is also a step that has appeared in FOOBI and more generally other tensor decomposition algorithms, and can be solved by a simultaneous diagonalization. Let be the matrix whose rows are ’s, which means . Let and be two random elements in the span of , where and are two random diagonal matrices. Both matrices and can be diagonalized by matrix . In this case, if we compute , since is a column of , we know
is an eigenvector of! The matrix can have at most eigenvectors and there are ’s, therefore the ’s are the only eigenvectors of .
Given the span of ’s, let be two random matrices in this span, with probability 1 the ’s are the only eigenvectors of .
Using this procedure we can find all the ’s (up to permutations and sign flip). Without loss of generality we assume . The only remaining problem is that might be negative. However, this is easily fixable by checking . Since is always nonnegative, has the same sign as , and we can flip if is negative.
3.3 Detailed Algorithm and Guarantees
We can now give the full algorithm, see Algorithm 2. The main steps of this algorithm is as explained in the previous section. Steps 2 - 5 constructs the pure neuron detector and finds the span of (as in Corollary 1); Steps 7 - 9 performs simultaneous diagonalization to get all the ’s; Steps 11, 12 calls Algorithm 1 to solve the single-layer problem and outputs the correct result.
We are now ready to state a formal version of Theorem 1:
Suppose and the distinguishing matrix are all full rank, and Algorithm 2 has access to the exact moments, then the network returned by the algorithm computes exactly the same function as the original neural network.
It is easy to prove this theorem using the lemmas we have.
By Corollary 1, we know that after Step 5 of Algorithm 2, the span of columns of is exactly equal to the span of . By Lemma 5, we know the eigenvectors of at Step 8 are exactly the normalized version of rows of . Without loss of generality, we will fix the permutation and assume . In Step 9, we use the fact that where is always positive because is the ReLU function. Therefore, after Step 9 we can assume all the ’s are positive.
Now the output (again by property of ReLU function ), by the design of Algorithm 1 we know . We also know that , therefore . Notice that These two scaling factors cancel each other, so the two networks compute the same function. ∎
In this section, we provide experimental results to validate the robustness of our algorithm for both Gaussian input distributions as well as more general symmetric distributions such as symmetric mixtures of Gaussians.
There are two important ways in which our implementation differs from our description in Section 3.3. First, our description of the simultaneous diagonalization step in our algorithm is mostly for simplicity of both stating and proving the algorithm. In practice we find it is more robust to draw random samples from the subspace spanned by the last right-singular vectors of and compute the CP decomposition of all the samples (reshaped as matrices and stacked together as a tensor) via alternating least squares (Comon et al., 2009). As alternating least squares can also be unstable we repeat this step 10 times and select the best one. Second, once we have recovered and fixed we use gradient descent to learn , which compared to Algorithm 1 does a better job of ensuring the overall error will not explode even if there is significant error in recovering . Crucially, these modifications are not necessary when the number of samples is large enough. For example, given 10,000 input samples drawn from a spherical Gaussian and and drawn as random orthogonal matrices, our implementation of the original formulation of the algorithm was still able to recover both and with an average error of approximately and achieve close to zero mean square error across 10 random trials.
4.1 Sample Efficiency
First we show that our algorithm does not require a large number of samples when the matrices are not degenerate. In particular, we generate random orthonormal matrices and as the ground truth, and use our algorithm to try to learn the neural network.
As illustrated by Figure 2, regardless of the size of and our algorithm is able to recover both weight matrices with negligible error so long as the number of samples is around 5x the number of parameters. To measure the error in recovering and , we first normalize the columns of and rows of for both our learned parameters and the ground truth, pair corresponding columns and rows together and then compute the squared distance between learned and ground truth parameters. In Figure 2 we also show the overall mean square error—averaged over all output units—achieved by our learned parameters.
4.2 Robustness to Noise
Figure 3 demonstrates the robustness of our algorithm to label noise for Gaussian and symmetric mixture of Gaussians input distributions. In this experiment, we fix the size of both and to be and again generate both parameters as random orthonormal matrices. The overall mean square error achieved by our algorithm grows almost perfectly in step with the amount of label noise, indicating that our algorithm recovers the globally optimal solution regardless of the choice of input distribution.
4.3 Robustness to Condition Number
We’ve already shown that our algorithm continues to perform well across a range of input distributions and even when and are high-dimensional. In all previous experiments however, we sampled and as random orthonormal matrices so as to control for their conditioning. In this experiment, we take the input distribution to be a random symmetric mixture of two Gaussians and vary the condition number of either or
by sampling singular value decompositionssuch that and are random orthonormal matrices and , where is chosen based on the desired condition number. Figure 4 respectively demonstrate that the performance of our algorithm remains steady so long as and are reasonably well-conditioned before eventually fluctuating. Moreover, even with these fluctuations the algorithm still recovers and with sufficient accuracy to keep the overall mean square error low.
Optimizing the parameters of a neural network is a difficult problem, especially since the objective function depends on the input distribution which is often unknown and can be very complicated. In this paper, we design a new algorithm using method-of-moments and spectral techniques to avoid the complicated non-convex optimization for neural networks. Our algorithm can learn a network that is of similar complexity as the previous works, while allowing much more general input distributions.
There are still many open problems. Besides the obvious ones of extending our results to more general distributions and more complicated networks, we are also interested in the relations to optimization landscape for neural networks. In particular, our algorithm shows there is a way to find the global optimal network in polynomial time, does that imply anything about the optimization landscape of the standard objective functions for learning such a neural network, or does it imply there exists an alternative objective function that does not have any local minima? We hope this work can lead to new insights for optimizing a neural network.
Anandkumar et al. (2014)
Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. (2014).
Tensor decompositions for learning latent variable models.
The Journal of Machine Learning Research, 15(1):2773–2832.
- Arora et al. (2014) Arora, S., Bhaskara, A., Ge, R., and Ma, T. (2014). Provable bounds for learning some deep representations. In International Conference on Machine Learning, pages 584–592.
Bhaskara et al. (2014)
Bhaskara, A., Charikar, M., Moitra, A., and Vijayaraghavan, A. (2014).
Smoothed analysis of tensor decompositions.
Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 594–603. ACM.
- Brutzkus and Globerson (2017) Brutzkus, A. and Globerson, A. (2017). Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966.
- Carbery and Wright (2001) Carbery, A. and Wright, J. (2001). Distributional and l^ q norm inequalities for polynomials over convex bodies in r^ n. Mathematical research letters, 8(3):233–248.
- Comon et al. (2009) Comon, P., Luciani, X., and De Almeida, A. L. (2009). Tensor decompositions, alternating least squares and other tales. Journal of Chemometrics: A Journal of the Chemometrics Society, 23(7-8):393–405.
- Daniely et al. (2016) Daniely, A., Frostig, R., and Singer, Y. (2016). Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261.
- De Lathauwer et al. (2007) De Lathauwer, L., Castaing, J., and Cardoso, J.-F. (2007). Fourth-order cumulant-based blind identification of underdetermined mixtures. IEEE Transactions on Signal Processing, 55(6):2965–2973.
- Du and Goel (2018) Du, S. S. and Goel, S. (2018). Improved learning of one-hidden-layer convolutional neural networks with overlaps. arXiv preprint arXiv:1805.07798.
- Du et al. (2017) Du, S. S., Lee, J. D., and Tian, Y. (2017). When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129.
- Ge et al. (2015) Ge, R., Huang, Q., and Kakade, S. M. (2015). Learning mixtures of gaussians in high dimensions. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 761–770. ACM.
- Ge et al. (2017a) Ge, R., Jin, C., and Zheng, Y. (2017a). No spurious local minima in nonconvex low rank problems: A unified geometric analysis. arXiv preprint arXiv:1704.00708.
- Ge et al. (2017b) Ge, R., Lee, J. D., and Ma, T. (2017b). Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501.
- Goel et al. (2016) Goel, S., Kanade, V., Klivans, A., and Thaler, J. (2016). Reliably learning the relu in polynomial time. arXiv preprint arXiv:1611.10258.
- Goel and Klivans (2017) Goel, S. and Klivans, A. (2017). Learning depth-three neural networks in polynomial time. arXiv preprint arXiv:1709.06010.
- Goel et al. (2018) Goel, S., Klivans, A., and Meka, R. (2018). Learning one convolutional layer with overlapping patches. arXiv preprint arXiv:1802.02547.
- Janzamin et al. (2015) Janzamin, M., Sedghi, H., and Anandkumar, A. (2015). Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473.
- Li and Yuan (2017) Li, Y. and Yuan, Y. (2017). Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607.
- Livni et al. (2014) Livni, R., Shalev-Shwartz, S., and Shamir, O. (2014). On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855–863.
- Ma et al. (2016) Ma, T., Shi, J., and Steurer, D. (2016). Polynomial-time tensor decompositions with sum-of-squares. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 438–446. IEEE.
- Rudelson and Vershynin (2009) Rudelson, M. and Vershynin, R. (2009). Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 62(12):1707–1739.
- Safran and Shamir (2018) Safran, I. and Shamir, O. (2018). Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning.
- Soltanolkotabi (2017) Soltanolkotabi, M. (2017). Learning relus via gradient descent. In Advances in Neural Information Processing Systems, pages 2007–2017.
- Spielman and Teng (2004) Spielman, D. A. and Teng, S.-H. (2004). Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463.
- Stewart and Sun (1990) Stewart, G. and Sun, J. (1990). Computer science and scientific computing. matrix perturbation theory.
- Stewart (1977) Stewart, G. W. (1977). On the perturbation of pseudo-inverses, projections and linear least squares problems. SIAM review, 19(4):634–662.
- Tian (2017) Tian, Y. (2017). An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560.
- Tropp (2012) Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434.
- Zhang et al. (2017) Zhang, Y., Lee, J., Wainwright, M., and Jordan, M. (2017). On the learnability of fully-connected neural networks. In Artificial Intelligence and Statistics, pages 83–91.
- Zhang et al. (2016) Zhang, Y., Lee, J. D., and Jordan, M. I. (2016). l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, pages 993–1001.
- Zhong et al. (2017) Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon, I. S. (2017). Recovery guarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175.
Appendix A Details of Exact Analysis
In this section, we first provide the missing proofs for the lemmas appeared in Section 3. Then we discuss how to handle the noise case (i.e. ) and give the corresponding algorithm (Algorithm 3). At the end we also briefly discuss how to handle the case when the matrix has more rows than columns (more outputs than hidden units).
Again, throughout the section when we write , the expectation is taken over the randomness of and noise .
a.1 Missing Proofs for Section 3
Single-layer: To get rid of the non-linearities like ReLU, we use the property of the symmetric distribution (similar to (Goel et al., 2018)). Here we provide a more general version (Lemma 6) instead of proving the specific Lemma 1. Note that Lemma 1 is the special case when and (here does not affect the result since it has zero mean and is independent with , thus ).
Suppose input comes from a symmetric distribution, for any vector and any non-negative integers and satisfying that is an even number, we have
where the expectation is taken over the input distribution.
Since input comes from a symmetric distribution, we know that . Thus, we have
There are two cases to consider: and
are both even numbers or both odd numbers.
For the case where and are even numbers, we have
If , we know Otherwise, we have Thus,
For the other case where and are odd numbers, we have
Similarly, if , we know . Otherwise, we have . Thus,
Proof of Lemma 3. Here, we only prove the second equation, since the first equation is just a special case of the second equation. First, we rewrite by letting . Then we transform these two terms in the LHS as follows. Let’s look at first. For , we have
For any vector , consider . We have
Let , we have
where the second equality holds due to Lemma 6.
Now, let’s look at the second term