Neural networks have achieved remarkable success in solving many modern machine learning problems which were previously considered to be intractable. With the use of neural networks now being wide-spread in numerous communities, the optimization of neural networks is an object of intensive study.
Common usage of neural networks involves running stochastic gradient descent (SGD) with simple non-linear activation functions, such as the extremely popular ReLU function, to learn an incredibly large set of weights. This technique has enjoyed immense success in solving complicated classification tasks with record-breaking accuracy. However, theoretically the behavior and convergence properties of SGD are very poorly understood, and few techniques are known which achieve provable bounds for the training of large neural networks. This is partially due to the hardness of the problem – there are numerous formulations where the problem is known to be NP-hard[BR92, Jud88, BDL18, MR18]. Nevertheless, given the importance and success in solving this problem in practice, it is important to understand the source of this hardness.
Typically a neural network can be written in the following form: , where is the depth of the network, is a matrix with columns corresponding to individual -dimensional input samples, and is the output labeling of . The functions are applied entry-wise to a matrix, and are typically non-linear. Perhaps the most popular activation used in practice is the ReLU, given by . Here each is an unknown linear map, representing the “weights”, which maps inputs from one layer to the next layer. In the reconstruction problem, when it is known that and are generated via the above model, the goal is to recover the matrices .
In this work, we consider the problem of learning the weights of two layer networks with a single non-linear layer. Such a network can be specified by two weight matrices and , such that, on a -dimensional input vector , the classification of the network is given by . Given a training set of examples, along with their labeling , where is a (possibly zero) noise matrix, the learning problem is to find and for which
We consider two versions of this problem. First, in the noiseless (or realizable) case, we observe precisely. In this setting, we demonstrate that exact recovery of the matrices is possible in polynomial time. Our algorithms, rather than exploiting smoothness of activation functions, exploit combinatorial properties of rectified activation functions. Additionally, we consider the more general noisy case, where we instead observe , where is a noise matrix which can satisfy various conditions. Perhaps the most common assumption in the literature [GKLW18, GLM17, JSA15] is that has mean and is sub-Gaussian. Observe that the first condition is equivalent to the statement that . While we primarily focus on designing polynomial time algorithms for this model of noise, in Section 7 we demonstrate fixed-parameter tractable (in the number of ReLUs) algorithms to learn the underlying neural network for a much wider class of noise matrices . We predominantly consider the identifiable case where has full column rank, however we also provide supplementary algorithms for the exact case when . Our algorithms are robust to the behavior of for positive , and therefore generalize beyond the ReLU to a wider class of rectified functions such that for and otherwise.
It is known that stochastic gradient descent cannot converge to the ground truth parameters when is ReLU and is orthonormal, even if we have access to an infinite number of samples [LSSS14]. This is consistent with empirical observations and theory, which states that over-parameterization is crucial to train neural networks successfully [Har14, SC16]. In contrast, in this work we demonstrate that we can approximate the optimal parameters in the noisy case, and obtain the optimal parameters exactly in the realizable case, in polynomial time, without over-parameterization. In other words, we provide algorithms that do not succumb to spurious local minima, and can converge to the global optimum efficiently, without over-parametrization.
1.1 Our Contributions
We now state our results more formally. We consider -layer neural networks with ReLU-activation functions . Such a neural network is specified by matrices and . We are given -dimensional input examples , which form the columns of our input matrix , and also give the network’s -dimensional classification of , which is , where is applied entry-wise. We note that our formulation corresponds to having one non-linear layer.
Worst Case Upper Bounds.
In the worst case setting, no properties are assumed on the inputs . While this problem is generally assumed to be intractable, we show, perhaps surprisingly, that when rank and , polynomial time exact algorithms do exist. One of our primary techniques throughout this work is the leveraging of combinatorial aspects of the ReLU function. For a row , we define a sign pattern of this row to simply be the subset of positive entries of the row. Thus, a sign pattern of a vector in is simply given by the orthant of in which it lies. We first prove an upper bound of on the number of orthants which intersect with an arbitrary -dimensional subspace of . Next, we show how to enumerate these sign patterns in time .
We use this result to give an time algorithm for the neural network learning problem in the realizable case, where for some fixed rank-k matrices . After fixing a sign pattern of , we can effectively “remove” the non-linearity of . Even so, the learning problem is still non-convex, and cannot be solved in polynomial time in the general case (even for fixed ). We show, however, that if the rank of is
, then it is possible to use a sequence of linear programs to recoverin polynomial time given the sign pattern, which allows for an overall running time. Our theorem is stated below.
Theorem 1. Given , such that and is rank , there is an algorithm that finds such that and runs in time .
Worst Case Lower Bounds.
Our upper bound relies crucially on the fact that is rank , which is full rank when . We demonstrate that an time algorithm is no longer possible without this assumption by proving the NP-hardness of the realizable learning problem when rank, which holds even for as small as . Our hardness result is as follows.
Theorem 2. For a fixed , the problem of deciding whether there exists a solution to is NP-hard even for . Furthermore, for the case for , the problem is still NP-hard when is allowed to be a variable.
Since non-convex optimization problems are known to be NP-hard in general, it is, perhaps, unsatisfying to settle for worst-case results. Typically, in the learning community, to make problems tractable it is assumed that the input data is drawn from some underlying distribution that may be unknown to the algorithm. So, in the spirit of learning problems, we make the common step of assuming that the samples in
have a standard Gaussian distribution. More generally, our algorithms work for arbitrary multi-variate Gaussian distributions over the columns of, as long as the covariance matrix is non-degenerate, i.e., full rank (see Remark 3
). In this case, our running time and sample complexity will blow up by the condition number of the covariance matrix, which we can estimate first using standard techniques. For simplicity, we state our results here for, though, for the above reasons, all of our results for Gaussian inputs extend to all full rank
Furthermore, because many of our primary results utilize the combinatorial sparsity patterns of , where is a Gaussian matrix, we do not rely on the fact that is linear for . For this reason, our results generalize easily to other non-linear rectified functions . In other words, any function given by
where is a continuous, injective function. In particular, our bounds do not change for polynomial valued for . For more details of this generalization, see Appendix 9.1. Note, however, that our worst-case, non-distributional algorithms (stated earlier), where is a fixed matrix, do not generalize to non-linear .
We first consider the noiseless setting, also referred to as the exact or realizable setting. Here is given for rank matrices and , where has non-degenerate Gaussian marginals. The goal is then to recover the weights exactly up to a permutation of their rows (since one can always permute both sets of rows without effecting the output of the network). Note that for any positive diagonal matrix , when is the ReLU. Thus recovery of is always only possible up to a permutation and positive scaling. We now state our main theorem for the exact recovery of the weights in the realizable (noiseless) setting.
Suppose where are both rank-, and such that the columns of are mean i.i.d. Gaussian. Then if , then there is a -time algorithm which recovers exactly up to a permutation of the rows with high probability.
exactly up to a permutation of the rows with high probability.
To the best of our knowledge, this is the first algorithm which learns the weights matrices of a two-layer neural network with ReLU activation exactly in the noiseless case and with Gaussian inputs . Our algorithm first obtains good approximations to the weights , and concludes by solving a system of judiciously chosen linear equations, which we solve using Gaussian elimination. Therefore, we obtain exact solutions in polynomial time, without needing to deal with convergence guarantees of continuous optimization primitives. Furthermore, to demonstrate the robustness of our techniques, we show that using results introduced in the concurrent and independent work of Ge et. al. [GKLW18], we can extend Theorem 7 to hold for inputs sampled from symmetric distributions (we refer the reader to Corollary 5). We note that [GKLW18] recovers the weight matrices up to additive error and runs in -time, whereas our algorithm has no dependency.
The runtime of our algorithm depends on the condition number of , which is a fairly ubiquitous requirement in the literature for learning neural networks, and optimization in general [GKLW18, JSA15, LSW15, CMTV17, AGMR17, ZSJ17, SJA16]. To address this dependency, in Lemma 16 we give a lower bound which shows at least a linear dependence on is necessary in the sample and time complexity.
Next, we introduce an algorithm for approximate recovery of the weight matrices when for Gaussian marginals and an i.i.d. sub-Gaussian mean-zero noise matrix
Let be given, where are rank-, is a matrix of i.i.d. mean-zero sub-Gaussian random variables with variance
is a matrix of i.i.d. mean-zero sub-Gaussian random variables with variance, and such that the columns of are i.i.d. Gaussian. Then given , there is an algorithm that runs in time and w.h.p. outputs such that
Again, to the best of our knowledge, this work is the first which learns the weights of a -layer network in this noisy setting without additional constraints, such as the restriction that be positive. Recent independent and concurrent work, using different techniques, achieves similar approximate recovery results in the noisy setting [GKLW18]. We note that the algorithm of Goel et. al. [GK17] that [GKLW18] uses, crucially requires the linearity of the ReLU for , and thus the work of [GKLW18] does not generalize to the larger class of rectified functions which we handle. We also note that the algorithm of [GLM17] requires to be non-negative. Finally, the algorithms presented in [JSA15] work for activation functions that are thrice differentiable and can only recover rows of up to scaling. Note, for the ReLU activation function, we need to resolve the signs of each row.
Fixed-Parameter Tractable Algorithms.
For several harder cases of the above problems, we are able to provide Fixed-Parameter Tractable algorithms. First, in the setting where the “labels” are vector valued, i.e., , we note prior results, not restricted to ReLU activation, require the rank of to be [GKLW18, JSA15, GLM17]. This implies that , namely, that the output dimension of the neural net is at least as large as the number
of hidden neurons. Perhaps surprisingly, however, we show that even whendoes not have full column rank, we can still recover exactly in the realizable case, as long as no two columns are non-negative scalar multiples of each other. Note that this allows for columns of the form for as long as is non-zero. Our algorithm for doing so is fixed paramater tractable in the condition number of and the number of hidden neurons . Our results rely on proving bounds on the sample complexity in order to obtain all possible sparsity patterns of the -dimensional columns of .
Suppose for for any such that no two columns of are non-negative scalar multiples of each other, and has rank, and . Then there is an algorithm which recovers exactly with high probability in time .
Furthermore, we generalize our results in the noisy setting to arbitrary error matrices , so long as they are independent of the Gaussians . In this setting, we consider a slightly different objective function, which is to find such that approximates well, where the measure is to compete against the optimal generative solution . Our results are stated below.
Theorem LABEL:thm:vershyninfinal. Let be given, where are rank-, and is any matrix independent of . Then there is an algorithm which outputs in time such that with probability we have
where is the spectral norm of .
Note that the above error bounds depend on the flatness of the spectrum of . In particular, our bounds give a approximation whenever the spectral norm of is a factor smaller than the Frobenius norm, as is in the case for a wide class of random matrices [Ver10]. When this is not the case, we can scale by , to get an -time algorithm which gives a approximation for any error matrix independent of such that .
Finally, we show that for sparse noise, when the network is low-rank we can reduce the problem to the problem of exact recovery in the noiseless case. Here, by low-rank we mean that . It has frequently been observed in practice that many pre-trained neural-networks exhibit correlation and a low-rank structure [DSD13, DZB14]. Thus, in practice it is likely that need not be as large as to well-approximate the data. For such networks, we give a polynomial time algorithm for Gaussian for exact recovery of . Our algorithm assumes that has orthonormal columns, and satisfies an incoherence property, which is fairly standard in the numerical linear algebra community [CR07, CR09, KMO10, CLMW11, JNS13, Har14]. Formally, assume where is i.i.d. Gaussian, and is obtained from the following sparsity procedure. First, fix any matrix , and randomly choose a subset of entries for some , and set them equal to . The following result states that we can exactly recover in polynomial time even when .
Theorem 13 & Corollary 8. Let be rank matrices, where has orthonormal columns, for some , and , where . Here is the condition number of . Let be generated from the -sparsity procedure with for some constant and let . Suppose the sample complexity satisfies Then on i.i.d. Gaussian input there is a time algorithm that recovers exactly up to a permutation and positive scaling with high probability.
1.2 Related Work
Recently, there has been a flurry of work developing provable algorithms for learning the weights of a neural network under varying assumptions on the activation functions, input distributions, and noise models [SJA16, ABMM16, GKKT16, MR18, ZSJ17, GKLW18, GLM17, ZSJ17, Tia17a, LY17a, BG17, Sol17, GKM18, DG18]. In addition, there have been a number of works which consider lower bounds for these problems under a similar number of varying assumptions [GKKT16, LSSS14, ZLJ16, SJA16, ABMM16, BDL18, MR18]. We describe the main approaches here, and how they relate to our problem.
Learning ReLU Networks without noise.
In the noiseless setting with Gaussian input, the results of Zhong et al. [ZSJ17]
utilize a similar strategy as ours. Namely, they first apply techniques from tensor decomposition to find a good initialization of the weights, whereafter they can be learned to a higher degree of accuracy using other methods. At this point our techniques diverge, as they utilize gradient descent on the initialized weights, and demonstrate good convergence properties forsmooth activation functions. However, their results do not give convergence guarantees for non-smooth activation functions, including the ReLU and the more general class of rectified functions considered in this work. In this work, once we are given a good initialization, we utilize combinatorial aspects of the sparsity patterns of ReLU’s, as well as solving carefully chosen linear systems, to obtain exact solutions.
Li and Yuan [LY17b] also analyize stochastic gradient descent, and demonstrate good convergence properties when the weight matrix is known to be close to the identity, and is the all ’s vector. In [Tia17b], stochastic gradient descent convergence is also analyzed when is the all ’s vector, and when is orthonormal. Moreover, [Tia17b] does not give bounds on sample complexity, and requires that a good initialization point is already given.
Learning ReLU Networks with noise.
Ge et al. [GLM17] considers learning a ReLU network with a single output dimension where is restricted to be entry-wise positive and is a zero-mean sub-Gaussian noise vector. In this setting, it is shown that the weights can be approximately learned in polynomial time when the input is i.i.d. Gaussian. However, in contrast to the algorithms in this work, the algorithm of [GLM17] relies heavily on the non-negativity of [Ge18], and thus cannot generalize to arbitrary . Janzamin, Sedghi, and Anandkumar [JSA15] utilize tensor decompositions to approximately learn the weights in the presence of mean zero sub-Gaussian noise, when the activation functions are smooth and satisfy the property that . Using similar techniques, Sedghi and Anandkumar [SJA16] provide a polynomial time algorithm to approximate the weights, if the weights are sparse.
A more recent result of Ge et al. demonstrates polynomial time algorithms for learning weights of two-layer ReLU networks in the presence of mean zero sub-gaussian noise, when the input is drawn from a mixture of a symmetric and Gaussian distribution [GKLW18]. We remark that the results of [GKLW18] were independently and concurrently developed, and utilize substantially different techniques than ours that rely crucially on the linearity of the ReLU for [Ge18]. For these reasons, their algorithms do not generalize to the larger class of rectified functions which are handled in this work. To the best of the our knowledge, for the case of Gaussian inputs, this work and [GKLW18] are the first to obtain polynomial time learning algorithms for this noisy setting.
A variety of works study learning ReLU’s in the more general agnostic learning setting, based off Valiant’s original PAC learning model [Val84]. The agnostic PAC model allows for arbitrary noisy and distributions over observations, and the goal is to output a hypothesis function which approximates the output of the neural network. Note that this does not necessarily entail learning the weights of an underlying network. For instance, Arora et al. [ABMM16] gives an algorithm with running time to minimize the empirical risk of a two-layer neural network. A closer analysis of the generalization bounds required in this algorithm for PAC learning is given in [MR18], which gives a time algorithm under the constraints that is given a fixed input, and both the input examples and the weights are restricted to being in the unit ball. In contrast, our time algorithm for general error matrices improves on their complexity whenever , and moreover can handle arbitrarily large and unknown
. We remark, however, that our loss function is different from that of the PAC model, and is in fact roughly equivalent to the empirical loss considered in[ABMM16].
Note that the above algorithms properly learn the networks. That is, they actually output weight matrices such that approximates the data well under some measure. A relaxation if this setting is improper learning, where the output of the learning algorithm can be any efficiently computable function, and not necessarily the weights of neural network. Several works have been studied that achieve polynomial running times under varying assumptions about the network parameters, such as [GKKT16, GK17]. The algorithm of [GK17], returns a “clipped” polynomial. In addition, [ZLJ16] gives polynomial time improper learning algorithms for multi-layer neural networks under several assumptions on the weights and activation functions.
Hardness results for learning networks have an extensive history in the literature [Jud88, BR92]. Originally, hardness was considered for threshold activation functions , where it is known that even for two ReLU’s the problem is NP-hard [BR92]. Very recently, there have been several concurrent and independent lower bounds developed for learning ReLU networks. The work of [BDL18] has demonstrated the hardness of a neural network with the same number of nodes as the hard network in this paper, albeit with two applications of ReLU’s (i.e., two non-linear layers) instead of one. Note that the hardness results of this work hold for even a single non-linear layer. Also concurrently and independently, a recent result of [MR18] appears to demonstrate the same NP-hardness as that in this paper, albiet using a slightly different reduction. The results of [MR18] also demonstrate that approximately learning even a single ReLU is NP-hard. In addition, there are also NP-hardness results with respects to improper learning of ReLU networks [GKKT16, LSSS14, ZLJ16] under certain complexity theoretic assumptions.
One of the main techniques of our work involves analyzing the sparsity patterns of the vectors in the rowspan of . Somewhat related reasoning has been applied by Spielman, Wang, and Wright to the dictionary learning problem [SWW12]. Here, given a matrix , the problem is to recover matrices such that , where is sparse. They argue the uniqueness of such a factorization by proving that, under certain conditions, the sparsest vectors in the row span of are the precisely rows of . This informs their later algorithm for the exact recovery of these sparse vectors using linear programming.
1.3 Our Techniques
One of the primary technical contributions of this work is the utilization of the combinatorial structure of sparsity patterns of the rows of , where is a rectified function, to solve learning problems. Here, a sparsity pattern refers to the subset of coordinates of which are non-zero, and a rectified function is one which satisfies for , and otherwise.
For instance, given where are full rank and is the ReLU, one approach to recovering the weights is to find -linearly vectors such that span precisely the rows of . Without the function , one could accomplish this by solving a linear system. Of course, the non-linearity of the activation function complicates matters significantly. Observe, however, that if the sparsity pattern of was known before hand, one could simple *remove* on the coordinates where is non-zero, and solve the linear system here. On all other coordinates, one knows that is
, and thus finding a linearly independent vector in the right row span can be solved with a linear system. Of course, naively one would need to iterate overpossible sparsity patterns before finding the correct one. However, one can show that any -dimensional subspace of can intersect at most orthants of , and moreover these orthants can be enumerated in time given the subspace. Thus the rowspan of , being -dimensional, can contain vectors with at most patterns. This is the primary obervation behind our -time algorithm for exact recovery of in the noiseless case (for arbitrary ).
As mentioned before, the prior result requires to be rank-, otherwise the row span of cannot be recovered from the row span of . We show that this difficulty is not merely a product of our specific algorithm, by demonstrating that even for as small as , if is given as input then it is NP-hard to find such that , thus ruling out any general time algorithm for the problem. For the case of , the problem is still NP-hard even when is not given as input, and is a variable.
In response to the aformentioned hardness results, we relax to the case where the input has Gaussian marginals. In the noiseless case, we exactly learn the weights given (up to a positive scaling and permutation). As mentioned, our results utilize analysis of the sparsity patterns in the row-span of . One benefit of these techniques is that they are largely insensitive to the behavior of for positive , and instead rely on the rectified property . Hence, this can include even exponential functions, and not solely the ReLU.
Our exact recovery algorithms proceed in two steps. First, we obtain an approximate version of the matrix . For a good enough approximation, we can exactly recover the sparsity pattern of . Our main insight is, roughly, that the only sparse vectors in the row span of are precisely the rows of . Specifically, we show that the only vectors in the row span which have the same sparsity pattern as a row of are scalar multiples of that row. Moreover, we show that no vector in the row span of is supported on a strict subset of the support of a given row of . Using these facts, we can then set up a judiciously designed linear system to find these vectors, which allows us to recover and then exactly. By solving linear systems, we avoid using iterative continuous optimization methods, which recover a solution up to additive error and would only provide rates of convergence in terms of . In contrast, Gaussian elimination yields exact solutions in a polynomial number of arithmetic operations.
The first step, finding a good approximation of
, can be approached from multiple angles. In this work, we demonstrate two different techniques to obtain these approximations, the first being Independent Component Analysis (ICA), and the second being tensor decomposition. To illustrate the robustness of our exact recovery procedure once a good estimate ofis known, we show in Section 4.3 how we can bootstrap the estimators of recent, concurrent and independent work [GKLW18], to improve them from approximate recovery to exact recovery.
Independent Component Analysis.
In the restricted case when is orthonormal, we show that our problem can be modeled as a special case of Independent Component Analysis (ICA). The ICA problem approximately recovers a subspace , given that the algorithm observes samples of the form , where
is i.i.d. and drawn from a distribution that has moments bounded away from Gaussians, and
is a Gaussian noise vector. Intuitively, the goal of ICA is to find a linear transformation of the data such that each of the coordinates or features are as independent as possible. By rotational invariance of Gaussians, in this caseis also i.i.d. Gaussian, and we know that the columns of have independent components and moments bounded away from a Gaussian. Thus, in the orthonormal case, our problem is well suited for the ICA framework.
A second, more general approach to approximating is to utilize techniques from tensor decomposition. Our starting point is the generative model considered by Janzamin et. al. [JSA15], which matches our setting, i.e., . The main idea behind this algorithm is to construct a tensor that is a function of both and captures non-linear correlations between them. A key step is to show that the resulting tensor has low CP-rank and the low-rank components actually capture the rows of the weight matrix . Intuitively, working with higher order tensors is necessary since matrix decompositions are only identifiable up to orthogonal components, whereas tensors have identifiable non-orthogonal components, and we are specifically interested in recovering approximations for non-orthonormal .
Next, we run a tensor decomposition algorithm to recover the low-rank components of the resulting tensor. While computing a tensor decomposition is NP-hard in general [HL13], there is a plethora of work on special cases, where computing such decompositions is tractable [BCMV14, SWZ16, WA16, GVX14, GM15, BM16]. Tensor decomposition algorithms have recently become an invaluable algorithmic primitive and with applications in statistical and machine learning [JSA15, JSA14, GLM17, AGH14, BKS15].
However, there are several technical hurdles involved in utilizing tensor decompositions to obtain estimates of . The first is that standard analysis of these methods utilizes a generalized version of Stein’s Lemma to compute the expected value of the tensor, which relies on the smoothness of the activation function. Thus, we first approximate closely using a Chebyshev polynomial on a sufficiently large domain. However, we cannot algorithmically manipulate the input to demand that instead be generated as . Instead, we add a small mean-zero Gaussian perturbation to our samples and analyze the variation distance between and . For a good enough approximation , this variation distance will be too small for any algorithm to distinguish between them, thus standard arguments imply the success of tensor decomposition algorithms when given the inputs and .
Next, a key step is to construct a non-linear transformation of the input by utilizing knowledge about the underlying density function for the distribution of , which we denote by . The non-linear function considered is the so-called Score Function, defined in [JSA14], which is the normalized
-th order derivative of the input probability distribution function. Computing the score function for an arbitrary distribution can be computationally challenging. However, as mentioned in [JSA14], we can use Hermite polynomials that help us compute a closed form for the score function, in the special case when .
A further complication arises due to the fact that this form of tensor decomposition is agnostic to the signs of . Namely, we are guaranteed vectors from tensor decomposition such that , where is some unknown sign. Prior works have dealt with this issue by considering restricted classes of smooth activation functions which satisfy [JSA15]. For such functions, one can compensate for not knowing the signs by allowing for an additional affine transformation in the neural network. Since we consider non-affine networks and rectified functions which do not satisfy this restriction, we must develop new methods to recover the signs to avoid the exponential blow-up needed to simply guess them.
For the noiseless case, if is close enough to , we can employ our previous results on the uniqueness of sparsity patterns in the row-span of . Namely, we can show that the sparsity pattern of will in fact be feasible in the row-span of , whereas the sparsity pattern of will not, from which we recover the signs via a linear system.
In the presence of noise, however, the problem becomes substantially more complicated. Because we do not have the true row-span of , but instead a noisy row-span given by , we cannot recover the ’s by feasibility arguments involving sparsity patterns. Our solution to the sign ambiguity in the noisy case is a projection-based scheme. Our scheme for determining involves constructing a dimensional subspace , spanned by vectors of the form for all . We augment this subspace as and . We then claim that the length of the projections of the rows of onto the will be smaller for than for . Thus by averaging the projections of the rows of onto these subspaces and finding the subspace which has the smaller projection length on average, we can recover the ’s with high probability. Our analysis involves bounds on projections onto perturbed subspaces, and a spectral analysis of the matrices , where is composed of up to rows of the form and .
In addition to our polynomial time algorithms, we also demonstrate how various seemingly intractable relaxations to our model, within the Gaussian input setting, can be solved in fixed-parameter tractable time in the number of hidden units, and the condition numbers of and . Our first result demonstrates that, in the noiseless case, exact recovery of is still possible even when is not rank . Note that the assumption that is rank is required in many other works on learning neural networks [GLM17, GKLW18, JSA15, SJA16]
We demonstrate that taking columns of , where is the condition number of , is sufficient to obtain -sparse vectors in the columns of . As a result, we can look for column of which are positive scalar multiples of each other, and conclude that any such pair will indeed be a positive scaling of a column of with probability . This allows for exact recovery of for any and , as long as no two columns of are positive scalar multiples of each other. Thereafter, we can recover by solving a linear system on the subset of -sparse columns of , and argue that the resulting constraint matrix is full rank. The result is a time algorithm for exact recovery of .
Our second FPT result involves a substantial generalization of the class of error matrices which we can handle. In fact, we allow arbitrary , so long as they are independent of the input . Our primary technical observation is as follows. Suppose that we were given , where is an arbitrary, possibly very large, error vector, and . Then one can look at the sign of each entry , and consider it to be a noisy observation of which side of a halfspace the vector
lies within. In other words, we couch the problem as a noisy half-space learning problem, where the half-space is given by the hyperplane normal to, and the labeling of is the sign of .
Now while the error on each entry will be large, resulting in nearly half of the labelings being flipped incorrectly, because is independent of , we are able to adapt recent techniques in noisy-halfspace learning to recover in polynomial time. In order to utilize these techniques without knowing anything about , we must first smooth out the error by adding a large Gaussian matrix. The comparatively small value of is then able to shift the observed distribution of signs sufficiently to have non-trivial correlation with the true signs. Taking polynomially many samples, our algorithms detect this correlation, which will allow for accurate recovery of .
To even obtain a matrix of the form , where is a row of , we can guess the pseudo-inverse of . To reduce the dependency on , we first sketch by a subspace-embedding , which will be a random Gaussian matrix and approximately preserve the column span of . In particular, this approximately preserves the spectrum of . The resulting matrix has
entries, and, given the maximum singular value of the inverse (which can be guessed to a factor of), can be guessed accurately enough for our purposes in time , which dominates the overall runtime of the algorithm.
In Section 2 we introduce our time exact algorithm when rank and arbitrary , for recovery of rank- matrices such that . In this section, we also demonstrate that for a very wide class of distributions for random matices , the matrix is in fact full rank with high probability, and therefore can be solved with our exact algorithm. Then, in Section 3, we prove NP-hardness of the learning problem when rank. Next, in Section 4, we give a polynomial time algorithm for exact recovery of in the case when has Gaussian marginals in the realizable setting. Section 4.1 develops our Independenct Component Analysis Based algorithm, whereas Section 4.2 develops our more general exact recovery algorithm. In Section 4.3, we show how recent concurrent results can be bootstrapped via our technqiues to obtain exact recovery for a wider class of distributions.
In Section 5, we demonstrate how to extend our algorithm to the case where where is mean i.i.d. sub-Gaussian noise. Then in Section 6, we give a fixed-paramater tractable (FPT) (in and ) for the exact recovery of in the case where does not have full column rank. We give our second FPT algorithm in Section 7, which finds weights which approximate the optimal network for arbitrary error matrices that are independent of . In Section 8, we demonstrate how the weights of certain low-rank networks, where , can be recovered exactly in the presence of a class of arbitrary sparse noise in polynomial time. Finally, in Appendix 9.1, we give further details on generalizing the ReLU to the class of rectified activation functions.
For a positive integer , we write to denote the set . We use the term with high probability (w.h.p.) in a parameter to describe an event that occurs with probability . For a real , we will often use the shorthand to denote a sufficiently large constant degree polynomial in . Since for simplicity we do not seek to analyze or optimize the polynomial running time of our algorithms, we will state many of our error bounds within technical lemmas as where constitutes some set of relevant parameters, with the understanding that this polynomial can be made arbitrarily large by increasing the sample complexity of our algorithms by a polynomial factor.
In this work we use boldface font to denote matrices, and non-boldface font to denote vectors. For a vector , we use to denote the norm of . For any matrix with rows and columns, for all , let denote the -th row of , for all let denote the -th column and let denote the -th entry of
. Further, the singular value decomposition of, denoted by , is such that is a matrix with orthonormal columns, is a matrix with orthonormal rows and is an diagonal matrix, where is the rank of . The entries along the diagonal are the singular values of , denoted by . We write to denote the Frobenius norm of , and
to denote the spectral norm. We will write to denote the
square identity matrix. We use the notationto denote the projection of the vector onto the row-span of . In other words, if , then . We now recall the condition number of a matrix .
For a rank matrix , let be the non-zero singular values of . Then the condition number of is given by
Note that if has full column rank (i.e., ), then if is the pseudo-inverse of we have and
where is the spectral norm of . Similarly if has full row rank (i.e. ), then and
A real -th order tensor is is the outer product of -dimensional Euclidean spaces. A third order tensor is defined to be rank- if where . Further, has Candecomp/Parafac (CP) rank- if it can be written as the sum of rank- tensors, i.e.,
is such that . Next, given a function , we use the notation to denote the -th order derivative operator w.r.t. the variable , such that
In the context of the ReLU activation function, a useful notion to consider is that of a sign pattern, which will be used frequently in our analysis.
For any matrix dimensions , a sign pattern is simply a subset of . For a matrix , we let be the sign pattern defined by
Intuitively, in the context of rectified activation functions, the sign pattern is an important notion since is invariant under application of , in other words . We similarly define a sparsity-pattern of a matrix as a subset of where is non-zero. Note that a sign and sparsity pattern of , taken together, specify precisely where the strictly positive, negative, and zero-valued entries are in .
We use the notation to denote the Gaussian distribution with mean and variance . More generally, we write to denote a -dimensional multi-variate Gaussian distribution with mean and variance . We make use of the -stability of the Gaussian distribution several times in this work, so we now recall the following definition of stable random variables. We refer the reader to [Ind06] for a further discussion of such distributions.
A distribution is said to be -stable if whenever are drawn independently, we have
for any fixed vector , where is again distributed as a -stable random variable. In particular, the Gaussian random variables are -stable for (i.e., , where ).
Finally, we remark that in this paper, we will work in the common real RAM model of computation, where arithmetic operations on real numbers can be performed in constant time.
2 Exact solution when rank
In this section, we consider the exact case of the neural network recovery problem. Given an input matrix of examples, and a matrix of classifications, the exact version of the recovery problem is to obtain rank- matrices such that , if such matrices exist. In this section we demonstrate the existence of an -time algorithm for exact recovery when . We demonstrate that this assumption is likely necessary in Section 3, where we show that if then the problem is NP-hard even for any when the matrix is given as input, and NP-hard for when is allowed to be a variable. This rules out the existence of a general time algorithm for this problem.
The main theorem we prove in this section is that there is an algorithm with running time dominated by such that it recovers the underlying matrices and exactly. Intuitively, we begin by showing a structural result that there are at most sign patterns that lie in the row space of and we can efficiently enumerate over them using a linear program. For a fixed sign pattern in this set, we construct a sequence of linear programs (LP) such that the -th LP finds a vector , is in the row span of , subject to the fixed sign pattern, and the constraint that is not a linear combination of . We note that being linearly independent is not a linear constraint, but we demonstrate how it can be linearized in a straightforward manner.
Crucially, our algorithm relies on the fact that we have the row-span of . Note that this is implied by the assumption that is rank . Knowing the rowspan allows us to design the constraints in the prior paragraph, and thus solve the LP to recover the rows of On the other hand, if the rank of is less than , then it no longer seems possible to efficiently determine the row span of . In fact, our NP-Hardness result of Section 3 demonstrates that, given as input, if the rank of is strictly less than , the problem of determining the exact row-span of is NP-Hard. The main result of this section is then as follows.
Given , there is an algorithm that finds such that and runs in time , if .
Let be a basis for the row-span of . For two matrices of the same dimension, we will write if the row spans of and are the same. The first step in our algorithm is to obtain a feasible set of sign patterns, within which the true sign pattern of must lie.
Given , such that , there is an algorithm which runs in time and returns a set of sign patterns with such that for any rank- matrices such that and any row , for some .
Recall, is rank . Thus there is a subset of rows of which span all the rows of . Critically, here we require that the rank of is and thus the row space of is the same as that of . Since and have the same dimensional row space, the row spaces of and are precisely the same, and so there must be an invertible change of basis matrix such that . Now note that , and thus it suffices to return a set of sign patterns which contains . Therefore, consider any fixed sign pattern , and fix a row , and consider the following feasibility linear program in the variables
Note that if the sign pattern is feasible by some , then the above LP will be feasible with a suitably large positive scaling to . Now the LP has variables and constraints, and thus a solution is obtained by choosing the that makes a subset of linearly independent constraints tight. Observe in any such LP of the above form, there are at most possible constraints that can ever occur. Thus if is realizable as the sign pattern of some , then it is obtained by the unique solution to a system which chooses to make of these constraints tight. Formally, if are the constraints for which in the LP, then a solution is given by where are a subset of of the constraints. Since there are at most such possible choices, it follows that there are at most realizable sign patterns, and these can be enumerated in time by simply checking the sign pattern which results from the solution (if one exists) to taken over all subsets of constraints of size .
Given access to the set of candidate sign patterns, , and vectors , we can define the following iterative feasibility linear program, that at each iteration finds a vector which is equal to some vector in the row span of , and such that are all linearly independent and in the row span of .
[frametitle=Algorithm 1 : Iterative LP., skipabove=, skipbelow=, linewidth=1.0pt, frametitlerule=true, ] Input: Matrix , a sign pattern , vectors such that are linearly independent.
Let be variables in .
Let be a matrix such that for all , . Construct the projection matrix onto . Note, the projection matrix is given by .
Define w.r.t. the sign pattern such that