Neural networks have attracted a significant amount of research interest in recent years due to the success of deep neural networks 2, 3, 4]. However, the theoretical underpinnings behind such success remains mysterious to a large extent. Efforts have been taken to understand which classes of functions can be represented by deep neural networks [5, 6, 7, 8]9], and why these networks generalize well [10, 11, 12].
One important line of research that has attracted extensive attention is the model-recovery problem, which is important for the network to generalize well . Assuming the training samples , , are generated independently and identically distributed (i.i.d.) from a distribution based on a neural network model with the ground truth parameter , the goal is to recover the underlying model parameter using the training samples. Consider a network whose output is given as . Previous studies along this topic can be mainly divided into two cases of data generations, with the input being Gaussian.
Regression, where each sample is generated as
This type of regression problem has been studied in various settings. In particular, 15] studied the one-hidden-layer multi-neuron network model, and  studied a two-layer feedforward network with ReLU activations and identity mapping.
Classification, where the label is drawn according to the conditional distribution
Such a problem has been studied in  when the network contains only a single neuron.
For both cases, previous studies attempted to recover , by minimizing an empirical loss function using the squared loss, i.e. , given the training data. Two types of statistical guarantees were provided for such model recovery problems using the squared loss. More specifically,  showed that in the local neighborhood of the ground truth , the empirical loss function is strongly convex for each given point under independent
high probability event. Hence, their guarantee for gradient descent to converge to the ground truth, assuming proper initialization, requires afresh set of samples at every iteration. Thus the total sample complexity depends on the number of iterations. On the other hand, studies such as [17, 14] established strong convexity in the entire local neighborhood in a uniform sense, so that resampling per iteration is not needed for gradient descent to have guaranteed linear convergence as long as it enters such a local neighborhood. Clearly, the second kind of statistical guarantee without per-iteration resampling is much stronger and desirable.
|(a) FCN||(b) CNN|
In this paper, we focus on the classification setting by minimizing the empirical loss using the cross entropy objective, which is a popular choice in training practical neural networks. The geometry as well as the model recovery problem based on the cross entropy loss function have not yet been understood even for one-hidden-layer networks. Such a loss function is much more challenging to analyze than the squared loss, not just because it is nonconvex with multiple neurons, but also because its gradient and Hessian take much more complicated forms compared with the squared loss; moreover, it is hard to control the size of gradient and Hessian due to the saturation phenomenon, i.e., when approaches or . The main focus of this paper is to develop technical analysis for guaranteed model recovery under the challenging cross entropy loss function for the classification problem for two types of one-hidden-layer network structures.
I-a Problem Formulation
We consider two popular types of one-hidden-layer nonlinear neural networks illustrated in Fig. 1, i.e., a Fully-Connected Network (FCN)  and a non-overlapping Convolutional Neural Network (CNN) . For both cases, we let be the input,
be the number of neurons, and the activation function be the sigmoid function
FCN: the network parameter is , and
Non-overlapping CNN: for simplicity we let for some integers . Let be the network parameter, and the
th stride ofbe given as . Then,
The non-overlapping CNN model can be viewed as a highly structured instance of the FCN, where the weight matrix can be written as:
In a model recovery setting, we are given training samples that are drawn i.i.d. from certain distribution regarding the ground truth network parameter (or resp. for CNN). Suppose the network input
is drawn from a standard Gaussian distribution. This assumption has been used a lot in previous literature [14, 19, 18, 20], to name a few. Then, conditioned on , the output is mapped to via the output of the neural network, i.e.,
Our goal is to recover the network parameter, i.e., , via minimizing the following empirical loss function:
where is the cross-entropy loss function, i.e.,
where can subsume either or .
I-B Our Contributions
Considering the multi-neuron classification problem with either FCN or CNN, the main contributions of this work are summarized as follows. Throughout the discussions below, we assume the number of neurons is a constant, and state the scaling only in terms of the input dimension and the number of samples .
Uniform local strong convexity: If the input is Gaussian, the empirical risk function is uniformly strongly convex in a local neighborhood of the ground truth as soon as the sample size .
Statistical and computational rate of gradient descent: consequently, if initialized in this neighborhood, gradient descent converges linearly to a critical point (which we show to exist). Due to the nature of quantized labels here, the recovery of the ground truth is only up to certain statistical accuracy. In particular, gradient descent finds the critical point with a computation cost of , where denotes the numerical accuracy and converges to at a rate of in the Frobenius norm.
Tensor initialization: We adopt the tensor method proposed in , and show that it provably provides an initialization in the neighborhood of the ground truth both for FCN and CNN. In particular, we strengthened the guarantee of the tensor method by replacing the homogeneous assumption on activation functions in  by a mild condition on the curvature of activation functions around , which holds for a larger class of activation functions including sigmoid and tanh.
We derive network specific quantities to capture the local geometry of FCN and CNN, which imply that the geometry of CNN is more benign than FCN, corroborated by the numerical experiments. In order to analyze the challenging cross-entropy loss function, our proof develops various new machineries in order to exploit the statistical information of the geometric curvatures, including the gradient and Hessian of the empirical risk, and to develop covering arguments to guarantee uniform concentrations. To the best of our knowledge, combining the analysis of gradient descent and initialization, this work provides the first globally convergent algorithm for the recovery of one-hidden-layer neural networks using the cross entropy loss function.
I-C Related Work
Due to the scope, we focus on the most relevant literature on theoretical and algorithmic aspects of learning shallow neural networks via nonconvex optimization. The parameter recovery viewpoint is relevant to the success of nonconvex learning in signal processing problems such as matrix completion, phase retrieval, blind deconvolution, dictionary learning and tensor decomposition –, to name a few; see also the overview article . The statistical model for data generation effectively removes worst-case instances and allows us to focus on average-case performance, which often possess much benign geometric properties that enable global convergence of simple local search algorithms.
The studies of one-hidden-layer network model can be further categorized into two classes, landscape analysis and model recovery. In the landscape analysis, it is known that if the network size is large enough compared to the data input, then there are no spurious local minima in the optimization landscape, and all local minima are global [30, 31, 32, 33]. For the case with multiple neurons () in the under-parameterized setting, the work of Tian  studied the landscape of the population squared loss surface with ReLU activations. In particular, there exist spurious bad local minima in the optimization landscape [35, 36] even at the population level. Zhong et. al.  provided several important geometric characterizations for the regression problem using a variety of activation functions and the squared loss.
In the model recovery problem, the number of neurons is smaller than the input dimension, and all the existing works discussed below assumed the squared loss and (sub-)Gaussian inputs. In the case with a single neuron (),  showed that gradient descent converges linearly when the activation function is ReLU, with a zero initialization, as long as the sample complexity is for the regression problem. When the activation function is quadratic,  shows that randomly initialized gradient descent converges fast to the global optimum at a near-optimal sample complexity. On the other hand,  showed that when has bounded first, second and third derivatives, there is no other critical points than the unique global minimum (within a constrained region of interest), and (projected) gradient descent converges linearly with an arbitrary initialization, as long as the sample complexity is for the classification problem. Moreover, in the case with multiple neurons,  showed that projected gradient descent with a local initialization converges linearly for smooth activations with bounded second derivatives for the regression problem,  showed that gradient descent with tensor initialization converges linearly to a neighborhood of the ground truth using ReLU activations, and  showed the linear convergence of gradient descent with the spectral initialization using quadratic activations. For CNN with ReLU activations,  shows that gradient descent converges to the ground truth with random initialization for the population risk function based on the squared loss under Gaussian inputs. Moreover,  shows that gradient descent successfully learns a two-layer convolutional neural network despite the existence of bad local minima. From a technical perspective, our study differs from all the aforementioned work in that the cross entropy loss function we analyze has a very different form. Furthermore, we study the model recovery classification problem under the multi-neuron case, which has not been studied before.
Finally, we note that several papers study one-hidden-layer or two-layer neural networks with different structures under Gaussian input. For example,  studied the overlapping convolutional neural network,  studied a two-layer feedforward networks with ReLU activations and identity mapping, and  introduced the Porcupine Neural Network. Very recently, several papers [42, 43, 44] declared global convergence of gradient descent for optimizing deep neural networks in the over-parameterized regime. These results are not directly comparable to ours since both the networks and the loss functions are different.
I-D Paper Organization and Notations
The rest of the paper is organized as follows. Section II presents the main results on local geometry and local linear convergence of gradient descent. Section III discusses the initialization based on the tensor method. Numerical examples are demonstrated in Section IV, and finally, conclusions are drawn in Section V. Details of the technical proofs are delayed in the supplemental materials.
Throughout this paper, we use boldface letters to denote vectors and matrices, e.g.and . The transpose of is denoted by , and , denote the spectral norm and the Frobenius norm. For a positive semidefinite (PSD) matrix , we write
. The identity matrix is denoted by. The gradient and the Hessian of a function is denoted by and , respectively.
as the sub-exponential norm of a random variable. We useto denote constants whose values may vary from place to place. For nonnegative functions and , means there exist positive constants and such that for all ; means there exist positive constants and such that for all .
Ii Gradient Descent and its Performance Guarantee
To estimate the network parameter, since (4) is a highly nonconvex function, vanilla gradient descent with an arbitrary initialization may get stuck at local minima. Therefore, we implement gradient descent (GD) with a well-designed initialization scheme that is described in details in Section III. In this section, we focus on the performance of the local update rule
where is the constant step size. The algorithm is summarized in Algorithm 1.
Input: Training data , step size , iteration
Gradient Descent: for
Note that throughout the execution of GD, the same set of training samples is used which is the standard implementation of gradient descent. Consequently the analysis is challenging due to the statistical dependence of the iterates with the data.
Ii-a Geometric properties of the networks
Definition 1 (Key quantity for FCN).
Let and define , and Define as
Definition 2 (Key quantity for CNN).
Let and define as
Note that Definition 1 for FCN is different from that in [15, Property 3.2] but consistent with [15, Lemma D.4] which removes the third term in [15, Property 3.2]. For the activation function considered in this paper, the first two terms suffice. Definition 2 for CNN is a newly distilled quantity in this paper tailored to the special structure of CNN. We depict as a function of in a certain range for the sigmoid activation in Fig. 2. It can be numerically verified that for all . Furthermore, the value of is much larger than for the same input.
Ii-B Uniform local strong convexity
We first characterize the local strong convexity of in a neighborhood of the ground truth. We use the Euclidean ball to denote the local neighborhood of for FCN or of for CNN. equationparentequation
where is the radius of the ball. With slight abuse of notations, we will drop the subscript FCN or CNN for simplicity, whenever it is clear from the context that the result is for FCN when the argument is and for CNN when the argument is . Further, denotes the
-th singular value of. Let the condition number be , and . The following theorem guarantees the Hessian of the empirical risk function in the local neighborhood of the ground truth is positive definite with high probability for both FCN and CNN.
Theorem 1 (Local Strong Convexity).
For FCN, assume for all . There exist constants and such that as soon as
with probability at least , we have for all ,
For CNN, assume . There exist constants and such that as soon as
with probability at least , we have for all ,
We note that for FCN (1), all column permutations of are equivalent global minimum of the loss function, and Theorem 1 applies to all such permutation matrices of . The proof of Theorem 1 is outlined in Appendix B.
Theorem 1 guarantees that for both FCN (1) and CNN (2) the Hessian of the empirical cross-entropy loss function is positive definite in a neighborhood of the ground truth , as long as the sample size is sufficiently large. The bounds in Theorem 1 depend on the dimension parameters of the network ( and ), as well as the ground truth (, , ).
Ii-C Performance Guarantees of GD
For the classification problem, due to the nature of quantized labels, is no longer a critical point of . By the strong convexity of the empirical risk function in the local neighborhood of , there can exist at most one critical point in , which is the unique local minimizer in if it exists. The following theorem shows that there indeed exists such a critical point , which is provably close to the ground truth , and gradient descent converges linearly to .
Theorem 2 (Performance Guarantees of Gradient Descent).
Assume the assumptions in Theorem 1 hold. Under the event that local strong convexity holds,
for FCN, there exists a critical point in such that
and if the initial point , GD converges linearly to , i.e.
for , where are constants;
for CNN, there exists a critical point in such that
and if the initial point , GD converges linearly to , i.e.
for , where are constants.
Similarly to Theorem 1, for FCN (1) Theorem 2 also holds for all column permutations of . The proof can be found in Appendix C. Theorem 2 guarantees that the existence of critical points in the local neighborhood of the ground truth, which GD converges to, and also shows that the critical points converge to the ground truth at the rate of for FCN (1) and for CNN(2) with respect to increasing the sample size . Therefore, can be recovered consistently as goes to infinity. Moreover, for both FCN (1) and CNN (2) gradient descent converges linearly to (or resp. ) at a linear rate, as long as it is initialized in the basin of attraction. To achieve -accuracy, i.e. (or resp. ), it requires a computational complexity of (or resp. ), which is linear in , and .
Iii Initialization via Tensor Method
Our initialization adopts the tensor method proposed in . The initialization method works for the FCN model and it also works for the CNN model, but with slight modification as presented in . We focus on the FCN case in this section and omit the CNN case for brevity since it is a straightforward extension. Below, we first briefly describe the tensor method, and then present the performance guarantee of the initialization with remarks on the differences from that in .
Iii-a Preliminary and Algorithm
This subsection briefly introduces the tensor method proposed in , to which a reader can refer for more details. We first define a product as follows. If is a vector and is the identity matrix, then . If is a symmetric rank- matrix factorized as and is the identity matrix, then
where , , , , and .
Define , , , and , , , as follows:
Let denote a randomly picked vector. We define and as follows: ,111See (101) in the supplemental materials for definition. where , and , where .
We further denote . The initialization algorithm based on the tensor method is summarized in Algorithm 2, which includes two major steps. Step 1 first estimates the direction of each column of by decomposing to approximate the subspace spanned by (denoted by ), then reduces the third-order tensor to a lower-dimension tensor , and applys non-orthogonal tensor decomposition on to output the estimate , where is a random sign. Step 2 approximates the magnitude of and the sign by solving a linear system of equations. For more implementation details about Algorithm 2, e.g., power method, we refer to .
Iii-B Performance Guarantee of Initialization
For the classification problem, we make the following technical assumptions, similarly to [15, Assumption 5.3] for the regression problem.
The activation function satisfies the following conditions:
If , then
At least one of and is non-zero.
Furthermore, we do not require the homogeneous assumption ((i.e., for an integer )) required in , which can be restrictive. Instead, we assume the following condition on the curvature of the activation function around the ground truth, which holds for a larger class of activation functions such as sigmoid and tanh.
Let be the index of the first nonzero where . For the activation function , there exists a positive constant such that is strictly monotone over the interval , and the derivative of is lower bounded by some constant for all .
We next present the performance guarantee for the initialization algorithm in the following theorem.
The proof of Theorem 3 consists of (a) showing the estimation of the direction of is sufficiently accurate and (b) showing the approximation of the norm of is accurate enough. The proof of part (a) is the same as that in , but our argument in part (b) is different, where we relax the homogeneous assumption on activation functions. More details can be found in the supplementary materials in Appendix E.
Iv Numerical Experiments
For FCN, we first implement gradient descent to verify that the empirical risk function is strongly convex in the local region around . If we initialize multiple times in such a local region, it is expected that gradient descent converges to the same critical point
, with the same set of training samples. Given a set of training samples, we randomly initialize multiple times, and then calculate the variance of the output of gradient descent. Denote the output of theth run as and the mean of the runs as . The error is calculated as , where is the total number of random initializations. Adopted in 
, it quantifies the standard deviation of the estimatorunder different initializations with the same set of training samples. We say an experiment is successful, if . We generate the ground truth from Gaussian matrices, and the training samples are generated using the FCN (1). Figure 3 (a) shows the successful rate of gradient descent by averaging over sets of training samples for each pair of and , where and respectively. The maximum iterations for gradient descent is set as . It can be seen that as long as the sample complexity is large enough, gradient descent converges to the same local minima with high probability.
We next show that the statistical accuracy of the local minimizer for gradient descent if it is initialized close enough to the ground truth. Suppose we initialize around the ground truth such that . We calculate the average estimation error as over Monte Carlo simulations with random initializations. Fig. 3 (b) shows the average estimation error with respect to the sample complexity when and respectively. It can be seen that the estimation error decreases gracefully as we increase the sample size and matches with the theoretical prediction of error rates reasonably well.
Similarly, for CNN, we first verify that the empirical risk function is locally strongly convex using the same method as before. We generate the entries of true weights from standard Gaussian distribution, and generate the training samples using the CNN model (2). In Fig. 4 (a), we say an experiment is successful if , and the successful rate is calculated over sets of training samples with and respectively. Then we verify the performance of gradient descent in Fig. 4 (b). Suppose we initialized in the neighborhood of , i.e., , for fixed , the average error is calculated over Monte Carlo simulations. It can be seen that the error decreases as we increase the number of samples.
In this paper, we have studied the model recovery problem of a one-hidden-layer neural network using the cross-entropy loss in a multi-neuron classification problem. In particular, we have characterized the sample complexity to guarantee local strong convexity in a neighborhood (whose size we have characterized as well) of the ground truth when the training data are generated from a classification model for two types of neural network models: fully-connected network and non-overlapping convolutional network. This guarantees that with high probability, gradient descent converges linearly to the ground truth if initialized properly. In the future, it will be interesting to extend the analysis in this paper to more general class of activation functions, particularly ReLU-like activations; and more general network structures, such as convolutional neural networks [46, 45].
Appendix A Gradient and Hessian of Population Loss
For the convenience of analysis, we first provide the gradient and the Hessian formula for the cross-entropy loss using FCN and CNN here.
A-a The FCN case
Consider the population loss function , where is associated with network . Hiding the dependence on for notational simplicity, we can calculate the gradient and the Hessian as
for . Here, when ,
and when ,
A-B The CNN case
For the CNN case, i.e., , the corresponding gradient and Hessian of the population loss function is given by
where when ,
and when ,
Appendix B Proof of Theorem 1
In order to show that the empirical loss possesses a local strong convexity, we follow the following steps:
We first show that the Hessian of the population loss function is smooth with respect to (Lemma 1);
We then show that satisfies local strong convexity and smoothness in a neighborhood of with appropriately chosen radius, , by leveraging similar properties of (Lemma 2);
Next, we show that the Hessian of the empirical loss function is close to its population counterpart uniformly in with high probability (Lemma 3).
Finally, putting all the arguments together, we establish satisfies local strong convexity and smoothness in .
To begin, we first show that the Hessian of the population risk is smooth enough around in the following lemmas.
Lemma 1 (Hessian Smoothness of Population Loss).
Lemma 2 (Local Strong Convexity and Smoothness of Population Loss).
The proof is provided in Appendix D-B. The next step is to show the Hessian of the empirical loss function is close to the Hessian of the population loss function in a uniform sense, which can be summarized as follows.