I Introduction
Neural networks have attracted a significant amount of research interest in recent years due to the success of deep neural networks [1]
in practical domains such as computer vision and artificial intelligence
[2, 3, 4]. However, the theoretical underpinnings behind such success remains mysterious to a large extent. Efforts have been taken to understand which classes of functions can be represented by deep neural networks [5, 6, 7, 8], when (stochastic) gradient descent is effective for optimizing a nonconvex loss function
[9], and why these networks generalize well [10, 11, 12].One important line of research that has attracted extensive attention is the modelrecovery problem, which is important for the network to generalize well [13]. Assuming the training samples , , are generated independently and identically distributed (i.i.d.) from a distribution based on a neural network model with the ground truth parameter , the goal is to recover the underlying model parameter using the training samples. Consider a network whose output is given as . Previous studies along this topic can be mainly divided into two cases of data generations, with the input being Gaussian.

Regression, where each sample is generated as
This type of regression problem has been studied in various settings. In particular, [14]
studied the singleneuron model under the Rectified Linear Unit (ReLU) activation,
[15] studied the onehiddenlayer multineuron network model, and [16] studied a twolayer feedforward network with ReLU activations and identity mapping. 
Classification, where the label is drawn according to the conditional distribution
Such a problem has been studied in [17] when the network contains only a single neuron.
For both cases, previous studies attempted to recover , by minimizing an empirical loss function using the squared loss, i.e. , given the training data. Two types of statistical guarantees were provided for such model recovery problems using the squared loss. More specifically, [15] showed that in the local neighborhood of the ground truth , the empirical loss function is strongly convex for each given point under independent
high probability event. Hence, their guarantee for gradient descent to converge to the ground truth, assuming proper initialization, requires a
fresh set of samples at every iteration. Thus the total sample complexity depends on the number of iterations. On the other hand, studies such as [17, 14] established strong convexity in the entire local neighborhood in a uniform sense, so that resampling per iteration is not needed for gradient descent to have guaranteed linear convergence as long as it enters such a local neighborhood. Clearly, the second kind of statistical guarantee without periteration resampling is much stronger and desirable.(a) FCN  (b) CNN 
In this paper, we focus on the classification setting by minimizing the empirical loss using the cross entropy objective, which is a popular choice in training practical neural networks. The geometry as well as the model recovery problem based on the cross entropy loss function have not yet been understood even for onehiddenlayer networks. Such a loss function is much more challenging to analyze than the squared loss, not just because it is nonconvex with multiple neurons, but also because its gradient and Hessian take much more complicated forms compared with the squared loss; moreover, it is hard to control the size of gradient and Hessian due to the saturation phenomenon, i.e., when approaches or . The main focus of this paper is to develop technical analysis for guaranteed model recovery under the challenging cross entropy loss function for the classification problem for two types of onehiddenlayer network structures.
Ia Problem Formulation
We consider two popular types of onehiddenlayer nonlinear neural networks illustrated in Fig. 1, i.e., a FullyConnected Network (FCN) [15] and a nonoverlapping Convolutional Neural Network (CNN) [18]. For both cases, we let be the input,
be the number of neurons, and the activation function be the sigmoid function

FCN: the network parameter is , and
(1) 
Nonoverlapping CNN: for simplicity we let for some integers . Let be the network parameter, and the
th stride of
be given as . Then,(2)
The nonoverlapping CNN model can be viewed as a highly structured instance of the FCN, where the weight matrix can be written as:
In a model recovery setting, we are given training samples that are drawn i.i.d. from certain distribution regarding the ground truth network parameter (or resp. for CNN). Suppose the network input
is drawn from a standard Gaussian distribution
. This assumption has been used a lot in previous literature [14, 19, 18, 20], to name a few. Then, conditioned on , the output is mapped to via the output of the neural network, i.e.,(3) 
Our goal is to recover the network parameter, i.e., , via minimizing the following empirical loss function:
(4) 
where is the crossentropy loss function, i.e.,
(5) 
where can subsume either or .
IB Our Contributions
Considering the multineuron classification problem with either FCN or CNN, the main contributions of this work are summarized as follows. Throughout the discussions below, we assume the number of neurons is a constant, and state the scaling only in terms of the input dimension and the number of samples .

Uniform local strong convexity: If the input is Gaussian, the empirical risk function is uniformly strongly convex in a local neighborhood of the ground truth as soon as the sample size .

Statistical and computational rate of gradient descent: consequently, if initialized in this neighborhood, gradient descent converges linearly to a critical point (which we show to exist). Due to the nature of quantized labels here, the recovery of the ground truth is only up to certain statistical accuracy. In particular, gradient descent finds the critical point with a computation cost of , where denotes the numerical accuracy and converges to at a rate of in the Frobenius norm.

Tensor initialization: We adopt the tensor method proposed in [15], and show that it provably provides an initialization in the neighborhood of the ground truth both for FCN and CNN. In particular, we strengthened the guarantee of the tensor method by replacing the homogeneous assumption on activation functions in [15] by a mild condition on the curvature of activation functions around , which holds for a larger class of activation functions including sigmoid and tanh.
We derive network specific quantities to capture the local geometry of FCN and CNN, which imply that the geometry of CNN is more benign than FCN, corroborated by the numerical experiments. In order to analyze the challenging crossentropy loss function, our proof develops various new machineries in order to exploit the statistical information of the geometric curvatures, including the gradient and Hessian of the empirical risk, and to develop covering arguments to guarantee uniform concentrations. To the best of our knowledge, combining the analysis of gradient descent and initialization, this work provides the first globally convergent algorithm for the recovery of onehiddenlayer neural networks using the cross entropy loss function.
IC Related Work
Due to the scope, we focus on the most relevant literature on theoretical and algorithmic aspects of learning shallow neural networks via nonconvex optimization. The parameter recovery viewpoint is relevant to the success of nonconvex learning in signal processing problems such as matrix completion, phase retrieval, blind deconvolution, dictionary learning and tensor decomposition [21]–[28], to name a few; see also the overview article [29]. The statistical model for data generation effectively removes worstcase instances and allows us to focus on averagecase performance, which often possess much benign geometric properties that enable global convergence of simple local search algorithms.
The studies of onehiddenlayer network model can be further categorized into two classes, landscape analysis and model recovery. In the landscape analysis, it is known that if the network size is large enough compared to the data input, then there are no spurious local minima in the optimization landscape, and all local minima are global [30, 31, 32, 33]. For the case with multiple neurons () in the underparameterized setting, the work of Tian [34] studied the landscape of the population squared loss surface with ReLU activations. In particular, there exist spurious bad local minima in the optimization landscape [35, 36] even at the population level. Zhong et. al. [15] provided several important geometric characterizations for the regression problem using a variety of activation functions and the squared loss.
In the model recovery problem, the number of neurons is smaller than the input dimension, and all the existing works discussed below assumed the squared loss and (sub)Gaussian inputs. In the case with a single neuron (), [14] showed that gradient descent converges linearly when the activation function is ReLU, with a zero initialization, as long as the sample complexity is for the regression problem. When the activation function is quadratic, [37] shows that randomly initialized gradient descent converges fast to the global optimum at a nearoptimal sample complexity. On the other hand, [17] showed that when has bounded first, second and third derivatives, there is no other critical points than the unique global minimum (within a constrained region of interest), and (projected) gradient descent converges linearly with an arbitrary initialization, as long as the sample complexity is for the classification problem. Moreover, in the case with multiple neurons, [19] showed that projected gradient descent with a local initialization converges linearly for smooth activations with bounded second derivatives for the regression problem, [38] showed that gradient descent with tensor initialization converges linearly to a neighborhood of the ground truth using ReLU activations, and [39] showed the linear convergence of gradient descent with the spectral initialization using quadratic activations. For CNN with ReLU activations, [18] shows that gradient descent converges to the ground truth with random initialization for the population risk function based on the squared loss under Gaussian inputs. Moreover, [20] shows that gradient descent successfully learns a twolayer convolutional neural network despite the existence of bad local minima. From a technical perspective, our study differs from all the aforementioned work in that the cross entropy loss function we analyze has a very different form. Furthermore, we study the model recovery classification problem under the multineuron case, which has not been studied before.
Finally, we note that several papers study onehiddenlayer or twolayer neural networks with different structures under Gaussian input. For example, [40] studied the overlapping convolutional neural network, [16] studied a twolayer feedforward networks with ReLU activations and identity mapping, and [41] introduced the Porcupine Neural Network. Very recently, several papers [42, 43, 44] declared global convergence of gradient descent for optimizing deep neural networks in the overparameterized regime. These results are not directly comparable to ours since both the networks and the loss functions are different.
ID Paper Organization and Notations
The rest of the paper is organized as follows. Section II presents the main results on local geometry and local linear convergence of gradient descent. Section III discusses the initialization based on the tensor method. Numerical examples are demonstrated in Section IV, and finally, conclusions are drawn in Section V. Details of the technical proofs are delayed in the supplemental materials.
Throughout this paper, we use boldface letters to denote vectors and matrices, e.g.
and . The transpose of is denoted by , and , denote the spectral norm and the Frobenius norm. For a positive semidefinite (PSD) matrix , we write. The identity matrix is denoted by
. The gradient and the Hessian of a function is denoted by and , respectively.Denote
as the subexponential norm of a random variable. We use
to denote constants whose values may vary from place to place. For nonnegative functions and , means there exist positive constants and such that for all ; means there exist positive constants and such that for all .Ii Gradient Descent and its Performance Guarantee
To estimate the network parameter
, since (4) is a highly nonconvex function, vanilla gradient descent with an arbitrary initialization may get stuck at local minima. Therefore, we implement gradient descent (GD) with a welldesigned initialization scheme that is described in details in Section III. In this section, we focus on the performance of the local update rulewhere is the constant step size. The algorithm is summarized in Algorithm 1.
Input: Training data , step size , iteration
Initialization:
Gradient Descent: for
Output:
Note that throughout the execution of GD, the same set of training samples is used which is the standard implementation of gradient descent. Consequently the analysis is challenging due to the statistical dependence of the iterates with the data.
Iia Geometric properties of the networks
Before stating our main results, we first introduce an important quantity regarding that captures the geometric properties of the loss function for neural networks (1) and (2).
Definition 1 (Key quantity for FCN).
Let and define , and Define as
Definition 2 (Key quantity for CNN).
Let and define as
Note that Definition 1 for FCN is different from that in [15, Property 3.2] but consistent with [15, Lemma D.4] which removes the third term in [15, Property 3.2]. For the activation function considered in this paper, the first two terms suffice. Definition 2 for CNN is a newly distilled quantity in this paper tailored to the special structure of CNN. We depict as a function of in a certain range for the sigmoid activation in Fig. 2. It can be numerically verified that for all . Furthermore, the value of is much larger than for the same input.
IiB Uniform local strong convexity
We first characterize the local strong convexity of in a neighborhood of the ground truth. We use the Euclidean ball to denote the local neighborhood of for FCN or of for CNN. equationparentequation
(6a)  
(6b) 
where is the radius of the ball. With slight abuse of notations, we will drop the subscript FCN or CNN for simplicity, whenever it is clear from the context that the result is for FCN when the argument is and for CNN when the argument is . Further, denotes the
th singular value of
. Let the condition number be , and . The following theorem guarantees the Hessian of the empirical risk function in the local neighborhood of the ground truth is positive definite with high probability for both FCN and CNN.Theorem 1 (Local Strong Convexity).
Consider the classification model with FCN (1) or CNN (2) and the sigmoid activation function.

For FCN, assume for all . There exist constants and such that as soon as
with probability at least , we have for all ,
where .

For CNN, assume . There exist constants and such that as soon as
with probability at least , we have for all ,
where .
We note that for FCN (1), all column permutations of are equivalent global minimum of the loss function, and Theorem 1 applies to all such permutation matrices of . The proof of Theorem 1 is outlined in Appendix B.
Theorem 1 guarantees that for both FCN (1) and CNN (2) the Hessian of the empirical crossentropy loss function is positive definite in a neighborhood of the ground truth , as long as the sample size is sufficiently large. The bounds in Theorem 1 depend on the dimension parameters of the network ( and ), as well as the ground truth (, , ).
IiC Performance Guarantees of GD
For the classification problem, due to the nature of quantized labels, is no longer a critical point of . By the strong convexity of the empirical risk function in the local neighborhood of , there can exist at most one critical point in , which is the unique local minimizer in if it exists. The following theorem shows that there indeed exists such a critical point , which is provably close to the ground truth , and gradient descent converges linearly to .
Theorem 2 (Performance Guarantees of Gradient Descent).
Assume the assumptions in Theorem 1 hold. Under the event that local strong convexity holds,

for FCN, there exists a critical point in such that
and if the initial point , GD converges linearly to , i.e.
for , where are constants;

for CNN, there exists a critical point in such that
and if the initial point , GD converges linearly to , i.e.
for , where are constants.
Similarly to Theorem 1, for FCN (1) Theorem 2 also holds for all column permutations of . The proof can be found in Appendix C. Theorem 2 guarantees that the existence of critical points in the local neighborhood of the ground truth, which GD converges to, and also shows that the critical points converge to the ground truth at the rate of for FCN (1) and for CNN(2) with respect to increasing the sample size . Therefore, can be recovered consistently as goes to infinity. Moreover, for both FCN (1) and CNN (2) gradient descent converges linearly to (or resp. ) at a linear rate, as long as it is initialized in the basin of attraction. To achieve accuracy, i.e. (or resp. ), it requires a computational complexity of (or resp. ), which is linear in , and .
Iii Initialization via Tensor Method
Our initialization adopts the tensor method proposed in [15]. The initialization method works for the FCN model and it also works for the CNN model, but with slight modification as presented in [45]. We focus on the FCN case in this section and omit the CNN case for brevity since it is a straightforward extension. Below, we first briefly describe the tensor method, and then present the performance guarantee of the initialization with remarks on the differences from that in [15].
Iiia Preliminary and Algorithm
This subsection briefly introduces the tensor method proposed in [15], to which a reader can refer for more details. We first define a product as follows. If is a vector and is the identity matrix, then . If is a symmetric rank matrix factorized as and is the identity matrix, then
where , , , , and .
Definition 3.
Define , , , and , , , as follows:
,
,
,
,
,
,
,
,
where
Definition 4.
Let denote a randomly picked vector. We define and as follows: ,^{1}^{1}1See (101) in the supplemental materials for definition. where , and , where .
We further denote . The initialization algorithm based on the tensor method is summarized in Algorithm 2, which includes two major steps. Step 1 first estimates the direction of each column of by decomposing to approximate the subspace spanned by (denoted by ), then reduces the thirdorder tensor to a lowerdimension tensor , and applys nonorthogonal tensor decomposition on to output the estimate , where is a random sign. Step 2 approximates the magnitude of and the sign by solving a linear system of equations. For more implementation details about Algorithm 2, e.g., power method, we refer to [15].
IiiB Performance Guarantee of Initialization
For the classification problem, we make the following technical assumptions, similarly to [15, Assumption 5.3] for the regression problem.
Assumption 1.
The activation function satisfies the following conditions:

If , then
for .

At least one of and is nonzero.
Furthermore, we do not require the homogeneous assumption ((i.e., for an integer )) required in [15], which can be restrictive. Instead, we assume the following condition on the curvature of the activation function around the ground truth, which holds for a larger class of activation functions such as sigmoid and tanh.
Assumption 2.
Let be the index of the first nonzero where . For the activation function , there exists a positive constant such that is strictly monotone over the interval , and the derivative of is lower bounded by some constant for all .
We next present the performance guarantee for the initialization algorithm in the following theorem.
Theorem 3.
The proof of Theorem 3 consists of (a) showing the estimation of the direction of is sufficiently accurate and (b) showing the approximation of the norm of is accurate enough. The proof of part (a) is the same as that in [15], but our argument in part (b) is different, where we relax the homogeneous assumption on activation functions. More details can be found in the supplementary materials in Appendix E.
Iv Numerical Experiments
For FCN, we first implement gradient descent to verify that the empirical risk function is strongly convex in the local region around . If we initialize multiple times in such a local region, it is expected that gradient descent converges to the same critical point
, with the same set of training samples. Given a set of training samples, we randomly initialize multiple times, and then calculate the variance of the output of gradient descent. Denote the output of the
th run as and the mean of the runs as . The error is calculated as , where is the total number of random initializations. Adopted in [17], it quantifies the standard deviation of the estimator
under different initializations with the same set of training samples. We say an experiment is successful, if . We generate the ground truth from Gaussian matrices, and the training samples are generated using the FCN (1). Figure 3 (a) shows the successful rate of gradient descent by averaging over sets of training samples for each pair of and , where and respectively. The maximum iterations for gradient descent is set as . It can be seen that as long as the sample complexity is large enough, gradient descent converges to the same local minima with high probability.(a)  (b) 
(a)  (b) 
We next show that the statistical accuracy of the local minimizer for gradient descent if it is initialized close enough to the ground truth. Suppose we initialize around the ground truth such that . We calculate the average estimation error as over Monte Carlo simulations with random initializations. Fig. 3 (b) shows the average estimation error with respect to the sample complexity when and respectively. It can be seen that the estimation error decreases gracefully as we increase the sample size and matches with the theoretical prediction of error rates reasonably well.
Similarly, for CNN, we first verify that the empirical risk function is locally strongly convex using the same method as before. We generate the entries of true weights from standard Gaussian distribution, and generate the training samples using the CNN model (2). In Fig. 4 (a), we say an experiment is successful if , and the successful rate is calculated over sets of training samples with and respectively. Then we verify the performance of gradient descent in Fig. 4 (b). Suppose we initialized in the neighborhood of , i.e., , for fixed , the average error is calculated over Monte Carlo simulations. It can be seen that the error decreases as we increase the number of samples.
V Conclusions
In this paper, we have studied the model recovery problem of a onehiddenlayer neural network using the crossentropy loss in a multineuron classification problem. In particular, we have characterized the sample complexity to guarantee local strong convexity in a neighborhood (whose size we have characterized as well) of the ground truth when the training data are generated from a classification model for two types of neural network models: fullyconnected network and nonoverlapping convolutional network. This guarantees that with high probability, gradient descent converges linearly to the ground truth if initialized properly. In the future, it will be interesting to extend the analysis in this paper to more general class of activation functions, particularly ReLUlike activations; and more general network structures, such as convolutional neural networks [46, 45].
Appendix A Gradient and Hessian of Population Loss
For the convenience of analysis, we first provide the gradient and the Hessian formula for the crossentropy loss using FCN and CNN here.
Aa The FCN case
Consider the population loss function , where is associated with network . Hiding the dependence on for notational simplicity, we can calculate the gradient and the Hessian as
(8)  
(9) 
for . Here, when ,
and when ,
AB The CNN case
For the CNN case, i.e., , the corresponding gradient and Hessian of the population loss function is given by
(10)  
(11) 
where when ,
and when ,
Appendix B Proof of Theorem 1
In order to show that the empirical loss possesses a local strong convexity, we follow the following steps:

We first show that the Hessian of the population loss function is smooth with respect to (Lemma 1);

We then show that satisfies local strong convexity and smoothness in a neighborhood of with appropriately chosen radius, , by leveraging similar properties of (Lemma 2);

Next, we show that the Hessian of the empirical loss function is close to its population counterpart uniformly in with high probability (Lemma 3).

Finally, putting all the arguments together, we establish satisfies local strong convexity and smoothness in .
To begin, we first show that the Hessian of the population risk is smooth enough around in the following lemmas.
Lemma 1 (Hessian Smoothness of Population Loss).
The proof is provided in Appendix DA. Together with the fact that be lower and upper bounded, Lemma 1 allows us to bound in a neighborhood around ground truth, given below.
Lemma 2 (Local Strong Convexity and Smoothness of Population Loss).
The proof is provided in Appendix DB. The next step is to show the Hessian of the empirical loss function is close to the Hessian of the population loss function in a uniform sense, which can be summarized as follows.
Comments
There are no comments yet.