1 Introduction
Deep neutral networks have recently boosted the notion of “deep learning from data,” with fieldchanging performance improvements reported in numerous machine learning and artificial intelligence tasks
[21, 13]. Despite their widespread use as well as numerous recent contributions, our understanding of how and why neural networks (NNs) achieve this success remains limited. While their expressivity (expressive power) has been well argued [37, 38], the research focus has shifted toward addressing the computational challenges of training such models and understanding their generalization behavior.From the vantage point of optimization, training deep NNs requires dealing with extremely highdimensional and nonconvex problems, which are NPhard in the worst case. It has been shown that even training a twolayer NN of three nodes is NPcomplete [3]
, and the loss function associated with even a single neuron exhibits exponentially many local minima
[2]. It is therefore not clear whether and how we can provably yet efficiently train a NN to global optimality.Nevertheless, as often evidenced by empirical tests, these NN architectures can be ‘successfully’ trained by means of simple local search heuristics, such as ‘plainvanilla’ (stochastic) (S) gradient descent (GD) on real or randomly generated data. Considering the overparameterized setting in particular, where the NNs have far more parameters than training samples, SGD can often successfully train these networks while exhibiting favorable generalization performance without overfitting
[31]. As an example, the celebrated VGG19 net with 20 million parameters trained on the CIFAR10 dataset of 50 thousand data samples achieves stateoftheart classification accuracy, and also generalizes well to other datasets [42]. In addition, training NNs by e.g., adding noise to the training samples [46], or to the (stochastic) gradients during backpropagation [30], has welldocumented merits in training with enhancing generalization performance, as well as in avoiding bad local minima [46]. In this contribution, we take a further step toward understanding the analytical performance of NNs, by providing fundamental insights into the optimization landscape and generalization capability of NNs trained by means of SGD with properly injected noise.For concreteness, we address these challenges in a binary classification setting, where the goal is to train a twolayer ReLU network on linearly separable data
. Although a nonlinear NN is clearly not necessary for classifying linearly separable data, as a linear classifier such as the Perceptron, would do
[39], the fundamental question we target here is whether and how one can efficiently train a ReLU network to global optimality, despite the presence of infinitely many local minima, maxima, and saddle points [23]. Separable data have also been used in recent works [45, 23, 29, 26, 5, 48]. The motivation behind employing separable data is twofold. They can afford a zero training loss, and distinguish whether a NN is successfully trained or not (as most loss functions for training NNs are nonconvex, it is in general difficult to check its global optimum). In addition, separable data enable improvement of the plainvanilla SGD by leveraging the power of random noise in a principled manner, so that the modified SGD algorithm can provably escape local minima and saddle points efficiently, and converge to a global minimum in a finite number of nonzero updates. We further investigate the generalization capability of successfully trained ReLU networks leveraging compression bounds [27]. Thus, the binary classification setting offers a favorable testbed for studying the effect of training noise on avoiding overfitting when learning ReLU networks. Although the focus of this paper is on twolayer networks, our novel algorithm and theoretical results can shed light on developing reliable training algorithms for as, well as on, understanding generalization of deep networks.In a nutshell, the main contributions of the present work are:

A simple SGD algorithm that can provably escape local minima and saddle points to efficiently train any twolayer ReLU network to attain global optimality;

Theoretical and empirical evidence supporting the injection of noise during training NNs to escape bad local minima and saddle points; and

Tight generalization error bounds and guarantees for (possibly overparameterized) ReLU networks optimally trained with the novel SGD algorithm.
The remainder of this paper is structured as follows. Section 2 reviews related contributions. Section 3 introduces the binary classification setting, and the problem formulation. Section 4 presents the novel SGD algorithm, and establishes its theoretical performance. Section 5 deals with the generalization behavior of ReLU networks trained with the novel SGD algorithm. Numerical tests on synthetic data and real images are provided in Section 6. The present paper is concluded with research outlook in Section 7, while technical proofs of the main results are delegated to the Appendix.
Notation:
Lower (upper)case boldface letters denote vectors (matrices), e.g.,
(). Calligraphic letters are reserved for sets, e.g. , with the exception ofrepresenting some probability distribution. The operation
returns the largest integer no greater than the given number , the cardinality counts the number of elements in set , and denotes the Euclidean norm of .2 Related Work
As mentioned earlier, NN models have lately enjoyed great empirical success in numerous domains [21, 13, 53]. Many contributions have been devoted to explaining such a success; see e.g., [4, 5, 18, 12, 23, 56, 35, 24, 26, 44, 19, 48, 52, 51, 10, 25, 40, 49, 17]. Recent research efforts have focused on the expressive ability of deep NNs [38], and on the computational tractability of training such models [43, 5]. In fact, training NNs is NPhard in general, even for small and shallow networks [14, 2]. Under various assumptions (e.g., Gaussian data, and a sufficiently large number of hidden units) as well as different models however, it has been shown that local search heuristics such as (S)GD can efficiently learn twolayer NNs with quadratic or ReLU activations [43].
Another line of research has studied the landscape properties of various loss functions for learning NNs; see e.g. [18, 56, 23, 5, 55, 33, 32, 51, 25, 36]. Generalizing the results for the loss [18, 56], it has been proved that deep linear networks with arbitrary convex and differentiable losses have no suboptimal (a.k.a. bad) local minima, that is all local minima are global, when the hidden layers are at least as wide as the input or the output layer [22]. For nonlinear NNs, most results have focused on learning shallow networks. For example, it has been shown that there are no bad local minima in learning twolayer networks with quadratic activations and the loss, provided that the number of hidden neurons exceeds twice that of inputs [43]. Focusing on a binary classification setting, [5] demonstrated that despite the nonconvexity present in learning onehiddenlayer leaky ReLU networks with a hinge loss criterion, all critical points are global minima if the data are linearly separable. Thus, SGD can efficiently find a global optimum of a leaky ReLU network. On the other hand, it has also been shown that there exist infinitely many bad local optima in learning even twolayer ReLU networks under mild conditions; see e.g., [50, Theorem 6], [5, Thm. 8], [23]. Interestingly, [23] provided a complete description of all suboptimal critical points in learning twolayer ReLU networks with a hinge loss on separable data. Yet, it remains unclear whether and how one can efficiently train even a singlehiddenlayer ReLU network to global optimality.
Recent efforts have also been centered on understanding generalization behavior of deep NNs by introducing and/or studying different complexity measures. These include Rademacher complexity, uniform stability, and spectral complexity; see [20] for a recent survey. However, the obtained generalization bounds do not account for the underlying training schemes, namely optimization methods. As such, they do not provide tight guarantees for generalization performance of (overparameterized) networks trained with iterative algorithms [5]. Even though recent work suggested an improved generalization bound by optimizing the PACBayes bound of an overparameterized network in a binary classification setting [9], this result is meaningful only when the optimization succeeds. Leveraging standard compression bounds, generalization guarantees have been derived for twolayer leaky ReLU networks trained with plainvanilla SGD [5]. But this bound does not generalize to ReLU networks, due to the challenge and impossibility of using plainvanilla SGD to train ReLU networks to global optimum.
3 Problem Formulation
Consider a binary classification setting, in which the training set comprises data sampled i.i.d. from some unknown distribution over , where without loss of generality we assume and . We are interested in the linearly separable case, in which there exists an optimal linear classifier vector such that . To allow for affine classifiers, a “bias term” can be appended to the classifier vector by augmenting all data vectors with an extra component of accordingly.
We deal with singlehiddenlayer NNs having scalar inputs, hidden neurons, and a single output (for binary classification). The overall inputoutput relationship of such a twolayer NN is
(1) 
which maps each input vector to a scalar output by combining nonlinear maps of linearly projected renditions of , effected via the ReLU activation . Clearly, due to the non negativity of ReLU outputs, one requires at least hidden units so that the output can take both positive and negative values to signify the ‘positive’ and ‘negative’ classes. Here, stacks up the weights of the links connecting the input to the th hidden neuron, and is the weight of the link from the th hidden neuron to the output. Upon defining and , which are henceforth collectively denoted as for brevity, one can express in a compact matrixvector representation as
(2) 
where the ReLU activation should be understood entrywise when applied to a vector .
Given our NN described by and adopting a hinge loss criterion , we define the empirical loss as the average loss of over the training set , that is
With the output , we construct a binary classifier as , where the sign function if , and otherwise. For this classifier, the training error (a.k.a. misclassification rate) over is
(3) 
where denotes the indicator function taking value if the argument is true, and otherwise.
yields a nonzero loss) for a ReLU network with two hidden neurons (corresponding to the hyperplanes
) (left blue line in each plot, with ) and (right blue line, with ). The arrows point to the positive cone of each hyperplane, namely for . The training dataset contains samples, two of which belong to class ‘’ (colored red) and one of which belong to class ‘’ (colored black). Data points with a nonzero classification error (hence nonzero loss) must lie in the negative cone of all hyperplanes.In this paper, we fix the second layer of network to be some constant vector given a priori, with at least one positive and at least one negative entry. Therefore, training the ReLU network boils down to learning the weight matrix only. As such, the network is henceforth denoted by , and the goal is to solve the following optimization problem
(4) 
where . Evidently for separable data and the ReLU network considered in this paper, it must hold that . Due to piecewise linear (nonsmooth) ReLU activations, becomes nonsmooth. It can be further shown that is nonconvex (e.g., [5, Proposition 5.1]), which indeed admits infinitely many (suboptimal) local minima [5, Thm. 8].
Interestingly though, it is possible to provide an analytical characterization of all suboptimal solutions. Specifically, at any critical point ^{1}^{1}1The critical point for a general nonconvex and nonsmooth function is defined invoking the Clarke subdifferential (see e.g., [6], [47]). Precisely, consider a function , which is locally Lipschitz around , and differentiable on , with being a set of Lebesgue measure zero. Then the convex hull of the set of limits of the form , where as , i.e.,
(7) 
Expressed differently, if data pair yields a nonzero loss at a critical point , the ReLU output must vanish at all hidden neurons. Building on this observation, we say that a critical point of is suboptimal if it obeys simultaneously the following two conditions: i) for some data sample , and ii) for which it holds that . According to these two defining conditions, the set of all suboptimal critical points includes different local minima, as well as all maxima and saddle points; see Figure 1 for an illustration. It is also clear from Figure 1 that the two conditions in certain cases cannot be changed by small perturbations on , suggesting that there are in general infinitely many suboptimal critical points. Therefore, optimally training even such a singlehiddenlayer ReLU network is indeed challenging.
Consider minimizing by means of plainvanilla SGD with constant learning rate , as
(8) 
with the (sub)gradient of the hinge loss at a randomly sampled datum given by
(9) 
where is a diagonal matrix holding entries of vector on its diagonal, and the indicator function applied to is understood entrywise. For any suboptimal critical point incurring a nonzero loss for some , it can be readily deduced that [23].
Following the convention [30], we say that a ReLU is active if its output is nonzero, and inactive otherwise. Furthermore, we denote the state of per th ReLU by its activity indicator function . In words, there exists always some data sample(s) for which all hidden neurons become inactive at a suboptimal critical point. This is corroborated by the fact that under some conditions, plainvanilla SGD converges to a suboptimal local minimum with high probability [5]. It will also be verified by our numerical tests in Section 6, that SGD can indeed get stuck in suboptimal local minima when training ReLU networks.
4 Main Results
In this section, we present our main results that include a modified SGD algorithm and theory for efficiently training singlehiddenlayer ReLU networks to global optimality. As in the convergence analysis of the Perceptron algorithm (see e.g., [34], [41, Chapter 9]), we define an update at iteration as nonzero or effective if the corresponding (modified) stochastic gradient is nonzero, or equivalently, whenever one has .
4.1 Algorithm
As explained in Section 3, plainvanilla SGD iterations for minimizing can get stuck in suboptimal critical points. Recall from (9) that whenever this happens, it must hold that for some data sample , or equivalently for all . To avoid being trapped in these points, we will endow the algorithm with a nonzero ‘(sub)gradient’ even at a suboptimal critical point, so that the algorithm will be able to continue updating, and will have a chance to escape from suboptimal critical points. If successful, then when the algorithm converges, it must hold that for all data samples (cf. (9)), or for all , thanks to linear separability of the data. This in agreement with the definition of the hinge loss function satisfies that in (4), which guarantees that the algorithm converges to a global optimum. Two critical questions arise at this point: Q1) How can we endow a nonzero ‘(sub)gradient’ based search direction even at a suboptimal critical point, while having the global minima as limiting points of the algorithm? and Q2) How is it possible to guarantee convergence?
Question Q1) can be answered by ensuring that at least one ReLU is active at a nonoptimal point. Toward this objective, motivated by recent efforts in escaping saddle points [15], [1], [11], we are prompted to add a zeromean random noise vector to , namely the input vector to the activity indicator function of all ReLUs. This would replace in the subgradient (cf. (9)) with at every iteration. In practice, Gaussian additive noise
with sufficiently large variance
works well.Albeit empirically effective in training ReLU networks, SGD with such architectureagnostic injected noise into all ReLU activity indicator functions cannot guarantee convergence in general, or convergence is difficult or even impossible to establish. We shall take a different route to bypass this hurdle here, which will lead to a simple algorithm provably convergent to a wanted global optimum in a finite number of nonzero updates. This result holds regardless of the data distribution, initialization, network size, or the number of hidden neurons. Toward, to ensure convergence of our modified SGD algorithm, we carefully design the noise injection process by maintaining at least one nonzero ReLU activity indicator variable at every nonoptimal critical point.
For the picked data sample per iteration , we inject Gaussian noise into the th ReLU activity indicator function in the SGD update of (9), if and only if the corresponding quantity holds, and we repeat this for all neurons .
Interestingly, the noise variance , admits simple choices, so long as it is selected sufficiently large matching the size of the corresponding summands . We will build up more intuition and highlight the basic principle behind such a noise injection design shortly in Section 4.2, along with our formal convergence analysis. For implementation purposes, we summarize the novel SGD algorithm with randomly perturbed ReLU activity indicator functions in Algorithm 1. As far as stopping criterion is concerned, it is safe to conclude that the algorithm has converged, if there has been no nonzero update for a succession of say, iterations, where is some fixed large enough integer. This holds with high probability, which depends on , and (), where the latter denotes the number of neurons with (). We have the following result, whose proof is provided in Appendix Appendix D.4.
Proposition 1.
Let for all neurons , and all iterations , and consider cycling deterministically through . If there is no nonzero update after a succession of iterations, then Algorithm 1 converges to a global optimum of with probability at least , where
is the cumulative density function of the standardized Gaussian distribution
.(10) 
Observe that the probability in Proposition 1 can be made arbitrarily close to by taking sufficiently large and/or . Regarding our proposed approach in Algorithm 1, three remarks are worth making.
Remark 1.
With the carefully designed noise injection rule, our algorithm constitutes a nontrivial generalization of the Perceptron or plainvanilla SGD algorithms to learn ReLU networks. Implementing Algorithm 1 is as easy as plainvanilla SGD, requiring almost negligible extra computation overhead. Both numerically and analytically, we will demonstrate the power of our principled noise injection into partial ReLU activity indicator functions, as well as establish the optimality, efficiency, and generalization performance of Algorithm 1 in learning twolayer (overparameterized) ReLU networks on linearly separable data.
Remark 2.
It is worth remaking that the random (Gaussian) noise in our proposal is solely added to the ReLU activity indicator functions, rather than to any of the hidden neurons. This is evident from the first indicator function being the (sub)derivative of a hinge loss, in Step 5 of Algorithm 1, which is kept as it is in the plainvanilla SGD, namely it is not affected by the noise. Moreover, our use of random noise in this way distinguishes itself from those in the vast literature for evading saddle points (see e.g., [15], [1], [11], [28]), which simply add noise to either the iterates or to the (stochastic) (sub)gradients. This distinction endows our approach with the unique capability of also escaping local minima (in addition to saddle points). To the best of our knowledge, our approach is the first of its kind in provably yet efficiently escaping local minima under suitable conditions.
Remark 3.
Compared with previous efforts in learning ReLU networks (e.g., [4], [43], [55], [8], [16], [54]), our proposed Algorithm 1 provably converges to a global optimum in a finite number of nonzero updates, without any assumptions on the data distribution, training/network size, or initialization. This holds even in the presence of exponentially many local minima and saddle points. To the best of our knowledge, Algorithm 1 provides the first solution to efficiently train such a singlehiddenlayer ReLU network to global optimality with a hinge loss, so long as the training samples are linearly separable. Generalizations to other objective functions based on e.g., the hinge loss and the smoothed hinge loss (a.k.a. polynomial hinge loss) [26], as well as to multilayer ReLU networks are possible, and they are left for future research.
4.2 Convergence analysis
In this section, we analyze the convergence of Algorithm 1 for learning singlehiddenlayer ReLU networks with a hinge loss criterion on linearly separable data, namely for minimizing in (4). Recall since we only train the first layer having the second layer weight vector fixed a priori, we can assume without further loss of generality that entries of are all nonzero. Otherwise, one can exclude the corresponding hidden neurons from the network, yielding an equivalent reducedsize NN whose second layer weight vector has all its entries nonzero.
Before presenting our main convergence results for Algorithm 1, we introduce some notation. To start, let () be the index set of data samples belonging to the ‘positive’ (‘negative’) class, namely whose (). It is thus selfevident that and hold under our assumptions. Putting our work in context, it is useful to first formally summarize the landscape properties of the objective function , which can help identify the challenges in learning ReLU networks.
Proposition 2.
Function has the following properties: i) it is nonconvex, and ii) for each suboptimal local minimum (that incurs a nonzero loss), there exists (at least) a datum for which all ReLUs become inactive.
The proof of Property i) in Proposition 2 can be easily adapted from that of [5, Proposition 5.1], while Property ii) is just a special case of [23, Thm. 5] for a fixed ; hence they are both omitted in this paper.
We will provide an upper bound on the number of nonzero updates that Algorithm 1 performs until no nonzero update occurs after within a succession of say, e.g. iterations (cf. (10)), where is a large enough integer. This, together with the fact that all suboptimal critical points of are not limiting points of Algorithm 1 due to the Gaussian noise injection with a large enough variance at every iteration, will guarantee convergence of Algorithm 1 to a global optimum of . Specifically, the main result is summarized in the following theorem.
Theorem 1 (Optimality).
If all rows of the initialization satisfy for any constant , and the second layer weight vector is kept fixed with both positive and negative (but nonzero) entries, then Algorithm 1 with some constant step size converges to a global minimum of after performing at most nonzero updates, where for it holds that
(11) 
In particular, if , then Algorithm 1 converges to a global optimum after at most nonzero updates.
Regarding Theorem 1, a couple of observations are of interest. The developed Algorithm 1 converges to a globally optimal solution of the nonconvex optimization (4) within a finite number of nonzero updates, which implicitly corroborates the ability of Algorithm 1 to escape suboptimal local minima, as well as saddle points. This holds regardless of the underlying data distribution , the number of training samples, the number of hidden neurons, or even the initialization . It is also worth highlighting that the number of nonzero updates does not depend on the dimension of input vectors, but it scales with (in the worst case), and it is inversely proportional to the step size . Recall that the worstcase bound for SGD learning of leakyReLU networks with initialization is [5, Thm. 2]
(12) 
where again, denotes an optimal linear classifier obeying . Clearly, the upper bound above does not depend on . This is due to the fact that the loss function corresponding to learning leakyReLU networks has no bad local minima, since all critical points are global minima. This is in sharp contrast with the loss function associated with learning ReLU networks investigated here, which generally involves infinitely many bad local minima! On the other hand, the bound in (12) scales inversely proportional with the quadratic ‘leaky factor’ of leaky ReLUs. This motivates having , which corresponds to letting the leaky ReLU approach the ReLU. In such a case, (12) would yield a worstcase bound of infinity for learning ReLU networks, corroborating the challenge and impossibility of learning ReLU networks by ‘plainvanilla’ SGD. Indeed, the gap between in Theorem 1 and the bound in (12) is the price for being able to escape local minima and saddle points paid by our noiseinjected SGD Algorithm 1. Last but not least, Theorem 1 also suggests that for a given network and a fixed step size , Algorithm 1 with works well too.
We briefly present the main ideas behind the proof of Theorem 1 next, but delegate the technical details to Appendix Appendix A.1. Our proof mainly builds upon the convergence proof of the classical Perceptron algorithm (see e.g., [41, Thm. 9.1]), and it is also inspired by that of [5, Thm. 1]. Nonetheless, the novel approach of performing SGD with principled noise injection into the ReLU activity indicator functions distinguishes itself from previous efforts. Since we are mainly interested in the (maximum) number of nonzero updates to be performed until convergence, we will assume for notational convenience that all iterations in (10) of Algorithm 1 perform a nonzero update. This assumption is made without loss of generality. To see this, since after the algorithm converges, one can always recount the number of effective iterations that correspond to a nonzero update and renumber them by
Our main idea is to demonstrate that every single nonzero update of the form (10) in Algorithm 1 makes a nonnegligible progress in bringing the current iterate toward some global optimum of , constructed based on the linear classifier weight vector . Specifically, as in the convergence proof of the Perceptron algorithm, we will establish separately a lower bound on the term , which is the sotermed Frobenius inner product, performing a componentwise inner product of two samesize matrices as though they are vectors; and, an upper bound on the norms and . Both bounds will be carefully expressed as functions of the number of performed nonzero updates. Recalling the CauchySchwartz inequality , the lower bound of cannot grow larger than the upper bound on . Since every nonzero update brings the lower and upper bounds closer by a nonnegligible amount, the worst case (in terms of the number of nonzero updates) is to have the two bounds equal at convergence, i.e., . To arrive at this equality, we are able to deduce an upper bound (due to a series of inequalities used in the proof to produce a relatively clean bound) on the number of nonzero updates by solving a univariate quadratic equality.
It will become clear in the proof that injecting random noise into just a subset of (rather than all) ReLU activity indicator functions enables us to leverage two key inequalities, namely, and for all data samples . These inequalities uniquely correspond to whether an update is nonzero or not. In turn, this characterization is indeed the key to establishing the desired lower and upper bounds for the two quantities on the two sides of the CauchySchwartz inequality, a critical ingredient of our convergence analysis.
4.3 Lower bound
Besides the worstcase upper bound given in Theorem 1, we also provide a lower bound on the number of nonzero updates required by Algorithm 1 for convergence, which is summarized in the following theorem. The proof is provided in Appendix Appendix C.3.
Theorem 2 (Lower bound).
The lower bound on the number of nonzero updates to be performed in Theorem 2 matches that for learning singlehiddenlayer leakyReLU networks initialized from zero [5, Thm. 4]. On the other hand, it is also clear that the worstcase bound established in Theorem 1 is (significantly) loose than the lower bound here. The gap between the two bounds (in learning ReLU versus leaky ReLU networks) is indeed the price we pay for escaping bad local minima and saddle points through our noiseinjected SGD approach.
5 Generalization
In this section, we investigate the generalization performance of training (possibly overparameterized) ReLU networks using Algorithm 1 with randomly perturbed ReLU activity indicator functions. Toward this objective, we will rely on compression generalization bounds, specifically for the classification error as in (3) [27].
Recall that our ReLU network has hidden units, and a fixed secondlayer weight . Stressing the number of ReLUs in the subscript, let denote the classifier obtained by training the network over training set using Algorithm 1 with initialization having rows obeying . Let also denote the set of all classifiers obtained using any and any , not necessarily those employed by Algorithm 1.
Suppose now that Algorithm 1 has converged after nonzero updates, as per Theorem 1. And let be the tuple of training data from randomly picked by SGD iterations of Algorithm 1. To exemplify the tuple used per realization of Algorithm 1, we write . Since can be smaller than , function and thus rely on compressed (down to size ) versions of the tuples comprising the set [41, Definition 30.4]. Let be the subset of training data not picked by SGD to yield ; and correspondingly, let denote the ensemble risk associated with , and the empirical risk associated with the complement training set, namely . With these notational conventions, our next result follows from [41, Thm. 30.2].
Theorem 3 (Compression bound).
If , then the following inequality holds with probability of at least over the choice of and
(13) 
Regarding Theorem 3, two observations are in order. The bound in (13) is nonasymptotic but as , the last two terms on the righthandside vanish, implying that the ensemble risk is upper bounded by the empirical risk . Moreover, once the SGD iterations in Algorithm 1 converge, we can find the complement training set , and thus can be determined. After recalling that holds at a global optimum of by Theorem 1, we obtain from Theorems 1 and 3 the following corollary.
Corollary 1.
If , and all rows of the initialization satisfy , then the following holds with probability at least over the choice of
(14) 
where is given in Theorem 1.
Expressed differently, the bound in (14) suggests that in order to guarantee a low generalization error, one requires in the worst case about training data to reliably learn a twolayer ReLU network of hidden neurons. This holds true despite the fact that Algorithm 1 can achieve a zero training loss regardless of the training size . One implication of Corollary 1 is a fundamental difference in the sample complexity for generalization between training a ReLU network (at least in the worst case), versus training a leaky ReLU network (), which at most needs data to be trained via SGDtype algorithms.
6 Numerical Tests
To validate our theoretical results, this section evaluates the empirical performance of Algorithm 1 using both synthetic data and real data. To benchmark Algorithm 1, we also simulated the plainvanilla SGD. To compare between the two algorithms as fair as possible, the same initialization , constant step size , and data random sampling scheme were employed. For reproducibility, the Matlab code of Algorithm 1 is publicly available at https://gangwg.github.io/RELUS/.
6.1 Synthetic data
We consider first two synthetic tests using data generated from Gaussian as well as uniform distributions. In the first test, feature vectors
were sampled i.i.d. from a standardized Gaussian distribution , and classifier was drawn from . Labels were generated according to . To further yield for all , we normalized by the smallest number among . We performed independent experiments with , and over a varying set of training samples using ReLU networks comprising hidden neurons. The second layer weight vector was kept fixed with the first entries being and the remaining being . For fixed and , each experiment used a random initialization generated from , step size , and noise variance , along with a maximum of effective data passes.Figure 2 depicts our results, where we display success rates of the plainvanilla SGD (top panel) and our noiseinjected SGD in Algorithm 1 (bottom panel); each plot presents results obtained from the experiments. Within each plot, a white square signifies that of the trials were successful, meaning that the learned ReLU network yields a training loss , while black squares indicate success rates. It is evident that the developed Algorithm 1 trained all considered ReLU networks to global optimality, while plainvanilla SGD can get stuck with bad local minima, for small in particular. The bottom panel confirms that Algorithm 1 achieves optimal learning of singlehiddenlayer ReLU networks on separable data, regardless of the network size, the number of training samples, and the initialization. The top panel however, suggests that learning ReLU networks becomes easier with plainvanilla SGD as grows larger, namely as the network becomes ‘more overparameterized.’
We repeated the first test using synthetic data as well as classifier generated i.i.d. from the uniform distribution . All other settings were kept the same. Success rates of plainvanilla SGD are plotted in Fig. 3 (left panel), while those of the proposed Algorithm 1 are omitted, as they are successful in all simulated tests.
6.2 Real data
Performance of Algorithm 1 for training (over)parameterized ReLU networks is further corroborated using two real datasets: iris in UCI’s machine learning repository [7], and MNIST images ^{2}^{2}2Downloaded from http://yann.lecun.com/exdb/mnist/. . The iris dataset contains fourdimensional feature vectors belonging to three classes. To obtain a twoclass linearly separable dataset, the firstclass data vectors were relabeled , while the remaining were relabeled . We performed independent experiments over a varying set of training samples using ReLU networks with hidden neurons. Gaussian initialization from , step size , noise variance , and a maximum of effective data passes were simulated. Success rates of plainvanilla SGD are given in Fig. 3 (right). Again, Algorithm 1 achieves a success rate in all simulated settings.
The linearly separable MNIST dataset collects images of digits (labeled ) and (labeled ), each having dimension . We performed independent experiments over a varying set of training samples using ReLU networks with hidden neurons. The constant step size of both plainvanilla SGD and Algorithm 1 was set to () when the ReLU networks have () hidden units, while the noise variance in Algorithm 1 was set to . Similar to the first experiment on randomly generated data, we plot success rates of the plainvanilla SGD (top panel) and our noiseinjected SGD (bottom panel) algorithms over training sets of MNIST images in Figure 4. It is selfevident that Algorithm 1 achieved a success rate under all testing conditions, which confirms our theoretical results in Theorem 1, and it markedly improves upon its plainvanilla SGD alternative.
7 Conclusions
This paper approached the task of training ReLU networks from a nonconvex optimization point of view. Focusing on the task of binary classification with a hinge loss criterion, this contribution put forth the first algorithm that can provably yet efficiently train any singlehiddenlayer ReLU network to global optimality, provided that the data are linearly separable. The algorithm is as simple as plainvanilla SGD, but it is able to exploit the power of random additive noise to break ‘optimality’ of the SGD learning process at any suboptimal critical point. We established an upper and a lower bound on the number of nonzero updates that the novel algorithm requires for convergence to a global optimum. Our result holds regardless of the underlying data distribution, network/training size, or initialization. We further developed generalization error bounds for twolayer NN classifiers with ReLU activations, which provide the first theoretical guarantee for the generalization behavior of ReLU networks trained with SGD. A comparison of such bounds with those of a leaky ReLU network reveals a key difference between optimally learning a ReLU network versus that of a leaky ReLU network in the sample complexity required for generalization.
Since analysis, comparisons, and corroborating tests focus on singlehiddenlayer networks with a hinge loss criterion here, our future work will naturally aim at generalizing the novel noiseinjection design to SGD for multilayer ReLU networks, and considering alternative loss functions, and generalizations to (multi)kernel based approaches.
References

[1]
G. An, “The effects of adding noise during backpropagation training on a generalization performance,”
Neural Comput., vol. 8, no. 3, pp. 643–674, Apr. 1996.  [2] P. Auer, M. Herbster, and M. K. Warmuth, “Exponentially many local minima for single neurons,” in Adv. in Neural Inf. Process. Syst., Denver, Colorado, Nov. 27–Dec. 2, 1995, pp. 316–322.
 [3] A. Blum and R. L. Rivest, “Training a 3node neural network is NPcomplete,” in Adv. in Neural Inf. Process. Syst., Cambridge, Massachusetts, Aug. 3–5, 1988, pp. 494–501.
 [4] A. Brutzkus and A. Globerson, “Globally optimal gradient descent for a ConvNet with Gaussian inputs,” in Intl. Conf. on Mach. Learn., vol. 70, Sydney, Australia, Aug. 6–11, 2017.
 [5] A. Brutzkus, A. Globerson, E. Malach, and S. ShalevShwartz, “SGD learns overparameterized networks that provably generalize on linearly separable data,” in Intl. Conf. on Learn. Rep., Vancouver, BC, Canada, Apr. 30–May 3, 2018.
 [6] F. H. Clarke, Optimization and Nonsmooth Analysis. SIAM, 1990, vol. 5.
 [7] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml.
 [8] S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh, “Gradient descent learns onehiddenlayer CNN: Don’t be afraid of spurious local minima,” in Intl. Conf. on Mach. Learn., vol. 80, Stockholm, Sweden, July 10–15, 2018, pp. 1338–1347.
 [9] G. K. Dziugaite and D. M. Roy, “Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data,” arXiv:1703.11008, 2017.
 [10] H. Fu, Y. Chi, and Y. Liang, “Guaranteed recovery of onehiddenlayer neural networks via cross entropy,” arXiv:1802.06463, 2018.

[11]
R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points — Online stochastic gradient for tensor decomposition,” in
Conf. on Learn. Theory, vol. 40, Paris, France, July 3–6, 2015, pp. 797–842.  [12] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Intl. Conf. on Artif. Intell. and Stat., Sardinia, Italy, May 13–15, 2010, pp. 249–256.
 [13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge: MIT press, 2016, vol. 1.
 [14] M. Gori and A. Tesi, “On the problem of local minima in backpropagation,” IEEE Trans. Pattern Anal. Mach. Intell., no. 1, pp. 76–86, Jan. 1992.
 [15] L. Holmstrom and P. Koistinen, “Using additive noise in backpropagation training,” IEEE Trans. Neural Netw., vol. 3, no. 1, pp. 24–38, Jan. 1992.
 [16] G. Jagatap and C. Hedge, “Learning ReLU networks via alternating minimization,” arXiv:1806.07863, 2018.
 [17] S. M. M. Kalan, M. Soltanolkotabi, and A. S. Avestimehr, “Fitting ReLUs via SGD and Quantized SGD,” arXiv:1901.06587, 2019.
 [18] K. Kawaguchi, “Deep learning without poor local minima,” in Adv. in Neural Inf. Process. Syst., Barcelona, Spain, Dec. 5–10, 2016, pp. 586–594.
 [19] K. Kawaguchi and L. P. Kaelbling, “Elimination of all bad local minima in deep learning,” arXiv:1901.00279, 2019.
 [20] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, “Generalization in deep learning,” arXiv:1710.05468, 2017.

[21]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in
Adv. in Neural Inf. Process. Syst., Lake Tahoe, Nevada, Dec. 3–6, 2012, pp. 1097–1105.  [22] T. Laurent and J. von Brecht, “Deep linear neural networks with arbitrary loss: All local minima are global,” in Intl. Conf. on Mach. Learn., Stockholm, Sweden, July 10–15, 2018, pp. 2908–2913.
 [23] ——, “The multilinear structure of ReLU networks,” in Intl. Conf. on Mach. Learn., vol. 80, Stockholm, Sweden, July 10–15, 2018.
 [24] D. Li, T. Ding, and R. Sun, “Overparameterized deep neural networks have no strict local minima for any continuous activations,” arXiv:1812.11039, 2018.
 [25] Y. Li and Y. Liang, “Learning overparameterized neural networks via stochastic gradient descent on structured data,” arXiv:1808.01204, 2018.
 [26] S. Liang, R. Sun, J. D. Lee, and R. Srikant, “Adding one neuron can eliminate all bad local minima,” arXiv:1805.08671, 2018.
 [27] N. Littlestone and M. Warmuth, “Relating data compression and learnability,” University of California, Santa Cruz, Tech. Rep., 1986.
 [28] S. Lu, M. Hong, and Z. Wang, “On the sublinear convergence of randomly perturbed alternating gradient descent to second order stationary solutions,” arXiv:1802.10418, 2018.
 [29] M. Nacson, N. Srebro, and D. Soudry, “Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate,” arXiv:1806.01796, 2018.

[30]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in
Intl. Conf. on Mach. Learn., Haifa, Israel, June 21–24, 2010, pp. 807–814.  [31] B. Neyshabur, R. Tomioka, and N. Srebro, “In search of the real inductive bias: On the role of implicit regularization in deep learning,” arXiv:1412.6614, 2014.
 [32] Q. Nguyen, “On connected sublevel sets in deep learning,” arXiv:1901.07417, 2019.
 [33] Q. Nguyen and M. Hein, “Optimization landscape and expressivity of deep CNNs,” in Intl. Conf. Mach. Learn., 2018, pp. 3727–3736.
 [34] A. B. Novikoff, “On convergence proofs for perceptrons,” in Proc. Symp. Math. Theory Automata, vol. 12, 1963, pp. 615–622.
 [35] S. Oymak, “Stochastic gradient descent learns state equations with nonlinear activations,” arXiv:1809.03019, 2018.
 [36] S. Oymak and M. Soltanolkotabi, “Towards moderate overparameterization: Global convergence guarantees for training shallow neural networks,” arXiv:1902.04674, 2019.
 [37] B. Poole, S. Lahiri, M. Raghu, J. SohlDickstein, and S. Ganguli, “Exponential expressivity in deep neural networks through transient chaos,” in Adv. in Neural Inf. Process. Syst., Barcelona, Spain, Dec. 510, 2016, pp. 3360–3368.
 [38] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. SohlDickstein, “On the expressive power of deep neural networks,” in Intl. Conf. on Mach. Learn., vol. 70, Sydney, Australia, Aug. 6–11, 2017, pp. 2847–2854.
 [39] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, p. 386, Nov. 1958.
 [40] I. Safran and O. Shamir, “Spurious local minima are common in twolayer ReLU neural networks,” in Intl. Conf. on Mach. Learn., vol. 80, Stockholm, Sweden, July 10–15, 2018, pp. 4430–4438.
 [41] S. ShalevShwartz and S. BenDavid, Understanding Machine Learning: From Theory to Algorithms. New York, NY: Cambridge University Press, 2014.
 [42] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv:1409.1556, 2014.
 [43] M. Soltanolkotabi, “Learning ReLU via gradient descent,” in Adv. in Neural Inf. Process. Syst., Long Beach, CA, Dec. 4–9, 2017, pp. 2007–2017.
 [44] M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoretical insights into the optimization landscape of overparameterized shallow neural networks,” arXiv:1707.04926, 2017.
 [45] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,” J. Mach. Learn. Res., vol. 19, no. 70, pp. 1–57, 2018.
 [46] C. Wang and J. C. Principe, “Training neural networks with additive noise in the desired signal,” IEEE Trans. Neural Netw., vol. 10, no. 6, pp. 1511–1517, Nov. 1999.
 [47] G. Wang, G. B. Giannakis, and Y. C. Eldar, “Solving systems of random quadratic equations via truncated amplitude flow,” IEEE Trans. Inf. Theory, vol. 64, no. 2, pp. 773–794, Feb. 2018.
 [48] T. Xu, Y. Zhou, K. Ji, and Y. Liang, “Convergence of SGD in learning ReLU models with separable data,” arXiv:1806.04339, 2018.
 [49] G. Yehudai and O. Shamir, “On the power and limitations of random features for understanding neural networks,” arXiv:1904.00687, 2019.
 [50] C. Yun, S. Sra, and A. Jadbabaie, “A critical view of global optimality in deep learning,” arXiv:1802.03487, 2018.
 [51] ——, “Efficiently testing local optimality and escaping saddles for ReLU networks,” arXiv:1809.10858, 2018.
 [52] ——, “Small nonlinearities in activation functions create bad local minima in neural networks,” arXiv:1802.03487, 2018.
 [53] L. Zhang, G. Wang, and G. B. Giannakis, “Realtime power system state estimation and forecasting via deep neural networks,” arXiv:1811.06146, Nov. 2018.
 [54] X. Zhang, Y. Yu, L. Wang, and Q. Gu, “Learning onehiddenlayer ReLU networks via gradient descent,” arXiv:1806.07808, 2018.
 [55] K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon, “Recovery guarantees for onehiddenlayer neural networks,” in Intl. Conf. on Mach. Learn., vol. 70, Sydney, Australia, Aug. 6–11, 2017, pp. 4140–4149.
 [56] Y. Zhou and Y. Liang, “Critical points of linear neural networks: Analytical forms and landscape properties,” in Intl. Conf. on Learn. Rep., Vancouver, BC, Canada, Apr. 30May 3, 2018.