1 Introduction
Deep learning (DL) has been demonstrated successfully in many different research areas such as image classification [20], speech recognition [16]
and natural language processing
[32]. In general, its empirical success stems mainly from better network architectures [15], larger mount of training data [6], and better learning algorithms [12].However, theoretical understanding of DL for its success in applications still remains elusive. Very recently researchers start to understand DL from the perspective of optimization such as the optimality of the learned models [13, 14, 36]. It has been proved that under certain (very restrictive) conditions the critical points learned for the deep models actually achieve global optimality, even though the optimization in deep learning is highly nonconvex. These theoretical results may partially explain why such deep models work well in practice.
Global optimality is always desirable and preferred in optimization. Locating global optimality in deep learning, however, is extremely challenging due to its high nonconvexity, and thus no conventional DL solvers, e.g
. stochastic gradient descent (SGD)
[2], Adagrad [7], Adadelta [37], RMSProp [33] and Adam [18], is intentionally developed for this purpose, to our best knowledge. Alternatively different regularization techniques are applied to smooth the objective functions in DL so that the solvers can converge to some geometrically wider and flatter regions in the parameter space where good model solutions may exist [39, 4, 40]. But these solutions may not necessarily be the global optimum.Inspired by the techniques in global optimization of nonconvex functions, we propose a novel approximation algorithm, BPGrad, which has the ability of locating global optimality in DL via branch and pruning (BP). BP [29] is a wellknown algorithm developed for searching for global solutions for nonconvex optimization problems. Its basic idea is to effectively and gradually shrink the gap between the lower and upper bounds of global optimum by efficiently branching and pruning the parameter space. Fig. 1 illustrates the optimization procedure in BPGrad.
In order to branch and prune the space we assume that the objective functions in DL are Lipschitz continuous [8], or can be approximated by Lipschitz functions. This is motivated by the facts that (1) Lipschitz continuity provides a natural way to estimate the lower and upper bounds of the global optimum (see Sec. 2.3.1) used in BP, and (2) it can also serve as regularization, if needed, to smoothen the objective functions so that the returned solutions can generalize well.
In Fig. 2 we illustrate the functionality of Lipschitz continuity as regularization, where the noisy narrower but deeper valley is smoothed out, while the wider but shallower valley is preserved. Such regularization behavior can prevent algorithms from being stuck in bad local minima. Also this is advocated and demonstrated to be crucial in order to achieve good generalization of learned DL models in several recent works such as [4]. In this sense, our BPGrad algorithm/solver essentially aims to locate global optimality in the smoothed objective functions for DL.
Further BPGrad can generate solutions along the directions of gradients (i.e. branch) based on the estimated regions wherein no global optimum should exist theoretically (i.e. pruning), and by repeating such branchandpruning procedure BPGrad can locate global optimum. Empirically the high demand of computation as well as footprint in memory for running BPGrad inspires us to develop an efficient DL solver to approximate BPGrad towards global optimization.
Contributions: The main contributions of our work are:

[noitemsep]

We propose a novel approximation algorithm, BPGrad, which is intent on locating global optimum in DL. To our best knowledge, our approach is the first algorithmic attempt towards global optimization in DL.

Theoretically we prove that BPGrad can converge to global optimality within finite iterations.

Empirically we propose a novel and efficient DL solver based on BPGrad to reduce the requirement of computation as well as footprint in memory. We provide both theoretical and empirical justification for our solver towards preserving the theoretical properties of BPGrad. We demonstrate that our solver outperforms conventional DL solvers in the applications of object recognition, detection, and segmentation.
1.1 Related Work
Global Optimality in DL: The empirical loss minimization problem in learning deep models is highly dimensional and nonconvex with potentially numerous local minima and saddle points. Blum and Rivest [1]
showed that it is difficult to find the global optima because in the worst case even learning a simple 3node neural network is NPcomplete.
In spite of the difficulties in optimizing deep models, researchers have attempted to provide empirical as well as theoretical justification for the success of these models w.r.t. global optimality in learning. Zhang et al. [38] empirically demonstrated that sufficiently overparametrized networks trained with stochastic gradient descent can reach global optimality. Choromanska et al. [5] studied the loss surface of multilayer networks using spinglass model and showed that for many largesize decoupled networks, there exists a band with many local optima, whose objective values are small and close to that of a global optimum. Brutzkus and Globerson [3]
showed that gradient descent converges to the global optimum in polynomial time on a shallow neural network with one hidden layer and a convolutional structure and a ReLU activation function. Kawaguchi
[17] proved that the error landscape does not have bad local minima in the optimization of linear deep neural networks. Yun et al. [36] extended these results and proposed sufficient and necessary conditions for a critical point to be a global minimum. Haeffele and Vidal [13] suggested that it is critical to balance the degrees of positive homogeneity between the network mapping and the regularization function to prevent nonoptimal local minima in the loss surface of neural networks. Nguyen and Hein [27]argued that almost all local minima are global optimal in fully connected wide neural networks, whose number of hidden neurons of one layer is larger than that of training points. Soudry and Carmon
[30]employed smoothed analysis techniques to provide theoretical guarantee that the highly nonconvex loss functions in multilayer networks can be easily optimized using local gradient descent updates. Hand and Voroninski
[14] provided theoretical properties for the problem of enforcing priors provided by generative deep neural networks via empirical risk minimization by establishing the favorable global geometry.DL Solvers: SGD [2] is the most widely used DL solver due to its simplicity, whose learning rate (i.e., step size for gradient) is predefined. In general, SGD suffers from slow convergence, and thus its learning rate needs to be carefully tuned. To improve the efficiency of SGD, several DL solvers with adaptive learning rates have been proposed, including Adagrad [7], Adadelta [37], RMSProp [33] and Adam [18]
. These solvers integrate the advantages from both stochastic and batch methods where small minibatches are used to estimate diagonal secondorder information heuristically. These solvers have the capability of escaping saddle points and often yield faster convergence empirically.
Specifically, Adagrad is well suited for dealing with sparse data, as it adapts the learning rate to the parameters, performing smaller updates on frequent parameters and larger updates on infrequent parameters. However, it suffers from shrinking on the learning rate, which motivates Adadelta, RMSProp and Adam. Adadelta accumulates squared gradients to be fixed values rather than over time in Adagrad, RMSProp updates the parameters based on the rescaled gradients, and Adam does so based on the estimated mean and variance of the gradients. Very recently, Mukkamala
et al. [26] proposed variants of RMSProp and Adagrad with logarithmic regret bounds.Convention vs. Ours: Though the properties of global optimality in DL are very attractive, as far as we know, however, there is no solver developed intentionally to capture such global optimality so far. To fill this void, we propose our BPGrad algorithm towards global optimization in DL.
From the optimization perspective, our algorithm shares similarities with the recent work [25] on global optimization of general Lipschitz functions (not specifically for DL). In [25] a uniform sampler is utilized to maximize the lower bound of the maximizer (equivalently minimizing the upper bound of the minimizer) subject to Lipschitz conditions. Convergence properties w.h.p. are derived. In contrast, our approach considers estimating both lower and upper bounds of the global optimum, and employs the gradients as guidance to more effectively sample the parameter space for pruning. Convergence is proved to show that our algorithm will terminate within finite iterations.
From the empirical solver perspective, our solver shares similarities with the recent work [19] on improving SGD using the feedback from the objective function. Specifically [19] tracks the relative changes in the objective function with a running average, and uses it to adaptively tune the learning rate in SGD. No theoretical analysis, however, is provided for justification. In contrast, our solver does use the feedback from the object function to determine the learning rate adaptively but based on the rescaled distance between the feedback and the current lower bound estimation. Theoretical as well as empirical justifications are established.
2 BPGrad Algorithm for Deep Learning
2.1 Key Notation
We denote as the parameters in the neural network, as a pair of a data sample and its associated label , as the nonconvex prediction function represented by the network, as the objective function for training the network with Lipschitz constant , as the gradient of over parameters ^{1}^{1}1We assume w.l.o.g. Empirically we can randomly sample a nonzero direction for update wherever ., denotes the normalized gradient (i.e. direction of the gradient), as the global minimum, and as the
norm operator over vectors.
Definition 1 (Lipschitz Continuity [8]).
A function is Lipschitz continuous with Lipschitz constant on , if there is a (necessarily nonnegative) constant such that
(1) 
2.2 Problem Setup
We would like to learn the parameters for a given network by minimizing the following objective function :
(2) 
where denotes the expectation over data pairs, denotes a loss function (e.g., hinge loss) for measuring the difference between the groundtruth labels and the predicted labels given data samples, and denotes a regularizer over parameters. Particularly we assume that:

[noitemsep]

is lower bounded by 0 and upper bounded as well, i.e. ;

is differentiable everywhere in the bounded space ;

is Lipschitz continuous, or can be approximated by Lipschitz functions, with constant .
2.3 Algorithm
2.3.1 Lower & Upper Bound Estimation
Consider the situation where samples exist for evaluation by function with Lipschitz constant , whose global minimum is reached by the sample . Then based on Eq. 1 and simple algebra, we can obtain
(3) 
This provides us a tractable upper bound and an intractable lower bound, unfortunately, of the global minimum. The intractability comes from the fact that is unknown, and thus makes the lower bound in Eq. 3 unusable empirically.
To address this problem, we propose a novel tractable estimator, . This estimator intentionally introduces a gap from the upper bound, which will be shrunk by either decreasing the upper bound or increasing . As proved in Thm. 1 (see Sec. 2.4), when the parameter space is fully covered by the samples , this estimator will become the lower bound of .
In summary, we define our lower and upper bound estimators for the global minimum as and , respectively.
2.3.2 Branch & Pruning
Based on our estimators, we propose a novel approximation algorithm, BPGrad, towards global optimization in DL via branch and pruning. We show it in Alg. 1 where the predefined constant controls the precision of the solution.
Branch: The inner loop in Alg. 1 conducts the branch operation to split the parameter space recursively by sampling. Towards this goal, we need a mapping between the parameter space and the bounds. Considering the lower bound in Eq. 3, we propose sampling based on the previous samples so that it satisfies
(4) 
Note that an equivalent constraint has been used in [25].
To improve sampling efficiency for decreasing the objective, we propose sampling along the directions of (stochastic) gradients with small distortion. Though gradients only encode local structures of (nonconvex) functions in a high dimensional space, they are good indicators for locating local minima [23, 28]. Specifically, we propose a minimization problem for generating samples:
(5)  
where is a predefine constant controlling the tradeoff between the distortion and the step size . That is, under the condition in Eq. 4, the objective in Eq. 5 aims to generate a sample that has small distortion from an anchor point, whose step size is small as well due to the locality property of gradients, along the direction of the gradient.
Note that other reasonable objective functions may also be utilized here for sampling purpose as long as the condition in Eq. 4 is satisfied. More efficient sampling objectives will be investigated in our future work.
Pruning: In fact Eq. 4 specifies that new samples should be generated outside the union of a set of balls defined by previous samples. To precisely describe this requirement, we introduce a new concept of removable solution space in our work as follows:
Definition 2 (Removable Parameter Space (RPS)).
We define the RPS, denoted as , as
(6) 
where defines a ball centered at sample with radius .
RPS specifies a region wherein the function evaluations of all the points cannot be smaller than the lower bound estimator conditioning on the Lipschitz continuity assumption. Therefore, when the lower bound estimator is higher than the global minimum , we can safely remove all the points in RPS without evaluation. However, when it becomes smaller than , we risk missing the global solutions.
To address this issue, we propose the outer loop in Alg. 1 to increase the lower bound for drawing more samples which may further decrease the upper bound later.
2.4 Theoretical Analysis
Theorem 1 (Lower & Upper Bounds).
Whenever holds, the samples generated by Alg. 1 satisfies
(7) 
Proof.
Since is the global minimum, it always holds that . Now suppose that if holds, holds as well. Then there would exist at least one point (i.e. global minimum) left for sampling, contradicting the condition of . We then complete the proof. ∎
Corollary 1 (Approximation Error Bound).
Whenever both and hold, it is satisfied that
(8) 
Theorem 2 (Convergence within Finite Samples).
The total number of samples, , in Alg. 1 is upper bounded by:
(9) 
where denotes the volume of the space , denotes a constant, and denotes the minimum evaluation.
Proof.
Given such that , we have
(10)  
This allows us to generate two balls and so that they have no overlap with each other. As a result we can generate balls with radius of and no overlaps, and their accumulated volume should be no bigger than . That is,
(11) 
Further using simple algebra we can complete the proof. ∎
3 Approximate DL Solver based on BPGrad
Though the BPGrad algorithm has nice theoretical properties for global optimization, directly applying Alg. 1 to deep learning will incur the following problems that limit its empirical usage:

[noitemsep]

From Thm. 2 we can see that due to the high dimensionality of the parameter space in DL it is impractical to draw sufficient samples to cover the entire space.

Solving Eq. 5 involves the knowledge of previous samples, which incurs significant amount of both computational and storage burden for deep learning.

Computing and is timeconsuming, especially for largescale data.
To address problem P1, in practice we manually set the maximum iterations for both inner and outer loops in Alg. 1.
To address problem P2, we further make some extra assumptions to simplify the branching/sampling procedure based on Eq. 5 as follows:

[noitemsep]

Minimizing distortion is much important than minimizing step sizes, i.e. ;

is sufficiently large where so that always holds;

is always sufficiently small for local update.

can be sampled only based on and .
By imposing these assumptions upon Eq. 5, we can directly compute the solution as follows:
(12) 
To address problem P3, we utilize minibatches to estimate and efficiently in each iteration.
In summary, we list our BPGrad solver in Alg. 2 by modifying Alg. 1 for the sake of fast sampling as well as low memory footprint in DL, but at the risk of being stuck in local regions. Fig. 3 illustrates such scenarios in a 1D example. In (b) the sampling method falls into a loop because it does not consider the history of samples but only current one. In contrast, the sampling method in (a) is able to keep generating new samples by avoiding the RPS of previous samples with more computation and storage, as expected.
3.1 Theoretical Analysis
Theorem 3 (Global Property Preservation).
Proof.
Corollary 2.
Discussion: Both Thm. 3 and Cor. 2
imply that our solver prefers sampling the parameter space along a path towards a single direction, roughly speaking. However, the gradients in conventional backpropagation have little guarantee to satisfy Eq.
13 or Eq. 15 due to lack of such constraints in learning. On the other hand, momentum [31] is a wellknown technique in deep learning to dampen oscillations in gradients and accelerate directions of low curvature. Therefore, our solver in Alg. 2 involves momentum to compensate such drawbacks in backpropagation for better approximation of Alg. 1.3.2 Empirical Justification
In this section we discuss the feasibility of the assumptions A1A4 for reducing computation and storage as well as preserving the properties towards global optimization in deep learning. We utilize MatConvNet [35] as our testbed, and run our solver in Alg. 2 to train the default networks in MatConvNet for MNIST [21] and CIFAR10 [20], respectively, using the default parameters without explicit mention. Also we set for MNIST and
for CIFAR10 by default. For justification purpose we only run 4 epochs on each dataset, 600 and 500 iterations per epoch for MNIST and CIFAR10, respectively. For more experimental details, please refer to Sec.
4.Essentially assumption A1 is made to support the other three to simplify the objective in Eq. 5, and assumption A2 usually holds in deep learning due to its high dimensionality. Therefore, below we only focus on empirical justification of assumptions A3 and A4.
Feasibility of A3: To justify this, we collect ’s by running Alg. 2 on both datasets, and plot them in Fig. 4. Overall these numbers are indeed sufficiently small for local update based on gradients, and decreases with the increase of iterations, in general. This behavior is expected as the objective is supposed to decrease as well w.r.t. iterations. The value gap at the beginning on the two datasets is induced mainly by different ’s.
Feasibility of A4: To justify this, we show some evidences in Fig. 5, where we plot the lefthand side (LHS) and righthand side (RHS) of Eq. 4 based on returned by Alg. 2. As we see in all the subfigures on the right with the values on RHS are always no smaller than those on LHS correspondingly. In contrast, in the remaining subfigures on the left with (i.e. conventional SGD update) the values on RHS are always no bigger than those on LHS correspondingly. These observations appear to be robust across different datasets, and irrelevant to parameter which determines the radius of balls, i.e. step sizes for gradients. The momentum parameter , which is related to the directions of gradients for updating models, appear to be the only factor to make the samples of our solver satisfy Eq. 4. This also supports our claims in Thm. 3 and Cor. 2 about the relation between model update and gradient in order to satisfy Eq. 4. More evidences have been provided in Sec. 4.1.1. Given these evidences we hypothesize that assumption A4 may hold empirically when using sufficiently large values for .
4 Experiments
To demonstrate the generalization of our BPGrad solver, we test it in the applications of object recognition, detection, and segmentation by training deep convolutional neural networks (CNNs). We utilize MatConvNet as our testbed, and employ its demo code as well as default network architectures for different tasks. Since our solver has the ability of determining learning rates adaptively, we compare ours with another four widely used DL solvers with adaptive learning rates, namely Adagrad, Adadelta, RMSProp, and Adam. We tune the parameters in these solvers to achieve their best performance as we can.
4.1 Object Recognition
4.1.1 Mnist & Cifar10
The MNIST digital dataset consists of a training set of images and a test set of images in classes labeled from 0 to 9, where all images have the resolution of pixels. The CIFAR10 dataset consists of a training set of images and a test set of images in 10 object classes, where the image resolution is pixels.
We follow the default implementation to train an individual CNN similar to LeNet [22] on each dataset. For the details of network architectures please refer to the demo code. Specifically for all the solvers, we train the networks for and epochs on MNIST and CIFAR10, respectively, with a minibatch size , weight decay , and momentum . In addition, we fix the initial weights for two networks and the feeding order of minibatches for fair comparison. The global learning rate is set to on MNIST for Adagrad, RMSProp and Adam. On CIFAR10, the global learning rate is set to for RMSProp, but to for Adagrad, Adam and Eve [19], and it is reduced to and at the st and st epoch. Adadelta does not require the global learning rate.
For our solver, the parameters and typically depend on the numbers of minibatches and epochs, respectively. Empirically we find that seems to work well, and thus we use it by default for all the experiments. Accordingly by default will be set to the product of the numbers of minibatches and epochs.
Also we find that the parameter as Lipschitz constant is quite robust w.r.t. performance, indicating that heavily tuning this parameter is unnecessary in practice. To demonstrate this, we compare the training objectives of our solver by varying in Fig. 6. To highlight the differences, here we crop and show the results in the first four epochs, but note that the remaining results have similar behavior. As we can see on MNIST when varies from 10 to 100, the corresponding curves are clustered, similarly on CIFAR10 for from 50 to 1000. We decide to set for MNIST and for CIFAR10, respectively, in our solver.
Next we show the solver comparison results in Fig. 7. To illustrate the effect of momentum in our solver in terms of performance, here we plot two variants of our solver with and , respectively. As we see our solver with works much better than the counterpart, achieving lower training objectives as well as lower top1 error at test time. This again provides evidence to support the importance of satisfying Eq. 4 in our solver to search for good solutions toward global optimality.
Overall, our solver performs best on MNIST and slightly inferior on CIFAR10 at test time, although in terms of training objective it achieves competitive performance on MNIST and the best on CIFAR10. We hypothesize that this behavior comes from the effect of regularization on Lipschitz continuity. However, our solver can decrease the objectives much faster than all the competitors in the first few epochs. This observation reflects the superior ability of our solver in determining adaptive learning rates for gradients. Especially on CIFAR10 we also compare an extra solver Eve based on our implementation. Eve was proposed in recent related work [19] that improves Adam with the feedbacks from the objective function, and tested on CIFAR10 as well. As we can see, our solver is much more reliable, performing consistently over epochs.
4.1.2 ImageNet ILSVRC2012 [20]
This dataset contains about training images and validation images among object classes. Following the demo code, we train the same AlexNet [20] on it from the scratch using different solvers. We perform training for epochs, with a minibatch size , weight decay , momentum , and default learning rates for the competitors. For our solver we set .
Adagrad  Adadelta  RMSProp  Adam  BPGrad  

training  49.0  71.6  46.0  70.0  33.0 
validation  54.8  76.7  47.2  72.8  44.0 
aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  table  dog  horse  mbike  persn  plant  sheep  sofa  train  tv  mAP  
Adagrad  67.5  71.5  60.7  47.1  28.3  72.7  76.7  77.0  34.3  70.2  64.0  72.0  74.2  69.5  64.9  28.8  57.4  60.5  73.1  61.1  61.7 
RMSProp  69.1  75.8  61.5  47.9  30.2  74.7  77.1  79.4  33.2  71.1  66.3  74.4  76.3  69.9  65.1  28.9  62.9  62.5  73.2  60.8  63.0 
Adam  68.9  79.9  64.1  56.6  37.0  77.4  77.7  82.5  38.2  71.5  64.7  77.6  77.7  75.0  66.8  30.6  65.9  65.1  74.4  67.9  66.0 
BPGrad  69.4  77.7  66.4  55.1  37.2  76.1  77.7  83.6  38.6  73.8  67.4  76.0  81.9  72.7  66.3  31.0  64.2  66.2  73.8  64.9  66.0 
We show the comparison results in Fig. 8. It is evident that our solver works the best at both training and test time. Namely, it converges faster to achieve lower objective as well as lower top1 error on validation dataset. In terms of numbers, ours is 3.2% lower than the second best, RMSProp, at the th epoch as listed in Table 1.
Based on all the experiments above we conclude that our solver is suitable to train deep models for object recognition.
4.2 Object Detection
Following Fast RCNN [11] in the demo code, we conduct the solver comparison on the PASCAL VOC2007 dataset [9] with 20 object classes using selective search [34] as default object proposal approach. For all solvers, we train the network for epochs using the images in VOC2007 trainval set and test it using images in VOC2007 test set. We set the weight decay and momentum to and , respectively, and use default learning rates for the competitors. We do not compare with Adadelta because we cannot obtain reasonable performance after heavy parameter tuning. For our solver we set and .
We show the training comparison in Fig. 9, and test results in Table 2. Though our training losses are inferior to those of Adam in this case, our solver works as well as Adam at test time on average, achieving best AP on out of classes. This demonstrates the suitability of our solver in training deep models for object detection.
4.3 Object Segmentation
Following the work [24] for semantic segmentation based on fully convolutional networks (FCN), we train FCN32s with perpixel multinomial logistic loss and validate it with the standard metric of mean pixel intersection over union (IU), pixel accuracy, and mean accuracy. For all the solvers, we conduct training for epochs with momentum and weight decay on PASCAL VOC2011 [10] segmentation set. For Adagrad, RMSProp and Adam, we find that the default parameters are able to achieve the best performance. For Adadelta, we tune its parameters with . The global learning rate for RMSProp is set to and for both Adagrad and Adam. Adadelta does not require the global learning rate. For our solver, we set .
We show the learning curves on training and validation datasets in Fig. 10, and list the testtime comparison results in Table 3. In this case our solver has very similar learning behavior as Adagrad, but achieves the best performance at test time. The smaller fluctuation over epochs on the validation dataset demonstrates again the superior reliability of our solver, compared with the competitors. Taking these observations into account, we believe that our solver has the ability of learning robust deep models for object segmentation.
mean IU  pixel accuracy  mean accuracy  average  
Adagrad  60.8  89.5  77.4  75.9 
Adadelta  46.6  86.0  54.4  62.3 
RMSProp  60.5  90.2  71.0  73.9 
Adam  50.9  87.2  66.4  68.2 
BPGrad  62.4  89.8  79.6  77.3 
5 Conclusion
In this paper we propose a novel approximation algorithm, namely BPGrad, towards searching for global optimality in DL via branch and pruning based on Lipschitz continuity assumption. Our basic idea is to keep generating new samples from the parameter space (i.e. branch) outside the removable parameter space (i.e. pruning). Lipschitz continuity not only provides us a way to estimate the lower and upper bounds of global optimality, but also serves as regularization to further smooth the objective functions in DL. Theoretically we prove that under some conditions our BPGrad algorithm can converge to global optimality within finite iterations. Empirically in order to avoid the high demand of computation as well as storage for BPGrad in DL, we propose a new efficient solver. Theoretical and empirical justification on preserving the properties of BPGrad is provided. We demonstrate the superiority of our solver to several conventional DL solvers in object recognition, detection, and segmentation.
References
 [1] A. Blum and R. L. Rivest. Training a 3node neural network is npcomplete. In NIPS, pages 494–501, 1989.
 [2] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for largescale machine learning. arXiv preprint arXiv:1606.04838, 2016.
 [3] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
 [4] P. Chaudhari, A. Choromanska, S. Soatto, and Y. LeCun. Entropysgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
 [5] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In AISTATS, pages 192–204, 2015.
 [6] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, 2009.
 [7] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12(Jul):2121–2159, 2011.
 [8] K. Eriksson, D. Estep, and C. Johnson. Applied Mathematics Body and Soul: Vol IIII. SpringerVerlag Publishing, 2003.
 [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html.
 [10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results. http://www.pascalnetwork.org/challenges/VOC/voc2011/workshop/index.html.
 [11] R. Girshick. Fast rcnn. In CVPR, pages 1440–1448, 2015.
 [12] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [13] B. D. Haeffele and R. Vidal. Global optimality in neural network training. In CVPR, pages 7331–7339, 2017.
 [14] P. Hand and V. Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. arXiv preprint arXiv:1705.07576, 2017.
 [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [16] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 [17] K. Kawaguchi. Deep learning without poor local minima. In NIPS, pages 586–594, 2016.
 [18] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [19] J. Koushik and H. Hayashi. Improving stochastic gradient descent with feedback. arXiv preprint arXiv:1611.01505, 2016.
 [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.

[21]
Y. LeCun.
The mnist database of handwritten digits.
http://yann.lecun.com/exdb/mnist/, 1998.  [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [23] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In COLT, pages 1246–1257, 2016.
 [24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
 [25] C. Malherbe and N. Vayatis. Global optimization of lipschitz functions. In ICML, 2017.
 [26] M. C. Mukkamala and M. Hein. Variants of rmsprop and adagrad with logarithmic regret bounds. arXiv preprint arXiv:1706.05507, 2017.
 [27] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017.
 [28] I. Panageas and G. Piliouras. Gradient descent only converges to minimizers: Nonisolated critical points and invariant regions. arXiv preprint arXiv:1605.00405, 2016.
 [29] D. G. Sotiropoulos and T. N. Grapsa. A branchandprune method for global optimization. In Scientific Computing, Validated Numerics, Interval Methods, pages 215–226. Springer, 2001.
 [30] D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
 [31] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, pages 1139–1147, 2013.
 [32] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014.

[33]
T. Tieleman and G. Hinton.
Lecture 6.5—RmsProp: Divide the gradient by a running average of
its recent magnitude.
COURSERA: Neural Networks for Machine Learning, 2012.
 [34] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 104(2):154–171, 2013.
 [35] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In ACM Multimedia, pages 689–692, 2015.
 [36] C. Yun, S. Sra, and A. Jadbabaie. Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444, 2017.
 [37] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 [38] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
 [39] S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging sgd. In NIPS, pages 685–693, 2015.
 [40] Z. Zhang and M. Brand. Convergent block coordinate descent for training tikhonov regularized deep neural networks. In NIPS, 2017.