Log In Sign Up

BPGrad: Towards Global Optimality in Deep Learning via Branch and Pruning

Understanding the global optimality in deep learning (DL) has been attracting more and more attention recently. Conventional DL solvers, however, have not been developed intentionally to seek for such global optimality. In this paper we propose a novel approximation algorithm, BPGrad, towards optimizing deep models globally via branch and pruning. Our BPGrad algorithm is based on the assumption of Lipschitz continuity in DL, and as a result it can adaptively determine the step size for current gradient given the history of previous updates, wherein theoretically no smaller steps can achieve the global optimality. We prove that, by repeating such branch-and-pruning procedure, we can locate the global optimality within finite iterations. Empirically an efficient solver based on BPGrad for DL is proposed as well, and it outperforms conventional DL solvers such as Adagrad, Adadelta, RMSProp, and Adam in the tasks of object recognition, detection, and segmentation.


page 1

page 2

page 3

page 4


Branch-and-Pruning Optimization Towards Global Optimality in Deep Learning

It has been attracting more and more attention to understand the global ...

What deep learning can tell us about higher cognitive functions like mindreading?

Can deep learning (DL) guide our understanding of computations happening...

DLSpec: A Deep Learning Task Exchange Specification

Deep Learning (DL) innovations are being introduced at a rapid pace. How...

A Complexity Efficient DMT-Optimal Tree Pruning Based Sphere Decoding

We present a diversity multiplexing tradeoff (DMT) optimal tree pruning ...

A Survey of Deep Learning Techniques for Dynamic Branch Prediction

Branch prediction is an architectural feature that speeds up the executi...

Improving the filtering of Branch-And-Bound MDD solver (extended)

This paper presents and evaluates two pruning techniques to reinforce th...

Scaling Laws for Deep Learning

Running faster will only get you so far – it is generally advisable to f...

1 Introduction

Deep learning (DL) has been demonstrated successfully in many different research areas such as image classification [20], speech recognition [16]

and natural language processing

[32]. In general, its empirical success stems mainly from better network architectures [15], larger mount of training data [6], and better learning algorithms [12].

However, theoretical understanding of DL for its success in applications still remains elusive. Very recently researchers start to understand DL from the perspective of optimization such as the optimality of the learned models [13, 14, 36]. It has been proved that under certain (very restrictive) conditions the critical points learned for the deep models actually achieve global optimality, even though the optimization in deep learning is highly nonconvex. These theoretical results may partially explain why such deep models work well in practice.

Global optimality is always desirable and preferred in optimization. Locating global optimality in deep learning, however, is extremely challenging due to its high non-convexity, and thus no conventional DL solvers, e.g

. stochastic gradient descent (SGD)

[2], Adagrad [7], Adadelta [37], RMSProp [33] and Adam [18], is intentionally developed for this purpose, to our best knowledge. Alternatively different regularization techniques are applied to smooth the objective functions in DL so that the solvers can converge to some geometrically wider and flatter regions in the parameter space where good model solutions may exist [39, 4, 40]. But these solutions may not necessarily be the global optimum.

Figure 1: Illustration of how BPGrad works, where each black dot denotes the solution at each iterations (i.e. branch), directed dotted lines denote the current gradients, and red dotted circles denote the regions wherein there should be no solutions achieving global optimality (i.e

. pruning). BPGrad can automatically estimate the scales of these regions based on the function evaluation of solutions and the Lipschitz continuity assumption.

Inspired by the techniques in global optimization of nonconvex functions, we propose a novel approximation algorithm, BPGrad, which has the ability of locating global optimality in DL via branch and pruning (BP). BP [29] is a well-known algorithm developed for searching for global solutions for nonconvex optimization problems. Its basic idea is to effectively and gradually shrink the gap between the lower and upper bounds of global optimum by efficiently branching and pruning the parameter space. Fig. 1 illustrates the optimization procedure in BPGrad.

In order to branch and prune the space we assume that the objective functions in DL are Lipschitz continuous [8], or can be approximated by Lipschitz functions. This is motivated by the facts that (1) Lipschitz continuity provides a natural way to estimate the lower and upper bounds of the global optimum (see Sec. 2.3.1) used in BP, and (2) it can also serve as regularization, if needed, to smoothen the objective functions so that the returned solutions can generalize well.

Figure 2: Illustration of Lipschitz continuity as regularization (red) to smoothen a function (blue).

In Fig. 2 we illustrate the functionality of Lipschitz continuity as regularization, where the noisy narrower but deeper valley is smoothed out, while the wider but shallower valley is preserved. Such regularization behavior can prevent algorithms from being stuck in bad local minima. Also this is advocated and demonstrated to be crucial in order to achieve good generalization of learned DL models in several recent works such as [4]. In this sense, our BPGrad algorithm/solver essentially aims to locate global optimality in the smoothed objective functions for DL.

Further BPGrad can generate solutions along the directions of gradients (i.e. branch) based on the estimated regions wherein no global optimum should exist theoretically (i.e. pruning), and by repeating such branch-and-pruning procedure BPGrad can locate global optimum. Empirically the high demand of computation as well as footprint in memory for running BPGrad inspires us to develop an efficient DL solver to approximate BPGrad towards global optimization.

Contributions: The main contributions of our work are:

  1. [noitemsep]

  2. We propose a novel approximation algorithm, BPGrad, which is intent on locating global optimum in DL. To our best knowledge, our approach is the first algorithmic attempt towards global optimization in DL.

  3. Theoretically we prove that BPGrad can converge to global optimality within finite iterations.

  4. Empirically we propose a novel and efficient DL solver based on BPGrad to reduce the requirement of computation as well as footprint in memory. We provide both theoretical and empirical justification for our solver towards preserving the theoretical properties of BPGrad. We demonstrate that our solver outperforms conventional DL solvers in the applications of object recognition, detection, and segmentation.

1.1 Related Work

Global Optimality in DL: The empirical loss minimization problem in learning deep models is highly dimensional and nonconvex with potentially numerous local minima and saddle points. Blum and Rivest [1]

showed that it is difficult to find the global optima because in the worst case even learning a simple 3-node neural network is NP-complete.

In spite of the difficulties in optimizing deep models, researchers have attempted to provide empirical as well as theoretical justification for the success of these models w.r.t. global optimality in learning. Zhang et al[38] empirically demonstrated that sufficiently over-parametrized networks trained with stochastic gradient descent can reach global optimality. Choromanska et al[5] studied the loss surface of multilayer networks using spin-glass model and showed that for many large-size decoupled networks, there exists a band with many local optima, whose objective values are small and close to that of a global optimum. Brutzkus and Globerson [3]

showed that gradient descent converges to the global optimum in polynomial time on a shallow neural network with one hidden layer and a convolutional structure and a ReLU activation function. Kawaguchi 

[17] proved that the error landscape does not have bad local minima in the optimization of linear deep neural networks. Yun et al[36] extended these results and proposed sufficient and necessary conditions for a critical point to be a global minimum. Haeffele and Vidal [13] suggested that it is critical to balance the degrees of positive homogeneity between the network mapping and the regularization function to prevent non-optimal local minima in the loss surface of neural networks. Nguyen and Hein [27]

argued that almost all local minima are global optimal in fully connected wide neural networks, whose number of hidden neurons of one layer is larger than that of training points. Soudry and Carmon 


employed smoothed analysis techniques to provide theoretical guarantee that the highly nonconvex loss functions in multilayer networks can be easily optimized using local gradient descent updates. Hand and Voroninski

[14] provided theoretical properties for the problem of enforcing priors provided by generative deep neural networks via empirical risk minimization by establishing the favorable global geometry.

DL Solvers: SGD [2] is the most widely used DL solver due to its simplicity, whose learning rate (i.e., step size for gradient) is predefined. In general, SGD suffers from slow convergence, and thus its learning rate needs to be carefully tuned. To improve the efficiency of SGD, several DL solvers with adaptive learning rates have been proposed, including Adagrad [7], Adadelta [37], RMSProp [33] and Adam [18]

. These solvers integrate the advantages from both stochastic and batch methods where small mini-batches are used to estimate diagonal second-order information heuristically. These solvers have the capability of escaping saddle points and often yield faster convergence empirically.

Specifically, Adagrad is well suited for dealing with sparse data, as it adapts the learning rate to the parameters, performing smaller updates on frequent parameters and larger updates on infrequent parameters. However, it suffers from shrinking on the learning rate, which motivates Adadelta, RMSProp and Adam. Adadelta accumulates squared gradients to be fixed values rather than over time in Adagrad, RMSProp updates the parameters based on the rescaled gradients, and Adam does so based on the estimated mean and variance of the gradients. Very recently, Mukkamala

et al[26] proposed variants of RMSProp and Adagrad with logarithmic regret bounds.

Convention vs. Ours: Though the properties of global optimality in DL are very attractive, as far as we know, however, there is no solver developed intentionally to capture such global optimality so far. To fill this void, we propose our BPGrad algorithm towards global optimization in DL.

From the optimization perspective, our algorithm shares similarities with the recent work [25] on global optimization of general Lipschitz functions (not specifically for DL). In [25] a uniform sampler is utilized to maximize the lower bound of the maximizer (equivalently minimizing the upper bound of the minimizer) subject to Lipschitz conditions. Convergence properties w.h.p. are derived. In contrast, our approach considers estimating both lower and upper bounds of the global optimum, and employs the gradients as guidance to more effectively sample the parameter space for pruning. Convergence is proved to show that our algorithm will terminate within finite iterations.

From the empirical solver perspective, our solver shares similarities with the recent work [19] on improving SGD using the feedback from the objective function. Specifically [19] tracks the relative changes in the objective function with a running average, and uses it to adaptively tune the learning rate in SGD. No theoretical analysis, however, is provided for justification. In contrast, our solver does use the feedback from the object function to determine the learning rate adaptively but based on the rescaled distance between the feedback and the current lower bound estimation. Theoretical as well as empirical justifications are established.

2 BPGrad Algorithm for Deep Learning

2.1 Key Notation

We denote as the parameters in the neural network, as a pair of a data sample and its associated label , as the nonconvex prediction function represented by the network, as the objective function for training the network with Lipschitz constant , as the gradient of over parameters 111We assume w.l.o.g. Empirically we can randomly sample a non-zero direction for update wherever ., denotes the normalized gradient (i.e. direction of the gradient), as the global minimum, and as the

-norm operator over vectors.

Definition 1 (Lipschitz Continuity [8]).

A function is Lipschitz continuous with Lipschitz constant on , if there is a (necessarily nonnegative) constant such that


2.2 Problem Setup

We would like to learn the parameters for a given network by minimizing the following objective function :


where denotes the expectation over data pairs, denotes a loss function (e.g., hinge loss) for measuring the difference between the ground-truth labels and the predicted labels given data samples, and denotes a regularizer over parameters. Particularly we assume that:

  1. [noitemsep]

  2. is lower bounded by 0 and upper bounded as well, i.e. ;

  3. is differentiable everywhere in the bounded space ;

  4. is Lipschitz continuous, or can be approximated by Lipschitz functions, with constant .

2.3 Algorithm

2.3.1 Lower & Upper Bound Estimation

Consider the situation where samples exist for evaluation by function with Lipschitz constant , whose global minimum is reached by the sample . Then based on Eq. 1 and simple algebra, we can obtain


This provides us a tractable upper bound and an intractable lower bound, unfortunately, of the global minimum. The intractability comes from the fact that is unknown, and thus makes the lower bound in Eq. 3 unusable empirically.

To address this problem, we propose a novel tractable estimator, . This estimator intentionally introduces a gap from the upper bound, which will be shrunk by either decreasing the upper bound or increasing . As proved in Thm. 1 (see Sec. 2.4), when the parameter space is fully covered by the samples , this estimator will become the lower bound of .

In summary, we define our lower and upper bound estimators for the global minimum as and , respectively.

2.3.2 Branch & Pruning

Based on our estimators, we propose a novel approximation algorithm, BPGrad, towards global optimization in DL via branch and pruning. We show it in Alg. 1 where the predefined constant controls the precision of the solution.

Input : objective function with Lipschitz constant , precision
Output : minimizer
Randomly initialize , , ; while  do
       while  satisfies Eq. 4 do
             Compute by solving Eq. 5; ;
       end while
      Increase such that still holds;
end while
return where ;
Algorithm 1 BPGrad Algorithm for Deep Learning

Branch: The inner loop in Alg. 1 conducts the branch operation to split the parameter space recursively by sampling. Towards this goal, we need a mapping between the parameter space and the bounds. Considering the lower bound in Eq. 3, we propose sampling based on the previous samples so that it satisfies


Note that an equivalent constraint has been used in [25].

To improve sampling efficiency for decreasing the objective, we propose sampling along the directions of (stochastic) gradients with small distortion. Though gradients only encode local structures of (nonconvex) functions in a high dimensional space, they are good indicators for locating local minima [23, 28]. Specifically, we propose a minimization problem for generating samples:


where is a predefine constant controlling the trade-off between the distortion and the step size . That is, under the condition in Eq. 4, the objective in Eq. 5 aims to generate a sample that has small distortion from an anchor point, whose step size is small as well due to the locality property of gradients, along the direction of the gradient.

Note that other reasonable objective functions may also be utilized here for sampling purpose as long as the condition in Eq. 4 is satisfied. More efficient sampling objectives will be investigated in our future work.

Pruning: In fact Eq. 4 specifies that new samples should be generated outside the union of a set of balls defined by previous samples. To precisely describe this requirement, we introduce a new concept of removable solution space in our work as follows:

Definition 2 (Removable Parameter Space (RPS)).

We define the RPS, denoted as , as


where defines a ball centered at sample with radius .

RPS specifies a region wherein the function evaluations of all the points cannot be smaller than the lower bound estimator conditioning on the Lipschitz continuity assumption. Therefore, when the lower bound estimator is higher than the global minimum , we can safely remove all the points in RPS without evaluation. However, when it becomes smaller than , we risk missing the global solutions.

To address this issue, we propose the outer loop in Alg. 1 to increase the lower bound for drawing more samples which may further decrease the upper bound later.

2.4 Theoretical Analysis

Theorem 1 (Lower & Upper Bounds).

Whenever holds, the samples generated by Alg. 1 satisfies


Since is the global minimum, it always holds that . Now suppose that if holds, holds as well. Then there would exist at least one point (i.e. global minimum) left for sampling, contradicting the condition of . We then complete the proof. ∎

Corollary 1 (Approximation Error Bound).

Whenever both and hold, it is satisfied that

Theorem 2 (Convergence within Finite Samples).

The total number of samples, , in Alg. 1 is upper bounded by:


where denotes the volume of the space , denotes a constant, and denotes the minimum evaluation.


Given such that , we have


This allows us to generate two balls and so that they have no overlap with each other. As a result we can generate balls with radius of and no overlaps, and their accumulated volume should be no bigger than . That is,


Further using simple algebra we can complete the proof. ∎

3 Approximate DL Solver based on BPGrad

Input : number of evaluations repeating times at most, objective function with Lipschitz constant , momentum
Output : minimizer
, and randomly initialize ; for  to  do
       ; while  do
             ; ; ;
       end while
      if  holds then  Break ;
end for
return where ;
Algorithm 2 BPGrad based Solver for Deep Learning

Though the BPGrad algorithm has nice theoretical properties for global optimization, directly applying Alg. 1 to deep learning will incur the following problems that limit its empirical usage:

  1. [noitemsep]

  2. From Thm. 2 we can see that due to the high dimensionality of the parameter space in DL it is impractical to draw sufficient samples to cover the entire space.

  3. Solving Eq. 5 involves the knowledge of previous samples, which incurs significant amount of both computational and storage burden for deep learning.

  4. Computing and is time-consuming, especially for large-scale data.

To address problem P1, in practice we manually set the maximum iterations for both inner and outer loops in Alg. 1.

To address problem P2, we further make some extra assumptions to simplify the branching/sampling procedure based on Eq. 5 as follows:

  1. [noitemsep]

  2. Minimizing distortion is much important than minimizing step sizes, i.e. ;

  3. is sufficiently large where so that always holds;

  4. is always sufficiently small for local update.

  5. can be sampled only based on and .

By imposing these assumptions upon Eq. 5, we can directly compute the solution as follows:


To address problem P3, we utilize mini-batches to estimate and efficiently in each iteration.

In summary, we list our BPGrad solver in Alg. 2 by modifying Alg. 1 for the sake of fast sampling as well as low memory footprint in DL, but at the risk of being stuck in local regions. Fig. 3 illustrates such scenarios in a 1D example. In (b) the sampling method falls into a loop because it does not consider the history of samples but only current one. In contrast, the sampling method in (a) is able to keep generating new samples by avoiding the RPS of previous samples with more computation and storage, as expected.

(a) Sampling using Eq. 5 (b) Sampling using Eq. 12
Figure 3: 1D illustration of difference in sampling between (a) using Eq. 5 and (b) using Eq. 12. Here the solid blue lines denote function , the black dotted lines denote the sampling paths starting from , and each big triangle surrounded by blue dotted lines denotes the RPS of each sample. As we see, (b) suffers from being stuck locally, while (a) can avoid the locality based on the RPS.

3.1 Theoretical Analysis

Theorem 3 (Global Property Preservation).

Let where is computed using Eq. 12. Then satisfies Eq. 4 if it holds that


where denotes the inner product between two vectors.


Based on Eq. 1, Eq. 12, and Eq. 13, we have


which is essentially equivalent to Eq. 4 based on algebra. We then can complete the proof. ∎

Corollary 2.

Suppose that a monotonically decreasing sequence is generated to minimize function by sampling using Eq. 12. Then the condition in Eq. 13 can be rewritten as follows:


Discussion: Both Thm. 3 and Cor. 2

imply that our solver prefers sampling the parameter space along a path towards a single direction, roughly speaking. However, the gradients in conventional backpropagation have little guarantee to satisfy Eq. 

13 or Eq. 15 due to lack of such constraints in learning. On the other hand, momentum [31] is a well-known technique in deep learning to dampen oscillations in gradients and accelerate directions of low curvature. Therefore, our solver in Alg. 2 involves momentum to compensate such drawbacks in backpropagation for better approximation of Alg. 1.

3.2 Empirical Justification

In this section we discuss the feasibility of the assumptions A1-A4 for reducing computation and storage as well as preserving the properties towards global optimization in deep learning. We utilize MatConvNet [35] as our testbed, and run our solver in Alg. 2 to train the default networks in MatConvNet for MNIST [21] and CIFAR10 [20], respectively, using the default parameters without explicit mention. Also we set for MNIST and

for CIFAR10 by default. For justification purpose we only run 4 epochs on each dataset, 600 and 500 iterations per epoch for MNIST and CIFAR10, respectively. For more experimental details, please refer to Sec. 


Essentially assumption A1 is made to support the other three to simplify the objective in Eq. 5, and assumption A2 usually holds in deep learning due to its high dimensionality. Therefore, below we only focus on empirical justification of assumptions A3 and A4.

Figure 4: Plots of on MNIST and CIFAR10, respectively.

Feasibility of A3: To justify this, we collect ’s by running Alg. 2 on both datasets, and plot them in Fig. 4. Overall these numbers are indeed sufficiently small for local update based on gradients, and decreases with the increase of iterations, in general. This behavior is expected as the objective is supposed to decrease as well w.r.t. iterations. The value gap at the beginning on the two datasets is induced mainly by different ’s.

Figure 5: Comparison between LHS and RHS of Eq. 4 based on returned by Alg. 2 using different values for momentum parameter .

Feasibility of A4: To justify this, we show some evidences in Fig. 5, where we plot the left-hand side (LHS) and right-hand side (RHS) of Eq. 4 based on returned by Alg. 2. As we see in all the subfigures on the right with the values on RHS are always no smaller than those on LHS correspondingly. In contrast, in the remaining subfigures on the left with (i.e. conventional SGD update) the values on RHS are always no bigger than those on LHS correspondingly. These observations appear to be robust across different datasets, and irrelevant to parameter which determines the radius of balls, i.e. step sizes for gradients. The momentum parameter , which is related to the directions of gradients for updating models, appear to be the only factor to make the samples of our solver satisfy Eq. 4. This also supports our claims in Thm. 3 and Cor. 2 about the relation between model update and gradient in order to satisfy Eq. 4. More evidences have been provided in Sec. 4.1.1. Given these evidences we hypothesize that assumption A4 may hold empirically when using sufficiently large values for .

4 Experiments

To demonstrate the generalization of our BPGrad solver, we test it in the applications of object recognition, detection, and segmentation by training deep convolutional neural networks (CNNs). We utilize MatConvNet as our testbed, and employ its demo code as well as default network architectures for different tasks. Since our solver has the ability of determining learning rates adaptively, we compare ours with another four widely used DL solvers with adaptive learning rates, namely Adagrad, Adadelta, RMSProp, and Adam. We tune the parameters in these solvers to achieve their best performance as we can.

4.1 Object Recognition

4.1.1 Mnist & Cifar10

The MNIST digital dataset consists of a training set of images and a test set of images in classes labeled from 0 to 9, where all images have the resolution of pixels. The CIFAR-10 dataset consists of a training set of images and a test set of images in 10 object classes, where the image resolution is pixels.

We follow the default implementation to train an individual CNN similar to LeNet- [22] on each dataset. For the details of network architectures please refer to the demo code. Specifically for all the solvers, we train the networks for and epochs on MNIST and CIFAR10, respectively, with a mini-batch size , weight decay , and momentum . In addition, we fix the initial weights for two networks and the feeding order of mini-batches for fair comparison. The global learning rate is set to on MNIST for Adagrad, RMSProp and Adam. On CIFAR10, the global learning rate is set to for RMSProp, but to for Adagrad, Adam and Eve [19], and it is reduced to and at the -st and -st epoch. Adadelta does not require the global learning rate.

Figure 6: Illustration of robustness of Lipschitz constant in our solver.
Figure 7: Comparison on (left) training objectives and (right) test top-1 errors for object recognition using (top) MNIST and (bottom) CIFAR10.

For our solver, the parameters and typically depend on the numbers of mini-batches and epochs, respectively. Empirically we find that seems to work well, and thus we use it by default for all the experiments. Accordingly by default will be set to the product of the numbers of mini-batches and epochs.

Also we find that the parameter as Lipschitz constant is quite robust w.r.t. performance, indicating that heavily tuning this parameter is unnecessary in practice. To demonstrate this, we compare the training objectives of our solver by varying in Fig. 6. To highlight the differences, here we crop and show the results in the first four epochs, but note that the remaining results have similar behavior. As we can see on MNIST when varies from 10 to 100, the corresponding curves are clustered, similarly on CIFAR10 for from 50 to 1000. We decide to set for MNIST and for CIFAR10, respectively, in our solver.

Next we show the solver comparison results in Fig. 7. To illustrate the effect of momentum in our solver in terms of performance, here we plot two variants of our solver with and , respectively. As we see our solver with works much better than the counterpart, achieving lower training objectives as well as lower top-1 error at test time. This again provides evidence to support the importance of satisfying Eq. 4 in our solver to search for good solutions toward global optimality.

Overall, our solver performs best on MNIST and slightly inferior on CIFAR10 at test time, although in terms of training objective it achieves competitive performance on MNIST and the best on CIFAR10. We hypothesize that this behavior comes from the effect of regularization on Lipschitz continuity. However, our solver can decrease the objectives much faster than all the competitors in the first few epochs. This observation reflects the superior ability of our solver in determining adaptive learning rates for gradients. Especially on CIFAR10 we also compare an extra solver Eve based on our implementation. Eve was proposed in recent related work [19] that improves Adam with the feedbacks from the objective function, and tested on CIFAR10 as well. As we can see, our solver is much more reliable, performing consistently over epochs.

4.1.2 ImageNet ILSVRC2012 [20]

This dataset contains about training images and validation images among object classes. Following the demo code, we train the same AlexNet [20] on it from the scratch using different solvers. We perform training for epochs, with a mini-batch size , weight decay , momentum , and default learning rates for the competitors. For our solver we set .

Figure 8: Comparison on (left) training objectives and (right)

validation top-1 errors for object recognition using ImageNet ILSVRC2012.

Adagrad Adadelta RMSProp Adam BPGrad
training 49.0 71.6 46.0 70.0 33.0
validation 54.8 76.7 47.2 72.8 44.0
Table 1: Top-1 recognition error () on ImageNet ILSVRC2012 dataset.
aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv mAP
Adagrad 67.5 71.5 60.7 47.1 28.3 72.7 76.7 77.0 34.3 70.2 64.0 72.0 74.2 69.5 64.9 28.8 57.4 60.5 73.1 61.1 61.7
RMSProp 69.1 75.8 61.5 47.9 30.2 74.7 77.1 79.4 33.2 71.1 66.3 74.4 76.3 69.9 65.1 28.9 62.9 62.5 73.2 60.8 63.0
Adam 68.9 79.9 64.1 56.6 37.0 77.4 77.7 82.5 38.2 71.5 64.7 77.6 77.7 75.0 66.8 30.6 65.9 65.1 74.4 67.9 66.0
BPGrad 69.4 77.7 66.4 55.1 37.2 76.1 77.7 83.6 38.6 73.8 67.4 76.0 81.9 72.7 66.3 31.0 64.2 66.2 73.8 64.9 66.0
Table 2: Average precision (AP, %) of object detection on VOC2007 test dataset.

We show the comparison results in Fig. 8. It is evident that our solver works the best at both training and test time. Namely, it converges faster to achieve lower objective as well as lower top-1 error on validation dataset. In terms of numbers, ours is 3.2% lower than the second best, RMSProp, at the -th epoch as listed in Table 1.

Based on all the experiments above we conclude that our solver is suitable to train deep models for object recognition.

4.2 Object Detection

Figure 9: Loss comparison on VOC2007 trainval dataset, including (left) the regression loss using bounding boxes and (right) the classification loss.

Following Fast RCNN [11] in the demo code, we conduct the solver comparison on the PASCAL VOC2007 dataset [9] with 20 object classes using selective search [34] as default object proposal approach. For all solvers, we train the network for epochs using the images in VOC2007 trainval set and test it using images in VOC2007 test set. We set the weight decay and momentum to and , respectively, and use default learning rates for the competitors. We do not compare with Adadelta because we cannot obtain reasonable performance after heavy parameter tuning. For our solver we set and .

We show the training comparison in Fig. 9, and test results in Table 2. Though our training losses are inferior to those of Adam in this case, our solver works as well as Adam at test time on average, achieving best AP on out of classes. This demonstrates the suitability of our solver in training deep models for object detection.

4.3 Object Segmentation

Following the work [24] for semantic segmentation based on fully convolutional networks (FCN), we train FCN-32s with per-pixel multinomial logistic loss and validate it with the standard metric of mean pixel intersection over union (IU), pixel accuracy, and mean accuracy. For all the solvers, we conduct training for epochs with momentum and weight decay on PASCAL VOC2011 [10] segmentation set. For Adagrad, RMSProp and Adam, we find that the default parameters are able to achieve the best performance. For Adadelta, we tune its parameters with . The global learning rate for RMSProp is set to and for both Adagrad and Adam. Adadelta does not require the global learning rate. For our solver, we set .

We show the learning curves on training and validation datasets in Fig. 10, and list the test-time comparison results in Table 3. In this case our solver has very similar learning behavior as Adagrad, but achieves the best performance at test time. The smaller fluctuation over epochs on the validation dataset demonstrates again the superior reliability of our solver, compared with the competitors. Taking these observations into account, we believe that our solver has the ability of learning robust deep models for object segmentation.

Figure 10: Segmentation performance comparison using FCN-32s model on VOC2011 training and validation datasets.
mean IU pixel accuracy mean accuracy average
Adagrad 60.8 89.5 77.4 75.9
Adadelta 46.6 86.0 54.4 62.3
RMSProp 60.5 90.2 71.0 73.9
Adam 50.9 87.2 66.4 68.2
BPGrad 62.4 89.8 79.6 77.3
Table 3: Numerical comparison on semantic segmentation performance (%) using VOC2011 test dataset at the -th epoch.

5 Conclusion

In this paper we propose a novel approximation algorithm, namely BPGrad, towards searching for global optimality in DL via branch and pruning based on Lipschitz continuity assumption. Our basic idea is to keep generating new samples from the parameter space (i.e. branch) outside the removable parameter space (i.e. pruning). Lipschitz continuity not only provides us a way to estimate the lower and upper bounds of global optimality, but also serves as regularization to further smooth the objective functions in DL. Theoretically we prove that under some conditions our BPGrad algorithm can converge to global optimality within finite iterations. Empirically in order to avoid the high demand of computation as well as storage for BPGrad in DL, we propose a new efficient solver. Theoretical and empirical justification on preserving the properties of BPGrad is provided. We demonstrate the superiority of our solver to several conventional DL solvers in object recognition, detection, and segmentation.