1 Introduction
Deep neural networks are a central object of study in machine learning, computer vision, and many other domains. They have substantially improved over conventional learning algorithms in many areas, including speech recognition, object detection, and natural language processing
[28]. The focus of this work is to investigate the duality gap of deep neural networks. The duality gap is the discrepancy between the optimal values of primal and dual problems. While it has been well understood for convex optimization, little is known for nonconvex problems. A smaller duality gap in relative value typically implies that the problem itself is less nonconvex, and thus is easier to optimize.^{1}^{1}1Although zero duality gap can be attained for some nonconvex optimization problems [6, 48, 11], they are in essence convex problems by considering the dual and bidual problems, which are always convex. So these problems are relatively easy to optimize compared with other nonconvex ones. Our results establish that: Deep neural networks with multibranch architecture have small duality gap in relative value.Our study is motivated by the computational difficulties of deep neural networks due to its nonconvex nature. While many works have witnessed the power of local search algorithms for deep neural networks [16], these algorithms typically converge to a suboptimal solution in the worst cases according to various empirical observations [52, 28]
. It is reported that for a singlehiddenlayer neural network, when the number of hidden units is small, stochastic gradient descent may get easily stuck at the poor local minima
[27, 49]. Furthermore, there is significant evidence indicating that when the networks are deep enough, bad saddle points do exist [1] and might be hard to escape [15, 21, 10, 1].Given the computational obstacles, several efforts have been devoted to designing new architectures to alleviate the above issues, including overparametrization [17, 54, 23, 41, 2, 46] and multibranch architectures [57, 18, 63, 33, 60]
. Empirically, increasing the number of hidden units of a singlehiddenlayer network encourages the firstorder methods to converge to a global solution, which probably supports the folklore that the loss surface of a wider network looks more “convex” (see Figure
1). Furthermore, several recently proposed architectures, including ResNeXt [63], Inception [57], Xception [18], SqueezeNet [33] and Wide ResNet [64] are based on having multiple branches and have demonstrated substantial improvement over many of the existing models in many applications. In this work, we show that one cause for such success is due to the fact that the loss of multibranch network is less nonconvex in terms of duality gap.The loss surface of onehiddenlayer ReLU network projected onto a 2d plane, which is spanned by three points to which the SGD algorithm converges according to three different initialization seeds. It shows that as the number of hidden neurons
increases, the landscape becomes less nonconvex.Our Contributions.
This paper provides both theoretical and experimental results for the population and empirical risks of deep neural networks by estimating the duality gap.
First, we study the duality gap of deep neural networks with arbitrary activation functions, adversarial data distribution, and multibranch architecture (see Theorem 1). The multibranch architecture is general, which includes the classic onehiddenlayer architecture as a special case (see Figure 2). By ShapleyFolkman lemma, we show that the duality gap of both population and empirical risks shrinks to zero as the number of branches increases. Our result provides better understanding of various stateoftheart architectures such as ResNeXt, Inception, Xception, SqueezeNet, and Wide ResNet.
Second, we prove that the strong duality (a.k.a. zero duality gap) holds for the empirical risk of deep linear neural networks (see Theorem 2). To this end, we develop multiple new proof techniques, including reduction to lowrank approximation and construction of dual certificate (see Section 4).
Finally, we empirically study the loss surface of multibranch neural networks. Our experiments verify our theoretical findings.
Notation. We will use bold capital letter to represent matrix and lowercase letter to represent scalar. Specifically, let
be the identity matrix and denote by
the allzero matrix. Let
be a set of network parameters, each of which represents the connection weights between the th and th layers of neural network. We use to indicate the th column of . We will use to represent theth largest singular value of matrix
. Given skinny SVD of matrix , we denote by the truncated SVD of to the first singular values. For matrix norms, denote by the matrix Schatten norm. Nuclear norm and Frobenius norm are special cases of Schatten norm: and . We use to represent the matrix operator norm, i.e., , and denote by the rank of matrix . Denote by the span of rows of . Let be the MoorePenrose pseudoinverse of .For convex matrix function , we denote by the conjugate function of and the subdifferential. We use to represent a diagonal matrix with diagonal entries . Let , and . For any two matrices and of matching dimensions, we denote by the concatenation of and along the row and the concatenation of two matrices along the column.
2 Duality Gap of MultiBranch Neural Networks
We first study the duality gap of neural networks in a classification setting. We show that the wider the network is, the smaller the duality gap becomes.
Network Setup. The output of our network follows from a multibranch architecture (see Figure 2):
where is the concatenation of all network parameters , is the input instance, is the parameter space, and represents an continuous mapping by a subnetwork which is allowed to have arbitrary
architecture such as convolutional and recurrent neural networks. As an example,
can be in the form of a layer feedforward subnetwork:Hereby, the functions are allowed to encode arbitrary form of continuous elementwise nonlinearity (and linearity) after each matrix multiplication, such as sigmoid, rectification, convolution, while the number of layers in each subnetwork can be arbitrary as well. When and , i.e., each subnetwork in Figure 2 represents one hidden unit, the architecture reduces to a onehiddenlayer network. We apply the socalled hinge loss [4, 7] on the top of network output for label :
(1) 
The
hinge loss has been widely applied in active learning of classifiers and margin based learning
[4, 7]. When , it reduces to the classic hinge loss [43, 17, 38].We make the following assumption on the margin parameter , which states that the parameter is sufficiently large.
Assumption 1 (Parameter ).
For sample drawn from distribution , we have for all with probability measure .
We further empirically observe that using smaller values of the parameter
and other loss functions support our theoretical result as well (see experiments in Section
5). It is an interesting open question to extend our theory to more general losses in the future.To study how close these generic neural network architectures approach the family of convex functions, we analyze the duality gap of minimizing the risk w.r.t. the loss (1) with an extra regularization constraint. The normalized duality gap is a measure of intrinsic nonconvexity of a given function [13]: the gap is zero when the given function itself is convex, and is large when the loss surface is far from the convexity intrinsically. Typically, the closer the network approaches to the family of convex functions, the easier we can optimize the network.
MultiBranch Architecture. Our analysis of multibranch neural networks is built upon tools from nonconvex geometric analysis — Shapley–Folkman lemma. Basically, the Shapley–Folkman lemma states that the sum of constrained nonconvex functions is close to being convex. A neural network is an ideal target to apply this lemma to: the width of network is associated with the number of summand functions. So intuitively, the wider the neural network is, the smaller the duality gap will be. In particular, we study the following nonconvex problem concerning the population risk:
(2) 
where are convex regularization functions, e.g., the weight decay, and can be arbitrary such that the problem is feasible. Correspondingly, the dual problem of problem (2) is a onedimensional convex optimization problem:^{2}^{2}2Although problem (3) is convex, it does not necessarily mean the problem can be solved easily. This is because computing is a hard problem. So rather than trying to solve the convex dual problem, our goal is to study the duality gap in order to understand the degree of nonconvexity of the problem.
(3) 
For , denote by
the convex relaxation of function on . For , we also define
Our main results for multibranch neural networks are as follows:
Theorem 1.
Note that measures the divergence between the function value of and its convex relaxation . The constant is the maximal divergence among all subnetworks, which grows slowly with the increase of . This is because only measures the divergence of one branch. The normalized duality gap has been widely used before to measure the degree of nonconvexity of optimization problems [13, 58, 14, 24, 22]. Such a normalization avoids trivialities in characterizing the degree of nonconvexity: scaling the objective function by any constant does not change the value of normalized duality gap. Even though Theorem 1 is in the form of population risk, the conclusion still holds for the empirical loss as well. This can be achieved by setting the marginal distribution
as the uniform distribution on a finite set and
as the corresponding labels uniformly distributed on the same finite set.Inspiration for Architecture Designs. Theorem 1 shows that the loss surface of deep network is less nonconvex when the width is large; when , surprisingly, deep network is as easy as a convex optimization. An intuitive explanation is that the large number of randomly initialized hidden units represent all possible features. Thus the optimization problem involves just training the top layer of the network, which is convex. Our result encourages a class of network architectures with multiple branches and supports some of the most successful architectures in practice, such as Inception [57], Xception [18], ResNeXt [63], SqueezeNet [33], Wide ResNet [64], ShakeShake regularization [25] — all of which benefit from the splittransformmerge behaviour as shown in Figure 2. The theory sheds light on an explanation of strong performance of these architectures.
Related Works. While many efforts have been devoted to studying the local minima or saddle points of deep neural networks [42, 68, 55, 36, 62, 61], little is known about the duality gap of deep networks. In particular, Choromanska et al. [20, 19] showed that the number of poor local minima cannot be too large. Kawaguchi [35] improved over the results of [20, 19]
by assuming that the activation functions are independent Bernoulli variables and the input data are drawn from Gaussian distribution. Xie et al.
[62] and Haeffele et al. [30] studied the local minima of regularized network, but they require either the network is shallow, or the network weights are rankdeficient. Ge et al. [27] showed that every local minimum is globally optimal by modifying the activation function. Zhang et al. [67] and Aslan et al. [3] reduced the nonlinear activation to the linear case by kernelization and relaxed the nonconvex problem to a convex one. However, no formal guarantee was provided for the tightness of the relaxation. Theorem 1, on the other hand, bounds the duality gap of deep neural networks with mild assumptions.Another line of research studies the convexity behaviour of neural networks when the number of hidden neurons goes to the infinity. In particular, Bach [5] proved that a singlehiddenlayer network is as easy as a convex optimization by using classical nonEuclidean regularization tools. Bengio et al. [12] showed a similar phenomenon for multilayer networks with an incremental algorithm. In comparison, Theorem 1 not only captures the convexification phenomenon when , but also goes beyond the result as it characterizes the convergence rate of convexity of neural networks in terms of duality gap. Furthermore, the conclusion in Theorem 1 holds for the population risk, which was unknown before.
3 Strong Duality of Linear Neural Networks
In this section, we show that the duality gap is zero if the activation function is linear. Deep linear neural network has received significant attention in recent years [51, 35, 67, 44, 8, 28, 31, 9] because of its simple formulation^{3}^{3}3Although the expressive power of deep linear neural networks and threelayer linear neural networks are the same, the analysis of landscapes of two models are significantly different, as pointed out by [28, 35, 44]. and its connection to nonlinear neural networks.
Network Setup. We discuss the strong duality of regularized deep linear neural networks of the form
(4) 
where is the given instance matrix, is the given label matrix, and represents the weight matrix in each linear layer. We mention that (a) while the linear operation is simple matrix multiplications in problem (4), it can be easily extended to other linear operators, e.g., the convolutional operator or the linear operator with the bias term, by properly involving a group of kernels in the variable [30]. (b) The regularization terms in problem (4) are of common interest, e.g., see [30]. When , our regularization terms reduce to , which is well known as the weightdecay or Tikhonov regularization. (c) The regularization parameter is the same for each layer since we have no further information on the preference of layers.
Our analysis leads to the following guarantees for the deep linear neural networks.
Theorem 2.
Denote by and . Let and , where stands for the minimal nonzero singular value of . Then the strong duality holds for deep linear neural network (4). In other words, the optimum of problem (4) is the same as its convex dual problem
(5) 
where is a convex function. Moreover, the optimal solutions of primal problem (4) can be obtained from the dual problem (5) in the following way: let be the skinny SVD of matrix , then for , and is a globally optimal solution to problem (4).
The regularization parameter cannot be too large in order to avoid underfitting. Our result provides a suggested upper bound for the regularization parameter, where oftentimes characterizes the level of random noise. When , our analysis reduces to the unregularized deep linear neural network, a model which has been widely studied in [35, 44, 8, 28].
Theorem 2
implies the followig result on the landscape of deep linear neural networks: the regularized deep learning can be converted into an equivalent convex problem by dual. We note that the strong duality rarely happens in the nonconvex optimization: matrix completion
[6], Fantope [48], and quadratic optimization with two quadratic constraints [11] are among the few paradigms that enjoy the strong duality. For deep networks, the effectiveness of convex relaxation has been observed empirically in [3, 67], but much remains unknown for the theoretical guarantees of the relaxation. Our work shows strong duality of regularized deep linear neural networks and provides an alternative approach to overcome the computational obstacles due to the nonconvexity: one can apply convex solvers, e.g., the Douglas–Rachford algorithm,^{4}^{4}4Grussler et al. [29] provided a fast algorithm to compute the proximal operators of . Hence, the Douglas–Rachford algorithm can find the global solution up to an error in function value in time [32]. for problem (5) and then conduct singular value decomposition to compute the weights
from . In addition, our result inherits the benefits of convex analysis. The vast majority results on deep learning study the generalization error or expressive power by analyzing its complicated nonconvex form [47, 66, 65]. In contrast, with strong duality one can investigate various properties of deep linear networks with much simpler convex form.Related Works. The goal of convexified linear neural networks is to relax the nonconvex form of deep learning to the computable convex formulations [67, 3]. While several efforts have been devoted to investigating the effectiveness of such convex surrogates, e.g., by analyzing the generalization error after the relaxation [67], little is known whether the relaxation is tight to its original problem. Our result, on the other hand, provides theoretical guarantees for the tightness of convex relaxation of deep linear networks, a phenomenon observed empirically in [3, 67].
We mention another related line of research — no bad local minima. On one hand, although recent works have shown the absence of spurious local minimum for deep linear neural networks [50, 35, 44], many of them typically lack theoretical analysis of regularization term. Specifically, Kawaguchi [35] showed that unregularized deep linear neural networks have no spurious local minimum. Lu and Kawaguchi [44] proved that depth creates no bad local minimum for unregularized deep linear neural networks. In contrast, our optimization problem is more general by taking the regularization term into account. On the other hand, even the “local=global” argument holds for the deep linear neural networks, it is still hard to escape bad saddle points [1]. In particular, Kawaguchi [35]
proved that for linear networks deeper than three layers, there exist bad saddle points at which the Hessian does not have any negative eigenvalue. Therefore, the stateoftheart algorithms designed to escape the saddle points might not be applicable
[34, 26]. Our result provides an alternative approach to solve deep linear network by convex programming, which bypasses the computational issues incurred by the bad saddle points.4 Our Techniques and Proof Sketches
(a) ShapleyFolkman Lemma. The proof of Theorem 1 is built upon the ShapleyFolkman lemma [22, 56, 24, 13]
, which characterizes a convexification phenomenon concerning the average of multiple sets and is analogous to the central limit theorem in the probability theory. Consider the averaged Minkowski sum of
sets given by . Intuitively, the lemma states that as , where is a metric of the nonconvexity of a set (see Figure 3 for visualization). We apply this lemma to the optimization formulation of deep neural networks. Denote by augmented epigraph the set , where is the constraint and is the objective function in the optimization problem. The key observation is that the augmented epigraph of neural network loss with multibranch architecture can be expressed as the Minkowski average of augmented epigraphs of all branches. Thus we obtain a natural connection between an optimization problem and its corresponding augmented epigraph. Applying ShapleyFolkman lemma to the augmented epigraph leads to a characteristic of nonconvexity of the deep neural network.(b) Variational Form. The proof of Theorem 2 is built upon techniques (b), (c), and (d). In particular, problem (4) is highly nonconvex due to its multilinear form over the optimized variables . Fortunately, we are able to analyze the problem by grouping together and converting the original nonconvex problem in terms of the separate variables to a convex optimization with respect to the new grouping variable . This typically requires us to represent the objective function of (4) as a convex function of . To this end, we prove that (see Lemma 4 in Appendix C). So the objective function in problem (4) has an equivalent form
(6) 
This observation enables us to represent the optimization problem as a convex function of the output of a neural network. Therefore, we can analyze the nonconvex problem by applying powerful tools from convex analysis.
(c) Reduction to LowRank Approximation. Our results of strong duality concerning problem (6) are inspired by the problem of lowrank matrix approximation:
(7) 
We know that all local solutions of (7) are globally optimal [35, 44, 6]. To analyze the more general regularized problem (4), our main idea is to reduce problem (6) to the form of (7) by Lagrangian function. In other words, the Lagrangian function of problem (6) should be of the form (7) for a fixed Lagrangian variable , which we will construct later in subsection (d). While some prior works attempted to apply a similar reduction, their conclusions either depended on unrealistic conditions on local solutions, e.g., all local solutions are rankdeficient [30, 29], or their conclusions relied on strong assumptions on the objective functions, e.g., that the objective functions are twicedifferentiable [30], which do not apply to the nonsmooth problem (6). Instead, our results bypass these obstacles by formulating the strong duality of problem (6) as the existence of a dual certificate satisfying certain dual conditions (see Lemma 6 in Appendix C). Roughly, the dual conditions state that the optimal solution of problem (6) is locally optimal to problem (7). On one hand, by the abovementioned properties of problem (7), globally minimizes the Lagrangian function when is fixed to . On the other hand, by the convexity of nuclear norm, for the fixed the Lagrangian variable globally optimize the Lagrangian function. Thus is a primaldual saddle point of the Lagrangian function of problem (6). The desired strong duality is a straightforward result from this argument.
(d) Dual Certificate. The remaining proof is to construct a dual certificate such that the dual conditions hold true. The challenge is that the dual conditions impose several constraints simultaneously on the dual certificate (see condition (19) in Appendix C), making it hard to find a desired certificate. This is why progress on the dual certificate has focused on convex programming. To resolve the issue, we carefully choose the certificate as an appropriate scaling of subgradient of nuclear norm around a lowrank solution, where the nuclear norm follows from our regularization term in technique (b). Although the nuclear norm has infinitely many subgradients, we prove that our construction of dual certificate obeys all desired dual conditions. Putting techniques (b), (c), and (d) together, our proof of strong duality is completed.
5 Experiments
In this section, we verify our theoretical contributions by the experimental validation. We release our PyTorch code at
https://github.com/hongyanz/multibranch.5.1 Visualization of Loss Landscape
Experiments on Synthetic Datasets. We first show that overparametrization results in a less nonconvex loss surface for a synthetic dataset. The dataset consists of examples in whose labels are generated by an underlying onehiddenlayer ReLU network with 11 hidden neurons [49]. We make use of the visualization technique employed by [40] to plot the landscape, where we project the highdimensional hinge loss () landscape onto a 2d plane spanned by three points. These points are found by running the SGD algorithm with three different initializations until the algorithm converges. As shown in Figure 1, the landscape exhibits strong nonconvexity with lots of local minima in the underparameterized case . But as increases, the landscape becomes more convex. In the extreme case, when there are hidden neurons in the network, no nonconvexity can be observed on the landscape.
Experiments on MNIST and CIFAR10. We next verify the phenomenon of overparametrization on MNIST [39] and CIFAR10 [37]
datasets. For both datasets, we follow the standard preprocessing step that each pixel is normalized by subtracting its mean and dividing by its standard deviation. We do not apply data augmentation. For MNIST, we consider a singlehiddenlayer network defined as:
, where , , is the input dimension, is the number of hidden neurons, and is the number of branches, with and . For CIFAR10, in addition to considering the exact same onehiddenlayer architecture, we also test a deeper network containing hidden layers of size , with ReLU activations and . We apply 10class hinge loss on the top of the output of considered networks.Figure 4 shows the changes of landscapes when increases from to for MNIST, and from to for CIFAR10, respectively. When there is only one branch, the landscapes have strong nonconvexity with many local minima. As the number of branches increases, the landscape becomes more convex. When for hiddenlayer networks on MNIST and CIFAR10, and for hiddenlayer network on CIFAR10, the landscape is almost convex.
5.2 Frequency of Hitting Global Minimum
To further analyze the nonconvexity of loss surfaces, we consider various onehiddenlayer networks, where each network was trained 100 times using different initialization seeds under the setting discussed in our synthetic experiments of Section 5.1. Since we have the groundtruth global minimum, we record the frequency that SGD hits the global minimum up to a small error after iterations. Table 1 shows that increasing the number of hidden neurons results in higher hitting rate of global optimality. This further verifies that the loss surface of onehiddenlayer neural network becomes less nonconvex as the width increases.
# Hidden Neurons  Hitting Rate  # Hidden Neurons  Hitting Rate 

10  2 / 100  16  30 / 100 
11  9 / 100  17  32 / 100 
12  21 / 100  18  35 / 100 
13  24 / 100  19  52 / 100 
14  24 / 100  20  64 / 100 
15  29 / 100  21  75 / 100 
6 Conclusions
In this work, we study the duality gap for two classes of network architectures. For the neural network with arbitrary activation functions, multibranch architecture and hinge loss, we show that the duality gap of both population and empirical risks shrinks to zero as the number of branches increases. Our result sheds light on better understanding the power of overparametrization and the stateoftheart architectures, where increasing the number of branches tends to make the loss surface less nonconvex. For the neural network with linear activation function and loss, we show that the duality gap is zero. Our two results work for arbitrary depths and adversarial data, while the analytical techniques might be of independent interest to nonconvex optimization more broadly.
Acknowledgements. We would like to thank Jason D. Lee for informing us the ShapleyFolkman lemma, and MariaFlorina Balcan, David P. Woodruff, Xiaofei Shi, and Xingyu Xie for their thoughtful comments on the paper.
References
 [1] A. Anandkumar and R. Ge. Efficient approaches for escaping higher order saddle points in nonconvex optimization. In Annual Conference on Learning Theory, pages 81–102, 2016.
 [2] S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, 2018.
 [3] Ö. Aslan, X. Zhang, and D. Schuurmans. Convex deep learning via normalized kernels. In Advances in Neural Information Processing Systems, pages 3275–3283, 2014.
 [4] P. Awasthi, M.F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1bit compressed sensing under asymmetric noise. In Annual Conference on Learning Theory, pages 152–192, 2016.

[5]
F. Bach.
Breaking the curse of dimensionality with convex neural networks.
Journal of Machine Learning Research, 18(19):1–53, 2017.  [6] M.F. Balcan, Y. Liang, D. P. Woodruff, and H. Zhang. Matrix completion and related problems via strong duality. In Innovations in Theoretical Computer Science, 2018.
 [7] M.F. F. Balcan and H. Zhang. Sample and computationally efficient learning algorithms under sconcave distributions. In Advances in Neural Information Processing Systems, pages 4799–4808, 2017.

[8]
P. Baldi and K. Hornik.
Neural networks and principal component analysis: Learning from examples without local minima.
Neural Networks, 2(1):53–58, 1989. 
[9]
P. Baldi and Z. Lu.
Complexvalued autoencoders.
Neural Networks, 33:136–147, 2012. 
[10]
P. Bartlett and S. BenDavid.
Hardness results for neural network approximation problems.
In
European Conference on Computational Learning Theory
, pages 50–62, 1999.  [11] A. Beck and Y. C. Eldar. Strong duality in nonconvex quadratic optimization with two quadratic constraints. SIAM Journal on Optimization, 17(3):844–860, 2006.
 [12] Y. Bengio, N. L. Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In Advances in neural information processing systems, pages 123–130, 2006.
 [13] D. P. Bertsekas and N. R. Sandell. Estimates of the duality gap for largescale separable nonconvex optimization problems. In IEEE Conference on Decision and Control, volume 21, pages 782–785, 1982.
 [14] Y. Bi and A. Tang. Refined ShapelyFolkman lemma and its application in duality gap estimation. arXiv preprint arXiv:1610.05416, 2016.
 [15] A. Blum and R. L. Rivest. Training a 3node neural network is NPcomplete. In Advances in neural information processing systems, pages 494–501, 1989.
 [16] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a ConvNet with Gaussian inputs. In International Conference on Machine Learning, 2017.
 [17] A. Brutzkus, A. Globerson, E. Malach, and S. ShalevShwartz. SGD learns overparameterized networks that provably generalize on linearly separable data. In International Conference on Learning Representations, 2018.

[18]
F. Chollet.
Xception: Deep learning with depthwise separable convolutions.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 1251–1258, 2017.  [19] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–200, 2015.
 [20] A. Choromanska, Y. LeCun, and G. B. Arous. Open problem: The landscape of the loss surfaces of multilayer networks. In Annual Conference on Learning Theory, pages 1756–1760, 2015.
 [21] B. DasGupta, H. T. Siegelmann, and E. Sontag. On the complexity of training neural networks with continuous activation functions. IEEE Transactions on Neural Networks, 6(6):1490–1504, 1995.
 [22] A. d’Aspremont and I. Colin. An approximate ShapleyFolkman theorem. arXiv preprint arXiv:1712.08559, 2017.
 [23] S. S. Du and J. D. Lee. On the power of overparametrization in neural networks with quadratic activation. In International Conference on Machine Learning, 2018.
 [24] E. X. Fang, H. Liu, and M. Wang. Blessing of massive scale: Spatial graphical model estimation with a total cardinality constraint. 2015.
 [25] X. Gastaldi. ShakeShake regularization. arXiv preprint arXiv:1705.07485, 2017.

[26]
R. Ge, F. Huang, C. Jin, and Y. Yuan.
Escaping from saddle pointsonline stochastic gradient for tensor decomposition.
In Annual Conference on Learning Theory, pages 797–842, 2015.  [27] R. Ge, J. D. Lee, and T. Ma. Learning onehiddenlayer neural networks with landscape design. In International Conference on Learning Representations, 2017.
 [28] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
 [29] C. Grussler, A. Rantzer, and P. Giselsson. Lowrank optimization with convex constraints. arXiv:1606.01793, 2016.
 [30] B. D. Haeffele and R. Vidal. Global optimality in neural network training. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7331–7339, 2017.
 [31] M. Hardt and T. Ma. Identity matters in deep learning. In International Conference on Learning Representations, 2017.
 [32] B. He and X. Yuan. On the convergence rate of the Douglas–Rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.
 [33] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016.
 [34] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. arXiv:1703.00887, 2017.
 [35] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
 [36] K. Kawaguchi, B. Xie, and L. Song. Deep semirandom features for nonlinear function approximation. In AAAI Conference on Artificial Intelligence, 2018.
 [37] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [38] T. Laurent and J. von Brecht. The multilinear structure of ReLU networks. arXiv preprint arXiv:1712.10132, 2017.
 [39] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [40] H. Li, Z. Xu, G. Taylor, and T. Goldstein. Visualizing the loss landscape of neural nets. arXiv preprint arXiv:1712.09913, 2017.
 [41] Y. Li, T. Ma, and H. Zhang. Algorithmic regularization in overparameterized matrix sensing and neural networks with quadratic activations. In Annual Conference on Learning Theory, 2017.
 [42] S. Liang, R. Sun, J. D. Lee, and R. Srikant. Adding one neuron can eliminate all bad local minima. arXiv preprint arXiv:1805.08671, 2018.
 [43] S. Liang, R. Sun, Y. Li, and R. Srikant. Understanding the loss surface of neural networks for binary classification. In International Conference on Machine Learning, 2018.
 [44] H. Lu and K. Kawaguchi. Depth creates no bad local minima. arXiv:1702.08580, 2017.

[45]
T. L. Magnanti, J. F. Shapiro, and M. H. Wagner.
Generalized linear programming solves the dual.
Management Science, 22(11):1195–1203, 1976.  [46] B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro. Towards understanding the role of overparametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
 [47] B. Neyshabur, R. Salakhutdinov, and N. Srebro. PathSGD: Pathnormalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2422–2430, 2015.
 [48] M. L. Overton and R. S. Womersley. On the sum of the largest eigenvalues of a symmetric matrix. SIAM Journal on Matrix Analysis and Applications, 13(1):41–45, 1992.
 [49] I. Safran and O. Shamir. Spurious local minima are common in twolayer ReLU neural networks. In International Conference on Machine Learning, 2017.
 [50] A. M. Saxe. Deep linear neural networks: A theory of learning in the brain and mind. PhD thesis, Stanford University, 2015.
 [51] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.
 [52] S. ShalevShwartz, O. Shamir, and S. Shammah. Failures of gradientbased deep learning. In International Conference on Machine Learning, 2017.
 [53] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [54] M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of overparameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.
 [55] D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv:1605.08361, 2016.
 [56] R. M. Starr. Quasiequilibria in markets with nonconvex preferences. Econometrica: Journal of the Econometric Society, pages 25–38, 1969.

[57]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inceptionv4, InceptionResNet and the impact of residual connections on learning.
In AAAI Conference on Artificial Intelligence, pages 4278–4284, 2017.  [58] M. Udell and S. Boyd. Bounding duality gap for separable problems with linear constraints. Computational Optimization and Applications, 64(2):355–378, 2016.
 [59] M. Udell, C. Horn, R. Zadeh, and S. Boyd. Generalized low rank models. Foundations and Trends in Machine Learning, 9(1):1–118, 2016.
 [60] A. Veit, M. J. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems, pages 550–558, 2016.
 [61] R. Vidal, J. Bruna, R. Giryes, and S. Soatto. Mathematics of deep learning. arXiv preprint arXiv:1712.04741, 2017.
 [62] B. Xie, Y. Liang, and L. Song. Diverse neural network learns true target functions. In Artificial Intelligence and Statistics, pages 1216–1224, 2017.
 [63] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5987–5995, 2017.
 [64] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, pages 87.1–87.12, 2016.
 [65] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2016.
 [66] Y. Zhang, J. Lee, M. Wainwright, and M. Jordan. On the learnability of fullyconnected neural networks. In Artificial Intelligence and Statistics, pages 83–91, 2017.

[67]
Y. Zhang, P. Liang, and M. J. Wainwright.
Convexified convolutional neural networks.
In International Conference on Machine Learning, 2017.  [68] P. Zhou and J. Feng. Empirical risk landscape analysis for understanding deep neural networks. In International Conference on Learning Representations, 2018.
Appendix A Supplementary Experiments
a.1 Performance of MultiBranch Architecture
In this section, we test the classification accuracy of the multibranch architecture on the CIFAR10 dataset. We use a 9layer VGG network [53] as our subnetwork in each branch, which is memoryefficient for practitioners to fit many branches into GPU memory simultaneously. The detailed network setup of VGG9 is in Table 2, where the width of VGG9 is either 16 or 32. We test the performance of varying numbers of branches in the overall architecture from 4 to 32, with crossentropy loss. Figure 5 presents the test accuracy on CIFAR10 as the number of branches increases. It shows that the test accuracy improves monotonously with the increasing number of parallel branches/paths.
Layer  Weight  Activation  Input size  Output size 

Input  N / A  N / A  N / A  
Conv1  BN + ReLU  
Conv2  BN + ReLU  
MaxPool  N / A  N / A  
Conv3  BN + ReLU  
Conv4  BN + ReLU  
MaxPool  N / A  N / A  
Conv5  BN + ReLU  
Conv6  BN + ReLU  
Conv7  BN + ReLU  
MaxPool  N / A  N / A  
Flatten  N / A  N / A  
FC1  BN + ReLU  
FC2  Softmax 
is the width of the network, which controls the number of filters in each convolution layer. All convolution layers have a kernel of size 3, and zero padding of size 1. All layers followed by the batch normalization have no bias term. All max pooling layers have a stride of 2.
a.2 Strong Duality of Deep Linear Neural Networks
We compare the optima of primal problem (4) and dual problem (5) by numerical experiments for threelayer linear neural networks (). The data are generated as follows. We construct the output matrix by drawing the entries of from i.i.d. standard Gaussian distribution and the input matrix by the identity matrix. The varies from to . Both primal and dual problems are solved by numerical algorithms. Given the nonconvex nature of primal problem, we rerun the algorithm by multiple initializations and choose the best solution that we obtain. The results are shown in Figure 6. We can easily see that the optima of primal and dual problems almost match. The small gap is due to the numerical inaccuracy.
We also compare the distance between the solution of primal problem and the solution of dual problem in Table 3. We see that the solutions are close to each other.
5  10  15  20  25  30  35  40  45  50  

distance ()  3.14  1.92  1.04  3.92  6.53  8.00 
Appendix B Proofs of Theorem 1: Duality Gap of MultiBranch Neural Networks
The lower bound is obvious by the weak duality. So we only need to prove the upper bound .
Consider the subset of :
Define the vector summation
Since and are continuous w.r.t. and ’s are compact, the set
is compact as well. So , , , and , are all compact sets. According to the definition of and the standard duality argument [45], we have
and
Technique (a): ShapleyFolkman Lemma. We are going to apply the following ShapleyFolkman lemma.
Lemma 3 (ShapleyFolkman, [56]).
Let be a collection of subsets of . Then for every , there is a subset of size at most such that
We apply Lemma 3 to prove Theorem 1 with . Let be such that
Applying the above ShapleyFolkman lemma to the set , we have that there are a subset of size and vectors
such that
(8) 
(9) 
Representing elements of the convex hull of by Carathéodory theorem, we have that for each , there are vectors and scalars such that
Recall that we define
(10) 
and . We have for ,
(11) 
and
(12) 
Thus, by Eqns. (8) and (11), we have
(13) 
and by Eqns. (9) and (12), we have
(14) 
Given any and , we can find a vector such that
(15) 
where the first inequality holds because is convex and the second inequality holds by the definition (10) of . Therefore, Eqns. (13) and (15) impliy that
Namely, is a feasible solution of problem (2). Also, Eqns. (14) and (15) yield
where the last inequality holds because . Finally, letting leads to the desired result.
Appendix C Proofs of Theorem 2: Strong Duality of Deep Linear Neural Networks
Let . We note that by Pythagorean theorem, for every ,
So we can focus on the following optimization problem instead of problem (4):
(16) 
Technique (b): Variational Form. Our work is inspired by a variational form of problem (16) given by the following lemma.
Comments
There are no comments yet.