Artificial neural networks are currently considered the state of the art in applications ranging from image classification, to speech recognition and even machine translation. However, little is understood about the process by which they are trained for supervised learning tasks. The problem of optimizing their parameters is an active area both practical and theoretical research. Despite considerable sensitivity to initialization and choice of hyperparameters, neural networks often achieve compelling results after optimization by gradient descent methods. Due to the nonconvexity and massive parameter space of these functions, it is poorly understood how these sub-optimal methods have proven so successful. Indeed, training a certain kind of neural network is known to be NP-Complete, making it difficult to provide any worst-case training guaranteesBlum:1992:OCT:148433.148441 . Much recent work has attempted to reconcile these differences between theory and practice Kawaguchi:2016:WithoutPoorLocalMinima ; Soudry:2016:NoBadLocalMinima .
This article attempts a modest step towards understanding the dynamics of the training procedure. We establish three main convexity results for a certain class of neural network, which is the current the state of the art. First, that the objective is piecewise convex as a function of the input data, with parameters fixed, which corresponds to the behavior at test time. Second, that the objective is again piecewise convex as a function of the parameters of a single layer, with the input data and all other parameters held constant. Third, that the training objective function, for which all parameters are variable but the input data is fixed, is piecewise multi-convex. That is, it is a continuous function which can be represented by a finite number of multi-convex functions, each active on a multi-convex parameter set. This generalizes the notion of biconvexity found in the optimization literature to piecewise functions and arbitrary index sets Gorski:2007:BiConvex
. To prove these results, we require two main restrictions on the definition of a neural network: that its layers are piecewise affine functions, and that its objective function is convex and continuously differentiable. Our definition includes many contemporary use cases, such as least squares or logistic regression on a convolutional neural network with rectified linear unit (ReLU) activation functions and either max- or mean-pooling. In recent years these networks have mostly supplanted the classic sigmoid type, except in the case of recurrent networksGlorot:2011:ReluNetworks . We make no assumptions about the training data, so our results apply to the current state of the art in many practical scenarios.
Piecewise multi-convexity allows us to characterize the extrema of the training objective. As in the case of biconvex functions, stationary points and local minima are guaranteed optimality on larger sets than we would have for general smooth functions. Specifically, these points are partial minima when restricted to the relevant piece. That is, they are points for which no decrease can be made in the training objective without simultaneously varying the parameters across multiple layers, or crossing the boundary into a different piece of the function. Unlike global minima, we show that partial minima are reliably found by the optimization algorithms used in current practice.
Finally, we provide some guarantees for solving general multi-convex optimization problems by various algorithms. First we analyze gradient descent, proving necessary convergence conditions. We show that every point to which gradient descent converges is a piecewise partial minimum, excepting some boundary conditions. To prove stronger results, we define a different optimization procedure breaking each parameter update into a number of convex sub-problems. For this procedure, we show both necessary and sufficient conditions for convergence to a piecewise partial minimum. Interestingly, adding regularization to the training objective is all that is needed to prove necessary conditions. Similar results have been independently established for many kinds of optimization problems, including bilinear and biconvex optimization, and in machine learning the special case of linear autoencodersWendell:1976:Bilinear ; Gorski:2007:BiConvex ; Baldi:2012:ComplexValuedAutoencoders . Our analysis extends existing results on alternating convex optimization to the case of arbitrary index sets, and general multi-convex point sets, which is needed for neural networks. We admit biconvex problems, and therefore linear autoencoders, as a special case.
Despite these results, we find that it is difficult to pass from partial to global optimality results. Unlike the encouraging case of linear autoencoders, we show that a single rectifier neuron, under the squared error objective, admits arbitrarily poor local minima. This suggests that much work remains to be done in understanding how sub-optimal methods can succeed with neural networks. Still, piecewise multi-convex functions are in some senses easier to minimize than the general class of smooth functions, for which none of our previous guarantees can be made. We hope that our characterization of neural networks could contribute to a better understanding of these important machine learning systems.
2 Preliminary material
We begin with some preliminary definitions and basic results concerning continuous piecewise functions.
Let be continuous functions from . A continuous piecewise function has a finite number of closed, connected sets covering such that for each we have for all . The set is called a piece of , and the function is called active on .
More specific definitions follow by restricting the functions . A continuous piecewise affine function has where and . A continuous piecewise convex function has convex, with convex as well.
Note that this definition of piecewise convexity differs from that found in the convex optimization literature, which focuses on convex piecewise convex functions, i.e. maxima of convex functions Tsevendorj:2001:PiecewiseConvex . Note also that we do not claim a unique representation in terms of active functions and pieces , only that there exists at least one such representation.
Before proceeding, we shall extend definition 2.1 to functions of multidimensional codomain for the affine case.
A function , and let denote the component of . Then is continuous piecewise affine if each is. Choose some piece from each and let , with . Then is a piece of , on which we have for some and .
First, we prove an intuitive statement about the geometry of the pieces of continuous piecewise affine functions.
Let be continuous piecewise affine. Then admits a representation in which every piece is a convex polytope.
Let denote the component of . Now, can be written in closed form as a max-min polynomial Ovchinnikov:2002:pwl_max_min . That is, is the maximum of minima of its active functions. Now, for the minimum of two affine functions we have
This function has two pieces divided by the hyperplane. The same can be said of . Thus the pieces of are intersections of half-spaces, which are just convex polytopes. Since the pieces of are intersections of the pieces of , they are convex polytopes as well.
Let and be continuous piecewise affine. Then so is .
To establish continuity, note that the composition of continuous functions is continuous.
Let be a piece of and a piece of such that , where denotes the inverse image of . By theorem 2.3, we can choose and to be convex polytopes. Since is affine, is closed and convex Boyd:2004:ConvexOptimization . Thus is a closed, convex set on which we can write
which is an affine function.
Now, consider the finite set of all such pieces . The union of over all pieces is just , as is the union of all pieces . Thus we have
Thus is piecewise affine on .
We now turn to continuous piecewise convex functions, of which continuous piecewise affine functions are a subset.
Let be a continuous piecewise affine function, and a convex function. Then is continuous piecewise convex.
On each piece of we can write
This function is convex, as it is the composition of a convex and an affine function Boyd:2004:ConvexOptimization . Furthermore, is convex by theorem 2.3. This establishes piecewise convexity by the proof of theorem 2.4.
Our final theorem concerns the arithmetic mean of continuous piecewise convex functions, which is essential for the analysis of neural networks.
Let be continuous piecewise convex functions. Then so is their arithmetic mean .
The proof takes the form of two lemmas.
Let and be a pair of continuous piecewise convex functions on . Then so is .
Let be a piece of , and a piece of , with . Note that the sum of convex functions is convex Rockafellar:1970:ConvexAnalysis . Thus is convex on . Furthermore, is convex because it is an intersection of convex sets Rockafellar:1970:ConvexAnalysis . Since this holds for all pieces of and , we have that is continuous piecewise convex on .
Let , and let be a continuous piecewise convex function. Then so is .
The continuous function is convex on every piece of .
Having established that continuous piecewise convexity is closed under addition and positive scalar multiplication, we can see that it is closed under the arithmetic mean, which is just the composition of these two operations.
3 Neural networks
In this work, we define a neural network to be a composition of functions of two kinds: a convex continuously differentiable objective (or loss) function, and continuous piecewise affine functions , constituting the layers. Furthermore, the outermost function must be , so that we have
where denotes the entire network. This definition is not as restrictive as it may seem upon first glance. For example, it is easily verified that the rectified linear unit (ReLU) neuron is continuous piecewise affine, as we have
where the maximum is taken pointwise. It can be shown that maxima and minima of affine functions are piecewise affine Ovchinnikov:2002:pwl_max_min . This includes the convolutional variant, in which
is a Toeplitz matrix. Similarly, max pooling is continuous piecewise linear, while mean pooling is simply linear. Furthermore, many of the objective functions commonly seen in machine learning are convex and continuously differentiable, as in least squares and logistic regression. Thus this seemingly restrictive class of neural networks actually encompasses the current state of the art.
By theorem 2.4, the composition of all layers is continuous piecewise affine. Therefore, a neural network is ultimately the composition of a continuous convex function with a single continuous piecewise affine function. Thus by theorem 2.5 the network is continuous piecewise convex. Figure 1 provides a visualization of this result for the example network
where . For clarity, this is just the two-layer ReLU network
with the squared error objective and a single data point , setting and , with all other parameters set to .
Before proceeding further, we must define a special kind of differentiability for piecewise continuous functions, and show that this holds for neural networks.
Let be piecewise continuous. We say that is piecewise continuously differentiable if each active function is continuously differentiable.
To see that neural networks are piecewise continuously differentiable, note that the objective is continuously differentiable, as are the affine active functions of the layers. Thus their composition is continuously differentiable. It follows that non-differentiable points are found only on the boundaries between pieces.
4 Network parameters of a single layer
In the previous section we have defined neural networks as functions of labeled data. These are the functions relevant during testing, where parameters are constant and data is variable. In this section, we extend these results to the case where data is constant and parameters are variable, which is the function to optimized during training. For example, consider the familiar equation
with parameters and data ). During testing, we hold constant, and consider as a function of the data . During training, we hold constant and consider as a function of the parameters . This is what we mean when we say that a network is being “considered as a function of its parameters111This is made rigorous by taking cross-sections of point sets in section 5. .” This leads us to an additional stipulation on our definition of a neural network. That is, each layer must be piecewise affine as a function of its parameters as well. This is easily verified for all of the layer types previously mentioned. For example, with the ReLU neuron we have
so for we have that the component of is linear in , while for it is constant. To see this, we can re-arrange the elements of
into a column vector, in row-major order, so that we have
In section 3 we have said that a neural network, considered as a function of its input data, is convex and continuously differentiable on each piece. Now, a neural network need not be piecewise convex as a function of the entirety of its parameters222To see this, consider the following two-layer network: , , and . For we have . Now fix the input as . Considered as a function of its parameters, this is , which is decidedly not convex.. However, we can regain piecewise convexity by considering it only as a function of the parameters in a single layer, all others held constant.
A neural network is continuous piecewise convex and piecewise continuously differentiable as a function of the parameters in a single layer.
For the time being, assume the input data consists of a single point . By definition is the composition of a convex objective and layers , with a function of . Let denote the network considered as a function of the parameters of layer , all others held constant. Now, the layers are constant with respect to the parameters of , so we can write . Thus on each piece of we have
By definition is a continuous piecewise affine function of its parameters. Since is constant, we have that is a continuous piecewise affine function of the parameters of . Now, by theorem 2.4 we have that is a continuous piecewise affine function of the parameters of . Thus by theorem 2.5, is continuous piecewise convex.
To establish piecewise continuous differentiability, recall that affine functions are continuously differentiable, as is .
Having established the theorem for the case of a single data point, consider the case where we have multiple data points, denoted . Now, by theorem 2.6 the arithmetic mean is continuous piecewise convex. Furthermore, the arithmetic mean preserves piecewise continuous differentiability. Thus these results hold for the mean value of the network over the dataset.
We conclude this section with a simple remark which will be useful in later sections. Let be a neural network, considered as a function of the parameters of the layer, and let be a piece of . Then the optimization problem
5 Network parameters of multiple layers
In the previous section we analyzed the convexity properties of neural networks when optimizing the parameters of a single layer, all others held constant. Now we are ready to extend these results to the ultimate goal of simultaneously optimizing all network parameters. Although not convex, the problem has a special convex substructure that we can exploit in proving future results. We begin by defining this substructure for point sets and functions.
Let , let , and let . The set
is the cross-section of intersecting with respect to .
In other words, is the subset of for which every point is equal to in the components not indexed by . Note that this differs from the typical definition, which is the intersection of a set with a hyperplane. For example, is the -axis, whereas is the -plane. Note also that cross-sections are not unique, for example . In this case the first two components of the cross section are irrelevant, but we will maintain them for notational convenience. We can now apply this concept to functions on .
Let , let and let be a collection of sets covering . We say that is multi-convex with respect to if is convex when restricted to the cross section , for all and .
This formalizes the notion of restricting a non-convex function to a variable subset on which it is convex, as in section 4 when a neural network was restricted to the parameters of a single layer. For example, let , and let , and . Then is a convex function of with fixed at . Similarly, is a convex function of with fixed at . Thus is multi-convex with respect to . To fully define a multi-convex optimization problem, we introduce a similar concept for point sets.
Let and let be a collection of sets covering . We say that is multi-convex with respect to if the cross-section is convex for all and .
This generalizes the notion of biconvexity found in the optimization literature Gorski:2007:BiConvex . From here, we can extend definition 2.1 to multi-convex functions. However, we will drop the topological restrictions on the pieces of our function, since multi-convex sets need not be connected.
Let be a continuous function. We say that is continuous piecewise multi-convex if each there exists a collection of multi-convex functions and multi-convex sets covering such that for each we have for all . Next, let . Then, is continuous piecewise multi-convex so long as each component is, as in definition 2.2.
From this definition, it is easily verified that a continuous piecewise multi-convex function admits a representation where all pieces are multi-convex, as in the proof of theorem 2.3.
Before we can extend the results of section 4 to multiple layers, we must add one final constraint on the definition of a neural network. That is, each of the layers must be continuous piecewise multi-convex, considered as functions of both the parameters and the input. Again, this is easily verified for the all of the layer types previously mentioned. We have already shown they are piecewise convex on each cross-section, taking our index sets to separate the parameters from the input data. It only remains to show that the number of pieces is finite. The only layer which merits consideration is the ReLU, which we can see from equation 4 consists of two pieces for each component: the “dead” or constant region, with , and its compliment. With components we have at most pieces, corresponding to binary assignments of “dead” or “alive” for each component.
Having said that each layer is continuous piecewise multi-convex, we can extend these results to the whole network.
Let be a neural network, and let be a collection of index sets, one for the parameters of each layer of . Then is continuous piecewise multi-convex with respect to .
We begin the proof with a lemma for more general multi-convex functions.
Let , , and let and be continuous piecewise multi-convex, with respect to a collection of index sets , and with respect to , where indexes the variables in , and the variables in . Then is continuous piecewise multi-convex with respect to .
Let be a piece of , let be a piece of and let , with chosen so that . Clearly is multi-convex on with respect to . It remains to show that is a multi-convex set. Now, let and we shall show that the cross-sections are convex. First, for any we have . Similarly, we have . These sets are convex, as they are the Cartesian products of convex sets Rockafellar:1970:ConvexAnalysis . Finally, as in the proof of theorem 2.4, we can cover with the finite collection of all such pieces , taken over all and .
Our next lemma extends theorem 2.6 to multi-convex functions.
Let be a collection of sets covering , and let and be continuous piecewise multi-convex with respect to . Then so is .
Let be a piece of and be a piece of with . Then for all , , a convex set on which is convex. Thus is continuous piecewise multi-convex, where the pieces of are the intersections of pieces of and .
We can now prove the theorem.
For the moment, assume we have only a single data point. Now, letand denote layers of , with parameters . Since and are continuous piecewise multi-convex functions of their parameters and input, we can write the two-layer sub-network as . By repeatedly applying lemma 5.6, the whole network is multi-convex on a finite number of sets covering the input and parameter space.
Now we extend the theorem to the whole dataset, where each data point defines a continuous piecewise multi-convex function . By lemma 5.7, the arithmetic mean is continuous piecewise multi-convex.
In the coming sections, we shall see that multi-convexity allows us to give certain guarantees about the convergence of various optimization algorithms. But first, we shall prove some basic results independent of the optimization procedure. These results were summarized by Gorksi et al. for the case of biconvex differentiable functions Gorski:2007:BiConvex . Here we extend them to piecewise functions and arbitrary index sets. First we define a special type of minimum relevant for multi-convex functions.
Let and let be a collection of sets covering . We say that is a partial minimum of with respect to if for all .
In other words, is a partial minimum of with respect to if it minimizes on every cross-section of intersecting , as shown in figure 2. By convexity, these points are intimately related to the stationary points of .
Let be a collection of sets covering , let be continuous piecewise multi-convex with respect to , and let . Then is a partial minimum of on every piece containing .
Let be a piece of containing , let , and let denote the relevant cross-section of . We know is convex on , and since , we have that minimizes on this convex set. Since this holds for all , is a partial minimum of on .
It is clear that multi-convexity provides a wealth of results concerning partial minima, while piecewise multi-convexity restricts those results to a subset of the domain. Less obvious is that partial minima of smooth multi-convex functions need not be local minima. An example was pointed out by a reviewer of this work, that the biconvex function has a partial minimum at the origin which is not a local minimum. However, the converse is easily verified, even in the absence of differentiability.
Let be a collection of sets covering , let be continuous piecewise multi-convex with respect to , and let be a local minimum on some piece of . Then is a partial minimum on .
The proof is essentially the same as that of theorem 5.9.
We have seen that for multi-convex functions there is a close relationship between stationary points, local minima and partial minima. For these functions, infinitesimal results concerning derivatives and local minima can be extended to larger sets. However, we make no guarantees about global minima. The good news is that, unlike global minima, we shall see that we can easily solve for partial minima.
6 Gradient descent
In the realm of non-convex optimization, also called global optimization, methods can be divided into two groups: those which can certifiably find a global minimum, and those which cannot. In the former group we sacrifice speed, in the latter correctness. This work focuses on algorithms of the latter kind, called local or sub-optimal methods, as only this type is used in practice for deep neural networks. In particular, the most common methods are variants of gradient descent, where the gradient of the network with respect its parameters is computed by a procedure called backpropagation. Since its explanation is often obscured by jargon, we shall provide a simple summary here.
Backpropagation is nothing but the chain rule applied to the layers of a network. Splitting the network into two functions, where , and , we have
where denotes the Jacobian operator. Note that here the parameters of are considered fixed, whereas the parameters of are variable and the input data is fixed. Thus is the gradient of with respect to the parameters of , if it exists. The special observation is that we can proceed from the top layer of the neural network to the bottom , with , and , each time computing the gradient of with respect to the parameters of . In this way, we need only store the vector and the matrix can be forgotten at each step. This is known as the “backward pass,” which allows for efficient computation of the gradient of a neural network with respect to its parameters. A similar algorithm computes the value of as a function of the input data, which is often needed to evaluate . First we compute and store as a function of the input data, then , and so on until we have . This is known as the “forward pass.” After one forward and one backward pass, we have computed with respect to all the network parameters.
Having computed , we can update the parameters by gradient descent, defined as follows.
Let , and be partial differentiable, with . Then gradient descent on is the sequence defined by
where is called the step size or “learning rate.” In this work we shall make the additional assumption that .
Variants of this basic procedure are preferred in practice because their computational cost scales well with the number of network parameters. There are many different ways to choose the step size, but our assumption that covers what is usually done with deep neural networks. Note that we have not defined what happens if . Since we are ultimately interested in neural networks on , we can ignore this case and say that the sequence diverges. Gradient descent is not guaranteed to converge to a global minimum for all differentiable functions. However, it is natural to ask to which points it can converge. This brings us to a basic but important result.
Let , and let result from gradient descent on with , and continuously differentiable at . Then .
First, we have
Assume for the sake of contradiction that for the partial derivative we have . Now, pick some such that , and by continuous differentiability, there is some such that for all , implies . Now, there must be some such that for all we have , so that does not change sign. Then we can write
But this contradicts the fact that converges. Thus .
In the convex optimization literature, this simple result is sometimes stated in connection with Zangwill’s much more general convergence theorem Zangwill:1969:Optimization ; Iusem:2003:subgradientConvergence . Note, however, that unlike Zangwill we state necessary, rather than sufficient conditions for convergence. While many similar results are known, it is difficult to strictly weaken the conditions of theorem 6.2. For example, if we relax the condition that is not summable, and take , then will always converge to a non-stationary point. Similarly, if we relax the constraint that is continuously differentiable, taking and decreasing monotonically to zero, we will always converge to the origin, which is not differentiable. Furthermore, if we have with constant, then will not converge for almost all . It is possible to prove much stronger necessary and sufficient conditions for gradient descent, but these results require additional assumptions about the step size policy as well as the function to be minimized, and possibly even the initialization Nesterov:2004:ConvexBook .
It is worth discussing in greater detail, since this is a piecewise affine function and thus of interest in our investigation of neural networks. While we have said its only convergence point is not differentiable, it remains subdifferentiable, and convergence results are known for subgradient descent Iusem:2003:subgradientConvergence . In this work we shall not make use of subgradients, instead considering descent on a piecewise continuously differentiable function, where the pieces are and . Although theorem 6.2 does not apply to this function, the relevant results hold anyways. That is, is minimal on some piece of , a result which extends to any continuous piecewise convex function, as any saddle point is guaranteed to minimize some piece.
Here we should note one way in which this analysis fails in practice. So far we have assumed the gradient is precisely known. In practice, it is often prohibitively expensive to compute the average gradient over large datasets. Instead we take random subsamples, in a procedure known as stochastic gradient descent. We will not analyze its properties here, as current results on the topic impose additional restrictions on the objective function and step size, or require different definitions of convergence Bertsekas:2010:IncrementalGradient ; Bach:2011:sgd ; Ge:2015:sgdSaddle . Restricting ourselves to the true gradient allows us to provide simple proofs applying to an extensive class of neural networks.
We are now ready to generalize these results to neural networks. There is a slight ambiguity in that the boundary points between pieces need not be differentiable, nor even sub-differentiable. Since we are interested only in necessary conditions, we will say that gradient descent diverges when does not exist. However, our next theorem can at least handle non-differentiable limit points.
Let be a collection of sets covering , let be continuous piecewise multi-convex with respect to , and piecewise continuously differentiable. Then, let result from gradient descent on , with , such that either
is continuously differentiable at , or
there is some piece of and some such that for all .
Then is a partial minimum of on every piece containing .
If the first condition holds, the result follows directly from theorems 6.2 and 5.9. If the second condition holds, then is a convergent gradient descent sequence on , the active function of on . Since is continuously differentiable on , the first condition holds for . Since , is a partial minimum of as well.
The first condition of theorem 6.3 holds for every point in the interior of a piece, and some boundary points. The second condition extends these results to non-differentiable boundary points so long as gradient descent is eventually confined to a single piece of the function. For example, consider the continuous piecewise convex function as shown in figure 3. When we converge to from the piece , it is as if we were converging on the smooth function . This example also illustrates an important caveat regarding boundary points: although is an extremum of on , it is not an extremum on .
7 Iterated convex optimization
Although the previous section contained some powerful results, theorem 6.3 suffers from two main weaknesses, that it is a necessary condition and that it requires extra care at non-differentiable points. It is difficult to overcome these limitations with gradient descent. Instead, we shall define a different optimization technique, from which necessary and sufficient convergence results follow, regardless of differentiability.
Iterated convex optimization splits a non-convex optimization problem into a number of convex sub-problems, solving the sub-problems in each iteration. For a neural network, we have shown that the problem of optimizing the parameters of a single layer, all others held constant, is piecewise convex. Thus, restricting ourselves to a given piece yields a convex optimization problem. In this section, we show that these convex sub-problems can be solved repeatedly, converging to a piecewise partial optimum.
Let be a collection of sets covering , and let and be multi-convex with respect to . Then iterated convex optimization is any sequence where is a solution to the optimization problem
We call this iterated convex optimization because problem 7 can be divided into convex sub-problems
for each . In this work, we assume the convex sub-problems are solvable, without delving into specific solution techniques. Methods for alternating between solvable sub-problems have been studied by many authors, for many different types of sub-problems Wendell:1976:Bilinear . In the context of machine learning, the same results have been developed for the special case of linear autoencoders Baldi:2012:ComplexValuedAutoencoders . Still, extra care must be taken in extending these results to arbitrary index sets. The key is that is not updated until all sub-problems have been solved, so that each iteration consists of solving convex sub-problems. This is equivalent to the usual alternating convex optimization for biconvex functions, where consists of two sets, but not for general multi-convex functions.
Some basic convergence results follow immediately from the solvability of problem 7. First, note that is a feasible point, so we have . This implies that exists, so long as is bounded below. However, this does not imply the existence of . See Gorski et al. for an example of a biconvex function on which diverges Gorski:2007:BiConvex . To prove stronger convergence results, we introduce regularization to the objective.
Let be a collection of sets covering , and let and be multi-convex with respect to . Next, let , and let , where and is a convex norm. Finally, let result from iterated convex optimization of . Then has at least one convergent subsequence, in the topology induced by the metric .
From lemma 2.7, is multi-convex, so we are allowed iterated convex optimization. Now, if we have that . Thus whenever . Since is a non-increasing sequence, we have that . Equivalently, lies in the set . Since is continuous, is closed and bounded, and thus it is compact. Then, by the Bolzano-Weierstrauss theorem, has at least one convergent subsequence Johnsonbaugh:1970:RealAnalysis .
In theorem 7.2, the function is called the regularized version of . In practice, regularization often makes a non-convex optimization problem easier to solve, and can reduce over-fitting. The theorem shows that iterated convex optimization on a regularized function always has at least one convergent subsequence. Next, we shall establish some rather strong properties of the limits of these subsequences.
Let be a collection of sets covering , and let and be multi-convex with respect to . Next, let result from iterated convex optimization of . Then the limit of every convergent subsequence is a partial minimum on with respect to , in the topology induced by the metric for some norm . Furthermore, if and are convergent subsequences, then .
Let denote a subsequence of with . Now, assume for the sake of contradiction that is not a partial minimum on with respect to . Then there is some and some with such that . Now, is continuous at , so there must be some such that for all , implies . Furthermore, since is an interior point, there must be some open ball of radius centered at , as shown in figure 4. Now, there must be some such that . Then, let , and since , we know that , and thus . Finally, , so we have , which contradicts the fact that minimizes over a set containing . Thus is a partial minimum on with respect to .
Finally, let and be two convergent subsequences of , with and , and assume for the sake of contradiction that . Then by continuity, there is some such that . But this contradicts the fact that is non-increasing. Thus .
The previous theorem is an extension of results reviewed in Gorski et al. to arbitrary index sets Gorski:2007:BiConvex . While Gorski et al. explicitly constrain the domain to a compact biconvex set, we show that regularization guarantees cannot escape a certain compact set, establishing the necessary condition for convergence. Furthermore, our results hold for general multi-convex sets, while the earlier result is restricted to Cartesian products of compact sets.
These results for iterated convex optimization are considerably stronger than what we have shown for gradient descent. While any bounded sequence in has a convergent subsequence, and we can guarantee boundedness for some variants of gradient descent, we cannot normally say much about the limits of subsequences. For iterated convex optimization, we have shown that the limit of any subsequence is a partial minimum, and all limits of subsequences are equal in objective value. For all practical purposes, this is just as good as saying that the original sequence converges to partial minimum.
8 Global optimization
Although we have provided necessary and sufficient conditions for convergence of various optimization algorithms on neural networks, the points of convergence need only minimize cross-sections of pieces of the domain. Of course we would prefer results relating the points of convergence to global minima of the training objective. In this section we illustrate the difficulty of establishing such results, even for the simplest of neural networks.
In recent years much work has been devoted to providing theoretical explanations for the empirical success of deep neural networks, a full accounting of which is beyond the scope of this article. In order to simplify the problem, many authors have studied linear neural networks, in which the layers have the form , where is the parameter matrix. With multiple layers this is clearly a linear function of the output, but not of the parameters. As a special case of piecewise affine functions, our previous results suffice to show that these networks are multi-convex as functions of their parameters. This was proven for the special case of linear autoencoders by Baldi and Lu Baldi:2012:ComplexValuedAutoencoders .
Many authors have claimed that linear neural networks contain no “bad” local minima, i.e. every local minimum is a global minimum Kawaguchi:2016:WithoutPoorLocalMinima ; Soudry:2016:NoBadLocalMinima . This is especially evident in the study of linear autoencoders, which were shown to admit many points of inflection, but only a single strict minimum Baldi:2012:ComplexValuedAutoencoders . While powerful, this claim does not apply to the networks seen in practice. To see this, consider the dataset consisting of three pairs, parameterized by
. Note that the dataset has zero mean and unit variance in thevariable, which is common practice in machine learning. However, we do not take zero mean in the variable, as the model we shall adopt is non-negative.
Next, consider the simple neural network
This is the squared error of a single ReLU neuron, parameterized by . We have chosen this simplest of all networks because we can solve for the local minima in closed form, and show they are indeed very bad. First, note that is a continuous piecewise convex function of six pieces, realized by dividing the plane along the line for each , as shown in figure 5. Now, for all but one of the pieces, the ReLU is “dead” for at least one of the three data points, i.e. . On these pieces, at least one of the three terms of equation 9 is constant. The remaining terms are minimized when , represented by the three dashed lines in figure 5. There are exactly three points where two of these lines intersect, and we can easily show that two of them are strict local minima. Specifically, the point minimizes the first two terms of equation 9, while minimizes the first and last term. In each case, the remaining term is constant over the piece containing the point of intersection. Thus these points are strict global minima on their respective pieces, and strict local minima on . Furthermore, we can compute and . This gives
Now, it might be objected that we are not permitted to take if we require that the variable has unit variance. However, these same limits can be achieved with variance tending to unity by adding instances of the point
to our dataset. Thus even under fairly stringent requirements we can construct a dataset yielding arbitrarily bad local minima, both in the parameter space and the objective value. This provides some weak justification for the empirical observation that success in deep learning depends greatly on the data at hand.
We have shown that the results concerning local minima in linear networks do not extend to the nonlinear case. Ultimately this should not be a surprise, as with linear networks the problem can be relaxed to linear regression on a convex objective. That is, the composition of all linear layersis equivalent to the function for some matrix , and under our previous assumptions the problem of finding the optimal is convex. Furthermore, it is easily shown that the number of parameters in the relaxed problem is polynomial in the number of original parameters. Since the relaxed problem fits the data at least as well as the original, it is not surprising that the original problem is computationally tractable.
This simple example was merely meant to illustrate the difficulty of establishing results for every local minimum of every neural network. Since training a certain kind of network is known to be NP-Complete, it is difficult to give any guarantees about worst-case global behavior Blum:1992:OCT:148433.148441 . We have made no claims, however, about probabilistic behavior on the average practical dataset, nor have we ruled out the effects of more specialized networks, such as very deep ones.
We showed that a common class of neural networks is piecewise convex in each layer, with all other parameters fixed. We extended this to a theory of a piecewise multi-convex functions, showing that the training objective function can be represented by a finite number of multi-convex functions, each active on a multi-convex parameter set. From here we derived various results concerning the extrema and stationary points of piecewise multi-convex functions. We established convergence conditions for both gradient descent and iterated convex optimization on this class of functions, showing they converge to piecewise partial minima. Similar results are likely to hold for a variety of other optimization algorithms, especially those guaranteed to converge at stationary points or local minima.
We have witnessed the utility of multi-convexity in proving convergence results for various optimization algorithms. However, this property may be of practical use as well. Better understanding of the training objective could lead to the development of faster or more reliable optimization methods, heuristic or otherwise. These results may provide some insight into the practical success of sub-optimal algorithms on neural networks. However, we have also seen that local optimality results do not extend to global optimality as they do for linear autoencoders. Clearly there is much left to discover about how, or even if we can optimize deep, nonlinear neural networks.
The author would like to thank Mihir Mongia for his helpful comments in preparing this manuscript.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
- (1) A. L. Blum, R. L. Rivest, Training a 3-node neural network is np-complete, Neural Networks 5 (1) (1992) 117–127. doi:10.1016/S0893-6080(05)80010-3.
- (2) K. Kawaguchi, Deep learning without poor local minima, arXiv 1605.07110arXiv:1605.07110.
- (3) D. Soudry, Y. Carmon, No bad local minima: Data independent training error guarantees for multilayer neural networks, arXiv 1605.08361arXiv:1605.08361.
- (4) J. Gorski, F. Pfeuffer, K. Klamroth, Biconvex sets and optimization with biconvex functions: a survey and extensions, Mathematical Methods of Operations Research 66 (3) (2007) 373–407. doi:10.1007/s00186-007-0161-1.
X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: G. J. Gordon, D. B. Dunson (Eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS-11), Vol. 15, Journal of Machine Learning Research - Workshop and Conference Proceedings, 2011, pp. 315–323.
R. E. Wendell, A. P. Hurter,
Minimization of a
non-separable objective function subject to disjoint constraints, Operations
Research 24 (4) (1976) 643–657.
P. Baldi, Z. Lu,
autoencoders, Neural Networks 33 (2012) 136–147.
- (8) I. Tsevendorj, Piecewise-convex maximization problems, Journal of Global Optimization 21 (1) (2001) 1–14. doi:10.1023/A:1017979506314.
- (9) S. Ovchinnikov, Max-min representation of piecewise linear functions, Contributions to Algebra and Geometry 43 (1) (2002) 297–302.
- (10) S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, The Edinburgh Building, Cambridge, CB2 8RU, UK, 2004.
- (11) R. T. Rockafellar, Convex Analysis, Princeton University Press, 1970.
- (12) W. I. Zangwill, Nonlinear programming : a unified approach, Prentice-Hall international series in management, Prentice-Hall, Englewood Cliffs, N.J., 1969.
- (13) A. N. Iusem, On the convergence properties of the projected gradient method for convex optimization, Computational and Applied Mathematics 22 (2003) 37 – 52.
- (14) Y. Nesterov, Introductory Lectures on Convex Optimization : A Basic Course, Applied optimization, Kluwer Academic Publishers, Boston, Dordrecht, London, 2004.
D. P. Bertsekas, Incremental gradient,
subgradient, and proximal methods for convex optimization: A survey, Tech.
rep., Massachusetts Institute of Technology Labratory for Information and
Decision Systems (2010).
F. R. Bach, E. Moulines, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in: Proceedings of the 25th Annual Conference on Neural Information Processing Systems, 2011, pp. 451–459.
R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points - online stochastic gradient for tensor decomposition, Vol. 1, Journal of Machine Learning Research - Workshop and Conference Proceedings, 2015, pp. 1–46.
- (18) R. Johnsonbaugh, W. E. Pfaffenberger, Foundations of Mathematical Analysis, Marcel Dekker, New York, New York, USA, 1981.