1 Introduction
The winner of the ILSVRC 2015 classification competition used a new architecture called residual networks (He et al., 2016), which enabled very fast training of very deep networks. These have since been widely adopted. (As of this writing, the paper that introduced this technique, published in 2015, has over 3700 citations.) Deep networks express models as the composition of transformations; residual networks depart from traditional deep learning models by using parameters to describe how each transformation differs from the identity, rather than how it differs from zero.
Motivated by this methodological advance, Hardt and Ma (2017) recently considered compositions of many linear maps, each close to the identity map. They showed that any matrix with spectral norm and condition number bounded by constants can be represented as a product of matrices , where each has spectral norm
. They considered this nonconvex parameterization for a linear regression problem with additive Gaussian noise, and showed that any critical point of the quadratic loss for which the
have sufficiently small spectral norm must correspond to the linear transformation that generated the data. This raised the possibility that gradient descent with each layer initialized to the identity might provably converge for this nonconvex optimization problem;
Bartlett et al. (2018) investigated this, identifying sets of problems where this method converges, and where it does not.In this paper, we continue this line of research. First, we identify a nonlinear counterpart of Hardt and Ma’s results motivated by deep residual networks: any smooth biLipschitz (that is, invertible Lipschitz map with differentiable inverse) can be represented exactly as a composition of functions that are close to the identity in the sense that each
is Lipschitz, and the Lipschitz constant decreases inversely with the number of functions composed. Since a twolayer neural network with standard activation functions can approximate arbitrary continuous functions, we can represent each
in the composition as , where is computed by a twolayer network in this way. The fact that has a small Lipschitz constant for deep networks shows that is small, in the sense that it only needs to approximate a slowly changing function.The requirement in our analysis that is biLipschitz generalizes the assumption in the linear case studied by Hardt and Ma that the map to be learned has a bounded condition number. The practical strength, and therefore relevance, of invertible feature maps in the nonlinear case is supported by success the reversible networks (Maclaurin et al., 2015; Gomez et al., 2017; Dinh et al., 2017).
For our second result, we consider a nonlinear regression problem using a composition of nearidentity nonlinear maps. If we consider Fréchet derivatives with respect to the functions in the composition, we show that any critical point of the quadratic criterion must be the global optimum. In contrast, if each is a twolayer net of the form , analogously to the architecture of (He et al., 2016), and we consider derivatives with respect to the realvalued parameters, there are regression problems that give rise to suboptimal critical points. We discuss the implications of this analysis in Section 6.
A number of authors have investigated how using deep architecture affects the set of functions computed by a network (see Montufar et al., 2014; Telgarsky, 2015; Poole et al., 2016; Mhaskar et al., 2016). Our main results abstract away the parameterization and focus on the expressiveness and of compositions of nearidentity functions, along with properties of their error landscapes.
2 Notation and Definitions
Let denote the identity map on , . Throughout, denotes a norm on . We also use to denote an induced norm: for a function , where and are normed spaces with norms and , we write Define the Lipschitz seminorm of as
Define the ball of radius in a normed space as For a function , denotes the Jacobian matrix, that is, the matrix with entries .
For a functional defined on Banach spaces and , recall that the Fréchet derivative of at is the linear operator satisfying that is,
We use to denote the Fréchet derivative of .
3 Representation
For , denote . Let be a differentiable, invertible map satisfying the following properties: (a) Smoothness: for some and all ,
(1) 
(b) Lipschitz inverse: for some , ; (c) Positive orientation: For some , .
Then for all , there are functions satisfying, for all , and, on , .
Think of the functions as nearidentity maps that might be computed as where and are matrices,
is a nonconstant nonlinearity, such as a sigmoidal function or piecewiselinear function, applied componentwise, and
is a vector. Although the proof constructs the
as differentiable (and even smooth) maps, each could, for example, be approximated to arbitrary accuracy on the compact using a single layer of ReLUs (for which ). (See, for example, Theorem 1 in (Hornik, 1991), and the comments in Section 3 of that paper about immediate generalizations to unbounded nonlinearities.) In that case, the conclusion of the theorem implies that can be Lipschitz.Notice that the conclusion of the theorem does not require the function to be small; shifting by an arbitrary constant does not affect the Lipschitz property.
The constants hidden in the bigoh notation in the theorem are polynomial in , , , , and . (Here, and
denote the smallest and largest singular values.)
The condition is an unavoidable topological constraint that arises because of the orientation of the identity map. As Hardt and Ma (2017) argue in the linear context, if we view as a mapping from raw representations to meaningful features, we can easily set the orientation of appropriately (that is, so that ) without compromising the mapping’s usefulness.
To prove Theorem 3, we prove the following special case.
Consider an that satisfies the conditions of Theorem 3 and also and . Then for all , there are functions satisfying, for all ,
and, on , , provided where the constant depends on , and .
To see that Theorem 3 is a corollary, notice that we can write where Since satisfies and , Theorem 3 shows that it can be expressed as a composition of nearidentity maps. Furthermore, the translations and have the property that is Lipschitz. Finally, Theorem 2.1 in Hardt and Ma (2017) shows that we can decompose the Jacobian matrix with for , and this implies that the linear map is Lipschitz (see Lemma 4, Part 3 below).
Before proving Theorem 3, we observe that the smoothness property implies a bound on the accuracy of a linear approximation, and a Lipschitz bound. The proof is in Appendix A. For satisfying the conditions of Theorem 3 and any ,
and .
(of Theorem 3) We give an explicit construction of the . For , define by , where the constants will be chosen later. The
’s can be viewed as functions that interpolate between the identity (which is
, the limit as approaches zero of ) and (which is , because ). Note that is invertible on , with for . Define and, for , define by so that and in particular . It remains to show that, for a suitable choice of , the satisfy the Lipschitz condition.We have
Now, fix and and set and . Then we have
(2) 
We consider two cases: when and are close, and when they are distant. First, suppose that .
Also, we can relate to via the Lipschitz property of :
so
(3) 
Combining, and using the assumption ,
(4) 
Now suppose that . From (2), we have
(6) 
where, in the first term, we have used the Lipschitz property from Lemma 3, with . But
by (1). Substituting into (6), and using (3) together with the assumption that ,
Combining with (4), it suffices to choose to satisfy and to satisfy, for , where
If we choose and set for , then these conditions are equivalent to and
Thus, it suffices if
using the inequality , which follows from convexity of .
4 Zero Fréchet derivatives with deep compositions
The following theorem is the main result of this section. It shows that if a composition of nearidentity maps has zero Fréchet derivatives of a quadratic criterion with respect to the functions in the composition, then the composition minimizes that criterion. That is, all critical points of this kind are global minimizers; there are no saddle points or suboptimal local minimizers in the nearidentity region.
Consider a distribution on , and define the criterion
Define a conditional expectation , so that minimizes . Consider the function computed by an layer network , and suppose that, for some and all , is differentiable, , and . Suppose that . Then for all ,
Thus, if is a critical point of , that is, for all , , we must have .
The theorem defines the expected quadratic loss under an arbitrary joint distribution, but in particular it could be a discrete distribution that is uniform on a training set.
Notice that if satisfies the properties of Theorem 3, then it can be represented as a composition of with the required properties. If it cannot, then the theorem shows that the nearidentity region will not contain critical points. The only property we require of is the boundedness condition . From the definition of the induced norm, this implicitly assumes that and that is differentiable at . In the context of learning embeddings, it seems reasonable to fix the embedding’s value at one input vector, and express its value elsewhere relative to that value.
Notice also that, although the theorem requires differentiability of the
, it is only important for various derivatives to be defined. In particular, a network with nondifferentiable but Lipschitz activation functions, like a ReLU network, could be approximated to arbitrary accuracy by replacing the ReLU nonlinearity with a differentiable one. The conclusions of the theorem apply to any critical point at a differentiable approximation of the ReLU network.
Suppose .

.

is invertible and .

For , , and hence .
Part 1: The triangle inequality and the Lipschitz property gives
Similarly,
Part 2: For , the inequality of Part 1 shows that is invertible. Together with the Lipschitz property, this also shows that
which, since , gives
.
Part 3:
From the definition of
,
We can write, for any with ,
Hence, . Since is a linear functional, this also shows that it is Lipschitz:
(of Theorem 4) From the projection theorem,
Fix . To analyze the effect of changing the function on
by applying the chain rule for Fréchet derivatives, we trace the effect of changing
on by describing as the result of the composition of a sequence of functionals, which map functions to functions. In particular, we write where for , for , and . Now, using the chain rule for Fréchet derivatives,where is the evaluation functional, . From the definition of the Fréchet derivative, always satisfies
(7) 
The definition of the Fréchet derivative also implies that is linear, as is the functional defined by . If and were unequal, progressively scaling down an input on which they differ would scale down the difference by the same amount, contradicting (7). Thus , which in turn implies
For all , since is Lipschitz, Lemma 4 implies is invertible. The lemma also implies that, for all , is Lipschitz, and hence that is also invertible. Because these inverses exist, we can define
where we pick the scalar so that . This choice ensures that
(8) 
and hence Since , for all there is a with . Define Then, using the definition of the induced norm and Equation (8), we have
Recalling that all and are Lipschitz, we can apply Lemma 4:
Taking the limit as implies the result.
5 Bad critical points for sigmoid residual nets
Theorem 4 may be paraphrased to say that residual nets cannot have any bad critical points in the nearidentity region, when we consider Fréchet derivatives. In this section, we show that when we consider gradients with respect to the parameters of a fixedsize residual network with sigmoid activation functions, the corresponding statement is not true.
For a depth , width and size , the residual network with parameters computes the function , where each layer is defined by , with and , and we define of a vector as the componentwise application of .
To gain an intuitive understanding of the existence of suboptimal critical points, consider the following two properties of networks with nonlinearities. First, there are finitely many simple transformations (such as permutations of hidden units, or negation of the input and output parameters of a unit) that leave the network function unchanged. Second, apart from these transformations, two networks with different parameter values compute different functions. (This was shown for generic parameter values and networks of arbitrary depth by Fefferman (1994), and improved by Albertini and Sontag (1992) for the special case of twolayer networks.) Then for any globally optimal parameter value, there is a simple transformation that is also globally optimal. Consider a path between these two parameter values that minimizes the maximum value of the criterion along the path. (It is not hard to construct a scenario in which such a minimax path exists.) The maximizer must be a suboptimal critical point. The proof we give of the following theorem is more direct, relying on specific properties of the parameterization, but we should expect a similar result to apply to networks with other nonlinearities and parameterizations, provided functions have multiple isolated distinct representations as in the case of networks.
The proof leverages the fact that, while Theorem 4 rules out the possibility of bad critical points arising from interactions between the layers , they may still arise due to the dynamics of training an individual .
For any
Comments
There are no comments yet.