Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

We show that any smooth bi-Lipschitz h can be represented exactly as a composition h_m ∘ ... ∘ h_1 of functions h_1,...,h_m that are close to the identity in the sense that each (h_i-Id) is Lipschitz, and the Lipschitz constant decreases inversely with the number m of functions composed. This implies that h can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant. Next, we consider nonlinear regression with a composition of near-identity nonlinear maps. We show that, regarding Fréchet derivatives with respect to the h_1,...,h_m, any critical point of a quadratic criterion in this near-identity region must be a global minimizer. In contrast, if we consider derivatives with respect to parameters of a fixed-size residual network with sigmoid activation functions, we show that there are near-identity critical points that are suboptimal, even in the realizable case. Informally, this means that functional gradient methods for residual networks cannot get stuck at suboptimal critical points corresponding to near-identity layers, whereas parametric gradient methods for sigmoidal residual networks suffer from suboptimal critical points in the near-identity region.

Authors

• 42 publications
• 1 publication
• 15 publications
• Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks

Tight estimation of the Lipschitz constant for deep neural networks (DNN...
06/12/2019 ∙ by Mahyar Fazlyab, et al. ∙ 9

• Exactly Computing the Local Lipschitz Constant of ReLU Networks

The Lipschitz constant of a neural network is a useful metric for provab...
03/02/2020 ∙ by Matt Jordan, et al. ∙ 24

• Ill-Posedness and Optimization Geometry for Nonlinear Neural Network Training

In this work we analyze the role nonlinear activation functions play at ...
02/07/2020 ∙ by Thomas O'Leary-Roseberry, et al. ∙ 15

• Deep Neural Networks with Trainable Activations and Controlled Lipschitz Constant

We introduce a variational framework to learn the activation functions o...
01/17/2020 ∙ by Shayan Aziznejad, et al. ∙ 0

• An Algorithm for Computing Lipschitz Inner Functions in Kolmogorov's Superposition Theorem

Kolmogorov famously proved that multivariate continuous functions can be...
12/22/2017 ∙ by Jonas Actor, et al. ∙ 0

• Convergence to minima for the continuous version of Backtracking Gradient Descent

The main result of this paper is: Theorem. Let f:R^k→R be a C^1 funct...
11/11/2019 ∙ by Tuyen Trung Truong, et al. ∙ 0

• On Residual Networks Learning a Perturbation from Identity

The purpose of this work is to test and study the hypothesis that residu...
02/11/2019 ∙ by Michael Hauser, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The winner of the ILSVRC 2015 classification competition used a new architecture called residual networks (He et al., 2016), which enabled very fast training of very deep networks. These have since been widely adopted. (As of this writing, the paper that introduced this technique, published in 2015, has over 3700 citations.) Deep networks express models as the composition of transformations; residual networks depart from traditional deep learning models by using parameters to describe how each transformation differs from the identity, rather than how it differs from zero.

Motivated by this methodological advance, Hardt and Ma (2017) recently considered compositions of many linear maps, each close to the identity map. They showed that any matrix with spectral norm and condition number bounded by constants can be represented as a product of matrices , where each has spectral norm

. They considered this non-convex parameterization for a linear regression problem with additive Gaussian noise, and showed that any critical point of the quadratic loss for which the

have sufficiently small spectral norm must correspond to the linear transformation that generated the data. This raised the possibility that gradient descent with each layer initialized to the identity might provably converge for this non-convex optimization problem;

Bartlett et al. (2018) investigated this, identifying sets of problems where this method converges, and where it does not.

In this paper, we continue this line of research. First, we identify a non-linear counterpart of Hardt and Ma’s results motivated by deep residual networks: any smooth bi-Lipschitz (that is, invertible Lipschitz map with differentiable inverse) can be represented exactly as a composition of functions that are close to the identity in the sense that each

is Lipschitz, and the Lipschitz constant decreases inversely with the number of functions composed. Since a two-layer neural network with standard activation functions can approximate arbitrary continuous functions, we can represent each

in the composition as , where is computed by a two-layer network in this way. The fact that has a small Lipschitz constant for deep networks shows that is small, in the sense that it only needs to approximate a slowly changing function.

The requirement in our analysis that is bi-Lipschitz generalizes the assumption in the linear case studied by Hardt and Ma that the map to be learned has a bounded condition number. The practical strength, and therefore relevance, of invertible feature maps in the non-linear case is supported by success the reversible networks (Maclaurin et al., 2015; Gomez et al., 2017; Dinh et al., 2017).

For our second result, we consider a nonlinear regression problem using a composition of near-identity nonlinear maps. If we consider Fréchet derivatives with respect to the functions in the composition, we show that any critical point of the quadratic criterion must be the global optimum. In contrast, if each is a two-layer net of the form , analogously to the architecture of (He et al., 2016), and we consider derivatives with respect to the real-valued parameters, there are regression problems that give rise to suboptimal critical points. We discuss the implications of this analysis in Section 6.

A number of authors have investigated how using deep architecture affects the set of functions computed by a network (see Montufar et al., 2014; Telgarsky, 2015; Poole et al., 2016; Mhaskar et al., 2016). Our main results abstract away the parameterization and focus on the expressiveness and of compositions of near-identity functions, along with properties of their error landscapes.

2 Notation and Definitions

Let denote the identity map on , . Throughout, denotes a norm on . We also use to denote an induced norm: for a function , where and are normed spaces with norms and , we write Define the Lipschitz seminorm of as

 ∥f∥L:=sup{∥f(x)−f(y)∥V∥x−y∥U:x,y∈U,x≠y}.

Define the ball of radius in a normed space as For a function , denotes the Jacobian matrix, that is, the matrix with entries .

For a functional defined on Banach spaces and , recall that the Fréchet derivative of at is the linear operator satisfying that is,

 limΔ→0∥F(f+Δ)−F(f)−DF(f)(Δ)∥V∥Δ∥U=0.

We use to denote the Fréchet derivative of .

3 Representation

For , denote . Let be a differentiable, invertible map satisfying the following properties: (a) Smoothness: for some and all ,

 ∥(Dh(y)−Dh(x))u∥≤α∥y−x∥∥u∥; (1)

(b) Lipschitz inverse: for some , ; (c) Positive orientation: For some , .

Then for all , there are functions satisfying, for all , and, on , .

Think of the functions as near-identity maps that might be computed as where and are matrices,

is a nonconstant nonlinearity, such as a sigmoidal function or piecewise-linear function, applied component-wise, and

is a vector. Although the proof constructs the

as differentiable (and even smooth) maps, each could, for example, be approximated to arbitrary accuracy on the compact using a single layer of ReLUs (for which ). (See, for example, Theorem 1 in (Hornik, 1991), and the comments in Section 3 of that paper about immediate generalizations to unbounded nonlinearities.) In that case, the conclusion of the theorem implies that can be -Lipschitz.

Notice that the conclusion of the theorem does not require the function to be small; shifting by an arbitrary constant does not affect the Lipschitz property.

The constants hidden in the big-oh notation in the theorem are polynomial in , , , , and . (Here, and

denote the smallest and largest singular values.)

The condition is an unavoidable topological constraint that arises because of the orientation of the identity map. As Hardt and Ma (2017) argue in the linear context, if we view as a mapping from raw representations to meaningful features, we can easily set the orientation of appropriately (that is, so that ) without compromising the mapping’s usefulness.

To prove Theorem 3, we prove the following special case.

Consider an that satisfies the conditions of Theorem 3 and also and . Then for all , there are functions satisfying, for all ,

 hm∘hm−1∘⋯∘h1(x)=h(x)

and, on , , provided where the constant depends on , and .

To see that Theorem 3 is a corollary, notice that we can write where Since satisfies and , Theorem 3 shows that it can be expressed as a composition of near-identity maps. Furthermore, the translations and have the property that is -Lipschitz. Finally, Theorem 2.1 in Hardt and Ma (2017) shows that we can decompose the Jacobian matrix with for , and this implies that the linear map is -Lipschitz (see Lemma 4, Part 3 below).

Before proving Theorem 3, we observe that the smoothness property implies a bound on the accuracy of a linear approximation, and a Lipschitz bound. The proof is in Appendix A. For satisfying the conditions of Theorem 3 and any ,

 ∥h(y)−(h(x)+Dh(x)(y−x))∥ ≤α2∥y−x∥2,

and .

(of Theorem 3) We give an explicit construction of the . For , define by , where the constants will be chosen later. The

’s can be viewed as functions that interpolate between the identity (which is

, the limit as approaches zero of ) and (which is , because ). Note that is invertible on , with for . Define and, for , define by so that and in particular . It remains to show that, for a suitable choice of , the satisfy the Lipschitz condition.

We have

 ∥h1(x)−x−(h1(y)−y)∥ =1a1∥h(a1x)−a1x−(h(a1y)−a1y)∥ =1a1∥Dh(a1y)(a1x−a1y)−(a1x−a1y)+(h(a1x)−(h(a1y)+Dh(a1y)(a1x−a1y)))∥ =1a1∥(Dh(a1y)−Dh(0))(a1x−a1y)+(h(a1x)−(h(a1y)+Dh(a1y)(a1x−a1y)))∥ ≤a1α(∥y∥∥x−y∥+12∥x−y∥2)(by (???) and Lemma~{}???) ≤2Ra1α∥x−y∥.

Now, fix and and set and . Then we have

 ∥hi(y)−y−(hi(x)−x)∥ =∥gi(v)−gi−1(v)−(gi(u)−gi−1(u))∥ =1ai∥∥∥h(aiv)−aiai−1h(ai−1v)−(h(aiu)−aiai−1h(ai−1u))∥∥∥. (2)

We consider two cases: when and are close, and when they are distant. First, suppose that .

 ∥hi(y)−y−(hi(x)−x)∥ =1ai∥∥∥h(aiv)−h(aiu)−aiai−1(h(ai−1v)−h(ai−1u))∥∥∥ =1ai∥∥∥aiDh(aiu)(v−u)+h(aiv)−(h(aiu)+aiDh(aiu)(v−u)) ≤∥(Dh(aiu)−Dh(ai−1u))(v−u)∥+α2(ai+ai−1)∥v−u∥2(applying Lemma~{}% ??? twice) ≤α(ai−ai−1)∥u∥∥v−u∥+α2(ai+ai−1)∥v−u∥2(by (???)) ≤α(R(ai−ai−1)+12(ai+ai−1)∥v−u∥)∥v−u∥.

Also, we can relate to via the Lipschitz property of :

 ai−1∥v−u∥ =∥h−1(h(ai−1v))−h−1(h(ai−1u))∥≤M∥h(ai−1v)−h(ai−1u)∥,

so

 ∥y−x∥ =1ai−1∥h(ai−1v)−h(ai−1u)∥≥1M∥v−u∥. (3)

Combining, and using the assumption ,

 ∥hi(y)−y−(hi(x)−x)∥ ≤αM(R(ai−ai−1)+12(ai+ai−1)M∥y−x∥)∥y−x∥ ≤(ai−ai−1)αM(R+M)∥y−x∥. (4)

Now suppose that . From (2), we have

 ∥hi(y)−y−(hi(x)−x)∥ =1ai∥∥∥h(aiv)−aiai−1h(ai−1v)−(h(aiu)−aiai−1h(ai−1u))∥∥∥ =1ai∥∥∥ai−1−aiai−1(h(ai−1v)−h(ai−1u))+h(aiv)−h(ai−1v)−(h(aiu)−h(ai−1u))∥∥∥ =1ai∥∥∥ai−1−aiai−1(h(ai−1v)−h(ai−1u)) +h(aiv)−(h(ai−1v)+Dh(ai−1v)(aiv−ai−1v)) −(h(aiu)−(h(ai−1u)+Dh(ai−1u)(aiu−ai−1u))) −Dh(ai−1v)(aiv−ai−1v)+Dh(ai−1u)(aiu−ai−1u)∥ ≤ai−ai−1aiL∥v−u∥+1ai∥Dh(ai−1v)(aiv−ai−1v)−Dh(ai−1u)(aiu−ai−1u)∥ +1aiα2(ai−ai−1)2(∥v∥2+∥u∥2) (6)

where, in the first term, we have used the Lipschitz property from Lemma 3, with . But

 1ai∥Dh(ai−1v)(aiv−ai−1v)−Dh(ai−1u)(aiu−ai−1u)∥ =ai−ai−1ai∥Dh(ai−1v)(v)−Dh(ai−1u)(u)∥ =ai−ai−1ai∥v−u+(Dh(ai−1u)−Dh(0))(v−u)+(Dh(ai−1v)−Dh(ai−1u))v∥ ≤ai−ai−1ai(1+αai−1(∥u∥+∥v∥))∥v−u∥,

by (1). Substituting into (6), and using (3) together with the assumption that ,

 ∥hi(y)−y−(hi(x)−x)∥ ≤ai−ai−1ai(LM+M(1+ai−1α(∥u∥+∥v∥))+α2(∥v∥2+∥u∥2))∥y−x∥ ≤ai−ai−1ai(M(L+1+2Rα)+αR2)∥y−x∥.

Combining with (4), it suffices to choose to satisfy and to satisfy, for , where

 B =max{αM(R+M),M(L+1+2Rα)+αR2}.

If we choose and set for , then these conditions are equivalent to and

 (1−c)m−1≤ϵ2αR⇔1−(ϵ2αR)1/(m−1)≤c.

Thus, it suffices if

 ϵB≥1−(ϵ2αR)1/(m−1)⇐ϵ≥Bm−1ln2αRϵ ⇐ϵ≥Bm−1max{1,ln2αRmB}⇐ϵ≥Bln2mm−1,

using the inequality , which follows from convexity of .

4 Zero Fréchet derivatives with deep compositions

The following theorem is the main result of this section. It shows that if a composition of near-identity maps has zero Fréchet derivatives of a quadratic criterion with respect to the functions in the composition, then the composition minimizes that criterion. That is, all critical points of this kind are global minimizers; there are no saddle points or suboptimal local minimizers in the near-identity region.

Consider a distribution on , and define the criterion

 Q(h)=12E(X,Y)∼P∥h(X)−Y∥22.

Define a conditional expectation , so that minimizes . Consider the function computed by an -layer network , and suppose that, for some and all , is differentiable, , and . Suppose that . Then for all ,

 infΔ∈B1DhiQ(h)(Δ)≤−(1−ϵ)m−1∥h−h∗∥(Q(h)−Q(h∗)).

Thus, if is a critical point of , that is, for all , , we must have .

The theorem defines the expected quadratic loss under an arbitrary joint distribution, but in particular it could be a discrete distribution that is uniform on a training set.

Notice that if satisfies the properties of Theorem 3, then it can be represented as a composition of with the required properties. If it cannot, then the theorem shows that the near-identity region will not contain critical points. The only property we require of is the boundedness condition . From the definition of the induced norm, this implicitly assumes that and that is differentiable at . In the context of learning embeddings, it seems reasonable to fix the embedding’s value at one input vector, and express its value elsewhere relative to that value.

Notice also that, although the theorem requires differentiability of the

, it is only important for various derivatives to be defined. In particular, a network with non-differentiable but Lipschitz activation functions, like a ReLU network, could be approximated to arbitrary accuracy by replacing the ReLU nonlinearity with a differentiable one. The conclusions of the theorem apply to any critical point at a differentiable approximation of the ReLU network.

Suppose .

1. .

2. is invertible and .

3. For , , and hence .

Part 1: The triangle inequality and the Lipschitz property gives

 ∥x−y∥ ≤∥f(x)−f(y)∥+∥f(x)−x−(f(y)−y)∥≤∥f(x)−f(y)∥+α∥x−y∥.

Similarly,

 ∥f(x)−f(y)∥ ≤∥x−y∥+∥f(x)−x−(f(y)−y)∥≤(1+α)∥x−y∥.

Part 2: For , the inequality of Part 1 shows that is invertible. Together with the Lipschitz property, this also shows that

 ∥x−y−(f(x)−f(y))∥ ≤α∥x−y∥≤α1−α∥f(x)−f(y)∥,

which, since , gives .
Part 3: From the definition of , We can write, for any with ,

 ∥Δ−DF(g)(Δ)∥ =to0.0pt$∥Δ+F(g+Δ)−F(g+Δ)+F(g)−F(g)+g−g−DF(g)(Δ)∥$ ≤∥F(g+Δ)−F(g)−DF(g)(Δ)∥ +∥f∘(g+Δ)−(g+Δ)−(f∘g−g)∥ =o(∥Δ∥)+supx∥f∘(g+Δ)(x)−(g+Δ)(x)−(f∘g−g)(x)∥∥x∥ =o(∥Δ∥)+αsupx∥Δ(x)∥∥x∥=o(∥Δ∥)+α∥Δ∥.

Hence, . Since is a linear functional, this also shows that it is -Lipschitz:

 ∥DF(g)(Δ1)−Δ1−(DF(g)(Δ2)−Δ2)∥=∥DF(g)(Δ1−Δ2)−(Δ1−Δ2)∥≤α∥Δ1−Δ2∥.

(of Theorem 4) From the projection theorem,

 Q(h) =12E(X,Y)∼P∥h(X)−Y∥22 =12E∥h(X)−h∗(X)∥22+12E(X,Y)∼P∥h∗(X)−Y∥22.

Fix . To analyze the effect of changing the function on

by applying the chain rule for Fréchet derivatives, we trace the effect of changing

on by describing as the result of the composition of a sequence of functionals, which map functions to functions. In particular, we write where for , for , and . Now, using the chain rule for Fréchet derivatives,

 DhiQ(h) =E[(h(X)−h∗(X))⋅evX∘Dhih] =E[(h(X)−h∗(X))⋅evX∘DHm(hm−1i)∘⋯∘DHi+1(hii)∘DGi(hi)],

where is the evaluation functional, . From the definition of the Fréchet derivative, always satisfies

 0 =limΔ→0∥Gi(g+Δ)−Gi(g)−DGi(g)(Δ)∥∥Δ∥ =limΔ→0∥Δ∘hi−1∘⋯∘h1−DGi(g)(Δ)∥∥Δ∥. (7)

The definition of the Fréchet derivative also implies that is linear, as is the functional defined by . If and were unequal, progressively scaling down an input on which they differ would scale down the difference by the same amount, contradicting (7). Thus , which in turn implies

 DhiQ(h)(Δ) =E[(h(X)−h∗(X))⋅evX∘DHm(hm−1i)∘⋯∘DHi+1(hii)∘Δ∘hi−1∘⋯∘h1] =E[(h(X)−h∗(X))⋅DHm(hm−1i)∘⋯∘DHi+1(hii)∘Δ∘hi−1∘⋯∘h1(X)].

For all , since is -Lipschitz, Lemma 4 implies is invertible. The lemma also implies that, for all , is -Lipschitz, and hence that is also invertible. Because these inverses exist, we can define

 Δ=c(DHm(hm−1i)∘⋯∘DHi+1(hii))−1∘(h∗−h)∘(hi−1∘⋯∘h1)−1,

where we pick the scalar so that . This choice ensures that

 DHm(hm−1i)∘⋯∘DHi+1(hii)∘Δ∘hi−1∘⋯∘h1=c(h∗−h), (8)

and hence Since , for all there is a with . Define Then, using the definition of the induced norm and Equation (8), we have

 c∥h−h∗∥≥c∥h(x)−h∗(x)∥∥x∥=1∥x∥∥∥DHm(hm−1i)∘⋯∘DHi+1(hii)∘Δ(y)∥∥.

Recalling that all and are -Lipschitz, we can apply Lemma 4:

 c∥h−h∗∥ ≥(1−ϵ)m−i∥Δ(y)∥∥x∥ ≥(1−ϵ)m−i(1−γ)∥y∥∥x∥ =(1−ϵ)m−i(1−γ)∥hi−1∘⋯∘h1(x)∥∥x∥ ≥(1−ϵ)m−1(1−γ).

Taking the limit as implies the result.

5 Bad critical points for sigmoid residual nets

Theorem 4 may be paraphrased to say that residual nets cannot have any bad critical points in the near-identity region, when we consider Fréchet derivatives. In this section, we show that when we consider gradients with respect to the parameters of a fixed-size residual network with sigmoid activation functions, the corresponding statement is not true.

For a depth , width and size , the residual network with parameters computes the function , where each layer is defined by , with and , and we define of a vector as the component-wise application of .

To gain an intuitive understanding of the existence of suboptimal critical points, consider the following two properties of networks with nonlinearities. First, there are finitely many simple transformations (such as permutations of hidden units, or negation of the input and output parameters of a unit) that leave the network function unchanged. Second, apart from these transformations, two networks with different parameter values compute different functions. (This was shown for generic parameter values and networks of arbitrary depth by Fefferman (1994), and improved by Albertini and Sontag (1992) for the special case of two-layer networks.) Then for any globally optimal parameter value, there is a simple transformation that is also globally optimal. Consider a path between these two parameter values that minimizes the maximum value of the criterion along the path. (It is not hard to construct a scenario in which such a minimax path exists.) The maximizer must be a suboptimal critical point. The proof we give of the following theorem is more direct, relying on specific properties of the parameterization, but we should expect a similar result to apply to networks with other nonlinearities and parameterizations, provided functions have multiple isolated distinct representations as in the case of networks.

The proof leverages the fact that, while Theorem 4 rules out the possibility of bad critical points arising from interactions between the layers , they may still arise due to the dynamics of training an individual .

For any