# Global Convergence of Sobolev Training for Overparametrized Neural Networks

Sobolev loss is used when training a network to approximate the values and derivatives of a target function at a prescribed set of input points. Recent works have demonstrated its successful applications in various tasks such as distillation or synthetic gradient prediction. In this work we prove that an overparametrized two-layer relu neural network trained on the Sobolev loss with gradient flow from random initialization can fit any given function values and any given directional derivatives, under a separation condition on the input data.

## Authors

• 2 publications
• 17 publications
• ### Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK

We consider the dynamic of gradient descent for learning a two-layer neu...
07/09/2020 ∙ by Yuanzhi Li, et al. ∙ 8

• ### Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

Natural gradient descent has proven effective at mitigating the effects ...
05/27/2019 ∙ by Guodong Zhang, et al. ∙ 35

• ### When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

We establish conditions under which gradient descent applied to fixed-wi...
02/09/2021 ∙ by Niladri S. Chatterji, et al. ∙ 0

• ### On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime

We describe a necessary and sufficient condition for the convergence to ...
05/27/2020 ∙ by Stephan Wojtowytsch, et al. ∙ 0

• ### Full-Jacobian Representation of Neural Networks

Non-linear functions such as neural networks can be locally approximated...
05/02/2019 ∙ by Suraj Srinivas, et al. ∙ 70

• ### Avoiding overfitting of multilayer perceptrons by training derivatives

Resistance to overfitting is observed for neural networks trained with e...
02/28/2018 ∙ by V. I. Avrutskiy, et al. ∙ 0

• ### Why does CTC result in peaky behavior?

The peaky behavior of CTC models is well known experimentally. However, ...
05/31/2021 ∙ by Albert Zeyer, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks are ubiquitous and have established state of the art performances in a wide variety of applications and fields. These networks often have a large number of parameters which are tuned via gradient descent (or its variants) on an empirical risk minimization task. In particular in supervised learning it is often required that the output of the network fits certain values/labels that can be thought as coming from an unknown target function. In many settings, though, additional prior information on the task or target function might be available, and enforcing them might be of interest. One such example is the case of high order derivatives of the unknown target function, which, as shown in

[5], naturally arises in problems such as distillation, in which a large teacher network is used to train a more compact student network, or prediction of synthetic gradients for training deep complex models. Therefore [5] proposed the “Sobolev training” which given training inputs , attempts to minimizes the following empirical risk:

 L(W)=n∑i=1[ℓ(f(W,xi),f∗(xi))+K∑j=1ℓj(Djxf(W,xi),Djxf∗(xi))] (1)

where is a neural network with input and parameters , denotes the target function, is a loss penalizing the deviation from the outputs of , and

are loss functions penalizing the deviations of the

-th derivative of the network with respect to from the -th derivative of the target .

The empirical successes of Sobolev training have been demonstrated in a number of works. In [5]

it was shown that Sobolev training leads to smaller generalization errors than standard training, in tasks such as distillation and synthetic gradient prediction especially in the low data regime. Similar results were also obtained for transfer learning via Jacobian matching in

[14]. Earlier Sobolev training was applied in [13]

in order to enforce invariance to translations and small rotations. More recently, instead, Sobolev training has been used in the context of anisotropic hyperelasticity in order to improve the predictions on the stress tensor (derivative of the network with respect to the input deformation tensor) in

[16]. Finally, the idea of Sobolev training is also tightly connected to other techniques which have been recently successfully employed, such as attention matching in student distillation [18] and [5], or convex data augmentation for generalization and robustness improvement [19].

On the theoretical side, justification for Sobolev training was given by [5], extending the classical work of Hornik [9]

and giving universal approximation properties of neural networks with relu activation function in Sobolev spaces. This result was then further improved for deep networks in

[7]. While these works motivated the use of the Sobolev loss (1), conditions under which it can be successfully minimized were not given. In particular even though the network used in Sobolev training are usually shallow, the resulting loss (1) is highly non-convex and therefore the success of first order methods is not a priori guaranteed.

In this paper we study a two-layer relu neural network trained with a Sobolev loss when at each input point the output values and a set of directional derivatives of the target function are given. Leveraging recent results on training with standard losses [2, 20, 1, 20, 12] we show that if the network is sufficiently overparametrized, the weights are randomly initialized, and the data satisfy certain natural non-degeneracy assumptions, Gradient Flow achieves a global minimum.

## 2 Main Result

We study the training of neural networks with “Directional Sobolev Training”. In particular we assume we are given training data

 {xi,yi,Vi,hi}ni=1, (2)

where , , and with orthonormal columns (unit Euclidean norm and pairwise orthogonal). This training data can be thought as being generated by a differentiable function according to

 yi=f∗(xi),andhi=VTi∇f∗(xi)fori=1,…,n (3)

so that each entry of the vector

corresponds to a directional derivative of in the direction given by the corresponding column of the matrix . We will denote by and the vectors with entries and blocks respectively.

In this work we study the training of a two-layer neural network with width :

 f(W,x)=m∑r=1arσ(wTrx), (4)

where are fixed at initialization, is the relu activation function, and is the weight matrix with rows . The network weights are learned by minimizing the Directional Sobolev Loss

 minW∈Rm×dL(W):=12n∑i=1(f(W,xi)−yi)2+12n∑i=1∥VTi∇xf(W,xi)−hi∥22.

 dwr(t)dt=−∂L(W(t))∂wr. (5)

Note that even though the relu activation function is not differentiable, we let and

. This corresponds to the choice made in most of the deep learning libraries, and the dynamical system (

5) can then be seen as the one followed in practice when using Sobolev training. Explicit formulas for the partial derivatives are given in the next section.

In this work we prove that for wide enough networks, gradient flow converges to a global minimizer of . In particular define the vectors of residuals and with coordinates

 [e(t)]i=yi−f(W(t),xi)[S(t)]i=hi−VTi∇xf(W(t),xi). (6)

We show that and as , under the following assumption of non-degeneracy of the training data.

###### Assumption 1

There exist and such that the following hold:

 mini≠j(∥xi−xj∥2,∥xi+xj∥2)≥δ1, (7)

and for every :

 max1≤j≤k|vTi,jxi|≤δ2, (8)

where are the columns of .

Given define the following “feature maps”:

 ϕw(xi):=σ′(wTxi)xi,
 ψw(xi):=σ′(wTxi)Vi,

and matrix:

 Ω(w)=[ϕw(x1),…,ϕw(xn),ψw(x1),…ψw(xn)]∈\bbbrd×(k+1)n.

The next quantity plays an important role in the proof of convergence of the gradient flow (5).

###### Definition 1

Define the matrix with entries given by , and let

be its smallest eigenvalue.

Under the non-degeneracy of the training set we show that is strictly positive definite.

###### Proposition 1

Under the Assumptions 1 the minimum eigenvalue of obeys:

 λ∗≥(1−kδ2)δ1100n2.

We are now ready to state the main result of this work.

###### Theorem 2.1

Assume Assumption 1 is satisfied and the data are normalized so that . Consider a one hidden layer neural network (4), let , set the number of hidden nodes to and i.i.d. initialize the weights according to:

 wr∼N(0,Id)andar∼unif{−1m1/2,1m1/2}forr=1,…,m. (9)

), then with probability

over the random initialization of and , for every :

 ∥e(t)∥22+∥S(t)∥22≤exp(−λ∗t)(∥e(0)∥22+∥S(0)∥22)

and in particular as .

The proof of this theorem is given in Section 3, below we will show how to extend this result to a network with bias.

### 2.1 Consequences for a network with bias

Given training data (2) generated by a target function according to (3), in this section we demonstrate how the previous theory can be extended to the Sobolev training of a two-layer network with width and bias term :

 g(W,b,x)=m∑r=1arσ(αwTrx+βbr) (10)

where111Notice that the introduction of the constants and does not change the expressivity of the network. and .

Similarly as before, the network weights and biases are learned by minimizing the Directional Sobolev Loss

 minW∈\bbbrm×d,b∈\bbbrdL(W,b):=12n∑i=1(g(W,b,xi)−yi)2+12n∑i=1∥1αVTi∇xg(W,b,xi)−hiα∥22. (11)

 dwr(t)dt=−∂L(W(t))∂wranddbr(t)dt=−∂L(W(t))∂br. (12)

Based on the following separation conditions on the input point we will prove convergence to zero training error of the Sobolev loss.

###### Assumption 2

There exists such that the following holds

 mini≠j(∥xi−xj∥2)≥^δ1. (13)

Define the vectors of residuals and with coordinates

 [e(t)]i=yi−g(W(t),b(t),xi),[S(t)]i=(hi−VTi∇xg(W(t),b(t),xi))/α,

then the next theorem follows readily from the analysis in the previous section.

###### Theorem 2.2

Assume Assumption 2 is satisfied and the data are normalized so that . Consider a two-layer neural network (10), let , set the number of hidden nodes to and i.i.d. initialize the weights according to:

 wr∼N(0,Id),br∼N(0,1)andar∼unif{−1m1/2,1m1/2}.

Consider the Gradient Flow (12), then with probability over the random initialization of , and , for every

 ∥e(t)∥22+∥S(t)∥22≤exp(−λ∗t)(∥e(0)∥22+∥S(0)∥22)

where and . In particular as .

### 2.2 Discussion

Theorem 2.2 establishes that the gradient flow (12) converges to a global minimum and therefore that a wide enough network, randomly initialized and trained with the Sobolev loss (11

) can interpolate any given function values and directional derivatives. We observe that recent works in the analysis of standard training

[21, 12] have shown that using more refined concentration results and control on the weight dynamics, the polynomial dependence on the number of samples can be lowered. We believe that by applying similar techniques to the Sobolev training, the dependence of from the number of samples and derivatives can be further improved.

Regarding the assumptions on the input data, we note that [12, 2, 6] have shown convergence of gradient descent to a global minimum of the standard loss, when the input points satisfy the separation conditions (7). These conditions ensure that no two input points and are parallel and reduce to (13) for a network with bias. While the separation condition (7) is also required in Sobolev training, the condition (8) is only required in case of a network without bias as a consequence of its homogeneity.

Finally, the analysis of gradient methods for training overparametrized neural networks with standard losses has been used to study their inductive bias and ability to learn certain classes of functions (see for example [2, 3]). Similarly, the results of this paper could be used to shed some light on the superior generalization capabilities of networks trained with a Sobolev loss and their use for knowledge distillation.

## 3 Proof of Theorem 2.1

We follow the lines of recent works on the optimization of neural networks in the Neural Tangent Kernel regime [10, 4, 12] in particular the analysis of [2, 6, 17]. We investigate the dynamics of the residuals error and , beginning with that of the predictions. Let , then:

 ddtf(W(t),xi) =m∑r=1∂f(W(t),xi)∂wrTdwr(t)dt =n∑j=1Aij(t)(yj−f(W(t),xj))+n∑j=1Bij(t)(hj−¯F(W(t),xj)),
 ddt¯F(W(t),xi) =m∑r=1∂¯F(W(t),xi)∂wrTdwr(t)dt =n∑j=1Bji(t)T(yj−f(W(t),xj))+n∑j=1Cij(t)(hj−¯F(W(t),xj)),

where we defined the matrices , , , with block structure:

 [Ar(t)]ij:=∂f(W(t),xi)∂wrT∂f(W(t),xj)∂wr=1mσ′(wTrxi)σ′(wTrxj)xTixj,
 [Br(t)]ij:=∂f(W(t),xi)∂wrT∂¯F(W(t),xj)∂wr=1mσ′(wTrxi)σ′(wTrxj)xTiVj,
 [Cr(t)]ij:=∂¯F(W(t),xi)∂wrT∂¯F(W(t),xj)∂wr=1mσ′(wTrxi)σ′(wTrxj)VTiVj.

The residual errors (6) then follow the dynamical system:

 ddt(eS)=−H(t)(eS) (14)

where is given by:

 H(t)=[A(t)B(t)\parB(t)TC(t)].

We moreover observe that if we define:

 Ωr(t):=[∂f(W(t),x1)∂wr,…,∂f(W(t),xn)∂wr,∂¯F(W(t),x1)∂wr,…,∂¯F(W(t),xn)∂wr]

and , then direct calculations show that and is symmetric positive semidefinite for all . In the next section we will show that is strictly positive definite in a neighborhood of initialization, while in section 3.2 we will show that this holds for large enough time leading to global convergence to zero of the errors.

### 3.1 Analysis near initialization

In this section we analyze the behavior of the matrix and the dynamics of the errors near initialization. We begin by bounding the output and directional derivatives of the network for every .

###### Lemma 1

For all and , it holds:

 ∥∂f(W(t),xi)∂wr∥2 ≤1√m ∥∂¯F(W(t),xi)∂wr∥2 ≤√km ∥Hr(t)∥2 ≤n(k+1)m.

We now lower bound the smallest eigenvalue of .

###### Lemma 2

Let , and then with probability over the random initialization:

 λmin(H(0))≥34λ∗

We now provide a bound on the expected value of the residual errors at initialization.

###### Lemma 3

Let and be randomly initialized as in (9) then the residual errors (6) at time zero satisfy with probability at least :

 ∥e(0)∥2+∥S(0)∥2≤2√nk+γδ1/2

where .

Next define the neighborhood around initialization:

 B0:={W:∥H(W)−H(W(0))∥F≤λ∗4}

and the escape time

 τ0:=inf{t:W(t)∉B0}. (15)

We can now prove the main result of this section which characterizes the dynamics of , and the weights in the vicinity of .

###### Lemma 4

Let and then with probability over the random initialization, for every :

 ∥e(t)∥22+∥S(t)∥22≤exp(−λ∗t)(∥e(0)∥22+∥S(0)∥22)

and

 ∥wr(t)−wr(0)∥2≤4λ∗√knm(∥e(0)∥2+∥S(0)∥2)=:R
###### Proof

Observe that if , by Lemma 2 with probability :

 λmin(H(t))≥λmin(H(0))−∥H(t)−H(0)∥F≥λ∗2.

Therefore using (14) it follows that for any :

 ddt12(∥e(t)∥22+∥S(t)∥22)≤−λ∗2(∥e(t)∥22+∥S(t)∥22),

which implies the first claim by Gronwall’s lemma. Next, using (5), the bounds in Lemma 1 and the above inequality we obtain:

 ∥ddtwr(t)∥2 =∥∂∂wrL(W(t))∥2 =∥n∑i=1(yi−f(W(t),xi))∂f(W(t),xi)∂wr+n∑i=1∂¯Fi(t)∂wr(hi−¯Fi(t))∥ ≤1√mn∑i=1|yi−f(W(t),xi)|+√kmn∑i=1∥hi−¯Fi(W(t))∥ ≤2√knme−λ∗2t(∥e(0)∥+∥S(0)∥).

We can therefore conclude by bounding the distance from initialization as:

 ∥wr(t)−wr(0)∥≤∫t0∥ddtwr(s)∥2dt≤4λ∗√knm(∥e(0)∥+∥S(0)∥)

### 3.2 Proof of Global Convergence

In order to conclude the proof of global convergence, according to Lemma 4, we need only to show that where is defined in (15) . Arguing by contradiction, assume this is not the case and . Below we bound .

Let , then from the formulas for , and in the previous sections, we have:

 |[A(τ0)−A(0)]ij| ≤1mm∑r=1Qijr, ∥[BT(τ0)−BT(0)]ij∥2 ≤√kmm∑r=1Qijr, ∥[C(τ0)−C(0)]ij∥F ≤kmm∑r=1Qijr.

Let as in in Lemma 4, then with probability at least for all we have . Moreover observe that if and , then . Therefore, for any and we can define the event and observe that:

 I{σ′(wr(0)Txi)≠σ′(wr(τ0)Txi)}≤I{Ei,r}+I{∥wr(τ0)−wr(0)∥>R}.

Next note that , so that and in particular:

 1mm∑r=1E[Qijr]≤1mm∑r=1(P[Ei,r]+P[Ej,r])+2P[∪r{∥wr(τ0)−wr(0)∥>R}]≤4R√2π+2δ.

By Markov inequality we can conclude that with probability at least :

 ∥A(τ0)−A(0)∥F≤∑i,j|[A(τ0)]ij−[A(0)]ij| ≤4n2√2πδR+2n2m, ∥BT(τ0)−BT(0)∥F≤∑i,j∥[BT(τ0)]ij−[BT(0)]ij)∥2 ≤4n2√k√2πδR+2n2m, ∥C(τ0)−C(0)∥F≤∑i,j∥[C(τ0)]ij−[C(0)]ij)∥F ≤4n2k√2πδR+2n2m,

and using Lemma 3 together with the definition of and :

 ∥H(τ0)−H(0)∥F≤16n2k√2πδR+8n2m=O(n3k2√mδ2max(1,γ)λ∗)

Then choosing we obtain

 ∥H(τ0)−H(0)∥F<λ∗4

which contradicts the definition of and therefore .

## Appendix 0.A Supplementary proofs for Section 3.1

In this section we provide the remaining proofs of the results in Section 3.1. We begin recalling the following matrix Chernoff inequality (see for example [15, Theorem 5.1.1]).

###### Theorem 0.A.1 (Matrix Chernoff)

Consider a finite sequence of independent, random, Hermitian matrices with . Let , then for all

 P[λmin(X)≤ϵλmin(E[X])]≤pe−(1−ϵ)2λmin(E[X])/2L (16)

In order to lower bound the smallest eigenvalue of we use Lemma 1 together with the previous concentration result.

###### Proof (Lemma 2)

We first note that , and moreover is symmetric positive semidefinite with by Lemma 1. Applying then the concentration bound (16) with the assumption gives the thesis.

We next upper bound the errors at initialization.

###### Proof (Lemma 3)

Note that for any , due the the assumption on the independence of the weights at initialization and the normalization of the data:

 E[(f(W,xi))2]=m∑r=11mE[σ(wTrxi)2]≤1

and similarly for the directional derivatives

 E[∥¯F(W,xi)∥22]=Eg∼N(0,I)[∥σ′(gTxi)VTig∥22]≤k∑j=1E[(vTi,jg)2]≤k.

We conclude the proof by using Jensen’s and Markov’s inequalities.

## Appendix 0.B Proof of Proposition 1

Consider the matrices , and for define

 ^ψw(xi)=σ′(wTxi)Xi.

and the matrix:

 ˆΩ(w)=[^ψw(x1),…,^ψw(xn)]∈\bbbrd×n(k+1)

which corresponds to a column permutation of . Next observe that the matrix is similar to and therefore has the same eigenvalues. In this section we lower bound by analyzing .

We begin recalling some facts about the spectral properties of the products of matrices.

###### Definition 2 ([8])

Let and be matrices in which each block is in . Then we define the block Hadamard product of as the matrix with:

 A□B:=[AαβBαβ]β=1,…,nα=1,…,n

where denotes the usual matrix product between and .

Generalizing Schur’s Lemma one has the following regarding the eigenvalues of the block Hadamard product of two block matrices.

###### Proposition 2 ([8])

Let and be positive semidefinite matrices. Assume that every block of commutes with every block of , then:

We finally recall the following on the eigenvalues of Kronecker product of matrices.

###### Proposition 3 ([11])

Let with eigenvalues and with eigenvalues , then Kronecker product between and has eigenvalues .

We next define the following random kernel matrix.

###### Definition 3

Let

then define the random matrix

with entries .

The next result from [12] establishes positive definiteness of this matrix in expectation, under the separation condition (7).

###### Lemma 5 ([12])

Let in and assume that (7) is satisfied for all . Then the following holds:

 Ew∼N(0,I)[M(w)]⪰δ1100n2

Finally let block matrix with blocks . Thanks to the assumption (8) the following result on the Gram matrices holds.

###### Lemma 6

Assume that the condition (8) is satisfied, then for any we have .

###### Proof

The claim follows by observing that by Gershgorin’s Disk Theorem:

 |λmin(XT