1 Introduction
Deep neural networks are ubiquitous and have established state of the art performances in a wide variety of applications and fields. These networks often have a large number of parameters which are tuned via gradient descent (or its variants) on an empirical risk minimization task. In particular in supervised learning it is often required that the output of the network fits certain values/labels that can be thought as coming from an unknown target function. In many settings, though, additional prior information on the task or target function might be available, and enforcing them might be of interest. One such example is the case of high order derivatives of the unknown target function, which, as shown in
[5], naturally arises in problems such as distillation, in which a large teacher network is used to train a more compact student network, or prediction of synthetic gradients for training deep complex models. Therefore [5] proposed the “Sobolev training” which given training inputs , attempts to minimizes the following empirical risk:(1) 
where is a neural network with input and parameters , denotes the target function, is a loss penalizing the deviation from the outputs of , and
are loss functions penalizing the deviations of the
th derivative of the network with respect to from the th derivative of the target .The empirical successes of Sobolev training have been demonstrated in a number of works. In [5]
it was shown that Sobolev training leads to smaller generalization errors than standard training, in tasks such as distillation and synthetic gradient prediction especially in the low data regime. Similar results were also obtained for transfer learning via Jacobian matching in
[14]. Earlier Sobolev training was applied in [13]in order to enforce invariance to translations and small rotations. More recently, instead, Sobolev training has been used in the context of anisotropic hyperelasticity in order to improve the predictions on the stress tensor (derivative of the network with respect to the input deformation tensor) in
[16]. Finally, the idea of Sobolev training is also tightly connected to other techniques which have been recently successfully employed, such as attention matching in student distillation [18] and [5], or convex data augmentation for generalization and robustness improvement [19].On the theoretical side, justification for Sobolev training was given by [5], extending the classical work of Hornik [9]
and giving universal approximation properties of neural networks with relu activation function in Sobolev spaces. This result was then further improved for deep networks in
[7]. While these works motivated the use of the Sobolev loss (1), conditions under which it can be successfully minimized were not given. In particular even though the network used in Sobolev training are usually shallow, the resulting loss (1) is highly nonconvex and therefore the success of first order methods is not a priori guaranteed.In this paper we study a twolayer relu neural network trained with a Sobolev loss when at each input point the output values and a set of directional derivatives of the target function are given. Leveraging recent results on training with standard losses [2, 20, 1, 20, 12] we show that if the network is sufficiently overparametrized, the weights are randomly initialized, and the data satisfy certain natural nondegeneracy assumptions, Gradient Flow achieves a global minimum.
2 Main Result
We study the training of neural networks with “Directional Sobolev Training”. In particular we assume we are given training data
(2) 
where , , and with orthonormal columns (unit Euclidean norm and pairwise orthogonal). This training data can be thought as being generated by a differentiable function according to
(3) 
so that each entry of the vector
corresponds to a directional derivative of in the direction given by the corresponding column of the matrix . We will denote by and the vectors with entries and blocks respectively.In this work we study the training of a twolayer neural network with width :
(4) 
where are fixed at initialization, is the relu activation function, and is the weight matrix with rows . The network weights are learned by minimizing the Directional Sobolev Loss
via the Gradient Flow
(5) 
Note that even though the relu activation function is not differentiable, we let and
. This corresponds to the choice made in most of the deep learning libraries, and the dynamical system (
5) can then be seen as the one followed in practice when using Sobolev training. Explicit formulas for the partial derivatives are given in the next section.In this work we prove that for wide enough networks, gradient flow converges to a global minimizer of . In particular define the vectors of residuals and with coordinates
(6) 
We show that and as , under the following assumption of nondegeneracy of the training data.
Assumption 1
There exist and such that the following hold:
(7) 
and for every :
(8) 
where are the columns of .
Given define the following “feature maps”:
and matrix:
The next quantity plays an important role in the proof of convergence of the gradient flow (5).
Definition 1
Define the matrix with entries given by , and let
be its smallest eigenvalue.
Under the nondegeneracy of the training set we show that is strictly positive definite.
Proposition 1
Under the Assumptions 1 the minimum eigenvalue of obeys:
We are now ready to state the main result of this work.
Theorem 2.1
Assume Assumption 1 is satisfied and the data are normalized so that . Consider a one hidden layer neural network (4), let , set the number of hidden nodes to and i.i.d. initialize the weights according to:
(9) 
Consider the Gradient Flow (5
), then with probability
over the random initialization of and , for every :and in particular as .
The proof of this theorem is given in Section 3, below we will show how to extend this result to a network with bias.
2.1 Consequences for a network with bias
Given training data (2) generated by a target function according to (3), in this section we demonstrate how the previous theory can be extended to the Sobolev training of a twolayer network with width and bias term :
(10) 
where^{1}^{1}1Notice that the introduction of the constants and does not change the expressivity of the network. and .
Similarly as before, the network weights and biases are learned by minimizing the Directional Sobolev Loss
(11) 
via the Gradient Flow
(12) 
Based on the following separation conditions on the input point we will prove convergence to zero training error of the Sobolev loss.
Assumption 2
There exists such that the following holds
(13) 
Define the vectors of residuals and with coordinates
then the next theorem follows readily from the analysis in the previous section.
Theorem 2.2
Assume Assumption 2 is satisfied and the data are normalized so that . Consider a twolayer neural network (10), let , set the number of hidden nodes to and i.i.d. initialize the weights according to:
Consider the Gradient Flow (12), then with probability over the random initialization of , and , for every
where and . In particular as .
2.2 Discussion
Theorem 2.2 establishes that the gradient flow (12) converges to a global minimum and therefore that a wide enough network, randomly initialized and trained with the Sobolev loss (11
) can interpolate any given function values and directional derivatives. We observe that recent works in the analysis of standard training
[21, 12] have shown that using more refined concentration results and control on the weight dynamics, the polynomial dependence on the number of samples can be lowered. We believe that by applying similar techniques to the Sobolev training, the dependence of from the number of samples and derivatives can be further improved.Regarding the assumptions on the input data, we note that [12, 2, 6] have shown convergence of gradient descent to a global minimum of the standard loss, when the input points satisfy the separation conditions (7). These conditions ensure that no two input points and are parallel and reduce to (13) for a network with bias. While the separation condition (7) is also required in Sobolev training, the condition (8) is only required in case of a network without bias as a consequence of its homogeneity.
Finally, the analysis of gradient methods for training overparametrized neural networks with standard losses has been used to study their inductive bias and ability to learn certain classes of functions (see for example [2, 3]). Similarly, the results of this paper could be used to shed some light on the superior generalization capabilities of networks trained with a Sobolev loss and their use for knowledge distillation.
3 Proof of Theorem 2.1
We follow the lines of recent works on the optimization of neural networks in the Neural Tangent Kernel regime [10, 4, 12] in particular the analysis of [2, 6, 17]. We investigate the dynamics of the residuals error and , beginning with that of the predictions. Let , then:
where we defined the matrices , , , with block structure:
The residual errors (6) then follow the dynamical system:
(14) 
where is given by:
We moreover observe that if we define:
and , then direct calculations show that and is symmetric positive semidefinite for all . In the next section we will show that is strictly positive definite in a neighborhood of initialization, while in section 3.2 we will show that this holds for large enough time leading to global convergence to zero of the errors.
3.1 Analysis near initialization
In this section we analyze the behavior of the matrix and the dynamics of the errors near initialization. We begin by bounding the output and directional derivatives of the network for every .
Lemma 1
For all and , it holds:
We now lower bound the smallest eigenvalue of .
Lemma 2
Let , and then with probability over the random initialization:
We now provide a bound on the expected value of the residual errors at initialization.
Lemma 3
Next define the neighborhood around initialization:
and the escape time
(15) 
We can now prove the main result of this section which characterizes the dynamics of , and the weights in the vicinity of .
Lemma 4
Let and then with probability over the random initialization, for every :
and
Proof
3.2 Proof of Global Convergence
In order to conclude the proof of global convergence, according to Lemma 4, we need only to show that where is defined in (15) . Arguing by contradiction, assume this is not the case and . Below we bound .
Let , then from the formulas for , and in the previous sections, we have:
Let as in in Lemma 4, then with probability at least for all we have . Moreover observe that if and , then . Therefore, for any and we can define the event and observe that:
Next note that , so that and in particular:
By Markov inequality we can conclude that with probability at least :
and using Lemma 3 together with the definition of and :
Then choosing we obtain
which contradicts the definition of and therefore .
References
 [1] AllenZhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962 (2018)
 [2] Arora, S., Du, S.S., Hu, W., Li, Z., Wang, R.: Finegrained analysis of optimization and generalization for overparameterized twolayer neural networks. arXiv preprint arXiv:1901.08584 (2019)
 [3] Bietti, A., Mairal, J.: On the inductive bias of neural tangent kernels. In: Advances in Neural Information Processing Systems. pp. 12873–12884 (2019)
 [4] Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. arxiv eprints, page. arXiv preprint arXiv:1812.07956 (2018)
 [5] Czarnecki, W.M., Osindero, S., Jaderberg, M., Swirszcz, G., Pascanu, R.: Sobolev training for neural networks. In: Advances in Neural Information Processing Systems. pp. 4278–4287 (2017)
 [6] Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)
 [7] Gühring, I., Kutyniok, G., Petersen, P.: Error bounds for approximations with deep relu neural networks in norms. arXiv preprint arXiv:1902.07896 (2019)
 [8] Günther, M., Klotz, L.: Schur’s theorem for a block hadamard product. Linear algebra and its applications 437(3), 948–956 (2012)
 [9] Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural networks 4(2), 251–257 (1991)
 [10] Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: Convergence and generalization in neural networks. In: Advances in neural information processing systems. pp. 8571–8580 (2018)
 [11] Laub, A.J.: Matrix analysis for scientists and engineers, vol. 91. Siam (2005)
 [12] Oymak, S., Soltanolkotabi, M.: Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674 (2019)
 [13] Simard, P., Victorri, B., LeCun, Y., Denker, J.: Tangent propa formalism for specifying selected invariances in an adaptive network. In: Advances in neural information processing systems. pp. 895–903 (1992)
 [14] Srinivas, S., Fleuret, F.: Knowledge transfer with jacobian matching. arXiv preprint arXiv:1803.00443 (2018)

[15]
Tropp, J.A., et al.: An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning
8(12), 1–230 (2015)  [16] Vlassis, N., Ma, R., Sun, W.: Geometric deep learning for computational mechanics part i: Anisotropic hyperelasticity. arXiv preprint arXiv:2001.04292 (2020)
 [17] Weinan, E., Ma, C., Wu, L.: A comparative analysis of optimization and generalization properties of twolayer neural network and random feature models under gradient descent dynamics. Science China Mathematics pp. 1–24
 [18] Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)
 [19] Zhang, H., Cisse, M., Dauphin, Y.N., LopezPaz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
 [20] Zou, D., Cao, Y., Zhou, D., Gu, Q.: Stochastic gradient descent optimizes overparameterized deep relu networks. arXiv preprint arXiv:1811.08888 (2018)
 [21] Zou, D., Gu, Q.: An improved analysis of training overparameterized deep neural networks. In: Advances in Neural Information Processing Systems. pp. 2053–2062 (2019)
Appendix 0.A Supplementary proofs for Section 3.1
In this section we provide the remaining proofs of the results in Section 3.1. We begin recalling the following matrix Chernoff inequality (see for example [15, Theorem 5.1.1]).
Theorem 0.A.1 (Matrix Chernoff)
Consider a finite sequence of independent, random, Hermitian matrices with . Let , then for all
(16) 
In order to lower bound the smallest eigenvalue of we use Lemma 1 together with the previous concentration result.
Proof (Lemma 2)
We next upper bound the errors at initialization.
Proof (Lemma 3)
Note that for any , due the the assumption on the independence of the weights at initialization and the normalization of the data:
and similarly for the directional derivatives
We conclude the proof by using Jensen’s and Markov’s inequalities.
Appendix 0.B Proof of Proposition 1
Consider the matrices , and for define
and the matrix:
which corresponds to a column permutation of . Next observe that the matrix is similar to and therefore has the same eigenvalues. In this section we lower bound by analyzing .
We begin recalling some facts about the spectral properties of the products of matrices.
Definition 2 ([8])
Let and be matrices in which each block is in . Then we define the block Hadamard product of as the matrix with:
where denotes the usual matrix product between and .
Generalizing Schur’s Lemma one has the following regarding the eigenvalues of the block Hadamard product of two block matrices.
Proposition 2 ([8])
Let and be positive semidefinite matrices. Assume that every block of commutes with every block of , then:
We finally recall the following on the eigenvalues of Kronecker product of matrices.
Proposition 3 ([11])
Let with eigenvalues and with eigenvalues , then Kronecker product between and has eigenvalues .
We next define the following random kernel matrix.
Definition 3
The next result from [12] establishes positive definiteness of this matrix in expectation, under the separation condition (7).
Finally let block matrix with blocks . Thanks to the assumption (8) the following result on the Gram matrices holds.
Lemma 6
Assume that the condition (8) is satisfied, then for any we have .
Proof
The claim follows by observing that by Gershgorin’s Disk Theorem:
Comments
There are no comments yet.