In 2014, Hinton et al. [hinton15] made a surprising observation: they found it easier to train classifiers using the real–valued outputs of another classifier as target values than using actual ground–truth labels. They introduced the term knowledge distillation
(or distillation for short) for this phenomenon. Since then, distillation–based training has been confirmed robustly in several different types of neural networks[chen2017learning, yim2017gift, yu2017visual]. It has been observed that optimization is generally more well–behaved than with label-based training, and it needs less if any regularization or specific optimization tricks. Consequently, in several fields, distillation has become a standard technique for transferring the information between classifiers with different architectures, such as from deep to shallow neural networks or from ensembles of classifiers to individual ones.
While the practical benefits of distillation are beyond doubt, its theoretical justification remains almost completely unclear 111This is the reason for our title with a play on Eugene Wigner’s famous article entitled "The Unreasonable Effectiveness of Mathematics in the Natural Sciences". Existing explanations rarely go beyond qualitative statements, e.g. claiming that learning from soft labels should be easier than learning from hard labels, or that in a multi-class setting the teacher’s output provides information about how similar different classes are to each other. The recent paper [phuong19] is the sole exception (as far as we are aware) but their analysis is in the setting of linear networks.
We carry out the first theoretical analysis of distillation in the setting of a simple two–layer non–linear neural network. Our analysis is carried out in the model underlying the exciting recent work that analyzes the dynamics of neural networks under the so–called kernel regime, see [Arora19, DuH19, CaoG19a, mei2019mean] and others. In this series of papers, it was shown that the behaviour of training by gradient descent (GD) in the limit of very wide neural networks can be approximated by linear system dynamics. This is dubbed the kernel regime because it was shown in [JFH18] that a fixed kernel – the neural tangent kernel – characterizes the behavior of fully-connected infinite width neural networks in this regime.
In the same model and regime, we prove the first theoretical results on knowledge distillation for non-linear networks. Our framework is general enough to encompass Vapnik’s notion of privileged information and provides a unified analysis of generalized distillation in the paradigm of machines teaching machines as in [LopezPaz15]. We give results on both what is learnt by the student network (in section 2.5) and also on the speed of convergence (in section 2.6). Intriguingly, we also confirm theoretically, the lottery ticket hypothesis [FrankleC19]in this model and regime as a special case of our analysis (in section 2.5.1. We introduce novel techniques to make possible this analysis. We also carry out systematic experimental evaluation that confirms the theoretical analysis.
2 Problem Formulation and Main Results
Unlike most previous work which study distillation in various complex architectures, we focus our attention on a simplified, analytically tractable, setting: the two layer non–linear model introduced in [Arora19, du2018gradient, DuH19, CaoG19a]:
Here the weights are variables of the network corresponding to hidden units,
is a real (nonlinear) activation function and the weightsare fixed. In our theoretical analysis, the student network has this form while the teacher can be any classifier. In our experimental analysis both the teacher and student networks have this form; the student just has many fewer nodes in the hidden layer.
We introduce a general optimization framework for knowledge transfer formulating it as a least squares optimization problem with regularization provided by privileged knowledge. Given a dataset comprising of data samples and their corresponding labels , our framework is given by the following optimization problem
where is stated in (1) and is the corresponding hidden feature of the student network. Here, we assume a given function for each hidden unit as privileged knowledge. We observe that (2) incorporates this knowledge as a regularization of the original nonlinear least squares (average risk minimization) framework for fitting a function to the labeled data: The coefficient in (2) is the regularization parameter. A special case of this setup concerns knowledge distillation where are hidden features of a teacher network, i.e. framing (2) as a standard student-teacher scheme for knowledge distillation.
2.1 Relation to Previous Work
To the best of our knowledge, [phuong19] is the only previous attempt to give a theoretical analysis of distillation. In the setting of (2), their work corresponds to (pure distillation), single hidden unit (), Sigmoid activation and cross-entropy replacing the square-error loss and hence leading to a convex objective. Their result concerns the final value and is expressed in terms of the weight values. Convexity allows [phuong19] to avoid our assumption on the initial weights in Theorem 1 and 2, as the final solution is independent of initialization. Our result is applicable to a different regime with a large number of units, high expression capacity and a non-convex formulation.
To gain a deeper insight into the process of knowledge transfer by (2), we study the generic behavior of the gradient descent (GD) algorithm when applied to the optimization therein. In the spirit of the analysis in [du2018gradient, Arora19], shortly explained in Section 3, we carry out an investigation on the dynamics for GD that answers two fundamental questions:
What does the (student) network learn?
How fast is the convergence by the gradient descent?
The answer to both these questions emerges from the analysis of the dynamics of GD.
2.2 General Framework
The existing analysis of dynamics for neural networks in a series of recent papers [Arora19, DuH19, CaoG19a] is tied centrally to the premise that the behaviour of GD for the optimization can be approximated by a linear dynamics of finite order. To isolate the negligible effect of learning rate in GD, it is also conventional to study the case
where GD is alternatively represented by an ordinary differential equation (ODE), known as the gradient flow, with a continuous "time" variablereplacing the iteration number (being equivalent to the limit of ). Let us denote by
the vector of the outputof the network at time . Then, the theory of linear systems with a finite order suggests the following expression for the evolution of :
where is a constant and
Here, is the order of the linear system and complex-valued vectors and nonzero complex values are to be determined by the specifications of the dynamics. The constants are called poles, that also correspond to the singular points of the Laplace transform of (except for , which corresponds to the constant in our formulation). We observe that such a representation may only have a convergence (final) value at if the poles have strictly positive real parts, in which case is the final value. Moreover, the asymptotic rate of convergence is determined by the dominating term in (4), i.e. the smallest value with a nonzero vector . We observe that identifying and the dominating term responds to the aforementioned questions of interest. In this paper, we show that these values can be calculated as the number of hidden units increases.
we first need to introduce a number of definitions. Let us take as the initial values of the weights and define as the realization of the "associated gram matrix" where denotes the derivative function of (that can be defined in the distribution sense). Further, denote by the vector of the initial values of the unit for different data points . Finally, we define .
Our analysis will also be built upon a number of assumptions:
Nonzero eigenvalues of the matrices
Nonzero eigenvalues of the matricesare all distinct. Note that they are always strictly positive as are by construction positive semi-definite (psd).
is the eigenvalue of the matrix
at exactly distinct strictly negative values of , where s are all different to the eigenvalues of , with and being the corresponding right and left eigenvectors.
being the corresponding right and left eigenvectors.
The function and its derivative is Lipschitz continuous.
We assume and and . Moreover, s and s are bounded.
2.5 What does the student learn?
This result pertains to the first question above, concerning the final value of . For this, we prove the following result:
Suppose that Assumption 1-4 holds true. Then, where
To gain a deeper insight, we specialize this result to the case where the teacher is a well-trained network with the similar structure as in (1) with units and coefficients . In this setup, the constant is a correction term and the output of the teacher at is exactly for (e.g. by the results in [du2018gradient]). Then, the privileged knowledge is extracted by randomly selecting indices and setting and . Then, Theorem 1 leads to the the following result:
Suppose that the student is initialized by , selected by the above randomized scheme and . Moreover, is bounded with . Then, the output after training satisfies:
with a probability higher than
with a probability higher than, where is a universal constant.
In simple words, the final error in the student will be proportional to , confirming the intuition that the error grows with a smaller student.
2.5.1 Relation to Lottery Ticket Hypothesis
We further observe an interesting connection between the setup in Theorem 2 and the lottery ticket hypothesis in [FrankleC19]: When (no distillation), the training procedure on the student network can be interpreted as the retraining of the randomly selected features of the teacher. This coincides with the lottery ticket setup. As such, our analysis in Theorem 2 with shows that re-training in the kernel limit with a fixed fraction of features leads to zero training loss, confirming the lottery ticket hypothesis.
2.6 How fast does the student learn?
Now, we turn our attention to the question of the speed of convergence, for which we have the following result:
Under Assumption 1-4, the dynamics of can be written as222We clarify that holds in sense, i.e. means .
From the definition, can be interpreted as an average overlap between a combination of the label vector and the knowledge vectors , and the "spectral" structure of the data as reflected by the vectors . This is a generalization of the geometric argument in [Arora19] in par with the "data geometry" concept introduced in [phuong19]. We will later use this result in our experiments to improve distillation by modifying the data geometry of coefficients.
2.7 Further Remarks
The above two results have a number of implications on the distillation process. First, note that the case reproduces the results in : The final value simply becomes
while the poles will become the singular values of the matrix. The other extreme case of corresponds pure distillation, where the effect of the first term in (2) becomes negligible and hence the optimization boils down to individually training each hidden unit by . One may then expect the solution of this case to be . However, the conditions of the above theorems become difficult to verify, but we shortly present experiments that numerically investigate the corresponding dynamics.
We also observe that for a finite value of , the final value is a weighted average, depending on the quality of . Defining as the final error, we simply conclude that , where the term reflects the quality of teacher in representing the labels. Also for an imperfect teacher the error monotonically increases with , while larger generally has a positive effect on the speed of convergence as it shifts the poles to become larger. For , this leads to a trade-off between speed and quality. The result of Theorem 1 also stems from the probabilistic analysis of in the lottery ticket setup.
The analysis of Theorem 3 gives us another way to assess the effect of the teacher. If the teachers have a small overlap with the "eigen-basis" corresponding to small values of , then the dynamics is mainly identified by the large poles , speeding up the convergence properties.
Finally, the assumption can be simply satisfied in the distillation case with single hidden layer, where we have and initializing the weights of the student by that of the teacher leads to . We further numerically investigate the consequences of violating this assumption.
We observe that the above elements can be simply investigated in the context of distillation with the lottery ticket principle in [frankle2018lottery]. This is what we mainly study in the numerical experiments.
3 Analysis and Insights
The study in [du2018gradient]
on the dynamics of backpropagation serves as our main source of inspiration, which we review first. The point of departure in this work is to represent the dynamics of BP or gradient descent (GD) for the standardrisk minimization, as in (2) and (1) with . In this case, the associated ODE to GD reads:
where are respectively the vectors of and , calculated in (1) by replacing . Moreover, the matrix consists of as its column. While the dynamics in (7) is generally difficult to analyze, we identify two simplifying ingredients in the study of [du2018gradient]. First, it turns the attention from the dynamics of weights to the dynamics of the function, as reflected by the following relation:
where is a short-hand notation for and . The second element in the proof can be formulated as follows:
Kernel Hypothesis (KH): In the asymptotic case of , the dynamics of has a negligible effect, such that it may be replaced by , resulting to a linear dynamics.
The reason for our terminology of the KH is that under this assumption, the dynamics of BP resembles that of a kernel regularized least squares problem. The investigation in [du2018gradient] further establishes KH under mild assumptions and further notes that for random initialization of weights concentrates on its mean value, denoted by .
3.1 Dynamics of Knowledge Transfer
Following the methodology of [du2018gradient], we proceed by providing the dynamics of the GD algorithm for the optimization problem in (2) with . Direct calculation of the gradient leads us to the following associated ODE for GD:
where are similar to the previous case in (7). Furthermore, are respectively the vectors of and . We may now apply the methodology of [du2018gradient] to obtain the dynamics of the features. We also observe that unlike this work, the hidden features explicitly appear in the dynamics:
where is a block matrix with as its block ( denotes the Kronecker delta function).
3.2 Dynamics Under Kernel Hypothesis
Now, we follow [du2018gradient] by simplifying the relation in (3.1) under the kernel hypothesis, which in this case assumes the matrices to be fixed to its initial value , leading again to a linear dynamics:
Despite similarities with the case in [du2018gradient, Arora19], the relation in (14) is not simple to analyze due to the asymmetry in and the complexity of its eigen-structure. For this reason, we proceed by taking the Laplace transform of (3.1) (assuming ) which after straightforward manipulations gives:
where are respectively the Laplace transform of . Hence, is given by taking the inverse Laplace transform of . Note that by construction, is a rational function, which shows the finite order of the dynamics. To find the inverse Laplace transform, we only need to find the poles of . These poles can only be either among the eigenvalues of or the values where the matrix becomes rank deficient. Under Assumption 1 and 2, we may conclude that the poles are only , which gives the result in Theorem 1 and 2. More details of this approach can be found in the supplement, where the kernel hypothesis for this case is also rigorously proved.
4 Experimental Results
We perform our numerical analysis on a commonly-used dataset for validating deep neural models, i.e., CIFAIR-10. This dataset is used for the experiments in [Arora19]. As in [Arora19], we only look at the first two classes and set the label if image belongs to the first class and if it belongs to the second class. The images are normalized such that for all .
The weights in our model are initialized as follows:
For optimization, we use (full batch) gradient descent with the learning rate . In our experiments we set similar to [Arora19].
In all of our experiments we use 100 hidden neurons for the teacher network and 20 hidden neurons for the student network.
4.1 Dynamics of knowledge transfer
In this section we study knowledge transfer and distillation in different settings. We first consider a finite regularization in Eq. 2 by setting . Figure 1 shows the dynamics of the results in different settings, i) no teacher, i.e., the student is independently trained without access to a teacher, ii) student training, where the student is trained by both the teacher and the true labels according to Eq. 2, and iii) the teacher, trained by only the true labels. For each setting, we illustrate the training loss (Figure 1(a)) and the test loss (Figure 1(b)). Note that true labels are the same for the teacher and the students. Teacher shows the best performance because of its considerably larger capacity. On the other hand, we observe, i) for the student with access to the teacher its performance is better than the student without access to the teacher. This corresponds to the result in Theorem 1 and the discussion in section 2.7, where the final performance of the student is shown to be improved by the teacher. ii) the convergence rate of the optimization is significantly faster for the student with teacher compared to the other alternatives. This confirms the prediction of Theorem 3. This experiment implies the importance of a proper knowledge transfer to the student network via the information from the teacher.
In the following, we study two special cases of the generic formulation in Eq. 2 where and . As discussed in section 2.5.1, the case corresponds to the lottery ticket setup. Figure 2 compares these two extreme cases with the student with and the teacher w.r.t. training loss (Figure 2(a)) and test loss (Figure 2(b)). We observe that the student with a finite regularization () outperforms the two other students in terms of both convergence rate (optimization speed) and the quality of the results. In particular, when the student is trained with and it is initialized with the weights of the teacher, then the generic loss in Eq. 2 equals 0. This renders the student network to keep its weights unchanged for and the performance remains equal to that of the privileged knowledge without the label input.
4.2 Dynamics of knowledge transfer with imperfect teacher
In this section, we study the impact of the quality of the teacher on the student network. We consider the student-teacher scenario in three different settings, i) perfect teacher where the student is initialized with the final weights of the teacher and uses the final teacher outputs in Eq. 2, ii) imperfect teacher where the student is initialized with the intermediate (early) weights of the teacher network and uses the respective intermediate teacher outputs in Eq. 2, and iii) no student initialization where the student is initialized randomly but uses the final teacher outputs. In all the settings, we assume .
Figure 3 shows the results for these three settings, respectively w.r.t. training loss (Figure 3(a)) and test loss (Figure 3(b)). We observe that initializing and training the student with the perfect (fully trained) teacher yields the best results in terms of both quality (training and test loss) and convergence rate (optimization speed). This observation verifies our theoretical analysis on the importance of initialization of the student with fully trained teacher, as the student should be very close to the teacher.
4.3 Kernel embedding
In order to provide the teacher and the student with more relevant information and to study the role of the data geometry, as stated by Theorem 3, we can use properly designed kernel embeddings. Specifically, instead of using the original features for the networks, we could first learn an optimal kernel which is highly aligned with the labels in training data, implicitly improving the combination of in Theorem 3 and then we feed the features induced by that kernel embedding into the networks (both student and teacher).
For this purpose, we employ the method proposed in [CortesMohri2012] that develops an algorithm to learn a new kernel from a group of kernels according to a similarity measure between the kernels, namely centered alignment. Then, the problem of learning a kernel with a maximum alignment between the input data and the labels is formulated as a quadratic programming (QP) problem. The respective algorithm is known as alignf [CortesMohri2012].
Let us denote by the centered variant of a kernel matrix . To obtain the optimal combination of the kernels (i.e., a weighted combination of some base kernels), [CortesMohri2012] suggests the objective function to be centered alignment between the combination of the kernels and , where is the true labels vector. By restricting the weights to be non-negative, a QP can be formulated as
is the number of the base kernels and for , and finally is a vector wherein for . If is the solution of the QP, then the vector of kernel weights is given by [CortesMohri2012, Gonen2011]
Using this algorithm we learn an optimal kernel based on seven different Gaussian kernels. Then, we need to approximate the kernel embeddings. To do so, we use the Nyström method [Nystroem]. Then we feed the approximated embeddings to the neural networks. The results in Figure 4 clearly show that using the kernel embeddings as inputs to the neural networks, helps both teacher and student networks in terms of training loss (Figure 4(a)) and test loss (Figure 4(b)).
4.4 Spectral analysis
Here, we investigate the overlap parameter of different networks, where we compute a simplified but conceptually consistent variant of the overlap parameter in theorem 3. For a specific network, we consider the normalized columns of matrix (as defined in Eq. 2) corresponding to the nonlinear outputs of the hidden neurons, and compute the dot product of each column with the top eigenvectors of , and take the average. We repeat this for all the columns and depict the histogram. For a small value of , the resulting values are approximately equal to in Theorem 3.
Figure 5 shows such histograms for two settings. In Figure 5(a) we compare the overlap parameter for two teachers, one trained partially (imperfect teacher) and the other trained fully (perfect teacher). We observe that the overlap parameter is larger for the teacher trained perfectly, i.e., there is more consistency between its outputs and the matrix . This analysis is consistent with the results in Figure 3 which demonstrates the important of fully trained (perfect) teacher. In Figure 5(b), we show that this improvement is transferred to the student via distillation.
We confirm this point further: if the teacher learns better representations via kernels, this is also transferred by distillation to the student.
4.5 Kernel spectral analysis
Finally, we perform a similar analysis to section 4.4, where we compute the overlap parameter for a teacher (Figure 6(a)) and a student (Figure 6(b)) trained with the representations from an optimal kernel described in section 4.3. The histograms show that the overlap parameter for a teacher trained with kernel embeddings is larger than the teacher trained with original features. We also observe that this better overlap parameter, i.e., higher consistency between and , is transferred to the student trained with the same kernel representations.
5 Related Work
In its current and most widely known form, distillation was introduced by Hinton et al. [hinton15]
with the aim of compressing neural networks. Since then, distillation has quickly gained popularity among practitioners and established its place in deep learning folklore. It has been found to work well across a wide range of applications, including
In contrast to its empirical success, a rigorous theoretical understanding of the principles underlying the effectiveness of distillation have largely remained a mystery. Until recently, Lopez-Paz et al. [LopezPaz15]
was the only work that tried to examine distillation from a theoretical perspective. It casts distillation as a form of Vapnik’s notion of learning using privileged information, a learning setting in which additional per-instance information is available at training time but not at test time. Their paper is more a heuristic argument for the effectiveness of distillation with repect to generalization error rather than a rigorous analysis. Very recently, Phuong and Lampert[phuong19] made the first attempt to analyze distillation in a simple model. In their setting, both the teacher and the student are linear classifiers (although the student’s weight vector is allowed a over-parametrized representation as a product of matrices). They give conditions under which the student’s weight vector converges (approximately) to that of the teacher and derive consequences for generalization error. Crucially, their analysis is limited to linear networks while we analyze distillation in the context of non-linear networks.
A series of recent papers [Arora19, DuH19, CaoG19a] achieved breakthroughs in understanding how (infinitely) wide neural network training behaves in the so-called kernel regime. In this regime, the dynamics of training by gradient descent can be approximated by the dynamics of a linear system. In this paper, we extend the repertoire of the methods that can be applied in such settings.
We give the first theoretical analysis of knowledge distillation for non–linear neural networks in the model and regime of [Arora19, DuH19, CaoG19a]. We provide results for both what is learnt by the student and on the speed of convergence. As an intriguing side result, we also confirm the "lottery ticket hypothesis" [FrankleC19] in this model and regime. Our numerical studies further confirm our theoretical findings on the role of data geometry and distillation in the final performance of student.
7 Supplementary Material
In this long version we have made few clarifications compared to the submitted version according to the following corrections.
After (6 in paper), must be replaced by .
In Assumption 2, the number of values must be changed to .
In Assumption 3, we must also have the derivative function to be also Lipschitz.
In assumption 4, can be changed to . We also assume that remains bounded.
Two more assumptions s and s are bounded.
We clarify that in theorem 2, holds in sense, i.e. means .
In Assumption 2, we found out that is symmetric. Hence , but we keep the expression unchanged.
Theorem 2 also needs the assumption that is bounded (it can be incorporated in , but the theorem does not reflect the dependency on ).
is the largest singular value and means 2-norm of vectors.
7.3 Proof of Theorem 1 and 3 in Paper
We continue the discussion in (15 of paper). Note that the values correspond to the points where . We also observe that these values correspond to the negative of eigenvalues of the matrix . We conclude that under assumption 2, the eigenvalues of are distinct and strictly positive, hence this matrix is diagonalizable. Now, we write and state the following lemma:
Suppose that is a diagonalizable matrix with strictly positive eigenvalues and denote its smallest eigenvalue by
is a diagonalizable matrix with strictly positive eigenvalues and denote its smallest eigenvalue by. Take ar a matrix valued function of the continuous valiable such that for a given fixed value of
Let denote the solution to (13 in paper) with . Then,
Consider the iteration that generates a sequence of function functions for where and is the solution to
with , which can also be written as
We observe that on the interval is a contraction map under norm as we have
and hence by the triangle inequality on the Hilbert space of functions, we get
we conclude that
which shows that is a contraction. Then, from Banach fixed-point theorem we conclude that converges uniformly on the interval to the fixed-point of , which coincides with the solution of (13 in paper). Moreover,
Now, we observe that
which completes the proof. ∎
Now, we state two results that connect to the change of :
Under Assumption 3, the following relation holds:
Take an arbitrary block vector with and note that
where . We obtain the desired result by observing that
Next, we show
where is the maximal eigenvalue of the the data matrix and is the largest of the Lipschitz constants of .
Note that since is symmetric, we have (e.g. by eigen-decomposition)
Taking an arbitrary normalized , we observe that
On other hand,
where . Hence,
We also observe that
and from Lipschitz continuity,
We conclude that
Similarly, we obtain
which completes the proof. ∎
We finally connect the magnitude of the change to :
With the same definitions as in Lemma 3, we have
From a similar argument as in Lemma 3, we have
which completes the proof. ∎
We may now proceed to the proof of Theorem 1 and 3. Define
Note that is nonempty as and open since is continuous. We show that for sufficiently large , . Otherwise is an open interval where . For any , we have from Lemma 1
Denote and and define