Transferring knowledge learned by a powerful model (“teacher”) to a simpler model (“student”) has become a theme in machine learning. The goal of the knowledge transfer is to have the teacher guide the learning process of the student, so as to achieve high prediction accuracy, or to reduce the sample complexity, which are otherwise hard for the student to achieve by itself. This learning paradigm is practically useful when it is necessary to deploy simpler models to real-world systems, which requires small memory footage or fast processing time.
We focus on two specific settings of knowledge transfer in this work. The first one is learning using privileged information (LUPI) (Vapnik and Vashist, 2009), in which the teacher provides an additional set of feature representation to the student during its training process but not the test time, and the extra feature set contains richer information to make the learning problem easier for the student; an example is that the “student may normally only have access to the image of a biopsy to predict the existence of cancer, but during the training process, it also has access to the medical report of an oncologist” (Lopez-Paz et al., 2015). The second setting is distillation (Ba and Caruana, 2014; Hinton et al., 2015)
, in which the teacher and the student have access to the same samples (and the same features), and the teacher learns a complex model (e.g., deep neural networks), which provides soft targets (e.g., the decision values) for the student (e.g., a shallow neural network) to mimic directly during its training process.Lopez-Paz et al. (2015) have unified the two settings as generalized distillation, which facilitates learning from multiple machines and data representations. We take the same unified view as Lopez-Paz et al. (2015) in this work and analyze different settings in a common learning framework, while using intuitions and assumptions specific to LUPI or distillation when appropriate. While in LUPI the learning task is often supervised, the constraint that privileged information is only available during training resembles that of (unsupervised) multi-view representation learning (Wang et al., 2015), where the goal is to learn feature transformations from multiple measurements (“views”) of the input data that can be useful for downstream tasks (which are often supervised).
. For example, for binary supervised learning (with the 0-1 loss), whentraining samples are used, the convergence rate is
in the separable case (i.e., when there exists a perfect classifier), and the rate degrades towhen the problem is not. However, when learning an SVM classifier in the non-separable case, if the teacher can supply the student with the slack variables associated with the optimal solution for each sample (thus setting the correct limit on each sample), the learning objective becomes separable for the student and the faster rate can be restored. It has been hypothesized that faster convergence rate can in general be obtained under LUPI (Lopez-Paz et al., 2015), although it is not clear under what conditions and using what learning method this is achievable.
We can consider the normal feature set and the privileged information as two “views” of the input data. There exists a long line of (both theoretical and empirical) research studying the multi-view learning setting, supervised or not. And the most popular intuition in multi-view learning is that the views need to agree, either on the features (Hermann and Blunsom, 2014; Wang et al., 2015; Fang et al., 2015) or on the predictions (Blum and Mitchell, 1998; Sindhwani et al., 2005; Rosenberg and Bartlett, 2007; Kakade and Foster, 2007; Balcan and Blum, 2010; Blum and Mansour, 2017). We heavily draw inspirations from these prior work and show how, under reasonable assumptions on the model complexities for distillation and LUPI, agreement between the views help reduce the search space for the student and therefore leads to faster convergence.
In particular, we analyze distillation and LUPI in the framework of learning linear predictors with convex, smooth and non-negative losses, which are commonly used in machine learning. We show that improved rate is possible by being optimistic, that is, assuming the student can achieve low expected loss for distillation, and assuming a good predictor on the privileged information can be transformed into a good predictor on the regular feature set for LUPI. Under these assumptions, we perform regularized ERM where the regularization term measures the prediction discrepancy (squared difference) between the student and the teacher; this is an intuitive and easy to optimize regularizer and often used in practice. The solution to regularized ERM achieves faster convergence than the one without regularization, or equivalently, there exists a larger range of optimal loss for which the student achieves the convergence. Interestingly, in the multi-view distillation and LUPI settings, the complexity control is achieved through the coordinate system defined by Canonical Correlation Analysis (Hotelling, 1936).
Bold lower case letters (such as ,
) denote vectors, and bold upper case letters (such as, ) denote matrices. For vectors, subscripts index samples of a random vector, while superscripts index the coordinates. denote the -dimensional vector whose -th coordinate is . We use and
to denote the all-zero vector and identity matrix respectively, whose dimensions can be inferred from the context. A convex functionis -smooth in if , , and it is -strongly convex in if , .
1.1 A brief review on CCA
We briefly review Canonical Correlation Analysis (CCA, Hotelling, 1936), a classical method for measuring the correlation between two random vectors, as it plays an important role later. Let and
be two random vectors with a joint probability distribution. The simultaneous formulation of CCA finds a set of directions (canonical directions) for each view, collected in matrices and , such that projections of onto these directions are maximally correlated:
where the cross- and auto-covariance matrices are defined as
In this work, we assume for simplicity that the random vectors have zero mean ( and ), and identity covariance ( and ). The global optimum of (1), denoted by , can be computed as follows. Let the full SVD of be
where and are unitary, and
contains the singular values (canonical correlations) on its diagonal:
Then the optimal directions are
where denote the submatrix of containing the first columns, and the optimal objective value is . We refer the readers to Gao et al. (2017); Allen-Zhu and Li (2016) for efficient numerical procedures for computing the solution.
Note that the canonical directions and correlations are derived solely from the multi-view input distribution (regardless of the label information), and they define a new coordinate system than that of the input space. As we will see later, based on certain assumptions, this new system facilitates complexity control for statistical learning.
2 Learning with smooth and non-negative loss
We now discuss the learning setup in which we analyze different knowledge transfer settings. Consider linear predictors for supervised learning: given i.i.d. samples of random variablesdrawn from an unknown distribution , we would like to learn a linear predictor to predict the target from the input , based on the inner product
, and the discrepancy between the prediction and target is measured by a instantaneous loss function. Let be the expected loss associated with the predictor . With i.i.d. samples , the empirical loss is defined as .
We assume the loss is convex and -smooth in the first argument . Such losses are widely used in machine learning. For example, the least squares loss is -smooth in , and the cross-entropy loss is -smooth in . In addition to (as in Sec 1.1), we further assume , implying that is -smooth in . Sometimes we also assume the loss to be strongly-convex in ; the least squares loss is -strongly convex, and the logistic loss is strongly convex as long as the first argument (decision value) is bounded.
The goal of convex-smooth-bounded learning (Shalev-Shwartz and Ben-David, 2014, Sec 12.2.2) is to find a good predictor such that
where is the hypothesis class we would like to learn, and is the excess risk. Such predictors can be learned (properly) by solving the constrained ERM problem, or improperly by solving the regularized ERM.
Lemma 1 (Srebro et al. (2012)).
Let be the optimal expected loss.
Then for either of the following estimators
be the optimal expected loss. Then for either of the following estimators
where , we have
where the expectation is over the training samples.
Observe that, for learning with smooth non-negative losses, the rate of convergence can be bounded using the optimal expected loss. For large , the second term in (3) is dominant and we obtain the usual rate. However, if is close to zero and the first term becomes dominant, we obtain a faster rate. This phenomenon is in spirit similar to the faster convergence for - loss when the problem is separable. Overall, we can bound the sample complexity to achieve excess error as
We thus see that the transition between the two regimes happens at : if the optimal loss is not much larger than the target excess error, the sample complexity is (corresponding to convergence); otherwise the sample complexity degrades to (corresponding to convergence). This faster rate achieved for small expected loss is known as the optimistic rate, which may not be attainable by learning with non-smooth losses (e.g., convex-Lipschitz-bounded learning).
3 Analysis for distillation
In the distillation setting (Ba and Caruana, 2014; Hinton et al., 2015), we have only one view . We first train a powerful, teacher model , which provides soft supervision for training a simpler, student model . In the context of learning linear predictors, we let be learned from a larger hypothesis class , while is constrained to come from a smaller hypothesis class , with .
We will assume that the optimal predictor from , i.e., , has small expected loss , so that teacher model, learned with the ERM
enjoys faster convergence:
The key assumption that allows us to accelerate the learning of the student model is that the optimal predictor from the smaller hypothesis class agrees well with the teacher model on prediction values, i.e., there exists some small , such that
As the lemma below shows, this assumption holds, e.g., when the loss is strongly convex, and the student model can achieve small expected loss relative to .
If the instantaneous loss is -strongly convex in the first argument, we have
where the expectation is taken over the data distribution and the samples for learning .
Let be any predictor from . By the -strong convexity of , we have for any that
Taking expectation over , we have
Note that . Since both , and is the minimizer of over , we have by the first order optimality condition that . Substituting this into (6) yields
Setting and in the above inequality, we obtain respectively
Combining the above two inequalities, we have
Due to the assumption that , the condition (5) conveniently reduces to
Therefore, under the assumption of small discrepancy between the student and the teacher, our search space for can be further reduced to
which is much smaller than for small .
It is then natural for us to perform regularized ERM to take advantage of the additional complexity constraint. We propose to use the regularization which encourages the student to agree with the teacher on the decision values (soft targets) over the distribution. In practice, this term can be approximated on large set of unlabeled data as it does not require ground truth targets .
Compute the following predictor
with . The we have
The proof is similar to Srebro et al. (2012, Theorem 5). Since is the minimizer to the regularized objective, we have for any that
Observe that the regularized ERM is -strongly convex in . Taking the expectation of the above inequality and applying the stability result (Lemma 12 in appendix), we obtain
Setting , and substituting in yields
Minimizing the RHS over gives , and
where we have used the inequality for . ∎
As long as , the convergence rate in Theorem 4 is much faster than that of Lemma 1. We observe from Lemma 3 that can be of the order . In this case, the above theorem leads to the following overall convergence (by our assumption, -related terms are of higher order):
Equivalently, we can bound the sample complexity to achieve excess error as
Compare this rate with the one obtained without the additional complexity control (Remark 2). In (9), as long as , rather than requiring as in (4), the effective convergence rate of (7) is . To see that the difference is significant, if the target excess error is , we now only need to be on the order of to achieve the optimistic rate, instead of without distillation. In other words, we achieve the optimistic rate in the distillation setting with much less stringent condition on the optimal expected loss.
4 Analysis for learning using privileged information
In the case of learning with privileged information, we have access to two views: the regular features (used by the student) and the privileged information (used by the teacher). We first discuss how the correlation between them comes into play when transferring knowledge from to . The prerequisites on CCA are discussed in Section 1.1.
Perform the following change of coordinates (recall that and have full dimensions and thus these new variables are well defined):
Clearly, the transformed data has identity covariance, i.e., and . On the other hand, we have and .
Based on the above coordinate transformation and the definition of CCA, an important observation made by Kakade and Foster (2007) is the following equality.
Lemma 6 (Lemma 2 of Kakade and Foster (2007) re-stated).
For any and , we have
with the convention that for , and that for .
This lemma implies, for a pair of predictors that agree well on the decision values, the predictors have low complexity in the CCA coordinate system: if is large (close to ), minimizing the discrepancy encourages to be close to , and if is small (close to ), minimizing the discrepancy encourages to be small. For any predictor of view , we defined the operator to return a corresponding predictor of view :
4.1 Multi-view distillation
Similar to single-view distillation analyzed in Section 3, we can perform a two-step distillation process with multi-view data; this setting is also named generalized distillation by Lopez-Paz et al. (2015). In the first step, we train a predictor based on labeled data from the view . And in the second step, we train a predictor on labeled data from the view , incorporating the soft supervision provided by . Note that the labeled data from each view need not overlap.
As before, we make the assumption that is learned from the hypothesis class by performing ERM, with optimistic rate convergence due to low optimal expected loss . But different from single-view distillation, since the view contains privileged information (rich representation of the data), learning is easy and a low-complexity hypothesis class (e.g., is small) suffices. The different assumptions on the complexities of hypothesis spaces for distillation vs. LUPI are motivated by Lopez-Paz et al. (2015, Section 4).
Additionally, assume the optimal predictor from the hypothesis class agrees well with on decision values, i.e., there exists some small , such that
The following lemma provides an example when this assumption is satisfied.
Let the instantaneous loss be -strongly convex in the first argument. Let be a predictor from with , and the CCA-transformed predictor achieves expected loss . Then we have
where the expectation is taken over the data distribution.
The proof is similar to that of Lemma 3. By the definition of , and the fact that , we have that . Now that both and are in and is the minimizer of in , we can use the same argument in Lemma 3 and have
By our assumption of , it holds that and the first inequality follows.
To show the second inequality, just observe from Lemma 6 that
We can then use the soft targets in the regularization, and the following Theorem is completely analogous to Theorem 4.
Compute either of the following predictors
with . Then we have
The difference in the two predictors is that the regularization term in (13) does not require computing or even the CCA system. But in view of Lemma 7, the difference in the two regularizers in (12) and (13) is bounded by a small constant.
Similar to the single-view distillation setting, the additional complexity constraint from the multi-view agreement assumption allows us to effectively reduce the search space when . In particular, when the assumptions of Lemma 7 holds, the predictor (12) achieves the convergence rate
To achieve excess error, the sample complexity can be written as
Assuming for some , then as long as , an overall sample complexity or convergence rate is achieved. Again, the optimistic rate is achieved with less stringent requirement on .
Here, the expected loss of measures the quality of the predictor transformed from a good predictor of view , and in a sense how well the privileged information can be transferred. If is already near-optimal for view , all the student has to do is to be close to it.
4.2 Simultaneous learning of teacher and student
In previous sections, we require first learning the teacher before learning the student. We now show that it is possible to learn both predictors at the same time, using a multi-task learning objective. This is the same setup considered by the SVM+ algorithm (Vapnik and Izmailov, 2015).
Similar to the distillation setting, we would like to learn predictors with low norm and agree well on decision values; the hypothesis class is defined as
Let be the optimal predictors from . Denote .
As before, we can learn the hypothesis class with regularized ERM. The term now regularizes the learning of both and , and this is known as “co-regularization” (Sindhwani et al., 2005; Rosenberg and Bartlett, 2007).
Let be i.i.d. samples from the distribution . Compute the following predictors
with and , where . Then we have
where the matrix is the diagonal matrix in the SVD of .
Let , and so . Further define the change of variable , and we have that the regularizer is -strongly convex in .
On the other hand, we can rewrite the instantaneous loss in terms of :
The smoothness of the loss with respect to depends on the size of its input, which is
where is the top left block of with dimension . By the inversion formula of block matrices, we have
Since the minimum eigenvalue of the matrix insideis at least , we continue from (15) and claim that the instantaneous loss of view is smooth in with parameter
The same smoothness condition holds for the view , and therefore the combined instantaneous loss is -smooth in .
The empirical loss in (14) thus approximate the expectation of the combined loss with paired samples. By the stability of regularized ERM for smooth loss, we have that
Since is the minimizer of the multi-task objective, we have
Define the shorthand .
Taking the expectation over the samples and re-organizing terms, we obtain
Let and . Substituting them into the definition of , we obtain