Everything old is new again: A multi-view learning approach to learning using privileged information and distillation

We adopt a multi-view approach for analyzing two knowledge transfer settings---learning using privileged information (LUPI) and distillation---in a common framework. Under reasonable assumptions about the complexities of hypothesis spaces, and being optimistic about the expected loss achievable by the student (in distillation) and a transformed teacher predictor (in LUPI), we show that encouraging agreement between the teacher and the student leads to reduced search space. As a result, improved convergence rate can be obtained with regularized empirical risk minimization.

Authors

• 36 publications
02/22/2021

Multi-View Feature Representation for Dialogue Generation with Bidirectional Distillation

Neural dialogue models suffer from low-quality responses when interacted...
12/19/2021

Controlling the Quality of Distillation in Response-Based Network Compression

The performance of a distillation-based compressed network is governed b...
12/21/2021

Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix

In the context of multi-modality knowledge distillation research, the ex...
11/16/2018

A generalized meta-loss function for distillation and learning using privileged information for classification and regression

Learning using privileged information and distillation are powerful mach...
12/07/2021

ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images

Despite significant advancements of deep learning-based forgery detector...
04/25/2022

Faculty Distillation with Optimal Transport

Knowledge distillation (KD) has shown its effectiveness in improving a s...
03/11/2022

A New Learning Paradigm for Stochastic Configuration Network: SCN+

Learning using privileged information (LUPI) paradigm, which pioneered t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transferring knowledge learned by a powerful model (“teacher”) to a simpler model (“student”) has become a theme in machine learning. The goal of the knowledge transfer is to have the teacher guide the learning process of the student, so as to achieve high prediction accuracy, or to reduce the sample complexity, which are otherwise hard for the student to achieve by itself. This learning paradigm is practically useful when it is necessary to deploy simpler models to real-world systems, which requires small memory footage or fast processing time.

We focus on two specific settings of knowledge transfer in this work. The first one is learning using privileged information (LUPI) (Vapnik and Vashist, 2009), in which the teacher provides an additional set of feature representation to the student during its training process but not the test time, and the extra feature set contains richer information to make the learning problem easier for the student; an example is that the “student may normally only have access to the image of a biopsy to predict the existence of cancer, but during the training process, it also has access to the medical report of an oncologist” (Lopez-Paz et al., 2015). The second setting is distillation (Ba and Caruana, 2014; Hinton et al., 2015)

, in which the teacher and the student have access to the same samples (and the same features), and the teacher learns a complex model (e.g., deep neural networks), which provides soft targets (e.g., the decision values) for the student (e.g., a shallow neural network) to mimic directly during its training process.

Lopez-Paz et al. (2015) have unified the two settings as generalized distillation, which facilitates learning from multiple machines and data representations. We take the same unified view as Lopez-Paz et al. (2015) in this work and analyze different settings in a common learning framework, while using intuitions and assumptions specific to LUPI or distillation when appropriate. While in LUPI the learning task is often supervised, the constraint that privileged information is only available during training resembles that of (unsupervised) multi-view representation learning (Wang et al., 2015), where the goal is to learn feature transformations from multiple measurements (“views”) of the input data that can be useful for downstream tasks (which are often supervised).

With the intuition of the teacher making the learning problem easier for the student, the LUPI setting is theoretically motivated (Vapnik and Vashist, 2009; Pechyony and Vapnik, 2010)

. For example, for binary supervised learning (with the 0-1 loss), when

training samples are used, the convergence rate is

in the separable case (i.e., when there exists a perfect classifier), and the rate degrades to

when the problem is not. However, when learning an SVM classifier in the non-separable case, if the teacher can supply the student with the slack variables associated with the optimal solution for each sample (thus setting the correct limit on each sample), the learning objective becomes separable for the student and the faster rate can be restored. It has been hypothesized that faster convergence rate can in general be obtained under LUPI (Lopez-Paz et al., 2015), although it is not clear under what conditions and using what learning method this is achievable.

Our contributions

We can consider the normal feature set and the privileged information as two “views” of the input data. There exists a long line of (both theoretical and empirical) research studying the multi-view learning setting, supervised or not. And the most popular intuition in multi-view learning is that the views need to agree, either on the features (Hermann and Blunsom, 2014; Wang et al., 2015; Fang et al., 2015) or on the predictions (Blum and Mitchell, 1998; Sindhwani et al., 2005; Rosenberg and Bartlett, 2007; Kakade and Foster, 2007; Balcan and Blum, 2010; Blum and Mansour, 2017). We heavily draw inspirations from these prior work and show how, under reasonable assumptions on the model complexities for distillation and LUPI, agreement between the views help reduce the search space for the student and therefore leads to faster convergence.

In particular, we analyze distillation and LUPI in the framework of learning linear predictors with convex, smooth and non-negative losses, which are commonly used in machine learning. We show that improved rate is possible by being optimistic, that is, assuming the student can achieve low expected loss for distillation, and assuming a good predictor on the privileged information can be transformed into a good predictor on the regular feature set for LUPI. Under these assumptions, we perform regularized ERM where the regularization term measures the prediction discrepancy (squared difference) between the student and the teacher; this is an intuitive and easy to optimize regularizer and often used in practice. The solution to regularized ERM achieves faster convergence than the one without regularization, or equivalently, there exists a larger range of optimal loss for which the student achieves the convergence. Interestingly, in the multi-view distillation and LUPI settings, the complexity control is achieved through the coordinate system defined by Canonical Correlation Analysis (Hotelling, 1936).

Notations

Bold lower case letters (such as ,

) denote vectors, and bold upper case letters (such as

, ) denote matrices. For vectors, subscripts index samples of a random vector, while superscripts index the coordinates. denote the -dimensional vector whose -th coordinate is . We use and

to denote the all-zero vector and identity matrix respectively, whose dimensions can be inferred from the context. A convex function

is -smooth in if , , and it is -strongly convex in if , .

1.1 A brief review on CCA

We briefly review Canonical Correlation Analysis (CCA, Hotelling, 1936), a classical method for measuring the correlation between two random vectors, as it plays an important role later. Let and

be two random vectors with a joint probability distribution

. The simultaneous formulation of CCA finds a set of directions (canonical directions) for each view, collected in matrices and , such that projections of onto these directions are maximally correlated:

 maxU,Vtr(U⊤ExzV)s.t.U⊤ExxU=V⊤EzzV=I (1)

where the cross- and auto-covariance matrices are defined as

 Exz=E[xz⊤],Exx=E[xx⊤],Ezz=E[zz⊤]. (2)

In this work, we assume for simplicity that the random vectors have zero mean ( and ), and identity covariance ( and ). The global optimum of (1), denoted by , can be computed as follows. Let the full SVD of be

 Exz=~UΣ~V⊤,

where and are unitary, and

contains the singular values (canonical correlations) on its diagonal:

 1≥λ1≥⋯≥λrank(Exz)>0.

Then the optimal directions are

 (U∗,V∗)=(~U1:r,V1:r),

where denote the submatrix of containing the first columns, and the optimal objective value is . We refer the readers to Gao et al. (2017); Allen-Zhu and Li (2016) for efficient numerical procedures for computing the solution.

Note that the canonical directions and correlations are derived solely from the multi-view input distribution (regardless of the label information), and they define a new coordinate system than that of the input space. As we will see later, based on certain assumptions, this new system facilitates complexity control for statistical learning.

2 Learning with smooth and non-negative loss

We now discuss the learning setup in which we analyze different knowledge transfer settings. Consider linear predictors for supervised learning: given i.i.d. samples of random variables

drawn from an unknown distribution , we would like to learn a linear predictor to predict the target from the input , based on the inner product

, and the discrepancy between the prediction and target is measured by a instantaneous loss function

. Let be the expected loss associated with the predictor . With i.i.d. samples , the empirical loss is defined as .

We assume the loss is convex and -smooth in the first argument . Such losses are widely used in machine learning. For example, the least squares loss is -smooth in , and the cross-entropy loss is -smooth in . In addition to (as in Sec 1.1), we further assume , implying that is -smooth in . Sometimes we also assume the loss to be strongly-convex in ; the least squares loss is -strongly convex, and the logistic loss is strongly convex as long as the first argument (decision value) is bounded.

The goal of convex-smooth-bounded learning (Shalev-Shwartz and Ben-David, 2014, Sec 12.2.2) is to find a good predictor such that

 L(^w)≤min∥w∥≤BL(w)+ϵ

where is the hypothesis class we would like to learn, and is the excess risk. Such predictors can be learned (properly) by solving the constrained ERM problem, or improperly by solving the regularized ERM.

Lemma 1 (Srebro et al. (2012)).

Let

be the optimal expected loss. Then for either of the following estimators

 (constrained ERM)^w=argmin∥w∥≤B^L(w) (regularized ERM)^w=argmin^L(w)+γ2∥w∥2

where , we have

 E[L(^w)−L∗]=~O⎛⎝βR2B2n+√βR2B2L∗n⎞⎠ (3)

where the expectation is over the training samples.

Remark 2.

Observe that, for learning with smooth non-negative losses, the rate of convergence can be bounded using the optimal expected loss. For large , the second term in (3) is dominant and we obtain the usual rate. However, if is close to zero and the first term becomes dominant, we obtain a faster rate. This phenomenon is in spirit similar to the faster convergence for - loss when the problem is separable. Overall, we can bound the sample complexity to achieve excess error as

 ~O(βR2B2ϵ⋅ϵ+L∗ϵ). (4)

We thus see that the transition between the two regimes happens at : if the optimal loss is not much larger than the target excess error, the sample complexity is (corresponding to convergence); otherwise the sample complexity degrades to (corresponding to convergence). This faster rate achieved for small expected loss is known as the optimistic rate, which may not be attainable by learning with non-smooth losses (e.g., convex-Lipschitz-bounded learning).

3 Analysis for distillation

In the distillation setting (Ba and Caruana, 2014; Hinton et al., 2015), we have only one view . We first train a powerful, teacher model , which provides soft supervision for training a simpler, student model . In the context of learning linear predictors, we let be learned from a larger hypothesis class , while is constrained to come from a smaller hypothesis class , with .

We will assume that the optimal predictor from , i.e., , has small expected loss , so that teacher model, learned with the ERM

 ^v=argminv∈Cv1nn∑i=1ℓ(v⊤zi,yi)

enjoys faster convergence:

The key assumption that allows us to accelerate the learning of the student model is that the optimal predictor from the smaller hypothesis class agrees well with the teacher model on prediction values, i.e., there exists some small , such that

 E∥∥w∗⊤x−^v⊤x∥∥2≤S2. (5)

As the lemma below shows, this assumption holds, e.g., when the loss is strongly convex, and the student model can achieve small expected loss relative to .

Lemma 3.

If the instantaneous loss is -strongly convex in the first argument, we have

 E∥∥w∗⊤x−^v⊤x∥∥2≤4(L∗w−L∗v+ϵt)σ

where the expectation is taken over the data distribution and the samples for learning .

Proof.

Let be any predictor from . By the -strong convexity of , we have for any that

 ℓ(w⊤x,y)−ℓ(v∗⊤x,y)≥ℓ′(v∗⊤x,y)⋅(w⊤x−v∗⊤x)+σ2∥∥w⊤x−v∗⊤x∥∥2.

Taking expectation over , we have

 L(w)−L(v∗)≥⟨E[ℓ′(v∗⊤x,y)⋅x],w−v∗⟩+σ2⋅E∥∥w⊤x−v∗⊤x∥∥2. (6)

Note that . Since both , and is the minimizer of over , we have by the first order optimality condition that . Substituting this into (6) yields

 L(w)−L(v∗)≥σ2⋅E∥∥w⊤x−v∗⊤x∥∥2.

Setting and in the above inequality, we obtain respectively

 L∗w−L∗v ≥σ2⋅E∥∥w∗⊤x−v∗⊤x∥∥2, ϵt=E[L(^v)−L∗v] ≥σ2⋅E∥∥^v⊤x−v∗⊤x∥∥2.

Combining the above two inequalities, we have

 E∥∥w∗⊤x−^v⊤x∥∥2≤2E∥∥w∗⊤x−v∗⊤x∥∥2+2E∥∥^v⊤x−v∗⊤x∥∥2≤4σ(L∗w−L∗v+ϵt).

Due to the assumption that , the condition (5) conveniently reduces to

 ∥w∗−^v∥≤S.

Therefore, under the assumption of small discrepancy between the student and the teacher, our search space for can be further reduced to

which is much smaller than for small .

It is then natural for us to perform regularized ERM to take advantage of the additional complexity constraint. We propose to use the regularization which encourages the student to agree with the teacher on the decision values (soft targets) over the distribution. In practice, this term can be approximated on large set of unlabeled data as it does not require ground truth targets .

Theorem 4.

Compute the following predictor

 ^w=argminw∈Cw1nn∑i=1ℓ(w⊤xi,yi)+ν2∥w−^v∥2 (7)

with . The we have

 E[L(^w)−L∗w]≤16βR2S2n+√16βR2S2L∗wn.
Proof.

The proof is similar to Srebro et al. (2012, Theorem 5). Since is the minimizer to the regularized objective, we have for any that

 ^L(^w)≤^L(^w)+ν2∥^w−^v∥2≤^L(w)+ν2∥w−^v∥2.

Observe that the regularized ERM is -strongly convex in . Taking the expectation of the above inequality and applying the stability result (Lemma 12 in appendix), we obtain

Setting , and substituting in yields

Minimizing the RHS over gives , and

 E[L(^w)−L(w)]≤8βR2S2n+√64β2R4S4n2+16βR2S2L∗wn≤16βR2S2n+√16βR2S2L∗wn

where we have used the inequality for . ∎

Remark 5.

As long as , the convergence rate in Theorem 4 is much faster than that of Lemma 1. We observe from Lemma 3 that can be of the order . In this case, the above theorem leads to the following overall convergence (by our assumption, -related terms are of higher order):

 E[L(^w)−L∗w]=~O⎛⎝βR2L∗wσn+√βR2(L∗w)2σn⎞⎠. (8)

Equivalently, we can bound the sample complexity to achieve excess error as

 O(βR2L∗wσϵ⋅ϵ+L∗wϵ). (9)

Compare this rate with the one obtained without the additional complexity control (Remark 2). In (9), as long as , rather than requiring as in (4), the effective convergence rate of (7) is . To see that the difference is significant, if the target excess error is , we now only need to be on the order of to achieve the optimistic rate, instead of without distillation. In other words, we achieve the optimistic rate in the distillation setting with much less stringent condition on the optimal expected loss.

4 Analysis for learning using privileged information

In the case of learning with privileged information, we have access to two views: the regular features (used by the student) and the privileged information (used by the teacher). We first discuss how the correlation between them comes into play when transferring knowledge from to . The prerequisites on CCA are discussed in Section 1.1.

Perform the following change of coordinates (recall that and have full dimensions and thus these new variables are well defined):

 (10)

Clearly, the transformed data has identity covariance, i.e., and . On the other hand, we have and .

Based on the above coordinate transformation and the definition of CCA, an important observation made by Kakade and Foster (2007) is the following equality.

Lemma 6 (Lemma 2 of Kakade and Foster (2007) re-stated).

For any and , we have

 E∥∥w⊤x−v⊤z∥∥2 =dx∑i=1(1−λi)(~wi)2+r∑i=1λi(~wi−~vi)2+dz∑i=1(1−λi)(~vi)2 =dx∑i=1(~wi−λi~vi)2+dz∑i=1(1−λ2i)(~vi)2,

with the convention that for , and that for .

This lemma implies, for a pair of predictors that agree well on the decision values, the predictors have low complexity in the CCA coordinate system: if is large (close to ), minimizing the discrepancy encourages to be close to , and if is small (close to ), minimizing the discrepancy encourages to be small. For any predictor of view , we defined the operator to return a corresponding predictor of view :

 TCCA(v)=~U⋅[λi(~V⊤v)i]dxi=1.

4.1 Multi-view distillation

Similar to single-view distillation analyzed in Section 3, we can perform a two-step distillation process with multi-view data; this setting is also named generalized distillation by Lopez-Paz et al. (2015). In the first step, we train a predictor based on labeled data from the view . And in the second step, we train a predictor on labeled data from the view , incorporating the soft supervision provided by . Note that the labeled data from each view need not overlap.

As before, we make the assumption that is learned from the hypothesis class by performing ERM, with optimistic rate convergence due to low optimal expected loss . But different from single-view distillation, since the view contains privileged information (rich representation of the data), learning is easy and a low-complexity hypothesis class (e.g., is small) suffices. The different assumptions on the complexities of hypothesis spaces for distillation vs. LUPI are motivated by Lopez-Paz et al. (2015, Section 4).

Additionally, assume the optimal predictor from the hypothesis class agrees well with on decision values, i.e., there exists some small , such that

 E∥∥w∗⊤x−^v⊤z∥∥2≤S2. (11)

The following lemma provides an example when this assumption is satisfied.

Lemma 7.

Let the instantaneous loss be -strongly convex in the first argument. Let be a predictor from with , and the CCA-transformed predictor achieves expected loss . Then we have

 ∥w∗−TCCA(v)∥2=E∥∥w∗⊤x−(TCCA(v))⊤x∥∥2≤2(LCCA(v)−L∗w)σ, and E∥∥w∗⊤x−v⊤z∥∥2≤2(LCCA(v)−L∗w)σ+(1−λdz)B2v.

where the expectation is taken over the data distribution.

Proof.

The proof is similar to that of Lemma 3. By the definition of , and the fact that , we have that . Now that both and are in and is the minimizer of in , we can use the same argument in Lemma 3 and have

By our assumption of , it holds that and the first inequality follows.

To show the second inequality, just observe from Lemma 6 that

 E∥∥w⊤x−v⊤z∥∥ =∥w−TCCA(v)∥2+dz∑i=1(1−λi)2(~vi)2 ≤∥w−TCCA(v)∥2+(1−λdz)2d∑i=1(~vi)2 ≤∥w−TCCA(v)∥2+(1−λdz)2∥v∥2.

We can then use the soft targets in the regularization, and the following Theorem is completely analogous to Theorem 4.

Theorem 8.

Compute either of the following predictors

 ^w=argminw∈Cw1nn∑i=1ℓ(w⊤xi,yi)+ν2∥w−TCCA(^v)∥2 (12) or ^w=argminw∈Cw1nn∑i=1ℓ(w⊤xi,yi)+ν2Ex,z∥∥w⊤x−^v⊤z∥∥2 (13)

with . Then we have

 E[L(^w)−L∗w]≤16βR2S2n+√16βR2S2L∗wn.

The difference in the two predictors is that the regularization term in (13) does not require computing or even the CCA system. But in view of Lemma 7, the difference in the two regularizers in (12) and  (13) is bounded by a small constant.

Remark 9.

Similar to the single-view distillation setting, the additional complexity constraint from the multi-view agreement assumption allows us to effectively reduce the search space when . In particular, when the assumptions of Lemma 7 holds, the predictor (12) achieves the convergence rate

 L(^w)−L∗w=O⎛⎝βR2(LCCA(^v)−L∗w)σn+√βR2L∗w(LCCA(^v)−L∗w)σn⎞⎠

To achieve excess error, the sample complexity can be written as

 n=O(βR2(LCCA(^v)−L∗w)σϵ(ϵ+L∗wϵ)).

Assuming for some , then as long as , an overall sample complexity or convergence rate is achieved. Again, the optimistic rate is achieved with less stringent requirement on .

Here, the expected loss of measures the quality of the predictor transformed from a good predictor of view , and in a sense how well the privileged information can be transferred. If is already near-optimal for view , all the student has to do is to be close to it.

4.2 Simultaneous learning of teacher and student

In previous sections, we require first learning the teacher before learning the student. We now show that it is possible to learn both predictors at the same time, using a multi-task learning objective. This is the same setup considered by the SVM+ algorithm (Vapnik and Izmailov, 2015).

Similar to the distillation setting, we would like to learn predictors with low norm and agree well on decision values; the hypothesis class is defined as

Let be the optimal predictors from . Denote .

As before, we can learn the hypothesis class with regularized ERM. The term now regularizes the learning of both and , and this is known as “co-regularization” (Sindhwani et al., 2005; Rosenberg and Bartlett, 2007).

Lemma 10.

Let be i.i.d. samples from the distribution . Compute the following predictors

 (^w,^v)= argminw,v1nn∑i=1ℓ(w⊤xi,yi)+1nn∑j=1ℓ(v⊤zj,yj) +γ2∥w∥2+γ2∥v∥2+ν2E∥∥w⊤x−v⊤z∥∥2 (14)

with and , where . Then we have

 E[(L(^w)+L(^v))−L∗]≤32βR2D2n+√32βR2D2L∗n.
Proof.

Perform the change of coordinates in (10), and let be the concatenated variable. In view of Lemma 6, we observe that the regularization in (14) (sum of the last three terms) is

where the matrix is the diagonal matrix in the SVD of .

Let , and so . Further define the change of variable , and we have that the regularizer is -strongly convex in .

On the other hand, we can rewrite the instantaneous loss in terms of :

 ℓ(w⊤x,y)=ℓ(~w⊤~x,y)=ℓ(p⊤[~x0],y)=ℓ(q⊤M−12[~x0],y).

The smoothness of the loss with respect to depends on the size of its input, which is

 (15)

where is the top left block of with dimension . By the inversion formula of block matrices, we have

 (M−1)xx=((γ+ν)I−ν2γ+νΣΣ⊤)−1.

Since the minimum eigenvalue of the matrix inside

is at least , we continue from (15) and claim that the instantaneous loss of view is smooth in with parameter

 β′=(γ+ν)βR2(γ+ν)2−λ21ν2.

The same smoothness condition holds for the view , and therefore the combined instantaneous loss is -smooth in .

The empirical loss in (14) thus approximate the expectation of the combined loss with paired samples. By the stability of regularized ERM for smooth loss, we have that

 E[L(^w)+L(^v)]≤11−8β′nE[^L(^w)+^L(^v)].

Since is the minimizer of the multi-task objective, we have

 ^L(^w)+^L(^v) ≤^L(w∗)+^L(v∗)+γ2∥w∗∥2+γ2∥v∗∥2+ν2E∥∥w∗⊤x−v∗⊤z∥∥2 ≤^L(w∗)+^L(v∗)+γ(B2w+B2v)2+νS22.

Define the shorthand .

Taking the expectation over the samples and re-organizing terms, we obtain

 E[(L(^w)+L(^v))−L∗]≤⎛⎜⎝11−8β′n−1⎞⎟⎠L∗+11−8β′n(γB22+νS22).

Let and . Substituting them into the definition of , we obtain

 β′=αΔwhereα=βR2B2S2(B2+S2)(B2+S2)2−λ21B4.

Furthermore,

 E[(L(^w)+L(^v))−L∗]≤⎛⎝11−