1 Introduction
Research communities have amassed a sizable number of deep net architectures for different tasks, and new ones are added almost daily. Some of those architectures are trained from scratch while others are finetuned, , before training, their weights are initialized using a structurally similar deep net which was trained on different data.
Beyond finetuning, particularly in reinforcement learning, teachers have also been considered in one way or another by Rusu et al. (2016b); Fernando et al. (2017); Wang et al. (2017); Li and Hoiem (2016); Bengio et al. (2009); Patel et al. (2015); Chen and Liu (2016); Teh et al. (2017); Parisotto et al. (2016). For instance, progressive neural net (Rusu et al., 2016b) keeps multiple teachers during both training and inference, and learns to extract useful features from the teachers for a new target task. PathNet (Fernando et al., 2017)
uses genetic algorithms to choose pathways from a giant network for learning new tasks. ‘Growing a Brain’
(Wang et al., 2017)finetunes a neural network while growing the network’s capacity (wider or deeper layers). Actormimic
(Parisotto et al., 2016) pretrains a big model on multiple source tasks, then the big model is used as a weight initialization for a new model which will be trained on a new target task. Knowledge distillation (Hinton et al., 2015) distills knowledge from a large ensemble of models to a smaller student model.However, all the aforementioned techniques have limitations. For example, progressive neural net models (Rusu et al., 2016b) grow with the number of teachers. This large number of parameters limits the number of teachers a progressive neural net can handle, and largely increases the training and testing time. In PathNet (Fernando et al., 2017), searching over a big network for pathways is computationally intensive. For finetuning based methods such as ‘Growing a Brain’ (Wang et al., 2017) and actormimic (Parisotto et al., 2016), only one pretrained model can be used at a time. Hence, their performance heavily relies on the chosen pretrained model.
To address these shortcomings, we develop knowledge flow which moves ‘knowledge’ of multiple teachers when training a student. Irrespective of how many teachers we use, the student is guaranteed to become independent at the final stage of training and the size of the resulting student net remains constant. In addition, our framework makes no restrictions on the deep net size of the teacher and student, which provides flexibility in choosing teacher models. Importantly, our approach is applicable to a variety of tasks from reinforcement learning to fullysupervised training.
We evaluate knowledge flow
on a variety of tasks from reinforcement learning to fullysupervised learning. In particular, we follow
Rusu et al. (2016b); Fernando et al. (2017) and compare on the same Atari games.In addition, we also observed significant top1 error rate improvements on supervised learning datasets, , CIFAR10, and CIFAR100.
2 Background
Knowledge flow is applicable to a variety of settings from supervised learning to reinforcement learning, which we briefly review to introduce notation.
Supervised Learning recovers the parameters of a mapping from data space to output space . To this end, a dataset containing pairs (assumed to be sampled i.i.d.) is used, where and . Given this dataset, the parameters of the mapping
are learned by minimizing a loss function
composed of a regularization term and an empirical risk which compares groundtruth label and prediction . The parameters are obtained by optimizing the following program:(1) 
Hereby, the mapping
is obtained by maximizing the logits or a corresponding probability distribution
, , . Here and below let the hat (‘’) indicate probability distributions over appropriate domains.Reinforcement Learning considers an agent interacting with an environment according to a policy which maps a state to an action at time . The policy depends on the parameters . After performing action , the agent observes the next state and receives a scalar reward . The discounted return at time is defined as , where is the discount factor. The expected future reward when observing state and when following policy is defined as , where is a trajectory generated by following from state .
The goal of reinforcement learning is to find a policy that maximizes the expected future reward from each state . Without loss of generality, in this paper, we follow the asynchronous advantage actorcritic (A3C) formulation (Mnih et al., 2016). In A3C, the policy mapping is obtained from a probability distribution over states, where is modeled by a deep net with parameters . The value function is also approximated by a deep net , having parameters .
To optimize the policy parameters given a state , a loss function based on a scaled negative loglikelihood and a negative entropy regularizer is common:
Hereby, is the empirical step return obtained when starting in state , and is the length of the trajectory generated by following . The scalar is a userspecified constant, and is the entropy function, which encourages exploration by favoring a uniform probability distribution .
To optimize the value function , it is common to use the squared loss
By minimizing the empirical expectation of and , , by addressing
(2) 
alternatingly, we learn a policy and a value function that maximize expected return.
3 Knowledge Flow
Instead of optimizing the programs given in eq:SL and eq:RL from scratch, the aforementioned warmstart techniques (see sec:related for more) are applicable. To address their mentioned shortcomings, we propose knowledge flow, a framework that moves ‘knowledge’ from an arbitrary number of deep nets, henceforth referred to as ‘teachers’ to a deep net under training, called the ‘student.’
3.1 Overview
Knowledge flow is outlined on example deep nets in fig:model (a,b). We train the parameters of the student net which are randomly initialized. To this end we take advantage of teachers, whose parameters are fixed and obtained from pretrained models on different source tasks by different algorithms. For example, for reinforcement learning, we may consider teachers trained by A3C (Mnih et al., 2016), A2C (Dhariwal et al., 2017) or DQN (Mnih et al., 2015).
‘Knowledge’ of multiple teachers is transferred to a student by adding transformed and scaled intermediate representations from the teacher deep nets to the student net. To achieve this, we modify the student net, , in the supervised setting and , in the reinforcement learning case. We add teacher representations which are transformed by multiplication with a trainable matrix Q and scaled via a weight that is normalized to sum to one for each student layer and parameterized via trainable parameters . The normalized weights encode which of the teachers’ or the student’s representation to trust at every layer of the student net. Note that a teacher can help the student at different levels of abstraction with input from different levels of its net.
Importantly, after training, the student model should perform well on the target task without relying on teachers. To achieve this, as training progresses, we increasingly encourage a high normalized weight on the student representation, which forces the student to eventually capture all the ‘knowledge.’ Due to the trainable scaling, at an early stage of training, we observe the student to rely heavily on the ‘knowledge’ of the teacher to quickly obtain better performance. However, as training proceeds, the student is encouraged to become more and more independent. During final stages of training, the student will no longer be able to rely on teachers, which ensures that the student has learned to master the desired task on its own. This is observed in fig:model (c).
To formally encourage this successive transfer we introduce two additional loss functions. The first, referred to as the dependency loss
, captures how much a student relies on teachers. It depends on the weight vector
which encodes the strength of the coupling. The second one ensures that a student’s behavior doesn’t change rapidly when the teachers’ influence decreases. We use loss to capture the change.By combining student net modifications and additional loss terms, for the supervised task we obtain min_θ, w, Q E_(x, y)[~ℓ_(x,y)(θ, w, Q) + λ_1 ℓ_dep(w) + λ_2 ℓ_KL(~^f_θ, ~^f_θ_old)],
and for reinforcement learning the transformed program reads as follows: { min_θ_π, w, Q E_τ∼~π_θ_π[~ℓ_π^τ(θ_π, w, Q)+ λ_1 ℓ_dep(w) + λ_2 ℓ_KL^τ(~^π_θ_π, ~^π_θ_π_old)]min_θ_v, w, Q E_τ∼~π_θ_π[~ℓ_v^τ(θ_v, w, Q)] . Loss originates from the original loss (Eqs. (1)(2)) by transforming the deep net to include crossconnections, hence its dependence on . The tilde (‘’) denotes this dependence, also for probability distribution and policy distribution . Parameters from the current and a previous iteration are referred to via and respectively.
For both supervised and reinforcement learning, and control the strength which is used to decrease the influence of the teacher.
A low allows the student to rely on teachers. Close to the end of training, the student should be independent. Therefore, we set to a small value at the beginning, and gradually increase its value as training progresses.
Note that we don’t make any assumptions about teachers and student’s objective. If a teacher’s and student’s objective differ, negative transfer may occur initially. However, the proposed method quickly decreases the weight for teacher layers to reduce this effect. Despite differences, students could potentially still benefit from the low level representation of the teachers. We do observe this low level knowledge transfer in our experiments.
In the following we first describe how to modify the deep nets, before we detail the loss functions and , which are used to successively decrease the influence of the teachers.
3.2 Deep Net Transformation and Loss Terms
Deep Net Transformation: Knowledge flow enhances the student by adding transformed and scaled intermediate representations from teacher models. To perform the transformation, intermediate representations from teachers are first multiplied by transformation matrices . Then the transformed representations from teachers and representations from the student are linearly combined. The weights for this linear combination are determined by a weight which is normalized to sum to one for each student layer.
Let index denote the student model and let refer to its parameters. Further, let , denote teacher models. We use to refer to deep net layer of teacher , with and the number of layers in teacher . We define layer of the student model to be , where and the number of deep net layers in the student model. The output of layer right before and after an activation unit is denoted and respectively.
To align a teacher’s layer with a student’s layer , we introduce a learnable transformation matrix , where gives the number of elements in the corresponding layer. The matrix multiplication aligns the representation from layer of teacher with the representation of layer of the student.
For each layer in the student model, we define a candidate set , which contains and all the teachers’ layers to be considered. For example, in fig:model (a), layer one of the student model is combined with layer one of teacher one and layer two of teacher two. Therefore, the candidate set of layer one of the student model is given by .
To decide which teachers’ or the student’s representation to trust at every layer of the student net, we introduce a normalized weight for all , where , summing to one for each layer in the student deep net, ,
To obtain the combined intermediate representation of layer for the student model, we use
where determines how much the student layer relies on transformed representations of layer from the th teacher. Intuitively, if the transformed representation of the th teacher layer is helpful, will be close to one. We visualize the deep net transformation in fig:model (b).
Note that the intermediate representations of teachers are not changed in our framework. To obtain the output of layer we apply the original activation unit to the original representation , ,
The maximal number of introduced matrices in our framework is . In practice, we don’t link a student’s layer to every layer of a teacher network. Intuitively, a teachers’ bottom layer features are very likely irrelevant to a student’s top layer features. Indeed, we observed that linking a teachers’ bottom layer to a student’s top layer generally doesn’t yield improvements. Therefore, in practice, we recommend to link one teacher layer to one or two student layers, in which case we introduce on the order of matrices Q. Also note that while additional trainable parameters and are introduced in our framework, and are not part of the resulting student network since we ensure at the end of training as discussed next. Hence, the additional parameters function as auxiliary knobs that help the student learn faster. In the final stage of training, the student will be independent (see fig:model (c)) and does no longer rely on , , or any transformed representations from teachers.
Decreasing Teachers’ Influence:
We successively decrease the influence of the teachers during training by gradually encouraging the normalized weight to increase to a value of . To capture how much the student relies on teachers, we introduce the dependence cost as the negative log probability:
(3) 
By minimizing , we encourage weights for the layers of the student to increase. Hence we encourage the student to become more and more independent. During the final stage of training, approaches one for all , making the student independent of the transformed representation obtained from teachers.
Empirically, we found that a fast decrease of the influence of the teacher can degrade the performance. This is intuitive as it requires some time to find good transformations . Moreover, decreasing the influence of a teacher too fast may change the output distribution over labels or actions of the student model too much, and thus lead to performance loss. To prevent changing a student’s output distribution too fast, we found a KullbackLeibler (KL) regularizer to yield good results. More specifically, in the case of supervised learning we use
(4) 
Hereby, is the set of current parameters, and are the previous ones. In the reinforcement learning case we use .
4 Experimental Results
In the following we evaluate knowledge flow on reinforcement and supervised learning tasks. Results are reported by using only the student model to avoid even the smallest influence from any teacher nets.



No teachers  
Ours  PathNet  Ours  PathNet  Ours  PNN  A3C  PPO  ACKTR  
Alien  1254  1700  1259  1800  1911  2000  182  1850  3197  
Asterix  3982  2000  3823  2000  6012  9000  6723  4533  31583  
Boxing  96  70  96  80  99  99  34  95  1  
Gopher  4152  3900  3820  2100  5233  4500  8443  2933  47730  
Hero  21250  12500  29343  12500  30928  30000  28766  n/a  n/a  
James.  857  600  832  600  1245  850  352  561  512  
Krull  8193  7800  6890  7500  10000  9954  8067  7942  9689 
(a) Alien  (b) Boxing  (c) Gopher 
(d) Hero  (e) JamesBond  (f) Krull 
4.1 Reinforcement Learning
We evaluate knowledge flow on reinforcement learning using Atari games that were used by Rusu et al. (2016b); Fernando et al. (2017). Following existing work, the input to our agent are raw images from the environment. The agent learns to predict actions only based on the rewards and the input images from the environment. The agent chooses an action every four frames, and the last action is repeated on the skipped four frames. For all teacher models and the student model, we use the fully forward architecture of A3C (Mnih et al., 2016)
. The model has three hidden layers. The first layer is a convolutional layer with 16 filters of size 8x8 and stride 4. The second layer is a convolutional layer with 32 filters of size 4x4 and stride 2. The third layer is a fully connected layer with 256 hidden units. Following the third hidden layer are two sets of output. One is a softmax output that provides a probability distribution over all valid actions. The other one is a scalar output that provides the estimated value function. We use the same hyperparameter settings as
Mnih et al. (2016) except for the learning rate. Mnih et al. (2016)use RMSProp with shared statistics while we use Adam with shared statistics, which we found to give better results when training the baselines. The learning rate is set to
and gradually decreased to zero for all experiments. To select and in our framework, we follow progressive neural net (Rusu et al., 2016b): randomly sample and . Note that is set to zero at the beginning of training, and linearly increased to the sampled value at the end of training. Following Rusu et al. (2016b), we repeat each experiment 25 times with different random seeds and randomly sampled and . The results of the top three out of 25 runs are reported. As A3C, we run 16 agents on 16 CPU cores in parallel.Evaluation Metrics: We follow the evaluation procedure of Mnih et al. (2015). The trained student models are evaluated by playing each game for 30 episodes. We also follow the ‘noop’ procedure: at the beginning of each testing episode, the agents perform up to 30 ‘noop’ actions.
Results: We first compare our framework with PathNet (Fernando et al., 2017) and progressive neural net (PNN) (Rusu et al., 2016b), which are stateoftheart transfer reinforcement learning frameworks, using their experimental settings. The comparison is summarized in t:pnn. The stateoftheart results (Mnih et al., 2016; Schulman et al., 2017; Wu et al., 2017) on Atari games are also included in t:pnn for reference. Compared to PathNet, a student model trained using our transfer framework with one teacher achieves higher scores in 11 out of 14 experiments. Compared with PNN, for a twoteacher framework, our trained student model has only 0.7M parameters and PNN has 16M parameters. Nonetheless we observe higher scores in five out of the seven experiments. The results demonstrate that knowledge flow effectively transfers knowledge from teachers to the student. t:pnn also indicates that, in our framework, when the number of teachers increases from one to two, the student’s performance improves significantly across all experiments. The training curves for the experiments are shown in fig:pnn. The curve is the average of the top three out of 25 runs. We observe our approach to generally perform very well.
(a) Seaquest  (b) KungFuMaster  (c) Alien 
To further evaluate knowledge flow, we experiment with different combinations of environment/teacher settings. These settings are not used by PathNet and progressive neural network. The results are summarized in t:ft, where “ours w/ expert” represents that one teacher is expert for the target game; “ours w/ nonexpert” represents that both teachers are not experts for the target game; “Finetune” represents finetuning from a nonexpert on a new target game; “A3C baseline” represents our implementation of the A3C baseline; “A3C” represents the scores reported originally (Mnih et al., 2016). Note that our A3C implementation achieves better scores than those reported by Mnih et al. (2016) for most of the games. As shown in t:ft, knowledge flow with expert teacher performs better than the baseline across all experiments, which we interpret as evidence that knowledge flow successfully transfers ‘knowledge’ from an expert teacher to the student. In addition, knowledge flow with nonexpert teachers also outperforms finetuning on a nonexpert teacher. The reasons are twofold: First, a student model in knowledge flow can learn from multiple teachers while the finetuning method can only start from one setting. Second, in knowledge flow, the student can avoid the negative impact from insufficiently pretrained teachers, while finetuning from an insufficiently pretrained model slows down the training process and may degrade the overall performance. The training curves for the experiments are shown in fig:ours. More training curves are in the Appendix (fig:ours_sup). Note that in knowledge flow, the student can benefit from the intermediate representations of the teacher, even if input space, output space and objectives differ. For example, in fig:ours (a), the two teachers are Chopper Command and Space Invaders, which are quite different from the target game Seaquest. The student model still benefits from learning from the teachers and achieves scores ten times larger than learning without teacher and finetuning from a teacher.
4.2 Supervised Learning
For supervised learning, we use a variety of image classification benchmarks, including CIFAR10 (Krizhevsky, 2009), CIFAR100 (Krizhevsky, 2009), STL10 (Coates et al., 2011), and EMNIST (Cohen et al., 2017). The parameters for the dependent cost and for the KL cost are determined using the validation set of each dataset.
Evaluation Metrics: To evaluate the trained student model we report top1 error rate on the test set of each dataset. All plots and reported numbers are the average of three runs obtained using different random seeds.
CIFAR10/CIFAR100: CIFAR10 and CIFAR100 datasets consist of colored images of size . CIFAR10 (C10) has 10 classes and CIFAR100 (C100) has 100 classes. For both dataset, the training and test sets contain 50,000 and 10,000 images respectively. We perform all experiments on CIFAR10 and CIFAR100 with standard data augmentation (Huang et al., 2017).
. We use Densenet (Huang et al., 2017) (depth 100, growth rate 24) as a baseline and follow their hyperparameter settings to train our baseline, teacher and student models. For our approach, we first train teachers on CIFAR10, CIFAR100, and SVHN (Netzer et al., 2011). We then train the student model using a different combination of teachers. We compare our results to finetuning and the baseline model. As shown in t:cifar (a), for the CIFAR10 target task, finetuning from the CIFAR100 expert improves over the baseline. Finetuning from the SVHN expert performs worse than the baseline model. Intuitively, for the CIFAR10 target task, the CIFAR100 deep net is a good teacher while a deep net trained with SVHN isn’t. Presented with both good and inadequate teachers, knowledge flow improves by over the baseline. This demonstrates that knowledge flow can not only leverage a good teacher’s ‘knowledge,’ but it can also avoid misleading influence. As detailed in t:cifar (b), the results are similar on the CIFAR100 dataset.
To further demonstrate the properties of knowledge flow, additional results are in the appendix.
5 Related Work
As mentioned before, ‘knowledge’ transfer has been considered using a variety of techniques. We briefly discuss related work in contrast to our approach in the following and defer details to sec:related_old.
PathNet (Fernando et al., 2017) enables multiple agents to train the same deep net while reusing parameters and avoiding catastrophic forgetting. In contrast to this formulation we consider availability of multiple pretrained teacher nets.
Progressive Net (Rusu et al., 2016b) leverages transfer and avoids catastrophic forgetting by introducing lateral connections to previously learned features. Our discussed method uses similar lateral connections. However, in contrast to Rusu et al. (2016b), our method ensures independence of the student upon training, addressing a limitation in (Rusu et al., 2016b) where only a fraction of the capacity of the student is eventually utilized.
Distral
a neologism combining ‘distill & transfer learning’
(Teh et al., 2017) considers joint training of multiple tasks. Multiple tasks share a ‘distilled’ policy which encodes common behavior between different tasks. While each worker addresses its own task, a shared policy encourages consistency between the policies. Different from Distral, which is a multitask learning framework, knowledge flow addresses a single task, while in multitask learning, multiple tasks are addressed at the same time. Hence, common for multitask learning and knowledge flow is a transfer of information. However, in multitask learning, information extracted from different tasks are shared to boost performance, while, in knowledge flow, the information of multiple teachers is leveraged to help a student learn better a single, new, previously unseen task.Other related work includes actormimic (Parisotto et al., 2016), learning without forgetting (Li and Hoiem, 2016), growing a brain (Wang et al., 2017), policy distillation (Rusu et al., 2016a), domain adaptation (Pan and Yang, 2010; Long et al., 2015; Tzeng et al., 2015), knowledge distillation (Hinton et al., 2015) or lifelong learning (Chen and Liu, 2016). A more detailed discussion on related work is provided in sec:related_old of the supplementary material.
6 Conclusion
We developed a general knowledge flow approach that permits to train a deep net from any number of teachers. We showed results for reinforcement learning and supervised learning, demonstrating improvements compared to training from scratch and to finetuning. In the future we plan to learn when to use which teacher and how to actively swap teachers during training of a student.
References
 Curriculum learning. In Proc. ICML, Cited by: §1.
 Pytorchplayground. Note: https://github.com/aaronxichen/pytorchplayground Cited by: §7.1, §7.1, §7.1.

Lifelong machine learning
. Morgan & Claypool Publishers. Cited by: §1, §5, §8.  An analysis of singlelayer networks in unsupervised feature learning. In Proc. AISTATS, Cited by: §4.2.
 EMNIST: an extension of MNIST to handwritten letters. arXiv preprint arXiv:1702.05373. Cited by: §4.2, Table 5.
 OpenAI baselines. Cited by: §3.1.
 PathNet: evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734. Cited by: §1, §1, §1, §4.1, §4.1, Table 1, §5, §8, §8.
 Active long term memory networks. arXiv preprint arXiv:1606.02355. Cited by: §8.
 Deep residual learning for image recognition. In Proc. CVPR, Cited by: §7.1.
 Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §5, §7.1, Table 4, §8.
 Densely connected convolutional networks. In Proc. CVPR, Cited by: §4.2, §4.2.
 Lessforgetting learning in deep neural networks. arxiv. Cited by: §8.
 Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: §4.2.
 Learning without forgetting. In Proc. ECCV, Cited by: §1, §5, §8, §8.
 Learning transferable features with deep adaptation networks. In Proc. ICML, Cited by: §5, §8.
 Neverending learning. In Proc. AAAI, Cited by: §8.
 Asynchronous methods for deep reinforcement learning. In Proc. ICML, Cited by: §2, §3.1, §4.1, §4.1, §4.1, Table 1, §7.5.
 Humanlevel control through deep reinforcement learning. In Nature, Cited by: §3.1, §4.1, §7.5.
 Reading digits in natural images with unsupervised feature learning. Cited by: §4.2.
 A survey on transfer learning. IEEE Trans. on Knowl. and Data Eng.. Cited by: §5, §8.
 Actormimic: deep multitask and transfer reinforcement learning. In Proc. ICLR, Cited by: §1, §1, §5, §8, §8.
 Visual domain adaptation: A survey of recent advances. IEEE Signal Process. Mag.. Cited by: §1.
 Policy distillation. In Proc. ICLR, Cited by: §5, §8.
 Progressive neural networks. In arXiv preprint arXiv:1606.04671, Cited by: §1, §1, §1, §4.1, §4.1, Table 1, §5, §8, §8.
 ELLA: an efficient lifelong learning algorithm. In Proc. ICML, Cited by: §8.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.1, Table 1.
 Distral: robust multitask reinforcement learning. In Proc. NIPS, Cited by: §1, §5, §7.2, Table 7, §8.

Analysis and optimization of convolutional neural network architectures
. arXiv preprint arXiv:1707.09725. Cited by: Table 6.  Lifelong learning algorithms. In Learning to Learn, Cited by: §8.
 Simultaneous deep transfer across domains and tasks. In Proc. ICCV, Cited by: §5, §8.
 Growing a brain: finetuning by increasing model capacity. In Proc. CVPR, Cited by: §1, §1, §5, §8, §8.
 Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In Proc. NIPS, Cited by: §4.1, Table 1.
 Stacked whatwhere autoencoders. arXiv preprint arXiv:1506.02351. Cited by: Table 6.
7 Appendix
7.1 Supervised Learning
Comparison with Knowledge Distillation: We follow knowledge Distillation (KD) (Hinton et al., 2015)
to distill knowledge from a larger model (teacher) to a smaller model (student). The student models have 50%  5% parameters of the teacher models. Following their setup, we conduct experiments on MNIST, MNIST with digit ‘3’ missing in the training set, CIFAR100, and ImageNet. For MNIST and MNIST with digit ‘3’ missing, following KD, the teacher model is an MLP with two hidden layers of 1200 hidden units, and the student model is an MLP with two hidden layers of 800 hidden units. For CIFAR100, we use the model from
Chen (2017) as teacher model. The student model follows the structure of the teacher, but the number of output channels of each convolutional layer is halved. For ImageNet, the teacher model is a 50layer ResNet (He et al., 2016), and the student model is a 18layer ResNet. The test error of the distilled student model are summarize in Table 4. Our framework has consistently better performance than KD, because the student model in our framework benefits not only from the output layer behavior of the teacher but also from intermediate layer representations of the teacher.MNIST  MNIST w/o digit ‘3’  C100  Imagenet  
Student alone  1.46  11.06  31.87  30.24 
KD Hinton et al. (2015)  0.74  2.06  30.28  30.04 
Ours  0.73  1.05  30.07  29.05 
EMNIST:
Model (Teacher)  Test error(%) 
Cohen et al. (2017)  14.85 
Finetune from EMNIST digits  9.04 
Baseline  9.20 
Ours (EMNIST letters)  7.13 
Ours (EMNIST half letters)  8.13 
Ours (EMNIST digit)  8.11 
The ‘EMNIST Letters’ dataset consists of images of size pixels showing handwritten letters. It has 26 balanced classes. Each class contains lower and upper case letters. The training and test sets contain 124,800 and 20,800 images respectively. The ‘EMNIST Digits’ dataset consists of images of size pixels showing handwritten digits. It has 10 balanced classes. The training and test sets contain 240,000 and 40,000 images respectively.
In this case we use the MNIST model from Chen (2017)
as a baseline, teacher and student model. We trained teachers on EMNIST Digits, EMNIST Letters, and EMNIST Letters with only 13 classes. Our target task is EMNIST Letters. The student model is trained with different teachers and the results are compared to finetuning, the baseline model, and the stateoftheart results on EMNIST. The results are summarized in t:emnist. Compared to the baseline and finetuning, student learning in our framework with expert teacher (EMNIST Letters), semiexpert teacher (Half EMNIST Letters), and nonexpert teacher (EMNIST Digits) all have better performance. In fig:letter we illustrate the accuracy over epochs for training of different models.
STL10:
Test error (%)  
Zhao et al. (2015)  25.20 
Thoma (2017)  21.34 
Baseline  25.50 
Finetune from C10  14.32 
Finetune from C100  14.38 
Ours (C100)  12.35 
Ours (C10, C100)  11.09 
The STL10 dataset consist of colored images of size pixels. It has 10 balanced classes. The training set contains 5,000 labeled images and 100,000 unlabeled images. The test set contains 8,000 images. In our experiment, we only use the 5,000 labeled images for training.
We use the STL10 model from Chen (2017) as our baseline, teacher and student model. We trained teachers on CIFAR10 and CIFAR100. We compare our results to finetuning and the baseline in t:stl. Note that STL10 is very similar to CIFAR10 and CIFAR100. Therefore, both CIFAR10 and CIFAR100 are very good teachers. As shown in t:stl, compared to the baseline, finetuning a model using weights pretrained on CIFAR10 and CIFAR100 reduce test errors by more than . Compared with finetuning, student model training in our framework further reduces the test error by . Note that we only train on the labeled data while other approaches use this data for testing of semisupervised approaches. Hence our results are obtained using fewer data and may not be directly comparable. We still list their results in t:stl for reference. In fig:stl we illustrate the accuracy over the epochs of training.
(a) Alien  (b) Breakout  (c) ChopperCommand 
(d) KungFuMaster  (e) MsPacman  (f) Seaquest 
7.2 Reinforcement Learning
We also compare to Distral (Teh et al., 2017), which is the stateoftheart multitask reinforcement learning framework. We used ‘KL + ent 1 col’, which has a central model (), and a task model () for each task. We perform the experiments on Atari games. In the experiments, we have three tasks (task 1, task 2, task 3). The teachers of task 2 () and task 3 () are provided for our framework. Distral is trained for 120M steps (40M steps/task), and our model is trained for 40M steps. For fair comparison, we report results of Distral’s task 1 model (), which is better than its center model (). The results are summarized in t:distral. Distral is suboptimal, because it aims to learn a multitask agent. In addition, identical action and state space is assumed. When the target task is very different from the source tasks, Distral cannot decrease the teacher influence. In contrast, our framework can decrease a teacher’s influence, and thus reduce negative transfer.
Task1, Task2, Task3  Distral Teh et al. (2017)  Ours 
KungFuMaster, Hero, Seaquest  27433  35103 
Hero, Seaquest, Riverraid  15096  30928 
James, Seaquest, Riverraid  550  1245 
7.3 Visualization of Normalized Weights of Teachers and Student
Following the reviewer’s suggestion, we plot the averaged normalized weight () for teachers and the student in the C10 experiment, where C100 and SVHN experts are teachers. Intuitively, the C100 teacher should have a higher value than the SVHN teacher, because C100 is more relevant to C10. The plot verifies this intuition. As shown in fig:c10prob, of the C100 teacher is higher than that of the SVHN teacher over the entire training. Note, both teachers’ normalized weights approach zero at the end of training.
7.4 Ablation Studies
(a) Hero  (b) JamesBond  (c) KungFuMaster 
7.4.1 Untrained Teacher Models
To verify that the student really benefits from the knowledge of teachers, we conduct an ablation study suggested by a reviewer. We use teacher models that haven’t been trained at all. Intuitively, learning with untrained teachers should have worse performance than learning with knowledgeable teachers. Our experiments verify this intuition. In Fig. 8 (a), where the target task is hero, learning with untrained teachers (‘w/ untrained teachers’) achieves an average reward of 15934. Learning with knowledgeable teachers (‘Ours with seaquest and riverraid teacher’) achieves an average reward of 30928. More results are presented in Figs. 8 (b, c). The results show that knowledge flow achieves higher rewards than training with untrained teachers in different environments and teacherstudent settings.
7.4.2 Training Without KL Term
(a) MsPacman  (b) KungFuMaster  (c) Boxing 
The KL term prevents the student’s output distribution over actions or labels from drastic changes when the teachers’ influence is decreasing. To investigate the importance of the KL term, we conduct an ablation study where the KL coefficient () is set to zero. The result is summarized in fig:kl. Considering fig:kl (a), where the target task is MsPacman and the teachers are Riverraid and Seaquest experts. Without the KL term, when a teacher’s influence decreases, the rewards drop drastically. In contrast, with a KL term, we don’t observe performance drops. At the end of training, learning with the KL term achieves an average reward of 2907 and learning without the KL term achieves an average reward of 1215. More results are presented in fig:kl (b, c), which shows that training with the KL term achieves higher reward than training without the KL term.
7.5 Teachers with Different Architecture than Student
(a) KungFuMaster  (b) Boxing  (c) Gopher 
In additional experiments, following the suggestion of a reviewer, we use architectures for the teacher which differ from the student model. More specifically, we use the model of Mnih et al. (2015) as a teacher model. The teacher model consists of 3 convolutional layers, which have 32, 64, and 64 filters, followed by a hidden fully connected layer which has 512 ReLUs. We use the model of Mnih et al. (2016) as the student model. The student model consists of 2 convolutional layers, which have 16 and 32 filters respectively, followed by a hidden fully connected layer which has 256 ReLUs. Both models’ fully connected layers are followed by two output layers for actions and values. In the experiments, we link each teacher’s first convolutional layer to the student’s first convolutional layer. Moreover, we link each teacher’s third convolutional layer to the student’s second convolutional layer, and each teacher’s fully connected layer to the student’s fully connected layer. In the experiment, the target task is KungFu Master, and the teachers are experts for Seaquest and Riverraid. The results are summarized in fig:dif_arch. We observed that learning with teachers, whose architecture differs from the student, to have similar performance as learning with teachers which have the same architecture. Consider as an example fig:dif_arch (a), where the target task is KungFu Master, and the teachers are experts for Seaquest and Riverraid. At the end of training, learning with teachers of different architectures achieves an average reward of 37520, and learning with teachers of the same architecture achieves an average reward of 35012. More results are shown in fig:dif_arch (b, c). The results show that knowledge flow can enable higher rewards, even if the teachers and the student architectures differ.
7.6 Average Network as
(a) Boxing  (b) KungFuMaster  (c) Gopher 
For the parameters an average network can be used. To investigate how usage of an average network to obtain the parameters affects the performance, we conduct an experiment where is computed using the exponential running average of the model weight. More specifically, is updated as follows: , where . The results are summarized in fig:avg. We observe that using an exponential average to compute results in very similar performance as using a single model. Consider fig:avg (a), where the target task is Boxing and the teacher is a Riverraid expert. At the end of training, using an average network to obtain achieves an average reward of 96.2 and using a single network to obtain achieves an average reward of 96.0. More results on using an average network are shown in fig:avg (b, c).
8 Related Work
As mentioned before, variants of ‘knowledge’ transfer have been considered using a variety of techniques, for instance, finetuning, progressive neural nets (Rusu et al., 2016b), PathNet (Fernando et al., 2017), ‘Growing a Brain’ (Wang et al., 2017), actormimic (Parisotto et al., 2016), learning without forgetting (Li and Hoiem, 2016). Also related are techniques on transfer learning and lifelong learning. We discuss those methods and contrast them to our approach in the following.
PathNet (Fernando et al., 2017) enables multiple agents to train the same giant deep net while reusing parameters and avoiding catastrophic forgetting. To this end, agents embedded in the neural net discover which weights can be reused for new tasks and restrict application of gradients to those parameters. In contrast to this formulation we consider availability of multiple teacher nets, which are trained.
Progressive Net (Rusu et al., 2016b) leverages transfer and avoids catastrophic forgetting by introducing lateral connections to previously learned features. Our discussed method uses similar lateral connections. However, in contrast to Rusu et al. (2016b), we introduce scaling with normalized weights. This ensures independence of the student upon training, addressing a limitation in (Rusu et al., 2016b) where only a fraction of the capacity of the student is eventually utilized.
Distral a neologism combining ‘distill & transfer learning’ (Teh et al., 2017) considers joint training of multiple tasks. Multiple tasks share a ‘distilled’ policy which encodes common behavior between different tasks. While each worker addresses its own task, a shared policy encourages consistency between the policies. Different from Distral, which is a multitask learning framework, knowledge flow addresses a single task, while in multitask learning, multiple tasks are addressed at the same time. Hence, common for multitask learning and knowledge flow is a transfer of information. However, in multitask learning, information extracted from different tasks are shared to boost performance, while, in knowledge flow, the information of multiple teachers is leveraged to help a student learn better a single, new, previously unseen task.
Knowledge distillation (Hinton et al., 2015) distills information form a larger deep net into a smaller one. It assumes both nets are trained on the same dataset. In contrast, our technique allows knowledge transfer between different source and target domains.
Actormimic (Parisotto et al., 2016) enables an agent to learn how to address multiple tasks simultaneously and generalize the extracted knowledge to new domains. A single policy net learns how to act in a set of tasks following the guidance of several expert teachers. A combination of feature regression and cross entropy loss is used to encourage the student to produce similar actions and representations. Our proposed technique differs in that we take advantage of a teachers representation at the beginning of training,
Learning without forgetting (Li and Hoiem, 2016) permits to add a new task to a deep net without forgetting the original capabilities. Importantly, only data from the new task is used and the old capabilities are retained by first recording the old networks output on the new data. Similar techniques have been developed by Furlanello et al. (2016); Jung et al. (2016). In contrast, we transfer ‘knowledge’ from teacher networks more explicitly.
Growing a Brain (Wang et al., 2017) analyzes the parameters which change during finetuning and points out that more natural model adaptation is obtained when increasing the model capacity, by either extending width or depth. Appropriate normalization is essential to significantly outperform classical finetuning. Since this technique is based on finetuning, it differs from our studentteacher based approach.