1 Introduction
The field of deep reinforcement learning has seen significant advances in the recent past. Innovations in environment design have led to a range of exciting, challenging and visually rich 3D worlds (e.g. Beattie et al., 2016; Kempka et al., 2016; Brockman et al., 2016). These have in turn led to the development of more complex agent architectures and necessitated massively parallelisable policy gradient and Qlearning based RL algorithms (e.g. Mnih et al., 2016; Espeholt et al., 2018). While the efficacy of these methods is undeniable, the problems we consider increasingly require more powerful models, complex action spaces and challenging training regimes for learning to occur.
Curriculum learning is a powerful instrument in the deep learning toolbox
(e.g. Bengio et al., 2009; Graves et al., 2017). In a typical setup, one trains a network sequentially on related problems of increasing difficulty, with the end goal of maximizing final performance on a desired task. However, such taskoriented curricula pose some practical difficulties for reinforcement learning. For instance, they require a certain understanding of and control over the generation process of the environment, such that simpler variants of the task can be constructed. And in situations where this is possible, it is not always obvious how to construct useful curricula – simple intuitions from learning in humans do not always apply to neural networks. Recently
(e.g. Sutskever & Zaremba, 2014; Graves et al., 2017) proposes randomised or automated curricula to circumvent some of these issues with some success. In this paper, instead of curricula over task variants we consider an alternate formulation – namely a curriculum over variants of agents. We are interested in training a single final agent, and in order to do so we leverage a series of intermediate agents that differ structurally in the way in which they construct their policies (Fig. 1). Agents in such curricula are not arranged according to architectural complexity, but rather training complexity. While these complexity measures are often aligned, they are sometimes orthogonal (e.g. it is often faster to train a complex model on two distinct tasks, than a simpler model on them jointly). In contrast to a curriculum over tasks, our approach can be applied to a wide variety of problems where we do not have the ability to modify the underlying task specification or design. However, in domains where traditional curricula are applicable, these two methods can be easily combined.The primary contribution of this work is thus to motivate and provide a principled approach for training with curricula over agents.
Mix & Match: An overview
In the Mix & Match framework, we treat multiple agents of increasing learning complexity as one M&M agent, which acts with a mixture of policies from its constituent agents (Fig. 1). Consequently it can be seen as an ensemble or a mixture of experts agent, which is used solely for purpose of training. Additionally, knowledge transfer (i.e. distillation) is used such that we encourage the complex agents to match the simpler ones early on. The mixing coefficient is controlled such that ultimately only the complex, target agent is used for generating experience. Note that we consider the complexity of an agent not just in terms of the depth or size of its network (see Section 4.2), but with reference to the difficulty in training it from scratch (see Section 4.1 and 4.3). We also note that while analogous techniques to ours can be applied to train mixtures of experts/policies (i.e. maximising performance across agents), this is not the focus of the present research; here our focus is to train a final target agent.
Training with the Mix & Match framework confers several potential benefits – for instance performance maximisation (either with respect to score or data efficiency), or enabling effective learning in otherwise hardtotrain models. And with reference to this last point, M&M might be particularly beneficial in settings where real world constraints (inference speed, memory) demand the use of certain specific final models.
2 Related work
Curriculum learning is a long standing idea in machine learning, with mentions as early as the work of Elman
(Elman, 1993). In its simplest form, pretraining and finetuning is a form of curriculum, widely explored (e.g. Simonyan & Zisserman, 2014). More explicitly, several works look at the importance of a curriculum for neural networks (e.g. Bengio et al., 2009). In many works, this focus is on constructing a sequence of tasks of increasing difficulty. More recent work Graves et al. (2017); Sutskever & Zaremba (2014) however looks at automating task selection or employing a mixture of difficulties at each stage in training. We propose to extend this idea and apply it instead to training agents in curricula – keeping in spirit recent ideas of mixtures of tasks (here, models).The recent work on Net2Net (Chen et al., 2016) proposes a technique to increase the capacity of a model without changing the underlying function it represents. In order to achieve this, the architectures have to be supersets/subsets of one another and be capable of expressing identity mappings. Followup work (Wei et al., 2016) extends these ideas further. Both approaches can be seen as implicitly constructing a form of curriculum over the architecture, as a narrow architecture is first trained, then morphed into a wider one.
Related to this idea is the concept of knowledge transfer or distillation Hinton et al. (2015); Ba & Caruana (2014) – a technique for transferring the functional behaviour of a network into a different model, regardless of the structure of the target or source networks. While initially proposed for model compression (Buciluǎ et al., 2006; Ba & Caruana, 2014) , in Parisotto et al. (2016); Rusu et al. (2016) distillation is used to compress multiple distinct policies into a single one. Distral (Teh et al., 2017) instead focuses on learning independent policies, which use codistilled centralised agent as a communication channel.
Our work borrows and unifies several of these threads with a focus on online, endtoend training of model curricula from scratch.
3 Method details
We first introduce some notation to more precisely describe our framework. Let us assume we are given a sequence of trainable agents^{1}^{1}1 For simplicity of notation we are omitting time dependence of all random variables, however we do consider a timeextended setting. In particular, when we talk about policy of agent we refer to this agent policy at given time (which will change in the next step). (with corresponding policies , each parametrised with some – which can share some parameters) ordered according to the complexity of interest (i.e. can be a policy using a tiny neural network while the very complex one). The aim is to train
, while all remaining agents are there to induce faster/easier learning. Furthermore, let us introduce the categorical random variable
(with probability mass function
) which will be used to select a policy at a given time:The point of Mix & Match is to allow curriculum learning, consequently we need the probability mass function (pmf) of to be changed over time. Initially the pmf should have and near the end of training thus allowing the curriculum of policies from simple to the target one . Note, that has to be adjusted in order to control learning dynamics and to maximise the whole learning performance, rather than immediate increase. Consequently it should be trained in a way which maximises long lasting increase of performance (as opposed to gradient based optimisation which tends to be greedy and focus on immediate rewards).
We further note that mixing of policies is necessary but not sufficient to obtain curriculum learning – even though (for non dirac delta like ) gradients always flows through multiple policies, there is nothing causing them to actually share knowledge. In fact, this sort of mixture of experts is inherently competitive rather than cooperative Jacobs et al. (1991). In order to address this issue we propose using a distillationlike cost , which will align the policies together.
The specific implementation of the above cost will vary from application to application. In the following sections we look at a few possible approaches.
The final optimisation problem we consider is just a weighted sum of the original loss (i.e. A3C Mnih et al. (2016)), applied to the control policy and the knowledge transfer loss:
We now describe in more detail each module required to implement Mix & Match, starting with policy mixing, knowledge transfer and finally adjustment.
3.1 Policy mixing
There are two equivalent views of the proposed policy mixing element – one can either think about having a categorical selector random variable described before, or an explicit mixing of the policy. The expected gradients of both are the same:
however, if one implements the method by actually sampling from
and then executing a given policy, the resulting single gradient update will be different than the one obtained from explicitly mixing the policy. From this perspective it can be seen as a Monte Carlo estimate of the mixed policy, thus for the sake of variance reduction we use explicit mixing in all experiments in this paper.
3.2 Knowledge transfer
For simplicity we consider the case of , but all following methods have a natural extension to an arbitrary number of policies. Also, for notational simplicity we drop the dependence of the losses or policies on when it is obvious from context.
Consider the problem of ensuring that final policy matches the simpler policy , while having access to samples from the control policy, . For simplicity, we define our M&M loss over the trajectories directly, similarly to the unsupervised auxiliary losses Jaderberg et al. (2017b); Mirowski et al. (2017), thus we put:
(1) 
and trajectories () are sampled from the control policy. The term is introduced so that the distillation cost disappears when we switch to .^{2}^{2}2It can also be justified as a distillation of mixture policy, see Appendix for derivation. This is similar to the original policy distillation Rusu et al. (2016), however here the control policy is mixture of (the student) and (the teacher).
One can use a memory buffer to store and ensure that targets do not drift too much Ross et al. (2011). In such a setting, under reasonable assumptions one can prove the convergence of to given enough capacity and experience.
Remark 1.
Lets assume we are given a set of trajectories from some predefined mix for any fixed and a big enough neural network with softmax output layer as . Then in the limit as , the minimisation of Eq. 1 converges to if the optimiser used is globally convergent when minimising cross entropy over a finite dataset.
Proof.
Given in the appendix. ∎
In practice we have found that minimising this loss in an online manner (i.e. using the current onpolicy trajectory as the only sample for Eq. (1)) works well in the considered applications.
3.3 Adjusting through training
An important component of the proposed method is how to set values of through time. For simplicity let us again consider the case of , where one needs just a single (as
now comes from Bernoulli distribution) which we treat as a function of time
.Hand crafted schedule
Probably the most common approach is to define a schedule by hand. Unfortunately, this requires per problem fitting, which might be time consuming. Furthermore while designing an annealing schedule is simple (given that we provide enough flat regions so that RL training is stable), following this path might miss the opportunity to learn a better policy using nonmonotonic switches in .
Online hyperparameter tuning
Since
changes through time one cannot use typical hyperparameter tuning techniques (like grid search or simple Bayesian optimisation) as the space of possible values is exponential in number of timesteps (
, where denotes a dimensional simplex). One possible technique to achieve this goal is the recently proposed Population Based Training Jaderberg et al. (2017a) (PBT) which keeps a population of agents, trained in parallel, in order to optimise hyperparameters through time (without the need of ever reinitialising networks). For the rest of the paper we rely on using PBT for adaptation, and discuss it in more detail in the next section.3.4 Population based training and M&M
Population based training (PBT) is a recently proposed learning scheme, which performs online adaptation of hyperparameters in conjunction with parameter optimisation and a form of online model selection. As opposed to many classical hyperparameter optimisation schemes– the ability of of PBT to modify hyperparameters throughout a single training run makes it is possible to discover powerful adaptive strategies e.g. autotuned learning rate annealing schedules.
The core idea is to train a population of agents in parallel, which periodically query each other to check how well they are doing relative to others. Badly performing agents copy the weights (neural network parameters) of stronger agents and perform local modifications of their hyperparameters. This way poorly performing agents are used to explore the hyperparameters space.
From a technical perspective, one needs to define two functions – eval which measures how strong a current agent is and explore which defines how to perturb the hyperparameters. As a result of such runs we obtain agents maximising the eval function. Note that when we refer to an agent in the PBT context we actually mean the M&M agent, which is already a mixture of constituent agents.
We propose to use one of the two schemes, depending on the characteristics of the problem we are interested in. If the models considered have a clear benefit (in terms of performance) of switching from simple to the more complex model, then all one needs to do is provide eval with performance (i.e. reward over episodes) of the mixed policy. For an explore function for we randomly add or subtract a fixed value (truncating between 0 and 1). Thus, once there is a significant benefit of switching to more complex one – PBT will do it automatically. On the other hand, often we want to switch from an unconstrained architecture to some specific, heavily constrained one (where there may not be an obvious benefit in performance from switching). In such setting, as is the case when training a multitask policy from constituent singletask policies, we can make eval an independent evaluation job which only looks at performance of an agent with . This way we directly optimise for the final performance of the model of interest, but at the cost of additional evaluations needed for PBT.
4 Experiments
We now test and analyse our method on three sets of RL experiments. We train all agents with a form of batched actor critic with an off policy correction Espeholt et al. (2018) using DeepMind Lab Beattie et al. (2016) as an environment suite. This environment offers a range of challenging 3D, firstperson view based tasks (see, appendix) for RL agents. Agents perceive 96 72 pixel based RGB observations and can move, rotate, jump and tag builtin bots.
We start by demonstrating how M&M can be used to scale to a large and complex action space. We follow this with results of scaling complexities of the agent architecture and finally on a problem of learning a multitask policy. In all following sections we do not force to approach 1, instead we initialise it around and analyse its adaptation through time. Unless otherwise stated, the eval function returns averaged rewards from last 30 episodes of the control policy. Note, that even though in the experimental sections we use , the actual curriculum goes through potentially infinitely many agents being a result of mixing between and . Further technical details and descriptions of all tasks are provided in Appendix.
4.1 Curricula over number of actions used
DeepMind Lab provides the agent with a complex action space, represented as a 6 dimensional vector. Two of these action groups are very high resolution (rotation and looking up/down actions), allowing up to 1025 values. The remaining four groups are low resolution actions, such as the ternary action of moving forward, backward or not moving at all, shooting or not shooting etc. If naively approached this leads to around
possible actions at each timestep.Even though this action space is defined by the environment, practitioners usually use an extremely reduced subset of available actions Mnih et al. (2016); Espeholt et al. (2018); Jaderberg et al. (2017b); Mirowski et al. (2017) – from 9 to 23 preselected ones. When referring to action spaces we mean the subset of possible actions used for which the agent’s policy provides a non zero probability. Smaller action spaces significantly simplify the exploration problem and introduce a strong inductive bias into the action space definition. However, having such a tiny subset of possible movements can be harmful for the final performance of the agent. Consequently, we apply M&M to this problem of scaling action spaces. We use 9 actions to construct , the simple policy (called Small action space
). This is only used to guide learning of our final agent – which in this case uses 756 actions – these are all possible combinations of available actions in the environment (when limiting the agent to 5 values of rotation about the zaxis, and 3 values about the xaxis). Similarly to the research in continuous control using diagonal Gaussian distributions
Heess et al. (2017)we use a factorised policy (and thus assume conditional independence given state) to represent the joint distribution
which we refer to as Big action space. In order to be able to mix these two policies we map actions onto the corresponding ones in the action space of (which is a strict superset of ).We use a simple architecture of a convolutional network followed by an LSTM, analogous to previous works in this domain Jaderberg et al. (2017b)
. For M&M we share all elements of two agents apart from the final linear transformation into the policy/value functions (Fig.
1(a)). Full details of the experimental hyperparameters can be found in the appendix, and on each figure we show the average over 3 runs for each result.We see that the small action space leads to faster learning but hampers final performance as compared to the big action space (Fig. 3 and Fig. 4). Mix & Match applied to this setting gets the best of both worlds – it learns fast, and not only matches, but surpasses the final performance of the big action space. One possible explanation for this increase is the better exploration afforded by the small action space early on, which allows agents to exploit fully their flexibility of movement.
We further compare two variants of our method. We first investigate using M&M (Shared Head) – in this approach, we share weights in the final layer for those actions that are common to both policies. This is achieved by masking the factorised policy and renormalising accordingly. We further consider a variant of our distillation cost – when computing the KL between and one can also mask this loss such that is not penalised for assigning nonzero probabilities to the actions outside the support – M&M (Masked KL). Consistently across tested levels, both shared Head and Masked KL approaches achieve comparable or worse performance than the original formulation. It is worth noting however, that if M&M were to be applied to a nonfactorised complex action space, the Masked KL might prove beneficial, as it would then be the only signal ensuring agent explore the new actions.
Comparison of the M&M agent and various baselines on four DM Lab levels. Each curve represents the average of 3 independent runs of 10 agents each (used for population based training). FF and LSTM represent the feedforward and LSTM baselines respectively, while FF+LSTM is a model with both cores and a skip connection. FF+LSTM is thus a significantly bigger model than the others, possibly explaining the outlier on the LT level. The LSTM&LSTM experiment shows M&M applied with two LSTM agents.
When plotting through time (Fig. 5 Left) we see that the agent switches fully to the big action space early on, thus showing that small action space was useful only for initial phase of learning. This is further confirmed by looking at how varied the actions taken by the agent are through training. Fig. 5 (Right) shows how the marginal distribution over actions evolves through time. We see that new actions are unlocked through training, and further that the final distribution is more entropic that the initial one.
4.2 Curricula over agent architecture
Another possible curriculum is over the main computational core of the agent. We use an architecture analogous to the one used in previous sections, but for the simple or initial agent, we substitute the LSTM with a linear projection from the processed convolutional signal onto a 256 dimensional latent space. We share both the convolutional modules as well as the policy/value function projections (Fig. 1(b)). We use a 540 element action space, and a factorised policy as described in the previous section.
We ran experiments on four problems in the DM Lab environment, focusing on various navigation tasks. On one hand, reactive policies (which can be represented solely by a FF policy) should learn reasonably quickly to move around and explore, while on the other hand, recurrent networks (which have memory) are needed to maximise the final performance – by either learning to navigate new maze layouts (Explore Object Location Small) or avoiding (seeking) explored unsuccessful (successful) paths through the maze.
As one can see on the average human normalised performance plot (Fig. 7) the M&M applied to the transition between FF and LSTM cores does lead to a significant improvement in final performance (20% increase in human normalised performance over tasks of interest). It is, however no longer as fast as the FF counterpart. In order to investigate this phenomenon we ran multiple ablation experiments (Fig. 6). In the first one, denoted FF+LSTM we use a skip connection which simply adds the activations of the FF core and LSTM core before passing it to a single linear projector for the policy/value heads. This enriched architecture does improve performance of LSTM only model, however it usually learns even slower, and has very similar learning dynamics to M&M. Consquently it strongly suggests that M&M’s lack of initial speedup comes from the fact that it is architecturally more similar to the skip connection architecture. Note, that FF+LSTM is however a significantly bigger model (which appears to be helpful on LT Horseshoe color task).
Another question of interest is whether the benefit truly comes from the two core types, or simply through some sort of regularisation effect introduced by the KL cost. To test this hypothesis we also ran an M&Mlike model but with 2 LSTM cores (instead of the feedforward). This model significantly underperforms all other baselines in speed and performance. This seems to suggest that the distillation or KL cost on its own is not responsible for any benefits we are seeing, and rather it is the full proposed Mix & Match method.
Finally if we look at the progression of the mixing coefficient () through time (Fig. 8), we notice once again quick switches on navigationlike tasks (all curves except the green one). However, there are two interesting observations to be made. First, the lasertag level, which requires a lot of reactiveness in the policy, takes much longer to switch to the LSTM core (however it does so eventually). This might be related to complexity of the level, which has pickup gadgets as well as many opponents, making memory useful much later in training. Secondly, for the simple goal finding task in a fixed maze (Nav maze static 01, the blue curve) the agent first rapidly switches to the LSTM, but then more or less mid training switches to the mixture policy () while finally switch completely again towards the end of training. This particular behaviour is possible due to the use of unconstrained adaptation with PBT – thus depending on the current performance the agent can go back and forth through curriculum, which for this particular problem seems to be needed.
4.3 Curricula for multitask
As a final proof of concept we consider the task of learning a single policy capable of solving multiple RL problems at the same time. The basic approach for this sort of task is to train a model in a mixture of environments or equivalently to train a shared model in multiple environments in parallel Teh et al. (2017); Espeholt et al. (2018). However, this sort of training can suffer from two drawbacks. First, it is heavily reward scale dependent, and will be biased towards highreward environments. Second, environments that are easy to train provide a lot of updates for the model and consequently can also bias the solution towards themselves.
To demonstrate this issue we use three DeepMind Lab environments – one is Explore Object Locations Small, which has high rewards and a steep initial learning curve (due to lots of reward signal coming from gathering apples). The two remaining ones are challenging laser tag levels (described in detail in the appendix). In both these problems training is hard, as the agent is interacting with other bots as well as complex mechanics (pick up bonuses, tagging floors, etc.).
We see in Fig. 9 that the multitask solution focuses on solving the navigation task, while performing comparitively poorly on the more challenging problems. To apply M&M to this problem we construct one agent per environment (each acting as from previous sections) and then one centralised “multitask” agent ( from previous sections). Crucially, agents share convolutional layers but have independent LSTMs. Training is done in a multitask way, but the control policy in each environment is again a mixture between the task specific (the specialist) and (centralised agent), see Fig. 1(c) for details. Since it is no longer beneficial to switch to the centralised policy, we use the performance of (i.e. the central policy) as the optimisation criterion (eval) for PBT, instead of the control policy.
We evaluate both the performance of the mixture and the centralised agent independently. Fig. 9 shows per task performance of the proposed method. One can notice much more uniform performance – the M&M agent learns to play well in both challenging laser tag environments, while slightly sacrificing performance in a single navigation task. One of the reasons of this success is the fact that knowledge transfer is done in policy space, which is invariant to reward scaling. While the agent can still focus purely on high reward environments once it has switched to using only the central policy, this inductive bias in training with M&M ensures a much higher minimum score.
5 Conclusions
We have demonstrated that the proposed method – Mix & Match – is an effective training framework to both improve final performance and accelerate the learning process for complex agents in challenging environments. This is achieved by constructing an implicit curriculum over agents of different training complexities. The collection of agents is bound together as a single composite whole using a mixture policy. Information can be shared between the components via shared experience or shared architectural elements, and also through a distillationlike KLmatching loss. Over time the component weightings of this mixture are adapted such that at the end of training we are left with a single active component consisting of the most complex agent – our main agent of interest from the outset. From an implementation perspective, the proposed method can be seen as a simple wrapper (the M&M wrapper) that is compatible with existing agent architectures and training schemes; as such it could easily be introduced as an additional element in conjunction with wide variety of on or offpolicy RL algorithms. In particular we note that, despite our focus on policybased agents in this paper, the principles behind Mix & Match are also easily applied to valuebased approaches such as Qlearning.
By leveraging M&M training, we are able to train complex agents much more effectively and in much less time than is possible if one were to attempt to train such an agent without the support of our methods. The diverse applications presented in this paper support the generality of our approach. We believe our training framework could help the community unlock the potential of powerful, but hitherto intractable, agent variants.
Acknowledgements
We would like to thank Raia Hadsell, Koray Kavukcuoglu, Lasse Espeholt and Iain Dunning for their invaluable comments, advice and support.
Appendix
Appendix A Network architectures
Default network architecture consists of:

Convolutional layer with 16 8x8 kernels of stride 4

Convolutional layer with 32 4x4 kernels of stride 2

ReLU

Linear layer with 256 neurons

ReLU

Concatenation with one hot encoded last action and last reward

LSTM core with 256 hidden units

Linear layer projecting onto policy logits, followed by softmax

Linear layer projecting onto baseline

Depending on the experiment, some elements are shared and/or replaced as described in the text.
Appendix B Pbt Jaderberg et al. (2017a) details
In all experiments PBT controls adaptation of three hyperparameters: , learning rate and entropy cost regularisation. We use populations of size 10.
The explore operator for learning rate and entropy regularisation is the permutation operator, which randomly multiplies the corresponding value by or . For it is an adder operator, which randomly adds or substracts and truncates result to interval. Exploration is executed with probability 25% independently each time worker is ready.
The exploit operator copies all the weights and hyperparameters from the randomly selected agent if it’s performance is significantly better.
Worker is deemed ready to undergo adaptation each 300 episodes.
We use TTest with pvalue threshold of 5% to answer the question whether given performance is significantly better than the other, applied to averaged last 30 episodes returns.
Initial distributions of hyperparameters are as follows:

learning rate: loguniform(1e5, 1e3)

entropy cost: loguniform(1e4, 1e2)

alpha: loguniform(1e3, 1e2)
b.1 Single task experiments
The eval function uses rewards.
b.2 multitask experiments
The eval function uses rewards, which requires a separate evaluation worker per learner.
Appendix C M&M details
for action space experiments is set to , and for agent core and multitask to
. In all experiments we allow backpropagation through both policies, so that teacher is also regularised towards student (and thus does not diverge too quickly), which is similar to Distral work.
While in principle we could also transfer knowledge between value functions, we did not find it especially helpful empirically, and since it introduces additional weight to be adjusted, we have not used it in the reported experiments.
Appendix D Impala Espeholt et al. (2018) details
We use 100 CPU actors per one learner. Each learner is trained with a single K80 GPU card. We use vtrace correction with truncation as described in the original paper.
Agents are trained with a fixed unroll of 100 steps. Optimisation is performned using RMSProp with decay of 0.99, epsilon of 0.1. Discounting factor is set to 0.99, baseline fitting cost is 0.5, rewards are clipped at 1. Action repeat is set to 4.
Appendix E Environments
We ran DM Lab using 96 72 3 RGB observations, at 60 fps.
e.1 Explore Object Locations Small
The task is to find all apples (each giving 1 point) in the procedurally generated maze, where each episode has different maze, apples locations as well as visual theme. Collecting all apples resets environment.
e.2 Nav Maze Static 01/02
Nav Maze Static 01 is a fixed geometry maze with apples (worth 1 point) and one calabash (worth 10 points, getting which resets environment). Agent spawns in random location, but walls, theme and objects positions are held constant.
The only difference for Nav Maze Static 02 is that it is significantly bigger.
e.3 LaserTag Horseshoe Color
Laser tag level against 6 builtin bots in a wide horseshoe shaped room. There are 5 Orb Gadgets and 2 Disc Gadgets located in the middle of the room, which can be picked up and used for more efficient tagging of opponents.
e.4 LaserTag Chasm
Laser tag level in a square room with Beam Gadgets, Shield Pickups (50 health) and Overshield Pickups (50 armor) hanging above a tagging floor (chasm) splitting room in half. Jumping is required to reach the items. Falling into the chasm causes the agent to lose 1 point. There are 4 builtin bots.
Appendix F Proofs
First let us recall the loss of interest
(2) 
where each come from .
Proposition 1. Lets assume we are given a set of trajectories from some predefined mix for any fixed and a big enough neural network with softmax output layer as . Then in the limit as , the minimisation of Eq. 1 converges to if the optimiser used is globally convergent when minimising cross entropy over a finite dataset.
Proof.
For denoting set of sampled trajectories over state space let as denote by the set of all states in , meaning that . Since is a softmax based policy, it assigns nonzero probability to all actions in every state. Consequently also does that as . Thus we have
Due to following the mixture policy, actual dataset gathered can consist of multiple replicas of each element in , in different proportions that one would achieve when following . Note, note however that if we use optimiser which is capable of minimising the cross entropy over finite dataset, it can also minimise loss (1) over thus in particular over which is its strict subset. Since the network is big enough, it means that it will converge to 0 training error:
where is the solution of th iteration of the optimiser used. Connecting the two above we get that in the limit of and
∎
While the global convergence might sound like a very strong property, it holds for example when both teacher and student policies are linear. In general for deep networks it is hypothesised that if they are big enough, and well initialised, they do converge to arbitrarily small training error even if trained with a simple gradient descent, thus the above proposition is not too restrictive for Deep RL.
Appendix G On based scaling of knowledge transfer loss
Let as take a closer look at the proposed loss
and more specifically at factor. The intuitive justification for this quantity is that it leads to gradually disappearing as M&M agent is switching to the final agent. However, one can provide another explanation. Let us instead consider divergence between mixed policy and the target policy (which also has the property of being once agent switches):
One can notice, that there are two factors of both losses, one being a cross entropy between and and the other being a form of entropy regularisers. Furthermore, these two losses differ only wrt. regularisations:
but since entropy is concave, this quantitiy is nonnegative, meaning that
therefore
Thus the proposed scheme is almost equivalent to minimising KL between mixed policy and but simply with more severe regularisation factor (and thus it is the upper bound of the .
Further research and experiments need to be performed to asses quantitative differences between these costs though. In preliminary experiments we ran, the difference was hard to quantify – both methods behaved similarly well.
Appendix H On knowledge transfer loss
Through this paper we focused on using KulbackLeibler Divergence for knowledge transfer . For many distillation related methods, it is actually equivalent to minimising cross entropy (as is constant), in M&M case the situation is more complex. When both and are learning provides a twoway effect – from one perspective is pulled towards and on the other is mode seeking towards
while at the same time being pushed towards uniform distribution (entropy maximisation). This has two effects, first, it makes it harder for the teacher to get too ahead of the student (similarly to
Teh et al. (2017); Zhang et al. (2017)); second, additional entropy term makes it expensive to keep using teacher, and so switching is preffered.Another element which has not been covered in depth in this paper is possibility of deep distillation. Apart from matching policies one could include inner activation matching Parisotto et al. (2016), which could be beneficial for deeper models which do not share modules. Furthermore, for speeding up convergence of distillation one could use Sobolev Training Czarnecki et al. (2017) and match both policy and its Jacobian matrix. Since policy matching was enough for current experiments, none of these methods has been used in this paper, however for much bigger models and more complex domains it might be the necesity as M&M depends on ability to rapidly transfer knowledge between agents.
References
 Ba & Caruana (2014) Ba, Jimmy and Caruana, Rich. Do deep nets really need to be deep? In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 2654–2662. 2014.
 Beattie et al. (2016) Beattie, Charles, Leibo, Joel Z., Teplyashin, Denis, Ward, Tom, Wainwright, Marcus, Küttler, Heinrich, Lefrancq, Andrew, Green, Simon, Valdés, Víctor, Sadik, Amir, Schrittwieser, Julian, Anderson, Keith, York, Sarah, Cant, Max, Cain, Adam, Bolton, Adrian, Gaffney, Stephen, King, Helen, Hassabis, Demis, Legg, Shane, and Petersen, Stig. Deepmind lab. CoRR, 2016.
 Bengio et al. (2009) Bengio, Yoshua, Louradour, Jerome, Collobert, Ronan, and Weston, Jason. Curriculum learning. In ICML, 2009.
 Brockman et al. (2016) Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie, and Zaremba, Wojciech. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Buciluǎ et al. (2006) Buciluǎ, Cristian, Caruana, Rich, and NiculescuMizil, Alexandru. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. ACM, 2006.
 Chen et al. (2016) Chen, Tianqi, Goodfellow, Ian J., and Shlens, Jonathon. Net2net: Accelerating learning via knowledge transfer. ICLR, abs/1511.05641, 2016.
 Czarnecki et al. (2017) Czarnecki, Wojciech M, Osindero, Simon, Jaderberg, Max, Swirszcz, Grzegorz, and Pascanu, Razvan. Sobolev training for neural networks. In Advances in Neural Information Processing Systems, pp. 4281–4290, 2017.
 Elman (1993) Elman, Jeffrey. Learning and development in neural networks: The importance of starting small. In Cognition, pp. 71–99, 1993.
 Espeholt et al. (2018) Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan, Karen, Mnih, Volodymir, Ward, Tom, Doron, Yotam, Firoiu, Vlad, Harley, Tim, Dunning, Iain, Legg, Shane, and Kavukcuoglu, Koray. Impala: Scalable distributed deeprl with importance weighted actorlearner architectures, 2018.
 Graves et al. (2017) Graves, Alex, Bellemare, Marc G., Menick, Jacob, Munos, Rémi, and Kavukcuoglu, Koray. Automated curriculum learning for neural networks. CoRR, 2017.
 Heess et al. (2017) Heess, Nicolas, Sriram, Srinivasan, Lemmon, Jay, Merel, Josh, Wayne, Greg, Tassa, Yuval, Erez, Tom, Wang, Ziyu, Eslami, Ali, Riedmiller, Martin, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 Hinton et al. (2015) Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Jacobs et al. (1991) Jacobs, Robert A, Jordan, Michael I, Nowlan, Steven J, and Hinton, Geoffrey E. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
 Jaderberg et al. (2017a) Jaderberg, Max, Dalibard, Valentin, Osindero, Simon, Czarnecki, Wojciech M., Donahue, Jeff, Razavi, Ali, Vinyals, Oriol, Green, Tim, Dunning, Iain, Simonyan, Karen, Fernando, Chrisantha, and Kavukcuoglu, Koray. Population based training of neural networks. CoRR, 2017a.
 Jaderberg et al. (2017b) Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver, David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. ICLR, 2017b.
 Kempka et al. (2016) Kempka, Michał, Wydmuch, Marek, Runc, Grzegorz, Toczek, Jakub, and Jaśkowski, Wojciech. Vizdoom: A doombased ai research platform for visual reinforcement learning. In Computational Intelligence and Games (CIG), 2016 IEEE Conference on, pp. 1–8. IEEE, 2016.
 Li & Yuan (2017) Li, Yuanzhi and Yuan, Yang. Convergence analysis of twolayer neural networks with relu activation. In Advances in Neural Information Processing Systems, pp. 597–607, 2017.
 Mirowski et al. (2017) Mirowski, Piotr, Pascanu, Razvan, Viola, Fabio, Soyer, Hubert, Ballard, Andrew J, Banino, Andrea, Denil, Misha, Goroshin, Ross, Sifre, Laurent, Kavukcuoglu, Koray, et al. Learning to navigate in complex environments. ICLR, 2017.
 Mnih et al. (2016) Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
 Parisotto et al. (2016) Parisotto, Emilio, Ba, Lei Jimmy, and Salakhutdinov, Ruslan. Actormimic: Deep multitask and transfer reinforcement learning. ICLR, 2016.

Ross et al. (2011)
Ross, Stéphane, Gordon, Geoffrey, and Bagnell, Drew.
A reduction of imitation learning and structured prediction to noregret online learning.
InProceedings of the fourteenth international conference on artificial intelligence and statistics
, pp. 627–635, 2011.  Rusu et al. (2016) Rusu, Andrei A, Colmenarejo, Sergio Gomez, Gulcehre, Caglar, Desjardins, Guillaume, Kirkpatrick, James, Pascanu, Razvan, Mnih, Volodymyr, Kavukcuoglu, Koray, and Hadsell, Raia. Policy distillation. 2016.
 Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 Sutskever & Zaremba (2014) Sutskever, Ilya and Zaremba, Wojciech. Learning to execute. CoRR, 2014.
 Teh et al. (2017) Teh, Yee, Bapst, Victor, Czarnecki, Wojciech M., Quan, John, Kirkpatrick, James, Hadsell, Raia, Heess, Nicolas, and Pascanu, Razvan. Distral: Robust multitask reinforcement learning. In NIPS. 2017.
 Wei et al. (2016) Wei, Tao, Wang, Changhu, Rui, Yong, and Chen, Chang Wen. Network morphism. In Proceedings of The 33rd International Conference on Machine Learning, pp. 564–572, 2016.
 Zhang et al. (2017) Zhang, Ying, Xiang, Tao, Hospedales, Timothy M, and Lu, Huchuan. Deep mutual learning. arXiv preprint arXiv:1706.00384, 2017.