Using Task Descriptions in Lifelong Machine Learning for Improved Performance and Zero-Shot Transfer

10/10/2017 ∙ by David Isele, et al. ∙ University of Pennsylvania 0

Knowledge transfer between tasks can improve the performance of learned models, but requires an accurate estimate of the inter-task relationships to identify the relevant knowledge to transfer. These inter-task relationships are typically estimated based on training data for each task, which is inefficient in lifelong learning settings where the goal is to learn each consecutive task rapidly from as little data as possible. To reduce this burden, we develop a lifelong learning method based on coupled dictionary learning that utilizes high-level task descriptions to model the inter-task relationships. We show that using task descriptors improves the performance of the learned task policies, providing both theoretical justification for the benefit and empirical demonstration of the improvement across a variety of learning problems. Given only the descriptor for a new task, the lifelong learner is also able to accurately predict a model for the new task through zero-shot learning using the coupled dictionary, eliminating the need to gather training data before addressing the task.



There are no comments yet.


page 7

page 9

page 17

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transfer learning (TL) and multi-task learning (MTL) methods reduce the amount of experience needed to train individual task models by reusing knowledge from other related tasks. This transferred knowledge can improve the training speed and model performance, as compared to learning the tasks in isolation following the classical machine learning pipeline [Pan  YangPan  Yang2010]. TL and MTL techniques typically select the relevant knowledge to transfer by modeling inter-task relationships using a shared representation, based on training data for each task [BaxterBaxter2000, Ando  ZhangAndo  Zhang2005, Bickel, Sawade,  SchefferBickel et al.2009, Maurer, Pontil,  Romera-ParedesMaurer et al.2013]. Despite benefits over single-task learning, this process requires sufficient training data for each task to identify these relationships before knowledge transfer can succeed and improve generalization performance. This need for data is especially problematic in learning systems that are expected to rapidly learn to handle new tasks during real-time interaction with the environment: when faced with a new task, the learner would first need to gather data on the new task before bootstrapping a model via transfer, consequently delaying how quickly the learner could address the new task.

An earlier version of this work focusing on policy gradient reinforcement learning appeared in the proceedings of IJCAI 2016

[Isele, Rostami,  EatonIsele et al.2016].

Consider instead the human ability to rapidly bootstrap a model for a new task, given only a high-level task description—before obtaining experience on the actual task. For example, viewing only the image on the box of a new IKEA chair, we can immediately identify previous related assembly tasks and begin formulating a plan to assemble the chair. In the same manner, an experienced inverted pole balancing agent may be able to predict the controller for a new pole given its mass and length, prior to interacting with the physical system. These examples suggest that an agent could similarly use high-level task information to bootstrap a model for a new task more efficiently.

Inspired by this idea, we explore the use of high-level task descriptions to improve knowledge transfer between multiple machine learning tasks. We focus on lifelong learning scenarios [ThrunThrun1996, Ruvolo  EatonRuvolo  Eaton2013], in which multiple tasks arrive consecutively and the goal is to rapidly learn each new task by building upon previous knowledge. Our approach to integrating task descriptors into lifelong machine learning is general, as demonstrated on applications to reinforcement learning, regression, and classification problems.

Our algorithm, Task Descriptors for Lifelong Learning (TaDeLL), encodes task descriptions as feature vectors that identify each task, treating these descriptors as side information in addition to training data on the individual tasks. The idea of using task features for knowledge transfer has been explored previously by Bonilla et al. bonilla2007kernel in an offline batch MTL setting, and more recently by Sinapov et al. Sinapov2015 in a computationally expensive method for estimating transfer relationships between pairs of tasks. In comparison, our approach operates online over consecutive tasks and is much more computationally efficient.

We use coupled dictionary learning to model the inter-task relationships between the task descriptions and the individual task policies in lifelong learning. The coupled dictionary enforces the notion that tasks with similar descriptions should have similar policies, but still allows dictionary elements the freedom to accurately represent the different task policies. We connect the coupled dictionaries to the concept of mutual coherence in sparse coding, providing theoretical justification for why the task descriptors improve performance, and verify this improvement empirically.

In addition to improving the task models, we show that the task descriptors enable the learner to accurately predict the policies for unseen tasks given only their description—this process of learning without data is known as zero-shot learning. This capability is particularly important in the online setting of lifelong learning. It enables the system to accurately predict policies for new tasks through transfer, without requiring the system to pause to gather training data on each task.

Specifically, this article provides the following contributions:

  • [itemsep=0pt]

  • We develop a general mechanism based on coupled dictionary learning to incorporate task descriptors into into knowledge transfer algorithms that use a factorized representation of the learned knowledge to facilitate transfer [Kumar  DauméKumar  Daumé2012, Maurer, Pontil,  Romera-ParedesMaurer et al.2013, Ruvolo  EatonRuvolo  Eaton2013].

  • Using this mechanism, we develop two algorithms, for lifelong learning (TaDeLL) and MTL (TaDeMTL), that incorporate task descriptors to improve learning performance.

  • Most critically, we show how these algorithms can achieve zero-shot transfer to bootstrap a model for a novel task, given only the high level task descriptor.

  • We provide theoretical justification for the benefit of using task descriptors in lifelong learning and MTL, building on the idea of mutual coherence in sparse coding.

  • Finally, we demonstrate the empirical effectiveness of TaDeLL and TaDeMTL on reinforcement learning scenarios involving the control of dynamical systems, and on prediction tasks in classification and regression settings, showing the generality of our approach.

2 Related Work

Multi-task learning (MTL) [CaruanaCaruana1997] methods often model the relationships between tasks to identify similarities between their datasets or underlying models. There are many different approaches to modeling these task relationships. Bayesian approaches take a variety of forms, making use of common priors [Wilson, Fern, Ray,  TadepalliWilson et al.2007, Lazaric  GhavamzadehLazaric  Ghavamzadeh2010], using regularization terms that couple task parameters [Evgeniou  PontilEvgeniou  Pontil2004, Zhong  KwokZhong  Kwok2012], and finding mixtures of experts that can be shared across tasks [Bakker  HeskesBakker  Heskes2003].

Where Bayesian MTL methods aim to find an appropriate bias to share among all task models, transformation methods seek to make one dataset look like another, often in a transfer learning setting. This can be accomplished with distribution matching [Bickel, Sawade,  SchefferBickel et al.2009], inter-task mapping [Taylor, Stone,  LiuTaylor et al.2007], or manifold alignment techniques [Wang  MahadevanWang  Mahadevan2009, Ham, Lee,  SaulHam et al.2005, Bou Ammar, Eaton, Ruvolo,  TaylorBou Ammar et al.2015].

Both the Bayesian strategy of discovering biases and the shared spaces often used in transformation techniques are implicitly connected to methods that learn shared knowledge representations for MTL. For example, the original MTL framework developed by Caruana Caruana1997 and later variations [BaxterBaxter2000]

capture task relationships by sharing hidden nodes in neural networks that are trained on multiple tasks. Related work in dictionary learning techniques for MTL

[Maurer, Pontil,  Romera-ParedesMaurer et al.2013, Kumar  DauméKumar  Daumé2012] factorize the learned models into a shared latent dictionary over the model space to facilitate transfer. Individual task models are then captured as sparse representations over this dictionary; the task relationships are captured in these sparse codes.

The Efficient Lifelong Learning Algorithm (ELLA) framework [Ruvolo  EatonRuvolo  Eaton2013] used this same approach of a shared latent dictionary, trained online, to facilitate transfer as tasks arrive consecutively. The ELLA framework was first created for regression and classification [Ruvolo  EatonRuvolo  Eaton2013], and later developed for policy gradient reinforcement learning (PG-ELLA) [Bou Ammar, Eaton,  RuvoloBou Ammar et al.2014, Bou Ammar, Eaton, Luna,  RuvoloBou Ammar et al.2015]. Other approaches that extend MTL to online settings also exist [Cavallanti, Cesa-Bianchi,  GentileCavallanti et al.2010]

. Saha et al. saha2011online use a task interaction matrix to model task relations online and Dekel et al. dekel2006online propose a shared global loss function that can be minimized as tasks arrive.

However, all these methods use task data to characterize the task relationships—this explicitly requires training on the data from each task in order to perform transfer. Instead of relying solely on the tasks’ training data, several works have explored the use of high-level task descriptors to model the inter-task relationships in MTL and transfer learning settings. Task descriptors have been used in combination with neural networks [Bakker  HeskesBakker  Heskes2003] to define a task-specific prior and to control the gating network between individual task clusters. Bonilla et al. bonilla2007kernel explore similar techniques for multi-task kernel machines, using task features in combination with the data for a gating network over individual task experts to augment the original task training data. These papers focus on multi-task classification and regression in batch settings where the system has access to the data and features for all tasks, in contrast to our study of task descriptors for lifelong learning over consecutive tasks. We use coupled dictionary learning to link the task description space with the task’s parameter space. This idea was originally used in image processing [Yang, Wright, Huang,  MaYang et al.2010] and was recently explored in the machine learning literature [Xu, Hospedales,  GongXu et al.2016]. The core idea is that two feature spaces can be linked through two dictionaries which are coupled by a joint sparse representation.

In the work most similar to our problem setting, Sinapov et al. Sinapov2015 use task descriptors to estimate the transferability between each pair of tasks for transfer learning. Given the descriptor for a new task, they identify the source task with the highest predicted transferability, and use that source task for a warm start in reinforcement learning (RL). Though effective, their approach is computationally expensive, since they estimate the transferability for every task pair through repeated simulation. Their evaluation is also limited to a transfer learning setting, and they do not consider the effects of transfer over consecutive tasks or updates to the transferability model, as we do in the lifelong setting.

Our work is also related to zero-shot learning, which seeks to successfully label out-of-distribution examples, often through means of learning an underlying representation that extends to new tasks and using outside information that appropriately maps to the latent space [Palatucci, Hinton, Pomerleau,  MitchellPalatucci et al.2009, Socher, Ganjoo, Manning,  NgSocher et al.2013]. The Simple Zero-Shot method by Romera-Paredes and Torr romera2015embarrassingly also uses task descriptions. Their method learns a multi-class linear model, and factorizes the linear model parameters, assuming the descriptors are coefficients over a latent basis to reconstruct the models. Our approach assumes a more flexible relationship: that both the model parameters and task descriptors can be reconstructed from separate latent bases that are coupled together through their coefficients. In comparison to our lifelong learning approach, the Simple Zero-Shot method operates in an offline multi-class setting.

3 Background

Our proposed framework for lifelong learning with task descriptors supports both supervised learning (classification and regression) and reinforcement learning settings. For completeness, we briefly review these learning paradigms here.

3.1 Supervised Learning

Consider a standard batch supervised learning setting. Let be a -dimensional vector representing a single data instance with a corresponding label . Given a set of sample observations with corresponding labels , the goal of supervised learning is to learn a function that labels inputs with their outputs and generalizes well to unseen observations.

In regression tasks, the labels are assumed to be real-valued (i.e., ). In classification tasks, the labels are a set of discrete classes; for example, in binary classification, . We assume that the learned model for both paradigms can be parameterized by a vector . The model is then trained to minimize the average loss over the training data between the model’s predictions and the given target labels:

where is generally assumed to be a convex metric, and regularizes the learned model. The form of the model , loss function

, and regularization method varies between learning methods. This formulation encompasses a number of parametric learning methods, including linear regression and logistic regression.

3.2 Reinforcement Learning

A reinforcement learning (RL) agent selects sequential actions in an environment to maximize its expected return. An RL task is typically formulated as a Markov Decision Process (MDP)

, where is the set of states, and is the set of actions that the agent may execute,

is the state transition probability describing the systems dynamics,

is the reward function, and is the discount assigned to rewards over time. At time step , the agent is in state and chooses an action according to policy , which is represented as a function defined by a vector of control parameters . The agents then receives reward according to and transitions to state according to . This sequence of states, actions, and rewards is given as a trajectory over a horizon . The goal of RL is to find the optimal policy with parameters that maximizes the expected reward. However, learning an individual task still requires numerous trajectories, motivating the use of transfer to reduce the number of interactions with the environment.

Policy Gradient (PG) methods [Sutton, McAllester, Singh,  MansourSutton et al.1999], which we employ as our base learner for RL tasks, are a class of RL algorithms that are effective for solving high dimensional problems with continuous state and action spaces, such as robotic control [Peters  SchaalPeters  Schaal2008]. The goal of PG is to optimize the expected average return: , where is the set of all possible trajectories, the average reward on trajectory is given by , and is the probability of under an initial state distribution . Most PG methods (e.g., episodic REINFORCE [WilliamsWilliams1992], PoWER [Kober  PetersKober  Peters2009], and Natural Actor Critic [Peters  SchaalPeters  Schaal2008]) optimize the policy by employing supervised function approximators to maximize a lower bound on the expected return of . This optimization is carried out by generating trajectories using the current policy , and then comparing the result with a new policy . Jensen’s inequality can then be used to lower bound the expected return [Kober  PetersKober  Peters2009]:

where . This is equivalent to minimizing the KL divergence between the reward-weighted trajectory distribution of and the trajectory distribution of the new policy .

In our work, we treat the term similar to the loss function of a classification or regression task. Consequently, both supervised learning tasks and RL tasks can be modeled in a unified framework, where the goal is to minimize a convex loss function.

3.3 Lifelong Machine Learning

Figure 1: The lifelong machine learning process. As a new task arrives, knowledge accumulated from previous tasks is selectively transferred to the new task to improve learning. Newly learned knowledge is then stored for future use.

In a lifelong learning setting [ThrunThrun1996, Ruvolo  EatonRuvolo  Eaton2013], a learner faces multiple, consecutive tasks and must rapidly learn each new task by building upon its previous experience. The learner may encounter a previous task at any time, and so must optimize performance across all tasks seen so far. A priori, the agent does not know the total number of tasks , the task distribution, or the task order.

At time , the lifelong learner encounters task . In this paper, all tasks are either regression problems , classification problems or reinforcement learning problems specified by an MDP . Note that we do not mix the learning paradigms—a lifelong learning agent will only face one type of learning task during its lifetime. The agent will learn each task consecutively, acquiring training data (i.e., trajectories or samples) in each task before advancing to the next. The agent’s goal is to learn the optimal models or policies with corresponding parameters , where is the number of unique tasks seen so far (). Ideally, knowledge learned from previous tasks should accelerate training and improve performance on each new task . Also, the lifelong learner should scale effectively to large numbers of tasks, learning each new task rapidly from minimal data. The lifelong learning framework is depicted in Figure 1.

Figure 2: The task specific model (or policy) parameters are factored into a shared knowledge repository and a sparse code . The repository stores chunks of knowledge that are useful for multiple tasks, and the sparse code extracts the relevant pieces of knowledge for a particular task’s model (or policy).

The Efficient Lifelong Learning Algorithm (ELLA) [Ruvolo  EatonRuvolo  Eaton2013] and PG-ELLA [Bou Ammar, Eaton,  RuvoloBou Ammar et al.2014] were developed to operate in this lifelong learning setting for classification/regression and RL tasks, respectively. Both approaches assume the parameters for each task model can be factorized using a shared knowledge base , facilitating transfer between tasks. Specifically, the model parameters for task are given by , where is the shared basis over the model space, and are the sparse coefficients over the basis. This factorization, depicted in Figure 2, has been effective for transfer in both lifelong and multi-task learning [Kumar  DauméKumar  Daumé2012, Maurer, Pontil,  Romera-ParedesMaurer et al.2013].

Under this assumption, the MTL objective is:


where is the matrix of sparse vectors, is the task-specific loss for task , and is the Frobenius norm. The norm is used to approximate the true vector sparsity of , and and are regularization parameters. Note that for a convex loss function , this problem is convex in each of the variables and . Thus, one can use an alternating optimization approach to solve it in a batch learning setting. To solve this objective in a lifelong learning setting, Ruvolo and Eaton Ruvolo2013 take a second-order Taylor expansion to approximate the objective around an estimate of the single-task model parameters for each task , and update only the coefficients for the current task at each time step. This process reduces the MTL objective to the problem of sparse coding the single-task policies in the shared basis , and enables and to be solved efficiently by the following alternating online update rules that constitute ELLA [Ruvolo  EatonRuvolo  Eaton2013]:


where , the symbol denotes the Kronecker product, is the Hessian of the loss , is the identity matrix, is initialized to a zero matrix, and is initialized to zeros.

This was extended to handle reinforcement learning by Bou Ammar et al. Ammar2014a via approximating the RL multi-task objective by first substituting in the convex lower-bound to the PG objective in order to make the optimization convex.

While these methods are effective for lifelong learning, this approach requires training data to estimate the model for each new task before the learner can solve it. Our key idea is to eliminate this restriction by incorporating task descriptors into lifelong learning, enabling zero-shot transfer to new tasks. That is, upon learning a few tasks, future task models can be predicted solely using task descriptors.

4 Lifelong Learning with Task Descriptors

Figure 3: The lifelong machine learning process with task descriptions. A model of task descriptors is added into the lifelong learning framework and couple with the learned model. Because of the learned coupling between model and description, the model for a new task can be predicted from the task description.

4.1 Task Descriptors

While most MTL and lifelong learning methods use task training data to model inter-task relationships, high-level descriptions can describe task differences. For example, in multi-task medical domains, patients are often grouped into tasks by demographic data and disease presentation [Oyen  LaneOyen  Lane2012]. In control problems, the dynamical system parameters (e.g., the spring, mass, and damper constants in a spring-mass-damper system) describe the task. Descriptors can also be derived from external sources, such as text descriptions  [Pennington, Socher,  ManningPennington et al.2014, Huang, Socher, Manning,  NgHuang et al.2012] or Wikipedia text associated with the task  [Socher, Ganjoo, Manning,  NgSocher et al.2013].

To incorporate task descriptors into the learning procedure, we assume that each task has an associated descriptor that is given to the learner upon first presentation of the task. The learner has no knowledge of future tasks, or the distribution of task descriptors. The descriptor is represented by a feature vector , where

performs feature extraction and (possibly) a non-linear basis transformation on the features. We make no assumptions on the uniqueness of

, although in general tasks will have different descriptors.111This raises the question of what descriptive features to use, and how task performance will change if some descriptive features are unknown. We explore these issues in Section 8.1. In addition, each task also has associated training data to learn the model; in the case of RL tasks, the data consists of trajectories that are dynamically acquired by the agent through experience in the environment.

We incorporate task descriptors into lifelong learning via sparse coding with a coupled dictionary, enabling the descriptors and learned models to augment each other. In an earlier version of this work, we focused on RL tasks [Isele, Rostami,  EatonIsele et al.2016]; here, we more fully explore the range of our approach, showing how it can be applied to regression, classification, and RL problems.

4.2 Coupled Dictionary Optimization

As described previously, many multi-task and lifelong learning approaches have found success with factorizing the policy parameters for each task as a sparse linear combination over a shared basis: . In effect, each column of the shared basis serves as a reusable model or policy component representing a cohesive chunk of knowledge. In lifelong learning, the basis is refined over time as the system learns more tasks. The coefficient vectors encode the task policies in this shared basis, providing an embedding of the tasks based on how their policies share knowledge.

We make a similar assumption about the task descriptors—that the descriptor features can be linearly factorized222This is potentially non-linear w.r.t , since can be non-linear. using a latent basis over the descriptor space. This basis captures relationships among the descriptors, with coefficients that similarly embed tasks based on commonalities in their descriptions. From a co-view perspective [Yu, Wu, Yang, Tian, Luo,  ZhuangYu et al.2014], both the policies and descriptors provide information about the task, and so each can augment the learning of the other. Each underlying task is common to both views, and so we seek to find task embeddings that are consistent for both the policies and their corresponding task descriptors. As depicted in Figure 4, we can enforce this by coupling the two bases and , sharing the same coefficient vectors to reconstruct both the policies and descriptors. Therefore, for task ,

Figure 4: The coupled dictionaries of TaDeLL, illustrated on an RL task. Policy parameters are factored into and while the task description is factored into and . Because we force both dictionaries to use the same sparse code , the relevant pieces of information for a task become coupled with the description of the task.

To optimize the coupled bases and during the lifelong learning process, we employ techniques for coupled dictionary optimization from the sparse coding literature [Yang, Wright, Huang,  MaYang et al.2010]

, which optimizes the dictionaries for multiple feature spaces that share a joint sparse representation. This notion of coupled dictionary learning has led to high performance algorithms for image super-resolution

[Yang, Wright, Huang,  MaYang et al.2010], allowing the reconstruction of high-res images from low-res samples, and for multi-modal retrieval [Zhuang, Wang, Wu, Zhang,  LuZhuang et al.2013] and cross-domain retrieval [Yu, Wu, Yang, Tian, Luo,  ZhuangYu et al.2014]. The core idea is that features in two independent subspaces can have the same representation in a third subspace.

Given the factorization in Eq. 6, we can re-formulate the multi-task objective (Eq. 1) for the coupled dictionaries as


where balances the model’s or policy’s fit to the task descriptor’s fit.

To solve Eq. 7 online, we approximate by a second-order Taylor expansion around , the minimizer for the single-task learner. In reinforcement learning, is the single-task policy for based on the observed trajectories [Bou Ammar, Eaton,  RuvoloBou Ammar et al.2014]. In supervised learning, is the single-task model parameters for [Ruvolo  EatonRuvolo  Eaton2013]. This step leads to a unified simplified formalism that is independent of the learning paradigm (i.e., classification, regression, or RL). Approximating Eq. 7 leads to


We can merge pairs of terms in Eq. 8 by choosing:

where is the zero matrix, letting us rewrite (8) concisely as


This objective can now be solved efficiently online, as a series of per-task update rules given in Algorithm 1, which we call TaDeLL (Task Descriptors for Lifelong Learning). and are updated independently using Equations 35

, following a recursive construction based on an eigenvalue decomposition.

2:while some task is available do
3:      collectData()
4:     Compute and from
6:      update               Eq. 35
7:      update      Eq. 35
8:     for do:  
9:end while
Algorithm 1  TaDeLL (k, , )

For the sake of clarity, we now explicitly state the differences between using TaDeLL for RL problems and for classification and regression problems. In an RL setting, at each timestep TaDeLL receives a new RL task and samples trajectories for the new task. We use the single-task policy as computed using a twice-differentiable policy gradient method as . The Hessian , calculated around the point , is derived according to the particular policy gradient method being used. Bou Ammar et al. Ammar2014a derive it for the cases of Episodic REINFORCE and Natural Actor Critic. The reconstructed is then used as the policy for the task .

In the case of classification and regression, at each time step TaDeLL observes a labeled training set for task , where . For classification tasks, , and for regression tasks, . We then set to be the parameters of a single-task model trained via classification or regression (e.g., logistic or linear regression) on that data set. is set to be the Hessian of the corresponding loss function around the single-task solution , and the reconstructed is used as the model parameters for the corresponding classification or regression problem.

4.3 Zero-Shot Transfer Learning

In a lifelong setting, when faced with a new task, the agent’s goal is to learn an effective policy for that task as quickly as possible. At this stage, previous multi-task and lifelong learners incurred a delay before they could produce a decent policy, since they needed to acquire data from the new task in order to identify related knowledge and train the new policy via transfer.

Incorporating task descriptors enables our approach to predict a policy for the new task immediately, given only the descriptor. This ability to perform zero-shot transfer is enabled by the use of coupled dictionary learning, which allows us to observe a data instance in one feature space (i.e., the task descriptor), and then recover its underlying latent signal in the other feature space (i.e., the policy parameters) using the dictionaries and sparse coding.

Given only the descriptor for a new task , we can estimate the embedding of the task in the latent descriptor space via LASSO on the learned dictionary :


Since the estimate given by also serves as the coefficients over the latent policy space , we can immediately predict a policy for the new task as: . This zero-shot transfer learning procedure is given as Algorithm 2.

1:Inputs: task descriptor , learned bases and
Algorithm 2  Zero-Shot Transfer to a New Task

5 Theoretical Analysis

This section examines theoretical issues related to incorporating task descriptors into lifelong learning via coupled dictionaries. We start by outlining theory to support why the inclusion of task features can improve performance of the learned policies and enable zero-shot transfer to new tasks safely. We then prove the convergence of TaDeLL. A full sample complexity analysis is beyond the scope of this paper, and, indeed, remains an open problem for zero-shot learning [Romera-Paredes  TorrRomera-Paredes  Torr2015].

5.1 Connections to Mutual Coherence in Sparse Coding

To analyze the policy improvement, since the policy parameters are factored as , we proceed by showing that incorporating the descriptors through coupled dictionaries can improve both and . Note that learning these dictionaries faster means faster knowledge transfer and more accurate task prediction. In this analysis, we employ the concept of mutual coherence, which has been studied extensively in the sparse recovery literature. [Donoho, Elad,  TemlyakovDonoho et al.2006] Mutual coherence measures the similarity of a dictionary’s elements as , where is the column of a dictionary . If , then

is an invertible orthogonal matrix and so sparse recovery can be solved directly by inversion;

implies that is not full rank and a poor dictionary. Intuitively, low mutual coherence indicates that the dictionary’s columns are considerably different, and thus such a “good” dictionary can represent wider range of tasks, potentially yielding more knowledge transfer. This intuition is shown in the following:

Theorem 5.1.

[Donoho, Elad,  TemlyakovDonoho et al.2006] Suppose we have noisy observations of the linear system , such that . Let be a solution to the system, and let . If , then is the unique sparsest solution of the system. Moreover, if is the LASSO solution for the system from the noisy observations, then: .

Therefore, an with low mutual coherence would lead to more stable solutions of the ’s against inaccurate single-task estimates of the policies (the ’s). We next show that our approach likely lowers the mutual coherence of .

TaDeLL alters the problem from training to training the coupled dictionaries and (contained in ). Let be the solution to Eq. 1 for task , which is unique under sparse recovery theory, so remains unchanged for all tasks. Theorem 5.1 implies that, if , coupled dictionary learning can help with a more accurate recovery of the ’s. To show this, we note that Eq. 7 can also be derived as a result of an MAP estimate from a Bayesian perspective, enforcing a Laplacian distribution on the ’s and assuming to be a Gaussian matrix with elements drawn i.i.d:

. When considering such a random matrix

, Donoho & Huo donoho2001uncertainty proved that asymptotically as . Using this as an estimate for and , since incorporating task descriptors increases , asymptotically , implying that TaDeLL learns a superior dictionary. Moreover, if , the theorem implies we can use alone to recover the task policies through zero-shot transfer.

To show that task features can also improve the sparse recovery, we rely on the following theorem about LASSO:

Theorem 5.2.

[Negahban, Yu, Wainwright,  RavikumarNegahban et al.2009] Let be a unique solution to the system with and . If is the LASSO solution for the system from noisy observations, then with high probability:  , where the constant depends on properties of the linear system and observations.

This theorem shows that the error reconstruction for LASSO is proportional to . When we incorporate the descriptor through , the RHS denominator increases from to while and remain constant, yielding a tighter fit. Therefore, task descriptors can improve learned dictionary quality and sparse recovery accuracy. To ensure an equivalently tight fit for using either policies or descriptors, Theorem 5.2 suggests it should be that to ensure that zero-shot learning yields similarly tight estimates of .

5.2 Theoretical Convergence of TaDeLL

In this section, we prove the convergence of TaDeLL, showing that the learned dictionaries become increasingly stable as it learns more tasks. We build upon the theoretical results from Bou Ammar et al. Ammar2014a and Ruvolo & Eaton Ruvolo2013, demonstrating that these results apply to coupled dictionary learning with task descriptors, and use them to prove convergence.

Let represent the sparse coded approximation to the MTL objective, which can be defined as:

This equation can be viewed as the cost for when the sparse coefficients are kept constant. Let be the version of the dictionary obtained after observing tasks. Given these definitions, we consider the following theorem:

Theorem 5.3.

[Ruvolo  EatonRuvolo  Eaton2013]

  1. The trained dictionary is stabilized over learning with rate:

  2. converges almost surely.

  3. converges almost surely to zero.

This theorem requires two conditions:

  1. The tuples , are drawn i.i.d from a distribution with compact support to bound the norms of and .

  2. For all , let be the subset of the dictionary , where only columns corresponding to non-zero element of are included. Then, all eigenvalues of the matrix need to be strictly positive.

Bou Ammar et al. Ammar2014a show that both of these conditions are met for the lifelong learning framework given in Eqs. 25. When we incorporate the task descriptors into this framework, we alter , , and . Note both and are formed by adding deterministic entries and thus can be considered to be drawn i.i.d (because and are assumed to be drawn i.i.d). Therefore, incorporating task descriptors does not violate Condition 1.

To show that Condition 2 holds, if we analogously form , then the eigenvalues of are strictly positive because they are either eigenvalues of (which are strictly positive according to [Bou Ammar, Eaton,  RuvoloBou Ammar et al.2014]) or the regularizing parameter by definition. Thus, both conditions are met and convergence follows directly from Theorem 5.3.

5.3 Computational Complexity

In this section, we analyze the computational complexity of TaDeLL. Each update begins with one PG step to update and at a cost of , where depends on the base PG learner and is the number of trajectories obtained for task . The cost of updating and alone is [Ruvolo  EatonRuvolo  Eaton2013], and so the cost of updating through coupled dictionary learning is . This yields an overall per-update cost of , which is independent of .

Next, we empirically demonstrate the benefits of TaDeLL on a variety of different learning problems.

6 Evaluation on Reinforcement Learning Domains

We apply TaDeLL to a series of RL problems. We consider the problem of learning a collection of different, related systems. For these systems, we use three benchmark control problems and an application to quadrotor stabilization.

6.1 Benchmark Dynamical Systems

Spring Mass Damper (SM)   The SM system is described by three parameters: the spring constant, mass, and damping constant. The system’s state is given by the position and velocity of the mass. The controller applies a force to the mass, attempting to stabilize it to a given position.

Cart Pole (CP)   The CP system involves balancing an inverted pendulum by applying a force to the cart. The system is characterized by the cart and pole masses, pole length, and a damping parameter. The states are the position and velocity of the cart and the angle and rotational velocity of the pole.

Bicycle (BK)   This system focuses on keeping a bicycle balanced upright as it rolls along a horizontal plane at constant velocity. The system is characterized by the bicycle mass, - and -coordinates of the center of mass, and parameters relating to the shape of the bike (the wheelbase, trail, and head angle). The state is the bike’s tilt and its derivative; the actions are the torque applied to the handlebar and its derivative.

(a) Simple Mass
(b) Cart Pole
(c) Bicycle
Figure 5: Performance of multi-task (solid lines), lifelong (dashed), and single-task learning (dotted) on benchmark dynamical systems. (Best viewed in color.)

6.2 Methodology

In each domain we generated 40 tasks, each with different dynamics, by varying the system parameters. The reward for each task was taken to be the distance between the current state and the goal. For lifelong learning, tasks were encountered consecutively with repetition, and learning proceeded until each task had been seen at least once. We used the same random task order between methods to ensure fair comparison. The learners sampled trajectories of 100 steps, and the learning session during each task presentation was limited to 30 iterations. For MTL, all tasks weres presented simultaneously. We used Natural Actor Critic  [Peters  SchaalPeters  Schaal2008] as the base learner for the benchmark systems and episodic REINFORCE [WilliamsWilliams1992] for quadrotor control. We chose and the regularization parameters independently for each domain to optimize the combined performance of all methods on 20 held-out tasks, and set to balance the fit to the descriptors and the policies. We measured learning curves based on the final policies for each of the 40 tasks. The system parameters for each task were used as the task descriptor features

; we also tried several non-linear transformations as

, but found the linear features worked well.

6.3 Results on Benchmark Systems

(a) Simple Mass
(b) Cart Pole
(c) Bicycle
Figure 6: Zero-shot transfer to new tasks. The figure shows the initial “jumpstart” improvement on each task domain. (Best viewed in color.)
(a) Simple Mass
(b) Cart Pole
(c) Bicycle
Figure 7: Learning performance of using the zero-shot policies as warm start initializations for PG. The performance of the single-task PG learner is included for comparison. (Best viewed in color.)

Figure 5 compares our TaDeLL approach for lifelong learning with task descriptors to 1.) PG-ELLA [Bou Ammar, Eaton,  RuvoloBou Ammar et al.2014], which does not use task features, 2.) GO-MTL [Kumar  DauméKumar  Daumé2012], the MTL optimization of Eq. 1, and 3.) single-task learning using PG. For comparison, we also performed an offline MTL optimization of Eq. 7

via alternating optimization, and plot the results as TaDeMTL. The shaded regions on the plots denote standard error bars.

We see that task descriptors improve lifelong learning on every system, even driving performance to a level that is unachievable from training the policies from experience alone via GO-MTL in the SM and BK domains. The difference between TaDeLL and TaDeMTL is also negligible for all domains except CP, demonstrating the effectiveness of our online optimization.

To measure zero-shot performance, we generated an additional 40 tasks for each domain, averaging results over these new tasks. Figure 6 shows that task descriptors are effective for zero-shot transfer to new tasks. We see that our approach improves the initial performance (i.e., the “jumpstart” [Taylor  StoneTaylor  Stone2009]) on new tasks, outperforming Sinapov et al. Sinapov2015’s method and single-task PG, which was allowed to train on the task. We attribute the especially poor performance of Sinapov et al. on CP to the fact that the CP policies differ substantially; in domains where the source policies are vastly different from the target policies, Sinapov et al.’s algorithm does not have an appropriate source to transfer. Their approach is also much more computationally expensive (quadratic in the number of tasks) than our approach (linear in the number of tasks), as shown in Figure 14; details of the runtime experiments are included in Section 8.2. Figure 7 shows that the zero-shot policies can be used effectively as a warm start initialization for a PG learner, which is then allowed to improve the policy.

6.4 Application to Quadrotor Control

We also applied our approach to the more challenging domain of quadrotor control, focusing on zero-shot transfer to new stability tasks. To ensure realistic dynamics, we use the model of Bouabdallah and Siegwart bouabdallah2005backstepping, which has been verified on physical systems. The quadrotors are characterized by three inertial constants and the arm length, with their state consisting of roll/pitch/yaw and their derivatives.

Figure 8: Warm start learning on quadrotor control. (Best viewed in color.)

Figure 8 shows the results of our application, demonstrating that TaDeLL can predict a controller for new quadrotors through zero-shot learning that has equivalent accuracy to PG, which had to train on the system. As with the benchmarks, TaDeLL is effective for warm start learning with PG.

7 Evaluation on Supervised Learning Domains

In this section, we evaluate TaDeLL on regression and classification domains, considering the problem of predicting the real-valued location of a robot’s end effector and two synthetic classification tasks.

7.1 Predicting the Location of a Robot End Effector

In this section, we evaluate TaDeLL on a regression domain. We look at the problem of predicting the real-valued position of the end effector of an 8-DOF robotic arm in 3D space, given the angles of the robot joints. Different robots have different link lengths, offsets, and twists, and we use these parameters as the description of the task.

We consider 200 different robot arms and use 10 points as training data per robot. The robot arms are simulated using the Robot Toolbox [CorkeCorke2011]. The learned dictionaries are then used to predict models for 200 different unseen robots. We measure performance as the mean square error of the prediction against the true location of the end effector.

Table 1 shows that both TaDeLL and ELLA outperform the single-task learner, with TaDeLL slightly outperforming ELLA. This same improvement holds for zero-shot prediction on new robot arms, with TaDeLL outperforming the single-task learner, which was trained on the new robot.

To better understand the relationship of dictionary size to performance, we investigated how learning performance varies with the number of bases in the dictionary. Figure 10 shows this relationship for the lifelong learning and zero-shot prediction settings. We observe that TaDeLL performs better with a larger dictionary than ELLA, we hypothesize that difference results from the added difficulty of encoding the representations with the task descriptions. To test this hypothesis, we reduced the number of descriptors in an ablative experiment. Recall that the task has 24 descriptors consisting of a twist, link offset, and link length for each joint. We reduced the number of descriptors by alternatingly removing the subsets of features corresponding to the twist, offset, and length. Figure 11 shows the performance of this ablative experiment, revealing that the need for the increased number of bases is particularly related to learning twist.

Algorithm Lifelong Learning Zero-Shot Prediction
TaDeLL 0.131 0.004 0.159 0.005
ELLA 0.152 0.005 N/A
STL 0.73 0.07 0.70 0.05
Table 1: Regression performance on robot end effector prediction in both lifelong learning and zero-shot settings. Performance is measured in mean squared error.
Figure 9: Example model of an 8-DOF robot. (Photo of the Sawyer arm by Rethink Robotics.)
(a) Lifelong Learning.
(b) Zero-shot Prediction.
Figure 10: Performance of TaDeLL and ELLA as the dictionary size is varied for lifelong learning and zero-shot learning. Performance of the single task learner is provided for comparison. In the lifelong learning setting, both TaDeLL and ELLA demonstrate positive transfer that converges to the performance of the single task learner as is increased. We see that, for this problem, TaDeLL prefers a slightly larger value of .
Figure 11: An ablative experiment studying the performance of TaDeLL as a function of the dictionary size , as we vary the subset of descriptors used. The feature consist of twist(t), length(l), and offset(o) variables for each joint. We train TaDeLL using only subsets of the features and we see that the need for a larger is directly related to learning the twist. Subsets that contain twist descriptors are shown in magenta. Trials that do not include twist descriptors are shown in gray. Performance of ELLA and the single-task learner (STL) are provided for comparison. (Best viewed in color.)

7.2 Experiments on Synthetic Classification Domains

To better understand the connections between TaDeLL’s performance and the structure of the tasks, we evaluated TaDeLL on two synthetic classification domains. The use of synthetic domains allows us to tightly control the task generation process and the relationship between the target model and the descriptor.

The first synthetic domain consists of binary-labeled instances drawn from , and each sample belongs to the positive class iff . Each task has a different parameter vector

drawn from the uniform distribution

; these vectors are also used as the task descriptors. Note that by sampling from the uniform distribution, this domain violates the assumptions of ELLA that the samples are drawn from a common set of latent features. Each task’s data consists of 10 training samples, and we generated 100 tasks to evaluate lifelong learning.

Table 2 shows the performance on this Synthetic Domain 1. We see that the inclusion of meaningful task descriptors enables TaDeLL to learn a better dictionary than ELLA in a lifelong learning setting. We also generated an additional 100 unseen tasks to evaluate zero-shot prediction, which is similarly successful.

Algorithm Lifelong Learning Zero-Shot Prediction
TaDeLL 0.926 0.004 0.930 0.002
ELLA 0.814 0.008 N/A
STL 0.755 0.009 0.762 0.008
Table 2: Classification accuracy on Synthetic Domain 1.

For the second synthetic domain, we generated and matrices, and then generated a random sparse vector

for each task. The true task model is then given by a logistic regression classifier with

. This generation process directly follows the assumptions of ELLA and TaDeLL, where is generated independently. We similarly generate 100 tasks for lifelong learning and another 100 unseen tasks for zero-shot prediction, and use the true task models to label 10 training points per task. In this experiment, we empirically demonstrate that TaDeLL works in the case of this assumption (Table 3) in both lifelong learning and zero-shot prediction settings.

We also use this domain to investigate performance versus sample complexity, as we generated varying amounts of training data per task. In Figure 11(a), we see that TaDeLL is able to greatly improve performance given on a small number of samples, and as expected, its benefit becomes less dramatic as the single-task learner receives sufficient samples. Figure 11(b) shows similar behavior in the zero-shot case.

Algorithm Lifelong Learning Zero-Shot Prediction
TaDeLL 0.889 0.006 0.87 0.01
ELLA 0.821 0.007 N/A
STL 0.752 0.009 0.751 0.009
Table 3: Classification accuracy on Synthetic Domain 2.
(a) Lifelong Learning
(b) Zero-Shot Prediction
Figure 12: Performance versus sample complexity on Synthetic Domain 2.

8 Additional Experiments

Having shown how TaDeLL can improve learning in a variety of settings, we now turn our attention to understanding other aspects of the algorithm. Specifically, we look at the issue of task descriptor selection and partial information, runtime comparisons, and the effect of varying the number of tasks used to train the dictionaries.

8.1 Choice of Task Descriptor Features

For RL, we used the system parameters as the task description, and for the robot end effector prediction, we used the dimensions of the robot. While in these cases the choice of task descriptor was straightforward, this might not always be the case. It is unclear exactly how the choice of task descriptor features might affect the resulting performance. In other scenarios, we may have only partial knowledge of the system parameters.

To address these questions, we conducted additional experiments on the Spring-Mass (SM) system and robot end effector problem, using various subsets of the task descriptor features when learning the coupled dictionaries. Figure 12(a) shows how the number and selection of parameters affects performance on the SM domain. We evaluated jumpstart performance when using all possible subsets of the system parameters as the task descriptor features. These subsets of the SM system parameters (mass , damping constant , and spring constant ) are shown along the horizontal axis for the task descriptors. Overall, the results show that the learner performs better when using larger subsets of the system parameters as the task descriptors.

The robot task has 24 descriptors consisting of a twist, link offset, and link length for each joint. We group the subset of features describing twist, offset, and length together and examine removing different subsets. Figure 12(b) show that twist is more important than the other features and again the inclusion of more features improves performance.

(a) Spring-Mass RL
(b) Robot End Effector Prediction
Figure 13: Performance using various subsets of the SM system parameters (mass , damping constant , and spring constant ) and Robot system parameters (twist , link length , and offset ) as the task descriptors.

8.2 Computational Efficiency

Figure 14: Runtime comparison.

We compared the average per-task runtime of our approach to that of Sinapov et al. Sinapov2015, the most closely related method to our approach. Since Sinapov et al.’s method requires training transferability predictors between all pairs of tasks, its total runtime grows quadratically with the number of tasks. In comparison, our online algorithm is highly efficient. As shown in Section 5.3, the per-update cost of TaDeLL is . Note that this per-update cost is independent of the number of tasks , giving TaDeLL a total runtime that scales linearly in the number of tasks.

Figure 14 shows the per-task runtime for each algorithm based on a set of 40 tasks, as evaluated on an Intel Core I7-4700HQ CPU. TaDeLL samples tasks randomly with replacement and terminates once every task has been seen. For Sinapov et al., we used 10 PG iterations for calculating the warm start, ensuring fair comparison between the methods. These results show a substantial reduction in computational time for TaDeLL: two orders of magnitude over the 40 tasks.

8.3 Performance for Various Numbers of Tasks

Although we have shown in Section 5.2 that the learned dictionaries become more stable as the system learns more tasks, we cannot currently guarantee that this will improve the performance of zero-shot transfer. To evaluate the effect of the number of tasks on zero-shot performance, we conducted an additional set of experiments on both the Simple-Mass domain and the robot end effector prediction domain. Our results, shown in Figure 15, reveal that zero-shot performance does indeed improve as the dictionaries are trained over more tasks. This improvement is most stable and rapid in an MTL setting, since the optimization over all dictionaries and task policies is run to convergence, but TaDeLL also shows clear improvement in zero-shot performance as increases. Since zero-shot transfer involves only the learned coupled dictionaries, we can conclude that the quality of these dictionaries for zero-shot transfer improves as the system learns more tasks.

(a) Spring-Mass RL
(b) Robot End Effector Prediction
Figure 15: Zero-shot performance as a function of the number of tasks used to train the dictionary. As more tasks are used, the performance of zero-shot transfer improves.

9 Conclusion

This article demonstrated that incorporating high-level task descriptors into lifelong learning both improves learning performance and also enables zero-shot transfer to new tasks. The mechanism of using a coupled dictionary to connect the task descriptors with the learned models is relatively straightforward, yet highly effective in practice and has connections to mutual coherence in sparse coding. Most critically, it provides a fast and simple mechanism to predict the model or policy for a new task via zero-shot learning, given only its high level task descriptor. This approach is general and can handle multiple learning paradigms, including classification, regression, and RL tasks. Experiments demonstrate that our approach outperforms the state of the art and requires substantially less computational time than competing methods.

This ability to rapidly bootstrap models (or policies) for new tasks is critical to the development of lifelong learning systems that will be deployed for extended periods in real environments and tasked with handling a variety of tasks. High-level descriptions provide an effective way for humans to communicate and to instruct each other. The description need not come from another agent; humans often read instructions and then complete a novel task quite effectively. Enabling lifelong learning systems to similarly take advantage of these high-level descriptions provides an effective step toward their practical effectiveness. As shown in our experiments with warm-start learning from the zero-shot predicted policy, these task descriptors can also be combined with training data on the new task in a hybrid approach.

Despite TaDeLL’s strong performance, defining what constitutes an effective task descriptor for a group of related tasks remains an open question. In our framework, task descriptors are given, typically as fundamental descriptions of the system. The representation we use for the task descriptors, a feature vector, is also relatively simple. One interesting direction for future work is to develop methods for integrating more complex task descriptors into MTL or lifelong learning. These more sophisticated mechanism could include natural language descriptions, step-by-step instructions, or logical relationships. Such an advance would likely involve moving beyond the linear framework used in TaDeLL, but would constitute an important step toward enabling more practical use of high-level task descriptors in lifelong learning.


This research was supported by ONR grant #N00014-11-1-0139, AFRL grant #FA8750-14-1-0069, and AFRL grant #FA8750-16-1-0109. We would like to thank the anonymous reviewers of the conference version of this paper for their helpful feedback.


  • [Ando  ZhangAndo  Zhang2005] Ando, R. K.  Zhang, T. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data  The Journal of Machine Learning Research, 6, 1817–1853.
  • [Bakker  HeskesBakker  Heskes2003] Bakker, B.  Heskes, T. 2003. Task clustering and gating for Bayesian multitask learning  The Journal of Machine Learning Research, 4, 83–99.
  • [BaxterBaxter2000] Baxter, J. 2000. A model of inductive bias learning 

    The Journal of Artificial Intelligence Research, 12, 149–198.

  • [Bickel, Sawade,  SchefferBickel et al.2009] Bickel, S., Sawade, C.,  Scheffer, T. 2009. Transfer learning by distribution matching for targeted advertising  Advances in Neural Information Processing Systems, 145–152.
  • [Bonilla, Agakov,  WilliamsBonilla et al.2007] Bonilla, E. V., Agakov, F. V.,  Williams, C. 2007. Kernel multi-task learning using task-specific features  In Proceedings of the International Conference on Artificial Intelligence and Statistics, 43–50.
  • [Bou Ammar, Eaton, Luna,  RuvoloBou Ammar et al.2015] Bou Ammar, H., Eaton, E., Luna, J. M.,  Ruvolo, P. 2015. Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning  In Proceedings of the International Joint Conference on Artificial Intelligence.
  • [Bou Ammar, Eaton,  RuvoloBou Ammar et al.2014] Bou Ammar, H., Eaton, E.,  Ruvolo, P. 2014. Online multi-task learning for policy gradient methods  In Proceedings of the International Conference on Machine Learning.
  • [Bou Ammar, Eaton, Ruvolo,  TaylorBou Ammar et al.2015] Bou Ammar, H., Eaton, E., Ruvolo, P.,  Taylor, M. E. 2015. Unsupervised cross-domain transfer in policy gradient reinforcement learning via manifold alignment  In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-15).
  • [Bouabdallah  SiegwartBouabdallah  Siegwart2005] Bouabdallah, S.  Siegwart, R. 2005. Backstepping and sliding-mode techniques applied to an indoor micro quadrotor  In Proceedings of the 2005 IEEE International Conference on Robotics and Automation., 2247–2252.
  • [CaruanaCaruana1997] Caruana, R. 1997. Multitask Learning  Machine Learning, 28, 41–75.
  • [Cavallanti, Cesa-Bianchi,  GentileCavallanti et al.2010] Cavallanti, G., Cesa-Bianchi, N.,  Gentile, C. 2010. Linear algorithms for online multitask classification  The Journal of Machine Learning Research, 11, 2901–2934.
  • [CorkeCorke2011] Corke, P. I. 2011. Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning  In Robotics, Vision & Control: Fundamental Algorithms in Matlab. Springer.
  • [Dekel, Long,  SingerDekel et al.2006] Dekel, O., Long, P. M.,  Singer, Y. 2006. Online multitask learning 

    In Proceedings of the International Conference on Computational Learning Theory,  453–467. Springer.

  • [Donoho, Elad,  TemlyakovDonoho et al.2006] Donoho, D. L., Elad, M.,  Temlyakov, V. N. 2006. Stable recovery of sparse overcomplete representations in the presence of noise  IEEE Transactions on Information Theory, 52(1), 6–18.
  • [Donoho  HuoDonoho  Huo2001] Donoho, D. L.  Huo, X. 2001. Uncertainty principles and ideal atomic decomposition  IEEE Transactions on Information Theory, 47(7), 2845–2862.
  • [Evgeniou  PontilEvgeniou  Pontil2004] Evgeniou, T.  Pontil, M. 2004. Regularized multi–task learning  In Proceedings of the International Conference on Knowledge Discovery and Data Mining,  109–117. ACM.
  • [Ham, Lee,  SaulHam et al.2005] Ham, J., Lee, D. D.,  Saul, L. K. 2005. Semisupervised alignment of manifolds  In Proceedings of International Conference on Artificial Intelligence and Statistics,  120–127.
  • [Huang, Socher, Manning,  NgHuang et al.2012] Huang, E. H., Socher, R., Manning, C. D.,  Ng, A. 2012. Improving word representations via global context and multiple word prototypes  Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, 873–882.
  • [Isele, Rostami,  EatonIsele et al.2016] Isele, D., Rostami, M.,  Eaton, E. 2016. Using task features for zero-shot knowledge transfer in lifelong learning  In Proceedings of the International Joint Conference on Artificial Intelligence.
  • [Kober  PetersKober  Peters2009] Kober, J.  Peters, J. 2009. Policy search for motor primitives in robotics  Advances in Neural Information Processing Systems, 849–856.
  • [Kumar  DauméKumar  Daumé2012] Kumar, A.  Daumé, H. 2012. Learning task grouping and overlap in multi-task learning  In Proceedings of the International Conference on Machine Learning, 1383–1390.
  • [Lazaric  GhavamzadehLazaric  Ghavamzadeh2010] Lazaric, A.  Ghavamzadeh, M. 2010. Bayesian multi-task reinforcement learning  In Proceedings of International Conference on Machine Learning,  599–606. Omnipress.
  • [Maurer, Pontil,  Romera-ParedesMaurer et al.2013] Maurer, A., Pontil, M.,  Romera-Paredes, B. 2013. Sparse coding for multitask and transfer learning  In Proceedings of the International Conference on Machine Learning, 28, 343–351.
  • [Negahban, Yu, Wainwright,  RavikumarNegahban et al.2009] Negahban, S., Yu, B., Wainwright, M.,  Ravikumar, P. 2009. A unified framework for high-dimensional analysis of -estimators with decomposable regularizers  In Advances in Neural Information Processing Systems,  1348–1356.
  • [Oyen  LaneOyen  Lane2012] Oyen, D.  Lane, T. 2012.

    Leveraging domain knowledge in multitask Bayesian network structure learning 

    In Proceedings of the AAAI Conference on Artificial Intelligence.
  • [Palatucci, Hinton, Pomerleau,  MitchellPalatucci et al.2009] Palatucci, M., Hinton, G., Pomerleau, D.,  Mitchell, T. M. 2009. Zero-shot learning with semantic output codes  Advances in Neural Information Processing Systems.
  • [Pan  YangPan  Yang2010] Pan, S. J.  Yang, Q. 2010. A survey on transfer learning  IEEE Transactions on Knowledge and Data Engineering, 22(10).
  • [Pennington, Socher,  ManningPennington et al.2014] Pennington, J., Socher, R.,  Manning, C. D. 2014. Glove: Global vectors for word representation 

    Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12, 1532–1543.

  • [Peters  SchaalPeters  Schaal2008] Peters, J.  Schaal, S. 2008. Natural actor-critic  Neurocomputing, 71(7), 1180–1190.
  • [Romera-Paredes  TorrRomera-Paredes  Torr2015] Romera-Paredes, B.  Torr, P. H. S. 2015. An embarrassingly simple approach to zero-shot learning  Proceedings of International Conference on Machine Learning, 2152–2161.
  • [Ruvolo  EatonRuvolo  Eaton2013] Ruvolo, P.  Eaton, E. 2013. ELLA: An efficient lifelong learning algorithm  Proceedings of the International Conference on Machine Learning, 28, 507–515.
  • [Saha, Rai, Venkatasubramanian,  DaumeSaha et al.2011] Saha, A., Rai, P., Venkatasubramanian, S.,  Daume, H. 2011. Online learning of multiple tasks and their relationships  Proceedings of International Conference on Artificial Intelligence and Statistics, 643–651.
  • [Sinapov, Narvekar, Leonetti,  StoneSinapov et al.2015] Sinapov, J., Narvekar, S., Leonetti, M.,  Stone, P. 2015. Learning inter-task transferability in the absence of target task samples  Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems.
  • [Socher, Ganjoo, Manning,  NgSocher et al.2013] Socher, R., Ganjoo, M., Manning, C. D.,  Ng, A. Y. 2013. Zero-shot learning through cross-modal transfer  Advances in Neural Information Processing Systems, 935–943.
  • [Sutton, McAllester, Singh,  MansourSutton et al.1999] Sutton, R. S., McAllester, D. A., Singh, S. P.,  Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation  Advances in Neural Information Processing Systems, 99, 1057–1063.
  • [Taylor  StoneTaylor  Stone2009] Taylor, M. E.  Stone, P. 2009. Transfer learning for reinforcement learning domains: A survey  The Journal of Machine Learning Research, 10, 1633–1685.
  • [Taylor, Stone,  LiuTaylor et al.2007] Taylor, M. E., Stone, P.,  Liu, Y. 2007. Transfer learning via inter-task mappings for temporal difference learning  The Journal of Machine Learning Research, 8(Sep), 2125–2167.
  • [ThrunThrun1996] Thrun, S. 1996. Is learning the n-th thing any easier than learning the first?  Advances in Neural Information Processing Systems, 640–646.
  • [Wang  MahadevanWang  Mahadevan2009] Wang, C.  Mahadevan, S. 2009. A general framework for manifold alignment  In Proceedings of the AAAI Conference on Artificial Intelligence.
  • [WilliamsWilliams1992] Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning  Machine Learning, 8(3-4), 229–256.
  • [Wilson, Fern, Ray,  TadepalliWilson et al.2007] Wilson, A., Fern, A., Ray, S.,  Tadepalli, P. 2007. Multi-task reinforcement learning: a hierarchical bayesian approach  In Proceedings of the International Conference on Machine Learning,  1015–1022. ACM.
  • [Xu, Hospedales,  GongXu et al.2016] Xu, X., Hospedales, T. M.,  Gong, S. 2016. Multi-task zero-shot action recognition with prioritised data augmentation 

    In Proceedings of the European Conference on Computer Vision,  343–359. Springer.

  • [Yang, Wright, Huang,  MaYang et al.2010] Yang, J., Wright, J., Huang, T. S.,  Ma, Y. 2010. Image super-resolution via sparse representation  IEEE Transactions on Image Processing, 19(11), 2861–2873.
  • [Yu, Wu, Yang, Tian, Luo,  ZhuangYu et al.2014] Yu, Z., Wu, F., Yang, Y., Tian, Q., Luo, J.,  Zhuang, Y. 2014. Discriminative coupled dictionary hashing for fast cross-media retrieval  Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, 395–404.
  • [Zhong  KwokZhong  Kwok2012] Zhong, L. W.  Kwok, J. T. 2012. Convex multitask learning with flexible task clusters  Proceedings of the International Conference on Machine Learning, 1, 49–56.
  • [Zhuang, Wang, Wu, Zhang,  LuZhuang et al.2013] Zhuang, Y. T., Wang, Y. F., Wu, F., Zhang, Y.,  Lu, W. M. 2013. Supervised coupled dictionary learning with group structures for multi-modal retrieval  Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.