In many real world problems we are interested in learning multiple tasks while the training set for each task is quite small. When the different tasks are related one can learn all tasks simultaneously and aim to get improved predictive performance by taking advantage of the common aspects of all tasks. This general idea is known as multi-task learning and it has been successfully investigated in several technical settings, with applications in many areas including medical diagnosis (Bi et al., 2008), recommendation systems (Dinuzzo et al., 2008) and HIV Therapy Screening (Bickel et al., 2008).
In this paper we explore Bayesian models especially using Gaussian Processes (GP) where sharing the prior and its parameters among the tasks can be seen to implement multi-task learning (Álvarez et al., 2011; Bonilla et al., 2008; Xue et al., 2007; Gelman, 2004; Yu et al., 2005; Schwaighofer et al., 2005; Pillonetto et al., 2010). Our focus is on the functional mixed-effect model (Lu et al., 2008; Pillonetto et al., 2010) where each task is modeled as a sum of a fixed effect shared by all the tasks and a random effect that can be interpreted as representing task specific deviations. In particular, both effects are realizations of zero-mean Gaussian processes. Thus, in this model, tasks share structure through hyper-parameters of the prior and through the fixed effect portion. This model has shown success in several applications, including geophysics (Lu et al., 2008), medicine (Pillonetto et al., 2010) and astrophysics (Wang et al., 2010). One of the main difficulties with this model, however, is computational cost, because while the number of samples per task is small, the total sample size can be large, and the typical cubic complexity of GP inference can be prohibitively large (Yu et al., 2005). Some improvement can be obtained when all the input tasks share the same sampling points, or when different tasks share many of the input points (Pillonetto et al., 2009, 2010). However, if the number of distinct sampling points is large the complexity remains high. For example, this is the case in (Wang et al., 2010) where sample points are clipped to a fine grid to avoid the high cardinality of the example set.
The same problem, handling large samples, has been addressed in single task formalizations of GP, where several approaches for so-called sparse solutions have been developed (Rasmussen and Williams, 2005; Seeger and Lawrence, 2003; Snelson, 2006; Titsias, 2009). These methods approximate the GP with support variables (or inducing variables, pseudo inputs) and their corresponding function values and perform inference using this set.
In this paper, we develop a sparse solution for multi-task learning with GP in the context of the functional mixed effect model. Specifically, we extend the approach of Titsias (2009) and develop a variational approximation that allows us to efficiently learn the shared hyper-parameters and choose the sparse pseudo samples. In addition, we show how the variational approximation can be used to perform prediction efficiently once learning has been performed. Our approach is particularly useful when individual tasks have a small number of samples, different tasks do not share sampling points, and there is a large number of tasks. Our experiments, using artificial and real data, validate the approach showing that it can recover the performance of inference with the full sample, that it performs better than simple sparse approaches for multi-task GP, and that for some applications it significantly outperforms alternative sparse multi-task GP formulation (Álvarez and Lawrence, 2011).
To summarize, our contribution is threefold. First we introduce the first sparse solution for the multi-task GP in mixed-effect model. Second, we develop a variational model-selection approach for the proposed sparse model. Finally we evaluate the algorithm and several baseline approaches for multi-task GP, showing that the proposed method performs well.
This paper is organized as follows. Section 2 reviews the mixed-effect GP model and its direct inference. Section 3 develops the variational inference and model selection for the sparse mixed-effect GP model. Section 4 shows how to extend the sparse solution to the grouped mixed-effect GP model. We discuss related work in Section 5 and demonstrate the performance of the proposed approach using three datasets in Section 6. Section 7 concludes with a summary and directions for future work.
2 Mixed-effect GP for Multi-task Learning
In this section and the next one, we develop the mixed-effect model and its sparse solution without considering grouping. The model and results are extended to include grouping in Section 4. Consider a set of tasks where the data for the th task is given by . Multi-task learning aims to learn all tasks simultaneously, taking the advantage of common aspects of different tasks. In this paper, given data , we are interested in learning the nonparametric Bayesian mixed-effect model and using the model to perform inference. The model captures each task as a sum of an average effect function and an individual variation specific to the th task. More precisely (Pillonetto et al., 2010):
For each and ,
where and are zero-mean Gaussian processes. In addition, and the set of are assumed to be mutually independent with covariance functions and respectively.
Assumption 1 implies that for , the following holds:
where is the Kronecker delta function. Let be the concatenation of the examples from all tasks , and similarly let , where and . It can easily been seen that, for any and new input for task , we have
where the covariance matrix is given by
From (3) we can extract the marginal distribution where
which can be used for model selection, that is, learning the hyper-parameters of the GP. (3) also provides the predictive distribution where
This works well in that sharing the information improves predictive performance but, as the number of tasks grows, the dimension increases leading to slow inference scaling as . In other words, even though each task may have a very small sample, the multi-task inference problem becomes infeasible when the number of tasks is large.
In single task GP regression, to reduce the computational cost, several sparse GP approaches have been proposed (Rasmussen and Williams, 2005; Seeger and Lawrence, 2003; Snelson, 2006; Titsias, 2009). In general, these methods approximate the GP with a small number of support variables and perform inference using this subset and the corresponding function values . Different approaches differ in how they choose the support variables and the simplest approach is to choose a random subset of the given data points. Recently, Titsias (2009) introduced a sparse method based on variational inference using a set of inducing samples, which are different from the training points. In this approach, the sample points are chosen to maximize a variational lower bound on the marginal likelihood, therefore providing a clear methodology for the choice of the support set. Following their idea, Álvarez et al. (2010) proposed the variational inference for sparse convolved multiple output GPs.
In this paper we extend this approach to provide a sparse solution for the aforementioned model as well as generalizing it to the Grouped mixed-effect GP model (Wang et al., 2010). As in the case of sparse methods for single task GP, the key idea is to introduce a small set of auxiliary inducing sample points and base the learning and inference on these points. For the multi-task case, each is specific to the th task. Therefore, it makes sense to induce values only for the fixed-effect portion . The details of this construction are developed in the following sections.
3 Sparse mixed-effect GP Model
In this section, we develop a sparse solution for the mixed-effect model without group effect. The model is simpler to analyze and apply, and it thus provides a good introduction to the results developed in the next section for the grouped model.
3.1 Variational Model Selection
In this section we specify the sparse model, and show how we can learn the hyper-parameters and the inducing variables using the sparse model. As mentioned above, we introduce auxiliary inducing sample points and hidden variables . Let and denote the values of the two functions at . In addition let , and , and similarly for .
To learn the hyper-parameters we wish to maximize the marginal likelihood where is all the observations. In the following we develop a variational lower bound for this quantity. To this end, we need the complete data likelihood and the variational distribution.
The complete data likelihood is given by:
We approximate the posterior on the hidden variables by
which extends the variational form used by Titsias (2009) to handle the individual variations as well as the multiple tasks. One can see that the variational distribution is not completely in free form. Instead, preserves the exact form of and in using it implicitly assumes that is a sufficient statistic for . The free form corresponds to but allows it to diverge from this value to compensate for the assumption that is sufficient. Notice that we are not making any assumption about the sufficiency of in the generative model and the approximation is entirely due to the variational distribution. An additional assumption is added later to derive a simplified form of the predictive distribution.
The inner integral denoted as is
where the second line holds because in the sum indexed by all the product measures
are integrated to 1, leaving only the -th integral. In subsection 3.1 we show that
where , , and . Thus we have
be a random variable andany function, then by Jensen’s inequality . Therefore, the best lower bound we can derive from (9), if it is achievable, is the case where equality holds in Jensen’s inequality. In subsection 3.2 we show that can be chosen to obtain equality, and therefore, the variational lower bound is
Evaluating the integral by marginalizing out and recalling that is the concatenation of the , we get
Thus, we have explicitly written the parameters that can be chosen to further optimize the lower bound, namely the support inputs , and the hyper-parameters and in and respectively. By calculating derivatives of (10
) we can optimize the lower bound using a gradient based method. In the experiments in this paper, we use stochastic gradient descent (SGD), which works better than the conjugate gradient (CG) in this scenario where the number of tasks is large.Titsias (2009) outlines methods that can be used when gradients are not useful.
Consider the -th element in the sum of (7):
where the third line holds because of the independence between and . We next show how this expectation can be evaluated. This is more complex than the single-task case because of the coupling of the fixed-effect and the random effect.
3.1.2 Variational distribution
For equality to hold in Jensen’s inequality, the function inside the must be constant. In our case this is easily achieved because is a free parameter, and we can set
yielding the bound given in (10). Setting yields the form of the optimal variational distribution
from which we observe that is
where Notice that by choosing the number of tasks to be 1 and the random effect to be a noise process, i.e. , (10) and (26) are exactly the variational lower bound and the corresponding variational distribution in (Titsias, 2009).
3.2 Prediction using the Variational Solution
Given any task , our goal is to calculate the predictive distribution of at some new input point . As described before, the full inference is expensive and therefore we wish to use the variational approximation for the prediction as well. The key assumption is that contains as much information as in terms of making prediction for . To start with, it is easy to see that the predictive distribution is Gaussian and that it satisfies
The above equation is more complex than the predictive distribution for single-task sparse GP because of the coupling induced by . We next show how this can be calculated via conditioning.
where and are given in (11).
is sufficient for , i.e. . Since we are interested in prediction for each task separately, by marginalizing out , we also have and
For we can view as noisy realizations from the same GP as and therefore
In order to obtain a sparse form of the predictive distribution we need to make an additional assumption.
We assume that is sufficient for , i.e.,
The above set of conditional distributions also imply that and are independent given and .
To evaluate (12), we have the following
We can easily get by marginalizing out in (15),
Finally, to calculate we have
we can get the second term. This yields
where . To simplify the notation, let , and . Then (18) can be evaluated as
We have therefore shown how to calculate the predictive distribution in (12). The complexity of these computations is which is a significant improvement over where .
4 Sparse Grouped mixed-effect GP Model (GMT-GP)
In this section, we extend the mixed-effect GP model such that the fixed-effect functions admit a group structure. We call this Grouped mixed-effect GP model (GMT-GP). More precisely, each task is sampled from a mixture of shared fixed-effect GPs and then adds its individual variation. We show how to perform the inference and model selection efficiently.
4.1 Generative Model
First, we specify the sparse GMT-GP model, and show how we can learn the hyper-parameters and the inducing variables using this sparse model.
For each and ,
where and are zero-mean Gaussian processes with covariance function and , and . In addition, and are assumed to be mutually independent.
The generative process (shown in Fig. 1) is as follows, where Dir and Multi denote the Dirichlet and the Multinominal distribution respectively.
Draw the processes of the mean effect: ;
For the -th task (time series);
Draw the random effect: ;
Draw , where and where to simplify the notation stands for .
4.2 Variational Model Selection
In this section we show how to perform the learning via variational approximation. The derivation follows the same outline as in the previous section but due to the hidden variables that specify group membership, we have to use the variational EM algorithm. As mentioned above, for the -th mixed-effect (or center), we introduce auxiliary inducing support variables and the hidden variable , which is the value of -th fixed-effect function evaluated at .
Let denote the function values of the -th mean effect so that is the sub-vector of corresponding to the -th task. Let be the values of the random effect at . Denote the collection of the hidden variables as , , and . In addition let , and , and similarly