# Nonparametric Bayesian Mixed-effect Model: a Sparse Gaussian Process Approach

Multi-task learning models using Gaussian processes (GP) have been developed and successfully applied in various applications. The main difficulty with this approach is the computational cost of inference using the union of examples from all tasks. Therefore sparse solutions, that avoid using the entire data directly and instead use a set of informative "representatives" are desirable. The paper investigates this problem for the grouped mixed-effect GP model where each individual response is given by a fixed-effect, taken from one of a set of unknown groups, plus a random individual effect function that captures variations among individuals. Such models have been widely used in previous work but no sparse solutions have been developed. The paper presents the first sparse solution for such problems, showing how the sparse approximation can be obtained by maximizing a variational lower bound on the marginal likelihood, generalizing ideas from single-task Gaussian processes to handle the mixed-effect model as well as grouping. Experiments using artificial and real data validate the approach showing that it can recover the performance of inference with the full sample, that it outperforms baseline methods, and that it outperforms state of the art sparse solutions for other multi-task GP formulations.

## Authors

• 24 publications
• 8 publications
• ### Multi-task Learning in Deep Gaussian Processes with Multi-kernel Layers

We present a multi-task learning formulation for Deep Gaussian processes...
05/29/2019 ∙ by Ayman Boustati, et al. ∙ 0

• ### EigenGP: Sparse Gaussian process models with data-dependent eigenfunctions

Gaussian processes (GPs) provide a nonparametric representation of funct...
04/18/2012 ∙ by Yuan Qi, et al. ∙ 0

• ### Scalable Multi-Task Gaussian Processes with Neural Embedding of Coregionalization

09/20/2021 ∙ by Haitao Liu, et al. ∙ 0

• ### Integrated Pre-Processing for Bayesian Nonlinear System Identification with Gaussian Processes

We introduce GP-FNARX: a new model for nonlinear system identification b...
03/12/2013 ∙ by Roger Frigola, et al. ∙ 0

• ### Toward a diagnostic toolkit for linear models with Gaussian-process distributed random effects

Gaussian processes (GPs) are widely used as distributions of random effe...
05/02/2018 ∙ by Maitreyee Bose, et al. ∙ 0

• ### Multi-task Learning for Aggregated Data using Gaussian Processes

Aggregated data is commonplace in areas such as epidemiology and demogra...
06/22/2019 ∙ by Fariba Yousefi, et al. ∙ 4

• ### Neural Processes Mixed-Effect Models for Deep Normative Modeling of Clinical Neuroimaging Data

Normative modeling has recently been introduced as a promising approach ...
12/12/2018 ∙ by Seyed Mostafa Kia, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many real world problems we are interested in learning multiple tasks while the training set for each task is quite small. When the different tasks are related one can learn all tasks simultaneously and aim to get improved predictive performance by taking advantage of the common aspects of all tasks. This general idea is known as multi-task learning and it has been successfully investigated in several technical settings, with applications in many areas including medical diagnosis (Bi et al., 2008), recommendation systems (Dinuzzo et al., 2008) and HIV Therapy Screening (Bickel et al., 2008).

In this paper we explore Bayesian models especially using Gaussian Processes (GP) where sharing the prior and its parameters among the tasks can be seen to implement multi-task learning (Álvarez et al., 2011; Bonilla et al., 2008; Xue et al., 2007; Gelman, 2004; Yu et al., 2005; Schwaighofer et al., 2005; Pillonetto et al., 2010). Our focus is on the functional mixed-effect model (Lu et al., 2008; Pillonetto et al., 2010) where each task is modeled as a sum of a fixed effect shared by all the tasks and a random effect that can be interpreted as representing task specific deviations. In particular, both effects are realizations of zero-mean Gaussian processes. Thus, in this model, tasks share structure through hyper-parameters of the prior and through the fixed effect portion. This model has shown success in several applications, including geophysics (Lu et al., 2008), medicine (Pillonetto et al., 2010) and astrophysics (Wang et al., 2010). One of the main difficulties with this model, however, is computational cost, because while the number of samples per task is small, the total sample size can be large, and the typical cubic complexity of GP inference can be prohibitively large (Yu et al., 2005). Some improvement can be obtained when all the input tasks share the same sampling points, or when different tasks share many of the input points (Pillonetto et al., 2009, 2010). However, if the number of distinct sampling points is large the complexity remains high. For example, this is the case in (Wang et al., 2010) where sample points are clipped to a fine grid to avoid the high cardinality of the example set.

The same problem, handling large samples, has been addressed in single task formalizations of GP, where several approaches for so-called sparse solutions have been developed (Rasmussen and Williams, 2005; Seeger and Lawrence, 2003; Snelson, 2006; Titsias, 2009). These methods approximate the GP with support variables (or inducing variables, pseudo inputs) and their corresponding function values and perform inference using this set.

In this paper, we develop a sparse solution for multi-task learning with GP in the context of the functional mixed effect model. Specifically, we extend the approach of Titsias (2009) and develop a variational approximation that allows us to efficiently learn the shared hyper-parameters and choose the sparse pseudo samples. In addition, we show how the variational approximation can be used to perform prediction efficiently once learning has been performed. Our approach is particularly useful when individual tasks have a small number of samples, different tasks do not share sampling points, and there is a large number of tasks. Our experiments, using artificial and real data, validate the approach showing that it can recover the performance of inference with the full sample, that it performs better than simple sparse approaches for multi-task GP, and that for some applications it significantly outperforms alternative sparse multi-task GP formulation (Álvarez and Lawrence, 2011).

To summarize, our contribution is threefold. First we introduce the first sparse solution for the multi-task GP in mixed-effect model. Second, we develop a variational model-selection approach for the proposed sparse model. Finally we evaluate the algorithm and several baseline approaches for multi-task GP, showing that the proposed method performs well.

This paper is organized as follows. Section 2 reviews the mixed-effect GP model and its direct inference. Section 3 develops the variational inference and model selection for the sparse mixed-effect GP model. Section 4 shows how to extend the sparse solution to the grouped mixed-effect GP model. We discuss related work in Section 5 and demonstrate the performance of the proposed approach using three datasets in Section 6. Section 7 concludes with a summary and directions for future work.

## 2 Mixed-effect GP for Multi-task Learning

In this section and the next one, we develop the mixed-effect model and its sparse solution without considering grouping. The model and results are extended to include grouping in Section 4. Consider a set of tasks where the data for the th task is given by . Multi-task learning aims to learn all tasks simultaneously, taking the advantage of common aspects of different tasks. In this paper, given data , we are interested in learning the nonparametric Bayesian mixed-effect model and using the model to perform inference. The model captures each task as a sum of an average effect function and an individual variation specific to the th task. More precisely (Pillonetto et al., 2010):

###### Assumption 1

For each and ,

 fj(x)=¯f(x)+~fj(x),j=1,⋯,M (1)

where and are zero-mean Gaussian processes. In addition, and the set of are assumed to be mutually independent with covariance functions and respectively.

Assumption 1 implies that for , the following holds:

 Cov[fj(s),fl(t)]=Cov[¯f(s),¯f(t)]+δjl⋅Cov[~f(s),~f(t)] (2)

where is the Kronecker delta function. Let be the concatenation of the examples from all tasks , and similarly let , where and . It can easily been seen that, for any and new input for task , we have

 [fj(x∗)˘y]∼N(0,[C†(x∗,x∗)C†(x∗,˘x)C†(˘x,x∗)C†(˘x,˘x)+σ2I]) (3)

where the covariance matrix is given by

 C†((xji),(xlk))=K(xji,xlk)+δjl⋅˜K((xji,xlk).

From (3) we can extract the marginal distribution where

 ˘y|˘x∼N(0,C†(˘x,˘x)+σ2I), (4)

which can be used for model selection, that is, learning the hyper-parameters of the GP. (3) also provides the predictive distribution where

 IE(fj(x∗))=C†(x∗,˘x)(C†(˘x,˘x)+σ2I)−1˘yCov(fj(x∗))=C†(x∗,x∗)−C†(x∗,˘x)(C†(˘x,˘x)+σ2I)−1C†(˘x,x∗). (5)

This works well in that sharing the information improves predictive performance but, as the number of tasks grows, the dimension increases leading to slow inference scaling as . In other words, even though each task may have a very small sample, the multi-task inference problem becomes infeasible when the number of tasks is large.

In single task GP regression, to reduce the computational cost, several sparse GP approaches have been proposed (Rasmussen and Williams, 2005; Seeger and Lawrence, 2003; Snelson, 2006; Titsias, 2009). In general, these methods approximate the GP with a small number of support variables and perform inference using this subset and the corresponding function values . Different approaches differ in how they choose the support variables and the simplest approach is to choose a random subset of the given data points. Recently, Titsias (2009) introduced a sparse method based on variational inference using a set of inducing samples, which are different from the training points. In this approach, the sample points are chosen to maximize a variational lower bound on the marginal likelihood, therefore providing a clear methodology for the choice of the support set. Following their idea, Álvarez et al. (2010) proposed the variational inference for sparse convolved multiple output GPs.

In this paper we extend this approach to provide a sparse solution for the aforementioned model as well as generalizing it to the Grouped mixed-effect GP model (Wang et al., 2010). As in the case of sparse methods for single task GP, the key idea is to introduce a small set of auxiliary inducing sample points and base the learning and inference on these points. For the multi-task case, each is specific to the th task. Therefore, it makes sense to induce values only for the fixed-effect portion . The details of this construction are developed in the following sections.

## 3 Sparse mixed-effect GP Model

In this section, we develop a sparse solution for the mixed-effect model without group effect. The model is simpler to analyze and apply, and it thus provides a good introduction to the results developed in the next section for the grouped model.

### 3.1 Variational Model Selection

In this section we specify the sparse model, and show how we can learn the hyper-parameters and the inducing variables using the sparse model. As mentioned above, we introduce auxiliary inducing sample points and hidden variables . Let and denote the values of the two functions at . In addition let , and , and similarly for .

To learn the hyper-parameters we wish to maximize the marginal likelihood where is all the observations. In the following we develop a variational lower bound for this quantity. To this end, we need the complete data likelihood and the variational distribution.

• The complete data likelihood is given by:

• We approximate the posterior on the hidden variables by

 q({fj,~fj},fm)=[M∏j=1Pr(~fj|fj,yj)]Pr({fj}|fm)ϕ(fm) (6)

which extends the variational form used by Titsias (2009) to handle the individual variations as well as the multiple tasks. One can see that the variational distribution is not completely in free form. Instead, preserves the exact form of and in using it implicitly assumes that is a sufficient statistic for . The free form corresponds to but allows it to diverge from this value to compensate for the assumption that is sufficient. Notice that we are not making any assumption about the sufficiency of in the generative model and the approximation is entirely due to the variational distribution. An additional assumption is added later to derive a simplified form of the predictive distribution.

With the two ingredients ready, the variational lower bound  (Jordan et al., 1999; Bishop, 2006), denoted as , is given by:

The inner integral denoted as is

 ∫[M∏j=1Pr(~fj|fj,yj)]Pr({fj}|fm)×M∑l=1log⎡⎣Pr(yl|fl,~fl)Pr(~fl)Pr(~fl|fl,yl)⎤⎦d{fj}d{~fj} =M∑j=1∫Pr(~fj|fj,yj)Pr(fj|fm)×log⎡⎣Pr(yj|fj,~fj)Pr(~fj)Pr(~fj|fj,yj)⎤⎦dfjd~fj (7)

where the second line holds because in the sum indexed by all the product measures

 M∏j=1,j≠lPr(~fj|fj,yj)Pr({fn}n≠l|fm,fl)d{fj}d{~fj},

are integrated to 1, leaving only the -th integral. In subsection 3.1 we show that

 logG(fm,Y)=m∑j=1[log[N(yj|αj,ˆCjj)]−12Tr[(Cjj−Qjj)[ˆCjj]−1]] (8)

where , , and . Thus we have

 (9)

Let

be a random variable and

any function, then by Jensen’s inequality . Therefore, the best lower bound we can derive from  (9), if it is achievable, is the case where equality holds in Jensen’s inequality. In subsection 3.2 we show that can be chosen to obtain equality, and therefore, the variational lower bound is

 FV(Xm,ϕ)=log∫∏j[N(yj|αj,ˆCjj)]Pr(fm)dfm−12M∑j=1Tr[(Cjj−Qjj)[ˆCjj]−1].

Evaluating the integral by marginalizing out and recalling that is the concatenation of the , we get

 FV(Xm,θ,~θ)=log[N(˘y|0,ΛmC−1mmΛTm+ˆCm)]−m∑j=1[12Tr[(Cjj−Qjj)[ˆCjj]−1]] (10)

where

 Λm=⎛⎜ ⎜ ⎜ ⎜⎝C1mC2m⋮CMm⎞⎟ ⎟ ⎟ ⎟⎠andˆCm=M⨁j=1ˆCjj=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝ˆC11ˆC22⋱ˆCMM⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

Thus, we have explicitly written the parameters that can be chosen to further optimize the lower bound, namely the support inputs , and the hyper-parameters and in and respectively. By calculating derivatives of  (10

) we can optimize the lower bound using a gradient based method. In the experiments in this paper, we use stochastic gradient descent (SGD), which works better than the conjugate gradient (CG) in this scenario where the number of tasks is large.

Titsias (2009) outlines methods that can be used when gradients are not useful.

#### 3.1.1 Evaluating logG(fm,Y)

Consider the -th element in the sum of  (7):

 ˆGj(fj,yj)=∫Pr(~fj|fj,yj)Pr(fj|fm)log⎡⎣Pr(yj|fj,~fj)Pr(~fj)Pr(~fj|fj,yj)⎤⎦dfjd~fj =∫Pr(~fj|fj,yj)Pr(fj|fm) ×log⎡⎣Pr(~fj|yj,fj)Pr(yj|fj)Pr(~fj|fj)⋅Pr(~fj)Pr(~fj|fj,yj)⎤⎦dfjd~fj=∫Pr(fj|fm)log[Pr(yj|fj)](∫Pr(~fj|fj,yj)d~fj)dfj =∫Pr(fj|fm)log[Pr(yj|fj)]dfj=IE[fj|fm]log[Pr(yj|fj)]

where the third line holds because of the independence between and . We next show how this expectation can be evaluated. This is more complex than the single-task case because of the coupling of the fixed-effect and the random effect.

Recall that

 Pr(fj|fm)=N(fj|CjmC−1mmfm,Cjj−CjmC−1mmCmj)

and

 yj|fj∼N(fj,ˆCjj)

where . Denote where can be chosen as its Cholesky decomposition, we have

Notice that

 Pr(Lfj|fm)=N(LCjmC−1mmfm,L(Cjj−Qjj)LT)

where . Recall the fact that for

and a constant vector

, we have . Thus,

 IE[fj|fm]log[Pr(yj|fj)]=−12∥Lyj−LCjmC−1mmfm∥2 −12Tr(L(Cjj−Qjj)LT)+log[(2π)−Nj2]+log[|ˆCjj|−12]={−12[y−CjmC−1mmfm]T(LTL)[y−CjmC−1mmfm]+log[(2π)−Nj2]+log[|Cjj|−12]}−12Tr[L(Cjj−Qjj)LT] =log[N(yj|αj,ˆCjj)]−12Tr[(Cjj−Qjj)ˆC−1jj]

where . Finally, calculating we get (8).

#### 3.1.2 Variational distribution ϕ∗(fm)

For equality to hold in Jensen’s inequality, the function inside the must be constant. In our case this is easily achieved because is a free parameter, and we can set

 ⎡⎢ ⎢⎣∏j[N(yj|αj,ˆCjj)]Pr(fm)ϕ(fm)⎤⎥ ⎥⎦≡c,

yielding the bound given in (10). Setting yields the form of the optimal variational distribution

 ϕ∗(fm)∝∏j[N(yj|αj,ˆCjj)]Pr(fm)∝exp{−12fTm[C−1mmΦC−1mm]fm+fTm(C−1mm∑jCmj[ˆCjj]−1yj)},

from which we observe that is

 N(fm∣∣∣CmmΦ−1∑jCmj[ˆCjj]−1yj,CmmΦ−1Cmm) (11)

where Notice that by choosing the number of tasks to be 1 and the random effect to be a noise process, i.e. , (10) and (26) are exactly the variational lower bound and the corresponding variational distribution in (Titsias, 2009).

### 3.2 Prediction using the Variational Solution

Given any task , our goal is to calculate the predictive distribution of at some new input point . As described before, the full inference is expensive and therefore we wish to use the variational approximation for the prediction as well. The key assumption is that contains as much information as in terms of making prediction for . To start with, it is easy to see that the predictive distribution is Gaussian and that it satisfies

 IE[fj(x∗)|˘y]=IE[¯f(x∗)|˘y]+IE[~fj(x∗)|˘y]Var[fj(x∗)|˘y]=Var[¯f(x∗)|˘y]+Var[~fj(x∗)|˘y]+2Cov[¯f(x∗)~fj(x∗)|˘y]. (12)

The above equation is more complex than the predictive distribution for single-task sparse GP because of the coupling induced by . We next show how this can be calculated via conditioning.

To calculate the terms in (12), three parts are needed, i.e., , and . Using the assumption of the variational form given in (6), we have the following facts,

1. where and are given in (11).

2. is sufficient for , i.e. . Since we are interested in prediction for each task separately, by marginalizing out , we also have and

 fj|fm,˘y∼N(CjmC−1mmfm,Cjj−CjmC−1mmCmj). (13)
3. For we can view as noisy realizations from the same GP as and therefore

 ~fj(x∗)|fj,˘y∼N(˜C∗j[˜Cjj+σ2jIj]−1[yj−fj],˜C∗∗−˜C∗j[˜Cjj+σ2jIj]−1˜Cj∗). (14)

In order to obtain a sparse form of the predictive distribution we need to make an additional assumption.

###### Assumption 2

We assume that is sufficient for , i.e.,

 Pr(¯f(x∗)|fm,˘y)=Pr(¯f(x∗)|fm),

implying that

 ¯f(x∗)|fm,˘y∼N(C∗mC−1mmfm,C∗∗−C∗mC−1mmCm∗). (15)

The above set of conditional distributions also imply that and are independent given and .

To evaluate (12), we have the following

• We can easily get by marginalizing out in (15),

 Pr(¯f(x∗)|˘y)=∫Pr(¯f(x∗)|fm)ϕ∗(fm)dfm

yielding

 ¯f(x∗)|˘y∼N(C∗mC−1mmμ,C∗∗−C∗mC−1mmCm∗+C∗mC−1mmAC−1mmCm∗). (16)
• Similarly, we can obtain by first calculating by marginalizing out in (13) and then marginalizing out in (14), as follows. First we have where

 B=Cjj−CjmC−1mmCmj+CjmC−1mmAC−1mmCmj.

Next for , we have

 Pr(~fj(x∗)|˘y)=∫Pr(~fj(x∗)|fj,yj)Pr(fj|˘y)dfj

and marginalizing out , can be obtained as

 (17)
• Finally, to calculate we have

where

 IE[¯fj(x∗)⋅~fj(x∗)|˘y]=IEfm|˘yIE[¯fj(x∗)⋅~fj(x∗)|fm,˘y]=IEfm|˘y[IE[¯fj(x∗)|fm]⋅IE[~fj(x∗)|fm,yj]] (18)

where the second line holds because, as observed above, the terms are conditionally independent. The first term can be obtained directly from (15). By marginalizing out in (14) such that

 Pr(~fj(x∗)|fm,yj)=∫Pr(~fj(x∗)|fj,˘y)Pr(fj|fm)dfj,

we can get the second term. This yields

 (19)

where . To simplify the notation, let , and . Then (18) can be evaluated as

 HyjF⋅IE[fm]−FG(IE[fmfTm|˘y])HT=HyjF⋅μ−FG[A+μμT]HT.

We have therefore shown how to calculate the predictive distribution in (12). The complexity of these computations is which is a significant improvement over where .

## 4 Sparse Grouped mixed-effect GP Model (GMT-GP)

In this section, we extend the mixed-effect GP model such that the fixed-effect functions admit a group structure. We call this Grouped mixed-effect GP model (GMT-GP). More precisely, each task is sampled from a mixture of shared fixed-effect GPs and then adds its individual variation. We show how to perform the inference and model selection efficiently.

### 4.1 Generative Model

First, we specify the sparse GMT-GP model, and show how we can learn the hyper-parameters and the inducing variables using this sparse model.

###### Assumption 3

For each and ,

 fj(x)=¯fzj(x)+~fj(x),j=1,⋯,M

where and are zero-mean Gaussian processes with covariance function and , and . In addition, and are assumed to be mutually independent.

The generative process (shown in Fig. 1) is as follows, where Dir and Multi denote the Dirichlet and the Multinominal distribution respectively.

1. Draw the processes of the mean effect: ;

2. Draw ;

3. For the -th task (time series);

• Draw ;

• Draw the random effect: ;

• Draw , where and where to simplify the notation stands for .

### 4.2 Variational Model Selection

In this section we show how to perform the learning via variational approximation. The derivation follows the same outline as in the previous section but due to the hidden variables that specify group membership, we have to use the variational EM algorithm. As mentioned above, for the -th mixed-effect (or center), we introduce auxiliary inducing support variables and the hidden variable , which is the value of -th fixed-effect function evaluated at .

Let denote the function values of the -th mean effect so that is the sub-vector of corresponding to the -th task. Let be the values of the random effect at . Denote the collection of the hidden variables as , , and . In addition let , and , and similarly