Infinite Shift-invariant Grouped Multi-task Learning for Gaussian Processes

03/05/2012 ∙ by Yuyang Wang, et al. ∙ Tufts University Harvard University 0

Multi-task learning leverages shared information among data sets to improve the learning performance of individual tasks. The paper applies this framework for data where each task is a phase-shifted periodic time series. In particular, we develop a novel Bayesian nonparametric model capturing a mixture of Gaussian processes where each task is a sum of a group-specific function and a component capturing individual variation, in addition to each task being phase shifted. We develop an efficient em algorithm to learn the parameters of the model. As a special case we obtain the Gaussian mixture model and em algorithm for phased-shifted periodic time series. Furthermore, we extend the proposed model by using a Dirichlet Process prior and thereby leading to an infinite mixture model that is capable of doing automatic model selection. A Variational Bayesian approach is developed for inference in this model. Experiments in regression, classification and class discovery demonstrate the performance of the proposed models using both synthetic data and real-world time series data from astrophysics. Our methods are particularly useful when the time series are sparsely and non-synchronously sampled.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real world problems we are interested in learning multiple tasks while the training set for each task is quite small. For example, in pharmacological studies, we may be attempting to predict the concentration of some drug at different times across multiple patients. Finding a good regression function of an individual patient based only on his or her measurements can be difficult due to insufficient training points for the patient. Instead, by using measurements across all the patients, we may be able to leverage common patterns across patients to obtain better estimates for the population and for each patient individually. Multi-task learning captures this intuition aiming to learn multiple correlated tasks simultaneously. This idea has attracted much interest in the literature and several approaches have been applied to a wide range of domains including medical diagnosis 

(Bi et al., 2008), recommendation systems (Dinuzzo et al., 2008) and HIV Therapy Screening (Bickel et al., 2008). Building on the theoretical framework for single-task learning, multi-task learning has recently been formulated by Evgeniou et al. (2006)

as a multi-task regularization problem in vector-valued Reproducing Kernel Hilbert space.

Several approaches formalizing multi-task learning exist within Bayesian statistics. Considering hierarchical Bayesian models 

(Xue et al., 2007; Gelman, 2004), one can view the parameter sharing of the prior among tasks as a form of multi-task learning where evidence from all tasks is used to infer the parameters. Over the past few years, Bayesian models for multi-task learning were formalized using Gaussian processes (Yu et al., 2005; Schwaighofer et al., 2005; Pillonetto et al., 2010). In this mixed-effect model, information is shared among tasks by having each task combine a common (fixed effect) portion and a task specific portion, each of which is generated by an independent Gaussian process.

Our work builds on this formulation extending it and the associated algorithms in several ways. In particular, we extend the model to include three new aspects. First, we allow the fixed effect to be multi-modal so that each task may draw its fixed effect from a different cluster. Second, we extend the model so that each task may be an arbitrarily phase-shifted image of the original time series. This yields our GMT model: the shift-invariant grouped mixed-effect model. Alternatively, our model can be viewed as a probabilistic extension of the Phased K-means algorithm of 

Rebbapragada et al. (2009) that performs clustering for phase-shifted time series data and as a non-parametric Bayesian extension of mixtures of random effects regressions for curve clustering (Gaffney and Smyth, 2003). Finally, unlike the existing models that require the model order to be set a priori, our extension in the DP-GMT model uses a Dirichlet process prior on the mixture proportions so that the number of mixture components is adaptively determined by the data rather than being fixed explicitly.

Our main technical contribution is the inference algorithm for the proposed model. We develop details for the em algorithm for the GMT model and a Variational em for DP-GMT optimizing the maximum a posteriori (MAP) estimates for the parameters of the models. Technically, the main insights are in estimating the expectation for the coupled hidden variables (the cluster identities and the task specific portion of the time series) and in solving the regularized least squares problem for a set of phase-shifted observations. In addition, for the DP-GMT, we show that the variational em algorithm can be implemented with the same complexity as the fixed order GMT without using sampling. Thus the DP-GMT provides an efficient model selection algorithm compared to alternatives such as BIC. As a special case our algorithm yields the (Infinite) Gaussian mixture model for phase shifted time series, which may be of independent interest, and which is a generalization of the algorithms of Rebbapragada et al. (2009) and Gaffney and Smyth (2003).

Our model primarily captures regression of time series but because it is a generative model it can be used for class discovery, clustering, and classification. We demonstrate the utility of the model using several experiments with both synthetic data and real-world time series data from astrophysics. The experiments show that our model can yield superior results when compared to the single-task learning and Gaussian mixture models, especially when each individual task is sparsely and non-synchronously sampled. The DP-GMT model yields results that are competitive with model selection using BIC over the GMT model, at much reduced computational cost.

The remainder of the paper is organized as follows. Section 2 provides an introduction to the multi-task learning problem and its Bayesian interpretation and develops the main assumptions of our model. Section 3 defines the new generative model, Section 4 develops the em algorithm for it, and the infinite mixture extension is addressed in Section 5. The experimental results are reported in Section 6. Related work is discussed in Section 7 and the final section concludes with a discussion and outlines ideas for future work.

2 Preliminaries

Throughout the paper, scalars are denoted using italics, as in ; vectors use bold typeface, as in , and denotes the th entry of . For a vector and real valued function , we extend the notation for to vectors so that where the superscript stands for transposition (and the result is a column vector). denotes a kernel function associated to some reproducing kernel Hilbert space (RKHS) and its norm is denoted as . To keep the notation simple, is substituted by where the index is not confusing.

2.1 Multi-task learning with kernel

Given training set , where , single-task learning focuses on finding a function that best fits and generalizes the observed data. In the regularization framework, learning amounts to solving the following variational problem (Evgeniou et al., 2000; Scholkopf and Smola, 2002)



is some (typically convex) loss function. The norm

relates to regularity condition on the function where a large norm penalizes non-smooth functions. The regularization parameter provides a tradeoff between the loss term and the complexity of the function.

Consider a set of tasks, with th task . Multi-task learning seeks to find for each task simultaneously, which, assuming square loss function, can be formulated as the following regularization problem


where the penalty term, applying jointly to all the tasks, encodes our prior information on how smooth the functions are, as well as how these tasks are correlated with each other. For example, setting the penalty term to implies that there is no correlation among the tasks. It further decomposes the optimization functional to separate single-task learning problems. On the other hand, with a shared penalty, the joint regularization can lead to improved performance. Moreover, we can use a norm in RKHS with a multi-task kernel to incorporate the penalty term (Micchelli and Pontil, 2005). Formally, consider a vector-valued function defined as . Then Equation (2) can be written as


where is the norm in RKHS with the multi-task kernel , where . As shown by Evgeniou et al. (2006), the representer theorem gives the form of the solution to Equation (3)


with norm

Let and , then the coefficients are given by the following linear system


where is the kernel matrix formed by .

2.2 Bayesian formulation

A Gaussian process is a functional extension for Multivariate Gaussian distributions. In the Bayesian literature, it has been widely used in statistical models by substituting a parametric latent function with a stochastic process with a Gaussian prior  

(Rasmussen, 2006). More precisely, under the single-task setting a simple Gaussian regression model is given by

where ’s prior is a zero mean Gaussian process with covariance function and

is independent zero mean white noise with variance

. Given data set , let , then and the posterior on is given by

The predictive distribution for some test point distinct from the training examples is

where . Furthermore, under square loss function, the optimizer of Equation (1) is equal to the expectation of the predictive distribution (Rasmussen, 2006). Finally, a Gaussian process corresponds to a RKHS with kernel such that


In this way, we can express a prior on functions using a zero mean Gaussian process (Lu et al., 2008)111

In general, a Gaussian process can not be thought of as a distribution on the RKHS, because with probability 1, one can find a Gaussian process such that its sample path does not belong to the RKHS. However, the equivalence holds between the RKHS and the expectation of a Gaussian process conditioned on a finite number of observations. For more details on the relationship between RKHS and Gaussian processes we refer interested reader to  

Seeger (2004).


Applying this framework in the context of multi-task learning, the model is given by

where are zero mean Gaussian processes and captures i.i.d. zero-mean noise with variance . Pillonetto et al. (2010) formalize the connection between multi-task kernel and covariance function among using


2.3 Basic Model Assumptions

Given data , the so-called nonparametric Bayesian mixed-effect model (Lu et al., 2008; Pillonetto et al., 2010) captures each task with respect to using a sum of an average effect function and an individual variation for each specific task,

This assumes that the fixed-effect (mean function) is sufficient to capture the behavior of the data, an assumption that is problematic for distributions with several modes. To address this, we introduce a mixture model allowing for multiple modes (just like standard Gaussian mixture model (GMM)), but maintaining the formulation using Gaussian processes. This amounts to adding a group effect structure and leads to the following assumption:

Assumption 1

For each and ,


where and are zero-mean Gaussian processes and . In addition, and are assumed to be mutually independent.

With the grouped-effect model and groups predefined, one can define a kernel that relates (with non zero similarity) only points from the same example or points for different examples but the same center as follows


However, in our work the groups are not known in advance and we cannot use this formulation. Instead we use a single kernel to relate all tasks.

The second extension allows us to handle phase shifted time series. In some applications, we face the challenge of learning a periodic function on a single period from samples , where similar functions in a group differ only in their phase. In the following assumption, the model of primary focus in this paper is presented, which extends the mixed-effect model to capture both shift-invariance and the clustering property.

Assumption 2

For each and ,


where , and are zero-mean Gaussian processes, stands for circular convolution and is the Dirac function with support at .222 Given a periodic function with period , its circular convolution with another function is defined as

where is arbitrary in and is also a periodic function with period . Using the definition we see that,
and thus performs a right shift of or in other words performs a phase shift of on . In addition, are assumed to be mutually independent.

3 Shift-invariant Grouped mixed-effect model

In Assumption 1, if we know the cluster assignment of each task, then the model decomposes to mixed-effect models which is the case investigated in (Pillonetto et al., 2010; Evgeniou et al., 2006). Similar results can be obtained for Assumption 2. However, prior knowledge of cluster membership is often not realistic. In this section, based on Assumption 2, a probabilistic generative model is formulated to capture the case of unknown clusters. We start by formally defining the generative model, which we call Shift-invariant Grouped mixed-effect Model (GMT). In this model, group effect functions are assumed to share the same Gaussian prior characterized by . The individual effect functions are Gaussian processes with covariance function . The model is shown in Figure 1 and it is characterized by parameter set and summarized as follows

  1. Draw

  2. For the th time series

    • Draw

    • Draw

    • Draw , where

Figure 1: GMT: Plate graph

where is the mixture proportion. Additionally, denote and , where are the time points when each time series is sampled and are the corresponding observations.

We assume that the group effect kernel and the number of centers are known. The assumption on is reasonable, in that normally we can get more information on the shape of the mean waveforms, thereby making it possible to design kernel for . On the other hand, the individual variations are more arbitrary and therefore is not assumed to be known. The assumption that is known requires some form of model selection. An extension using a non-parametric Bayesian model, the Dirichlet process (Teh, 2010), that does not limit is discussed in the section 5. The group effect , individual shifts , noise variance and the kernel for individual variations are unknown and need to be estimated. The cluster assignments and individual variation are treated as hidden variables. Note that one could treat too as hidden variables, but we prefer to get a concrete estimate for these variables because of their role as the mean waveforms in our model.

The model above is a standard model for regression. We propose to use it for classification by learning a mixture model for each class and using the Maximum A Posteriori (MAP) probability for the class for classification. In particular, consider a training set that has classes, where the th instance is given by . Each observation is given a label from . The problem is to learn the model for each class ( in total) separately and the classification rule for a new instance is given by


As we show in our experiments, the generative model can provide explanatory power for the application while giving excellent classification performance.

4 Parameter Estimation

Given data set , the learning process aims to find the MAP estimates of the parameter set


The direct optimization of Equation (12) is analytically intractable because of coupled sums that come from the mixture distribution. To solve this problem, we resort to the em algorithm (Dempster et al., 1977). The em algorithm is an iterative method for optimizing the maximum likelihood (ML) or MAP estimates of the parameters in the context of hidden variables. In our case, the hidden variables are (which is the same as in standard GMM), and . The algorithm iterates between the following expectation and maximization steps until it converges to a local maximum.

4.1 Expectation step

In the E-step, we calculate


where stands for estimated parameters from the last iteration. For our model, the difficulty comes from estimating the expectation with respect to the coupled latent variables . In the following, we show how this can be done. First notice that,

and further that


The first term in Equation (14) can be further written as


where is specified by the parameters estimated from last iteration. Since is given, the second term is the marginal distribution that can be calculated using a Gaussian process regression model. In particular, denoting we get

where is the kernel matrix for the th task using parameters from last iteration, i.e. . Therefore, the marginal distribution is


Next consider the second term in Equation (14). Given , we know that , i.e. there is no uncertainty about the identity of and therefore the calculation amounts to estimating the posterior distribution under standard Gaussian process regression, that is

and the conditional distribution is given by


where is the posterior mean


and is the posterior covariance of


Since Equation (15) is multinomial and is Normal in (17), the marginal distribution of is a Gaussian mixture distribution given by

To work out the concrete form of , denote iff . Then the complete data likelihood can be reformulated as

where we have used the fact that exactly one is 1 for each and included the last term inside the product over for convenience. Then Equation (13) can be written as

Denote the second term by . By a version of Fubini’s theorem (Stein and Shakarchi, 2005) we have


Now because the last term in Equation (20) does not include any , the equation can be further decomposed as




can be calculated from Equation (15) and (16) and can be viewed as a fractional label indicating how likely the th task is to belong to the th group. Recall that

is a normal distribution given by

and is a standard multivariate Gaussian distribution determined by its prior

Using these facts and Equation (21), can be re-formulated as


We next develop explicit closed forms for the remaining expectations. For the first, note that for and a constant vector ,

Therefore the expectation is


where is as in Equation (18) where we set explicitly.

For the second expectation we have

4.2 M-step

In this step, we aim to find


and use to update the model parameters. Using the results above this can be decomposed into three separate optimization problems as follows:

That is, can be estimated easily using its separate term, is only a function of and depends only on , and we have




The optimizations for and are described separately in the following two subsections.

4.2.1 Learning

To optimize Equation (26) we assume first that is given. In this case, optimizing decouples into sub-problems, finding th group effect and its corresponding shift . Denoting the residual , where , the problem becomes


Note that different have different dimensions and they are not assumed to be sampled at regular intervals. For further development, following  Pillonetto et al. (2010), it is useful to introduce the distinct vector whose component are the distinct elements of . For example if , then . For th task, let the binary matrix be such that

That is, extracts the values corresponding to the th task from the full vector. If are fixed, then the optimization in Equation (28) is standard and the representer theorem gives the form of the solution as


Denoting the kernel matrix as , and we get . To simplify the optimization we assume that can only take values in the discrete space , that is, , for some (e.g., a fixed finite fine grid), where we always choose . Therefore, we can write , where is . Accordingly, Equation (28) is reduced to


To solve this optimization, we follow a cyclic optimization approach where we alternate between steps of optimizing and respectively,

  • At step , optimize equation (30) with respect to given . Since is known, it follows immediately that Equation (30) decomposes into independent tasks, where for the th task we need to find such that is closest to under the Euclidean distance. A brute force search with time complexity yields the optimal solution. If the time series are synchronously sampled (i.e. ), this is equivalent to finding the shift corresponding the cross-correlation, defined as


    where and and refers to the vector right shifted by