A conditional one-output likelihood formulation for multitask Gaussian processes

06/05/2020 ∙ by Vanessa Gómez-Verdejo, et al. ∙ Universidad Carlos III de Madrid University of New Mexico 0

Multitask Gaussian processes (MTGP) are the Gaussian process (GP) framework's solution for multioutput regresion problems in which the T elements of the regressors cannot be considered conditionally independent given the observations. Standard MTGP models assume that there exist both a multitask covariance matrix as a function of an intertask matrix, and a noise covariance matrix. These matrices need to be approximated by a low rank simplification of order P in order to reduce the number of parameters to be learnt from T^2 to TP. Here we introduce a novel approach that simplifies the multitask learning by reducing it to a set of conditioned univariate GPs without the need for any low rank approximations, therefore completely eliminating the requirement to select an adequate value for hyperparameter P. At the same time, by extending this approach with either a hierarchical or approximate model, the proposed method is capable of recovering the multitask covariance and noise matrices after learning only 2T parameters, avoiding the validation of any model hyperparameter and reducing the overall complexity of the model as well as the risk of overfitting. Experimental results over synthetic and real problems confirm the advantages of this inference approach in its ability to accurately recover the original noise and signal matrices, as well as the achieved performance improvement in comparison to other state of art MTGP approaches. We have also integrated the model with standard GP toolboxes, showing that it is computationally competitive with other state of the art options.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes (GP) rasmussen2006

are considered state of the art in (non)linear regression since they are a natural way to implement a predictive posterior distribution that provides a predictive target mean together with a relevant measure of uncertainty in the form of a predictive covariance. While standard GPs were initially designed to handle scalar outputs, it is becoming more and more common to have to face multi-task (MT) or multidimensional output problems. While there are several approaches to formulate multioutput GPs

liu2018remarks

, here we focus on a multitask GP (MTGP) formulation that is able to jointly and symmetrically estimate a set of

tasks from a single observation.

The general MTGP formulation proposed in Bonilla2008 , and reformulated as a linear model of coregionalization (LMC) schmidt2003Bayesian ; fanshawe2012bivariate , assumes that the multitask covariance matrix is expressed as the Kroneker product of an intertask matrix and the input kernel matrix. To avoid the prohibitive cost of estimating a full rank matrix when is large, Bonilla2008 proposes the use of a low-rank approximation of order of , , so that the number of parameters to be learnt is reduced from to , with the drawback of requiring the adjustment of parameter .

To obtain computational savings in the above MTGP formulation, stegle2011efficient

proposes an efficient inversion of the MT covariance matrix by combining properties of the singular value decomposition (SVD) and the Kronecker product, reducing the computational cost from

to . Another interesting approach is proposed in Rakitsch2013 , where a noise covariance matrix is introduced to model noise dependencies together with the intertask relationships.

Additionally, several convolutional models lee2002 ; Boyle2005 ; Alvarez2008 ; alvarez2011computationally have emerged, establishing a more sophisticated formulation that is able to model blurred relationships between tasks by the generalization of the MT kernel matrix through a convolution. However, adequate usage requires a careful selection of the convolutional kernel in order to make the integral tractable, and the number of parameters must be limited to balance the model’s flexibility against its complexity to avoid overfitting issues. Furthermore, this complex design limits the model’s interpretability since the covariance matrices are not explicitly estimated. Efficient versions of these models alvarez2011computationally select inducing points to reduce the computational cost down to .

Inspired by the MTGP probabilistic model proposed in Rakitsch2013 , where both the intertask and noise covariance matrices are modelled, we introduce a novel learning methodology based on a decomposition of the likelihood function to turn the MT learning process into a set of conditional one-output GPs. This methodology, combined with a hierarchical extension of the conditioned GPs, allows us to introduce the following novelties and advantages: (1) We do not need to use a low rank approximations, avoiding the selection of hyperparameter . (2) We reduce the number of parameters to be learnt in the noise and signal covariance matrices from to , while recovering the full noise and intertask matrices. This reduces the the model’s complexity and, as experiments show, leads to an improvement in performance. (3) This conditioned model, combined with an efficient implementation, leads to a computational time of , comparable to the one of stegle2011efficient ; Rakitsch2013 when

. (4) Finally, this learning approach can be easily adapted to leverage efficient GP libraries, such as Pytorch

GPTorch , that can run on graphical processing units (GPU). In this case, the model also admits an embarrasingly parallel implementation over the tasks to get a computational cost per process of .

2 Introduction to the Multitask Gaussian Processes

Given the set of observations , and a transformation into a reproducing kernel Hilbert space ShaweTaylor04 , the general expression for the MT regression problem of estimating regressors or tasks, , from , results in the model

(1)

being a mixing matrix and representing the model noise. To complete this probabilistic model, the following inter-task noise distribution and weight prior are assumed

(2)

where

is a vectorization operator and

is the Kroneker product. Matrix models the covariances between the elements of and it is common for all the tasks. This considers that correlation between the noise elements of different tasks is represented through the noise covariance , and relationships between tasks are modelled with the intertask covariance .

Once these matrices are learnt, the predictive posterior for test sample is constructed as:

(3)

where vector contains the dot products between the test sample and the training dataset , and is the matrix of dot products between data, and .

So far, this formulation extends the model of Bonilla2008 and stegle2011efficient and is formally identical to Rakitsch2013 , which introduces the noise covariance. The underlying limitation of these works lies in the fact that the number of parameters to be optimized grows with , and therefore a rank- approximation of the form is used to model matrices and . This reduces the number of parameters to , but imposes the need of selecting a suitable value for parameter .

3 Parameter learning through conditional one-output likelihood for MTGPs

Here, we introduce a novel methodology based on a conditional one-output likelihood MTGP (Cool-MTGP) where the previous MTGP formulation is decomposed into a set of conditioned one-output GPs. This methodology reduces the number of parameters to be learnt to twice the number of tasks without assuming any low rank approximation and the adjustment of additional hyperparameters.

3.1 MT likelihood as a product of conditional one-output distributions

Let us consider a model whose output corresponding to input in the -th task is given by a linear combination of both the input data and the output of the previous tasks, , i.e.,

(4)

where is the noise process for task and the weight of each factorized task is split into two components: weight for the input data and weight for the previous tasks. This model is closely related to the original MTGP described in Section 2 since, given the values of and , one can recursively recover the original weights as:

(5)

We can now apply the chain rule of probability to the original joint MT likelihood to factorize it into a set of conditional probabilities, each associated to a conditional one-output GP:

(6)

where the likelihood for each of these conditioned GPs is given by:

(7)

And the prior over the input weight components is defined as:

(8)

where assumes a common prior for all tasks scaled by a task-dependent factor .

(a) Global model (b) Hierarchical extension
Figure 1: Graphical model for the conditional one-output likelihood MTGP model

This approach assumes that are latent variables modelled with a prior distribution, whereas previous task weights are defined as model parameters (see the graphical model in Figure1(a)); this way, for each task we generate a conditioned one-output likelihood GP (Cool-GP) with mean and covariance . This guarantees that the model remains Gaussian, allowing us to recover the original MTGP joint likelihood from the set of Cool-GP likelihoods, as defined in (7).

3.2 Parameter learning and model inference

In order to optimize the model in Figure 1(a), we need to learn the input prior factors , the noise covariances , the common kernel parameters and, additionally, the weights of the previous tasks . To reduce the number of parameters to be learnt we propose two approaches which are described in the following sections. A detailed derivation of both approaches and their corresponding algorithms is included in the Supplementary Material, alongside a software demo in notebook form to ensure the reproducibility of our results.

3.2.1 A hierarchical approach for Cool-MTGP learning

This first method introduces a two-step hierarchical model for each Cool-GP. In the first step we define a prior over ,

(9)

which enables the application of a first inference pass over it and obtain its MAP estimation as

(10)

Note that this process is equivalent to training a GP with zero mean and covariance matrix (as depicted in the model in Figure 1(b)); this way has the formal expression of a standard GP rasmussen2006 , where the input data kernel matrix is rescaled by factor plus a linear kernel matrix for the previous tasks outputs.

In the second step of the hierarchical model, we learn the remaining parameter values (, and ) by going back to the original cool-GP (Figure 1(a)) and substituting by its MAP estimation; the model parameters are then learnt by maximizing the joint likelihood (6) for and marginalizing the influence of , i.e,

(11)

where and parameters are optimized by gradient descent. Note that parameters and are particular of each GP, so they can be optimized independently, and only kernel parameters must be optimized jointly, and that each Cool-MTGP is a Gaussian Process with mean equal to and a covariance matrix . Finally, when the model parameters are learnt, we can also recover the MAP estimation of as:

(12)

3.2.2 An approximate approach for Cool-MTGP learning

In this second approach, instead of applying a two step learning algorithm and requiring the inversion of two kernel matrices, we achieve an approximation by performing joint inference over and . To accomplish this, we directly consider the priors of and in equations (8) and (9), so that each Cool-MTGP has zero mean and covariance . Then, to learn the model parameters we maximize the following marginalized joint likelihood

(13)

where the kernel matrix is the same one used to infer both and whose MAP estimations are given by

(14)

This approximate Cool-MTGP approach uses to infer the weights and learn the parameters, thus slashing the computational time in half. However, because we are now considering

as a latent random variable, the conditional likelihoods are no longer Gaussian and therefore we can’t ensure that the original MTGP formulation can be accurately recovered. For this reason, we call this approach

approximate. Nevertheless, as we shall show in the experimental section, the results for both the hierarchical and the approximate approaches are very similar.

3.2.3 Efficient implementation

A naive implementation of these two cool-MTGP versions result in a computational complexity of and , respectively. However, as we discuss in the Supplementary Material, by using SVD and rank-t update properties we can reduce this computational burden to , which results in a significant cost reduction in the common case where . Parallelization of the model is also straightforward since the learning stage for each Cool-GP can be carried out at the same time. Furthermore, in the case of the approximate Cool-MTG where all model parameters and variables for task t share the same GP model, any standard GP library (efficiently designed for single output GPs) can be leveraged to implement the model.

3.3 Recovering the multitask model

Despite the fact that intertask and noise covariance matrices and are not explicit in the conditioned model, they can be recovered from parameters , and the MAP estimation of the weights, and . The general MTGP formulation considers that the joint MT likelihood, which contains the noise matrix , is given by

(15)

Considering the factorization given by (6), the factorized likelihoods of (7), and applying the properties of the products of conditional Gaussians (see, e.g. bishop2006pattern ), we can recursively recover the mean and the covariance terms, which leads us directly to the MT noise covariance , as:

(16)

where it is assumed that and .

As for the construction of , we will provide a brief summary here while the full details can be found in the Supplementary Material. Given the knowledge of the joint covariance prior of , which we denote as , the expression of the intertask covariance matrix can be obtained from

(17)

where is a matrix whose -th row is constructed as the concatenation of and zeros, i.e . Hence, the first row is all zeros.

To obtain the terms of we first recall that its diagonal elements were predefined in (8). We approximate elements with the sample estimation of the cross correlation coefficients, i.e., where . Since this computation depends on variables and , we can carry it out by either generating samples from their posterior distribution and obtaining the values of with a Monte Carlo approximation, or by directly considering that are given by their MAP estimations. In fact, if we consider the latter approach (either Equation (12) or Equation (14)), the calculation of the terms of can be expressed in a more compact form as:

(18)

where with given by (10) in the hierarchical Cool-MTGP method, or in the approximate approach.

One might be concerned that this reconstruction process, and therefore the overall performance of the model, depends on the order in which the tasks are assigned to the Cool-GPs. However, experimental results show that the reconstruction of the matrices is consistent for any random permutation of the tasks, confirming that task order has little impact.

4 Experimental Results

4.1 Synthetic benchmarks

We have carried out a synthetic data experiment to compare the performance of all the considered MTGP algorithms against a ground truth model. Following the general model of Section 2, a data set has been drawn from likelihood function where and the intertask and noise covariance matrices follow the low rank form (and similarly for ). We have created datasets for three scenarios in which the and matrices were generated with values of , and . In all cases, tasks, samples and iterations were run with random training/test partitions.

We compare both the hierarchical (HCool-MT) and approximate (Cool-MT) versions of the proposed model to the standard MTGP (Std-MT) of Bonilla2008 and the extension introduced in Rakitsch2013 , which includes a noise matrix (-MT). Since these methods require the selection of parameter , we have analyzed three values: one equal to (the ideal case), a value of smaller than and, where possible, a value of greater than . Additionally, a ground-truth model that uses the true intertask and noise covariance matrices is included, as well as a set of independent GPs. Predictive performance is measured with the mean square error (MSE) averaged over all the tasks.

         (a) (b) (c)
          , , , , , ,
Figure 2: MSE for all models of the synthetic experiment and different values of the matrix rank of the generative model and parameter for Std-MTGP and -MTGP.

(a) Ground-Truth (b) Std-MT (c) -MT (d) Cool-MT (e) HCool-MT
Figure 3: Estimated intertask, , and noise, , covariance matrices vs. true ones (Ground-Truth) when R=10. Std-MT and -MT were trained for .

Figure 2 shows that -MT has the highest sensitivity to the choice of , and its performance degrades when the scenario complexity (defined by matrix rank ) increases. Std-MT has more robustness with respect to both the selection of and the value of . As was expected, both models show their best performances when , and thus a cross validation of is paramount for these methods to perform optimally. This sensitivity depends on the number of the parameters to be inferred, which is in the case of Std-MT and for -MT, while it is only for both Cool-MTGPs. This allows our model to perform closer to the ground truth independently of the scenario complexity .

Experimentally, it can be seen in Figure 3 that, while all models are capable of inferring the intertask covariance, the noise matrix is not properly inferred by the -MT when scenario complexity ( value) is high. Besides, comparing both versions of the Cool-MTGP model, the hierarchical approach is unsurprisingly better at estimating the true parameters and reconstructing the noise matrix, leading to a better consistency in its predictions as it was already shown in Figure 2.

(a) Real (b) Random 1 (c) Random 2 (d) Random 3 (e) Random 4
Figure 4: Estimated and covariances for four random permutations of the task order for 15 tasks with R=15.

Finally we address the possibility that the model is sensitive to task ordering during the training of the Cool-GPs. As can be seen in Figure 4, the model is consistent with regard to the order of the tasks.

4.2 Real data benchmarks

In this experiment we use real world scenarios to compare the Cool-MT model’s capabilities to those of the Std-MT and Conv-MT GPflow2017 models111We have been unable to include -MT in this part of our study due to convergence issues with the available implementation that led to unusable results.. To accomplish this, we have made use of the collection of datasets featured in spyromitros2016multi , which offers a wide variety in sample size, input dimensions and output tasks. A summary of their main characteristics can be found in the Supplementary Material. In all cases a standard normalization of the data was applied and 10 iterations were run with a random 80%/20% training/test partitioning of the data. In larger datasets (# samples 400) we dedicated 300 samples to the training set and 100 to the test set. All models were trained with both a linear kernel and a squared exponential (SE) kernel, but only the results for the best performing kernel are shown.

Dataset Kernel Ind. GPs Std-MT Conv-MT Cool-MT HCool-MT
oes10 Linear 1.04 0.88 0.15 0.07 0.14 0.07 0.14 0.07 0.14 0.07
oes97 Linear 1.53 0.58 0.21 0.08 0.19 0.07 0.20 0.08 0.20 0.08
scm1d Linear 1.78 1.01 0.58 0.43 0.21 0.05 0.22 0.05 0.22 0.05
scm20d Linear 0.52 0.06 0.49 0.06 0.47 0.06 0.47 0.06 0.47 0.06
andro SE 0.20 0.11 0.20 0.11 0.82 0.29 0.20 0.11 0.21 0.11
enb SE 0.04 0.01 0.03 0.00 0.08 0.02 0.02 0.00 0.02 0.01
edm SE 0.39 0.12 0.44 0.16 0.49 0.13 0.49 0.21 0.41 0.13
slump SE 0.39 0.08 0.50 0.22 0.40 0.09 0.37 0.07 0.37 0.07
atp1d SE 0.20 0.07 0.99 0.36 1.04 0.37 0.20 0.07 0.20 0.07
atp7d SE 0.48 0.19 0.97 0.33 1.16 0.30 0.48 0.17 0.47 0.17
Table 1: Real dataset benchmark results: MSE averaged over tasks. Best results in bold.

As can be seen in Table 1, in the linear cases all models present similar MSE, whereas in the non-linear scenarios the Cool-MTGPs come clearly on top. After analysing the values for the SE kernel length-scale parameter learnt by all the models, it is clear that Conv-MT and, in some cases, Std-MT are unable to achieve a correct estimation. We believe that this is due to our model’s reduced number of parameters to be learnt, making its adjustment easier. Once again, both Cool-MT approaches lead to similar results except in the case of the edm dataset, where the hierarchical approach gains an edge.

4.3 Computational performance analysis

In this last section we evaluate the computational performance of some of the methods under study when executed on a CPU and a GPU. For this purpose, we have selected the Std-MT and Conv-MT, since they are efficiently implemented over Pytorch and TensorFlow, and we have designed a wrapper over the Pytorch GP implementation for the proposed

Cool-MT approach in order to run it on GPUs. We have measured the run time and MSE performance of each algorithm with a linear kernel for different sized () training partitions of the scm20d dataset considering only 4 tasks. Computational times are averaged over iterations. optimization iterations were used for the Std-MT and Cool-MT methods, whereas Conv-MT needed to obtain accurate results. The experiment was carried out on an Intel Core i9 Processor using a single core (3.3GHz, 98GB RAM) and a GeForce RTX 2080Ti GPU (2944 Cuda Cores, 1.545GHz, 10.76GB VRAM).

Figure 5 shows the evolution of the runtime and MSE with . Conv-MT and Cool-MT show similar MSE, but the computational time of Conv-MT grows much faster with the number of data. We conclude that Cool-MT presents the best trade-off between accuracy and computational burden.

Notably, while Std-MT’s implementation is specific to GPUs using the optimizers provided by Pytorch, for now Cool-MT only uses a wrapper. Despite this, Cool-MT achieves comparable performance. Additional improvements can be expected with an implementation tailored for parallelization.

(a) Execution time in CPU (b) Execution time in GPU (c) MSE evolution
Figure 5: Runtime and accuracy comparison on the scm20d dataset with 4 tasks for different numbers of training samples. Conv-MT and Cool-MT are competitive in MSE, but Cool-MT scales much better with the number of data. Std-MT offers the smallest computational burden, but has poor performance with a low number of data.

5 Conclusions

In this paper we have proposed a novel solution for the MTGP problem that, compared to previous formulations, eliminates the need to validate any model hyperparameters and dramatically reduces the number of parameters to be learnt.

Similarly to other existing models, this proposal assumes that an intertask and a noise covariances exist. The novelty lies in the parameter inference, which is solved through the factorization of the joint MT likelihood into a product of conditional one-output GPs. Once these parameters are learnt, with either a hierarchical or an approximate approach, a recursive algorithm can be used to recover the MT intertask and noise covariances.

Experimental results show an accurate estimation of the MT intertask and noise matrices which translates into an improved error performance. At the same time, we have integrated the model with standard GP toolboxes, showing that it is computationally competitive with the state of the art.

Broader Impact

This article proposes a novel MTGP learning methodology based on a conditional one-output likelihood for MTGPs. This approach assumes full rank intertask and noise covariance matrices, without compromising the computational cost.

Our approach may become the reference formulation in the Machine Learning community for MTGP due to its improved performance, interpretability and competitive computational cost, as well as its easy integration with existing standard toolboxes for GPs.

It can be used in a variety of real world problems where it is important to account for the dependencies between tasks and where there is value in providing an interpretation of the observed MT regression process, for example, in medical time series analysis or in genomics. It can also be applied to forecast problems in areas such as Smart-Grids or power load prediction.

Our software repository is publicly available and can be used right away in any real world scenario using a GPU or high performance computing facility with standard Python libraries.

Acknowledgments and Disclosure of Funding

We thank Dr. Miguel Lázaro-Gredilla and Gustau Camps-Valls for their thorough review of the paper and fruitful discussions. Partially supported by the National Science Foundation EPSCoR Cooperative Agreement OIA-1757207, the Spanish MINECO grants TEC2017-83838-R and PID2019-108539RB-C22.

References

  • (1) C. E. Rasmussen and C. K. Williams, Gaussian process for machine learning. MIT press, 2006.
  • (2) H. Liu, J. Cai, and Y.-S. Ong, “Remarks on multi-output Gaussian process regression,” Knowledge-Based Systems, vol. 144, pp. 102–121, 2018.
  • (3) E. V. Bonilla, K. M. Chai, and C. Williams, “Multi-task Gaussian process prediction,” in Advances in Neural Information Processing Systems 20 (J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, eds.), pp. 153–160, Curran Associates, Inc., 2008.
  • (4) A. M. Schmidt and A. E. Gelfand, “A Bayesian coregionalization approach for multivariate pollutant data,” Journal of Geophysical Research: Atmospheres, vol. 108, no. D24, 2003.
  • (5) T. R. Fanshawe and P. J. Diggle, “Bivariate geostatistical modelling: a review and an application to spatial variation in radon concentrations,” Environmental and ecological statistics, vol. 19, no. 2, pp. 139–160, 2012.
  • (6) O. Stegle, C. Lippert, J. M. Mooij, N. D. Lawrence, and K. Borgwardt, “Efficient inference in matrix-variate Gaussian models with iid observation noise,” in Advances in neural information processing systems, pp. 630–638, 2011.
  • (7) B. Rakitsch, C. Lippert, K. Borgwardt, and O. Stegle, “It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals,” in Advances in Neural Information Processing Systems 26, pp. 1466–1474, 2013.
  • (8) H. K. Lee, C. H. Holloman, C. A. Calder, and D. M. Higdon, “Flexible Gaussian processes via convolution,” Duke University, 2002.
  • (9) P. Boyle and M. Frean, “Dependent Gaussian processes,” in Advances in Neural Information Processing Systems 17, pp. 217–224, 2005.
  • (10) M. Alvarez and N. D. Lawrence, “Sparse convolved Gaussian processes for multi-output regression,” in Advances in Neural Information Processing Systems 21, pp. 57–64, 2009.
  • (11) M. A. Álvarez and N. D. Lawrence, “Computationally efficient convolved multiple output Gaussian processes,” Journal of Machine Learning Research, vol. 12, no. May, pp. 1459–1500, 2011.
  • (12) J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson, “Gpytorch: Blackbox matrix-matrix Gaussian process inference with gpu acceleration,” in Advances in Neural Information Processing Systems, pp. 7576–7586, 2018.
  • (13) J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge, UK: Cambridge University Press, 2004.
  • (14) C. M. Bishop, Pattern recognition and machine learning

    , ch. 2. Probability Distributions.

    Springer, 2006.
  • (15) A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman, “GPflow: A Gaussian process library using TensorFlow,” Journal of Machine Learning Research, vol. 18, pp. 1–6, apr 2017.
  • (16) E. Spyromitros-Xioufis, G. Tsoumakas, W. Groves, and I. Vlahavas, “Multi-target regression via input space expansion: treating targets as inputs,” Machine Learning, vol. 104, no. 1, pp. 55–98, 2016.