Weakly-supervised Multi-output Regression via Correlated Gaussian Processes

by   Seokhyun Chung, et al.

Multi-output regression seeks to infer multiple latent functions using data from multiple groups/sources while accounting for potential between-group similarities. In this paper, we consider multi-output regression under a weakly-supervised setting where a subset of data points from multiple groups are unlabeled. We use dependent Gaussian processes for multiple outputs constructed by convolutions with shared latent processes. We introduce hyperpriors for the multinomial probabilities of the unobserved labels and optimize the hyperparameters which we show improves estimation. We derive two variational bounds: (i) a modified variational bound for fast and stable convergence in model inference, (ii) a scalable variational bound that is amenable to stochastic optimization. We use experiments on synthetic and real-world data to show that the proposed model outperforms state-of-the-art models with more accurate estimation of multiple latent functions and unobserved labels.



There are no comments yet.


page 1

page 2

page 3

page 4


Efficient Modeling of Latent Information in Supervised Learning using Gaussian Processes

Often in machine learning, data are collected as a combination of multip...

Bayesian Alignments of Warped Multi-Output Gaussian Processes

We present a Bayesian extension to convolution processes which defines a...

Indian Buffet process for model selection in convolved multiple-output Gaussian processes

Multi-output Gaussian processes have received increasing attention durin...

Superposition-Assisted Stochastic Optimization for Hawkes Processes

We consider the learning of multi-agent Hawkes processes, a model contai...

Generalized Multi-Output Gaussian Process Censored Regression

When modelling censored observations, a typical approach in current regr...

Multi-Output Gaussian Processes with Functional Data: A Study on Coastal Flood Hazard Assessment

Most of the existing coastal flood Forecast and Early-Warning Systems do...

Multi-Task Processes

Neural Processes (NPs) consider a task as a function realized from a sto...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of multi-output regression is to estimate multiple latent functions, each corresponding to a group or a source, over a common domain of input and output. It learns from the observed input, output and source membership label by borrowing strength from potential between-source commonalities. The capability to account for dependencies between outputs improves the accuracy of prediction and estimation. Indeed, in recent years, multi-output regression has seen great success within the machine learning community, specifically in Gaussian processes (GP) (e.g., Dai et al., 2017; Álvarez and Lawrence, 2011). This success is mainly attributed to the fact that correlated outputs can be expressed as a realization from a single GP, where commonalities are modeled through inducing cross-correlations between outputs (e.g., Álvarez et al., 2013; Kaiser et al., 2018b; Kontar et al., 2018). In the literature, the resulting GP has been termed “Multi-output”, “Multivariate” or a “Multi-task” GP. In this paper we refer to this class of models as MGP.

However, the success stories of MGPs have mainly relied on the existence of fully labeled data, where for each observation there exists a correct label that indicates its group membership. In many applications, group labels are often hard or expensive to obtain, creating acute needs for methods that can handle data without complete group labels. We first illustrate the hypothetical scenario in Figure 1.

Figure 1: Examples of fully labeled (left) and partially labeled (right) settings for multi-output regression.

Consider a regular physical examination that assesses diabetic risk through measuring an individual’s blood glucose level and body mass index (BMI) for two groups: AIDS (group A) and non-AIDS individuals (group B). On the one hand, we may expect sparse observations in group A compared to group B if the AIDS prevalence is low in the population being examined. Modeling the blood-glucose-level-versus-BMI curves independently a priori for the two groups will result in unstable predictions for group A with fewer observations. This can be addressed by using an MGP which borrows information from the densely observed group B to inform group A. On the other hand, the output labels are not always fully available. For example, due to privacy concerns, some AIDS patients may choose not to report their AIDS status; A subset of non-AIDS patients may also not report the AIDS status. This results in data points with unobserved labels for a subset of subjects in both groups (see Figure 1, Right). This aspect presents a critical challenge for the use of MGPs in such scenarios.

In this paper, we address this key challenge through an MGP-based probabilistic model that can jointly infer the group labels and the underlying latent functions for all the groups. We refer to our model as weakly-supervised regression (Zhou, 2017) because it can handle (i) unlabeled data (i.e., semi-supervised setting), (ii) noisy labels, (iii) prior belief on group memberships. Specifically, the membership for each data point is assumed to follow a multinomial distribution. For the labeled data points, this allows prior belief on label memberships to be included. For the unlabeled observations, we assign a Dirichlet prior on the multinomial probabilities.

The Dirichlet prior acts as a regularizer that controls group assignment acuteness and in turn is capable of minimizing predictive variance

. Correlation between the outputs are then induced through sparse convolution processes with a layer of shared latent processes, which are amenable to computationally efficient posterior inferential algorithms via sparse approximations. We then show how the sparse model seamlessly integrates the label prior probabilities. To overcome posterior intractability we derive a variational bound and show that it interestingly turns out to have similar structure to a typical MGP. Using the structural similarity, a scalable variational bound amenable to stochastic optimization is obtained.

In the following, we summarize the major contributions of this paper.

1. We introduce a weakly-supervised regression framework based on correlated GPs that efficiently learns missing labels and estimates conditional means and variances for multiple groups. The model leverages unlabeled data to improve performance over a range of weakly-supervised settings including semi-supervision, noisy labels and prior belief on group memberships. To the best of our knowledge, this is the first study that addresses the weakly-supervised setting in MGP. We remark that in our model we exploit the popular convolved GPs (Álvarez and Lawrence, 2011) to establish cross dependencies, however, our framework is applicable to any separable or non-separable GP construction.

2. We derive two variational bounds. First, we derive a modified variational lower bound for the marginal likelihood. The modified bound enjoys (i) interpretability which allows us to obtain some useful insights of the proposed model, (ii) ease of modeling whereby any MGP construction can be plugged into the lower bound, and (iii) stability due to faster convergence. The bound indeed shares a close structure to a fully labeled MGP, and inspired from this, we derive an alternative scalable variational bound that stochastic optimization methods can be used to maximize.

3. We provide a mechanism to control assignment acuteness of unlabeled data via a Dirichlet prior on the multinomial probabilities. We highlight the analytical properties of this mechanism, specifically compared to state-of-the-art data association methods in GPs (Lázaro-Gredilla et al., 2012) which face the challenge of acute label assignment (see Section 4.3) resulting in high predictive variance and obscure assignments.

4. We illustrate the model and inferential algorithm using two challenging datasets: housing price index and climate data. We show that our model can leverage unlabeled data to improve predictive performance and reliably recover unobserved or noisy labels.

2 Background

2.1 Multioutput GPs: Fully-Labeled

Consider a dataset with observations, where and are the inputs and outputs, respectively; Each observation is drawn from one of groups. Now, suppose we observe the group labels , for observation . Let represent the number of observations from output ; We have . Without loss of generality, we re-order the rows in by the ascending order of the group labels . That is, we write and as: with corresponding inputs where , and for .

Given a source

, we assume the vector of outputs obtained at distinct input values follows

, for observation , where and denote the -th unknown regression function and an independent additive Gaussian noise, respectively. Assuming Gaussian process priors for , the exact MGP likelihood upon integrating over is given as where is an arbitrary covariance matrix. Here are hyperparameters for , an by block matrix; is comprised of block covariance (on the main diagonal blocks) and cross-covariance (off-diagonal blocks) matrices. The -th block is of dimension with elements , for any and from group and , respectively; where the notation indicates a block diagonal matrix with on the main diagonal.

2.2 Sparse Convolved MGP (SCMGP)

To induce cross-correlations between the latent functions (off-diagonal blocks in ), we use convolved MGPs. For example, the class of separable covariances or the linear model of coregionalization are special cases of the convolution construction (Álvarez et al., 2012; Fricker et al., 2013). Convolved MGPs build the covariance matrix via , , where denotes a smoothing kernel and is a shared latent GP. The key rationale is to share the common latent GP across sources. Since convolution is a linear operator, the outputs are then samples from multiple interrelated GPs - an MGP. This construction readily generalizes to multiple shared latent GPs. We refer readers to Álvarez et al. (2012) for further details.

The key advantage of convolved MGPs is that they are amenable to sparse approximations in a similar fashion as in univariate GPs (Quiñonero-Candela and Rasmussen, 2005). This is done through approximating using conditional independence assumptions. The approximation is built from a sparse set of inducing points (length ) where given , are a priori conditionally independent. This can be written as , where ; ; is a cross-covariance matrix relating the column vector and where the column vector collects values at each of inputs in group . Additionally, with , where is a covariance matrix of ; is a cross-covariance matrix between and ; is a set of hyperparameters. The marginal likelihood is:


Notice that we only introduce one latent process, but it is easy to generalize for multiple latent processes. As shown in Eq. (1), the key idea is approximating by which is equal to the in the block diagonals and a low rank approximation in the off-diagonal blocks. Indeed, such approximation is derived from variational inference where the recent work of Burt et al. (2019) provided a theoretical foundation for such approximations showing that the varaiational bound can made arbitrarily small to the exact GP with ( for the exponential kernel).

3 Weakly-Supervised MGP

Now we discuss our model, referred to as weakly-supervised MGP (WSMGP). We first present the general probabilistic framework, and then utilize SCMGP in the framework. We use superscripts and to indicate quantities associated with “labeled” and “unlabeled” observations, respectively.

Unlike the fully-labeled MGP, we now have observations comprised of unlabeled and labeled observations . For each labeled observation, we have a vector of labels where specifies that the -th observation originates from the -th group.

For the -th labeled observation we introduce a vector of unknown binary indicators , where and . Let indicate that the latent function generated observation . We define similarly defined for the unlabeled observations.

For notational clarity, we collectively define additional notations as follows: the inputs , , , the observed responses , , , the group indicators , , , the noise levels and the unknown functions .

Our model can then be formulated as follows:


where a priori follows independent multinomial distributions with parameters , where an element indicates the probability that the observation is generated from the source ; Thus for all . Here we have organized the probabilities by where and .

For each labeled observation , we give the probabilities specific values. The probabilities are determined by the level of uncertainty of the label . For example, if we are perfectly sure about , we can set , where is the indicator function and equals if the statement is true; otherwise.

For each unlabeled observation , we place a Dirichlet prior on :


where , a multivariate Beta function with parameter . In general, the elements of need not be identical; In this paper, for simplicity, we use identical for all the elements, i.e., symmetric Dirichlet. Hereon, we omit the hyperparmeters from the probabilistic models.

Apply SCMGP in Our Framework In Eq. (3) we modeled as an MGP with a full covariance matrix . In this paper, we use SCMGP as a sparse approximation to the full covariance. That is, we replace the full MGP (3) by


Our framework is not restricted to the SCMGP, because any approximation of

can be utilized. Nevertheless, we use SCMGP in this paper for its computational efficiency, generality and fast convergence rates. Based on the probability distributions (

2)-(7), the marginal log-likelihood is


where represents integration over the ranges of variables in .

4 Inference

Derivation of the posterior distribution of WSMGP, , is analytically intractable. A popular technique addressing this challenge is variational inference (VI) (Blei et al., 2017). VI approximates the posterior distribution maximizing a evidence lower bound (ELBO) on the marginal likelihood which is equivalent to minimizing the Kullback-Leibler (KL) divergence between the candidate variational distributions and the true posterior.

Here we note that in GPs, traditional VI uses a mean-field approximation which assumes that the latent variables in variational distributions are independent, i.e., (Titsias, 2009). This has been the basis of many recent extensions and application of Vi in literature (e.g., Zhao and Sun, 2016; Hensman et al., 2013; Panos et al., 2018). Hereon, the well-known ELBO derived by the mean-field approximation is denoted as .

4.1 Variational Approximation with Improved Lower Bound

In our model, we derive an alternative bound based on the observation that the latent variables and can be analytically marginalized out from Eq. (8). Thus, we only introduce variational distributions over and that belong to the same distributional families as those of the original distributions and . We distinguish the variational parameters from original by using the hat notation (e.g., ). We refer the new bound to as the KL-corrected variational bound . To derive , we first observe the following inequality: as a direct application of Jensen’s inequality. Exponentiating each side of the above inequality and plugging it into Eq. (8), we find

Now we can analytically calculate the integral resulting in the following interpretable form


where , is a diagonal matrix with elements and

A detailed derivation of Eq. (9) is provided in the supplementary materials. We maximize to obtain the optimizing variational parameters , and hyperparamters as our estimates. Note that the computational complexity is dominated by the inversion of in . Due to the structural equivalence to SCMGP, the computational complexity is calculated as .

4.2 Stochastic Variational Inference

Stochastic variational inference (SVI) facilitates employing stochastic optimization algorithms in the VI framework (Hoffman et al., 2013). The crucial benefit of is enabling parallelization and hence scalability to large datasets. However, cannot be utilized as a stochastic variational bound, because dependencies between the observations are retained by marginalizing out .

In recent GP literature, this problem is tackled by introducing a variational distribution (Hensman et al., 2013; Saul et al., 2016). Realizing that the first term of Eq. (9) differs from Eq. (1) via the diagonal matrix , we derive the stochastic variational bound of WSMGP in a similar manner. Specifically, the bound, denoted by , is derived as


The approximate marginal posterior for is given by and where with

Note that significantly reduces the computational complexity to , i.e., complexity of inverting . The main advantage of this form is to enable stochastic optimization, where mini-batches are sampled and noisy gradients are calculated in the optimization of . Through this procedure our model can attain scalability. Detailed derivations are provided in the appendix. On the other hand, introducing may produce a challenge in implementation: the variational distribution is highly sensitive to hyperparameter changes. This can be handled using variational EM (Bishop, 2006), which iterates between optimizing the variational parameters and the hyperparameters.

4.3 Analysis of KL-divergence terms in

Now regarding the KL terms in Eq. (9). For the labeled data points, we can easily see that the second term in Eq. (9) encourages to be close to which is the prior label belief. For the unlabeled data points, the hyperparmeter of Dirichlet prior plays an important role of regularization. To see this, we first note that the optimal value of variable , denoted by , is . Given , the third term in Eq. (9) is expressed as

Figure 2: is convex or concave in depending on .

Figure 2 demonstrates the above KL-divergence corresponding to the -th observation by in the case that we consider two GPs: . Note that small is preferred in maximizing . According to the figure, we realize that controls the acuteness or discretion on assignment of observations to sources. To be more detailed, observe that if we set a large , (e.g., ), then with or is greater than the one with . On the other hand, if we have a small (e.g., ), with is greater than the one with or . Also, converges to as increases.

Therefore, a small encourages assignment of an unlabeled observation to a group with probability close to zero or one. Thus it acts as a form of regularization or shrinkage that encourages sparse assignments, which in turn reduces predictive variance. While a large or the absence of a Dirichlet prior will augment model uncertainty via decreased acuteness in group assignments. This results also imply that the Dirichlet prior plays a role in interpretability where points can receive more acute classification results.

5 Related work

Although weakly-supervised learning has recently become a popular area within the machine learning community

(e.g., Grandvalet and Bengio, 2005; Singh et al., 2009; Rohrbach et al., 2013; Ng et al., 2018), research on its interface with GPs remains sparse. The few problems that have been addressed in this area mainly focused on classification tasks using GPs (Lawrence and Jordan, 2005; Skolidis and Sanguinetti, 2013; Damianou and Lawrence., 2015). From a regression perspective, the works are numbered. Jean et al. (2018) proposed an approach based on deep kernel learning (Wilson et al., 2016) for regression with unlabeled data, inferring a posterior distribution that minimizes the variance of unlabeled data as well as labeled data. Cardona et al. (2015)

modeled a convolved MGP for semi-supervised learning. Note that our model is different from such work because they define the unlabeled data as the observations of which the output value

is missing and do not use prior knowledge or regluarization for their labels.

Figure 3: An illustrative example for two imbalanced populations where (imbalance ratio, labeled fraction)=. Note that the observations from source 1 are sparse. The observed data points and labels are colored in the first panel; The same data is fitted by three models (WSMGP, OMGP, OMGP-WS) shown in the other plots where the inferred probability of each observation’s label is represented by a color gradient.

Our study considers weakly-supervised learning for the case of missing labels for outputs rather than missing output values. In this sense, our study is closely related to Lázaro-Gredilla et al. (2012) who solve the data association problem using OMGP. The goal of data association is to infer the movement trajectories of different objects while recovering the labels identifying which trajectories correspond to which object. Extending the OMGP, Ross and Dy (2013) proposed a model that enables the inference of the number of latent GPs. Recently, Kaiser et al. (2018a) also extended the OMGP by modeling both the latent functions and the data associations using GPs. However, OMGP makes a key assumption of independence across mixture components (i.e., the outputs). This limits its capability in scenarios that calls for modeling between-output correlations and borrowing strength across outputs, a key feature we incorporate in this work. Also the previous work on the data association problem is unsupervised and cannot handle noisy labels or control labeling acuteness. Specifically, our model can search the proper hyperparameter in optimization that reduces predictive variance by acute class assignments, whereas previous models do not have any options to control, which often results in poor predictions with high variances.

6 Experimental Results

We show the experimental results to assess the performance of the proposed model using both synthetic and real-world data. We compare WSMGP to two OMGP-based benchmarks and SCMGP: (i) OMGP ((Lázaro-Gredilla et al., 2012)), where , for any (ignoring labels), (ii) weakly-supervised OMGP (OMGP-WS), where and for if , for the -th observation (account for observed labels), (iii) SCMGP, using only labeled observations. Note that comparing WSMGP with SCMGP will show if WSMGP efficiently leverages the unlabeled data.

6.1 Experiment on Synthetic Data

For WSMGP, we use one latent process modeled as a GP with squared exponential kernel , where is a diagonal matrix. We also use the smoothing kernel with and a positive definite matrix , which is widely studied in the literature (e.g., Álvarez and Lawrence, 2011; Álvarez et al., 2019). As benchmarks, a squared exponential kernel is used for each independent GP. In the experiment using synthetic data, we assume that we have two sources (). The data is generated from a full MGP composed of two GPs corresponding to the sources, with a kernel where and . For each source we generate 120 observations. As in the motivating example, we consider sparse observations from source 1. We introduce -sparsity to indicate the ratio of the number of observations from source 1 to those from source 2. In addition, we use “-dense” to indicate that fraction of the observed data from each source are labeled. The source code is provided in the appendix.

Imbalanced populations We first investigate the behavior of WSMGP when we have imbalanced populations. To do this, we set and to mimic distinct levels of imbalance and fractions of the observed labels. For WSMGP, we set . We evaluate the prediction performance of each model by the root mean squared error (RMSE), where the errors are evaluated at . To clearly separate the two groups/sources, in the simulations, we assume a bias of (a constant mean of GP) to the source 2.

Figure 4:

Mean of RMSE. Standard deviations are omitted for clear representation.

Figure 5: An illustrative example of the case where the latent functions only differ in local areas.

We make four key observations about the results shown in Figure 3 and 4. First, the WSMGP outperforms the benchmarks, especially for source 1 at low values (highly imbalanced populations). Therefore, WSMGP can use information of the dense group (source 2) for predicting the responses in the sparse group (source 1). In contrast, the benchmarks model the latent curves independently resulting in poor predictions for the sparse group. Importantly, WSMGP (second panel in Figure 3) infers the label of each observation with an overall higher accuracy relative to the benchmarks (third and fourth panels in Figure 3). Both better prediction and more accurate label recovery are direct benefits offered by our model’s capacity to incorporate the between group correlations. Second, Figure 4 shows that the prediction accuracy of each model improves as the population becomes more balanced with more labels (larger values of and ). Specifically, we did not observe significant reductions in RMSE for OMGP as the fraction of observed labels increases. This is not surprising given OMGP ignores the labels. Third, the WSMGP outperforms SCMGP in both groups. This results illustrate the capability of WSMGP for weakly-supervised learning, which makes use of information from unlabeled data to improve predictions. Fourth, the WSMGP has small RMSEs for both sources, while the OMGP-WS cannot predict well for the sparse group 1 (Figure 4). For example, with enough number of labels or observations, the OMGP-WS can make a reasonable prediction for the dense group (see the fourth panel in Figure 3), while the predictions for the sparse group is poor. The ability of WSMGP to predict the responses well even for sparse groups highlights the potential of our model in achieving the goal of fairness in prediction for imbalanced groups, a direction we hope to pursue in future work.

Revealing similar latent functions In this experiment, we generate data with three outputs where two of them are very similar; we set . We still maintain the setting in terms of the sparsity and partially observed labels as the previous experiment. An illustrative example is represented in Figure 5. We remark that, based on the observations in the second plot, it is very difficult to distinguish the true curve of each source for a human observer. The differences between similar curves mainly come from the peaks and valleys (e.g., , ), and the WSMGP performs well to reveal the underlying similar curves in those intervals.

Dirichlet prior In this experiment we investigate the effect of varying the hyperparameter of the Dirichlet prior. We compare the WSMGP with and an alternative model, say WSMGP-NoDir, obtained by removing the Dirchlet prior from WSMGP. A comparative result is demonstrated in Figure 6. Note that the plots in the second row represents , with values close to 1 or 0 representing high levels of posterior certainty of the inferred labels.

Figure 6:

Predictions by WSMGP and WSMGP without Dirichlet prior. Plots in the second row illustrates the posterior probability of belonging to the source 1 for each observation (


From the second row of the figure, we can find that the WSMGP-NoDir does not assign proper probabilities to some observations which leads to a poor prediction, while WSMGP perfectly assigns the probabilities for every observation. This is because the inference of depends on the Dirichlet prior. Specifically, as we discussed in Section 4.3, if a small is given (e.g., ), the model is encouraged to find a probability that clearly specifies a group for unlabeled observations, and vice versa. In particular, if is large enough (e.g., =100), the WSMGP converges to WSMGP-NoDir. It shows that placing the Dirichlet prior enable WSMGP to achieve more flexibility by searching a proper , while WSMGP-NoDir does not have this option. We can treat as a hyperparameter to be optimized. Jointly optimizing with other parameters, the model can provide a better prediction.

6.2 Experiments on Real-world Data

Housing Price Index data

We apply our model and the benchmarks to the interpolation of time-series for housing prices. Specifically, we use the Housing Prices Index (HPI) datasets

111https://www.fhfa.gov/DataTools/Downloads/Pages/House-Price-Index-Datasets.aspx created by Federal Housing Finance Agency (FHFA). The HPI is obtained by weighting repeat-sales index, which evaluates the variations of average price in repeat sales or refinancings on the same properties. FHFA measured the HPI based on the observations on repeat mortagage transactions of single-family properties. The dataset contains HPI of various cities in US from January 1975 to 2019, evaluated at quarterly or monthly. Note that we can expect the HPI of neighboring cities would exhibit correlation.

From the dataset we collect the HPI of two cities in state of Michigan: Ann Arbor and Kalamazoo-Portage metropolitan area. HPI for both cities are evaluated at quarterly. We randomly collected 30% of observations as a training dataset. We sparsify the observations from Ann Arbor based on . Additionally, we further remove observations for Ann Arbor in a range from 2000 to 2011 for an efficient comparison of the models. Finally, We removed the labels specifying a city based on for both cities.

Figure 7 illustrates the training data and results. According to the results, we find that the WSMGP can efficiently find the labels as well as the latent curves in the sparse range (2000 to 2011). Specifically, using the given labels, WSMGP can accurately find the latent curves in the dense area and learn the correlation between them. Based on the correlation it can find the latent curve of the sparse group (Ann Arbor). Note that OMGP-WS can find the latent curve of the dense group (Kalamazoo-Portage), whereas poorly predicts the sparse group for the sparse range in particular.

Figure 7: Predictions on HPI data.

Climate data This climate data is collected from sensor networks established in the southern coast of UK, which is composed of four sensors called as Bramblemet, Cambermet, Chimet and Sotonmet (Osborne et al., 2008; Parra and Tobar, 2017). They collect several maritime environmental signals in every 5 minutes. The signals are highly correlated across the sensors since they are set up in adjacent locations. We choose one factor, tide depth in meters, collected from Bramblemet and Chimet. we extract two consecutive days in August 2019 for each sensor. Specifically, we form 27 datasets where corresponding days are (1th, 2nd), (2nd, 3rd), …(29th, 30th) from which two days (12th,13th) and (13th, 14th) are excluded. We removed the two datasets since they exhibit abnormal trajectories. In particular, we sparsify the Bramblemet data with and remove 50% labels for both sensors. To obtain more challenging setting, we further remove observations of Bramblemet from 12:00am to 12:00pm in the second days. Finally, we adopt RMSE to compare performances.

Figure 8:

Boxplot for RMSE by method. Outliers are excluded.

Figure 8 summarizes the results. First of all, we find that WSMGP outperforms the benchmarks in prediction for Bramblemet with sparse observations. This shows that WSMGP reveals the missing labels well and makes a good prediction for sparse output by transferring information from the dense output. We further remark that WSMGP predicts better than SCMGP. This is because WSMGP can leverage information from unlabeled data, which demonstrates the capability of our model for weakly-supervised learning. Note that benchmarks using unlabeled data, i.e., OMGP and OMGP-WS, also outperform SCMGP.

7 Conclusion

In this study we have proposed a Bayesian probabilistic model for weakly-supervised learning that performs multi-output regression with partially labeled outputs. Through extensive simulations and empirical studies, we show the proposed approach excels in settings with imbalanced populations with correlated latent functions, which we believe are particularly relevant for improving fairness in prediction in machine learning. Even more scientific applications (e.g., Wu and Chen, 2019) need extensions of the proposed WSMGP framework for multivariate non-continuous and mixed-type outcomes which we will pursue in future work.


  • M. A. Álvarez and N. D. Lawrence (2011) Computationally efficient convolved multiple output gaussian processes. Journal of Machine Learning Research 12 (May), pp. 1459–1500. Cited by: §B.1, §1, §1, §6.1.
  • M. A. Álvarez, D. Luengo, and N. D. Lawrence (2013) Linear latent force models using gaussian processes. IEEE transactions on pattern analysis and machine intelligence 35 (11), pp. 2693–2705. Cited by: §1.
  • M. A. Álvarez, L. Rosasco, N. D. Lawrence, et al. (2012) Kernels for vector-valued functions: a review. Foundations and Trends® in Machine Learning 4 (3), pp. 195–266. Cited by: §2.2.
  • M. A. Álvarez, W. O. C. Ward, and C. Guarnizo (2019) Non-linear process convolutions for multi-output gaussian processes. In

    International Conference on Artificial Intelligence and Statistics

    pp. 1969–1977. Cited by: §6.1.
  • C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §4.2.
  • D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. Cited by: §4.
  • D. R. Burt, C. E. Rasmussen, and M. Van Der Wilk (2019) Rates of convergence for sparse variational gaussian process regression. In Proceedings of the 36th International Conference on Machine Learning, pp. 862–871. Cited by: §2.2.
  • H. D. V. Cardona, M. A. Álvarez, and Á. A. Orozco (2015) Convolved multi-output gaussian processes for semi-supervised learning. In International Conference on Image Analysis and Processing, pp. 109–118. Cited by: §5.
  • Z. Dai, M. A. Álvarez, and N. Lawrence (2017) Efficient modeling of latent information in supervised learning using gaussian processes. In Advances in Neural Information Processing Systems, pp. 5131–5139. Cited by: §1.
  • A. Damianou and N. D. Lawrence. (2015) Semi-described and semi-supervised learning with gaussian processes.. In Uncertainty in Artificial Intelligence (UAI), Cited by: §5.
  • T. E. Fricker, J. E. Oakley, and N. M. Urban (2013) Multivariate gaussian process emulators with nonseparable covariance structures. Technometrics 55 (1), pp. 47–56. Cited by: §2.2.
  • Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §5.
  • J. Hensman, N. Fusi, and N. D. Lawrence (2013) Gaussian processes for big data. Conference on Uncertainty in Artificial Intellegence, pp. 282–290. Cited by: §4.2, §4.
  • M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley (2013) Stochastic variational inference. The Journal of Machine Learning Research 14 (1), pp. 1303–1347. Cited by: §4.2.
  • N. Jean, S. M. Xie, and S. Ermon (2018) Semi-supervised deep kernel learning: regression with unlabeled data by minimizing predictive variance. In Neural Information Processing Systems, pp. . Cited by: §5.
  • M. Kaiser, C. Otte, T. A. Runkler, and C. H. Ek (2018a) Data association with gaussian processes. In ArXiv prerpint, pp. . Cited by: §5.
  • M. Kaiser, C. Otte, T. Runkler, and C. H. Ek (2018b) Bayesian alignments of warped multi-output gaussian processes. In Advances in Neural Information Processing Systems, pp. 6995–7004. Cited by: §1.
  • N. J. King and N. D. Lawrence (2006) Fast variational inference for gaussian process models through kl-correction. In European Conference on Machine Learning, pp. 270–281. Cited by: Appendix D.
  • R. Kontar, S. Zhou, C. Sankavaram, X. Du, and Y. Zhang (2018) Nonparametric modeling and prognosis of condition monitoring signals using multivariate gaussian convolution processes. Technometrics 60 (4), pp. 484–496. Cited by: §1.
  • N. D. Lawrence and M. I. Jordan (2005) Semi-supervised learning via gaussian processes. In Advances in neural information processing systems, pp. 753–760. Cited by: §5.
  • M. Lázaro-Gredilla, S. Van Vaerenbergh, and N. D. Lawrence (2012) Overlapping mixtures of gaussian processes for the data association problem. Pattern Recognition 45 (4), pp. 1386–1395. Cited by: Appendix D, §1, §5, §6.
  • Y. C. Ng, N. Colombo, and R. Silva (2018) Bayesian semi-supervised learning with graph gaussian processes. In Advances in Neural Information Processing Systems, pp. 1683–1694. Cited by: §5.
  • M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings (2008) Towards real-time information processing of sensor network data using computationally efficient multi-output gaussian processes. In 2008 International Conference on Information Processing in Sensor Networks (ipsn 2008), pp. 109–120. Cited by: §6.2.
  • A. Panos, P. Dellaportas, and M. K. Titsias (2018) Fully scalable gaussian processes using subspace inducing inputs. arXiv preprint arXiv:1807.02537. Cited by: §4.
  • G. Parra and F. Tobar (2017) Spectral mixture kernels for multi-output gaussian processes. In Advances in Neural Information Processing Systems, pp. 6681–6690. Cited by: §6.2.
  • J. Quiñonero-Candela and C. E. Rasmussen (2005) A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research 6 (Dec), pp. 1939–1959. Cited by: §2.2.
  • M. Rohrbach, S. Ebert, and B. Schiele (2013) Transfer learning in a transductive setting. In Advances in neural information processing systems, pp. 46–54. Cited by: §5.
  • J. Ross and J. Dy (2013) Nonparametric mixture of gaussian processes with constraints. In International Conference on Machine Learning, pp. 1346–1354. Cited by: §5.
  • A. D. Saul, J. Hensman, A. Vehtari, and N. D. Lawrence (2016) Chained gaussian processes. In Artificial Intelligence and Statistics, pp. 1431–1440. Cited by: §4.2.
  • A. Singh, R. Nowak, and J. Zhu (2009) Unlabeled data: now it helps, now it doesn’t. In Advances in neural information processing systems, pp. 1513–1520. Cited by: §5.
  • G. Skolidis and G. Sanguinetti (2013) Semisupervised multitask learning with gaussian processes.

    IEEE transactions on neural networks and learning systems

    24 (12), pp. 2101–2112.
    Cited by: §5.
  • M. Titsias (2009) Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pp. 567–574. Cited by: §4.
  • A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing (2016) Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378. Cited by: §5.
  • Z. Wu and I. Chen (2019) Regression analysis of dependent binary data for estimating disease etiology from case-control studies. arXiv preprint arXiv:1906.08436. Cited by: §7.
  • J. Zhao and S. Sun (2016) Variational dependent multi-output gaussian process dynamical systems. The Journal of Machine Learning Research 17 (1), pp. 4134–4169. Cited by: §4.
  • Z. Zhou (2017) A brief introduction to weakly supervised learning. National Science Review 5 (1), pp. 44–53. Cited by: §1.

Appendix A Derivations of variational bounds

a.1 Derivations of

Here we use the same notations as in the main article. For notational simplicity, we omit hyperparameters in the derivations unless it causes confusion. We use the notation to indicate expectation over a variational distribution .

First, we start with calculating the following expectation that will be used shortly:


Now we derive . Recall that we have


In Eq. (A2), we first focus on the following integral on the exponential


where the last identity is based on Eq. (A1) and the terms for are collected into

Plugging Eq. (A3) into Eq. (A2), we obtain the final form of as


of which the last inequality is based on Eq. (1) in the main article.

a.2 Derivations of

To obtain the scalable variational bound , we further introduce to approximate . We first derive the variational marginalized distribution for as

Then, the lower bound is given by

where we use the variational expectation Eq. (A1) for the last identity. We recover Eq. (10) by noting that because is a diagonal matrix.

Appendix B Gradients of Variational Bounds

To utilize gradient-based optimization algorithms, we provide the gradients of proposed variational bounds for the parameters. In , the parameters to be optimized are: , , , and , where collects the hyperparameters related to the GPs and , denoted by and , respectively. In , we additionally have and .

We first consider that appears in both and . The related parameters are and , that is, and for and . Because can be expressed as the summation in which terms are independent of each other in terms of the parameters, it is trivial to derive the partial derivatives; we omit the derivatives of .

The remaining terms are a bit tricky to obtain partial derivatives since they include matrices. In order to obtain the partial derivatives for the parameters in matrice, we use the notation of Mbrookes2001matrix: we define , which denotes a vector obtained by vectorizing the matrix

. The partial derivative for each parameter is calculated using the law of total derivative and the chain rule. For example, consider

in . The matrices that involve are and . Hence, the partial derivative of is given by In the following sections, we find the partial derivatives of the proposed lowerbounds for each matrix.

b.1 Matrix Derivatives of

Let denote the first term in , . Note that this is structurally equivalent to the marginal distribution of sparse convolved multi-output GP (1) proposed by Álvarez and Lawrence (2011). Our derivation is similar to their work, we thus directly provide the results. A reader who wants to find detailed derivations is referred to the supplement of Álvarez and Lawrence (2011).

with ; ;