Multiple Output Regression with Latent Noise

In high-dimensional data, structured noise caused by observed and unobserved factors affecting multiple target variables simultaneously, imposes a serious challenge for modeling, by masking the often weak signal. Therefore, (1) explaining away the structured noise in multiple-output regression is of paramount importance. Additionally, (2) assumptions about the correlation structure of the regression weights are needed. We note that both can be formulated in a natural way in a latent variable model, in which both the interesting signal and the noise are mediated through the same latent factors. Under this assumption, the signal model then borrows strength from the noise model by encouraging similar effects on correlated targets. We introduce a hyperparameter for the latent signal-to-noise ratio which turns out to be important for modelling weak signals, and an ordered infinite-dimensional shrinkage prior that resolves the rotational unidentifiability in reduced-rank regression models. Simulations and prediction experiments with metabolite, gene expression, FMRI measurement, and macroeconomic time series data show that our model equals or exceeds the state-of-the-art performance and, in particular, outperforms the standard approach of assuming independent noise and signal models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/17/2015

Classification of weak multi-view signals by sharing factors in a mixture of Bayesian group factor analyzers

We propose a novel classification model for weak signal data, building u...
11/08/2021

Statistical properties of large data sets with linear latent features

Analytical understanding of how low-dimensional latent features reveal t...
06/26/2020

Tensor estimation with structured priors

We consider rank-one symmetric tensor estimation when the tensor is corr...
11/06/2018

Optimal singular value shrinkage with noise homogenization

We derive the optimal singular values for prediction in the spiked model...
06/27/2012

The Landmark Selection Method for Multiple Output Prediction

Conditional modeling x → y is a central problem in machine learning. A s...
07/14/2020

The impact of signal-to-noise, redshift, and angular range on the bias of weak lensing 2-point functions

Weak lensing data follow a naturally skewed distribution, implying the d...
10/17/2021

Exploitation of error correlation in a large analysis validation: GlobCurrent case study

An assessment of variance in ocean current signal and noise shared by in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Explaining away structured noise is one of the cornerstones for successful modeling of high-dimensional output data in the regression framework (Fusi et al., 2012; Rai et al., 2012; Virtanen et al., 2011; Klami et al., 2013; Rakitsch et al., 2013)

. The structured noise refers to dependencies between response variables, which are unrelated to the dependencies of interest. It is noise caused by observed and unobserved confounders that affect multiple variables simultaneously. Common observed confounders in medical and biological data include age and sex of an individual, whereas unobserved confounders include, for example, the state of the cell being measured, measurement artefacts influencing multiple probes, or other unrecorded experimental conditions. When not accounted for, structured noise may both hide interesting relationships and result in spurious findings

(Leek and Storey, 2007; Kang et al., 2008).

The effects of known confounders can be removed straightforwardly by using supervised methods. For the unobserved confounders, a routinely used approach for explaining away structured noise has been to assume a priori independent effects for the interesting and uninteresting factors. For example, in the factor regression setup (Bernardo et al., 2003; Stegle et al., 2010; Fusi et al., 2012), the target variables are assumed to have been generated as

(1)

where is the matrix of target variables (or dependent variables) and contains the covariates (or independent variables), for the observations. The model parameter matrix comprises the unknown latent factors and the factor loadings, which are used to model away the structured noise. The term represents independent unstructured noise. To reduce the effective number of parameters in the regression coefficient matrix , a low-rank structure may be assumed:

(2)

where the rank of parameters and is substantially lower than the number of target variables or covariates . The low-rank decomposition of the regression coefficient matrix (2) may be given an interpretation whereby the covariates affect the latent components with coefficients specified in , and the components, in turn, affect the target with coefficients .

Covariates

Latent space

Target variables

Noise

structured effect

structured noise

Covariates

Latent space

Target variables

Noise

joint structuredeffect & noise
(a) (b)
Figure 1: Illustration of (a) a priori independent interesting and uninteresting effects and (b) the latent noise assumption. Latent noise is mediated to the target variable measurements through a common subspace with the interesting effects.

Another line of work in multiple output prediction has focused on borrowing information from the correlation structure of the target variables when learning the regression model. The intuition stems from the observation that correlated targets are often seen to be affected similarly by the covariates, for example in genetic applications (see, e.g., Davis et al., 2014; Inouye et al., 2012). One popular method, GFlasso (Kim et al., 2009), learns the regression coefficients using

(3)

where the are the columns of . Two regularization parameters are introduced: represents the standard Lasso penalty, and encourages the effects and of the th covariate on correlated outputs and to be similar. Here represents the correlation between the th and th phenotypes. The is an a priori specified correlation graph for the output variables, with edges representing correlations to be accounted for in the model.

In this paper we propose a model that simultaneously learns the structured noise and encourages the sharing of information between the noise and the regression models. To motivate the new model, we note that by assuming independent prior distributions on and in model (1), one implicitly assumes independence of the interesting and uninteresting effects, caused by covariates and unknown factors , respectively (Fig. 1a). This is a poor assumption for instance in molecular biology, where gene expression and metabolomics measurements record concentrations of compounds generated by ongoing biological processes. When predicting the expression from a limited set of covariates, such as single nucleotide polymorphisms (SNPs), they determine the activity of the biological process only partially and, thus, all other activity of the processes is due to structured “noise”, or unrecorded factors. In such cases, the noise affects the measurement levels through the very same process as the interesting signal (Fig. 1b), and rather than assuming independence of the effects, an assumption about parallel effects would be more appropriate. We refer to this type of noise as latent noise as it can be considered to affect the same latent subspace as the interesting effects.

A natural way to encode the assumption of latent noise is to use the following model structure

(4)

where the is a matrix consisting of unknown latent factors. In (4), mediates the effects of both the interesting and uninteresting signals on the target variables. We note that the change required in the model structure is small, and has in fact been presented earlier (Bo and Sminchisescu 2009; recently extended with an Indian Buffet Process prior on the latent space by Bargi et al. 2014). We now proceed to using the structure (4) for GFlasso-type sharing of information (3) between the regression and noise models while simultaneously explaining away structured noise. To see that the information sharing between noise and regression models follows immediately from model (4), one can consider simulations generated from the model. The a priori independence assumption of model (1) results in uncorrelated regression weights regardless of the correlations between target variables (Figure 2a). The assumption of latent noise (4), however, encourages the regression weights to be correlated similarly with the target variables (Figure 2c).

.
(a) (b) (c)
Figure 2: Conditional distribution of the correlation between regression coefficients, given the correlation between the corresponding target variables. In (a) the model (1) assumes a priori independent regression and noise models, and in (c) the model (4) makes the latent noise assumption. (b) A mixture of the models in a and c. The data were generated using equation (17), as described in Section 5.1, and

denotes the relative proportion of latent noise in data generation. The dashed lines denote the 95% confidence intervals of the conditional distributions.

In this work, we focus on modelling weak signals among structured noise in high-dimensinoal data. For instance effects explaining

of the variance of the target variables can be considered weak. We have argued above that a model with the structure (

4) is particularly well-suited for this purpose. Additionally, (i) particular emphasis must be put on defining adequate prior distributions to distinguish the weak effects from noise as effectively as possible, and (ii) scalability to large sample size is needed to have any chance of learning the weak effects. For (i), we define latent signal-to-noise ratio as a generalization of the standard signal-to-noise ratio in the latent space:

(5)

We use the latent signal-to-noise ratio as a hyperparameter in the modelling, and show that it is a key parameter affecting model performance. It can be either be learned or set using prior knowledge. In addition, we introduce an ordered infinite-dimensional shrinkage prior that resolves the inherent rotational ambiguity in the model (4), by sorting both signal and noise components by their importance. Finally, we present efficient inference methods for the model.

2 Related work

Integrating multiple real-valued prediction tasks with the same set of covariates is called multiple-output regression (Breiman and Friedman, 1997); and more generally sharing of statistical strength between related tasks is called multitask learning (Baxter, 1996; Caruana, 1997). The data consist of input-output pairs ; the

-dimensional input vectors

(covariates) are used for predicting -dimensional vectors of target variables. The common approach to dealing with structured noise due to unobserved confounders is to apply factor regression modeling (1) (Bernardo et al., 2003) and to explain away the structured noise using a noise model that is assumed to be a priori independent of the regression model (Stegle et al., 2010; Fusi et al., 2012; Rai et al., 2012; Virtanen et al., 2011; Klami et al., 2013; Rakitsch et al., 2013). A recent Bayesian reduced-rank regression (BRRR) model (Marttinen et al., 2014) implements the routine assumption of the independence of the regression and noise models; we will include it in the comparison studies of this paper.

Methods for multiple-output regression without the structured noise model have been proposed in other fields. In the application fields of genomic selection and multi-trait quantitative trait loci mapping, solutions (Yi and Banerjee, 2009; Xu et al., 2009; Calus and Veerkamp, 2011; Stephens, 2013) for low-dimensional target variable vectors () have been proposed, but these methods do not scale up to the currently emerging needs of analyzing higher-dimensional target variable data. Additionally, sparse multiple-output regression models have been proposed for prediction of phenotypes from genomic data (Kim et al., 2009; Sohn and Kim, 2012).

Many methods for multi-task learning have been proposed in the field of kernel methods (Evgeniou and Pontil, 2007). These methods do not, however, scale up to data sets with several thousands of samples, required for predicting the weak effects. Other relevant work include a recent method based on the BRRR presented by Foygel et al. (2012), but it does not scale to the dimensionalities of our experiments either. Methods for high-dimensional phenotypes have been proposed in the field of expression quantitative trait loci mapping (Bottolo et al., 2011) for the weaker task of finding associations (and avoiding false positives) rather than prediction, which is our main focus. Also functional assumptions (Wang et al., 2012) have been used to constrain related learning problems.

3 Model

In this Section, we present details of the model, show how the hyperparameters can be set using the latent signal-to-noise ratio, and analyze the properties of the infinite-dimensional shrinkage prior.

3.1 Model details: latent-noise BRRR

Our model is given by

(6)

where contains the -dimensional response variables for observations, and

contains the predictor variables. The product

, of and , results in a regression coefficient matrix with rank . The contains unknown latent factors representing the latent noise. Finally, with where is a matrix of uncorrelated target variable-specific noise vectors. Figure 3 displays graphically the structure of the model. In the figure, the node corresponding to the parameter that is shared by the regression and noise models is highlighted with green.

Similarly to a recent BRRR model (Marttinen et al., 2014) and the Bayesian infinite sparse factor analysis model (Bhattacharya and Dunson, 2011), we assume the number of components connecting the covariates to the targets to be infinite. Accordingly, the number of rows in the weight matrix , and the numbers of columns in and are infinite. The low-rank nature of the model is enforced by shrinking the columns of and rows of and increasingly with the growing column/row index, such that only a small number of columns/rows are influential in practice. The increasing shrinkage also solves any rotational nonidentifiablity issues by enforcing the model to mediate the strongest effects through the first columns/rows. In Section 3.3 we explore the basic properties of the infinite-dimensional prior, to ensure its soundness.

Figure 3: Graphical representation of latent-noise BRRR. The observed data are denoted by black circles, variables related to the reduced-rank regression part of the model by white circles, variables related only to the noise model are denoted by gray circles, and variables related to both the regression and the structured noise model are denoted with green circles. The matrix comprises the sparsity parameters for the target variables for the components.

The hierarchical priors for the projection weight matrix , where , are set as follows:

(7)

Here is a global shrinkage parameter for the th row of and the s are local shrinkage parameters for the individual elements of , to provide additional flexibility over the global shrinkage priors. The same parameters are used to shrink the columns of the matrices and , because the scales of and (or ) are not identifiable separately:

where is a parameter that specifies the amount of latent noise, which is used to regularize the model (see the next Section). With the priors specified, the hidden factors can be integrated out analytically, yielding

(8)

where is obtained from by multiplying the rows of with the shrinkages of the columns of .

Finally, conjugate prior distributions

(9)

are placed on the noise parameters of the target variables.

3.2 Regularization of latent-noise BRRR through the variance of

The latent signal-to-noise ratio in Equation (5) has an intuitive interpretation: given our prior distributions for and , the prior latent SNR indicates the extent to which we believe the noise to explain variation in , as compared to the variance explained by the covariates . Thus, the latent SNR acts as a regularization parameter: by allowing latent variables to have a large variance, the data will be explained by the noise model rather than the covariates. We note that this approach to regularization is non-standard and it may have favourable characteristics compared to the commonly used L1/L2 regularization of regression weights. First, the regression weights remain relatively unbiased as they need not be enforced to zero to control for overfitting. More importantly, while regularizing with the a priori selected latent SNR, the regularization parameter itself remains interpretable: every value of the variance parameter of can be immediately interpreted as the percentage of variance explained by the covariates as compared to the noise model. Making similar interpretations for L1/L2 regularization parameters is not straightforward. In our experiments, we use cross-validation to select the variance of . However, the interpretability of the parameter makes it easy to express beliefs of the plausible values based on prior knowledge.

3.3 Proofs of the soundness of the infinite prior

In this Section we verify the sensibility of both the infinite non-parametric prior, which we introduce for ordering the components according to decreasing importance, and of a computational approximation resulting from truncation of the infinite model.

It has been proven that in Bayesian factor models and (in our case defined in eqn 7) is sufficient for the elements of to have finite variance in a Bayesian factor model (1), even if an infinite number of columns with a prior similar to our model is assumed for (Bhattacharya and Dunson, 2011). In this Section we present similar characteristics of the infinite reduced-rank regression model. The detailed proofs can be found in the Supplementary material. First, in analogy to the infinite Bayesian factor analysis model, we show that

(10)

is sufficient for the prediction of any of the response variables to have finite variance under the prior distribution (Proposition 1). Second, we show that the underestimation of uncertainty (variance) resulting from using a finite rank approximation to the infinite reduced-rank regression model decays exponentially with the rank of the approximation (Proposition 2). For notational clarity, let denote in the following the column of the matrix. With this notation, a prediction for the th response variable can be written as

Furthermore, let denote below the gamma function (not to be confused with the matrix used in all other Sections of this paper).

Proposition 1: Finite variance of predictions Suppose that and . Then

(11)

A detailed proof is provided in the Supplementary material.

Proposition 2: Truncation error of the finite rank approximation Let denote the prediction for the th phenotype when using an approximation for and consisting of the first columns or rows only, respectively. Then,

that is, the reduction in the variance of the prediction resulting from using the approximation, relative to the infinite model, decays exponentially with the rank of the approximation. A detailed proof is provided in the Supplementary material.

4 Efficient computation by reparameterization

For estimating the parameters of the latent-noise BRRR, we use Gibbs sampling, updating the parameters one-by-one by sampling them from their conditional posterior probability distributions, given the current values of all other parameters. The bottleneck of computation is related to the matrix

, and below we present a novel efficient update for this parameter.

Update of

The conditional distribution of the parameter matrix of latent-noise BRRR can be updated using a standard result for Bayesian linear models (Bishop et al., 2006) which states that if

(12)

then

(13)

where

(14)

Because in our model (6) the columns of the noise matrix are assumed independent with variances , we get

(15)

Thus, by substituting

into (12), together with prior covariance derived from (7), we immediately obtain the posterior of from (13) and (14).

Updates of and

The updates of the hyperparameters are the same as in Bayesian Reduced Rank Regression, and the conditional posterior distributions of the hyperparameters can be found in the Supplementary material of Marttinen et al. (2014). The has the same conditional posterior distribution as the model parameter of Marttinen et al. (2014).

Improved update of

The computational bottleneck of the naive Gibbs sampler is the update of parameter , which has

elements with a joint multivariate Gaussian distribution, conditionally on the other parameters

(Geweke, 1996; Marttinen et al., 2014)

. Thus, the inversion of the precision matrix of the joint distribution has a computational cost of

. To remove the bottleneck, we reparameterize our model, after which a linear algebra trick by Stegle et al. (2011) can be used to reduce the computational cost of the bottleneck to . When sampling we also integrate over the distribution of following the standard result from Equation (8). The reparameterization and the new posteriors are presented in the Supplementary material.

In brief, the eigenvalue decomposition of a matrix of the form

(16)

can be evaluated inexpensively. Through reparameterization, the posterior covariance matrix of becomes of the form (16) and the eigenvalue decomposition is then used to generate samples from the posterior distribution of . We note that the trick can also be applied to the original formulation of the Bayesian reduced-rank regression model by Geweke (1996) and the R-code published with this article allows generating samples from the original model as well. In the next Section, we compare the computational cost of the algorithm using the naive Gibbs sampler and the improved version that uses the new parameterization.

5 Simulation experiments

In Section 5.1 we investigate the impact of the noise model assumptions on simulated data. In Section 5.2 we confirm the computational speedup resulting from the new algorithm introduced in Section 4.

5.1 Impact of the noise model assumptions

We study the performance of the models with simulated data sets generated using a continuum of models between the two extremes of assuming either latent noise, or a priori independent regression and noise models. The synthetic data are generated according to

(17)

where the parameter defines the proportion of variance attributed to the latent noise. We study a continuum of problems with the values of parameter . The parameters and are orthogonalized using Gramm-Schmidt orthogonalization. The parameters are scaled so that covariates explain 3 % of the variance of through , the diagonal Gaussian noise explains 20 % of the total variance of and the structured noise explains the remaining 77 % of the total variance of . The simulation was repeated 100 times and training data sets of 500 and 2000 samples were generated for each replicate. To compare the methods, performance in mean squared error (MSE) of the models learned with each method was compared to that of the true model on a test set of 15 000 samples. The number of covariates was fixed to 30 and the number of dependent variables to 60. Rank of the regression coefficient matrix and structured noise was fixed to 3 when simulating the data sets.

The ranks of all methods were fixed to the true value. For latent-noise BRRR, the variance of was selected using 10-fold cross-validation. The grid was specified so that it corresponds to latent signal-to-noise ratios ranging from to , . More specifically where . The grid was chosen according to the interpretation given in Section 3.2; it corresponds to assuming that the latent noise explains 5 to 15 times the variance explained by the covariates.

Figures 4 (a) and (b) present the results of a simulation study with training sets of 500 and 2000 samples, respectively. When the structured noise is generated according to the conventional assumption of independent signal and noise, the model making the independece assumption performs equally well to the true model with both 500 and 2000 samples. However, as the assumption is violated and the proportion of latent noise increases, the performance of the model with the a priori independence assumptions quickly breaks down, whereas the model assuming latent noise performs consistently well. Methods that do not explain away the structured noise are not able to outperform the null model with the training set of 500 samples. When the number of training samples is increased to 2000, these models outperform the model assuming a priori independent signal and noise.

The naïve asumption about a priori independent signal and noise actually appears to harm performance even with a large training data set of 2000 samples, as the models not addressing the structured noise clearly perform better. This emphasizes the need to take the latent noise into account; even with thousands of samples, the relatively weak but wrong prior assumption in this respect can result in low performance.

Figure 4 shows results also for an alternative novel model that shares information between the noise and regression models (correlated BRRR, see Supplementary material for a detailed description). The model includes a separate noise model for the structured noise, as in (1), but achieves the information sharing by assuming a joint prior for the noise and regression models. In detail, conditional on the noise model, the current residual correlation matrix between the response variables is used as a prior for the rows of . This way the correlations between target variables are propagated into the corresponding regression weights; however, the strongest noise components are not automatically coupled with the strongest signal components. Notably, the performance of the correlated BRRR model is very similar to the regular BRRR model that does not have any dependence between the noise and signal components. This underlines the importance of explicitly coupling the strongest signal and noise components, as in the latent-noise BRRR, by enforcing them to use the shared regression coefficient vectors.

.
(a) (b)
Figure 4: Performance of different methods, compared to the true model, as a function of the proportion of latent noise with a training set of (a) 500 and (b) 2000 samples. The x-axis indicates the proportion of noise generated according to the latent noise assumptions (100 % corresponds to ).

5.2 Runtime of the new algorithm for BRRR models

To confirm the computational speed-up resulting from the reparameterization presented in Section 4, we performed an experiment where the algorithm implementing the naive Gibbs sampling updates for the Bayesian reduced-rank regression (Geweke, 1996; Karlsson, 2012) was compared with the new algorithm that uses the reparameterization. Similar improvements were achieved with the new models.

Ten toy data replicates were generated from the prior. The number of samples in the training set was fixed to 5000 and the number of target variables was set to 12. Rank of the regression coefficient matrix was 2. Runtime was measured as a function of the number of covariates, which was varied from 100 to 300; 1000 posterior samples were generated. The new algorithm that reparameterizes the model clearly outperforms the naive Gibbs sampler (Figure 5). As a sanity check, the algorithms were compared in terms of the estimated regression coefficient matrices, which were found to be similar.

Figure 5: Runtime of the algorithm implementing the naive Gibbs sampler and the new algorith that reparameterizes the model.

6 Genomic experiments

In this Section, we evaluate the performance of the latent-noise BRRR in gene expression prediction, metabolomics prediction and an exemplifying multivariate association testing task. The impacts of the latent signal-to-noise ratio and the number of samples are investigated. Data from two cohorts, DILGOM and NFBC1966 (for details, see below), are used in the experiments. SNPs known to be associated with lipid metabolism (Teslovich et al., 2010; Kettunen et al., 2012; Global Lipids Genetics Consortium, 2013) are used as the covariates. We also compare the runtimes of different methods. Throughout the Section, mean squared error (MSE) of test data is used as the test score. The MSE is computed by dividing the data sets into 10 non-overlapping folds, and predicting one fold at a time while using the other folds as the training data. The mean over the test folds is used as the final score. The effect of the number of observations is studied by downsampling the training set while keeping the test sets fixed.

6.1 Methods included in comparison

We compare the latent-noise BRRR with a state-of-the-art sparse multiple-output regression method Graph-guided Fused Lasso (’GFlasso’) (Kim and Xing, 2009), BRRR/factor regression model (Marttinen et al., 2014) with and without the a priori independent noise model (’BRRR’, ’BRRR without noise model’), standard Bayesian linear model (’blm’) (Gelman et al., 2004), elastic net-penalized multi-task learning (’L2/L1 MTL’) and a baseline method of predicting with target data mean. GFlasso constitutes a suitable comparison as it encourages sharing of information between correlated responses, as our model, but does that within the Lasso-type penalized regression framework without the use of a noise model to explain away the structured noise. L2/L1 MTL is a multitask regression method implemented in the glmnet  package (Friedman et al., 2010) that allows elastic net regularization and also does not use a noise model to explain away confounders. The blm, on the other hand, was selected as a simple single-task baseline.

Parameters for the different methods were specified as follows:

  • GFlasso: The regularization parameters of the gw2 model were selected from the default grid using cross-validation. The method has been developed for genomic data indicating that the default values should be appropriate. However, for NFBC1966 data, we were unable to run the method with the smallest values of the regularization parameters (110, 60, 10) due to lengthy runtime with these values. To save computation time, we predicted only 2 randomly selected validation folds out of the 10 possible. With these computational compromises, the average training time for the largest training data sets was 650 h. With NFBC1966 data, the pre-specified correlation network required by the GFlasso was constructed to match the VLDL, IDL, LDL, and HDL metabolite clusters from Inouye et al. (2012). Within these clusters, the correlation network was fixed to the empirical correlations, and to 0 otherwise. With DILGOM data, we used the empirical correlation network, with correlations below 0.8 fixed to 0 to reduce the number of edges in the network for computational speedup.

  • BRRR, BRRR without noise model: Hyperparameters and for all the BRRR models were fixed to 3 and 4, respectively. 1,000 MCMC samples were generated and 500 were discarded as burn-in. In preliminary tests similiar results were obtained with 50,000 samples. The remaining samples thinned by a factor of 10 were used for prediction. The truncation point of the infinite-rank BRRR model was learned using cross-validation from the set of values .

  • latent-noise BRRR: With the NFBC1966 data, the latent signal-to-noise ratio was simply fixed to corresponding to the prior assumption of SNPs explaining about 1% of the variation of the phenotypes. A sensitivity analysis for this parameter is presented in Section 6.4. For the DILGOM data, the latent signal-to-noise ratio was selected using cross-validation from the grid . Other parameters, including the number of iterations, were set as for the BRRR.

  • blm: The variance hyperparameter of BLM was integrated over using the MCMC. 1000 posterior samples were generated and 500 were discarded as burn-in.

  • L1/L2 MTL:

    The effects of different types of regularization penalties are an active research topic and we ran a continuum of mixtures of L1 and L2 penalties ranging from group lasso to ridge regression. The mixture parameter

    controling the balance between L1 and L2 regularization was evaluated on the grid [0, 0.1, , 0.9, 1.0] and selected using a 10-fold cross validation.

6.2 Data sets

The DILGOM data set (Inouye et al., 2010) consists of genome-wide SNP data along with metabolomics and gene expression measurements. For details concerning metabolomics and gene expression data collection, see Soininen et al. (2009) and Kettunen et al. (2012)

. In total 509 individuals had all three measurement types. The DILGOM metabolomics data comprises 137 metabolites, most of which represent NMR-quantified levels of lipoproteins classified into 4 subclasses (VLDL, IDL, LDL, HDL), together with quantified levels of amino acids, some serum extracts, and a set of quantities derived as ratios of the aformentioned metabolites. All 137 metabolites were used simultaneously as prediction targets. In gene expression prediction, in total 387 probes corresponding to curated gene sets of 8 KEGG lipid metabolism pathways were used as the prediction targets. The average number of probes in a pathway was 48 and the pathways were treated independently. For details about the pathways, see the Supplementary material.

The NFBC1966 data comprises genome-wide SNP data along with metabolomics measurements for a cohort of 4,702 individuals (Rantakallio, 1969; Soininen et al., 2009). With these data, 96 metabolites belonging to the subclasses VLDL, IDL, LDL and HDL (Inouye et al., 2012) were used as the phenotype.

Effects of age, sex, and lipid lowering medication were regressed out from the metabolomics data as a preprocessing step. For the genotype data, SNPs with low minor allele frequency (0.01) were removed as a preprocessing step.

6.3 Results: NFBC1966 metabolomics prediction

Figure 6 presents test data MSE and training times for the different methods. With all training set sizes, latent-noise BRRR outperforms the other methods. Methods blm and BRRR without noise model perform worse than the baseline (null model) even with the largest training data containing 3761 individuals.

(a) (b)
Figure 6: (a) Test data MSE for different amounts of training data. (b) Computation times of the methods for different training set sizes, on the NFBC1966 metabolomics data.

6.4 Results: Impact of the latent SNR assumption

To study the sensitivity of the latent-noise BRRR on the values of the hyperparameter modelling the latent signal-to-noise ratio, we carried out an additional study. We re-ran the analysis of the NFBC1966 data with the largest training data size on a dense grid of the hyperparameter , specifying the latent SNR, ranging from to . Figure 7 presents the results of the sensitivity analysis. Test data MSE for NFBC1966 is shown as a function of and model rank. We see that the a priori set latent SNR assumption has a dramatic impact on performance. The optimal values correspond to assuming the SNPs explain of the variance of the phenotype, which is in good agreement with previous estimates of the amount of variance predictable by the SNPs (Kettunen et al., 2012). The value learned with cross-validation from data (CV) gives slightly worse results than the best a priori setting, but are better than alternatives.

Figure 7: Sensitivity to a priori set latent signal-to-noise. The results are on NFBC1966 test data MSE ( 3761) as a function of the a priori latent SNR and model rank. The horizontal lines show the baseline (null model), the best comparison method (GFlasso), and the results with the SNR selected by cross-validation.

6.5 Results: DILGOM metabolomics and gene expression prediction

On another data set, which includes two data types, gene expression and metabolomics, the limited number of samples ( 506) made the prediction of the weak effects challenging for any method. Indeed, we noticed that the null model using the average training data value for prediction was better than any other method in terms of MSE over all variables. However, a detailed investigation of the results revealed that while many of the metabolites/probes could not be predicted at all (as indicated by the worse than null model MSE) some of the metabolites/expression probes could still be predicted better than the null model, and by focusing the analysis on (1) the number of expression probes/metabolites that could be predicted better than the null model, and (2) the MSE computed over the predictable probes/metabolites (i.e., those that could be predicted better than the null by at least one method), conclusions regarding the model performances could still be made.

Results of the metabolite prediction experiment with the DILGOM data are presented in Table 1. Latent-noise BRRR outperforms all other methods, measured both in terms of test data MSE for the predictable metabolites and the number of metabolites predicted better than the null model. Latent-noise BRRR has predictive power for 58 out of the 137 target variables, whereas GFlasso improves predictions as compared to using the target data mean for 38 target variables, BRRR for only 3 target variables and the other methods have no predictive power for any of the target variables. In terms of MSE over all the metabolites, mean prediction (null model) and L1/L2 MTL outperform all other methods; L1/L2 MTL selects values of regularization parameters that set all regression weights to zero through cross-validation, effectively reducing the model to the null model.



MSE on predictable Better than null MSE over all
metabolites model metabolites
latent-noise BRRR 1.027051 58 1.021744
BRRR 1.078377 3 1.084020
GFlasso 1.027175 38 1.021800
BRRR without noise model 1.246208 0 1.334456
L2/L1 MTL 1.027583 0 1.021541
null model 1.027583 0 1.021541
blm 1.557899 0 1.551917
Table 1: Metabolite prediction results (DILGOM). Latent-noise BRRR predicts 58/137 metabolites better than the null model, GFlasso 38/137 and BRRR 3/137. Other methods are unable to improve performance in any part of the target variable data, compared to the null model.

Results of the gene expression prediction experiment are presented in Table 2. Latent-noise BRRR and BRRR perform equally well with respect to the MSE of the predictable expression probes. In terms of the number of metabolites predicted better than the null model, latent-noise BRRR is again the best. In terms of the total MSE, the null model and L1/L2 MTL outperform all other methods and L1/L2 MTL sets all regression weight to zero based on the cross-validation. It is reassuring to see the good performance of the BRRR in this task, because the routine preprocessing of gene expression data is to remove effects corresponding to the first PCA components, which is very similiar to what the structured noise model in the BRRR achieves.


MSE on predictable Better than null MSE over all
metabolites model metabolites
latent-noise BRRR 1.009831 123 1.01076
BRRR 1.009831 111 1.01137
GFlasso 1.010133 103 1.01064
BRRR without noise model 1.015306 37 1.01688
L2/L1 MTL 1.010077 0 1.01006
null model 1.010077 0 1.01006
blm 1.047237 2 1.05529
Table 2: Gene expression prediction results (DILGOM). Out of the 387 gene expression probes, latent-noise BRRR predicts 123 better than the null model, BRRR 111 and GFlasso 103. The other methods improve predictive power as compared to the null model for very few target variables.

6.6 Results: multivariate association detection

Detection of associations between multiple SNPs and metabolites is a topic that has received attention recently (see, e.g., Kim et al., 2009; Inouye et al., 2012; Marttinen et al., 2014). Here we demonstrate the potential of the new method in this task using two illustrative example genes for which some ground truth is available. Associations between SNPs within two genes, LIPC and XRCC4, and the metabolites in the NFBC1966 data are investigated in the experiment. LIPC was selected as a reference, because it is one of the most strongly lipid-associated genes. On the contrary, XRCC4 was discovered only recently using three cohorts of individuals (Marttinen et al., 2014), and it was selected to serve as an example of a complex association detectable only by associating multiple SNPs with multiple metabolites, and not visible using simpler methods.

We use the proportion of total variance explained (PTVE) as the test score (Marttinen et al., 2014)

, and sample 100 permutations to measure the power to detect the associations. Furthermore, we use downsampling to evaluate the impact of the amount of training data. For comparison, we select the BRRR, the exhaustive pairwise (univariate) linear regression (’lm’), and canonical correlation analysis (CCA)

(Ferreira and Purcell, 2009), these being the methods that have been proposed for the task and having a sensible runtime in putative genome-wide applications. For lm, the minimum p-value of the regression coefficient over all SNP-metabolite pairs, and for the CCA, the minimum p-value over all SNPs (each SNP associated with all metabolites jointly) are used as the test scores. The association involving the XRCC4 gene was originally detected using the BRRR model; however, unlike here, informative priors were used for the regression coefficients.

Table 3 presents the ranking of the original data among the permuted data with different sample sizes and methods. As expected, all methods were able to detect the association involving LIPC with both training set sizes. However, latent-noise BRRR had the clearly highest power to detect the XRCC4 gene.



XRCC4 LIPC
4702 2351 4702 2351
latent-noise BRRR 0.98 0.94 1 1.00
BRRR 0.41 0.32 1 0.99
lm 0.62 0.74 1 1.00
cca 0.20 0.24 1 1.00
Table 3: Power of different methods to detect the association to XRCC4 and LIPC genes with 4702 and 2351 samples. All methods detect the association to LIPC but only latent-noise BRRR detects the association to XRCC4. Power is measured as the proportion of association test scores in permuted data sets smaller than the test score in the original data set. Value 1 indicates that the association score of the unpermuted data was higher than the score in any permutation.

7 Discussion

In this work, we evaluated the performance of multiple output regression with different assumptions for the structured noise. While most existing methods assume a priori independence of the interesting effects and the uninteresting structured noise, we started from the opposite assumption of strong dependence between the components of the model. This assumption is more accurate for instance for the molecular biological data sets often analyzed with such methods. Using simulations we demonstrated the harmfulness of the independence assumption when latent noise was present. In real data experiments the model assuming latent noise outperformed state-of-the-art methods in prediction of metabolite and gene expression measurements from genotype (SNP) data. In an illustrative multivariate association detection task, the latent noise model had increased power to detect associations invisible to other methods. To better address the computational needs, we presented a new algorithm reducing the runtime considerably and improving the scalability of the BRRR models as the number of variables increases. The prior distributions were parameterized in terms of the new concept of latent signal-to-noise ratio, which was a key ingredient for optimal model performance. In addition, the rotational unidentifiability of the model was solved using ordered infinite-dimensional shrinkage priors.

The new model implementing the concept of latent noise was studied using high-dimensional data containing weak signal (weak effects). The new model exploits a ubiquitous characteristic of such data: while the interesting effects are weak, the noise is strong. Latent-noise BRRR borrows statistical strength from the noise model so as to alleviate learning of the weak effects, by automatically enforcing the regression coefficients on correlated target variables to be correlated. This intuitive characteristic can be seen as a counterpart of the powered correlation priors (Krishna et al., 2009) in the target variable space:;Krishna et al. used the correlation structure of the covariates as a prior for the regression weights to enforce correlated covariates to have correlated weights.

The latent-noise BRRR is an extension of several common model families. By removing the covariates, the model reduces to a standard factor analysis model, which explains the output data with underlying factors. Thus, the latent-noise BRRR can be seen as a reversed analogy of PCA regression (Bernardo et al., 2003), in which components of the input space are used as covariates in prediction; in latent-noise BRRR components derived from the output space are predicted using the covariates (see Bo and Sminchisescu, 2009). Allowing the noise term to affect the latent space directly results in interesting connections to linear mixed models (LMMs) and best linear unbiased prediction (BLUP) (Robinson, 1991); using the latent noise formulation, the model can explain away bias in the residuals as in BLUP. On the other hand, LMMs have a random term for each sample and target variable. While LMMs are not computationally feasible to generalize for high-dimensional targets due to the random effect parameters and the associated inversion of an covariance matrix, the latent-noise BRRR can be seen as a low-rank generalization of LMMs for high-dimensional target variables: the covariates are used for prediction in the latent space and in this space there is a noise term for each sample and dimension. Therefore, the number of random effect parameters stays at and inference remains tractable.

In summary, our findings extend the existing literature on modeling structured noise in an important way by showing that structured noise can, and should, be taken advantage of when learning the interesting effects between the covariates and the target variables, and how this can be done. Code in R for the new method is available on request from the authors.

This work was financially supported by the Academy of Finland (grant number 251170 to the Finnish Centre of Excellence in Computational Inference Research COIN; grant number 259272 to PM; grant number 257654 to MP; grant number 140057 to SK).

NFBC1966 received financial support from the Academy of Finland (project grants 104781, 120315, 129269, 1114194, 24300796, Center of Excellence in Complex Disease Genetics and SALVE), University Hospital Oulu, Biocenter, University of Oulu, Finland (75617), NHLBI grant 5R01HL087679-02 through the STAMPEED program (1RL1MH083268-01), NIH/NIMH (5R01MH63706:02), ENGAGE project and grant agreement HEALTH-F4-2007-201413, EU FP7 EurHEALTHAgeing -277849 and the Medical Research Council, UK (G0500539, G0600705, G1002319, PrevMetSyn/SALVE).


References

  • Bargi et al. (2014) A. Bargi, R. Y. Xu, Z. Ghahramani, and M. Piccardi. A non-parametric conditional factor regression model for multi-dimensional input and response. In

    Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics

    , volume 33, pages 77–85. JMLR.org, 2014.
  • Baxter (1996) J. Baxter. A Bayesian/information theoretic model of bias learning. In

    Proceedings of the Ninth Annual Conference on Computational Learning Theory

    , COLT ’96, pages 77–88, New York, NY, USA, 1996. ACM.
  • Bernardo et al. (2003) J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, and M. West. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics, 7:733–742, 2003.
  • Bhattacharya and Dunson (2011) A. Bhattacharya and D. Dunson. Sparse Bayesian infinite factor models. Biometrika, 98(2):291–306, 2011.
  • Bishop et al. (2006) C. M. Bishop et al. Pattern recognition and machine learning, volume 1. Springer, 2006.
  • Bo and Sminchisescu (2009) L. Bo and C. Sminchisescu. Supervised spectral latent variable models. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5, pages 33–40, Florida, USA, 2009. JMLR.org.
  • Bottolo et al. (2011) L. Bottolo, E. Petretto, S. Blankenberg, F. Cambien, S. Cook, L. Tiret, and S. Richardson. Bayesian detection of expression quantitative trait loci hot spots. Genetics, 189(4):1449–1459, 2011.
  • Breiman and Friedman (1997) L. Breiman and J. Friedman. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):3–54, 1997.
  • Calus and Veerkamp (2011) M. Calus and R. Veerkamp. Accuracy of multi-trait genomic selection using different methods. Genetics Selection Evolution, 43(1):26, 2011.
  • Caruana (1997) R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
  • Davis et al. (2014) O. Davis, G. Band, M. Pirinen, C. Haworth, E. Meaburn, Y. Kovas, N. Harlaar, S. Docherty, K. Hanscombe, M. Trzaskowski, et al. The correlation between reading and mathematics ability at age twelve has a substantial genetic component. Nature Communications, 5, 2014.
  • Evgeniou and Pontil (2007) A. Evgeniou and M. Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19, volume 19, pages 41–48. The MIT Press, 2007.
  • Ferreira and Purcell (2009) M. Ferreira and S. Purcell. A multivariate test of association. Bioinformatics, 25(1):132–133, 2009.
  • Foygel et al. (2012) R. Foygel, M. Horrell, M. Drton, and J. Lafferty. Nonparametric reduced rank regression. In Advances in Neural Information Processing Systems 25, pages 1637–1645, 2012.
  • Friedman et al. (2010) J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
  • Fusi et al. (2012) N. Fusi, O. Stegle, and N. Lawrence. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Computational Biology, 8(1):e1002330, 2012.
  • Gelman et al. (2004) A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman & Hall/CRC, 2004.
  • Geweke (1996) J. Geweke. Bayesian reduced rank regression in econometrics. Journal of Econometrics, 75(1):121–146, 1996.
  • Global Lipids Genetics Consortium (2013) Global Lipids Genetics Consortium. Discovery and refinement of loci associated with lipid levels. Nature Genetics, 45(11):1274–1283, 2013.
  • Inouye et al. (2010) M. Inouye, J. Kettunen, P. Soininen, K. Silander, S. Ripatti, L. S Kumpula, E. Hämäläinen, P. Jousilahti, A. J. Kangas, S. Männistö, et al. Metabonomic, transcriptomic, and genomic variation of a population cohort. Molecular Systems Biology, 6(1), 2010.
  • Inouye et al. (2012) M. Inouye, S. Ripatti, J. Kettunen, L. Lyytikäinen, N. Oksala, P. Laurila, A. Kangas, P. Soininen, M. Savolainen, J. Viikari, et al. Novel loci for metabolic networks and multi-tissue expression studies reveal genes for atherosclerosis. PLoS Genetics, 8(8):e1002907, 2012.
  • Kang et al. (2008) H. M. Kang, C. Ye, and E. Eskin. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics, 180(4):1909–1925, 2008.
  • Karlsson (2012) S. Karlsson. Conditional posteriors for the reduced rank regression model. Technical Report Working Papers 2012:11, Orebro University Business School, 2012.
  • Kettunen et al. (2012) J. Kettunen, T. Tukiainen, A-P. Sarin, A. Ortega-Alonso, E. Tikkanen, L-P. Lyytikäinen, A. J. Kangas, P. Soininen, P. Würtz, K. Silander, et al. Genome-wide association study identifies multiple loci influencing human serum metabolite levels. Nature genetics, 44(3):269–276, 2012.
  • Kim and Xing (2009) S. Kim and E. Xing. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics, 5(8):e1000587, 2009.
  • Kim et al. (2009) S. Kim, K. Sohn, and E. Xing. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics, 25(12):i204–i212, 2009.
  • Klami et al. (2013) A. Klami, S. Virtanen, and S. Kaski. Bayesian canonical correlation analysis. The Journal of Machine Learning Research, 14(1):965–1003, 2013.
  • Krishna et al. (2009) A. Krishna, H. D. Bondell, and S. K. Ghosh. Bayesian variable selection using an adaptive powered correlation prior. Journal of Statistical Planning and Inference, 139(8):2665–2674, 2009.
  • Leek and Storey (2007) J. T. Leek and J. D. Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3(9):e161, 2007.
  • Marttinen et al. (2014) P. Marttinen, M. Pirinen, A-P. Sarin, J. Gillberg, J. Kettunen, I. Surakka, A. J. Kangas, P. Soininen, P. O’Reilly, M. Kaakinen, M. Kähönen, T. Lehtimäki, M. Ala-Korpela, O. T. Raitakari, V. Salomaa, M. R. Järvelin, S. Ripatti, and S. Kaski. Assessing multivariate gene-metabolome associations with rare variants using Bayesian reduced rank regression. Bioinformatics, 2014.
  • Rai et al. (2012) P. Rai, A. Kumar, and H. Daume III. Simultaneously leveraging output and task structures for multiple-output regression. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 3194–3202. Curran Associates, Inc., 2012.
  • Rakitsch et al. (2013) B. Rakitsch, C. Lippert, K. Borgwardt, and O. Stegle. It is all in the noise: Efficient multi-task gaussian process inference with structured residuals. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1466–1474. Curran Associates, Inc., 2013.
  • Rantakallio (1969) P. Rantakallio. Groups at risk in low birth weight infants and perinatal mortality. Acta Paediatrica Scandinavica, 193:Suppl–193, 1969.
  • Robinson (1991) G. K. Robinson. That blup is a good thing: The estimation of random effects. Statistical science, pages 15–32, 1991.
  • Sohn and Kim (2012) K-A. Sohn and S. Kim. Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization. In N. Lawrence and M. Girolami, editors, International Conference on Artificial Intelligence and Statistics, volume 22 of JMLR W&CP, pages 1081–1089. JMLR.org, 2012.
  • Soininen et al. (2009) P. Soininen, A. Kangas, P. Würtz, T. Tukiainen, T. Tynkkynen, R. Laatikainen, M. Järvelin, M. Kähönen, T. Lehtimäki, J. Viikari, et al. High-throughput serum NMR metabonomics for cost-effective holistic studies on systemic metabolism. Analyst, 134(9):1781–1785, 2009.
  • Stegle et al. (2010) O. Stegle, L. Parts, R. Durbin, and J. Winn. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eqtl studies. PLoS Computational Biology, 6(5):e1000770, 2010.
  • Stegle et al. (2011) O. Stegle, C. Lippert, J. M Mooij, N. D. Lawrence, and K. M. Borgwardt. Efficient inference in matrix-variate gaussian models with iid observation noise. In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 630–638. Curran Associates, Inc., 2011.
  • Stephens (2013) M. Stephens. A unified framework for association analysis with multiple related phenotypes. PLoS ONE 8(7): e65245, 2013.
  • Teslovich et al. (2010) T. Teslovich, K. Musunuru, A. Smith, A. Edmondson, I. Stylianou, M. Koseki, J. Pirruccello, S. Ripatti, D. Chasman, C. Willer, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 466(7307):707–713, 2010.
  • Virtanen et al. (2011) S. Virtanen, A. Klami, and S. Kaski. Bayesian CCA via group sparsity. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 457–464, New York, NY, 2011. ACM.
  • Wang et al. (2012) W. Wang, V. Baladandayuthapani, J. Morris, B. Broom, G. Manyam, and K. Do. Integrative Bayesian analysis of high-dimensional multi-platform genomics data. Bioinformatics, 29(2):149–159, 2012.
  • Xu et al. (2009) C. Xu, X. Wang, Z. Li, and S. Xu. Mapping QTL for multiple traits using Bayesian statistics. Genetical Research, 91(1):23–37, 2009.
  • Yi and Banerjee (2009) N. Yi and S. Banerjee. Hierarchical generalized linear models for multiple quantitative trait locus mapping. Genetics, 181(3):1101–1113, 2009.