1 Introduction
Explaining away structured noise is one of the cornerstones for successful modeling of highdimensional output data in the regression framework (Fusi et al., 2012; Rai et al., 2012; Virtanen et al., 2011; Klami et al., 2013; Rakitsch et al., 2013)
. The structured noise refers to dependencies between response variables, which are unrelated to the dependencies of interest. It is noise caused by observed and unobserved confounders that affect multiple variables simultaneously. Common observed confounders in medical and biological data include age and sex of an individual, whereas unobserved confounders include, for example, the state of the cell being measured, measurement artefacts influencing multiple probes, or other unrecorded experimental conditions. When not accounted for, structured noise may both hide interesting relationships and result in spurious findings
(Leek and Storey, 2007; Kang et al., 2008).The effects of known confounders can be removed straightforwardly by using supervised methods. For the unobserved confounders, a routinely used approach for explaining away structured noise has been to assume a priori independent effects for the interesting and uninteresting factors. For example, in the factor regression setup (Bernardo et al., 2003; Stegle et al., 2010; Fusi et al., 2012), the target variables are assumed to have been generated as
(1) 
where is the matrix of target variables (or dependent variables) and contains the covariates (or independent variables), for the observations. The model parameter matrix comprises the unknown latent factors and the factor loadings, which are used to model away the structured noise. The term represents independent unstructured noise. To reduce the effective number of parameters in the regression coefficient matrix , a lowrank structure may be assumed:
(2) 
where the rank of parameters and is substantially lower than the number of target variables or covariates . The lowrank decomposition of the regression coefficient matrix (2) may be given an interpretation whereby the covariates affect the latent components with coefficients specified in , and the components, in turn, affect the target with coefficients .
(a)  (b) 
Another line of work in multiple output prediction has focused on borrowing information from the correlation structure of the target variables when learning the regression model. The intuition stems from the observation that correlated targets are often seen to be affected similarly by the covariates, for example in genetic applications (see, e.g., Davis et al., 2014; Inouye et al., 2012). One popular method, GFlasso (Kim et al., 2009), learns the regression coefficients using
(3) 
where the are the columns of . Two regularization parameters are introduced: represents the standard Lasso penalty, and encourages the effects and of the th covariate on correlated outputs and to be similar. Here represents the correlation between the th and th phenotypes. The is an a priori specified correlation graph for the output variables, with edges representing correlations to be accounted for in the model.
In this paper we propose a model that simultaneously learns the structured noise and encourages the sharing of information between the noise and the regression models. To motivate the new model, we note that by assuming independent prior distributions on and in model (1), one implicitly assumes independence of the interesting and uninteresting effects, caused by covariates and unknown factors , respectively (Fig. 1a). This is a poor assumption for instance in molecular biology, where gene expression and metabolomics measurements record concentrations of compounds generated by ongoing biological processes. When predicting the expression from a limited set of covariates, such as single nucleotide polymorphisms (SNPs), they determine the activity of the biological process only partially and, thus, all other activity of the processes is due to structured “noise”, or unrecorded factors. In such cases, the noise affects the measurement levels through the very same process as the interesting signal (Fig. 1b), and rather than assuming independence of the effects, an assumption about parallel effects would be more appropriate. We refer to this type of noise as latent noise as it can be considered to affect the same latent subspace as the interesting effects.
A natural way to encode the assumption of latent noise is to use the following model structure
(4) 
where the is a matrix consisting of unknown latent factors. In (4), mediates the effects of both the interesting and uninteresting signals on the target variables. We note that the change required in the model structure is small, and has in fact been presented earlier (Bo and Sminchisescu 2009; recently extended with an Indian Buffet Process prior on the latent space by Bargi et al. 2014). We now proceed to using the structure (4) for GFlassotype sharing of information (3) between the regression and noise models while simultaneously explaining away structured noise. To see that the information sharing between noise and regression models follows immediately from model (4), one can consider simulations generated from the model. The a priori independence assumption of model (1) results in uncorrelated regression weights regardless of the correlations between target variables (Figure 2a). The assumption of latent noise (4), however, encourages the regression weights to be correlated similarly with the target variables (Figure 2c).
.  

(a)  (b)  (c) 
denotes the relative proportion of latent noise in data generation. The dashed lines denote the 95% confidence intervals of the conditional distributions.
In this work, we focus on modelling weak signals among structured noise in highdimensinoal data. For instance effects explaining
of the variance of the target variables can be considered weak. We have argued above that a model with the structure (
4) is particularly wellsuited for this purpose. Additionally, (i) particular emphasis must be put on defining adequate prior distributions to distinguish the weak effects from noise as effectively as possible, and (ii) scalability to large sample size is needed to have any chance of learning the weak effects. For (i), we define latent signaltonoise ratio as a generalization of the standard signaltonoise ratio in the latent space:(5) 
We use the latent signaltonoise ratio as a hyperparameter in the modelling, and show that it is a key parameter affecting model performance. It can be either be learned or set using prior knowledge. In addition, we introduce an ordered infinitedimensional shrinkage prior that resolves the inherent rotational ambiguity in the model (4), by sorting both signal and noise components by their importance. Finally, we present efficient inference methods for the model.
2 Related work
Integrating multiple realvalued prediction tasks with the same set of covariates is called multipleoutput regression (Breiman and Friedman, 1997); and more generally sharing of statistical strength between related tasks is called multitask learning (Baxter, 1996; Caruana, 1997). The data consist of inputoutput pairs ; the
dimensional input vectors
(covariates) are used for predicting dimensional vectors of target variables. The common approach to dealing with structured noise due to unobserved confounders is to apply factor regression modeling (1) (Bernardo et al., 2003) and to explain away the structured noise using a noise model that is assumed to be a priori independent of the regression model (Stegle et al., 2010; Fusi et al., 2012; Rai et al., 2012; Virtanen et al., 2011; Klami et al., 2013; Rakitsch et al., 2013). A recent Bayesian reducedrank regression (BRRR) model (Marttinen et al., 2014) implements the routine assumption of the independence of the regression and noise models; we will include it in the comparison studies of this paper.Methods for multipleoutput regression without the structured noise model have been proposed in other fields. In the application fields of genomic selection and multitrait quantitative trait loci mapping, solutions (Yi and Banerjee, 2009; Xu et al., 2009; Calus and Veerkamp, 2011; Stephens, 2013) for lowdimensional target variable vectors () have been proposed, but these methods do not scale up to the currently emerging needs of analyzing higherdimensional target variable data. Additionally, sparse multipleoutput regression models have been proposed for prediction of phenotypes from genomic data (Kim et al., 2009; Sohn and Kim, 2012).
Many methods for multitask learning have been proposed in the field of kernel methods (Evgeniou and Pontil, 2007). These methods do not, however, scale up to data sets with several thousands of samples, required for predicting the weak effects. Other relevant work include a recent method based on the BRRR presented by Foygel et al. (2012), but it does not scale to the dimensionalities of our experiments either. Methods for highdimensional phenotypes have been proposed in the field of expression quantitative trait loci mapping (Bottolo et al., 2011) for the weaker task of finding associations (and avoiding false positives) rather than prediction, which is our main focus. Also functional assumptions (Wang et al., 2012) have been used to constrain related learning problems.
3 Model
In this Section, we present details of the model, show how the hyperparameters can be set using the latent signaltonoise ratio, and analyze the properties of the infinitedimensional shrinkage prior.
3.1 Model details: latentnoise BRRR
Our model is given by
(6) 
where contains the dimensional response variables for observations, and
contains the predictor variables. The product
, of and , results in a regression coefficient matrix with rank . The contains unknown latent factors representing the latent noise. Finally, with where is a matrix of uncorrelated target variablespecific noise vectors. Figure 3 displays graphically the structure of the model. In the figure, the node corresponding to the parameter that is shared by the regression and noise models is highlighted with green.Similarly to a recent BRRR model (Marttinen et al., 2014) and the Bayesian infinite sparse factor analysis model (Bhattacharya and Dunson, 2011), we assume the number of components connecting the covariates to the targets to be infinite. Accordingly, the number of rows in the weight matrix , and the numbers of columns in and are infinite. The lowrank nature of the model is enforced by shrinking the columns of and rows of and increasingly with the growing column/row index, such that only a small number of columns/rows are influential in practice. The increasing shrinkage also solves any rotational nonidentifiablity issues by enforcing the model to mediate the strongest effects through the first columns/rows. In Section 3.3 we explore the basic properties of the infinitedimensional prior, to ensure its soundness.
The hierarchical priors for the projection weight matrix , where , are set as follows:
(7) 
Here is a global shrinkage parameter for the th row of and the s are local shrinkage parameters for the individual elements of , to provide additional flexibility over the global shrinkage priors. The same parameters are used to shrink the columns of the matrices and , because the scales of and (or ) are not identifiable separately:
where is a parameter that specifies the amount of latent noise, which is used to regularize the model (see the next Section). With the priors specified, the hidden factors can be integrated out analytically, yielding
(8) 
where is obtained from by multiplying the rows of with the shrinkages of the columns of .
Finally, conjugate prior distributions
(9) 
are placed on the noise parameters of the target variables.
3.2 Regularization of latentnoise BRRR through the variance of
The latent signaltonoise ratio in Equation (5) has an intuitive interpretation: given our prior distributions for and , the prior latent SNR indicates the extent to which we believe the noise to explain variation in , as compared to the variance explained by the covariates . Thus, the latent SNR acts as a regularization parameter: by allowing latent variables to have a large variance, the data will be explained by the noise model rather than the covariates. We note that this approach to regularization is nonstandard and it may have favourable characteristics compared to the commonly used L1/L2 regularization of regression weights. First, the regression weights remain relatively unbiased as they need not be enforced to zero to control for overfitting. More importantly, while regularizing with the a priori selected latent SNR, the regularization parameter itself remains interpretable: every value of the variance parameter of can be immediately interpreted as the percentage of variance explained by the covariates as compared to the noise model. Making similar interpretations for L1/L2 regularization parameters is not straightforward. In our experiments, we use crossvalidation to select the variance of . However, the interpretability of the parameter makes it easy to express beliefs of the plausible values based on prior knowledge.
3.3 Proofs of the soundness of the infinite prior
In this Section we verify the sensibility of both the infinite nonparametric prior, which we introduce for ordering the components according to decreasing importance, and of a computational approximation resulting from truncation of the infinite model.
It has been proven that in Bayesian factor models and (in our case defined in eqn 7) is sufficient for the elements of to have finite variance in a Bayesian factor model (1), even if an infinite number of columns with a prior similar to our model is assumed for (Bhattacharya and Dunson, 2011). In this Section we present similar characteristics of the infinite reducedrank regression model. The detailed proofs can be found in the Supplementary material. First, in analogy to the infinite Bayesian factor analysis model, we show that
(10) 
is sufficient for the prediction of any of the response variables to have finite variance under the prior distribution (Proposition 1). Second, we show that the underestimation of uncertainty (variance) resulting from using a finite rank approximation to the infinite reducedrank regression model decays exponentially with the rank of the approximation (Proposition 2). For notational clarity, let denote in the following the column of the matrix. With this notation, a prediction for the th response variable can be written as
Furthermore, let denote below the gamma function (not to be confused with the matrix used in all other Sections of this paper).
Proposition 1: Finite variance of predictions Suppose that and . Then
(11) 
A detailed proof is provided in the Supplementary material.
Proposition 2: Truncation error of the finite rank approximation Let denote the prediction for the th phenotype when using an approximation for and consisting of the first columns or rows only, respectively. Then,
that is, the reduction in the variance of the prediction resulting from using the approximation, relative to the infinite model, decays exponentially with the rank of the approximation. A detailed proof is provided in the Supplementary material.
4 Efficient computation by reparameterization
For estimating the parameters of the latentnoise BRRR, we use Gibbs sampling, updating the parameters onebyone by sampling them from their conditional posterior probability distributions, given the current values of all other parameters. The bottleneck of computation is related to the matrix
, and below we present a novel efficient update for this parameter.Update of
The conditional distribution of the parameter matrix of latentnoise BRRR can be updated using a standard result for Bayesian linear models (Bishop et al., 2006) which states that if
(12) 
then
(13) 
where
(14) 
Because in our model (6) the columns of the noise matrix are assumed independent with variances , we get
(15) 
Thus, by substituting
into (12), together with prior covariance derived from (7), we immediately obtain the posterior of from (13) and (14).
Updates of and
The updates of the hyperparameters are the same as in Bayesian Reduced Rank Regression, and the conditional posterior distributions of the hyperparameters can be found in the Supplementary material of Marttinen et al. (2014). The has the same conditional posterior distribution as the model parameter of Marttinen et al. (2014).
Improved update of
The computational bottleneck of the naive Gibbs sampler is the update of parameter , which has
elements with a joint multivariate Gaussian distribution, conditionally on the other parameters
(Geweke, 1996; Marttinen et al., 2014). Thus, the inversion of the precision matrix of the joint distribution has a computational cost of
. To remove the bottleneck, we reparameterize our model, after which a linear algebra trick by Stegle et al. (2011) can be used to reduce the computational cost of the bottleneck to . When sampling we also integrate over the distribution of following the standard result from Equation (8). The reparameterization and the new posteriors are presented in the Supplementary material.In brief, the eigenvalue decomposition of a matrix of the form
(16) 
can be evaluated inexpensively. Through reparameterization, the posterior covariance matrix of becomes of the form (16) and the eigenvalue decomposition is then used to generate samples from the posterior distribution of . We note that the trick can also be applied to the original formulation of the Bayesian reducedrank regression model by Geweke (1996) and the Rcode published with this article allows generating samples from the original model as well. In the next Section, we compare the computational cost of the algorithm using the naive Gibbs sampler and the improved version that uses the new parameterization.
5 Simulation experiments
In Section 5.1 we investigate the impact of the noise model assumptions on simulated data. In Section 5.2 we confirm the computational speedup resulting from the new algorithm introduced in Section 4.
5.1 Impact of the noise model assumptions
We study the performance of the models with simulated data sets generated using a continuum of models between the two extremes of assuming either latent noise, or a priori independent regression and noise models. The synthetic data are generated according to
(17) 
where the parameter defines the proportion of variance attributed to the latent noise. We study a continuum of problems with the values of parameter . The parameters and are orthogonalized using GrammSchmidt orthogonalization. The parameters are scaled so that covariates explain 3 % of the variance of through , the diagonal Gaussian noise explains 20 % of the total variance of and the structured noise explains the remaining 77 % of the total variance of . The simulation was repeated 100 times and training data sets of 500 and 2000 samples were generated for each replicate. To compare the methods, performance in mean squared error (MSE) of the models learned with each method was compared to that of the true model on a test set of 15 000 samples. The number of covariates was fixed to 30 and the number of dependent variables to 60. Rank of the regression coefficient matrix and structured noise was fixed to 3 when simulating the data sets.
The ranks of all methods were fixed to the true value. For latentnoise BRRR, the variance of was selected using 10fold crossvalidation. The grid was specified so that it corresponds to latent signaltonoise ratios ranging from to , . More specifically where . The grid was chosen according to the interpretation given in Section 3.2; it corresponds to assuming that the latent noise explains 5 to 15 times the variance explained by the covariates.
Figures 4 (a) and (b) present the results of a simulation study with training sets of 500 and 2000 samples, respectively. When the structured noise is generated according to the conventional assumption of independent signal and noise, the model making the independece assumption performs equally well to the true model with both 500 and 2000 samples. However, as the assumption is violated and the proportion of latent noise increases, the performance of the model with the a priori independence assumptions quickly breaks down, whereas the model assuming latent noise performs consistently well. Methods that do not explain away the structured noise are not able to outperform the null model with the training set of 500 samples. When the number of training samples is increased to 2000, these models outperform the model assuming a priori independent signal and noise.
The naïve asumption about a priori independent signal and noise actually appears to harm performance even with a large training data set of 2000 samples, as the models not addressing the structured noise clearly perform better. This emphasizes the need to take the latent noise into account; even with thousands of samples, the relatively weak but wrong prior assumption in this respect can result in low performance.
Figure 4 shows results also for an alternative novel model that shares information between the noise and regression models (correlated BRRR, see Supplementary material for a detailed description). The model includes a separate noise model for the structured noise, as in (1), but achieves the information sharing by assuming a joint prior for the noise and regression models. In detail, conditional on the noise model, the current residual correlation matrix between the response variables is used as a prior for the rows of . This way the correlations between target variables are propagated into the corresponding regression weights; however, the strongest noise components are not automatically coupled with the strongest signal components. Notably, the performance of the correlated BRRR model is very similar to the regular BRRR model that does not have any dependence between the noise and signal components. This underlines the importance of explicitly coupling the strongest signal and noise components, as in the latentnoise BRRR, by enforcing them to use the shared regression coefficient vectors.
.  

(a)  (b) 
5.2 Runtime of the new algorithm for BRRR models
To confirm the computational speedup resulting from the reparameterization presented in Section 4, we performed an experiment where the algorithm implementing the naive Gibbs sampling updates for the Bayesian reducedrank regression (Geweke, 1996; Karlsson, 2012) was compared with the new algorithm that uses the reparameterization. Similar improvements were achieved with the new models.
Ten toy data replicates were generated from the prior. The number of samples in the training set was fixed to 5000 and the number of target variables was set to 12. Rank of the regression coefficient matrix was 2. Runtime was measured as a function of the number of covariates, which was varied from 100 to 300; 1000 posterior samples were generated. The new algorithm that reparameterizes the model clearly outperforms the naive Gibbs sampler (Figure 5). As a sanity check, the algorithms were compared in terms of the estimated regression coefficient matrices, which were found to be similar.
6 Genomic experiments
In this Section, we evaluate the performance of the latentnoise BRRR in gene expression prediction, metabolomics prediction and an exemplifying multivariate association testing task. The impacts of the latent signaltonoise ratio and the number of samples are investigated. Data from two cohorts, DILGOM and NFBC1966 (for details, see below), are used in the experiments. SNPs known to be associated with lipid metabolism (Teslovich et al., 2010; Kettunen et al., 2012; Global Lipids Genetics Consortium, 2013) are used as the covariates. We also compare the runtimes of different methods. Throughout the Section, mean squared error (MSE) of test data is used as the test score. The MSE is computed by dividing the data sets into 10 nonoverlapping folds, and predicting one fold at a time while using the other folds as the training data. The mean over the test folds is used as the final score. The effect of the number of observations is studied by downsampling the training set while keeping the test sets fixed.
6.1 Methods included in comparison
We compare the latentnoise BRRR with a stateoftheart sparse multipleoutput regression method Graphguided Fused Lasso (’GFlasso’) (Kim and Xing, 2009), BRRR/factor regression model (Marttinen et al., 2014) with and without the a priori independent noise model (’BRRR’, ’BRRR without noise model’), standard Bayesian linear model (’blm’) (Gelman et al., 2004), elastic netpenalized multitask learning (’L2/L1 MTL’) and a baseline method of predicting with target data mean. GFlasso constitutes a suitable comparison as it encourages sharing of information between correlated responses, as our model, but does that within the Lassotype penalized regression framework without the use of a noise model to explain away the structured noise. L2/L1 MTL is a multitask regression method implemented in the glmnet package (Friedman et al., 2010) that allows elastic net regularization and also does not use a noise model to explain away confounders. The blm, on the other hand, was selected as a simple singletask baseline.
Parameters for the different methods were specified as follows:

GFlasso: The regularization parameters of the gw2 model were selected from the default grid using crossvalidation. The method has been developed for genomic data indicating that the default values should be appropriate. However, for NFBC1966 data, we were unable to run the method with the smallest values of the regularization parameters (110, 60, 10) due to lengthy runtime with these values. To save computation time, we predicted only 2 randomly selected validation folds out of the 10 possible. With these computational compromises, the average training time for the largest training data sets was 650 h. With NFBC1966 data, the prespecified correlation network required by the GFlasso was constructed to match the VLDL, IDL, LDL, and HDL metabolite clusters from Inouye et al. (2012). Within these clusters, the correlation network was fixed to the empirical correlations, and to 0 otherwise. With DILGOM data, we used the empirical correlation network, with correlations below 0.8 fixed to 0 to reduce the number of edges in the network for computational speedup.

BRRR, BRRR without noise model: Hyperparameters and for all the BRRR models were fixed to 3 and 4, respectively. 1,000 MCMC samples were generated and 500 were discarded as burnin. In preliminary tests similiar results were obtained with 50,000 samples. The remaining samples thinned by a factor of 10 were used for prediction. The truncation point of the infiniterank BRRR model was learned using crossvalidation from the set of values .

latentnoise BRRR: With the NFBC1966 data, the latent signaltonoise ratio was simply fixed to corresponding to the prior assumption of SNPs explaining about 1% of the variation of the phenotypes. A sensitivity analysis for this parameter is presented in Section 6.4. For the DILGOM data, the latent signaltonoise ratio was selected using crossvalidation from the grid . Other parameters, including the number of iterations, were set as for the BRRR.

blm: The variance hyperparameter of BLM was integrated over using the MCMC. 1000 posterior samples were generated and 500 were discarded as burnin.

L1/L2 MTL:
The effects of different types of regularization penalties are an active research topic and we ran a continuum of mixtures of L1 and L2 penalties ranging from group lasso to ridge regression. The mixture parameter
controling the balance between L1 and L2 regularization was evaluated on the grid [0, 0.1, , 0.9, 1.0] and selected using a 10fold cross validation.
6.2 Data sets
The DILGOM data set (Inouye et al., 2010) consists of genomewide SNP data along with metabolomics and gene expression measurements. For details concerning metabolomics and gene expression data collection, see Soininen et al. (2009) and Kettunen et al. (2012)
. In total 509 individuals had all three measurement types. The DILGOM metabolomics data comprises 137 metabolites, most of which represent NMRquantified levels of lipoproteins classified into 4 subclasses (VLDL, IDL, LDL, HDL), together with quantified levels of amino acids, some serum extracts, and a set of quantities derived as ratios of the aformentioned metabolites. All 137 metabolites were used simultaneously as prediction targets. In gene expression prediction, in total 387 probes corresponding to curated gene sets of 8 KEGG lipid metabolism pathways were used as the prediction targets. The average number of probes in a pathway was 48 and the pathways were treated independently. For details about the pathways, see the Supplementary material.
The NFBC1966 data comprises genomewide SNP data along with metabolomics measurements for a cohort of 4,702 individuals (Rantakallio, 1969; Soininen et al., 2009). With these data, 96 metabolites belonging to the subclasses VLDL, IDL, LDL and HDL (Inouye et al., 2012) were used as the phenotype.
Effects of age, sex, and lipid lowering medication were regressed out from the metabolomics data as a preprocessing step. For the genotype data, SNPs with low minor allele frequency (0.01) were removed as a preprocessing step.
6.3 Results: NFBC1966 metabolomics prediction
Figure 6 presents test data MSE and training times for the different methods. With all training set sizes, latentnoise BRRR outperforms the other methods. Methods blm and BRRR without noise model perform worse than the baseline (null model) even with the largest training data containing 3761 individuals.
(a)  (b) 
6.4 Results: Impact of the latent SNR assumption
To study the sensitivity of the latentnoise BRRR on the values of the hyperparameter modelling the latent signaltonoise ratio, we carried out an additional study. We reran the analysis of the NFBC1966 data with the largest training data size on a dense grid of the hyperparameter , specifying the latent SNR, ranging from to . Figure 7 presents the results of the sensitivity analysis. Test data MSE for NFBC1966 is shown as a function of and model rank. We see that the a priori set latent SNR assumption has a dramatic impact on performance. The optimal values correspond to assuming the SNPs explain of the variance of the phenotype, which is in good agreement with previous estimates of the amount of variance predictable by the SNPs (Kettunen et al., 2012). The value learned with crossvalidation from data (CV) gives slightly worse results than the best a priori setting, but are better than alternatives.
6.5 Results: DILGOM metabolomics and gene expression prediction
On another data set, which includes two data types, gene expression and metabolomics, the limited number of samples ( 506) made the prediction of the weak effects challenging for any method. Indeed, we noticed that the null model using the average training data value for prediction was better than any other method in terms of MSE over all variables. However, a detailed investigation of the results revealed that while many of the metabolites/probes could not be predicted at all (as indicated by the worse than null model MSE) some of the metabolites/expression probes could still be predicted better than the null model, and by focusing the analysis on (1) the number of expression probes/metabolites that could be predicted better than the null model, and (2) the MSE computed over the predictable probes/metabolites (i.e., those that could be predicted better than the null by at least one method), conclusions regarding the model performances could still be made.
Results of the metabolite prediction experiment with the DILGOM data are presented in Table 1. Latentnoise BRRR outperforms all other methods, measured both in terms of test data MSE for the predictable metabolites and the number of metabolites predicted better than the null model. Latentnoise BRRR has predictive power for 58 out of the 137 target variables, whereas GFlasso improves predictions as compared to using the target data mean for 38 target variables, BRRR for only 3 target variables and the other methods have no predictive power for any of the target variables. In terms of MSE over all the metabolites, mean prediction (null model) and L1/L2 MTL outperform all other methods; L1/L2 MTL selects values of regularization parameters that set all regression weights to zero through crossvalidation, effectively reducing the model to the null model.
MSE on predictable  Better than null  MSE over all  

metabolites  model  metabolites  
latentnoise BRRR  1.027051  58  1.021744 
BRRR  1.078377  3  1.084020 
GFlasso  1.027175  38  1.021800 
BRRR without noise model  1.246208  0  1.334456 
L2/L1 MTL  1.027583  0  1.021541 
null model  1.027583  0  1.021541 
blm  1.557899  0  1.551917 
Results of the gene expression prediction experiment are presented in Table 2. Latentnoise BRRR and BRRR perform equally well with respect to the MSE of the predictable expression probes. In terms of the number of metabolites predicted better than the null model, latentnoise BRRR is again the best. In terms of the total MSE, the null model and L1/L2 MTL outperform all other methods and L1/L2 MTL sets all regression weight to zero based on the crossvalidation. It is reassuring to see the good performance of the BRRR in this task, because the routine preprocessing of gene expression data is to remove effects corresponding to the first PCA components, which is very similiar to what the structured noise model in the BRRR achieves.
MSE on predictable  Better than null  MSE over all  

metabolites  model  metabolites  
latentnoise BRRR  1.009831  123  1.01076 
BRRR  1.009831  111  1.01137 
GFlasso  1.010133  103  1.01064 
BRRR without noise model  1.015306  37  1.01688 
L2/L1 MTL  1.010077  0  1.01006 
null model  1.010077  0  1.01006 
blm  1.047237  2  1.05529 
6.6 Results: multivariate association detection
Detection of associations between multiple SNPs and metabolites is a topic that has received attention recently (see, e.g., Kim et al., 2009; Inouye et al., 2012; Marttinen et al., 2014). Here we demonstrate the potential of the new method in this task using two illustrative example genes for which some ground truth is available. Associations between SNPs within two genes, LIPC and XRCC4, and the metabolites in the NFBC1966 data are investigated in the experiment. LIPC was selected as a reference, because it is one of the most strongly lipidassociated genes. On the contrary, XRCC4 was discovered only recently using three cohorts of individuals (Marttinen et al., 2014), and it was selected to serve as an example of a complex association detectable only by associating multiple SNPs with multiple metabolites, and not visible using simpler methods.
We use the proportion of total variance explained (PTVE) as the test score (Marttinen et al., 2014)
, and sample 100 permutations to measure the power to detect the associations. Furthermore, we use downsampling to evaluate the impact of the amount of training data. For comparison, we select the BRRR, the exhaustive pairwise (univariate) linear regression (’lm’), and canonical correlation analysis (CCA)
(Ferreira and Purcell, 2009), these being the methods that have been proposed for the task and having a sensible runtime in putative genomewide applications. For lm, the minimum pvalue of the regression coefficient over all SNPmetabolite pairs, and for the CCA, the minimum pvalue over all SNPs (each SNP associated with all metabolites jointly) are used as the test scores. The association involving the XRCC4 gene was originally detected using the BRRR model; however, unlike here, informative priors were used for the regression coefficients.Table 3 presents the ranking of the original data among the permuted data with different sample sizes and methods. As expected, all methods were able to detect the association involving LIPC with both training set sizes. However, latentnoise BRRR had the clearly highest power to detect the XRCC4 gene.
XRCC4  LIPC  

4702  2351  4702  2351  
latentnoise BRRR  0.98  0.94  1  1.00 
BRRR  0.41  0.32  1  0.99 
lm  0.62  0.74  1  1.00 
cca  0.20  0.24  1  1.00 
7 Discussion
In this work, we evaluated the performance of multiple output regression with different assumptions for the structured noise. While most existing methods assume a priori independence of the interesting effects and the uninteresting structured noise, we started from the opposite assumption of strong dependence between the components of the model. This assumption is more accurate for instance for the molecular biological data sets often analyzed with such methods. Using simulations we demonstrated the harmfulness of the independence assumption when latent noise was present. In real data experiments the model assuming latent noise outperformed stateoftheart methods in prediction of metabolite and gene expression measurements from genotype (SNP) data. In an illustrative multivariate association detection task, the latent noise model had increased power to detect associations invisible to other methods. To better address the computational needs, we presented a new algorithm reducing the runtime considerably and improving the scalability of the BRRR models as the number of variables increases. The prior distributions were parameterized in terms of the new concept of latent signaltonoise ratio, which was a key ingredient for optimal model performance. In addition, the rotational unidentifiability of the model was solved using ordered infinitedimensional shrinkage priors.
The new model implementing the concept of latent noise was studied using highdimensional data containing weak signal (weak effects). The new model exploits a ubiquitous characteristic of such data: while the interesting effects are weak, the noise is strong. Latentnoise BRRR borrows statistical strength from the noise model so as to alleviate learning of the weak effects, by automatically enforcing the regression coefficients on correlated target variables to be correlated. This intuitive characteristic can be seen as a counterpart of the powered correlation priors (Krishna et al., 2009) in the target variable space:;Krishna et al. used the correlation structure of the covariates as a prior for the regression weights to enforce correlated covariates to have correlated weights.
The latentnoise BRRR is an extension of several common model families. By removing the covariates, the model reduces to a standard factor analysis model, which explains the output data with underlying factors. Thus, the latentnoise BRRR can be seen as a reversed analogy of PCA regression (Bernardo et al., 2003), in which components of the input space are used as covariates in prediction; in latentnoise BRRR components derived from the output space are predicted using the covariates (see Bo and Sminchisescu, 2009). Allowing the noise term to affect the latent space directly results in interesting connections to linear mixed models (LMMs) and best linear unbiased prediction (BLUP) (Robinson, 1991); using the latent noise formulation, the model can explain away bias in the residuals as in BLUP. On the other hand, LMMs have a random term for each sample and target variable. While LMMs are not computationally feasible to generalize for highdimensional targets due to the random effect parameters and the associated inversion of an covariance matrix, the latentnoise BRRR can be seen as a lowrank generalization of LMMs for highdimensional target variables: the covariates are used for prediction in the latent space and in this space there is a noise term for each sample and dimension. Therefore, the number of random effect parameters stays at and inference remains tractable.
In summary, our findings extend the existing literature on modeling structured noise in an important way by showing that structured noise can, and should, be taken advantage of when learning the interesting effects between the covariates and the target variables, and how this can be done. Code in R for the new method is available on request from the authors.
This work was financially supported by the Academy of Finland (grant number 251170 to the Finnish Centre of Excellence in Computational Inference Research COIN; grant number 259272 to PM; grant number 257654 to MP; grant number 140057 to SK).
NFBC1966 received financial support from the Academy of Finland (project grants 104781, 120315, 129269, 1114194, 24300796, Center of Excellence in Complex Disease Genetics and SALVE), University Hospital Oulu, Biocenter, University of Oulu, Finland (75617), NHLBI grant 5R01HL08767902 through the STAMPEED program (1RL1MH08326801), NIH/NIMH (5R01MH63706:02), ENGAGE project and grant agreement HEALTHF42007201413, EU FP7 EurHEALTHAgeing 277849 and the Medical Research Council, UK (G0500539, G0600705, G1002319, PrevMetSyn/SALVE).
References

Bargi et al. (2014)
A. Bargi, R. Y. Xu, Z. Ghahramani, and M. Piccardi.
A nonparametric conditional factor regression model for
multidimensional input and response.
In
Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
, volume 33, pages 77–85. JMLR.org, 2014. 
Baxter (1996)
J. Baxter.
A Bayesian/information theoretic model of bias learning.
In
Proceedings of the Ninth Annual Conference on Computational Learning Theory
, COLT ’96, pages 77–88, New York, NY, USA, 1996. ACM.  Bernardo et al. (2003) J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, and M. West. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics, 7:733–742, 2003.
 Bhattacharya and Dunson (2011) A. Bhattacharya and D. Dunson. Sparse Bayesian infinite factor models. Biometrika, 98(2):291–306, 2011.
 Bishop et al. (2006) C. M. Bishop et al. Pattern recognition and machine learning, volume 1. Springer, 2006.
 Bo and Sminchisescu (2009) L. Bo and C. Sminchisescu. Supervised spectral latent variable models. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5, pages 33–40, Florida, USA, 2009. JMLR.org.
 Bottolo et al. (2011) L. Bottolo, E. Petretto, S. Blankenberg, F. Cambien, S. Cook, L. Tiret, and S. Richardson. Bayesian detection of expression quantitative trait loci hot spots. Genetics, 189(4):1449–1459, 2011.
 Breiman and Friedman (1997) L. Breiman and J. Friedman. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):3–54, 1997.
 Calus and Veerkamp (2011) M. Calus and R. Veerkamp. Accuracy of multitrait genomic selection using different methods. Genetics Selection Evolution, 43(1):26, 2011.
 Caruana (1997) R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
 Davis et al. (2014) O. Davis, G. Band, M. Pirinen, C. Haworth, E. Meaburn, Y. Kovas, N. Harlaar, S. Docherty, K. Hanscombe, M. Trzaskowski, et al. The correlation between reading and mathematics ability at age twelve has a substantial genetic component. Nature Communications, 5, 2014.
 Evgeniou and Pontil (2007) A. Evgeniou and M. Pontil. Multitask feature learning. In Advances in Neural Information Processing Systems 19, volume 19, pages 41–48. The MIT Press, 2007.
 Ferreira and Purcell (2009) M. Ferreira and S. Purcell. A multivariate test of association. Bioinformatics, 25(1):132–133, 2009.
 Foygel et al. (2012) R. Foygel, M. Horrell, M. Drton, and J. Lafferty. Nonparametric reduced rank regression. In Advances in Neural Information Processing Systems 25, pages 1637–1645, 2012.
 Friedman et al. (2010) J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
 Fusi et al. (2012) N. Fusi, O. Stegle, and N. Lawrence. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Computational Biology, 8(1):e1002330, 2012.
 Gelman et al. (2004) A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman & Hall/CRC, 2004.
 Geweke (1996) J. Geweke. Bayesian reduced rank regression in econometrics. Journal of Econometrics, 75(1):121–146, 1996.
 Global Lipids Genetics Consortium (2013) Global Lipids Genetics Consortium. Discovery and refinement of loci associated with lipid levels. Nature Genetics, 45(11):1274–1283, 2013.
 Inouye et al. (2010) M. Inouye, J. Kettunen, P. Soininen, K. Silander, S. Ripatti, L. S Kumpula, E. Hämäläinen, P. Jousilahti, A. J. Kangas, S. Männistö, et al. Metabonomic, transcriptomic, and genomic variation of a population cohort. Molecular Systems Biology, 6(1), 2010.
 Inouye et al. (2012) M. Inouye, S. Ripatti, J. Kettunen, L. Lyytikäinen, N. Oksala, P. Laurila, A. Kangas, P. Soininen, M. Savolainen, J. Viikari, et al. Novel loci for metabolic networks and multitissue expression studies reveal genes for atherosclerosis. PLoS Genetics, 8(8):e1002907, 2012.
 Kang et al. (2008) H. M. Kang, C. Ye, and E. Eskin. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics, 180(4):1909–1925, 2008.
 Karlsson (2012) S. Karlsson. Conditional posteriors for the reduced rank regression model. Technical Report Working Papers 2012:11, Orebro University Business School, 2012.
 Kettunen et al. (2012) J. Kettunen, T. Tukiainen, AP. Sarin, A. OrtegaAlonso, E. Tikkanen, LP. Lyytikäinen, A. J. Kangas, P. Soininen, P. Würtz, K. Silander, et al. Genomewide association study identifies multiple loci influencing human serum metabolite levels. Nature genetics, 44(3):269–276, 2012.
 Kim and Xing (2009) S. Kim and E. Xing. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics, 5(8):e1000587, 2009.
 Kim et al. (2009) S. Kim, K. Sohn, and E. Xing. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics, 25(12):i204–i212, 2009.
 Klami et al. (2013) A. Klami, S. Virtanen, and S. Kaski. Bayesian canonical correlation analysis. The Journal of Machine Learning Research, 14(1):965–1003, 2013.
 Krishna et al. (2009) A. Krishna, H. D. Bondell, and S. K. Ghosh. Bayesian variable selection using an adaptive powered correlation prior. Journal of Statistical Planning and Inference, 139(8):2665–2674, 2009.
 Leek and Storey (2007) J. T. Leek and J. D. Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3(9):e161, 2007.
 Marttinen et al. (2014) P. Marttinen, M. Pirinen, AP. Sarin, J. Gillberg, J. Kettunen, I. Surakka, A. J. Kangas, P. Soininen, P. O’Reilly, M. Kaakinen, M. Kähönen, T. Lehtimäki, M. AlaKorpela, O. T. Raitakari, V. Salomaa, M. R. Järvelin, S. Ripatti, and S. Kaski. Assessing multivariate genemetabolome associations with rare variants using Bayesian reduced rank regression. Bioinformatics, 2014.
 Rai et al. (2012) P. Rai, A. Kumar, and H. Daume III. Simultaneously leveraging output and task structures for multipleoutput regression. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 3194–3202. Curran Associates, Inc., 2012.
 Rakitsch et al. (2013) B. Rakitsch, C. Lippert, K. Borgwardt, and O. Stegle. It is all in the noise: Efficient multitask gaussian process inference with structured residuals. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1466–1474. Curran Associates, Inc., 2013.
 Rantakallio (1969) P. Rantakallio. Groups at risk in low birth weight infants and perinatal mortality. Acta Paediatrica Scandinavica, 193:Suppl–193, 1969.
 Robinson (1991) G. K. Robinson. That blup is a good thing: The estimation of random effects. Statistical science, pages 15–32, 1991.
 Sohn and Kim (2012) KA. Sohn and S. Kim. Joint estimation of structured sparsity and output structure in multipleoutput regression via inversecovariance regularization. In N. Lawrence and M. Girolami, editors, International Conference on Artificial Intelligence and Statistics, volume 22 of JMLR W&CP, pages 1081–1089. JMLR.org, 2012.
 Soininen et al. (2009) P. Soininen, A. Kangas, P. Würtz, T. Tukiainen, T. Tynkkynen, R. Laatikainen, M. Järvelin, M. Kähönen, T. Lehtimäki, J. Viikari, et al. Highthroughput serum NMR metabonomics for costeffective holistic studies on systemic metabolism. Analyst, 134(9):1781–1785, 2009.
 Stegle et al. (2010) O. Stegle, L. Parts, R. Durbin, and J. Winn. A Bayesian framework to account for complex nongenetic factors in gene expression levels greatly increases power in eqtl studies. PLoS Computational Biology, 6(5):e1000770, 2010.
 Stegle et al. (2011) O. Stegle, C. Lippert, J. M Mooij, N. D. Lawrence, and K. M. Borgwardt. Efficient inference in matrixvariate gaussian models with iid observation noise. In J. ShaweTaylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 630–638. Curran Associates, Inc., 2011.
 Stephens (2013) M. Stephens. A unified framework for association analysis with multiple related phenotypes. PLoS ONE 8(7): e65245, 2013.
 Teslovich et al. (2010) T. Teslovich, K. Musunuru, A. Smith, A. Edmondson, I. Stylianou, M. Koseki, J. Pirruccello, S. Ripatti, D. Chasman, C. Willer, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 466(7307):707–713, 2010.
 Virtanen et al. (2011) S. Virtanen, A. Klami, and S. Kaski. Bayesian CCA via group sparsity. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML11), ICML ’11, pages 457–464, New York, NY, 2011. ACM.
 Wang et al. (2012) W. Wang, V. Baladandayuthapani, J. Morris, B. Broom, G. Manyam, and K. Do. Integrative Bayesian analysis of highdimensional multiplatform genomics data. Bioinformatics, 29(2):149–159, 2012.
 Xu et al. (2009) C. Xu, X. Wang, Z. Li, and S. Xu. Mapping QTL for multiple traits using Bayesian statistics. Genetical Research, 91(1):23–37, 2009.
 Yi and Banerjee (2009) N. Yi and S. Banerjee. Hierarchical generalized linear models for multiple quantitative trait locus mapping. Genetics, 181(3):1101–1113, 2009.
Comments
There are no comments yet.