1 Introduction
In matrix completion—one of the most widely used approaches for collaborative filtering—the objective is to predict missing elements of a partially observed data matrix. Such problems are often characterized by large and extremely sparsely observed data sets. The classic linear solution to the problem is to find a factorization of the data matrix as a product of latent variables and weights (), from which elements of can be predicted as . Probabilistic matrix factorization (PMF) [11], formulates the problem as a probabilistic model, regularized by placing priors on and
, and finds the solution as a maximum a posteriori (MAP) estimate of these matrices. Fully Bayesian matrix factorization
[10]expands this model by further placing priors on model hyperparameters, and marginalizing these along with
and . Bayesian matrix factorization brings the advantages of automatic complexity control and better robustness to overfitting. Moreover, the solution comes with an uncertainty estimate, which is useful when the completed matrix is used for decision making. For instance, sparsely observed drugtarget interaction matrices are used for deciding which interactions to measure next.Lawrence and Urtasun Lawrence+Urtasun:2009 generalized PMF using a Gaussian process latent variable model (GPLVM) formulation, where the relationship between and is given by , with a GPprior placed over . The is optimized to find its MAP solution. Note that this formulation also subsumes the linear model as a special case. Subsequently, a variational inference framework for fully Bayesian GPLVM has been developed [2, 15], building on sparse GP approximations [9, 13]. It parametrizes the covariance matrix implied by the GP kernel using a set of auxiliary inducing variables. While Bayesian GPLVM has been successfully used for dimensionality reduction and extracting latent representations, less consideration has been given to its applicability in matrix completion tasks with extremely sparse data. Computationally, this is a much more demanding problem, because the variational updates have to be performed separately for each dimension of the data matrix, instead of being performed as a single operation.
Existing approaches for scaling up Bayesian GPLVM make use of the insight that, conditional on the inducing variables, the data points can be decoupled for parallel computations. In this line of work, Gal et al. Gal+others:2014 introduced a distributed version of Bayesian GPLVM. Dai et al. Dai+others:2014 proposed a similar framework, additionally using GPU acceleration to speed up local computations. Neither of the works demonstrated learning of latent variable models beyond moderatelysized data, nor have they been implemented for sparse matrices, which is needed for the problems considered in this paper. More importantly, current distributed solutions require the worker nodes to communicate with the master node in every iteration, which leads to an accumulating communication overhead as the number of worker units increased with the size of the problem. Vander Aa et al. VanderAa+others:2016 reported such a phenomenon for their distributed MPI implementation of Bayesian linear matrix factorization. Finally, our experience indicates that existing distributed implementations may suffer from high memory consumption.
For GPregression models, with observed, Deisenroth and Ng Deisenroth+Ng:2015 proposed a framework with particularly good scaling properties and efficient use of memory. This framework utilizes a productofGPexperts (PoE) formulation [1, 7, 16], which makes predictions using a product of independent local models, each operating on a subset of the data. These types of approximations are amenable to embarrassingly parallel computations, and can therefore be scaled up to arbitrarily large data sets, at least in principle. However, a direct application of PoE for nonlinear matrix completion may not produce satisfactory predictions for two reasons. First, since the target matrix is very sparsely observed, each local model has very limited information to learn an informative model without sharing information. Second, while local models could be combined into larger models to improve predictive performance, this is hindered by the general nonuniqueness of the latent variables in latent variable models.
In this work, we propose a distributed computational strategy which is almost as communicationefficient as embarrassingly parallel computation, but enables local models to share information and avoids the problem of nonuniqueness in aggregating local models. In a nutshell, one data subset is processed first, and the rest of the embarrassingly parallel computations are conditioned on the result. A similar idea was recently presented by [8] for Bayesian linear matrix completion. The remainder of the paper proceeds as follows: In Section 2 we first provide a brief review of GPLVMs. Then, in Section 3, we present our framework for scalable Bayesian nonlinear matrix completion. An empirical evaluation of the method, using simulations and a benchmark dataset, is given in Section 4. The paper ends with conclusions in Section 5.
2 Gaussian Process Latent Variable Models
A Gaussian process latent variable model (GPLVM) [6] can be constructed from a nonlinear multioutput regression model,
by placing a GP prior over the unknown functions . Integrating over the space of functions with respect to a zeromean GP then yields the likelihood as
(1) 
where is an kernel matrix defined by a GP covariance function . We use
to collectively denote all parameters, including the noise variance
and the parameters of the covariance function.When values are missing, as is the case in matrix completion, each factor of the likelihood (2) will only account for observed elements, thus we have
(2) 
where denotes the set of indices of observed elements in column . Furthermore, the symmetric matrices and will only include rows and columns corresponding to the indices .
2.1 Bayesian GPLVM
For Bayesian GPLVM, we complement the likelihood (2) by placing a prior on . A standard choice is to set
(3) 
The marginal likelihood of Bayesian GPLVM is obtained by integrating the model with respect to :
(4) 
While this operation is intractable in general, Titsias and Lawrence Titsias+Lawrence:2010 introduced a variational framework, which leads to a tractable lower bound,
(5)  
on the log of the marginal likelihood (4). For a detailed treatment of the framework, see [2]. As a byproduct of optimizing the lower bound (5), we get a variational approximation to the posterior , for which we assume a factorized Gaussian form
(6) 
In our current work, we use this approximation to share information between parallel computations (see Section 3.1).
2.2 Extension to Multiview Settings
Manifold relevance determination (MRD) [3] extends GPLVM to a multiview setting by reformulating the likelihood (2) as , where the elements of index the data views, i.e. matrices conditionally independent given a single latent matrix . In matrix completion problems, one of the views is typically the target in which prediction is carried out, while the other views constitute sidedata. When predicting values in completely unobserved (or new) rows or columns in the target, predictions are effectively done ‘outside’ of the observed matrix. This can be done with the help of observed data in the sideviews, and is referred to as outofmatrix prediction.
3 Scalable Matrix Completion Using Bayesian GPLVM
This section presents a computational strategy, which enables Bayesian GPLVM to be scaled up for largescale matrix completion problems using a product of experts (PoE). In brief, we first partition the observation matrix into disjoint subsets for parallel processing. To avoid problems resulting from the unidentifiability of latent variable models, we couple the subsets as follows: in an initial stage only one subset is processed; in the second stage, the remaining subsets are processed in parallel using incremental learning (Section 3.1). For each subset, we use an implementation of the variational inference framework for Bayesian GPLVM [15]. Finally, with a set of independently learned Bayesian GPLVMs, we use PoE to predict unobserved matrix elements (Section 3.2).
The proposed strategy is summarized in Figure 1. In Section 3.3, we present a further variant of the method, which uses intermediate aggregation of the submodels, prior to PoE, to improve predictive performance. The scalability of our method is briefly discussed in Section 3.4.
3.1 Coupling Parallel Inferences Using Incremental Learning
To couple the parallel inferences over subsets in the second stage of our method, we use a probabilistic variant of the incremental learning algorithm introduced by [17] for online learning of (nonBayesian) nonlinear latent variable models. Let be the submatrix processed in the initial stage. Furthermore, denote by , , the combined submatrix obtained by augmenting with . The corresponding combined latent matrix is denoted by .
The objective of incremental learning is to learn the joint latent matrix without extensive relearning of , while still allowing it to be updated. When learning , Yao et al. Yao+others:2011 added a regularizer to the loglikelihood to prevent from deviating too much from its initial estimate, and to speed up learning. The original incremental learning algorithm used the Frobenius norm to regularize the updated estimate of . In our current work, we use the KLdivergence to ensure that the updated variational posterior approximation remains close to the initial approximation . For , we use the default prior given in Equation (3). Thus, the variational inference for the incremental learning of Bayesian GPLVM follows the procedure introduced in Section 2.1, with the KL terms in the lower bound of Eq. (5) taking the following form:
For the augmented subsets, the inducing points and kernel hyperparameters are initialized to values obtained in the initial stage. For initialization of latent variables, we use posterior means for and nearest neighbors for [17].
3.2 Prediction with Product of Experts
Product of experts (PoE) prediction for Gaussian process regression [4] uses the simple idea of combining predictions made by independent GPmodels (i.e. ‘experts’) as a product:
(7) 
where is a given test input and is the corresponding output value to be predicted. Under this model, a prediction is proportional to a Gaussian with parameters
With latent variable models, the essential difference to the above is that the inputs are unobserved and must therefore be inferred. In matrix completion, we wish to predict the missing part of a partially observed test point , where are the observed elements (or side views) and are the missing values (or the unobserved target view) to be reconstructed. The prediction task can be finished in two steps. First, we infer the latent input of the test point, which involves maximizing the variational lower bound on the marginal likelihood
to obtain the approximate posterior . The lower bound has the same form as the learning objective function in Equation (5), but for its maximization, the variational distribution over latent variables for training data and parameters remains fixed during test time. After obtaining , making predictions for is approached as GP prediction with uncertain inputs [2, 5].
In our distributed setting, the experts in PoE correspond to submodels learned from the augmented training subsets formed in the incremental learning phase. To correct for the initial subset being used in training sets, we formulate a corrected PoE as follows:
Finally, denoting the means and variances of the local predictive distributions as and , respectively, we compute the aggregated statistics of the corrected PoE predictions as:
In their distributed GP framework, Deisenroth and Ng Deisenroth+Ng:2015 used a reweighted variant of PoE, which they coined the robust Bayesian committee machine (rBCM). Although rBCM has been shown to outperform the basic PoE for GPregression, in our current setup, we have not observed any advantage of it over PoE. We have therefore formulated our framework using standard PoE but note that the extension to rBCM is straightforward.
3.3 Improved Solution with Intermediate Aggregation
PoE aggregates predictions from local submodels learned on data subsets, effectively using a blockdiagonal approximation of the fulldata covariance matrix. With larger submodels, PoE provides a closer approximation to the full covariance matrix, which can be expected to result in better predictive accuracy. Here we introduce an intermediate aggregation strategy, by which submodels are aggregated for improved predictive performance, while the initial training of submodels is still done on smaller subsets with lower computational cost. While latent variable models are in general nonidentifiable, making a direct aggregation of local models difficult to carry out in a meaningful way, the incremental learning introduced in Section 3.1, encourages identifiability among local models, alleviating the problem.
The aggregation of submodels involves (i) stacking together local variational distributions, which are assumed to be independent across subsets, (ii) concatenating the corresponding data subsets, and finally (iii) aggregating the hyperparameters of the models. The model parameters can be approximated using suitable statistics (e.g. mode, median or mean) of the distributions. In our implementation, we use the mode to approximate the kernel and Gaussian noise variance parameters, and use averaging to estimate inducing variables.
Since the first subset is used multiple times through incremental learning, the corresponding variational distribution is obtained through the following aggregation:
where
Above, each of the variational distributions is Gaussian, of the form given by Equation (6).
Note that after intermediate aggregation, each training subset is used only once to make predictions, and we may use the ordinary PoE formulation in Equation (7) for prediction.
3.4 Computational Cost
Our method aims to leverage the scaling properties of sparse GP for training and those of PoE for prediction. Thus, for data partitioned into subsets of size , , and assuming that a sufficient number of parallel workers is available, the time complexity for training is , where is the number of inducing points and reflects the fact that variational updates have to be performed separately for each dimension of sparsely observed data. For prediction, the cost is . For incremental learning and intermediate aggregation, refers to the size of the concatenation of multiple subsets. By intermediate aggregation of submodels, we are able to trade off prediction cost against accuracy.
4 Experiments
In this section, we evaluate the predictive performance of the proposed method for outofmatrix prediction problems on simulated and realworld chemogenomic data, and compare it with two alternative approaches: (i) the embarrassingly parallel or subset of data (SoD) approach, which has been widely studied to scale up Gaussian Process regression models, and (ii) Macau, Bayesian multirelational factorization with side information [12], supporting outofmatrix prediction. Macau is implemented with highly optimized C libraries, and is available for experiments on largescale data. The comparison with the SoD approach shows the advantage of our method in sharing information among submodels, while the comparison with Macau shows the benefit of using Bayesian nonlinear matrix factorization. We emphasize, however, that the choice and effectiveness of a model always depends on the problem at hand.
Simulated data.
We generated synthetic data using nonlinear signals corrupted with Gaussian noise, using matern data generator available in the GPy^{2}^{2}2https://sheffieldml.github.io/GPy/ software. The data has three views , the dimension of the views is as follows: N = 25,000, = 150, = 100, = 150. As the task is to perform outofmatrix prediction, we randomly selected 20% of the rows as a test set, using the remaining rows as the training set. In addition, 80% of the data in the first view were masked as missing, to simulate the sparsely observed target data in a realworld application. The other two views were regarded as side information and were fully observed. Unlike Bayesian GPLVM, Macau cannot handle missing values in the side data.
Realworld data.
We performed the experiments on ExCAPEDB data [14], which is an aggregation of public compoundtarget bioactivity data and describes interactions between drugs and targets using the pIC50^{3}^{3}3IC50 (units in ) is the concentration of drug at which 50% of the target is inhibited. The lower the IC50 of the drug, the less likely the drug will be to have some offtarget effect (e.g. potential toxicity) that is not desired. . measure. It has 969,725 compounds and 806 targets with 76,384,863 observed bioactivities. The dataset has 469 chem2vec features as side information which are generated from ECFP fingerprint features for the compounds using word2vec software. We used 3fold cross validation to split the training and test set, where about 30% of the rows or compounds were chosen as test set in each fold.
4.1 Experimental Setup
The experimental setting for MRD models is: number of inducing points 100, optimization through scaled conjugate gradients (SCG) with 500 iterations. For the SoD approach, the latent variables were initialized with PPCA method. We ran Macau with Gibbs sampling for 1200 iterations, discarded the first 800 samples as burnin and saved every second of the remaining samples yielding in total 200 posterior samples. We set the dimension of latent variables K=10 for ExCAPEDB data, K=5 for simulated data for all methods.
For the proposed and SoD methods, we partitioned the simulated data into 10x1 subsets and ExCAPEDB data into 400x1 subsets. Other partitions are also possible; we have chosen the size of subsets such that Bayesian inference could be performed for the subsets in reasonable time on a single CPU. Notice that the views with missing values are generally sparsely observed in many realworld applications, which makes it challenging to learn informative models for such data. Following Qin et al. Qin+others:2019, we reordered the rows and columns of training data in descending order according to the proportion of observations in them. This makes the first subset the most densely observed block, thus making the resulting submodel informative and facilitating the parallel inferences in the following stages.
Kernel  Macau  SoD  Proposed methods  Full posterior  
PoE  Intermediate aggregation  
RMSE: the smaller, the better.  
Linear  0.8927 0.010  0.747 0.034  0.685 0.041  0.656 0.038  0.656 0.039 
RBF    0.825 0.034  0.791 0.048  0.683 0.038  0.658 0.039 
Matern32    0.824 0.032  0.772 0.048  0.687 0.039  0.656 0.038 
Spearman correlation score: the larger, the better.  
Linear  0.6971 0.044  0.713 0.056  0.726 0.048  0.744 0.044  0.744 0.045 
RBF    0.689 0.060  0.651 0.080  0.721 0.048  0.742 0.045 
Matern32    0.684 0.064  0.672 0.083  0.718 0.047  0.744 0.044 
Wallclock time (secs.)  
Linear  171.44  27730.748  55191.296  57055.027  249782.967 
RBF    31371.620  53211.605  57306.449  176075.462 
Matern32    36757.294  67197.138  71945.822  106303.016 
Affinity  Model  RMSE  F1score  AUCROC  Ratio of successful  Wallclock 

level  score  queries  time (secs.)  
5  Macau  1.108 0.069  0.886 0.003  0.805 0.009  0.319 0.024  37041.8 
5  SoD  0.914 0.023  0.890 0.011  0.791 0.003  0.363 0.006  63110.16 
Proposed methods:  
5  PoE  0.831 0.021  0.900 0.001  0.805 0.002  0.309 0.008  92419.06 
5  Intermediate aggregation (nAggs=10)  0.743 0.009  0.919 0.005  0.811 0.006  0.405 0.018  93331.23 
5  Intermediate aggregation (nAggs=20)  0.736 0.004  0.914 0.003  0.813 0.004  0.455 0.008  100492.9 
6  Macau  1.123 0.065  0.783 0.013  0.799 0.006  0.318 0.003  37041.8 
6  SoD  0.930 0.021  0.787 0.011  0.791 0.005  0.381 0.019  63110.16 
Proposed methods:  
6  PoE  0.837 0.022  0.846 0.011  0.811 0.003  0.285 0.004  92419.06 
6  Intermediate aggregation (nAggs=10)  0.775 0.028  0.851 0.015  0.817 0.003  0.376 0.025  93331.23 
6  Intermediate aggregation (nAggs=20)  0.789 0.006  0.838 0.005  0.816 0.004  0.434 0.016  100492.9 
We evaluated performance by predictive accuracy and the quality of prediction for downstream ranking tasks. Root mean squared error (RMSE) is a common performance measure for matrix completion. In realworld applications, such as item recommendation or drug discovery, we are more interested in the performance of the ranking task, for instance how many of the recommended items the user actually clicks or buys, how many drugs recommended by models actually have the desired effect for the disease. For this purpose, we regard matrix completion as a classification task (of whether a prediction is relevant or not at a given threshold), use F1 and AUCROC score as performance metrics for ExCAPEDB. Furthermore, following Qin et al. Qin+others:2019, we use the wallclock time^{4}^{4}4Wallclock time measures the real time between the start and the end of a program. For parallel processes, we use the wallclock time of the slowest process. to measure the speedup achieved by parallelization. For our method, the reported wallclock time is calculated by summing the maximum wallclock times of submodels for each inference stage plus the wallclock time of making prediction.
For compound activity prediction tasks, we use a pIC50 cutoff (a.k.a. affinity level) at 5 and 6, corresponding to concentrations of 10 and 1, respectively. The test set was further filtered by only keeping targets having at least 100 compounds, at least 25 active compounds, and 25 inactive compounds, to ensure a minimum number of positive and negative data points.
Macau^{5}^{5}5We ran the Macau version available in SMURFF software: https://github.com/ExaScience/smurff. was run on compute nodes with 20 CPUs; all the other methods were run on a single CPU. Our implementation is based on the GPy package.
4.2 Results
The results for simulated and ExCAPEDB data are given in Table 1 and 2, respectively. In Table 1, column ‘Full posterior’ refers to the performance of MRD learned from the full data; column ‘Intermediate aggregation’ refers to our method which works by first aggregating multiple submodels into a model with reasonable size (as long as the compute node can still accommodate the model to make predictions) and then perform predictions by aggregating predictions from multiple experts with PoE.
It is clear from Table 1 that the model with full posterior performs better than other methods in terms of predictive performance; our intermediate aggregation method achieves competitive results while being much faster than the full posterior. The intermediate aggregation method also performs better than the SoD approach and the variant of our method without the intermediate aggregation step. With a linear kernel, all MRD models perform better than Macau.
For ExCAPEDB data, our intermediate aggregation method (by aggregating 10 or 20 submodels to obtain larger models for prediction) performs much better than all the other methods in all performance metrics for different affinity levels. At both affinity levels, all versions of our proposed method perform better than the SoD method in terms of RMSE, F1score and AUCROC score. Again, we observed that all MRD methods perform better than Macau in all performance metrics. In both tables, the wallclock times of our methods are larger than that of the SoD approach. This is due to the twostage parallel inferences of the proposed scheme.
To summarise, the proposed method with an intermediate aggregation step achieves a better tradeoff between predictive accuracy and computation time. The proposed method performs better than the embarrassingly parallel approaches for scalable Gaussian process models and a stateoftheart highly optimized implementation of linear Bayesian matrix factorization with side information.
5 Conclusion
In this paper, we have introduced a scalable approach for Bayesian nonlinear matrix completion. We have argued that a key factor in constructing distributed solutions for massivescale data is to limit the communication required between computational units. To this end, we have introduced a computational scheme which leverages embarrassingly parallel techniques developed for Gaussian process regression by suitably adapting them for Bayesian Gaussian process latent variable models. The resulting framework is almost as communicationefficient as embarrassingly parallel computation, adding only one additional stage of communication, while achieving accuracy close to the nondistributed full data solution.
Acknowledgements
The authors gratefully acknowledge the computational resources provided by the Aalto ScienceIT project and support by the Academy of Finland (Finnish Center for Artificial Intelligence, FCAI, and projects 319264, 292334).
References
 [1] (2014) Generalized product of experts for automatic and principled fusion of Gaussian process predictions. In Modern Nonparametrics 3: Automating the Learning Pipeline workshop at NIPS, Cited by: §1.
 [2] (2016) Variational inference for latent variables and uncertain inputs in Gaussian processes. JMLR 17 (42), pp. 1–62. Cited by: §1, §2.1, §3.2.
 [3] (2012) Manifold relevance determination. In ICML, Cited by: §2.2.
 [4] (2015) Distributed gaussian processes. In ICML, pp. 1481–1490. Cited by: §3.2.
 [5] (2003) Gaussian process priors with uncertain inputs application to multiplestep ahead time series forecasting. In NIPS, pp. 545–552. Cited by: §3.2.

[6]
(2005)
Probabilistic nonlinear principal component analysis with Gaussian process latent variable models
. JMLR 6 (Nov), pp. 1783–1816. Cited by: §2.  [7] (2014) Hierarchical mixtureofexperts model for largescale Gaussian process regression. arXiv preprint arXiv:1412.3078. Cited by: §1.
 [8] (2019) Distributed Bayesian matrix factorization with limited communication. Machine Learning, pp. 1–26. Cited by: §1.
 [9] (2005) A unifying view of sparse approximate gaussian process regression. JMLR 6, pp. 1939–1959. External Links: ISSN 15324435, Link Cited by: §1.

[10]
(2008)
Bayesian probabilistic matrix factorization using Markov chain Monte Carlo
. In ICML, pp. 880–887. External Links: Link, Document Cited by: §1.  [11] (2008) Probabilistic matrix factorization. In NIPS, Cited by: §1.
 [12] (2015) Macau: scalable bayesian multirelational factorization with side information using mcmc. arXiv preprint arXiv:1509.04610. Cited by: §4.
 [13] (2006) Sparse Gaussian processes using pseudoinputs. In NIPS, pp. 1257–1264. Cited by: §1.
 [14] (2017) ExCAPEDB: an integrated large scale dataset facilitating big data analysis in chemogenomics. Journal of Cheminformatics 9 (1), pp. 17:1–17:9. External Links: Document, Link Cited by: §4.
 [15] (2010) Bayesian Gaussian process latent variable model. In AISTATS, pp. 844–851. Cited by: §1, §3.
 [16] (2000) A Bayesian committee machine. Neural Computation 12 (11), pp. 2719–2741. Cited by: §1.
 [17] (2011) Learning probabilistic nonlinear latent variable models for tracking complex activities. In NIPS, pp. 1359–1367. Cited by: §3.1, §3.1.
Comments
There are no comments yet.