Scalable Bayesian Non-linear Matrix Completion

07/31/2019
by   Xiangju Qin, et al.
Helsingin yliopisto
aalto
4

Matrix completion aims to predict missing elements in a partially observed data matrix which in typical applications, such as collaborative filtering, is large and extremely sparsely observed. A standard solution is matrix factorization, which predicts unobserved entries as linear combinations of latent variables. We generalize to non-linear combinations in massive-scale matrices. Bayesian approaches have been proven beneficial in linear matrix completion, but not applied in the more general non-linear case, due to limited scalability. We introduce a Bayesian non-linear matrix completion algorithm, which is based on a recent Bayesian formulation of Gaussian process latent variable models. To solve the challenges regarding scalability and computation, we propose a data-parallel distributed computational approach with a restricted communication scheme. We evaluate our method on challenging out-of-matrix prediction tasks using both simulated and real-world data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/03/2015

Bayesian Matrix Completion via Adaptive Relaxed Spectral Regularization

Bayesian matrix completion has been studied based on a low-rank matrix f...
09/17/2020

Bayesian Matrix Completion for Hypothesis Testing

The United States Environmental Protection Agency (EPA) screens thousand...
07/05/2011

Distributed Matrix Completion and Robust Factorization

If learning methods are to scale to the massive sizes of modern datasets...
10/20/2021

Computational Graph Completion

We introduce a framework for generating, organizing, and reasoning with ...
12/11/2020

Deep Learning Approach for Matrix Completion Using Manifold Learning

Matrix completion has received vast amount of attention and research due...
05/26/2018

Nonlinear Inductive Matrix Completion based on One-layer Neural Networks

The goal of a recommendation system is to predict the interest of a user...
08/06/2018

Improving Temporal Interpolation of Head and Body Pose using Gaussian Process Regression in a Matrix Completion Setting

This paper presents a model for head and body pose estimation (HBPE) whe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In matrix completion—one of the most widely used approaches for collaborative filtering—the objective is to predict missing elements of a partially observed data matrix. Such problems are often characterized by large and extremely sparsely observed data sets. The classic linear solution to the problem is to find a factorization of the data matrix as a product of latent variables and weights (), from which elements of can be predicted as . Probabilistic matrix factorization (PMF) [11], formulates the problem as a probabilistic model, regularized by placing priors on and

, and finds the solution as a maximum a posteriori (MAP) estimate of these matrices. Fully Bayesian matrix factorization

[10]

expands this model by further placing priors on model hyperparameters, and marginalizing these along with

and . Bayesian matrix factorization brings the advantages of automatic complexity control and better robustness to overfitting. Moreover, the solution comes with an uncertainty estimate, which is useful when the completed matrix is used for decision making. For instance, sparsely observed drug-target interaction matrices are used for deciding which interactions to measure next.

Lawrence and Urtasun Lawrence+Urtasun:2009 generalized PMF using a Gaussian process latent variable model (GP-LVM) formulation, where the relationship between and is given by , with a GP-prior placed over . The is optimized to find its MAP solution. Note that this formulation also subsumes the linear model as a special case. Subsequently, a variational inference framework for fully Bayesian GP-LVM has been developed [2, 15], building on sparse GP approximations [9, 13]. It parametrizes the covariance matrix implied by the GP kernel using a set of auxiliary inducing variables. While Bayesian GP-LVM has been successfully used for dimensionality reduction and extracting latent representations, less consideration has been given to its applicability in matrix completion tasks with extremely sparse data. Computationally, this is a much more demanding problem, because the variational updates have to be performed separately for each dimension of the data matrix, instead of being performed as a single operation.

Existing approaches for scaling up Bayesian GP-LVM make use of the insight that, conditional on the inducing variables, the data points can be decoupled for parallel computations. In this line of work, Gal et al. Gal+others:2014 introduced a distributed version of Bayesian GP-LVM. Dai et al. Dai+others:2014 proposed a similar framework, additionally using GPU acceleration to speed up local computations. Neither of the works demonstrated learning of latent variable models beyond moderately-sized data, nor have they been implemented for sparse matrices, which is needed for the problems considered in this paper. More importantly, current distributed solutions require the worker nodes to communicate with the master node in every iteration, which leads to an accumulating communication overhead as the number of worker units increased with the size of the problem. Vander Aa et al. VanderAa+others:2016 reported such a phenomenon for their distributed MPI implementation of Bayesian linear matrix factorization. Finally, our experience indicates that existing distributed implementations may suffer from high memory consumption.

For GP-regression models, with observed, Deisenroth and Ng Deisenroth+Ng:2015 proposed a framework with particularly good scaling properties and efficient use of memory. This framework utilizes a product-of-GP-experts (PoE) formulation [1, 7, 16], which makes predictions using a product of independent local models, each operating on a subset of the data. These types of approximations are amenable to embarrassingly parallel computations, and can therefore be scaled up to arbitrarily large data sets, at least in principle. However, a direct application of PoE for nonlinear matrix completion may not produce satisfactory predictions for two reasons. First, since the target matrix is very sparsely observed, each local model has very limited information to learn an informative model without sharing information. Second, while local models could be combined into larger models to improve predictive performance, this is hindered by the general non-uniqueness of the latent variables in latent variable models.

In this work, we propose a distributed computational strategy which is almost as communication-efficient as embarrassingly parallel computation, but enables local models to share information and avoids the problem of non-uniqueness in aggregating local models. In a nutshell, one data subset is processed first, and the rest of the embarrassingly parallel computations are conditioned on the result. A similar idea was recently presented by [8] for Bayesian linear matrix completion. The remainder of the paper proceeds as follows: In Section 2 we first provide a brief review of GP-LVMs. Then, in Section 3, we present our framework for scalable Bayesian non-linear matrix completion. An empirical evaluation of the method, using simulations and a benchmark dataset, is given in Section 4. The paper ends with conclusions in Section 5.

2 Gaussian Process Latent Variable Models

A Gaussian process latent variable model (GP-LVM) [6] can be constructed from a non-linear multi-output regression model,

by placing a GP prior over the unknown functions . Integrating over the space of functions with respect to a zero-mean GP then yields the likelihood as

(1)

where is an kernel matrix defined by a GP covariance function . We use

to collectively denote all parameters, including the noise variance

and the parameters of the covariance function.

When values are missing, as is the case in matrix completion, each factor of the likelihood (2) will only account for observed elements, thus we have

(2)

where denotes the set of indices of observed elements in column . Furthermore, the symmetric matrices and will only include rows and columns corresponding to the indices .

2.1 Bayesian GP-LVM

For Bayesian GP-LVM, we complement the likelihood (2) by placing a prior on . A standard choice is to set

(3)

The marginal likelihood of Bayesian GP-LVM is obtained by integrating the model with respect to :

(4)

While this operation is intractable in general, Titsias and Lawrence Titsias+Lawrence:2010 introduced a variational framework, which leads to a tractable lower bound,

(5)

on the log of the marginal likelihood (4). For a detailed treatment of the framework, see [2]. As a by-product of optimizing the lower bound (5), we get a variational approximation to the posterior , for which we assume a factorized Gaussian form

(6)

In our current work, we use this approximation to share information between parallel computations (see Section 3.1).

2.2 Extension to Multi-view Settings

Manifold relevance determination (MRD) [3] extends GP-LVM to a multi-view setting by reformulating the likelihood (2) as , where the elements of index the data views, i.e. matrices conditionally independent given a single latent matrix . In matrix completion problems, one of the views is typically the target in which prediction is carried out, while the other views constitute side-data. When predicting values in completely unobserved (or new) rows or columns in the target, predictions are effectively done ‘outside’ of the observed matrix. This can be done with the help of observed data in the side-views, and is referred to as out-of-matrix prediction.

Figure 1: Overview of scalable Bayesian non-linear matrix completion. In the learning phase (left panel), a large matrix is partitioned into 4x1 subsets, which are processed in two stages: in the initial stage only one subset is processed (box with solid borders), after which the remaining subsets are processed in parallel, each coupled with the first subset using incremental learning. The prediction phase (right panel) uses a product of experts, aggregating the means and variances of local experts into a joint Gaussian prediction with and .

3 Scalable Matrix Completion Using Bayesian GP-LVM

This section presents a computational strategy, which enables Bayesian GP-LVM to be scaled up for large-scale matrix completion problems using a product of experts (PoE). In brief, we first partition the observation matrix into disjoint subsets for parallel processing. To avoid problems resulting from the unidentifiability of latent variable models, we couple the subsets as follows: in an initial stage only one subset is processed; in the second stage, the remaining subsets are processed in parallel using incremental learning (Section 3.1). For each subset, we use an implementation of the variational inference framework for Bayesian GP-LVM [15]. Finally, with a set of independently learned Bayesian GP-LVMs, we use PoE to predict unobserved matrix elements (Section 3.2).

The proposed strategy is summarized in Figure 1. In Section 3.3, we present a further variant of the method, which uses intermediate aggregation of the submodels, prior to PoE, to improve predictive performance. The scalability of our method is briefly discussed in Section 3.4.

3.1 Coupling Parallel Inferences Using Incremental Learning

To couple the parallel inferences over subsets in the second stage of our method, we use a probabilistic variant of the incremental learning algorithm introduced by [17] for online learning of (non-Bayesian) non-linear latent variable models. Let be the submatrix processed in the initial stage. Furthermore, denote by , , the combined submatrix obtained by augmenting with . The corresponding combined latent matrix is denoted by .

The objective of incremental learning is to learn the joint latent matrix without extensive relearning of , while still allowing it to be updated. When learning , Yao et al. Yao+others:2011 added a regularizer to the log-likelihood to prevent from deviating too much from its initial estimate, and to speed up learning. The original incremental learning algorithm used the Frobenius norm to regularize the updated estimate of . In our current work, we use the KL-divergence to ensure that the updated variational posterior approximation remains close to the initial approximation . For , we use the default prior given in Equation (3). Thus, the variational inference for the incremental learning of Bayesian GP-LVM follows the procedure introduced in Section 2.1, with the KL terms in the lower bound of Eq. (5) taking the following form:

For the augmented subsets, the inducing points and kernel hyperparameters are initialized to values obtained in the initial stage. For initialization of latent variables, we use posterior means for and nearest neighbors for [17].

3.2 Prediction with Product of Experts

Product of experts (PoE) prediction for Gaussian process regression [4] uses the simple idea of combining predictions made by independent GP-models (i.e. ‘experts’) as a product:

(7)

where is a given test input and is the corresponding output value to be predicted. Under this model, a prediction is proportional to a Gaussian with parameters

With latent variable models, the essential difference to the above is that the inputs are unobserved and must therefore be inferred. In matrix completion, we wish to predict the missing part of a partially observed test point , where are the observed elements (or side views) and are the missing values (or the unobserved target view) to be reconstructed. The prediction task can be finished in two steps. First, we infer the latent input of the test point, which involves maximizing the variational lower bound on the marginal likelihood

to obtain the approximate posterior . The lower bound has the same form as the learning objective function in Equation (5), but for its maximization, the variational distribution over latent variables for training data and parameters remains fixed during test time. After obtaining , making predictions for is approached as GP prediction with uncertain inputs [2, 5].

In our distributed setting, the experts in PoE correspond to submodels learned from the augmented training subsets formed in the incremental learning phase. To correct for the initial subset being used in training sets, we formulate a corrected PoE as follows:

Finally, denoting the means and variances of the local predictive distributions as and , respectively, we compute the aggregated statistics of the corrected PoE predictions as:

In their distributed GP framework, Deisenroth and Ng Deisenroth+Ng:2015 used a re-weighted variant of PoE, which they coined the robust Bayesian committee machine (rBCM). Although rBCM has been shown to outperform the basic PoE for GP-regression, in our current setup, we have not observed any advantage of it over PoE. We have therefore formulated our framework using standard PoE but note that the extension to rBCM is straightforward.

3.3 Improved Solution with Intermediate Aggregation

PoE aggregates predictions from local submodels learned on data subsets, effectively using a block-diagonal approximation of the full-data covariance matrix. With larger submodels, PoE provides a closer approximation to the full covariance matrix, which can be expected to result in better predictive accuracy. Here we introduce an intermediate aggregation strategy, by which submodels are aggregated for improved predictive performance, while the initial training of submodels is still done on smaller subsets with lower computational cost. While latent variable models are in general non-identifiable, making a direct aggregation of local models difficult to carry out in a meaningful way, the incremental learning introduced in Section 3.1, encourages identifiability among local models, alleviating the problem.

The aggregation of submodels involves (i) stacking together local variational distributions, which are assumed to be independent across subsets, (ii) concatenating the corresponding data subsets, and finally (iii) aggregating the hyperparameters of the models. The model parameters can be approximated using suitable statistics (e.g. mode, median or mean) of the distributions. In our implementation, we use the mode to approximate the kernel and Gaussian noise variance parameters, and use averaging to estimate inducing variables.

Since the first subset is used multiple times through incremental learning, the corresponding variational distribution is obtained through the following aggregation:

where

Above, each of the variational distributions is Gaussian, of the form given by Equation (6).

Note that after intermediate aggregation, each training subset is used only once to make predictions, and we may use the ordinary PoE formulation in Equation (7) for prediction.

3.4 Computational Cost

Our method aims to leverage the scaling properties of sparse GP for training and those of PoE for prediction. Thus, for data partitioned into subsets of size , , and assuming that a sufficient number of parallel workers is available, the time complexity for training is , where is the number of inducing points and reflects the fact that variational updates have to be performed separately for each dimension of sparsely observed data. For prediction, the cost is . For incremental learning and intermediate aggregation, refers to the size of the concatenation of multiple subsets. By intermediate aggregation of submodels, we are able to trade off prediction cost against accuracy.

4 Experiments

In this section, we evaluate the predictive performance of the proposed method for out-of-matrix prediction problems on simulated and real-world chemogenomic data, and compare it with two alternative approaches: (i) the embarrassingly parallel or subset of data (SoD) approach, which has been widely studied to scale up Gaussian Process regression models, and (ii) Macau, Bayesian multi-relational factorization with side information [12], supporting out-of-matrix prediction. Macau is implemented with highly optimized C libraries, and is available for experiments on large-scale data. The comparison with the SoD approach shows the advantage of our method in sharing information among submodels, while the comparison with Macau shows the benefit of using Bayesian non-linear matrix factorization. We emphasize, however, that the choice and effectiveness of a model always depends on the problem at hand.

Simulated data.

We generated synthetic data using non-linear signals corrupted with Gaussian noise, using matern data generator available in the GPy222https://sheffieldml.github.io/GPy/ software. The data has three views , the dimension of the views is as follows: N = 25,000, = 150, = 100, = 150. As the task is to perform out-of-matrix prediction, we randomly selected 20% of the rows as a test set, using the remaining rows as the training set. In addition, 80% of the data in the first view were masked as missing, to simulate the sparsely observed target data in a real-world application. The other two views were regarded as side information and were fully observed. Unlike Bayesian GP-LVM, Macau cannot handle missing values in the side data.

Real-world data.

We performed the experiments on ExCAPE-DB data [14], which is an aggregation of public compound-target bioactivity data and describes interactions between drugs and targets using the pIC50333IC50 (units in ) is the concentration of drug at which 50% of the target is inhibited. The lower the IC50 of the drug, the less likely the drug will be to have some off-target effect (e.g. potential toxicity) that is not desired. . measure. It has 969,725 compounds and 806 targets with 76,384,863 observed bioactivities. The dataset has 469 chem2vec features as side information which are generated from ECFP fingerprint features for the compounds using word2vec software. We used 3-fold cross validation to split the training and test set, where about 30% of the rows or compounds were chosen as test set in each fold.

4.1 Experimental Setup

The experimental setting for MRD models is: number of inducing points 100, optimization through scaled conjugate gradients (SCG) with 500 iterations. For the SoD approach, the latent variables were initialized with PPCA method. We ran Macau with Gibbs sampling for 1200 iterations, discarded the first 800 samples as burn-in and saved every second of the remaining samples yielding in total 200 posterior samples. We set the dimension of latent variables K=10 for ExCAPE-DB data, K=5 for simulated data for all methods.

For the proposed and SoD methods, we partitioned the simulated data into 10x1 subsets and ExCAPE-DB data into 400x1 subsets. Other partitions are also possible; we have chosen the size of subsets such that Bayesian inference could be performed for the subsets in reasonable time on a single CPU. Notice that the views with missing values are generally sparsely observed in many real-world applications, which makes it challenging to learn informative models for such data. Following Qin et al. Qin+others:2019, we reordered the rows and columns of training data in descending order according to the proportion of observations in them. This makes the first subset the most densely observed block, thus making the resulting submodel informative and facilitating the parallel inferences in the following stages.

Kernel Macau SoD Proposed methods Full posterior
PoE Intermediate aggregation
RMSE: the smaller, the better.
Linear 0.8927 0.010 0.747 0.034 0.685 0.041 0.656 0.038 0.656 0.039
RBF - 0.825 0.034 0.791 0.048 0.683 0.038 0.658 0.039
Matern32 - 0.824 0.032 0.772 0.048 0.687 0.039 0.656 0.038
Spearman correlation score: the larger, the better.
Linear 0.6971 0.044 0.713 0.056 0.726 0.048 0.744 0.044 0.744 0.045
RBF - 0.689 0.060 0.651 0.080 0.721 0.048 0.742 0.045
Matern32 - 0.684 0.064 0.672 0.083 0.718 0.047 0.744 0.044
Wall-clock time (secs.)
Linear 171.44 27730.748 55191.296 57055.027 249782.967
RBF - 31371.620 53211.605 57306.449 176075.462
Matern32 - 36757.294 67197.138 71945.822 106303.016
Table 1: Comparison of performance metrics for different methods on simulated data. The results are averaged over 5 folds.
Affinity Model RMSE F1-score AUC-ROC Ratio of successful Wall-clock
level score queries time (secs.)
5 Macau 1.108 0.069 0.886 0.003 0.805 0.009 0.319 0.024 37041.8
5 SoD 0.914 0.023 0.890 0.011 0.791 0.003 0.363 0.006 63110.16
Proposed methods:
5 PoE 0.831 0.021 0.900 0.001 0.805 0.002 0.309 0.008 92419.06
5 Intermediate aggregation (nAggs=10) 0.743 0.009 0.919 0.005 0.811 0.006 0.405 0.018 93331.23
5 Intermediate aggregation (nAggs=20) 0.736 0.004 0.914 0.003 0.813 0.004 0.455 0.008 100492.9
6 Macau 1.123 0.065 0.783 0.013 0.799 0.006 0.318 0.003 37041.8
6 SoD 0.930 0.021 0.787 0.011 0.791 0.005 0.381 0.019 63110.16
Proposed methods:
6 PoE 0.837 0.022 0.846 0.011 0.811 0.003 0.285 0.004 92419.06
6 Intermediate aggregation (nAggs=10) 0.775 0.028 0.851 0.015 0.817 0.003 0.376 0.025 93331.23
6 Intermediate aggregation (nAggs=20) 0.789 0.006 0.838 0.005 0.816 0.004 0.434 0.016 100492.9
Table 2: Comparison of RMSE, F1-score, AUC-ROC score and the ratio of successful queries (i.e. queries with AUC-ROC score larger than 0.7 for the targets) for out-of-matrix prediction on ExCAPE-DB by different methods. The first three metrics are calculated for only the successful queries, the ratio is defined as #successfulQueries / #validQueries, where a query target is considered to be valid if it has at least 100 observed bioactivity, at least 25 active and 25 inactive compounds. The results are averaged over 3 runs.

We evaluated performance by predictive accuracy and the quality of prediction for downstream ranking tasks. Root mean squared error (RMSE) is a common performance measure for matrix completion. In real-world applications, such as item recommendation or drug discovery, we are more interested in the performance of the ranking task, for instance how many of the recommended items the user actually clicks or buys, how many drugs recommended by models actually have the desired effect for the disease. For this purpose, we regard matrix completion as a classification task (of whether a prediction is relevant or not at a given threshold), use F1- and AUC-ROC score as performance metrics for ExCAPE-DB. Furthermore, following Qin et al. Qin+others:2019, we use the wall-clock time444Wall-clock time measures the real time between the start and the end of a program. For parallel processes, we use the wall-clock time of the slowest process. to measure the speed-up achieved by parallelization. For our method, the reported wall-clock time is calculated by summing the maximum wall-clock times of submodels for each inference stage plus the wall-clock time of making prediction.

For compound activity prediction tasks, we use a pIC50 cutoff (a.k.a. affinity level) at 5 and 6, corresponding to concentrations of 10 and 1, respectively. The test set was further filtered by only keeping targets having at least 100 compounds, at least 25 active compounds, and 25 inactive compounds, to ensure a minimum number of positive and negative data points.

Macau555We ran the Macau version available in SMURFF software: https://github.com/ExaScience/smurff. was run on compute nodes with 20 CPUs; all the other methods were run on a single CPU. Our implementation is based on the GPy package.

4.2 Results

The results for simulated and ExCAPE-DB data are given in Table 1 and 2, respectively. In Table 1, column ‘Full posterior’ refers to the performance of MRD learned from the full data; column ‘Intermediate aggregation’ refers to our method which works by first aggregating multiple submodels into a model with reasonable size (as long as the compute node can still accommodate the model to make predictions) and then perform predictions by aggregating predictions from multiple experts with PoE.

It is clear from Table 1 that the model with full posterior performs better than other methods in terms of predictive performance; our intermediate aggregation method achieves competitive results while being much faster than the full posterior. The intermediate aggregation method also performs better than the SoD approach and the variant of our method without the intermediate aggregation step. With a linear kernel, all MRD models perform better than Macau.

For ExCAPE-DB data, our intermediate aggregation method (by aggregating 10 or 20 submodels to obtain larger models for prediction) performs much better than all the other methods in all performance metrics for different affinity levels. At both affinity levels, all versions of our proposed method perform better than the SoD method in terms of RMSE, F1-score and AUC-ROC score. Again, we observed that all MRD methods perform better than Macau in all performance metrics. In both tables, the wall-clock times of our methods are larger than that of the SoD approach. This is due to the two-stage parallel inferences of the proposed scheme.

To summarise, the proposed method with an intermediate aggregation step achieves a better trade-off between predictive accuracy and computation time. The proposed method performs better than the embarrassingly parallel approaches for scalable Gaussian process models and a state-of-the-art highly optimized implementation of linear Bayesian matrix factorization with side information.

5 Conclusion

In this paper, we have introduced a scalable approach for Bayesian non-linear matrix completion. We have argued that a key factor in constructing distributed solutions for massive-scale data is to limit the communication required between computational units. To this end, we have introduced a computational scheme which leverages embarrassingly parallel techniques developed for Gaussian process regression by suitably adapting them for Bayesian Gaussian process latent variable models. The resulting framework is almost as communication-efficient as embarrassingly parallel computation, adding only one additional stage of communication, while achieving accuracy close to the non-distributed full data solution.

Acknowledgements

The authors gratefully acknowledge the computational resources provided by the Aalto Science-IT project and support by the Academy of Finland (Finnish Center for Artificial Intelligence, FCAI, and projects 319264, 292334).

References

  • [1] Y. Cao and D. J. Fleet (2014) Generalized product of experts for automatic and principled fusion of Gaussian process predictions. In Modern Nonparametrics 3: Automating the Learning Pipeline workshop at NIPS, Cited by: §1.
  • [2] A. C. Damianou, M. K. Titsias, and N. D. Lawrence (2016) Variational inference for latent variables and uncertain inputs in Gaussian processes. JMLR 17 (42), pp. 1–62. Cited by: §1, §2.1, §3.2.
  • [3] A. Damianou, C. H. Ek, M. K. Titsias, and N. D. Lawrence (2012) Manifold relevance determination. In ICML, Cited by: §2.2.
  • [4] M. Deisenroth and J. W. Ng (2015) Distributed gaussian processes. In ICML, pp. 1481–1490. Cited by: §3.2.
  • [5] A. Girard, C. E. Rasmussen, J. Q. Candela, and R. Murray-Smith (2003) Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting. In NIPS, pp. 545–552. Cited by: §3.2.
  • [6] N. Lawrence (2005)

    Probabilistic non-linear principal component analysis with Gaussian process latent variable models

    .
    JMLR 6 (Nov), pp. 1783–1816. Cited by: §2.
  • [7] J. W. Ng and M. P. Deisenroth (2014) Hierarchical mixture-of-experts model for large-scale Gaussian process regression. arXiv preprint arXiv:1412.3078. Cited by: §1.
  • [8] X. Qin, P. Blomstedt, E. Leppäaho, P. Parviainen, and S. Kaski (2019) Distributed Bayesian matrix factorization with limited communication. Machine Learning, pp. 1–26. Cited by: §1.
  • [9] J. Quiñonero-Candela and C. E. Rasmussen (2005) A unifying view of sparse approximate gaussian process regression. JMLR 6, pp. 1939–1959. External Links: ISSN 1532-4435, Link Cited by: §1.
  • [10] R. Salakhutdinov and A. Mnih (2008)

    Bayesian probabilistic matrix factorization using Markov chain Monte Carlo

    .
    In ICML, pp. 880–887. External Links: Link, Document Cited by: §1.
  • [11] R. Salakhutdinov and A. Mnih (2008) Probabilistic matrix factorization. In NIPS, Cited by: §1.
  • [12] J. Simm, A. Arany, P. Zakeri, T. Haber, J. K. Wegner, V. Chupakhin, H. Ceulemans, and Y. Moreau (2015) Macau: scalable bayesian multi-relational factorization with side information using mcmc. arXiv preprint arXiv:1509.04610. Cited by: §4.
  • [13] E. Snelson and Z. Ghahramani (2006) Sparse Gaussian processes using pseudo-inputs. In NIPS, pp. 1257–1264. Cited by: §1.
  • [14] J. Sun, N. Jeliazkova, V. Chupakhin, J. Golib-Dzib, O. Engkvist, L. Carlsson, J. Wegner, H. Ceulemans, I. Georgiev, V. Jeliazkov, N. Kochev, T. J. Ashby, and H. Chen (2017) ExCAPE-DB: an integrated large scale dataset facilitating big data analysis in chemogenomics. Journal of Cheminformatics 9 (1), pp. 17:1–17:9. External Links: Document, Link Cited by: §4.
  • [15] M. Titsias and N. Lawrence (2010) Bayesian Gaussian process latent variable model. In AISTATS, pp. 844–851. Cited by: §1, §3.
  • [16] V. Tresp (2000) A Bayesian committee machine. Neural Computation 12 (11), pp. 2719–2741. Cited by: §1.
  • [17] A. Yao, J. Gall, L. V. Gool, and R. Urtasun (2011) Learning probabilistic non-linear latent variable models for tracking complex activities. In NIPS, pp. 1359–1367. Cited by: §3.1, §3.1.