Latent variable models based on matrix factorization have in recent years become one of the most popular and successful approaches for matrix completion tasks, such as collaborative filtering in recommender systems (Koren et al, 2009) and drug discovery (Cobanoglu et al, 2013). The main idea in such models is, given a matrix of observed values , to find two matrices and with , such that their product forms a low-rank approximation of :
In matrix completion, the matrix is typically very sparsely observed, and the goal is to predict unobserved matrix elements based on the observed ones.
A standard way of dealing with high levels of unobserved elements in matrix factorization is to model the observed values only (instead of imputing the missing values), using regularization to avoid overfitting. A probabilistically justified regularized matrix factorization model was first introduced bySalakhutdinov and Mnih (2008b), and subsequently extended to a fully Bayesian formulation (called Bayesian Probabilistic Matrix Factorization, BPMF) (Salakhutdinov and Mnih, 2008a)
. This formulation sidesteps the difficulty of choosing appropriate values for the regularization parameters by considering them as hyperparameters, placing a hyperprior over them and using Markov chain Monte Carlo (MCMC) to perform posterior inference. Additional advantages of the fully Bayesian approach include improved predictive accuracy, quantification of the uncertainty in predictions, the ability to incorporate prior knowledge, as well as flexible utilization of side-information(Adams et al, 2010; Park et al, 2013; Porteous et al, 2010; Simm et al, 2015).
Given the appeal and many advantages of Bayesian matrix factorization, applying it also to massive-scale matrices would be attractive but scaling up the posterior inference has proven difficult, and calls for distributing both data and computation over many workers. So far only very few distributed implementations of BMF have been presented in the literature. Recently, Ahn et al (2015) proposed a solution based on distributed stochastic gradient Langevin dynamics (DSGLD), and showed empirically that BMF with DSGLD achieves the same level of predictive performance as Gibbs sampling. However, the convergence efficiency of the DSGLD solution is constrained by several factors such as the nee for careful tuning of the learning rate and for using an orthogonal group partition111This partition is used to avoid conflicting access to parameters among parallel workers, and refers to a partition scheme in which rows and columns included in one block do not appear in the other ones.
for training. When a model is trained with blocks in an orthogonal group, in each iteration it only makes use of a small subset of the full data set for learning, which could lead to an estimate with higher variance and slowing down the convergence speed.Şimşekli et al (2015) developed a similar distributed MCMC method based on SGLD for large generalised matrix factorization problems, which they called Parallel SGLD (PSGLD); see also Şimşekli et al (2017) for an application of the method to non-negative matrix factorization (NMF). Different from DSGLD, PSGLD is implemented such that for each iteration, only blocks of instead of the whole need to be transferred among parallel workers. Nevertheless, this solution suffers from the same issues as DSGLD. Vander Aa et al (2017) presented a distributed high-performance implementation of BPMF with Gibbs sampling using the TBB and GASPI libraries, and provided an empirical comparison with other state-of-the-art distributed high-performance parallel implementations. They found that a significant speed-up could only be achieved with a limited number of workers, after which the addition of more workers eventually leads to a dramatic drop in parallel computation efficiency due to the increased communication overhead (Vander Aa et al, 2016, 2017). Therefore, a key factor in devising even more scalable distributed solutions is to be able to minimize communication between worker nodes.
One of the most promising directions in large-scale Bayesian computation in recent years has been embarrassingly parallel MCMC, a family of essentially communication-free algorithms, where the data are first partitioned into multiple subsets and independent sampling algorithms are then run on each subset in parallel (Minsker et al, 2014; Neiswanger et al, 2014; Scott et al, 2016; Srivastava et al, 2015; Wang and Dunson, 2013; Wang et al, 2014, 2015). In these algorithms, communication only takes place in a final aggregation step, which combines the subset posteriors to form an approximation to the full-data posterior. A key factor which limits the applicability of such methods for BMF models is that Equation (1) can be solved only up to orthogonal transformations. Each subset posterior can therefore converge to any of an infinite number of modes, making the aggregation step difficult to carry out in a meaningful way. Previous embarrassingly parallel MCMC algorithms have only been applied in cases where the model is unidentified up to a finite number of solutions (e.g. Nemeth and Sherlock, 2017) but are not applicable in a continuum of unidentifiable cases.
In this paper, we introduce an approach which addresses the unidentifiability issue by introducing dependencies between the subset posterior inferences, while limiting the communication between workers. We will draw inspiration from the observation that even though an infinite number of solutions to Equation (1) exist in principle, in practical computation with a finite number of observations, a sampler with finite chain-length will only explore a small number of solutions, each corresponding to a separate mode. The key idea is to encourage the samplers in all subsets to target the same set of solutions; note that this does not restrict generality as the standard way of finding other modes, by employing additional chains, is available here as well.
In large BMF problems we partition the data matrix along both rows and columns, effectively making different subsets dependent on parameters shared by subsets on the same rows and columns. To implement the dependencies in the inference, we divide the subsets into three groups, which are processed in a hierarchy of three consecutive stages (see Figure 1). The posterior distributions obtained in each stage are propagated forwards and used as priors in the following stage. This way, communication only takes place between the stages and not between the subsets within a stage, and each subset inference is regularized using information from the relevant subset inference in the preceding stage. Note that within each stage, we perform the inference for subsets in parallel. Thus, for a partition scheme with subsets, the maximum number of parallel workers that can be used by our algorithm is equal to the number of subsets in the third stage, i.e. . We refer to the proposed procedure as posterior propagation (PP). Table 1 compares the computational characteristics of PP with previous approaches for parallel and distributed BMF / NMF.
|Tune learning||Shuffle||Optimal||Communication cost|
|rate or||train data||partition|
BMF + DSGLD
|(Ahn et al, 2015)||orthogonal|
NMF + PSGLD
|(Şimşekli et al, 2017)||orthogonal|
|(Vander Aa et al, 2017)|
1.1 Contributions and overview of the paper
The main contributions of our paper are as follows: In Section 3, we introduce a hierarchical, exact decomposition of the joint posterior of the BMF model parameters, which makes possible embarrassingly parallel computations over data subsets in a sequence of at most three stages, limiting all communication to take place between the stages. This decomposition is computationally intractable in general; however, in Section 4 we build on it to develop a MCMC-based approximate inference scheme for BMF. In the numerical experiments of Section 5, we show empirically, with both real and simulated data, that the proposed distributed approach is able to achieve a speed-up of almost an order of magnitude over the full posterior, with a negligible effect on predictive accuracy, compared to MCMC inference on the full data. In the experiments, the method also significantly outperforms state-of-the-art embarrassingly parallel MCMC methods in accuracy, and achieves competitive results compared to other available distributed and parallel implementations of BMF.
2.1 Bayesian matrix factorization
Let be a partially observed data matrix, and be matrices of unknown parameters. The general Bayesian matrix factorization (BMF) model is then specified by the likelihood
which is a probabilistic version of Equation (1). Here denotes an indicator function which equals 1 if the element is observed and 0 otherwise.
While the general BMF model is agnostic to the choice of distributional form, in many applications, the elementsand ,
where denotes the noise precision. Note that some formulations specify an individual precision for each column. To complete the Bayesian model, priors are placed on the model parameters and , commonly normal priors specified as
where denotes a -dimensional normal distribution with covariance specified in terms of the precision matrix. The model formulation may additionally include priors on some or all of the hyperparameters , as well as on the data precision parameters (e.g. Bhattacharya and Dunson, 2011; Salakhutdinov and Mnih, 2008a). For concreteness, we proceed in this paper with the Gaussian case, as specified in Equations (3–4b), but note that the developments of Section 3, along with Algorithm 1, are general with no reference to this choice of distributions for likelihood and priors.
2.1.1 Unidentifiability of matrix factorization
It is commonly known that the solution of Equation (1) is unique only up to orthogonal transformations. To demonstrate this unidentifiability, let
be any semi-orthogonal matrix for which, being a unit matrix, and denote . Then, performing an orthogonal transformation by right-multiplying both and by , leads to
by which an uncountable number of equally good solutions to Equation (1) can be produced.
As a special case, let be a unit matrix with the th diagonal element set to . The matrix is then clearly orthogonal (also semi-orthogonal), since . Right-multiplying and by has the effect of flipping the signs of all elements in the th columns of these matrices. It can then easily be verified that the product of the resulting matrices remains unchanged. Since any of the K columns of and can have their signs flipped without affecting the product , within this family of transformations we have equally good solutions for Equation (1).
Although the unidentifiability of a single matrix factorization task could be addressed by e.g. constraining to be a lower triangular matrix (Lopes and West, 2004), expensive communication would be needed for distributed inference schemes to ensure all the posteriors are jointly identifiable.
2.2 Embarrassingly parallel MCMC
Consider now a parametric modelwith exchangeable observations and parameter for which we wish to perform posterior inference using MCMC. If is very large, the inference may be computationally too expensive to be carried out on a single machine. Embarrassingly parallel MCMC strategies aim to overcome this by partitioning the data into multiple disjoint subsets , and running independent sampling algorithms for each subset using a down-weighted prior . In most embarrassingly parallel MCMC algorithms, the aggregation of the obtained subset posteriors into a full-data posterior is based, in one way or another, on the following product density equation:
where each factor constitutes an independent inference task. Aggregating the joint product density equation with satisfactory efficiency and accuracy is in general a challenging problem, since the involved densities are unknown and represented in terms of samples. For a recent overview of various subset posterior aggregation techniques, see Angelino et al (2016).
Standard embarrassingly parallel inference techniques relying on Equation (5) are not well-suited for unidentifiable models, such as the BMF model presented in Section 2.1. To illustrate this, consider the following simple example:
Assume that we have observed a data matrix
conditional on which we wish to estimate the parameters and of the corresponding BMF model. A plausible inference may then result in a bimodal posterior with high density regions around, say, the exact solutions
and . Next, assume that we split the data into three subsets
and conduct independent inference over each of them. Again, plausible subset inferences may accumulate posterior mass around some set of exact solutions, say,
along with their corresponding negative solutions. However, aggregating these inferences using Equation (5) does not necessarily lead to a posterior with high density around any correct solution.
Ideally, we would like all subset inferences in the above example to target the same solutions in order for them to reinforce each other. To do so, it is clearly necessary to impose some constraints or regularization on them. One way of doing this is to equip the inferences with strong enough prior information. We will build on this idea in the following section.
3 Hierarchical Parallelization of BMF
Let us now assume that a data matrix has been partitioned with respect to both rows and columns into subsets , , . It then follows from Equations (2, 4a, 4b), that the joint posterior density of the BMF parameter matrices and , given the partitioned data matrix , can be factorized as
Our goal is to develop an equivalent decomposition that fulfils the apparently contradictory aims of both allowing for embarrassingly parallel computations and making the subset inferences dependent.
3.1 From sequential to parallel inference
We begin with the simple case of having only three subsets. With and , the parameters of the partitioned BMF model are , , and . Sequential inference (exploiting no parallelism) over can then be performed in three successive stages as follows. In the first stage, the posteriors for the parameters and , given , are computed as
In the second stage, the posterior from the first stage is used as a prior for the shared parameter to compute
In the above stage, using the posterior obtained in Equation (7) as a prior can be interpreted as a form of regularization, which encourages the inference to target the same set of modes as the first stage. Finally, using the posterior from the second stage as a prior in the third stage then gives the full-data posterior as
In general, a data set partitioned into subsets will require stages of sequential inference to obtain the full posterior.
We will now consider an alternative, partly parallelizable inference scheme, which begins with an initial stage identical to that of the above sequential scheme. However, instead of processing the subsets and in sequence, we process them in parallel. Regularizing the inferences with a common informative prior, obtained in the first stage, introduces dependence between them and encourages the targeted solutions to agree with each other. This leads to the following decomposition:
where the right-hand side of line (9a) corresponds to the first stage, lines (9b)–(9c) correspond to the second stage, with the two remaining subsets now being processed in parallel. Finally, an aggregation stage combines all of (9a)–(9d).
With , the number of stages for the parallel scheme is exactly the same as for the sequential one. However, while the sequential scheme always requires stages, the number of stages for the parallel scheme remains constant for all . A key challenge is then to be able to carry out the aggregation stage efficiently. Strategies for aggregation are discussed further in Section 4.2.
3.2 Posterior propagation
We will now extend the idea introduced above for arbitrary partitions of
and show that this yields an exact decomposition of the full joint distribution (6). As is partitioned along both columns and rows, our hierarchical strategy is conducted in three successive stages. Communication is only required to propagate posteriors from one stage to the next, while within each stage, the subsets are processed in an embarrassingly parallel manner with no communication. The approach, coined posterior propagation (PP), proceeds as follows:
Inference stage I. Inference is conducted for the parameters of subset :
Inference stage II. Inference is conducted in parallel for parameters of subsets which share columns or rows with . Posterior marginals from stage 1 are used as priors for the shared parameters:
for and .
Inference stage III. The remaining subsets are processed in parallel using posterior marginals propagated from stage II as priors:
Product density equation. Combining the submodels in Equations (10–12), for all and , and dividing away the multiply-counted propagated posterior marginals yields the following product density equation:
The following theorem higlights the fact that this is indeed a proper decomposition of the full posterior density.
The proof of the theorem is given in Appendix A.
4 Approximate inference
In the previous section, we introduced a hierarchical decomposition of the joint posterior distribution of the BMF model, which couples inferences over subsets but allows for embarrassingly parallel computations in a sequence of (at most) three stages. The challenge with implementing this scheme in practice is threefold, and relates to the analytically intractable form of the BMF posterior: i) propagating posteriors efficiently from one stage to the next, ii) utilizing the posteriors of one stage as priors in the next stage, and iii) aggregating all subset posteriors at the end. In this section, we propose to resolve these challenges by using parametric approximations computed from subset posterior samples obtained by MCMC in each stage.
Computational schemes for distributed data settings, combining MCMC with propagation of information through parametric approximations have recently been explored by Xu et al (2014) and Vehtari et al (2018). Nevertheless, their expectation propagation algorithms for distributed data require frequent communication among parallel workers to share global information, which could become a bottleneck for large-scale computational problems where the number of model parameters scales linearly with the number of data samples. On the other hand, the proposed method only requires communication between stages of inference. While our focus here is on sampling-based inference, it is worth emphasizing that the decomposition introduced in Section 3.2 is itself not tied to any particular inference algorithm. Thus, it could also be combined e.g. with variational inference.
4.1 Parametric approximations for propagation of posteriors
We present here three alternative approaches for finding tractable approximations from posterior samples. A generic algorithm for the proposed inference scheme using these approximations is given in Algorithm 1.
Gaussian mixture model approximation. For the first approach, we note that the posterior distributions represented by the samples are typically multimodal due to the inherent unidentifiability of the BMF model. Gaussian mixture models
(GMM) are universal approximators of probability distributions, that is, given sufficiently many components, they can approximate any continuous distribution with arbitrary accuracy. Thus, they are a reasonable parametric approximation of the posterior.
Dominant mode approximation. Our second approach is based on the intuition that for purposes of prediction in matrix completion tasks, it is sufficient to find only one of the possibly infinitely many solutions to the matrix factorization problem. In this approach, we therefore locate the dominant mode from each posterior distribution. We then fit a multivariate Gaussian to the samples correspoding to this mode only, and propagate it as a prior to the following stage.
Moment matching approximation. Our final approach presents an intermediate between the previous two approaches. Here, we fit a unimodal multivariate Gaussian to the entire set of posterior samples for each parameter using moment matching. Beyond its simplicity, propagating Gaussian approximations for priors has the appeal that the inferences in different stages (i.e. steps 1, 1, 1, 1 in Algorithm 1) can be processed as a standard BMF. It also has the usual interpretation that the log-posterior corresponds to a sum-of-squared-errors objective function with quadratic regularization terms (Salakhutdinov and Mnih, 2008b)
. Finally, the moment matching approximation brings our scheme in close relation to recent work on expectation propagation for distributed data(Xu et al, 2014; Vehtari et al, 2018), but with only limited-communication and a single pass over the data as in assumed density filtering.
4.2 Approximating the product density equation for aggregation
Each subset posterior inference results in a joint distribution for subsets of the parameters and , approximated by a set of samples. Direct aggregation of these joint distributions using the product density equation (13) is a computationally challenging task. For computational efficiency, and to enable the use of the approximations introduced above, we simplify the task by decoupling the parameters and performing the aggregation by posterior marginals. With parametric approximations for each subset posterior marginal available (steps 1, 1, 1, 1 in Algorithm 1), we aggregate them by multiplying them together and dividing away all multiply counted propagated posteriors.
We assume that the marginal distributions over the parameter matrices can be factorized along rows into a product of -dimensional distributions, i.e.
The dominant mode and moment matching
approximations both produce unimodal multivariate Gaussian representations for each row of the parameter matrices. By the properties of Gaussian distributions, the aggregated posterior for theth row of is then obtained as
where , denote the estimated statistics of the posterior for the th submodel. Note that for submodels indexed by , the effect of the first submodel has been removed. The aggregation of each is done in similar fashion.
For the GMM approach, the posterior marginal for each and is a mixture with density . Maintaining this approximation in the aggregation phase would lead to the computationally challenging problem of dividing one mixture by another. Emphasizing speed and efficiency, we instead apply Equation (14) using pooled mixture components:
This section provides a brief discussion about the scalability of the above inference scheme in terms of computation time and communication cost. With workers available, both rows and columns can be partitioned into parts, assuming for simplicity an equal number of partitions in both directions (note, however, that this is not a requirement for our method). This results in a total of subsets. The computational cost of a typical BMF inference algorithm per iteration is proportional to , where and are the respective dimensions of the observation matrix, is the number of observed values, and is the number of latent dimensions. Thus, for each submodel, the theoretical computation time is proportional to
assuming an equal number of observations in each subset and iterations. Thus, the initial stage can be completed with one worker in time , inference stage II can be processed with workers in time , and inference stage III can be completed with workers in time . Finally, the aggregation step mainly involves calculating the product of multivariate Gaussian distributions, which can be done with parallel workers in time proportional to
Therefore, the total computation time of the algorithm with worker nodes is proportional to the sum of the computation times of each inference stage plus the computation time of the aggregation, .
In terms of communication cost, the proposed inference scheme requires first communicating inputs to workers and then collecting the outputs for aggregation. The inputs consist of two parts: data and prior distributions. As workers use non-overlapping parts of the data, the total amount of communication needed to load the data is proportional to the number of observations . Each worker receives parameters for distributions, each with parameters; for the dominant mode and moment matching approximations is proportional to and for the Gaussian mixture model approximation it is proportional to , where is the number of components and is due to using a full covariance matrix for the posterior of parameters. As there are workers, the total amount of communication needed for input distributions is proportional to . The output distributions are of the same size as the input distributions. Thus, the communication cost at the aggregation stage is the same as the communication cost of input distributions.
In this section, we evaluate the empirical performance of our limited-communication inference scheme for BMF, posterior propagation with parametric approximations, by comparing it with both embarrassingly parallel MCMC methods in Section 5.4, and available state-of-the-art parallel and distributed implementations of BMF and non-negative matrix factorization (NMF) in Section 5.5. In Section 5.6, we further analyse the advantage of our method over embarrassingly parallel MCMC in terms of encouraging a common representation for model parameters to facilitate subset posterior aggregation. Details about the implementation, test setup and data sets are provided in Sections 5.1–5.3.
For posterior inference in each subset, we use our R implementation222An implementation of BMF with PP, with highly optimized C libraries, is available in SMURFF software on github (bmfpp branch): https://github.com/ExaScience/smurff of the BPMF Gibbs sampler presented by Salakhutdinov and Mnih (2008a)333We have corrected an apparent error in the original formulation of the update formula for , where should be calculated with (Murphy, 2007) rather than . Alternatively, one could use the original formula for and calculate with or (Teh, 2007).. BPMF considers the noise precision as a constant and places a normal-Wishart prior on the hyperparameters and , as well as on hyperparameters and . In the first stage of PP, we sample all parameters of the BPMF model. However, in the second and third stages, we sample hyperparameters and only when the posterior of is not propagated. Similarly, hyperparameters and are sampled only when the posterior of is not propagated.
We have introduced three alternative approaches (in Section 4.1) to estimate subset posteriors from samples:
The GMM approximation fits a mixture of multivariate Gaussians to a clustering of the posterior samples (PP GMM).
The dominant mode approximation fits a multivariate Gaussian to the samples of the dominant mode (PP DM).
The moment matching approximation fits a unimodal multivariate Gaussian to the entire set of posterior samples (PP MM).
For a computationally fast way of implementing the first two algorithms, we first use the nonparametric -means clustering algorithm (Comiter et al, 2016) to cluster the posterior samples, then (i) for PP DM we choose the cluster with the maximum number of posterior samples to estimate the posterior, (ii) for PP GMM we estimate the posteriors for top-N modes/clusters. In our experiments, we use the top-3 modes. When using the estimated GMM as a prior for BMF, we perform Gibbs sampling in a similar way as for mixture models; denoting by and the estimated mean and precision, respectively, of mode in the posterior of parameter :
Compute the probability of generating the parameter for each mixture component, i.e. the likelihood of .
Calculate the responsibility of each component to explain .
Choose the component with the maximum as the propagated prior for , and update the parameter with its statistics.
The above procedure is done analogously for .
5.2 Test setup
We evaluate the distributed inference methods using simulated data and three real-world data sets: MovieLens-1M, MovieLens-20M and ChEMBL. In addition to inference on full data for medium-sized data sets, we predict missing values using column means; this benchmark serves as a baseline and sanity check.
We evaluate performance by predictive accuracy. To this end, we randomly partition the data into training and test sets and use root mean squared error (RMSE) on the test set as performance measure. For prediction in the experiments, we use Equation (1) with posterior means as values for and . Furthermore, we use wall-clock time444Wall-clock time measures the real time between the start and the end of a program. For parallel processes, we use the wall-clock time of the slowest process. to measure the speed-up achieved by parallelization. The reported wall-clock time for our method is calculated by summing the maximum wall-clock times of submodels for each inference stage plus the wall-clock time of the aggregation step555In our current implementation, we run each stage on a cluster as an array job consisting of multiple independent processes. While in this implementation communication between stages is done using read and write operations on disk, its effect on total computation time is negligible as this needs to be done at most three times during the entire progam, and the amount of data communicated is relatively small. .
In all experiments, we ran Gibbs sampling with 1200 iterations. We discarded the first 800 samples as burn-in and saved every second of the remaining samples yielding in total 200 posterior samples. The results were averaged over 5 runs. For parallelization, we experiment with different partitioning schemes; a partitioning scheme means that rows are partitioned into and columns into subsets. The partitioning scheme refers to the full data. Note that the maximum number of parallel workers that can be used by our algorithm is equal to the number of subsets in the third stage, i.e. . We also tested two ordering schemes. In the first scheme, rows and columns are permuted randomly (row/column order: random). In the second scheme, the rows and columns are reordered into a descending order according to the proportion of observations in them (row/column order: decreasing). Thus, the most dense rows and columns are processed in the first two stages, by which the subsequently propagated posteriors can be made more informative.
5.3 Data sets
The MovieLens-1M (Harper and Konstan, 2015) data set consists of 1,000,209 movie ratings by 6,040 users on 3,706 movies. Approximately 4.5% of the elements of the movie rating matrix, where each user corresponds to a row and each movie to a column, are observed. The MovieLens-20M (Harper and Konstan, 2015) data set contains 20 million ratings from 138,493 users on 27,278 movies; that is, about 0.53% of the elements are observed. Following Simm et al (2015), we set and for performance analysis.
The ChEMBL (Bento et al, 2014) data set describes interactions between drugs and proteins using the pIC50 measure. The data set has 15,703 rows corresponding to drugs and 346 columns corresponding to proteins, and contains 59,280 observations which is slightly over 1% of the elements. Again, we follow Simm et al (2015) to set and . As ChEMBL contains only 346 columns, we only partitioned the rows.
For these real-world data sets, we conduct a 5-fold cross-validation study where 20% of observations are withheld and used as test set.
To complement the real data sets, we generated simulated data sets with 6,040 observations and 3,706 features as follows: We set the number of latent factors to . The elements of the matrices and were generated independently from the standard univariate normal distribution. Finally, we generated the data with , where the is a noise matrix whose elements were generated from a standard normal distribution. For learning, we set the parameters and to the corresponding correct values, i.e., and . We generated 5 independent simulated data sets.
In many real-world applications, such as collaborative filtering and the ChEMBL benchmark, the data are very sparsely observed. We analyse the predictive performance of the model with respect to different types of missing data. To this end, we randomly select 80% of the data as missing, use these missing data as test set and the remaining data as training set. To additionally simulate not-missing-at-random data as the second simulated data scenario, we first assigned weights and to each row and column, respectively, such that they form an equally spaced decreasing sequence . Then we assigned the element to the test data with probability ; this results in a matrix with about 80% of elements missing. This is referred to as the structured missingness scenario.
5.4 Comparison with embarrassingly parallel methods
In this subsection, we compare the predictive performance and computation times of the proposed inference scheme to those of the full model, as well as the following algorithms for embarrassingly parallel MCMC666We found that running BMF with Gibbs sampler for each subset using down-weighted priors and could lead to numerical instabilities (i.e. resulting in non-positive definite precision matrices). Therefore, we chose to run BMF for each subset with the standard normal priors and , and then divided away the multiply-counted priors when aggregating the subset posteriors.:
Parametric density product (parametric) is a multiplication of Laplacian approximations to subset posteriors. The aggregated samples are drawn from the resulting multivariate Gaussian (Neiswanger et al, 2014).
Semiparametric density product (semiparametric) draws aggregated samples from multiplicated semi-parametric estimates of subset posteriors (Neiswanger et al, 2014).
Random partition tree (randomPARTree) works by first performing space partitioning over subset posteriors, followed by a density aggregation step which simply multiplies densities across subsets for each block and then normalizes (Wang et al, 2015). Aggregated samples are drawn from the aggregated posterior.
All of the above algorithms are implemented in the Matlab PART library777https://github.com/wwrechard/random-tree-parallel-MCMC. We ran the randomPARTree algorithm with different values for its hyperparameters (i.e. min cut length=0.001 or 0.01, min fraction block=0.01 or 0.1, cut type=“kd” or “ml”, local gaussian smoothing=true or false) for pilot analysis, and found that there are no significant differences in the predictive performance for different values for the hyperparameters. Thus, for this algorithm, we use KD-tree for space partition and 1000 resamples for final approximation, and use the default values for the other hyperparameters provided in the library.
The results for ChEMBL, MovieLens-1M, MovieLens-20M, and simulated data are shown in Figures 2 and 3. To improve the readability of the plots, we only plot RMSE for the two posterior aggregation methods that give the best performance for embarrassingly parallel MCMC on each data set. In the following, we summarize the major conclusions of this evaluation.
As a general conclusion, we found that posterior propagation can give almost an order of magnitude faster computation times with a negligible effect on predictive accuracy, compared to MCMC inference on the full data matrix; this can be seen on simulated data and MovieLens (Figure 3 and right-hand side of Figure 2 (a)). The almost horizontal RMSE curves for our methods on MovieLens-20M and simulated data indicate that posterior propagation speeds up computation with almost no loss on accuracy. Note that without approximations, PP would give the same results as the full model. The difference between them therefore quantifies the effect of the approximations made in our approach.
Of the embarrassingly parallel MCMC methods, the parametric aggregation method gives the best predictive accuracy on ChEMBL and MovieLens-1M data. Posterior propagation provides better predictive accuracy (lower RMSE values) than any of the embarrassingly parallel MCMC methods on all of the data sets considered. We also found out that reordering rows and columns, in a decreasing order with respect to the number of observations, usually improves the accuracy of posterior propagation compared to using a random order of rows and columns; this can be seen on the sparsely observed MovieLens-1M data (right-hand side of Figure 2 (a,b)). In Appendix C, we analyse the results in Figures 2 and 3 from another perspective to show the wall-clock time speed-up as a function of the number of parallel workers.
We further explored empirically whether posterior propagation can produce good prediction for users and items with only a few observations. This is useful for cold-start problems, i.e., recommendation for new users with very few observed ratings. For this analysis, we visualize test RMSE versus the number of ratings per user in the training set in Figure 4. Again, we observed that compared with the alternative embarrassingly parallel MCMC methods, our methods show superior performance for all user groups and improve prediction for users with very few observed ratings.
5.5 Comparison with other parallel and distributed implementations
In this subsection, we show that our method achieves competitive results compared to alternative implementations of parallel and distributed BMF, while keeping the communication requirement bounded. To this end, we compare our method on large-scale data (MovieLens-20M) with the following implementations:
Distributed parallel BPMF888https://github.com/ExaScience/bpmf (D-BPMF): a state-of-the-art C++ implementation of distributed BPMF with Gibbs sampler (Vander Aa et al, 2017). It supports hybrid communication for distributed learning, which utilizes TBB for shared memory level parallelism and Global Address Space Programming Interface (GASPI) for cross-node communication.
NMF with parallel SGLD999https://perso.telecom-paristech.fr/simsekli/nmf_sgmcmc/ (NMF + PSGLD): OpenMP implementation of non-negative matrix factorization with parallel SGLD (Şimşekli et al, 2017). This is an open source software that is similar to BPMF with distributed SGLD101010The software for large scale BPMF with distributed SGLD is not publicly available. (Ahn et al, 2015). This software requires careful tuning of hyperparameters in order to avoid numerical instabilities/overflow issues and get reasonable predictions. We set , (using a Gaussian likelihood), , for the experiment based on a pilot study.
For our method, we used a partition scheme with the same setup as for the experiments in Figure 2(c): i.e. , iterations for Gibbs sampling and a burn-in of 800 samples. Note that our method was implemented in the R language without any optimization, while the other two methods were implemented with highly optimized C libraries. For the sake of obtaining comparable results, the RMSE was obtained using our R implementation (same as in Figure 2(c)), while the wall-clock time is an estimate computed by using D-BPMF within each individual data subset in the three stages of our posterior propagation scheme. For each subset, we used a single node with 24 cores, resulting in a total of parallel processes (one per node) for the third stage.
Since only the OpenMP implementation of NMF + PSGLD was available to us, it was run within a single compute node with no cross-node communication.
The results of the comparison are shown in Table 2. The approximations made in our approach (BMF + PP) have only a slight negative effect on the RMSE compared to D-BPMF, which makes no distributional approximations and should therefore be close to the accuracy of the full model (not computed here because of its large size). On the other hand, in terms of computation time, our method is able to leverage the combination of a high level of parallelism and a very low communication frequency compared to D-BPMF, which requires frequent communication. Varying the number of nodes / cores for the latter yields results which are consistent with the empirical finding of Vander Aa et al (2017): parallel efficiency initially improves with increased parallelism, but begins to level off as the number of nodes increases beyond the boundary of fast network connections in the HPC cluster. A further point to note is that the communication cost of D-BPMF increases linearly with respect to the number of MCMC samples while ours stays constant, thus, the longer chains we run, the more advantage we get. Finally, NMF + PSGLD performs worse than the other alternative methods in terms of predictive accuracy and wall-clock computation time. This is partially due to the difficulty of tuning the hyperparameters for each specific data set for the DSGLD and PSGLD methods.
|Model||#Nodes/#cores||RMSE||Wall-clock time (s)||Communication cost|
|(BMF + PP)|
|(Vander Aa et al, 2017)||64/1536||0.772||124|
NMF + PSGLD
|(Şimşekli et al, 2017)|
5.6 Correlations of the subset posteriors
We observed in Section 5.4 that compared with existing embarrassingly parallel methods, the proposed method can provide a better trade-off between predictive accuracy and wall-clock time on several benchmark data sets. In this section, we investigate empirically to what extent our method is able to encourage joint identifiability via dependencies between the inferences in different subsets.
For this purpose, we compute for our method the correlations between the posterior expected means of parameters in subsets sharing rows or columns, and compare them to the corresponding correlations produced by embarrassingly parallel MCMC, see Figure 5. For example, for the partition scheme in Figure 1, we would calculate the correlations of posterior means for subsets as follows: cor(, ), cor(, ), cor(, ), cor(, ), for , . For embarrassingly parallel MCMC, to avoid low correlations due to mis-aligned permutations of the latent dimensions in different submodels, highly correlated latent dimensions were aligned prior to calculating the correlations between the posterior means.
An obvious trend from Figure 5 is that the correlation scores of posterior estimates generated by our method are much higher than those of embarrassingly parallel MCMC. The observation suggests that by propagating the posteriors obtained from the earlier stage to the next stage as priors, our method can produce highly dependent subset posteriors. On the other hand, since existing embarrassingly parallel MCMC methods do not introduce any dependencies between the inferences for different subsets, they are unable to enforce a common permutation and scaling for parameters, making the aggregation step challenging for unidentifiable models.
We have introduced a hierarchical embarrassingly parallel strategy for Bayesian matrix factorization, which enables a trade-off between accuracy and computation time, and uses very limited communication. The empirical evaluation on both real and simulated data shows that (i) our distributed approach is able to achieve a speed-up of almost an order-of-magnitude, with a negligible effect on predictive accuracy, compared to MCMC inference on the full data matrix; (ii) our method also significantly outperforms state-of-the-art embarrassingly parallel MCMC methods in accuracy, and (iii) performs comparably to other available distributed and parallel implementations of BMF. We further show that, unlike existing embarrassingly parallel approaches, our method produces posterior estimates, which are highly correlated across different subsets and thus enable a meaningful aggregation of subset inferences.
We have experimented with both inclusive approximations (GMM and MM; attempting to include all sampled modes, which is still restricted to a small finite number) as well as exclusive approximations (DM; attempting to exclude all but one mode, a property shared with variational inference). In our current setting, the inclusive approximations gave more consistent performance, striking a balance between restricting the set of solutions to encourage identifiability and letting the sampler explore good solutions. While the proposed approximations work well, more accurate representations for subset posteriors could be considered instead, in particular for aggregation (e.g. Wang et al, 2015). This would be relevant especially if we were interested in an accurate, global representation of the joint distribution of and , instead of predicting unobserved elements of the data matrix , which was the case in our current work. While more sophisticated representations may improve accuracy, they come at the expense of increased computational burden. An additional motivation to consider alternative subset posterior representations is to be able to establish theoretical guarantees for the global model, which despite their good empirical performance, are not straightforward to give for the current approximations.
Several works on distributed learning for BMF (e.g. BMF with DSGLD (Ahn et al, 2015), NMF with parallel SGLD (Şimşekli et al, 2017)) assume (implicitly) that the models are trained with blocks in an orthogonal group (or squared partition), in order to avoid conflicting access to parameters among parallel workers. In our work, we do not make any assumptions about the partition scheme, and our method can therefore work flexibly with diversified partition schemes, which depend on the size of the data in both dimensions. For instance, it can work with partitions only along the row direction for tall data, or partitions only along the column direction for fat data, or partitions along both row and column directions for tall and fat data.
The focus of our work has been to develop an efficient and scalable distributed learning scheme for BMF. In doing so, we have assumed implicitly that the data are missing at random, following many other works on BMF. While many (if not most) real-world data sets exhibit non-random patterns on missingness, we have handled such patterns using the simple strategy of reordering rows and columns into a descending order according to the proportion of observations in them. Thus, the most dense rows and columns are processed during the first two stages, by which the subsequently propagated posteriors can be made more informative. However, it is also possible to handle non-random patterns of missing values in a more principled manner. In the context of matrix factorization, Hernández-Lobato et al (2014) modelled the generative process for both the data and the missing data mechanism, and showed empirically that learning these two models jointly can improve performance of the MF model. This strategy would be straightforward to incorporate within our distributed scheme.
Finally, we have run experiments on a scale of only tens of millions of elements, but there is no obstacle for running the proposed distributed algorithm on larger matrices. Indeed, the proposed approach as such is not implementation-dependent, and it could be used together with any available well-optimized (but more communication intensive) implementation to enable further scaling beyond the point at which parallel efficiency would otherwise begin to level off.
Acknowledgements.This project has received funding from the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671555 and from the Academy of Finland (Finnish Centre of Excellence in Computational Inference Research COIN and grants 294238, 292334). The authors gratefully acknowledge the computational resources provided by the Aalto Science-IT project and by IT4I under Project ID OPEN-11-20. We also thank Tom Vander Aa for his help in setting up and configuring their distributed parallel BPMF software (Vander Aa et al, 2017) on the Salomon cluster, and the authors of Şimşekli et al (2017) for sharing their software for NMF + PSGLD. Finally, we wish to thank the four anonymous reviewers for their constructive feedback, which lead to many improvements of the paper.
Adams et al (2010)
Adams R, Dahl G, Murray I (2010) Incorporating side information in probabilistic matrix factorization with Gaussian processes. In: Proceedings of the 26th Annual Conference on Uncertainty in Artificial Intelligence, AUAI Press, pp 1–9
- Ahn et al (2015) Ahn S, Korattikara A, Liu N, Rajan S, Welling M (2015) Large-scale distributed Bayesian matrix factorization using stochastic gradient MCMC. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 9–18, DOI 10.1145/2783258.2783373
Angelino et al (2016)
Angelino E, Johnson MJ, Adams RP (2016) Patterns of scalable Bayesian inference. Foundations and Trends in Machine Learning 9(2-3):119–247, DOI10.1561/2200000052
- Bento et al (2014) Bento A, Gaulton A, Hersey A, Bellis L, Chambers J, Davies M, Krüger F, Light Y, Mak L, McGlinchey S, Nowotka M, Papadatos G, Santos R, Overington J (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Research 42(D1):D1083–D1090
- Bhattacharya and Dunson (2011) Bhattacharya A, Dunson DB (2011) Sparse Bayesian infinite factor models. Biometrika 98(2):291–306, DOI 10.1093/biomet/asr013
- Cobanoglu et al (2013) Cobanoglu MC, Liu C, Hu F, Oltvai ZN, Bahar I (2013) Predicting drug-target interactions using probabilistic matrix factorization. Journal of Chemical Information and Modeling 53(12):3399–3409, DOI 10.1021/ci400219z
Comiter et al (2016)
Comiter M, Cha M, Kung HT, Teerapittayanon S (2016) Lambda means clustering: Automatic parameter search and distributed computing implementation. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp 2331–2337, DOI10.1109/ICPR.2016.7899984
- Şimşekli et al (2015) Şimşekli U, Koptagel H, Güldaş H, Cemgil AT, Öztoprak F, Birbil ŞI (2015) Parallel stochastic gradient MCMC for matrix factorisation models. arXiv preprint arXiv:150601418
- Şimşekli et al (2017) Şimşekli U, Durmus A, Badeau R, Richard G, Moulines É, Cemgil AT (2017) Parallelized stochastic gradient Markov chain Monte Carlo algorithms for non-negative matrix factorization. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, pp 2242–2246
- Harper and Konstan (2015) Harper FM, Konstan JA (2015) The MovieLens datasets: History and context. ACM Transactions on Interactive Intelligent Systems 5(4)
- Hernández-Lobato et al (2014) Hernández-Lobato JM, Houlsby N, Ghahramani Z (2014) Probabilistic matrix factorization with non-random missing data. In: Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, pp II–1512–II–1520
- Koren et al (2009) Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42(8):30–37, DOI 10.1109/MC.2009.263
- Lian et al (2015) Lian X, Huang Y, Li Y, Liu J (2015) Asynchronous parallel stochastic gradient for nonconvex optimization. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, pp 2737–2745
- Lopes and West (2004) Lopes HF, West M (2004) Bayesian model assessment in factor analysis. Statistica Sinica 14(1):41–67
- Minsker et al (2014) Minsker S, Srivastava S, Lin L, Dunson D (2014) Robust and scalable Bayes via a median of subset posterior measures. arXiv preprint arXiv:14032660
- Murphy (2007) Murphy KP (2007) Conjugate Bayesian analysis of the Gaussian distribution. Tech. rep., Google
- Neiswanger et al (2014) Neiswanger W, Wang C, Xing E (2014) Asymptotically exact, embarrassingly parallel MCMC. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, AUAI Press, pp 623–632
- Nemeth and Sherlock (2017) Nemeth C, Sherlock C (2017) Merging MCMC subposteriors through Gaussian-process approximations. Bayesian Analysis DOI 10.1214/17-BA1063, advance publication
- Park et al (2013) Park S, Kim YD, Choi S (2013) Hierarchical Bayesian matrix factorization with side information. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, AAAI Press, pp 1593–1599
- Porteous et al (2010) Porteous I, Asuncion A, Welling M (2010) Bayesian matrix factorization with side information and Dirichlet process mixtures. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence, AAAI Press, pp 563–568
- Salakhutdinov and Mnih (2008a) Salakhutdinov R, Mnih A (2008a) Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In: Proceedings of the 25th International Conference on Machine Learning, ACM, pp 880–887, DOI 10.1145/1390156.1390267
- Salakhutdinov and Mnih (2008b) Salakhutdinov R, Mnih A (2008b) Probabilistic matrix factorization. In: Advances in Neural Information Processing Systems 20, MIT Press
- Scott et al (2016) Scott SL, Blocker AW, Bonassi FV, Chipman HA, George EI, McCulloch RE (2016) Bayes and big data: the consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management 11(2):78–88, DOI 10.1080/17509653.2016.1142191
- Simm et al (2015) Simm J, Arany A, Zakeri P, Haber T, Wegner JK, Chupakhin V, Ceulemans H, Moreau Y (2015) Macau: Scalable Bayesian multi-relational factorization with side information using MCMC. arXiv preprint arXiv:150904610
- Srivastava et al (2015) Srivastava S, Li C, Dunson D (2015) Scalable Bayes via barycenter in Wasserstein space. arXiv preprint arXiv:150805880
- Teh (2007) Teh YW (2007) Exponential Families: Gaussian, Gaussian-Gamma, Gaussian-Wishart, Multinomial. Tech. rep., University College London
- Vander Aa et al (2016) Vander Aa T, Chakroun I, Haber T (2016) Distributed Bayesian probabilistic matrix factorization. In: 2016 IEEE International Conference on Cluster Computing, IEEE Computer Society, pp 346–349
- Vander Aa et al (2017) Vander Aa T, Chakroun I, Haber T (2017) Distributed Bayesian probabilistic matrix factorization. Procedia Computer Science 108:1030 – 1039
- Vehtari et al (2018) Vehtari A, Gelman A, Sivula T, Jylänki P, Tran D, Sahai S, Blomstedt P, Cunningham JP, Schiminovich D, Robert C (2018) Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data. arXiv preprint arXiv:14124869 URL https://arxiv.org/abs/1412.4869
- Wang and Dunson (2013) Wang X, Dunson D (2013) Parallelizing MCMC via Weierstrass sampler. arXiv preprint arXiv:13124605
- Wang et al (2014) Wang X, Peng P, Dunson DB (2014) Median selection subset aggregation for parallel inference. In: Advances in Neural Information Processing Systems 27, Curran Associates, Inc., pp 2195–2203
- Wang et al (2015) Wang X, Guo F, Heller KA, Dunson DB (2015) Parallelizing MCMC with random partition trees. In: Advances in Neural Information Processing Systems 28, Curran Associates, Inc., pp 451–459
- Xu et al (2014) Xu M, Lakshminarayanan B, Teh YW, Zhu J, Zhang B (2014) Distributed Bayesian posterior sampling via moment sharing. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in Neural Information Processing Systems 27, Curran Associates, Inc., pp 3356–3364, URL http://papers.nips.cc/paper/5596-distributed-bayesian-posterior-sampling-via-moment-sharing.pdf
Appendix A Proof of Theorem 1
Appendix B Hardware platform
The experiments for our method and embarrassingly parallel MCMC were performed on Triton (Aalto Science-IT), a cluster with more than 200 compute nodes. The nodes used for the experiments are equipped with dual 12-core Xeon E5 2680 v3, a clock speed 2.50GHz and 128 GB of RAM. One node was used for posterior inference for each subset. The experiments for distributed / parallel implementations of BMF and NMF were conducted on Salomon (IT4Innovations), a cluster with 576 nodes, each equipped with dual 12-core Intel Xeon E5-2680v3, a clock speed 2.50GHz and at least 128 GB of physical memory per node.
Appendix C Wall-clock time speed-up
The results for RMSE vs. different partition schemes in Figure 2 and 3 provide a trade-off between predictive performance and computation time. In this section, we revisit the results for simulated data and MovieLens-1M to analyse the scaling behaviour of our method by evaluating wall-clock time speed-up with respect to the number of workers. We follow Lian et al (2015) and compute the wall-clock time speed-up (WTS) as follows:
The results of this analysis are given in Figure 6(a) and Table 3. The theoretical scaling behaviour for an arbitrary fixed problem size is shown in Figure 6(b), where we use the definition of computation time derived in Section 4.3 as a function of the number of workers. We conclude that the general behaviour in the empirical results is well in line with what we expect from our discussion in Section 4.3, with the actual scale factor depending on the problem-specific configurations (e.g. data size, partition scheme), implementation and hardware.
|Partition||#Workers||Decreasing order||Random order|
|RMSE||Wall-clock time (s)||WTS||RMSE||Wall-clock time (s)||WTS|