SMURFF: a High-Performance Framework for Matrix Factorization

04/04/2019 ∙ by Tom Vander Aa, et al. ∙ 0

Bayesian Matrix Factorization (BMF) is a powerful technique for recommender systems because it produces good results and is relatively robust against overfitting. Yet BMF is more computationally intensive and thus more challenging to implement for large datasets. In this work we present SMURFF a high-performance feature-rich framework to compose and construct different Bayesian matrix-factorization methods. The framework has been successfully used in to do large scale runs of compound-activity prediction. SMURFF is available as open-source and can be used both on a supercomputer and on a desktop or laptop machine. Documentation and several examples are provided as Jupyter notebooks using SMURFF's high-level Python API.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Matrix Factorization

Recommender Systems (RS) have become very common in recent years and are useful in various real-life applications. The most popular ones are probably suggestions for movies on Netflix and books for Amazon 

[1]. However, they can also be used in more unlikely area such drug discovery where a key problem is the identification of candidate molecules that affect proteins associated with diseases [2].

In RS one has to analyze large and sparse matrices, for example those containing the known movie or book ratings. Matrix Factorization (MF) is a technique that has been successfully used here. As sketched in Figure 1, the idea of this method is to approximate the rating matrix as a product of two low-rank matrices and . Predictions can be made from the approximation which is dense.

Bayesian Matrix Factorization (BMF [3]), using Gibbs Sampling is one of the more popular approaches for matrix factorization. Thanks to the Bayesian approach, BMF has been proven to be more robust to data-overfitting [3]. Gibbs sampling [4] makes the problem feasible to compute. Yet, BMF is still computationally intensive and thus more challenging to implement for large datasets.

In this work we present SMURFF [5], a high-performance feature-rich framework to compose and construct different Bayesian matrix factorization methods. Using the SMURFF framework one can easily vary: i) the type of matrix to be factored (dense or sparse); ii) the prior-distribution that you assume the model to fit to (multivariate-normal, spike-and-slab, and others); or iii) the noise model (Gaussian noise, Probit noise or others). The framework also allows to combine different matrices together and thus incorporate more and different types of information into the model.

Figure 1: Low-rank Matrix Factorization

In this paper we describe the SMURFF framework developed. Section 2 lists the different features supported. In Section 3 we describe some of the implementation details and how those impact compute performance. We compare SMURFF to other matrix factorization implementations in Section 4, and look at compute performance in Section 5 and Section 6. Finally, Section 7 presents conclusions and future work.

2 Framework

Figure 2 shows a matrix-factorization system as can be composed with SMURFF. The basic matrix-components are:

Figure 2: SMURFF Matrix Factorization Framework
  • The matrix to be factored , which can be composed of different blocks (, , …). Each of the blocks can be sparse with many values unknown, or fully known. If is fully known the data can still be sparse (many values zero) or dense (all non-zeros).

  • The model matrices, i.e. the factors and . The rows in and columns in is the so-called latent dimension of size . Each user () each movie (

    ) is represented in by a latent vector of size

    . Each element of such a vector is called a component.

  • A matrix with side information () for the rows and/or columns of is taken into account by means of constructing a link-matrix (). More information is in the paragraph below.

In the following paragraphs we describe the basic components of the matrix factorization framework and how they are used or constructed. The columns in Table 1 represent the different choices SMURFF supports in terms of input matrices, prior distributions, noise models and types of side informations. We explain what those choices mean in the following paragraphs.

Input Matrices Prior Distribution Noise Model Side Information
Choices Sparse with unknowns Normal Fixed Gaussian Link Matrix
Sparse fully known Spike-and-Slab (SnS) Adaptive Gaussian -
Dense-Dense -
BMF Sparse with unknowns Normal Fixed -
Macau Sparse with unknowns Normal Fixed/Adaptive Link Matrix
GFA Sparse and/or Dense Normal + SnS Fixed/Adaptive -
Table 1: Possible MF Algorithms

Input Matrices

As already explained in Figure 2, different matrix types are supported.

Prior Distributions

Samples from /

are taken from either a multi-variate normal distribution or from a spike-and-slab distribution 


. The spike-and-slab prior is a Gaussian distribution with an extra constraint that enforces some latent components to zero. This allows to find out common and disjoint factors between the different data sources 


Noise model

Noise is injected in the model when sampling. This noise is taken from a Gaussian distribution, with either a fixed precision or with an adaptive precision calculated based on how precise the model matches the data.

Side Information

Side information are properties for each row or column of the matrix. For example, for compound-activity prediction this could mean the physical shape or chemical properties of the compounds can be used to improve the quality of the factorization.

The link matrix to include the side information stored in in the model, is constructed using the Macau algorithm described in [8].

Possible Algorithms

The aforementioned options on the type of input matrices, the prior distribution, the noise model and inclusion of side-information through a link matrix can be combined in different ways, resulting in different MF algorithms. The bottom part of Table 1 shows the choices to take to implement Macau [8], BMF [3] and Group Factor Analysis (GFA) [7]. Of course, other combinations are possible.

3 High-Performance Implementation

The most simple implementation (a single matrix, without side-information) can be expressed as the pseudo code shown in Algorithm 1. Most time is spent in the loops updating and , where each iteration consist of some relatively basic matrix and vector operations on matrices, and one computationally more expensive matrix inversion.

for sampling iterations do
       sample hyper-parameters movies based on V for all movies of  do
             update movie model based on ratings () for this movie and model of users that rated this movie, plus randomly sampled noise
       end for
      sample hyper-parameters users based on U for all users of  do
             update user based on ratings () for this user and model of movies this user rated, plus randomly sampled noise
       end for
      for all test points do
             predict rating and compute RMSE
       end for
end for
Algorithm 1 SMURFF Pseudo Code

These matrix and vector operations are very well supported in Eigen [9] a high-performance modern C++11 linear algebra library. Sampling from the basic distributions is available in the C++ standard template library (STL), or can be trivially implemented on top. As a result the Eigen-based C++ version of Algorithm 1 is a mere 35 lines of C++ code. This implementation only supports BMF with a single sparse matrix and has been the starting point of the SMURFF framework.

Although the structure of the fully-featured SMURFF framework is similar to Algorithm 1 it consists of more than 25 000 lines of code.

A Python API has been built on top of this C++ low-level library allowing users of SMURFF to combine it with other Python packages for machine learning and matrices such as

numpy, scipy or scikit-learn [10]. Several elaborate examples and a tutorial are available using Jupyter notebooks [11].

Multi-core parallelisation

Since the number of users and movies is very large and since all can be computed in parallel, the for-all-users and for-all-movies loops in Algorithm 1 become parallel-for loops. Inside a single user/movie the amount of work depends on the number of non-zero entries in . If a certain user/movie has many non-zero entries, we use a parallel algorithm, effectively splitting them up in more smaller tasks that can utilize multiple cores on the system.

In the current implementation the parallelism is exploited using OpenMP [12] parallel for loops, for the users/movie loops and OpenMP tasks for the parallelization inside users/movies.

4 Use Cases

In this section we look at the different matrix factorization methods that can be implemented using SMURFF, and compare the SMURFF and original implementations with respect to compute performance and machine learning result.


As a benchmark we use the ChEMBl [13], a dataset from the drug discovery research field. ChEMBL contains descriptions for biological activities involving over a million chemical entities, extracted primarily from scientific literature. Several versions exist since the dataset is updated on a fairly frequent basis. In this work, we use a dataset extracted from ChEMBL containing IC50 levels, which is a measure of the effectiveness of a substance in inhibiting a specific biological or biochemical function, and ECFP chemical descriptors for the compounds.


We compare the performance of the proposed multi-core BMF with several other popular machine-learning packages that implement the BMF algorithm: We verified that the predictive performance of the model, from all implementations is the same.

Figure 3: Runtime of different BMF implementations.
  1. PyMC3

    : PyMC3 is a Python package for Bayesian statistical modeling and probabilistic machine learning which focuses on advanced Markov chain Monte Carlo and variational fitting algorithms. PyMC3 relies on Theano for automatic differentiation and also for computation optimization and dynamic C compilation. 


  2. GraphChi: Turi is a graph-based, high performance, distributed computation framework written in C++. In this graph we show the performance of GraphChi, a scalable high-performance multi-core implementation of Turi [15].

  3. SMURFF: this implementation.

  4. BMF with GASPI: A BMF-only multi-node implementation [16]. This code has been heavily optimized using the GASPI library [17] to run very efficient on multiple nodes.

Figure 3 shows the runtime of BMF [3] for the four implementations. The X-axis shows number of cores used. We have tested the PyMC3, GraphChi and SMURFF implementations on a single node with 36 cores, since these implementations do not support multiple nodes, and the ”BMF with GASPI”-implementation on a system with 128 nodes and 2048 cores in total, since this version has been optimized for a multi-node supercomputer.

The results show that SMURFF is faster compared to GraphChi, and even compared to PyMC3. This is because PyMC3 and GraphChi are very versatile frameworks, easy to use and with many more possibilities than SMURFF. PyMC3, for example, has a high-level Python interface where one can choose many different samplers, from many different distributions. One can even implement custom samplers and distributions in Python. SMURFF on the other hand only supports Gibbs sampling, from Normal distributions.

As expected, the BMF-with-GASPI version scales very well, up to 1000 cores, as we already showed in [16].


We have explored the use and correctness of Group Factor Analysis (GFA) with SMURFF by reproducing the results from Simulated study in [18].

We have verified that SMURFF and the original R implementation of GFA produce the same model, while the SMURFF C++ version is about faster. Using the SMURFF implementation, we were able to reduce the runtime of GFA on a realistic industrial dataset from 3 months for the R version [7], to 15 hours for the C++ implementation. This speedup is expected, in general because R is an interpreted language, but in this case especially since R is even slower on sparse matrices and since the code contains many explicit for-loops [19].


The benefit of the Macau [8] algorithm is its ability to incorporate side information for the rows and/or columns of the matrix. SMURFF has been developed in collaboration with the authors of the Macau implementation, and has the same code base.

SMURFF with Macau has been used to predict compound-on-protein activity on a matrix with more than one million compounds (rows) and several thousand proteins (columns) [20]. Chemical fingerprints were used for the compounds, in both dense and sparse formats. This has led to important new insights and potential new compounds to be used in drug discovery.

5 Compute Performance

In this section we look at the hardware performance of SMURFF on different processor architectures and accelerators. Since some hardware platforms are better are handling sparse matrices, and some are better at dealing with dense data, we also look at data types in this section.

Figure 4: Performance of BMF, Macau on dense input data and Macau on sparse input data on three different hardware platforms.

We compare three different hardware platforms:

  1. Xeon: a Haswell Intel Xeon processor running at 2.3 GHz with 36 cores in 2 sockets ( cores). We use up to 36 threads with Hyper Threading.

  2. Xeon Phi: an Intel KNC Xeon Phi processor running at 1.2 GHz wit 61 cores. We use up to 244 threads with Hyper Threading.

  3. ARM: a ThunderX ARM 64bit processor with 96 cores.

The results in Figure 4 show that the regular Xeon processor always results in the best performance, and the Xeon Phi is always the worst, the latter being between 4x and 10x slower. The two main reasons for the Xeon Phi being slower are on one hand its lower clock frequency 1.2Ghz, compared to 3Ghz for the Xeon, and its inferior cache structure. It is known that the cache coherency protocol on the regular Xeon is far better than the ring structure that is used for L2 cache coherency on the Xeon Phi [21].

The ARM processor performs in between the two Xeon processors, being on average slower than the regular Xeon, sometimes closer to the Xeon Phi, sometimes closer to the Xeon. This is explained by the difference in vector instruction size. The Xeon uses 512bit AVX2 instructions, while the arm 128bit NEON instructions.

The gap between the different hardware platforms is largest for sparse input data, where the Xeon Haswell processor excels and the other two struggle. This is mainly due to the superior memory architecture on the Xeon. The larger L3 cache of 40MB on the Xeon helps to keep the large sparse matrices, compared to the smaller cache on the ARM (16MB) and the already mentioned crippled ring interconnect between Xeon Phi cores.

6 Binary Packaging using Conda

Figure 5: Performance of compilation with Conda as opposed to native compilation.

SMURFF can be installed with one command using Conda [22]: conda install -c vanderaa smurff

The main benefits of using Conda are:

  • Binary packages, no need to compile SMURFF yourself;

  • Performance almost as good as compiling SMURFF yourself;

  • A large ecosystem of packages, not only python;

  • Works on Windows, Linux and Mac OS, with Python 2 and Python 3;

Figure 5 shows that it is indeed true that the binary Conda package is compatible with many processor generations, without much loss in performance. The figure shows runtime of four different combinations of compilers (GCC or Intel compiler), algebra libraries (Intel Math Kernel Library [23] or OpenBLAS) and target platform (either compilation for the native processor, or compilation for a generic hardware target with Conda.

Because the Intel MKL library dynamically adapts to the runtime hardware platform it performs as good in native as in Conda-mode, and much better than OpenBLAS, especially for the BMF algorithm. Using the Intel or GCC compiler does not make a difference, since both compilers are state-of-the-art and since most of the time is spent in MKL.

7 Conclusions and Future Work

This paper described SMURFF a multi-core high-performance framework that supports a wide range of Bayesian matrix-factorization algorithms, such as Macau [8], BMF [3] and GFA [7]. SMURFF is available as open-source and can be used both on a supercomputer and on a desktop or laptop machine. Documentation and several examples are provided as Jupyter notebooks using SMURFF’s high-level Python API. The framework has been successfully used to do large scale runs of compound-activity prediction.

In future version of SMURFF, we will incorporate the use of GASPI and MPI to be able to run SMURFF efficiently on multiple nodes.


This work is partly funded by the European projects ExCAPE en EPEEC with references 671555 and 801051, and by the IT4Innovations infrastructure, which is supported from the Large Infrastructures for Research, Experimental Development and Innovations project “IT4Innovations National Supercomputing Center – LM2015070”.


  • [1] C. A. Gomez-Uribe and N. Hunt, “The netflix recommender system: Algorithms, business value, and innovation,” ACM Trans. Manage. Inf. Syst., vol. 6, no. 4, pp. 13:1–13:19, Dec. 2015. [Online]. Available:
  • [2] M. Bredel and E. Jacoby, “Chemogenomics: an emerging strategy for rapid target and drug discovery,” Nature Reviews Genetics, vol. 5, p. 262, Apr. 2004. [Online]. Available:
  • [3] R. Salakhutdinov and A. Mnih, “Bayesian probabilistic matrix factorization using Markov chain Monte Carlo,” in Proceedings of the International Conference on Machine Learning, vol. 25, 2008.
  • [4]

    I. Yildirim, “Bayesian inference: Gibbs sampling,”

    Technical Note, University of Rochester, 2012.
  • [5] “SMURFF matrix factorization source code,”
  • [6] H. Ishwaran and J. S. Rao, “Spike and slab variable selection: frequentist and bayesian strategies,” Annals of Statistics, pp. 730–773, 2005.
  • [7] S. Virtanen, A. Klami, S. A. Khan, and S. Kaski, “Bayesian group factor analysis.” in AISTATS, 2012, pp. 1269–1277.
  • [8] J. Simm, A. Arany, P. Zakeri, T. Haber, J. K. Wegner, V. Chupakhin, H. Ceulemans, and Y. Moreau, “Macau: Scalable bayesian multi-relational factorization with side information using MCMC,” 2015.
  • [9] G. Guennebaud, B. Jacob et al., “Eigen v3,”, 2010.
  • [10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [11] T. Kluyver, B. Ragan-Kelley, F. Pérez, B. E. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. B. Hamrick, J. Grout, S. Corlay, P. Ivanov, D. Avila, S. Abdalla, C. Willing, and et al., “Jupyter notebooks - a publishing format for reproducible computational workflows,” in ELPUB, 2016.
  • [12] OpenMP Architecture Review Board, “OpenMP application program interface version 3.0,” May 2008. [Online]. Available:
  • [13] A. Bento, A. Gaulton, A. Hersey, L. Bellis, J. Chambers, M. Davies, F. Kruger, Y. Light, L. Mak, S. McGlinchey, M. Nowotka, G. Papadatos, R. Santos, and J. Overington, “The chembl bioactivity database: an update.” Nucleic Acids Res., vol. 42, pp. 1083–1090, 2014.
  • [14] J. Salvatier, T. Wiecki, and C. Fonnesbeck, “Probabilistic programming in python using PyMC3,” PeerJ Computer Science, vol. 2, 2016.
  • [15] A. Kyrola, G. Blelloch, and C. Guestrin, “Graphchi: Large-scale graph computation on just a pc,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’12.   Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available:
  • [16] T. Vander Aa, I. Chakroun, and T. Haber, “Distributed bayesian probabilistic matrix factorization,” in ICCS 2017: International Conference on Computational Science, June 2017.
  • [17] D. Grünewald and C. Simmendinger, “The GASPI API specification and its implementation GPI 2.0,” in 7th International Conference on PGAS Programming Models, vol. 243, 2013.
  • [18] K. Bunte, E. Leppäaho, I. Saarinen, and S. Kaski, “Sparse group factor analysis for biclustering of multiple data sources,” CoRR, vol. abs/1512.08808, 2015. [Online]. Available:
  • [19] “Why are loops slow in r?”, 2018.
  • [20] The ExCAPE Consortium. ExCAPE: Exascale Compound Activity Prediction Engine. Retrieved: June 2017.
  • [21] J. Fang, H. Sips, L. Zhang, C. Xu, Y. Che, and A. L. Varbanescu, “Test-driving intel xeon phi,” in Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’14.   New York, NY, USA: ACM, 2014, pp. 137–148. [Online]. Available:
  • [22] “SMURFF conda packages,”
  • [23] Intel math kernel library, Intel Corporation, 2017, see