1 Matrix Factorization
Recommender Systems (RS) have become very common in recent years and are useful in various reallife applications. The most popular ones are probably suggestions for movies on Netflix and books for Amazon
[1]. However, they can also be used in more unlikely area such drug discovery where a key problem is the identification of candidate molecules that affect proteins associated with diseases [2].In RS one has to analyze large and sparse matrices, for example those containing the known movie or book ratings. Matrix Factorization (MF) is a technique that has been successfully used here. As sketched in Figure 1, the idea of this method is to approximate the rating matrix as a product of two lowrank matrices and . Predictions can be made from the approximation which is dense.
Bayesian Matrix Factorization (BMF [3]), using Gibbs Sampling is one of the more popular approaches for matrix factorization. Thanks to the Bayesian approach, BMF has been proven to be more robust to dataoverfitting [3]. Gibbs sampling [4] makes the problem feasible to compute. Yet, BMF is still computationally intensive and thus more challenging to implement for large datasets.
In this work we present SMURFF [5], a highperformance featurerich framework to compose and construct different Bayesian matrix factorization methods. Using the SMURFF framework one can easily vary: i) the type of matrix to be factored (dense or sparse); ii) the priordistribution that you assume the model to fit to (multivariatenormal, spikeandslab, and others); or iii) the noise model (Gaussian noise, Probit noise or others). The framework also allows to combine different matrices together and thus incorporate more and different types of information into the model.
In this paper we describe the SMURFF framework developed. Section 2 lists the different features supported. In Section 3 we describe some of the implementation details and how those impact compute performance. We compare SMURFF to other matrix factorization implementations in Section 4, and look at compute performance in Section 5 and Section 6. Finally, Section 7 presents conclusions and future work.
2 Framework
Figure 2 shows a matrixfactorization system as can be composed with SMURFF. The basic matrixcomponents are:

The matrix to be factored , which can be composed of different blocks (, , …). Each of the blocks can be sparse with many values unknown, or fully known. If is fully known the data can still be sparse (many values zero) or dense (all nonzeros).

The model matrices, i.e. the factors and . The rows in and columns in is the socalled latent dimension of size . Each user () each movie (
) is represented in by a latent vector of size
. Each element of such a vector is called a component. 
A matrix with side information () for the rows and/or columns of is taken into account by means of constructing a linkmatrix (). More information is in the paragraph below.
In the following paragraphs we describe the basic components of the matrix factorization framework and how they are used or constructed. The columns in Table 1 represent the different choices SMURFF supports in terms of input matrices, prior distributions, noise models and types of side informations. We explain what those choices mean in the following paragraphs.
Input Matrices  Prior Distribution  Noise Model  Side Information  

Choices  Sparse with unknowns  Normal  Fixed Gaussian  Link Matrix 
Sparse fully known  SpikeandSlab (SnS)  Adaptive Gaussian    
DenseDense    
BMF  Sparse with unknowns  Normal  Fixed   
Macau  Sparse with unknowns  Normal  Fixed/Adaptive  Link Matrix 
GFA  Sparse and/or Dense  Normal + SnS  Fixed/Adaptive   
Input Matrices
As already explained in Figure 2, different matrix types are supported.
Prior Distributions
Samples from /
are taken from either a multivariate normal distribution or from a spikeandslab distribution
[6]. The spikeandslab prior is a Gaussian distribution with an extra constraint that enforces some latent components to zero. This allows to find out common and disjoint factors between the different data sources
[7].Noise model
Noise is injected in the model when sampling. This noise is taken from a Gaussian distribution, with either a fixed precision or with an adaptive precision calculated based on how precise the model matches the data.
Side Information
Side information are properties for each row or column of the matrix. For example, for compoundactivity prediction this could mean the physical shape or chemical properties of the compounds can be used to improve the quality of the factorization.
The link matrix to include the side information stored in in the model, is constructed using the Macau algorithm described in [8].
Possible Algorithms
The aforementioned options on the type of input matrices, the prior distribution, the noise model and inclusion of sideinformation through a link matrix can be combined in different ways, resulting in different MF algorithms. The bottom part of Table 1 shows the choices to take to implement Macau [8], BMF [3] and Group Factor Analysis (GFA) [7]. Of course, other combinations are possible.
3 HighPerformance Implementation
The most simple implementation (a single matrix, without sideinformation) can be expressed as the pseudo code shown in Algorithm 1. Most time is spent in the loops updating and , where each iteration consist of some relatively basic matrix and vector operations on matrices, and one computationally more expensive matrix inversion.
These matrix and vector operations are very well supported in Eigen [9] a highperformance modern C++11 linear algebra library. Sampling from the basic distributions is available in the C++ standard template library (STL), or can be trivially implemented on top. As a result the Eigenbased C++ version of Algorithm 1 is a mere 35 lines of C++ code. This implementation only supports BMF with a single sparse matrix and has been the starting point of the SMURFF framework.
Although the structure of the fullyfeatured SMURFF framework is similar to Algorithm 1 it consists of more than 25 000 lines of code.
A Python API has been built on top of this C++ lowlevel library allowing users of SMURFF to combine it with other Python packages for machine learning and matrices such as
numpy, scipy or scikitlearn [10]. Several elaborate examples and a tutorial are available using Jupyter notebooks [11].Multicore parallelisation
Since the number of users and movies is very large and since all can be computed in parallel, the forallusers and forallmovies loops in Algorithm 1 become parallelfor loops. Inside a single user/movie the amount of work depends on the number of nonzero entries in . If a certain user/movie has many nonzero entries, we use a parallel algorithm, effectively splitting them up in more smaller tasks that can utilize multiple cores on the system.
In the current implementation the parallelism is exploited using OpenMP [12] parallel for loops, for the users/movie loops and OpenMP tasks for the parallelization inside users/movies.
4 Use Cases
In this section we look at the different matrix factorization methods that can be implemented using SMURFF, and compare the SMURFF and original implementations with respect to compute performance and machine learning result.
Dataset
As a benchmark we use the ChEMBl [13], a dataset from the drug discovery research field. ChEMBL contains descriptions for biological activities involving over a million chemical entities, extracted primarily from scientific literature. Several versions exist since the dataset is updated on a fairly frequent basis. In this work, we use a dataset extracted from ChEMBL containing IC50 levels, which is a measure of the effectiveness of a substance in inhibiting a specific biological or biochemical function, and ECFP chemical descriptors for the compounds.
Bmf
We compare the performance of the proposed multicore BMF with several other popular machinelearning packages that implement the BMF algorithm: We verified that the predictive performance of the model, from all implementations is the same.

PyMC3
: PyMC3 is a Python package for Bayesian statistical modeling and probabilistic machine learning which focuses on advanced Markov chain Monte Carlo and variational fitting algorithms. PyMC3 relies on Theano for automatic differentiation and also for computation optimization and dynamic C compilation.
[14] 
GraphChi: Turi is a graphbased, high performance, distributed computation framework written in C++. In this graph we show the performance of GraphChi, a scalable highperformance multicore implementation of Turi [15].

SMURFF: this implementation.
Figure 3 shows the runtime of BMF [3] for the four implementations. The Xaxis shows number of cores used. We have tested the PyMC3, GraphChi and SMURFF implementations on a single node with 36 cores, since these implementations do not support multiple nodes, and the ”BMF with GASPI”implementation on a system with 128 nodes and 2048 cores in total, since this version has been optimized for a multinode supercomputer.
The results show that SMURFF is faster compared to GraphChi, and even compared to PyMC3. This is because PyMC3 and GraphChi are very versatile frameworks, easy to use and with many more possibilities than SMURFF. PyMC3, for example, has a highlevel Python interface where one can choose many different samplers, from many different distributions. One can even implement custom samplers and distributions in Python. SMURFF on the other hand only supports Gibbs sampling, from Normal distributions.
As expected, the BMFwithGASPI version scales very well, up to 1000 cores, as we already showed in [16].
Gfa
We have explored the use and correctness of Group Factor Analysis (GFA) with SMURFF by reproducing the results from Simulated study in [18].
We have verified that SMURFF and the original R implementation of GFA produce the same model, while the SMURFF C++ version is about faster. Using the SMURFF implementation, we were able to reduce the runtime of GFA on a realistic industrial dataset from 3 months for the R version [7], to 15 hours for the C++ implementation. This speedup is expected, in general because R is an interpreted language, but in this case especially since R is even slower on sparse matrices and since the code contains many explicit forloops [19].
Macau
The benefit of the Macau [8] algorithm is its ability to incorporate side information for the rows and/or columns of the matrix. SMURFF has been developed in collaboration with the authors of the Macau implementation, and has the same code base.
SMURFF with Macau has been used to predict compoundonprotein activity on a matrix with more than one million compounds (rows) and several thousand proteins (columns) [20]. Chemical fingerprints were used for the compounds, in both dense and sparse formats. This has led to important new insights and potential new compounds to be used in drug discovery.
5 Compute Performance
In this section we look at the hardware performance of SMURFF on different processor architectures and accelerators. Since some hardware platforms are better are handling sparse matrices, and some are better at dealing with dense data, we also look at data types in this section.
We compare three different hardware platforms:

Xeon: a Haswell Intel Xeon processor running at 2.3 GHz with 36 cores in 2 sockets ( cores). We use up to 36 threads with Hyper Threading.

Xeon Phi: an Intel KNC Xeon Phi processor running at 1.2 GHz wit 61 cores. We use up to 244 threads with Hyper Threading.

ARM: a ThunderX ARM 64bit processor with 96 cores.
The results in Figure 4 show that the regular Xeon processor always results in the best performance, and the Xeon Phi is always the worst, the latter being between 4x and 10x slower. The two main reasons for the Xeon Phi being slower are on one hand its lower clock frequency 1.2Ghz, compared to 3Ghz for the Xeon, and its inferior cache structure. It is known that the cache coherency protocol on the regular Xeon is far better than the ring structure that is used for L2 cache coherency on the Xeon Phi [21].
The ARM processor performs in between the two Xeon processors, being on average slower than the regular Xeon, sometimes closer to the Xeon Phi, sometimes closer to the Xeon. This is explained by the difference in vector instruction size. The Xeon uses 512bit AVX2 instructions, while the arm 128bit NEON instructions.
The gap between the different hardware platforms is largest for sparse input data, where the Xeon Haswell processor excels and the other two struggle. This is mainly due to the superior memory architecture on the Xeon. The larger L3 cache of 40MB on the Xeon helps to keep the large sparse matrices, compared to the smaller cache on the ARM (16MB) and the already mentioned crippled ring interconnect between Xeon Phi cores.
6 Binary Packaging using Conda
SMURFF can be installed with one command using Conda [22]: conda install c vanderaa smurff
The main benefits of using Conda are:

Binary packages, no need to compile SMURFF yourself;

Performance almost as good as compiling SMURFF yourself;

A large ecosystem of packages, not only python;

Works on Windows, Linux and Mac OS, with Python 2 and Python 3;
Figure 5 shows that it is indeed true that the binary Conda package is compatible with many processor generations, without much loss in performance. The figure shows runtime of four different combinations of compilers (GCC or Intel compiler), algebra libraries (Intel Math Kernel Library [23] or OpenBLAS) and target platform (either compilation for the native processor, or compilation for a generic hardware target with Conda.
Because the Intel MKL library dynamically adapts to the runtime hardware platform it performs as good in native as in Condamode, and much better than OpenBLAS, especially for the BMF algorithm. Using the Intel or GCC compiler does not make a difference, since both compilers are stateoftheart and since most of the time is spent in MKL.
7 Conclusions and Future Work
This paper described SMURFF a multicore highperformance framework that supports a wide range of Bayesian matrixfactorization algorithms, such as Macau [8], BMF [3] and GFA [7]. SMURFF is available as opensource and can be used both on a supercomputer and on a desktop or laptop machine. Documentation and several examples are provided as Jupyter notebooks using SMURFF’s highlevel Python API. The framework has been successfully used to do large scale runs of compoundactivity prediction.
In future version of SMURFF, we will incorporate the use of GASPI and MPI to be able to run SMURFF efficiently on multiple nodes.
Acknowledgments
This work is partly funded by the European projects ExCAPE en EPEEC with references 671555 and 801051, and by the IT4Innovations infrastructure, which is supported from the Large Infrastructures for Research, Experimental Development and Innovations project “IT4Innovations National Supercomputing Center – LM2015070”.
References
 [1] C. A. GomezUribe and N. Hunt, “The netflix recommender system: Algorithms, business value, and innovation,” ACM Trans. Manage. Inf. Syst., vol. 6, no. 4, pp. 13:1–13:19, Dec. 2015. [Online]. Available: http://doi.acm.org/10.1145/2843948
 [2] M. Bredel and E. Jacoby, “Chemogenomics: an emerging strategy for rapid target and drug discovery,” Nature Reviews Genetics, vol. 5, p. 262, Apr. 2004. [Online]. Available: http://dx.doi.org/10.1038/nrg1317
 [3] R. Salakhutdinov and A. Mnih, “Bayesian probabilistic matrix factorization using Markov chain Monte Carlo,” in Proceedings of the International Conference on Machine Learning, vol. 25, 2008.

[4]
I. Yildirim, “Bayesian inference: Gibbs sampling,”
Technical Note, University of Rochester, 2012.  [5] “SMURFF matrix factorization source code,” https://github.com/ExaScience/smurff.
 [6] H. Ishwaran and J. S. Rao, “Spike and slab variable selection: frequentist and bayesian strategies,” Annals of Statistics, pp. 730–773, 2005.
 [7] S. Virtanen, A. Klami, S. A. Khan, and S. Kaski, “Bayesian group factor analysis.” in AISTATS, 2012, pp. 1269–1277.
 [8] J. Simm, A. Arany, P. Zakeri, T. Haber, J. K. Wegner, V. Chupakhin, H. Ceulemans, and Y. Moreau, “Macau: Scalable bayesian multirelational factorization with side information using MCMC,” 2015.
 [9] G. Guennebaud, B. Jacob et al., “Eigen v3,” http://eigen.tuxfamily.org, 2010.
 [10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [11] T. Kluyver, B. RaganKelley, F. Pérez, B. E. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. B. Hamrick, J. Grout, S. Corlay, P. Ivanov, D. Avila, S. Abdalla, C. Willing, and et al., “Jupyter notebooks  a publishing format for reproducible computational workflows,” in ELPUB, 2016.
 [12] OpenMP Architecture Review Board, “OpenMP application program interface version 3.0,” May 2008. [Online]. Available: http://www.openmp.org/mpdocuments/spec30.pdf
 [13] A. Bento, A. Gaulton, A. Hersey, L. Bellis, J. Chambers, M. Davies, F. Kruger, Y. Light, L. Mak, S. McGlinchey, M. Nowotka, G. Papadatos, R. Santos, and J. Overington, “The chembl bioactivity database: an update.” Nucleic Acids Res., vol. 42, pp. 1083–1090, 2014.
 [14] J. Salvatier, T. Wiecki, and C. Fonnesbeck, “Probabilistic programming in python using PyMC3,” PeerJ Computer Science, vol. 2, 2016.
 [15] A. Kyrola, G. Blelloch, and C. Guestrin, “Graphchi: Largescale graph computation on just a pc,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
 [16] T. Vander Aa, I. Chakroun, and T. Haber, “Distributed bayesian probabilistic matrix factorization,” in ICCS 2017: International Conference on Computational Science, June 2017.
 [17] D. Grünewald and C. Simmendinger, “The GASPI API specification and its implementation GPI 2.0,” in 7th International Conference on PGAS Programming Models, vol. 243, 2013.
 [18] K. Bunte, E. Leppäaho, I. Saarinen, and S. Kaski, “Sparse group factor analysis for biclustering of multiple data sources,” CoRR, vol. abs/1512.08808, 2015. [Online]. Available: http://arxiv.org/abs/1512.08808
 [19] “Why are loops slow in r?” https://stackoverflow.com/questions/7142767/, 2018.
 [20] The ExCAPE Consortium. ExCAPE: Exascale Compound Activity Prediction Engine. http://excapeh2020.eu/. Retrieved: June 2017.
 [21] J. Fang, H. Sips, L. Zhang, C. Xu, Y. Che, and A. L. Varbanescu, “Testdriving intel xeon phi,” in Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’14. New York, NY, USA: ACM, 2014, pp. 137–148. [Online]. Available: http://doi.acm.org/10.1145/2568088.2576799
 [22] “SMURFF conda packages,” https://anaconda.org/vanderaa/smurff/.
 [23] Intel math kernel library, Intel Corporation, 2017, see http://software.intel.com/enus/articles/intelmkl/.
Comments
There are no comments yet.