Efficient Matrix Factorization on Heterogeneous CPU-GPU Systems

06/24/2020 ∙ by Yuanhang Yu, et al. ∙ Zhejiang Gongshang University University of Technology Sydney UNSW 0

Matrix Factorization (MF) has been widely applied in machine learning and data mining. A large number of algorithms have been studied to factorize matrices. Among them, stochastic gradient descent (SGD) is a commonly used method. Heterogeneous systems with multi-core CPUs and GPUs have become more and more promising recently due to the prevalence of GPUs in general-purpose data-parallel applications. Due to the large computational cost of MF, we aim to improve the efficiency of SGD-based MF computation by utilizing the massive parallel processing power of heterogeneous multiprocessors. The main challenge in parallel SGD algorithms on heterogeneous CPU-GPU systems lies in the granularity of the matrix division and the strategy to assign tasks. We design a novel strategy to divide the matrix into a set of blocks by considering two aspects. First, we observe that the matrix should be divided nonuniformly, and relatively large blocks should be assigned to GPUs to saturate the computing power of GPUs. In addition to exploiting the characteristics of hardware, the workloads assigned to two types of hardware should be balanced. Aiming at the final division strategy, we design a cost model tailored for our problem to accurately estimate the performance of hardware on different data sizes. A dynamic scheduling policy is also used to further balance workloads in practice. Extensive experiments show that our proposed algorithm achieves high efficiency with a high quality of training quality.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As a common technique in machine learning and data mining, matrix factorization (MF) has been widely applied in many areas, such as recommendation system [lian2014geomf, he2016fast], social network analysis [li2015predicting, al2017unveiling], web mining [qiu2018network], word embedding [pennington2014glove], and graph embedding [DBLP:journals/tkde/CuiWPZ19].

Given a sparse matrix with rows and columns, the aim of MF is to decompose into two dense matrices and , which satisfies the following condition:

Here, the dimension is far less than and so that the original sparse data can be represented well by new dense data with low dimensionality.

Stochastic Gradient Descent. The existing literature has proposed three main approaches to solve MF, namely, alternating least squares (ALS), coordinate descent (CD), and stochastic gradient descent (SGD). Among them, SGD-based algorithms have received the most attention, due to their algorithmic simplicity and effectiveness. SGD algorithms execute several iterations to make the training result convergent. The number of iterations can be a parameter specified by users. In each iteration, the gradient of every element in the sparse matrix

is computed and the corresponding vectors in the result matrices are updated. We focus on the SGD-based MF algorithms in this paper and use MF to denote the SGD-based MF whenever there is no ambiguity.

In the literature, several parallel SGD approaches have been proposed on different system settings such as GPUs (e.g., CuMF_SGD [xie2017cumf_sgd]), multi-core CPUs (e.g., FPSGD [zhuang2013fast]), and distributed systems (e.g., NOMAD [yun2014nomad]). To parallelize the computation in SGD, the main idea of these methods is to uniformly divide the sparse matrix into a set of blocks. In each iteration, a set of mutually independent blocks are assigned to different working units (e.g., threads, nodes, or GPUs) and updated by these working units. Here, two blocks are independent of each other if they do not share the same row and the same column. This strategy avoids the writing conflict in and since the elements in the same row (resp. column) of will update the same row (resp. column) of (resp. ). Several optimizations are also studied for their specific system settings, and more related works of MF are introduced in Section III.

Fig. 1: A rating matrix and a corresponding matrix factorization

Nowadays, GPU becomes very popular for running general-purpose data-parallel applications. Consequently, heterogeneous systems with multi-core CPUs and GPUs become more and more promising for many general tasks. The idea of combing CPU and GPU not only speeds up the task execution but also effectively utilizes the computing resources available in heterogeneous systems. By contrast, to meet performance requirements, the CPU-only or GPU-only systems are usually over-provisioned. This leads that their average utilization remains low. For example, after allocating the task to a GPU (i.e., starting the kernel), the CPU stays idle. Similarly, for applications where the GPU memory bandwidth is a major bottleneck, the massive computation resources of GPUs remain underutilized. If the resources that were originally idle in the system are utilized effectively, the running time of the program will be reduced. This has motivated a significant amount of research on heterogeneous computing techniques [luk2009qilin, pandit2014fluidic, stehle2017memory]. To our best knowledge, there is no existing work on the MF task on heterogeneous CPU-GPU systems, and it is desirable to develop an efficient MF algorithm for this system setting. In this paper, we mainly study the task division and the scheduling strategy for SGD on heterogeneous CPU-GPU systems. Our techniques do not closely depend on any specific algorithm tailored for a single thread of the CPU or a GPU.

Due to the close relation between threads in a GPU, a straightforward method to combine GPUs and CPUs is to consider the GPU as a CPU thread. Then, we can naturally adopt the state-of-the-art shared memory algorithm FPSGD [zhuang2013fast] in our problem. We call this method HSGD. Despite the success of HSGD, there are still opportunities for further improvement. First, based on [zhuang2013fast], we derive a matrix division rule such that the number of blocks in the matrix should be at least , where and are the number of CPU cores and GPUs, respectively. We observe that in HSGD, even with the smallest number of blocks, the size of a block derived by the uniform division is not large enough to saturate the processing power of GPUs (Observation IV-B). This may limit the overall efficiency of HSGD. Second, the computing power of a GPU is normally much stronger than that of a single CPU thread. Given that blocks sharing the same row or column cannot be processed in parallel, the update frequency of blocks processed by GPU can be extremely high due to the powerful computing power of GPU. This phenomenon may lead to a weak training quality (Example IV-B).

Our Approach. To overcome the drawbacks in HSGD, we propose a nonuniform matrix division strategy for the parallel SGD algorithm. Specifically, we first design a novel cost model to evaluate the performance of hardware. Existing works on estimating the working efficiency on heterogeneous CPU-GPU systems are neither available in our setting nor accurate enough in the MF problem. Our cost model is tailored for the parallel SGD algorithm and can accurately estimate the performance of GPUs when the block is relatively small. Based on the cost model, we can initially divide the matrix into two parts which are assigned to GPUs and CPUs, respectively. Each part is divided into a set of blocks to avoid conflicts in MF computation. We also adopt a dynamic scheduling policy which allows the dominant computation resource to use work-stealing mechanism[blumofe1999scheduling] and process blocks originally assigned to the other resource. This optimization further balances workloads by filling the gap between the estimation by our cost model and the practical performance. When the dynamic scheduling policy is activated, the sparse matrix is divided into more blocks for avoiding conflicts.

Contributions. We summarize main contributions as follows.

  • A parallel SGD algorithm on heterogeneous CPU-GPU systems. To our best knowledge, this is the first work to efficiently tackle Matrix Factorization on heterogeneous CPU-GPU systems.

  • A nonuniform matrix division strategy to improve efficiency. Compared with a straightforward method to uniformly divide the matrix, our division strategy fully utilizes the strong processing power of GPUs.

  • A novel cost model to estimate the performance of GPUs. We balance workloads using a cost model specifically designed for our problem. The cost model of the GPU part considers both data transfer and kernel execution.

  • Extensive experiments on four benchmark datasets. The experiments demonstrate that our proposed framework can effectively utilize the resources in the heterogeneous CPU-GPU systems and show the superior performance compared to the competitors.

Organization. The rest of the paper is organized as follows. Section II introduces background information including MF and GPU architecture. Section III reviews related works. Section IV discusses the motivation and proposes an overall framework. Section V proposes the details of our cost model. Section VI introduces the dynamic scheduling strategy and the final matrix division strategy. Section VII reports the experimental results, and Section VIII concludes the paper.

Ii Preliminary

Ii-a Matrix Factorization

We consider a user-item rating matrix where and are the number of rows and the number of columns of the matrix, respectively. For each and , is the rating from the user to the item . is normally sparse since not every element in is explicitly reported. Matrix factorization aims to represent the matrix as a dot product between two dense matrices and , where is the number of latent factors. A mathematical representation of the matrix factorization is shown as follows.


In Equation 1, is the -th row vector of , and is the -th column vector of . Matrix factorization achieves Equation 1

by minimizing the following loss function.


In Equation 2, measures the gap between and estimated value . and are used to avoid over-fitting. computes the Frobenius norm111https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm of a vector. and are regularization coefficients.


We give an example of the matrix factorization in Figure 1. The rating matrix contains nine ratings for four movies given by four customers. The number of latent factors is . We have which means that gives a rating to . The results of ’s matrix factorization are shown on the right of . is a user preference matrix, and is a movie feature matrix. The vector is the preference of the user , and is the feature of the movie . The estimated value of is , which is close to .

Ii-B Stochastic Gradient Descent for Matrix Factorization

It is time-consuming to calculate the loss of the whole matrix when using the loss function of Equation 2, especially when contains billions of items. Several works have been done to minimize Equation 2 and improve the efficiency of matrix factorization [funk2006netflix, koren2009matrix, yu2012scalable]. In this paper, we follow a prevalent algorithm called stochastic gradient descent (SGD) [funk2006netflix] among them, and ideas of several other algorithms are introduced in Section III.

SGD executes several iterations. The number of iterations can be specified by users. Instead of straightforwardly applying the gradient descent to minimize Equation 2 in each iteration, SGD computes the gradient of every element in and updates the corresponding vectors in the result matrices. Consequently, the original loss function in Equation 2 is reduced to the following equation.


The gradient of Equation 3 is represented as follows.


Based on the gradient (Equation 4 and Equation 5), we train the model iteratively, and the value of the loss function decreases until it is convergent. Equation 6 shows this process where is the learning rate.

// Data preprocessing phase
1 ;
// Training phase
2 foreach  to  do
3       foreach  do
4             ;
5             ;
6             ;
// Data post-processing phase
9 ;
Algorithm 1 SGD

We present the pseudocode of SGD in Algorithm 1. There are several input parameters. is a sparse rating matrix stored in the form of triadic tuple. is the number of latent factors. is the learning rate. is the regularization parameter. is the number of iterations. Algorithm 1 outputs two feature matrices and . The algorithm consists of three phases: data preprocessing, SGD training, and data post-processing. The data preprocessing phase initializes two resulting matrices and with values generated randomly. In the training phase, in each iteration (line 3-6), every element is picked to decrease the value of loss function by using Equation 6. The SGD training terminates when the given number of iterations is reached (line 2) or the model converges. The feature matrices are stored after training.

Problem Definition. In this paper, we aim to develop an efficient SGD-based MF algorithm in heterogeneous CPU-GPU systems.


Our research mainly focuses on the scheduling strategy for the task division and assignment between CPUs and GPUs. Our proposed techniques will not closely depend on any specific GPU or GPU SGD-based MF algorithms.

Ii-C Characteristics of GPUs

We introduce several basic architecture characteristics of GPUs. More details can be found in [jia2018dissecting].

Execution Model. GPU manages thousands of threads in a three-layer hierarchy, which includes grids, blocks, and warps. A block contains dozens of warps, and multiple blocks are organized into a grid. A warp is the smallest execution unit from the view of hardware and contains 32 consecutive threads. Instructions of a GPU program (also named kernel) are executed in a way called SIMT (Single Instruction Multiple Thread). SIMT means that all threads in a warp must execute the same instruction on their own data at the same time, while the state of threads in a warp can be different (active or inactive) so that they can go through different execution paths. Therefore, GPU threads are not as flexible as CPU threads.

Memory Hierarchy. GPU has a complicated memory hierarchy. We introduce several commonly used concepts. Global memory is the largest memory, whose access speed is the slowest. Data in global memory can be accessed by all threads. Shared memory is on-chip and a relatively scarce resource. It can be accessed by threads from the same thread block since GPU promises that threads in the same block are performed on the same processor. Register is used to store local variables in kernel, whose access speed is the fastest. A register is only directly accessed by the thread it belongs to. Threads can access local variables of other threads in the same warp.

Iii Related Works

In this section, we introduce the details of FPSGD [zhuang2013fast] and CuMF_SGD [xie2017cumf_sgd], which are related to our problem. Several other related works are briefly summarized.

Iii-a Multi Core MF Algorithms

There are several algorithms extending the ideas of SGD in shared memory (multi CPU cores) systems. Hogwild [recht2011hogwild] adopts a lock-free update strategy to compute MF in parallel. DSGD [gemulla2011large] performs a matrix-blocking partition to avoid conflicts. Several variants of DSGD [zhuang2013fast, oh2015fast, li2013sparkler] also follow this property. FPSGD [zhuang2013fast] is the state-of-the-art in this type. We introduce more details of FPSGD as follows.

FPSGD. FPSGD uniformly divides the rating matrix into a set of sub-matrices (called blocks). Two blocks are independent of each other if they share neither any common column nor any common row of the rating matrix. Otherwise, we say they conflict. FPSGD applies the SGD algorithm to a set of independent blocks simultaneously and repeats this step until the number of processed blocks reaches a predefined value. In each iteration, all selected blocks must be independent since two blocks in the same row (resp. column) update the same area of (resp. ) and cannot be processed in parallel.

Fig. 2: An example of independent and conflicting blocks in grid matrix

We explain FPSGD in Figure 2. The rating matrix is divided into blocks. The computation on updates vectors and , and the computation on updates vectors and . They update vectors in different regions, which means that the computation on and can be performed at the same time. We can see that and are mutually independent. However, and are not independent of each other since they both update vectors in , and this leads to a conflict.

Iii-B GPU Algorithms

GPUSGD [jin2016gpusgd] proposes matrix blocking-based MF solution on GPUs. BIDMach [canny2013bidmach] supports SGD-based MF and uses GPUs as accelerators. CuMF_SGD[xie2017cumf_sgd] is the state-of-the-art.

CUMF_SGD. CuMF_SGD designs a high-performance GPU kernel by fully exploiting GPU hardware characteristics, which includes warp shuffle, memory coalescing, register usage, and half-precision. CuMF_SGD fully utilizes the CPU-GPU memory transfer bandwidth by simultaneously performing the memory transfer and the computation. Specifically, when scheduling blocks, the algorithm randomly selects multiple consecutive blocks at a time instead of only one independent block for the GPU. The strategy reduces the overhead of CPU-GPU memory transfer. CuMF_SGD uses CUDA streams to achieve this strategy. A stream contains a list of GPU commands that are executed in serial. Commands in different streams are executed in parallel if hardware resources permit. CuMF_SGD kernel uses three streams to manage the CPU-GPU memory transfer, the GPU kernel, and the GPU-CPU memory transfer, respectively.

Iii-C Other Related Works

Other SGD-based MF Algorithms. MF has been extensively investigated in the literature and a large number of algorithms were proposed for this problem. In addition to algorithms on multi-core and GPU architecture mentioned before, Parallel SGD solutions have been discussed in distributed systems [li2013sparkler] [yun2014nomad] and parameter server [zhong2016scaling].

Other MF Algorithms. In addition to SGD-based algorithms, other methods like Coordinate Descent (CD) [yu2012scalable] and Alternate Least Square (ALS) [koren2009matrix] are also proposed to compute MF. They use different update rules in each iteration. Specifically, ALS [koren2009matrix] updates one of the result matrices once by fixing the other. Then, it updates the other result matrix in the same way. An iteration is completed when two result matrices are both updated. CD [yu2012scalable] updates one element in a result matrix once by fixing all other elements in two result matrices. An iteration is completed when all elements in result matrices are updated by the same update rule.

Iv Our Approach

In this section, we first give a straightforward method and show its drawbacks. Then, we briefly introduce scheduling algorithms for heterogeneous systems and propose our method to balance workloads. Finally, an overview of our improved algorithm is provided.

Iv-a A Straightforward Method

A straightforward idea to utilize both CPU and GPU resources is to treat a GPU kernel as a worker thread. Based on this idea, we can adapt FPSGD [zhuang2013fast] by regarding a GPU as an additional worker thread. Similarly, to avoid the conflict and obtain good training quality, a worker thread receives a new block satisfying the two criteria in Section III-A once it finishes processing a block. We apply the FPSGD and CuMF_SGD algorithms to process a matrix block on a CPU thread and a GPU kernel, respectively. This straightforward algorithm shown above can be called HSGD.

Let and be the number of CPU threads and GPUs. Following the matrix-division rule in [zhuang2013fast], the number of blocks should be at least . We refine this formula and give a more precise matrix-division rule.


Given an input rating matrix , CPU threads, and GPUs, should be divided into at least blocks.

We explain the rationale of this rule. When the number of blocks is less than , every thread is always assigned the same block. As a result, only several specific blocks are updated during the algorithm. Obviously, this will lead to a terrible training result. Even worse, the algorithm cannot fully exploit all worker threads. For example, in Figure 2, assume that there are threads. The block is assigned to thread 1, and the rest gray blocks are assigned to other threads. When thread 1 finishes processing , the rest gray blocks may still be occupied. To avoid conflicts, thread 1 has to continually process . By contrast, when the block number increases to , thread 1 can always locate a spare row or column which is not occupied by other blocks.

Iv-B Motivation

Even though the HSGD successfully combines CPU and GPU resources, we have several observations which can help (i) improve the working efficiency of GPUs and (ii) balance workloads for different hardware.

Exploiting Hardware Characteristics. GPUs and CPUs have different features. We have two observations as follows.


In the context of MF, small blocks cannot saturate the GPU computing power.

(a) GPU
(b) CPU
Fig. 3: Processing speed of GPUs and CPUs on blocks with different sizes

To indicate the relationship between the block size and the efficiency of the GPU, we launch a GPU kernel with the default configuration to process blocks with different sizes. The details of configuration and hardware can be found in Section VII. The GPU throughput is reported in Figure 3(a), where the labels on the x-axis represent the number of elements in a block, and the labels on the y-axis represent the average number of elements processed in every second.

The throughput significantly increases when the block size is relatively small. Afterward, the upward trend becomes gentle as the block size continues to grow. This phenomenon may be due to two main reasons. First, we need to transfer data via the PCI-e bus to the global memory of GPU when launching a GPU kernel. A small block cannot fully utilize the bandwidth of the bus. Second, more data can better utilize all threads and the cache mechanism of the GPU. More throughput details about data transfer and GPU kernel execution can support our opinion and will be shown in Section V-B.


In the context of MF, the computing power of CPU cores is not sensitive to the block size.

The average number of elements processed by a thread of CPU in every second is shown in Figure 3(b). Unlike the GPU, the throughput of a CPU thread always remains stable when the block size varies. This is because the worker threads of CPU are relatively more independent than those of GPU and the computing capability of a CPU thread is not so powerful, compared with the whole GPU device.

Nonuniform Matrix Division. Based on the above two observations, an immediate idea to improve the algorithmic efficiency is to set the block size as large as possible. However, as mentioned in Section IV-A, the input matrix should be divided into at least blocks and the block size under this division strategy is still relatively small. For example, we use 16 CPU threads and a GPU in the default configuration of our experiments. Considering the real-world dataset Yahoo!Music with rows and columns, we divide it into at least blocks. Consequently, the number of elements in each block is less than million, which is still not large enough to saturate the GPU computing power in the light of Figure 3(a).

To improve the working efficiency of GPUs, we divide the rating matrix into blocks with different sizes. The large ones are assigned to GPUs, and the small ones are assigned to CPUs. Towards this end, there are several issues that we need to address. For example, we need to answer how to set suitable block sizes for both CPU and GPU, and how to divide the rating matrix in practice. We will answer these questions in the following section, and the final matrix division strategy will be given in Section VI.

Workload Balance. As shown above, we should make the size of blocks assigned to GPUs as large as possible in terms of improving the working efficiency of GPUs. However, an extreme nonuniform division strategy may cause a serious workload imbalance problem, which can remarkably reduce the overall efficiency when combining two hardware resources. To prevent this issue, a scheduling strategy should be considered. Many efforts have been done to balance workloads for the applications in heterogeneous systems. They can be categorized into three kinds: (1) dynamic methods, (2)

static methods by classifier

, and (3) static methods by cost models. Here, we briefly review some of representative methods among them and elaborate reasons why they cannot be straightforwardly used in our problem. Then, we propose our own method. For more details of scheduling strategies in heterogeneous systems, interested readers can refer to surveys[pradeep2018review, mittal2015survey].

(1) Dynamic Methods. The dynamic methods assign a new task to a computational device according to the performance of devices on previous tasks [rossi2016leveraging, jimenez2009predictive, leis2014morsel, belviranli2013dynamic, boyer2013load, choi2013efficient, wang2013cap, kaleem2014adaptive]. For example, [rossi2016leveraging] maintains a double-ended queue. CPU processes the tasks from the front of the queue. Simultaneously, GPU processes the tasks from the reverse direction. The algorithm naturally enables the dominant hardware resource to handle more tasks. The baseline algorithm in Section IV-A essentially adopts a dynamic strategy. If a CPU core or a GPU finishes updating a block, it is assigned a new block without incurring any conflict. The dynamic method performs well in other problems if there is not any rule to assign tasks.

However, this type of method does not work well in our problem due to the independence property of the task assignment. As discussed in Section IV-A, we first equally divide the matrix into a set of blocks, which is necessary to avoid conflicts. GPUs cannot achieve an ideal performance in this setting according to Observation IV-B. In addition to the problem of GPU working efficiency, the dynamic update method in HSGD may suffer from a poor training result. We give an example to explain this problem.

Fig. 4: A running example of the straightforward algorithm given 2 CPU cores and 1 GPU

We consider computing the matrix factorization of a rating matrix on a machine with two CPU threads and one GPU . Assume that we divide into matrix blocks, which is shown in Figure 4. In the beginning, two CPU threads and the GPU get blocks , , and , respectively. Normally, the computing power of a GPU is much stronger than that of a CPU thread. As a result, when has completed its task , and are still working on their blocks and . According to the scheduling policy of HSGD, will apply a new block which is independent of and and has the least number of updates. Block satisfies these conditions. picks in the second step of Figure 4. In the following steps, will continually update the two blocks in the lower right corner since and are always occupied by and . This phenomenon makes the numbers of updates for different blocks severely unbalanced, which is demonstrated in the last matrix. The number of updates for a block is relatively large if the corresponding color is dark. Compared with the original SGD algorithm which updates matrix elements randomly, the process in this example leads to a weak training result.

(2) Static Methods by Classifier. Static methods provide scheduling decisions for worker threads of different devices before applications start. This type of method establishes a classifier based on training datasets in the offline phase [grewe2011static, kofler2013automatic, wen2014smart, ghose2016divergence]. Given a new task, the classifier identifies the class of the task and applies a corresponding strategy of task assignment derived from the offline phase.

There are several drawbacks if applying the classifier-based methods in our problem. First, it is complicated and time-consuming to generate training data including static code features and optimal partition strategies. Moreover, these methods usually rely on specific frameworks such as OpenCL [stone2010opencl] and Insieme [insieme] to obtain static code features. This makes them not general. Second, they are usually designed for multi-task platforms. Consequently, The partition scheme generated by them is coarse-grained.

Fig. 5: Overview of HSGD*

(3) Static Methods By Cost Model. This type of method estimates CPU’s and GPU’s execution time by establishing a cost model [luk2009qilin, lee2015orchestrating]. A cost model is a function revealing the relationship between the input data size and the corresponding execution time of a specific worker thread. A representative approach among them is Qilin [luk2009qilin]. It divides a training dataset into two parts and , which are assigned to CPUs and GPUs, respectively. is further divided into subparts . Each subpart is processed by a CPU thread, and the corresponding execution time is recorded. A similar operation is applied to by GPUs. Qilin uses the curve fitting to construct two linear equations as the projections of the execution times for CPUs and GPUs respectively.

In the context of our problem, a simple linear function in [luk2009qilin] is hard to accurately estimate the execution time of GPUs. Recall Figure 3(a), it is not a horizontal line. This proves that the execution time does not increase linearly as the number of elements in a block grows.

Our Method to Balance Workloads. We propose a hybrid method for our problem. We first customize a cost model to divide the matrix specialized for the MF problem. We improve the accuracy of the GPU cost model. The details of cost models are given in Section V. Second, we have a dynamic scheduling mechanism, which allows the dominant resource to use work-stealing mechanism. This alleviates the deviation between the cost model and the practical performance.

Iv-C The Framework

In this section, we give an overview of our improved algorithm in Figure 5, which is called HSGD*. Our method contains an offline preprocessing phase and an online processing phase. The offline phase (the gray area in Figure 5) derives a cost model which estimates the hardware performance. This step can be performed only once on a machine, and the corresponding parameters are stored to support the query of any input rating matrix in the online phase.

In the online phase, a sparse rating matrix is given. HSGD* first divides the matrix into two parts, denoted as and , based on cost models of CPUs and GPUs from the offline phase. Then, and are further divided into several blocks. Finally, the scheduler assigns blocks to worker threads. The calculation process continues until the number of iterations reaches the predefined value. During most of the period when HSGD* runs, CPU threads are only allowed to process blocks in , and GPUs are only allowed to process blocks in . Similar to HSGD, the block assignment avoids conflicts in the same row or column. We also have a dynamic scheduling strategy to balance workloads in practice, which is not reflected in Figure 5. The details will be shown in Section VI. A formal pseudocode of the framework HSGD* is reported in Algorithm 2, which is self-explanatory.

// Offline Phase
1 generate cost models of both CPUs and GPUs;
// Online Query Processing
2 partition according to the cost models;
3 preprocess data;
4 ;
5 scheduler assigns blocks to CPU threads and GPUs;
6 return ;
Algorithm 2 HSGD*

V Our Cost Model

Given an input matrix , let and be the proportion of the workload assigned to GPUs and CPUs, respectively, where . We use and to denote the time spent on updating elements in and by a GPU and a CPU thread, respectively. When the context is clear, and are used for short. The total running time of our algorithm is represented below.


Note that both and are monotonic. Based on the computed cost functions, the total running time is minimized when the load between resources keeps balancing. We set using the following equation.


Based on discussion above, our aim is to derive cost functions and for a GPU and a CPU thread, respectively, where (resp. ) denotes the estimation of (resp. ). As discussed earlier, a straightforward method to establish a cost model is to follow the works [luk2009qilin, lee2015orchestrating] which think that execution time is linearly related to the size of the input matrix. However, based on Observation IV-B

, we find that the processing speed of GPUs increases when the block size increases in our problem. This makes linear regression methods for the GPU cost model inaccurate. Moreover, the computation in execution kernels of GPUs and the data transfer are not completely serial due to CUDA stream mechanism. The total running time of GPU is not a simple sum of the kernel execution and the data transfer. This observation makes us reconsider how execution kernel and data transfer exactly influence the total running time of GPUs, which is not discussed in previous cost models.

In the rest of this section, we introduce the strategy to prepare the training data in Section V-A and propose our cost model of GPUs in Section V-B. For the cost model of CPUs, we use a linear function to estimate the performance similar to [luk2009qilin]. A formal pseudocode to compute cost models is given in Algorithm 3. We explain each step as follows.

Output: and
// S is an array of segments
// and are arrays of structures
// Data preprocessing phase
1 ;
// Training cost models
2 ;
3 ;
4 ;
5 ;
6 ;
7 ;
Algorithm 3 Cost Estimation

V-a Data Preparation and Training for CPUs

To derive the training data, we shuffle the input dataset to avoid uneven data distribution. After the data is shuffled, we equally divide input dataset into disjoint parts , stored in array at line 1 of Algorithm 3. Then, CPU execution kernel configured with a single thread is launched to compute on datasets generated in the last step at line 2. Instead of computing on respectively in [luk2009qilin], CPU execution kernel computes on respectively, and the corresponding execution time is recorded. After this, we get an array including data size and corresponding execution time. As a training data set, this array is used to curve fitting for CPUs. Our adaptation generates a wider range of training data which can better reflect the relationship between data size and execution time. To eliminate noise, the execution time in the training data is derived from the average of multiple tests.

V-B Estimating Working Efficiency of GPUs

Given a task, the total processing time by a GPU is spent on two parts: (1) data transfer between the CPU and the GPU via the PCI-e bus, and (2) GPU execution kernels.

Data Transfer. Data transfer has two directions — from CPU to GPU (sometimes called Host to Device) and from GPU to CPU. We denote the times spent on them by and , respectively. We only discuss the process to model the data transfer from CPU to GPU, and the model for the other direction is similar.

(a) CPU to GPU
(b) GPU to CPU
Fig. 6: Transfer speed varies with block size

Figure 6 reveals the transfer speed for data with different sizes. The transfer speed grows very fast in the beginning. After the data size is larger than a threshold, the transfer speed remains stable. Based on this phenomenon, we use a function to fit the curve when the dataset is not very large, then use the linear regression to model the rest. A formal model is expressed as follows. denotes the size of data transferred from CPU to GPU, and denotes the threshold.

The rationale for is that the data transfer time can be represented by the quotient of the data size and the transfer speed. According to Figure 6, we use the function to model the curve of the first stage. We select this function since the trend performs like an inverse function of the parabola. Note that the label distribution on the x-axis is not linear. This is because the logarithmic scale can make the trend clearly presented even though the data size is small. Obviously, this phenomenon shows that we cannot fully utilize the bandwidth of the PCI bus if the data is not large enough, which supports Observation IV-B. Then, we determine the threshold by the extent to transfer speed variation. Empirically, when the variation of the transfer speed is less than 2% in a time unit, we consider that the transfer speed has been stable. Finally, we fit the curve as a linear function for the second stage when . The curve fitting can be done by using the least squares method.

Fig. 7: Kernel execution time by varying data size

GPU Execution Kernel. We design the cost model of the GPU execution kernel. Similar to Figure 6, the throughput of updating remains stable after the block size reaches a threshold, which means that the computing power of the GPU is saturated. To fit the curves, we use a logarithmic function when the dataset is not very large, and then use the linear regression to model the rest. The growth trend of the logarithmic function can be slower than the power function (e.g., ), which is more consistent with the trend in Figure 7. This is why we choose it to model the curve of the first stage. A formal model is expressed as follows, where denotes the time spent on by the GPU execution kernel. The function represents the processing speed of the GPU execution kernel.

Overall GPU Cost Model. We cannot simply sum the time of the kernel execution and the data transfer as our estimation for the overall execution time of the GPU, since these two parts are not absolutely serial. Specifically, to improve the overall working efficiency of GPUs, we adopt a widely used optimization based on the CUDA stream mechanism. A CUDA stream contains a list of GPU commands executed in serial, and commands in different streams are executed in parallel if hardware resources permit. At the same time, commands in different streams can be synchronized. This mechanism allows us to perform data transfer and kernel execution in parallel without breaking correctness. We explain this idea in the following example.

Fig. 8: Data transfer optimization

As shown in Figure 8, we use three streams to manage the data transfer from CPU to GPU, the kernel execution, and the data transfer from GPU to CPU, respectively. Assume that a block is assigned to the GPU. Stream 1 transfers and corresponding to the global memory of a GPU. Then, the GPU kernel scans the block and updates . Simultaneously, stream 1 continuously transfers the next block assigned to the GPU and corresponding . When stream 2 finishes , stream 3 transfers the updated back to CPU.

From this example, the overall time for GPU can be roughly decided by the maximum time spent among these three streams because it covers the time of the other two parts. Note that although the first and the last schemes cannot be overlapped by the maximal stream in the figure, the cost can be ignored when the number of transferred and computed blocks is very large. Note that is always smaller than since we do not need to transfer blocks back to CPU. Therefore, we define the overall cost model of a GPU as follows.


The overall cost model of a GPU depends on the maximum between data transfer time from CPU and GPU and execution time of the GPU kernel.

Vi Workload Balance in Practice

Vi-a Dynamic Scheduling

Even though we have proposed a tailored GPU cost model for MF, the estimation may be still hard to exactly reflect the computing power of devices given a different dataset. The workloads of CPU and GPU may be unbalanced if we assign blocks simply according to the cost model. To remedy this issue, we adopt a dynamic scheduling strategy when one device finishes its tasks. Specifically, assume that the GPU has finished its tasks. Instead of waiting for the tasks being processed by CPUs, we allow GPUs to pick some blocks originally assigned to CPUs. We call it static phase when the GPU and the CPU only process the originally assigned tasks and call it dynamic phase when one of them finishes its own tasks and is involved in processing tasks of the other. In static phase, every GPU is assigned to blocks in specific rows so that it can update one segment of one result matrix all the time and avoid the transfer of this segment.

Vi-B Putting Things Together

We explain our final strategy for the matrix division in this section. An example is shown in Figure 9.

Fig. 9: The final division strategy

Number of Columns. Based on the cost model proposed in Section V, we first partition the matrix into two sub-matrices and for CPUs and GPUs, respectively. They are marked by white and gray in Figure 9. In the light of Rule IV-A, we further divide the matrix into columns. This setting guarantees two things. The first thing is that GPUs can always know not only the current block but also the next block. This enables the overlap between computation and data transfer in Figure 8. The second thing is that there always exists a spare column when a GPU kernel or a CPU thread finishes processing its block.

Number of Rows for CPUs. As shown in Rule IV-A, we can divide the input matrix into rows, where there are (resp. ) rows in (resp. ). However, this division strategy causes a problem when the dynamic scheduling is activated. Specifically, assume that GPUs have finished their own tasks and are involved in processing blocks of CPUs. Currently, we have totally (CPU and GPU) threads working on blocks. This would break Rule IV-A and cannot fully exploit all worker threads. To support the assistance from GPUs, we set the number of rows of CPUs as based on Rule IV-A. The setting would not affect the CPU efficiency since the computing power of CPUs is not sensitive to the block size as shown in Observation IV-B.

Number of Rows for GPUs. Similarly, If CPUs first finish their tasks and apply for blocks in , the number of rows in should be at least . However, compared with the row number for GPUs, the row number leads to a smaller block size, which cannot saturate the computing power of GPUs according to Observation IV-B. Different from the case of CPUs, the division strategy for GPUs needs to satisfy that the block size is large enough in the beginning, while the number of rows is large enough to avoid conflicts if CPUs join. To achieve this target, we divide into rows. For each row where , we further divide into sub-rows. As a result, relatively large blocks with sizes are assigned to GPUs in static phase, and blocks with sizes are assigned to GPUs and CPUs in dynamic phase.


We give an example to explain the division strategy for . Assume that we have 2 GPUs and 4 CPU threads, i.e., . We divide into rows, and each row is further divide into sub-rows. In static phase, we assign a block with size to a GPU, and in dynamic phase, we assign a block with size to a GPU or a CPU thread. On the other hand, is divided into columns and rows. This division for would not change in the algorithm.

Vii Experiments

We conduct extensive experiments to show the efficiency and the effectiveness of our proposed algorithms. Algorithms appearing in our experiments are summarized as follows.

  • CPU-Only: Only CPU works. We uniformly divide the matrix and use the strategy in [zhuang2013fast] to assign blocks. More details can be found in Section III-A. We use AVX and OpenMP for acceleration.

  • GPU-Only: Only GPU works. We vary the number of rows and columns for the matrix division and adopt the best one. The "-O3" optimization flag is supported.

  • HSGD: CPU and GPU work in parallel. The algorithm is introduced in Section IV-A. AVX, OpenMP, and "-O3" optimization flag are supported.

  • HSGD*: CPU and GPU work in parallel. Nonuniform matrix division and dynamic strategy are used. Our cost model decides the size of blocks assigned to two hardware resources. AVX, OpenMP, and "-O3" optimization flag are supported.

Stochastic gradient methods and the parameter used in [zhuang2013fast] and [xie2017cumf_sgd] are different. To correctly combine two methods on Heterogeneous CPU-GPU Systems, we embed the core part of LIBMF222https://github.com/cjlin1/libmf and CuMF_SGD333https://github.com/cuMF/cumf_sgd into our code and make minor modifications to make the stochastic gradient methods they use consistent. For stochastic gradient method, We choose to use the more concise one in [zhuang2013fast]. For the value of parameter , we choose the larger one in [xie2017cumf_sgd] because a large value can lead to a better training result.

Datasets and Parameter Setting. We evaluate algorithms in four real-world datasets — MovieLens444http://grouplens.org/datasets/movielens/, Netflix555https://www.kaggle.com/netflix-inc/netflix-prize-data, R1666https://webscope.sandbox.yahoo.com/catalog.php?datatype=r, and Yahoo!Music777https://webscope.sandbox.yahoo.com/catalog.php?datatype=c. Statistics of the datasets are presented in Table I. For reproducibility, we consider the original training/test sets in our experiments. More details about each dataset can be found on the corresponding website. We set the parameters following [chin2015learning], which are also listed in Table I.

Datasets MovieLens Netflix R1 Yahoo!Music
71567 2649429 1948883 1000990
65133 17770 1101750 624961
9301274 99072112 104215016 252800275
698780 1408395 11364422 4003960
128 128 128 128
0.05 0.05 1 1
0.05 0.05 1 1
0.005 0.005 0.005 0.01
TABLE I: Network statistics and parameter settings

Experimental Environment. We use a machine with Intel Xeon E5-2687W v3 3.10GHz processors and a Quadro P4000 GPU with 8GB global memory. The number of available cores is 20. The system interface of the GPU is PCI Express 3.016. The total bandwidth is 32GB/s. By default, we use 16 CPU threads and 128 GPU parallel workers. Here, we follow the definition of GPU parallel workers in [xie2017cumf_sgd], which means the number of elements computed simultaneously in the GPU kernel. All datasets can fit in memory in our experiments.

Organization. Section VII-A shows the adaptiveness of our algorithms by varying the computing resources. Section VII-B shows the effectiveness of our algorithm compared with the state-or-the-art competitor. Section VII-C and Section VII-D evaluate our optimization techniques including matrix division strategy and workload balance.

Vii-a Overall Efficiency

We evaluate the overall efficiency of our final algorithm HSGD* with CPU-Only and GPU-Only as comparisons. We use Root Mean Square Error (RMSE) 888https://en.wikipedia.org/wiki/Root-mean-square_deviation as a metric for the loss, which is widely used in many recommender systems. For each dataset, we terminate all algorithms and record the corresponding running time when the RMSE reaches a predefined value. Given that we use a different stochastic gradient method from [xie2017cumf_sgd] and a different k value from [zhuang2013fast], the predefined loss values they used are not available. For fair comparison, we select these values that can be reached by all methods including HSGD which suffers from a weak training quality. The predefined loss values are , , , and for MovieLens, Netflix, R1, and Yahoo!Music, respectively. The comparisons between HSGD and HSGD* will be shown in Section VII-C.

Vii-A1 Varying GPU parallel workers

In this experiment, we evaluate the adaptiveness of our algorithm by varying the GPU parallel workers from 32 to 512. The running times of algorithms for different GPU parallel workers are reported in Figure 10 for four datasets. Note that the CPU thread number is fixed to the default value .

(a) MovieLens
(b) Netflix
(c) R1
(d) Yahoo!Music
Fig. 10: Varying GPU Threads

As a reference, the running time of CPU-Only is stable on all settings. Initially, the GPU-Only is slower than CPU-Only. When we use more GPU threads, the running time of GPU-Only decreases and overtakes that of CPU-Only. The running time of HSGD* is the smallest among all algorithms. For example, in R1, given GPU worker threads, HSGD* takes seconds while CPU-Only and GPU-Only take seconds and seconds respectively. When the thread number increases to 512, HSGD* takes seconds while GPU-Only takes seconds. The decrease of the time of HSGD* also shows that our algorithm is adaptive to different GPU settings and can fully utilize the increasing computing power of GPUs.

Vii-A2 Varying CPU Thread Number

In this experiment, we evaluate the adaptiveness of our algorithm by varying the CPU thread number from 4 to 16. The GPU parallel workers are fixed to the default value . The running times of algorithms for different CPU thread numbers are reported in Figure 11.

In contrast to Figure 10, the running time of GPU-Only is consistent, and the running time of CPU-Only decreases when we use more CPU threads. HSGD* is the fast algorithm on all settings and all datasets. For example, in R1 when the CPU thread number is 4, HSGD* takes seconds, while CPU-Only takes seconds, and GPU-Only takes seconds. When the CPU thread number increases to 16, HSGD* takes only seconds while CPU-Only takes seconds. The decrease of the time of HSGD* shows that our algorithm is adaptive to different CPU settings and can fully utilize the increasing computing power of CPUs.

Figure 11 and Figure 10 show the high efficiency of HSGD*. When the gap between the computing power of CPU and GPU is limited (default setting), HSGD* achieves a 1.4-2.3x speedup over CPU-Only and a 1.4-2.3x speedup over GPU-Only on all datasets. The experiments also show that the overhead cost of HSGD* is minor. When the gap between the computing power of CPU and GPU is large, e.g., CPU uses 16 threads and GPU uses 512 parallel workers in Figure 10, HSGD* still achieves a slight speedup over GPU-Only.

(a) MovieLens
(b) Netflix
(c) R1
(d) Yahoo!Music
Fig. 11: Varying CPU Threads

Vii-B Training Quality

In this experiment, we report the derived loss values (RMSE) of our algorithm HSGD* during the training process to show the effectiveness of our method. The experiment will demonstrate the loss value of our algorithm finally converges to a reasonable value. CPU-Only and GPU-Only are also compared as references.

(a) MovieLens
(b) Netflix
(c) R1
(d) Yahoo!Music
Fig. 12: Test RMSE over training time on four datasets

The results are shown in Figure 12. The downward trends of HSGD* are obvious, and the loss of HSGD* converges in the shortest time. In addition, HSGD* achieves a similar converged loss value compared with other algorithms, which shows the effectiveness of our algorithm. For example, in Yahoo!Music, the loss of HSGD* drops to in 10 seconds. At the same time, the loss values of CPU-Only and GPU-Only are and , respectively. When time increases to 25 seconds, the loss of HSGD*, CPU-Only, and GPU-Only drops to , , and , respectively. Finally, all loss values of HSGD*, CPU-Only, and GPU-Only stay about .

Vii-C Matrix Division Strategy

We evaluate the effectiveness of our matrix division strategy in this experiment. For each dataset, we record the loss (RMSE) value on different running times of HSGD* with HSGD as a comparison. The result is shown in Figure 13.

We can see that the training quality of HSGD is poor especially when processing relatively large datasets. This phenomenon is consistent with the discussion in Example IV-B. The nonuniform matrix division in HSGD* fixes this issue. Given the same running time, HSGD* derives a smaller loss value than HSGD, and the advantage of HSGD* is obvious especially in large datasets. For example, given 50 seconds in R1, the RMSE value for HSGD* reaches to , while that for HSGD is only . The result proves that nonuniform matrix division can utilize GPU resources better.

Datasets MovieLens Netflix R1 Yahoo!Music
Workload proportion
HSGD*-Q C 49.56% 55.98% 56.07% 56.46%
G 50.44% 44.02% 43.93% 43.54%
HSGD*-M C 55.91% 49.02% 49.75% 53.61%
G 44.09% 50.98% 50.25% 46.39%
Running time
HSGD*-Q 0.92 s 15.87 s 13.07 s 40.88 s
HSGD*-M 0.89 s 13.02 s 12.08 s 35.41 s
TABLE II: Comparison of cost models

Vii-D Workload Balance

We evaluate the effectiveness of techniques used to balance workloads in this section.

Vii-D1 Cost Models

To show the effectiveness of our cost models, we report the proportion of workloads derived by our cost models with [luk2009qilin] as a comparison in Table II. To clearly reflect the algorithmic efficiency based on two cost models, we make these two methods run the same number of iterations, which is 20 in this experiment.

In Table II, HSGD*-Q represents the algorithm HSGD* which uses Qilin [luk2009qilin] to evaluate the working efficiency of hardware. HSGD*-M represents the algorithm HSGD* which uses our model in Section V to evaluate the working efficiency of hardware. Note that for fairness, both HSGD*-Q and HSGD*-M do not include the dynamic scheduling strategy in Section VI to further balance workloads. "C" and "G" in the table represent the assigned proportion of workloads to CPUs and GPUs, respectively.

The practical running times of HSGD*-Q and HSGD*-M are also reported. The running time of HSGD*-M is smaller than that of HSGD*-Q on all datasets. This result proves that our cost model can derive a more accurate estimation for the working efficiency of hardware. We can find that HSGD*-M prefers to assign more work to GPU compared with HSGD*-Q on all datasets except MovieLens. For MovieLens, HSGD*-M observes that the performance of GPU is not strong when processing a small dataset (Observation IV-B). Therefore, it assigns less work to GPU. The effectiveness of HSGD*-M becomes considerable when the dataset is large. Note that given a smaller target loss, both algorithms will require more iterations, and the advantage of our method will become obvious. For example, to achieve the predefined loss value of Yahoo!Music in Section VII-A, HSGD* needs 46 iterations, which is more than twice that of this experiment.

(a) MovieLens
(b) Netflix
(c) R1
(d) Yahoo!Music
Fig. 13: Test RMSE over training time
Dataset HSGD*-M HSGD*
MovieLens 0.89 s 0.84 s
Netflix 13.02 s 11.42 s
R1 12.08 s 10.58 s
Yahoo!Music 35.41 s 30.96 s
TABLE III: Effectiveness of dynamic scheduling

Vii-D2 Dynamic Scheduling

We evaluate the effectiveness of the dynamic scheduling strategy (Section VI). Similar to the experiment for cost models, we use HSGD*-M to denote our final algorithm without the dynamic scheduling technique to further balance workloads. The running times of HSGD*-M and HSGD* on all datasets are shown in Table III.

The result shows HSGD* is faster than HSGD*-M on all datasets. Note that in MovieLens with relatively small size, the computing power of GPU cannot be saturated, which degrades the effectiveness of the dynamic scheduling. As a result, the improvement of dynamic scheduling on MovieLens is minor. By combining the new cost models and the dynamic scheduling strategy, our final algorithm achieves a significant improvement in balancing workloads of MF.

Viii Conclusion

We propose an efficient parallel MF algorithm on heterogeneous CPU-GPU systems. To fully utilizes the processing power of GPUs, our approach divides the input matrix using a nonuniform method and assigns large blocks to GPUs. We design a new cost model to estimate the performance of computing resources. We balance workloads by combining the cost models and dynamic scheduling in practice.