I Introduction
As a common technique in machine learning and data mining, matrix factorization (MF) has been widely applied in many areas, such as recommendation system [lian2014geomf, he2016fast], social network analysis [li2015predicting, al2017unveiling], web mining [qiu2018network], word embedding [pennington2014glove], and graph embedding [DBLP:journals/tkde/CuiWPZ19].
Given a sparse matrix with rows and columns, the aim of MF is to decompose into two dense matrices and , which satisfies the following condition:
Here, the dimension is far less than and so that the original sparse data can be represented well by new dense data with low dimensionality.
Stochastic Gradient Descent. The existing literature has proposed three main approaches to solve MF, namely, alternating least squares (ALS), coordinate descent (CD), and stochastic gradient descent (SGD). Among them, SGDbased algorithms have received the most attention, due to their algorithmic simplicity and effectiveness. SGD algorithms execute several iterations to make the training result convergent. The number of iterations can be a parameter specified by users. In each iteration, the gradient of every element in the sparse matrix
is computed and the corresponding vectors in the result matrices are updated. We focus on the SGDbased MF algorithms in this paper and use MF to denote the SGDbased MF whenever there is no ambiguity.
In the literature, several parallel SGD approaches have been proposed on different system settings such as GPUs (e.g., CuMF_SGD [xie2017cumf_sgd]), multicore CPUs (e.g., FPSGD [zhuang2013fast]), and distributed systems (e.g., NOMAD [yun2014nomad]). To parallelize the computation in SGD, the main idea of these methods is to uniformly divide the sparse matrix into a set of blocks. In each iteration, a set of mutually independent blocks are assigned to different working units (e.g., threads, nodes, or GPUs) and updated by these working units. Here, two blocks are independent of each other if they do not share the same row and the same column. This strategy avoids the writing conflict in and since the elements in the same row (resp. column) of will update the same row (resp. column) of (resp. ). Several optimizations are also studied for their specific system settings, and more related works of MF are introduced in Section III.
Nowadays, GPU becomes very popular for running generalpurpose dataparallel applications. Consequently, heterogeneous systems with multicore CPUs and GPUs become more and more promising for many general tasks. The idea of combing CPU and GPU not only speeds up the task execution but also effectively utilizes the computing resources available in heterogeneous systems. By contrast, to meet performance requirements, the CPUonly or GPUonly systems are usually overprovisioned. This leads that their average utilization remains low. For example, after allocating the task to a GPU (i.e., starting the kernel), the CPU stays idle. Similarly, for applications where the GPU memory bandwidth is a major bottleneck, the massive computation resources of GPUs remain underutilized. If the resources that were originally idle in the system are utilized effectively, the running time of the program will be reduced. This has motivated a significant amount of research on heterogeneous computing techniques [luk2009qilin, pandit2014fluidic, stehle2017memory]. To our best knowledge, there is no existing work on the MF task on heterogeneous CPUGPU systems, and it is desirable to develop an efficient MF algorithm for this system setting. In this paper, we mainly study the task division and the scheduling strategy for SGD on heterogeneous CPUGPU systems. Our techniques do not closely depend on any specific algorithm tailored for a single thread of the CPU or a GPU.
Due to the close relation between threads in a GPU, a straightforward method to combine GPUs and CPUs is to consider the GPU as a CPU thread. Then, we can naturally adopt the stateoftheart shared memory algorithm FPSGD [zhuang2013fast] in our problem. We call this method HSGD. Despite the success of HSGD, there are still opportunities for further improvement. First, based on [zhuang2013fast], we derive a matrix division rule such that the number of blocks in the matrix should be at least , where and are the number of CPU cores and GPUs, respectively. We observe that in HSGD, even with the smallest number of blocks, the size of a block derived by the uniform division is not large enough to saturate the processing power of GPUs (Observation IVB). This may limit the overall efficiency of HSGD. Second, the computing power of a GPU is normally much stronger than that of a single CPU thread. Given that blocks sharing the same row or column cannot be processed in parallel, the update frequency of blocks processed by GPU can be extremely high due to the powerful computing power of GPU. This phenomenon may lead to a weak training quality (Example IVB).
Our Approach. To overcome the drawbacks in HSGD, we propose a nonuniform matrix division strategy for the parallel SGD algorithm. Specifically, we first design a novel cost model to evaluate the performance of hardware. Existing works on estimating the working efficiency on heterogeneous CPUGPU systems are neither available in our setting nor accurate enough in the MF problem. Our cost model is tailored for the parallel SGD algorithm and can accurately estimate the performance of GPUs when the block is relatively small. Based on the cost model, we can initially divide the matrix into two parts which are assigned to GPUs and CPUs, respectively. Each part is divided into a set of blocks to avoid conflicts in MF computation. We also adopt a dynamic scheduling policy which allows the dominant computation resource to use workstealing mechanism[blumofe1999scheduling] and process blocks originally assigned to the other resource. This optimization further balances workloads by filling the gap between the estimation by our cost model and the practical performance. When the dynamic scheduling policy is activated, the sparse matrix is divided into more blocks for avoiding conflicts.
Contributions. We summarize main contributions as follows.

A parallel SGD algorithm on heterogeneous CPUGPU systems. To our best knowledge, this is the first work to efficiently tackle Matrix Factorization on heterogeneous CPUGPU systems.

A nonuniform matrix division strategy to improve efficiency. Compared with a straightforward method to uniformly divide the matrix, our division strategy fully utilizes the strong processing power of GPUs.

A novel cost model to estimate the performance of GPUs. We balance workloads using a cost model specifically designed for our problem. The cost model of the GPU part considers both data transfer and kernel execution.

Extensive experiments on four benchmark datasets. The experiments demonstrate that our proposed framework can effectively utilize the resources in the heterogeneous CPUGPU systems and show the superior performance compared to the competitors.
Organization. The rest of the paper is organized as follows. Section II introduces background information including MF and GPU architecture. Section III reviews related works. Section IV discusses the motivation and proposes an overall framework. Section V proposes the details of our cost model. Section VI introduces the dynamic scheduling strategy and the final matrix division strategy. Section VII reports the experimental results, and Section VIII concludes the paper.
Ii Preliminary
Iia Matrix Factorization
We consider a useritem rating matrix where and are the number of rows and the number of columns of the matrix, respectively. For each and , is the rating from the user to the item . is normally sparse since not every element in is explicitly reported. Matrix factorization aims to represent the matrix as a dot product between two dense matrices and , where is the number of latent factors. A mathematical representation of the matrix factorization is shown as follows.
(1) 
In Equation 1, is the th row vector of , and is the th column vector of . Matrix factorization achieves Equation 1
by minimizing the following loss function.
(2) 
In Equation 2, measures the gap between and estimated value . and are used to avoid overfitting. computes the Frobenius norm^{1}^{1}1https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm of a vector. and are regularization coefficients.
Example.
We give an example of the matrix factorization in Figure 1. The rating matrix contains nine ratings for four movies given by four customers. The number of latent factors is . We have which means that gives a rating to . The results of ’s matrix factorization are shown on the right of . is a user preference matrix, and is a movie feature matrix. The vector is the preference of the user , and is the feature of the movie . The estimated value of is , which is close to .
IiB Stochastic Gradient Descent for Matrix Factorization
It is timeconsuming to calculate the loss of the whole matrix when using the loss function of Equation 2, especially when contains billions of items. Several works have been done to minimize Equation 2 and improve the efficiency of matrix factorization [funk2006netflix, koren2009matrix, yu2012scalable]. In this paper, we follow a prevalent algorithm called stochastic gradient descent (SGD) [funk2006netflix] among them, and ideas of several other algorithms are introduced in Section III.
SGD executes several iterations. The number of iterations can be specified by users. Instead of straightforwardly applying the gradient descent to minimize Equation 2 in each iteration, SGD computes the gradient of every element in and updates the corresponding vectors in the result matrices. Consequently, the original loss function in Equation 2 is reduced to the following equation.
(3) 
The gradient of Equation 3 is represented as follows.
(4) 
(5) 
Based on the gradient (Equation 4 and Equation 5), we train the model iteratively, and the value of the loss function decreases until it is convergent. Equation 6 shows this process where is the learning rate.
(6) 
We present the pseudocode of SGD in Algorithm 1. There are several input parameters. is a sparse rating matrix stored in the form of triadic tuple. is the number of latent factors. is the learning rate. is the regularization parameter. is the number of iterations. Algorithm 1 outputs two feature matrices and . The algorithm consists of three phases: data preprocessing, SGD training, and data postprocessing. The data preprocessing phase initializes two resulting matrices and with values generated randomly. In the training phase, in each iteration (line 36), every element is picked to decrease the value of loss function by using Equation 6. The SGD training terminates when the given number of iterations is reached (line 2) or the model converges. The feature matrices are stored after training.
Problem Definition. In this paper, we aim to develop an efficient SGDbased MF algorithm in heterogeneous CPUGPU systems.
Remark.
Our research mainly focuses on the scheduling strategy for the task division and assignment between CPUs and GPUs. Our proposed techniques will not closely depend on any specific GPU or GPU SGDbased MF algorithms.
IiC Characteristics of GPUs
We introduce several basic architecture characteristics of GPUs. More details can be found in [jia2018dissecting].
Execution Model. GPU manages thousands of threads in a threelayer hierarchy, which includes grids, blocks, and warps. A block contains dozens of warps, and multiple blocks are organized into a grid. A warp is the smallest execution unit from the view of hardware and contains 32 consecutive threads. Instructions of a GPU program (also named kernel) are executed in a way called SIMT (Single Instruction Multiple Thread). SIMT means that all threads in a warp must execute the same instruction on their own data at the same time, while the state of threads in a warp can be different (active or inactive) so that they can go through different execution paths. Therefore, GPU threads are not as flexible as CPU threads.
Memory Hierarchy. GPU has a complicated memory hierarchy. We introduce several commonly used concepts. Global memory is the largest memory, whose access speed is the slowest. Data in global memory can be accessed by all threads. Shared memory is onchip and a relatively scarce resource. It can be accessed by threads from the same thread block since GPU promises that threads in the same block are performed on the same processor. Register is used to store local variables in kernel, whose access speed is the fastest. A register is only directly accessed by the thread it belongs to. Threads can access local variables of other threads in the same warp.
Iii Related Works
In this section, we introduce the details of FPSGD [zhuang2013fast] and CuMF_SGD [xie2017cumf_sgd], which are related to our problem. Several other related works are briefly summarized.
Iiia Multi Core MF Algorithms
There are several algorithms extending the ideas of SGD in shared memory (multi CPU cores) systems. Hogwild [recht2011hogwild] adopts a lockfree update strategy to compute MF in parallel. DSGD [gemulla2011large] performs a matrixblocking partition to avoid conflicts. Several variants of DSGD [zhuang2013fast, oh2015fast, li2013sparkler] also follow this property. FPSGD [zhuang2013fast] is the stateoftheart in this type. We introduce more details of FPSGD as follows.
FPSGD. FPSGD uniformly divides the rating matrix into a set of submatrices (called blocks). Two blocks are independent of each other if they share neither any common column nor any common row of the rating matrix. Otherwise, we say they conflict. FPSGD applies the SGD algorithm to a set of independent blocks simultaneously and repeats this step until the number of processed blocks reaches a predefined value. In each iteration, all selected blocks must be independent since two blocks in the same row (resp. column) update the same area of (resp. ) and cannot be processed in parallel.
Example.
We explain FPSGD in Figure 2. The rating matrix is divided into blocks. The computation on updates vectors and , and the computation on updates vectors and . They update vectors in different regions, which means that the computation on and can be performed at the same time. We can see that and are mutually independent. However, and are not independent of each other since they both update vectors in , and this leads to a conflict.
IiiB GPU Algorithms
GPUSGD [jin2016gpusgd] proposes matrix blockingbased MF solution on GPUs. BIDMach [canny2013bidmach] supports SGDbased MF and uses GPUs as accelerators. CuMF_SGD[xie2017cumf_sgd] is the stateoftheart.
CUMF_SGD. CuMF_SGD designs a highperformance GPU kernel by fully exploiting GPU hardware characteristics, which includes warp shuffle, memory coalescing, register usage, and halfprecision. CuMF_SGD fully utilizes the CPUGPU memory transfer bandwidth by simultaneously performing the memory transfer and the computation. Specifically, when scheduling blocks, the algorithm randomly selects multiple consecutive blocks at a time instead of only one independent block for the GPU. The strategy reduces the overhead of CPUGPU memory transfer. CuMF_SGD uses CUDA streams to achieve this strategy. A stream contains a list of GPU commands that are executed in serial. Commands in different streams are executed in parallel if hardware resources permit. CuMF_SGD kernel uses three streams to manage the CPUGPU memory transfer, the GPU kernel, and the GPUCPU memory transfer, respectively.
IiiC Other Related Works
Other SGDbased MF Algorithms. MF has been extensively investigated in the literature and a large number of algorithms were proposed for this problem. In addition to algorithms on multicore and GPU architecture mentioned before, Parallel SGD solutions have been discussed in distributed systems [li2013sparkler] [yun2014nomad] and parameter server [zhong2016scaling].
Other MF Algorithms. In addition to SGDbased algorithms, other methods like Coordinate Descent (CD) [yu2012scalable] and Alternate Least Square (ALS) [koren2009matrix] are also proposed to compute MF. They use different update rules in each iteration. Specifically, ALS [koren2009matrix] updates one of the result matrices once by fixing the other. Then, it updates the other result matrix in the same way. An iteration is completed when two result matrices are both updated. CD [yu2012scalable] updates one element in a result matrix once by fixing all other elements in two result matrices. An iteration is completed when all elements in result matrices are updated by the same update rule.
Iv Our Approach
In this section, we first give a straightforward method and show its drawbacks. Then, we briefly introduce scheduling algorithms for heterogeneous systems and propose our method to balance workloads. Finally, an overview of our improved algorithm is provided.
Iva A Straightforward Method
A straightforward idea to utilize both CPU and GPU resources is to treat a GPU kernel as a worker thread. Based on this idea, we can adapt FPSGD [zhuang2013fast] by regarding a GPU as an additional worker thread. Similarly, to avoid the conflict and obtain good training quality, a worker thread receives a new block satisfying the two criteria in Section IIIA once it finishes processing a block. We apply the FPSGD and CuMF_SGD algorithms to process a matrix block on a CPU thread and a GPU kernel, respectively. This straightforward algorithm shown above can be called HSGD.
Let and be the number of CPU threads and GPUs. Following the matrixdivision rule in [zhuang2013fast], the number of blocks should be at least . We refine this formula and give a more precise matrixdivision rule.
Rule.
Given an input rating matrix , CPU threads, and GPUs, should be divided into at least blocks.
We explain the rationale of this rule. When the number of blocks is less than , every thread is always assigned the same block. As a result, only several specific blocks are updated during the algorithm. Obviously, this will lead to a terrible training result. Even worse, the algorithm cannot fully exploit all worker threads. For example, in Figure 2, assume that there are threads. The block is assigned to thread 1, and the rest gray blocks are assigned to other threads. When thread 1 finishes processing , the rest gray blocks may still be occupied. To avoid conflicts, thread 1 has to continually process . By contrast, when the block number increases to , thread 1 can always locate a spare row or column which is not occupied by other blocks.
IvB Motivation
Even though the HSGD successfully combines CPU and GPU resources, we have several observations which can help (i) improve the working efficiency of GPUs and (ii) balance workloads for different hardware.
Exploiting Hardware Characteristics. GPUs and CPUs have different features. We have two observations as follows.
Observation.
In the context of MF, small blocks cannot saturate the GPU computing power.
To indicate the relationship between the block size and the efficiency of the GPU, we launch a GPU kernel with the default configuration to process blocks with different sizes. The details of configuration and hardware can be found in Section VII. The GPU throughput is reported in Figure 3(a), where the labels on the xaxis represent the number of elements in a block, and the labels on the yaxis represent the average number of elements processed in every second.
The throughput significantly increases when the block size is relatively small. Afterward, the upward trend becomes gentle as the block size continues to grow. This phenomenon may be due to two main reasons. First, we need to transfer data via the PCIe bus to the global memory of GPU when launching a GPU kernel. A small block cannot fully utilize the bandwidth of the bus. Second, more data can better utilize all threads and the cache mechanism of the GPU. More throughput details about data transfer and GPU kernel execution can support our opinion and will be shown in Section VB.
Observation.
In the context of MF, the computing power of CPU cores is not sensitive to the block size.
The average number of elements processed by a thread of CPU in every second is shown in Figure 3(b). Unlike the GPU, the throughput of a CPU thread always remains stable when the block size varies. This is because the worker threads of CPU are relatively more independent than those of GPU and the computing capability of a CPU thread is not so powerful, compared with the whole GPU device.
Nonuniform Matrix Division. Based on the above two observations, an immediate idea to improve the algorithmic efficiency is to set the block size as large as possible. However, as mentioned in Section IVA, the input matrix should be divided into at least blocks and the block size under this division strategy is still relatively small. For example, we use 16 CPU threads and a GPU in the default configuration of our experiments. Considering the realworld dataset Yahoo!Music with rows and columns, we divide it into at least blocks. Consequently, the number of elements in each block is less than million, which is still not large enough to saturate the GPU computing power in the light of Figure 3(a).
To improve the working efficiency of GPUs, we divide the rating matrix into blocks with different sizes. The large ones are assigned to GPUs, and the small ones are assigned to CPUs. Towards this end, there are several issues that we need to address. For example, we need to answer how to set suitable block sizes for both CPU and GPU, and how to divide the rating matrix in practice. We will answer these questions in the following section, and the final matrix division strategy will be given in Section VI.
Workload Balance. As shown above, we should make the size of blocks assigned to GPUs as large as possible in terms of improving the working efficiency of GPUs. However, an extreme nonuniform division strategy may cause a serious workload imbalance problem, which can remarkably reduce the overall efficiency when combining two hardware resources. To prevent this issue, a scheduling strategy should be considered. Many efforts have been done to balance workloads for the applications in heterogeneous systems. They can be categorized into three kinds: (1) dynamic methods, (2)
static methods by classifier
, and (3) static methods by cost models. Here, we briefly review some of representative methods among them and elaborate reasons why they cannot be straightforwardly used in our problem. Then, we propose our own method. For more details of scheduling strategies in heterogeneous systems, interested readers can refer to surveys[pradeep2018review, mittal2015survey].(1) Dynamic Methods. The dynamic methods assign a new task to a computational device according to the performance of devices on previous tasks [rossi2016leveraging, jimenez2009predictive, leis2014morsel, belviranli2013dynamic, boyer2013load, choi2013efficient, wang2013cap, kaleem2014adaptive]. For example, [rossi2016leveraging] maintains a doubleended queue. CPU processes the tasks from the front of the queue. Simultaneously, GPU processes the tasks from the reverse direction. The algorithm naturally enables the dominant hardware resource to handle more tasks. The baseline algorithm in Section IVA essentially adopts a dynamic strategy. If a CPU core or a GPU finishes updating a block, it is assigned a new block without incurring any conflict. The dynamic method performs well in other problems if there is not any rule to assign tasks.
However, this type of method does not work well in our problem due to the independence property of the task assignment. As discussed in Section IVA, we first equally divide the matrix into a set of blocks, which is necessary to avoid conflicts. GPUs cannot achieve an ideal performance in this setting according to Observation IVB. In addition to the problem of GPU working efficiency, the dynamic update method in HSGD may suffer from a poor training result. We give an example to explain this problem.
Example.
We consider computing the matrix factorization of a rating matrix on a machine with two CPU threads and one GPU . Assume that we divide into matrix blocks, which is shown in Figure 4. In the beginning, two CPU threads and the GPU get blocks , , and , respectively. Normally, the computing power of a GPU is much stronger than that of a CPU thread. As a result, when has completed its task , and are still working on their blocks and . According to the scheduling policy of HSGD, will apply a new block which is independent of and and has the least number of updates. Block satisfies these conditions. picks in the second step of Figure 4. In the following steps, will continually update the two blocks in the lower right corner since and are always occupied by and . This phenomenon makes the numbers of updates for different blocks severely unbalanced, which is demonstrated in the last matrix. The number of updates for a block is relatively large if the corresponding color is dark. Compared with the original SGD algorithm which updates matrix elements randomly, the process in this example leads to a weak training result.
(2) Static Methods by Classifier. Static methods provide scheduling decisions for worker threads of different devices before applications start. This type of method establishes a classifier based on training datasets in the offline phase [grewe2011static, kofler2013automatic, wen2014smart, ghose2016divergence]. Given a new task, the classifier identifies the class of the task and applies a corresponding strategy of task assignment derived from the offline phase.
There are several drawbacks if applying the classifierbased methods in our problem. First, it is complicated and timeconsuming to generate training data including static code features and optimal partition strategies. Moreover, these methods usually rely on specific frameworks such as OpenCL [stone2010opencl] and Insieme [insieme] to obtain static code features. This makes them not general. Second, they are usually designed for multitask platforms. Consequently, The partition scheme generated by them is coarsegrained.
(3) Static Methods By Cost Model. This type of method estimates CPU’s and GPU’s execution time by establishing a cost model [luk2009qilin, lee2015orchestrating]. A cost model is a function revealing the relationship between the input data size and the corresponding execution time of a specific worker thread. A representative approach among them is Qilin [luk2009qilin]. It divides a training dataset into two parts and , which are assigned to CPUs and GPUs, respectively. is further divided into subparts …. Each subpart is processed by a CPU thread, and the corresponding execution time is recorded. A similar operation is applied to by GPUs. Qilin uses the curve fitting to construct two linear equations as the projections of the execution times for CPUs and GPUs respectively.
In the context of our problem, a simple linear function in [luk2009qilin] is hard to accurately estimate the execution time of GPUs. Recall Figure 3(a), it is not a horizontal line. This proves that the execution time does not increase linearly as the number of elements in a block grows.
Our Method to Balance Workloads. We propose a hybrid method for our problem. We first customize a cost model to divide the matrix specialized for the MF problem. We improve the accuracy of the GPU cost model. The details of cost models are given in Section V. Second, we have a dynamic scheduling mechanism, which allows the dominant resource to use workstealing mechanism. This alleviates the deviation between the cost model and the practical performance.
IvC The Framework
In this section, we give an overview of our improved algorithm in Figure 5, which is called HSGD*. Our method contains an offline preprocessing phase and an online processing phase. The offline phase (the gray area in Figure 5) derives a cost model which estimates the hardware performance. This step can be performed only once on a machine, and the corresponding parameters are stored to support the query of any input rating matrix in the online phase.
In the online phase, a sparse rating matrix is given. HSGD* first divides the matrix into two parts, denoted as and , based on cost models of CPUs and GPUs from the offline phase. Then, and are further divided into several blocks. Finally, the scheduler assigns blocks to worker threads. The calculation process continues until the number of iterations reaches the predefined value. During most of the period when HSGD* runs, CPU threads are only allowed to process blocks in , and GPUs are only allowed to process blocks in . Similar to HSGD, the block assignment avoids conflicts in the same row or column. We also have a dynamic scheduling strategy to balance workloads in practice, which is not reflected in Figure 5. The details will be shown in Section VI. A formal pseudocode of the framework HSGD* is reported in Algorithm 2, which is selfexplanatory.
V Our Cost Model
Given an input matrix , let and be the proportion of the workload assigned to GPUs and CPUs, respectively, where . We use and to denote the time spent on updating elements in and by a GPU and a CPU thread, respectively. When the context is clear, and are used for short. The total running time of our algorithm is represented below.
(7) 
Note that both and are monotonic. Based on the computed cost functions, the total running time is minimized when the load between resources keeps balancing. We set using the following equation.
(8) 
Based on discussion above, our aim is to derive cost functions and for a GPU and a CPU thread, respectively, where (resp. ) denotes the estimation of (resp. ). As discussed earlier, a straightforward method to establish a cost model is to follow the works [luk2009qilin, lee2015orchestrating] which think that execution time is linearly related to the size of the input matrix. However, based on Observation IVB
, we find that the processing speed of GPUs increases when the block size increases in our problem. This makes linear regression methods for the GPU cost model inaccurate. Moreover, the computation in execution kernels of GPUs and the data transfer are not completely serial due to CUDA stream mechanism. The total running time of GPU is not a simple sum of the kernel execution and the data transfer. This observation makes us reconsider how execution kernel and data transfer exactly influence the total running time of GPUs, which is not discussed in previous cost models.
In the rest of this section, we introduce the strategy to prepare the training data in Section VA and propose our cost model of GPUs in Section VB. For the cost model of CPUs, we use a linear function to estimate the performance similar to [luk2009qilin]. A formal pseudocode to compute cost models is given in Algorithm 3. We explain each step as follows.
Va Data Preparation and Training for CPUs
To derive the training data, we shuffle the input dataset to avoid uneven data distribution. After the data is shuffled, we equally divide input dataset into disjoint parts , stored in array at line 1 of Algorithm 3. Then, CPU execution kernel configured with a single thread is launched to compute on datasets generated in the last step at line 2. Instead of computing on respectively in [luk2009qilin], CPU execution kernel computes on respectively, and the corresponding execution time is recorded. After this, we get an array including data size and corresponding execution time. As a training data set, this array is used to curve fitting for CPUs. Our adaptation generates a wider range of training data which can better reflect the relationship between data size and execution time. To eliminate noise, the execution time in the training data is derived from the average of multiple tests.
VB Estimating Working Efficiency of GPUs
Given a task, the total processing time by a GPU is spent on two parts: (1) data transfer between the CPU and the GPU via the PCIe bus, and (2) GPU execution kernels.
Data Transfer. Data transfer has two directions — from CPU to GPU (sometimes called Host to Device) and from GPU to CPU. We denote the times spent on them by and , respectively. We only discuss the process to model the data transfer from CPU to GPU, and the model for the other direction is similar.
Figure 6 reveals the transfer speed for data with different sizes. The transfer speed grows very fast in the beginning. After the data size is larger than a threshold, the transfer speed remains stable. Based on this phenomenon, we use a function to fit the curve when the dataset is not very large, then use the linear regression to model the rest. A formal model is expressed as follows. denotes the size of data transferred from CPU to GPU, and denotes the threshold.
The rationale for is that the data transfer time can be represented by the quotient of the data size and the transfer speed. According to Figure 6, we use the function to model the curve of the first stage. We select this function since the trend performs like an inverse function of the parabola. Note that the label distribution on the xaxis is not linear. This is because the logarithmic scale can make the trend clearly presented even though the data size is small. Obviously, this phenomenon shows that we cannot fully utilize the bandwidth of the PCI bus if the data is not large enough, which supports Observation IVB. Then, we determine the threshold by the extent to transfer speed variation. Empirically, when the variation of the transfer speed is less than 2% in a time unit, we consider that the transfer speed has been stable. Finally, we fit the curve as a linear function for the second stage when . The curve fitting can be done by using the least squares method.
GPU Execution Kernel. We design the cost model of the GPU execution kernel. Similar to Figure 6, the throughput of updating remains stable after the block size reaches a threshold, which means that the computing power of the GPU is saturated. To fit the curves, we use a logarithmic function when the dataset is not very large, and then use the linear regression to model the rest. The growth trend of the logarithmic function can be slower than the power function (e.g., ), which is more consistent with the trend in Figure 7. This is why we choose it to model the curve of the first stage. A formal model is expressed as follows, where denotes the time spent on by the GPU execution kernel. The function represents the processing speed of the GPU execution kernel.
Overall GPU Cost Model. We cannot simply sum the time of the kernel execution and the data transfer as our estimation for the overall execution time of the GPU, since these two parts are not absolutely serial. Specifically, to improve the overall working efficiency of GPUs, we adopt a widely used optimization based on the CUDA stream mechanism. A CUDA stream contains a list of GPU commands executed in serial, and commands in different streams are executed in parallel if hardware resources permit. At the same time, commands in different streams can be synchronized. This mechanism allows us to perform data transfer and kernel execution in parallel without breaking correctness. We explain this idea in the following example.
Example.
As shown in Figure 8, we use three streams to manage the data transfer from CPU to GPU, the kernel execution, and the data transfer from GPU to CPU, respectively. Assume that a block is assigned to the GPU. Stream 1 transfers and corresponding to the global memory of a GPU. Then, the GPU kernel scans the block and updates . Simultaneously, stream 1 continuously transfers the next block assigned to the GPU and corresponding . When stream 2 finishes , stream 3 transfers the updated back to CPU.
From this example, the overall time for GPU can be roughly decided by the maximum time spent among these three streams because it covers the time of the other two parts. Note that although the first and the last schemes cannot be overlapped by the maximal stream in the figure, the cost can be ignored when the number of transferred and computed blocks is very large. Note that is always smaller than since we do not need to transfer blocks back to CPU. Therefore, we define the overall cost model of a GPU as follows.
(9) 
The overall cost model of a GPU depends on the maximum between data transfer time from CPU and GPU and execution time of the GPU kernel.
Vi Workload Balance in Practice
Via Dynamic Scheduling
Even though we have proposed a tailored GPU cost model for MF, the estimation may be still hard to exactly reflect the computing power of devices given a different dataset. The workloads of CPU and GPU may be unbalanced if we assign blocks simply according to the cost model. To remedy this issue, we adopt a dynamic scheduling strategy when one device finishes its tasks. Specifically, assume that the GPU has finished its tasks. Instead of waiting for the tasks being processed by CPUs, we allow GPUs to pick some blocks originally assigned to CPUs. We call it static phase when the GPU and the CPU only process the originally assigned tasks and call it dynamic phase when one of them finishes its own tasks and is involved in processing tasks of the other. In static phase, every GPU is assigned to blocks in specific rows so that it can update one segment of one result matrix all the time and avoid the transfer of this segment.
ViB Putting Things Together
We explain our final strategy for the matrix division in this section. An example is shown in Figure 9.
Number of Columns. Based on the cost model proposed in Section V, we first partition the matrix into two submatrices and for CPUs and GPUs, respectively. They are marked by white and gray in Figure 9. In the light of Rule IVA, we further divide the matrix into columns. This setting guarantees two things. The first thing is that GPUs can always know not only the current block but also the next block. This enables the overlap between computation and data transfer in Figure 8. The second thing is that there always exists a spare column when a GPU kernel or a CPU thread finishes processing its block.
Number of Rows for CPUs. As shown in Rule IVA, we can divide the input matrix into rows, where there are (resp. ) rows in (resp. ). However, this division strategy causes a problem when the dynamic scheduling is activated. Specifically, assume that GPUs have finished their own tasks and are involved in processing blocks of CPUs. Currently, we have totally (CPU and GPU) threads working on blocks. This would break Rule IVA and cannot fully exploit all worker threads. To support the assistance from GPUs, we set the number of rows of CPUs as based on Rule IVA. The setting would not affect the CPU efficiency since the computing power of CPUs is not sensitive to the block size as shown in Observation IVB.
Number of Rows for GPUs. Similarly, If CPUs first finish their tasks and apply for blocks in , the number of rows in should be at least . However, compared with the row number for GPUs, the row number leads to a smaller block size, which cannot saturate the computing power of GPUs according to Observation IVB. Different from the case of CPUs, the division strategy for GPUs needs to satisfy that the block size is large enough in the beginning, while the number of rows is large enough to avoid conflicts if CPUs join. To achieve this target, we divide into rows. For each row where , we further divide into subrows. As a result, relatively large blocks with sizes are assigned to GPUs in static phase, and blocks with sizes are assigned to GPUs and CPUs in dynamic phase.
Example.
We give an example to explain the division strategy for . Assume that we have 2 GPUs and 4 CPU threads, i.e., . We divide into rows, and each row is further divide into subrows. In static phase, we assign a block with size to a GPU, and in dynamic phase, we assign a block with size to a GPU or a CPU thread. On the other hand, is divided into columns and rows. This division for would not change in the algorithm.
Vii Experiments
We conduct extensive experiments to show the efficiency and the effectiveness of our proposed algorithms. Algorithms appearing in our experiments are summarized as follows.

CPUOnly: Only CPU works. We uniformly divide the matrix and use the strategy in [zhuang2013fast] to assign blocks. More details can be found in Section IIIA. We use AVX and OpenMP for acceleration.

GPUOnly: Only GPU works. We vary the number of rows and columns for the matrix division and adopt the best one. The "O3" optimization flag is supported.

HSGD: CPU and GPU work in parallel. The algorithm is introduced in Section IVA. AVX, OpenMP, and "O3" optimization flag are supported.

HSGD*: CPU and GPU work in parallel. Nonuniform matrix division and dynamic strategy are used. Our cost model decides the size of blocks assigned to two hardware resources. AVX, OpenMP, and "O3" optimization flag are supported.
Stochastic gradient methods and the parameter used in [zhuang2013fast] and [xie2017cumf_sgd] are different. To correctly combine two methods on Heterogeneous CPUGPU Systems, we embed the core part of LIBMF^{2}^{2}2https://github.com/cjlin1/libmf and CuMF_SGD^{3}^{3}3https://github.com/cuMF/cumf_sgd into our code and make minor modifications to make the stochastic gradient methods they use consistent. For stochastic gradient method, We choose to use the more concise one in [zhuang2013fast]. For the value of parameter , we choose the larger one in [xie2017cumf_sgd] because a large value can lead to a better training result.
Datasets and Parameter Setting. We evaluate algorithms in four realworld datasets — MovieLens^{4}^{4}4http://grouplens.org/datasets/movielens/, Netflix^{5}^{5}5https://www.kaggle.com/netflixinc/netflixprizedata, R1^{6}^{6}6https://webscope.sandbox.yahoo.com/catalog.php?datatype=r, and Yahoo!Music^{7}^{7}7https://webscope.sandbox.yahoo.com/catalog.php?datatype=c. Statistics of the datasets are presented in Table I. For reproducibility, we consider the original training/test sets in our experiments. More details about each dataset can be found on the corresponding website. We set the parameters following [chin2015learning], which are also listed in Table I.
Datasets  MovieLens  Netflix  R1  Yahoo!Music 

71567  2649429  1948883  1000990  
65133  17770  1101750  624961  
9301274  99072112  104215016  252800275  
698780  1408395  11364422  4003960  
128  128  128  128  
0.05  0.05  1  1  
0.05  0.05  1  1  
0.005  0.005  0.005  0.01 
Experimental Environment. We use a machine with Intel Xeon E52687W v3 3.10GHz processors and a Quadro P4000 GPU with 8GB global memory. The number of available cores is 20. The system interface of the GPU is PCI Express 3.016. The total bandwidth is 32GB/s. By default, we use 16 CPU threads and 128 GPU parallel workers. Here, we follow the definition of GPU parallel workers in [xie2017cumf_sgd], which means the number of elements computed simultaneously in the GPU kernel. All datasets can fit in memory in our experiments.
Organization. Section VIIA shows the adaptiveness of our algorithms by varying the computing resources. Section VIIB shows the effectiveness of our algorithm compared with the stateortheart competitor. Section VIIC and Section VIID evaluate our optimization techniques including matrix division strategy and workload balance.
Viia Overall Efficiency
We evaluate the overall efficiency of our final algorithm HSGD* with CPUOnly and GPUOnly as comparisons. We use Root Mean Square Error (RMSE) ^{8}^{8}8https://en.wikipedia.org/wiki/Rootmeansquare_deviation as a metric for the loss, which is widely used in many recommender systems. For each dataset, we terminate all algorithms and record the corresponding running time when the RMSE reaches a predefined value. Given that we use a different stochastic gradient method from [xie2017cumf_sgd] and a different k value from [zhuang2013fast], the predefined loss values they used are not available. For fair comparison, we select these values that can be reached by all methods including HSGD which suffers from a weak training quality. The predefined loss values are , , , and for MovieLens, Netflix, R1, and Yahoo!Music, respectively. The comparisons between HSGD and HSGD* will be shown in Section VIIC.
ViiA1 Varying GPU parallel workers
In this experiment, we evaluate the adaptiveness of our algorithm by varying the GPU parallel workers from 32 to 512. The running times of algorithms for different GPU parallel workers are reported in Figure 10 for four datasets. Note that the CPU thread number is fixed to the default value .
As a reference, the running time of CPUOnly is stable on all settings. Initially, the GPUOnly is slower than CPUOnly. When we use more GPU threads, the running time of GPUOnly decreases and overtakes that of CPUOnly. The running time of HSGD* is the smallest among all algorithms. For example, in R1, given GPU worker threads, HSGD* takes seconds while CPUOnly and GPUOnly take seconds and seconds respectively. When the thread number increases to 512, HSGD* takes seconds while GPUOnly takes seconds. The decrease of the time of HSGD* also shows that our algorithm is adaptive to different GPU settings and can fully utilize the increasing computing power of GPUs.
ViiA2 Varying CPU Thread Number
In this experiment, we evaluate the adaptiveness of our algorithm by varying the CPU thread number from 4 to 16. The GPU parallel workers are fixed to the default value . The running times of algorithms for different CPU thread numbers are reported in Figure 11.
In contrast to Figure 10, the running time of GPUOnly is consistent, and the running time of CPUOnly decreases when we use more CPU threads. HSGD* is the fast algorithm on all settings and all datasets. For example, in R1 when the CPU thread number is 4, HSGD* takes seconds, while CPUOnly takes seconds, and GPUOnly takes seconds. When the CPU thread number increases to 16, HSGD* takes only seconds while CPUOnly takes seconds. The decrease of the time of HSGD* shows that our algorithm is adaptive to different CPU settings and can fully utilize the increasing computing power of CPUs.
Figure 11 and Figure 10 show the high efficiency of HSGD*. When the gap between the computing power of CPU and GPU is limited (default setting), HSGD* achieves a 1.42.3x speedup over CPUOnly and a 1.42.3x speedup over GPUOnly on all datasets. The experiments also show that the overhead cost of HSGD* is minor. When the gap between the computing power of CPU and GPU is large, e.g., CPU uses 16 threads and GPU uses 512 parallel workers in Figure 10, HSGD* still achieves a slight speedup over GPUOnly.
ViiB Training Quality
In this experiment, we report the derived loss values (RMSE) of our algorithm HSGD* during the training process to show the effectiveness of our method. The experiment will demonstrate the loss value of our algorithm finally converges to a reasonable value. CPUOnly and GPUOnly are also compared as references.
The results are shown in Figure 12. The downward trends of HSGD* are obvious, and the loss of HSGD* converges in the shortest time. In addition, HSGD* achieves a similar converged loss value compared with other algorithms, which shows the effectiveness of our algorithm. For example, in Yahoo!Music, the loss of HSGD* drops to in 10 seconds. At the same time, the loss values of CPUOnly and GPUOnly are and , respectively. When time increases to 25 seconds, the loss of HSGD*, CPUOnly, and GPUOnly drops to , , and , respectively. Finally, all loss values of HSGD*, CPUOnly, and GPUOnly stay about .
ViiC Matrix Division Strategy
We evaluate the effectiveness of our matrix division strategy in this experiment. For each dataset, we record the loss (RMSE) value on different running times of HSGD* with HSGD as a comparison. The result is shown in Figure 13.
We can see that the training quality of HSGD is poor especially when processing relatively large datasets. This phenomenon is consistent with the discussion in Example IVB. The nonuniform matrix division in HSGD* fixes this issue. Given the same running time, HSGD* derives a smaller loss value than HSGD, and the advantage of HSGD* is obvious especially in large datasets. For example, given 50 seconds in R1, the RMSE value for HSGD* reaches to , while that for HSGD is only . The result proves that nonuniform matrix division can utilize GPU resources better.
Datasets  MovieLens  Netflix  R1  Yahoo!Music  
Workload proportion  
HSGD*Q  C  49.56%  55.98%  56.07%  56.46% 
G  50.44%  44.02%  43.93%  43.54%  
HSGD*M  C  55.91%  49.02%  49.75%  53.61% 
G  44.09%  50.98%  50.25%  46.39%  
Running time  
HSGD*Q  0.92 s  15.87 s  13.07 s  40.88 s  
HSGD*M  0.89 s  13.02 s  12.08 s  35.41 s 
ViiD Workload Balance
We evaluate the effectiveness of techniques used to balance workloads in this section.
ViiD1 Cost Models
To show the effectiveness of our cost models, we report the proportion of workloads derived by our cost models with [luk2009qilin] as a comparison in Table II. To clearly reflect the algorithmic efficiency based on two cost models, we make these two methods run the same number of iterations, which is 20 in this experiment.
In Table II, HSGD*Q represents the algorithm HSGD* which uses Qilin [luk2009qilin] to evaluate the working efficiency of hardware. HSGD*M represents the algorithm HSGD* which uses our model in Section V to evaluate the working efficiency of hardware. Note that for fairness, both HSGD*Q and HSGD*M do not include the dynamic scheduling strategy in Section VI to further balance workloads. "C" and "G" in the table represent the assigned proportion of workloads to CPUs and GPUs, respectively.
The practical running times of HSGD*Q and HSGD*M are also reported. The running time of HSGD*M is smaller than that of HSGD*Q on all datasets. This result proves that our cost model can derive a more accurate estimation for the working efficiency of hardware. We can find that HSGD*M prefers to assign more work to GPU compared with HSGD*Q on all datasets except MovieLens. For MovieLens, HSGD*M observes that the performance of GPU is not strong when processing a small dataset (Observation IVB). Therefore, it assigns less work to GPU. The effectiveness of HSGD*M becomes considerable when the dataset is large. Note that given a smaller target loss, both algorithms will require more iterations, and the advantage of our method will become obvious. For example, to achieve the predefined loss value of Yahoo!Music in Section VIIA, HSGD* needs 46 iterations, which is more than twice that of this experiment.
Dataset  HSGD*M  HSGD* 

MovieLens  0.89 s  0.84 s 
Netflix  13.02 s  11.42 s 
R1  12.08 s  10.58 s 
Yahoo!Music  35.41 s  30.96 s 
ViiD2 Dynamic Scheduling
We evaluate the effectiveness of the dynamic scheduling strategy (Section VI). Similar to the experiment for cost models, we use HSGD*M to denote our final algorithm without the dynamic scheduling technique to further balance workloads. The running times of HSGD*M and HSGD* on all datasets are shown in Table III.
The result shows HSGD* is faster than HSGD*M on all datasets. Note that in MovieLens with relatively small size, the computing power of GPU cannot be saturated, which degrades the effectiveness of the dynamic scheduling. As a result, the improvement of dynamic scheduling on MovieLens is minor. By combining the new cost models and the dynamic scheduling strategy, our final algorithm achieves a significant improvement in balancing workloads of MF.
Viii Conclusion
We propose an efficient parallel MF algorithm on heterogeneous CPUGPU systems. To fully utilizes the processing power of GPUs, our approach divides the input matrix using a nonuniform method and assigns large blocks to GPUs. We design a new cost model to estimate the performance of computing resources. We balance workloads by combining the cost models and dynamic scheduling in practice.
Comments
There are no comments yet.