1. Introduction
Learning nonlinear relationships between predictor variables and responses is a fundamental problem in machine learning
(BarndorffNielsen and Shephard, 2004), (BertinMahieux et al., 2011). One stateoftheart method is Kernel Ridge Regression (KRR) (Zhang et al., 2013), which we target in this paper. It combines ridge regression, a method to address illposedness and overfitting in the standard regression setting via L2 regularization, with kernel techniques, which adds flexible support for capturing nonlinearity.Computationally, the input of KRR is an by data matrix with training samples and features, where typically . At training time, KRR needs to solve a large linear system where is an by matrix, and are
by1 vectors, and
is a scalar. Forming and solving this linear system is the major bottleneck of KRR. For example, even on a relatively small dataset—357 megabytes (MB) for a 520,000by90 (BertinMahieux et al., 2011)—KRR needs to form a 2 terabyte dense kernel matrix. A standard distributed parallel dense linear solver on a processor system will require arithmetic operations per processor; we refer to this approach hereafter as Distributed KRR (DKRR). In machine learning, where weak scaling is of primary interest, DKRR fares poorly: keeping fixed, the total storage grows as per processor and the total flops as per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as per processor (i.e. memory and flops are constant). In one experiment, the weak scalability of DKRR dropping from 100% to 0.32% as increased from to processors.DivideandConquer KRR (DCKRR) (Zhang et al., 2013), addresses the scaling problems of DKRR. Its main idea is to randomly shuffle the rows of the by data matrix and then partition it to different by matrices, one per machine, leading to an by kernel matrix on each machine; it builds local KRR models that are then reduced globally to obtain the final model. DCKRR reduced the memory and computational requirement. However, it can’t be used in practice because it sacrifices accuracy. For example, on the Million Song Dataset (a recommendation system application) with 2k test samples, the mean squared error (MSE) decreased only from 88.9 to 81.0 as the number of training samples increased from 8k to 128k. We double the number of processors as we double the number of samples. This is a bad weak scaling problem in terms of accuracy. By contrast, the MSE of DKRR decreased from 91 to 0.002, which is substantially better.
The idea of DCKRR is to partition the dataset into similar parts and generate similar models, and then average these models to get the final solution. We seek other ways to partition the problem that scale as well as DCKRR while improving its accuracy. Our idea is to partition the input dataset into different parts and generate different models, from which we then select the best model among them. Further addressing particular communication overheads, we obtain two new methods, which we call Balanced KRR (BKRR) and Kmeans KRR (KKRR). Figure 1 is the summary of our optimization flow. We proposed a series of approaches with details explained later (KKRR3 is an impractical algorithm used later to bound the attainable accuracy). Figure 2 shows the fundamental tradeoff between accuracy and scaling for these approaches. Among them, we recommend BKRR2 (optimized version of BKRR) and KKRR2 (optimized version of KKRR) to use in practice. BKRR2 is optimized for scaling and has good accuracy. KKRR2 is optimized for accuracy and has good speed.
When we increase the number of samples from 8k to 128k, KKRR2 (optimized version of KKRR) reduces the MSE from 95 to , which addresses the poor accuracy of DCKRR. In addition, KKRR2 is faster than DCKRR for a variety of datasets. Our KKRR2 method improves weak scaling efficiency over DKRR from 0.32% to 38% and achieves a 591 speedup over DKRR on the same data at the same accuracy and the hardware (1536 processors). For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505 speedup with only a slight loss in accuracy.
2. Background
2.1. Linear Regression and Ridge Regression
In machine learning, linear regression is a widelyused method for modeling the relationship between a scalar dependent variable
(regressand) and multiple explanatory variables (independent variables) denoted by , where is a dimensional vector. The training data of a linear regression problem consists of data points, where each data point (or training sample) is a pair and . Linear regression has two phases: training and prediction. The training phase builds a model from an input set of training samples, while the prediction phase uses the model to predict the unknown regressands of a new data point . The training phase is the main limiter to scaling, both with respect to increasing the training set size and number of processors . In contrast, prediction is embarrassingly parallel and cheap per data point.Training Process. Given training data points where each and is a scalar, we want to build a model relating a given sample () to its measured regressand (). For convenience, we define as an by matrix with and be an dimensional vector. The goal of regression is to find a such that . This can be formulated as a least squares problem in Equation (1). To solve some illposed problems or prevent overfitting, L2 regularization (Schmidt, 2005) is used, as formulated in Equation (2), which is called Ridge Regression. is a positive parameter that controls : the larger is , the smaller is .
(1) 
(2) 
Prediction Process. Given a new sample , we can use the computed in the training process to predict the regressand of by . If we have test samples and their true regressands
, the accuracy of the estimated regressand
can be evaluated by MSE (Mean Squared Error) defined in Equation (3).(3) 
2.2. Kernel Method
For many real problems, the underlying model cannot be described by a linear function, so linear ridge regression suffers from poor prediction error. In those cases, a common approach is to map samples to a high dimensional space using a nonlinear mapping, and then learn the model in the high dimensional space. Kernel method (Hofmann et al., 2008) is a widely used approach to conduct this learning procedure implicitly by defining the kernel function—the similarity of samples in the high dimensional space. The commonly used kernel functions are shown in Table 1, and we use the most widely used Gaussian kernel in this paper.
Linear  

Polynomial  
Gaussian  
Sigmoid 
2.3. Kernel Ridge Regression (KRR)
Combining the Kernel method with Ridge Regression yields Kernel Ridge Regression, which is presented in Equations (4) and (5). The in Equation (4) is a Hilbert space norm (Zhang et al., 2013). Given the by kernel matrix , this problem reduces to a linear system defined in Equation (6). is the kernel matrix constructed from training data by , is the input by1 regressand vector corresponding to , and is the by1 unknown solution vector.
(4) 
(5) 
In the Training phase, the algorithm’s goal is to get by (approximately) solving the linear system in (6). In the Prediction phase, the algorithm uses to predict the regressand of any unknown sample using Equation (7). KRR is specified in Algorithm 1. The algorithm searches for the best and from parameter sets. Thus, in practice, Algorithm 1 is only a single iteration of KRR because people do not know the best parameters before using the dataset. is the number of iterations where and are the parameter sets of and (Gaussian Kernel) respectively. Thus, the computational cost of KRR method is . In a typical case, if and , then the algorithm needs to finish thousands of iterations. People also use crossvalidation technique to select the best parameters, which needs much more time.
(6) 
(7) 
2.4. Kmeans clustering
Here we review Kmeans clustering algorithm, which will be used in our algorithm. The objective of Kmeans clustering is to partition a dataset into subsets (), using a notion of proximity based on Euclidean distance (Forgy, 1965). The value of is chosen by the user. Each subset has a center (), each of which is a dimensional vector. A sample belongs to if is its closest center. Kmeans is shown in Algorithm 2.
3. Existing Methods
3.1. Distributed KRR (DKRR)
The bottleneck of KRR is solving the by linear system (6), which is generated by a much smaller by input matrix with . As stated before, this makes weakscaling problematic, because memorypermachine grows like , and flopsperprocessor grows like . In the perfect weak scaling situation, both the memory needed and the flops grow as per processor (i.e. memory and flops are constant). Our experiments show the weak scaling efficiency of DKRR decreases from 100% to 0.3% as we increase the number of processors from 96 to 1536. Since the by matrix cannot be created on a single node, we create it in a distributed way on a by machine grid (Fig. 3). We first divide the sample matrix into equal parts by rows. To generate of the kernel matrix, each machine will need two of these parts of the sample matrix. For example, in Fig. 3, to generate the block, machine 6 needs the second and the third parts of the blocked sample matrix. Thus, we reduce the storage and computation for kernel creation from to per machine. Then we use distributed linear solver in ScaLAPACK (Choi et al., 1995) to solve for .
3.2. DivideandConquer KRR (DCKRR)
DCKRR (Zhang et al., 2013) showed that using divideandconquer can reduce the memory and computational requirement. We first shuffle the original data matrix by rows. Then we distribute to all the nodes evenly by rows (lines 15 of Algorithm 3). On each node, we construct a much smaller matrix () than the original kernel matrix () (lines 611 of Algorithm 3). After getting the local kernel matrix , we use it to solve a linear equation on each node where is the input labels and is the solution. After getting , the training step is complete. Then we use the local to make predictions for each unknown data point and do a global average (reduction) to output the final predicted label (lines 1315 of Algorithm 3). If we get a set of better hyperparameters, we record them by overwriting the old versions (lines 1619 of Algorithm 3).
4. Accurate, Fast, and Scalable KRR
4.1. Kmeans KRR (KKRR)
DCKRR performs better than stateoftheart methods (Zhang et al., 2013). However, based on our observation, DCKRR still needs to be improved. DCKRR has a poor weak scaling in terms of accuracy, which is the bottleneck for distributed machine learning workloads. We observe that the poor accuracy of DCKRR is mainly due to the random partitioning of training samples. Thus, our objective is to design a better partitioning method to achieve a higher accuracy and maintain a high scaling efficiency. Our algorithm is accurate, fast, and scalable on distributed systems.
The analysis in this section is based on the Gaussian kernel because it is the most widely used case (Zhang et al., 2013). Other kernels can work in the same way with different distance metrics. For any two training samples, their kernel value () is close to zero when their Euclidean distance () is large. Therefore, for a given sample , only the training points close to in Euclidean distance can have an effect on the result (Equation (7)) of the prediction process. Based on this idea, we partition the training dataset into subsets (). We use Kmeans to partition the dataset since Kmeans maximizes Euclidean distance between any two clusters. The samples with short Euclidean distance will be clustered into one group. After Kmeans, each subset will have its data center (). Then we launch independent KRRs () to process these subsets.
Given a test sample , instead of only using one model to predict , we make all the nodes have a copy of . This additional cost is trivial for two reasons: (1) the test dataset is much smaller than the training dataset, and the training dataset is much smaller than the kernel matrix, which is the major memory overhead. (2) This broadcast is finished at initial step and there is no need to do it at every iteration. Each node makes a prediction for by using its local model. Then we get the final prediction by a global reduction (average) over all nodes (). Figure 4 is the framework of KKRR. KKRR is highly parallel because all the subproblems are independent of each other. However, the performance of KKRR is not good as we expect. From Figure 5.1, we observe that DCKRR achieves better accuracy than KKRR. In addition, KKRR is slower than DCKRR. In the following sections, we will make the algorithm more accurate and scalable, which is better than DCKRR.
4.2. Kkrr2
The low accuracy of KKRR is mainly due to its conquer step. For DCKRR, because the divide step is a random and even partition, the subdatasets and local models are similar to each other, and thus averaging works pretty well. For KKRR, the clustering method divides the original dataset into different subdatasets which are far away from each other in Euclidean distance. The local models generated by subdatasets are also totally different from each other. Thus, using their average will get worse accuracy because some models are unrelated to the test sample . For example, if the data center of th partition () is far away from in Euclidean distance, the th model should not be used to make prediction for . Since we divide the original dataset based on Euclidean distance, the similar procedure should be used in the prediction phase. Thus, we design the following algorithm.
After the training process, each subKRR will generate its own model file (). We can use these models independently for prediction. For a given unknown sample , if its closest data center (in Euclidean distance) is , we only use to make a prediction for . We call this version KKRR2. From Figures 5.1 and 5.2 we observe that KKRR2 is more accurate than DCKRR. However, KKRR2 is slower than DCKRR. For example, to get the same accuracy (MSE=88), KKRR2 is 2.2 (436s vs 195s) slower than DCKRR. Thus we need to focus on speed and scalability.
4.3. Balanced KRR (BKRR)
Based on the profiling results in Figure 6, we observe that the major source of KKRR’s poor efficiency is load imbalance. The reason is that the partitioning by Kmeans is irregular and imbalanced. For example, processor 2 in Figure 6 has to handle = 35,137 samples while processor 3 only needs to process = 7,349 samples. Since the memory requirement grows like and the number of flops grows like , processor 3 runs 51 faster than processor 1 (Figure 6). On the other hand, partitioning by Kmeans is datadependent and the sizes of clusters cannot be controlled. This makes it unreliable to be used in practice, and thus we need to replace Kmeans with a loadbalanced partitioning algorithm. To this end, we design a Kbalance clustering algorithm and use it to build Balanced Kernel Ridge Regression (BKRR).
In our design, a machine corresponds to a clustering center. If we have machines, then we partition the training dataset into parts. As mentioned above, the objective of Kbalance partitioning algorithm is to make the number of samples on each node close to . If a data center has samples, then we say it is balanced. The basic idea of this algorithm is to find the closest center for each sample, and if a given data center has been balanced, no additional sample will be added to this center. The detailed Kbalance clustering method is in Algorithm 4. Line 1 of Algorithm 4 is an important step because Kbalance needs to first run Kmeans algorithm to get the data centers. This makes Kbalance have a similar clustering pattern as Kmeans. Lines 312 find the center for each sample. Lines 610 find the best underload center for the th sample. Lines 1519 recompute the data center by averaging all the samples in a certain center. Recomputing the centers by averaging is optional because it will not necessarily make the results better. From Figure 6 we observe that Kbalance partitions the dataset in a balanced way. After Kbalance clustering, all the nodes have the same number of samples, so in the training phase all the nodes roughly have the same training time and memory requirement.
After replacing Kmeans with Kbalance, KKRR becomes BKRR, and KKRR2 becomes BKRR2. Algorithm 5 is a framework of BKRR2. As we mentioned in the sections of Abstract and Introduction, KKRR2 is the optimized version of KKRR and BKRR2 is the optimized version of BKRR. Lines 17 perform the partition. Lines 922 perform one iteration of the training. Lines 911 construct the kernel matrix. Line 12 solves the linear equation on each machine. Lines 1318 perform the prediction. Only the best model does the prediction for a certain test sample. Lines 1922 evaluate the accuracy of the system. There is no communication cost of BKRR2 during training except for the initial data partition and the final model selection. The additional cost of BKRR2 comes from two parts: (1) the Kbalance algorithm and partition, and (2) the prediction.
The overhead of Kbalance is tiny compared with the training part. Kbalance first does Kmeans. The cost of Kmeans is where is the number of Kmeans iterations, which is usually less 100. The cost of lines 312 in Algorithm 4 is where is the number of partitions and also the number of machines. is tiny compared with the training part, which is . For example, if we use BKRR2 to process the 32k MSD training samples on 96 NERSC Edison processors, the singleiteration training time is 20 times larger than the Kbalance and partition time. Since the computational overhead of Kbalance is low compared to KRR training ( in practice), we can just use one node to finish the Kbalance algorithm. Although it is not necessary, we have an approximate parallel algorithm for Kbalance. We conduct parallel Kmeans clustering and distribute the centers to all the nodes. Then we partition the samples to all the processors and set as . Since in practice, this parallel implementation roughly gets the same results with the singlenode implementation. The overhead of this approximate parallel algorithm is .
For the prediction part, instead of conducting small communications, we make each machine first compute its own error (line 18 of Algorithm 5) and then only conduct one global communication (line 19 of Algorithm 5). The reason is detailed below. The runtime on a distributed system can be estimated by the sum of the time for computation and the time for communication: , where is the number of floating point operations, is the time per floating point operation, is latency, is the bandwidth, is the number of messages, and is the number of bytes per message. In practice, (e.g. s, s for Intel NetEffect NE020, s for Intel 5300series Clovertown). Therefore, one big message is much cheaper than small messages because . This optimization reduces the latency overhead. Figures 8 and 9 show that BKRR2 achieves lower error rate than DCKRR in a shorter time for a variety of datasets. Figure 7 shows the difference between BKRR2 and DCKRR.
4.4. BKRR3 and KKRR3: Error Lower Bound and The Unrealistic Approach
The KKRR and BKRR families share the same idea of partitioning the data into parts and making different models, but they are different in that the KKRR family is optimized for accuracy while the BKRR family is optimized for performance. We want to know the gap between our method and the best theoretical method. Let us refer to the theoretical KKRR method as KKRR3 and the theoretical BKRR method as BKRR3.
BKRR3 is similar to BKRR2 in terms of communication and computation pattern. Like BKRR and BKRR2, after the training process, each subBKRR will generate its own model file (). We can use these model files independently for prediction. For a given test sample ( is its true regressand), we make all the nodes have a copy of (like KKRR). Each model will make a prediction for . We get , , …, from models respectively. We select for prediction where . This means we inspect the true regressand to make sure we select the best model for each test sample. When changing Kbalance to Kmeans, BKRR3 becomes KKRR3, which is much more accurate than DCKRR (Figures 5.1 and 5.2). The MSE of BKRR3 is the lower bound of the MSE of BKRR2 because BKRR3 can always use the best model for each test sample. In fact, BKRR3 is even much more accurate than the original method (DKRR) for all the testings in our experiments. The framework of BKRR3 is in Algorithm 6. Figures 8 and 9 show the results. BKRR3 is always the best approach in these comparisons. BKRR3 achieves much higher accuracy than DCKRR and BKRR2.
5. Implementation and Analysis
5.1. RealWorld Dataset
To give a fair comparison with DCKRR, we use the Million Song Dataset (MSD) (BertinMahieux et al., 2011) as our major dataset in our experiments because MSD was used in the paper of DCKRR. It is a freelyavailable collection of audio features and metadata for a million contemporary popular music tracks. The dataset contains 515,345 samples. Each sample is a song (track) released between 1922 and 2011, and the song is represented as a vector of timbre information computed from the song. Each sample consists of a pair of where is a dimensional () vector and is the year that the song was released. The Million Song Dataset Challenge aims at being the best possible offline evaluation of a music recommendation system. It is a largescale, personalized music recommendation challenge, where the goal is to predict the songs that a user will listen to, given both the user’s listening history and full information (including metadata and content analysis) for all songs. To justify the efficiency of our approach, we use another three realworld datasets. The information of these four datasets are summarized in Table 2. All these datasets were downloaded from (Lin, 2017).
name  MSD  cadata  MG  spacega 

# Train  463,715  18,432  1,024  2,560 
# Test  51,630  2,208  361  547 
Dimension  90  8  6  6 
Application  Music  Housing  Dynamics  Politics 
Method  Processors  96  192  384  768  1536 

BKRR2  IterTime (s)  1.06  1.06  1.06  1.08  1.15 
KKRR2  IterTime (s)  2.60  3.66  4.05  4.23  6.85 
DKRR  IterTime (s)  13  75.1  234  1273  4048 
BKRR2  Efficiency  1.0  1.0  1.0  0.99  0.92 
KKRR2  Efficiency  1.0  0.71  0.64  0.62  0.38 
DKRR  Efficiency  1.0  0.17  0.06  0.01  0.003 
Samples  DCKRR  BKRR2  DKRR  KKRR2  BKRR3 

8k  88.9  93.1  90.9  95.0  40.2 
32k  85.5  87.7  85.0  87.5  6.6 
128k  81.0  14.7  0.002 
5.2. Fair Comparison
Let us use as the number of partitions or nodes, as the number of processors. Each node has 24 processors. When we use =1536 processors, we actually divide the dataset into =64 parts. Each machine generates a local model for BKRR2. To give a fair comparison, we make sure all the comparisons were tuned based on the same parameter set. Different methods may select different bestparameters from the same parameter set to achieve its lowest MSE. From Figures 8, 9 and 11, we clearly observe BKRR2 is faster than DCKRR and also achieves lower prediction error on all the datasets. In other words, BKRR2 and DCKRR may use different parameters to achieve their lowest MSEs. The lowest MSE of BKRR2 is lower than the lowest MSE of DCKRR. On the other hand, BKRR2 is slightly faster than both DCKRR and BKRR3 (for both single iteration time and overall time). The reason is that each machine only needs to process of the test samples for prediction. For DCKRR and BKRR3, each machine needs to process all the test samples for prediction. BKRR2 achieves 1.2 speedup over DCKRR on average and has a much lower error rate for 128ksample MSD dataset (14.7 vs 81.0).
Because of DKRR’s poor weak scaling, BKRR2 runs much faster than DKRR for 1536 processors and 128k training samples. The single iteration time of BKRR2, , is 1.15 sec while the single iteration time of DKRR, , is 4048 sec. Here, single iteration means picking a pair of parameters and solving the linear equation once (e.g. lines 922 of Algorithm 5 are one iteration). The algorithm can get a better pair of parameters after each iteration. All the algorithms in this paper run the same number of iterations because they use the same parameter tuning space. However, it is unfair to say BKRR2 achieves 3505 speedup over DKRR because, for the same 2k test dataset, the lowest MSE of BKRR2 is 14.7 while that of DKRR is 0.002. This means the BKRR2 model with 128K samples model () is worse than DKRR model () for accuracy. To give a fair comparison, we increase the training samples of BKRR2 to 256k so that its lowest MSE can generate a better model (). By using , the lowest MSE of BKRR2 is 0.53 (Figure 10). We run it on the same 1536 processors and observe becomes 5.6 sec. In this way, we can say that BKRR2 achieves 723 speedup over DKRR for achieving roughly the same accuracy by using the same hardware and the same test dataset. It is worth noting that the biggest dataset that DKRR can handle on 1536 processors is 128k samples, otherwise it will collapse. Therefore, is the best model DKRR can get on 1536 processors. However, BKRR2 can use a bigger dataset to get a better model than on 1536 processors (e.g. 512k model in Figure 10).
The theoretical speedup of over is the ratio of and , which is 4096 for = 64 and = 1536. We achieve 3505 speedup. The theoretical speedup of over is the ratio of and , which is 512 for = 64 and = 1536. We achieve 723 speedup. The difference between theoretical speedup and practical speedup comes from systems and lowlevel libraries (e.g. the implementation of LAPACK and ScaLAPACK). In this comparison, BKRR2’s better scalability allows it to use more data than DKRR, which cannot run on an input of the same size. We also want to compare KKRR2 to DKRR for using the same amount of data. Let us refer to the model generated by KKRR2 using 128k samples as and the single iteration time as . is 6.9 sec in our experiment. The MSE of is , which is even lower than the MSE of (0.002). Thus, KKRR2 achieves 591 speedup over DKRR for the same accuracy by using the same data and hardware (Table 3 and 4).
5.3. WeakScaling Issue
As we mentioned in a previous section, weak scaling means we fix the data size per node and increase the number of nodes. We use from 96 to 1536 processors (4 to 64 partitions) for scaling tests on the NERSC Edison supercomputer. The test dataset is the MSD dataset. We set the 96processor case as the baseline and assume it has a 100% weakscaling. We double the number of samples as we double the number of processors. Figure 11 and Table 3 show the time weakscaling and accuracy weakscaling of BKRR. In terms of time weakscaling, BKRR2 achieves 92% efficiency on 1536 processors while DKRR only achieves 0.32%. We then compare the weak scaling in terms of test accuracy: the MSE of DCKRR for 2k test samples only decreases from 88.9 to 81.0, while the MSE of BKRR2 decreases from 93.1 to 14.7. KKRR2 reduced the MSE from 95 to . In conclusion, we observe that DKRR has a bad time scaling and DCKRR has a poor accuracy weak scaling. Our proposed method, BKRR2, has good weak scaling behavior both in terms of time and accuracy. The weak scaling of BKRR3 can be shown in Table 5.
processors  96  192  384  768  1536 

Time (s)  1311  1313  1328  1345  1471 
Efficiency  100%  99.8%  98.7%  97.5%  89.1% 
MSE  40.2  16.5  6.62  1.89 
5.4. Method Selection
KKRR2 achieves better accuracy than the stateoftheart method. If the users need the most accurate method, KKRR2 is the best choice. KKRR2 runs slower than DCKRR but much faster than DKRR. BKRR2 achieves better accuracy and also runs faster than the DCKRR. BKRR2 is much faster than DKRR with the same accuracy. The reason why BKRR2 is more efficient than DKRR is not that it assumes more samples, but rather, because it has much better weak scaling, so it is easier to use more samples to get a better model. Thus, BKRR2 is the most practical method because it balances the speed and the accuracy. BKRR3 can be used to evaluate the performance of systems both in accuracy and speed. KKRR2 is optimized for accuracy. We recommend using either BKRR2 or KKRR2.
5.5. Implementation Details
We use CSR (Compressed Row Storage) format for processing the sparse matrix (the original by input matrix, not the dense kernel matrix) in our implementation. We use MPI for distributed processing. To give a fair comparison, we use ScaLAPACK (Choi et al., 1995) for solving the linear equation on distributed systems and LAPACK (Anderson et al., 1999) on shared memory systems, both of which are Intel MKL versions. Our experiments are conducted on NERSC Cori and Edison supercomputer systems (NERSC, 2016). Our source code is available online^{1}^{1}1https://people.eecs.berkeley.edu/~youyang/cakrr.zip.
The kernel matrix is symmetric positive definite (SPD) (Wehbe, 2013), and so is . Thus we use Cholesky decomposition to solve in DKRR, which is 2.2 faster than the Gaussian Elimination version.
To conduct a weak scaling study, we select a subset (e.g. 64k, 128k, 256k) from 463,715 samples as the training data to generate a model. Then we use the model to make a prediction. We use 2k test samples to show the MSE comparisons among different methods. If we change the number of test samples (e.g. change 2k to 20k), the testing accuracy will be roughly the same. The reason is that all the test samples are randomly shuffled so the 2k case will have roughly the sample pattern as the 20k case.
6. Related Work
The most related work to this paper is DCKRR (Zhang et al., 2013). The difference between DCKRR and this work was presented in the Introduction section. Scholkopf et al (Schölkopf et al., 1998)
conducted a nonlinear form of principal component analysis (PCA) in highdimensional feature spaces. Fine et al
(Fine and Scheinberg, 2002) used Incomplete Cholesky Factorization (ICF) to design an efficient Interior Point Method (IPM). Williams et al (Williams and Seeger, 2001) designed a Nyström sampling method by carrying out an eigendecomposition on a smaller dataset. Si et al (Si et al., 2014) designed a memory efficient kernel approximation method by first dividing the kernel matrix and then doing lowrank approximation on the smaller matrices. These methods can reduce the running time from to or where is the rank of the kernel approximation. However, all of these methods are proposed for serial analysis on single node systems. On the other hand, Kernel method is more accurate than existing approximate methods (Zhang et al., 2013). Zhang et al. (Zhang et al., 2013) showed DCKRR can beat all the previous approximate methods. DCKRR is considered as stateoftheart approach. Thus, we focus on the comparison with Zhang et al. (Zhang et al., 2013).The second line of work is to use iterative methods such as gradient descent (Yao et al., 2007), block Jacobi method (Schreiber, 1986) and conjugate gradient method(Blanchard and Krämer, 2010) to reduce the running time. However, iterative method can not make full use of computational powers efficiently because they only load a small amount of data to memory. If we scale the algorithm to 100+ cores, Kernel matrix method will be much faster. These methods provide a tradeoff between time and accuracy, and we reserve them for future research. ASKIT (March et al., 2015) used nbody ideas to reduce the time and storage of kernel matrix evaluation, which is a direction of our future work. CASVM (You et al., 2015) (You et al., 2016) used the divideandconquer approach for machine learning applications. The differences include: (1) CASVM is for classification while this work is for regression. (2) CASVM uses an iterative method, whereas we use a direct method. (3) The iterative method did not store the huge Kernel matrix, thus CASVM work has no consideration on the huge Kernel matrix. The scalability of iterative method is also limited.
Lu et al (Lu et al., 2013) designed a fast ridge regression algorithm, which reduced the running time from to . However, first, this work made an unreasonable assumption: the number of features is much larger than the number of samples (). This is true for some datasets like the webspam dataset (Lin, 2017), which has 350,000 training samples and each sample has 16,609,143 features. However, on average only about 10 of these 16 million features are nonzero. This can be processed efficiently by the sparse format like CSR (Compressed Sparse Row). Besides, this method only works for the linear ridge regression algorithm rather than the kernel version, which will lose nonlinear space information.
7. Conclusion
Due to a memory requirement and arithmetic operations, KRR is prohibitive for largescale machine learning datasets when is very large. The weak scaling of KRR is problematic because the total memory required will grow like per processor, and the total flops will grow like per processor. The reason why BKRR2 is more scalable than DKRR is that it removes all the communication in the training part and reduces the computation cost from to . The reason why BKRR2 is more accurate than DCKRR is that it creates different models and selects the best one. Compared to DKRR, BKRR2 improves the weak scaling efficiency from 0.32% to 92% and achieves 723 speedup for the same accuracy on the same 1536 processors. Compared to DCKRR, BKRR2 achieves much higher accuracy with faster speed. DCKRR can never get the best accuracy achieved by BKRR2. When we increase # samples from 8k to 128k and # processors from 96 to 1536, BKRR2 reduces the MSE from 93 to 14.7, which solves the poor accuracy weakscaling problem of DCKRR (MSE only decreases from 89 to 81). KKRR2 achieves 591 speedup over DKRR for the same accuracy by using the same data and hardware. Based on a variety of datasets used in this paper, we observe that KKRR2 is the most accurate method. BKRR2 is the most practical algorithm that balances the speed and accuracy. In conclusion, BKRR2 and KKRR2 are accurate, fast, and scalable methods.
8. Acknowledgement
Yang You was supported by the U.S. DOE Office of Science, Office of Advanced Scientific Computing Research under Award Numbers DESC0008700 and AC0205CH11231. ChoJui Hsieh acknowledges the support of NSF via IIS1719097
References
 (1)
 Anderson et al. (1999) Edward Anderson, Zhaojun Bai, Christian Bischof, Susan Blackford, Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling, A McKenney, and D Sorensen. 1999. LAPACK Users’ guide. Vol. 9. Siam.
 BarndorffNielsen and Shephard (2004) Ole E BarndorffNielsen and Neil Shephard. 2004. Econometric analysis of realized covariation: High frequency based covariance, regression, and correlation in financial economics. Econometrica 72, 3 (2004), 885–925.
 BertinMahieux et al. (2011) Thierry BertinMahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere. 2011. The million song dataset. In ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 2428, 2011, Miami, Florida. University of Miami, 591–596.
 Blanchard and Krämer (2010) Gilles Blanchard and Nicole Krämer. 2010. Optimal learning rates for kernel conjugate gradient regression. In Advances in Neural Information Processing Systems. 226–234.
 Choi et al. (1995) Jaeyoung Choi, James Demmel, Inderjiit Dhillon, Jack Dongarra, Susan Ostrouchov, Antoine Petitet, Ken Stanley, David Walker, and R Clinton Whaley. 1995. ScaLAPACK: A portable linear algebra library for distributed memory computers—Design issues and performance. In Applied Parallel Computing Computations in Physics, Chemistry and Engineering Science. Springer, 95–106.
 Fine and Scheinberg (2002) Shai Fine and Katya Scheinberg. 2002. Efficient SVM training using lowrank kernel representations. The Journal of Machine Learning Research 2 (2002), 243–264.
 Forgy (1965) Edward W Forgy. 1965. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21 (1965), 768–769.
 Hofmann et al. (2008) Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. 2008. Kernel methods in machine learning. The annals of statistics (2008), 1171–1220.
 Lin (2017) ChihJen Lin. 2017. LIBSVM Machine Learning Regression Repository. (2017). https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
 Lu et al. (2013) Yichao Lu, Paramveer Dhillon, Dean P Foster, and Lyle Ungar. 2013. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in Neural Information Processing Systems. 369–377.
 March et al. (2015) William B March, Bo Xiao, and George Biros. 2015. ASKIT: Approximate skeletonization kernelindependent treecode in high dimensions. SIAM Journal on Scientific Computing 37, 2 (2015), A1089–A1110.
 NERSC (2016) NERSC. 2016. NERSC Computational Systems. (2016). https://www.nersc.gov/users/computationalsystems/
 Schmidt (2005) Mark Schmidt. 2005. Least squares optimization with l1norm regularization. (2005).

Schölkopf et al. (1998)
Bernhard Schölkopf,
Alexander Smola, and KlausRobert
Müller. 1998.
Nonlinear component analysis as a kernel eigenvalue problem.
Neural computation 10, 5 (1998), 1299–1319. 
Schreiber (1986)
R. Schreiber.
1986.
Solving eigenvalue and singular value problems on an undersized systolic array.
SIAM J. Sci. Stat. Comput. 7 (1986), 441–451. first block Jacobi reference?  Si et al. (2014) Si Si, ChoJui Hsieh, and Inderjit Dhillon. 2014. Memory efficient kernel approximation. In Proceedings of The 31st International Conference on Machine Learning. 701–709.
 Wehbe (2013) Leila Wehbe. 2013. Kernel Properties  Convexity. (2013).
 Williams and Seeger (2001) Christopher Williams and Matthias Seeger. 2001. Using the Nystrom method to speed up kernel machines. In Proceedings of the 14th Annual Conference on Neural Information Processing Systems. 682–688.
 WRIGHT et al. (2015) Nicholas J WRIGHT, Sudip S DOSANJH, Allison K ANDREWS, Katerina B ANTYPAS, Brent DRANEY, R Shane CANON, Shreyas CHOLIA, Christopher S DALEY, Kirsten M FAGNAN, Richard A GERBER, et al. 2015. Cori: A PreExascale Supercomputer for Big Data and HPC Applications. Big Data and High Performance Computing 26 (2015), 82.
 Yao et al. (2007) Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. 2007. On early stopping in gradient descent learning. Constructive Approximation 26, 2 (2007), 289–315.

You
et al. (2015)
Yang You, James Demmel,
Kenneth Czechowski, Le Song, and
Richard Vuduc. 2015.
Casvm: Communicationavoiding support vector machines on distributed systems. In
Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. IEEE, 847–859. 
You
et al. (2016)
Yang You, Jim Demmel,
Rich Vuduc, Le Song, and
Kent Czechowski. 2016.
Design and Implementation of a CommunicationOptimal Classifier for Distributed Kernel Support Vector Machines.
IEEE Transactions on Parallel and Distributed Systems (2016).  Zhang et al. (2013) Yuchen Zhang, John Duchi, and Martin Wainwright. 2013. Divide and conquer kernel ridge regression. In Conference on Learning Theory. 592–617.
Comments
There are no comments yet.