Oversampling Divide-and-conquer for Response-skewed Kernel Ridge Regression

07/13/2021 ∙ by Jingyi Zhang, et al. ∙ The University of Arizona Tsinghua University 0

The divide-and-conquer method has been widely used for estimating large-scale kernel ridge regression estimates. Unfortunately, when the response variable is highly skewed, the divide-and-conquer kernel ridge regression (dacKRR) may overlook the underrepresented region and result in unacceptable results. We develop a novel response-adaptive partition strategy to overcome the limitation. In particular, we propose to allocate the replicates of some carefully identified informative observations to multiple nodes (local processors). The idea is analogous to the popular oversampling technique. Although such a technique has been widely used for addressing discrete label skewness, extending it to the dacKRR setting is nontrivial. We provide both theoretical and practical guidance on how to effectively over-sample the observations under the dacKRR setting. Furthermore, we show the proposed estimate has a smaller asymptotic mean squared error (AMSE) than that of the classical dacKRR estimate under mild conditions. Our theoretical findings are supported by both simulated and real-data analyses.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of calculating large-scale kernel ridge regression (KRR) estimates in a nonparametric regression model. Although the theoretical properties of the KRR estimator are well-understood (Geer and van de Geer, 2000; Zhang, 2005; Steinwart et al., 2009), in practice, the computation of KRR estimates may suffer from a large computational burden. In particular, for a sample of size , it requires computational time to calculate a KRR estimate using the standard approach, as will be detailed in Section 2. Such a computational cost is prohibitive when the sample size is considerable. The divide-and-conquer approach has been implemented pervasively to alleviate such computational burden (Zhang et al., 2013, 2015; Xu et al., 2016; Xu and Wang, 2018; Xu et al., 2019). Such an approach randomly partitions the full sample into subsamples of equal sizes, then calculates a local estimate on an independent local processor (also called a local node) for each subsample. The local estimates are then averaged to obtain the global estimate. The divide-and-conquer approach reduces the computational cost of calculating KRR estimates from to . Such savings may be substantial as grows.

Despite algorithmic benefits, the success of the divide-and-conquer approach highly depends on the assumption that the subsamples can well represent the observed full sample. Nevertheless, this assumption cannot be guaranteed in many real-world applications, where the response variable

may have a highly skewed distribution. Specifically, the random variable

has a highly skewed distribution if is nearly zero inside a large region

. Problems of this type arise in high energy physics, Bayesian inference, financial research, biomedical research, environmental data, among others

(McGuinness et al., 1997; Afifi et al., 2007; Haixiang et al., 2017). In these applications, are the responses that occur with very low frequency. However, such responses are often of more interest as they tend to have a more widespread impact. For example, may represent a rare signal for seismic activity or stock market fluctuations. Overlooking such signals could resulting in a substantial negative impact on society either economically or in terms of human casualties.

Recall that in the classical divide-and-conquer approach, a random subsample is processed on every local node. Under the aforementioned rare-event scenarios, such a subsample could fail to have observations selected from the potentially informative region . The local estimate based on such a subsample thus is very likely to overlook the region . Averaging these local estimators could lead to unreliable estimations and predictions over these informative regions. A synthetic example in Fig. 1 illustrates the scenario that the response is highly skewed. In this example, the one-dimensional sample (gray points) is uniformly generated from . The response variable has a heavy-tailed distribution, as illustrated by the histogram. The classical dacKRR method is used to estimate the true function (gray curve). We observe that almost all the local estimates (blue curves) Averaging these local estimates thus results in a global estimate (red curve) with unacceptable estimation performance. Such an observation is due to the fact that the subsample in each node overlooks the informative region over the distribution of (the peaks) and thus fails to provide a successful global estimate.

Figure 1: An illustration of unacceptable results for the classical dacKRR estimate.

Our contributions.

To combat the obstacles, we develop a novel response-adaptive partition approach with oversampling to obtain large-scale kernel ridge regression estimates, especially when the response is highly skewed. Although the oversampling technique is widely used for addressing discrete label skewness, extending such a technique to the continuous dacKRR setting is nontrivial. We bridge the gap by providing both theoretical and practical guidance on how to effectively over-sample the observations. Different from the classical divide-and-conquer approach, such that each observation is allocated to only one local node, we propose to allocate the duplicates of some informative data points to multiple nodes. Theoretically, we show the proposed estimates have a smaller asymptotic mean squared error (AMSE) than the classical dacKRR estimates under mild conditions. Furthermore, we show the number of strata of the response vector

, regulates the trade-off between the computational time and the AMSE of the proposed estimator. In particular, a larger is associated with a longer computational time as well as a smaller AMSE. Such results are novel to the best of our knowledge. Our theoretical findings are supported by both simulated and real-data analyses. In addition, we show the proposed method is not specific to the dacKRR method and has the potential to improve the estimation of other nonparametric regression estimators under the divide-and-conquer setting.

2 Preliminaries

Model setup. Let be a reproducing kernel Hilbert space (RKHS). Such a RKHS is induced by the reproducing kernel , which is a symmetric nonnegative definite function, and an inner product satisfying for any . The RKHS is equipped with the norm . The well-known Mercer’s theorem states that, under some regularity conditions, the kernel function can be written as , where

is a sequence of decreasing eigenvalues and

is a family of orthonormal basis functions. The smoothness of a function is characterized by the decaying rate of the eigenvalues . The major types of such decaying rates include the finite rank type, the exponentially decaying type, and the polynomially decaying type. The representative kernel function of these three major types are the polynomial kernel ( is an integer), the Gaussian kernel ( is the scale parameter and is the Euclidean norm), and the kernel , respectively. We refer to Hastie et al. (2009); Shawe-Taylor and Cristianini (2004) for more details.

Consider the nonparametric regression model

(1)

where is the response, is the predictors, is the unknown function to be estimated, and

are i.i.d. normal random errors with zero mean and unknown variance

. The kernel ridge regression estimator aims to find a projection of into the RKHS . Such an estimator can be written as

(2)

Here, the regularization parameter controls the trade-off between the goodness of fit of and the smoothness of it. A penalized least squares framework analogous to Equation (2) has been extensively studied in the literature of regression splines and smoothing splines (Wahba, 1990; Hastie, 1996; Luo and Wahba, 1997; He et al., 2001; Gu and Kim, 2002; Ruppert, 2002; Zhang et al., 2004; Sklar et al., 2013; Yuan et al., 2013; Ma et al., 2015; Zhang et al., 2018; Meng et al., 2020).

The well-known representer theorem (Wahba, 1990) states that the minimizer of Equation (2) in the RKHS takes the form Let be the response vector, be the coefficient vector and be the kernel matrix such that the -th element equals . These coefficient vector can be estimated through solving the minimization problem as follows,

(3)

It is known that the solution of such a minimization problem has a closed form

(4)

where the regularization parameter can be selected based on the cross-validation technique, the Mallows’s Cp method (Mallows, 2000), or the generalized cross-validation (GCV) criterion (Wahba and Craven, 1978).

Although the solution of the minimization problem (3) has a closed-form, the computational cost for calculating the solution using Equation (4) is of the order , which is prohibitive when the sample size is considerable. In this paper, we focus on the divide-and-conquer approach for alleviating such a computational burden. In recent decades, there also exist a large number of studies that aim to develop low-rank matrix approximation methods to accelerate the calculation of kernel ridge regression estimates (Rahimi and Recht, 2008; Wang, 2015; Rudi et al., 2015; Mahoney, 2016; Musco and Musco, 2017). These approaches are beyond the scope of this paper. In practice, one can combine the divide-and-conquer approach with the aforementioned methods to further accelerate the calculation.

Background of the divide-and-conquer approach. The classical divide-and-conquer approach is easy to describe. Rather than solving the minimization problem (3) using the full sample, the divide-and-conquer approach randomly partitions the sample into subsamples of equal sizes. Each subsample is then allocated to an independent local node. Next, minimization problems are solved independently on each local node based on the corresponding subsamples. The final estimate is simply the average of all the local estimates. The divide-and-conquer approach has proven to be effective in linear models (Chen and Zhang, 2014; Lu et al., 2016), partially linear models (Zhao et al., 2016), nonparametric regression models (Zhang et al., 2013, 2015; Lin et al., 2017; Shang and Cheng, 2017; Guo et al., 2017)

, principal component analysis

(Wu et al., 2018; Fan et al., 2019), matrix factorization (Mackey et al., 2011), among others.

Input: The training set ; the number of nodes
Step 1: Randomly and evenly partition the sample into disjoint subsamples, denoted by . Let be the number of observations in .
Step 2:
for  in  do
     calculate the local kernel ridge regression estimate on the -th local node
end for
Output: Combine each local estimates to obtain the final estimate
Algorithm 1 Classical divide-and-conquer kernel ridge regression

Algorithm 1 summarizes the classical divide-and-conquer method under the kernel ridge regression setting. Such an algorithm reduces the computational cost for the estimation from to . The savings may be substantial as grows.

One difficulty in Algorithm 1 is how to choose the regularization parameter s. A natural way to determine the size of these regularization parameters is to utilize the standard approaches, e.g., the Mallows’s Cp criterion or the (GCV) criterion, based only on the local subsample . It is known that such a simple strategy may lead to a global estimate that suffers from suboptimal performance (Zhang et al., 2015; Xu and Wang, 2018). To overcome the challenges, there has been a large number of studies dedicated to developing more effective methods of selecting the local regularization parameters in the recent decade. For example, Zhang et al. (2015) proposed to select the regularization parameter according to the order of the entire sample size instead of the subsample size . Recently, Xu and Wang (2018) proposed the distributed GCV method to select a global optimal regularization parameter for nonparametric regression under the classical divide-and-conquer setting.

3 Motivation and Methodology

Motivation. To motivate the development of the proposed method, we first re-examine the oversampling strategy. Such a strategy is well-known in imbalanced data analysis, where the labels can be viewed as skewed discrete response variables. (Japkowicz and Stephen, 2002; He and Garcia, 2009; Krawczyk, 2016)

. Classical statistical and machine learning algorithms assume that the number of observations in each class is roughly at the same scale. In many real-world cases, however, such an assumption may not hold, resulting in imbalanced data. Imbalanced data pose a difficulty for classical learning algorithms, as they will be biassed towards the majority group. Despite the rareness, the minority class is usually more important from the data mining perspective, as it may carry useful and important information. An effective classification algorithm thus should take such unbalances into account. To tackle the imbalanced data, the oversampling strategy supplements the training set with multiple copies of the minority classes and keeps all the observations in the majority classes to make the whole dataset suitable for a standard learning algorithm. Such strategy has proven to be effective for achieving more robust results

(Japkowicz and Stephen, 2002; He and Garcia, 2009; Krawczyk, 2016).

Intuitively, the data with a continuous skewed response can be considered as imbalanced data. In particular, if we divide the range of the skewed response into equally-spaced disjoint intervals, denoted by , then there will be a disproportionate ratio of observations in each interval. The intervals that contain more observations can be considered as the majority classes, and the ones that contain fewer observations can be considered as the minority classes. Recall that the classical divide-and-conquer approach utilizes a simple random subsample from the full sample to calculate the local estimate. Therefore, when the response variable is highly skewed, such a subsample could easily overlook the observations respecting the minority classes, resulting in unpleasant estimates.

Main algorithm. Inspired by the oversampling strategy, we propose to supplement the training data with multiple copies of the minority classes before applying the divide-and-conquer approach. In addition, analogous to the downsampling strategy, the subsample in each node should keep most of the observations from the minority classes. Let be the number of observations in the set . Let be the rounding function that rounds down the number to its nearest integer. The proposed method, called oversampling divide-and-conquer kernel ridge regression, is summarized in Algorithm 2.

Input: The training set ; the number of nodes , the number of slices
Step 1: Divide the range of the responses into disjoint equally-spaced intervals, denoted by . Without lose of generality, we assume .
Step 2:
for  in  do
     Let be the set that contains copies of all the elements in . Randomly and evenly partition the set into disjoint subsamples, denoted by .
end for
Step 3:
for  in  do
     The subsample respecting the -th local node is . Calculate the local estimate on the -th local node
end for
Output: Combine each local estimates to obtain the final estimate
Algorithm 2 Oversampling divide-and-conquer kernel ridge regression
Figure 2: Illustration of Algorithm 2.

We use a toy example to illustrate Algorithm 2 in Fig. 2. Suppose there are only two nodes (), and the histogram of the response is divided into three slices (). The left panel of Fig. 2 illustrates the oversampling process. In particular, the observations are duplicated by certain times, such that the total number of observations respecting each slice is roughly equal to each other. The original observations and the duplicated observations are marked by tilted lines and horizontal lines, respectively. The right panel of Fig. 2 shows the next step. For each slice, the observations within such a slice are then randomly and evenly allocated to each of the nodes. The observations allocated to different nodes are marked by blue and red, respectively. Finally, similar to the classical dacKRR approach, the local estimates are calculated and then are averaged to obtain the global estimate.

Implementation details and the computational cost. Analogous to Algorithm 1, a difficulty in Algorithm 2 is how to choose the regularization parameters ’s. Throughout this paper, we opt to use the same approach to select ’s as the one proposed in Zhang et al. (2015). The number of slices can be set as a constant or be determined by Scott’s rule; see Scott (2015) for more details. Our empirical studies indicate that the performance of the proposed algorithm is robust to a wide range of choices of .

Consider the total number of observations in the training set after the oversampling process. Let

to denote such a number. When the response variable has a roughly uniform distribution, few observations would be copied, and thus one has

. In such a case, the computational cost of Algorithm 2 is at the same order as that of the classical divide-and-conquer method. In the most extreme case that the most majority class contains almost all the observations, it can be shown that . Therefore, the computational cost of Algorithm 2 is at the order of . Consequently, when is a constant, again, the computational cost of Algorithm 2 is at the same order of the classical divide-and-conquer method. Otherwise, when is determined by Scott’s rule (Scott, 2015), in which case one has , the computational cost of Algorithm 2 becomes . However, such a higher computational cost is associated with theoretical benefits, as will be detailed in the next section.

4 Theoretical results

Main theorem. In this section, we obtain the upper bounds on the mean squared estimation error for the proposed estimator and establish the main result in Theorem 4.1. We show our bounds contain a smaller asymptotic estimation variance term, compared to the bounds of the classical dacKRR estimator (Zhang et al., 2015; Liu et al., 2018). The following assumptions are required to bound the terms in Theorem 4.1. Assumption 4 guarantees the spectral decomposition of the kernel via Mercer’s theorem. Assumption 4 is a regularity assumption on . These two assumptions are fairly standard. Assumption 4 requires that the basis functions are uniformly bounded. Zhang et al. (2015) showed that one could easily verify this assumption for a given kernel . For instance, this assumption holds for many classical kernel functions, e.g., the Gaussian kernel.

Assumption 1. The reproducing kernel is symmetric, positive definite, and square integrable.

Assumption 2. The underlying function .

Assumption 3. There exists some and such that and , where is the supermom norm.

Let be the sample after the oversampling process, and be the total number of observations in . Let be the number of nodes. Throughout this section, we assume is a constant. Let and be the number of observations in each node for the classical divide-and-conquer method and the proposed method, respectively. Following the notations in Algorithm 2, let be the the number of observations on the most minority slice for each node. Let be the the effective dimensionality of the kernel (Zhang, 2005). The bounds on the mean squared estimation error (AMSE) for the proposed estimator are given in Theorem 4.1, followed by the main corollary.

Theorem 4.1.

Under Assumption 4-4, the mean squared estimation error of the proposed estimator is bounded by

Corollary 4.2.

As , when , we have

(5)

Corollary 4.2 holds when we assume the first two terms of the bounds in Theorem 4.1 are dominant. With the properly chosen regularization parameter, most of the commonly used kernels, e.g., the kernels for the Sobolev spaces, can satisfy this assumption Gu (2013). The squared bias term for the classical dacKRR estimator is at the order of , which is the same as the one in Equation (5) Zhang et al. (2015). However, different from the variance term in Equation (5), which equals , the variance term for the classical dacKRR estimator is at the order of .

Practical guidance on the selection of the parameter . We now compare the AMSE of the classical dacKRR estimator with the proposed one under three different scenarios. Furthermore, we provide some practical guidance on the selection of .

First, consider the scenario that , i.e. as shown in Fig. 3(a). In this case, oversampling is unnecessary, thus the proposed estimate has the same AMSE as the classical dacKRR estimator.

Second, consider the scenario that the response variable has a slightly skewed distribution, i.e., one has and , as shown in Fig. 3(b). Under this scenario, the condition holds when 111 iff there exist some positive constant , , and , such that for all , . Notice that one has , according to the definition of . Consequently, holds when the number of slices is a constant. Corollary 4.2 indicates the proposed estimate has the same AMSE as the classical dacKRR estimator.

Third, consider the scenario that the response variable has a highly skewed distribution, such that one has , as shown in Fig. 3(c). Under this scenario, the condition holds when , and Equation (5) becomes

(6)

Consider the case when the number of slices as . For example, when is determined by Scott’s rule, one has . In such cases, when , Equation (6) indicates the proposed estimator has a smaller AMSE than the classical dacKRR estimator. Corollary 4.2 indicates that the parameter regulates the trade-off between the computational time and the AMSE. Specifically, a larger is associated with a longer computational time as well as a smaller AMSE.

Figure 3: Three different scenarios for the “skewness”. The solid bars are original data and the lined ones are over-sampled data.

5 Simulation results

To show the effectiveness of the proposed estimator, we compared it with the classical dacKRR estimator in terms of the mean squared error (MSE). We calculated the MSE for each of the estimators based on 100 replicates. In particular, , where is the true function and represents the estimator in the th replication, respectively. Through all the experiments in this section, we set the number of nodes . Gaussian kernel function was used for the kernel ridge regression. We followed the procedure in (Zhang et al., 2015) to select the bandwidth for the kernel and the regularization parameters. For the proposed method, we divided the range of into slices according to Scott’s rule (Scott, 2015).

Recall that in Algorithm 2, each observation is copied for certain times, such that the total number of observations in each slice is roughly equal to each other. A natural question is how does the number of such copies affects the performance of the proposed method. In other words, if a fewer number of copies are supplemented to the training data, would the proposed method perform better or worse? To answer this question, we induced a constant to control the oversampling size. Specifically, for the th slice, , we duplicated the observations within such a slice for times, instead of times in Algorithm 2. We set . This procedure is equivalent to Algorithm 2 when .

Let be the -dimensional vector, such that all elements are equal to . We considered a function that has a highly skewed response, i.e.,

where , and represents the norm. We then simulated the data from Model (1) with , , and two different regression function ’s,

Uni-peak: , with ;

Double-peak: , with and .

Figure 4: Comparison of different estimators. Each row represents a different true function and each column represents a different .

Figure 4 shows the MSE versus different sample size under various settings. Each column represents a different , and each row represents a different . The classical dacKRR estimator and the full sample KRR estimator are labeled as blue lines and black lines, respectively. The proposed estimators are labeled as red lines, and the proposed method with

, i.e., Algorithm 2, is labeled as solid red lines. The vertical bars represent the standard errors obtained from 100 replications. In Fig. 

4, we first observed that the classical dacKRR estimator, as expected, does not perform well. We then observed that the proposed estimators perform consistently better than the classical estimator. Such an observation indicates that when the response is highly skewed, oversampling the observations respecting the minority values of the response helps improve the estimation accuracy. Finally, we observed that a larger value of is associated with a better performance. In particular, as the number of copies increases, the proposed estimator tends to have faster convergence rates. This observation supports Corollary 4.2, which states that a larger number of observations after the oversampling process is associated with a smaller estimation MSE.

Figure 5: Comparison for different choices of ’s.

Besides Scott’s rule (), we also considered other rules, e.g., the Sturge’s formula () Sturges (1926) and the Freedman-Diaconis choice () Freedman and Diaconis (1981). Figure 5 compares the empirical performance of the proposed estimator w.r.t. different choices of ’s. For the setting of unskewed response (the first row), all methods share similar performance. For response-skewed settings (the second and the third row), all the three aforementioned rules yield better performance than fixed . Among all, Scott’s rule gives the best results. The simulation results are consistent with Corollary 4.2, which indicates that the proposed estimator shows advantages over the classical one as long as the condition holds.

Besides the impact of , we also studied the impact of . In addition, we have also applied the proposed strategy to other nonparametric regression estimators, say smoothing splines, under the divide-and-conquer setting. Such simulation results are provided in the Supplementary Material. These results indicated that the performance of the proposed method is robust the choice of and has the potential to improve the estimation of other nonparametric regression estimators under the divide-and-conquer setting. It is also possible to consider unequally-spaced slicing methods in the proposed method; however, such extensions are beyond the scope of this paper.

6 Real data analysis

We applied the proposed method to a real-world dataset called Melbourne-housing-market 222Data source https://www.kaggle.com/anthonypino/melbourne-housing-market.. The dataset includes housing clearance data in Melbourne from the year 2016 to 2017. Each observation represents a record for a sold house, including the total price and several predictors. Of interest is to predict the price-per-square-meter (range from 20 to 30000) for each sold house using the longitude and the latitude of the house, and the distance from the house to the central business district (CBD). Such a goal can be achieved by using kernel ridge regression.

Figure 6: Upper panel: testing MSE versus different number of nodes. Lower panel: CPU time (in seconds) versus different number of nodes. Vertical bars represent the standard errors.

We replicated the experiment one hundred times. In each replicate, we used stratified sampling to randomly pick of the observations as the testing set and the remaining as the training set. The Gaussian kernel function was used for the kernel ridge regression. We set the number of nodes . We followed the procedure in Zhang et al. (2015) to select the bandwidth for the kernel and the regularization parameters. The performance of different estimators was evaluated by the MSE, respecting the prediction on the testing set. The experiment was conducted in R using a Windows computer with 16GB of memory and a single-threaded 3.5Ghz CPU.

The upper panel of Fig. 6 shows the MSE versus the different number of nodes , and the bottom panel of which shows the CPU time (in seconds) versus different . To the best of our knowledge, our work is the first approach that addresses the response-skewed issue under the dacKRR setting, while other approaches only considered discrete-response settings or cannot be easily extended to the divide-and-conquer scenario. Thus we only compare our method with the classical dacKRR approach. In Fig. 6, the blue lines and the red lines represent the classical dacKRR approach and the proposed approach, respectively. Vertical bars represent the standard errors, which are obtained from one hundred replicates. The black lines represent the results of the full sample KRR estimate, and the gray dashed lines represent its standard errors. We can see that the proposed estimate consistently outperforms the classical dacKRR estimate. In particular, we observe the MSE of the proposed estimate is comparable with the MSE of the full sample KRR estimate. The MSE for the classical dacKRR estimate, however, almost doubles the MSE for the full sample KRR estimate when the number of nodes is greater than one hundred. The bottom panel of Fig. 6 shows the CPU time of the proposed approach is comparable with the CPU time of the classical divide-and-conquer approach. All these observations are consistent with the findings in the previous section. Such observations indicate the proposed approach outperforms the classical dacKRR approach for response-skewed data without requiring too much extra computing time.

References

  • [1] A. A. Afifi, J. B. Kotlerman, S. L. Ettner, and M. Cowan (2007)

    Methods for improving regression analysis for skewed continuous or counted responses

    .
    Annu. Rev. Public Health 28, pp. 95–111. Cited by: §1.
  • [2] C. P. Chen and C. Zhang (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Information Sciences 275, pp. 314–347. Cited by: §2.
  • [3] J. Fan, D. Wang, K. Wang, and Z. Zhu (2019)

    Distributed estimation of principal eigenspaces

    .
    Annals of statistics 47 (6), pp. 3009. Cited by: §2.
  • [4] D. Freedman and P. Diaconis (1981) On the histogram as a density estimator: l2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57 (4), pp. 453–476. Cited by: §5.
  • [5] S. A. Geer and S. van de Geer (2000) Empirical processes in m-estimation. Cambridge University Press. Cited by: §1.
  • [6] C. Gu and Y. Kim (2002) Penalized likelihood regression: general formulation and efficient approximation. Canadian Journal of Statistics 30 (4), pp. 619–628. Cited by: §2.
  • [7] C. Gu (2013) Smoothing spline anova models. Springer Science & Business Media. Cited by: §4.
  • [8] Z. Guo, L. Shi, and Q. Wu (2017) Learning theory of distributed regression with bias corrected regularization kernel network. The Journal of Machine Learning Research 18 (1), pp. 4237–4261. Cited by: §2.
  • [9] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing (2017) Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications 73, pp. 220–239. Cited by: §1.
  • [10] T. Hastie, R. Tibshirani, and J. Friedman (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Cited by: §2.
  • [11] T. Hastie (1996) Pseudosplines. Journal of the Royal Statistical Society. Series B (Methodological), pp. 379–396. Cited by: §2.
  • [12] H. He and E. A. Garcia (2009) Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21 (9), pp. 1263–1284. Cited by: §3.
  • [13] X. He, L. Shen, and Z. Shen (2001) A data-adaptive knot selection scheme for fitting splines. IEEE Signal Processing Letters 8 (5), pp. 137–139. Cited by: §2.
  • [14] N. Japkowicz and S. Stephen (2002) The class imbalance problem: a systematic study. Intelligent data analysis 6 (5), pp. 429–449. Cited by: §3.
  • [15] B. Krawczyk (2016) Learning from imbalanced data: open challenges and future directions.

    Progress in Artificial Intelligence

    5 (4), pp. 221–232.
    Cited by: §3.
  • [16] S. Lin, X. Guo, and D. Zhou (2017) Distributed learning with regularized least squares. The Journal of Machine Learning Research 18 (1), pp. 3202–3232. Cited by: §2.
  • [17] M. Liu, Z. Shang, and G. Cheng (2018) How many machines can we use in parallel computing for kernel ridge regression?. arXiv preprint arXiv:1805.09948. Cited by: §4.
  • [18] J. Lu, G. Cheng, and H. Liu (2016) NONPARAMETRIC heterogeneity testing for massive data. arXiv preprint arXiv:1601.06212. Cited by: §2.
  • [19] Z. Luo and G. Wahba (1997) Hybrid adaptive splines. Journal of the American Statistical Association 92 (437), pp. 107–116. Cited by: §2.
  • [20] P. Ma, J. Z. Huang, and N. Zhang (2015) Efficient computation of smoothing splines via adaptive basis sampling. Biometrika 102 (3), pp. 631–645. Cited by: §2.
  • [21] L. Mackey, A. Talwalkar, and M. I. Jordan (2011) Divide-and-conquer matrix factorization. Advances in neural information processing systems 24. Cited by: §2.
  • [22] M. W. Mahoney (2016) Lecture notes on randomized linear algebra. arXiv preprint arXiv:1608.04481. Cited by: §2.
  • [23] C. L. Mallows (2000) Some comments on Cp. Technometrics 42 (1), pp. 87–94. Cited by: §2.
  • [24] D. McGuinness, S. Bennett, and E. Riley (1997) Statistical analysis of highly skewed immune response data. Journal of immunological methods 201 (1), pp. 99–114. Cited by: §1.
  • [25] C. Meng, X. Zhang, J. Zhang, W. Zhong, and P. Ma (2020) More efficient approximation of smoothing splines via space-filling basis selection. Biometrika 107, pp. 723–735. Cited by: §2.
  • [26] C. Musco and C. Musco (2017) Recursive sampling for the Nyström method. In Advances in Neural Information Processing Systems, pp. 3833–3845. Cited by: §2.
  • [27] A. Rahimi and B. Recht (2008) Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177–1184. Cited by: §2.
  • [28] A. Rudi, R. Camoriano, and L. Rosasco (2015) Less is more: nyström computational regularization. Advances in Neural Information Processing Systems 28, pp. 1657–1665. Cited by: §2.
  • [29] D. Ruppert (2002) Selecting the number of knots for penalized splines. Journal of computational and graphical statistics 11 (4), pp. 735–757. Cited by: §2.
  • [30] D. W. Scott (2015) Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons. Cited by: §3, §3, §5.
  • [31] Z. Shang and G. Cheng (2017) Computational limits of a distributed algorithm for smoothing spline. The Journal of Machine Learning Research 18 (1), pp. 3809–3845. Cited by: §2.
  • [32] J. Shawe-Taylor and N. Cristianini (2004) Kernel methods for pattern analysis. Cambridge university press. Cited by: §2.
  • [33] J. C. Sklar, J. Wu, W. Meiring, and Y. Wang (2013) Nonparametric regression with basis selection from multiple libraries. Technometrics 55 (2), pp. 189–201. Cited by: §2.
  • [34] I. Steinwart, D. R. Hush, and C. Scovel (2009) Optimal rates for regularized least squares regression.. In COLT, pp. 79–93. Cited by: §1.
  • [35] H. A. Sturges (1926) The choice of a class interval. Journal of the american statistical association 21 (153), pp. 65–66. Cited by: §5.
  • [36] G. Wahba and P. Craven (1978) Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation.. Numerische Mathematik 31, pp. 377–404. Cited by: §2.
  • [37] G. Wahba (1990) Spline models for observational data. SIAM. Cited by: §2, §2.
  • [38] S. Wang (2015)

    A practical guide to randomized matrix computations with matlab implementations

    .
    arXiv preprint arXiv:1505.07570. Cited by: §2.
  • [39] S. X. Wu, H. Wai, L. Li, and A. Scaglione (2018) A review of distributed algorithms for principal component analysis. Proceedings of the IEEE 106 (8), pp. 1321–1340. Cited by: §2.
  • [40] C. Xu, Y. Zhang, R. Li, and X. Wu (2016) On the feasibility of distributed kernel regression for big data. IEEE Transactions on Knowledge and Data Engineering 28 (11), pp. 3041–3052. Cited by: §1.
  • [41] D. Xu and Y. Wang (2018) Divide and recombine approaches for fitting smoothing spline models with large datasets. Journal of Computational and Graphical Statistics 27 (3), pp. 677–683. Cited by: §1, §2.
  • [42] G. Xu, Z. Shang, and G. Cheng (2019) Distributed generalized cross-validation for divide-and-conquer kernel ridge regression and its asymptotic optimality. Journal of computational and graphical statistics 28 (4), pp. 891–908. Cited by: §1.
  • [43] Y. Yuan, N. Chen, and S. Zhou (2013) Adaptive B-spline knot selection using multi-resolution basis set. Iie Transactions 45 (12), pp. 1263–1277. Cited by: §2.
  • [44] H. H. Zhang, G. Wahba, Y. Lin, M. Voelker, M. Ferris, R. Klein, and B. Klein (2004) Variable selection and model building via likelihood basis pursuit. Journal of the American Statistical Association 99 (467), pp. 659–672. Cited by: §2.
  • [45] J. Zhang, H. Jin, Y. Wang, X. Sun, P. Ma, and W. Zhong (2018) Smoothing spline ANOVA models and their applications in complex and massive datasets. Topics in Splines and Applications, pp. 63. Cited by: §2.
  • [46] T. Zhang (2005) Learning bounds for kernel regression using effective data dimensionality. Neural Computation 17 (9), pp. 2077–2098. Cited by: §1, §4.
  • [47] Y. Zhang, J. Duchi, and M. Wainwright (2013) Divide and conquer kernel ridge regression. In Conference on learning theory, pp. 592–617. Cited by: §1, §2.
  • [48] Y. Zhang, J. Duchi, and M. Wainwright (2015) Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research 16 (1), pp. 3299–3340. Cited by: §1, §2, §2, §3, §4, §4, §5, §6.
  • [49] T. Zhao, G. Cheng, and H. Liu (2016) A partially linear framework for massive heterogeneous data. Annals of statistics 44 (4), pp. 1400. Cited by: §2.