Under a big-data setting, the storage and analysis of data can no longer be performed on a single machine, and in this case dividing data into many sub-samples becomes a critical procedure for any numerical algorithm to be implemented. Distributed statistical estimation and distributed optimization have received increasing attention in recent years, and a flurry of research towards solving very large scale problems have emerged recently, such as Mcdonald et al. (2009); Zhang et al. (2013, 2015); Rosenblatt et al. (2016)
and the references therein. In general, distributed algorithm can be classified into two families: data parallelism and task parallelism. Data parallelism aims at distributing the data across different parallel computing nodes or machines; and task parallelism distributes different tasks across parallel computing nodes. We are only concerned with data parallelism in this paper. In particular, we primarily consider the distributed estimation for partially linear models via using the standard divide and conquer strategy.Divide-and-conquer technology is a simple and communication-efficient way for handling big data, which is commonly used in the literature of statistical learning. To be precise, the whole data is randomly allocated among machines, a local estimator is computed independently on each machine, and then the central node averages the local solutions into a global estimate.
Partially linear models (PLM) (Hardle and Liang, 2007; Heckman, 1986), as the leading example of semiparametric models, are a class of important tools for modeling complex data, which retain model interpretation and flexibility simultaneously. Given the observations , where is the response, and
are vectors of covariates, the partially linear models assume that
where is a vector of unknown parameters for linear terms, is an unknown function defined on a compact subset of , and ’s are independently standard normal variables. In the sparse setting, one often assumes that the cardinality of nonzero components of is far less than , that is, .
There is a substantial body of work focusing on the sparse setting for PLM, see, for example, Green et al. (1985); Wahba (1990); Hardle et al. (2007); Zhang et al. (2011); Lian et al. (2012); Wang et al. (2014), among others. Chen (1988) and Robinson (1988) showed that the parametric part can be estimated with parametric rates under appropriate conditions. Mammen et al. (1997) proved the linear part is asymptotically normal under a more general setting. These results are asymptotic and valid in the fixed dimensions, where the number of variables in the linear part is far less than the number of observations.
Although this paper is mainly concerned with data parallelism, which is practically useful in the setting, the size of each sub-sample that is allocated to each node may be less than with a large number of nodes (). So the high dimensional issue has been endowed with additional implications under the data parallelism setting. Compared to linear models or nonparametric additive models, the high dimensional case for studying PLM with is more challenging, mainly because of the correlation and interaction effect between covariates in the linear part and covariates in the nonparametric part. Under the high dimensional framework, a commonly-used approach is to construct penalized least squares estimation with a double penalty terms, using a smoothing functional norm to control complexity of the nonparametric part and a shrinkage penalty on parametric components to achieve model parsimony. To build each individual estimator before merging data, we consider a double-regularized approach with the Lasso penalty and a reproducing kernel Hilbert space (RKHS) norm, given by
where is the -th subsample with data size , and are two tuning parameters. Here we consider a function from a RKHS denoted by , endowed with the norm . The kernel function defined on and is determined by each other.
With diverging dimensions in the linear part, there is a rich literature on penalized estimation for PLM in the last decade. Xie et al. (2009) proposed the SCAD-penalized estimators of the linear coefficients, and achieved estimation consistency and variable selection consistency for the linear and nonparametric components. Similar to (1.2), Ni et al. (2009) formulated a double-penalized least squares approach, using the smoothing spline to estimate the nonparametric part and the SCAD to conduct variable selection. It is shown that the proposed procedure can be as efficient as the oracle estimator. Recently, Wang et al. (2014) proposed a new doubly penalized procedure for variable selection with respect to both linear and additive components, where the numbers of linear covariates and nonlinear components both diverges as the sample size increases. All these aforementioned papers focus on the case where is relatively small compared to .
Allowing for and even
, recent years has witnessed several related research in terms of non-asymptotic analysis for the sparse PLM. In particularity,Muller and van de Geer (2015) theoretically analyzed the penalized estimation (1.2), and proved that the parametric part in PLM achieves the optimal rates of the linear models, as if the nonparametric component were known in advance. More recently, Zhu (2017) considered a two-step approach for estimation and variable selection in PLM. The first step uses nonparametric regression to obtain the partial residuals, and the second step is an -penalized least squares estimator (the Lasso) to fit the partial residuals from the first step. Like Muller and van de Geer (2015), they derived the optimal non-asymptotic oracle results for the linear estimator. As shown in the distributed learning literature (Lee et al., 2015; Zhang et al., 2015), the optimal estimation of each local estimate is very critical to derive the optimal non-asymptotic results of the averaging estimate. However, we find that the above mentioned results cannot be used directly in our case. In fact, the optimal oracle results established in Muller et al. (2015) can be achieved only when the decay of the tuning parameter is not faster than , where is a nonparametric complexity. This constraints on makes it impossible to obtain the optimal rates of the averaging estimate. Zhu (2017) considered a two-step approach for PLM which differs from our one-step penalized estimation. To this end, under more relaxed constraints on the tuning parameters, we prove theoretical results for both the prediction and the estimation error.
In this paper, we study the aforementioned distributed estimation by allowing the dimension of , to exceed . Although distributed estimation on linear models (Zhang et al., 2013; Lee et al., 2015; Battey et al., 2015) and on fully nonparametric models (Zhang et al., 2015; Lin et al., 2016) have been well-understood, the investigation on distributed estimation for PLM is more challenging and there are very few works on this (Zhao et al., 2016). First, it is known that the Lasso penalty and the functional norm in (1.2
) lead to a heavily biased estimation, and naive averaging only reduces the variance of the local estimators, but has no effect on the bias(Mcdonald et al., 2009; Wu, 2017). Moreover, in the diverging dimensional setting (i.e. ), Rosenblatt and Nadler (2016) showed that the averaged empirical risk minimization (ERM) is suboptimal versus the centralized ERM. So debiasing is essential to improving accuracy of the averaging estimate. In addition, the significant influences of high dimensions and correlated covariates from the parametric and nonparametric components will result in additional technical difficulties.
Our main contribution to this line of research consists of the following two aspects. Our first contribution is to analyze the double-regularized least squares method (1.2) for estimating sparse PLM and provide upper bounds on the parametric part and the nonparametric part respectively. In particular, when the linear coefficients are restricted to a bounded domain, the optimal oracle rates of the parametric estimator can be achieved using a broad range of values of , which thereby yields the corresponding optimal bounds for the debiased averaging estimator. Note that from the proof on the debiased averaging estimation, it is seen that the approximation ability of the nonparametric kernel-based estimation significantly affects estimation errors of the averaging parametric estimator. Only by choosing a sufficiently small , the averaging estimate can achieve the same rates as the centralized M-estimator based on the entire data. This is why the mentioned previous results (Muller and van de Geer, 2015) are not applicable in our analysis. Even without any boundedness restriction of , the optimal rates of the nonparametric estimation are still obtained, while in this case we only obtain suboptimal rates for the parametric part.
Our second contribution is to propose a debiased distributed estimation for the sparse PLM under the big-data setting, since averaging cannot reduce the estimation bias in contrast to the local estimators. To our knowledge, this is the first work that considers distributed problems on the high dimensional PLM. In fact, our study in this paper is related to the previous work (Zhao et al., 2016)
, where they considered the naive averaging strategy for the PLM with non-sparse coefficients and fixed dimensions. So their work differs from the current paper in terms of problem setup and methodological strategy. Although the debiasing technology has been employed for the sparse linear regression(Lee et al., 2015), analyzing the debiased distributed estimation for PLM is more challenging, mainly because the approximation ability of the nonparametric component affects error level of the averaging estimation. To handle this problem, we apply some abstract operator theory to provide upper bounds of this approximation error in RKHS. By contrast, some existing related results (Wahba, 1990; Ni et al., 2009) depend on some strong assumptions on data sampling, for example, ’s are deterministically drawn from such that , where is a continuous and positive function. From our simulated results, we see the estimation error of the averaging debiased parametric estimator is comparable to that of the centralized M-estimator, while that of the naive averaged parametric estimator is much worse.
The rest of the paper is organized as follows. In Section 2, we provide some background on kernel spaces and propose a debiased averaging parametric estimator based on each local estimate (1.2). Section 3 is devoted to the statement of our main results and discussion of their consequences. In Section 4, for each local estimation, we present general upper bounds on the estimation error of the parametric and nonparametric parts, respectively. Section 5 contains the technical details, and some useful lemmas are deferred to the Appendix. Some simulation experiments are reported in Section 6, and we conclude in Section 7.
Notations. In the following, for a vector , we use , to represent and -norm in the Euclidean space respectively, and also . For a matrix , denotes the spectral norm. For a function defined on and a given data drawn from the underlying distribution defined on , let be the -norm for any square-integrable function . We use to denote the empirical -norm, i.e. . For sequences , means that there is some constant , such that , and means that for an absolute constant
with probability approaching one. The symbolswith various subscripts are used to denote different constants. For , we write .
2 Background and The Proposed Estimator
We begin with some background on RKHSs, and then formulate a profiled Lasso-type approach equivalent to the double-regularized one (1.2). Based on a gradient-induced debiasing and an estimate to approximate the inverse weighted covariate matrix, we propose the debiased averaging parametric estimator for PLM.
2.1 Reproducing kernel Hilbert space
Given a subset and a probability measure , we define a symmetric non-negative kernel function , associated with the RKHS of functions from to . The Hilbert space and its dot product are such that (i) for any , ; (ii) the reproducing property holds, e.g. for all . It is known that the kernel function is determined by (Aronszajn, 1950). Without loss of generality, we assume that , and such a condition includes the Gaussian kernel and the Laplace kernel as special cases.
The reproducing property of RKHS plays an important role in theoretical analysis and numerical optimization for any kernel-based method. Specially, this property implies that for all . Moreover, by Mercer’s theorem, a kernel defined on a compact subset admits the following eigen-decomposition:
are the eigenvalues andis an orthonormal basis in . The decay rate of fully characterizes the complexity of the RKHS induced by the kernel , and has close relationships with various entropy numbers, see Steinwart and Christmann (2008) for details. Based on this, we define the quantity:
Let be the smallest positive solution to the inequality: Then, we introduce the following quantity related to the convergence rates of nonparametric estimation:
2.2 The Debiased Local Estimator
For the -th machine, define , , and . is a semi-definite matrix whose entries are . The partially linear model (1.1) can then be written as . By the reproducing property of RKHS (Aronszajn, 1950), the nonparametric minimizer of programme (1.2) has the form and particularly . Hence we can write as
Given and , the first order optimality condition for convex optimization (Boydet and Vandenberghe, 2004) yields the solution
and the quadratic term in is called the profiled least squares in the literature. Note that is a nonnegative definite smoothing matrix.
Since the gradient vector of the empirical risk at is. Then, borrowing the debiasing idea in Javanmard and Montanari (2014) and in particular by adding a term proportional to the subgradient of the empirical risk for debiasing, the debiased estimator from the -th subsample with respect to the Lasso estimator is given by
where is an approximate inverse to the weight empirical covariance matrix on the design matrix . Note that we drop the dependence on of for simplicity.
can be regarded as a new design matrix through a linear transformation of the original design matric. Note that the choice of is crucial to the performance of the debiased estimator, and some feasible algorithms for forming has been proposed recently by Cai et al. (2011), Javanmard and Montanari (2014) and van der Geer et al. (2014). Thus, the averaged parametric estimator by combining the debiased estimators from all the subsamples is given by
In this paper, we employ an estimator for forming proposed by van der Geer et al. (2014): nodewise regression on the predictors. More precisely, for some , the -th machine solves
where is less its -th column . Then we can define a non-normalized covariance matrix by
where is the -th element of , indexed by . To scale the rows of , we define a diagonal matrix by
and based on this, we form .
In order to show that is a approximate inverse of , the first order optimality conditions of (2.5) is applied to yield
Let be the -th row of , according to the definition of , it follows from the above equality that
and the optimality condition of (2.5) is applied again to get
Finally, we have
We remark that forming is -times as expensive as solving the local Lasso problem, which is the most expensive step of evaluating the averaging estimator. To this end, we could consider an estimator only using a common for all the local estimators in the following way. To reduce computational cost, we assign the task of computing rows of to each local machine. Then each machine sends rows of computed by (2.5) to the central server, as well as sending and . Here we use different for different machine merely for convenience of presentation and implementation.
In addition, by (2.2), we see that the nonparametric solution has a closed form in terms of the parametric coefficients, and the distributed estimation of the nonparametric components in (1.2) can could also be formulated. In general, the parametric estimation in PLM is more difficult than the nonparametric part, requiring more refined theoretical analysis.
3 Theoretical Results
In this section, we first present several assumptions used in our theoretical analysis and introduce further notations. Thereafter, the assumptions are explained explicitly and the main results are stated.
Assumption A (model specification). For partially linear model (1.1), we assume that (i) is sparse with sparsity , and the nonparametric component is a multivariate zero-mean function in the RKHS defined pn ; (ii) the noise terms ’s are independent normal variables, and also uncorrelated with covariates .
Assumption B. (i) The covariate in is bounded uniformly, that is, for all and . (ii) The largest eigenvalue of the covariance matrix is finite, denoted by .
Gaussian error assumption is a quite strong, but standard in the literature. In general, this condition can be easily relaxed to sub-Gaussian errors. It is worth noting that we allow correlations between and
. The assumption that the target function belongs to the RKHS is frequently used in machine learning and statistics literature, seeSteinwart and Christmann (2008); Raskutti et al. (2012); Zhao et al. (2016) and many others. A bound on the -values is a more restrictive assumption than the sub-Gaussian tails, and we use it for technical reasons.
Throughout this paper, we assume that the nonparametric minimizer in (1.2) can be found in the bounded ball with some constant . This assumption is standard in the literature of nonparametric estimation in RKHS (Raskutti et al., 2012). To estimate the parametric and nonparametric parts respectively, we need some conditions concerning correlations between and . For each , let be the projection of onto . That is, with
We write and . In the extreme case, when is uncorrelated with , we get . The following condition ensures that there is enough information in the data to identify the parametric coefficients, which has been imposed in Yu et al. (2011); Muller et al. (2015).
Assumption C. (i) The smallest eigenvalue of is positive. (ii) The largest eigenvalue of is finite with high probability, denoted by .
By the definition of projection and the reproducing property of RKHS, we see that for any .
We are now ready to state our main results concerning the debiased averaging estimator (2.4). It indicates that the averaging estimator achieves the convergence rate of the centralized double-regularized estimator in some cases, as long as the dataset is not split across too many machines.
Suppose that Assumptions A-C hold. Under the constraint set , we consider the debiased averaging parametric estimator defined by (2.4), with suitable parameters constraint: for any ,
Let in (2.5), we have
with probability at least
In particular, with the choices of , and , we can choose the number of machines by , such that
We remark that provided is taken as some constant, Theorem 1 shows that the averaging estimator achieves the statistical minimax rate over all estimators using the set of samples, as if there is no information loss induced by data partitioning. The constraint has also been imposed in some existing works, for example in Fan and Lv (2013); Yuan et al. (2017). The result in Theorem 1 also indicates that the choice of the number of subsampled datasets does not rely on the sparsity parameter or the nonparametric complexity , which means that is adaptive to the parameters and .
On the other hand, we can derive a rough upper bound of from the minimization problem (1.2) without contraint. However, as shown below, may be diverging as increases.
Suppose that Assumptions A-C hold and for all . Consider the double-regularized estimator defined by (1.2), such that When , with probability at least , it holds that
As an illustration, for the RKHS with infinitely many eigenvalues which decay at a rate for some parameter . This type of scaling covers the case of Sobolev spaces and Besov spaces. In this case we can check that . When is less than exponential in , dominates and this implies that . Thus, we have by choosing , which is suboptimal in the minimax sense. We remark that the main challenge to deriving the optimal minimax rates for our averaging parametric estimator comes from the choice of , and existing results in the literature are not applicable to our case.
4 Estimation on Local Estimators
This section provides general upper bounds on and for the standard PLM (1.1). The novelty of our results lies in that the tuning parameter is allowed to be in a broader range than typically assumed in the literature, and this requirement is essential to derive sharp error bounds of our averaging parametric estimator. As a by-product, we also obtain the optimal nonparametric rates for each local estimate of (1.2) in the minimax sense.
The proof of Theorem 2, contained in Section 5, constitutes one main technical contribution of this paper. From the result of Theorem 2, it is seen that the achieved rate is affected by the bound on . There is some related work assuming directly that is bounded, see Fan and Lv (2013); Yuan et al. (2017). This case easily yields the optimal oracle rates of the parametric estimator. In general, there is no restriction on in most of the literature. In the latter setting, a standard technical proof is to first construct some event, and then prove that the desired results hold under the event, as well as that the event occurs with high probability, see Muller and van de Geer (2015) for details. However, their results require the lower bound of , which further hinders the optimal rates of the averaging estimator. We also notice that, the restricted strong convexity (Raskutti et al., 2012) and restricted eigenvalues constants (Bickel et al., 2009) are two key tools to derive sharp error bounds of the oracle results. These mentioned techniques can be used for the two-step estimation for PLM (Zhu, 2017), nevertheless it seems quite difficult to apply for our one-step approach.
The following proposition provides an upper bound of , which together with Theorem 2 can yield the rates of the parametric estimator even without the constraint.
Suppose that for all . When is satisfied, then with probability at least ,
Suppose that Assumptions A-C hold and for all . We consider the double-regularized estimator defined by (1.2), such that
Then, with probability at least , it holds that
This theorem establishes the consistency of the double-regularized estimator of the parametric part in (1.2). Note that, the smoothness index of RKHS affects the rate of convergence of the parametric estimator, while the nonparametric estimator has the minimax optimal rate in the nonparametric literature.
When is set to be a fixed constant, we have the following result.
Suppose that Assumptions A-C hold and in Theorem 2 is an absolute constant. Consider the double-regularized estimator by choosing
Then we have that
In this section, we provide the proofs of Theorems 1-4, as well as the bound based on the minimization problem in (1.2). We present a detailed proof, deferring some useful lemmas to the appendices. First of all, we give the proof of each local parametric estimate, which is one of key ingredients for obtaining the oracle rates of the averaging estimator based on the entire data.
5.1 Proof for Parametric Estimator
In this section, we focus on theoretical analysis on each local machine () in (1.2). For all the symbols and numbers, we drop their dependence on for notational simplicity. In what following, we write and , and particularly and .
Since and by sparsity assumption in model (1.1)?b by the triangle inequality, the last inequality further implies that
First, since are independent of and they are bounded and Gaussian respectively, and is sub-Gaussian, whose sub-Gaussian norms are upper bounded by . By the Hoffding-type inequality in Lemma 4 and the union bounds, we have
with probability at least .
Consider the term