# Sparse Model Identification and Learning for Ultra-high-dimensional Additive Partially Linear Models

The additive partially linear model (APLM) combines the flexibility of nonparametric regression with the parsimony of regression models, and has been widely used as a popular tool in multivariate nonparametric regression to alleviate the "curse of dimensionality". A natural question raised in practice is the choice of structure in the nonparametric part, that is, whether the continuous covariates enter into the model in linear or nonparametric form. In this paper, we present a comprehensive framework for simultaneous sparse model identification and learning for ultra-high-dimensional APLMs where both the linear and nonparametric components are possibly larger than the sample size. We propose a fast and efficient two-stage procedure. In the first stage, we decompose the nonparametric functions into a linear part and a nonlinear part. The nonlinear functions are approximated by constant spline bases, and a triple penalization procedure is proposed to select nonzero components using adaptive group LASSO. In the second stage, we refit data with selected covariates using higher order polynomial splines, and apply spline-backfitted local-linear smoothing to obtain asymptotic normality for the estimators. The procedure is shown to be consistent for model structure identification. It can identify zero, linear, and nonlinear components correctly and efficiently. Inference can be made on both linear coefficients and nonparametric functions. We conduct simulation studies to evaluate the performance of the method and apply the proposed method to a dataset on the Shoot Apical Meristem (SAM) of maize genotypes for illustration.

There are no comments yet.

## Authors

• 15 publications
• 115 publications
• 5 publications
09/10/2020

### Statistical Inference for Generalized Additive Partially Linear Model

The Generalized Additive Model (GAM) is a powerful tool and has been wel...
07/27/2021

### A robust spline approach in partially linear additive models

Partially linear additive models generalize the linear models since they...
03/18/2015

### The Knowledge Gradient Policy Using A Sparse Additive Belief Model

We propose a sequential learning policy for noisy discrete global optimi...
06/18/2012

### Sparse Additive Functional and Kernel CCA

Canonical Correlation Analysis (CCA) is a classical tool for finding cor...
08/14/2020

### Bayesian model selection in additive partial linear models via locally adaptive splines

We consider a model selection problem for additive partial linear models...
03/29/2020

### Nonparametric additive factor models using sieve methods

This paper proposes a nonparametric additive factor model where the comm...
08/28/2020

### A Note on Debiased/Double Machine Learning Logistic Partially Linear Model

It is of particular interests in many application fields to draw doubly ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the past three decades, flexible and parsimonious additive partially linear models (APLMs) have been extensively studied and widely used in many statistical applications, including biology, econometrics, engineering, and social science. Examples of recent work on APLMs include Liang et al. (2008), Liu et al. (2011), Ma and Yang (2011), Wang et al. (2011), Ma et al. (2013), Wang et al. (2014) and Lian et al. (2014)

. APLMs are natural extensions of classical parametric models with good interpretability and are becoming more and more popular in data analysis.

Suppose we observe . For subject , is a univariate response, is a

-dimensional vector of covariates that may be linearly associated with the response, and

is a -dimensional vector of continuous covariates that may have nonlinear associations with the response. We assume is an i.i.d sample from the distribution of , satisfying the following model:

 Yi =μ+Z⊤(i)α+p2∑ℓ=1ϕℓ(Xiℓ)+εi=μ+p1∑k=1Zikαk+p2∑ℓ=1ϕℓ(Xiℓ)+εi, (1)

where is the intercept, , , are unknown regression coefficients, are unknown smooth functions, and each is centered with to make model (1) identifiable. The is a -dimensional vector of zero-mean covariates having density with a compact support. Without loss of generality, we assume that each covariate can be rescaled into an interval . The

terms are iid random errors with mean zero and variance

.

The APLM is particularly convenient when is a vector of categorical or discrete variables, and in this case, the components of enter the linear part of model (1) automatically, and the continuous variables usually enter the model nonparametrically. In practice, we might have reasons to believe that some of the continuous variables should enter the model linearly rather than nonparametrically. A natural question is how to determine which continuous covariates have a linear effect and which continuous covariates have a nonlinear effect. If the choice of linear components is correctly specified, then the biases in the estimation of these components are eliminated and root- convergence rates can be obtained for the linear coefficients. However, such prior knowledge is rarely available, especially when the number of covariates is large. Thus, structure identification, or linear and nonlinear detection, is an important step in the process of building an APLM from high-dimensional data.

When the number of covariates in the model is fixed, structure identification in additive models (AMs) has been studied in the literature. Zhang et al. (2011)

proposed a penalization procedure to identify the linear components in AMs in the context of smoothing splines ANOVA. They demonstrated the consistency of the model structure identification and established the convergence rate of the proposed method specifically under the tensor product design.

Huang et al. (2012b) proposed another penalized semiparametric regression approach using a group minimax concave penalty to identify the covariates with linear effects. They showed consistency in determining the linear and nonlinear structure in covariates, and obtained the convergence rate of nonlinear function estimators and asymptotic properties of linear coefficient estimators; but they did not perform variable selection at the same time.

For high-dimensional AMs, Lian et al. (2015) proposed a double penalization procedure to distinguish covariates that enter the nonparametric and parametric parts and to identify significant covariates simultaneously. They demonstrated the consistency of the model structure identification, and established the convergence rate of nonlinear function estimators and asymptotic normality of linear coefficient estimators. Despite the nice theoretical properties, their method heavily relies on the local quadratic approximation in Fan and Li (2001), which is incapable of producing naturally sparse estimates. In addition, employing the local quadratic approximation can be extremely expensive because it requires the repeated factorization of large matrices, which becomes infeasible when the number of covariates is very large.

Note that all the aforementioned papers (Zhang et al., 2011; Huang et al., 2012b; Lian et al., 2015) about structure identification focus on the AM with continuous explanatory variables. However, in many applications, a canonical partitioning of the variables exists. In particular, if there are categorical or discrete explanatory variables, as in the case of the SAM data studies (see the details in Section 5) and in many genome-wide association studies, we may want to keep discrete explanatory variables separate from the other design variables and let discrete variables enter the linear part of the model directly. In addition, if there is some prior knowledge of certain parametric forms for some specific covariates, such as a linear form, we may lose efficiency if we simply model all the covariates nonparametrically.

The above practical and theoretical concerns motivate our further investigation of the simultaneous variable selection and structure selection problem for flexible and parsimonious APLMs, in which the features of the data suitable for parametric modeling are modeled parametrically and nonparametric components are used only where needed. We consider the setting where both the dimension of the linear components and the dimension of nonlinear components is ultra-high. We propose an efficient and stable penalization procedure for simultaneously identifying linear and nonlinear components, removing insignificant predictors, and estimating the remaining linear and nonlinear components. We prove the proposed Sparse Model Identification, Learning and Estimation (referred to as SMILE) procedure is consistent. We propose an iterative group coordinate descent approach to solve the penalized minimization problem efficiently. Our algorithm is very easy to implement because it only involves simple arithmetic operations with no complicated numerical optimization steps, matrix factorizations, or inversions. In one simulation example with and , it takes less than one minute to complete the entire model identification and variable selection process on a regular PC.

After variable selection and structure detection, we would like to provide an inferential tool for the linear and nonparametric components. The spline method is fast and easy to implement; however, the rate of convergence is only established in mean squares sense, and there is no asymptotic distribution or uniform convergence, so no measures of confidence can be assigned to the estimators. In this paper, we propose a two-step spline-backfitted local-linear smoothing (SBLL) procedure for APLM estimation, model selection and simultaneous inference for all the components. In the first stage, we approximate the nonparametric functions , , with undersmoothed constant spline functions. We perform model selection for the APLM using a triple penalized procedure to select important variables and identify the linear vs. nonlinear structure for the continuous covariates, which is crucial to obtain efficient estimators for the non-zero components. We show that the proposed model selection and structure identification for both parametric and nonparametric terms are consistent, and the estimators of the nonzero linear coefficients and nonzero nonparametric functions are both -norm consistent. In the second stage, we refit the data with covariates selected in the first step using higher-order polynomial splines to achieve root- consistency of the coefficient estimators in the linear part, and apply a one-step local-linear backfitting to the projected nonparametric components obtained from the refitting. Asymptotic normality for both linear coefficient estimators and nonlinear component estimators, as well as simultaneous confidence bands (SCBs) for all nonparametric components, are provided.

The rest of the paper is organized as follows. In Section 2, we describe the first-stage spline smoothing and propose a triple penalized regularization method for simultaneous model identification and variable selection. The theoretical properties of selection consistency and rates of convergence for the coefficient estimators and nonparametric estimators are developed. Section 3 introduces the spline-backfitted local-linear estimators and SCBs for the nonparametric components. The performance of the estimators is assessed by simulations in Section 4 and illustrated by application to the SAM data in Section 5. Some concluding remarks are given in Section 6. Section A of the online Supplemental Materials evaluates the effect of different smoothing parameters on the performance of the proposed method. Technical details are provided in Section B of the Supplemental Materials.

## 2 Methodology

### 2.1 Model Setup

In the following, the functional form (linear vs. nonlinear) for each continuous covariate in model (1) is assumed to be unknown. In order to decide the form of , for each , we can decompose into a linear part and a nonlinear part: , where is some unknown smooth nonlinear function (see Assumption (A1) in Appendix E.1). For model identifiability, we assume that , and . The first two constraints and , are required to guarantee identifiability for the APLM, that is, . The constraint ensures there is no linear form in nonlinear function . Note that these constraints are also in accordance with the definition of nonlinear contrast space in Zhang et al. (2011), which is a subspace of the orthogonal decomposition of RKHS. In the following, we assume values are centered so that we can express the APLM in (1) without an intercept parameter as

 Yi =p1∑k=1Zikαk+p2∑ℓ=1Xiℓβℓ+p2∑ℓ=1gℓ(Xiℓ)+εi. (2)

In the following, we define predictor variable

as irrelevant in model (2), if and only if , and as irrelevant if and only if and for all on its support. A predictor variable is defined as relevant if and only if it is not irrelevant. Suppose that only an unknown subset of predictor variables is relevant. We are interested in identifying such subsets of relevant predictors consistently while simultaneously estimating their coefficients and/or functions.

For covariates , we define

 Active index set for Z: Sz={k=1,…,p1:αk≠0}, Inactive index set for Z: Nz={k=1,…,p1:αk=0}.

For continuous covariate , we say it is a linear covariate if and for all on its support, and is a nonlinear covariate if . Explicitly, we define the following index sets for :

 Active pure linear index set for X: Sx,PL={ℓ=1,…,p2:βℓ≠0, gℓ≡0}, Active nonlinear index set for X: Sx,N={ℓ=1,…,p2:gℓ≠0}, Inactive index set for X: Nx={ℓ=1,…,p2:βℓ=0, gℓ≡0}.

Note that the active nonlinear index set for , , can be decomposed as , where is the index set for covariates whose linear and nonlinear terms in (2) are both nonzero, and is the index set for active pure nonlinear index set for .

Therefore, the model selection problem for model (2) is equivalent to the problem of identifying , , , , and . To achieve this, we propose to minimize

 n∑i=1{Yi−p1∑k=1Zikαk−p2∑ℓ=1Xiℓβℓ−p2∑ℓ=1gℓ(Xiℓ)}2+p1∑k=1pλn1(|αk|)+p2∑ℓ=1pλn2(|βℓ|)+p2∑ℓ=1pλn3(∥gℓ∥2), (3)

where , and , and are penalty functions explained in detail in Section 2.3. The tuning parameters , and decide the complexity of the selected model. The smoothness of predicted nonlinear functions is controlled by , and , and go to as increases to .

### 2.2 Spline Basis Approximation

We approximate the smooth functions in (2) by polynomial splines for their simplicity in computation. For example, for each , let be knots that partition with . The space of polynomial splines of order , , consisting of functions satisfying (i) the restriction of to subintervals , , and , is a polynomial of -degree (or less); (ii) for and , is times continuously differentiable on . Below we denote , , the basis functions of .

To ensure and , we consider the following normalized first-order B-splines, referred to as piecewise constant splines. We define for any the piecewise constant B-spline function as the indicator function of the equally-spaced subintervals of with length , that is,

 IJ,ℓ(xℓ) ={1a+JH≤xℓ

Define the following centered spline basis

 b(1)J,ℓ(xℓ)=IJ,ℓ(xℓ)−(∥IJ,ℓ∥2/∥IJ−1,ℓ∥2)IJ−1,ℓ(xℓ),∀ J=1,…,Nn,ℓ=1,…,p2,

with the standardized version given for any ,

 B(1)J,ℓ(xℓ)=b(1)J,ℓ(xℓ)/∥b(1)J,ℓ∥2,∀ J=1,…,Nn. (4)

So , . In practice, we use the empirical distribution of to perform the centering and scaling in the definitions of and .

We approximate the nonparametric function , , using the above normalized piecewise constant splines

 gℓ(xℓ)≈gℓs(xℓ)=Nn∑J=1γJ,ℓB(1)J,ℓ(xℓ)=B(1)⊤ℓ(xℓ)γℓ, (5)

where , and is a vector of the spline coefficients. By using the centered constant spline basis functions, we can guarantee that , and except at the location of the knots.

Denote a length vector . For any vector , denote as the norm of . Following from (5), to minimize (3), it is approximately equivalent to consider the problem of minimizing

 n∑i=1{Yi−p1∑k=1Zikαk−p2∑ℓ=1Xiℓβℓ−p2∑ℓ=1B(1)iℓγℓ}2+p1∑k=1pλn1(|αk|)+p2∑ℓ=1pλn2(|βℓ|)+p2∑ℓ=1pλn3(∥γℓ∥).

### 2.3 Adaptive Group LASSO Regularization

We use adaptive LASSO (Zou, 2006) and adaptive group LASSO (Huang et al., 2010) for variable selection and estimation. Other popular choices include methods based on the Smoothly Clipped Absolute Deviation penalty (Fan and Li, 2001) or the minimax concave penalty (Zhang, 2010). Specifically, we start with group LASSO estimators obtained from the following minimization:

 (˜α,˜β,˜γ)=argminα,β,γn∑i=1{Yi−p1∑k=1Zikαk−p2∑ℓ=1Xiℓβℓ−p2∑ℓ=1B(1)iℓγℓ}2 +˜λn1p1∑k=1|αk|+˜λn2p2∑ℓ=1|βℓ|+˜λn3p2∑ℓ=1∥γℓ∥. (6)

Then, let , , , where by convention, . The adaptive group LASSO objective function is defined as

 L(α,β,γ;λn1,λn2,λn3)=n∑i=1{Yi−p1∑k=1Zikαk−p2∑ℓ=1Xiℓβℓ−p2∑ℓ=1B(1)iℓγℓ}2 +λn1p1∑k=1wαk|αk|+λn2p2∑ℓ=1wβℓ|βℓ|+λn3p2∑ℓ=1wγℓ∥γℓ∥. (7)

The adaptive group LASSO estimators are minimizers of (7), denoted by

 (ˆα,ˆβ,ˆγ)=argminα,β,γ L(α,β,γ;λn1,λn2,λn3).

The model structure selected is defined by

 ˆSz ={1≤k≤p1:|ˆαk|>0}, ˆSx,PL={ℓ:|ˆβℓ|>0,∥ˆγℓ∥=0,1≤ℓ≤p2}, ˆSx,LN ={ℓ:|ˆβℓ|>0,∥ˆγℓ∥>0,1≤ℓ≤p2}, ˆSx,PN={ℓ:|ˆβℓ|=0,∥ˆγℓ∥>0,1≤ℓ≤p2}.

The spline estimators of each component function are

 ˆgℓ(xℓ)=Nn∑J=1ˆγJ,ℓB(1)J,ℓ(xℓ)−n−1n∑i=1Nn∑J=1ˆγJ,ℓB(1)J,ℓ(Xiℓ).

Accordingly, the spline estimators for the original component functions ’s are .

The following theorems establish the asymptotic properties of the adaptive group LASSO estimators. Theorem 1 shows the proposed method can consistently distinguish nonzero components from zero components. Theorem 2 gives the convergence rates of the estimators. We only state the main results here. To facilitate the development of the asymptotic properties, we assume the following sparsity condition:

1. (Sparsity) The numbers of nonzero components , and are fixed, and there exist positive constants , and such that , , and .

Other regularity conditions and proofs are provided in Appendix E.1E.3.

###### Theorem 1.

Suppose that Assumptions (A1), (A2)–(A6) in Appendix E.1 hold. As , we have , , and

with probability approaching one.

In the following, to avoid confusion, we use , to denote the true parameters in model (2), and to denote the nonlinear functions in model (2). Let , where consists of all nonzero components of , and without loss of generality; similarly, let , where consists of all nonzero components of , and without loss of generality.

###### Theorem 2.

Suppose that Assumptions (A1), (A2)–(A6) in Appendix E.1 hold. Then

 ∑k∈Sz|ˆαk−α0k|2=OP(n−1Nn)+O(N−2n)+OP(n−23∑j=1λ2nj), ∑ℓ∈Sx,L|ˆβℓ−β0ℓ|2=OP(n−1Nn)+O(N−2n)+OP(n−23∑j=1λ2nj), ∑ℓ∈Sx,N∥ˆgℓ−g0ℓ∥22=OP(n−1Nn)+O(N−2n)+OP(n−23∑j=1λ2nj).

## 3 Two-stage SBLL Estimator and Inference

After model selection, our next step is to conduct statistical inference for the nonparametric component functions of those important variables. Although the one-step penalized estimation in Section 2.3 can quickly identify the nonzero nonlinear components, the asymptotic distribution is not available for the resulting estimators.

To obtain estimators whose asymptotic distribution can be used for inference, we first refit the data using selected model,

 (8)

We approximate the smooth functions in (8) by polynomial splines introduced in Section 2.2. Let be the space of polynomial splines of order , and . Working with ensures that the spline functions are centered, see for example Xue and Yang (2006); Wang and Yang (2007); Wang et al. (2014). Let be a set of standardized spline basis functions for with dimension , where , , so that , . Specifically, if , and is the standardized piecewise constant spline function defined in (4).

We propose a one-step backfitting using refitted pilot spline estimators in the first stage followed by local-linear estimators. The refitted coefficients are defined as

 (ˆα∗,ˆβ∗,ˆγ∗)=argminα,β,γn∑i=1⎛⎜⎝Yi−∑k∈ˆSzZikαk−∑j∈ˆSx,PLXijβj−∑ℓ∈ˆSx,NB(d)iℓγℓ⎞⎟⎠2. (9)

Then the refitted spline estimator for nonlinear functions is

 (10)

Next we establish the asymptotic normal distribution for the parametric estimators. To make

estimable at the rate, we need a condition to ensure and are not functionally related. Define as the Hilbert space of theoretically centered additive functions. For any , let be the coordinate mapping that maps to its -th component so that , and let be the orthogonal projection of onto . Let . Similarly, for any , let be the coordinate mapping that maps to its -th component so that , and let

 ψxℓ=argminψ∈F+∥xℓ−ψ∥22=argminψ∈F+E{Xℓ−ψ(X)}2 (11)

be the orthogonal projection of onto . Let . Define and . Denote vector and as , .

###### Theorem 3.

Under the Assumptions (A1), (A2)–(A6), (A3) and (A6) in Appendix E.1,

 (nΣ)1/2(ˆα∗Sz−α0,Szˆβ∗Sx,PL−β0,Sx,PL)D⟶N(0,I),

where

is an identity matrix and

.

The proof of Theorem 3 is similar to the proof of Liu et al. (2011) and Li et al. (2018) and thus omitted. Let and , . If and are given, can be consistently estimated by , where with given in (E.16) in the Supplemental Materials and . In practice, we replace and with and , respectively, to obtain the corresponding estimate.

Let . In the selection step, we estimate and consistently, that is, . Within the event , that is, and , the estimator is root- consistent according to Theorem 3. Since is shown to have probability tending to one, we can conclude that is also root- consistent.

These refitted pilot estimators defined in (9) and (10) are then used to define new pseudo-responses , which are estimates of the unobservable “oracle” responses . Specifically,

 ˆYiℓ=Yi−⎧⎪⎨⎪⎩∑k∈ˆSzZikˆα∗k+∑ℓ′∈ˆSx,PLXiℓ′ˆβ∗ℓ′+∑ℓ′′∈ˆSx,N∖{ℓ}ˆϕ∗ℓ′′(Xiℓ′′)⎫⎪⎬⎪⎭, Yiℓ=Yi−⎧⎪⎨⎪⎩∑k∈SzZikα0k+∑ℓ′∈Sx,PLXiℓ′β0ℓ′+∑ℓ′′∈Sx,N∖{ℓ}ϕ0ℓ′′(Xiℓ′′)⎫⎪⎬⎪⎭. (12)

Denote a continuous kernel function, and let be a rescaling of , where is usually called the bandwidth. Next, we define the spline-backfitted local-linear (SBLL) estimator of as based on , which attempts to mimic the would-be SBLL estimator of based on if the unobservable “oracle” responses were available:

 (ˆϕoℓ(xℓ),ˆϕSBLLℓ(xℓ))=(1  0)(X∗⊤ℓWℓX∗ℓ)−1X∗⊤ℓWℓ(Yℓ,ˆYℓ), (13)

where and , with and as defined in (12), respectively; and the weight and “design” matrices are

 Wℓ=n−1