# On sufficient dimension reduction via principal asymmetric least squares

In this paper, we introduce principal asymmetric least squares (PALS) as a unified framework for linear and nonlinear sufficient dimension reduction. Classical methods such as sliced inverse regression (Li, 1991) and principal support vector machines (Li, Artemiou and Li, 2011) may not perform well in the presence of heteroscedasticity, while our proposal addresses this limitation by synthesizing different expectile levels. Through extensive numerical studies, we demonstrate the superior performance of PALS in terms of both computation time and estimation accuracy. For the asymptotic analysis of PALS for linear sufficient dimension reduction, we develop new tools to compute the derivative of an expectation of a non-Lipschitz function.

## Authors

• 2 publications
• 4 publications
11/28/2019

### Distributed estimation of principal support vector machines for sufficient dimension reduction

The principal support vector machines method (Li et al., 2011) is a powe...
10/19/2020

### Sufficient dimension reduction for classification using principal optimal transport direction

Sufficient dimension reduction is used pervasively as a supervised dimen...
10/20/2019

### Supporting Multi-point Fan Design with Dimension Reduction

Motivated by the idea of turbomachinery active subspace performance maps...
03/11/2021

10/28/2020

### Bridging linearity-based and kernel-based sufficient dimension reduction

There has been a lot of interest in sufficient dimension reduction (SDR)...
08/10/2015

### Model-based SIR for dimension reduction

A new dimension reduction method based on Gaussian finite mixtures is pr...
04/06/2021

### Accelerated derivative-free nonlinear least-squares applied to the estimation of Manning coefficients

A general framework for solving nonlinear least squares problems without...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For univariate response and multivariate predictor , sufficient dimension reduction (Li, 1991; Cook, 1998) aims to find such that

 Y\raisebox0.5pt\rotatebox[origin=c]90.0$⊨$X|B⊤X, (1)

where “ ” means statistical independence. Under (1), the conditional distribution of given is the same as the conditional distribution of given . If satisfies (1), the column space of is called a dimension reduction space. Under very general conditions as discussed in Yin, Li and Cook (2008), the intersection of all dimension reduction spaces is also a dimension reduction space. We refer to this minimum dimension reduction space as the central space for the regression between and , and we denote it as . The dimensionality of the central space is known as the structural dimension.

Moment-based methods such as sliced inverse regression (SIR) (Li, 1991), sliced average variance estimation (SAVE) (Cook and Weisberg, 1991), and directional regression (Li and Wang, 2007) are among the most popular sufficient dimension reduction methods. These moment-based methods are easy to implement in practice, and their extensions include sparse sufficient dimension reduction (Li, 2007), dimension reduction with matrix-valued predictors (Li, Kim and Altman, 2010), and dimension reduction for functional data (Li and Song, 2017). For an excellent review, please refer to Li (2018). More recently, Li, Artemiou and Li (2011) proposed the principal support vector machine (PSVM), which applies a modified support vector machine to find the optimal separating hyperplanes of the discretized response. It is shown that the normal vector of the separating hyperplanes can be used to recover . Extensions of PSVM include

PSVM (Artemiou and Dong, 2016), principal logistic regression (Shin and Artemiou, 2017), and weighted PSVM (Shin et al., 2017).

A well-known limitation of the moment-based sufficient dimension reduction methods as well as PSVM is that they may not perform well in the presence of heteroscedastic error. In this paper, we propose to replace the hinge loss in PSVM with the asymmetric least squares loss, and we refer to the new proposal as principal asymmetric least squares (PALS). By synthesizing different expectile levels, PALS can improve the performance of PSVM when the error is heteroscedastic. We implement the sample level estimation of PALS through quadratic programming, provide the asymptotic normality of the sample PALS estimator, and extend PALS for nonlinear sufficient dimension reduction. Wang, Shin, and Wu (2018) proposed principal quantile regression (PQR), where quantile regression was used instead of the expectile regression in our proposal. We note that the check loss from the PQR objective function is not smooth, while PALS utilizes a smooth objective function. As a result, PALS leads to more accurate estimation with much improved computational speed compared to PQR.

The rest of the paper is organized as follows. The population level development and the sample level estimation of PALS are studied in Section 2 and Section 3, respectively. Extensions to nonlinear sufficient dimension reduction are examined in Section 4. Extensive simulation studies are reported in Section 5 and we provide a real data analysis in Section 6. Section 7 concludes the paper with some discussions. All the proofs are relegated to the Appendix.

## 2 Population level development

Let and . The population -th level objective function of PALS is

 L\uptau(α,β)=β⊤Σβ+λE[ρ\uptau{Y−α−β⊤(X−μ)}]. (2)

Here denotes the expectile level, is a tuning parameter, and

is the asymmetric least squares loss function (Newey and Powell, 1987) defined as follows

 ρ\uptau(c)={(1−\uptau)c2if c≤0,\uptauc2if c>0. (3)

The asymmetric least squares loss was originally designed to recover the regression expectiles, which is known be closely related to regression quantiles (Abdous and Remillard, 1995). The objective function (2) is linked to sufficient dimension reduction through the next result.

###### Theorem 1.

Suppose is linear in , where is a basis of . Let

 (α0,\uptau,β0,\uptau)=argminα∈R, β∈Rp L\uptau(α,β).

Then .

The assumption about is known as the linear conditional mean condition, and is common in the sufficient dimension reduction literature. As a result of Theorem 1, we have

###### Corollary 1.

Suppose is linear in , where is a basis of . Let and

 (α0,\uptauk,β0,\uptauk)=argminα∈R, β∈Rp L\uptauk(α,β) for k=1,…,K.

Then , where .

Here denotes the column space. Corollary 1 suggests that we can recover the central space by optimizing the PALS objective function (2) at multiple expectile levels.

## 3 Sample level estimation

Given an i.i.d sample , the sample version of (2) becomes

 ^L\uptau(α,β)=β⊤^Σβ+λnn∑i=1ρ\uptau{Yi−α−β⊤(Xi−¯X)}, (4)

where and . Denote , and . (4) reduces to

 ~L\uptau(α,θ)=θ⊤θ+~λn∑i=1ρ\uptau(Yi−α−θ⊤Zi). (5)

Let . We now introduce

 ξi+=(Yi−α−θ⊤Zi)+ and ξi−=(α+θ⊤Zi−Yi)+.

From the definition of in (3), (5) leads to the following primal optimization problem

 (6)

subject to , , , and .

###### Theorem 2.

Let and . The dual optimization problem of (6) is

 (7)

subject to , , and . Furthermore, we have

 ^θ0,\uptau=12Z⊤(^a0,\uptau−^η0,\uptau). (8)

Consider expectile levels . For a given , the dual problem (7) can be solved through standard quadratic programming to get and . After computing from (8), we get the minimizer of in (4) as . Based on Corollary 1, we get the estimator of as . Recall that denotes the structural dimension of

. The eigenvectors corresponding to the

largest eigenvalues of

then consist the final PALS estimator to recover the central space.

We conclude this section with the asymptotic normality of , where Vec means vectorization. The details are provided in the Appendix. In order to compute the derivative of an expectation of a non-Lipschitz function, we extend the theoretical development of PSVM (Li, Artemiou and Li, 2011). This extension is necessary as PSVM deals with discretized response while PALS applies to the continuous response without discretization.

###### Theorem 3.

Suppose the regularity conditions in Theorem 4 and Theorem 5 from the Appendix are satisfied. Then we have

 √n{Vec(Λ)−Vec(^Λ)}D⟶N(0,Ω)

as , where “” means converge in distribution and is specified in the Appendix.

## 4 Nonlinear sufficient dimension reduction

Suppose with are nonlinear functions satisfying

 Y\raisebox0.5pt\rotatebox[origin=c]90.0$⊨$X|φ(X), (9)

where . Then the conditional distribution of given is the same as the conditional distribution of given , and identifying is known as nonlinear sufficient dimension reduction. Let be a reproducing kernel Hilbert space of the functions of with inner product . Let be the covariance operator such that for any . Consider objective function

 Πτ(α,φ)=⟨φ,Σφ⟩H+λE[ρτ{Y−α−φ(X)}]. (10)

Compared with (2), we see that is a generalization of with the matrix replaced by the operator , the linear function replaced by the nonlinear function , and the inner product in replaced by the inner product in . Let be the minimizer of over and . Under proper conditions, it can be shown that is a function of . See, for example, Theorem 2 of Li, Artemiou and Li (2011).

Based on an i.i.d. sample , we now describe the implementation of nonlinear dimension reduction through PALS. Suppose can be spanned by . Then any function becomes , where and . The sample version of (10) thus becomes

 ^Πτ(α,γ)=1nγ⊤Ψ⊤Ψγ+λnn∑i=1ρτ{Yi−α−γ⊤ψ(Xi)}, (11)

where and the th row of is . has the same form as in (4), and can be minimized in a similar fashion. Denote the minimizer of as . We then estimate by . To synthesize multiple expectile levels, consider expectile levels . For a given , we get from minimizing . Denote with leading eigenvectors as . The final estimator of in (9) is .

It remains to choose a proper basis for . Define kernel matrix , with the element in the th row and th column as

 κ(Xi,Xj)=exp(−r∥Xi−Xj∥2). (12)

Here is a tuning parameter and denotes the Euclidean norm. Define , where is the identity matrix and is the matrix whose entries are . For , let be the eigenvector corresponding to the -th largest eigenvalue of . Then following Proposition 2 of Li, Artemiou and Li (2011). In our simulations, we choose , and use the sample version of for in (12), where and are independent .

## 5 Simulation studies

### 5.1 Linear sufficient dimension reduction

We evaluate the performance of PALS for linear sufficient dimension reduction in this section. The following models are considered:

 I: Y=X10.5+(X2+1.5)2+ε; II: Y=3sin{0.25(X1+X2)}+3sin{0.25(X3+X4)}+ε; III: Y=X1+0.5(e0.15X2)ε,

where and is independent of . The distribution of will be specified later. Let , , , and . Denote as the basis of the central space . Then for models I and III, while for model II.

We compare PALS with five existing methods in the literature: SIR, SAVE, directional regression (DR), PSVM, and PQR. The number of slices for SIR is set as , and we use slices for SAVE and DR. Note that SIR is generally not sensitive to the choice of slice numbers, while SAVE and DR work better with fewer slices. For PSVM, the number of dividing points is set as 9, as Li, Artemiou and Li (2011) recommend a larger number is preferable. For a given set of dividing points, two ways to dichotomize the response are considered in Li, Artemiou and Li (2011), “left versus right” (LVR) and “one versus another”. We adopt the LVR scheme in our simulations. For PQR, we follow Wang, Shin and Wu (2018) and set the number of quantile levels to be 9, which leads to 10 slices. For PALS, we set for . To evaluate the performance of each estimator , we report

 Δ=∥PB−P^B∥F, (13)

where denotes the orthogonal projection onto , and is the matrix Frobenius norm. Smaller value means more accurate estimation.

For the choice of the tuning parameter , PSVM, PQR and PALS seem to be not overly sensitive. We try , and report the best results that a fixed can achieve. In addition, we propose a variable scheme for PALS so that one can use different values across repetitions. Specifically, denote as the PALS estimator for a specific . We choose such that the squared sample distance correlation (Székely, Rizzo and Bakirov, 2007) between and is maximized. We refer to this method as DC-PALS.

First, we set , where the element in the th row and th column of is for . We fix , and consider . The results based on repetitions are summarized in Table 1. We report the average of in (13) and include its standard error in the parenthesis. We see that PALS with fixed leads to the best result across all three models. PSVM is not as good as classical method such as SIR in model III, where heteroscedasticity is present. PQR is very competitive in models II and III, but is not as good as PSVM and PALS in model I. Furthermore, DC-PALS with variable has the second best overall performance, and it is only slightly worse than PALS with fixed . As increases, all methods deteriorate, while PALS and DC-PALS maintain their advantage over the other methods.

Next, we fix , , and consider three cases for the distribution of : case (i), ; case (ii), with ; and case (iii), , , where the components of are independent. The linear conditional mean assumption holds for cases (i) and (ii), and is no longer satisfied for case (iii). The results based on repetitions are summarized in Table 2. Compared to PALS, PQR does not work as well for model I, and PSVM is significantly worse for model III when is normal. For cases (i) and (ii), all the estimators become worse when the correlation between the normal predictors increase. For cases (i) and (iii) with uncorrelated predictors, we see that all the methods become worse when the linear conditional mean assumption is violated. PALS and DC-PALS again have the best overall performances.

Last but not least, we list the computation time of 100 repetitions in Table 3 for PSVM, PQR and PALS when we fix and . We only report the results for model III. The other two models lead to similar results and are omitted. We see that the computation time generally increases when increases, although the increase does not seem to be significant. The predictor distribution does not seem to affect the computation time. PSVM costs the least computation time among all three methods. Although not as fast as PSVM, PALS is almost four times faster than PQR across all settings.

### 5.2 Nonlinear sufficient dimension reduction

For nonlinear sufficient dimension reduction, we consider the following models:

 IV: Y=√φ1(X)log{√φ1(X)}+0.5ε; V: Y=φ21(X)+0.5φ2(X)ε,

where , , , , and is independent of . Denote as the basis for nonlinear sufficient dimension reduction such that . Then for model IV, and for model V.

We denote our proposal in Section 4 as kernel PALS (kPALS), and we compare it with kernel SIR (kSIR) (Wu, 2008), kernel PSVM (kPSVM) (Li, Artemiou and Li, 2011), and kernel PQR (kPQR) (Wang, Shin and Wu, 2018). For estimator , we measure its performance by the squared sample distance correlation between and as

 Υ=dCor2{φ(X),^φ(X)}. (14)

Larger values of mean better estimation. Similar to PSVM, PQR and PALS for linear sufficient dimension reduction, their kernel counterparts require a choice of . See, for example, for kPALS in (11). For , we report the results based on the best . Parallel to DC-PALS, we also include DC-kPALS, where is chosen such that the squared sample distance correlation between and is maximized. We fix and set . From Table 4, we see that kPALS has the best performance, and DC-kPALS is a close second. All methods improve as increases for model IV, and only kSIR improves as increases for model V. Together with previous simulation studies, we conclude that distance correlation can be a useful tool to select for PALS in both linear and nonlinear sufficient dimension reduction.

## 6 Real data analysis of the Boston housing data

We consider Boston housing data for the real data analysis. The data is originally studied in Harrison and Rubinfeld (1978). After removing a categorical variable and excluding the cases where the census tract bounds the Charles river, we end up with 12 predictors and 471 observations. The response is the median value of owner-occupied homes in each census tract. For a complete list of the predictors, one can refer to Table 5 of Wang, Shin and Wu (2018). As suggested by Wang, Shin and Wu (2018), we set the structural dimension to be

and denote the estimator as . We apply PSVM, PQR and PALS to this data set, and report the squared sample distance correlation between and for different . From Table 5, we see that PQR and PALS perform similarly, and both are better than PSVM. Furthermore, PQR and PALS are less sensitive to the choice of than PSVM.

## 7 Conclusion

We propose PALS for linear and nonlinear sufficient dimension reduction in this paper. Our proposed method is very competitive with existing methods in the literature. On one hand, our proposal enjoys better estimation accuracy than SIR and PSVM, especially in the presence of heteroscedasticity. On the other hand, our proposal is computationally more efficient compared to PQR. Unlike PSVM where the response is dichotomized, both PQR and PALS deal with continuous response directly. We develop new tools for the asymptotic analysis of PALS. Specifically, Lemma 3 of Li, Artemiou and Li (2011) provides a tool to compute the derivative of an expectation of a non-Lipschitz function, and Theorem 3 of Wang, Shin and Wu (2018) applied this Lemma directly without considering the continuous support of the response in PQR. This limitation is addressed in Lemma 1 and Theorem 5 of the Appendix, where Lemma 3 of Li, Artemiou and Li (2011) is adapted for continuous response.

We consider a fixed set of expectile levels in this paper. Although our experience indicates that the performance of PALS is not sensitive to the choice of expectile levels, choosing an optimal set of expectile levels is worth further investigation. Kim, Wu and Shin (2019) develop quantile-slicing for sufficient dimension reduction, and expectile-slicing for sufficient dimension reduction may be an interesting research direction.

## Appendix A

Proof of Theorem 1. We assume without loss of generality that . Note that . Then (2) becomes

 L\uptau(α,β)=Var(β⊤X)+λE{ρ\uptau(Y−α−β⊤X)}. (15)

The first term on the right hand side of (15) satisfies

 Var(β⊤X)≥Var{E(β⊤X|B⊤X)}. (16)

The second term on the right hand side of (15) satisfies

 E{ρ\uptau(Y−α−β⊤X)}=E[E{ρ\uptau(Y−α−β⊤X)|B⊤X,Y}]≥E[ρ\uptau{E(Y−α−β⊤X)|B⊤X,Y}]=E[ρ\uptau{Y−α−E(β⊤X|B⊤X)}], (17)

where the inequlality is due to the convexity of , and the last equality is due to the conditional independence (1). The assumption that is linear in implies that

 E(X|B⊤X)=Σ(B⊤ΣB)−1B⊤X. (18)

(15), (16), (17) and (18) together imply that

 L\uptau(α,β)≥L\uptau(α,~β) % with ~β=B(B⊤ΣB)−1Σβ.

Thus the minimizer must satisfy .

Proof of Corollary 1. The proof follows directly from Theorem 1 and is omitted.

Proof of Theorem 2. Denote , , , , , and . Denote as the Lagrangian of the primal optimization problem (6) and abbreviate it as . Then we have

 L∗= θ⊤θ+~λ\uptaun∑i=1ξ2i++~λ(1−\uptau)n∑i=1ξ2i−−n∑i=1uiξi+−n∑i=1viξi− (19) +n∑i=1ai(Yi−α−θ⊤Zi−ξi+)+n∑i=1ηi(−Yi+α+θ⊤Zi−ξi−),

where , , , and for all . Take partial derivatives of (19) and set them to be zero. We get

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩∂L∗/∂θ=2θ−∑ni=1(ai−ηi)Zi=0∂L∗/∂α=∑ni=1(ηi−ai)=0∂L∗/∂ξi+=2~λ\uptauξi+−ui−ai=0∂L∗/∂ξi−=2~λ(1−\uptau)ξi−−vi−ηi=0 (20)

Assume for a particular . The Karush Kuhn Tucker (KKT) conditions state that for all . Then we must have from KKT. On the other hand, we have from the third equation of (20), which leads to because , , and . This contradiction guarantees that for all . Thus we have

 ξi+=ai2~λ\uptau for all i. (21)

Similarly, from the fourth equation of (20) and the KKT condition, we have for all and

 ξi−=ηi2~λ(1−\uptau) for % all i. (22)

Furthermore, the first equation of (20) leads to

 θ=12n∑i=1(ai−ηi)Zi (23)

By complementary slackness, we have

 uiξi+=0,ai(Yi−α−θ⊤Zi−ξi+)=0,viξi−=0,ηi(−Yi+α+θ⊤Zi−ξi−)=0 for all i. (24)

Plug (21), (22), (23) and (24) into (19), and we get the objective function in the dual optimization problem (7). The constraints for the dual problem are and for all . The second equation of (20) leads to the constraint that for all . Equation (23) leads to (8), which connects the solution of the dual problem to the primal problem.

## Appendix B

We provide the proof of Theorem 3 in this section. The following notations are needed. Without loss of generality, assume . Denote , and . Let be a block diagonal matrix such that the block diagonal elements of are and . Then in (2) becomes , where

 ℓ\uptau(~β,Ξ)=~β⊤~Σ~β+λρ\uptau(Y−~β⊤~X). (25)

Let be the -dimensional column vector of differential operators . The next result gives the gradient of .

###### Theorem 4.

Suppose for any , the distribution of is dominated by the Lebesgue measure, and . Then

 D~βE{ℓ\uptau(~β,Ξ)}=(0,2β⊤Σ)⊤−2λE{\uptauξ+~X−(\uptauξ+ξ−)I(ξ<0)~X}, (26)

where , , , and is the indicator function.

Proof. Denote . It is easy to check that

 ℓ∗\uptau(~β,Ξ)=\uptauξ2++(1−\uptau)ξ2−. (27)

Note that . It follows that

 D~βξ2+=−2ξ+{1−I(ξ<0)}~X. (28)

Similarly from , we have

 D~βξ2−=2ξ−I(ξ<0)~X. (29)

Plug (28) and (29) into (27). After taking derivatives, we get

 D~βℓ∗\uptau(~β,Ξ)=−2\uptauξ+{1−I(ξ<0)}~X+2(1−\uptau)ξ−I(ξ<0)~X=−2\uptauξ+~X+2\uptau(ξ+−ξ−)I(ξ<0)~X+2ξ−I(ξ<0)~X=−2\uptauξ+~X+2(\uptauξ+ξ−)I(ξ<0)~X. (30)

(30) and (25) together imply that

 E{D~βℓ\uptau(~β,Ξ)}=(0,2β⊤Σ)⊤−2λE{\uptauξ+~X−(\uptauξ+ξ−)I(ξ<0)~X}. (31)

Let be the support of . For and , we have

 ℓ∗\uptau(~β1,Ξ)−ℓ∗\uptau(~β2,Ξ)=\uptau{(Y−~β⊤1~X)2+−(Y−~β⊤2~X)2+}+(1−\uptau){(~β⊤1~X−Y)2+−(~β⊤2~X−Y)2+}. (32)

Note that and . Then

 (Y−~β⊤1~X)2+−(Y−~β⊤2~X)2+≤|(~β1−~β2)⊤~X|(|Y−~β⊤1~X|+|Y−~β⊤2~X|) ≤(1+∥X2∥)1/2(|Y−~β⊤1~X|+|Y−~β⊤2~X|)∥~β1−~β2∥

for some constant . The last inequality is due to the assumption that and . Thus the first term on the right hand side of (32) satisfies the Lipschitz condition with respect to . Similarly, one can show the second term on the right hand side of (32) also satisfies the Lipschitz condition. Together, we know satisfies the Lipschitz condition with respect to . From Lemma 2 of Li, Artemiou and Li (2011), we have

 D~βE{ℓ\uptau(~β,Ξ)}=E{D~βℓ\uptau(~β,Ξ)}. (33)

(31) and (33) together lead to the desired result.

The next lemma is used to compute the derivative of an expectation of a non-Lipschitz function. Let denote the operation of first taking derivative with respect to and then evaluating the derivative at .

###### Lemma 1.

Suppose , and

are random variables, and

is a measurable -valued function. Suppose, moreover,

• the joint distribution of

is dominated by the Lebesgue measure;

• for any , the function is continuous, where denotes the conditional density of given ;

• for each component of