 # Inverse Moment Methods for Sufficient Forecasting using High-Dimensional Predictors

We consider forecasting a single time series using high-dimensional predictors in the presence of a possible nonlinear forecast function. The sufficient forecasting (Fan et al., 2016) used sliced inverse regression to estimate lower-dimensional sufficient indices for nonparametric forecasting using factor models. However, Fan et al. (2016) is fundamentally limited to the inverse first-moment method, by assuming the restricted fixed number of factors, linearity condition for factors, and monotone effect of factors on the response. In this work, we study the inverse second-moment method using directional regression and the inverse third-moment method to extend the methodology and applicability of the sufficient forecasting. As the number of factors diverges with the dimension of predictors, the proposed method relaxes the distributional assumption of the predictor and enhances the capability of capturing the non-monotone effect of factors on the response. We not only provide a high-dimensional analysis of inverse moment methods such as exhaustiveness and rate of convergence, but also prove their model selection consistency. The power of our proposed methods is demonstrated in both simulation studies and an empirical study of forecasting monthly macroeconomic data from Q1 1959 to Q1 2016. During our theoretical development, we prove an invariance result for inverse moment methods, which make a separate contribution to the sufficient dimension reduction.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Forecasting using high-dimensional predictors is an increasingly important research topic in statistics, biostatistics, macroeconomics and finance. A large body of literature has contributed to forecasting in a data rich environment, with various applications such as the forecasts of market prices, dividends and bond risks (Sharpe, 1964; Lintner, 1965; Ludvigson and Ng, 2009), macroeconomic outputs (Stock and Watson, 1989; Bernanke et al., 2005), macroeconomic uncertainty and fluctuations (Ludvigson and Ng, 2007; Jurado et al., 2015), and clinical outcomes based on massive genetic, genomic and imaging measurements. Motivated by principal component regression, the pioneering papers by Stock and Watson (2002a, b) systematically introduced the forecasting procedure using factor models, which has played an important role in macroeconomic analysis. Recently, Fan et al. (2016) further extended the analysis of Stock and Watson (2002a, b) to allow for a nonparametric nonlinear forecast function and multiple nonadditive forecasting indices. Following Fan et al. (2016), we consider the following factor model with a target variable that we aim to forecast:

 yt+1 = g(\boldmathϕ′1ft,⋯,\boldmathϕ′Lft,ϵt+1), (1.1) xit = b′ift+uit,1≤i≤p, 1≤t≤T, (1.2)

where is the -th high-dimensional predictor observed at time , is a vector of factor loadings, is a vector of common factors driving both the predictor and the response, is an unknown forecast function that is possibly nonadditive and nonseperate, is an idiosyncratic error, and is an independent stochastic error. Here, , and are unobserved vectors.

Had the factors been observed, model (1.1) would be commonly adopted in the literature of sufficient dimension reduction. The linear space spanned by , denoted by , is the parameter of interest that is identifiable and known as the central subspace (Cook, 1998). Multiple methods have been proposed to estimate , among which a main family employ inverse moments, i.e. the moments of the conditional distribution , and are called the inverse regression methods. Representative members include sliced inverse regression (Li, 1991), which uses the inverse first moment

, sliced average variance estimation

(Cook and Weisberg, 1991) and directional regression (Li and Wang, 2007), which additionally use the inverse second moment , and moreover, the inverse third-moment method (Yin and Cook, 2003), etc. Generally, methods that employ higher-order inverse moments can capture more comprehensive information from the data, which leads to more ability in exhaustive estimation, i.e. detecting all the directions in . The price, on the other hand, is to estimate more moments and impose additional distributional assumptions on the factors; see Li (1991), Yin and Cook (2003), and Li and Wang (2007) for more details.

A commonly recognized limit of sufficient dimension reduction methods, including sliced inverse regression, is that they can only handle predictors with either a finite dimension or a diverging dimension that is dramatically smaller than the sample size (Zhu et al., 2006). Therefore, even though it is theoretically desirable to directly apply sufficient dimension reduction to forecasting, using as the predictor and as the response, none of the existing sufficient dimension reduction methods would be readily applicable. For this reason, it is necessary to reduce the dimension of the predictor prior to sufficient dimension reduction, for which adopting the factor model (1.2) is a reasonable choice. An alternative choice can be found in Jiang and Liu (2014) and Yu et al. (2016), etc.

Following this logic, Fan et al. (2016) introduced the sufficient forecasting scheme to use factor analysis in model (1.2) to estimate , and apply sliced inverse regression in model (1.1) with the estimated factors as the predictor. Such a combination provides a promising forecasting technique that not only extracts the underlying commonality of the high-dimensional predictor but also models the complex dependence between the predictor and the forecast target. Meanwhile, it allows the dimension of the predictor to diverge and even become much larger than the number of observations, which is intrinsically appealing to solving high-dimensional forecasting problems.

It is important to note that the consistency of the sequential procedure in Fan et al. (2016) is not granted as it may appear. Let be the central subspace based on the estimated factors ’s. Without additional assumptions, the two central subspaces and may not coincide (Li and Yin, 2007). Thus, the naive method by applying existing dimension reduction methods to the estimated factors ’s may not necessarily lead to the consistent estimation of . Fan et al. (2016) effectively solved this issue by developing an important invariance result between and . See Proposition 1 and Equation (2.6) of Fan et al. (2016). This invariance result provides an essential theoretic foundation for using sliced inverse regression (Li, 1991) under Models (1.1)–(1.2). The method and theory of Fan et al. (2016) required three assumptions:

• The number of factors is a fixed constant as .

• The matrix satisfies that are positive and distinct.

• Linearity condition (Li, 1991): is a linear function of for any .

A fixed facilitates the accurate estimation of factors and loadings, but it also narrows our attention to a fixed dimensional factor space to forecast . It is advocated that a diverging may find a better balance between estimating factors and forecasting (Lam and Yao, 2012; Li et al., 2013; Jurado et al., 2015); in other words, should a growing be used, the sufficient forecasting would deliver a potentially more powerful forecast. In the meantime, a diverging would also relax the linearity condition (B1), which would greatly enhance the applicability of the forecasting method: as is unknown, the condition is commonly strengthened to that it is satisfied for basis matrices of any -dimensional subspace of , which equivalently requires to follow an elliptical distribution and can be restrictive; however, when is much smaller than , the low-dimensional projection from a high-dimensional random vector will always approximate to a linear function under fairly general regularity conditions (Hall and Li, 1993), and the ellipticity of is no longer needed.

Assumption (A2) can be restrictive as well: one can easily verify that is zero if the factors have an elliptical distribution and the forecast function is symmetric along direction of the factors. The latter occurs, for instance, when the forecast target was investigated using squared factors (Bai and Ng, 2008; Ludvigson and Ng, 2007). When (A2) fails, Fan et al. (2016) cannot detect all the sufficient forecasting directions and will lead to sub-optimal forecasting performance. Referring to the literature of sufficient dimension reduction mentioned above, a natural way to relax (A2) is to employ higher-order inverse regression methods, such as directional regression and the inverse third-moment method, in the sufficient dimension reduction stage.

In this work, we follow the same spirit as in Fan et al. (2016) to conduct factor analysis and sufficient dimension reduction sequentially based on models (1.1) and (1.2). We allow the number of factors to diverge as and grows, and employ directional regression and the inverse third-moment method for sufficient dimension reduction, which use higher-order inverse moments and exhaustively estimate under weaker assumptions. The proposed method is applicable for generally distributed predictor with diverging number of factors, and is capable of detecting non-monotone effect of the factors on the response. Hence it is more applicable and effective than Fan et al. (2016) in many cases. During our theoretical development, we also propose an invariance result, which makes separate contribution to the literature of surrogate sufficient dimension reduction.

The rest of this paper is organized as follows. We first study directional regression in the sufficient forecasting in Section 2, including the factor analysis in Subsection 2.1, an invariance result for sufficient dimension reduction in Subsection 2.2, the details of implementation in Subsection 2.3, the asymptotic results in Subsection 2.4, and a Bayesian information criterion (BIC) to select the dimension of the central subspace in Subsection 2.5. In Section 3, we further incorporate the inverse third-moment method in the sufficient forecasting, and develop the corresponding theoretical results. Section 4 is devoted to the simulation studies and a real data example that illustrates the power of the proposed method. We leave all the proofs to Section 5.

## 2 Forecasting with directional regression

### 2.1 Factor analysis

To make forecast, we need to estimate the factor loadings and the error covariance matrix . For ease of presentation, we first assume that the number of underlying factors is growing as but known. Consider the following constrained least squares problem:

 (ˆBK,ˆFK) =argmin(B,F) ∥X−BF′∥2F, subject to T−1F′F=IK,B′B is diagonal,

where , and denotes the Frobenius norm of a matrix. The constraints and that is diagonal is to address the issue of identifiability during the minimization. As they can always be satisfied for any after appropriate matrix operations on and , they impose no additional restrictions on the factor model (1.2). It is a commonly known fact that the minimizers and of (2.1) are such that the columns of

are the eigenvectors corresponding to the

largest eigenvalues of the

matrix and To simplify notation, let and .

As both the dimension of the predictor and the number of factors are diverging, it is necessary to regulate the magnitude of the factor loadings and the idiosyncratic error , so that the latter is negligible with respect to the former. We should also regulate the stationarity of the time series. In this paper, we adopt the following assumptions. For simplicity in notation, we let , and be the maximum of the absolute values of all the entries in . Let and denote the algebras generated by and respectively.

(1) There exists such that , and there exist two positive constants and such that

 c1

(2) Identification: , and is a diagonal matrix with distinct entries.

###### Assumption 2.2 (Data Generating Process).

, and are three independent groups, and all of them are strictly stationary. The factor process satisfies that both and are bounded sequences. In addition, for and some .

###### Assumption 2.3 (Residuals and Dependence).

There exists a positive constant that does not depend on or , such that
(1) , and .
(2) , and for every ,
(3) For every , .

Assumption 2.1 regulates the signal strength of the factors contained in the predictor through the order of the factor loadings, and Assumption 2.3 regulates the order of the idiosyncratic errors contained in the predictor. Together, they ensure that the former dominates the latter in the population level as grows. Assumption 2.2 implies that the sample observations are only weakly dependent, so that the estimation accuracy grows with the sample size.

Under these assumptions, we have the following consistency result for estimating the factor loadings. Instead of the Frobenius norm used in (2.1), we use the spectral norm to measure the magnitude of a matrix, defined as , the square root of the largest eigenvalue of , for any matrix .

###### Theorem 2.1.

Suppose . Let and . Under Assumptions 2.1, 2.2 and 2.3, we have

• ,

• .

Because the dimension of the factor loadings is diverging, the estimation error accumulates as grows. For a -dimensional vector whose entries are constantly one, its spectral norm is , which diverges to infinity. Thus, we should treat as the unit magnitude of the spectral norm of matrices with rows, in which sense the statement 1) of Theorem 2.1 justifies the estimation consistency of the factor loadings . As the error term shrinks as grows under Assumption 2.3, the convergence order of the factor loading estimation largely depends on - a higher dimensional predictor means a more accurate estimation. The convergence order in this theorem can be further improved if we impose stronger assumptions on the negligibility of the error terms in the factor model (1.2).

Given , it is easy to see that . Thus, together with the negligibility of the error term , the consistency of and indicates the closeness between the true factors and the estimated factors , of which the latter will be used in the subsequent sufficient dimension reduction. The error covariance matrix can be estimated by thresholding the sample covariance matrix of the estimated residual , denoted by , as in Cai and Liu (2011), Xue et al. (2012), Fan et al. (2013) and Fan et al. (2016).

### 2.2 An invariance result

Using the estimated factors from factor analysis, we now apply directional regression to further produce a lower-dimensional sufficient predictor for forecasting. Before digging into more details about the estimation consistency, we focus on the population level, and temporarily assume an oracle scenario where is known a priori. This scenario simplifies the discussion, as it eliminates the estimation error introduced by estimating the factor loadings. We will return to the realistic case afterwards.

As pointed out in Fan et al. (2016), the inverse regression methods are not readily applicable to estimate the central subspace , for the reason that the estimated factors always contain an error term aside from the true factors

, which can be asymptotically non-negligible in certain settings. To see this point, we apply an ordinary least square estimation in (

1.2), and have

 ˆft=ft+u∗t, (2.2)

where . Then is the price that we pay for the contamination of the original predictor by the error term . Such a price is intrinsic as it is inevitable whatever estimators of are used. The consequence is two-fold. First, as no distributional assumption is imposed on , the regularity conditions on such as the linearity condition (B1) may fail, which causes inconsistency of the inverse regression methods in estimating the corresponding central subspace . Second, even if can be estimated consistently, it needs not coincide with the central subspace of interest.

To address this issue, one may naturally search for suitable conditions that ensure the coincidence between the two central subspaces and , which is equivalent to that the two spaces have equal dimension and that any basis matrix of satisfies

 yt+1(ft+u∗t)|(\boldmathϕ1,…,\boldmathϕL)′(ft+u∗t).

Such a study can be embedded in surrogate sufficient dimension reduction (Li and Yin, 2007), which aims to conduct sufficient dimension reduction when the predictor is contaminated by a measurement error. In particular, a direct application of Theorem in Li and Yin (2007) implies that the coincidence of the central subspaces holds if both and

are normally distributed, subject to the independence between

and that we have already assumed. In that case, the resulting normality of also makes estimable by the inverse regression methods like sliced inverse regression. Hence the central subspace of interest can be consistently estimated.

However, the normality of adopted in Li and Yin (2007) can be easily violated in practice, in which case the coincidence between the central subspaces becomes infeasible. It is important to notice that is the only parameter of interest, and its coincidence with is needed only when the latter serves as the intermediate parameter in the estimation procedure. Consequently, such coincidence can be relaxed if we manage to find other intermediate parameters, which naturally leads us to consider the inverse regression methods.

In all the inverse regression methods, the central subspace is characterized as the column space of certain positive semi-definite matrix parameters, called the kernel matrices. If we can manage to estimate the kernel matrices using the estimated factors in place of , then so too is . Naturally, this can be realized if we adopt suitable conditions on the predictor , so that the kernel matrices are invariant of the change from to

. Because the kernel matrices are constructed only by the inverse moments, rather than the entire joint distribution, we expect such conditions to be weaker than those required for the coincidence of the central subspaces. This point is demonstrated in the following theorem for directional regression.

###### Theorem 2.2.

Under model (1.2), the kernel matrix for directional regression,

 Mdr=E{2var(ft)−E[(ft−fs)(ft−fs)′|yt+1,ys+1]}2, (2.3)

where is an independent copy of , is invariant if and are replaced by and in (2.3), respectively.

Because the true factors has identity covariance matrix, the form of the kernel matrix adopted in the theorem coincides with its original form in Li and Wang (2007). However, it does make a modification on the latter when the estimated factors are used instead, which no longer have identity covariance matrix in the population level. From the proof of the theorem, one can easily see that such a modification is crucial, as it removes the effect of the estimation error from the kernel matrix estimation.

The coincidence between the column space of and the central subspace requires both the linearity condition (B1) and the constant variance condition:

• is degenerate.

Since the central subspace is unknown, same as (B1), (B2) is commonly strengthened to that it is satisfied for basis matrices of any -dimensional subspace of . The strengthened conditions equivalently require the factors to be jointly normally distributed, which are again restrictive in practice. If one treat as the response and as the predictor in regression, then (B1) is the linearity assumption on the regression function and (B2) is the homoscedasticity assumption on the error term. In this sense, we follow the convention in the literature of regression to treat (B2) less worrisome than (B1) in practice.

A similar invariance result to Theorem 2.2 has been developed in Fan, Xue and Yao (2016) for sliced inverse regression where the inverse first moment is involved (see their equation (2.6)). As is allowed to be non-normally distributed, according to Li and Yin (2007), the two central subspaces and may differ, which means that the former may not be recovered by the corresponding kernel matrix in directional regression. This can also be explained by the fact that when is used as the predictor, the linearity condition (B1) and the constant variance condition (B2) are violated due to the arbitrariness of the distribution of , which makes directional regression inconsistent.

One can always argue that under the assumption of negligible error term relative to the true factors , the estimated factors approximate to , which suggests the corresponding approximation between the central subspaces and . Consequently, the central subspace of interest can be still consistently estimated through the intermediate parameter , without using the invariance result. However, as the invariance result justifies the Fisher consistency of directional regression under the existence of potentially non-negligible measurement error , it is useful in more general settings. These settings include, for example, the case when the primary statistical interest is on estimating which linear combinations of affect , i.e. the factor loadings and the central subspace , rather than the forecast function . In this sense, this result itself makes an independent contribution to the literature of surrogate sufficient dimension reduction.

In reality, the hypothetical independent copies and do not exist in the observed data. Consequently, we estimate using its equivalent form, which is derived by expanding (2.3),

 Mdr = 2E{[var(ft)−E(ftf′t|yt+1)]2}+2E2[E(ft|yt+1)E′(ft|yt+1)] (2.4) +2E[E′(ft|yt+1)E(ft|yt+1)]⋅E[E(ft|yt+1)E′(ft|yt+1)].

The marginal covariance matrix and the conditional covariance matrix can be easily estimated by replacing with its estimate , slicing the support of , and using the sample moments. By Theorem 2.1, the factor loadings are consistently estimated by , so it is plausible that under suitable moment conditions and sufficiently large sample size, the leading eigenvectors of the resulting matrix span a consistent estimator of the central subspace . In Subsection 2.4, we will justify the sufficiency of Assumptions 2.1 - 2.3 for such consistency, and give the corresponding convergence order of the estimation.

### 2.3 Implementation

In the literature of sufficient dimension reduction, it has been a common practice to estimate the inverse moments in the inverse regression methods using the slicing technique; that is, we partition the sample of into slices with equal sample proportion, and estimate the moments of the factors within each slice. In the population level, it corresponds to partitioning the support of into

slices, i.e. intervals, with equal probability, and using the slice indicator, denoted by

, as the new working response variable.

The slicing technique substantially simplifies the implementation of the inverse regression methods. Because the slice indicator is a measurable function of the original response , must affect through . Thus, the working central subspace is always a subspace of the central subspace of interest , which means that no redundant directions of will be selected in the estimation.

Using the slice indicator , the inverse moments and can be easily estimated by the usual sample moments. As constrained in factor analysis, has sample variance equal to , which we use to estimate its population variance . Alternatively, one can also use the restriction that to estimate by , where is the thresholding covariance estimator. An omitted simulation study shows that the two estimators perform similarly to each other, so we choose the former for simplicity. The kernel matrix estimator , whose leading eigenvectors span an estimate of the central subspace, is then given by (2.3). In summary, the proposed estimator can be implemented using the following steps:

An omitted simulation study shows that our estimate is robust to the choice of , as long as the latter falls into a reasonable range, say three to ten. This phenomenon has also been observed by multiple authors; see, for example, Li (1991) and Li and Wang (2007).

### 2.4 Asymptotic properties

For simplicity of the presentation, in this subsection, we assume both the dimension of the central subspace and the number of factors to be known a priori, where the latter is a diverging sequence. The same asymptotic result can be developed similarly if and are unknown but consistently estimated. Consistent determination of and will be discussed later in Subsection , under which the result developed here can still be applied.

We first introduce some elementary result about the consistency of the eigen-decomposition of random matrices. The following concept characterizes a sequence of non-negligible random variables.

###### Definition 2.1.

A sequence of random variables is called bounded below from in probability, and written as , if there exists a constant such that as .

This concept is a natural generalization of non-stochastic sequences that are bounded below from zero to a probabilistic version, much like the generalization from to . In particular, it includes these non-stochastic sequences as a special case. For convenience in notations, we denote these non-stochastic sequence also by , if no ambiguity is caused. Another simple example of is , where is an arbitrary positive constant and is an arbitrary sequence of random variables such that . Using this concept, the following result shows when the leading eigenvectors of a sequence of random matrices span a converging linear space. For any symmetric matrix , we denote its smallest eigenvalue by .

###### Lemma 2.1.

Let be a sequence of symmetric random matrices and and be an orthonormal basis of . If

• ,

• there exists a non-stochastic sequence with , such that and ,

then the linear space spanned by the leading eigenvectors of , denoted by , consistently estimates the linear space spanned by in the sense that the projection matrix of the former converges to that of the latter in probability; that is, .

If we let be the sample kernel matrix of directional regression, and be an orthonormal basis of the central subspace, then condition (a) of the lemma means that , which requires the rank of the kernel matrix to be at least , or equivalently, exhaustive estimation of the central subspace by directional regression. Based on Theorem in Li and Wang (2007), the following result gives the corresponding sufficient conditions on the inverse moments.

###### Theorem 2.3.

The following two assumptions are equivalent:

• for any sequence of non-stochastic vectors and an independent copy of ,

 var[E[{v′T(ft−fs)}2|yt+1,ys+1]]=O+P(1).
• For any sequence of non-stochastic vectors ,

 max{var[E(v′Tft|yt+1)],var[E{(v′Tft)2|yt+1}]}=O+P(1).

Moreover, they imply the exhaustiveness of directional regression. That is, , the th eigenvalue of , is .

The statements in this theorem require that all the directions in the central subspace are captured in the first two inverse moments, and , which is satisfied, for example, when is normally distributed. See also Cook and Lee (1999) for more details. Although one can always construct specific models in which the effect of is revealed only in higher-order inverse moments, the exhaustiveness of directional regression has been commonly recognized in applications. A detailed justification can also be found in Li and Wang (2007).

Because we estimate a sequence of central subspaces with diverging dimensions, in addition to the exhaustiveness of directional regression at each dimension , we further require in this theorem that the weakest signal strength in the central subspace not to vanish as grows. The condition can be relaxed if we conduct a more careful study about the order restriction in Lemma 2.1 and allow a slower convergence rate in estimating the central subspace. But considering the fact that

is fixed and the popularity of similar conditions in the literature of high-dimensional data analysis, we decide to adopt it here, and leave further relaxation to future work.

###### Theorem 2.4.

Under Assumptions 2.1, 2.2 and 2.3, the assumption in Theorem 2.3, the linearity condition (B1), and the constant variance condition (B2), if , then the leading eigenvectors of , denoted by , span a consistent estimator of the central subspace in the sense that

 ∥(ˆ\boldmathϕ1,…,ˆ\boldmathϕL)(ˆ\boldmathϕ1,…,ˆ\boldmathϕL)′−(\boldmathϕ1,…,\boldmathϕ% L)(\boldmathϕ1,…,\boldmathϕL)′∥F=OP(K3/2p−1/2+KT−1/2).

In connection with Theorem 2.1, the estimation error in sufficient dimension reduction, as justified in this theorem, can be divided into two parts. The first part, which is of order , is inherited from factor analysis. This part represents the price we pay for estimating the factor loadings , and it depends on the dimension of the original predictor. By contrast, the second part, which is of order , does not depend on and is newly generated in the sufficient dimension reduction stage. As can be easily seen from the proof of the theorem, it represents the price we pay for estimating the unknown inverse second moment involved in the kernel matrix. Therefore, this part would persist even if no error were generated in factor analysis.

### 2.5 Model selection

We now discuss how to determine the number of factors and the dimension of the central subspace . The problem is commonly called order determination in the literature of dimension reduction (Luo and Li, 2016).

The correct specification of the number of factors, , is a fundamental issue of large factor models. Various consistent order-determination approaches have been established under the setting that is fixed; see Bai and Ng (2008); Onatski (2010); Ahn and Horenstein (2013), etc. However, a growing number of empirical studies suggest that the number of factors may increase as the cross-sectional dimension or time-series dimension increases; see Ludvigson and Ng (2009) and Jurado et al. (2015), whose estimate of the number of factors explaining certain macroeconomic time series ranges from 1 to 10. Recently, Li et al. (2013) extended the analysis of Bai and Ng (2008) to estimate the number of factors that may increase with the cross-sectional size and time period, and prove the consistency of a modified procedure of Bai and Ng (2008). Specifically, it estimates by

 ^K=argmin0≤K≤Kmaxlog(1pT∥X−T−1XˆFKˆF′K∥2F)+Kg(p,T),

where is a prescribed upper bound that possibly increases with and corresponds to the solution to (2.1). is a penalty function such that and , where . The specific choice of does not affect asymptotic results. One example suggested by Bai and Ng (2008) as well as Li et al. (2013) is

 g(p,T)=p+TpTlog(pTp+T).

Under Assumptions 2.1–2.3 and letting , Li et al. (2013) showed that the data-driven is a consistent estimator of .

Next we discuss the choice of . In the sufficient dimension reduction literature, multiple methods have been proposed to determine the dimension of the central subspace, including the sequential tests (Li 1991, Li and Wang 2007), the permutation test (Cook and Weisberg 1991), the bootstrap procedure (Ye and Weiss, 2003), the cross-validation method (Xia et al., 2002; Wang and Xia, 2008), the BIC type procedure (Zhu, Miao and Peng 2006), and the ladle estimator (Luo and Li, 2016). Among them, the BIC type procedure can be easily implemented in conjunction with the inverse regression methods, and reaches the desired consistency in high-dimensional cases. For a -dimensional positive semi-definite matrix parameter of rank and its sample estimator , let and be their eigenvalues in the descending order, respectively. We set a censoring constant that is invariant of , and define the objective function to be

 G(l)=(T/2)∑Kci=1+min(τ,l){log(ˆλi+1)−ˆλi}−CTl(2K−l+1)/2, (2.5)

in which is nearest integer to and is the number of ’s that are greater than zero. We then estimate as the maximizer of . Compared with the original BIC type procedure in Zhu, Miao and Peng (2006), we introduce the censoring constant to restrict the range of candidate dimensions. Because the number of factors is diverging while the dimension of the central subspace is fixed, this restriction is reasonable for all large samples. The restriction is indeed crucial for the consistency of the order-determination procedure revealed in the following theorem without introducing additional constraints on the order of and , which improves the original result in Zhu, Miao and Peng (2006).

###### Theorem 2.5.

Suppose and . If satisfies that

 CTKT−1→0 and ∥ˆM−M∥2=oP(CTKT−1),

then converges to in probability.

A candidate of can be . Referring to Theorem 2.4, if we apply the BIC-type procedure in conjunction with directional regression to detect the dimension of the central subspace , then we can choose to be . To further polish the procedure, we can incorporate a multiplicative constant in , and tune its value in a data-driven manner such as cross-validation.

## 3 Sufficient forecasting with the inverse third moment

As mentioned in the Introduction, Yin and Cook (2003) employed the inverse third moment in their kernel matrix. In our context, the inverse third moment is

 μ30(ft|yt+1)=E[{ft−E(ft|yt+1)}⊗{ft−E(ft|yt+1)}{ft−E(ft|yt+1)}′|yt+1], (3.1)

which can be treated as a matrix that contains distinct rows. Let be the sub-matrix of that contains these distinct rows. The kernel matrix of the inverse third-moment method, where the subscript stands for “third moment”, is . For the column space of to be a subspace of the central subspace , in addition to the linearity condition (B1) and the constant variance condition (B2), the distribution of the factors must also satisfy the symmetry condition (Yin and Cook, 2003):

• .

Same as the linearity condition (B1) and the constant variance condition (B2), the symmetry condition (B3) is also satisfied when is normally distributed. Referring to the discussion below (B2), if we treat as the response and as the predictor in regression, then (B3) is implied by the symmetry of the error term, which is frequently adopted in the literature of regression. Thus the condition can be treated fairly general in practice.

Compared with directional regression, the inverse third-moment method incorporates higher-order inverse moments, so it captures additional information about the inverse conditional distribution . As justified in the following theorem, under the symmetry condition (B3), it can serve as a useful complement to directional regression, if the latter fails to be exhaustive.

###### Theorem 3.1.

Suppose that the linearity condition (B1), the constant variance condition (B2), and the symmetry condition (B3) are satisfied. If, for any sequence of non-stochastic vectors ,

 E2[{v′Tft−E(v′Tft|yt+1)}3|yt+1]=O+P(1),

then the inverse third-moment method is exhaustive in the sense that , the th eigenvalue of , is .

As the inverse first moment is excluded from the kernel matrix, the inverse third-moment method cannot capture the corresponding information. In addition, it will also miss the directions of the factors that are associated with the response in a symmetric pattern, if any. Hence, it may fail to be exhaustive in applications. Because directional regression effectively uses the first two inverse moments and can detect symmetric pattern between the factors and the response, in the spirit of Ye and Weiss (2003), we can combine the two methods into one by using the kernel matrix . As shown in the next corollary, compared with each individual method, their ensemble is exhaustive under more general conditions.

###### Corollary 3.1.

If, for any sequence of non-stochastic vectors ,

then , the th eigenvalue of , is .

To estimate using the contaminated factors , similar to directional regression, we need to slightly modify the kernel matrix, which leads to the following invariance result.

###### Theorem 3.2.

Assume that the factor loadings are known in priori, then under model (1.2), the kernel matrix is invariant of the replacement of by , if we modify to be

 μ3(ft|yt+1)=μ30(ft|yt+1)−E(ft⊗ftf′t).

It is easy to see that the symmetry condition (B3) implies . Thus the modified inverse moment coincides with its original form . However, the two moments differ when the estimated factors are used in place of . The modification made in the former removes the effect of the error term from the kernel matrix, so it is crucial to the invariance result. When , which occurs, for example, if has a symmetric distribution, the effect of on automatically vanishes, and the invariance result holds without any modification on the inverse moment.

Same as for directional regression, when is negligible with sufficiently fast convergence order, the kernel matrix for the true factors can be approximated by using the estimated factors , in which case consistent estimation of the central subspace can be justified without the invariance result. However, the invariance result makes its own contribution to the surrogate sufficient dimension reduction literature, as it sheds light on consistent sufficient dimension reduction estimation using inverse third moment, even when the measurement error is non-negligible.

Same as in Li (1991) and Li and Wang (2007), Yin and Cook (2003) used the slicing strategy to estimate the kernel matrix . In our context, we modify the estimations Steps 2 and 3 to:

Step 2. For , estimate by

 ˆμ30(