# Joint Likelihood-based Principal Components Regression

We propose a method for estimating principal components regressions by maximizing a multivariate normal joint likelihood for responses and predictors. In contrast to classical principal components regression, our method uses information in both responses and predictors to select useful linear combinations of the predictors. We show our estimators are consistent when responses and predictors have sub-Gaussian distributions and the number of observations tends to infinity faster than the number of predictors. Simulations indicate our method is substantially more accurate than classical principal components regression in estimation and prediction, and that it often compares favorably to competing methods such as partial least squares and predictor envelopes. We corroborate the simulation results and illustrate the practical usefulness of our method with a data example with cross-sectional prediction of stock returns.

There are no comments yet.

## Authors

• 7 publications
02/18/2021

### A Generative Approach to Joint Modeling of Quantitative and Qualitative Responses

In many scientific areas, data with quantitative and qualitative (QQ) re...
10/23/2020

### LowCon: A design-based subsampling approach in a misspecified linear modeL

We consider a measurement constrained supervised learning problem, that ...
09/30/2014

### Unsupervised Bump Hunting Using Principal Components

Principal Components Analysis is a widely used technique for dimension r...
07/24/2019

### Comparison of Multi-response Estimation Methods

Prediction performance does not always reflect the estimation behaviour ...
03/20/2019

### Comparison of Multi-response Prediction Methods

While data science is battling to extract information from the enormous ...
02/26/2021

### Cholesky-based multivariate Gaussian regression

Multivariate Gaussian regression is embedded into a general distribution...
03/08/2017

### Unsupervised Ensemble Regression

Consider a regression problem where there is no labeled data and the onl...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Principal components analysis is one of the most commonly used methods for dimension reduction (Cook, 2018a), whereby variables are reduced to

linear combinations of those variables—the principal components. The vectors of weights defining the principal components are eigenvectors of the variables’ sample covariance matrix. Thus, subject to the constraint that the weight vectors be orthonormal, the

principal components are the

linear combinations with maximal sample variances. In classical principal components regression (PCR), principal components of the predictors are used to model or predict one or more response variables

(e.g. Jolliffe, 2002, Chapter 8). Intuitively, one expects this to be useful when the corresponding relation holds in the population; that is, when only a few linear combinations of the predictors, with large population variances, affect the responses. This is the case in many popular models (Stock and Watson, 2002; Bair et al., 2006; Yu et al., 2006; Bai and Wang, 2016; Singer et al., 2016) and can be motivated by more general probabilistic arguments (Artemiou and Li, 2009). Accordingly, PCR has been considered in a wide range of applications, including ones in economics, chemometrics, and genomics (e.g. Næs and Martens, 1988; Chang et al., 2001; Stock and Watson, 2002; Watson, 2003; Wang and Abbott, 2008; Barbarino and Bura, 2017; Du et al., 2018).

Despite its popularity, it has also been argued that classical PCR is unreliable (Cook, 2018a) and that it is preferable to use both responses and predictors to decide which linear combinations of the predictors are relevant (Helland, 1992; Bair et al., 2006; Yu et al., 2006; Cook et al., 2010; Kelly and Pruitt, 2015). Our work shows that, in many such cases, it may be that a PCR model, which we will define in detail shortly, is still appropriate, but that classical PCR gives poor estimates of the parameters in that model. We address this by proposing a method that estimates a PCR model by maximizing a multivariate normal joint likelihood of responses and predictors. Thus, if data are multivariate normal, our estimators are maximum likelihood estimators, and otherwise they are M-estimators. As we will see, our method leads to substantially more accurate estimation and prediction than classical PCR in a wide range of settings. Other related methods typically do not give estimates that lie in the parameter space of a PCR model, and hence may be inappropriate when one is suggested by domain knowledge or other considerations. Nevertheless, we compare our method to partial least squares (PLS) and predictor envelopes (Cook et al., 2013)

since these are appropriate for similar settings and tend to perform well in practice. Using simulations, we show our method compares favorably to both when data are generated from our model. The practical relevance this model is supported by a data example where our method gives more accurate cross-sectional prediction of stock returns than classical PCR, PLS, predictor envelopes, and ordinary least squares (OLS). For a more precise discussion, we require some notation.

Let the pairs , , be independent with , , , , , and , where is the set of symmetric and positive definite matrices. Later, we will assume is sub-Gaussian for some of our asymptotic results, but that is not necessary to define our model. The assumption that both responses and predictors have mean zero is not needed in practice but makes some of our theoretical arguments easier to present. To define our model, we assume there exist a semi-orthogonal , diagonal and positive semi-definite , , and such that

 (i)  β=Uγand(ii)  ΣX=τ(Ip+UDUT). (1)

Model (1) implies the columns of are leading eigenvectors of and that . That is, only linear combinations of the predictors, whose weights are given by leading eigenvectors of the predictors’ population covariance matrix, are relevant for the regression. Condition (ii) implies the

smallest eigenvalues of

are all equal to or, equivalently, that is spiked (Johnstone, 2001). A spiked predictor covariance matrix is not necessary to define a PCR model, but it facilitates both our theory and the implementation of our method. Spiked covariance matrices are common in the modern literature on principal components analysis and large covariance matrix estimation (Johnstone, 2001; Cai et al., 2014; Wang and Fan, 2017; Donoho et al., 2018), and they arise naturally in latent variable and factor models that motivate both our and several related methods (Tipping and Bishop, 1999; Bair et al., 2006; Yu et al., 2006; Kelly and Pruitt, 2015; Singer et al., 2016) (see also Appendix C). With , any and can be written as in (1

). Thus, our model includes a classical multivariate linear regression model as a special case. Here, however, we are primarily interested in the cases where

so that there is potential for dimension reduction.

We propose estimating , , and by maximizing a multivariate normal joint log-likelihood for observations from (1); we do not attempt to estimate , , or directly since they are unidentifiable. Nevertheless, for any , , and in our parameter set, we can decompose them as in (1) and write the conditional log-likelihood for given and the marginal likelihood for , respectively, as

 ℓY∣X(U,γ,Σ)=−n2log|Σ|−12tr[(Y−XUγ)T(Y−XUγ)Σ−1]

and

 ℓX(U,D,τ)=−n2log|τ(Ip+UDUT)|−12τtr[XTX(Ip+UDUT)−1],

where is the determinant and the trace when applied to square matrices. The joint likelihood our estimators maximize is then

 ℓn(β,Σ,ΣX)=ℓn(U,γ,Σ,D,τ)=ℓY∣X(U,γ,Σ)+ℓX(U,D,τ). (2)

Since both and depend on , our estimate of , and in particular the eigenvectors of that estimate, will in general depend on both and . It is well known (Tipping and Bishop, 1999) that if maximizes , then the columns of are leading eigenvectors of . Thus, loosely speaking, our method will try to select a whose columns are close to the leading eigenvectors of and at the same time makes the (weighted) sum of squared errors of the regression of on small. By contrast, classical PCR is equivalent to a two-stage procedure where, in the first stage, is maximized to obtain , effectively ignoring and the fit of the regression of on . Then, in the second stage with fixed, is estimated as , where is obtained from a least squares regression of on (Jolliffe, 2002, Chapter 8).

The log-likelihood is in general unbounded if , and hence we assume throughout. Our asymptotic results, showing our estimators are consistent, assume both and tend to infinity with . This reflects our belief that our method will be of most interest in settings where the number of predictors is large enough for dimension reduction to be useful, but small enough in comparison to that a likelihood-based method without penalization is applicable. For context, we note that

is common in the random matrix literature and that

is a special case of settings considered in sparse, high-dimensional spiked covariance matrix estimation (Wang and Fan, 2017). Asymptotic theory for related likelihood-based methods such as predictor envelopes typically assumes is fixed (Cook et al., 2010), but it has been noted that it may be more appropriate to let grow with (Cook et al., 2007).

Maximizing is a non-trivial, non-convex optimization problem. Helland (1992) proposed an algorithm for a somewhat similar setting but with , , and without spiked covariance matrix. The spiked covariance is essential to our development of a method that handles and . The key step for the algorithm we propose is to re-parameterize as for an which we discuss in more detail in Section 3.1. Briefly, with appropriate restrictions on , this parameterization is identifiable and the corresponding log-likelihood can be partially maximized analytically in all arguments but . The resulting profile log-likelihood for can then maximized numerically.

There are similarities between our method and the ones we compare to, but also important differences. Predictor envelopes assume for some and whose columns are eigenvectors of

that, in contrast to in our model, need not be leading ones. Thus, the predictor envelope model is more flexible and less parsimonious than ours. PLS can be viewed as a moment-based estimator of a predictor envelope. With

, PLS is equivalent to selecting linear combinations of the predictors given by the columns of the sample Krylov matrix , where (Helland, 1990; Cook et al., 2013). More precisely, PLS is equivalent to selecting linear combinations given by any set of basis vectors for the column space of , and it is well known that the column space of , defined as but with population covariances in place of their sample versions, coincides with the predictor envelope (Cook et al., 2013). Loosely speaking, PLS and likelihood-based predictor envelope estimators attempt to both infer which eigenvectors are relevant and estimate them. Since our method assumes the leading ones are relevant, we expect it to perform better in settings where that is either true or a reasonable approximation. This intuition is confirmed in our simulations and data example.

The remainder of the paper is organized as follows. In Section 2 we give conditions that ensure consistency of our estimators and in Section 3 we outline how the to compute the estimates in practice and how to select the number of components, . Section 4 contains simulation results and in Section 5 we compare our method to competing ones in a data example. Concluding remarks are in Section 6.

## 2 Asymptotic results

### 2.1 Consistent estimation

This section gives conditions that ensure consistency of our estimators. The idea is to show that an appropriately scaled log-likelihood concentrates around its expectation, which is maximized by the true parameter. The majority of the work is showing that that convergence is uniform on suitably chosen subsets of the parameter set. Allowing to grow with and letting the true parameter depend on (and ) makes this more complicated than in classical settings. The results in this section assume is known; how to select in practice is discussed in Section 3.2. Proofs not given in the main text can be found in Appendix A unless otherwise noted.

We parameterize in terms of precision matrices and . Let and be the set of for which and can be written as in (1). The true parameter is denoted . We equip with the max-norm defined by

 ∥θ∥M=max{∥β∥,∥Ω∥,∥ΩX∥},

where is the spectral norm when applied to matrices. When parameterizing in terms of precision matrices, the negative log-likelihood corresponding to (2) is, up to scaling and an additive constant,

 gn(θ)=−log|Ω|+1ntr[(Y−Xβ)T(Y−Xβ)Ω]−log|ΩX|+tr(SXΩX). (3)

We denote the minimizers of interest by . Clearly, if minimizes over , then and can be decomposed to give maximizers of (2). The following result says a exists under reasonable conditions. We let denote the orthogonal projection onto the orthogonal complement of the column space of .

###### Theorem 2.1.

The set is non-empty if is invertible and has rank greater than , and it is empty if has rank less than or equal to or if is non-invertible and .

The conditions in Theorem 2.1 are weak in the settings we consider. In particular, if both responses and predictors have continuous distributions, has full column rank almost surely if , which implies the sufficient conditions in the theorem.

To state the assumptions we use for the asymptotic analysis, we use the following definitions

(Vershynin, 2018), or its distribution, is sub-Gaussian if there exists a such that for all . The sub-Gaussian norm of is . A random vector is sub-Gaussian if is sub-Gaussian for any unit-length . The sub-Gaussian norm of is the supremum over all unit-length of the sub-Gaussian norm of . A random vector

is sub-Gaussian if and only if its sub-Gaussian norm is finite. Common sub-Gaussian distributions include the multivariate normal and all with bounded support, such as, for example, Bernoulli and uniform distributions.

We use five assumptions to establish consistency. Let and denote the largest and smallest eigenvalues of their matrix arguments, respectively. We use to denote a generic constant that does not depend on or but can otherwise change with context. The number of predictors can change as a function of and we write .

###### Assumption 1.

For any fixed and every , , , , and ; there also exist , , , and such that and .

###### Assumption 2.

The true parameter depends on , and can hence depend on , but there exists a constant such that, for every and ,

1. ,

2. , and

3. .

###### Assumption 3.

The number of observations and the number of predictors both tend to infinity and satisfy

 limn→∞p(n)n=0;

the number of responses and the number of relevant linear combinations are fixed.

###### Assumption 4.

The vectors , , are independent and have sub-Gaussian norms bounded by a not depending on or .

Assumptions 2 and 3 define the type of asymptotics we consider. Cook and Forzani (2019) discuss possible interpretations of the boundedness of and in high-dimensional regressions. Here, since the estimators we consider do not exist in general when (Theorem 2.1), we only note that is required to avoid the case where tends to infinity with and . Assumption 4 is useful to us because, together with Assumption 3, it implies the sample covariance matrix

is close to its population counterpart in spectral norm, with high probability

(Vershynin, 2018, Theorem 4.6.1).

The final assumption, defined below, is made so that terms of that depend on but no other arguments concentrate around their expectation. Because it can seem less intuitive than the other assumptions, we discuss this assumption in detail in Section 2.2 and show it is implied by the other assumptions in many relevant settings, including when data are multivariate normal or generated from a common latent variable model.

###### Assumption 5.

For any , as ,

 P(|tr(SX−ΣX∗)|≥ϵ)→0.

We are ready to state the main result of the section.

###### Theorem 2.2.

If Assumptions 15 hold, then as , is non-empty with probability tending to one, and for any sequence of estimators with , it holds for any that

 P(∥^θn−θ∗∥M≥ϵ)→0.

Before proving Theorem 2.2, we discuss some intermediate results that we need in the proof. Recall, the main idea is to show that the distance between and tends to zero in probability, uniformly in on appropriately chosen subsets of . To that end, for , we define the set by

 An={θ∈Θ:∥β∥≤c2,c1≤λmin(Ω),λmax(Ω)≤c2,c1≤λmin(ΩX),λmax(ΩX)≤c2}. (4)

The following result formalizes the necessary uniform convergence.

###### Lemma 2.3.

If Assumptions 15 hold, then for any and , as ,

 P(supθ∈An|gn(θ)−κn(θ)|≥ϵ)→0.

Lemma 2.3 alone is not enough to ensure consistency; because the uniformity is not over the whole of , we also need to show that both and are in . Moreover, we need that is, loosely speaking, asymptotically uniquely minimized at . The following two lemmas give what we need. Results like Lemma 2.5 are common in the M-estimation literature; here, some work is needed to prove the lemma because depends on .

###### Lemma 2.4.

If Assumptions 15 hold, then there exist such that

 P(argminθ∈Θgn(θ)⊆An(c1,c2))→1.
###### Lemma 2.5.

For any such that and any ,

 κn(θ)−κn(θ∗)≥δ∥θ−θ∗∥2M

for some that depends on and but not on .

We are ready to prove Theorem 2.2

###### Proof of Theorem 2.2.

To show the existence part, it suffices, by Theorem 2.1, to show that and are invertible with probability tending to one. Lemma A.3 implies these quantities are consistent in spectral norm for and , respectively. Thus, by Lemma A.1 and Assumption 2, and with probability tending to one. It remains to prove that any existing minimizers are consistent. To that end, pick an arbitrary and a and so that Lemmas 2.4 and 2.5 hold; this is possible because for all small enough and large enough , and making larger only increases the probability in Lemma 2.4. A routine calculation shows , and, hence, . Let , where superscript denotes the complement and is the open ball in with radius and center . By Lemma 2.5, for every . But Lemma 2.3 says that, with probability tending to one, (say) for any . Thus, with probability tending to one, for any , . Thus, with probability tending to one, . Thus, since Lemma 2.4 says with probability tending to one, with probability tending to one, and that finishes the proof. ∎

The following corollary says the estimates of the covariance matrices are consistent when those of the precision matrices are. Its proof is straightforward by using bounds on the eigenvalues of , , and their estimates implied by Assumption 2 and Theorem 2.2, and is hence omitted.

###### Corollary 2.1.

If Assumptions 15 hold, then for any sequence of estimators with , it holds that, in probability as ,

 ∥^Ω−1−Σ∗∥→0and∥^Ω−1X−ΣX∗∥→0.

### 2.2 Consistent trace

To gain some intuition for when Assumption 5 holds, we discuss a few sufficient conditions that cover several relevant settings, including multivariate normal predictors.

Let be a spectral decomposition of and define , . The s have common covariance matrix and are sub-Gaussian when the s are. Moreover, . Thus, independence of the s and Chebyshev’s inequality gives, for any ,

 P(|tr(SX−ΣX∗)|≥ϵ)≤(ϵn)−2n∑i=1⎡⎣p∑j=1var(Z2i,j)+2∑j

If

has a multivariate normal distribution for every

, then the elements of are independent since they are uncorrelated, and hence their squares are independent and have covariance zero. Then, the right hand side of (5) is less than for any , which can be chosen to not depend on or by Assumption 4. But as by Assumption 3, so Assumption 5 is satisfied. More generally, the same argument works if linear combinations of that are uncorrelated are also independent; normality is not necessary.

If the covariance terms in (5) need not vanish, then the right hand side is less than , which tends to zero if . Thus, Assumption 5 is redundant if Assumption 3 is appropriately strengthened.

Finally, suppose that , where and are latent variables whose elements are independent with mean zero and unit variance, and is a parameter. This commonly considered latent variable model (Tipping and Bishop, 1999; Singer et al., 2016; Cook, 2018b) is consistent with our assumptions since is a spiked covariance matrix whose smallest eigenvalues are all equal to . Let and . Then , and hence

 tr[SX−ΓT∗Γ∗−τ∗Ip]=tr[ΓT∗(SW−Ik)Γ∗]+2√τ∗tr(ΓT∗SWE)+τ∗tr[SE−Ip].

The absolute values of the first two terms are bounded by and , respectively. These tend to zero in probability since Assumption 2 implies both and are bounded as , and Assumptions 3 and 4 imply consistency of sample covariance matrices in spectral norm (Lemma A.3). Finally, using Chebyshev’s inequality as in (5) together with independence of the elements of , and hence their squares, gives that tends to zero in probability.

We collect the sufficient conditions just discussed in the following proposition. Neither of the conditions is necessary, and they are not exclusive. In particular, 2 is a special case of 3.

###### Proposition 2.6.

Assumption 5 is satisfied if

1. Assumptions 1, 2, and 4 hold and Assumption 3 is strengthened to ,

or, in addition to Assumptions 14, any of the following holds for every :

1. is multivariate normal.

2. has the property that for any such that , it holds that and are independent.

3. for some and , whose entries are independent with mean zero and unit variance, and parameters and .

## 3 Implementation

### 3.1 Maximizing the likelihood

Recall the definition of and in (3). Minimization of over is a non-convex problem without analytical solution. However, simplifications are possible upon re-parameterizing as , and hence , for , the set of lower-echelon matrices with positive leading entries in every column (Cantó et al., 2015). The representation is unique, so that the parameters are identified, if is positive definite (Cantó et al., 2015). With this parameterization, left singular vectors of are leading eigenvectors of . Thus, is equivalent to for some . Using this, the scaled negative log-likelihood after re-parameterization is

 (6)

Minimizing over is equivalent to minimizing over . However, is useful for practical purposes because a profile log-likelihood for can be derived analytically. Indeed, routine calculations show the necessary partial minimizers of are , , where is the orthogonal projection onto the orthogonal complement of the column space of ; and . Thus,

 Hn(L)=minΣ,α,τhn(Σ,α,τ,L)=log|YTQXLY|+plogtr[XTX(Ip+LLT)−1].

If a minimizer of can be found numerically, then the parameters , , , and hence , can be estimated by plugging in place of in the expressions for , , and . On points where is not invertible, the inverse in the definition of can be replaced by, for example, the Moore–Penrose pseudo-inverse without affecting .

We use standard first order methods to minimize , which is non-convex. Software implementing our method, using a descent-based algorithm for constrained optimization (Byrd et al., 1995), is available at the author’s web page. Simulations and data examples indicate the estimates found by the proposed method are useful. Calculating the gradient necessary to implement a first order method is straightforward but tedious (Appendix B).

### 3.2 Selecting k

Information criteria offer a principled way to select in practice. Fix and , and, for any , let denote the number of parameters and a minimizer of . Then, up to additive constants, many popular information criteria can be written as

 Iρ(k)=ngn(^θn(k))+ρd(k),

where different give different criteria. Akaike’s information criterion (AIC) (Akaike, 1998) and Schwarz’s Bayesian information criterion (BIC) (Schwarz, 1978) set, respectively, and . For a given , is selected as . We examine the performance of AIC and BIC in our model using simulations in Section 4. The following proposition establishes . Recall, we are assuming and ; parameterizing the means requires an additional and parameters, respectively.

###### Proposition 3.1.

For a given , , and , the number of parameters in our model is if , and if .

## 4 Simulations

We compare our method to classical PCR, PLS using the SIMPLS algorithm in the pls package (Mevik and Wehrens, 2007), predictor envelopes using the Renvlp package (Lee and Su, 2019), and OLS. We use root mean squared error (RMSE) of estimating and out-of-sample prediction RMSE. These are defined, respectively, for a generic estimate and independent test set , as

 ∥^β−β∗∥F/√rpand∥Xnew^β−Ynew∥F/√rn,

where denotes the Frobenius matrix norm. We also compare the methods’ bias in selecting . For classical PCR and PLS we use leave one out cross-validation to select . For predictor envelopes, we use AIC, BIC, and a likelihood ratio test procedure implemented in the Renvlp package (Lee and Su, 2019). In all simulations, both the training set and the test set are generated as independent observations from our model with multivariate normal responses and predictors. We fix and in all simulations. We consider and in various alignments for which it is reasonable to use likelihood-based methods (here, our and envelopes). As a baseline, we use and . This is similar to our data example (Section 5) which has and . We construct the predictor covariance as where is a realization from the uniform distribution on . We examine performance for different values of , , , , and . In all settings, for some , which we call the average spiked eigenvalue. The coefficient is generated as , where is constructed by drawing its elements as independent realizations of the uniform distribution on and then normalizing each column to have a Euclidean norm that can change between simulations; we refer to this as the coefficient column norm or the coefficient size. Because is semi-orthogonal, for any column .

Before discussing the results in Figure 1 setting by setting, we note that, overall, our method has lower RMSE than classical PCR, by a substantial margin, in almost all settings we consider. In some settings, predictor envelopes with selected using likelihood ratio tests is competitive with our method, and in some PLS is. However, neither of the other methods perform as well as ours across all the simulations. The only setting in which our method is substantially outperformed by another is when and ; that is, when is about the same size as and both are relatively small. In this setting, the moment-based dimension reduction methods, classical PCR and PLS, perform better than the likelihood-based, though our method is substantially more accurate than predictor envelopes and OLS. In general, our method performs best in both estimation and prediction when is selected using BIC, and envelopes perform better with selected using likelihood ratio tests than information criteria.

The first row in Figure 1 shows our method performs best for all considered coefficient sizes, whether is selected using AIC or BIC, and in both estimation and prediction. The method whose RMSE is closest to that of our method is predictor envelopes with selected using likelihood ratio tests. PLS also performs well, especially for small coefficient sizes. On the other hand, leave one out cross validation with classical PCR grossly overestimates , and for large coefficients classical PCR is no better than OLS. Overall, the results indicate dimension reduction is particularly useful in settings with small coefficients.

The second row in Figure 1 shows all methods perform better relative to OLS when the average spiked eigenvalue is larger, which is intuitive since the relevant eigenvectors should then be easier to identify. Our method with selected using BIC performs best for all settings, but for the smallest envelopes with selected using likelihood ratio tests perform slightly better than our method with selected using AIC.

The third row in Figure 1 shows our method and predictor envelopes with selected using likleihood ratio tests are preferable to the moment-based methods when is small, but that PLS becomes competitive as increases. This is consistent with the fact that our method and envelopes in general require , and our asymptotics which require . Note, since for and all , a large implies a regression with relatively many predictors, but where each predictor has a small effect on the response. Conversely, a small means each predictor has a larger effect.

In the fourth row of Figure 1, our method is shown to perform better than the other methods for all considered . We also note that dimension reduction is decreasingly useful as the number of relevant components increases.

Finally, the last row of Figure 1 shows the moment-based methods are preferable for small , and that in estimation, the differences between our method and the others is relatively stable as varies. By contrast, the prediction RMSE for all methods gets closer to that of OLS as increases. That is, dimension reduction may not be as useful when is very large in comparison to .

## 5 Prediction of stock returns

We apply our method to data on monthly returns, from January 2010 to March 2020, on 29 stocks in the Dow Jones Industrial Average index (one out of the usual 30 is omitted because it was introduced in 2019). Specifically, we consider cross-sectional prediction of the return on one stock using contemporaneous returns of the other stocks. Code for replicating all results are available at the author’s web page. Before discussing the results and how they are obtained, we give a short motivation for why stock returns are suitable for an application of our method.

First, classical models assume stock prices follow a geometric Brownian motion (Hull, 2017). This implies the (log-)returns—the first difference of the logarithm of the prices—are temporally independent, normally distributed, and have constant variance. Of course, in practice we expect neither to hold exactly, but the popularity of models based on the geometric Brownian motion suggests it is often a useful approximation. With this in mind, it seems reasonable to estimate parameters by maximizing a likelihood for independent normal observations.

Secondly, it is often hypothesized that stock returns can be decomposed into a stock-specific component, which is independent of the returns on other stocks, and a few common components, or factors (Fama and French, 2015; Bai and Wang, 2016). The latter can include, for example, a risk-free interest rate and a market return, and they can be observable or unobservable (Bai and Wang, 2016). Let denote a vector of such components at time , and suppose the return on stock is

 Rt,j=μ∗j+γT∗jWt+εt,j, (7)

where and are parameters and an unobservable error term with mean zero and variance , assumed to be independent and identically distributed for all and . Our data comprise realizations of but not . Thus, the common components are latent variables that can induce dependence between contemporaneous returns of different stocks. To see how (7) relates to our model, let be the return to be predicted using the vector of predictors . Then with and ,

which is a spiked covariance matrix whose smallest eigenvalues are equal to and whose leading eigenvectors are leading left singular vectors of . Moreover, (7) and normality of the returns imply

 E(Yt∣Xt)=E(Yt)+Σ−1Xcov(Xt,Yt)(Xt−E(Xt))=E(Yt)+βT∗[Xt−E(Xt)]

where . Thus, lies in the span of leading left singular vectors of and, hence, (7) leads to a model consistent with our assumptions.

We split the 123 observations so that the first 70 are training data and the remaining 53 are test data. The response is centered by its training data sample mean and the predictors are centered and scaled by their training data sample means and sample standard deviations, respectively. We fit each method to the training data and compute the RMSE of prediction using the test data. For our method, we select

using AIC and BIC. For classical PCR and PLS, we select by leave one out cross-validation, using the implementation in the pls package (Mevik and Wehrens, 2007). The package suggests two ways of selecting based on cross-validation: a randomization test approach (van der Voet, 1994)

or the ”one standard error rule”, which selects the smallest number of components that gives a cross-validation error within one standard error of the smallest cross-validation error of any number of components

(Hastie et al., 2009). In our application, we found that both of these rules often selected the model with zero components, which led to poor performance and an uninteresting comparison with our method. Thus, we instead present results where the number of components are selected to minimize the cross-validation error. The predictor envelope model we fit using the Renvlp package (Lee and Su, 2019). This package provides three ways to select : AIC, BIC, and likelihood ratio tests—we consider all three.

We focus on prediction of Home Depot’s stock return. To highlight which results are particular to this choice of response and which hold more generally, we also present summary statistics from repeating the same analysis with the other stocks as responses. We focus on the results for Home Depot because they are relatively representative of more general patterns indicated by the summary statistics.

Table 1 shows the prediction results. The presented RMSEs are divided by the RMSE of a model without predictors, that is, the model which predicts all response realizations in the test set are equal to the training data sample mean of the response. For Home Depot’s stock, our method with selected using BIC performs best (RMSE 0.75); PLS (0.76) and envelopes with selected using likelihood ratio tests (0.76) also perform well. Envelopes with selected using AIC or BIC and OLS perform worse than a prediction without predictors. In general, the results for Home Depot indicate that methods which select a relatively small , but not , perform better. This finding is corroborated when looking at the summary statistics from repeating the analysis with the other stocks as responses: Our method with selected using BIC performs best (average RMSE 0.80) and, on average, selects . PLS performs second best (0.82) and on average selects . Our method with selected using AIC and classical PCR also perform relatively well. On average, all methods except OLS perform better than the model without predictors. However, our method with selected using BIC is the only method that does no worse than the model without predictors for any choice of response.

We now focus on the Home Depot results for our method with , as selected by BIC. In Figure 2, the data are plotted along with fitted values and predictions. The fitted values and predictions are typically closer to the sample mean than the response, and look to be relatively close to an (scaled) average of the predictors. Accordingly, if we fit our model to the full data with , then the estimated leading eigenvector of , say , has elements which all have the same sign and absolute values between 0.10 and 0.24 (see Supplementary Material); that is, is proportional to a weighted average of the predictors, which with (7) in mind could be interpreted as a market component. The second estimated eigenvector, , is more difficult to interpret: Its elements do not all have the same sign, and some are close to zero. In an OLS regression of on and

, the t-value for testing the null hypothesis that the first coefficient is zero is -8.25 and that for the second -0.53 (Supplementary Material). This may be an indication that only the first component has an effect on

. This is still compatible with as suggested by BIC since, even if only one linear combination of has a direct effect on , may lead to a poor model for . Inspecting the estimates from the full data, we find and ; that is, , , and , where denotes the th largest eigenvalue of its argument matrix.

## 6 Discussion

We have proposed a likelihood-based method for estimating principal components regressions that outperforms classical principal components regression in many practically relevant settings. Our method does this by addressing a common criticism of classical principal components regression, namely that the responses are ignored when selecting linear combinations of the predictors.

Both our data example and simulations indicate our method performs best when the number of components are selected using BIC, rather than AIC. In general, erring on the side of using fewer components seems preferable to using too many; that is, parsimonious models are preferred. In our data example, BIC indicates that using two components is appropriate, but further analysis indicates one may be sufficient for the regression. Thus, it may be that BIC selects two instead of one because the model for the predictors’ covariance matrix requires two components to fit well. This suggests that, more generally, it may be possible to further improve our method by implementing variable selection, or component selection. With some work, our method could be extended in that direction by, for example, encouraging sparsity in the estimate of the parameter in the representation . Sparsity in would say that, while the principal components regression model holds, not all of the principal components are important for all responses. If, for example, the th row of vanishes, then the th principal components is unimportant. Moreover, if a sparsity-inducing or other penalty on is added to the negative log-likelihood, then the resulting objective function need not be unbounded when . That is, it may also be possible to extend our method to high-dimensional settings by using a penalized likelihood estimator. When is small in comparison to , one may also be able to select which components are relevant for the regression by comparing information criteria from fitting the model with various restrictions on . Both of these approaches would require a substantial amount of work as the current algorithm relies on a re-parameterization where and do not appear explicitly.

In some applications it may be that only a few components are relevant, but that they do not correspond to leading eigenvectors. Our method can still perform well relative to PLS and predictor envelopes in many such cases. Intuitively, there is a price to pay for having to, as those methods aim to, infer which eigenvectors are important, so in some settings it can be preferable to fit a principal components regression model with many linear combinations, including some important and some unimportant ones.

Finally, we note that there may be room for extending our method to dependent data. Our asymptotic theory fundamentally relies on concentration of sub-Gaussian vectors; if the concentration results we use can be replaced by equivalent ones for dependent data, we expect other parts of our proofs can be adapted accordingly to show our estimators are consistent. However, an arguably more appropriate, and more ambitious, extension of our method to dependent data would use a likelihood that takes that dependence into account, for example by assuming responses and predictors follow vector autoregressive processes. This would also allow for an interesting re-analysis of the stock return data, focusing on forecasting instead of cross-sectional prediction.

## Appendix A Proofs

Define

 gn,1(θ)=−log|Ω|+n−1tr[(Y−Xβ)T(Y−Xβ)Ω],
 gn,2(θ)=−log|ΩX|+tr[SXΩX],
 κn,1(θ)=E[gn,1(θ)]=−log|Ω|+tr[(β−β∗)TΩ−1X(β−β∗)Ω]+tr(ΩΩ−1∗),

and

 κn,2(θ)=E[gn,2(θ)]=−log|ΩX|+tr(ΩXΩ−1X∗).

The following is Corollary III.2.6 in Bhatia (2012), attributed to Weyl, which we state here for easy reference. We let denote the th largest eigenvalue of its argument matrix.

If and