1 Introduction
Linear factor models have been widely used for a long time and with notable success in economics, finance, medicine, psychology, and various other natural and social sciences (Harman, 1976)
. In such a model, each observed variable is a linear combination of unobserved common factors plus idiosyncratic noise, and the collection of random variables is jointly Gaussian. We consider in this paper the problem of learning a factor model from a training set of vector observations. In particular, our learning problem entails simultaneously estimating the loadings of each factor and the residual variance of each variable. We seek an estimate of these parameters that best explains outofsample data. For this purpose, we consider the likelihood of test data that is independent of the training data. As such, our goal is to design a learning algorithm that maximizes the likelihood of a test set that is not used in the learning process.
A common approach to factor model learning involves application of principal component analysis (PCA). If the number of factors is known and residual variances are assumed to be uniform, PCA can be applied to efficiently compute model parameters that maximize likelihood of the training data (Tipping and Bishop, 1999). In order to simplify analysis, we begin our study with a context for which PCA is ideally suited. In particular, before treating more general models, we will restrict attention to models in which residual variances are uniform. As a baseline among learning algorithms, we consider applying PCA together with crossvalidation, computing likelihoodmaximizing parameters for different numbers of factors and selecting the number of factors that maximizes likelihood of a portion of the training data that is reserved for validation. We will refer to this baseline as uniformresidual rankconstrained maximumlikelihood (URM) estimation.
To improve on URM, we propose uniformresidual tracepenalized maximumlikelihood
(UTM) estimation. Rather than estimating parameters of a model with a fixed number of factors and iterating over the number of factors, this approach maximizes likelihood across models without restricting the number of factors but instead penalizing the trace of a matrix derived from the model’s covariance matrix. This trace penalty serves to regularize the model and naturally selects a parsimonious set of factors. The coefficient of this penalty is chosen via crossvalidation, similarly with the way in which the number of factors is selected by URM. Through a computational study using synthetic data, we demonstrate that UTM results in better estimates than URM. In particular, we find that UTM requires as little as twothirds of the quantity of data used by URM to match its performance. Further, leveraging recent work on random matrix theory, we establish theoretical results that explain how UTM corrects the biases induced by URM.
We then extend UTM to address the more general and practically relevant learning problem in which residual variances are not assumed to be uniform. To evaluate the resulting algorithm, which we refer to as scaled tracepenalized maximumlikelihood (STM) estimation, we carry out experiments using both synthetic data and real stock price data. The computational results demonstrate that STM leads to more accurate estimates than alternatives available from prior art. We also provide an analysis to illustrate how these alternatives can suffer from biases in this nonuniform residual variance setting.
Aside from the aforementioned empirical and theoretical analyses, an important contribution of this paper is in the design of algorithms that make UTM and STM efficient. The UTM approach is formulated as a convex semidefinite program (SDP), which can be solved by existing algorithms such as interiorpoint methods or alternating direction method of multipliers (see, e.g., Boyd et al. (2011)). However, when the data dimension is large, as is the case in many relevant contexts, such algorithms can take too long to be practically useful. This exemplifies a recurring obstacle that arises in the use of SDP formulations to study large data sets. We propose an algorithm based on PCA that solves the UTM formulation efficiently. In particular, we found this method to typically require three orders of magnitude less compute time than the alternating direction method of multipliers. Variations of PCA such as URM have enjoyed wide use to a large extent because of their efficiency, and the computation time required for UTM is essentially the same as that of URM. STM requires additional computation but remains in reach for problems where the computational costs of URM are acceptable.
Our formulation is related to that of Chandrasekaran et al. (2012), which estimates a factor model using a similar trace penalty. There are some important differences, however, that distinguish our work. First, the analysis of Chandrasekaran et al. (2012) focuses on establishing perfect recovery of structure in an asymptotic regime, whereas our work makes the point that this trace penalty reduces nonasymptotic bias. Second, our approach to dealing with nonuniform residual variances is distinctive and we demonstrate through computational and theoretical analysis that this difference reduces bias. Third, Chandrasekaran et al. (2012) treats the problem as a semidefinite program, whose solution is often computationally demanding when data dimension is large. We provide an algorithm based on PCA that efficiently solves our problem. The algorithm can also be adapted to solve the formulation of Chandrasekaran et al. (2012), though that is not the aim of our paper.
In addition, there is another thread of research on regularized maximumlikelihood estimation for covariance matrices that relates loosely to this paper. Along this line, Banerjee et al. (2008) regularizes maximumlikelihood estimation by the norm of the inverse covariance matrix in order to recover a sparse graphical model. An efficient algorithm called graphical Lasso was then proposed by Friedman et al. (2008) for solving this formulation. Similar formulations can also be found in Yuan and Lin (2007) and Ravikumar et al. (2011), who instead penalize the norm of offdiagonal elements of the inverse covariance matrix when computing maximumlikelihood estimates. For a detailed survey, see Pourahmadi (2011). Although our approach shares some of the spirit represented by this line of research in that we also regularize maximumlikelihood estimation by an like penalty, the settings are fundamentally different: while ours focuses on a factor model, theirs are based on sparse graphical models. We propose an approach that corrects the bias induced by conventional factor analysis, whereas their results are mainly concerned with accurate recovery of the topology of an underlying graph. As such, their work does not address biases in covariance estimates. On the algorithmic front, we develop a simple and efficient solution method that builds on PCA. On the contrary, their algorithms are more complicated and computationally demanding. ^{1}^{1}1The code of our algorithms can be downloaded at: http://www.yhkao.com/RPCAcode.zip.
2 Problem Formulation
We consider the problem of learning a factor model without knowledge of the number of factors. Specifically, we want to estimate a covariance matrix from samples , where is the sum of a symmetric matrix and a diagonal matrix . These samples can be thought of as generated by a factor model of the form , where represents a set of common factors and represents residual noise. The number of factors is represented by , and it is usually assumed to be much smaller than the dimension .
Our goal is to produce based on the observed samples a factor loadings matrix and a residual variance matrix such that the resulting factor model best explains outofsample data. In particular, we seek a pair of such that the covariance matrix maximizes the average loglikelihood of outofsample data:
This is also equivalent to minimizing the Kullback–Leibler divergence between
and .3 Learning Algorithms
Given our objective, one simple approach is to choose an estimate that maximizes insample loglikelihood:
(1) 
where , and we use to denote the sample covariance matrix. Here, the maximum likelihood estimate is simply given by .
The problem with maximum likelihood estimation in this context is that insample loglikelihood does not accurately predict outofsample loglikelihood unless the number of samples far exceeds the dimension . In fact, when the number of samples is smaller than the dimension , is illconditioned and the outofsample loglikelihood is negative infinity. One remedy to such poor generalization involves exploiting factor structure, as we discuss in this section.
3.1 Uniform Residual Variances
We begin with a simplified scenario in which the residual variances are assumed to be identical. As we will later see, such simplification facilitates theoretical analysis. This assumption will be relaxed in the next subsection.
3.1.1 Constraining the Number of Factors
Given a belief that the data is generated by a factor model with few factors, one natural approach is to employ maximum likelihood estimation with a constraint on the number of factors. Now suppose the residual variances in the generative model are identical, and as a result we impose an additional assumption that is a multiple
of the identity matrix. This leads to an optimization problem
(2)  
s.t.  
where denote the set of all positive semidefinite symmetric matrices, and is the exogenously specified number of factors. In this case, we can efficiently compute an analytical solution via principal component analysis (PCA), as established in Tipping and Bishop (1999). This involves first computing an eigendecomposition of the sample covariance matrix , where is orthonormal and with . The solution to (2) is then given by
(3) 
In other words, the estimate for residual variance equals the average of the last
sample eigenvalues, whereas the estimate for factor loading matrix is spanned by the top
sample eigenvectors with coefficients
. We will refer to this method as uniformresidual rankconstrained maximumlikelihood estimation, and use to denote the covariance matrix resulting from this procedure. It is easy to see that the eigenvalues of are , as illustrated in Figure 1(a).A number of methods have been proposed for estimating the number of factors (Akaike, 1987; Bishop, 1998; Minka, 2000; Hirose et al., 2011). Crossvalidation provides a conceptually simple approach that in practice works at least about as well as any other. To obtain best performance from such a procedure, one would make use of socalled fold crossvalidation. To keep things simple in our study and comparison of estimation methods, for all methods we will consider, we employ a version of crossvalidation that reserves a single subset of data for validation and selection of . Details of the procedure we used can be found in the appendix. Through selection of , this procedure arrives at a covariance matrix which we will denote by .
3.1.2 Penalizing the Trace
Although (2) can be elegantly solved via PCA, it is unclear that imposing a hard constraint on the number of factors will lead to an optimal estimate. In particular, one might suspect a “softer” regularization could improve estimation accuracy. Motivated by this idea, we propose penalizing the trace instead of constraining the rank of the factor loading matrix. As we shall see in the experiment results and theoretical analysis, such an approach indeed improves estimation accuracy significantly.
Nevertheless, naively replacing the rank constraint of (2) by a trace constraint will result in a nonconvex optimization problem, and it is not clear to us whether it can be solved efficiently. Let us explore a related alternative. Some straightforward matrix algebra shows that if with and , then the matrix defined by is in , with . This observation, together with the wellknown fact that the loglikelihood of is concave in the inverse covariance matrix , motivates the following convex program:
Here, the variable represents the reciprocal of residual variance. Pricing out the trace constraint leads to a closely related problem in which the trace is penalized rather than constrained:
(4)  
We will consider the trace penalized problem instead of the trace constrained problem because it is more convenient to design algorithms that address the penalty rather than the constraint. Let be an optimal solution to (4), and let denote the covariance matrix estimate derived from it. Here, the “U” indicates that residual variances are assumed to be uniform across variables and “T” stands for tracepenalized.
It is easy to see that (4) is a semidefinite program. As such, the problem can be solved in polynomial time by existing algorithms such as interiorpoint methods or alternating direction method of multipliers (ADMM). However, when the number of variables is large, as is the case in many contexts of practical import, such algorithms can take too long to be practically useful. One contribution of this paper is an efficient method for solving (4), which we now describe. The following result motivates the algorithm we will propose for computing :
Theorem 1
and share the same trace and eigenvectors, and letting the eigenvalues of the two matrices, sorted in decreasing order, be denoted by and , respectively, we have
(5) 
This theorem suggests an algorithm for computing . First, we compute the eigendecomposition of , where and are as defined in Section 3.1.1. This provides the eigenvectors and trace of . To obtain its eigenvalues, we only need to determine the value of such that the eigenvalues given by (5) sum to the desired trace. This is equivalent to determining the largest integer such that
To see this, note that setting
uniquely satisfies (5) and ensures . Algorithm 1 presents this method in greater detail. In our experiments, we found this method to typically require three orders of magnitude less compute time than ADMM. For example, it can solve a problem of dimension within seconds on a workstation, whereas ADMM requires hours to attain the same level of accuracy.
Also note that for reasonably large , this algorithm will flatten most sample eigenvalues and allow only the largest eigenvalues to remain outstanding, effectively producing a factor model estimate. Figure 1(b) illustrates this effect. Comparing Figure 1(a) and Figure 1(b), it is easy to see that URM and UTM primarily differ in the largest eigenvalues they produce: while URM simply retains the largest sample eigenvalues, UTM subtracts a constant from them. As we shall see in the theoretical analysis, this subtraction indeed corrects the bias incurred in sample eigenvalues.
Like URM, the most computationally expensive step of UTM lies in the eigendecomposition. ^{2}^{2}2In fact, a full eigendecomposition is not required, as we only need the top eigenvectors to compute the estimate. Beyond that, the evaluation of UTM eigenvalues for any given takes , and is generally negligible. In our implementation, this regularization parameter is chosen by crossvalidation from a range around , whose reasons will become apparent in Section 5.1. We denote by the covariance matrix resulting from this crossvalidation procedure.
3.2 Nonuniform Residual Variances
We now relax the assumption of uniform residual variances and discuss several methods for the general case. As in the previous subsection, the hyperparameters of these methods will be selected by crossvalidation.
3.2.1 Constraining the Number of Factors
Without the assumption of uniform residual variances, the rankedconstrained maximumlikelihood formulation can be written as
(6)  
s.t.  
where denote the set of all positive semidefinite diagonal matrices. Unlike (2), this formulation is generally hard to solve, and therefore we consider two widelyused approximate solutions.
The expectationmaximization (EM) algorithm
(Rubin and Thayer, 1982) is arguably the most conventional approach to solving (6), though there is no guarantee that this will result in a global optimal solution. The algorithm generates a sequence of iterates and , such that the covariance matrix increases the loglikelihood of with each iteration. Each iteration involves an estimation step in which we assume the data are generated according to the covariance matrix , and compute expectations and for . A maximization step then updates and based on these expectations. In our implementation, the initial and are selected by the MRH algorithm described in the next paragraph. We will denote the estimate produced by the EM algorithm by and that resulting from further selection of through crossvalidation by .A common heuristic for approximately solving (
6) without entailing iterative computation is to first compute by PCA and then take the factor matrix estimate to be the defined in (3) and the residual variances to be , for . In other words, is selected so that the diagonal elements of the estimated covariance matrix are equal to those of the sample covariance matrix. We will refer to this method as marginalvariancepreserving rankconstrained heuristic and denote the resulting estimates by and .3.2.2 Penalizing the Trace
We now develop an extension of Algorithm 1 that applies when residual variances are nonuniform. One formulation that may seem natural involves replacing in (4) with a diagonal matrix . That is,
(7)  
Indeed, a closely related formulation is proposed in Chandrasekaran et al. (2012). However, as we will see in Sections 4 and 5, solutions to this formulation suffer from bias and do not compete well against the method we will propose next. That said, let us denote the estimates resulting from solving this formulation by and , where “T” stands for tracepenalized. Also note that this formulation can be efficiently solved by a straightforward generalization of Theorem 1, though we will not elaborate on this.
Our approach involves componentwise scaling of the data. Consider an estimate of . Recall that we evaluate the quality of the estimate using the expected loglikelihood of outofsample data. If we multiply each data sample by a matrix , the data set becomes , where . If we also change our estimate accordingly to then the new expected loglikelihood becomes
Therefore, as long as we constrain to have unit determinant, will be equal to , suggesting that if is a good estimate of then is a good estimate of . This motivates the following optimization problem:
(8)  
The solution to this problem identifies a componentwisescaling matrix that allows the data to be bestexplained by a factor model with uniform residual variances. Given an optimal solution, should be approximately proportional to the residual variance of the th variable, so that scaling by makes residual variances uniform. Note that the optimization problem constrains to be nonnegative rather than zero. This makes the feasible region convex, and this constraint is binding at the optimal solution. Denote the optimal solution to (8) by . Our estimate is thus given by .
The objective function of (8) is not concave in , but is biconcave in and . We solve it by coordinate ascent, alternating between optimizing and . This procedure is guaranteed convergence. In our implementation, we initialize by . We will denote the resulting estimates by , where “ST” stands for scaled and tracepenalized.
4 Experiments
We carried out two sets of experiments to compare the performance of aforementioned algorithms. The first is based on synthetic data, whereas the second uses historical prices of stocks that make up the S&P 500 index.
4.1 Synthetic Data
We generated two kinds of synthetic data. The first was generated by a model in which each residual has unit variance. This data was sampled according to the following procedure, which takes as input the number of factors , the dimension , the factor variances , and the number of samples :

Sample orthonormal vectors isotropically.

Sample .

Let .

Let .

Sample iid from .
We repeated this procedure one hundred times for each , with , and . We applied to this data URM and UTM, since they are methods designed to treat such a scenario with uniform residual variances. Regularization parameters and were selected via crossvalidation, where about of each data set was used for training and for validation. Figure 2(a) plots outofsample loglikelihood delivered by the two algorithms. Performance is plotted as a function of the logratio of the number of samples to the number of variables, which represents the availability of data relative to the number of variables. We expect this measure to drive performance differences. UTM outperforms URM in all scenarios. The difference is largest when data is scarce. When data is abundant, both methods work about as well. This should be expected since both estimation methods are consistent.
To interpret this result in a more tangible way, we also plot the equivalent data requirement of UTM in Figure 2(b). This metric is defined as the portion of training data required by UTM to match the performance of URM. As we can see, UTM needs as little as 67% of the data used by the URM to reach the same estimation accuracy.
(a) The average outofsample loglikelihood delivered by URM and UTM, when residual variances are identical. (b) The average portion of data required by UTM to match the performance of URM. The error bars denote 95% confidence interval.
Our second type of synthetic data was generated using an entirely similar procedure except step 4 was replaced by
where were sampled iid from . Note that effectively controls the variation among residual variances. Since these residual variances are nonuniform, EM, MRH, TM, and STM were applied. Figure 3 plots the results for the cases and , corresponding to moderate and large variation among residual variances, respectively. In either case, STM outperforms the alternatives. Figure 4 further gives the equivalent data requirement of STM with respect to each alternative. It is worth pointing out that the performance of MRH and TM degrades significantly as the variation among residual variances grows, while EM is less susceptible to such change. We will elaborate on this phenomenon in Section 5.2.
4.2 S&P 500 Data
An important application area of factor analysis is finance, where return covariances are used to assess risks and guide diversification (Markowitz, 1952). The experiments we will now describe involve estimation of such covariances from historical daily returns of stocks represented in the S&P 500 index as of March, 2011. We use price data collected from the period starting November 2, 2001, and ending August 9, 2007. This period was chosen to avoid the erratic market behavior observed during the bursting of the dotcom bubble in 2000 and the financial crisis that began in 2008. Normalized daily logreturns were computed from closing prices through a process described in detail in the appendix. Over this duration, there were 1400 trading days and 453 of the stocks under consideration were active. This produced a data set , in which the th component of represents the normalized logdailyreturn of stock on day .
We generated estimates corresponding to each among a subset of the 1400 days. As would be done in realtime application, for each such day we used data points that would have been available on that day to compute the estimate and subsequent data to assess performance. In particular, we generated estimates every ten days beginning on day and ending on day . For each of these days, we evaluated average loglikelihood of logdailyreturns over the next ten days. Algorithm 2 formalizes this procedure.
These tests served the purpose of slidingwindow crossvalidation, as we tried a range of regularization parameters over this time period and used test results to select a regularization parameter for each algorithm. More specifically, for each algorithm , its regularization parameter was selected by
On days , we generated one estimate per day per algorithm, in each case using the regularization parameter selected earlier and evaluating average loglikelihood over the next ten days. For each algorithm , we took the average of these ten tenday averages to be its outofsample performance, defined as
Figure 5 plots the performance delivered EM, MRH, TM, and STM with . STM is the dominant solution. It is natural to ask why the performance of each algorithm improves then degrades as grows. If the time series were stationary, one would expect performance to monotonically improve with . However, this is a real time series and is not necessarily stationary. We believe that the distribution changes enough over about a thousand trading days so that using historical data collected further back worsens estimates. This observation points out that in real applications STM is likely to generate superior results even when all aforementioned algorithms are allowed to use all available data. This is in contrast with the experiments of Section 4.1 involving synthetic data, which may have led to an impression that the performance difference could be made small by using more data.
5 Analysis
In this section, we explain why UTM and STM are expected to outperform alternatives as we have seen in our experimental results.
5.1 Uniform Residual Variances
Let us start with the simpler context in which residual variances are identical. In other words, let for a low rank matrix and uniform residual variance . We will begin our analysis with two desirable properties of UTM, and then move on to the comparison between UTM and URM.
It is easy to see that is a consistent estimator of for any , since and by Theorem 1 we have
Another important property of UTM is the fact that the trace of UTM estimate is the same as that of sample covariance matrix. This preservation is desirable as suggested by the following result.
Proposition 1
For any fixed , scalars , and any sequence of covariance matrices with eigenvalues , we have
as and
(9) 
Note that, for any fixed fixed number of samples , the righthandside of (9) diminishes towards 0 as data dimension grows. In other words, as long as the data dimension is large compared to the number of factors , the sample trace is usually a good estimate of the true one, even when we have very limited data samples.
Now we would like to understand why UTM outperforms URM. Recall that, given an eigendecomposition of the sample covariance matrix, estimates generated by URM and UTM admit eigendecompositions and , deviating from the sample eigenvalues but not the eigenvectors . Hence, URM and UTM differ only in the way they select eigenvalues: URM takes each eigenvalue to be either a constant or the corresponding sample eigenvalue, while UTM takes each eigenvalue to be either a constant or the corresponding sample eigenvalue less another constant. Thus, large eigenvalues produced by UTM are a constant offset less than those produced by URM, as illustrated in Figure 1. We now explain why such subtraction lends UTM an advantage over URM in highdimensional cases.
Given the eigenvectors , let us consider the optimal eigenvalues that maximize outofsample loglikelihood of the estimate. Specifically, let us define
With some straightforward algebra, we can show that , where for . Let each th sample eigenvalue be denoted by , and let the th largest eigenvalue of be denoted by . The following theorem, whose proof relies on two results from random matrix theory found in Baik and Silverstein (2006) and Paul (2007), relates sample eigenvalues to optimal eigenvalues .
Theorem 2
For all , scalars , , sequences such that , covariance matrices with eigenvalues , and such that , there exists such that as .
Consider eigenvalues that are large relative to so that is negligible. In such cases, when in the asymptotic regime identified by Theorem 2, we have . This observation suggests that, when the number of factors is relatively small compared to data dimension , and when and number of samples scale proportionally to large numbers, the way in which UTM subtracts a constant from large sample eigenvalues should improve performance relative to URM, which does not modify large sample eigenvalues. Furthermore, comparing Theorem 1 and 2, we can see that the correction term should satisfy , or equivalently . This relation can help us narrow the search range of in crossvalidation.
It is worth pointing out that the overshooting effect of sample eigenvalues is well known in statistics literature (see, e.g, Johnstone (2001) ). Our contribution, however, is to quantify this effect for factor models, and show that the large eigenvalues are not only biased high, but biased high by the same amount.
5.2 Nonuniform Residual Variances
Comparing Figure 2, 3, and 4, we can see that the relation between URM and UTM is analogous to that between EM and STM. Specifically, the equivalent data requirement of UTM versus URM behaves very similarly as that of STM versus EM. This should not be surprising, as we now explain.
To develop an intuitive understanding of this phenomenon, let us consider an idealized, analytically tractable context in which both EM and STM successfully estimate the relative magnitudes of residual variances. In particular, suppose we impose an additional constraint into (6) and an additional constraint into (8). ^{3}^{3}3Here we use the notation to mean that there exists such that . In this case, it is straightforward to show that EM is equivalent to URM with data scaled by , and STM is equivalent to UTM with the same scaled data. Therefore, by the argument given in Section 5.1, it is natural to expect STM outperforms EM.
A question that remains, however, is why MRH and TM are not as effective as STM. We believe the reason to be that they tend to select factor loadings that assign larger values than appropriate to variables with large residual variances. Indeed, such disadvantage has been observed in our synthetic data experiment: when the variation among residual variances increases, the performances of MRH and TM degrade significantly, as shown in Figure 3.
Again, let us illustrate this phenomenon through an idealized context. Specifically, consider a case in which the sample covariance matrix turns out to be identical to , with and , where is a vector with every component equal to . Recall that MRH uses the eigenvectors of corresponding to the largest eigenvalues as factor loading vectors. One would hope that factor loading estimates are insensitive to underlying residual variances. However, the following proposition suggests that, as grows, the first component of the first eigenvector of dominates other components by an unbounded ratio.
Proposition 2
Suppose and , . Let be the top eigenvector of . Then we have .
As such, the factor estimated by this top eigenvector can be grossly misrepresented, implying MRH is not preferable when residual variances differ significantly from each other.
TM suffers from a similar problem, though possibly to a lesser degree. The matrix in the TM formulation (7) represents an estimate of . For simplicity, let us consider an idealized TM formulation which further incorporates a constraint . That is,
(10)  
Using the same setting as in Proposition 2, we can show that when this idealized TM algorithm produces an estimate of exactly one factor as desired, the first component of the estimated factor loading vector is strictly larger than the other components, as formally described in the following proposition.
Proposition 3
Again, this represents a bias that overemphasizes the variable with large residual variance, even when we incorporate additional information into the formulation. On the contrary, it is easy to see that STM can accurately recover all major factors if similar favorable constraints are incorporated into its formulation (ie., if we set and in (8) ).
6 Conclusion
We proposed factor model estimates UTM and STM, both of which are regularized versions of those that would be produced via PCA. UTM deals with contexts where residual variances are assumed to be uniform, whereas STM handles variation among residual variances. Our algorithm for computing the UTM estimate is as efficient as conventional PCA. For STM, we provide an iterative algorithm with guaranteed convergence. Computational experiments involving both synthetic and real data demonstrate that the estimates produced by our approach are significantly more accurate than those produced by preexisting methods. Further, we provide a theoretical analysis that elucidates the way in which UTM and STM corrects biases induced by alternative approaches.
Let us close by mentioning a few possible directions for further research. Our analysis has relied on data being generated by a Gaussian distribution. It would be useful to understand how things change if this assumption is relaxed. Further, in practice estimates are often used to guide subsequent decisions. It would be interesting to study the impact of STM on decision quality and whether there are other approaches that fare better in this dimension. Our recent paper on directed principle component analysis
(Kao and Van Roy, 2012) relates to this. In some applications, PCA is used to identify a subspace for dimension reduction. It would be interesting to understand if and when the subspace identified by STM is more suitable. Finally, there is a growing body of research on robust variations of factor analysis and PCA. These include the pursuit of sparse factor loadings (Jolliffe et al., 2003; Zou et al., 2004; D’Aspremont et al., 2004; Johnstone and Lu, 2007; Amini and Wainwright, 2009), and the methods that are resistant to corrupted data (Pison et al., 2003; Candès et al., 2009; Xu et al., 2010). It would be interesting to explore connections to this body of work.Appendix A Proofs
Lemma 1
Fixing , consider the optimization problem
(11)  
Let be the solution to (11), , and be an eigendecomposition of matrix with orthonormal. Then we have , where is a diagonal matrix with entries .
Proof We can rewrite (11) as
s.t. 
Now associate a Lagrange multiplier with the constraint and write down the Lagrangian as
Let denote the dual solution. By KKT conditions we have:
(12)  
(13)  
(14) 
Recall that
(15) 
Plugging this into (12) we get
and so
By (13), both and are in .
Let an eigendecomposition of
be for which is orthonormal and . Using (14) and the fact that trace is invariant under cyclic permutations, we have
Since and , we can deduce
Let . Because is positive semidefinite, for all we also have . Furthermore, since
multiplying both sides of (A) by leads to
which shows is an eigenvector of . Recall that . Without loss of generality, we can take for all . Indeed, since , we have
which implies we can further take . This gives
(17) 
where is a diagonal matrix with entries , for . Plugging this into (A) we have
(18) 
For any , multiplying the both sides of the above equation by results in
which implies