Based on the simple but essential idea of diversification and optimal risk-return profile of an investment strategy, the mean-variance model by Markowitz (1952) still represents the groundwork for portfolio optimization. In its original design, Markowitz portfolio theory assumes perfect knowledge about the expected value and variance of returns. For practical implementations, however, these parameters have to be estimated from historical data. The misspecifications due to error in estimation can lead to strong deviations from optimality and therefore a suboptimal portfolio choice and inferior out-of-sample performance (Jobson and Korkie, 1981; Frost and Savarino, 1986; Michaud, 1989; Broadie, 1993). This major drawback has been tackled from different perspectives in the financial literature. Some focus on estimation errors in the portfolio weights directly (see, e.g., Brodie et al., 2009; DeMiguel et al., 2009a), whereas others work on the inputs by improving expected returns and the covariance matrix.
For specific effective solutions, it is important to keep in mind that it is more difficult to estimate expected returns than the covariances of returns (Merton, 1980) and that portfolio weights are extremely sensitive to changes in expected returns (Best and Grauer, 1991a, b). Jagannathan and Ma (2003) even observe that the “…estimation error in the sample mean is so large nothing much is lost in ignoring the mean altogether when no further information about the population mean is available” (pp. 1652–1653). It is therefore not surprising that a considerable part of recent academic research focuses on the global minimum-variance portfolio (GMV), as this does not depend on expected returns.111DeMiguel et al. (2009b) additionally show that the mean-variance portfolio is outperformed out-of-sample by the minimum-variance portfolio not only in terms of risk, but as well in respect to the return-risk ratio.
However, even if investors decide to use the GMV portfolio, the estimation errors associated with covariances can still lead to significant estimation errors in the portfolio weights. Several approaches have been proposed in the literature to deal with covariance estimation errors, particularly for high-dimensional data.
We extend this line of research by first reviewing selected efficient covariance estimation methodologies. Especially for high-dimensional data, the sample covariance matrix is highly ill-conditioned. This negative property intensifies when calculating its inverse, namely the sample precision matrix. We cover several approaches that have been shown to overcome these estimation issues and perform well in terms of out-of-sample variance. For instance, we discuss the linear shrinkage estimators of Ledoit and Wolf (2004a, b) designed to offer an optimal bias-variance trade-off between the sample covariance matrix and a structured target matrix. Furthermore, we adopt the recent nonlinear shrinkage technique by Ledoit and Wolf (2017a)
that is proven to be optimal under a variety of financially relevant loss functions(Ledoit and Wolf, 2018b). Since an underlying factor structure is considered a valid assumption for asset returns’ covariance estimators, we outline the elaborate principal complement thresholding (POET) estimator by Fan et al. (2013). In addition, we follow the findings of the recent empirical studies by Goto and Xu (2015) and Torri et al. (2019) and include a sparse precision matrix estimator, namely the graphical least absolute shrinkage and selection operator (GLASSO), which has so far received little attention in portfolio optimization research. To the best of our knowledge, we are the first to compare state-of-the-art covariance estimators such as the nonlinear shrinkage and POET estimators with the GLASSO estimator.
More importantly, the selected covariance estimation methods share one thing in common: a regularization of the sample covariance estimator is performed with an individual tuning parameter to optimize out-of-sample performance. For example, linear shrinkage methods need an optimal shrinkage intensity to balance the included variance and bias, whereas the performance of the GLASSO depends on the level of sparsity, induced by a penalty parameter. The procedure for optimally identifying those parameters often includes the choice of a specific loss function to be minimized. Furthermore, as often advocated, a loss function or selection criterion in the model estimation is best aligned with the model evaluation framework (Christoffersen and Jacobs, 2004; Ledoit and Wolf, 2017a; Engle et al., 2019). To exploit those effects in more detail, we apply a nonparametric cross-validation (CV) technique with different optimality criteria to determine the optimal parameters for all the considered covariance estimators. CV is a widely accepted data-driven method for model selection (see Arlot et al., 2010). By splitting a dataset into training and testing periods, a model is repeatedly estimated with the training data and validated with the test data with reference to a pre-specified measure of fit.
Since we focus on enhancing the risk profile of the GMV portfolios, we choose two relevant risk-related measures for our data-driven estimation methodology and the corresponding out-of-sample performance evaluation, namely the mean squared forecasting error (MSFE), as in Zakamulin (2015), and the portfolio’s out-of-sample variance. We show that in most cases there exists a strong positive relation between the CV criterion and the respective out-of-sample performance measure. For instance, when the overall goal is to reduce the out-of-sample variance of a portfolio, then using CV based on that performance measure provides better results than a theoretical solution or CV based on the original optimality criterion. Similar results are documented by Liu (2014), although he only considers the most straight-forward linear shrinkage as in Ledoit and Wolf (2003); Ledoit and Wolf (2004a, b). Here, we examine more recent and efficient estimation methods as well as identify those that can actually profit from a data-driven methodology based on their theoretical properties. In detail, estimators that depend strongly on the choice of a specific parameter within their derivation are more prone to be positively influenced by replacing the ‘original’ solution with a data-driven one.
Our contributions to the current literature on the subject of covariance and precision matrix estimation within the portfolio optimization framework can be summarized as follows. First, we show that recent advances in estimation methods lead to strong improvements in the risk profile of the GMV. In this context, we emphasize the distinct and often significant outperformance of the GLASSO, which estimates sparse precision matrices by identifying pairs of zero partial auto-correlations in asset returns data. In line with the main discussion, we show that a model’s outperformance in respect to out-of-sample portfolio variance does coincide with an identical estimation objective in-sample. Although the advanced method outlined in Ledoit and Wolf (2017a) is not strongly influenced by the CV procedure, all the other considered estimators generally perform better. Increasing the number of included assets amplifies that behavior. Considering the MSFE, the results are straightforward for all methods. If an investor aims to minimize this measure, the respective data-driven estimation ought to be performed. Nonetheless, we examine and analyze the inefficiency of the MSFE for high-dimensional cases. In particular, because of a distorted calculation of the realized covariance matrix within high-dimensional return data and in the presence of multicollinearity, it is not advisable to consider the MSFE as a proxy for portfolio risk. Finally, to test the robustness of our results, we analyze the effects of the suggested methods on the GMV without short-sales. As argued by Jagannathan and Ma (2003), a constraint imposed on short-sales is similar to the linear shrinkage technique since it reduces the impact of estimation errors on optimal portfolio weights. Although we show that efficient covariance estimators indeed do not result in strong outperformance in comparison to the sample estimator, a CV with the portfolio’s variance as an optimality criterion improves the out-of-sample risk even in this scenario.
The rest of the paper is organized as follows: In Section 2, we review the considered covariance estimation methods and their properties. Section 3 outlines the suggested data-driven methodology in respect to its main characteristics: the cross-validation procedure, the parameter set, and the selection criteria. We describe the empirical study in Section 4 with a strong focus on the chosen dataset, methodology and performance measures. In Section 5, we discuss the performance of classical and constrained GMV portfolios and analyze in detail the influence of data-driven estimation among all considered datasets and methods. Section 6 summarizes the results and concludes.
2 Overview of the Estimation Methods
2.1 Sample Covariance
The standard approach for estimating the covariance matrix of returns among researchers and practitioners is to use the sample estimator, defined as
where is the matrix of past asset returns with observations and number of stocks,
is the vector of expected returns, here estimated with the sample mean, andis an -dimensional vector of ones. As shown by Merton (1980), the sample covariance matrix is an asymptotically unbiased and consistent estimator of the true covariance matrix when the concentration ratio
. Because of the limited availability of historical data a concentration ratio of such magnitude is practically infeasible. With a high relation of the number of assets to the sample size, also called high-dimensionality, the sample covariance becomes more ill-conditioned and exhibits higher amount of estimation error, mainly due to the over- and underestimation of the respective eigenvalues. In these cases, the inverseis a poor estimator of . While
is an unbiased estimator of, its inverse is highly biased if with an expected value of under the assumption of normality. Moreover, for , the sample estimator becomes singular and the inverse cannot be calculated.
The sample estimator’s instability and possible singularity are a problem within the mean-variance portfolio optimization framework, where the covariance matrix and, specifically, its inverse capture the dependency between asset returns and allow for the effect of diversification as a way of reducing risk. It is then a straightforward conclusion that the accuracy of optimally estimated portfolio weights is directly related to the precision of the covariance estimator. As a solution to this problem, several alternative estimators have been proposed in the literature.
In the case of high-dimensionality and potentially high multicollinearity, there are two main groups of suggested estimation techniques. First, if no structure on the covariance matrix is assumed, recent studies recommend linear and nonlinear shrinkage techniques, which adjust the sample covariance matrix using a specific shrinkage function. As candidates for these, we consider the methods of Ledoit and Wolf (2004a, b, 2017b). The second broad branch of research addresses the deficiencies of the sample estimator by inducing structure on the covariance matrix or its inverse. Here we consider factor and graph models as in Fan et al. (2013) and Friedman et al. (2008), respectively.
2.2 Linear Shrinkage
To produce more stable estimators of the covariance matrix, a linear shrinking procedure can be applied to the sample estimator towards a more structured target matrix as follows:
where the constant controls the shrinkage intensity, which is set higher the more ill-conditioned the sample estimator is and vice versa. In contrast to the unbiased, but unstable sample covariance matrix, a structured target matrix has little estimation error but tends to be biased. As a compromise, a convex combination of both uses the bias-variance trade-off by accepting more bias in-sample in exchange for less variance out-of-sample. This idea is central to the shrinkage methodology of Stein (1956) and James and Stein (1961). In search of linear shrinkage techniques for covariance matrices, Ledoit and Wolf (2004b)
propose a set of asymptotically optimal estimators. If no specific correlation structure among assets is assumed, a generic target matrix would be the identity matrix with. The asymptotically optimal linear shrinkage estimator is calculated as follows:
where is the average of all individual sample variances and is an optimal shrinkage intensity parameter. However, in the context of financial time series, it may be beneficial to consider target matrices with more reference to the correlation structure of asset returns.
Ledoit and Wolf (2004a) consider identical pairwise correlations between all assets. The target matrix is therefore derived under the constant correlation matrix model of Elton and Gruber (1973), so that . While the variances are kept as their original sample values, the off-diagonal entries of the target matrix are estimated by assuming a constant average sample correlation . This results in The corresponding estimator is therefore defined as
The level of the shrinkage in Equations (2) and (3), the parameter , can be obtained analytically. In particular, as shown by Ledoit and Wolf (2004a, b), asymptotically consistent estimators for the optimal linear shrinkage intensity are derived under the quadratic loss function , known as the Frobenius loss:
where the covariance estimator is substituted accordingly to Equation (2) or (3) and the finite sample solution is found at the minimum of the expected value of the Frobenius loss, namely the risk function mean squared error (MSE) as follows:
The methodology behind this derivation is universal and can be applied to other shrinkage targets in a convex combination setting after an individually performed analysis and respective mathematical adaptation. Within our data-driven implementation of the linear shrinkage estimators, however, we do not rely on the theoretically optimal shrinkage intensity; instead, we search for an optimal value using the suggested CV technique. Although uncomplicated in construct, linear shrinkage methodologies assume an appropriate target matrix, representing a reasonable covariance structure. On the contrary, the nonlinear shrinkage does not require the exogenous decision upon such a target matrix.
2.3 Nonlinear Shrinkage
The nonlinear shrinkage method, first proposed by Ledoit and Wolf (2012), shrinks covariance entries by increasing small (underestimated) sample eigenvalues and decreasing large (overestimated) ones in an individual fashion. Without any assumption about the true covariance matrix, the positive-definite rotationally equivariant222This class of estimators first introduced by Stein (1986) is based on the spectral decomposition of the sample covariance matrix. nonlinear shrinkage is defined as
whereas columns and is the diagonal matrix of the eigenvalues, shrunk as shown in Ledoit and Wolf (2012, 2015); Ledoit and Wolf (2017a, 2018a). To find the optimal nonlinear shrinkage to the eigenvalues, Ledoit and Wolf (2012)
originally minimize the MSE in finite samples. However, under the considered large-dimensional asymptotics from the field of Random Matrix Theory the Frobenius loss converges almost surely to a nonstochastic limit, guaranteeing the optimality of the nonlinear shrinkage estimator. Furthermore, as shown byLedoit and Wolf (2017a) and Engle et al. (2019), the estimator yields the same mathematical solution under the minimum-variance loss function, defined as the out-of-sample variance of an optimal Markowitz portfolio with
where are the optimal weights, is the proposed estimator, is the true covariance matrix and is an assumed vector of expected asset returns. This tailored loss function is crucial for the application of this estimator within a portfolio optimization framework. Overall, as shown in Ledoit and Wolf (2018b), it is indisputable that the nonlinear shrinkage estimator is optimal under a practically relevant set of loss functions.
Without going into further details, we examine one consideration necessary for the direct estimation technique, as demonstrated by Ledoit and Wolf (2018a). The optimal solution for the nonlinear shrinkage is achieved using a nonparametric variable bandwidth kernel estimation of the limiting spectral density of the sample eigenvalues and its Hilbert transform. The speed at which the bandwidth vanishes in the number of assets can be set to
according to standard kernel density estimation theory(Silverman, 1986) or following the Arrow model of Ledoit and Wolf (2018b). For the practical implementation of the nonlinear estimator, a compromise of is suggested as the average between those two approaches. Within the data-driven methodology for nonlinear shrinkage, we aim to verify whether this choice of the kernel bandwidth’s speed can be outperformed in terms of the estimator’s efficiency, using a CV method with an aligned optimality criterion.
2.4 Approximate Factor Model
The previously outlined estimation techniques adjust the sample eigenvalues , while the eigenvectors are kept unchanged. Such approaches are only justifiable in the absence of a priori information on the covariance matrix structure and thus on the eigenvectors’ direction. In contrast, some structure-based approaches imply sparsity on the covariance matrix itself, see Bickel and Levina (2008); Cai and Liu (2011). A more appropriate assumption in the case of financial time series is conditional sparsity with the help of factor models, so that the covariance matrix estimator is given by
where is the sample covariance matrix of the common factors and is the residuals covariance matrix.333Following this definition and assuming common factors with , a covariance matrix estimator based on factor models only needs to estimate covariance entries and is thus more stable. One disadvantage of such exact factor models is the strong assumption of no correlation in the error terms across assets; that is, the error covariance matrix is assumed to contain only the sample variances of the residuals. Therefore, possible cross-sectional correlations are neglected after separating the common present factors (Fan et al., 2013)
. Instead, approximate factor models allow for off-diagonal values within the error covariance matrix. The POET estimator is one of the most recent and efficient estimators from this branch of research. Using the close connection between factor models and the principal component analysis,Fan et al. (2013)
infer the necessary factor loadings by running a singular value decomposition on the sample covariance matrix as follows:
The covariance, formed by the first principal components, contains most of the information about the implied structure. The rest is assumed to be an approximately sparse matrix. For its estimation, an adaptive thresholding procedure (Cai and Liu, 2011) with a threshold parameter is applied.444For the operational use of POET, the threshold value needs to be determined, so that the positive-definiteness of is assured in finite samples. The choice of can therefore occur from a set, for which the respective minimal eigenvalue of the errors’ covariance matrix after thresholding is positive. The minimal constant that guarantees positive-definiteness is then chosen. For more details, see, Fan et al. (2013). As a result, the POET estimator can be formulated as the sum of a low -rank and a sparse matrix as follows:
As argued by Fan et al. (2013), for high-dimensional data with a sufficiently large , the number of factors can be inferred from the data. In this case, the recent literature proposes several sophisticated data-driven methodologies. A consistent data-driven estimator is adopted in the following form:
where is the predefined maximum number of factors, is the matrix of asset returns with a sample covariance matrix , is a matrix with columns the eigenvectors, corresponding to the largest eigenvalues of , and is a penalty function of the type, introduced by Bai and Ng (2002). Although the suggested inference of is data-driven, in this study we further examine whether a CV technique with an appropriate optimality criterion can achieve better results.
2.5 Graphical Model
In a portfolio optimization context, the inverse covariance matrix is the direct input parameter necessary for exploiting diversification effects upon optimization. While the covariance matrix only captures marginal correlations among returns, the precision (or concentration) matrix captures conditional correlations (Fan et al., 2016). Instead of imposing a factor structure on the covariance matrix with a sparse error covariance as in POET, sparsity in the precision matrix can be a valid approach for reducing estimation errors, especially in the case of conditional independence among asset pairs. In detail, the entry if and only if asset returns and are independent, conditional on the other assets in the investment universe. Since graphical models are used to describe both the conditional and unconditional dependence structures of a set of variables, the estimation of
is closely related to graphs under a Gaussian model. Within the Markowitz portfolio optimization framework asset returns are assumed to follow a multivariate normal distribution and therefore the identification of zeros in the inverse can be performed with the Gaussian graphical model.555This idea was first proposed by Dempster (1972) with the so-called covariance selection model.
One of the most commonly used approaches for inducing sparsity on the precision matrix is by penalizing the maximum-likelihood. For i.i.d. with , the Gaussian log-likelihood function is given by
where denotes the determinant and the trace of a matrix. Maximizing Equation (10) alone yields the known maximum-likelihood estimator for the precision matrix , which suffers from high estimation error. To reduce such errors, a penalized maximum log-likelihood function can be optimized. The latter is obtained by adding a lasso penalty (Tibshirani, 1996) on the precision matrix entries as follows:
where is the -norm (the sum of the absolute values) of the matrix , an matrix with the off-diagonal elements, equal to the corresponding elements of the precision matrix and the diagonal elements equal to zero, so that no penalty is applied to the asset returns’ sample variances. Furthermore, is a penalty parameter that controls the sparsity level, with higher values leading to a larger number of zero off-diagonal elements within the resulting estimator.
The penalized likelihood framework for a sparse graphical model estimation was first proposed by Yuan and Lin (2007), who solve Equation (11) with an interior-point method. Banerjee et al. (2008) show that the problem is convex and solve it for with a box-constrained quadratic program. To date, the fastest available solution is reached with the GLASSO algorithm, developed by Friedman et al. (2008) and later improved by Witten et al. (2011). They demonstrate that the above formulation is equivalent to an N-coupled lasso problem and solve it using a coordinate descent procedure.666For more details on the GLASSO algorithm, see, Friedman et al. (2008); Goto and Xu (2015); Torri et al. (2019).
In addition to a fast algorithm, a value of the tuning parameter is necessary for calculating an optimal GLASSO estimator. For this purpose, Yuan and Lin (2007) suggest using the Bayesian Information Criterion (BIC), defined for each value of as the sum of the respective log-likelihood function’s value and a penalty for the number of nonzero elements as follows:
where the indicator function counts the number of nonzero off-diagonal elements in the estimated precision matrix. The value of , corresponding to the lowest BIC, is chosen as the optimal lasso penalty parameter. The choice of the BIC as a selection criterion for is further justified by the relation between the penalized problem in Equation (11) and the model selection criteria, the Akaike information criterion (AIC) and BIC by Akaike (1974) and Schwarz et al. (1978), respectively. As pointed out by Goto and Xu (2015), these information criteria penalize the number of parameters through a cardinality constraint to prevent high model complexity and overfitting. Replacing the cardinality with a lasso penalty, model selection and parameter estimation are combined in a common convex optimization problem.777Some additional theoretical properties of the methods presented above have been investigated by Yuan and Lin (2007); Rothman et al. (2008); Lam and Fan (2009) among others. For a recent overview of further estimation techniques for sparse precision matrices, see, Fan et al. (2016).
Although thoroughly analyzed from a statistical perspective, there are only a few applications of GLASSO within a portfolio optimization framework.888Goto and Xu (2015) induce sparsity to enhance robustness and lower the estimation error within portfolio hedging strategies, Brownlees et al. (2018) develop a procedure called “realized network” by applying GLASSO as a regularization procedure for realized covariance estimators, and Torri et al. (2019) analyze the out-of-sample performance of a minimum-variance portfolio, estimated with GLASSO. One of the reasons is that graph models are often referred to as unfitting because of the assumed sparsity (see Ledoit and Wolf, 2017a). However, as outlined by Goto and Xu (2015) and Torri et al. (2019), a sparse precision matrix does not imply sparsity in the covariance matrix since the asset returns remain correlated with each other. Although Yuan and Lin (2007) argue that a CV procedure for an optimal lasso penalty can yield better out-of-sample results, the existing financial applications estimate only in-sample using selection criteria such as the BIC. By contrast, we additionally consider the superiority of data-driven methods in this context and perform a multi-fold CV with two risk-related optimality criteria. The exact methodology is described in the next section.
3 Data-Driven Methodology
Each of the outlined covariance estimation methods includes an exogenous or data-dependent parameter. The linear shrinkage estimators in Equations (2) and (3) are calculated with an optimal common shrinkage intensity . For the more general nonlinear shrinkage technique Ledoit and Wolf (2017a) set the kernel bandwidth’s speed at as the average of two recognized approaches. The approximate factor model, the POET estimator by Fan et al. (2013), deals with an unknown number of factors , which are identified by minimizing popular information criteria. Finally, the GLASSO estimator proposed by Friedman et al. (2008) needs an optimal choice for the penalty parameter , often identified by minimizing the BIC in-sample. To clarify our analysis, we refer to these estimation methods as ‘original’. In addition, we adopt a nonparametric technique, a multi-fold CV, to identify the necessary parameter for each estimation method in a data-driven way. Instead of relying on pre-specified assumptions and deriving corresponding solutions individually, we perform a grid search over a domain of values and find the best possible parameter for arbitrary out-of-sample statistics.
3.1 Parameter Set
To employ a data-driven choice, we first need to specify a domain of possible values for the necessary parameters that should be selected within the CV procedure. For this purpose we create a sequence (or grid) of arbitrary parameters for each covariance model. Depending on the chosen length of the sequence, the CV can be computationally time-consuming. Since the choice of this sequence is crucial for the out-of-sample efficiency of the data-driven methodology, the domain of possible parameters has to be individually evaluated for each estimation method by considering the trade-off between desired precision and computing time. Subsection 4.2 outlines the examined sequences for the considered covariance estimation methods.
3.2 Cross-Validation Procedure
The CV is a model validation technique designed to assess how an estimated model would perform on an unknown dataset. To evaluate the model accuracy, the available dataset is repeatedly split into a training and a testing subset in a rolling-window fashion, (see, e.g., Refaeilzadeh et al., 2009). For instance, in the case of an -fold CV, a dataset with observations is split into equal parts. The first rolling-window then uses as a training dataset the first fold consisting of the first observations ordered by time. Upon this, the consecutive observations are used to validate the performed estimation as a test dataset. This is iteratively done times by shifting the training window by observations and, therefore, maintaining the chronological order within the data.999Efron and Gong (1983) and Arlot et al. (2010), for example, provide more information about the construction and performance of the CV technique.
For each of the pre-defined parameters and each covariance estimation model, we successively use the training data to calculate a covariance matrix estimator for a test dataset and a specific parameter .101010For clarity in the notation, we do not differentiate between covariance estimation methods. The procedure is applied to all methods equally. During the following validation stage, we must set selection criteria, also referred to as measures of fit, to identify which parameter performs best. In this study, we investigate two common objective functions within the field of portfolio risk minimization.
As often argued and applied, the squared forecasting error (SFE) or, as defined in Section 2, the Frobenius loss, is minimized to find a covariance estimator with the least forecasting error, (e.g., Zakamulin, 2015). Specifically, we first calculate a realized covariance matrix for the test dataset with
where are the asset returns from the test dataset and is the vector of average returns for the testing period consisting of observations. Then, we calculate the corresponding SFE as
This procedure is repeated times, so that we end up with SFE values for each . From the parameter set we then choose this for which the average (over all iterations) SFE is minimized. In our empirical study, the data-driven estimation method with the SFE as a measure of fit is referred to as CV1.
Instead of the SFE, within a portfolio optimization framework, one is generally more interested in whether a covariance estimator leads to lower out-of-sample risk of a portfolio, (see, e.g., Liu, 2014; Ledoit and Wolf, 2017a; Engle et al., 2019). To incorporate and later investigate this concept, as our second scenario (CV2), we minimize the out-of-sample variance of a portfolio. In detail, with the previously estimated covariance matrix , we calculate the optimal weights for a portfolio of our choice (e.g., the GMV). This then allows us to calculate the respective portfolio returns throughout the testing period with observations as follows:
This procedure is repeated times, so that we end up with portfolio return vectors for each . From the parameter set, we then choose this for which the empirical variance (over all iterations) of those portfolio out-of-sample returns is minimized.
4 Empirical Study
In addition to discussing original covariance estimation methods, our study carries out a nonparametric CV technique. We explicitly address the importance of the aligned selection criteria for the out-of-sample performance of a specific methodology by applying different measures of fit within the data-driven estimation. To exploit these considerations, we perform an extensive empirical study of the suggested covariance estimation methods within a portfolio optimization context. For this purpose, we create GMV portfolios with and without short-sales and evaluate their out-of-sample performance for a range of commonly used measures. Moreover, we compare the theoretical parameters with their calibrated equivalents. The exact empirical construct is elaborated on in the following subsections.
4.1 Model Setup
For the empirical study, we focus on the GMV portfolio. The optimal weights for an investment period are determined by minimizing the portfolio’s variance as follows
where is an n-dimensional vector of ones and is an arbitrary covariance matrix estimator for the investment period . This formulation has the analytical solution
Furthermore, we consider the following GMV portfolio with an imposed constraint on the weights (GMV-NOSHORT):
which is a quadratic optimization problem with linear constraints that can be solved with every popular quadratic optimization software. Here, CV2 has a different calculation of the measure of fit depending on the assumed portfolio’s structure. When short-sales are allowed, we use the solution of Equation (13) to calculate the portfolio’s variance as in Subsection 3.2. In the case of GMV-NOSHORT, on the contrary, we solve Equation (14) within the CV.
As discussed by recent literature (see, e.g, Jagannathan and Ma, 2003; DeMiguel et al., 2009b), the introduction of a short-sale constraint is not only practically relevant because of common fund rules or budget constraints for individual investors. It moreover limits the estimation error in the portfolio weights. As a consequence, we would expect a slightly reduced effect of the estimation methods’ efficiency in respect to the out-of-sample performance. The empirical analysis of both GMV and GMV-NOSHORT portfolios thus aims to complement our study of efficient covariance estimation methods and their CV1 and CV2 data-driven counterparts, and moreover, to ensure the practical reproducibility and relevance of our results.
4.2 Data and Methodology
To test the performance of the proposed covariance matrix estimation methods, we utilize four S&P 500 related datasets, which differ only in the number of assets used: 50, 100, 200 and 250 stocks. We consider daily prices from the Thomson Reuters EIKON database. The corresponding discrete returns start on 01/01/1990 and end on 12/31/2018. Overall, our data include observations for months (or 7306 days) per asset. We use daily returns to keep the empirical results as closely as possible to a practical scenario. In addition, daily returns are technically necessary for the implementation of the proposed CV procedure, especially the CV1 method, which involves the calculation of a monthly realized covariance matrix. Another important feature of the daily data is the better performance of the sample covariance estimator (Jagannathan and Ma, 2003; Zakamulin, 2015). To further ensure the stability in our results, we randomly select the necessary number of stocks among all the companies that have survived throughout the investigated period and keep them as the investment universe for our empirical study. Since the chosen datasets consist of individual stocks, the adopted investment strategies can be recreated easily and cost efficiently in practice by simply buying or selling the respective amount of stocks. Choosing datasets with different quantities of assets, we aim to study the behavior of the covariance estimators when the concentration ratio becomes increasingly large in-sample.
To evaluate the out-of-sample performance of the constructed portfolios and, implicitly, the covariance estimation methods, we adopt a rolling-window study with an in-sample period of two years, months (or roughly 504 days), and an out-of-sample period from 01/01/1992 to 12/31/2018, resulting in months (or 6801 days) out-of-sample portfolio returns. We employ a monthly rebalancing strategy, since this is more cost efficient and common in practice. Within each rolling-window step, the covariance matrix of asset returns for the investment month is estimated at the end of month using approximately the most recent 504 daily in-sample observations.
In our empirical study, the sample covariance estimator serves as a benchmark to the high-dimensional estimation methods and considered data-driven adjustments in terms of out-of-sample risk. For the application of the nonlinear shrinkage, we use the MATLAB-code provided by Ledoit and Wolf.111111https://www.econ.uzh.ch/en/people/faculty/wolf/publications.html.. The POET estimator is calculated using the R-package
POET provided by Fan et al. (2013). As suggested by the authors, we adopt a soft-thresholding rule as well as a data-driven derivation of the number of factors and the thresholding constant for the original version of the method. Finally, the GLASSO estimator is calculated with the algorithm provided by Friedman
et al. (2008) within the R-package
glasso with no penalty on the diagonal elements.
In addition to the models in Section 2, we calculate the data-driven estimators as in Section 3 by applying -fold CV. To calculate the selection criteria for the respective CV methods, we choose and therefore divide the in-sample observations into a training sample of 12 months (or 252 days) and a testing sample of one month (or 21 days). With this construction, we replicate the proposed monthly rebalancing strategy inside the performed CV and ensure more stable results for the data-driven covariance estimators. As introduced in Subsection 3.1, we additionally need to define a set of parameters for each covariance estimation method.
Since both linear shrinkage methods in Equation (2) (LW1) and Equation (3) (LWCC) represent the weighted average between the sample and a target covariance matrix, we define a parameter set of G shrinkage intensities, such that . Considering the reasoning in Subsection 2.3, for the nonlinear shrinkage estimator in Equation (6) (LWNL), we set the kernel bandwidth’s speed to lie between and with . Since the accuracy of a data-driven estimation depends on the number of examined parameters, with more parameters allowing for finer results, we consider a linear grid of equidistant values in the above cases. Furthermore, for the POET estimator, we consider . For the GLASSO estimator, we follow Friedman et al. (2008) and choose a sequence of penalty parameters , derived from the training data. Specifically, we define a logarithmic sequence as our -generating function, where with number of parameters in the sequence, being the maximal absolute value of the sample covariance matrix, estimated with the training dataset, and .
After calculating all the possible combinations of original and data-driven estimators within the validation subset, we choose an optimal parameter for each covariance estimation method, as outlined in Subsection 3.2, and use all the in-sample data to estimate the covariance matrix for the next investment month by setting , as we assume no time dependency in the variance matrix. We use to find the optimal weights , as in Equations (13) and (14). With these weights, we calculate the out-of-sample portfolio returns for each model in . This procedure is repeated multiple times until the end of our investment horizon.
Overall, our study covers 17 different portfolios for each scenario with and without short-sales. First, we include the equally-weighted portfolio, hereafter also referred to as the Naive portfolio. This strategy implies an identity covariance matrix and hence, does not include any estimation risk (DeMiguel et al., 2009b). In addition to the Naive strategy, which is a standard benchmark when comparing turnover rates, we build a GMV portfolio with the sample covariance matrix estimator, which serves as a benchmark for the out-of-sample risk comparison. For each of the five main covariance estimation methods discussed in Section 2, we construct portfolios with the original and calibrated parameters, resulting in three versions for each estimation methodology. All these portfolios are evaluated with the performance measures, presented in the following subsection.
4.3 Performance Measures
To evaluate the out-of-sample performance of each covariance matrix estimation method, we report different performance measures for the risk profile as well as the allocation properties of the corresponding GMV and GMV-NOSHORT portfolios. First, we calculate the MSFE as
where is the covariance matrix estimator and is the realized covariance for month . The MSFE is frequently used to measure the forecasting power of an estimation method. To avoid double accounting for forecasting errors, we exclude the lower triangular part of both matrices from the calculation.
Considering the nature of minimum-variance portfolios as risk-reduction strategies, we are especially interested in the out-of-sample SD as a performance indicator. We calculate the out-of-sample standard deviation (SD) of the 6801 out-of-sample portfolio returns and annualize it by multiplying by. For a more detailed analysis of the out-of-sample risk of the constructed portfolios and therefore, implicitly, covariance estimation methods, we perform the two-sided Parzen Kernel HAC-test for differences in variances, as described by Ledoit and Wolf (2008) and Ledoit and Wolf (2011), and report the corresponding significance levels. Since we use daily returns, there are sufficient observations and a bootstrapping technique is not necessary.
Since the MSFE is closely related to the SFE optimality criterion, as within the CV1 method, we expect the correspondingly optimized covariance estimators to exhibit a lower MSFE than their original versions. On the contrary, a data-driven estimation with the CV2 approach, based on optimizing the portfolio’s variance, is expected to result in a lower out-of-sample SD.
In practice investors need to additionally address the problem of high transaction costs; hence, they prefer a more stable allocation for an optimal portfolio strategy. Therefore, we analyze average monthly turnover, defined as
where denotes the -norm of a vector as the sum of its absolute values and denotes the portfolio weights at the end of the investment month , scaled back to one. The average monthly turnover is therefore calculated as the averaged sum of absolute values of the monthly rebalancing trades across all assets and over all investment dates and is used as a proxy for the arising transaction costs. The next section reports the detailed out-of-sample performance analysis and empirical results.
5 Empirical Results
In the empirical part to this paper we compare how the original and data-driven methods for estimating the covariance matrix of returns affect the out-of-sample performance of GMV portfolios with and without short positions. This section examines the out-of-sample properties of the three estimation methodologies (original, CV1 and CV2). In particular, we focus on the out-of-sample risk and allocation profile of the optimal portfolios.
5.1 Optimal Parameters
Figure 1 exemplary displays the selected linear shrinkage intensities for the original as well as CV1- and CV2-based LW1 estimation methods in the case of 50 stocks.121212Figure 2 shows the evolution of the selected parameters for the remaining covariance estimation methods in the case of the 50 considered stocks. The other three datasets produce similar results. To our surprise, the code for the POET estimator provided by Fan et al. (2013) produces a consistent number of factors throughout the observation period and for all four datasets. The trend of the optimal linear shrinkage intensities shows that the original approach of Ledoit and Wolf (2004b) is less reactive to changes in asset returns than our CV methodologies. The strong fluctuation in the selected shrinkage intensity for CV1 and CV2 results from their data-driven nature which implies fast adaptation to potentially changing market conditions. Still, such volatility in the parameter estimation could have negative effects on the out-of-sample properties of the corresponding estimators and thus, the estimated portfolios (e.g., in terms of turnover or an overall risk level). Therefore, the sole observation of the chosen parameters cannot lead to a clear conclusion on whether the CV technique enhances the covariance matrix estimator and the respective portfolio’s performance. In the following subsections we examine this behavior for the GMV portfolios with short sales.
5.2 GMV with short-sales
Table 5.2 presents the central results of our empirical analysis on the GMV portfolio. The columns show the investment universes with 50, 100, 200 and 250 randomly chosen S&P 500 stocks as well as the three performance measures MSFE, SD and average monthly turnover rate (TO). The rows indicate the portfolio strategies based on the covariance estimation. While the original estimators are noted only by the respective name of the estimation method, the endings CV1 and CV2 represent the data-driven approaches, as explained in the previous sections.
The compact representation of the results allows us to observe that in the case of enhanced covariance estimators the annualized SD declines as more assets are included in the GMV portfolio. This is easily explained by the known power of diversification – the desirable effect of including more stocks in a portfolio. Not surprisingly however, the estimation error with the sample covariance estimator diminishes the positive diversification effect, as shown by the increase in out-of-sample risk for the scenarios with 200 and 250 stocks. Moreover, all the efficient covariance estimation methods perform better than the sample estimator in terms of out-of-sample risk for all the datasets, with larger deviations for a higher number of stocks.
More importantly, we can clearly detect the positive effect of the appropriate choice of selection criterion for determining the necessary covariance parameters. For all the datasets, the minimization of out-of-sample portfolio variance with the CV2 approach indeed leads to lower out-of-sample SD for the linear shrinkage methods LW1 and LWCC and the GLASSO estimator.131313Figure 3 provides a more visually attractive summary of the results for the GMV portfolios in terms of out-of-sample risk. For those estimation methods more susceptible to the tuning parameter, the differences in SD between the original, CV1, and CV2 methods are more pronounced for a higher number of stocks. This finding implies that making an adequate choice of optimality criteria that better incorporate the investor’s objective is crucial for high-dimensional and ill-conditioned scenarios. Especially noteworthy is the continuous and strong risk-reduction property of the GLASSO model. Among the original models, using an in-sample BIC for the optimal selection of the lasso penalty produces the best results for the 100SP and 250SP datasets. Moreover, when the data-dependent parameter for the GLASSO estimation is selected with the CV2 approach, outperformance is superior. Hence, in respect to out-of-sample risk, applying graph models to induce sparsity within the precision matrix seems to be a valid assumption, even in comparison to highly sophisticated methods such as the nonlinear shrinkage and approximate factor models. Interestingly, for these estimators, the CV2 method does not lead to consistent outperformance in terms of risk. The original version of POET performs better than POET-CV2 in all cases but the 100SP dataset. This result implies that criteria such as the SFE (CV1) and out-of-sample portfolio variance (CV2) do not produce better results than the originally established function in Equation (9). Moreover, for LWNL, there is almost no relevant difference in annualized out-of-sample SD. As elaborated on in Subsection 2.3, LWNL is proven to be optimal not only under the Frobenius loss, but also under the tailored minimum-variance loss function. Therefore, the efficiency of LWNL does not strongly depend on the choice of the kernel bandwidth’s speed what explains that a more sophisticated and data-driven specification the specified parameter does not improve the performance of this estimation method.
For the CV1 approach, the investigation of the MSFE is mandatory. The values reported in Table 5.2 indicate the distinct effect of the CV1 approach on the minimization of the MSFE out-of-sample. For all the estimation methods and datasets, except the isolated case of LWCC for 250SP, the MSFE is the lowest for the CV1 version of every estimator. Even robust estimators such as LWNL and POET exhibit higher forecasting power, measured by the MSFE, when the corresponding parameters are estimated with the CV1 approach. In contrast, the MSFE measure does not seem to proxy for the better out-of-sample performance of a covariance estimator in terms of portfolio risk. Within the financial literature including Zakamulin (2015), the MSFE is applied to datasets with low concentration ratios. As evident from our out-of-sample results, within a higher-dimensional setting, a lower MSFE does not strictly coincide with lower SD out-of-sample for a portfolio for any of the datasets or estimation methods.141414Only the LWNL estimator for the 200SP and 250SP datasets yields the lowest risk levels and lowest MSFE for the CV1 approach. This result is, however, merely based on a negligible difference and can thus be ignored. Under the CV1 method, the SFE is computed as an estimator’s squared distance to the monthly realized covariance matrix, derived on the basis of daily returns (roughly 21 days) for assets. The implied concentration ratios, ranging from for the 50SP dataset to for the 250SP dataset, lead to ill-conditioned realized covariance matrices and a noisy SFE calculation. Within a high-dimensional framework, recent financial studies have focused on an error-free estimation of large realized covariance matrices (see, e.g., Hautsch et al., 2012; Callot et al., 2017; Bollerslev et al., 2018).
To understand the magnitude of improvements in the CV2-based estimation methods as well as the superiority of the GLASSO method, Table 5.2 presents the differences in annualized SDs and the respective pairwise significance levels across all the main covariance estimation methods and their CV2-based counterparts for the high-dimensional case of the 250SP dataset.151515Appendix 8 compares further datasets. Overall, the results are similar in tendency, but are less pronounced due to a lower dimensionality. Table 5.2 is to be read column-wise; that is, the difference in SD for the LW1 and Sample estimator is listed under the second column for the first row. For completeness, we construct the table symmetrically. Still, we focus our attention on the elements above the diagonal only.
At first glance, we can once more distinguish the weak performance of the sample covariance estimator. All the other estimators lead to lower out-of-sample risk and are significantly different with a p-value of roughly 0.001 or lower. The second worst estimation method for this asset universe is the originally proposed LW1, followed by LWCC. On the contrary, when the linear shrinkage intensity is optimized for with respect to out-of-sample portfolio variance under the CV2 approach, we observe an astonishing improvement for those estimation methods. Both LW1-CV2 and LWCC-CV2 result in a strong statistical difference in out-of-sample SD in respect to their original counterparts. Comparing LWCC-CV2 with LW1-CV2, we can conclude that linear shrinkage toward the identity matrix yields more stable results than the same technique applied to a constant correlation model. This can be explained by the fact that the underlying model in LW1 is equivalent to the introduction of a ridge type penalty in the estimation (Warton, 2008), which has been proven to perform well in situations with highly desired bias-variance trade-off. Although LW1 does not incorporate any knowledge on the covariances of asset returns (Ledoit and Wolf, 2003), it performs better than the assumed constant correlation covariance model in LWCC. This observation leads us to conclude that it is better to assume less than assume the wrong structure. Another surprising insight emerges from the comparison of LW1 with LWNL. Although especially designed for high-dimensional settings, both the original and the CV2-based nonlinear shrinkage methods lead to higher out-of-sample risk than the data-driven linear shrinkage methods. This effect is observable for the 100SP dataset as well (see, for reference, Table 8). As the difference is not statistically significant in any of the cases, we can only draw a qualitative conclusion that a methodologically easy-to-understand and simple-to-implement method can perform as well as a complex state-of-the-art estimator when the necessary tuning parameter (here, the shrinkage intensity) is identified in a data-driven way. Moreover, a suitably chosen selection criterion for the optimal tuning parameter is extremely important in the context of the GLASSO estimator. When comparing the variances of the original (BIC-based) GLASSO estimation with those of GLASSO-CV2, we find a statistically significant difference with a p-value of 0.01. Furthermore, the generally strong performance of the GLASSO estimators becomes evident since GLASSO outperforms all the other estimation models in terms of SD. Although GLASSO-CV2 results in lower out-of-sample portfolio risk than the highly efficient LWNL, the differences are only significant for the 100SP and 200SP datasets (see, for reference, the significant levels in Table 8 and Table 8). Since, as demonstrated by Ledoit and Wolf (2017a), the nonlinear shrinkage is considered as a superior covariance estimation method, even for more moderate concentration ratios, these results are of great importance for the application of the GLASSO within portfolio optimization.
In practice, investors are not only interested in the risk profile of a GMV portfolio. Because of regularity restrictions and existing transaction costs, the practical implementation of a portfolio strategy depends on the distribution of its weights. Hence, Table 5.2 additionally reports the average monthly turnover rate as a proxy for the arising transaction costs induced by monthly rebalancing. The Naive portfolio, being long only and equally-weighted by construction, naturally has the lowest turnover (approximately 0.06 on average across all the datasets). As expected, the GMV portfolios estimated with the sample covariance matrix are characterized by extreme exposures for all the datasets. With higher dimensionality in the returns data, the ill-conditioned sample estimator induces even higher dispersion in the portfolio weights. On the other hand, considering all the estimators, an estimation with the GLASSO has the most pronounced positive effects on the allocation characteristics. In particular, the GLASSO-CV1 estimation methodology results in GMV portfolios with the lowest turnover for the 200SP and 250SP datasets. For the least high-dimensional dataset, 50SP, GLASSO-CV2 leads to the lowest turnover. An interesting case is the 100SP dataset, where clear outperformance in terms of turnover rate occurs for the estimator LW1-CV2, followed closely by GLASSO-CV1. It seems that when the concentration ratio is tolerable, as for the case of 100SP, the linear shrinkage of the sample covariance toward an identity matrix produces satisfactory results. However, when the sample covariance matrix becomes more ill-conditioned, as for the 200SP and 250SP datasets, even a sophisticated CV technique for LW1 cannot outperform the GLASSO. In addition, LWNL is highly outperformed by the data-driven estimators LW1-CV2, LWCC-CV2, GLASSO-CV1, and GLASSO-CV2 in terms of allocation. For instance, while GLASSO-CV1 achieves the lowest turnover rate of 0.36 for 250SP, estimation with LWNL leads to twice this value, with an average monthly turnover of approximately 0.73. More importantly, when a covariance estimator is susceptible to an improvement by a data-driven estimation of the necessary parameters, as LW1, LWCC, and GLASSO are, CV1 or CV2 lead to a strong positive impact on the stability of the weights. Generally, a more efficient and stable estimator results in less deviation in the weights and a lower turnover. For instance, the GLASSO covariance estimator exhibits a mean sum of absolute differences in its entries of over time, while the same value for GLASSO-CV2 is only . This finding implies a more stable covariance estimation that leads to lower turnover out-of-sample. While LW1 shrinks the sample covariance matrix toward the identity matrix, a GLASSO estimator shrinks the precision matrix toward the identity matrix. Since the Naive portfolio corresponds to a GMV portfolio estimated with an identity covariance matrix, one may suggest that both estimation methods result in an implicit shrinkage of the sample GMV portfolio weights toward an equally-weighted portfolio, as in Tu and Zhou (2011). It can be concluded that an optimization for risk through a shrinkage toward the identity matrix results in the best trade-off between out-of-sample risk and turnover.
5.3 GMV without short-sales
Focusing on more practically relevant portfolio strategies, we construct a second set of GMV portfolios that do not exhibit negative weights (GMV-NOSHORT). The exclusion of short-sales is a common regulatory constraint that strongly influences the optimal performance in respect to the out-of-sample risk and the allocation of weights. To investigate those, Table 5.3 reports the main out-of-sample measures for GMV-NOSHORT. Since the examined short-sale constraint does not play any role in the CV1-based estimation of the covariance matrix, we do not report the MSFE values. The table is structured similarly to Table 5.2 with the columns representing the investment universes (50SP, 100SP, 200SP and 250SP), and performance measures, while the rows indicate the portfolio strategies based on the considered covariance estimation methods.
First, due to the constraint imposed in Equation (14), it is straight-forward that portfolio performance depends not only on the efficiency of the covariance estimator, but also on the selection of nonnegative asset positions. As expected, the sample covariance estimator leads to better results in comparison to Table 5.2, verifying the impact of constraints on the minimization of estimation errors in portfolio weights, as shown by Jagannathan and Ma (2003). In contrast to the results from the previous setting, the differences in out-of-sample performance in terms of portfolio risk are generally less distinctive among the estimation methods. While GLASSO-CV2 continues to achieve the lowest out-of-sample annualized SD for all the datasets, the data-driven approach only inconsistently enhances the performance of the original estimation methods. For instance, both LW1 and LWCC perform better than LW1-CV2 and LWCC-CV2 for the 100SP and 200SP datasets. Although the nonlinear shrinkage and POET with CV2-estimated parameters seem to produce lower out-of-sample risk, the effect is negligible.161616Figure 4 plots the relative differences in SD among all the original and cross-validated estimators.
Moreover, Table 5.3 presents the differences in annualized SDs and the respective pairwise significance levels across all the main covariance estimation methods and their CV2-based counterparts for the high-dimensional case of the 250SP dataset.171717Appendix 9 compares further datasets. Overall, the results are similar in tendency, but are less pronounced due to a lower dimensionality. The table is constructed similarly to Table 5.2.