Some Statistical Problems with High Dimensional Financial data

by   Arnab Chakrabarti, et al.

For high dimensional data, some of the standard statistical techniques do not work well. So modification or further development of statistical methods are necessary. In this paper, we explore these modifications. We start with the important problem of estimating high dimensional covariance matrix. Then we explore some of the important statistical techniques such as high dimensional regression, principal component analysis, multiple testing problems and classification. We describe some of the fast algorithms that can be readily applied in practice.



There are no comments yet.


page 1

page 2

page 3

page 4


Slicing: Nonsingular Estimation of High Dimensional Covariance Matrices Using Multiway Kronecker Delta Covariance Structures

Nonsingular estimation of high dimensional covariance matrices is an imp...

Statistical inference in factor analysis for diffusion processes from discrete observations

We consider statistical inference in factor analysis for ergodic and non...

Estimating Divergences in High Dimensions

The problem of estimating the divergence between 2 high dimensional dist...

High-Dimensional Classification for Brain Decoding

Brain decoding involves the determination of a subject's cognitive state...

High Dimensional Statistical Analysis and its Application to ALMA Map of NGC 253

In astronomy, if we denote the dimension of data as d and the number of ...

Information-Theoretic Limits for the Matrix Tensor Product

This paper studies a high-dimensional inference problem involving the ma...

Structure retrieval from 4D-STEM: statistical analysis of potential pitfalls in high-dimensional data

Four-dimensional scanning transmission electron microscopy (4D-STEM) is ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A high degree of interdependence among modern financial systems, such as firms or banks, is captured through modeling by a network , where each node in represents a financial institution and an edge in stands for dependence between two such institutions. The edges are determined by calculating the correlation coefficient between asset prices of pairs of financial institutions. If the sample pairwise correlation coefficient is greater than some predefined threshold then an edge is formed between corresponding nodes. This network model can be useful to answer important questions on the financial market, such as determining clusters or sectors in the market, uncovering possibility of portfolio diversification or investigating the degree distribution boginski2005 , vandewalle2001 . See figure 1 for illustration of one such network. Using correlation coefficients to construct the economic or financial network has a serious drawback. If one is interested in direct dependence of two financial institutions the high observed correlation may be due to the effect of other institutions. Therefore a more appropriate measure to investigate direct dependence is partial correlation. Correlation and partial correlation coefficients are related to covariance and inverse covariance matrix respectively. Therefore in order to have meaningful inference on the complex system of financial network, estimation of covariance matrix accurately is of utmost importance. In this paper we investigate how inference based on covariance matrix for high dimensional data can be problematic and how to solve the problem.

Figure 1: Network topology of European economies in post-euro period as described in papadimitriou2014european .

The rest of the paper is organized as follows. In section 2 we discuss the distribution of eigenvalues of covariance matrix. In section 3 the problem and possible solution of covariance matrix estimation is discussed. Section 4 deals with estimation of precision matrix. Section 5 and 6 deals with multiple testing procedure and high dimensional regression problem respectively. We discuss high dimensional principal component analysis and several classification algorithms in section 7 and 8.

2 Distribution of Eigenvalues

2.1 Eigenvalues of Covariance Matrix

In multivariate statistical theory, the sample covariance matrix is the most common and undisputed estimator because it is unbiased and has good large sample properties with growing number of observations when number of variables is fixed. But if the ratio of the number of variables () and the number of observations () is large, then sample covariance does not behave as expected. It can be showen that if grows at the same rate as i.e. the sample covariance matrix becomes inconsistent and therefore can not be relied upon stein1956 . In Figure 2 eigenvalues of population covariance matrix and sample covariance matrix are plotted for different values of and

where the population covariance matrix is the identity matrix. It is evident that the true and sample spectra differ a lot as the ratio

grows. So for high dimensional data () there is a need to find an improved estimator.

Even though the sample eigenvalues are not consistent anymore, the limiting distribution of eigenvalues of the sample covariance matrix and the connection it has with the limiting eigenvalue distribution of population covariance matrix are of importance. Determining the limiting spectral distribution can be very useful to test the underlying assumption of the model. In this section we will very briefly discuss some results of random matrix theory that answers this kind of questions. Throughout we will denote the ratio

as .

2.2 Marchenko Pastur Law and Tracy Widom Law

Suppose that

are iid Gaussian variables with variance

. If then the empirical spectral distribution (distribution function of eigenvalues) of sample covariance matrix converges almost surely to the distribution F with the density

if , where and . If It will take additional positive mass at 0.

is called the scale parameter. The distribution is known as Marchenko-Pastur distribution.

If then empirical spectral distribution of converges almost surely to the semicircle law with density:

Although the sample eigenvalues are not consistent estimator, the limiting spectral distribution is related to the population covariance matrix in a particular way.

Also if and such that , then where is the largest eigenvalue of sample covariance and and is Tracy-Widom Law distribution.

Figure 2: Plot of true (dotted line) and sample (solid line) eigenvalues.

3 Covariance Matrix Estimator

3.1 Stein’s Approach

We see from figure 2 that the sample eigenvalues can differ a lot from the population eigenvalues. Thus shrinking the eigenvalues to a central value is a reasonable approach to take. Such an estimator was proposed by Stein stein1956inadmissibility and takes the following form:

where and is also a diagonal matrix. If then is the usual estimator

. In this approach the eigen vectors are kept as it is, but the eigenvalues are shrink towards a central value. As the eigen vectors are not altered or regularized this estimator is called rotation equivariant covariance estimator. To come up with a choice of

, we can use entropy loss function

or Frobeneous loss function

Under entropy risk (), we have where

The only problem with this estimator is that some of the essential properties of eigenvalues, like monotonicity and nonnegativity, are not guaranteed. Some modifications can be adopted in order to force the estimator to satisfy those conditions (see Lin and naul2016 ). An algorithm was proposed to avoid such undesirable situations by pooling adjacent estimators together in Stein1975 . In this algorithm first the negative ’s are pooled together with previous values until it becomes positive and then to keep the monotonicity intact, the estimates () are pooled together pairwise.

3.2 Ledoit-Wolf type Estimator

As an alternative to the above mentioned method, the empirical Bayes estimator can also be used to shrink the eigenvalues of sample covariance matrix. haff1980 proposed to estimate by

where . This estimator is a linear combination of and which is reasonable because although is unbiased, it is highly unstable for high dimensional data and has very little variability with possibly high bias. Therefore a more general form of estimator would be

where is a positive definite matrix and (shrinkage intensity parameter), can be determined by minimising the loss function. For example ledoit2004 used

to obtain a consistent estimator with . A trade off between bias and variance is achieved through the value of shrinkage parameter. In figure 3, bias, variance and MSE are plotted against shrinkage parameter. The optimum value of shrinkage intensity is that for which MSE is minimum.

It can be shown that if there exists and independent of such that and where is the -th element of any row of the matrix of principal components of and if

where denotes the set of all quadruples made of four distinct integers between and , then the following estimator (a convex combination of and ) is consistent for , see ledoit2004 :

where is the th row of and

The first condition clearly deals with the interplay between sample size, dimension and moments whereas the second one deals with the dependence structure. For

the last condition for dependence structure can be trivially verified as a consequence of the assumption on moments. This estimate is also computationally easy to work with. In fact as

is still unbiased estimator one possible way to reduce the variance is to use bootstrap-dependent techniques like bagging. But that is far more computationally demanding compared to this method.

With the additional assumption that is bounded as , ledoit2004 showed that

This result implies that expected loss of sample covariance matrix, although bounded, does not usually vanish. Therefore consistency of usual sample covariance matrix is achieved only when or

In the latter case most of the random variables are asymptotically degenerate. The difference between these two cases is that in the first, the number of variables is very less compared to

and in the latter degenerate variables are augmented with the earlier lot. Both of these essentially indicate sparsity.

Figure 3: Plot for Error vs shirnkage intensity ledoit2004

A more general target matrix can be used instead

. For example, under Gaussian distribution, if

, (intensity parameter) and , then optimal shrinkage intensity is

which implies that the shrinkage estimator is a function of the sufficient statistics and therefore can be further improved upon by using Rao-Blackwell theorem chen2009shrinkage . The resulting estimator becomes where

If we take , that is, the diagonal elements of then the optimal intensity that minimises can be estimated as

where , as shown infisher2011 . schafer2005 chose the shrinkage parameter to be

Along with conventional target matrix () they used five other target matrices summarised in the following Table.

Table 1

Target A: "diagonal, unit variance”
0 estimated parameters
Target B: “diagonal, common variance”
1 estimated parameter:
Target C: "common (co)variance”
2 estimated parameters:
Target D: "diagonal, unequal variance ”
estimated parameters:
Target E: “perfect positive correlation”
estimated parameters:
Target F: “constant correlation”
estimated parameters,

3.3 Element-wise Regularization

Under the assumption of sparsity, some element-wise regularization methods can be used. In contrast to ledoit2004 type of estimator, where only the eigenvalues were shrunk, here both eigenvalues and vectors are regularised. We will first discuss popular methods like Banding and Tapering which assume some order between the variables and as a result, the estimator is not invariant under permutation of variables. So this is useful for time-dependent data.


The idea behind banding is that the variables are ordered in such a way that elements of the covariance matrix, further away from the main diagonal, are negligible. An banded covariance matrix is defined as , where is the sample covariance matrix and is band length, determined through cross validation. One can question- which kind of population covariance matrix can be well approximated by Banded sample covariance matrix. Intuitively, such a matrix should have decaying entries as one moves away from the main diagonal. bickel2008covariance showed that the population covariance can be well approximated uniformly over the following class of matrices: , where is a constant and captures the rate of decay of the entries as goes away from . Although is large, if is very small compared to , that is, , then such a can be well-approximated by accurately chosen band length and the error in approximation depends on and . Same result holds also for the precision matrix. Banded covariance estimation procedure does not guarantee positive definiteness.


Tapering the covariance matrix is another possible way and it can preserve positive definiteness. is a tapered estimator where is the sample covariance matrix, is the tapering matrix and ‘’ denotes the Hadamard product (element-wise product). Properties of Hadamard product suggest that is positive definite if is so. The banded covariance matrix is a special case of this with , which is not positive definite.


The most widely applicable element-wise regularization method is defined through Thresholding Operator. The regularized estimator is , where is the sample covariance matrix and is the threshold parameter. can be determined through cross validation. Although it is much simpler than other methods, like penalized lasso, it has one problem. The estimator preserves symmetry but not positive definiteness. With Gaussian assumption, consistency of this estimator can be shown uniformly over a class with , and for sufficiently large bickel2008covariance . For the condition reduces to The rate of convergence is dependent on the dimension (), sample size () and , the determining factor of the number of nonzero elements in . Similar result can be shown for precision matrix. For non-Gaussian case, we need some moment conditions in order to achieve consistency result bickel2008covariance .

The result goes through for a larger class of thresholding operators. One such is called generalized thresholding operators with the following three properties:

  1. (shrinkage)

  2. for (thresholding)

  3. (constraint on amount of shrinkage)

Apart from being consistent under suitable conditions discussed earlier, if variance of each variable is bounded then this operator is also “sparsistent” i.e. able to identify true zero entries of population covariance matrix with probability tending to one.

For both thresholding and generalized thresholding operators is fixed for all entries of the matrix. An adaptive threshold estimator can be developed cai2011adaptive to have different parameters for different entries where

3.4 Approximate Factor Model

Sometimes the assumption of sparsity is too much to demand. For such situations estimation methods of a larger class of covariance matrices is required. A simple extension is possible to the class of matrices that can be decomposed into sum of low rank and sparse matrix: , where is low rank and is sparse matrix. Due to similarity with Factor model where is diagonal, this model is called approximate factor model. To estimate , one can decompose similarly as, , where the first part involves the first principal components and the second part is residual. As is sparse we can now use thresholding/adaptive thresholding operators to estimate it chamberlain1982arbitrage .

3.5 Positive Definiteness

Sometimes positive definiteness of the estimator is required in order to be used in classification or covariance regularised regression. As we have discussed thresholding estimators do not guarantee positive definite estimator. In the following section we will describe a couple of methods to achieve that. One possible way is to replace the negative eigenvalues in eigen decomposition of by zero. But this manipulation destroys the sparse nature of the covariance matrix. An alternative way is necessary that will ensure sparsity and at the same time will produce positive definite output. Let us denote sample correlation matrix by matrix if it is symmetric and positive definite ( for positive semi definite) and -th column of symmetric matrix with it’s -th element removed. matrix formed after removing -th column and -th row of and is the diagonal matrix with the same diagonal elements as . Define . Then a desirable positive definite estimator is

where and estimated correlation matrix is

with and respectively being tuning parameter and a fixed small value. The log-determinant term in the optimization function ensures positive definiteness. Regularizing the correlation matrix leads to faster convergence rate bound and scale invariance of the estimator. Under suitable and reasonable conditions this estimator is consistentrothman2012positive . For fast computation the following algorithm has been developed.

  • Input - a symmetric matrix with positive diagonals, , and initialise with . Follow steps for and repeat till convergence.

    Step1: and solve the lasso penalized regression:


    Step3: Compute

An alternative estimator has been proposed based on alternating direction method xue2012positive . If we want a positive semi definite matrix then the usual objective function along with penalty term should be optimized with an additional constraint for positive semi-definiteness:

For positive definite matrix we can replace the constraint with for very small . Introducing a new variable , we can write the same as

Now it is enough to minimize its augmented Lagrangian function for some given penalty parameter :

, where is the Lagrange multiplier. This can be achieved through the following algorithm (S being Soft Thresholding Operator):

  • Input , , .

  • Iterative alternating direction augmented Lagrangian step: for the -th iteration:

    1. Solve

    2. Solve

    3. Update

  • Repeat the above cycle till convergence.

4 Precision Matrix Estimator

In some situations instead of covariance matrix, the precision matrix () needs to be calculated. One example of such situation is financial network model using partial correlation coefficients because the sample estimate of partial correlation between two nodes is , where . Of course can be calculated from but that inversion involves operations. For high dimensional data it is computationally expensive. On the other hand, if it is reasonable to assume sparsity of the precision matrix, that is, most of the off-diagonal elements of the precision matrix are zeros, then we can directly estimate the precision matrix. Although the correlation for most of the pairs of financial institutions would not be zero, the partial correlations can be. So this assumption of sparsity would not be a departure from reality in many practical situations. In such cases starting from a fully connected graph we can proceed in a backward stepwise fashion, by removing the least significant edges. Instead of such sequential testing procedure, some multiple testing strategy, for example, controlling for false discovery rate, can also be adopted. We discuss this in detail in section 5. After determining which off-diagonal entries of precision matrix are zeros (by either sequential or multiple testing procedure), maximum likelihood estimates of nonzero entries can be found by solving a convex optimization problem: maximizing the concentrated likelihood subject to the constraint that a subset of entries of precision matrix equal to zero dempster1972covariance pourahmadi2013 .

Alternatively, under Gaussian assumption a penalized likelihood approach can be employed. If , the likelihood function is

The penalized likelihood , with penalty parameter can be used to obtain a sparse solution yuan2007model . The fastest algorithm to obtain the solution is called graphical lasso friedman2008sparse , described as follows:

  1. Denote . Start with a matrix that can be used as a proxy of . The choice recommended in Friedman et. al is .

  2. Repeat till convergence for :

    1. Partition matrix in two parts, th row and column, and the matrix composed by the remaining elements. After eliminating the th element, the remaining part of column (dimensional) is denoted as and similarly the row is denoted as . Similarly, define , , for matrix. (For , the partition will look like: and ).

    2. Solve the estimating equations

      using cyclical coordinate-descent algorithm to obtain .

    3. Update .

  3. In the final cycle, for each , solve for , with . Stacking up will give the th column of .

Figure 4 shows undirected graph from Cell-signalling data obtained through graphical lasso with different penalty parameters friedman2008sparse .

Figure 4: Resulting networks given by graphical lasso algorithm for different values of penalty parameter lambda friedman2008sparse

5 Multiple Hypothesis Testing Problem and False Discovery Rate

We can encounter large scale hypothesis testing problems in many practical situations. For example, in Section 4, we discussed to remove edge from a fully connected graph, we need to perform testing problems- vs . A detailed review can be found in drton2007multiple .

Suppose we have independent hypothesis

to test. In such situations it is important to control not only the type I error of individual hypothesis tests but also the overall (or combined) error rate. It is due to the fact that the probability of atleast one true hypothesis would be rejected becomes large:

, where being the level of significance, generally taken as 0.05 or 0.01. The conventional way to resolve this problem is by controlling the familywise error rate (FWER)- . One example of such is Bonferroni correction. The problem with this procedure is that it is overly conservative and as a consequence the power of the test will be small. A much more liberal and efficient method for high dimension has been proposed by Benjamini and Hochberg benjamini1995controlling . In figure 5, out of hypothesis tests in

cases null hypothesis is true and in

cases null hypothesis is false. According the decision rule, in out of cases null hypothesis is rejected. Clearly is observed but or are not. The following algorithm controls the expected false discovery proportion:

  1. The test statistic of

    yield values .

  2. Order the values .

  3. Rank the hypothesis according to the values

  4. Find largest , say , such that

  5. Reject the top tests as significant.

It can be shown that if the values are independent of each other then the rule based on the algorithm controls the expected false discovery proportion by , more precisely, .

Figure 5: False discovery rate efron2012large

6 High Dimensional Regression

In financial econometrics one can often encounter multiple regression analysis problem. A large number of predictors implies large number of parameters to be estimated which reduces the degrees of freedom. As a result prediction error will be increased. So in high dimensional regression regularization is an essential tool.

In this section we will breifly discuss the multivariate regression problem with responses and predictors, which requires estimation of parameters in the regression coefficient matrix. Suppose the matrix of regressors, responses and coefficient matrix are , and respectively. As we know (under multivariate mormality, this is also the maximum likelihood estimator) with parameters. Estimated covariance matrix (with parameters) of is . When and are large then both these estimators exhibits poor statistical properties. So here again, shrinkage and regularization of would help to obtain a better estimator. It can be achieved through Reduced Rank Regression which attempts to solve a constrained least square problem:

The solution of this constrained optimization is where with

being normalized eigenvector corresponding to the

th largest eigenvalue of the matrix . Choice of is important because this is the parameter that balances between bias and variance of prediction.

Alternatively, a regularized estimator can be obtained by introducing a nonnegative penalty function in the optimization problem:

when is a scalar function and is nonnegative quantity. Most common choices of are norms. leads to lasso estimate where as

amount to ridge regression.

for and for are called elastic net and bridge regression respectively. Grouped lasso with imposes groupwise penalty on the rows of , which may lead to exclusion of some predictors for all the responses.

All the above mentioned methods regularize the matrix only while leaving aside. Although a little more complicated, it is sometimes appropriate to regularize both and . One way to do that is to adding separate lasso penalty on and in the negative log likelihood:

where and are as usual tuning parameter, and . This optimization problem is not convex but biconvex. Note that solving the above mentioned optimization problem for with fixed at reduced to the optimization problem:

where . If we fix at a nonnegative definite it will lead to

It can be shown that the original problem can be solved by using the following algorithm prescribed by rothman2010sparse -

  • Fix and , initialize and .

    • Step 1: Compute by solving

      by coordinate descent algorithm.

    • Step 2: Compute by solving

      by Graphical lasso algorithm.

    • Step 3: If where is the Ridge estimator of .

7 Principal Components

In many high dimensional studies estimates of principal component loadings are inconsistent and the eigenvectors consists of too many entries to interpret. In such situation regularising the eigenvectors along with eigenvalues would be preferable. So it is desirable to have loading vectors with only a few nonzero entries. The simplest way to achieve that is through a procedure called SCoTLASS jolliffe2003modified . In this approach a lasso penalty is to be imposed on the PC loadings. So the first PC loading can be obtained by solving the optimization problem:

The next PC can be obtained by imposing extra constraint of orthogonality. Note that this is not a minimization problem and so can be difficult to solve. However the above problem is equivalent to the following one:

The equivalence between the two can be easily verified by using Cauchy Schwartz inequality to and noting that equality will be achieved for . The optimization problem can be solved by the following algorithm witten2009penalized

  • Initialize to have norm 1.

  • Iterate until convergence

    • , where is a soft thresholding operator, and if the computed satisfies otherwise with

8 Classification

Suppose there are independent observations of training data (), , coming from an unknown distribution. Here denotes the class of the th observation and therefore can take values if there are classes. , generally a vector of dimension , is the feature vector for the ith observation. Given a new observation X, the task is to determine the class, the observation belongs to. In other words we have to determine a function from the feature space to

. One very widely used class of classifiers is distance based classifiers. It assigns an observation to a class

, if the observation is closer to class on average compared to other classes i.e. , where is the center of the feature space of class . As an example if there are two classes and the feature distribution for the first class is and for the second class is

then under the assumption that both the classes have equal prior probabilities, the most widely used distance measure is called Mahalanobis’s distance

. So class 1 is to be chosen when

This technique is called Fisher Discriminant Analysis. For high dimensional data Fisher Discriminant Analysis does not perform well because it involves accurate estimation of precision matrix bickel2004some . In the following section we will discuss some high dimensional classification methods.

8.1 Naive Bayes Classifier

Suppose we classify the observation, with feature x, by some predetermined function i.e. . Now to judge the accuracy we need to consider some Loss function. A most intuitive loss function is the zero-one loss: