Distributionally Robust Inverse Covariance Estimation: The Wasserstein Shrinkage Estimator

05/18/2018
by   Viet Anh Nguyen, et al.
EPFL
Delft University of Technology
0

We introduce a distributionally robust maximum likelihood estimation model with a Wasserstein ambiguity set to infer the inverse covariance matrix of a p-dimensional Gaussian random vector from n independent samples. The proposed model minimizes the worst case (maximum) of Stein's loss across all normal reference distributions within a prescribed Wasserstein distance from the normal distribution characterized by the sample mean and the sample covariance matrix. We prove that this estimation problem is equivalent to a semidefinite program that is tractable in theory but beyond the reach of general purpose solvers for practically relevant problem dimensions p. In the absence of any prior structural information, the estimation problem has an analytical solution that is naturally interpreted as a nonlinear shrinkage estimator. Besides being invertible and well-conditioned even for p>n, the new shrinkage estimator is rotation-equivariant and preserves the order of the eigenvalues of the sample covariance matrix. These desirable properties are not imposed ad hoc but emerge naturally from the underlying distributionally robust optimization model. Finally, we develop a sequential quadratic approximation algorithm for efficiently solving the general estimation problem subject to conditional independence constraints typically encountered in Gaussian graphical models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 27

02/25/2022

An Improvement on the Hotelling T^2 Test Using the Ledoit-Wolf Nonlinear Shrinkage Estimator

Hotelling's T^2 test is a classical approach for discriminating the mean...
07/27/2017

Sequential Inverse Approximation of a Regularized Sample Covariance Matrix

One of the goals in scaling sequential machine learning methods pertains...
06/17/2020

Shrinking the eigenvalues of M-estimators of covariance matrix

A highly popular regularized (shrinkage) covariance matrix estimator is ...
10/14/2021

Near optimal sample complexity for matrix and tensor normal models via geodesic convexity

The matrix normal model, the family of Gaussian matrix-variate distribut...
07/08/2020

Robust Bayesian Classification Using an Optimistic Score Ratio

We build a Bayesian contextual classification model using an optimistic ...
05/22/2019

Distributionally Robust Formulation and Model Selection for the Graphical Lasso

Building on a recent framework for distributionally robust optimization ...
01/17/2018

Bayesian Estimation of Gaussian Graphical Models with Projection Predictive Selection

Gaussian graphical models are used for determining conditional relations...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The covariance matrix of a random vector governed by a distribution  collects basic information about the spreads of all individual components and the linear dependencies among all pairs of components of . The inverse of the covariance matrix is called the precision matrix. This terminology captures the intuition that a large spread reflects a low precision and vice versa. While the covariance matrix appears in the formulations of many problems in engineering, science and economics, it is often the precision matrix that emerges in their solutions. For example, the optimal classification rule in linear discriminant analysis [17]

, the optimal investment portfolio in Markowitz’ celebrated mean-variance model 

[35] or the optimal array vector of the beamforming problem in signal processing [15] all depend on the precision matrix. Moreover, the optimal fingerprint method used to detect a multivariate climate change signal blurred by weather noise requires knowledge of the climate vector’s precision matrix [41].

If the distribution of is known, then the covariance matrix  and the precision matrix  can at least principally be calculated in closed form. In practice, however, is never known and only indirectly observable through independent training samples from . In this setting, and need to be estimated from the training data. Arguably the simplest estimator for is the sample covariance matrix , where stands for the sample mean. Note that and

simply represent the actual mean and covariance matrix of the uniform distribution on the training samples. For later convenience,

is defined here without Bessel’s correction and thus constitutes a biased estimator.111An elementary calculation shows that . Moreover, as a sum of rank- matrices, is rank deficient in the big data regime (). In this case, cannot be inverted to obtain a precision matrix estimator, which is often the actual quantity of interest.

If follows a normal distribution with unknown mean and precision matrix , which we will assume throughout the rest of the paper, then the log-likelihood function of the training data can be expressed as

(1)

Note that is strictly concave in and [7, Chapter 7] and depends on the training samples only through the sample mean and the sample covariance matrix. It is clear from the last expression that is maximized by for any fixed . The maximum likelihood estimator for the precision matrix is thus obtained by maximizing over all , which is tantamount to solving the convex program

(2)

If is rank deficient, which necessarily happens for , then problem (2) is unbounded. Indeed, expressing the sample covariance matrix as with orthogonal and diagonal, we may set for any , where is the diagonal matrix with if and if . By construction, the objective value of in (2) tends to as grows. If is invertible, on the other hand, then the first-order optimality conditions can be solved analytically, showing that the minimum of problem (2) is attained at . This implies that maximum likelihood estimation under normality simply recovers the sample covariance matrix but fails to yield a precision matrix estimator for .

Adding an -regularization term to its objective function guarantees that problem (2) has a unique minimizer , which constitutes a proper (invertible) precision matrix estimator [26]. Moreover, as the -norm represents the convex envelope of the cardinality function on the unit hypercube, the -norm regularized maximum likelihood estimation problem promotes sparse precision matrices that encode interpretable Gaussian graphical models [1, 18]. Indeed, under the given normality assumption one can show that

if and only if the random variables

and are conditionally independent given [29]. The sparsity pattern of the precision matrix thus captures the conditional independence structure of .

In theory, the -norm regularized maximum likelihood estimation problem can be solved in polynomial time via modern interior point algorithms. In practice, however, scalability to high dimensions remains challenging due to the problem’s semidefinite nature, and larger problem instances must be addressed with special-purpose methods such as the Newton-type QUIC algorithm [26].

Instead of penalizing the -norm of the precision matrix, one may alternatively penalize its inverse with the goal of promoting sparsity in the covariance matrix and thus controlling the marginal independence structure of [5]. Despite its attractive statistical properties, this alternative model leads to a hard non-convex and non-smooth optimization problem, which can only be solved approximately.

By the Fisher-Neyman factorization theorem, is a sufficient statistic for the true covariance matrix of a normally distributed random vector, that is, contains the same information about as the entire training dataset. Without any loss of generality, we may thus focus on estimators that depend on the data only through . If neither the covariance matrix nor the precision matrix

are known to be sparse and if there is no prior information about the orientation of their eigenvectors, it is reasonable to restrict attention to

rotation equivariant estimators. A precision matrix estimator is called rotation equivariant if for any rotation matrix . This definition requires that the estimator for the rotated data coincides with the rotated estimator for the original data. One can show that rotation equivariant estimators have the same eigenvectors as the sample covariance matrix (see, e.g., [40, Lemma 5.3]

for a simple proof) and are thus uniquely determined by their eigenvalues. Hence, imposing rotation equivariance reduces the degrees of freedom from

to

. Using an entropy loss function introduced in 

[28], Stein was the first to demonstrate that superior covariance estimators in the sense of statistical decision theory can be constructed by shrinking the eigenvalues of the sample covariance matrix [46, 47]. Unfortunately, his proposed shrinkage transformation may alter the order of the eigenvalues and even undermine the positive semidefiniteness of the resulting estimator when , which necessitates an ad hoc correction step involving an isotonic regression. Various refinements of this approach are reported in [14, 23, 56] and the references therein, but most of these works focus on the low-dimensional case when .

Jensen’s inequality suggests that the largest (smallest) eigenvalue of the sample covariance matrix is biased upwards (downwards), which implies that tends to be ill-conditioned [52]. This effect is most pronounced for

. A promising shrinkage estimator for the covariance matrix is thus obtained by forming a convex combination of the sample covariance matrix and the identity matrix scaled by the average of the sample eigenvalues

[32]. If its convex weights are chosen optimally in view of the Frobenius risk, the resulting shrinkage estimator can be shown to be both well-conditioned and more accurate than . Alternative shrinkage targets include the constant correlation model, which preserves the sample variances but equalizes all pairwise correlations [31], the single index model, which assumes that each random variable is explained by one systematic and one idiosyncratic risk factor [30], or the diagonal matrix of the sample eigenvalues [49] etc.

The linear shrinkage estimators described above are computationally attractive because evaluating convex combinations is cheap. Computing the corresponding precision matrix estimators requires a matrix inversion and is therefore more expensive. We emphasize that linear shrinkage estimators for the precision matrix itself, obtained by forming a cheap convex combination of the inverse sample covariance matrix and a shrinkage target, are not available in the big data regime when and fails to be invertible.

More recently, insights from random matrix theory have motivated a new rotation equivariant shrinkage estimator that applies an individualized shrinkage intensity to every sample eigenvalue

[33]. While this nonlinear shrinkage estimator offers significant improvements over linear shrinkage, its evaluation necessitates the solution of a hard nonconvex optimization problem, which becomes cumbersome for large values of . Alternative nonlinear shrinkage estimators can be obtained by imposing an upper bound on the condition number of the covariance matrix in the underlying maximum likelihood estimation problem [55].

Alternatively, multi-factor models familiar from the arbitrage pricing theory can be used to approximate the covariance matrix by a sum of a low-rank and a diagonal component, both of which have only few free parameters and are thus easier to estimate. Such a dimensionality reduction leads to stable estimators [8, 16].

This paper endeavors to develop a principled approach to precision matrix estimation, which is inspired by recent advances in distributionally robust optimization [11, 22, 54]. For the sake of argument, assume that the true distribution of is given by , where . If and were known, the quality of some estimators and for and , respectively, could conveniently be measured by Stein’s loss [28]

(3)

which is reminiscent of the log-likelihood function (1). It is easy to verify that Stein’s loss is nonnegative for all and and vanishes only at the true mean and the true precision matrix . Of course, we cannot minimize Stein’s loss directly because is unknown. As a naïve remedy, one could instead minimize an approximation of Stein’s loss obtained by removing the (unknown but irrelevant) normalization constant and replacing in (3) with the empirical distribution . However, in doing so we simply recover the standard maximum likelihood estimation problem, which is unbounded for and outputs the sample mean and the inverse sample covariance matrix for . This motivates us to robustify the empirical loss minimization problem by exploiting that is close to in Wasserstein distance.

Definition 1.1 (Wasserstein distance).

The type-2 Wasserstein distance between two arbitrary distributions and on

with finite second moments is defined as

The squared Wasserstein distance between and can be interpreted as the cost of moving the distribution to the distribution , where quantifies the cost of moving unit mass from  to .

A central limit type theorem for the Wasserstein distance between empirical normal distributions implies that converges weakly to a quadratic functional of independent normal random variables as the number of training samples tends to infinity [42, Theorem 2.3]. We may thus conclude that for every there exists such that for all large enough. In the following we denote by the family of all normal distributions on and by

the ambiguity set of all normal distributions whose Wasserstein distance to is at most . Note that  depends on the unknown true distribution only through the training data and, for , contains  with confidence asymptotically as tends to infinity. It is thus natural to formulate a distributionally robust estimation problem for the precision matrix that minimizes Stein’s loss—modulo an irrelevant normalization constant—in the worst case across all reference distributions .

(4)

Here, denotes the set of admissible precision matrices. In the absence of any prior structural information, the only requirement is that be positive semidefinite and invertible, in which case . Known conditional independence relationships impose a sparsity pattern on , which is easily enforced through linear equality constraints in . By adopting a worst-case perspective, we hope that the minimizers of (4) will have low Stein’s loss with respect to all distributions in  including the unknown true distribution . As Stein’s loss with respect to the empirical distribution is proportional to the log-likelihood function (1), problem (4) can also be interpreted as a robust maximum likelihood estimation problem that hedges against perturbations in the training samples. As we will show below, this robustification is tractable and has a regularizing effect.

Recently it has been discovered that distributionally robust optimization models with Wasserstein ambiguity sets centered at discrete distributions on (and without any normality restrictions) are often equivalent to tractable convex programs [36, 57]. Extensions of these results to general Polish spaces are reported in [6, 20]

. The explicit convex reformulations of Wasserstein distributionally robust models have not only facilitated efficient solution procedures but have also revealed insightful connections between distributional robustness and regularization in machine learning. Indeed, many classical regularization schemes of supervised learning such as the Lasso method can be explained by a Wasserstein distributionally robust model. This link was first discovered in the context of logistic regression

[45] and later extended to other popular regression and classification models [6, 44]

and even to generative adversarial networks in deep learning 

[19].

Model (4) differs fundamentally from all existing distributionally robust optimization models in that the ambiguity set contains only normal distributions. As the family of normal distributions fails to be closed under mixtures, the ambiguity set is thus nonconvex. In the remainder of the paper we devise efficient solution methods for problem (4), and we investigate the properties of the resulting precision matrix estimator.

The main contributions of this paper can be summarized as follows.

  • Leveraging an analytical formula for the Wasserstein distance between two normal distributions derived in [21], we prove that the distributionally robust estimation problem (4) is equivalent to a tractable semidefinite program—despite the nonconvex nature of the underlying ambiguity set.

  • We prove that problem (4) and its unique minimizer depend on the training data only through (but not through ), which is reassuring because is a sufficient statistic for the precision matrix.

  • In the absence of any structural information, we demonstrate that problem (4) has an analytical solution that is naturally interpreted as a nonlinear shrinkage estimator. Indeed, the optimal precision matrix estimator shares the eigenvectors of the sample covariance matrix, and as the radius of the Wasserstein ambiguity set grows, its eigenvalues are shrunk towards 0 while preserving their order. At the same time, the condition number of the optimal estimator steadily improves and eventually converges to 1 even for . These desirable properties are not enforced ex ante but emerge naturally from the underlying distributionally robust optimization model.

  • In the presence of conditional independence constraints, the semidefinite program equivalent to (4) is beyond the reach of general purpose solvers for practically relevant problem dimensions . We thus devise an efficient sequential quadratic approximation method reminiscent of the QUIC algorithm [26], which can solve instances of problem (4) with on a standard PC.

  • We derive an analytical formula for the extremal distribution that attains the supremum in (4).

The paper is structured as follows. Section 2 demonstrates that the distributionally robust estimation problem (4) admits an exact reformulation as a tractable semidefinite program. Section 3 derives an analytical solution of this semidefinite program in the absence of any structural information, while Section 4 develops an efficient sequential quadratic approximation algorithm for the problem with conditional independence constraints. The extremal distribution that attains the worst-case expectation in (4) is characterized in Section 5, and numerical experiments based on synthetic and real data are reported in Section 6.

Notation. For any we use to denote the trace and to denote the Frobenius norm of . By slight abuse of notation, the Euclidean norm of is also denoted by . Moreover, stands for the identity matrix. Its dimension is usually evident from the context. For any , we use to denote the inner product and to denote the Kronecker product of and . The space of all symmetric matrices in is denoted by . We use () to represent the cone of symmetric positive semidefinite (positive definite) matrices in . For any , the relation () means that ().

2. Tractable Reformulation

Throughout this paper we assume that the random vector

is normally distributed. This is in line with the common practice in statistics and in the natural and social sciences, whereby normal distributions are routinely used to model random vectors whose distributions are unknown. The normality assumption is often justified by the central limit theorem, which suggests that random vectors influenced by many small and unrelated disturbances are approximately normally distributed. Moreover, the normal distribution maximizes entropy across all distributions with given first- and second-order moments, and as such it constitutes the least prejudiced distribution compatible with a given mean vector and covariance matrix.

In order to facilitate rigorous statements, we first provide a formal definition of normal distributions.

Definition 2.1 (Normal distributions).

We say that is a normal distribution on with mean and covariance matrix , that is, , if is supported on , and if the density function of with respect to the Lebesgue measure on is given by

where , is the diagonal matrix of the positive eigenvalues of , and is the matrix whose columns correspond to the orthonormal eigenvectors of the positive eigenvalues of . The family of all normal distributions on is denoted by , while the subfamily of all distributions in with zero means and arbitrary covariance matrices is denoted by .

Definition 2.1 explicitly allows for degenerate normal distributions with rank deficient covariance matrices.

The normality assumption also has distinct computational advantages. In fact, while the Wasserstein distance between two generic distributions is only given implicitly as the solution of a mass transportation problem, the Wasserstein distance between two normal distributions is known in closed form. It can be expressed explicitly as a function of the mean vectors and covariance matrices of the two distributions.

Proposition 2.2 (Givens and Shortt [21, Proposition 7]).

The type-2 Wasserstein distance between two normal distributions and with and amounts to

If and share the same mean vector (e.g., if ), then the Wasserstein distance reduces to a function of the covariance matrices and only, thereby inducing a metric on the cone .

Definition 2.3 (Induced metric on ).

Let be the metric on induced by the type-2 Wasserstein metric on the family of normal distributions with equal means. Thus, for all we set

The definition of implies via Proposition 2.2 that for all and with . Thanks to its interpretation as the restriction of to the space of normal distributions with a fixed mean, it is easy to verify that is symmetric and positive definite and satisfies the triangle inequality. In other words, inherits the property of being a metric from .

Corollary 2.4 (Commuting covariance matrices).

If commute (), then the induced Wasserstein distance simplifies to the trace norm between the square roots of and , that is,

Proof.

The commutativity of and implies that , whereby

Thus, the claim follows. ∎

Proposition 2.2

reveals that the Wasserstein distance between any two (possibly degenerate) normal distributions is finite. In contrast, the Kullback-Leibler divergence between degenerate and non-degenerate normal distributions is infinite.

Remark 2.5 (Kullback-Leibler divergence between normal distributions).

A simple calculation shows that the Kullback-Leibler divergence from to amounts to

whenever and . If either or is degenerate (that is, if is singular and invertible or vice versa), then fails to be absolutely continuous with respect to , which implies that . Moreover, from the above formula it is easy to verify that diverges if either or tends to a singular matrix.

In the big data regime () the sample covariance matrix is singular even if the samples are drawn from a non-degernerate normal distribution with . In this case, the Kullback-Leibler distance between the empirical distribution and is infinite, and thus and are perceived as maximally dissimilar despite their intimate relation. In contrast, their Wasserstein distance is finite.

In the remainder of this section we develop a tractable reformulation for the distributionally robust estimation problem (4). Before investigating the general problem, we first address a simpler problem variant where the true mean of is known to vanish. Thus, we temporarily assume that follows . In this setting, it makes sense to focus on the modified ambiguity set , which contains all normal distributions with zero mean that have a Wasserstein distance of at most from the empirical distribution . Under these assumptions, the estimation problem (4) thus simplifies to

(5)

We are now ready to state the first main result of this section.

Theorem 2.6 (Convex reformulation).

For any fixed and , the simplified distributionally robust estimation problem (5) is equivalent to

(6)

Moreover, the optimal value function  is continuous in .

The proof of Theorem 2.6 relies on several auxiliary results. A main ingredient to derive the convex program (6) is a reformulation of the worst-case expectation function defined through

(7)

In Proposition 2.8 below we will demonstrate that is continuous and coincides with the optimal value of an explicit semidefinite program, a result which depends on the following preparatory lemma.

Lemma 2.7 (Continuity properties of partial infima).

Consider a function on two normed spaces and , and define the partial infimum with respect to as  for every .

  • If is continuous in at for every , then  is upper-semicontinuous at .

  • If is calm from below at uniformly in , that is, if there exists a constant such that for all , then is lower-semicontinuous at .

Proof.

As for assertion (i), we have

where the inequality follows from interchanging the infimum and supremum operators, while the penultimate equality in the last line relies on the continuity assumption. As for assertion (ii), note that

where the inequality in the second line holds due to the calmness assumption. ∎

Proposition 2.8 (Worst-case expectation function).

For any fixed , and , the worst-case expectation defined in (7) coincides with the optimal value of the tractable semidefinite program

(8)

Moreover, the optimal value function is continuous in .

Proof.

Using the definitions of the worst-case expectation and the ambiguity set , we find

where the second equality holds because the metric on is induced by the type-2 Wasserstein metric on , meaning that there is a one-to-one correspondence between distributions with and covariance matrices with . The continuity of  thus follows from Berge’s maximum theorem [2, pp. 115–116], which applies because and are continuous in , while is nonempty and compact for every and .

By the definition of the induced metric  we then obtain

(9)

To establish the equivalence between (8) and (9), we first assume that . The generalization to rank deficient sample covariance matrices will be addressed later. By dualizing the explicit constraint in (9) and introducing the constant matrix , which inherits invertibility from , we find

(10)

Here, the first equality exploits the identity for any , the second equality follows from strong duality, which holds because constitutes a Slater point for problem (9) when , and the third equality relies on the substitution , which implies that . Introducing the shorthand allows us to simplify the inner maximization problem over in (10) to

(11)

If , then (11) is unbounded. To see this, denote by the largest eigenvalue of and by a corresponding eigenvector. If , then the objective value of in (11) grows quadratically with . If , then for otherwise contrary to our assumption, and thus the objective value of in (11) grows linearly with . In both cases (11) is indeed unbounded.

If , then (11) becomes a convex optimization problem that can be solved analytically. Indeed, the objective function of (11) is minimized by , which satisfies the first-order optimality condition

(12)

and is strictly feasible in (11) because . Moreover, as (12) is naturally interpreted as a continuous Lyapunov equation, its solution can be shown to be unique; see, e.g.[25, Theorem 12.5]. We may thus conclude that is the unique maximizer of (11) and that the maximum of (11) amounts to .

Adding the constraint to the outer minimization problem in (10), thus excluding all values of for which and the inner supremum is infinite, and replacing the optimal value of the inner maximization problem with yields (8). This establishes the claim for .

In the second part of the proof, we show that the claim remains valid for rank deficient sample covariance matrices. To this end, we denote the optimal value of problem (8) by . From the first part of the proof we know that for all . We also know that is continuous in . It remains to be shown that for all and .

Fix any and , and note that for every . Defining the intervals and as well as the auxiliary functions

it follows from (8) that

One can show via Lemma 2.7 that is continuous at . Indeed, is linear and thus continuous in for every , which implies via Lemma 2.7(a) that is upper-semicontinuous at . Moreover, is calm from below at with uniformly in because

Here, the inequality holds for all due to the conditions , which are equivalent to and imply . Lemma 2.7(b) thus ensures that is lower-semicontinuous at . In summary, we conclude that  is indeed continuous at .

Combining the above results, we find

where the five equalities hold due to the continuity of in , the fact that for all , the definition of , the continuity of at and once again from the definition of , respectively. The claim now follows because and were chosen arbitrarily. ∎

We have now collected all necessary ingredients for the proof of Theorem 2.6.

Proof of Theorem 2.6.

By Proposition 2.8, the worst-case expectation in (5) coincides with the optimal value of the semidefinite program (8). Substituting this semidefinite program into (5) yields (6). Note that the condition , which ensures that is well-defined, is actually redundant because it is implied by the constraint . Nevertheless, we make it explicit in (6) for the sake of clarity.

It remains to show that is continuous. To this end, we first construct bounds on the minimizers of (6) that vary continuously with . Such bounds can be constructed from any feasible decision . Assume without loss of generality that , and denote by the objective value of in (6), which constitutes a linear function of . Moreover, define two continuous auxiliary functions

(13)

which are strictly positive because . Clearly, the infimum of problem (6) is determined only by feasible decisions with an objective value of at most . All such decisions satisfy

(14)

where the second and third inequalites exploit the estimates and , respectively, which are both implied by the constraint , and the last inequality holds because for all . By rearranging the above inequality and recalling the definition of , we thus find , which in turn implies that .

Denoting by the eigenvalues of the matrix and setting , we further find

where the first inequality follows from (14), while the second inequality is based on overestimating all but the smallest eigenvalue of by . By rearranging the above inequality and recalling the definition of , we thus find , which in turn implies that .

The above reasoning shows that the extra constraint has no impact on (6), that is,

where the second equality follows from Proposition 2.8. The continuity of now follows directly from Berge’s maximum theorem [2, pp. 115–116], which applies due to the continuity of  established in Proposition 2.8, the compactness of the feasible set and the continuity of and . ∎

An immediate consequence of Theorem 2.6 is that the simplified estimation problem (5) is equivalent to an explicit semidefinite program and is therefore in principle computationally tractable.

Corollary 2.9 (Tractability).

For any fixed and , the simplified distributionally robust estimation problem (5) is equivalent to the tractable semidefinite program

(15)
Proof.

We know from Theorem 2.6 that the estimation problem (5) is equivalent to the convex program (6). As represents a decision variable instead of a parameter, however, problem (6) fails to be a semidefinite program per se. Indeed, its objective function involves the nonlinear term , which is interpreted as outside of its domain . However, constitutes a matrix fractional function as described in [7, Example 3.4] and thus admits the semidefinite reformulation

where the second equality holds because implies