# Sparse and Low-Rank Covariance Matrices Estimation

This paper aims at achieving a simultaneously sparse and low-rank estimator from the semidefinite population covariance matrices. We first benefit from a convex optimization which develops l_1-norm penalty to encourage the sparsity and nuclear norm to favor the low-rank property. For the proposed estimator, we then prove that with large probability, the Frobenious norm of the estimation rate can be of order O(√(s(r)/n)) under a mild case, where s and r denote the number of sparse entries and the rank of the population covariance respectively, n notes the sample capacity. Finally an efficient alternating direction method of multipliers with global convergence is proposed to tackle this problem, and meantime merits of the approach are also illustrated by practicing numerical simulations.

## Authors

• 2 publications
• 3 publications
• 1 publication
• 3 publications
• ### A finite sample estimator for large covariance matrices

The present paper concerns large covariance matrix estimation via compos...
11/24/2017 ∙ by Matteo Farnè, et al. ∙ 0

• ### Square-root nuclear norm penalized estimator for panel data models with approximately low-rank unobserved heterogeneity

This paper considers a nuclear norm penalized estimator for panel data m...
04/19/2019 ∙ by Jad Beyhum, et al. ∙ 0

• ### Low-Rank Covariance Function Estimation for Multidimensional Functional Data

Multidimensional function data arise from many fields nowadays. The cova...
08/29/2020 ∙ by Jiayi Wang, et al. ∙ 0

• ### Convex Optimization without Projection Steps

For the general problem of minimizing a convex function over a compact c...
08/04/2011 ∙ by Martin Jaggi, et al. ∙ 0

• ### Simultaneously sparse and low-rank abundance matrix estimation for hyperspectral image unmixing

In a plethora of applications dealing with inverse problems, e.g. in ima...
04/07/2015 ∙ by Paris Giampouras, et al. ∙ 0

• ### Exact and Stable Covariance Estimation from Quadratic Sampling via Convex Programming

Statistical inference and information processing of high-dimensional dat...
10/02/2013 ∙ by Yuxin Chen, et al. ∙ 0

• ### A Low-Rank Approach to Off-The-Grid Sparse Deconvolution

We propose a new solver for the sparse spikes deconvolution problem over...
12/23/2017 ∙ by Paul Catala, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Estimation of population covariance matrices from samples of multivariate data has draw many attentions in the last decade owing to its fundamental importance in multivariate analysis. With dramatic advances in technology in recent years, various research fields, such as genetic data, brain imaging, spectroscopic imaging, climate data and so on, have been used to deal with massive high-dimensional data sets, whose sample sizes can be very small relative to dimension. In such settings, the standard and the most usual sample covariance matrices often performs poorly

[1, 2, 11].

Fortunately, regularization as a class of new methods to estimate covariance matrices has recently emerged to overcome those shortages of using traditional sample covariance matrices. These methods encompass several specified forms, banding [1, 6, 17], tapering [4, 10] and thresholding [2, 5, 8, 16] for instance.

Moreover, there are many cases where the model is known to be structured in several ways at the same time. In recent years, one of research contents is to estimate a covariance matrix possessing both sparsity and positive definiteness. For instance, Rothman [15] gave the following model:

 ^Σ+=argminΣ≻0 12∥Σ−Σn∥2F−τlogdet(Σ)+λ∥∥Σ−∥∥1, (1)

where is the sample covariance matrix, is the Frobenious norm, is the element-wise -norm and . From the optimization viewpoint, (1) is similar to the graphical lasso criterion [9] which also has a log-determinant part and the element-wise -penalty. Rothman [15] derived an iterative procedure to solve (1). While Xue, Ma and Zou [18] omitted the log-determinant part and considered the positive definite constraint for some arbitrarily small :

 ^Σε=argminΣ⪰ε\emph{I} 12∥Σ−Σn∥2F+λ∥∥Σ−∥∥1, (2)

They utilized an efficient alternating direction method (ADM) to solve the challenging problem (2) and established its convergence properties.

Most of the literatures, e.g.,[12, 15, 18], required the population covariance matrices being positive definite, and thus there is no essence of pursuing the low-rank of the estimator. By contrast, newly appeared research topic is to consider simultaneously the sparsity and low-rank of a structured model, which implies that the population covariance matrices are no longer restricted to the positive definite matrix cone and can be relaxed to the positive semidefinite cone. In addition, the models with structure of being simultaneously the sparsity and low-rank are widely applied into practice, such as sparse signal recovery from quadratic measurements and sparse phase retrieval, see [13] for example. Moreover, Richard et al. [14]

showed that both sparse and low-rank model can be derived in covariance matrix when the random variables are highly correlated in groups, which means this covariance matrix has a block diagonal structure.

With stimulations of those ideas, we construct the following convex model encompassing the -norm and nuclear norm for estimating the covariance matrix:

 ^Σ=argminΣ⪰0 12∥Σ−Σn∥2F+λ∥Σ∥1+τ∥Σ∥∗, (3)

where are tuning parameters. The -norm penalty is also called lasso-type penalty and is used to encourage sparse solutions. The nuclear norm , with

being the eigenvalue of

, is the trace norm when and ensures low-rank solutions of (3). Here we inroduce the approximate rank to interpret the low-rank, which is defined as being the smallest number such that

 σar+1(A)σ1(A)≤γ, (4)

where

are the singular value of

with , and could be chosen based on the needs, throughout our paper we fix for simplicity.

The contributions of this paper mainly center on two aspects. For one thing, being different from [13, 14], we establish the theoretical statistical theory under different assumptions rather than giving the generalized error bound of the estimation. Especially, we acquire the estimation rate under the Frobenious norm error, which improves the optimal rate where the low-rank property of the estimator does not be considered [7, 15, 18] and is the samples’ dimension with . For another, we take advantage of the alternating direction method of multipliers (ADMM), also can be seen in [18, 19], to combat our problem (3).

The organization of this paper is as follows. In Section 2 we will present some theoretical properties of the estimator derived by the proposed model (3). After that the alternating direction method of multipliers (ADMM) is going to be introduced to combat the problem, and numerical experiments are projected to show the performance of this method in Sections 3 and 4 respectively. We make a conclusion in the last section.

## 2 A Sparse and Low-Rank Covariance Estimator

Before the main part, we hereafter introduce some notations. and denote the expectation of and the probability of the incident occurring respectively. is the number of entries of the set

. Normal distribution with mean

and covariance is written as . Say if for every , there is a such that for all , and say if . If there are two constants such that , we write as .

For given observed independently and identically distributed (i.i.d. for short) -variate random variables with covariance matrix and , the goal is to estimate the unknown matrix based on the sample . This problem is called covariance matrix estimation which is of fundamental importance in multivariate analysis.

Given a random sample from (without loss of generality) and a population covariance matrix , the sample covariance matrix is

 Σn=(σnij)1≤i,j≤p=1n−1n∑l=1(Xl−¯X)(Xl−¯X)⊤,

where . Denote the support set of the population covariance matrix as

 S={(i,j):σ0ij≠0}, s=\textmdCard(S), r=\textmdrank(Σ0).
###### Assumption 2.1

For all , , where is a constant.

Assumption 2.1 is a common used condition in covariance matrix estimation, on which a useful lemma based is recalled here for the sequel analysis. One can also refer it in [1].

###### Lemma 2.2

Let   be i.i.d. and Assumption 2.1 holds, (i.e., for all , ). Then, if  ,

 P{∣∣ ∣∣n∑l=1(XliXlj−σ0ij)∣∣ ∣∣≥nν}≤C1\emphexp{−C2nν2}, for |ν|≤δ (5)

where constants and depend on only.

Another lemma which plays an important role in our main results is stated below.

###### Lemma 2.3

Suppose that Assumption 2.1 holds, . Then for sufficiently small,

 \emph{max}1≤i,j≤p∣∣σ0ij−σnij∣∣≤λ  implies  ∥^Σ−Σ0∥F≤ε, (6)

where is defined as .

Proof  First make the eigenvalue decomposition of () as

 Σ0=UΛΣ0U⊤≡U(\textmdDiag(λ(Σ0))0)U⊤,

where with

is the matrix composed of eigenvectors,

is a diagonal matrix generated by eigenvalues with and .
By denoting with and , which implies , we consider the model

 ^Δ := \textmdargminΔ+ΛΣ0⪰0  F(Δ)=\textmdargminΔ1+\textmdDiag(λ(Σ0))⪰0  F(Δ) ≡ \textmdargminΔ+ΛΣ0⪰0  12∥∥UΔU⊤+Σ0−Σn∥∥2F+λ∥∥UΔU⊤+Σ0∥∥1+τ∥∥UΔU⊤+Σ0∥∥∗.

Clearly, from (3) we have which implies . For a given sufficiently small and any (i.e., ), we compute

 F(Δ)−F(0) = 12∥∥UΔU⊤+Σ0−Σn∥∥2F+λ∥∥UΔU⊤+Σ0∥∥1+τ∥∥UΔU⊤+Σ0∥∥∗ −12∥Σ0−Σn∥2F+λ∥Σ0∥1+τ∥Σ0∥∗ = 12∥Δ∥2F+⟨UΔU⊤,Σ0−Σn⟩+λ(∥UΔU⊤+Σ0∥1−∥Σ0∥1) +τ(∥UΔU⊤+Σ0∥∗−∥Σ0∥∗) ≡ 12∥Δ∥2F+\textmdI+\textmdII+\textmdIII.

For convenience we denote . Then for I, it holds

 \textmdI=⟨¯¯¯¯¯Δ,Σ0−Σn⟩ ≥ −∣∣\textmdtr(¯¯¯¯¯Δ(Σn−Σ0))∣∣ = −∣∣∑i,j¯¯¯¯¯Δij(σ0ij−σnij)∣∣≥−max1≤i,j≤p∣∣σ0ij−σnij∣∣∥¯¯¯¯¯Δ∥1,

For II, we obtain by noting and that,

 II = λ(∥Σ0+¯¯¯¯¯Δ∥1−∥Σ0∥1) = λ(∥Σ0S+¯¯¯¯¯ΔS∥1+∥¯¯¯¯¯ΔSC∥1−∥Σ0S∥1) ≥ λ(∥Σ0S+¯¯¯¯¯ΔS∥1+∥¯¯¯¯¯ΔSC∥1−(∥Σ0S+¯¯¯¯¯ΔS∥1+∥¯¯¯¯¯ΔS∥1)) = λ(∥¯¯¯¯¯ΔSC∥1−∥¯¯¯¯¯ΔS∥1).

From the Hlder Inequality, one can prove that

 ∥¯¯¯¯¯Δ∥∗=∥Δ∥∗=∥Δ1∥∗=r∑i=1|λi(Δ1)|≤√r∥Δ1∥F=√r∥Δ∥F, (8) ∥¯¯¯¯¯ΔS∥1≤√s∥¯¯¯¯¯ΔS∥F. (9)

For III, combining with (8) we get that

Since and (9),

 G(Δ) ≡ F(Δ)−F(0) = 12∥Δ∥2F+\textmdI+\textmdII+\textmdIII ≥ 12∥Δ∥2F−λ∥¯¯¯¯¯Δ∥1+λ(∥¯¯¯¯¯ΔSC∥1−∥¯¯¯¯¯ΔS∥1)−τ√r∥Δ∥F = 12∥Δ∥2F−λ(∥¯ΔS∥1+∥¯¯¯¯¯ΔSC∥1−(∥¯¯¯¯¯ΔSC∥1−∥¯¯¯¯¯ΔS∥1))−τ√r∥Δ∥F = 12∥Δ∥2F−2λ∥¯¯¯¯¯ΔS∥1−τ√r∥Δ∥F ≥ 12∥Δ∥2F−2λ√s∥¯¯¯¯¯ΔS∥F−τ√r∥Δ∥F ≥ 12∥Δ∥2F−2λ√s∥¯¯¯¯¯Δ∥F−τ√r∥Δ∥F = 12∥Δ∥2F−2λ√s∥Δ∥F−τ√r∥Δ∥F.

Therefore, by

 G(Δ)≥12∥Δ∥2F−2λ√s∥ΔS∥F−τ√r∥Δ∥F≥ε22−ε28−ε28=ε24>0.

Hence we prove that if and , for any , it holds

 G(Δ)≡F(Δ)−F(0)>0.

In addition, from (2), we have

 ^Δ=\textmdargminΔ+\textmdDiag(λΣ0)⪰0  F(Δ)−F(0) =\textmdargminΔ+\textmdDiag(λΣ0)⪰0  G(Δ), (10)

which implies that . Otherwise, we suppose , then . Since for any , it follows which is contradicted with the fact is a convex function and , because

 0

Finally indicates that due to . Hence the desired result is obtained..

Then in order to acquiring the rate of the estimation, the two following commonly used assumptions are needed to introduced, and also can be seen [1, 15]. Assumption 2.4 holds, for example, if are Gaussian.

###### Assumption 2.4

hold for all and , where and are two constants.

###### Assumption 2.5

hold for all , where some and is a constant.

Built on the two assumptions, we give our main results with regard to rates of the estimator of (3).

###### Theorem 2.6

For some , let be a sufficiently large constant and suppose Assumptions 2.1 and 2.4 hold. If and , then

 ∥^Σ−Σ0∥F=OP⎛⎝√slogrn+slogpnKδ1⎞⎠. (11)

Proof  Since Assumptions 2.1 and 2.4 hold, a fact employed by Rothman et al. [16] is that for sufficiently small,

 P{max1≤i,j≤n∣∣σ0ij−σnij∣∣≥ν}≤C3p2exp{−C4nν2},

where and are some constants. We then apply the bound and Lemma 2.3 with

 ε=16K1√slogrn+slogpnKδ1,  λ=K1√logrn+logpnKδ1

to obtain that

 P{∥^Σ−Σ0∥F≤ε}≥P{max1≤i,j≤n∣∣σ0ij−σnij∣∣≤λ}≥1−C3p2−C4K2−δ1r−C4K21.

Evidently, can be arbitrarily close to one by choosing sufficiently large..

###### Corollary 2.7

For some , let be a sufficiently large constant such that , and suppose Assumptions 2.1, 2.4 hold. If and , then

 ∥^Σ−Σ0∥F=OP(√slogrn). (12)
###### Remark 2.8

Clearly, if the in Assumption 2.1, the better rate would reduce to . It is worth mentioning that under the the Assumption 2.1, the minimax optimal rate of convergence under the Frobenius norm in Theorem 4 of [7] is which also has been obtained by [18]. However, to attain the same rate in the presence of the log-determinant barrier term (1), Rothman [15] instead would require that , the minimal eigenvalue of the true covariance matrix, should be bounded away from zero by some positive constant, and also that the barrier parameter should be bounded by some positive quantity. [18] illustrated this theory requiring a lower bound on is not very appealing.

###### Theorem 2.9

Let be a sufficiently large constant and suppose that Assumptions 2.1 and 2.5 hold. If and , then

 ∥^Σ−Σ0∥F=OP⎛⎝√sp4/αn⎞⎠. (13)

Proof  Since Assumptions 2.1 and 2.5 hold, one can modify a result of Bickel & Levina (2008a) and show that for sufficiently small

 P{max1≤i,j≤n∣∣σ0ij−σnij∣∣≥ν}≤p2C5n−α/2ν−α,

where is a constant. We then apply the bound and Lemma 2.3 with to get

 P{∥∥^Σ−Σ0∥∥F≤ε}≥P{max% 1≤i,j≤n∣∣σ0ij−σnij∣∣≤λ}≥1−C6K−α2.

Apparently, the bound can be arbitrarily close to one by taking sufficiently large..

## 3 Alternating Direction Method of multipliers

In this section, we will construct the alternating direction method of multipliers (ADMM) to solve problem (3). By introducing an auxiliary variable , problem (3) can be rewritten as

 \textmdmin  12∥Σ−Σn∥2F+λ∥Σ∥1+τ∥Γ∥∗, (14) \textmds.t.    Γ⪰0, Σ−Γ=0.

The constraint can be put into the objective function by using an indicator function:

 I(Γ⪰0)= 0,        Γ⪰0 I(Γ⪰0)= +∞,   \textmdotherwise.

This leads to the following equivalent reformulation of (14):

 \textmdmin  12∥Σ−Σn∥2F+λ∥Σ∥1+τ∥Γ∥∗+I(Γ⪰0), (15) \textmds.t.    Σ−Γ=0.

Recently, the alternating direction method of multipliers (ADMM) has been studied extensively for solving (15). A typical iteration of ADMM for solving (15) can be described as

 Γk+1:=\textmdargminΓ Lμ(Σk,Γ,Λk) (16) Σk+1:=\textmdargminΣ Lμ(Σ,Γk+1,Λk) (17) Λk+1:=Λk−1μ(Γk+1−Σk+1), (18)

where the augmented Lagrangian function is defined as

 Lμ(Σ,Γ,Λ) := (19) +I(Γ⪰0)−⟨Λ,Γ−Σ⟩+12μ∥Σ−Γ∥2F,

in which is the Lagrange multiplier and is a penalty parameter. Note that ADMM (16-18) can be written explicitly as

 Γk+1:=\textmdargminΓ⪰0 τ∥Γ∥∗+12μ∥Γ−(Σk+μΛk)∥2F (20) Σk+1:=\textmdargminΣ λ∥Σ∥1+μ+12μ∥Σ−μμ+1(Σn+1μΓk+1−Λk)∥2F (21) Λk+1:=Λk−(Γk+1−Σk+1)/μ. (22)

We now show that the two subproblems (20) and (21) can be easily solved. For the subproblem (20),

 Γk+1 = \textmdargminΓ⪰0 τ∥Γ∥∗+12μ∥Γ−(Σk+μΛk)∥2F (23) = \textmdargminΓ⪰0 τ⟨Γ,I⟩+12μ∥Γ−(Σk+μΛk)∥2F = \textmdargminΓ⪰0 ∥Γ−(Σk+μΛk−μτI)∥2F = (Σk+μΛk−μτI)+,

where denote the projection of a matrix onto the convex positive semidefinite cone . Namely , where and

The solution of the second subproblem (21) is given by the -shrinkage operation

 Σk+1 = μμ+1\textmdShrink(Σn+1μΓk+1−Λk,λ) (24) = μμ+1(\textmdmax {∣∣Pij∣∣−λ,0}\textmdsign(Pij))1≤i,j≤p,

where and is a sign function.

Therefore, combining with (22)-(24), the whole algorithm is written as follows

To end this section, we prove that the sequence produced by the alternating direction method of multipliers (Table 1) converges to , where is an optimal solution of (15) and is the optimal dual variable. Now we label some necessary notations for the ease of presentation. Let be a matrix defined as

 H=(1μ\emphIp×p00μ\emphIp×p),

the weighted norm stands for and the corresponding inner product is . Before presenting the main theorem with regard to the global convergence of ADMM, we introduce the following lemma.

###### Lemma 3.1

Assume that is an optimal solution of and is the corresponding optimal dual variable associated with the equality constraint . Then the sequence produced by ADMM satisfies

 ∥^V−Vk∥2H−∥^V−Vk+1∥2H≥∥Vk+1−Vk∥2H, (25)

where and .

Based on the lemma above, the convergent theorem can be derived immediately.

###### Theorem 3.2

The sequence generated by Algorithm 1 from any starting point converges to an optimal solution of .

## 4 Numerical Simulations

In this section we will exploit the proposed method ADMM to tackle two examples, one of which possessed the block structured population covariance matrix, and another utilized the banded population covariance matrix. Actually as the constraint , our proposed model (3) is equivalent to

 (26)

So similar to the method in [18], one can solve the soft-thresholding estimator

 Σst:=argminΣ 12∥Σ−(Σn−τI)∥2F+λ∥Σ∥1

to initialize the . If the derived then the recovered sparse and low-rank semidefinite estimator . In our stimulation, we uniformly initialize as the matrix with all entries being 1,

as zero matrix and

respectively. Unlike and , does not change the final covariance estimator, thus we fixed just for simplicity and the stop criteria is set as

 max{ ∥Γk+1−Γk∥F, ∥Σk+1−Σk∥F }≤5×10−4.

For the sample dimensions, we always take and .

### 4.1 Example I: Block Structure

Analogous to the model, modified slightly here, emerged in [14] who synthesized samples for a block diagonal population covariance matrix , we will use blocks of random sizes, and each block is generated by where the entries of

are drawn i.i.d. from the uniform distribution on

. Evidently, the rank of produced in the way is . Corresponding MATLAB code of generating is , thereby deriving the sample covariance matrix

 Σn=1n−1n∑l=1(X