## 1 Introduction

In contemporary statistics, datasets are typically collected with high-dimensionality, where the dimension can be significantly larger than the sample size . For example, in genomics studies, the number of genes is typically much larger than the number of subjects (The Cancer Genome Atlas Network et al., 2012)

. In computer vision, the number of pixels in each image can be comparable to or exceed the number of images when the resolution of these images is relatively high

(Georghiades et al., 2001; Lee et al., 2005). When dealing with such high-dimensional datasets, covariance matrix estimation plays a central role in understanding the complex structure of the data and has received significant attention in various contexts, including latent factor models (Bernardo et al., 2003; Geweke and Zhou, 1996), Gaussian graphical models (Liu et al., 2012; Wainwright and Jordan, 2008), etc. However, in the high-dimensional setting, additional structural assumptions are often necessary in order to address challenges associated with statistical inference (Johnstone and Lu, 2009). For example, sparsity is introduced for sparse covariance/precision matrix estimation (Cai et al., 2016; Cai and Zhou, 2012; Friedman et al., 2008), and low-rank structure is enforced in spiked covariance matrix models (Cai et al., 2015; Johnstone, 2001). Readers can refer to Cai et al. (2016) for a recent literature review.In this paper we focus on the sparse spiked covariance matrix models under the Gaussian sampling distribution assumption. The spiked covariance matrix models, originally named in Johnstone (2001), is a class of models that can be described as follows: The observations are independently collected from the

-dimensional mean-zero normal distribution with covariance matrix

of the form(1) |

where is a matrix with orthonormal columns, is an diagonal matrix, and . Since the spectrum of the covariance matrix is (in non-increasing order), there exists an eigen-gap , where denotes the

-th largest eigenvalue of

. Therefore the first leading eigenvalues of can be regarded as “spike” or signal eigenvalues, and the remaining eigenvalues may be treated as “bulk” or noise eigenvalues. Here we assume that the eigenvector matrix is jointly sparse, the formal definition of which is deferred to Section 2.1. Roughly speaking, joint sparsity refers to a significant amount of rows inbeing zero, which allows for feature selection and brings easy interpretation in many applications. For example, in the analysis of face images, a classical method to extract common features among different face characteristics, expressions, illumination conditions, etc., is to obtain the eigenvectors of these face data, referred to as eigenfaces. Each coordinate of these eigenvectors corresponds to a specific pixel in the image. Nonetheless, the number of pixels (features) is typically much larger than the number of images (samples), and it is often desirable to gain insights of the face information via a relatively small number of pixels, referred to as key pixels. By introducing joint sparsity to these eigenvectors, one is able to conveniently model key pixels among multiple face images corresponding to non-zero rows of eigenvectors. A concrete real data example is provided in Section

4.2.The literature on sparse spiked covariance matrix estimation in high-dimensions from a frequentist perspective is quite rich. In Johnstone and Lu (2009)

, it is shown that the classical principal component analysis can fail when

. In Cai et al. (2013) and Vu and Lei (2013), the minimax estimation of the principal subspace (*i.e.*, the linear subspace spanned by the eigenvector matrix ) with respect to the projection Frobenius norm loss under various sparsity structure on is considered, and Cai et al. (2015) provides minimax estimation procedures of the principal subspace with respect to the projection operator norm loss under the joint sparsity assumption.

In contrast, there is comparatively limited literature on Bayesian estimation of sparse spiked covariance matrices providing theoretical guarantees. To the best of our knowledge, Gao and Zhou (2015) and Pati et al. (2014) are the only two works in the literature addressing posterior contraction rates for Bayesian estimation of sparse spiked covariance matrix models. In particular, in Pati et al. (2014) the authors discuss the posterior contraction behavior of the covariance matrix with respect to the operator norm loss under the Dirichlet-Laplace shrinkage prior (Bhattacharya et al., 2015), but the contraction rates are sub-optimal when the number of spikes grows with the sample size; In Gao and Zhou (2015), the authors propose a carefully designed prior on that yields rate-optimal posterior contraction of the principal subspace with respect to the projection Frobenius norm loss, but the tractability of computing the full posterior distribution is lost, except for the posterior mean as a point estimator. Neither Gao and Zhou (2015) nor Pati et al. (2014) discusses the posterior contraction behavior for sparse spiked covariance matrix models when the eigenvector matrix exhibits joint sparsity.

We propose a matrix spike-and-slab LASSO prior to model joint sparsity occurring in the eigenvector matrix

of the spiked covariance matrix. The matrix spike-and-slab LASSO prior is a novel continuous shrinkage prior that generalizes the classical spike-and-slab LASSO prior for vectors in

Ročková (2018) and Ročková and George (2016) to jointly sparse rectangular matrices. One major contribution of this work is that under the matrix spike-and-slab LASSO prior, we establish the rate-optimal posterior contraction for the entire covariance matrix with respect to the operator norm loss as well as that for the principal subspace with respect to the projection operator norm loss. Furthermore, we also focus on the two-to-infinity norm loss, a novel loss function measuring the closeness between linear subspaces. As will be seen in Section 2.1, the two-to-infinity norm loss is able to detect element-wise perturbations of the eigenvector matrix spanning the principal subspace. Under certain low-rank and bounded coherence conditions on , we obtain a tighter posterior contraction rate for the principal subspace with respect to the two-to-infinity norm loss than that with respect to the routinely used projection operator norm loss. Besides the contraction of the full posterior distribution, the Bayes procedure also leads to a point estimator for the principal subspace with a rate-optimal risk bound. In addition to the convergence results*per se*, we present a collection of concentration and large deviation inequalities for the matrix spike-and-slab LASSO prior that may be of independent interest. These technical results serve as the main tools for deriving the posterior contraction rates. Last but not least, unlike the prior proposed in Gao and Zhou (2015), the matrix spike-and-slab LASSO prior yields a tractable Metropolis-within-Gibbs sampler for posterior inference.

The rest of the paper is organized as follows. In Section 2 we briefly review the background for the sparse spiked covariance matrix models and propose the matrix spike-and-slab LASSO prior. Section 3 elaborates on our theoretical contributions, including the concentration and large deviation inequalities for the matrix spike-and-slab LASSO prior and the posterior contraction results. The numerical performance of the proposed methodology is presented in Section 4 through synthetic examples and the analysis of a real-world computer vision dataset. Further discussion is included in Section 5.

Notations: Let and be positive integers.
We adopt the shorthand notation . For any finite set , we use to denote the cardinality of . The symbols and mean the inequality up to a universal constant, *i.e.*, (, resp.) if () for some absolute constant . We write if and .
The zero matrix is denoted by , and the -dimensional zero column vector is denoted by . When the dimension is clear, the zero matrix is simply denoted by . The identity matrix is denoted by , and when the dimension is clear, is denoted by .
An orthonormal -frame in is a matrix with orthonormal columns, *i.e.*, .
The set of all orthonormal -frames in is denoted by . When , we write .
For a -dimensional vector , we use to denote its th component, to denote its -norm, to denote its -norm, and to denote its -norm.
For a symmetric square matrix , we use to denote the th-largest eigenvalue of . For a matrix , we use to denote the row vector formed by the th row of , to denote the column vector formed by the th column of ,
the lower case letter to denote the -th element of , to denote the Frobenius norm of , to denote the operator norm of , to denote the two-to-infinity norm of , and to denote the (matrix) infinity norm of . The prior and posterior distributions appearing in this paper are denoted by , and the densities of with respect to the underlying sigma-finite measure are denoted by .

## 2 Sparse Bayesian spiked covariance matrix models

### 2.1 Background

In the spiked covariance matrix model (1), the matrix is of the form We focus on the case where the leading eigenvectors of (the columns of ) are jointly sparse (Cai et al., 2015; Vu and Lei, 2013). Formally, the row support of is defined as

and is said to be jointly -sparse, if

. Heuristically, this assumption asserts that the signal comes from at most

features among all features. Geometrically, joint sparsity has the interpretation that at most coordinates of generate the subspace (Vu and Lei, 2013). Noted that due to the orthonormal constraint on the columns of .This paper studies a Bayesian framework for estimating the covariance matrix . We quantify how well the proposed methodology estimates the entire covariance matrix and the principal subspace

in the high-dimensional and jointly sparse setup. Leaving the Bayesian framework for a moment, we first introduce some necessary background. Throughout the paper, we write

to be the true covariance matrix that generates the data from the-dimensional multivariate Gaussian distribution

, where . The parameter space of interest for is given byThe following minimax rate of convergence for under the operator norm loss Cai et al. (2015) serves as a benchmark for measuring the performance of any estimation procedure for .

###### Theorem 1 (Cai et al., 2015).

Let . Suppose that and are bounded away from and . Then the minimax rate of convergence for estimating is

(2) |

Estimation of the principal subspace is less straightforward due to the fact that may not uniquely determine the eigenvector matrix .
In particular, when there exist replicates among the eigenvalues (*i.e.*, for some ), the corresponding eigenvectors can only be identified up to orthogonal transformation.
One solution is to focus on the Frobenius norm loss (Cai et al., 2013; Vu and Lei, 2013) or the operator norm loss (Cai et al., 2015) of the corresponding projection matrix , which is uniquely determined by and vice versa. The corresponding minimax rate of convergence for with respect to the projection operator norm loss is given by Cai et al. (2015):

(3) |

Though convenient, the direct estimation of the projection matrix does not provide insight into the element-wise errors of the principal eigenvectors . Motivated by a recent paper (Cape et al., 2018b), which presents a collection of technical tools for the analysis of element-wise eigenvector perturbation bounds with respect to the two-to-infinity norm, we also focus on the following two-to-infinity norm loss

(4) |

for estimating in addition to the projection operator norm loss, where

is the orthogonal matrix given by

Here, corresponds to the orthogonal alignment of so that and are close in the Frobenius norm sense. As pointed out in Cape et al. (2018b), the use of as the orthogonal alignment matrix is preferred over the two-to-infinity alignment matrix

because is not analytically computable in general, whereas can be explicitly computed (Stewart and Sun, 1990), facilitating the analysis: Let

admit the singular value decomposition

, then .The following lemma formalizes the connection between the projection operator norm loss and the two-to-infinity norm loss.

###### Lemma 1.

Let and be two orthonormal -frames in , where . Then there exists an orthonormal -frame in depending on and , such that

where is the Frobenius orthogonal alignment matrix.

When the projection operator norm loss is much smaller than one, Lemma 1 states that the two-to-infinity norm loss can be upper bounded by the product of the projection operator norm loss
and , where is an orthonormal -frame in . In particular, under the sparse spiked covariance matrix models in high dimensions, the number of spikes can be much smaller than the dimension (*i.e.*, is a “tall and thin” rectangular matrix), and hence the factor can be much smaller than .

We provide the following motivating example for the preference on the two-to-infinity norm loss (4) over the projection operator norm loss for .

###### Example.

Let be even and . Suppose the truth is given by

and consider the following two perturbations of :

where is some sufficiently small perturbation, , and is related to by

The perturbed matrices and are designed such that their projection operator norm losses are identical, *i.e.*, .
In contrast, and perturb in different fashions: all nonzero elements in are perturbed in , whereas only two nonzero elements in are perturbed in .

We examine the two candidate losses and for different values of and present them in Figure 1. It can clearly be seen that the two-to-infinity norm loss is smaller than the projection operator norm loss. Furthermore, the projection operator norm loss is unable to detect the difference between and . In contrast, the two-to-infinity norm loss indicates that has larger element-wise deviation from than does. Thus the two-to-infinity norm loss is capable of detecting element-wise perturbations of the eigenvector compared to the projection operator norm loss for estimating .

### 2.2 The matrix spike-and-slab LASSO prior for joint sparsity

We first illustrate the general Bayesian strategies in modeling sparsity occurring in high-dimensional statistics and then elaborate on the proposed prior model. Consider a simple yet canonical sparse normal mean problem. Suppose we observe independent normal data

, , with the goal of estimating the mean vector , which is assumed to be sparse in the sense that with the sparsity level as . To model sparsity on , classical Bayesian methods impose the spike-and-slab prior of the following form on : for any measurable set ,(5) | ||||

where is the indicator that ,

represents the prior probability of

being non-zero, is the point-mass at (called the “spike” distribution), and is the density of an absolutely continuous distribution (called the “slab” distribution) with respect to the Lebesgue measure ongoverned by some hyperparameter

. Theoretical justifications for the use of spike-and-slab prior (5) for sparse normal means and sparse Bayesian factor models have been established in Castillo and van der Vaart (2012) and Pati et al. (2014), respectively. Therein, the spike-and-slab prior (5) involves point-mass mixtures, which can be daunting in terms of posterior simulations (Pati et al., 2014). To address this issue, the spike-and-slab LASSO prior (Ročková, 2018) is designed as a continuous relaxation of (5):(6) | ||||

where is the Laplace distribution with mean

and variance

. When , the spike-and-slab LASSO prior (6) closely resembles the spike-and-slab prior (5). The continuity feature of the spike-and-slab LASSO prior (6), in contrast to the classical spike-and-slab prior (5), is highly desired in high-dimensional settings in terms of computation efficiency.Motivated by the spike-and-slab LASSO prior, we develop a matrix spike-and-slab LASSO prior to model joint sparsity in sparse spiked covariance matrix models (1) with the covariance matrix . The orthonormal constraint on the columns of makes it challenging to incorporate prior distributions. Instead, we consider the following reparametrization of :

(7) |

where , and is an arbitrary orthogonal matrix in . Clearly, in contrast to the orthonormal constraint on , there is no constraint on except that . Furthermore, joint sparsity of is inherited from : Specifically, for , there exists some permutation matrix and , such that

It follows directly that

implying that . Therefore, working with allows us to circumvent the orthonormal constraint while maintaining the jointly sparse structure of . We propose the following matrix spike-and-slab LASSO prior on : given hyperparameters and , for each , we independently impose the prior on as follows:

where are binary group assignment indicators, and

is the density function of the double Gamma distribution with shape parameter

and rate parameter :We further impose hyperpriors on

and aswhere is the inverse Gamma distribution with density , and is some fixed constant. We refer to the above hierarchical prior on as the matrix spike-and-slab LASSO prior and denote . The hyperparameter is fixed throughout. In the single-spike case (), we observe that reduces to the density function of the Laplace distribution, and hence the matrix spike-and-slab LASSO prior coincides with the spike-and-slab LASSO prior (Ročková, 2018). Clearly, it can be seen that a priori, is much larger than , so that corresponds to rows that are close to , and represents that the th row is decently away from . It should be noted that unlike the spike-and-slab prior (5), the group indicator variable or corresponds to small or large values of rather than the exact sparsity of . In addition, indicates that the matrix spike-and-slab LASSO prior favors a large proportion of rows of being close to . These features of the matrix spike-and-slab LASSO prior are in accordance with the joint sparsity assumption on . We complete the prior specification by imposing for some for the sake of conjugacy.

Lastly, we remark that the parametrization (7) of the spiked covariance matrix models (1) has another interpretation. The sampling model can be equivalently characterized in terms of the latent factor model

(8) |

where , , are -dimensional latent factors, is a factor loading matrix, and , are homoscedastic noisy vectors. Since by our earlier discussion is also sparse, this formulation is related to the sparse Bayesian factor models presented in Bhattacharya and Dunson (2011) and Pati et al. (2014), the differences being the joint sparsity of and prior specifications on . In addition, the latent factor formulation (8

) is convenient for posterior simulation through Markov chain Monte Carlo, as discussed in Section 3.1 of

Bhattacharya and Dunson (2011).## 3 Theoretical properties

### 3.1 Properties of the matrix spike-and-slab LASSO prior

The theoretical properties of the classical spike-and-slab LASSO prior (6) have been partially explored by Ročková (2018) and Ročková and George (2016) in the context of sparse linear models and sparse normal means problems, respectively. It is not clear whether the properties of the spike-and-slab LASSO priors adapt to other statistical context, including sparse spiked covariance matrix models, high-dimensional multivariate regression (Bai and Ghosh, 2018), etc.
In this subsection we present a collection of theoretical properties of the matrix spike-and-slab LASSO prior that not only are useful for deriving posterior contraction under the spiked covariance matrix models, but also may be of independent interest for other statistical tasks, *e.g.*

, sparse Bayesian linear regression with multivariate response

Ning and Ghosal (2018).Let be a matrix, and let be a jointly -sparse matrix with , corresponding to the underlying truth. In the sparse spiked covariance matrix model, represents the scaled eigenvector matrix up to an orthonormal matrix in , but for generality, we do not impose the statistical context in this subsection. A fundamental measure of goodness for various prior models with high dimensionality is the prior mass assignment on a small neighborhood around the true but unknown value of the parameter. This is referred to as the *prior concentration* in the literature of Bayes theory. Formally, we consider the prior probability of the non-centered ball under the prior distribution for small values of .

###### Lemma 2.

Suppose for some fixed positive constants and , and is jointly -sparse, where . Then for small values of with for some , it holds that

for some absolute constant .

We next formally characterize how the matrix spike-and-slab LASSO prior imposes joint sparsity on the columns of using a probabilistic argument. Unlike the classical spike-and-slab prior (5), which allows occurrence of exact zeros in the mean vector with positive probability, the spike-and-slab LASSO prior (6) (the matrix spike-and-slab LASSO prior) is absolutely continuous with respect to the Lebesgue measure on (, respectively), and with probability one. Rather than forcing elements of to be exactly , the matrix spike-and-slab LASSO prior shrinks elements of toward . This behavior suggests the following generalization of the row support of a matrix : for taken to be small, we define . Namely, consists of row indices of whose Euclidean norms are greater than . Intuitively, one should expect that under the matrix spike-and-slab LASSO prior, should be small with large probability. The following lemma formally confirms this intuition.

###### Lemma 3.

Suppose for some fixed positive constants and , . Let be a small number with for some , and let be an integer such that is sufficiently small. Then for any , it holds that

We conclude this section by providing a large deviation inequality for the matrix spike-and-slab LASSO prior.

###### Lemma 4.

Suppose for some fixed positive and , and is jointly -sparse, where , and is sufficiently small. Let and be positive sequences such that and . Then for sufficiently large and for all , it holds that

for some absolute constant .

### 3.2 Posterior contraction for the sparse Bayesian spiked covariance matrix model

We now present the posterior contraction rates for sparse spiked covariance matrix models under the matrix spike-and-slab LASSO prior with respect to various loss functions, which are the main results of this paper. We point out that the posterior contraction rates presented in the following theorem are minimax-optimal as they coincide with (2) and (3).

###### Theorem 2.

Assume the data are independently sampled from with , , , and . Suppose , , and . Let for some positive and , and for some . Then there exists some constants , , and depending on and , and hyperparameters, such that the following posterior contraction for holds for all when is sufficiently large:

(9) |

For each , let be the left-singular vector matrix of . Then the following posterior contraction for holds for all :

(10) |

###### Remark 1.

We briefly compare the posterior contraction rates obtained in Theorem 2 with some related results in the literature. In Pati et al. (2014) the authors consider the posterior contraction with respect to the operator norm loss of the entire covariance matrix, while in Gao and Zhou (2015), the authors consider the posterior contraction with respect to the projection Frobenius norm loss for estimating . In Pati et al. (2014), the notion of sparsity is slightly different than the joint sparsity notion presented here, as they assume that under the latent factor model representation (8), the individual supports of columns of are not necessarily the same. When , the assumption in Pati et al. (2014) coincides with this paper, and our rate is superior to the rate obtained in Pati et al. (2014) by a logarithmic factor. The assumptions in Gao and Zhou (2015) are the same as those in Pati et al. (2014), and in Gao and Zhou (2015) the authors focus on designing a prior that yields rate-optimal posterior contraction with respect to the Frobenius norm loss of the projection matrices as well as adapting to the prior sparsity and the rank . Our result in equation (10), which focuses on the projection operator norm loss, serves as a complement to the rate-optimal posterior contraction for principal subspaces under the joint sparsity assumption in constrast to Gao and Zhou (2015), in which the authors work on the projection Frobenius norm loss.

To derive the posterior contraction rate for the principal subspace with respect to the two-to-infinity norm loss, we need the posterior contraction result for with respect to the stronger matrix infinity norm. These two results are summarized in the following theorem.

###### Theorem 3.

Assume the conditions in Theorem 2 hold. Further assume that the eigenvector matrix exhibits bounded coherence: for some constant , and the number of spikes is sufficiently small in the sense that . Then there exists some constants depending on and , and hyperparameters, such that the following posterior contraction for holds for all when is sufficiently large:

(11) |

For each , let be the left-singular vector matrix of . Then the following posterior contraction for holds for all :

(12) |

where is the Frobenius orthogonal alignment matrix

###### Remark 2.

We also present some remarks concerning the posterior contraction with respect to the two-to-infinity norm loss . In Cape et al. (2018b), the authors show that

meaning that can be coarsely upper bounded by the projection operator norm loss . This naive bound immediately yields

for some large , which is the same as (10).
Our result (12) improves this rate by a factor of and, thus yielding a tighter posterior contraction rate with respect to the two-to-infinity norm loss. In particular, when (*i.e.*, is a “tall and thin” rectangular matrix), the factor can be much smaller than .

The posterior contraction rate (10) also leads to the following risk bound for a point estimator of the principal subspace :

###### Theorem 4.

Assume the conditions in Theorem 2 hold. Let

be the posterior mean of the projection matrix , and set be the orthonormal -frame in with columns being the first eigenvectors corresponding to the first largest eigenvalues of . Then the following risk bound holds for for sufficiently large :

The setup so far is concerned with the case where is known and fixed. When is unknown, Cai et al. (2013) provides a diagonal thresholding method for consistently estimating . In such a setting, the posterior contraction in Theorem 2 reduces to the following weaker version:

###### Corollary 1.

Assume the data are independently sampled from with ,
,
, and .
Suppose , , and , but is unknown and instead is consistently estimated by (*i.e.*, ). Let for some positive and , and for some . Then there exists some large constant , such that the following posterior contraction for holds for all :

For each , let be the left-singular vector matrix of . Then the following posterior contraction for holds for all :

### 3.3 Proof Sketch and Auxiliary Results

Now we sketch the proof of Theorem 2 along with some important auxiliary results. The proof strategy is based on a modification of the standard testing-and-prior-concentration approach, which was originally developed in Ghosal et al. (2000) for proving convergence rates of posterior distributions, and later adopted to a variety of statistical contexts. Specialized to the sparse spiked covariance matrix models, let us consider the posterior contraction for with respect to the operator norm loss as an example. The posterior contraction for with respect to the infinity norm loss can be proved in a similar fashion. Denote , and write the posterior distribution as

(13) |

where

Comments

There are no comments yet.