Minimax Estimation of Bandable Precision Matrices

10/19/2017 ∙ by Addison Hu, et al. ∙ Yale University 0

The inverse covariance matrix provides considerable insight for understanding statistical models in the multivariate setting. In particular, when the distribution over variables is assumed to be multivariate normal, the sparsity pattern in the inverse covariance matrix, commonly referred to as the precision matrix, corresponds to the adjacency matrix representation of the Gauss-Markov graph, which encodes conditional independence statements between variables. Minimax results under the spectral norm have previously been established for covariance matrices, both sparse and banded, and for sparse precision matrices. We establish minimax estimation bounds for estimating banded precision matrices under the spectral norm. Our results greatly improve upon the existing bounds; in particular, we find that the minimax rate for estimating banded precision matrices matches that of estimating banded covariance matrices. The key insight in our analysis is that we are able to obtain barely-noisy estimates of k × k subblocks of the precision matrix by inverting slightly wider blocks of the empirical covariance matrix along the diagonal. Our theoretical results are complemented by experiments demonstrating the sharpness of our bounds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imposing structure is crucial to performing statistical estimation in the high-dimensional regime where the number of observations can be much smaller than the number of parameters. In estimating graphical models, a long line of work has focused on understanding how to impose sparsity on the underlying graph structure.

Sparse edge recovery is generally not easy for an arbitrary distribution. However, for Gaussian graphical models, it is well-known that the graphical structure is encoded in the inverse of the covariance matrix , commonly referred to as the precision matrix [lauritzen, Meinshausen06, cai_constrained_2011]. Therefore, accurate recovery of the precision matrix is paramount to understanding the structure of the graphical model. As a consequence, a great deal of work has focused on sparse recovery of precision matrices under the multivariate normal assumption [FriedHasTib2007, cai_estimating_2012, cai2016estimating, Rot09, ren_asymptotic_2015]. Beyond revealing the graph structure, the precision matrix also turns out to be highly useful in a variety of applications, including portfolio optimization, speech recognition, and genomics [lauritzen, YuaLin07, 5947493].

Although there has been a rich literature exploring the sparse precision matrix setting for Gaussian graphical models, less work has emphasized understanding the estimation of precision matrices under additional structural assumptions, with some exceptions for block structured sparsity [hoslee16] or bandability [bicgel11]. One would hope that extra structure should allow us to obtain more statistically efficient solutions. In this work, we focus on the case of bandable precision matrices, which capture a sense of locality between variables. Bandable matrices arise in a number of time-series contexts and have applications in climatology, spectroscopy, fMRI analysis, and astronomy [fri94, vis95, pad16]. For example, in the time-series setting, we may assume that edges between variables are more likely when is temporally close to , as is the case in an auto-regressive process. The precision and covariance matrices corresponding to distributions with this property are referred to as bandable, or tapering. We will discuss the details of this model in the sequel.

Past work:

Previous work has explored the estimation of both bandable covariance and precision matrices [cai_optimal_2010, pad16]. Closely related work includes the estimation of sparse precision and covariance matrices [cai_constrained_2011, Rot09, cai_estimating_2012]. Asymptotically-normal entrywise precision estimates as well as minimax rates for operator norm recovery of sparse precision matrices have also been established [ren_asymptotic_2015]. A line of work developed concurrently to our own establishes a matching minimax lower bound [leelee17].

When considering an estimation technique, a powerful criterion for evaluating whether the technique performs optimally in terms of convergence rate is minimaxity. Past work has established minimax rates of convergence for sparse covariance matrices, bandable covariance matrices, and sparse precision matrices [CaiZhou10, cai_optimal_2010, cai_estimating_2012, Rot09].

The technique for estimating bandable covariance matrices proposed in [cai_optimal_2010] is shown to achieve the optimal rate of convergence. However, no such theoretical guarantees have been shown for the bandable precision estimator proposed in recent work for estimating sparse and smooth precision matrices that arise from cosmological data [pad16].

Of note is the fact that the minimax rate of convergence for estimating sparse covariance matrices matches the minimax rate of convergence of estimating sparse precision matrices. In this paper, we introduce an adaptive estimator and show that it achieves the optimal rate of convergence when estimating bandable precision matrices from the banded parameter space (3). We find, satisfyingly, that analogous to the sparse case, in which the minimax rate of convergence enjoys the same rate for both precision and covariance matrices, the minimax rate of convergence for estimating bandable precision matrices matches the minimax rate of convergence for estimating bandable covariance matrices that has been established in the literature [cai_optimal_2010].

Our contributions:

Our goal is to estimate a banded precision matrix based on i.i.d. observations. We consider a parameter space of precision matrices with a power law decay structure nearly identical to the bandable covariance matrices considered for covariance matrix estimation [cai_optimal_2010]. We present a simple-to-implement algorithm for estimating the precision matrix. Furthermore, we show that the algorithm is minimax optimal with respect to the spectral norm. The upper and lower bounds given in Section LABEL:sect:optimality together imply the following optimal rate of convergence for estimating bandable precision matrices under the spectral norm. Informally, our results show the following bound for recovering a banded precision matrix with bandwidth . [Informal] The minimax risk for estimating the precision matrix over the class given in (3) satisfies:

(1)

where this bound is achieved by the tapering estimator as defined in Equation (7). An important point to note, which is shown more precisely in the sequel, is that the rate of convergence as compared to sparse precision matrix recovery is improved by a factor of .

We establish a minimax upper bound by detailing an algorithm for obtaining an estimator given observations and a pre-specified bandwidth

, and studying the resultant estimator’s risk properties under the spectral norm. We show that an estimator using our algorithm with the optimal choice of bandwidth attains the minimax rate of convergence with high probability.

To establish the optimality of our estimation routine, we derive a minimax lower bound to show that the rate of convergence cannot be improved beyond that of our estimator. The lower bound is established by constructing subparameter spaces of (3) and applying testing arguments through Le Cam’s method and Assouad’s Lemma [Yu97, cai_optimal_2010].

To supplement our analysis, we conduct numerical experiments to explore the performance of our estimator in the finite sample setting. The numerical experiments confirm that even in the finite sample case, our proposed estimator exhibits the minimax rate of convergence.

The remainder of the paper is organized as follows. In Section 2, we detail the exact model setting and introduce a blockwise inversion technique for precision matrix estimation. In Section LABEL:sect:optimality, theorems establishing the minimaxity of our estimator under the spectral norm are presented. An upper bound on the estimator’s risk is given in high probability with the help of a result from set packing. The minimax lower bound is derived by way of a testing argument. Both bounds are accompanied by their proofs. Finally, in Section LABEL:sect:simulations, our estimator is subjected to numerical experiments. Owing to space constraints, proofs for auxiliary lemmas may be found in Appendix LABEL:aux-lemma-proofs.

Notation:

We will now collect notation that will be used throughout the remaining sections. Vectors will be denoted as lower-case

while matrices are upper-case . The spectral or operator norm of a matrix is defined to be while the matrix norm of a symmetric matrix is defined to be .

2 Background and problem set-up

In this section we present details of our model and the estimation procedure. If one considers observations of the form drawn from a distribution with precision matrix and zero mean, the goal then is to estimate the unknown matrix based on the observations . Given a random sample of -variate observations drawn from a multivariate distribution with population covariance , our procedure is based on a tapering estimator derived from blockwise estimates for estimating the precision matrix .

The maximum likelihood estimator of is

(2)

where is the empirical mean of the vectors . We will construct estimators of the precision matrix by inverting blocks of along the diagonal, and averaging over the resultant subblocks.

Throughout this paper we adhere to the convention that refers to the element in a matrix . Consider the parameter space , with associated probability measure , given by:

(3)

where denotes the

th eigenvalue of

, with for all . We also constrain . Observe that this parameter space is nearly identical to that given in Equation (3) of [cai_optimal_2010]. We take on an additional assumption on the minimum eigenvalue of , which is used in the technical arguments where the risk of estimating under the spectral norm is bounded in terms of the error of estimating .

Observe that the parameter space intuitively dictates that the magnitude of the entries of decays in power law as we move away from the diagonal. As with the parameter space for bandable covariance matrices given in [cai_optimal_2010], we may understand in (3) as a rate of decay for the precision entries as they move away from the diagonal; it can also be understood in terms of the smoothness parameter in nonparametric estimation [Tsybakov:2008:INE:1522486]. As will be discussed in Section LABEL:sect:optimality, the optimal choice of depends on both and the decay rate .

2.1 Estimation procedure

We now detail the algorithm for obtaining minimax estimates for bandable , which is also given as pseudo-code333 In the pseudo-code, we adhere to the NumPy convention (1) that arrays are zero-indexed, (2) that slicing an array arr with the operation arr[a:b] includes the element indexed at a and excludes the element indexed at b, and (3) that if b is greater than the length of the array, only elements up to the terminal element are included, with no errors. in Algorithm 1.

The algorithm is inspired by the tapering procedure introduced by Cai, Zhang, and Zhou [cai_optimal_2010] in the case of covariance matrices, with modifications in order to estimate the precision matrix. Estimating the precision matrix introduces new difficulties as we do not have direct access to the estimates of elements of the precision matrix. For a given integer , we construct a tapering estimator as follows. First, we calculate the maximum likelihood estimator for the covariance, as given in Equation (2). Then, for all integers and , we define the matrices with square blocks of size at most along the diagonal:

(4)

For each , we replace the nonzero block with its inverse to obtain . For a given , we refer to the individual entries of this intermediate matrix as follows:

(5)

For each , we then keep only the central subblock of to obtain the blockwise estimate :

(6)

Note that this notation allows for and ; in each case, this out-of-bounds indexing allows us to cleanly handle corner cases where the subblocks are smaller than .

For a given bandwidth (assume is divisible by 2), we calculate these blockwise estimates for both and . Finally, we construct our estimator by averaging over the block matrices:

(7)

We note that within entries of the diagonal, each entry is effectively the sum of estimates, and as we move from to from the diagonal, each entry is progressively the sum of one fewer entry.

Therefore, within of the diagonal, the entries are not tapered; and from to of the diagonal, the entries are linearly tapered to zero. The analysis of this estimator makes careful use of this tapering schedule and the fact that our estimator is constructed through the average of block matrices of size at most .

function FitBlockwise(, k)
     
     for  do
         
     end for
     for  do
         
     end for
     return
end function
function BlockInverse(, m, l)
      Obtain block inverse.
     
     
     
      Preserve central block of inverse.
     
     
      Restore block inverse to appropriate indices.
     
     
     
     return
end function
Algorithm 1 Blockwise Inversion Technique

2.2 Implementation details

The naive algorithm performs inversions of square matrices with size at most . This method can be sped up considerably through an application of the Woodbury matrix identity and the Schur complement relation [MR0038136, Boyd02]. Doing so reduces the computational complexity of the algorithm from to . We discuss the details of modified algorithm and its computational complexity below.

Suppose we have and are interested in obtaining . We observe that the nonzero block of corresponds to the inverse of the nonzero block of , which only differs by one row and one column from , the matrix for which the inverse of the nonzero block corresponds to , which we have already computed. We may understand the movement from to (to which we already have direct access) and as two rank-1 updates. Let us view the nonzero blocks of as the block matrices: