Precision Matrix Estimation with Noisy and Missing Data

04/07/2019 ∙ by Roger Fan, et al. ∙ 0

Estimating conditional dependence graphs and precision matrices are some of the most common problems in modern statistics and machine learning. When data are fully observed, penalized maximum likelihood-type estimators have become standard tools for estimating graphical models under sparsity conditions. Extensions of these methods to more complex settings where data are contaminated with additive or multiplicative noise have been developed in recent years. In these settings, however, the relative performance of different methods is not well understood and algorithmic gaps still exist. In particular, in high-dimensional settings these methods require using non-positive semidefinite matrices as inputs, presenting novel optimization challenges. We develop an alternating direction method of multipliers (ADMM) algorithm for these problems, providing a feasible algorithm to estimate precision matrices with indefinite input and potentially nonconvex penalties. We compare this method with existing alternative solutions and empirically characterize the tradeoffs between them. Finally, we use this method to explore the networks among US senators estimated from voting records data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Undirected graphs are often used to describe high-dimensional distributions. Under sparsity conditions, these graphs can be estimated using penalized methods such as

(1)

where is the sample covariance or correlation matrix and is a separable (entry-wise) sparsity-inducing penalty function. Although this approach has proven successful in a variety of application areas such as neuroscience and genomics, its soundness hinges on the positive semidefiniteness (PSD) of . If is indefinite, the objective may be unbounded from below.

In order to ensure this penalized -estimator is well-behaving, Loh and Wainwright (2015) impose a side constraint of the form , where is a convex function. Here we focus on the estimator using the operator norm as a side constraint

(2)

Loh and Wainwright (2017) adopt this method and show in theory the superior statistical properties of this constrained estimator. Their results suggest that the addition of a side constraint is not only sufficient but also almost necessary to effectively untangle the aforementioned complications.

Unfortunately, this additional constraint precludes using existing methods to solve the penalized objective with non-PSD input. To close this gap, we develop an alternating direction method of multipliers (ADMM) algorithm to implement (2) efficiently. We conduct empirical studies comparing this new method to several other precision matrix estimators. Our simulation study reveals several trends that are not present in the fully observed case. Finally, we illustrate the performance of our methods in analyzing the US senate voting data, uncovering both known and novel phenomena from the modern political landscape.

The remainder of this paper is organized as follows. In Section 2, we provide an overview of existing related work and describe in detail the optimization issues that arise from indefinite inputs and nonconvex penalties. In Section 3, we present the proposed ADMM algorithm and present some convergence results. Section 4 provides numerical examples and comparisons. Section 5 presents an exploratory analysis of US Senate voting records data using this method and details several interesting conclusions that can be drawn from the estimated graphs. Finally, we summarize the empirical results and their practical implications regarding choice of method in Section 6.

2 Problem formulation and existing work

There is a wide body of work proposing methods to perform precision matrix estimation in the fully observed case, including Meinshausen and Bühlmann (2006), Yuan and Lin (2007), Rothman et al. (2008), Friedman et al. (2008), Banerjee et al. (2008), and Zhou et al. (2010), most of which are essentially a -penalized likelihood approach (1) which we will refer to as the graphical Lasso.

Recent work has focused on using nonconvex regularizers such as SCAD and MCP for model selection in the regression setting (Fan and Li, 2001; Zhang, 2010; Breheny and Huang, 2011; Zhang and Zhang, 2012). Loh and Wainwright (2015, 2017) extend this analysis to general -estimators, including variants of the graphical Lasso objective, and show their statistical convergence and support recovery properties. Estimators with these penalties have been shown to attain model selection under weaker theoretical conditions, but require more sophisticated optimization algorithms to solve, such as the local linear approximation (LLA) method of Fan et al. (2014).

In a fully observed and noiseless setting, is the sample covariance and guaranteed to be at least positive semidefinite. Then, if is the -penalty, the objective of (1) is convex and bounded from below. In this setting, one can show that for a unique optimum

exists with bounded eigenvalues and that the iterates for any descent algorithm will also have bounded eigenvalues

(for example, see Lemma 2 in Hsieh et al., 2014).

When working with missing, corrupted, and dependent data, the likelihood is nonconvex, and the expectation-maximization (EM) algorithm has traditionally been used to perform statistical inference. However, in these noisy settings, the convergence of the EM algorithm is difficult to guarantee and is often slow in practice. For instance,

Städler and Bühlmann (2012) implement a likelihood-based method for inverse covariance estimation with missing values, but their EM algorithm requires solving a full graphical Lasso optimization problem in each M-step.

An alternative approach is to develop -estimators that account for missing and corrupted data. For graphical models, Loh and Wainwright (2015) establish that the graphical Lasso, including a version using nonconvex penalties, can be modified to accommodate noisy or missing data by adjusting the sample covariance estimate.

These modified estimators depend on the observation that statistical theory for the graphical Lasso generally requires that converges to zero at a sufficiently fast rate (e.g. Rothman et al., 2008; Zhou et al., 2010; Loh and Wainwright, 2017). When considering missing or corrupted data, it is often possible to construct covariance estimates that satisfy this convergence criteria but are not necessarily positive semidefinite. In fact, in high-dimensional settings may even be guaranteed to be indefinite. Attempting to input these indefinite covariance estimates into the graphical Lasso, however, presents novel optimization issues.

Unbounded objective. When attempting to move beyond the penalized case with positive semidefinite input, the problem in (1) becomes unbounded from below, so an optimum may not necessarily exist. This issue comes from two potential sources: 1) negative eigenvalues in , or 2) zero eigenvalues combined with the boundedness of the nonconvex penalty . For example, consider the restriction of the objective in (1

) to a ray defined by an eigenvalue-vector pair

of :

(3)

If , we see that is unbounded from below due to the and terms. In fact, if and is bounded from above, as is the case when using standard nonconvex penalties, the objective is also unbounded from below.

So unboundedness can occur anytime there is a negative eigenvalue in the input matrix, or whenever there are zero eigenvalues combined with a nonconvex penalty function . Unboundedness creates optimization issues, as an optimum no longer necessarily exists.

Handling unboundedness. In order to guarantee that an optimum exists for (1), an additional constraint of the form can be imposed, where is some convex function. In this paper, we consider the estimator (2), which uses a side constraint of the form . Loh and Wainwright (2017) show the rates of convergence of this estimator (2) and show that it can attain model selection consistency and spectral norm convergence without the incoherence assumption when used with a nonconvex penalty (see Appendix E therein), but do not discuss implementation or optimization aspects of the problem.

To our knowledge, there is currently no feasible optimization algorithm for the estimator defined in (2), particularly when the input is indefinite. Loh and Wainwright (2015) present a composite gradient descent method for optimizing a subset of side-constrained versions of (1). However, their algorithm requires a side constraint of the form , which does not include the spectral norm constraint and therefore cannot attain the better theoretical results it achieves (Section C.7

compares the performance of different side constraints). It may be possible to develop heuristic algorithms that alternate performing a proximal gradient update ignoring the side constraint and projecting to the constraint set, but as far as we know there has not been any analysis of algorithms of this type (we discuss this in more detail in Section 

C.4).

An alternative approach to solving this unbounded issue is to project the input matrix to the positive semidefinite cone before inputting into (1). We discuss this further in Section 4.1, but this only solves the unbounded issue when using the penalty; nonconvex penalties still require a side constraint to have a bounded objective and therefore our algorithm is still useful even for the projected methods.

3 ADMM Algorithm

Our algorithm is similar to the algorithm in Guo and Zhang (2017), which applies ADMM to the closely related problem of condition number-constrained sparse precision matrix estimation using the same splitting scheme as below. We discuss their method in more detail in Section A.6. The following algorithm is specialized to the case where the spectral norm is used as the side constraint. In Section B we derive a similar ADMM algorithm that can be used for any side constraint with a computable projection operator.

Rewrite the objective from (2) as

(4)

where and .

Let be a penalty parameter and let be the prox operator of . We derive these updates for SCAD and MCP in Section A.2. Let be the following prox operator for , which we derive in Section A.3,

where is the eigendecomposition of . Then the ADMM algorithm for solving (4), which we derive in Section A.2, is described in Algorithm 1. Computationally this algorithm is dominated by the eigendecomposition used to evaluate , and therefore has a complexity of , which matches the scaling of other graphical Lasso solvers (e.g. Meinshausen and Bühlmann, 2006; Friedman et al., 2008; Hsieh et al., 2014).

Input: , , ,
Output:
Initialize , ;
while not converged do
      
      
      
end while
Algorithm 1 ADMM for graphical Lasso with a side constraint

3.1 Convergence

The following proposition applies standard results on the convergence of ADMM for convex problems to show convergence when the penalty is used. Details are in Section A.4.

Proposition 1.

If the penalty is convex and satisfies the conditions in Section A.1, Algorithm 1 converges to a global minimum of (4).

Remark.

Regarding the nonconvex penalty, recent work has established ADMM convergence results in some nonconvex settings (see Hong et al., 2016; Wang et al., 2015), but to our knowledge there is no convergence result that encompasses this nonsmooth and nonconvex application. We can show convergence if a fairly strong assumption is made on the iterates, but we are currently working on extending existing results to this case.

Proposition 2 shows that any limiting point of Algorithm 1 is a stationary point of the original objective (4). This is proved in Section A.5. When using the penalty or a nonconvex penalty with , where is the weak convexity constant of , the objective is convex and therefore any stationary point is unique and also the global optimum. See Section C.5 for a more detailed discussion.

Proposition 2.

Assume that the penalty satisfies the conditions in Section A.1. Then for any limit point of the ADMM algorithm defined in Algorithm 1, is also a stationary point of the objective as defined in (4).

The assumptions on in Section A.1 are the same as those assumed in Loh and Wainwright (2015, 2017), and are satisfied by the Lasso, SCAD, and MCP functions.

Note that if a limiting point is found to exist when using a nonconvex penalty the result in Proposition 2 will still hold. Empirically we find that the algorithm performs well and converges consistently when used with nonconvex penalties, but there is no existing theoretical guarantee that a limiting point of ADMM will exist in that setting.

4 Simulations

We evaluate the proposed estimators using the relative Frobenius norm and the sum of the false positive rate and false negative rate (). We present results over a range of values, noting that all the compared methods would use similar techniques to perform model tuning. Section C.1 presents an example of how to use BIC or cross-validation to tune these methods. We present results using covariance matrices from auto-regressive and Erdős-Rényi random graph models. See Section C for descriptions of these models as well as additional simulation results.

4.1 Alternative methods

When faced with indefinite input, there are two alternative graphical Lasso-style estimators that can be used besides (2), which involve either projection to the positive semidefinite cone or nodewise regression in the style of Meinshausen and Bühlmann (2006).

Projection. Given an indefinite input matrix , Park (2016) and Greenewald et al. (2017) propose performing the projection . They then input into the optimization problem (1). This is similar to the projection done in Datta and Zou (2017). In terms of the upper bound on statistical convergence rates, this method pays a constant factor cost, though in practice projection may result in a loss of information and therefore a decrease in efficiency.

After projecting the input, existing algorithms can be used to optimize (1) with the penalty. However, as mentioned in Section 2, using a nonconvex penalty still leads to an unbounded objective and therefore still requires using our ADMM algorithm to solve (2).

Nodewise regression. Loh and Wainwright (2012) and Rudelson and Zhou (2017) both study the statistical and computational convergence properties of using errors-in-variables regression to handle indefinite input matrices in high-dimensional settings. Following the nodewise regression ideas of Meinshausen and Bühlmann (2006) and Yuan (2010), we can perform Lasso-type regressions to obtain estimates and form estimates , where

(5)

and combine to get with and . Finally, we symmetrize the result to obtain , where is the set of symmetric matrices.

These types of nodewise estimators have gained popularity as they require less restrictive incoherence conditions to attain model selection consistency and often perform better in practice in the fully observed case. They have not, however, been as well studied when used with indefinite input.

4.2 Data models

We test these methods on two models that result in indefinite covariance estimators, the non-separable Kronecker sum model from Rudelson and Zhou (2017) and the missing data graphical model described in Loh and Wainwright (2015). In the main paper we focus on the missing data model, but Section C contains a detailed description of the Kronecker sum model as well as simulation results using it.

(a) ,
(b) MCP,
(c) ,
(d) MCP,
Figure 1: Convergence of the ADMM algorithm for several initializations. Blue lines show the relative optimization error (, where is the result of running our algorithm to convergence) while red lines show the statistical error (). All panels use an covariance with and and set . The left panels use an penalty, while the right panels use MCP with . is set to be three times the oracle spectral norm.

Missing data (MD). As discussed above, Loh and Wainwright (2013); Loh and Wainwright (2015) propose an estimator for a graphical model with missing-completely-at-random observations.

Let

be a mean-zero subgaussian random matrix. Let

where are independent of . This corresponds to entries of the

th column of the data matrix being observed with probability

. Then we have an unobserved matrix and observed matrix generated by and , where denotes the Hadamard, or element-wise, product. Here the covariance estimate for is

(6)

where denotes element-wise division. As we divide off-diagonal entries by smaller values, will not necessarily be positive semidefinite.

4.3 Simulation results

Optimization performance. Figure 1 shows the optimization performance of Algorithm 1 using nonprojected input matrices from the missing data model with both and nonconvex penalties (MCP). The top two panels present an “easy” scenario with a higher sampling rate, while the bottom two have a more challenging scenario with significant missing data. Blue lines report the optimization error while red lines are the statistical error.

All the plots in Figure 1 have their optimization error quickly converge to below the statistical error. These plots also suggest that our algorithm can attain linear convergence rates. We find that the algorithm consistently converges well over a range of tested scenarios.

Comparing the statistical error of the top two plots, we see that MCP achieves significantly lower error for the easier scenario. But in the bottom two plots, where there is more missing data, it struggles relative to the penalty. This is a common trend through our simulations, as the performance of estimators using MCP degrades as missingness increases while the -penalized versions are more robust.

Figure 2: The performance of the various estimators for the missing data model in terms of relative Frobenius error () and model selection as measured by FPR + FNR. We use an covariance and set . Settings are chosen so that the effective sample size () is roughly equivalent. The MCP penalty uses . We set to be times the oracle value for each method and set . Our convergence criteria is .

Method comparisons. Figure 2 demonstrates the statistical performance along the full regularization path. Across the panels from left to right, the sampling rate decreases and therefore the magnitude of the most negative eigenvalue increases (see Table 4).

In terms of Frobenius error, both projected methods and the nonprojected estimator with the penalty get slightly worse across panels, but the nodewise regression and the nonprojected MCP estimator react much more negatively to more indefinite input. The nodewise regression in particular goes from being among the best to among the worst estimators as the sampling rate decreases.

Comparing the projected and nonprojected curves in Figure 2, we see that the optimal value of , as well as the range of optimal values, shrinks for the projected method as the sampling rate decreases. This pattern is consistently repeated across models and scenarios, likely because the projection is shrinking the off-diagonal entries of the input matrix. We find that the nonprojected graphical Lasso performs slightly better than the projected version when used with the penalty, likely due to the information lost in this shrinkage.

Figure 2 also shows how these methods perform in terms of model selection. We can see that the nonconvex penalties perform essentially identically to their penalized counterparts. In particular, the degradation of the nonprojected MCP estimator in terms of norm error does not seem to affect its model selection performance. The nodewise regression, however, still demonstrates this pattern, as its model selection performance degrades across the panels. For scenarios with more missing data, the nonprojected estimators seem to be easier to tune, maintaining a wider range of values where they perform near-optimally. In Section C of the supplement we perform similar experiments in a variety of different noise and model settings.

Sensitivity to . Figure 3 demonstrates the sensitivity of the nonprojected estimators to the choice of , the size of the side constraint. We can see that all these methods are sensitive to the choice of for small values of in terms of norm error. None of the methods are sensitive in terms of model selection.

Figure 3: The performance of missing data estimators over different choices of . The non-nodewise estimators set , while each node’s regression in the nodewise estimator sets to be R_scale times that node’s oracle value. We use an covariance, set , , and choose a sampling rate of . The MCP penalty is chosen with .

The nonprojected graphical Lasso with MCP is the most sensitive to and is also sensitive for larger choices of , which is important since it never reaches its oracle minimum norm errors when is chosen to be larger than the oracle. The nonprojected graphical Lasso with and the projected graphical Lasso with MCP both still achieve the same best-case performance when is misspecified, though tuning becomes more difficult.

The nodewise regression results are also plotted here. Here is the side constraint level in (5). For smaller values of the nodewise estimator levels off, corresponding to when the side constraint becomes active over the penalty. Different values of change when this occurs and, if is chosen large enough, do not significantly affect ideal performance. Note that these use a stronger oracle that knows each column-wise norm, but do show that this method can be improved with careful tuning.

5 Senate voting analysis

Based on the missing data model from Section 4.2, we estimate the conditional dependence graph among senators using the ADMM algorithm from Section 3. The dataset includes voting records from the United States Senate during the 112th Congress (2011-2013). We drop senators who serve partial terms and unanimous votes, resulting in a dataset of voting records for 99 senators over 426 votes. Appendix D contains further details regarding data processing and the methods used as well as additional analysis.

Missing values in this data correspond to votes that are missed by senators and consist of roughly 2.6% of total votes. Note that only 109 of the votes are fully observed, so some type of correction or imputation should be used instead of omitting rows.

A major story at this time was the rise of the tea party movement in the Republican party. Across the US government tea party challengers rose to prominence. Though it was not an official party, politicians associated with the tea party movement tended to be more conservative and less likely to compromise than establishment Republicans, leading to a particularly politically polarized period of government.

Figure 4 plots the estimated graph among senators. As expected the distinction between Republicans and Democrats is stark. Both independent senators caucus with the Democrats, so as expected they are part of the Democratic component of the graph.

(a)
(b)
(c) , Republican subgraph
Figure 4: Graphs among senators estimated on Senate voting records from the 112th US Congress using an penalty with penalty as indicated. We set and the ADMM algorithm was run with . After estimation, the precision matrix is thresholded at 0.04 for the top and bottom panel and 0.055 for the middle one.

We identify senators who were present at the inaugural meeting of the unofficial Senate Tea Party Caucus as well as those elected in 2010 with significant tea party support.111The marked tea party senators are Marco Rubio, Mike Lee, Jerry Moran, Jim DeMint, Rand Paul, Ron Johnson, and Pat Toomey. These senators are colored in black, and we can see that within the Republican party they are clustered together.

In Figure 3(a) we can see that the sole connection between parties runs through the tea party (Rand Paul) and Jeff Merkley, a Democratic senator. This may be surprising, as Rand Paul is one of the most conservative senators and Merkley one of the most liberal. Paul is, however, regarded as a relatively libertarian conservative. So though he is extremely conservative in some dimensions, he may share liberal views with Merkeley on others.

Figure 3(b) plots the same graph estimated at a lower penalization level. The Republicans who have cross-party connections include some of both the most conservative (Paul) and the most moderate (Thad Cochran, Lisa Murkowski).222Here we are measuring ideology by NOMINATE, a standard method in political science for assessing a representative’s position on the political spectrum (Poole, 2005). See Appendix D for more details. On the Democratic side the cross-connected senators also include both the most liberal (Sanders, Merkley, Tom Udall) and relatively moderate (Claire McCaskill). As expected, moderates are among those most connected opposing party, but this shows that the most extreme members of a party can also be linked to the opposing party. Appendix D discusses these cross-party links in more detail.

Figure 3(c) shows the Republican subgraph from Figure 3(a). Here we can identify other senators who are closely associated with the tea party. In particular, two nodes near the tea party cluster are marked ‘H’ and ‘C,’ corresponding to Senators Orrin Hatch and Tom Coburn. Both have been linked to the tea party in the media, either as candidates supported by it or as being supportive of the movement.

It is also of interest that one marked senator is not clustered with the others, Jerry Moran. This suggests that he is not as closely connected to the tea party movement as the others we have identified.

6 Summary and discussion

In this paper, we study the estimation of sparse precision matrices from noisy and missing data. To close an existing algorithmic gap, we propose an ADMM algorithm that allows for fast optimization of the side-constrained graphical Lasso, which is needed to implement the graphical Lasso with either indefinite input and/or nonconvex penalties. We investigate its convergence properties and compare its performance with other methods that handle the indefinite sample covariance matrices that arise with dirty data.

We find that methods with nonconvex penalties are quite sensitive to the indefiniteness of the input covariance estimate, and are particularly sensitive to the magnitude of its negative eigenvalues. They may have better existing theoretical guarantees, but in practice we find that with nontrivial missingness or noise they perform worst than or, at best, recover the performance of their -normalized counterparts. The nonconvex methods can outperform the -penalized ones when there is a small amount of missingness or noise, but in these cases we often find the nodewise estimator to perform best.

In difficult settings with significant noise or missingness, the most robust and efficient method seems to be using the graphical Lasso with nonprojected input and an penalty. As the application becomes easier – with more observations or less missing data – the nodewise estimator becomes more competitive, just as it is understood to be with fully observed data.

The projected graphical Lasso estimator with an penalty seems to be slightly worse than its nonprojected counterpart. Projection does, however, allow for the use of nonconvex penalties in more difficult settings without the large degradation in performance we have observed. This may be desired in some scenarios when the nonzero off-diagonal precision matrix entries are expected to be large.

Finally, we also use this new algorithm to estimate conditional dependence graphs among US senators using voting records data. We identify several interesting patterns in these graphs, especially regarding the rise of the tea party movement and cross-party connections between senators.

Acknowledgements

The research is supported in part by the NSF under grants DMS-1316731 and DMS-1830247.

References

  • Agarwal et al. (2012) Agarwal, A., Negahban, S., and Wainwright, M. J. (2012).

    Fast global convergence of gradient methods for high-dimensional statistical recovery.

    The Annals of Statistics, 40(5):2452–2482.
  • Banerjee et al. (2008) Banerjee, O., Ghaoui, L. E., and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9(Mar):485–516.
  • Belloni et al. (2017) Belloni, A., Rosenbaum, M., and Tsybakov, A. B. (2017). Linear and conic programming estimators in high dimensional errors-in-variables models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):939–956.
  • Bertsekas and Tsitsiklis (1989) Bertsekas, D. P. and Tsitsiklis, J. N. (1989). Parallel and distributed computation: Numerical methods. Prentice Hall Englewood Cliffs, NJ. Republished by Athena Scientific in 1997.
  • Boyd et al. (2010) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122.
  • Breheny and Huang (2011) Breheny, P. and Huang, J. (2011).

    Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection.

    The Annals of Applied Statistics, 5(1):232.
  • Datta and Zou (2017) Datta, A. and Zou, H. (2017). Cocolasso for high-dimensional error-in-variables regression. The Annals of Statistics, 45(6):2400–2426.
  • Fan and Li (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360.
  • Fan et al. (2014) Fan, J., Xue, L., and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics, 42(3):819.
  • Farrell (1985) Farrell, R. H. (1985). Multivariate calculation: Use of the continuous groups.
  • Friedman et al. (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441.
  • Greenewald et al. (2017) Greenewald, K., Park, S., Zhou, S., and Giessing, A. (2017). Time-dependent spatially varying graphical models, with application to brain fMRI data analysis. In Advances in Neural Information Processing Systems, pages 5834–5842.
  • Guo and Zhang (2017) Guo, X. and Zhang, C. (2017). The effect of penalization on condition number constrained estimation of precision matrix. Statistica Sinica, 27:1299–1317.
  • Hong et al. (2016) Hong, M., Luo, Z.-Q., and Razaviyayn, M. (2016). Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364.
  • Hornstein et al. (2018) Hornstein, M., Fan, R., Shedden, K., and Zhou, S. (2018). Joint mean and covariance estimation with unreplicated matrix-variate data. Journal of the American Statistical Association.
  • Hsieh et al. (2014) Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. (2014). QUIC: Quadratic approximation for sparse inverse covariance estimation. Journal of Machine Learning Research, 15(1):2911–2947.
  • Loh and Wainwright (2012) Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637–1664.
  • Loh and Wainwright (2013) Loh, P.-L. and Wainwright, M. J. (2013). Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. The Annals of Statistics, 41(6):3022.
  • Loh and Wainwright (2015) Loh, P.-L. and Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559–616.
  • Loh and Wainwright (2017) Loh, P.-L. and Wainwright, M. J. (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6):2455–2482.
  • Meinshausen and Bühlmann (2006) Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, pages 1436–1462.
  • Mota et al. (2011) Mota, J. F., Xavier, J. M., Aguiar, P. M., and Püschel, M. (2011). A proof of convergence for the alternating direction method of multipliers applied to polyhedral-constrained functions. arXiv preprint arXiv:1112.2295.
  • Park (2016) Park, S. (2016).

    Selected Problems for High-Dimensional DataQuantile and Errors-in-Variables Regressions

    .
    PhD thesis, University of Michigan.
  • Park et al. (2017) Park, S., Shedden, K., and Zhou, S. (2017). Non-separable covariance models for spatio-temporal data, with applications to neural encoding analysis. arXiv preprint arXiv:1705.05265.
  • Poole (2005) Poole, K. T. (2005). Spatial models of parliamentary voting. Cambridge University Press.
  • Rosenbaum and Tsybakov (2010) Rosenbaum, M. and Tsybakov, A. B. (2010). Sparse recovery under matrix uncertainty. The Annals of Statistics, pages 2620–2651.
  • Rosenbaum and Tsybakov (2013) Rosenbaum, M. and Tsybakov, A. B. (2013). Improved matrix uncertainty selector. In From Probability to Statistics and Back: High-Dimensional Models and Processes–A Festschrift in Honor of Jon A. Wellner, pages 276–290. Institute of Mathematical Statistics.
  • Rothman et al. (2008) Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515.
  • Rudelson and Zhou (2017) Rudelson, M. and Zhou, S. (2017). Errors-in-variables models with dependent measurements. Electronic Journal of Statistics, 11(1):1699–1797.
  • Städler and Bühlmann (2012) Städler, N. and Bühlmann, P. (2012). Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 22(1):219–235.
  • Wang et al. (2015) Wang, Y., Yin, W., and Zeng, J. (2015). Global convergence of ADMM in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324.
  • Yuan (2010) Yuan, M. (2010).

    High dimensional inverse covariance matrix estimation via linear programming.

    Journal of Machine Learning Research, 11(Aug):2261–2286.
  • Yuan and Lin (2007) Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35.
  • Zhang (2010) Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942.
  • Zhang and Zhang (2012) Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, pages 576–593.
  • Zhou (2014) Zhou, S. (2014). GEMINI: Graph estimation with matrix variate normal instances. The Annals of Statistics, 42(2):532–562.
  • Zhou (2019) Zhou, S. (Forthcoming, 2019). Sparse Hanson-Wright inequalities for subgaussian quadratic forms. Bernoulli. Available at https://arxiv.org/abs/1510.05517.
  • Zhou et al. (2010) Zhou, S., Lafferty, J., and Wasserman, L. (2010). Time varying undirected graphs. Machine Learning, 80:295–319.

References

  • Agarwal et al. (2012) Agarwal, A., Negahban, S., and Wainwright, M. J. (2012).

    Fast global convergence of gradient methods for high-dimensional statistical recovery.

    The Annals of Statistics, 40(5):2452–2482.
  • Banerjee et al. (2008) Banerjee, O., Ghaoui, L. E., and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9(Mar):485–516.
  • Belloni et al. (2017) Belloni, A., Rosenbaum, M., and Tsybakov, A. B. (2017). Linear and conic programming estimators in high dimensional errors-in-variables models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):939–956.
  • Bertsekas and Tsitsiklis (1989) Bertsekas, D. P. and Tsitsiklis, J. N. (1989). Parallel and distributed computation: Numerical methods. Prentice Hall Englewood Cliffs, NJ. Republished by Athena Scientific in 1997.
  • Boyd et al. (2010) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122.
  • Breheny and Huang (2011) Breheny, P. and Huang, J. (2011).

    Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection.

    The Annals of Applied Statistics, 5(1):232.
  • Datta and Zou (2017) Datta, A. and Zou, H. (2017). Cocolasso for high-dimensional error-in-variables regression. The Annals of Statistics, 45(6):2400–2426.
  • Fan and Li (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360.
  • Fan et al. (2014) Fan, J., Xue, L., and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics, 42(3):819.
  • Farrell (1985) Farrell, R. H. (1985). Multivariate calculation: Use of the continuous groups.
  • Friedman et al. (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441.
  • Greenewald et al. (2017) Greenewald, K., Park, S., Zhou, S., and Giessing, A. (2017). Time-dependent spatially varying graphical models, with application to brain fMRI data analysis. In Advances in Neural Information Processing Systems, pages 5834–5842.
  • Guo and Zhang (2017) Guo, X. and Zhang, C. (2017). The effect of penalization on condition number constrained estimation of precision matrix. Statistica Sinica, 27:1299–1317.
  • Hong et al. (2016) Hong, M., Luo, Z.-Q., and Razaviyayn, M. (2016). Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364.
  • Hornstein et al. (2018) Hornstein, M., Fan, R., Shedden, K., and Zhou, S. (2018). Joint mean and covariance estimation with unreplicated matrix-variate data. Journal of the American Statistical Association.
  • Hsieh et al. (2014) Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. (2014). QUIC: Quadratic approximation for sparse inverse covariance estimation. Journal of Machine Learning Research, 15(1):2911–2947.
  • Loh and Wainwright (2012) Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637–1664.
  • Loh and Wainwright (2013) Loh, P.-L. and Wainwright, M. J. (2013). Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. The Annals of Statistics, 41(6):3022.
  • Loh and Wainwright (2015) Loh, P.-L. and Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559–616.
  • Loh and Wainwright (2017) Loh, P.-L. and Wainwright, M. J. (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6):2455–2482.
  • Meinshausen and Bühlmann (2006) Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, pages 1436–1462.
  • Mota et al. (2011) Mota, J. F., Xavier, J. M., Aguiar, P. M., and Püschel, M. (2011). A proof of convergence for the alternating direction method of multipliers applied to polyhedral-constrained functions. arXiv preprint arXiv:1112.2295.
  • Park (2016) Park, S. (2016).

    Selected Problems for High-Dimensional DataQuantile and Errors-in-Variables Regressions

    .
    PhD thesis, University of Michigan.
  • Park et al. (2017) Park, S., Shedden, K., and Zhou, S. (2017). Non-separable covariance models for spatio-temporal data, with applications to neural encoding analysis. arXiv preprint arXiv:1705.05265.
  • Poole (2005) Poole, K. T. (2005). Spatial models of parliamentary voting. Cambridge University Press.
  • Rosenbaum and Tsybakov (2010) Rosenbaum, M. and Tsybakov, A. B. (2010). Sparse recovery under matrix uncertainty. The Annals of Statistics, pages 2620–2651.
  • Rosenbaum and Tsybakov (2013) Rosenbaum, M. and Tsybakov, A. B. (2013). Improved matrix uncertainty selector. In From Probability to Statistics and Back: High-Dimensional Models and Processes–A Festschrift in Honor of Jon A. Wellner, pages 276–290. Institute of Mathematical Statistics.
  • Rothman et al. (2008) Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515.
  • Rudelson and Zhou (2017) Rudelson, M. and Zhou, S. (2017). Errors-in-variables models with dependent measurements. Electronic Journal of Statistics, 11(1):1699–1797.
  • Städler and Bühlmann (2012) Städler, N. and Bühlmann, P. (2012). Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 22(1):219–235.
  • Wang et al. (2015) Wang, Y., Yin, W., and Zeng, J. (2015). Global convergence of ADMM in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324.
  • Yuan (2010) Yuan, M. (2010).

    High dimensional inverse covariance matrix estimation via linear programming.

    Journal of Machine Learning Research, 11(Aug):2261–2286.
  • Yuan and Lin (2007) Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35.
  • Zhang (2010) Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942.
  • Zhang and Zhang (2012) Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, pages 576–593.
  • Zhou (2014) Zhou, S. (2014). GEMINI: Graph estimation with matrix variate normal instances. The Annals of Statistics, 42(2):532–562.
  • Zhou (2019) Zhou, S. (Forthcoming, 2019). Sparse Hanson-Wright inequalities for subgaussian quadratic forms. Bernoulli. Available at https://arxiv.org/abs/1510.05517.
  • Zhou et al. (2010) Zhou, S., Lafferty, J., and Wasserman, L. (2010). Time varying undirected graphs. Machine Learning, 80:295–319.

Appendix A Auxiliary Results

a.1 Nonconvex penalties

The nonconvex penalties we will focus on are the SCAD and MCP functions, introduced in Fan and Li (2001) and Zhang (2010), respectively. Following Loh and Wainwright (2015), we make the following assumptions regarding the (univariate) penalty function .

  1. and .

  2. is nondecreasing for .

  3. is nonincreasing for .

  4. exists for all and .

  5. is weakly convex, i.e. there exists such that is convex.

Note that Loh and Wainwright (2017) show stronger model selection results under the following additional assumption.

  1. There exists a constant such that for all .

This excludes the penalty, but is satisfied by the nonconvex penalties we consider.

The SCAD penalty takes the form

(7)

for some parameter . Note that this penalty is weakly convex with constant .

The MCP penalty has the form

(8)

for some parameter . This penalty is weakly convex with .

a.2 Derivation of Algorithm 1

Recall that we can rewrite the objective as

where and .

We then introduce an auxiliary optimization variable and reformulate the problem as

For a penalty parameter and Lagrange multiplier , we consider the augmented Lagrangian

(9)

The ADMM algorithm is then, given current iterates , , and ,

(10)
(11)
(12)

Considering the -subproblem, we can show that the minimization problem in (10) is equivalent to

Which is a prox operator of . Let and . If is the penalty then these updates simply soft-threshold the elements of at level . For SCAD, these updates have the element-wise form

(13)

While for MCP the updates are

(14)

See Loh and Wainwright (2015) for the derivations of these updates.

For the -subproblem, we can similarly show that (11) is equivalent to

(15)

For any matrix with corresponding eigendecomposition let us define the operator

(16)

whose solution is derived in Section A.3. Then the solution to (11) is .

Using these results, the algorithm in (10)-(12) becomes

(17)

a.3 Solution of

Recall that in (16) we define

Let and be the eigen-decompositions of the optimization variable and . Then, similar to the derivation in Guo and Zhang (2017), we can rewrite this problem as

The final line is since, if we denote to be the set of orthonormal matrices,

Which holds with equality when . Note that the last equality here is from Theorem 14.3.2 of Farrell (1985).

We therefore get that where

We can see that this is separable by element. Let

So . Ignoring the constraints in the indicator function for now, we can set the derivative of equal to zero to get that

Which we can solve with the quadratic formula to show that has a unique minimizer over at

Adding back and noting that is strictly convex over , we get that we simply need to truncate this value at . Therefore we get that

a.4 Proof of Proposition 1

Proof.

The optimization problem (4) is equivalently

(18)

where , , , and .

Boyd et al. (2010) show that if and are proper convex functions and if (18) is solveable then ADMM converges in terms of the objective value and dual variable . Bertsekas and Tsitsiklis (1989, Proposition 4.2) and Mota et al. (2011) show that if in addition and have full column rank then we get convergence of the primal iterates and , where is the solution to (18). ∎

a.5 Proof of Proposition 2

Before we prove Proposition 2, we first define directional derivatives and stationary points.

Definition.

The directional derivative of a lower semi-continuous function at in the direction is

Note that we allow . We say that is a stationary point of if it satisfies the first-order necessary conditions to be a local extrema, i.e.

Note that this coincides with the definition of stationary point used in Loh and Wainwright (2017), though they use slightly different notation. Also note that when is continuously differentiable.

Proof.

From the first-order necessary conditions of the subproblems (10)-(11), we get that, for all ,

(19)

And recall that

(20)

We can rewrite (19)-(20) as

(21)
(22)
(23)

Now consider a fixed point and consider (21)-(23) evaluated at this limit point. From (23) we get that . This combined with (21) gives us that, for all ,

Finally, (22) gives us that

Using the above and recalling the objective as defined in (4), we get that, for all ,

So is a stationary point of by definition. ∎

a.6 Comparison to Guo and Zhang (2017)

Guo and Zhang (2017) study the problem of condition number-constrained precision matrix estimation, where they consider the estimator

(24)

Note that this is quite similar to the estimators we consider in (2), as they simply replace the maximum eigenvalue constraint with a constraint on the ratio of the maximum to minimum eigenvalues.

However, they do not study the application of their estimator to cases with indefinite input or its performance in noisy and missing data situations. In particular, constraining the condition number does not necessarily guarantee that the graphical Lasso objective (1) will be lower bounded, especially when using nonconvex penalties.

As a simple example, consider the case with an input matrix and iterates

In this case the objective is

which is unbounded below as grows even though the condition numbers of the iterates are constant.

More generally, whenever has eigenvalues , where and . Denote and . Let be the eigendecomposition of the covariance estimate. Then for some condition number bound , we can consider iterates of the form , where is a diagonal matrix with entries

Which we note has a condition number of . Then we can see that the objective becomes

So if then this objective is still unbounded below.

Using a spectral norm bound as the side constraint with a indefinite input guarantees a lower bound on the graphical Lasso objective regardless of the choice of and is therefore a more natural side constraint to use.

Appendix B ADMM for general side constraints

In this section we develop an ADMM algorithm for general side constraints, i.e. the following variant of (2).333Note that we switch the notation of the side constraint function from to to avoid confusion with the ADMM penalty parameter .

This algorithm has the same convergence guarantees as Algorithm 1, but in practice we find that Algorithm 1 converges faster and more consistently when the spectral norm side constraint is used.

b.1 Derivation

We first rewrite the objective as

(25)

where and

We can then introduce auxiliary optimization variables and reformulate the optimization problem as

For a penalty parameter and Lagrange multiplier matrices , we consider the augmented Lagrangian of this problem

(26)

The ADMM algorithm is then, given current iterates , , , , and ,