1 Introduction
The principle of maximum entropy was proposed by
Jaynes (1957)for probability density estimation. It states that from the probability densities that represent the current state of knowledge one should choose the one with the largest entropy, that is, the one which does not introduce additional biases. The state of knowledge is often given by sample points from a sample space and some fixed functions (sufficient statistics) on the sample space. The knowledge is then encoded naturally in form of constraints on the probability density by requiring that the expected values of the functions equal their respective sample means. Here, we assume the particularly simple multivariate sample space
and functionsSuppose we are given sample points . Then formally, for estimating the distribution from which the sample points are drawn, the principle of maximum entropy suggests solving the following entropy maximization problem
where is the set of all probability distributions on , the expectation is with respect to the distribution , and is the entropy. We denote the matrix of sample means compactly by and the matrix of functions by . Then, the entropy maximization problem becomes
Dudík et al. (2004) observed that invoking the principle of maximum entropy tends to overfit when the number of features is large. Requiring that the expected values of the functions equal their respective sample means can be too restrictive. Consequently, they proposed to relax the constraint using the maximum norm as
for some . That is, for every function the expected value only needs to match the sample mean up to a tolerance of
. The dual of the relaxed problem has a natural interpretation as a featureselective
regularized loglikelihood maximization problemwhere is the set of symmetric matrices, is the matrix of dual variables for the constraint , and
is the loglikelihood function for pairwise Ising models with the standard matrix dot product and normalizer (logpartition function)
In this paper, we are restricting the relaxation of the entropy maximization problem by also enforcing the alternative constraint
where and denotes the spectral norm on . A difference to the maximum norm constraint is that now the expected values of the functions only need to collectively match the sample means up to a tolerance of instead of individually. The dual of the more strictly relaxed entropy maximization problem
is the regularized loglikelihood maximization problem
see Appendix A. Here, the regularization term promotes a low rank of the positivesemidefinite matrix . This implies that the matrix in the loglikelihood function also has low rank. Thus, a solution of the dual problem is the sum of a sparse matrix and a lowrank matrix . This can be interpreted as follows: the variables interact indirectly through the lowrank matrix , while some of the direct interactions through the matrix are turned off by setting entries in to zero. We get a more intuitive interpretation of the dual problem if we consider a weakening of the spectral norm constraint. The spectral norm constraint is equivalent to the two constraints
that bound the spectrum of the matrix from above and below. If we replace the spectral norm constraint by only the second of these two constraints in the maximumentropy problem, then the dual problem becomes
This problem also arises as the loglikelihood maximization problem for a conditional Gaussian model (see Lauritzen (1996)) that exhibits observed binary variables and unobserved, latent conditional Gaussian variables. The sample space of the full mixed model is , where is the sample space for the unobserved variables. We want to write down the density of the conditional Gaussian model on this sample space. For that we respectively denote the interaction parameters between the observed binary variables by , the ones between the observed binary and latent conditional Gaussian variables by , and the ones between the latent conditional Gaussian variables by , where . Then, for and up to normalization, the density of the conditional Gaussian model is given as
One can check, see also Lauritzen (1996), that the conditional densities are variate Gaussians on . Here, we are interested in the marginal distribution
on that is obtained by integrating over the unobserved variables in , see Appendix B. The matrix is symmetric and positive semidefinite. The loglikelihood function for the marginal model and the given data is thus given as
where , and is once again the normalizer of the density.
If only a few of the binary variables interact directly, then is sparse, and if the number of unobserved variables is small compared to , then is of low rank. Hence, one could attempt to recover and from the data using the regularized loglikelihood maximization problem
(ML) 
that we encountered before.
We are now in a similar situation as has been discussed by Chandrasekaran et al. (2012) who studied Gaussian graphical models with latent Gaussian variables. They were able to consistently estimate both the number of latent components, in our case , and the conditional graphical model structure among the observed variables, in our case the zeroes in . Their result holds in the highdimensional setting, where the number of variables (latent and observed) may grow with the number of observed sample points. Here, we show a similar result for the Ising model with latent conditional Gaussian variables, that is, the one that we have introduced above.
2 Related Work
Graphical Models. The introduction of decomposed sparse + lowrank models followed a period of quite extensive research on sparse graphical models in various settings, for example Gaussians (Meinshausen and Bühlmann (2006), Ravikumar et al. (2011)), Ising models (Ravikumar et al. (2010)), discrete models (Jalali et al. (2011)), and more general conditional Gaussian and exponential family models (Lee and Hastie (2015), Lee et al. (2015), Cheng et al. (2017)). All estimators of sparse graphical models maximize some likelihood including a penalty that induces sparsity.
Most of the referenced works contain highdimensional consistency analyses that particularly aim at the recovery of the true graph structure, that is, the information which variables are not conditionally independent and thus interact. A prominent proof technique used throughout is the primaldualwitness method originally introduced in Wainwright (2009) for the LASSO, that is, sparse regression. Generally, the assumptions necessary in order to be able to successfully identify the true interactions for graphical models (or rather the active predictors for the LASSO) are very similar. For example, one of the conditions that occurs repeatedly is irrepresentability, sometimes also referred to as incoherence. Intuitively, this condition limits the influence the active terms (edges) can have on the inactive terms (nonedges), see Ravikumar et al. (2011).
Sparse + lowrank models. The seminal work of Chandrasekaran et al. (2012) is the first to propose learning sparse + lowrank decompositions as an extension of classical graphical models. As such it has received a lot of attention since then, putting forth various commentators, for example Candès and Soltanolkotabi (2012), Lauritzen and Meinshausen (2012), and Wainwright (2012). Notably, Chandrasekaran et al. (2012)’s highdimensional consistency analysis generalizes the prooftechnique previously employed in graphical models. Hence, unsurprisingly, one of their central assumptions is a generalization of the irrepresentability condition.
Astoundingly, not so much effort has been undertaken in generalizing sparse + lowrank models to broader domains of variables. The particular case of multivariate binary models featuring a sparse + lowrank decomposition is related to Item Response Theory (IRT, see for example Hambleton et al. (1991)). In IRT the observed binary variables (test items) are usually assumed to be conditionally independent given some continuous latent variable (trait of the test taker). Chen et al. (2018) argued that measuring conditional dependence by means of sparse + lowrank models might improve results from classical IRT. They estimate their models using pseudolikelihood, a strategy that they also proposed in an earlier work, see Chen et al. (2016).
Chen et al. (2016) show that their estimator recovers the algebraic structure, that is, the conditional graph structure and the number of latent variables, with probability tending to one. However, their analysis only allows a growing number of sample points whereas they keep the number of variables fixed. Their result thus severs from the tradition to analyze the more challenging highdimensional setting, where the number of variables is also explicitly tracked.
Placement of our work.
Our main contribution is a highdimensional consistency analysis of a likelihood estimator for multivariate binary sparse + lowrank models. Furthermore, our analysis is the first to show parametric consistency of the likelihoodestimates and to provide explicit rates for this type of models. It thus complements the existing literature. Our other contribution is the connection to a particular type of relaxed maximumentropy problems that we established in the introduction. We have shown that this type of relaxation leads to an interpretation as the marginal model of a conditional Gaussian distribution. Interestingly, this has not drawn attention before, though our semidefiniteness constraints can be obtained as special cases of the general relaxed maximumentropy problem discussed in
Dudík and Schapire (2006).3 Parametric and Algebraic Consistency
This section constitutes the main part of this paper. Here, we discuss assumptions that lead to consistency properties of the solution to the likelihood problem ML and state our consistency result. We are interested in the highdimensional setting, where the number of samples , the number of observed binary variables , and the number of latent conditional Gaussian variables are allowed to grow simultaneously. Meanwhile, there are some other problemspecific quantities that concern the curvature of the problem that we assume to be fixed. Hence, we keep the geometry of the problem fixed.
For studying the consistency properties, we use a slight reformulation of Problem ML from the introduction. First, we switch from a maximization to a minimization problem, and let be the negative loglikelihood from now on. Furthermore, we change the representation of the regularization parameters, namely
(SL) 
where controls the tradeoff between the two regularization terms and controls the tradeoff between the negative loglikelihood term and the regularization terms.
We want to point out that our consistency proof follows the lines of the seminal work in Chandrasekaran et al. (2012) who investigate a convex optimization problem for the parameter estimation of a model with observed and latent Gaussian variables. The main difference to the Ising model is that the Gaussian case requires a positivedefiniteness constraint on the pairwise interaction parameter matrix that is necessary for normalizing the density. Furthermore, in the Gaussian case the pairwise interaction parameter matrix is the inverse of the covariance matrix. This is no longer the case for the Ising model, see Loh and Wainwright (2012).
In this work, we want to answer the question if it is possible to recover the parameters from data that has been drawn from a hypothetical true model distribution parametrized by and . We focus on two key concepts of successful recovery in an asymptotic sense with high probability. The first is parametric consistency. This means that should be close to w.r.t. some norm. Since the regularizer is the composed norm , a natural norm for establishing parametric consistency is its dual norm
The second type of consistency that we study is algebraic consistency. It holds if recovers the true sparse support of , and if has the same rank as .
In the following we discuss the assumptions for our consistency result. For that we proceed as follows: First, we discuss the requirements for parametric consistency of the compound matrix in Section 3.1. Next, we work out the three central assumptions that are sufficient for individual recovery of and in Section 3.2. We state our consistency result in Section 3.3. Finally, in Section 3.4 we outline the proof, the details of which can be found in Section 5.
3.1 Parametric consistency of the compound matrix
In this section, we briefly sketch how the negative loglikelihood part of the objective function in Problem SL drives the compound matrix that is constructed from the solution to parametric consistency with high probability. We only consider the negative loglikelihood part because we assume that the relative weight of the regularization terms in the objective function goes to zero as the number of sample points goes to infinity. This implies that the estimated compound matrix is not affected much by the regularization terms since they contribute mostly small (but important) adjustments. More specifically, the norm regularization on shrinks entries of such that entries of small magnitude are driven to zero such that
will likely be a sparse matrix. Likewise, the trace norm (or nuclear norm) can be thought of diminishing the singular values of the matrix
such that small singular values become zero, that is, will likely be a lowrank matrix.The negative loglikelihood function is strictly convex and thus has a unique minimizer . We can assume that . Let and . Then, consistent recovery of the compound matrix is essentially equivalent to the estimation error being small. Now, consider the Taylor expansion
with remainder . It turns out that if the number of samples is sufficiently large, then the gradient is small with high probability, and if is small, then the remainder is also small. In this case, the Taylor expansion implies that locally around the true parameters the negative loglikelihood is well approximated by the quadratic form induced by its Hessian, namely
This quadratic form is obviously minimized at , which would entail consistent recovery of in a parametric sense. However, this does not explain how the sparse and lowrank components of can be recovered consistently. In the next section we elaborate sufficient assumptions for the consistent recovery of these components.
3.2 Assumptions for individual recovery
Consistent recovery of the components, more specifically parametric consistency of the solutions and , requires the two errors and to be small (in their respective norms). Both errors together form the joint error . Note though that the minimum of the quadratic form from the previous section at does not imply that the individual errors and are small. We can only hope for parametric consistency of and if they are the unique solutions to Problem SL.
For uniqueness of the solutions we need to study optimality conditions. Problem SL is the Lagrange form of the constrained problem
for suitable regularization parameters and , where we have neglected the positivesemidefiniteness constraint on . The constraints can be thought of as convex relaxations of constraints that require to have a certain sparsity and require to have at most a certain rank. That is, should be contained in the set of symmetric matrices of a given sparsity and should be contained in the set of symmetric lowrank matrices. To formalize these sets we briefly review the varieties of sparse and lowrank matrices.
Sparse matrix variety.
For the support is defined as
and the variety of sparse symmetric matrices with at most nonzero entries is given as
Any matrix with is a smooth point of with tangent space
Lowrank matrix variety.
The variety of matrices with rank at most is given as
Any matrix with rank is a smooth point of with tangent space
where
is the restricted eigenvalue decomposition of
, that is, has orthonormal columns and is diagonal.Next, we formulate conditions that ensure uniqueness in terms of the tangent spaces of the introduced varieties.
Transversality.
Remember that we understand the constraints in the constrained formulation of Problem SL as convex relaxations of constraints of the form and . Because the negative loglikelihood function is a function of , its gradient with respect to and its gradient with respect to coincide at . Hence, the firstorder optimality conditions for the nonconvex problem require that the gradient of the negative loglikelihood function needs to be normal to and at any (locally) optimal solutions and , respectively. If the solution is not (locally) unique, then basically the only way to get an alternative optimal solution that violates (local) uniqueness is by translating and by an element that is tangential to at and tangential to at , respectively. Thus, it is necessary for (local) uniqueness of the optimal solution that such a tangential direction does not exist. Hence, the tangent spaces and need to be transverse, that is, . Intuitively, if we require that transversality holds for the true parameters , that is, , then provided that is close to , the tangent spaces and should also be transverse.
We do not require transversality explicitly since it is implied by stronger assumptions that we motivate and state in the following. In particular, we want the (locally) optimal solutions and not only to be unique, but also to be stable under perturbations. This stability needs some additional concepts and notation that we introduce now.
Stability assumption.
Here, stability means that if we perturb and in the respective tangential directions, then the gradient of the negative loglikelihood function should be far from being normal to the sparse and lowrank matrix varieties at the perturbed and , respectively. As for transversality, we require stability for the true solution and expect that it carries over to the optimal solutions and , provided they are close. More formally, we consider perturbations of in directions from the tangent space , and perturbations of in directions from tangent spaces to the lowrank variety that are close to the true one . The reason for considering tangent spaces close to is that there are lowrank matrices close to that are not contained in because the lowrank matrix variety is locally curved at any smooth point.
Now, in light of a Taylor expansion the change of the gradient is locally governed by the dataindependent Hessian of the negative loglikelihood function at . To make sure that the gradient of the tangentially perturbed (true) solution cannot be normal to the respective matrix varieties we require that it has a significant component in the tangent spaces at the perturbed solution. This is achieved if the minimum gains of the Hessian in the respective tangential directions
are large, where are tangent spaces to the lowrank matrix variety that are close to in terms of the twisting
between these subspaces given some . Here, we denote projections onto a matrix subspace by subindexed by the subspace.
Note though that only requiring and to be large is not enough if the maximum effects of the Hessian in the respective normal directions
are also large, because then the gradient of the negative loglikelihood function at the perturbed (true) solution could still be almost normal to the respective varieties. Here, is the normal space at orthogonal to , and is the space orthogonal to .
Overall, we require that is bounded away from zero and that the ratio is bounded from above, where . Note that in our definitions of the minimum gains and maximum effects we used the  and the spectral norm, which are dual to the  and the nuclear norm, respectively. Ultimately, we want to express the stability assumption in the norm which is the dual norm to the regularization term in Problem SL. For that we need to compare the  and the spectral norm. This can be accomplished by using norm compatibility constants that are given as the smallest possible and such that
where and are the tangent spaces at points and from the sparse matrix variety and the lowrank matrix variety , respectively. Let us now specify our assumptions in terms of the stability constants from above.
Assumption 1 (Stability)
We set and assume that

, and

there exists such that , where .
The second assumption is essentially a generalization of the wellknown irrepresentability condition, see for example Ravikumar et al. (2011). The next assumption ensures that there are values of for which stability can be expressed in terms of the norm, that is, a coupled version of stability.
feasibility assumption.
The norm compatibility constants and allow further insights into the realm of problems for which consistent recovery is possible. First, it can be shown, see Chandrasekaran et al. (2011), that , where is the maximum number of nonzero entries per row/column of , that is, constitutes a lower bound for . Intuitively, if is large, then the nonzero entries of the sparse matrix could be concentrated in just a few rows/columns and thus would be of low rank. Hence, in order not to confuse with a lowrank matrix we want the lower bound on the maximum degree to be small.
Second, constitutes a lower bound on the incoherence of the matrix . Incoherence measures how well a subspace is aligned with the standard coordinate axes. Formally, the incoherence of a subspace is defined as where the are the standard basis vectors of . It is known, see again Chandrasekaran et al. (2011), that
where is the incoherence of the subspace spanned by the rows/columns of the symmetric matrix . A large value means that the row/column space of is well aligned with the standard coordinate axes. In this case, the entries of do not need to be spread out and thus could have many zero entries, that is, it could be a sparse matrix. Hence, in order not to confuse with a sparse matrix we want the lower bound on the incoherence , or equivalently , to be small.
Altogether, we want both and to be small to avoid confusion of the sparse and the lowrank parts. Now, in Problem SL, the parameter controls the tradeoff between the regularization term that promotes sparsity, that is, the norm term, and the regularization term that promotes low rank, that is, the nuclear norm term. It turns out that the range of values for that are feasible for our consistency analysis becomes larger if and are small. Indeed, the following assumption ensures that the range of values of that are feasible for our consistency analysis is nonempty.
Assumption 2 (feasibility)
The range with
is nonempty. Here, we use the additional problemspecific constant with
The feasibility assumption is equivalent to
Note that this upper bound on the product is essentially controlled by the product . It is easier to satisfy when the latter product is large. This is well aligned with the stability assumption, because in terms of the stability assumption the good case is that the product is large, or more specifically that is large and is close to .
Gap assumption.
Intuitively, if the smallestmagnitude nonzero entry of is too small, then it is difficult to recover the support of . Similarly, if the smallest nonzero eigenvalue of is too small, then it is difficult to recover the rank of . Hence, we make the following final assumption.
Assumption 3 (Gap)
We require that
where and are problemspecific constants that are specified more precisely later.
Recall that the regularization parameter controls how strongly the eigenvalues of the solution and the entries of the solution are driven to zero. Hence, the required gaps get weaker as the number of sample points grows, because the parameter goes to zero as goes to infinity.
3.3 Consistency theorem
We state our consistency result using problemspecific dataindependent constants , and . Their exact definitions can be found alongside the proof in Section 5.1. Also note that the norm compatibility constant is implicitly related to the number of latent variables . This is because as we have seen above and , see Chandrasekaran et al. (2011). Hence, the smaller , the better can the upper bound on be. Therefore, we track and explicitly in our analysis.
[Consistency] Let be a sparse and let be a lowrank matrix. Denote by and the tangent spaces at and , respectively to the variety of symmetric sparse matrices and to the variety of symmetric lowrank matrices. Suppose that we observed samples drawn from a pairwise Ising model with interaction matrix such that the stability assumption, the feasibility assumption, and the gap assumption hold. Moreover let , and assume that for the number of sample points it holds that
and that the regularization parameter it set as
Then, it follows with probability at least that the solution to the convex program SL is

parametrically consistent, that is, , and

algebraically consistent, that is, and have the same support (actually, the signs of corresponding entries coincide), and and have the same ranks.
3.4 Outline of the proof
The proof of Theorem 3.3 is similar to the one given in Chandrasekaran et al. (2012) for latent variable models with observed Gaussians. More generally, it builds on a version of the primaldualwitness proof technique. The proof consists of the following main steps:

First, we consider the correct model set whose elements are all parametrically and algebraically consistent under the stability, feasibility, and gap assumptions. Hence, any solution to our problem, if additionally constrained to , is consistent.

Second, since the set is nonconvex, we consider a simplified and linearized version of the set and show that the solution to the problem constrained to the linearized model space is unique and equals . Since it is the same solution, consistency follows from the first step.

Third, we show that the solution also solves Problem SL. More precisely, we show that this solution is strictly dual feasible and hence can be used as a witness as required for the primaldualwitness technique. This implies that it is also the unique solution, with all the consistency properties from the previous steps.

Finally, we show that the assumptions from Theorem 3.3 entail all those made in the previous steps with high probability. Thereby, the proof is concluded.
4 Discussion
Our result, that constitutes the first highdimensional consistency analysis for sparse + lowrank Ising models, requires slightly more samples (in the sense of an additional logarithmic factor , and polynomial probability) than were required for consistent recovery for the sparse + lowrank Gaussian models considered by Chandrasekaran et al. (2012). This is because the strong tail properties of multivariate Gaussian distributions do not hold for multivariate Ising distributions. Hence, it is more difficult to bound the sampling error
of the secondmoment matrices, which results in weaker probabilistic spectral norm bounds of this sampling error. Under our assumptions, we believe that the sampling complexity, that is, the number of samples required for consistent recovery of sparse + lowrank Ising models, cannot be improved. We also provided a detailed discussion of why all of our assumptions are important.
It would be interesting to test for consistency experimentally, but this is better done using a pseudolikelihood approach because it avoids the problem of computing costly normalizations. We believe that likelihood and pseudolikelihood behave similarly, but so far only much weaker guarantees are known for the pseudolikelihood approach than the ones that we prove here.
We gratefully acknowledge financial support from the German Science Foundation (DFG) grant (GI711/51) within the priority program (SPP 1736) Algorithms for Big Data.
References

Bach et al. (2012)
Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski.
Optimization with sparsityinducing penalties.
Foundations and Trends in Machine Learning
, 4(1):1–106, 2012.  Candès and Soltanolkotabi (2012) Emmanuel J. Candès and Mahdi Soltanolkotabi. Discussion: Latent variable graphical model selection via convex optimization. The Annals of Statistics, 40(4):1996–2004, 2012.
 Chandrasekaran et al. (2011) Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, and Alan S. Willsky. Ranksparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011.
 Chandrasekaran et al. (2012) Venkat Chandrasekaran, Pablo A. Parrilo, and Alan S. Willsky. Latent variable graphical model selection via convex optimization. The Annals of Statistics, 40(4):1935–1967, 2012.
 Chen et al. (2016) Yunxiao Chen, Xiaoou Li, Jingchen Liu, and Zhiliang Ying. A fused latent and graphical model for multivariate binary data. Technical report, arXiv preprint arXiv:1606.08925, 2016.
 Chen et al. (2018) Yunxiao Chen, Xiaoou Li, Jingchen Liu, and Zhiliang Ying. Robust measurement via a fused latent and graphical item response theory model. Psychometrika, pages 1–25, 2018.
 Cheng et al. (2017) Jie Cheng, Tianxi Li, Elizaveta Levina, and Ji Zhu. Highdimensional mixed graphical models. Journal of Computational and Graphical Statistics, 26(2):367–378, 2017.
 Dudík and Schapire (2006) Miroslav Dudík and Robert E. Schapire. Maximum entropy distribution estimation with generalized regularization. In Conference on Learning Theory (COLT), pages 123–138, 2006.
 Dudík et al. (2004) Miroslav Dudík, Steven J. Phillips, and Robert E. Schapire. Performance guarantees for regularized maximum entropy density estimation. In Conference on Learning Theory (COLT), pages 472–486, 2004.
 Hambleton et al. (1991) Ronald K. Hambleton, Hariharan Swaminathan, and H. Jane Rogers. Fundamentals of item response theory. Sage, 1991.

Jalali et al. (2011)
Ali Jalali, Pradeep Ravikumar, Vishvas Vasuki, and Sujay Sanghavi.
On learning discrete graphical models using groupsparse
regularization.
In
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)
, pages 378–387, 2011.  Jaynes (1957) Edwin T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620–630, 1957.
 Lauritzen (1996) Steffen L. Lauritzen. Graphical models. Oxford University Press, 1996.
 Lauritzen and Meinshausen (2012) Steffen L. Lauritzen and Nicolai Meinshausen. Discussion: Latent variable graphical model selection via convex optimization. The Annals of Statistics, 40(4):1973–1977, 2012.
 Lee and Hastie (2015) Jason D. Lee and Trevor J. Hastie. Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics, 24(1):230–253, 2015.
 Lee et al. (2015) Jason D. Lee, Yuekai Sun, and Jonathan E. Taylor. On model selection consistency of regularized estimators. Electronic Journal of Statistics, 9(1):608–642, 2015.
 Loh and Wainwright (2012) PoLing Loh and Martin J. Wainwright. Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. In Conference on Neural Information Processing Systems (NIPS), pages 2096–2104, 2012.
 Meinshausen and Bühlmann (2006) Nicolai Meinshausen and Peter Bühlmann. Highdimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.

Ravikumar et al. (2010)
Pradeep Ravikumar, Martin J. Wainwright, and John D. Lafferty.
Highdimensional Ising model selection using
regularized logistic regression.
The Annals of Statistics, 38(3):1287–1319, 2010.  Ravikumar et al. (2011) Pradeep Ravikumar, Martin J. Wainwright, Garvesh Raskutti, and Bin Yu. Highdimensional covariance estimation by minimizing penalized logdeterminant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
 Vershynin (2010) Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. Technical report, arXiv preprint arXiv:1011.3027, 2010.
 Wainwright (2009) Martin J. Wainwright. Sharp thresholds for highdimensional and noisy sparsity recovery using constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55(5):2183–2202, 2009.
 Wainwright (2012) Martin J. Wainwright. Discussion: Latent variable graphical model selection via convex optimization. The Annals of Statistics, 40(4):1978–1983, 2012.
 Watson (1992) G. Alistair Watson. Characterization of the subdifferential of some matrix norms. Linear Algebra and its Applications, 170:33–45, 1992.
5 Proof of the Consistency Theorem
In this section, we prove Theorem 3.3.
5.1 Preliminaries
Here, we give an overview of basic definitions and constants that are used throughout the paper. The constants are also necessary to refine the problemspecific constants that appear in the assumptions and claims of Theorem 3.3.
Duplication operator.
Throughout we use the duplication operator
Norms.
During the course of the paper we use several matrix norms. For , the  and the nuclear norm are given by
where are the singular values of . Note that for it holds . We also use the respective dual norms. They are the  and spectral norm given by
where is the standard Euclidean norm for vectors.
Secondmoment matrices and norm of the Hessian.
In the introduction we used and which actually are the empirical and the population version of the secondmoment matrix, that is, and , where the expectation is taken w.r.t. the true Ising model distribution with parameter matrix . Note that the gradient of the negative loglikelihood satisfies , where we denoted . Moreover, we denote the Hessian as and its operator norm is given by
Norm compatibility constant.
Since we will encounter the following constant several times in the proof we give it its own symbol
Later in Section 5.4, we will show that is essentially a norm compatibility constant between the norm and the spectral norm.
Problemspecific constants.
Minimum number of samples required (precise).
We require at least
samples for consistent recovery, where is a positive constant that is used to control the probability with which consistent recovery is possible.
Choice of (precise).
For our consistency analysis we choose the following value for the tradeoff parameter between the negative loglikelihood and the regularization terms
Gap assumption (precise).
The precise gap assumptions on the smallestmagnitude nonzero entry of and the smallest nonzero eigenvalue of is given by
5.2 Tangent space lemmas
Lowrank tangent spaces play a fundamental role throughout the proof. Therefore we characterize the tangent spaces at smooth points of the lowrank variety before moving on.
Suppose is a rank matrix. Then, the tangent space to at is given by
where is the (restricted) eigenvalue decomposition of , that is, has orthonormal columns and is diagonal with the eigenvalues on the diagonal. The tangent space at is given by the span of all tangent vectors at zero to smooth curves initialized at , that is, . Because has rank it is a smooth point of and we can assume that with rank matrices for all and is the diagonal matrix whose diagonal entries are the signs of the eigenvalues of , that is, they are in . We can assume the signs of the eigenvalues along the curve to be fixed because we only consider smooth curves. In particular , so it must hold
. Now, by the chain rule it holds
We still need to show that can take arbitrary values. To do so, for any consider which has rank for sufficiently small since has rank and the curve is smooth. Moreover, it holds . Now with the particular choice of , since
the tangential vector of the corresponding curve at zero is .
Note that the variety of symmetric lowrank matrices has dimension . Since is a smooth point in the tangent space has the same dimension.
One consequence of the form of the tangent spaces is the following lemma that concerns the norms of projections on certain tangent spaces and their orthogonal complements.
For any two tangent spaces and at any smooth points w.r.t. the varieties and , respectively, we can bound the norms of projections of matrices in the following manner:
In particular, for we have
Recall that from Lemma 5.2 we have for smooth points that
where
is the (restricted) singular value decomposition of
. Then, we have more explicitly thatwhere projects onto the column space of . Note that since
and that is orthogonal to since
and since for any we have
where the second and last inequality follow from , and we used in the third equality. Thus, is indeed the orthogonal projection of onto . Now, by submultiplicativity of the spectral norm
since and
Comments
There are no comments yet.