 # Discussion: Latent variable graphical model selection via convex optimization

Discussion of "Latent variable graphical model selection via convex optimization" by Venkat Chandrasekaran, Pablo A. Parrilo and Alan S. Willsky [arXiv:1008.1290].

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Latent variable model selection

The proposed scheme is an extension of the graphical lasso of Yuan and Lin Yuan07 (see also Banerjee , Friedman ), which is a popular approach for learning the structure in an undirected Gaussian graphical model. In this setup, we assume we have independent samples with a covariance matrix exhibiting a sparse dependence structure but otherwise unknown; that is to say, most pairs of variables are conditionally independent given all the others. Formally, the concentration matrix is assumed to be sparse. A natural fitting procedure is then to regularize the likelihood by adding a term proportional to the

norm of the estimated inverse covariance matrix

:

 minimize−ℓ(S,Σn0)+λ∥S∥1 (1)

under the constraint , where is the empirical covariance matrix and . (Variants are possible depending upon whether or not one would want to penalize the diagonal elements.) This problem is convex.

When some variables are unobserved—the observed and hidden variables are still jointly Gaussian—the model above may not be appropriate because the hidden variables can have a confounding effect. An example is this: we observe stock prices of companies and would like to infer conditional (in)dependence. Suppose, however, that all these companies rely on a commodity, a source of energy, for instance, which is not observed. Then the stock prices might appear dependent even though they may not be once we condition on the price of this commodity. In fact, the marginal inverse covariance of the observed variables decomposes into two terms. The first is the concentration matrix of the observed variables in the full model conditioned on the latent variables. The second term is the effect of marginalization over the hidden variables. Assuming a sparse graphical model, the first term is sparse, whereas the second term may have low rank; in particular, the rank is at most the number of hidden variables. The authors then penalize the negative log-likelihood with a term proportional to

 γ∥S∥1+trace(L) (2)

since the trace functional is the usual convex surrogate for the rank over the cone of positive semidefinite matrices. The constraints are .

The penalty (2) is simple and flexible since it does not really make special parametric assumptions. To be truly appealing, it would also need to be adaptive in the following sense: suppose there is no hidden variable, then does the low-ranksparse model (LS) behave as well or nearly as well as the graphical lasso? When there are few hidden variables, does it behave nearly as well? Are there such theoretical guarantees? If this is the case, it would say that using the LS model would protect against the danger of not having accounted for all possible covariates. At the same time, if there were no hidden variable, one would not suffer any loss of performance. Thus, we would get the best of both worlds.

At first sight, the analysis presented in this paper does not allow us to reach this conclusion. If is -dimensional, the number of samples needed to show that one can obtain accurate estimates scales like , where is a modulus of continuity introduced in the paper that is typically much smaller than 1. We can think of as being related to the maximum degree of the graph so that the condition may be interpreted as having a number of observations very roughly scaling like

. In addition, accurate estimation holds with the proviso that the signal is strong enough; here, both the minimum nonzero singular value of the low-rank component and the minimum nonzero entry of the sparse component scale like

. On the other hand, when there are no hidden variables, a line of work Meinshausen06 , Ravikumar11 , Rothman08 has established that we could estimate the concentration matrix with essentially the same accuracy if and the magnitude of the minimum nonvanishing value of the concentration matrix scales like . As before, is the maximum degree of the graphical model. In the high-dimensional regime, the results offered by this literature seem considerably better. It would be interesting to know whether this could be bridged, and if so, under what types of conditions—if any.

Interestingly, such adaptivity properties have been established for related problems. For instance, the L

S model has been used to suggest the possibility of a principled approach to robust principal component analysis

Candes11 . Suppose we have incomplete and corrupted information about an low-rank matrix . More precisely, we observe , where . We think of as a corruption pattern so that some entries are totally unreliable but we do not know which ones. Then Candes11 shows that under rather broad conditions, the solution to

 minimize ∥L∥∗+λ∥S∥1 (3) subject to Mij=Lij+Sij,(i,j)∈Ωobs,

where is the nuclear norm, recovers exactly. Now suppose there are no corruptions. Then we are facing a matrix completion problem and, instead, one would want to minimize the nuclear norm of under data constraints. In other words, there is no need for in (Adaptivity). The point is that there is a fairly precise understanding of the minimal number of samples needed for this strategy to work; for incoherent matrices CR08 , must scale like , where is the rank of . Now some recent work Li establishes the adaptivity in question. In details, (Adaptivity) recovers from a minimal number of samples, in the sense defined above, even though a positive fraction may be corrupted. That is, the number of reliable samples one needs, regardless of whether corruption occurs, is essentially the same. Results of this kind extend to other settings as well. For instance, in sparse regression or compressive sensing we seek a sparse solution to by minimizing the norm of . Again, we may be worried that some equations are unreliable because of gross errors and would solve, instead,

 minimize ∥b∥1+λ∥e∥1 (4) subject to y=Xb+e

to achieve robustness. Here, Li shows that the minimal number of reliable samples/equations required, regardless of whether the data is clean or corrupted, is essentially the same.

## The versatility of the L+S model

We now move to discuss the LS model more generally and survey a set of circumstances where it has proven useful and powerful. To begin with, methods which simply minimize an norm, or a nuclear norm, or a combination thereof are seductive because they are flexible and apply to a rich class of problems. The LS model is nonparametric and does not make many assumptions. As a result, it is widely applicable to problems ranging from latent variable model selection Chandrasekaran11

(arguably one of the most subtle and beautiful applications of this method) to video surveillance in computer vision and document classification in machine learning

Candes11 . In any given application, when much is known about the problem, it may not return the best possible answer, but our experience is that it is always fairly competitive. That is, the little performance loss we might encounter is more than accounted for by the robustness we gain vis a vis various modeling assumptions, which may or may not hold in real applications. A few recent applications of the LS model demonstrate its flexibility and robustness.

## Applications in computer vision

The LS model has been applied to address several problems in computer vision, most notably by the group of Yi Ma and colleagues. Although the low-ranksparse model may not hold precisely, the nuclear relaxation appears practically robust. This may be in contrast with algorithms which use detailed modeling assumptions and may not perform well under slight model mismatch or variation.

### Video surveillance

An important task in computer vision is to separate background from foreground. Suppose we stack a sequence of video frames as columns of a matrix (rows are pixels and columns time points), then it is not hard to imagine that the background will have low-rank since it is not changing very much over time, while the foreground objects, such as cars, pedestrians and so on, can be seen as a sparse disturbance. Hence, finding an LS decomposition offers a new way of modeling the background (and foreground). This method has been applied with some success Candes11 ; see also the online videos Video 1 and Video 2.

### From textures to 3D

One of the most fundamental steps in computer vision consists of extracting relevant features that are subsequently used for high-level vision applications such as 3D reconstruction, object recognition and scene understanding. There has been limited success in extracting stable features across variations in lightening, rotations and viewpoints. Partial occlusions further complicate matters. For certain classes of 3D objects such as images with regular symmetric patterns/textures, one can bypass the extraction of local features to recover 3D structure from 2D views. To fix ideas, a vertical or horizontal strip can be regarded as a rank-1 texture and a corner as a rank-2 texture. Generally speaking, surfaces may exhibit a low-rank texture when seen from a suitable viewpoint; see Figure

1. However, their 2D projections as captured by a camera will typically not be low rank. To see why, imagine there is a low-rank texture on a planar surface. The image we observe is a transformed version of this texture, namely, . A technique named TILT Zhang12 recovers simply by seeking a low-rank and sparse superposition. In spite of idealized assumptions, Figures 1 and 2 show that the LS model works well in practice. Figure 1: (a) Pair of images from distinct viewpoints. (b) 3D reconstruction (TILT) from photographs in (a) using the L+S model. The geometry is recovered from two images. Figure 2: We are given the 16 images on the right. The task is to remove the clutter and align the images. Stacking each image as a column of a matrix, we look for planar homeographies that reveal a low-rank plus sparse structure Peng10 . From left to right: original data set, aligned images, low-rank component (columns of L), sparse component (columns of S).

### Compressive acquisition

In the spirit of compressive sensing, the L

S model can also be used to speed up the acquisition of large data sets or lower the sampling rate. At the moment, the theory of compressive sensing relies on the sparsity of the object we wish to acquire, however, in some setups the L

S model may be more appropriate. To explain our ideas, it might be best to start with two concrete examples. Suppose we are interested in the efficient acquisition of either (1) a hyper-spectral image or (2) a video sequence. In both cases, the object of interest is a data matrix which is , where each column is an -pixel image and each of the columns corresponds to a specific wavelength (as in the hyper-spectral example) or frame (or time point as in the video example). In the first case, the data matrix may be thought of as , where indexes position and wavelength, whereas in the second example, we have where is a time index. We would like to obtain a sequence of highly resolved images from just a few measurements; an important application concerns dynamic magnetic resonance imaging where it is only possible to acquire a few samples in -space per time interval.

Clearly, frames in a video sequence are highly correlated in time. And in just the same way, two images of the same scene at nearby wavelengths are also highly correlated. Obviously, images are correlated in space as well. Suppose that

is a tensor basis, where

sparsifies images and time traces ( might be a wavelet transform and

a Fourier transform). Then we would expect

to be a nearly sparse matrix. With undersampled data of the form , where is the operator supplying information about and is a noise term, this leads to the low-ranksparse decomposition problem

 minimize ∥X∥∗+λ∥WXF∥1 (5) subject to ∥A(X)−y∥2≤ε,

where is the noise power. A variation, which is more in line with the discussion paper is a model in which is a low-rank matrix modeling the static background, and is a sparse matrix roughly modeling the innovation from one frame to the next; for instance, might encode the moving objects in the foreground. This would give

 minimize λ∥L∥∗+∥WSF∥1 (6) subject to ∥A(L+S)−y∥2≤ε.

One could imagine that these models might be useful in alleviating the tremendous burden on system resources in the acquisition of ever larger 3D, 4D and 5D data sets.

We note that proposals of this kind have begun to emerge. As we were preparing this commentary, we became aware of Golbabaee12 , which suggests a model similar to (Compressive acquisition) for hyperspectral imaging. The difference is that the second term in (Compressive acquisition) is of the form in which is the th column of , the image at wavelength ; that is, we minimize the total variation of each image, instead of looking for sparsity simultaneously in space and wavelength/frequency. The results in Golbabaee12 show that dramatic undersampling ratios are possible. In medical imaging, movement due to respiration can degrade the image quality of Computed Tomography (CT), which can lead to incorrect dosage in radiation therapy. Using time-stamped data, 4D CT has more potential for precise imaging. Here, one can think of the object as a matrix with rows labeling spatial variables and columns time. In this context, we have a low-rank (static) background and a sparse disturbance corresponding to the dynamics, for example, of the heart in cardiac imaging. The recent work Gao11 shows how one can use the LS model in a fashion similar to (Compressive acquisition). This has interesting potential for dose reduction since the approach also supports substantial undersampling.

## Connections with theoretical computer science and future directions

A class of problems where further study is required concerns situations in which the low-rank and sparse components have a particular structure. One such problem is the planted clique problem. It is well known that finding the largest clique in a graph is NP hard; in fact, it is even NP-hard to approximate the size of the largest clique in an vertex graph to within a factor . Therefore, much research has focused on an “easier” problem. Consider a random graph on

vertices where each edge is selected independently with probability

. The expected size of its largest clique is known to be . The planted clique problem adds a clique of size to . One hopes that it is possible to find the planted clique in polynomial time whenever . At this time, this is only known to be possible if is on the order of

or larger. In spite of its seemingly simple formulation, this problem has eluded theoretical computer scientists since 1998, and is regarded as a notoriously difficult problem in modern combinatorics. It is also fundamental to many areas in machine learning and pattern recognition. To emphasize its wide applicability, we mention a new connection with game theory. Roughly speaking, the recent work

Hazan10 shows that finding a near-optimal Nash equilibrium in two-player games is as hard as finding hidden cliques of size , where is some universal constant.

One can think about the planted clique as a low ranksparse decomposition problem. To be sure, the adjacency matrix of the graph can be written as the sum of two matrices: the low-rank component is of rank 1 and represents the clique of size (a submatrix with all entries equal to ); the sparse component stands for the random edges (and with on the diagonal if and only if that vertex belongs to the hidden clique). Interestingly, low-ranksparse regularization based on nuclear and norms have been applied to this problem Doan10 . (Here the clique is both low-rank and sparse and is the object of interest so that we minimize subject to data constraints.) These proofs show that these methods find cliques of size , thus recovering the best known results, but they may not be able to break this barrier. It is interesting to investigate whether tighter relaxations, taking into account the specific structure of the low-rank and sparse components, can do better.

## References

• (1) [mr] Banerjee, OnureenaO., El Ghaoui, LaurentL. d’Aspremont, AlexandreA. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9 485–516. issn=1532-4435, mr=2417243 imsref
• (2) [mr] Candès, Emmanuel J.E. J., Li, XiaodongX., Ma, YiY. Wright, JohnJ. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37. doi=10.1145/1970392.1970395, issn=0004-5411, mr=2811000 imsref
• (3) [mr] Candès, Emmanuel J.E. J. Recht, BenjaminB. (2009). Exact matrix completion via convex optimization. Found. Comput. Math. 9 717–772. doi=10.1007/s10208-009-9045-5, issn=1615-3375, mr=2565240 imsref
• (4) [mr] Chandrasekaran, VenkatV., Sanghavi, SujayS., Parrilo, Pablo A.P. A. Willsky, Alan S.A. S. (2011). Rank-sparsity incoherence for matrix decomposition. SIAM J. Optim. 21 572–596. doi=10.1137/090761793, issn=1052-6234, mr=2817479 imsref
• (5) [auto:STB—2012/05/30—10:51:56] Doan, X. V.X. V. Vavasis, S. A.S. A. (2010). Finding approximately rank-one submatrices with the nuclear norm and norm. Available at http://arxiv.org/abs/. imsref
• (6) [auto:STB—2012/05/30—10:51:56] Friedman, J.J., Hastie, T.T. Tibshirani, R.R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441. imsref
• (7) [auto:STB—2012/05/30—10:51:56] Gao, H.H., Cai, J.J., Shen, Z.Z. Zhao, H.H. (2011). Robust principal component analysis-based four-dimensional computed tomography. Phys. Med. Biol. 56 3181. imsref
• (8) [auto:STB—2012/05/30—10:51:56] Golbabaee, M.M. Vandergheynst, P.P. (2012). Joint trace/TV norm minimization: A new efficient approach for spectral compressive imaging. In IEEE International Conference on Image Processing (ICIP), Orlando, Florida. imsref
• (9) [mr] Hazan, EladE. Krauthgamer, RobertR. (2011). How hard is it to approximate the best Nash equilibrium? SIAM J. Comput. 40 79–91. doi=10.1137/090766991, issn=0097-5397, mr=2765712 imsref
• (10) [auto:STB—2012/05/30—10:51:56] Li, X.X. (2011). Compressed sensing and matrix completion with constant proportion of corruptions. Available at http://arxiv.org/abs/1104.1041. imsref
• (11) [mr] Meinshausen, NicolaiN. Bühlmann, PeterP. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462. doi=10.1214/009053606000000281, issn=0090-5364, mr=2278363 imsref
• (12) [auto:STB—2012/05/30—10:51:56] Peng, Y.Y., Ganesh, A.A., Wright, J.J., Xu, W.W. Ma, Y.Y. (2010). RASL: Robust alignment by sparse and low-rank de-composition for linearly correlated images. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA. imsref
• (13) [mr] Ravikumar, PradeepP., Wainwright, Martin J.M. J., Raskutti, GarveshG. Yu, BinB. (2011). High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Electron. J. Stat. 5 935–980. doi=10.1214/11-EJS631, issn=1935-7524, mr=2836766 imsref
• (14) [mr] Rothman, Adam J.A. J., Bickel, Peter J.P. J., Levina, ElizavetaE. Zhu, JiJ. (2008). Sparse permutation invariant covariance estimation. Electron. J. Stat. 2 494–515. doi=10.1214/08-EJS176, issn=1935-7524, mr=2417391 imsref
• (15) [mr] Yuan, MingM. Lin, YiY. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19–35. doi=10.1093/biomet/asm018, issn=0006-3444, mr=2367824 imsref
• (16) [auto:STB—2012/05/30—10:51:56] Zhang, Z.Z., Ganesh, A.A., Liang, X.X. Ma, Y.Y. (2012). TILT: Transform-invariant low-rank textures. International Journal of Computer Vision (IJCV). To appear. imsref