GLAD: Learning Sparse Graph Recovery

06/01/2019 ∙ by Harsh Shrivastava, et al. ∙ 0

Recovering sparse conditional independence graphs from data is a fundamental problem in machine learning with wide applications. A popular formulation of the problem is an ℓ_1 regularized maximum likelihood estimation. Many convex optimization algorithms have been designed to solve this formulation to recover the graph structure. Recently, there is a surge of interest to learn algorithms directly based on data, and in this case, learn to map empirical covariance to the sparse precision matrix. However, it is a challenging task in this case, since the symmetric positive definiteness (SPD) and sparsity of the matrix are not easy to enforce in learned algorithms, and a direct mapping from data to precision matrix may contain many parameters. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization (AM) algorithm as our model inductive bias, and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graph from data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recovering sparse conditional independence graphs from data is a fundamental problem in high dimensional and time series analysis, and it has found applications in diverse areas. In computational biology, a sparse graph structure between gene expression data may be used to understand gene regulatory networks; in finance, a sparse graph structure between financial time-series may be used to understand the relationship between different financial assets. A popular formulation of the problem is an

regularization log-determinant estimation of the precision matrix. Based on this convex formulation, many algorithms have been designed to solve this problem efficiently, and one can formally prove that under a list of conditions, the solution of the optimization problem is guaranteed to recover the graph structure with high probability.

However, convex optimization based approaches have their own limitations. The hyperparameters, such as the regularization parameters and learning rate, may depend on unknown constants, and need to be tuned carefully to achieve the recovery results. Furthermore, the formulation uses a single regularization parameter for all entries in the precision matrix, which may not be optimal. It is intuitive that one may obtain better recovery results by allowing the regularization parameters to vary across the entries in the precision matrix. However, such flexibility will lead to a quadratic increase in the number of hyperparameters, but it is hard for traditional approaches to search over a large number of hyperparameters. Thus, a new paradigm may be needed for designing more effective sparse recovery algorithms.

Recently, there has been a surge of interest in a new paradigm of algorithm design, where algorithms are augmented with learning modules trained directly with data, rather than prescribing every step of the algorithms. This is meaningful because very often a family of optimization problems needs to be solved again and again, similar in structures but different in data. A data-driven algorithm may be able to leverage this distribution of problem instances, and learn an algorithm which performs better than traditional convex formulation. In our case, the sparse graph recovery problem may also need to be solved again and again, where the underlying graphs are different but have similar degree distribution, the magnitude of the precision matrix entries, etc. For instance, gene regulatory networks may be rewiring depending on the time and conditions, and we want to estimate them from gene expression data. Company relations may evolve over time, and we want to estimate their graph from stock data. Thus, we will also explore data-driven algorithm design in this paper.

However, it is very challenging to design a data-driven algorithm for precision matrix estimation. First, the input and output of the problem may be large. A neural network parametrization of direct mapping from the input covariance matrix to the output precision matrix may require as many parameters as the square of the number of dimensions. Second, there are many structure constraints in the output. The resulting precision matrix needs to be positive definite and sparse, which is not easy to enforce by a simple deep learning architecture. Third, direct mapping may result in a model with lots of parameters, and hence may require lots of data to learn. Thus a data-driven algorithm needs to be designed carefully to achieve a better bias-variance trade-off and satisfy the output constraints.

In this paper, we propose a deep learning model ‘GLAD’ with following attributes:

  • [leftmargin=*,nolistsep]

  • Uses an unrolled Alternating Minimization (AM) algorithm as an inductive bias.

  • The regularization and the square penalty terms are parameterized as entry-wise functions of intermediate solutions, allowing GLAD to learn to perform entry-wise regularization update.

  • Furthermore, this data-driven algorithm is trained with a collection of problem instances in a supervised fashion, by directly comparing the algorithm outputs to the ground truth graphs.

In our experiments, we show that the AM architecture provides very good inductive bias, allowing the model to learn very effective sparse graph recovery algorithm with a small amount of training data. In all cases, the learned algorithm can recover sparse graph structures with much fewer data points from a new problem, and it also works well in recovering gene regulatory networks based on realistic gene expression data generators.

Related works.

Previous works have parametrized optimization algorithms as recurrent neural networks or policies in reinforcement learning. For instance, 

Andrychowicz et al. [1] considered directly parametrizing optimization algorithm as an RNN based framework for learning to learn. Li and Malik [9] approach the problem of automating algorithm design from reinforcement learning perspective and represent any particular optimization algorithm as a policy. Khalil et al. [8] learn combinatorial optimzation over graph via deep Q-learning. These works did not consider the structures of our sparse graph recovery problem. Another interesting line of approach is to develop deep neural networks based on unfolding an iterative algorithm [7, 5, 10]Liu et al. [10] developed ALISTA which is based on unrolling the Iterative Shrinkage Thresholding Algorithm (ISTA). Sun et al. [16] developed ‘ADMM-Net’, which is also developed for compressive sensing of MRI data. Though these seminal works were primarily developed for compressive sensing applications, they alluded to the general theme of using unrolled algorithms as inductive biases.

2 Sparse Graph Recovery Problem and Convex Formulation

Given observations of a

-dimensional multivariate Gaussian random variable

, the sparse graph recovery problem aims to estimate its covariance matrix and precision matrix . The -th component of is zero if and only if and are conditionally independent given the other variables . Therefore, it is popular to impose an regularization for the estimation of to increase its sparsity and lead to easily interpretable models. Following Banerjee et al. [2], the problem is formulated as the -regularized maximum likelihood estimation

(1)

where is the empirical covariance matrix based on samples, is the space of symmetric positive definite matrices (SPD), and is the off-diagonal regularizer with regularization parameter . This estimator is sensible even for non-Gaussian , since it is minimizing an -penalized log-determinant Bregman divergence [14]. The sparse precision matrix estimation problem in Eq. (1) is a convex optimization problem which can be solved by many algorithms. Here we give a few canonical and advanced examples which are compared in our later experiments:

G-ISTA. G-ISTA is a proximal gradient method, and it updates the precision matrix iteratively

(2)

The step sizes is determined by line search such that is SPD matrix [15].

ADMM. Alternating direction methods of multiplier transforms the problem into an equivalent constrained form, decouple the log-determinant term and the regularization term, and result in the following augmented Lagrangian form with a penalty parameter :

(3)

Taking as the scaled dual variable, the update rules for the ADMM algorithm are

(4)

(5)

BCD. Block-coordinate decent methods [6] updates each column (and the corresponding row) of the precision matrix iteratively by solving a sequence of lasso problems. The algorithm is very efficient for large scale problems involving thousands of variables.

Apart from various algorithms, rigorous statistical analysis has also been provided for the optimal solution of the convex formulation in Eq. (1). Ravikumar et al. [14] established consistency of the estimator in Eq. (1) in terms of both Frobenius and spectral norms, at rate scaling roughly as with high probability, where is the number of nonzero entries in . This statistical analysis also reveal certain limitations of the convex formulation:

  • [leftmargin=*,nolistsep,nosep]

  • The established consistency is based on a set of carefully chosen conditions, including the lower bound of sample size, the sparsity level of , the degree of the graph, the magnitude of the entries in the covariance matrix, and the strength of interaction between edge and non-edge in the precision matrix (or mutual incoherence on the Hessian ) . In practice, a problem may not satisfy those recovery conditions.

  • The consistency also relies on choosing a specific regularization parameter to formulate the correct convex optimization problem. The choice of depends on the tail behavior of the maximum deviation , which may not be accessible in practice. Furthermore, a set of parameters also need to be chosen to ensure the optimization problem is solved.

Therefore, it seems that there is still room for improving the above convex optimization algorithms for recovering the true graph structure. Furthermore, since the log-determinant estimator in Eq. (1) is NOT directly optimizing the recovery objective , there is also a mismatch in the optimization objective and the final evaluation objective. This increase the hope one may improve the results by directly optimizing the recovery objective with the algorithms learned from data.

3 Learning Data-Driven Algorithm for Graph Recovery

In the remainder of the paper, we will present a data-driven method to learn an algorithm for precision matrix estimation, and we call the resulting algorithm GLAD (stands for Graph recovery Learning Algorithm using Data-driven training). We ask the question of

Given a family of precision matrices, is it possible to improve recovery results for sparse graphs by learning a data-driven algorithm?

More formally, suppose we are given precision matrices from a family of graphs and samples associated with each . These samples can be used to form sample covariance matrices . We are interested in learning an algorithm for precision matrix estimation by solving a supervised learning problem, , where is a set of parameters in and the output of is expected to be a good estimation of

in terms of an interested evaluation metric

. The benefit is that it can directly optimize the final evaluation metric which is related to the desired structure or graph properties of a family of problems. However, it is a challenging task to design a good parameterization of for this graph recovery problem. We will explain the challenges below and then present our solution.

3.1 Challenges in Designing Learning Models

Figure 1: A recurrent unit GLADcell.

In the literature on learning data-driven algorithms, most models are designed using traditional deep learning architectures, such as fully connected DNN or recurrent neural networks. But, for graph recovery problems, directly using these architectures does not work well due to the following reasons.

First, using a fully connected neural network is not practical. Since both the input and the output of graph recovery problems are matrices, the number of parameters scales at least quadratically in . Such a large number of parameters will need many input-output training pairs to provide a decent estimation. Thus some structures need to be imposed in the network to reduce the size of parameters and sample complexity.

Second, structured models such as convolution neural networks (CNNs) have been applied to learn a mapping from

to  [4]. Due to the structure of CNNs, the number of parameters can be much smaller than fully connected networks. However, a recovered graph should be permutation invariant with respect to the matrix rows/columns, and this constraint is very hard to be learned by CNNs, unless there are lots of samples. Also, the structure of CNN is a bias imposed on the model, and there is no guarantee why this structure may work.

Third, the intermediate results produced by both fully connected networks and CNNs are not interpretable, making it hard to diagnose the learned procedures and progressively output increasingly improved precision matrix estimators.

Fourth, the SPD constraint is hard to impose in traditional deep learning architectures.

Although, the above limitations do suggest a list of desiderata when designing learning models: Small model size; Minimalist learning; Interpretable architecture; Progressive improvement; and SPD output. These desiderata will motivate design of our deep architecture using unrolled algorithms.

3.2 Glad: Deep Learning Model based on Unrolled Algorithm

To take into account the above desiderata, we will use an unrolled algorithm as the template for the architecture design of GLAD

. The unrolled algorithm already incorporates some problem structures, such as permutation invariance and interpretable intermediate results; but this unrolled algorithm does not traditionally have a learning component, and is typically not directly suitable for gradient-based approaches. We will leverage this inductive bias in our architecture design and augment the unrolled algorithm with suitable and flexible learning components, and then train these embedded models with stochastic gradient descent.

GLAD model is based on a reformulation of the original optimization problem in Eq. (1) with a squared penalty term, and an alternating minimization (AM) algorithm for it. More specifically, we consider a modified optimization with a quadratic penalty parameter :

(6)

and the alternating minimization (AM) method for solving it:

(7)
(8)

where . The derivation of these steps are given in Appendix A. We replace the penalty constants by problem dependent neural networks, and . These neural networks are very minimalist in terms of the number of parameters. The update equations for our unrolled AM based model, GLAD, are summarized in Algorithm 1. Except for the parameters in and , the constant for initialization is also a learnable scalar parameter. This unrolled algorithm with neural network augmentation can be viewed as a highly structured recurrent architecture as illustrated in Fig. 1. Section 5 and Appendix C.5 contain more details about the exact structure of these neural networks used in our experiments.

Function GLADcell():
       For all  do
            
      return
Function GLAD():
       , For  to  do
             GLADcell()
      return
Algorithm 1 GLAD

There are many traditional algorithms for solving graph recovery problems. We choose AM as our basis because: First, empirically, we tried models built upon other algorithms including G-ISTA, ADMM, etc, but AM-based model gives consistently better performances. Appendix C.10 discusses details of different parameterizations tried. Second, and more importantly, the AM-based architecture has a nice property of maintaining as a SPD matrix throughout the iterations as long as .

3.3 Training algorithm

To learn the parameters in GLAD architecture, we will directly optimize the recovery objective function rather than using log-determinant objective. A nice property of our deep learning architecture is that each iteration of our model will output a valid precision matrix estimation. This allows us to add auxiliary losses to regularize the intermediate results of our GLAD architecture, guiding it to learn parameters which can generate a smooth solution trajectory.

Specifically, we will use Frobenius norm in our experiments, and design an objective which has some resemblance to the discounted cumulative reward in reinforcement learning:

(9)

where is the output of the recurrent unit GLADcell at -th iteration, is number of unrolled iterations, and is a discounting factor.

We will use stochastic gradient descent algorithm to train the parameters in the GLADcell. A key step in the gradient computation is to propagate gradient through the matrix square root in the GLADcell. To do this efficiently, we make use of the property of SPD matrix that , and the product rule of derivatives to obtain

(10)

The above equation is a Sylvester’s equation for . Since the derivative for is easy to obtain, then the derivative of can be obtained by solving the Sylvester’s equation in (10).

4 Theoretical Analysis

Since our GLAD architecture is obtained by augmenting an unrolled optimization algorithm by learnable components, the question is what kind of guarantees can be provided for such learned algorithm, and whether learning can bring benefits to the recovery of the precision matrix. In this section, we will first analyze the statistical guarantee of running the AM algorithm in Eq. (7) and  Eq. (8) for steps with a fixed quadratic penalty parameter , and then interpret its implication for the learned algorithm. First, we need two standard assumptions from the literature [14]:

Assumption 1 (Tail conditions).

There exists and a function such that Further, assume is monotonically increasing in sample size and and define .

Assumption 2 (Incoherence condition).

Denote the Hession by , the indices of nonzero entries by and its complement set by . There exists such that .

The assumption on the tail condition ensures that the sample sizes are large enough for an accurate estimation of the covariance matrix , while the incoherence assumption restricts the interaction between edge and non-edge terms in the precision matrix. These two assumptions characterize the fundamental limitation of the sparse graph recovery problem, beyond which recovery is not possible. Under these assumptions, we prove the consistency of AM algorithm (proof is in Appendix B). [] Under assumption 1 and 2, suppose the regularization parameter is for some and the sample size is larger than a certain number (see more details in Appendix B). Then with probability at least ,

(11)

where is an increasing function of the initialization and . From the theorem, one can see that by optimizing the quadratic penalty parameter , one can trade-off between the first two terms and in the bound. Furthermore, if one takes the most updated as the initialization for the next optimization, the theorem implies that

(12)

Therefore, at each stage , an optimal penalty parameter can be chosen depending on the most updated value . An adaptive sequence of penalty parameters should achieve a better error bound compared to a fixed .

Besides, the consistency in this theorem is based on a specific choice of the sparse regularity parameter . However, the term characterizing the tail behavior may not be accessible in practice.

In summary, the implications of this theorem are:

  • [leftmargin=*,nolistsep]

  • An adaptive sequence should lead to an algorithm with better convergence than a fixed , but the sequence may not be easy to choose manually.

  • Both and the optimal depend on the tail behavior and the corresponding error , which make these parameters hard to prescribe manually.

Our learning augmented deep architecture, GLAD, can tune these sequence of and parameters jointly using gradient descent. Moreover, we also make a function depending on the most up-to-date solution , and allow different regularizations for different entries of the precision matrix. Such flexibility can potentially further improve the ability of GLAD model to recover the sparse graph.

Although the above analysis provides good intuition on how learning can help to improve the recovery guarantee, the error bound is loose because it is derived based on existing analytical results which assumes a fixed and . A tighter analysis may be obtained by directly bounding the output of GLAD step from , which will be an interesting future work.

5 Experiments

In this section, we report several experiments to compare GLAD with traditional algorithms and other data-driven algorithms. The results validate the list of desiderata mentioned previously. Especially, it shows the potential of pushing the boundary of traditional graph recovery algorithms by utilizing data. Python implementation of the experiments is available111Code: to be released soon. Details of some experimental settings are covered in Appendix C. Evaluation metric. We use normalized mean square error (NMSE) and probability of success (PS) to evaluate the algorithm performance. NMSE is and PS is the probability of correct signed edge-set recovery, i.e., . Notation. In all reported results, D stands for dimension of the random variable, M stands for sample size and N stands for the number of graphs (precision matrices) that is used for training.

5.1 Benefit of data-driven gradient-based algorithm

Figure 2: Convergence of ADMM in terms of NMSE and optimization objective. (Refer to Appendix C.2).

Inconsistent optimization objective. Traditional algorithms are typically designed to optimize the -penalized log likelihood. Since it is a convex optimization, convergence to optimal solution is usually guaranteed. However, this optimization objective is different from the true error. Taking ADMM as an example, it is revealed in Figure 2 that, although the optimization objective always converges, errors of recovering true precision matrices measured by NMSE have very different behaviors given different regularity parameter , which indicates the necessity of directly optimizing NMSE and hyperparameter tuning.

   5 1 0.5 0.1 0.01
   -2.51 -2.25 -2.06 -2.06 -2.69
   -5.59 -9.05 9.48 -9.61 -9.41
   -9.53 -7.58 -7.42 -7.38 -7.46
   -9.38 -6.51 -6.43 -6.41 -6.50
   -6.76 -4.68 -4.55 -4.47 -4.80
Table 1: NMSE results for ADMM.

Expensive hyperparameter tuning. Although hyperparameters of traditional algorithms can be tuned if the true precision matrices are provided as a validation dataset, we want to emphasize that hyperparamter tuning by grid search is a tedious and hard task. Table 1 shows that the NMSE values are very sensitive to both and the quadratic penalty of ADMM method. For instance, the optimal NMSE in this table is when and . However, it will increase by a large amount to if is only changed slightly to . There are many other similar observations in this table, where slight changes in parameters can lead to significant NMSE differences, which in turns makes grid-search very expensive. G-ISTA and BCD follow similar trends.

For a fair comparison against GLAD which is data-driven, in all following experiments, all hyperparameters in traditional algorithms are fine-tuned using validation datasets, for which we spent extensive efforts (See more details in Appendix C.3C.6). In contrast, the gradient-based training of GLAD turns out to be much easier.

5.2 Convergence

We follow the experimental setting in [15, 12, 11]

to generate data and perform synthetic experiments on multivariate Gaussians. Each off-diagonal entry of the precision matrix is drawn from a uniform distribution, i.e.,

, and then set to zero with probability , where means the sparsity level (refer to Appendix C.1). We use 30 recurrent steps for GLAD and compare it to G-ISTA, ADMM and BCD. All algorithms are trained/finetuned using 10 randomly generated graphs and tested over 100 graphs.

Time/itr D=25 D=100
ADMM 1.45 16.45
G-ISTA 37.51 41.47
GLAD 2.81 20.23
Table 2: ms per iteration.

Convergence results and average runtime of different algorithms on Nvidia’s P100 GPUs are shown in Figure 3 and Table 2 respectively. GLAD consistently converges faster and gives lower NMSE. Although the fine-tuned G-ISTA also has decent performance, the computation time in each iteration is much longer than GLAD because it requires line search steps. Besides, we could also see a progressive improvement of GLAD across its iterations.

Figure 3: GLAD vs traditional methods. Left 3 plots: Sparsity level is fixed as . Right 3 plots: Sparsity level of each graph is randomly sampled as . Results are averaged over 100 test graphs where each graph is estimated 10 times using 10 different sample batches of size

. Standard error is plotted but not visible. Intermediate steps of BCD are not evaluated because we use sklearn package

[13] and can only access the final output. Appendix C.4C.5 give details about the experiment setup and GLAD architecture.
Figure 4: Sample complexity for model selection consistency.

5.3 Recovery probability

As analyzed by Ravikumar et al. [14], the recovery guarantee (such as in terms of Frobenius norm) of the regularized log-determinant optimization significantly depends on the sample size and other conditions. Our GLAD directly optimizes the recovery objective based on data, and it has the potential of pushing the sample complexity limit. We experimented with this and found the results positive.

We follow Ravikumar et al. [14] to conduct experiments on GRID graphs, which satisfy the conditions required in [14]. Furthermore, we conduct a more challenging task of recovering restricted but randomly constructed graphs (see Appendix C.7 for more details). The probability of success (PS) is shown in Figure 4. It is noteworthy that PS is non-zero only if the algorithm recovers all the edges with correct signs. GLAD consistently outperforms traditional methods in terms of sample complexity. It is able to recover the true edges with considerably fewer number of samples.

5.4 Data Efficiency

Having a good inductive bias makes GLAD’s architecture quite data-efficient compared to other deep learning models. For instance, the prominent work on DeepGraph [4] is based on convolution neural networks, and it contains orders of magnitude more parameters than GLAD. Furthermore, it takes roughly samples, and several hours for training their DG-39 model. In contrast, GLAD learns well with less than parameters, within training samples, and notably less training time. Table 3 also shows that GLAD significantly outperforms DG-39 model in terms of AUC (Area under the ROC curve) by just using training graphs, typically the case for real world settings. Fully connected DL models are unable to learn from such small data and hence are skipped in the comparison.

[capbesideposition=left, center,capbesidewidth=6.2cm]table[0.999] Methods M=15 M=35 M=100 BCD 0.5780.006 0.6390.007 0.7040.006 DG-39 0.6640.008 0.7380.006 0.7590.006 DG-39+P 0.6720.008 0.7400.007 0.7710.006 GLAD 0.7880.003 0.8110.003 0.8780.003

Table 3: AUC on test graphs for D=39: For experiment settings, refer Table 1 of [4]. Gaussian Random graphs with sparsity were chosen and edge values sampled from . (Refer appendix(C.8))

5.5 Gene regulation data

The SynTReN [17] is a synthetic gene expression data generator specifically designed for analyzing the sparse graph recovery algorithms. It models different types of biological interactions and produces biologically plausible synthetic gene expression data. Figure 5 shows that GLAD performs favourably for structure recovery in terms of NMSE on the gene expression data. As the governing equations of the underlying distribution of the SynTReN are unknown, these experiments also emphasize the ability of GLAD to handle non-Gaussian data.

Figure 6 visualizes the edge-recovery performance of GLAD models trained on a sub-network of true Ecoli bacteria data. We denote TPR: True Positive Rate, FPR: False Positive Rate, FDR: False Discovery Rate. Although, GLAD was trained on lower dimensional graphs , it was able to robustly recover the structure of a higher dimensional graph . This also showed that the learned GLAD is able to generalize to larger graphs.

Figure 5: Performance on the SynTReN generated gene expression data with graph as Erdos-renyi having sparsity . Refer appendix(C.9) for experiment details.
(a) True graph
(b) M=10, fdr=0.613, tpr=0.913, fpr=0.114
(c) M=100, fdr=0.236, tpr=0.986, fpr=0.024
Figure 6: Recovered graph structures for a sub-network of the E. coli consisting of genes and interactions with increasing samples. Increasing the samples reduces the fdr by discovering more true edges.

6 Conclusion & Future work

We presented a novel neural network, GLAD, for the sparse graph recovery problem based on an unrolled Alternating Minimization algorithm. We theoretically as well as empirically show that learning can improve the sparse graph recovery. The learned GLAD model is able to push the sample complexity limits thereby highlighting the potential of using algorithms as inductive biases for deep learning architectures. Further development of theory is needed to fully understand and realize the potential of this new direction.

References

  • Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
  • Banerjee et al. [2008] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine learning research, 9(Mar):485–516, 2008.
  • Beck [2015] Amir Beck. On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization, 25(1):185–209, 2015.
  • Belilovsky et al. [2017] Eugene Belilovsky, Kyle Kastner, Gaël Varoquaux, and Matthew B Blaschko. Learning to discover sparse graphical models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 440–448. JMLR. org, 2017.
  • Chen et al. [2018] Xiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoretical linear convergence of unfolded ista and its practical weights and thresholds. In Advances in Neural Information Processing Systems, pages 9061–9071, 2018.
  • Friedman et al. [2008] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
  • Gregor and LeCun [2010] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 399–406. Omnipress, 2010.
  • Khalil et al. [2017] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song.

    Learning combinatorial optimization algorithms over graphs.

    In Advances in Neural Information Processing Systems, pages 6348–6358, 2017.
  • Li and Malik [2016] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
  • Liu et al. [2018] Jialin Liu, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Alista: Analytic weights are as good as learned weights in lista. 2018.
  • Lu [2010] Zhaosong Lu. Adaptive first-order methods for general sparse inverse covariance selection. SIAM Journal on Matrix Analysis and Applications, 31(4):2000–2016, 2010.
  • Mazumder and Agarwal [2011] Rahul Mazumder and Deepak K Agarwal. A flexible, scalable and efficient algorithmic framework for primal graphical lasso. arXiv preprint arXiv:1110.5508, 2011.
  • Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Ravikumar et al. [2011] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, Bin Yu, et al. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
  • Rolfs et al. [2012] Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, and Arian Maleki. Iterative thresholding algorithm for sparse inverse covariance estimation. In Advances in Neural Information Processing Systems, pages 1574–1582, 2012.
  • Sun et al. [2016] Jian Sun, Huibin Li, Zongben Xu, et al. Deep admm-net for compressive sensing mri. In Advances in neural information processing systems, pages 10–18, 2016.
  • Van den Bulcke et al. [2006] Tim Van den Bulcke, Koenraad Van Leemput, Bart Naudts, Piet van Remortel, Hongwu Ma, Alain Verschoren, Bart De Moor, and Kathleen Marchal. Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC bioinformatics, 7(1):43, 2006.

Appendix A Derivation of Alternating Minimization Steps

Given the optimization problem

(13)

Alternating Minimization is performing

(14)
(15)

Taking the gradient of the objective function with respect to to be zero, we have

(16)

Taking the gradient of the objective function with respect to to be zero, we have

(17)

where

(18)

Solving the above two equations, we obtain:

(19)
(20)
(21)

Appendix B Proof of Theorem

See 2

In this proof, we will use the following notations:

The proof of this theorem is based on [3] and [14]. Let us introduce the results that we need to use.

  • [leftmargin=*]

  • Existing result in Section 3.2 in [3]:

    (22)

    where is the diameter defined as

    (23)
  • Existing result in [14]: [Section 3.4 in [14]] Under assumption 1 and 2, suppose the regularization parameter is for some and the sample size is lower bounded by . Then with probability at least ,

    (24)

    The constant is defined as

    where is defined as .

Now we are ready for the proof.

Proof.

Observe that

Then we can rewrite as

Therefore,

Also note that by the optimality condition

and the definition of , we have

Combining the previous two observations, we conclude

Moreover, since is -strongly convex,

(25)

Therefore,

(26)

Since is -strongly convex with respect to , then

where . Combining with Equation 22, we have

(27)

Finally, under the conditions in Corollary B, we use triangle inequality to combine the above results and Corollary B.

(28)
(29)
(30)
(31)

where the second term follows from and

(32)

Appendix C Experimental details

This section contains the detailed settings used in the experimental evaluation section.

c.1 Synthetic Dataset generation

For sections 5.1 and 5.2, the synthetic data was generated based on the procedure described in [15]. A dimensional precision matrix was generated by initializing a matrix with its off-diagonal entries sampled i.i.d. from a uniform distribution . These entries were then set to zero based on the sparsity pattern of the corresponding Erdos-Renyi random graph with a certain probability

. Finally, an appropriate multiple of the identity matrix was added to the current matrix, so that the resulting matrix had the smallest eigenvalue as

. In this way,

was ensured to be a well-conditioned, sparse and positive definite matrix. This matrix was then used in the multivariate Gaussian distribution

, to obtain i.i.d samples.

c.2 Experiment details: Benefit of data-driven gradient-based algorithm

Figure(2): The plots are for the ADMM method on the Erdos-Renyi graphs (fixed sparsity ) with dimension and number of samples . The results are averaged over test graphs with sample batches per graph. The std-err = is shown. Refer appendix(C.1) for more details on data generation process.

c.3 Experiment details: Expensive hyperparameters tuning

Table(1) shows the final NMSE values for the ADMM method on the random graph (fixed sparsity ) with dimension and number of samples . We fixed the initialization parameter of as and chose appropriate update rate for . It is important to note that the NMSE values are very sensitive to the choice of as well. These parameter values changed substantially for a new problem setting. Refer appendix(C.1) for more details on data generation process.

c.4 Experiment details: Convergence on synthetic datasets

Figure(3) experiment details: Figure(3) shows the NMSE comparison plots for fixed sparsity and mixed sparsity synthetic Erdos-renyi graphs. The dimension was fixed to and the number of samples vary as . The top row has the sparsity probability for the Erdos-Renyi random graph, whereas for the bottom row plots, the sparsity probabilities are uniformly sampled from . For finetuning the traditional algorithms, a validation dataset of graphs was used. For the GLAD algorithm, training graphs were randomly chosen and the same validation set was used.

c.5 Glad: Architecture details for Section(5.2)

GLAD parameter settings: was a 4 layer neural network and was a 2 layer neural network. Both used 3 hidden units in each layer. The non-linearity used for hidden layers was , while the final layer had sigmoid () as the non-linearity for both , and . The learnable offset parameter of initial was set to . It was unrolled for iterations. The learning rates were chosen to be around and multi-step LR scheduler was used. The optimizer used was ‘adam’. The best nmse model was selected based on the validation data performance. Figure(7) explores the performance of GLAD on using varying number of unrolled iterations .

Figure 7: Varying the number of unrolled iterations. The results are averaged over test graphs. The variable is the number of unrolled iterations. We observe that the higher number of unrolled iterations better is the performance.

c.6 Additional note of hyper-parameter finetuning for traditional methods

Figure(1) shows the average NMSE values over test graphs obtained by the ADMM algorithm on the synthetic data for dimension and samples as we vary the values of penalty parameter and lagrangian parameter . The offset parameter for was set to . The NMSE values are very sensitive to the choice of as well. These parameter values changes substantially for a new problem setting. G-ISTA and BCD follow similar trends.

Additional plots highlighting the hyperparameter sensitivity of the traditional methods for model selection consistency experiments. Refer figure(8).

Figure 8: We attempt to illustrate how the traditional methods are very sensitive to the hyperparameters and it is a tedious exercise to finetune them. The problem setting is same as described in section(5.3). For all the methods shown above, we have already tuned the algorithm specific parameters to a reasonable setting. Now, we vary the penalty term and can observe that how sensitive the probability of success is with even slight change of values.

c.7 Tolerance of Noise: Experiment details

Details for experiments in figure(4). Two different graph types were chosen for this experiment which were inspired from [14]. In the ‘grid’ graph setting, the edge weight for different precision matrices were uniformly sampled from . The edges within a graph carried equal weights. The other setting was more general, where the graph was a random Erdos-Renyi graph with probability of an edge was . The off-diagonal entries of the precision matrix were sampled uniformly from . The parameter settings for GLAD were the same as described in Appendix C.5. The model with the best PS performance on the validation dataset was selected. train/valid/test=10/10/100 graphs were used with 10 sample batches per graph.

c.8 Glad: Comparison with other Deep Learning based methods

Table(3) shows AUC (with std-err) comparisons with the DeepGraph model. For experiment settings, refer Table 1 of [4]. Gaussian Random graphs with sparsity were chosen and edge values sampled from . GLAD was trained on only graphs with sample batches per graph. The dimension of the problem is . The architecture parameter choices of GLAD were the same as described in Appendix C.5 and it performs consistently better along all the settings by a significant AUC margin.

c.9 SynTReN gene expression simulator details

The SynTReN [17] is a synthetic gene expression data generator specifically designed for analyzing the structure learning algorithms. The topological characteristics of the synthetically generated networks closely resemble the characteristics of real transcriptional networks. The generator models different types of biological interactions and produces biologically plausible synthetic gene expression data enabling the development of data-driven approaches to recover the underlying network.

The SynTReN simulator details for section(5.5). For performance evaluation, a connected Erdos-Renyi graph was generated with probability as . The precision matrix entries were sampled from and the minimum eigenvalue was adjusted to by adding an appropriate multiple of identity matrix. The SynTReN simulator then generated samples from these graphs by incorporating biological noises, correlation noises and other input noises. All these noise levels were sampled uniformly from . The figure(5) shows the NMSE comparisons for a fixed dimension and varying number of samples . The number of training/validation graphs were set to 20/20 and the results are reported on test graphs. In these experiments, only batch of samples were taken per graph to better mimic the real world setting.

Figure(6) visualizes the edge-recovery performance of the above trained GLAD models on a subnetwork of true Ecoli bacteria data.which contains edges and nodes. The Ecoli subnetwork graph was fed to the SynTReN simulator and samples were obtained. SynTReN’s noise levels were set to and the precision matrix edge values were set to . For the GLAD models, the training was done on the same settings as the gene-data NMSE plots with and on corresponding number of samples .

c.10 Different designs tried for data-driven algorithm

We tried multiple unrolled parameterizations of the optimization techniques used for solving the graphical lasso problem which worked to varying levels of success. We list here a few, in interest for helping researchers to further pursue this recent and novel approach of data-driven algorithm designing.

  1. ADMM + ALISTA parameterization: The threshold update for can be replaced by ALISTA network [10]. The stage I of ALISTA is determining W, which is trivial in our case as . So, we get . Thus, combining ALISTA updates along with AM’s we get an interesting unrolled algorithm for our optimization problem.

  2. G-ISTA parameterization: We parameterized the line search hyperparameter as well as replaced the next step size determination step by a problem dependent neural network of Algorithm(1) in [15].

  3. Mirror Descent Net: We get a similar set of update equations for the graphical lasso optimization. We identify some learnable parameters, use neural networks to make them problem dependent and train them end-to-end.

  4. For all these methods we also tried unrolling the neural network as well. In our experience we found that the performance does not improve much but the convergence becomes unstable.