I Introduction
Multivariate Gaussian distributions are widely used in modeling realworld problems where the relationship between approximately normally distributed variables is of interest. Examples are the distribution of stock returns
[1, 2], proteinprotein interactions [3], and brain functional activities [4].Let be a
dimensional random vector with multivariate Gaussian distribution
. The inverse of covariance matrix, , is known as the precision matrix. If the entry of the precision matrix is , then and are conditionally independent given all the other variables. Therefore, the conditional independence relationships of a multivariate Gaussian can be visualized by an unweighted, undirected graph, whose adjacency matrix, is such that(1) 
The graph is known as the conditional independence graph. The sparsity pattern of determines the topology of . Figure 1 shows an example precision matrix and its associated conditional independence graph.
Often, we are interested in the problem of estimating
from i.i.d observations . The problem is especially difficult when the Gaussian distribution has high dimensionality but with few observations (). In most approaches, the estimator of (or ) contains a regularization term to 1) prevent overfitting, 2) control the sparsity of the estimated solution. In the case of estimating the precision matrix, the weight of the regularization term (as determined for example by a regularization parameter ) has a high impact on the structure of the conditional independence graph, .In this paper we develop a new methodology for choosing the weighting of the regularization term, i.e, . Our method is based on Rissanens’s minimum description length (MDL). Rissanen’s MDL principle codifies model selection, which balances between how well a model describes the given data and the complexity of the model [5]. In practice, however, it is often difficult to compute an exact description length for some statistical models. For example, previous work in model selection of Gaussian graphical models does not account for the exact topology of the conditional independence graph.
Previously, we developed methods for lossless coding of arbitrary unweighted, undirected graph structures [6]. In this paper, we use the description length, the minimum number of bits required to describe the data losslessly, associated with these coding methods in conjunction with MDL principle to learn the best graph model of data. More specifically, we use the description length for model selection and anomaly detection in Gaussian graphical models. We leverage existing work in universal graph coding [7, 6]
, based on graph statistics such as edge probability, degree distribution and triangle distribution, to compute the exact description length of a Gaussian graphical model.
An overview of previous studies for model selection and anomaly detection in Gaussian graphical models is given in section II. In section III, we use our method in model selection of graphical lasso [8], a popular method for estimating a sparse precision matrix. We show that our method outperforms competing model selection methods especially in the case where the number of observations is less than the number of variables. In section IV, we use description length for anomaly detection using the principle of atypicality [9]. We show that atypicality can capture anomalous data by describing it with fewer bits in itself rather than the optimum coder for typical data. The experiment results on a realworld electrocardiogram (ECG) dataset show that our approach can find anomaly in multivariate normally distributed data with high accuracy.
Ii Previous Work
Different estimators have been introduced to fit a sparse graphical model to multivariate Gaussian data. Reference [10] proposed a method that works based on neighborhood selection, where conditional independence for each node is estimated by regressing on other variables with lasso. Another popular estimator is graphical lasso [8] that work based on penalized loglikelihood. This method is easy to implement and fast to compute. Other penalized methods have been introduced to solve the problem efficiently [11, 12, 13, 14, 15]. All of these estimators have a regularization parameter that controls the sparsity level of conditional independence graph ranging from fully disconnected nodes to a fully connected graph. Therefore, their performance is highly influenced by the value of regularization parameter .
To pick the best value of , different model selection techniques have been used in the literature. We can divide them into two main classes: informationbased methods and resampling methods. Informationbased methods such as Bayesian information criteria (BIC) [12], Akaike information criteria (AIC) [16], and extended Bayesian information criteria (EBIC) [17] work based on the performance on training dataset and the complexity of the model. This class offers a good statistical interpretation by adding a penalty term to compensate for the overfitting problem of more complex models. EBIC uses an extra penalization parameter that controls preference to simpler models and is set manually based on the cautious and application.
Resampling methods work based on measuring the performance on outofsample data. They split the data into a subset of samples to fit a model and the remaining samples to estimate the efficacy of the model. This process is repeated multiple times, and the results are aggregated to find the best model. The most common method in this class is crossvalidation (CV) [18]. Other methods are Generalized Approximate Crossvalidation (GACV) [19], Rotation Information Criterion (RIC) [20], and Stability Approach to Regularization Selection (StARS) [21]. The major shortcoming of resampling methods is their high computational cost since they require to solve the problem across all sets of subsamples.
Besides these two approaches, there are several studies that propose other methods. The authors in [22] suggested a method based on KL loss that improves the performance of BIC when the sample size is small. Also, several methods based on graph structure were proposed in [23] considering there is a highlevel knowledge about graph structure. The paper [24] provides informationtheoretic bounds for correct recovery of conditional independence graph based on sample size, number of variables, and the maximum node degree.
All of the aforementioned methods are missing a key aspect of graphical models. They do not consider the graph structure as an informative part of model selection. For example, EBIC only considers the number of edges in estimated graph structure while one may argue that we can have different graph structures with the same number of edges but various degrees of complexity. In this paper, we want to consider the whole graph structure for data analysis in Gaussian graphical models.
In addition to using description length for model selection, we will show another application for anomaly detection in Gaussian graphical models. Previous studies for anomaly detection in Gaussian graphical models are limited to changepoints detection in graph structures over time. A timevarying graphical lasso method was developed in [25] by imposing a temporal penalty term that limits the evolution of network structure over time. Reference [26] introduced a multiscale method for changepoint detection that attains a tradeoff between sensitivity and accuracy. In [27], structural changes are determined based on the difference between learned background precision matrix and sliding foreground precision matrix. Our approach is different as it uses description length for anomaly detection in data; that is, we encode data using a lossless source coder, and use the resulting codelength as the decision criteria.
Iii Model Selection
In this section, we apply graphcoding techniques introduced in [6, 7] for model selection in Gaussian graphical models. In other words, our goal is to identify zeros in the precision matrix. For that, we use MDL principle and select that minimizes the sum of description length of the graph structure and data under that graph structure
(2) 
where is the description length of conditional independence graph structure that depends on , and denote the description length of data when encoded with . Our results show that associated with minimizing (2) gives a better model of data compared to other methods such as CV, BIC, and EBIC.
In the following, we provide the details on how to compute terms in (2).
Iiia Encoding Graph Structure
To compute , we first need to find the underlying conditional independence graph . By applying an estimator to the sequence of i.i.d observations from variate Gaussian distribution, we obtain an estimated precision matrix for each realization of . We have dropped the explicit dependence on from and other related parameters for the sake of simplicity of notation.
In this paper, we use graphical lasso [8], which is one of the most commonly used estimators in Gaussian graphical models. Our approach can be easily applied to any other estimator. The estimated precision matrix obtained from the graphical lasso estimator is used to form the adjacency matrix of the unweighted, undirected graph as described in (1). Next, we can use any of the following graph coders to compute :

Degree distribution: This coder uses the degree distribution of an underlying graph for encoding [6].

Triangle: This coder looks for triangles to encode a graph structure [6].

IID structure: This coder utilizes the edge probability to encode a graph structure. It is designed for ErdösRényi graphs and uses a twostage compression algorithm [7]. First, the structure of graph is converted into two binary strings. Then, an arithmetic encoder is used to compress these strings. The decoder can reconstruct the graph that is isomorphic to the original graph.
IiiB Encoding Data with Graph Structure
Computing is not as straightforward as computing . There are two challenges in this regard. First, we need to deal with realvalued data where lossless source coding is not generalized directly. Second, we have to encode the data based on the underlying conditional independence graph . To encode realvalued data, we assume a fixedpoint representation with a (large) finite number, , bits after the period, and an unlimited number of bits before the period [28]. Therefore, the number of bits required to represent data according to the pdf distribution is given by
(3) 
Since we are only interested in relative codelengths between different models, the dependency on cancels out.
The second challenge is how to encode the data with respect to the graph structure. As mentioned, the data is generated by multivariate Gaussian distribution, which can be characterized by covariance matrix . Since we do not have access to true covariance matrix , we can find an estimate based on the sample covariance matrix . It should be noted that the structure of should match with the structure of obtained from the previous step:
(4) 
This problem is known as matrix completion problem and is widely studied in the literature [29]. Dempster in [30], proved the existence and uniqueness of maximum likelihood solution for when the sample covariance matrix is positive definite. He presented a coordinatedescent algorithm to find the solution iteratively. It should be noted that is the precision matrix obtained from the graphical lasso solution and is not necessarily the inverse of estimated covariance matrix .
Once we estimate , we use predictive minimum description length (MDL) [5] to compute the codelength of data under the graph structure:
(5) 
where is the conditional probability and denote the maximum likelihood estimate of the parameter, which in this case is the estimated covariance obtained under . The codelength in (5) is a valid codelength since it is sequentially decodable. It does not work for the first few samples, as there is no estimate of the covariance matrix. Instead, we encode the first few samples with a default distribution that is the same among different realizations of .
Finally, the best conditional independence graph structure is obtained by minimizing the summation of codelength of and data encoded under . In the following, we present an algorithm to find the best graph model of data associated with
IiiC Experiments
We now provide simulation results on synthetic data generated from zeromean multivariate Gaussian distribution with known precision matrix, . We will compare the performance of our approach against other methods in the recovery of conditional independence
. We use the F1score as it is a widely used metric. F1score is the harmonic mean of precision and recall where the precision is the ratio of the number of correctly estimated edges to the total number of edges in the estimated graph, and the recall is the ratio of the number of correctly estimated edges to the total number of edges in the true graph
[21]. We consider two cases: 1) with and 2) with . The experiments were repeated for . We applied the graphical lasso method described in [8] for 50 realizations of in the range where is the minimum value of that gives totally isolated nodes. The multivariate Gaussian data was generated with different precision matrix structures have been frequently considered as test cases in the literature [31, 32]:
Cycle structure with .

Autoregressive process of order one AR(1) with .

Autoregressive process of order two AR(2) with .

ErdösRényi (ER) structure with where
is a matrix with offdiagonal elements taking values randomly chosen from uniform distribution
with the probability of and diagonal values set to zero. To keep positive definite, we choose whereis the absolute value of the minimum eigenvalue of
. Hereis the identity matrix of size
. 
Hub structure with two hubs. First, we create the adjacency matrix by setting offdiagonal elements to one with the probability of and zero otherwise. Next, we randomly select two hub nodes and set the elements of the corresponding rows and columns to 1 with the probability of and zero otherwise. After that for each nonzero element , we set with a value chosen randomly from uniform distribution . Then, we set . The final precision matrix is obtained by as setup for ErdösRényi graph.
Table I and Table II show the results of applying different methods to recover the conditional independence graph by applying graphical lasso as the estimator for and cases, respectively. The results for CV are given for 5fold CV, and for EBIC method are given for recommended value of [17]. The results for graphcoding methods are obtained by following the steps outlined in Algorithm 1. The values are the average of 50 Monte Carlo trials and the best value (values) at each row is (are) boldfaced. It can be seen that graphcoding techniques (columns named with Degree, IID, and Triangle) outperform other methods particularly when . We observed that coding with degree distribution and triangle give better performance than coding with IID structure in most of the cases. This finding confirms the necessity of having graph coders with the capability to reflect more information about the graph in their codelength.
Type  Benchmark methods  Graphcoding methods  

CV  BIC  EBIC  Degree  IID  Triangle  
Cycle  100  200  0.26  0.65  0.75  1  1  1 
Cycle  200  400  0.24  0.62  0.69  0.99  0.99  0.99 
AR(1) 
100  200  0.25  0.64  0.76  0.99  0.88  0.98 
AR(1)  200  400  0.24  0.62  0.70  1  0.99  0.99 
AR(2) 
100  200  0.28  0.64  na  0.64  0.62  0.63 
AR(2)  200  400  0.24  0.71  na  0.71  0.68  0.72 
ER 
100  200  0.28  0.63  0.51  0.67  0.60  0.67 
ER  200  400  0.27  0.68  0.76  0.76  0.76  0.76 
Hub 
100  200  0.23  0.47  0.01  0.49  0.47  0.48 
Hub  200  400  0.24  0.44  0.20  0.44  0.42  0.44 
Type  Benchmark methods  Graphcoding methods  

CV  BIC  EBIC  Degree  IID  Triangle  
Cycle 
100  50  0.27  0.26  0.75  0.89  0.79  0.89 
Cycle  200  100  0.23  0.29  0.68  0.96  0.80  0.96 
AR(1) 
100  50  0.28  0.26  0.73  0.89  0.80  0.88 
AR(1)  200  100  0.23  0.29  0.70  0.95  0.78  0.94 
AR(2) 
100  50  0.32  0.30  na  0.36  0.37  0.36 
AR(2)  200  100  0.27  0.33  na  0.52  0.53  0.52 
ER 
100  50  0.28  0.24  0.39  0.36  0.44  0.46 
ER  200  100  0.29  0.35  0.62  0.60  0.63  0.63 
Hub 
100  50  0.18  0.16  0.02  0.25  0.28  0.18 
Hub  200  100  0.15  0.19  0.01  0.24  0.24  0.10 
Iv Anomaly Detection
Another application of our graphcoding approach is anomaly detection in multivariate Gaussian data. For detecting anomalous data, we utilize the atypicality concept developed in [33]. A sequence is atypical if it can be coded in fewer bits using a coder that is not the optimum coder for any typical sequence. Theoretical properties and experimental practicality of atypicality were described in [33, 34, 6]. In the following, we describe the approach for using atypicality to find anomalous data in graphbased data.
Iva Anomalous Graph Detection Algorithm
In anomalous graph detection, we are given a set of i.i.d observations from a variate Gaussian distribution as training set. The question is to determine whether a test set of i.i.d observations generated from a different distribution, where , is anomalous or not.
The general procedure is as follows:

On the set of training data , we use Algorithm 1 to find the best conditional independence graph . It means that we learn the parameters of graph structure (e.g. degree distribution) that describes the data best. The typical coder is the estimated conditional independence graph structure . Both encoder and decoder know the values of the parameters associated with (i.e. degree distribution) and thus, they do not need to be encoded.

On the test data , we first compute the codelength of typical coder by computing the codelength of data and graph structure obtained from the previous step. We then use Algorithm 1 to find the best conditional independence graph describing the test data; here for each realization of . We also need to add the overhead of encoding the parameters associated with that specific conditional independence graph. The atypical codelength, , is now the minimum of these codelengths.

The atypicality measure is the difference between the atypical codelength and the typical codelength. If , or when it is smaller than some threshold, then the test data is considered as anomalous.
IvB Experiments
We tested our method for anomaly detection on a realworld dataset. This dataset contains 12lead ECG signals of a group of youth 1621 years of age (). The dataset of 102 ECG variables was initially processed to select features with empirical distribution close to normal distribution. After screening all features and selecting those with approximately normal distribution, we ended up with 43 ECG variables for analysis.
The idea is that different age groups may show different patterns in their feature space, which may be too subtle to recognize by medical professionals or conventional ECG reading algorithms. Therefore, if we learn the underlying conditional independence graph for a specific age group, it might not perform well when it is applied to another age group. In the experiments, females of 16 years of age are considered as the typical data (), and females of any other age group are considered as the atypical data (). It is expected that the universal source coder gives a better model for the atypical data, shorter codelength, when it is compared to the typical coder.
To measure the performance, the receiver operating characteristic (ROC) curve and area under the curve (AUC) were used as the metrics. We estimated the typical coder based on 80% of data on the typical data. Test data comprises of two different sets 1) 20% of holdout data from the typical dataset, 2) randomly chosen samples of the same size from each age group in the atypical data (one set for each age group). To compute the ROC curve, we repeated the experiment 50 times. Figure 2 depicts the ROC curve of the females of each age group versus 16yearold females when we used degree distribution for coding the graph structure. Our methodology could distinguish a single age group (e.g. 17, etc. years) as atypical data compared to the training agegroup (16 years) with a high level of accuracy. Despite the small age gap between the typical and atypical data and the fact that both datasets were from healthy individuals, there was a distinctive pattern of ECG variables that made the groups distinguishable by our novel application. In comparison, current established ECG variables used to predict some heart condition by ECG provide an AUC of only 0.490.73 [35].
One may ask how other graph coders perform on this task and we may not need to have more efficient graph coders. Our experiments show when we use IID structure coder, the performance drops compared to the case when we use degree distribution or triangle coder as given in Table III. It means that the coders based on degree distribution and triangle can extract more information from the graph and use it toward shorter codelength, which leads to better performance.
Age Group  Degree  IID  Triangle 

17yearold  0.90  0.85  0.89 
18yearold  0.92  0.90  0.90 
19yearold  0.94  0.93  0.95 
20yearold  0.93  0.92  0.92 
21yearold  0.89  0.85  0.89 
V Conclusion
In this paper, we extended the description length analysis to graphbased data by encoding graph structure and data according to the graph structure. We introduced two applications: 1) model selection, 2) anomaly detection in Gaussian graphical models. First, we demonstrated that our approach for model selection gives a more accurate graph model of data compared to commonly used methods that overlook the graph structure as an informative part in decisionmaking. While in most cases coding with degree distribution and triangle outperform coding with IID structure by a wide margin, there are few cases where IID structure offers a slight improvement. In [36], we will address this by combining degree distribution and triangle coders into a single coder that works in all cases. Second, we utilized description length for anomaly detection by using codelength as the decision criteria. In the future, we will extend this approach to other types of graphbased data and applications.
References
 [1] S. J. Kon, “Models of stock returns—a comparison,” The Journal of Finance, vol. 39, no. 1, pp. 147–165, 1984.
 [2] V. Golosnoy, B. Gribisch, and R. Liesenfeld, “The conditional autoregressive wishart model for multivariate stock market volatility,” Journal of Econometrics, vol. 167, no. 1, pp. 211–223, 2012.
 [3] C. Baldassi, M. Zamparo, C. Feinauer, A. Procaccini, R. Zecchina, M. Weigt, and A. Pagnani, “Fast and accurate multivariate gaussian modeling of protein families: predicting residue contacts and proteininteraction partners,” PloS one, vol. 9, no. 3, p. e92721, 2014.
 [4] G. Varoquaux, A. Gramfort, J.B. Poline, and B. Thirion, “Brain covariance selection: better individual functional connectivity models using population prior,” Advances in neural information processing systems, vol. 23, pp. 2334–2342, 2010.
 [5] J. Rissanen, “Stochastic complexity and modeling,” The Annals of Statistics, no. 3, pp. 1080–1100, Sep. 1986.
 [6] A. HostMadsen and J. Zhang, “Coding of graphs with application to graph anomaly detection,” in 2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018, pp. 1829–1833.
 [7] Y. Choi and W. Szpankowski, “Compression of graphical structures: Fundamental limits, algorithms, and experiments,” IEEE Transactions on Information Theory, vol. 58, no. 2, pp. 620–638, 2012.
 [8] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
 [9] A. HøstMadsen, E. Sabeti, and C. Walton, “Data discovery and anomaly detection using atypicality: Theory,” IEEE Transactions on Information Theory, vol. 65, no. 9, pp. 5302–5322, 2019.
 [10] N. Meinshausen, P. Bühlmann et al., “Highdimensional graphs and variable selection with the lasso,” The annals of statistics, vol. 34, no. 3, pp. 1436–1462, 2006.

[11]
O. Banerjee, L. E. Ghaoui, and A. d’Aspremont, “Model selection through sparse
maximum likelihood estimation for multivariate gaussian or binary data,”
Journal of Machine learning research
, vol. 9, no. Mar, pp. 485–516, 2008.  [12] M. Yuan and Y. Lin, “Model selection and estimation in the gaussian graphical model,” Biometrika, vol. 94, no. 1, pp. 19–35, 2007.
 [13] C. Lam and J. Fan, “Sparsistency and rates of convergence in large covariance matrix estimation,” Annals of statistics, vol. 37, no. 6B, p. 4254, 2009.
 [14] D. M. Witten, J. H. Friedman, and N. Simon, “New insights and faster computations for the graphical lasso,” Journal of Computational and Graphical Statistics, vol. 20, no. 4, pp. 892–900, 2011.
 [15] C.J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar, “Quic: quadratic approximation for sparse inverse covariance estimation,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 2911–2947, 2014.
 [16] P. Menéndez, Y. A. Kourmpetis, C. J. ter Braak, and F. A. van Eeuwijk, “Gene regulatory networks from multifactorial perturbations using graphical lasso: application to the dream4 challenge,” PloS one, vol. 5, no. 12, p. e14147, 2010.
 [17] R. Foygel and M. Drton, “Extended bayesian information criteria for gaussian graphical models,” in Advances in neural information processing systems, 2010, pp. 604–612.
 [18] A. J. Rothman, P. J. Bickel, E. Levina, J. Zhu et al., “Sparse permutation invariant covariance estimation,” Electronic Journal of Statistics, vol. 2, pp. 494–515, 2008.
 [19] H. Lian, “Shrinkage tuning parameter selection in precision matrices estimation,” Journal of Statistical Planning and Inference, vol. 141, no. 8, pp. 2839–2848, 2011.
 [20] S. Lysen, “Permuted inclusion criterion: a variable selection technique,” Publicly accessible Penn Dissertations, p. 28, 2009.
 [21] H. Liu, K. Roeder, and L. Wasserman, “Stability approach to regularization selection (stars) for high dimensional graphical models,” in Advances in neural information processing systems, 2010, pp. 1432–1440.
 [22] I. Vujačić, A. Abbruzzo, and E. Wit, “A computationally fast alternative to crossvalidation in penalized gaussian graphical models,” Journal of Statistical Computation and Simulation, vol. 85, no. 18, pp. 3628–3640, 2015.
 [23] A. C. Mestres, N. Bochkina, and C. Mayer, “Selection of the regularization parameter in graphical models using network characteristics,” Journal of Computational and Graphical Statistics, pp. 1–11, 2018.
 [24] W. Wang, M. J. Wainwright, and K. Ramchandran, “Informationtheoretic bounds on model selection for gaussian markov random fields,” in 2010 IEEE International Symposium on Information Theory. IEEE, 2010, pp. 1373–1377.
 [25] D. Hallac, Y. Park, S. Boyd, and J. Leskovec, “Network inference via the timevarying graphical lasso,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 205–213.
 [26] V. Avanesov, N. Buzun et al., “Changepoint detection in highdimensional covariance structure,” Electronic Journal of Statistics, vol. 12, no. 2, pp. 3254–3294, 2018.
 [27] A. Maurya and M. Cheung, “Contrastive structured anomaly detection for gaussian graphical models,” in 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 2018, pp. 736–739.
 [28] J. Rissanen, “A universal prior for integers and estimation by minimum description length,” The Annals of Statistics, no. 2, pp. 416–431, 1983.
 [29] R. Grone, C. R. Johnson, E. M. Sá, and H. Wolkowicz, “Positive definite completions of partial hermitian matrices,” Linear algebra and its applications, vol. 58, pp. 109–124, 1984.
 [30] A. P. Dempster, “Covariance selection,” Biometrics, pp. 157–175, 1972.
 [31] L. Li and K.C. Toh, “An inexact interior point method for l 1regularized sparse covariance selection,” Mathematical Programming Computation, vol. 2, no. 34, pp. 291–315, 2010.
 [32] K. M. Tan, P. London, K. Mohan, S.I. Lee, M. Fazel, and D. Witten, “Learning graphical models with hubs,” arXiv preprint arXiv:1402.7349, 2014.
 [33] A. HøstMadsen, E. Sabeti, and C. Walton, “Data discovery and anomaly detection using atypicality: Theory,” IEEE Transactions on Information Theory, vol. 65, no. 9, pp. 5302–5322, 2019.
 [34] E. Sabeti, C. Walton, S. J. Lim et al., “Universal data discovery using atypicality,” in 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016, pp. 3474–3483.
 [35] A. Bratincsák, M. Williams, C. Kimata, and et al., “The electrocardiogram is a poor diagnostic tool to detect left ventricular hypertrophy in children: A comparison with echocardiographic assessment of left ventricular mass,” Congenit Heart Dis., vol. 10, pp. 164–171, 2015.
 [36] M. Abolfazli, A. HøstMadsen, J. Zhang, and A. Bratincsak, “Graph coding with application to model selection and anomaly detection,” unpublished.