Bayesian Networks (BNs) are probabilistic graphical models that have been widely used for representing dependencies and independencies among variables of a problem (Zhang et al., 2017; Koller and Friedman, 2009)
. We can consider a BN as a compact representation of a joint probability distribution function(Margaritis, 2003). A BN consists of a structure which is a Directed Acyclic Graph (DAG), and a parameter set which represents the quantitative information about dependencies among a set of variables.These networks have been widely used in the fields of machine vision (Wu et al., 2020), bioinformatics (Luo et al., 2017), data fusion (Akbar et al., 2018), and decision-making support systems (Cypko et al., 2017). Learning the structure of a BN has attracted a great deal of attention over the last decades. Although many approaches have been developed for learning a continuous BN, there is a lack for the discrete one. The reason is learning a discrete BN is a challenging problem due to the large parameter space and the difficulty in searching for an efficient structure. However, the problem becomes more challenging when we need to learn a sparse BN from high-dimensional discrete data. There is great demand for sparse structures in a broad collection of problems from brain sciences (Hastie et al., 2019) to biological sciences (Cassidy et al., 2014). For example, identifying brain regions interacting in task performance (Li et al., 2008), or modeling the interacting patterns between genes from microarray gene expression data (Rodin and Boerwinkle, 2005) represents high-dimensional data in which the number of samples is equal or less than the number of variables. In addition, many real-world networks, such as gene association networks and brain connectivity networks are sparse (Shuai et al., 2013). Hence, accurately learning a sparse structure from such datasets is of great importance.
Most methods that have been proposed for BN structure learning fall into three categories: 1) constraint-based methods (Jiang et al., 2018) such as PC (Spirtes et al., 2000), MMPC (Tsamardinos et al., 2006), and FCI (Colombo et al., 2012), 2) score-based methods (Shuai et al., 2013; Adabor et al., 2015), and 3) hybrid methods (Perrier et al., 2008; Dai et al., 2020) such as MMHC (Tsamardinos et al., 2006). With growing tendency toward sparse modeling, score-based methods have attracted more attention since their capability of applying constraints to the score functions.
Score-based methods assign a score to each structure and then search for the structure with the best score. Different score functions such as Bayesian Dirichlet (BD) metric, Bayesian Information Criterion (BIC), Minimum Description Length (MDL), and entropy-based metrics are used in score-based structure learning. After assigning scores, a search algorithm is used to find the optimal structure (with the optimal score). Various search algorithms have been proposed for search step in score-based structure learning methods, such as Hill Climbing (HC), k2 (Cooper and Herskovits, 1992) and Monte Carlo methods (Niinimaki et al., 2016; Zhou, 2011). The most important weakness of score-based methods is that finding an exact structure is an NP-Hard problem (Chickering et al., 2004; Malone, 2015)
, hence, heuristic search techniques(Scutari et al., 2019)2020; Contaldi et al., 2019) and simulated annealing method (Lee and Kim, 2019) were employed for solving this problem.
BN structure learning can be modeled as an optimization problem. By adding a penalty term to the score function, conditions such as sparsity and DAG property of BN can be modeled in the score function. By optimizing the new score function, we can find the best sparse structure. Sparse Candidate (SC) (Friedman et al., 2013), L1MB-DAG (Schmidt et al., 2007) which uses LASSO to select a sparse structure, and SBN (Zhang et al., 2017) which proposed for Gaussian data are some examples of this approach. In another research (Shojaie and Michailidis, 2010)2013) used a maximum likelihood function with a L1 penalty term as score function. Afterward, this algorithm has been expanded to use a concave penalty term (Aragam and Zhou, 2015).
However, most of the research about BNs is concentrated on continuous one. Learning discrete BNs is challenging because of large parameter space and difficulty in searching for a sparse structure. Recently, a block-wised coordinate descent (BCD) method has been proposed for learning discrete BN’s structure (Gu et al., 2019). The proposed algorithm is a deterministic method which uses Newton-Raphson technique to optimize a score function. BCD algorithm employs an L1 regularized likelihood score function and a constraint has been applied to the score function to guarantee the sparsity. The algorithm needs further efforts to ensure the acyclicity constraint. In addition, stochastic methods have shown better performance in high-dimensional problems, but the problem with the stochastic gradient descent (SGD) is that its variance is large, and in particular does not decay to zero with the iteration index. Hence, we have to decay the size to zero step-by-step and hence the convergence rate is slow. Since, new algorithms have to be developed for discrete BN structure learning to accommodate these difficulties.
As a solution, we have proposed a stochastic score-based method called stochastic variance reduction coordinate descent (SVRCD) for discrete BN sparse structure learning in this paper. We have proposed a new penalized log-likelihood score function and developed an optimization algorithm accordingly. To achieve a sparse directed acyclic structure, we have added two penalties to the likelihood term to define the new score function, one for controlling sparsity, and another for achieving a directed acyclic structure (Shuai et al., 2013). The optimization algorithm uses a BCD method in which each coordinate is optimized using SGD. We have utilized a reducing variance method (Johnson and Zhang, 2013) in SGD to accelerate the optimization process. The efficiency, scalability, and robustness of the algorithm are measured. The results show that SVRCD outperforms some of the well-known algorithms in BN structure learning.
The rest of the paper is organized as follows: Section 2 briefly reviewed the BN and required definitions. Section 3 presents our novel score function definition and its components. Section 4 presents the SVRCD algorithm. Section 5 reports the result of the proposed algorithm for simulated networks. Concluding remarks are presented in Section 6.
A BN is a graphical representation for a joint probability distribution function. This graphical model consists of a parameter set and a structure. The structure is a DAGthat is the set of nodes, and is the set of edges , where and are respectively parent and child. Given , joint probability distribution function is represented as:
is a random variable for, and is parent set of . Because of the big parameter space and the difficulty in finding a sparse structure, BN structure learning is a challenge. The score-based approach is a popular structure learning method. In this approach, the score function assigns a numeric value to each structure showing how the data fits the structure. Given dataset and graph , the score is represented as:
We consider the equal probability for all structures, and therefore, just the numerator of the equation 2 need to be optimized. If the structure score is a summation of all variables scores, the score function is called decomposable. Given a variable and its parents, the variable score is:
Most score functions have a Penalized Log-Likelihood (PLL) form. We can formulate the constraints by adding a penalty term to Log-Likelihood (LL) function. Given an LL function is defined as:
where is the value of in data row h, and is the value of ’s parents in data row h. A PLL function is defined as:
where is the penalty term that has been added to the score function.
There are different optimization methods for optimizing the score function. A coordinate descent (CD) method is an iterative optimization method which in each iteration optimizes the function given one component. This method decomposes the problem to some reduced dimension sub-problems (Wright, 2015). A BCD algorithm is a CD algorithm but optimizes the score function using a subset of all components in each iteration.
Algorithm 1 is a CD algorithm which has two steps: choosing a component of the score function, and optimizing the score function with regard to the selected component using a gradient descent (GD) method. Using SGD method in the second step, we could have better performance in high-dimensional datasets. This method optimizes the score function using just a mini batch of data which is selected stochastically.
A convex function has just one global and no local optimum. Convex functions are very important in optimization problems, since formulating a problem as a convex function could help to find the global optimum. Commonly, a quadratic optimization method such as Newton-Raphson is used for optimizing convex functions. Quadratic methods need calculating the Hessian of the function that is impossible or too complex in most problems. Instead of quadratic methods we can use GD and SGD with an appropriate learning rate . GD and SGD could find global optimum in a convex function. In addition, these methods use the first derivative of a function that is simply calculable.
3 The Proposed Score Function
Structure learning could be modelled as an optimization problem by defining an appropriate score function, and hence, it could be solved using an optimization algorithm. We have proposed a PLL score function with two penalty terms to control structure acyclicity and sparsity simultaneously. Consider as a discrete variable of a BN which its domain has value . is ’s parents that has value in its domain. The parameter set of a BN is:
The number of parameters is:
If for all , the number of parameters is:
We use a multi-logit model(Gu et al., 2019) to decrease the number of parameters. Using this model the likelihood function is:
where we code each using dummy variables as a vector ; if then and , and if then we say has the reference value.
is a coefficient vector for predicting how probable is. If then . We use a symmetric multi-logit model and add two constraints to the model since the model is identifiable (Friedman et al., 2010):
The number of parameters in this model is:
Consider for all , then:
Multi-logit model is an appropriate estimation of the product multinomial model. In this model, the number of parameters increases linearly with the number of edges. This model is much more efficient than the product multinomial model which grows exponentially as the size of the parent set increases. We consider dataset , that has been sampled from , where is the value of in data row . Using multi-logit model, could be defined as:
where is index variable, and for all . The score function controls the structure sparsity using a penalty term. We use a group norm penalty term that penalizes all components of the score function at the same. Let be a graph that is learned from parameter set . The sparsity penalty term is defined as:
We use a different penalty term that causes the structure to not have any directed cycles. Let us consider and are adjacency and path matrices, respectively. Path matrix would be defined as:
Using a BFS algorithm and an adjacency matrix, we can check the existence of a path between every two nodes. Lemma A sufficient and necessary condition for existing no path between any pair of nodes is:
Proof Consider is a DAG. First, imagine for every and , . Hence, there is a directed edge between and , and at least there is a directed path between and . It means that there is a directed cycle in graph, which is a contradiction to our presumption that is a DAG. On the other side, suppose that for every pair of and , . If is not a DAG, there is at least one directed cycle in . Hence, there are two nodes and that there is a path from to . This is a contradiction to our presumption that . Hence, the lemma proved (Shuai et al., 2013).
Regarding the lemma, we can define the following penalty term that guarantees the obtained structure is a DAG.
Hence, the score function is:
where and are regulation parameters. The greater , the more sparse structure and the less accuracy. On the other side, selecting a large , we would have a more sparse structure, but we might miss some edges.
4 The Proposed Optimization Algorithm
In our algorithm, all the weights are updated at the same time. It means, we update instead of just one component of the vector. In each iteration, one weight vector is selected, then using SGD method, the score function is updated regarding the selected weight vector. SGD has lower convergence time in comparison to GD algorithm. Therefore, SGD works efficiently in high-dimensional datasets.
SGD calculates gradient using just one or a part of the data, not all the data, and it causes big variance in estimations. Selecting an inappropriate data could mislead the optimization. We need a small since a large causes big variances and hence the algorithm does not converge correctly. SGD uses smaller , but still, the variance problem exists.
There are some methods and solutions that reduce the estimation variance. Using these solutions, we can select a larger . One practical issue for SGD is that in order to ensure convergence the learning rate has to decay to zero. This leads to slower convergence. The need for a small learning rate is due to the variance of SGD. Using variance reduction method in SGD, the learning rate for SGD does not have to decay, which leads to faster convergence as one can use a relatively large (Johnson and Zhang, 2013). We use a variance reduced SGD in the optimization step in our proposed algorithm.
To optimize with regard to , would be defined as:
Gradient of the cost function is:
To reduce the estimation variance, we have used SVRG (Johnson and Zhang, 2013) method. SVRG saves after each iteration and then calculates and for .
Weight vectors are updated as follow:
where , hence:
It means this algorithm averagely is equal to ordinary GD.
Our proposed algorithm SVRCD is a BCD algorithm that selects a weight vector in each iteration and then optimizes the cost function. The pseudocode is shown in Algorithm 2. SVRCD needs determined regulation parameters , , and also learning rate .
5 Experimental Results
|Number of estimated edges||Predicted|
|Number of edges which are in skeleton and have right direction||Expected|
|Number of edges which are in skeleton but have wrong direction||Reverse|
|Number of edges which are not recognized||Missing|
|Number of edges which are not in original BN but are in estimated BN||False Positive|
|Right edges learning rate||True Positive Rate|
|Wrong edges learning rate||False Discovery Rate|
|Distance between estimated and original BN||Structural Hamming Distance|
|Similarity of estimated and original BN||Jaccard Index|
In this section, we evaluate the proposed algorithm. At first, we tested the algorithm with different parameter settings to find the best values for , , and . Then we compare SVRCD with other known algorithms such as HC, PC, MMHC, and a new competitive algorithm CD (Gu et al., 2019). PC and MMHC are respectively constraint based and hybrid methods proposed for BN structure learning while the other methods are score-based BN structure learning methods. At the end of this section, the algorithm scalability and noise robustness are evaluated.
We have generated simulated datasets using bipartite, scale-free, and random graphs. To produce bipartite, and scale-free graphs, we have used the igraph package (Csardi et al., 2006) in the R environment. Bipartite graphs have upper and lower nodes where is the number of all nodes in the graph, and directed edges from upper nodes to lower ones. Scale-free networks are produced using the Barabasi-Albert (Barabási and Albert, 1999) model which has edges. We have set and , which are two needed parameters for producing scale-free networks in the igraph package. Random graphs have been generated using sparsebnUtils package in R. There are edges in our random graphs.
We have run HC and MMHC algorithms using bnlearn package (Scutari, 2009), and PC algorithm using Pcalg package in R. For CD algorithm, we have used its implementation in discretedAlgorithm package (Aragam et al., 2019) in R. We have set CD parameters to their default values. All variables in graphs are considered to be binary in all implementations. Weight vectors have been initialized randomly with a value in range . We have considered different values for the number of data rows and variables in evaluations. The goal is to evaluate SVRCD using high dimensional discrete data; hence we have set in data generation process. Different evaluation metrics are used in this paper that are introduced in table 1. All results are averages over 20 datasets for each setting.
In the first step of evaluations, we measure the impact of on evaluation metrics. We have reported the results in Table 2. As shown in the table, in , has the greatest value, since determines how sparse the structure can be. We should set to a proper value that controls sparsity but does not decrease other metrics such as . SVRCD has the best SHD and JI in for . In addition, our proposed algorithm has a greater P in , but a greater FP, FDR, SHD, and a lower JI as well. Also, SVRCD estimates a sparser structure using but E has a low value in this setting.
Table 3 shows the results for different values of . This parameter puts a constraint on the structure to be a DAG. Also, has a small effect on sparsity, hence selecting a proper value for this parameter is crucial. A large value of causes a large FP and a low TP, and a low value of causes the structure not to be a DAG. The results show that with our generated dataset, the best value for this parameter is 0.2. Although SVRCD has the lowest FP in , E, TPR, FDR, SHD, and JI have the best values in . Hence one will observe a better overall performance in .
Table 4 reports measurement metrics with regard to different values for learning rate . A large value for results in a wrong learnt structure and a large FDR and FP. Furthermore, a low increases the runtime. We have found the best setting in our evaluations. SVRCD has a better TPR in , but a larger SHD. SHD and JI specify the quality of the estimated structure. Hence, is the best setting in our evaluations.
The second step of evaluations compares SVRCD with some known algorithms. First, the datasets are sampled from a bipartite graph with setting. As shown in Table 5, our proposed algorithm outperforms BCD, PC, HC, and MMHC algorithms regarding FDR, SHD, and JI metrics. Although HC demonstrates better TPR and E than SVRCD, it estimates 387.1 edges for a BN with 200 edges. Hence, the estimated structure is not sparse and has more edges than the original structure. SVRCD estimated two more FP edges than the PC algorithm. The reason is that the PC has estimated fewer edges, and has a lower E. SVRCD with has defeated PC with . In addition, SVRCD has better FDR, TPR, SHD, and JI than PC. Hence, SVRCD generally outperformed PC. BCD is one of the recently proposed algorithms for estimating sparse BN using high-dimensional discrete data. We compared SVRCD to this algorithm as well. Our proposed algorithm leads to much better SHD, JI, FDR, TPR, and E than BCD, but BCD has estimated a sparser structure and has a lower FP. However, SHD and JI are two important metrics that show the overall performance of a structure learning algorithm. We have also repeated our evaluations with data sampled from random graphs, and scale-free networks. The results are reported in Table 5.
The next set of evaluations examines the scalability of SVRCD with respect to . The results have been reported in Table 6. The greater and the lower are, the more sophisticated problem is. We obtain from Table 6 that SVRCD is more efficient for than . On the other hand, comparing to , we observe that SVRCD achieves the same TPR, FDR, and JI. It means that increasing the number of variables does not have a significant negative effect on the TPR, FDR, and JI. SVRCD can obtain admissible results on high-dimensional data.
In the fourth step of evaluations, we have tested SVRCD using datasets with settings generated from bipartite graphs, random graphs, and scale-free networks to measure the performance of our proposed algorithm in estimating each graph. For each test, we generated 20 datasets from each graph. Figure 1 represents the means and variances of E, FP, FDR, and SHD while estimating each graph using 20 datasets. We have reported JI, and TPR means and variances in Figure 2 with more details since the variances are too small. In the horizontal axis, 1, 2, and 3 represent the bipartite graph, random graph, and scale-free network, respectively.
The final part of evaluations calculates the algorithm robustness against noise. We have sampled data from a bipartite graph with . Then we have swapped a specific portion of data to their complement values. The results are reported in Table 7 and Figure 3. From Figure 3, we observe that increasing the noise causes an increase in E and TPR errors. But, FDR increases at first and then decreases, since increasing the noise causes the algorithm to predict fewer edges (smaller P).
In this paper, we proposed a novel BN structure learning algorithm, SVRCD, for learning BN structures from discrete high-dimensional data. Also, we proposed a score function consisting of three parts: likelihood, sparsity penalty term, and DAG penalty term. SVRCD is a BCD algorithm that uses a variance reduction method in the optimization step, which is an SGD method. Variance reduction method is used to prevent large variances in estimations and hence, increases the convergence speed. As shown in the previous section, our algorithm outperforms some known algorithms in BN structure learning. SVRCD shows significant results in learning BN structure from discrete high-dimensional data. One of the advantages of SVRCD is that more data rows causes a lower algorithm runtime. Because it uses an SGD method and when the data rows increase, the algorithm iterations should be set on a lower value to prevent the algorithm overfitting. Obtained results show that SVRCD dominates the recently proposed BCD algorithm in SHD and JI metrics.
In future work, we will investigate and prove the algorithm convergence for different discrete data distributions. Furthermore, this algorithm is proposed for learning discrete BN, but the same formulation may be adopted for Gaussian BN. In addition, since BNs have many applications in medical studies, SVRCD may help medical problems such as constructing gene networks. Another potential future work can be finding tuning parameters causing the algorithm to converge to the optimum point.
- SAGA: a hybrid search algorithm for bayesian network structure learning of transcriptional regulatory networks. Journal of biomedical informatics 53, pp. 27–35. Cited by: §1.
- Real-time probabilistic data fusion for large-scale iot applications. Ieee Access 6, pp. 10015–10027. Cited by: §1.
- Learning large-scale bayesian networks with the sparsebn package. Journal of Statistical Software 91 (1), pp. 1–38. Cited by: §5.
Concave penalized estimation of sparse gaussian bayesian networks.
The Journal of Machine Learning Research16 (1), pp. 2273–2328. Cited by: §1.
- Emergence of scaling in random networks. science 286 (5439), pp. 509–512. Cited by: §5.
- Brain activity: connectivity, sparsity, and mutual information. IEEE transactions on medical imaging 34 (4), pp. 846–860. Cited by: §1.
- Large-sample learning of bayesian networks is np-hard. Journal of Machine Learning Research 5. Cited by: §1.
- Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pp. 294–321. Cited by: §1.
- Bayesian network hybrid learning using an elite-guided genetic algorithm. Artificial Intelligence Review 52 (1), pp. 245–272. Cited by: §1.
- A bayesian method for the induction of probabilistic networks from data. Machine learning 9 (4), pp. 309–347. Cited by: §1.
- The igraph software package for complex network research. InterJournal, complex systems 1695 (5), pp. 1–9. Cited by: §5.
- Validation workflow for a clinical bayesian network model in multidisciplinary decision making in head and neck oncology treatment. International journal of computer assisted radiology and surgery 12 (11), pp. 1959–1970. Cited by: §1.
- An improved evolutionary approach-based hybrid algorithm for bayesian network structure learning in dynamic constrained search space. Neural Computing and Applications 32 (5), pp. 1413–1434. Cited by: §1.
- Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33 (1), pp. 1. Cited by: §3.
- Learning bayesian network structure from massive datasets: the” sparse candidate” algorithm. arXiv preprint arXiv:1301.6696. Cited by: §1.
- Learning sparse causal gaussian networks with experimental intervention: regularization and coordinate descent. Journal of the American Statistical Association 108 (501), pp. 288–300. Cited by: §1.
- Penalized estimation of directed acyclic graphs from discrete data. The Journal of Statistics and Computing 19 (1), pp. 161–176. Cited by: §1, §3, §5.
- Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC. Cited by: §1.
- An improved constraint-based bayesian network learning method using gaussian kernel probability density estimator. Expert Systems with Applications 113, pp. 544–554. Cited by: §1.
- Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems 26, pp. 315–323. Cited by: §1, §4, §4.
- Probabilistic graphical models: principles and techniques. MIT press. Cited by: §1.
- Parallel simulated annealing with a greedy algorithm for bayesian network structure learning. IEEE Transactions on Knowledge and Data Engineering 32 (6), pp. 1157–1166. Cited by: §1.
- Dynamic bayesian network modeling of fmri: a comparison of group-analysis methods. Neuroimage 41 (2), pp. 398–407. Cited by: §1.
- Unraveling biophysical interactions of radiation pneumonitis in non-small-cell lung cancer via bayesian network analysis. Radiotherapy and Oncology 123 (1), pp. 85–92. Cited by: §1.
- Empirical behavior of bayesian network structure learning algorithms. In Workshop on Advanced Methodologies for Bayesian Networks, pp. 105–121. Cited by: §1.
- Learning bayesian network model structure from data. Technical report Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science. Cited by: §1.
- Structure discovery in bayesian networks by sampling partial orders. The Journal of Machine Learning Research 17 (1), pp. 2002–2048. Cited by: §1.
- Finding optimal bayesian network given a super-structure.. Journal of Machine Learning Research 9 (10). Cited by: §1.
- Mining genetic epidemiology data with bayesian networks i: bayesian networks and example application (plasma apoe levels). Bioinformatics 21 (15), pp. 3273–3278. Cited by: §1.
- Learning graphical model structure using l1-regularization paths. In AAAI, Vol. 7, pp. 1278–1283. Cited by: §1.
- Learning bayesian networks from big data with greedy search: computational complexity and efficient implementation. Statistics and Computing 29 (5), pp. 1095–1108. Cited by: §1.
- Learning bayesian networks with the bnlearn r package. arXiv preprint arXiv:0908.3817. Cited by: §5.
- Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika 97 (3), pp. 519–538. Cited by: §1.
- A sparse structure learning algorithm for gaussian bayesian network identification from high-dimensional data. IEEE Trans on Pattern Analysis and Machine Intelligence 35 (6), pp. 1328–1342. Cited by: §1, §1, §1, §3.
- Causation, prediction, and search. MIT press. Cited by: §1.
- The max-min hill-climbing bayesian network structure learning algorithm. Machine learning 65 (1), pp. 31–78. Cited by: §1.
- Coordinate descent algorithms. Mathematical Programming 151 (1), pp. 3–34. Cited by: §2.
Multilayer fractional-order machine vision classifier for rapid typical lung diseases screening on digital chest x-ray images. IEEE Access 8, pp. 105886–105902. Cited by: §1.
- Improved population-based incremental learning of bayesian networks with partly known structure and parallel computing. Engineering Applications of Artificial Intelligence 95, pp. 103920. Cited by: §1.
- Privbayes: private data release via bayesian networks. ACM Transactions on Database Systems (TODS) 42 (4), pp. 1–41. Cited by: §1, §1.
- Multi-domain sampling with applications to structural inference of bayesian networks. Journal of the American Statistical Association 106 (496), pp. 1317–1330. Cited by: §1.