Bayesian networks (BN) are used to portray probabilistic dependencies. They are graphical models that encode a set of conditional independence statements through absence or presence of directed edges among nodes in a graph [1, 2]
. The child and parent relations formed in a such a graph help capture the dependency structure of the domain. BNs have found great application in machine learning, knowledge modeling, desicion systems, and causal learning[3, 4, 5].
A significant statistical problem is to learn a BN from observational data. A promising approach for tackling this learning problem consists of a group of algorithms under the heading of score based learning algorithms (SBL) [6, 7, 8]. The initial step in the majority of the SBL algorithms consists of computing the local scores for all possible child and parent-set combinations [9, 10, 11]. In the case of Bayesian networks over continous domain, the local score for a particular child and parent-set pair is usually calculated as a function of the regression of the child variable over the parent-set. The sheer number of regressions models that need solving presents a computational challenge. For a network with nodes, there are candidate parent sets for a particular node and a total of candidate families (a family consisting of a particular child and parent-set pair). The exponential number of node and parent-set combinations neccesiates the need for efficient numerical strategies.
In this paper, we propose an algorithm for an efficient and exact calculation of regressions for all child and parent-set combinations of a given set of variables. The main data structure employed in the proposed algorithm is QR decomposition (QRD) as it both provides high numerical percision and can capture the dependency between the regressions for different families. Using QRD together with Givens rotations (a low-cost operation), we show how to form a sequence of R matrices that provide the necessary information required to solve the regressions for all the families. In the proposed algorithm, we find the shortest of such sequences.
We further compare the theoretical runtime of the suggested method with different algorithms, mainly those arising in all subset regression problems, and show that our algorithm has the fastest runtime. We also explain how to parallelize the proposed algorithm providing a linear speedup proportional to the number of processors utilized.
1.1 Previous Work
A brute strategy for calculating all regression models would be to solve the regression equations for each family independent of the other families. Given the QRD of a family with parents, the number of flops (a flop consisting only of basic arithmetic operations such as a multipication, a division, a subtraction, or an addition) required for calculating the regression coefficients is of the order of . Forming the QRD itself requires flops given samples. Therefore, the total number of flops required for calculating the regressions for all parent-set and child pairs is:
where is the number of possible combinations of objects out of , and is the number of variables.
We can achieve a faster runtime by using the covariance matrix of the data and Cholesky decompositions and form each QRD in . This however comes at cost of numerical accuracy . Using (1) and replacing with , it can be shown that forming the QRDs for all parent-set and child pairs requires flops.
A better strategy would be to consider the problem of calculating the regression models for all the families as that of solving for multiple all-subset regression problems [13, 14, 15]. In other words, calculating the regression models for a BN over variables can be thought of as all-subset regression problems, one for each node. Using algorithms specialized to all-subset regression problems such as Clarke’s algorithm or DCA, it is possible to achieve a linear speed up in solving for regression models [16, 17, 18]. However, such algorithms still solve the all-subset regression problem for each node independent of the other nodes and do not utilize the full structure of the problem. Thus we conjecture that it is possible to improve the runtime even further.
1.2 Givens Rotations
Reviewing the basics of Givens rotations would help us explain the workings of our proposed algorithm better. Pre-mutliplying a vector with a Givens rotation matrix,, rotates the vector in the plane:
where and in appear at the intersections of and rows and columns. By choosing appropriately, it is possible to rotate a vector such that its component becomes zero while preserving its norm .
1.3 Retriangularization with Givens Rotation
Consider a data matrix for the variables and its corresponding QRD (with the same variable ordering). Having this QRD, it is then possible to compute the QRD of the data matrix with variables ordered as (variables and are transposed in the new order) by pre-multiplying the factor (after having its and columns swapped) by an appropriate Givens rotation matrix:
where can be calculated using :
Such matrix transformation requires flops. We can regard this procedure as an operator that given a starting QRD outputs another QRD by column swapping and retriangularization. We will call this operator and will refer to it by . We use to emphasize that column swapping occurs at index .
2 Problem Definition
We will describe our problem’s framework and our proposed algorithm by considering an example Bayesian network with three nodes, . We use this example to construct a more general framework by the end of this section. In a BN over three variables, there are a total of nine regression models: , where represents the regression model with
as predictor variables and
as the response variable.
For these variables, there are also six QRDs, each corresponding to a specific premutation of these variables. Each QRD can, in turn, be used to solve three regression models:
where we have used for the QRD of the data matrix with variables ordered as . We can go from one permutation to another (from one QRD to another) by using the operation introduced above and transposing two adjacent columns and retriangularization of the resulting matrix by using Givens transformation:
The GRC operation allows us to traverse between the six QRDs for the variables as shown in the graph of Fig 1.
Now consider the sequence of column transpositions giving rise to sequence of permutations:
Note that all 9 regression models are included in this permutation set. Therefore, it is possible to obtain all the QRDs necessary for solving all the regression models in a BN with three variables starting with an initial QRD, and forming four more by four adjacent column transpositions and application of four Givens rotations. Thus the problem of solving for all nine regression models is reduced to performing four simple Givens rotations.
Using this example as a basis, we propose the following algorithm. The algorithm starts with calculating the QRD of the data matrix for variables. This initial QRD can be used to solve regression models for families. Then, through a sequence of adjacent column transpositions and retriangularizations, new QRDs are calculated. Each new QRD provides the information required to solve further regression models until all regression models are accounted for. The only remaining decision is to choose the sequence of column transpositions optimally such that minimum number of column transpositions are employed.
3 A Greedy Algorithm Using Givens Rotations
In our algorithm, we choose the sequence of column transpositions in a greedy manner, leading to a very simple and intuitive algorithm. The greedy choice at each step consists of chooing a column for swapping such that the number of new models specified by the new permutation is the highest. Following is the sequence of greedy column transpositions that provides a sequence of QRDs sufficient for solving all the regressions of a BN with four variables, (the total number of models specified at each step is shown next to the permutations):
On closer examination, one will see that the above greedy algorithm follows a recursive structure. More specifically, consider the variables . Assume that we are given the QR decomposition of the variables in the same order as written above. Note that the greedy algorithm starts at the left most position. Further, note that transposing at , the last variable, at any step, leads to a permutation that only adds one single model. Therefore, a greedy algorithm can always limit its operation to transposing the first variables until all the regression models having predictors in are solved. When the permutations are such exhausted that transpositions in these positions add no new models, then the first transposition of occurs: . In the following steps, unless is at first position, then no transposition on other variables is allowed since such transpositions add no new models. This sequence continues until through transpositions on , this variable comes to the first position of the permutation, . At this step, a greedy algorithm for applies the same sequence of transpositions to that it previously applied to . Therefore, we can implement the proposed greedy algorithm through recursion.
In particular, assume that we have found a sequence of greedy column transpositions that generates a set of QRDs sufficient for finding all the regression models for a BN with nodes. Let us call this sequence of column transpositions . The greedy sequence of column transpositions for a graph with nodes can then be formed in three steps. (1) starts with the same sequence of column transpositions as that of , leading to permuatations necessary for calculation of the regressions of every node over all possible parent-sets not containing . (2) To calculate the regressions of nodes over parent-sets containing , we first move the variable to the start of the ordering by applying column transpositions. (3) The final sequence of column swapping in consists of swappings at indexes for . The resulting sequence of column transposition then form the sequence of greedy column transpositions that generate for finding permutaitons necessary the score table of a BN with nodes. The pseudocode GreedySwaps in Algorithm 1 uses this recursive structure to find the sequence of greedy swaps.
Clarke has proposed an algorithm for solving all subset regression problem with a similar recursive structure [18, 14]. There are, however, major differences between the two algorithms. First, in every QRD, we consider all possible combinations of regressor and predictor variables. For example, in the case of having the QRD of variables , given in alphabetic ordering, Clarke is only concerned with the three regression models , while we consider the regression models in addition to that of Clarke’s. Furthermore, as discussed later in section 6, our algorithmic complexity analysis shows that the proposed algorithm results in a linear speed up compared to that of Clarke’s.
Using the recurrence relation described above, we can find the length of the greedy sequence of column transpositions as a function of number of variables in the BN:
4 Optimality of Greedy Algorithm
In this section, we show that the proposed greedy algorithm is optimal; that the number of operations or column transpositions required for generating a sufficient set of QRDs for solving all the regression models of a BN using the greedy algorithm is minimal. Our proof makes use of an auxiliary problem. This problem is that of solving all subset regression problem for predictors and one (imaginary) response variable. We first show that our original problem is equivalent to this auxiliary problem when the only operation allowed is that of column transposition and retriangularization. Since the optimal solution for the auxiliary problem is known , we can then show that our solution is optimal for the original problem.
In the first step of our proof we show that the two following problems are equivalent:
Problem I (Original). Forming a sequence of QRDs such that all the regression models for a BN with variables are included in the QRD set. The constraints are that we only have access to the starting QRD with variables ordered as . Further, the only operation available is transposing two immediate column and retriangularization using the operator.
Problem II (Auxiliary). Forming a sequence of QRDs such that all the regression models in an all subset rgression problem with predictor variables are included in the QRD set. Again we are given a starting QRD with variables ordered as ( is the response variable). We are also again constrained to only using the operator.
We first show that every solution to Problem I is also a solution to problem II. Specifically, assume that we are looking to solve the regression of over predictor set . In other words, we want a QRD where the variables, , are ordered as , and and are some permutations of variables in the sets and . Assume that the size of the predictor set is smaller than . Since we know the solution to Problem I, then we have access to QRDs that provide the solution to the regressions of node , , over all its possible parent sets. Thus, the solution to Problem I, has to provide a QRD of the form . Choosing and equal to and , we get the desired QRD. If , we simply choose a node , and the solution to Problem I, provides access to a QRD of the form , which is the desired QRD for solving problem II.
In the same manner, we can show that every solution to Problem II is also a solution to Problem I. Assume we wish to solve a regression for a specific parent set and node pair where the parents are in the , and the node is . In other words, we want a QRD where the variables are ordered as , where , and and are some permutations of variables in the sets and . Given the solution to Problem II, we have QRDs for all subset regression models. Therefore, we have access to a QRD where the first variables are ordered as . Choosing and equal to and , we get the desired QRD. This concludes our proof that the above two problems are equivalent.
Our solution to Problem I, uses column transpositions. It has been proven that the minimal number of column transpositions required for solving Problem II, is in fact . Therfore, the proposed greedy algorithm is optimal and no other algorithm can generate a sequence of QRDs of smaller length such that all regression models of a score table for a BN are included in the QRD set.
5 Algorithmic Complexity Analysis
In this section we calculate the runtime of our algorithm and compare it to three other methods for solving the regression models for all possible families of a BN. To analyze the runtime of the proposed algorithm, we make use of the following recurrence relation for the sequence of column transpositions employed by our algorithm:
where is the sequence of column transpositions for variable case, is a sequence of column transpositions starting at index down to the first position, and is concatanation operator. Noting that the number of operations required for applying Given’s transformation to a matrix of size is six more than its counterpart for a matrix of size , we can write the following recurrence to describe the runtime of our algorithm:
where is the number of column transpositions employed by our algorithm when applied to a network with nodes:
The runtime of the proposed method for forming the QRDs of all possible parent-set and child combination is compared to that of Clakre’s all subset regression algorithm , Dropping Column Algorithm (DCA) , and direct brute force using Cholesky decomposition and covariance structure in Table 1.
|Greedy Column Swapping|
In order to design an efficient parallel version of the proposed algorithm, let us re-state the general problem framework introduced in section 4 so as to account for employing of multiple processors. Given processors, we wish to find initial QRDs and sequences of adjacent column transpositions for each of the processors, such that the resulting variable permutations and their corresponding QRDs among all the processing cores solves for all the regression models of a Bayesian network with nodes.
To find the optimal performance gain using processors, we first derive a bound on possible performance improvment in the case of all subset regression problem. Due to discussions in section 5, we know that this bound would be still in effect for the problem of forming all QRDs for all families in a BN. We then propose a near optimal parallelization scheme that achieves a performance gain close to this bound.
Assume that we have found initial QRDs, , and sequences of column transpositions, , , for solving the all subset regression problem using processors. We will denote the length of the sequence of column transpositions performed by processor by . Since the initial QRDs can at most account for regression models and since a column transposition can at most add one new regression model, we have:
Therefore we have:
Thus the number of required column transpositions when using processors is at best of the order of .
We propose a parallelization method that is a direct consequence of the following recurrence relation:
where denotes the sequence of column transpositions performed by the single core algorithm of the previous section for the variable case.
Consider the case where the number of processors, , is two. In this case, we initialize the first core with the QRD of variables ordered as and the second core with the QRD of variables ordered as . For the first processor, we choose the sequence of column transpositions equal to , and for the second processor we employ the sequence of column transpositions . In general, given processors, we can use this recurrence relation times to find the initial QRDs and the sequence of column transpositions for each processor.
More specifically let us represent each of the processors with a binary array of -bits, mapping processor to a binary array equivalent to its binary repressentation. Then, Algorithm 2 can provide the initial permutation and the sequence of column swapping required to be performed at processor .
The number of column transpositions performed by each processors in general is of the order of or .
In this paper, we proposed an algorithm for an efficient and exact calculation of regressions for all the families of a BN. Noting that the regressions for the different families are dependent on each other, we utilized QR decomposition as a data structure for capturing these dependencies. We then used Givens rotations and column transpositions as low-cost operations to efficiently trace a greedy path through the space of QRDs such that all the regression models are included. We showed how the proposed greedy method could be more easily implemented using recursions in section 3. In section 4 we provided a lower bound on the number of column transpositions required for solving the regressions for all the families and showed that the proposed greedy algorithm achieves this lower bound.
We further compared the runtime of our algorithm with specialized algorithms for all-subset regression problems in Table 1. We argued that spcialized all-subset regression algorithms and brute force algorithms do not utilize the whole of the dependency structure among the families. The faster runtime of our proposed method then proves that we make better use of the dependency structure. Although in terms of algorithmic complexity our proposed algorithm has only a constant factor of improvement compared to that of DCA, the memmory requirements of our algorithm is much lower than theirs. Specifically, the proposed algorithm utilizes a storage of size (only a single
matrix needs to be stored at any moment) while DCA requires a storage of size. In section 7 we further provided a near optimal parallelization scheme for our prosed algorithm.
-  Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.
-  Finn V Jensen. An introduction to Bayesian networks, volume 210. UCL press London, 1996.
Peter Spirtes, Clark N Glymour, and Richard Scheines.
Causality from probability, volume 112. Carnegie-Mellon University, Laboratory for Computational Linguistics, 1989.
-  Nir Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303(5659):799–805, 2004.
-  Judea Pearl. Causality. Cambridge university press, 2009.
-  Gregory F Cooper and Edward Herskovits. A bayesian method for the induction of probabilistic networks from data. Machine learning, 9(4):309–347, 1992.
-  David Heckerman, Dan Geiger, and David M Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995.
David Maxwell Chickering and Christopher Meek.
Finding optimal bayesian networks.
Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pages 94–102. Morgan Kaufmann Publishers Inc., 2002.
-  David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002.
-  Tomi Silander and Petri Myllymaki. A simple approach for finding the globally optimal bayesian network structure. arXiv preprint arXiv:1206.6875, 2012.
-  Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm for learning bayesian networks. arXiv preprint arXiv:1207.1429, 2012.
-  James M Ortega. Numerical analysis: a second course. SIAM, 1990.
-  Alan Miller. Subset selection in regression. Chapman and Hall/CRC, 2002.
-  DM Smith and JM Bremner. All possible subset regressions using the qr decomposition. Computational Statistics & Data Analysis, 7(3):217–235, 1989.
-  Petko Yanev, Paolo Foschi, and Erricos John Kontoghiorghes. Algorithms for computing the qr decomposition of a set of matrices with common columns. Algorithmica, 39(1):83–93, 2004.
-  Cristian Gatu, Petko I Yanev, and Erricos J Kontoghiorghes. A graph approach to generate all possible regression submodels. Computational Statistics & Data Analysis, 52(2):799–815, 2007.
-  Cristian Gatu and Erricos J Kontoghiorghes. Parallel algorithms for computing all possible subset regression models using the qr decomposition. Parallel Computing, 29(4):505–521, 2003.
-  MRB Clarke. Algorithm as 163: A givens algorithm for moving from one linear model to another without going back to the data. Journal of the Royal Statistical Society. Series C (Applied Statistics), 30(2):198–203, 1981.