Probabilistic graphical models (PGMs) 
combine graph and probability theory to represent structured distributions. They play an important role in many computationally oriented fields
, including combinatorial optimisation, statistical physics, bioinformatics, machine learning, control theory and economics. PGMs are widely used in evolutionary optimisation, especially in Estimation of Distribution Algorithms (EDAs)  when interactions among variables are considered (Multivariate EDAs).
EDAs are a class of Evolutionary Algorithms (EAs) that explores the search space by building a probabilistic model from a set with the current best candidate solutions. Since new solutions are sampled from the probabilistic model, the evolution is likely to be guided towards more promising areas of the search space. Most of the Multivariate EDAs, such as Estimation of Bayesian network Algorithm (EBNA)  and Bayesian Optimisation Algorithm (BOA) , use PGMs , and in particular particular Bayesian networks (BNs), to be able to capture multivariate interactions between variables.
Finding the optimal structure of a Bayesian network is considered NP-hard . Structure learning methods have been extensively studied in [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 2], resulting in a range of algorithms for various settings. However, constructing Bayesian network models is still particularly computationally expensive with the increasing number of variables and interactions.
In this work, we consider the Bayesian Optimisation Algorithm (BOA)  for which we show experimentally that the changes in the PGM follow a certain pattern. This pattern is then used to propose an alternative version of BOA, called Fast BOA (FBOA) that decides whether to adjust111Adjusting a PGM here means generating a new PGM, either based on the previous one or not.
the PGM following a probability distribution.
To achieve our goals, we start by analysing the behaviour of BOA and the similarities between consecutive Bayesian networks. Additionally, we explore the impact of problem features on the search performance of BOA versions while proposing a new runtime estimation metric which considers the computational complexity of evaluation and PGM generation. We consider the NK-model as our benchmark problem for this study. The results indicate that our model-based BOA is about three times faster than the standard BOA while producing quality solutions.
This paper is organised as follows: Section 2 provides the mathematical formulation of the NK-landscape model and briefly introduces Bayesian networks as PGM in EDAs. Section 3 presents the BOA framework used in our investigation and introduces a new runtime estimation approach. In Section 4, our model-based variant of BOA (FBOA) is introduced. Results from numerical experiments are reported and discussed in Section 5. Finally, Section 6 presents some conclusions and future directions.
In this section, we first cover the basics of the family of problems with tunable ruggedness that we consider. Then, we describe the Bayesian networks and their use as probabilistic graphical models.
2.1 NK-Landscape Model
The NK-landscape models are a family of combinatorial problems proposed in  aiming to explore the way in which the neighborhood structure and the strength of the interactions between neighboring variables (subfunctions) are linked to the search space ruggedness.
denote a vector of discrete variables andan assignment to the variables.
An NK fitness landscape is defined by the following components:
Number of variables, .
Number of neighbors per variable, .
A set of neighbors, , for , where contains neighbors.
A subfunction defining a real value for each combination of values of and , .
Both the subfunction for each variable and the neighborhood structure are randomly set.
The function to be maximized is defined as:
For a set of given parameters, the problem consists of finding the global maximum of the function .
2.2 Bayesian networks as PGM
A Bayesian network is a mathematical structure developed to represent a joint probability distribution considering a set of variables. Bayesian networks are among the most general probabilistic models for discrete variables used in EDAs [5, 8]. In this paper, we use Bayesian networks to model multinomial data with discrete variables, generating new solutions using the particular conditional probability  described by Equation 2:
where is a vector representation of random variables and its -th component; is the structure and a set of local parameters; represents the set of parents of , which denoting a particular combination of values for , is the total number of different possible instantiations of the parent variables of given by , with defining the total of possible values (states) that can assume. The parameter represents the conditional probability that variable takes its th value (), knowing that its parent variables have taken their -th combination of values (). This way, the parameter set is given by , where and is the total number of nodes in the Bayesian network.
The parameters of and are usually unknown, and the literature presents two possibilities to estimate them: Maximum Likelihood Estimate (MLE) and the more general Bayesian Estimate . In this work, we address the last method.
In terms of Bayesian network structures learning process, there are mainly three different approaches: score-based learning, constraint-based learning, and hybrid methods 
. Score-based techniques apply heuristic optimisation methods to sort the structures selecting the one that maximizes the value of a scoring metric, like the K2 metric in Equation 3:
where is the number of observations in the data set for which assumes the -th value given the -th combination of values from its parents.
Constraint-based learning methods typically use statistical tests to identify conditional independence relations from data and build a Bayesian network structure that best fits those relations , such as Incremental Association Markov Blanket (IAMB)  and Semi-Interleaved Hiton-PC (SI-HITON-PC) . Hybrid methods aim to integrate the two approaches, like Max-Min Hill Climbing (MMHC) .
In this paper we consider the K2 algorithm, which is a commonly-used, score-based greedy local search technique that applies the K2 metric (Equation 3). It starts by assuming that a node, in a (pre-defined) ordered list, does not have any parent, then at each step it gradually adds the edges that increase the scoring metric the most, until no edge increases the metric anymore.
3 Revisiting Bayesian optimisation algorithms
In this section, we provide basic concepts used throughout this article (Sub-section 3.1) and also present some contributions (Sub-sections 3.2 and 3.3) that will support the analysis performed later on.
Among the most general probabilistic models for discrete variables used in EDAs are Bayesian networks. Several EDAs have been proposed based on this model, such as, EBNA and EBNAK2 , BOA  and hBOA .
In this paper, we adopted BOA  whose framework encompasses general steps and assumptions discussed in the next section.
3.1 The Framework
The general framework for the EDA considered here is presented in Algorithm 1. The algorithm starts by generating a random initial Bayesian network (Step 2). At each iteration, a set of solutions is generated in Step 4 by sampling solutions from the current model; and evaluated in Step 5. Then, the best solutions in are stored in the current population . Afterwards, the Bayesian network is adjusted based on the current population in Step 7. This process is repeated until a specified stopping criterion is met (e.g. maximum number of iterations, maximum fitness reached, etc.).
3.2 Tracing PGM adjustment
In a standard BOA implementation (Algorithm 1), the PGM is updated in each iteration. If two or more consecutive PGMs are very similar, then the algorithm performs a pointless time consuming task.
This is particularly costly for the problem addressed in this paper, as it is done many times for the multiple ruggedness levels and the multiple landscapes of the NK-models, besides it can be an issue for most of PGM-based approaches.
Aiming to quantify similarities between consecutive PGMs, we use in this work the Structural Hamming Distance (SHD)  as a metric. At each iteration of BOA, we store the SHD value between two consecutive PGM structures, the current and the previous one, generating, at the end of evolution, an SHD vector as large as the total number of iterations that were necessary to converge.
In order to extract a common pattern across all multiple runs () of the algorithm, we need to aggregate these SHD vectors. As the resulting SHD vectors can be of different sizes due to the convergence speed of each run, we use a methodology to normalise the vector size, which is described in the following steps:
Define the normalised size as the maximum size among all the SHD vectors (i.e. the maximum number of iterations before convergence).
For all the SHD vectors with size smaller than , fill-in the gaps by adding a ’*’ character in a uniform manner.
To obtain the final aggregated normalised vector, joint all SHD vector obtaining a matrix of size . Then we average each column, considering that if the value corresponds to a ’*’, it is simply ignored. Afterwards, we normalise the vector elements into the range .
Then, at the end we have the aggregated normalised information stored in a vector named
, which describes in average the pattern of PGM adjustments followed by BOA. We believe that this approach is a good fit to aggregate vectors in this particular situation compared to interpolation for instance. The reason we did not opt for a curve fitting approach was to avoid modifying the originalSHD values and including additional values we are not sure are adequate.
3.3 Estimating the runtime
Let be the probability of success of the algorithm and let be the random variable measuring the number of function evaluations for unsuccessful runs.
After failures, each one requiring evaluations, with the final successful run of evaluations, the total runtime is then defined as , where is the random variable measuring the number of runs. The random variable
follows a geometric distribution with parameter. By taking the expectation and by considering independent runs for each instance, stopping at the first success, we have:
The expected runtime for successful runs is estimated as the average number of function evaluations performed by successful runs, and we note the expected runtime for unsuccessful runs. As the expectation of a geometric distribution for with parameter is equal to , the estimated runtime can be expressed as the following:
where is the number of successful runs, is the number of evaluations for successful run .
In the context of BOA, it is clear that this approach is far from being accurate. In fact, the process of adjusting the PGM has a higher computational complexity 222In terms of complexity theory, finding the optimal PGM structure is NP-complete. However, some real-world problems could require more wallclock time for calculating the objective value of a single solution, in which case surrogate models are often used. than that of evaluating a population of solutions, even for fast greedy algorithms such as K2. Therefore, we propose an alternative approach specific to our case study based on the expected time complexity instead of the number of evaluations. More precisely, we take into consideration the complexity of both: the cost of the objective function and the cost of generating a new PGM using a given structure learning algorithm like K2.
Let us start by estimating the time complexity of PGM adjustments of a BOA. We assume that it generates a Bayesian network based on a given probability distribution . Let and denote the number of PGM adjustment for successful runs and unsuccessful runs respectively. We can define the number of PGM adjustment as . Thus, the expected number of PGM adjustment is defined by:
By noticing that the complexity of evaluating a population of solutions, each one represented by an size vector, is and the complexity of the K2 algorithm is , and by unifying the runtime unit to number of elementary operation instead of number of evaluation, we obtain the following runtime equation:
in our case is the sampled population size and the bitstring length.
4 Designing a model-based BOA
In this section, we start by analysing the similarity patterns between PGM based on the output of a standard BOA implementation. The outcome is then used to propose a model-based algorithm based on the BOA framework that adopts a strategy to reduce the frequency of PGM adjustments.
4.1 Analysis of PGM adjustment patterns
In Figure (h)h, we show the evolution of SHD values (solid lines) and objective values (dashes lines) for 5 randomly selected runs (each color is associated with a specific run), for each ruggedness level . The experiments are conducted for different problem sizes (, ) and the SHD curves (each one with a different convergence speed meaning different number of iterations until find the best solutions) have been fitted with polynomials of degree .
The first observation we can make is that BOA converges faster for small values of . The explanation of this behaviour is straightforward: these instances are easier to solve and therefore require less computational effort to solve. More importantly, we can see that significant improvements in the objective values do not necessarily correspond to significant changes in the PGM. This is our first indicator that adjusting the PGM at each iteration might not be the best strategy to adopt.
Looking at the patterns in Figure 2, one would be tempted to use the Boltzmann criterion to decide whether to adjust the PGM or resample from the previous one. A good argument not to use the Boltzmann criterion is that the patterns in Figure 2 do not necessarily embed a Boltzmann distribution. In fact, we can identify a small, but significant, peak towards the end of the evolutionary process. This means that the latest generated PGMs are different from their predecessors. In other terms, the algorithm tends to diversify the search process before convergence by sampling from new probability distributions.
4.2 The proposed algorithm
Our proposed approach is summarised in Algorithm 2. It follows almost the same steps performed by BOA (Algorithm 1), the exceptions are the use of a probability vector in line 3 and the decision rule in line 8.
In line 3, FBOA creates the PGM adjustment probability vector (which corresponds to the vector for in our experimental study). The PGM adjustment at a given iteration is done with a probability as shown in line 8. We will refer to this algorithm as FBOA, which stands for Fast Bayesian Optimisation Algorithm.
5 Experiments and Results
In this section, we are interested in finding out the ability of FBOA (algorithm presented in Section 4.2) to obtain similar performance in comparison with BOA. In particular, we investigate the estimated runtime of FBOA and BOA necessary to identify the optimal solution on particular NK-landscapes instances.
For both algorithms, we consider the population size and sample size . The addressed NK-landscapes have the problem size and the epistatic degree (as applicable). In order to enumerate the solution space exhaustively for each , we used the largest value of that can still be analysed with reasonable computational resources. A set of different landscapes are independently generated at random for each and . Unless a near-optimal solution is found, a maximum number of evaluations is used as a stopping criterion. Finally, each algorithm is executed times per instance.
5.1 Comparison of the optimisation results of BOA and FBOA
First, in Table 1, we report the average gaps between the optimal objective values (obtained through enumeration) and the best objective values found by BOA and FBOA on average for each combination of and . The p-values of the last column of Table 1 have been obtained using Friedman test 
as the results are not normally distributed according to the Shapiro-Wilk normality test. The Friedman test has been executed with a confidence level of indicating that there is statistically significant differences whenever . According to Table 1, there are no statistically significant differences between BOA and FBOA for all instances. It means that BOA and FBOA performances cannot be differentiated based on the average gaps for all the considered combinations of and .
Next, Table 2 presents the average success rates for both BOA and FBOA with multiple and settings. The values represent averages of the number of successful runs over the total numbers of landscapes and runs. The last row of each reports the average success rate over .
We apply the McNemar Chi-Square statistical test  to compare BOA and FBOA approaches. The test is useful to show the difference between paired proportions and can determine whether there is marginal homogeneity between the two approaches.There are a few statistically significant differences between success rates of two approaches when (confidence level of 99%), which are highlighted. However, overall, the success rates are comparable.
We can observe from Table 2 that there is no statistically significant difference for 26 of 30 NK-landscape configurations, except for those with ( and ), and ( and ) with some advantage for BOA. By considering the average over for each (last row of each ), there is no statistically significant difference between the considered approaches. It means that BOA and FBOA performances cannot be differentiated regarding the success rate for all the considered combinations of and .
5.2 Comparison of the estimated runtimes of BOA and FBOA
In this section, we are interested in comparing the estimated runtimes of BOA and FBOA for different values of (ruggedness). A correlation analysis is performed between and the estimated runtime given by Equation (7
), and a linear regression model is built according to Equation (8)
usingas explanatory variable, and a usual error . A log-transformation is applied on both variables to better approximate linearity. The accuracy of the linear regression model is measured by the coefficient of determination which ranges from to .
Table 3 shows the coefficient of determination for BOA and FBOA. The average distances between the respective linear regression curves are not log-transformed. According to Table 3, well-adjusted linear regressions are obtained for all but . For the best adjusted regression model (greatest coefficient of determination for each algorithm), FBOA is times faster than BOA. Additionally, according to the average distance between the regression curves, FBOA is about times faster than BOA.
Figure 3 shows both scatter plots and linear regression curves for all in a log-log scale. Each combination of and has values corresponding to ten different landscapes, and each landscape has its corresponding given by Equation 7. According to the regression curves in Figure 3, FBOA is always faster than BOA. It is worth noticing that runtime performances of FBOA increases for high values of as the slopes of regression curves of BOA are always smaller than those of BOA.
The results shown in Figure 3 and Table 3 might indicate that the computational effort saved with PGM adjustments could be beneficial as the problem difficulty increases (by increasing and ). We do not claim that FBOA guarantees the best tradeoff between optimization results and runtime; this requires further investigation. However, the experiments show clearly that adjusting the PGM less often than what BOA suggests yields a significant runtime improvement without noticing any statistically significant differences between gaps obtained from BOA and FBOA to the optimal solutions.
6 Conclusions and future directions
In this study, we investigated similarity patterns between two consecutive Bayesian networks using the Structural Hamming Distance (SHD) metric, i.e. a proxy to measure a similarity between two networks, over the evolutionary process of BOA. The experiments clearly show patterns across a wide range of NK-landscapes in which the SHD values decrease during BOA’s runs, implying the fact that generating a new Bayesian network at every iteration of the BOA might not be necessary – except for an increase again towards the end that is necessary for exploitation.
Based on the this observation, we proposed a faster alternative called FBOA, which conducts the adjustment of the PGM following a probability as a function of the FBOA iteration. Furthermore, we tested the performance of FBOA in terms of its solution quality and computational burden. The experiments demonstrate that FBOA provides a solution quality comparable to BOA while dramatically saving computation time.
To further improve the performance of FBOA, one possibility is to incorporate the objective improvement information into the its decision of adjusting the PGM or resampling from the previous one. For example, we can employ self-adaptive methods that (1) enforce the generation of a new PGM or (2) increase the probability of generating a new PGM, if the best found objective value is not improved by sampling from the current PGM. Alternatively, the probability of generating a new PGM can be represented as a function of the iteration number and the improvements in the objective value. At a given iteration, the likelihood of resampling instead generating a new PGM is increased when the objective value is improved.
Although we have proposed an alternative implementation of BOA, this work is mainly to show that the frequency of adjusting the PGM can be tuned to save computational time. The subject requires more investigation in the future from both theoretical and empirical perspectives.
-  Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
-  Yu Cheng, Ilias Diakonikolas, Daniel Kane, and Alistair Stewart. Robust learning of fixed-structure bayesian networks. In NeurIPS, pages 10304–10316, 2018.
-  Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
-  H. Mühlenbein and G. Paab. From Recombination of Genes to the Estimation of Distributions I. Binary parameters. In Parallel Problem Solving from Nature, PPSN IV - Lecture Notes in Computer Science 1411, pages 178–187, London, UK, UK, 1996. Springer-Verlag.
P. Larrañaga and J. A. Lozano.
Estimation of distribution algorithms: A new tool for evolutionary computation, volume 2. Springer, Netherlands, 2002.
Ramon Etxeberria and Pedro Larrañaga.
Global optimization using Bayesian networks.
Proceedings of the Second Symposium on Artificial Intelligence, CIMAF’99, pages 332–339, Havana, Cuba, 1999. Editorial Academia.
-  Martin Pelikan, David E. Goldberg, and Erick Cantú-Paz. BOA: The Bayesian optimization algorithm. In Proceedings of the Genetic and Evolutionary Computation Conference, volume I of GECCO’99, pages 525–532, San Francisco, CA, 1999. Morgan Kaufmann Publishers.
-  D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, Cambridge, MA, 2009.
-  D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, 20(3):197–243, 1995.
-  G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9(4):309–347, 1992.
-  Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for Large Scale Markov Blanket Discovery. In FLAIRS conference, volume 2, pages 376–380, St. Augustine, Florida, USA, 2003. AAAI Press.
-  Ioannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1):31–78, 2006.
Constantin F Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani
Mani, and Xenofon D Koutsoukos.
Local causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical evaluation.Journal of Machine Learning Research, 11(Jan):171–234, 2010.
-  Pieter Abbeel, Daphne Koller, and Andrew Y Ng. Learning factor graphs in polynomial time and sample complexity. Journal of Machine Learning Research, 7(Aug):1743–1788, 2006.
-  Luis M De Campos, Juan M Fernández-Luna, Juan F Huete, and Miguel A Rueda-Morales. Combining content-based and collaborative recommendations: A hybrid approach based on bayesian networks. International Journal of Approximate Reasoning, 51(7):785–799, 2010.
-  Jim Q Smith and Alireza Daneshkhah. On the robustness of bayesian networks to learning from non-conjugate sampling. International Journal of Approximate Reasoning, 51(5):558–572, 2010.
-  Anima Anandkumar, Daniel J Hsu, Furong Huang, and Sham M Kakade. Learning mixtures of tree graphical models. In Advances in Neural Information Processing Systems, pages 1052–1060, 2012.
-  Narayana P Santhanam and Martin J Wainwright. Information-theoretic limits of selecting binary graphical models in high dimensions. IEEE Trans. Information Theory, 58(7):4117–4134, 2012.
-  Po-Ling Loh and Martin J Wainwright. Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. In Advances in Neural Information Processing Systems, pages 2087–2095, 2012.
-  Guy Bresler, David Gamarnik, and Devavrat Shah. Structure learning of antiferromagnetic ising models. In Advances in Neural Information Processing Systems, pages 2852–2860, 2014.
Efficiently learning ising models on arbitrary graphs.
Proceedings of the forty-seventh annual ACM Symposium on Theory of Computing (STOC), pages 771–782. ACM, 2015.
-  Stuart A Kauffman. The origins of order: Self-organization and selection in evolution. Oxford University Press, USA, 1993.
-  Roberto Santana, Alexander Mendiburu, and Jose A Lozano. Evolving MNK-landscapes with structural constraints. In IEEE Congress on Evolutionary Computation, CEC’15, pages 1364–1371, Sendai, Japan, 2015. IEEE.
Propagating uncertainty in Bayesian networks by probabilistic logic
Machine Intelligence and Pattern Recognition, volume 5, pages 149–163. Elsevier, 1988.
-  Endika Bengoetxea. Inexact Graph Matching Using Estimation of Distribution Algorithms. Phd thesis, University of the Basque Country, Basque Country, 2002.
-  Changhe Yuan and Brandon Malone. Learning Optimal Bayesian Networks: A Shortest Path Perspective. Journal of Artificial Intelligence Research, 48(1):23–65, 2013.
-  J. Pearl. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann, San Mateo, CA, 1988.
Analysis of estimation of distribution algorithms and genetic algorithms on NK landscapes.In Proceedings of the 10th annual conference on Genetic and Evolutionary Computation, GECCO’08, pages 1033–1040, Atlanta, Georgia, 2008. ACM.
-  Arnaud Liefooghe, Sébastien Verel, Fabio Daolio, Hernan Aguirre, and Kiyoshi Tanaka. A feature-based performance analysis in evolutionary multiobjective optimization. In International Conference on Evolutionary Multi-Criterion Optimization, pages 95–109, Guimaraes, Portugal, 2015. Springer.
-  Marcella S.R. Martins, Mohamed El Yafrani, Roberto Santana, Myriam R.B.S. Delgado, Ricardo Lüders, and Belaïd Ahiod. On the performance of multi-objective estimation of distribution algorithms for combinatorial problems. In IEEE Conference on Evolutionary Computation, CEC’18, pages 1–8 in arXiv:1806.09935, 2018.
-  Fabio Daolio, Arnaud Liefooghe, Sébastien Verel, Hernán Aguirre, and Kiyoshi Tanaka. Global vs local search on multi-objective NK-landscapes: contrasting the impact of problem features. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO’15, pages 369–376. ACM, 2015.
Sidney Siegel and NJ Castellan.
The friedman two-way analysis of variance by ranks.Nonparametric statistics for the behavioral sciences, pages 174–184, 1988.
-  W.J. Conover. Practical Nonparametric Statistics. Wiley, New York, third edition, 1999.
-  Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157, 1947.