In Bayesian statistics, it is a common problem to collect and compute random samples from a probability distribution. Markov Chain Monte Carlo (MCMC) is an intensive technique commonly used to address this problem when direct sampling is often arduous or impossible. MCMC using Bayesian inference is often used to solve problems in biology, forensics , education , and chemistry 
, among other areas making it one of the most widely used algorithms when a collection of samples from a probability distribution is needed. Monte Carlo applications are generally considered embarrassingly parallel since each chain can run independently on two or more independent machines or cores. Despite that, the main problem is that each chain is not embarrassingly parallel, and when the feature space and the proposal are computationally expensive, we can not do much to improve the running time and get results faster. When we have to handle huge state-spaces and complex compound states, it takes significant time for an MCMC simulation to converge on an adequate model not only in terms of the number of iterations required but also the complexity of the calculations occurring in each iteration(such as searching for the best features and tree shape of a decision tree). For example, running an MCMC on a single chain Decision tree for a dataset ofdatapoints and features took upwards of 6 hours to converge when run on a Intel Core i7-10875H. In , an approach aiming to parallelise a single chain is presented, and the improvement achieved is at its best 2.2 times faster. The functionality of this kind of solution is therefore limited as in real-time, and life applications run time is critical. The work presented in this paper aims to find methods to significantly reduce the MCMC Decision tree’s runtime by emphasising on the implementation of MCMC rather than the statistical algorithm itself. We aim to reduce significantly and up to an order of magnitude the run time of the MCMC Decision Tree on a single laptop or personal computer which is going to make the algorithm widely applicable and suitable for non tecinacl users. The remainder of this paper is organised as follows. Section 2 explains the MCMC in General and the Most Recent Work. Section 3 presents the MCMC in Decision trees. Our method is outlined in section 4, with the possible theoretical improvements. We introduce the case study in which we applied our method and reviewed results in section 5. Section 6 concludes the paper.
2 Markov Chain Monte Carlo in General and Most Recent Work
. Given a partition of the state vector into components, i.e.,, and that we wish to update the component, the Metropolis-Hastings update proceeds as follows. We first have a density for producing candidate observation , such that , which is denoted by . Given the chains ergodic condition, the definition of is arbitrary, and it has a stationary distribution which is selected so that the observations may be generated relatively easily. After the new state generation from density , the new state is accepted or rejected using the Rejections Sampling principle with acceptance probability given by equation 1. If the proposed state is rejected, the chain remains in the current state.
It is worth mentioning that acceptance probability in this form is not unique, considering there are many acceptance functions that supplies a chain with the required properties. Nevertheless, Peskun(1973)  proved that MH is the optimal one where the proper states are rejected least often, which maximises the statistical efficiency meaning that more samples are collected with fewer iterations.
On a Markov process, the next step depends on the current state, which makes it hard for a single Markov chain to be processed contemporaneously by several processing elements. Byrd . proposed a method to parallelise a single Markov chain(Multithreading on SMP Architectures), where we consider backup move “B” in a separate thread of execution as it is not possible to determine whether move “A” will be accepted. If “A” is accepted, the backup move ‘B’ - whether accepted or rejected - must be discarded as it was based upon a now supplanted chain state. If “A” is rejected, control will pass to “B”, saving much of the real-time spent considering “A” had “A” and “B” been evaluated sequentially. Of course, we may have as many concurrent threads as desired.
At this point, it is worth mentioning that the single chain parallelisation can become quickly problematic as the efficiency of the parallelisation is not guaranteed, especially for computationally cheap proposal distributions. Also, we need to consider that nowadays, computers make serial computations much faster than in 2008, when the single parallelisable chain was proposed.
Another way of making faster MCMC applications is to reduce the convergence rate by requiring fewer iterations. Metropolis-Coupled MCMC() utilised multiple MCMC  chains to run at the same time, while one chain is treated as the "cold" where its parameters are set to normal while the other chains are treated as "hot", which are expected to accept the proposed moves. The space will be explored faster through the "hot" chains than the "cold" as they are more possible to make disadvantageous transitions and not to remain at near-optimal solutions. The speedup increased when more chains and cores were added.
Our work is focused on achieving a faster execution time of the MCMC algorithm on Decision trees through multiprocessor architectures. We aim to reduce the number of iterations while the number of samples collected is not affected. Multi-threading on SMP Architectures and differs from our work as the former targets rejected moves as a place for optimisation, and the latter requires communication between the chains. Moreover, the aims are different as expands the combination of the chain, enhancing the possibilities of discovering different solutions and assisting avoid the simulation getting stuck in local optima.
2.1 Probabilistic trees packages and level of parallelism
Most of the existing probabilistic tree packages are only supported by the R programming language.
BART  software included in the CRAN package 111https://cran.r-project.org/web/packages/BART/index.html supports multi threading based on OpenMP, where there are numerous exceptions for operating systems, so it is difficult to generalise. Generally, Microsoft Windows lacks OpenMP detection since the GNU autotools do not natively exist on this platform and Apple macOS since the standard Xcode toolkit is also not provided. The parallel package provides multi-threading via forking, only available in Unix. BART under CRAN uses parallelisation for the predict function and running concurrent chains.
BartMachine 222https://cran.r-project.org/web/packages/bartMachine/index.html, which is written in Java and its interface is provided by rJava package, which requires Java Development Kit(JDK), provides multi-threading features similar to BART. BartMachine is recommended only for those users who have a firm grounding in the java language and its tools to upgrade the package and get the best performance out of it. Similar to BART, its parallelisation is based on running concurrent chains.
The rest of the available packages, BayesTree333https://cran.r-project.org/web/packages/BayesTree/BayesTree.pdf,dbarts444https://cran.r-project.org/web/packages/dbarts/index.html,Bartpy555https://pypi.org/project/bartpy/,XBART666https://jingyuhe.com/xbart.html and imptree777https://cran.r-project.org/web/packages/imptree/index.html does not support any kind of parallelisation.
Concurrent chains can not solve the problem of long hours of execution time. For example, if a single chain needs 50 hours to execute, 5 chains will still need 50 hours if run concurrently. In contrast, in our case, a chain that serially needs 50 hours now takes approximately 2 hours for each chain. Moreover, we can to run concurrent chains where each chain is parallelised. If our implementation is compared to a package like BartMachine and BART, the runtime improvement we achieved is around 18 times faster, and if we compare it with a package that does not offer any parallelisation like most of the existing ones, the run time improvement for 5 chains is around 85 times faster.
3 Markov Chain Monte Carlo in Decision Tree
A decision tree typically starts with a root node, which branches into possible outcomes. Each of those outcomes leads to additional decision nodes, which branch off into other possibilities ending up in leaf nodes. This gives it a treelike shape.
Our model describes the conditional distribution of given , where is a vector of predictors . The main components of the includes the depth of the tree, the features and the thresholds for each node where , and the possibilities for each leaf node. If lies in the region corresponding to the terminal node, then has distribution , where f represents a parametric family indexed by . The model is called a probabilistic classification tree, according to the quantitative response y.
As Decision Trees are identified by
, a Bayesian analysis of the problem proceeds by specifying a prior probability distribution. Because
indexes the parametric model for each, it will usually be convenient to use the relationship
In our case it is possible to analytically obtain eq 2 and calculate the posterior of as follows:
Equation 3 describes the product of the probabilities of every data point(
) classified correctly given the datapoints features(), the tree structure(), and the features/thresholds() on each node on the tree. Equation 4 describes the product of possibilities of picking the specific feature() and threshold() on every node given the tree structure(). Equation 5 is used as the prior for tree . This formula is recommended by  and three aspects specify it: the probability that a node at depth is nonterminal, the parameter which controls how likely a node would split, with larger values increasing the probability of split, and the parameter which controls the number of terminal nodes, with larger values of reducing the number of terminal nodes. This feature is crucial as this is the penalizing feature of our probabilistic tree which prevents it from overfitting and allowing convergence to the target function , and it puts higher probability on "bushy" trees, those whose terminal nodes do not vary too much in depth.
An exhaustive evaluation of equation 2 over all trees
will not be feasible, except in trivially small problems, because of the sheer number of possible trees, which makes it nearly impossible to determine precisely which trees have the largest posterior probability.
Despite these limitations, Metropolis-Hastings algorithms can still be used to explore the posterior. Such algorithms simulate a Markov chain sequence of trees such as:
which are converging in distribution to the posterior in equtaion 2.
Because such a simulated sequence will tend to gravitate toward regions of higher posterior probability, the simulation can be used to search for high-posterior probability trees stochastically. We next describe the details of such algorithms and their implementation.
3.1 Specification of the Metropolis-Hastings Search Algorithm on Decision Trees
The Metropolis-Hastings(MH) algorithm for simulating the Markov chain in Decision trees (see equation 7) is defined as follows. Starting with an initial tree , iteratively simulate the transitions from to by these two steps:
Generate a candidate value with probability distribution .
Set with probability
Otherwise set .
To implement the algorithm, we need to specify the transition kernel . We consider kernels , which generate from by randomly choosing among four steps:
Grow(G) : add a new and choose uniformly a and a
Prune(P) : choose uniformly a to become a leaf
Change(C) = choose uniformly a and change randomly a and a
Swap(S) = choose uniformly two and swap their and
The rules are chosen by picking a number uniformly between 0 and 1 and each action have its own interval. For example,
The probabilities (see equation 8) represent the sum of the probabilities of every accepted forward move. P(G), p(P), p(C), p(S) are set by the user who chooses how often each move wants to be proposed.
Equation 9 can be described as the possibility of proposing the grow move including the probability of choosing the specific feature(), threshold() and leaf node() to grow. P(G) is multiplied by the number of features(), the unique number of datapoints() and the number of leaf nodes(). For example, given a dataset with 100 unique datapoints(), 5 features(), a tree structure() with 7 leaf nodes() and a eq9 will be
Equation 10 is the possibility of proposing the prune move, where p(P) is multiplied by the number of decision nodes subtracting one( we are not allowed to prune the root node). In practise, given a and a tree structure() with 7 decision nodes() eq10 will be
Equation 11 is the possibility of proposing the change move where p(C) is multiplied by the number of decision nodes(), the number of features(), and the number of unique datapoints(). For example, given a dataset with with 100 unique datapoints(), 5 features(), a tree structure() with 12 decision nodes() and a eq9 will be
Equation 12 is the possibility of proposing the swap move where p(S) is multiplied by the number of paired decision nodes().In practise, given a tree structure() with 12 decision nodes() and a eq12 will be
Transition kernel(see equation 13) yields a reversible Markov chain, as every step from to has a counterpart that can move from to .
Assume a Markov chain, starting from its unique invariant distribution . Now, take into consideration that for every sample have the same joint probability mass function(p.m.f) as their time reversal , so as we can call the Markov chain reversible, as well as its invariant distribution is reversible. This can be explained as a simulation of a reversible chain that looks the same if it runs backward.
The first thing we have to look for is if the Markov chain starts at , and it can be checked by equation14
Equation14 is only dependent on and where this expression for reversibility must be the same as the forward transition probability . If, both original and the reverse Markov chains have the same transition probabilities, then their p.f.m must be the same as well.
An example for our probabilistic tree is the following:
Assume a tree structure() with 5 leaf nodes() and 11 decision nodes() sampling from a given dataset with 4 features() and 100 unique datapoints() for each feature.
If for example the forward proposal() = () with p(C) = 0.2, we end up with the following equation: . At the same time the reverse proposal(going from the current position to the previous) () equation looks exactly the same as the forward proposal. Given the above practical example we have strengthen our proof which shows that () = (), which shows in practise the reverse transition kernel nature of our model.
4 Parallelising a single Decision Tree MCMC Chain
Given a Decision’s Tree MCMC chain with iterations, we propose a method that utilises number of cores aiming to enhance the running time of a single chain by at least an order of magnitude. As stated in Section1 and Section3, at each iteration, a new sample is drawn from the proposal distribution. Our method requires sampling from number of cores, number of samples in parallel. We then accept the sample with the greatest and repeat the same method until the Markov Chain converges to a stationary distribution. In our method, we check the Markov chain convergence when the F1-score fluctuates less than for at least 100 iterations. Once the Chain has converged, we proceed to the second phase of our method. We now keep producing samples using cores, but we can now collect more than one sample which satisfies . ( is a random uniform number), otherwise we collect From this point, we will propose new samples from the sample with the greatest until we are happy with the number of samples collected. Using this method, we can collect the same number of samples and explore the feature space as effectively as the serial implementation, but 18 times faster using an average laptop or personal computer.
Our algorithm reduces the number of iterations and explores the feature space faster as we use more cores. This provides us with a significant run time improvement up to 18 times faster when the feature space is big and the proposal is expensive. The following sections will evaluate the running time improvement and the quality of the samples produced.
4.0.1 Theoretical gains
Using cores simultaneously, the programme cycle consists of repeated "steps," each performing the equivalent of between 1 and iterations. We need to calculate the number of iterations based on the acceptance rate to produce the same number of samples() when we increase . The moves are considered in parallel, where they are accepted or rejected. Given that the average probability of a single arbitrary move being rejected is , the probability of the in every single concurrent core is . This step continues for iterations where equations 15, 16, and 17 show the iterations needed, run time, and speedup improvement, respectively, given a time() in minutes for each iteration. Theoretical speedup (see figure1) were plotted for varying cores.
For example, given a single MCMC Decision Tree chain running for iterations and an acceptance rate of , after burn-in iterations, we end up with samples.
For the parallel MCMC chain, given the same settings as the serial one, with a burn-in period and 25 cores, we will collect the same number of samples with 500 iterations. This provides us with a 25 times faster execution time. Algorithm 2 indicates the part implemented on a parallel environment, and figure 3 the maximum theoretical benefits from utilising our method. Considering the communication overhead, the parts of the algorithm that are not parallelised, and the fact that the cores does not receive constant utilisation, in practise the speedups of this order are not achievable. Therefore, we will test it in practise and find out how it performs on real-life scenarios respect to the accuracy and the runtime improvement
5.1 Quality of the samples between serial and parallel implementation
We have used the Wine dataset from scikit learn datasets888https://scikit-learn.org/stable/datasets/toydataset.html repository as well as Pima Indians Diabetes and Dermatology from UCI machine learning repository 999https://archive.ics.uci.edu/ml/index.php
, which are publicly available, to examine the quality of the samples on several testing hypotheses, including the different number of cores per iteration given the average F1-Score. Precision and Recall were also calculated for more depth and detailed insights about the performance and quality of the samples. The results, including the F1-Score, Precision, Recall, and Accuracy(see table2, table3 and table4) produced through 25-Fold Cross-Validation, ensure that every observation from the original dataset has the possibility of appearing in the training and test sets and also reduce any statistical error. Before any performance comparison, we need to examine whether the samples produced for each test case(25 cores and 40 cores) have any statistical difference from the serial implementation. We next examine if extracted samples by utilising 25 and 40 cores are representative of the family of the data they come. We use as ground truth data the F1-scores on each sample collected on every fold of the serial implementation, and we compare them with the corresponding collected samples from the other two test cases. In order to check any statistical difference between the samples, we performed the two-sample t-test for unpaired data, which is defined as follows:
is used to calculate the t-Test statistic equation whereand are the sample sizes, and are the sample means, and and 19 holds, which is the critical value of the distribution with degrees of freedom.
For our first dataset(Wine), we first examine the serial implementation with the parallel using 25 cores. The absolute value of the t-Test, 0.62, is less than the critical value of 1.964, so we prove the null hypothesis and conclude that the samples drawn by using 25 cores have not any statistical difference at the 0.05 significance level. We then compare the serial implementation with the parallel using 40 cores. In this case, the absolute value of the t-Test, 0.63, is less than the critical value of 1.964, so we prove the null hypothesis and conclude that the samples drawn by using 40 cores have not any statistical difference at the 0.05 significance level. We also examine the parallel implementations between them(using 25 and 40 cores accordingly). In this case, the absolute value of t-Test, 0.64, is less than the critical value of 1.964, so we reject the null hypothesis and conclude that the samples drawn using 40 cores(in comparison with the samples drawn with 25 cores) have not a statistical difference at the 0.05 significance level. The results for the rest of the datasets are presented explicitly in table 1
|Datasets||1 vs 25 cores||1 vs 40 cores||25 vs 40 cores||Critical T value|
|Pima Indians Diabetes||0.51||0.63||0.63||1.97|
T-test proves that if we use up to 40 cores for sampling(rarely laptops and personal computers have more than 40cores), the quality of the samples are the same, ending up with statistically same samples as the serial implementation.Table 1, table 2, and table 3 shows that when we sample in parallel using up to 40 cores, the accuracy and the F1-score remain on the same levels as the serial implementation. Tables 1, 2, 3, 4 indicate that a single chain on MCMC Decision Trees can not be an embarrassingly parallel algorithm as we can only improve the running time of a single chain by utilising a specific number of cores. The running time improvement we achieved( faster) is the maximum run-time enhancement we can achieve on an MCMC decision tree to maintain the high metrics produced by the serial implementation. If we try to extract up to 40 samples per iteration, it is highly probable to get samples that does not affect the final results negatively. According to our results, the maximum number of cores can that be used is 40. Furthermore, Precision, Recall, and F1-score metrics indicate that no overfit is observed even when more than 3 labels exist, proving the samples’ quality even in multi class classification problems.
|1 core||20cores||40cores||1 core||20cores||40cores||1 core||20cores||40cores|
|1 core||20cores||40cores||1 core||20cores||40cores||1 core||20cores||40cores|
|1 core||20cores||40cores||1 core||20cores||40cores||1 core||20cores||40cores|
5.2 Practical gains
Figure 1 presents the theoretical and practical speedup achieved given the number of cores used demonstrating a remarkable runtime improvement, especially when the feature space is ample and the proposal is expensive. Figure 1 demonstrates that in practise, theoretical speedups of this order can not be achieved for various reasons, including communications overhead, and as well as the cores do not receive constant utilisation. The practical improvement achieved used the novel method we proposed, speeds up the process up to 18 times depending on the number of cores the user may choose to utilise. Moreover, figure 1 demonstrates that even if we use more than 25 cores, the speedup achieved is the same, because of the architecture of the cores. When we run on a local machine(laptop or personal computer) a medium size dataset(500,000 entries), the memory is not enough to run every single parallel tree on a separate core. Given that, we have to wait for a core to finish the task, in order to allocate its memory to another core. Given that, we end up that it is not always beneficial to use more cores, as faster execution time and speedup is not guaranteed. Scaling up to 25 cores is the ideal, having in mind that any number above that, might not benefit the run time. To the best of our knowledge it is the first time where a single chain in general, specifically on decision trees, is parallelised with our proposed method.
Our novel proposed method for parallelising a single MCMC Decision tree chain takes advantage of multicore machines without altering any properties of the Markov Chain. Moreover, our method can be easily and safely used in conjunction with other parallelisation strategies, i.e., where each parallel chain can be processed on a separate machine, each being sped up using our method.
Furthermore, our approach can be applied to any MCMC Decision tree algorithm which needs to process hundreds of thousands of data given an expensive proposal where an execution time of 18 times faster can be easily achieved. As multicore technology improves, CPUs with multiple processing cores will provide speed-ups closer to the theoretical limit calculated. By taking advantage of the improvements in modern processor designs our method can help make the use of MCMC Decision tree-based solutions more productive and increasingly applicable to a broader range of applications. Future work includes on expanding our method on a High Performance Computer(HPC), servers, and cloud which are build for this kind of tasks to compare and demonstrate possible runtime improvements, and discover the merits of such technologies. Moreover, we plan to implement more MCMC single chain parallelisation techniques, including data partitioning, and conduct experiments with various size and shapes datasets, to find the most effective technique, given the type and shape of the dataset.
-  (2004) Parallel metropolis coupled markov chain monte carlo for bayesian phylogenetic inference. Bioinformatics. Cited by: §2.
-  (2008) Reducing the run-time of mcmc programs by multithreading on smp architectures. In 2008 IEEE International Symposium on Parallel and Distributed Processing, Cited by: §2.
-  (2008) Speculative moves: multithreading markov chain monte carlo programs. High-Performance Medical Image Computing and Computer Aided Intervention (HP-MICCAI). Cited by: §1.
-  (2010) BART: bayesian additive regression trees. The Annals of Applied Statistics. Cited by: §2.1, §3.
-  (2021) Early predictor for student success based on behavioural and demographical indicators. In International Conference on Intelligent Tutoring Systems, Cited by: §1.
-  (2021) Quantification of uncertainties in the assessment of an atmospheric release source applied to the autumn 2017. Atmospheric Chemistry and Physics. Cited by: §1.
-  (1978) 3—two-sample t-test. Fixed effects analysis of variance. Probability and mathematical statistics: a series of monographs and textbooks. Cited by: §5.1.
-  (1970) Monte carlo sampling methods using markov chains and their applications. Cited by: §2.
-  (1953) Equation of state calculations by fast computing machines. The journal of chemical physics. Cited by: §2.
-  (1973) Optimum monte-carlo sampling using markov chains. Biometrika 60 (3), pp. 607–612. Cited by: §2.
On theory for bart.
The 22nd International Conference on Artificial Intelligence and Statistics, Cited by: §3.
-  (2014) Interpreting forensic dna profiling evidence without specifying the number of contributors. Forensic Science International: Genetics. Cited by: §1.
Mcmc techniques for parameter estimation of ode based models in systems biology. Frontiers in Applied Mathematics and Statistics. Cited by: §1.