1 Introduction
Methods from both machine learning and statistical sciences have contributed to the advancements in probabilistic graphical models, and specifically in learning Bayesian Networks (BNs). The structure of a BN is typically represented by a Direct Acyclic Graph (DAG), containing nodes that represent variables and edges that represent conditional relationships. The process of learning BNs from data is separated into two tasks. An unsupervised structure learning approach is first used to determine the graph of the BN, followed by parameter estimation of the conditional distributions given the learnt structure.
The task of structure learning is known to be both difficult and computationally expensive, and it is NPhard even for small networks containing just 10 variables. These challenges are elevated when noise is present in the data, or when the data are incomplete (Constantinou et al., 2021). One such example is when the input data do not capture all the variables; often referred to as the problem of learning under the assumption of causal insufficiency (Spirtes et al., 2001). A special case of a latent variable is the latent confounder which represents a missing common cause of two or more observed variables, and tends to lead to spurious edges between its children.
Structure learning algorithms that assume causal insufficiency generate ancestral graphs that represent an extension of a DAG. These algorithms generate either a Maximal Ancestral Graph (MAG) that contains bidirected edges indicating confounding and directed edges indicating causal or ancestral relationships, or a Partial Ancestral Graph (PAG) that represents the Markov equivalence class of a set of MAGs, in the same way a Completed Partial DAG (CPDAG) represents the Markov equivalence class of a set of DAGs. As we later explain in subsection 2.1, a PAG also captures the possible existence of multiple latent confounders. Wellestablished algorithms that generate PAGs tend to be constraintbased or hybrid, and largely rely on FCI by Spirtes et al. (2001). Some variants of FCI include the constraintbased RFCI algorithm by Colombo et al. (2012), cFCI by Ramsey et al. (2012), mFCI by Colombo and Maathuis (2014), and the hybrid GFCI algorithm by Ogarrio et al. (2016) which combines elements of constraintbased and scorebased learning.
In this work, we describe two learning strategies that take a PAG as an input, along with observed data, to determine the existence of latent confounders and learn their underlying latent distributions. We propose two algorithms: one that maximises accuracy and another that balances accuracy with computational complexity. The paper is organised as follows: Section 2 provides preliminary information through past related works, Section 3 describes the two algorithms, Section 4 describes the experimental setup, Section 5 presents the results, and we provide our conclusions and directions for future work in Section 6.
2 Preliminaries
2.1 Ancestral Graphs
As discussed in the introduction, a MAG represents a DAG extension that captures information about possible latent confounders. A PAG, which represents a set of Markov equivalent MAGs that entail the same set of Conditional Independencies (CIs) or mseparation criteria, is represented by a tuple where is the set of observed variables and is the set of the edges. The edges in a PAG can be: , , o, oo or , where indicates latent confounders, indicates that all MAGs in the equivalence class contain this directed edge, o indicates that the corresponding MAGs will contain either or , and indicates selection variables. Because we do not deal with selection bias in this paper, we will not be considering ancestral graphs that contain undirected edges. Both PAGs and MAGs are acyclic and do not assume partially directed cycles when ; but instead assume that either is an ancestor of , or is an ancestor of (Richardson and Spirtes, 2000).
2.2 Conjugateexponential family models
We consider conjugateexponential family models for discrete data. We assume a Dirichlet prior that serves as a conjugate prior of a multinomial likelihood
(Bishop, 2006), whose posterior distribution is also Dirichlet. We use the empirical Bayes method by Gelman et al.
(2003) to determine the prior parameters from data. For density estimation of latent confounders, we assume a Dirichlet prior whereis a hyperparameter set to 1 for uniform distribution, and
denotes parameters where represents the number of states. Since we perform structure learning and density estimation under causal insufficiency, some variables will not be observed in the data, leading to an incompletedata marginal likelihood of a DAG .2.3 Variational Bayesian ExpectationMaximization (VBEM) algorithm
Marginalising out the parameters over latent confounders in
makes the task of learning prohibitively expensive and intractable. We address this issue by approximating distributions of latent variables using the computationally efficient Variational Bayesian ExpectationMaximization (VBEM) algorithm
(Beal and Ghahramani, 2003) that enables tractable solutions. The VBEM algorithm combines elements of variational inference (Jordan et al., 1999) and ExpectationMaximisation (EM; Friedman, 2013). It uses an alternated optimisation technique to find a surrogate distribution from any exponential family (e.g., Gaussian, Dirichlet, multinomial) and optimises towards the true distribution . VBEM offers an approximate solution that guarantees to monotonically increase the objective score, and scales better with large data compared to MCMC (Hastings, 1970).The objective of VBEM is to minimise the discrepancy between two distributions and . It uses the reverse KullbackLeiber (KL) divergence for this task, which is the standard choice for variational inference, defined as follows:
Because the incompletedata marginal likelihood is intractable to compute, we consider to be a constant. The aim is to minimise , which is equivalent to maximising the Evidence Lower Bound (ELBO). Therefore, we can minimise without having to know the true distribution and . We can describe ELBO as the objective function:
ELBO
where is assumed to be the factorisation of the free distributions and . We maximise ELBO using a function of both and as follows (Beal and Ghahramani, 2006):
ELBO
To maximise , VBEM calculates and while holding the other fixed at iteration . The two steps for each iteration are:
1) VBE step: estimates the posterior distribution over latent confounders given from the last iteration by taking the functional derivatives in Equation (3) with respect to , where is the number of latent confounders.
2) VBM step: estimates given the posterior distribution taken from the VBE step by taking the functional derivatives in Equation (3) with respect to .
VBEM iterates over the VBE and VBM steps until the difference in ELBO becomes smaller than a given threshold, indicating convergence. A revised version called pELBO was proposed by RodriguezSanchez et al. (2020) that includes a penalty term to avoid the equivalent ways of assigning sets of parameters that result in the same distribution (nonidentifiability), and it is defined as pELBO = ELBO; where is the number of states in .
2.4 Past relevant work
ELBO was used as the objective function of a neural network in Variational Autoencoder (VAE) by Kingma and Welling
(2013). VAE for heterogeneous Mixed type data (VAEM) was used by Ma et al. (2020) for density estimation of latent variables in deep generative models. VAE assumes each observed variable has a latent parent, whereas VAEM is an extension of VAE that assumes an additional latent confounder that serves as a parent of all latent variables.The ELBO score was extended to pELBO by RodriguezSanchez et al. (2020; 2022), which was used as the objective score in Constrained Incremental Learner (CIL) and Greedy Latent Structure Learner (GLSL) algorithms. CIL learns a treestructured BN that assumes any two nodes are connected by one directed path only, whereas GLSL learns a DAG BN. Both algorithms start from an empty graph and perform various search operations including a) add or remove latent variables as parents of observed variables, b) increase the number of states of latent variables, and c) perform edge operations such as add, remove, or reverse edges, aiming to maximise pELBO. Searching for latent confounders often means iterating over all pairs of observed variables, which can be computationally expensive. Instead, these algorithms offer a strategy that focuses on a set of pairs of variables that provide the highest Mutual Information (MI). Empirical results show that GLSL outperforms CIL, but at the expense of high computational complexity.
3 Two new algorithms for learning latent confounders
This section describes the two learning strategies we have implemented for latent confounder discovery and density estimation. Subsection 3.1 describes how we use existing algorithms to draw a PAG that is then given as an input to the two algorithms we propose, which in turn use the PAG graph to search for different MAGs and DAGs with parameterised latent confounders. We describe the two algorithms in subsections 3.2 and 3.3 respectively. Both algorithms assume the input data are discrete, and that the latent confounders have no parents but have at least two children. We further assume a Dirichlet prior over all parameters as described in subsection 2.2, and we use pELBO as the objective function which is computed using the VBEM algorithm as described in subsection 2.3.
3.1 Searching for MAGs and DAGs given a PAG input
The FCI algorithm and some of its variants discussed in the introduction represent the stateoftheart in recovering ancestral graphs under the assumption of causal insufficiency (Kitson et al., 2021). Any of these algorithms can be used to draw a PAG that can be given as input to the two algorithms described in subsections 3.2 and 3.3. A set of Markov equivalent MAGs can be then be derived from the input PAG. However, because the number of possible latent confounders that can be explored for a given MAG is generally intractable, we shall assume the minimum number of latent confounders that satisfy the mseparation criteria.
Assumption 1: The optimal number of latent confounders is the minimum number of latent confounders that retain the CIs of a given MAG.
Figure 1 presents a simple PAG that contains two bidirected edges, along with a MAG and three DAGs that satisfy the CI statements of the PAG. Converting a MAG into possible DAGs implies that each DAG retains the CIs of that MAG by reducing the criteria of mseparation to dseparation. In this example, the DAG that contains the minimum number of latent confounders, with reference to the MAG in Figure 1b, is shown in Figure 1c. The DAGs in Figures 1d and 1e contain a higher number of latent confounders than the minimum required to satisfy all the CIs of the given MAG. Because the algorithms we describe in subsection 3.2 and 3.3 rely on Assumption 1, they will never explore DAGs that contain a higher number of latent confounders than the minimum required, and would not visit DAGs such as those shown in Figures 1d and 1e.
3.2 Alg 1: Incremental Latent Confounder search with VBEM (ILCV)
The first algorithm, which we call Incremental Latent Confounder search with VBEM (ILCV), is described in Algorithm 1. It takes a PAG input (Step 1) and uses the ZML algorithm available in R (Malinsky and Spirtes, 2017) to enumerate all Markov equivalent MAGs of that PAG (Step 3). It then further constructs DAGs for each MAG, starting from the MAGs that contain the minimum number of bidirected edges (Step 4). Each latent confounder modelled at Step 4 is assumed to be binary, and the optimal DAG is the one that maximises pELBO using the VBEM algorithm made available as a Java library by RodriguezSanchez (2021).
At Step 5, Algorithm 1 calls Algorithm 1b to determine the optimal number of states for each latent confounder. This is achieved by iterating over each latent confounder present in the highest scoring DAG determined at Step 4, and greedily increasing the number of states by one at a time, for each latent confounder. Algorithm 1b returns a DAG that contains the optimal number of states for each latent confounder, or the maximum number of states if the objective score continues to increase with the number of states. To improve computational complexity, the objective score pELBO is applied to a subgraph that contains the auxiliary latent confounders and their children, since the conditional distributions of the remaining nodes remain unchanged in the BN. The final Step 6 generates the final DAG BN and revises the pELBO score.
3.3 Alg 2: HillClimbing Latent Confounder search with VBEM (HCLCV)
Because ILCV (Algorithm 1) is computationally expensive, as we later show in Section 5, one might be interested in a computationally efficient version that minimally decreases the objective score of Algorithm 1. A problem with ILCV is that when the input PAG contains a high number of invariant edges oo or o, enumerating all possible MAGs can quickly cause memory allocation problems. To address this, we introduce a modified version of ILCV, which we call HillClimbing Latent Confounder search with VBEM (HCLCV), that skips Markov equivalence checks. This means that HCLCV no longer needs to check the CIs for each DAG visited, and this saves enormous computational time. Instead, HCLCV iterates over possible edge orientations as described in Step 4 of Algorithm 2, and continues to follow the incremental search strategy of ILCV in terms of the number of bidirected edges. Moreover, a list of the bestfound latent confounders from one iteration is carried forward to the next iteration (see Steps 3 and 4 in Algorithm 2). Lastly, since HCLCV relies on hillclimbing search, it stops exploration when a local maximum is reached.
4 Case studies and evaluation setup
The experimental setup involves four realworld BNs taken from the Bayesys repository (Constantinou et al., 2020), described in Table 1. We generated synthetic data of 1k and 10k samples for each network using the bnlearn R package (Scutari, 2010). One data set is created for each latent confounder listed in Table 1. This process was applied to both sample sizes, and led to a total of 22 data sets.
We have used the constraintbased FCI and the hybrid GFCI algorithms to generate PAGs to be provided as input to ILCV and HCLCV. This produced four different resultcombinations, which we refer to as ILCVFCI, HCLCVFCI, ILCVGFCI and HCLCVGFCI in Section 5. The GFCI algorithm was tested using the Tetradbased rcausal R package (Wongchokprasitti, 2019), and the FCI algorithm was tested using the pcalg R package (Kalisch et al., 2012). Regarding the hyperparameters of FCI and GFCI, we set the Gsquare significance threshold to =0.05 and the Sepset size to =1 for unlimited size of conditioning sets. For ILCV and HCLCV, we set the maximum number of bidirected edges to =4 to enable us to carry out experiments within reasonable runtime, and the convergence threshold of VBEM to =0.01.
We assess the accuracy of ILCV and HCLCV in terms of the objective score pELBO and learning runtime, with reference to those obtained by the GLSL and CIL algorithms discussed in subsection 2.4. GLSL and CIL are tested using the Java library by RodriguezSanchez (2021) with =10 regarding the number of pairs of variables to be considered with the highest MI, and with =1 for GLSL to assume no parents for density estimation of latent confounders to enable us to carry out experiments within reasonable runtime.
We impose a runtime limit of 12 hours for each experiment and set hyperparameter to 12 hours for both ILCV and HCLCV, to ensure that they return a result within the 12hour runtime limit. Experiments by the other algorithms that do not complete learning within the specified runtime limit are denoted as Timeout. All experiments are based on 8GB of RAM. The experiments involving the Asia, Sports and Property networks were carried out on the Intel Core i58250 CPU at 1.80 GHz, whereas the experiments involving the Alarm network on the M1 CPU at 3.2 GHz.
5 Empirical results
5.1 The difference in search space explored by ILCV and HCLCV
This subsection investigates the difference in search space explored between the two proposed algorithms, ILCV and HCLCV. The comparison assumes that the PAG inputs are produced by GFCI, and relies on Step 4 (which represents the main difference between the two algorithms) where the latent confounders are assumed to be binary.
Figure 2 presents the results based on the Property network (27 nodes) for both sample sizes 1k and 10k. Figure 2a shows that ILCVGFCI produces a slightly higher pELBO score than HCLCVGFCI, but that ILCVGFCI achieved that by exploring considerably more search space than HCLCVGFCI; i.e., visited a total of 170 DAGs versus 20 DAGs. The charts depict different colours to illustrate how the two algorithms differ at visiting DAGs derived from MAGs that contain increasing numbers of bidirected edges. Specifically, Figure 2a shows that ILCVGFCI visited all DAGs derived from MAGs containing up to three bidirected edges, whereas HCLCVGFCI ended at a local maximum while visiting DAGs derived from MAGs containing up to two bidirected edges.
Figure 2b, on the other hand, shows that the higher sample size helped ILCVGFCI to both find a higher objective score and complete learning faster than HCLCVGFCI. This is because ILCVGFCI found no DAG derived from MAGs containing two bidirected edges to have a higher score than the highest scoring DAG derived from MAGs containing one bidirected edge, which caused ILCVGFCI to skip MAGs containing three bidirected edges. On the other hand, HCLCVGFCI ended up visiting a higher number of DAGs, but note this does not necessarily imply that the algorithm was slower; i.e., recall that HCLCV skips checking for Markov equivalence between graphs.
Figure 2c and 2d repeat the analysis of Figure 2a and 2b with application to the Alarm network (37 nodes), and show that the results are consistent with those produced for the Property network. The only difference here is that, at 10k sample size, the pELBO score of HCLCVGFCI matched that of the generally slower ILCVGFCI.
5.2 Performance of ILCV and HCLCV relative to other algorithms
We compare the results produced by ILCV and HCLCV to those produced by the CIL and GLSL algorithms described in subsection 2.4 which, to the best of our knowledge, are the two algorithms that are most relevant to this work, which involves both the discovery and density estimation of latent confounders.
Table 2 presents the pELBO score for each algorithm and data set combination, plus the pELBO scores of the true DAGs, for both sample sizes 1k and 10k. The average ranks show that ILCVGFCI performs best in terms of maximising the pELBO score across both sample sizes, followed by HCLCVGFCI. CIL algorithm is found to be the worst performing algorithm at sample size 10k, whereas GLSL mostly outperforms both ILCVFCI and HCLCVFCI, but not ILCVGFCI and HCLCVGFCI. This means that ILCV and HCLCV benefit from the PAG input of GFCI, and suggests that the hybrid learning GFCI might be better than FCI at recovering PAGs; an observation consistent with previous studies (Constantinou et al., 2021). Note that while the true DAG will not always have the highest pELBO score, the highest scores produced by the algorithms tend to be very close to those of the true DAG, and this helps to validate the results.
While ILC performs best in general, it does not scale well with the size of the network. As shown in Table 2, ILCV returns an outofmemory error (for 8GB RAM) for most experiments with Alarm, specifically when paired with FCI, caused by the large number of possible MAGs derived from the input PAG. The cumulative runtime across all 10k sample sizes was 14, 34, 46 and 88 hours for CIL, HCLCVGFCI, ILCVGFCI and GLSL respectively, with a similar trend observed across 1k sample sizes. On average, HCLCV is found to be 1.4 times faster than ILCV, which in turn is found to be 1.6 times slower than CIL and 4.5 times faster than GLSL which failed to complete the Alarm network experiments at 10k sample size; suggesting that its computational efficiency might not scale well with sample size.
6 Conclusions
This work investigated two novel algorithms that can be used for both discovery and density estimation of latent confounders in BN structure learning from discrete observational data. The first algorithm (ILCV) aims to maximise model selection accuracy by exploring sets of Markov equivalent MAGs, starting from the set of MAGs that contain the lowest number of bidirected edges and, while the objective score increases with each set, moving to sets of MAGs with increasing numbers of bidirected edges. The second algorithm (HCLCV) aims to balance accuracy relative to computational efficiency by employing hillclimbing over the MAG space, enabling application to larger networks.
Both algorithms require a PAG to be provided as an input, which means that the proposed algorithms need to be paired with a structure learning algorithm that recovers ancestral graphs. Because the input PAG will typically indicate multiple possible latent confounders, the ILCV and HCLCV algorithms use pELBO as the objective function to determine the number as well as the position of the latent confounders, thereby contributing to the discovery process, in addition to density estimation, of latent confounders.
The two proposed algorithms are evaluated relative to two recent and relevant implementations that also optimise for pELBO. The empirical results show meaningful improvements in maximising the objective score, and in some ways in reducing time complexity; although the latter remains a major issue. Two important limitations are that a) both algorithms rely on a PAG input to be provided by some other structure learning algorithm, and b) the results are based on experiments that assume a single latent confounder only, which was necessary to ensure that most experiments complete within the 12hour runtime limit.
References
 The Variational Bayesian EM Algorithm for Incomplete Data : with Application to Scoring Graphical Model Structures. Statistics. Cited by: §2.3.
 Variational Bayesian learning of directed graphical models with hidden variables. Bayesian Analysis 1, pp. 793–831. Cited by: §2.3.
 Pattern recognition and machine learning. Springer. Cited by: §2.2.
 Learning highdimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics. Cited by: §1.
 OrderIndependent ConstraintBased Causal Structure Learning. Journal of Machine Learning Research 15 (116), pp. 3921–3962. Cited by: §1.
 Largescale empirical validation of Bayesian Network structure learning algorithms with noisy data. International Journal of Approximate Reasoning 131, pp. 151–188. External Links: ISSN 0888613X Cited by: §1, §5.2.
 The bayesys data and bayesian network repository. queen mary university of london, london, uk. [online]. External Links: Link Cited by: §4.
 The Bayesian Structural EM Algorithm. Vol. 98, pp. 129 – 138. Cited by: §2.3.
 Bayesian Data Analysis, Second Edition. Chapman & Hall/CRC Texts in Statistical Science, Taylor & Francis. Cited by: §2.2.

Monte Carlo Sampling Methods Using Markov Chains and Their Applications
. Biometrika 57 (1), pp. 97–109. External Links: ISSN 00063444 Cited by: §2.3.  An Introduction to Variational Methods for Graphical Models. In Learning in Graphical Models, pp. 105–161. Cited by: §2.3.
 Causal Inference Using Graphical Models with the R Package pcalg. Journal of Statistical Software 47. External Links: Link, Document Cited by: §4.
 AutoEncoding Variational Bayes. arXiv. External Links: Document, Link Cited by: §2.4.
 A survey of Bayesian Network structure learning. arXiv. External Links: Document, Link Cited by: §3.1.
 VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data. Cited by: §2.4.
 Estimating bounds on causal effects in highdimensional and possibly confounded systems. International Journal of Approximate Reasoning 88. Cited by: §3.2.
 A Hybrid Causal Search Algorithm for Latent Variable Models. In Proceedings of the Eighth International Conference on Probabilistic Graphical Models, Proceedings of Machine Learning Research. Cited by: §1.
 AdjacencyFaithfulness and Conservative Causal Inference. CoRR abs/1206.6843. External Links: 1206.6843 Cited by: §1.

Ancestral Graph Markov Models
. Annals of Statistics. Cited by: §2.1.  Multipartition clustering of mixed data with Bayesian networks. International Journal of Intelligent Systems. Cited by: §2.4.
 Incremental Learning of Latent Forests. IEEE Access 8 (), pp. 224420–224432. External Links: Document Cited by: §2.3, §2.4.
 Mpcmixed library. External Links: Link Cited by: §3.2, §4.
 Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software 35, pp. 1–22. Cited by: §4.
 Causation, Prediction, and Search, 2nd Edition. edition, MIT Press Books, Vol. 1, The MIT Press. External Links: Document Cited by: §1, §1.
 Rcausal r wrapper for tetrad library, v1.2.1. External Links: Link Cited by: §4.