# Bayesian Networks, Total Variation and Robustness

Now that Bayesian Networks (BNs) have become widely used, an appreciation is developing of just how critical an awareness of the sensitivity and robustness of certain target variables are to changes in the model. When time resources are limited, such issues impact directly on the chosen level of complexity of the BN as well as the quantity of missing probabilities we are able to elicit. Currently most such analyses are performed once the whole BN has been elicited and are based on Kullback-Leibler information measures. In this paper we argue that robustness methods based instead on the familiar total variation distance provide simple and more useful bounds on robustness to misspecification which are both formally justifiable and transparent. We demonstrate how such formal robustness considerations can be embedded within the process of building a BN. Here we focus on two particular choices a modeller needs to make: the choice of the parents of each node and the number of levels to choose for each variable within the system. Our analyses are illustrated throughout using two BNs drawn from the recent literature.

## Authors

• 1 publication
• 20 publications
• ### The total variation distance between high-dimensional Gaussians

We prove a lower bound and an upper bound for the total variation distan...
10/19/2018 ∙ by Luc Devroye, et al. ∙ 0

• ### Multiclass Total Variation Clustering

Ideas from the image processing literature have recently motivated a new...
06/05/2013 ∙ by Xavier Bresson, et al. ∙ 0

• ### Sensitivity Analysis in Bayesian Networks: From Single to Multiple Parameters

Previous work on sensitivity analysis in Bayesian networks has focused o...
07/11/2012 ∙ by Hei Chan, et al. ∙ 0

• ### The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited

Hypergraphs allow one to encode higher-order relationships in data and a...
12/18/2013 ∙ by Matthias Hein, et al. ∙ 0

• ### Robust learning Bayesian networks for prior belief

Recent reports have described that learning Bayesian networks are highly...
02/14/2012 ∙ by Maomi Ueno, et al. ∙ 0

• ### Tail Sensitivity Analysis in Bayesian Networks

The paper presents an efficient method for simulating the tails of a tar...
02/13/2013 ∙ by Enrique F. Castillo, et al. ∙ 0

• ### Discrete MDL Predicts in Total Variation

The Minimum Description Length (MDL) principle selects the model that ha...
09/25/2009 ∙ by Marcus Hutter, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayesian Networks (BNs) are now a widely used probabilistic modelling tool, particularly in the field of decision support. It is now acknowledged as best practice (Cowell et al., 1999; Laskey and Mahoney, 2000; Smith, 2010)

that these models are set up in two distinct stages. Firstly the structure of the BN, as expressed by its Directed Acyclic Graph (DAG), is either directly elicited from domain experts or when sufficient supporting data exists, learned from the data using a model search algorithm with default priors on the hyperparameters, see

Boneh (2010) and Korb and Nicholson (2010)

. Once this graphical framework has been discovered, the graph is embellished into a full probabilistic model. In the case of a discrete BN, this second stage involves eliciting or estimating, using priors on probabilities informed by expert judgements, the entries of its conditional probability tables (CPTs). These CPTs provide the numerical prespecification of all the conditional probabilities needed to generate the full joint probability mass function and hence a fully specified probability model.

When engaging in this two stage process the analyst needs to be fully aware of precisely which inputs might be critical to the inferences made through the BN, see Albrecht et al. (2014b). One critical element in an elicitation, or statistical estimation of the graph is to ensure these critical features are specified as accurately as possible. This is especially important when elicitation or estimation is resource limited, as is usually the case in practice. The modeller can then optimise their allocation of resources to concentrate on eliciting those elements of the model whose misspecification might most influence the required outputs.

To this end, the practitioner, prompted by the functionality of various software, is currently encouraged to develop awareness of the robustness of a chosen model to its inputs by performing a one-at-a-time numerical sensitivity analysis of the preliminary BN. Here various different forms of numerical contaminations of the model are investigated, where effects are usually measured in terms of mutual information/Kullback-Leibler divergence

(Albrecht et al., 2014a; Friedman et al., 1997; Nicholson and Jitnah, 1998; Zaragoza et al., 2011). This type of study is obviously extremely useful. On the other hand it has drawbacks. First, it relies on the chosen enacted perturbations covering the entire space which becomes more challenging as models become increasingly large. Furthermore, even if such a search is performed systematically, the impacts (most currently measured by mutual information), are not directly relevant to the impact on ensuing decisions, see below for further clarification. Additionally, such an analysis must perforce be performed after the model has been fully specified. This means that the whole probability model is needed before the sensitivity analysis can be performed. One interesting recent attempt to provide such assessments after the structural elicitation phase, but before the probabilistic embellishment is through the use of distance weighted sensitivity measures (see Albrecht et al., 2014a). However, these do not dovetail with the mutual information measures described above and have a level of arbitrariness in the choice of weight function needed to use this method.

Over recent years more formal and systematic robustness analyses have appeared. Robustness of probability models has been studied by statisticians for many decades, and specific methodology for Bayesian Networks has also been recently developed: Coupé and van der Gaag (2002), Gómez-Villegas et al. (2013), Laskey (1995), O’Neill (2009), Renooij (2010). These fall into two main streams: local robustness studies and global studies. In the former, a chosen probability model is perturbed using a finite parametrised modification. The latter, termed global analyses, does not rely on perturbation lying within a given parametric family (O’Neill, 2009; Smith and Daneshkhah, 2010). Instead, an appropriate divergence measure is applied to first specify a neighbourhood system around each model. Bounds are then calculated for the maximum deviation in the inference that could be achieved by a model in this neighbourhood. If this deviation is small then the model is deemed to be robust (Gustafson and Wasserman, 1995; Smith and Rigat, 2012). Both types of robustness analysis have been applied to BNs in work such as Smith and Daneshkhah (2010). In this paper we focus solely on global robustness studies as applied to finite discrete BNs.

Thus far, global robustness studies for BNs have mainly centred around the analysis of how robust a model might be to perturbations, with respect to Kullback-Leibler (KL) or Chan-Darwiche divergences (see Chan and Darwiche, 2005; Gómez-Villegas et al., 2013; Leonelli et al., 2017). Both of these divergence measures benefit from some helpful technical properties which allow various measures of dependence to have explicit formulae. These measures are specified in terms of log probabilities in the KL case or equivalently ratios of probabilities in the Chan-Darwiche instance. Therefore, both have the disadvantage that they depend very heavily on the accurate specification of small probabilities. However, it is well documented that it is precisely these small probabilities that typically exhibit the largest elicitation error (see O’Hagan et al., 2006; Smith, 2010). Furthermore, when BNs are learned from data, any associated small probabilities are difficult to reliably estimate from data, because almost by definition we will see very few of these events in any training set we use to estimate a model.

In many circumstances (especially in decision analysis), the misspecification of improbable event probabilities has only a small impact on the required outputs of a decision analysis: see below. For the purposes of the two stage process described above, the Kullback-Leibler and Chan-Darwiche divergence measures are hardly ideal as they can be highly sensitive to misspecification which may have little effect on any supported decision analysis.

In this paper we demonstrate that an alternative robustness study based on a more conventional divergence measure (widely used in probability theory and stochastic analysis), which is the total variation distance, has some serious practical and theoretical advantages over its main competitors. Although it is often difficult to derive

explicit formulae for the impacts of deviation in variation, it is nevertheless straightforward to tightly bound such deviations in variation distance. Deviation in variation corresponds much more closely to the types of error we would envisage experiencing within either an elicitation exercise or through misestimation. Perhaps most important, the expectation of a fixed bounded utility function

, under various decisions (induced by an approximation) are simply bounded by linear functions of the total variation in the probability distributions of the attributes of

(see e.g. Smith, 2010). Note that in a BN these attributes will typically constitute a small subset of the totality of its variables. Hence small variation distances (between probability mass functions) on these small subsets translate directly into small effects in the pertinent expected utilities. Conversely, large deviations translate into large effects that might have a greater impact on some specification of a utility.

In the following section we review the BN framework and introduce our examples. Then in Section 3

we review some simple properties of the total variation distance and show that the effect in variation distance of the misspecification of the probability mass function of one random variable in a BN to another diminishes exponentially. We then discover explicit bounds for this error both when the BN is decomposable and more generally. We demonstrate that this impact can be bounded explicitly in terms of a simple function of the extreme entries of the CPTs within the BN. These results have the useful spin-off that CPTs do not necessarily need to be fully elicited before the robustness analysis can take place. In Section

4 we show how these explicit measures of robustness can be applied to determine the effect of approximating simplifications on the topology of the BN and additionally, to decide the number of levels into which to categorise each variable. We demonstrate how by using total variation, robustness analyses can be performed in a harmonious composite way that directly bounds the impact on decision making of various types of expedient approximations. Finally in Section 5 we provide some guidelines to best employ our results in practice and discuss some enhancements of our strategy.

## 2 Hypotheses of a Bayesian Network

We begin by giving a short review of BNs and some of its properties we use later in the paper. A discrete Bayesian Network (BN)

, or DAG, on a random vector

represents a family of models which respect a set of conditional independence hypotheses so that for

 Xi⨿XR(i)|XPa(i)

where are the parents of , i.e. those indices of the previously listed variables on which depends, and .

An equivalent expression is that the joint probability mass function of factorises as

 p(x)=p(x1)p(x2|xPa(2))…p(xi|xPa(i))…p(xm|xPa(m)). (1)

In either formulation the directed graph of the BN has vertex set and a directed edge from to iff .

An important subclass of BNs whose properties we discuss later, are those which are called decomposable. A decomposable BN is a BN in which every parent set of each node in the graph forms a complete subgraph of . It is simple to show that any BN can (albeit inefficiently) be re-expressed in a decomposable BN containing it (Lauritzen, 1996; Korb and Nicholson, 2010; Smith, 2010). This property, widely used for propagation algorithms, can also be used for robustness analyses.

When a BN is decomposable it can be shown (see Lauritzen, 1996; Smith, 2010) that the joint density factors in the following way. The cliques i.e. the maximal connected subsets of the decomposable graph can be totally ordered starting with any clique, label this . We call where the separator of from . An indexing is said to satisfy the running intersection property if for all there exists some index such that . This implies that the result of intersecting a clique with all previous cliques is contained within one or more earlier cliques Lauritzen (1996); Smith (2010). Note the choice of may not be unique

We can depict one of these choices of order and containment by a junction tree . This is an undirected tree whose vertices are and whose undirected edges simply connect  to . Note that these edges can be labelled by a corresponding separator of . Here we will for simplicity assume that the entries of the joint mass function are all strictly positive, although this is not strictly necessary (see Lauritzen and Spiegelhalter, 1988). In fact this is advised from a practical point of view by a number of authors e.g. Korb and Nicholson (2010) when dealing with no known functional relations. It can then be proved (e.g. see Cowell et al., 2007; Smith, 2010) that of any such decomposable BN respects the following formula:

 p(x)=p(xC1).p(xC2).p(xC3)…p(xCm)p(xS2).p(xS3)…p(xSm).

One straightforward but important consequence of this decomposition used later is that given any BN and an associated junction tree , then for any two cliques there is a unique sequence of cliques with no repeats, and separators between and within , called a simple path. If we write , then since we know that is a subvector of giving

 p(x¯¯¯¯Ck) =p(xC1).p(xC2).p(xC3),…p(xCk)p(xS2).p(xS3),…p(xSk) (2) =p(xC1).p(xC2|xS2).p(xC3|xS3)…p(xCk|xSk).
###### Lemma 2.1.

It follows from Equation 2 and the conditional independence in that if then

 p(xC1∪Ck)=∑xT1,kp(xC1).p(xS3|xS2).p(xS4|xS3)…p(xSk|xSk−1)p(xCk|xSk).

Thus we have a formula for the joint mass function of a “donating” clique and a “target” clique depending on

, expressed in terms of a sequence of transitions in a non-homogeneous Markov Chain. Although this property derives directly from the elementary properties of trees it is important, and an often overlooked property. It means that standard results from non-homogeneous Markov Chain theory can be used to measure the extent of the diminishing effect of information as it passes along this simple path. In particular it is well-known that variation distance in an ergodic, acyclic Markov Chain contracts as information is propagated through the system. The observation in Lemma

2.1 is therefore critical to the development of some of the robustness bounds we develop here.

### 2.1 Applying a Bayesian Network in Practice

A BN is generally selected in one of two ways. Occasionally we may have access to a complete training data set from which we can select the most promising explanatory BN whose associated respecting Equation 1 appears to best fit the data. There are many ways to do this, including using software packages such as ‘bnlearn’ in R (see Scutari and Denis, 2014). However, when applying such a model selection method in practice, even for low dimensional BNs, it is common to find many models score similarly well. A BN may not adequately describe all features in the data set. Even if we know this model to be true, as in a simulation exercise or even a moderately sized problem, it has been demonstrated that the best model is only close to the generating process, unless the training data set is absolutely enormous Cussens (2011). There are also the obvious statistical errors associated with the representativeness of the data set used, even if sampling is performed at random. Hence it is rare for a single data generating model to be unequivocally identified. Considering the robustness of the critical outputs of the fitted model is therefore a critical element of any ensuing statistical analysis.

The second way to create a BN is by performing a direct elicitation from an expert. Here, having listed the variables in an order which might be compatible with the sequence in which those measurements may occur, the expert is asked for each () of the previously mentioned variables which might be relevant to forecasting it. Building on this qualitative framework, hopefully faithful to the expert’s actual judgements, we then proceed to embellish the graph by supplementing the structure with the specification of the corresponding CPTs . These probabilities will be subject to elicitation error, although the preceding structural elicitation process aims to mitigate this specification error (Korb and Nicholson, 2010; EFSA, 2014; Smith, 2010). Again an understanding of the robustness to perturbations of the hypothesised graphical framework and also the entries in the CPTs of any inferential assumptions we make here, will clearly be critical to a good statistical analysis.

### 2.2 Applications

#### 2.2.1 Food Security System

To illustrate the uses and practicalities of our results we shall be using the Food Security Integrated Decision Support System (IDSS) described in greater depth in Barons et al. (2018b) and Smith et al. (2015). The aim of this massive multi-layered dynamic BN is to ascertain which local government policies influence or improve the level of food poverty within the UK. However, the targeted user here is primarily interested in three specific classes of outputs: Health (of its local constituents), Educational Attainment of children and measures of Social Cohesion (Smith et al., 2015), measures of which, in the terminology of this paper, will form our final vector of target variables.

The overarching IDSS model is a DBN model as shown in Figure 1

in which the target nodes are classified as Level 1. Each component of this model can be broken down into detailed subnetworks. For example, the Level 2 ‘UK Food Costs’ depends on the availability of food, production costs and so on. Specifically we may be interested in access to healthy food necessities such as fruit and vegetables which rely heavily on pollinator abundance. A sub-subnetwork to determine the factors which affect the pollinator abundance is therefore required, a fragment of which is shown in Figure

2. A subset of this BN has been elicited from experts and the results can be found in Barons et al. (2018a).

#### 2.2.2 An Ecological Demonstration

To illustrate our approach we also use a well known ecological BN called the “Native Fish” example as introduced in Nicholson et al. (2010) and discussed further in Nicholson and Flores (2011). This BN was designed specifically for demonstration purposes, notably introducing non-statisticians to BNs, and is therefore simplified version of a much more complicated process. However, because the meaning of its variables are transparent and its topology (Version 2 of this model) is just large enough to demonstrate our arguments, this DSS is ideal for illustrating some of our methods.

This ecological BN is used to model the impact on native fish abundance of pesticide usage on surrounding fields as well as levels of rainfall. The structure of the BN is given in Figure 3. Our target node is ‘Native Fish Abundance’.

## 3 Properties of Total Variation Distance for BNs

We begin by outlining the total variation distance, highlighting some of its useful properties which we can directly apply to this robustness analyses.

Assume

is a vector of finite discrete random variables taking values

. Let , taking values , denote the subvector of comprising those components with indices , where denotes a subset of . Let denote a hypothesised and an alternative joint mass function on and denote the probability with respect to the mass functions respectively of the set where , . Nearly all inferential methodology and certainly all robustness analyses focus on properties of such events Smith (2010).

###### Definition 3.1.

The (Total) Variation distance, , is defined in the discrete casee by

 dV(pA,qA)≜12∑xA∈XA|pA−qA|

### 3.1 Variation under Marginalisation and Conditioning

Measures of variation distance can be applied directly to CPTs. In this section we define some new objects which will be especially useful in our later development.

Let and , with rows respectively for , be two CPT matrices of a random vector , taking levels given another random vector , taking levels. For a BN, will typically be a random variable whilst will be the vector of its parents; nevertheless when studying junction trees it is also helpful to consider cases when is a vector.

There is a natural variation distance we can now define between and :

###### Definition 3.2.

Let the variation distance between conditional probability tables and be defined by

 d+V(P,Q)≜max1≤i≤ndV(pi,qi).
###### Example 3.1.

Assume that the CPT in Nicholson et al. (2010), represented by the transition matrix below, gives the elicited combined matrix of a panel of experts using a standard protocol (see EFSA, 2014, for example). Suppose expert A’s individually elicited elicited probabilities are given by the matrix Q:

We can simplify this to matrix form, denoted by . Let us assume that this CPT was elicited from experts who disagree on a couple of probabilities resulting in an alternate CPT, :

 P=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.200.600.200.250.600.150.300.600.100.700.250.050.800.180.020.900.090.01⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,Q=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.200.600.200.300.500.200.300.600.100.650.250.100.800.180.020.900.100.00⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

We can now compute . Expert A will often be concerned that the substitution of for

will not effect significantly the conclusions about the target variables of this panel. We show below how we can use the diameter calculation to directly measure, in a formal sense, the extent of this substitution and thus allay Expert A’s fears that the panel’s judgement might be substantially at variance with their own.

Note that if and are the vectors of marginal mass functions of and is a margin on then for all possible margins

 dV(ρ(P),ρ(Q))≤d+V(P,Q),

where whenever puts all its mass on atoms indexed by where

 i+≜argmax1≤i≤ndV(pi,qi).

Thus we have that for all possible margins

 dV(ρ(P),ρ(Q))≤d+V(P,Q).

This therefore gives rather coarse, but quick bounds which require only comparisons of the pairs of individual rows of the perturbed CPT.

Earlier we highlighted that when eliciting a BN we first elicit hypotheses of conditional independence. Only then do we expand this with a full probability specification through the numerical values in its CPTs. So we next consider robustness measures associated with small deviations from conditional independence. The definition we present below is, to our knowledge, a new construction using variation distance on CPTs to determine the measure of dependence between variables.

###### Definition 3.3.

The diameter, and the I-local diameter

are respectively defined as

 d+(P) = 12max1≤i,i′≤n{n′∑j=1∣∣pij−pi′j∣∣}, dI+(P) = 12maxi,i′∈I{n′∑j=1∣∣pij−pi′j∣∣}.
###### Example 3.2.

The values of diameters (typical of those found in many exercises) for each of the CPTs of the Native Fish BN from Nicholson and Flores (2011), together with those obtained in an elicitation exercise associated with the pollinator example Barons et al. (2018a) are given in the tables below. Discrepancies passing through CPTS with diameters close to might be retained as different target distributions. However, once discrepancies pass through more than two CPTs with diameters less than these usually quickly dissolve, for reasons we discuss below.

The size of the diameter of a conditional probability table is a measure of the dependence of on . This is because whenever all rows of will be equal and so . It is easy to check that whenever some non-trivial function of can be written as a deterministic function of then , its maximum value. So when there is only a weak relationship between and , in the sense that changing the different levels of impacts only slightly on the conditional mass function of , then . Note that unless is symmetric, the diameter of on is not the same as the diameter of on , in fact the difference between these can be arbitrarily close to (see Wright, 2018).

The -local diameter has the same property, where this time it is conditional on taking values only in the set of levels . This is useful when comparing the efficacy of deleting a parent in a BN or when combining a collection of rows of the CPT/levels of into a single entry: see below.

### 3.2 Variation and Mixtures

#### 3.2.1 Approximations associated with mixing

A useful and well-known property of total variation is its convexity under mixing in the following sense. Let , , , and define

 qπ≜n∑i=1πiqi,pπ′≜n′∑i=1π′ipi,

then

###### Lemma 3.1.
 dV(pπ′,qπ)≤n∑i=1n′∑i′=1πiπ′i′dV(pi′,qi).
###### Proof.

See Appendix A.1. ∎

In particular, if we know extremal distributions are small then so are convex linear combinations of these. Such processes occur for example in the calculation of a margin: here of a target variable. This enables us to prove a number of useful results concerning the contraction of error under learning in a BN: see below.

Combining our new definitions of diameter with variation distance we prove the following result that enables us to track this distance through a given BN:

###### Theorem 3.2.

Let and be two possible margins of vectors and of random variables and suppose that is the (shared) CPT of the concatenated levels of the conditional and that and are the margins of . Then

 dV(ρ1,ρ2)≤d+(P(Y|X))dV(π1,π2).
###### Proof.

See Appendix A.2. ∎

This property will be exploited below in the study of BNs. Note for example that if has been specified accurately, but that the margin  is uncertain, then our marginal beliefs about are no more uncertain than those about , because by definition . More importantly we have a bound on how much our uncertainty, quantified in terms of total variation, reduces in terms of – a measure of how far away is from independence of .

###### Example 3.3.

Let us once again look at the CPT of ‘Tree Condition’,

, which had a binary parent ‘Drought’ and a three-state parent ‘Rainfall’. The joint distribution can be calculated from CPTs as

. Suppose another expert proposed he different probability vector . We have previously calculated and can calculate that . Therefore Theorem 3.2 gives:

 dV(ρ1,ρ2) ≤d+(P(Y|X))dV(π1,π2)=0.7×0.125=0.0875

However, we can of course calculate this margin exactly as . However, is we knew only the extreme entries of then we could still calculate our bound which is of the right order of magnitude: a property we have found to be typical of the types of CPTs we habitually elicit.

#### 3.2.2 A Global Bound Approximation

There is another bound which applies when not only a margin of is perturbed to , but also the conditional mass functions of is simultaneously perturbed. Occasionally we need variation bounds on the consequent perturbation on the margins of :

###### Definition 3.4.

Let the superbound, , between stochastic matrices and be defined by

 d∗V(P,Q)≜max1≤i,i′≤ndV(pi,qi′)≤1.

So here we compare variation distances between each row of and possibly different rows of before selecting the largest difference. Note by definition and the triangle inequality that

 d+V(P,Q)≤d∗V(P,Q)≤max{d+V(P,Q)+max{d(P),d(Q)},1}. (3)
###### Example 3.4.

Let us compare the two alternative CPTs, and , for the ‘Tree Condition’ node as introduced in Example 3.1. The value of can be calculated directly from the total variation distance between every possible pairwise combination of rows in and . For this example corresponding to .

Let , represent respectively the conditional probability mass functions of under the hypothesis and alternative given , where without loss we can assume that and are disjoint. Notice that these can be seen as CPTs whose rows correspond to the different values of the vector . Then under our definitions of transition matrices above whenever

 d+V(PA|B,QA|B)=d∗V(PA|B,QA|B)=dV(pA,qA).

This arises simply because implies that all rows in the CPT matrix are equal to each other and so equal to the corresponding margin on . Thus we see that standard analyses that elicit irrelevances or independences translate here into equations on variation distance. We will see later that this enables us to study the implications of models where the embedded conditional independences are only approximately true.

###### Definition 3.5.

The stochastic variation matrix is the symmetric matrix whose entries are the variation distances between the different rows of the matrix .

We will later use this construction to draw out useful functions of the explanatory variables associated with a particular variable of focus.

Now note that we can write

 π1=(1−β)π∗1+βπ1∧2,π2=(1−β)π∗2+βπ1∧2,

where and where without loss we can assume the mixing process is shared by the two mass functions, so points are drawn either from or alternatively something drawn from either or (see Supplementary Material for more a more detailed construction). Using the same argument as for when

 dV(ρ1,ρ2) = dV(π1P1,π2P2) = dV(((1−β)π∗1+βπ1∧2)P1,((1−β)π∗2+βπ1∧2)P2) ≤ βdV(π1∧2P1,π1∧2P2)+(1−β)dV(π∗1P1,π∗2P2) ≤ βd+V(P1,P2)+(1−β)d∗V(P1,P2).

We can then show

 dV(ρ1,ρ2)≤d+V(P1,P2)+dV(π1,π2)d∗V(P1,P2).

Using Equation 3 in particular we have that

 dV(ρ1,ρ2)≤{1+dV(π1,π2)}d+V(P1,P2)+dV(π1,π2)max{d(P1),d(P2)}.

## 4 Approximations of the CPTs in a known BN

Suppose all clients are content that the conditional independences in a given BN are valid. Without changing the random variables in the system we are now interested in finding ways of approximating the graphical model and refining initial probability estimates within this given BN.

### 4.1 Diameter Bounds when Marginalising or Conditioning

We now present some basic results about diameters of the transition matrices between two vectors of random variables under various marginalisations and conditioning of the subvectors. These bounds are particularly helpful when moving from a BN to a junction tree.

Let and be, respectively, the transition matrix associated with the conditional distribution of (the same conditional distribution but now with marginalised out). Let denote their respective diameters.

###### Lemma 4.1.
 d+(PY|X1)≤d+(PY|X).
###### Proof.

This is immediate since each of the rows of is a weighted average (the weights on row labelled corresponding to the masses on ). ∎

Note that this bound is tight in the sense that it is attained for a particular distribution on . Suppose is attained when we compare the row with and

 P(X2=x2|X1=x1)=1 and P(X2=x′2|X1=x′1)=1,

then it is easy to check that .

###### Lemma 4.2.

Using the obvious notation, for any two joint probability mass functions over

 dV(pX,Y(x,y),p′X,Y(x,y))≤inf{dV(pX(x),p′X(x))+supxdV(pY|X(y|x),p′Y|X(y|x)),1}.
###### Proof.

See Appendix A.3. ∎

Finally, we can determine a bound on the diameter of a CPT in which many variables are dependent on the same set. This will often be the case when we are looking at a simple path of a junction tree in which a separator contains more than one variable:

###### Proof.

See Appendix A.4. ∎

These results may seem trivial, however they enable us to bound the diameters of CPTs in our junction tree path, using the diameters already calculated from the original CPTs in the BN. This enables us to study the robustness to misspecification without calculating any new information.

### 4.2 Diminishing tree propagated approximation error

The following result explains why when using standard propagation algorithms on updating one of the clique margins , the knock on effect on the other clique margins becomes weaker and weaker as the updated cliques become progressively more remote from - a property Albrecht et al. (2014a) exploit in their work. Furthermore the extent of the deviation can be measured, in the sense that it can be bounded above. This enables us to bound the potential extent of error in the distributions of focus variables induced from the misspecification of structure or various CPTs in the BN. This is particularly useful when we elicit a large BN and want to know how far away from target nodes we need to elicit the corresponding CPTs accurately.

###### Theorem 4.4.

Let , from to , be the minimal sequence of cliques with associated separators . Let each undirected edge of the marginalised junction tree be denoted by for ; the diameter of the conditional probability table between the two sequential nodes, for example . Then

 dV(pCk(xCk),qCk(xCk))≤dV(pC1(xC1),qC1(xC1))k∏i=1δi.
###### Proof.

By Lemma 2.1 we can rewrite our junction tree to marginalise over internal cliques leaving us with the graphical structure:

Let each undirected edge be denoted by for ; the diameter of the conditional probability table between the two sequential nodes. Giving . By successive application of Theorem 3.7:

 dV(pCk(xCk) ,qCk(xCk))≤d+(P(Ck|Sk))dV(pSk(xSk),qSk(xSk)) ≤d+(P(Ck|Sk))d+(P(Sk|Sk−1))dV(pSk−1(xSk−1),qSk−1(xSk−1)) ≤d+(P(Ck|Sk))d+(P(Sk|Sk−1))…d+(P(S3|S2))d+(P(S2|C1))dV(pC1(xC1),qC1(xC1)) =(k∏i=1δi)dV(pC1(xC1),qC1(xC1))

Next we define the impact of one clique upon another in order to ascertain the diminishing effect of errors downstream in the causal chain.

###### Definition 4.1.

Define the impact of on to be .

The impact of one clique on another is a simple measure of the maximum possible influence the misspecification of one set of clique probabilities could have on another as measured by a bound on the variation distance. Note that in general we can label the edges of a junction tree (which are also labelled by a separator between adjacent cliques) and by two diameters and one measuring the impact of on and the other the impact of on . Note that these two impacts are not necessarily equal, and are often very different. However, in the contexts we consider here (where our primary interest concerns the robustness of the margins of an identified subset of attributes) we usually need to focus on propagation in a single direction. Furthermore, if the BN is constructed consistently with a conjectured causal directionality in mind, then this directionality often tends to have the attributes at the end of the causal chain. This means that the diameters we need can often be calculated directly from the diameter of the elicited CPTs of the BN.

###### Example 4.1.

The two simple BNs we have used in our running example are not deep enough to illustrate the usefulness of this result, whilst the full IDSS is far too complicated. So instead we use here a simplification of a BN used to model radicalisation processes one of the authors has elicited, where the precise meaning of the nodes is confidential but not relevant to the points we mean to illustrate.

Let us label the cliques to satisfy the running intersection property:

 C1={X1,X2},C2={X2,X3,X4},C3={X4,X5},C4={X5,X6,X7},C5={X6,X7,X8},C6={X3,X10},C7={X7,X9}

Giving us separators:

 S2={X2},S3={X4},S4={X5},S5={X6,X7}S6=C2∩C6={X3},S7=C5∩C7={X7}

Suppose we wish to determine the effect on if we perturb . Draw the ancestral graph of nodes and , derive the impact formula (which is simply the product of diameters of each separator conditional on the previous previous separators):

 I(X9|C1) =p(X2|X1)p(X4|X2)p(X5|X4)p(X7|X5)p(X9|X7) ≤d+(X2)d+(X4)d+(X5)d+(X7)d+(X9)

Extending this further, we can determine the impact on cliques and simultaneously, if we perturb both and . Following the same steps of creating cliques and separators for the ancestral graph of these nodes, the impact is given as:

 I(X6,X7|X1,X2) =p(X2|X1)p(X4|X2)p(X5|X4)p(X6,X7|X5).

This can be written in terms of the original BN CPTs using Lemma 4.3, as some separators contain more than one node:

 I(X6,X7|X1,X2) ≤d+(X2)d+(X4)d+(X5)[inf{d+(X6|X5)+d+(X7|,X6,X5),1}] ≤d+(X2)d+(X4)d+(X5)[inf{d+(X6|X5)+d+(X7|,X5),1}]

There are various practical corollaries to the simple theorem above:

###### Corollary 4.4.1.

If is decomposable and lies on the minimal sequence between and then if all attributes are in then the probabilities of have higher influence on than those of

As we indicated above, these bounds can be applied to any BN. We recommend following the construction below to ensure that your BN is in a suitable format to apply Theorem 4.4:

• Begin with a BN , the diameters of whose CPTs have been provisionally elicited.

• Identify a donating variable or complete vector of and the vector of focus .

• Find the ancestral set of in .

• Construct the ancestral graph, , which has variables where the order of these vertices are chosen compatible with .

• Create a triangularised version, , of and find its junction tree . Denote the clique containing as and the clique containing , .

• Find the single path starting from clique to labelling the cliques in order .

• Remove all variables that are not in one of these cliques.

Note that these influences provide a very useful tool for prioritisation of the elicitation in a BN. For example, if we can obtain estimates of influence across a junction tree (either from direct elicitation of or alternatively after having performed a preliminary coarse elicitation of the corresponding CPTs) then we can use these influences to identify which of those CPTs to refine. For example suppose all attributes consisting of the subvectors of variables of interest lie in a single clique. We can then follow the simple guidelines:

• Refine the elicitation of the CPTs whose attributes and parents lie in this clique,

• Elicit the CPTs associated with parents/separators with the most influence,

• Use the influence formula (Theorem 4.4) to guide the refinement of the CPTs associated with other parents or parents of parents.

### 4.3 Approximations associated with a general BN

In a junction tree each vector has just a single parent within a given compatible ordering. Of course in the case of a BN this is no longer necessarily true. We would still like to find the impact bound of one variable on another and so annotate each of its directed edges with a value between zero and one which reflects this. The result below gives us a way of coding this impact in a useful way.

Suppose , taking values , is potentially dependent on vectors , taking values . For let be a vector of values of other variables . Let the CPT of given be so that its diameter is given by

 d+(P)=12maxx,x′∈X{∑y∈Y∣∣pxy−px′y∣∣}.
###### Definition 4.2.

Let the diameter of to be defined by

 d+j=12maxxˆj∈Xˆjmaxxj,x′j∈Xj{∑y∈Y∣∣pxy−px′y∣∣}.

So the diameter is the maximum extra effect varying the value of can have on the distribution of for any fixed value of the other variables. Notice in particular that

 Y⨿Xj|Xˆj⇔d+j=0.

Thus in a formal sense, is a measure of the extent by which this conditional independence is violated and the merit of knowing the value of