1 Introduction
Symbolic derivations resulting in complicated expressions are often encountered in many fields working with mathematical notation. These expressions can be derived manually or they can be outputs from a computer algorithm. In both cases, the expressions may be correct but unnecessarily complex in a sense that some unrecognized identities or properties would lead to simpler expressions.
We will consider simplification in the context of causal inference in graphical models (Pearl, 2009). Advances in causal inference have led to algorithmic solutions to problems such as identifiability of causal effects and conditional causal effects (Huang and Valtorta, 2006; Shpitser and Pearl, 2006a, b), identifiability (Bareinboim and Pearl, 2012), transportability and metatransportability (Bareinboim and Pearl, 2013b; Bareinboim and Pearl, 2013a)
among others. The aforementioned algorithmic solutions operate symbolically on the joint distribution of the variables of interest and return expressions for the desired queries. These algorithms have been previously implemented in the R package causaleffect
(Tikka and Karvanen, 2017). Another implementation of an identifiability algorithm can be found in the CIBN software by Jin Tian and Lexin Liu freely available from http://web.cs.iastate.edu/~jtian/Software/CIBN.htm. However, the algorithms themselves are imperfect in a sense that they often output an expression that is complicated and far from ideal. The question is whether there exists a simpler expression that is still a solution to the original problem.Simplification of expressions may provide significant benefits. First, a simpler expression can be understood and reported more easily. Second, evaluating a simpler expression will be less of a computational burden due to reduced dimensionality of the problem. Third, in situations where estimation of causal effects is of interest and missing data is a concern, eliminating variables with missing data from the expression has clear advantages. The same applies to variables with measurement error.
We begin with presenting in Section 2 a general form of probabilistic expressions that are often encountered in causal inference. In this paper probabilistic expressions are formed by products of nonparametric conditional distributions of some variables and summations over the possible values of these variables. Simplification in this case is the process of eliminating terms from these expressions by carrying out summations. As our expressions correspond to causal effects, the expressions themselves take a specific form.
Causal models are typically associated with a directed acyclic graph (DAG) which represents the functional relationships between the variables of interest. In situations where the joint distribution is faithful, meaning that no additional conditional independences are generated by the joint distribution (Spirtes et al., 2000), the conditional independence properties of the variables can be read from the graph itself through a concept known as dseparation (Geiger et al., 1990). We will use dseparation as our primary tool for operating on the probabilistic expressions. The reader is assumed to be familiar with a number of graph theoretic concepts that are explained for example in (Koller and Friedman, 2009) and used throughout the paper.
Our simplification procedure is built on the definition of simplification sets, which is presented in Section 3. We continue by introducing a sound and complete simplification algorithm for probabilistic expressions defined in Section 2 for which these simplification sets exist. The algorithm takes as an input the expression to be simplified and the graph induced by the underlying causal model, and proceeds to construct a joint distribution of the variables contained in the expression by using the dseparation criteria. Higher level algorithms that use this simplification procedure are presented in Section 4. These include an algorithm for the simplification of a nested expression and an algorithm for the simplification of a quotient of two expressions. Section 5 contains examples on the application of these algorithms. We have also updated the causaleffect Rpackage to automatically apply these simplification procedures to causal effect expressions.
As a motivating example we present an expression of a causal effect given by the ID algorithm of Shpitser and Pearl (2006a) that can be simplified. The complete derivation of this effect can be found in Appendix C.
The causal effect of on and is identifiable in the graph of Figure 1 and application of the ID algorithm gives
It turns out that there exists a significantly simpler expression,
(1) 
for the same causal effect. This expression can be obtained without any knowledge of the underlying model by using standard probability manipulations. However, this requires that a favorable choice is made for the ordering of the nodes of the graph in the ID algorithm. In the case that we had chosen an ordering where
precedes , the term for would instead be and simplification would require knowledge about the underlying graph. We will take another look at this example later in Section 5 where we describe in detail how our procedure can be used to find expression (1).Our simplification procedure is different from the wellknown exact inference method of minimizing the amount of numerical computations when evaluating expressions for conditional and marginal distributions by changing the order of summations and multiplications in the expression. Variants of this method are known by different names depending on the context, such as Bayesian variable elimination (Koller and Friedman, 2009) and the sumproduct algorithm (Bishop, 2006) which is a generalization of belief propagation (Pearl, 1988; Lauritzen and Spiegelhalter, 1988). Efficient computational methods exist for causal effects as well, such as (Shpitser et al., 2011). The general principle is the same in all of the variants, and no symbolic simplification is performed.
In our setting simplification can be defined explicitly but in general it is difficult to say what makes one expression simpler than another. Carette (2004) provides a formal definition for simplification in the context of Computer Algebra Systems (CAS) that operate on algebraic expressions. Modern CAS systems such as Mathematica (Wolfram Research Inc., 2015) and Maxima (Maxima, 2014) implement techniques for symbolic simplification. Bailey et al. (2014) and references therein discuss simplification techniques in CAS systems further. However to the best of our knowledge, the symbolic simplification procedures for probabilistic expressions described in this paper have neither been given previous attention nor implemented in any existing system.
2 Probabilistic Expressions
Every expression that we consider is defined in terms of a set of variables
. As we are interested in probabilistic expressions, we also assume a joint probability distribution
for the variables of . The most basic of expressions are called atomic expressions which will be the main focus of this paper.[Atomic expression] Let be a set of discrete random variables and let be any joint distribution of . An atomic expression is a pair
where

is a set of pairs such that for each and it holds that , , and for .

is a set such that for each it holds that for some .
The value of an atomic expression is
The probabilities are referred to as the terms of the atomic expression. A term is said to contain a variable if or . A term for a variable refers to a term . We also use the shorthand notation . As is a set, we will only sum over a certain variable once. All variables are assumed to be univariate and discrete for clarity, but we may also consider multivariates and situations where some of the variables are continuous and the respective sums are interpreted as integrals instead.
As an example we will construct an atomic expression describing the following formula
which is a part of the motivating example in the introduction. We let , which is the set of nodes of the graph of Figure 1. The sets and can now be defined as
respectively. Next we define a more general probabilistic expression.
[Expression] Let be a set of variables and let be the joint distribution of . An expression is a triple
where

is a subset of .

For , is a set of atomic expressions
If then .

For , is a set of expressions
such that , for all . If then .
The value of an expression is
where an empty product should be understood as being equal to 1.
The recursive definition ensures the finiteness of the resulting expression by requiring that each subexpression has fewer subexpressions of their own than the expression above it. A single value might be shared by multiple expressions, as the terms of the product in the value of the expression are exchangeable. Expressions and are equivalent if their values and are equal for all . Equivalence is defined similarly for atomic expressions. Every expression is formed by nested atomic expressions by definition. Because of this, we focus on the simplification of atomic expressions.
As an example we construct an expression for the causal effect formula (1). We define and let the sets and be empty. We define the set to consist of three atomic expressions and defined as follows
In the context of probabilistic graphical models, we are provided additional information about the joint distribution of the variables of interest in the form of a DAG. As we are concerned on the simplification of the results of causal effect derivations in such models, the general form of the atomic expressions can be further narrowed down by using the structure of the graph and the ordering of vertices called a topological ordering.
[Topological ordering] Topological ordering of a DAG is an ordering of its vertices, such that if is an ancestor of in then in .
The symbol is used to denote the subset of vertices of that are less than in . For sets we may define to contain those vertices of that are less than every vertex of in . Consider a DAG and a topological ordering of its vertices. We use the notation to denote indexing over the vertex set of in the ordering given by , that is where . For any atomic expression such that we also define the induced ordering . This ordering is an ordering of the variables in such that if in then also in . From now on in this paper, any indexing over the variables of an atomic expression will refer to the induced ordering of the set when is given, i.e in . In other words, is obtained from by leaving out variables that are not contained in .
The ID algorithm performs the socalled Ccomponent factorization. These components are subgraphs of the original graph where every node is connected by a path consisting entirely of bidirected edges. The resulting expressions of these factors serve as the basis for our simplification procedure.
[Topological consistency] Let be a DAG with a subgraph and let be a topological ordering of the vertices of . An atomic expression is topologically consistent (or consistent for short) if
Here denotes the ancestors of in . To motivate this definition we note that the outputs of the algorithms of Shpitser and Pearl (2006a, b) can always be represented by using products and quotients of topologically consistent atomic expressions. An expression is topologically consistent when every atomic expression contained by it is topologically consistent with respect to a topological ordering of a subgraph. We provide a proof for this statement in Appendix A. This also shows that any manual derivation of a causal effect can always be represented by a topologically consistent expression. The assumption that is not necessary for the simplification to be successful. This assumption is used to speed up the performance of our procedure in Section 3.
3 Simplification
Simplification in our context is the procedure of eliminating variables from the set of variables that are to be summed over in expressions. In atomic expressions, a successful simplification in terms of a single variable should result in another expression that holds the same value, but with the respective term eliminated and the variable removed from the summation. As we are interested in causal effects, we consider only simplification of topologically consistent atomic expressions.
Our approach to simplification is that the atomic expression has to represent a joint distribution of the variables present in the expression to make the procedure feasible. The question is whether the expression can be modified to represent a joint distribution. Before we can consider simplification, we have to define this property explicitly.
[Simplification sets] Let be a DAG and let be a subgraph of over a vertex set with a topological ordering . Let , where , be a consistent atomic expression and let . Suppose that and that and let be the set
If there exists a set and the sets for all such that the conditional distribution of the variables can be factorized as
(2) 
and
(3) 
then the sets and are the simplification sets of with respect to .
This definition is tailored for the next result that can be used to determine the existence of a simpler expression when simplification sets exist. Afterwards we will show how this result can be applied in practice via an example. The definition characterizes consistent atomic expressions that represent joint distributions. It is apparent that simplifications sets are not always unique, which can lead to different but still simpler expressions. Henceforth the next result considers simplification in terms of a single variable. The proof is available in Appendix B.
[Simplification] Let be a DAG and let be a subgraph of over a vertex set with a topological ordering . Let be a consistent atomic expression and let and be its simplification sets with respect to a variable . Then there exist an expression such that , and no term in contains .
Note that even if in Definition 3, the existence of simplification sets still requires that = . In many cases there exists variables such that the expression does not contain a term for . Condition (2) of Definition 3 guarantees that if these terms were contained in the expression it would represent a joint distribution. Our goal is thus to introduce these terms into the original expression temporarily, carry out the desired summation, and finally remove the added terms. This can only be achieved if the variables in the set are conditionally independent of the variable currently being summed over, hence the assumption of condition (3) of Definition 3.
We show how simplification sets can be used in practice to derive a simpler expression via an example. We consider the causal effect of on in the graph of Figure 2.
The effect in question is identifiable and the ID algorithm readily gives atomic expression
We consider simplification sets with respect to . The topological order is . The atomic expression does not contain a term for so we have . By noting that we are able to satisfy condition (3) of Definition 3. We can write
as required by condition (2) of Definition 3 by setting . Thus, the simplification sets and for the atomic expression with respect to are and , respectively. Finally, we obtain the simpler atomic expression by carrying out the summation over :
Neither Definition 3 nor Theorem 3 provide a method to obtain simplification sets or to determine whether they exist in general. To solve this problem we present a simplification algorithm for consistent atomic expressions that operates by constructing simplification sets iteratively for each variable in the summation set.
Algorithm 1 always attempts to perform maximal simplification, meaning that as many variables of the set are removed as possible. If the simplification in terms of the entire set can not be completed, the intermediate result with as many variables simplified as possible is returned. If simplification in terms of specific variables or a subset is preferred, the set should be defined accordingly.
The function simplify takes three arguments: an atomic expression that is to be simplified, a graph and a topological ordering of its vertices. is assumed to be consistent.
On line 10 the function index.of returns the corresponding index of the term containing . Since is consistent, we only have to iterate through the variables as the terms outside this range contain no relevant information about the simplification of . The variables without a corresponding term in the atomic expression are retrieved on line 11 by the function get.missing. This function returns the set of Definition 3 with respect to the current variable to be summed over.
In order to show that the term of represent some joint distribution, we proceed in the order dictated by the topological ordering of the vertices. The sets and keep track of the variables that have been successfully processed and of the conditioning set of the joint term that was constructed on the previous iteration. Similarly, the sets and keep track of the variables and conditioning sets of the corresponding variables that the atomic expression does not originally contain a term for. Iteration through relevant terms begins on line 13. Next, we take a closer look at the function join which is called next on line 14.
Here denotes the power set, denotes the symmetric difference and denotes the ancestors with the argument included. The function join attempts to combine the joint term , obtained from the previous iteration steps, with the term of the current iteration step. dseparation statements of are evaluated to determine whether this can be done. In practice this means finding a suitable subset of , where is the largest possible conditioning set of the new combined term. The set is computed on line on line 4 of Algorithm 2. A valid subset satisfies and which allow us to write the product as .
In order to find this valid subset, we compute the sets and for each candidate on lines 8 and 9. These sets characterize the necessary change in the conditioning sets of the terms and that would enable a joint term to be formed by these two terms. The validity of the candidate set is finally checked on line 10 which determines if the necessary change is allowed by dseparation criteria in the graph . If no valid subset can be found, we can still attempt to insert a missing variable of by calling insert. If this does not succeed either, the original sets and are returned, which instructs simplify to terminate simplification in terms of and attempt simplification in the next variable.
A special case where the first variable of the joint distribution forms alone is processed on line 2 of Algorithm 2. In this case, we have an immediate result without having to iterate through the subsets of . The formulation of the set ensures that the resulting factorization is consistent if it exists. Knowing that the ancestral set has to be a subset of the new conditioning set also greatly reduces the amount of subsets we have to iterate through. In a typical situation, the size of is not very large. Let us now inspect the insertion procedure in greater detail.
In essence, the function insert is a simpler version of join, because the only restriction on the conditioning set of is imposed by the conditioning set of and the fact that has to be conditionally independent of the current variable to be summed over. If join or insert was unsuccessful in forming a new joint distribution, we have that . In this case simplification in terms of the current variable cannot be completed. If we have that the iteration continues.
Together the functions join and insert capture the two conditions of Definition 3. They are essentially two variations of the underlying procedure of determining whether the terms of the atomic expression actually represent a joint distribution. The only difference is that join is called when we are processing terms that already exist in the expression, and insert is called when there are variables without corresponding terms in the expression, that is the set of Definition 3 is not empty.
If the innermost whileloop of Algorithm 1 succeeded in iterating through the relevant variables, we are ready to complete the simplification process in terms of . We carry out the summation over which results in . This is done on line 27 by calling which checks whether the joint term can be factorized back into a product of terms. In practice this means that if the function succeeds, it will return an atomic expression obtained by removing each inserted term such that and from atomic expression . The status of the atomic expression is updated on lines 31 and 32 to reflect this. If the function fails, it will return unchanged.
If the innermost whileloop did not iterate completely through the relevant variables, the simplification was not successful in terms of at this point. In this case we reset to its original state on line 29 and attempt simplification in terms of the next variable. If there are no further variables to be eliminated, the outermost whileloop will also terminate. In the next theorem, we show that Algorithm 1 is both sound and complete in terms of simplification sets. The proof for the theorem can be found in Appendix D.
Let be a DAG and let be a subgraph of over a vertex set with a topological ordering . Let be a consistent atomic expression. Then if succeeds, it has constructed a collection of simplification sets of with respect to . Conversely, if there exists a collection of simplifications sets of with respect to , then will succeed.
4 High Level Algorithms
In this section, we present an algorithm to simplify all atomic expressions in the recursive stack of an expression. We will also provide a simple procedure to simplify quotients defined by two expressions: one representing the numerator and another representing the denominator. In some cases it is also possible to eliminate the denominator by subtracting common terms. First, we present a general algorithm to simplify topologically consistent expressions.
Algorithm 4 begins by simplifying all atomic expressions contained in the expressions. If an atomic expression contains no summations after the simplification but does contain multiple terms, each individual term is converted into an atomic expression of their own. After this, we iterate through all subexpressions contained in the expression. The purpose of this is to carry out the simplification of every atomic expression in the stack and collect the results into as few atomic expressions as possible. First, we traverse to the bottom of the stack on line 8 by deconstructing subexpressions until they have no subexpressions of their own. Afterwards, it must be the case that consists of atomic subexpressions only.
If contains no summations on line 9 then the atomic expressions contained in this expression do not require an additional expression to contain them, but can instead be transferred to be a part of the expression above the current one in the recursive stack. On line 6 we lift the atomic expressions contained in the atomic subexpressions up to the current recursion stage.
There is no guarantee, that the resulting atomic expression is still consistent after this procedure. The function deconstruct operates on the principle of simplifying as many atomic expressions as possible, combining the results into new atomic expressions and simplifying them once more. We do not claim that this procedure is complete in a sense that Algorithm 4 would always find the simplest representation for a given expression. This method in nonetheless sound and finds drastically simpler expressions in almost every situation where such an expression exists.
We may also consider quotients often formed by deriving conditional distributions. For this purpose we need a subroutine to extract terms from atomic subexpression that are independent of the summation index, that is and .
The procedure of Algorithm 5 is rather straightforward. First, we attempt to simplify by using deconstruct on line 2. Next, we simply recurse as deep as possible without encountering a sum in an expression. If a sum is encountered, extraction is attempted. On any stage where a sum was not encountered, we may still have atomic subexpression that contain sums. Because the recursion had reached this far, we know that there are no summations above them in the stack, so we can attempt extraction on them as well.
5 Examples
In this section we present examples of applying the algorithms of the previous sections. We denote line number of algorithm with A:. We begin with a simple example on the necessity of the insert procedure in graph of Figure 3.
The causal effect of on is identifiable in this graph, and expression
is obtained by direct application of the ID algorithm or by the truncated factorization formula for causal effects in Markovian models (Pearl, 2009). We let be this atomic expression. The topological ordering is and . The call to will first attempt simplification in terms of , by calling
which results in . At the second call
we already run into trouble since we cannot find a conditioning set that would allow to be joined with . However, since is nonempty and and this means that the next call is
Insertion fails in this case, as one can see from the fact that no conditioning set exists that would make conditionally independent of . Thus we recurse back to join and back to simplify and end up on line A1:15 which breaks out of the whileloop. Thus cannot be simplified in terms of . Simplification is attempted next in terms of . The first two calls are in this case
and in the second call we run into trouble again and have to attempt insertion
This time we find that we can add a term for which is because . The other calls to join also succeed and we can write the value of as
and complete the summation in terms of . After the call to factorize we are left with the final expression
We continue by considering again graph depicted in Figure 1. The topological ordering is . Atomic expression given by
is a part of the expression to be simplified.
We will first simplify and take a closer look at how the function join operates. The call to will attempt simplification in terms of the set in the ordering that agrees with the topological ordering , which is . After initializing the required sets, we find the index of the term with as a variable on line 10. There is one missing variable, , so as returned by get.missing on line A1:11. The first call to join results in , because line A2:3 is triggered. Condition on line A1:15 is not satisfied since . Thus we update the status of and on lines A1:18 and A1:19. Since on line A1:20 we do not have to update the status of and on lines A1:21, A1:22 and A1:23. The innermost whileloop is now complete and we call factorize on line A1:27 which succeeds in removing the term by completing the sum. Now we update the status of the atomic expression on line A1:31 and remove from the set of variables to be summed over on line A1:32. The resulting value of the expression at this point is
Next, the summation in terms of is attempted. join is once again successful, because is the first variable to be joined and line A2:3 is triggered. Next we attempt to join the terms and . Computation of the set on line A2:4 results in
The power set computed on line A2:5 contains only the empty set. For we have
on line 8, and
on line 9. The condition on line A2:10 evaluates to true and we return with . The innermost whileloop terminates allowing the summation over to be performed. The function factorize provides us with the final expression
(4) 
Next, we will consider the full example and see how simplify is applied. Using the ID algorithm we obtain the causal effect of on and in graph of Figure 1 and it is
We will represent this as a quotient of expression using Definition 2. Let be the atomic expression of the previous example and let also be an atomic expression given by
which is essentially the same as , but with the variable removed from the summation set . Similarly, we let be an atomic expression given by
We also define the atomic expressions with the value and with the value . Now, we define two expressions and for the quotient as follows:
We now call . First, we must trace the calls to extract for both expressions on lines A6:2 and A6:3. For and this immediately results in a call to deconstruct on line A5:2. First, the function applies simplify to each atomic expression contained in the expressions on line A4:4.
Let us first consider the simplification of . As before with , we have that join first succeeds in forming , but this time is not in the summation set, so we continue. Next, the algorithm attempts to join with . The set