Probabilistic graphical models Geman and Geman (1984); Pearl (1988); Richardson and Domingos (2006); Koller and Friedman (2009) are a powerful formalism for reasoning under uncertainty. However, these models typically represent precise information by encoding a single probability distribution over the variables of interest.
Real-world applications often deal with imprecise information. For example, the exact probability of an event may be hard to assess and thus it may be given by a probability interval, i.e., lower and upper probability bounds. Similarly, the precise dependency between two events may not be known and thus it may be described by an interval for a conditional probability. In general, there may be multiple sources of imprecise information: e.g., a Bayesian network learned from data as well as knowledge from human experts in the form of logic formulas annotated with confidence scores or probability intervals. Aggregating multiple sources of imprecise information in an effective manner is therefore critical for inference tasks.
Over the past decades, numerous formalisms have been proposed to represent imprecise information Zadeh (1965); Shafer (1976); Nilsson (1986); Fagin, Halpern, and Megiddo (1990); Heinsohn (1994); Jaeger (1994); Andersen and Hooker (1994); Chandru and Hooker (1999); Cozman (2000); Cano and Moral (2002); Dürig and Studer (2005); De Raedt, Kimmig, and Toivonen (2007); Cozman and Polastro (2008); Lukasiewicz (2008); Riegel et al. (2020). Some of these formalisms do not use probability functions to represent uncertainty.
This paper builds on earlier proposals for probabilistic logic Nilsson (1986); Fagin, Halpern, and Megiddo (1990); Nilsson (1994), credal networks Andersen and Hooker (1994); Cozman (2000); Cano and Moral (2002) and their variants Cozman and Polastro (2008, 2009)
. A common feature of these formalisms is that imprecise information is typically expressed by probability upper/lower bounds, which are treated as constraints that must be satisfied by each probability distribution in the entailed set of distributions. In many practical real-world situations, these upper and lower probability values have intuitive frequentist interpretations and therefore can be naturally incorporated in a probabilistic logic programNilsson (1986); Cozman and Polastro (2008).
We highlight two important yet contrasting features of the above formalisms. On the one hand, some probabilistic logics Nilsson (1986); Fagin, Halpern, and Megiddo (1990); Nilsson (1994) impose few restrictions on the logic formulas and therefore are quite expressive. They however lack independence declarations, which leads to excessively wide intervals in the inference results. Consequently, they are less useful for decision making in real-world applications. On the other hand, credal networks and their variants inherit the Markov condition of Bayesian networks Pearl (1988). Yet these models either can only express the structure of a Bayesian network Andersen and Hooker (1994); Cozman (2000); Cano and Moral (2002) or require acyclicity and other strong restrictions on logic formulas Cozman and Polastro (2008, 2009).
Contribution: In this paper, we present Logical Credal Networks
(LCN), a new probabilistic logic model designed to exploit the best of both worlds. On one hand, the model allows probability bounds and conditional probability bounds for arbitrary propositional and finite-domain first-order logic formulas without requiring acyclicity. On the other hand, we define a Markov condition that allows random variables to have cyclic dependencies and yet we show that our proposed Markov condition matches the Markov condition in Bayesian/credal networks for acyclic graphs. Subsequently, we describe exact and approximate inference algorithms to compute the posterior probability of a query formula. We evaluate the proposed model on a maximum a posteriori (MAP) inference task using benchmark problems derived from Mastermind games with uncertainty and a realistic credit card fraud detection application. Our experimental results are quite promising and show that the proposed method outperforms existing approaches. In particular, they highlight the ability of LCNs to aggregate multiple sources of imprecise information in an effective manner.
2.1 Bayesian and Credal Networks
A Bayesian network (BN) Pearl (1988) is defined by a tuple , where is a set of variables over multi-valued domains , is a directed acyclic graph (DAG) over as nodes, and where are conditional probability tables (CPTs) associated with each variable and are the parents of in . The Markov condition in BNs states that each node is independent of its non-descendants given its parents. Consequently, the joint probability distribution is .
Bayesian networks are typically formulated and learned using a combination of expert opinion and tabular data. Prior work has also studied other forms of knowledge to construct Bayesian networks, besides structural information and point estimates of conditional probabilities. For instance, there is a line of research on exploiting qualitative information to learn the parameters of a Bayesian networkWellman (1990); Druzdzel and van der Gaag (1995); Wittig and Jameson (2000).
Credal networks Andersen and Hooker (1994); Cozman (2000); Cano and Moral (2002) extend Bayesian networks to deal with imprecise probabilities. A credal set is a set of probability distributions. A credal network (CN) is defined by a pair , where is a DAG over discrete variables and is a set of conditional credal sets each one associated with a variable and its parents in . We consider separately specified credal networks where each variable and each configuration of has a conditional credal set which is specified separately from all others.
The following Markov condition for credal networks implies that two variables and are independent iff the vertices of the credal set factorize, i.e., each distribution that is a vertex of the set satisfies for all values of and (and likewise for conditional independence). Similar to how Bayesian networks specify a joint probability distribution over variables , credal networks can be used to specify a set of joint probability distributions over . The largest extension of a credal network that complies with the Markov condition above is called the strong extension of the network: Cozman (2000).
A common representation of credal sets which we also assume in this paper are probability intervals.
2.2 Propositional and First-Order Theories
In propositional logic, propositions can take only two values and are denoted by lowercase letters such as and . Propositional literals (i.e., , ) stand for being True or being False. In first-order logic (FOL), a term is a variable, a constant, or a function applied to terms. An atom (or atomic formula) is either a proposition or a predicate of arity where the are terms. A formula (either propositional or FOL) is built out of atoms using universal and existential quantifiers (for FOL) and the usual logical connectives , , and , respectively.
2.3 Probabilistic Logics
Syntactically, a probabilistic logic program is a set of logic formulas (either propositional or FOL), each one of them being annotated with a probability value Nilsson (1986). Let be a pair such that is a formula and is the associated probability value. The semantics is the set of probability distributions over all interpretations such that 111Throughout this paper, we use as shorthand notation for and for .. Point values can be easily replaced by intervals. Each inference is a pair of optimization problems to compute the upper and lower bounds of the probability of interest. The logic in Fagin, Halpern, and Megiddo (1990) is more general, and for example can represent bounds of conditional probabilities which are important in real-world tasks Nilsson (1994).
A major weakness of these probabilistic logics is the lack of a Markov condition: there are no independence relations that are implied in a probabilistic logic program. Consider, for example, the following simple logic program:
where are atomic formulas. Following Nilsson (1986), computing the bounds on 222Symbol stands for XOR. results in the interval
. Indeed, there exists a joint distribution oversuch that (1) and (2) are satisfied and that is always false, and there exists another such that is always true. The inference result of , however, is not informative for most purposes and often is not the intention when one writes down (1)(2) for an application. This example illustrates that due to the lack of a Markov condition, arbitrary dependence between variables is considered possible, which results in excessively wide intervals in inference results. More recent works on description logic Heinsohn (1994); Jaeger (1994); Lukasiewicz (2008) share the same weakness. A more practical approach in this case is to assume that and are independent of each other unless there is information saying otherwise. With independence assumption, computing bounds for results in the interval .
A straightforward way to allow for a Markov condition in a probabilistic logic program is to treat it as a credal network with probability intervals. Consider the following probabilistic logic program / credal network:
are atomic formulas (i.e., binary variables). The Markov condition in this credal network is thatis independent of given , in mathematical terms:
which is the network’s strong extension Cozman (2000).
Clearly, the credal network above represents the set of probability distributions over all interpretations such that constraints (3–9) are satisfied. However, this representation comes with several restrictions: (1) the only non-atomic logic formulas allowed are AND over atomic formulas and negation of atomic formulas; (2) there must not be cyclic dependencies among atomic formulas; (3) an atomic formula must be specified by either a marginal probability interval or by a set of conditional probability intervals, and not both; (4) the conditions in the conditional probabilities must enumerate all possible interpretations of the parent variables – we will refer to the last two requirements as the unique-assessment assumption. In practice, there is often knowledge that cannot be expressed by a simple AND, and there are often multiple sources of information that, when aggregated, break the acyclicity or unique-assessment requirements. More recently, Cozman and Polastro (2008, 2009) relax some of the restrictions above and allow for example to specify a conditional probability interval for , where can be an arbitrary logic formula but is an atom.
3 Logical Credal Networks
In this section, we introduce a new probabilistic logic called a Logical Credal Network (LCN), that is designed to have the best of both worlds, namely as few restrictions as possible on logic formulas when specifying probability bounds and a set of implied independence relations that are similar to the Markov condition in Bayesian and credal networks.
Syntactically, an LCN is specified by a set of probability-assessment sentences in one of the following two forms:
where and can be arbitrary propositional and finite-domain first-order logic formulas and , . Each sentence is further associated with a Boolean label , which indicates whether formula implies graphical dependence and will be explained in the next section.
An LCN represents the set of all its models. A model333The semantics is not model-theoretic because there exist implied constraints that are jointly derived from multiple sentences. of an LCN is a probability distribution over all interpretations such that it satisfies a set of constraints given explicitly by (10)–(11) and a set of independence constraints which are implied by the LCN. The latter are similar to the independence relations implied by a Markov condition in graphical models. It is important that the independence constraints are implied rather than explicitly stated, because requiring a user to explicitly specify independence constraints would be tedious and potentially error-prone in a real-world application.
Therefore, the critical aspect of LCN semantics is how to define the implied independence constraints. In contrast to previous work Andersen and Hooker (1994); Cozman (2000); Cozman and Polastro (2009), we propose a generalized Markov condition that accommodates the LCN’s much more relaxed requirements on logic formulas, including cyclic dependencies. In particular, our definition is backward compatible and matches the Markov condition in Bayesian networks when (10)–(11) happen to specify the marginal and conditional probabilities of a Bayesian network.
We begin by defining the dependency graph of an LCN. Given an LCN , its dependency graph contains a node for each atomic and non-atomic formula in . A sentence in induces a set of directed edges in which we call a stamp. Assuming , then for each sentence (10) such that has atomic formulas, its stamp contains directed edges from to each , and from each to , respectively. Similarly, for each sentence (11) such that (resp. ) are the atomic formulas in (resp. ), its stamp contains a directed edge from to , a set of directed edges from each to , as well as directed edges from to each and from each to , respectively. Setting to indicates that the sentences (10) or (11) do not imply graphical dependency among atomic formulas in . In this case, the stamp of (10) would be empty, while the stamp of (11) would not involve directed edges from to . Figure 1 illustrates stamps of the two types of sentences in LCNs, for and , respectively.
The intuition behind the stamps is the need to capture two types of dependencies. For a sentence (10) or (11) with , the dependency among atomic formulas in is similar to a clique in Markov random fields (see the bi-directional edges between and its atomic formulas in Figures 1(a)(c)). For a sentence (11), the dependency between and is similar to the dependencies in Bayesian networks (see the one-directional connections in Figures 1(b)(c)(d))
Consider the following LCN derived from the Smokers and Friends example of Richardson and Domingos (2006). We abbreviate the predicates , and by , and , respectively. Predicate is symmetric.
Here, (12) states that friends of friends are likely friends; (13) states that, if two people are friends, they likely either both smoke or neither does; (14) and (15) state that smoking likely causes cancer. for all sentences. Figure 2 illustrates the dependency graph of the LCN grounded on a domain of three people (as before, symbol is XOR).
With the dependency graph, we are ready to define the generalized Markov condition for LCNs.
The parents of an atomic formula , denoted by , are the set of atomic formulas such that there exists a directed path in the dependency graph from each of them to in which all intermediate nodes are non-atomic.
The descendants of an atomic formula , denoted by , are the set of atomic formulas such that there exists a directed path in the dependency graph from to each of them in which no intermediate node is in .
Let denote the non-descendant non-parent variables of : .
Definition 3 (Markov condition)
In a model of an LCN, every atomic formula is conditionally independent of given .
Consider the special case of using LCNs to represent Bayesian networks. The upper and lower bounds in a LCN sentence (10) or (11) are equal; formulas in (10)(11) are atomic; and the LCN sentences specify the marginal and conditional probabilities in a Bayesian network. The LCN dependency graph is constructed by stamps only in the form of Figure 1(b). Consequently, the parents of an atomic node are the same as the parents of in the Bayesian network, and the descendants of an atomic formula are the same as the descendants of in the Bayesian network. Therefore, the Markov condition of the LCN is identical to that of the Bayesian network. The LCN has only one model, which is the same probability distribution as the Bayesian network.
As previous probabilistic logics discussed in Section 2.3, inference in an LCN means computing upper and lower bounds on a probability of interest. This entails solving a pair of optimization problems comprising the constraints stated explicitly by the LCN sentences (10)(11) and those derived from the generalized Markov condition. Specifically, (10)(11) are linear constraints while constraints from the Markov condition are quadratic. The objective function can be in the form of either , where is a logic formula, or , where are logic formulas and represents evidence. In some scenarios we may be interested in the model with the maximum entropy Cheeseman (1983) and the objective function can be a measure of entropy. The appendix includes examples of these optimization problems.
For exact inference, we use the IPOPT solver Wächter and Biegler (2006) to solve the optimization problems. The experiments in Section 4 have been conducted with this approach. Since the optimization problems are over the space of joint distributions, the complexity is exponential with respect to the number of atomic formulas, same as in prior works Chandru and Hooker (1999). The experimental results suggest that, although with problem size limitations, the exact approach is able to perform inference for meaningful tasks such as solving Mastermind puzzles and credit card fraud detection.
For large scale problems, approximate inference algorithms are needed. In order to handle imprecise probabilities, the classical belief propagation algorithms Pearl (1982); Weiss (2000) for probabilistic graphical models have been extended to propagate intervals. Specifically, the 2U algorithm Fagiuoli and Zaffalon (1998) is exact on credal networks that are polytrees; the L2U and IPE algorithms Ide and Cozman (2008) are built on top of 2U for approximate inference on credal networks with loops; Antonucci et al. (2010) generalize L2U to beyond binary variables. Cozman and Polastro (2009) suggested that L2U is the inference algorithm of choice for the general case of its semantics.
Incompatibility with Belief Propagation
A closer examination however reveals that L2U, or any other method based on sum-product message passing, is incompatible with LCNs or work by Cozman and Polastro (2008, 2009) when the unique-assessment assumption is broken. Consider the following example:
As discussed in Section 2, the unique-assessment assumption is a property of credal/Bayesian networks. Here, it allows (17)(18) or (19), but not both. Therefore, this example is not a credal network but is legitimate as an LCN or by the semantics of Cozman and Polastro (2008, 2009). Under both semantics, if we query , the correct answer is [0.3,0.35]. However, 2U or L2U gives an incorrect answer of [0.1,0.26], even though the dependency graph is a polytree.
The inconsistency stems from the fact that sum-product message passing is fundamentally a Markov random field solver and 2U/L2U treats probability values as factor potentials. When the unique-assessment assumption is upheld, as in credal/Bayesian networks, products of factor potentials coincide with probabilities and therefore L2U works for credal networks. As soon as the unique-assessment assumption is broken, the solvers lose correctness regardless of network topology. Hence 2U/L2U/IPE are not fit for LCNs and not fit for the semantics of Cozman and Polastro (2008, 2009). In real-world applications, the unique-assessment assumption is often broken when multiple sources of information are aggregated, hence there is a need for approximate inference algorithms that can handle such scenarios.
Modified Belief Propagation
We propose a modified belief propagation algorithm. The high-level flow is identical to classical belief propagation. We build a factor graph Frey (2003) with variable nodes and factor nodes, and iteratively update messages between variables and factors until convergence. The difference lies in how messages are computed, which are not by sum and product but by tightening bounds at variable nodes and solving a local constraint program at factor nodes. The discussion is limited to binary variables. For multi-valued variables, messages will have more complex structures Antonucci et al. (2010).
An LCN factor graph is a bipartite graph with variable nodes, which represent atomic formulas, and factor nodes, each of which represents one or more sentences in the LCN. Sentences that involve the same set of atomic formulas are grouped into one factor. For example, an LCN by (3–7) has three factors: (3), (4)(5) and (6)(7). Let denote a variable node. Let denote a factor node. Let denote the neighbors of a node. A message is an interval where . Let denote the message from to and from to .
If a variable node has degree one, it sends a message of to its only factor neighbor. If the degree of is more than one, it sends the following message to neighbor :
A factor node computes its message to neighbor by solving a local constraint program, which is composed of:
Sentences of factor ;
Quadratic constraints that the variables in are independent of each other.
The objective function is , and the message and are the results of minimizing and maximizing the objective with the local constraint program.
All messages are initialized to be . Convergence is guaranteed because the lower bound of each message monotonically increases while the upper bound of each message monotonically decreases. The role of variable nodes is to tighten the bounds, which is fundamentally different from that in previous belief propagation methods. More details and illustrating examples are in the appendix.
Comparison with Other Algorithms
The bound tightening operation at variable nodes is similar to the inference algorithm in Riegel et al. (2020). The message computation at factor nodes bears some similarity to 2U; the independence assumption in local constraint programs is a mechanism to approximate the Markov condition, and the same approach is used in classical belief propagation. There is also a relation to the IPE algorithm Ide and Cozman (2008): IPE cuts out a number of polytree subgraphs, solves each subgraph, and then chooses the tightest bounds from the subgraphs; our proposed algorithm implicitly enumerates an exponential number of subgraphs and chooses one subgraph for each or that computes the tightest bound. The algorithm guarantees exact results on polytree credal/Bayesian networks, with and without additional marginal probability sentences that break the unique-assessment assumption. In general, however, there is no guarantee on correctness; this is the same as classical belief propagation.
We evaluate the proposed approach on the following variants of MAP inference for LCNs. Given an LCN and a subset of query (or MAP) variables, the task is to find the assignment to the query variables such that its posterior marginal probability interval has the largest upper bound (maximax) or, alternatively, the largest lower bound (maximin).
4.1 Mastermind Puzzles with Uncertainty
Mastermind is a classic two-player code breaking game, made popular in computer science by Knuth (1976). We consider Mastermind games with uncertainty which differ from the classic game in that the code-maker lies randomly. If given the probabilities of lying, a game is completely specified by a Bayesian network. A large number of puzzles, i.e., game boards, are generated by sampling from such Bayesian networks. Given imprecise information of the underlying Bayesian network of each puzzle, the task is to guess the hidden code. The ground truth is the MAP hidden code(s) given the board as observations. Details on puzzle generation and an example puzzle are in the appendix.
Let be the number of rounds in a puzzle and let be a boolean variable indicating whether the code-maker lied in the round. Given a puzzle, the imprecise information available to the inference algorithms is, for example:
) is a typical credal network specification and is used by all inference algorithms, while only a subset of the algorithms are able to utilize the remaining equations. We use ten random seeds to generate ten sets of puzzles (each set having 729 puzzles on average) and report the mean and standard deviation of the accuracy (Table1).
We consider the following competing methods:
Bayesian_midpoint simply assumes that is equal to the midpoint of the interval specified in (22) and performs MAP inference on the resulting Bayesian network. It does not utilize logic formulas like (23)–(25).
ProbLog_midpoint and cProbLog_midpoint are ProbLog De Raedt, Kimmig, and Toivonen (2007) and cProbLog Fierens et al. (2012) that use the midpoints of the intervals because they do not allow probability intervals. ProbLog is unable to represent formulas like (23)–(25) and hence is the same as Bayesian_midpoint. cProbLog is able to represent (23)–(25) as soft constraints.
MLN_midpoint is the Markov Logic Network (MLN) Richardson and Domingos (2006) that uses all logic formulas as well as the midpoint of each interval. Following Fierens et al. (2012), each formula is associated with weight , and with these weights MLN_midpoint is equivalent to cProbLog_midpoint. Because each puzzle is sampled from a different underlying Bayesian network, it is impossible to train the MLN weights.
Nilsson_maximax and Nilsson_maximin are the original probabilistic logic formulations from Nilsson (1986, 1994) that compute the maximax and maximin MAP assignments respectively. Although Nilsson’s method is able to utilize all the logic formulas and their bounds, it does not allow modeling variables as being independent of each other. Given a puzzle, each possible hidden code corresponds to a truth assignment to the variables; without independence relations, there always exists a joint probability distribution of the variables such that the particular truth assignment has zero probability. Consequently, the lower bound of the posterior probability given a puzzle is zero for any , and therefore the maximin criterion has no basis to make decisions upon. Intuitively, a similar effect should also impair the accuracy of Nilsson_maximax. However, we are unable to verify this empirically: computing the upper bound of the posterior by solving Nilsson’s constraint program is computationally prohibitive due to the complexity of the objective function and the size of the program.
LCN_maximax and LCN_maximin are the proposed methods computing the maximax and maximin MAP assignments respectively. In this case, sentences like (23)–(25) are annotated with so that they do not imply dependency among the variables.
LCN_maxent is a variant of the proposed method that, given all the constraints, computes a joint probability distribution of the variables with the greatest entropy and assumes that it is the underlying distribution.
Table 1 demonstrates clearly that our proposed approach outperforms all its competitors. Furthermore, it is the only method able to properly aggregate multiple sources of knowledge specified as logic formulas with probability bounds. We also notice that MLN_midpoint and cProbLog_midpoint are slightly worse than Bayesian_midpoint and ProbLog_midpoint. cProbLog/MLN do not gain accuracy by exploiting additional knowledge. This is because they treat each logic formula as a statistically independent factor and the probabilities as factor potentials; in other words, the probability midpoints are not treated as probabilities in the joint distribution. This is a major weakness of cProbLog and MLN, in addition to their inability to handle intervals. In contrast, the proposed LCN aggregates knowledge as multiple sources of constraints and treats the given bounds as actual probability bounds without making any unwarranted independence assumptions among different pieces of knowledge.
4.2 Credit Card Fraud Detection
We consider a realistic credit card fraud detection task based on the UCSD-FICO Data Mining Contest dataset FICO-UCSD (2009) which contains 100,000 transactions over a period of 98 days out of which 2,654 are fraudulent. Each transaction is characterized by transaction amount, transaction time, state, hashed zip code, hashed email address together with eleven other anonymized features.
We use the hashed email addresses as account IDs and split the data into two parts. The first contains 55,750 accounts, each with a single transaction. The second contains 14,374 accounts that have multiple transactions, and the total number of transactions is 44,250. The first and second subsets form training and test data respectively. In addition to learning from training data, we provide additional knowledge regarding fraudulent transactions and account history through the following three logic rules from Li et al. (2020):
We create ten randomized tasks by sampling half of the training data and half of the test data for each. Note that we sample accounts rather than transactions; in other words, transactions from the same account are either all included in a test set or all excluded from it. Subsequently, we emulate expert knowledge as follows. For each of the three logic rules and each of the ten test sets, we measure the conditional probability that the consequent is true given that the antecedent is true. We take the min and max over the test sets and obtain the following probability intervals for the three rules: , and , respectively.
Prediction is based on a binary posterior distribution and consequently the maximax and maximin decision criteria give the same results. We consider the following methods.
Naive_Bayes uses only the training data and does not use expert knowledge.
uses a Bayesian network that expands Naive_Bayes by adding the antecedents in the three logic rules as three parent nodes of the Is-Fraud node. The midpoints of the three intervals are used; they are not enough to fully specify the dependencies between the three new nodes and the Is-Fraud node, and we use the noisy-OR model to address this issue. The prior probability of the three new nodes is 0.5 which does not change the inference result.
Credal uses a credal network with the same structure as Bayesian_midpoint but with the three intervals. The same noisy-OR model is applied and the prior on the three new nodes is the probability interval .
ProbLog_midpoint contains the Naive_Bayes model expressed in ProbLog and uses the three logic rules annotated with the midpoint of the corresponding probability interval.
MLN_midpoint contains the Naive_Bayes model expressed as an MLN and adds the three logic rules as follows. Let be the midpoint of one of the three intervals. We add a factor of with weight and another factor of with weight , respectively. Because the training data contains no account history, it is impossible to train the MLN weights.
LCN is the proposed method.
Table 2 reports the mean and standard deviation of the F1 scores obtained on the test set of the credit card fraud dataset. We see again that our proposed LCN method substantially outperforms its competitors. Bayesian_midpoint and Credal perform quite poorly in this case because of the unique-assessment requirement of Bayesian/credal networks. Specifically, since the Is-Fraud node has three parents, we are no longer allowed to specify
. In the Naive Bayes models,values are measured on the training data. This information is lost in Bayesian_midpoint and Credal models and consequently, false positive predictions increase dramatically.
We propose a new probabilistic logic that expresses both probability bounds for propositional and first-order logic formulas with little restrictions and a Markov condition that is similar to Bayesian networks and Markov random fields. The formula bounds allow for flexibility in the form and precision of background knowledge that can be utilized, while the Markov condition restricts the space of distributions to enable a meaningful representation of uncertainties. Evaluation on a set of MAP inference tasks shows promising results, particularly in aggregating multiple sources of imprecise information. Potential future directions include extending to temporal models, further algorithmic innovations and experiments on a wider array of applications.
- Andersen and Hooker (1994) Andersen, K.; and Hooker, J. N. 1994. Bayesian logic. Decision Support Systems, 11(2): 191–210.
- Antonucci et al. (2010) Antonucci, A.; Sun, Y.; De Campos, C. P.; and Zaffalon, M. 2010. Generalized loopy 2U: A new algorithm for approximate inference in credal networks. International Journal of Approximate Reasoning, 51(5): 474–484.
- Cano and Moral (2002) Cano, A.; and Moral, S. 2002. Using probability trees to compute marginals with imprecise probabilities. International Journal of Approximate Reasoning, 29(1): 1–46.
- Chandru and Hooker (1999) Chandru, V.; and Hooker, J. 1999. Optimization Methods for Logical Inference. John Wiley & Sons.
Cheeseman, P. 1983.
A method of computing generalized Bayesian probability values for expert systems.In
Proceedings of the International Joint Conference on Artificial Intelligence, 198–202.
- Cozman (2000) Cozman, F. G. 2000. Credal networks. Artificial Intelligence, 120(2): 199–233.
- Cozman and Polastro (2008) Cozman, F. G.; and Polastro, R. B. 2008. Loopy propagation in a probabilistic description logic. In International Conference on Scalable Uncertainty Management, 120–133. Springer.
- Cozman and Polastro (2009) Cozman, F. G.; and Polastro, R. B. 2009. Complexity analysis and variational inference for interpretation-based probabilistic description logics. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 117–125.
- De Raedt, Kimmig, and Toivonen (2007) De Raedt, L.; Kimmig, A.; and Toivonen, H. 2007. ProbLog: A probabilistic prolog and its application in link discovery. In Proceedings of the International Joint Conference on Artificial Intelligence, volume 7, 2462–2467.
- Druzdzel and van der Gaag (1995) Druzdzel, M. J.; and van der Gaag, L. C. 1995. Elicitation of probabilities for belief networks: Combining qualitative and quantitative information. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 141–148.
- Dürig and Studer (2005) Dürig, M.; and Studer, T. 2005. Probabilistic ABox reasoning: Preliminary results. In Description Logics, 104–111.
- Fagin, Halpern, and Megiddo (1990) Fagin, R.; Halpern, J. Y.; and Megiddo, N. 1990. A logic for reasoning about probabilities. Information and Computation, 87(1-2): 78–128.
- Fagiuoli and Zaffalon (1998) Fagiuoli, E.; and Zaffalon, M. 1998. 2U: An exact interval propagation algorithm for polytrees with binary variables. Artificial Intelligence, 106(1): 77–107.
- FICO-UCSD (2009) FICO-UCSD. 2009. FICO Credit Card Dataset. https://ebiquity.umbc.edu/blogger/2009/05/24/ucsd-data-mining-contest.
- Fierens et al. (2012) Fierens, D.; Van den Broeck, G.; Bruynooghe, M.; and De Raedt, L. 2012. Constraints for probabilistic logic programming. In Proceedings of the NIPS Probabilistic Programming Workshop, 1–4.
- Frey (2003) Frey, B. J. 2003. Extending factor graphs so as to unify directed and undirected graphical models. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 257–264.
- Geman and Geman (1984) Geman, S.; and Geman, D. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6): 721–741.
- Heinsohn (1994) Heinsohn, J. 1994. Probabilistic description logics. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence, 311–318.
- Ide and Cozman (2008) Ide, J. S.; and Cozman, F. G. 2008. Approximate algorithms for credal networks with binary variables. International Journal of Approximate Reasoning, 48(1): 275–296.
- Jaeger (1994) Jaeger, M. 1994. Probabilistic reasoning in terminological logics. In Principles of Knowledge Representation and Reasoning, 305–316. Elsevier.
- Knuth (1976) Knuth, D. 1976. The computer as Master Mind. Journal of Recreational Mathematics, 9(1): 1–6.
- Koller and Friedman (2009) Koller, D.; and Friedman, N. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.
Li et al. (2020)
Li, S.; Wang, L.; Zhang, R.; Chang, X.; Liu, X.; Xie, Y.; Qi, Y.; and Song, L.
Temporal logic point processes.
International Conference on Machine Learning, 5990–6000. PMLR.
- Lukasiewicz (2008) Lukasiewicz, T. 2008. Expressive probabilistic description logics. Artificial Intelligence, 172(6-7): 852–883.
- Nilsson (1986) Nilsson, N. J. 1986. Probabilistic logic. Artificial Intelligence, 28(1): 71–87.
- Nilsson (1994) Nilsson, N. J. 1994. Probabilistic logic revisited. Artificial Intelligence, 59(1-2): 39–42.
- Pearl (1982) Pearl, J. 1982. Reverend Bayes on inference engines: A distributed hierarchical approach. In AAAI Conference on Artificial Intelligence, 133–136.
- Pearl (1988) Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann.
- Richardson and Domingos (2006) Richardson, M.; and Domingos, P. 2006. Markov logic networks. Machine Learning, 62(1-2): 107–136.
- Riegel et al. (2020) Riegel, R.; Gray, A.; Luus, F.; Khan, N.; Makondo, N.; Akhalwaya, I. Y.; Qian, H.; Fagin, R.; Barahona, F.; Sharma, U.; et al. 2020. Logical neural networks. arXiv preprint arXiv:2006.13155.
- Shafer (1976) Shafer, G. 1976. A Mathematical Theory of Evidence. Princeton University Press.
- Wächter and Biegler (2006) Wächter, A.; and Biegler, L. T. 2006. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1): 25–57.
- Weiss (2000) Weiss, Y. 2000. Correctness of local probability propagation in graphical models with loops. Neural Computation, 12(1): 1–41.
- Wellman (1990) Wellman, M. P. 1990. Fundamental concepts of qualitative probabilistic networks. Artificial Intelligence, 44: 257–303.
- Wittig and Jameson (2000) Wittig, F.; and Jameson, A. 2000. Exploiting qualitative knowledge in the learning of conditional probabilities of Bayesian networks. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 644–652.
- Zadeh (1965) Zadeh, L. A. 1965. Fuzzy sets. Information and Control, 8(3): 338–353.
Appendix A Examples of Exact Inference
Consider the following LCN:
where for the sentences. The implied constraints by the Markov condition are: and are independent; is conditionally independent of given and ; is conditionally independent of given and .
Define sixteen variables to represent the probabilities of the sixteen interpretations, where the four bits in subscript represent the truth values of , , and respectively. For example, is the probability that , , and are all false.
Consider a query on the marginal probability of . We formulate two optimization problems:
Constraints (A.6)(A.7) ensure that the sixteen variables are a valid probability distribution. (A.8–A.12) are explicit and linear constraints from the LCN sentences (A.1–A.5). (A.13–A.21) are implicit and quadratic constraints from the Markov condition, and they have been reduced using techniques from Andersen and Hooker (1994). By maximizing and minimizing the objective function (A.22), we obtain the upper and lower bounds for , which are 0.33 and 0 respectively. For another query on the posterior probability of , we replace (A.22) with the following objective:
and the resulting interval is . In some scenarios we may be interested in the model with the maximum entropy Cheeseman (1983) and therefore minimize the following objective instead.
Appendix B Details of the Modified Belief Propagation Algorithm
Let’s use the example from Section 3.3, copied here:
If we query , the correct answer is [0.3,0.35] according to both our semantics and the semantics in Cozman and Polastro (2009). However, both the 2U Fagiuoli and Zaffalon (1998) and L2U Ide and Cozman (2008) algorithms compute an incorrect answer of [0.1,0.26].
The high-level flow of the new algorithm is identical to classical belief propagation. We build a factor graph with variable nodes and factor nodes, and iteratively update messages from variables to factors and messages from factors to variables until convergence. The factor graph is a bipartite graph with variable nodes, which represent atomic formulas, and factor nodes, each of which represents one or more sentences in the LCN. Sentences that involve the same set of atomic formulas are grouped into one factor. For example, an LCN by (B.1–B.4) has three factors: is (B.1), is (B.2)(B.3), and is (B.4).
Let denote a variable node. Let denote a factor node. Let denote the neighbors of a node. A message is an interval where . Let denote the message from to and from to . If a variable node has degree one, it sends a message of to its only factor neighbor. If the degree of is more than one, it sends the following message to neighbor :
A factor node computes its message to neighbor by solving a local constraint program, which is composed of:
Sentences of factor ;
Quadratic constraints that the variables in are independent of each other.
The objective function is , and the message and are the results of minimizing and maximizing the objective with the local constraint program.
In the example, the constraint program to update and is:
For another example, suppose factor is one sentence of , and its constraint program to update and would be:
All ’s are initialized to 0 and ’s initialized to 1, and the messages are updated until convergence. Intuitively, the role of factor nodes is to solve local constraint programs, and the role of variable nodes is to tighten the bounds. The independence assumptions in the local constraint programs is a mechanism to approximate Markov condition, and the same approach is used in classical belief propagation. There is a notable relation to the LNN inference algorithm Riegel et al. (2020), which also iteratively tightens bounds. There is also a relation to the IPE algorithm Ide and Cozman (2008): IPE cut out a number of polytree subgraphs, solve each subgraph, and then choose tightest bounds from the subgraphs; the new algorithm implicitly enumerates an exponential number of subgraphs and chooses one subgraph for each or that computes the tightest bound. The algorithm guarantees correctness on polytree Bayesian and credal networks, with and without additional marginal probability sentences that break the unique-assessment assumption. In general, however, there is no guarantee on correctness; this is the same as classical belief propagation.
Finally, let’s make sure that the modified algorithm solves the example correctly. The following are the messages at convergence.
Therefore, we get the correct lower and upper bounds for :
Appendix C Details of the Mastermind Experiments
Algorithm C.1 specifies how the puzzles are generated. The reason that we run Knuth’s algorithm three times is to obtain a longer board and thereby reduce the number of MAP ties that have the same posterior probabilities. With three, most of the puzzles have a single MAP code to be used as the ground truth. In the case that there are multiple MAP codes, we consider an inference algorithm correct if it guesses any one of them.
Figure C.1 illustrates one puzzle generated by Algorithm C.1. Each row has two parts: the first is a guess by Knuth’s algorithm, which is composed of 4 colored pegs out of six possible colors, the second part is the feedback which may or may not be a lie.
In addition to puzzles, we also need to generate knowledge as in (23-25) and alike. As shown, the formulas are AND/OR of two consecutive ’s, and they alternate between AND and OR. Note that we could replace them with arbitrary formulas. For each formula, we first compute the exact point probability based on the values from step 2 of Algorithm C.1. Then we compute the widest probability interval for this formula if could take any value in [0.3,0.7]: for AND, and ; for OR, and . Then we sample a number uniformly from as the lower bound for the formula, and sample a number uniformly from as the upper bound. This ensures that the knowledge in sentences like (23-25) are correct.
Appendix D Details of the Credit Card Fraud Detection Experiments
The LCN for credit card fraud detection task is the following. Let denote the binary Is-Fraud variable; let denote the
feature variable in the Naive Bayes classifier, and letdenotes the possible value for ; let denote the probability of in legitimate transactions in training data, and let denote that for fraudulent ones; let denote the fraction of fraudulent transactions in training data; let denote the antecedents in the three logic rules.
The first three equations (D.1)(D.2)(D.3) are exactly the Naive Bayes model. The latter six equations use three auxiliary variables , and the effect is similar to the noisy-OR model in Bayesian_midpoint and Credal. However, unlike credal/Bayesian networks, LCN does not require unique-assessment assumption and therefore (D.1) is still allowed. Table 2 shows that this flexibility of LCN results in substantial performance gains.