The Dempster–Shafer (DS) theory is a framework for probabilistic reasoning based on observed data, and modeling of knowledge via the formal operations of projection and combination. In the seminal work Dempster (1963, 1966, 1967a, 1967b, 1968a, 1968b, 1969, 1972) developed a novel approach for inference which utilized the expressions of upper and lower probabilities. Together with Shafer (1976, 1979)
, this approach was refined to become known as the Dempster–Shafer theory of belief functions. Over the past decades, the DS theory saw continued adaptations to applications in the fields of signal processing, computer vision, artificial intelligence, and so on. As outlined most recently inDempster (2008, 2014), the Dempster–Shafer Extended Calculus of Probability (DS-ECP) represents a st century view of the theory as a potentially important tool for carrying out statistical inferences in scientific practice. This article presents an algorithm to carry out some of the computation involved in the task of DS inference for Categorical distributions with observed counts. It is intended as a computational companion to Dempster (1966, 1972) as suggested by the title, and presented in a self-contained way.
In the DS framework, inferences on user-defined assertions are expressed probabilistically. These assertions can be statements concerning model parameters, of the form “the parameter belongs to a certain subset of the parameter space”, or concerning future observations, such as “the next observation will belong to a certain subset of the observation space”. Contrary to Bayesian inference, no prior distribution is required of the parameters of interest within the DS framework, for which the randomness on the parameter space stems from the data sampling mechanism rather than from prior probabilities. Rather than Bayesian posterior probabilities, DS inference is expressed as a three-valued assessment of uncertainty. The probabilities “for”, “against”, and “don’t know” associated with the assertion of interest, or, add up to one. In Categorical distributions, if the parameter associated with the first category is denoted by , consider the assertion “” where is an interval contained in . Upon specifying a prior distribution, a Bayesian analysis will provide a posterior probability for “”, which is a number in and is complement to the posterior probability for “”. The DS analysis, on the other hand, does not require a prior distribution, but an explicit specification of how data are generated given parameters and given a random number generator. Instead of the two-valued posterior probabilities, the DS analysis reports three probabilities “for”, “against”, and “don’t know”, or , associated with “”, such that the three numbers add up to one.
Among different models considered in the early developments of DS, Categorical distributions for arbitrary dimensions were front and center, due to their generality and applicability to statistical subjects such as contingency tables. With the exception of the two-category case (the Bernoulli case), the computation required by the DS Categorical model proved to be demanding. As described below, DS inference involves distributions of convex polytopes within the simplex. The Categorical distribution model was described in Dempster (1966), and the marginal distribution of extreme vertices of the polytope was examined in Dempster (1972)
. Unfortunately, no closed-form joint distribution of the vertices exists, hindering both theoretical exposition and simple Monte Carlo sampling of the random object. The challenge promptedDenœux (2006) to comment that, “Dempster studied the trinomial case […] However, the application of these results to compute the marginal belief function of X has proved, so far and to our knowledge, mathematically intractable.” Likewise, Lawrence et al. (2009) commented: “[…] his method for the multinomial model is seldom used, partly because of its computational complexity.” Over the past fifty years, the literature saw a handful of alternative methods for Categorical inference using generalized types of probability structures, such as the Imprecise Dirichlet Model (Walley, 1996), the Dirichlet-DSM method (Lawrence et al., 2009), and the Poisson-projection method (Gong, 2018, unpublished PhD thesis). All of these alternatives are accompanied by analytical solutions to typical upper and lower probability and expectation calculations in the posterior, and the latter two were motivated in part by circumventing the computational hurdle put forward by the original DS formulation of Categorical inference. The early work of Dempster (1972) described some properties of the random convex polytopes of interest, and suggested that computational strategies should be within reach for simulating such objects.
We propose a Gibbs sampler to implement DS inference for Categorical distributions with categories, according to the modeling setup originally described by Dempster (1966, 1972). The method is applicable for arbitrary numbers of observations and numbers of categories . The cost of each iteration of the proposed algorithm increases with (linearly) and with (polynomially). Numerical experiments suggest that the algorithm is useful for non-trivial values of and . For the remainder of this section, we review the approach to Categorical inference under consideration. In Section 2, we describe the computational setup for general Categorical distributions with , and observe a mathematical equivalence between the feasibility constraint of the sampling problem and shortest path finding in fully connected graphs. Section 3 describes the proposed Gibbs sampler and its execution variants. Section 4 illustrates the proposed algorithms with numerical examples, including the testing of independence in contingency tables and parameter estimation of the linkage model. Results are compared with alternative methods of Categorical inference using imprecise probabilities. Section 5 concludes. Code in R (R Core Team, 2018) is available at github.com/pierrejacob/dempsterpolytope to reproduce the figures of the article, and the appendices provide further information.
We denote random variables and their realizations by the same letters. Bold letters refer to arrays of variables. The setfor is also denoted by or . Therefore we write for , or , or . We use calligraphic letters for sets. We write for the indicator function, i.e. is one if and otherwise. To refer to a uniform variable over a set we write or . The -dimensional simplex will be denoted by , and is defined as . We denote its vertices by . In barycentric coordinates, , and so on. A polytope is a set of points satisfying linear inequalities, of the form , where the inequality is understood component-wise, is a matrix with columns and
is a vector.
1.3 Inference approach under consideration
We first consider the case where the observations are each in . For this case Dempster (1966) obtains various analytical results. We use this simple setting as a starting point to present the general strategy that will apply to the case . The probabilistic model is , independently for all , and is the unknown parameter. Probabilistically this is equivalent to the sampling mechanism
The approach under consideration “propagates” the randomness associated with the variables to the parameter space, which enables probabilistic statements about measurable subsets of the parameter space, without directly specifying a prior distribution on the parameter as in the Bayesian approach. We will denote by the set of indices such that , for , i.e. . The cardinalities of these sets will be denoted by , so that . Consider the set
of all possible realizations of which could have produced the data for some . It is possible to check that for any , any is such that for all . We will denote the set by and refer to it as a “feasible set”. Here and more generally, is called a multi-valued mapping (Dempster, 1967a) from the auxiliary variable space (of ) to subsets of the parameter space.
For an arbitrary , the set can very well be empty; but by definition if then . Indeed can be expressed also as
and then it is apparent that is the set of ’s such that is non-empty. The approach under consideration relies on the distribution of when
follows the uniform distribution on, denoted by
, which has probability density function
where denotes the volume of . Consider a subset of , corresponding to an “assertion” about the parameter. For instance we may wonder about being in a certain sub-interval of . We define the set of points in such that there exists . This leads to the definition of upper probabilities:
The set on the other hand contains the points such that no makes the data compatible with such ; in other words contains . This leads to the definition of lower probabilities:
with for all , quantifying support “for”, “against”, and “don’t know” about the assertion . As illustrated in Dempster (2008) and Gong (2019), the triple draws a stochastic parallel to the three-valued logic, with r representing weight of evidence in a third, indeterminate logical state. A p or q value close to 1 is interpreted as strong evidence towards or , respectively. If r is large, it suggests that the model and data at hand is structurally deficient in making a judgment about the assertion , or equivalently its negation. Using sampling methods, we can approximately sample from and thus approximate all quantities from (1.4) through (1.6) following the Monte Carlo approach. For completeness Appendix A gives some details on how to obtain triplets associated with assertions on future observations.
Upper and lower probability functions, also called plausibility and belief functions (Shafer, 1976), are axiomatic generalizations of probability functions. Both functions satisfy that the empty set must be assigned probability 0: , and that the whole set must be assigned probability 1: . The upper probability is sub-additive and the lower probability is super-additive. Furthermore, and are conjugate in the sense that for all
. As part of the DS extended calculus of probability, these calculations generalize Bayesian reasoning by propagating information through a graph defined by structural equation models, extending for example belief propagation in Bayesian networks.
1.4 A Gibbs sampler for the Bernoulli case
We quickly outline a Gibbs sampling strategy to sample from in the case . From (1.2) and (1.3) we can obtain the conditional distribution of , also denoted by , given , denoted by . Similarly we can obtain the conditional distribution of given . These conditional distributions under are given by
so that both conditional distributions take the form of products of Bernoulli distributions. These can be sampled from as part of a Gibbs sampling strategy; see e.g.Robert and Casella (2004) for an introduction to Monte Carlo methods. The sampler generates in , and associated feasible sets , that can be used to approximate triplets via Monte Carlo approximations of (1.4)-(1.5). The main contribution of this article is to extend the above reasoning to the general case .
We note as a side remark that the above conditionals only involve the statistics and
which can be sampled directly from translated and scaled Beta distributions. Doing so would not improve the convergence rate of the Gibbs sampler, but would decrease the cost per iteration fromto . Whether such a shortcut can be implemented for the general case is an open question.
2 Categorical distributions
2.1 Sampling mechanism
The goal is to perform inference on the parameters of a Categorical distribution with categories, using observations for all . Probabilistically, the model states that for all , . Sampling from a Categorical distribution can be done in different ways. In likelihood-based approaches this ambiguity has no bearing on the inference, but in the DS approach it does; and indeed different sampling schemes lead to different inferences as we will see when comparing the approaches of Dempster (1966) and Lawrence et al. (2009). We consider the sampling mechanism as in Dempster (1966) defined as
The map takes a pair as input and returns a value in . To describe the map we define the sets , where each refers to a polytope, with the same vertices as except that vertex is replaced by . The sets form a partition of : their union is and their interiors are mutually exclusive. Then is the index such that . We can write equivalently or . We can also write , which gives an explicit form to the map . We will use the notation for the indices such that , as before. Each has components denoted by for .
We recall the following characterization of points in , which plays an important role in the forthcoming developments.
(Lemma 5.2 in Dempster (1966)). For with coordinates , a point is within if and only if for all , .
In the case , Figure 1 (left) shows a triangle split into three sets . Recall that the points such that is equal to a constant value , lie in a hyperplane that contains all except the th and th vertices of the simplex , i.e. a line when , a plane when , and so on. Also, an inequality is equivalent to , so it is a linear inequality on . Figure 1 (right) shows an equality constraint of the form as a dashed line emitting from the second vertex, and a linear inequality as a shaded region within the simplex .
2.2 Feasible sets
Mimicking the Bernoulli case and (1.1), we can define the set as
Given a realization of by definition there is a non-empty feasible set defined as
which constitutes the basis of the statistical approach under consideration.
We consider data with , , observed counts equal to . For this data set, an example of draws in is represented in Figure 2 (left). In the figure, is shown in red if , in green if and in blue if . Thus according to the observed counts, the figure shows red dots, green dots and blue dots in total. The figure also shows six dashed lines, each of them corresponding to a linear constraint , for and . The feasible set is shown as a black polygon. Figure 2 (right) shows a hundred feasible sets generated by the proposed Gibbs sampler.
In the Bernoulli case, we first went from (1.1) to (1.2), where is represented without explicit mention to a parameter , and from there we could derive the conditional distribution of given the other components of under . We will follow the same steps in the general case. We first proceed to obtain a generalization of (1.2), that is, a representation of involving only relationships between the components of and no explicit mention of .
We first find an equivalent definition of the feasibility of for a given . Above it was defined by the requirement that, for all . For each , considering the points with and using Lemma 2.1, we have equivalently
We next define
Thus we see that is equivalent to for , upon defining .
Next we assume and consider some of the implications on the values . First, for all
A geometric interpretation of “” is that, considering “line constraints” represented e.g. in Figure 2 (left), the pair of lines and coming from vertex should be ordered in a certain way, i.e. the angle between the lines should be of a certain sign. This is represented in Figure 3 (left).
Similarly we can write as , obtaining for all the inequalities
An illustration of these inequalities in the case can be found in Figure 3 (right); see also Appendix B for examples where inequalities of the form are satisfied while some inequalities of the form are violated, and vice versa.
If , we can iterate the above reasoning, expressing as , and so on. Overall we can write, for all , with any number of indices , the following constraints:
We can drop “” from the notation for simplicity. Note that the case gives inequalities of the form , which are always satisfied provided that we define for all . Furthermore, we can consider only indices that are different from one another. Indeed if some of the above indices are equal, then the inequality is implied by inequalities obtained for smaller values of . For instance is implied by and . More formally, suppose with . Then is implied by and ”, an argument with recursive applicability.
At this point, we observe an important connection between (2.4) and directed graphs, which helps with both the intuition for the validity of the proposed algorithm and accelerating its computation. The indices in
can be viewed as vertices of a fully connected, directed graph. Directed edges can be represented as ordered pairs. Then we can associate the product with a sequence of edges, , , up to . That sequence of edges forms a path from vertex back to vertex , of length . In other words, it is a directed cycle of length .
Define for all , and consider it to be the “weight” of edge . Then the inequality is equivalent to . The sum of the weights along a path can be referred to as the “value” of the path (as opposed to its “length”, which refers to the number of edges constituting the path). The constraints in (2.4) are then equivalent to the following statement: in a fully connected graph with vertices and with weights on edge for all , the value of any cycle is non-negative. Detecting whether graphs contain negative cycles is a standard question in the study of graphs, and can be done with e.g. the Bellman–Ford algorithm (e.g. Bang-Jensen and Gutin, 2008).
To summarize, we have established that the existence of is equivalent to for all with defined in (2.3). In turn, this implies the inequalities of (2.4), which can be understood as graph constraints. Our next result states that the converse holds also, and therefore that the existence of is equivalent to the values , defined as for all , satisfying (2.4).
There exists satisfying for all if and only if the values satisfy
Furthermore we can consider only indices different from one another in the above statement.
The forward implication has been shown above. The proof of the reverse implication explicitly constructs a feasible based on the values , assuming that they satisfy (2.5).
Introduce the fully connected graph with vertices, and assign as a weight to edge . For any path in the graph from vertex to vertex , there is a value defined as the sum of the weights of the edges along that path. Thanks to the values satisfying (2.5), there are no negative cycles in the graph. Thus one cannot decrease the value of a path by appending a cycle to it. Since there are only finitely many paths without cycles from vertex to vertex , there is a finite value corresponding to the smallest value of any path from to . Let us denote the minimum value over all paths from to by . We have just used (2.5) to argue that , and we also have , thus is finite for all .
We next choose a vertex in arbitrarily, for instance let us select vertex . We define through , for all ; this is possible thanks to the finiteness of for all . We can check that , by construction. Then, the ratio is equal to . But we can always write
because the right hand side is the value of a path from to (via ), while the left hand side is the smallest value over all paths from to . Upon taking the exponential, the above inequality is equivalent to . Thus we have constructed a point satisfying for all . ∎
where we recall that is defined as . This can be seen as the generalization of (1.2) to the case of arbitrary . Indeed is now expressed in terms of inequalities that the components of must satisfy with respect to one another, without explicit mention of parameters . We can next find convenient expressions for conditional distributions under , that will lead to the proposed Gibbs sampler.
2.3 Conditional distributions
Equipped with the representation of in (2.6), we now consider the problem of sampling (also denoted as ) given (also denoted as ) from their conditional distribution under , which we denote by . We fix , assume that , and note that the points enter the definition of (which stands for , but not the definition of for . Thus, updating the entries of leads to new values of , but all the other ’s remain unchanged. We next consider what constraints can be imposed on to guarantee that satisfies (2.6)
As previously discussed, we can consider the indices in (2.6) to be unique. Thus there is at most one such index equal to . Listing the inequalities that feature and keeping only the terms on the left side of the inequality sign, we obtain the inequalities,
Thus, for to remain in after updating its components , it is enough to check that the ratios for and are lower bounded by expressions that depend only on with , and thus only on . We obtain the following result that characterizes the conditional distribution .
Let , and define for all . For each , given , there is a that can be constructed explicitly using such that is the uniform distribution on , the -ary Cartesian product of .
In other words is the product measure where the component of follows the uniform distribution on for all .
Let and let us construct as follows. Consider the fully connected directed graph with vertices, and weight on edge as in the proof of Proposition 2.4, except that we only consider edges with . Define to be the minimum value of all paths from to . Finiteness of results from a similar reasoning as in the proof of Proposition 2.4 since belongs to . Note that can be constructed without the entries of , because the shortest path from to should pay no attention to any directed edges that stem from , that is, .
Next we define , for , and by normalizing so that . It remains to be shown that the support of is exactly .
Let . By Lemma 2.1 and the definition of , we have
Finally, let us show that the support is exactly by considering . Then, for some and some , we have . Denote by the path attaining the value . We obtain
We now have all the elements to design a Gibbs sampler targeting .
3 Gibbs sampler
We start the algorithm by drawing a vector as follows. We set in the simplex; it could be the center of the simplex, or equal to the observed proportions of each category, or a draw from a Dirichlet distribution. We then sample, for each , for each , uniformly in , and then set . Then because at least is in . Sampling from can be done following Algorithm 1, see also equation (5.7) in Dempster (1966).
3.2 Gibbs step
Assume that , and define for all as before. We next consider drawing entries for an arbitrary from the distribution . This will constitute one iteration of a Gibbs sampler; the overall algorithm will cycle through the categories, either deterministically or randomly, and generate a sequence iteratively, such that it converges to as .
According to Proposition 2.5 we can draw independently from the uniform distribution on for a certain , e.g. using Algorithm 1. The vector was constructed in the proof of the proposition and implementation is discussed below.
3.2.1 Using shortest path algorithms
The point is defined in the proof of Proposition 2.5 by first computing , the minimum value over all paths from to , for . Enumerating all paths would be cumbersome for even moderate values of , but we can use shortest path algorithms designed for this task. For instance we can use the Bellman–Ford algorithm as implemented in the igraph package (Csardi and Nepusz, 2006). We can thus compute as follows.
Set weight of edge to be for all , in a fully connected graph with vertices.
For all , define as follows.
Find the shortest among paths from to , with value .