 # In principle determination of generic priors

Probability theory as extended logic is completed such that essentially any probability may be determined. This is done by considering propositional logic (as opposed to predicate logic) as syntactically suffcient and imposing a symmetry from propositional logic. It is shown how the notions of possibility' and property' may be suffciently represented in propositional logic such that 1) the principle of indifference drops out and becomes essentially combinatoric in nature and 2) one may appropriately represent assumptions where one assumes there is a space of possibilities but does not assume the size of the space.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

It can be argued that Bayesian probability theory is the general framework for scientific prediction and inference

Jaynes:2003 ; Finetti . It is a calculus for normative statements

 P(A|B),

where and are propositions. These statements encode degrees of belief/plausibility of , given . Fully half of probability theory is encoded into two simple rules111The notation we are using corresponds to and or , and not ,

 Product rule: P(AB|C)=P(A|C)P(B|AC) =P(B|C)P(A|BC) Sum rule: P(A|B)+P(¯A|B)=1,

from which we have a generalised sum rule

 P(A+B|C)=P(A|C)+P(B|C)−P(AB|C).

These rules give relationships between different probabilities but do not constrain the probabilities enough to uniquely determine them Cox:1946 . This is the incompleteness of probability theory.

In particular, in inference we often want to determine the probability of a hypothesis given data and perhaps some background knowledge ,

 P(H|DI)=P(D|HI)P(H|I)P(D|I).

Probabilities like are often called ‘priors’. These need to be determined to work out .

Many methods have been invented to determine priors for different situations. If one is a Subjective Bayesian where a probability is a degree of belief relative to some agent, the (perhaps counterfactual) agent is free to just choose the value of their priors based on intuition or gut instinct. If one is an Objective Bayesian, one wants to find methods of derivation such that each probability assignation can be considered unique and agent independent (although different agents may make different assumptions and so may still consider different degrees of plausibility; i.e., Objective Bayesians still consider probability theory subjective in one sense). Such methods include Laplace’s principle of indifference, transformation group methods, and maximum entropy methods Shore:1980 . However, methods such as these are not always of use for calculation of generic priors.

I propose a method whereby in principle any prior may be determined. Moreover, it is a method for calculating essentially222We shall be using a finite sets policy where one always starts with finite sets and only then takes limits. See Jaynes:2003 for detailed motivation. any333There is also no general method for calculating factors - called likelihoods - unless predicts or for certain; the calculation often reduces to calculation of a prior. probability. This will be accomplished by completing the Objective Bayesian approach of probability theory as extended logic Jaynes:2003 ; Cox:1961 by imposing a symmetry and treating probability theory as syntactically complete.

## Ii Logic and Extended Logic

To understand the completion I am proposing, we must understand certain aspects of logic that I propose to impose on probability theory.

Consider the following logical argument:

 A B ∴ C

This is not a valid argument. The basic propositions ‘’ and ‘’ may have meanings for us that are not reflected in the formulation of the argument. For example, we may want the correspondence

 A: Socrates is a man. B: If Socrates is a man, then Socrates is mortal. C: Socrates is mortal.

A better formulation of the argument will then be to use logical implication instead of . We then get the new444Note here we are not considering uppercase propositions to be propositional variables; within each argument, propositions are constant. It is the arguments that change. argument,

 A A→C ∴ C

which is a valid argument. One may see from this a trivial aspect of logic; the validity of an argument is dependent on the structure of the argument. One needs to sufficiently define the structure in order to make an argument that is appropriate.

More importantly, the structure of an argument is the only thing the validity is dependent upon. Any influence to the validity of a logical argument beyond the form of the stated argument is extra-logical. One may have extra-logical influences in two ways; the choice of rules that define the logic used may be changed such that one uses a different logic; and one may have some meaning for a proposition in mind that is not defined within the argument. For this second influence, we shall take the position that one has insufficiently represented ones premises and thus the argument is not well formulated. For the first influence, this is a legitimate endeavour. We shall however stick to propositional logic and see how far we can go.

An argument is also independent of the labels one uses for the propositions; what is important is logical structure.

The position we are taking is sometimes called a syntactic logical interpretation of probability theory. Many philosophers take the position that there is meaning for some propositions relevant to scientific inference and prediction that cannot be defined through logical structure alone. These issues and others are discussed in the remarks.

We start by following Cox Cox:1961 who derived the product and sum rules from basic desiderata to extend logic. In the system, our primitive objects are degrees of plausibility

 A|B,

of given , that are equal to real numbers. Probabilities are positive, continuous, monotonic functions, , of plausibilities

 P(A|B)=f(A|B).

The function, , is chosen such that certainty corresponds to the number 1. This choice is arbitrary but it leads to the particularly simple forms of the product and sum rules we have given.

We shall consider a degree of plausibility as directly analogous to a logical argument. This means two things:

1. The value of a plausibility (analogous to the validity of an argument) is dependent on only the explicit logical structure. From now on, any non-basic proposition will be written as or as opposed to . When calculating a prior for , the product and sum rules will constrain it to be functionally related to probabilities of basic propositions. We may then isolate the probabilities that cannot be calculated by using only the product and sum rules, and the rules of Boolean algebra.

2. Directly related to the above, the value of a degree of plausibility will not depend on the labels used on basic propositions. This gives us a powerful trick that is fundamental to derivations Jaynes:2003 of indifference and transformation group methods. For example,

 P(A1|X[A1,A2])=P(A2|X[A2,A1]).

If is permutation symmetric; i.e., , then

 P(A1|X[A1,A2])=P(A2|X[A1,A2]),

which gives us a non-trivial constraint. Relabelling can also be used for individual probabilities separately within a functional relationship. For example,

 P(A1+A2|A3) = P(A1|A3)+P(A2|A3)−P(A1A2|A3) = 2P(A1|A3)−P(A1A2|A3).

## Iii Exclusivity, Exhaustivity and Indifference

In nearly every probability calculation exclusivity and exhaustivity for some set of possibilities are (usually implicitly) assumed. Exclusivity means that if you assume that is true, then you may infer that any , in some set where , must be false. Exhaustivity means that you assume at least one of the set is true. These assumptions are often parametrised by the conditions

 P(AiAj|I[Ai]) = P(Ai|I[Ai])δij; P(n∑i=1Ai|I[Ai]) = 1.

If one wants to write a probability in terms of a mixture of others in the normal way, then it is necessary to make these assumptions;

 P(B|X[Ai,B]) = P(Bn∑i=1Ai|X[Ai,B]) = n∑i=1P(BAi|X[Ai,B]) = n∑i=1P(Ai|X[Ai,B])P(B|AiX[Ai,B]).

If one doesn’t assume exclusivity, then one gets extra terms in the above equation.

One may consider exclusivity and exhaustivity as the sufficient logical definition of ‘possibility’ and ‘property’ Harrigan:2010 ; Pusey:2012 . A property may be considered a coarse grained possibility; for example, any classical observable in Hamiltonian mechanics segments the space of possible states, defining a property. In particular, the energy of a system is a property with various values of the energy related to various sets of possible states. The meaning that differentiates different types of properties and possibilities is defined by other logical relationships one assumes. So energy is a classification differentiated by causal structure (i.e., the equations one uses); we assume that if a system has a certain value for its energy at a certain time and is isolated, then the system will have the same energy at a later time; we have logical correlation.

One problem is that the conditions for exclusivity and exhaustivity are not derived from the explicit form of our assumptions . The explicit form of for the simple set is as follows:

 M3[Ai]=A1¯A2¯A3+¯A1A2¯A3+¯A1¯A2A3+¯A1¯A2¯A3

for exclusivity and

 X3[Ai] = A1A2A3+A1A2¯A3+A1¯A2A3 +¯A1A2A3+A1¯A2¯A3+¯A1A2¯A3 +¯A1¯A2A3

for exhaustivity. This gives us

 I3[Ai]=M3[Ai]X3[Ai]=A1¯A2¯A3+¯A1A2¯A3+¯A1¯A2A3.

These functions have been written in a non-minimal way where the function is a sum of terms with each term being a product of ’s and ’s and containing ’s for all for some . In this form - which is often called disjunctive normal form - every term is exclusive by definition. Every propositional function can be written in this form. Each function can then be associated with a subset of the power set of , , where each term is associated with an element of . The sum of terms associated with are exhaustive by definition. This will give us great flexibility in calculation.

Let us now calculate

 P(A1|I3[Ai]) = P(A1¯A2¯A3|)P(A1¯A2¯A3|)+P(¯A1A2¯A3|)+P(¯A1¯A2A3|) = 13×P(A1¯A2¯A3|)P(A1¯A2¯A3|) = 13,

where we have used relabelling. Probabilities of the form refer to probabilities with minimal assumptions. This will be defined later. Indifference can be seen as fundamentally combinatoric in nature, where the probability is directly related to the number of ways one may assign a single non-negated proposition in a product of propositions. We can generalise to any in a simple manner.

One aspect of the above derivation is that the probabilities cancel out such that they do not need to be calculated. This is suggestive of why indifference could be derived in the past Jaynes:2003 without needing to go beyond the basic sum and product rules.

The above derivation also shows us that exclusivity and exhaustivity are not just necessary but also sufficient in deriving indifference.

## Iv Determining Generic Probabilities

We define a working set as the set of all propositions we are working with for a particular probability. This set may be made arbitrarily large:

 P(A|B) = P(A|B)P(C+¯C|AB) = P(A[C+¯C]|B) = P(A|[C+¯C]B).

From this one can see we may add an arbitrary number of tautologies to the premises. One may consider this arbitrariness an important criterion for probability theory; we want our probabilities to be stable under arbitrary additions of tautologies to our assumptions. It is interesting that the product and sum rules give this to us for free.

An important thing to note is that we are allowing propositions like in the conclusions that have no representation in the premises.

Probabilities with minimal assumptions may be written as

 P(Z[A1,...,An]|)=P(Z[A1,...,An]|Qn[A1,...,An]),

where . We may thus consider probabilities as ones either assuming nothing or only tautologies.

Let us now turn our attention to a generic probability

 P(Z[Ai]|Y[Ai]).

We may write

 P(Z[Ai]|Y[Ai])=P(Z[Ai]Y[Ai]|)P(Y[Ai]|).

Both of these factors may be written as sums of terms of the form . These terms may be decomposed using the product rule. At this point the derivations of Jaynes Jaynes:2003 and Cox Cox:1961 give us no further support. Jaynes appeared Jaynes:2003 (P35) to consider probability theory complete as he expected one to always have background knowledge to determine the terms. Here we are explicitly assuming only tautologies with probabilities and hence cannot rely on such background knowledge. Moreover, as we are aiming at generality, we do not want to rely on such background knowledge.

To determine the terms, consider the following symmetry: The validity of an argument is invariant under swapping a basic proposition with its negation in both the premises and the conclusions. Moreover, the swapping symmetry is a symmetry of logical structure. It may be seen as directly related to the double negation rule; imposing the symmetry on a trivial identity gives us the rule:

 ¯A=¯A→A=¯¯A.

Consider also that a possible state of affairs may be referred to by either or , with both choices giving equal consequences for argumentation. The two propositions are defined in contrast to one another (their truth values are opposed) and are not distinguished within the system in any other way. This lack of distinguishing factors is made more apparent when possibility is seen as an explicit extra assumption; the proposition does not by definition mean that a proposition from a set of possibilities, other than , must be true. Such meaning comes from an assumption .

I assert that our degrees of plausibility must satisfy the symmetry in order to not introduce an extra-logical bias into our framework. This may be considered part of the desideratum of consistency used in Jaynes:2003 .

Consider notation . From our symmetry, one may impose555This condition is imposed on the plausibilities but may be stated in terms of probabilities.

 xj+1k=xjk+1∀j,k≥0. (1)

We now prove a lemma: if then .

Proof: Assume for some . Then

 P(A|) = P(A[Aq+¯Aq]|A1...Aj0¯B1...¯Bk0) = P(A|)P(A|A1...Aj0Aq¯B1...¯Bk0) + {1−P(A|)}P(A|A1...Aj0¯B1...¯Bk0¯Aq) = P(A|)xj0+1k0+{1−P(A|)}xj0k0+1

Then by eq.(1),

From our lemma and (1), we get by induction from ,

 ∀j,k≥0,xjk=P(A|).

Now see

 P(A1...Aj¯B1...¯Bk|) = j∏l=1k∏r=1P(Al|μl){1−P(Br|νr)} = Pj(A|){1−P(A|)}k,

where and are products of various ’s and ’s. To determine , we impose the symmetry again:

 A|=¯A|.

From this condition one arrives at

 P(A|)=12.

Our generic probability then becomes

 P(Z[Ai]|Y[Ai])=MN2n−m,

where and are just the number of terms in and respectively when the functions are written in minimal disjunctive normal form. Moreover,

can heuristically be thought of as an unnormalised overlap between

and . The numbers and are the numbers of basic propositions in the terms of and respectively, written in minimal disjunctive normal form.

## V Applications

The precision and generality of our scientific statements are directly related to the precision and generality of the language used to make the statements. With the formulation of probability theory I have just proposed, we are able to determine precise probabilistic statements for a greater variety of situations then we were able to before. Immediate applications include situations where we do not or cannot assume exhaustivity and exclusivity:

1. We can calculate probabilities for propositions when we know only that they are one of m exhaustive possibilities;

 P(Ai|Xm[Ai])=2m−12m−1.
2. Consider a situation similar to one presented by Walley Walley:1996 : We have a bag of marbles. Suppose we know that they are labelled in a distinguishable way. In particular, they are numbered and we know there is a marble that is labelled with ‘1’. We know nothing about the number of marbles in the bag (perhaps the bag is magical, with the ability to hold an unlimited number of marbles). We want to know the probability that if we pick a marble from the bag, that marble will be the one labelled with ‘1’. This will generally depend on our knowledge of how we pick the marble. We are not interested in this particular aspect and if we know our method of picking cannot discriminate the labels, we may neglect this knowledge for our current purposes.

Walley and others have proposed solutions to problems of this sort which go beyond the Bayesian framework. One sought after property of a probability in this situation is called regrouping invariance; i.e., it should somehow be invariant to changes in the ‘size of the sample space’. This presupposes that our probabilities are defined in terms of ‘sample spaces’.

Within the framework just proposed the solution requires only properly stating the salient assumptions; we have positive knowledge that there is a set of exclusive and exhaustive possibilities, we just do not know the size of the set. An appropriate probability will then be of the form

 limn→∞P(A1|n∑j=1Ij[Ai]).

Note, assuming does not assume the various sample spaces are exclusive to each other. Exclusivity of sample spaces would require additional assumptions and change the probability. This is just one example of the precise choices we could make in our assumptions, exemplifying the generality of our approach.

3. Quantum theory has severe ontological problems. Our difficulty in solving these problems may be an insufficient formulation of probability theory Fuchs:2010 . Most if not all no-go theorems for ontological models of quantum theory Bell:1987 ; Harrigan:2010 ; Spekkens:2005 ; Hardy:2004 ; Pusey:2012 implicitly assume exclusivity and exhaustivity for the space of ontological states. The framework presented here allows for a whole class of models, which do not assume exclusivity and exhaustivity, to be explored.

## Vi Remarks

The probabilistic framework here is considered as a symbolic system rather than a system of functions or measures on a predefined set. The framework is general enough to deal with situations where sets of possibilities are not assumed. The principle of indifference is derived as a consequence of our ability to relabel and the explication of the assumptions we implicitly make to define possibility. Indifference is thus not a principle imposed a priori or arbitrarily.

Probability theory as extended logic is completed by imposing a symmetry from propositional logic.

The degree to which one is convinced by the framework proposed here partly depends on whether one is convinced that propositional logic is sufficient for the task of scientific inference. We have seen how one may represent basic notions of possibility and property while still maintaining logical consistency. What propositional logic does not do are universals. I argue that universals are not directly relevant for scientific inference; a scientist would never be able to test the statement ‘all ravens are black’.

I propose the notion of universality is related to notions of induction and simplicity.

The framework just proposed does not directly justify induction. This is a good thing. An approach Carnap:1950 by Carnap - that has similar motivations to the approach here - tries to build induction directly into the framework. One problem is that the inductive predictions do not take into account ones assumptions; whether or not one predicts a sequence to continue at all and precisely how one predicts this depends on ones assumptions. Moreover, I submit these things should only depend on ones assumptions; if you make no assumptions you have no reason to predict the continuation of a sequence.

One may still perform inductive reasoning given certain assumptions such as a constant causal mechanism. There is, however, still a problem of induction: One may make valid predictions based on assumptions but those assumptions may not necessarily be justified.

The Bayesian framework has some built in notion of simplicity Jaynes:2003 (Ch.20). Consider two sets of propositional functions we’ll call models, and , where is parametrised by parameters and by parameters. Suppose the ’th parameter is and the subset has a one to one correspondence with where each element in both sets is identified with one that produces the same likelihood for some data . We may take and as compound models, i.e., models where the parameters are unknown. If the elements are exclusive for both and and the point in the parameter space that gives a maximum likelihood (for data ) is near and sharply peaked, then the likelihood for the compound model of will generally be greater than the likelihood for ; a set of models that predicts the observations as well as another but with less parameters will generally be better.

One limitation to this is that one has already chosen the sets of models to consider in a certain way. This has partly to do with ones preferences; do I judge a model with various parameters on the best choice of values of those parameters or do I judge a model on the total parameter space given to me? The choice also has to do with the choice of using a mathematical framework in the first place. In principle there are an infinite number of propositional functions that one may use as a model that have no discernable or consistent pattern. Can the restriction to propositional functions with consistent patterns be justified? This question becomes manifest in the proposed framework where we do not rely on calculating things with respect to a predefined set of alternative models; we may ask where those alternatives come from and why.

Note that the framework presented here manifests a primitive notion of simplicity for propositional functions themselves. The probability of some , given no assumptions, is proportional to where is the minimum number of propositions required to write in disjunctive normal form. The smaller the value of , the ‘simpler’ is.

I speculate that a justification for induction and simplicity comes from an assumption, , that restricts the set of propositional functions one may use. This restriction could be justified by epistemological considerations. Models with consistent patterns may then emerge due to combinatoric reasons.

The concept of possibility that is outlined in this article is suggestive of how scientific concepts may be defined generally. Possibility is a pattern of propositions within a model. Crucially, this pattern is not unique; different models with different sizes for possibility spaces will use different patterns (e.g., and are different). Moreover, the pattern may be nested such that the different possibilities are propositional functions rather than basic propositions. Within this framework, the concept of possibility cannot be defined as a form of classification, in contrast to some other attempts at the definition of a concept Valiant:1984 . I speculate that concepts like possibility and property may instead be associated with algorithms.

Universality may be defined as a concept.

This definition of concept suggests a motivation for its use. Consider an agent with data and assumption . There will likely be an infinite set of models to consider. Calculation for decisions may be computationally intractable. The agent may choose some scheme that best approximates the inferences one would ideally achieve. This scheme could involve algorithms for generating models. It may be the case that the best algorithms come from collections of nested concepts we may call general hypotheses. These general hypotheses may not give unique results but rather generate propositional functions dependent on input. Some of these general hypotheses may be well parametrised by mathematics.

Further work is required.

## References

• (1) E. T. Jaynes, “Probability Theory: The Logic of Science,” Cambridge University Press, Cambridge (2003).
• (2) B. de Finetti, “Theory of Probability: A Critical Introductory Treatment,” John Wiley & Sons Ltd, New York (1974-75).
• (3) J. E. Shore and R. W. Johnson, IEEE transactions on information theory, 26 (1980).
• (4) R. T. Cox, “The Algebra of Probable Inference,” John Hopkins Press, Baltimore (1961).
• (5) P. Walley, J. R. Statist. Soc. B, 58:3-57 (1996).
• (6) C. Fuchs, eprint arXiv:quant-ph/1003.5209v1 (2010).
• (7) J. S. Bell, “Speakable and Unspeakable in Quantum Mechanics,” Cambridge Univ. Press (1987).
• (8) N. Harrigan, R. W. Spekkens, Found. Phys. 40:125-157 (2010).
• (9) R. W. Spekkens, Phys. Rev. A, 71:052108 (2005).
• (10) L. Hardy, Stud. Hist. Phil. Mod. Phys., 35:267-276 (2004).
• (11) M. F. Pusey, J. Barrett, T. Rudolph, Nature Phys., 8:475 (2012).
• (12) R. Carnap, “Logical Foundations of Probability,” University of Chicago Press (1950).
• (13) L. G. Valiant, Communications of the ACM, 27:1134 (1984).
• (14) R. T. Cox, Ann. J. Phys., 14:1-13 (1946).