1 Introduction
Motivation.
Sophisticated computer applications generally require expressive languages for knowledge representation and reasoning. In particular, such languages need to be able to represent both structured knowledge and uncertainty [Nil86, Hal03, Mug96, DK03, RD06, Háj01, Wil02]. A suitable language for this purpose is higherorder logic [Chu40, Hen50, And02, Llo03, vBD83, Lei94, Sha01], which admits higherorder functions that can take functions as arguments and/or return functions as results. This facility is convenient for probabilistic modeling since it means that theories can contain probability densities [Far08, Pfe07, GMR08]. In particular, many forms of probabilistic reasoning can be done in higherorder logic using the traditional axiomatic method: a theory can be written down which has the intended interpretation as a model and then conventional proof and computation techniques can be used to answer queries [NL09, NLU08]. While such a computational approach is effective, it is sometimes more natural to pose a problem as one where the probability of some sentences in the theory being true may be strictly less than one and/or the query sentence (and its negation) may not be a logical consequence of the theory. In such cases, deductive reasoning does not suffice for answering queries and it becomes necessary to use probabilistic methods [Par94, KD07, RD06, Mug96, MR07].
Main aim.
These considerations lead to the main technical issue studied in this paper:
Given a set of sentences, each having some probability of being true,
what probability should be ascribed to other (query) sentences?
We build on the work of Gaifman [Gai64] whose paper with Snir [GS82] develops a quite comprehensive theory of probabilities on sentences in firstorder Peano arithmetic. We take up these ideas, using nondogmatic priors [GS82] and additionally the minimum relative entropy principle as in [Wil08a], but for general theories and in a higherorder setting. We concentrate on developing probabilities on sentences in a higherorder logic. This sets the stage for combining it with the probabilities inside sentences approach [NL09, NLU08].
Summary of key concepts.
Section 2 introduces higherorder logic and its relevant properties. We use the higherorder logic (Definitions 2, 2, and 2) based on Church’s simple theory of types [Chu40, Hen50, And02]. We employ the Henkin semantics and make use of a particular class of interpretations, called separating interpretations (Definition 2).
Section 3 gives the definition of probabilities on sentences in higherorder logic (Definition 3), introduces the Gaifman condition, and develops some basic properties of such probabilities. Section 4 then introduces probabilities on interpretations and shows their close connection with probabilities on sentences. Gaifman [Gai64] (generalized in Definition 3 and Propositions 3, 3, 3) introduced a condition, called Gaifman in [SK66], that connects probabilities of quantified sentences to limits of probabilities of finite conjunctions. In our case, it effectively restricts probabilities to separating interpretations while maintaining countable additivity.
While generally accepted in probability theory (Definition
4), some circles argue that countable additivity (CA) does not have a good philosophical justification, and/or that it is not needed since real experience is always finite, hence only nonasymptotic statements are of practical relevance, for which CA is not needed. On the other hand, it is usually much easier to first obtain asymptotic statements which requires CA, and then improve upon them. Furthermore we will show that CA can guide us in the right direction to find good finitary prior probabilities.Another principle which has received much less attention than CA but is equally if not more important is that of Cournot [Cou43, Sha06]: An event of probability (close to) zero singled out in advance is physically impossible; or conversely, an event of probability 1 will physically happen for sure. In short: zero probability means impossibility. The history of the semantics of probability is stony [Fin73]. Cournot’s “forgotten” principle is one way of giving meaning to probabilistic statements like, “the relative frequency of heads of a fair coin converges to 1/2 with probability 1”. The contraposition of Cournot is that one must assign nonzero probability to possible events. If “events” are described by sentences and “possible” means it is possible to satisfy these sentences, i.e. they possess a model, then we arrive at the strong Cournot principle that satisfiable sentences should be assigned nonzero probability (Definitions 3 and 4). This condition has been appropriately called ‘nondogmatic’ in [GS82]. As long as something is not proven false, there is a (small) chance it is valid in the intended interpretation. This nondogmatism is crucial in Bayesian inductive reasoning, since no evidence (however strong) can increase a zero prior belief to a nonzero posterior belief [RH11]. The Gaifman condition is inconsistent with the strong Cournot principle (Example 5), but consistent with a weaker version (Definition 3). Probabilities that are Gaifman and (plain, not strong) Cournot allow learning in the limit (Theorem 3 and Corollary 8).
A standard way to construct (general / Cournot / Gaifman) probabilities on sentences is to construct (general / nondogmatic / separating) probabilities on interpretations, and then transfer them to sentences (Propositions 4, 4, and 4). At the same time we give modeltheoretic characterizations of the Gaifman condition (Corollary 4) and the Cournot condition (Definition 4). In Section 5, we give a particularly simple construction of a probability that is Cournot and Gaifman (Theorem 5) and a complete characterization of general/Cournot/Gaifman probabilities (Theorems 5 and 5 and Corollary 5). We also give various examples of (strong) (non)Cournot and/or Gaifman probabilities and (non)separating interpretations for countable domains (Examples 5, 5, and 5) and finite domains (Examples 5, 5, 5, 5).
Section 7 considers the important practical situation of whether a realvalued function on a set of sentences can be extended to a probability on all sentences; a necessary and sufficient condition is given for this, as is a method for determining such probabilities using minimum relative entropy introduced in Section 6. Prior knowledge and data constrain our (belief) probabilities in various ways, which we need to take into account when constructing probabilities. Prior knowledge is usually given in the form of probabilities on sentences like “the coin has head probability 1/2”, or facts like “all electrons have the same charge”, or nonlogical axioms like “there are infinitely many natural numbers”. They correspond to requiring their probability to be 1/2, extremely close to 1, and 1, respectively. It is therefore necessary to be able to go from probabilities on sentences to probability on interpretations (Proposition 4). This allows us to prove various necessary and sufficient conditions under which such partial probability specifications can be completed and what properties they have (Propositions 7 and 7). In particular we show that hierarchical probabilistic knowledge (Definitions 7) is always probabilistically consistent (Proposition 7). Further, seldom does knowledge constrain the probability on all sentences to be uniquely determined. In this case it is natural to choose a probability that is least dogmatic or biased [Nil86, Wil08a]. The minimum relative entropy (Definition 6) principle can be used to construct such a unique minimally more informative probability that is consistent with our prior knowledge (Definition 6 and Propositions 6 and 7).
Section 8 is a brief outlook on how the developed theory might be used and approximated in autonomous reasoning agents. In particular, certain knowledge, learning in the limit (8), the infamous black raven paradox, and the Monty Hall problem are discussed, but only briefly. The paper ends with a more detailed discussion in Section 9 of the broader context and motivation of this work, as well as related results in the literature, the outline of a framework for probabilistic reasoning and modeling in higherorder logic, and future research directions.
While some of the results presented in this paper are known in the firstorder case and their extension to the higherorder case is straightforward, it nevertheless seems useful to provide a survey of this material (with proofs included). Also, many beautiful ideas in the long and technical paper by Gaifman [GS82] deserve wider attention than they have received. We hope our exposition helps to rectify this situation.
2 Logic
We review here a standard formulation of higherorder logic [And02] that is based on Church’s simple theory of types [Chu40]. Other references on higherorder logic include [Llo03, Far08, vBD83, Lei94, Sha01]. Some discussion of the interesting history of the simple theory of types is given in [And02, Far08].
The best way to think about higherorder logic is that it is the formalization of everyday informal mathematics: whatever mathematical description one might give of some situation, the formalization of that situation in higherorder logic is likely to be a straightforward translation of the informal description. In particular, higherorder logic provides a suitable foundation for mathematics itself which has several advantages over more traditional approaches that are based on axiomatizing sets in firstorder logic. Furthermore, higherorder logic is the logical formalism of choice for much of theoretical computer science and also applications areas such as software and hardware verification. For a convincing account of the advantages of higherorder over firstorder logic in computer science, see [Far08].
The logic presented here differs in a minor way from that in [And02] in that we omit the description operator , for reasons that are discussed later. All the results from [And02] that are used here also hold for the logic with omitted, by obvious changes to their proofs. In addition the notation for the logic used here differs somewhat from that in [And02], but the correspondences will always be clear. There are also a few differences in terminology here compared to [And02] that are noted along the way.
We begin with the definition of a type.
Definition (type )
A type is defined inductively as follows.

is a type.

is a type.

If and are types, then is a type.
In this definition, is the type of the truth values, is the type of individuals, and is the type of functions from elements of type to elements of type . We use the convention that is right associative. So, for example, when we write we mean . A function type is a type of the form , for some and .
There is a denumerable list of variables of each type. The logical constants are , for each type . The denotation of equality is the identity relation between individuals of type . In addition, there may be other nonlogical constants of various types. The alphabet is the set of all variables and constants.
Next comes the definition of a term.
Definition (term )
A term, together with its type, is defined inductively as follows.

A variable of type is a term of type .

A constant of type is a term of type .

If is a term of type and a variable of type , then is a term of type .

If is a term of type and a term of type , then is a term of type .
A formula is a term of type . A closed term is a term with no free variables. A sentence is a closed formula. A theory is a set of formulas.
If the set of nonlogical constants is countable, then the set of terms is denumerable. As shown in [And02, p.212], using equality, it is easy to define (truth), (falsity), (conjunction), (disjunction), (negation), (universal quantification), and (existential quantification). The axioms for the logic are as follows [And02, p.213]:
Axiom (logical axioms)

Truth values:

Leibniz’ law:

Extensionality:

reduction:
In the above, , …are variables of the indicated type, is a syntactical variable for variables of type , and , …are syntactical variables for terms of the indicated type. Also is the result of simultaneously substituting for all free occurrences of in .
Axiom (1) expresses the idea the truth and falsity are the only truth values; Axioms (2) (for each type ) express a basic property of equality; Axioms (3) (for each type ) are the axioms of extensionality; and Axiom schemata (4) is the axiom for reduction.
Here is the single rule of inference [And02, p.213]:
Rule (rule of inference; equality substitution)
From and , infer the result of replacing one occurrence of in by an occurrence of , provided that the occurrence of in is not (an occurrence of a variable) immediately preceded by a .
The logic also has an equational reasoning system that has been used as the computational basis for a functional logic programming language
[Llo03, NL09, NLU08, LN11].In the following, to simplify the notation, we usually omit the type subscripts on terms; the type of a term will always either be unimportant or clear from the context. We use for sentences and sometimes for formulas, and for terms. With this notation, and .
The logic includes Church’s calculus: a term of the form is an abstraction and a term of the form is an application.
The logic is given a conventional Henkin semantics [Hen50].
Definition (frame )
A frame is a collection of nonempty sets, one for each type , satisfying the following conditions.

.

is some collection of functions from to .
For each type , is a called a domain.
The members of are called the truth values and the members of are called individuals.
Definition (valuation )
Given a frame , a valuation is a function that maps each constant having type to an element of such that is the function from into defined by
for .
Definition (variable assignment )
A variable assignment with respect to a frame is a function that maps each variable of type to an element of .
An interpretation can now be defined.
Definition (interpretation )
A pair is an interpretation if there is a function such that, for each variable assignment and for each term of type , and the following conditions are satisfied.

, where is a variable.

, where is a constant.

the function whose value for each is , where has type and is except .

.
If is an interpretation, then the function is uniquely defined. is called the denotation of with respect to and . If is a closed term, then is independent of and we write it as . Not every pair is an interpretation; to be an interpretation, every term must have a denotation with respect to each variable assignment.
What is called an interpretation here is called a general model in [And02], following Henkin. In [And02], a general model is called a standard model if, for each and , is the set of all functions from to . Moving from standard models to general models was the crucial step that allowed Henkin to prove the completeness of the logic [Hen50].
Definition (satisfiable)
Let be a formula, an interpretation, and a variable assignment with respect to .

satisfies in if .

is satisfiable in if there is a variable assignment which satisfies in .

is valid in if every variable assignment satisfies in .

is valid if is valid in every interpretation.

A model for a theory is an interpretation in which each formula in the theory is valid.
Definition (consistency)
A theory is consistent if cannot be derived from the theory.
Definition (logical consequence)
A formula is a logical consequence of a theory if is valid in every model of the theory.
We will have need for a particular class of interpretations, defined as follows.
Definition (separating interpretation/model)
An interpretation for an alphabet is separating if, for every pair , of closed terms of the same function type, say, , such that , there exists a closed term of type such that .
A separating model is a separating interpretation that is a model (for some set of formulas).
We emphasize that, in the definition of a separating interpretation, the closed term is formed only from symbols in the given alphabet. Intuitively, an interpretation is separating if, for every pair , of closed terms of the same type , whose respective denotations in the interpretation are different, there exists a closed term of type for which the respective denotations in the interpretation of and are different. Thus, in a separating interpretation, closed terms that have distinct functions as denotations must be distinct on an argument in the domain that is the denotation of some closed term using the given alphabet and thus is ‘accessible’ or ‘nameable’ via that term.
The concept of a separating interpretation is closely related to the concept of an extensionally complete theory that plays a crucial part in the proof of completeness [And02, p.248].
Definition (extensionally complete)
A set of sentences is extensionally complete if, for every pair , of closed terms of the same function type, say, , there exists a closed term of type such that is derivable from .
A connection with separating interpretations is provided by the following result.
Proposition (extensionally complete separating)
Every model of an extensionally complete set of sentences is separating.
Proof
Let be a set of sentences that is extensionally complete and be a model for .
Suppose that , is a pair of closed terms of the same function type,
say, , such that .
By extensional completeness, there exists a closed term such that
is derivable from .
Since is a model for and the proof system is sound,
it follows that .
Hence is separating.
Now we show that, if we are willing to expand the alphabet, any set of sentences having a model also has a separating model in an expanded alphabet.
Proposition (existence of separating models)
If a set of sentences has a model, then there exists an alphabet that includes the original alphabet and an interpretation based on the expanded alphabet which is a separating model for .
Proof
Since has a model, is consistent.
By [And02, Theorem 5500], there is an expansion of the original alphabet and a set of sentences
such that , is consistent, and is extensionally complete in the expanded alphabet.
Since is consistent, by Henkin’s Theorem [And02, Theorem 5501], it has a model (based on the expanded alphabet).
By Proposition 2, this model must be a separating one,
and it is also a model for .
The most important property of the logic that we will need is compactness [And02, Theorem 5503].
Theorem (compactness)
If every finite subset of a set of sentences has a model, then has a model.
In fact, most of the development in the paper can be carried out in any logic that has the compactness property.
While the version of higherorder logic introduced in this section generally provides much more direct and succinct formalisations than firstorder logic, for practical applications a number of extensions are highly desirable. Some of these extensions are nothing more than abbreviations, such as those used to introduce the connectives and quantifiers, and some are deeper. These extensions include manysortedness, which allows more than one domain of individuals; tuples and product types; and type constructors and polymorphism. The logic of [Llo03], which is also used in [NL09, NLU08], includes all these extensions. These and other extensions are discussed in [Far08].
3 Probabilities on Sentences
We now define probabilities on sentences. They are not probabilities in the conventional sense of probability theory (on algebras); however, a connection between probabilities on sentences and (conventional) probabilities on a algebra on the set of interpretations will be made below.
Definition (probability on sentences)
Let be the set of all sentences (for some alphabet). A probability (on sentences) is a nonnegative function satisfying the following conditions:

If is valid, then .

If is valid, then .
For a sentence , where , one can define the conditional probability by
for each sentence .
A probability on sets of sentences has the following intended meaning:
For a sentence , is the degree of belief that is true.
Definition (pairwise disjoint sentences)
The sentences are pairwise disjoint if, for each such that , is valid.
Proposition (properties of probability on sentences)
Let be a probability on sentences. Then the following hold:

, for each .

, for each .

If is unsatisfiable, then .

If is valid, then .

If is valid, then .

If is a finite subset of pairwise disjoint
sentences in , then . 
If is a finite subset of , then .

The following are equivalent:
(a) For each , implies is valid.
(b) For each , implies is unsatisfiable. 
If , then is a probability.

.
Proof
The proof is elementary and standard, and only included for completeness.
1. Since is valid, . Also, since is valid, . Thus .
2. Since , we have that .
3. Note that is unsatisfiable iff is valid. Thus , so that .
4. Note first that is valid iff is valid. Thus . Hence .
5. This follows immediately from Part 4.
6. The proof is by induction on . When the result is obvious. Assume now the result is true for . Note that is valid and so is valid. Then
[ is valid]  
[induction hypothesis]  
7. The proof is by induction on . When the result is obvious. Assume now the result is true for . Then
[Part 4 and induction hypothesis]  
8. Suppose that, for each , implies is valid. Now let satisfy . By Part 1, . Thus is valid and so is unsatisfiable.
Conversely, suppose that, for each , implies is unsatisfiable. Now let satisfy . By Part 1, . Thus is unsatisfiable and so is valid.
9. Suppose that is valid. Then .
Suppose that is valid. Then
[ is valid]  
Thus is a probability.
Next we introduce Gaifman probabilities.
Definition (Gaifman probability)
Let be a probability on sentences. Then is Gaifman if
for every pair and of closed terms having the same function type, say, , and where ranges over all finite sets of closed terms of type .
Proposition (Gaifman probability)
Let be a probability on sentences. Then the following are equivalent.

is Gaifman.

,
for every pair and of closed terms having the same function type, say, , and where ranges over all finite sets of closed terms of type . 
,
for every formula having a single free variable of type , say, and where ranges over all finite sets of closed terms of type . 
,
for every formula having a single free variable of type , say, and where ranges over all finite sets of closed terms of type .
Proof
1. implies 2. Suppose that the probability is Gaifman. Then
Hence 2. holds.
2. implies 3. Suppose that 2. holds. Then
Hence 3. holds.
3. implies 4. Suppose that 3. holds. Then
Hence 4. holds.
4. implies 1. Suppose that 4. holds. Then
[Axioms of Extensionality]  
Hence 1. holds.
Proposition (limits for countable alphabet)
Let the alphabet be countable, a probability on sentences, and a formula having a single free variable of type .


,
where, on the LHS, ranges over all finite sets of closed terms of type and, on the RHS, is an enumeration of all closed terms of type .
Proof
Since the alphabet is countable, the set of all closed terms of type is countable and hence can be enumerated.
1. Let be a subset of closed terms of type . Let be sufficiently large so that each , for , appears in the enumeration of the first terms of an enumeration of all closed terms of type .
Then is valid, so that
by Proposition 3.4. By first taking the supremum on the RHS and then the supremum on the LHS we get
Conversely we have
since the sup on the LHS includes . Now taking the limit and combining both inequalities gives equality. Proposition 3.4 gives that ; hence is monotone nondecreasing in , which allows the replacement of by .
2. The proof is similar.
We can reduce the class of terms that is necessary to “browse” through even further, by considering only one term from each equivalence class, where two terms and are equivalent iff is valid.
Proposition (Gaifman for countable alphabet)
Let the alphabet be countable and a probability on sentences. Then the following are equivalent.

is Gaifman.

,
for every pair and of closed terms having the same function type, say, , and where is an enumeration of all closed terms of type . 
,
for every pair and of closed terms having the same function type, say, , and where is an enumeration of all closed terms of type . 
,
for every formula having a single free variable of type , say, and where is an enumeration of all closed terms of type . 
,
for every formula having a single free variable of type , say, and where is an enumeration of all closed terms of type .
In each case, the enumeration of closed terms of type can be reduced to one where a single representative is chosen from each equivalence class under the equivalence relation and are equivalent if is valid.
Proof
While these forms of the Gaifman condition closely resemble the continuity condition (countable additivity (CA) axiom) in measure theory, we will see that CA over (general) interpretations is derived from the compactness theorem and not from the Gaifman condition (see Definition 4 and Proposition 4 in the next section). But the Gaifman condition confines probabilities to separating interpretations while preserving CA (Propositions 4 and 4).
Example (natural numbers Nat)
Consider the standard type Nat of natural numbers, as the type of individuals,
and the usual Peano axioms. Let be the constant of
type Nat whose denotation is the natural number 0, and be the term of type Nat
whose denotation is the natural number , where is a
constant of type whose denotation is the
successor function.
In practice one usually defines denumerably many constants
, one for
each natural number, directly.
Further, let be functions with
their usual axioms and meaning.
Now there are many closed terms that represent the same
natural number. For instance , ,
, are different terms, all
having the number as denotation.
For type Nat, it is sufficient to choose in
Proposition 3.4, and so the
condition in Definition 3 (indeed) reduces to
the one used by Gaifman [GS82].
Of particular interest are probabilities that are strictly positive on satisfiable sentences since this is a desirable property of a prior. This suggests the following definition.
Definition (strongly Cournot probability)
A probability is strongly Cournot if, for each , is satisfiable implies .
By Part 8 of Proposition 3, a probability is strongly Cournot iff, for each , is not valid implies , or, by contraposition, implies is valid. This is akin to Cournot’s principle as discussed in the introduction that an event of probability 1 singled out in advance will happen for sure in the real world. We will see this general idea plays an important role for inductive inference.
However, the following weaker form of the Cournot principle will turn out to be more useful.
Definition (Cournot probability)
A probability is Cournot if, for each , has a separating model implies .
Clearly a strongly Cournot probability is Cournot. It will be the Cournot probabilities (not the strongly Cournot ones) that will be of most interest in the subsequent development. The major reasons for this are as follows. First, Theorem 5 below shows that, if the alphabet is countable, there exists a probability on sentences that is Cournot and Gaifman. Such a probability makes a good prior. Second, the Cournot and Gaifman conditions are necessary and sufficient to do learning in the limit of universal hypotheses as the following theorem shows and as discussed in more detail in Section 8.
Theorem (confirming universal hypotheses)
Let the alphabet be countable, a probability on sentences, a formula having a single free variable of some type , an enumeration of (representatives of) all closed terms of type . Then
If the left hand side (hence also the r.h.s.) holds, we say that can confirm universal hypothesis . It also holds that
Proof
[]  
[]  
[] 
As can be seen from the proof, if one or both of the conditions fail, then does not converge to 1.
For the bottom we abbreviate the statements
In this notation, the top reads iff and .
Assume is Gaifman and Cournot and . This implies and . By we get . We have shown that for any , if is Gaifman and Cournot, then implies .
Case 1 [ is true]
Then by assumption, . Then by we get
and .
Note that every sentence can be written as with being a formula having a
single free variable . Therefore, for all that have a separating model. Hence
is Cournot.
Case 2 [ is false] That is, has no
separating model, therefore must have (at
least one) separating model, say . Since is a separating model of ,
Definition 2 implies that there exists a
closed term such that is also a separating
model of . Now
[ and are disjoint]  
[ is not free in ]  
[since , Case 1 implies ]  
[ is not free in ]  
[ for some , and false] 
This proves for false.
Case 1 and 2 together prove for all , hence is Gaifman.
4 Probabilities on Interpretations
We now study probabilities defined on sets of interpretations.
Consider the set of interpretations (for the alphabet). A Borel algebra can be defined on . For that, a topology needs to be defined first. Given some alphabet, let denote the set of sentences based on the alphabet. For each sentence , let denote the set
Consider the set . Since is closed under finite intersections, it is a basis for a topology on . is also an algebra, since it is closed under complementation and finite unions, and . Let
Comments
There are no comments yet.