DeepAI

# Entailment Relations on Distributions

In this paper we give an overview of partial orders on the space of probability distributions that carry a notion of information content and serve as a generalisation of the Bayesian order given in (Coecke and Martin, 2011). We investigate what constraints are necessary in order to get a unique notion of information content. These partial orders can be used to give an ordering on words in vector space models of natural language meaning relating to the contexts in which words are used, which is useful for a notion of entailment and word disambiguation. The construction used also points towards a way to create orderings on the space of density operators which allow a more fine-grained study of entailment. The partial orders in this paper are directed complete and form domains in the sense of domain theory.

06/22/2015

### Distributional Sentence Entailment Using Density Matrices

Categorical compositional distributional model of Coecke et al. (2010) s...
12/02/2014

### Tiered Clustering to Improve Lexical Entailment

Many tasks in Natural Language Processing involve recognizing lexical en...
01/19/2016

### Graded Entailment for Compositional Distributional Semantics

The categorical compositional distributional model of natural language p...
08/04/2016

### Dual Density Operators and Natural Language Meaning

Density operators allow for representing ambiguity about a vector repres...
07/08/2020

### Language Modeling with Reduced Densities

We present a framework for modeling words, phrases, and longer expressio...
09/22/2020

### Context-theoretic Semantics for Natural Language: an Algebraic Framework

Techniques in which words are represented as vectors have proved useful ...
07/27/2014

### Defining Relative Likelihood in Partially-Ordered Preferential Structures

Starting with a likelihood or preference order on worlds, we extend it t...

## 1 Introduction

Distributional models of natural language form a popular way to study language in the context of automated natural language processing. These models rely on the Distributional Hypothesis: words that occur in similar contexts have similar meanings.

The categorical compositional distributional model of natural language meaning developed by Coecke, Sadrzadeh and Clark [7, 6] also gives a way of composing distributions. It has been corroborated empirically that on some tests this model performs better then the state of the art [10].

Recently this model has been expanded to take into account lexical ambiguity, notions of homonymy and polysemy [13] and entailment at the word and sentence level by passing from a vector space model to a density matrix model [2, 3, 4]. In these papers various relations between the density matrices were explored to get a definition of entailment on the distributional level, such as the fidelity and the relative entropy. In [4] a modified Löwner ordering was used to get a notion of graded entailment. In [11] they constructed a nonsymmetric similarity measure based on a modified measure of feature inclusion.

In any of these cases the goal is to get a relation between pairs of words or sentences that captures the idea of information content. We say that the word dog entails the word animal because in most contexts where the word dog is used, we could use the more general (less specific, less informative) word animal.

The same is true for word disambiguation. Consider the word bank. It might mean river bank or investment bank. Without any further context we don’t know which one is meant. The word bank offers less information than either of these more specific words. We can consider it to be in a mixed state of these pure meanings, which collapses to a pure state when given the right context.

It therefore makes sense that to get a notion of disambiguation or lexical entailment we should be looking for a relation that captures the idea of information content. The obvious properties that we would require of such a relation are those of a partial order: reflexive, transitive and antisymmetric.

The way word vectors are usually constructed is by counting the coocurrence with some set of basis words. The components can then be interpreted as probabilities of a word occuring at the same time as this specific basis word. So in fact, the word vector can be seen as a probability distribution.

If a word is instead represented by a density matrix then when it is diagonalised we have a probability distribution on the diagonal. This means that a relation capturing the notion of information content should at least be a partial order on the space of probability distributions.

There is a well known partial order on the space of positive semi-definite matrices called the Löwner order, but on the space of density operators no two different density operators are comparable (if we have then we must have ). This is a direct effect of the normalisation of the trace of the operators. A modification to the Löwner order was made in [4] in order to get a notion of graded entailment. The resulting structure was no longer a partial order, since the modification removed transitivity and replaced it with a weaker condition. In this paper we will show two different modifications to the Löwner order that do result in proper partial orders.

An example of a nontrivial partial order on the space of probability distributions that has suitable information-like properties is the Bayesian order outlined in [8, 9]. This is in fact the only example the author could find in literature. The Bayesian order served as the inspiration for this paper and the results outlined here can be seen as generalisations of the results related to the Bayesian order.

In this paper we will explore what conditions we need in order for the resulting partial order to represent information content. We will also look at what kind of conditions we need in addition to get a unique notion of information content. Since there has been surprisingly little work in the area of partial orders representing information we will focus on partial orders on probability distributions instead of on the bigger space of density operators. We will also just be looking at entailment on the word level and leave compositionality for further research.

Note also that the results in this paper might prove useful in resource theory and quantum information theory as density matrices are quantum states and probability distributions are classical states. The partial orders studied in this paper turn out to be domains: directed complete partial orders which are exact.

## 2 Background

We begin by stating the definition of a partial order.

###### Definition 1.

A partial order on a space is a binary relation which is

1. Reflexive: .

2. Transitive: .

3. Antisymmetric: .

We can restrict a partial order on the density matrices to the diagonal density matrices. This is equivalent to the space of finite probability distributions , which can be interpreted geometrically as the -simplex.

We can then wonder when this procedure can be reversed: which partial orders on the diagonal density matrices extend to a partial order on the entire space of density matrices? The naive approach is to define iff DiagDiag, where Diag

is the probability distribution of the eigenvalues of

. However if we take an arbitrary density matrix the diagonalisation will not be fully determined: we can still freely permute the basis vectors. Reflexivity would then imply that any permutation of basis must be equivalent which would in turn break antisymmetry. We must require that and be diagonalised simultaneously in order for them to be comparable by a partial order on .

If two density matrices can be diagonalised simultaneously then there is still a freedom of permuting the basis vectors, so a necessary condition for to be extended to the entire space of density operators is for it to be invariant under basis vector permutation:

###### Definition 2.

Let be a partial order on . We call it permutation invariant if for any permutation :

It can be shown that a permutation invariant partial order extends to a partial order on the density matrices (a density matrix is completely determined by its eigenvalues and an orthonormal basis).

A notion of information content is Shannon entropy. On

the element with the highest amount of entropy is the uniform distribution

. The elements with the lowest amount of entropy are the pointed distributions that have for some and the rest equal to zero, also called the ‘pure’ states. Denote these as . Intuitively is the element with the lowest amount of information, and are the elements with the most amount of information. We require that our partial order on respects this: every distribution contains more information than and every distribution is smaller than at least one maximal element.
Linguistically a word would be represented by the uniform distribution if it occured the same amount of times in any context, but such a word would of course not add any information to the sentence. A candidate for such a word would for instance be ‘the’. Realistically no word will be represented by the uniform distribution, but we would find examples of words that are uniformly distributed on a subset of contexts. Such as the word ‘bank’ that we would expect somewhat uniformly in the contexts of finance and rivers. Stating that each word can be compared to some pure state is akin to stating that each word can be resolved to some pure meaning.

In order to restrict ourselves to nontrivial partial orders we will require one further property: that the partial order respects the mixing of information content, defined as such:

###### Definition 3.

We say that a partial order on allows mixing when we have for any and :

 x≤y⟹x≤(1−t)x+ty≤y

This states that when an element contains less information than another and this information is comparable, then mixing the information content will give something with an information content in between. Note that the space of probability distributions is convex. This demand makes the partial order respect that convexity in a natural way. We are now ready to give a minimal definition of partial order that represents information content.

###### Definition 4.

A partial order on which is permutation invariant, allows mixing and has the uniform distribution as the minimal element and the pointed distributions as the maximal elements is an information ordering.

There is a unique partial order satisfying the conditions of Definition 4 on as seen in Figure 1. The pure distributions are at the ends while the uniform distribution is in the middle.

We might hope that these conditions also uniquely determine a partial order for higher values of , but this is not the case. The inductive procedure in [9] uniquely determines a partial order that does have the right properties, but as we will see we can create other partial orders without using this inductive procedure. The structure that is fixed by these conditions is illustrated for in Figure 2.

We see that the space is cut up into natural regions. We will refer to these as sectors, and will come back to those later.

For reference we will state a definition of the Bayesian order here. The other partial orders in this paper will have a similar format.

###### Definition 5.

The Bayesian order is defined as iff there is a permutation such that the coordinates of and are both monotonically decreasing and we have
for all . [9]

The condition that comparable elements must both be able to be permuted in the same way might be seen as odd, but it in fact ensures that the elements are part of the same sector (one of the smaller triangles in Figure 2). As we will see in Section 4, the Bayesian order belongs to a class of partial orders that have this property.

## 3 Non-Uniqueness of Information Orderings

We will start by showing that the requirements of Definition 4 are not strong enough to give a unique definition of information content. That is: there exist partial orders and such that there are points with but .

### 3.1 Renormalising the Löwner order

As stated in the introduction, the Löwner order given by iff is trivial () on . This is due to the fact that the components of and both need to sum up to 1. By renormalising the components so that they no longer sum up to the same value, we are able to create a nontrivial order.

There are at least two natural choices for renormalisation: we can set the largest coordinate equal to 1, or we can set the smallest coordinate equal to 1.

The normalisation to the largest coordinate gives the partial order

 x⊑+Ly⟺x+yk≤y+xk for all k.

where is defined as max. This partial order satisfies all the conditions specified in Definition 4, so it is an information ordering.

The normalisation to the smallest coordinate is slightly more difficult since the smallest coordinate could be equal to zero. If both elements have the same amount of zeroes we can ignore those and use the smallest nonzero element. If an element has strictly more zeroes than we can view as being blown up to infinity while stays finite, so we would simply define , as long as their common zeroes are in the same positions.

Keeping this in mind, we can define the second renormalised Löwner order by induction on as if and only if one of the following holds:

1. There is a such that , and .

2. There is a such that , .

3. For all : and .

Here is defined as the smallest nonzero coordinate.

This is a well-defined partial order and it satisfies all the conditions specified in Definition 4. In Figure 3 you can see that these two renormalisations make a big difference to the resulting partial order. If we take the points and , then and . So there is at least one pair of points where and contradict each other. The conditions specified in Definition 4 are not strong enough to get a unique notion of information content.

A very useful tool to study the relation between different partial orders are measurements, the definition of which we take from [9] and [12].

###### Definition 6.

A measurement is a Scott-continuous strict monotonic map .

Monotonicity means that when we have and strictness states that when and we have . Scott-continuity is not important for us, but it is a useful property in relation to proving that a partial order is directed complete. All the strict monotonic maps in this paper are also Scott-continuous.

Define the monotone sector of as . This corresponds to the lower rightmost triangle in Figure 2. For each there is a unique such that for some permutation . This gives us a natural retraction . is a measurement for any information ordering on . This means that if we have a measurement this extends to a measurement of by composition with .

The measurements we will be using are of the form .111Any partial order that allows such a measurement is a dcpo [12]. is the positive interval with the reversed order, so monotonicity means that implies .

The order has the measurement . The order has a slightly more complicated measure. Define the zero counting function . Then when we have . If and , then , and if additionally then . Putting this together we see that is a measurement of . We can read this as first counting the amount of zeroes, and then looking at the lowest coordinate. The constant is added such that iff Max.

These two measurements capture different ideas of what we “care” about in our information ordering. Respecting the measurement states that the head of a distribution is important, while respecting means we care about the tail of a distribution.

Suppose we have two partial orders and that have the same measurement . Then if and we get which gives . So partial orders with the same measurement can’t contradict each other222Note that two partial orders that do not have the same measurement don’t necessarily have to contradict each other.. This gives us a tool for ensuring a class of partial orders won’t contradict each other.

## 4 Restricted Information Orders

We can extend an information order on to one on the density operators by allowing comparisons if two density operators can be diagonalised simultaneously. Since we have a measurement from to we can wonder if we can do the same sort of procedure for transitioning from an information order on to one on . That is: we allow comparisons when two elements in can be brought to simultaneously by some permutation . So in that case and we proceed with comparing and using a partial order on . This does however not always result in a valid partial order on : Suppose we have where is a border element of . Then it also lies in a neighbouring sector. Suppose there is an element in this neighboring sector such that . Then by transitivity . But and are in different sectors. So this is a contradiction. A necessary and sufficient condition to prevent this and ensure we can extend a partial order on to is the following

###### Definition 7.

A partial order on (or ) satisfies the degeneracy condition when for all (or ) where and we have .

We call this property the degeneracy condition as it ensures that border elements, elements with a degenerated spectrum, are not above any nondegenerated elements. There is a one-to-one correspondence between information orders satisfying the degeneracy condition on and those orders on . We will call an information order that satisfies the degeneracy condition a restricted information order as comparisons between elements are restricted to within sectors. The renormalised Löwner orders are not restricted information orders, while the Bayesian order is a restricted information order.

We are interested in information-like properties of a distribution . If we suppose that all these features can be encoded in terms of real numbers, this would give rise to a feature vector that is an element of for some . Comparing the information content of distributions is then translated to comparing the feature vectors of the distributions: iff where is the standard product order on : iff for all . For instance, for the feature vector components are . For the Bayesian order the feature vector is and for majorization it would be

. We can classify these types of orders.

###### Theorem 1.

Classification of Restricted Information Orderings. All restricted information orderings of the form for some function can be written as the join or meet of the set of partial orders defined as

 x⊑Ay⟺fi(x)gi(y) ≤fi(y)gi(x) for all 1≤i≤n−1 where fi(x) =xi−xi+1 and gi(x) =yi+1+n∑j=i+2Aijyj where 1+k∑j=i+2Aij>0 for 2≤i and i+10 and 1+k A10+k∑j=3A1j>0 for 2

Furthermore, all these partial orders allow as a measurement, which means they are all dcpo’s. The space of these restricted orders is a complete lattice

Note that the feature vectors of these partial orders are . Using and instead of turns out to be easier because we can deal more naturally with possible zeroes in .

We see that all the parameters are bounded from below, but not from above. In general, higher values for the parameters correspond to partial orders that are less strict. The Bayesian order is retrieved when setting all parameters to zero. In general, the restricted orders don’t respect the ordering given by Shannon entropy. It can be shown that the subset of restricted orders that allow as an additional measurement have Shannon entropy as a measurement as well. All the partial orders seen above also have the property that if and then . Or in other words: the support of is included in . This ensures that the relative entropy between and is finite. These partial orders are therefore somewhat comparable with the entailment relation of [2].

Sharing as a measurement ensures that these partial orders don’t contradict each other. So the degeneracy condition is a sufficient condition to get a unique direction of information content. Because this space of orderings is a complete lattice there is a unique minimal order and a unique maximal order. The difference between these and the Bayesian order is shown in Figure 4.

Note that all the restricted orders and share the measurement , so they don’t contradict each other. It can also be shown that doesn’t contradict any restricted order, so both these renormalisations can serve as valid extensions of the restricted orders.

Such an extension is probably necessary. Restricted information orders only allow comparisons within sectors. The amount of sectors in is equal to the factorial of . In an empirical natural language model we would usually have in the hundreds or thousands, so then there are many more sectors than there are words. This probably means that restricted information orders are too restrictive to be used in practice in the study of natural language as each word will be in its own sector. The renormalised Löwner orders might be better suited to the task.

These partial orders are all constructed by comparing soms feature vectors with each others. This allows for a natural modification to support graded entailment. Suppose we have the partial order , then we kan define the -entailment as for some number . This is no longer a partial order, but a nonsymmetric entailment measure. This generalises the idea of [4].

## 5 Information Orders on Density Operators

The central idea behind classifying information orders on probability distributions is that we transition to a feature vector. Let’s look at this more closely. We have . has a natural partial order, the product order, which is trivial on . By using a feature map to transform to a different subset of , we can make this partial order nontrivial.

The same sort of procedure can be used on the space of density operators . This space can be seen as a subset of the positive operators . has a natural partial order in the form of the Löwner order, which is trivial on . We can again consider a “feature map” which possibly gives rise to a partial order where is the Löwner order. For instance, setting where is the highest eigenvalue of is the natural extension to the density operators of the first renormalised Löwner order described above. In fact, since the Löwner order restricted to diagonal matrices is equal to the product order on , this is a natural generalisation of the construction of information orders on .

## 6 Conclusion and Further Research

We have shown that there is a wide variety of partial orders on the space of probability distributions that satisfy the necessary conditions to capture the notion of information content. With an extra restriction (the degeneracy condition) we can make sure that this notion is unique. Unfortunately in practical linguistic applications this condition might prove to be too strict. The renormalised Löwner orderings are less strict in what they can compare and might prove to be more useful, although empirical research is needed to confirm this. The construction of the restricted information orders also points towards a way to create information orderings on the space of density operators, but studying this in detail is outside of the scope of this paper.

In the pursuit of methods that make comparisons between distributions easier we might look at rescaling distributions to study graded entailment (a generalisation of the approach taken in [4]). Another avenue of attack that might work is using the fact that in a high dimensional space words are probably far apart, so that we can be less picky with the comparisons, and set whenever some elements within a certain radius of and are comparable. This procedure would break antisymmetry when considering the entire space, but not when only comparing words (assuming they are far enough apart). Doing this might allow elements in different sectors to be compared by a restricted information order.

## Acknowledgements

The author would like to thank the anonymous reviewers for their valuable feedback. Thanks goes out as well to the authors Master thesis supervisor Bob Coecke of which this publication is part, and to Martha Lewis and Daniel Marsden for valuable comments.