Monads for Measurable Queries in Probabilistic Databases

12/28/2021
by   Swaraj Dash, et al.
University of Oxford
0

We consider a bag (multiset) monad on the category of standard Borel spaces, and show that it gives a free measurable commutative monoid. Firstly, we show that a recent measurability result for probabilistic database queries (Grohe and Lindner, ICDT 2020) follows quickly from the fact that queries can be expressed in monad-based terms. We also extend this measurability result to a fuller query language. Secondly, we discuss a distributive law between probability and bag monads, and we illustrate that this is useful for generating probabilistic databases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/27/2022

Probabilistic Query Evaluation with Bag Semantics

We initiate the study of probabilistic query evaluation under bag semant...
02/27/2019

On Constrained Open-World Probabilistic Databases

Increasing amounts of available data have led to a heightened need for r...
12/02/2020

Complex Coordinate-Based Meta-Analysis with Probabilistic Programming

With the growing number of published functional magnetic resonance imagi...
04/06/2022

Computing expected multiplicities for bag-TIDBs with bounded multiplicities

In this work, we study the problem of computing a tuple's expected multi...
10/04/2019

A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs

We study the problem of probabilistic query evaluation (PQE) over probab...
03/28/2022

HypeR: Hypothetical Reasoning With What-If and How-To Queries Using a Probabilistic Causal Approach

What-if (provisioning for an update to a database) and how-to (how to mo...
04/04/2017

Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes

Databases are widespread, yet extracting relevant data can be difficult....
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Probabilistic databases cater for uncertainty in data. There may be uncertainty about whether rows should be in a database, or uncertainty about what values certain attributes should have.

For example, consider a database of movies. We might have a table that assigns the gross amount to each movie, which may be quite uncertain for older movies. We might have a table that records which actors appeared in which movies, and there may be uncertainty about whether a particular actor appeared in a given movie. The uncertainty might come from incorrect text processing, for example if the information was scraped off internet forums, or just noise in measurement, e.g. if the gross amount is difficult to calculate precisely. This is a simple example, but probabilistic databases have applications in other areas of information extraction as well as in scientific data management, medical records, and in data cleaning. See the textbook [29] for further examples.

In this paper, we argue that the semantics of probabilistic databases lies in combining a probability monad, , with a bag monad (aka multiset). This builds on the long-established tradition of using monads to structure computational effects in functional programming [31, 5].

A good semantic analysis is important in view of the recent work of Grohe and Lindner [12, 13] which builds on [19, 6]

. This new line of work breaks with the traditional approach of having a fixed finite support for the probabilistic database, and argues that the support should be infinite, possibly uncountable. For example, it may be that the gross takings from a movie are approximated as a real number taken from a normal distribution, and it may be that the number of actors appearing in a movie is unknown and unbounded. This leads to semantic complications and introduces issues of measurability.

1.1 Two monads

We argue that probabilistic databases are best understood as inhabitants of a set, or space,

  • is a space of all records (aka rows, tuples) that are allowed according to the schema. For example, in the movie database above, we put where

    since we can either record that an actor appeared in a movie, or that a movie had a certain gross. (Here we are using a standard notation for tagged disjoint unions.)

  • is a monad of bags (aka multisets). So is the space of bags over , and these are the deterministic databases for the given schema.

  • is a probability monad. So

    is the space of probability distributions (or measures) over the space of deterministic databases, and these are the probabilistic databases for the given schema.

In the traditional case, studied in [29], the probability distributions have finite support. In the general setting proposed by [13], the support of a distribution is uncountable. This is formalized using measure theory, by placing a -algebra on , by deriving a -algebra on and . We can regard this as moving from the category of sets to a category of measurable spaces. As we will show (Theorem 3), the bag monad extends to a monad on the category of measurable spaces. We can then regard  as the Giry monad on the category of measurable spaces [9].

We clarify a subtle point. The support of the distributions in might be infinite, and this means that the set of records that have a chance of appearing in the database can be infinite. But this is a different issue from the sizes of the bags under consideration, which will always be finite. For example, there are infinite possibilities as to what the gross from a movie is, but the number of movies will always be finite. The number of actors in a movie is unbounded, but there is never an infinite cast list for a particular movie.

1.2 Measurable queries

In the deterministic setting, a query (aka view) translates a database from one schema to another. For example, we might ask,

(1)

This is a function . For probabilistic databases, the usual approach is to consider queries on deterministic databases, and then lift them to probabilistic databases. Semantically, this can regarded as the functorial action of the monad , which gives a translation between probabilistic databases:

Notice that if there is uncertainty about whether an actor appeared in a movie, or about what the gross of the movie was, then this will lead to uncertainty about whether that actor should appear in this view.

This functorial action amounts to pushing forward the probability measure. But this is only legitimate if the query is measurable. In Theorem 4, we show that all queries are measurable provided they are definable in the standard BALG query language for bags [14].

Our proof of measurability is straightforward, because most of the BALG query operations are directly definable from the monad structure of  (Theorem 3). The remaining operations are easily definable from an fold construction (Theorem 1), which is connected to the fact that  is the free commutative monoid on .

Measurability of a fragment of BALG is perhaps the main technical result of [13]. That work was groundbreaking, but here we have two additional contributions:

  1. we show that the full language BALG is measurable, which allows us to also treat aggregation queries within the same framework, and

  2. we demonstrate that the proof of measurability is almost immediate from the categorical properties of the monad .

We give the full details of BALG in Section 4. But for now we note that another way to see that the particular query (1) is measurable is that it can be written in the monad comprehension syntax as

This comprehension syntax works for any strong monad (Section 2.2.1 and  [31]), indeed it is merely a convenient shorthand for

where is the monad bind (Kleisli composition) and is the monad unit. The predicates (, ) are well-known to be measurable on the domains where they are used here, and so the query must be measurable.

As an aside, we remark that much work in the database literature is on computing the results of queries efficiently. In the probabilistic setting, this is even more of a problem. But in this paper (as in [13]) we are focusing on the semantic aspects.

1.3 Generating probabilistic databases

Having established the measurability of the query language, in Section 5 we turn to investigate languages for generating probabilistic databases. For this we turn to the composite of the monads, , which we have already shown to be a monad in [7] (see also [17, 20]). As we demonstrate, the language for the monad appears to be ideal for generating probabilistic databases, at least as an intermediate language.

The paradigm for using infinite support probabilistic databases is still under debate, but typically one would begin from a deterministic database, and then add some randomness. Very simple kinds of randomness include

  • adding noise to certain attributes, such as the movie gross, or blood pressure in a medical database;

  • adding or deleting records at random, if there was uncertainty in the accuracy of those records.

We demonstrate how this can be done easily in the monad . We also investigate a more elaborate model based on a GDatalog program, which translates very cleanly into the language of the monad.

1.4 Connection with other work on programming semantics

Our work discusses probabilistic databases in the context of monads and functional programming, and so we bring the general ideas of probabilistic databases to the language of functional probabilistic programming languages. We have already prototyped our examples simply by implementing a bag monad in Haskell and using a standard Haskell library for probabilistic inference. The idea of applying ideas from probabilistic programming to databases already has some momentum on the practical side, through languages such as BayesDB [28] and PClean [21]. Slightly further afield are probabilistic logic languages such as Blog [32] and ProbLog [8].

Probabilistic programming is a general approach for statistics. Within statistics, inhabitants of are well-known and important, and called ‘point processes’.

Further over to the semantic side, we note that the relevance of bags for probability has recently been emphasised by Jacobs [17, 18]. Bags are a form of non-determinism, and the problem for combining non-determinism and probability is notoriously subtle, although there has been plenty of recent progress [10, 20, 23, 24, 15, 30]. The particular combination we use here is trouble-free.

1.5 Summary

In this paper we show the following.

  • The Bag monad extends to a strong monad on standard Borel spaces (Thm. 3).

  • The Bag monad gives a free commutative monoid, and has a ‘fold’ construction (Thm. 2, Thm. 1).

  • The BALG language for database queries always yields functions that are measurable (Thm. 4).

  • The composite monad combines probability and bags and is useful for generating probabilistic databases.

Acknowledgements.

We are grateful to Peter Lindner for discussions. It has also been helpful to discuss this work with Martin Grohe, Bart Jacobs, Sean Moss, and Philip Saville. We acknowledge funding from Royal Society University Research Fellowship, the ERC BLAST grant, and the Air Force Office of Scientific Research under award number FA9550-21-1-0038. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force.

2 Mathematical preliminaries

2.1 Measure theory

Definition 1.

The Borel sets form the least collection of subsets of containing intervals which is closed under complementation and countable unions.

Definition 2.

A -algebra on a set is a nonempty family of subsets of that is closed under complements and countable unions. The pair is called a measurable space (we just write when can be inferred from context).

Given , a measure is a function such that for all countable collections of disjoint sets , In particular, . It is a probability measure if .

Definition 3.

Let and be measurable spaces. A measurable function is a function such that when .

Definition 4.

A measurable space is a standard Borel space if it is either measurably isomorphic to or it is countable and discrete.

(This is equivalent to the usual definition of standard Borel spaces, which involves Polish spaces.)

Standard Borel spaces include the measurable spaces of real numbers and the integers, as well as all finite discrete spaces such as the booleans. So all the measurable spaces that arise in probabilistic databases are standard Borel, and indeed the restriction to standard Borel spaces is also made in Grohe and Lindner (see [13, Section 3.1]). Standard Borel spaces are closed under countable products and countable coproducts. Moreover, the equality predicate is measurable when is a standard Borel space.

2.2 Monads

Definition 5.

A monad on a category is given by an object for each object , a morphism for each object , and for objects and and morphism a morphism is given, satisfying identity and associativity laws (see e.g. [25]).

A strong monad on a category with products is equipped with a morphism that respects the structure (see e.g. [25]).

The construction is sometimes called bind or Kleisli composition.

2.2.1 Monad comprehension notation

For any strong monad we can use a comprehension notation, which is just syntactic sugar for chaining together compositions of . The name comes from the fact that this notation resembles set comprehension notation, and when is the powerset monad on the category of sets, this is exactly set comprehension. But it makes sense for any monad, and is often used with the list monad [31]. As we will see (Section 4), for the bag monad, it gives an alternative notation for queries based on products, projection and selection (see also [5]).

Given , , , , and given , we write

for the composite morphism

where

2.3 The Giry monad

The Giry monad [9] is a first key monad on measurable spaces. It also restricts to standard Borel spaces. If is a measurable space, then is the set of probability measures on  equipped with the -algebra generated by . The unit is given by the Dirac measures ( if , otherwise ). The bind is given by Lebesgue integration: if then . The strength is given by .

3 The bag monoid and monad on measurable spaces

Let be a set. A bag, aka multiset, is a finite unordered list of elements of , or more formally an equivalence class of lists under permutation. Equivalently, a bag is a function such that is finite, or more formally it is an integer valued finite measure.

In this section we will focus on bags in the category of standard Borel spaces. We will show that the bags form a free commutative monoid, and support a ‘fold’ operation. We will also show that the bag construction forms a strong monad.

We begin by defining the measurable space of bags on some measurable space.

Definition 6.

Let be a measurable space. Let be the set of bags on the set underlying . Equip with the least -algebra containing the generating sets .

Then is the measurable space of bags of .

Some observations about the space of bags are helpful in what follows. First we note that for any measurable space we can decompose the set of bags of into a disjoint union of the set of bags of size for all . We can equip each with the sub--algebra. Then,

Second we record the following lemma.

Lemma 1.

Let be a non-empty standard Borel space. The quotient function , which takes a tuple to the bag , is measurable, and has a measurable section .

Proof.

The idea is that we can regard as a space of sorted lists in , as is common in practice in databases. Any standard Borel space is either isomorphic to the reals or countable and discrete. All of these spaces have a measurable total order . There may or may not be a canonical choice for a particular , but it doesn’t matter for the sake of this proof.

We can then use this within the language of measurable functions to write a measurable sorting function that takes a list and returns the sorted version of it. (For example, if , let if and otherwise .)

As a set-theoretic function, this sorting function factors through the quotient map,

It remains to show that these two functions are measurable. That is measurable is well-known, and in fact the -algebra on can be characterized as  [22, 26]. Finally, to see that is measurable, suppose , then we must show that , i.e. that . Since , and is measurable, we are done. ∎

Proposition 1.

The space is standard Borel when is standard Borel.

Proof.

If is standard Borel, then so is , since any countable product of standard Borel spaces is standard Borel. So each is standard Borel, since any retract of a standard Borel space is standard Borel. So is a countable union of standard Borel spaces, hence also standard Borel. ∎

Note. Since all the spaces involved in probabilistic databases are standard Borel, in the remainder of this paper we only consider standard Borel spaces.

3.1 Measurable structural recursion on bags

All the computations we were interested in relied on a form of structural recursion over bags, which we now introduce. This is reminiscent of the fold construction from functional programming. For example, given a list of integers, it is possible to compute the sum of its elements by extracting elements starting at the head and calculating a running sum until we reach the tail. In this way the function sum can be defined as where plays the role of the accumulating function and is the initial argument provided to plus along with the head of the list. If the list being considered is empty, the result of the fold is simply the initial argument provided, which in this case is . The same approach works for bags too, provided that our accumulating function will need to be such that the order in which it receives its arguments does not matter. This leads us to the definition of commutative functions. In the rest of this Section we define what it means to measurably fold a bag.

Definition 7.

A function is commutative if

Definition 8.

Let be a measurable commutative function. Then define to be the function which applies the accumulating function with initial value to each element of one-by-one. Note that the order of selection of elements does not matter as is commutative. We first define for bags of size . When , . For non-zero ,

From this, we obtain fold as the unique function coming out of the coproduct of each of the ’s above, giving us

Theorem 1.

is measurable for commutative measurable .

Proof.

We use Lemma 1. First we define fold on lists:

This is clearly measurable, because it is just built from composition of measurable functions and product operations. Next we note that for any section of the quotient map,

The commutativity of means that the choice of section  does not matter. This again is a composition of measurable functions and so is measurable. The full function is a copairing of measurable functions, and so it is measurable too. ∎

3.2 The space of bags as the free commutative monoid

In order to define a monoidal structure on the space of bags we first consider the function which adds a single element to a bag, incrementing its multiplicity by one. It is clear that add is commutative.

Proposition 2.

is measurable.

Proof.

Consider a measurable set . This is the set of bags with exactly elements belonging to . Then is the set of pairs such that . In other words, each bag in is decomposed into a set of pairs consisting of an element from the bag and the remaining bag. We consider the cases when and when . In both these cases the inverse image map is in .

  • : Here we consider the set of bags such that no element belongs to . Then it is guaranteed that any element removed from the bag will be in and the remaining bag still in , resulting in the inverse map being , which is in .

  • : Here each bag in the set has a non-zero number of elements in and as a result we have two further cases depending on whether or not the element extracted from the bag is in . If it is not in , the pair consisting of the element and the remaining bag belongs to . If it belongs to , the pair is an element of . And so the inverse image of under add is the union of these two sets. Each of these sets is an element of ; consequently so too is their union.

With add as an accumulating function we can define the disjoint union of two bags as a measurable function by considering the fold of add where one bag provides all the new elements to be added to the other bag, which acts as the base case.

Theorem 2.

For any standard Borel space , is a free commutative monoid.

Proof.

First note that is a commutative monoid. Given any commutative monoid and a map we can define where monAcc is the composite

From this we obtain the unique commutative monoid homomorphism . ∎

As an aside, we remark that a fold-like operation is sometimes regarded as immediate from the free (commutative) monoid property. For example, in a cartesian closed category with list objects , the space  is a monoid (under composition), and hence any map induces a canonical monoid homomorphism , which is a curried form of fold. However, the category of measurable spaces is not cartesian closed [2, 16], and so we have recorded the existence of fold as a separate fact to the free commutative monoid property.

3.3 The bag monad

We now use this universal property to describe the structure of the bag monad on standard Borel spaces.

  • The unit is given by the singleton bag: . This is measurable because is if , if , and otherwise.

  • The bind is given as follows. Informally, for , let ,

    Formally, we apply fold to the composite measurable function

    to get a measurable function , and then pass in the empty set as the initial argument. Equivalently, the monad multiplication can be given by applying fold to the function

    to get a measurable function , by passing in as the initial argument.

  • The strength is given by applying fold to the function

    to get a measurable function , and passing in the empty set as the initial argument and projecting the first result.

As an aside we note that in the statistics literature, it is quite common to regard as a space of integer valued measures on . With this perspective, regarding as a monad of measures, the strong monad structure on  is entirely analogous to the monad structure of the Giry monad .

Theorem 3.

is a strong monad on the category of standard Borel spaces.

4 Measurable query operations on bags

In the standard theory of database modelling, relations are assumed to be sets, disallowing the existence of duplicates. Most database software, however, relax this restriction, often to save the cost of duplicate elimination. BALG (“bag algebra”), an algebra for manipulating bags, was first introduced in [14]. In that paper BALG was presented as an extension of the nested relation algebra (RALG), with a focus on the study of its expressive power and relative complexity to RALG. The authors showed that BALG as a query language was more expressive than RALG.

In this Section we will consider the entire BALG query language and show that it extends to measurable functions on bags.

For now we briefly review the query language BALG; we discuss these queries and their semantics in more detail later in this Section. The singleton operation returns a singleton bag consisting of the input. Restructuring rows of tables is possible using the query, which applies the function to every row in the table. The queries product, dunion, difference, union, and intersect compute the product, disjoint union, difference, union, and intersection of the input tables respectively. The project query projects out user-specified columns. flatten transforms a bag of bags to a bag consisting of the disjoint unions of all the internal bags. Duplicate elimination, or deduplication, is possible using the dedup query. Finally, we can compute the bag of sub-bags of any bag using powerbag, and the bag of subsets of any bag using powerset.

Given the expressiveness of the BALG query language it comes as no surprise that many operations can be defined in terms of each other. For example, the powerset of a bag is simply the deduplicated version of the powerbag. It is also known that the union and intersection of bags can be defined using the disjoint union and difference operators. To this end we will only consider the following minimal subset of BALG queries, in terms of which all other queries can be defined: {singleton, flatten, map, product, project, select, dunion, difference, powerbag, dedup}.

Previous work: In their work, Grohe and Lindner [12] considered BALG111It is called BALG, with superscript 1., a subset of BALG restricted to bags of nesting level 1. That is, the queries of BALG are defined on bags of type where cannot have another type in its definition. The minimal set of queries for BALG is the same set we consider here minus flatten and powerbag since they operate on bags of bags. In their work Grohe and Lindner showed that BALG queries extend to measurable functions on bags. We generalise their results and show, using our monadic and monoidal structure on bags, that all of BALG extends to measurable functions on bags, and give a clearer picture of how it comes together. Furthermore, we discuss the actions of grouping and aggregation as measurable queries in BALG.

4.1 Measurability of BALG queries

We provide a semantics to BALG queries by mapping each query to a measurable function on bags. The measurability of the semantics of the singleton, flatten, map, product, project, and select queries is guaranteed by defining their semantics as monad comprehensions. The measurability of the semantics of the remaining queries, dunion, difference, powerbag, and dedup, is obtained by defining their semantics using our fold construction introduced in Section 3. Note that commutativity holds for all the measurable accumulating functions in the fold-based definitions to follow. The condition of commutativity is easy to check.

Bagging and flattening

The semantics for the singleton and flatten queries are given by the unit and multiplication maps for the bag monad . The measurability of these maps is proved in Theorem 3.

This can be neatly written using monad comprehension notation: .
A similar treatment can be given to the project query which projects out the indices of the input schema.

In monad comprehension syntax, for a monad with a given zero element (e.g. ), a shorthand notation is often used.

Disjoint union

dunion simply computes the disjoint union of its arguments. In Section 3.2 we defined the measurable disjoint union as .

To get the final bag after removal we project out the second element of the pair returned by remAcc.
Using remove we define the bag difference of and by letting be the initial input and from it remove-ing each element in one-by-one. The measurability of bag difference follows from the commutativity and measurability of remove.

The function is measurable since and , both of which are measurable sets (due to being standard Borel).

Theorem 4.

BALG queries yield measurable functions on bags.

4.2 Grouping and aggregation

Consider, for example, the table from the database MovieFact introduced at the start of this paper. A natural query that one may want to compute is, “How many movies has each actor appeared in?” In order to calculate the answer to this we first need to be able to group actors with the bag of all the movies they appeared in. To this resultant table we can map a size function to the second column to get the numbers we need. Here we introduce a group query to BALG and show that it is a measurable operation on bags.

Definition 9.

The query acts on tables of schema and is parametrized by two projection functions and . The result of this query is a table with schema where the elements of the first column are paired with the bag of elements they were related to in the input table. In other words, we group the rows of the table by the elements in the -projection of the table.

The measurable bag-semantics for is given by

In the monad comprehension we first project out the columns of interest using and deduplicate the resultant bag. From this bag we extract out the rows indices by which we index the rows of the input bag. For each index we return the pair consisting of along with the -projection of the input where the -projection of the rows is equal to . We can conclude that this query is measurable by defining it as a monad comprehension composed of other measurable functions.

Recall the actor grouping example suggested earlier. Given an input bag from we can apply to create a table of rows relating actors to the bag of movies they appeared in. To this table we can apply to arrive at the final result. The function can be measurably defined as a fold, for example.

A second option for defining an group/aggregation query comes from an extension to monad comprehensions in Haskell where the syntax has been extended with the keywords group by [27]. This extension works for any strong monad, but the user needs to provide a grouping function which, in our case, needs to map a bag on to a bag of bags on where each sub-bag contains the same element. This can be written in BALG as

So , with:

The entire actor query can now be concisely written in the notation of [27] as

This modified comprehension syntax of [27] works by implicitly changing the types of and from Actor and Movie in the right half of the comprehension to and on the left half. This allows us to apply the measurable aggregation function to . The aggregation function used on is , which is some measurable function such that when is a bag that only contains copies of . (For example, we could sort and return the first element.)

5 Generating probabilistic databases

The main focus of this paper has been measurable query languages (Theorem 4). We now turn to the question of where probabilistic databases come from in the first place, particularly in the setting where they have infinite support. A good language for generating infinite probabilistic databases remains a topic of active research, but we now illustrate that the monads for probability  and bags  could be the basis of a good intermediate language.

5.1 A distributive law

We recall the distributive law between the monads and that we provided in an earlier paper [7] (see also [17, 20, 33]):

Using our fold technology (Thm. 1) we can define the distributive law as , where

As usual, this distributive law determines a monad structure on [3].

5.2 Randomizing attributes

For a first example of a probabilistic database, suppose we are given a deterministic database of (movie,gross) pairs. We may then decide that the gross figure is inaccurate and should be subject to a noise from a normal distribution, yielding a probabilistic database. This can be done categorically by the following map:

Since is a monad, we can use comprehension notation for it, and equivalently write the above generation method as

Here we are implictly casting to and to , implicitly using the units and .

Another use-case for random attributes, studied in [13], is to deal with null attributes by drawing them randomly from a vague prior distribution. This would also be easy to express using the monad.

5.3 Adding random records

We can also add and remove random records straightforwardly. We note a few helpful facts.

  • The disjoint union operation lifts to by composing with the strength of . In this way the composite monad has a commutative monoid structure.

  • Since

    , we can regard the Poisson distribution as a map in

    , parameterized by rate.

Now supposing we also have reasonably uniform distributions

and , we can delete some credits and generate random additional actors for movies, modelling the fact that some actors are unlisted:

The first line deletes random rows with probability , and the second line adds in some extra actors (on average, 3 extra actors). Of course, a more sophisticated model could take into account other prior information such as relationships and ages between actors.

5.4 Towards GDatalog

The GDatalog language has recently been proposed as a generative language for probabilistic databases [4, 11]. The language combines datalog-style features with continuous probability distributions.

In general, GDatalog is recursive. We have not treated recursion in this paper, so we focus on the non-recursive fragment. This can easily be translated into the monad. For example, consider the following GDatalog program taken from [11]. The idea is to simulate possibly faulty burglar alarms which either go off because of a burglary or because of an earthquake.