Learning Over Dirty Data Without Cleaning

04/05/2020 ∙ by Jose Picado, et al. ∙ Oregon State University 0

Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose DLearn, a novel relational learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean instances of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Users often would like to learn interesting relationships over relational databases (Kimmig et al., 2020; Raedt et al., 2017; Domingos, 2018; De Raedt, 2010; Getoor and Taskar, 2007; Zeng et al., 2014). Consider the IMDb database (imdb.com) that contains information about movies whose schema fragments are shown in Table 1 (top). Given this database and some training examples, a user may want to learn a new relation highGrossing(title), which indicates that the movie with a given title is high grossing. Given a relational database and training examples for a new relation,

relational machine learning

(relational learning) algorithms learn (approximate) relational models and definitions of the target relation in terms of existing relations in the database (De Raedt, 2010; Getoor and Taskar, 2007; Richardson and Domingos, 2006; Mihalkova and Mooney, 2007; Muggleton et al., 2009; Quinlan, 1990). For instance, the user may provide a set of high grossing movies as positive examples and a set of low grossing movies as negative examples to a relational learning algorithm. Given the IMDb database and these examples, the algorithm may learn:

which indicates that high grossing movies are often released in May and their genre is comedy. One may assign weights to these definitions to describe their prevalence in the data according their training accuracy (Kimmig et al., 2020; Richardson and Domingos, 2006). As opposed to other machine learning algorithms, relational learning methods do not require the data points to be statistically independent and follow the same identical distribution (IID) (Domingos, 2018). Since a relational database usually contain information about multiple types of entities, the relationships between these entities often violate the IID assumption. Also, the data about each type of entities may follow a distinct distribution. This also holds if one wants to learn over the data gathered from multiple data sources as each data source may have a distinct data distribution. Thus, using other learning methods on these databases results in biased and inaccurate models (Kimmig et al., 2020; Raedt et al., 2017; Domingos, 2018). Since relational learning algorithms leverage the structure of the database directly to learn new relations, they do not need the tedious process of feature engineering. In fact, they are used to discover features for the downstream non-relational models (Lao et al., 2015). Thus, they have been widely used over relational data, e.g., building usable query interfaces (Abouzeid et al., 2013; Li et al., 2015; Kalashnikov et al., 2018), information extraction (Kimmig et al., 2020; Domingos, 2018), and entity resolution (Evans and Grefenstette, 2018).

movies(id, title, year) mov2countries(id, name)
mov2genres(id, name) mov2releasedate(id, month, year)
mov2totalGross(title, gross)
Table 1. Schema fragments for the IMDb and BOM.

Real-world databases often contain inconsistencies (Bertossi et al., 2011; Doan et al., 2012; Fan et al., 2009; Getoor and Machanavajjhala, 2013; Cong et al., 2007; Yakout et al., 2011; Fan and Geerts, 2012), which may prevent the relational learning algorithms from finding an accurate definition. In particular, the information in a domain is sometimes spread across several databases. For example, IMDb does not contain the information about the budget or total grossing of movies. This information is available in another database called Box Office Mojo (BOM) (boxofficemojo.com), for which schema fragments are shown in Table 1 (bottom). To learn an accurate definition for highGrossing, the user has to collect data from the BOM database. However, the same entity or value may be represented in various forms in the original databases, e.g., the titles of the same movie in IMDb and BOM have different formats, e.g., the title of the movie Star Wars: Episode IV is represented in IMDb as Star Wars: Episode IV - 1977 and in BOM as Star Wars - IV. A single database may also contain these type of heterogeneity as a relation may have duplicate tuples for the same entity, e.g., duplicate tuples fo the same movie in BOM. A database may have other types of inconsistencies that violate the integrity of the data. For example, a movie in IMDb may have two different production years (Cong et al., 2007; Yakout et al., 2011; Fan and Geerts, 2012).

Users have to resolve inconsistencies and learn over the repaired database, which is very difficult and time-consuming for large databases (Doan et al., 2012; Getoor and Machanavajjhala, 2013). Repairing inconsistencies usually leads to numerous clean instances as the information about the correct fixes is not often available (Bertossi et al., 2011; Burdick et al., 2016; Fan et al., 2009). An entity may match and be a potential duplicate of multiple distinct entities in the database. For example, title Star Wars may match both titles Star Wars: Episode IV - 1977 and Star Wars: Episode III - 2005. Since we know that the Star Wars: Episode IV - 1977 and Star Wars: Episode III - 2005 refer to two different movies, the title Star Wars must be unified with only one of them. For each choice, the user ends up with a distinct database instance. Since a large database may have many possible matches, the number of clean database instances will be enormous. Similarly, it is not often clear how to resolve data integrity violations. For instance, if a movie has multiple production years, one may not know which year is correct. Due to the sheer number of volumes, it is not possible to generate and materialize all clean instances for a large dirty database (Fan and Geerts, 2012). Cleaning systems usually produce a subset of all clean instances, e.g., the ones that differ minimally with the original data (Fan and Geerts, 2012). This approach still generates many repaired databases (Bertossi et al., 2011; Yakout et al., 2011; Fan and Geerts, 2012). It is also shown that these conditions may not produce the correct instances (Ilyas, 2016). Thus, the cleaning process may result in many instances where it is not clear which one to use for learning. It takes a great deal of time for users to manage these instances and decide which one(s) to use for learning. Most data scientists spend more than 80% of their time on such cleaning tasks (Krishnan et al., 2016).

Some systems aim at producing a single probabilistic database that contain information about a subset of possible clean instances (Rekatsinas et al., 2017). These systems, however, do not address the problem of duplicates and value heterogeneities as they assume that there always is a reliable table, akin to a dictionary, which gives the unique value that should replace each potential duplicate in the database. However, given that different values represent the same entity, it is not clear what should replace the final value in the clean database, e.g., whether Star War represents Star Wars: Episode IV - 1977 or Star Wars: Episode III - 2005. They also allow violations of integrity constraints to generate the final probabilistic database efficiently, which may lead to inconsistent repairs. Moreover, to restrict the set of clean instances, they require attributes to have finite domains that does not generally hold in practice.

We propose a novel learning method that learns directly over dirty databases without materializing its clean versions, thus, it substantially reduces the effort needed to learn over dirty. The properties of clean data are usually expressed using declarative data constraints, e.g., functional dependencies, (Abiteboul et al., 1994; Abedjan et al., 2015; Chu et al., 2016; Fan, 2008; Fan et al., 2009; Bahmani et al., 2012; Fan and Geerts, 2012; Burdick et al., 2016; Rekatsinas et al., 2017; Galhardas et al., 2001). Our system uses the declarative constraints during learning. These constraints may be provided by users or discovered from the data using profiling techniques (Abedjan et al., 2015; Koumarelas et al., 2020). Our contributions are as follows:

  • We introduce and formalize the problem of learning over an inconsistent database (Section 3).

  • We propose a novel relational learning algorithm called DLearn to learn over inconsistent data (Section 4).

  • Every learning algorithm chooses the final result based on its coverage of the training data. We propose an efficient method to compute the coverage of a definition directly over the heterogeneous database (Section 4.2).

  • We provide an efficient implementation of DLearn over a relational database system (Section 5).

  • We perform an extensive empirical study over real-world datasets and show that DLearn scales to and learns efficiently and effectively over large data.

2. Background

2.1. Relational Learning

In this section, we review the basic concepts of relational learning over databases without any heterogeneity (De Raedt, 2010; Getoor and Taskar, 2007). We fix two mutually exclusive sets of relation and attribute symbols. A database schema is a finite set of relation symbols , . Each relation is associated with a set of attribute symbols denoted as . We denote the domain of values for attribute as . Each database instance of schema maps a finite set of tuples to every relation in . Each tuple is a function that maps each attribute symbol in to a value from its domain. We denote the value of the set of attributes of tuple in the database by or if is clear from the context. Also, when it is clear from the context, we refer to an instance of a relation simply as . An atom is a formula in the form of , where is a relation symbol and are terms. Each term is either a variable or a constant, i.e., value. A ground atom is an atom that only contains constants. A literal is an atom, or the negation of an atom. A Horn clause (clause for short) is a finite set of literals that contains exactly one positive literal. A ground clause is a clause that only contains ground atoms. Horn clauses are also called Datalog rules (without negation) or conjunctive queries. A Horn definition is a set of Horn clauses with the same positive literal, i.e., non-recursive Datalog program or union of conjunctive queries. Each literal in the body is head-connected if it has a variable shared with the head literal or another head-connected literal.

Relational learning algorithms learn first-order logic definitions from an input relational database and training examples. Training examples are usually tuples of a single target relation, and express positive () or negative () examples. The input relational database is also called background knowledge. The hypothesis space is the set of all possible first-order logic definitions that the algorithm can explore. It is usually restricted to Horn definitions to keep learning efficient. Each member of the hypothesis space is a hypothesis. Clause covers an example if , where is the entailment operator, i.e., if and are true, then is true. Definition covers an example if at least one its clauses covers . The goal of a learning algorithm is to find the definition in the hypothesis space that covers all positive and the fewest negative examples as possible.

Example 2.1 ().

IMDb contains the tuples movie (10,‘Star Wars: Episode IV - 1977’, 1977), mov2genres(10, ‘comedy’), and
mov2releasedate(10, ‘May’, 1977). Therefore, the definition that indicates that high grossing movies are often released in May and their genre is comedy shown in Section 1 covers the positive example highGrossing(‘Star Wars: Episode IV - 1977’).

Most relational learning algorithms follow a covering approach illustrated in Algorithm 1 (Mihalkova and Mooney, 2007; Muggleton et al., 2009; Picado et al., 2017; Quinlan, 1990; Zeng et al., 2014). The algorithm constructs one clause at a time using the LearnClause function. If the clause satisfies a criterion, e.g., covers at least a certain fraction of the positive examples and does not cover more than a certain fraction of negative ones, the algorithm adds the clause to the learned definition and discards the positive examples covered by the clause. It stops when all positive examples are covered by the learned definition.

Input : Database instance , examples
Output : Horn definition
1 while  is not empty  do
2       LearnClause if  satisfies minimum criterion then
Algorithm 1 Covering approach algorithm.

2.2. Matching Dependencies

Learning over databases with heterogeneity in representing values may deliver inaccurate answers as the same entities and values may be represented under different names. Thus, one must resolve these representational differences to produce a high-quality database to learn an effective definition. The database community has proposed declarative matching and resolution rules to express the domain knowledge about matching and resolution (Arasu et al., 2008; Bahmani et al., 2012; Benjelloun et al., 2008; Burdick et al., 2016; Fan et al., 2009; Galhardas et al., 2001; Hernández et al., 2013; Hernández and Stolfo, 1995; Weis et al., 2008). Matching dependencies (MD) are a popular type of such declarative rules, which provide a powerful method of expressing domain knowledge on matching values (Fan et al., 2009; Bertossi et al., 2011; Bahmani et al., 2015; Fan and Geerts, 2012; Koumarelas et al., 2020). Let be the schema of the original database and and two distinct relations in . Attributes and from relations and , respectively, are comparable if they share the dame domain. MD is a sentence of the form , where and are comparable to and , respectively, , and . Operation is a similarity operator defined over domain and ,, indicates that the values of and refer to the same value, i.e., are interchangeable. Intuitively, the aforementioned MD says that if the values of and are sufficiently similar, the values of and are different representations of the same value. For example, consider again the database that contains relations from IMDb and BOM whose schema fragments are shown in Table 1. According to our discussion in Section 1, one can define the following MD . The exact implementation of the similarity operator depends on the underlying domains of attributes. Our results are orthogonal to the implementation details of the similarity operator. In the rest of the paper, we use operation only between comparable attributes. For brevity, we eliminate the domain from when it is clear from the context or the results hold for any domain . We also denote in an MD as . An MD is equivalent to a set of MDs , , , . Thus, for the rest of the paper, we assume that each MD is in the form of , where and are comparable attributes of and , respectively. Given a database with MDs, one must enforce the MDs to generate a high-quality database. Let tuples and belong to and in database of schema , respectively, such that , , denoted as for brevity. To enforce the MD on , one must make the values of and identical as they actually refer to the same value (Fan et al., 2009; Bertossi et al., 2011). For example, if attributes and contain titles of movies, one unifies both values Star Wars - 1977 and Star Wars - IV to Star Wars Episode IV - 1977 as it deems this value as the one to which and refer. The following definition formalizes the concept of applying an MD to the tuples and on .

Definition 2.2 ().

Database is the immediate result of enforcing MD on and in , denoted by if

  1. , but ;

  2. ; and

  3. and agree on every other tuple and attribute value.

One may define a unification function over some domains to map the values that refer to the same value to the correct value in the cleaned instance. It is, however, usually difficult to define such a function due to the lack of knowledge about the correct value. For example, let and in Definition 2.2 contain information about names of people and and have values J. Smth and Jn Sm, respectively, which according to an MD refer to the same actual name, which is Jon Smith. It is not clear how to compute Jon Smith using the values of and . We know that the values of and will be identical after enforcing , but we do not usually know their exact values. Because we aim at developing learning algorithms that are efficient and effective over databases from various domains, we do not fix any matching method in this paper. We assume that matching every pair of values and in the database creates a fresh value denoted as .

Given the database with the set of MDs , is stable if for all and all tuples . In a stable database instance, all values that represent the same data item according to the database MDs are assigned equal values. Thus, it does not have any heterogeneities. Given a database with set of MDs , one can produce a stable instance for by starting from and iteratively applying each MD in according to Definition 2.2 finitely many times (Fan et al., 2009; Bertossi et al., 2011). Let denote the sequence of databases produced by applying MDs according to Definition 2.2 starting from such that is stable. We say that satisfy and denote it as . A database may have many stable instances depending on the order of MD applications (Bertossi et al., 2011; Fan et al., 2009).

Example 2.3 ().

Let (10,‘Star Wars: Episode IV - 1977’, 1977) and (40,‘Star Wars: Episode III - 2005’, 2005) be tuples in relation movies and (‘Star Wars’) be a tuple in relation highBudgetMovies whose schemas are shown in Table 1. Consider MD highBudgetMovies[title]. Let ‘Star Wars: Episode IV - 1977’ ‘Star Wars’ and ‘Star Wars: Episode III - 2005’ ‘Star Wars’ be true. Since the movies with titles ‘Star Wars: Episode IV - 1977’ and ‘Star Wars: Episode III - 2005’ are different movies with distinct titles, one can unify the title in the tuple (‘Star Wars’) in highBudgetMovies with only one of them in each stable instance. Each alternative leads to a distinct instance.

MDs may not have perfect precision. If two values are declared similar according to an MD, it does not mean that they represent the same real-world entities. But, it is more likely for them to represent the same value than the ones that do not match an MD. Since it may be cumbersome to develop complex MDs that are sufficiently accurate, researchers have proposed systems that automatically discover MDs from the database content (Koumarelas et al., 2020).

2.3. Conditional Functional Dependencies

Users usually define integrity constraints (IC) to ensure the quality of the data. Conditional functional dependencies (CDF) have been useful in defining quality rules for cleaning data (Fan, 2008; Yakout et al., 2011; Golab et al., 2008; Cong et al., 2007; Yakout et al., 2011; Fan and Geerts, 2012). They extend functional dependencies, which are arguably the most widely used ICs (GarciaMolina et al., 2008). Relation with sets of attributes and satisfies FD if every pairs of tuples in that agree on the values of will also agree on the values of . A CFD over is a form where is an FD over and is a tuple pattern over . For each attribute , is either a constant in domain of or an unnamed variable denoted as ‘-’ that takes values from the domain of . The attributes in and are separated by in . For example, consider relation mov2locale(title, language, country) in BOM. The CFD : (title, language country, (-, English -) ) indicates that title uniquely identifies country for tuples whose language is English. Let be a predicate over data values and unnamed variable ‘-’, where if either or is a value and is ‘-’. The predicate naturally extends to tuples, e.g., (‘Bait’, English, USA) (‘Bait’, -, USA). Tuple matches if . Relation satisfies the CFD iff for each pair of tuples in the instance if , then . In other words, if and are equal and match pattern , and are equal and match . A relation satisfies a set of CFDs , if it satisfies every CFD in . For each set of CFDs , we can find an equivalent set of CFDs whose members have a single attribute on their right-hand side (Cong et al., 2007; Yakout et al., 2011; Fan and Geerts, 2012). For the rest of the paper, we assume that each CFD has a single attribute on its right-hand side.

CFDs may be violated in real-world and heterogeneous datasets (Yakout et al., 2011; Golab et al., 2008). For example, the pair of tuples (‘Bait’, English, USA) and (‘Bait’, English, Ireland) in movie2locale violate . One can use attribute value modifications to repair violations of a CFD in a relation and generate a repaired relation that satisfy the CFD (Bohannon et al., 2005; Franconi1 et al., 2001; Wijsen, 2003; Cong et al., 2007; Yakout et al., 2011; Kolahi and Lakshmanan, 2009; Fan and Geerts, 2012). For instance, one may repair the violation of in and by updating the value of title or language in one of the tuples to value other than Bait or English, respectively. One may also repair this violation by replacing the countries in these tuples with the same value. Inserting new tuples do not repair CFD violations and one may simulate tuple deletion using value modifications. Moreover, removing tuples leads to unnecessary loss of information for attributes that do not participate in the CFD. Modifying attribute values is also sufficient to resolve CFD violations (Cong et al., 2007; Yakout et al., 2011). Thus, given a pair of tuples and in that violate CFD , to resolve the violation, one must either modify (resp. ) such that and , update (resp. ) such that (resp. ) or . Let be a relation that violates CFD . Each updated instance of that is generated by applying the aforementioned repair operations and does not contain any violation of is a repair of . As there are multiple fixes for each violation, there may be many repairs for each relation.

As opposed to FDs, a set of CFDs may be inconsistent, i.e., there is not any non-empty database that satisfies them (Bohannon et al., 2007; Cong et al., 2007; Yakout et al., 2011; Fan and Geerts, 2012). For example, the CFDs and over relation cannot be both satisfied by any non-empty instance of . The set of CFDs used in cleaning is consistent (Bohannon et al., 2007; Cong et al., 2007; Yakout et al., 2011; Fan and Geerts, 2012). We refer the reader to (Bohannon et al., 2007) for algorithms to detect inconsistent CFDs.

3. Semantic of Learning

3.1. Different Approaches

Let be an instance of schema with MDs that violate some CFDs . A repair of is a stable instance of that satisfy . The values in are repaired to satisfy using the method explained in Section 2.3. Given and a set of training examples , we wish to learn a definition for a target relation in terms of the relations in . Obviously, one may not learn an accurate definition by applying current learning algorithms over as the algorithm may consider different occurrences of the same value to be distinct or learn patterns that are induced based on tuples that violate CFDs. One can learn definitions by generating all possible repairs of , learning a definition over each repair separately, and computing a union (disjunction) of all learned definitions. Since the discrepancies are resolved in repaired instances, this approach may learn accurate definitions.

However, this method is neither desirable nor feasible for large databases. As a large database may have numerous repairs, it takes a great deal of time and storage to compute and materialize all of them. Moreover, we have to run the learning algorithm once for each repair, which may take an extremely long time. More importantly, as the learning has been done separately over each repair, it is not clear whether the final definition is sufficiently effective considering the information of all stable instances. For example, let database have two repairs and over which the aforementioned approach learns definitions and , respectively. and must cover a relatively small number of negative examples over and , respectively. However, and may cover a lot of negative examples over and , respectively. Thus, the disjunction of and will not be effective considering the information in both and . Hence, it is not clear whether the disjunction of and is the definition that covers all positive and the least negative examples over and . Also, it is not clear how to encode usably the final result as we may end up with numerous definitions.

Another approach is to consider only the information shared among all repairs for learning. The resulting definition will cover all positive and the least negative examples considering the information common among all repaired instances. This idea has been used in the context of query answering over inconsistent data, i.e., consistent query answering (Arenas et al., 1999; Bertossi et al., 2011). However, this approach may lead to ignoring many positive and negative examples as their connections to other relations in the database may not be present in all stable instances. For example, consider the tuples in relations movies and highBudgetMovies in Example 2.3. The training example (‘Star Wars’) has different values in different stable instances of the database, therefore, it will be ignored. It will also be connected to two distinct movies with vastly different properties in each instance. Similarly, repairing the instance to satisfy the violated CFDs may further reduce the amount of training examples shared among all repairs. The training examples are usually costly to obtain and the lack of enough training examples may results in inaccurate learned definitions. Because in a sufficiently heterogeneous database, most positive and negative examples may not be common among all repairs, the learning algorithm may learn an inaccurate or simply an empty definition.

Thus, we hit a middle-ground. We follow the approach of learning directly over the original database. But, we also give the language of definitions and semantic of learning enough flexibility to take advantage of as much (training) information as possible. Each definition will be a compact representation of a set of definitions, each of which is sufficiently accurate over some repairs. If one increases the expressivity of the language, learning and checking coverage for each clause may become inefficient (Eiter et al., 1997). We ensure that the added capability to the language of definitions is minimal so learning remains efficient.

3.2. Heterogeneity in Definitions

We represent the heterogeneity of the underlying data in the language of the learned definitions. Each new definition encapsulates the definitions learned over the repairs of the underlying database. Thus, we add the similarity operation, , to the language of Horn definitions. We also add a set of new (built-in) relation symbols with arity two called repair relations to the set of relation symbols used by the Datalog definitions over schema . A literal with a repair relation symbol is a repair literal. Each repair literal in a definition represents replacing the variable (or constant) in (other) existing literals in with variable if condition holds. Condition is a conjunction of , , and relations over the variables and constants in the clause. Each repair literal reflects a repair operation explained in Sections 2.2 and 2.3 for an MD or violated CFD over the underlying database. The condition is computed according to the corresponding MD or CFD. Finally, we add a set of literals with , , and relations called restriction literals to establish the relationship between the replacement variables, e.g, , according to the corresponding MDs and CFDs. Consider again the database created by integrating IMDb and BOM datasets, whose schema fragments are in Table 1, with MD . We may learn the following definition for the target relation highGrossing.

The repair literals and represent the repairs applied to and to unify their values to a new one according to . We add equality literal to restrict the replacements according to the corresponding MD.

We also use repair literals to fix a violation of a CFD in a clause. These repair literals reflect the operations explained in Section 2.3 to fix the violation of a CFD in a relation. The resulting clause represents possible repairs for a violation of a CFD in the clause. A variable may appear in multiple literals in the body of a clause and some repairs may modify only some of the occurrences of the variable, e.g., the example on BOM database in Section 2.3. Thus, before adding repair literals for both MDs and CFDs, we replace each occurrence of a variable with a fresh one and add equality literals, i.e., induced equality literals, to maintain the connection between their replacements. Similarly, we replace each occurrence of the constant with a fresh variable and use equality literals to set the value the variable equal to the constant in the clause.

Example 3.1 ().

Consider the following clause, that may be a part of a learned clause over the integrated IMDb and BOM database for highGrossing.

This clause reflects a violation of CFD from Section 2.3 in the underlying database as it indicates that English movies with the same title are produced in different countries. We first replace each occurrence of repeated variable with a new variable and then add the repair literals. Due to the limited space, we do not show the repair literals and their conditions for modifying the values of constant ’English’. Let condition be .

We call a clause (definition) repaired if it does not have any repair literal. Each clause with repair literals represents a set of repaired clauses. We convert a clause with repair literals to a set of repaired clauses by iteratively applying repair literals to and eliminating them from the clause. To apply a repair literal to a clause, we first evaluate considering the (restriction) literals in the clause. If holds, we replace all occurrences of with in all literals and the conditions of the other repair literals in the clause and remove . Otherwise, we only eliminate from the clause. We progressively apply all repair literals until no repair literal is left. Finally, we remove all restriction and induced equality literals that contain at least one variable that does not appear in any literal with a schema relation symbol. The resulting set is called the repaired clauses of the input clause.

Example 3.2 ().

Consider the following clause over the movie database of IMDb and BOM.

The application of repair literals and results in the following clause.

Similar to the repair of a database based on MDs and CFDs, the application of a set of repair literals to a clause may create multiple repaired clauses depending on the order by which the repair literals are applied.

Example 3.3 ().

Consider a target relation , an input database with schema , , and MDs and . The definition over this schema has two repaired definitions: and As another example, the application of each repair literal in the clause of Example 3.1 results in a distinct repaired clause. For instance, applying replaces with in all literals and conditions of the repair literals and results in the following.

As Example 3.3 illustrates, repair literals provide a compact representation of multiple learned clauses where each may explain the patterns in the training data in some repair of the input database. Given an input definition , the repaired definitions of are a set of definitions where each one contains exactly one repaired clause per each clause in .

3.3. Coverage Over Heterogeneous Data

A learning algorithm evaluates the score of a definition according to the number of its covered positive and negative examples. One way to measure the score of a definition is to compute the difference of the number of positive and negative examples covered by the definition (De Raedt, 2010; Quinlan, 1990; Zeng et al., 2014). Each definition may have multiple repaired definitions each of which may cover a different number of positive and negative examples on the repairs of the underlying database. Thus, it is not clear how to compute the score of a definition.

One approach is to consider that a definition covers a positive example if at least one of its repaired definitions covers it in some repaired instances. Given all other conditions are the same, this approach may lead to learning a definition with numerous repaired definitions where each may not cover sufficiently many positive examples. Hence, it is not clear whether each repaired definition is accurate. A more restrictive approach is to consider that a definition covers a positive example if all its repaired definitions cover it. This method will deliver a definition whose repaired definitions have high positive coverage over repaired instances. There are similar alternatives for defining coverage of negative examples. One may consider that a definition covers a negative example if all of its repaired definitions cover it. Thus, if at least one repaired definition does not cover the negative example, the definition will not cover it. This approach may lead to learning numerous repaired definitions, which cover many negative examples. On the other hand, a restrictive approach may define a negative example covered by a definition if at least one of its repaired definitions covers it. In this case, generally speaking, each learned repaired definition will not cover too many negative examples. We follow a more restrictive approach.

Definition 3.4 ().

A definition covers a positive example w.r.t. to database iff every repaired definition of covers in some repairs of .

Example 3.5 ().

Consider again the schema, MDs, and definition in Examples 3.3 and the database of this schema with training example and tuples . Assume that and are true. The database has two stable instances , and , . Definition covers the single training example in the original database according to Definition 3.4 as its repaired definitions and cover the training example in repaired instances and , respectively.

Definition 3.4 provides a more flexible semantic than considering only the common information between all repaired instances as described in Section 3.1. The latter semantic considers that the definition covers a positive example if it covers the example in all repaired instances of a database. As explained in Section 3.1, this approach may lead to ignoring many if not all examples.

Definition 3.6 ().

A definition covers a negative example with regard to database if at least one of the repaired definitions of covers in some repairs of .

4. DLearn

In this section, we propose a learning algorithm called DLearn for learning over heterogeneous data efficiently. It follows the approach used in the bottom-up relational learning algorithms (Muggleton and Feng, 1990; Muggleton et al., 2009; Mihalkova and Mooney, 2007; Picado et al., 2017). In this approach, the LearnClause function in Algorithm 1 has two steps. It first builds the most specific clause in the hypothesis space that covers a given positive example, called a bottom-clause. Then, it generalizes the bottom-clause to cover as most positive and as fewest negative examples as possible. DLearn extends these algorithms by integrating the input MDs and CFDs into the learning process to learn over heterogeneous data.

4.1. Bottom-clause Construction

A bottom-clause associated with an example is the most specific clause in the hypothesis space that covers relative to the underlying database . Let be the input database of schema and the set of MDs and CFDs . The bottom-clause construction algorithm consists of two phases. First, it finds all the information in relevant to . The information relevant to example is the set of tuples that are connected to . A tuple is connected to if we can reach using a sequence of exact or approximate (similarity) matching operations, starting from . Given the information relevant to , DLearn creates the bottom-clause .

movies(m1,Superbad (2007),2007) mov2genres(m1,comedy)
movies(m2,Zoolander (2001),2001) mov2genres(m2,comedy)
movies(m3,Orphanage (2007),2007) mov2genres(m3,drama)
mov2countries(m1,c1) countries(c1,USA)
mov2countries(m2,c1) countries(c2,Spain)
mov2countries(m3,c2) englishMovies(m1)
mov2releasedate(m1,August,2007) englishMovies(m2)
mov2releasedate(m2,September,2001) senglishMovies(m3)
Table 2. Example movie database.
Example 4.1 ().

Given example highGrossing(Superbad), database in Table 2, and MD , DLearn finds the relevant tuples movies(m1, Superbad (2007), 2007), mov2genres(m1, comedy), mov2countries(m1, c1), englishMovies(m1),
mov2releasedate(m1, August, 2007), and countries(c1, USA). As the movie title in the training example, e.g., Superbad, does not exactly match with the movie title in the movies relation, e.g., Superbad (2007), the tuple movies(m1, Superbad (2007), 2007) is obtained through an approximate match and similarity search according to . We get others via exact matches.

To find the information relevant to , DLearn uses Algorithm 2. It maintains a set that contains all seen constants. Let be a training example. First, DLearn adds to . These constants are values that appear in tuples in . Then, DLearn searches all tuples in that contain at least one constant in and adds them to . For exact search, DLearn uses simple SQL selection queries over the underlying relational database. For similarity search, DLearn uses MDs in . If contains constants in some relation and given an MD , DLearn performs a similarity search over , to find relevant tuples in , denoted by . We store these pairs of tuples that satisfy the similarity match in in a table in main memory. We will discuss the details of the implementation of DLearn over relational database systems in Section 5. For each new tuple in , the algorithm extracts new constants and adds them to . It repeats this process for a fixed number of iterations .

Input : example , # of iterations
Output : bottom-clause
1 // stores known constants add constants in to for  to  do
2       foreach relation  do
3             foreach attribute in  do
4                   // select tuples with constants in M if  MD  then
5                         ,
6                  foreach tuple  do
7                         add to and constants in to
create clause from and return
Algorithm 2 DLearn bottom-clause construction algorithm.

To create the bottom-clause from , DLearn first maps each constant in to a new variable. It creates the head of the clause by creating a literal for and replacing the constants in with their assigned variables. Then, for each tuple , DLearn creates a literal and adds it to the body of the clause, replacing each constant in with its assigned variable. If there is a variable that appears in more than a single literal, we add the equality literals according to the method explained in Section 3.2. If satisfies a similarity match according to the table of similarity matches with tuple , we add a similarity literal per each value match in and to the clause. Let be the corresponding MD of this similarity match. We will also add repair literals and and restriction equality literal to the clause according to .

Example 4.2 ().

Given the relevant tuples found in Example 4.1, DLearn creates the following bottom-clause:

Then, we scan to find violations of each CFD in and add their corresponding repair literals. Since each CFD is defined over a single table, we first group literals in based on their relation symbols. For each group with the relation symbol and CFD on , our algorithm scans the literals in the group, finds every pair of literals that violate , and adds the repair and restriction literals to the group. We add the repair and restriction literals corresponding to the repair operations explained in Section 2.3 to the group and consequently as illustrated in Example 3.1. The added repair literals will not induce any new violation of in the clause (Cong et al., 2007; Yakout et al., 2011; Fan and Geerts, 2012). However, repairing a violation of may induce violations for anther CFD over (Fan and Geerts, 2012). For example, consider CFD and on relation . Given literals and that violate , our method adds repair literals that replaces in with a fresh variable. This repair literal produces a repaired clause that violates . Thus, the algorithm repeatedly scans the clause and adds repair and restriction literals to it for all CFDs until there is a repair for every violation of CFDs both the ones in the original clause and the ones induced by the repair literals added in the preceding iterations. The repaired literals for the violations induced by other repair literals will use the replacement variables from the violating repair literals as their arguments and conditions.

It may take a long time to generate the clause that contains all repair literals for all original and induced violations of every CFD in a large input bottom-clause. Hence, we reduce the number of repair literals per CFD violation by adding only the repair literals for the variables of the right-hand side attribute of the CFD that use current variables in the violation. For instance, in Example 3.1, the algorithm does not introduce literals , , and and only uses literals and to repair the clause in Example 3.1. The repair literals for the variables corresponding to the left-hand side of the CFD will be used as explained before. This approach follows the popular minimal repair semantic for repairing CFDs (Bohannon et al., 2005; Franconi1 et al., 2001; Wijsen, 2003; Cong et al., 2007; Yakout et al., 2011; Kolahi and Lakshmanan, 2009; Fan and Geerts, 2012) as it repairs the violation by modifying fewer variable than the repair literals that introduce fresh variables to the both literals of the violation, e.g., one versus two modifications induced by , in the repair of the clause in Example 3.1. Since each CFD is defined over a single relation, the aforementioned steps are applied separately to literals of each relation, which are usually a considerably smaller set than the set of all literals in the bottom-clause. Moreover, the bottom-clause is significantly smaller than the size of the whole database. Thus, the bottom-clause construction algorithm takes significantly less time than producing the repairs of the underlying database.

Current bottom-clause constructions methods do not induce inequality literal between distinct constants in the database and their corresponding variables and represent their relationship by replacing them with distinct variables. If the inequality literal is used, the eventual generalization of the bottom-clause may be too strict and lead to a learned clause that does not cover sufficiently many positive examples (Muggleton et al., 2009; Muggleton, 1995; De Raedt, 2010; Picado et al., 2017). For example, let be a bottom-clause. This clause will not cover positive examples such as for which we have . However, the bottom-clause has more generalization power and may cover both positive examples such as and such that . As the goal of our algorithm is to simulate relational learning over repaired instances of the original database, we follow the same approach and remove the inequality literals between variables. As our repair operations ensure that the arguments of inequality literals are distinct variables, our method exactly emulates bottom-clause construction in relational learning. The inequalities remain in the condition of each repair literal and will return true if the variables are distinct and there is no equality literal between them in the body of the clause and false otherwise. They are not used in learning and are used to apply repair literals on the final clause.

Proposition 4.3 ().

The bottom-clause construction algorithm for positive example and database with MDs and CFD terminates. Also, the bottom-clause created from using the algorithm covers .

4.2. Generalization

After creating the bottom-clause for example , DLearn generalizes to produce a clause that is more general than . Clause is more general than clause if and only if covers at least all positive examples covered by . A more general clause than may cover more positive examples than . DLearn iteratively applies the generalization to find a clause that covers the most positive and fewest negative examples as possible. It extends the algorithm in ProGolem (Muggleton et al., 2009) to produce generalizations of in each step efficiently. This algorithm is based on the concept of -subsumption, which is widely used in relational learning (De Raedt, 2010; Muggleton, 1995; Muggleton et al., 2009). We first review the concept of -subsumption for repaired clauses (De Raedt, 2010; Muggleton et al., 2009), then, we explain how to extend this concept and its generalization methods for non-stable clauses.

Repaired clause -subsumes repaired clause , denoted by , iff there is some substitution such that (De Raedt, 2010; Abiteboul et al., 1994), i.e., the result of applying substitution to literals in creates a set of literals that is a subset of or equal to the set of literals in . For example, clause -subsumes as for substitution , we have . We call each literal in where there is a literal in such that a mapped literal under . For Horn definitions, we have -subsumes iff , i.e., logically entails (Abiteboul et al., 1994; De Raedt, 2010). Thus, -subsumption is sound for generalization. If clauses and contain equality and similarity literals, the subsumption checking requires additional testings, which can be done efficiently (Abiteboul et al., 1994; De Raedt, 2010; Afrati et al., 2004). Roughly speaking, current learning algorithms generalize a clause efficiently by eliminating some of its literals which produces a clause that -subsumes . We define -subsumption for clauses with repair literals using its definition for the repaired ones. Given a clause , a repair literal in is connected to a non-repair literal in iff or appear in or in the arguments of a repair literal connected to .

Definition 4.4 ().

Let denote the set of all repair literals in -subsumes , denoted by , iff

  • there is some substitution such that where repair literals are treated as normal ones and

  • every repair literal connected to a mapped literal in is also a mapped literal under .

Definition 4.4 ensures that each repair literal that modifies a mapped one in has a corresponding repair literal in . Intuitively, this guarantees that there is subsumption mapping between corresponding repaired versions of and . The next step is to examine whether -subsumption provides a sound bases for generalization of clauses with repair literals. We first define logical entailment following the semantics of Definition 3.4.

Definition 4.5 ().

We have if and only if there is an onto relation from the set of repairs of to the one of such that for each repaired clause of , , and each , we have .

According to Definitions 4.5, if one wants to follow the generalization method used in the current learning algorithm to check whether generalizes , one has enumerate and check -subsumption of almost every pair of repaired clauses of and in the worst case. Since both clauses normally contain many literals and -subsumption is NP-hard (Abiteboul et al., 1994), this method is not efficient. The problem is more complex if one wants to generalize a given clause . It may have to generate all repaired clauses of and generalize each of them separately. It is not clear how to unify and represent all produced repaired clauses in a single non-repaired one. It quickly explodes the hypothesis space if we cannot represent them in a single clause as the algorithm may have to keep track and generalize of almost as many clauses as repairs of the underlying database. Also, because the learning algorithm performs numerous generalizations and coverage tests, learning a definition may take an extremely long time. The following theorem establishes that -subsumption is sound for generalization of clauses with repair literals.

Theorem 4.6 ().

Given clauses and , if -subsumes , we have .

To generalize , DLearn randomly picks a subset of positive examples. For each example in , DLearn generalizes to produce a candidate clause , which is more general than and covers . Given clause and positive example , DLearn produces a clause that -subsumes and covers by removing the blocking literals. It first creates a total order between the relation symbols and the symbols of repair literals in the schema of the underlying database, e.g., using a lexicographical order and adding the condition and argument variables to the symbol of the repair literals. Thus, it establishes an order in each clause in the hypothesis space. Let be the bottom-clause. The literal with relation symbol is a blocking literal if and only if is the least value such that for all substitutions where , does not cover  (Muggleton et al., 2009).

Example 4.7 ().

Consider the bottom-clause in Example 4.2 and positive example highGrossing(‘Zoolander’). To generalize to cover , DLearn drops the literal
mov2releasedates because the movie Zoolander was not released in August.

DLearn removes all blocking literals in to produce the generalized clause . DLearn also ensures that all literals in the resulting clause are head-connected. For example, if a non-repair literal is dropped so as the repair literals whose only connection to the head literal is through . Since is generated by dropping literals, it -subsumes . It also covers by construction. DLearn generates one clause per example in . From the set of generalized clauses, DLearn selects the highest scoring candidate clause. The score of a clause is the number of positive minus the number of negative examples covered by the clause. DLearn then repeats this with the selected clause until its score is not improved.

During each generalization step, the algorithm should ensure that the generalization is minimal with respect to -subsumption, i.e., there is not any other clause such that -subsumes and -subsumes (Muggleton et al., 2009). Otherwise, the algorithm may miss some effective clauses and produce a clause that is overly general and may cover too many negative examples. The following proposition states that DLearn produces a minimal generalization in each step.

Proposition 4.8 ().

Let be a head-connected and ordered clause generated from a bottom-clause using DLearn generalization algorithm. Let clause be the generalization of produced in a single generalization step by the algorithm. Given the clause that -subsumes , if -subsumes , then and are equivalent.

4.3. Efficient Coverage Testing

DLearn checks whether a candidate clause covers training examples in order to find blocking literals in a clause. It also computes the score of a clause by computing the number of training examples covered by the clause. Coverage tests dominate the time for learning (De Raedt, 2010). One approach to perform a coverage test is to transform the clause into a SQL query and evaluate it over the input database to determine the training examples covered by the clause. However, since bottom-clauses over large databases normally have many literals, e.g., hundreds of them, the SQL query will involve long joins, making the evaluation extremely slow. Furthermore, it is challenging to evaluate clauses using this approach over heterogeneous data (Bertossi et al., 2011). It is also not clear how to evaluate clauses with repair literals.

We use the concept of -subsumption for clauses with repair literals and the result of Theorem 4.6 to compute coverage efficiently. To evaluate whether covers a positive example over database , we first build a bottom-clause for in called a ground bottom-clause. Then, we check whether using -subsumption. We first check whether . Based on Theorem 4.6, if we find a substitution for such that , and logically entails , thus, covers . However, if we cannot find such a substitution, it is not clear whether logically entails as Theorem 4.6 does not provide the necessity of -subsumption for logical entailment. Fortunately, this is true if we have only repair literals for MDs in and .

Theorem 4.9 ().

Given clauses and such that every repair literal in and corresponds to an MD, if , -subsumes .

We leverage Theorem 4.9 to check whether covers efficiently as follows. Let and be the clauses that have the same head literal as and and contain all body literals in and without any connected repair literal and the ones where all their connected repair literals correspond to some MDs, respectively. Thus, if there is no subsumption between and , our algorithm tries to find a subsumption between and . If there is no subsumption mapping between and , does not cover . Otherwise, let and be the set of body literals of and that do not appear in the body of and , respectively. We apply the repair literals in and in and and perform subsumption checking for pairs of resulting clauses. If every obtained clause of -subsumes at least one resulting clause of , covers . Otherwise, does not cover . We note than the resulting clauses are not repairs of and as they sill have the repair literals that correspond to some MD.

We follow a similar method to the one explained in the preceding paragraph to check whether clause covers a negative example with the difference that we use the semantic introduced in Definition 3.6 to determine the coverage of negative examples. Let be the ground bottom-clause for the negative example . We generate all repaired clauses of the clause as described in Section 3. Then, we check whether each repaired clause of -subsumes the same way as checking -subsumption for and a ground bottom-clause for a positive example. -subsumes as soon as one repaired clause of -subsumes .

Proposition 4.10 ().

Given the clause and ground bottom-clause for negative example relative to database , clause covers iff a repair of -subsumes .

Commutativity of Cleaning & Learning: An interesting question is whether our algorithm produces essentially the same answer as the one that learns a repaired definition over each repair of separately. We show that, roughly speaking, our algorithm delivers the same information as the one that separately learns over each repaired instance. Thus, our algorithm learns using the compact representation without any loss of information. Let denote the set of all repaired clauses of clause . Let BC(, , ) denote the bottom-clause generated by applying the bottom-clause construction algorithm in Section 4.1 using example over database with the set of MDs and CFDs . Also, let BC(, ) be the set of repaired clauses generated by applying the bottom-clause construction to each repair of for .

Theorem 4.11 ().

Given database with MDs , CFDs and set of positive examples , for every positive example