A Hybrid Data Cleaning Framework using Markov Logic Networks

03/14/2019 ∙ by Yunjun Gao, et al. ∙ Zhejiang University Shanghai Jiao Tong University 0

With the increase of dirty data, data cleaning turns into a crux of data analysis. Most of the existing algorithms rely on either qualitative techniques (e.g., data rules) or quantitative ones (e.g., statistical methods). In this paper, we present a novel hybrid data cleaning framework on top of Markov logic networks (MLNs), termed as MLNClean, which is capable of cleaning both schema-level and instance-level errors. MLNClean mainly consists of two cleaning stages, namely, first cleaning multiple data versions separately (each of which corresponds to one data rule), and then deriving the final clean data based on multiple data versions. Moreover, we propose a series of techniques/concepts, e.g., the MLN index, the concepts of reliability score and fusion score, to facilitate the cleaning process. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of MLNClean to the state-of-the-art approach in terms of both accuracy and efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data analysis benefits from a wide variety of reliable information. The existence of dirty data not only leads to erroneous decisions or unreliable analysis but probably causes a blow to the corporate economy 

[1]. As a consequence, there has been a surge of interest from both industry and academia on data cleaning [2]. The purpose of data cleaning is generally to correct errors, to remove duplicate information, and to provide data consistency. It usually includes two steps, i.e., error detecting and error repairing. The first step is to find where dirty data hide, and the second one is to correct dirty data detected in the previous step.

Example 1

Table 1 depicts a group of sampled tuples from a dirty hospital information dataset . It contains four attributes, including hospital name HN, city CT, state ST, and phone number PN. The dataset needs to comply with three integrity constraints, i.e., one functional dependency FD, one denial constraint DC, and one conditional functional dependency CFD.
() FD: CT ST
() DC: (PN() = PN() ST() ST())
() CFD: HN(“ELIZA”), CT(“BOAZ”)PN(“2567688400”)

Specifically, the rule means that a city uniquely determines a state, the rule indicates that two hospitals located in different states have different phone numbers, and the rule means that a hospital named “ELIZA” and located in city “BOAZ”, has a specific phone number “2567688400”. Errors appeared in the tuples are highlighted in colored cells, and they can be treated from two different levels, i.e., the schema-level and instance-level [3].

The schema-level errors refer to the values that violate integrity constraints. For example, tuples , and are violated on the attribute ST w.r.t. . The instance level errors contain replacement errors, typos, and duplicates in this example. In particular, the replacement error signifies that a value is incorrectly recorded as another value. That is, the value is completely wrong. For instance, .CT being “DOTHAN” is a replacement error, the correct value should be “BOAZ” in this cell. Typos, also called misprints, are caused by the typing process. For example, .CT being “DOTH” is a typo, and the correct value should be “DOTHAN” in this cell. In addition, duplicates indicate that there are multiple tuples corresponding to the same real entity, e.g., tuples , , and .

TID HN CT ST PN
ALABAMA DOTHAN AL 3347938701
ALABAMA DOTH AL 3347938701
ELIZA DOTHAN AL 2567638410
ELIZA BOAZ AK 2567688400
ELIZA BOAZ AL 2567688400
ELIZA BOAZ AL 2567688400
TABLE I: A Sample of a Hospital Information Dataset

Data cleaning methods can be divided into two major categories including qualitative techniques and quantitative ones. The qualitative techniques [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17] mainly rely on integrity constraints to express data quality rules. They detect errors which violate integrity constraints, and repair errors with the principle of minimality (i.e., minimizing the impact on the dataset by trying to preserve as many tuples as possible). Take the dataset in Table I as an example. Tuples and are violated on the attribute ST w.r.t. . Thus, according to the principle of minimality, it replaces the value “AK” with “AL” on the attribute ST of , whereas it fails to repair the attributes CT and PN of . In addition, the attribute CT of cannot be repaired since it does not violate any rule. In contrast, the second category containing quantitative techniques [18, 19, 20]

employs statistical approaches to detect possible errors, and finds probable repair candidates of errors based on the probability theory. Qualitative techniques guarantee that the cleaning results are in accordance with data quality rules, and quantitative techniques ensure that the cleaning results conform to statistical characteristics. Recently, new attempts are made by

[21, 22], which combine qualitative and quantitative techniques. Nevertheless, they either consider only one kind of integrity constraints (e.g., FDs), or keep the error detecting and repairing steps in isolation and focus on data repairing process, incurring the redundant computation.

In this paper, we propose a novel hybrid data cleaning framework, termed as MLNClean. It aims to address two key challenges: (i) how to combine the advantages of both quantitative and qualitative techniques to deal with multiple error types; and (ii) how to boost cleaning efficiency as much as possible. In the first place, in terms of the first challenge, we seamlessly integrate data quality rules and Markov logic networks (MLNs) into MLNClean, such that it combines the advantages of both qualitative and quantitative techniques, and it is able to cope with schema-level errors (that violate integrity constraints) and instance-level errors (including replacement errors, typos, and duplicates). Moreover, we present a critical cleaning criterion based on a new concept of reliability score, which is defined by considering both the principle of minimality (using the distance metric) and the statistical characteristics (adopting the weight learning of MLNs).

Regarding the second challenge, we enable MLNClean to seamlessly handle both error detecting and error repairing stages. It helps to avoid redundant computation, and therefore minimizes the computation cost of the entire cleaning process. Furthermore, we develop an effective MLN index to shrink the search space. Specifically, the MLN index is built as a two-layer hash table with each block in the first layer including a set of groups in the second layer. Each block is with respect to a data rule that involves a set of data attributes. One block contains a set of groups. In particular, dirty values within each block are cleaned independently, which does not need the access to the information outside the block. In addition, it is noteworthy that, instead of deciding whether each value is clean or not per time in traditional methods, MLNClean chooses to decide whether one piece of data (w.r.t. several attribute values involving one rule) is clean or not per time. Hence, the efficiency of MLNClean is further gained. In a nutshell, MLNClean has the following contributions.

  • Our proposed data cleaning framework MLNClean combines the advantages of both qualitative and quantitative techniques via integrating data quality rules and Markov logic networks (MLNs). MLNClean consists of two major cleaning stages, i.e., cleaning rule-based multiple data versions and deriving the unified clean data, which seamlessly performs error detecting and repairing.

  • In the first stage, for a data version w.r.t. each block (built in the MLN index), MLNClean first processes abnormal groups, and then, it cleans errors within one group using a novel concept of reliability score.

  • In the second stage, MLNClean unifies the final clean data set based on multiple data versions in the previous stage, where a newly defined fusion score is employed to eliminate conflicts among data versions.

  • Extensive experiments on both real and synthetic datasets confirm that MLNClean outperforms the state-of-the-art approach in terms of both accuracy and efficiency.

The rest of this paper is organized as follows. We review related work in Section 2. Then, Section 3 introduces data cleaning semantics as well as some concepts related to Markov logic network. In Section 4, we overview the cleaning framework MLNClean. Section 5 elaborates the two-stage data cleaning process. Section 6 details the distributed version of MLNClean on Spark. In Section 7, we report the experimental results and our findings, and then, we conclude our work with future work in Section 8.

2 Related Work

Existing data cleaning methods can be partitioned into two categories: (i) qualitative techniques and (ii) quantitative ones. The qualitative techniques mainly utilize integrity constraints to clean errors, including ones using FDs [6, 7, 9, 16], or CFDs [8, 11, 13], or DCs [5, 10, 17]. In addition to the above methods that repair data violating only one specific constraint, Temporal [4], LLUNATIC [14], NADEEF [12], BigDansing [15], and CleanM [23] support data cleaning against violations of at least two kinds of those constraints. In particular, Temporal is extended with temporal dimension, to capture the duration information for data cleaning. The generic data cleaning platform NADEEF supports the customization for application-specific data quality problems. It provides a programming interface that allows users to specify multiple types of integrity constraints. BigDansing translates the insights of NADEEF into the Map-Reduce framework. In contrast, the recent work CleanM integrates the physical and logical optimizations used in BigDansing to demonstrate its superiority. It is worthwhile to mention that, CleanM focuses on error detecting regardless of error repairing, but MLNClean considers both error detecting and repairing.

The quantitative techniques use data itself to construct appropriate models and predict repair solutions on the basis of data distributions. ERACER [19], SCARE [20], and ActiveClean [18]

are among this group. ERACER is an iterative statistical framework based on belief propagation and relationship-dependent networks. SCARE cleans data by combining the machine learning and probability models. It repairs errors based on the maximum likelihood estimation. ActiveClean is a stepwise cleaning method in which models are updated incrementally rather than retrained, and thus, the cleaning accuracy could gradually increase. The group of methods is more suitable to the cases where the fraction of dirty data is far less than that of clean data. The less the dirty data, the more reliable the learned parameters. Theoretically, large-scale datasets can benefit the sophisticated statistical models.

One hybrid method [21] that combines both qualitative and quantitative techniques is then proposed. However, it only combines FDs and statistical methods without considering other types of integrity constraints. Thereafter, the state-of-the-art method HoloClean [22] unifies several data repair signals including integrity constraints and external data to construct a knowledge-base probabilistic graphical model by using DeepDive [24]. HoloClean aims at error repairing, which employs existing approaches to detect errors. By contrast, our proposed framework MLNClean tackle both error detecting and repairing. As empirically confirmed, MLNClean is superior to HoloClean in terms of both efficiency and accuracy. Also, MLNClean can clean various instance-level errors (e.g., replacement errors and typos), while HoloClean fails to solve them in some cases.

3 Preliminaries

In this section, we describe data cleaning semantics and some concepts related to Markov logic networks. Table II summarizes the symbols used frequently throughout this paper.

Notation Description
a dataset with dirty values
a tuple belonging to the dataset
the value of on attribute
an integrity constraint/rule
the weight of a rule
a piece of data that contains attribute values of a tuple w.r.t. a rule
a block (corresponding to a rule ) in the MLN index over the dataset
a group containing a set of s that share the same values on the reason part of the rule w.r.t. the block
TABLE II: Symbols and Description

A dataset T with dimensions consists of a set of tuples , and each tuple has the value on the attribute . denotes the domain of attribute . There are usually some integrity constraints that should hold on the dataset , such as functional dependencies (FDs), denial constraints (DCs), as well as conditional functional dependencies (CFDs). In addition, each integrity constraint could be considered as two parts, i.e., the reason part and result part, and the reason part determines the result part. In other words, there is no the same reason to determine multiple different results. As an example, for the rule , CT is the reason part, while ST is the result part. CT uniquely determines ST.

According to Markov logic theory [25], every integrity constraint can be converted into a unified form. For ease of presentation, we call the rule in the unified form an MLN rule. The MLN rule is expressed as , where is a literal for . A literal is any expression that contains a predicate symbol applied to a variable or a constant, e.g., CT(), HN(“ELIZA”). For the rules shown in Example 1, they can be transformed into the following MLN rules, where .


() FD: CT ST
() DC: (PN() = PN()) (ST() ST())
() CFD: HN(“ELIZA”)CT(“BOAZ”)PN(“2567688400”)

IC rule MLN rule Ground MLN rules
CT ST CT ST CT(“DOTHAN”) ST(“AL”)
CT(“DOTH”) ST(“AL”)
CT(“BOAZ”) ST(“AL”)
CT(“BOAZ”) ST(“AK”)
TABLE III: Example of Ground MLN Rules w.r.t.

In a traditional viewpoint of data cleaning work, if a value violates one rule, it has zero probability to be correct. Nevertheless, in most cases, one cannot guarantee the full correctness of the rules owing to the lack of specific domain knowledge, and hence, it is not desirable to clean data using this kind of hard constraints. Fortunately, the attraction of Markov logic networks (MLNs) lies that, it is able to soften those constraints. The formal definition of a Markov logic network is stated below.

Definition 1

Markov logic network [25]. A Markov logic network L is defined as a set of rule-weight pairs (), where is a rule and is a real-number weight of .

Each MLN rule has an associated weight that reflects how strong a constraint is. The higher the weight is, the more reliable the rule is, which indicates the higher probability of a value satisfying the rule. Without loss of generality, we use the terms of integrity constraint, rule, and MLN rule interchangeably throughout this paper.

On top of one dataset, each MLN rule can be converted to a set of ground MLN rules through a grounding process. The term grounding refers to a process that replaces variables in the MLN rule with the corresponding constants (i.e., attribute values) in the dataset. For instance, based on the dataset in Table I, the ground MLN rules of the rule are shown in Table III. Accordingly, the weight of a ground MLN rule reflects the probability of the attribute values w.r.t. this ground MLN rule being clean.

4 MLNClean FRAMEWORK

In this section, we briefly introduce the procedure of MLNClean. Figure 1 depicts the general framework of MLNClean.

Fig. 1: MLNClean architecture

The framework MLNClean receives a dirty dataset together with a set of integrity constraints (ICs). It outputs clean data through three steps, including pre-processing, MLN index construction, and two-stage data cleaning. In the pre-processing phase, MLNClean first transforms integrity constraints into MLN rules, and derives ground MLN rules of each MLN rule based on the dirty dataset. Then, MLNClean builds a two-layer MLN index with a set of blocks in the first layer and a set of groups in the second layer. The MLN index is a vital structure, which helps to narrow the search space of repair candidates for the subsequent data cleaning phase. Next, MLNClean enters the two-stage data cleaning phase, which first cleans multiple data versions independently (with each data version coming from different blocks), and then derives the final unified clean data on top of the previous multi-version data. The procedure of MLNClean is shown in Algorithm 1.

MLN index construction. A MLN index is a two-layer hash table. There are a set of blocks in the first layer, each of which has a set of groups in the second layer. One block corresponds to one MLN rule. In other words, the data attribute values related to a set of ground MLN rules that belong to one same MLN rule are put together to form one block. Thus, the number of blocks is equivalent to the number of MLN rules. For convenience, we call the data attribute values of each ground MLN rule a piece of data, denoted as . Then, within each block, we get a set of s with the same reason part in one group. As a result, in the second layer, each block is divided into several groups, and the pieces of data (i.e., s) within each group share one same reason part (referring to lines 1-13 in Algorithm 1). To be more specific, for different types of rules, we decide the reason and result parts as follows. First, for implication formulas (i.e., FDs and CFDs), they are in the form: , , , . The antecedent belongs to the reason part, while the consequent pertains to the result part. In contrast, DC formulas have the following form: . We simply treat the last predicate as the result part, and the other predicates as the reason part.

Take the sample dataset in Table I as an example. We depict the MLN index structure in Figure 2. There are three blocks , and corresponding to three rules , and respectively shown in Example 1. They have 3, 3, and 2 groups, respectively. Let and be the number of blocks (or rules) and tuples in the dataset, respectively. The time complexity of MLN index construction is . In addition, it is easy to realize that, there might be multiple pieces of data pertaining to each tuple in the dataset, and each of them comes from different blocks. In other words, for each tuple, there are at most pieces of data derived from it. Hence, we can say that, there are multiple data versions, each of which comes from different blocks.

Input: a dirty dataset and a set of data rules
Output: a clean dataset
  /* MLN index construction */
  // : a block collection
1 foreach  do
          // : a block w.r.t. 
2        foreach tuple  do
                 // : a piece of data for                             w.r.t. 
3               attribute values of w.r.t. the reason part of attribute values of w.r.t. the result part of insert and into if there is no group sharing the same from  then
                        // : a group
4                      add to
5              add to that shares the same
6       insert into
  /* Two-stage cleaning process */
7 foreach  do
        AGP(// abnormal group process
8        foreach group  do
                // R-score based cleaning
9               RSC(, )
10       
 // F-score based conflict resolution
FSCR() return
Algorithm 1 The general procedure of MLNClean

Two-stage data cleaning. For one data version w.r.t. each block (built in the MLN index), the first cleaning stage involves the process of abnormal groups (when there are errors located in a rule’s reason part), and cleaning the pieces of data (i.e., s) within one group using a new concept of reliability score (i.e., r-score). It refers to an abnormal group process strategy (i.e., AGP) and an r-score based cleaning method (i.e., RSC) respectively in lines 14-17 of Algorithm 1, which will be elaborated in Section 5 later. After cleaning multiple version data independently, there naturally exist conflicts among different data versions. Thus, in the second cleaning stage, MLNClean strives to eliminate those conflicts using a newly defined concept of fusion score (i.e., f-score), in order to get the final clean data. It is with respect to an f-score based conflict resolution strategy (i.e., FSCR) in line 18 of Algorithm 1 (to be detailed in Section 5).

Fig. 2: Illustration of the MLN index over the sample dataset

5 Two-Stage Data Cleaning

In this section, we describe the two-stage data cleaning process in MLNClean, including cleaning multiple data versions and deriving the unified clean data.

5.1 Cleaning Multiple Data Versions

As mentioned in Section 4, when cleaning multiple data versions, there are two major missions, i.e., processing abnormal groups and cleaning each group. Through them, schema-level errors (violating integrity constraints) are addressed as well as some instance-level errors including replacement errors and typos.

5.1.1 Processing Abnormal Groups

Based on a MLN index, for a tuple with error(s) in the reason part of a data rule, the corresponding piece of data (i.e., ) might erroneously form or belong to a group, and thereby, we call the corresponding group an abnormal group. For example, there is a typo .[CT] being “DOTH” in the sample dataset shown in Table I. It forms a group in the MLN index, as depicted in Figure 2. Indeed, the piece of data being {CT: DOTH, ST: AL} w.r.t. should be in the same group containing {CT: DOTHAN, ST: AL}. Hence, is actually an abnormal group. To this end, we propose an abnormal group process strategy, termed as AGP, to first identify those abnormal groups and then merge them to the corresponding normal groups.

First, we observe that, the group size and the distance to other groups within the same block are key factors for abnormal group. In general, the relatively small groups (that have less pieces of data) are prone to be abnormal. The closer to other groups, the most likely to be abnormal. Thus, AGP adopts a simple but effective method to identify abnormal groups. Specifically, if the number of tuples related to s contained in a group is not larger than a threshold , AGP regards this group as an abnormal group; otherwise as a normal group. Note that the optimal value of is empirically chosen. Then, for each abnormal group in a block , AGP merges it with its nearest normal group within . Specifically, let be the piece of data related to the most tuples in a group. The distance of two groups is defined as the distance of their respective s.

For instance, in terms of the MLN index shown in Figure 2, when setting as 1, groups and are identified as abnormal groups. Then, for , its nearest group is based on the Levenshtein distance. Hence, is merged with . Similarly, is merged with , and is merged with .

The time complexity of AGP is , where is the total number of blocks, and (and ) is the average number of groups (and abnormal groups) in a block. In addition, it is worth mentioning that, how to identify abnormal groups as accurately as possible is essential to the overall performance of the cleaning framework, since this step has the biggest propagated impact to the final cleaning accuracy. Therefore, we are going to conduct in-depth exploration about this issue in our future work.

5.1.2 Cleaning within Each Group

In the constructed MLN index, one group shares the same value(s) on the reason part of the corresponding rule. Ideally, if data are clean, one group contains one and only one piece of data, meaning that the same values on the reason part cannot derive different values on the result part. However, when one group contains several pieces of data (that are the same on the reason part), there definitely exist dirty values. In light of this, we present a cleaning strategy using a concept of reliability score, called r-score based cleaning (RSC for short), to clean dirty values within each group.

RSC judges which piece of data included in the group is clean using the reliability score. Then, RSC corrects the other dirty ones with the detected clean one, so that each group has one and only one piece of data eventually. It is noteworthy that, RSC cleans the pieces of data within every block independently, which does not need the information outside the block. Actually, we can even know that, RSC is executed to clean data within each group separately, if regardless of Markov weight learning. In other words, what we would like to highlight here is that, the MLN index structure indeed helps to minimize the search space for RSC.

The concept of the reliability score is stated in Definition 2, which is defined to evaluate the possibility degree of the piece of data (i.e., ) being clean. The with the highest reliability score is most likely clean. As a result, all the other pieces of data within the same group are replaced with the piece of data having the highest reliability score.

Definition 2

Reliability score. For a piece of data in a group , its reliability score, denoted as r-score, is defined in Eq. 1.

(1)

where , is a normalization function to make within the interval , denotes the number of tuples related to in group , and is the probability of being clean.

The definition of reliability score considers two factors: (i) Distance, which represents the minimum cost of replacing a with others. The greater the distance is, the more likely the is clean. (ii) Probability, which indicates the possibility of the being clean. The higher the probability is, the more likely the is clean.

We attempt to derive the probability in Definition 2

as follows. First, the probability distribution of values

specified by the Markov network is given by Eq. 2 from [25].

(2)

where is the normalized function, which can be treated as a constant, is the number of true groundings of rule in , is the weight of , and represents the number of rules. When it computes involving ground MLN rules in our case, equals one for its corresponding ground MLN rule, and is zero for other ground MLN rules. Hence, we have

(3)

In particular, is a constant, and is a monotonically increasing function. Thus, the higher the probability , the greater the weight . As a result, instead of deriving the probability directly, we leverage the weight of to compute a reliability score, i.e., r-score .

In implementation, the weight is computed by MLN weight learning method from Tuffy [26], which adopts diagonal Newton method. Particularly, the prior weight of each for weight learning is given by

(4)

where represents the number of tuples related to , and denotes the number of different s within a block. For example, for the piece of data being {CT: BOAZ, ST: AK} in group from block , the initial weight is set as .

Fig. 3: Illustration of reliability score computation
Example 2

Take the group belonging to block (as depicted in Figure 2) as an example. includes two pieces of data, denoted as and , namely, is {CT: BOAZ, ST: AK} and is {CT: BOAZ, ST: AL}. They have the same value on the reason part but different values on the result part. Obviously, there is at least one error at the value of attribute ST within the group, according to the data rule . The reliability score of each is derived, as shown in Figure 3, where the Levenshtein distance is used. The piece of data being {CT: BOAZ, ST: AL} has higher reliability score than being {CT: BOAZ, ST: AK}. Thus, is regarded as the clean one in this group, and is finally replaced with by RSC strategy.

Fig. 4: The three clean data versions over the sample dataset

In addition, for ease of understanding, Figure 4 illustrates three clean data versions after adopting AGP and RSC strategies consecutively. In particular, the group from is skipped to calculate the reliability score, because this group has reached the ideal state with only one in it. Finally, the first clean data version contains {CT: DOTHAN, ST: AL} in (w.r.t. and ) and {CT: BOAZ, ST: AL} in (w.r.t. and ). The second clean data version incorporates {PN: 3347938701, ST: AL} in (w.r.t. and ) and {PN: 2567688400, ST: AL} in (w.r.t. and ). The third clean data version consists of {HN: ELIZA, CT: BOAZ, PN: 2567688400} in (w.r.t. and ).

5.2 Deriving the Unified Clean Data

The first cleaning stage of MLNClean has obtained the ruled-based multiple clean data versions. Now, we are ready to enter the second cleaning stage. It aims to execute the data fusion on top of multi-version data, in order to get the final clean data, in which data conflicts among different data versions have to be solved. Note that, this stage provides the second opportunity to clean erroneous data that are not (or incorrectly) repaired in the first cleaning stage as many as possible.

Take the tuple depicted in Table I as an example. After finishing the first cleaning stage, the piece of data w.r.t. in (w.r.t. the first data version) is {CT: DOTHAN, ST: AL}, while in (w.r.t. the third data version), the piece of data related to is {HN: ELIZA, CT: BOAZ, PN: 2567688400}. It is obvious that, .[CT] has two different values (i.e., “DOTHAN” and “BOAZ”) from the two versions. That is to say, there exist conflicts on attributes CT of , and that should be eliminated to get the final clean . Besides, although {CT: DOTHAN, ST: AL} conforms to the rule , there might still exist errors w.r.t

in the case that it is erroneously classified into a group and thereby cannot be correctly repaired in the first cleaning stage.

Accordingly, during executing the data fusion on top of multiple data versions, we identify a set of candidates to solve conflicts, which refers to all the possible fusion versions for a tuple. Hence, we introduce a novel concept of fusion score (i.e., f-score) to get the most likely clean fusion version. Specifically, for a tuple , the fusion score of , denoted by f-score(), is defined as the product of weights of data pieces (related to ) from different data versions, as written in Eq. 5.

(5)

where denotes the weight of related to . The larger the f-score, the more likely clean the corresponding fusion version of tuple .

Input: a dirty dataset and multiple clean data versions contained in blocks
Output: a clean dataset
1 foreach tuple  do
2        ; foreach  from  do
3               ; GetFusionT()
4       replace the corresponding attribute values of with add to
5return Function GetFusionT() if  is empty  then
6        if  then
7               ;
8       return
9foreach  from  do
10        if  and have conflicts then
11               if there exists with the highest such that there is no conflict between and  then
12                     
13              else
14                      ; return
15              
16       ; GetFusionT()
return
Algorithm 2 F-Score based Conflict Resolution (FSCR)

As a result, we develop an f-score based conflict resolution (FSCR for short) strategy, with the pseudo code presented in Algorithm 2. It receives a dirty dataset and multiple clean data versions obtained in the first cleaning stage, which are stored in different blocks. To begin with, FSCR initializes the clean dataset as an empty set (line 1). Then, for each tuple in the dirty dataset , it attempts to derive the unified clean tuple (lines 3-9). FSCR first puts all the pieces of data into a set , where is not larger than (line 3), i.e., collects all the versions of tuple . Next, a temporal variable is set as zero, which is used to store the maximal f-score, and the corresponding fusion version of , i.e., , is set as (line 4). Then, for each piece of data related to , FSCR merges it with other data pieces of contained in using the function GetFusionT, in order to find the optimal fusion version with the highest f-score as the final unified one (lines 5-7). Thereafter, the tuple is updated using the derived optimal fusion version , and is added to (lines 8-9). FSCR proceeds to process the remaining tuples in the dirty dataset one by one. Finally, it returns the clean dataset (line 10).

Since the fusion version of a tuple is related to the order of merging (), the number of possible fusion versions is a factorial number, up to . Hence, GetFusionT is a recursive function (lines 11-25). It terminates if f-score is zero, or one fusion version of tuple has been obtained (i.e., becomes empty) (lines 12-15). Given a , for each data piece from (that excludes ), GetFusionT has to decide whether there are conflicts between and . The conflicts exist only in the case that and have some common attribute(s), but the values on at least one of those attribute(s) from and are different. If existing conflicts, GetFusionT attempts to find a candidate piece of data from the block to replace . The candidate is the one with the highest Markov weight, and has no conflict with . If there does not exist such , the fusion for tuple fails and terminates (lines 18-22). If there is no conflict or exists a proper , GetFusionT continues the tuple fusion, and meanwhile updates the value of f-score (line 23). Note that, in line 6, the value of f-score is firstly set as the weight of current . In the sequel, it invokes itself at line 24 to merge with the other version of .

Example 3

We illustrate how FSCR works in terms of tuple in the sample dataset, based on three versions (denoted by and ) of tuple (obtained from the first cleaning stage). In particular, denotes {CT: DOTHAN, ST: AL} from block , is with respect to {PN: 2567688400, ST: AL} from , and represents {HN: ELIZA, CT: BOAZ, PN: 2567688400} from block . Hence, following Algorithm 2, for the tuple , FSCR gets . For simplicity, we show two fusion attempts: (i) merging , , and in order, and (ii) merging , , and in order. For the first attempt, and are merged directly, since there is no conflict between them. Then, it proceeds to merge with from block . While there is a conflict on the attribute CT, here GetFusionT tries to find another from block , such that has the same value on the common attribute CT for the fusion of and . Unfortunately, there does not exist such in block . Thus, GetFusionT terminates the current fusion. For the second attempt, and are firstly merged, as there is no conflict between them. Then, GetFusionT merges the fusion of and with from block . However, there is a conflict on the attribute CT. At that time, GetFusionT successfully finds a being {CT: BOAZ, ST: AL} from block . Finally, the final fusion version of , i.e., {HN: ELIZA, CT: BOAZ, ST: AL, PN: 2567688400}, is obtained.

Let be the number of tuples in the dataset and be the average number of versions for a tuple. is bounded by the number of blocks/rules. There are at most fusion versions for a tuple. Each fusion version needs time to find the candidate version . Therefore, FSCR takes time. In addition, after eliminating conflicts via FSCR, MLNClean automatically detects and removes duplicate tuples. Take the sample dataset as an example, and are duplicates, and and are duplicates. MLNClean deletes extra duplicate tuples among them to finish the cleaning process.

6 Distributed MLNClean

In order to enable MLNClean to work well even for large-scale datasets with a large number of data rules, we aim to deploy MLNClean in the Spark system. In this section, we describe the distributed MLNClean program.

First, data skew is a critical issue in the distributed system, which may lead the overall process to delay. Hence, an effective data partition strategy is needed for the distributed

MLNClean version. As a result, the distributed MLNClean version executes in the following procedure. It first partitions the whole dataset into several parts, and allocates each part to a worker node. Then, it cleans each part using the stand-alone MLNClean. When each part has been cleaned, those parts are gathered to derive the final clean dataset, during which conflicts and duplicates are eliminated in the same way to stand-alone MLNClean.

Input: a dataset , a data partition
Output: the data partition
1 initialize each part as an empty max-heap for randomly select a centroid , and insert it into the part for collect those centroids into for each  do
2        find with if  then
3               insert into the part
4       else
5               ; get the top node of if  ) then
6                      ; replace with
7              find with add to
8       
return
Algorithm 3 Data partition method
Fig. 5: Illustration of data partition

Our data partition method is depicted in Algorithm 3, which aims to divide the dataset into parts, denoted as . The maximum capacity of each part is set as (denoted by

) for uniform distribution (line 1). Tuples are stored in a maximum heap for each part, where each node also records the distance between the corresponding tuple and the centroid of the part. Then, each tuple

is allocated to the part , if the distance between and is minimal and the number of tuples in is smaller than (lines 6-8). When the size of is larger than , if the distance is smaller than the distance from the top node (denoted by ) to the centroid of this part (i.e., ), is replaced with in the part . Meanwhile, is inserted into its closest partition if it is not full. On the other hand, if , is directly added to its closest part that is not full (lines 10-14).

Take the sample dataset shown in Table I as an example. Figure 5 illustrates the partition result. The inner digital of each part denotes the distance between the tuple and the centroid. For instance, the digital 4 of partition represents the distance between tuple and the centroid .

The time complexity of data partition algorithm is , where is the total number of tuples in the dataset, and is the complexity of one insertion operation in a maximum heap.

For the distributed MLNClean program, due to the small-scale of tuples allocated in each part, the result of every weight learning on each worker (w.r.t. each part) might not be very reliable. For example, in the part , the learned weight of being {CT: DOTHAN, ST: AL} may be unreliable, since there is no relevant evidence for learning the weight of . Conversely, in the part , there is a tuple providing the relevant evidence for learning the weight of . Thus, we adjust the weight of each via Eq. 6 in virtue of the relevant evidence from other parts.

(6)

where is the number of parts in the data partition, is the number of tuples related to in the part , and represents the learned weight of in the part . As a result, each corresponds to a unique weight in global, and it is used for the subsequent cleaning process.

7 Experiments

In this section, we present a comprehensive experimental evaluation. In what follows, we evaluate our proposed data cleaning framework MLNClean using both real-world and synthetic datasets in the following scenarios: (i) the experimental comparisons between MLNClean and the state-of-the-art method HoloClean [22], (ii) the effect of various parameters on the performance of MLNClean, and (iii) the performance of distributed MLNClean version on the Spark platform.

Dataset Rules
HAI PhoneNumber ZIPCode
PhoneNumber State
ZIPCode City
MeasureID MeasureName
ZIPCode CountyName
ProviderID City, PhoneNumber
(PhoneNumber() = PhoneNumber()
State() State())
CAR Make(“acura”), Type Doors
Model, Type Make
TPC-H CustKey Address
TABLE IV: Rules Used in Each Dataset

7.1 Experimental Setup

(a) CAR
(b) HAI
(c) CAR
(d) HAI
Fig. 6: Effect of error percentage on comparison evaluation

In the experiments, we use two real-world datasets, i.e., HAI and CAR, and one synthetic dataset, i.e., TPC-H.

HAI111https://data.medicare.gov/data/hospital-compare is a real dataset that provides information about healthcare associated infections occurred in hospitals. It contains 231,265 tuples.

CAR222https://www.cars.com/ contains the used vehicle information, including model, make, type, year, condition, wheelDrive, doors, and engine attributes. It consists of 30,760 tuples.

TPC-H333http://www.tpc.org/tpch/ is a benchmark for performance metrics for systems operating. The two largest tables including lineitem and customer tables are utilized to create a synthetic dataset. It contains 6,001,115 tuples.

Table IV summarizes integrity constraints of each dataset used in our experiments, and they are given by domain experts. We add errors randomly, including typos and replacement errors, on attributes related to integrity constraints shown in Table IV for each dataset. Specifically, we randomly delete any letter of an attribute value to construct a typo. For a replacement error, we replace a value with another value from the same domain. For each dataset, we generate error rate by default, including a half fraction of typos and another half fraction of replacement errors. It is necessary to mention that, enterprises typically find data error rates are approximately [27], and the reported error rates are no more than in many case studies [28]. Note that, the error rate is defined as the number of erroneous values to the number of total attribute values. In addition, we use the Levenshtein distance as the distance metric, unless otherwise stated.

We utilize F1-score to evaluate the accuracy of data cleaning methods. Specifically, it is defined as

(7)

where precision is equal to the ratio of correctly repaired attribute values to the total number of updated attribute values, and recall equals the ratio of correctly repaired attribute values to the total number of erroneous values. In addition, unless otherwise stated, the experiments were conducted on a Dell PowerEdge T620 with one Intel(R) Xeon(R) E5-2620 v2 2.10GHz processors (6 physical cores and 12 CPU threads) and 188GB RAM.

7.2 Comparisons with HoloClean

In this section, we verify the performance of MLNClean and HoloClean. Since HoloClean adopts external modules for error detection and it can only fix errors caught by the error detection phase, we set the detection accuracy of HoloClean as 100% for an absolutely fair evaluation, which helps avoid the effect of the detection accuracy on the subsequent error repairing phase.

Effect of error percentage. We change the error percentage from 5% to 30%, and report the corresponding experimental results in Figure 6. As observed from Figure 6(a) and Figure 6(b), for both MLNClean and HoloClean, the accuracy decreases slightly as the error percentage increases. For MLNClean, there are two reasons for the decline. The first reason is that, with the increase of error percentage, AGP is prone to wrongly treat more normal groups as abnormal ones, and the following cleaning steps are subject to the chain reaction of AGP, resulting in degraded accuracy. The second reason originates in statistical characteristics of RSC, which employs the reliability score based on Markov weight learning. The larger error percentage leads to the less reliability of the learned weight. On the other hand, HoloClean separates the whole dataset into noisy and clean parts. It uses clean values which are picked by error detection methods to learn the statistical model parameters. Then, it employs the trained model to infer the probability of each noisy value. When error rate increases, the statistical difference between the noisy and clean parts enlarges, which incurs the unsuitable parameters for the inference of noisy values. The results also show that, MLNClean has much higher accuracy than HoloClean for all cases, which reflects the superiority of the two-stage cleaning of MLNClean. This is because, when erroneous values become more and more, HoloClean relying solely on probabilistic reasoning is becoming weaker. However, MLNClean considering both statistical characteristics and the principle of minimality is much stronger, thereby it has better performance.

As shown in Figure 6(c) and Figure 6(d), in terms of the execution time, we can observe that the time cost increases when the error percentage is growing for both MLNClean and HoloClean. Note that, the overall runtime of MLNClean includes both error detection time and repairing time, but the total runtime of HoloClean only involves the error repairing time due to its property. For MLNClean, the growth of the total time cost mainly results from the Markov weight learning, which occupies almost 95% of the total time. Specifically, the increase of error percentage makes it more difficult to determine whether a value is clean or erroneous, thus leading to the slower convergence of Markov weight learning. For HoloClean, the runtime is mainly determined by the compile and repair phase. In particular, for more errors, the candidate set of each value turns larger, incurring more overhead.

Furthermore, MLNClean is consistently faster than HoloClean, even though MLNClean deals with both error detection and repairing stages and HoloClean only focuses on the error repairing. The superiority of MLNClean comes from the cleaning scheme of MLNClean. In MLNClean, the smallest unit of data cleaning is a piece of data, i.e., , which contains multiple attribute values. Thus, it is able to clean multiple values per time. However, in HoloClean, the minimum cleaning unit is a single attribute value, namely, it only cleans one value per time. Hence, it needs longer time for HoloClean to clean all errors.

Effect of error type ratio. In order to investigate the effect of different error types on the performance of MLNClean and HoloClean, we vary the error type ratio, and set the total error percentage as 5% by default. We consider two error types, i.e., replacement errors and typos. Specifically, we change the proportion of replacement errors to the total errors, denoted as , from 0 to 100%. In particular, being zero means that there is no replacement error in the 5% total errors, namely, all the 5% total errors are typos. being 100% indicates that all the 5% total errors are replacement errors.

The corresponding experimental results are plotted in Figure 7. We can observe that, HoloClean is rather sensitive to the error type ratio on the CAR dataset. In contrast, the performance of MLNClean is stable on both datasets. The reason behind mainly comes from two aspects. From the method aspect, as explained previously, HoloClean trains the model using the clean part, and infers values in the dirty part using the trained model to correct errors. Following the generation methods, the replacement errors are incorrect values from the corresponding same domain, and thus, they exist in both clean and dirty data parts. In contrast, typos are absent from the clean part, which leads the trained model to be weak for many typos. Thus, HoloClean is supposed to be sensitive to typos. From the dataset aspect, CAR is rather sparse while HAI is relatively much dense. Hence, HoloClean is much sensitive to over CAR than that over HAI. The F1-score of HoloClean in CAR shows a growing trend when varying from 0 to 100%. Especially when there are only typos errors in the data set, the cleaning result is the worst. On the other hand, MLNClean fully considers both types of errors via the two-stage cleaning strategy. Thus, it is much stable with the change of . It further confirms the superiority of MLNClean. In addition, it is necessary to mention that, the execution time is not very sensitive to the error type. Hence, we omit the related description due to the space constraint.

7.3 Results on MLNClean

In this section, we study the effect of different parameters (i.e., the value of threshold in AGP strategy, the total error percentage, and the distance metric) on the performance of MLNClean. Especially, for in-depth investigation, we also explore the effect of parameters on the three components of MLNClean, including AGP strategy, RSC method, and FSCR strategy, each of which has an impact on the data cleaning accuracy of MLNClean. In particular, in order to appropriately measure the accuracy of each component, we introduce a series of metrics for them.

For AGP strategy, we define Precision-A as the fraction of correctly merged abnormal groups over the total number of detected abnormal groups, and Recall-A as the fraction of correctly merged abnormal groups over the total number of real abnormal groups. For RSC method, Precision-R is defined as the ratio of correctly repaired s to the total number of repaired s, and Recall-R is equal to the ratio of correctly repaired s to the number of s which contain errors. In addition, for FSCR strategy, Precision-F (resp. Recall-F) corresponds to the fraction of correctly repaired attribute values by FSCR over the number of erroneous attribute values that include detected conflicts (resp. the total number of erroneous attribute values).

(a) CAR
(b) HAI
Fig. 7: Effect of error type ratio on comparison evaluation

7.3.1 Effect of Threshold

(a) CAR
(b) HAI
Fig. 8: The performance of AGP vs. the value of

Effect on the performance of AGP. First, we evaluate the performance of abnormal group process (i.e., AGP) strategy when varying the value of threshold . The corresponding results are shown in Figure 8. Note that, we also report the total size of pieces of data (i.e., s) within detected abnormal groups by AGP under different thresholds. For simplicity, we call it the number of detected abnormal s (i.e., #dag for short).

The first observation is that, the accuracy of AGP

(both precision and recall) first ascends and then drops as the value of

grows over both datasets. AGP achieves the highest accuracy for CAR dataset at being 1. For HAI dataset, it achieves the highest accuracy when gets 10. The second observation is that, when is zero, the accuracy is near to zero on both datasets. The reason is that no group is treated as abnormal in this case. It corresponds to the lowest size of abnormal s (i.e., #dag depicted in diagrams). In contrast, when the value of exceeds the optimal value, the accuracy deteriorates, while the corresponding size of abnormal s increases over both datasets. This is because, more normal groups are detected as abnormal groups. In particular, there is an extreme situation that the accuracy sharply drops to zero, while the corresponding size of abnormal s grows significantly when is larger than 30 on HAI dataset. The reason behind is that, the vast majority of normal groups are wrongly detected as abnormal ones.

(a) CAR
(b) HAI
Fig. 9: The performance of RSC vs. the value of

Effect on the performance of RSC. Then, we investigate the impact of the threshold on the accuracy of the reliability score based cleaning strategy (i.e., RSC). As shown in Figure 9, an appropriate value of threshold (i.e., being 1 on CAR and being 10 on HAI) contributes to the higher accuracy of RSC. Nonetheless, when the value of deviates from the optimal value, the accuracy gets worse. The reason is that, the further the is to the optimal value, the more the groups are processed wrongly by AGP, and thus, the less the pieces of data in groups are correctly repaired by RSC. Besides, the precision of RSC (w.r.t. Precision-R) remains higher than the recall of RSC (w.r.t. Recall-R). This is because, when more groups are wrongly processed by AGP in previous step, RSC executed within each group is not able to repair more errors caused by AGP, resulting in the lower recall. There is such an extreme case that the recall sharply drops nearly to zero when is larger than 30 on HAI dataset.

(a) CAR
(b) HAI
Fig. 10: The performance of FSCR vs. the value of

Effect on the performance of FSCR. Next, we study the performance of our presented conflict resolution method (i.e., FSCR) when varying the value of threshold. The corresponding results are depicted in Figure 10. As expected, the appropriate value of threshold (i.e., being 1 on CAR and being 10 on HAI) contributes to the optimal accuracy. Moreover, we can find that, the precision maintains high value when the value of deviates from the optimal value. According to the definition of precision, it means that few detected conflicts are wrongly repaired. Besides, the lower recall than precision indicates that, some errors have not been detected by FSCR. The recall sharply drops, even near to zero for HAI when is larger than 30. It signifies that, there are more and more erroneous values that are not detected by FSCR, since those errors have not been correctly processed by AGP or RSC in previous phases.

Effect on the performance of MLNClean. Last but not the least, we explore the effect of threshold on the overall framework MLNClean, and report the corresponding experimental results in Figure 11. It is hardly surprising that, MLNClean gets the highest accuracy when is 1 on CAR dataset (in which F1 is 0.96), and when equals 10 on HAI dataset (where F1 equals 0.98). The deviation of the threshold value from the most appropriate value contributes to the descend of accuracy. On the other hand, the total execution time of MLNClean turns longer as the increasing value of threshold. This is because, the bigger the value of threshold, the larger the number of detected abnormal groups by AGP, and hence, it leads to the longer processing time. Note that, without loss of generality, we set as 1 on CAR dataset (and as 10 on HAI dataset) in the rest of experiments.

(a) CAR
(b) HAI
Fig. 11: The performance of MLNClean vs. the value of

7.3.2 Effect of Error Percentage

Effect on the performance of AGP. First, we verify the effect of error percentage on the performance of AGP. The corresponding results are shown in Figure 12 with various error percentages. It is observed that, as the growth of error percentage, the accuracy of AGP decreases. This is because, the higher the error rate, the more the (abnormal) groups, and hence the less the s within one group. Thus, when there are less and less s within a group with the increase of error rate, AGP easily tends to treat more and more normal groups wrongly as abnormal ones, for a fixed value of threshold on each dataset. As a result, both the precision and recall of AGP gets lower according to their definitions.

(a) CAR
(b) HAI
Fig. 12: The performance of AGP vs. the error percentage

Effect on the performance of RSC. Then, we investigate the accuracy of RSC method when changing the error percentage, and report the corresponding results in Figure 13. We can observe that, both the precision and recall of RSC drop slightly with the increasing error rate. There are two major reasons for the trends. The first one is about the propagated influence of the decreasing accuracy of AGP in the previous step. The second reason comes from the statistical characteristic of RSC, which employs a reliability score based on Markov weight learning. The larger the error rate, the less reliable the learned weights, and the lower the accuracy of RSC. On the other hand, we have to mention that, RSC is quite robust to the change of error rate. In particular, the drop of precision is around 10%, and the drop of recall is around 1%. In addition, the recall is higher than the precision in most cases. The reason is that, with the growth of error rate, the number of repaired s by RSC increases faster than the number of s containing errors, which partly results from more wrongly processed groups of AGP (as explained earlier).

(a) CAR
(b) HAI
Fig. 13: The performance of RSC vs. the error percentage
Levenshtein distance Cosine distance
CAR 0.968 0.730
HAI 0.970 0.947
TABLE V: F1-scores under Different Distance Metrics

Effect on the performance of FSCR. We also study the impact of error percentage on the performance of FSCR. The corresponding results are depicted in Figure 14. One can observe that, the accuracy has no significant fluctuation with the changing error percentage. The values of both precision and recall are always above 90%, and the fluctuation of them is within 6%. The high accuracy of FSCR reflects that, FSCR is indeed capable of cleaning out those errors which have not been correctly cleaned by AGP or RSC in previous stages. Furthermore, it attributes the higher recall than precision to more wrongly detected conflicts by FSCR, due to the relatively lower accuracy of AGP and RSC in previous stages. In addition, it is worth pointing out that, we have analyzed the overall performance of MLNClean in terms of the experimental results in Figure 6 when changing the error percentage. Thus, we omit the related description here due to the space constraint.

(a) CAR
(b) HAI
Fig. 14: The performance of FSCR vs. the error percentage

7.3.3 Effect of Distance Metrics

(a) HAI
(b) TPC-H
Fig. 15: The performance of distributed MLNClean
The number of workers
2 4 6 8 10
Total time of MLNClean (sec) 50,759 27,574 16,289 11,572 7,578
TABLE VI: Experiments under Different Numbers of Workers

The distance metric plays an important role in MLNClean from two aspects. First, it is the basis of measuring the similarity of two groups, involving AGP strategy. Second, it is an significant factor of computing the reliability score, which is employed by RSC method. Thus, we also evaluate the effect of different distance metrics, including the Levenshtein distance and cosine distance, on the accuracy of MLNClean. As shown in Table V, the accuracy of MLNClean using the Levenshtein distance is higher than that using cosine distance on both datasets. The reason is that, for the cosine distance, if the foremost few characters of a string are incorrectly spelled, the cosine distance from it to its similar string might be large. Nevertheless, the Levenshtein distance just decides how many different characters between two strings, regardless of the positions of those characters. Thus, the Levenshtein distance is more suitable to deal with various error types.

7.4 Results on Distributed MLNClean

In this section, we evaluate the performance of our proposed distributed MLNClean version using larger HAI and TPC-H datasets. This set of experiments was implemented on Spark 1.0.2, and was executed on a 11-node Dell cluster (1 master with 10 workers), each node has two Intel(R) Xeon(R) E5-2620 v3 2.4GHz processors (12 physical cores, 24 CPU threads) and 64GB RAM.

Figure 15 plots the corresponding results when varying the error percentage. As expected, with the growth of error percentage, the execution time of MLNClean gets longer, and its accuracy gets lower on both HAI and TPC-H datasets. The reason behind is similar to that analyzed in Section 7.2. We would like to point out that, when the error percentage increases from 5% to 30%, the accuracy of MLNClean is always above 95% for all cases, and the drop of accuracy is about less than 3% over both datasets. Consequently, MLNClean maintains good robustness on Spark platform.

In addition, we change the number of workers from 2 to 10 on TPC-H dataset, and present the corresponding results in Table VI. One can observe that, the time cost drops as the number of workers grows, while the accuracy has very slight fluctuation. When the number of workers changes from 2 to 10, the efficiency has about 6.7 times speedup.

8 Conclusions

In this paper, we propose a novel hybrid data cleaning framework MLNClean on top of Markov logic networks (MLNs). It combines the advantages of quantitative methods and qualitative ones, and is capable of cleaning both schema-level and instance-level errors. With the help of an effective two-layer MLN index, MLNClean consists of two major cleaning stages, i.e., first cleaning multiple data versions independently and then deriving the final unified clean data from multi-version data. In the first cleaning stage, an AGP strategy is presented to process abnormal groups (built on the MLN index). Based on a new concept of reliability score, an RSC method is developed to clean data within each group. Moreover, in the second cleaning stage, with a newly defined concept of fusion score, an FSCR algorithm is proposed to eliminate conflicts when unifying multiple data versions. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of MLNClean to the state-of-the-art approach in terms of both accuracy and efficiency. In the future, we intend to establish more sophisticated strategies to process abnormal groups, since the performance of this step significantly affects the overall performance of MLNClean.

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant No. 2018YFB1004003, the 973 Program under Grant No. 2015CB352502, the NSFC under Grant No. 61522208, the NSFC-Zhejiang Joint Fund under Grant No. U1609217, and the ZJU-Hikvision Joint Project. Both Yunjun Gao and Xiaoye Miao are the corresponding authors of the work.

References

  • [1] W. W. Eckerson, “Data quality and the bottom line: Achieving business success through a commitment to high quality data,” The Data Warehousing Institute, pp. 1–36, 2002.
  • [2] X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, “Data cleaning: Overview and emerging challenges,” in SIGMOD, pp. 2201–2206, 2016.
  • [3] E. Rahm and H. H. Do, “Data cleaning: Problems and current approaches,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3–13, 2000.
  • [4] Z. Abedjan, C. G. Akcora, M. Ouzzani, P. Papotti, and M. Stonebraker, “Temporal rules discovery for web data cleaning,” PVLDB, vol. 9, no. 4, pp. 336–347, 2015.
  • [5] L. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko, “Complexity and approximation of fixing numerical attributes in databases under integrity constraints,” in International Workshop on Database Programming Languages, pp. 262–278, Springer, 2005.
  • [6] G. Beskales, I. F. Ilyas, and L. Golab, “Sampling the repairs of functional dependency violations under hard constraints,” PVLDB, vol. 3, no. 1, pp. 197–207, 2010.
  • [7] G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin, “On the relative trust between inconsistent data and inaccurate constraints,” in ICDE, pp. 541–552, 2013.
  • [8] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for data cleaning,” in ICDE, pp. 746–755, 2007.
  • [9]

    P. Bohannon, M. Flaster, W. Fan, and R. Rastogi, “A cost-based model and effective heuristic for repairing constraints by value modification,” in

    SIGMOD, pp. 143–154, 2005.
  • [10] X. Chu, I. F. Ilyas, and P. Papotti, “Holistic data cleaning: Putting violations into context,” in ICDE, pp. 458–469, 2013.
  • [11] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, “Improving data quality: Consistency and accuracy,” in PVLDB, pp. 315–326, 2007.
  • [12] M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang, “NADEEF: A commodity data cleaning system,” in SIGMOD, pp. 541–552, 2013.
  • [13] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for capturing data inconsistencies,” ACM Trans. Database Syst., vol. 33, no. 2, pp. 6:1–6:48, 2008.
  • [14] F. Geerts, G. Mecca, P. Papotti, and D. Santoro, “The LLUNATIC data-cleaning framework,” PVLDB, vol. 6, no. 9, pp. 625–636, 2013.
  • [15] Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and S. Yin, “BigDansing: A system for big data cleansing,” in SIGMOD, pp. 1215–1230, 2015.
  • [16] S. Kolahi and L. V. S. Lakshmanan, “On approximating optimum repairs for functional dependency violations,” in ICDT, pp. 53–62, 2009.
  • [17] A. Lopatenko and L. Bravo, “Efficient approximation algorithms for repairing inconsistent databases,” in ICDE, pp. 216–225, 2007.
  • [18] S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg, “Activeclean: Interactive data cleaning for statistical modeling,” PVLDB, vol. 9, no. 12, pp. 948–959, 2016.
  • [19] C. Mayfield, J. Neville, and S. Prabhakar, “ERACER: A database approach for statistical inference and data cleaning,” in SIGMOD, pp. 75–86, 2010.
  • [20] M. Yakout, L. Berti-Équille, and A. K. Elmagarmid, “Don’t be scared: Use scalable automatic repairing with maximal likelihood and bounded changes,” in SIGMOD, pp. 553–564, 2013.
  • [21] N. Prokoshyna, J. Szlichta, F. Chiang, R. J. Miller, and D. Srivastava, “Combining quantitative and logical data cleaning,” PVLDB, vol. 9, no. 4, pp. 300–311, 2015.
  • [22] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré, “Holoclean: Holistic data repairs with probabilistic inference,” PVLDB, vol. 10, no. 11, pp. 1190–1201, 2017.
  • [23] S. Giannakopoulou, M. Karpathiotakis, B. Gaidioz, and A. Ailamaki, “CleanM: An optimizable query language for unified scale-out data cleaning,” PVLDB, vol. 10, no. 11, pp. 1466–1477, 2017.
  • [24] F. Niu, C. Zhang, C. Ré, and J. W. Shavlik, “DeepDive: Web-scale knowledge-base construction using statistical learning and inference,” VLDS, vol. 12, pp. 25–28, 2012.
  • [25] P. M. Domingos and D. Lowd,

    Markov Logic: An Interface Layer for Artificial Intelligence

    .
    Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Publishers, 2009.
  • [26] F. Niu, C. Ré, A. Doan, and J. W. Shavlik, “Tuffy: Scaling up statistical inference in markov logic networks using an RDBMS,” PVLDB, vol. 4, no. 6, pp. 373–384, 2011.
  • [27] W. Fan and F. Geerts, Foundations of Data Quality Management. Synthesis Lectures on Data Management, Morgan & Claypool Publishers, 2012.
  • [28] T. C. Redman, “the impact of poor data quality on the typical enterprise,” Commun. ACM, vol. 41, no. 2, pp. 79–82, 1998.