Improve3C: Data Cleaning on Consistency and Completeness with Currency

07/31/2018 ∙ by Xiaoou Ding, et al. ∙ Harbin Institute of Technology NetEase, Inc 0

Data quality plays a key role in big data management today. With the explosive growth of data from a variety of sources, the quality of data is faced with multiple problems. Motivated by this, we study the multiple data quality improvement on completeness, consistency and currency in this paper. For the proposed problem, we introduce a 4-step framework, named Improve3C, for detection and quality improvement on incomplete and inconsistent data without timestamps. We compute and achieve a relative currency order among records derived from given currency constraints, according to which inconsistent and incomplete data can be repaired effectively considering the temporal impact. For both effectiveness and efficiency consideration, we carry out inconsistent repair ahead of incomplete repair. Currency-related consistency distance is defined to measure the similarity between dirty records and clean ones more accurately. In addition, currency orders are treated as an important feature in the training process of incompleteness repair. The solution algorithms are introduced in detail with examples. A thorough experiment on one real-life data and a synthetic one verifies that the proposed method can improve the performance of dirty data cleaning with multiple quality problems which are hard to be cleaned by the existing approaches effectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Data quality plays the key role in data-centric applications [1]. The quality problems in data are often quite serious and trouble data transaction steps (e.g., acquisition, copy, querying). Specifically, currency, consistency and completeness (3C for short) are three important issues in data quality [2]. For example, various information systems store data with different formats or semantic. It may lead to costly consistency problems in multi-source data integration. In addition with imperfect integrity standard of information systems, the records in database may have missing values. Worse still, the low frequency in data update makes it out-of-date to some degree when the timestamps are missing or incomplete under the loose and imprecise copy functions of data among sources. These three problems result in the low reliability of data, which adds to the confusion and misunderstanding in data applications. The low-quality data sets may result in negative impact on many fields.

Researchers have gone a long way in data quality and data cleaning, particularly in consistency and completeness. It is acknowledged that consistency and completeness are likely to affect each other during repairing, rather than completely isolated [3, 2]. We find that currency issues also seriously impact the repair of inconsistent and incomplete values. These mixed data problem are challenged to be both detected and repaired, as illustrated in the following example.

Example 1

Table 1 shows a part of personal career information collected from the talent pool of different companies, describing two individuals (entities), Mike and Helen. Each record has 9 attributes. Level is an industry-recognized career rank, while Title is the post of the employee. City describes the place where the company is located, and Address records the commercial districts the Company belongs to. Email reports the current professional email.

Specifically, “ME, RE, RA” and “MR” represents major engineer, research engineer, assistant researcher, and major researcher, respectively. “Zhongguancun, Xuhui”, etc. are well-known landmarks of different cities in China. Abbreviations are used in Email. As the data came from multi-sources, and the timestamp is missing. Inconsistent and incomplete problems also exist in attributes.

Name Level Title Company Address City Salary Email Group
: : Mike P2 E Baidu Zhongguancun Beijing 13k M@Bai Java
: Mike P2 E Baidu Tongzhou Beijing 13k M@Bai Map
: Mike P2 ME Baidu Zhongguancun Beijing 15k M@Bai Map
: Mike P3 ME Baidu Zhongguancun Beijing 20k M@Bai Map
: Mike P4 E Alibaba Zhongguancun Beijing 22k M@Bai Tmall
: Mike P4 E Alibaba XiXi Hangzhou 22k M@ali Financial
: Mike P4 RE Alibaba XiXi Hangzhou 23k M@ali Financial
: : Helen P2 RA Tencent Binhai Shenzhen 15k H@QQ Game
: Helen P3 R Tencent Binhai Shenzhen H@QQ Game
: Helen P3 R Tencent 18k H@QQ Financial
: Helen P3 R Tencent Xuhui Shanghai 20k H@QQ Social Network
: Helen P4 R Microsoft Zhongguancun Beijing 22k H@outlook Social Computing
: Helen P4 MR Microsoft Zhongguancun Beijing 22k H@outlook Social Computing
: Helen P5 Microsoft Zhongguancun Beijing H@outlook Social Computing
TABLE I: Personal career information for Mike and Helen
Dirty attributes After repair Explanation
: [Address]=“Tongzhou ().“Zhongguancun Can be well-repaired by CFD or records similarity.
: [Address], [City], [Email] ().“Xixi”,“Hangzhou”,“M@ali An effective repair from currency-related consistency method.
[Company], [Group] ().Baidu, “ML A poor repair without taking account currency issues.
: [Salary] missing ().“15K A proper clean value.
: [Address] missing, [City] missing ().“Xuhui”, “Shanghai An effective repair from currency-related completeness methods.
().“Binhai”, “Shenzhen A poor repair fails to capture the closet current values.
: [Title] missing ().“MR An accurate and current repair
().“R The repair is less accurate and current.
TABLE II: Repair dirty data

As outlined in red in Table 1, dirty values exist in 5 records. An incorrect address happens in , since Baidu (Beijing) is not located in Thongzhou district. describes Mike works in Alibaba (Hangzhou). However, it reports the city is Beijing, and he is using a Baidu email at the same time. It leads to a confusion, and we can conclude that inconsistent values exist in [Address], [City] and [Email], or even in [Company] and [Group] of . For Helen, , and contain missing values. We fail to know when she began working in Shanghai and how much is her current salary.

With existing data repairing methods, we can adopt some optional repair schema in Table 2. The incorrect address in can be repaired to “Zhangguancun” according to a CFD: ). We can give a relative clean value “15K” to ’s missing salary referring to its most similar record , but things are not simple when repairing other dirty values. The company and group that Mike works in do not coincide with the city and his working email in . It is possible to clean with the same values of . However, Mike has actually began working in Alibaba at the time of , which implies that is more current than . Thus, this repair is a poor one without considering the temporal issues. For and of Helen, the edit distance makes it difficult to distinguish which is closer to , and it also presents no currency difference among and . Similarly, it seems no difference to repair with either “R” or “MR” because of the equal and .

From the above, without the guidance of available timestamps, it is difficult to clean the inconsistent and incomplete values. If cleaning them simply with the values from their most similar records, we are likely to obtain wrongly repaired data.Thus, the repairing of data quality problems in currency, consistency and completeness together is in demand.

However, the development of the repairing of mixed quality issues is faced with challenges. Firstly, with the attributes’ changing and evolution with time, the temporal and current features in records influence the repairing accuracy, which becomes the key point in data quality management. Moreover, as some overall fundamental problems are already known as computationally hard [4, 5], multi-errors data repairing makes this problem even more challenged. Worse still, repairing some errors may cause another kind of errors. Without a sophisticated method, it may be costly to repair dirty data due to the iteratively repairing of the errors caused by data repairing.

As yet, works on cleaning multiple errors in completeness, consistency and currency are still inadequate. On the one hand, currency orders are difficult to determine when timestamps are unavailable. Existing currency repairing methods mostly depend on the definite timestamps, and few works provide feasible algorithms or even models for the data with the absence of valid timestamps. On the other hand, though inconsistency and incompleteness coexist in databases, both issues fails to be solved explicitly.

Motivated by this, we study the repairing approach of incompleteness and inconsistency with currency. Both incompleteness and inconsistency can be solved more effectively with currency information. We use an example to illustrate the benefit of currency in data repairing.

For instance, better repairs are shown in Table 2 as marked in green. We deduce a currency order for that the title of an employee in a company is increasing in the real world. Thus, Mike’s title can only change from E to ME when he works in the same company. Similarly, the salary is always monotonically increasing. is expected to be more current than . We repair ’s address, city and email with “Xixi”, “Hangzhou” and “M@ali”. The occurrence of dirty data is possibly because the delay between the database update and changes in the real world. If working emails fail to be well-repaired, both employees and companies will suffer losses.

For the dirty records of Helen, we repair , of with a CFD: (). It reveals that Helen has already changed her work to the financial group in Shanghai at P3. It improves the accuracy of her career information. With a currency order: R MR can we know Helen has become a MR at P4, and is more current than . According to anther currency order: , is the most current and freshness record now. Its missing title and salary are supposed to be filled with the present of most current values, i.e., “P5” and “22k”, respectively. It indicates that Helen’s salary is no less than 22k at P5 as a MR in her group. These cases indicate the complex conditions in dirty data, and the necessary of the interaction method in data cleaning on 3C. From Table 2, the combination of these three issues makes contributions to improve the accuracy of data cleaning.

Contributions. In this paper, we propose a framework of data repair together with currency, consistency and completeness, named Improve3C. To make sufficient usage of currency information hidden in the database, we propose a currency order computation method with currency constraints, which achieves a reliable time-related replacement when the timestamps of the database is not valid. In this way, we are able to discovery and awaken the internal knowledge from records in databases to maximize the repairing effectiveness. We summarize our contributions in this paper as follows:

(1) We propose a comprehensive data repairing approach for consistency, completeness and currency. To the best of our knowledge, it is the first study on data quality improvement on completeness and consistency of the data sets without reliable timestamps.

(2) We propose a 4-step framework Improve3C of multiple data quality problems detection and quality improvement. A total currency order schema is performed by processing the currency order graph with currency constraints.

(3) Moreover, we propose the currency and consistency Difference metric between the dirty data and the standard one to repair the inconsistent attributes together with CFDs and currency orders. In addition, we propose the solution for repairing incomplete values with naive Bayesian, where the currency order is considered as a key feature for classification training process.

(4) We conduct a thorough experiment on both real-life and synthetic data. The experimental results verify Improve3C can detect and repair the mixed dirty data effectively. Our framework can improve the performance of the existing methods in low-quality data repairing. Our strategy also achieves high efficiency compared with the treatment of the dimensions independently.

Organization

. The rest of the paper is organized as follows: Section 2 discusses the basic definitions and the overview of our method. Section 3 introduces construction and conflict detection on currency graph, and Section 4 discusses algorithms and examples for currency order determination. Section 5 (resp. Section 6) presents inconsistency repairing (resp. incompleteness imputation) process. Experimental study is reported in Section 7. Section 8 reviews the related work, and Section 9 draws the conclusion.

Ii Overview

In this section, we first introduce necessary background and fundamental definitions in Section II-A, and then propose our method framework Improve3C in Section 2.2.

Ii-a Basic Definitions

The currency constraints (also named as currency rules) are used to determine the currency of data under the circumstances the timestamps are not available. Definition 1 presents the semantic of currency constraints adopted in our method referring to the one proposed in [4]. We use CCs for short below in this paper.

Definition 1

(Currency constraints). In the set of currency constraints, , is the total record number in dataset . and are two records in . represents the predicate in an instance of a CC. eID represent ID number to identify the same person. There are mainly three kinds of constraints regarding :
:
:
:
where , and is the set of attributes in . Value[] is the value of the attribute. is the currency order determined on .

Accordingly, we can draw the currency constraints adopted in Table 1 as follows:
: .
: .
: .
: .
: .
The conditional function dependencies (CFDs for short) have been developed to detect and resolute inconsistency in a data set or among datasets [6]. Sound researches have been done in inconsistency repairing [7, 8]. Based on this, we adopt CFDs in our framework to improve data consistency as discussed in Definition 2.

Definition 2

(Conditional functional dependencies). On a relation schema , is the set of all the CFDs. A CFD is defined as , where (resp. ) is denoted as the antecedent (resp. consequent) of , i.e., LHS(), (resp. RHS()). , where
is a standard FD, and
is a tableau that either is a constant value from the attribute value domain or an unnamed variable “_” which draws values from .

Accordingly, below are some of the CFDs the records in Table 1 should satisfy.
: .
: .
: .
: .
: .
Further, we introduce the low-quality data with mixed problems. As mentioned above, we focus on three vital quality problems on completeness, consistency and currency, thus, the low-quality data in our study is defined in Definition 3. We outline our problem definition of 3C-data-quality repairing in Definition 4.

Definition 3

(Low-quality Data ). The schema ,…,) has no timestamps. Some missing values exist in () in , and at the same time some value pairs violate the consistency (including CFDs in Definition 2) measures. is a set including massive instances like .

Definition 4

(Problem Definition). Given a low-quality data , data quality rules including a set of CCs and a set of CFDs, and a confidence for each attributes. Data quality improvement on with completeness, consistency and currency is to detect the dirty data in and repair it into a clean one, denoted by , where
(a) has a reliable currency order value satisfying the set of CCs, denoted by .
(b) is consistent referring to the set of CFDs, i.e., .
(c) The missing values in are repaired with the clean ones whose confidence into .
(d) The repair cost is as small as possible.

Fig. 1: Framework overview of Improve3C

Ii-B Framework

We present the proposed 3C data repairing method Improve3C in Figure 1. Since that completeness and consistency are metrics focusing on measuring the quality with features in values, while currency describes the temporal order or the volatility of records in the whole data set. We process consistency and completeness repairing in order along the currency order defined in this paper. Improve3C is constructed to serves two purposes: First, each repair operation in Improve3C will not cause any new dirty data which violates one of 3C issues. Second, no dirty data exists on 3C after process Improve3C according to the proposed definitions in this paper. We achieve an overall data repairing on currency, consistency and completeness with the approach Improve3C, which consists four main steps.
(1) We first construct currency graphs for records with the adopted CCs, and make conflict detections in the currency graphs. If conflicts exist, the conflicted CCs and the involved records will be returned. They are supposed to be fixed by domain experts or revised from business process. This step is introduced in Section III.
(2) We then determine the currency order of records extracted from CCs. We update valid edges and find the longest currency order chain in the currency graph iteratively, and compute currency values to each record. This currency order is obtained as a direct and unambiguous metric among records on currency. Currency order determination is discussed in detail in Section IV.
(3) After that, we repair consistency issues with the global currency orders. We input consistency constraints (CFDs in this paper) first, and extract potential consistency schema from the original date set to capture undiscovered consistent tableau. After the consistency schema is determined, we define a metric Diffcc to measure the distance between dirty data and clean ones, combining consistency difference with currency orders. We repair the inconsistent data not only according to the consistency schema, but also take into account the currency order, i.e., repair the dirty data with proper values which have the closest current time point. The process is reported in Section V.

(4) We repair incomplete values with Bayesian strategy in the final step because of its obvious advantages in training both discrete and continuous attributes in relational database. We treat currency orders as a weighted feature and train the complete records to fill in the missing values if the filling probability no less than a confidence measure

. Up till now, we achieve high-quality data on 3C. Incompleteness imputation is presented in Section VI.
Specifically, we use CFDs as consistency constraints, and other kind of dependencies can be similarly adopted in our framework. We detect and repair consistency problems ahead of completeness in Improve3C, because we are able to repair some missing values (like in Table 1) which can be detected by the given CFDs. In this case, data completeness achieves a little improvement with consistency solution. The data becomes more complete, beneficial to the accuracy of completeness training model. We can clean the data more effectively for the rest missing values which fails to be captured and fixed by . Moreover, the repaired part will not give rise to new violation issues on both currency and consistency. On one hand, currency order has been taken into account as an important feature in the training process. The algorithm will provide clean values with nearest currency metrics. On anther hand, the consistency constraints would not let any records escape which have missing and inconsistent values at the same time. With respect to the time costs, the computing time is also decreased in Improve3C.

Iii Conflict Detection in CCs

Conflict resolution of currency constraints is a necessary step in preprocessing for achieving accurate and unambiguous currency order determination. As defined in Definition 5, we first construct the directed currency graph for each entity in , where each in represents a set of records with the same currency order referring to the same entity. Accordingly, the conflicts on CCs can be identified by discovering whether there exists loops in . Conflicts may result from either ambiguous currency constraints or definite currency problems in some records. Without credible external knowledge, these conflicts cannot be resolved. As the conflicts only happen in a small part of data, we detect and return them for artificial process (e.g., repairing by domain experts or assigning crowdsourcing tasks [9, 10].) The time cost of conflict detection is , where is the total number of records in .

Definition 5

(Currency Graph). An entity has records in , denoted by . The directed graph is the currency graph of , where represents the currency order of the records () in concluded by CCs. Each represents a set of records with the same currency order, denoted by . For , in , if has higher currency order than , i.e., , there is an edge from pointing to , , and otherwise .

Example 2

According to Definition 5, we construct the currency graph for and in Example 1 in Figure 2. We deduce from the CCs in Section 2.1 that , , , , and in Figure 2(a). (resp. ) is merged to be vertices (resp. ), as they share the same currency order. Thus, is constructed in Figure 2(b), where and . Similarly, with , , , , , is constructed in Figure 2(d), where , and , .

(a)
(b) for
(c)
(d) for
Fig. 2: Currency graphs for Example 1

Iv Currency Order Determination

Since that CCs can only describe partial orders among values on several target attributes, part of records’ currency order still cannot be deduced. Under the circumstances, the data without any currency order reasoning from CCs is hard to be evaluated on currency. It motives us to determine data currency on the whole data. We compute and assign currency values to all the vertices in , which achieves an approximate currency order for records.
becomes a directed acyclic graph after conflict detection. We assign currency order values to all the vertices in to make all the records comparable on currency. An intuitive approach is to perform topological sorting on and determine currency order on the sorting results. Unfortunately, the topological sorting result is not always stable [11], which could be influenced by the order of graph construction or other external factors. On this occasion, we propose a currency order determination method, which computes currency values more precisely. To some extent, the currency order is a kind of replacement of timestamps when the real timestamps are not available in database. Accordingly, the currency of data is uncovered and the metrics on it assist data quality resolutions on both consistency and completeness.
In currency graphs like in Example 2, the currency-comparable records of the same entity make up chains, which assists to determine currency values of the graph. We now present the definition of the currency order chains in Definition 6. Accordingly, the directed edge connects two elements (vertices) and in a currency order chain, where , i.e., the records represented by are more current than the ones in .

Definition 6

(The Currency Order Chain) is a currency order chain of the currency graph , iff.
(a) , there exists an edge , and and , and
(b) , then .

When determining currency orders, we are supposed to assign values to the currency order chains in first. In order to achieve a uniform and accurate determination of currency orders, we propose the currency value computing approach following two steps: (1) We compute and update the currency order bounds of the vertices in , and (2) find the present longest valid chains and value each element in it in ascending order, denoted by CurrValue(), (). We recursively repeat the two steps until all the chains have been visited and all the vertices are valued.

Input: the currency graph of the entity
Output: = (CurrValue(), )
1 add s and t to , let s points to all 0 in-degree edges and t be pointed from all 0 out-degree edges;
2CurrValue(s), sup(s), inf(s) , CurrValue(t), sup(t), inf(t) ;
3 while  CurrValue() has not been determined,  do
4      UpdateValid();
5       getMaxCandS(), ;
6       Value inf(), Inc ;
7       for  do
8            CurrValue() Value + Inc;
9      
return = (CurrValue(), );
Algorithm 1 CurrValue

When finding , each CurrValue() is computed depended on the possible minimum and maximum values of , as well as the relative position of in the involved . We adopt the currency order bound to describe these possible min and max values in Definition 7. and are vital factors for discovering currency order and updating currency values for vertices. The bounds make the value range of CurrValue() as accurate as possible.

Definition 7

(The Currency Order Bounds). When determining currency values, the upper and lower bound of a vertex in , () is defined as:
(a) The upper currency order bound of is
{CurrValue()}. represents the descendant vertex connecting from .
(b) The lower currency order bound of is
{CurrValue()}, where represents the ancestor vertex connecting to .

The whole computing process is shown in Algorithm 1. We first add a global start and terminal node i.e., s and t to the graph to ensure all currency orders are located in the domain . s points to all 0-in-degree vertices, and its currency value and bounds is set 0. Similarly, t are connected from all 0-out-degree vertices and CurrValue(t)=sup(t)=inf(t)=1. After that, we begin to compute currency values of vertices.
In lines 3-11, we repeatedly find the longest candidate chain in and compute currency values of the elements in it (Algorithm 2). In the loop, we update ’s present bounds, and determine the validation of the involved edges (line 4). This function will be outlined in Algorithm 2 below.
After that, we find the present longest candidate chain in line 5 (Algorithm 3), where is the length of , i.e., the number of elements in . Next, we assign normalized currency values to each in in lines 7-10. Since that bounds are determined, we use the lower (resp. upper) bound of the first (resp. last) element inf() (resp. sup()) in to compute currency values of all elements in . Finally we obtain the valued currency graph of .

Example 3

We now determine currency values in and . In Figure 3(a), is found after insert s, t to the graph. For each vertex in , CurrValue()= CurrValue() + . For in Figure 3(b), we find and compute CurrValue() in it to be . After that, only remain ’s currency value has not been determined. We use and to obtain CurrValue()= 0.335.

(a)
(b)
Fig. 3: Determine currency values for and

Next, we address the two main steps in currency order determination in detail. We introduce bounds and valid edges update process in Section IV-A, and discuss the longest candidate chain discovery in IV-B.

Iv-a Updating Bounds and Valid Edges

As mentioned above, a chain reveals a length of transitive currency orders deduced from part of currency order described by CCs, and different chains may come cross through vertices. Thus, not all edges contribute to find the longest chain of during each iteration. During the computing course, we are supposed to determine whether a vertex can make up by computing the bounds of it.
The edges selected to form are called valid edges in this paper. That is, the candidate exists in the currency order chains forms with valid edges. We update the validation of the present edges with Definition 8 during each iteration. Thus, we can effectively find according to these valid edges (discussed in Section IV-B).

Definition 8

(Validation of Edges). The edge is a valid edge () under three cases:
(a) If both CurrValue() and CurrValue() has not determined, is a valid edge iff. = , and = .
(b) If CurrValue() is determined and CurrValue() is not, is a valid edge iff. = .
(c) If CurrValue() is determined and CurrValue() is not, is a valid edge iff. = .

Note that if the currency values on both and is determined, is certainly not a valid edge, because and have been already visited in previous iterations. As we have obtained their currency values, will not be valid in the present updating function.

Input: the currency graph , sup and inf
Output: the updated , sup and inf.
1 mark all the of as invalid edges;
2 UpdateOneWay(, inf, );
3 UpdateOneWay(, sup, );
4 foreach  do
5       if (inf[] = inf[] CurrValue() is not determined) and (sup[] = sup[] CurrValue() is not determined) then
6            label as a valid edge;
7      
8Function UpdateOneWay(, bound {sup, inf}, {});
9 while  with 0 in-degree do
10       foreach  do
11             if (bound[], bound[]) then
12                   bound[] bound[];
13            
14      ;
15      
16end Function;
17 restore all ;
18 return the updated , sup, inf;
Algorithm 2 UpdateValid

Algorithm 2 shows the update process of the valid edges and the bounds. We first mark all edges in as invalid edges. We use the vertex with its determined CurrValue() to update the lower bound inf() of the vertices reachable from . Similarly, we update sup() on the converse graph of (Line 2-3). Both bounds are updated via a one-way function UpdateOneWay. During the function, (we might as well take inf updating for example), we recursively chose a with 0 in-degree, and enumerate all . We compare the inf values between and . If inf() inf(), inf() will be updated with inf() (Lines 12-13). After all are processed, we (temporarily) removed from (Line 16). After the function, we enumerate all edges in , and determined whether the edge is a valid one according to Definition 8 (Lines 5-6). Finishing validation determination, we recover the vertices deleted in previous iterations and with updated bounds and labeled valid edges will be returned to Algorithm 1.
Since the structure of and is simple, we discuss another case in Example 4 to present the steps of our method. It is clear and valid to show how the method works on the records with a more complex currency relations.

Example 4

Figure 4 shows a currency graph , and the present longest chain is in Figure 4(a), with the present valid edges are marked in blue lines. With the computed CurrValue() (), we update bounds of the rest vertices, i.e., , and find next in the rest chains. In Figure 4(a), and all reach , which is the vertex with the min currency value among descendant vertices of them. According to Definition 7, sup(, ) = CurrValue()=0.75. only reaches , so sup() = CurrValue()=0.875. Similarly, in the converse graph of , the max{CurrValue()} reachable from is , while and reach in Figure 4(b). inf() = CurrValue() =0.125, and inf()=CurrValue()=0.25.

(a) Update sup for
(b) Update inf for
(c) Find present valid edges
(d) Find candidate
(e) Find out
(f) Find present valid edges
(g) Find out
(h) Find out
Fig. 4: Examples of updating valid edges

As currency values of are not determined, and (resp. ) has the same sup and inf. and are marked valid according to Definition 8(a). Similarly, and (resp. and ) are valid referring to Definition 8(a) (resp. Definition 8(c)). The valid edges are marked in orange lines in Figure 4(c).

Input: the currency graph
Output: the longest candidate chain
1 Depth 0, pre[ ] Null;
2 endDepth , endPoint Null;
3 while ) with 0 in-degree do
4       if CurrValue() is determined then
5             if Depth[] endDepth then
6                  endDepth Depth[];
7                   endPoint ;
8                  
9            Depth[] 0;
10            
11      foreach  do
12             if  is a valid edge and Depth[]+1 Depth[] then
13                   Depth[] Depth[]+1;
14                   pre[] ;
15                  
16            
17      delete from ;
18      
19 the with endPoint and pre[ ];
20 restore all ;
return ;
Algorithm 3 getMaxCandS

Iv-B Finding the Longest Candidate Chain

We now introduce how to find the longest candidate chain . As the bounds and valid edges are updated (in each iteration), we discover among the vertices connected by valid edges. We first present the definition of candidate chains in Definition 9.

Definition 9

(The Candidate Currency Order Chain). A currency order chain is a candidate one, denoted by , iff.
CurrValue() and CurrValue() are known, where , is the starting and ended element in , repectively.
, the directed edge is a valid edge.

Based on the breadth-first search method, the algorithm getMaxCandS finds the current longest candidate currency order chain among all valid edges. The pseudocode is outlined in Algorithm 3. We perform topological sorting in lines 3-18 until all vertices in have been visited. According to Definition 9, cannot contain such that CurrValue() is determined. Thus, when the sorting process arrives line 4, we update the current chain. For whose CurrValue() is not computed, we enumerate all edges beginning from , and update each ’s depth with valid edges (lines 12-16). If we reach any invalid edge, we quit the present chain because it cannot form a any longer. We finally restore the edges deleted in pervious computing steps and obtain . %ֻ ȷ ʱЧֵ δȷ ʱЧֵ ȸ £ δȷ ʱЧֵ Ч ĵ ȸ £ Ա ̵֤ѡ һ һ ѡ

Example 5

We continue to introduce finding in from Example 4. As the valid edges have been determined in orange lines in Figure 4(d), we find the candidate chains beginning with 0-in-degree valid vertices, i.e., , and , and let them be , and , respectively. We update the depth of with the valid and . When it reaches , it is not a any longer, because is not valid. Thus, the depth of is 3. Similarly, As and are valid, we update Depth() = 4 when it finally reaches , while reaches and Depth() = 2. Thus, the present longest candidate chain is obtained, i.e., {} in Figure 4(e). We compute CurrValue() = 0.417 and CurrValue() = 0.583, according to the currency value of and .
The determined vertices are marked in blue in Figure 4(f), and we iteratively carry out the above steps. The third longest is
}, thus, CurrValue() = 0.222, and CurrValue() = 0.320. Finally, we obtain CurrValue() = 0.729. The currency value determining on the whole graph is finished.

Complexity. UpdateValid (Algorithm 2) and getMaxCandS (Algorithm 3) are two steps within the outer loop in Algorithm 1, CurrValue. In Algorithm 2, the UpdateOneWay function costs time to update bounds of all vertices. It takes to determine valid edges. Thus, Algorithm 2 runs in time. For Algorithm 3, it takes in total to find out . When computing currency values, the outer loop (lines 3-11) in Algorithm CurrValue costs for the worst. Thus, Algorithm CurrValue takes in total.
After the currency orders of records are determined, we further consider repair the inconsistent and incomplete dirty data. In order to achieve no violation on both consistency and completeness after the whole repair, we address inconsistency issues first, and then resolve incompleteness ones.

V Inconsistency Repair

As mentioned above, with the attributes evolution among records, currency and consistency issues as well as the interaction between them are both critical to repairing the dirty data violating the constraints (like CFDs). To achieve consistency cleaning effectively, we propose an inconsistency repair method with the currency orders obtained above. We first put forward a thought of potential consistency schema extraction in Section V-A, and then introduce the consistency repair algorithm ImpCCons together with cases study in Section V-B.

V-a Consistency Schema Extraction

CFDs are used as a general kind of consistency evaluation measure and data quality rule to describe whether the data is clean or not [4]. At the meanwhile, the challenges cannot be ignored that high-quality CFDs are not easily to be both manually designed and automatically discovered. In this case, some relation schema within attribute values in certain data set may fail to be captured. In the third step of Improve3C, we consider to address a reliable relation among enormous records for besides CFDs in order to detect and repair the violation in data more precisely and sensitively.
For the potential relation on some attributes among records which cannot be process by CFDs, we count the total occurrence number of a schema ) in , as the same form with CFD in Definition 2. If the ratio between and the total record number achieve a given threshold, i.e., , we called a reliable schema in . Such will be added to the consistency constraint set . The expanded set will be applied to guide the repairing of the inconsistent data in .
This step can be treated as an alternative step to extensively consider the consistency dependencies specifically from certain data set beyond CFDs if necessary. Works in [12, 13] has been done to discover reasonable functional dependencies, which can guide the setting of in our framework. We omit the detailed explanation of the extraction due to the limited space.

V-B Algorithm ImpCCons

We now propose the repairing method considering the effect from both consistency and currency. Intuitively, to repair a dirty record with a (at least relative) cleaned one, we are supposed to measure the distance (sometimes, the cost) of with the standard schema. In our method, we first detect the record violating any in , and then compute the consistency-currency distance between and its neighbor clean records. We also compute the distance between with the tableau of the violated by . We repair with the minimum distance.
To address the interaction between consistency and currency, we measure the difference between with the standard one by the distance of consistency together with currency, i.e., to find the consistent data with the closest currency value. We first present the distance functions on consistency and currency, respectively.
Equation (1) shows the consistency distance between two records denoted by cons. Bin is a Boolean function that if , and , otherwise. Equation (2) shows the consistency distance between a record and a measuring the distance on both LHS() and RHS() referring to Definition 2. (resp. ) is the number of attributes in (resp. ). In general, Equation (1),(2) measures the consistency distance as the ratio of the number of violations in the involved attributes. This kind of distance is widely adopted in records distance and similarity measurement [4].
It is sometimes traditionally assumed there are less violations in LHS() than RHS(). Repairing methods usually focus on the violations in RHS(). However, the violations in LHS() make things even worse and may results in detecting mistakes. Thus, we treat both LHS and RHS equally when computing consistency distance.

(1)
(2)

Equation (3) measures the currency distance with the difference in currency values. represents the difference between currency values of and as determined above. () is a threshold which can be set by users or learned from data, describing the max tolerable difference between the currency value of and . If , which means the currency gap between and are too large to be referred in currency compare, we set .
Specifically, we set . The currency distance guarantees is closer to its neighbor records in currency order, and has a certain distance with the CFD schema whose currency is indefinite to some degree.

Input: after algorithm 1, , , , , and .
Output: the data after consistency repair:
1 add s into with ;
2 foreach  do
3       foreach  Vio() do
4             Null, Diffcc ;
5             foreach  do
6                  Diffcc ;
7                   Diffcc Diffcc, ;
8                  
9            update with (, Diffcc);
10            
11      
12return ;
Algorithm 4 ImpCCons
(3)

Now, we propose the distance metric of records, named Diffcc in Definition 10 on the both dimensions.

Definition 10

(Diffcc). The currency-consistency difference between two records and is denoted by,

(4)

where and are the distance functions defined on consistency and currency, respectively. and are weight values, and .

Algorithm 4 outlines the consistency repair process with currency. We first extract potential relation schema s from and add them to . In the outer loop (Lines 2-11), we detect the satisfaction of each . The records violating a certain will be marked in the set Vio(). In the inner loop (lines 3-9), for each in Vio(), we enumerate its neighbor records (selected by ) in and the schema in to compute the Diffcc of with them. We update the present minDiffcc and store the corresponding (Line 7). After finishing this loop, we repair with according to minDiffcc, and obtain a consistent data set .

Example 6

We now present the repair of in Table 1. We first find out the neighbor records of with Curr() = 0.2. As belongs to in , and are selected, for Curr() = 0.667 - 0.5 = 0.167, Curr() = 0. We then detect violates and as mentioned in Section 3.1 with algorithm ImpCCons. Thus, we compute the Diffcc of with , and with . Diffcc() is computed

(5)

Similarly, Diffcc()=0.177, and Diffcc() is,

(6)

And Diffcc()=0.267. turns out to be the closest neighbor of . We repair the dirty part of as [Address], [City] and [Email]. Errors in can also be captured by Algorithm 4, and we are able to repair it to be [Address] and [City] ahead of incompleteness repair step.

Complexity. In Algorithm 4, the outer loop in lines 2-10 takes time to detect the violation on each , where is the number of consistency constraints. Within the loop, it costs to compute and repair the consistent-violative values on average, where represents the number of violative values, is the number of neighbor records (quite smaller than ), and is the number of in a . To put it together, Algorithm ImpCCons costs on average.
During consistency repair, we treat missing values captured by the given as a kind of violation of consistency. We are able to repair them by in the third step of Improve3C. We do not need to repair those values in completeness repair step. Specially, Algorithm 4 performs on the assumption that there is no conflict or ambiguous between the given CFDs and CCs. Works has been done (like [8]) on conflict resolution with CFDs and CCs, which has been applied in the preprocess of our method.

an incomplete record,
the domain of the missing values,