Modern data-intensive applications critically rely on high quality data to ensure that analyses are meaningful, as well as do not fall prey to the garbage in, garbage out (GIGO) syndrome. In constraint-based data cleaning, dependencies are used to specify data quality requirements. Data that are inconsistent with respect to the dependencies are identified as erroneous, and modifications to the data are performed to re-align the data with the dependencies. Past work has focused mainly on functional dependency (FDs) . In recent years, several extensions to the notion of an FD have been studied, including order dependencies (ODs) [2, 3, 4, 5, 6, 7], which express rules involving order. Many interesting application domains manage datasets that include ordered attributes, such as timestamps, numbers and strings .
We introduce a novel data quality rule approximate band conditional OD (abcOD). Band ODs express order relationships between attributes with small variations causing traditional ODs [2, 3, 4, 5, 6, 7] to be violated without actual violation of application semantics. To match real world scenarios, we allow band ODs to hold conditionally over subsets of the data and approximately with some exceptions with a mix of ascending and descending order.
Table I contains sample releases of the Music dataset (reprise records) from Discogs111www.discogs.com. For tracking purposes music companies assign a catalog number () to each release of a particular label. A certain pattern between the release date (encoded using attributes and ) and can be observed. When lexicographically ordered by the attribute , the release date is also approximately ordered over subsets of the data (with ascending and descending directions).
|Take A Picture||US||2000||11||9 16889-4|
|One Week||US||1998||9||9 17174-2|
|Only If…||US||1997||11||9 17266-2|
|We Run …||US||1994||12||9 18069-2|
|The Jimi…||US||1982||8||9 22306-1|
|Vonda Shepard||US||1989||3||9 25718-2|
|Ancient Heart||US||Null||7||9 25839-2|
|Twenty 1||US||1991||5||9 26391-2|
The attribute pair , is approximately ordered within subsets of the tuples called series, i.e., –– and –. Also, release dates are ordered within each series with some small exceptions, e.g., the tuple has a smaller ( ), however, it is released a few months later than the tuple (). This is common in the music industry as is often assigned to a record before it is actually released at the production stage. Thus, tuples with delayed release dates will slightly violate an OD between and (, ). A permissible range to accommodate these small variations is called a band.
The attribute has also a missing value (tuple ) and an erroneous value (tuple ) that severely break the OD between and , as the the value of for the tuple is and for the tuple is despite the ascending trend within the series. We verified that the correct value of for the tuple is by the online search. Table II shows statistics of violations to the order dependency between attributes and for top-5 labels in the Music dataset by reporting the number of missing and inconsistent values for .
|label||# total releases||# missing years||# incorrect years|
As another example, the order in which a vehicle leaves an assembly line refers to a particular factory of the car manufacturer. The first character of the vehicle identification number () denotes where a car was built and the next two characters denote the manufacturer of a car. Since following characters of s are assigned sequentially over time, the attributes and are conditionally ordered over subsets of the data. There are small variations to the OD between these attributes as s are assigned to a car before it is manufactured and denotes the time of the completion of the product. There are also actual errors to this OD (as illustrated in Figure 1), due to the data quality issues, such as incorrect human entry and information extraction methods. Fig. 1 plots a small sample of the real-world Car dataset222www.classicdriver.com and Music dataset series (separated by vertical lines). Series are identified by and , respectively, that approximately order with small variations. This illustrates that abcODs are needed to express the semantics of real-world data.
While data dependencies to identify data quality errors can be obtained manually through a consultation with domain experts, this is known to be an expensive, time consuming, and error-prone process [1, 2, 3, 6]. Thus, automatic approaches to discover data dependencies to identify data quality issues in ordered data are needed. The key technical problem that we study is how to automatically discover abcOD. Automatically discovered abcODs can then be manually validated by domain experts, which is a much easier task than manual specification.
There are two variants of data dependency discovery algorithms. The first global approach is to find automatically all dependencies that hold in the data [1, 3, 6, 7]. The second one is a relativistic approach to find subsets of the data obeying the expected semantics [2, 8], which is laborious to do manually. We apply a hybrid approach to discovery of abcODs that combines those two approaches.
To identify automatically candidates for the embedded band ODs without human intervention, we use a global approach to find all traditional ODs within approximation ratio from some of the authors of this paper prior work [6, 7], as the discovery of traditional ODs is less computationally intensive. The approach in [6, 7] is limited to identifying ODs that do not permit small variations within a band, thus, the approximation ratio is deliberately set higher, and that hold over the entire dataset rather than subsets of the data, thus, we separate the data into segments by using the divide-and-conquer approach. We use found traditional ODs as candidate embedded band ODs to solve the problem of discovering abcODs and rank discovered abcODs by the measure of interestingness, as not all candidate abcODs are necessary correct.
We define the problem of abcOD discovery as an optimization problem desiring parsimonious segments that identify large fractions of the data (gain) that satisfy the embedded band OD with few violations (cost
). This ensures that in each series, majority of tuples satisfy a band OD, and outlier values that severely violate monotonicity are few and sparse. To decrease the cognitive burden of human verification, the normalized ratio of gain to cost serves as a measure of interestingness of discovered abcODs, which allows to rank the dependencies.
We make the following contributions in this paper.
We enhance dependency-based data cleaning with abcODs as a novel integrity constraint. We define band ODs based on small variations causing traditional ODs to be violated without an intrinsic violation of application semantics. To make band ODs applicable to real-world datasets, we relax their requirements to hold approximately with some exceptions and conditionally on subsets of the data with a mix of ascending and descending directions. We develop a method to automatically compute the band-width to allow for small variations (based on longest monotonic bands defined below).
We formulate the abcOD discovery problem as a constraint optimization problem to identify large subsets of the data that satisfy band ODs with a small number of violations. We propose algorithms to automatically discover abcODs from data in order to alleviate the burden of specifying them manually. The naive solution to consider all possible segmentations of tuples is prohibitively expensive, as it leads to exponential time data complexity. Thus, we devise discovery algorithms based on the proposed notion of longest monotonic bounds (LMBs) to identity longest subsequences of tuples that satisfy a band OD. LMBs are at the heart of abcOD discovery as they reduce the search space and can be computed efficiently in super-linear time. We devise the dynamic programming algorithm based on LMBs that finds the optimal solution in polynomial time (super-cubic complexity). To further decrease the search space over large datasets, we propose a greedy algorithm based on pieces, which are contiguous sequences of tuples. Our greedy algorithm is order-of-magnitude faster than the optimal-algorithm without sacrificing the precision. When the bidirectionality is removed to consider unidirectional abcODs, we show that the pieces-based algorithm finds the optimal solution.
We experimentally demonstrate the effectiveness and scalability of our solution, and compare our techniques with baseline methods on both real-world and synthetic datasets.
We use the following notational conventions.
Relations. denotes a relation schema and denotes a specific table instance. Italic letters from the beginning of alphabet and denote single attributes. Also, and denote tuples in and denotes the value of an attribute in a tuple . denotes the domain of an attribute .
Lists. Bold letters from the end of alphabet X, Y and Z denote lists of attributes. denotes an explicit list of attributes. denotes the domain of X, where . denotes the value the list of attributes X in the tuple .
Let be a distance function defined on the domain of X. Distance function satisfies the following properties: anti-symmetry, triangle inequality and identity of indiscernibles. We consider , where denotes the norm of the value list .
We model an order specification as a directive to sort a dataset in ascending or descending order.
Definition II.1 (Order Specification)
An order specification is a list of marked attributes, denoted as . There are two directionality operators: and , indicating ascending and descending ordering, respectively. As shorthand, indicates Y and indicates Y .
Definition II.2 (Operator )
Let be a list of marked attributes in a relation and let be a constant value. For two tuples , if
and ; or
Let be the operator , where .
Definition II.3 (Band Order Dependency)
Given a band-width , a list of attributes X, a list of marked attributes over a relation schema , a band order dependency (band OD) denoted by holds over a table if implies for every tuple pair .
Without loss of generality, band ODs specify that when tuples are ordered increasingly on antecedent of the dependency values, their consequent of the dependency values must be ordered non-decreasingly or non-increasingly within a specified band-width. (Since both ascending and descending trends are allowed, the consequent of the dependency is a list of marked attributes.) We describe how to automatically compute band-width in Sec. III-D. Note that traditional ODs [3, 4, 5, 6, 7] are a special case of band ODs, where .
In real-world applications, band ODs often hold approximately with some exceptions to accommodate errors (missing and inaccurate values) and conditionally over subsets of the data (series) .
There are three series in Table I: wrt , wrt and wrt . There is a tuple with an erroneous (; correct is 1995), and a tuple with a missing (; correct is 1988).
We desire parsimonious series that identify large subsets of the data that satisfy a band OD with few violations. We formally define the approximate band conditional OD (abcOD) discovery problem as optimization problem in Sec. IV.
Iii Longest Monotonic Bands
A naive solution to the abcOD discovery problem is to consider all possible segmentations of tuples over the dataset. The exponential time complexity is prohibitively expensive over large datasets. In order to reduce the search space, we introduce the notion of a longest monotonic band (LMB) to identify the longest subsequences of tuples that satisfy a band OD. We define formally LMBs in Sec. III-A, present how to efficiently calculate LMBs in Sec. III-B with computation details in Sec. III-C. We use LMBs in Sec. III-D to automatically compute band-width and in Sec. IV to efficiently discover abcODs.
Iii-a Defining LMB
In contrast to previous methods [2, 3, 6], our definition below of longest monotonic bands allows for slight variations. Recall, that when we consider a band OD and a series in Figure 2 over Table I, tuples and are considered to be correct and only a tuple is assumed to be incorrect. We define LMBs with respect to a band OD . In the remaining, denotes a sequence of tuples ordered lexicographically by X in ascending order.
Definition III.1 (Longest Monotonic Band)
Given a sequence of tuples , list of marked attributes and band-width , a monotonic band (MB) is a subsequence of tuples over , such that , where = or . The longest subsequence satisfying this condition over is called a longest monotonic band (LMB). A sequence is called an increasing band (IB) (and a longest IB (LIB) when is a LMB) if and a decreasing band (DB) (and a longest DB (LDB) when is a LMB) if .
Consider Table I ordered by and let . Suppose tuples = form one series. There is a LDB in and there are two LIBs and in . Thus, a LMB over is .
Note that LMBs are not necessary contiguous subsequences of tuples. For example, in Fig. 2 that illustrates LMBs over three series, a LMB (that happens to be a LIB) over series with tuples – includes tuples –.
Definition III.3 (Maximal & Minimal Tuples)
Given a sequence of tuples and a list of attributes Y, a tuple is called a maximal tuple if and a minimal tuple if .
Iii-B Calculating LMBs
We propose an efficient approach to calculate LMBs by reducing the problem of finding a LMB in a sequence of tuples to the sub-problem of finding monotonic bands of all possible lengths. Once MBs are enumerated, the longest one is picked as a LMB. denotes the prefix of a sequence of length , i.e., , where and . The following theorem leads to an effective solution of calculating LMBs. The proofs of all theorems and lemmas can be found in Appendix.
Given a band width , a sequence of tuples and a list attributes Y, let be an IB with the smallest maximal tuple among all IBs with length in a prefix , and be a DB with the largest minimal tuple among all DBs with length in a prefix .
If and is a maximal tuple from , then there is a new IB of length in a prefix and its maximal tuple is the smallest so far, i.e., ; otherwise, .
If and is a minimal tuple from , then there is a new DB of length in a prefix and its minimal tuple is the largest so far, i.e., ; otherwise, .
Based on Theorem 1 to find a LMB in a sequence of tuples, it is sufficient to maintain two tuples for each possible : (1) the smallest maximal tuple of IBs of length in a prefix , and (2) the largest minimal tuple of DBs of length in .
Definition III.4 (Best tuples)
Given a sequence of tuples , band-width and a list of attributes Y, for each , are the best tuples of MBs of length in if (1) is the smallest maximal tuple of an IB with length in , and (2) is the largest minimal tuple of a DB with length in a prefix . Let initially and equal to () for and . The best tuples of monotonic band with length in a prefix satisfy the recurrence in Equation (1), where and , , .
Consider a sequence and a band-width . There are three IBs of length : , and in among which is the smallest maximal tuple. Accordingly, there are also three the same DBs of length , where is the largest minimal tuple. Thus, are the best tuples of MBs with length . Similarly, there is one DB with length : , where is the largest minimal tuple. There are three IBs with length 2 and among which is the smallest maximal value. Thus, are the best values of MBs of length in .
Based on Equation 1 in Def. III.4 best tuples for monotonic bands can be computed recursively. Assume for each the best tuples for MBs with length in a prefix are found. If holds, then a new IB of length is found, where the maximal tuple is . Thus, the smallest maximal tuple is chosen between and , i.e., . Otherwise, the smallest maximal tuple among IBs with length remains unchanged, i.e., . Accordingly, DBs are computed in analogous fashion as IBs.
Assume over Table I, and . Fig. 3 presents a calculated matrix, where the entry denotes the smallest maximal tuple for IBs with length , and the largest minimal tuple for DBs with length in a prefix . Initially, is set to and is set to for and . We first test if can extend any MB, i.e., , . Since , a new IB with length with a maximal tuple is found. Similarly, since , a new DB with length with a minimal tuple is found. We set to . For each , and . Thus, is set to . Tuples are processed accordingly with results of best tuples reported in Fig. 3.
Iii-C Computation Details
To find a LMB in a sequence of tuples two arrays of size are used to store the best tuples of MBs. Algorithm LABEL:alg:lmb presents the pseudo-code for computing a LMB. Arrays and store the best tuples for each , i.e., . For each element in , and are updated by finding the best position of in and , denoted by to , as follows.
is the smallest index in that satisfies . It is the shortest length of IBs with a smallest maximal tuple that ends at in .
is the smallest index in that satisfies . It is the longest length of IBs with a smallest maximal tuple that ends at in .
is the smallest index in that satisfies . It is the shortest length of DBs with a largest minimal tuple that ends at in .
is the smallest index in that satisfies . It is the longest length of DBs with a largest minimal tuple that ends at in .
and are two arrays of size that store the set of lengths of MBs with best tuples ending at for each , i.e., and . For each , is updated by and adding to ; and for each , is updated by and adding to .
Assume over Table I, and . Initially, and for each and and are empty. We start with . (and , respectively) is computed by finding the position of () in array so that () is the the first-left value in that is greater than (). In both cases, , thus, is replaced by , and is inserted into . Next, is considered. With the updated array , and are computed and is set. Similarly, there is one IB with best tuples ending at , i.e., , and the length is stored in . The remaining tuples are processed accordingly with results reported in Figure 4.
Next, we describe how to compute a LMB over given the best tuple matrix stored in and . The path of a LIB is constructed in a sequence of tuples in reverse order by scanning the array . Let be the largest value in ; A LIB in with length and is found as the tuple in a LIB. Based on and , is scanned in reverse order until the first tuple is found that contains . Then, is found, the tuple in an IB. is continued to be scanned until all tuples in an IB are found. A LDB is computed accordingly. A LMB is chosen as the longest between a LIB and a LDB.
Alg. LABEL:alg:lmb correctly finds a LMB in a sequence of tuples of size in time and space.
Consider over Table I, and . Fig. 4 shows the arrays and for finding a LIB and a LDB, respectively. To find a LIB is scanned to find the largest value in . Thus, is the eighth tuple in a LIB. By a reverse scan (marked with blue arrows) from , the -th tuple in a LIB is found: . The operation is continued until all tuples in a LIB are found; i.e., , , , , , , , . Since the length of a LDB over found in a similar fashion is , a LIB becomes a LMB.
Iii-D Automatic Band-Width Estimation
Our goal is to effectively identify outliers in a sequence of tuples, while being tolerant to tuples that slightly violate an OD. Since band ODs hold over subsets of data called series (with ascending and descending trends), to identify the correct band-width, we separate entire sequence of tuples (ordered by X) over a table into contiguous subsequences of tuples . We identify contiguous subsequences of tuples by using divide-and-conquer method, such that tuples in satisfy a traditional OD X Y within approximation ratio. (Details of how to generate candidate abcODs based on global approach to find traditional ODs [6, 7] are in Sec.IV-A.)
We would like to include a large number of tuples from each sequence into a LMB by setting a “proper” band-width , such that the distances of outliers from a LMB are large. To capture this, we propose a method to automatically compute a band-width. For a particular band-width , denotes a distance of outliers from a LMB and denotes a distinctive degree of in a sequence of tuples .
For each outlier over a tuple in , let be a repair of . If is a sequence where LMB is a LIB, then denotes the maximal value among and denotes the minimal value among . Accordingly, if is a sequence where LMB is a LDB, then denotes the minimal value among and denotes the maximal value among
. We define the estimated repairas .
Since the value 2012 of the tuple over an attribute is incorrect and is part of a LIB wrt , the repair . is calculated as (1992 + 1995) / 2, which is rounded to 1993.
The distance from a LMB is computed as . The distance of outliers from a LMB is calculated as the average distance i.e., = / .
The band-width is chosen that maximizes the distinctive degree . If is set too low (e.g., ), a large number of tuples are detected that slightly violate monotonicity as outliers, hence, the average distance of outliers from a LMB is rather small. When a band-width is increased to a “proper” value, the outliers that are very close to its LMB are not considered as errors anymore, and thus, the average of outliers and the distinctive degree increases dramatically. However, if a band-width is continued to be increased, small number of outliers may disappear, thus, the average distance changes slightly and the distinctive degree is reduced. Note that since entire sequence is divided into contiguous subsequences , the band-width is an aggregated value computed over all contiguous subsequences .
Iv Discovery of abcODs
Iv-a Discovery Problem
To make band ODs relevant to real-world applications, we make them less strict to hold approximately with some exceptions and conditionally on subsets of the data. Given a band OD , where is a sequence of tuples ordered by X, our goal is to segment into multiple contiguous, non-overlapping subsequences of tuples, called series, such that (1) large fraction of tuples in each series satisfy a band OD, and (2) outlier tuples that severely violate a band OD in each series are few and sparse. (We verified in experimental evaluation in Section VI that in practice errors are few and sparse over real-world datasets.)
We define the approximate band conditional OD (abcOD) discovery problem as a constrained optimization problem.
Definition IV.1 (abcOD Discovery Problem)
Let be a band OD, be a sequence of tuples, ordered by X over a table and be an approximation error rate parameter. Among all possible non-overlapping segmentations of , the approximate band conditional OD (abcOD) discovery problem is to find the optimal segmentation denoted as , where that satisfies the following condition.
defines the gain in terms of portions of satisfying a band OD, and defines a cost quantifying the number of errors in that violate a band OD. For each segment in , let be the number of non-null tuples in , and be a LMB in . The gain and the cost are defined respectively as follows.
is the maximum number of contiguous outliers that violate a band OD in .
We call band ODs that hold conditionally over subsets of the data and approximately with some exceptions approximate band conditional ODs (abcODs). In In Equation 4, we consider a gain function that rewards correct tuples weighted by the length of series excluding tuples with null values. Weighting by the length of series is necessary to achieve high recall. Otherwise small series would be ranked high with an extreme optimal case of all segments being individual tuples, which is not desirable.
Note that we describe how to automatically estimate band-width in Section III-D. To identify candidate band ODs, we use a global approach to find all traditional ODs within approximation ratio (from some of the authors of this paper prior work [6, 7]) to narrow the search space, as discovering traditional ODs is less computationally intensive. Since band ODs hold over subsets of the data (with a mix of ascending and descending ordering), we seperate an entire sequence of tuples into contiguous subsequences of tuples, by using divide-and-conquer approach, such that tuples over contiguous subsequences satisfy a traditional OD within approximation ratio. As not all traditional ODs within approximation ratio are valid abcODs, we rank discovered abcODs by the normalized ratio of gain to cost (functions developed in Definition IV.1) to present to the users only the most interesting dependencies.
Our problem of abcODs discovery is not a simple matter of finding splitting points. We study a technically challenging joint optimization problem of finding splits, monotonic bands and approximation (to account for outliers), which is not easily obtained by simple visualization. Also, note that Figure 1 presents only a small sample of the data extracted from the entire dataset to illustrate the intuition. In general, the number of data series can be hundreds or thousands over large datasets (see Sec VI), thus, data cannot be split easily into a few partitions. We argue that an automatic approach to discover abcODs is needed as formulating constraints manually requires domain expertise, is prone to human errors, and is excessively time consuming especially over large datasets. Automatically discovered dependencies can be manually validated by domain experts, which is a much easier task than manual specification. (In our experimental evaluation in Sec. VI it turned out that all the discovered series are true.) The purpose of our framework is to alleviate the cognitive burden of human specification.
Iv-B Computing abcODs
We provide an algorithm to compute series for abcODs with a optimal solution denoted as , where denotes a number of tuples over a dataset. The solution to the abcOD discovery problem has an optimal substructure property. The optimal solution in a prefix contains optimal solutions to the subproblems in prefixes .
We develop a dynamic-programming algorithm (pseudo-code in Algorithm LABEL:alg:series) to solve the abcOD discovery problem. Two arrays are maintained of the size : array stores the overall benefits of optimal solutions to the subproblems, i.e., ; and array stores the corresponding series, i.e., stores a segment ID that tuples – belong to in a prefix .
Alg. LABEL:alg:series solves the abcOD discovery problem optimally in time in a sequence of tuples of size .
Note that complexity of Alg LABEL:alg:series is assuming that LMBs are precomputed in all subsequences of .
Consider a band OD and an error rate . Over Table I a tuple is examined first. It forms a singleton series with the benefit . A tuple can either form its own series with the benefit equal to (with overall benefit ) or be merged into the same series with with the benefit . Thus, and are merged as well as and are set. The rest of tuples are processed accordingly with the results reported in Table III and series presented in Figure 2.
V Pruning via Pieces
V-a Pieces Decomposition
Since the devised dynamic programming algorithm (Sec. IV-B) that finds the optimal solution has the super-cubic time complexity, to further prune the search space, we develop a greedy algorithm for the abcOD discovery that is based on pieces. Pieces split a sequence of tuples (based on pre-pieces defined below) into contiguous subsequences that are monotonic within a band-width. Pieces are used to drastically improve the performance of abcODs discovery (Sec. V-B) without sacrificing the precision (details in Sec. VI-D).
Definition V.1 (Pre-Piece)
Given a sequence of tuples , ,