Skyblocking for Entity Resolution

05/31/2018 ∙ by Jingyu Shao, et al. ∙ 0

In this paper, for the first time, we introduce the concept of skyblocking, which aims to efficiently identify the "most preferred" blocking scheme in terms of a given set of selection criteria for entity resolution blocking. To capture all possible preferred blocking schemes, scheme skyline (i.e. blocking schemes on the skyline) has been studied in a multi-dimensional scheme space with dimensions corresponding to selection criteria for blocking (e.g. PC and PQ). However, applying traditional skyline techniques to learn scheme skylines is a non-trivial task. Due to the unique characteristics of blocking schemes, we face several challenges, such as: how to find a balanced number of match and non-match labels to effectively approximate a block scheme in a scheme space, and how to design efficient skyline algorithms to explore a scheme space for finding scheme skylines. To overcome these challenges, we propose a scheme skyline learning approach, which incorporates skyline techniques into an active learning process of scheme skylines. We have conducted experiments over four real-world datasets. The experimental results show that our approach is able to efficiently identify scheme skylines in a large scheme space only using a limited number of labels. Our approach also outperforms the state-of-the-art approaches for learning blocking schemes in several aspects, including: label efficiency, blocking quality and learning efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Entity Resolution (ER) is of great importance in many applications, such as matching product records from different online stores for customers, detecting people’s financial status for national security, or analyzing health conditions from different medical organizations christen2012data ; fisher2015clustering . Generally, entity resolution refers to the process of identifying records which represent the same real-world entity from one or more datasets wang2016clustering

. Common ER approaches require the calculation of similarity between records in order to classify record pairs as matches or non-matches. However, with the size of datasets growing, the similarity computation time for records increases quadratically. For example, given a dataset

, the total number of record pairs to be compared is (i.e., each record is paired with all other records in ). To reduce the number of compared record pairs, blocking techniques are widely applied, which can group potentially matched records into the same block wang2016semantic . Since the comparison only occurs between records in the same block, blocking can reduce the number of compared record pairs to no more than , where is the number of records in the largest block and is the number of blocks.

In past years, a good number of techniques have been proposed for blocking draisbach2009comparison ; wang2016semantic ; fisher2015clustering ; papadakis2014meta ; papadakis2014supervised ; shao2018active . Among these techniques, using blocking schemes is an efficient and declarative way to generate blocks. Intuitively, a blocking scheme takes records from a dataset as input, and groups the records using a logical combination of blocking predicates, where each blocking predicate specifies an attribute and its corresponding function. Learning a blocking scheme is thus the process of deciding which attributes, what their corresponding methods to compare values in attributes, and how different attributes and methods are logically combined so that desired blocks can be generated to satisfy the needs of an application. Compared with blocking techniques that consider data at the instance level draisbach2009comparison ; fisher2015clustering , blocking schemes have several advantages: (1) They only require to decide what metadata, such as attributes and the corresponding methods, is needed, rather than data from individual records; (2) They provide a more human readable description for how attributes and methods are involved in blocking; and (3) They enable more natural and effective interaction for blocking across heterogeneous datasets.

However, ER applications often require multi-criteria analysis of the varying preferences of blocking schemes. More specifically, given a scheme space that contains a large set of possible blocking schemes and a number of criteria for choosing blocking schemes, such as the commonly used ones: pair completeness (PC), pair quality (PQ) and reduction ratio (RR) christen2012survey , how can we identify the most preferred blocking scheme? Ideally, a good blocking scheme should yield blocks that minimize the number of record pairs to compare, while still preserving true matches as many as possible, i.e. optimizing all criteria simultaneously. Unfortunately, the criteria for selecting blocking schemes are often competing with each other, for example, PC and PQ are negatively correlated in many applications, as well as RR and PQ christen2012survey . In Fig. 1, we can see that, a blocking scheme with an improved PC leads to a decrease in PQ, and conversely, a blocking scheme with an increase in PQ may have a decease in PC. Nonetheless, depending on specific applications, different blocking schemes can be considered as being ”preferred” in different situations. For example, for a crime investigation where individuals need to be matched with a large databases of people, a high PC would be desired to make sure that potential criminals are included for investigation. On the other hand, in public health studies where each match corresponds to a patient with certain medical conditions, one may require a high PQ and want to be sure to only include patients that do have the medical condition under study christen2012data .

Blocking PC PQ scheme 0.13 0.76 0.31 0.99 0.58 0.76 0.84 0.40 0.86 0.50
Figure 1: An example for a scheme skyline (schemes in red line) from Cora Dataset, where blocking schemes (green points) are presented in a 2-dimensional space and their corresponding PC and PQ values are shown in the table.

Thus, to effectively learn a blocking scheme that is optimal in one or more criteria, previous work have specified conditions on the other negatively correlated criteria in the learning process kejriwal2015dnf . For example, some approaches cao2011leveraging ; michelson2006learning aimed to learn a blocking scheme that can maximize both reduction ratio of record pairs with and without blocking, and coverage of matched pairs. Some others bilenko2006adaptive ; kejriwal2013unsupervised targeted to find a blocking scheme that can generate blocks containing all or nearly all matched record pairs with a minimal number of non-matched record pairs. Shao and Wang shao2018active considered to select a blocking scheme with a minimal number of non-matched pairs constrained under a given coverage rate for matched pairs. However, setting such conditions perfectly is a challenging task, because conditions vary in different applications and it is unknown which condition is appropriate for a specific application. If a condition is set too high, no blocking scheme can be learned; on the other hand, if a condition is set too low, the learned blocking scheme is not necessarily the most preferred one. Thus, the question that naturally arises is: can we efficiently identify all possible most preferred blocking schemes in terms of a given set of selection criteria for blocking?

To answer this question, we propose a scheme skyline learning approach for blocking, which incorporates skyline techniques into the learning process of blocking schemes on the skyline, i.e., scheme skyline. The aim of learning scheme skyline is to find a range of blocking schemes under different user preferences, and each of such blocking schemes is the optimal with respect to one or more selection criteria. As shown in Fig. 1, possible blocking schemes are marked as green points, and preferred blocking schemes correspond to the points on the skyline (i.e., red line). This can thus provide a user with a direct view on all possible preferred blocking schemes for a given dataset and help make decisions on choosing the most preferred blocking scheme from a user’s perspective.

Challenges To the best of our knowledge, no scheme skyline learning approach has been previously proposed for entity resolution. Traditionally, skyline techniques have been widely used for database queries kalyvas2017survey ; chomicki2013skyline ; sarma2011representative . The result of a skyline query consists of multi-dimensional points for which there is no other point having better values in all the dimensions chester2015explanations . However, applying skyline queries technique to learning scheme skyline is a non-trivial task. Different from skyline queries in the database context chomicki2013skyline , in which the dimensions of a skyline correspond to attributes and points correspond to records, scheme skylines in our work have the dimensions corresponding to selection criteria for blocking, and points corresponding to blocking schemes in this multi-dimensional space, called scheme space. Learning scheme skylines becomes challenging, due to the following unique characteristics of blocking schemes. Firstly, it is hard to obtain match and non-match labels in real-world ER applications. Furthermore, it is well-known that match and non-match labels for entity resolution are often highly imbalanced, called the class imbalance problem fisher2016active ; christen2015efficient ; wang2015efficient

. For example, given a dataset with two tables, each table containing 1,000 non-duplicate records, the total number of record pairs will be 1,000,000, but the number of true matches is no more than 1,000. The class imbalance ratio of this dataset is thus at most 1:1,000, which indicates that the probability of randomly selecting a matched pair is 0.001. However, to find scheme skylines effectively, we need a balanced training set with both match and non-match labels. Hence, the first challenge we must address is how to find a balanced number of match and non-match labels to effectively approximate a block scheme in a scheme space. Secondly, a scheme space can be very large, making an exhaustive search for blocking schemes on the skyline computationally expensive. As will be discussed in Section

4.5, if a blocking scheme is composed of at most different blocking predicates, the number of all possible blocking schemes can be asymptotically. Nevertheless, only a small portion of these blocking schemes are on the skyline. Thus, the second challenge we need to act on is how to design efficient skyline algorithms to explore a large scheme space efficiently for finding blocking schemes on the skyline. Enumerating through the entire scheme space by checking a blocking scheme each time is obviously not an efficient solution in practice and infeasible for large datasets with hundreds or thousands of attributes.

Contributions To overcome the above challenges, in our approach, we hierarchically learn a scheme skyline based on blocking predicates to avoid enumerating all possible blocking schemes. A balanced active sampling strategy is also developed, which aims to select informative samples (i.e., a more balanced set of match and non-match samples) within a limited budget of labels. In summary, the contributions of this paper are as follows.

  • We formalize the scheme skyline learning problem and propose three novel algorithms for learning a scheme skyline. These enable us to efficiently learn the scheme skyline with significantly less label cost and help users to select the blocking scheme they need in terms of different constraints.

  • To efficiently use limited labels, we tackle the class imbalance problem by converting it into the balance sampling problem. Then we propose the active sampling strategy to solve the balance sampling problem by actively selecting representative samples.

  • Three scheme skyline learning algorithms are proposed based on the active sampling and scheme extension strategies. The scheme extension strategy aims to reduce the search space and label cost by only considering potentially dominating schemes.

  • We have evaluated our approach over four real-world datasets. Our experimental results show that our approach outperforms the state-of-the-art approaches in several aspects, including: label efficiency, blocking quality and learning efficiency.

Outline The rest of the paper is structured as follows. Section 2 introduces the notations used in the paper and the problem definition. Section 3 discusses our active sampling stretegy to the class imbalance problem. After that, three skyline algorithms for learning scheme skylines are presented in Section 4, and the hierarchical scheme skyline learning strategy has been discussed. Section 5 discusses our experimental results, and Section 6 introduces the related work. We conclude the paper in Section 7.

2 Problem Definition

Let be a dataset consisting of a set of records, and each record be associated with a set of attributes . We use to refer to the value of attribute in a record . Each attribute has a domain . A blocking function takes an attribute value from as input and returns a value in as output. A blocking predicate is a pair of attribute and blocking function . Given a pair of records , a blocking predicate returns true if holds; otherwise false. For example, we may have soundex as a blocking function for attribute author, and accordingly, a blocking predicate . For two records with values “Gale” and “Gaile” of attribute author, returns true because of .

A (blocking) scheme is a disjunction of conjunctions of blocking predicates (i.e. in the disjunctive normal form). For example, we may have as a blocking scheme. A blocking scheme is called a n-ary blocking scheme if it contains distinct blocking predicates. For example, a blocking scheme is a 3-ary blocking scheme, because it contains three blocking predicates, i.e. , and .

A training set

consists of a set of feature vectors

and a set of labels , where each is the label (i.e. match or non-match) of the feature vector . Given a pair of records , and a set of blocking predicates , a feature vector of and is a tuple , where each is an equality value of either 1 or 0, describing whether the corresponding blocking predicate returns true or false. For each feature vector , a human oracle is used to provide a label . If , it indicates that the given pair of records refers to the same entity (i.e. a match), and analogously, indicates that the given pair of records refers to two different entities (i.e. a non-match). The human oracle is associated with a budget limit budget, which indicates the total number of labels the human oracle can provide.

Given a blocking scheme , where each () is a conjunction of blocking predicates, we can generate a set of pairwise disjoint blocks , where (), and . For any two records and in the dataset , and are placed into the same block iff there exists a conjunction of block predicates such that, for each blocking predicate in , we have that holds.

Now we characterize blocking schemes under different perspectives using the notion of scheme skyline. Given a dataset and a set of blocking schemes , a scheme space over is a -dimensional space consisting of points in . Each blocking scheme can be mapped to a point in this scheme space such that , where each indicates the value of in the i-th dimension of this scheme space.

Definition 1

(Dominance) Given two blocking schemes and , we say dominates , denoted as , iff and .

Based on the notion of dominance between two blocking schemes, we now define the notion of scheme skyline.

Definition 2

(Scheme skyline) Given a dataset and a set of blocking schemes , a scheme skyline over is a subset of , where each scheme is not dominated by any other blocking scheme in , i.e. .

Without loss of generality, we will discuss a scheme space with in this paper, which has two dimensions: one indicates the PC of a blocking scheme and the other indicates the PQ. Both PC and PQ are commonly used measures for blocking christen2012survey . For a pair of records that are placed into the same block, we call it a true positive if it refers to a match; otherwise, it is a false positive. Similarly, a pair of records is called a false negative if it refers to a match but the records are placed into two different blocks. For convenience, we use , and to denote the numbers of true positives, false positives and false negatives in the blocks generated by , respectively. The pair completeness (PC) of a blocking scheme is the number of true positives divided by the total number of true matches, i.e. , which measures the rate of true matches remained in blocks. We have:

(1)

The pair quality (PQ) of a blocking scheme is the number of true positives divided by the total number of record pairs that are placed into the same blocks, i.e. , which measures the rate of true positives in blocks. We have:

(2)

Note that, it is possible and straightforward to extend the dimensions of a scheme space by taking into account other measures for blocking schemes, such as reduction ratio (RR) and execution time (ET) kopcke2010frameworks .

We now define the problem of scheme skyline learning in a scheme space, which allows us to select desired blocking schemes depending upon different blocking criteria.

Definition 3

(Scheme skyline learning problem) Let be a human oracle, be a dataset, be a set of blocking scheme over , and be a d-dimensional scheme space. Then the scheme skyline learning problem is to learn the scheme skyline over through actively selecting a training set from , such that holds.

3 Class Imbalance Learning

Figure 2: A comparison on the sample distribution of 100 samples from Cora dataset: (a) random sampling, and (b) active sampling (based on similarity of two attributes title and authors, where a red circle indicates a match sample and a blue triangle indicates a non-match sample.

In this section, we propose an active sampling strategy to tackle the class imbalance problem. This strategy aims to generate an informative training set with a small amount of label usage.

The class imbalance problem has been known to impede the performance of learning approaches in entity resolution bilenko2006adaptive ; cao2011leveraging ; kejriwal2013unsupervised . Although the ratio of non-matches and matches may vary in different datasets, using random sampling can always lead to much more non-matches than matches in a training set, as shown in Fig. 2

.(a). Thus, supervised learning approaches that are based on random sampling need a large number of labels in order to gurantee the learning performance.

Previously, it has been reported arasu2010active ; bellare2012active that the more similar two records are, the higher probability they can be a match. To tackle the class imbalance problem, we observe that, by virtue of this correlation between similarity and labels, a balanced set of similar and dissimilar records likely represents a training set with balanced matches and non-matches. Hence, we define the notion of balance rate.

Definition 4

(Balance rate:) Let be a blocking scheme and a non-empty feature vector set, the balance rate of in terms of , denoted as , is defined as:

(3)

Conceptually, the balance rate describes how balance or imbalance of by comparing the number of similar samples to that of dissimilar samples in terms of a given blocking scheme . The range of balance rate is . If , there are all similar samples in with regard to , whereas means all the samples in are dissimilar pairs. In these two cases, X is highly imbalanced. If , there is an equal number of similar and dissimilar samples, indicating that X is balanced.

Then, based on the notion of balance rate, we convert the class imbalance problem into the balanced sampling problem, which is defined as follows:

Definition 5

(Balanced sampling problem) Given a set of blocking scheme and a label budget , the balanced sampling problem is to select a training set , where , in order to:

(4)

For two different blocking schemes and , they may have different balance rates over the same feature vector set , i.e. is possible. The goal here is to find a training set that minimizes the balance rates in terms of the given set of blocking schemes. The optimal case is , according to the objective function above, but sometimes it is impossible to achieve this in real world applications. For example, a dataset may contain all the records from the same county, which would indicate all samples are similar samples for the blocking scheme .

In Fig. 2, we compare samples identified using our active sampling strategy with the ones found using random sampling. Due to the class imbalance problem, the samples from random sampling are almost all non-matches, whereas the samples from active sampling contain much more matches than the samples from random sampling.

4 Scheme Skyline Learning Approaches

Now we propose three algorithms for scheme skyline learning. Our algorithms are designed upon the scheme extension strategy, which begins with blocking schemes containing only one blocking predicate, then extends with other blocking schemes in either conjunction or disjunction form.

4.1 Scheme Extension Strategy

In general, given a blocking scheme , there are two kinds of extensions, through which we can extend to another blocking scheme: conjunction and disjunction. Hence if a conjunction or disjunction can be decided before extension, the searching space will be reduced by half.

Let and be two blocking schemes. We have the monotonicity property of PC, in terms of either the disjunction of and or the conjunction of and , based on the following lemma.

Lemma 1
(5)
(6)
Proof 1

Given , for any true positive record pair , we have . This is to say, the number of true positives generated by scheme either or cannot be larger than that generated by scheme and cannot be smaller than that generated by scheme , i.e. and . Since we know the sum of true positives and false negatives is constant, which refers to the total number of matched pairs in the dataset, based on the definition of PC, we prove the lemmas.

The lemma states that, if a scheme generates blocks which discard more matches than expected, a disjunction form of extension can be applied in order to increase matches contained in blocks. On the contrary, if a conjunction form of extension is applied, the completeness of the matches will be reduced.

4.2 Grid-based Naive Learning

Figure 3: An illustration of the grid-based naive learning algorithm, where optimal blocking schemes are learned in parallel as shown in (a), and the scheme skyline is depicted in (b), where is dominated by the skyline, and are the same scheme shown as .

A naive way to learn skyline blocking schemes is to learn the optimal blocking schemes under different thresholds in one or more dimensions. Then, based on these optimal schemes, a scheme skyline can be learned.

In the following, we first present an Active Scheme Learning (ASL) algorithm which can actively learn the optimal blocking scheme under a given PC threshold and a specified label budget. In this algorithm, the concept of optimal blocking scheme is defined as follows.

Definition 6

(Optimal blocking scheme) Given a human oracle , and a PC threshold , the optimal blocking scheme is a blocking scheme w.r.t. the following objective function, through selecting a training set :

(7)
subject to

Additionally, a locally optimal blocking scheme over is the optimal scheme among a set of schemes which satisfies the above criteria.

At its core, the ASL algorithm adopts two active learning strategies: active sampling aims to tackle the class imbalance problem by solving the balanced sampling problem and thus to reduce the label cost, active branching is a scheme extension strategy that is targeted to reduce the searching space and thus further reduce the label cost.

Input: Dataset:
         PC threshold
         Human oracle
         Set of blocking predicates
         Sample size
Output: A blocking scheme
1 , , ,
2
3 while  do
4       for each  do
5             if  then
6                  
7                  
8            else
9                  
10                  
11            
12      
13      ;
14      if  then
15            
16      else
17            
18            
19      
Return
Algorithm 1 Active Scheme Learning (ASL)

A high-level description for the algorithm is presented in Algorithm 1. Let be a set of blocking schemes, where each blocking scheme is a blocking predicate at the beginning. The budget usage is initially set to zero, i.e. . A set of feature vectors is selected from the dataset as seed samples (lines 1 and 2). Then, the algorithm iterates until the number of samples in the training set reaches the budget limit (line 3). At the beginning of each iteration, the active sampling strategy is applied to generate a training set (lines 4 to 10). For each blocking scheme , the samples are selected in two steps: (1) firstly, the balance rate of this blocking scheme is calculated (lines 5 and 7), (2) secondly, a feature vector to reduce this balance rate is selected from the dataset (lines 6 and 8). Then the samples are labeled by the human oracle and stored in the training set . The usage of label budget is increased, accordingly (lines 9 and 10).

A locally optimal blocking scheme is searched among a set of blocking schemes over the training set, according to a specified PC threshold (line 11). Then the active branching strategy is applied to determine whether a conjunction or disjunction form of this optimal blocking scheme and other blocking schemes will be used based on Lemma 1 (lines 12-16). If the locally optimal blocking scheme is found, new blocking schemes are generated by extending to a conjunction with each of the blocking schemes in , so that the PQ of new blocking schemes may be further increased (lines 12 and 13). Otherwise a blocking scheme with the maximum PC value is selected and new schemes are generated using disjunctions, so that the PC of new schemes can be further increased (lines 14 to 16).

Now, we propose the Naive Learning algorithm for scheme skylines, which is based on the idea of using ASL for learning optimal blocking schemes under different PC thresholds. The naive learning algorithm is described in Algorithm 2. The input a user need to specify is the total label budget, the maximal ary of schemes in scheme skyline, and the size of threshold interval, i.e. the difference of two consecutive thresholds. For example, if is set to be 0.1, there will be in total 10 thresholds used, i.e. 0.1, 0.2, …, 1.0, to learn at most 10 optimal blocking schemes. It is possible that the same blocking scheme is learned under two different thresholds. As shown in Algorithm 2, this approach consists of two steps. In the first step, called parallel step, the threshold interval is defined (line 1), and the optimal blocking schemes are learned in parallel under each threshold using the ASL algorithm (lines 2-5). Here, in our algorithm, sample size is uniformly decided in terms of the total label budget , the threshold interval and the number of predicates . In the second step, called merging step, all optimal blocking schemes are merged and their domination is checked, which generates the scheme skyline (lines 6-7). Fig. 3(a) illustrates that two optimal blocking schemes are learned in parallel, where their thresholds are set to be 0.3 and 0.7, respectively. In Fig. 3(b), all the optimal schemes that are learned under the thresholds 0.1, 0.2, …, 1.0 are merged together and the scheme skyline is generated.

Input: Dataset:
         Human oracle
         Set of blocking predicates
         Size of threshold interval
         Sample size
Output: Scheme Skyline
1 , ,
2 while  do
3      
4      
5      
6      
7
Return
Algorithm 2 Grid-based Naive Learning (Naive-Sky)

4.3 Grid-based Active Learning

There are some limitations in the grid-based naive learning approach. For example, how to choose an appropriate size of threshold interval? If is set too high, blocking scheme points in the skyline will be sparse. On the contrary, if is set too low, users may have dense blocking scheme points in the skyline, but a large parallel iteration step will be involved in learning optimal blocking schemes under different thresholds, and some of them are often redundant. This leads to unnecessary computational costs.

Figure 4: Grid-based active learning algorithm which can actively choose PC threshold. We use the same example as presented in Fig. 3.

During our study with the grid-based naive learning algorithm, we have noticed that the PC threshold specified by a user may not be (or even far from) the actual PC value of the optimal blocking scheme learned by the grid-based naive learning algorithm. For example, when the threshold is set to 0.3, the actual PC value of the learned optimal blocking scheme can be 0.53. Moreover, when the threshold is set to 0.4, e.g. , the learned optimal blocking scheme still remains unchanged. To leverage this observation, we propose another algorithm for learning scheme skylines, called Grid-based Active Learning, to efficiently learn the scheme skyline even for a small based on the following lemma. The process of this algorithm is shown in Fig. 4.

Lemma 2

Assume and are two optimal schemes learned by ASL under the PC thresholds , and the same label budget. The following property holds:

(8)
Proof 2

Given a threshold , a set of blocking schemes (, ), we have . Given any threshold , s.t. , we have a set of blocking schemes (, ). We can also have and . Based on Definition 6, the optimal blocking scheme is unique. we have .

The grid-based learning algorithm is presented in Algorithm 3. By Lemma 2, the algorithm is designed to actively select the next PC threshold based on the PC value of current optimal blocking scheme (line 5).

Input: Dataset:
         Human oracle
         Set of blocking predicates
         Size of threshold interval
         Sample size
Output: Skyline blocking schemes
1 , ,
2 while  do
3      
4      
5      
6      
7
Return
Algorithm 3 Grid-based Active Learning (Active-Sky)

4.4 Progressive Skyline Learning

Figure 5: Progressive learning algorithm in progress: (a) shows the space separated by PC and PQ that we may find scheme skyline (blue and red area) or dominated by current skyline point (green area); (b) - (d) present the progress of our algorithm.

The grid-based active learning approach has solved the problem that the lower the is set, the more iterations (each iteration learns one optimal blocking scheme) the ASL algorithm has to take. However, it still has to construct the training set for each iteration, and distribute the label budget into different iterations for sampling. This gives rise to a new question: can we reduce the iteration times to further reduce the label cost?

Here we propose a Progressive Skyline Learning approach. This approach can learn a scheme skyline progressively, i.e., it starts by learning the 1-ary scheme skyline and ends by learning the n-ary scheme skyline, as illustrated in Fig 5 In this approach, a user does not need to set the size of threshold interval .

Given a blocking scheme, we can partition the scheme space into four parts, as shown in Fig. 5(a). Blocking schemes in the green area, called Dominated Space, are dominated by the given blocking scheme. Blocking schemes in the skyline can be found in the red area, called Scheme Skyline Space. Blocking schemes in the blue area, called Dominating Space, dominate the given blocking scheme. Clearly, if a blocking scheme is in the skyline, its dominating space should have no any other schemes. Therefore, our objective for progressive skyline learning is: given a scheme skyline , we try to extend each by discarding schemes in its dominated space, verifying schemes in its scheme skyline space, and finding schemes in its dominating space.

The algorithm is described in Algorithm 4, which does not change much compared with the previous algorithms in the sampling steps (lines 1-11). However, in the scheme extension step, we take the property illustrated in Fig. 5(a) into consideration. For each scheme in the skyline, we extend it with all the other predicates that are not included in this scheme (lines 12-13). We extend the existing schemes in by conducting both conjunctions and disjunctions with all predicates, obtain both PC and PQ of the new schemes in terms of the training set and select ones with incremental PC or PQ (line 14). That is to say, if the new scheme fails in the red area as shown in Fig. 5(a), we add this scheme into the candidate set. If it appears in the blue area, we replace the previous one by this new scheme. If it appears in the green area, we discard it.

Input: Dataset:
         Human oracle
         Set of blocking predicates
         Ary of Schemes
Output: scheme skyline
1 , , , ,
2
3 while  do
4       for each  do // Begin sampling
5             if  then
6                  
7                  
8            else
9                  
10                  
11             // End sampling
12       // Add samples
13       ;
14      for  do
15             for  do
16                  
17            
18      
Return
Algorithm 4 Progressive Learning (Pro-Sky)
Datasets Attributes Records True Matches Class Imbalance Ratio # Blocking Predicates
Cora 4 1,295 17,184 1 : 49 16
DBLP-Scholar 4/4 2,616 / 64,263 2360 1 : 71,233 16/16
DBLP-ACM 4/4 2,616 / 2,294 2,224 1 : 2,698 16/16
NCVR 18/18 6,233,785 / 6,981,877 6,122,579 1 : 6.6 72/72
Table 1: Characteristics of datasets

4.5 Complexity Analysis

In this section, we discuss the search complexity and the time complexity for learning the scheme skyline.

4.5.1 Search complexity

Given blocking predicates, we have blocking schemes composed of blocking predicates in conjunctions. Furthermore, if a blocking scheme is composed of at most different blocking predicates (i.e. n-ary blocking scheme) in disjunction of conjunctions, we can regard them as the monotonic boolean functions, which are defined to be the expression combining the inputs (which may appear more than once) using only the operators conjunction and disjunction (in particular ”not” is forbidden) kisielewicz1988solution . Hence the searching complexity for all possible blocking schemes can be asymptotically, which is also known as the Dedekind Number. Learning a scheme skyline in this way will be no doubt to be accurate, because all blocking schemes will be considered. However, this is space consuming and label wasting when the number of blocking predicates is large, but the number of blocking schemes in a scheme skyline is small.

To analyze the search complexity of schemes of the algorithms Grid-based Naive Learning and Grid-based Active Learning, we first analyze the complexity of ASL. Given blocking predicates, with sufficient label budget, in the worst case, we can learn a n-ary blocking scheme as output. During this process, we first need to search for all blocking schemes. Then based on the locally optimal one, we need to search for blocking schemes of 2-ary, and then, blocking schemes of 3-ary. Accordingly, the searching complexity of this process is . Given , the total complexity of Grid-based Naive Learning is , and in the worst case, the the total complexity of Grid-based Naive Learning is the same as . The complexity of algorithm Progressive Learning, if we use to describe the number of schemes selected in the i-ary, the total complexity will be , where 2 indicates both conjunction and disjunction of two schemes.In the worst case that all schemes are in the skyline, is the same as Dedekind Number. However, if is introduced to this algorithm, i.e., there should be at least a distance of between two schemes in skyline, the total complexity will be (), in the worst case.

4.5.2 Time complexity

We further discuss the time complexity of sampling. To tackle the class imbalance problem, we select both similar and dissimilar samples w.r.t. a given blocking scheme . However, as explained in Section 3, it may be a high imbalance ratio of similar and dissimilar samples or even impossible to select one w.r.t. a blocking predicate. In the worst case, we have to traverse the whole dataset to obtain one sample. Hence the time complexity to generate samples will be , where is a dataset and is the sample size for one predicate.

To make the algorithms efficient under a large sample budget, we have adopted index tables for each candidate scheme by obtaining their values, hence the algorithm can choose either similar or dissimilar samples it needs in linear time. In this way, the time complexity will be for any candidate scheme.

5 Evaluation

We have evaluated our algorithms to experimentally verify their performance. All our experiments have been run on a server with 6-core 64-bit Intel Xeon 2.4 GHz CPUs, 128GBytes of memory.

5.1 Experimental Setup

We present the datasets, blocking predicates, baseline approaches and measures used in our experiments.

5.1.1 Datasets

We have used four datasets in our experiments: (1) Cora111Available from: http://secondstring.sourceforge.net

dataset contains bibliographic records of machine learning publications. (2)

DBLP-Scholar11footnotemark: 1 dataset contains bibliographic records from the DBLP and Google Scholar websites. (3) DBLP-ACMkopcke2010evaluation dataset contains bibliographic records from the DBLP and ACM websites. (4) North Carolina Voter Registration (NCVR)222Available from: http://alt.ncsbe.gov/data/ dataset contains real-world voter registration information of people from North Carolina in the USA. Two sets of records collected in October 2011 and December 2011 respectively are used in our experiments. We summarize the characteristics of these data sets in Table 1. A complete list of attributes in these data sets are presented in Table 2.

Datasets Attributes
Cora authors, pub_details, title, affiliation,
conf_journal, location, publisher, year,
pages, editors, appear, month
DBLP-Scholar title, authors, venue and year
DBLP-ACM
NCVR county_id, county_desc, voter_reg_num,
voter_status_desc, voter_status_reason_desc,
absent_ind, last_name, first_name, midl_name,
full_name_rep, full_name_mail, reason_cd,
status_cd, house_num, street_name,
street_type_cd, res_city_desc and state_cd
Table 2: Attributes of datasets

5.1.2 Blocking predicates

We have used the following blocking functions in our experiments:

  • Exact-Match: This blocking function takes two strings as input and compares the string values. The function returns true if the strings are exactly the same.

  • Soundex: This blocking function takes two strings as input and transfers them into soundex codes based on a reference table holmes2002improving . Then it returns true if two codes are the same.

  • Double-Metaphone: This blocking function also takes two strings as input and transfers each string into two codes from two reference tables philips2000double . It returns true if either code is the same. Comparing with Soundex, it has a better performance when it is applied to Asian names.

  • Get-substring: This blocking function takes only a segment (i.e. first four letters) of the whole string for comparison, and returns true if the segments from two strings are the same.

The above blocking functions have been applied to all attributes in the datasets depicted in Table 2, which accordingly leads to 16 or 72 blocking predicates in each dataset as shown in Table 1.

5.1.3 Baseline approaches

We have used the following approaches as the baselines: (1) Fisher kejriwal2013unsupervised , which is the state-of-the-art unsupervised scheme learning approach proposed by Kejriwal and Miranker. Details of this approach will be outlined in Section 6. (2) TBlo fellegi1969theory , which is a traditional blocking approach based on expert-selected attributes. In the survey christen2012survey , this approach has a better performance than the other approaches in terms of the F-measure results. (3) RSL (Random Scheme Learning), which is an algorithm that is similar to the structure of the ASL algorithm, but uses random sampling, instead of active sampling, to build a training set and learn blocking schemes. In each experiment, we have run the RSL ten times. We present the average results of the blocking schemes it has learned.

5.1.4 Measures

We use the following measures christen2012survey to evaluate the blocking quality of our approach. Reduction Ratio (RR) is one minus the total number of record pairs in blocks divided by the total number of record pairs without blocks, i.e., RR measures the reduction of the number of compared pairs with and without blocks. Pairs Completeness (PC) and Pairs Quality (PQ) have been defined in Section 2. F-measure

is the harmonic mean of PC and PQ.

The size of a training set directly affects the scheme skyline learning results. If the training set is small, an algorithm may learn different skylines in different runs. This is because the undersampling can lead to biased samples which do not represent the characteristics of the whole sampling space. Hence, we define the notion of constraint satisfaction as to describe the learning stability of an algorithm. We use to denote the number of times that an algorithm can learn a blocking scheme, and is the total number of times the algorithm runs. For example, if an algorithm runs ten times and learns three different scheme skylines with 2, 3, and 5 times, respectively, then the CS value for the third blocking scheme is in this case.

In the rest of this section, we will present the experimental results of our skyline algorithms, and compare the performance of our approach with the baseline approaches.

5.2 Label Efficiency

Figure 6: Comparison on constraint satisfaction by ASL and RSL under different label budgets over four datasets

We have evaluated the label efficiency of our algorithms.

5.2.1 Label cost

The label cost of our proposed algorithms, namely Naive-Sky for algorithm 2, Active-Sky for algorithm 3 and Pro-Sky for algorithm 4, are shown in Table 3, where the label costs of each algorithm for learning the scheme skylines of four datasets with are recorded. We have defined . For Naive-Sky, the label budget begins with 50, and increases by 50 for each run of ASL. For Active-Sky, the budget is the same. The total label cost is accumulated when the PC threshold increases. For Pro-Sky, the label budget begins with 500, and increases by 500. The label cost is recorded when CS reaches 90% in ten runs. Table 3 shows that Active-Sky saves more labels than Naive-Sky; nonetheless, Pro-Sky can reduce the label cost from one third to a half of the label usage of Active-Sky.

Algorithm Cora DBLP-Sc. DBLP-ACM NCVR
Naive-Sky 6000 5000 3000 3500
Active-Sky 4200 3500 1250 1400
Pro-Sky 2500 2000 1000 1000

Table 3: Comparison on label cost of scheme skyline algorithms Naive-Sky, Active-Sky and Pro-Sky with CS = 90%
Cora DBLP-Scholar DBLP-ACM NCVR
0.2 600 500 300 300
0.4 400 350 200 350
0.6 450 250 150 250
0.8 550 300 200 200
0.9 500 250 300 250
RSL 7,900 10,000+ 2,200 10,000+

Table 4: Comparison on label cost of ASL and RSL with CS = 90%
Cora DBLP-Scholar DBLP-ACM NCVR
Budget RT Budget RT Budget Run Time Budget RT
Naive-Sky 6000 0.1 18.74 5000 0.1 78.18 3000 0.1 76.95 3500 0.1 225.08
0.05 21.55 0.05 89.9 0.05 90.8 0.05 267.85
Active-Sky 4200 0.1 12.48 3500 0.1 56.28 1250 0.1 33.23 1400 0.1 118.4
0.05 14.68 0.05 60.73 0.05 34.55 0.05 122.16
Pro-Sky 2200 0.1 5.6 1600 0.1 32.56 800 0.1 11.38 800 0.1 63.09
0.05 6.27 0.05 38.09 0.05 13.28 0.05 70.56
Table 5: Comparison on the running time (in seconds) of different algorithms over four datasets

Both the Naive-Sky and Active-Sky algorithms use ASL to build scheme skylines. In order to analysis the factors that affect the label cost and to show that our active sampling strategy has better performance than random sampling, we have evaluated the label efficiency of ASL algorithm. Here, We present the label costs for learning a stable blocking scheme with CS = 90% under different PC thresholds, and compare them with the number of labels required by RSL (random sampling algorithm) in Table 4. The minimum label cost for each dataset is marked in black. In our experiments, the label budget for ASL under a given PC threshold is tested with 50 at the beginning, and then increased by 50. The label budget for RSL is tested with 50 at the beginning, and increased by 50 each time. The experiments for both ASL and RSL terminate when the CS values of the learned blocking schemes reach 90% in ten consecutive runs.

5.2.2 Constraint satisfaction evaluation

To further analyze the label efficiency of our algorithms, especially the naive-sky and the active-sky algorithms, we have monitored the constraint satisfaction under different label budgets (ranging from 20 to 500) in terms of various PC thresholds (e.g. ) over four datasets. The results are presented in Fig. 6.

We use the total label budget as the training label size for RSL to make a fair comparison on active sampling and random sampling. Our experimental results show that random sampling with a limited number of labels fails to identify an optimal blocking scheme. Additionally, both PC threshold and label budget can affect the constraint satisfaction. In general, when the label budget increases or the PC threshold decreases, the CS value grows. It is worthy to note that an extremely high PC threshold is usually harder to achieve, and thus sometimes no scheme that satisfies the threshold can be learned due to the limitation of samples (e.g. the black line with square marks under 100 sample budget). However, on the other hand, if the PC threshold is set extremely low under a low label budget (e.g. the red line with ), it could generate quite a number of blocking schemes satisfying the threshold with different PQ , which also leads to a low CS value. This is happened because that in our approach, only the optimal scheme is selected as candidate. With a limited number of samples, a sub-optimal blocking scheme may be selected.

5.2.3 Label efficiency analysis

Here we analyze the factors that can affect the label cost. First, as explained before, both extremely high and low PC thresholds may cost more labels than other cases. Second, datasets of a smaller size (i.e., a smaller number of record pairs) often need less labels such as DBLP-ACM. Furthermore, datasets with more attributes (e.g. NCVR) have a larger number of blocking predicates and thus need a higher label budget for sampling all its blocking predicates. The quality of a dataset may also affect the label cost. Generally, the cleaner a dataset is, the less labels it costs. For example, even though Cora has the smallest size of attributes and records, it still needs the largest number of labels because it contains fuzzy values (e.g. mis-spelling names, exchanged first name and last name). On the contrary, NCVR needs a lower label budget even it has more attributes and records. The cleanness of a dataset also affects the distribution of label costs under different thresholds. For example, with a PC threshold , we need only 200 labels for NCVR but 550 labels for Cora to generate blocking schemes with CS = 90%.

5.3 Time Efficiency

We show the Run Time (RT) of our three algorithms over four datasets in Table 5. The label budget we use is decided by Table 3, where the algorithms can learn consistent results as output. Two threshold intervals are used, i.e. and . Generally speaking, the run time depends on two factors: (1) the size of the dataset and (2) the number of samples we need. With the incremental of the label budget and the number of records in one dataset, the run time grows. For the algorithms, the threshold interval may affect the run time, but not significantly.

Figure 7: The progressive process for learning scheme skylines by Pro-Sky over four datasets

5.4 Blocking Quality

Now we discuss the blocking quality of our algorithms.

Figure 8: Blocking quality of PC (a) and PQ (b) under different PC thresholds over four datasets

5.4.1 Under different aries

The experimental results of the Pro-Sky algorithm under different aries are presented in Fig. 7. We have also tested on Naive-Sky and Active-Sky with and . However, because the blocking schemes in the scheme skylines learned by these algorithms are the subsets of the scheme skyline generated by Pro-Sky in which no is defined, we thus omit the results for Naive-Sky and Active-Sky. Fig. 7 presents the progressive process for generating the scheme skylines as well as the blocking schemes from 1-ary to 3+-ary over four datasets. We do not present the further process of 3+-ary, as the scheme skylines have already been generated within 3-ary. In Fig. 7.(h) and 7.(l) for NCVR dataset, we present detailed schemes with because most of scheme skylines are located in this range. Such schemes can not be learned by Naive-Sky nor Active-Sky unless is set to be very small (e.g. 0.0001).

Figure 9: Comparison on blocking quality by different blocking approaches over four datasets using the measures: (a) FM, (b) RR, (c) PC, and (d) PQ

5.4.2 Under different PC thresholds

To make a detailed comparison with the existing work, we have conducted experiments based on our ASL algorithm under different PC thresholds. We have monitored the blocking schemes learned by the ASL algorithm as well as their PC and PQ values under different PC thresholds , and the results are presented in Fig.  8. It shows how the PC and PQ vary due to the changing of learned schemes with the increment of the PC threshold with sufficient label budget.

With the increment of threshold, the blocking scheme we learn generates lower PC and higher PQ. However, there are still some points with the same PC and PQ even though the threshold increases. This indicates that under certain thresholds, the algorithm learns the same blocking scheme, which proves, on another side, the efficiency of Active-Sky compared with Naive-Sky. For example, the performance of our approach for NCVR dataset is consistent whatever the PC threshold is, because the blocking scheme learned can generate blocks with both high PC and PQ. On the contrary, for other datasets such as Cora, a lower PC threshold allows the algorithm to seek for blocking schemes that can generate higher PQ, but the PC decreases. We can also notice that for DBLP-ACM and NCVR datasets, blocking schemes with both high PC and PQ are learned with a low threshold (e.g. ), but for DBLP-Scholar, no blocking schemes can be learned with high PQ (i.e. higher than 0.6). In the figure, blocking schemes with PC threshold 1.0 are normally hard to achieve, hence we present the maximum PC our scheme can achieve.

5.4.3 Compared with baselines

Existing work largely focuses on learning a single blocking scheme, while our approach aims to learn a scheme skyline, which is a set of blocking schemes. Hence we first conduct experiments to present our scheme skyline learning results and show that, in a 2-dimension space of PC and PQ, the scheme points learned by the baselines TBlo and Fisher are dominated or contained in our scheme skylines over four datasets, as shown in Fig. 7(i)-(l).

PC PQ
TBlo ASL TBlo ASL
Cora 0.3296 0.3167 0.6758 0.9898
DBLP-Scholar 0.7492 0.7492 0.2869 0.2869
DBLP-ACM 0.2801 0.8826 0.4387 0.8854
NCVR 0.9981 0.9979 0.6558 0.9640
PC PQ
Fisher ASL Fisher ASL
Cora 0.9249 0.9249 0.2219 0.2219
DBLP-Scholar 0.9928 0.9928 0.0320 0.0320
DBLP-ACM 0.9661 0.9686 0.0522 0.6714
NCVR 0.9990 0.9990 0.0774 0.0774
Table 6: Comparison on blocking quality where the baseline PC values are selected as thresholds

To make a fair point-to-point comparison with baselines, we conduct experiments which regard the baseline PC values as PC thresholds for the ASL algorithm, and compare the PC and PQ values which are listed in Table 6. Comparing with TBlo, our approach can generate blocks with much higher (i.e. from 50% to 101%) PQ while remaining similar PC except in DBLP-Scholar, where the results are the same. Comparing with Fisher, in the most case, the results are the same, except in DBLP-Scholar, the results generated by our approach have a 12 times higher PQ. In general, our approach can generate results as good as or better than the baseline approaches under same thresholds with sufficient label budget.

We also present the schemes with highest f-measure values in the scheme skylines and compare them with baseline schemes in terms of FM, RR, PC and PQ. The FM results are shown in Fig. 9(a) in which our approach outperforms all the baselines over all the datasets. In Fig. 9(b), all the approaches yield high RR values over four datasets. In Fig. 9(c), the PC values of our approach are not the highest over the four datasets, but they are not much lower than the highest one (i.e. within 10% lower except in DBLP-Scholar). However, out approach can generate higher PQ values than all the other approaches, from 15 percents higher in NCVR (0.9956 vs 0.8655) to 20 times higher in DBLP-ACM (0.6714 vs 0.0320), as shown in Fig. 9(d).

6 Related Work

Scheme based blocking techniques for entity resolution were first mentioned by Fellegi and Sunter fellegi1969theory . They used a pair (attribute, matching-method) to define blocks. For example, the soundex code of both names Gail and Gayle is “g400”, thus records that contain either of the names will be placed into the same block. At that time, a blocking scheme was chosen by domain experts without using any learning algorithm. After that, scheme learning approaches generally fall into two categories: (1) supervised blocking scheme learning approaches michelson2006learning ; cao2011leveraging , and (2) Unsupervised blocking scheme learning approaches kejriwal2013unsupervised ; kejriwal2014two ; kejriwal2015dnf .

Michelson and Knoblock michelson2006learning first proposed a blocking scheme learning algorithm, called Blocking Scheme Learner, which is the first supervised algorithm to learn blocking schemes.This approach adopts the Sequential Covering Algorithm(SCA) mitchell1997machine to learn schemes with both high PC and high RR. In the same year, Bilenko et al. bilenko2006adaptive proposed two blocking scheme learning algorithms called ApproxRBSetCover and ApproxDNF to learn disjunctive blocking schemes and DNF (i.e. Disjunctive Normal Form) blocking schemes, respectively. Both algorithms are supervised and need training samples with labels. Then, Cao et al. cao2011leveraging noticed that obtaining labels for training samples is very expensive, and thus used both labeled and unlabeled samples in their approach. Their algorithm can learn a blocking scheme using conjunctions of blocking predicates which satisfy both minimum true-match coverage and minimum precision criteria.

Later on, Kejriwal et al. kejriwal2013unsupervised proposed an unsupervised algorithm for learning blocking schemes, where no labels from human labelers are needed. Instead, a weak training set was applied, where both positive and negative labels were generated by calculating the similarity of two records in terms of TF-IDF. The predicate with the highest score is selected as part of the result, and if the lower ranking predicate can cover more positively labeled pairs in the training set, they will be selected in a disjunctive form. After traversing all the predicates, a blocking scheme is learned. Although this approach circumvents the need of human labelers, using unlabeled samples or labeled samples only based on string similarity are not reliable wang2016semantic and hence blocking quality can not be guaranteed wang2016clustering . We need to notice that the Meta-blocking approaches proposed by papadakis2014meta ; papadakis2014supervised are proposed based on existing blocking results, and is not targeting at learning blocking schemes.

However, so far, all related approaches have focused on learning a single blocking scheme under given constraints. No work has been reported to provide an overview of all possible blocking schemes under different constraints for the users so that they can choose their preferred blocking schemes under their own constraints. To fill in this gap, in this paper, we consider to use skyline techniques to present a set of blocking schemes under different constraints.

The concept of skyline query has been widely studied in the context of databases. A good number of approaches have been developed in recent years chomicki2013skyline ; lin2007selecting ; tao2009distance ; sarma2011representative , which primarily focused on learning representative skylines, such as top-k RSP (representative skyline points) lin2007selecting , k-center (i.e. choose centers and one skyline point for each center) tao2009distance and threshold-based preference sarma2011representative . A survey by Kalyvas and Tzouramanis kalyvas2017survey has reviewed the major approaches in this area.

From an algorithmic perspective, a naive algorithm for skyline queries (e.g., nested-loop algorithm) has the time complexity , where is the number of records and is the number of attributes in a given database. Later, several algorithms have been proposed to improve the efficiency of skyline queries based on different properties, which have been previously ignored by the naive algorithm. In the early days, Borzsony et al. borzsony2001skyline proposed the BNL (block nested-loop) algorithm based on the transitivity of the dominance relation (e.g. if dominates and dominates , then dominates ) . Then, Chomicki et al. chomicki2003skyline ; chomicki2005skyline proposed an SFS (sort-filter-skyline) algorithm with the improvements: progressive and optimal comparison times. Sheng and Tao proposed an EM (external memory) model based on the attribute order, which was discussed by Borzsonyi et al. borzsony2001skyline . Morse et al. proposed the LS-B (lattice skyline) algorithm based on the low cardinality of some attributes (e.g. the rating of movies are integers within a small range of [1, 5]) morse2007efficient . Papadias et al. proposed the BBS (branch-and-bound skyline) algorithm based on the index of all input records by an R-tree papadias2003optimal .

Nevertheless, existing work on skyline queries aimed to efficiently tease out the skyline of queries over a database in which records and attributes are known. In contrast, our study in this paper has shifted the focus to learning the skyline of blocking schemes in terms of a given number of selection criteria but the actual values of those selection criteria are not available in a database. Particularly, in many real-world applications, only a limited number of labels are allowed to be used for assessing blocking schemes. Thus, how to efficiently and effectively learning the skyline of blocking schemes is a difficult task, as previously discussed. To overcome the difficulty of limited label budgets, in this paper, we consider to leverage active learning techniques for finding informative samples and improving the performance of learning.

Active learning has been extensively studied in the past settles2012active . Ertekin et al. ertekin2007learning showed that active learning may provide almost the same or even better results in solving the class imbalance problem, compared with the random sampling approaches, such as oversampling the minority class and/or undersampling the majority class chawla2002smote . In recent years, a number of active learning approaches have been introduced for entity resolution arasu2010active ; bellare2012active ; fisher2016active . For example, Arasu et al. arasu2010active proposed an active learning algorithm based on the monotonicity assumption, i.e. the more textually similar a pair of records is, the more likely it is a matched pair. Their algorithm aimed to maximize recall under a manually specific precision constraint. To reduce the label and computational complexity, Bellare et al. bellare2012active proposed an approach to solve the maximal recall with precision constraint problem by converting this problem into a classifier learning problem. In their approach, the main algorithm is called ConvexHull which aimed to find the best classifier with the maximal recall by finding the classifier with the minimal 01-Loss. Here the 01-Loss refers to the total number of false negatives and false positives. The authors then designed another algorithm called RejectionSampling which used a black-box to generate the 01-Loss of each classifier. The black-box invokes the IWAL algorithm in beygelzimer2010agnostic .

In this paper, for the first time, we study the problem of scheme skyline learning. Previous work on blocking scheme learning aims to learn a single scheme that matches the user requirements. Although skyline technique can help to provide a set of schemes but traditionally, they regard all the attribute values (e.g. PC and PQ) as known, which do not apply to scheme skyline. We thus aim to propose novel algorithms to efficiently learn a scheme skyline.

7 Conclusions

In this paper, we have proposed the scheme skyline learning approach called skyblocking, which uses skyline query techniques and active learning techniques to learn a set of optimal blocking schemes under different constraints and a limited label budget. We have tackled the class imbalance problem by solving the balanced sampling problem which is proved to be more label efficient than random sampling. We have also proposed the scheme extension strategy to reduce the searching space and the label cost. Three algorithms are proposed for efficiently learning scheme skylines. Additionally, our approach overcomes the weaknesses of existing blocking scheme learning approaches in that: (1) Previously, supervised blocking scheme learning approaches require a large number of labels for learning a blocking scheme, which is an expensive task for entity resolution; (2) Existing unsupervised approaches generate training sets based on the similarity of record pairs, instead of true labels, thus the training quality can not be guaranteed.

References

  • [1] A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, pages 783–794. ACM, 2010.
  • [2] K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. In SIGKDD, pages 1131–1139. ACM, 2012.
  • [3] A. Beygelzimer, D. J. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In ANIPS, pages 199–207, 2010.
  • [4] M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, pages 87–96. IEEE, 2006.
  • [5] S. Borzsony, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, pages 421–430. IEEE, 2001.
  • [6] Y. Cao, Z. Chen, J. Zhu, P. Yue, C.-Y. Lin, and Y. Yu. Leveraging unlabeled data to scale blocking for record linkage. In IJCAI, volume 22, page 2211, 2011.
  • [7] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    , 16:321–357, 2002.
  • [8] S. Chester and I. Assent. Explanations for skyline query results. In EDBT, pages 349–360, 2015.
  • [9] J. Chomicki, P. Ciaccia, and N. Meneghetti. Skyline queries, front and back. SIGMOD Record, 42(3):6–18, 2013.
  • [10] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting. In ICDE, pages 717–719. IEEE, 2003.
  • [11] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting: Theory and optimizations. In IIPWM, pages 595–604. Springer, 2005.
  • [12] P. Christen. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media, 2012.
  • [13] P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 24(9):1537–1555, 2012.
  • [14] P. Christen, D. Vatsalan, and Q. Wang. Efficient entity resolution with adaptive and interactive training data selection. In ICDM, pages 727–732. IEEE, 2015.
  • [15] U. Draisbach and F. Naumann. A comparison and generalization of blocking and windowing algorithms for duplicate detection. In International Workshop on QDB, pages 51–56, 2009.
  • [16] S. Ertekin, J. Huang, L. Bottou, and L. Giles. Learning on the border: active learning in imbalanced data classification. In CIKM, pages 127–136. ACM, 2007.
  • [17] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969.
  • [18] J. Fisher, P. Christen, and Q. Wang. Active learning based entity resolution using markov logic. In PAKDD, pages 338–349. Springer, 2016.
  • [19] J. Fisher, P. Christen, Q. Wang, and E. Rahm. A clustering-based framework to control block sizes for entity resolution. In SIGKDD, pages 279–288. ACM, 2015.
  • [20] D. Holmes and M. C. McCabe.

    Improving precision and recall for soundex retrieval.

    In ITCC, pages 22–26. IEEE, 2002.
  • [21] C. Kalyvas and T. Tzouramanis. A survey of skyline query processing. arXiv preprint arXiv:1704.01788, 2017.
  • [22] M. Kejriwal and D. P. Miranker. An unsupervised algorithm for learning blocking schemes. In ICDM, pages 340–349. IEEE, 2013.
  • [23] M. Kejriwal and D. P. Miranker. A two-step blocking scheme learner for scalable link discovery. In OM, pages 49–60, 2014.
  • [24] M. Kejriwal and D. P. Miranker. A dnf blocking scheme learner for heterogeneous datasets. arXiv preprint arXiv:1501.01694, 2015.
  • [25] A. Kisielewicz. A solution of dedekincts problem on the number of isotone boolean functions. J. reine angew. math, 386:139–144, 1988.
  • [26] H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison.

    Data & Knowledge Engineering

    , 69(2):197–210, 2010.
  • [27] H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. VLDB Endowment, 3(1-2):484–493, 2010.
  • [28] X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang. Selecting stars: The k most representative skyline operator. In ICDM, pages 86–95. IEEE, 2007.
  • [29] M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, pages 440–445, 2006.
  • [30] T. M. Mitchell et al. Machine learning. wcb, 1997.
  • [31] M. Morse, J. M. Patel, and H. V. Jagadish. Efficient skyline computation over low-cardinality domains. In VLDB, pages 267–278. VLDB Endowment, 2007.
  • [32] G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl. Meta-blocking: Taking entity resolutionto the next level. TKDE, 26(8):1946–1960, 2014.
  • [33] G. Papadakis, G. Papastefanatos, and G. Koutrika. Supervised meta-blocking. VLDB Endowment, 7(14):1929–1940, 2014.
  • [34] D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal and progressive algorithm for skyline queries. In SIGMOD, pages 467–478. ACM, 2003.
  • [35] L. Philips. The double metaphone search algorithm. C/C++ users journal, 18(6):38–43, 2000.
  • [36] A. D. Sarma, A. Lall, D. Nanongkai, R. J. Lipton, and J. Xu. Representative skylines using threshold-based preference distributions. In ICDE, pages 387–398. IEEE, 2011.
  • [37] B. Settles. Active learning. Synthesis Lectures on AIML, 6(1):1–114, 2012.
  • [38] J. Shao and W. Qing. Active blocking scheme learning for entity resolution. In PAKDD, 2018.
  • [39] Y. Tao, L. Ding, X. Lin, and J. Pei. Distance-based representative skyline. In ICDE, pages 892–903. IEEE, 2009.
  • [40] Q. Wang, M. Cui, and H. Liang. Semantic-aware blocking for entity resolution. TKDE, 28(1):166–180, 2016.
  • [41] Q. Wang, J. Gao, and P. Christen. A clustering-based framework for incrementally repairing entity resolution. In PAKDD, pages 283–295. Springer.
  • [42] Q. Wang, D. Vatsalan, and P. Christen. Efficient interactive training selection for large-scale entity resolution. In PAKDD, pages 562–573. Springer, 2015.

Acknowledgment

This work was partially funded by the Australian Research Council (ARC) under Discovery Project DP160101934.