Gradual Machine Learning for Entity Resolution

10/29/2018 ∙ by Boyi Hou, et al. ∙ 0

Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this paper, we propose a novel learning paradigm for ER, called gradual machine learning, which aims to enable effective machine learning without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alternatives, and it is highly competitive with the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The task of entity resolution (ER) aims at finding the records that refer to the same real-world entity (Christen, 2012). Usually considered as a classification problem, ER has been extensively studied in the literature (Christen, 2012; Elmagarmid et al., 2007). However, ER remains a very challenging task in many real scenarios due to the prevalence of dirty values in the data. The existing unsupervised solutions (e.g., rule-based (Singla and Domingos, 2006; Fan et al., 2009; Li et al., 2015) and clustering-based (Jin and Han, 2016)) usually have limited efficacy. Resolution accuracy can be effectively improved by training a variety of learning models (most notably deep neural networks (DNN)) for a task. However, the supervised solutions have the limitation that they usually require a lot of accurately labeled training data to achieve good performance. Unfortunately, in many real scenarios, high-quality labeled training data may require expensive manual work, and are therefore not readily available.

It can be observed that the dependence of the popular learning models (e.g., DNN) on high-quality labeled data is not limited to the task of ER. The dependence is actually crucial for their success in various domains (e.g., image and speech recognition (Yu and Deng, 2014)

, and sentiment analysis 

(Tang et al., 2015)). However, in the real scenarios, where high-quality labeled data is scarce, their efficacy can be severely compromised. To address the limitation resulting from such dependence, this paper proposes a novel learning paradigm, called gradual machine learning, in which gradual means proceeding in small stages. Inspired by the gradual nature of human learning, which is adept at solving the problems with increasing hardness, gradual machine learning begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually reasons about the labels of the more challenging instances based on the observations provided by the labeled instances.

Consider an ER task consisting of record pairs. A solution labels each record pair in the task as matching or unmatching

. We observe that even though accurately labeling all the pairs in an ER task by the machine is usually very challenging, automatically labeling only some easy instances among them is usually not. It can be simply performed based on the user-specified rules or the unsupervised clustering techniques. Statistically speaking, a record pair with a high (resp. low) similarity can be considered to have a correspondingly high probability of being an equivalent (resp. inequivalent) pair. It follows that setting a high similarity lowerbound for the

matching label would result in the desired set of easy matching instances, while setting a low similarity upperbound for the unmatching label would result in the desired set of easy unmatching instances. However, the following two properties of gradual machine learning make it distinct and challenging:

  • Distribution misalignment between easy and hard instances in a task. The scenario of gradual machine learning does not satisfy the i.i.d (independent and identically distributed) assumption underlying most machine learning models: the labeled easy instances are not representative of the unlabeled hard instances. The distribution misalignment between the labeled and unlabeled instances renders most existing learning models unfit for gradual machine learning.

  • Gradual learning by small stages. Gradual machine learning proceeds in small stages. At each stage, it chooses to only label the instance with the highest degree of evidential certainty based on the observations provided by the labeled instances. The process of iterative labeling can be performed in an unsupervised manner without requiring any human intervention.

We note that there already exist many learning paradigms for a variety of classification tasks, including transfer learning

(Pan and Yang, 2010), lifelong learning (Chen and Liu, 2018), curriculum learning (Bengio et al., 2009) and self-paced learning (Kumar et al., 2010)

to name a few. Unfortunately, none of them is applicable to the scenario of gradual machine learning. Transfer learning focuses on using the labeled training data in a domain to help learning in another target domain. It can handle the data distribution misalignment between the source and target domains. However, its efficacy usually depends on high-quality training data in the source domain. Gradual machine learning instead focuses on gradual learning within a task: it does not enjoy the access to high-quality training data or a good classifier in a source domain to kich-start learning. Furthermore, transfer learning transfers instances or knowledge in a batch manner. Hence, it can not perform gradual learning by small stages. Similar to transfer learning, lifelong learning studied how to leverage the knowledge mined from past tasks to serve the current task. One of their major differences is that lifelong learning usually supposes that the current task also has good training data, and aims to further improve the learning using both the target domain training data and the knowledge gained in past learning.

Both curriculum learning (Bengio et al., 2009) and self-paced (Kumar et al., 2010) learning investigated how to organize a curriculum (the presenting order of training examples) for better performance. In curriculum learning, the curriculum is assumed to be given by an oracle beforehand, and remains fixed thereafter. In self-paced learning, the curriculum is instead dynamically generated by the learner itself, according to what the learner has already learned. It is worthy to point out that both curriculum learning and self-paced learning depend on the i.i.d assumption and require good-coverage training sample for their efficacy. In contrast, the scenario of gradual machine learning does not satisfy the i.i.d assumption. More recently, noticing the limitation of DNNs’ dependence on large quantities of manually labeled data, Snorkel (Ratner et al., 2017a; Bach et al., 2017; Ratner et al., 2017b; Ratner et al., 2016a) aimed to enable automatic and massive machine labeling by specifying a wide variety of labeling functions. The results of machine labeling are supposed to be fed to DNN models. It is noteworthy that Snorkel aims to provide with good-coverage training data for DNN. In contrast, gradual machine learning does not depend on DNNs. It instead aims to address the limitation resulting from the dependence on good-coverage training data.

To address the challenges of gradual machine learning, we propose to first extract a wide variety of common features shared between easy and hard instances. Features serve as the medium to convey the knowledge obtained from the labeled easy instances to the unlabeled harder ones. We then construct a factor graph, which consisting of the instances and their features, to facilitate effective knowledge conveyance. Finally, gradual learning is fulfilled by iterative factor graph inference on the constructed factor graph. We summarize the major contributions of this paper as follows:

  1. We propose a novel learning paradigm of gradual machine learning, which aims to alleviate the burden of manually labeling large quantities of training data for challenging classification tasks;

  2. We present an effective solution based on the proposed paradigm for entity resolution. We propose a package of techniques, including easy instance labeling, feature extraction and influence modeling, and scalable gradual inference, to enable effective and efficient gradual machine learning for ER.

  3. Our extensive experiments on real data validates the efficacy of the proposed approach. Our empirical results show that the proposed approach performs considerably better than the unsupervised alternatives, and it is highly competitive with the state-of-the-art supervised techniques. It also scales well with task workload.

The rest of this paper is organized as follows: Section 2 reviews more related work. Section 3 introduces the learning paradigm. Section 4 proposes the solution for ER. Section 5 proposes the scalable solution for gradual learning. Section 6 presents our empirical evaluation results. Finally, Section 7 concludes this paper with some thoughts on future work.

2. Related Work

Machine Learning Paradigms

There exist many machine learning paradigms proposed for a wide variety of classification tasks. Due to space limit, we can not exhaustively review all the work here. We will instead review those closely related to our work and discuss their differences from the proposed gradual machine learning.

Traditional machine learning algorithms make predictions on the future data using statistical models that are trained on previously collected labeled or unlabeled training data (Fellegi and Sunter, 1969; Yin et al., 2006; Singla and Domingos, 2006; Sarawagi and Bhamidipaty, 2002; Kuncheva and Rodriguez, 2007; Christen, 2008; Arasu et al., 2010; Bellare et al., 2012). In many real scenarios, the labeled data may be too few to build a good classifier. Semi-supervised classification (Blum and Mitchell, 1998; Joachims, 1999; Nigam et al., 2000) addresses this problem by making use of a large amount of unlabeled data and a small amount of labeled data. Nevertheless, most of them assume that both labeled and unlabeled data share the same distribution.

In contrast, transfer learning (Pan and Yang, 2010), allows the distributions of the data used in training and testing to be different. It focused on using the labeled training data in a domain to help learning in another target domain. The other learning techniques closely related to transfer learning include lifelong learning (Chen and Liu, 2018) and multi-task learning (Caruana, 1997). Lifelong learning is similar to transfer learning in that it also focused on leveraging the experience gained on the past tasks for the current task. However, different from transfer learning, it usually assumes that the current task has good training data, and aims to further improve the learning using both the target domain training data and the knowledge gained in past learning. Multi-task learning instead tries to learn multiple tasks simultaneously even when they are different. A typical approach for multi-task learning is to uncover the pivot features shared among multiple tasks. However, all these learning paradigms can not be applied for the scenario of gradual machine learning. Firstly, focusing on unsupervised learning within a task, gradual machine learning does not enjoy the access to good labeled training data or a well-trained classifier to kick-start learning. Secondly, the existing techniques transfer instances or knowledge between tasks in a batch manner. As a result, they do not support gradual learning by small stages on the instances with increasing hardness within the same task.

The other closely related machine learning paradigms include curriculum learning (CL) (Bengio et al., 2009) and self-paced learning (SPL) (Kumar et al., 2010). Both of them are similar to gradual machine learning in that they were also inspired by the learning principle underlying the cognitive process in humans, which generally starts with learning easier aspects of a task, and then gradually takes more complex examples into consideration. However, both of them depend on a curriculum, which is a sequence of training samples essentially corresponding to a list of samples ranked in ascending order of learning difficulty. A major disparity between them lies in the derivation of the curriculum. In CL, the curriculum is assumed to be given by an oracle beforehand, and remains fixed thereafter. In SPL, the curriculum is instead dynamically generated by the learner itself, according to what the learner has already learned. It is worthy to point out that based on the traditional learning models, both CL and SPL depend on the i.i.d assumption and require good-coverage training examples for their efficacy. In contrast, the target scenario of gradual machine learning does not satisfy the i.i.d assumption. It instead aims to address the limitation resulting from the dependence on good-coverage training data.

Online learning (Kivinen et al., 2004) and incremental learning (Schlimmer and Granger, 1986) have also been proposed for the scenarios where training data only becomes available gradually over time or its size is out of system memory limit. Online learning was usually used to update the best predictor for future data at each step, as opposed to the batch learning techniques which generate the best predictor by learning on the entire training data set at once. In online learning, it is possible that the model forgets the previously learned inferences which is called catastrophic interference. The latter can however be addressed by incremental learning, whose aim is for the learning model to adapt to new data without forgetting its existing knowledge. Based on the traditional learning models, both online learning and incremental learning depend on high-quality training data for their efficacy. Therefore, they can not be applied for gradual learning.

Work on Entity Resolution

Research effort on unsupervised entity resolution were mainly dedicated to devising various distance functions to measure pair-wise similarity (Monge and Elkan, 1996; Cohen, 2000). Two records are considered to be equivalent if their similarity exceeds a pre-specified threshold. An alternative unsupervised technique based on graph-theoretic fusion was also proposed for ER in (Ravikumar and Cohen, 2004; Zhang et al., 2015). However, the effectiveness of these unsupervised techniques is limited. Moreover, it is usually very challenging to design an appropriate metric given a new dataset (Bilenko et al., 2003).

To overcome the limitations of unsupervised techniques, the supervised alternatives based on probabilistic theory or machine learning have been widely used in practice. They viewed the problem of ER as a binary classification task and then applied various learning models (e.g. SVM (Tong and Koller, 2001; Platt et al., 1999; Christen, 2008; Arasu et al., 2010; Bellare et al., 2012; Sarawagi and Bhamidipaty, 2002), native Bayesian (Berger, 1985), stochastic models (Ravikumar and Cohen, 2004; Singla and Domingos, 2006), and DNN models (Mudgal et al., 2018)) for the task. It has been empirically shown that compared with the unsupervised techniques, they can effectively improved entity resolution to some extent. However, the limitation of the supervised techniques is that they require a lot of accurately labeled training data for good performance, which may not be readily available in real applications.

The progressive paradigm for ER (Whang et al., 2013b; Altowim et al., 2014) has also been proposed for the application scenarios in which ER should be processed efficiently but does not necessarily require to generate high-quality results. Taking a pay-as-you-go approach, it studied how to maximize result quality given a pre-specified resolution budget. It fulfilled the purpose by constructing various resolution hints that can be used by a variety of existing ER algorithms as a guidance for which entities to resolve first. It is worthy to point out that its target scenario is different from that of gradual machine learning, whose major challenge is to label the instances with increasing hardness without resolution budget. These proposed techniques can therefore not be applied for gradual machine learning.

It has been well recognized that pure machine algorithms may not be able to produce satisfactory results in many practical scenarios (Li et al., 2016). Therefore, many researchers  (Altowim et al., 2018; Mozafari et al., 2014; Chai et al., 2016; Wang et al., 2012; Vesdapunt et al., 2014; Gruenheid et al., 2012; Getoor and Machanavajjhala, 2012; Chu et al., 2015; Firmani et al., 2016; Whang et al., 2013a; Gokhale et al., 2014; Wang et al., 2015; Verroios et al., 2017) have studied how to crowdsource an ER workload. In (Wang et al., 2012), the authors studied how to generate Human Intelligence Tasks (HIT), and how to incrementally select the instance pairs for human verification such that the required human cost can be minimized. In (Whang et al., 2013a), the authors focused on how to select the most beneficial questions for humans in terms of expected accuracy. The authors of (Chai et al., 2016) proposed a cost-effective framework that employs the partial order relationship on instance pairs to reduce the number of asked pairs. While these researchers addressed the challenges specific to crowdsourcing, we instead investigate a different problem in this paper: how to enable unsupervised learning given an ER task without the requirement for human intervention.

3. Learning Paradigm

In this section, we first define the task of ER, and then give an overview on the proposed paradigm using ER as a test case.

3.1. Task Statement

Entity resolution reasons about the equivalence between two records. Two records are deemed to be equivalent if and only if they correspond to the same real-world entity; otherwise, they are deemed to be inequivalent. We call a record pair an equivalent pair if and only if its two records are equivalent; otherwise, it is called an inequivalent pair. Given an ER workload consisting of record pairs, a solution labels each pair in the workload as matching or unmatching.

Notation Description
an ER workload consisting of record pairs
a subset of
, a labeling solution for
, an record pair in
the total number of pairs in
the total number of equivalent pairs in
the estimated equivalence probability of
, a feature of record pair
, a feature set
the set of record pairs having the feature
Table 1. Frequently Used Notations.

For the sake of presentation simplicity, we summarize the frequently used notations in Table. 1

. As usual, we measure the quality of a labeling solution by the metrics of precision and recall. The metric of precision denotes the fraction of equivalent pairs among all the pairs labeled as

matching, while recall denotes the fraction of the equivalent pairs labeled as matching among all the equivalent pairs. Given an ER workload and a labeling solution , suppose that denotes the set of record pairs labeled as matching, and denotes the set of record pairs labeled as unmatching. Then, the achieved precision level of on can be represented by

(1)

in which denotes the total number of pairs in a set, and denotes the total number of equivalent pairs in a set. Similarly, the achieved recall level of can be represented by

(2)

The overall quality of entity resolution is usually measured by the unified metric of F-1 as represented by

(3)

Finally, the task of entity resolution is defined as follows:

Definition 3.1 ().

[Entity Resolution]. Given a workload consisting of record pairs, , the task of entity resolution is to give a labeling solution for such that is maximized.

3.2. Paradigm Overview

Figure 1. Paradigm Overview.

The general idea of gradual machine learning is to first accurately label some easy instances in a task by the machine, and then gradually label the more challenging instances in it by iterative small stages. At each stage, it ranks the unlabeled instances by the metric of evidential certainty and chooses to only label the instance with the highest degree of certainty. The framework of gradual machine learning, as shown in Figure 1, consists of the following three essential steps:

  • Easy Instance Labeling. Given a classification task, it is usually very challenging to accurately label all the instances in the task without good-coverage training examples. However, the work can become much easier if we only need to automatically label some easy instances in the task with high accuracy. In the case of ER, while the pairs with the medium similarities are usually challenging for machine labeling, highly similar (resp. dissimilar) pairs have fairly high probabilities to be equivalent (resp. inequivalent). They can therefore be chosen as easy instances. In real scenarios, easy instance labeling can be performed based on the simple user-specified rules or the existing unsupervised learning techniques. For instance, in unsupervised clustering, an instance close to the center of a cluster in the feature space can be considered an easy instance, because it has only a remote chance to be misclassified. Gradual machine learning begins with the observations provided by the labels of easy instances. Therefore, the high accuracy of automatic machine labeling on easy instances is critical for its ultimate performance on a given task.

  • Feature Extraction and Influence Modeling. Features serve as the medium to convey the knowledge obtained from the labeled easy instances to the unlabeled harder ones. This step extracts the common features shared by the labeled and unlabeled instances. To facilitate effective knowledge conveyance, it is desirable that a wide variety of features are extracted to capture as much information as possible. For each extracted feature, this step also needs to model its influence over the labels of its relevant instances.

  • Gradual Inference. This step gradually labels the instances with increasing hardness in a task. Unfortunately, the scenario of gradual learning does not satisfy the i.i.d assumption: the labeled instances are not representative of the more challenging unlabeled ones. Therefore, the traditional learning models can not be applied for this purpose. Therefore, we propose to fulfill gradual learning from the perspective of evidential certainty. As shown in Figure 1, we construct a factor graph, which consisting of the labeled and unlabeled instances and their common features. Gradual learning is conducted over the factor graph by iterative factor graph inference. At each iteration, it chooses the unlabeled instance with the highest degree of evidential certainty for labeling. The iteration is repeatedly invoked until all the instances in a task are labeled. Note that in gradual inference, a newly labeled instance at the current iteration would serve as an evidence observation in the following iterations.

According to the above description, we summarize the key properties of gradual machine learning that collectively make it distinct from the existing learning paradigms:

  1. It begins with some easy instances in a classification task, which can be automatically labeled by the machine with high accuracy;

  2. Its major challenge is to gradually label the increasingly hard instances in the task;

  3. Knowledge conveyance between the labeled and unlabeled instances are facilitated by their common features, which serve as the learning medium.

The framework we have laid out in Figure 1 is general in that additional technical work is required for building a practical solution for a specific task based on it. It is noteworthy that many techniques proposed in the existing learning models can be applied in the different steps of gradual machine learning. For instance, the existing rule-based and unsupervised techniques can be used to label easy instances. There also exist many techniques to extract the features for supervised and unsupervised learning. They can definitely be used in the step of feature extraction and influence modeling.

In the rest of this paper, we will focus on the technical solution for ER based on the proposed paradigm. We want to emphasize that the provided solution is open-ended. For a classification task other than ER, a different technical solution may be required.

4. Solution for ER

In this section, we present the technical solution for ER based on the proposed learning paradigm.

4.1. Easy Instance Labeling

Given an ER task consisting of record pairs, the solution identifies the easy instances by the simple rules specified on record similarity. The set of easy instances labeled as matching is generated by setting a high lowerbound on record similarity. Similarly, the set of easy instances labeled as unmatching is generated by setting a low upperbound on record similarity. To explain the effectiveness of the rule-based approach, we introduce the monotonicity assumption of precision, which was first defined in (Arasu et al., 2010), as follows:

Assumption 1 (Monotonicity of Precision).

A value interval is dominated by another interval , denoted by , if every value in is less than every value in . We say that precision is monotonic with respect to a pair metric if for any two value intervals in [0,1], we have , in which denotes the equivalence precision of the set of instance pairs whose metric values are located in .

With the metric of pair similarity, the underlying intuition of Assumption 1 is that the more similar two records are, the more likely they refer to the same real-world entity. According to the monotonicity assumption, we can statistically state that a pair with a high (resp. low) similarity has a correspondingly high probability of being an equivalent (resp. inequivalent) pair. These record pairs can be deemed to be easy in that they can be automatically labeled by the machine with high accuracy. In comparison, the instance pairs having the medium similarities are more challenging because labeling them either way by the machine would introduce considerable errors.

Figure 2. Empirical Validation of the Monotonicity Assumption.

We have empirically validated the monotonicity assumption on the real datasets, DBLP-Scholor111available at https://dbs.uni-leipzig.de/file/DBLP-Scholar.zip and Abt-Buy222available at https://dbs.uni-leipzig.de/file/Abt-Buy.zip. The precision levels of different similarity intervals are shown in Figure 2. It can be observed that statistically speaking, precision increases with similarity value with notably rare exceptions. The proposed approach assumes that the monotonicity of precision is a statistical trend. It however does not expect that the monotonicity assumption can be strictly satisfied on real data. On DBLP-Scholar, if the similarity lowerbound is set to be 0.8, the achieved precision is 0.992, nearly 100%. On the other hand, if the similarity upperbound is set to be 0.32, the ratio of inequivalent pairs is similarly very high at 0.997. We have the similar observation on Abt-Buy.

Given a machine metric for a classification task, the monotonicity assumption of precision actually underlies its effectiveness as a classification metric. Therefore, the easy instances in an ER task can be similarly identified by other classification metrics. However, for presentation simplicity, we use pair similarity as the example of machine metric in this paper.

4.2. Feature Extraction and Influence Modeling

The guiding principle of feature extraction is to extract a wide variety of features that can capture as much information as possible from the record pairs. Given an ER workload, we extract three types of features from its pairs, which include:

  1. Attribute value similarity. This type of feature measures a pair’s value similarity at each record attribute. Different attributes may require different similarity metrics. For instance, on the DBLP-Scholar dataset, the appropriate metric for the venue attribute is the edit distance, while the appropriate metric for the title attribute is instead a hybrid metric combining Jacard similarity and edit distance.

  2. The maximal number of common consecutive tokens in string attributes. Given a record pair, its length of the longest consecutive token sequence shared by both records’ individual attributes can usually, to a large extent, affect its equivalence probability, especially for the attributes with string values (e.g., the title attribute in the literature data and the product data). It can be observed that consecutive tokens can usually provide additional information besides that implied by the attribute value similarity features.

  3. The tokens occurring in both records or in one and only one record. Suppose that we denote a token by , the feature that occurs in both records by , and the feature that occurs in one and only one record by . Note that the feature of serves as evidence for equivalence, while the feature of indicates the opposite. Unlike the previous two types of features, which treat attribute values as a whole, this type of feature considers the influence of each individual token on pair equivalence probability.

The three types of features listed above can provide a good coverage of the information contained in record pairs. It is worthy to point out that our proposed solution for feature extraction is open-ended. Depending on the task domain, other types of features may also be extracted and modeled for gradual learning. The topic is however beyond the scope of this paper.

Since the scenario of gradual learning does not satisfy the i.i.d assumption, the traditional techniques of modeling the influence of features over pair equivalence probability can not be applied here. We can see that all the three types of features can be supposed to satisfy the monotonicity assumption of precision. Therefore, for each feature, we model its influence over pair labels by a monotonous sigmoid function with two parameters,

and as shown in Figure 1, which denote the -value of the function’s midpoint and the steepness of the curve respectively. The -value of the sigmoid function represents the value of a pair w.r.t the corresponding feature, and the -value represents the probability of a pair being labeled as matching. Since the third type of features has the constant value of 1, we first align them with record similarity and then model their influence by sigmoid functions.

Figure 3. the Examples of Sigmoid Function.

We illustrate the sigmoid function by the examples shown in Figure 3. It can be observed that different value combinations of and can result in vastly different influence curves. Given a sigmoid model, gradual machine learning essentially reasons about the labels of the middle points, which correspond to the hard instances, provided with the labels of the more extreme points at both sides, which correspond to the easy instances. If it were not for the monotonicity assumption, estimating the labels of the middle points by regression would be too erroneous because the more extreme observations at both sides are not their valid representatives. Our solution overcomes this hurdle by assuming monotonicity of precision and proceeding in small stages, in each of which the regression results of only a few instances close to the labeled easy instances are considered for equivalence reasoning. Fortunately, as pointed out before, monotonicity of precision is a universal assumption underlying the effectiveness of the existing machine metrics for classification tasks. Our proposed solution for modeling feature influence can, therefore, be potentially generalized for other classification tasks.

4.3. Gradual Inference

To enable gradual machine learning, we construct a factor graph, , which consisting of the labeled easy instances, the unlabeled hard instances and their common features. Gradual machine learning is attained by iterative factor graph inference on . In , the labeled easy instances are represented by the evidence variables, the unlabeled hard instances by the inference variables, and the features by the factors. The value of each variable represents its corresponding pair’s equivalence probability. A variable, once it is labeled, would become an evidence variable in the following iterations. An evidence variable has the constant value of 0 or 1, which indicate the status of unmatching and matching respectively. It participates in gradual inference, but its value remains unchanged during the inference process. The values of the inference variables should instead be inferred based on .

Figure 4. An Example of Factor Graph.

An example of factor graph is shown in Figure 4. Each variable has multiple factors, each of which corresponds to one of its features. Since a feature can be shared among multiple pairs, for presentation simplicity, we represent a feature by a single factor and connect it to multiple variables. Note that the influence of a feature over a pair is specified by a sigmoid function. Given a feature and a pair , the influence of w.r.t is represented by

(4)

in which represents ’s value w.r.t , which is known beforehand, and and represents the parameters of a sigmoid function, which need to be learned. Accordingly, in the factor graph, we represent the factor weigh of w.r.t by

(5)

in which codes the estimated influence of on by sigmoid regression, and represents the confidence on influence estimation. It can be observed that in practical implementation, can be easily estimated based on the theory of regression error bound (Chen, 1994). More details on the computation of will be discussed in Subsection 5.1.

Denoting the feature set of a pair by , a factor graph infers the equivalence probability of , , by:

(6)

The process of gradual inference essentially learns the parameter values ( and ) of all the features such that the inferred results maximally match the evidence observations on the labeled instances. Formally, the objective function can be represented by

(7)

in which denotes the observed labels of evidence variables, denotes the inference variables in , and denotes the joint probability of the variables in , which can be represented by:

(8)

Since the variables in are conditionally independent, the objective function can be simplified into

(9)

In our practical implementation, we deploy (Jones et al., 2001) to implement the parameter optimization process. The process tries to find out the optimal parameters which can minimize the objective function Eq. 9 in the given search space. To avoid overfitting, the search range of the midpoint parameter of a feature is set to be between the expectation value of a certain proportion (i.e., 10%) of the largest feature values on of the pairs labeled as unmatching, and the expectation value of the same proportion of the smallest feature values on of the pairs labeled as matching, keeping the midpoint parameter always falling into a proper range between the two classes. The search space of the parameter is simply set to be for all features.

Given a factor graph, , at each stage, gradual inference first reasons about the parameter values of the features and the equivalence probabilities of the unlabeled pairs by maximum likelihood, and then labels the unlabeled pair with the highest degree of evidential certainty. We define evidential certainty as the inverse of entropy (Shannon, 1948), which is formally defined by

(10)

in which denotes the entropy of . According to the definition, the degree of evidential certainty varies inversely with the estimated value of entropy. The value of reaches its maximum when , and it decreases as the value of becomes more extreme (close to 0 or 1). Therefore, at each iteration, gradual inference selects the instance pair with the minimal entropy for labeling. It labels the chosen instance pair as matching if , and as unmatching if . An inference variable once labeled would become an evidence variable and serve as an evidence observation in the following iterations. The iteration is repeatedly invoked until all the inference variables are labeled.

Unfortunately, repeated inference by maximum likelihood estimation over a large-sized factor graph of the whole variables is usually very time-consuming (Zhou et al., 2016). As a result, the above-mentioned approach can not scale well with a large ER task. In the next section, we will propose a scalable approach that can effectively fulfill gradual learning without repeatedly inferring over the entire factor graph.

5. Scalable Gradual Inference

In this section, we propose a scalable solution for gradual inference. Our solution is crafted based on the following observations:

  • Many unlabeled inference variables in the factor graph are only weakly linked through the factors to the evidence variables with low confidence. Due to the lack of evidential support, their inferred probabilities would be quite ambiguous, i.e. close to 0.5. As a result, only the inference variables that receive considerable support from the evidence variables need to be considered for labeling;

  • With regard to the probability inference of a single variable in a large factor graph, it can be effectively approximated by considering the potentially much smaller subgraph consisting of and its neighboring variables. The inference over the subgraph can usually be much more efficient than over the original entire graph.

1 while there exists any unlabeled variable in  do
2        all the unlabeled variables in ;
3        for  do
4               Measure the evidential support of in ;
5              
6        end for
7       Select top-m unlabeled variables with the most evidential support (denoted by ) ;
8        for  do
9               Estimate the probability of in by approximation;
10              
11        end for
12       Select top- certain variables in terms of entropy in based on the approximate probabilities (denoted by ) ;
13        for  do
14               Compute the probability of in by the factor graph inference over a subgraph of ;
15              
16        end for
17       Label the variable with the minimal entropy in ;
18       
19 end while
Algorithm 1 Scalable Gradual Inference

The process of scalable gradual inference is sketched in Algorithm 1. It first selects the top-m unlabeled variables with the most evidential support in as the candidates for probability inference. To reduce the invocation of variables inference, it then approximates probability inference by a more efficient algorithm on the candidates. Finally, it infers via maximum likelihood the probabilities of only the top- most promising unlabeled variables among the candidates. For each variable in the final set of candidates, its probability is not inferred over the entire graph of , but over a potentially much smaller subgraph.

The rest of this section is organized as follows: Subsection 5.1 presents the technique to measure evidential support. Subsection 5.2 presents the approximation algorithm to efficiently rank the probabilities of unlabeled variables. Subsection 5.3 describes how to construct an inference subgraph for a target unlabeled variable.

5.1. Measurement of Evidential Support

Since the influence of a feature over the pairs is modeled by a sigmoid function, we consider the evidential support that an unlabeled variable receives from a feature as the confidence on the regression result provided by its corresponding function, denoted by . Note that is also used to compute the confidence-aware factor weight in Eq. 5. Given an unlabeled variable, , we first estimate its evidential support provided by each of its factors based on the theory of regression error bound (Chen, 1994), and then aggregate them to estimate its overall evidential support based on the Dempster-Shafer theory (Shafer, 1976).

Formally, for the influence estimation of a single feature

on the variables, the process of parameter optimization corresponds to a linear regression between the natural logarithmic coded influence in Eq. 

5, hereinafter denoted by , and the feature value , as follows

(11)

in which denotes the regression residual. The parameters and are optimized by minimizing the regression residual as follows:

(12)

in which denotes the set of labeled pairs having the feature .

According to the theory of linear regression error bound, given a pair , its prediction error bound and the confidence level satisfy the following formula

(13)

in which represents the Student’s -value with degree of freedom at quantile, and

(14)

and

(15)

Given an error bound , we measure the evidential support of an unlabeled variable provided by by estimating its corresponding regression confidence level according to Eq. 13. Finally, we use the classical theory of evidence, the Dempster-Shafer (D-S) theory (Shafer, 1976), to combine the evidential support from different features and arrive at a degree of belief that takes into account all the available evidence. In our case, given a variable , the evidential support provided by a feature can be considered to be the extent that supports the inference on the probability of : a value of 1 means complete support while a value of 0 corresponds to the lack of any support. Suppose that an unlabeled variable has features, {,,}, and the evidential support receiving from is denoted by . We first normalize the values of by so that falls into the value range of . Then, according to the Dempster’s rule, the combined evidential support of provided by its features can be represented by

(16)

On time complexity, each iteration of evidential support measurement takes time, in which denotes the total number of instances in a task, and denotes the total number of extracted features. Therefore, we have Lemma 5.1, whose proof is straightforward, thus omitted here.

Lemma 5.1 ().

Given an ER task, the total computational cost of evidential support measurement can be represented by .

5.2. Approximate Estimation of Inferred Probability

Due to the prohibitive cost of factor graph inference, at each iteration, reasoning about the probabilities of all the top-m inference variables ranked by evidential support via the factor graph inference may still be too time-consuming. In this subsection, we propose an efficient approach to approximate the inferred probabilities of these top-m variables such that only a small portion (top-k) of them needs to be inferred using factor graph inference.

As previously mentioned, since the feature’s natural logarithmic influence w.r.t a pair can be estimated by the linear regression value based on Eq. 11, therefore the approximate factor weight of w.r.t , can be estimated by

(17)

in which represents ’s confidence level on the regression result w.r.t and and are the regression parameter values estimated by Eq. 12, and a pair’s equivalence probability can be thus computed by combining the approximate factor weights of all its features:

(18)

in which denotes the feature set of .

Accordingly, the entropy of is approximated by

(19)

In practical implementation, due to the high efficiency of evidential support measurement, the number of candidate inference variables selected for approximate probability estimation () can be usually set to a large value (in the order of thousands). However, the number of candidate inference variables chosen for factor graph inference () is usually set to a much smaller value (), due to the inefficiency of factor graph inference. In the initial stages, factor graph inference can even be saved if the entropy of the top variable in the candidates is considerably smaller than that of any other variable. Due to space limit, we will not present more details on parameter tuning. It is noteworthy that our empirical evaluation in Section 6 shows that the performance of gradual inference is not sensitive to the parameter settings of and .

On time complexity, each iteration of approximate probability estimation takes time, in which denotes the total number of pairs in a task, and denotes the total number of extracted features. Therefore, we have Lemma 5.2, whose proof is straightforward, thus omitted here.

Lemma 5.2 ().

Given an ER task, the total computational cost of approximate probability estimation can be represented by .

5.3. Construction of Inference Subgraph

Ratio of Easy Instances
Pair Similarity ub(UM) lb(M) ub(UM) lb(M) ub(UM) lb(M) ub(UM) lb(M)
DS 0.27 0.87 0.29 0.83 0.31 0.79 0.33 0.75
AB 0.135 0.45 0.15 0.43 0.165 0.4 0.175 0.4
SG 0.52 0.98 0.53 0.98 0.55 0.95 0.57 0.95
Ratio of Easy Instances
Labeling Accuracy UM M UM M UM M UM M
DS 0.999 0.995 0.998 0.996 0.997 0.989 0.995 0.984
AB 0.992 0.966 0.990 0.925 0.986 0.838 0.985 0.838
SG 1.000 1.000 1.000 1.000 1.000 0.997 1.000 0.997
Table 2. The Statistics of Easy Instance Labeling on Test Workloads

Given a target inference variable in a large factor graph , inferring ’s equivalence probability over the entire graph is usually very time-consuming. Fortunately, it has been shown that factor graph inference can be effectively approximated by considering the subgraph consisting of and its neighboring variables in  (Zhou et al., 2016). Specifically, consider the subgraph consisting of and its -hop neighbors. It has been shown that increasing the diameter of neighborhood (the value of ) can effectively improve the approximation accuracy. On the other hand, it has been empirically shown (Zhou et al., 2016) that with even a small value of (e.g., 3-5), the approximation by -hop inference can be sufficiently accurate in many real scenarios.

Unfortunately, in the scenario of gradual learning, some factors (e.g, attribute value similarity) may be shared among almost all the variables. As a result, the simple approach of considering -hop neighborhood may result in a subgraph covering almost all the variables. Therefore, we propose to limit the inference subgraph size in the following manner:

  1. Gradual learning infers the label of a pair based on its features. Approximate factor graph inference considers only the factors corresponding to the features of . The other factors in are instead excluded from the constructed subgraph;

  2. The influence distribution of a factor is estimated based on its evidence variables. Approximate factor graph inference considers only the evidence variables sharing at least one feature with the target inference variable, . The remaining variables, including the unlabeled inference variables other than and the evidence variables not sharing any common feature with , are instead excluded from the constructed subgraph;

  3. In the case that applying the previous two guides still results in an exceedingly large subgraph, we propose to limit the total number of evidence variables for any given feature. According to (Chen, 1994)

    , the accuracy of function regression generally increases with the number of sample observations. However, the validity of this proposition depends on the uniform distribution of the samples. The additional samples very similar to the existing ones can only produce marginal improvement on prediction accuracy. Therefore, given a feature, we divide its evidence variables into two clusters, one consisting of the variables with the

    matching label and the other consisting of those with the unmatching label. Then, for each cluster, we divide the space of feature values into multiple intervals, and limit the number of observations in each interval by sampling. Given any interval, if its number of observations exceeds a pre-specified threshold , we randomly select observations among them for factor graph inference.

It is worthy to point out that our proposed approach for subgraph construction is consistent with the principle of -hop approximation in that it essentially opts to include those factors and variables in the close neighborhood of a target variable in the subgraph. In practical implementation, we suggest that the feature range of [0,1] is divided into ten uniform intervals, [0,0.1], [0.1,0.2], , [0.9,1.0], and the number of observations for each interval is set between 50 and 200.

6. Empirical Evaluation

In this section, we empirically evaluate the performance of our proposed approach (denoted by GML) on real data. We compare GML with both unsupervised and supervised alternative techniques, which include

  • Unsupervised Rule-based (denoted by UR). The unsupervised rule-based approach reasons about pair equivalence based on the rules handcrafted by the human. Based on human experience and knowledge on the test data, the rules are specified in terms of record similarity. For fair comparison, in our implementation, we set the equivalence threshold based on the upper and lower bounds used for identifying easy instances.

  • Unsupervised Clustering (denoted by UC). The approach of unsupervised clustering maps the record pairs to points in a multi-dimensional feature space and then clusters them into distinct classes based on the distance between them. The features usually include different similarity metrics specified at different attributes. In our implementation, we use the classical k-means technique to classify pairs into two classes.

  • Learning based on Support Vector Machine (denoted by SVM). The SVM-based approach 

    (Christen, 2008) also maps the record pairs to points in a multi-dimensional feature space. Unlike unsupervised clustering, it fits an optimal SVM classifier on labeled training data and then uses the trained model to label the pairs in the test data.

  • Deep Learning (denoted by DNN). The deep learning approach (Mudgal et al., 2018)

    is the state-of-the-art supervised learning approach for ER. Representing each record pair by vector, it first trains a deep neural network (DNN) on labeled training data, and then uses the trained DNN to classify the pairs in the test data.

The four compared approaches listed above provide a good coverage of the techniques used for ER in practice: UR and UC are representative of the unsupervised techniques, and SVM and DNN are the popular supervised techniques. Note that we do not compare GML with Snorkel (Bach et al., 2017) because of two reasons: (1) a Snorkel solution for ER does not exist; it would require a wide variety of labeling functions, which are unfortunately not readily available for ER tasks; (2) the output of Snorkel are supposed to be fed to DNN models; therefore, the performance of DNN provided with manually labeled training data can serve as a good reference for the performance of Snorkel.

The rest of this section is organized as follows: Subsection 6.1 describes the experimental setup. Subsection 6.2 compares GML with the other alternatives. Subsection 6.3 evaluates the sensitivity of GML w.r.t various parameter settings. Finally, subsection 6.4 evaluates the scalability of GML w.r.t workload.

6.1. Experimental Setup

GML UR UC
recall precision F1 recall precision F1 recall precision F1
DS 0.887 0.936 0.911 0.923 0.840 0.880 0.793 0.939 0.860
AB 0.456 0.815 0.585 0.645 0.428 0.514 0.806 0.268 0.402
SG 0.944 0.998 0.970 0.993 0.825 0.901 0.995 0.808 0.892
SVM
10% 20% 30% 40%
recall precision F1 recall precision F1 recall precision F1 recall precision F1
DS 0.893 0.923 0.908 0.895 0.920 0.908 0.896 0.921 0.908 0.899 0.925 0.912
AB 0.349 0.721 0.482 0.706 0.374 0.489 0.398 0.721 0.513 0.643 0.429 0.515
SG 0.995 0.855 0.920 0.992 0.925 0.957 0.991 0.945 0.968 0.991 0.947 0.969
DNN
10%(5%:5%) 20%(15%:5%) 30%(25%:5%) 40%(35%:5%)
recall precision F1 recall precision F1 recall precision F1 recall precision F1
DS 0.949 0.869 0.907 0.945 0.956 0.950 0.982 0.929 0.955 0.957 0.955 0.956
AB 0.043 0.254 0.074 0.441 0.601 0.509 0.444 0.707 0.546 0.564 0.753 0.645
SG 0.777 0.830 0.802 0.952 0.900 0.925 0.938 0.970 0.954 0.949 0.977 0.963
Table 3. Comparative Evaluation of GML

Our evaluation is conducted on three real datasets, which are described as follows:

In the empirical study, GML uses pair similarity as the machine metric to identify easy instances. The rule-based approach also uses pair similarity to specify the equivalence condition. Pair similarity is computed by aggregating the attribute similarities via a weighted sum (Christen, 2012). Specifically, on the DS dataset, Jaccard similarity of the attributes title, authors and year, and Jaro-Winkler distance of the attribute title, authors and venue are used; on the AB dataset, Jaccard similarity of the attributes product name and product description are used; on the SG dataset, Jaccard similarity of the attributes song title and artist name, Jaro-Winkler distance of the attributes song title and release information, and number similarity of the attributes duration are used. The weight of each attribute is determined by the number of its distinct values. As in the previous study (Mudgal et al., 2018), we use the blocking technique to filter the instance pairs having a small chance to be equivalent. After blocking, the DS workload has 10482 pairs, and 4771 among them are equivalent; the AB workload has 8924 pairs, and 774 among them are equivalent; the SG workload has 8312 pairs, and 1412 among them are equivalent.

Title Authors Venue Year
d1 :
Fast Algorithms for Mining Association Rules
in Large Databases
R Agrawal, R Srikant VLDB 1994
:
Mining Quantitative Association Rules
in Large Tables
R Srikant, R Agrawal VLDB -
d2 :
STING: A Statistical Information Grid Approach to
Spatial Data Mining
W Wang,
J Yang, R Muntz
VLDB 1997
:
J., and Muntz, RR 1997. STING: A statistical information
grid approach to spatial data
WY Wang
Proc. Int. Conf.
Very Large Databases
-
Table 4. Two Example Record Pairs: and

6.2. Comparative Study

This section compares GML with its alternatives. On all three workloads, the easy instances are generated by specifying a lowerbound of pair similarity for matching pairs, and an upperbound of pair similarity for unmatching pairs. The similarity thresholds and the resulting labeling accuracies of the easy instances on the test workloads are shown in Table 2, in which denotes the similarity upperbound for the unmatching easy instances, the lowerbound for the matching easy instances, the label of matching and the label of unmatching.

For fair comparison, the rule-based approach sets the equivalence condition on pair similarity at the middle point between the lower and upper bounds. In real scenarios, the ratios of identified easy instances may vary. Our evaluation results in Subsection. 6.3 will show that GML is robust w.r.t easy instance labeling in that its performance is, to a large extent, insensitive to the initial ratio of easy instances.

F-1
DS 0.915 0.906 0.914 0.911
AB 0.573 0.585 0.568 0.576
SG 0.970 0.970 0.970 0.970
Table 5. Sensitivity Evaluation w.r.t Easy Instance Labeling
F-1
DS 0.903 0.905 0.908 0.911
AB 0.565 0.573 0.568 0.585
SG 0.917 0.921 0.945 0.970
Table 6. Sensitivity Evaluation w.r.t the Parameter
F-1
DS 0.911 0.914 0.912 0.914
AB 0.585 0.582 0.582 0.582
SG 0.970 0.970 0.970 0.970
Table 7. Sensitivity Evaluation w.r.t the Parameter
F-1
DS 0.908 0.909 0.909 0.909
AB 0.558 0.553 0.553 0.553
SG 0.951 0.951 0.943 0.951
Table 8. Sensitivity Evaluation w.r.t the Parameter

The detailed evaluation results are presented in Table 3, in which the results on F-1 have been highlighted. For the supervised approaches of SVM and DNN, we report their performance provided with different sizes of training data, which is measured by the fraction of training data among the whole workload. In the empirical evaluation, training data are randomly selected from the workload. Since the performance of SVM and DNN depends on the randomly-selected training data, the reported results are the averages over ten runs. For DNN, the training data consists of the data used for model training and the data used for validation. Therefore, we report the fractions of both parts in the table.

The results show that GML performs considerably better than the unsupervised alternatives, UR and UC. In most cases, their performance differences on F-1 are larger than 5%. It can also be observed that GML can beat both supervised approaches of SVM and DNN if they are provided with only small sets of training data (e.g., less than 20%). When the size of training data increases, the performance of SVM and DNN generally improves as expected. It is worthy to point out that even when given 40% of the workload as training examples, GML can still outperforms SVM: on the AB workload, GML achieves an F-1 value of 58.5% while SVM achieves only 51.5%; on the other two workloads, their performance is very similar. On AB, when given 30% of the workload as training examples, GML can still outperform DNN (58.5% vs 54.6%). Only when given 40% of the workload as training examples, can the DNN clearly beat GML in performance.

We also illustrate the effectiveness of GML by the examples shown in Table 4. Both UR and UC mistakenly classify the pair as matching due to the similarity of its two records on the attributes. GML can instead correctly classify it as unmatching: its inferred equivalence probability is 0.34 () due to the presence of the features: the partial occurrence of tokens, “algorithms”, “tables”, “quantitative” and “fast”. Similarly, both UR and UC misclassify the pair as unmatching due to the seeming dissimilarity of its two records. Again, GML correctly labels it as matching. Due to the long common consecutive tokens at the title attribute and the co-occurrence of many tokens, including “statistical”, “grid”, and “sting”, its inferred probability is 0.66, larger than 0.5.

6.3. Sensitivity Evaluation

In this section, we evaluate the sensitivity of GML w.r.t different parameter settings. We first vary the ratio of the initial easy instances in a workload and track the performance of GML with different ratios. Note that the ratio is set by specifying the lower and upper bounds of pair similarity. For scalable gradual inference, we vary the number of the pair candidates selected for inference probability approximation (the parameter in Algorithm 1), the number of the pair candidates selected for factor graph inference (the parameter in Algorithm 1), and the limit specification on the size of inference subgraph. The value of is set between 500 and 2000. The value of is set between 1 and 10. We limit the size of inference subgraph by setting an upperbound on the number of observations in each unit interval of . The threshold (the value of ) is set between 50 to 200.

The detailed evaluation results w.r.t the initial ratio of easy instances are reported in Table 5. We can see that given a reasonable range on the ratio, between 30% and 60%, the performance of GML is highly stable. In many real scenarios, it is usually challenging to automatically label a large proportion of a given workload with high accuracy. Our experimental results demonstrate that GML performs robustly w.r.t the initial set of easy instances. They bode well for GML’s performance in real applications.

The detailed evaluation results w.r.t the parameter setting of scalable gradual inference are reported in Table 6,  7 and  8. It can be observed that overall, the performance of GML is very stable with different parameter settings. GML’ performance tends to improve as the value of is set higher. The improvement is most visible on the SG workload. Fortunately, as shown in Lemma 5.1, measurement of evidential support can be executed efficiently. Therefore, implementation-wise, the value of can be freely set to a high value. It can also be observed that the performance of GML only fluctuates slightly w.r.t the different values of and . Note that reducing the value of can effectively reduce the invocation frequency of factor graph inference, while reducing the value of can directly improve the efficiency of factor graph inference. Our experimental results bode well for the efficient implementation of GML in real applications.

It is worthy to point out that even though setting to a small number can only marginally affect the performance of GML, it does not mean that the factor graph inference is unwanted, it can thus be replaced by the more efficient distance-based approximate probability estimation. On the contrary, in the experiments, we observe that there actually exist many pair instances whose factor graph inference results are sufficiently different from their approximated probabilities such that their labels are flipped by factor graph inference, especially in the final stages of gradual inference.

6.4. Scalability Evaluation

Figure 5. Scalability Evaluation.

In this section, we evaluate the scalability of the proposed scalable approach for GML. Based on the entities in DBLP and Scholar, we generate different-sized DS workloads, from 10000 to 50000. We fix the proportion of identified easy instances at 60%. All the other parameters are fixed for all workloads. The detailed evaluation results are presented in Figure 5. Our experiments show that most of the runtime is spent on factor graph inference, and the average cost of the scalable GML spent on each unlabeled pair fluctuates only slightly as the workload increases. As a result, it can be observed that the total consumed time increases nearly linearly with the workload. These experimental results clearly demonstrate the high scalability of the proposed scalable approach.

7. Conclusion

In this paper, we have proposed a novel learning paradigm, called gradual machine learning, which begins with some easy instances in a given task, and then gradually labels more challenging instances in the task based on iterative factor graph inference without requiring any human intervention. We have also developed an effective solution based on it for the task of entity resolution. Our extensive empirical study on real datasets has shown that the proposed solution performs considerably better than the unsupervised alternatives, and it is also highly competitive with the state-of-the-art supervised techniques. Using ER as a test case, we have demonstrated that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.

Our research on gradual machine learning is an ongoing effort. Future work can be pursued on multiple fronts. Firstly, for specific ER tasks, gradual machine learning can be further improved by extracting more complex features besides those used in this paper and modeling them as factors in the inference graph. For instance, it has been shown in progressive ER (Altowim et al., 2014) that in many ER tasks, it is not rare that the resolution on a pair can benefit the resolution on another pair. This type of information can be obviously modeled as binary factors in the inference graph. Technical investigation on this topic is however beyond the scope of this paper. Secondly, even though gradual machine learning is proposed as an unsupervised learning paradigm in this paper, human work can be easily integrated into its process for improved performance. An interesting open challenge is how to effectively improve the performance of gradual machine learning with the minimal effort of human intervention, which include but are not limited to manually labeling some instances and adjusting feature weights. Finally, it is also very interesting to develop the solutions for other challenging classification tasks (e.g., sentiment analysis (Tang et al., 2015) and financial fraud detection (Yue et al., 2007)) based on the proposed paradigm.

References

  • (1)
  • Altowim et al. (2018) Jingru Yang, Ju Fan, and Zhewei Wei. 2014. Cost-Effective Data Annotation using Game-Based Crowdsourcing. Proceedings of the VLDB Endowment 12, 1 (2018), 57–70.
  • Altowim et al. (2014) Yasser Altowim, Dmitri V Kalashnikov, and Sharad Mehrotra. 2014. Progressive approach to relational entity resolution. Proceedings of the VLDB Endowment 7, 11 (2014), 999–1010.
  • Arasu et al. (2010) Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010.

    On active learning of record matching packages.

    Proceedings of the ACM International Conference on Management of Data (SIGMOD) (2010), 783–794.
  • Bach et al. (2017) Stephen H. Bach, Bryan He, Alexander Ratner, and Christopher Ré. 2017. Learning the Structure of Generative Models without Labeled Data. In International Conference on Machine Learning (ICML).
  • Bellare et al. (2012) Kedar Bellare, Suresh Iyengar, Aditya G Parameswaran, and Vibhor Rastogi. 2012. Active sampling for entity matching. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 1131–1139.
  • Bengio et al. (2009) Yoshua Bengio, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning. 41–48.
  • Berger (1985) James O Berger. 1985. Statistical decision theory and Bayesian analysis. Springer Series in Statistics, New York: Springer, 2nd ed.
  • Bilenko et al. (2003) Mikhail Bilenko, Raymond Mooney, William Cohen, Pradeep Ravikumar, and Stephen Fienberg. 2003. Adaptive Name Matching in Information Integration. IEEE Intelligent Systems 18, 5 (2003), 16–23.
  • Blum and Mitchell (1998) Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In

    Conference on Computational Learning Theory

    . 92–100.
  • Caruana (1997) Rich Caruana. 1997. Multitask Learning. Machine Learning 28, 1 (1997), 41–75.
  • Chai et al. (2016) Chengliang Chai, Guoliang Li, Jian Li, Dong Deng, and Jianhua Feng. 2016. Cost-effective crowdsourced entity resolution: A partial-order approach. Proceedings of the ACM International Conference on Management of Data (SIGMOD) (2016), 969–984.
  • Chen (1994) S. X. Chen. 1994.

    Empirical Likelihood Confidence Intervals for Linear Regression Coefficients

    .
    Academic Press, Inc. 24–40 pages.
  • Chen and Liu (2018) Zhiyuan Chen and Bing Liu. 2018. Lifelong Machine Learning, Second Edition. Morgan & Claypool Publishers.
  • Christen (2008) Peter Christen. 2008. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 151–159.
  • Christen (2012) Peter Christen. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media, Chapter 2, 32–34.
  • Chu et al. (2015) Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). ACM, 1247–1261.
  • Cohen (2000) William W. Cohen. 2000. Data Integration Using Similarity Joins and a Word-based Information Representation Language. ACM Trans. Inf. Syst. 18, 3 (2000), 288–321.
  • Elmagarmid et al. (2007) Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 2007. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19, 1 (2007), 1–16.
  • Fan et al. (2009) Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proceedings of the VLDB Endowment 2, 1 (2009), 407–418.
  • Fellegi and Sunter (1969) Ivan P Fellegi and Alan B Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183–1210.
  • Firmani et al. (2016) Donatella Firmani, Barna Saha, and Divesh Srivastava. 2016. Online entity resolution using an Oracle. Proceedings of the VLDB Endowment 9, 5 (2016), 384–395.
  • Getoor and Machanavajjhala (2012) Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment 5, 12 (2012), 2018–2019.
  • Gokhale et al. (2014) Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). ACM, 601–612.
  • Gruenheid et al. (2012) Anja Gruenheid, Donald Kossmann, Ramesh Sukriti, and Florian Widmer. 2012. Crowdsourcing Entity Resolution: When is A=B? Eth Department of Computer Science Systems Group (2012).
  • Jin and Han (2016) Xin Jin and Jiawei Han. 2016. K-Means Clustering. In Encyclopedia of Machine Learning and Data Mining. Springer, 1–3.
  • Joachims (1999) Thorsten Joachims. 1999. Transductive Inference for Text Classification using Support Vector Machines. In 16th International Conference on Machine Learning. 200–209.
  • Kivinen et al. (2004) J. Kivinen, A. J. Smola, and R. C. Williamson. 2004. Online learning with kernels. IEEE Transactions on Signal Processing 52, 8 (2004), 2165–2176.
  • Kumar et al. (2010) M. Pawan Kumar, Benjamin Packer, and Daphne Koller. 2010. Self-paced Learning for Latent Variable Models. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1 (NIPS’10). USA, 1189–1197.
  • Kuncheva and Rodriguez (2007) L. I. Kuncheva and J. J. Rodriguez. 2007. Classifier Ensembles with a Random Linear Oracle. IEEE Transactions on Knowledge and Data Engineering 19, 4 (2007), 500–508.
  • Li et al. (2016) G. Li, J. Wang, Y. Zheng, and M. J. Franklin. 2016. Crowdsourced Data Management: A Survey. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2296–2319.
  • Li et al. (2015) Lingli Li, Jianzhong Li, and Hong Gao. 2015. Rule-Based method for entity resolution. IEEE Transactions On Knowledge And Data Engineering 27, 1 (2015), 250–263.
  • Monge and Elkan (1996) Alvaro Monge and Charles Elkan. 1996. The field matching problem: Algorithms and applications. In In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. 267–270.
  • Mozafari et al. (2014) Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: a case for active learning. Proceedings of the VLDB Endowment 8, 2 (2014), 125–136.
  • Mudgal et al. (2018) Sidharth Mudgal, Han Li, Theodoros Rekatsinas, and et.al. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data SIGMOD. 19–34.
  • Nigam et al. (2000) Kamal Nigam, Andrew Kachites Mccallum, Sebastian Thrun, and Tom Mitchell. 2000. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 2-3 (2000), 103–134.
  • Pan and Yang (2010) S. J. Pan and Q. Yang. 2010. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345–1359.
  • Platt et al. (1999) John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10, 3 (1999), 61–74.
  • Ratner et al. (2017b) A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. 2017b. Snorkel: Rapid Training Data Creation with Weak Supervision. Proceedings of the Vldb Endowment 11, 3 (2017).
  • Ratner et al. (2016a) A. Ratner, C Sa De, S. Wu, D. Selsam, and C. Ré. 2016a. Data Programming: Creating Large Training Sets, Quickly.. In Adv Neural Inf Process Syst (NIPS), Vol. 29. 3567–3575.
  • Ratner et al. (2016b) A Ratner, Sa C De, S. Wu, D Selsam, and C Ré. 2016b. Data Programming: Creating Large Training Sets, Quickly. Advances in Neural Information Processing Systems 29 (2016), 3567.
  • Ratner et al. (2017a) Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Chris Ré. 2017a. Snorkel: Fast Training Set Generation for Information Extraction. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). ACM, 1683–1686.
  • Ravikumar and Cohen (2004) Pradeep Ravikumar and William W. Cohen. 2004. A Hierarchical Graphical Model for Record Linkage. In

    Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence

    (UAI ’04). AUAI Press, 454–461.
  • Sarawagi and Bhamidipaty (2002) Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 269–278.
  • Schlimmer and Granger (1986) Jeffrey C. Schlimmer and Richard H. Granger, Jr. 1986. Incremental Learning from Noisy Data. Machine Learning 1, 3 (1986), 317–354.
  • Shafer (1976) Glenn Shafer. 1976. A mathematical theory of evidence. Princeton University Press.
  • Shannon (1948) C. E. Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal 27, 3 (1948), 379–423.
  • Singla and Domingos (2006) Parag Singla and Pedro Domingos. 2006. Entity resolution with markov logic. IEEE 6th International Conference on Data Mining (ICDM) (2006), 572–582.
  • Tang et al. (2015) Duyu Tang, Bing Qin, and Ting Liu. 2015. Deep Learning for Sentiment Analysis: Successful Approaches and Future Challenges. Wiley Int. Rev. Data Min. and Knowl. Disc. 5, 6 (2015), 292–303.
  • Tong and Koller (2001) Simon Tong and Daphne Koller. 2001. Support vector machine active learning with applications to text classification. Journal of machine learning research 2, Nov (2001), 45–66.
  • Verroios et al. (2017) Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. 2017. Waldo: An Adaptive Human Interface for Crowd Entity Resolution. (2017), 1133–1148.
  • Vesdapunt et al. (2014) Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. 2014. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment 7, 12 (2014), 1071–1082.
  • Wang et al. (2012) Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5, 11 (2012), 1483–1494.
  • Wang et al. (2015) Sibo Wang, Xiaokui Xiao, and Chun-Hee Lee. 2015. Crowd-based deduplication: An adaptive approach. Proceedings of the ACM International Conference on Management of Data (SIGMOD) (2015), 1263–1277.
  • Whang et al. (2013a) Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013a. Question selection for crowd entity resolution. Proceedings of the VLDB Endowment 6, 6 (2013), 349–360.
  • Whang et al. (2013b) Steven Euijong Whang, David Marmaros, and Hector Garcia-Molina. 2013b. Pay-as-you-go entity resolution. IEEE Transactions on Knowledge and Data Engineering 25, 5 (2013), 1111–1124.
  • Yin et al. (2006) Xiaoxin Yin, Jiawei Han, Jiong Yang, and Philip S. Yu. 2006. Efficient Classification across Multiple Database Relations: A CrossMine Approach. IEEE Transactions on Knowledge & Data Engineering 18, 6 (2006), 770–783.
  • Yu and Deng (2014) Dong Yu and Li Deng. 2014. Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated.
  • Yue et al. (2007) D. Yue, X. Wu, Y. Wang, Y. Li, and C. Chu. 2007. A Review of Data Mining-Based Financial Fraud Detection Research. In 2007 International Conference on Wireless Communications, Networking and Mobile Computing. 5519–5522.
  • Zhang et al. (2015) Dongxiang Zhang, Long Guo, Xiangnan He, Jie Shao, Sai Wu, and Heng Tao Shen. 2015. A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution. IEEE 34th International Conference on Data Engineering (ICDE) (2015), 219–230.
  • Zhou et al. (2016) Xiaofeng Zhou, Yang Chen, and Daisy Zhe Wang. 2016. ArchimedesOne: Query Processing over Probabilistic Knowledge Bases. Proc. VLDB Endow. 9, 13 (2016), 1461–1464.
  • Jones et al. (2001) Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001. SciPy: Open source scientific tools for Python. http://www.scipy.org/ [Online].