Entity Matching (EM) refers to the problem of determining whether two data entries refer to the same real-world entity. Consider the two datasets about products in Figure 1. The goal is to determine the set of pairs of data entries, one entry from each table so that each pair of entries refer to the same product.
If the datasets are large, it can be expensive to determine the pairs of matching entries. For this reason, EM is typically accompanied by a pre-processing step, called blocking, to prune pairs of entries that are unlikely matches to reduce the number of candidate pairs to consider. As we will illustrate, correctly matching the candidate pairs requires substantial language understanding and domain-specific knowledge. Hence, entity matching remains a challenging task even for the most advanced EM solutions.
We present Ditto, a novel EM solution based on pre-trained
Transformer-based language models (or pre-trained language models in short). We cast EM as a sequence-pair classification problem to leverage such models, which have been shown to generate highly contextualized embeddings that capture better language understanding compared to traditional word embeddings. Ditto further improves its matching capability through three optimizations: (1) It allows domain knowledge to be added by highlighting important pieces of the input that may be useful for matching decisions. (2) It summarizes long strings so that only the most essential information is retained and used for EM. (3) It augments training data with (difficult) examples, which challenges Ditto to learn “harder” and also reduces the amount of training data required. Figure 2 depicts Ditto in the overall architecture of a complete EM workflow.
There are 9 candidate pairs of entries to consider for matching in total in Figure 1
. The blocking heuristic that matching entries must have one word in common in thetitle will reduce the number of pairs to only 3: the first entry on the left with the first entry on the right and so on. Perhaps more surprisingly, even though the 3 pairs are highly similar and look like matches, only the first and last pair of entries are true matches. Our system, Ditto, is able to discern the nuances in the 3 pairs to make the correct conclusion for every pair while some state-of-the-art systems are unable to do so.
The example illustrates the power of language understanding given by Ditto’s pre-trained language model. It understands that instant immersion spanish deluxe 2.0 is the same as instant immers spanish dlux 2 in the context of software products even though they are syntactically different. Furthermore, one can explicitly emphasize that certain parts of a value are more useful for deciding matching decisions. For books, the domain knowledge that the grade level or edition is important for matching books can be made explicit to Ditto, simply by placing tags around the grade/edition values. Hence, for the second candidate pair, even though the titles are highly similar (i.e., they overlap in many words), Ditto is able to focus on the grade/edition information when making the matching decision. The third candidate pair shows the power of language understanding for the opposite situation. Even though the entries look dissimilar Ditto is able to attend to the right parts of a value (i.e., the manf./modelno under different attributes) and also understand the semantics of the model number to make the right decision.
Contributions In summary, the following are our contributions:
We present Ditto, a novel EM solution based on pre-trained language models (LMs) such as BERT, DistilBERT, and ALBERT. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. To the best of our knowledge, Ditto is the first EM solution that leverages pre-trained Transformer-based LMs, which are powerful LMs that have been shown to provide deeper language understanding.
We also developed three optimization techniques to further improve Ditto’s matching capability through injecting domain knowledge, summarizing long strings, and augmenting training data with (difficult) examples. The first two techniques help Ditto focus on the right information for making matching decisions. The last technique, data augmentation, is adapted from [miao2020snippext] for EM to help Ditto learn “harder” to understand the data invariance properties that may exist but are beyond the provided labeled examples and also, reduce the amount of training data required.
We evaluated the effectiveness of Ditto on three benchmark datasets: the Entity Resolution benchmark [kopcke2010evaluation], the Magellan dataset [Konda:2016:Magellan], and the WDC product matching dataset [Primpeli:2019:WDC] of various sizes and domains. Our experimental results show that Ditto consistently outperforms the previous SOTA EM solutions in all datasets and by up to 25% in F1 scores. Furthermore, Ditto consistently performs better on dirty data and is more label efficient: it achieves the same or higher previous SOTA accuracy using less than half the labeled data.
We applied Ditto to a real-world large-scale matching task on two company datasets, containing 789K and 412K entries respectively. To deploy an end-to-to EM pipeline efficiently, we developed an advanced blocking technique to help reduce the number of pairs to consider for Ditto. Ditto obtains high accuracy, 96.5% F1 on a holdout dataset. The blocking phase also helped speed up the end-to-end EM deployment significantly, by up to 3.8 times, compared to naive blocking techniques.
Finally, we will open-sourceDitto in the future.
Outline Section 2 overviews Ditto and pre-trained LMs. Section 3 describes how we optimize Ditto with domain knowledge, summarization, and data augmentation. Our experimental results are described in Section 4 and the case study is presented in Section 5. We discuss related work in Section 6 and conclude in Section 7.
2 Background and Architecture
We present the main concepts behind EM and provide some background on pre-trained LMs before we describe how we fine-tune the LMs on EM datasets to train EM models. We also present a simple method for reducing EM to a sequence-pair classification problem so that pre-trained LMs can be used for solving the EM problem.
Notations Ditto’s EM pipeline takes as input two collections and of data entries (e.g., rows of relational tables, XML documents, JSON files, text paragraphs) and outputs a set of pairs where each pair is thought to represent the same real-world entity (e.g., person, company, laptop, etc.). A data entry is a set of key-value pairs where is the attribute name and is the attribute’s value represented as text. Note that our definition of data entries is general enough to capture both structured and semi-structured data such as JSON files.
As described earlier, an end-to-end EM system consists of a blocker and a matcher. The goal of the blocking phase is to quickly identify a small subset of of candidate pairs of high recall (i.e., a high proportion of actual matching pairs are that subset). The goal of a matcher (i.e., Ditto) is to accurately predict, given a pair of entries, whether they refer to the same real-world entity.
2.1 Pre-trained language models
Unlike prior learning-based EM solutions that rely on word embeddings and customized RNN architectures to train the matching model (See Section 6 for a detailed summary), Ditto trains the matching models by fine-tuning pre-trained LMs in a simpler architecture.
Pre-trained LMs such as BERT [Devlin:2019:BERT], ALBERT [lan2019albert]
, and GPT-2[gpt2]
have demonstrated good performance on a wide range of NLP tasks. They are typically deep neural networks with multiple Transformer layers[vaswani2017attention], typically 12 or 24 layers, pre-trained on large text corpora such as Wikipedia articles in an unsupervised manner. During pre-training, the model is self-trained to perform auxiliary tasks such as missing token and next-sentence prediction. Studies [Clark:2019:WhatDoesBERTLookAt, Tenney:2019:BertRediscovers] have shown that the shallow layers capture lexical meaning while the deeper layers capture syntactic and semantic meanings of the input sequence after pre-training.
A specific strength of pre-trained LMs is that it learns the semantics of words better than conventional word embedding techniques such as word2vec, GloVe, or FastText. This is largely because the Transformer architecture calculates token embeddings from all the tokens in the input sequence and thus, the embeddings it generates are highly-contextualized and captures the semantic and contextual understanding of the words. Consequently, such embeddings can capture polysemy, i.e., discern that the same word may have different meanings in different phrases. For example, the word Sharp has different meanings in “Sharp resolution” versus “Sharp TV”. Pre-trained LMs will embed “Sharp
” differently depending on the context while traditional word embedding techniques such as FastText always produce the same vector independent of the context. Such models can also understand the opposite, i.e., that different words may have the same meaning. For example, the wordsimmersion and immers (respectively, (deluxe, dlux) and (2.0, 2)) are likely the same given their respective contexts. Thus, such language understanding capability of pre-trained LMs can improve the EM performance.
2.2 Fine-tuning pre-trained language models
A pre-trained LM can be fine-tuned with task-specific training data so that it becomes better at performing that task. Here, we fine-tune a pre-trained LM for the EM task with a labeled training dataset consisting of positive and negative pairs of matching and non-matching entries as follows:
Add task-specific layers after the final layer of the LM. For EM, we add a simple fully connected layer and a softmax output layer for binary classification.
Initialize the modified network with parameters from the pre-trained LM.
Train the modified network on the training set until it converges.
The result is a model fine-tuned for the EM task. In Ditto, we fine-tune the popular base 12-layer BERT model [Devlin:2019:BERT] and its distilled variant DistilBERT [sanh2019distilbert], which is smaller but more efficient. However, our proposed techniques are independent of the choice of pre-trained LMs and our experimental results (Table 6) indicate that Ditto can potentially perform even better with larger pre-trained LMs. We illustrate the model architecture in Figure 3. The pair of data entries is serialized (see next section) as input to the LM and the output is a match or no-match decision. Ditto’s architecture is much simpler when compared to many state-of-the-art EM solutions today [Mudgal:2018:DeepMatcher, Ebraheem:2018:DeepER]. Even though the bulk of the “work” is simply off-loaded to pre-trained LMs, we show that this simple scheme works surprisingly well in our experiments.
2.3 Serializing the data entries for Ditto
Since LMs take token sequences (i.e., text) as input, a key challenge is to convert the candidate pairs into token sequences so that they can be meaningfully ingested by Ditto.
Ditto serializes data entries as follows: for each data entry , we let
where and are special tokens for indicating the start of attribute names and values respectively. For example,the first entry of the second table is serialized as:
[COL] title [VAL] instant immers spanish dlux 2 [COL] manf./modelno [VAL] NULL [COL] price [VAL] 36.11
To serialize a candidate pair , we let
where is the special token separating the two sequences and is the special token necessary for BERT to encode the sequence pair into a 768-dimensional vector which will be fed into the fully connected layers for classification.
Other serialization schemes There are different ways to serialize data entries so that LMs can treat the input as a sequence classification problem. For example, one can also omit the special tokens “” and/or “”, or exclude attribute names during serialization. We found that including the special tokens to retain the structure of the input does not hurt the performance in general and excluding the attribute names tend to help only when the attribute names do not contain useful information (e.g., names such as attr1, attr2, …) or when the entries contain only one column. A more rigorous study on this matter is left for future work.
Heterogeneous schemas As shown, the serialization method of Ditto does not require data entries to adhere to the same schema. It also does not require that the attributes of data entries to be matched prior to executing the matcher, which is a sharp contrast to other EM systems such as DeepER [Ebraheem:2018:DeepER] or DeepMatcher111In DeepMatcher, the requirement that both entries have the same schema can be removed by treating the values in all columns as one value under one attribute. [Mudgal:2018:DeepMatcher]. Furthermore, Ditto can also ingest and match hierarchically structured data entries by serializing nested attribute-value pairs with special start and end tokens (much like Lisp or XML-style parentheses structure).
3 Optimizations in Ditto
As we will describe in Section 4, the basic version of Ditto, which leverages only the pre-trained LM, is already outperforming the SOTA on average. Here, we describe three further optimization techniques that will facilitate and challenge Ditto to learn “harder”, and consequently make better matching decisions.
3.1 Leveraging Domain Knowledge
Our first optimization allows domain knowledge to be injected into Ditto through pre-processing
the input sequences (i.e., serialized data entries) to emphasize what pieces of information are potentially important. This follows the intuition that when human workers make a matching/non-matching decision on two data entries, they typically look for spans of text that contain key information before making the final decision. Even though we can also train deep learning EM solutions to learn such knowledge, we will require a significant amount of training data to do so. As we will describe, this pre-processing step on the input sequences is lightweight and yet can yield significant improvements. Our experiment results show that with less than 5% of additional training time, we can improve the model’s performance by up to 8%.
There are two main types of domain knowledge that we can provide Ditto.
Span Typing The type of a span of tokens is one kind of domain knowledge that can be provided to Ditto. Product id, street number, publisher are examples of span types. Span types help Ditto avoid mismatches. With span types, for example, Ditto is likelier to avoid matching a street number with a year or a product id.
Table 1 summarizes the main span types that human workers would focus on when matching three types of entities in our benchmark datasets.
|Entity Type||Types of Important Spans|
|Publications, Movies, Music||Persons (e.g., Authors), Year, Publisher|
|Organizations, Employers||Last 4-digit of phone, Street number|
|Products||Product ID, Brand, Configurations (num.)|
The developer specifies a to type spans of tokens from attribute values. The takes a text string as input and returns a list of start/end positions of the span in and the corresponding type of the span. Ditto
’s current implementation leverages an open-source Named-Entity Recognition (NER) model[spacy:ner] to identify known types such as persons, dates, or organizations and use regular expressions to identify specific types such as product IDs, last 4 digits of phone numbers, etc.
After the types are recognized, the original text is replaced by a new text where special tokens are inserted to reflect the types of the spans. For example, a phone number “(866) 246-6453” may be replaced with “( 866 ) 246 - [LAST] 6453 [/LAST]” where [LAST]/[/LAST] indicates the start/end of the last 4 digits and additional spaces are also added because of tokenization. In our implementation, when we are sure that the span type has only one token or the NER model is inaccurate in determining the end position, we drop the end indicator and keep only the start indicator token.
Intuitively, these newly added special tokens are additional signals to the self-attention mechanism that already exists in pre-trained LMs, such as BERT. If two spans have the same type, then Ditto picks up the signal that they are likelier to be the same and hence, they are aligned together for matching. In the above example,
when the model sees two encoded sequences with the [LAST] special tokens, it is likely to take the hint to align “6453” with “0000” without relying on other patterns elsewhere in the sequence that may be harder to learn.
Span Normalization The second kind of domain knowledge that can be passed to Ditto rewrites syntactically different but equivalent spans into the same string. This way, they will have identical embeddings and it becomes easier for Ditto to detect that the two spans are identical. For example, we can enforce that “VLDB journal” and “VLDBJ” are the same by writing them as VLDBJ. Similarly, we can enforce the general knowledge that “5 %” vs. “5.00 %” are equal by writing them as “5.0%”.
The developer specifies a set of rewriting rules to rewrite spans. The specification consists of a function that first identifies the spans of interest before it replaces them with the rewritten spans. Ditto contains a number of rewriting rules for numbers, including rules that round all floating point numbers to 2 decimal places and dropping all commas from integers (e.g., “2,020” “2020”). For abbreviations, we allow the developers to specify a dictionary of synonym pairs to normalize all synonym spans to be the same.
3.2 Summarizing long entries
When the value is an extremely long string, it becomes harder for the LM to understand what to pay attention to when matching. In addition, one limiting factor of Transformer-based pre-trained LMs is that there is a limit on the sequence length of the input. For example, the input to BERT can have at most 512 sub-word tokens. It is thus important to summarize the serialized entries down to the maximum allowed length while retaining the key information. A common practice is to truncate the sequences so that they fit within the maximum length. However, the truncation strategy does not work well for EM in general because the important information for matching is usually not at the beginning of the sequences.
There are many ways to perform summarization [mihalcea2004textrank, radford2019language, rush2015neural]. In Ditto’s current implementation, we use a TF-IDF-based summarization technique that retains non-stopword tokens with the high TF-IDF scores. We ignore the start and end tags generated by span typing in this process and use the list of stop words from scikit-learn library. By doing so, Ditto feeds only the most informative tokens to the LM. We found that this technique works well in practice. Our experiment results show that it improves the F1 score of Ditto on a text-heavy dataset from 40% to over 92% and we plan to add more summarization techniques to Ditto’s library in the future.
3.3 Augmenting training data
We describe how we apply data augmentation to augment the training data for entity matching.
Data augmentation (DA) is a commonly used technique in computer vision for generating additional training data from existing examples by simple transformation such as cropping, flipping, rotation, padding, etc. The DA operators not only add more training data, but the augmented data also allows to model to learn to make predictions invariant of these transformations.
Similarly, DA can add training data that will help EM models learn “harder”. Although labeled examples for EM are arguably not hard to obtain, invariance properties are very important to help make the solution more robust to dirty data, such as missing values (NULLs), values that are placed under the wrong attributes or missing some tokens.
Next, we introduce a set of DA operators for EM that will help train more robust models.
Augmentation operators for EM The proposed DA operators are summarized in Table 2. If is a serialized pair of data entries with a match or no-match label , then an augmented example is a pair , where is obtained by applying an operator on and has the same label as before.
|span_del||Delete a randomly sampled span of tokens|
|span_shuffle||Randomly sample a span and shuffle the tokens’ order|
|attr_del||Delete a randomly chosen attribute and its value|
|attr_shuffle||Randomly shuffle the orders of all attributes|
|entry_swap||Swap the order of the two data entries and|
The operators are divided into 3 categories. The first category consists of span-level operators, such as span_del and span_shuffle. These two operators are used in NLP tasks [wei2019eda, miao2020snippext] and shown to be effective for text classification. For span_del, we randomly delete from a span of tokens of length at most 4 without special tokens (e.g., [SEP], [COL], [VAL]). For span_shuffle, we sample a span of length at most 4 and randomly shuffle the order of its tokens.
These two operators are motivated by the observation that making a match/no-match decision can sometimes be “too easy” when the candidate pair of data entries contain multiple spans of text supporting the decision. For example, suppose our negative examples for matching company data in the existing training data is similar to what is shown below.
[CLS] [VAL] Google LLC [VAL] (866) 246-6453 [SEP]
[VAL] Alphabet inc [VAL] (650) 253-0000 [SEP]
The model may learn to predict “no-match” based on the phone number alone, which is insufficient in general. On the other hand, by corrupting parts of the input sequence (e.g., dropping phone numbers), DA forces the model to learn beyond that, by leveraging the remaining signals, such as the company name, to predict “no-match”.
The second category of operators is attribute-level operators: attr_del and attr_shuffle. The operator attr_del randomly deletes an attribute (both name and value) and attr_shuffle randomly shuffles the order of the attributes of both data entries. The motivation for attr_del is similar to span_del and span_shuffle but it gets rid of an attribute entirely. The attr_shuffle operator allows the model to learn the property that the matching decision should be independent of the ordering of attributes in the sequence.
The last operator, entry_swap, swaps the order of the pair
with probability. This teaches the model to make symmetric decisions (i.e., ) and helps double the size of the training set if both input tables are from the same data source.
MixDA: interpolating the augmented data
MixDA: interpolating the augmented dataUnlike DA operators for images which almost always preserve the image labels, the operators for EM can distort the input sequence so much that the label becomes incorrect. For example, the attr_del operator may drop the company name entirely and the remaining attributes may contain no useful signals to distinguish the two entries.
To address this issue, Ditto applies MixDA, a recently proposed data augmentation technique for NLP tasks [miao2020snippext] illustrated in Figure 4. Instead of using the augmented example directly, MixDA computes a convex interpolation of the original example with the augmented examples. Hence, the interpolated example is somewhere in between, i.e., it is a “partial” augmentation of the original example and this interpolated example is expected to be less distorted than the augmented one.
The idea of interpolating two examples is originally proposed for computer vision tasks [zhang2017mixup]. For EM or text data, since we cannot directly interpolate sequences, MixDA interpolates their representations by the language model instead. We omit the technical details and refer the interested readers to [miao2020snippext]. In practice, augmentation with MixDA slows the training time because the LM is called twice. However, the prediction time is not affected since the DA operators are only applied to training data.
We present the experiment results on benchmark datasets for EM: the ER Benchmark datasets [kopcke2010evaluation], the Magellan datasets [Konda:2016:Magellan] and the WDC product data corpus [Primpeli:2019:WDC]. Ditto achieves new SOTA results on all these datasets and outperforms the previous best results by up to 25% in F1 score. The results show that Ditto is more robust to dirty data and performs well when the training set is small. Ditto is also more label-efficient as it achieves the previous SOTA results using only 1/2 of the training data across multiple subsets of the WDC corpus. Our ablation analysis shows that (1) using pre-trained LMs contributes to over 50% of Ditto’s performance gain and (2) all 3 optimizations, domain knowledge (DK), summarization (SU) and data augmentation (DA), are effective. For example, SU improves the performance on a text-heavy dataset by 41%, DK leads to 1.98% average improvement on the ER-Magellan datasets and DA improves on the WDC datasets by 2.53% on average.
4.1 Benchmark datasets
We experimented with all the 13 publicly available datasets used for evaluating DeepMatcher [Mudgal:2018:DeepMatcher]. These datasets are from the ER Benchmark datasets [kopcke2010evaluation] and the Magellan data repository [magellandata]. We summarize the datasets in Table 3 and refer to them as ER-Magellan. These datasets are for training and evaluating matching models for various domains including products, publications, and businesses. Each dataset consists of candidate pairs from two structured tables of entity records of the same schema. The pairs are sampled from the results of blocking and manually labeled. The positive rate (i.e., the ratio of matched pairs) ranges from 9.4% (Walmart-Amazon) to 25% (Company). The number of attributes ranges from 1 to 8.
Among the datasets, the Abt-Buy and Company datasets are text-heavy meaning that at least one attributes contain long text. Also, following [Mudgal:2018:DeepMatcher], we use the dirty version of the DBLP-ACM, DBLP-Scholar, iTunes-Amazon, and Walmart-Amazon datasets to measure the robustness of the models against noise. These datasets are generated from the clean version by randomly emptying attributes and appending their values to another randomly selected attribute.
Each dataset is split into the training, validation, and test sets using the ratio of 3:1:1. We list the size of each dataset in Table 5.
|Amazon-Google, Walmart-Amazon||software / electronics|
|DBLP-ACM*, DBLP-Scholar*, iTunes-Amazon*||citation / music|
|Company, Fodors-Zagats||company / restaurant|
The WDC product data corpus [Primpeli:2019:WDC] contains 26 million product offers and descriptions collected from e-commerce websites [wdc]. The goal is to find product offer pairs that refer to the same product. To evaluate the accuracy of product matchers, the dataset provides 4,400 manually created golden labels of offer pairs from 4 categories: computers, cameras, watches, and shoes. Each category has a fixed number of 300 positive and 800 negative pairs. For training, the dataset provides for each category pairs that share the same product ID such as GTINs or MPNs mined from the product’s webpage. The negative examples are created by selecting pairs that have high textual similarity but different IDs. These labels are further reduced to different sizes to test the models’ label efficiency. We summarize the different subsets in Table 4. We refer to these subsets as the WDC datasets.
Each entry in this dataset has 5 attributes. Ditto uses only the title attribute because it contains rich product information such as brands, IDs, and price, making the rest of the attributes redundant. Meanwhile, DeepMatcher is allowed to use any subsets of attributes to determine the best attribute set as in [Primpeli:2019:WDC].
4.2 Implementation and experimental setup
We implemented Ditto
in PyTorch[paszke2019pytorch] and the Transformers library [wolf2019transformers]. The default setting uses the uncased 6-layer DistilBERT [sanh2019distilbert]
pre-trained model and half-precision floating-point (fp16) to accelerate the training and prediction speed. In all the experiments, we fix the learning rate to be 3e-5 and the max sequence length to be 256. The batch size is 32 if MixDA is used and 64 otherwise. The training process runs a fixed number of epochs (10, 15, or 40 depending on the dataset size) and returns the checkpoint with the highest F1 score on the validation set. We conducted all experiments on ap3.8xlarge AWS EC2 machine with 4 V100 GPUs (one GPU per run).
Compared methods. We compare Ditto with the SOTA EM solution DeepMatcher and its variants. We also compare with variants of Ditto without the data augmentation (DA) and/or domain knowledge (DK) optimization to evaluate the effectiveness of each component. We summarize these methods below. We report the average F1 of 5 repeated runs in all the settings.
DeepMatcher: DeepMatcher [Mudgal:2018:DeepMatcher] is the SOTA matching solution. Compared to Ditto, DeepMatcher customizes the RNN architecture to aggregate the attribute values, then compares/aligns the aggregated representations of the attributes. DeepMatcher leverages FastText [fasttext] to train the word embeddings. When reporting DeepMatcher’s F1 scores, we use the numbers in [Mudgal:2018:DeepMatcher] for the ER-Magellan datasets and numbers in [Primpeli:2019:WDC] for the WDC datasets. We also reproduced those results using the open-sourced implementation and report the training time.
DeepMatcher+: Follow-up work [Kasai:2019:LowResourceER] slightly outperforms DeepMatcher in the DBLP-ACM dataset and [Fu:2019:End2End] achieves better F1 in the Walmart-Amazon and Amazon-Google datasets. According to [Mudgal:2018:DeepMatcher], the Magellan system ([Konda:2016:Magellan], based on classical ML models) outperforms DeepMatcher in the Beer and iTunes-Amazon datasets. For these cases, we denote by DeepMatcher+ the best F1 scores among DeepMatcher and these works aforementioned.
Ditto: This is the full version of our system with all 3 optimizations, domain knowledge (DK), TF-IDF summarization (SU), and data augmentation (DA) turned on. See the details below.
Ditto(DA): This version only turns on the DA (with MixDA) and SU but does not have the DK optimization. We apply one of the span-level or attribute-level DA operators listed in Table 2 with the entry_swap operator. We compare the different combinations and report the best one. Following [miao2020snippext], we apply MixDA with the interpolation parameter
sampled from a Beta distribution.
Ditto(DK): With only the DK and SU optimizations on, this version of Ditto is expected to have lower F1 scores but train much faster. We apply the span-typing to datasets of each domain according to Table 1 and apply the span-normalization on the number spans.
Baseline: This base form of Ditto corresponds simply to fine-tuning a pre-trained LM (DistilBERT) on the EM task. We did not apply any optimizations on the baseline. We pick DistilBERT instead of larger models such as BERT or ALBERT because DistilBERT is faster to train and it also makes a tougher comparison for Ditto since larger models are generally perceived to have more powerful language understanding capabilities [yang2019xlnet, liu2019roberta, lan2019albert].
4.3 Main results
Table 5 shows the results of the ER-Magellan datasets. Overall, Ditto (with optimizations) achieves significantly higher F1 scores than the SOTA results (DeepMatcher+). Ditto without optimizations (i.e., the baseline) achieves comparable results with DeepMatcher+. Ditto outperforms DeepMatcher+ in 10/13 cases and by up to 25% (Dirty, Walmart-Amazon) while the baseline outperforms DeepMatcher+ in 8/13 cases and by up to 16% (Dirty, Walmart-Amazon). On the 3 cases that Ditto performs slightly worse than DeepMatcher+, it turns out that using a larger pre-trained LMs such as BERT or ALBERT helps fill the gaps (see Table 6). These initial results led us to believe that larger pre-trained language models will further improve Ditto’s results and we leave as future work to further verify this hypothesis.
In addition, we found that Ditto is better at datasets with small training sets. Particularly, the average improvement on the 7 smallest datasets is 9.96% vs. 0.32% on average on the rest of datasets. Ditto is also more robust against data noise than DeepMatcher+. In the 4 dirty datasets, the performance degradation of Ditto is only 0.68 on average while the performance of DeepMatcher+ degrades by 8.21. These two properties make Ditto more attractive in practical EM settings.
Ditto also achieves promising results on the WDC datasets (Table 7). Ditto achieves the highest F1 score of 94.08 when using all the 215k training data, outperforming the previous best result by 3.92. Similar to what we found in the ER-Magellan datasets, the improvements are higher on settings with fewer training examples (to the right of Table 7). The results also show that Ditto is more label efficient than DeepMatcher. For example, when using only 1/2 of the data (Large), Ditto already outperforms DeepMatcher with all the training data (xLarge) by 2.89 in All. When using only 1/8 of the data (Medium), the performance is within 1% close to DeepMatcher’s F1 when 1/2 of the data (Large) is in use. The only exception is the shoes category. This may be caused by the large gap of the positive label ratios between the training set and the test set (9.76% vs. 27.27% according to Table 4).
|Datasets||DM+||Ditto (BERT)||delta||Ditto (ALBERT)||delta|
|Size||xLarge (1/1)||Large (1/2)||Medium (1/8)||Small (1/20)|
Training time. We plot the training time required by DeepMatcher and Ditto in Figure 6. We do not plot the time for Ditto(DA) because the DK optimization only pre-processes the data and adds no more than 5% of training time. The running time ranges from 69 seconds (450 examples) to 5.2 hours (113k examples). Ditto has a similar training time to DeepMatcher although DistilBERT, which is used by Ditto, has a Transformer-based architecture that is deeper and more complex. The speed-up is due to DistilBERT and the fp16 optimization. Ditto with MixDA is about 2-3x slower than Ditto(DK) without MixDA. This is because MixDA requires additional time for generating the augmented pairs and computing with the LM twice. However, this overhead only affects offline training and does not affect online prediction.
4.4 Ablation study
The use of a pre-trained LM contributes to a large portion of the performance gain. In the ER-Magellan datasets (excluding Company), the average improvement of the baseline compared to DeepMatcher+ is 3.49, which accounts for 58% of the improvement of the full Ditto (6.0). While DeepMatcher+ and the baseline Ditto (essentially fine-tuning DistilBERT) are comparable on the Structured datasets, the baseline performs much better on all the Dirty datasets and the Abt-Buy dataset. This confirms our intuition that the language understanding capability is a key advantage of Ditto over existing EM solutions. The Company dataset is a special case because the length of the company articles (3,123 words on average) is much greater than the max sequence length of 256. The SU optimization increases the F1 score of this dataset from 41% to over 92%. In the WDC datasets, across the 20 settings, LM contributes to 3.41 F1 improvement on average, which explains 55.3% of improvement of the full Ditto (6.16).
The DK optimization is more effective on the ER-Magellan datasets. Compared to the baseline, the improvement of Ditto(DK) is 1.98 on average and is up to 9.67 on the Beer dataset while the improvement is only 0.22 on average on the WDC datasets. We inspected the span-typing output and found that only 66.2% of entry pairs have spans of the same type. This is caused by the current NER module not extracting product-related spans with the correct types. We expect DK to be more effective if we use an NER model trained on the product domain.
DA is effective on both datasets and more significantly on the WDC datasets. The average F1 score of the full Ditto improves upon Ditto(DK) (without DA) by 0.53 and 2.53 respectively in the two datasets. In the WDC datasets, we found that the span_del operator always performs the best while the best operators are diverse in the ER-Magellan datasets. We list the best operator for each dataset in Table 8. We note that there is a large space of tuning these operators (e.g., the MixDA interpolation parameter, maximal span length, etc.) and new operators to further improve the performance. Finding the best DA operators for EM is future work beyond the scope of this paper.
|span_shuffle||DBLP-ACM (Both), DBLP-Google (Both), Abt-Buy|
|span_del||Walmart-Amazon(D), Company, all of WDC|
|attr_del||Beer, iTunes-Amazon(S), Walmart-Amazon(S)|
5 Case Study: Employer Matching
We present a case of applying Ditto to a real-world EM task. An online recruiting platform would like to join its internal employer records with newly collected public records to enable downstream aggregation tasks. Formally, given two tables and (internal and public) of employer records, the goal of the task is to find, for every record in table , a record in table that represents the same employer. Both tables have 6 attributes: name, addr, city, state, zipcode, and phone
. Our goal is to find matching record pairs with both high precision and recall.
Basic blocking. Our first challenge is size of the datasets. As shown in Table 9, both tables are of nontrivial sizes even after deduplication. Thus, a naive pairwise comparison is not feasible. The first blocking method we designed is to only match companies with the same zipcode. However, since 60% of records in Table do not have the zipcode attribute and some large employers have multiple sites, we use a second blocking method that returns for each record in Table the top-20 most similar records in
ranked by the TF-IDF cosine similarity ofname and addr attributes. We use the union of these two methods as our blocker, which produces 10 million candidate pairs.
Data labeling. We labeled 10,000 pairs sampled from the results of each blocking method (20,000 labels in total). We sampled pairs of high similarity with higher probability to increase the difficulty of the dataset to train more robust models. The positive rate of all the labeled pairs is 39%. We split the labeled pairs into training, validation, and test sets by the ratio of 3:1:1.
Applying Ditto. The user of Ditto
does not need to extensively tune the hyperparameters but only needs to specify the domain knowledge and choose a data augmentation operator. We observe that the street number and the phone number are both useful signals for matching. Thus, we implemented a simplethat tags the first number string in the addr attribute and the last 4 digits of the phone attribute. Since we would like the trained model to be robust against the large number of missing values, we choose the attr_del operator for data augmentation.
We plot the model’s performance in Figure 7. Ditto achieves the highest F1 score of 96.53 when using all the training data. Ditto outperforms DeepMatcher (DM) in F1 and trains faster (even when using MixDA) than DeepMatcher across different training set sizes.
Advanced blocking. Optionally, before applying the trained model to all the candidate pairs, we can use the labeled data to improve the basic blocking method. We leverage Sentence-BERT [reimers2019sentence], a variant of the BERT model that trains sentence embeddings for sentence similarity search. The trained model generates a high-dimensional (e.g., 768 for BERT) vector for each record. Although this model has a relatively low F1 (only 92%) thus cannot replace Ditto, we can use it with vector similarity search to quickly find record pairs that are likely to match. We can greatly reduce the matching time by only testing those pairs of high cosine similarity. We list the running time for each module in Table 10. With this technique, the overall EM process is accelerated by 3.8x (1.69 hours vs. 6.49 hours with/without advanced blocking).
6 Related Work
EM solutions have tackled the blocking problem [blocking1, blocking2, blocking3, Papadakis:2019:BlockingSurvey, blocking4] and the matching problem with rules [dalvi2013optimal, elmagarmid2014nadeef, singh2017synthesizing, wang2011entity], crowdsourcing [gokhale2014corleone, karger2011human, wang2012crowder]
, or machine learning[sarawagi2002interactive, cohen2002learning, bilenko2003adaptive, gokhale2014corleone, Konda:2016:Magellan].
Recently, EM solutions used deep learning and achieved promising results [Ebraheem:2018:DeepER, Fu:2019:End2End, Kasai:2019:LowResourceER, Mudgal:2018:DeepMatcher, Zhao:2019:AutoEM]. DeepER [Ebraheem:2018:DeepER] trains EM models based on the LSTM [lstm] neural network architecture with word embeddings such as word2vec [word2vec] or GloVe [glove]. DeepER also proposed a blocking technique to represent each entry by the LSTM’s output. Our advanced blocking technique based on Sentence-BERT [reimers2019sentence], described in Section 5, is inspired by this. Auto-EM [Zhao:2019:AutoEM] improves deep learning-based EM models by pre-training the EM model on an auxiliary task of entity type detection. Ditto
also leverages transfer learning by fine-tuning pre-trained LMs, which are more powerful models in language understanding. We did not compareDitto with Auto-EM in experiments because the entity types required by Auto-EM are not available in our benchmarks. However, we expect that pre-training Ditto with EM-specific data/tasks can improve the performance of Ditto further and is part of our future work. DeepMatcher introduced a design space for applying deep learning methods to EM. Following their template architecture, one can think of Ditto as replacing both the attribute embedding and similarity representation components in the architecture with a single pre-trained LM such as BERT, thus providing a much simpler overall architecture.
All systems, Auto-EM, DeepER, DeepMatcher, and Ditto formulate matching as a binary classification problem. The first three take a pair of data entries of the same arity as input and aligns the attributes before passing them to the system for matching. On the other hand, Ditto serializes both data entries as one input with structural tags intact. This way, data entries of different schemas can be uniformly ingested, including hierarchically formatted data such as those in JSON. Our serialization scheme is not only applicable to Ditto, but also to other systems such as Auto-EM, DeepMatcher, and DeepER. In fact, we serialized data entries to DeepMatcher under one attribute using our scheme and observed that DeepMatcher improved by as much as 1.94% on some datasets.
External knowledge is known to be effective in improving neural network models in NLP tasks [chen2017neural, sun2019ernie]. Instead of directly modifying the network architecture [Wang:2019:KGAT, Yang:2017:KBLSTM]
or the loss function[zhang-etal-2019-ernie] to incorporate domain knowledge, Ditto modularizes the way domain knowledge is incorporated by allowing users to specify and customize rules for preprocessing input entries. Data augmentation has been extensively studied in computer vision and has recently received more attention in NLP [miao2020snippext, wei2019eda, xie2019unsupervised]. We designed a set of data augmentation operators suitable for EM and apply them with MixDA [miao2020snippext], a recently proposed DA strategy based on convex interpolation. To the best of our knowledge, this is the first time data augmentation has been applied to EM.
We present Ditto, the first EM system based on fine-tuned pre-trained Transformer-based language models. Ditto
uses a simple architecture to leverage pre-trained LMs and is further optimized by injecting domain knowledge, text summarization, and data augmentation. Our results show that it outperforms existing EM solutions on all three benchmark datasets with significantly less training data.Ditto’s good performance can be attributed to the improved language understanding capability mainly through pre-trained LMs, the more accurate text alignment guided by the injected knowledge, and the data invariance properties learned from the augmented data. We plan to further explore our design choices for injecting domain knowledge, text summarization, and data augmentation. In addition, we plan to extend Ditto to other data integration tasks beyond EM, such as entity type detection and schema matching with the ultimate goal of building a BERT-like model for tables.