|ID||Model||Resolution (A) MP (B)||Sensor|
|A.1 B.1||Canon 4000D||18.0Mp (A) 17 (B)||CMOS|
Data integration is the fundamental problem of providing a unified view over different data sources. The sheer number of ways humans represent and misrepresent information makes data integration a challenging problem. Data integration is typically thought of as a pipeline of multiple tasks rather than a end-to-end problem. Different works identify different tasks as constituting the data integration pipeline, but some tasks are widely recognized at its core, namely Schema Matching and Entity Resolution . An illustrative example is provided in Table 1.
) fueled by recent advances in the field of natural language processing and to the unprecedented availability of real-world benchmarks for training and testing (see for instance the meta-repository in 
. The same trend – i.e., benchmarks influencing the design of innovative solutions – was previously observed in other areas. For instance, the Transaction Processing Performance Council proposed the family of TPC benchmarks for OLTP and OLAP, and NIST promoted several initiatives such as TREC and MNIST.
Unfortunately, in the data integration landscape, only separate task-specific benchmarks are available. For instance, As a result, the techniques influenced by such benchmarks can be difficult to integrate into a more complex data integration pipeline.
Our approach. Since it is difficult to get a benchmark for the entire pipeline by just composing benchmarks for different tasks, we provide a benchmark that can support by design the development of end-to-end data integration methods. Our intuition is to take the best of both real-world benchmarks and synthetic data generators, by providing a real-world benchmark that can be used flexibly to generate a variety of data integration tasks and instances. As for the tasks, we start from the core of the data integration pipeline and focus on popular variants of Schema Matching and Entity Resolution tasks. As for the instances, we include data sources with different data distributions and complexity, yielding different configurations of the dataset (e.g., by selecting sources with longer textual descriptions or larger size). We refer to our benchmark as Alaska.
The Alaska benchmark. The Alaska dataset111https://github.com/merialdo/research.alaska consists of collections of almost 70k product specifications extracted from 71 different web sources. Currently, there are available domains, or verticals: camera, monitor and notebook. Figure 0(a) shows a sample record from the www.buy.net data source of the camera domain. Each record is extracted from a different web page and refers to a single product. Record contain attributes with the product properties, such as the camera resolution, and are represented in a flat JSON format. The default <page title> attribute shows the title of the original web page and typically contains some of the product properties in an unstructured format. The entire dataset comprises 15k different attributes. As for the data collection, we have a three-steps process: (i) web sources in our benchmark are discovered and crawled with the Dexter method in , (ii) products specification are extracted from web pages with an ad-hoc method (see Section 4), (iii) ground truth is manually annotated by a small crowd of domain experts, prioritizing annotations with the notion of benefit introduced in . (More details about the data collection process are given in Section 5) A preliminary version of our Alaska dataset has been recently used for the SIGMOD Programming Contest222www.inf.uniroma3.it/db/sigmod2020contest and for other related contests.333di2kg.inf.uniroma3.it
The main properties of the Alaska dataset and their implications are summarized below.
Multi-task. Alaska supports a variety of data integration tasks. The version described in this work focuses on Schema Matching and Entity Resolution, but our benchmark could be easily extended to support other tasks (e.g., Data Extraction). For this reason, Alaska is suitable for assessing complex data integration pipelines, solving more than one individual task.
Heterogeneous. Data sources included in Alaska cover a large spectrum of data characteristics, from clean and small sources to large and dirty ones. Dirty sources may present a variety of records, each providing different set of attributes, sometimes with different representations, while clean sources tend to have records with similar attributes and value format. To this end, Alaska comes with a set of profiling metrics that can be used to select subsets of the data, yielding different use-cases with tunable properties.
Manually curated. Alaska comes with a large manually curated ground truth by domain experts, both in terms of number of sources included in the ground truth and number of records and attributes. For this reason, Alaska is suitable for evaluating methods with high accuracy, without overlooking tail knowledge.
2 Related Works
We now review recent benchmarks for data integration tasks - namely Schema Matching (SM) and Entity Resolution (ER) - either comprising real-world data or providing utilities for generating synthetic ones.
Real-world data for SM. In the context of SM, the closest works to our are XBenchMatch and T2D. XBenchMatch [12, 13] is a popular benchmark for XML data, comprising data from a variety of domains, such as finance, biology, business and travel. Each domain includes of two or few more schemata.The T2D 
is a more recent benchmark for SM over HTML tables, including hundreds of thousands of tables from the Web Data Commons archive, with the DBPedia source serving as the target schema. Both XBenchMatch and T2D focus on dense schemata where most of the attributes have non-null values. In contrast we provide both dense and sparse sources, allowing each of the sources in our benchmark to be selected ad the target schema, yielding a variety of SM problems with varying difficulty. Smaller benchmarks are described in[16, 1, 9]: in [16, 1] we have pairs of matching schemata like in XBenchMatch, while in  we have several schemata matching to one like in T2D. Finally, we mention the Ontology Alignment Evaluation Initiative (OAEI)  and SemTab (Tabular Data to Knowledge Graph Matching) 
for sake of completeness. Such datasets are related to the problem matching ontologies (OAEI) and HTML tables (SemTab) to relations of a Knowledge Graph and can be potentially used for building more complex datasets for the SM taks.
Real-world data for ER. There is a wide variety of real-world dataset for the ER task in literature (see the meta-repository in 
). We review some of the most used ones in the following. The Cora Citation Matching Dataset is a popular collection of bibliography data, consisting of text of almost 2K citations. The Leipzig DB Group Datasets [21, 20] provides a collection of bibliography and product datas, obtained by querying respectively publication titles and product names over different websites. The Magellan Data Repository 
includes a collection of 24 data sets created by students in the CS 784 data science class at UW-Madison. Each dataset represents a domain (e.g., movies and restaurants) and consists of two tables extracted from websites of that domain (e.g., Rotten Tomatoes van IMDB). The rise on deep neural network based ER methods[5, 14, 22, 24] and their need for large amounts of training data, has recently motivated the development of the Web Data Commons  collection, consisting of more than 26M product offers originating from 79K websites. Analogously to SM, all the above benchmarks come with its own variant of the ER task, focusing for instance with the problem of finding matches between two sources only (Magellan Data Repository) or matching data based on title or product identifiers. We consider instead sources with different data distributions and schemata, yielding a variety of ER problems, some easier and others more difficult. We also note that except for the smaller Cora Dataset, the ground truth of all the mentioned datasets is built automatically (e.g., using a key attribute such as the UPC code) with only a limited amount of manually verified data (2K matching pairs per dataset). In contrast, the entire ground truth available in our dataset has been manually verified.
Data generators. The main limitation of real-world data is that they have fixed size and characteristics.,In contrast, data generators can be used flexibly and in a more controlled setting. MatchBench , for instance, can generate various SM tasks by injecting different types of synthetic noise on real schemata . STBenchmark  provides tools for generating synthetic schemata and instances with complex correspondences. The Febrl system 
includes a data generator for the ER task, that can create fictitious hospital patients data with a variety of cluster size distributions and error probabilities. More recently, EMBench++ provides a system to generate data for benchmarking ER techniques, consisting of a collection of matching scenarios, capturing basic real world situations (e.g., syntactic variations and structural differences) as well as more advanced situations (i.e., evolving information). Finally, the iBench  metadata generator can be used to evaluate a wide-range of integration tasks, including SM and ER, allowing to control size and characteristics of the data, such as schemata, constraints, and mappings. In our benchmark, we aim at getting the best of both worlds: considering real-world data and providing more flexibility in terms of task definition and choice of data characteristics.
3 Benchmark Tasks
Let be a set of data sources providing information about a set of entities, where each entity is representative of a product (e.g., Canon EOS 1100D). Each source consists of a set of records . Each record refers to an entity and the same entity can be referred to by different records. Each records is represented as a title and a set of attribute-value pairs. Each attribute refers to one or more properties and the same property can be referred to by different attributes. We describe below a toy example that will be used in the rest of this section.
Example 1 (Toy example)
Consider the sample records in Figure 1. Records and refer to entity “Canon EOS 1100D” while refers to “Sony A7”. Attribute resolution of and manufacturer refer to the same underlying property. Attribute battery of refer to the same properties than battery_model and battery_chemistry of .
We identify each attribute with the combination of the source name and the attribute name – such as, .battery. We refer (i) to the set of attributes specified for a given record as and (ii) to the set of attributes specified for a given source , that is, the source schema, as . Note that in each record we only specify the non-null attributes.
In our toy example, .brand, .resolution, .battery, .digital_screen, .size, while is the union of and , i.e., .manufacturer, .resolution, .battery_model, .battery_chemistry, .size .rating.
Finally, we refer (i) to the set of text tokens appearing in a given record as and (ii) to the set of text tokens appearing in a given source, that is, the source vocabulary, as , as .
In our toy example, Cannon, EOS, 1100D, Buy, Canon, 32, mp, NP-400, , while is the union of and .
Based on the presented data model, we now define the tasks and task variants considered in this work to demonstrate the flexibility of our benchmark. Main tasks are Schema Matching (SM) and Entity Resolution (ER). For SM we consider two variants, namely, catalog-based and mediated. For ER we consider two three variants, that are, self join, similarity join and a schema-agnostic variant. Finally, we discuss evaluation metrics used in our experiments.
3.1 Schema Matching
Given a set of sources , let and be a set of properties of interest. is referred to as the target schema and can be either defined manually or set equal to the schema of a given source in . Schema Matching (SM) is the problem of finding correspondences between elements of two schemata, such as and [4, 27]. Formal definition is given below.
Definition 1 (Schema Matching)
Let be a set s.t. iff the attribute refers to the property , SM can be defined as the problem of finding pairs in .
Let and with , , , , , , , , and . Let , with “battery model”, “battery chemistry” and “brand”. Then , , , , , .
We consider two popular variants for the SM problem.
Catalog SM. Given a set of sources and a catalog source s.t. , find the correspondences in .
Mediated SM. Given a set of sources and a mediated schema , defined manually with a selection of different real-world properties, find the correspondences in .
In both the variants above, can consist of one or more sources. State of the art algorithms typically consider one or two sources and it does not need to be the entire set of the sources available in our benchmark.
Challenges. The main challenges provided by our dataset for the SM problem as detailed below.
Synonyms. Our dataset contains attributes with different names but referring to the same property, such as .brand and .manufacturer;
Homonyms. Our dataset contains attributes with the same names but referring to different properties, such as .size (the size of the digital screen) and .size (dimension of the camera body);
Granularity. In addition to one-to-one correspondences (e.g., .brand with .manufacturer), our dataset contains one-to-many and many-to-many correspondences, due to different attribute granularities. .battery, for instance, corresponds to both .battery_model and .battery_chemistry.
3.2 Entity Resolution
Given a set of sources , let be the set of records in . Entity Resolution (ER) is the problem of finding records referring to the same entity, or duplicates. Formal definition is given below.
Definition 2 (Entity Resolution)
Let be a set s.t. iff and refer to the same entity, ER can be defined as the problem of finding pairs in .
We note that is transitively closed, i.e., if and , then , and each connected component is a clique representing a distinct entity. We call each clique a cluster of .
Let and . Then and there are two clusters, namely representing a “Canon EOS 1100D” and representing a “Sony A7”.
We consider three variants for the ER problem.
Similarity Join. Given two sources find the duplicates in .
Self Join. Given a set of sources find the duplicates in the entire .
Schema agnostic ER. In both the variants above, we assume that a solution for the SM problem is available, that is, attributes of every input record for ER are previously aligned to the a manually-specified mediated schema . This variant is the same than self-join ER but has no SM information available.
Challenges. The main challenges provided by our dataset for the ER problem as detailed below.
Format. Different sources may use different format and naming conventions. For instance, the records and have conceptually the same resolution value, but use different formats. In addition, records can contain textual descriptions such as the battery in and data sources of different countries can use different conventions (i.e., inches/cm and “EOS”/“Rebel”).
Noise. Records can contain noisy or erroneous values, due to entity misrepresentation in the original web page and in the data extraction process. An example of noise in the web page is “Cannon” in place of “Canon” in the page title of . An example of noise due to data extraction is the rating in , as it does not represent strictly a product property.
As a result, records referring to different entities can have more similar representation than records referring to the same entity. Finally, the cluster size distribution over the entire set of sources is skewed, meaning that some entities are over-represented and others are under-represented.
3.3 Performance measures
We use standard performance measures of precision, recall and F1-measure. More specifically, let and . For each of the tasks defined in this section, we consider manually verified subsets of , , and that we refer to as ground truth, and then we define precision, recall and F1-measure with respect to the manually verified sets.
Let such ground truth subsets be denoted as , , and . It is important to notice that in our benchmark they all satisfy closed-world assumption. That is, yields a complete bipartite sub-graph of and yields a complete sub-graph of . More details on the ground truth can be found in Section 4.
Finally, in case the SM or ER algorithm at hand has prior restrictions, such as, a SM algorithm restricting to one-to-one correspondences or a similarity-join ER algorithm restricting to inter-source pairs, we only consider the portion of the ground truth satisfying those restrictions.
Alaska contains data from three different e-commerce domains, dubbed camera, monitor and notebook. We refer to such domains as verticals. Table 2 includes, for each vertical, the number of sources , the number of records , the number of records in the largest source , , the total number of attributes , the average number of record attributes in each vertical , the average number of record tokens (i.e. the record length).
In Table 2 we also show the number of target attributes in our manually identified ground truth, the number of entities and the size of the largest entity .
Size statistics. Figure 1(a) and 1(b) show respectively the distribution of Alaska sources with respect to the the number of records () and the average number of attributes (). As for the number of records (Figure 1(a)), the camera and notebook verticals have few big sources and a long tail of small sources, while the monitor vertical have more sources with medium size. As for the number of attributes, camera and notebook follow the same distribution, with both sources having fewer attributes and sources having more attributes, while monitor has notably more sources with higher average number of attributes.
The plots in Figure 3 show respectively the distribution of target attributes and entities in our manually curated ground truth, with respect to the number of sources in which they appear. We note from Figure 2(a) that a significant fraction of attributes in our ground truth are present in most sources, but there are tail attributes present in few sources (less than ) as well. Regarding entities, Figure 2(b) shows that most entities in notebook span less than of sources, while in monitor and camera there are popular entities that are present in up to of sources. For sake of completeness, we also show in Figure 4 the cluster size distribution of entities. We observe that the three Alaska verticals are significantly different: camera has the largest entities, monitor has both small and medium-size entities, and notebook has the skewed cluster size distribution, with many small entities and few large ones.
4.1 Source hardness
We define three hardness metrics, dubbed Attribute Sparsity, Vocabulary Overlap and Vocabulary Size, that can be used to select subsets of sources for each vertical and generate instances for SM and ER with different difficulty levels, from easy to challenging.
AS. Attribute Sparsity is defined as follows.
AS is a measure of how dirty a source is in terms of attribute specification. In a dense source (low ), most records have the same set of attributes. On the opposite, in a sparse source (high ), many attributes are non-null only for a few records. Denser sources represent easier instances for both SM and ER, being cleaner and providing more non-null values for each attribute and more complete information for each record. Figure 5 shows the distribution of Alaska sources according to their AS value: notebook has the densest sources, while monitor has the largest proportion of sparse sources ().
V0. Vocabulary Overlap
is defined below as the Jaccard index of the two source vocabularies.
VO is a measure of how similar two sources are in terms of attribute values. Source sets with pair-wise high VO values have typically similar representation of entities and properties, and are easier instances for SM. Figure 6 shows distribution of source pairs according to their VO value. As for VO, in all three Alaska verticals the majority of source pairs have medium-small overlap. In camera there are notable cases with VO close to zero, providing extremely challenging instances. Finally, in notebook there are pairs of sources that use approximately the same vocabulary ( close to one) which can be used as frame of comparison when testing vocabulary-sensitive approaches.
VS. Vocabulary Size is defined as follows.
VS is a measure of how verbose are attribute values of a source. Sources with high values have typically longer textual descriptions, as opposite of sources with low that have mostly categorical or numerical values. Less verbose sources represent easier instances for the schema agnostic ER task. Figure 7 shows distribution of Alaska sources according to their VS value. The camera vertical has the most verbose sources, while notebook has the largest proportion of sources with smaller vocabulary. Therefore, camera sources are more suitable on average for testing NLP-based approaches for ER such as .
5 Data collection
The first task of a data set creation process is, typically, to find data sources with the kind of desired resources: in our case, to find websites with product pages.
This phase is known with the name of Source Discovery.
For the goal of finding suitable data sources, the Dexter444https://github.com/disheng/DEXTER research project  provided a great starting point.
The URLs, related to camera, monitor and notebook verticals, were taken and the associated HTML were downloaded (some of the URLs were broken), and finally an extraction tool, called Carbonara Extractor, was run on the pages.
This tool extracts attribute-value pairs from the page in input: the pairs are the specifications the tool managed to extract about the main product of the page. The output of the tool is a JSON object and, in addition to the successful key value pairs extracted, it also contains a pair, called <page title>, with the HTML page title and associated value.
Carbonara Extractor works as follows:
it use an ANN to classify HTML tables and lists (=1 if the current table or list contain relevant information about products, =0 instead); the ANN was trained using DOM information as features (e.g., number of URLs, number of bold tags, etc.) plus a feature for vertical knowledge based on the presence of a set ofhot words. The training set was manually created;
it apply rules of extraction to tables and lists classified as relevant, and produce as an output the attribute-value pairs.
Then, the JSON files of extraction of each vertical were filtered based on what we called 3-3-100 filtering: it was used to reduce noise, i.e. fake attributes that came from the extraction process; this filtering had also the focus on considering larger data sources. The 3-3-100 filtering works as follows:
key-value pairs with an attribute name that is not present in at least pages of the same source were filtered from the extraction files they belong to (not considering the attributes <page title> attribute);
after that, extraction files with less than key-value pairs were kept out;
finally, only data sources with at least files of extraction made it through the filtering.
Finally, we labeled sources with several versions for different countries, as a clones, and only retained the larger source (e.g. www.ebay.com versus www.ebay.ca, www.ebay.co.uk and so on). Furthermore, we did not consider sources that were clear copiers or sources, which are basically search engines for products on sale by other sources (e.g. www.shopping.com): these choices were made on the basis of domain knowledge and a manual checking process.
Table2 shows information (number of sources, specifications and attributes) about data collected for each vertical.
We proposed a flexible benchmark for testing several variations of schema matching and entity resolution systems.
For this purpose, we made available a real-world dataset, composed of sources with different characteristics extracted from the Web, providing specifications about two categories of products; with a manually curated ground truth both for schema matching and entity resolution tasks.
We presented a profiling of the dataset under several dimensions, to show variety of our dataset, but also to allow users to choose subset of sources with different difficulties and that best reflects the target setting of their systems.
We finally showed possible usages of our dataset, by testing existing state-of-the-art systems on different configurations of our dataset.
We plan to extend Alaska benchmark working on different directions:
include support for new tasks (e.g., Data Fusion and Data Extraction)
add different domains, also not product-related (e.g., biological data)
automatize part of the ground truth construction pipeline, in order to make the process faster allowing us to label a higher number of data
-  (2008) STBenchmark: towards a benchmark for mapping systems. Proceedings of the VLDB Endowment 1 (1), pp. 230–244. Cited by: §2, §2.
-  (2019) Results of the ontology alignment evaluation initiative 2019. In CEUR Workshop Proceedings, Vol. 2536, pp. 46–85. Cited by: §2.
-  (2015) The ibench integration metadata generator. Proceedings of the VLDB Endowment 9 (3), pp. 108–119. Cited by: §2.
-  (2011) Generic schema matching, ten years later. Proceedings of the VLDB Endowment 4 (11), pp. 695–701. Cited by: §3.1.
-  (2020) Entity matching with transformer architectures-a step forward in data integration. In International Conference on Extending Database Technology, Copenhagen, 30 March-2 April 2020, Cited by: §1, §2.
Febrl- an open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1065–1068. Cited by: §2.
-  The magellan data repository. University of Wisconsin-Madison. Note: https://sites.google.com/site/anhaidgroup/useful-stuff/data Cited by: §1, §2.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
-  (2002) Comparison of schema matching evaluations. In Net. ObjectDays: International Conference on Object-Oriented and Internet-Based Technologies, Concepts, and Applications for a Networked World, pp. 221–237. Cited by: §2.
-  (2012) Principles of data integration. Elsevier. Cited by: §1.
-  (2015) Big data integration. Synthesis Lectures on Data Management 7 (1), pp. 1–198. Cited by: §1.
-  (2007) XBenchMatch: a benchmark for xml schema matching tools. In The VLDB Journal, Vol. 1, pp. 1318–1321. Cited by: §2.
-  (2014) Designing a benchmark for the assessment of schema matching tools. Open Journal of Databases 1 (1), pp. 3–25. Cited by: §2.
-  (2017) DeepER–deep entity resolution. arXiv preprint arXiv:1710.00597. Cited by: §1, §2.
-  (2016) Online entity resolution using an oracle. Proceedings of the VLDB Endowment 9 (5), pp. 384–395. Cited by: §1.
-  (2013) MatchBench: benchmarking schema matching algorithms for schematic correspondences. In British National Conference on Databases, pp. 92–106. Cited by: §2, §2.
-  (2014) EMBench: generating entity-related benchmark data. ISWC, pp. 113–116. Cited by: §2.
-  (2020) SemTab 2019: resources to benchmark tabular data to knowledge graph matching systems. In European Semantic Web Conference, pp. 514–530. Cited by: §2.
-  (1991) Classifying schematic and data heterogeneity in multidatabase systems. Computer 24 (12), pp. 12–18. Cited by: §2.
-  The leipzig db groupdatasets. Note: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution Cited by: §2.
-  (2010) Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment 3 (1-2), pp. 484–493. Cited by: §2.
-  (2020) Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584. Cited by: §1, §2.
-  (2004) Note: http://www.cs.umass.edu/␣̃mcallum/data/cora-refs.tar.gz Cited by: §2.
-  (2018) Deep learning for entity matching: a design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pp. 19–34. Cited by: §1, §2, §4.1.
-  (2019) The WDC training dataset and gold standard for large-scale product matching. In Companion of The 2019 World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pp. 381–386. External Links: Cited by: §2.
-  (2015) Dexter: large-scale discovery and extraction of product specifications on the web. Proceedings of the VLDB Endowment 8 (13), pp. 2194–2205. Cited by: §1, §5.
-  (2001) A survey of approaches to automatic schema matching. the VLDB Journal 10 (4), pp. 334–350. Cited by: §3.1.
-  (2015) Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, pp. 1–6. Cited by: §2.