-
An Ensemble Blocking Scheme for Entity Resolution of Large and Sparse Datasets
Entity Resolution, also called record linkage or deduplication, refers t...
read it
-
AutoBlock: A Hands-off Blocking Framework for Entity Matching
Entity matching seeks to identify data records over one or multiple data...
read it
-
Classification of entities via their descriptive sentences
Hypernym identification of open-domain entities is crucial for taxonomy ...
read it
-
Integrating Information About Entities Progressively
Users often have to integrate information about entities from multiple d...
read it
-
MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities
Entity Resolution (ER) aims to identify different descriptions in variou...
read it
-
SLIM: Scalable Linkage of Mobility Data
We present a scalable solution to link entities across mobility datasets...
read it
-
Deep Indexed Active Learning for Matching Heterogeneous Entity Representations
Given two large lists of records, the task in entity resolution (ER) is ...
read it
Multi-Source Spatial Entity Linkage
Besides the traditional cartographic data sources, spatial information can also be derived from location-based sources. However, even though different location-based sources refer to the same physical world, each one has only partial coverage of the spatial entities, describe them with different attributes, and sometimes provide contradicting information. Hence, we introduce the spatial entity linkage problem, which finds which pairs of spatial entities belong to the same physical spatial entity. Our proposed solution (QuadSky) starts with a spatial blocking technique (QuadFlex) that creates blocks of nearby spatial entities with the time complexity of the quadtree algorithm. After pairwise comparing the spatial entities in the same block, we propose the SkyRank algorithm that ranks the compared pairs using Pareto optimality. We introduce the SkyEx-* family of algorithms that can classify the pairs with 0.85 precision and 0.85 recall for a manually labeled dataset of 1,500 pairs and 0.87 precision and 0.6 recall for a semi-manually labeled dataset of 777,452 pairs. Moreover, our fully unsupervised algorithm SkyEx-D approximates the optimal result with an F-measure loss of just 0.01. Finally, QuadSky provides the best trade-off between precision and recall and the best F-measure compared to the existing baselines.
READ FULL TEXT
Comments
There are no comments yet.