Spread2RML: Constructing Knowledge Graphs by Predicting RML Mappings on Messy Spreadsheets

10/25/2021
by   Markus Schröder, et al.
DFKI GmbH
0

The RDF Mapping Language (RML) allows to map semi-structured data to RDF knowledge graphs. Besides CSV, JSON and XML, this also includes the mapping of spreadsheet tables. Since spreadsheets have a complex data model and can become rather messy, their mapping creation tends to be very time consuming. In order to reduce such efforts, this paper presents Spread2RML which predicts RML mappings on messy spreadsheets. This is done with an extensible set of RML object map templates which are applied for each column based on heuristics. In our evaluation, three datasets are used ranging from very messy synthetic data to spreadsheets from data.gov which are less messy. We obtained first promising results especially with regard to our approach being fully automatic and dealing with rather messy data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/18/2019

Using Mapping Languages for Building Legal Knowledge Graphs from XML Files

This paper presents our experience on building RDF knowledge graphs for ...
05/18/2022

Carbon Figures of Merit Knowledge Creation with a Hybrid Solution and Carbon Tables API

Nowadays there are algorithms, methods, and platforms that are being cre...
08/31/2020

FunMap: Efficient Execution of Functional Mappings for Knowledge Graph Creation

Data has exponentially grown in the last years, and knowledge graphs con...
11/28/2015

Column-Oriented Datalog Materialization for Large Knowledge Graphs (Extended Technical Report)

The evaluation of Datalog rules over large Knowledge Graphs (KGs) is ess...
12/14/2020

Fork or Fail: Cycle-Consistent Training with Many-to-One Mappings

Cycle-consistent training is widely used for jointly learning a forward ...
05/08/2018

Phoneme-to-viseme mappings: the good, the bad, and the ugly

Visemes are the visual equivalent of phonemes. Although not precisely de...
12/02/2021

Narrative Cartography with Knowledge Graphs

Narrative cartography is a discipline which studies the interwoven natur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of mapping languages is a well-known method to construct knowledge graphs from data. Their declarative rules enable diverse transformations from data formats to RDF graphs. R2RML [19] is one such language that allows to map relational databases to RDF datasets. Its successor RML [5] extends its capabilities to also map semi-structured data like JSON, XML, CSV and recently Excel spreadsheet files [16]. Unlike databases, such data can be schema-free and is therefore usually less constrained. As a result, agents (users or systems) are able to create content more freely in a variety of ways. Without a data management strategy, content can easily become messy (some examples are later illustrated). Due to this unexpected modeling in semi-structured data, we observe that the definition of proper RML mappings tends to be time-consuming, since several decisions have to be made how data should be mapped: An RML mapping111https://rml.io/specs/rml/ usually consists of several triples maps each containing a subject map and typically many predicate-object maps. While triples maps iterate over entities found in logical sources, subject maps assign URIs and possible classes to them. To eventually form RDF statements (SPO-triples), predicate-object maps attach to subjects certain properties together with their objects. In particular, the containing predicate maps state what properties are used, whereas object maps define where objects can be found (e.g. reference or constant) and what kind of objects (term type) there are: resources or literals. If an object happens to be a literal, its corresponding datatype can also be stated. Once objects have to be processed during mapping, functions can be called which are typically defined with the Function Ontology222https://fno.io (FnO) [14].

Usually, mapping experts make aforementioned modeling decisions when defining how data should be mapped. To simplify mapping creation for them, human-friendly languages like YARRRML [12] and graphical user interfaces such as RMLEditor [10] can be used. However, there is still a considerable effort to inspect the data, recognize possible discrepancies and choose the right mapping options accordingly. Especially when working on rather large collections of unexamined files, a system could support users by producing initial mapping suggestions. For example, some aspects in RML mappings could already be recommended by existing approaches: Given an ontology, a subject map’s class could be recommended by applying named entity typing or a predicate map’s property could be predicted by using relation extraction. For object maps a system may predict their term types and in case of literals proper datatypes. In case data needs to be transformed, the right function could be suggested. While such recommendations are imaginable for usual semi-structured data (JSON, XML and CSV), we focus on spreadsheets in this paper, since they offer interesting challenges due to their potential messiness.

We understand the prediction of an RML object map as the selection of best fitting mappings based on heuristics. Therefore, our main research question is stated as follows: “How well can a heuristic-based approach predict mappings of messy spreadsheets given expected RDF graphs?”. To answer this question, we propose our approach Spread2RML which defines several mapping templates with heuristics in order to select the most probable ones for various columns. This method is evaluated on synthetically generated messy spreadsheets by our tool

[15], manually annotated spreadsheets found on the U.S. Government’s open data platform333https://www.data.gov/ and from industry.

The contributions of this paper are the following:

  • Spread2RML predicts RML mappings instead of performing a direct conversion from spreadsheets to RDF statements. Having explicit mapping definitions enable users to understand, adjust and correct them (e.g. using a GUI).

  • Several messy patterns we identified in prior work [15] are supported and thus such data is correctly mapped.

  • Since our method directly works on (Excel) spreadsheets, peculiarities such as partially formatted texts and data formats are considered.

  • In our evaluation method, our approach is compared against ground truth on RDF statement level.

This paper is structured as follows: in the next section (2), we explain in more detail how RML mappings are defined for spreadsheets. This is followed by a list of related work (Sec. 3) an explanation of our approach in more detail (4) and its evaluation (5). Finally, we give a conclusion and outlook in Section 6.

2 RML Mappings for Spreadsheets

Figure 1: Motivational examples what predicate-object maps we expect for various columns in a sheet. Column A: integer literal mapping, Column B: boolean literal mapping, Column C: date parsing, Column D: entity linking and Column E: red colored date extraction. Bold text shows where crucial decisions are made.

Spreadsheets (such as Excel or OpenOffice) are commonly used in industry and have a complex data structure that provides rich data modeling capabilities: each cell in a sheet can either store a floating-point number, a complex function expression or arbitrary text. In case of a numeric value, a separate data format determines how the number is interpreted, for example the number together with the data format “MM/DD/YYYY HH:MM AM/PM” is interpreted as the time “02/01/2021 08:21 AM”. For string values, any text can be stored in cells, even partially formatted (e.g. some words are bold or red colored). As a result, users are able to model and design sheets in various ways: they may fill cells with different data types or change their appearances in the form of formatted text, colors, styles and borders [15]. Without a data management strategy, we often observe that such sheets easily become “messy” which makes their mapping more challenging.

To be able to express RML mappings for spreadsheets, we utilize our previous work [16] and make some assumptions about the sheets’ structure and mappings. Regarding [3], we assume that tables in spreadsheets are already correctly localized, segmented and functional as well as structural analyzed. They are expected to have a 1-dimensional vertical layout without nested headers. To illustrate examples of the problem, Figure 1 presents expected predicate-object maps on a given sheet. Our expected sheets contain specific rows which represent entities (see the ones with white backgrounds), whereas columns (A-E) model their properties. One RML predicate-object map is defined for each column and is written in RDF Turtle syntax. Well-known types given by the XML Schema Definition444https://www.w3.org/TR/xmlschema11-2/#built-in-primitive-datatypes (XSD) are used in the mappings. The namespace rr refers to R2RML concepts. Crucial options in the mappings are highlighted in bold text. In the following, we will present a possible mapping for each column.

  • Since column A solely consists of cells with integer numbers, we map literals of type xsd:integer.

  • Although, column B stores actually numbers, we would expect literals of type xsd:boolean, since the data format “"Yes";;"No";” and numeric values encode boolean values ( for true displayed as “Yes” and for false displayed as “No”).

  • Column C lists dates, thus literals of type xsd:date are expected, but different formats and types (strings and numbers) are used. In order to still acquire the prescribed date format YYYY-MM-DD, a transformation function parseDate needs to be called.

  • In case of Column D, names of editors are listed, but instead of putting texts in string literals, we would like to recognize them as named entities and use their identifiers (typically IRIs) in the mapping. To achieve this, the function entityLinking

    is called on the text to perform the necessary named entity recognition.

  • Looking at Column E, again dates are written in the cells, however some formatting is applied. To only obtain the red colored dates, a function getEntitiesByColor is executed on the cells’ rich text. Additionally, it transforms the date to its prescribed format (like Column C). The italic formatted date can be extracted with another function analogously.

3 Related Work

Our research contributes to the discipline of table interpretation in the field of table understanding (for a recent survey see [3]

). Interpretation of tables is usually approached by applying schema matching and schema mapping techniques. However, this requires that schemata, often in form of ontologies, already exist in advance to be able to match with them. Public knowledge graphs such as DBpedia

[2], Wikidata [18] or YAGO [17] are usually exploited to interpret tables about common knowledge (e.g. HTML tables in the web). For instance, approaches utilize these knowledge bases to find named entities in the tables to recommend a column’s type or properties between columns. For primitive datatypes, pre-defined regular expressions (or in general finite state machines) are frequently utilized in this field [20, DBLP:conf/sofsem/BonfittoCTVM21, DBLP:journals/datamine/CeritliWG20]. In our case, spreadsheets typically contain very specific and personal information (e.g. private company data) that is often not covered by public datasets. That is why we cannot rely on external data or already annotated sheets (for training).

There are a lot of tools that allow to map tabular data to RDF. For a list of tools see the survey by Fiorelli and Stellato [8]. Some systems, such as Any23555https://any23.apache.org/, convert tabular data in a predefined and fixed manner. They often require the data to be in a certain shape (structure and content) to work properly, like in excel2rdf666https://pypi.org/project/excel2rdf. Another class of tools lets users decide how data should be mapped by providing a mapping language. For example, spread2rdf777https://github.com/marcelotto/spread2rdf uses a domain specific language (DSL) of the Ruby programming language to do that. Other tools are so called data wranglers which usually come with a GUI to let users clean and transform data until the desired output is formed. For example, this can be an add-in in Excel [9] to exploit existing spreadsheet workflows during mapping.

In this paper we do not intend to provide another mapping approach, language or GUI. Instead, we would like to automatically guess mappings that would have been defined by users. This concept is also approached by Ermilov et al. [7]. In their work, rather simple default mappings for CSV files are created that are later revised by users in a semantic wiki environment. Similarly, such a mapping generation is also performed for databases, for example in the MIRROR system [13]. It identifies relationship patterns of database tables and generates with algorithms suitable R2RML [19] mappings. Our approach also recognizes certain patterns in spreadsheet columns to suggest appropriate mappings.

Regarding the prediction of RML definitions, the most related paper is the work by Heyvaert et al. [11]. Given some examples in the form of expected RDF statements, their approach also suggests RML mappings by aligning examples with data sources and building mappings with an algorithm. However, this approach differs from ours in several ways: It needs initial inputs by users, does not consider possible FnO function calls and does not seem to be robust enough against discrepancies we expect in the data.

4 Approach

Our approach Spread2RML fully automatically suggests an RML mapping for a given sheet. It checks an extensible set of RML object map templates for each column and picks the most promising one based on heuristics. The templates are meaningful combinations of possible RML references, term types, datatypes and FnO function calls. During our evaluation (Section 5) we observed that a manageable number of such templates already map messy data in practice sufficiently. However, selecting the suitable ones in all possible cases is a difficult task. We use heuristics to assess how suitable the templates are and choose the one that received the highest score. Still, it may happen that some cells in the column are not covered by the selected mapping. The remaining cells are again recursively checked for suitable templates until no cells are left. This approach forms a tree of scored mapping templates for a single column.

We identified fifteen RML object map templates which are listed in Table 1. The table shows what cell reference, term type, datatype and function (see Section 4.1) each object map uses. A formal definition of their heuristics are given. The rationale behind them will be discussed in more detail in Section 4.2. Should two or more templates reach the highest score, a fixed rank value decides which template is finally selected.

Object Map Template Heuristic Rank Reference Term Type Datatype FnO Function
Formatted Text - - valueRichText child dependent child dependent getEntitiesByTag
getEntitiesByColor
getEntitiesByUnformatted
Integer as String 2 json Literal xsd:integer parseNumber
Decimal as String 1 json Literal xsd:decimal parseNumber
(“.” decimal point)
Decimal as String 1 json Literal xsd:decimal parseNumber
(“,” decimal point)
Date as String 2 json Literal xsd:date parseDate
DateTime as String 3 json Literal xsd:dateTime parseDateTime
Integer List as String 0 json Literal xsd:integer parseNumber
Boolean as String if then … 4 json Literal xsd:boolean parseBoolean
String 0 value Literal xsd:string -
Single Entity 3 valueString IRI - entityLinking
Multiple Entities 4 valueString IRI - entityLinking
Native Boolean 0 valueBoolean Literal xsd:boolean -
Native Integer 3 valueInt Literal xsd:integer -
Native Decimal 4 valueNumeric Literal xsd:decimal -
Numeric with 5 json Literal xsd:date parseDate
Data Format xsd:dateTime parseDateTime
Table 1: Fifteen RML object map templates with their used reference, term type, datatype and FnO function. For each template a heuristic is formally stated and a rank is given. The auxiliary sets and functions are discussed in text.

4.1 Functions

Functions need to be applied once a standard RML mapping is not sufficient to map data correctly. Our use case includes eight functions. Since rather messy spreadsheets are assumed, we expect that literals (such as numeric or boolean values) can also be represented as character strings and named entities are mentioned in texts (possibly several). We also expect to extract information exclusively from typographical emphasized or colored text. To obtain proper values in all of these cases, the following functions can be applied.

parseNumber

This function considers different decimal separators (point vs. comma) and returns numbers with or without decimal places using regular expressions.

parseBoolean

In order to acquire boolean values, this method receives lists of strings that represent true and false values and decides for one side through string matching and majority voting.

parseDate & parseDateTime

Both functions find typical date formats in texts and return them in the prescribed format by the XML Schema Definition888https://www.w3.org/TR/xmlschema11-2/#built-in-primitive-datatypes (XSD). Numeric values are converted to dates as specified in Excel999https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DateUtil.html#getJavaDate-double-. An example is presented in Figure 1 in Column C.

entityLinking

To perform named entity recognition,
entityLinking receives a text as well as entities with labels and returns IRIs of found entities. It uses an implementation101010https://github.com/robert-bor/aho-corasick of the Aho-Corasick algorithm [1].

Since formatted text is represented with an HTML syntax [16], we can extract certain styles with an XML parser111111https://jsoup.org/. These functions also have parameters for term type and datatype to properly call previously mentioned functions on extracted text as subroutines.

getEntitiesByTag

This function extracts typographical emphasized text such as bold (<b>), italic (<i>), underlined (<u>) or crossed out (<strike>) words by passing the regarding HTML-Tag.

getEntitiesByColor

This method obtains text in a certain color using its hex value
(e.g. <font color="#ff0000">, see Column E in Figure 1).

getEntitiesByUnformatted

This procedure gets text that is neither typographical emphasized nor colored (not black).

4.2 Template Heuristics

In this section the heuristics of the templates are discussed (see Table 1).

However, the first template Formatted Text does not have any heuristic: it is automatically applied once formatted cells are found in a column (e.g. Figure 1 Column E). Their formatted texts are grouped according to their combinations of coloring and typographical emphasis (e.g. bold red, italic blue, just underlined or just green). This also includes a group for unformatted texts. Each of these groups form a new virtual column that contains string cells having texts in plain text format. Since one predicate-object map is generated for one column, multiple maps per formatting are defined. Hence, different properties are used for differently formatted information, because we assume that partial formatting indicates relations. The correct getEntitiesBy-function is selected based on the format (typographical emphasize is given preference). However, the term types and datatypes are decided on further reviews of the virtual columns (now containing plain texts).

Before the heuristics of the next templates are explained, we have to formally define some concepts first. Given a column , for all cells we can assign them to one of the disjoint sets: boolean valued cells , numeric valued cells and plain text valued cells (strings). Let a function determine the number of decimal places of a numeric cell, for instance, . The set contains data format types such as date and date time, while the function determines the used data format of a numeric cell. Let be a function that returns the string value of a given cell. Let a function separate a string based on non-alphabetical delimiters found in the text, for example, since semicolon and curved brackets are assumed to be delimiters. Based on this, we define a set which contains all separated substrings of a column. The predicates , , and decide if a string valued cell can be parsed successfully as an integer, decimal number (given the decimal point character), date with and without time information (following common English and German formats).

In order to discover integer numbers, decimal numbers or dates represented in textual forms, our approach tries to parse them respectively. The proportion of successfully parsed strings serves as a heuristic to use the template. Since the parsing functions are also able to deal with numerical values

, they are considered in the estimation as well. We also support integer lists (such as “42, 15; 3”) by calculating the amount of separated strings

that can be parsed to integers. The detection of boolean values encoded as strings is a more challenging task. First, we assume that there should be at most only two distinct string values in the column (like “yes”/“no” or just “y”). However, such strings could also refer to entities such as “DFKI”/“TUKL”. We know from our experience that users tend to write logical values (true and false) in rather short texts or even single symbols (e.g. “x”). That is why we use the average string length as an indicator: if it is below , boolean values are assumed, otherwise (one or two) named entities are expected.

A similar but even more challenging problem is the distinction between simple strings (like names or descriptions) and references to named entities (i.e. resources in a knowledge graph, like organizations or categories). We observed in data from practice that users tend to refer to named entities frequently and recurringly in a column. When they enter string literals their values rather differ from each other. To express this issue in a number, we calculate a certain “degree of duplication” for cell values in in the following way:

Our definition consists of two parts: element multiplicity (proportion of elements occurring more than once) and element uniformity (proportion of same elements). The auxiliary set contains all cells having string values which occurs only once (i.e. which are unique). Thus, if all string elements occur at least two times which indicates a high degree of duplication. Conversely, if all elements occur once which lowers the value of the function. The set contains by definition distinct string values of the cells. If all cells have the exact same string content, is still a one-element set (), which suggests a high uniformity and therefore high duplication rate. However, should all contents differ from each other, then and no uniformity occurs (i.e. diversity is at a maximum). Numeric and boolean values are always considered to be distinct. In the following, we illustrate some application examples:

The function is used as a heuristic in the String, Single Entity and Multiple Entities templates. To detect string literals is calculated which means that is the threshold that decides between the string or entity assumption. The Multiple Entities template is checked by using all separated substrings in . A higher degree of duplication can indicate that per cell multiple entities are mentioned instead of single ones. Because these templates use the entityLinking function, they additionally provide an RDF graph of the presumed entities as resources with labels.

Besides texts, spreadsheets can also have cells with numeric values. The Native Boolean template assumes boolean columns when there is a high proportion of boolean valued cells. Similarly, Native Integer suggests integers once numeric values have noticeably often no decimal places, while Native Decimal behaves in the opposite way. The Numeric with Data Format template checks for a given set of possible data formats the proportion of cells which have a certain data format. Currently, several date and date time formats are supported.

5 Evaluation

To evaluate our approach, we use three sources of datasets. The first one is a synthetic generated dataset121212https://www.dfki.uni-kl.de/~mschroeder/demo/datasprout/data by our approach Data Sprout [15]. This generator produces especially messy spreadsheets based on generation patterns found in industrial use cases. It also provides the ground truth knowledge graph that is expected to be constructed from the sheets. For a second dataset, we utilize the U.S. Government’s open data platform131313https://www.data.gov/ (data.gov): first spreadsheets are downloaded by using its Comprehensive Knowledge Archive Network (CKAN) API141414https://catalog.data.gov/. With a faceted query on the resource format “EXCEL”, 8689 URLs to spreadsheet resources are collected. 7948 files are downloaded from them, however only 3692 can be parsed using Apache’s POI Library151515https://poi.apache.org/. From this collection we pick 12 spreadsheets which contain a certain degree of messiness and which meet our structural expectations (1-dimensional vertical layout without nested headers). They are manually annotated by creating expected RML mappings for them. To ensure that our defined rules map the data correctly, we run RML Mapper and check its outcome. As a third dataset, three spreadsheets about meta data of documents from an industry partner are used. Like for the data.gov dataset, expected RML mappings are defined manually by us. Unfortunately, we are not allowed to publish its raw data because of confidentiality reasons.

As a baseline algorithm Apache’s Any23 tool161616https://any23.apache.org/ is utilized. This program is able to automatically extract RDF data from CSV files, besides a variety of Web documents. Since the tool does not support Excel as input, the command-line tool ssconvert from Gnumeric171717http://www.gnumeric.org/ is used to convert Excel files to one CSV file per sheet (using -S argument). A similar tool xlsx2csv181818https://github.com/dilshod/xlsx2csv (version 0.7.8) is also used, since different converters may lead to different CSV files that can effect the outcome of Any23. Note that in these transformation steps all values become strings and potential style information is lost (e.g. struck out text).

The next section covers how the results of Spread2RML and Any23 are compared with the ground truth.

5.1 Matching

In order to evaluate an approach, we would like to know to which extent the procedure extracted the expected RDF statements for each sheet individually. We call the approach’s output actual graph, while the ground truth forms the expected graph. Since approaches may output resource URIs (including classes and properties) different to the URIs in the ground truth, simple statement equality checks will not show the actual ground truth coverage. To solve this problem, subgraph isomorphism between the two graphs has to be applied, but this is NP-complete [4]. In order to reduce the complexity, we use the provenance information of RDF statements to assign them to the spreadsheet cells where they were extracted. These newly formed “cell graphs” are always planar, since they only contain outgoing edges from a resource that represents the sheet’s row. Planar subgraph isomorphism is known to be solvable in linear time [6].

When comparing an actual RDF graph (set of statements) with an expected RDF graph , we would like to find the best match of their resources (URIs) which is a injective function such that has the largest possible number of elements. The notation means that for every statement in resources will be substituted according to . To obtain such a matching , we designed a greedy-based procedure (Algorithm 1) that takes two lists of statements and returns a best match. At its core, enumerate (Line 1) recursively permutes statement list and zips it with (from right to left) to acquire all possible statement matches (Line 2–5, 8–9). Note that each copy is based upon . During the permutation a distance sum of all fixed statements in the current permutation is calculated (Line 6–7). If the distance is over a certain threshold, we know that it is meaningless to continue the permutation recursively (Line 7–8), because the distance can only become greater. The distance procedure (Line 11) takes two statements and checks with the current match and its inverse if the subject pair (Line 14–15), predicate pair (Line 17–18) and object pair (Line 20–23) are still matched consistently. If this is not the case, a distance value is returned, otherwise the resources are matched in (Line 16, 19, 24) and no distance is returned (Line 25). In the end, enumerate returns the that matches most resources as well as already identical ones (Line 10).

1: and are lists of statements, , ,
2:a function that matches resources from to resources from such that has the largest possible number of elements.
3:procedure enumerate(, , , , )
4:     if  then return      
5:     for  do
6:         swap(, , )
7:         for  do
8:               distance(, , )          
9:         if  is under a certain threshold then
10:              enumerate(, , , , )          
11:         swap(, , )      
12:     return largest with largest
13:procedure distance(, , )
14:      is a distance
15:     , the inverse function of
16:     if  then
17:         if  or  then return               
18:     
19:     if  then
20:         if  or  then return               
21:     
22:     if  and are literals then
23:         if  then return          
24:     else if  and are resources then
25:         if  or  then return               
26:     
27:     return 0
Algorithm 1 Greedy Matching

The algorithm’s correctness was tested by a consistent substitution of all URIs of our Data Sprout dataset. Our procedure was able to match all resources in the way they were substituted. We also discovered in our experiments, that it performs faster when statements with literals are checked first, since right at the beginning many matches can be ruled out. Thus, we sort such RDF triples to the end of the lists and .

We can now formulate our evaluation metrics. In our experiments, we compare an

actual sheet that is returned from a procedure with the ground truth expected sheet . The row index and column index are used to locate the same located cell in the sheets, namely and . Using our matching procedure (Algorithm 1), we obtain the best match between them and map actual statements to expected statements in order to compare them. This way, we can acquire true positive (correct), false negative (missed) and false positive (false alarm) statements:

Based on that we are able to define the well-known precision (), recall () and f-measure metrics. They are applied for the whole sheet by summing up , and values for all cells in the sheet.

5.2 Results

Evaluation results are shown in Table 2 which shows precision, recall and f-measure values for Any23 and Spread2RML. Our approach performs better then the baseline for all datasets. In Data Sprout four knowledge graphs (BSBM, GL, LUBM and SP2B) were used to generate messy spreadsheet for each generation pattern [15]

, including no patterns are applied (Clean) as well as all are used at once (All). For the four datasets the average and standard deviation values per pattern are calculated and shown in the table. Spread2RML archives an f-measure between

and

on average, however with a higher variance compared to Any23. Better results could be archived for the data.gov dataset, since realistic data from practice is used. Here, we receive for the 12 datasets on average a

f-score. The industry dataset receives the worst measures with an f-measure of . We assume that for our private company dataset a data management strategy was rather less pursued, unlike for the publicly accessible and therefore cleaner government data.

Although, our approach shows a good performance, there is still potential for improvements. In the following, we discuss some typical error classes we discovered.

Strings and Numbers. It may happen that columns with numbers are mixed with textual expressions, such as “Baseline” or “N/A”. Although they are infrequent, Spread2RML suggest to map the whole column to string values, while numeric values would be a better choice. The problem is that the String template and the Native templates count and score differently. As a consequence, the String template often receives a higher heuristic.

Single & Multiple Entities vs. Strings. Our heuristic is not able to detect entities when they do not occur often enough for several times. It can also be very ambiguous, if single or multiple entities should be extracted. In such cases feedback from an expert is required.

Boolean Misinterpretation. In cases columns contain one single entity (like “CRMO” or “lung”), they can be interpreted as boolean values.

In summary, despite some misinterpretations, Spread2RML shows first promising results predicting the right mappings in various cases of messy data. All resources regarding the evaluation can be accessed online191919https://www.dfki.uni-kl.de/~mschroeder/demo/spread2rml.

Any23 Spread2RML
Data Sprout Precision Recall F-Measure Precision Recall F-Measure
Acronyms Or Symbols
All
Clean
Intra Cell Additional Information
Manual
Multiple Surface Forms
Multiple Types In A Table
Numeric Information As Text
Outdated Is Formatted
Partial Formatting Indicates Relations
Property Value As Color
Any23 Spread2RML
data.gov Precision Recall F-Measure Precision Recall F-Measure
0154ae39 0.03 0.01 0.02 0.81 0.89 0.85
12920281 0.40 0.21 0.27 1.00 1.00 1.00
17aecc3e 0.13 0.06 0.08 0.60 0.50 0.55
29a77cd1 0.01 0.00 0.00 1.00 1.00 1.00
5a0c9f12 0.19 0.18 0.18 0.65 0.87 0.74
614e3bd1 0.40 0.28 0.33 0.76 0.88 0.82
73b7a715 0.23 0.21 0.22 1.00 1.00 1.00
8cc689af 0.43 0.29 0.35 0.68 0.56 0.61
b193a34e 0.00 0.00 0.00 1.00 1.00 1.00
bc01528b 0.05 0.27 0.08 0.92 0.96 0.94
bf52a35a 0.40 0.28 0.33 0.93 0.95 0.94
c34bdd57 0.38 0.18 0.25 0.48 0.66 0.56
Any23 Spread2RML
Precision Recall F-Measure Precision Recall F-Measure
Industry 0.11 0.07 0.09 0.29 0.30 0.30
Table 2: Precision, recall and f-measure results of Any23 and Spread2RML on the datasets Data Sprout, data.gov and Industry. In case of Data Sprout, average and standard deviation values per messy pattern are presented.

6 Conclusion and Outlook

In this paper, we presented and evaluated our approach Spread2RML which is able to predict RML mappings on especially messy spreadsheets. This is accomplished by an extensible set of RML predicate-object map templates which are chosen for each column based on heuristics. Our templates make use of several FnO functions to parse texts, link entities or extract formatted information. For evaluation, synthetic messy data and real data was used to examine to which extent our approach extracts expected RDF statements. For comparison, a baseline procedure was applied as well. The assessment showed that Spread2RML archives first promising results.

In the future, we plan to apply a human-in-the-loop (HumL) approach in the prediction process. Having mapping suggestions, experts are able to review them and give valuable feedback. In very ambiguous cases, experts can give Spread2RML necessary hints to decide correctly. This can boost the prediction performance, for instance, when learning special rules or individual exceptions.

Acknowledgements

This work was funded by the BMBF project SensAI (grant no. 01IW20007).

References

  • [1] Aho, A.V., Corasick, M.J.: Efficient string matching: An aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975). https://doi.org/10.1145/360825.360855, https://doi.org/10.1145/360825.360855
  • [2] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: Dbpedia: A nucleus for a web of open data. In: Aberer, K., Choi, K., Noy, N.F., Allemang, D., Lee, K., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11-15, 2007. Lecture Notes in Computer Science, vol. 4825, pp. 722–735. Springer (2007), https://doi.org/10.1007/978-3-540-76298-0_52
  • [3] Bonfitto, S., Casiraghi, E., Mesiti, M.: Table understanding approaches for extracting knowledge from heterogeneous tables. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 11(4) (2021). https://doi.org/10.1002/widm.1407, https://doi.org/10.1002/widm.1407
  • [4]

    Cook, S.A.: The complexity of theorem-proving procedures. In: Harrison, M.A., Banerji, R.B., Ullman, J.D. (eds.) Proceedings of the 3rd Annual ACM Symposium on Theory of Computing, May 3-5, 1971, Shaker Heights, Ohio, USA. pp. 151–158. ACM (1971).

    https://doi.org/10.1145/800157.805047,
    https://doi.org/10.1145/800157.805047
  • [5] Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de Walle, R.: RML: a generic language for integrated RDF mappings of heterogeneous data. In: Proceedings of the 7th Workshop on Linked Data on the Web. CEUR Workshop Proceedings, vol. 1184 (Apr 2014)
  • [6] Eppstein, D.: Subgraph isomorphism in planar graphs and related problems. In: Clarkson, K.L. (ed.) Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, 22-24 January 1995. San Francisco, California, USA. pp. 632–640. ACM/SIAM (1995), http://dl.acm.org/citation.cfm?id=313651.313830
  • [7] Ermilov, I., Auer, S., Stadler, C.: User-driven semantic mapping of tabular data. In: Sabou, M., Blomqvist, E., Noia, T.D., Sack, H., Pellegrini, T. (eds.) I-SEMANTICS 2013 - 9th International Conference on Semantic Systems, ISEM ’13, Graz, Austria, September 4-6, 2013. pp. 105–112. ACM (2013). https://doi.org/10.1145/2506182.2506196, https://doi.org/10.1145/2506182.2506196
  • [8] Fiorelli, M., Stellato, A.: Lifting tabular data to RDF: A survey. In: Garoufallou, E., Ovalle-Perandones, M.A. (eds.) Metadata and Semantic Research - 14th International Conference, MTSR 2020, Madrid, Spain, December 2-4, 2020, Revised Selected Papers. Communications in Computer and Information Science, vol. 1355, pp. 85–96. Springer (2020). https://doi.org/10.1007/978-3-030-71903-6_9, https://doi.org/10.1007/978-3-030-71903-6_9
  • [9] Hammar, K.: Linked data creation with excelrdf. In: Harth, A., Presutti, V., Troncy, R., Acosta, M., Polleres, A., Fernández, J.D., Parreira, J.X., Hartig, O., Hose, K., Cochez, M. (eds.) The Semantic Web: ESWC 2020 Satellite Events - ESWC 2020 Satellite Events, Heraklion, Crete, Greece, May 31 - June 4, 2020, Revised Selected Papers. Lecture Notes in Computer Science, vol. 12124, pp. 104–109. Springer (2020). https://doi.org/10.1007/978-3-030-62327-2_18, https://doi.org/10.1007/978-3-030-62327-2_18
  • [10] Heyvaert, P., Dimou, A., Herregodts, A., Verborgh, R., Schuurman, D., Mannens, E., de Walle, R.V.: Rmleditor: A graph-based mapping editor for linked data mappings. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) The Semantic Web. Latest Advances and New Domains - 13th International Conference, ESWC 2016, Heraklion, Crete, Greece, May 29 - June 2, 2016, Proceedings. Lecture Notes in Computer Science, vol. 9678, pp. 709–723. Springer (2016), https://doi.org/10.1007/978-3-319-34129-3_43
  • [11] Heyvaert, P., Dimou, A., Verborgh, R., Mannens, E.: Semi-automatic example-driven linked data mapping creation. In: Gentile, A.L., Nuzzolese, A.G., Zhang, Z. (eds.) Proceedings of the 5th International Workshop on Linked Data for Information Extraction co-located with the 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 22, 2017. CEUR Workshop Proceedings, vol. 1946, pp. 14–25. CEUR-WS.org (2017), http://ceur-ws.org/Vol-1946/paper-03.pdf
  • [12] Heyvaert, P., Meester, B.D., Dimou, A., Verborgh, R.: Declarative rules for linked data generation at your fingertips! In: Gangemi, A., Gentile, A.L., Nuzzolese, A.G., Rudolph, S., Maleshkova, M., Paulheim, H., Pan, J.Z., Alam, M. (eds.) The Semantic Web: ESWC 2018 Satellite Events - ESWC 2018 Satellite Events, Heraklion, Crete, Greece, June 3-7, 2018, Revised Selected Papers. Lecture Notes in Computer Science, vol. 11155, pp. 213–217. Springer (2018), https://doi.org/10.1007/978-3-319-98192-5_40
  • [13] de Medeiros, L.F., Priyatna, F., Corcho, Ó.: MIRROR: automatic R2RML mapping generation from relational databases. In: Cimiano, P., Frasincar, F., Houben, G., Schwabe, D. (eds.) Engineering the Web in the Big Data Era - 15th International Conference, ICWE 2015, Rotterdam, The Netherlands, June 23-26, 2015, Proceedings. Lecture Notes in Computer Science, vol. 9114, pp. 326–343. Springer (2015). https://doi.org/10.1007/978-3-319-19890-3_21, https://doi.org/10.1007/978-3-319-19890-3_21
  • [14] Meester, B.D., Dimou, A., Verborgh, R., Mannens, E.: An ontology to semantically declare and describe functions. In: The Semantic Web - ESWC 2016 Satellite Events, Heraklion, Crete, Greece, May 29 - June 2, 2016, Revised Selected Papers. Lecture Notes in Computer Science, vol. 9989, pp. 46–49 (2016), https://doi.org/10.1007/978-3-319-47602-5_10
  • [15] Schröder, M., Jilek, C., Dengel, A.: Dataset generation patterns for evaluating knowledge graph construction. In: The Semantic Web: ESWC 2021 Satellite Events - Poster and Demo Track (2021), https://openreview.net/forum?id=PTJhN-wHBEy
  • [16] Schröder, M., Jilek, C., Dengel, A.: Mapping spreadsheets to rdf: Supporting excel in rml. In: The Semantic Web: ESWC 2021 Satellite Events - Knowledge Graph Construction Workshop (2021), https://openreview.net/forum?id=Jfw5sfITi7R
  • [17] Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Williamson, C.L., Zurko, M.E., Patel-Schneider, P.F., Shenoy, P.J. (eds.) Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007. pp. 697–706. ACM (2007). https://doi.org/10.1145/1242572.1242667, https://doi.org/10.1145/1242572.1242667
  • [18] Vrandecic, D.: Wikidata: a new platform for collaborative data collection. In: Mille, A., Gandon, F., Misselis, J., Rabinovich, M., Staab, S. (eds.) Proceedings of the 21st World Wide Web Conference, WWW 2012, Lyon, France, April 16-20, 2012 (Companion Volume). pp. 1063–1064. ACM (2012), https://doi.org/10.1145/2187980.2188242
  • [19] World Wide Web Consortium: R2RML: RDB to RDF mapping language. https://www.w3.org/TR/r2rml/ (2012), accessed: 2021-02-01
  • [20] Zhang, Z.: Effective and efficient semantic table interpretation using tableminer. Semantic Web 8(6), 921–957 (2017), https://doi.org/10.3233/SW-160242