OBDA for the Web: Creating Virtual RDF Graphs On Top of Web Data Sources

05/22/2020 ∙ by Konstantina Bereta, et al. ∙ University of Athens 0

Due to Variety, Web data come in many different structures and formats, with HTML tables and REST APIs (e.g., social media APIs) being among the most popular ones. A big subset of Web data is also characterised by Velocity, as data gets frequently updated so that consumers can obtain the most up-to-date version of the respective datasets. At the moment, though, these data sources are not effectively supported by Semantic Web tools. To address variety and velocity, we propose Ontop4theWeb, a system that maps Web data of various formats into virtual RDF triples, thus allowing for querying them on-the-fly without materializing them as RDF. We demonstrate how Ontop4theWeb can use SPARQL to uniformly query popular, but heterogeneous Web data sources, like HTML tables and Web APIs. We showcase our approach in a number of use cases, such as Twitter, Foursquare, Yelp and HTML tables. We carried out a thorough experimental evaluation which verifies the high efficiency of our framework, which goes beyond the current state-of-the-art in this area, in terms of both functionality and performance.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Querying Web data sources on-the-fly is an important task for several reasons: (i) Having full access to such data sources may involve a high economic cost (e.g., the price of subscribing to the entire Twitter stream). (ii)The constantly changing terms of use and the corresponding legislation complicates data crawling (e.g., the constraints defined by the recent EU General Data Protection Regulation111See https://eugdpr.org for more details.). (iii) The high frequency of updates (Velocity) makes it difficult for data consumers to keep up with the content published in popular Web sources like social media applications. For example, in Twitter, approximately 6.000 tweets are posted per second222http://www.internetlivestats.com/twitter-statistics/.

Moreover, querying on-the-fly non-RDF Web data using SPARQL has become a major issue (ISWC18; ldow18; eswc18), because many Web data sources rely on non-RDF formats, such as REST APIs and HTML tables. To address this Variety, SPARQL is extended in (eswc18) so that it allows for querying RDF data in combination with data coming from Web APIs in the form of JSON files. In (ldow18), the authors propose an architecture based on micro-services that extends the SPARQL protocol with the ability to query APIs on-the-fly. Finally, an extension of the R2RML mapping language is proposed in (ISWC18), providing primitives for querying various kinds of Web data sources, such as APIs.

However, these works merely support relational data or specific file formats (e.g., XML, CSV). They also rely on custom SPARQL/R2RML extensions that hamper their adoption, while combining them with third-party added-value services is a very complicated procedure. Some of them also implement a caching mechanism (eswc18; ldow18), but are inherently incapable of making the most of it, as demonstrated by our experiments in Section 7.

Figure 1. Tables with 100 movies from Rotten Tomatoes and Wikipedia.

For example, let us assume that we would like to query data available in the HTML tables shown in Figure 1 using SPARQL. The first table presents a listing of movies with the best rank, provided by RottenTomatoes333https://www.rottentomatoes.com/. The second one is a Wikipedia table that lists the best 100 movies. If we would like to query this data using SPARQL, we would have to:

  1. create a custom parser to parse the data

  2. convert the data into RDF

  3. store the RDF data in a triple store, and eventually,

  4. query the data using SPARQL.

The only alternative approach would be to convert and store the data into relational tables, instead of using a triple store, and then use an OBDA or RDB2RDF system to query the data using SPARQL. In both cases, the convert-and-store tasks could be inevitable. The approach presented in (eswc18) could not be applied, as it is not desinged for HTML tables.

In this paper, we go beyond these state-of-the-art works, introducing a system, called Ontop4theWeb, that extends existing ontology-based data access (OBDA) techniques to support uniform queries over data from different Web sources, such as WebTables and REST APIs. Ontop4theWeb relies on virtual relational tables that allow for executing any SPARQL query on top of an OBDA system, using the necessary ontology and mappings, but without requiring the data to be available a-priori, i.e., before the query is posed. In this way, Ontop4theWeb offers a series of unique characteristics:

  1. It accommodates any format of Web data, such as the increasingly popular HTML tables and the omnipresent REST APIs, and it is especially suitable for data sources with high Velocity.

  2. It is easy to use and incorporate into any Semantic Web application, as it relies on standard SPARQL and R2RML (and its equivalents).

  3. Based on micro-services, it facilitates the seamless enrichment of retrieved data with third-party added-value services (e.g., sentiment analysis).

  4. It is able to fully exploit an effective caching mechanism that can be configured in line with the update rate of the respective data source.

  5. It goes beyond traditional convert-and-store approaches by requiring no materialisation of the original data following the new schema.

Overall, the main contributions of this paper are as follows:

  • We introduce Ontop4theWeb, an OBDA-based system for posing SPARQL queries on top of non-RDF Web data on-the-fly, i.e., they are fetched at query time, rather than importing or downloading them a-priori. To achieve this, virtual table operators are embedded in the SQL queries that are included in R2RML mappings. These mappings specify which part and source of Web data will be fetched and how they will be mapped to virtual RDF terms. Combining these mappings with an ontology allows for returning the virtual relational data that are involved in the query as RDF results.

  • We showcase the applicability of Ontop4theWeb in three use cases that: (i) involve significant amount of heterogeneous crowd-sourced information (Variety), (ii) get updated so frequently that a snapshot of the respective information at a given time might become outdated soon (Velocity), and (iii) are widely used by application developers.

  • We experimentally evaluate Ontop4theWeb, demonstrating its feasibility and scalability in the three, realistic, highly diverse and demanding applications we consider. The results show that our approach is able to process queries on WebTables of up to 100,000 rows in size within minutes. We also compared the performance of our approach to the state-of-the-art method described in (eswc18), with the results verifying that our framework provides more functionality, while being more efficient, as well.

The rest of the paper is organized as follows: in Section 2, we discuss the state-of-the-art in the field, while in Section 3, we briefly present basic background knowledge. Section 4 describes our approach and methodology, whereas Section 5 documents our system, which is applied to practical use cases in Section 6. Section 7 presents our experimental evaluation, with Section 8 concluding the paper along with directions for future work.

2. Related Work

OBDA systems (poggi) are primarily useful in cases where users store their data in relational databases, but do not want to materialize them as RDF triples, particularly when these databases are large or/and get frequently updated (Velocity) (mastro). As a result, many OBDA and RDB2RDF systems have been developed in the recent years, such as Ontop (ontop), Ultrawrap (ultrawrap), Morph (morph), Sparqlify444http://aksw.org/Projects/Sparqlify.html, and Oracle Spatial and Graph555http://www.oracle.com/technetwork/database/options/spatialandgraph/overview/index.html. These systems are able to connect to existing relational data sources and create virtual RDF graphs using ontologies and mappings. The common assumption of these systems is that the source data are materialized and connection details are provided in the mappings. Most of them support the R2RML mapping language or provide translators from their native mappings languages to R2RML. For example, Ontop also supports its own native OBDA language for encoding mappings. Once connected to the data source, OBDA systems make the most of the underlying database by collecting information about data characteristics (e.g., statistics, constraints).

On another line of research, there are RDB2RDF systems that focus on converting data into RDF using mappings to produce RDF dumps. Initially, only relational data sources were supported through the R2RML language (rtorml). Given, though, that data can be found in many formats other than relational, the RML language was created as a superset of R2RML, encoding how various data formats, like XML and CSV, can be mapped to RDF triples (rml). Another recent work in this direction is the approach described in (sparql-generate), which aims at converting Web data from various formats (e.g., CSV, JSON) into RDF, using SPARQL queries - SPARQL 1.1 primitives and extension functions were extended, too.

Recently, a mapping language, called D2RML, was proposed in (ISWC18), inspired from R2RML and RML. This work extends R2RML to support more data formats, including REST APIs. Although an implementation of a D2RML processor exists, it is not part of a standalone SPARQL query engine, to the best of our knowledge.

Closer to our work is the SERVICE-to-API system (eswc18), which proposes an extension of SPARQL that enables users to combine the responses of JSON APIs with results from the evaluation of standard triple patterns. We deviate from this approach in that:

  1. we do not extend SPARQL syntax,

  2. we allow users to query APIs using standard SPARQL triple patterns directly, without having to combine them with stored RDF data,

  3. we provide a general approach that is not limited to JSON APIs,

  4. we produce significantly fewer API calls, which translates to improved performance.

Section 7 provides a detailed qualitative and quantitative comparison between the two systems.

Also close to our work is the architecture proposed in (ldow18), which is based on the development of SPARQL wrappers for Web APIs. To this end, it extends HTTP requests to SPARQL endpoints to include arguments that are used to retrieve a fragment of the data that can be accessed via the Web API. This fragment is converted into RDF and stored using an in-memory triple store. In this way, the SPARQL query that is contained in the original SPARQL HTTP request is evaluated against the RDF graph that is stored in the triple store, which is only a fragment of the original dataset. This fragment can be considered as a linked data fragment (LDF) interface, as described in (ldf). Note that the original linked data fragment approach considers the evaluation of single triple patterns on the server-side, leaving the rest to the client, in order to improve the sustainability of linked data endpoints (ldf). However, there is no limit to the expressivity of queries that can be executed on the server in (ldow18). In short, (ldow18) converts a fragment of the dataset into RDF and stores the converted data into an in-memory triple store. In our approach, the conversion is performed on-the-fly using mappings and an in-memory virtual table is constructed instead.

We deviate from this approach in that:

  1. no data are materialized into RDF, as the query is converted on-the-fly using mappings,

  2. our approach can be adapted to a different schema simply by modifying the mapping file, without requiring a change in the system code, as in (ldow18),

  3. the translation of the original SPARQL query is completely transparent to the end-user, whereas (ldow18) requires the end user to be fully aware of the Web API documentation so as to specify the fragment of the Web API that needs to be accessed.

In the area of databases and data integration, (data-integration) gives an overview of how web sources can be accessed using wrappers. The approach described in this paper follows the same principles, as the architecture that we propose contains two layers that can be considered as wrappers on top of Web data sources, as we describe later on. However, we target specifically the problem of how one can pose SPARQL queries on top of Web data sources using ontologies and mappings, and this problem goes beyond the approaches discussed in (data-integration) (although ontologies are mentioned).

3. Preliminaries

We now present the background knowledge that forms the basis for defining the problem we are tackling as well as Ontop4theWeb.

3.1. RDF and SPARQL

We denote as , and the pairwise disjoint infinite sets of IRIs, blank nodes and literals, respectively. stands for the infinite set of variables that are disjoint from , and . Based on (SPARQL), we provide the following definitions:

RDF triple. An RDF triple is an element of the form of , where is the subject, is the predicate and is the object.

RDF graph. An RDF graph is finite set of RDF triples.

Triple pattern. A triple pattern is an element of the form

Graph pattern. A (basic) graph pattern (BGP) is a finite set of triple patterns.

Evaluation of triple patterns over an RDF graph. Let be an RDF graph over , t a triple pattern and graph patterns. The mapping is a partial function , while is the triple obtained if we replace every variable of the variables included in (i.e., ) with their bindings according to (i.e., ). In this context, the evaluation of a graph pattern over , denoted by , is recursively defined as follows:

  • , where is a set of variables occurring in .

  • .

  • .

  • .

3.2. Ontology-based data access (OBDA)

The OBDA paradigm (obda) proposes the creation of virtual RDF graphs on top of relational databases using ontologies and mappings. Given a database schema , an ontology , and a set of mappings , an OBDA specification is defined as . Then, an OBDA instance is defined given the OBDA specification and the database that follows the database schema . Mappings encode how relational data get mapped into RDF terms. A virtual RDF graph of the database instance is produced if we apply the mappings to . Then, if is the evaluation of the SPARQL query over the OBDA instance , it is equivalent to .

The W3C standard language for encoding mappings is R2RML (rtorml). As an example, we can map a relational table named Student with columns id and name into RDF with the following R2RML mapping:

<r2rml_mapping>  a rr:TriplesMap;
rr:logicalTable [ rr:sqlQuery """select id, name from Students""" ];
rr:subjectMap [ rr:template ex:{id}; rr:class ex:Student ];
rr:predicateObjectMap [ rr:predicate Ψex:hasName;
rr:objectMap [ rr:column "name" ; rr:datatype xsd:String ]].

Yet, many OBDA systems support their own native mapping languages. In this work, we use the native language of Ontop (ontop), as it combines brevity with readability. The equivalent to the above R2RML mapping is the following:

[[mappingId obda_mapping
  targetΨex:{id} a ex:Student ; {name}^^xsd:String .
  sourceΨselect id, name from Students ]]

In this mapping, the target part provides templates of virtual triples that are generated using the values of these columns, while the source part contains an SQL query that retrieves the id and name columns of the table Students.

4. Approach

We now elaborate on how we extend SQL with virtual table operators that access heterogeneous Web data sources (Section 4.1), on how we evaluate standard SPARQL queries on top of Web APIs (Section 4.2) as well as on the steps comprising our methodology (Section 4.3).

4.1. Extending SQL with virtual table operators

The core concept of our approach is to model a data source as a virtual relational table. For this reason, we define a virtual table operator for each kind of data source. Each virtual table operator has the syntax:

, where the vector

denotes the arguments that are given as input to the virtual table operator, while is optional, denoting the cache update rate.

To understand the form of the SQL queries that use virtual tables, consider the extension of the SQL syntax in Listing LABEL:lst:sqlsyntax.

<query specification> ::= SELECT [ <set quantifier> ] <select list>
                                    <table expression>
<table expression>    ::=  <from clause>
                           [ <where clause> ]
                           [ <group by clause> ]
                           [ <having clause> ]
<from clause>         ::= FROM <table references>
<table references>    ::= <table reference>
                           [ { <comma> <table reference> }..]  |
Listing 1: SQL syntax for virtual tables

We extend the SQL syntax provided in Listing LABEL:lst:sqlsyntax with virtual table support as shown in the last lines. The SQL standard defines two types of tables: (i) the base ones, which are materialized in a database, and (ii) the derived ones, which are produced from relational algebra expressions. At the relational algebra level, a virtual table () is just another relational algebra operator. Thus, we consider virtual tables generated by virtual table operators as another kind of derived tables; any mapping language that is able to use SQL queries in mappings (e.g., R2RML, OBDA) is compatible.

To improve performance, each virtual table can optionally use a cache. The cache feature is useful in cases where:

  1. not all data sources get updated with the same frequency,

  2. some data sources might not be accessible at the next query time (e.g., due to API limitations),

  3. a minimal query execution time is required, due to a large number of queries, i.e., the frequency of queries is much higher than the update frequency of data sources.

To support these cases, indicates the length of the time window (in milliseconds), during which the retrieved data are temporarily stored. If the virtual table operator with the same input parameters () is invoked twice (or more) before this time window ends, the cached data will be used, improving query time. If the query is repeated after the end of the time window, the fresh data is fetched from the data source and gets stored in the system. If has a negative value or is completely absent, nothing is stored and the virtual table operator fetches fresh data every time it is invoked. To support this functionality, we store meta-data about when and where data resulting from a virtual table signature was stored last time.

Input : 
Output : T, the generated virtual table
1 begin
2        T ;
3        t getLastUpdate();
4        if NOW  then
5               T getTableFromCache();
6               return T ;
8        retrieveData();
9        for  do
10               row ;
11               ;
12               for  do
13                      w’[] processAttribute(w);
14                      row row {w’[]};
16              T T {row };
18       UpdateCache( NOW, T);
19        return T;
Algorithm 1 Virtual Table Operator

Since our approach and our caching mechanism deviates considerably from related works (ldow18; eswc18), we now explain in more detail how virtual tables work. Each virtual table operator is implemented differently, but a generalized description is provided in Algorithm 1. First, the operator checks the time the last query with the same arguments was executed (Line 3). If it is within the given cache update rate, , the already retrieved results are returned as output (Lines 4-6). Otherwise, the operator retrieves the data from scratch, using the given arguments (Line 7). For each record, it creates a new tuple with a unique id (Lines 8-9). Next, it iterates over its attribute values, adding them to the tuple after the necessary processing (Lines 11-14). Note that the functionality of the processAttribute function ranges from simple tasks (e.g., data manipulation functions like value transformation/correction) to more complicated tasks (e.g., data mining tasks). For this reason, may contain more than one element. E.g., it can be the text of a tweet together with information about its polarity (i.e., positive or negative sentiment). Finally, after all tuples have been processed and added to the virtual table (Line 13), the cache is updated (Line 15) and the table is returned as output.

The result of a virtual table operator is a virtual table with the following schema: , where is the unique identifier of a tuple and are the requested attributes. Note that some of these attributes might not exist originally in the data source, but they could introduce new knowledge derived from processing the original data, as shown in Section 6.2.

4.2. Evaluation of SPARQL queries on top of Web APIs

As described above, the semantics of RDF and SPARQL (SPARQL) assume that the evaluation of SPARQL queries is performed over an RDF knowledge base and the OBDA paradigm (obda) defines the creation of virtual RDF graphs on top of materialised databases, for which the schema is known a-priori.

Our aim is to support the evaluation of SPARQL queries on top of different kinds of Web data (e.g., APIs, WebTables, etc.) without extending SPARQL or the mapping languages, as suggested by the related work (ISWC18; ldow18; eswc18). Instead, we extend the OBDA paradigm to support virtual relational data, for which the schema is not known a-priori, i.e., not before a SPARQL query is fired.

Let us model the response of an API call as a set of sets of pairs. Let be the set of all attributes of a response of an API call. For each , we define a mapping that maps to a virtual predicate , where is the pairwise disjoint set of IRIs. Then, the value of defines as follows: , where is the value of and is a URI template populated by the (API) value of , as the object of a triple can either be a literal or a URI. All URI templates are defined in the mappings. Finally, we create a virtual graph that consists of triples of the form . The evaluation of a SPARQL triple pattern over a virtual RDF graph on top of an API given the set of mappings , is the following:

Notably, even though we mention only Web APIs as data sources, our approach applies uniformly to any other non-RDF Web data source as well, such as HTML tables.

4.3. Methodology

With the following steps, we can pose SPARQL queries to non-RDF data sources on-the-fly with the help of the virtual table operator:

  1. We construct an ontology that models the data of interest.

  2. We create a virtual table operator for the data source at hand (if it is not available for the kind of data source we want to access (e.g., Twitter API), implementing Algorithm 1.

  3. We create the mappings, where the source part comprises an extended-SQL query, i.e., an SQL query that uses the virtual table operator for the selected data source along with the respective parameters. The caching parameter t is included optionally as a parameter of the respective virtual tables.

  4. Given the ontology and the mappings, we set up a virtual RDF repository using our extended OBDA system in combination with an SQL engine that is able to process the extended-SQL queries included in the mappings. Note that the selected OBDA system should be (made) “database-agnostic” in the sense that it does not require access to the data beforehand. This feature goes beyond the existing RDB2RDF and OBDA systems, which require that the data to be mapped already reside in a database, to which they connect in order to a-priori extract meta-data. (ontop-journal; ultrawrap). In our case, we change the OBDA paradigm so that the data is fetched on-the-fly, after a SPARQL query is fired.

  5. Once a SPARQL query arrives, the OBDA system translates it to SQL. The resulting SQL embeds the virtual table operator(s) involved in the query.

  6. By the time these operators are invoked as part of the extended-SQL query evaluation, the extended SQL query is evaluated using a system that supports extended-SQL queries and virtual tables. In our case, this system is MadIS. According to the caching parameter f that is defined in the mappings, MadIS decides whether results will be accessed on-the-fly from the data source (Step 6a) or cached results will be returned instead (Step 6b).

  7. Eventually, the query result returns back to the OBDA system to be presented as virtual RDF triples.

  8. If applicable, reasoning is applied to the fetched data (e.g., OWL 2 QL reasoning (motik2009owl) is performed in (ontop-journal)).

Example. The SQL-extended query described in Listing LABEL:lst:foursql includes the virtual table operator foursqr, which connects to the Foursquare API, retrieves the requested attributes, and populates a virtual table on-the-fly. This is not performed a-priori, the virtual table is populated only when the SQL query is executed. In this way, the most recent version of the data is retrieved, unless the optional parameter f is provided. This parameter defines the length of the window for which cached data can be used. In the case of this example, if the same operator with the same parameters was executed again in less than 10 minutes ago, then the cached data would be returned directly.

 select id, category, name,
 hereNow_count as h, contact  from
 (foursqr key:coffee near:Chicago f:10) ]]
Listing 2: Virtual table operator for Foursquare data

Given that our approach is generic, we do not associate it with a specific mapping language or OBDA system. Instead, we set the specifications such that, once they are met, any RDB2RDF mapping language or system can implement our approach. Our own implementation is described in Sections 5 and 6.

5. System Architecture
and Implementation

In this section, we describe the implementation of the above methodology that we implemented in our system Ontop4TheWeb, which is available opensource as an extension of the system Ontop-spatial666https://github.com/ConstantB/ontop-spatial. The requirements that the system addresses are the following:

  1. To be suitable for data that get updated frequently.

  2. To be compliant to existing W3C standards for querying RDF data (either materialised or virtual), i.e., the W3C standard SPARQL query language should be supported. Applications that build on top of SPARQL should be able to use this system regardless of the underlying implementation and/or the format of the original data sources.

  3. To support different data formats, but to represent them, on the high level, using a uniform schema.

  4. To represent data as virtual RDF terms, thus saving users from converting Web data via a set of tools specialized for parsing, converting and storing data as RDF triples.

The design choices that address these requirements are the following:

  1. We create a virtual table operator for every kind of data source (not for every data source) that we want to represent as virtual RDF graph. These operators are embedded in SQL queries and, once invoked , these operators connect to the original data sources and return the data in tabular format (virtual tables).

  2. Virtual table operators include a caching mechanism, using the same data they were previously retrieved in a previous execution, for a time window . The length of this window is given as a parameter so that it can be adjusted according to the requirements of a specific use (e.g., Velocity is typically different for every data source).

  3. Standard SPARQL queries are provided as input to the system.

  4. OWL2 QL ontologies are also provided as input to the system to model the data regardless of the original format of the data source. The W3C standard mapping language R2RML is used to encode how the data from the virtual tables can be mapped to virtual RDF terms. For the first time, R2RML mappings include SQL queries with embedded virtual table operators, thus connecting data that are not materialised in a DBMS to virtual RDF terms.

  5. We extend the OBDA paradigm in order to connect to virtual relational data and, using ontologies and mappings, create virtual RDF graphs on top of them. By enabling on-the-fly SPARQL-to-SQL translation, this data can be queried using SPARQL, in the same way that one could query the data as if it had been converted into RDF and stored in triple stores.

The architecture of Ontop4theWeb is shown in Figure 2. The system consists of the following components:

Figure 2. System architecture for Ontop4theWeb


The back-end is based on MadIS777http://madgik.github.io/madis (madis), a relational database system that relies on SQLite888http://www.sqlite.org, but extends it via the Python wrapper APSW999https://github.com/rogerbinns/apsw. The SQLite database can be extended with user-defined operators that can be used as row, aggregate, or virtual table operators. To this end, MadIS exploits the APSW SQLite wrapper, which provides an interface for implementing such operators in an extensible way through Python. Using APSW, we define our own operators to create virtual tables and populate them with data retrieved from the Web. To query the retrieved data, we use MadQL, the MadIS implementation of the extended-SQL language described in Section 4, which contains the virtual table operators. Instead of using MadIS, we could implement the same virtual table operators in C, extending SQLite directly, but this would be less user-friendly and re-usable than the plug-and-play MadIS Python operators; it would also undermine the modularity and extensibility of Ontop4theWeb.

Third party applications are external micro-services

that can be invoked by a virtual table operator in MadIS. For example, a virtual table operator is able to identify the sentiment of a tweet by calling a micro-service that implements a Sentiment Analysis classifier (see Section

6.2). This feature enables Ontop4theWeb to perform data analysis tasks without facing any compatibility issues between the virtual table operator and any data analysis software: the server can be written in any language or platform, but the client can still use it as a service.

Ontop101010https://github.com/ontop/ontop (ontop)

is a state-of-the-art, open-source OBDA system that supports R2RML and its native mapping language. We extended the MadIS JDBC connector so that it complies with Ontop, while Ontop was extended to use MadIS as a back-end. The latter modification is the most significant one, enabling Ontop to operate in a “database-agnostic” manner that supports non-materialized databases and relies on MadIS as back-end. The reason is that Ontop, like all other OBDA systems, connects only to populated and materialized databases, using their data for optimization,

before querying them. Instead, Ontop4theWeb retrieves the data to be queried only after the user fires a query, creating a virtual table on-the-fly. As a result, no prior knowledge of the data can be used.

6. Practical Scenarios

We now showcase how we can pose SPARQL queries on data coming from WebTables or REST APIs using Ontop4theWeb.

6.1. HTML tables use case

HTML tables constitute one of the most common tabular formats for publishing data on the Web. A lot of research activities and applications have focused on retrieving, mining, annotating, and semantically-enriching information available in WebTables (WebTables). As an example, consider a semantic-based recommendation engine that tries to address the cold-start problem for new users. To make meaningful suggestions for users with empty profile and no history, it uses the American Film Institute list of the 100 best movies from Wikipedia111111http://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies in combination with the latest list of user reviews from Rotten Tomatoes121212http://www.rottentomatoes.com/top/bestofrt/, as shown in Figure 1. This is expressed with the SPARQL query described in Listing LABEL:lst:WebTable.

PREFIX wiki: <http://en.wikipedia.org/movies/ontology#>
PREFIX r: <http://www.rottentomatoes.com/top/bestofrt/>
select distinct  ?title ?rrank ?wrank
where {
?s r:title ?title .
?s2 wiki:title ?title .
?s r:rank ?rrank .
?s2 wiki:rank ?wrank  }
Listing 3: Querying WebTables using SPARQL

The SPARQL query provided in Listing LABEL:lst:WebTable retrieves the titles of movies that are included in both tables and the respective ranks. This is performed by executing a join on the “title" column of both tables. We now explain how we can accommodate this application using Ontop4theWeb to query data contained in HTML tables based on ontologies and mappings.

First, we use the virtual table operator WebTable, extending the respective MadIS operator. This operator creates a virtual table and populates it with data contained in the HTML table that is given as input so that this data can be queried using MadQL queries. These queries can then be embedded in mappings as a data source, creating virtual RDF graphs.

The mappings provided in Listing LABEL:lst:WebTables describe how the information contained in these tables is translated into RDF terms. From the Rotten Tomatoes WebTable, we retrieve the rank number of films according to reviews along with the title of the film. From the Wikipedia WebTable, we retrieve the title, the ranks for years 1998 and 2007 and the release date. The aim of this task is to compare and combine two different sources of information (Wikipedia and Rotten Tomatoes) based on the ranks of movies. To retrieve this information, we use the WebTable virtual table operator that parses an HTML table and returns the results as a virtual table. The MadQL query that uses this operator can be seen in both mappings. Its first argument is the HTML page that contains the respective WebTable, while the second one is the index of the WebTable in the page. In our example, we want the 3 HTML table that appears in the Rotten Tomatoes page and the 2 one that appears in the respective Wikipedia page.

Note that the Rotten Tomatoes website includes the release date of every film in parenthesis next to the film title, while the Wikipedia table provides it in a separate column. Since we want to join the two WebTables on the “Title” field, we align this attribute so that it has the same format in both tables. To achieve this, we concatenate the columns “Title” and “Release year" of the Wikipedia table so that the format of the resulting title is exactly the same with the one in the Rotten Tomatoes WebTable.

[MappingDeclaration] @collection [[
mappingId WebTable_rotten_tomatoes
target rot:{rank} rot:rank {rank};
       rot:title {Title};
       rot:reviews  {reviews}^^xsd:int;
       rot:rating {RatingTomatometer}^^xsd:int .
source select  rid as rank,
       "No. of Reviews" as reviews,
       Title, RatingTomatometer from
mappingId WebTable_wikipedia
target wiki:{rid}  wiki:title {Title};
       wiki:rank98 {rank98}^^xsd:int ;
       wiki:rank {rank} .
source  select rid, rank,Title from
       (select rid,Film||" ("||"Release year"||")"
       as Title,"2007 rank" as rank from WebTable(
Listing 4: Mapping for WebTable data

6.2. Twitter Use Case

Twitter is a popular social network whose popularity is increasing to the extent that many people use it as a news stream (twitter). Collecting its data is important for many academic and commercial activities to perform data mining, integration, and analysis tasks (twitteranalysis). Twitter data sources have the following characteristics (Velocity)σ: (i) They get frequently updated (about 8,000 tweets are posted per second and around 700M are posted per day131313http://www.internetlivestats.com/twitter-statistics/), (ii) They are more important when they are fresh - the primary use of Twitter is to find out information about what is happening now. (iii) They are frequently used by data scientists as input datasets to data analysis and data mining tasks (e.g., sentiment analysis (sentanalysis)).

Typically, users write crawlers to retrieve Twitter data and store it in files or in a database. Since the Twitter API has a limit of 100 tweets per request, the crawlers perform multiple requests and accumulate data over a large period of time. Let us now imagine a semantic-based application that tracks user-generated content about active academia events, collecting the latest relevant tweets and processing them with a sentiment analysis service that identifies their polarity. For example, it uses the SPARQL query described in Listing LABEL:lst:twittersparql to retrieve positive tweets about EDBT 2020.

select distinct ?s
where {
?s twitter:tweetsAbout
<https://diku-dk.github.io/edbticdt2020> .
?s twitter:sentiment "positive"}
Listing 5: Querying Twitter using SPARQL

Traditionally, this SPARQL query would be answered through the following steps: (i) retrieve the relevant Twitter data, (ii) transform it into RDF, and (iii) store it in a RDF store. Alternatively, one would store the data in a database, using an OBDA system to query it with mappings. The sentiment analysis task would be performed as a pre- or a post- processing step.

In contrast, using Ontop4theWeb requires less steps for answering this query. After a SPARQL query like the one described above is fired, a virtual table is created, containing information about every tweet along with its sentiment, i.e., whether its sentiment is positive or negative. Then, this information gets mapped into virtual RDF terms. To this end, we implemented a virtual table operator that (i) searches data using the Twitter REST API, (ii) uses a binary classifier to identify whether it is positive of negative, and (iii) populates a virtual table with the results. Its data are then accessed via MadQL queries that can be incorporated in mappings so that virtual RDF triples can be produced on-the-fly. An exemplary mapping appears in Listing LABEL:lst:twittermap.

[MappingDeclaration] @collection [[
mappingId       twitter_mapping
target          twitter:{username} twitter:tweetsAbout
                twitter:sentiment {sentiment}.
source          select distinct id, sentiment
                from (twitterapi key:edbt2020) ]]
Listing 6: Mappings for Twitter data

The source part of this mapping contains a MadQL query that uses the virtual table operator named twitterapi. This virtual table operator takes as input a search keyword, which in our example is edbt2020. The result of this query is the creation of a virtual table with information about tweets for EDBT 2020. Note that the attribute sentiment is not part of the data retrieved from Twitter API, but is derived from the sentiment analysis classifier that is used internally, in the twitterapi virtual table operator.

In this context, the SPARQL query in Listing LABEL:lst:twittersparql is translated into the SQL query in Listing LABEL:lst:sqlquery.

1 AS "sQuestType", NULL AS "sLang", (’http://twitter.com/’ ||
   ’!’, ’%21’),...)) AS "s" FROM
(select distinct id,
sentiment from (twitterapi key:edbt2020)) QVIEW1,
(select distinct id,
sentiment from (twitterapi key:edbt2020)) QVIEW2
(QVIEW1.id = QVIEW2.id) AND
(QVIEW2.sentiment = positive’)) SUB_QVIEW;
Listing 7: SQL query for the virtual table of Twitter

This query contains the virtual table operator twitterapi that creates a virtual table. The columns id and sentiment of this table populate a view that is created on-the-fly by the OBDA system. In traditional OBDA systems, the views are constructed on-the-fly from existing, materialized tables (or other views). In Ontop4theWeb, this table does not exist, but is created and populated on-the-fly, after the SPARQL query is fired and translated into MadQL. The MadQL query will create and populate the table, but this procedure is completely invisible to the user: exactly the same SPARQL query would be used even if the data did not come from a REST API, but was stored in a database, or a triple store.

To classify the tweet according to its polarity, we employed an open-source sentiment classifier for Twitter141414https://github.com/dkakkar/Twitter-Sentiment-Classifier, which uses an SVM model that is already trained with the following datasets: (i) The Stanford Sentiment140 dataset151515http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip, (ii) the Polarity Dataset (v2.0)161616http://www.cs.cornell.edu/people/pabo/movie-review-data/, and (iii) a dataset from the University of Michigan171717https://inclass.kaggle.com/c/si650winter11 that contains 7,086 sentences extracted from various social media.

We have modified this classifier so that it follows a client-server model, where the server and the client communicate through a socket. In this way, we avoid incorporating the whole classifier into the virtual table operator and save the cost of loading the classifier every time the virtual table operator is invoked. When the server starts, it loads the classifier and waits for connection. The client part is incorporated into the twitterapi virtual table operator and sends every tweet of the results to the server for classification through a socket. The server performs sentiment analysis and returns whether the tweet is positive or negative. The result is returned as an additional column of the produced virtual table, called sentiment.

6.3. Foursquare Use Case

Foursquare181818https://foursquare.com is a mobile application that offers location-based search for venues with multiple criteria (e.g., nearby restaurants ranked by rating or distance). The descriptions of these venues are enriched with user reviews and ratings, thus facilitating location recommendations. Foursquare also allows users to share their location with their friends, informs them how many other users are simultaneously at the same location, and alerts them when many people check in at the same time in a place nearby.

Having around 55 million monthly users and a platform that contains crowd-sourced information for 105 million venues worldwide (according to its website), Foursquare has become a useful data source for various applications. Developers can access its API191919https://developer.foursquare.com/ and get part of this information for free (e.g., venue description, location, rating, check-ins), while more data is available on charge. Foursquare has approximately 40,000 registered developers using its API202020https://en.wikipedia.org/wiki/Foursquare#Foursquare_API. As an example of applications built on top of Foursquare, consider the “Mr Jitters" app, which uses Foursquare data to find the best coffee places nearby212121https://developer.foursquare.com/docs/sample-apps.

Semantic Web agents could also exploit this valuable data source. Imagine a semantic-web alternative to “Mr Jitters" that uses Foursquare venues as RDF so as to interlink them with datasets from the linked open data cloud (e.g., LinkedGeodata). Supposing that it searches information about coffee places in Chicago, it would pose the SPARQL query described in Listing LABEL:lst:foursparql.

select ?venue ?checkins
where {
?venue four:name ;
four:hereNow ?checkins;
four:category "Coffee";
four:near "Chicago"}
Listing 8: Querying Foursquare using SPARQL

Ontop4theWeb can be used to map the free Foursquare data to virtual RDF graphs and perform this query on top of them. First, we create an ontology that describes all venue categories that appear in the Foursquare venue category taxonomy222222https://developer.Foursquare.com/docs/resources/categories. The resulting ontology232323http://pyravlos-vm5.di.uoa.gr/foursquare.owl contains a rich hierarchy of 961 classes that represent venue categories, enabling reasoning over it.

Next, we implement a virtual table operator, called foursqr, which receives as input some keywords for searching venues and returns as output a list of venues. The operator is implemented as a Python MadIS virtual table operator, which internally uses a Python library for the Foursquare API242424https://github.com/mLewisLogic/foursquare.git. When the Foursquare virtual table operator is invoked, it accesses the Foursquare API with the input parameters as arguments, and the result is presented as a virtual table that in turn gets mapped into RDF terms using the mapping described in Listing LABEL:lst:fourmap.

[PrefixDeclaration] four: http://foursquare.com/
[MappingDeclaration] @collection [[
 mappingId foursquare_mapping
 target    four:{id} four:hasID {id} ;
           four:name {name} ;
           four:hereNow {h}^^xsd:integer;
           four:category four:{category};
           four:near "Chicago"  .
 source    select id, category, name,
           hereNow_count as h, contact
           from (foursqr key:coffee near:Chicago) ]]
Listing 9: Mapping for Foursquare data

In this mapping, we want to retrieve coffee places in Chicago. The foursqr operator used in the MadQL query of the mapping takes the respective parameters as input. It generates a virtual table populated with information about coffee places in Chicago. The target part of the mapping encodes how these attributes are translated into RDF terms according to the Foursquare Ontology.

7. Experimental evaluation

7.1. Experimental setup

Execution environment. All experiments run on a PC with Intel Core™
2 Quad CPU Q9650 at 3.00GHz, 8GB RAM and Ubuntu 14.04. In all experiments, we measure the query execution time, which includes a full iteration over the result set. We repeat every execution 3 times and consider the average running time. We execute all experiments in both cold and warm cache. In warm cache, we execute a query once before all executions of the same query that we measure. In cold cache, we configure all virtual tables that are involved so that they do not use the caching mechanism described in Section 4 (i.e., we set a negative value to the rate parameter of the virtual table operator in the mappings). These two configurations allow for measuring the impact of the caching mechanism on the query execution time.

Data sources and queries. We query data from WebTables and REST APIs, using queries that are similar to the ones described in Section 6. More specifically, we pose queries for tweets that contain the EDBT2020 hashtag, retrieving also the sentiment for each tweet. We look for coffee places in Chicago from Foursquare and we join two HTML tables with films, one from Wikipedia and one from Rotten Tomatoes. The mappings and part of the queries that we used are explained in Section 6. For each data source, we begin with a query that involves a single triple pattern and then, we increment the number of triple patterns to increase the complexity of the query.

To evaluate the scalability of Ontop4theWeb, we use synthetic WebTables. We employed an original Wikipedia table about Italian election opinion polls252525https://en.wikipedia.org/wiki/Opinion_polling_for_the_Italian_general_election,_2018 as a template, which we multiplied so that we can execute queries for tables with 10, 100, 1,000, 10,000, and 100,000 rows. Then, we posed the same queries over these tables in order to measure the scalability of Ontop4theWeb.

7.2. Experimental Results

Figure 3. Query execution time as dataset size increases.
Figure 4. Execution times for real workload queries.

Real workload. The query execution times of the real workload experiments in both cold and warm cache are presented in Figure 4. The label of each query is suffixed by the number of triple patterns it incorporates (e.g., Q2 indicates two triple patterns).

We observe that the execution times in warm cache are at least an order of magnitude lower than in cold cache and that they remain stable, regardless of the query complexity. In contrast, as the number of triple patterns in the queries increases, the execution time in cold cash increases considerably. This happens because more triple patterns yield more joins in the translated SQL query. When these joins produce more intermediate results, instead of filtering them down, they introduce additional cost in the evaluation. In other words, we add triple patterns to retrieve more information, rather than to pose restrictions. The main reason is that the data is not materialized in the database and, thus, the OBDA system is not aware of database constraints, or other hints that could accelerate SQL translation and execution, as described in (ontop).

Note also that all films queries use two data sources, joining the Webtables described in Section 6.1 to retrieve the movies that are common between the two tables. This involves a higher cost than the Twitter and Foursquare use cases, as the query execution time also includes the time to parse the HTML table(s).

Synthetic workload. The goal of our scalability analysis is to assess the maximum size of input data that we can query efficiently. We used two queries with two triple patterns posed against the synthetic WebTables of varying size. The first query, which is provided in Listing LABEL:lst:weblow, is not selective, returning as many results as the rows of the table. The second query, which is described in Listing LABEL:lst:webhigh is very selective, returning two results at all cases.

select distinct ?s1 ?d ?l
where {
?s1 :date ?d .
?s1 :lead ?l .}
Listing 10: Query of low selectivity for WebTables
select distinct ?s1 ?d
where {
?s1 :date ?d
?s1 :lead \"1.5\
"^^<http://www.w3.org/2001/XMLSchema#float> . }
Listing 11: Query of high selectivity for WebTables

The outcomes of the scalability test appear in Figure 3. We observe that as the number of rows in a WebTable increases, the query execution time increases superlinearly, but the extent of this increase depends heavily on the selectivity of the query. We observe, though, that Ontop4theWeb can process queries against WebTables with up to 100,000 rows within minutes, when the selectivity is high.

7.3. Comparison with the state-of-the-art

7.3.1. Qualitative comparison

We now compare Ontop4theWeb with the SERVICE-to-API system (eswc18).262626We also attempted to compare Ontop4theWeb with the work described in (ldow18), but we could not build an instance of their platform, following the online instructions. Recall that its goal is to enrich RDF data with data from external sources, such as REST APIs. Thus, its query language requires at least one triple pattern that is evaluated in the RDF repository and its variables are bounded to values that populate their URI templates. Every variable binding actually yields a separate API call. A cache mechanism aims to minimize the API calls.

An example of its query language is presented in Listing LABEL:lst:yelp3a. The value of keyword creates a URI template for each one of the values bound to the variable , which is used in the query’s triple pattern. In this case, a call to the Yelp API is produced for each binding of the variable , returning a JSON file. This JSON file is parsed according to the JSON pattern included in the query, which bounds the variables , and to the values of the respective attributes of the JSON file.

SELECT   ?i ?name ?rating WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> ?l .
SERVICE <https://api.yelp.com/v3/businesses/{l}>{
( $.[\"id\"], $.[\"name\"],
$.[\"rating\"]) AS (?i, ?name, ?rating)}}
Listing 12: SERVICE-to-API query, equivalent to SPARQL query Q1 in Listing LABEL:lst:yelp

In this context, there are two major qualitative differences between Ontop4theWeb and SERVICE-to-API (eswc18):

  1. The query language. For SERVICE-to-API, the JSON attributes are directly bound to variables by parsing the JSON response, as instructed by the JSON patterns included in the query. As a result, the users need to know the documentation of the API in order to identify the information they need. Only in this way are they able to combine API data with the RDF data in the triplestore, formulating accurate queries that extend SPARQL with JSON patterns (eswc18). In contrast, Ontop4theWeb creates virtual semantic graphs on top of APIs using mappings, thus allowing users to pose standard SPARQL queries as if the contents of the APIs were transformed into RDF. The trade-off for not having to convert, materialise and store the data into an RDF store is the use of mappings. For any virtual Ontop4theWeb RDF repository, a mapping file should be provided. On the one hand, writing the mappings can be an overhead. However this approach has the following advantages: (i) mappings need to be written once unless the schema changes, (ii) the mapping language R2RML is W3C standard, as well as the SPARQL query language. This ensures compatibility with applications built on top of SPARQL, (iii) materialisation is not avoided in SERVICE-to-API, part of the data needs to be stored in a triple store.

  2. Every API call in Ontop4theWeb retrieves an entire virtual table, which is mapped to a virtual RDF graph. In contrast, SERVICE-to-API merely retrieves one entry of this table per API call, which has a significant impact on time efficiency, as explained in the following experiments. However, both systems use a cachine mechanism

Figure 5. Execution times for Yelp queries in warm and cold cache.
Figure 6. API calls for Yelp queries in warm and cold cache.

7.3.2. Quantitative comparison

For this comparison, we consider data retrieved from the REST API of Yelp272727https://www.yelp.ie/dublin, as SERVICE-to-API does not apply to WebTables. We chose the Yelp API, as it is the only data source for which both systems offer the same functionality (our Twitter operator involves a microservice for performing sentiment analysis). However, the findings of this experiment are representative of the general behaviour of the two systems with respect to any Web API.

For SERVICE-to-API, we stored data about businesses (burger joints in Chicago) in an RDF repository, because it does not support queries that include API calls without triple patterns included in the query. Then, we used the SERVICE keyword to join them with their names and IDs that are retrieved from the REST API of Yelp. Note that we used the original implementation of SERVICE-to-API, which was kindly provided to us by the authors of (eswc18). For Ontop4theWeb, we implemented a virtual table operator of Yelp and pose the SPARQL query Q1, which appears in Listing LABEL:lst:yelp, to retrieve the same data.

SPARQL Query Q2 contains one more triple pattern (i.e., we also retrieve the rating of businesses) and it is described in Listing LABEL:lst:yelp2. There are different ways to express this query in the SERVICE-to-API, depending on the configuration of the repository. The closest definition seems to be the query shown in Listing LABEL:lst:yelp5, which is query Q5. However, the fact that it returns different results suggests that this is not the case. Instead of returning the name and rating of the requested businesses, it returned the Cartesian product of all different burger businesses and all different rating values. So, if the SPARQL query Q2 is expected to return results, the query in Listing LABEL:lst:yelp5 returns results, where is the number of different rating values. We could briefly describe this phenomenon as a difference in semantics between SPARQL and the new language proposed in (eswc18).

Despite this significant difference between the two systems, we want to perform an exhaustive evaluation that highlights their pros and cons. To this end, we created and evaluated all different variations of configurations for the standard SPARQL queries Q1 and Q2. We explain the differences in the SERVICE-to-API queries below.

In SERVICE-to-API query Q1 (Listing LABEL:lst:yelpq1), we have stored the names of businesses. So, we only need to retrieve the id’s from the Yelp API. In SERVICE-to-API query Q2 (Listing LABEL:lst:yelpq2), we want to retrieve both information from the Yelp API. In both cases, we want to retrieve names and burger businesses in Chicago, so both queries are supposed to be equivalent to query Q1. However, these queries do not return the same results. SERVICE-to-API Query Q1 returns the same results as the standard SPARQL query Q1 that was evaluated in our system, but SERVICE-to-API query Q2 returned many false positives. These false positives were produced because the values that are bound to the variables involved in the query do not get joined, as in the case when the names are materialised in SERVICE-to-API query Q1.

We did the same for SPARQL query Q2, which contains one more triple pattern in its standard SPARQL representation. Once we have at least one triple for each entity stored, we retrieve only the missing values using the SERVICE-to-API queries Q3 and Q4, which are described in Listings LABEL:lst:yelp3 and LABEL:lst:yelp4, respectively. In this way, SERVICE-to-API returns the correct results, since the underlying triple store is forced to perform a JOIN between the materialized and the values that are returned from the API, instead of a Cartesian product. The trade-off, on the other hand, is that SERVICE-to-API cannot pose a query to retrieve the results directly through the API, as some form of materialization needs to be performed in order to retrieve correct results.

SERVICE-to-API Query Q6 (Listing LABEL:lst:yelp6) differs from query Q5 only in that it uses the BIND operator instead of triple pattern (i.e., instead of storing the respective triple in a triple store). The reason why we performed this experiment was to execute a materialised-nothing query as the one that is performed in Ontop4theWeb query, where nothing is materialised in a database. The results of this query were eventually the same as the results of the SERVICE-to-API query Q5.

select distinct ?id ?name
where {
 ?s yelp:name ?name .  ?s yelp:hasID ?id }
Listing 13: SPARQL query Q1
select distinct ?id ?name
where { ?s yelp:name ?name .
 ?s yelp:rating ?rating .
 ?s yelp:hasID ?id }
Listing 14: SPARQL query Q2
SELECT distinct  ?i ?name
?x <http://www.w3.org/2000/01/rdf-schema#label> ?l .
?x <http://yelp.com/ontology#name> ?name .
SERVICE <https://api.yelp.com/v3/businesses/{l}>{
( $.[\"id\"]) AS (?i)}}
Listing 15: SERVICE-to-API query Q1 (eq. to SPARQL Q1)
SELECT distinct ?id ?name  WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> ?l  SERVICE
$.[\"businesses\"][0:20][\"name\"]) AS (?id, ?name)}"
Listing 16: SERVICE-to-API query Q2 (eq. to SPARQL Q1)
SELECT distinct  ?i ?name ?rating
?x <http://www.w3.org/2000/01/rdf-schema#label> ?l .
?x <http://yelp.com/ontology#name> ?name .
SERVICE <https://api.yelp.com/v3/businesses/{l}>{
( $.[\"id\"],
$.[\"businesses\"][0:20][\"rating\"] ) AS (?id, ?r) }}
Listing 17: SERVICE-to-API query Q3 (eq. to SPARQL Q2)
SELECT distinct  ?i ?name    WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> ?l .
 ?x <http://yelp.com/ontology#name> ?name .
?x <http://yelp.com/ontology#rating> ?rating .
SERVICE <https://api.yelp.com/v3/businesses/{l}>{
( $.[\"id\"]) AS (?i)}}
Listing 18: SERVICE-to-API query Q4 (eq. to SPARQL Q2)
SELECT distinct ?id ?b  WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> ?l
AS (?id, ?b, ?r)}"
Listing 19: SERVICE-to-API query Q5 (eq. to SPARQL Q2)
 SELECT distinct ?id ?b ?r
 WHERE {\n
 bind(\"Chicago\" as ?l)
 { ($.[\"businesses\"][0:20][\"name\"],
 $.[\"businesses\"][0:20][\"rating\"] )
 AS (?id, ?b, ?r)}" }
Listing 20: SERVICE-to-API query Q6 (eq. to SPARQL Q2)

Response time. We evaluated these queries in both systems and we present the results in Figures 5 and 6. The former depicts the query execution times and the latter the number of API calls invoked. In both cases, we consider warm and cold caches (on the left and right, respectively). We observe that Ontop4theWeb is three times faster than SERVICE-to-API. The main reason is that Ontop4theWeb retrieves a set of tuples for each API call, which are then mapped into virtual RDF graphs. In contrast, SERVICE-to-API retrieves one entry for each API call, yielding many more API calls in order to get the same information.

Another observation is that Ontop4theWeb by design benefits more from caching than the system in comparison. We cache the entire table for each API call, while SERVICE-to-API performs an API call for each tuple, which means that only one tuple is cached each time. Hence, for a result set consisting of tuples, Ontop4theWeb will cache the entire result set as a virtual table that is retrieved from a single API call. In contrast, SERVICE-to-API needs at least calls, of which at most one will be cached.

One could argue that the comparison between the two systems might not be fair, as it seems that the two systems are have differences (e.g., they implement different languages). However, our motivation for these experiments were to compare the performance and functionality with a system that offers similar functionality, answering to the following question "If Ontop4theWeb was not in place, what would be the system that we would use in order to have similar functionality". SERVICE-to-API was the only alternative in this direction.

Another argument could be the fact that, given implementation internals of SERVICE-to-API (e.g., retrieving tuples instead of virtual tables), the results of the experimental evaluation are reasonable. However, these internal implementation details were not obvious until we executed the experiments. Our experiments highlighted these issues and led us to discover these differences in the design and implementation that are the cause of the performance results presented above.

Accuracy. We now investigate how accurate are the results in both systems when posing the queries described in the previous section. Table 1 shows how precise were the results returned by the two systems in comparison, Table 2 presents the recall, Table 3 presents the accuracy metrics and Table 4 presents the F1-score of all six queries that were evaluated.

We observe that we make is that both systems do not produce false negatives, so the recall is always 1. SERVICE-to-API produces false positives that reduce the system’s precision and accuracy and, inevitably, its F1-score. The reason is that, as discussed in previous sections, it returns the cartesian product of the bindings of all variables involved in an API call.

Summary. The findings of this experiment show that not only is Ontop4theWeb more efficient in terms of response time in comparison with the current state-of-the-art, but it also produces accurate results in all cases. Our experiments demonstrate that in order to obtain correct results from SERVICE-to-API, one needs to partially store the data and use a REST API complementarily. Even in this case, however, the functionality that is offered is a subset of the functionality that is offered by our system, while the execution time of queries, even with all optimisations enabled, is considerably larger.

System Q1 Q2 Q3 Q4 Q5 Q6
SERVICE-to-API 1 0.05 0.05 1 0.016 0.016
Ontop4TheWeb 1 1 1 1 1 1
Table 1. Precision
System Q1 Q2 Q3 Q4 Q5 Q6
SERVICE-to-API 1 1 1 1 1 1
Ontop4TheWeb 1 1 1 1 1 1
Table 2. Recall
System Q1 Q2 Q3 Q4 Q5 Q6
SERVICE-to-API 1 0.05 0.05 1 0.016 0.016
Ontop4TheWeb 1 1 1 1 1 1
Table 3. Accuracy
System Q1 Q2 Q3 Q4 Q5 Q6
SERVICE-to-API 1 0.09 0.09 1 0.015 0.015
Ontop4TheWeb 1 1 1 1 1 1
Table 4. F1-score

8. Conclusions

This paper presents Ontop4theWeb, a novel system for querying Web data on-the-fly using SPARQL. Ontop4theWeb extends SQL with virtual table operators, embeds them into mappings and makes an OBDA system compliant to them. Our extensive experimental evaluation verified that Ontop4theWeb goes beyond the state of the art, not only in terms of functionality, but also in terms of performance. Our approach complements traditional approaches of querying data using SPARQL, accommodating the Variety and Velocity of Web data.

In the future, we will use our system as a framework to solve more research problems that include data analysis tasks and make the results available as virtual RDF triples on-the-fly. We will also exploit the extensibility of our system to support more use cases, such as creating virtual RDF graphs on top of XML documents.