SchenQL -- A Domain-Specific Query Language on Bibliographic Metadata

06/14/2019
by   Christin Katharina Kreutz, et al.
0

Information access needs to be uncomplicated, users rather use incorrect data which is easily received than correct information which is harder to obtain. Querying bibliographic metadata from digital libraries mainly supports simple textual queries. A user's demand for answering more sophisticated queries could be fulfilled by the usage of SQL. As such means are highly complex and challenging even for trained programmers, a domain-specific query language is needed to provide a straightforward way to access data. In this paper we present SchenQL, a simple query language focused on bibliographic metadata in the area of computer science while using the vocabulary of domain-experts. By facilitating a plain syntax and fundamental aggregate functions, we propose an easy-to-learn domain-specific query language capable of search and exploration. It is suitable for domain-experts as well as casual users while still providing the possibility to answer complicated queries. A user study with computer scientists directly compared our query language to SQL and clearly demonstrated SchenQL's suitability and usefulness for given queries as well as users' acceptance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

05/13/2022

SchenQL: A query language for bibliographic data with aggregations and domain-specific functions

Current search interfaces of digital libraries are not suitable to satis...
03/16/2020

Duoquest: A Dual-Specification System for Expressive SQL Queries

Querying a relational database is difficult because it requires users to...
01/27/2019

CRAQL: A Composable Language for Querying Source Code

This paper describes the design and implementation of CRAQL (Composable ...
08/17/2015

Reasoning in complex environments with the SelectScript declarative language

SelectScript is an extendable, adaptable, and declarative domain-specifi...
09/17/2020

Extensible Data Skipping

Data skipping reduces I/O for SQL queries by skipping over irrelevant da...
05/29/2018

Sapphire: Querying RDF Data Made Simple

RDF data in the linked open data (LOD) cloud is very valuable for many d...
08/17/2017

A Semiotics-inspired Domain-Specific Modeling Language for Complex Event Processing Rules

Complex Event Processing (CEP) is one technique used to the handling dat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scientific writing almost always starts with thorough bibliographic research on relevant publications, authors, conferences, journals and institutions. While web search is excellent for query answering and intuitively performed, not all retrieved information is correct, unbiased and categorized [4]. The arising problem is people’s tendency of rather using poor information sources which are easy to query than more reliable sources which might be harder to access [5]. This introduces the need for more formal and structured information sources such as digital libraries specialized on the underlying data which are also easy to query. Oftentimes, interfaces of digital libraries offer the possibility to execute search on all metadata or query attributes. In many cases, they are not suitable to fulfil users’ information needs directly when confronted with advanced query conditions such as ”Which are the five most cited publications written by ?”. Popular examples of full-text search engines with metadata-enhanced search for a subset of their attributes [6] are dblp [20] or semantic scholar [30]. Complex relations of all bibliographic metadata can be precisely traversed using textual queries with restricted focus and manual apposition of constraints. While SQL is a standard way of querying databases, it is highly difficult to master [36]. Domain experts are familiar with the schema but are not experienced in using all-purpose query languages such as SQL [1, 23]. Casual users of digital libraries are not versed in either.

To close this gap, we propose SchenQL, a query language (QL) specified on the domain of bibliographic metadata. It is constructed to be easily operated by domain-experts as well as casual users as it uses the vocabulary of digital libraries in its syntax. While domain-specific query languages (DSLs) provide a multitude of advantages [9], the most important aspect in the conception of SchenQL was that no programming skills or database schema knowledge is required to use it.

The contribution of this paper lies in the presentation of a domain-specific query language on bibliographic metadata in computer science which is the first to the best of our knowledge. It focuses on retrieval and exploration aspects as well as aggregate functions. The proposed QL is evaluated two-fold: 1) interviews with domain experts were used to find real applications as well as room for further development and 2) a quantitative user-study thoroughly evaluated effectiveness, efficiency and user satisfaction of the new DSL against SQL.

The remainder of this paper is structured as follows: In Section 2 related work is discussed before our dataset is presented in Section 3. Section 4 introduces the structure and syntax of SchenQL which is thoroughly evaluated in two parts in the following Section 5. The last Section 6 describes possible future research.

2 Related Work

Areas adjacent to the one we are tackling are search on digital libraries, formalized query languages, query languages deriving queries from natural language and domain-specific query languages.

With search on digital libraries, several aspects need to be taken into consideration: The MARC format [11] is a standard for information exchange in digital libraries [4]. While it is useful for known-item search, topical search might be problematic as contents of the corresponding fields can only be interpreted by domain-experts [4]. Most interfaces on digital libraries provide field-based boolean search [29] which can lead to difficulties in formulating queries that require the definition and concatenation of multiple attributes. This might cause a substantial cognitive workload on the user [7]. Withholding or restriction of faceted search on these engines fails to answer complex search tasks [6]. We provide a search option on topical information which even casual users can operate while also offering the possibility to clearly define search terms for numerous attributes in one query. Faceted search is possible on almost all attributes.

Numerous works come from the area of formalized query languages. Examples for SQL-like domain-unspecific QLs on heterogeneous networks are BiQL [15] which is suitable for network analysis and focuses on create/update functions or SnQL [28], a social network query language specialized on transformation and construction operations. Other SQL-like QLs which are not restricted on a domain, focus on graph traversal while being document-centric [32] or social network analysis while being person-centric [31]. Some SQL-like query languages operate on RDF [10] or bibliographic databases using the MARC format [22]. While SchenQL is as structured as these formalized query languages, it does not depend on complicated SQL-like syntax. We argue that SQL is too complex to be effectively operated by casual users and hardly even by computer scientists. Our DSL is neither person-centric nor document-centric but both. Its focal point is the retrieval and exploration of data contrasting the creation/transformation focus of several of the presented languages.

Further research applies QLs in a way that users do not interact with it directly in using the system but in their back end. In many cases, graph-like structured data of heterogeneous networks is used to locate information semantically relevant to a unspecific query [2, 14]. Such an indirection could be a future step in the development of SchenQL.

Whether it be translation of natural language for the formulation of XQuery [21], the processing of identified language parts to build parse trees [3], the representation of natural language queries posed to digital libraries in SPARQL [8] or the recent audio to SQL translation available for different databases and domains [36], analysis of natural language and its conversion to machine processable queries is an active field of research. Our proposed DSL does not translate natural language to SQL but offers a syntax which is similar to natural language.

Domain-specific query languages come in many shapes. They can be SQL-like [19], visual QLs [1, 12] or use domain-specific vocabulary [33] but are typically specialized on a certain area. They also come in different complexities: For example MathQL [16] is a query language in markup style on RDF repositories but a user needs to be mathematician to be able to operate it. The DSL proposed by [23] stems from the medical domain and is designed to be used by inexperienced patients as well as medical staff. Naturally, there are hybrid forms: Some natural language to machine-readable query options are domain-specific [27] and some DSLs might be transferable to other domains [9]. With SchenQL, we want to provide a DSL which uses vocabulary from the domain of bibliographic metadata while being useful for experts as well as casual users.

3 Dataset

Contrary to attempts of modelling every particularity of bibliographic metadata as seen with MARC format [11] or Dublin Core, we concentrate on a few basic objects in our data model. In our concept, bibliographic metadata consists of persons and publications which they authored or edited. These persons can be affiliated with certain institutions. Manuscripts can be of type book, chapter, article, master or PhD thesis and are possibly published in different venues (conferences or journals). They can reference other papers and oftentimes are cited themselves. Figure 1 describes the difference between citations and references.

References of

Citations of

Figure 1: Nodes symbolize publications, edges between papers symbolize citations. and are citations of , and are references of .

The dataset we evaluated on stems from the area of computer science. Our structures were filled with data from dblp [20] mapped on data from SemanticScholar [30] and enriched with information about institutions from Wikidata [35] with application specific extensions. As of June 2019, dblp contains data on more than 4.6 million publications, 2.3 million persons and several thousands of venues. Figure 2 shows the relations, specializations and attributes of data objects in our dataset.

Figure 2: Relations, specializations and attributes of data objects from our extended dblp dataset.

4 SchenQL

The SchenQL Query Language was developed to pose the possibility to access bibliographic metadata in a textual manner which resembles natural language for casual as well as expert users of digital libraries in computer science. The fundamental idea in the development of the query language was to hide possibly complex operations behind plain domain-specific vocabulary. Such functionality would enable usage from anyone versed in the vocabulary of the domain without experience in sophisticated query languages such as SQL. SchenQL queries are formulated declarative, not procedural.

4.1 Building Blocks

PUBLICATION PERSON CONFERENCE JOURNAL INSTITUTION
L dblp key, title dblp key, primary name, orcid dblp key, acronym dblp key, acronym
S ARTICLE, MASTERTHESIS, CHAPTER, PHDTHESIS, BOOK AUTHOR, EDITOR
F PUBLISHED BY (I), ABOUT (keywords), WRITTEN BY (PE), EDITED BY (PE), APPEARED IN (C|J), BEFORE year, IN YEAR year, AFTER year, TITLED title, REFERENCES (PU), CITED BY (PU) PUBLISHED IN (C|J), PUBLISHED WITH (I), WORKS FOR (I), NAMED name, ORCID orcid, AUTHORED (PU), REFERENCES (PU), CITED BY (PU) ACRONYM acronym, ABOUT (keywords), BEFORE year, IN YEAR year, AFTER year NAMED name, ACRONYM acronym,
ABOUT (keywords), BEFORE year, IN YEAR year, AFTER year, VOLUME volume
NAMED name,
CITY city, COUNTRY country, MEMBERS (PE)
V title primary name acronym acronym primary name+primary location
Table 1: SchenQL base concepts Publications (PU), persons (PE), conferences (C), journals (J) and institutions (I) with their respective literals (L), specializations (S), filters (F) and standard return values (V).

Base concepts are the basic return objects of the query language. A base concept is connected to an entity of the dataset and has multiple attributes. Those base concepts are publications, persons, conferences, journals and institutions. Upon these concepts, queries can be constructed. Base concepts can be specialized. For example publications can be refined by specializations books, chapters, articles, master or PhD theses. A specialization can be used instead of a base concept in a query.

Restrictions on base concepts are possible by using filters. A filter extracts a subset of the data of a base concept. Literals can be used as identifiers for objects from base concepts, they can be used to query for specific data. Attributes of base concepts can be queried. Table 1 gives an overview of literals, specializations, filters and the standard return value for every base concept. In Figure 2, where base concepts are underlined and written in blue, their attributes are shown.

Functions are used to aggregate data or offer domain-specific operations. Right now, only three functions are implemented in SchenQL: MOST CITED, COUNT and KEYWORDS OF. The function MOST CITED (PUBLICATION) can be applied on publications. It counts and orders the number of citations of papers in the following set, and returns their titles as well as their number of citations. By default, the top 5 results are returned. COUNT returns the number of objects contained in the following subquery. KEYWORDS OF (PUBLICATION | CONFERENCE | JOURNAL) returns the keywords associated with the following base concept. The LIMIT operator with can be appended at the end of any query to change the number of displayed results to .

4.2 Syntax

The syntax of SchenQL follows some simple rules with the goal of being similar to queries formulated in natural language and therefore understandable and easy to construct. Queries are completed with a semicolon, subqueries have to be surrounded by parentheses. It is possible to write singular or plural when using base concepts or specializations (e.g. CONFERENCE; or CONFERENCES;). Filters follow base concepts or their specializations, can be in arbitrary order and are connected via conjunction. Most filters expect a base concept as parameter (e.g. WRITTEN BY (PERSONS)), several filters expect a string as parameter (e.g. COUNTRY "de"). Specializations can be used in place of base concepts. Instead of a query PERSON NAMED "Ralf Schenkel"; a specialization like AUTHOR NAMED "Ralf Schenkel"; would be possible.

If a filter requires a base concept, parentheses are needed except for the case of using literals for uniquely identifying objects of the base concept. For example PUBLICATIONS WRITTEN BY "Ralf Schenkel"; is semantically equivalent to PUBLICATIONS WRITTEN BY (PERSONS NAMED "Ralf Schenkel");.

COUNT can process any kind of subquery (e.g. COUNT (INSTITUTIONS);). LIMIT can be appended to any query, MOST CITED requires a subquery which produces objects of base concept PUBLICATION (e.g. MOST CITED (ARTICLES APPEARED IN "icadl") LIMIT 10; returns the ten most cited articles which have appeared in the conference with acronym ICADL). KEYWORDS OF requires a subquery, which returns objects of type PUBLICATION, JOURNAL or CONFERENCE. Attributes of base concepts can be accessed by putting the queried for attribute in quotation marks in front of a base concept and connecting both parts with an OF (e.g. "dblpKey" OF JOURNALS IN YEAR 2010;)

4.3 Implementation

Our dataset is stored in a MySQL 8.0.16 database. Lexer and parser of the compiler were built using ANTLR [24] with Java as target language. The compiler translates queries from SchenQL to SQL and runs them on the database. SchenQL can be used in a terminal client similar to the MySQL shell.

5 Evaluation

Our evaluation consists of two parts. In an initial, quantitative study we compared our domain-specific query language SchenQL against the all-purpose query language SQL. In the second, qualitative study, use-cases and possible further improvements were accessed. The quantitative study intended to measure the effectiveness and efficiency of SchenQL and users’ satisfaction with it.

5.1 Quantitative Study

In the quantitative study, SQL was used in a terminal client as a widespread alternative query language to SchenQL, as it is not feasible to compare a specialized system to a commercial search engine and the differences between the compared systems should be minor [18]. The underlying data was stored in a MySQL database in version 8.0.16. Therefore, the requirement for test users was to be familiar with SQL.

5.1.1 Queries

What are the titles of publications written by author ?
What are the names of authors which published on conference ?
What are the titles of the publications referenced by author in year ?
What are the titles of the five most cited publications written by author ?
Table 2: Overview of types of queries used in the evaluation, are authors, is a conference and is a year.

Our evaluation queries are inspired by everyday search tasks of users of digital libraries [8, 25]. We formulated four different types of queries targeting several base concepts and functionalities of SchenQL. Table 2 gives an overview of types of questions used in the evaluation. Variables were switched between query languages to prevent learning effects based on results of queries. , and are publication searches while targets person search.

A formulation of in SQL would be:

SELECT title
FROM publication NATURAL JOIN person_authored_publication NATURAL JOIN person_names
WHERE person_names.name = "A";

In SchenQL, the same query could be expressed by the following:

PUBLICATIONS WRITTEN BY "A";

A formulation of in SQL would be:

SELECT primaryName
FROM person NATURAL JOIN person_authored_publication NATURAL JOIN publication
WHERE publication.conferenceKey = "C";

In SchenQL, the same query could be expressed by the following:

AUTHORS PUBLISHED IN (CONFERENCE ACRONYM "C");

A formulation of in SQL would be:

SELECT DISTINCT title
FROM publication p JOIN publication_references pr ON p.publicationKey = pr.pub2Key
WHERE pr.pub1Key IN (
  SELECT publicationKey
  FROM person_authored_publication NATURAL JOIN person_names NATURAL JOIN publication
  WHERE person_names.name = "A" AND year = Y
);

In SchenQL, the same query could be expressed by the following:

PUBLICATIONS CITED BY (PUBLICATIONS WRITTEN BY "A" IN YEAR Y);

And lastly, a formulation of in SQL would be:

SELECT title, COUNT(*)
FROM publication p NATURAL JOIN person_authored_publication NATURAL JOIN person_names JOIN publication_references pr ON p.publicationKey = pr.pub2Key
WHERE person_names.name = "A"
GROUP BY title
ORDER BY COUNT(*) DESC
LIMIT 5;

In SchenQL, the same query could be expressed by the following:

MOST CITED (PUBLICATIONS WRITTEN BY "A");

As SQL queries tend to become complex relatively fast, the construction of more sophisticated queries was omitted.

5.1.2 Setting

After a preliminary study with two participants, we defined the evaluation process of our archetypical interactive information retrieval study [18] as follows: Every user performed the evaluation alone in presence of a passive investigator on a computer with two monitors. The screens were captured in order to measure times used to formulate the queries. A query language, with which a user was going to begin the evaluation was assigned. Users were permitted to use the internet at any stage of the evaluation. For the evaluation, the presented tables in the database were restricted to the ones which were needed for formulation of the queries as the full ER diagram of the database could overwhelm users. A similar strategy was executed with SchenQL. Not all base concept, attributes and specializations were explained to test subjects but only a smaller subset which was roughly equivalent to the specified tables in SQL. Test users were given the ER diagram which is shown in Figure 4, Figure 5 which shows examples of data in the database schema and a SchenQL cheat sheet as shown in Figure 6.

At first, a video tutorial [37] for the introduction and usage of SQL and SchenQL was shown, afterwards subjects were permitted to formulate queries using the system they were starting to work with. Following this optional step, users were asked to answer a first online questionnaire to assess their current and highest level of SQL knowledge, the number of times they used SQL in the last three months as well as their familiarity with the domain of bibliographic metadata. The next part was the formulation of the four queries and their subjective rating of difficulty in the first query language before the query language was switched. Test users starting with SQL are contained in group A, those beginning with SchenQL are part of group B. Participants were asked to submit the queries in SQL and SchenQL respectively.

The evaluation was concluded with a second online questionnaire regarding the overall impression of SchenQL, the rating of SchenQL and SQL for the formulation of queries as well as several open questions targeting possible advantages and improvements of SchenQL.

We evaluated 21 participants (23 counting the subjects of the preliminary study) from the area of computer science. As the system to start with was rotated between users, ten subjects started by using SQL while eleven participants began the evaluation using SchenQL. We assume gender does not influence the measured values so it is not seen as additional factor in the evaluation [18].

5.1.3 Numerical Results

Figure 3: Required time for all (on the left) and only correct (on the right) answers for all queries.

l—l—l—l Correlation& CQ all & CQ SQL & CQ SchenQL
Current SQL skill&0.35&0.2&0.28
Highest SQL skill&0.11&0.14&0
Frequency of SQL&0.65&0.57&0.29

Table 4: Pearsons correlation coefficients for different combinations of measures of skill in SQL and number of correctly formulated queries (CQ) for every user.

l—l—l—l User group&CORR in general&CORR SQL&CORR SchenQL
Non-experts&63.89& 47.22&80.56
Domain-experts&75 & 64.58 & 85.41

Table 5: Comparison of correctness (CORR) in percent for all queries and queries seperated by system for domain-experts as well as non-experts.
Table 3: Correctness (CORR) in percent, assessed average difficulty (DIFF) and difference of average difficulty of the four queries for SQL and SchenQL.