Supporting Schema References in Keyword Queries over Relational Databases

by   Paulo Martins, et al.
Federal University of Amazonas

Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without knowing schema details or query languages. These systems take the keywords from the input query, locate the elements of the target database that correspond to these keywords, and look for ways to "connect" these elements using information on referential integrity constraints, i.e., key/foreign key pairs. Although several such systems have been proposed in the literature, most of them only support queries whose keywords refer to the contents of the target database and just very few support queries in which keywords refer to elements of the database schema. This paper proposes LATHE, a novel R-KwS designed to support such queries. To this end, in our work, we first generalize the well-known concepts of Query Matches (QMs) and Candidate Joining Networks (CJNs) to handle keywords referring to schema elements and propose new algorithms to generate them. Then, we introduce an approach to automatically select the CJNs that are more likely to represent the user intent when issuing a keyword query. This approach includes two major innovations: a ranking algorithm for selecting better QMs, yielding the generation of fewer but better CJNs, and an eager evaluation strategy for pruning void useless CJNs. We present a comprehensive set of experiments performed with query sets and datasets previously used in experiments with state-of-the-art R-KwS systems and methods. Our results indicate that LATHE can handle a wider variety of keyword queries while remaining highly effective, even for large databases with intricate schemas.


page 23

page 24

page 25

page 26


DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks

Translating Natural Language Queries (NLQs) to Structured Query Language...

Quantum-Inspired Keyword Search on Multi-Model Databases

With the rising applications implemented in different domains, it is ine...

Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases

A critical challenge in constructing a natural language interface to dat...

AutoMode: Relational Learning With Less Black Magic

Relational databases are valuable resources for learning novel and inter...

Domain Representative Keywords Selection: A Probabilistic Approach

We propose a probabilistic approach to select a subset of a target domai...

Subjective Databases

Online users are constantly seeking experiences, such as a hotel with cl...

Understanding Queries by Conditional Instances

A powerful way to understand a complex query is by observing how it oper...

1 Introduction

Keyword Search over Relational Databases (R-KwS) enables naive/informal users to retrieve information from relational databases (DBs) without any knowledge about schema details or query languages. The success of search engines shows that untrained users are at ease using keyword search to find information of interest. However, enabling users to search relational DBs using keyword queries is a challenging task because the information sought frequently spans multiple relations and attributes, depending on the schema design of the underlying DB. As a result, R-KwS systems face the challenge of automatically determining which pieces of information to retrieve from the database and how to connect them to provide a relevant answer to the user.

In general, the keywords from a query may refer to both database values, such as tuples containing these keywords, and schema elements, such as relation and attribute names. For instance, consider the query over a database on movies. The keywords and may refer to values of person names. The keyword on the other hand is more likely to refer to a schema element, the name of the relation about movies. Although a significant number of query keywords correspond to schema references Bergamaschi@SIGMOD11Keymantic , the majority of previous work on R-KwS systems in the literature does not support references to schema information such as the one in the query above. As a result, given a query, they will search for attributes whose tuples include the keyword , which is unlikely to yield a useful answer for the user.

In this work, we study new techniques for supporting schema references in keyword queries over relational databases. Specifically, we propose Lathe111The name Lathe refers to the fact that our system assigns a structure or form to an unstructured keyword-based query, a new R-KwS system to generate a suitable SQL query from a keyword query, considering that keywords may refer either to instance values or to database schema elements, i.e., relations and attributes. Lathe follows the Schema Graph approach for R-KwS systems Oliveira@ICDE18MatCNGen ; Coffman@CIKM10Framework . Given a keyword query, this approach consists of generating relational algebra expressions called Candidate Joining Networks222Most of the previous work uses the term Candidate Networks instead. Here, we use Candidate Joining Networks because we consider it more meaningful. (CJNs), which are likely to express user intent when formulating the original query. The generated CJNs are evaluated, that is, they are translated into SQL queries and executed by a DBMS, resulting in several Joining Networks of Tuples (JNTs) which are collected and supplied to the user.

In the literature, the most well-known algorithm for CJN Generation is CNGen, which was first presented in the system DISCOVER Hristidis@VLDB02DISCOVER , but was adopted by most R-KwS systems Agrawal@ICDE02DBXplorer ; Hristidis@VLDB03Efficient ; Luo@SIGMOD07Spark ; Coffman@KEYS10CD . Despite the possibly large number of CJNs, most works in the literature focused on improving CN Evaluation and ranking of JNTs instead. Specifically, DISCOVER-II Hristidis@VLDB03Efficient , SPARK Luo@SIGMOD07Spark , and CD Coffman@KEYS10CD used information retrieval (IR) style score functions to rank the top-K JNTs. KwS-F Baid@VLDB10KwSF imposed a time limit for CJN evaluation, returning potentially partial results as well as a summary of the CJNs that have yet to be evaluated. Later, CNRank Oliveira@ICDE15CNRank introduces a CJN ranking, requiring only the top-ranked CJNs to be evaluated. MatCNGen Oliveira@ICDE18MatCNGen ; Oliveira@TKDE20 proposed a novel method for generating CJNs that efficiently enumerated the possible matches for the query in the DB. These Query Matches (QMs) are then used to guide the CJN generation process, greatly decreasing the number of generated CJNs and improving the performance of CJN evaluation.

Among the methods based on the Schema Graph approach, Lathe is, to the best of our knowledge, the first method to address the problem of generating and ranking CJNs considering queries with keywords that can refer to either schema elements or attribute values. We revisited and generalized concepts introduced in previous approaches Hristidis@VLDB02DISCOVER ; Oliveira@ICDE15CNRank ; Oliveira@ICDE18MatCNGen ; Oliveira@TKDE20 , such as tuples-sets, QMs, and the CJNs themselves, to enable schema references. In addition, we proposed a more effective approach to CJN Generation that included two major innovations: QM ranking and Eager CJN Evaluation. Lathe roughly matches keywords to the values of the attributes or to schema elements such as names of attributes and relations. Next, the system combines the keyword matches into QMs that cover all the keywords from the query. The QMs are ranked and only the most relevant ones are used to generate CJNs. The CJN generation explores the primary key/foreign key relationships to connect all the elements of the QMs. In addition, Lathe employs an eager CJN evaluation strategy, which ensures that all CJNs generated will yield non-empty results when evaluated. The CJNs are then ranked and evaluated. Finally, the CJN evaluation results are delivered to the user. Unlike the previous methods, Lathe provides the user with the most relevant answer without relying on JNTs rankings. This is due to the effective rankings of QMs and CJNs that we propose, which are absent in the majority of previous work.

We performed several experiments to assess the effectiveness and efficiency of Lathe. First we compared the quality of its results with those obtained with several previous R-KwS systems, including the state-of-the-art QUEST Bergamaschi@VLDBDEMO13_QUEST system using a benchmark proposed by Coffman & Weaver Coffman@CIKM10Framework . Second we assessed the quality of our ranking of QMs. The ranking of CJNs was then evaluated by comparing different configurations in terms of the number of QMs, the number of CJNs generated per QM, and the use of the eager evaluation strategy. Finally, we assessed the performance of each phase of Lathe, as well as the trade off between quality and performance of various system configurations. Lathe achieved better results than all of the R-KwS systems tested in our experiments. Also, our results indicate that the ranking of QMs and the eager CJN evaluation greatly improved the quality of the CJN generation.

Our key contributions are: (i) a novel method for generating and ranking CJNs with support for keywords referring to schema elements; (ii) a novel algorithm for ranking QMs, which avoids the processing of less likely answers to a keyword query; (iii) an eager CJN evaluation for discarding spurious CJNs; (iv) a simple and yet effective ranking of CJNs which exploits the ranking of QMs.

The remainder of this paper is organized as follows: Section 2 reviews the related literature on relational keywords search systems based on schema graphs and support to schema references. Section 3 summarizes all of the phases of our method, which are discussed in detail in Sections 4-6. Section 7 summarizes the findings of the experiments we conducted. Finally, Section 8 summarizes the findings and outlines our plans for future.

2 Background and Related Work

In this section, we discuss the background and related work on keyword search systems over relational databases and on supporting schema references in such systems. For a more comprehensive view of the state-of-the-art in keyword-based and natural language queries over databases, we refer the interested reader to a recent survey Affolter@VLDBJ2019_SurveyNLIDB .

2.1 Relational Keyword Search Systems

Current R-KwS systems fall in one of two distinct categories: systems based on Schema Graphs and systems based on Instance Graphs. Systems in the first category are based on the concept of Candidate Joining Networks (CJNs), which are networks of joined relations that are used to generate SQL queries and whose evaluation return several Joining Networks of Tuples (JNTs) which are collected and supplied to the user. This method was proposed in DISCOVER Hristidis@VLDB02DISCOVER and DBXplorer Agrawal@ICDE02DBXplorer , and it was later adopted by several other systems, including DISCOVER-II Hristidis@VLDB03Efficient , SPARK Luo@SIGMOD07Spark , CD Coffman@KEYS10CD , KwS-F Baid@VLDB10KwSF , CNRank Oliveira@ICDE15CNRank , and MatCNGen Oliveira@ICDE18MatCNGen ; Oliveira@TKDE20 . Systems in this category make use of the underlying basic functionality of the RDBMS by generating appropriate SQL queries to retrieve answers to keyword queries posed by users.

Systems in the second category are based on a structure called Instance Graph, whose nodes represent tuples associated with the keywords they contain, and the edges connect these tuples based on referential integrity constraints. BANKS Aditya@VLDB02BANKS , BANKS-II Kacholia@VLDB05Bidirectional , BLINKS He@SIGMOD07BLINKS and, Effective Liu@SIGMOD06Effective use this approach to compute keyword queries results by finding subtrees in a data graph that minimizes the distance between nodes matching the given keywords. These systems typically generate the query answer in a single phase that combines the tuple retrieval task and the answer schema extraction. However, the Instance Graph approach requires a materialization of the DB and requests a higher computational cost to deliver answers to the user. Furthermore, the important structural information provided by the database schema is ignored, once the data graph has been built.

2.2 R-KwS Systems based on Schema Graphs

In our research, we focus on systems based on Schema Graphs, since we assume that the data we want to query are stored in a relational database and we want to use an RDBMS capable of processing SQL queries. Also, our work expands on the concepts and terminology introduced in DISCOVER Hristidis@VLDB02DISCOVER ; Hristidis@VLDB03Efficient and expanded in CNRank Oliveira@ICDE15CNRank and MatCNGen Oliveira@ICDE18MatCNGen ; Oliveira@TKDE20 . This formal framework is used and expanded to handle keyword queries that may refer to attribute values or to database schema elements. As a result, we can inherit and maintain all guarantees regarding the generation of minimal, complete, sound, and meaningful CJNs.

The best-known algorithm for CJN Generation is CNGen, which was introduced in DISCOVER Hristidis@VLDB02DISCOVER but was later adopted as a default in most of the R-KwS systems proposed in the literature Agrawal@ICDE02DBXplorer ; Hristidis@VLDB03Efficient ; Luo@SIGMOD07Spark ; Coffman@KEYS10CD . To generate a complete, non-redundant set of CJNs, this algorithm employs a Breadth-First Search approach Cormen@09Algorithms . As a result, CNGen frequently generates a large number of CJNs, resulting in a costly CJN generation and evaluation process.

Initially, most of the subsequent work focused on the CJN evaluation only. Specifically, as many CJNs were generated by CNGen that should be evaluated, producing a larger number of JNTs, such systems as DISCOVER-II Hristidis@VLDB03Efficient , SPARK Luo@SIGMOD07Spark , and CD Coffman@KEYS10CD introduced algorithms for ranking JNTs using IR style score functions.

KwS-F Baid@VLDB10KwSF addressed the efficiency and scalability problems in CJN evaluation in a different way. Their approach consists of two steps. First, a limit is imposed on the time the system spends evaluating CJNs. After this limit is reached, the system must return the (possibly partial) top-K JNTs. Second, if there are any CJNs that have yet to be evaluated, they are presented to the user in the form of query forms, from which the user can choose one and the system will evaluate the corresponding CJN.

CNRank Oliveira@ICDE15CNRank proposed a method for lowering the cost of CJN evaluation by ranking them based on the likelihood that they will provide relevant answers to the user. Specifically, CNRank presented a probabilistic ranking model that uses a Bayesian Belief Network

to estimate the relevance of a CJN given the current state of the underlying database. A score is assigned to each generated CJN, so that only a few CJNs with the highest scores need to be evaluated.

MatCNGen Oliveira@ICDE18MatCNGen ; Oliveira@TKDE20 introduced a match-based approach for generating CJNs. The system enumerates the possible ways which the query keywords can be matched in the DB beforehand, to generate query answers. MatCNGen then generates a single CJN, for each of these QMS, drastically reducing the time required to generate CJNs. Furthermore, because the system assumes that answers must contain all of the query keywords, each keyword must appear in at least one element of a CJN. As a result of the generation process avoiding generating too many keyword occurrence combinations, a smaller but better set of CJNs is generated.

Lastly, Coffman & Weaver Coffman@CIKM10Framework

proposed a framework for evaluating R-KwS systems and reported experimental results over three representative standardized datasets they built, namely MONDIAL, IMDb, and Wikipedia, along with respective query workloads. The authors compare nine R-KwS systems, assessing their effectiveness and performance in a variety of ways. The resources of this framework were also used in the experiments of several other studies on R-KwS systems 

Luo@SIGMOD07Spark ; Oliveira@ICDE15CNRank ; Oliveira@ICDE18MatCNGen ; Oliveira@TKDE20 ; Coffman@TKDE12Evaluation .

2.3 Support to Schema References in R-KwS

Overall there are few systems in the literature that support schema references in keywords queries. One of the first such systems was BANKS Bhalotia@ICDE02BANKS , a R-KwS system based on Instance Graphs. However, hence the query evaluation with keywords matching metadata can be relatively slow, since a large number of tuples may be defined to be relevant to the keyword.

Support for schema references in keyword queries was extensively addressed by Bergamaschi et al. in Keymantic Bergamaschi@SIGMOD11Keymantic , KEYRY Bergamaschi@ER2011KEYRY , and QUEST Bergamaschi@VLDBDEMO13_QUEST

. All these systems can be classified as schema-based since they aim at generating a suitable SQL query given an input keyword query. They do not, however, rely on the concept of CJNs, as Lathe and all DISCOVER-based systems do. Keymantic 

Bergamaschi@SIGMOD11Keymantic and KEYRY Bergamaschi@ER2011KEYRY

consider a scenario in which data instances are not acessible, such as in databases on the hidden web and sources hidden behind wrappers in data integration settings, where typically only metadata is made available. Both systems rely on similarity techniques based on structural and lexical knowledge that can be extracted from the available metadata, e.g., names of attributes and tables, attribute domains, regular expressions, or from other external sources, such as ontologies, vocabularies, domain terminologies, etc. The two systems mainly differ in the way they rank the possible interpretations they generate for an input query. While Keymantic relies on an extension the authors proposed for the Hungarian algorithm, KEYRY is based on the Hidden Markov Model, a probabilistic sequence model, adapted for keyword query modeling. QUEST 

Bergamaschi@VLDBDEMO13_QUEST can be thought of as an extension of KEYRY because it uses a similar strategy to rank the mappings from keywords to database elements. QUEST, on the other hand, considers the database instance to be accessible and includes features derived from it for ranking interpretations, in contrast to KEYRY.

From these systems, QUEST is the one most similar to Lathe. However, it is difficult to draw a direct comparison between the two systems as QUEST does not rely on the formal framework from CJN-related previous work Hristidis@VLDB02DISCOVER ; Hristidis@VLDB03Efficient ; Oliveira@ICDE15CNRank ; Oliveira@ICDE18MatCNGen ; Oliveira@TKDE20 and it also resolves a smaller set of keyword queries then Lathe. QUEST, in particular, does not support keyword queries whose resolution necessitates SQL queries with self-joins. As a result, when comparing QUEST to other approaches, the authors limited the experimentation to 35 queries rather then the 50 included in the original benchmark Bergamaschi@VLDBDEMO13_QUEST ; Coffman@CIKM10Framework . Lathe, on the other hand, supports all 50 queries.

Finally, there are systems that propose going beyond the retrieval of tuples that fulfill a query expressed using keywords and try to provide a functionality close to structured query languages. This is the case of SQAK Tata@SIGMOD08SQAK that allows users to specify aggregation functions over schema elements. Such an approach was later expanded in systems such as SODA Blunschi@VLDB2012_soda and SQUIRREL Ramada@InfoSys2020SQUIRREL , which aim to handle not only aggregation functions, but also keywords that represent predicates, groupings, orderings and so on. To support such features, these systems rely on a variety of resources that are not part of the database schema or instances. Among these are conceptual schemas, generic and domain-specific ontologies, lists of reserved keywords, and user-defined metadata patterns. We see such useful systems as being closer to natural language query systems Affolter@VLDBJ2019_SurveyNLIDB . In contrast, Lathe, like any typical R-KwS system, aims at retrieving sets of JNTs that fulfill the query, and not computing results with the tuples. In addition, it does not rely on any external resources.

3 Lathe Overview

In this section we present an overview of Lathe. We begin by presenting a simple example of the task carried out by our system. For this, we illustrate in Figure 1 a simplified excerpt from the well-known IMDB333Internet Movie Database

max width= PERSON ID Name     1 Will Smith     1 Will Theakston     1 Maggie Smith     1 Sean Bean     1 Elijah Wood     1 Angelina Jolie MOVIE ID Title Year 1 Men in Black 1997 1 I am Legend 2007 1 Harry Potter and the Sorcerer’s Stone 2001 1 The Lord of the Rings: The Fellowship of the Ring 2001 1 The Lord of the Rings: The Return of the King 2003 1 Mr. & Mrs. Smith 2005 CHARACTER ID Name 1 Agent J 1 Robert Neville 1 Marcus Flint 1 Minerva McGonagall 1 Boromir 1 Frodo Baggins 1 Jane Smith ROLE ID Name 1 Actor 1 Actress 1 Producer 1 Writer 1 Director 1 Editor CASTING ID PID MID ChID RID 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Figure 1: A simplified excerpt from IMDB

Consider that a user inputs the keyword query and assume that she wants the system to list the movies in which Will Smith appears. Notice that, informally, the terms “will” and “smith” are likely to match the contents of a relation from the DB, while the term “films” is likely to match the name of a relation or attribute.

As other methods previously proposed in the literature, such as CNGen Hristidis@VLDB02DISCOVER and MatCNGen Oliveira@ICDE18MatCNGen ; Oliveira@TKDE20 , the main goal of Lathe is, given a query such as , generating a SQL query that, when executed, fulfills the information needed for the user. The difference between Lathe and these previous methods is that they are not able to handle references to schema elements, such as “films” in .

For query , two of the possible SQL queries that would be generated are queries and , presented in Figures 2 (a) and (b), respectively. The respective results of these queries for the database of Figure 1 are presented in Figures 2(c) and (d). Query retrieves the movies which Will Smith was in, and thus, satisfies the original user intent. On the other hand, query retrieves movies in which two different persons, whose names respectively include the terms “will” and “smith”, participated in.

Figure 2: SQL queries generated for the keyword query “will smith movies” and their returned results.

As this example indicates, there may be several plausible SQL queries related to a given keyword query. Therefore, it is necessary to decide which alternative is more likely to fulfill the user intent. This task is also carried out by Lathe.

Next, we present an overview of the components and the functioning of Lathe.

3.1 System Architecture

In this section, we present the overall architecture of Lathe. We base our discussion on Figure 3, which illustrates the main phases that comprise the operation of the method.

Figure 3: Main phases and architecture of Lathe

The process begins with an input keyword query posed by the user. The system then attempts to associate each of the keywords from the query with a database schema element, such as a relation or an attribute. The system relies on the DB schema, i.e., the names of relations and attributes, or on the DB instance, i.e., on the values of the attributes, for this. This phase, called Keyword Matching 

, generates sets of Value-Keyword Matches (VKMs), which associate keywords with sets of tuples whose attribute values contain these keywords, and Schema-Keyword Matches (SKMs), which associate keywords with names of relations or attributes deemed as similar to these keywords.

In the next phase, Query Matching 

, Lathe combines keyword matches (VKMs and SKMs) so that they form a total and minimal cover for the keywords from the query. That is, the keyword matches comprise all the keywords from the query and no keyword match is redundant. These keyword match combinations are referred to as Query Matches (QMs). Although the Query Matching phase may generate a large number of QMs due to its combinatorial nature, only a few of them are useful in producing plausible answers to the user. As a result, we propose the first algorithm for Ranking Query Matches in the literature. This ranking assigns a score to QMs based on their likelihood of satisfying the needs of the user when formulating the keyword query. Thus, the system only outputs a few top-ranked QMs to the next phases. By doing so, it avoids having to process less likely QMs.

Lastly, in the Candidate Joining Network Generation 

phase, the system searches for interpretations for the keyword query. That is, the system tries to connect all the keyword matches from the QMs through CJNs, which are based on the schema graph. CJNs can be thought as relational algebra joining expressions that can be directly translated into SQL queries. Then, we use the ranked QMs to generate a Candidate Joining Network Ranking. This ranking favors CJNs that are more concise in terms of the number of relations they employ. Once we have identified the most likely CJNs, they can be evaluated as SQL queries that are executed by a DBMS to the users. We notice that some of the generated CJNs may return empty results when they are evaluated. Thus, Lathe can alternatively evaluate CJNs before ranking them and prune such void CJNs. We call this process instance-based pruning.

During the whole process of generating CJNs, Lathe uses two data structures which are created in a Preprocessing stage 

: the Value Index and the Schema Index. During the whole process of generating CJNs, Lathe uses two data structures which are created in a Preprocessing stage 

: the Value Index and the Schema Index.

The Value Index is an inverted index that stores keyword occurrences in the database, indicating the relations, attributes, and tuples where a keyword appears. These occurrences are retrieved to generate VKMs. Furthermore, the Value Index is used to calculate term frequencies for the QMs and CJNs Rankings. The Schema Index is an inverted index that stores database schema information, as well as statistics about relations and attributes. While database schema information, such as PK/FK relationships, are used for the generation of CJNs, the statistics about attributes, such as norm and inverted frequency, are used for rankings of QMs and CJNs.

In the following sections we present each of the phases of Figure 3, describing the steps, definitions, data structures, and algorithms we used.

4 Keyword Matching

In this section, we present the details on keyword matches and their generation. Their role in our work is to associate each keyword from the query to some attribute or relation in the database schema. Initially, we classify them as either VKMs and SKMs, according to the type of associations they represent. Later, we provide a generalization of the keyword matches and we introduce the concept of Keyword-Free Matches, which will be used in the next phases of our method.

4.1 Value-Keyword Matching

We may associate the keywords from the query to some attribute in the database schema-based on the values of this attribute in the tuples that contain these keywords using value-keyword matches, according to Definition 1.

Definition 1.

Let be a keyword query and be a relation state over the relation schema . A value-keyword match from over is given by:

where is the set of keywords from that are associated to the attribute , returns the set of words in for attribute and denotes a match of keywords to the database values.

Notice that each tuple from the database can be a member of only one value-keyword match. Therefore, the VKMs of a given query are disjoint sets of tuples.

Throughout our discussion, for the sake of compactness in the notation, we often omit mappings of attributes to empty keyword sets in the representation of a VKM. For instance, we use the notation to represent .

Example 1.

Consider the database instance of Figure 1. The following VKMs can be generated for the query “will smith films”.

VKMs play a similar role to the tuple-sets from related literature Hristidis@VLDB02DISCOVER ; Oliveira@ICDE18MatCNGen . They are, however, more expressive because they specify which attribute is associated with each keyword. Previous R-KwS systems based on the DISCOVER system, on the other hand, are unable to create tuple-sets that span multiple attributes Hristidis@VLDB02DISCOVER ; Hristidis@VLDB03Efficient ; Oliveira@ICDE15CNRank . Example 2 shows a keyword query that includes more than one attribute.

Example 2.

Consider the query “lord rings 2001” whose intent is to return which Lord of the Rings movie was launched in 2001. We can represent it with the following value-keyword match:

The generation of VKMs uses a structure we call the Value Index. This index stores the occurrences of keywords in the database, indicating the relations and tuples a keyword appears and which attributes are mapped to the keyword. Lathe creates the Value Index during a preprocessing phase that scans all target relations only once. This phase comes before the query processing and it is not expected to be repeated frequently. As a result, without further interaction with the DBMS, answers are generated for each query. The Value Index has following the structure, which is shown in Example 3.

Example 3.

The VKMs presented in Example 1 are based on the following keyword occurrences:.

In Lathe, the generation of VKMs is carried out by the VKMGen algorithm, presented in details in A.

4.2 Schema-Keyword Matching

We may associate the keywords from the query to some attribute or relation in the database schema based on the name of the attribute or relation using Schema-Keyword Matches, according to Definition 2. Specifically, our method matches keywords to the names of relations and attributes using similarity metrics.

Definition 2.

Let be a keyword from the query, be a relation schema. A schema-keyword match from over is given by:

where , is the set of keywords from that are associated with the schema element , gives the similarity between the name of a schema element and the keyword , which must be above a threshold , and denotes a match of keywords to the database schema.

In this representation, we use the artificial attribute when we match a keyword to the name of a relation. Example 4 shows an instance of a schema-keyword match wherein the keyword “” is matched to the relation .

Example 4.

The following schema-based relation matches are created for the query “will smith films”, considering a threshold .

where gives the similarity between the schema element and the keyword , , and .

Despite their similarity to VKMs, the schema-keyword matches serve a different purpose in our method, ensuring that the attributes of a relation appear in the query results. As a result, they do not “filter” any of the tuples from the database, implying that they do not represent any selection operation over database relations.

Similarity Metrics

For the matching of keywords to schema elements, we used two similarity metrics based on the lexical database WordNet: the Path similarity Miller@98Wordnet ; Pedersen@ACL04WordNetSimilarity and the Wu-Palmer similarity Wu@ACL94WuPalmer ; Pedersen@ACL04WordNetSimilarity . We introduce the WordNet database and the two similarity metrics below.

WordNet Database

WordNet Miller@98Wordnet is a large lexical database that resembles a thesaurus, as it groups words based on their meanings. One use of WordNet is to measure similarity between words based on the relatedness of their senses, the many different meanings that words can have Keselj@09 . As a result, the word “film” can refer to a movie, as well as the act of recording or the plastic film. Each of these senses have a different relation to the sense of a “show". Wordnet represents sense relationships , such as synonymy, hyponymy, and hypernymy, to measure similarity between words. Synonyms are two word senses that share the same meaning. In addition, we say that the sense is a hyponym of the sense if is more specific, denoting a subclass of . For instance, “protagonist” is a hyponym of “character”; “actor” is a hyponym of “person”, and “movie” is a hyponym of “show”. The hypernymy is the opposite of hyponymy relation. Thus, us a hypernymy of .

Path Similarity

The Path similarity Miller@98Wordnet ; Pedersen@ACL04WordNetSimilarity exploits the structure and content of the WordNet database. The relatedness score is inversely proportional to the number of nodes along the shortest path between the senses of two words. If the two senses are synonyms, the path between them has length 1. The relatedness score is calculated as follows:

Wu-Palmer Similarity

The Wu-Palmer measure (WUP) Wu@ACL94WuPalmer ; Pedersen@ACL04WordNetSimilarity calculates relatedness by considering the depths of the two synsets and in the WordNet taxonomies, along with the depth of the Least Common Subsumer(LCS). The most specific synset is the LCS, which is the ancestor of both synsets and . Because the depth of the LCS is never zero, the score can never be zero (the depth of the root of a taxonomy is one). Also, the score is 1 if the two input synsets are the same. The WUP similarity for two words and is given by:

As in the case of VKMs, we detail the SKMGen algorithm used in Lathe in B.

4.3 Generalization of Keyword Matches

Initially, we presented Definitions 1 and 2 which, respectively, introduce VKMs and SKMs. We chose to explain the specificity of these concepts separately for didactic purposes. They are, however, both components of a broader concept, Keyword Match (KM), which we define in Definition 3. In the following phases, this generalization will be useful when merging VKMs and SKMs.

Definition 3.

Let be a keyword query and be a relation state over the relation schema . Let be a value-keyword match from R over Q. Let be a schema-keyword match from R over Q. A general keyword match from over is given by:

The representations of VKMs and SKMs in the general notation are given as follows:

Another concept required for the generation of QMs and CNs is keyword-free matches, which we describe in Definition 4. They are KMs that are not associated with any keyword but are used as auxiliary structures, such as intermediate nodes in CJNs.

Definition 4.

We say that a keyword match given by:

is a keyword-free match if, and only if, , where .

For the sake of simplifying the notation, we will represent a keyword-free match as or simply by .

5 Query Matching

In this section, we describe the processes of generating and ranking QMs, which are combinations of the keyword matches generated in the previous phases that comprise every keyword from the keyword query.

5.1 Query Matches Generation

We combine the associations present in the KMs to form total and non-redundant answers for the user. In other words, Lathe looks for KM combinations that satisfy two conditions: (i) every keyword from the query must appear in at least one of the KMs and (ii) if any KM is removed from the combination, the combination no longer meets the first condition. These combinations, called Query Matches (QMs), are described in Definition 5

Definition 5.

Let be a keyword query. Let be a set of keyword matches for Q in a certain database instance , where:

Also, let and be the sets of all keywords associated with and with , respectively. We say that is a query match for if, and only if, forms a minimal set cover of the keywords in . That is, and , .

Notice that a QM cannot contain any keyword-free match, as it would not be minimal anymore. Example 5 presents combinations of KMs which are or are not QMs.

Example 5.

Considering the KMs from the Examples 1 and 4, only some of the following sets are considered QMs for the query “will smith films”:

The sets and are considered QMs. In contrast, the sets of keyword matches , and are not QMs. While does not include all query keywords, and are not minimal, that is, they have unnecessary KMs.

We present the QMGen algorithm for generating QMs in C.

5.2 Query Matches Ranking

As described in Section 3, Lathe performs a ranking of the QMs generated in the previous step. This ranking is necessary because frequently many QMs are generated, yet, only a few of them are useful to produce plausible answers to the user.

Lathe estimates the relevance of QMs based on a Bayesian Belief Network model for the current state of the underlying database. In practice, this model assess two types of relevance when ranking query matches. The the TF-IDF model is used to calculate the value-based score

, which adapts the traditional Vector space model to the context of relational databases, as done in LABRADOR 

Mesquita@IPM07LABRADOR and CNRank Oliveira@ICDE15CNRank . The schema-based score, on the other hand, is calculated by estimating the similarity between keywords and schema elements names.

In Lathe, only the top-k QMs in the ranking are considered in the succeeding phases. By doing so, we avoid generating CJNs that are less likely to properly interpret the keyword query.

Belief Bayesian Network

We adopt the Bayesian framework proposed by Ribeiro@1996BBN and Cristo@2003BBN for modeling distinct IR problems. This framework is simple and allows for the incorporation of features from distinct models into the same representational scheme. Other keyword search systems, such as LABRADOR Mesquita@IPM07LABRADOR and CNRank Oliveira@ICDE15CNRank , have also used it.

In our model, we interpret the QMs as documents, which are ranked for the keyword query. Figure 4

illustrates an example of the adopted Bayesian Network. The nodes that represent the keyword query are located at the top of the network, on the Query Side. The Database Side, located at the bottom of the network, contains the nodes that represent the QM that will be scored. The center of the network is present on both sides and is made up of sets of keywords: the set

of all terms present in the values of the database and the set of all schema element names.

Figure 4: Bayesian Network corresponding to the query

In our Bayesian Network, we rank QMs based on their similarities with the keyword query. This similarity is interpreted as the probability of observing a query match

given the keyword query , that is, , where is a normalizing constant, as used in Pearl@2014Probabilistic .

Initially, we define a random binary variable associated with each keyword from the sets


, which indicates whether the keyword was observed in the keyword query. As these random variables are the root nodes of our Bayesian Network, all of the probabilities of the other nodes are dependent on them. Therefore, if we consider

and as the sets of keywords observed, we can derive the probability of any non-root node as follows: .

As all the possibilities of and are equally likely a priori, we can calculate them as and , respectively.

The instantiation of the root nodes of the network separates the query match nodes from the query nodes, making them mutually independent. Therefore:

The probability of the keyword query is split between the probability of each of its keywords:

A keyword from the query is observed, given the sets and , either if occurs in the values of the database or if has a similarity above a threshold with a schema element.

Similarly, in our network, the probability of a query match is splited between the probability of each of its KMs.

We compute the probability of KMs using two different metrics: a schema score based on the same similarities used in the generation of SKMs; and a value score based on a Vector model Baeza@2011ModernRI ; Salton@IPM88TFIDF

using the cosine similarity.

where .

It is important to distinguish the documents from the Bayesian Network model and the Vector Model. The documents of the Bayesian Network are QMs, and the query is the keyword query itself, whereas the documents of the Vector model are database attributes, and the query is the set of keywords associated with the KM.

Once we know the document and the query of the Vector model, we can calculate the cosine similarity by taking inner product of the document and the query. The cosine similarity formula is given as follows:

where is the constant that represents the norm of the query, which is not necessary for the ranking.

The weights for each term are calculated using the TF-IDF measure. This measure is based on the term frequency and specificity in the collection. We use the raw frequency and inverse frequency, which are the most recommended form of TF-IDF weights Baeza@2011ModernRI .

where can be either the document or the query, is the number of attributes in the database, and is the number of attributes that are mapped to the occurrences of the term . In the case of be the query, gives the number of occurrences of a term in the keyword query, which is generally 1. In the case of be an attribute(document), gives the occurrences of a term in an attribute, which is obtained from the Value Index.

We present the algorithm for ranking QMs in D.

6 Candidate Joining Networks

In this section we present the details on our method for generating and ranking Candidate Joining Networks (CJNs), which represent different interpretations of the the keyword query. We recall that our definition of CJNs expands on the definition presented in Hristidis@VLDB02DISCOVER to support keywords referring to schema elements.

The generation of CJNs uses a structure we call a Schema Graph. In this graph, there is a node representing each relation in the database and the edges correspond to the referential integrity constraints (RIC) in the database schema. In practice, this graph is built in a preprocessing phase based on information gathered from the database schema.

Definition 6.

Let be a set of relation schemas from the database. Let

be a subset of the ordered pairs from

given by:

where gives the number of Referential Integrity Constraints from a relation to a relation . We say that a schema graph is an ordered pair , where is the set of vertices (nodes) of , and is the set of edges of .

Example 6.

Considering the sample movie database introduced in Figure 1, our method generates the schema graph bellow.

In Figure 5, we represent a graphical illustration of .

Figure 5: A schema graph for the sample movie database of Figure 1

Once we defined the schema graph, we can introduce an important concept, the Joining Network of Keyword Matches (JNKM). Intuitively, a joining network of keyword matches contains every KM from a query match . may also contain some free-keyword matches for the sake of connectivity. Finally, is a connected graph that is structured according to the schema graph . The definition of candidate joining networks is given as follows:

Definition 7.

Let be a query match for a keyword query . Let be a schema graph. Let be a set of keyword-free matches from the relations of . Consider a graph of keyword matches , where and are the vertices and edges of . We say that is a joining network of keyword matches from over if the following conditions hold:

For the sake of simplifying the notation, we will use a graphical illustration to represent CJNs, which is shown in Example 7.

Example 7.

Considering the query match previously generated in Example 5, the following JNKMs can be generated:

The JNKMs and cover the query match . The interpretation of looks for the movies of the person will smith. looks for the movies of the person will smith and which character will smith played in these movies.

Notice that a JNKM might have unnecessary information for the keyword query, which was the case of presented in Example 7. One approach to avoid generating unnecessary information is to generate Minimal Joining Networks of Keyword Matches(MJNKM), which are addressed in Definition 8. Roughly, a MJNKM cannot have any keyword-free match as a leaf, that is, a keyword-free match incident to a single edge.

Definition 8.

Let be a schema graph. Let be a query match for a query Q. We say that from over is minimal joining network of keyword matches (MJNKM) if, and only if, the following condition holds:

Example 8.

Considering the query match previously generated in Example 5, the following MJNKMs can be generated:

Another issue that a JNKM might have is representing an inconsistent interpretation. For instance, it is impossible for presented in Example 8 to return any results from the database. By Definition 1, the VKMs and are disjoint. However, a tuple from cannot refer to two different tuples of . Thus is inconsistent. We notice that previous work in literature for CJN generation had addressed this kind of inconsistency Hristidis@VLDB02DISCOVER ; Oliveira@ICDE18MatCNGen . They did not, however, consider the situation in which there exist more than one RIC from one relation to another. In contrast, based on the theorems and definitions presented in Hristidis@VLDB02DISCOVER , Lathe proposes a novel approach for checking consistency in CJNs that support such scenarios. Theorem 1 presents a criterion that determines when a JNKM is sound, that is, it can only produce JNTs that do not have more than one occurrences of a tuple. The proof of Theorem 1 is presented in E.

Theorem 1.

Let be a schema graph. Let be a joining network of keyword matches. We say that is sound, that is, it does not have more than one occurrences of the same tuple for every instance of the database if, and only if, the following condition holds

where indicates the number of Referential Integrity Constraints from a relation to a relation .

Example 9 presents a JNKM that is sound, although it would be deemed not sound by previous approaches Hristidis@VLDB02DISCOVER ; Oliveira@ICDE18MatCNGen .

Example 9.

Consider a simplified excerpt from the MONDIAL database May@99MONDIAL , presented in Figure 6. As there exists 2 RICs from the relation to , represented by the attributes Ctry1_Code e Ctry2_Code, a tuple from can be joined to at most two distinct tuples from , which is the case of . Thus, the following MJNKM is sound:

max width= COUNTRY Code Name Capital_ID CO Colombia 1 BR Brazil 2 PE Peru 3 BORDER Ctry1_Code Ctry2_Code Length CO BR 1643 PE BR 1560 CITY ID Name Population 1 Bogota 1643 2 Brasilia 1560 3 Lima 1560

Figure 6: A simplified excerpt from MONDIAL

Finally, Definition 9 describes a Candidate Joining Network (CJN), which is roughly a sound minimal joining network of keyword matches.

Definition 9.

Let be a query match for the keyword query . Let be a schema graph. Let be a joining network of keyword matches from over given by . We say that is a candidate joining network if, and only if, is minimal and sound.

Example 10.

Considering the query match previously generated in Example 5, the following CJN can be generated:

The candidate joining networks covers the query match . is a minimal and sound JNKM. The interpretation of searches for the movies where both persons “will” (e.g. Will Theakston) and “smith” (e.g. Maggie Smith) participate in. The two keyword-free matches from the are treated as different nodes in the candidate joining network .

The details on how we generate CJNs in Lathe are described by the CNKMGen Algorithm in F.

6.1 Candidate Joining Network Ranking

In this section, we present CJNKMRank, a novel ranking of CJNs based on the ranking of QMs. This ranking is necessary because often many CJNs are generated, yet, only a few of them are indeed useful to produce relevant answers.

We present in Section 5.2 a QM ranking that advances the majority of the features present in the ranking of CJNs of other proposed systems, such as CNRank Oliveira@ICDE15CNRank . Thus, we can exploit the scores of the QMs to rank the CJNs. For this reason, CJNKMRank provides a straightforward yet effective ranking of candidate joining networks.

CJNKMRank is described in Algorithm 1. Roughly, it uses the ranking of QMs adding a penalization for large CJNs. Therefore, the score of a candidate joining network from a query match is given by:

To ensure that CJNs with the same score are placed in the same order that they were generated, in Line 1, we used a stable sorting algorithm Cormen@09Algorithms .

Input: A set of candidate networks
Output: The set of candidate networks
1 for  do
2       let  be the query match used to generate cn_score = .append( )
Sort in descending order return
Algorithm 1 CNKMRank()

6.2 Candidate Joining Network Pruning

In this section we present an eager evaluation strategy for pruning CJNs. Even if CJNs contain valid interpretations of the keyword query, some of them may fail to produce any JNTs as a result. Thus, we can improve the results of our CJN generation and ranking if by pruning what we call void CJNs, which are CJNs with no JNTs in their results.