Example-Driven Query Intent Discovery: Abductive Reasoning using Semantic Similarity

06/25/2019 ∙ by Anna Fariha, et al. ∙ University of Massachusetts Amherst 0

Traditional relational data interfaces require precise structured queries over potentially complex schemas. These rigid data retrieval mechanisms pose hurdles for non-expert users, who typically lack language expertise and are unfamiliar with the details of the schema. Query by Example (QBE) methods offer an alternative mechanism: users provide examples of their intended query output and the QBE system needs to infer the intended query. However, these approaches focus on the structural similarity of the examples and ignore the richer context present in the data. As a result, they typically produce queries that are too general, and fail to capture the user's intent effectively. In this paper, we present SQuID, a system that performs semantic similarity-aware query intent discovery. Our work makes the following contributions: (1) We design an end-to-end system that automatically formulates select-project-join queries in an open-world setting, with optional group-by aggregation and intersection operators; a much larger class than prior QBE techniques. (2) We express the problem of query intent discovery using a probabilistic abduction model, that infers a query as the most likely explanation of the provided examples. (3) We introduce the notion of an abduction-ready database, which precomputes semantic properties and related statistics, allowing SQuID to achieve real-time performance. (4) We present an extensive empirical evaluation on three real-world datasets, including user-intent case studies, demonstrating that SQuID is efficient and effective, and outperforms machine learning methods, as well as the state-of-the-art in the related query reverse engineering problem.



There are no comments yet.


page 4

page 11

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Database technology has expanded drastically, and its audience has broadened, bringing on a new set of usability requirements. A significant group of current database users are non-experts, such as data enthusiasts and occasional users. These non-expert users want to explore data, but lack the expertise needed to do so. Traditional database technology was not designed with this group of users in mind, and hence poses hurdles to these non-expert users. Traditional query interfaces allow data retrieval through well-structured queries. To write such queries, one needs expertise in the query language (typically SQL) and knowledge of the, potentially complex, database schema. Unfortunately, occasional users typically lack both. Query by Example (QBE) offers an alternative retrieval mechanism, where users specify their intent by providing example tuples for their query output [45].

Unfortunately, traditional QBE systems [51, 48, 16] for relational databases make a strong and oversimplifying assumption in modeling user intent: they implicitly treat the structural similarity and data content of the example tuples as the only factors specifying query intent. As a result, they consider all queries that contain the provided example tuples in their result set as equally likely to represent the desired intent.111More nuanced QBE systems exist, but typically place additional requirements or significant restrictions over the supported queries (Figure 3). This ignores the richer context in the data that can help identify the intended query more effectively.

vlightgray id name
100 Thomas Cormen
101 Dan Suciu
102 Jiawei Han
103 Sam Madden
104 James Kurose
105 Joseph Hellerstein
vlightgray aid interest
100 algorithms
101 data management
102 data mining
103 data management
103 distributed systems
104 computer networks
105 data management
105 distributed systems
Figure 1: Excerpt of two relations of the CS Academics database. Dan Suciu and Sam Madden (in bold), both have research interests in data management.
Example 1.1.

In Figure 1, the relations academics and research store information about CS researchers and their research interests. Given the user-provided set of examples {Dan Suciu, Sam Madden}, a human can posit that the user is likely looking for data management researchers. However, a QBE system, that looks for queries based only on the structural similarity of the examples, produces Q1 to capture the query intent, which is too general:
Q1: SELECT name FROM academics
In fact, the QBE system will generate the same generic query Q1 for any set of names from the relation academics. Even though the intended semantic context is present in the data (by associating academics with research interest information using the relation research), existing QBE systems fail to capture it. The more specific query that better represents the semantic similarity among the example tuples is Q2:
Q2: SELECT name FROM academics, research
       WHERE research.aid = academics.id AND
                 research.interest = ‘data management’

Example 1.1 shows how reasoning about the semantic similarity of the example tuples can guide the discovery of the correct query structure (join of the academics and research tables), as well as the discovery of the likely intent (research interest in data management).

We can often capture semantic similarity through direct attributes of the example tuples. These are attributes associated with a tuple within the same relation, or through simple key-foreign key joins (such as research interest in Example 1.1). Direct attributes capture intent that is explicit, precisely specified by the particular attribute values. However, sometimes query intent is more vague, and not expressible by explicit semantic similarity alone. In such cases, the semantic similarity of the example tuples is implicit, captured through deeper associations with other entities in the data (e.g., type and quantity of movies an actor appears in).

Example 1.2.

The IMDb dataset contains a wealth of information related to the movies and entertainment industry. We query the IMDb dataset (Figure 2) with a QBE system, using two different sets of examples:

ET1={Arnold Schwarzenegger ET2={Eddie Murphy
ET1={Sylvester Stallone ET1={Jim Carrey
ET1={Dwayne Johnson} ET1={Robin Williams}

ET1 contains the names of three actors from a public list of “physically strong” actors222https://www.imdb.com/list/ls050159844; ET2 contains the names of three actors from a public list of “funny” actors333https://www.imdb.com/list/ls000025701. ET1 and ET2 represent different query intents (strong actors and funny actors, respectively), but a standard QBE system produces the same generic query for both:
Q3: SELECT person.name FROM person

Explicit semantic similarity cannot capture these different intents, as there is no attribute that explicitly characterizes an actor as “strong” or “funny”. Nevertheless, the database encodes these associations implicitly, in the number and type of movies an actor appears in (“strong” actors frequently appear in action movies, and “funny” actors in comedies).

Figure 2: Partial schema of the IMDb database. The schema contains 2 entity relations: movie, person; and a semantic property relation: genre. The relations castinfo and movietogenre associate entities and semantic properties.

Standard QBE systems typically produce queries that are too general, and fail to capture nuanced query intents, such as the ones in Examples 1.1 and 1.2. Some prior approaches attempt to refine the queries based on additional, external information, such as external ontologies [38], provenance information of the example tuples [16], and user feedback on multiple (typically a large number) system-generated examples [12, 37, 18]. Other work relies on a closed-world assumption444In the closed-world setting, a tuple not specified as an example output is assumed to be excluded from the query result. to produce more expressive queries [37, 57, 65] and thus requires complete examples of input databases and output results. Providing such external information is typically complex and tedious for a non-expert.

In contrast with prior approaches, in this paper, we propose a method and present an end-to-end system for discovering query intent effectively and efficiently, in an open-world setting, without the need for any additional external information, beyond the initial set of example tuples.555Figure 3 provides a summary exposition of prior work, and contrasts with our contributions. We detail this classification and metrics in Appendix F and discuss the related work in Section 8. SQuID, our semantic similarity-aware query intent discovery framework [23], relies on two key insights: (1) It exploits the information and associations already present in the data to derive the explicit and implicit similarities among the provided examples. (2) It identifies the significant semantic similarities among them using abductive reasoning, a logical inference mechanism that aims to derive a query as the simplest and most likely explanation for the observed results (example tuples). We explain how SQuID uses these insights to handle the challenging scenario of Example 1.2.

vlightgrayvlightgray query class
vlightgrayQBE: vlightgrayQuery by Example
vlightgrayQRE: vlightgrayQuery Reverse Engineering additional
vlightgrayDX: vlightgrayData Exploration requirements
vlightgrayKG: vlightgrayKnowledge Graph
vlightgray!: vlightgraywith significant restrictions join projection selection aggregation semi-join implicit property scalable open-world
QBE relational SQuID TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ TextRenderingMode=FillStroke, LineWidth=.5pt, ✓
Bonifati et al. [12] ! user feedback
QPlain [16] provenance input
Shen et al. [51]
FastTopK [48]
KG Arenas et al. [4] !
SPARQLByE [17] ! negative examples
GQBE [29] !
QBEES [43] !
QRE relational PALEO-J [47] top-k queries only
SQLSynthesizer [65] schema knowledge
Scythe [57] schema knowledge
Zhang et al. [64]
REGAL [53]
REGAL+ [54]
FastQRE [33]
QFE [37] user feedback
TALOS [55]
DX rel. AIDE [18] user feedback
REQUEST [24] user feedback
Figure 3: SQuID captures complex intents and more expressive queries than prior work in the open-world setting.
Example 1.3.

We query the IMDb dataset with SQuID, using the example tuples in ET2 (Example 1.2). SQuID discovers the following semantic similarities among the examples: (1) all are Male, (2) all are American, and (3) all appeared in more than 40 Comedy movies. Out of these properties, Male and American are very common in the IMDb database. In contrast, a very small fraction of persons in the dataset are associated with such a high number of Comedy movies; this means that it is unlikely for this similarity to be coincidental, as opposed to the other two. Based on abductive reasoning, SQuID selects the third semantic similarity as the best explanation of the observed example tuples, and produces the query:
  Q4: SELECT person.name
         FROM person, castinfo, movietogenre, genre
         WHERE person.id = castinfo.person_id
            AND castinfo.movie_id = movietogenre.movie_id
            AND movietogenre.genre_id = genre.id
            AND genre.name = ‘Comedy’
         GROUP BY person.id
         HAVING count(*) ¿= 40

In this paper, we make the following contributions:

  • We design an end-to-end system, SQuID, that automatically formulates select-project-join queries with optional group-by aggregation and intersection operators () based on few user-provided example tuples. SQuID does not require the users to have any knowledge of the database schema or the query language. In contrast with existing approaches, SQuID does not need any additional user-provided information, and achieves very high precision with very few examples in most cases.

  • SQuID infers the semantic similarity of the example tuples, and models query intent using a collection of basic and derived semantic property filters (section 3

    ). Some prior work has explored the use of semantic similarity in knowledge graph retrieval tasks 

    [66, 43, 29]. However, these prior systems do not directly apply to the relational domain, and do not model implicit semantic similarities, derived from aggregating properties of affiliated entities (e.g., number of comedy movies an actor appeared in).

  • We express the problem of query intent discovery using a probabilistic abduction model (section 4). This model allows SQuID

    to identify the semantic property filters that represent the most probable intent given the examples.

  • SQuID achieves real-time performance through an offline strategy that pre-computes semantic properties and related statistics to construct an abduction-ready database (section 5). During the online phase, SQuID consults the abduction-ready database to derive relevant semantic property filters, based on the provided examples, and applies abduction to select the optimal set of filters towards query intent discovery (section 6). We prove the correctness of the abduction algorithm in Theorem 1.

  • Our empirical evaluation includes three real-world datasets, 41 queries covering a broad range of complex intents and structures, and three case studies (section 7). We further compare with TALOS [55], a state-of-the-art system that captures very expressive queries, but in a closed-world setting. We show that SQuID is more accurate at capturing intent and produces better queries, often reducing the number of predicates by orders of magnitude. We also empirically show that SQuID outperforms a semi-supervised machine learning system [21], which learns classification models from positive examples and unlabeled data.

2 SQuID overview

In this section, we first discuss the challenges in example-driven query intent discovery and highlight the shortcomings of existing approaches. We then formalize the problem of query intent discovery using a probabilistic model and describe how SQuID infers the most likely query intent using abductive reasoning. Finally, we present the system architecture for SQuID, and provide an overview of our approach.

2.1 The Query Intent Discovery Problem

SQuID aims to address three major challenges that hinder existing QBE systems:

Large search space. Identifying the intended query given a set of example tuples can involve a huge search space of potential candidate queries. Aside from enumerating the candidate queries, validating them is expensive, as it requires executing the queries over potentially very large data. Existing approaches limit their search space in three ways: (1) They often focus on project-join (PJ) queries only. Unfortunately, ignoring selections severely limits the applicability and practical impact of these solutions. (2) They assume that the user provides a large number of examples or interactions, which is often unreasonable in practice. (3) They make a closed-world assumption, thus needing complete sets of input data and output results. In contrast, SQuID focuses on a much larger and more expressive class of queries, select-project-join queries with optional group-by aggregation and intersection operators ()666The queries derived by SQuID limit joins to key-foreign key joins, and conjunctive selection predicates of the form attribute OP value, where OP and value is a constant., and is effective in the open-world setting with very few examples.

Distinguishing candidate queries. In most cases, a set of example tuples does not uniquely identify the target query, i.e., there are multiple valid queries that contain the example tuples in their results. Most existing QBE systems do not distinguish among the valid queries [51] or only rank them according to the degree of input containment, when the example tuples are not fully contained by the query output [48]. In contrast, SQuID exploits the semantic context of the example tuples and ranks the valid queries based on a probabilistic abduction model of query intent.

Complex intent. A user’s information need is often more complex than what is explicitly encoded in the database schema (e.g., Example 1.2). Existing QBE solutions focus on the query structure and are thus ill-equipped to capture nuanced intent. While SQuID still produces a structured query in the end, its objectives focus on capturing the semantic similarity of the examples, both explicit and implicit. SQuID thus draws a contrast between the traditional query-by-example problem, where the query is assumed to be the hidden mechanism behind the provided examples, and the query intent discovery problem that we focus on in this work.

Figure 4: SQuID’s operation includes an offline module, which constructs an abduction-ready database (DB) and precomputes statistics of semantic properties. During normal operation, SQuID’s query intent discovery module interacts with the DB to identify the semantic context of the provided examples and abduces the most likely query intent.

We proceed to formalize the problem of query intent discovery. We use to denote a database, and to denote the set of tuples in the result of query operating on .

Definition 2.1 (Query Intent Discovery).

For a database and a user-provided example tuple set , the query intent discovery problem is to find an query such that:

More informally, we aim to discover an query that contains within its result set and maximizes the query posterior, i.e., the conditional probability .

2.2 Abductive Reasoning

SQuID solves the query intent discovery problem (Definition 2.1) using abduction. Abduction or abductive reasoning [42, 32, 11, 5] refers to the method of inference that finds the best explanation (query intent) of an often incomplete observation (example tuples). Unlike deduction, in abduction, the premises do not guarantee the conclusion. So, a deductive approach would produce all possible queries that contain the example tuples in their results, and it would guarantee that the intended query is one of them. However, the set of valid queries is typically extremely large, growing exponentially with the number of properties and the size of the data domain. In our work, we model query intent discovery as an abduction problem and apply abductive inference to discover the most likely query intent. More formally, given two possible candidate queries, and , we infer as the intended query if .

Example 2.1.

Consider again the scenario of Example 1.1. SQuID identifies that the two example tuples share the semantic context interest = data management. Q1 and Q2 both contain the example tuples in their result set. However, the probability that two tuples drawn randomly from the output of Q1 would display the identified semantic context is low ( in the data excerpt). In contrast, the probability that two tuples drawn randomly from the output of Q2 would display the identified semantic context is high (). Assuming that Q1 and Q2 have equal priors (), then from Bayes’ rule .

Figure 5: A genre value (e.g., genre=Comedy) is a basic semantic property of a movie (through the movietogenre relation). A person is associated with movie entities (through the castinfo relation); aggregates of basic semantic properties of movies are derived semantic properties of person, e.g., the number of comedy movies a person appeared in. The DB stores the derived property in the new relation persontogenre. (For ease of exposition, we depict attributes genre and person instead of genre.id and person.id.)

2.3 Solution Sketch

At the core of SQuID is an abduction-ready database, DB (Figure 4). The DB (1) increases SQuID’s efficiency by storing precomputed associations and statistics, and (2) simplifies the query model by reducing the extended family of queries on the original database to equivalent SPJ queries on the DB.

Example 2.2.

The IMDb database has, among others, relations person and genre (Figure 2). SQuID’s DB stores a derived semantic property that associates the two entity types in a new relation, persontogenre(person.id, genre.id, count), which stores how many movies of each genre each person appeared in. SQuID derives this relation through joins with castinfo and movietogenre, and aggregation (Figure 5). Then, the query Q4 (Example 1.3) is equivalent to the simpler SPJ query Q5 on the DB:
 Q5: SELECT person.name
       FROM person, persontogenre, genre
       WHERE person.id = persontogenre.person_id AND
          persontogenre.genre_id = genre.id AND
          genre.name = ‘Comedy’ AND persontogenre.count ¿= 40

By incorporating aggregations in precomputed, derived relations, SQuID can reduce queries on the original data to SPJ queries on the DB. SQuID starts by inferring a PJ query, , on the DB as a query template; it then augments with selection predicates, driven by the semantic similarity of the examples. section 3 formalizes SQuID’s model of query intent as a combination of the base query and a set of semantic property filters. Then, section 4 analyzes the probabilistic abduction model that SQuID uses to solve the query intent discovery problem (Definition 2.1).

After the formal models, we describe the system components of SQuID. section 5 describes the offline module, which is responsible for making the database abduction-ready, by precomputing semantic properties and statistics in derived relations. section 6 describes the query intent discovery module, which abduces the most likely intent as an SPJ query on the DB.

3 Modeling Query Intent

SQuID’s core task is to infer the proper SPJ query on the DB. We model an SPJ query as a pair of a base query and a set of semantic property filters: . The base query is a project-join query that captures the structural aspect of the example tuples. SQuID can handle examples with multiple attributes, but, for ease of exposition, we focus on example tuples that contain a single attribute of a single entity (name of person).

In contrast to existing approaches that derive PJ queries from example tuples, the base query in SQuID does not need to be minimal with respect to the number of joins: While a base query on a single relation with projection on the appropriate attribute (e.g., Q1 in Example 1.1) would capture the structure of the examples, the semantic context may rely on other relations (e.g., research, as in Q2 of Example 1.1). Thus, SQuID considers any number of joins among DB relations for the base query, but limits these to key-foreign key joins.

We discuss a simple method for deriving the base query in Section 6.2. SQuID’s core challenge is to infer , which denotes a set of semantic property filters that are added as conjunctive selection predicates to . The base query and semantic property filters for Q2 of Example 1.1 are:
   SELECT name FROM academics, research
   WHERE research.aid = academics.id
   = research.interest = ‘data management’

3.1 Semantic Properties and Filters

Semantic properties encode characteristics of an entity. We distinguish semantic properties into two types. (1) A basic semantic property is affiliated with an entity directly. In the IMDb schema of Figure 2, gender=Male is a basic semantic property of a person. (2) A derived semantic property of an entity is an aggregate over a basic semantic property of an associated entity. In Example 2.2, the number of movies of a particular genre that a person appeared in is a derived semantic property for person. We represent a semantic property of an entity from a relation as a triple . In this notation, denotes a value777SQuID can support disjunction for categorical attributes (e.g., gender=Male or gender=Female), so could be a set of values. However, for ease of exposition we keep our examples limited to properties without disjunction. or a value range for attribute associated with entities in . The association strength parameter quantifies how strongly an entity is associated with the property. It corresponds to a threshold on derived semantic properties (e.g., the number of comedies an actor appeared in); it is not defined for basic properties ().

A semantic property filter is a structured language representation of the semantic property . In the data of Figure 6, the filters and represent two basic semantic properties on gender and age, respectively. Expressed in relational algebra, filters on basic semantic properties map to standard selection predicates, e.g., (person) and (person). For derived properties, filters specify conditions on the association across different entities. In Example 2.2, for person entities, the filter denotes the property of a person being associated with at least 30 movies with the basic property genre=Comedy. In relational algebra, filters on derived properties map to selection predicates over derived relations in the DB, e.g., (persontogenre).

3.2 Filters and Example Tuples

To construct , SQuID needs to infer the proper set of semantic property filters given a set of example tuples. Since all example tuples should be in the result of , cannot contain filters that the example tuples do not satisfy. Thus, we only consider valid filters that map to selection predicates that all examples satisfy.

Definition 3.1 (Filter validity).

Given a database , an example tuple set , and a base query , a filter is valid if and only if , where .

Figure 6 shows a set of example tuples over the relation person. Given the base query SELECT name FROM person, the filters and on relation person are valid, because all of the example entities of Figure 6 are Male and fall in the age range [50, 90].

Lemma 3.1.

(Validity of conjunctive filters). The conjunction () of a set of filters is valid, i.e., , if and only if is valid.

Relaxing a filter (loosening its conditions) preserves validity. For example, if is valid, then is also valid. Out of all valid filters, SQuID focuses on minimal valid filters, which have the tightest bounds.888Bounds can be derived in different ways, potentially informed by the result set cardinality. However, we found that the choice of the tightest bounds works well in practice.

Definition 3.2 (Filter minimality).

A basic semantic property filter is minimal if it is valid, and , is not valid. A derived semantic property filter is minimal if it is valid, and is not valid.

In the example of Figure 6, is a minimal filter and is not.

4 Probabilistic Abduction Model

We now revisit the problem of Query Intent Discovery (Definition 2.1), and recast it based on our model of query intent (section 3). Specifically, Definition 2.1 aims to discover an query ; this is reduced to an equivalent SPJ query on the DB (as in Example 2.2). SQuID’s task is to find the query

that maximizes the posterior probability

, for a given set of example tuples. In this section, we analyze the probabilistic model to compute this posterior, and break it down to three components.

vlightgray id name gender age
1 Tom Cruise Male 50
2 Clint Eastwood Male 90
3 Tom Hanks Male 60
4 Julia Roberts Female 50
5 Emma Stone Female 29
6 Julianne Moore Female 60
Example tuples
vlightgray Column 1
Tom Cruise
Clint Eastwood
Figure 6: Sample database with example tuples

4.1 Notations and Preliminaries

Semantic context . Observing a semantic property in a set of 10 examples is more significant than observing the same property in a set of 2 examples. We denote this distinction with the semantic context , which encodes the size of the set () where the semantic property is observed. We denote with the set of semantic contexts exhibited by the set of example tuples . Candidate SPJ query . Let be the set of minimal valid filters999We omit in the filter notation when the context in clear., from hereon simply referred to as filters, where encodes the semantic context . Our goal is to identify the subset of filters in that best captures the query intent. A set of filters defines a candidate query , and (from Lemma 3.1).

Filter event . A filter may or may not appear in a candidate query . With slight abuse of notation, we denote the filter’s presence () with and its absence () with . We use to represent the occurrence event of in .


vlightgray Notation Description
Semantic property defined by attribute ,
value , and association strength
or Semantic property filter for
Set of minimal valid filters
SPJ query with semantic property filters
applied on base query
Semantic context of for
Set of semantic contexts
Figure 7: Summary of notations

4.2 Modeling Query Posterior

We first analyze the probabilistic model for a fixed base query and then generalize the model in Section 4.3. We use as a shorthand for . We model the query posterior , using Bayes’ rule:

By definition, ; therefore:

Using the fact that and applying Bayes’ rule on the prior , we get:

Finally, is conditionally independent of given the semantic context , i.e., . Thus:


In Equation 1, we have modeled the query posterior in terms of three components: (1) the semantic context prior , (2) the query prior , and (3) the semantic context posterior, . We proceed to analyze each of these components.

4.2.1 Semantic Context Prior

The semantic context prior denotes the probability that any set of of example tuples of size exhibits the semantic contexts . This probability is not easy to compute analytically, as it involves computing a marginal over a potentially infinite set of candidate queries. In this work, we model the semantic context prior as proportional to the selectivity of , where is a filter that encodes context :


Selectivity . The selectivity of a filter denotes the portion of tuples from the result of the base query that satisfy :

Similarly, for a set of filters , . Intuitively, a selectivity value close to means that the filter is not very selective and most tuples satisfy the filter; selectivity value close to denotes that the filter is highly selective and rejects most of the tuples. For example, in Figure 6, is more selective than , with selectivities and , respectively.

Selectivity captures the rarity of a semantic context: uncommon contexts are present in fewer tuples and thus appear in the output of fewer queries. Intuitively, a rare context has lower prior probability of being observed, which supports the assumption of Equation 


4.2.2 Query Prior

The query prior denotes the probability that is the intended query, prior to observing the example tuples. We model the query prior as the joint probability of all filter events , where . By further assuming filter independence101010Reasoning about database queries commonly assumes independence across selection predicates, which filters represent, even though it may not hold in general., we reduce the query prior to a product of probabilities of filter events:


The filter event prior denotes the prior probability that filter is included in (if ) or excluded from (if ) the intended query. We compute for each filter as follows:

Here, is a base prior parameter, common across all filters, and represents the default value for the prior. The other factors (, , and ) reduce the prior, depending on characteristics of each filter. We describe these parameters next.

Domain selectivity impact . Intuitively, a filter that covers a large range of values in an attribute’s domain is unlikely to be part of the intended query. For example, if a user is interested in actors of a certain age group, that age group is more likely to be narrow () than broad (). We penalize broad filters with the parameter ; is equal to 1 for filters that do not exceed a predefined ratio in the coverage of their domain, and decreases for filters that exceed this threshold.111111Details on the computation of and are in the Appendix.

Association strength impact . Intuitively, a derived filter with low association strength is unlikely to appear in the intended query, as the filter denotes a weak association with the relevant entities. For example, is less likely than to represent a query intent. We label filters with lower than a threshold as insignificant, and set . All other filters, including basic filters, have .

Outlier impact . While characterizes the impact of association strength on a filter individually, characterizes its impact in consideration with other derived filters over the same attribute. Figure 8 demonstrates two cases of derived filters on the same attribute (genre), corresponding to two different sets of example tuples. In Case A, and are more significant than the other filters of the same family (higher association strength). Intuitively, this corresponds to the intent to retrieve actors who appeared in mostly Comedy and SciFi movies. In contrast, Case B does not have filters that stand out, as all have similar association strengths: The actors in this example set are not strongly associated with particular genres, and thus, intuitively, this family of filters is not relevant to the query intent.

vlightgray Case A
vlightgray Case B
Figure 8: Two cases for derived semantic property filters. Top two filters of Case A are interesting, whereas no filter is interesting in Case B.

We model the outlier impact of a filter using the skewness

of the association strength distribution within the family of derived filters sharing the same attribute. Our assumption is that highly-skewed, heavy-tailed distributions (Case A) are likely to contain the significant (intended) filters as

outliers. We set for a derived filter whose association strength is an outlier in the association strength distribution of filters of the same family. We also set for basic filters. All other filters get .1111footnotemark: 11

4.2.3 Semantic Context Posterior

The semantic context posterior is the probability that a set of example tuples of size , sampled from the output of a particular query , exhibits the set of semantic contexts :

Two semantic contexts are conditionally independent given . Therefore:

Recall that encodes the semantic context (section 4.1). We assume that is conditionally independent of any , given (this always holds for ):


For each , we compute based on the state of the filter event ( or ):

: By definition, all tuples in exhibit the property of . Hence, .

: This is the probability that a set of tuples drawn uniformly at random from ( is not applied to the base query) exhibits the context . The portion of tuples in that exhibit the property of is the selectivity . Therefore, .

Using Equations (1)–(4), we derive the final form of the query posterior (where is a normalization constant):


4.3 Generalization

So far, our analysis focused on a fixed base query. Given an SPJ query , the underlying base query is deterministic, i.e., . Hence:

We assume to be equal for all valid base queries, where . Then we use to find the query that maximizes the query posterior .

5 Offline Abduction Preparation

In this section, we discuss system considerations to perform query intent discovery efficiently. SQuID employs an offline module that performs several pre-computation steps to make the database abduction-ready. The abduction-ready database (DB) augments the original database with derived relations that store associations across entities and precomputes semantic property statistics. Deriving this information is relatively straightforward; the contributions of this section lie in the design of the DB, the information it maintains, and its role in supporting efficient query intent discovery. We describe the three major functions of the DB.

Entity lookup. SQuID’s goal is to discover the most likely query, based on the user-provided examples. To do that, it first needs to determine which entities in the database correspond to the examples. SQuID uses a global inverted column index [51], built over all text attributes and stored in the DB, to perform fast lookups, matching the provided example data to entities in the database.

Semantic property discovery. To reason about intent, SQuID first needs to determine what makes the examples similar. SQuID looks for semantic properties within entity relations (e.g., gender appears in table person), other relations (e.g., genre appears in a separate table joining with movie through a key-foreign-key constraint), and other entities, (e.g., the number of movies of a particular genre that a person has appeared in). The DB precomputes and stores such derived relations (e.g., persontogenre), as these frequently involve several joins and aggregations and performing them at runtime would be prohibitive.121212The data cube [26] can serve as an alternative mechanism to model the DB data, but is much less efficient compared to the DB (details are in Appendix F.4). For example, SQuID computes the persontogenre relation (Figure 5) and stores it in the DB with the SQL query below:

Q6: CREATE TABLE persontogenre as
  Q6: (SELECT person_id, genre_id, count(*) AS count
  Q6: FROM castinfo, movietogenre
  Q6: WHERE castinfo.movie_id = movietogenre.movie_id
  Q6: GROUP BY person_id, genre_id)

For the DB construction, SQuID only relies on very basic information to understand the data organization. It uses (1) the database schema, including the specification of primary and foreign key constraints, and (2) additional meta-data, which can be provided once by a database administrator, that specify which tables describe entities (e.g., person, movie), and which tables and attributes describe direct properties of entities (e.g., genre, age). SQuID then automatically discovers fact tables, which associate entities and properties, by exploiting the key-foreign key relationships. SQuID also automatically discovers derived properties up to a certain pre-defined depth, using paths in the schema graph, that connect entities to properties. Since the number of possible values for semantic properties is typically very small and remains constant as entities grow, the DB grows linearly with the data size. In our implementation, we restrict the derived property discovery to the depth of two fact-tables (e.g., SQuID derives persontogenre through castinfo and movietogenre). SQuID can support deeper associations, but we found these are not common in practice. SQuID generally assumes that different entity types appear in different relations, which is the case in many commonly-used schema types, such as star, galaxy, and fact-constellation schemas. SQuID can perform inference in a denormalized setting, but would not be able to produce and reason about derived properties in those cases.

Smart selectivity computation. For basic filters involving categorical values, SQuID stores the selectivity for each value. However, for numeric ranges, the number of possible filters can grow quadratically with the number of possible values. SQuID avoids wasted computation and space by only precomputing selectivities for all , where is the set of values of attribute in the corresponding relation, and is the minimum value in . The DB can derive the selectivity of a filter with any value range as:

In case of derived semantic properties, SQuID precomputes selectivities for all , where is the set of values of association strength for the property “”.

6 Query Intent Discovery

During normal operation, SQuID receives example tuples from a user, consults the DB, and infers the most likely query intent (Definition 2.1). In this section, we describe how SQuID resolves ambiguity in the provided examples, how it derives their semantic context, and how it finally abduces the intended query.

6.1 Entity and Context Discovery

SQuID’s probabilistic abduction model (section 4) relies on the set of semantic contexts and determines which of these contexts are intended vs coincidental, by the inclusion or exclusion of the corresponding filters in the inferred query. To derive the set of semantic contexts from the examples, SQuID first needs to identify the entities in the DB that correspond to the provided examples.

6.1.1 Entity disambiguation

User-provided examples are not complete tuples, but often single-column values that correspond to an entity. As a result, there may be ambiguity that SQuID needs to resolve. For example, suppose the user provides the examples: {Titanic, Pulp Fiction, The Matrix}. SQuID consults the precomputed inverted column index to identify the attributes (movie.title

) that contain all the example values, and classifies the corresponding entity (

movie) as a potential match. However, while the dataset contains unique entries for Pulp Fiction (1994) and The Matrix (1999), there are 4 possible mappings for Titanic: (1) a 1915 Italian film, (2) a 1943 German film, (3) a 1953 film by Jean Negulesco, and (4) the 1997 blockbuster film by James Cameron.

The key insight for resolving such ambiguities is that the provided examples are more likely to be alike. SQuID selects the entity mappings that maximize the semantic similarities across the examples. Therefore, based on the year and country information, it determines that Titanic corresponds to the 1997 film, as it is most similar to the other two (unambiguous) entities. In case of derived properties, e.g., nationality of actors appearing in a film, SQuID aims to increase the association strength (e.g., the number of such actors). Since the examples are typically few, SQuID can determine the right mappings by considering all combinations.

6.1.2 Semantic context discovery

Once SQuID identifies the right entities, it then explores all the semantic properties stored in the DB that match these entities (e.g., year, genre, etc.). Since the DB precomputes and stores the derived properties, SQuID can produce all the relevant properties using queries with at most one join. For each property, SQuID produces semantic contexts as follows:

Basic property on categorical attribute. If all examples in contain value for the property of attribute , SQuID produces the semantic context . For example, a user provides three movies: Dunkirk, Logan, and Taken. The attribute genre corresponds to a basic property for movies, and all these movies share the values, Action and Thriller, for this property. SQuID generates two semantic contexts: and .

Basic property on numerical attribute. If and are the minimum and maximum values, respectively, that the examples in demonstrate for the property of attribute , SQuID creates a semantic context on the range : . For example, if contains three persons with ages 45, 50, and 52, SQuID will produce the context .

Derived property. If all examples in contain value for the derived property of attribute , SQuID produces the semantic context , where is the minimum association strength for the value among all examples. For example, if contains two persons who have appeared in 3 and 5 Comedy movies, SQuID will produce the context .

6.2 Query Abduction

SQuID starts abduction by constructing a base query that captures the structure of the example tuples. Once it identifies the entity and attribute that matches the examples (e.g., person.name), it forms the minimal PJ query (e.g., SELECT name FROM person). It then iterates through the discovered semantic contexts and appends the corresponding relations to the FROM clause and the appropriate key-foreign key join conditions in the WHERE clause. Since the DB precomputes and stores the derived relations, each semantic context will add at most one relation to the query.

The number of candidate base queries is typically very small. For each base query , SQuID abduces the best set of filters to construct SPJ query , by augmenting the WHERE clause of with the corresponding selection predicates. (SQuID also removes from any joins that are not relevant to the selected filters ).

While the number of candidate SPJ queries grows exponentially in the number of minimum valid filters (), we prove that we can make decisions on including or excluding each filter independently. Algorithm 1 iterates over the set of minimal valid filters and decides to include a filter only if its addition to the query increases the query posterior probability (lines 1-1). Our query abduction algorithm has time complexity and is guaranteed to produce the query that maximizes the query posterior.

Input: set of entities , base query , set of minimal valid filters
Output: such that is maximized
1 // semantic contexts in
3 foreach  do
4        // from Equation (5)
5        // from Equation (5)
6        if   then
Algorithm 1 QueryAbduction ()
Theorem 1.

Given a base query , a set of examples , and a set of minimal valid filters , Algorithm 1 returns the query , where , such that is maximized.

7 Experiments

In this section, we present an extensive experimental evaluation of SQuID over three real-world datasets, with a total of 41 benchmark queries of varying complexities. Our evaluation shows that SQuID is scalable and effective, even with a small number of example tuples. Our evaluation extends to qualitative case studies over real-world user-generated examples, which demonstrate that SQuID succeeds in inferring the query intent of real-world users. We further demonstrate that when used as a query-reverse-engineering system in a closed-world setting SQuID outperforms the state-of-the-art. Finally, we show that SQuID is superior to semi-supervised PU-learning in terms of both efficiency and effectiveness.

7.1 Experimental Setup

We implemented SQuID in Java and all experiments were run on a 12x2.66 GHz machine with 16GB RAM running CentOS 6.9 with PostgreSQL 9.6.6.

Datasets and benchmark queries

Our evaluation includes three real-world datasets and a total of 41 benchmark queries, designed to cover a broad range of intents and query structures. We summarize the datasets and queries below and provide detailed description in Appendix D.

IMDb (633 MB): The dataset contains 15 relations with information on movies, cast members, film studios, etc. We designed a set of 16 benchmark queries ranging the number of joins (1 to 8 relations), the number of selection predicates (0 to 7), and the result cardinality (12 to 2512 tuples).

DBLP (22 MB): We used a subset of the DBLP data [2], with 14 relations, and 16 years (2000–2015) of top 81 conference publications. We designed 5 queries ranging the number of joins (3 to 8 relations), the number of selection predicates (2 to 4), and the result cardinality (15 to 468 tuples).

Adult (4 MB): This is a single relation dataset containing census data of people and their income brackets. We generated 20 queries, randomizing the attributes and predicate values, ranging the number of selection predicates (2 to 7) and the result cardinality (8 to 1404 tuples).

Case study data

We retrieved several public lists (sources listed in Appendix D) with human-generated examples, and identified the corresponding intent. For example, a user-created list of “115 funniest actors” reveals a query intent (funny actors), and provides us with real user examples (the names in the list). We used this method to design 3 case studies: funny actors (IMDb), 2000s Sci-Fi movies (IMDb), and prolific database researchers (DBLP).


We report query discovery time as a metric of efficiency. We measure effectiveness using precision, recall, and f-score. If

is the intended query, and is the query inferred by SQuID, precision is computed as and recall as

; f-score is their harmonic mean. We also report the total number of predicates in the produced queries and compare them with the actual intended queries.

Figure 9: Average abduction time over the benchmark queries in (a) IMDb (top), DBLP (bottom), and (b) 4 versions of the IMDb dataset in different sizes.

To the best of our knowledge, existing QBE techniques do not produce SPJ queries without (1) a large number of examples, or (2) additional information, such as provenance. For this reason, we can’t meaningfully compare SQuID with those approaches. Removing the open-world requirement, SQuID is most similar to the QRE system TALOS [55] with respect to expressiveness and capabilities (Figure 3). We compare the two systems for query reverse engineering tasks in Section 7.5. We also compare SQuID against PU-learning methods [21] in Section 7.6.

(b) DBLP
(a) IMDb
Figure 10: SQuID achieves high accuracy with few examples (typically ) in most benchmark queries.
(a) IMDb
(a) IMDb (b) DBLP
Figure 11: SQuID rarely produces queries that are slower than the original with respect to query runtime.

7.2 Scalability

In our first set of experiments, we examine the scalability of SQuID against increasing number of examples and varied dataset sizes. Figure 8(a) displays the abduction time for the IMDb and DBLP datasets as the number of provided examples increases, averaged over all benchmark queries in each dataset. Since SQuID retrieves semantic properties and computes context for each example, the runtime increases linearly with the number of examples, which is what we observe in practice.

Figure 8(b) extends this experiment to datasets of varied sizes. We generate three alternative versions of the IMDb dataset: (1) sm-IMDb (75 MB), a downsized version that keeps 10% of the original data; (2) bs-IMDb (1330 MB), doubles the entities of the original dataset and creates associations among the duplicate entities (person and movie) by replicating their original associations; (3) bd-IMDb (1926 MB), is the same as bs-IMDb but also introduces associations between the original entities and the duplicates, creating denser connections.131313Details of the data generation process are in Appendix D. SQuID’s runtime increases for all datasets with the number of examples, and, predictably, larger datasets face longer abduction times. Query abduction involves point queries to retrieve semantic properties of the entities, using B-tree indexes. As the data size increases, the runtime of these queries grows logarithmically. SQuID is slower on bd-IMDb than on bs-IMDb: both datasets include the same entities, but bd-IMDb has denser associations, which results in additional derived semantic properties.

7.3 Abduction Accuracy

Intuitively, with a larger number of examples, abduction accuracy should increase: SQuID has access to more samples of the query output, and can more easily distinguish coincidental from intended similarities. Figure 9(a) confirms this intuition, and precision, recall, and f-score increase, often very quickly, with the number of examples for most of our benchmark queries. We discuss here a few particular queries.

IQ4 & IQ11: These queries include a statistically common property (USA movies), and SQuID needs more examples to confirm that the property is indeed intended, not coincidental; hence, the precision converges more slowly.

IQ6: In many movies where Clint Eastwood was a director, he was also an actor. SQuID needs to observe sufficient examples to discover that the property role:Actor is not intended, so recall converges more slowly.

IQ10: SQuID performs poorly for this query. The query looks for actors appearing in more than 10 Russian movies that were released after 2010. While SQuID discovers the derived properties “more than 10 Russian movies” and “more than 10 movies released after 2010”, it cannot compound the two into “more than 10 Russian movies released after 2010”. This query is simply outside of SQuID’s search space, and SQuID produces a query with more general predicates than was intended, which is why precision drops.

IQ3: The query is looking for actresses who are Canadian and were born after 1970. SQuID successfully discovers the properties gender:Female, country:Canada, and birth year 1970; however, it fails to capture the property of “being an actress”, corresponding to having appeared in at least 1 film. The reason is that SQuID is programmed to ignore weak associations (a person associated with only 1 movie). This behavior can be fixed by adjusting the association strength parameter to allow for weaker associations.

Figure 12: Effect of disambiguation on IMDb
Execution time

While the accuracy results demonstrate that the abduced queries are semantically close to the intended queries, SQuID could be deriving a query that is semantically close, but more complex and costly to compute. In Figures 10(a) and 10(b) we graph the average runtime of the abduced queries and the actual benchmark queries. We observe that in most cases the abduced queries and the corresponding benchmarks are similar in execution time. Frequently, the abduced queries are faster because they take advantage of the precomputed relations in the DB. In few cases (IQ1, IQ5, and IQ7) SQuID discovered additional properties that, while not specified by the original query, are inherent in all intended entities. For example, in IQ5, all movies with Tom Cruise and Nicole Kidman are also English language movies and released between 1990 and 2014.

Effect of entity disambiguation

Finally, we found that entity disambiguation never hurts abduction accuracy, and may significantly improve it. Figure 12 displays the impact of disambiguation for five IMDb benchmark queries, where disambiguation significantly improves the f-score.

Figure 13: Precision, recall, and f-score for (a) Funny actors (b) 2000s Sci-Fi movies (c) Prolific DB researchers
Figure 14: Both systems achieve perfect f-score on Adult (not shown). SQuID produces significantly smaller queries, often by orders of magnitude, and is often much faster.
Figure 13: Precision, recall, and f-score for (a) Funny actors (b) 2000s Sci-Fi movies (c) Prolific DB researchers

7.4 Qualitative Case Studies

In this section, we present qualitative results on the performance of SQuID, through a simulated user study. We designed 3 case studies, by constructing queries and examples from human-generated, publicly-available lists.

Funny actors (IMDb). We created a list of names of 211 “funny actors”, collected from human-created public lists and Google Knowledge Graph (sources are in Appendix D), and used these names as examples of the query intent “funny actors.” Figure 14(a) demonstrates the accuracy of the abduced query over a varying number of examples. Each data point is an average across 10 different random samples of example sets of the corresponding size. For this experiment, we tuned SQuID to normalize the association strength, which means that the relevant predicate would consider the fraction of movies in an actor’s portfolio classified as comedies, rather than the absolute number.

2000s Sci-Fi movies (IMDb). We used a user-created list of 165 Sci-Fi movies released in 2000s as examples of the query intent “2000s Sci-Fi movies”. Figure 14(b) displays the accuracy of the abduced query, averaged across 10 runs for each example set size.

Prolific database researchers (DBLP). We collected a list of database researchers who served as chairs, group leaders, or program committee members in SIGMOD 2011–2015 and selected the top 30 most prolific. Figure 14(c) displays the accuracy of the abduced query averaged, across 10 runs for each example set size.


In our case studies there is no (reasonable) SQL query that models the intent well and produces an output that exactly matches our lists. Public lists have biases, such as not including less well-known entities even if these match the intent.141414 To counter this bias, our case study experiments use popularity masks (derived from public lists) to filter the examples and the abduced query outputs (Appendix D). In our prolific researchers use case, some well-known and prolific researchers may happen to not serve in service roles frequently, or their commitments may be in venues we did not sample. Therefore, it is not possible to achieve high precision, as the data is bound to contain and retrieve entities that don’t appear on the lists, even if the query is a good match for the intent. For this reason, our precision numbers in the case studies are low. However our recall rises quickly with enough examples, which indicates that the abduced queries converge to the correct intent.

(a) IMDb
(b) DBLP
Figure 15: SQuID produces queries with significantly fewer predicates than TALOS and is more accurate on both IMDb and DBLP. SQuID is almost always faster on IMDb, but TALOS is faster on DBLP.

7.5 Query Reverse Engineering

We present an experimental comparison of SQuID with TALOS [55], a state-of-the-art Query Reverse Engineering (QRE) system.151515Other related methods either focus on more restricted query classes [33, 64] or do not scale to data sizes large enough for this evaluation [65, 57] (overview in Figure 3). QRE systems operate in a closed-world setting, assuming that the provided examples comprise the entire query output. In contrast, SQuID assumes an open-world setting, and only needs a few examples. In the closed-world setting, SQuID is handicapped against a dedicated QRE system, as it does not take advantage of the closed-world constraint in its inference.

For this evaluation under the QRE setting, we use the IMDb and DBLP datasets, as well as the Adult dataset, on which TALOS was shown to perform well [55]. For each dataset, we provided the entire output of the benchmark queries as input to SQuID and TALOS. Since there is no need to drop coincidental filters for query reverse engineering, we set the parameters so that SQuID behaves optimistically (e.g., high filter prior, low association strength threshold, etc.).161616Details on the system parameters are in Appendix E. We adopt the notion of instance equivalent query (IEQ) from the QRE literature [55] to express that two queries produce the same set of results on a particular database instance. A QRE task is successful if the system discovers an IEQ of the original query (f-score=1). For the IMDb dataset, SQuID was able to successfully reverse engineer 11 out of 16 benchmark queries. Additionally, in 4 cases where exact IEQs were not abduced, SQuID queries generated output with f-score. SQuID failed only for IQ10, which is a query that falls outside the supported query family, as discussed in Section 7.3. For the DBLP and Adult datasets, SQuID successfully reverse-engineered all benchmark queries.

Comparison with TALOS

We compare SQuID to TALOS on three metrics: number of predicates (including join and selection predicates), query discovery time, and f-score.

Adult. Both SQuID and TALOS achieved perfect f-score on the 20 benchmark queries. Figure 14 compares the systems in terms of the number of predicates in the queries they produce (top) and query discovery time (bottom). SQuID almost always produces simpler queries, close in the number of predicates to the original query, while TALOS queries contain more than 100 predicates in 20% of the cases.

SQuID is faster than TALOS when the input cardinality is low (100 tuples), and becomes slower for the largest input sizes ( tuples). SQuID was not designed as a QRE system, and in practice, users rarely provide large example sets. SQuID’s focus is on inferring simple queries that model the intent, rather than cover all examples with potentially complex and lengthy queries.

IMDb. Figure 14(a) compares the two systems on the 16 benchmark queries of the IMDb dataset. SQuID produced better queries in almost all cases: in all cases, our abduced queries where significantly smaller, and our f-score is higher for most queries. SQuID was also faster than TALOS for most of the benchmark queries. We now delve deeper into some particular cases.

For IQ1 (cast of Pulp Fiction), TALOS produces a query with f-score = 0.7. We attempted to provide guidance to TALOS through a system parameter that specifies which attributes to include in the selection predicates (which would give it an unfair advantage). TALOS first performs a full join among the participating relations (person and castinfo) and then performs classification on the denormalized table (with attributes person, movie, role). TALOS gives all rows referring to a cast member of Pulp Fiction

a positive label (based on the examples), regardless of the movie that row refers to, and then builds a decision tree based on these incorrect labels. This is a limitation of TALOS, which

SQuID overcomes by looking at the semantic similarities of the examples, rather than treating them simply as labels.

SQuID took more time than TALOS in IQ4, IQ7, and IQ15. The result sets of IQ4 and IQ15 are large (), so this is expected. IQ7, which retrieves all movie genres, does not require any selection predicate. As a decision tree approach, TALOS has the advantage here, as it stops at the root and does not need to traverse the tree. In contrast, SQuID retrieves all semantic properties of the example tuples only to discover that either there is nothing common among them, or the property is not significant. While SQuID takes longer, it still abduces the correct query. These cases are not representative of QBE scenarios, as users are unlikely to provide large number of example tuples or have very general intents (PJ queries without selection).

DBLP. Figure 14(b) compares the two systems on the DBLP dataset. Here, SQuID successfully reverse engineered all five benchmark queries, but TALOS failed to reverse engineer two of them. TALOS also produced very complex queries, with 100 or more predicates for four of the cases. In contrast, SQuID’s abductions were orders of magnitude smaller, on par with the original query. On this dataset, SQuID was slower than TALOS, but not by a lot.

7.6 Comparison with learning methods

Query intent discovery can be viewed as a one-class classification problem, where the task is to identify the tuples that satisfy the desired intent. Positive and Unlabeled (PU) learning addresses this problem setting by learning a classifier from positive examples and unlabeled data in a semi-supervised setting. We compare SQuID against an established PU-learning method [21] on 20 benchmark queries of the Adult dataset. The setting of this experiment conforms with the technique’s requirements [21]: the dataset comprises of a single relation and the examples are chosen uniformly at random from the positive data.

Figure 16: (a) PU-learning needs a large fraction () of the query results (positive data) as examples to achieve accuracy comparable to SQuID. (b) The total required time for training and prediction in PU-learning increases linearly with the data size. In contrast, abduction time for SQuID increases logarithmically.

Figure 16 (a) compares the accuracy of SQuID

and PU-learning using two different estimators, decision tree (DT) and random forest (RF). We observe that PU-learning needs a large fraction (

) of the query result to achieve f-score comparable to SQuID. PU-learning favors precision over recall, and the latter drops significantly when the number of examples is low. In contrast, SQuID achieves robust performance, even with few examples, because it can encode problem-specific assumptions (e.g., that there exists an underlying SQL query that models the intent, that some filters are more likely than other filters, etc.); this cannot be done in straightforward ways for machine learning methods.

To evaluate scalability, we replicated the Adult dataset, with a scale factor up to 10x. Figure 16 (b) shows that PU-learning becomes significantly slower than SQuID as the data size increases, whereas SQuID’s runtime performance remains largely unchanged. This is due to the fact that, SQuID does not directly operate on the data outside of the examples (unlabeled data); rather, it relies on the DB, which contains a highly compressed summary of the semantic property statistics (e.g., filter selectivities) of the data. In contrast, PU-learning builds a new classifier over all of the data for each query intent discovery task. We provide more discussion on the connections between SQuID and machine learning approaches in Section 8.

8 Related Work

Query-by-Example (QBE) was an early effort to assist users without SQL expertise in formulating SQL queries [67]. Existing QBE systems [51, 48] identify relevant relations and joins in situations where the user lacks schema understanding, but are limited to project-join queries. These systems focus on the common structure of the example tuples, and do not try to learn the common semantics as SQuID does. QPlain [16] uses user-provided provenance of the example tuples to learn the join paths and improve intent inference. However, this assumes that the user understands the schema, content, and domain to provide these provenance explanations, which is often unrealistic for non-experts.

Set expansion is a problem corresponding to QBE in Knowledge Graphs [66, 58, 60]. SPARQLByE [17], built on top of a SPARQL QRE system [4], allows querying RDF datasets by annotated (positive/negative) example tuples. In semantic knowledge graphs, systems address the entity set expansion problem using maximal-aspect-based entity model, semantic-feature-based graph query, and entity co-occurrence information [38, 29, 27, 43]. These approaches exploit the semantic context of the example tuples, but they cannot learn new semantic properties, such as aggregates involving numeric values, that are not explicitly stored in the knowledge graph, and they cannot express derived semantic properties without exploding the graph size.171717To represent “appearing in more than K comedies”, the knowledge graph would require one property for each possible value of .

Interactive approaches rely on relevance feedback on system-generated tuples to improve query inference and result delivery [1, 12, 18, 24, 37]. Such systems typically expect a large number of interactions, and are often not suitable for non-experts who may not be sufficiently familiar with the data to provide effective feedback.

Query Reverse Engineering (QRE) [59, 6] is a special case of QBE that assumes that the provided examples comprise the complete output of the intended query. Because of this closed-world assumption, QRE systems can build data classification models on denormalized tables [55], labeling the provided tuples as positive examples and the rest as negative. Such methods are not suitable for our setting, because we operate with few examples, under an open-world assumption. While few QRE approaches [33] relax the closed world assumption (known as the superset QRE problem) they are also limited to PJ queries similar to the existing QBE approaches. Most QRE methods are limited to narrow classes of queries, such as PJ [64, 33], aggregation without joins [53], or top-k queries [47]. REGAL+[54] handles SPJA queries but only considers the schema of the example tuples to derive the joins and ignores other semantics. In contrast, SQuID considers joining relations without attributes in the example schema (Example 1.1).

A few QRE methods do target expressive SPJ queries [65, 57], but they only work for very small databases ( cells), and do not scale to the datasets used in our evaluation. Moreover, the user needs to specify the data in their entirety, thus expecting complete schema knowledge, while SCYTHE [57] also expects hints from the user towards precise discovery of the constants of the query predicates.

Machine learning methods can model QBE settings as classification problems, and relational machine learning targets relational settings in particular [25]. However, while the provided examples serve as positive labels, QBE settings do not provide explicit negative examples. Semi-supervised statistical relational learning techniques [61] can learn from unlabeled and labeled data, but require unbiased sample of negative examples. There is no straightforward way to obtain such a sample in our problem setting without significant user effort.

Our problem setting is better handled by one-class classification [40, 34], more specifically, Positive and Unlabeled (PU) learning [62, 39, 10, 21, 9, 44], which learns from positive examples and unlabeled data in a semi-supervised setting [14]. Most PU-learning methods assume denormalized data, but relational PU-leaning methods do exist. However, all PU-learning methods rely on one or more strong assumptions [10] (e.g., all unlabeled entities are negative [46], examples are selected completely at random [21, 8], positive and negative entities are naturally separable [62, 39, 52], similar entities are likely from the same class [35]). These assumptions create a poor fit for our problem setting where the example set is very small, it may exhibit user biases, response should be real-time, and intents may involve deep semantic similarity.

Other approaches that assist users in query formulation involve query recommendation based on collaborative filtering [20], query autocompletion [36], and query suggestion [22, 19, 31]. Another approach to facilitating data exploration is keyword-based search [3, 28, 63]. User-provided examples and interactions appear in other problem settings, such as learning schema mappings [50, 49, 13]. The query likelihood model in IR [41] resembles our technique, but does not exploit the similarity of the input entities.

9 Summary and Future Directions

In this paper, we focused on the problem of query intent discovery from a set of example tuples. We presented SQuID, a system that performs query intent discovery effectively and efficiently, even with few examples in most cases. The insights of our work rely on exploiting the rich information present in the data to discover similarities among the provided examples, and distinguish between those that are coincidental and those that are intended. Our contributions include a probabilistic abduction model and the design of an abduction-ready database, which allow SQuID to capture both explicit and implicit semantic contexts. Our work includes an extensive experimental evaluation of the effectiveness and efficiency of our framework over three real-world datasets, case studies based on real user-generated examples and abstract intents, and comparison with the state-of-the-art in query reverse engineering (a special case of query intent discovery) and with PU-learning. Our empirical results highlight the flexibility of our method, as it is extremely effective in a broad range of scenarios. Notably, even though SQuID targets query intent discovery with a small set of a examples, it outperforms the state-of-the-art in query reverse engineering in most cases, and is superior to learning techniques.

There are several possible improvements and research directions that can stem from our work, including smarter semantic context inference using log data, example recommendation to increase sample diversity and improve abduction, techniques for adjusting the depth of association discovery, on-the-fly DB construction, and efficient DB maintenance for dynamic datasets.


  • [1] A. Abouzied, D. Angluin, C. Papadimitriou, J. M. Hellerstein, and A. Silberschatz. Learning and verifying quantified boolean queries by example. In PODS, pages 49–60, 2013.
  • [2] S. Agarwal, A. Sureka, N. Mittal, R. Katyal, and D. Correa. DBLP records and entries for key computer science conferences, 2016.
  • [3] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: a system for keyword-based search over relational databases. In ICDE, pages 5–16, 2002.
  • [4] M. Arenas, G. I. Diaz, and E. V. Kostylev. Reverse engineering SPARQL queries. In WWW, pages 239–249, 2016.
  • [5] O. Arieli, M. Denecker, B. V. Nuffelen, and M. Bruynooghe.

    Coherent integration of databases by abductive logic programming.

    J. Artif. Intell. Res., 21:245–286, 2004.
  • [6] P. Barceló and M. Romero. The Complexity of Reverse Engineering Problems for Conjunctive Queries. In ICDT, volume 68, pages 7:1–7:17, 2017.
  • [7] C. A. Becker. Semantic context effects in visual word recognition: An analysis of semantic strategies. Memory & cognition, 8(6):493–512, 1980.
  • [8] J. Bekker and J. Davis. Positive and unlabeled relational classification through label frequency estimation. In ILP, pages 16–30, 2017.
  • [9] J. Bekker and J. Davis. Estimating the class prior in positive and unlabeled data through decision tree induction. In AAAI, pages 2712–2719, 2018.
  • [10] J. Bekker and J. Davis. Learning from positive and unlabeled data: A survey. CoRR, abs/1811.04820, 2018.
  • [11] L. E. Bertossi and B. Salimi. Causes for query answers from databases: Datalog abduction, view-updates, and integrity constraints. Int. J. Approx. Reasoning, 90:226–252, 2017.
  • [12] A. Bonifati, R. Ciucanu, and S. Staworko. Learning join queries from user examples. TODS, 40(4):24:1–24:38, 2016.
  • [13] A. Bonifati, U. Comignani, E. Coquery, and R. Thion. Interactive mapping specification with exemplar tuples. In SIGMOD, pages 667–682, 2017.
  • [14] O. Chapelle, B. Schlkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010.
  • [15] G. DeJong.

    Prediction and substantiation: A new approach to natural language processing.

    Cognitive Science, 3(3):251–273, 1979.
  • [16] D. Deutch and A. Gilad. Qplain: Query by explanation. In ICDE, pages 1358–1361, 2016.
  • [17] G. I. Diaz, M. Arenas, and M. Benedikt. SPARQLByE: Querying RDF data by example. PVLDB, 9(13):1533–1536, 2016.
  • [18] K. Dimitriadou, O. Papaemmanouil, and Y. Diao.

    Aide: An active learning-based approach for interactive data exploration.

    TKDE, 28(11):2842–2856, 2016.
  • [19] M. Drosou and E. Pitoura. Ymaldb: Exploring relational databases via result-driven recommendations. VLDBJ, 22(6):849–874, 2013.
  • [20] M. Eirinaki, S. Abraham, N. Polyzotis, and N. Shaikh. Querie: Collaborative database exploration. TKDE, 26(7):1778–1790, 2014.
  • [21] C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In SIGKDD, pages 213–220. ACM, 2008.
  • [22] J. Fan, G. Li, and L. Zhou. Interactive sql query suggestion: Making databases user-friendly. In ICDE, pages 351–362, 2011.
  • [23] A. Fariha, S. M. Sarwar, and A. Meliou. SQuID: Semantic similarity-aware query intent discovery. In SIGMOD, pages 1745–1748, 2018.
  • [24] X. Ge, Y. Xue, Z. Luo, M. A. Sharaf, and P. K. Chrysanthis. Request: A scalable framework for interactive construction of exploratory queries. In Big Data, pages 646–655, 2016.
  • [25] L. Getoor and B. Taskar. Introduction to statistical relational learning, 2007.
  • [26] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery, 1(1):29–53, 1997.
  • [27] J. Han, K. Zheng, A. Sun, S. Shang, and J. R. Wen. Discovering neighborhood pattern queries by sample answers in knowledge base. In ICDE, pages 1014–1025, 2016.
  • [28] V. Hristidis and Y. Papakonstantinou. DISCOVER: keyword search in relational databases. In VLDB, pages 670–681, 2002.
  • [29] N. Jayaram, A. Khan, C. Li, X. Yan, and R. Elmasri. Querying knowledge graphs by example entity tuples. TKDE, 27(10):2797–2811, 2015.
  • [30] C. Ji, X. Zhou, L. Lin, and W. Yang. Labeling images by integrating sparse multiple distance learning and semantic context modeling. In ECCV (4), volume 7575 of Lecture Notes in Computer Science, pages 688–701. Springer, 2012.
  • [31] L. Jiang and A. Nandi. SnapToQuery: Providing interactive feedback during exploratory query specification. PVLDB, 8(11):1250–1261, 2015.
  • [32] A. C. Kakas. Abduction. In Encyclopedia of Machine Learning and Data Mining, pages 1–8. Springer, 2017.
  • [33] D. V. Kalashnikov, L. V. S. Lakshmanan, and D. Srivastava. FastQRE: Fast query reverse engineering. In SIGMOD, pages 337–350, 2018.
  • [34] S. S. Khan and M. G. Madden. A survey of recent trends in one class classification. In AICS, pages 188–197. Springer, 2009.
  • [35] T. Khot, S. Natarajan, and J. W. Shavlik. Relational one-class classification: A non-parametric approach. In AAAI, pages 2453–2459, 2014.
  • [36] N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. SnipSuggest: Context-aware autocompletion for SQL. PVLDB, 4(1):22–33, 2010.
  • [37] H. Li, C. Chan, and D. Maier. Query from examples: An iterative, data-driven approach to query construction. PVLDB, 8(13):2158–2169, 2015.
  • [38] L. Lim, H. Wang, and M. Wang. Semantic queries by example. In EDBT, pages 347–358, 2013.
  • [39] B. Liu, Y. Dai, X. Li, W. S. Lee, and S. Y. Philip. Building text classifiers using positive and unlabeled examples. In ICDM, volume 3, pages 179–188. Citeseer, 2003.
  • [40] L. M. Manevitz and M. Yousef. One-class svms for document classification. JMLR, 2(Dec):139–154, 2001.
  • [41] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008.
  • [42] T. Menzies. Applications of abduction: knowledge-level modelling. Int. J. Hum.-Comput. Stud., 45(3):305–335, 1996.
  • [43] S. Metzger, R. Schenkel, and M. Sydow. QBEES: query-by-example entity search in semantic knowledge graphs based on maximal aspects, diversity-awareness and relaxation. J. Intell. Inf. Syst., 49(3):333–366, 2017.
  • [44] F. Mordelet and J.-P. Vert. A bagging svm to learn from positive and unlabeled examples. Pattern Recognition Letters, 37:201–209, 2014.
  • [45] D. Mottin, M. Lissandrini, Y. Velegrakis, and T. Palpanas. Exemplar queries: A new way of searching. VLDBJ, 25(6):741–765, 2016.
  • [46] A. Neelakantan, B. Roth, and A. McCallum.

    Compositional vector space models for knowledge base completion.

    In ACL, pages 156–166, 2015.
  • [47] K. Panev, N. Weisenauer, and S. Michel. Reverse engineering top-k join queries. In BTW, pages 61–80, 2017.
  • [48] F. Psallidas, B. Ding, K. Chakrabarti, and S. Chaudhuri. S4: Top-k spreadsheet-style search for query discovery. In SIGMOD, pages 2001–2016, 2015.
  • [49] L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD, pages 73–84, 2012.
  • [50] A. D. Sarma, A. G. Parameswaran, H. Garcia-Molina, and J. Widom. Synthesizing view definitions from data. In ICDT, pages 89–103, 2010.
  • [51] Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and L. Novik. Discovering queries based on example tuples. In SIGMOD, pages 493–504, 2014.
  • [52] A. Srinivasan. The aleph manual.
  • [53] W. C. Tan, M. Zhang, H. Elmeleegy, and D. Srivastava. Reverse engineering aggregation queries. PVLDB, 10(11):1394–1405, 2017.
  • [54] W. C. Tan, M. Zhang, H. Elmeleegy, and D. Srivastava. REGAL+: reverse engineering SPJA queries. PVLDB, 11(12):1982–1985, 2018.
  • [55] Q. T. Tran, C. Chan, and S. Parthasarathy. Query reverse engineering. VLDBJ, 23(5):721–746, 2014.
  • [56] D. Vallet, M. Fernández, P. Castells, P. Mylonas, and Y. Avrithis. Personalized information retrieval in context. In MRC, pages 16–17, 2006.
  • [57] C. Wang, A. Cheung, and R. Bodik. Synthesizing highly expressive sql queries from input-output examples. In PLDI, pages 452–466, 2017.
  • [58] R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM, pages 342–350, 2007.
  • [59] Y. Y. Weiss and S. Cohen. Reverse engineering spj-queries from examples. In PODS, pages 151–166, 2017.
  • [60] Word grabbag. http://wordgrabbag.com, 2018.
  • [61] R. Xiang and J. Neville. Pseudolikelihood EM for within-network relational learning. In ICDM, pages 1103–1108, 2008.
  • [62] H. Yu, J. Han, and K. C. Chang. PEBL: positive example based learning for web page classification using SVM. In SIGKDD, pages 239–248, 2002.
  • [63] Z. Zeng, M. Lee, and T. W. Ling. Answering keyword queries involving aggregates and GROUPBY on relational databases. In EDBT, pages 161–172, 2016.
  • [64] M. Zhang, H. Elmeleegy, C. M. Procopiuc, and D. Srivastava. Reverse engineering complex join queries. In SIGMOD, pages 809–820, 2013.
  • [65] S. Zhang and Y. Sun. Automatically synthesizing sql queries from input-output examples. In ASE, pages 224–234, 2013.
  • [66] X. Zhang, Y. Chen, J. Chen, X. Du, K. Wang, and J. Wen. Entity set expansion via knowledge graphs. In SIGIR, pages 1101–1104, 2017.
  • [67] M. M. Zloof. Query-by-example: The invocation and definition of tables and forms. In PVLDB, pages 1–24, 1975.

Appendix A Domain selectivity impact

We use the notion of domain coverage of a filter to denote the fraction of values of ’s domain that covers. As an example, for attribute age, suppose that the domain consists of values in the range , then the filter has domain coverage and the filter has domain coverage. We use a threshold to specify how much domain coverage does not reduce the domain selectivity impact . After that threshold, as domain coverage increases, decreases. We use another parameter which states how strongly we want to penalize a filter for having large domain coverage. The value of implies that we do not penalize at all, i.e., all filters will have . As increases, we reduce more for larger domain coverages. We compute the domain selectivity impact using the equation below:

Appendix B Outlier impact

Towards computing outlier impact of a filter , we first compute skewness of the association strength distribution within the family of derived filters involving attribute ; and then check whether is an outlier among them. We compute sample skewness of , with sample mean

and sample standard deviation

, using the standard formula:

vlightgray Title Source
IMDb dataset https://datasets.imdbws.com/
DBLP dataset https://data.mendeley.com/datasets/3p9w84t5mr
Adult dataset https://archive.ics.uci.edu/ml/datasets/adult
Physically strong actors https://www.imdb.com/list/ls050159844/
Top 1000 Actors and Actresses* http://www.imdb.com/list/ls058011111/
Sci-Fi Cinema in the 2000s http://www.imdb.com/list/ls000097375/
Funny Actors https://www.imdb.com/list/ls000025701/
100 Random Comedy Actors https://www.imdb.com/list/ls000791012/
BEST COMEDY ACTORS https://www.imdb.com/list/ls000076773/
115 funniest actors https://www.imdb.com/list/ls051583078/
Top 35 Male Comedy Actors https://www.imdb.com/list/ls006081748/
Top 25 Funniest Actors Alive https://www.imdb.com/list/ls056878567/
the top funniest actors in hollywood today https://www.imdb.com/list/ls007041954/
Google knowledge graph: Actors: Comedy https://www.google.com/search?q=funny+actors
The Best Movies of All Time* https://www.ranker.com/crowdranked-list/the-best-movies-of-all-time
Top H-Index for Computer Science & Electronics* http://www.guide2research.com/scientists/
* Used as popularity mask
Figure 17: Source of datasets


vlightgray IMDb & variations
vlightgray IMDb bd-IMDb
DB size 633 MB DB size 1926 MB
#Relations 15 #Relations 15
Precomputed DB size 2310 MB Precomputed DB size 5971 MB
Precomputation time 150 min Precomputation time 370 min
person 6,150,949 person 12,301,898
Rel. movie 976,719 Rel. movie 1,953,438
Card. castinfo 14,915,325 Card. castinfo 59,661,300
vlightgray bs-IMDb sm-IMDb
DB size 1330 MB DB size 75 MB
#Relations 15 #Relations 15
Precomputed DB size 4831 MB Precomputed DB size 317 MB
Precomputation time 351 min Precomputation time 14 min
person 12,301,898 person 65,865
Rel. movie 1,953,438 Rel. movie 335,705
Card. castinfo 29,830,650 Card. castinfo 1,364,890


vlightgray DBLP Adult


DB size 22 MB DB size 4 MB
#Relations 14 #Relations 1
Precomputed DB size 98 MB Precomputed DB size 5 MB
Precomputation time 42 min Precomputation time 3 min
author 126,094
Rel. publication 148,521 Rel. adult 32,561
Card. authortopub 416,445 Card.


Figure 18: Description of IMDb, DBLP, and Adult datasets

A distribution is skewed if its skewness exceeds a threshold

. For outlier detection, we use the mean and standard deviation method. For sample mean

, sample standard deviation , and a constant , is an outlier if (. For , skewness is not defined and we assume all elements to be outliers. We compute outlier impact :

Appendix C Proof of Theorem 1


We will prove Theorem 1 by contradiction. Suppose that is the optimal set of filters, i.e., is the most likely query. Additionally, suppose that is the minimal set of filters for obtaining such optimality, i.e., such that . Now suppose that, Algorithm 1 returns a sub-optimal query , i.e., . Since is suboptimal, ; therefore at least one of the following two cases must hold:

Case 1: such that . Since Algorithm 1 did not include , it must be the case that . Therefore, we can exclude from to obtain and according to Equation 5, which contradicts with our assumption about the optimality and minimality of .

Case 2: such that . Since Algorithm 1 included , it must be the case that . Therefore, we can add to to obtain and according to Equation 5, which again contradicts with our assumption about the optimality of .

Hence, cannot be suboptimal and this implies that Algorithm 1 returns the most likely query. ∎

Note that, in a special case where , Algorithm 1 drops the filter using Occam’s razor principle to keep the query as simple as possible. But this, however, does not return any query that is strictly less likely than the best query.

Appendix D Datasets and Benchmark Queries

We collect the datasets from various sources and provide them in Figure 18. The detailed description of the datasets are given in Figure 18. We mention the cardinalities of the big relations for providing a sense of the data and their associations.

d.1 Alternative IMDb Datasets

For the scalability experiment, we generated 3 versions of the IMDb database. For obtaining a downsized database sm-IMDb, we dropped persons with less than 2 affiliated movies and/or who have too many semantic information missing, and movies that have no cast information. We produced two upsized databases: one with dense associations bd-IMDb, and the other with sparse associations bs-IMDb. bd-IMDb contains duplicate entries for all movies, persons, and companies (with different primary keys), and the associations among persons and movies are duplicated to produce more dense associations. For example, if P1 acted in M1 in IMDb, i.e., (P1,M1) exists in IMDB’s castinfo, we added a duplicate person P2, a duplicate movie M2, and 3 new associations, (P1,M2), (P2,M2), and (P2,M1), to bd-IMDb’s castinfo. For bs-IMDb, we only duplicated the old associations, i.e., we added P2 and M2 in a similar fashion, but only added (P2,M2) in castinfo.

d.2 Benchmark Queries

We discuss the benchmark queries for all datasets here. Figures 21 and 21 display benchmark queries that we use to run different experiments on the IMDb and DBLP datasets, respectively. The tables show the query intents, details of the corresponding queries in SQL (number of joining relations (J) and selection predicates (S)), and the result set cardinality. Figure 22 shows 20 benchmark queries along with their result set cardinality for the Adult dataset.

Appendix E SQuID Parameters

We list the four most important SQuID parameters in Figure 21 along with brief description. We now discuss the impact of these parameters on SQuID and provide few empirical results.

. The base filter prior parameter defines SQuID’s tendency towards including filters. Small makes SQuID pessimistic about including a filter, and thus favors recall. In contrast, large makes SQuID optimistic about including a filter, which favors precision. Low helps in getting rid of coincidental filters, particularly with very few example tuples. However, with sufficient example tuples, coincidental filters eventually disappears, and the effect of diminishes. Figure 26 shows effect of varying the value of for few benchmark queries on the IMDb dataset. While low favors some queries (IQ2, IQ16), it causes accuracy degradation for some other queries (IQ3, IQ4, IQ11), where high works better. It is a tradeoff and we found empirically that moderate value of (e.g., 0.1) works best on an average.

. The domain coverage penalty parameter specifies SQuID’s leniency towards filters with large domain coverage. Low penalizes filters with large domain coverage less, and high penalizes them more. Figure 26 shows the effect of varying . Very low favors some queries (IQ3, IQ4, IQ11) but also causes accuracy degradation for some other queries (IQ2, IQ16), where high works better. Like , it is also a tradeoff, and empirically we found moderate values of (e.g., 2) to work well on an average.

. The association strength threshold is required to define the association strength impact (Section 4.2.2). Figure 26 illustrates the effect of different values of on the benchmark query IQ5 on the IMDb dataset. The figure shows that, with very few example tuples, high is preferable, since it helps dropping coincidental filters with weak associations. Similar to other parameters, with increased number of example tuples, the effect of diminishes.

. The skewness threshold is required to classify an association strength distribution as skewed or not (Appendix B). Figure 26 illustrates the effect of different values of on the benchmark query IQ1 on the IMDb dataset. refers to the experiment where outlier impact was not taken into account (i.e.,