Schema2QA: Answering Complex Queries on the Structured Web with a Neural Model

Virtual assistants today require every website to submit skills individually into their proprietary repositories. The skill consists of a fixed set of supported commands and the formal representation of each command. The assistants use the contributed data to create a proprietary linguistic interface, typically using an intent classifier. This paper proposes an open-source toolkit, called Schema2QA, that leverages the markup found in many websites to automatically build skills. Schema2QA has several advantages: (1) Schema2QA handles compositional queries involving multiple fields automatically, such as "find the Italian restaurant around here with the most reviews", or "what W3C employees on LinkedIn went to Oxford"; (2) Schema2QA translates natural language into executable queries on the up-to-date data from the website; (3) natural language training can be applied to one domain at a time to handle multiple websites using the same representations. We apply Schema2QA to two different domains, showing that the skills we built can answer useful queries with little manual effort. Our skills achieve an overall accuracy between 74 span three or more properties with 65 can be supported by transferring knowledge. The open-source Schema2QA lets each website create and own its linguistic interface.



There are no comments yet.


page 1

page 2

page 3

page 4


FLIN: A Flexible Natural Language Interface for Web Navigation

AI assistants have started carrying out tasks on a user's behalf by inte...

Open-Vocabulary Semantic Parsing with both Distributional Statistics and Formal Knowledge

Traditional semantic parsers map language onto compositional, executable...

A Transfer-Learnable Natural Language Interface for Databases

Relational database management systems (RDBMSs) are powerful because the...

Neural Databases

In recent years, neural networks have shown impressive performance gains...

Multi-Modal End-User Programming of Web-Based Virtual Assistant Skills

While Alexa can perform over 100,000 skills on paper, its capability cov...

Harvesting Production GraphQL Queries to Detect Schema Faults

GraphQL is a new paradigm to design web APIs. Despite its growing popula...

Alexa, Play Fetch! A Review of Alexa Skills for Pets

Alexa Skills are used for a variety of daily routines and purposes, but ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The adoption of virtual assistants grew at an unprecedented rate, reaching 50 million American adults in just the first two years (Perez, 2018). Alexa and Google Assistant are rapidly growing their third-party platform of skills so consumers can access different websites and IoT devices by voice (Kinsella, 2019). Skill builders supply information on how their sites can be accessed to each of the assistant platforms, with sample natural language invocations; the virtual assistant platforms use proprietary linguistic technology to support arbitrarily phrased commands.

There are over 1.5 billion websites on the world wide web today. Does each company have to enter a skill on each of the platforms it wishes to run on? Given the high cost of data acquisition needed to develop natural language technology and the virality of network effects, would there be oligopolistic virtual assistants that control separate proprietary linguistic webs? Would non-profit organizations be well served? Would rare natural languages be supported?

Our overall goal is to create one open non-proprietary linguistic web. This paper presents Schema2QA, an open-source virtual assistant skill authoring tool that produces fully custom natural language models that answer questions on websites, based on markup.

1.1. Our Solution

Linguistic Questions&Answers (Q&A) on websites. Ideally, a linguistic interface can be created for each website automatically. We take a step towards this goal by leveraging the markup included in millions of websites originally to facilitate web search. Schema2QA can generate a neural semantic parser that translates natural language into queries on the Schema markup on websites, without any real user training data.

Domain-by-domain improvement. Schema2QA lets developers improve the quality, one domain at a time, with relatively little manual effort. Because this tool generates open-source neural models that companies can own and use in their own websites, apps, or phone services, it encourages open collaboration to build a neural model that leverages data from many domains and languages.

Complex queries. The Linguistic User Interface (LUI) is much more flexible than the Graphical User Interface (GUI). We can easily ask for information that requires joining arbitrary fields or performing computation. Table 1 shows a sample of questions that can be answered by Schema2QA; in contrast, today’s commercial assistants cannot answer most of such questions. Schema2QA supports complex queries effectively by training a neural model with millions of synthesized sentences, less than 1% of which are paraphrased by crowdsource workers.

Open web. Search engines can crawl the markup in websites and use Schema2QA to answer questions across many websites at once, eliminating the need for proprietary centralized skill repositories. By making all the source code, training data, and neural models publicly available, we wish to encourage open collaboration in LUI research.

Figure 1. Schema2QA pipeline

1.2. Approach

To answer questions using up-to-date web information, it is not possible to train our system with question and answer pairs. Instead, we create a semantic parser that automatically translates natural language queries into a formal query language. To address the difficulty in acquiring training data, we use grammar-based rules to generate pairs of sentences and their formal counterparts, and ask crowdsource workers to paraphrase the natural language sentences (Wang et al., 2015). To handle complex queries, we write templates to generate a wide variety of synthesized data, which is used in conjunction with paraphrase data to train a neural semantic parser (Campagna et al., 2019).

Previous techniques require significant effort to create one skill at a time. To handle the millions of available websites, we wish to provide an automatic baseline solution that can handle all the websites automatically, while allowing incremental improvements one domain at a time. With the help of, we can train semantic models for each domain, rather than for each skill or website. Illustrated in Fig. 1, our approach consists of the following:

  1. We create DBTalk, a high-level query language optimized for translation from natural language. It supports common computed functions such as aggregation in addition to typical join, selection and projection operations.

  2. We extend to the use case of natural language Q&A, and create NL-Schema, which is a natural-language-friendly representation.

  3. Templates are used to generate training data, which is a pair of natural language and its corresponding DBTalk query. There are two kinds of templates: generic templates and domain-specific templates. The former is handcoded once and for all (for each targeted natural language), the latter is automatically generated from NL-Schema.

  4. We use Genie (Campagna et al., 2019), which accepts the definition of DBTalk, the NL-Schema, the templates, as well as a parameter dataset collected from existing websites, and produces synthetic training data, part of which are paraphrased, to generate a neural semantic parser. Note that we can also supply additional manual natural language templates to improve the quality of specific domains.

  5. The semantic parser produces queries which are executed on a website database index to return results. Schema2QA automatically builds the index from the website data, and can be called by the website author when the data changes, or invoked periodically to crawl new website data.

1.3. Contributions

The contributions of this paper include:

  1. A high-level query language, called DBTalk, optimized for translation from natural language.

  2. The NL-Schema data model for structured website information, derived from, and adapted for building natural language QA skills.

  3. The Schema2QA tool that allows website owners to build skills for their website, in a fully automated fashion for existing domains. Additionally, improvements can be made that require small effort for each domain.

  4. We have demonstrated an end-to-end system by incorporating the results of applying Schema2QA on the domains of restaurants, people and hotels in the open-source Almond virtual assistant.

  5. A large dataset with more than 1.3M training sentences for restaurant questions, and 900,000 training sentences for questions about people. This dataset also includes 215 real-world crowdsourced restaurant questions and 234 questions about people, that refer to one to three properties in a single question. This test set can serve as a benchmark for future QA systems based on

  6. Experimental results showing that skills built from Schema2QA can understand a variety of real-world questions, with an accuracy between 74% and 78%. Furthermore, Schema2QA can support querying across more than 750 websites that use today.

1.4. Outline

The rest of the paper is organized as follows. Section 2 presents an overview of the system. Sections 3 and 4 present the design of DBTalk and NL-Schema, respectively. Section 5 discusses how we use templates to generate the training set. Lastly, we present experimental results, related work, and conclusions.

2. Overview

Our goal is to provide a basic automatically generated semantic parser that can handle questions in each domain, and give domain experts the ability to improve the accuracy of the system. Here we present a high-level overview with the help of a couple of examples drawn from questions about restaurants, as shown in Fig. 1.

2.1. DBTalk Design

We introduce DBTalk as the formal query language. It is designed to handle the important use case, with as little extraneous notations as possible. Here is an example of a DBTalk query to find the restaurants that serve Chinese cuisine:

skill Restaurant {
    name: String #_[npp=["name"]],
    image: Picture #_[nnp=["image", "picture"]],
    description: String #_[nnp=["description"]], ...
  Place extends Thing(
    aggregateRating: {
      reviewCount: Number #_[nnp=["review count"]],
      ratingValue: Number #_[
        jj=[{prefix="rated", suffix="star"}]
    }, ...
  Organization extends Thing();
  LocalBusiness extends Place, Organization();
  FoodEstablishment extends LocalBusiness(
    servesCuisine: String #_[
      npp=["cuisine", "food type"],
      vb=[{prefix="serves", suffix="cuisine"}]
    ], ...
  Restaurant extends FoodEstablishment();
Figure 2. NL-Schema representation for Restaurant

2.2. NL-Schema Design

The NL-Schema data model is adapted from for the purpose of answering natural language questions. NL-Schema uses a relational model that supports class hierarchies, fixed tables and fixed fields of a single static type, whereas uses a graph-based representation with union-typed properties. The example of a Restaurant class is shown in Fig. 2.

NL-Schema is used to facilitate the generation of training data for our neural semantic parser. Each training sample consists of a natural language sentence and the corresponding DBTalk code.

“Which restaurants serve Chinese cuisine?”

Thus, to establish the relationship between natural language and a query that uses a certain field, each field is annotated with a canonical representation and a part-of-speech (POS). For example, says that the phrase “cuisine” and “serves … cuisine” can be generated as a noun phrase (nnp) and a verb phrase (vb), respectively, and that they both refer to the property “servesCuisine”.

Schema2QA includes a tool to automatically transform the whole representation into NL-Schema. Additional annotations can be supplied manually if desired to improve its quality.

2.3. Templates for Training Data Generation

Templates are grammar production rules that describe how to generate a sentence and its formal representation. The generic templates cover all the domains in The template shown in Fig. 1,

can be used generically to generate “Which restaurants serve Chinese cuisine”, “Which person received Turing awards”, “Which hotels offers free parking”.

Domain-specific templates further increase the variety of training samples generated. For example, we can refer to restaurants serving Chinese cuisine simply as Chinese restaurants. This can be captured by a domain-specific template, such as $cuisine “restaurants” Restaurant, servesCuisine $cuisine which generates “Chinese restaurants”, “Mexican restaurants”, etc.

The generic templates are curated by hand, but to handle the scale of the definition, we have developed a data-driven template generator to automatically create domain-specific templates. Similar to annotations, manual domain-specific templates can be added. For example, developers can add the following template to query the top rated restaurants:

“top rated restaurants”

This template will map sorting Restaurant by “aggregateRating.ratingValue” to the phrase “top rated restaurants”, which cannot be produced by Schema2QA based on generic templates.

2.4. Generating a semantic parser for a domain

Here is how we can create a semantic parser that translates natural language questions on a given domain into formal DBTalk queries:

  1. Download the markup of representative websites in the domain. We may only need to download information from just one website, or even a fraction thereof, if a large aggregator exists, such as Yelp for restaurant reviews.

  2. Apply the NL-Schema converter on the information for that domain to generate NL-Schema.

  3. Apply the automatic template generator on the generated NL-Schema to create domain-specific templates.

  4. Extract the list of values for each property from the downloaded data.

  5. Obtain a set of validation data by hand-labeling crowdsourced questions.

  6. Apply Genie on the DBTalk definition, the generic templates, and all the results obtained in steps 2 to 5 above. Genie uses all this information to synthesize data, a fraction of which is paraphrased by crowdsource workers. Both the synthetic and paraphrase data sets are used to train a Multi-Task Question Answering Network (MQAN) (McCann et al., 2018) model to translate natural language into DBTalk. MQAN is an encoder-decoder model (Sutskever et al., 2014) that uses both LSTM (Hochreiter and Schmidhuber, 1997) and self-attention (Vaswani et al., 2017) layers to encode the input question, then produces the output query token by token. More details of the neural model are described elsewhere (McCann et al., 2018).

The domain parser that this process produces can handle natural language queries for any website in the same domain. Developers are expected to use the validation set to identify weaknesses in the synthesized data, and augment the data with additional annotations as well as manually written domain-specific templates.

2.5. Schema2QA System

Once the parser is generated, we can use it to answer questions for any websites in that domain that use We apply the translated DBTalk queries on databases constructed from the markup of the websites of interest. If multiple websites are available in the database, the generated skill can answer aggregate queries, such as “which is the cheapest of all the restaurants” among the downloaded sites.

3. The DBTalk Query Language

Schema2QA translates natural language into a formal query language called DBTalk, which was created to facilitate translation with a neural model. As shown by Genie (Campagna et al., 2019), it is important that the formal language resembles natural language. We follow the same principles in the design of DBTalk.

DBTalk assumes a relational database model. DBTalk queries have the form:

where table is the type of entity being retrieved (similar to a table name in SQL), filter applies a selection predicate that can make use of the fields in the table, and fn is an optional list of field names to project on. The full grammar is shown in Fig. 3.

Selection sel
Projection pr
Aggregation agg
Computation cmp
Sorting sort
Indexing idx
Join join
Expression expr
Comparison cmpop
Table name tn identifier
Field name fn identifier
Figure 3. The formal definition of DBTalk.

Here is an example of a DBTalk query, corresponding to the query “who wrote the 0 star review for Din Tai Fung?”


3.1. Type System

DBTalk uses a static type system similar to ThingTalk, a previously proposed programming language for virtual assistants (Campagna et al., 2017). This type system includes domain-specific types like Location, Measure, Date and Time. It also includes high-level concepts such as “here”.

All DBTalk tables implicitly include an “id” column, which can be used to compare two rows for equality and to join tables. Furthermore, DBTalk has native support for named entities, such as people, brands, countries, etc. DBTalk tables that contain named entities have a “name” column; for those tables, a query can lookup a specific row by specifying the name and type of the entity.

Unlike ThingTalk, DBTalk has record types, which we introduce to avoid creating tables for objects that represent structured values and have no identity of their own. Fields in each record type are recursively flattened so they can be accessed as fields in the table that uses the record type. We do not support recursive record types.

Compared to SQL, we also introduce array types to express joins in a more intuitive way, and to avoid the use of glue tables for many-to-many joins.

3.2. Sorting, Aggregation, Computation

DBTalk queries support the simple computation operators found in SQL for sorting a table, indexing & slicing a table, aggregating all results, and computing a new field for each row. These operators can be combined: for example, the distance operator can be used to compute the distance of a place from another location. Combined with sorting and indexing, this allows to express the query “what is the nearest restaurant?” as: (sort distance asc of comp distance(geo, hereof Restaurant)[1] The query reads as: select all restaurants, compute the distance between the field geo and the user’s current location (and by default, store it in the distance field), sort by increasing distance, and then choose the first result (with index 1).

3.3. Limitations of DBTalk

To keep DBTalk simple and understandable to end users, we omit set operations, including union, intersection, and set difference. We also omit group-by operations, subqueries, and quantifiers. While providing limited function, this design does guarantee that the language is canonicalizable. There is a unique canonical syntactic form for every query, which has been shown to improve accuracy of semantic parsers (Campagna et al., 2019).

4. Adapting for Natural Language Processing is a markup vocabulary created to help search engines understand and index structured data across the web. Here, we introduce and describe how we extend it to natural language queries.

4.1. Data model of is based on RDF, and uses a graph data model, where nodes represent objects. Nodes are connected by properties and are grouped in classes. Classes are arranged in a multiple inheritance hierarchy where each class can be a subclass of multiple other classes, and the “Thing” class is the superclass of all classes. There is also a parallel hierarchy where the literal data types are defined. By convention, all class names start with an uppercase letter, while property names start with a lowercase letter.

Each property’s domain consists of one or more classes that can possess that property. The same property can be used in separate domains; e.g., “Person” and “Organization” classes both use the “owns” property. Subclasses of a class in the domain of a certain property can also make use of that property.

Each property’s range consists of one or more classes or primitive types. Additionally, as having any data is considered better than none, the “Text” type (free text) is always implicitly included in a property’s range. For properties where free text is the recommended or only type expected, e.g. the “name” property, “Text” is explicitly declared as the property type. is organized in layers, with a “core” layer representing the portion agreed by all users, a “pending” layer including proposed addition to the core, and various domain-specific extension layers (e.g., “bib”, “auto”). Here we only consider the core layer.

4.2. NL-Schema Representation

NL-Schema, and templates derived from it, provide the domain-specific information used in the grammar-driven generation of training data. To facilitate the generation, NL-Schema uses a relational data model, where each table contains a fixed set of properties with fixed types. The grammar is shown in Fig. 4, and an example is shown in Fig. 2. Schema2QA includes a converter tool that leverages the definitions and the data in websites to translate the graph-based representation automatically into NL-Schema.

Table Definition tdef
Field Definition fdef
Annotation ann
Canonical Form cf
Part-of-Speech pos
Type type
Table name tn identifier
Field name fn identifier
Figure 4. The formal definition of NL-Schema.

Tables and record types. In NL-Schema, we distinguish between entity and non-entity classes. Entity classes are those that refer to well-known meaningful identities, such as people, organization, places, events, URLs, with names that the user can recognize. All other classes are considered non-entity; they can only be referred to as properties of other classes.

An entity class is represented as a table, whose properties make up the columns. The column may be of a primitive type, a reference to another entity class, or a record type. Non-entity classes are represented as anonymous record types, with the exception of recursive classes. DBTalk does not have recursive record types, where a record type has a field with the same type. Recursive non-entity classes are mapped to nameless tables, instead of record types. For example, the “Review” class is a non-entity class, because it inherits the “review” property from “CreativeWork”, and that property also refers to the “Review” class.

When a class is used as a property, it often uses only a subset of all the possible fields. Consider the non-entity class “Rating”. When referred as the “aggregateRating” property of the “Restaurant” class, it uses fields “reviewCount” and “ratingValue”; when referred as the “reviewRating” property of the “Review” class, it uses fields “ratingValue” and might use the “author” field. We use data scraped from websites to determine the fields used in practice for each class’s property, and create a custom record type holding only such fields for that property. This limits the vocabulary to only the relevant terms for each class’s property.

Array types. Since correctly distinguishing singular and plural properties is necessary to generate good training sentences, we introduce cardinality to NL-Schema fields. For example, for plural properties such as “review” we can ask “how many reviews does a restaurant have?”, and “what restaurant has the most reviews?”.

Schema2QA considers any property with “ItemList” type as an array. It also heuristically analyzes the documentation comment provided on to identify arrays. Furthermore, when the element type of the array is not provided by, Schema2QA heuristically infers it from the property name. Empirically, we found our heuristics works well in the 3 domains we evaluated, with the exception of properties of the “Thing” class such as “image” and “description”, which are described as plural in the comment, but have one value per object in practice.

Union types. For compatibility with existing websites and extensibility, the range of many properties in includes multiple classes, effectively creating a union type. To avoid ambiguity when parsing natural language, and to avoid having to resolve the type dynamically at runtime, which can lead to missing data and confusion, NL-Schema does not support union types. For each property, it picks among the types in its range the one with the highest priority, which is defined in decreasing order: record types, DBTalk primitive types, entity references, and finally strings. All website data is cast to the chosen type as follows: If the data contains a primitive value or a record where an entity reference is expected, Schema2QA creates new entities, and assigns new unique IDs. Conversely, if the data contains a record where a primitive value or an entity reference is expected, Schema2QA reports a warning and selects a property of that record as the primitive value. Note that website data often do not respect the type declared in; in that case, Schema2QA will automatically cast the data to the correct type or discard the value.

4.3. Natural Language Annotations

Schema2QA uses a template-based method to synthesize a large training set mapping natural language questions to DBTalk queries. To do so, each property in NL-Schema is annotated with its canonical forms: a list of short phrases that indicates how the property and its value are referred to in natural language. Each canonical form also indicates the part-of-speech it belongs to so they are used in natural language templates correctly. For example, the “author” property of “Review” has canonical forms “author” (noun phrase) and “written by” (adjective phrase).

Canonical forms might have a component that precedes the property value (“prefix”) and one that follows it (“suffix”). For example, one of the canonical forms of the “servesCuisine” property has prefix “serves” and suffix “cuisine”. This allows Schema2QA to generate sentences of the form “restaurants that serve Italian cuisine”, where the value “Italian” is in between “serves” and “cuisine”.

Automatic Generation of Canonical Forms. When converting to NL-Schema, Schema2QA automatically generates a canonical form for each property, based on the property name and type. It converts the camel-cased names into multiple words and removes redundant words at the end of property names. For example, it converts “worksFor” to “works for”, and “ratingValue” to “rating”. Schema2QA removes the table name or the names of the parent record type from the property name, if present. For example, “reviewRating” is converted to “rating”. Schema2QA also recognizes the verbs “has” and “is”, which are commonly used as prefixes of property names in, and uses to identify the correct part of speech for the generated canonical form. Developers can manually add more canonical forms or refine the generated ones.

The canonical forms and types are also used in the displaying the results of the query. If the query returns many fields, a default priority is used to present the most important ones. For each table, developers can override the formatting information.

Parts-of-Speech Annotations. Previous work by Wang et al. (Wang et al., 2015) asked developers to provide canonical forms in one of two parts-of-speech (POS): a noun phrase or a verb phrase. This simplistic POS characterization generates low-quality sentences that are unsuitable for training. Subsequent work shows that allowing developers supply domain-specific templates can generate higher quality synthetic sentences, which can be included with a small amount of paraphrase data for effective training (Campagna et al., 2019). We propose classifying canonical forms into more POS categories111The abbreviations are based on POS tags from the Penn Treebank tagset. so generic templates can be used to synthesize more varied and useful training sets. Schema2QA applies an off-the-shelf part-of-speech tagger to identify the POS tag of generated canonical forms.

The noun-phrase for property field (NNP) tag denotes a noun phrase representing what a subject has. For example, “reviews” is a NNP canonical form for the “review” property; examples of generated sentences are “show me the reviews of the restaurant?” and “which restaurant has more than 5 reviews?”. Most properties defined in have at least one NNP canonical form.

The noun-phrase for identity field (NNI) tag denotes a noun phrase representing what a subject is. For example, “alumni of” is a NNI canonical form for the “alumniOf” property; an example is “who is an alumni of NTU?”.

The verb-phrase field (VB) tag denotes a verb phrase representing what a subject does. For example, “serves” is a VB canonical form for the “servesCuisine” property; an example is “what restaurants serve tacos?”.

The adjective-phrase field (JJ) tag denotes an adjective or passive verb. For example, “rated” is a JJ canonical form for the “rating” property; an example is “restaurants rated 4.5 or above”. Similarly, the preposition “by” is a JJ canonical form for the “author” property of “Review”; an example is “show me reviews by Bob Smith”.

Additionally, in some cases the property is not explicit in the question, and the value is sufficient to infer what property the query should use. Thus, we create two categories based on values:

The noun-phrase for value (NNV) tag on the property denotes that the value of a property indicates what a subject is. For example, the property “jobTitle” is annotated NNV, and an example is “who is a CEO?”. The property “jobTitle” is implicit.

The adjective-phrase for value (JJV) tag denotes that the value of a property is an adjective. For example, the property “servesCuisine” is annotated JJV, and an example is “show me Mexican restaurants”. Here, the property “servesCuisine” is implicit.

5. Training Set Generation

Template Example Sentence
projection “the” “of” “the rating of Panda Express”
table “that have” “more than” “restaurants that have rating more than 4”
table “with the highest” “restaurants with the highest rating”
question “how far is” “how far is Starbucks?”
Table 2. Examples of Generic Templates used by Schema2QA to Generate Training Data
Template Candidates Positive Example Queries Negative Example Queries
table “Mexican restaurants” “4.5 hotels”
table “in” “hotels in Florida” “restaurants in Mexican ”
table “with” “hotels with fitness center” “restaurants with Mexican”
table “containing” “hotels containing fitness center” “restaurants containing Mexican”
table “person works for Google” “restaurants serves Mexican”
table “Mexican cuisine restaurants” “Nobel prize award person”
table “with” “restaurants with Mexican cuisine” “hotels with Florida state”
table “5 star rated restaurants” “Nobel prize award received person”
Table 3. Template Candidates used by Schema2QA to Generate Domain-Specific Templates

Schema2QA uses Genie (Campagna et al., 2019) to generate training data for neural semantic parsers. Genie accepts templates, which are production rules mapping the grammar of natural language to a semantic function that produces the corresponding code. Formally, a template is expressed as:

The non-terminal nt is produced by expanding the terminals and non-terminals on the right-hand side of the sign; the bound variables are used by the semantic function sf to produce the corresponding code. Genie expands the templates by substituting the non-terminals with the previously generated derivations and applying the semantic function. The expansion continues recursively, up to a configurable depth.

When generating complex queries, Genie depends on generic templates to form the sketch of the query and domain-specific templates to refer to the details. We introduce the two kinds of templates in following sections.

5.1. Generic templates

Generic templates are written by hand, and map natural language compositional constructs to formal language operators. Many sentences can be covered with generic DBTalk templates, such as: table “that”

This template can generate both “hotel that offers free parking” and “person that works for Google”. In the former case, the table is “Hotel”, and the property is “amenityFeature” with VB canonical form “offers”. In the latter case, the table is “Person” and the property is “worksFor”, with VB canonical form “works for”.

Templates rely on the type system to form sentences that are meaningful and map to correct queries. For example, in the following template, only properties of comparable type can be used: table “whose” “is less than” The template would generate “restaurants whose rating is less than 3”, but it would not generate “restaurants whose cuisine is less than Italian”, as the latter would not typecheck. Typing is also used to automatically add computation, aggregation and sorting.

Schema2QA includes 512 hand-curated generic templates. Of these, 208 (41%) are related to filters. Table 1 shows that filters are used in many meaningful queries, so Schema2QA puts particular emphasis on generating understandable and varied questions for them. Schema2QA also includes 94 templates for argmax and argmin operations: these are templates that combine the sort and index operation in DBTalk to select the minimum or maximum row in the table. Argmin and argmax templates are used to understand questions that choose the most highly rated restaurant (Table 1).

Schema2QA makes use of DBTalk’s native support for the Location type to understand “where” questions and “how far” (distance) questions. These questions use the “geo” property, and are available in all tables that support that property.

Additional examples of templates, together with possible sentences that they generate, are shown in Table 2. The table shows one example of a projection asking for a specific field of the table, one example of a comparison filter, one example of argmax aggregation operator, and one example of a computed field.

5.2. Domain-Specific Templates

We wish to augment generic templates with domain-specific ones so the generated training data can include terminology specific to each domain. To scale across all domains, we have developed a data-driven generator for domain-specific templates.

For English queries, we have identified 8 common templates for specifying filters on tables, as shown in column 1 in Table 3. Columns 2 and 3 illustrate that which template gets used is highly dependent on the specific property. We exhaustively apply each of the templates to every property in the NL-Schema by substituting the parameters with canonical forms of the right POS, and by replacing values with sampled data from the websites. We test if the resulting phrases are used in practice by counting the number of exact matches returned by a web search engine. We normalize the count to adjust for the fact that longer queries have fewer exact matches. The 4 most commonly used templates are adopted for each property.

6. Experimental Results

We used Schema2QA to create three Q&A skills in the open-source Almond virtual assistant (Campagna et al., 2017), as an end-to-end demonstration of its functionality. The skills and Schema2QA will be released upon publication. A screenshot of the Schema2QA interface is shown in Fig. 5

. At the moment, our system supports only American English questions, and can be trained only on English websites. Extensions to other languages are left as future work.

In this section, we evaluate the performance of Schema2QA. We first describe the dataset we used for training and evaluation. Then we present experimental results to answer the following questions: (1) How Schema2QA performs on aggregator websites? (2) How does paraphrase data affect the accuracy? (3) How does the synthesized data compare with prior work? (4) Can the knowledge we learn from one domain be transferred to a related domain? And finally, (5) how would Schema2QA work in the real-world?

Figure 5. Screenshot of a skill generated by Schema2QA, running in the Almond virtual assistant.

6.1. Training and Evaluation Dataset

Schema2QA does not use any real data for training, which significantly reduces the cost of data acquisition. The training data is generated automatically with templates; a small fraction of the generated data is paraphrased by crowdsource workers. In addition, the dataset is augmented with property values extracted from the website data.

We, however, use realistic data for validation and testing, which has been shown to be significantly more challenging than testing with paraphrase data (Campagna et al., 2019). Crowdsource workers are presented with a list of properties in the relevant domain and a few examples of queries, and are asked to come up with 5 questions. We do not show them any sentences or website data we used for training, and allow workers to freely choose the value of each property. We ask half of the workers to produce queries about a single property, and the other half to produce queries related to two properties of their choice. The questions are then annotated by hand with their DBTalk representation, by an author of the paper, and then split into the dev set and test set. Our metric of query accuracy measures if the entire generated query matches the annotation.

6.2. Applying Schema2QA to Aggregator Sites

In our first experiment, we apply Schema2QA to two major aggregator websites in two different domains: Yelp for restaurants and LinkedIn for people. We chose them because they aggregate many entities within their domain and they make extensive use of markup. They have abundant structured information, which allows Schema2QA to answer rich and interesting questions.

The Yelp data contains restaurants with 10 properties including “servesCuisine”, “reviews”, “aggregateRating”, as well as reviews with 4 properties: “reviewRating”, “author”, “dataPublished”, and “description”. The LinkedIn data contains data about people, with 5 properties: “alumniOf”, “worksFor”, “address”, “award”, and “name”.

The size of the training and evaluation data is shown in Tables 4 and 5

, respectively. We observe that crowdsource workers for the Restaurant domain generate about 100 questions that refer to three or more properties, despite being instructed to just generate queries involving one or two. This is not observed with LinkedIn data probably because LinkedIn has fewer properties to choose from.

Restaurants People
Synthesized 1,294,278 553,067
Paraphrase 6,288 6,000
Total (augmented) 1,808,109 930,564
Table 4. Training set sizes.
Restaurants People
Dev 1 property 134 16
2 properties 47 144
3+ properties 59 0
Total 240 160
Test 1 property 96 127
2 properties 79 106
3+ properties 40 0
Total 215 233
Table 5. Size of evaluation sets, by number of properties.

Based on the development set, we refined the generic templates. New templates we found from validation include: projection on two properties (“what is the address and the telephone of …?”), filters that use “both” (“who works for both Google and Amazon?”, “what restaurant serves both ramen and sushi?”), comparisons that use “or more” (“restaurants with 4 stars or more”). We also used the dev set to refine the canonical forms of the properties. Note that the author who annotated the test set did not refine the templates and canonical forms after annotating. This guarantees that NL-Schema is only tuned based on observations from the dev set.

We train and evaluate on each of the two domains separately. The accuracy of Schema2QA is shown in Figure 6. On Yelp data, Schema2QA achieves 74% overall and 81% on questions with one property. For more complicated questions, Schema2QA still gets 71% on questions with two properties, and 65% with three or more properties. On LinkedIn data, Schema2QA achieves 78% overall: 80% on questions with one property and 76% on questions with two properties. In the figure, the column for three or more properties is marked “N/A” because there is no test data.

Overall, this result, which was achieved without any real data in training, shows that Schema2QA can build an effective parser at a low cost. Additionally, developers can add more domain-specific templates and canonical forms to further increase the accuracy with little additional cost. Schema2QA is able to achieve reasonably high accuracy for complex queries involving multiple properties because the synthesis generates many combinations. This allows us to outperform commercial assistants like Google, Alexa, and Siri on a variety of questions, as shown in Table 1.

Figure 6. Accuracy on aggregator websites

6.3. Impact of Paraphrase Data

Figure 7. Impact of Paraphrase Data on Restaurant Domain

To understand the contribution of the synthetic and paraphrase training data, we evaluate the accuracy of the parser with different amounts of paraphrase data. Results are shown in Fig. 7. Without any paraphrase data, the model already achieves an accuracy of 57% for simple queries and drops to 44% for queries with two or more properties. Adding just 25% of the paraphrase data improves the accuracy to 74% for questions with one property, 62% with two properties, and 58% with three or more properties. This indicates that the model trained with no paraphrase overfits on the templatized sentences; just the addition of a small amount of paraphrase data forces it to pay more attention to the language, improving the accuracy. Using the full paraphrase set leads to 81% for one property, 71% for two properties and 65% for three or more.

6.4. Comparison With Sempre Templates

Schema2QA uses a significantly more sophisticated template language and synthesis generation algorithm in comparison to Sempre, the first to introduce the use of canonical sentences to create synthetic data for Q&A semantic parsers. Here we evaluate the quality of our synthesized data set compared to the one generated using the Sempre language.

We apply the same data augmentation to both sets, and train with only synthesized data. On the Restaurants domain, training with Sempre templates achieves less than 1% accuracy at all three complexity levels. On the other hand, training with only synthesized data produced with Schema2QA templates achieves 57% accuracy for one property, 44% for two properties, and 45% for three or more properties. On the People domain, training with Sempre templates achieves 10% accuracy for questions with one property, and 2% accuracy for questions with two properties, whereas Schema2QA templates achieve 53% for one property, and 30% for two.

This results shows that the synthesized data we generate matches our realistic test questions more closely. Our templates are more tuned to understand the variety of filters that commonly appear in the test. On the other hand, Sempre templates are tuned for questions with many joins, which are not common in the domains we tested. Synthesizing data that matches the test data more closely means we don’t need to rely as much on expensive paraphrasing.

Figure 8. Proportion of crawled websites which use each property of the “Restaurant” and “Hotel” classes. Only properties used by at least 10% of the websites are included.

6.5. Transfer Learning Across Domains

Many domains in share common classes and properties. Is it possible to transfer the learning from one domain to another and create a semantic parser for a new domain without getting new annotations, domain-specific templates, or even paraphrases manually? We take the Schema2QA skill for restaurants, and apply it to hotels; restaurants and hotels share many of the same fields. The Hotel class has additional properties “checkinTime”, “checkoutTime”, and “amenityFeature”, but it does not have the “servesCuisine” property found in the Restaurant class.

For training data synthesis, besides using the automatically generated canonical forms, we also use canonical forms manually created for those common fields with restaurants. We adapt the paraphrases for the restaurant domain to the hotel domain by replacing the words “restaurant”, “diner”, “canteen”, etc. with the word “hotel”. We augment the synthesized and paraphrase data sets with data from the Hyatt hotel chain.

We acquire an evaluation set of 362 questions, crowdsourced from MTurk, and annotated by hand. These are divided in 181 for validation and 181 for test. 86 of the test questions use one property, 62 use two properties, and 33 use three or more. On the test set, the generated parser achieves an overall accuracy of 55%. Furthermore, on the subset of the test set that does not use any hotel-specific property, the accuracy is 63%. This shows that the knowledge from one domain can be transferred to a new domain with no manual effort, if the domain shares the same properties.

6.6. Applying Schema2QA To The Web

The semantic parser generated by Schema2QA can be applied to new websites of the same domain. To validate this capability, we created two Q&A systems, one for restaurants and one for hotels, in the cities of Washington DC and New York, and in the state of Hawaii. We use Google Custom Search Engine and identify 311 hotel and 475 restaurant websites that include markup. They exhibit wide variability in the properties used. Fig. 8 shows the proportion of websites using each of the properties that appear in at least 10% of the sites.

We unify the crawled data into a single knowledge base. The neural models trained on the Yelp and the Hyatt websites can be used to answer questions that aggregate across all these websites immediately. Yelp alone covers 9 of the top 15 common properties for restaurants, and Hyatt covers 9 of the top 13 common properties for hotels. For example, we can ask the hotel and restaurant questions in Table 1 across the crawled websites without any skill-by-skill engineering. We expect more websites will add markup into their websites to support natural language queries about their content.

7. Related Work

Question Answering.

Question answering (QA) is a well studied problem in Natural Language Processing, with work dating back to the 60s 

(Green Jr et al., 1961). A subset of the QA field is knowledge-based question answering (KB QA), where the answer can be found from a graph or relational database by executing an appropriate query.

Semantic parsing techniques for KB QA (Zelle and Mooney, 1994, 1996; Tang and Mooney, 2001; Zettlemoyer and Collins, 2005; Wong and Mooney, 2007; Yahya et al., 2012; Pasupat and Liang, 2015; Wang et al., 2015; Xiao et al., 2016) have focused on generating an executable query in a domain-specific query language. More recently, work has been focused on generating SQL directly (Zhong et al., 2017; Xu et al., 2017; Iyer et al., 2017; Yu et al., 2018; Yavuz et al., 2018), which allows the QA system to interact with an unmodified traditional database. Semantic parsing has also been applied to event-driven virtual assistant commands (Quirk et al., 2015; Beltagy and Quirk, 2016; Dong and Lapata, 2016; Yin and Neubig, 2017; Campagna et al., 2017, 2019), instructions to robotic agents (Kate et al., 2005; Kate and Mooney, 2006; Wong and Mooney, 2006; Chen and Mooney, 2011), and trading card games (Ling et al., 2016; Yin and Neubig, 2017; Rabinovich et al., 2017).

In general, though, training a semantic parser requires a large corpus of questions annotated with the corresponding query, which is expensive. Previous work has proposed crowdsourcing paraphrases to bootstrap new semantic parsers (Wang et al., 2015). The previously proposed Genie toolkit further suggested training with data synthesized from manually tuned templates (Campagna et al., 2019). Genie requires each skill to provide domain-specific templates mapping to website-specific APIs. In this paper, we make use of Genie, but propose a larger and more varied set of generic templates, as well as automatically generated domain-specific templates, that reduce the amount of manual effort. Furthermore, by leveraging markup, the effort in building skills with Schema2QA is per-domain rather than per-skill.

Using in Virtual Assistants.

Prior work has also investigated answering questions based on Semantic Web data and Linked Data knowledge-graph repositories 

(Yahya et al., 2012, 2013). More recently, the vocabulary has also been used by commercial virtual assistants as the intermediate representation for their builtin skills, for example in the Alexa Meaning Representation Language (Kollar et al., 2018). Efforts to support complex and compositional queries based on require expert annotation on large training sets (Perera et al., 2018). Furthermore, because of the cost of building such training sets, compositional query capabilities are not available to third-parties, which are limited to an intent classifier and slot tagger system (Kumar et al., 2017).

Google Assistant is also able to automatically generate skills for websites that use markup, and supports five domains. Each skill is automatically built by pairing the crawled website data with predefined models. While this approach supports multiple websites, it requires a substantial amount of work in annotating the training set, and the models are not transferrable. In addition, automatically generated skills do not answer aggregated questions.

Our approach not only scales to the number of websites, but also to the number of domains, with only a small amount of developer effort. Furthermore, each website can own their generated semantic parser and improve it for their own use case, instead of relying on a proprietary one.

8. Conclusion

This paper presents Schema2QA, a semi-automated tool that can build QA virtual assistant skills across a variety of websites, based on the existing markup.

Schema2QA translates natural language questions into a query language we designed, called DBTalk, using a neural semantic parser. Schema2QA uses NL-Schema, an extension of the definitions that includes natural language annotations, to automatically generate a large data set to train the neural network. This training set is generated based on a combination of built-in generic templates as well as automatically generated and developer provided domain-specific templates.

Experimental results suggest that the skill produced by Schema2QA can be effective at answering a variety of question, with an accuracy of 74% overall for restaurants, and 78% for LinkedIn. Furthermore, Schema2QA can answer 65% of questions that use 3 or more property, which is a significant improvement over existing assistants that can at most support one or two filters.

By making Schema2QA publicly available to every developer, we wish to encourage the creation of a linguistic web that is open to every assistant.

We thank Ramanathan V. Guha for his help and suggestions on This work is supported in part by the Sponsor National Science Foundation Rl˙ID=1900638&HistoricalAwards=false under Grant No. Grant #3 and the Stanford MobiSocial Laboratory, sponsored by AVG, Google, HTC, Hitachi, ING Direct, Nokia, Samsung, Sony Ericsson, and UST Global.


  • I. Beltagy and C. Quirk (2016) Improved semantic parsers for if-then statements. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), External Links: Document, Link Cited by: §7.
  • G. Campagna, R. Ramesh, S. Xu, M. Fischer, and M. S. Lam (2017) Almond: the architecture of an open, crowdsourced, privacy-preserving, programmable virtual assistant. In Proceedings of the 26th International Conference on World Wide Web - WWW ’17, New York, New York, USA, pp. 341–350. External Links: Document, ISBN 9781450349130, Link Cited by: §3.1, §6, §7.
  • G. Campagna, S. Xu, M. Moradshahi, R. Socher, and M. S. Lam (2019) Genie: a generator of natural language semantic parsers for virtual assistant commands. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, New York, NY, USA, pp. 394–410. External Links: ISBN 978-1-4503-6712-7, Link, Document Cited by: item 4, §1.2, §3.3, §3, §4.3, §5, §6.1, §7, §7.
  • D. L. Chen and R. J. Mooney (2011) Learning to interpret natural language navigation instructions from observations. In

    Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI-2011)

    pp. 859–865. Cited by: §7.
  • L. Dong and M. Lapata (2016) Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), External Links: Document, Link Cited by: §7.
  • B. F. Green Jr, A. K. Wolf, C. Chomsky, and K. Laughery (1961) Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference, pp. 219–224. Cited by: §7.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: item 6.
  • S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and L. Zettlemoyer (2017) Learning a neural semantic parser from user feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), External Links: Document, Link Cited by: §7.
  • R. J. Kate, Y. W. Wong, and R. J. Mooney (2005) Learning to transform natural to formal languages. In Proceedings of the 20th national conference on Artificial intelligence-Volume 3, pp. 1062–1068. Cited by: §7.
  • R. J. Kate and R. J. Mooney (2006) Using string-kernels for learning semantic parsers. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL - ACL '06, External Links: Document, Link Cited by: §7.
  • B. Kinsella (2019) Amazon alexa has 100k skills but momentum slows globally. here is the breakdown by country.. Note: Cited by: §1.
  • T. Kollar, D. Berry, L. Stuart, K. Owczarzak, T. Chung, L. Mathias, M. Kayser, B. Snow, and S. Matsoukas (2018) The alexa meaning representation language. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), External Links: Document, Link Cited by: §7.
  • A. Kumar, A. Gupta, J. Chan, S. Tucker, B. Hoffmeister, and M. Dreyer (2017) Just ASK: building an architecture for extensible self-service spoken language understanding. CoRR abs/1711.00549. External Links: Link, 1711.00549 Cited by: §7.
  • W. Ling, P. Blunsom, E. Grefenstette, K. M. Hermann, T. Kočiský, F. Wang, and A. Senior (2016) Latent predictor networks for code generation. External Links: Document, Link Cited by: §7.
  • B. McCann, N. S. Keskar, C. Xiong, and R. Socher (2018) The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730. Cited by: item 6.
  • P. Pasupat and P. Liang (2015) Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), External Links: Document, Link Cited by: §7.
  • V. Perera, T. Chung, T. Kollar, and E. Strubell (2018) Multi-task learning for parsing the alexa meaning representation language. In American Association for Artificial Intelligence (AAAI), pp. 181–224. Cited by: §7.
  • S. Perez (2018) 47.3 million u.s. adults have access to a smart speaker, report says. TechCrunch. Note: Cited by: §1.
  • C. Quirk, R. Mooney, and M. Galley (2015) Language to code: learning semantic parsers for if-this-then-that recipes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), External Links: Document, Link Cited by: §7.
  • M. Rabinovich, M. Stern, and D. Klein (2017) Abstract syntax networks for code generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1139–1149. External Links: Document, Link Cited by: §7.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: item 6.
  • L. R. Tang and R. J. Mooney (2001)

    Using multiple clause constructors in inductive logic programming for semantic parsing

    In Machine Learning: ECML 2001, pp. 466–477466–477. External Links: Document, Link Cited by: §7.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: item 6.
  • Y. Wang, J. Berant, and P. Liang (2015) Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1332–1342. External Links: Document, Link Cited by: §1.2, §4.3, §7, §7.
  • Y. W. Wong and R. J. Mooney (2006) Learning for semantic parsing with statistical machine translation. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 439–446. External Links: Document, Link Cited by: §7.
  • Y. W. Wong and R. Mooney (2007) Learning synchronous grammars for semantic parsing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 960–967. Cited by: §7.
  • C. Xiao, M. Dymetman, and C. Gardent (2016) Sequence-based structured prediction for semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1341–1350. External Links: Document, Link Cited by: §7.
  • X. Xu, C. Liu, and D. Song (2017)

    Sqlnet: generating structured queries from natural language without reinforcement learning

    arXiv preprint arXiv:1711.04436. Cited by: §7.
  • M. Yahya, K. Berberich, S. Elbassuoni, M. Ramanath, V. Tresp, and G. Weikum (2012) Natural language questions for the web of data. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 379–390. Cited by: §7, §7.
  • M. Yahya, K. Berberich, S. Elbassuoni, and G. Weikum (2013) Robust question answering over the web of linked data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1107–1116. Cited by: §7.
  • S. Yavuz, I. Gur, Y. Su, and X. Yan (2018) What it takes to achieve 100% condition accuracy on WikiSQL. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1702–1711. External Links: Document Cited by: §7.
  • P. Yin and G. Neubig (2017) A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 440–450. Cited by: §7.
  • T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. R. Radev (2018) Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. CoRR abs/1809.08887. External Links: Link, 1809.08887 Cited by: §7.
  • J. M. Zelle and R. J. Mooney (1994) Inducing deterministic prolog parsers from treebanks: a machine learning approach. In AAAI, pp. 748–753. Cited by: §7.
  • J. M. Zelle and R. J. Mooney (1996) Learning to parse database queries using inductive logic programming. In Proceedings of the thirteenth national conference on Artificial intelligence-Volume 2, pp. 1050–1055. Cited by: §7.
  • L. S. Zettlemoyer and M. Collins (2005) Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp. 658–666. Cited by: §7.
  • V. Zhong, C. Xiong, and R. Socher (2017) Seq2SQL: generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103. Cited by: §7.