The adoption of virtual assistants grew at an unprecedented rate, reaching 50 million American adults in just the first two years (Perez, 2018). Alexa and Google Assistant are rapidly growing their third-party platform of skills so consumers can access different websites and IoT devices by voice (Kinsella, 2019). Skill builders supply information on how their sites can be accessed to each of the assistant platforms, with sample natural language invocations; the virtual assistant platforms use proprietary linguistic technology to support arbitrarily phrased commands.
There are over 1.5 billion websites on the world wide web today. Does each company have to enter a skill on each of the platforms it wishes to run on? Given the high cost of data acquisition needed to develop natural language technology and the virality of network effects, would there be oligopolistic virtual assistants that control separate proprietary linguistic webs? Would non-profit organizations be well served? Would rare natural languages be supported?
Our overall goal is to create one open non-proprietary linguistic web. This paper presents Schema2QA, an open-source virtual assistant skill authoring tool that produces fully custom natural language models that answer questions on websites, based on Schema.org markup.
1.1. Our Solution
Linguistic Questions&Answers (Q&A) on websites. Ideally, a linguistic interface can be created for each website automatically. We take a step towards this goal by leveraging the Schema.org markup included in millions of websites originally to facilitate web search. Schema2QA can generate a neural semantic parser that translates natural language into queries on the Schema markup on websites, without any real user training data.
Domain-by-domain improvement. Schema2QA lets developers improve the quality, one domain at a time, with relatively little manual effort. Because this tool generates open-source neural models that companies can own and use in their own websites, apps, or phone services, it encourages open collaboration to build a neural model that leverages data from many domains and languages.
Complex queries. The Linguistic User Interface (LUI) is much more flexible than the Graphical User Interface (GUI). We can easily ask for information that requires joining arbitrary fields or performing computation. Table 1 shows a sample of questions that can be answered by Schema2QA; in contrast, today’s commercial assistants cannot answer most of such questions. Schema2QA supports complex queries effectively by training a neural model with millions of synthesized sentences, less than 1% of which are paraphrased by crowdsource workers.
Open web. Search engines can crawl the Schema.org markup in websites and use Schema2QA to answer questions across many websites at once, eliminating the need for proprietary centralized skill repositories. By making all the source code, training data, and neural models publicly available, we wish to encourage open collaboration in LUI research.
To answer questions using up-to-date web information, it is not possible to train our system with question and answer pairs. Instead, we create a semantic parser that automatically translates natural language queries into a formal query language. To address the difficulty in acquiring training data, we use grammar-based rules to generate pairs of sentences and their formal counterparts, and ask crowdsource workers to paraphrase the natural language sentences (Wang et al., 2015). To handle complex queries, we write templates to generate a wide variety of synthesized data, which is used in conjunction with paraphrase data to train a neural semantic parser (Campagna et al., 2019).
Previous techniques require significant effort to create one skill at a time. To handle the millions of available websites, we wish to provide an automatic baseline solution that can handle all the websites automatically, while allowing incremental improvements one domain at a time. With the help of Schema.org, we can train semantic models for each domain, rather than for each skill or website. Illustrated in Fig. 1, our approach consists of the following:
We create DBTalk, a high-level query language optimized for translation from natural language. It supports common computed functions such as aggregation in addition to typical join, selection and projection operations.
We extend Schema.org to the use case of natural language Q&A, and create NL-Schema, which is a natural-language-friendly representation.
Templates are used to generate training data, which is a pair of natural language and its corresponding DBTalk query. There are two kinds of templates: generic templates and domain-specific templates. The former is handcoded once and for all (for each targeted natural language), the latter is automatically generated from NL-Schema.
We use Genie (Campagna et al., 2019), which accepts the definition of DBTalk, the NL-Schema, the templates, as well as a parameter dataset collected from existing websites, and produces synthetic training data, part of which are paraphrased, to generate a neural semantic parser. Note that we can also supply additional manual natural language templates to improve the quality of specific domains.
The semantic parser produces queries which are executed on a website database index to return results. Schema2QA automatically builds the index from the website data, and can be called by the website author when the data changes, or invoked periodically to crawl new website data.
The contributions of this paper include:
A high-level query language, called DBTalk, optimized for translation from natural language.
The NL-Schema data model for structured website information, derived from Schema.org, and adapted for building natural language QA skills.
The Schema2QA tool that allows website owners to build skills for their website, in a fully automated fashion for existing domains. Additionally, improvements can be made that require small effort for each domain.
We have demonstrated an end-to-end system by incorporating the results of applying Schema2QA on the domains of restaurants, people and hotels in the open-source Almond virtual assistant.
A large dataset with more than 1.3M training sentences for restaurant questions, and 900,000 training sentences for questions about people. This dataset also includes 215 real-world crowdsourced restaurant questions and 234 questions about people, that refer to one to three properties in a single question. This test set can serve as a benchmark for future QA systems based on Schema.org.
Experimental results showing that skills built from Schema2QA can understand a variety of real-world questions, with an accuracy between 74% and 78%. Furthermore, Schema2QA can support querying across more than 750 websites that use Schema.org today.
The rest of the paper is organized as follows. Section 2 presents an overview of the system. Sections 3 and 4 present the design of DBTalk and NL-Schema, respectively. Section 5 discusses how we use templates to generate the training set. Lastly, we present experimental results, related work, and conclusions.
Our goal is to provide a basic automatically generated semantic parser that can handle questions in each domain, and give domain experts the ability to improve the accuracy of the system. Here we present a high-level overview with the help of a couple of examples drawn from questions about restaurants, as shown in Fig. 1.
2.1. DBTalk Design
We introduce DBTalk as the formal query language. It is designed to handle the important use case, with as little extraneous notations as possible. Here is an example of a DBTalk query to find the restaurants that serve Chinese cuisine:
2.2. NL-Schema Design
The NL-Schema data model is adapted from Schema.org for the purpose of answering natural language questions. NL-Schema uses a relational model that supports class hierarchies, fixed tables and fixed fields of a single static type, whereas Schema.org uses a graph-based representation with union-typed properties. The example of a Restaurant class is shown in Fig. 2.
NL-Schema is used to facilitate the generation of training data for our neural semantic parser. Each training sample consists of a natural language sentence and the corresponding DBTalk code.
|“Which restaurants serve Chinese cuisine?”|
Thus, to establish the relationship between natural language and a query that uses a certain field, each field is annotated with a canonical representation and a part-of-speech (POS). For example, says that the phrase “cuisine” and “serves … cuisine” can be generated as a noun phrase (nnp) and a verb phrase (vb), respectively, and that they both refer to the Schema.org property “servesCuisine”.
Schema2QA includes a tool to automatically transform the whole Schema.org representation into NL-Schema. Additional annotations can be supplied manually if desired to improve its quality.
2.3. Templates for Training Data Generation
Templates are grammar production rules that describe how to generate a sentence and its formal representation. The generic templates cover all the domains in Schema.org. The template shown in Fig. 1,
can be used generically to generate “Which restaurants serve Chinese cuisine”, “Which person received Turing awards”, “Which hotels offers free parking”.
Domain-specific templates further increase the variety of training samples generated. For example, we can refer to restaurants serving Chinese cuisine simply as Chinese restaurants. This can be captured by a domain-specific template, such as $cuisine “restaurants” Restaurant, servesCuisine $cuisine which generates “Chinese restaurants”, “Mexican restaurants”, etc.
The generic templates are curated by hand, but to handle the scale of the Schema.org definition, we have developed a data-driven template generator to automatically create domain-specific templates. Similar to annotations, manual domain-specific templates can be added. For example, developers can add the following template to query the top rated restaurants:
|“top rated restaurants”|
This template will map sorting Restaurant by “aggregateRating.ratingValue” to the phrase “top rated restaurants”, which cannot be produced by Schema2QA based on generic templates.
2.4. Generating a semantic parser for a domain
Here is how we can create a semantic parser that translates natural language questions on a given domain into formal DBTalk queries:
Download the Schema.org markup of representative websites in the domain. We may only need to download information from just one website, or even a fraction thereof, if a large aggregator exists, such as Yelp for restaurant reviews.
Apply the NL-Schema converter on the Schema.org information for that domain to generate NL-Schema.
Apply the automatic template generator on the generated NL-Schema to create domain-specific templates.
Extract the list of values for each property from the downloaded data.
Obtain a set of validation data by hand-labeling crowdsourced questions.
Apply Genie on the DBTalk definition, the generic templates, and all the results obtained in steps 2 to 5 above. Genie uses all this information to synthesize data, a fraction of which is paraphrased by crowdsource workers. Both the synthetic and paraphrase data sets are used to train a Multi-Task Question Answering Network (MQAN) (McCann et al., 2018) model to translate natural language into DBTalk. MQAN is an encoder-decoder model (Sutskever et al., 2014) that uses both LSTM (Hochreiter and Schmidhuber, 1997) and self-attention (Vaswani et al., 2017) layers to encode the input question, then produces the output query token by token. More details of the neural model are described elsewhere (McCann et al., 2018).
The domain parser that this process produces can handle natural language queries for any website in the same domain. Developers are expected to use the validation set to identify weaknesses in the synthesized data, and augment the data with additional annotations as well as manually written domain-specific templates.
2.5. Schema2QA System
Once the parser is generated, we can use it to answer questions for any websites in that domain that use Schema.org. We apply the translated DBTalk queries on databases constructed from the Schema.org markup of the websites of interest. If multiple websites are available in the database, the generated skill can answer aggregate queries, such as “which is the cheapest of all the restaurants” among the downloaded sites.
3. The DBTalk Query Language
Schema2QA translates natural language into a formal query language called DBTalk, which was created to facilitate translation with a neural model. As shown by Genie (Campagna et al., 2019), it is important that the formal language resembles natural language. We follow the same principles in the design of DBTalk.
DBTalk assumes a relational database model. DBTalk queries have the form:
where table is the type of entity being retrieved (similar to a table name in SQL), filter applies a selection predicate that can make use of the fields in the table, and fn is an optional list of field names to project on. The full grammar is shown in Fig. 3.
|Table name tn||identifier|
|Field name fn||identifier|
Here is an example of a DBTalk query, corresponding to the query “who wrote the 0 star review for Din Tai Fung?”
3.1. Type System
DBTalk uses a static type system similar to ThingTalk, a previously proposed programming language for virtual assistants (Campagna et al., 2017). This type system includes domain-specific types like Location, Measure, Date and Time. It also includes high-level concepts such as “here”.
All DBTalk tables implicitly include an “id” column, which can be used to compare two rows for equality and to join tables. Furthermore, DBTalk has native support for named entities, such as people, brands, countries, etc. DBTalk tables that contain named entities have a “name” column; for those tables, a query can lookup a specific row by specifying the name and type of the entity.
Unlike ThingTalk, DBTalk has record types, which we introduce to avoid creating tables for objects that represent structured values and have no identity of their own. Fields in each record type are recursively flattened so they can be accessed as fields in the table that uses the record type. We do not support recursive record types.
Compared to SQL, we also introduce array types to express joins in a more intuitive way, and to avoid the use of glue tables for many-to-many joins.
3.2. Sorting, Aggregation, Computation
DBTalk queries support the simple computation operators found in SQL for sorting a table, indexing & slicing a table, aggregating all results, and computing a new field for each row. These operators can be combined: for example, the distance operator can be used to compute the distance of a place from another location. Combined with sorting and indexing, this allows to express the query “what is the nearest restaurant?” as: (sort distance asc of comp distance(geo, here) of Restaurant) The query reads as: select all restaurants, compute the distance between the field geo and the user’s current location (and by default, store it in the distance field), sort by increasing distance, and then choose the first result (with index 1).
3.3. Limitations of DBTalk
To keep DBTalk simple and understandable to end users, we omit set operations, including union, intersection, and set difference. We also omit group-by operations, subqueries, and quantifiers. While providing limited function, this design does guarantee that the language is canonicalizable. There is a unique canonical syntactic form for every query, which has been shown to improve accuracy of semantic parsers (Campagna et al., 2019).
4. Adapting schema.org for Natural Language Processing
Schema.org is a markup vocabulary created to help search engines understand and index structured data across the web. Here, we introduce Schema.org and describe how we extend it to natural language queries.
4.1. Data model of Schema.org
Schema.org is based on RDF, and uses a graph data model, where nodes represent objects. Nodes are connected by properties and are grouped in classes. Classes are arranged in a multiple inheritance hierarchy where each class can be a subclass of multiple other classes, and the “Thing” class is the superclass of all classes. There is also a parallel hierarchy where the literal data types are defined. By convention, all class names start with an uppercase letter, while property names start with a lowercase letter.
Each property’s domain consists of one or more classes that can possess that property. The same property can be used in separate domains; e.g., “Person” and “Organization” classes both use the “owns” property. Subclasses of a class in the domain of a certain property can also make use of that property.
Each property’s range consists of one or more classes or primitive types. Additionally, as having any data is considered better than none, the “Text” type (free text) is always implicitly included in a property’s range. For properties where free text is the recommended or only type expected, e.g. the “name” property, “Text” is explicitly declared as the property type.
Schema.org is organized in layers, with a “core” layer representing the portion agreed by all users, a “pending” layer including proposed addition to the core, and various domain-specific extension layers (e.g., “bib”, “auto”). Here we only consider the core layer.
4.2. NL-Schema Representation
NL-Schema, and templates derived from it, provide the domain-specific information used in the grammar-driven generation of training data. To facilitate the generation, NL-Schema uses a relational data model, where each table contains a fixed set of properties with fixed types. The grammar is shown in Fig. 4, and an example is shown in Fig. 2. Schema2QA includes a converter tool that leverages the Schema.org definitions and the Schema.org data in websites to translate the graph-based Schema.org representation automatically into NL-Schema.
|Table Definition tdef|
|Field Definition fdef|
|Canonical Form cf|
|Table name tn||identifier|
|Field name fn||identifier|
Tables and record types. In NL-Schema, we distinguish between entity and non-entity classes. Entity classes are those that refer to well-known meaningful identities, such as people, organization, places, events, URLs, with names that the user can recognize. All other classes are considered non-entity; they can only be referred to as properties of other classes.
An entity class is represented as a table, whose properties make up the columns. The column may be of a primitive type, a reference to another entity class, or a record type. Non-entity classes are represented as anonymous record types, with the exception of recursive classes. DBTalk does not have recursive record types, where a record type has a field with the same type. Recursive non-entity classes are mapped to nameless tables, instead of record types. For example, the “Review” class is a non-entity class, because it inherits the “review” property from “CreativeWork”, and that property also refers to the “Review” class.
When a class is used as a property, it often uses only a subset of all the possible fields. Consider the non-entity class “Rating”. When referred as the “aggregateRating” property of the “Restaurant” class, it uses fields “reviewCount” and “ratingValue”; when referred as the “reviewRating” property of the “Review” class, it uses fields “ratingValue” and might use the “author” field. We use Schema.org data scraped from websites to determine the fields used in practice for each class’s property, and create a custom record type holding only such fields for that property. This limits the vocabulary to only the relevant terms for each class’s property.
Array types. Since correctly distinguishing singular and plural properties is necessary to generate good training sentences, we introduce cardinality to NL-Schema fields. For example, for plural properties such as “review” we can ask “how many reviews does a restaurant have?”, and “what restaurant has the most reviews?”.
Schema2QA considers any Schema.org property with “ItemList” type as an array. It also heuristically analyzes the documentation comment provided on Schema.org to identify arrays. Furthermore, when the element type of the array is not provided by Schema.org, Schema2QA heuristically infers it from the property name. Empirically, we found our heuristics works well in the 3 domains we evaluated, with the exception of properties of the “Thing” class such as “image” and “description”, which are described as plural in the comment, but have one value per object in practice.
Union types. For compatibility with existing websites and extensibility, the range of many properties in Schema.org includes multiple classes, effectively creating a union type. To avoid ambiguity when parsing natural language, and to avoid having to resolve the type dynamically at runtime, which can lead to missing data and confusion, NL-Schema does not support union types. For each property, it picks among the types in its range the one with the highest priority, which is defined in decreasing order: record types, DBTalk primitive types, entity references, and finally strings. All website data is cast to the chosen type as follows: If the data contains a primitive value or a record where an entity reference is expected, Schema2QA creates new entities, and assigns new unique IDs. Conversely, if the data contains a record where a primitive value or an entity reference is expected, Schema2QA reports a warning and selects a property of that record as the primitive value. Note that website data often do not respect the type declared in Schema.org; in that case, Schema2QA will automatically cast the data to the correct type or discard the value.
4.3. Natural Language Annotations
Schema2QA uses a template-based method to synthesize a large training set mapping natural language questions to DBTalk queries. To do so, each property in NL-Schema is annotated with its canonical forms: a list of short phrases that indicates how the property and its value are referred to in natural language. Each canonical form also indicates the part-of-speech it belongs to so they are used in natural language templates correctly. For example, the “author” property of “Review” has canonical forms “author” (noun phrase) and “written by” (adjective phrase).
Canonical forms might have a component that precedes the property value (“prefix”) and one that follows it (“suffix”). For example, one of the canonical forms of the “servesCuisine” property has prefix “serves” and suffix “cuisine”. This allows Schema2QA to generate sentences of the form “restaurants that serve Italian cuisine”, where the value “Italian” is in between “serves” and “cuisine”.
Automatic Generation of Canonical Forms. When converting to NL-Schema, Schema2QA automatically generates a canonical form for each property, based on the property name and type. It converts the camel-cased names into multiple words and removes redundant words at the end of property names. For example, it converts “worksFor” to “works for”, and “ratingValue” to “rating”. Schema2QA removes the table name or the names of the parent record type from the property name, if present. For example, “reviewRating” is converted to “rating”. Schema2QA also recognizes the verbs “has” and “is”, which are commonly used as prefixes of property names in Schema.org, and uses to identify the correct part of speech for the generated canonical form. Developers can manually add more canonical forms or refine the generated ones.
The canonical forms and types are also used in the displaying the results of the query. If the query returns many fields, a default priority is used to present the most important ones. For each table, developers can override the formatting information.
Parts-of-Speech Annotations. Previous work by Wang et al. (Wang et al., 2015) asked developers to provide canonical forms in one of two parts-of-speech (POS): a noun phrase or a verb phrase. This simplistic POS characterization generates low-quality sentences that are unsuitable for training. Subsequent work shows that allowing developers supply domain-specific templates can generate higher quality synthetic sentences, which can be included with a small amount of paraphrase data for effective training (Campagna et al., 2019). We propose classifying canonical forms into more POS categories111The abbreviations are based on POS tags from the Penn Treebank tagset. so generic templates can be used to synthesize more varied and useful training sets. Schema2QA applies an off-the-shelf part-of-speech tagger to identify the POS tag of generated canonical forms.
The noun-phrase for property field (NNP) tag denotes a noun phrase representing what a subject has. For example, “reviews” is a NNP canonical form for the “review” property; examples of generated sentences are “show me the reviews of the restaurant?” and “which restaurant has more than 5 reviews?”. Most properties defined in schema.org have at least one NNP canonical form.
The noun-phrase for identity field (NNI) tag denotes a noun phrase representing what a subject is. For example, “alumni of” is a NNI canonical form for the “alumniOf” property; an example is “who is an alumni of NTU?”.
The verb-phrase field (VB) tag denotes a verb phrase representing what a subject does. For example, “serves” is a VB canonical form for the “servesCuisine” property; an example is “what restaurants serve tacos?”.
The adjective-phrase field (JJ) tag denotes an adjective or passive verb. For example, “rated” is a JJ canonical form for the “rating” property; an example is “restaurants rated 4.5 or above”. Similarly, the preposition “by” is a JJ canonical form for the “author” property of “Review”; an example is “show me reviews by Bob Smith”.
Additionally, in some cases the property is not explicit in the question, and the value is sufficient to infer what property the query should use. Thus, we create two categories based on values:
The noun-phrase for value (NNV) tag on the property denotes that the value of a property indicates what a subject is. For example, the property “jobTitle” is annotated NNV, and an example is “who is a CEO?”. The property “jobTitle” is implicit.
The adjective-phrase for value (JJV) tag denotes that the value of a property is an adjective. For example, the property “servesCuisine” is annotated JJV, and an example is “show me Mexican restaurants”. Here, the property “servesCuisine” is implicit.
5. Training Set Generation
|projection “the” “of”||“the rating of Panda Express”|
|table “that have” “more than”||“restaurants that have rating more than 4”|
|table “with the highest”||“restaurants with the highest rating”|
|question “how far is”||“how far is Starbucks?”|
|Template Candidates||Positive Example Queries||Negative Example Queries|
|table||“Mexican restaurants”||“4.5 hotels”|
|table “in”||“hotels in Florida”||“restaurants in Mexican ”|
|table “with”||“hotels with fitness center”||“restaurants with Mexican”|
|table “containing”||“hotels containing fitness center”||“restaurants containing Mexican”|
|table||“person works for Google”||“restaurants serves Mexican”|
|table||“Mexican cuisine restaurants”||“Nobel prize award person”|
|table “with”||“restaurants with Mexican cuisine”||“hotels with Florida state”|
|table||“5 star rated restaurants”||“Nobel prize award received person”|
Schema2QA uses Genie (Campagna et al., 2019) to generate training data for neural semantic parsers. Genie accepts templates, which are production rules mapping the grammar of natural language to a semantic function that produces the corresponding code. Formally, a template is expressed as:
The non-terminal nt is produced by expanding the terminals and non-terminals on the right-hand side of the sign; the bound variables are used by the semantic function sf to produce the corresponding code. Genie expands the templates by substituting the non-terminals with the previously generated derivations and applying the semantic function. The expansion continues recursively, up to a configurable depth.
When generating complex queries, Genie depends on generic templates to form the sketch of the query and domain-specific templates to refer to the details. We introduce the two kinds of templates in following sections.
5.1. Generic templates
Generic templates are written by hand, and map natural language compositional constructs to formal language operators. Many sentences can be covered with generic DBTalk templates, such as: table “that”
This template can generate both “hotel that offers free parking” and “person that works for Google”. In the former case, the table is “Hotel”, and the property is “amenityFeature” with VB canonical form “offers”. In the latter case, the table is “Person” and the property is “worksFor”, with VB canonical form “works for”.
Templates rely on the type system to form sentences that are meaningful and map to correct queries. For example, in the following template, only properties of comparable type can be used: table “whose” “is less than” The template would generate “restaurants whose rating is less than 3”, but it would not generate “restaurants whose cuisine is less than Italian”, as the latter would not typecheck. Typing is also used to automatically add computation, aggregation and sorting.
Schema2QA includes 512 hand-curated generic templates. Of these, 208 (41%) are related to filters. Table 1 shows that filters are used in many meaningful queries, so Schema2QA puts particular emphasis on generating understandable and varied questions for them. Schema2QA also includes 94 templates for argmax and argmin operations: these are templates that combine the sort and index operation in DBTalk to select the minimum or maximum row in the table. Argmin and argmax templates are used to understand questions that choose the most highly rated restaurant (Table 1).
Schema2QA makes use of DBTalk’s native support for the Location type to understand “where” questions and “how far” (distance) questions. These questions use the “geo” property, and are available in all tables that support that property.
Additional examples of templates, together with possible sentences that they generate, are shown in Table 2. The table shows one example of a projection asking for a specific field of the table, one example of a comparison filter, one example of argmax aggregation operator, and one example of a computed field.
5.2. Domain-Specific Templates
We wish to augment generic templates with domain-specific ones so the generated training data can include terminology specific to each domain. To scale across all domains, we have developed a data-driven generator for domain-specific templates.
For English queries, we have identified 8 common templates for specifying filters on tables, as shown in column 1 in Table 3. Columns 2 and 3 illustrate that which template gets used is highly dependent on the specific property. We exhaustively apply each of the templates to every property in the NL-Schema by substituting the parameters with canonical forms of the right POS, and by replacing values with sampled data from the websites. We test if the resulting phrases are used in practice by counting the number of exact matches returned by a web search engine. We normalize the count to adjust for the fact that longer queries have fewer exact matches. The 4 most commonly used templates are adopted for each property.
6. Experimental Results
We used Schema2QA to create three Q&A skills in the open-source Almond virtual assistant (Campagna et al., 2017), as an end-to-end demonstration of its functionality. The skills and Schema2QA will be released upon publication. A screenshot of the Schema2QA interface is shown in Fig. 5
. At the moment, our system supports only American English questions, and can be trained only on English websites. Extensions to other languages are left as future work.
In this section, we evaluate the performance of Schema2QA. We first describe the dataset we used for training and evaluation. Then we present experimental results to answer the following questions: (1) How Schema2QA performs on aggregator websites? (2) How does paraphrase data affect the accuracy? (3) How does the synthesized data compare with prior work? (4) Can the knowledge we learn from one domain be transferred to a related domain? And finally, (5) how would Schema2QA work in the real-world?
6.1. Training and Evaluation Dataset
Schema2QA does not use any real data for training, which significantly reduces the cost of data acquisition. The training data is generated automatically with templates; a small fraction of the generated data is paraphrased by crowdsource workers. In addition, the dataset is augmented with property values extracted from the website data.
We, however, use realistic data for validation and testing, which has been shown to be significantly more challenging than testing with paraphrase data (Campagna et al., 2019). Crowdsource workers are presented with a list of properties in the relevant domain and a few examples of queries, and are asked to come up with 5 questions. We do not show them any sentences or website data we used for training, and allow workers to freely choose the value of each property. We ask half of the workers to produce queries about a single property, and the other half to produce queries related to two properties of their choice. The questions are then annotated by hand with their DBTalk representation, by an author of the paper, and then split into the dev set and test set. Our metric of query accuracy measures if the entire generated query matches the annotation.
6.2. Applying Schema2QA to Aggregator Sites
In our first experiment, we apply Schema2QA to two major aggregator websites in two different domains: Yelp for restaurants and LinkedIn for people. We chose them because they aggregate many entities within their domain and they make extensive use of Schema.org markup. They have abundant structured information, which allows Schema2QA to answer rich and interesting questions.
The Yelp data contains restaurants with 10 properties including “servesCuisine”, “reviews”, “aggregateRating”, as well as reviews with 4 properties: “reviewRating”, “author”, “dataPublished”, and “description”. The LinkedIn data contains data about people, with 5 properties: “alumniOf”, “worksFor”, “address”, “award”, and “name”.
, respectively. We observe that crowdsource workers for the Restaurant domain generate about 100 questions that refer to three or more properties, despite being instructed to just generate queries involving one or two. This is not observed with LinkedIn data probably because LinkedIn has fewer properties to choose from.
Based on the development set, we refined the generic templates. New templates we found from validation include: projection on two properties (“what is the address and the telephone of …?”), filters that use “both” (“who works for both Google and Amazon?”, “what restaurant serves both ramen and sushi?”), comparisons that use “or more” (“restaurants with 4 stars or more”). We also used the dev set to refine the canonical forms of the properties. Note that the author who annotated the test set did not refine the templates and canonical forms after annotating. This guarantees that NL-Schema is only tuned based on observations from the dev set.
We train and evaluate on each of the two domains separately. The accuracy of Schema2QA is shown in Figure 6. On Yelp data, Schema2QA achieves 74% overall and 81% on questions with one property. For more complicated questions, Schema2QA still gets 71% on questions with two properties, and 65% with three or more properties. On LinkedIn data, Schema2QA achieves 78% overall: 80% on questions with one property and 76% on questions with two properties. In the figure, the column for three or more properties is marked “N/A” because there is no test data.
Overall, this result, which was achieved without any real data in training, shows that Schema2QA can build an effective parser at a low cost. Additionally, developers can add more domain-specific templates and canonical forms to further increase the accuracy with little additional cost. Schema2QA is able to achieve reasonably high accuracy for complex queries involving multiple properties because the synthesis generates many combinations. This allows us to outperform commercial assistants like Google, Alexa, and Siri on a variety of questions, as shown in Table 1.
6.3. Impact of Paraphrase Data
To understand the contribution of the synthetic and paraphrase training data, we evaluate the accuracy of the parser with different amounts of paraphrase data. Results are shown in Fig. 7. Without any paraphrase data, the model already achieves an accuracy of 57% for simple queries and drops to 44% for queries with two or more properties. Adding just 25% of the paraphrase data improves the accuracy to 74% for questions with one property, 62% with two properties, and 58% with three or more properties. This indicates that the model trained with no paraphrase overfits on the templatized sentences; just the addition of a small amount of paraphrase data forces it to pay more attention to the language, improving the accuracy. Using the full paraphrase set leads to 81% for one property, 71% for two properties and 65% for three or more.
6.4. Comparison With Sempre Templates
Schema2QA uses a significantly more sophisticated template language and synthesis generation algorithm in comparison to Sempre, the first to introduce the use of canonical sentences to create synthetic data for Q&A semantic parsers. Here we evaluate the quality of our synthesized data set compared to the one generated using the Sempre language.
We apply the same data augmentation to both sets, and train with only synthesized data. On the Restaurants domain, training with Sempre templates achieves less than 1% accuracy at all three complexity levels. On the other hand, training with only synthesized data produced with Schema2QA templates achieves 57% accuracy for one property, 44% for two properties, and 45% for three or more properties. On the People domain, training with Sempre templates achieves 10% accuracy for questions with one property, and 2% accuracy for questions with two properties, whereas Schema2QA templates achieve 53% for one property, and 30% for two.
This results shows that the synthesized data we generate matches our realistic test questions more closely. Our templates are more tuned to understand the variety of filters that commonly appear in the test. On the other hand, Sempre templates are tuned for questions with many joins, which are not common in the domains we tested. Synthesizing data that matches the test data more closely means we don’t need to rely as much on expensive paraphrasing.
6.5. Transfer Learning Across Domains
Many domains in Schema.org share common classes and properties. Is it possible to transfer the learning from one domain to another and create a semantic parser for a new domain without getting new annotations, domain-specific templates, or even paraphrases manually? We take the Schema2QA skill for restaurants, and apply it to hotels; restaurants and hotels share many of the same fields. The Hotel class has additional properties “checkinTime”, “checkoutTime”, and “amenityFeature”, but it does not have the “servesCuisine” property found in the Restaurant class.
For training data synthesis, besides using the automatically generated canonical forms, we also use canonical forms manually created for those common fields with restaurants. We adapt the paraphrases for the restaurant domain to the hotel domain by replacing the words “restaurant”, “diner”, “canteen”, etc. with the word “hotel”. We augment the synthesized and paraphrase data sets with data from the Hyatt hotel chain.
We acquire an evaluation set of 362 questions, crowdsourced from MTurk, and annotated by hand. These are divided in 181 for validation and 181 for test. 86 of the test questions use one property, 62 use two properties, and 33 use three or more. On the test set, the generated parser achieves an overall accuracy of 55%. Furthermore, on the subset of the test set that does not use any hotel-specific property, the accuracy is 63%. This shows that the knowledge from one domain can be transferred to a new domain with no manual effort, if the domain shares the same properties.
6.6. Applying Schema2QA To The Web
The semantic parser generated by Schema2QA can be applied to new websites of the same domain. To validate this capability, we created two Q&A systems, one for restaurants and one for hotels, in the cities of Washington DC and New York, and in the state of Hawaii. We use Google Custom Search Engine and identify 311 hotel and 475 restaurant websites that include Schema.org markup. They exhibit wide variability in the properties used. Fig. 8 shows the proportion of websites using each of the properties that appear in at least 10% of the sites.
We unify the crawled data into a single knowledge base. The neural models trained on the Yelp and the Hyatt websites can be used to answer questions that aggregate across all these websites immediately. Yelp alone covers 9 of the top 15 common properties for restaurants, and Hyatt covers 9 of the top 13 common properties for hotels. For example, we can ask the hotel and restaurant questions in Table 1 across the crawled websites without any skill-by-skill engineering. We expect more websites will add Schema.org markup into their websites to support natural language queries about their content.
7. Related Work
Question answering (QA) is a well studied problem in Natural Language Processing, with work dating back to the 60s(Green Jr et al., 1961). A subset of the QA field is knowledge-based question answering (KB QA), where the answer can be found from a graph or relational database by executing an appropriate query.
Semantic parsing techniques for KB QA (Zelle and Mooney, 1994, 1996; Tang and Mooney, 2001; Zettlemoyer and Collins, 2005; Wong and Mooney, 2007; Yahya et al., 2012; Pasupat and Liang, 2015; Wang et al., 2015; Xiao et al., 2016) have focused on generating an executable query in a domain-specific query language. More recently, work has been focused on generating SQL directly (Zhong et al., 2017; Xu et al., 2017; Iyer et al., 2017; Yu et al., 2018; Yavuz et al., 2018), which allows the QA system to interact with an unmodified traditional database. Semantic parsing has also been applied to event-driven virtual assistant commands (Quirk et al., 2015; Beltagy and Quirk, 2016; Dong and Lapata, 2016; Yin and Neubig, 2017; Campagna et al., 2017, 2019), instructions to robotic agents (Kate et al., 2005; Kate and Mooney, 2006; Wong and Mooney, 2006; Chen and Mooney, 2011), and trading card games (Ling et al., 2016; Yin and Neubig, 2017; Rabinovich et al., 2017).
In general, though, training a semantic parser requires a large corpus of questions annotated with the corresponding query, which is expensive. Previous work has proposed crowdsourcing paraphrases to bootstrap new semantic parsers (Wang et al., 2015). The previously proposed Genie toolkit further suggested training with data synthesized from manually tuned templates (Campagna et al., 2019). Genie requires each skill to provide domain-specific templates mapping to website-specific APIs. In this paper, we make use of Genie, but propose a larger and more varied set of generic templates, as well as automatically generated domain-specific templates, that reduce the amount of manual effort. Furthermore, by leveraging Schema.org markup, the effort in building skills with Schema2QA is per-domain rather than per-skill.
Using Schema.org in Virtual Assistants.
Prior work has also investigated answering questions based on Semantic Web data and Linked Data knowledge-graph repositories(Yahya et al., 2012, 2013). More recently, the schema.org vocabulary has also been used by commercial virtual assistants as the intermediate representation for their builtin skills, for example in the Alexa Meaning Representation Language (Kollar et al., 2018). Efforts to support complex and compositional queries based on schema.org require expert annotation on large training sets (Perera et al., 2018). Furthermore, because of the cost of building such training sets, compositional query capabilities are not available to third-parties, which are limited to an intent classifier and slot tagger system (Kumar et al., 2017).
Google Assistant is also able to automatically generate skills for websites that use schema.org markup, and supports five domains. Each skill is automatically built by pairing the crawled website data with predefined models. While this approach supports multiple websites, it requires a substantial amount of work in annotating the training set, and the models are not transferrable. In addition, automatically generated skills do not answer aggregated questions.
Our approach not only scales to the number of websites, but also to the number of domains, with only a small amount of developer effort. Furthermore, each website can own their generated semantic parser and improve it for their own use case, instead of relying on a proprietary one.
This paper presents Schema2QA, a semi-automated tool that can build QA virtual assistant skills across a variety of websites, based on the existing Schema.org markup.
Schema2QA translates natural language questions into a query language we designed, called DBTalk, using a neural semantic parser. Schema2QA uses NL-Schema, an extension of the Schema.org definitions that includes natural language annotations, to automatically generate a large data set to train the neural network. This training set is generated based on a combination of built-in generic templates as well as automatically generated and developer provided domain-specific templates.
Experimental results suggest that the skill produced by Schema2QA can be effective at answering a variety of question, with an accuracy of 74% overall for restaurants, and 78% for LinkedIn. Furthermore, Schema2QA can answer 65% of questions that use 3 or more property, which is a significant improvement over existing assistants that can at most support one or two filters.
By making Schema2QA publicly available to every developer, we wish to encourage the creation of a linguistic web that is open to every assistant.
Acknowledgements.We thank Ramanathan V. Guha for his help and suggestions on Schema.org. This work is supported in part by the Sponsor National Science Foundation Rlhttps://www.nsf.gov/awardsearch/showAward?AWD˙ID=1900638&HistoricalAwards=false under Grant No. Grant #3 and the Stanford MobiSocial Laboratory, sponsored by AVG, Google, HTC, Hitachi, ING Direct, Nokia, Samsung, Sony Ericsson, and UST Global.
- Improved semantic parsers for if-then statements. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), External Links: Cited by: §7.
- Almond: the architecture of an open, crowdsourced, privacy-preserving, programmable virtual assistant. In Proceedings of the 26th International Conference on World Wide Web - WWW ’17, New York, New York, USA, pp. 341–350. External Links: Cited by: §3.1, §6, §7.
- Genie: a generator of natural language semantic parsers for virtual assistant commands. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, New York, NY, USA, pp. 394–410. External Links: Cited by: item 4, §1.2, §3.3, §3, §4.3, §5, §6.1, §7, §7.
Learning to interpret natural language navigation instructions from observations.
Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI-2011), pp. 859–865. Cited by: §7.
- Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), External Links: Cited by: §7.
- Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference, pp. 219–224. Cited by: §7.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: item 6.
- Learning a neural semantic parser from user feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), External Links: Cited by: §7.
- Learning to transform natural to formal languages. In Proceedings of the 20th national conference on Artificial intelligence-Volume 3, pp. 1062–1068. Cited by: §7.
- Using string-kernels for learning semantic parsers. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL - ACL '06, External Links: Cited by: §7.
- Amazon alexa has 100k skills but momentum slows globally. here is the breakdown by country.. Voicebot.ai. Note: https://voicebot.ai/2019/10/01/amazon-alexa-has-100k-skills-but-momentum-slows-globally-here-is-the-breakdown-by-country/ Cited by: §1.
- The alexa meaning representation language. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), External Links: Cited by: §7.
- Just ASK: building an architecture for extensible self-service spoken language understanding. CoRR abs/1711.00549. External Links: Cited by: §7.
- Latent predictor networks for code generation. External Links: Cited by: §7.
- The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730. Cited by: item 6.
- Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), External Links: Cited by: §7.
- Multi-task learning for parsing the alexa meaning representation language. In American Association for Artificial Intelligence (AAAI), pp. 181–224. Cited by: §7.
- 47.3 million u.s. adults have access to a smart speaker, report says. TechCrunch. Note: https://techcrunch.com/2018/03/07/47-3-million-u-s-adults-have-access-to-a-smart-speaker-report-says/ Cited by: §1.
- Language to code: learning semantic parsers for if-this-then-that recipes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), External Links: Cited by: §7.
- Abstract syntax networks for code generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1139–1149. External Links: Cited by: §7.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: item 6.
Using multiple clause constructors in inductive logic programming for semantic parsing. In Machine Learning: ECML 2001, pp. 466–477466–477. External Links: Cited by: §7.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: item 6.
- Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1332–1342. External Links: Cited by: §1.2, §4.3, §7, §7.
- Learning for semantic parsing with statistical machine translation. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 439–446. External Links: Cited by: §7.
- Learning synchronous grammars for semantic parsing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 960–967. Cited by: §7.
- Sequence-based structured prediction for semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1341–1350. External Links: Cited by: §7.
Sqlnet: generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436. Cited by: §7.
- Natural language questions for the web of data. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 379–390. Cited by: §7, §7.
- Robust question answering over the web of linked data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1107–1116. Cited by: §7.
- What it takes to achieve 100% condition accuracy on WikiSQL. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1702–1711. External Links: Cited by: §7.
- A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 440–450. Cited by: §7.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. CoRR abs/1809.08887. External Links: Cited by: §7.
- Inducing deterministic prolog parsers from treebanks: a machine learning approach. In AAAI, pp. 748–753. Cited by: §7.
- Learning to parse database queries using inductive logic programming. In Proceedings of the thirteenth national conference on Artificial intelligence-Volume 2, pp. 1050–1055. Cited by: §7.
- Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp. 658–666. Cited by: §7.
- Seq2SQL: generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103. Cited by: §7.