SmBoP: Semi-autoregressive Bottom-up Semantic Parsing

10/23/2020 ∙ by Ohad Rubin, et al. ∙ Tel Aviv University 0

The de-facto standard decoding method for semantic parsing in recent years has been to autoregressively decode the abstract syntax tree of the target program using a top-down depth-first traversal. In this work, we propose an alternative approach: a Semi-autoregressive Bottom-up Parser (SmBoP) that constructs at decoding step t the top-K sub-trees of height ≤ t. Our parser enjoys several benefits compared to top-down autoregressive parsing. First, since sub-trees in each decoding step are generated in parallel, the theoretical runtime is logarithmic rather than linear. Second, our bottom-up approach learns representations with meaningful semantic sub-programs at each step, rather than semantically vague partial trees. Last, SmBoP includes Transformer-based layers that contextualize sub-trees with one another, allowing us, unlike traditional beam-search, to score trees conditioned on other trees that have been previously explored. We apply SmBoP on Spider, a challenging zero-shot semantic parsing benchmark, and show that SmBoP is competitive with top-down autoregressive parsing. On the test set, SmBoP obtains an EM score of 60.5%, similar to the best published score for a model that does not use database content, which is at 60.6%.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic parsing, the task of mapping natural language utterances into formal langauge (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Clarke et al., ; Liang et al., 2011), has converged in recent years on a standard architecture. First, the input utterance is encoded with a bidirectional encoder. Then, the abstract syntax tree of the target program is decoded autoregressively (grammar-based decoding), often combined with beam search (Yin and Neubig, 2017; Krishnamurthy et al., 2017; Rabinovich et al., 2017).

Bottom-up semantic parsing, conversely, has received less attention (Cheng et al., 2019; Odena et al., 2020). In this work, we propose a bottom-up semantic parser, and demonstrate that equipped with recent developments in Transformer-based (Vaswani et al., 2017) architectures, it offers several attractive advantages. First, our approach is semi-autoregresive: at each decoding step , we generate in parallel the top- program sub-trees of depth (akin to beam search). This leads to theoretical runtime complexity that is logarithmic in the tree size, rather than linear. Second, neural bottom-up parsing provides learned representations for meaningful (and executable) sub-programs, which are sub-trees computed during the search procedure. This is in contrast to top-down parsing, where hidden states represent partial trees without clear semantics. Last, working with sub-trees allows us to naturally contextualize and re-rank trees as part of the model. Specifically, before scoring trees on the beam, they are contextualized with a Transformer, allowing us to globally score each tree conditioned on other trees it may combine with in future steps. Similarly, a Transformer-based re-ranking layer takes all constructed tree representations and re-ranks them. This is in contrast to prior work (Goldman et al., 2018; Bogin et al., 2019; Yin and Neubig, 2019), where re-ranking programs was done using a completely separate model.

Figure 1: An overview of the decoding procedure of SmBoP. is is the beam at step , is the contextualized beam, is the frontier ( are logical operations applied on trees, as explained below), is the pruned frontier, and is the new beam. At the top we see the new trees created in this step. For (depicted here), the beam contains the predicted DB constants.

Figure 1 illustrates a single decoding step of our parser. Given a beam with trees of height

(blue vectors), a

beam Transformer contextualizes the trees on the beam together with the input utterance (orange). Then, the frontier, which contains all trees of height that can be constructed, using a grammar, from the current beam are scored, and the top- trees are kept (purple). Last, a representation for each of the new trees is generated and placed in the new beam . After decoding steps, SmBoP gathers all constructed tree representations, passes them through another different Transformer layer that re-ranks them. Because we have gold trees at training time, the entire model is trained jointly using maximum likelihood.

We evaluate our model, SmBoP111Rhymes with ‘MMMBop’. (SeMi-autoregressive Bottom-up semantic Parser), on Spider (Yu et al., 2018), a challenging zero-shot text-to-SQL dataset with long compositional queries. When using an identical encoder, and swapping an autoregressive decoder with the semi-autoregressive SmBoP, we observe on the Spider development set almost identical performance for a model based on BART-large Lewis et al. (2020), and a slight drop in performance () for a model based on BART-base. This competitive performance is promising, seeing that SmBoP is the first semi-autoregressive semantic parser (we discuss potential advantages in §3.4). On the hidden Spider test set, SmBoP obtains EM, comparable to RyanSQLv2, which obtains EM. RyanSQLv2 is the highest-scoring model that, like SmBoP, does not use DB content as input.222All our code will be made publicly available.

2 Background

Problem definition

We focus in this work on text-to-SQL semantic parsing. Given a training set , where is an utterance, is its translation to a SQL query, and is the schema of the target database (DB), our goal is to learn a model that maps new question-schema pairs to the correct SQL query . A DB schema includes : (a) a set of tables, (b) a set of columns for each table, and (c) a set of foreign key-primary key column pairs describing relations between table columns. Schema tables and columns are termed DB constants, and denoted by .

RATSQL encoder

This work is focused on decoding, and thus we use the state-of-the-rat RAT-SQL encoder Wang et al. (2020), which we implement and now briefly review for completeness.

The RAT-SQL encoder is based on two main ideas. First, it provides a joint contextualized representation of the utterance and schema. Specifically, the utterance is concatenated to a linearized form of the schema , and they are passed through a stack of Transformer (Vaswani et al., 2017) layers. Then, tokens that correspond to a single DB constant are aggregated, which results in a final contextualized representation , where is a vector representing a single DB constant. This contextualization of and leads to better representation and alignment between the utterance and schema.

Second, RAT-SQL uses relational-aware self-attention (Shaw et al., 2018) to encode the structure of the schema and other prior knowledge on relations between encoded tokens. Specifically, given a sequence of token representations , relational-aware self-attention computes a scalar similarity score between pairs of token representations . This is identical to standard self-attention ( and are the query and key parameter matrices), except for the term , which is an embedding that represents a relation between and from a closed set of possible relations. For example, if both tokens correspond to schema tables, an embedding will represent whether there is a primary-foreign key relation between the tables. If one of the tokens is an utterance word and another is a table column, a relation will denote if there is a string match between them. The same principle is also applied for representing the self-attention values, where another relation embedding matrix is used. We refer the reader to the RAT-SQL paper for exact details.

Overall, RAT-SQL jointly encodes the utterance, schema, the structure of the schema and alignments between the utterance and schema, and leads to state-of-the-art results in text-to-SQL parsing.

Autoregressive top-down decoding

The prevailing method for decoding in semantic parsing has been grammar-based autoregressive top-down decoding (Yin and Neubig, 2017; Krishnamurthy et al., 2017; Rabinovich et al., 2017), which guarantees decoding of syntactically valid programs. Specifically, the target program is represented as an abstract syntax tree under the grammar of the formal language, and linearized to a sequence of rules (or actions) using a top-down depth-first traversal. Once the program is represented as a sequence, it can be decoded using a standard sequence-to-sequence model with encoder attention (Dong and Lapata, 2016), often combined with beam search. We refer the reader to the aforementioned papers for further details on grammar-based decoding.

We now turn to describe our method, which provides a radically different approach for decoding in semantic parsing.

3 The SmBoP parser

1 input: utterance , schema
4 for  do
Algorithm 1 SmBoP

We first provide a high-level overview of SmBoP (see Algorithm 1 and Figure 1). As explained in §2, we encode the utterance and schema with a RAT-SQL encoder. We initialize the beam (line 1) with the top- trees of height , i.e., DB constants, using the function . This function scores each DB constant independently and in parallel, and is defined in §3.2.

Next, we start the search procedure. At every step , the entire beam is contextualized jointly with the utterance, providing a global representation for each sub-tree on the beam (line 1). This global representation is used to score every tree on the frontier: the set of sub-trees of depth that can be constructed from sub-trees on the beam with depth (lines 1-1). After choosing the top- trees for step , we compute a new representation for them. After steps, we collect all trees constructed during the search procedure and re-rank them with another Transformer. Steps in our model operate on tree representations independently, and thus each step can be parallelized.

SmBoP resembles beam search as at each step it holds the top- trees of a fixed height. It is also related to (pruned) chart parsing, since trees at step are computed from trees that were found at step . This is unlike sequence-to-sequence models where items on the beam are competing hypotheses without any interaction.

We now provide the details of our parser. First, we describe the formal language representation we use (§3.1), then we provide precise details of our model architecture, (§3.2

), we describe the loss functions and training procedure (§

3.3), and last, we discuss the properties of our parser compared to prior work (§3.4).

3.1 Representation of Query Trees

Relational algebra

guo2019complex have shown recently that the mismatch between natural language and SQL leads to parsing difficulties. Therefore, they proposed SemQL, a formal query language with better alignment to natural language.

Operation Notation Input Output
Set Union
Set Intersection
Set difference
Cartesian product
Constant Union
Order by
Group by
In/Not In
Like/Not Like
Table 1: Our relational algebra grammar, along with the input and output semantic types of each operation. : Predicate, : Relation, : Constant, : Constant set, and .

In this work, we follow their intuition, but instead of SemQL, we use the standard query language relational algebra (Codd, 1970). Relational algebra describes queries as trees, where leaves (terminals) are DB constants and inner nodes (non-terminals) are operations (see Table 1) Similar to SemQL, its alignment with natural language is better than SQL. However, unlike SemQL, it is an existing query language, commonly used by SQL execution engines for query planning.

We write a grammar for relational algebra, augmented with SQL operators that are missing from relational algebra. We then implement a bidirectional compiler (transpiler) and convert SQL queries to relational algebra for parsing, and then back from relational algebra to SQL for evaluation.333We will also release the compiler code. Table 1 shows the full grammar, including the input and output semantic types of all operations. A relation () is a tuple (or tuples), a predicate () is a Boolean condition (evaluating to True or False), a constant () is a DB constant or value, and () is a set of DB constants. Figure 2 shows an example relational algebra tree with the corresponding SQL query. More examples illustrating the correspondence between SQL and relational algebra (e.g., for the SQL JOIN operation) are in Appendix A.

Tree balancing

Conceptually, at each step SmBoP should generate new trees of height and keep the top- trees computed so far. In practice, we do not want to compare trees from different decoding steps, since they are not directly coparable. Thus, we want the beam at step to only have trees that are of height exactly (-high trees).

To achieve this, we introduce a unary Keep operation that does not change the semantics of the sub-tree it is applied on. Hence, we can always grow the height of trees in the beam without changing the formal query. For training (which we elaborate on in §3.3), we balance all relational algebra trees in the training set using the keep operation, such that the distance from the root to all leaves is equal. For example, in Figure 2, two Keep operations are used to balance the column After tree balancing, all DB constants are at height , and the goal of the parser at step is to generate the gold set of -high trees.

(a) Unbalanced tree (b) Balanced tree
Figure 2: An unbalanced and balanced relational algebra tree (with the unary Keep operation) for the utterance “What are the names of actors older than 60?”, where the corresponding SQL query is SELECT name FROM actor WHERE age 60.

3.2 Model Architecture

To fully specify Alg. 1, we need to define the following components: (a) scoring of trees on the frontier (lines 1-1), (b) representation of trees (line 1), (c) the re-ranking function (line 1), and (d) representing and scoring DB constants with (leaves). We now describe these components. Figure 3 illustrates the scoring and representation of a binary operation.

Scoring with contextualized beams

SmBoP maintains at each decoding step a beam , where is a symbolic representation of the query tree, and is its corresponding vector representation. Unlike standard beam search, trees on our beams do not only compete with one another, but also compose with each other (similar to chart parsing). For example, in Fig. 1, the beam contains the column age and an anonymized value, which compose using the operator to form the tree.

We contextualize the beam representations and the utterance, allowing information to flow between them. Concretely, we concatenate the tree representations to the utterance x and pass them through a Transformer layer to obtain the contextualized beam representations .

Next, we compute scores for all -high trees on the frontier. Trees can be generated by applying either a unary (including Keep) operation or binary operation on beam trees. Let be a scoring vector for a unary operation (such as , , etc.), let be a scoring matrix for a binary operation (such as , , etc.), and let be contextualized tree representations on the beam. We define a linear scoring function for frontier trees, where the score for a new tree generated by applying a unary rule on a tree is , and similarly the score for a tree generated by applying a binary rule is . We use semantic types to detect invalid rule applications and fix their score to .

Figure 3: Illustration of our tree scoring and tree representation mechanisms. is the symbolic tree, z is its vector representation, and its contextualized representation.

Overall, the total number of trees on the frontier is . Because scores of different trees on the frontier are independent, they can be parallelized for efficiency. Note that we score new trees from the frontier before creating a representation for them, which we describe next.

Recursive tree representation

after scoring the frontier, we generate a recursive vector representation for the top- trees. While scoring is done with a contextualized beam, representations are not contextualized. We found that a contextualized tree representation slows down training as every tree becomes a function of all other beam trees (see §4).

We represent trees with a standard LSTM for unary operations Hochreiter and Schmidhuber (1997) and TreeLSTM Tai et al. (2015) for binary operations. Let be the representation for a new tree, let be an embedding for a unary or binary operation, and let be non-contextualized tree representations from the beam we are extending. We compute a new representation as follows:

where for the unary Keep operation, we simply copy the representation from the previous step.


As described below in §3.3, SmBoP is trained at search time to give high scores to trees that are sub-trees of the gold tree, which is different from giving a high score to the full gold query tree. Moreover, we decode for a fixed number of steps, but the gold program might be found earlier. Consequently, we add a separate fully-differentiable re-ranker that ranks all trees constructed, and is trained to output correct trees only.

The re-ranker takes as input the RAT-SQL-enriched utterance x, concatenated with tree representations of all trees from all decoding steps that are returnable trees , , where a returnable tree is one where the output semantic type of its root is Relation (R). The re-ranker passes them through a re-ranking Transformer layer, which outputs contextualized representations . Then, each tree representation is independently scored with a feed-forward network , and the top-scoring tree is returned.

Beam initialization

We describe the function , which populates the initial beam with DB constants. Recall that is a representation of a DB constant that is already contextualized by the rest of the schema and the utterance. The function is simply a feed-forward network that scores each DB constant independently , and the top- DB constants populate the initial beam.

3.3 Training

To specify our loss functions, we first need to define the supervision signal. Recall that given the gold SQL program, we convert it into a gold balanced relational algebra tree , as explained in §3.1 and Figure 2. This lets us define for every decoding step the set of -high gold sub-trees . For example includes all gold DB constants, includes all -high gold trees, etc.

During training, we apply “bottom-up Teacher Forcing” (Williams and Zipser, 1989), that is, we populate the beam with all trees from and then fill the rest of the beam (of size ) with the top-scoring non-gold predicted trees. This guarantees that we will be able to compute a loss at each decoding step, as described below.

SmBoP is trained end-to-end, but it is not fully differentiable due to the operation in line 1 of Alg. 1

. Fortunately, we have gold trees that allow us to define a loss term at every decoding step, and we can train the parameters as usual with backpropagation.

Loss functions

During search, our goal is to give high scores to the possibly multiple sub-trees of the gold tree. Because of teacher forcing, the frontier is guaranteed to contain all gold trees . We first apply a softmax over all frontier trees

, and then maximize the probabilities of gold trees:

where the loss is normalized by , the number of nodes in the gold tree, which corresponds to the total number of summed terms.

In the re-ranker, our goal is to maximize the probability of the gold tree . However, multiple trees might be semantically equivalent to the gold tree, when ignoring the Keep operation, which does not have semantics. Let be the set of all returnable trees found during search and be the subset that is equivalent to . We compute a probability for each tree , and us maximum marginal likelihood (Guu et al., 2017):

3.4 Discussion

Figure 4: A histogram showing the distribution of size and height of trees in the development set of Spider.

To our knowledge, this work is the first to present a semi-autoregressive bottom-up semantic parser. We discuss potential benefits of our approach.


has theoretical runtime complexity that is logarithmic in the size of the tree instead of linear for autoregressive models. Figure 

4 shows the distribution over the height and size (number of nodes) of trees on Spider. Clearly, the height of most trees is around 7, while the size is 25-35, illustrating the efficiency potential of our approach. In practice, achieving inference speedup requires fully parallelizing all decoding operations. Our current implementation does not support that and so inference is not faster than autoregressive models.

Unlike top-down autoregressive models, SmBoP naturally computes representations z for all sub-trees constructed at decoding time, which are well-defined semantic objects. These representations can be used in setups such as contextual semantic parsing, where a semantic parser answers a sequence of questions. For example, given the questions “How many students are living in the dorms?” and then “what are their last names?”, the pronoun “their” refers to a sub-tree from the SQL tree of the first question. Having a representation for such sub-trees can be useful when parsing the second question, in benchmarks such as SPARC Yu et al. (2019).

In this work, we score trees based on a contextualized representation, since trees do not only compete with one another, but also combine with each other. For example, a tree might get a higher score conditioned on another tree in the beam that it can compose with. Conceptually, beam contextualization can be done also in autoregressive parsing, but the incentive is lower, as different items on the beam represent independent hypotheses.

Last, re-ranking has been repeatedly shown to be useful in semantic parsing (Goldman et al., 2018; Yin and Neubig, 2019). However, in all prior work a re-ranker is a separate model, trained independently, that consumes the input examples from scratch. In SmBoP, we compute tree representations that can be ranked inside the model with a single Transformer-based re-ranking layer.

4 Experimental Evaluation

We conduct our experimental evaluation on Spider (Yu et al., 2018), a challenging large-scale dataset for text-to-SQL parsing. Spider has become a common benchmark for evaluating semantic parsers because it includes complex SQL queries and a realistic zero-shot setup, where schemas at test time are different from training time.

4.1 Experimental setup

We encode the input utterance and schema using the BART encoder (Lewis et al., 2020), a pre-trained language model that follows the encoder-decoder architecture, before passing it to the RAT-SQL encoder. We use BART, rather than the previously-used BERT (Devlin et al., 2019), since it is trained with inputs of length 1024, and the length of 8% of the examples in the Spider development set is more than 512. We use BART-Base for development and ablations, and BART-large for our hidden test set submission.

We use the semantic parsing package from AllenNLP Gardner et al. (2018), the Adam optimizer with a learning rate of , beam size , and constant decoding step at inference time. The beam Transformer has dimensionality , 3 heads, and 3 layers. We train for 90K steps, with a batch size of 16, we perform early stopping based on the development set.

We evaluate with the official Spider evaluation script, which computes exact match (EM), that is, whether the predicted SQL query is equivalent to the gold SQL query after some query normalizations. The official evaluation is over anonymized SQL queries, that is all DB values (that are not schema items) are anonymized to value, and thus we train and test our model with anonymized SQL queries (see more about using DB content below).


The most direct method to compare our method to prior work is using the Spider leaderboard. However, this comparison is scientifically problematic for several reasons. First, some models use the content of the DB as input to the model (values), which requires a separate step for filtering the possibly huge DB. We avoid this step for simplicity and are thus not directly comparable with such models. Second, different models vary widely in optimization and implementation details. For example, we use BART while RAT-SQL uses BERT, and the only difference between RAT-SQLv2 and v3 is longer and more careful optimization.444 Last, some leaderboard entries do not have any description of their method available. Consequently, we compare with the best entries that provide a description of their model: (i) RATSQL-v3, the current state-of-the-art, and (ii) RATSQL-v2, where (i) and (ii) both use the DB content; (iii) RyanSQL-v2 (Choi et al., 2020), which is the best entry that does not use the DB content; and (iv) IRNet v2, which does not use DB content.

For a more scientific comparison, we implement the RAT-SQL model over our own BART encoder. This allows us to directly compare SmBoP against a top-down autoregressive decoder, by having the two use an identical encoder and an identical training procedure, where we only swap the decoder. For the autoregressive decoder, we use the grammar-based decoder from (Bogin et al., 2019), with inference beam size . This decoder is very similar to the RAT-SQL decoder, but does not use a pointer network, which is unnecessary when working with anonymized SQL. Our re-implementation of RAT-SQL does not use DB content as input. We term this OurRATSQL-large and OurRATSQL-base, corresponding to BART-large and BART-base.


SmBoP uses a new decoding approach, and so it is important to examine the importance of its components. We report results for the following ablations and oracle experiments:

  • [leftmargin=*,itemsep=0pt,topsep=0pt]

  • No Re-Ranker: We remove the re-ranker and return the highest-scoring tree from decoding step .

  • No Beam Cntx: We remove the beam transformer that computes and use the representations in directly to score the frontier.

  • No Re-ranker Cntx: In the re-ranker, we remove the one-layer transformer and apply directly on tree representations from the different decoding steps.

  • Cntx Rep.: We use the contextualized representations not only for scoring, but also as input for creating the new trees . This tests if contextualized representations on the beam hurt or improve performance.

  • -Oracle: An oracle experiment where is populated only with the gold set of DB constants, to provide an upper bound given perfect schema matching.

Model Test
RATSQLv3 (DB content used) 65.6%
RATSQLv2 (DB content used) 61.9%
RYANSQLv2 60.6%
IRNetv2 55.0%
SmBoP-large 60.5%
Table 2: Results on the Spider test set.
Model EM BEM recall
OurRATSQL-large 66.0% n/a n/a
OurRATSQL-base 65.3% n/a n/a
SmBoP-large 66.0% 80.8% 98.1%
SmBoP-base 63.7% 76.0% 95.2%
- No Re-ranker 60.0% 78.9% 95.9%
- No Beam Cntx 62.5% 76.3% 95.7%
- No Re-ranker Cntx 62.0% 77.4% 95.1%
- Cntx Rep. 48.4% 61.7% 92.7%
SmBoP-base--Oracle 72.5% 84.6% n/a
Table 3: Development set EM, beam EM (BEM) and recall on DB constants ( recall) for all models.

4.2 Results

Table 2 shows test results of our model compared to the top-4 (non-anonymous) leaderboard entries, at the time of writing. Our model is comparable to RyanSQLv2, which like us does not use the DB content, and substantially outperforms IRNetv2. The performance of RATSQLv3, which uses the DB content, is 5 points higher than SmBoP. As mentioned, apples-to-apples comparisons are difficult in a leaderboard setup, and we conclude that our model’s performance is competitive with current state-of-the-art.

Table 3 shows results and ablations on the development set. Comparing SmBoP to OurRATSQL, which have an identical encoder and only differ in the decoder, we observe a slight EM drop for the base model, and very similar performance for the large model. We view this as a promising result, given that SmBoP is the first semi-autoregressive parser with logarithmic runtime complexity.

Table 3 also presents our ablations. Removing the re-ranker and returning the top-scoring tree from decoding step reduces performance by almost 4 points. Removing the contextualizing Transformers leads to a moderate drop in performance of 1.2-1.7 points, but our analysis later shows that contextualization improves performance specifically on deep trees. Using contextualized representations for trees leads to difficulties in optimization and reduces performance dramatically to 48.4%.

We also present beam EM (BEM), which measures whether a correct tree was found anywhere during the decoding steps. We observe that BEM is 76%-79%, showing that a perfect re-ranker would improve performance by points, and the rest of the errors are due to search.

Last, we evaluate the performance and importance of predicting all the correct DB constants at step of our decoder. We evaluate recall, that is, the fraction of examples that contained all gold DB constants required to generate the gold tree on the beam at decoding step . We observe that our models perform relatively well at this task (95%-98%), despite the fact that this step is fully non-autoregressive. However, an oracle model that always outputs the gold DB constants, and only the DB constants, leads to significant improvement in performance (), and BEM is at , showing there is still headroom on Spider.

4.3 Analysis

Figure 5: Breakdown of EM across different gold tree heights. We observe contextualizing trees improves performance specifically for examples with deep trees.

Figure 5 breaks down EM performance based on the height of the gold tree. Similar to auto-regressive models, performance drops as the height increases. Interestingly, the beam and re-ranking Transformers substantially improve performance for heights 7-8, showing the benefit of contextualizing trees in the more difficult examples. All models fails on the 15 examples with trees of height .

Figure 6: Recall across decoding steps.

We extend the notion of recall to all decoding steps, where recall is whether all gold -high sub-trees were generated at step . We see recall across decoding steps in Figure 6. The drop after step and subsequent rise indicates that the model maintains in the beam, using the Keep operation, trees that are sub-trees of the gold tree, and expands them in later steps. This means that the parser can recover from errors in early decoding steps as long as the relevant trees are kept on the beam.

Figure 7: Distribution of the failure step and gold height for the incorrect examples.

SmBoP fails if at some decoding step, it does not add to the beam a sub-tree necessary for the final gold tree, and no other beam trees can be combined to create it. To measure the performance of our search procedure, the histogram in Figure 7 shows the distribution over steps where failure occurs, as well as the distribution over tree heights of errors. We observe that more than half the errors are in decoding steps 3-4, where the beam is already full and many gold trees still need to be produced in each step.

We also randomly sample 50 errors from SmBoP and categorize them into the following types:

  • [leftmargin=*,itemsep=0pt,topsep=0pt]

  • Search errors (56%): we find that most search errors (60%) are due to either extra or missing join conditions. The rest involve operators that are relatively rare in Spider, such as LIKE.

  • Schema encoding errors (30%): Missing or extra DB constants in the predicted query.

  • Equivalent queries (14%): Predicted trees that are equivalent to the gold tree, but the automatic evaluation script does not handle.

Last, we randomly sampled 50 examples where SmBoP-base and OurRATSQL-base disagree. We find that the models disagree on 16% of these examples, and inspecting the errors we could not discern a particular error category that one model makes but the other does not.

5 Related Work

Generative models whose runtime complexity is sub-linear have been explored recently outside semantic parsing. The Insertion Transformer Stern et al. (2019) is a partially autoregressive model in machine translation, where at each step multiple tokens from the target are generated. Unlike machine translation, in semantic parsing the target is a tree, which induces a particular way to generate sub-trees in parallel. More generally, there has been ample work in machine translation aiming to speed up inference through non-autoregressive or semi-autoregressive generation (Wang et al. (2018); Ghazvininejad et al. (2020); Saharia et al. (2020); among others).

In program syntehsis, odena2020bustle recently proposed a bottom-up parser, where search is improved by executing partial queries and conditioning on the resulting values, similar to some work in semantic parsing Berant et al. (2013); Liang et al. (2017). We do not explore this advantage of bottom-up parsing here, since executing queries at training slows it down.

6 Conclusions

In this work we present the first semi-autoregressive bottom-up semantic parser that enjoys logarithmic theoretical runtime, and show it is competitive with the commonly used autoregressive top-down parser. Our work shows that bottom-up parsing, where the model learns representations for semantically meaningful sub-trees is a promising research direction, that can contribute in the future to setups such as contextual semantic parsing, where sub-trees often repeat, and can enjoy the benefits of execution at training time. We believe future work can also leverage work on learning tree representations (e.g., Shiv and Quirk (2019)) to further improve parser performance.


We thank Ben Bogin, Jonathan Herzig, Inbar Oren, Elad Segal and Ankit Gupta for their useful comments. This research was partially supported by The Yandex Initiative for Machine Learning, and the European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800).


  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on Freebase from question-answer pairs. In

    Empirical Methods in Natural Language Processing (EMNLP)

    Cited by: §5.
  • B. Bogin, M. Gardner, and J. Berant (2019) Global reasoning over database structures for text-to-SQL parsing. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §4.1.
  • J. Cheng, S. Reddy, V. Saraswat, and M. Lapata (2019) Learning an executable neural semantic parser. Computational Linguistics 45 (1), pp. 59–94. External Links: Document, Link, Cited by: §1.
  • D. Choi, M. C. Shin, E. Kim, and D. R. Shin (2020) RYANSQL: recursively applying sketch-based slot fillings for complex text-to-SQL in cross-domain databases. External Links: 2004.03125 Cited by: §4.1.
  • [5] J. Clarke, D. Goldwasser, M. Chang, and D. Roth Driving semantic parsing from the world’s response. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL), External Links: Link Cited by: §1.
  • E. F. Codd (1970) A relational model of data for large shared data banks. Commun. ACM 13 (6), pp. 377–387. External Links: ISSN 0001-0782, Link, Document Cited by: §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §4.1.
  • L. Dong and M. Lapata (2016) Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), Berlin, Germany, pp. 33–43. External Links: Link, Document Cited by: §2.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. Zettlemoyer (2018) AllenNLP: a deep semantic natural language processing platform. In

    Proceedings of Workshop for NLP Open Source Software (NLP-OSS)

    Melbourne, Australia, pp. 1–6. External Links: Link, Document Cited by: §4.1.
  • M. Ghazvininejad, O. Levy, and L. Zettlemoyer (2020) Semi-autoregressive training improves mask-predict decoding. ArXiv abs/2001.08785. Cited by: §5.
  • O. Goldman, V. Latcinnik, E. Nave, A. Globerson, and J. Berant (2018) Weakly supervised semantic parsing with abstract examples. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers), Melbourne, Australia, pp. 1809–1819. External Links: Link, Document Cited by: §1, §3.4.
  • K. Guu, P. Pasupat, E. Liu, and P. Liang (2017)

    From language to programs: bridging reinforcement learning and maximum marginal likelihood

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, pp. 1051–1062. External Links: Link, Document Cited by: §3.3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §3.2.
  • J. Krishnamurthy, P. Dasigi, and M. Gardner (2017) Neural semantic parsing with type constraints for semi-structured tables. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020)

    BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

    In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, pp. 7871–7880. External Links: Link, Document Cited by: §1, §4.1.
  • C. Liang, J. Berant, Q. Le, and K. D. F. N. Lao (2017) Neural symbolic machines: learning semantic parsers on Freebase with weak supervision. In Association for Computational Linguistics (ACL), Cited by: §5.
  • P. Liang, M. Jordan, and D. Klein (2011) Learning dependency-based compositional semantics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), Portland, Oregon, USA, pp. 590–599. External Links: Link Cited by: §1.
  • A. Odena, K. Shi, D. Bieber, R. Singh, and C. Sutton (2020) BUSTLE: bottom-up program-synthesis through learning-guided exploration. External Links: 2007.14381 Cited by: §1.
  • M. Rabinovich, M. Stern, and D. Klein (2017) Abstract syntax networks for code generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, pp. 1139–1149. External Links: Link, Document Cited by: §1, §2.
  • C. Saharia, W. Chan, S. Saxena, and M. Norouzi (2020) Non-autoregressive machine translation with latent alignments. ArXiv abs/2004.07437. Cited by: §5.
  • P. Shaw, J. Uszkoreit, and A. Vaswani (2018) Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 2 (Short Papers), New Orleans, Louisiana, pp. 464–468. External Links: Link, Document Cited by: §2.
  • V. L. Shiv and C. Quirk (2019) Novel positional encodings to enable tree-structured transformers. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.
  • M. Stern, W. Chan, J. Kiros, and J. Uszkoreit (2019) Insertion transformer: flexible sequence generation via insertion operations. In Proceedings of The International Conference on Machine Learning (ICML), Cited by: §5.
  • K. S. Tai, R. Socher, and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL), Beijing, China, pp. 1556–1566. External Links: Link, Document Cited by: §3.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), External Links: Link Cited by: §1, §2.
  • B. Wang, R. Shin, X. Liu, O. Polozov, and M. Richardson (2020) RAT-SQL: relation-aware schema encoding and linking for text-to-SQL parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §2.
  • C. Wang, J. Zhang, and H. Chen (2018)

    Semi-autoregressive neural machine translation

    In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §5.
  • R. J. Williams and D. Zipser (1989)

    A learning algorithm for continually running fully recurrent neural networks

    Neural Computation 1 (2), pp. 270–280. External Links: Document, Link, Cited by: §3.3.
  • P. Yin and G. Neubig (2017) A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, pp. 440–450. External Links: Link, Document Cited by: §1, §2.
  • P. Yin and G. Neubig (2019) Reranking for neural semantic parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), External Links: Link, Document Cited by: §1, §3.4.
  • T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev (2018) Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 3911–3921. External Links: Link, Document Cited by: §1, §4.
  • T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, S. Li, I. L. Heyang Er, B. Pang, T. Chen, E. Ji, S. Dixit, D. Proctor, S. Shim, V. Z. Jonathan Kraft, C. Xiong, R. Socher, and D. Radev (2019) SParC: cross-domain semantic parsing in context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy. Cited by: §3.4.
  • J. M. Zelle and R. J. Mooney (1996)

    Learning to parse database queries using inductive logic programming


    Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI)

    Cited by: §1.
  • L. S. Zettlemoyer and M. Collins (2005) Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §1.

Appendix A Appendix: Examples for Relational Algebra Trees

We show multiple examples of relation algebra trees along with the corresponding SQL query, for better understanding of the mapping between the two.

Figure 8: Unbalanced and balanced relational algebra trees for the utterance “How many flights arriving in Aberdeen city?”, where the corresponding SQL query is SELECT COUNT( * ) FROM flights JOIN airports ON flights.destairport = airports.airportcode WHERE = ’value’.
Figure 9: Unbalanced and balanced relational algebra trees for the utterance “When is the first transcript released? List the date and details.”, where the corresponding SQL query is SELECT transcripts.transcript_date , transcripts.other_details FROM transcripts ORDER BY transcripts.transcript_date ASC LIMIT ’value’.
Figure 10: Unbalanced and balanced relational algebra trees for the utterance “How many dog pets are raised by female students?”, where the corresponding SQL query is SELECT COUNT( * ) FROM student JOIN has_pet ON student.stuid = has_pet.stuid JOIN pets ON has_pet.petid = pets.petid WHERE = ’value’ AND pets.pettype = ’value’.
Figure 11: Unbalanced and balanced relational algebra trees for the utterance “Find the number of distinct name of losers.”, where the corresponding SQL query is SELECT COUNT( DISTINCT matches.loser_name ) FROM matches.