Question Answering on tabular data (TableQA) is the task of generating answers to natural language questions given tables as extra knowledge [20, 29, 6, inter alia]. It has drawn increasing attention in research. Compared to the standard question answering task, TableQA reflects the real world situation better: users interact with a QA system with natural language questions, which requires the QA system to understand the question and reason based on external knowledge in order to provide the correct answer. State-of-the-art (SOTA) TableQA systems are reported to have exceeded human performance [11, 12]. Despite their impressive results, an important question remains unanswered: Can these systems really understand natural language questions and reason with given tables, or do they merely capture certain statistical patterns in the datasets, which has poor generalizability?
We leverage adversarial examples to answer this question, as they have been proven to be useful for evaluating and improving machine learning systems on various tasks[14, 17]
. Adversarial examples were first introduced to computer vision systems by injecting small adversarial perturbations to the input to maximize the chance of misclassification. This is because small perturbations on continuous pixel values does not affect the overall perception of human beings. Most current works for producing semantic-preserving natural language adversarial examples [9, 22, 28] are constrained to local attack models (§2.2) under the assumption that small changes are less likely to lead to large semantic shift from the original sentence. However, our experiments demonstrate that these changes will impact language fluency or lead to noticeable difference in meaning, which makes the generated adversarial examples invalid.
In this paper, we propose SAGE (Semantically valid Adversarial GEnerator), which generates semantically valid and fluent adversarial questions for TableQA systems.
Main contributions of this paper include:
We propose SAGE, a novel white-box attack model to generate adversarial questions at the sentence level for TableQA. To the best of our knowledge, this work is the first attempt to bridge the gap between white-box adversarial attack and text generation model for TableQA.
SAGE is based on our proposed stochastic Wasserstein Sequence-to-sequence (Seq2seq) model (Wseq). To improve the attack success rate, it incorporates the adversarial loss directly from the target system with Gumbel-Softmax . To tackle the problem of semantic shift, we use delexicalization to tag the entities in the original instances and employ the semantic similarity score SimiLe  to guide the model to generate semantically valid questions.
Through our experiments, we demonstrate that our approach can generate more fluent and semantically valid adversarial questions than baselines using local methods (§2.2) while keeping high success rate (§4.1, §4.2). Moreover, SAGE-generated examples can further improve the test performance and robustness of the target QA systems with adversarial training (§4.3).
2.1 Data and target system
Because of its well-defined syntax and wide usage, using SQL query  as the answer form for TableQA has advantages over other forms, such as text string  or operation sequence in table look-up . In this work, we use the WikiSQL dataset , which is one of the largest TableQA datasets consisting of Wikipedia tables and pairs of human-edited question and SQL query (see Table 1 for an example).
TableQA systems on WikiSQL are trained to generate the SQL query and the final answer to a natural language question on a single table using ONLY the table schema (table header) without seeing table content . The final answer is obtained deterministically by executing the generated SQL query on the corresponding relational database. SOTA Wikisql systems [11, 12] use pretrained representation models such as BERT  as encoder, and constrain the output space of SQL query by casting text-to-SQL generation into many classification tasks predicting the slots of SQL keywords and values for SELECT and WHERE columns. They have achieved superhuman test performance on WikiSQL with over query accuracy (Q-Acc = ) and about answer accuracy (A-Acc = ).
2.2 Problem definition and local attack models
Given the table schema and original question , a TableQA system is trained to predict the correct slots of the SQL query:
where is the combined set of predicted labels and is the set of all possible label combinations.
Ideally, to attack this system, we want to generate an adversarial question that is semantically valid compared to the original question, but can cause the system to output a wrong answer:
We define semantic validity in the context of WikiSQL as whether the generated question and the original one can be expressed by the same SQL query. Maintaining semantic validity is a difficult task, therefore local attack models such as token manipulation are widely used to limit semantic shift.
We apply three white-box local attack models [9, 17] as our local attack baseline models, all of which can be formulated as searching the best token embedding in the first order approximation of the adversarial loss around the input token embeddings:
which only requires one forward and backward pass to compute the input token gradients. Specifically, the three local attack models are: Unconstrained, which searches within the whole embedding space of ; kNN, which constraints the search space within 10 nearest neighbors of the original token embedding; and CharSwap, which swaps or adds a character to the original token and changes it to <unk>.
3 Semantically valid adversarial generator
In this section, we describe SAGE. Figure 1 illustrates its main architecture with the losses of three main components. SAGE takes the meaning representation of the question, i.e. SQL query, as input and aims to generate semantically valid and fluent adversarial questions that can fool TableQA systems without changing the gold-standard SQL query.
We discuss the three main components of SAGE in detail: 1) Stochastic Wasserstein Seq2seq model (Wseq) used for question generation (§3.1); 2) Delexicalization and minimum risk training with SimiLe  to enhance semantic validity (§3.2); 3) End-to-end training with adversarial loss from the target system using Gumbel-Softmax  (§3.3).
3.1 Stochastic Wasserstein Seq2seq model (Wseq)
with a one layer bidirectional gated recurrent unit (GRU), and the latent representation of the whole SQL sequence, i.e. the last hidden state is used to initialize the hidden state of the decoder, which is another one layer GRU. During each step of decoding, we apply general global attention  and copy mechanism , then output the predicted token with a Softmax distribution.
Seq2seq model encodes SQL query into a deterministic latent representation , which can potentially lead to poor generalization during inference. When an unseen SQL is encoded to a new , the deterministic Seq2seq model can output nonsensical or unnatural questions from such unseen , even if it is very close to a training instance , which could negatively impact the fluency of the generated questions. Recent advance in deep generative models based on variational inference  have shown great success in learning smooth latent representations by modeling as a distribution to generate more meaningful text [4, 1]
. To improve the fluency of the generated questions, we thus propose the new Wseq model based on Wasserstein autoencoder, which has been shown to achieve more stable training and better performance than other variational models . Specifically, Wseq models z
as a Gaussian distribution conditioned on the inputx:
The training objective is to minimize the expected reconstruction loss regularized by the Wasserstein distance between the aggregated posterior and a normal prior :
by assuming , i.e. is a function of in variational encoder-decoder . We use the maximum mean discrepancy (MMD) with the inverse multiquadratic kernel as . During training, we sample from
and approximate MMD with the samples in each mini-batch, so the loss function for each batch can be written as
where is the size of the batch , is sampled from the posterior , is sampled the prior , and
is a hyperparameter controlling the degree of regularization.
3.2 Enhancing the semantic validity (Wseq-S)
To enhance the semantic validity of the generated questions, we introduce entity delexicalization to improve the sentence level entity coverage rate and employ SimiLe in minimum risk training to improve the semantic similarity between the generated and the original questions.
If a generated question does not contain all the entities in the WHERE columns of the SQL query, it cannot be semantically invalid, because all current TableQA systems rely on entity information to locate the correct cells in table to perform reasoning. To address this problem, we delexicalize the entities appearing in WHERE columns of the SQL query and its corresponding question in WikiSQL. We replace entities with et_i to denote the -th entity in the query/question. Delexicalization can dramatically reduce the length of the entity tokens our model needs to predict and improve the entity coverage at sentence level.
Minimum risk training with SimiLe
How to preserve the original semantics is the main challenge in natural language adversarial attacks. While including human judgement in the loop is beneficial , it is expensive and time consuming. Instead, we opt for SimiLe , an automatic semantic similarity score between two sentences to guide our model training, which correlates well with human judgement. SimiLe
calculates the cosine similarity between embeddings of two sentences trained on a large amount of paraphrase text. We choose SimiLe over string matching based metrics, such as BLEU , because our generated question can be very different in lexical or syntactic realizations from the original question while keeping high semantic similarity. In order to incorporate SIMILE into our model, we follow  and apply minimum risk training  on a set of generated questions, i.e. a hypothesis set , to approximate the whole generated question space given the SQL query x:
|BLEU||METEOR||SimLe||Ecr (%)||Qfr (%)||Afr (%)||Perplexity|
|Seq2seq w/o delex||32.69||35.77||80.09||68.97||12.62||11.25||515|
Automatic evaluation metrics on WikiSQL test set for generated adversarial questions.Delex represents entity delexicalization. Ecr, Qfr and Afr represent sentence level entity coverage rate, query and answer flip rate. The best scores for Seq2seq-based models are in bold.
3.3 End-to-end training with adversarial loss
In order to apply white-box attack, we employ end-to-end training by sending the generated questions to the target system and back-propagate the adversarial loss through our model. The adversarial loss from the TableQA system, shown in Equation 1
, maximizes the probability of the target system making incorrect predictions. However, it is not possible to directly back-propagate the adversarial loss to our attack model through the discrete question tokens which are generated by operatingon Softmax. To overcome the issue, we adopt the Gumbel-Softmax  to replace Softmax:
where is the probability after Softmax for token in the output vocabulary with size , is the Gumbel(, ) distribution sample, and controls the smoothness of the distribution. We still use to discretize at each time step during generation, but approximate the backward gradients with the Straight-Through (ST) 
Gumbel estimator to enable end-to-end training.
Finally, for each batch, SAGE combines the losses of previous three components all together:
where and are hyperparameters. It takes the text fluency, semantic validity, and the adversarial attack into consideration.
In this section, we use the publicly released SQLova  with BERT large encoder as our target system,333We use the released version without execution-guided decoding at https://github.com/naver/sqlova. because the techniques used in SQLova are representative of the SOTA TableQA systems on WikiSQL and it is also one of the best performing systems. However, it should be noted that our method can be applied to any other differentiable target systems as well.
4.1 Automatic evaluation
We evaluate the generated questions in three aspects: semantic validity, flip rate, and fluency.
As discussed in §3.2, the generated adversarial questions can only be semantically valid if they contain all required entities and preserve the original meaning of questions. We use sentence level entity coverage rate and semantic similarity to evaluate these two criteria. Sentence level entity coverage rate (Ecr) is defined as the ratio between the number of generated questions with all required entities () and the total number of generated adversarial questions (): . To measure semantic similarity, We use BLEU, METEOR  and SimLe
. BLEU is based on exact n-gram matching, so it does not give any credit to semantically similar sentences different in lexical realizations. METEOR computes the unigram matching F-score using stemming, synonymy and paraphrasing information, allowing for certain lexical variations. SimLe is the only embedding-based similarity metric free from string matching.
WikiSQL systems are evaluated with both query accuracy and answer accuracy (§2.1). Correspondingly, we use query flip rate (Qfr) and answer flip rate (Afr) to measure the attack success of our generated questions. Out of valid adversarial questions, questions cause SQL query error and for answer error, so Qfr and Afr can be calculated as:
We want the generated questions to be fluent and natural. Following , we use GPT-2 based Perplexity  as an automatic evaluation metric for fluency. Fluency is different from semantic validity, because if a generated question differs in meaning from its original question (not semantic valid), it can still be completely fluent to humans.
We compare SAGE with two groups of baselines. The first group comprises the three local attack models discussed in §2.2. For fair comparison with SAGE, which employs entity delexicalization, we mask entity tokens during local attack so that all the entities in the original question will be preserved.444Local attack models without entity masking has comparable scores in BLEU, METEOR, SimLe and perplexity, but much worse Ecr, Qfr and Afr, e.g. 24.47 Qfr for Unconstrained compared to 49.46 in Table 2. The second group includes Seq2seq-based models, i.e., the deterministic Seq2seq models with and without entity delexicalization, as well as ablated SAGE models.
Table 2 shows the results of automatic evaluation. Since local models only make changes to a single token, compared to Seq2seq-based models, they can easily achieve much higher scores in BLEU and METEOR, which are based on string matching. However, such gap is reduced significantly between local and Seq2seq-based models in SimLe. This suggests that although questions generated by Seq2seq-based models are different in textual realization from the original questions, the semantic meaning is greatly preserved. Since SimLe is still based on the bag-of-words assumption that ignores syntactic structure, local models inevitably have higher scores than those generated by Seq2seq-based models from scratch, which contain more significant lexical and syntactic variations. We argue that these variations are beneficial because they mimic the diversity of human language.
The flip rates should be read together with perplexity, as our goal is to generate fluent adversarial questions. As demonstrated in the table, local models tend to achieve high flip rates by generating non-fluent or nonsensical questions that rarely appear in natural language. This is further confirmed in human evaluation (§4.2).
Among the Seq2seq-based models, Wseq-S performs the best in terms of semantic validity whereas Seq2seq (w/o delex) ranks the worst. This demonstrates the effectiveness of entity delexicalization and minimum risk training with SimLe in enhancing semantic validity.
SAGE achieves the highest flip rates outperforming all of the Seq2seq baselines, with a 9.85 absolute Qfr increase over Wseq-S. It indicates that SAGE is better at adversarial attacks to the target system, compared to Wseq and Wseq-S. However, this is at the cost of sacrificing semantic validity, which suggests that the objectives of adversarial loss and semantic validity are not perfectly aligned.
Wseq is our most fluent model that keeps a good balance between semantic validity and fluency, achieving the best perplexity while maintaining comparable similarity scores to other Seq2seq-based models.
|Validity (%)||Fluency (rank)|
|Seq2seq w/o delex||78.7||2.99|
: Significant compared to kNN ().
: Significant compared to kNN () and Seq2seq w/o delex ().
4.2 Human evaluation
Due to the limitations of automatic metrics in evaluating semantic validity and fluency, we introduce human evaluation to substantiate our findings. We sample 100 questions from the WikiSQL test set and recruit three native expert annotators to annotate the adversarial examples generated by each model in the tasks of semantic validity and fluency. Here we focus on the Seq2seq-based models as well as the best-performed local attack models from Table 2.
It is extremely difficult and error prone to require annotators to write the corresponding SQL query for generated and original questions to measure semantic validity. Instead, we ask the annotators to make a binary decision on whether the generated and original question have the same query process given table, i.e. whether they use the same columns and rows of the table for the same answer.
According to human evaluation (as shown in Table 3), our models substantially outperform the local models in terms of semantic validity. For example, 90.3% of the generated questions of our Wseq-S are semantically valid, fortifying the effectiveness of entity delexicalization and SimiLe. Also, all Seq2seq-based models achieve higher semantic validity than the local models. It demonstrates the capability of Seq2seq-based models in generating more semantically equivalent questions. Although local models only introduce minor word-level changes, most of them actually greatly alter the meaning of the original questions and make the generated adversarial questions semantically invalid. Additionally, it suggests that existing automatic metrics are not sufficient to evaluate semantic validity. On the other hand, SAGE scores lower than Wseq-S and Wseq in terms of semantic validity. However, given its high flip rates, SAGE still generates many more semantically valid adversarial questions than any other Seq2seq-based model.
|Semantic Validity||What is the sum of wins after 1999 ? (Original)||SELECT SUM(Wins) WHERE Year 1999||-|
|What is the sum of wins downs 1999 ? (Unconstrained)||✓||N|
|What is the sum of wins after 1999 is (kNN)||SELECT Wins WHERE Year 1999||N|
|How many wins in the years after 1999 ? (Seq2seq)||✓||Y|
|What is the total wins for the year after 1999 ? (Wseq)||✓||Y|
|What is the sum of wins in the year later than 1999 ? (Wseq-S)||SELECT COUNT(Wins) WHERE YEAR 1999||Y|
|How many wins have a year later than 1999 ? (SAGE)||SELECT COUNT(Wins) WHERE YEAR 1999||Y|
|Fluency||What was the date when the opponent was at South Carolina ? (Original)||SELECT Date WHERE Opponent at South Carolina||3.0|
|What was the date when the jord was at South Carolina ? (Unconstrained)||✓||5.3|
|What was the date when the opponent was at South Carolina , (kNN)||✓||4.0|
|What date was the opponent at South carolina ? (Seq2seq)||✓||1.3|
|What is the date of the game against at South Carolina ? (Wseq)||✓||4.7|
|What is the date of the opponent at South Carolina ? (Wseq-S)||✓||4.0|
|On what date was the opponent at South Carolina ? (SAGE)||✓||1.7|
To compare the fluency of generated questions by each model, we follow the practice of  and ask annotators to rank a set of generated questions including the original one in terms of fluency and naturalness. We adopt ranking instead of scoring in measuring fluency, because we care more about comparison of the models than absolute scores. To facilitate annotation, we additionally provide a coarse-grained three level guideline to the annotators.
Table 3 shows that Seq2seq-based models outperform local models markedly in fluency. Similar to automatic fluency evaluation (perplexity), Wseq tops all models in human evaluation, which only ranks behind the original question by a small margin. This verifies that Wseq is effective in improving generation fluency. SAGE yields good fluency, which is significantly better than the Seq2seq model without entity delexicalization. The decrease in fluency from SAGE compared to Wseq are caused by two reasons: (1) Adding the adversarial loss harms text quality; (2) SimiLe drives the model to generate questions closer to the original ones, which increases semantic validity but sacrifices fluency, which is the same case in Wseq-S.
Overall, human evaluation confirms the advantage of Seq2seq-based models over local models in both semantic validity and fluency.
4.3 Adversarial Training with SAGE
We augment training data for TableQA systems with SAGE-generated adversarial examples, then test the performance and robustness of the retrained systems.
Specifically, we train a SQLova system with BERT base encoder (SQLova-B) on account of efficiency. This is different from the released SQLova system with BERT large encoder. We then use SAGE to attack both SQLova-B and released system and generate two sets of adversarial examples, AdvData-B and AdvData-L. We retrain SQLova-B using the original training data augmented with 30k and the full (56k) adversarial examples from AdvData-B and AdvData-L, respectively. Table 4 shows that adding SAGE-generated adversarial examples improves the performance of SQLova-B on the original WikiSQL test data. Although AdvData-L is not targeted for SQLova-B, it can still boost the performance of SQLova-B.
We further attack the two retrained SQLOVA-B systems augmented with full adversarial examples using different attack models. Table 5 demonstrates both AdvData-B and AdvData-L can help SQLova-B to defend various attacks, with all flip rates decreased.
In summary, SAGE-generated adversarial examples can improve the performance and robustness of TableQA systems, regardless which target system is used for generation.
5 Qualitative Analysis
We study some generated examples and analyze SAGE qualitatively with the help of model output and human evaluations in Table 6. In the first example, adversarial questions generated by either local model are semantically invalid, showing their limitation in meaning preservation, especially when the local edit happens at content words. On the contrary, Seq2seq-based models generate questions from the continuous semantic space, which can incorporate more lexical and syntactic variations. In particular, SAGE is able to generate questions that are both semantically valid and challenging to the target system.
The second example demonstrates the low fluency of local models, due to the occurrence of nonsensical word or inappropriate punctuation. Questions from Seq2seq-based models are generally more meaningful and fluent. Wseq and Wseq-S can generate very fluent questions, except for the prepositional phrase entity “at South Carolina”. This is because by default all delexicalized entities from the SQL query are inferred to be proper nouns. However, SAGE manages to bypass this pitfall by using another syntactic structure. Additionally, when comparing Wseq-based models, the word “opponent”, which is used by the original question, is generated by models with SimiLe but not plain Wseq. This suggests that the SimiLe loss is encouraging the model to generate words coming from the original question, which gives us a hint on the relatively low fluency of Wseq-S in §4.2.
We proposed SAGE, a Wasserstein Seq2seq model with entity delexicalization and semantic similarity regularization, to generate white-box adversarial questions for TableQA systems. We include the adversarial loss with Gumbel Softmax in training to enforce the adversarial attack. Experiments showed that our model is effective in consolidating semantic validity and fluency while maintaining high flip rates. The generated adversarial examples can promote better evaluation and interpretation for TableQA systems. Moreover, our results demonstrated that they can improve TableQA systems’ performance in question understanding and knowledge reasoning, as well as robustness towards various attacks.
-  (2019) Stochastic wasserstein autoencoder for probabilistic sentence generation. In NAACL, Cited by: §3.1.
-  (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Cited by: §4.1.
Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR. Cited by: §3.3.
-  (2016) Generating sentences from a continuous space. In CoNLL, Cited by: §3.1.
-  (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, Cited by: §3.1.
-  (2018) Adversarial TableQA: attention supervision for question answering on tables. In ACML, Cited by: §1, §2.1.
-  (2020) Plug and play language models: a simple approach to controlled text generation. In ICLR, Cited by: §4.1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §2.1.
-  (2018) HotFlip: white-box adversarial examples for text classification. In ACL, Cited by: §1, §2.2.
-  (2015) Explaining and harnessing adversarial examples. In ICLR, Cited by: §1.
-  (2019) X-SQL: reinforce schema representation with context. CoRR. Cited by: §1, §2.1.
-  (2019) A comprehensive exploration on WikiSQL with table-aware word contextualization. CoRR. Cited by: §1, §2.1, §4.
-  (2017) Categorical reparameterization with Gumbel-Softmax. In ICLR, Cited by: 2nd item, §3.3, §3.
-  (2017) Adversarial examples for evaluating reading comprehension systems. In EMNLP, Cited by: §1, §3.2.
-  (2013) Auto-encoding variational bayes. CoRR. Cited by: §3.1.
Effective approaches to attention-based neural machine translation. In EMNLP, Cited by: §3.1.
-  (2019) On evaluation of adversarial perturbations for sequence-to-sequence models. In NAACL, Cited by: §1, §2.2.
-  (2018) RankME: reliable human ratings for natural language generation. In NAACL, Cited by: §4.2.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §3.2.
-  (2015) Compositional semantic parsing on semi-structured tables. In ACL, Cited by: §1, §2.1.
-  (2019) Language models are unsupervised multitask learners. Cited by: §4.1.
-  (2019) Generating natural language adversarial examples through probability weighted word saliency. In ACL, Cited by: §1.
-  (2017) Get to the point: summarization with pointer-generator networks. In ACL, Cited by: §3.1.
-  (2016) Minimum risk training for neural machine translation. In ACL, Cited by: §3.2.
-  (2018) Wasserstein auto-encoders. In ICLR, Cited by: §3.1.
-  (2019) Beyond BLEU:training neural machine translation with semantic similarity. In ACL, Cited by: 2nd item, §3.2, §3.
-  (2018) ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In ACL, Cited by: §3.2.
-  (2019) Generating fluent adversarial examples for natural languages. In ACL, Cited by: §1.
Seq2SQL: generating structured queries from natural language using reinforcement learning. CoRR. Cited by: §1, §2.1, §2.1.
-  (2017) Morphological inflection generation with multi-space variational encoder-decoders. In CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection, Cited by: §3.1.