Log In Sign Up

Structurally Diverse Sampling Reduces Spurious Correlations in Semantic Parsing Datasets

A rapidly growing body of research has demonstrated the inability of NLP models to generalize compositionally and has tried to alleviate it through specialized architectures, training schemes, and data augmentation, among other approaches. In this work, we study a different relatively under-explored approach: sampling diverse train sets that encourage compositional generalization. We propose a novel algorithm for sampling a structurally diverse set of instances from a labeled instance pool with structured outputs. Evaluating on 5 semantic parsing datasets of varying complexity, we show that our algorithm performs competitively with or better than prior algorithms in not only compositional template splits but also traditional IID splits of all but the least structurally diverse datasets. In general, we find that diverse train sets lead to better generalization than random training sets of the same size in 9 out of 10 dataset-split pairs, with over 10 5, providing further evidence to their sample efficiency. Moreover, we show that structural diversity also makes for more comprehensive test sets that require diverse training to succeed on. Finally, we use information theory to show that reduction in spurious correlations between substructures may be one reason why diverse training sets improve generalization.


page 2

page 6


SUBS: Subtree Substitution for Compositional Semantic Parsing

Although sequence-to-sequence models often achieve good performance in s...

Unobserved Local Structures Make Compositional Generalization Hard

While recent work has convincingly showed that sequence-to-sequence mode...

Making Transformers Solve Compositional Tasks

Several studies have reported the inability of Transformer models to gen...

Sequence-Level Mixed Sample Data Augmentation

Despite their empirical success, neural networks still have difficulty c...

On the Compositional Generalization Gap of In-Context Learning

Pretrained large generative language models have shown great performance...

Learning to Generalize Compositionally by Transferring Across Semantic Parsing Tasks

Neural network models often generalize poorly to mismatched domains or d...

1 Introduction

Systematic compositionality—expressing novel complex concepts as a systematic composition of expressions for simpler concepts—is the property underlying human languages’ immense expressive power and productivity Lake et al. (2017); Fodor and Pylyshyn (1988). However, NLP models, in spite of large-scale pretraining, have been shown to struggle to generalize compositionally to novel composite expressions in the context of semantic parsing Lake and Baroni (2018); Loula et al. (2018); Kim and Linzen (2020).

Data augmentation has been extensively explored for compositional generalization Akyürek et al. (2021); Guo et al. (2021); Wang et al. (2021); Guo et al. (2020); Qiu et al. (2021). However, training instances that overlap in their structure—such as abstract syntax trees (ASTs) of target programs—are sub-optimal and any augmentation mechanism that ignore this will require unnecessarily large training data to achieve compositional generalization. One way to improve sample efficiency is to train on a diverse training set—one that contains a variety of structures. The prior work exploring the efficiency of diverse training sets has been very limited in scope. Oren et al. (2021) demonstrate that template diversity improved sample efficiency in subsamples from a pool of instances as compared to random subsamples. However, they only experiment with a single dataset and given that templates themselves overlap in structure, diversifying over them can itself be inefficient. Bogin et al. (2022) demonstrate benefits of diversifying over bigrams—adjacent nodes in the AST of a logical form—which are too fine-grained. It is reasonable to expect that the optimal granularity of substructures over which to diversify would be somewhere in the middle of the extremes of templates and bigrams. Moreover, while they show benefits from diversity in compositional splits, there is limited evidence for whether they also help in the traditional IID setting.

In this study, we take a broader look at the efficacy of diversity in datasets. We propose a general, substructure-agnostic recipe for structurally diverse subsampling algorithms and use that to formulate a novel algorithm that diversifies over program subtrees. Evaluating on template and IID splits of 5 different semantic parsing datasets of varying complexities, we find that structurally diverse train sets perform best in 9 out of 10 splits with over 10 points improvements in exact match accuracy in 5. In particular, our proposed Subtree diversity algorithm is the most consistent, performing better than Template and Bigram diversity in 6 out of 10 splits. Interestingly, we find that the original versions of both bigram and template diversity algorithms often perform worse than random subsampling in IID splits but improve with some simple modifications.

We further show that diverse subsampling can also be used to sample test sets that more comprehensively cover the space of instance structures and hence are far more challenging than IID splits for models trained on random train sets than for those trained on diverse sets. Finally, we use information theory to explore the improved generalization from diverse train sets and find that the weakening of spurious correlations between substructures might be one way diversity helps.

2 Setup

While the idea of diverse subsampling is applicable to any task with structured inputs or outputs, the sampling algorithms we use in this work operate on trees, and we focus in particular on abstract syntax trees from output programs found in semantic parsing tasks. In semantic parsing, an utterance has to be parsed into an executable program or logical form . Further, following Oren et al. (2021) and Bogin et al. (2022), we assume access to a pool of instances where , from which we want to subsample instances. We view each instance as a bag of substructures: , where is the mapping from instances to their substructures and is the set of all substructures. The substructures may be shared between instances.

The goal of a structurally-diverse subsampling algorithm is to select instances that contain among them as many substructures in as many different contexts as possible. When used for training, the hypothesis is that such a subsample will give the model more information about the system

underlying the instances than a random subsample and hence improve generalization and sample efficiency. Intuitively, randomly sampling instances from a grammar, or from human annotators, is likely to produce skewed distributions with spurious correlations between substructures; a diverse sampling algorithm minimizes these correlations. Similarly, when subsampling to create test sets, we argue that a diverse subsample also makes for more comprehensive test sets by better covering the space of possible structures. This is analogous to why one would choose to evaluate on a test set with balanced classes even when the training data might have considerable imbalance.

3 Structurally Diverse Datasets

3.1 Substructures

Oren et al. (2021) and Bogin et al. (2022) used program templates and bigrams extracted from program ASTs. While we follow them in extracting substructures from programs alone, we choose to use different substructures as we believe templates and bigrams are not the right granularity to diversify over. Templates are too coarse; they share a lot of structure and may be large in number themselves, making it inefficient to even cover all of them. Bigrams on the other hand are too fine, and may be unable to capture many salient structural patterns. We thus use subtrees of the program AST up to a certain size . The ASTs are constructed as in Bogin et al. (2022): tokens in a program are categorized as either functions, values, or structural tokens (such as parentheses or commas) that define the hierarchical structure of the program. Figure 1 shows an example program along with the different substructures extracted from it that we consider in this work: templates, bigrams and subtrees.

Figure 1: An example instance from COVR dataset (§ 4) along with its AST and the different types of substructures extracted from it, including template, bigrams and subtrees. Note that subtrees will additionally also include all isolated nodes in the AST.

3.2 Structurally Diverse Subsampling

As described in §2, the goal of structurally diverse subsampling is to sample instances that contain many different substructures in many different contexts to maximize the pressure on the model to learn the underlying compositional system. One way to approach this would be to set it up as an optimization problem that directly selects an optimally diverse set of instances. This, however, has the challenge of having to define a measure of diversity that is also tractable to optimize given pools that may be very large, making it harder to experiment with different measures of diversity.

We thus choose to take the route of an iterative algorithm (pseudo-code in Algorithm 1) that picks instances from the pool, one-by-one till the requisite number of instances have been sampled. At a high level, in each iteration, it first selects a substructure based on some substructure-weighting scheme, , and then selects an instance with that substructure using another instance-weighting scheme, . It also keeps track of substructures that have been sampled ( and ) and resets when all substructures have been sampled allowing it to cycle over them repeatedly. The weighting schemes can be used to encode our interpretation of diversity. Additionally, being iterative has the usual benefit of being able to stop at any time and return the currently sampled instances.

In the current state, Algorithm 1 is still just a recipe and instantiating it into a subsampling algorithm requires substructure definition and substructure and instance-weighting schemes to be specified. In the following sections we’ll describe our proposed subsampling algorithm that improves subtree diversity in the framework of this recipe and show that it also subsumes both template diversity and bigram diversity algorithms from Oren et al. (2021) and Bogin et al. (2022).

Sample pool ; instance-to-substructure mapping ; template mapping ; substructure weighting function ; sample weighting function ; training budget
while  do
     if  then
     end if
     if  then
     end if
end while
Algorithm 1 Iterative algorithm for Diverse Subsampling

3.3 Subtree Diversity

Our subsampling algorithm diversifies over subtrees as defined in § 3.1, i.e. is the set of subtrees of size in the AST of . In this work we use . Further, it uses the following substructure and instance weighting schemes:

Substructure Selection: We use the following substructure weighting function that prioritizes selecting more frequent unsampled substructures in the pool.: if and 0 otherwise. Here denotes set cardinality.

Instance Selection: We experiment with the following instance weighting schemes:

  1. RandEx samples an instance uniformly at random: is a constant function.

  2. RandNewT samples an instance with unseen template: where is the indicator function.

  3. FreqNewT samples instance with the most frequent unsampled template: if and 0 otherwise. This and RandNewT try to improve coverage over templates.

3.4 Template and Bigram Diversity

Template Diversity from Oren et al. (2021), henceforth referred to as Template, can be implemented in the framework of Algorithm 1 as:

  • [nosep]

  • is a singleton set containing the template for .

  • i.e. a uniformly random unsampled template.

  • is a constant function (RandEx).

Bigram Diversity algorithm from Bogin et al. (2022), henceforth referred to as Bigram, can also be implemented in the framework of Algorithm 1 as:

  • is the set of bigrams in ’s AST as defined by Bogin et al. (2022).

  • Bogin et al. (2022)’s bigram diversity algorithm randomly samples from unsampled bigrams until there is still an unsampled bigram and then any random bigram. This can be formulated as: where is the set of all sampled bigrams. Note that is different from ; the latter gets emptied to enable repeatedly cycling through substructures.

  • is a constant function (RandEx).

Additionally we also experiment with using template and bigram diversity algorithms along with the substructure weighting scheme described in §3.3 that prioritizes more frequent substructures. These will be referred to as Template[Freq] and Bigram[Freq] respectively.

4 Experiments

Given a dataset of utterance-program pairs, we first create three different types of splits, that we describe below, with each split consisting of a training pool and test set . We then compare the various subsampling algorithms described in §3 by using them to sample training sets of varying budget size from and evaluating on .

4.1 Splits

IID split For this we randomly sample some instances from to use as keeping the rest for .

Template split This is a type of compositional split proposed by Finegan-Dollak et al. (2018). Here instances are grouped based on their program template which is obtained by applying an anonymization function to the programs that replaces certain program tokens such as strings and numbers with their abstract type. The split is then created by randomly splitting the set of templates into a train set and a test set and using examples for train or test templates as or respectively. We follow the procedure of Bogin et al. (2022) to obtain solvable template splits i.e. there is no token in the test set but not in the training set.

Subtree split As discussed in §2, diversely subsampled sets of instances may also make for more comprehensive test sets. To test this, we also evaluate on a third type of split: we use the Subtree[FreqNewT] diverse subsampling algorithm to sample test sets from , using the rest of the instances as .

4.2 Datasets

Dataset Input Utterance Target Program
(synthetic) [l]What is the number of black dog that is
chasing mouse ? [l]count ( with_relation ( filter ( black , find ( dog ) ) ,
chasing , find ( mouse ) ) )
natural) [l]person whose height is 180 cm and
whose birthdate is 2004
what person born in 2004 is 180 cm tall [l](listValue (filter (filter (getProperty (singleton en.person)
(string !type)) (string height) (string =) (number 180
(string birthdate) (string =) (date 2004 -1 -1)))
(synthetic) [l]what people are named aideliz li [l]( Person ) filter id =  "aideliz li"
(natural) [l]a flight on continental airlines leaving boston
and going to denver [l]( lambda $0 e ( and ( flight $0 ) ( airline $0 co : al )
( from $0 boston : ci ) ( to $0 denver : ci ) ) )
(natural) [l]When is my next staff meeting scheduled for? [l](Yield (Event.start (FindNumNextEvent (Event.subject?
(? = "staff meeting")) 1L)))
Table 1: Examples of input utterance and target program pairs for the datasets used in this work.
Dataset Instances Bigrams Subtrees Templates
Atis 5037 5091 41536 1149
COVR 100000 298 4490 29141
Overnight 4419 354 3015 87
SM-CalFlow 121910 43689 146386 21083
Schema2QA 1577860 223 3060 139
Table 2: Number of instances, bigrams, subtrees (size ), and templates in the datasets used for structural diversity experiments.

We use five different semantic parsing datasets for our analysis from a variety of domains with varied number of program templates and both synthetic as well as natural language input utterances. Tables 1 and 2 show a few examples and statistics regarding number of instances and different types of compounds respectively.

COVR: A synthetic dataset that uses a variable-free functional query langauge and is generated using a synchronous context-free grammar (SCFG) adapted from the VQA dataset of Bogin et al. (2021a). We use the SCFG to generate 100K examples for our experiments.

ATIS Hemphill et al. (1990); Dahl et al. (1994): A dataset of natural language queries about aviation paired with -calculus programs.

Overnight Wang et al. (2015): A dataset containing both synthetic and natural language utterances from 11 domains (e.g. socialnetwork, restaurants, etc.) paired with Lambda-DCS logical forms.

Schema2QA Xu et al. (2020): Uses the ThingTalk language Campagna et al. (2019). We use the synthetic instances from the people domain generated by Oren et al. (2021).

SM-CalFlow Andreas et al. (2020): Consists of dialogs paired with LISP programs. Each instance is a single dialogue turn form one of two domains pertaining to creating calendar events or querying an org chart.

With the exception of SM-CalFlow which we took from Andreas et al. (2020), we used the pre-processed versions of the above datasets provided by Bogin et al. (2022) and used their code111 to anonymize programs and produce ASTs. For SM-CalFlow, we anonymized strings and numbers such as “staff meeting” and “1L” in Table 1 and used nltk222 to produce ASTs.

4.3 Model and Training

We use the pretrained BART-base model Lewis et al. (2020) for our experiments, fine-tuning it on subsamples for each dataset, split and subsampling algorithm. For each dataset we use 5 different random seeds to create 5 splits of each type. Then for each split and training budget, we use 3 different seeds to subsample 3 training sets for each subsampling algorithm. See App. A for more details.

5 Results

5.1 Structural Diversity in Train Sets

Figures 2 and 3 show the results on Template and the IID splits of the different datasets.

Diverse subsampling is more sample-efficient than random subsampling. Diverse subsampling outperforms random subsampling in both IID and template splits, being more efficient in 9 out of 10 dataset-split type combinations with over 10 point improvements in template splits and over 5 point improvement in IID splits of all datasets except SM-CalFlow. The only splits where random subsampling performs better are IID splits of SM-CalFlow. We believe this to be due to the fact that target programs in SM-CalFlow contain free-form strings and which inflate the number of substructures (bigrams or subtrees) as shown in Table 2.

Subtree diversity is most consistent. Among the diverse subsampling algorithms, Subtree diversity is the most consistent. It is more efficient than Bigram diversity in all datasets and splits and outperforms Template diversity everywhere except in the template splits of Atis, Schema2QA and Overnight which all have very few templates (as shown in Table 2) and hence little to gain from diversifying over smaller substructures. These results suggest that for most tasks except for those with little structural diversity in the first place, diversification over more granular substructures would be more sample efficient.

Prioritizing more frequent compounds improves efficiency. The original Bigram and Template diversity algorithms (Bigram and Template), while good in template splits, are worse than even random subsampling in IID splits. However, replacing their compound selection scheme with one that prioritizes frequent compounds (Bigram[Freq] and Template[Freq]) improves both their efficiencies in IID splits with minor degradation in template splits. The only exception is the template diversity in COVR. This is expected given that it was generated from a synchronous grammar with production rules sampled uniformly at random and hence is dominated by instances with shorter templates. Thus, Template[Freq] will only pick these.

Improving coverage over templates makes subtree diversity more consistent. Comparing Subtree[RandEx] and Subtree[RandNewT] we see that prioritizing examples with unseen templates improves efficiency in template splits with relatively minor reductions in IID splits.

Figure 2: Accuracies of different subsampling algorithms on Template splits with budgets of 6000 instances for SM-CalFlow, and 300 for the rest.
Figure 3: Accuracies of different subsampling algorithms on IID splits with budgets of 6000 instances for SM-CalFlow, and 100 for the rest.

5.2 Structural Diversity in Test Sets

Figure 4: Sample efficiency of random subsamples v/s Subtree[RandEx] subsamples on IID and Subtree splits.

Figure 4 compares the performance of random and diverse (Subtree[RandEx]) subsamples in IID and Subtree splits. It is clear that Subtree splits are harder than IID splits. More importantly, while random training sets perform very poorly on them, diverse subsamples quickly climb close to their performance on IID splits in all datasets except Atis. Note that this is not merely because of distributional change: these splits were created using a different variant of Subtree diverse subsampling (Subtree[FreqNewT]) and bigram and template diversity algorithms also regain much of their IID performance (see App. B). We thus believe that this demonstrates comprehensiveness of diverse samples as test sets.

6 Information-theoretic Analysis

Having seen that structurally diverse datasets indeed improve generalization, we now attempt to explain why they do so. Our hypothesis is that they improve generalization by weakening the spurious correlations between substructures that would exist in any dataset of bounded size. To test this, we define a measure for correlations and compare its value for random and diverse subsamples.

Figure 5: Normalized Cumulative Mutual Information (NCMI) of random subsamples compared with that of diverse subsamples.

Measure for Correlations We measure correlations between substructures in a set of instances as the sum of pairwise mutual information (MI) between them, normalized to take into account the larger number of distinct substructures in diverse subsamples. Formally, as earlier, we view each instance as a bag of substructures, specifically subtrees of size with

as the set of all subtrees. We define an indicator random variable

for each subtree that is 1 if is present in a random instance, 0 otherwise. MI between two subtrees, and is then defined as,



is the empirical probability of

occurring in an instance and is the empirical probability that and co-occur in an instance, i.e.


Other probabilities are defined analogously. Finally, our measure of correlations is the average mutual information (AMI) across all subtree pairs.

Split Type Budget COVR Overnight Schema2QA
IID 100 -0.50 -0.83 -0.39
300 -0.70 -0.76 -0.56
Template 100 -0.57 -0.58 -0.68
300 -0.60 -0.56 -0.64
Table 3: Spearman correlations of NCMI and Accuracy for train sets sampled randomly and using Subtree[RandEx] and Subtree[RandNewT] algorithms averaged across 4 split seeds (from § 4).

Results For each dataset, we took 3 samples using random and subtree subsampling algorithms, treating the entire dataset as pool and computed their NCMI. Figure 5 shows that diverse subsamples have lower NCMI than random subsamples across different training budgets and the different split types, confirming our hypothesis that diverse datasets have lower correlation. Moreover, examining Spearman correlations of NCMI and accuracy (Table 3) for train sets subsampled using random, Subtree[RandEx] and Subtree[RandNewT] subsampling algorithms from the experiments in §4, we find that they are significantly negatively correlated, giving further credence to our measure. These results substantiate our hypothesis that reduction of spurious correlations is indeed one way in which diverse subsamples improve generalization.

7 Related Work

Benchmarks Multiple benchmarks for compositional generalization have been created. These include those created manually specifically for the purpose of measuring generalization Lake and Baroni (2018); Bastings et al. (2018); Bahdanau et al. (2019); Kim and Linzen (2020); Ruis et al. (2020); Bogin et al. (2021a). Compositional splits have also been created automatically from existing semantic parsing datasets by splitting by output length Lake and Baroni (2018), holding out program templates Finegan-Dollak et al. (2018), and by maximizing compound divergence between the training and test sets Keysers et al. (2020); Shaw et al. (2021).

Improving Generalization Many approaches to examining and improving compositional generalization have been proposed, including specialized architectures with inductive bias for compositional generalization Herzig and Berant (2021); Bogin et al. (2021b); Chen et al. (2020); Gordon et al. (2020); Yin et al. (2021), data augmentation Andreas (2020); Akyürek et al. (2021); Guo et al. (2021), modifications to training methodology Oren et al. (2020); Csordás et al. (2021) and meta learning Conklin et al. (2021); Lake (2019). Data-based approaches have the advantage of being model-agnostic and hence can be used in conjunction with pretrained models. However without a more informed choice of what instances are added, given the infinite productivity of language, they would require large amounts of data to induce compositional generalization. Selecting structurally diverse instances is one such choice that leads to improved generalization and sample efficiency.

Data Selection Our work is most closely related to that of Oren et al. (2021) and Bogin et al. (2022), both of which take into account the structure of output programs in semantic parsing and select selecting structurally diverse instances for training to improve compositional generalization and sample efficiency. Our work, especially that on sampling diverse test sets, is also related to work on creating compositional splits from existing datasets Keysers et al. (2020); Shaw et al. (2021); Bogin et al. (2022) and reducing biases in datasets via adversarial filtering or other means Bras et al. (2020); Sakaguchi et al. (2021); Gardner et al. (2021) and representation debiasing Li and Vasconcelos (2019).

8 Conclusion

In this work we broadly explored the efficacy of structural diversity for the task of semantic parsing. We proposed a novel subtree diversity algorithm that diversifies over subtrees in program abstract syntax trees. Evaluating on multiple datasets with varying complexities and logical formalisms and on both IID and compositional splits, we demonstrated that diversity almost always, and often greatly, improves generalization and sample efficiency with our proposed algorithm. We further demonstrated that structurally diverse sets of instances also make for more comprehensive test sets and showed that reduction in spurious correlations might be one reason why diverse train sets are more sample efficient.

We hope that our experiments demonstrating benefits of diversity in a wide variety of datasets would encourage future research in finding other potentially better substructures in other structure prediction tasks and sampling algorithms with better sample efficiency. In particular, the current algorithms assume the availability of a pool of labeled instances. In many semantic parsing scenarios it is possible to sample a large pool of possible outputs from which we could select diverse programs to be annotated with corresponding utterances, but this is one of the only scenarios where our current algorithms could be applied. Future research could thus look at the setup of applying diversity when there is no pool available, e.g., when generating a synthetic dataset from a synchronous grammar or by prefering structural diversity in the input space instead of the output space (if a large collection of unlabeled inputs can be obtained or generated). Another important direction of future research is to apply principles of diversity to natural language utterances for which a hierarchical representation may not be as readily available and which will also have to deal with language variability. Additionally, future work may also look at better measures of spurious correlations that we believe may lead to more efficient diverse sampling algorithms.

Finally, and quite importantly, in the context of compositional generalization, our results indicate that the apparent lack of it in NLP models may be due in part to the lack of diversity in datasets, suggesting the need to create more diverse compositional generalization benchmarks.


Appendix A Training

Models for COVR, ATIS, Schema2QA and Overnight datasets were trained for different number of epochs depending on train set size as shown in Table

4. Models for Schema2QA were all trained for 240 epochs. Training was run with batch sizes ranging from 8 to 20, depending on the maximum number of example tokens in each dataset and a learning rate of with polynomial decay. Each experiment was run with a Nvidia Titan RTX GPU. Following Bogin et al. (2022)

, we do early stopping using the test set. As our goal is to estimate the train set quality and not the model, we argue this is an acceptable choice in our setting.

Budget #Epochs
50 160
300 128
600 96
1000 80
Table 4: Number of training epochs for COVR, ATIS, Schema2QA and Overnight depending on train set size.

Appendix B Full Results

Tables 5, 6, 7, 8 and 9 show results for all the subsampling algorithm, split type and budget size combinations we tried for COVR, ATIS, Schema2QA, Overnight and SM-CalFLow respectively.

Split Type iid subtree template
Budget 50 100 300 50 100 300 50 100 300
Random 56.31 79.61 95.90 40.27 69.80 95.96 32.87 64.25 92.02
Bigram 47.22 71.98 96.70 49.80 80.28 97.99 35.51 67.29 94.48
BigramImproved 62.18 79.43 96.79 58.70 82.60 97.45 45.86 67.70 93.80
Template 27.59 46.08 61.56 NaN NaN NaN 50.75 75.76 93.80
TemplateImproved 43.37 50.27 63.54 19.95 24.54 42.15 6.17 11.58 41.02
Subtree[RandEx] 61.22 86.04 98.37 41.88 76.77 98.95 39.09 73.78 97.25
Subtree[RandNewT] 61.81 85.50 98.39 42.87 78.90 98.48 38.87 75.06 97.61
Subtree[FreqNewT] 48.95 59.60 77.25 25.90 38.91 87.90 10.48 24.66 60.73
Table 5: Complete subsampling results for COVR.
Split Type iid subtree template
Budget 100 300 600 1000 100 300 600 1000 100 300 600 1000
Random 40.53 59.42 68.87 75.52 18.92 38.88 52.96 63.07 13.98 33.89 46.62 56.05
Bigram 39.89 59.97 70.93 77.72 22.66 45.22 59.95 68.89 15.65 38.84 52.80 61.54
BigramImproved 43.65 62.77 72.72 78.93 25.07 46.34 60.83 70.18 17.76 38.52 52.63 61.37
Template 37.58 63.00 71.69 77.55 NaN NaN NaN NaN 20.29 45.21 58.42 65.18
TemplateImproved 45.73 65.83 73.86 77.76 24.46 51.51 64.24 71.35 16.09 46.13 59.39 65.37
Subtree[RandEx] 43.20 63.10 71.60 77.73 24.77 45.36 60.79 69.71 17.90 39.94 52.25 60.97
Subtree[RandNewT] 44.24 64.98 73.24 79.19 28.89 52.90 65.13 71.73 21.54 44.88 57.28 62.34
Subtree[FreqNewT] 46.18 65.88 73.67 79.11 26.97 52.48 64.97 71.43 20.04 44.66 57.13 62.35
Table 6: Complete subsampling results for ATIS.
Split Type iid subtree template
Budget 50 100 300 600 100 300 600 1000 100 300 600 1000
Random 66.08 74.93 86.48 91.72 5.10 16.68 29.88 43.27 13.66 44.99 63.58 76.52
Bigram 53.29 65.18 89.97 95.28 32.82 64.63 78.77 87.12 10.13 45.15 64.86 77.36
BigramImproved 62.81 70.28 90.74 94.35 33.15 64.49 77.35 85.43 10.37 43.17 60.17 74.13
Template 28.31 59.17 89.38 94.22 25.22 65.87 81.38 89.17 22.09 66.56 87.65 93.75
TemplateImproved 55.61 67.34 89.89 94.93 26.97 64.02 80.92 88.56 20.59 64.32 85.80 93.12
Subtree[RandEx] 73.14 83.98 94.36 96.29 21.67 65.68 88.34 94.88 18.29 53.84 67.30 82.24
Subtree[RandNewT] 63.45 79.05 93.22 95.45 29.84 75.84 88.90 95.57 21.70 58.94 70.79 84.15
Subtree[FreqNewT] 63.75 78.38 93.35 95.73 29.92 76.25 89.58 95.52 20.08 60.05 71.92 84.89
Table 7: Complete subsampling results for Schema2QA.
Split Type iid subtree template
Budget 50 100 300 100 300 600 1000 100 300 600 1000
Random 47.52 72.62 93.86 33.03 65.62 79.80 88.34 29.65 48.98 59.96 63.51
Bigram 25.89 57.37 95.63 65.38 88.96 93.23 93.57 27.37 51.25 59.45 63.25
BigramImproved 44.66 73.61 96.03 67.15 89.99 93.59 94.27 34.77 52.82 60.19 63.47
Template 26.75 67.60 96.62 67.89 92.43 94.53 94.83 39.19 58.62 68.42 68.90
TemplateImproved 42.26 76.93 97.99 68.95 92.41 94.49 94.76 38.80 59.03 66.77 69.11
Subtree[RandEx] 57.14 85.95 99.01 67.52 94.30 94.42 94.60 38.23 55.80 63.82 66.64
Subtree[RandNewT] 59.68 88.31 99.18 70.50 94.19 94.58 94.51 37.13 57.80 64.86 65.24
Subtree[FreqNewT] 59.84 88.84 99.25 70.57 94.16 94.37 94.62 38.73 57.61 67.13 65.61
Table 8: Complete subsampling results for overnight.
Split Type iid subtree template
Budget 1000 3000 6000 1000 3000 6000 1000 3000 6000
Random 36.38 48.06 52.97 18.54 35.63 44.43 12.77 26.49 33.02
Bigram 15.49 29.78 37.30 15.78 32.61 40.78 8.94 22.90 30.93
BigramImproved 29.71 43.28 47.21 24.55 40.50 46.86 15.46 29.22 35.68
Template 21.04 32.29 37.61 NaN NaN NaN 16.42 30.80 37.34
TemplateImproved 31.08 40.74 43.46 25.65 43.11 49.49 14.64 31.05 37.69
Subtree[RandEx] 31.78 43.84 48.70 25.48 42.98 48.94 16.56 31.19 37.46
Subtree[RandNewT] 32.20 42.84 47.15 25.32 43.11 49.32 16.89 32.17 38.59
Subtree[FreqNewT] 29.78 41.66 47.12 27.76 47.51 51.60 15.98 31.71 38.43
Table 9: Complete subsampling results for SM-CalFlow.