This is the code in
Natural Language Sentence Matching (NLSM) has gained substantial attention from both academics and the industry, and rich public datasets contribute a lot to this process. However, biased datasets can also hurt the generalization performance of trained models and give untrustworthy evaluation results. For many NLSM datasets, the providers select some pairs of sentences into the datasets, and this sampling procedure can easily bring unintended pattern, i.e., selection bias. One example is the QuoraQP dataset, where some content-independent naive features are unreasonably predictive. Such features are the reflection of the selection bias and termed as the leakage features. In this paper, we investigate the problem of selection bias on six NLSM datasets and find that four out of them are significantly biased. We further propose a training and evaluation framework to alleviate the bias. Experimental results on QuoraQP suggest that the proposed framework can improve the generalization ability of trained models, and give more trustworthy evaluation results for real-world adoptions.READ FULL TEXT VIEW PDF
Natural language inference (NLI) aims at predicting the relationship bet...
Despite the remarkable success deep models have achieved in Textual Matc...
Subject selection plays a critical role in experimental studies, especia...
Statistical natural language inference (NLI) models are susceptible to
Recent advancements in natural language generation has raised serious
In this paper, we focus on the problem of unsupervised image-sentence
Pretrained language models (PLM) achieve surprising performance on the C...
This is the code in
Natural Language Sentence Matching (NLSM) aims at comparing two sentences and identifying the relationships (Wang et al., 2017), and serves as the core of many NLP tasks such as question answering and information retrieval (Wang et al., 2016b). Natural Language Inference (NLI) (Bowman et al., 2015) and Semantic Textual Similarity (STS) (Wang et al., 2016b) are both typical NLSM problems. A large number of publicly available datasets have benefited the research to a great extent (Kim et al., 2018; Wang et al., 2017; Tien et al., 2018), including QuoraQP111https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs, SNLI (Bowman et al., 2015), SICK (Marelli et al., 2014), etc. These datasets provide resources for both training and evaluation of different algorithms (Torralba and Efros, 2011).
However, most of the datasets are prepared by conducting procedures involving a sampling process, which can easily introduce a selection bias (Heckman, 1977; Zadrozny, 2004). It would get even worse when the bias can reveal the label information, resulting in the “leakage features
”, which are irrelevant to the content/semantic of the sentences but are predictive to the label. One example is the QuoraQP, a dataset on classifying whether two sentences are duplicated (labeled as) or not (labeled as ), which has been widely used to evaluate STS models (Gong et al., 2017; Kim et al., 2018; Wang et al., 2017; Devlin et al., 2018). In QuoraQP, three leakage features have been identified, including S1_freq, the number of occurrences of the first sentence in the dataset; S2_freq, the number of occurrences of the second sentence; and S1S2_inter, the number of sentences that are paired with both the first and the second sentences in the dataset for comparison.
and normalized leakage features versus the labels in QuoraQP. The features are all normalized to their quantiles. As illustrated, the leakage features are more predictive than the WMD, as the differences between the distributions of positive and negative pairs are more significant. Moreover, combiningS1_freq and S2_freq can make even more accurate predictions as illustrated in Figure 2, where we calculate the averages of the labels under different S1_freq and S2_freq. We find that when both features’ values are large, the pairs tend to be duplicated (marked in red), while when one is large and the other is small, the pairs tend to be not_duplicated (marked in blue).
These leakage features play a critical role in the QuoraQP competition222https://www.kaggle.com/c/quora-question-pairs/discussion/34355 and https://www.kaggle.com/c/quora-question-pairs/discussion/33168. As the evaluations are conducted with the same biased datasets, models that fit the bias pattern can take additional advantages over unbiased models, making the benchmark results untrustworthy. On the other hand, the bias pattern doesn’t exist in the real-world, so if a model fits the bias pattern (intentionally or unintentionally), the generalization performance will be hurt, limiting the values of these datasets for further applications Torralba and Efros (2011).
In this paper, we study this problem and demonstrate the impact of the selection bias by a series of experiments. We focus on the selection bias embodied in the comparing relationships of sentences, and the main contributions of this paper are the answers to the following questions:
Does selection bias exist in other NLSM datasets? We identify four out of six publicly available datasets that suffer from the selection bias.
Would DNN-based methods learn from the bias pattern unintentionally? We find that Siamese-LSTM models trained on QuoraQP do capture the bias pattern.
Can we help the model learn the useful semantic pattern from the content without fitting the bias pattern? We propose an easy-adopting method to mitigate the bias. Experiments show that this method can improve the generalization performance of the trained models.
Can we build an evaluation framework that gives us more reliable results for real-world adoption? We propose a more trustworthy evaluation method that demonstrates consistent results with unbiased cross-dataset evaluations.
The rest of the paper is organized as follows. Section 2 gives an empirical look at the selection bias on a variety of NLSM datasets and analyzes why the leakage features are effective. Section 3 examines whether DNN-based methods fit the bias pattern unintentionally. Section 4 introduces the training and evaluation framework to alleviate the biasedness. Taking QuoraQP as an example, we report the experimental results in Section 5. Section 6 summarizes related work, and Section 7 draws the conclusion.
In this section, we investigate the problem of selection bias on six NLSM datasets and then analyze why the leakage features are effective.
To quantify the severity of the leakage from the selection bias, we formulate a toy problem for NLSM. We predict the semantic relationship of two sentences based on the comparing relationships between sentences. We refer semantic relationship of two sentences as their labels, for example, duplicated for STS and entailment for NLI, and comparing relationship as whether they are paired for comparison in the dataset. Here we only consider the index of each sentence, and the actual content is not used. The formal problem definition is as follow:
Given a set of sentence ids , and a set of comparing relationships of the sentences . The goal is to infer the semantic relationship between given pairs of sentence ids from .
This toy problem is indeed an edge classification problem Aggarwal et al. (2016), as we can construct a graph using the comparing relationships as illustrated in Figure 3. In addition, from the graph perspective, S1_freq and S2_freq are the degrees of nodes, and S1S2_inter is the number of 2-hop paths connecting two nodes. Learning on the graph for this toy problem follows a transductive setting (Ji et al., 2010), where the graph is built with the comparing relationships of all the examples.
Based on the new problem definition, we investigate six NLSM datasets, including SNLI, MultiNLI (Williams et al., 2018), QuoraQP, MSRP (Dolan et al., 2004), SICK and ByteDance333https://www.kaggle.com/c/fake-news-pair-classification-challenge. We apply two different methods to classify the edges on the graph, including Leakage which uses the three leakage features introduced in Section 1 and Advanced which uses some more advanced graph-based features (Perozzi et al., 2014; Zhou et al., 2009; Liben-Nowell and Kleinberg, 2007) together with the three leakage features444The features are selected carefully to describe the local structure between two nodes and to prevent the model from remembering the exact ID of sentences to make inferences.. We also report the results of three baselines, including Majority which predicts the most frequent label, Unlexicalized which uses 15 handcrafted features from the content of sentences (Bowman et al., 2015), and LSTM
which is a DNN-based method using sequences of word embeddings. All classifiers are Random Forests if no specific configuration is mentioned. The classifiers are trained with the training set, and we report the results on the testing set. More detailed settings are introduced in AppendixA. The results are reported in Table 1.
Predicting semantic relationships without using sentence contents seems impossible. However, we find that the graph-based features (Leakage and Advanced) make the problem feasible on a wide range of datasets. Specifically, on the datasets like QuoraQP and ByteDance, the leakage features are even more effective than the unlexicalized features. One exception is that on MultiNLI, Majority outperforms Leakage and Advanced significantly. Another interesting finding is that on SNLI and ByteDance, advanced graph-based features improve a lot over the leakage features, while on QuoraQP, the difference is very small. Among all the tested datasets, only MSRP and SICK are almost neutral to the leakage features. Note that their sizes are relatively small with only less than 10k samples. Results in Table 1 raise concerns about the impact of selection bias on the models and evaluation results.
As discussed in Section 1, the leakage features are the reflection of selection bias. Intuitively, if we construct a dataset for NLSM by randomly sampling some pairs of sentences, the resulting dataset would be extremely imbalanced, where the most of the pairs are neutral for NLI or not_duplicated for STS. Thus, to make the dataset relatively balanced, a sampling strategy is often required. If the strategy is not well-designed, it will introduce a bias pattern into the dataset, which can be revealed by leakage features. Here we try to figure out why the leakage features are effective in aforementioned datasets. Since we do not have every detail about how they are constructed, we only analyze based on SNLI and QuoraQP.
During the preparation of SNLI, as introduced in (Bowman et al., 2015), human workers are presented with “premise scene descriptions”, and asked to supply “hypotheses” for each of the three labels (i.e., entailment, neutral and contradiction). However, it is found that some workers are “reusing the same sentence for many different prompts”, which might cause SNLI to suffer from selection bias. To validate, we calculate the percentage of each label versus S2_freq, and the results are shown in Figure 4. We see that the percentages of the three labels are similar when S2_freq is small, but as S2_freq increases, the label is more likely to be an entailment.
For QuoraQP dataset, the providers state that “Our original sampling method returned an imbalanced dataset with many more true examples of duplicate pairs than non-duplicates. Therefore, we supplemented the dataset with negative examples. One source of negative examples were pairs of “related questions” which, although pertaining to similar topics, are not truly semantically equivalent.” Our hypothesis is that the way in which negative samples were supplemented is the reason why QuoraQP is so biased. For example, the newly added sentences of “related questions” may appear in the dataset for limited times, thus we get the phenomenon in Figure 2, i.e., if two sentences both appear for many times, the pair is likely to be duplicated, while if one of them appears for only a few times, the pair is likely to be not_duplicated.
We conduct ablation experiments on the datasets where the leakage features are effective, i.e., SNLI, QuoraQP, SICK and ByteDance. The results are reported in Table 2. We can see that S2_freq is more effective in SNLI, and S1_freq plays a more critical role in SICK, while in QuoraQP and ByteDance, S1S2_inter is the most predictive. Based on the experiments and observations, we conclude that existing datasets incline to be biased due to various reasons. Further study is required to understand the problem and prevent bias from being introduced into future datasets for research.
In this section, we investigate whether DNN models are unintentionally fitting the bias pattern in addition to the semantic pattern. We train a classical Siamese-LSTM model555The detailed setting for the model is introduced in Section 5.2 with the training set of QuoraQP, and make predictions on a synthetic dataset. Interestingly, we find that the results are significantly influenced by the bias pattern.
The synthetic dataset is built in the following way. We extract the distinct sentences from the training set of QuoraQP, then compare the sentences with themselves, finally we obtain 517,970 pairs in total. Since the two sentences in the pairs are identical, the labels are all duplicated. All three leakage features are the same, i.e., the numbers of occurrences of the sentence in the dataset. If the model can perfectly learn the semantic relationships between sentences, the predictions should be substantially the same for all the pairs.
To illustrate the predicted scores of duplication, we visualize them versus the leakage features in Figure 5, and the boxplot follows the Tukey boxplot style (Frigge et al., 1989). Intriguingly, we find that even though the sentences in pairs are all identical, the model still tends to give lower scores of duplication to the pairs with leakage features equal to . This result is consistent with the bias pattern shown in Figure 2, i.e., the data points in the bottom left corner tend to be not_duplicated, compared with the data points in the top right corner which represent larger values of S1_freq and S2_freq.
The results indicate that the model is unintentionally capturing the undesired bias pattern that only exists in the particular dataset. This will make an adverse effect on the generalization performance of the trained models (to be illustrated in Section 5.4).
Given a biased dataset, can we eliminate the bias to train completely unbiased models? Unfortunately, this is very difficult due to that the bias is related with the labels, and we cannot have access to the labels of unselected samples Zadrozny (2004). In this paper, we propose to take a step back and define a leakage-neutral distribution, which is more close to the real-world than the biased one. We make a few reasonable assumptions about it and how the biased dataset is generated from it. We demonstrate that we can train and evaluate models unbiased to the leakage-neutral distribution, with only the biased dataset.
Assuming that there is a leakage-neutral distribution with domain where is the semantic feature space, is the (binary) semantic label space, is the sampling strategy feature space and is the (binary) sampling intention space. The sampling intentions represent whether dataset providers want to select a positive sample or a negative sample. For example, means that the providers want to select a positive sample here.
We assume that samples are drawn independently from , then if (the label matches the sampling intention), the samples are selected into the dataset, otherwise, the samples are discarded. This operation results in the biased distribution that are observed from the dataset.
We make the following assumptions about . The first one is the leakage-neutral assumption defined as follows,
which means that the sampling strategy is independent with the labels, making the leakage-neutral distribution more close to the real-world.
The second one is that, given , is independent with and defined as follows,
which means that the sampling strategy features can completely control the sampling intentions.
Based on the assumptions above, given a biased dataset, the proposed method works in the following way.
Firstly, we estimatefrom the dataset for all samples. In practice, this can be achieved by training classifiers and making cross-predictions. Since we don’t have access to the true sampling strategy features, we use the leakage features from the graph instead, as they are the reflection of the biased sampling strategy.
Then we can get , the conditional probability of the sampling intention on given , using the following equation with given.
Afterwards, we use as the weights for the samples (note that the labels are needed here). Training and evaluating with the weights can give us the results unbiased to the leakage-neutral distribution.
The step-by-step procedure for leakage-neutral learning and evaluation is presented in Algorithm 1. Note that our analyses and the proposed method are general enough for a variety of bias, as long as a sampling strategy feature is given.
|Algorithm 1: Leakage-neutral Training and Evaluation|
Input: The dataset , the number of fold
for cross prediction, and the prior probability.
|01||Extract the leakage features from the dataset.|
|02||Estimate for all samples by training classifiers and using -fold cross-predicting strategy.|
|03||Calculate for all samples according to Equation (1).|
|04||Obtain the weights for all samples and normalize the mean of the weights.|
|05||Train and validate models with the training set and validation set respectively using as the sample weights.|
|06||Evaluate the models with the testing set using as the sample weights.|
Assuming that we know , and they are greater than zero for any , the following theorem shows that we can obtain the loss unbiased to the leakage neutral distribution after using the sample weights.
For any classifier , and for any loss function
, and for any loss function, if we use as weights, then
The proof is presented in Appendix B.2. Since is only a number which does not affect the models, we can concentrate on the denominator, i.e., and use as the weights instead. The loss can be used for both training and evaluation unbiased to the leakage neutral distribution.
In this section, we present the experimental results for leakage-neutral learning on QuoraQP. We demonstrate that the proposed learning framework can mitigate the bias and improve the generalization performance of trained models. Besides, the corresponding evaluation method can serve as a more reliable in-domain benchmark compared with the biased one.
We use QuoraQP as our experimental dataset. We use the same dataset partition as (Wang et al., 2017).
We use the three leakage features for generating the weights. We use Random Forest classifiers to estimate , and the 100-fold cross predictions as the estimated values. is chosen to keep the proportion of the weights of positive and negative samples unchanged, and the mean of the weights is normalized to 1. The minimum weight of all the samples is , and the maximum weight is .
backend. Sequences of the embeddings of word tokens are fed into the LSTM layer with a hidden size of 128. Then the representations of both sentences, as well as the dot-production of the representations, go through a two Layer MLP where Batch Normalization(Ioffe and Szegedy, 2015) is applied after every hidden layer. Dropout (Srivastava et al., 2014)
with rate 0.5 is applied after the last hidden layer. We use the RMSProp(Tieleman and Hinton, 2012)
optimizer to train all the parameters. The learning rate starts at 1e-3, and decays at a fixed rate of 0.2 when performance does not improve on the validation set. We also use a gradient clipping of 5.0. The batch size is set to 256. All the results reported in this section are the average numbers of ten runs using the same hyper-parameters with different random initializations. Our implementation achieves slightly better performance compared with the results of the original Siamese-LSTM fromWang et al. (2017)666The codes and the weights will be published upon the paper acceptance..
We initialize our word embeddings with pre-trained GloVe 840B 300D vectors(Pennington et al., 2014), and the embeddings are kept fixed during training. All the sentences are cut off to have a maximum of 35 word tokens.
Note that the scale of weights of the different samples varies greatly. To prevent the model from jiggling during the mini-batch training, we use a sampling strategy for model training, i.e., we sample examples with probabilities proportional to the weights to get the data for every mini-batch.
To evaluate the effectiveness of leakage-neutral learning, we use the following strategy in our experiments. Firstly, we train and validate a model using the data from QuoraQP without any weights. The model is referred to as Biased Model. Then we train and validate a model using the data from QuoraQP with the weights, and the model is referred to as Debiased Model. These two models are evaluated with the following methods.
Testing set evaluation. We evaluate the models with the testing set of QuoraQP. Evaluation without the weights is named as Biased Eva, and evaluation with the weights is named as Debiased Eva. This can show how the leakage-neutral evaluation proposed in Section 4 affect the evaluation results.
Synthetic dataset evaluation. We evaluate the performance of models with the synthetic dataset introduced in Section 3. A better model is supposed to give higher accuracy, and tended to be less impacted by the bias pattern.
Cross-dataset evaluation. We evaluate that how the models perform on other STS datasets, i.e., MSRP and SICK. We use the entire datasets for evaluations. As the preparation strategies of different datasets are different, cross-dataset evaluations will not give additional rewards for the selection bias of QuoraQP. Although different datasets may have different contexts, a better model trained with QuoraQP is still supposed to perform better.
Among all the evaluation methods, using the testing set for evaluation without weights (Biased Eva) is biased, and we will show that the Debiased Eva is more consistent with the unbiased synthetic dataset evaluation and cross-dataset evaluations.
|Method||Biased Eva||Debiased Eva|
The evaluation results on the testing set of QuoraQP are reported in Table 3. From the accuracy of the method Leakage, we can see that although the influence isn’t completely eliminated, the evaluation result of Debiased Eva is less impacted by the bias pattern in the original distribution. This makes the results more reliable for evaluations.
As for the Biased Model and the Debiased Model, we find that the Biased Model performs significantly better under the Biased Eva. This is the effect of fitting the bias pattern in addition to the semantic pattern, thus taking some extra advantage that cannot be generalized to real-life cases. On the other hand, under the Debiased Eva, we can find that the Debiased Model performs the best.
Table 4 reports the results on the datasets that are not biased to the leakage pattern of QuoraQP. We find that the Debiased Model significantly outperforms the Biased Model on all three datasets. This indicates that the Debiased Model better captures the true semantic similarities of the input sentences. We further visualize the predictions on the synthetic dataset in Figure 6. As illustrated, the predictions are more neutral to the leakage feature.
From the experimental results, we can see that the proposed leakage-neutral training method is effective, as the Debiased Model performs significantly better with Synthetic dataset, MSRP and SICK, showing a better generalization strength. Moreover, the Debiased Eva gives results that are more consistent with the results on unbiased datasets, thus it can serve as a more reliable in-domain way to evaluate models trained with QuoraQP. As a conclusion, our constructed leakage-neutral distribution is more close to the real-world one compared with the biased distribution that is directly observed from the given datasets.
In this section, we summarize the related work and distinguish them from our contributions.
Usually, the Inverse Propensity Score (IPS) is used to reduce the selection bias (Schonlau et al., 2009; d’Agostino, 1998), where the propensity score (Rosenbaum and Rubin, 1983) is the probability that a sample will be selected into the dataset. Zadrozny (2004) studies the learning and evaluating of classifiers under sample selection bias, while his focus was the “missing-at-random” (MAR) (Little and Rubin, 2014) problem where the biasedness only depends on the feature vector .
For NLSM datasets, the selection bias is “not-missing-at-random” (NMAR) (Little and Rubin, 2014), thus we cannot hope to estimate the true propensity scores directly as it requires the labels of unselected samples (Zadrozny, 2004). In this paper, we propose to fit a constructed leakage-neutral distribution, which could be achieved with only the selected samples that we can access.
Although dataset bias is often mentioned, the research community is not putting sufficient attention to it compared with models and algorithms. Torralba and Efros (2011) studied the dataset bias for image recognition datasets, and categorize the bias into Selection Bias, Capture Bias and Negative Set Bias. Selection bias is widely studied in the search ranking field as position bias (Wang et al., 2016a, 2018; Joachims et al., 2017). Usually the propensity scores are estimated through online Result Randomization (Joachims et al., 2017).
In the NLP field, Minka and Robertson (2008) studied the selection bias in the LETOR datasets, and found that Reverse BM25 performs unreasonably well due to the selection procedure. Dixon et al. (2018) studied the potential unfairness for toxic comments classification due to unintended bias, and proposed methods to mitigate it by balancing the training dataset with additional data. Gururangan et al. (2018) and Poliak et al. (2018) found that in some NLI datasets, there is biasedness of specific linguistic phenomena.
In this paper, we study the selection bias embodied in the comparing relationships in NLSM datasets. To the best of our knowledge, this is the first study on this kind of selection bias.
In this paper, we take a close look at the selection bias of NLSM datasets and focus on the selection bias embodied in the comparing relationships of sentences. To mitigate the bias, we propose an easy-adopting method for leakage-neutral learning and evaluations.
However, there is still much to do to form a clearer scope of this problem. For example, we still do not know the details of dataset preparations of many other NLSM datasets, and we can not say to what extent the assumptions in Section 4 hold in QuoraQP and what is the relationship between the leakage-neutral distribution and the real-world distribution. We suggest for future NLSM datasets, the providers should pay more attention to this problem. Furthermore, they could reveal the more detailed strategy of sample selection, and might publish some official weights to eliminate the bias.
Tensorflow: a system for large-scale machine learning.In OSDI, volume 16, pages 265–283.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642.
Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958.
Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4144–4150. AAAI Press.
For SICK, both entailment_label and relatedness_score are provided. We use the sentence pairs with relatedness_score greater than as duplicated, and otherwise not_duplicated. This threshold gives roughly 50% of positive pairs and 50% negative pairs.
For ByteDance, since no existing dataset partition is available, we randomly divide the dataset into a training set, a validation set, and a testing set in a ratio of 8:1:1. We use the sentences in English during our experiments.
The BLEU score of both sentences, using n-gram length from 1 to 4, which are totally 4 features.
The length difference between the two sentences, as one real-valued feature.
The length and percentage of overlap words between both sentences over all words and over just nouns, verbs, adjectives and adverbs, which are totally 10 features.
We list the features we used in method Advanced in Section 2.1. As mentioned above, if we use a node to represent a sentence and add an undirected edge if two sentences are compared in the dataset, the whole dataset can be viewed as a graph as illustrated in Figure 3. To classify the edges in the graph, we use 3 types of graph-based features:
The origin and extended leakage features: degrees of both nodes, number of 2-hop and 3-hop paths between the two nodes, number of 2-hop and 3-hop neighbors of both nodes, which are totally 8 features.
The element-wise product and dot product of Deepwalk (Perozzi et al., 2014) embedding of the two nodes, all together as 65 features.
Here we present the derivation of Equation (1).
By solving the above equation, we have the result in Equation (1). ∎
Here we present the proof for Theorem 1, i.e., the unbiased expectation theorem.
As illustrated above, by adding specific weights to the samples, we can obtain the loss unbiased to the leakage neutral distribution . The unbiased loss can be used for both training and evaluation.