, have shown impressive improvements in various natural language processing (NLP) tasks. With the help of massive universal knowledge learned from pre-trained language models such as BERT, we can use less task-specific knowledge to solve downstream tasks, namely, we may need less labeled data for training. Much recent work put efforts to boost the downstream task performance with pre-trained language models (LM). Differently, in this paper, we focus on the question that can we use less labeled data with these models for learning of downstream NLP tasks?
Active learning approaches such as uncertainty sampling Lewis and Gale (1994) can be a straightforward choice to reduce the labeled data for training, which needs to traverse all unlabeled data to find informative unlabeled samples. Specifically, in uncertainty sampling, these informative samples are always near the decision boundary with larger entropy. However, the traverse process is very time-consuming, thus cannot be conducted frequently Settles and Craven (2008). A common choice is to perform the sampling process after every specific period, usually when every 10% or 20% data are labeled and well-trained Deng et al. (2018).
We argue that uncertainty sampling after every specific period is not necessarily the best. Infrequently performing uncertainty sampling may lead to the “ineffective sampling” problem. Because in the early phase of training, the decision boundary changes quickly, which makes the uncertainty samples less effective after several updates of the model. Ideally, uncertainty sampling should be performed very frequently in the early phase of model training.
In this paper, we propose the adversarial uncertainty sampling in discrete space (AUSDS) to address the ineffective sampling problem for active sentence learning, aiming to reduce the label data for sentence prediction. Specifically, the sentence learning refers to the learning of NLP tasks such as text classification, sequence labeling, etc. We first borrow the adversarial attack Goodfellow et al. (2014); Kurakin et al. (2016) idea into uncertainty sampling. The basis is that both of the uncertainty sampling and the adversarial attack are to find uncertain samples near the decision boundary of the current model. The traditional uncertainty sampling finds uncertain samples through a costly traversal of all unlabeled samples ( for each sampling run), while adversarial attack algorithms directly find local approximations by simply computing partial derivatives of the current training batch ( for each sampling process), which is much more efficient given a large unlabeled dataset and thus can perform uncertainty sampling much more frequently.
However, it is non-trivial to perform adversarial uncertainty sampling for sentence learning. We can not directly perform adversarial attacks by computing adversarial gradients in a sentence space since the sentence space is discrete. We propose to include a neural encoder to map unlabeled sentences into a continuous space for performing adversarial attacks in this space. Specifically, we use particular pre-trained LM like BERT as the encoder, which provides a continuous hidden space for the representation of sentences. We map every unlabeled sentence into the encoding space and then obtaining adversarial data points of these sentences in the encoding space. Due to that not every data point in the encoding space can be mapped to one of unlabeled sentences, we propose to use the k-nearest neighbor (KNN) algorithmAltman (1992) to find the most similar unlabeled sentences (the adversarial sample) to the adversarial data points111
Note that KNN search can be very fast on GPU with open source implementation. We also compare the running time in the experiment..
Fig. 1 shows the difference between uncertainty sampling and AUSDS. Besides, empirically, we mix some random samples into the uncertainty samples to alleviate the sampling bias mentioned by Huang et al. (2010). We deploy AUSDS for active sentence learning and conduct experiments on five datasets across 2 NLP tasks, namely sequence classification and sequence labeling. Experimental results show that AUSDS outperforms random sampling and uncertainty sampling strategies. Further analyses show that AUSDS achieves the best sampling effectiveness with linear running time compared with random sampling.
Our contributions are summarized as follows:
We propose AUSDS for active sentence learning, which first introduces the adversarial attack for sentence uncertainty sampling, alleviating the ineffective sampling problem.
We propose to map sentences into the pre-trained LM encoding space, which makes adversarial uncertainty sampling available in the discrete sentence space.
Experimental results demonstrate that the AUSDS assisted learning framework outperforms strong baselines in sampling effectiveness with acceptable running time.
2 Related Work
This work focuses on reducing the labeled data size with the help of pre-trained LM in solving sequence learning tasks. The proposed AUSDS approach is related to two different research topics, namely active learning and adversarial attack.
2.1 Active Learning
Active learning algorithms can be categorized into three scenarios, namely membership query synthesis, stream-based selective sampling, and pool-based active learning Settles (2009). Our work is related to pool-based active learning, which assumes that there is a small set of labeled data and a large pool of unlabeled data available Lewis and Gale (1994). To reduce the label complexity, the learner starts from the labeled data and selects one or more queries from the unlabeled data pool for the annotation, then learns from the new labeled data and repeats.
The pool-based active learning scenario has been studied in many real-world applications, such as text classification Lewis and Gale (1994); Hoi et al. (2006), information extraction Settles and Craven (2008) and image classification Joshi et al. (2009). Among the query strategies of existing active learning approaches, the uncertainty sampling strategy Joshi et al. (2009); Lewis and Gale (1994) is the most popular and widely used. The basic idea of uncertainty sampling is to enumerate the unlabeled samples and compute the uncertainty measurement like information entropy for each sample. The enumeration and uncertainty computation makes the sampling process costly and cannot be performed frequently, which induced the ineffective sampling problem.
There are some works that focus on accelerating the costly uncertainty sampling process. Jain et al. Jain et al. (2010) propose a hashing method to accelerate the sampling process in sub-linear time. Deng et al. Deng et al. (2018) propose to train an adversarial discriminator to select informative samples directly and avoid computing the rather costly sequence entropy. Nevertheless, the above works are still computationally expensive and cannot be performed frequently, which means the ineffective sampling problem still exists.
2.2 Adversarial Attack
. As machine learning models are often vulnerable to adversarial samples, adversarial attacks have been used to serve as an important surrogate to evaluate the robustness of deep learning models before they are deployedBiggio et al. (2013); Szegedy et al. (2013). Existing adversarial attack approaches can be categorized into three groups, which are one-step gradient-based approaches Goodfellow et al. (2014); Rozsa et al. (2016), iterative methods Kurakin et al. (2016) and optimization-based methods Szegedy et al. (2013).
Inspired by the similar goal of adversarial attacks and uncertainty sampling, in this paper, instead of considering adversarial attacks as a threat, we propose to combine these two approaches for achieving real-time uncertainty sampling. Some works share a similar but different idea with us. Li et al. Li et al. (2018) introduce active learning strategies into black-box attacks to enhance query efficiency. Zhu and Bento Zhu and Bento (2017) propose to train Generative Adversarial Networks to generate samples by minimizing the distance to the decision boundary directly, which is in the query synthesis scenario different from us. Ducoffe and Precioso Ducoffe and Precioso (2018) also introduce adversarial attacks into active learning by augmenting the training set with adversarial samples of unlabeled data, which is totally different from our work as it is in a continuous space. Note that none of the works above share the same scenario with our problem setting.
3 Adversarial Uncertainty Sampling in Discrete Space
In this section, we introduce Adversarial Uncertainty Sampling in Discrete Space (AUSDS) with the AUSDS assisted active sentence learning framework since they are strongly coupled with each other. The learning framework consists of two blocks, training block and sampling block. They interact with each other frequently in batch level, aiming at performing real-time effective sampling (Fig. 2.a).
The framework starts from a training batch, the training block encodes the training samples into latent states, and generates the adversarial data points based on the latent states and the gradients over the loss. With the adversarial data points, the sampling block finds the adversarial samples by KNN search over the encoding space and generates the next training batch. The procedure of the framework is outlined in Algorithm 1, some notations can be found in Fig. 2 along with the corresponding components. We split the framework into four stages, namely initialization, training, sampling, and fine-tuning.
The initialization stage is corresponding to line 1-7 in Algorithm 1. As shown in Devlin et al. (2018), the 2 NLP tasks we considered, sequence classification and sequence labeling, can be solved in an encoder-decoder framework. We first load the pre-trained LM implemented with Devlin et al. (2018) or Peters et al. (2018) as our encoder . Then we build the decoder , define the latent states , and the encoding space according to the downstream task. Note that the decoder is different on 2 NLP tasks.
Since the sampling approach requires a basic model to provide a prediction of the decision boundary in , we initialize the accumulated labeled data set and train the basic model with . With the defined encoding space and the well-trained encoder , we can then construct a bidirectional mapper between the unlabeled sequences and the latent states . It means we can easily track the original textual input with its corresponding latent state. Finally, we initialize the training batch and the fine-tuning counter , which are prepared for the rest stages.
The training stage is corresponding to line 9-10 in Algorithm 1. With the defined decoders and the prepared training batch , we train the decoder parameters directly with a cross entropy loss (Fig. 2.b). Here we fix the encoder because we need to update along with the change of the encoder, which is costly. The refers to a bidirectional mapper between the unlabelled sequences and the latent states in using encoder as described in Algorithm 1 in the paper. In other words, it’s a memory buffer that holds the bijection between the sequences and the corresponding latent states using a given encoder . Since the encoder is well-trained on the entire , fine-tuning the encoder infrequently cannot influence the performance of the model. Therefore, we fine-tune the encoder for steps after every steps, where and
are two hyperparameters.
Then, we perform adversarial attacks over the current model with the gradients of the current batch . The following adversarial attack approaches are considered:
Fast Gradient Value (FGV) Rozsa et al. (2016): a one-step gradient-based approach with high efficiency. The adversarial data points are generated by:
where is a hyper parameter, and is the cross entropy loss on .
C&W Carlini and Wagner (2017): an optimization-based approach with the optimization problem defined as:
where is a manually designed function, satisfying if and only if ’s label is a specific target label. is a distance measurement like Minkowski distance.
The sampling stage is corresponding to line 11-17 in Algorithm 1. In our sentence learning scenario, the adversarial data points may not be mapped back to the unlabeled samples. Thus we perform k-nearest neighbor (KNN) search Altman (1992) to find the most similar unlabeled samples to the generated data points.
We implement the KNN search using Faiss222https://github.com/facebookresearch/faiss Johnson et al. (2017), an efficient similarity search algorithm with GPUs. The computation cost of KNN search is from two procedures, which are constructing the sample mapper and searching the similar latent states. The mapper construction procedure is performed infrequently, as described in Section 3.2. The searching procedure is very efficient (100 faster than generating ) thanks to Faiss. Thus AUSDS approach can be performed frequently in batch-level.
After acquiring adversarial samples using KNN search, we mix with random samples drawn from by the ratio of , where is a hyperparameter. The motivation of appending random samples is to balance exploration and exploitation, which can alleviate the problem of sampling bias Huang et al. (2010).
Then we perform top-k ranking over the information entropy of the mixed samples. Since the size of the mixed samples is comparable to the batch size, the computation cost is acceptable. The remaining samples are then labeled by and added into the current labeled data set as well as the accumulated labeled data set .
Finally, we sample a training batch from and by the ratio of , where is a hyperparameter. The training samples in are all close to the current decision boundary, which can induce the problem of sampling bias Huang et al. (2010). Therefore, we introduce to balance exploration and exploitation. The details on sampling bias is discussed in Sec. 4.2.2.
|SST-2 Socher et al. (2013)||sequence classification||11.8k sentences, 215k phrases|
|SST-5 Socher et al. (2013)||sequence classification||11.8k sentences, 215k phrases|
|MRPC Dolan et al. (2004)||sequence classification||5,801 sentence pairs|
|AG News Zhang et al. (2015)||sequence classification||12k sentences|
|CoNLL’03 Sang and De Meulder (2003)||sequence labeling||22k sentences, 300k tokens|
The fine-tuning stage is corresponding to line 18-22 in Algorithm 1. We fine-tune the encoder for steps after every steps, as described in Section 3.2. During the fine-tuning, both the encoder and the decoder are trained on the accumulated labeled data set . After fine-tuning, we update the mapper for the following KNN search. The algorithm terminates until the unlabeled text corpus is used up.
We evaluate the AUSDS assisted active sentence learning framework on sequence classification and sequence labeling tasks. For the oracle labeler, we directly use the labels provided by the datasets. In all the experiments, we take average results of 5 runs with different random seeds to alleviate the influence of randomness.
We use five datasets, namely Stanford Sentiment Treebank (SST-2 / SST-5) Socher et al. (2013), Microsoft Research Paraphrase Corpus (MRPC) Dolan et al. (2004), AG’s News Corpus (AG News) Zhang et al. (2015)
and CoNLL 2003 Named Entity Recognition dataset (CoNLL’03)Sang and De Meulder (2003) for experiments. The statistics can be found in Table 1. And the data split ratios for train, development, and test follow the original settings in those papers. We use accuracy for sequence classification and f1-score for sequence labeling as the metric.
Our aim here is to prove that our AUSDS can achieve better sampling effectiveness with acceptable time. We use two common baseline approaches in NLP active learning to compare with our framework, namely random sampling (RM) and entropy-based uncertainty sampling (US). For sequence classification tasks, we use the widely used Max Entropy (ME) Berger et al. (1996) as the uncertainty measurement, which is given by:
where is the number of classes. For sequence labeling tasks, we use the total token entropy (TTE) Settles and Craven (2008) as the uncertainty measurement, which is given by:
where is the sequence length and is the number of labels.
We implement the model based on this repository333https://github.com/huggingface/pytorch-pretrained-BERT and the based on this repository444https://github.com/allenai/allennlp. The configurations of the model are the same as reported in Devlin et al. (2018); Peters et al. (2018). The implementation of KNN search is introduced in section 3.3. The accumulated labeled data set is initialized the same for different approaches, taking 0.1% of the whole unlabeled data (0.5% for MRPC because the dataset is relatively small). We will release our code with full configurations for reproducibility after acceptance.
4.2 Main Results
4.2.1 Computational Efficiency
AUSDS is computationally more efficient than uncertainty sampling. As we described in section 3, the training block and the sampling block interact with each other frequently in batch level. Thus AUSDS can achieve real-time effective sampling. We conduct experiments in real-time sampling setting, in which we perform the sampling process in batch level.
Table 2 shows the average sampling cost for each sampling step with different approaches. We can observe that uncertainty sampling can hardly work in real-time sampling setting because of the costly sampling process. Our AUSDS sampling approaches are more than 10x faster compared with common uncertainty sampling. The larger the unlabeled data pool is, the more significant the acceleration is. Our framework spends slightly longer computation time, compared with the random sampling baseline, because of extra computation for adversarial examples. But it’s still fast enough for real-time batch-level sampling. Moreover, the experimental results on Sampling Effectiveness show that the extra computation is worthy with obvious performance enhancement on the same amount of labeled data.
The margin of outputs on samples selected by different sampling strategies on SST-5. The margin denotes for differences between the largest and the second-largest output probabilities on different classes. The lower the margin is, the closer the sample is located to the decision boundary. Fig. (a) shows the average margin of each sampling step during training. The margins of samples selected by RM and US on whole unlabeled data are also plotted as references. Fig. (b) shows the margin distribution of samples selected from sampling step 800 to 1000, where the average uncertainty becomes steady. US in Fig. (b) is omitted for better visualization.
4.2.2 Sampling Effectiveness
AUSDS can achieve higher sampling effectiveness than uncertainty sampling due to the sampling bias problem. Simply training the model until convergence after each sampling step, which we call continuous training setting, can easily induce the problem of sampling bias Huang et al. (2010) and cannot reflect the informativeness of selected samples. The sampling bias denotes the bias of the sampling process for informative unlabelled examples with uncertainty based methods. The decision boundary of the model is merely determined by a small number of labeled examples in the early phase. And the biased decision boundary may lead to the ineffective selection of examples, namely, the selected examples may be informative with higher uncertainty but not that representative to the whole unlabelled data. The error would be accumulated and results in the poorer final performance of the model. The delayed uncertainty sampling also can encounter this problem because of frequent oscillation of the decision boundary in the early phase of training.
Thus we propose another training setting, named training from scratch, for convergence results. In the training from scratch setting, we train models from scratch using the labeled data sampled by different approaches with various label sizes. We argue that this setting is more suitable to measure the sampling efficiency. The results are shown in Table 3. (The results on SST-2 with ELMo as the encoder are demonstrated in our supplemental material to show the generalization ability of our AUSDS to other pre-trained LM encoding space.)
Active learning focuses on training with a limited amount of labeled data by selecting more valuable examples to label. It makes no difference whether to perform active learning or not with enough labeled data available. So we include at most 10% of the whole training data labeled for training in each sampling approach. We believe that with less labeled data, the performance gap, namely the difference of sampling effectiveness is more obvious.
Our framework outperforms the random baselines consistently because it selects more informative samples for identifying the shape of the decision boundary. Also, it outperforms the common uncertainty sampling in most cases with the same label size limits because the frequent sampling processes in our approach alleviate the sampling bias issue. With the results on the five standard benchmarks of 2 NLP tasks, we observe that our AUSDS can achieve better sampling effectiveness.
To prove that our AUSDS framework does not heavily depend on BERT, we conduct experiments on SST-2 with ELMo as the encoder, which has a totally different network structure. The results in Table 4 show that in this setting, our AUSDS framework still achieves higher sampling effectiveness, while the original uncertainty sampling stuck in a more severe sampling bias problem. The results in this experiment can be a side evidence of the generalization ability of our framework to other pre-trained LM encoding space.
4.2.3 Samples Uncertainty
AUSDS can actually select examples with higher uncertainty. We plot the margins of outputs on samples selected with different sampling strategies on the SST-5 dataset in Fig. 3. We use margin as the measure of the distance to the decision boundary. Lower margin indicates positions closer to the decision boundary. As shown in Fig. 3(a), the samples selected by our AUSDS sampling strategies with different attack approaches achieve lower average margins during the entire sampling process. We synthesize the samples selected from step 800 to 1000 for the estimation of the margin distribution, as shown in Fig. 3(b). It shows that our AUSDS sampling strategies have better capability to capture the samples with higher uncertainty as their margin distribution is more to the left. The uncertainty sampling performed on the whole unlabeled data gets the most uncertain samples. However, it is very time-consuming and outperformed by our proposed AUSDS in the above experiments.
In short, we conduct sampling speed comparison experiments to show comparable time efficiency of our AUSDS with respect to random baseline, also, the weakness of uncertainty sampling as expensive computation cost. We revealed the existence of sampling bias and proved the capability of our AUSDS for alleviating this problem with experiments in the from scratch training setting. The better performance enhancement with low label size limits supports our hypothesis from the side since the sampling bias problem is heavier in the early phase.
Uncertainty sampling can be an effective way of reducing the labeled data size of sentence learning. However, uncertainty sampling with latency may lead to the ineffective sampling problem. To address this problem, in this paper, we propose the adversarial uncertainty sampling in discrete space for active sentence learning. By introducing the adversarial attack into uncertainty sampling and mapping discrete sentences into pre-trained LM space, the proposed AUSDS is more efficient than traditional uncertainty sampling. Experimental results on five datasets show that our proposed approach outperforms strong baselines in most cases and can achieve better sampling effectiveness.
- An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46 (3), pp. 175–185. Cited by: §1, §3.3.
- A maximum entropy approach to natural language processing. Computational linguistics 22 (1), pp. 39–71. Cited by: §4.1.
- Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Cited by: §2.2.
Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: 3rd item.
- Adversarial active learning for sequences labeling and generation.. In IJCAI, pp. 4012–4018. Cited by: §1, §2.1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.1, §4.1.
- Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics, pp. 350. Cited by: Table 1, §4.1.
- Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841. Cited by: §2.2.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §2.2.
- Large-scale text categorization by batch mode active learning. In Proceedings of the 15th international conference on World Wide Web, pp. 633–642. Cited by: §2.1.
- Active learning by querying informative and representative examples. In Advances in neural information processing systems, pp. 892–900. Cited by: §1, §3.3, §3.3, §4.2.2.
Hashing hyperplane queries to near points with applications to large-scale active learning. In Advances in Neural Information Processing Systems, pp. 928–936. Cited by: §2.1.
- Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §3.3.
- Multi-class active learning for image classification. In , pp. 2372–2379. Cited by: §2.1.
- Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §1, §2.2.
A sequential algorithm for training text classifiers. In SIGIR’94, pp. 3–12. Cited by: §1, §2.1, §2.1.
- Query-efficient black-box attack by active learning. arXiv preprint arXiv:1809.04913. Cited by: §2.2.
- DeepFool: a simple and accurate method to fool deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 2nd item.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1, §3.1, §4.1.
- Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf. Cited by: §1.
- Adversarial diversity and hard positive generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 25–32. Cited by: §2.2, 1st item.
- Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Cited by: Table 1, §4.1.
- An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the conference on empirical methods in natural language processing, pp. 1070–1079. Cited by: §1, §2.1, §4.1.
- Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §2.1.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: Table 1, §4.1.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §2.2.
- Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: Table 1, §4.1.
- Generative adversarial active learning. arXiv preprint arXiv:1702.07956. Cited by: §2.2.