An effective approach to semi-supervised learning has long been a goal for the NLP community, as unlabeled data tends to be plentiful compared to labeled data. Early work emphasized using unlabeled data drawn from the same distribution as the labeled data(Nigam et al., 2000), but larger and more reliable gains have been obtained by using contextual embeddings trained with a language modeling (LM) objective on massive amounts of text from domains such as Wikipedia or news (Peters et al., 2018a; Devlin et al., 2019; Radford et al., 2018; Howard and Ruder, 2018). The latter approaches play to the strengths of high-resource settings (e.g., access to web-scale corpora and powerful machines), but their computational and data requirements can make them less useful in resource-limited environments. In this paper, we instead focus on the low-resource setting (§2.1), and develop a lightweight approach to pretraining for semi-supervised text classification.
Our model, which we call vampire, combines a variational autoencoder (VAE) approach to document modeling (Kingma and Welling, 2013; Miao et al., 2016; Srivastava and Sutton, 2017) with insights from LM pretraining (Peters et al., 2018a). By operating on a bag-of-words representation, we avoid the time complexity and difficulty of training a sequence-to-sequence VAE (Bowman et al., 2016; Xu et al., 2017; Yang et al., 2017) while retaining the freedom to use a multi-layer encoder that can learn useful representations for downstream tasks. Because vampire ignores sequential information, it leads to models that are much cheaper to train, and offers strong performance when the amount of labeled data is small. Finally, because vampire is a descendant of topic models, we are able to explore model selection by topic coherence, rather than validation-set perplexity, which results in better downstream classification performance (§6.1).
In order to evaluate the effectiveness of our method, we experiment with four text classification datasets. We compare our approach to a traditional semi-supervised baseline (self-training), alternative representation learning techniques that have access to the in-domain data, and the full-scale alternative of using large language models trained on out-of-domain data, optionally fine-tuned to the task domain.
Our results demonstrate that effective semi-supervised learning is achievable for limited-resource settings, without the need for computationally demanding sequence-based models. While we observe that fine-tuning a pretrained BERT model to the domain provides the best results, this depends on the existence of such a model in the relevant language, as well as GPUs to fine-tune it. When this is not an option, our model offers equivalent or superior performance to the alternatives with minimal computational requirements, especially when working with limited amounts of labeled data.
The major contributions of this paper are:
We demonstrate experimentally that our method is an efficient and effective approach to semi-supervised text classification when data and computation are limited (§5).
We confirm that fine-tuning is essential when using contextual embeddings for document classification, and provide a summary of practical advice for researchers wishing to use unlabeled data in semi-supervised text classification (§8).
We release code to pretrain variational models on unlabeled data and use learned representations in downstream tasks.222http://github.com/allenai/vampire
2.1 Resource-limited Environments
In this paper, we are interested in the low-resource setting, which entails limited access to computation, labels, and out-of-domain data. Labeled data can be obtained cheaply for some tasks, but for others, labels may require expensive and time-consuming human annotations, possibly from domain experts, which will limit their availability.
While there is a huge amount of unlabeled text available for some languages, such as English, this scale of data is not available for all languages. In-domain data availability, of course, varies by domain. For many researchers, especially outside of STEM fields, computation may also be a scarce resource, such that training contextual embeddings from scratch, or even incorporating them into a model could be prohibitively expensive.
Moreover, even when such pretrained models are available, they inevitably come with potentially undesirable biases baked in, based on the data on which they were trained (Recasens et al., 2013; Bolukbasi et al., 2016; Zhao et al., 2019). Particularly for social science applications, it may be preferable to exclude such confounders by only working with in-domain or curated data.
Given these constraints and limitations, we seek an approach to semi-supervised learning that can leverage in-domain unlabeled data, achieve high accuracy with only a handful of labeled instances, and can run efficiently on a CPU.
2.2 Semi-supervised Learning
, and representation learning using generative models or word vectors(Mikolov et al., 2013; Pennington et al., 2014). Contextualized embeddings have recently emerged as a powerful way to use out-of-domain data (Peters et al., 2018a; Radford, 2018), but training these large models requires a massive amount of appropriate data (typically on the order of hundreds of millions of words), and industry-scale computational resources (hundreds of hours on multiple GPUs).333For example, ULMfit was trained on 100 million words, and BERT used 3.3 billion. While many pretrained models have been made available, they are unlikely to cover every application, especially for rare languages.
There have also been attempts to leverage VAEs for semi-supervised learning in NLP, mostly in the form of sequence-to-sequence models (Xu et al., 2017; Yang et al., 2017), which use sequence-based encoders and decoders (see §3). These papers report strong performance, but there are many open questions which necessitate further investigation. First, given the reported difficulty of training sequence-to-sequence VAEs (Bowman et al., 2016), it is questionable whether such an approach is useful in practice. Moreover, it is unclear if such complex models (which are expensive to train) are actually required for good performance on tasks such as text classification.
In this work, we assume that we have documents, , with observed categorical labels . We also assume access to a larger set of documents drawn from the same distribution, but for which the labels are unobserved, i.e, . Our primary goal is to learn a probabilistic classifier, .
Our approach heavily borrows from past work on VAEs (Kingma and Welling, 2013; Miao et al., 2016; Srivastava and Sutton, 2017), which we adapt to semi-supervised text classification (see Figure 1). We do so by pretraining the document model on unlabeled data (§3.1), and then using learned representations in a downstream classifier (§3.3). The downstream classifier makes use of multiple internal states of the pretrained document model, as in Peters et al. (2018b). We also explore how to best do model selection in a way that benefits the downstream task (§3.2).
3.1 Unsupervised Pretraining
In order to learn useful representations, we initially ignore labels, and assume each document is generated from a latent variable,
. The functions learned in estimating this model then provide representations which are used as features in supervised learning.
Using a variational autoencoder for approximate Bayesian inference, we simultaneously learn anencoder, which maps from the observed text to an approximate posterior , and a decoder
, which reconstructs the text from the latent representation. In practice, we instantiate both the encoder and decoder as neural networks and assume that the encoder maps to a normally distributed posterior, i.e., for document,
Using standard principles of variational inference, we derive a variational bound on the marginal log-likelihood of the observed data,
Intuitively, the first term in the bound can be thought of as a reconstruction loss, ensuring that generated words are similar to the original document. The second term, the KL divergence, encourages the variational approximation to be close to the assumed prior, , which we take to be a spherical normal distribution.
Using the reparameterization trick (Kingma and Welling, 2013; Rezende et al., 2014), we replace the expectation with a single-sample approximation,444We leave experimentation with multi-sample approximation (e.g., importance sampling) to future work. i.e.,
where is sampled from an independent normal. All parameters can then be optimized simultaneously by performing stochastic gradient ascent on the variational bound.
A powerful way of encoding and decoding text is to use sequence models. That is, and would map from a sequence of tokens to a pair of vectors, and , and would similarly decode from to a sequence of tokens, using recurrent, convolutional, or attention-based networks. Some authors have adopted this approach Bowman et al. (2016); Xu et al. (2017); Yang et al. (2017), but as discussed above (§2.2), it has a number of disadvantages.
In this paper, we adopt a more lightweight and directly interpretable approach, and work with word frequencies instead of word sequences. Using the same basic structure as Miao et al. (2016) but employing a softmax in the decoder, we encode and
with multi-layer feed forward neural networks operating on an input vector of word counts,:
For a decoder, we use the following form, which reconstructs the input in terms of topics (coherent distributions over the vocabulary):
where ranges over the vocabulary.
By placing a softmax on , we can interpret as a distribution over latent topics, as in a topic model (Blei et al., 2003), and as representing positive and negative topical deviations from a background . This form (essentially a unigram LM) allows for much more efficient inference on , compared to sequence-based encoders and decoders.
3.2 Model Selection via Topic Coherence
Because our pretraining ignores document labels, it is not obvious that optimizing it to convergence will produce the best representations for downstream classification. When pretraining using a LM objective, models are typically trained until model fit stops improving (i.e., perplexity on validation data). In our case, however, has a natural interpretation as the distribution (for document ) over the latent “topics” learned by the model (). As such, an alternative is to use the quality of the topics as a criterion for early stopping.
It has repeatedly been observed that different types of topic models offer a trade-off between perplexity and topic quality (Chang et al., 2009; Srivastava and Sutton, 2017). Several methods for automatically evaluating topic coherence have been proposed (Newman et al., 2010; Mimno et al., 2011), such as normalized pointwise mutual information (NPMI), which Lau et al. (2014) found to be among the most strongly correlated with human judgement. As such, we consider using either log likelihood or NPMI as a stopping criteria for vampire pretraining (§6.1), and evaluate them in terms of which leads to the better downstream classifier.
NPMI measures the probability that two words collocate in an external corpus (in our case, the validation data). For each topicin , we collect the top ten most probable words and compute NPMI between all pairs:
We then arrive at a global NPMI for
by averaging the NPMIs across all topics. We evaluate NPMI at the end of each epoch during pretraining, and stop training when NPMI has stopped increasing for a pre-defined number of epochs.
3.3 Using a Pretrained VAE for Text Classification
Kingma et al. (2014) proposed using the latent variable of an unsupervised VAE as features in a downstream model for classifying images. However, work on pretraining for NLP, such as Peters et al. (2018a), found that LMs encode different information in different layers, each of which may be more or less useful for certain tasks. Here, for an -layer MLP encoder on word counts , we build on that idea, and use as representations a weighted sum over and the internal states of the MLP, , with weights to be learned by the downstream classifier.555We also experimented with the joint training and combined approaches discussed in Kingma et al. (2014), but found that neither of these reliably improved performance over our pretraining approach.
That is, for any sequence-to-vector encoder, , we propose to augment the vector representations for each document by concatenating them with a weighted combination of the internal states of our variational encoder Peters et al. (2018a). We can then train a supervised classifier on the weighted combination,
where is a neural classifier and are softmax-normalized trainable parameters.
In all cases, we optimize models using Adam (Kingma and Ba, 2014). In order to prevent divergence during pretraining, we make use of a batch-norm layer on the reconstruction of Ioffe and Szegedy (2015). We also use KL-annealing (Bowman et al., 2016), placing a scalar weight on the KL divergence term in Eq.(3), which we gradually increase from zero to one. Because our model consists entirely of feedforward neural networks, it is easily parallelized, and can run efficiently on either CPUs or GPUs.
4 Experimental Setup
We evaluate the performance of our approach on four text classification tasks, as we vary the amount of labeled data, from 200 to 10,000 instances. In all cases, we assume the existence of about 75,000 to 125,000 unlabeled in-domain examples, which come from the union of the unused training data and any additional unlabeled data provided by the corpus. Because we are working with a small amount of labeled data, we run each experiment with five random seeds, each with a different sample of labeled training instances, and report the mean performance on test data.
4.1 Datasets and Preprocessing
We experiment with text classification datasets that span a variety of label types. The datasets we use are the familiar AG News (Zhang et al., 2015), imdb (Maas et al., 2011), and Yahoo! Answers datasets Chang et al. (2008), as well as a dataset of tweets labeled in terms of four Hatespeech categories (Founta et al., 2018). Summary statistics are presented in Table 1. In all cases, we either use the official test set, or take a random stratified sample of 25,000 documents as a test set. We also sample 5,000 instances as a validation set.
We tokenize documents with spaCy, and use up to 400 tokens for sequence encoding (). For vampire pretraining, we restrict the vocabulary to the 30,000 most common words in the dataset, after excluding tokens shorter than three characters, those with digits or punctuation, and stopwords.666http://snowball.tartarus.org/algorithms/english/stop.txt We leave the vocabulary for downstream classification unrestricted.
4.2 vampire Architecture
In order to find reasonable hyperparameters forvampire, we utilize a random search strategy for pretraining. For each dataset, we take the model with the best NPMI for use in the downstream classifiers. We detail sampling bounds and final assignments for each hyperparameter in Table 5 in Appendix A.1.
4.3 Downstream Classifiers
For all experiments we make use of the Deep Averaging Network (DAN) architecture (Iyyer et al., 2015) as our baseline sequence-to-vector encoder,
. That is, embeddings corresponding to each token are summed and passed through a multi-layer perceptron.
where converts a sequence of tokens to a sequence of vectors, using randomly initialized vectors, off-the-shelf GloVe embeddings (Pennington et al., 2014), or contextual embeddings.
To incorporate the document representations learned by vampire in a downstream classifier, we concatenate them with the average of randomly initialized trainable embeddings, i.e.,
Preliminary experiments found that DANs with one-layer MLPs and moderate dropout provide more reliable performance on validation data than more expressive models, such as CNNs or LSTMs, with less hyperparameter tuning, especially when working with few labeled instances (details in Appendix A.2).
4.4 Resources and Baselines
In these experiments, we consider baselines for both low-resource and high-resource settings, where the high-resource baselines have access to greater computational resources and a either massive amount of unlabeled data or a pretrained model, such as ELMo or BERT.777As discussed above, we consider these models to be representative of the high-resource setting, both because they were computationally intensive to train, and because they were made possible by the huge amount of English text that is available online.
In the low-resource setting we assume that computational resources are at a premium, so we are limited to lightweight approaches such as vampire, which can run efficiently on a CPU. As baselines, we consider a) a purely supervised model, with randomly initialized 50-dimensional embeddings and no access to unlabeled data; b) the same model initialized with 300-dimensional GloVe vectors, pretrained on 840 billion words;888http://nlp.stanford.edu/projects/glove/ c) 300-dimensional GloVe vectors trained on only in-domain data; and d) self-training, which has access to the in-domain unlabeled data. For self-training, we iterate over training a model, predicting labels on all unlabeled instances, and adding to the training set all unlabeled instances whose label is predicted with high confidence, repeating this up to five times and using the model with highest validation accuracy. On each iteration, the threshold for a given label is equal to the 90th percentile of predicted probabilities for validation instances with the corresponding label.
In the high-resource setting, we assume access to plentiful computational resources and massive amounts of out-of-domain data, which may be indirectly accessed through pretrained models. Specifically, we evaluate the performance of a Transformer-based ELMo Peters et al. (2018b) and BERT, both (a) off-the-shelf with frozen embeddings and (b) after semi-supervised fine-tuning to both unlabeled and labeled in-domain data. To perform semi-supervised fine-tuning, we first use ELMo and BERT’s original objectives to fine-tune to the unlabeled data. To fine-tune ELMo to the labeled data, we average over the LM states and add a softmax classification layer. We obtain the best results applying slanted triangular learning rates and gradual unfreezing Howard and Ruder (2018) to this fine-tuning step. To fine-tune BERT to labeled data, we feed the hidden state corresponding to the [CLS] token of each instance to a softmax classification layer. We use AllenNLP999https://allennlp.org/elmo to fine-tune ELMo
, and Pytorch-pretrained-BERT101010https://github.com/huggingface/pytorch-pretrained-BERT to fine-tune BERT.
We also experiment with ELMo trained only on in-domain data as an example of high-resource LM pretraining methods, such as Dai and Le (2015), when there is no out-of-domain data available. Specifically, we generate contextual word representations with a Transformer-based ELMo. During downstream classification, the resulting vectors are frozen and concatenated to randomly initialized word vectors prior to the summation in Eq. (17).
In the low-resource setting, we find that vampire achieves the highest accuracy of all low-resource methods we consider, especially when the amount of labeled data is small. Table 2 shows the performance of all low-resource models on all datasets as we vary the amount of labeled data, and a subset of these are also shown in Figure 2 for easy comparison.
|imdb||Baseline||68.5 (7.8)||79.0 (0.4)||84.4 (0.1)||87.1 (0.3)|
|Self-training||73.8 (3.3)||80.0 (0.7)||84.6 (0.2)||87.0 (0.4)|
|GloVe (ID)||74.5 (0.8)||79.5 (0.4)||84.7 (0.2)||87.1 (0.4)|
|GloVe (OD)||74.1 (1.2)||80.0 (0.2)||84.6 (0.3)||87.0 (0.6)|
|vampire||82.2 (2.0)||84.5 (0.4)||85.4 (0.4)||87.1 (0.4)|
|AG||Baseline||68.8 (2.0)||77.3 (1.0)||84.4 (0.1)||87.5 (0.2)|
|Self-training||77.3 (1.7)||81.3 (0.8)||84.8 (0.2)||87.7 (0.1)|
|GloVe (ID)||70.4 (1.2)||78.0 (1.0)||84.1 (0.3)||87.1 (0.2)|
|GloVe (OD)||68.8 (5.7)||78.8 (1.1)||85.3 (0.3)||88.0 (0.3)|
|vampire||83.9 (0.6)||84.5 (0.4)||85.8 (0.2)||87.7 (0.1)|
|Yahoo!||Baseline||54.5 (2.8)||63.0 (0.5)||69.5 (0.3)||73.6 (0.2)|
|Self-training||57.5 (2.0)||63.2 (0.6)||69.8 (0.3)||73.6 (0.2)|
|GloVe (ID)||55.2 (2.3)||63.5 (0.3)||69.7 (0.3)||73.5 (0.3)|
|GloVe (OD)||55.4 (2.4)||63.9 (0.3)||70.1 (0.5)||73.8 (0.4)|
|vampire||59.9 (0.9)||65.1 (0.3)||69.8 (0.3)||73.6 (0.2)|
|Hatespeech||Baseline||67.7 (1.8)||71.3 (0.2)||75.6 (0.4)||77.8 (0.2)|
|Self-training||68.5 (0.6)||71.3 (0.2)||75.5 (0.3)||78.1 (0.2)|
|GloVe (ID)||69.7 (1.2)||71.9 (0.5)||76.0 (0.3)||78.3 (0.2)|
|GloVe (OD)||69.7 (0.7)||72.2 (0.8)||76.1 (0.8)||77.6 (0.5)|
|vampire||74.1 (0.8)||74.4 (0.5)||76.2 (0.6)||78.0 (0.3)|
In the high-resource setting, we find, not surprisingly, that fine-tuning the pretrained BERT model to in-domain data provides the best performance. For both BERT and ELMo, we find that using frozen off-the-shelf vectors results in surprisingly poor performance, compared to fine-tuning to the task domain, especially for Hatespeech and imdb.111111See also Howard and Ruder (2018). For these two datasets, an ELMo model trained only on in-domain data offers far superior performance to frozen off-the-shelf ELMo (see Figure 3). This difference is smaller, however, for Yahoo! and AG. (Please see Appendix B for full results).
These results taken together demonstrate that although pretraining on massive amounts of web text offers large improvements over purely supervised models, access to unlabeled in-domain data is critical, either for fine-tuning a pretrained language model in the high-resource setting, or for training vampire in the low-resource setting. Similar findings have been reported by Yogatama et al. (2019) for tasks such as natural language inference and question answering.
6.1 NPMI versus NLL as Stopping Criteria
To analyze the effectiveness of different stopping criterion in vampire, we pretrain 200 vampire models on imdb: 100 selected via NPMI, and 100 selected via negative log likelihood (NLL) on validation data. Interestingly, we observe that vampire NPMI and NLL values are negatively correlated ( = –0.72; Figure 4A), suggesting that upon convergence, trained models that better fit the data also tend to have more coherent topics. We then train 200 downstream classifiers with the same hyperparameters, on a fixed 200 document random subset of the IMDB dataset, uniformly sampling over the NPMI- and NLL-selected vampire models as additional features. In Figure 4B and Figure 4C, we observe that better pretrained vampire models (according to either criterion) tend to produce better downstream performance. ( = 0.55 and = –0.53, for NPMI and NLL respectively).
However, we also observe higher variance in accuracy among the vampire models obtained using NLL as a stopping criterion (Figure 4D). Such models selected via NLL have poor topic coherence and downstream performance. As such, doing model selection using NPMI is the preferred alternative, and all vampire results in Table 2 are based on pretrained models selected using this criterion.
The experiments in Ding et al. (2018) provide some insight into this behaviour. They find that when training neural topic models, model fit and NPMI initially tend to improve on each epoch. At some point, however, perplexity continues to improve, while NPMI starts to drop, sometimes dramatically. We also observe this phenomenon when training vampire (see Appendix C). Using NPMI as a stopping criterion, as we propose to do, helps to avoid degenerate models that result from training too long.
In some preliminary experiments, we also observe cases where NPMI is artificially high because of redundancy in topics. Applying batchnorm to the reconstruction markedly improves diversity of collocating words across topics, which has also been noted by Srivastava and Sutton (2017). Future work may explore assigning a word diversity regularizer to the NPMI metric, so as to encourage models that have both stronger coherence and word diversity across topics.
6.2 Learned Latent Topics
In addition to being lightweight, one advantage of vampire is that it produces document representations that can be explicitly interpreted in terms of topics. Although the input we feed into the downstream classifier combines this representation with internal states of the encoder, the topical interpretation helps to summarize what the pretraining has learned. Examples of topics learned by vampire are provided in Table 3 and Appendix D.
6.3 Learned Scalar Layer Weights
Since the scalar weight parameters in are trainable, we are able to investigate which layers of the pretrained VAE the classifier tends to prefer. We consistently find that the model tends to upweight the first layer of the VAE encoder, , and , and downweight the other layers of the encoder. To improve learning, especially under low resource settings, we initialize the scalar weights applied to the first encoder layer and with high values and downweighted the intermediate layers, which increases validation performance. However, we also have observed that using a multi-layer encoder in vampire leads to larger gains downstream.
6.4 Computational Requirements
An appealing aspect of vampire is its compactness. Table 4 shows the computational requirements involved in training vampire on a single GPU or CPU, compared to training an ELMo model from scratch on the same data on a GPU. It is possible to train vampire orders of magnitude faster than ELMo, even without expensive hardware, making it especially suitable for obtaining fast results when resources are limited.
|vampire (GPU)||3.8M||7 min|
|vampire (CPU)||3.8M||22 min|
|ELMo (GPU)||159.2M||12 hr 35 min|
7 Related Work
In addition to references given throughout, many others have explored ways of enhancing performance when working with limited amounts of labeled data. Early work on speech recognition demonstrated the importance of pretraining and fine-tuning deep models in the semi-supervised setting (Yu et al., 2010). Chang et al. (2008) considered “dataless” classification, where the names of the categories provide the only supervision. Miyato et al. (2016) showed that adversarial pretraining can offer large gains, effectively augmenting the amount of data available. A long line of work in active learning similarly tries to maximize performance when obtaining labels is costly (Settles, 2012). Xie et al. (2019) describe novel data augmentation techniques leveraging back translation and tf-idf word replacement. All of these approaches could be productively combined with the methods proposed in this paper.
Based on our findings in this paper, we offer the following practical advice to those who wish to do effective semi-supervised text classification.
When resources are unlimited, the best results can currently be obtained by using a pretrained model such as BERT, but fine-tuning to in-domain data is critically important (see also Howard and Ruder, 2018).
When computational resources and annotations are limited, but there is plentiful unlabeled data, vampire offers large gains over other low-resource approaches.
Training a language model such as ELMo on only in-domain data offers comparable or somewhat better performance to vampire, but may be prohibitively expensive, unless working with GPUs.
Alternatively, resources can be invested in getting more annotations; with sufficient labeled data (tens of thousands of instances), the advantages offered by additional unlabeled data become negligible. Of course, other NLP tasks may involve different trade-offs between data, speed, and accuracy.
The emergence of models like ELMo and BERT has revived semi-supervised NLP, demonstrating that pretraining large models on massive amounts of data can provide representations that are beneficial for a wide range of NLP tasks. In this paper, we confirm that these models are useful for text classification when the number of labeled instances is small, but demonstrate that fine-tuning to in-domain data is also of critical importance. In settings where BERT cannot easily be used, either due to computational limitations, or because an appropriate pretrained model in the relevant language does not exist, vampire offers a competitive lightweight alternative for pretraining from unlabeled data in the low-resource setting. When working with limited amounts of labeled data, we achieve superior performance to baselines such as self-training, or using word vectors pretrained on out-of-domain data, and approach the performance of ELMo trained only on in-domain data at a fraction of the computational cost.
We thank the members of the AllenNLP and ARK teams for useful comments and discussions. We also thank the anonymous reviewers for their insightful feedback. Computations on beaker.org were supported in part by credits from Google Cloud.
- Blei et al. (2003) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. JMLR, 3:993–1022.
- Blum and Mitchell (1998) Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of COLT.
- Bolukbasi et al. (2016) Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems.
- Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In CoNLL.
- Card et al. (2018) Dallas Card, Chenhao Tan, and Noah A. Smith. 2018. Neural models for documents with metadata. In Proceedings of ACL.
- Chang et al. (2009) Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems.
- Chang et al. (2008) Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Proceedings of AAAI.
- Charniak (1997) Eugene Charniak. 1997. Statistical parsing with a context-free grammar and word statistics. In Proceedings AAAI.
- Dai and Le (2015) Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL.
- Ding et al. (2018) Ran Ding, Ramesh Nallapati, and Bing Xiang. 2018. Coherence-aware neural topic modeling. In Proceedings of EMNLP.
- Founta et al. (2018) Antigoni Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large scale crowdsourcing and characterization of Twitter abusive behavior. In Proceedings of AAAI.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of ACL.
- Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings ICML.
- Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of ACL.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems.
- Kingma and Welling (2013) Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational Bayes. CoRR, abs/1312.6114.
- Lau et al. (2014) Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of EACL.
Maas et al. (2011)
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and
Christopher Potts. 2011.
Learning word vectors for sentiment analysis.In Proceedings of ACL.
- McClosky et al. (2006) David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings NAACL.
- Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In Proceedings of ICML.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems.
- Mimno et al. (2011) David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of EMNLP.
- Miyato et al. (2016) Takeru Miyato, Andrew M. Dai, and Ian J. Goodfellow. 2016. Virtual adversarial training for semi-supervised text classification. CoRR, abs/1605.07725.
- Newman et al. (2010) David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Proceedings of NAACL.
- Nigam et al. (2000) Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2-3).
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of EMNLP.
- Peters et al. (2018a) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke S. Zettlemoyer. 2018a. Deep contextualized word representations. In Proceedings of NAACL.
- Peters et al. (2018b) Matthew E. Peters, Mark Neumann, Luke S. Zettlemoyer, and Wen tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of EMNLP.
- Phang et al. (2018) Jason Phang, Thibault Févry, and Samuel R. Bowman. 2018. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. CoRR, abs/1811.01088.
- Radford (2018) Alec Radford. 2018. Improving language understanding by generative pre-training.
- Radford et al. (2018) Alec Radford, Rafal Józefowicz, and Ilya Sutskever. 2018. Learning to generate reviews and discovering sentiment. CoRR, abs/1704.01444.
- Recasens et al. (2013) Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. 2013. Linguistic models for analyzing and detecting biased language. In Proceedings of ACL.
Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014.
Stochastic backpropagation and approximate inference in deep generative models.In Proceedings of ICML.
- Settles (2012) Burr Settles. 2012. Active Learning. Morgan & Claypool.
- Srivastava and Sutton (2017) Akash Srivastava and Charles A. Sutton. 2017. Autoencoding variational inference for topic models. In Proceedings of ICLR.
- Xie et al. (2019) Qizhe Xie, Zihang Dai, Eduard H. Hovy, Minh-Thang Luong, and Quoc V. Le. 2019. Unsupervised data augmentation. CoRR, abs/1904.12848.
- Xu et al. (2017) Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. 2017. Variational autoencoder for semi-supervised text classification. In AAAI.
- Yang et al. (2017) Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. 2017. Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of ICML.
- Yogatama et al. (2019) Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tomás Kociský, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, and Phil Blunsom. 2019. Learning and evaluating general linguistic intelligence. CoRR, abs/1901.11373.
Yu et al. (2010)
Dong Yu, Li Deng, and George E. Dahl. 2010.
Roles of pre-training and fine-tuning in context-dependent DBN-HMMs
for real-world speech recognition.
NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems.
- Zhao et al. (2019) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. 2019. Gender bias in contextualized word embeddings. In Proceedings of NAACL.
- Zhou and Li (2005) Zhi-Hua Zhou and Ming Li. 2005. Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering, 17:1529–1541.
Appendix A Hyperparameter Search
In this section, we describe the hyperparameter search we used to choose model configurations, and include plots illustrating the range of validation performance observed in each setting.
|Computing Infrastructure||GeForce GTX 1080 GPU|
|Number of search trials||60 trials per dataset|
|Search strategy||uniform sampling|
|number of epochs||50||50||50||50||50|
|KL divergence annealing||choice[sigmoid, linear, constant]||linear||linear||linear||constant|
|KL annealing sigmoid weight 1||0.25||N/A||N/A||N/A||N/A|
|KL annealing sigmoid weight 2||15||N/A||N/A||N/A||N/A|
|KL annealing linear scaling||1000||1000||1000||1000||N/A|
|VAMPIRE hidden dimension||uniform-integer[32, 128]||80||81||118||125|
|Number of encoder layers||choice[1, 2, 3]||2||2||3||3|
[relu, tanh, softplus]
|Mean projection layers||1||1||1||1||1|
|Mean projection activation||linear||linear||linear||linear||linear|
|Log variance projection layers||1||1||1||1||1|
|Log variance projection activation||linear||linear||linear||linear||linear|
|Number of decoder layers||1||1||1||1||1|
|learning rate optimizer||Adam||Adam||Adam||Adam||Adam|
|learning rate||loguniform-float[1e-4, 1e-2]||0.00081||0.00021||0.00024||0.0040|
|update background frequency||choice[True, False]||False||False||False||False|
a.1 VAMPIRE Search
For the results presented in the paper, we varied the hyperparameters of vampire across a number of different dimensions, outlined in Table 5.
a.2 Classifier Search
To choose a baseline classifier for which we experiment with all pretrained models, we performed a mix of manual tuning and random search over four basic classifiers: CNN, LSTM, Bag-of-Embeddings (i.e., Deep Averaging Networks), and Logistic Regression.
Figure 6 shows the distribution of validation accuracies using 200 and 10,000 labeled instances, respectively, for different classifiers on the imdb and AG datasets. Under the low-resource setting, we observe that logistic regression and DAN based classifiers tend to lead to more reliable validation accuracies. With enough compute, CNN-based classifiers tend to produce marginally higher validation accuracies, but the probability is mostly centered below those of the logistic regression and DAN classifiers. LSTM-based classifiers tend to have extremely high variance under the low-resource setting. For this work, we choose to experiment with the DAN classifier, which comes with the richness of vector-based representations, along with the reliability that comes with having very few hyperparameters to tune.
Appendix B Results in the High Resource Setting
Table 6 shows the results of all high-resource methods (along with vampire) on all datasets, as we vary the amount of labeled data. As can be seen, training ELMo only on in-domain data results in similar or better performance to using an off-the-shelf ELMo or BERT model, without fine-tuning it to in-domain data.
Except for one case in which it fails badly (Yahoo! with 200 labeled instances), fine-tuning BERT to the target domain achieves the best performance in every setting. Though we performed a substantial hyperparameter search under this regime, we attribute the failure of fine-tuning BERT under this setting to potential hyperparameter decisions which could be improved with further tuning. Other work has suggest that random initializations have a significant effect on the failure cases of BERT, pointing to the brittleness of fine-tuning Phang et al. (2018).
The performance gap between fine-tuned ELMo and frozen ELMo in AG News corpus is much smaller than that of the other datasets, perhaps because the ELMo model we used was pre-trained on the Billion Words Corpus, which is a news crawl. This dataset is also an example where frozen ELMo tends to out-perform using vampire. We attribute the strength of frozen, pretrained ELMo under this setting as further evidence of the importance of in-domain data for effective semi-supervised text classification.
|imdb||ELMo (FR)||75.1 (1.4)||80.3 (1.1)||85.3 (0.1)||87.3 (0.3)|
|BERT (FR)||81.5 (1.0)||83.9 (0.4)||86.8 (0.3)||88.2 (0.3)|
|ELMo (ID)||81.7 (1.3)||84.5 (0.2)||86.3 (0.4)||88.0 (0.4)|
|vampire||82.2 (2.0)||84.5 (0.4)||85.4 (0.4)||87.1 (0.4)|
|ELMo (FT)||86.4 (0.6)||87.9 (0.4)||90.0 (0.4)||91.6 (0.2)|
|BERT (FT)||88.1 (0.7)||89.4 (0.7)||91.4 (0.1)||93.1 (0.1)|
|AG||ELMo (FR)||84.5 (0.5)||85.7 (0.5)||88.3 (0.2)||89.4 (0.3)|
|BERT (FR)||84.6 (1.1)||85.7 (0.7)||88.0 (0.4)||89.0 (0.3)|
|ELMo (ID)||84.5 (0.6)||85.8 (0.8)||87.9 (0.2)||89.2 (0.2)|
|vampire||83.9 (0.6)||84.5 (0.4)||85.8 (0.2)||87.7 (0.1)|
|ELMo (FT)||85.2 (0.5)||86.6 (0.4)||88.6 (0.2)||89.5 (0.1)|
|BERT (FT)||87.1 (0.6)||88.0 (0.4)||90.1 (0.5)||91.9 (0.1)|
|Yahoo!||ELMo (FR)||54.3 (1.6)||64.2 (0.6)||71.2 (1.3)||74.1 (0.3)|
|BERT (FR)||57.0 (1.3)||64.2 (0.5)||70.0 (0.3)||73.8 (0.2)|
|ELMo (ID)||60.9 (1.7)||66.9 (0.9)||72.8 (0.5)||75.6 (0.1)|
|vampire||59.9 (0.9)||65.1 (0.3)||69.8 (0.3)||73.6 (0.2)|
|ELMo (FT)||60.5 (1.9)||66.1 (0.7)||71.7 (0.7)||75.8 (0.3)|
|BERT (FT)||45.3 (7.5)||69.2 (1.6)||76.9 (0.6)||81.0 (0.1)|
|Hatespeech||ELMo (FR)||70.5 (1.7)||72.4 (0.9)||76.0 (0.5)||78.3 (0.2)|
|BERT (FR)||75.1 (0.6)||76.3 (0.3)||77.8 (0.4)||79.0 (0.2)|
|ELMo (ID)||73.3 (0.8)||74.1 (0.8)||77.2 (0.3)||78.9 (0.2)|
|vampire||74.1 (0.8)||74.4 (0.5)||76.2 (0.6)||78.0 (0.3)|
|ELMo (FT)||73.9 (0.6)||75.4 (0.4)||78.1 (0.3)||78.7 (0.1)|
|BERT (FT)||76.2 (1.8)||78.3 (1.0)||79.8 (0.4)||80.2 (0.3)|
Appendix C Further Details on NPMI vs. NLL as Stopping Criteria
In the main paper, we note that we have observed cases in which training vampire for too long results in NPMI degradation, while NLL continues to improve. In Figure 5, we display example learning curves that point to this phenomenon.
Appendix D Additional Learned Topics
In Table 7 we display some additional topics learned by vampire on the Yahoo! dataset.
|Canine Care||Networking||Multiplayer Gaming||Harry Potter|