vsl
Code for "Variational Sequential Labelers for Semi-Supervised Learning" (EMNLP 2018)
view repo
We introduce a family of multitask variational methods for semi-supervised sequence labeling. Our model family consists of a latent-variable generative model and a discriminative labeler. The generative models use latent variables to define the conditional probability of a word given its context, drawing inspiration from word prediction objectives commonly used in learning word embeddings. The labeler helps inject discriminative information into the latent space. We explore several latent variable configurations, including ones with hierarchical structure, which enables the model to account for both label-specific and word-specific information. Our models consistently outperform standard sequential baselines on 8 sequence labeling datasets, and improve further with unlabeled data.
READ FULL TEXT VIEW PDF
Recent advances in semi-supervised learning with deep generative models ...
read it
Labeled sequence transduction is a task of transforming one sequence int...
read it
We present a semi-supervised learning algorithm for learning discrete fa...
read it
A latent-variable model is introduced for text matching, inferring sente...
read it
We introduce a novel objective for training deep generative time-series
...
read it
Semantic word embeddings represent the meaning of a word via a vector, a...
read it
Deep neural networks have been shown to be very successful at learning
f...
read it
Code for "Variational Sequential Labelers for Semi-Supervised Learning" (EMNLP 2018)
Sequence labeling tasks in natural language processing (NLP) often have limited annotated data available for model training. In such cases regularization can be important, and it can be helpful to use additional unlabeled data. One approach for both regularization and semi-supervised training is to design latent-variable generative models and then develop neural variational methods for learning and inference
Kingma and Welling (2014); Rezende and Mohamed (2015).Neural variational methods have been quite successful for both generative modeling and representation learning, and have recently been applied to a variety of NLP tasks Mnih and Gregor (2014); Bowman et al. (2016); Miao et al. (2016); Serban et al. (2017); Zhou and Neubig (2017); Hu et al. (2017). They are also very popular for semi-supervised training; when used in such scenarios, they typically have an additional task-specific prediction loss Kingma et al. (2014); Maaløe et al. (2016); Zhou and Neubig (2017); Yang et al. (2017b). However, it is still unclear how to use such methods in the context of sequence labeling.
In this paper, we apply neural variational methods to sequence labeling by combining a latent-variable generative model and a discriminatively-trained labeler. We refer to this family of procedures as variational sequential labelers (VSLs). Learning maximizes the conditional probability of each word given its context and minimizes the classification loss given the latent space. We explore several models within this family that use different kinds of conditional independence structure among the latent variables within each time step. Intuitively, the multiple latent variables can disentangle information pertaining to label-oriented and word-specific properties.
We study VSLs in the context of named entity recognition (NER) and several part-of-speech (POS) tagging tasks, both on English Twitter data and on data from six additional languages. Without unlabeled data, our models consistently show 0.5-0.8% accuracy improvements across tagging datasets and 0.8
improvement for NER. Adding unlabeled data further improves the model performance by 0.1-0.3% accuracy or 0.2 score. We obtain the best results with a hierarchical structure using two latent variables at each time step.Our models, like generative latent variable models in general, have the ability to naturally combine labeled and unlabeled data. We obtain small but consistent performance improvements by adding unlabeled data. In the absence of unlabeled data, the variational loss acts as regularizer on the learned representation of the supervised sequence prediction model. Our results demonstrate that this regularization improves performance even when only labeled data is used. We also compare different ways of applying the classification loss when using a latent variable hierarchy, and find that the most effective structure also provides the cleanest separation of information in the latent space.
There is a growing amount of work applying neural variational methods to NLP tasks, including document modeling Mnih and Gregor (2014); Miao et al. (2016); Serban et al. (2017), machine translation Zhang et al. (2016)
Bowman et al. (2016); Serban et al. (2017); Hu et al. (2017), language modeling Bowman et al. (2016); Yang et al. (2017b), and sequence transduction Zhou and Neubig (2017), but we are not aware of any such work for sequence labeling. Before the advent of neural variational methods, there were several efforts in latent variable modeling for sequence labeling Quattoni et al. (2007); Sun and Tsujii (2009).There has been a great deal of work on using variational autoencoders in semi-supervised settings
Kingma et al. (2014); Maaløe et al. (2016); Zhou and Neubig (2017); Yang et al. (2017b). Semi-supervised sequence labeling has a rich history Altun et al. (2006); Jiao et al. (2006); Mann and McCallum (2008); Subramanya et al. (2010); Søgaard (2011). The simplest methods, which are also popular currently, use representations learned from large amounts of unlabeled data Miller et al. (2004); Owoputi et al. (2013); Peters et al. (2017). Recently, Zhang et al. (2017) proposed a structured neural autoencoder that can be jointly trained on both labeled and unlabeled data.Our work involves multi-task losses and is therefore also related to the rich literature on multi-task learning for sequence labeling (Plank et al., 2016; Augenstein and Søgaard, 2017; Bingel and Søgaard, 2017; Rei, 2017, inter alia).
Another related thread of work is learning interpretable latent representations. Zhou and Neubig (2017) factorize an inflected word into lemma and morphology labels, using continuous and categorical latent variables. Hu et al. (2017) interpret a sentence as a combination of an unstructured latent code and a structured latent code, which can represent attributes of the sentence.
There have been several efforts in combining variational autoencoders and recurrent networks Gregor et al. (2015); Chung et al. (2015); Fraccaro et al. (2016). While the details vary, these models typically contain latent variables at each time step in a sequence. This prior work mainly focused on ways of parameterizing the time dependence between the latent variables, which gives them more power in modeling distributions over observation sequences. In this paper, we similarly use latent variables at each time step, but we adopt stronger independence assumptions which leads to simpler models and inference procedures. Also, the models cited above were developed for modeling data distributions, rather than for supervised or semi-supervised learning, which is our focus here.
The key novelties in our work compared to the prior work mentioned above are the proposed sequential variational labelers and the investigation of latent variable hierarchies within these models. The empirical effectiveness of latent hierarchical structure in variational modeling is a key contribution of this paper and may be applicable to the other applications discussed above. Recent work, contemporaneous with this submission, similarly showed the advantages of combining hierarchical latent variables and variational learning for conversational modeling, in the context of a non-sequential model Park et al. (2018).
We begin by describing variational autoencoders and the notation we will use in the following sections. We denote the input word sequence by , the corresponding label sequence by , the input words other than the word at position by , the generative model by , and the posterior inference model by .
We review variational autoencoders (VAEs) by describing a VAE for an input sequence . When using a VAE, we assume a generative model that generates an input using a latent variable
, typically assumed to follow a multivariate Gaussian distribution. We seek to maximize the marginal likelihood of inputs
when marginalizing out the latent variable . Since this is typically intractable, especially when using continuous latent variables, we instead maximize a lower bound on the marginal log-likelihood (Kingma and Welling, 2014):(1) | ||||
where we have introduced the variational posterior parametrized by new parameters . is referred to as an “inference model” as it encodes an input into the latent space. We also have the generative model probabilities parametrized by . The parameters are trained in a way that reflects a classical autoencoder framework: encode the input into a latent space, decode the latent space to reconstruct the input. These models are therefore referred to as “variational autoencoders”.
The lower bound consists of two terms: reconstruction loss and KL divergence. The KL divergence term provides a regularizing effect during learning by ensuring that the learned posterior remains close to the prior over the latent variables.
|
|
|
|
|
|
Variational sequential labelers. The first row shows the original graphical models of each variant where shaded circles are observed variables. The second row shows how we perform inference and learning, showing inference models (in dashed lines), generative models (in solid lines), and classifier (in dotted lines). All models are trained to maximize
and predict the label .We now introduce variational sequential labelers (VSLs) and propose several variants for sequence labeling tasks. Although the latent structure varies, a VSL maximizes the conditional probability of and minimizes a classification loss using the latent variables as the input to the classifier. Unlike VAEs, VSLs do not autoencode the input, so they are more similar to recent conditional variational formulations (Sohn et al., 2015; Miao et al., 2016; Zhou and Neubig, 2017). Intuitively, the VSL variational objective is to find the information that is useful for predicting the word from its surrounding context, which has similarities to objectives for learning word embeddings (Collobert et al., 2011; Mikolov et al., 2013). This objective serves as regularization for the labeled data and as an unsupervised objective for the unlabeled data.
All of our models use latent variables for each position in the sequence. These characteristics are shown in the visual depictions of our models in Figure 1. We consider variants with multiple latent variables per time step and attach the classifier to only particular variables. This causes the different latent variables to capture different characteristics.
In the following sections, we will describe various latent variable configurations that we will evaluate empirically in subsequent sections.
We begin by defining a basic VSL and corresponding parametrization, which will also be used in other variants. This first model (which we call VSL-G and show in Figure 0(a)) has a Gaussian latent variable at each time step. VSL-G uses two training objectives; the first is similar to the lower bound on log-likelihood used by VAEs:
(2) | ||||
VSL-G additionally uses a classifier on the latent variable which is trained with the following objective:
(3) |
The final loss is
where
is a trade-off hyperparameter.
is set to zero during supervised training but it is tuned based on development set performance during semi-supervised training. The same procedure is adopted for the other VSL models below.For the generative model, we parametrize
as a feedforward neural network with two hidden layers and ReLU
Nair and Hinton (2010)as activation function. As reconstruction loss, we use cross-entropy over the words in the vocabulary. We defer the descriptions of the parametrization of
to Section 3.6.We now discuss how we parametrize the inference model
. We use a bidirectional gated recurrent unit (BiGRU;
Chung et al., 2014) network to produce a hidden vector
at position . The BiGRU is run over the input , where each is the concatenation of a word embedding and the concatenated final hidden states from a character-level BiGRU. The inference model is then a single layer feedforward neural network that uses as input. When parametrizing the posterior over latent variables in the following models below, we use this same procedure to produce hidden vectors with a BiGRU and then use them as input to feedforward networks. The structure of our inference model is similar to those used in previous state-of-the-art models for sequence labeling Lample et al. (2016); Yang et al. (2017a).In order to focus more on the effect of our variational objective, the classifier we use is always the same as our baseline model (see Section 4.3), which is a one layer feedforward neural network without a hidden layer, and it is also used in test-time prediction.
We next consider ways of factorizing the functionality of the latent variable into label-specific and other word-specific information. We introduce VSL-GG-Flat (shown in Figure 0(b)), which has two conditionally independent Gaussian latent variables at each time step, denote and for time step .
The variational lower bound is derived as follows:
(4) | ||||
The classifier is on the latent variable and its loss is
(5) |
The final loss for the model is
(6) |
Where is a trade-off hyperparameter.
Similarly to the VSL-G model, and are parametrized by single layer feedforward neural networks using the hidden state as input.
We also explore hierarchical relationships among the latent variables. In particular, we introduce the VSL-GG-Hier model which has two Gaussian latent variables with the hierarchical structure shown in Figure 0(c). This model encodes the intuition that the word-specific latent information may differ depending on the class-specific information of .
The classifier uses as input and is trained with the following loss:
(8) |
Note that and have the same form. The final loss is
(9) |
Where is a trade-off hyperparameter.
The hierarchical posterior is parametrized by concatenating the hidden vector
and the random variable
and then using them as input to a single layer feedforward network.Traditional variational models assume extremely simple priors (e.g., multivariate standard Gaussian distributions). Recently there have been efforts to learn the prior and posterior jointly during training Fraccaro et al. (2016); Serban et al. (2017); Tomczak and Welling (2018). In this paper, we follow this same idea but we do not explicitly parametrize the prior . This is partially due to the lack of computationally-efficient parametrization options for . In addition, since we are not seeking to do generation with our learned models, we can let part of the generative model be parametrized implicitly.
More specifically, the approach we use is to learn the priors by updating them iteratively. During training, we first initialize the priors of all examples as multivariate standard Gaussian distributions. As training proceeds, we use the last optimized posterior as our current prior based on a particular “update frequency” (see supplementary material for more details).
Our learned priors are implicitly modeled as
(10) | ||||
where is the empirical data distribution, is a random variable corresponding to the observation at position , and is the prior update time step. The intuition here is that the prior is obtained by marginalizing over values for the missing observation represented by the random variable . The posterior is as defined in our latent variable models. We assume for . For context that can pair with multiple values of , its prior is the data-dependent weighted average posterior. For simplicity of implementation and efficient computation, however, if context can pair with multiple values in our training data, we ignore this fact and simply use instance-dependent posteriors. Another way to view this is as conditioning on the index of the training examples while parametrizing the above. That is
(11) |
where is the index of the instance.
In this subsection, we introduce techniques we have used to address difficulties during training.
It is challenging to use gradient descent for a random variable as it involves a non-differentiable sampling procedure. Kingma and Welling (2014) introduced a reparametrization trick to tackle this problem. They parametrize a Gaussian random variable as where and , are deterministic and differentiable functions, so the gradient can go through and . In our experiments, we use one sample for each time step during training. For evaluation at test time, we use the mean value .
Although the use of prior updating
lets us avoid tuning the weight of the KL divergence, the simple priors can still hinder learning during the initial stages of training. To address this, we follow the method described by Bowman et al. (2016) to add weights to all KL divergence terms and anneal the weights from a small value to 1.
We describe key details of our experimental setup in the subsections below but defer details about hyperparameter tuning to the supplementary material. Our implementation is available at https://github.com/mingdachen/vsl
We evaluate our model on the CoNLL 2003 English NER dataset Tjong Kim Sang and De Meulder (2003) and 7 POS tagging datasets: the Twitter tagging dataset of Gimpel et al. (2011) and Owoputi et al. (2013), and 6 languages from the Universal Dependencies (UD) 1.4 dataset McDonald et al. (2013).
The Twitter dataset has 25 tags. We use Oct27Train and Oct27Dev as the training set, Oct27Test as the development set, and Daily547 as the test set. We randomly sample {1k, 2k, 3k, 4k, 5k, 10k, 20k, 30k, 60k} tweets from 56 million English tweets as our unlabeled data and tune the amount of unlabeled data based on development set accuracy.
The UD datasets have 17 tags. We use French, German, Spanish, Russian, Indonesian and Croatian. We follow the same setup as Zhang et al. (2017), randomly sampling 20% of the original training set as our labeled data and 50% as unlabeled data. There is no overlap between the labeled and unlabeled data. See Zhang et al. (2017) for more details about the setup.
We use the BIOES labeling scheme and report micro-averaged . We preprocessed the text by replacing all digits with 0. We randomly sample 10% of the original training set as our labeled data and 50% as unlabeled data. We also ensure there is no overlap between the labeled and unlabeled data.
For all experiments, we use pretrained 100-dimensional word embeddings. For Twitter, we trained skip-gram embeddings Mikolov et al. (2013) on a dataset of 56 million English tweets. For the UD datasets, we trained skip-gram embeddings on Wikipedia for each of the six languages. For NER, we use 100-dimensional pretrained GloVe Pennington et al. (2014) embeddings. Our models perform better with word embeddings kept fixed during training while for the baselines the word embeddings are fine tuned as this improves the baseline performance.
Our primary baseline is a BiGRU tagger where the input consists of the concatenation of a word embedding and the concatenation of the final hidden states of a character-level BiGRU. This BiGRU architecture is identical to that used in the inference networks in our VSL models. Predictions are made based on a linear transformation given the current hidden state. The output dimensionality of the transformation is task-dependent (e.g., 25 for Twitter tagging). We use the standard per-position cross entropy loss for training.
We also report results from the best systems from Zhang et al. (2017), namely the NCRF and NCRF-AE models. Both use feedforward networks as encoders and conditional random field layers for capturing sequential information. The NCRF-AE model additionally can benefit from unlabeled data.
|
|
Table 0(a) shows results on the Twitter development and test sets. All of our VSL models outperform the baseline and our best VSL models outperform the BiGRU baseline by 0.8–1% absolute. When comparing different latent variable configurations, we find that a hierarchical structure performs best. Without unlabeled data, our models already outperform the BiGRU baseline. Adding unlabeled data enlarges the gap between the baseline and our models by up to 0.1–0.3% absolute.
French | German | Indonesian | Spanish | Russian | Croatian | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
acc. | UL | acc. | UL | acc. | UL | acc. | UL | acc. | UL | acc. | UL | |
NCRF | 93.4 | - | 90.4 | - | 88.4 | - | 91.2 | - | 86.6 | - | 86.1 | - |
NCRF-AE | 93.7 | +0.2 | 90.8 | +0.2 | 89.1 | +0.3 | 91.7 | +0.5 | 87.8 | +1.1 | 87.9 | +1.2 |
BiGRU baseline | 95.9 | - | 92.6 | - | 92.2 | - | 94.7 | - | 95.2 | - | 95.6 | - |
VSL-G | 96.1 | +0.0 | 92.8 | +0.0 | 92.3 | +0.0 | 94.8 | +0.1 | 95.3 | +0.0 | 95.6 | +0.1 |
VSL-GG-Flat | 96.1 | +0.0 | 93.0 | +0.1 | 92.4 | +0.1 | 95.0 | +0.1 | 95.5 | +0.1 | 95.8 | +0.1 |
VSL-GG-Hier | 96.4 | +0.1 | 93.3 | +0.1 | 92.8 | +0.1 | 95.3 | +0.2 | 95.9 | +0.1 | 96.3 | +0.2 |
Table 0(b) shows results on the CoNLL 2003 NER development and test sets. We observe similar trends as in the Twitter data, except that the model does not show improvement on the test set when adding unlabeled data.
Table 2 shows our results on the UD datasets. The trends are broadly consistent with those of Table 0(a) and 0(b). The best performing models use hierarchical structure in the latent variables. There are some differences across languages. For French, German, Indonesian and Russian, VSL-G does not show improvement when using unlabeled data. This may be resolved with better tuning, since the model actually shows improvement on the dev set.
Note that results reported by Zhang et al. (2017) and ours are not strictly comparable as their word embeddings were only pretrained on the UD training sets while ours were pretrained on Wikipedia. Nonetheless, they also mentioned that using embeddings pretrained on larger unlabeled data did not help. We include these results to show that our baselines are indeed strong compared to prior results reported in the literature.
|
|
NER | UD average | |||||
---|---|---|---|---|---|---|
acc. | UL | UL | acc. | UL | ||
classifier on | 91.6 | +0.3 | 88.4 | +0.2 | 95.0 | +0.1 |
classifier on | 91.1 | +0.2 | 87.8 | +0.1 | 94.4 | +0.0 |
We investigate the effect of attaching the classifier to different latent variables. In particular, for the VSL-GG-Hier model, we compare the attachment of the classifier between and . See Figure 2. The results in Table 3 suggest that attaching the reconstruction and classification losses to the same latent variable () harms accuracy although attaching the classifier to effectively gives the classifier an extra layer. We can observe why this occurs by looking at the latent variable visualizations in Figure 2(d). Compared with Figure 2(e), where the two variables are more clearly disentangled, the latent variables in Figure 2(d) appear to be capturing highly similar information.
|
|
|
|
To verify our assumption of the latent structure, we visualize the latent space for Gaussian models using t-SNE Maaten and Hinton (2008) in Figure 3. The BiGRU baseline (Figure 2(a)) and the VSL-G (Figure 2(b)) do not show significant differences. However, when using multiple latent variables, the different latent variables capture different characteristics. In the VSL-GG-Flat model (Figure 2(c)), the variable (the upper plot) reflects the clustering of the tagging space much more closely than the variable (the lower plot). Since both variables are used to reconstruct the word, but only the variable is trained to predict the tag, it appears that is capturing other information useful for reconstructing the word. However, since they are both used for reconstruction, the two spaces show signs of alignment; that is, the “tag” latent variable does not show as clean a separation into tag clusters as the variable in the VSL-GG-Hier model in Figure 2(e).
In Figure 2(e) (VSL-GG-Hier), the clustering of words with respect to the tag is clearest. This may account for the consistently better performance of this model relative to the others. The variable reflects a space that is conditioned on but that diverges from it, presumably in order to better reconstruct the word. The closer the latent variable is to the decoder output, the weaker the tagging information becomes while other word-specific information becomes more salient.
Figure 2(d) shows that VSL-GG-Hier with classification loss on , which consistently underperforms both the VSL-GG-Flat and VSL-GG-Hier models in our experiments, appears to be capturing the same latent space in both variables. Since the variable is used to both predict the tag and reconstruct the word, it must capture both the tag and word reconstruction spaces, and may be limited by capacity in doing so. The variable does not seem to be contributing much modeling power, as its space is closely aligned to that of .
NER | ||||
---|---|---|---|---|
acc. | no VR | no VR | ||
BiGRU baseline | 90.8 | - | 87.6 | - |
VSL-G | 91.1 | 90.9 | 87.8 | 87.7 |
VSL-GG-Flat | 91.4 | 90.9 | 88.0 | 87.8 |
VSL-GG-Hier | 91.6 | 91.0 | 88.4 | 87.9 |
We investigate the beneficial effects of variational frameworks (“variational regularization”) by replacing our variational components in VSLs with their deterministic counterparts, which do not have randomness in the latent space and do not use the KL divergence term during optimization. Note that these BiGRU encoders share the same architectures as their variational posterior counterparts and still use both the classification and reconstruction losses. While other subsets of losses could be considered in this comparison, our motivation is to compare two settings that correspond to well-known frameworks. The “no VR” setting corresponds roughly to the combination of a classifier and a traditional autoencoder.
We note that these experiments do not use any unlabeled data.
The results in Table 4 demonstrate that compared to the baseline BiGRU, adding the reconstruction loss (“VSL-G, no VR”) yields only 0.1 improvement for both Twitter and NER. Although adding hierarchical structure further improves performance, the improvements are small (+0.1 and +0.2 for Twitter and NER respectively). For VSL-GG-Hier, variational regularization accounts for relatively large differences of 0.6 for Twitter and 0.5 for NER. These results show that the improvements do not come solely from adding a reconstruction objective to the learning procedure. In limited preliminary experiments, we did not find a benefit from adding unlabeled data under the “no VR” setting.
In order to examine the effect of unlabeled data, we report our Twitter dev accuracies when varying the unlabeled data size. We choose VSL-GG-Hier as the model for this experiment since it benefits the most from unlabeled data. As Figure 4 shows, gradually adding unlabeled data helps a little at the beginning. Further adding unlabeled data boosts the accuracy of the model. The improvements that come from unlabeled data quickly plateau after the amount of unlabeled data goes beyond 10,000. This suggests that with little unlabeled data, the model is incapable of fully utilizing the information in the unlabeled data. However if the amount of unlabeled data is too large, the supervised training signal becomes too weak to extract something useful from the unlabeled data.
We also notice that when there is a large amount of unlabeled data, it is always better to pretrain the prior first using a small (e.g., 0.1) and then use it as a warm start to train a new model using a larger (e.g., 1.0).
Tuning the weight of the KL divergence could achieve a similar effect, but it may require tuning the weight for labeled data and unlabeled data separately. We prefer to pretrain the prior as it is simpler and involves less hyperparameter tuning.
We introduced variational sequential labelers for semi-supervised sequence labeling. They consist of latent-variable generative models with flexible parametrizations for the variational posterior (using RNNs over the entire input sequence) and a classifier at each time step. Our best models use multiple latent variables arranged in a hierarchical structure.
We demonstrate systematic improvements in NER and POS tagging accuracy across 8 datasets over a strong baseline.
We also find small, but consistent, improvements by using unlabeled data.
We would like to thank NVIDIA for donating GPUs used in this research, the anonymous reviewers for their comments that improved this paper, and Google for a faculty research award to K. Gimpel that partially supported this research. This research was funded by NSF grant 1433485.
Empirical evaluation of gated recurrent neural networks on sequence modeling.
InNIPS 2014 Workshop on Deep Learning, December 2014
.Journal of Machine Learning Research
, 12(Aug):2493–2537.Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss.
In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 412–418. Association for Computational Linguistics.Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics
, volume 84 of Proceedings of Machine Learning Research, pages 1214–1223, Playa Blanca, Lanzarote, Canary Islands. PMLR.Variational neural machine translation.
In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 521–530. Association for Computational Linguistics.
Comments
There are no comments yet.