Recently, distributed word representations have grown to become a mainstay of natural language processing (NLP), and been show to have empirical utility in a myriad of tasks [Collobert2008, turian2010word, baroni:2014, Andreas:Klein:2014]. The underlying idea behind distributed word representations is simple: to map each word in our vocabulary
onto a continuous-valued vector of dimensionality. Words that are similar (e.g., with respect to syntax or lexical semantics) will ideally be mapped to similar regions of the vector space, implicitly supporting both generalisation across in-vocabulary (IV) items, and countering the effects of data sparsity for low-frequency and out-of-vocabulary (OOV) items.
Without some means of automatically deriving the vector representations without reliance on labelled data, however, word embeddings would have little practical utility. Fortunately, it has been shown that they can be “pre-trained” from unlabelled text data using various algorithms to model the distributional hypothesis (i.e., that words which occur in similar contexts tend to be semantically similar). Pre-training methods have been refined considerably in recent years, and scaled up to increasingly large corpora.
As with other machine learning methods, it is well known that the quality of the pre-trained word embeddings depends heavily on factors including parameter optimisation, the size of the training data, and the fit with the target application. For example, turian2010word showed that the optimal dimensionality for word embeddings is task-specific. One factor which has received relatively little attention in NLP is the effect of “updating” the pre-trained word embeddings as part of the task-specific training, based on self-taught learning[raina2007self]. Updating leads to word representations that are task-specific, but often at the cost of over-fitting low-frequency and OOV words.
In this paper, we perform an extensive evaluation of four recently proposed word embedding approaches under fixed experimental conditions, applied to four sequence labelling tasks: POS-tagging, full-text chunking, named entity recognition (NER), and multiword expression (MWE) identification. Compared to previous empirical studies[collobert2011natural, turian2010word, pennington2014glove], we fill their gaps by considering more word embedding approaches and evaluating them with more sequence labelling tasks. In addition, we explore the following research questions:
are these word embeddings better than baseline approaches of one-hot unigram features and Brown clusters?
do word embeddings require less training data (i.e. generalise better) than one-hot unigram features? If so, to what degree can word embeddings reduce the amount of labelled data?
what is the impact of updating word embeddings in sequence labelling tasks, both empirically over the target task and geometrically over the vectors?
what is the impact of these word embeddings (with and without updating) on both OOV items (relative to the training data) and out-of-domain data?
overall, are some word embeddings better than others in a sequence labelling context?
2 Word Representations
2.1 Types of Word Representations
turian2010word identifies three varieties of word representations: distributional, cluster-based, and distributed.
Distributional representation methods map each word to a context word vector , which is constructed directly from co-occurrence counts between and its context words. The learning methods either store the co-occurrence counts between two words and directly in [sahlgren2006word, turney2010frequency, honkela1997self] or project the concurrence counts between words into a lower dimensional space [vrehuuvrek2010software, lund1996producing], using dimensionality reduction techniques such as SVD [dumais1988using] and LDA [blei2003latent].
Cluster-based representation methods build clusters of words by applying either soft or hard clustering algorithms [lin2009phrase, li2005semi]. Some of them also rely on a co-occurrence matrix of words [pereira1993distributional]. The Brown clustering algorithm [Brown92class-basedn-gram] is the best-known method in this category.
Distributed representation methods usually map words into dense, low-dimensional, continuous-valued vectors, with , where is referred to as the word dimension.
2.2 Selected Word Representations
Over a range of sequence labelling tasks, we evaluate five methods for inducing word representations: Brown clustering [Brown92class-basedn-gram] (“Brown”), the neural language model of Collobert & Weston (“CW”) [collobert2011natural], the continuous bag-of-words model (“CBOW”) [Mikolov13], the continuous skip-gram model (“Skip-gram”) [Mikolov13NIPS], and Global vectors (“Glove”) [pennington2014glove]. With the exception of CW, all have have been shown to be at or near state-of-the-art in recent empirical studies [turian2010word, pennington2014glove]. CW is included because it was highly influential in earlier research, and the pre-trained embeddings are still used to some degree in NLP. The training of these word representations is unsupervised: the common underlying idea is to predict occurrence of words in the neighbouring context. Their training objectives share the same form, which is a sum of local training factors ,
where is the vocabulary of a given corpus, and denotes the local context of word . The local context of a word can either be its previous words, or the words surrounding it. Local training factors are designed to capture the relationship between and its local contexts of use, either by predicting based on its local context, or using to predict the context words. Other than Brown, which utilises a cluster-based representation, all the other methods employ a distributed representation.
The starting point for CBOW and Skip-gram is to employ softmax to predict word occurrence:
where denotes the distributed representation of the local context of word . CBOW derivesgiven its local context. In contrast, Skip-gram applies softmax to each context word of a given occurrence of word . In this case, corresponds to the representation of one of its context words. This model can be characterised as predicting context words based on . In practice, softmax is too expensive to compute over large corpora, and thus Mikolov13NIPS use hierarchical softmax and negative sampling to scale up the training.
CW considers the local context of a word to be words to the left and words to the right of . The concatenation of the embeddings of
and all its context words are taken as input to a neural network with one hidden layer, which produces a higher level representation. Then the learning procedure replaces the embedding of with that of a randomly sampled word and generates a second representation with the same neural network. The training objective is to maximise the difference between them:
This approach can be regarded as negative sampling with only one negative example.
Glove assumes the dot product of two word embeddings should be similar to the logarithm of the co-occurrence count of the two words. As such, the local factor becomes:
where and are the bias terms of words and , respectively, and is a weighting function based on the co-occurrence count. This weighting function controls the degree of agreement between the parametric function and . Frequently co-occurring word pairs will be larger weight than infrequent pairs, up to a threshold.
Brown partitions words into a finite set of word classes . The conditional probability of seeing the next word is defined to be:
where denotes the word class of the word , are the previous words and are their respective word classes. Then
. Since there is no tractable method to find an optimal partition of word classes, the method uses only a bigram class model, and utilises hierarchical clustering as an approximation method to find a sufficiently good partition of words.
2.3 Building Word Representations
For a fair comparison, we train Brown, CBOW, Skip-gram, and Glove on a fixed corpus, comprised of freely available corpora, as detailed in Tab. 1. The joint corpus was preprocessed with the Stanford CoreNLP sentence splitter and tokeniser. All consecutive digit substrings were replaced by NUMf, where f is the length of the digit substring (e.g., 10.20 is replaced by NUM2.NUM2. Due to the computational complexity of the pre-training, for CW, we simply downloaded the pre-compiled embeddings from: http://metaoptimize.com/projects/wordreprs.
The dimensionality of the word embeddings and the size of the context window are the key hyperparameters when learning distributed representations. We use all combinations of the following values to train word embeddings on the combined corpus:
Context window size
Brown requires only the number of clusters as a hyperparameter. We perform clustering with clusters.
3 Sequence Labelling Tasks
We evaluate the different word representations over four sequence labelling tasks: POS-tagging (POS-tagging), full-text chunking (Chunking), NER (NER) and MWE identification (MWE). For each task, we fed features into a first order linear-chain graph transformer [collobert2011natural] made up of two layers: the upper layer is identical to a linear-chain CRF [lafferty2001conditional], and the lower layer consists of word representation and hand-crafted features. If we treat word representations as fixed, the graph transformer is a simple linear-chain CRF. On the other hand, if we can treat the word representations as model parameters, the model is equivalent to a neural network with word embeddings as the input layer. We trained all models using AdaGrad [duchi2011adaptive].
As in turian2010word, at each word position, we construct word representation features from the words in a context window of size two to either side of the target word, based on the pre-trained representation of each word type. For Brown
, the features are the prefix features extracted from word clusters in the same way as turian2010word. As a baseline (and to testRQ1), we include a one-hot representation (which is equivalent to a linear-chain CRF with only lexical context features).
Our hand-crafted features for POS-tagging, Chunking and MWE, are those used by collobert2011natural, turian2010word and mwecorpus, respectively. For NER, we use the same feature space as turian2010word, except for the previous two predictions, because we want to evaluate all word representations with the same type of model – a first-order graph transformer.
In training the distributed word representations, we consider two settings: (1) the word representations are fixed during sequence model training; and (2) the graph transformer updated the token-level word representations during training.
|Training||Development||In-domain Test||Out-of-domain Test|
|POS-tagging||WSJ Sec. 0-18||WSJ Sec. 19–21||WSJ Sec. 22–24||EWT|
|Chunking||WSJ||WSJ (1K sentences)||WSJ (CoNLL-00 test)||Brown|
|NER||Reuters (CoNLL-03 train)||Reuters (CoNLL-03 dev)||Reuters (CoNLL-03 test)||MUC7|
|MWE||EWT (500 docs)||EWT (100 docs)||EWT (123 docs)||—|
As outlined in Tab. 2, for each sequence labelling task, we experiment over the de facto corpus, based on pre-existing training–dev–test splits where available:111For the MWE dataset, no such split pre-existed, so we constructed our own.
the Wall Street Journal portion of the Penn Treebank (Marcus:1993: “WSJ”) with Penn POS tags
the Wall Street Journal portion of the Penn Treebank (“WSJ”), converted into IOB-style full-text chunks using the CoNLL conversion scripts for training and dev, and the WSJ-derived CoNLL-2000 full text chunking test data for testing [TjongKimSang:Buchholz:2000]
the English portion of the CoNLL-2003 English Named Entity Recognition data set, for which the source data was taken from Reuters newswire articles (TjongKimSang:DeMeulder:2003: “Reuters”)
the MWE dataset of mwecorpus, over a portion of text from the English Web Treebank222https://catalog.ldc.upenn.edu/LDC2012T13 (“EWT”)
For all tasks other than MWE,333Unfortunately, there is no second domain which has been hand-tagged with MWEs using the method of mwecorpus to use as an out-of-domain test corpus. we additionally have an out-of-domain test set, in order to evaluate the out-of-domain robustness of the different word representations, with and without updating. These datasets are as follows:
the English Web Treebank with Penn POS tags (“EWT”)
the Brown Corpus portion of the Penn Treebank (“Brown”), converted into IOB-style full-text chunks using the CoNLL conversion scripts
the MUC-7 named entity recognition corpus444https://catalog.ldc.upenn.edu/LDC2001T02 (“MUC7”)
For reproducibility, we tuned the hyperparameters with random search over the development data for each task [bergstra2012random]. In this, we randomly sampled 50 distinct hyperparameter sets with the same random seed for the non-updating models (i.e. the models that don’t update the word representation), and sampled 100 distinct hyperparameter sets for the updating models (i.e. the models that do). For each set of hyperparameters and task, we train a model over its training set and choose the best one based on its performance on development data [turian2010word]. We also tune the word representation hyperparameters – namely, the word vector size and context window size (distributed representations), and in the case of Brown, the number of clusters.
For the updating models, we found that the results over the test data were always inferior to those that do not update the word representations, due to the higher number of hyperparameters and small sample size (i.e. 100). Since the two-layer model of the graph transformer contains a distinct set of hyperparameters for each layer, we reuse the best-performing hyperparameter settings from the non-updating models, and only tune the hyperparameters of AdaGrad for the word representation layer. This method requires only 32 additional runs and achieves consistently better results than 100 random draws.
In order to test the impact of the volume of training data on the different models (RQ2), we split the training set into 10 partitions based on a base-2 log scale (i.e., the second smallest partition will be twice the size of the smallest partition), and created 10 successively larger training sets by merging these partitions from the smallest one to the largest one, and used each of these to train a model. From these, we construct learning curves over each task.
For ease of comparison with previous results, we evaluate both in- and out-of-domain using chunk/entity/expression-level F1-measure (“F1”) for all tasks except POS-tagging, for which we use token-level accuracy (“Acc”). To test performance over OOV (unknown) tokens – i.e., the words that do not occur in the training set – we use token-level accuracy for all tasks (e.g. for Chunking, we evaluate whether the full IOB tag is correct or not), due to the sparsity of all-OOV chunks/NEs/MWEs.
4 Experimental Results and Discussion
|Task||Benchmark||In-domain Test set||Out-of-domain Test set|
|POS-tagging (Acc)||0.972 [Toutanova:2003]||0.959 (Skip-gram+UP)||0.910 (Skip-gram)|
|Chunking (F1)||0.942 [Sha:2003]||0.938 (Brown)||0.676 (Glove)|
|NER (F1)||0.893 [Ando:2005]||0.868 (Skip-gram)||0.736 (Skip-gram)|
|MWE (F1)||0.625 [Schneider+:2014]||0.654 (CBOW+UP)||—|
We structure our evaluation by stepping through each of our five research questions (RQ1–5) from the start of the paper. In this, we make reference to: (1) the best-performing method both in- and out-of-domain vs. the state-of-the-art (Tab. 3); (2) a heat map for each task indicating the convergence rate for each word representation, with and without updating (Fig. 1); (3) OOV accuracy both in-domain and out-of-domain for each task (Fig. 2); and (4) visualisation of the impact of updating on word embeddings, based on t-SNE (Fig. 3).
Rq1: Are the selected word embeddings better than one-hot unigram features and Brown clusters?
As shown in Tab. 3, the best-performing method for every task except in-domain Chunking is a word embedding method, although the precise method varies greatly. Fig. 1, on the other hand, tells a more subtle story: the difference between Unigram and the other word representations is relatively modest, esp. as the amount of training data increases. Additionally, the difference between Brown and the word embedding methods is modest across all tasks. So, the overall answer would appear to be: yes for unigrams when there is little training data, but not really for Brown.
Rq2: Do word embedding features require less training data?
Fig. 1 shows that for POS-tagging and NER, with only several hundred training instances, word embedding features achieve superior results to Unigram. For example, when trained with 561 instances, the POS-tagging model using Skip-gram+UP embeddings is 5.3% above Unigram; and when trained with 932 instances, the NER model using Skip-gram is 11.7% above Unigram. Similar improvements are also found for other types of word embeddings and Brown, when the training set is small. However, all word representations perform similarly for Chunking regardless of training data size. For MWE, Brown performs slightly better than the other methods when trained with approximately 25% of the training instances. Therefore, we conjecture that the POS-tagging and NER tasks benefit more from distributional similarity than Chunking and MWE.
Rq3: Does task-specific updating improve all word embeddings across all tasks?
Based on Fig. 1, updating of word representations can equally correct poorly-learned word representations, and harm pre-trained representations, due to overfitting. For example, Glove perform significantly worse than Skip-gram in both POS-tagging and NER without updating, but with updating, the gap between their results and the best performing method becomes smaller. In contrast, Skip-gram performs worse over the test data with updating, despite the results on the development set improving by 1%.
To further investigate the effects of updating, we sampled 60 words and plotted the changes in their word embeddings under updating, using 2-d vector fields generated by using matplotlib and t-SNE [vanderMaaten:Hinton:2008]. Half of the words were chosen manually to include known word clusters such as days of the week and names of countries; the other half were selected randomly. Additional plots with 100 randomly-sampled words and the top-100 most frequent words, for all the methods and all the tasks, can be found in the supplementary material and at https://123abc123abd.wordpress.com/. In each plot, a single arrow signifies one word, pointing from the position of the original word embedding to the updated representation.
In Fig. 3, we show vector fields plots for Chunking and NER using Skip-gram embeddings. For Chunking, most of the vectors were changed with similar magnitude, but in very different directions, including within the clusters of days of the week and country names. In contrast, for NER, there was more homogeneous change in word vectors belonging to the same cluster. This greater consistency is further evidence that semantic homogeneity appears to be more beneficial for NER than Chunking.
Rq4: What is the impact of word embeddings cross-domain and for OOV words?
As shown in Tab. 3, results predictably drop when we evaluate out of domain. The difference is most pronounced for Chunking, where there is an absolute drop in F1 of around 30% for all methods, indicating that word embeddings and unigram features provide similar information for Chunking.
Another interesting observation is that updating often hurts out-of-domain performance because the distribution between domains is different. This suggests that, if the objective is to optimise performance across domains, it is best not to perform updating.
We also analyze performance on OOV words both in-domain and out-of-domain in Fig. 2. As expected, word embeddings and Brown excel in out-of-domain OOV performance. Consistent with our overall observations about cross-domain generalisation, the OOV results are better when updating is not performed.
Rq5 Overall, are some word embeddings better than others?
Comparing the different word embedding techniques over our four sequence
labelling tasks, for the different evaluations (overall, out-of-domain
and OOV), there is no clear winner among the word embeddings – for
POS-tagging, Skip-gram appears to have a slight advantage, but this does not
generalise to other tasks.
While the aim of this paper was not to achieve the state of the art over the respective tasks, it is important to concede that our best (in-domain) results for NER, POS-tagging and Chunking are slightly worse than the state of the art (Tab. 3). The 2.7% difference between our NER system and the best performing system is due to the fact that we use a first-order instead of a second-order CRF [Ando:2005], and for the other tasks, there are similarly differences in the learner and the complexity of the features used. Another difference is that we tuned the hyperparameters with random search, to enable replication using the same random seed. In contrast, the hyperparameters for the state-of-the-art methods are tuned more extensively by experts, making them more difficult to reproduce.
5 Related Work
collobert2011natural proposed a unified neural network framework that learns word embeddings and applied it for POS-tagging, Chunking, NER and semantic role labelling. When they combined word embeddings with hand crafted features (e.g., word suffixes for POS-tagging; gazetteers for NER
) and applied other tricks like cascading and classifier combination, they achieved state-of-the-art performance. Similarly, turian2010word evaluated three different word representations onNER and Chunking, and concluded that unsupervised word representations improved NER and Chunking. They also found that combining different word representations can further improve performance. guo2014revisiting also explored different ways of using word embeddings for NER. owoputi2013improved and Schneider+:2014 found that Brown clustering enhances Twitter POS tagging and MWE, respectively. Compared to previous work, we consider more word representations including the most recent work and evaluate them on more sequence labelling tasks, wherein the models are trained with training sets of varying size.
Bansal+:2014 reported that direct use of word embeddings in dependency parsing did not show improvement. They achieved an improvement only when they performed hierarchical clustering of the word embeddings, and used features extracted from the cluster hierarchy. In a similar vein, Andreas:Klein:2014 explored the use of word embeddings for constituency parsing and concluded that the information contained in word embeddings might duplicate the one acquired by a syntactic parser, unless the training set is extremely small. Other syntactic parsing studies that reported improvements by using word embeddings include Koo:2008, Koo:2010, Haffari:2011, Tratz:2011 and chen:2014.
Word embeddings have also been applied to other (non-sequential NLP) tasks like grammar induction [Spitkovsky:2011], and semantic tasks such as semantic relatedness, synonymy detection, concept categorisation, selectional preference learning and analogy [baroni:2014].
Huang:2009 demonstrated that using distributional word representations methods (like TF-IDF and LSA) as features, improves the labelling of OOV, when test for POS-tagging and Chunking. In our study, we evaluate the labelling performance of OOV words for updated vs. not updated word embeddings representations, relative to the training set and with out-of-domain data.
We have performed an extensive extrinsic evaluation of four word embedding methods under fixed experimental conditions, and evaluated their applicability to four sequence labelling tasks: POS-tagging, Chunking, NER and MWE identification. We found that word embedding features reliably outperformed unigram features, especially with limited training data, but that there was relatively little difference over Brown clusters, and no one embedding method was consistently superior across the different tasks and settings. Word embeddings and Brown clusters were also found to improve out-of-domain performance and for OOV words. We expected a performance gap between the fixed and task-updated embeddings, but the observed difference was marginal. Indeed, we found that updating can result in overfitting. We also carried out preliminary analysis of the impact of updating on the vectors, a direction which we intend to pursue further.