Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The Impact of Word Representation on Sequence Labelling Tasks

04/21/2015 ∙ by Lizhen Qu, et al. ∙ CSIRO Carnegie Mellon University 0

Word embeddings -- distributed word representations that can be learned from unlabelled data -- have been shown to have high utility in many natural language processing applications. In this paper, we perform an extrinsic evaluation of five popular word embedding methods in the context of four sequence labelling tasks: POS-tagging, syntactic chunking, NER and MWE identification. A particular focus of the paper is analysing the effects of task-based updating of word representations. We show that when using word embeddings as features, as few as several hundred training instances are sufficient to achieve competitive results, and that word embeddings lead to improvements over OOV words and out of domain. Perhaps more surprisingly, our results indicate there is little difference between the different word embedding methods, and that simple Brown clusters are often competitive with word embeddings across all tasks we consider.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, distributed word representations have grown to become a mainstay of natural language processing (NLP), and been show to have empirical utility in a myriad of tasks [Collobert2008, turian2010word, baroni:2014, Andreas:Klein:2014]. The underlying idea behind distributed word representations is simple: to map each word in our vocabulary

onto a continuous-valued vector of dimensionality

. Words that are similar (e.g., with respect to syntax or lexical semantics) will ideally be mapped to similar regions of the vector space, implicitly supporting both generalisation across in-vocabulary (IV) items, and countering the effects of data sparsity for low-frequency and out-of-vocabulary (OOV) items.

Without some means of automatically deriving the vector representations without reliance on labelled data, however, word embeddings would have little practical utility. Fortunately, it has been shown that they can be “pre-trained” from unlabelled text data using various algorithms to model the distributional hypothesis (i.e., that words which occur in similar contexts tend to be semantically similar). Pre-training methods have been refined considerably in recent years, and scaled up to increasingly large corpora.

As with other machine learning methods, it is well known that the quality of the pre-trained word embeddings depends heavily on factors including parameter optimisation, the size of the training data, and the fit with the target application. For example, turian2010word showed that the optimal dimensionality for word embeddings is task-specific. One factor which has received relatively little attention in NLP is the effect of “updating” the pre-trained word embeddings as part of the task-specific training, based on self-taught learning 

[raina2007self]. Updating leads to word representations that are task-specific, but often at the cost of over-fitting low-frequency and OOV words.

In this paper, we perform an extensive evaluation of four recently proposed word embedding approaches under fixed experimental conditions, applied to four sequence labelling tasks: POS-tagging, full-text chunking, named entity recognition (NER), and multiword expression (MWE) identification. Compared to previous empirical studies 

[collobert2011natural, turian2010word, pennington2014glove], we fill their gaps by considering more word embedding approaches and evaluating them with more sequence labelling tasks. In addition, we explore the following research questions:

  1. are these word embeddings better than baseline approaches of one-hot unigram features and Brown clusters?

  2. do word embeddings require less training data (i.e. generalise better) than one-hot unigram features? If so, to what degree can word embeddings reduce the amount of labelled data?

  3. what is the impact of updating word embeddings in sequence labelling tasks, both empirically over the target task and geometrically over the vectors?

  4. what is the impact of these word embeddings (with and without updating) on both OOV items (relative to the training data) and out-of-domain data?

  5. overall, are some word embeddings better than others in a sequence labelling context?

2 Word Representations

2.1 Types of Word Representations

turian2010word identifies three varieties of word representations: distributional, cluster-based, and distributed.

Distributional representation methods map each word to a context word vector , which is constructed directly from co-occurrence counts between and its context words. The learning methods either store the co-occurrence counts between two words and directly in  [sahlgren2006word, turney2010frequency, honkela1997self] or project the concurrence counts between words into a lower dimensional space [vrehuuvrek2010software, lund1996producing], using dimensionality reduction techniques such as SVD [dumais1988using] and LDA [blei2003latent].

Cluster-based representation methods build clusters of words by applying either soft or hard clustering algorithms [lin2009phrase, li2005semi]. Some of them also rely on a co-occurrence matrix of words [pereira1993distributional]. The Brown clustering algorithm [Brown92class-basedn-gram] is the best-known method in this category.

Distributed representation methods usually map words into dense, low-dimensional, continuous-valued vectors, with , where is referred to as the word dimension.

2.2 Selected Word Representations

Over a range of sequence labelling tasks, we evaluate five methods for inducing word representations: Brown clustering [Brown92class-basedn-gram] (“Brown”), the neural language model of Collobert & Weston (“CW”) [collobert2011natural], the continuous bag-of-words model (“CBOW”) [Mikolov13], the continuous skip-gram model (“Skip-gram”) [Mikolov13NIPS], and Global vectors (“Glove”) [pennington2014glove]. With the exception of CW, all have have been shown to be at or near state-of-the-art in recent empirical studies [turian2010word, pennington2014glove]. CW is included because it was highly influential in earlier research, and the pre-trained embeddings are still used to some degree in NLP. The training of these word representations is unsupervised: the common underlying idea is to predict occurrence of words in the neighbouring context. Their training objectives share the same form, which is a sum of local training factors ,

where is the vocabulary of a given corpus, and denotes the local context of word . The local context of a word can either be its previous words, or the words surrounding it. Local training factors are designed to capture the relationship between and its local contexts of use, either by predicting based on its local context, or using to predict the context words. Other than Brown, which utilises a cluster-based representation, all the other methods employ a distributed representation.

The starting point for CBOW and Skip-gram is to employ softmax to predict word occurrence:

where denotes the distributed representation of the local context of word . CBOW derives

based on averaging over the context words. That is, it estimates the probability of each

given its local context. In contrast, Skip-gram applies softmax to each context word of a given occurrence of word . In this case, corresponds to the representation of one of its context words. This model can be characterised as predicting context words based on . In practice, softmax is too expensive to compute over large corpora, and thus Mikolov13NIPS use hierarchical softmax and negative sampling to scale up the training.

CW considers the local context of a word to be words to the left and words to the right of . The concatenation of the embeddings of

and all its context words are taken as input to a neural network with one hidden layer, which produces a higher level representation

. Then the learning procedure replaces the embedding of with that of a randomly sampled word and generates a second representation with the same neural network. The training objective is to maximise the difference between them:

This approach can be regarded as negative sampling with only one negative example.

Glove assumes the dot product of two word embeddings should be similar to the logarithm of the co-occurrence count of the two words. As such, the local factor becomes:

where and are the bias terms of words and , respectively, and is a weighting function based on the co-occurrence count. This weighting function controls the degree of agreement between the parametric function and . Frequently co-occurring word pairs will be larger weight than infrequent pairs, up to a threshold.

Brown partitions words into a finite set of word classes . The conditional probability of seeing the next word is defined to be:

where denotes the word class of the word , are the previous words and are their respective word classes. Then

. Since there is no tractable method to find an optimal partition of word classes, the method uses only a bigram class model, and utilises hierarchical clustering as an approximation method to find a sufficiently good partition of words.

2.3 Building Word Representations

For a fair comparison, we train Brown, CBOW, Skip-gram, and Glove on a fixed corpus, comprised of freely available corpora, as detailed in Tab. 1. The joint corpus was preprocessed with the Stanford CoreNLP sentence splitter and tokeniser. All consecutive digit substrings were replaced by NUMf, where f is the length of the digit substring (e.g., 10.20 is replaced by NUM2.NUM2. Due to the computational complexity of the pre-training, for CW, we simply downloaded the pre-compiled embeddings from:

Data set Size Words
UMBC 48.1GB 3G
One Billion 4.1GB 1G
English Wikipedia 49.6GB 3G
Table 1: Corpora used to pre-train the word embeddings

The dimensionality of the word embeddings and the size of the context window are the key hyperparameters when learning distributed representations. We use all combinations of the following values to train word embeddings on the combined corpus:

  • Embedding dim. 

  • Context window size

Brown requires only the number of clusters as a hyperparameter. We perform clustering with clusters.

3 Sequence Labelling Tasks

We evaluate the different word representations over four sequence labelling tasks: POS-tagging (POS-tagging), full-text chunking (Chunking), NER (NER) and MWE identification (MWE). For each task, we fed features into a first order linear-chain graph transformer [collobert2011natural] made up of two layers: the upper layer is identical to a linear-chain CRF [lafferty2001conditional], and the lower layer consists of word representation and hand-crafted features. If we treat word representations as fixed, the graph transformer is a simple linear-chain CRF. On the other hand, if we can treat the word representations as model parameters, the model is equivalent to a neural network with word embeddings as the input layer. We trained all models using AdaGrad [duchi2011adaptive].

As in turian2010word, at each word position, we construct word representation features from the words in a context window of size two to either side of the target word, based on the pre-trained representation of each word type. For Brown

, the features are the prefix features extracted from word clusters in the same way as turian2010word. As a baseline (and to test

RQ1), we include a one-hot representation (which is equivalent to a linear-chain CRF with only lexical context features).

Our hand-crafted features for POS-tagging, Chunking and MWE, are those used by collobert2011natural, turian2010word and mwecorpus, respectively. For NER, we use the same feature space as turian2010word, except for the previous two predictions, because we want to evaluate all word representations with the same type of model – a first-order graph transformer.

In training the distributed word representations, we consider two settings: (1) the word representations are fixed during sequence model training; and (2) the graph transformer updated the token-level word representations during training.

Training Development In-domain Test Out-of-domain Test
POS-tagging WSJ Sec. 0-18 WSJ Sec. 19–21 WSJ Sec. 22–24 EWT
Chunking WSJ WSJ (1K sentences) WSJ (CoNLL-00 test) Brown
NER Reuters (CoNLL-03 train) Reuters (CoNLL-03 dev) Reuters (CoNLL-03 test) MUC7
MWE EWT (500 docs) EWT (100 docs) EWT (123 docs)
Table 2: Training, development and test (in- and out-of-domain) data for each sequence labelling task.

As outlined in Tab. 2, for each sequence labelling task, we experiment over the de facto corpus, based on pre-existing training–dev–test splits where available:111For the MWE dataset, no such split pre-existed, so we constructed our own.

  1. the Wall Street Journal portion of the Penn Treebank (Marcus:1993: “WSJ”) with Penn POS tags

  2. the Wall Street Journal portion of the Penn Treebank (“WSJ”), converted into IOB-style full-text chunks using the CoNLL conversion scripts for training and dev, and the WSJ-derived CoNLL-2000 full text chunking test data for testing [TjongKimSang:Buchholz:2000]

  3. the English portion of the CoNLL-2003 English Named Entity Recognition data set, for which the source data was taken from Reuters newswire articles (TjongKimSang:DeMeulder:2003: “Reuters”)

  4. the MWE dataset of mwecorpus, over a portion of text from the English Web Treebank222 (“EWT”)

For all tasks other than MWE,333Unfortunately, there is no second domain which has been hand-tagged with MWEs using the method of mwecorpus to use as an out-of-domain test corpus. we additionally have an out-of-domain test set, in order to evaluate the out-of-domain robustness of the different word representations, with and without updating. These datasets are as follows:

  1. the English Web Treebank with Penn POS tags (“EWT”)

  2. the Brown Corpus portion of the Penn Treebank (“Brown”), converted into IOB-style full-text chunks using the CoNLL conversion scripts

  3. the MUC-7 named entity recognition corpus444 (“MUC7”)

For reproducibility, we tuned the hyperparameters with random search over the development data for each task [bergstra2012random]. In this, we randomly sampled 50 distinct hyperparameter sets with the same random seed for the non-updating models (i.e. the models that don’t update the word representation), and sampled 100 distinct hyperparameter sets for the updating models (i.e. the models that do). For each set of hyperparameters and task, we train a model over its training set and choose the best one based on its performance on development data [turian2010word]. We also tune the word representation hyperparameters – namely, the word vector size and context window size (distributed representations), and in the case of Brown, the number of clusters.

For the updating models, we found that the results over the test data were always inferior to those that do not update the word representations, due to the higher number of hyperparameters and small sample size (i.e. 100). Since the two-layer model of the graph transformer contains a distinct set of hyperparameters for each layer, we reuse the best-performing hyperparameter settings from the non-updating models, and only tune the hyperparameters of AdaGrad for the word representation layer. This method requires only 32 additional runs and achieves consistently better results than 100 random draws.

In order to test the impact of the volume of training data on the different models (RQ2), we split the training set into 10 partitions based on a base-2 log scale (i.e., the second smallest partition will be twice the size of the smallest partition), and created 10 successively larger training sets by merging these partitions from the smallest one to the largest one, and used each of these to train a model. From these, we construct learning curves over each task.

For ease of comparison with previous results, we evaluate both in- and out-of-domain using chunk/entity/expression-level F1-measure (“F1”) for all tasks except POS-tagging, for which we use token-level accuracy (“Acc”). To test performance over OOV (unknown) tokens – i.e., the words that do not occur in the training set – we use token-level accuracy for all tasks (e.g. for Chunking, we evaluate whether the full IOB tag is correct or not), due to the sparsity of all-OOV chunks/NEs/MWEs.

4 Experimental Results and Discussion

Task Benchmark In-domain Test set Out-of-domain Test set
POS-tagging (Acc) 0.972 [Toutanova:2003] 0.959 (Skip-gram+UP) 0.910 (Skip-gram)
Chunking (F1) 0.942 [Sha:2003] 0.938 (Brown) 0.676 (Glove)
NER (F1) 0.893 [Ando:2005] 0.868 (Skip-gram) 0.736 (Skip-gram)
MWE (F1) 0.625 [Schneider+:2014] 0.654 (CBOW+UP)
Table 3: State-of-the-art results vs. our best results for in-domain and out-of-domain test sets.
POS-tagging (Acc)
Chunking (F1)
NER (F1)
MWE (F1)
Figure 1: Results for each type of word representation over POS-tagging, Chunking, NER and MWE, optionally with updating (“+UP”). The -axis indicates the training data sizes (on a log scale). Green = high performance, and red = low performance, based on a linear scale of the best- to worst-result for each task.
Figure 2: Acc over out-of-vocabulary (OOV) words for in-domain and out-of-domain test sets.

We structure our evaluation by stepping through each of our five research questions (RQ1–5) from the start of the paper. In this, we make reference to: (1) the best-performing method both in- and out-of-domain vs. the state-of-the-art (Tab. 3); (2) a heat map for each task indicating the convergence rate for each word representation, with and without updating (Fig. 1); (3) OOV accuracy both in-domain and out-of-domain for each task (Fig. 2); and (4) visualisation of the impact of updating on word embeddings, based on t-SNE (Fig. 3).

Rq1: Are the selected word embeddings better than one-hot unigram features and Brown clusters?

As shown in Tab. 3, the best-performing method for every task except in-domain Chunking is a word embedding method, although the precise method varies greatly. Fig. 1, on the other hand, tells a more subtle story: the difference between Unigram and the other word representations is relatively modest, esp. as the amount of training data increases. Additionally, the difference between Brown and the word embedding methods is modest across all tasks. So, the overall answer would appear to be: yes for unigrams when there is little training data, but not really for Brown.

Rq2: Do word embedding features require less training data?

Fig. 1 shows that for POS-tagging and NER, with only several hundred training instances, word embedding features achieve superior results to Unigram. For example, when trained with 561 instances, the POS-tagging model using Skip-gram+UP embeddings is 5.3% above Unigram; and when trained with 932 instances, the NER model using Skip-gram is 11.7% above Unigram. Similar improvements are also found for other types of word embeddings and Brown, when the training set is small. However, all word representations perform similarly for Chunking regardless of training data size. For MWE, Brown performs slightly better than the other methods when trained with approximately 25% of the training instances. Therefore, we conjecture that the POS-tagging and NER tasks benefit more from distributional similarity than Chunking and MWE.

Rq3: Does task-specific updating improve all word embeddings across all tasks?

Based on Fig. 1, updating of word representations can equally correct poorly-learned word representations, and harm pre-trained representations, due to overfitting. For example, Glove perform significantly worse than Skip-gram in both POS-tagging and NER without updating, but with updating, the gap between their results and the best performing method becomes smaller. In contrast, Skip-gram performs worse over the test data with updating, despite the results on the development set improving by 1%.

To further investigate the effects of updating, we sampled 60 words and plotted the changes in their word embeddings under updating, using 2-d vector fields generated by using matplotlib and t-SNE [vanderMaaten:Hinton:2008]. Half of the words were chosen manually to include known word clusters such as days of the week and names of countries; the other half were selected randomly. Additional plots with 100 randomly-sampled words and the top-100 most frequent words, for all the methods and all the tasks, can be found in the supplementary material and at In each plot, a single arrow signifies one word, pointing from the position of the original word embedding to the updated representation.

In Fig. 3, we show vector fields plots for Chunking and NER using Skip-gram embeddings. For Chunking, most of the vectors were changed with similar magnitude, but in very different directions, including within the clusters of days of the week and country names. In contrast, for NER, there was more homogeneous change in word vectors belonging to the same cluster. This greater consistency is further evidence that semantic homogeneity appears to be more beneficial for NER than Chunking.

Figure 3: A t-SNE plot of the impact of updating on Skip-gram

Rq4: What is the impact of word embeddings cross-domain and for OOV words?

As shown in Tab. 3, results predictably drop when we evaluate out of domain. The difference is most pronounced for Chunking, where there is an absolute drop in F1 of around 30% for all methods, indicating that word embeddings and unigram features provide similar information for Chunking.

Another interesting observation is that updating often hurts out-of-domain performance because the distribution between domains is different. This suggests that, if the objective is to optimise performance across domains, it is best not to perform updating.

We also analyze performance on OOV words both in-domain and out-of-domain in Fig. 2. As expected, word embeddings and Brown excel in out-of-domain OOV performance. Consistent with our overall observations about cross-domain generalisation, the OOV results are better when updating is not performed.

Rq5 Overall, are some word embeddings better than others?

Comparing the different word embedding techniques over our four sequence labelling tasks, for the different evaluations (overall, out-of-domain and OOV), there is no clear winner among the word embeddings – for POS-tagging, Skip-gram appears to have a slight advantage, but this does not generalise to other tasks.

While the aim of this paper was not to achieve the state of the art over the respective tasks, it is important to concede that our best (in-domain) results for NER, POS-tagging and Chunking are slightly worse than the state of the art (Tab. 3). The 2.7% difference between our NER system and the best performing system is due to the fact that we use a first-order instead of a second-order CRF [Ando:2005], and for the other tasks, there are similarly differences in the learner and the complexity of the features used. Another difference is that we tuned the hyperparameters with random search, to enable replication using the same random seed. In contrast, the hyperparameters for the state-of-the-art methods are tuned more extensively by experts, making them more difficult to reproduce.

5 Related Work

collobert2011natural proposed a unified neural network framework that learns word embeddings and applied it for POS-tagging, Chunking, NER and semantic role labelling. When they combined word embeddings with hand crafted features (e.g., word suffixes for POS-tagging; gazetteers for NER

) and applied other tricks like cascading and classifier combination, they achieved state-of-the-art performance. Similarly, turian2010word evaluated three different word representations on

NER and Chunking, and concluded that unsupervised word representations improved NER and Chunking. They also found that combining different word representations can further improve performance. guo2014revisiting also explored different ways of using word embeddings for NER. owoputi2013improved and Schneider+:2014 found that Brown clustering enhances Twitter POS tagging and MWE, respectively. Compared to previous work, we consider more word representations including the most recent work and evaluate them on more sequence labelling tasks, wherein the models are trained with training sets of varying size.

Bansal+:2014 reported that direct use of word embeddings in dependency parsing did not show improvement. They achieved an improvement only when they performed hierarchical clustering of the word embeddings, and used features extracted from the cluster hierarchy. In a similar vein, Andreas:Klein:2014 explored the use of word embeddings for constituency parsing and concluded that the information contained in word embeddings might duplicate the one acquired by a syntactic parser, unless the training set is extremely small. Other syntactic parsing studies that reported improvements by using word embeddings include Koo:2008, Koo:2010, Haffari:2011, Tratz:2011 and chen:2014.

Word embeddings have also been applied to other (non-sequential NLP) tasks like grammar induction [Spitkovsky:2011], and semantic tasks such as semantic relatedness, synonymy detection, concept categorisation, selectional preference learning and analogy [baroni:2014].

Huang:2009 demonstrated that using distributional word representations methods (like TF-IDF and LSA) as features, improves the labelling of OOV, when test for POS-tagging and Chunking. In our study, we evaluate the labelling performance of OOV words for updated vs. not updated word embeddings representations, relative to the training set and with out-of-domain data.

6 Conclusions

We have performed an extensive extrinsic evaluation of four word embedding methods under fixed experimental conditions, and evaluated their applicability to four sequence labelling tasks: POS-tagging, Chunking, NER and MWE identification. We found that word embedding features reliably outperformed unigram features, especially with limited training data, but that there was relatively little difference over Brown clusters, and no one embedding method was consistently superior across the different tasks and settings. Word embeddings and Brown clusters were also found to improve out-of-domain performance and for OOV words. We expected a performance gap between the fixed and task-updated embeddings, but the observed difference was marginal. Indeed, we found that updating can result in overfitting. We also carried out preliminary analysis of the impact of updating on the vectors, a direction which we intend to pursue further.