ETNLP: A Toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings

03/11/2019 ∙ by Xuan-Son Vu, et al. ∙ CSIRO Umeå universitet University of Tasmania 0

In this paper, we introduce a comprehensive toolkit, ETNLP, which can evaluate, extract, and visualize multiple sets of pre-trained word embeddings. First, for evaluation, ETNLP analyses the quality of pre-trained embeddings based on an input word analogy list. Second, for extraction ETNLP provides a subset of the embeddings to be used in the downstream NLP tasks. Finally, ETNLP has a visualization module which is for exploring the embedded words interactively. We demonstrate the effectiveness of ETNLP on our pre-trained word embeddings in Vietnamese. Specifically, we create a large Vietnamese word analogy list to evaluate the embeddings. We then utilize the pre-trained embeddings for the name entity recognition (NER) task in Vietnamese and achieve the new state-of-the-art results on a benchmark dataset for the NER task. A video demonstration of ETNLP is available at https://vimeo.com/317599106. The source code and data are available at https: //github.com/vietnlp/etnlp.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

etnlp

ETNLP: A toolkit to evaluate, extract, and visualize multiple embeddings


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Word embedding, also known as word representation, represents a word as a vector capturing both syntactic and semantic information, so that the words with similar meanings should have similar vectors

Levy and Goldberg (2014). Although, classic embedding models, such as Word2Vec Mikolov et al. (2013), GloVe Pennington et al. (2014), fastText Bojanowski et al. (2017), have been shown to help improve the performance of existing models in a variety of tasks like parsing Bansal et al. (2014), topic modeling Nguyen et al. (2015); Batmanghelich et al. (2016), and document classification Taddy (2015). Each word is associated with a single vector leading to a challenge on using the vector in vary across linguistic contexts Peters et al. (2018). To overcome that problem, recently, contextual embeddings (e.g., ELMO of Peters:2018, BERT of devlin2018bert) have been proposed and helps existing models achieve new state-of-the-art results on many NLP tasks. Different from non-contextual embeddings, ELMO and BERT can capture different latent syntactic-semantic information of the same word based on its contextual uses. Thus, this paper incorporates both classical embeddings (i.e., Word2Vec, fastText) and contextual embeddings (i.e., ELMO, BERT) to evaluate their performances.

Given the fact that there are many different types of word embedding models, we argue that building a unified toolkit, which can evaluate, extract, and visualize word embeddings for NLP tasks, is important. However, to our knowledge, there is no single toolkit which can perform all the tasks of evaluation, extraction, and visualization. For example, the recent framework called flair Akbik et al. (2018) is famous for stacking multiple embeddings but not for quick evaluation or visualization.

In this paper, we propose ETNLP, a comprehensive embedding toolkit, which can extract, evaluate and visualize the pre-trained embeddings. We detail the three main components which are evaluator, extractor, and visualizer as follows.

  • [wide]

  • Evaluator: given multiple sets of pre-trained embeddings, how do we choose the embeddings which will potentially work best for the downstream NLP tasks (e.g., NER)? Mikolov:13 presented a large benchmark for embedding evaluation based on a series of analogies. However, the benchmark is only for English and there is no publicly available large benchmark for under-resourced languages like Vietnamese.

  • Extractor: given multiple sets of pre-trained embeddings, how do we get the advantage from all of them? For instance, if people want to use the character embedding to handle the out-of-vocabulary (OOV) issue in the Word2Vec model, they must implement their own extractor to combine two different sets of embeddings. It is more complicated when they want to evaluate the performance of either each set of embeddings separately or the combination of the two sets. The provided extractor API in ETNLP will fulfill this seamlessly to elaborate this process in NLP applications.

  • Visualizer: when having a new set of word embeddings, people usually want to get samples from the embedding set to see how is the semantic similarity between different words. To fulfill this requirement, we employ the well-known Embedding Projector (projector.tensorflow.org) to let users explore the embedding space interactively. Moreover, users can also compare the qualities of the word similarity list between multiple embeddings side by side (see the demo).

To demonstrate the effectiveness of ETNLP, we employ the toolkit to a use case in Vietnamese. Evaluating pre-trained embeddings in Vietnamese is a challenge as there is no publicly available large111There are a couple of available datasets Nguyen et al. (2018b). But the datasets are small containing only 400 words. lexical resource similar to the word analogy list in English to evaluate the performance of pre-trained embeddings. Moreover, different from English where all word analogy records consist of single words in one record (e.g.,grandfather — grandmother — king — queen), in Vietnamese (e.g., ông nội — bà ngoại — vua — nữ_hoàng), there are many cases where only compound words can represent a similar semantic relationship between two word pairs to state a word analogy record.

We propose a large word analogy list in Vietnamese which can handle the problems. Having that word analogy list constructed, we utilize different embedding models, namely Word2Vec, fastText, ELMO and BERT on Vietnamese Wikipedia data to generate different sets of word embeddings. We then utilize the word analogy list to select suitable sets of embeddings for the name entity recognition (NER) task in Vietnamese. We achieve the new state-of-the-art results on VLSP 2016 222http://vlsp.org.vn/vlsp2016/eval/ner, a Vietnamese benchmark dataset for the NER task.

Here are our main contributions in this work:

  • [wide]

  • Provide a general embedding toolkit (ETNLP) to let users evaluate, extract, and visualize multiple sets of word embeddings. The system’s design is generally to be used in any language.

  • Release a large word analogy list in Vietnamese for evaluating multiple word embeddings.

  • Train and release multiple sets of word embeddings for NLP tasks in Vietnamese, wherein, their effectiveness is verified through new state-of-the-art results on a NER task in Vietnamese.

The rest of this paper is organized as follows. Section 2 describes how different embedding models are trained. Section 3 shows how to use the toolkit to evaluate, extract, and visualize word embeddings. Section 4 shows how the word embeddings are evaluated through word analogy task and NER task. Section 5 concludes the paper followed by future work.

2 Embedding Models

This section details the word embedding models supported in our ETNLP toolkit.

  • [wide]

  • Word2Vec (W2V) Mikolov et al. (2013): a widely used method in NLP for generating word embeddings.

  • W2V_C2V: the Word2Vec model faces the OOV issue on unseen text, therefore, we provide a character2vec (C2V) embedding for unseen words by getting embedding vectors at character level of unseen words. The C2V embedding can be easily calculated from a W2V model by averaging all vectors where a character occurred.

  • fastText Bojanowski et al. (2016): fastText

    associates embeddings with character-based n-grams, and a word is represented as the summation of the representations of its character-based n-grams. Based on this design, fastText attempts to capture morphological information to induce word embeddings, and hence, deals better with OOV words.

  • ELMO Peters et al. (2018): a model generates embeddings for a word based on the context it appears. Thus, we choose the contexts where the word appears in the training corpus to generate embeddings for each of its occurrences. Then the final embedding vector is the average of all its context embeddings.

  • BERT_{Base, Large} Devlin et al. (2018): BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. Different from ELMO, the directional models, which reads the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words simultaneously. It, therefore, is considered bidirectional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word). BERT comes with two configurations called BERT_Base (12 layers) and BERT_Large (24 layers).

3 Basic Usages

Hình 1: General process of ETNLP Toolkit

Figure 1 shows the general process of the toolkit. The four main processes of EMNLP are very simple to call from either the command-line or the Python API.

  • [wide]

  • Pre-processing: since we use Word2Vec (W2V) format as the standard format for the whole process of ETNLP, we provide a pre-processing tool for converting different embedding formats to the W2V format. Figure 2 shows the command-line to convert from GloVe format to W2V format.

    Hình 2: Conversion from GloVe format to W2V format.
  • Evaluator: to evaluate multiple sets of embeddings on the word analogy task, users have to set the location of the word embeddings and the word analogy list. To make it convenient for representing compound words, we use “ — ” to separate different part of a word analogy record instead of space as in the English word analogy list. Figure 3 shows an example of two records in the word analogy in Vietnamese (on the left) and their translation (on the right). The lower part shows a command-line to evaluate multiple sets of word embeddings on this task.

    Hình 3: Run evaluation on multiple word embeddings on the word analogy task.
  • Extractor: to extract embedding vectors at word level for other NLP tasks. For instance, the popular implementation of Reimers:2017:EMNLP on the sequence tagging task allows users to set location for the word embeddings. The format of the file is text-based, i.e., each line contains the embedding of a word. The file then is compressed in .gz format. Figure 4 shows a command-line to extract multiple embeddings for an NLP task. The option “solveoov:1” informs the extractor to use Character2Vec (C2V) embedding to solve OOV words in the first embedding “¡emb_in#1¿”. The “-input_c2v” can be omitted if users wish to simply extract embeddings from the embedding list given after the “-input_embs” argument.

    Hình 4: Run extractor to export single or multiple embeddings for NLP tasks.
  • Visualizer:

    to visualize given word embeddings in the argument “-input_embs”. After the executions, embedding vectors are transformed to tensors to visualize with the Embedding Projector. Each word embedding will be set to different local port from which, users can explore the embedding space using a Web browser. Figure

    6 shows an example of the interactive visualization of “Hà_Nội” using ELMO embeddings. See Figure 5 for an example command-line.

    Hình 5: Run visualizer to explore given pre-trained embedding models.
    Hình 6: Interactive visualization for the word “Hà_Nội” with ELMO embeddings.

4 Evaluations: a use-case in Vietnamese

4.1 Training word embeddings

We trained embedding models detailed in Section 2 on the Wikipedia dump in Vietnamese 333https://goo.gl/8WNfyZ. We then apply sentence tokenization and word segmentation provided by VnCoreNLP Vu et al. (2018); Nguyen et al. (2018a) to pre-process all documents. It is noted that, for BERT model, we have to (1) format the data differently for the next sentence prediction task; and (2) use SentencePiece Kudo and Richardson (2018) to tokenize the data for learning the pre-trained embedding. It is worth noting that due to the limitation in computing resources, we can only run BERT_Base for 900,000 update steps and BERT_Large for 60,000 update steps. We, therefore, do not report the result of BERT_Large for a fair comparison. We also create MULTI embeddings by concatenating four sets of embeddings (i.e., W2V_C2V, fastText, ELMO and BERT_Base) 444We do not use W2V here because W2V_C2V is W2V with the use of character embedding to deal with OOV..

Model MAP@10 P-value
W2V 0.4796 -
W2V_C2V 0.4796 -
FastText 0.4970 See [1] & [2]
ELMO 0.4999 vs. FastText: 0.95 [1]
BERT_Base 0.4609 -
MULTI 0.4906 vs. FastText: 0.025 [2]
Bang 1:

Evaluation results of different word embeddings on the Word Analogy Task. P-value column shows results from Paired t-tests.

Hyper-parameter Search Space
cemb dim (char embedding) 50 100 500
drpt (dropout rate) 0.3 0.5 0.7
lstm-s (LSTM size) 50 100 500
lrate (learning rate) 0.0005 0.001 0.005
Bang 2: Grid search for hyper-parameters.
F1 wemb dim cemb dim drpt lstm-s lrate
BiLC3  Ma and Hovy (2016) 88.28 300 - - - -
VNER Dong and Nguyen (2018) 89.58 300 300 0.6 - 0.001
VnCoreNLP Vu et al. (2018) 88.55 300 - - - -
VnCoreNLP 91.30 1024 - - - -
BiLC3 + W2V 89.01 300 50 0.5 100 0.0005
BiLC3 + BERT-Base 88.26 768 500 0.3 100 0.0005
BiLC3 + W2V_C2V 89.46 300 100 0.5 500 0.0005
BiLC3 + fastText 89.65 300 500 0.3 100 0.001
BiLC3 + ELMO 89.67 1024 100 0.7 500 0.0005
BiLC3 + MULTI 91.09 2392 100 0.7 100 0.001
Bang 3: Performance of the NER task using different embedding models. The MULTI is the concatenation of four embeddings: W2V_C2V, fastText, ELMO, and Bert_Base. “wemb dim” is the dimension of the embedding model. VnCoreNLP* means we retrain the VnCoreNLP with our pre-trained embeddings.

4.2 Dataset

The named entity recognition (NER) shared task at the 2016 VLSP workshop provides a dataset of 16,861 manually annotated sentences for training and development, and a set of 2,831 manually annotated sentences for test, with four NER labels PER, LOC, ORG, and MISC. The data was published in 2016 and recently reported in Nguyen:19. It is a standard benchmark on the NER task and has been used in Vu et al. (2018); Dong and Nguyen (2018). It is noted that, in the original dataset, each word representing a full personal name are separated into syllables that constitute the word. Because this annotation scheme results in an unrealistic scenario for a pipeline evaluation Vu et al. (2018), therefore, we tested on a “modified” VLSP 2016 corpus where we merge contiguous syllables constituting a full name to form a word. This similar setup was also used in Vu et al. (2018); Dong and Nguyen (2018), the current state-of-the-art approaches.

4.3 Word Analogy Task

To measure the quality of different sets of embeddings in Vietnamese, similar to Mikolov:13, we define a word analogy list consisting of 9802 word analogy records. To create the list, we selected suitable categories from the English word analogy list and then translate them to Vietnamese. We also added customized categories which are suitable for Vietnamese (e.g., cities and their zones in Vietnam). Since most of this process is automatically done, it can be applied easily to other languages. To know which set of word embeddings potentially works better for a downstream task, we limit the vocabulary of the embeddings similar to vocabulary of the task. Thus, only 3135 word analogy records are being evaluated for the NER dataset (Section 4.2).

Regarding the evaluation metric, Mikolov:13 used accuracy metric to measure the quality of word embeddings on the task in which only when the expected word is on top of the prediction list, then the model gets +1 for true positive count. However, this is not a well-suited metric in low resource languages where training corpus is relatively small, i.e., 233M tokens in Vietnamese Wiki compared to 6B tokens in Google News corpus. Therefore, we change to use mean average precision (MAP) metric to measure quality of the word analogy task. MAP is widely used in information retrieval to evaluate results based on the top

K returned results Manning et al. (2008). We use MAP@10 in this paper. Table 1 shows evaluation results of different sets of embeddings on the word analogy task. The evaluator of ETNLP also shows P-value using the paired t-tests on the raw MAP@10 scores (i.e., before averaging) between different sets of embeddings. The P-values (Table 1) show that the performances of the top three sets of word embeddings (i.e., fastText, ELMO, and MULTI), are significantly better than the remainders but there is no significantly different between the three. Therefore, these sets of embeddings will be sellected for NER task.

4.4 NER Task

Model:

We apply the current most well-known neural network architecture for NER task of Ma:2016 with no modification in its architecture, namely,

BiLSTM-CRF+CNN-char (BiLC3). Only in the embedding layer, a different set of word embeddings is used to evaluate their effectiveness. Regarding experiments, we perform a grid search for hyper-parameters and select the best parameters on the validation set to run on the test set. Table 2 presents the value ranges we used to search for the best hyper-parameters. We also follow the same setting as in  Vu et al. (2018) to use the last 2000 records in the training data as the validation set. Moreover, due to the availability of the VnCoreNLP code, we also retrain their model with our pre-trained embeddings (VnCoreNLP).

Main results: Table 3 shows results of the NER task using different word embeddings. It clearly shows that, by using the pre-trained embeddings on Vietnamese Wikipedia data, we can achieve the new state-of-the-art results on the task. The reason might be that fastText, ELMO and MULTI can handle OOV words as well as capture better the context of the words. Moreover, learning the embeddings from a formal dataset like Wikipedia is beneficial for the NER task. This is also verified the fact that using our pre-trained embeddings on VnCoreNLP helps significantly boost its performance. Table 3 also shows the F1 scores of W2V, W2V_C2V and BERT_Base embeddings which are worse than three selected embeddings’ (i.e., fastText, ELMO and MULTI). This might indicate that using word analogy to select embeddings for downstream NLP tasks is sensible.

5 Conclusions

We have presented a new toolkit, ETNLP, for evaluating, extracting, and visualizing multiple pre-trained embeddings. The toolkit was designed with three principles in mind: (1) easy to use, (2) better performance, and (3) be able to handle unknown vocabulary in real-world data (i.e., using C2V). The evaluation of the toolkit in Vietnamese NER task showed its effectiveness. In the future, we plan to support more embeddings in different languages, especially in low resource languages. We will also apply the toolkit to other downstream NLP tasks, such as part-of-speech (POS) tagging  Nguyen et al. (2017).

Tài li.u