During the last few years, neural networks have significantly affected many Natural Language Processing (NLP) tasks. One of those tasks is representation learning for words, also known as word embeddings, which is a very useful unsupervised learning technique. Word embeddings can be beneficial for most NLP applications by increasing the overall system accuracy and capturing different aspects of similarity among words. The main idea behind word embeddings is the distributional hypothesis, which states that the meaning of a word can be captured by the context in which it appears.
Even we live in the NLP era passing from static word representations to dynamic (contextualized) word representations, there are still applications where static word embeddings (word2vec, fasttext, glove) are used, like in various RNNs/CNNs models. It is also known that in various NLP tasks best results are achieved when using a concatenation of contex-aware word embeddings with static word embeddings [peters-etal-2018-deep, akbik2018contextual].
Many different approaches have been proposed for producing static word vectors. Two of the most popular approaches are the Skip-gram and the Continuous Bag-of-Words (CBOW) architectures, as implemented in word2vec[Mikolov2013a] and fastText [Bojanowski2017]. In the skip-gram model, nearby words are predicted given a source word, while in the CBOW model, the source word is predicted according to its context. Even though these two methods produce high-quality word representations, each method achieves the highest possible accuracy in distinct categories of the word analogy questions. More precisely, the skip-gram method achieves high accuracy in semantic categories, while the CBOW method performs best in syntactic tasks.
Our newly proposed architecture is trying to combine the benefits from both Skip-gram and CBOW approaches. The Continuous Bag-of-Skip-grams (CBOS) model achieves competitively high accuracy in semantic and syntactic categories compared to the aforementioned models. These results lead to an overall increased accuracy of the word embeddings. In addition, the CBOS architecture does not increase the computational cost significantly due to its efficient implementation. Thus CBOS can be trained on vast amounts of text corpora within a reasonable time.
The rest of the paper is organized as follows: Firstly, section 2 is a brief description of the data and tools that were used for the training of our model. In section 3, our proposed CBOS model is explained along with its differences to other popular models. Section 4 presents the evaluation methods used for comparing models in the experimental setup and section 5 shows the results of the different experiments. Finally, in section 6 we provide conclusions based on the results of the experiments.
2 Data Sources and Tools
In this section, we describe the datasets that were used for this research, along with their sources. Furthermore, we present the tools and libraries that were used for the development of word embeddings models.
2.1 Wikipedia Corpus
Wikipedia is the largest, in more than 200 distinct languages, free online encyclopedia. The corresponding text is considered of high quality since the articles are edited, rendering Wikipedia an excellent tool for natural language processing in most languages. It was used in various functions such as information extraction [Wu2010] or word sense disambiguation [Mihalcea2007].
In this work, we used the first 109 bytes of the English Wikipedia dump on March 3, 2006 provided by Matt Mahoney111http://mattmahoney.net/dc/textdata.html. The data is UTF-8 encoded XML consisting primarily of English text. The English Wikipedia corpus contains 243K article titles. The primary preprocessing step was to extract the text content from the XML dumps. For this purpose, the script wikifil.pl was used as published by Matt Mahoney. The final preprocessed file consists from 680MB of text data and 124M words.
In addition, the Greek Wikipedia dump from December 2018 was used for training. A few basic preprocessing steps were implemented. These steps included lowering cases of all words and removing punctuation. The finalized text file used for training contains 800MB of text data and 68M words.
2.2 Greek Web Content Corpus
have collected and crawled the most extensive Greek corpus available from about 20M URLs with Greek language content. Firstly, the Greek corpus was extracted in Web Archive (WARC) format and, after that, several pre-processing and extraction steps were applied. This process has produced a single uncompressed text which was handed and used by this work. Greek language n-grams were also offered. Some details for the Greek corpus are listed below:
Raw crawled text size: 10TB
Text after pre-processing size: 50GB
Unique sentences: 120M
2.3 FastText Library
FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It is written in C++, supports multiprocessing during training, and allows the user to train supervised and unsupervised representations of words and sentences. FastText supports training Continuous Bag-of-Words or Skip-gram models using negative sampling, softmax or hierarchical softmax loss functions. Furthermore, fastText offers a variety of tuning parameters, so the user can discover the most efficient combination of parameters that suits his / her needs.
Our contribution to fastText Library is the CBOS method that can be used for training. The source code is publicly available222https://github.com/mikeliou/greek_word_embeddings.
3 Proposed Model
3.1 Continuous Bag-of-Skip-grams
The new model, Continuous Bag-of-Skip-grams (CBOS), proposed by this work, is a combination of CBOW and Skip-gram models and was named respectively. The main idea behind CBOS is that, given a word w and a context window c, the training should capitalize on both training techniques in order to combine their benefits. Then we consider two training phases:
using w for predicting every word in c and
using every word in c for predicting a single word in the context window.
More specifically, training is implemented as follows:
A phase where w is trained by predicting every word in the context window c (Skip-gram)
A phase where a bag-of-words is created from all words in the context window c, except a word p which is used for predicting and word w which was used for training in the previous phase (CBOW).
It is essential to note that in the second phase, the word p is selected randomly between the words included in the window. Furthermore, the CBOS method includes every feature and tuning parameter proposed by [Mikolov2013a, Mikolov2013, Bojanowski2017] as implemented in fastText Library (e.g. subword information, negative sampling).
For example, consider the sentence I am reading a paper about word embeddings with a window of 2 words before and after the current word. The current word for the first phase of training is paper and the randomly selected word for the second phase prediction is about. In the first phase, paper will make four predictions, one for each word in the context window (reading, a, about, word). In the next phase, every word vector, except the one selected randomly (about) and the one used for training in the previous phase (paper), will be summed in a unique vector and will predict the word about. This example is illustrated in Figure 1.
This simple step added to the training of each word seems vital for the improvement of the quality of word embeddings. The additional complexity by this step does not change the complexity class of the algorithm as it appears below in the execution times of the different models.
4.1 Word Analogy
Word embeddings are very helpful for a broad range of NLP prediction tasks. An easy way of evaluation of word embeddings is to use the vectors produced to predict syntactic and semantic connections like king is to queen as father is to ?.
The first research about word analogy was submitted in [Mikolov2013] to show the chance for capturing connections between words as the offset of their vectors. Solving analogies became one of the most common references for word embeddings, assuming that the performance of the embedding is reflective of linear relationships between word pairs (e.g., king:man :: woman:queen). The evaluation of word analogy is based upon the idea that human prediction of arithmetic operations in a word vector space would be possible: given the three words, a, b and c, the task is the identification of the word d, so that the relationship c:d is the same as the relationship a:b [Pereira2016, Turian2010]. For instance if a=Paris, b=France, c=Moscow, the target word would be Russia as the relationship a:b is capital:country, so one must figure out which country has Moscow as the capital city. This evaluation method is called analogical reasoning and presented by [Mikolov2013]. The evaluation dataset published by Mikolov and colleagues was used for the evaluation of the English word embeddings in this work.
An evaluation framework for the Greek word embeddings has recently been introduced by [Outsios2019]. This evaluation framework focuses on intrinsic evaluation which evaluates the trained word embeddings using semantic and syntactic analogies and especially in word similarity and word analogy. In this work, for the evaluation of Greek word embeddings, we use the word analogy dataset.
5 Experimental Results
5.1 Alternative CBOS implementations
Before we conclude to the CBOS model proposed earlier, we implemented different versions of CBOS in order to achieve the highest accuracy to the Greek word analogy task. The different versions of CBOS are described below:
Next-word incremental CBOS: After the first phase of predictions, the bag-of-words is formed incrementally starting from the first word at the left. After the addition of each word to the bag-of-words, a prediction is made on the next word.
Central-word incremental CBOS: The same process with the previous method is followed but, instead of predicting the next word, the prediction is made on the central word of the window.
Non-random CBOS: This implementation follows the same steps of CBOS except for the randomly chosen word in the second phase. The chosen word for prediction is the central word.
Variable context window CBOS: In the second phase of CBOS, the context window is changed to a random number between 1 and 5. Thus, the bag-of-words could contain different words used for the second phase of training.
Non-repeated words CBOS: This method does not add any word to the bag-of-words that is already contained.
For the comparison presented in Table 1, the Greek Wikipedia dataset and the default parameters were used for training. For the evaluation, the closest vector is evaluated and the out-of-vocabulary (OOV) words are excluded.
|Variable context window||48.49||47.49||47.92|
5.2 English Wikipedia Corpus
For the first evaluation, the English Wikipedia dataset was used, which is consisted of 680MB of text data and 124M words. The three models were trained using the default parameters provided by the fastText library and were evaluated using the word analogy task for English language [Mikolov2013a]. Only the closest vector (top-1) is considered for a successful prediction. The out-of-vocabulary (OOV) words are excluded. Results are presented in Table 2.
The CBOW model outperforms the other two in the syntactic category and execution time, and the Skip-gram leads the semantic category. Even though the CBOS model does not achieve the highest accuracy in the semantic nor the syntactic category, it outperforms the other two models in the total score.
Since the CBOS model has a few more iterations on the training data than Skip-gram and CBOW model due to the second phase of training, we had some more experiments in order to have a fair comparison. In this round of experiments, we trained CBOW and Skip-gram models with the double epochs (10) than the CBOS model. The results are shown in Table3.
|CBOW (10 epochs)||40.21||71.45||50.47||20m 14.019s|
|Skip-gram (10 epochs)||47.26||63.80||52.69||29m 27.670s|
|CBOS (5 epochs)||42.94||68.08||51.20||21m 10.789s|
The CBOW model achieves the best accuracy in the syntactic category and with CBOS have the fastest training time. The Skip-gram model outperforms the other two in the semantic category and the total score but has the worse execution time.
5.3 Greek Wikipedia Corpus
The Greek Wikipedia dataset, which consists of 800MB of text data and 68M words (see paragraph 2.1), was used as the training corpus for the second set of evaluations. The three methods (CBOS, CBOW and Skipgram) were trained using the default tuning parameters suggested by the FastText framework. The evaluation was performed using the word analogy task for the Greek language [Outsios2019]. The one closest vector is used for evaluation. The OOV words are excluded. Results are presented in Table 4.
Concerning the Greek Wikipedia dataset, the CBOS approach achieves the highest accuracy in all categories, except execution time where the CBOW approach seems the fastest.
|CBOW (10 epochs)||32.71||45.16||39.81||16m 25.623s|
|Skip-gram (10 epochs)||58.73||41.73||49.03||22m 39.659s|
|CBOS (5 epochs)||52.72||48.23||50.16||16m 55.784s|
Furthermore, we trained the Skip-gram and CBOW models with the double epochs (10) compared to the CBOS model. The results in Table 5 show that the Skip-gram method trained with double epochs achieves the highest accuracy in the semantic category, but the CBOS method outperforms the other two methods in the syntactic category and total accuracy. The CBOW method leads the execution time by a slight difference compared to CBOS.
5.4 Greek Web Content Corpus
The next round of evaluations used the Greek Web Content dataset, which contains 50GB of raw text and 3B tokens, for the training of the three models. Every model was trained using the same parameters as the previous two evaluations. The word analogy task for the Greek language was used for the evaluation of the closest vector, and the OOV words were not evaluated.
In this round of evaluations shown in Table 6, the CBOW model is the fastest, and the Skip-gram model leads the semantic category. The CBOS model, though, achieves the highest accuracy in the syntactic category and in the total accuracy.
For one more time, we evaluated the three different models by doubling the epochs (10) of Skip-gram and CBOW models.
|CBOW (10 epochs)||21.01||55.26||43.16||791m 46.704s|
|Skip-gram (10 epochs)||44.35||51.07||48.69||1395m 20.622s|
|CBOS (5 epochs)||41.16||62.39||54.89||810m 43.196s|
The results in the Table 7 show that the CBOS method outperforms the other two models in the syntactic category and total accuracy even when they are trained with the double epochs. The Skip-gram method leads the semantic category, and the CBOW method has the minimum execution time.
This paper focused on producing high-quality word embeddings for the Greek language devising a new embedding method, the Continuous Bag-of-Skip-grams (CBOS). CBOS combines the benefits of the CBOW and Skip-gram approaches introduced in [Mikolov2013]. Because of its neat implementation, CBOS does not increase the computational cost of the training phase. In addition, we show that CBOS outperforms the CBOW and Skip-gram models when they are trained on the same data.
The future work of this research could include training of our newly proposed model with the Common Crawl dataset for the Greek language. The comparison of the new results along with the ones presented in this work should give us a complete overview. Moreover, the comparison with other state-of-the-art methods could be considered as an extension.
We would like to thank Ion Androutsopoulos for the useful discussions and suggestions.