1 Introduction & Related work
In this work, we present Greek word embeddings, trained on, to the best of our knowledge, the largest so far corpus available, collected/crawled from about 20M URLs with Greek language content. The vocabulary and word vectors are available on request. We developed a live web tool to enable users to interact with the Greek word embeddings. Some of the functions we provide is similarity score, most similar words as well as analogy operations. We also present a vector explorer, where we can project a sample of the word vectors.
Lately, pre-trained word vectors for 294 languages were introduced, trained on Wikipedia using FastText. These 300-dimensional vectors were obtained using the skip-gram model. Their Greek language variant, was trained on the Wikipedia corpus concerning only Greek documents.
Visualization tools for word embeddings are of great importance, since they contribute to the interpretation of their nature. Similarly to our tool, Tensorflow222https://projector.tensorflow.org/ offers an illustration of a sample of word embeddings after applying dimensionality reduction techniques. Last, the training process can be observed with WEVI, a word embedding visual inspector333https://ronxin.github.io/wevi/.
2 Crawling the Greek Web
For the process of crawling the Greek Web (which was funded by the Stavros Niarhhos foundation, see: https://www.snf.org, for the Greek National Library) we used the Heritrix444http://crawler.archive.org/ tool. Collecting the websites adheres to the international Web Archive (WARC) template. The WARC file form defines a method combining multiple media resources into one archive. Next, we present some statistics about the data we crawled:
Number of WARCs: 112K
Size of HTML (stored in WARC format): 10TB
Number of Greek domains: 350K
Number of URLs: 20M
Duration of crawling: 45 days
3 Pre-Processing & Text Extraction
Before training, we applied several pre-processing and extraction steps on the raw crawled text:
detect the encoding of each webpage, so that we are able to read it properly,
remove boilerplate code555https://en.wikipedia.org/wiki/Boilerplate_code,
remove all non-Greek characters,
track the line change character,
produce compressed text files per domain.
The third step is very important for the text quality, since we request a corpus that can be used later for developing linguistic resources (language model, embeddings etc.). Except their content, webpages consist of navigation elements, headers, footers, as well as commercial banners. This text is usually not associated with the webpage’s main content, and can lead in decreasing the integrity of the collection. In order to do that, we used libraries like BeautifulSoup, Justext, NTLK’s clean_html and Boilerpipe666https://boilerpipe-web.appspot.com/. The best results were obtained by Boilerpipe, which was the one we used in the end to remove useless text (boilerplate). We removed identical sentences (de-duplication) and produced the final corpus in text form, sized around 50GB. We obtained thus 3B tokens and a total number of 498M sentences, with 118M of them being unique.
De-duplication per domain resulted in reducing the size of raw corpus by 75%. With an additional processing of the final corpus, we create the Greek language n-grams: Unigrams:7M, Bigrams: 90M, Trigrams: 300M.
4 Training Greek Word Embeddings
For the process of learning the Greek word embeddings we utilized the FastText library, which takes under consideration the morphology of a word. Training on the raw uncompressed text of the Greek internet web, with size of 50GB, required 2 days in a 8-core Ubuntu system with 32GB of RAM.
Different Greek vector models were produced like: 1. native fasttext skipgram with the following parameters: -minCount 11 -loss ns -thread 8 -dim 300, 2. native fasttext cbow, 3. gensim777https://radimrehurek.com/gensim/ word2vec skipgram, 4. gensim fasttext skipgram, 5. gensim fasttext skipgram, no subword information. Methods 3 and 5 lead to the same result as they use the same technique. By evaluating their effectiveness in automatic spell correction along with similarity queries, method 1 yields the most reliable results. In the future, we plan to offer as well a set of handcrafted questions for evaluation purposes.
Next, we designed tools in order to visualize examples of Greek word vector relationships. The first demo offers linguistic functions which are enabled by the existence of word embeddings, like analogy, similarity score or most similar words. The second demo tool for exploring and querying the word vectors was based on the word2vec-explorer888https://github.com/dominiek/word2vec-explorer. In this tool, a user can navigate through a sample of the Greek word embeddings, visualize it via t-SNE
and apply k-means clustering. Furthermore, we offer comparing functions for combinations of words. For the frontend, we used libraries likeFlask, Jinja and Bootstrap.
6 Conclusion & Future Work
In this work, we present the efforts that resulted in Greek word embeddings and other Greek language resources trained on the largest corpus available for the Greek language. The resources (corpus, trained vectors, stopwords, vocabulary as well as unigrams, bigrams and trigrams) are available on request. We have also implemented a live web tool, where a user can explore word relationships in the Greek language. In addition, we provide a word embedding explorer, where one could visualize a sample of the Greek vectors with t-SNE.
-  Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin, “A neural probabilistic language model,” JMLR, 2003.
Ronan Collobert and Jason Weston,
“A unified architecture for natural language processing: Deep neural networks with multitask learning,”in ICML, 2008.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean,
“Distributed representations of words and phrases and their compositionality,”in NIPS, 2013, pp. 3111–3119.
“Convolutional neural networks for sentence classification,”in EMNLP, 2014.
-  Rie Johnson and Tong Zhang, “Supervised and semi-supervised text categorization using lstm for region embeddings,” in ICML, 2016.
-  Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, “Enriching word vectors with subword information,” TACL, 2017.
-  Laurens van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne,” JMLR, 2008.
-  Maximillian Nickel and Douwe Kiela, “Poincaré embeddings for learning hierarchical representations,” in NIPS. 2017.
-  Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer, “Deep contextualized word representations,” in NAACL, 2018.
-  Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger, “From word embeddings to document distances,” in ICML, 2015.