Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF

by   H. J. Meijer, et al.

Over the last few years, neural network derived word embeddings became popular in the natural language processing literature. Studies conducted have mostly focused on the quality and application of word embeddings trained on public available corpuses such as Wikipedia or other news and social media sources. However, these studies are limited to generic text and thus lack technical and scientific nuances such as domain specific vocabulary, abbreviations, or scientific formulas which are commonly used in academic context. This research focuses on the performance of word embeddings applied to a large scale academic corpus. More specifically, we compare quality and efficiency of trained word embeddings to TFIDF representations in modeling content of scientific articles. We use a word2vec skip-gram model trained on titles and abstracts of about 70 million scientific articles. Furthermore, we have developed a benchmark to evaluate content models in a scientific context. The benchmark is based on a categorization task that matches articles to journals for about 1.3 million articles published in 2017. Our results show that content models based on word embeddings are better for titles (short text) while TFIDF works better for abstracts (longer text). However, the slight improvement of TFIDF for larger text comes at the expense of 3.7 times more memory requirement as well as up to 184 times higher computation times which may make it inefficient for online applications. In addition, we have created a 2-dimensional visualization of the journals modeled via embeddings to qualitatively inspect embedding model. This graph shows useful insights and can be used to find competitive journals or gaps to propose new journals.



There are no comments yet.


page 6


Evaluation Of Word Embeddings From Large-Scale French Web Content

Distributed word representations are popularly used in many tasks in nat...

Word Embeddings for the Armenian Language: Intrinsic and Extrinsic Evaluation

In this work, we intrinsically and extrinsically evaluate and compare ex...

Beyond Word Embeddings: Learning Entity and Concept Representations from Large Scale Knowledge Bases

Text representation using neural word embeddings has proven efficacy in ...

Characterizing Diseases from Unstructured Text: A Vocabulary Driven Word2vec Approach

Traditional disease surveillance can be augmented with a wide variety of...

Word Embeddings for the Construction Domain

We introduce word vectors for the construction domain. Our vectors were ...

Political Depolarization of News Articles Using Attribute-aware Word Embeddings

Political polarization in the US is on the rise. This polarization negat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural network derived word embeddings are dense numerical representations of words that are able to capture semantic and syntactic information[20]. Word embedding models are calculated by capturing word relatedness[10]

in a corpus as derived from contextual co-occurrences. They have proven to be a powerful tool and have attracted the attention of many researchers over the last few years. The usage of word embeddings has improved various natural language processing (NLP) areas including named entity recognition

[3], part-of-speech tagging[25], and semantic role labelling[8, 16]. Word embeddings have also given promising results on machine translation[7], search[6] and recommendation[21, 22].
Similarly, there are many potential applications of the embeddings in the academic domain such as improving search engines, enhancing NLP tasks for academic texts, or journal recommendations for manuscripts. Published studies have mostly focused on generic text like Wikipedia[14, 23], or informal text like reviews[4, 12] and tweets[18, 33]. We aim to validate word embedding models for academic texts containing technical, scientific or domain specific nuances such as exact definitions, abbreviations, or chemical/mathematical formulas. We will evaluate the embeddings by matching articles to their journals. To quantify the match, we use the ranks derived by sorting similarity of embeddings between each article and all journals. Furthermore, we plot the journal embeddings as a 2-dimensional representation of journal relatedness. Our 2-dimensional plot of embeddings visualizes relatedness in a scatter plot[2, 9].

2 Data and environment

In this study, we compare content models based on TFIDF, embeddings, and various combinations of both. This section describes the training environment and parameters as well as other model specifications to create our content models.

2.1 Dataset:

Previous studies have highlighted the benefits of learning embeddings in a similar context as they are later used in[11, 30]. Thus, we trained our models on title and abstracts of approximately 70 million scientific articles from 30 thousand distinct sources such as journals and conferences. All articles are derived from Scopus abstract and citation database [27]. After tokenizing, removal of stopwords and stemming the dataset contains a total of ca. 5.6 billion tokens (ca. 0.64 million unique tokens). The word occurrences in this training set follow a Pareto-like distribution as described by Wiegand et al[32]. This distribution indicates that our original data has similar properties as standard English texts.

2.2 Tfidf

We used 3 TFIDF alternatives all created by the TFIDF and the hasher from the pySpark mllib[29]. We controlled TFIDF alternatives in two ways, (a) adjusting vocabulary size and (b) adjusting the number of hash buckets. We label the TFIDF alternatives as follows:“vocabulary-size / number-of-hash-buckets”. Thus, we label the TFIDF configuration that has a vocabulary size of 10,000 and 10,000 hash buckets as TFIDF 10K/10K. To select the TFIDF sets, we measured memory footprint of multiple TFIDF configurations vs our accuracy metric (see section 3 for detailed definition). As seen in Table 1, the performance on both title and abstract stagnates; the same is true for the memory usage. Given these results, we selected the 10K/10K, 10K/5K and 5K/5K configurations for our research as reasonable compromises between memory footprint and accuracy. We also do not expect significantly better performance for higher vocabulary sizes such as 20K.

Memory Median Rank
Vocabulary Size and Number of Hash Buckets (GB) Title Abstract
1k (1k/1k) 5.13 183 44
4k (4k/4k) 9.29 59 20
7k (7k/7k) 10.85 42 16
10k (10k/10k) 11.61 35 14
Table 1: TFIDF accuracy and memory usage vs variable hash-bucket and vocabulary size

2.3 Embeddings

Our word embeddings are obtained using a spark implementation[29] of the word2vec skip-gram model with hierarchical softmax as introduced by Mikolov et al[19]. In this shallow neural network architecture, the word representations are the weights learned during a simple prediction task. To be precise, given a word the training objective is to maximize the mean log-likelihood of its context. We have optimized model parameters by means of a word similarity task using external evaluation sets[5, 1, 15] and consequently used the best performing model (see 4.1) as reference embedding model in this entire article (referred to as embedding). Additionally, we created 4 variants of TFIDF combined with embedding. All embedding models are listed below:

- embedding: unweighted mean embedding of all tokens
- TFIDF_embedding: tfidf-weighted mean embedding of all tokens
- 10K_embedding: tfidf-weighted mean of the top 10,000 most occurring tokens
- 5K_embedding: tfidf-weighted mean embedding of the top 5,000 most occurring tokens
-1K_6K_embedding: tfidf-weighted mean embedding of the top 6,000 most occurring tokens excluding the 1,000 most occurring tokens.

3 Methodology

To measure the quality of embeddings, we calculate a ranking between each article and its corresponding journal. This ranking, calculated by comparing embedding of the article with the embedding of all journals, will resemble the performance of embeddings in a categorization task. Articles in 2017 are split into 80%-20% training and test sets. Within training set, we average embeddings per journal and define it as the embedding per journal. This study is limited to journals with at least 150 publications in 2017 and those who had papers in both test and training sets (roughly 3700 journals and 1.3 millions articles). We calculate the similarity of embeddings between each article in the test set and all journals in the training set. We order similarity scores such that rank one corresponds to the journal with the most similar embedding. We record the rank of the source journal of each article for evaluations. We do this for both title and abstract separately. We calculate the performance per set, therefore we combine the ranking results of all articles for a set into one score. We use the median and average for that: the average rank takes the total average of all ranks, while the median is the point at which 50% of all the ranks are higher and 50% of all the ranks are lower. We keep track of the following results when ranking: source journal rank, score as well as name of the best matching journal for both abstract and title. We furthermore monitor the memory usage and computation time. To plot the journal embeddings, we use PCA (Principal Component Analysis)-based tSNE. tSNE (t-Stochastic Neighbor Embedding) is a vector reduction method introduced by Maaten et al


4 Results

In this section, the results of our research are presented; the detailed discussion on the meaning and implications of these results are presented in section 5, Discussion.

4.1 Model Optimization

During optimization, we tested the effect of several learning parameters on training time and quality using three benchmark sets for evaluating word relatedness, i.e. the WordSimilarity-353 [5], MEN [1] and the Rare Word [16] test collection that contain generic and rare English word pairs along with human-assigned similarity judgments. Only few parameters, i.e. number of iterations, vector size and the minimum word count for a token to be included in the vocabulary had significant effect on the quality. The learning rate was 0.025 and the minimum word count was 25. Our scores were close to external benchmarks from above studies. We manually investigated the differences and they were mostly due to word choice differences between academic context vs non-academic. Indeed, the biggest difference was between television and show pairs (because in academic context show would rarely relate to television). Table 2.3 contains the average scores and training times when tuning the parameters while fixing the remaining one. Our final and reference model is based on 300-dimensional vectors, a context window of 5 tokens, 1 iteration and 160 partitions.

value average score training time
vector size 100 0.447 3.2 h
200 0.46 -a
300 0.51 -a
no. of iteration 1 0.446 2.94 h
3 0.457 4.48 h
6 0.46 7.1 h
min. word count 15 0.467 -b
25 0.473 -b
50 0.447 -b
  • ran in different cluster due to memory issues.

  • not significant.

Table 2: average accuracy scores and computation times during training.

4.2 Ranking

Figures 2 and 2

show the result of the categorization task via ranking measures for titles and abstracts. The rank indicates the position of the correct journal in the sorted list of all journals. These graphs show both the average and the median ranks, based on the cosine-similarity between the article and journal embeddings. These embedding vectors, whether calculated by word2vec, TFIDF or their combinations can be considered as the feature vectors used elsewhere for machine-learning tasks.

Figure 1: Median and average title rankings
Figure 2: Median and average abstract rankings
Figure 1: Median and average title rankings
Figure 3: Rank distribution for the title and the abstract: Y-axis shows the fraction of articles.

4.3 Rank Distribution

Figure 3 shows the distributions of the ranks for our default embeddings, TFIDF weighted embedding and the 10K/10K TFIDF. The figure plots the cumulative percentage of articles as a function of rank. The plot gives a detailed view of the ranks presented in Figures 2 and 2.

4.4 Memory Usage and Computation Time

Table 3 shows the total memory usage of each test set in gigabytes. Moreover, it provides the absolute hit percentage of the title and the abstracts, i.e. the percentage of articles that have their source journal as the first result in the ranking. The table furthermore lists the median rank and the median abstract rank, as visualized in Figures 2 and 2. Thus, this table gives an overview of the memory usage of the sets, combined with their accuracy on the ranking task.
We furthermore investigated compute efficiency for different content models. To simulate what can happen during an online application, we selected 1000 random articles and then measured time needed for dot products between all pairs. Time recorded excluded input/output time and all calculations started from cached data with optimized numpy matrix/vector calculations. Table 4 shows computation time in seconds as well as ratios. Generally dot products can be faster for dense vectors as opposed to sparse vectors. Generally TFIDF vectors are stored as sparse vectors while embeddings are dense vectors. Hence, we also created a dense vector version of TFIDF sets to isolate the effect of sparse vs dense representation.

Memory Absolute Hit Median Rank
(GB) Title Abstract Title Abstract
tfidf 5k 5K 5.42% 10.18% 50 27
tfidf 5K 10K 6.49% 11.08% 38 15
tfidf 10K 10K 11.61 6.79% 11.32% 35 14
embedding 7.92% 9.24% 27 23
5k embedding 6.34% 8.36% 42 27
10k embedding 7.03% 8.76% 34 25
tfidf embedding 3.13 7.89% 9.33% 27 22
1k 6k embedding 3.06 5.16% 7.86% 64 31
Table 3: memory usage and performance for various content models
Seconds Ratio vs Embedding
Title Abstract Title Abstract
TFIDF (sparse vector) 154.95 169.89 231.25 184.36
TFIDF (dense vector) 35.67 35.18 53.23 39.59
Embedding 0.67 0.89 1 1
Table 4: comparing computation time between embeddings and TFIDF models

4.5 Journal Plot

Figure 4 shows the 2-dimensional visualization of the (default) journal embeddings based on the abstracts. This plot is color coded to visualize publishers. Some journal names have been added to the plot to indicate research areas.

Figure 4: Journal plot of abstract embeddings after tSNE transformation. Red, green, blue, and gray represent respectively Wiley, Elsevier, Springer-Nature, and other/unknown publishers.

5 Discussion

5.1 Results Analysis

5.1.1 Highest Accuracy

The data, as presented in Figures 2 and 2 shows that the 10k/10k set performs better than all other TFIDF sets, although the difference with the 5k/10k is low (a median rank difference of 1 on the abstracts and 3 on the titles). For the embeddings, the TFIDF weighted embedding outperforms other embedding models by a narrow margin: 1 median rank higher on the abstracts, and equal on the titles.

5.1.2 Dataset and Model Optimization

The determinants for choosing the final model parameters were constrained by their computational costs. Hence, even if increasing the number of iterations could have led to better performing word embeddings we chose 1 training iteration due to the increased training time. Similarly, we decided to stem tokens prior to training in order to decrease the vocabulary size. This might have led to a loss of syntactical information or caused ambiguous tokens.

5.1.3 Tfidf

The TFIDF feature vectors outperform the embeddings on abstracts, while the embeddings outperform the TFIDF on titles. The main difference between abstract and title is the number of tokens. Hence, embeddings which enhance tokens by their semantic meaning outperform TFIDF on the title. On the other hand, the TFIDF model outperforms on the abstract likely due to additional specification by additional tokens. In other words, longer text provides a better context and hence requires a less accurate semantic model for individual tokens.
Furthermore, none of our various vocabulary size cut-offs improved TFIDF ranks and indeed increasing the vocabulary size monotonically increased the performance of the TFIDF. In other words, we could not find a cut-off strategy to reduce noise and enhance TFIDF results. Although, it could still be possible that at even higher vocabulary sizes the cut-off would result in a sharper signal. However, since we noticed performance stagnation we did not investigate larger vocabulary sizes beyond 10k (presented in Table 1).

5.1.4 Combination of TFIDF and embeddings

The limited TFIDF embeddings all fall short of full TFIDF embedding. We did not find a vocabulary size cut-off strategy to increase accuracy by reducing noise from rare or highly frequent words or their combinations. In other words, it is best not to miss any word. This is in line with what we found with the TFIDF results: larger vocabulary sizes enhances models.
Rank distribution; Although the limited TFIDF embeddings underperform, we found that their rank distribution is different from the other embeddings. The rank distribution of the limited TFIDF embeddings shows the following pattern: a high/average performance on the top-rankings, a below average performance on the middle rankings and an increased ratio of articles with worsened higher ranks. The rank distribution seems to indicate that the cut-offs marginalize ranks. The cut-off moved the “middle-ranked” articles to either the higher end or the lower end with a net effect to deteriorate median ranks. However, the articles that matched with limited TFIDF embedding had higher accuracy scores. The reduction in vocabulary size did not reduce the storage size for the embeddings, except for the 1K-6K case. This indicated that only the 1K-6K cuts removed all tokens from some abstract and titles resulting in null records and hence lower memory.

5.1.5 TFIDF & embeddings

Our hypothesis on the difference between the TFIDF and the standard embedding is as follows:
The embeddings seem to outperform the TFIDF feature vectors in situations where there is little information available (titles). This indicates that the embeddings store some word meaning that enables them to perform relatively well on the titles. The abstracts, on the other hand, contain much more information. Our data seems to indicate that the amount of information available in the abstracts enables the TFIDF to cope with the lack of an explicit semantic model. If this is the case, we could expect that there would be little performance increase on the title when we compare the embeddings to the weighted TFIDF embeddings, because the TFIDF lacks the information to perform well. This can be seen in our data, only the average rank increased by 3, indicating that there is a difference between the two embeddings, but not a significant one. We would also expect on the abstract an increase in performance since the TFIDF has more information in this context. We would expect that the weighting applied by the TFIDF model, an importance weighting, will improves the performance of the embedding. Our data shows a minor improvement in performance: 1 median rank and 10 average ranks. While these improvements cannot be seen as significant, our data at least indicates that weighting the embeddings with TFIDF values has a positive effect on the embeddings.

5.1.6 Memory usage & Calculation time

TFIDF outperforms the embeddings on the abstracts, but requires more memory. Embedding uses 3.13 GB RAM, while the top performing TFIDF, 10K/10K, uses 11.61 GB (3.7 times more RAM footprint). This indicates that the embeddings are able to store the relatedness information more densely than the TFIDF. The embeddings furthermore need less calculation time for online calculations as shown in Table 4. In average, embeddings are 200 times faster than sparse TFIDF. When the vectors are transformed to dense vectors this is reduced to 46 times. The difference between the sparse and dense vectors is due to dense vectors being processed more efficiently by low level vector operations. The difference between the embedding and TFIDF dense vectors is mainly due to the vector size. Embeddings use a 300 dimensional vector, while TFIDF uses a 10000 dimensional vector. Hence a time ratio of 33 is just normal and indeed close to measured values of 40 and 53 in Table  4. Note that even though dense representation is roughly 4-5 times faster, it requires 33 times more RAM which can be prohibitive.

5.2 Improvements

This research demonstrates that even though the embeddings can capture and preserve relatedness, TFIDF is able to outperform the embeddings on the abstracts. We used basic word2vec but earlier research already shows additional improvement potential for word2vec. Dai et al[2] showed that using paragraph vectors improves the accuracy of word embeddings with 4.4% on triplet creation with the Wikipedia corpus and a 3.9% improvement on the same task based on the arXiv articles. Furthermore, Le et al[13] show that the usage of paragraph vectors decrease the error rate (positive/negative) with 7.7% compared to averaging the word embeddings on categorizing text as either positive or negative. While the improvement looks promising, we have to keep in mind that our task differs from earlier research. We do not categorize on two categories but about 3700 journals. Since the classification task is fundamentally the same, still we would expect an improvement by using paragraph vectors. However, the larger scale here complicates the task due to the “grey areas” between categories. These are the areas in which the classification algorithm is “in doubt” and could reasonably assign the article to both journals. There are many similar journals and hence we cannot expect a rank 1 for most of articles. Indeed our classes here are not exactly mutually exclusive. Indeed in general, the number of these grey areas increase with increased target class size.
Pennington et al[24] showed that the GloVe model outperforms the continuous-bag-of-words (CBOW) model, which is used in this research, on a word analogy task. Wang et al[31] introduced the linked document embedding method (LDE) method, which makes use of additional information about a document, such as citations. Their research specifically focused on categorizing documents, showed a 5.89% increase of the micro-F1 score on LDE compared to CBOW, and a 9.11% increase of the macro-F1 score. We would expect that applying this technique to our dataset would improve our scores, given earlier results on comparable tasks. Although our results seem to indicate that the embeddings work for academic texts, Schnabel et al[26] found that the quality of the embeddings are depended on the validation task. Therefore, conservatively we can only state that our research shows that embeddings work on academic texts for journal classifications.
Despite immense existing researches, we have not been able to find published results which are directly comparable to ours. This is due to our large target class size (3700 journals) that requires a ranking measure. Earlier research limited themselves to small number of groups such as binary classes or 3 classes [28]. We have opted median rank as our key measure, but like existing research we have also reported absolute hit [31]. Our conclusions, were indifferent to exact metric used (median vs average rank vs absolute hit).

6 Conclusion

Our research, based on academic corpus, indicates that embeddings provide a better content model for shorter text such as title and fall short of TFIDF for larger texts such as abstracts. The higher accuracy of TFIDF may not be worth it, as it requires 3.7 more RAM and is roughly 200 times slower for online applications. The performance of the embeddings have been improved by weighing them with the TFIDF values on the word level, although this improvement cannot be seen as significant on our dataset. The visualization of the journal embedding shows that similar journals are grouped together, indicating a preservation of relatedness between the journal embeddings.

7 Future work

7.0.1 Intelligent cutting

A better way of cutting could improve the quality of the embeddings. This improvement might be achieved by cutting the center of the vector space out before normalization. All words which are generic are in the center of the spectrum, removing these words prevents the larger texts to be pulled towards the middle of the vector space, where they lose the parts of their meaning which set them apart from the other texts. We expect that this way of cutting, instead of word-occurrence cutting, can enhance embeddings especially for longer texts.

7.0.2 TFIDFs performance point

In our research, TFIDF performed better on the abstracts than on the titles, which we think is caused by the difference in text size. Consequently, there could be a critical length of text where the best performing model switches from embedding to TFIDF. If this length is found, one could skip the TFIDF calculations in certain situations, and skip the embedding training in other scenario’s, reducing costs.

7.0.3 Reversed word pairs

At this point, there are no domain-specific word pair sets available. However, as we demonstrated, we can still test the quality of word embeddings. Once one has established that the word vectors are of high quality, could one create word pairs from these embeddings? If this is the case we could create word pair sets using the embeddings and then reverse engineer domain specific word pairs for future use.

8 Acknowledgement

We would like to thank Bob JA Schijvenaars for his support, advice and comments during this project.


  • [1] E. Bruni, N. Tran, and M. Baroni (2014) Multimodal distributional semantics.

    Journal of Artificial Intelligence Research

    49, pp. 1–47.
    Cited by: §2.3, §4.1.
  • [2] A. M. Dai, C. Olah, and Q. V. Le (2015) Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998. Cited by: §1, §5.2.
  • [3] H. Do, K. Than, and P. Larmande (2018) Evaluating named-entity recognition approaches in plant molecular biology. bioRxiv, pp. 360966. Cited by: §1.
  • [4] C. dos Santos and M. Gatti (2014)

    Deep convolutional neural networks for sentiment analysis of short texts

    In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 69–78. Cited by: §1.
  • [5] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin (2001) Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20, pp. 116–131. Cited by: §2.3, §4.1.
  • [6] D. Ganguly, D. Roy, M. Mitra, and G. J. Jones (2015) Word embedding based generalized language model for information retrieval. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 795–798. Cited by: §1.
  • [7] S. Gouws, Y. Bengio, and G. Corrado (2015)

    Bilbowa: fast bilingual distributed representations without word alignments

    In International Conference on Machine Learning, pp. 748–756. Cited by: §1.
  • [8] L. He, K. Lee, O. Levy, and L. Zettlemoyer (2018) Jointly predicting predicates and arguments in neural semantic role labeling. arXiv preprint arXiv:1805.04787. Cited by: §1.
  • [9] G. E. Hinton and S. T. Roweis (2003) Stochastic neighbor embedding. In Advances in neural information processing systems, pp. 857–864. Cited by: §1.
  • [10] Y. Hou (2018) Enhanced word representations for bridging anaphora resolution. arXiv preprint arXiv:1803.04790. Cited by: §1.
  • [11] S. Lai, K. Liu, S. He, and J. Zhao (2016) How to generate a good word embedding. IEEE Intelligent Systems 31 (6), pp. 5–14. Cited by: §2.1.
  • [12] P. Lauren, G. Qu, J. Yang, P. Watta, G. Huang, and A. Lendasse (2018) Generating word embeddings from an extreme learning machine for sentiment analysis and sequence labeling tasks. Cognitive Computation, pp. 1–14. Cited by: §1.
  • [13] Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In International Conference on Machine Learning, pp. 1188–1196. Cited by: §5.2.
  • [14] O. Levy and Y. Goldberg (2014) Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185. Cited by: §1.
  • [15] T. Luong, R. Socher, and C. D. Manning (2013) Better word representations with recursive neural networks for morphology. In CoNLL, Cited by: §2.3.
  • [16] T. Luong, R. Socher, and C. Manning (2013) Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113. Cited by: §1, §4.1.
  • [17] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §3.
  • [18] A. J. Masino, D. Forsyth, and A. G. Fiks (2018) Detecting adverse drug reactions on twitter with convolutional neural networks and word embedding features. Journal of Healthcare Informatics Research 2 (1-2), pp. 25–43. Cited by: §1.
  • [19] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    arXiv preprint arXiv:1301.3781. Cited by: §2.3.
  • [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
  • [21] C. Musto, G. Semeraro, M. de Gemmis, and P. Lops (2016) Learning word embeddings from wikipedia for content-based recommender systems. In European Conference on Information Retrieval, pp. 729–734. Cited by: §1.
  • [22] M. G. Ozsoy (2016) From word embeddings to item recommendation. arXiv preprint arXiv:1601.01356. Cited by: §1.
  • [23] D. Park, S. Kim, J. Lee, J. Choo, N. Diakopoulos, and N. Elmqvist (2018)

    ConceptVector: text visual analytics via interactive lexicon building using word embedding

    IEEE Transactions on Visualization & Computer Graphics (1), pp. 361–370. Cited by: §1.
  • [24] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §5.2.
  • [25] C. D. Santos and B. Zadrozny (2014) Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1818–1826. Cited by: §1.
  • [26] T. Schnabel, I. Labutov, D. Mimno, and T. Joachims (2015) Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 298–307. Cited by: §5.2.
  • [27] Scopus search home page,. External Links: Link Cited by: §2.1.
  • [28] D. Shen, G. Wang, W. Wang, M. R. Min, Q. Su, Y. Zhang, C. Li, R. Henao, and L. Carin (2018) Baseline needs more love: on simple word-embedding-based models and associated pooling mechanisms. arXiv preprint arXiv:1805.09843. Cited by: §5.2.
  • [29] Spark home page,. External Links: Link Cited by: §2.2, §2.3.
  • [30] J. Truong (2017) An evaluation of the word mover’s distance and the centroid method in the problem of document clustering. Cited by: §2.1.
  • [31] S. Wang, J. Tang, C. Aggarwal, and H. Liu (2016) Linked document embedding for classification. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 115–124. Cited by: §5.2.
  • [32] M. Wiegand, S. Nadarajah, and Y. Si (2018) Word frequencies: a comparison of pareto type distributions. Physics Letters A. Cited by: §2.1.
  • [33] X. Yang, C. Macdonald, and I. Ounis (2018) Using word embeddings in twitter election classification. Information Retrieval Journal 21 (2-3), pp. 183–207. Cited by: §1.