Detecting keywords represents a crucial task in several text intensive applications. News industry relies on keywords for organization, linking and summarization of articles according to the content and topics they cover. With the current trend of fast-paced type of writing and an ever-growing amount of generated news, it becomes an infeasible task for the journalists to manually extract keywords and the development of tools for automatic extraction has become essential for speeding up the media production.
Keyword extraction can be tackled in a supervised or an unsupervised way. The current supervised state-of-the-art approaches are based on transformer-based 
deep neural networks and employ large-scale language model pretraining. Despite being very successful in solving the task, they do require substantial amounts of labeled data which is expensive to obtain or non-existent for some low-resource languages and domains. To cope with this, researchers in most cases employ unsupervised keyword extraction in this low-resource scenarios. Unsupervised approaches require no prior training and can be applied to most languages, making them a perfect fit for domains and languages that have low to no amount of labeled data. On the other hand, they offer non-competitive performance when compared to supervised approaches, since they can not be adapted to the specific language, domain and keyword assignment regime through training.
In this work, we explore another option for keyword extraction in low-resource settings, which has not been extensively explored in the past, a zero-shot cross-lingual keyword detection. More specifically, we investigate how multilingual pretrained language models, which have been fine-tuned to detect keywords on a set of languages, perform, when applied to a new language not included in the train set, and compare these results to the results achieved by several state-of-the-art unsupervised keyword extractors. In addition, we also investigate whether in a setting, where training data is available, supervised monolingual models can benefit from additional data from another language. The main contributions are the following:
We conduct and extensive zero-shot cross-lingual study of keyword extraction on six languages, four of them less-resourced European languages, and demonstrate that a multilingual BERT model fine-tuned on the training data not matching target language, performs better than state-of-the-art unsupervised keyword extraction algorithms.
We evaluate the performance of supervised zero-shot cross-lingual models in comparison to the supervised monolingual models in order to better determine the decrease in the performance when no language specific data is available.
We investigate if the performance of monolingual models can be improved by including additional multilingual data and whether there is a trade-off between the amount of data available and the language specificity of this data.
We produce new supervised keyword extraction models for a new Slovenian dataset for keyword extraction, contributing to the development of new language resources for a less-resourced European language.
The rest of this paper is organized in the following way: Section 2 presents the related work in the field of keyword extraction, focusing also on the cross-lingual zero-shot learning. Section 3 describes the data used in our experiments and Section 4 explains our experimental settings. While Section 5 presents and discuses the results of our experiments, Section 6 concludes the paper and proposes further work on this topic.
2 Related Work
We can divide approaches for keyword extraction into supervised and unsupervised. As stated above, state-of-the-art supervised learning approaches have become very successful at tackling the keyword extraction task but are data-intensive and time consuming. Unsupervised keyword detectors can tackle these two problems and usually require a lot less computational resources and no training data, yet this comes at the cost of the reduced overall performance.
We can divide unsupervised approaches into four main categories, namely statistical, graph-based, embeddings-based, and
, leverage various text statistics to capture keywords, while Graph-based methods, such as TextRank, Single Rank, KeyCluster , and RaKUn  build graphs and rank words according to their keyword potential based on their position in the graph. Among the most recent statistical approaches is YAKE 
, which we also test in this study. It is based on features such as casing, position, frequency, relatedness to context and dispersion of a specific term, which are heuristically combined to assign a single score to every keyword. KPMiner is an older, simpler method that focuses on the frequency and the position of appearance of a potential keyphrase. In order to enrich the quality of the retrieved phrases, the model proposes several filtering steps, e.g. removing rare candidate phrases that do not appear at least n-times and that do not appear within some cutoff distance from the beginning of the document.
TextRank , which we evaluate in this study, is one of the first graph-based methods for keyword detection. It leverages Google’s PageRank algorithm to rank vertices in the lexical graph according to their importance inside a graph. Other method that employs PageRank is PositionRank . The so-called MultiPartiteRank algorithm  encodes the potential candidate keywords of a given document into a multipartite graph structure, which also considers topic information. In this graph two nodes, representing keyphrase candidates, are connected only if they belong to different topics and the edges are weighted according to the distance between the two candidates in the document. In order to rank the verticies, the method leverages PageRank, similarly to mihalcea2004textrank. One of the most recent graph-based keyword extractors is RaKUn . The main novelty in this algorithm is the expansion of the initial lexical graph with the introduction of meta-vertices, i.e., aggregates of existing vertices. It employs load centrality measure for ranking vertices and relies on several graph redundancy filters.
Embedding-based keyword extraction methods are less popular but are nevertheless recently gaining traction. The first methods of this type were proposed by wang2014corpus, who proposed Key2Vec 
, and bennani-smires-etal-2018-simple, who proposed EmbedRank. Both of these methods employ semantic information from distributed word and sentence representations. The most recent state-of-the-art method of this type is KeyBERT proposed by grootendorst2020keybert, which leverages pretrained BERT based embeddings for keyword extraction. In this approach, embedding representations of candidate keyphrases are ranked according to the cosine similarity to the embedding of the entire document.
Language model-based keyword methods, such as the ones proposed by tomokiyo2003language use language model derived statistics to extract keywords from text. These type of keyword extraction models are quite rare and are not included in our study.
One of the first supervised approaches to keyword extraction was KEA proposed by witten2005kea. It considers keyword identification as a classification task and employs Naive Bayes classifier to determine for each word or phrase in the text if it is a keyword or not. It uses only TF-IDF and the term’s position in the text as classification features. A more recent non-neural supervised approach employs a sequence labelling approach to keyword extraction and was proposed by gollapalli2017incorporating. The approach relies on Conditional Random Field (CRF) tagger. First neural sequence labeling approach was proposed by luan2017scientific, who proposed a neural network comprising of a bidirectional Long Short-Term Memory (BiLSTM) layer and a CRF tagging layer.
Keyword detection can also be considered as a sequence-to-sequence generation task. This idea was first proposed by meng2017deep, who employed a recurrent generative mode with an attention mechanism and a copying mechanism  based on positional information for keyword prediction. What distinguishes this model from others is that besides being able to detect keywords in the input text sequence, it can also potentially find keywords that do not appear in the text.
The most recent approaches tackle keyword detection with transformer architectures  and formulate keyword extraction task as a sequence labelling task. In the study by sahrawat2019keyphrase, contextual embeddings generated using BERT , RoBERTa  and GPT-2  were fed into a bidirectional Long short-term memory network (BiLSTM) with an optional Conditional random fields layer (BiLSTM-CRF). They conclude that contextual embeddings generated by transformer architectures outperform static. Another study employing transformer architecture and sequence labelling approach was conducted by martinc2020tnt. Their approach, named TNT-KID did not rely on massive pretraining but rather on pretraining the transformer based language model on much smaller domain specific corpora. They report good results employing this tactic and claim that this makes their model more transferable to low-resource languages with limited training resources.
Most keyword detection studies still focus on English. Nevertheless, recently several multilingual and cross-lingual studies, which also include low-resource languages, were conducted. On of them is the study by koloski-etal-2021-extending where the performance of two supervised transformer-based models, multilingual BERT with a BiLSTM-CRF classification head and TNT-KID were compared in a multilingual settings, on Estonian, Latvian, Croatian and Russian news corpora. The authors also explored if combining the outputs of the supervised models with the outputs of unsupervised models can improve the recall of the system.
Cross-lingual zero-shot transfer represents an arising hot-topic in the research community. The main idea behind this family of approaches is that models can benefit from transfer from one language to another and therefore be able to conduct tasks in new, ‘unseen languages‘, on which they were not trained in a supervised way. These approaches are especially useful for low-resource languages without manually labeled resources. We are aware of two unsupervised cross-lingual approaches to keyword extraction. One of them is BiKEA , where the authors construct word graphs for documents in parallel corpora and rely on cross-lingual word statistics for keyword extraction. Another one is the study by cross_ling_latent, where the focus is on building single latent space over two languages, and later extracting keywords, to be used as topic categories for the articles, from this common latent space.
Researchers conducted various studied on the effect of applying zero-shot cross-language modelling to multiple domains of NLP, with most of the experiments showing promising results. For example, a zero-shot approach, in which a model was trained on one language and applied on the other, for the task of automatic reading comprehension was carried out by crossling_comperhension. Phoneme recognition is another task that cross-lingual zero-shot learning seems to improve. In the work by crossling_phoeme they show that cross-lingual phoneme recognition offers performance comparable to the state-of-the-art unsupervised models for the task at hand.
Recently, masked language models based on transformers such as BERT 
have taken the field by the storm, achieving state-of-the-art results on many tasks. In a study by crossling_bert_study they explored how well does the multilingual variant of BERT performs when used in a zero-shot setting. The study included 39 languages and covered 5 different tasks, including document classification, natural language inference, named entity recognition, part-of-speech tagging, and dependency parsing. The results were very promising, with the model outscoring several unsupervised and non-transformer based cross-lingual approaches. A zero-shot approach relying on multilingual BERT was also adopted to tackle the tasks of news-sentiment classification, offensive speech detection  and abusive language detection . These studies concluded that pretrained models can be used in a cross-lingual fashion, serving as a strong baseline in the low-resource scenario. To the best of our knowledge, zero-shot transfer has not yet been investigated for the task of keyword extraction.
For model evaluation we use six different datasets from the news domain. We include Russian, Croatian, Latvian, and Estonian news article datasets with manually labeled keywords from the clarin_kw dataset repository, using the same splits as in koloski-etal-2021-extending. Additionally, we include a benchmark English dataset, the KPTimes dataset , and a Slovenian SentiNews 
, which was originally used for news sentiment analysis, but nevertheless does contain manually labeled keywords and was therefore identified as suitable for keyword extraction. Before feeding the datasets to the models, they are lowercased. Each dataset is split into three different splits:train, validation and test. For English, we use the data splits introduced in , for other languages besides Slovenian we use the same data splits as in , while for Slovene we first removed the articles without keywords and randomly split the dataset into training, validation and test splits. We use the splits in the following manner:
train split - used for fine-tuning of the cross-lingual supervised model. The procedure is explained in detail in Section 4.3.
valid split - used for early stopping in order to prevent over-fitting during the fine-tuning phase of the supervised models.
test split - used for evaluation of the supervised and unsupervised methods. This split is not used during training of any of the methods.
The dataset statistics are available in Table 1. For each split we report on the size (number of articles), the average amount of keywords per document (kw_per_doc) and finally the percentage of keywords that actually appear in the text of the news articles (kw_present). Latvian dataset has on average least keywords per document () while the English and Russian datasets contain most keywords per article, and , respectively.
Note that some of the keywords accompanying an article in the data do not appear in the text of the document. For evaluation purposes we only use the keywords present in the documents. English has the lowest amount of present keywords (), while Latvian has the highest percentage of present keywords (). We consider keyword or keyphrase as present if a stemmed (English and Lativan) or lemmatized version (for other languages) appears in the stemmed or lemmatized version of the document. We use the NLTK’s  implementation of the PorterStemmer for English and LatvianStemmer111https://github.com/rihardsk/LatvianStemmer for Latvian. For Croatian, Slovenian, Estonian and Russian we use the Lemmagen3  lemmatizer.
4 Experimental Setup
In our experiments, we employ several unsupervised models to which we compare several supervised cross-lingual, multilingual and monolingual approaches.
4.1 Unsupervised Approaches
We evaluate three types of unsupervised keyword extraction methods, statistical, graph-based, and embedding-based, described in Section 2.
4.1.1 Statistical Methods
4.2 Embedding-based Methods
KeyBERT : For document embedding generation we employ sentence-transformers , more specifically the distiluse-base-multilingual-cased-v2 model available in the Huggingface library222https:/huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2. Initially, we tested two different KeyBERT configurations: one with n-grams of size and another with n-grams ranging from 1 to 3, with MMR= and with MaxSum=. The unigram model outscored the model that considered n-grams of sizes 1 to 3 as keyword candidates for all languages, therefore in the final report we show only the results for the unigram model.
4.2.1 Graph-based Methods
MultipartiteRank : We employ part-of-speech tagging during candidate weighting for supported languages, and we set the minimum similarity threshold for clustering at .
We use the PKE  implementations of YAKE, KPMiner, TextRank and MultiPartiteRank. We use the official implementation for the RaKUn model  and for the KeyBERT model . For unsupervised models, the number of returned keywords need to be set in advance. Since we employ F1@10 as the main evaluation measure (see Section 4.4), we set the number of returned keywords to 10 for all models.
4.3 Supervised Approaches
We utilize the transformer-based BERT model  with a token-classification head consisting of a simple linear layer for all our supervised approaches. We treat the keyword extraction task as a sequence classification task. We follow the approach proposed in martinc2020tnt and predict binary labels (1 for ‘keywords’ and 0 for ‘not keywords’) for all words in the sequence. The sequence of two or more sequential keyword labels predicted by the model is always interpreted as a multi-word keyword. We do not follow the related work  on adding a BilSTM-CRF classification head on top of BERT for sequence classification. Sincethe classification head needs to be randomly initialized (i.e. it was not pretrained during the BERT pretraining) and since, among others, we apply the model in a cross-lingual setting, we prefer to keep the token classification head simple, since the layers inside the head do not obtain any multilingual information during fine-tuning. The hypothesis is that using a simple one-layer classification head will result in a better generalization of the model in a cross-lingual setting.
More specifically, we employ the bert-uncased-multilingual model from the HuggingFace library  and fine-tune it using an adaptive learning rate (starting with the learning rate of ), for up to epochs with a batch-size of .
4.3.1 Cross-lingual Setup
Let be the collection of all of the possible tuples of size that can be constructed from the languages. For example, denotes the collection of all possible two language combinations in a set of languages, e.g.
We denote the i-th tuple of size with , e.g. for the previous example, would yield . The cardinality of the collection , is calculated as:
We create the i-th training dataset D from the collection of tuples of size k, as a concatenation of datasets in the tuple, or more formally :
where train-split represents the respective data-split of the given language as described in Section 3.
Dependent on the number of languages included in the training set, and depending on what languages are the trained models employed, we define the following specific settings, for which we report results in Section 5:
MON - monolingual ( ; for ) - where we fine-tune the model on a single language (for example English). In this setting we train a total of 6 monolingual models444Note that even in this ‘monolingual setting’ we employ BERT pretrained on a multilingual corpus, since we are more interested in the comparison of fine-tuning regimes in this paper than in the comparison of different pretrained models. and we train and test each model on the same language. We use this setting as a baseline to which we compare unsupervised, cross-lingual and multilingual settings, i.e. for cross-lingual (LOO) and unsupervised settings, MON indicates how much we would gain, if language specific training data was available.
LOO - Leave One Out ( ; for ) - where we fine-tune the model on a concatenation of five languages (for example Slovenian, Estonian, Latvian, Russian, Croatian) and test it on the sixth language not appearing in the train set (e.g. English). In this manner we obtain 6 different models. This is the so-called zero-shot cross-lingual setting, since we do not include the test language at the training time. The main idea behind this setting is to test how well does a model do if no language specific training data is available. This setting represents the core of our experiments.
MUL - multilingual ( ; for ) - where we fine-tune just one model on all languages from the language set and apply it on all the test datasets. With this experiment we want to check if adding more domain-specific data from other languages improves the performance in comparison to the monolingual setting described above.
4.4 Evaluation Setting
In order to evaluate the models, we calculate F1, recall and precision at 10 retrieved words. We omit the documents that do not have present keywords or do not contain keywords. We do this since we only use approaches that extract words (or multi-word expressions) from the given document and cannot handle keywords not appearing in the text. All approaches are evaluated on the same monolingual test splits, which are not used for training of supervised models. Lowercasing and stemming (for English and Latvian) or lemmatization (for other languages) are performed on both the gold standard and the extracted keywords (keyphrases) during the evaluation.
5 Discussion of Results
All unsupervised approaches are outperformed by the cross-lingual approaches (see row LOO) across all of the datasets and according to all criteria. For all languages besides Slovenian, the cross-lingual model improves on the best performing unsupervised model by more than 10 percentage points in terms of F1@10, the improvement being the smallest for Slovenian (about 8 percentage points) and the biggest for Latvian and Estonian (about 15 percentage points). The best performing unsupervised model in terms of F1@10 is KeyBert, which outperforms other unsupervised models on all languages.
The difference in F@10 between the cross-lingual and monolingual models (see row MON) is substantial. If no training data for the target language is used, the performance is more than halved on three languages, Latvian, Estonian and Russian. For the other three languages, the drop is smaller yet still substantial. Similar drops can be observed according to two other measurements precision@10 and recall@10.
The monolingual and multilingual models offer comparable performance according to all measures and across all languages. This indicates that including other languages into the train set, besides the target language, does generally not improve performance of the models, especially if the training dataset is sufficiently large. This finding supports the so-called curse of multilinguality , i.e. a trade-off between the number of languages the model supports and the overall decrease in performance on monolingual and cross-lingual benchmarks. It is however very likely that the transfer between languages would be more successful if the language set would contain more similar languages.
|Without training data in the target language|
|With training data|
|Without training data in the target language|
|With training data|
|Without training data in the target language|
|With training data|
5.1 Adding More Languages in a Cross-lingual Setting
Above we have showed that adding other languages into the train set already containing the data that matches the target language does generally not improve the performance. On the other hand, here we explore if it is worth adding more languages in a cross-lingual setting. We consider English as a testing language, and train on different combinations of languages that do not include English. Figure 1 presents the correlation between the number of languages and the performance of the model according to the F1@10. The Figure does indicate some positive correlation between the number of languages in the train set and the F1@10 improvement. The best was the model trained on Croatian (labeled as C) achieving F1@10 of . Overall, the best performing model on English was trained on the concatenation of the Croatian and Estonian corpus (labeled as CE). Adding additional languages to the train set did not improve the performance further. It does however improve the stability of the models, i.e. the models trained on more languages tend to have higher performance minimum but also lower performance maximum, as can be clearly seen in Figure 2.
5.2 Zero-shot Performance of the Monolingual Models
We explored how powerful are the monolingual (MON) models described in Section 4.3 in a cross-lingual zero-shot keyword extraction setting. Each of six trained monolingual models was tested on six languages to obtain a heatmap presented in Figure 3. There was no single monolingual model that worked best for all of the remaining languages. For English, the best-performing model was trained on Croatian, most likely due to the fact that both datasets contain news from 2019, suggesting some topic intersection. The best performance on the Estonian dataset was achieved by the model trained on the Latvian dataset, most likely due to the fact that both of these datasets contain news from the same time period and were collected by the same media company, which covers news for both neighboring countries, Estonia and Latvia. Not surprisingly, the reverse correlation si also true: the best-performing model on the Latvian dataset was trained on the Estonian dataset. The best performance on the Russian dataset was achieved by the Estonian model due to both of the datasets coming from the same media-house stationed in Estonia, as reported by koloski-etal-2021-extending. Finally, the best performance on the Slovenian data was achieved by the Croatian model, most likely because both of these languages belong to the Southern-Slavic language group and since Slovenia and Croatia are neighbouring countries.
We also conducted hierarchical clustering, using the cross-lingual scores of the monolingual models as affinities. We present the resulting dendrogram in Figure4. The results mostly confirm relations between languages, countries and sources of data, pointed out above. Estonian and Latvian datasets seem to be most similar. Russian dataset is the natural addition to this cluster, most likely due to language and content similarity. Interestingly, Croatian and English form a separate cluster, most likely on the premises of both containing news from 2019, while the Slovenian dataset appears to be most dissimilar to other datasets.
6 Conclusions and Further Work
In this work, we have presented a comprehensive comparison study covering multiple unsupervised, cross-lingual, multilingual and monolingual approaches for keyword extraction. While we did not manage to improve the performance of the supervised monolingual models by adding additional foreign language data to the training dataset, the results clearly indicate that cross-lingual models outperform unsupervised methods by a large margin. This suggests that if a labeled keyword train set from a specific domain is not available for a specific low-resource language, one opt to try to train a supervised model on a dataset covering the same domain in some other (preferably similar) language and employ that model in a zero-shot setting, before employing the unsupervised methods.
While cross-lingual models tend to outperform unsupervised approaches by a large margin, the discrepancy in performance between the supervised cross-lingual setting and the supervised monolingual setting is nevertheless substantial and training the model on the target languages is still the preferred option in terms of performance. This is in line with further experiments conducted during the study, which suggest that the models perform really well for target languages similar to the languages on which the model was trained and when there is some intersection between the news content in the training and test datasets.
For further work we propose exploring few-shot shot scenarios, in which a small amount of target language data will be added to the multilingual train set. We plan to pinpoint the amount of needed target language data in order to bridge the gap in performance between the monolingual and cross-lingual models. Additionally, we propose ensambling multiple methods and explore how would that benefit the performance of the approach.
This work was supported by the Slovenian Research Agency (ARRS) grants for the core programme Knowledge technologies (P2-0103), the project Computer-assisted multilingual news discourse analysis with contextual embeddings (CANDAS, J6-2581), as well as the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media). The third author was financed via young research ARRS grant.
8 Bibliographical References
-  (2009) Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Cited by: §3.
Pke: an open source python-based keyphrase extraction toolkit. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, pp. 69–73. External Links: Cited by: 1st item, §4.2.1.
-  (2018) Unsupervised keyphrase extraction with multipartite graphs. CoRR abs/1803.08721. External Links: Cited by: §2, 2nd item.
-  (2017) Manually sentiment annotated slovenian news corpus SentiNews 1.0. Note: Slovenian language resource repository CLARIN.SI External Links: Cited by: §3.
-  (2018) Yake! collection-independent automatic keyword extractor. In European Conference on Information Retrieval, pp. 806–810. Cited by: §2, 1st item.
-  (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §5.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. External Links: Cited by: §2, §2, §4.3.
-  (2009) KP-miner: a keyphrase extraction system for english and arabic documents. Information systems 34 (1), pp. 132–144. Cited by: §2, 2nd item.
-  (2017-07) PositionRank: an unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1105–1115. External Links: Cited by: §2.
-  (2019) KPTimes: a large-scale dataset for keyphrase generation on news documents. In Proceedings of the 12th International Conference on Natural Language Generation, pp. 130–135. Cited by: §3.
-  (2020-12) XHate-999: analyzing and detecting abusive language across domains and languages. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 6350–6365. External Links: Cited by: §2.
-  (2020) KeyBERT: minimal keyword extraction with bert.. Zenodo. External Links: Cited by: 1st item, §4.2.1.
-  (2016-08) Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1631–1640. External Links: Cited by: §2.
-  (2014) Cross-lingual information to the rescue in keyword extraction. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 1–6. Cited by: §2.
-  (2010) Lemmagen: multilingual lemmatisation with induced ripple-down rules. Journal of Universal Computer Science 16 (9), pp. 1190–1214. Cited by: §3.
-  (2021-04) Extending neural keyword extraction with TF-IDF tagset matching. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Online, pp. 22–29. External Links: Cited by: §3, §4.3.
-  (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
-  (2009-08) Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 257–266. External Links: Cited by: §2.
-  (2018-06) Key2Vec: automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, USA, pp. 634–639. External Links: Cited by: §2.
-  (2020) Tnt-kid: transformer-based neural tagger for keyword identification. Natural Language Engineering, pp. 1–40. Cited by: §1.
-  (2004) Textrank: bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404–411. Cited by: §2, §2, 1st item.
-  (2020) Zero-shot learning for cross-lingual news sentiment classification. Applied Sciences 10 (17). External Links: Cited by: §2.
-  (2021) Investigating cross-lingual training for offensive language detection. PeerJ Computer Science 7, pp. e559. Cited by: §2.
-  (2019) Language models are unsupervised multitask learners. Technical report OpenAi. Cited by: §2.
-  (2019-11) Sentence-bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Cited by: 1st item.
-  (2010) Automatic keyword extraction from individual documents. Text mining: applications and theory, pp. 1–20. Cited by: §2.
RaKUn: rank-based keyword extraction via unsupervised learning and meta vertex aggregation. CoRR abs/1907.06458. External Links: Cited by: §2, §2, 3rd item, §4.2.1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, Vancouver, Canada, pp. 5998–6008. Cited by: §1, §2.
-  (2019) HuggingFace’s transformers: state-of-the-art natural language processing. CoRR abs/1910.03771. External Links: Cited by: §4.3.