DeepAI
Log In Sign Up

Applying Transformer-based Text Summarization for Keyphrase Generation

09/08/2022
by   Anna Glazkova, et al.
0

Keyphrases are crucial for searching and systematizing scholarly documents. Most current methods for keyphrase extraction are aimed at the extraction of the most significant words in the text. But in practice, the list of keyphrases often includes words that do not appear in the text explicitly. In this case, the list of keyphrases represents an abstractive summary of the source text. In this paper, we experiment with popular transformer-based models for abstractive text summarization using four benchmark datasets for keyphrase extraction. We compare the results obtained with the results of common unsupervised and supervised methods for keyphrase extraction. Our evaluation shows that summarization models are quite effective in generating keyphrases in the terms of the full-match F1-score and BERTScore. However, they produce a lot of words that are absent in the author's list of keyphrases, which makes summarization models ineffective in terms of ROUGE-1. We also investigate several ordering strategies to concatenate target keyphrases. The results showed that the choice of strategy affects the performance of keyphrase generation.

READ FULL TEXT VIEW PDF
10/08/2021

VieSum: How Robust Are Transformer-based Models on Vietnamese Summarization?

Text summarization is a challenging task within natural language process...
04/20/2022

A Survey on Neural Abstractive Summarization Methods and Factual Consistency of Summarization

Automatic summarization is the process of shortening a set of textual da...
03/29/2019

Keyphrase Generation: A Text Summarization Struggle

Authors' keyphrases assigned to scientific articles are essential for re...
05/28/2022

Learning Non-Autoregressive Models from Search for Unsupervised Sentence Summarization

Text summarization aims to generate a short summary for an input text. I...
03/16/2022

Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search

Abstractive summarization systems today produce fluent and relevant outp...
09/14/2019

Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study

Using data-driven models for solving text summarization or similar tasks...
09/25/2020

PerKey: A Persian News Corpus for Keyphrase Extraction and Generation

Keyphrases provide an extremely dense summary of a text. Such informatio...

1 Introduction

A list of keyphrases is an important attribute of a scientific text. Keyphrases contain a brief representation of the contents of a text. They help search engines find and systematize papers. A qualitative selection of keyphrases positively affects a paper’s visibility and its number of citations [2, 17].

Many of the current approaches to keyphrase extraction involve selecting words from the source text, ranking the candidates, and choosing the top . The value of is determined by the user. The methods of directly extracting keyphrases from the text will produce only those phrases that are explicitly contained in the text. But in practice, keyphrases often represent an abstractive summary of the text. This summary can include hypernyms and paraphrased sentences from the source text. Therefore, studying the applicability of abstractive text summarization methods is a major area of interest within the field of generating multiple keyphrases as a sequence.

Compared with traditional methods for keyphrase extraction, the approaches based on abstractive text summarization have the following properties: a) is a value determined by the model and is not an input parameter; b) the model takes into account both semantic and syntactic components of the source text; c) not only those words or phrases that are found in the source text can be proposed as keyphrases. To date, few studies have investigated generating keywords using text summarization [9, 35]. However, the performance of state-of-the-art models based on the transformer architecture has not been closely examined for the task of keyphrase extraction. Moreover, there have been no studies that compare the effectiveness of different ordering strategies for concatenating target keyphrases.

In this paper, we aim to fill this research gap by systematically evaluating transformer-based abstractive text summarization models on several keyphrase extraction benchmarks. We seek to answer the following research questions:

  • RQ1: Do transformer-based models for abstractive summarization outperform other baselines?

  • RQ2: What is the effect of different ordering strategies for concatenating target keyphrases?

The paper is organized as follows. Section 2 contains a brief review of related works. Next, we describe our methods and experiments in Section 3. In Section 4, we discuss the results. Finally, Section 5 concludes this paper.

2 Related Work

The aim of keyphrase extraction is to define a set of phrases that are related to the main topics discussed in a given document [15]. Up to now, there have been a large volume of published studies presenting unsupervised and supervised approaches to keyphrase extraction.

Unsupervised approaches basically consist of the following stages: selecting candidate words or n-grams in accordance with some characteristics; ranking the candidate words; formatting the keyphrases by selecting the top-ranked words

[29]. Unsupervised keyphrase extraction is mainly performed by statistical and graph-based methods. Statistical methods, such as TFIDF, KPMiner [12], and YAKE! [8], utilize textual statistical features to define the most important words in the text. The general idea of graph-based methods is to create a document graph consisting of candidate phrases as nodes and their relations as edges. Well-known examples of graph-based methods are TextRank [27], TopicRank [6], and PositionRank [13].

Supervised methods for keyphrase extraction include methods based on traditional supervised algorithms as well as deep learning methods. In terms of machine learning, the words in a document are “examples” and the objective of the keyphrase extraction system is to divide the examples into “keyphrases” and “not-keyphrases”

[1]. One of the common keyphrase extraction systems is KEA [38]

which identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a Naïve Bayes classifier to predict the most probable keyphrases. Binary classification models for keyphrase extraction were also proposed in

[24, 28, 36]. Zhang et al. [40]

proposed a deep recurrent neural network model to combine keyphrases and context information that jointly process the keyphrase ranking and keyphrase generation task. Meng et al.

[26] presented a generative model for keyphrase prediction with an encoder-decoder framework (CopyRNN). Wang et al. [37]

proposed a topic-based adversarial neural network (TANN), which uses the idea of transfer learning.

To date, the models based on transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT) [11]

, show state-of-the-art results in many natural language processing tasks. Several studies have investigated the use of BERT-based language models for keyphrase extraction. Thus, KeyBERT

[14]

calculates cosine similarity between BERT-embeddings to find the sub-phrases more fully reflecting the content of the document. Some researchers attempted to fine-tune BERT-based models for keyphrase extraction as a classification or sequence labelling task

[21, 32].

A number of studies have examined neural models to generate multiple keyphrases as a sequence [9, 35]. Chowdhury et al. [10] showed that fine-tuned BART [20] can show competitive results in keyphrase generation compared with the existing neural models for keyphrase extraction. The authors ranked the produced keyphrases, selected a fixed number of generated keyphrases per source text and separately calculated an F1-score for keyphrases present in the source text, and Recall for keyphrases absent in the text. Meng et al. [25] explored the effect of different strategies for concatenating target keyphrases with the example of the One2Seq model. They showed that the ordering of concatenating target phrases matters. The best results were achieved where target keyphrases were sorted by their first occurrences in the source text.

3 Experimental Setup

3.1 Datasets

Characteristic Krapivin-A Krapivin-T KP20K Inspec SemEval2017
Size 2294 2293 20000 2000 500
Domains CS CS CS CS
CS,
material science,
physics
Type of texts abstracts texts abstracts abstracts paragraphs
Avg symbols per text 1001.74 43807.85 995.85 777.25 1113.66
STD 381.37 12565.47 451.32 392.69 310.45
Avg tokens per text 169.06 8597.63 165.72 127.35 194.99
STD 68.58 2411.77 76.89 65.03 58.14
Avg keyphrases per text 5.34 5.34 5.28 14.11 17.3
STD 2.77 2.77 3.77 6.41 7
Absent keyphrases, % 51.3 18.04 43.67 43.8 0
STD 25.99 19.69 28.38 17.83 0
Table 1: Data statistics. CS - Computer Science. The number of tokens is obtained using NLTK [4]

. STD means the standard deviation for the correspondence characteristic.

We use several English corpora of scientific text for evaluating automatic keyphrase extraction methods. The main characteristics of the datasets are demonstrated in Table 1. The percentage of absent keyphrases means the proportion of keyphrases from the list of keyphrases that are not appeared in the text.

  • Krapivin2009 [19], a dataset that contains full papers divided into title, abstract, and text. In this work, we extract keyphrases using abstracts and texts separately (Krapivin-A and Krapivin-T respectively).

  • Inspec [16], a dataset for keyphrase extraction from scientific abstracts.

  • KP20K [26], a large-scale scholarly abstracts corpus with 528K abstracts for training, 20K abstracts for validation and 20K abstracts for testing. In this work, we limit ourselves to utilizing only the test set.

  • SemEval2017 [3], a corpus that consists of the paragraphs manually selected from scientific papers among several domains.

3.2 Text Summarization Models

For each dataset, we fine-tune two pretrained language models for abstractive text summarization (Figure 1):

  • BART [20]

    , a transformer-based denoising autoencoder for pre-training a seq2seq model. We use BART-base, which has 139 M parameters, 12 layers, and the hidden size of 768;

  • T5 [30], an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. We utilize T5-small, which has 60M parameters, 12 layers, and the hidden size of 512.

We do not fix the number of keyphrases to be generated and thereby this value is determined by the model. This approach can be useful when the number of keyphrases in the training set differs significantly depending on the topic and type of the source text. The other advantage is that the model can independently choose the optimal number of keyphrases based on the training set. However, the model trained in such a manner is not able to generate the exact number of keywords specified during the evaluation stage.

Figure 1: Keyphrase generation with pretrained language models.

3.3 Ordering for Concatenating Keyphrases

We define eight strategies for concatenating target keyphrases, the first six of which were presented in [25]: a) No-Sort, i.e. keyphrases in their original order; b) Random, in random order; c) Length, keyphrases sorted by their length in terms of symbols; d) Alpha, in alphabetical order; e) Appear-Pre, sorted by their first occurrences, absent keyphrases are randomly shuffled and added at the beginning; f) Appear-Post, same as the previous one, but absent keyphrases are added at the end.

In addition to these strategies, we consider the following ways to concatenate keyphrases:

  • Appear-Pre-CC, same as Appear-Pre but absent and present keyphrases are marked with control codes <absent> and <present> respectively, for example: , where is a number of keyphrases. The idea of the use of control codes was introduced in [18]. Control codes represent certain words, sentences, or links which are added to the input of the text generation model to generate coherent outputs. For example, control codes were used for the joined generating titles and short summaries (TLDRs) for scientific papers in [7]. The authors appended each source with control codes for short summary and title generation, respectively. This allowed the parameters of the model to learn to generate TLDR or title depending on the control code. In our work, we utilize control codes to distinguish between absent and present keyphrases.

  • Appear-Post-CC, same as Appear-Post but using control codes.

The presented strategies were used during the training phase, i.e. before starting the model training.

3.4 Baselines

To evaluate the performance of transformer-based text summarization models for keyphrase generation, we compare them with the following baselines:

  • TFIDF, a method based on statistical frequencies that expresses the importance of a word in a particular document of a corpus.

  • TopicRank [6], an unsupervised extractive method that extracts the noun phrases and creates top-ranked topic clusters and uses the extracted noun phrases as vertices in a complete graph.

  • YAKE! [8], an unsupervised multilingual method that utilizes various features, such as term position, term frequency, and others.

  • KEA [38], a supervised method that extracts candidate keyphrases on the basis of several linguistic features and uses the Naïve Bayes algorithm to classify candidate phrases into keyphrases and not.

  • KeyBERT [14], a method that utilizes document and word embeddings from BERT and cosine similarity to find the sub-phrases in a document that are the most similar to the document itself.

3.5 Evaluation Metrics

Since previous studies highlighted that different metrics are necessary to capture different aspects of text generation [33, 34], we adopt three diverse metrics to evaluate the performance of the selected transformer-based text summarization models and baselines: ROUGE-1 [22], full-match F1-score, and BERTScore [41]. For those methods that return the predefined number of keyphrases, we computed metrics on the top 5, top 10, and top 15 returned keyphrases.

The ROUGE-1 score calculates the number of matching unigrams between the model-generated text and the reference. To measure ROUGE-1, the keyphrases for each text were combined into a string with a comma as a separator.

The full-match F1-score evaluates the number of exact matches between the original and generated sets of keyphrases. It is calculated as a harmonic mean of precision and recall.

BERTScore utilizes the pre-trained contextual embeddings from BERT-based models and matches words in the source and generated texts using cosine similarity. It has been shown that human judgment correlates with this metric on sentence-level and system-level evaluation. To calculate BERTScore, we use contextual embeddings from RoBERTa-large [23], a modification of BERT that is pretrained using dynamic masking.

3.6 Implementation Details

We utilize PKE [5]

, a state-of-the-art open-source Python-based keyphrase extraction toolkit, which makes it possible to implement TFIDF, TopicRank, YAKE!, and KEA. We use the n-gram range of (1,3), which means that produced keyphrases could be unigrams, bigrams, or trigrams. To implement KeyBERT, we use the BERT embeddings produced by the all-MiniLM-L6-v2 model

[31]

that maps texts to a 384-dimensional dense vector space.

We conduct our experiments on text summarization models using the Transformers library [39]

. We use 3 epochs to fine-tune models on each dataset, the batch size of 8, the maximum sequence length of 256, and the learning rate of 4e-5. All summarization models are evaluated by three-fold cross-validation. We compute ROUGE-1, F1-score, and BERTScore for each fold separately and then determine the mean value.

4 Results and Discussion

To answer RQ1, we compared the results of all considered models in Tables 2, 3, and 4 in terms of F1-score, ROUGE-1, and BERTScore, respectively. For baselines, we compute metrics on the top 5, 10, and 15 returned keyphrases. The best result for each dataset is highlighted. The results for Appear-Post, Appear-Pre-CC, and Appear-Post-CC are missed for Semeval2017, because all keyphrases from this dataset are present in the corresponding texts. Therefore, Appear-Pre just produces the list of keyphrases sorted by their first occurences.

Full-match results on all corpora are reported in Table 2. Among unsupervised methods, TFIDF shows the highest results on Krapivin-A, Krapivin-T, and KP20K. TopicRank demonstrates the best performance on Inspec and SemEval2017. KEA outperforms other baselines on Krapivin-A (in terms of F1@5), Krapivin-T (F1@10), and Inspec (F1@15). KeyBERT does not show the highest performance on any of the datasets and demonstrates the worst F1-score on Inspec and SemEval2017. Overall, BART achieves the best full-match results on Krapivin-A and Krapivin-T. T5 outperforms other methods on Inspec, KP20K, and SemEval2017.

Table 2 indicates the superiority of abstractive summarization models on the datasets which have a large proportion of absent keyphrases. In particular, the number of summarization models that outperform the best baseline result is higher for the datasets consisting of abstracts, i.e., KP20K (43.67% of absent keyphrases) – 9 models, Inspec (43.8%) – 8 models, Krapivin-A (51.3%) – 3 models, Krapivin-T (18.04%) – 1 model, SemEval2017 (0%) – 1 model.

Data Krapivin-A Krapivin-T Inspec KP20K SemEval2017

F1@5

F1@10

F1@15

F1@5

F1@10

F1@15

F1@5

F1@10

F1@15

F1@5

F1@10

F1@15

F1@5

F1@10

F1@15

TFIDF 10.8 10.5 9.48 7.85 8.59 8.45 9.76 13.27 14.81 11.12 10.69 9.66 14.86 19.33 21.96
TopicRank 7.61 7.73 7.38 5.4 5.72 5.44 12.01 14.91 16 8.51 8.13 7.54 17.31 22.83 24.93
YAKE! 6.93 8.33 8.1 7.68 8.34 7.79 10.53 13.46 13.96 7.71 8.65 8.58 13.03 18.52 21.04
KEA 10.88 10.47 9.53 8.2 9.19 9.1 9.76 13.17 16.61 7.85 7.49 6.84 14.85 19.42 21.72
KeyBERT 9.46 9.35 8.63 5.54 5.29 4.76 8.62 10.75 11.6 8.27 8.32 7.81 9.47 12.59 14.29
BART
No Sort 9.19 5.55 14.45 10.68 10.23
Random 9.29 5.95 12.14 11.14 12.8
Length 9.09 5.43 11.77 10.38 13.58
Alpha 8.71 5.74 9.64 10.72 10.74
Appear-Pre 10.1 5.17 14.55 11.72 18.22
Appear-Post 7.59 4.57 13.33 11.36 -
Appear-Pre-CC 11.24 6.38 15.11 11.76 -
Appear-Post-CC 9.6 9.65 15.58 11.31 -
T5
No Sort 10.2 6.05 22.29 11.65 16.62
Random 11 5.9 17.49 11.94 18.62
Length 10.69 5.31 17.3 10.96 17.51
Alpha 9.76 6.34 17.43 11.18 18.3
Appear-Pre 10.93 6.09 19.53 11.43 25.39
Appear-Post 10.02 6.18 19.04 10.92 -
Appear-Pre-CC 8.88 4.61 21.24 9.61 -
Appear-Post-CC 7.38 4.78 18.18 7.53 -
Table 2: Results (F1-score, %). The best result for each dataset is highlighted.

ROUGE-1 scores are reported in Table 3. TopicRank is the best on Inspec and Semeval2017, KEA demonstrates the highest performance on Krapivin-T, and KeyBERT achieves the best results on Krapivin-A and KP20K. The scores of BART and T5 are quite low on each dataset. To sum up, the results obtained in terms of F1-score and ROUGE-1 show that abstractive text summarization models are relatively successful in predicting full-match keyphrases, but the generated sequence of keyphrases contains a small number of words from the original list of keyphrases.

Data Krapivin-A Krapivin-T Inspec KP20K SemEval2017

R1@5

R1@10

R1@15

R1@5

R1@10

R1@15

R1@5

R1@10

R1@15

R1@5

R1@10

R1@15

R1@5

R1@10

R1@15

TFIDF 27.66 29.91 29.14 21.04 24.84 25.21 26.55 36.45 41.45 27.9 30.11 29.3 23.67 34.92 41.72
TopicRank 24.91 25.36 24.19 19.66 21.92 21.62 31.77 40.49 44.09 25.74 25.77 24.31 29.99 41.67 47.78
YAKE! 24.63 27.7 28.25 22.85 25.98 26.36 30.29 37.44 40.9 25.14 27.82 28.21 26.43 35.65 41.39
KEA 28.09 30.05 29.37 21.88 26.54 27.26 26.49 36.03 40.88 19.86 21.36 20.89 23.51 34.4 40.83
KeyBERT 30.22 31.11 30.49 24.78 25.16 24.01 31.01 38.33 41.93 29.68 30.41 29.46 26.69 36.91 42.77
BART
No Sort 22.69 16.29 36.77 22.9 22.27
Random 22.88 16.48 31.17 22.97 26.61
Length 22.28 15.38 30.4 21.6 27.2
Alpha 21.51 15.72 27.51 22.91 20.23
Appear-Pre 22.53 16.09 34.38 23.57 38.54
Appear-Post 21.71 15.27 34.44 17 -
Appear-Pre-CC 22.22 15.98 34.25 23.52 -
Appear-Post-CC 21.46 16.47 35.08 17.19 -
T5
No Sort 21.68 14.37 41.51 21.76 27.98
Random 22.37 14.04 33.52 21.9 33.12
Length 22.73 13.62 34.27 20.84 30.52
Alpha 21.66 15.38 32.67 22.02 29.11
Appear-Pre 22.45 14.94 36.2 21.04 40.34
Appear-Post 21.07 14.89 36.7 22.13 -
Appear-Pre-CC 19.1 12.8 39.5 19.48 -
Appear-Post-CC 18.13 12.47 35.02 18.09 -
Table 3: Results (ROUGE-1, %). The best result for each dataset is highlighted.

The results in terms of BERTScore are shown in Table 4. Among unsupervised methods, the best scores for all datasets are obtained by TopicRank. KEA and KeyBERT perform worse on all considered text corpora. The results indicate the sustained superiority of BART, which achieves the highest scores for all datasets. The results of T5 differ depending on data. For instance, T5 shows rather high scores on Inspec, but never exceeds the best result obtained by traditional methods on Krapivin-T. In addition, BERTScore also shows that abstractive summarization models mostly outperform traditional methods on the datasets with a large proportion of absent keyphrases, as is the case with F1-score. For example, 15 models of 16 are superior to TopicRank on Inspec and eleven models beat TopicRank on Krapivin-A. On the other hand, only two summarization models of ten outperform the best baseline result on SemEval2017, the proportion of absent keyphrases of which is 0%. Overall, BERTScore indicates that abstractive summarization models can produce keyphrases that are close in meaning to original keyphrases in terms of token similarity calculated with contextual embeddings.

Data Krapivin-A Krapivin-T Inspec KP20K SemEval2017

BS@5

BS@10

BS@15

BS@5

BS@10

BS@15

BS@5

BS@10

BS@15

BS@5

BS@10

BS@15

BS@5

BS@10

BS@15

TFIDF 85.91 85.12 84.24 85.32 85.12 84.5 83.45 84.15 84.28 85.6 84.7 83.86 83.64 84.61 84.73
TopicRank 86.53 86.19 85.7 85.95 85.74 85.16 84.25 85.17 85.45 86.21 85.75 85.23 84.29 85.29 85.59
YAKE! 85.28 84.74 84.14 84.98 84.73 84.19 83.64 84.1 84.21 84.88 84.24 83.72 83.5 84.2 84.49
KEA 85.9 85.06 84.2 85.49 85.41 84.83 83.44 84.1 84.21 85.56 84.66 83.82 83.59 84.51 84.61
KeyBERT 86.43 85.46 84.72 85.45 84.21 83.33 84.39 84.64 84.56 85.72 84.71 84 84.3 84.82 84.91
BART
No Sort 87.9 86.59 87.72 86.97 84.42
Random 88 87.01 86.81 86.98 84.91
Length 87.85 86.97 86.94 87.13 84.95
Alpha 87.72 86.68 86.45 87.08 83.91
Appear-Pre 88.09 86.66 87.36 87.1 86.8
Appear-Post 87.59 86.64 87.53 85.75 -
Appear-Pre-CC 86.31 85.43 86.39 86.86 -
Appear-Post-CC 86.8 86.21 86.73 86.47 -
T5
No Sort 87.3 85.03 86.74 84.84 84.21
Random 87.02 85.07 85.3 86.79 83.98
Length 87.29 84.23 85.52 84.11 84.1
Alpha 86.35 84.51 86.1 85.75 83
Appear-Pre 86.59 83.57 86.03 86.05 86.43
Appear-Post 85.32 84.39 85.67 84 -
Appear-Pre-CC 86.06 83.72 86.31 85.07 -
Appear-Post-CC 85.05 84.28 85.77 84.25 -
Table 4: Results (BERTScore, %). The best result for each dataset is highlighted.
Figure 2: An example of the differences between the metrics. Original keyphrases (reference): e-books, library journal, library automation, electronic books, electronic publishing. T5 (candidate): e-book revolution, consumer market, reading devices, electronic publishing. F1-score: 22.22%, ROUGE-1: 26.67%, BERTScore: 88.86%

Table 5 shows two examples of generating keyphrases. The first row contains the original list of keyphrases provided by the authors. The rest of the table illustrates keyphrases derived through various methods. For BART and T5, we provide the list of keyphrases produced by the best models. Full matches for the keyphrases from the original list are shown in bold. Exact word matches for the words from the authors’ list of keyphrases are underlined. Such examples demonstrate that text summarization models generate more abstractive keyphrases and use fewer repeated words.

Our study demonstrates that different evaluation metrics estimates different aspects of keyphrase generation. For instance, Figure

2 visualize matching scores between the tokens from original and generated keyphrases obtained by BERTScore. This original example was taken from Inspec and the generated list of keyphrases was produced by T5 (No Sort). It has F1-score, which is close to the mean F1-score for this dataset, while the ROUGE-1 score is much lower and BERTScore is higher compared to the mean value. Therefore, the choice of metric is an important step of the research that directly affects the conclusions. In this work, ROUGE-1 indicates the superiority of traditional methods for keyphrase extraction. By contrast, BERTScore defines BART as the most effective model.

Source Example 1 Example 2
Original female computer science doctorates, survey of earned doctorates, information science, computer science education, gender issues OS porting, application development, consumer operating system, hardware design, operating systems (computers), software portability
TFIDF doctorates, women, completing doctorates, science, computer science, 1997, academic year, sed, females, academic operating system, porting, deliver improved, deliver improved usability, improved usability, high-end portable, high-end portable consumer, portable consumer, portable consumer products, consumer products
TopicRank doctorates, computer, women, academic year, science, percentages, degrees, sed, prior research, females major complexity, device developer, appropriate consumer operating system, validation, porting, problem, support, system implementation, trend, real-time OS
YAKE! education statistics, national center, computer science, computer, science, doctorates, degrees, statistics, national, center portable consumer products, high-end portable consumer, deliver improved usability, appropriate consumer operating, consumer products, portable consumer, appropriate consumer, consumer operating system, deliver improved, improved usability
KEA doctorates, women, science, computer science, completing doctorates, 1997, academic year, academic, earned, education statistics operating system, deliver improved, deliver improved usability, improved usability, high-end portable, high-end portable consumer, portable consumer, portable consumer products, consumer products, appropriate consumer
KeyBERT doctorates computer, doctorate computer, include doctorates, earned doctorates, science doctorates, computer science, doctorates, doctorates sed, doctorates degrees, doctorates include OS porting, platform OS, supported OS, operating OS, OS, cation device, portable consumer, implementation porting, device developer, platform
BART academic year, computer science education, women education, national center for education statistics portable computing, software development, portable consumer products, software, hardware design, real-time OS, porting, validation, software aspects, portability, performance, user interfaces, user experience, portable products, consumer operating system, operating system
T5 computer science education, women, computer science doctorates, academic levels, gender issues, education porting, virtual reality, portable consumer products, consumer operating system, OS, commercially supported OS, complete operating system, real-time OS, complex platform OS, real-time operating systems, asynchronous operating systems, portable devices, mobile computing
Table 5: Keyphrases extracted by different models. Full matches are bolted, and exact matches are underline.

To answer RQ2, we estimate the performance growth of the keyphrase generation in comparison with the No-Sort strategy for all other concatenating strategies. The performance growth is calculated as follows:

(1)

where is a performance (F1-score or BERTScore) for the NoSort strategy, is a strategy from the set of strategies Random, Length, Alpha, Appear-Pre, Appear-Post, Appear-Pre-CC, Appear-Post-CC.

a) BART (F1-score) b) T5 (F1-score) a) BART (BERTScore) b) T5 (BERTScore)
Figure 3: Comparing strategies for summarization models.

As shown in Figure 3, Random can increase the results, but we have not found stable improvement when using this strategy. Length, Alpha, and Appear-Post demonstrate low performance in most models. As can be seen on all charts, Appear-Pre improves the performance on SemEval2017. Therefore, the ordering is effective for those dataset that does not include absent keyphrases. The results of Appear-Pre-CC and Appear-Post-CC are opposite for BART (F1-Score) and other cases. The use of control codes improves the full-match results of BART on all corpora. In all other considered cases, control codes appeared to be ineffective.

To illustrate the resource usage for keyphrase generation by different models, we show the time and memory consumption in Figure 4 on the example of SemEval2017. The time and memory usage for training was not included in the calculation. In other words, we loaded the trained model, generated keyphrases for all texts in the dataset, and measured the indicators using the Python and Google Colaboratory tools. The figure shows that memory usage is expectedly higher for transformer-based models than for traditional methods. The time consumption during running on CPU is the highest for TFIDF, KEA, and BART. However, the use of GPU for transformers allows reducing the generation time to less than one minute for KeyBERT and approximately two minutes for BART and T5.

Figure 4: Resource usage on the example of SemEval2017.

5 Conclusion

In this paper, we explored the effectiveness of abstractive summarization models based on transformer architecture for the task of predicting keyphrases for scientific texts. We performed an extensive evaluation of unsupervised and supervised models for keyphrase extraction and compared several ordering strategies for concatenating keyphrases on several datasets. Our results showed some pros and cons of the use of transformer-based summarization models for keyphrase extraction. First, we obtained promising results in terms of the full-match F1-score, but ROUGE-1 indicates the superiority of traditional methods for keyphrase extraction. Second, we indicated that summarization models are more competitive in generating keyphrases that are not explicitly presented in the source text. Finally, we demonstrated that some ordering strategies provide better results in keyphrase generation, while others decrease the performance.

References

  • [1] Alami Merrouni, Z., Frikh, B., Ouhbi, B.: Automatic keyphrase extraction: a survey and trends. Journal of Intelligent Information Systems 54(2), 391–424 (2020). 10.1007/s10844-019-00558-9
  • [2] Ale Ebrahim, N., Salehi, H., Embi, M. A., Habibi, F., Gholizadeh, H., Motahar, S. M., Ordi, A.: Effective strategies for increasing citation frequency. International Education Studies 6(11), 93–99 (2013). 10.5539/ies.v6n11p93
  • [3] Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: SemEval 2017 Task 10: ScienceIE-Extracting Keyphrases and Relations from Scientific Publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 546–555 (2017). 10.18653/v1/s17-2091
  • [4] Bird, S.: NLTK: The Natural Language Toolkit. In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, 69–72 (2006). 10.3115/1225403.1225421
  • [5] Boudin, F.: PKE: an open source python-based keyphrase extraction toolkit. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: system demonstrations, 69–73 (2016).
  • [6] Bougouin, A., Boudin, F., Daille, B.: TopicRank: Graph-based topic ranking for keyphrase extraction. In: International joint conference on natural language processing (IJCNLP), 543–551 (2013).
  • [7] Cachola, I., Lo, K., Cohan, A., Weld, D. S.: TLDR: Extreme Summarization of Scientific Documents. In: Findings of the Association for Computational Linguistics: EMNLP 2020, 4766–4777 (2020).
  • [8]

    Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! Keyword extraction from single documents using multiple local features. Information Sciences

    509, 257–289 (2020). 10.1016/j.ins.2019.09.013
  • [9] Çano, E., Bojar, O.: Keyphrase Generation: A Text Summarization Struggle. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 666–672 (2019). 10.18653/v1/N19-1070
  • [10] Chowdhury, M. F. M., Rossiello, G., Glass, M., Mihindukulasooriya, N., Gliozzo, A. Applying a Generic Sequence-to-Sequence Model for Simple and Effective Keyphrase Generation. arXiv (2022). 10.48550/ARXIV.2201.05302
  • [11] Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186 (2019). 10.18653/v1/N19-1423
  • [12] El-Beltagy, S. R., Rafea, A.: KP-Miner: A keyphrase extraction system for English and Arabic documents. Information systems 34(1), 132–144 (2009). 10.1016/j.is.2008.05.002
  • [13] Florescu, C., Caragea, C.: PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1105–1115 (2017). 10.18653/v1/P17-1102
  • [14] Grootendorst, M.: KeyBERT: Minimal keyword extraction with BERT. Zenodo (2020). 10.5281/zenodo.4461265
  • [15] Hasan, K. S., Ng, V.: Automatic keyphrase extraction: A survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1262–1273 (2014). 10.3115/v1/P14-1119
  • [16] Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on Empirical methods in natural language processing, 216–223 (2003). 10.3115/1119355.1119383
  • [17] Ilango, D. V., Kumar, D. S. M.: Factors For Improving The Research Publications And Quality Metrics. International Journal of Civil Engineering & Technology 8(4), 477–496 (2017).
  • [18] Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., Socher, R.: CTRL: A conditional transformer language model for controllable generation. arXiv (2019). 10.48550/arxiv.1909.05858.
  • [19] Krapivin, M., Autaeu, A., Marchese, M.: Large dataset for keyphrases extraction (2009).
  • [20] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880 (2020). 10.18653/v1/2020.acl-main.703
  • [21] Lim, Y., Seo, D., Jung, Y.: Fine-tuning BERT Models for Keyphrase Extraction in Scientific Articles. Journal of advanced information technology and convergence 10(1), 45–56 (2020). 10.14801/jaitc.2020.10.1.45
  • [22] Lin, C. Y. Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74-81 (2004).
  • [23] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv (2019). 10.48550/arXiv.1907.11692
  • [24] Medelyan, O., Frank, E., Witten, I. H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 1318–1327 (2009). 10.3115/1699648.1699678
  • [25] Meng, R., Yuan, X., Wang, T., Brusilovsky, P., Trischler, A., He, D.: Does order matter? an empirical study on generating multiple keyphrases as a sequence. arXiv (2019). 10.48550/ARXIV.1909.03590
  • [26] Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y.: Deep Keyphrase Generation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 582–592 (2017). 10.18653/v1/P17-1054
  • [27] Mihalcea, R., Tarau, P.: TextRank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, 404-411 (2004).
  • [28] Nguyen, T. D., Luong, M. T.: WINGNUS: Keyphrase extraction utilizing document logical structure. In: Proceedings of the 5th international workshop on semantic evaluation, 166–169 (2010).
  • [29] Papagiannopoulou, E., Tsoumakas, G.: A review of keyphrase extraction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10(2), e1339 (2020). 10.1002/widm.1339
  • [30] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P. J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 1–67 (2020).
  • [31] Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992 (2019). 10.18653/v1/D19-1410
  • [32] Sahrawat, D., Mahata, D., Zhang, R., Kulkarni, M., Sharma, A., Gosangi, R., Stent, A., Kumar, Y., Shah, R. R., Zimmermann, R.: Keyphrase Extraction as Sequence Labeling Using Contextualized Embeddings. Lecture Notes in Computer Science 12036, 328–335 (2020). 10.1007/978-3-030-45442-5_41
  • [33] Shen, L., Jiang, H., Liu, L., Shi, S.: Revisiting the Evaluation Metrics of Paraphrase Generation. arXiv (2022). 10.48550/arXiv.2202.08479
  • [34] Stowe, K., Beck, N., Gurevych, I.: Exploring Metaphoric Paraphrase Generation. In: Proceedings of the 25th Conference on Computational Natural Language Learning, 323–336 (2021). 10.18653/v1/2021.conll-1.26
  • [35] Swaminathan, A., Zhang, H., Mahata, D., Gosangi, R., Shah, R., Stent, A.: A preliminary exploration of GANs for keyphrase generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8021–8030 (2020). 10.18653/v1/2020.emnlp-main.645
  • [36] Wang, L., Li, S.: PKU_ICL at SemEval-2017 task 10: Keyphrase extraction with model ensemble and external knowledge. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 934–937 (2017). 10.18653/v1/S17-2161
  • [37] Wang, Y., Liu, Q., Qin, C., Xu, T., Wang, Y., Chen, E., Xiong, H.: Exploiting topic-based adversarial neural network for cross-domain keyphrase extraction. In: 2018 IEEE International Conference on Data Mining (ICDM), 597–606 (2018). 10.1109/icdm.2018.00075
  • [38] Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., Nevill-Manning, C. G.: Kea: Practical automated keyphrase extraction. Design and Usability of Digital Libraries: Case Studies in the Asia Pacific. IGI global, 129–152 (2005). 10.4018/978-1-59140-441-5.ch008
  • [39] Wolf, T, Debut, L., Sanh, V, Chaumond, J., Delangue, C, Moi, A., Cistac, P., Rault, T, Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C, Jernite, Y, Plu, J., Xu, C, Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 38–45 (2020). 10.18653/v1/2020.emnlp-demos.6
  • [40] Zhang, Q., Wang, Y., Gong, Y., Huang, X. J.: Keyphrase extraction using deep recurrent neural networks on Twitter. In: Proceedings of the 2016 conference on empirical methods in natural language processing, 836–845 (2016). 10.18653/v1/D16-1080
  • [41] Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., Artzi, Y.: BERTScore: Evaluating Text Generation with BERT. In: International Conference on Learning Representations (2019).