The novel method that we present belongs to an overlap of summarization and categorization tasks traditionally described as keyphrase generation, keyphrase extraction, and topic modeling [1, 2, 3, 4].
Topics have multiple purposes. One is to provide a succinct summary of a document by listing its core concepts [3, 5]. This is a document-centric version of topics that enables a reader to quickly grasp the most important themes within a text without reading it. On the other end of the spectrum, topics also serve the purpose of describing how groups of documents relate to each other and the distribution of themes across an entire corpus. This is essentially a categorization task, where the category labels may be predetermined — such as the various sections of a newspaper — or generated de novo from the corpus itself [2, 6].
This reveals two challenges worth noting. First, there is ambiguity around the term "topic" itself. Depending on the version of the task, a "topic" may refer to a string used to describe and label a single document or a cluster of multiple documents, or "topic" may refer to the document cluster itself. The method we describe in this paper serves primarily to generate labels for single documents. (Of course, the generated single-document topic labels, or embedding representations thereof, can serve as the basis for downstream document clustering.)
A second challenge worth noting is that the different versions of topic modeling conflict in their requirements. For example, for the task of summarizing the key themes of a single document, the optimum topic labels may call for highly specific keyphrases. But for the task of describing how that single document relates thematically to the rest of a corpus, more generic topic labels may be required. Due to the contrasting nature of these objectives, we put emphasis here on generating topic labels that are optimal for summarizing a single document.
A method for efficiently generating highly relevant topic labels for documents has many advantages. Traditional topic modeling methods that approach topics from within a generic theme, such as those based on LDA  or clustering, have often been found to be too general or vague to provide a good summary . Additionally, identifying higher level themes and document groupings typically requires computation on the entire corpus to assign topics, which is ill-suited to streaming contexts [8, 9].
Traditional topic label extraction methods can be split into two steps: First, the creation of a candidate list of potential topics, and second, the selection of representative topics from the candidates by ranking or filtering [10, 2]. The candidate lists can be generated through noun phrase extraction or other part of speech patterns [11, 5]. However, many of the ranking techniques are supervised and built on top of features based on TF-IDF or TopicRank . These methods only leverage basic word statistics. As documents often have frequently occurring terms with don’t relate to the task , effective ranking requires a representation of a term’s semantic meaning and significance to the document .
We introduce a novel topic generation method that utilizes a universal language model trained for document summarization
. This model selects candidate spans as it goes down a text generation path. Rejected candidate spans that lay near the generation path still have high significance to the document and thus provide valuable information for summarization. Though these second-rank candidate spans did not make the cut for the final summary output, they turn out to be valuable as topic prototypes. Thus a model trained only for the task of document summarization can be used as is for topic generation, with no need for topic-specific training.
Following the traditional extraction pattern, our method begins by extracting all noun phrases from a document as candidate spans, and then filters them by overlap with topics prototypes obtained by our summarization model. The resulting set of spans is a mixture of keyphrases suitable for summarizing the document as well as more generic concepts suitable for thematically grouping the document with others in a corpus.
To assess the quality of our machine-generated topic labels, we assembled a corpus of documents from the news websites of The Guardian and The Huffington Post, both of which include topic labels. We conducted a double-blind trial in which human annotators were presented with the articles and topic labels (either machine-generated or the original human-written labels) and asked to score their quality.
2.1 Our strategy
In order to produce topics for a given text document, we obtain two lists of spans from a text. The first list is a wide list of all possible well-formed candidate phrases, disregarding their importance for the text. This list is purpose-oriented, in the sense that it depends on what kinds of topics we want to have in the final result. The second list is importance-oriented—it is a list of spans that may be not perfect phrases but have importance for the text. The details of the different steps of our strategy are shown in figure 1 .
We chose to generate the purpose-oriented list as a list of noun phrases using part of speech tagging provided by Natural Language Toolkit (nltk library). This could easily be switched out for another list of candidates, whether a list of noun phrases generated by a different natural language processing library, or a list of concepts of interest created for a specific purpose. For instance, we looked into using the open source library spaCy to identify noun chunks and obtained similar results.
Our novel approach in generating the importance-oriented list is in using a Universal Language Model trained for Summarization/Title Generation. The model generates a summary or a title by consequently selecting the most-suited words or spans. For use in topic generation, the only modification is that we output all highly ranked candidate spans, including those that did not make it to the final summary generation stage. Only the candidate spans that pass a quality threshold are retained as potential topics. We give more details on calculating this score below.
The two lists are then combined by prioritising the overlap between the spans, resulting in an intermediate list of topics. The overlap needs to meet certain criteria; in particular the noun in the noun phrase has to be present in the span overlap.
The last step is to apply a simple post-processing to enhance the final quality, such as reducing duplicated information.
2.2 Using a ULM to generate scores
In order to obtain a list of spans scored and ranked by importance (see Figure 1 Step 1-b), we utilize the by-product of our document summarization/title generation model. Trained using a question-answer paradigm to generate news article titles, this model generates titles or bullet point summaries by using the text of a document in place of an external dictionary .
The span selection is possible thanks to the model’s prediction probabilities, represented by the logits of the start tokens and the end tokens of text spans. At each iteration, the start and end tokens with the highest logits are selected and used to generate the title span by span. When used for title generation or summarization, other spans are discarded. When we use this same model for keyphrase selection we investigate all candidate spans and assign them a score based on the sum of the logits of the first and last token.
The candidate spans () are sorted by score and assigned a rank (, the lower the value, the better the candidate) as well as a distance (), with:
The list of candidate spans is obtained by selecting the top ranked candidates and applying a maximal distance threshold to ensure quality. More information about the distribution of these values is given in section 3. From that investigation, we show that the best results are obtained by picking all candidate spans that have rank lower than 15 and that are within a distance of 0.05 from the generation path.
2.3 Overlap between candidate spans and noun phrases
We find all conventionally defined noun phrases in the text, and sort them by how much they overlap with our candidate spans. We define the overlap of a noun phrase with the candidate spans as a sum of lengths of all coinciding words divided by the length of the noun phrase. The pseudo-code is detailed in Figure 2. We select the noun phrases that have highest overlaps with the spans from the importance-oriented list.
|Given a document text.|
|Find all noun phrases in the text.|
|Produce a filtered list of candidate spans () by generation of a title:|
|Each candidate-span has an associated rank and a distance .|
|Note that if is over a maximum quality threshold, is filtered out.|
|Initialize to an empty list|
|for in :|
|Initialize to an empty list|
|for in :|
|for , , in :|
|for in :|
|if is noun:|
|if and :|
|Add () to|
2.4 Post-processing topics
The most crucial steps in the selection of topics was performed in the previous subsection. However the list obtained by checking the overlap between noun phrases and spans with significant summarization value can still have redundancy that is detrimental to the overall usefulness of the topics. We thus implement a simple post-processing step to further de-duplicate the topic labels and select the best.
We remove redundant information with a very basic de-duplication step, where we remove phrases with 50% or more words contained in longer phrases. Other more sophisticated de-duplication methods could naturally be implemented.
For instance, with our simple approach the topics ‘Trump’, ‘Kurdish forces’, ‘Donald Trump’, ‘Trump’, ‘fighters’ and ‘US-backed Kurdish fighters’ would be de-duplicated to ‘Donald Trump’ and ‘US-backed Kurdish fighters’.
We also select higher value phrases by evaluating them against the 5 following dimensions:
Distance criteria: we compute the mean distance across all the overlapping candidate spans corresponding to a given phrase. A smaller mean distance increases the value of the phrase. Mean distances over 0.4 do not change the value.
Rank criteria: we compute the mean rank across all the overlapping candidate spans corresponding to a given phrase. Better mean rank (i.e. smaller numerical values) increase the overall value of the phrase. Mean ranks above 4 do not change the value.
Number of candidate spans: a higher number of candidate spans increases the value of the phrase, and 4 or more spans gets the maximum value for this criteria.
Number of words: a higher number of words in the phrase increases the value of that phrase. A phrase with 3 or more words gets the maximum value for this criteria.
Number of capitalized words: a higher number of words in the phrase starting with a capital letter increases the value of that phrase. If there are 3 or more capitalized words, the phrase gets the maximum value for this criteria.
The combination of those 2 steps allows us to proceed, for instance, from the highly repetitive and therefore less useful list of topics produced by the model:
‘Qatar’, ‘Palestinians’, ‘West Bank’, ‘ceasefire deal’, ‘Qatar’, ‘Qatar’, ‘ceasefire deal’, ‘Qatar’, ‘Palestinians’, ‘Israeli-occupied West Bank’, ‘Qatar’, ‘Qatar’, ‘Qatar’, ‘Palestinians’, ‘Palestinians’, ‘West Bank’, ‘Qatar’
to a de-duplicated list:
‘ceasefire deal’, ‘Qatar’, ‘Palestinians’, ‘Israeli-occupied West Bank’
to the final, re-ordered list:
‘Israeli-occupied West Bank’, ‘ceasefire deal’, ‘Qatar’.
This post-processing step allows us to have better quality topics, and could easily be tailored to choose the best topics for a given use-case.
3 Inspecting candidate spans
3.1 Generating all candidate spans
In order the generate the candidate spans, we use a model trained on a task of title generation. This model has therefore never been exposed to topics before and is not in any way modified for generating topics.
At each step, the model finds the best span to include in the partial title it is tasked with generating. We track each step in the generation process through the span index. The model starts the generation at span index 0, getting the best span for this first iteration as well as all the other non-selected spans as a by-product. These are the candidate spans for span index 0. The model then searches for the next spans to include in the title, at span index 1, and repeats the process. Typically a title consists of 4-5 spans, but in rare cases the number of spans can reach up to 20 and more. As detailed in section 2.2, the model can be used to give a score to each candidate-span (i.e. the sum of logits of its first and its last tokens). From the scores, we derive at each span index both the rank and the relative distance between a candidate span and the span actually selected for the title.
3.2 Properties of generated candidate spans
The model was used to generate titles for 2000 randomly chosen English-language news documents published in May 2019. The information on candidate spans is collected as a by-product.
In order to visualize the neighborhood of the generation path, we collected candidate spans from the 2000 generated titles. Figure 3 shows that information, truncated to 15 iterations of the title generation process (span index in sentence) and to the 50 best candidate spans (candidate rank).
Looking at the evolution of the distance along the "span index in sentence" axis, we observed that the shape of the distance curve is stabilized after approximately the 8th span index.
Looking at the evolution of the distance along the "candidate rank" axis, we see an initial sharp slope which then flattens out around ranks 10 to 20. This indicates a significant difference in quality between initial and late-phase candidate spans. We thus decided to not use the candidates with ranks higher than about 10 or 20 for the topic generation task. Figure 4, a 2D projection along the axes considered, helps refine the choice of a reasonable threshold.
The overall profile of the surface shown in Figure 3 clearly shows that there are a higher number of good quality candidate spans close to the best one (i.e. less than 0.04 in relative distance) when the model is first starting out on the generation path for the title (i.e. span index of 0) than when the model is trying to complete the title (i.e. span indices over 6). At that point, Figure 4 indicates that only perhaps the first five candidates are substantially better than the rest. This has an intuitive interpretation: It is easy to start a sentence—there are many valid ways to begin—but the options become increasingly constrained to maintain logical cohesion and fluency to the end of the sentence.
3.3 Using the candidate spans as topics
Our inspection of Figure 3 gives us a strong foundation to go about selecting an interesting subset of the scored candidate spans. We propose to use a threshold limit of 0.05 on the distance from the best candidate-span. We also propose to restrict ourselves to the candidate spans with a rank below 15. We therefore have as the model-produced importance list:
As a side note, as we looked into candidate spans, we found that some of the filtered candidate spans already look like valid topics. Intuitively, this makes sense because the selected candidate spans almost made it into the title, and therefore reflect important concepts of the text. However, the title/summary generation aims for fluency, and some of the raw-candidate spans include verbs, articles, and punctuation that are not as useful for generating good topics. This is why producing the importance-oriented list is only one step in our topics generation process.
4 Human evaluation
4.1 General setup
In order to assess the quality of the generated topics we turn to human evaluation. We trained a group of 10 annotators to evaluate a list of topic labels associated with a document. We then contrasted the model’s performance with an external data set of topics created by journalists for news articles. We found that the online news articles of The Huffington Post and The Guardian to be richly annotated with topic labels. We randomly selected 50 articles from The Guardian and 50 articles from Huffington Post, and collected the text and corresponding topics (referred to hereafter as "real topics"). For the same texts we generated topics by our method (referred to as "generated topics").
We set up the evaluation task by asking each annotator, hired through Odetta.ai ), to assess the quality of the topics on a 5-point scale:
0 = VERY BAD, 1 = BAD, 2 = OK, 3 = GOOD or 4 = VERY GOOD.
The annotators work independently from each other and have access to only one article at a time. The text is displayed alongside the corresponding group of topic labels (the topics of the group are either all real or all generated) through the text annotation tool LightTag . Note that both the group containing the generated topics and the one with the real topics have about the same number of topics (most often between 3 and 5 topics). The annotators are not given any information about the origin of the topics they see. The order of real and generated examples is random.
4.2 Evaluation with minimal instructions
Before the labeling, the annotators were provided with minimal instructions, in order to avoid imposing a bias. The instructions are shown in Figure 5.
The distribution of the scores obtained in result is shown in Figure 6. For each article the scores were averaged over scores of all annotators.
The overall averages and medians of the scores are given in the table 1.
|Documents||Real or generated||Average||Median|
4.3 Evaluation with more guidance
In order to reveal the influence of the instructions and to prompt annotators to apply more rigorous criteria, we modified our instructions as shown in the Figure 7, and also provided specific examples with suggested ranges of scores and descriptions of what the authors of this paper perceived as deficiencies in a set of topics.
This evaluation used a new set of randomly selected 50 articles from The Guardian and 50 articles from The Huffington Post.
Having the list of possible deficiencies in the instructions, the annotators were now less generous with their scores. However, the overall preference for the generated topics persisted. The results from this second evaluation are provided in more detail below in the remaining part of this section.
The distribution of the scores that annotators assigned to the real versus generated topics is shown in Figure 8.
The Table 2 shows the main aggregation results of the evaluation.
|Documents||Real or generated||Average||Median|
In order to produce confidence estimates for the distribution of the gathered scores, we performed Bootstrap with 3 million samples (increasing the number of samples does not change our results).
In the Bootstrap, each sample was obtained by two mutually independent random selections with replacement: selection of the 10 annotators and selection of the 100 articles (each article has two scores - one for real topics, another for generated topics).
The results of comparison of the scores given by the same annotator to the real versus generated topics of the same document are shown in Figure 9
(left), with 95% confidence intervals. The distribution of the scores with 95% confidence interval is shown in Figure9 (right).
The Table 3 shows several examples of scored topics. The examples are chosen to represent the spectrum of the difference between the scores given to the generated vs real topics.
In this paper we present a new approach for generating topics that requires neither distinct training data nor access to the entire corpus at inference time. We are able to to generate topic labels for a single document by utilizing a model trained for document summarization. The quality of the generated topics was deemed by double-blind trial to be on par with topic labels written by humans.
While utilizing the span candidates for generating topics and for some other usages, we were fascinated by the rich neighborhood of the generation path. The topics we generate are essentially concepts that ’wanted to be’ in a title for the document but did not quite make the cut. For the purposes of the evaluation presented in this paper we used generation of a title restricted to reading the first several paragraphs of the text—as much length as allowed by the maximal input length of the standard BERT model. This is normally enough for getting all useful topics because, if the text is not too long, the most important topics are mentioned starting from the top of the text. As our evaluations with annotators show, this is indeed enough for typical news articles published by The Guardian and The Huffington Post.
For longer articles we use our summarization model which makes multiple runs, generating title-like sentences for each next chunk of text. In doing so, the generation picks up the most important concepts throughout the text. We can also change the criteria used for ranking.
Finally, we have not discussed here the usage of our topics for clustering of documents, but a large fraction of our topics contain topics generic enough for this purpose.
We are thankful to Delenn Chin, Vedant Dharnidharka and Wei Gong for reviewing the paper.
-  Eirini Papagiannopoulou, Grigorios Tsoumakas. A Review of Keyphrase Extraction. arXiv preprint arXiv:1905.05044v2, 2019.
-  Erion Çano, Ondřej Bojar. Keyphrase Generation: A Multi-Aspect Survey. arXiv preprint arXiv:1910.05059, 2019.
-  Erion Çano, Ondřej Bojar. Keyphrase Generation: A Text Summarization Struggle. arXiv preprint arXiv:1904.00110v2, 2019.
-  Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. Proceedings of the 22nd International Conference on Neural Information Processing Systems 22, pages 288–296, 2009.
-  Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, Yu Chi. Deep Keyphrase Generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 582–592, 2017.
-  David M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
-  David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003.
-  Akash Srivastava, Charles Sutton. Autoencoding Variational Inference For Topic Models. arXiv preprint arXiv:1703.01488, 2017.
-  Jason Ren, Russell Kunes, Finale Doshi-Velez. Prediction Focused Topic Models via Vocab Selection. arXiv preprint arXiv:1910.05495, 2019.
-  Kazi Saidul Hasan and Vincent Ng. Automatic Keyphrase Extraction: A Survey of the State of the Art. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014.
-  Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and Maosong Sun. Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, pages 366–376, 2010.
-  Bougouin, A., Boudin, F. and Daille, B. TopicRank: Graph-based topic ranking for keyphrase extraction. Proceedings of the 6th International Joint Conference on Natural Language Processing, IJCNLP, 2013.
-  Oleg Vasilyev, Tom Grek, John Bohannon. Headline Generation: Learning from Decomposable Document Titles. arXiv preprint arXiv:1904.08455v3, 2019.
-  https://odetta.ai/
-  https://www.lighttag.io/