Recent years have witnessed a boom in the amounts of scientific works being published in various online sources, such as arXiv.org, Google Scholar, Microsoft Academic Search, and IBM Science Summarizer [DBLP:conf/emnlp/EreraSFNBRCWMRL19]. For example, within less than a decade, the number of yearly submissions to arXiv repository has nearly doubled111https://arxiv.org/help/stats/2019_by_area/index.
Trying to overcome the information overload, several online sources such as paper a day222https://medium.com/@sharaf, The morning Paper333 https://blog.acolyer.org/, TopBots444 https://www.topbots.com/most-important-ai-research-papers-2018/, OpenReview.net and ShortScience [cohen2017shortscience], now provide access to human authored summaries of selected works written by both experts and practitioners in their respective communities. Such summaries tend to be long, detailed and contain headlines and figures from the original papers.
1.1 Towards automatic summarization
Scientific papers have a complex structure as well as an intricate content, making their summarization a hard task even for humans. Trying to study the automatic scientific summarization task, several datasets, such as Scisumm [jaidka2016overview], and ScisummNet [yasunaga2019scisummnet] have been proposed. Yet, compared to real human summaries such as the ones nowadays available in the various online sources, existing datasets only focus on automatic generation of relatively short summaries ( words) which have an abstract-like structure and are lacking other summarization constructs used by humans such as headlines and figures. Moreover, many of existing summarization methods of scientific papers, rely on citations in order to pinpoint the important parts within scientific papers [qazvinian2008scientific]. However, for most newly published papers, as the ones usually summarized by humans, citations volume is not large enough to perform a similar analysis.
Trying to fill the gap, we study a dataset for scientific summarization, based on long human summaries authored by ShortScience.org555https://shortscience.org users. Our goal is to study the characteristics of human scientific summaries and to propose the use of such summaries, that are sometimes published as blogs, as a potential benchmark for automation in this difficult task.
ShortScience.org is an open platform for publishing summaries of scientific papers in the domains of Computer Science (CS), Physics and Biology. In this work, we focus on CS publications. The web-site provides minimal instructions on how to write a summary, and therefore, there is a large variation in summary length and structure. To this end, we fetched summaries from the website associated with papers. To analyze papers and their summaries, we utilized NLTK for word tokenization and sentence segmentation. We disregard sentences having less than characters, to minimize effect of parsing errors. The mean summary length is words and the median is words. The average number of sentences per summary is .
Each summary page links to a publisher website that hosts the source article which we have used to in order to download its PDF version. We have legally fetched articles from the following sources: Arxiv , NeurIPS , ACL , and Springer .
We used Science-Parse666https://github.com/allenai/science-parse to extract the PDF text of each article. Science-Parse outputs a JSON record for each PDF, which among other fields, contains the title, abstract text, metadata (such as authors and year), and a flat list of the article sections, where each record holds the section’s title and text.
3 Human Summaries Analysis
3.1 Summary subjectivity
Trying to assess to what extent the summaries represent a subjective account of the original scientific work, we explored the expression of opinions by human summarizers. For each summary we extracted all sentences that contained the terms “i” or “my”. Overall out of summaries we found summaries that include such sentences. To validate the assumption that these terms can be coined with opinion expression and to learn what is the polarity of the opinion (“positive”, “negative”, or “neutral”), we conducted a simple evaluation task. We asked five students to read the extracted opinion text from each summary and to indicate if it indeed expresses an opinion and what is its polarity. To aggregate the results across judges, we used a majority vote rule. The analysis shows that out of the summaries that were tagged as containing opinions, only in cases this tagging was errorneous. With respect to polarity, surprisingly, most of the summaries were marked as neutral , were marked as positive and the rest as negative. This can demonstrate that when humans decide to publicly express their opinion on scientific work they tend to present a positive or balanced view and not to criticize. This can also indicate that people choose to summarize papers they deem valuable.
3.2 Summary coverage
Scientific papers are usually structured into several logical categories which address various aspects of the reported research work. To assess to what extent human summaries cover such logical aspects of the papers being summarized, we tried to align each summary sentence to its most probable category in the original paper. To this end, paper sections hierarchy was restored, with sub-sections being merged into their containing high level section. The following high-level sections where identified based on section categories:Introduction, Related work, Method, Results, Experiments, Discussion, Conclusions, Future work and Unknown
. Section sentences inherit their containing section title. Each human summary sentence was then aligned to the paper sentence most similar to it, and was assigned with the category of that sentence. We experimented with three similarity methods: F1 ROUGE-L, average of F1 ROUGE-1, ROUGE-2, and ROUGE-L, and Cosine similarity over word vectors (GoogleNews-Vectors). Overall,out of
article sections were assigned with a category (while the rest were classified asUnknown). Table 1 reports the distribution of summary sentences to logical summary categories.
As depicted in the table, the weights are quite stable when using different similarities, providing further confidence in the correctness of this distribution. Potentially, a summarization algorithm can aim at assigning higher focus to more salient logical sections, reflecting how humans attend different sections in their summary.
3.3 Summary style
3.3.1 Figures inclusion
: Figures are used in scientific papers to illustrate an architecture or a flow, to exemplify a statement or report results in a graphical way. Some human summaries in the dataset include figures from the original paper including image captures of equations or tables. Overall, about of the summaries include at least one such figure, with an average of figures per summary. This suggests that, in some cases, human summarizers would assume that the best way to explain an idea and deliver it in a more understandable form is to utilize visual aids along with the accompanied summary text. This demonstrates the need to consider multi-modal summarization, which to the best of our knowledge, has not been research yet in this domain.
3.3.2 Summary Itemization:
Almost half of the summaries in the dataset utilized some form of structuring using itemization (i.e., bullets or numbering), with an average of items per summary. We further measured the amount of text associated with each item by counting the number of sentences between each two consecutive items. The average size of an item is sentences. This conforms with the typical usage of such a structure: each item most probably conveys a single concise fact. This is also supported by a strong positive correlation between number of items in a summary and summary length measured in sentences.
About of the summaries contain lines that start with “#”, which act as summary “headlines”. Out of summary authors used headlines in their summaries. Such headlines are commonly used by human summarizers for dividing their summary into small logical parts. In order to determine which subjects prevail in creating summary subdivisions, we analyzed summary headlines text. Table 2 contains the most frequent unigrams and bigrams in summary headlines (after tokenization, stemming and stopwords removal). From the table it seems that many of the subdivisions either refer to specific logical aspect of the paper (as hinted by words like “result”, “model”, “dataset” ) or convey summarizer’s take on the paper (“my two cents”, “key points”).