We present PubMed 200k RCT, a new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.READ FULL TEXT VIEW PDF
Prevalent models based on artificial neural network (ANN) for sentence
Short text classi cation is a method for classifying short sentence with...
Sequential modelling entails making sense of sequential data, which natu...
Existing models based on artificial neural networks (ANNs) for sentence
Sentence function is an important linguistic feature referring to a user...
Randomized controlled trials (RCTs) represent the paramount evidence of
Determining the sentence similarity between Short Message Service (SMS)
Short-text classification is an important task in many areas of natural language processing, such as sentiment analysis, question answering, or dialog management. For example, in a dialog management system, one might want to classify each utterance into dialog actsStolcke et al. (2000).
In the dataset we present in this paper, PubMed 200k RCT, each short text we consider is one sentence. We focus on classifying sentences in medical abstracts, and particularly in randomized controlled trials (RCTs), as they are commonly considered to be the best source of medical evidence Tianjing Li (2015). Since sentences in an abstract appear in a sequence, we call this task the sequential sentence classification task, in order to distinguish it from general text or sentence classification that does not have any context.
The number of RCTs published every year is steadily increasing, as Figure 1 illustrates. Over 1 million RCTs have been published so far and around half of them are in PubMed Mavergames (2013), which makes it challenging for medical investigators to pinpoint the information they are looking for. When researchers search for previous literature, e.g., to write systematic reviews, they often skim through abstracts in order to quickly check whether the papers match the criteria of interest. This process is easier when abstracts are structured, i.e., the text in an abstract is divided into semantic headings such as objective, method, result, and conclusion. However, over half of published RCT abstracts are unstructured, as shown in Figure 2, which makes it more difficult to quickly access the information of interest.
|Achilles tendinopathy (AT) is a common and difficult to treat musculoskeletal disorder. The purpose of this study is to examine whether 1 injection of platelet-rich plasma (PRP) would improve outcomes more effectively than placebo (saline) after 3 months when used to treat AT. A total of 24 male patients with chronic AT (median disease duration, 33 months) were randomized (1:1) to receive either a blinded injection of PRP (n = 12) or saline (n = 12). Patients were informed that they could drop out after 3 months if they were dissatisfied with the treatment. After 3 months, all patients were reassessed (no dropouts). No difference between the PRP and the saline group could be observed with regard to the primary outcome (VISA-A score: mean difference [MD], -1.3; 95% CI, -17.8 to 15.2; P = .868). Secondary outcomes were pain at rest (MD, 1.6; 95% CI, -0.5 to 3.7; P = .137), pain while walking (MD, 0.8; 95% CI, -1.8 to 3.3; P = .544), pain when tendon was squeezed (MD, 0.3; 95% CI, -0.2 to 0.9; P = .208). PRP injection did not result in an improved VISA-A score over a 3-month period compared with placebo. The conclusions are limited to the 3 months after treatment owing to the large dropout rate.|
Consequently, classifying each sentence of an abstract to an appropriate heading can significantly reduce time to locate the desired information, as Figure 3
illustrates. Besides assisting humans, this task may also be useful for a variety of downstream applications such as automatic text summarization, information extraction, and information retrieval. In addition to the medical applications, we hope that the release of this dataset will help the development of algorithms for sequential sentence classification.
|Hara et al. Hara and Matsumoto (2007)||200||y||y|
|Hirohata et al. Hirohata et al. (2008)||104k||n||n||no|
|Chung Chung (2009)||327||y||y||no|
|Boudin et al. Boudin et al. (2010)||29k||n||n||no|
|Kim et al. Kim et al. (2011)||1k||y||n|
|Huang et al. Huang et al. (2011)||23k||n||n||no|
|Robinson Robinson (2012)||1k||n||y||no|
|Zhao et al. Zhao et al. (2012)||20k||y||n||no|
|Davis et al. Davis-Desmond and Mollá (2012)||194||n||y||public|
|Huang et al. Huang et al. (2013)||20k||n||y||no|
|PubMed 200k RCT||196k||n||y||no|
Existing datasets for classifying sentences in medical abstracts are either small, not publicly available, or do not focus on RCTs. Table 1 presents an overview of existing datasets.
The most studied dataset to our knowledge is the NICTA-PIBOSO corpus published by Kim et al. Kim et al. (2011). This dataset was the basis of the ALTA 2012 Shared Task Amini et al. (2012), in which 8 competing research teams participated.
Only the dataset published in Davis-Desmond and Mollá (2012) is publicly available: two datasets can only be obtained via email inquiries, and the other datasets are not accessible (unanswered email requests or negative replies). The only public dataset is also the smallest one.
Our dataset is constructed upon the MEDLINE/PubMed Baseline Database published in 2016 , which we will refer to as PubMed in this paper. PubMed can be accessed online by anyone, free of charge and without having to go through any registration. It contains 24,358,442 records. A record typically consists of metadata on one article, as well as the article’s title and in many cases its abstract.
We use the following information from each PubMed record of an article to build our dataset: the PubMed ID (PMID), the abstract and its structure if available, and the Medical Subject Headings (MeSH) terms. MeSH is the NLM controlled vocabulary thesaurus used for indexing articles for PubMed.
We select abstracts from PubMed based on the two following criteria:
the abstract must belong to an RCT. We rely on the article’s MeSH terms only to select RCTs. Specifically, only the articles with the MeSH term D016449, which corresponds to an RCT, are included in our dataset. 399,254 abstracts fit this criterion.
the abstract must be structured. In order to qualify as structured, it has to contain between 3 and 9 sections (inclusive), and it should not contain any section labeled as “None”, “Unassigned”, or “” (empty string). Only 0.5% of abstracts have fewer than 3 sections or more than 9 sections: we chose to discard these outliers. The label of each section was originally given by the authors of the articles, typically following the guidelines given by journals: as many labels exist, PubMed maps them into a smaller set of standardized labels: background, objective, methods, results, conclusions, “None”, “Unassigned”, or “” (empty string).
195,654 abstracts fit these two criteria, i.e., belong to RCTs and are structured.
The dataset contains 195,654 abstracts and is randomly split into three sets: a validation set containing 2500 abstracts, a test set containing 2500 abstracts, and a training set containing the remaining 190,654 abstracts. Since 200k abstracts may be too many for some applications, we also provide a smaller dataset, PubMed 20k RCT, which contains 15000 abstracts for the training set, 2500 abstracts for the validation set, and 2500 abstracts for the test set. The 20k abstracts were chosen from the 200k abstracts by taking the most recently published ones. Table 2 presents the number of abstracts and sentences for both PubMed 20k RCT and PubMed 200k RCT, for each split of the data set.
|PubMed 20k||68k||15k (180k)||2.5k (30k)||2.5k (30k)|
|PubMed 200k||331k||190k (2.2M)||2.5k (29k)||200 (29k)|
The dataset is provided as three text files: one for the training set, one for the validation set, and one for the test set. Each file has the same format: each line corresponds to either a PMID or a sentence with its capitalized label at the beginning. Each token is separated by a space. Listing 1 shows an excerpt from these files.
For each abstract, sentence and token boundaries are detected using the Stanford CoreNLP toolkit Manning et al. (2014). We provide two versions of the dataset: one with the original text, and one where digits are replaced by the character @ (at sign).
Figure 4 counts the number of sentences per label: the least common label (objective) is approximately four times less frequent than the most common label (results), which indicates that the dataset is not excessively unbalanced. Figure 5 shows the distribution of the number of tokens the sentence. Figure 6 shows the distribution of the number of sentences per abstract. Figures 4, 5 and 6 are based on PubMed 200k RCT.
We report the performance of several systems to characterize our dataset. The first baseline is a classifier based on logistic regression (LR) using n-gram features extracted from the current sentence: it does not use any information from the surrounding sentences. This baseline was implemented with scikit-learnPedregosa et al. (2011).
The second baseline (Forward ANN) uses the artificial neural network (ANN) model presented inLee and Dernoncourt (2016): it computes sentence embeddings for each sentence, then classifies the current sentence given a few preceding sentence embeddings as well as the current sentence embedding.
The third baseline is a conditional random field (CRF) that uses n-grams as features: each output variable of the CRF corresponds to a label for a sentence, and the sequence the CRF considers is the entire abstract. The CRF baseline therefore uses both preceding and succeeding sentences when classifying the current sentence. CRFs have been shown to give strong performances for sequential sentence classification Amini et al. (2012). This baseline was implemented with CRFsuite Okazaki (2007).
The fourth baseline (bi-ANN) is an ANN consisting of three components: a token embedding layer (bi-LSTM), a sentence label prediction layer (bi-LSTM), and a label sequence optimization layer (CRF). The architecture is described in Dernoncourt et al. (2016) and has been demonstrated to yield state-of-the-art results for sequential sentence classification.
Table 3 compares the four baselines. As expected, LR performs the worst, followed by the Forward ANN. The bi-ANN outperforms the CRF, but as the data set becomes larger the difference of performances diminishes.
presents the precision, recall, F1-score and support for each class with the bi-ANN. Accurately classifying the background and objective classes is the most challenging. The confusion matrix in Table5 shows that background sentences are often confused with objective sentences, and vice versa.
Table 6 gives more details on the LR baseline, and illustrates the impact of the choice of the n-gram size on the performance. By the same token, Table 7 shows the impact of the choice of the window size on the performance of the CRF.
|Model||PubMed 20k||PubMed 200k|
In this article we have presented PubMed 200k RCT, a dataset for sequential sentence classification. It is the largest such dataset that we are aware of. We have evaluated the performance of several baselines so that researchers may directly compare their algorithms against them without having to develop their own baselines. We hope that the release of this dataset will accelerate the development of algorithms for sequential sentence classification and increase the interest of the text mining community in the study of RCTs.
Sequential short-text classification with recurrent and convolutional neural networks.In Human Language Technologies 2016: The Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT.
Scikit-learn: Machine learning in python.Journal of Machine Learning Research 12(Oct):2825–2830.