Log In Sign Up

Advancing PICO Element Detection in Medical Text via Deep Neural Networks

by   Di Jin, et al.

In evidence-based medicine (EBM), structured medical questions are always favored for efficient search of the best available evidence for treatments. PICO element detection is widely used to help structurize the clinical studies and question by identifying the sentences in a given medical text that belong to one of the four components: Participants (P), Intervention (I), Comparison (C), and Outcome (O). In this work, we propose a hierarchical deep neural network (DNN) architecture that contains dual bi-directional long short-term memory (bi-LSTM) layers to automatically detect the PICO element in medical texts. Within the model, the lower layer of bi-LSTM is for sentence encoding while the upper one is to contextualize the encoded sentence representation vector. In addition, we adopt adversarial and virtual adversarial training to regularize the model. Overall, we advance the PICO element detection to new state-of-the-art performance, outperforming the previous works by at least 4% in F1 score for all P/I/O categories.


page 1

page 2

page 3

page 4


Sexism detection: The first corpus in Algerian dialect with a code-switching in Arabic/ French and English

In this paper, an approach for hate speech detection against women in Ar...

A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations

Matching natural language sentences is central for many applications suc...

Multimodal Learning for Cardiovascular Risk Prediction using EHR Data

Electronic health records (EHRs) contain structured and unstructured dat...

Enhancing PIO Element Detection in Medical Text Using Contextualized Embedding

In this paper, we investigate a new approach to Population, Intervention...

Densely Connected Bidirectional LSTM with Applications to Sentence Classification

Deep neural networks have recently been shown to achieve highly competit...

Italian Event Detection Goes Deep Learning

This paper reports on a set of experiments with different word embedding...

1 Introduction

In evidence-based medicine (EBM), well formulated and structured documents and questions can help physicians efficiently identify appropriate resources and search for the best available evidence for medical treatment richardson1995well . In practice, clinical studies and questions always either explicitly or implicitly contain four aspects: Population/Problem (P), Intervention (I), Comparison (C) and Outcome (O), which are known as PICO elements. Using this structure to help with the information retrieval (IR) of medical evidence within a large medical citation database is popular and advantageous huang2006evaluation ; schardt2007utilization ; boudin2010improving . But, it first requires accurately identifying PICO elements in the medical documents as well as in the questions.

The PICO element detection process can be cast as a classification task on the sentence or segment level. Previously there have been many studies that sought to develop algorithms for this problem with improved performance. In earlier work, these studies have been focused on basic machine learning techniques such as Naïve Bayes (NB)


, Random Forest (RF)


, Support Vector Machine (SVM)

hansen2008method , Conditional Random Field (CRF) kim2011automatic

, and Multi-Layer Perceptron (MLP)

huang2011classification . All these methods heavily rely on careful collections of hand-engineered features. More recently, the bi-directional Long Short-Term Memory (bi-LSTM) model started to be adopted to encode each sentence into a representation vector for subsequent classification and the CRF module was added to form the “bi-LSTM+CRF” architecture, which achieved state-of-the-art (SOTA) performance jin2018pico .

In this work, based on the last SOTA model, we stack another layer of bi-LSTM over the encoded sentence representation vector to aggregate the features of surrounding sentences, inspired by the methods from jin2018hierarchical . In this way, the contextual information from surrounding sentences can be utilized to help infer the label of the current one. In addition, we adopted adversarial and virtual adversarial training to regularize the model by stabilizing the classification function miyato2016adversarial . With all these changes, we are able to advance the PICO element detection task to new SOTA performance. Specifically, the absolute improvement of F1 score for the three P/I/O elements are 4.3%, 6.8%, and 5.0%, respectively.

2 Related Work

In the last decade, many researchers have sought to build stronger models for automatic PICO element detection, where various machine learning techniques have been proposed, including NB huang2013pico ; boudin2010combining ; demner2007answering , RF boudin2010combining , SVM boudin2010combining ; hansen2008method , CRF kim2011automatic ; chung2009sentence ; chung2007study , and MLP boudin2010combining ; huang2011classification . Most recently, inspired by the unprecedented success of deep neural networks (DNNs), Jin et al. jin2018pico

was the first to utilize a bi-LSTM model to classify each sentence in the paragraph (e.g., abstracts) into PICO categories or “other,” and showed that this model can boost accuracy by a large margin compared with non deep learning models (over 5% absolute increase in F1 score for all PICO categories). Besides the advantage of performance improvement, DNN models rely only on pre-trained word embeddings as the features and totally remove the need for feature selection.

In terms of datasets generation, earlier works mainly relied on manual annotation so the corpora they used are quite small and are on the order of hundreds of abstracts demner2007answering ; dawes2007identification ; chung2009sentence ; kim2011automatic . Afterwards, the structural information embedded in some abstracts, in which the authors have clearly stated distinctive sentence headings such as such as “PATIENTS”, “SAMPLE” or “OUTCOMES”, started to be utilized to label the PICO element boudin2010combining ; huang2011classification ; huang2013pico ; jin2018pico . In this way, tens of thousands of abstracts that contain PICO elements from PubMed can be automatically compiled as a well-annotated dataset, which makes the application of DNN models plausible.

3 Methods

3.1 Model Architecture

The model architecture is illustrated in Fig. 1. It can be decomposed into four parts, which are described in detail below:

Figure 1: Model architecture.

Sentence Encoder

We first embed each word in the sentence into a word embedding vector and then use a layer of bi-LSTM to encode this sequence of word vectors so that we can get a sequence of hidden state vectors , each of which corresponds to a word.

Attention Layer

To obtain a single vector to represent the sentence, attentive pooling is used to aggregate the sequence of hidden state vectors into one. Detailed equations are given in yang2016hierarchical .

Sentence Contextualization

So far, for an abstract of several sentences, we have obtained a sequence of vectors, each of which corresponds to a sentence. In this step, these vectors are further processed by another bi-LSTM layer, so that we can contextualize each sentence vector with the information from surrounding sentences. And this contextualized vector is later used to infer the label of the corresponding sentence.

CRF Layer

We finally use a CRF module to optimize the sequence of labels collobert2011natural . It can model the dependencies between subsequent labels so that some unlikely label sequence can be avoided.

3.2 Adversarial Training

Based on the above-mentioned architecture, we further apply adversarial and virtual adversarial training as an effective way to regularize the classifier by adding small perturbations to the embeddings while training. For this, we first normalize the embeddings so that the embeddings and perturbations are on a similar scale, as shown below miyato2016adversarial :


where is the frequency of the -th word based on the statistics of training samples, and is the vocabulary size.

We denote the concatenation of a sequence of word embedding vectors as

(this sequence can be a sentence or paragraph), and the model conditional probability of gold label

on as given the current model parameters . Then the adversarial perturbation is calculated using the following equation:


where controls the scale of -norm of the perturbation. To make the classifier robust to the adversarial perturbation, we add the adversarial loss to the original classification loss, which is defined by:


where is the number of labeled samples.

In terms of virtual adversarial training, we calculate the following approximated virtual adversarial perturbation:


where is a small random vector, and

stands for the KL divergence between probability distributions

and . Then the virtual adversarial loss is defined as:


where is the number of both labeled and unlabeled samples since labels are not needed to calculate the virtual adversarial loss.

4 Experiments

4.1 Datasets

The dataset used in this study is the benchmark dataset from jin2018pico . This dataset was generated from MEDLINE, which is a free access database on medical articles. In this dataset, each sentence of an abstract is annotated into one of the 7 labels: Aim (A), Participants (P), Intervention (I), Outcome (O), Method (M), Results (R), and Conclusion (C). Table 2 in the appendix showcases the sentences from a typical abstract with their corresponding labels. There are 24,668 abstracts in total, each of which contains at least one of the P/I/O labels. In detail, there are 21,198 abstracts with P-labels, 13,712 with I-labels, and 20,473 with O-labels.

4.2 Training Settings

Ten-fold cross-validation was used to report the final performance results. The test set was always evaluated at the highest development set performance. The model is optimized by the Adam optimization method kingma2014adam . For regularization, dropout is applied to each layer JMLR:v15:srivastava14a and regularization is also used. The word embeddings were pre-trained on a large corpus combining PubMed and PMC texts 111Downloaded from using the word2vec tool222 They are fixed during the training phase to avoid over-fitting.

5 Results

Table 1 summarizes the performance results of our proposed model by comparing with previous results. As shown in this table, the previously published methods for comparison include LR, MLP, CRF, and BiLSTM+CRF, which are all from jin2018pico . For our proposed model, there are four variants: the baseline is our proposed architecture as illustrated in Fig. 1 without either adversarial training or virtual adversarial training; Adv. and V-Adv. mean that we use adversarial training or virtual adversarial training while optimizing the model, respectively; Adv.+V-Adv. means that we use both training methods.

As we can see from Table 1, our baseline model improves by a large margin compared with the previous methods for all three P/I/O elements. Especially for the I element, which performs the worst among the three labels, the absolute increase in F1 score is the highest, reaching 5%. This indicates that the contextual information extracted by the newly added upper layer of bi-LSTM from surrounding sentences is most helpful for the I element.

Furthermore, when we adopt adversarial training while optimizing the baseline model, the absolute increases in F1 score for all three P/I/O elements are around 1%, which demonstrates the effectiveness of adversarial training as a means of regularization. On the other hand, the improvement brought by virtual adversarial training is not as much as adversarial training, which could be because the loss of virtual adversarial training is calculated in an unsupervised way and thus is not specific to this task. However, this characteristic opens the venue to utilizing the abundant unlabeled corpus that comes from the same source as the labeled data for better generalization of the model. Specifically for this dataset, those PubMed abstracts without labels can all be used for this semi-supervised strategy and such an unlabeled corpus can be at least 10 times larger than the labeled data, which can potentially lead to good improvements and is left for future work.

From the last row of Table 1, when we combine the adversarial and virtual adversarial training, we can achieve larger improvement than using either alone, indicating that these two techniques can complement each other. And finally, with all these modifications, the absolute improvement of F1 score for the three P/I/O labels are 4.3%, 6.8%, and 5.0%, respectively.

Models P-element (%) I-element (%) O-element (%)
p r F1 p r F1 p r F1
LR 66.9 68.5 67.7 55.6 55.0 55.3 65.4 67.0 66.2
MLP 77.8 74.1 75.8 64.3 65.9 64.9 73.8 77.9 75.8
CRF 82.2 77.5 79.8 67.8 70.3 69.0 76.0 76.3 76.2
BiLSTM+CRF 87.8 83.4 85.5 72.7 81.3 76.7 81.1 85.3 83.1
Ours–Baseline 90.0 86.6 88.3 79.6 84.0 81.7 85.5 87.8 86.6
Ours–Adv. 90.5 88.0 89.2 81.8 84.3 83.0 85.8 89.7 87.7
Ours–V-Adv. 90.2 87.8 89.0 80.7 83.3 81.9 86.3 88.6 87.4
Ours–Adv.+V-Adv. 91.7 88.1 89.8 82.4 84.6 83.5 87.0 89.4 88.1
Table 1: Performance in terms of precision (p), recall (r) and F1 on the test set (average value based on 10-fold cross validation).


  • [1] Florian Boudin, Jian-Yun Nie, Joan C Bartlett, Roland Grad, Pierre Pluye, and Martin Dawes. Combining classifiers for robust pico element detection. BMC medical informatics and decision making, 10(1):29, 2010.
  • [2] Florian Boudin, Lixin Shi, and Jian-Yun Nie. Improving medical information retrieval with pico element detection. In European Conference on Information Retrieval, pages 50–61. Springer, 2010.
  • [3] Grace Y Chung. Sentence retrieval for abstracts of randomized controlled trials. BMC medical informatics and decision making, 9(1):10, 2009.
  • [4] Grace Y Chung and Enrico Coiera. A study of structured clinical abstracts and the semantic classification of sentences. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, pages 121–128. Association for Computational Linguistics, 2007.
  • [5] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
  • [6] Martin Dawes, Pierre Pluye, Laura Shea, Roland Grad, Arlene Greenberg, and Jian-Yun Nie. The identification of clinically important elements within medical journal abstracts: Patient_population_problem, exposure_intervention, comparison, outcome, duration and results (pecodr). Journal of Innovation in Health Informatics, 15(1):9–16, 2007.
  • [7] Dina Demner-Fushman and Jimmy Lin. Answering clinical questions with knowledge-based and statistical techniques. Computational Linguistics, 33(1):63–103, 2007.
  • [8] Marie J Hansen, Nana Ø Rasmussen, and Grace Chung. A method of extracting the number of trial participants from abstracts describing randomized controlled trials. Journal of Telemedicine and Telecare, 14(7):354–358, 2008.
  • [9] Ke-Chun Huang, I-Jen Chiang, Furen Xiao, Chun-Chih Liao, Charles Chih-Ho Liu, and Jau-Min Wong. Pico element detection in medical text without metadata: Are first sentences enough? Journal of biomedical informatics, 46(5):940–946, 2013.
  • [10] Ke-Chun Huang, Charles Chih-Ho Liu, Shung-Shiang Yang, Furen Xiao, Jau-Min Wong, Chun-Chih Liao, and I-Jen Chiang. Classification of pico elements by text features systematically extracted from pubmed abstracts. In Granular Computing (GrC), 2011 IEEE International Conference on, pages 279–283. IEEE, 2011.
  • [11] Xiaoli Huang, Jimmy Lin, and Dina Demner-Fushman. Evaluation of pico as a knowledge representation for clinical questions. In AMIA annual symposium proceedings, volume 2006, page 359. American Medical Informatics Association, 2006.
  • [12] Di Jin and Peter Szolovits. Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. arXiv preprint arXiv:1808.06161, 2018.
  • [13] Di Jin and Peter Szolovits. Pico element detection in medical text via long short-term memory neural networks. In Proceedings of the BioNLP 2018 workshop, pages 67–75, 2018.
  • [14] Su Nam Kim, David Martinez, Lawrence Cavedon, and Lars Yencken. Automatic classification of sentences to support evidence based medicine. In BMC bioinformatics, volume 12, page S5. BioMed Central, 2011.
  • [15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [16] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016.
  • [17] W Scott Richardson, Mark C Wilson, Jim Nishikawa, and Robert SA Hayward. The well-built clinical question: a key to evidence-based decisions. ACP journal club, 123(3):A12–A12, 1995.
  • [18] Connie Schardt, Martha B Adams, Thomas Owens, Sheri Keitz, and Paul Fontelo. Utilization of the pico framework to improve searching pubmed for clinical questions. BMC medical informatics and decision making, 7(1):16, 2007.
  • [19] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
  • [20] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, 2016.