Structured Multi-Label Biomedical Text Tagging via Attentive Neural Tree Decoding

10/02/2018 ∙ by Gaurav Singh, et al. ∙ Northeastern University King's College London UCL 4

We propose a model for tagging unstructured texts with an arbitrary number of terms drawn from a tree-structured vocabulary (i.e., an ontology). We treat this as a special case of sequence-to-sequence learning in which the decoder begins at the root node of an ontological tree and recursively elects to expand child nodes as a function of the input text, the current node, and the latent decoder state. In our experiments the proposed method outperforms state-of-the-art approaches on the important task of automatically assigning MeSH terms to biomedical abstracts.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the task of multilabel text annotation, where labels are drawn from an ontology. We are motivated by problems in biomedical NLP Zweigenbaum et al. (2007); Demner-Fushman et al. (2016). Specifically, scientific abstracts in this domain are typically associated with multiple Medical Subject Heading (MeSH) terms. MeSH is a controlled, hierarchically structured vocabulary that facilitates semantic labeling of texts at varying levels of granularity. This in turn supports semantic indexing of biomedical literature, thus facilitating improved search and retrieval.111This problem also resembles tagging clinical notes with ICD codes Mullenbach et al. (2018).

At present, MeSH annotation is largely performed manually by highly skilled annotators employed by the National Library of Medicine (NLM). Automating this annotation task is thus highly desirable, and there have been considerable efforts to do so. The BIOASQ222 challenge, in particular, concerns MeSH annotation, and competitive systems have emerged from this in past years Liu et al. (2014); Tsoumakas et al. (2013); these constitute baseline approaches in the present work.


Figure 1: Illustration of the proposed Neural Tree Decoding (NTD) model. Input text is encoded, and a decoder then conditionally traverses the label tree to select all relevant nodes to apply, with node-wise attention induced over the input text.

More generally, MeSH annotation is a specific instance of multi-label classification, which has received substantial attention in general Elisseeff and Weston (2002); Fürnkranz et al. (2008); Read et al. (2011); Bhatia et al. (2015); Daumé III et al. (2017); Chen et al. (2017); Jernite et al. (2016). Our work differs from these prior efforts in that MeSH tagging involves structured multi-label classification: the label space is a tree333Technically, MeSH comprises multiple trees, but we join these by insertion of an overarching root node. in which nodes represent nested semantic concepts, and the specificity of these increases with depth.

Past efforts in multi-label classification have considered hierarchical and tree-based approaches for tagging Jernite et al. (2016); Beygelzimer et al. (2009); Daumé III et al. (2017), but these have not assumed a given structured label space; instead, these efforts have attempted to induce trees to improve inference efficiency. By contrast, we propose to explicitly capitalize on a known output structure codified here by the target ontology from which tags are drawn. We realize this by recursively traversing the tree to make (conditional) binary tag application predictions.

The contribution of this work is a neural sequence-to-sequence (seq2seq) model Bahdanau et al. (2014) for structured multi-label classification. Our approach entails encoding the input text to be tagged using an RNN, and then decoding into the ontological output space

. This involves a tree traversal beginning at the root of the tree. At each step, the decoder decides whether to ‘expand’ children as a function of a hidden state vector, node embeddings, and induced attention weights over the input text. This approach is schematized in Figure

1. Expanded nodes are added to the predicted tag set. This process is repeated recursively until either leaf nodes are reached or no children are selected for expansion. This neural tree decoding (NTD) model outperforms state-of-the-art models for MeSH tagging.

2 Model


. Our model is an instance of an encoder-decoder architecture. For the encoder, we adopt a standard Gated Recurrent Unit (GRU) network

Cho et al. (2014a), which yields hidden states for the tokens comprising an input document. The decoder network consumes these outputs and begins at the root of the ontological tree. It induces an attention distribution over encoder states, which is used together with the current decoder state vector to inform which (if any) of its immediate children are applicable to the input text (Figure 1). This decoding process proceeds recursively for all children deemed relevant. Below we provide more in-depth technical detail regarding the constituent modules.

The encoder (enc) consumes as input a raw sequence of words, here composing an abstract. These are passed through an embedding layer, producing a sequence of word embeddings (for clarity we omit a document index here), which are then passed through a GRU Cho et al. (2014b) to obtain a sequence of hidden vectors , where .

These are then passed to our neural tree decoder, which is responsible for tagging the encoded text with an arbitrary number of terms from the label tree, i.e., sequences in the structured output space. This module traverses the label space top-down, beginning at the root, thus exploiting the concept hierarchy codified by the tree structure.

At each step in the decoding process, the decoder will be positioned at a particular node in the tree . Children — immediate descendents — of this node are then considered for expansion in turn, based on a hidden state vector , and a context vector . Both of these are initialized to zero vectors and recursively updated during traversal, i.e., as nodes are selected for expansion (and hence added to the predicted tag set). More specifically, the context vector that informs the decision to expand node in the label hierarchy from its parent node is a weighted sum of the encoder hidden states , where weights reflect induced attention over inputs, conditioned on . That is:





is a simple multi-layer perceptron (MLP), with node-specific parameters

. Here both sums range over the length of the input text.


, we then estimate the probability that child label

is applicable to the current input text as a function of the decoder state vector (), the current context vector () and the decoder parameters. In particular, this is realized via a standard linear layer with sigmoid activations, parameterized by a weight matrix comprising independent weight vectors for each output node . Thus the score for a particular output node is , where denotes the weight vector for output node .

1:function NodeLoss(, , , )
2:      0
3:     , dec(, , )
4:     for each child children(do
5:           )
6:           depth in tree
7:           Ber()
8:          if  then
10:          if  then
11:                + NodeLoss(, , , )                return
12:function Train(, ,

, epochs)

13:      init
15:     while  epochs do
16:          for each instance  do
17:                enc()
19:                NodeLoss(root, , , )
20:                backprop()
22:                return
Algorithm 1 RecursiveTreeDecoding

Pseudocode for the training and decoding procedures are presented in Algorithm 1. In the NodeLoss function, denotes a particular node. The set of hidden vectors induced by the encoder (corresponding to the inputs) are denoted by , is the hidden state of the decoder, and is the reference label (this encodes a path in the output tree). We assume the decoder, Dec, consumes input representations, a node index and a hidden state and yields a context vector for , and an updated state vector

; in our case the latter is implemented via a GRU. The advantage of using an RNN during decoding is that this allows the exploitation of learned, distributed hidden representations of partial tree paths, which inform node-wise attention and subsequent predictions.

Incurring loss for all nodes along the path specified by

would place a disproportionate amount of emphasis on correctly applying terms that are ‘higher’ in the ontology, as loss will be propagated for the initial predictions concerning the application of these and then also, due to recursive application, for all of their children (and so on). Thus we only incur (and hence backpropagate) loss for a node

stochastically, according to a Bernoulli distribution

with parameter . We set to be proportional to the depth of node in the tree such that we are likely to incur larger loss for deeper (rarely occurring) nodes. We operationalize this as: , where is the count corresponding to the least frequently observed node in the training corpus and is the count for node . In Section 4 we demonstrate the benefit of this approach.

At train time we use teacher forcing Williams and Zipser (1989) during decoding. That is, we revert the model back to the correct (training) tree subsequence when it goes off-course, and continue decoding from there. We have elided this detail from the pseudocode for clarity.

3 Experimental setup

Below we describe experimental details concerning our implementation, datasets and baselines. Code and data to reproduce our results is available at

3.1 Implementation Details

We limited the vocabulary to the most frequent words. Word embeddings were initialized to pre-trained vectors induced via word2vec, trained over a large set of abstracts indexed on PubMed.444A repository of biomedical literature. Ontology node embeddings were pre-trained using DeepWalk Perozzi et al. (2014), fit over PubMed.

3.2 Dataset

Our dataset comprises abstracts of articles describing randomized controlled trials (RCTs) from PubMed along with their MeSH terms. The MeSH annotations were manually applied by professionals at the National Library of Medicine (NLM). The label space underlying MeSH terms is codified by a publicly available ontology.555

We split this dataset into disjoint sets for training/development and final evaluation (Table 1

). We further separated the former into train, validation and development test subsets, to refine our approach. For our final evaluation we used a heldout set of 10,000 abstracts that were not seen in any way during model development and/or hyperparameter tuning. We performed extensive hyperparameter tuning for the baseline models to ensure fair comparison; details regarding this tuning are provided in the Appendix.

3.3 Baselines

We compare our proposed approach to three baselines, including two prior winners of the annual BioASQ challenge, which includes an automated MeSH annotation task. However, it is important to note that we used a different (and considerably smaller) dataset in the current work, as compared to the corpus used in the BioASQ challenge.

LSSI Tsoumakas et al. (2013) use an approach that involves predicting both the number of terms and which to apply to a given abstract. They use linear models for both tasks, which operate over TF-IDF representations of abstracts. Specifically, they train a regressor to predict , the number of MeSH terms to be applied to an abstract. Simultaneously, a binary linear SVM is trained independently for each MeSH term appearing in the train set. At test time, these SVMs provide scores for each term and the top terms are applied, where is the estimate from the aforementioned regressor.

UIUC Liu et al. (2014)

uses a learning-to-rank model to identify the top MeSH terms for an abstract from a candidate set of terms, which is obtained from the nearest neighbours of the abstract. Additionally, one SVM classifier is trained for each of the MeSH terms (similar to the above approach), and scores for each are used to obtain additional terms to be added to the candidate set. In the end, a threshold (tuned on the validation set) is used to select the final set of terms to be assigned.

Finally, we consider a deep multilabel classification model DML Rios and Kavuluru (2015)

that takes as input unstructured abstracts and activates the output nodes corresponding to the relevant MeSH terms. In brief, embedded tokens are fed through a CNN to induce a vector representation, which is then passed on to the dense output layer. Finally, this is passed through a sigmoid activation function. Note that this model exploits the same pre-trained word embeddings as our model does.

Train 20000
Validation 4000
Dev test 18884
Test (held-out) 10000
Mean MeSH terms per article 15.33
Total unique MeSH terms 27892
Unique MeSH terms in dataset 3781
Table 1: Dataset statistics.

3.4 Evaluation metrics

We first evaluate model performance via output node-wise precision, recall and F1 measure. However, these metrics are overly strict in the sense that a model will be penalized equally for all mistakes, regardless of whether they are nearby or far from the target in the label tree. This is problematic because whether to apply a specific MeSH term or its immediate parent may be somewhat subjective in practice. To quantify this, and to explore the extent to which explicitly decoding into the target label space yields improved predictions, we also consider a measure that we refer to as semantic distance (SD):


where and are the sets of target and predicted terms respectively, and dist is a function that returns the shortest distance between two nodes in the label ontology tree. The idea is that this penalizes less for ‘near misses’. Thus if a model fails to apply a particular tag , but does apply one near to in the label tree, then it is penalized less.666This metric is equivalent to the sum of two metrics (”divergent path to gold standard” and ”divergent path to prediction”) defined in Perotte et al. (2013). We hypothesize that our model will improve results markedly with respect to this metric, given our exploitation of the tree structure.

As in the case of recall, SD can be ‘gamed’: one can achieve a perfect score by predicting that all nodes apply to a given abstract. Thus this is only meaningful alongside complementary metrics like F1.

4 Results

Results on the test set (which was completely held out during development) are reported in Table 2. The proposed Neural Tree Decoding model with stochastic backpropagation (NTD-s) bests the most competitive baseline (LSSI) in F1 score by over 2 points.

To explore the effect of backpropagating loss from nodes in proportion to their depth in the ontology, we also include results for a deterministic variant that does not do this, NTD-d. This version does not perform as well, demonstrating the utility of the proposed training approach.

The metrics reported thus far do not account of the structure in the output space. We thus additionally report results with respect to the the semantic distance (SD) metric (Eq. 3). We observe a marked performance increase of 21 over the best performing baseline. This is intuitive given that we are explicitly decoding into the label tree structure, and demonstrates the ability of our model to learn the ontological structure, thereby predicting semantically appropriate terms.

Method Precision Recall F1 SD
LSSI 0.326 0.293 0.309 1.518
UIUC 0.236 0.388 0.291 1.433
DML 0.378 0.223 0.275 1.516
NTD-d 0.434 0.235 0.299 1.209
NTD-s 0.425 0.265 0.327 1.130
Table 2: Results on the held-out test dataset. SD refers to semantic distance, defined in Eq. 3.

5 Conclusions, Discussion & Limitations

We developed a neural attentive sequence tree decoding model for structured multilabel classification where labels are drawn from a known ontology. The proposed method can decode an input text into a tree of labels, effectively using the structure in the output space. We demonstrated that this model outperformed SOTA approaches for the important task of tagging biomedical abstracts with Medical Subject Heading (MeSH) terms on a modestly sized training corpus. Code and data to reproduce these results are available at

One limitation of our model is that it is comparatively slow, due to having to traverse the tree structure during decoding. Prediction speed may not be a major issue in practice, as articles on PubMed could be batch tagged nightly as they arrive. However, slow decoding also means lengthy training (see Appendix, section A.2 for details). For this reason we have here used a modest training set of 20k abstracts, which is smaller than corpora used in prior work on this task. Given the relative expressiveness of our model, we expect it to benefit substantially from additional training data, moreso than the simpler baseline architectures. But at present this is only a conjecture.

In future work we thus hope to apply this model to larger datasets, and to address the efficiency issue. Concerning the latter, sibling subtrees may be traversed in parallel, conditioned on the hidden state of their parent. Another promising direction would be to move to convolutional encoder and decoder architectures, designing the latter in a way similarly capitalizes on the label space tree structure.

6 Acknowledgements

JT and GS acknowledge support from Cochrane via the Transform project. BCW was supported by the National Library of Medicine (NLM) of the National Institutes of Health (NIH), grant R01LM012086. IJM acknowledges support from the MRC (UK), through its grant MR/N015185/1.


  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
  • Beygelzimer et al. (2009) Alina Beygelzimer, John Langford, Yuri Lifshits, Gregory Sorkin, and Alex Strehl. 2009. Conditional probability tree estimation analysis and algorithms. In

    Proceedings of the Conference on Uncertainty in Artificial Intelligence

    . AUAI Press.
  • Bhatia et al. (2015) Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. 2015. Sparse local embeddings for extreme multi-label classification. In Advances in Neural Information Processing Systems.
  • Chen et al. (2017) Sheng Chen, Akshay Soni, Aasish Pappu, and Yashar Mehdad. 2017. Doctag2vec: An embedding based multi-label learning approach for document tagging. In Proceedings of the Workshop on Representation Learning for NLP. Association for Computational Linguistics.
  • Cho et al. (2014a) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014a. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In

    Empirical Methods in Natural Language Processing

  • Cho et al. (2014b) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014b. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
  • Daumé III et al. (2017) Hal Daumé III, Nikos Karampatziakis, John Langford, and Paul Mineiro. 2017. Logarithmic time one-against-some. In

    International Conference on Machine Learning

  • Demner-Fushman et al. (2016) D Demner-Fushman, N Elhadad, et al. 2016.

    Aspiring to unintended consequences of natural language processing: A review of recent developments in clinical and consumer-generated text processing.

    IMIA Yearbook.
  • Elisseeff and Weston (2002) André Elisseeff and Jason Weston. 2002. A kernel method for multi-labelled classification. In Advances in Neural Information Processing Systems.
  • Fürnkranz et al. (2008) Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencía, and Klaus Brinker. 2008. Multilabel classification via calibrated label ranking. Machine learning, 73(2).
  • Jernite et al. (2016) Yacine Jernite, Anna Choromanska, and David Sontag. 2016. Simultaneous learning of trees and representations for extreme classification, with application to language modeling. arXiv preprint arXiv:1610.04658.
  • Liu et al. (2014) Ke Liu, Junqiu Wu, Shengwen Peng, Chengxiang Zhai, and Shanfeng Zhu. 2014. The fudan-uiuc participation in the bioasq challenge task 2a: The antinomyra system. Risk, 129816.
  • Mullenbach et al. (2018) James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. 2018. Explainable prediction of medical codes from clinical text. In North American Chapter of the Association for Computational Linguistics.
  • Perotte et al. (2013) Adler Perotte, Rimma Pivovarov, Karthik Natarajan, Nicole Weiskopf, Frank Wood, and Noémie Elhadad. 2013.

    Diagnosis code assignment: models and evaluation metrics.

    Journal of the American Medical Informatics Association, 21(2).
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining. ACM.
  • Read et al. (2011) Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2011. Classifier chains for multi-label classification. Machine learning, 85(3).
  • Rios and Kavuluru (2015) Anthony Rios and Ramakanth Kavuluru. 2015. Convolutional neural networks for biomedical text classification: application in indexing biomedical articles. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Health Informatics. ACM.
  • Tsoumakas et al. (2013) Grigorios Tsoumakas, Manos Laliotis, Nikos Markantonatos, and Ioannis Vlahavas. 2013. Large-scale semantic indexing of biomedical publications at BioASQ. In BioASQ Workshop.
  • Williams and Zipser (1989) Ronald J Williams and David Zipser. 1989.

    A learning algorithm for continually running fully recurrent neural networks.

    Neural computation, 1(2).
  • Zweigenbaum et al. (2007) Pierre Zweigenbaum, Dina Demner-Fushman, Hong Yu, and Kevin B Cohen. 2007. Frontiers of biomedical text mining: current progress. Briefings in bioinformatics, 8(5).

Appendix A Appendix

a.1 Parameter Tuning

We performed extensive parameter tuning for the baselines. For LSSI, we tuned over vocabulary sizes of 50K, 100K and 150K. We also tuned over the regularization parameter used for linear-SVR and linear-SVM on a validation set. The values were tuned over a range of [0.1, 1.0, 10.0, 100.0]. For UIUC, we tuned over the vocabulary sizes of 50K, 100K and 150K. We also tuned the NN classifier for [5, 10, 15, 30]. We also tuned the regularizer of the linearSVM in the range [0.1, 1.0, 10.0, 100.0]. Afterwards, we tuned the threshold of the classifier for ten equidistant values in the range . For DML, we performed nested validation after each epoch to save the best performing model parameters. We also tuned the threshold for classification over ten equidistant values in the interval .

a.2 Training Details

The training times for iterating over 1K samples can range in 30-40 minutes on Tesla K60 GPU, although, it can be faster with more advanced GPUs. We can converge with a learn rate of 0.01 in 30-40 epochs. We can converge over a set of 20K documents in 3-4 days.