The COVID-19 pandemic has made it a global priority for research on the subject to be developed at unprecedented rates. Researchers in a wide variety of fields, from clinicians to epidemiologists to policy makers, must all have effective access to the most up to date publications in their respective areas. Automated document classification can play an important role in organizing the stream of articles by fields and topics to facilitate the search process and speed up research efforts.
To explore how document classification models can help organize COVID-19 research papers, we use the LitCovid dataset Chen et al. (2020), a collection of 8,000 newly released scientific papers compiled by the NIH to facilitate access to the literature on all aspects of the virus. This dataset is updated daily and every new article is manually assigned one or more of the following 8 categories: General, Transmission Dynamics (Transmission), Treatment, Case Report, Epidemic Forecasting (Forecasting), Prevention, Mechanism and Diagnosis. We leverage these annotations and the articles made available by LitCovid to compile a timely new dataset for multi-label document classification.
Apart from addressing the pressing needs of the pandemic, this dataset also offers an interesting document classification dataset which spans different biomedical specialities while sharing one overarching topic. This setting is distinct from other biomedical document classification datasets which tend to exclusively distinguish between biomedical topics such as hallmarks of cancer Baker et al. (2016), chemical exposure methods Baker (2017) or diagnosis codes Du et al. (2019). The dataset’s shared focus on the COVID-19 pandemic also sets it apart from open-domain datasets and academic paper classification datasets such as IMDB or the aRxiv Academic Paper Dataset (AAPD) Yang et al. (2018) in which no shared topic can be found in most of the documents, and it poses unique challenges for document classification models.
We evaluate a number of models on the LitCovid dataset and find that fine-tuning pre-trained language models yields higher performance than traditional machine learning approaches and neural models such as LSTMsAdhikari et al. (2019b); Kim (2014); Liu et al. (2017). We also notice that BioBERT Lee et al. (2019), a BERT model pre-trained on the original corpus for BERT plus a large set of PubMed articles, performed slightly better than the original BERT base model. We also observe that the novel Longformer Beltagy et al. (2020) model, which allows for processing longer sequences, matches BioBERT’s performance when 1024 subwords are used instead of 512, the maximum for BERT models.
We then explore the data efficiency and generalizability of these models as crucial aspects to address for document classification to become a useful tool against outbreaks like this one. Finally, we discuss some issues found in our error analysis such as current models often (1) correlating certain categories too closely with each other and (2) failing to focus on discriminative sections of a document and get distracted by introductory text about COVID-19, which suggest venues for future improvement.
In this section, we describe the LitCovid dataset in more detail and briefly introduce the CORD-19 dataset which we sampled to create a small test set to evaluate model generalizability.
|# of Classes||8||8|
|# of Articles||8,002||100|
|Total # of tokens||9,771,284||286,065|
The LitCovid dataset is a collection of recently published PubMed articles which are directly related to the 2019 novel Coronavirus. The dataset contains upwards of 14,000 articles and approximately 2,000 new articles are added every week, making it a comprehensive resource for keeping researchers up to date with the current COVID-19 crisis.
For a large portion of the articles in LitCovid, either the full article or at least the abstract can be downloaded directly from their website. For our document classification dataset, we select 8,002 from the original 14,000+ articles which contain full texts or abstracts. As seen in table 1, these selected articles contain on average approximately 51 sentences and 1,200 tokens, reflecting the roughly even split between abstracts and full articles we observe from inspection.
Each article in LitCovid is assigned one or more of the following 8 topic labels: Prevention, Treatment, Diagnosis, Mechanism, Case Report, Transmission, Forecasting and General. Even though every article in the corpus can be labelled with multiple tags, most articles, around 76%, contain only one label. Table 2 shows the label distribution for the subset of LitCovid which is used in the present work. We note that there is a large class imbalance, with the most frequently occurring label appearing almost 20 times as much as the least frequent one. We split the LitCovid dataset into train, dev, test with the ratio 7:1:2.
The COVID-19 Open Research Dataset (CORD-19) Wang et al. (2020) was one of the earliest datasets released to facilitate cooperation between the computing community and the many relevant actors of the COVID-19 pandemic. It consists of approximately 60,000 papers related to COVID-19 and similar coronaviruses such as SARS and MERS since the SARS epidemic of 2002. Due to their differences in scope, this dataset shares only around 1,200 articles with the LitCovid dataset.
In order to test how our models generalize to a different setting, we asked biomedical experts to label a small set of 100 articles found only in CORD-19. Each article was labelled independently by two annotators. For articles which received two different annotations (around 15%), a third annotator broke ties. Table 1 shows the statistics of this small set and Table 2 shows its category distribution.
In the following section we provide a brief description of each model and the implementations used. We use micro-F1 (F1) and accuracy (Acc.) as our evaluation metrics, as done inAdhikari et al. (2019a). All reproducibility information can be found in Appendix A.
|Model||Dev Set||Test Set|
|LSTM||57.7 0.7||75.8 0.5||59.1 1.3||76.1 0.5|
|LSTM||59.4 2.4||74.6 1.2||61.7 1.9||75.9 1.2|
|KimCNN||59.3 1.1||75.7 0.4||61.0 0.1||76.2 0.2|
|XML-CNN||61.9 1.0||77.2 0.3||64.6 0.4||77.9 0.3|
|BERT||66.1 1.3||79.1 0.1||68.1 0.9||80.6 0.2|
|BERT||66.4 0.5||79.0 0.7||68.1 1.1||79.5 1.2|
|Longformer||66.7 1.1||79.9 0.5||69.2 0.2||80.7 0.7|
|BioBERT||66.5 0.6||80.2 0.1||68.5 1.0||81.2 0.3|
3.1 Traditional Machine Learning Models
To compare with simpler but competitive traditional baselines we use the default scikit-learn Pedregosa et al. (2011)
implementation of logistic regression and linear support vector machine (SVM) for multi-label classification which trains one classifier per class using a one-vs-rest scheme. Both models use TF-IDF weighted bag-of-words as input.
3.2 Conventional Neural Models
Using Hedwig222https://github.com/castorini/hedwig, a document classification toolkit, we evaluate the following models: KimCNN Kim (2014), XML-CNN Liu et al. (2017) as well as an unregularized and a regularized LSTM Adhikari et al. (2019b). We notice that they all perform similarly and slightly better than traditional methods.
3.3 Pre-Trained Language Models
Using the same Hedwig document classification toolkit, we evaluate the performance of DocBERT Adhikari et al. (2019a) on this task with a few different pre-trained language models. We fine-tune BERT base, BERT large Devlin et al. (2019) and BioBERT Lee et al. (2019), a version of BERT base which was further pre-trained on a collection of PubMed articles. We find all BERT models achieve best performance with their highest possible sequence length of 512 subwords. Additionally, we fine-tune the pre-trained Longformer Beltagy et al. (2020) in the same way and find that it performs best when a maximum sequence length of 1024 is used. As in the original Longformer paper, we use global attention on the [CLS] token for document classification but find that performance improves by around 1% if we use the average of all tokens as input instead of only the [CLS] representation. We hypothesize that this effect can be observed because the LitCovid dataset contains longer documents on average that the Hyperpartisan dataset used in the original Longformer paper.
We find that all pre-trained language models outperform the previous traditional and neural methods by a sizable margin in both accuracy and micro-F1 score. The best performing models are the Longformer and BioBERT, both achieving a similar micro-F1 score of around 81% on the test set and an accuracy of 69.2% and 68.5% respectively.
4 Results & Discussion
In this section, we explore data efficiency, model generalizability and discuss potential ways to improve performance on this task in future work.
4.1 Data Efficiency
During a sudden healthcare crisis like this pandemic it is essential for models to obtain useful results as soon as possible. Since labelling biomedical articles is a very time-consuming process, achieving peak performance using less data becomes highly desirable. We thus evaluate the data efficiency of these models by training each of the ones shown in Figure 1 using 1%, 5%, 10%, 20% and 50% of our training data and report the micro-F1 score on the dev set. When selecting the data subsets, we sample each category independently to make sure they are all represented.
We observe that pre-trained models are much more data-efficient than other models and that BioBERT is the most efficient, demonstrating the importance of domain-specific pre-training. We also notice that BioBERT performs worse than other pre-trained models on 1% of the data, suggesting that its pre-training prevents it from leveraging potentially spurious patterns when there is very little data available.
4.2 CORD-19 Generalizability
To effectively respond to this pandemic, experts must not only learn as much as possible about the current virus but also thoroughly understand past epidemics and similar viruses. Thus, it is crucial for models trained on the LitCovid dataset to successfully categorize articles about related epidemics. We therefore evaluate some of our baselines on such articles using our labelled CORD-19 subset. We find that the micro-F1 and accuracy metrics drop by around 10 and 30 points respectively. This massive drop in performance from a minor change in domain indicates that the models have trouble ignoring the overarching COVID-19 topic and isolating relevant signals from each category.
It is interesting to note that Mechanism is the only category for which BioBERT performs better in CORD-19 than in LitCovid. This could be due to Mechanism articles using technical language and there being enough samples for the models to learn; in contrast with Forecasting which also uses specific language but has far fewer training examples. BioBERT’s binary F1 scores for each category on both datasets can be found in Appendix B.
4.3 Error Analysis
We analyze 50 errors made by both highest scoring BioBERT and the Longformer models on LitCovid documents to better understand their performance. We find that 34% of these were annotation errors which our best performing model predicted correctly. We also find that 10% of the errors were nearly impossible to classify using only the text available on LitCovid, and the full articles are needed to make better-informed prediction. From the rest of the errors we identify some aspects of this task which should be addressed in future work.
We first note these models often correlate certain categories, namely Prevention, Transmission and Forecasting, much more closely than necessary. Even though these categories are semantically related and some overlap exists, the Transmission and Forecasting tags are predicted in conjunction with the Prevention tag much more frequently than what is observed in the labels as can be seen from the table in Appendix C. Future work should attempt to explicitly model correlation between categories to help the model recognize the particular cases in which labels should occur together. The first row in Table 4 shows a document labelled as Forecasting which is also incorrectly predicted with a Prevention label, exemplifying this issue.
Finally, we observe that models have trouble identifying discriminative sections of the document due to how much introductory content on the pandemic can be found in most articles. Future work should explicitly model the gap in relevance between introductory sections and crucial sentences such as thesis statements and article titles. In Table 4, the second and third examples would be more easily classified correctly if specific sentences were ignored while others attended to more thoroughly. This could also increase interpretability, facilitating analysis and further improvement.
We provide an analysis of document classification models on the LitCovid dataset for the COVID-19 literature. We determine that fine-tuning pre-trained language models yields the best performance on this task. We study the generalizability and data efficiency of these models and discuss some important issues to address in future work.
This research was sponsored in part by the Ohio Supercomputer Center Center (1987). The authors would also like to thank Lang Li and Tanya Berger-Wolf for helpful discussions.
- DocBERT: bert for document classification. ArXiv abs/1904.08398. Cited by: Appendix A, §3.3, §3.
Rethinking complex neural network architectures for document classification. In NAACL-HLT, Cited by: §1, §3.2.
- Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32 3, pp. 432–40. Cited by: §1.
- Corpus and Software. External Links: Cited by: §1.
- Longformer: the long-document transformer. arXiv:2004.05150. Cited by: §1, §3.3.
- Ohio supercomputer center. External Links: Cited by: Acknowledgments.
- Keep up with the latest coronavirus research. Nature 579 (7798), pp. 193. External Links: Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. ArXiv abs/1810.04805. Cited by: §3.3.
- ML-net: multi-label classification of biomedical texts with deep neural networks. Journal of the American Medical Informatics Association : JAMIA. Cited by: §1.
- Convolutional neural networks for sentence classification. In EMNLP, Cited by: §1, §3.2.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. External Links: Cited by: §1, §3.3.
- Deep learning for extreme multi-label text classification. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. Cited by: §1, §3.2.
- Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, pp. 2825–2830. Cited by: Appendix A, §3.1.
- CORD-19: the covid-19 open research dataset. ArXiv abs/2004.10706. Cited by: §2.2.
- SGM: sequence generation model for multi-label classification. In COLING, Cited by: §1.
Appendix A Experimental Set-up
We split the LitCovid dataset into train, dev, test with the ratio 7:1:2.
We adopt micro-F1 and accuracy as our evaluation metrics, same as Adhikari et al. (2019a). We use scikit-learn Pedregosa et al. (2011) and Hedwig evaluation scripts to evaluate all the models. For preprocessing, tokenization and sentence segmentation, we use the NLTK library.
All the document classification models used in the paper,
logistic regression 111https://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LogisticRegression.html SVM 222https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html DocBERT 333https://github.com/castorini/hedwig/blob/master/models/bert, Reg-LSTM 444https://github.com/castorini/hedwig/blob/master/models/reg_lstm, Reg-LSTM 555https://github.com/castorini/hedwig/blob/master/models/reg_lstm, XML-CNN 666https://github.com/castorini/hedwig/blob/master/models/xml_cnn, Kim CNN 777https://github.com/castorini/hedwig/blob/master/models/kim_cnn are run based on the implementations listed here and strictly followed their instructions. We used the following pre-trained language models, BioBERT 888https://huggingface.co/monologg/biobert_v1.1_pubmed, BERT base 999https://huggingface.co/bert-base-uncased, BERT large 101010https://huggingface.co/bert-large-uncased and the Longformer 111111https://github.com/allenai/longformer.
For reproducibility, we list all the key hyperparameters, the tuning bounds and the # of parameters for each model in TableA1
. For the logistic regression and the SVM all hyperparameters used were default to scikit-learn and therefore are excluded from this table. For all models we train for a maximum of 30 epochs with a patience of 5. We used micro-F1 score for all hyperparameter tuning. All models were run on NVIDIA GeForce GTX 1080 GPUs.
|Model||Hyperparameters||Hyperparameter bounds||Number of Parameters|
Appendix B Performance by Category
|Category||Binary F1 Score|
Appendix C Category Correlation