Automatic Classification of Pathology Reports using TF-IDF Features

03/05/2019 ∙ by Shivam Kalra, et al. ∙ 0

A Pathology report is arguably one of the most important documents in medicine containing interpretive information about the visual findings from the patient's biopsy sample. Each pathology report has a retention period of up to 20 years after the treatment of a patient. Cancer registries process and encode high volumes of free-text pathology reports for surveillance of cancer and tumor diseases all across the world. In spite of their extremely valuable information they hold, pathology reports are not used in any systematic way to facilitate computational pathology. Therefore, in this study, we investigate automated machine-learning techniques to identify/predict the primary diagnosis (based on ICD-O code) from pathology reports. We performed experiments by extracting the TF-IDF features from the reports and classifying them using three different methods---SVM, XGBoost, and Logistic Regression. We constructed a new dataset with 1,949 pathology reports arranged into 37 ICD-O categories, collected from four different primary sites, namely lung, kidney, thymus, and testis. The reports were manually transcribed into text format after collecting them as PDF files from NCI Genomic Data Commons public dataset. We subsequently pre-processed the reports by removing irrelevant textual artifacts produced by OCR software. The highest classification accuracy we achieved was 92% using XGBoost classifier on TF-IDF feature vectors, the linear SVM scored 87% accuracy. Furthermore, the study shows that TF-IDF vectors are suitable for highlighting the important keywords within a report which can be helpful for the cancer research and diagnostic workflow. The results are encouraging in demonstrating the potential of machine learning methods for classification and encoding of pathology reports.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Cancer is one of the leading causes of death in the world, with over 80,000 deaths registered in Canada in 2017 (Canadian Cancer Statistics 2017). A computer-aided system for cancer diagnosis usually involves a pathologist rendering a descriptive report after examining the tissue glass slides obtained from the biopsy of a patient. A pathology report contains specific analysis of cells and tissues, and other histopathological indicators that are crucial for diagnosing malignancies. An average sized laboratory may produces a large quantity of pathology reports annually (e.g., in excess of 50,000), but these reports are written in mostly unstructured text and with no direct link to the tissue sample. Furthermore, the report for each patient is a personalized document and offers very high variability in terminology due to lack of standards and may even include misspellings and missing punctuation, clinical diagnoses interspersed with complex explanations, different terminology to label the same malignancy, and information about multiple carcinoma appearances included in a single report [1].

In Canada, each Provincial and Territorial Cancer Registry (PTCR) is responsible for collecting the data about cancer diseases and reporting them to Statistics Canada (StatCan). Every year, Canadian Cancer Registry (CCR) uses the information sources of StatCan to compile an annual report on cancer and tumor diseases. Many countries have their own cancer registry programs. These programs rely on the acquisition of diagnostic, treatment, and outcome information through manual processing and interpretation from various unstructured sources (e.g., pathology reports, autopsy/laboratory reports, medical billing summaries). The manual classification of cancer pathology reports is a challenging, time-consuming task and requires extensive training [1].

With the continued growth in the number of cancer patients, and the increase in treatment complexity, cancer registries face a significant challenge in manually reviewing the large quantity of reports [2, 1]

. In this situation, Natural Language Processing (NLP) systems can offer a unique opportunity to automatically encode the unstructured reports into structured data. Since, the registries already have access to the large quantity of historically labeled and encoded reports, a supervised machine learning approach of feature extraction and classification is a compelling direction for making their workflow more effective and streamlined. If successful, such a solution would enable processing reports in much lesser time allowing trained personnel to focus on their research and analysis. However, developing an automated solution with high accuracy and consistency across wide variety of reports is a challenging problem.

For cancer registries, an important piece of information in a pathology report is the associated ICD-O code which describes the patient’s histological diagnosis, as described by the World Health Organization’s (WHO) International Classification of Diseases for Oncology [3]. Prediction of the primary diagnosis from a pathology report provides a valuable starting point for exploration of machine learning techniques for automated cancer surveillance. A major application for this purpose would be “auto-reporting” based on analysis of whole slide images, the digitization of the biopsy glass slides. Structured, summarized and categorized reports can be associated with the image content when searching in large archives. Such as system would be able to drastically increase the efficiency of diagnostic processes for the majority of cases where in spite of obvious primary diagnosis, still time and effort is required from the pathologists to write a descriptive report.

The primary objective of our study is to analyze the efficacy of existing machine learning approaches for the automated classification of pathology reports into different diagnosis categories. We demonstrate that TF-IDF feature vectors combined with linear SVM or XGBoost classifier can be an effective method for classification of the reports, achieving up to 83% accuracy. We also show that TF-IDF features are capable of identifying important keywords within a pathology report. Furthermore, we have created a new dataset consisting of 1,949 pathology reports across 37 primary diagnoses. Taken together, our exploratory experiments with a newly introduced dataset on pathology reports opens many new opportunities for researchers to develop a scalable and automatic information extraction from unstructured pathology reports.

Ii Background

NLP approaches for information extraction within the biomedical research areas range from rule-based systems 

[4], to domain-specific systems using feature-based classification [2], to the recent deep networks for end-to-end feature extraction and classification [1]. NLP has had varied degree of success with free-text pathology reports [5]. Various studies have acknowledge the success of NLP in interpreting pathology reports, especially for classification tasks or extracting a single attribute from a report [5, 6].

The Cancer Text Information Extraction System (caTIES) [7] is a framework developed in a caBIG project focuses on information extraction from pathology reports. Specifically, caTIES extracts information from surgical pathology reports (SPR) with good precision as well as recall.

Another system known as Open Registry [8] is capable of filtering the reports with disease codes containing cancer. In [9], an approach called Automated Retrieval Console (ARC) is proposed which uses machine learning models to predict the degree of association of a given pathology or radiology with the cancer. The performance ranges from an F-measure of 0.75 for lung cancer to 0.94 for colorectal cancer. However, ARC uses domain-specific rules which hiders with the generalization of the approach to variety of pathology reports.

This research work is inspired by themes emerging in many of the above studies. Specifically, we are evaluating the task of predicting the primary diagnosis from the pathology report. Unlike previous approaches, the system does not rely on custom rule-based knowledge, domain specific features, balanced dataset with fewer number of classes.

Iii Materials and Methods

(a) Primary Diagnosis
Description Count
Clear cell adenocarcinoma, NOS 523
Squamous cell carcinoma, NOS 340
Papillary adenocarcinoma, NOS 300
Adenocarcinoma, NOS 233
Renal cell carcinoma, chromophobe type 113
Adenocarcinoma with mixed subtypes 89
Seminoma, NOS 68
Thymoma, type AB, malignant 31
Mixed germ cell tumor 30
Thymoma, type B2, malignant 26
Embryonal carcinoma, NOS 26
Thymoma, type A, malignant 15
Renal cell carcinoma, NOS 14
Thymoma, type B1, malignant 13
Bronchiolo-alveolar carcinoma, non-mucinous 13
Thymoma, type B3, malignant 13
Acinar cell carcinoma 13
Mucinous adenocarcinoma 11
Thymic carcinoma, NOS 11
Basaloid squamous cell carcinoma 9
Thymoma, type AB, NOS 7
Squamous cell carcinoma, keratinizing, NOS 7
Teratoma, benign 6
Solid carcinoma, NOS 5
Thymoma, type B2, NOS 5
Yolk sac tumor 4
Papillary squamous cell carcinoma 4
Bronchiolo-alveolar adenocarcinoma, NOS 3
Bronchio-alveolar carcinoma, mucinous 3
Teratoma, malignant, NOS 3
Micropapillary carcinoma, NOS 2
Thymoma, type A, NOS 2
Teratocarcinoma 2
Squamous cell carcinoma, large cell, nonkeratinizing 2
Thymoma, type B1, NOS 1
Squamous cell carcinoma, small cell, nonkeratinizing 1
Signet ring cell carcinoma 1
(b) Primary Site
Kidney 937
Lung 749
Testis 139
Thymus 124
TABLE I: Distribution of pathology reports across (a) Primary diagnosis, used a the label for the study, and (b) Primary site associated with a report.

We assembled a dataset of 1,949 cleaned pathology reports. Each report is associated with one of the 37 different primary diagnoses based on IDC-O codes. The reports are collected from four different body parts or primary sites from multiple patients. The distribution of reports across different primary diagnoses and primary sites is reported in Table I. The dataset was developed in three steps as follows.

Collecting pathology reports: The total of 11,112 pathology reports were downloaded from NCI’s Genomic Data Commons (GDC) dataset in PDF format [10]. Out of all PDF files, 1,949 reports were selected across multiple patients from four specific primary sites—thymus, testis, lung, and kidney. The selection was primarily made based on the quality of PDF files.

Cleaning reports:

The next step was to extract the text content from these reports. Due to the significant time expense of manually re-typing all the pathology reports, we developed a new strategy to prepare our dataset. We applied an Optical Character Recognition (OCR) software to convert the PDF reports to text files. Then, we manually inspected all generated text files to fix any grammar/spelling issues and irrelevant characters as an artefact produced by the OCR system.

Splitting into training-testing data: We split the cleaned reports into 70% and 30% for training and testing, respectively. This split resulted in 1,364 training, and 585 testing reports.

Iii-a Pre-Processing of Reports

We pre-processed the reports by setting their text content to lowercase and filtering out any non-alphanumeric characters. We used NLTK library to remove stopping words, e.g., ‘the’, ‘an’, ‘was’, ‘if’ and so on [11]. We then analyzed the reports to find common bigrams, such as “lung parenchyma”, “microscopic examination”, “lymph node” etc. We joined the biagrams with a hyphen, converting them into a single word. We further removed the words that occur less than 2% in each of the diagnostic category. As well, we removed the words that occur more than 90% across all the categories. We stored each pre-processed report in a separate text file.

Iii-B TF-IDF features

TF-IDF stands for Term Frequency-Inverse Document Frequency, and it is a useful weighting scheme in information retrieval and text mining. TF-IDF signifies the importance of a term in a document within a corpus. It is important to note that a document here refers to a pathology report, a corpus refers to the collection of reports, and a term refers to a single word in a report. The TF-IDF weight for a term in a document is given by


We performed the following steps to transform a pathology report into a feature vector:

  1. Create a set of vocabulary containing all unique words from all the pre-processed training reports.

  2. Create a zero vector of the same length as the vocabulary.

  3. For each word in a report , set the corresponding index in to .

  4. The resultant is a feature vector for the report and it is a highly sparse vector.

Iii-C Keyword extraction and topic modelling

The keyword extraction involves identifying important words within reports that summarizes its content. In contrast, the topic modelling allows grouping these keywords using an intelligent scheme, enabling users to further focus on certain aspects of a document. All the words in a pathology report are sorted according to their TF-IDF weights. The top

sorted words constitute the top keywords for the report. The is empirically set to 50 within this research. The extracted keywords are further grouped into different topics by using latent Dirichlet allocation (LDA) [12]. The keywords in a report are highlighted using the color theme based on their topics.

Iii-D Evaluation metrics

Each model is evaluated using two standard NLP metrics—micro and macro averaged F-scores, the harmonic mean of related metrics precision and recall. For each diagnostic category

from a set of 37 different classes , the number of true positives , false positives , and false negatives , the micro F-score is defined as


whereas macro F-score is given by


In summary, micro-averaged metrics have class representation roughly proportional to their test set representation (same as accuracy for classification problem with a single label per data point), whereas macro-averaged metrics are averaged by class without weighting by class prevalence [13].

Iii-E Experimental setting

In this study, we performed two different series of experiments: i) evaluating the performance of TF-IDF features and various machine learning classifiers on the task of predicting primary diagnosis from the text content of a given report, and ii) using TF-IDF and LDA techniques to highlight the important keywords within a report. For the first experiment series, training reports are pre-processed, then their TF-IDF features are extracted. The TF-IDF features and the training labels are used to train different classification models. These different classification models and their hyper-parameters are reported in Table II

. The performance of classifiers is measured quantitatively on the test dataset using the evaluation metrics discussed in the previous section. For the second experiment series, a random report is selected and its top 50 keywords are extracted using TF-IDF weights. These 50 keywords are highlighted using different colors based on their associated topic, which are extracted through LDA. A non-expert based qualitative inspection is performed on the extracted keywords and their corresponding topics.

Iv Results and Discussion

Iv-a Experiment Series 1

A classification model is trained to predict the primary diagnosis given the content of the cancer pathology report. The performance results on this task are reported in Table III. We can observe that the XGBoost classifier outperformed all other models for both the micro F-score metric, with a score of 0.92, and the macro F-score metric, with a score of 0.31. This was an improvement of 7% for the micro F-score over the next best model, SVM-L, and a marginal improvement of 5% for macro F-score. It is interesting to note that SVM with linear kernels performs much better than SVM with RBF kernel, scoring 9% on the macro F-score and 12% more on the micro F-score. It is suspected that since words used in primary diagnosis itself occur in some reports, thus enabling the linear models to outperform complex models.

Code Classifier Parameters
SVM-L SVM kernel linear, C 1.0, shrinking true
SVM-RBF SVM kernel rbf, C 1.0, shrinking: true
LR Logistic Regression penalty l2, solver liblinear, C 1.0
XGBoost XGBoost max depth 6, learning rate 0.3
TABLE II: Different classifiers used in the study
Classifier Code Micro F-score Macro F-score
Train Test Train Test
SVM-L 0.95 0.87 0.28 0.24
SVM-RBM 0.80 0.75 0.19 0.18
LR 0.82 0.78 0.20 0.18
XGBoost 0.99 0.92 0.64 0.31
TABLE III: Final train and test performance of classification models

Iv-B Experiment Series 2

Figure 1 shows the top 50 keywords highlighted using TF-IDF and LDA. The proposed approach has performed well in highlighting the important regions, for example the topic highlighted with a red color containing “presence range tumor necrosis” provides useful biomarker information to readers.

Top 10 Keywords
1. Epithelial (0.377), 2. Presence (0.269), 3. Thymectomy (0.232)
4. Epithelial cells (0.210), 5. Cells (0.180), 6. Small (0.161), 7. Lobulated (0.161)
8. Lung parenchyma (0.151), 9. Appear (0.150), 10. Examination (0.139)
Topic # Keywords
Topic 1 examination, thymectomy, measuring, resection, showing,
inflammatory, lymph, spaces, lung, immunohistochemistry,
node, modified, complete
Topic 2 samples, epithelial, proliferation, mixed, CD20, CD5,
histological, according, classification, masaoka
Topic 3 lobulated, necrotic, small, right, architecture,
presence, medulla, range, tumor, necrosis, green, appear,
parenchyma, cells, cytokeratin, right, major
Fig. 1: The top 50 keywords in a report identified using TF-IDF weights. The keywords are color encoded as per the abstract “topics” extracted using LDA. Each topic is given a separate color scheme.

Iv-C Conclusions

We proposed a simple yet efficient TF-IDF method to extract and corroborate useful keywords from pathology cancer reports. Encoding a pathology report for cancer and tumor surveillance is a laborious task, and sometimes it is subjected to human errors and variability in the interpretation. One of the most important aspects of encoding a pathology report involves extracting the primary diagnosis. This may be very useful for content-based image retrieval to combine with visual information. We used existing classification model and TF-IDF features to predict the primary diagnosis. We achieved up to 92% accuracy using XGBoost classifier. The prediction accuracy empowers the adoption of machine learning methods for automated information extraction from pathology reports.


  • [1] S. Gao, M. T. Young, J. X. Qiu, H.-J. Yoon, J. B. Christian, P. A. Fearn, G. D. Tourassi, and A. Ramanthan, “Hierarchical attention networks for information extraction from cancer pathology reports,”
  • [2] R. Weegar, J. F. Nygård, and H. Dalianis, “Efficient Encoding of Pathology Reports Using Natural Language Processing.,” in RANLP, pp. 778–783.
  • [3] D. N. Louis, H. Ohgaki, O. D. Wiestler, W. K. Cavenee, P. C. Burger, A. Jouvet, B. W. Scheithauer, and P. Kleihues, “The 2007 who classification of tumours of the central nervous system,” Acta neuropathologica, vol. 114, no. 2, pp. 97–109, 2007.
  • [4] N. Kang, B. Singh, Z. Afzal, E. M. van Mulligen, and J. A. Kors, “Using rule-based natural language processing to improve disease normalization in biomedical text,” Journal of the American Medical Informatics Association, vol. 20, no. 5, pp. 876–881, 2012.
  • [5] A. E. Wieneke, E. J. Bowles, D. Cronkite, K. J. Wernli, H. Gao, D. Carrell, and D. S. Buist, “Validation of natural language processing to extract breast cancer pathology procedures and results,” Journal of pathology informatics, vol. 6, 2015.
  • [6] T. D. Imler, J. Morea, C. Kahi, and T. F. Imperiale, “Natural language processing accurately categorizes findings from colonoscopy and pathology reports,” Clinical Gastroenterology and Hepatology, vol. 11, no. 6, pp. 689–694, 2013.
  • [7] R. S. Crowley, M. Castine, K. Mitchell, G. Chavan, T. McSherry, and M. Feldman, “caties: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research,” Journal of the American Medical Informatics Association, vol. 17, no. 3, pp. 253–264, 2010.
  • [8] P. Contiero, A. Tittarelli, A. Maghini, S. Fabiano, E. Frassoldi, E. Costa, D. Gada, T. Codazzi, P. Crosignani, R. Tessandori, et al., “Comparison with manual registration reveals satisfactory completeness and efficiency of a computerized cancer registration system,” Journal of biomedical informatics, vol. 41, no. 1, pp. 24–32, 2008.
  • [9] L. W. D’avolio, T. M. Nguyen, W. R. Farwell, Y. Chen, F. Fitzmeyer, O. M. Harris, and L. D. Fiore, “Evaluation of a generalizable approach to clinical information retrieval using the automated retrieval console (arc),” Journal of the American Medical Informatics Association, vol. 17, no. 4, pp. 375–382, 2010.
  • [10] R. L. Grossman, A. P. Heath, V. Ferretti, H. E. Varmus, D. R. Lowy, W. A. Kibbe, and L. M. Staudt, “Toward a shared vision for cancer genomic data,” New England Journal of Medicine, vol. 375, no. 12, pp. 1109–1112, 2016.
  • [11] E. Loper and S. Bird, “NLTK: The Natural Language Toolkit,” in Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, ETMTNLP ’02, pp. 63–70, Association for Computational Linguistics.
  • [12] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
  • [13]

    J. X. Qiu, H. Yoon, P. A. Fearn, and G. D. Tourassi, “Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports,” vol. 22, no. 1, pp. 244–251.