Automatic Classification of Pathology Reports using TF-IDF Features

03/05/2019
by   Shivam Kalra, et al.
0

A Pathology report is arguably one of the most important documents in medicine containing interpretive information about the visual findings from the patient's biopsy sample. Each pathology report has a retention period of up to 20 years after the treatment of a patient. Cancer registries process and encode high volumes of free-text pathology reports for surveillance of cancer and tumor diseases all across the world. In spite of their extremely valuable information they hold, pathology reports are not used in any systematic way to facilitate computational pathology. Therefore, in this study, we investigate automated machine-learning techniques to identify/predict the primary diagnosis (based on ICD-O code) from pathology reports. We performed experiments by extracting the TF-IDF features from the reports and classifying them using three different methods---SVM, XGBoost, and Logistic Regression. We constructed a new dataset with 1,949 pathology reports arranged into 37 ICD-O categories, collected from four different primary sites, namely lung, kidney, thymus, and testis. The reports were manually transcribed into text format after collecting them as PDF files from NCI Genomic Data Commons public dataset. We subsequently pre-processed the reports by removing irrelevant textual artifacts produced by OCR software. The highest classification accuracy we achieved was 92% using XGBoost classifier on TF-IDF feature vectors, the linear SVM scored 87% accuracy. Furthermore, the study shows that TF-IDF vectors are suitable for highlighting the important keywords within a report which can be helpful for the cancer research and diagnostic workflow. The results are encouraging in demonstrating the potential of machine learning methods for classification and encoding of pathology reports.

READ FULL TEXT
research
06/28/2021

Priority prediction of Asian Hornet sighting report using machine learning methods

As infamous invaders to the North American ecosystem, the Asian giant ho...
research
06/30/2020

Primary Tumor Origin Classification of Lung Nodules in Spectral CT using Transfer Learning

Early detection of lung cancer has been proven to decrease mortality sig...
research
06/29/2020

Classification of cancer pathology reports: a large-scale comparative study

We report about the application of state-of-the-art deep learning techni...
research
08/02/2022

Automatic Classification of Bug Reports Based on Multiple Text Information and Reports' Intention

With the rapid growth of software scale and complexity, a large number o...
research
11/22/2019

Classifying Vietnamese Disease Outbreak Reports with Important Sentences and Rich Features

Text classification is an important field of research from mid 90s up to...
research
10/31/2019

Human-centric Metric for Accelerating Pathology Reports Annotation

Pathology reports contain useful information such as the main involved o...
research
02/01/2023

iPAL: A Machine Learning Based Smart Healthcare Framework For Automatic Diagnosis Of Attention Deficit/Hyperactivity Disorder (ADHD)

ADHD is a prevalent disorder among the younger population. Standard eval...

Please sign up or login with your details

Forgot password? Click here to reset