Computationally Efficient Labeling of Cancer Related Forum Posts by Non-Clinical Text Information Retrieval

03/24/2023
by   Jimmi Agerskov, et al.
0

An abundance of information about cancer exists online, but categorizing and extracting useful information from it is difficult. Almost all research within healthcare data processing is concerned with formal clinical data, but there is valuable information in non-clinical data too. The present study combines methods within distributed computing, text retrieval, clustering, and classification into a coherent and computationally efficient system, that can clarify cancer patient trajectories based on non-clinical and freely available information. We produce a fully-functional prototype that can retrieve, cluster and present information about cancer trajectories from non-clinical forum posts. We evaluate three clustering algorithms (MR-DBSCAN, DBSCAN, and HDBSCAN) and compare them in terms of Adjusted Rand Index and total run time as a function of the number of posts retrieved and the neighborhood radius. Clustering results show that neighborhood radius has the most significant impact on clustering performance. For small values, the data set is split accordingly, but high values produce a large number of possible partitions and searching for the best partition is hereby time-consuming. With a proper estimated radius, MR-DBSCAN can cluster 50000 forum posts in 46.1 seconds, compared to DBSCAN (143.4) and HDBSCAN (282.3). We conduct an interview with the Danish Cancer Society and present our software prototype. The organization sees a potential in software that can democratize online information about cancer and foresee that such systems will be required in the future.

READ FULL TEXT

page 6

page 13

research
11/25/2018

Clustering of Transcriptomic Data for the Identification of Cancer Subtypes

Cancer is a number of related yet highly heterogeneous diseases. Correct...
research
09/16/2017

Some variations on Random Survival Forest with application to Cancer Research

Random survival forest can be extremely time consuming for large data se...
research
11/20/2018

An interpretable multiple kernel learning approach for the discovery of integrative cancer subtypes

Due to the complexity of cancer, clustering algorithms have been used to...
research
11/21/2022

Unsupervised extraction, labelling and clustering of segments from clinical notes

This work is motivated by the scarcity of tools for accurate, unsupervis...
research
12/06/2018

On Uncensored Mean First-Passage-Time Performance Experiments with Multiwalk in R^p: a New Stochastic Optimization Algorithm

A rigorous empirical comparison of two stochastic solvers is important w...

Please sign up or login with your details

Forgot password? Click here to reset