An unsupervised and customizable misspelling generator for mining noisy health-related text sources

06/04/2018
by   Abeed Sarker, et al.
0

In this paper, we present a customizable datacentric system that automatically generates common misspellings for complex health-related terms. The spelling variant generator relies on a dense vector model learned from large unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. Weighting of intra-word character sequence similarities allows further problem-specific customization of the system. On a dataset prepared for this study, our system outperforms the current state-of-the-art for medication name variant generation with best F1-score of 0.69 and F1/4-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms showed an increase of over 67 Twitter posts when the generated variants are included. Our proposed spelling variant generator has several advantages over the current state-of-the-art and other types of variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision maybe employed to adjust weights for task-specific customization. The performance and significant relative simplicity of our proposed approach makes it a much needed misspelling generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research purposes.

READ FULL TEXT
research
07/29/2021

IIITG-ADBU@HASOC-Dravidian-CodeMix-FIRE2020: Offensive Content Detection in Code-Mixed Dravidian Text

This paper presents the results obtained by our SVM and XLM-RoBERTa base...
research
01/01/2023

Floods Relevancy and Identification of Location from Twitter Posts using NLP Techniques

This paper presents our solutions for the MediaEval 2022 task on Disaste...
research
06/16/2022

JU_NLP at HinglishEval: Quality Evaluation of the Low-Resource Code-Mixed Hinglish Text

In this paper we describe a system submitted to the INLG 2022 Generation...
research
02/14/2023

Generation of Highlights from Research Papers Using Pointer-Generator Networks and SciBERT Embeddings

Nowadays many research articles are prefaced with research highlights to...
research
07/31/2017

Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

In this paper we show that reporting a single performance score is insuf...
research
11/09/2022

Improving Performance of Automatic Keyword Extraction (AKE) Methods Using PoS-Tagging and Enhanced Semantic-Awareness

Automatic keyword extraction (AKE) has gained more importance with the i...
research
12/02/2021

Improving Controllability of Educational Question Generation by Keyword Provision

Question Generation (QG) receives increasing research attention in NLP c...

Please sign up or login with your details

Forgot password? Click here to reset