Comparing Variation in Tokenizer Outputs Using a Series of Problematic and Challenging Biomedical Sentences

05/15/2023
by   Christopher Meaney, et al.
0

Background Objective: Biomedical text data are increasingly available for research. Tokenization is an initial step in many biomedical text mining pipelines. Tokenization is the process of parsing an input biomedical sentence (represented as a digital character sequence) into a discrete set of word/token symbols, which convey focused semantic/syntactic meaning. The objective of this study is to explore variation in tokenizer outputs when applied across a series of challenging biomedical sentences. Method: Diaz [2015] introduce 24 challenging example biomedical sentences for comparing tokenizer performance. In this study, we descriptively explore variation in outputs of eight tokenizers applied to each example biomedical sentence. The tokenizers compared in this study are the NLTK white space tokenizer, the NLTK Penn Tree Bank tokenizer, Spacy and SciSpacy tokenizers, Stanza/Stanza-Craft tokenizers, the UDPipe tokenizer, and R-tokenizers. Results: For many examples, tokenizers performed similarly effectively; however, for certain examples, there were meaningful variation in returned outputs. The white space tokenizer often performed differently than other tokenizers. We observed performance similarities for tokenizers implementing rule-based systems (e.g. pattern matching and regular expressions) and tokenizers implementing neural architectures for token classification. Oftentimes, the challenging tokens resulting in the greatest variation in outputs, are those words which convey substantive and focused biomedical/clinical meaning (e.g. x-ray, IL-10, TCR/CD3, CD4+ CD8+, and (Ca2+)-regulated). Conclusion: When state-of-the-art, open-source tokenizers from Python and R were applied to a series of challenging biomedical example sentences, we observed subtle variation in the returned outputs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/17/2020

Comparison of Syntactic Parsers on Biomedical Texts

Syntactic parsing is an important step in the automated text analysis wh...
research
05/10/2022

SuMe: A Dataset Towards Summarizing Biomedical Mechanisms

Can language models read biomedical texts and explain the biomedical mec...
research
01/18/2022

Sectioning of Biomedical Abstracts: A Sequence of Sequence Classification Task

Rapid growth of the biomedical literature has led to many advances in th...
research
09/29/2022

Perturbations and Subpopulations for Testing Robustness in Token-Based Argument Unit Recognition

Argument Unit Recognition and Classification aims at identifying argumen...
research
08/11/2018

From POS tagging to dependency parsing for biomedical event extraction

Given the importance of relation or event extraction from biomedical res...
research
07/29/2020

Biomedical and Clinical English Model Packages in the Stanza Python NLP Library

We introduce biomedical and clinical English model packages for the Stan...

Please sign up or login with your details

Forgot password? Click here to reset