An Information Extraction Study: Take In Mind the Tokenization!

03/27/2023
by   Christos Theodoropoulos, et al.
0

Current research on the advantages and trade-offs of using characters, instead of tokenized text, as input for deep learning models, has evolved substantially. New token-free models remove the traditional tokenization step; however, their efficiency remains unclear. Moreover, the effect of tokenization is relatively unexplored in sequence tagging tasks. To this end, we investigate the impact of tokenization when extracting information from documents and present a comparative study and analysis of subword-based and character-based models. Specifically, we study Information Extraction (IE) from biomedical texts. The main outcome is twofold: tokenization patterns can introduce inductive bias that results in state-of-the-art performance, and the character-based models produce promising results; thus, transitioning to token-free IE models is feasible.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/01/2021

Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Commonly-used transformer language models depend on a tokenization schem...
research
04/24/2023

DocParser: End-to-end OCR-free Information Extraction from Visually Rich Documents

Information Extraction from visually rich documents is a challenging tas...
research
08/11/2018

From POS tagging to dependency parsing for biomedical event extraction

Given the importance of relation or event extraction from biomedical res...
research
04/30/2018

Syntactic Patterns Improve Information Extraction for Medical Search

Medical professionals search the published literature by specifying the ...
research
05/28/2021

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Most widely-used pre-trained language models operate on sequences of tok...
research
08/04/2023

Chinese Financial Text Emotion Mining: GCGTS – A Character Relationship-based Approach for Simultaneous Aspect-Opinion Pair Extraction

Aspect-Opinion Pair Extraction (AOPE) from Chinese financial texts is a ...
research
06/23/2021

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

State-of-the-art models in natural language processing rely on separate ...

Please sign up or login with your details

Forgot password? Click here to reset