Tokenization Repair in the Presence of Spelling Errors

10/15/2020
by   Hannah Bast, et al.
0

We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces, correct these. Spelling errors can be present, but it's not part of the problem to correct them. For example, given: "Tispa per isabout token izaionrep air", compute "Tis paper is about tokenizaion repair". It is tempting to think of this problem as a special case of spelling correction or to treat the two problems together. We make a case that tokenization repair and spelling correction should and can be treated as separate problems. We investigate a variety of neural models as well as a number of strong baselines. We identify three main ingredients to high-quality tokenization repair: deep language models with a bidirectional component, training the models on text with spelling errors, and making use of the space information already present. Our best methods can repair all tokenization errors on 97.5 spelled test sentences and on 96.0 spaces removed from the given text (the scenario from previous work), the accuracy falls to 94.5 analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/05/2023

Repair of Reed-Solomon Codes in the Presence of Erroneous Nodes

We consider the repair scheme of Guruswami-Wootters for the Reed-Solomon...
research
02/11/2022

A Quick Repair Facility for Debugging

Modern development environments provide a widely used auto-correction fa...
research
01/10/2021

Towards Repairing Scenario-Based Models with Rich Events

Repairing legacy systems is a difficult and error-prone task: often, lim...
research
04/11/2022

Defect Identification, Categorization, and Repair: Better Together

Just-In-Time defect prediction (JIT-DP) models can identify defect-induc...
research
08/14/2018

KGCleaner : Identifying and Correcting Errors Produced by Information Extraction Systems

KGCleaner is a framework to identify and correct errors in data produced...
research
01/29/2020

TarTar: A Timed Automata Repair Tool

We present TarTar, an automatic repair analysis tool that, given a timed...
research
04/19/2018

Reducing Cascading Parsing Errors Through Fast Error Recovery

Syntax errors are generally easy to fix for humans, but not for parsers:...

Please sign up or login with your details

Forgot password? Click here to reset