DeepAI AI Chat
Log In Sign Up

Improving Structured Text Recognition with Regular Expression Biasing

11/10/2021
by   Baoguang Shi, et al.
Microsoft
0

We study the problem of recognizing structured text, i.e. text that follows certain formats, and propose to improve the recognition accuracy of structured text by specifying regular expressions (regexes) for biasing. A biased recognizer recognizes text that matches the specified regexes with significantly improved accuracy, at the cost of a generally small degradation on other text. The biasing is realized by modeling regexes as a Weighted Finite-State Transducer (WFST) and injecting it into the decoder via dynamic replacement. A single hyperparameter controls the biasing strength. The method is useful for recognizing text lines with known formats or containing words from a domain vocabulary. Examples include driver license numbers, drug names in prescriptions, etc. We demonstrate the efficacy of regex biasing on datasets of printed and handwritten structured text and measures its side effects.

READ FULL TEXT

page 1

page 6

09/15/2015

Regular expressions for decoding of neural network outputs

This article proposes a convenient tool for decoding the output of neura...
11/09/2021

DataWords: Getting Contrarian with Text, Structured Data and Explanations

Our goal is to build classification models using a combination of free-t...
01/19/2021

VML-MOC: Segmenting a multiply oriented and curved handwritten text lines dataset

This paper publishes a natural and very complicated dataset of handwritt...
05/08/2020

On Vocabulary Reliance in Scene Text Recognition

The pursuit of high performance on public benchmarks has been the drivin...
04/28/2019

TMIXT: A process flow for Transcribing MIXed handwritten and machine-printed Text

Handling large corpuses of documents is of significant importance in man...
04/22/2023

An approach to extract information from academic transcripts of HUST

In many Vietnamese schools, grades are still being inputted into the dat...
12/04/2020

Data-Driven Regular Expressions Evolution for Medical Text Classification Using Genetic Programming

In medical fields, text classification is one of the most important task...