Improving Structured Text Recognition with Regular Expression Biasing

by   Baoguang Shi, et al.

We study the problem of recognizing structured text, i.e. text that follows certain formats, and propose to improve the recognition accuracy of structured text by specifying regular expressions (regexes) for biasing. A biased recognizer recognizes text that matches the specified regexes with significantly improved accuracy, at the cost of a generally small degradation on other text. The biasing is realized by modeling regexes as a Weighted Finite-State Transducer (WFST) and injecting it into the decoder via dynamic replacement. A single hyperparameter controls the biasing strength. The method is useful for recognizing text lines with known formats or containing words from a domain vocabulary. Examples include driver license numbers, drug names in prescriptions, etc. We demonstrate the efficacy of regex biasing on datasets of printed and handwritten structured text and measures its side effects.



There are no comments yet.


page 1

page 6


Regular expressions for decoding of neural network outputs

This article proposes a convenient tool for decoding the output of neura...

DataWords: Getting Contrarian with Text, Structured Data and Explanations

Our goal is to build classification models using a combination of free-t...

VML-MOC: Segmenting a multiply oriented and curved handwritten text lines dataset

This paper publishes a natural and very complicated dataset of handwritt...

On Vocabulary Reliance in Scene Text Recognition

The pursuit of high performance on public benchmarks has been the drivin...

TMIXT: A process flow for Transcribing MIXed handwritten and machine-printed Text

Handling large corpuses of documents is of significant importance in man...

Text2Math: End-to-end Parsing Text into Math Expressions

We propose Text2Math, a model for semantically parsing text into math ex...

Rethinking Text Line Recognition Models

In this paper, we study the problem of text line recognition. Unlike mos...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A practical OCR system is often applied to recognize structured text, i.e. text that follows certain formats. Examples include dates, currencies, phone numbers, addresses, etc. Structured text possesses challenges to OCR recognizers, as such text contains digits and symbols that are hard to separate, e.g. “1” (one), “l” (lower-case L), and “/” (slash). Meanwhile, structured text possesses important information and demands higher recognition accuracy.

Figure 1: An illustration of regex biasing applied to recognizing a low-resolution driver license scan. The specified regex consists of 3 parts, concatenated by the OR (“|”) operator, biasing the recognition of license number, expiration date, and sex fields. The fields impacted by the biasing are highlighted by color-coded boxes. This image is a fake sample from California DMV.
Figure 2: A mini WFST decoder representing the language model comprising words, “foo” and “bar”. Thick-lined circles are start states; double-lined circles are final states. Transition labels are formatted in “<input label>:<output label>/<weight>”, or “<input label>:<output label>” when weight is zero. The auxiliary symbol “#0” is for disambiguation when a word has more than one spelling (e.g. spelling in lower case and upper case letters) [povey2011kaldi]. In (a), the weight values 6.9 is calculated from

, meaning unigram probabilities 0.001 for words “foo” and “bar”. The transition from state 1 to 2 means a 0.01 bigram probability for “foo bar”.

Improving the recognition of structured text can be seen as a case of domain adaptation. A common strategy for domain adaption is to finetune recognition models with domain-specific data. However, finetuning requires collecting a sufficiently large dataset in the same domain. Therefore, finetuning can be very expensive, and impractical in many cases due to the sensitivity of the data in the target domain.

In many cases, the format of the structured text is known in prior and can be expressed by regular expressions. For example, California car license plate number follows “\d[A-Z]{3}\d{3}”, meaning “one digit, followed by three capital letters, then followed by three digits”. The knowledge of the format may substantially improve the recognition of structured text, as candidate characters are limited by their positions and contexts.

In this paper, we propose to inject such knowledge into a text recognizer by biasing it towards user-specified regexes. A biased recognizer will favor text that matches the specified recognizer over other similar candidates. Figure 1 illustrates the biasing using a regex describing the formats of license number, expiration date, and sex. A recognizer biased as such will favor “DL I12345678” over “DL 112345678” because the former matches the specified regex while the latter does not. Consequently, a biased recognizer recognizes structured text with significantly improved accuracy at the cost of a generally small degradation on other text.

We realize regex biasing by expressing the regexes as a Weighted Finite-State Transducer (WFST) [MohriPR02] and use it to decode the outputs of the recognition model. A single hyperparameter controls the weight of WFST and thus the biasing strength. Specifically, a lower makes the recognizer bias the specified regexes with higher strength.

Regex biasing enables convenient domain adaption without training data. Not only is this method cost-effective, but particularly useful for situations where training data from the target domain is hard to obtain due to its cost or sensitivity. We test the efficacy of this method on both printed and handwritten text data. A side effect of biasing is creating false positives, especially when biasing strength is high. We measure this side effect in the experiments.

Since our method only concerns the WFST language model, it can be applied to any recognizers that are compatible with WFST decoders. We focus on the type of text recognizer that outputs a sequence of character probabilities, e.g. [ShiBY17, 0001TDCSLH17]

. Recognizers of this kind have been shown to deliver state-of-the-art performance on line-level recognition when working together with an n-gram language model


The rest of the paper is organized as follows. In Section II, we discuss the related works. In Section III, we first briefly present the basics of WFST, then introduce how to model regular expressions as WFSTs and how to use them for decoding. In Section IV, we present the experiment results, demonstrating the efficacy and side-effects of regex biasing.

Ii Related works

Text recognition has attracted a lot of research effort in recent years [ShiBY17, ChengBXZPZ17, LiaoZWXLLYB19, ShiYWLYB19, AroraGWMSKCRBPE19, 0001TDCSLH17, CaiH17, CongHHG19, LitmanATLMM20, QiaoZYZ020]

. These works model text recognition as a sequence prediction problem and employ models such as convolutional network, LSTM, HMM (Hidden Markov Model), and WFST for sequence modeling. Some works focus on word-level scene text recognition

[ShiBY17, ChengBXZPZ17, LiaoZWXLLYB19, ShiYWLYB19] while others address line-level handwriting recognition [AroraGWMSKCRBPE19, 0001TDCSLH17, CaiH17, CongHHG19]. Our character recognition model is based on the CNN-LSTM-CTC framework [ShiBY17] and WFST for the language modeling part. It has a similar architecture as the one in [CaiH17], which has been demonstrated to outperform attention-based models when trained with a large-scale dataset and coupled with a language model.

WFST has been highly successful in the research of speech recognition [MohriPR02] and handwriting recognition [AroraGWMSKCRBPE19]

. Because of the flexibility of WFST, it can be used for modeling n-gram language models, lexicons, etc. Our WFST building process follows the standard recipe that involves lexicon and grammar modeling


The idea of biasing WFSTs with domain knowledge has been previously explored in the speech recognition community [AleksicAEKCM15, HaynorA20]. We drew our inspiration from these works. [AleksicAEKCM15] proposes to improve the recognition of contact names by dynamic WFST replacement, which we also used for injecting regex patterns. [HaynorA20] proposes to improve the recognition of numeric sequence by building numeric grammar modeled by WFST, which is similar to our idea of improving structured text with regex-defined grammars.

To the best of our knowledge, the idea of regex biasing has not been previously proposed in the literature.

Iii Method

Figure 3: Example regex operations and their corresponding WFSTs. x can be a symbol or a regular expression. In the examples above, x is a symbol.

Iii-a Background: WFST decoder

WFST has a long-standing role in speech recognition [MohriPR02] for its language modeling capability. It provides a unified representation of n-gram language model, lexicon, CTC collapsing rule, etc. We refer the readers to [MohriPR02] for a complete tutorial.

A WFST is a finite-state machine whose state transitions are labeled with input symbols, output symbols, and weights. A state transition consumes the input symbol, writes the output symbol, and accumulates the weight. A special symbol means consuming no input when used as an input symbol or outputting nothing when used as an output label. Therefore, a path through the WFST maps an input string to an output string with a total weight.

A common set of operations are available for WFSTs. Composition () combines two WFSTs: Denoting the two WFSTs by and , if the output space (symbol table) of matches the input space of , they can be combined by the composition algorithm, as in . Applying on any sequence is equivalent to applying first, then on the output of . Determinization and minimization are two standard WFST optimization operations. Determinization makes each WFST state has at most one transition with any given input label and eliminates all input -labels. Minimization reduces the number of states and transitions. The common optimization recipe combines the two operations, as in , and yields an equivalent WFST that is faster to decode and smaller in size.

In the context of OCR, WFST can be used for decoding the output sequences of a text recognition model. Below we demonstrate the WFST decoder for a CTC-based [GravesFGS06] recognizer. The WFST is composed and optimized from three different WFSTs, denoted by , , and :


Here, (grammar) models n-gram probabilities. Its input and output symbols are words (or sub-word units, such as BPE [SennrichHB16a]) and its transition weights represent n-gram probabilities. models lexicon, i.e. the spelling of every word in . Its input space is the set of characters supported by the text recognizer and its output space is the words modeled by . Since a CTC-based recognizer outputs extra blank symbols, an extra WFST is left-composed to perform the “collapsing rule” of CTC. In practice, is realized by inserting states and transitions that consume all blanks and repeated characters to . In Figure 2, we illustrate , and on a mini language model involving only two words “foo” and “bar”.

Decoding with WFST is to find the most probable word (or sub-word unit) sequence given the character observations output by the CTC recognition model:


Here, (known as acoustic weight

in speech) controls the weight of the character observations. The most probable path can be approximated by the beam search algorithm. Open-source toolkits such as Kaldi 

[povey2011kaldi] provide highly efficient decoding implementations.

Iii-B Modeling regex as WFST

Regular expressions are widely used in computer science for specifying search patterns. A regex expression can be translated into a deterministic finite automaton (DFA) by a regex processor, such as the Thompson’s construction algorithm [AhoSU86]. Since WFST is also finite automaton, we can convert the DFA of a regex into a WFST by turning every transition label into a pair of identical input and output labels and assign a unit weight. The resulting unweighted WFST is denoted by .

Figure 3 demonstrates some basic regex operations and their corresponding WFSTs. Using these operators we can build regex to match complex patterns. In practice, we rely on the open-source grammar compiler Thrax [RoarkSARST12] to compile regexes directly to WFSTs. The syntax of Thrax is similar to the common regex syntaxes such as that in Python, but it also supports more advanced features such as functions. The pseudo regex syntax we have used so far can be easily translated into equivalent Thrax statements. For example, the digit matcher “\d” in Figure 1 is replaced by a constant “DIGIT” in Thrax’s syntax, which is defined as “DIGIT = ’’0’’|’’1’’|...|’’9’’”.

We make regex WFSTs have small or even negative transition weight values so that the paths through regex WFSTs will be favored by the decoder. In practice, we use a length-linear function to assign the weights in the WFST transitions. This implemented by left-composing a scoring WFST with the unweighted regex WFST :

Here, is a scoring WFST that has a single state that is both start and final state and connects a number of self-loop transitions where the input and output labels are the supported symbols (characters). The weights of these transitions are set to a constant . After the composition, the total weight of a path in for a matching text string will be , where is the length of the string. In this way, we can control the biasing strength by adjusting : lowering increases the biasing strength.

Iii-C Decoding with regex biasing

The weighted regex WFST cannot be used directly for decoding since it only accepts text matching the regex. We combine with the base language model so that the decoder can output any text.

To achieve this, we first modify to add special transitions that are labeled with a nonterminal [chomsky2002syntactic] symbol $REGEX. The modified WFST is known as the class-based language model [BrownPdLM92]. One way to add $REGEX is to add it as a unigram word to the grammar WFST so that regex biasing can be applied to part of a sentence, while the rest of the sentence is scored by . The lexicon WFST is also modified to add the spelling of $REGEX, which is one epsilon (“”). The modified grammar and lexicon WFSTs are denoted by and , respectively. Then we have .

and can be combined using the WFST replacement operation:

which replaces transitions labeled with $REGEX with its corresponding WFST . Figure 4 illustrates this process on the aforementioned mini language model. The modified base language model has two additional transitions with $REGEX. After replacement, state 0 and state 7 in (corresponding to state 5 in ) both have a transition to state 1, effectively acting as the entry and return points of . After the replacement, we can make into a CTC-compatible decoder using the composition in Equation 2.

In practice, we use an operation called dynamic replacement [AllauzenRSSM07] to perform the combination. We first build CTC-compatible WFSTs from and respectively (denoted by and respectively). Then, during decoding, transitions in is replaced by on-demand, as in . Since the main language model WFST may contain millions of states and transitions and is costly to update, dynamic replacement allows us to fix the main language model and only update the regex WFST.

Figure 4: Illustration of WFST replacement. (a) Modified WFST with nonterminal symbol $REGEX; (b): Regex WFST representing regex “ab*” with biasing strength ; (c) Unoptimized WFST after replacement.

Iv Experiments

In this section, we demonstrate the efficacy of regex biasing through different combinations of dataset and regex setting. We focus on the improvements on the structured text of interest as well as the side effects, i.e. degradation on other text.

Iv-a Evaluation datasets

Throughout the experiments, recognition is performed on text lines rather than isolated words. This setting fits real-world scenarios and is necessary for regex biasing since structured text often comes in multiple words that are separated by space (e.g. “Date of Birth: Aug 1, 1990”). To the best of our knowledge, there are no public datasets of such structured text. Therefore, we collect two datasets of printed text and simulate structured text recognition on a handwritten text dataset.

Driver Licenses This dataset consists of scans of US driver licenses from different states. The licenses are fake samples we collected from the Internet. To simulate the real-world imaging conditions, we printed the samples on paper and took photos of them with phone cameras. We ran an in-house text detector to extract text lines and label them manually. There are 737 text lines from 50 distinct licenses. Many lines contain structured text, such as date of birth and driver’s license number. To simulate poor imaging conditions, we further augmented each image by degrading image quality and got 4422 text line images for the final dataset.

Passport MRZ This dataset contains 8040 text lines extracted from the machine-readable zones (MRZ) of a collection of passport scans from different countries. Each MRZ contains two text lines. Usually, MRZs are scanned by specialized passport readers and their images are in high resolution. But in this dataset, images have much lower quality in terms of resolution, lighting condition, etc.

IAM [MartiB02] This dataset has been a standard dataset for handwriting recognition. The test set of IAM contains 1861 handwritten text lines. Some text lines of IAM contain out-of-vocabulary words such as people’s names. We use this dataset to test the efficacy of biasing a word list.

Table I: Regex definition for the driver license dataset in Thrax’s syntax. “UPPER” and “DIGIT” are predefined constants, meaning upper-case letters and digits respectively.

Iv-B Recognition models

Our recognizer consists of a character-level recognition model (character model) and a language model. The character model follows the ConvNet-LSTM-CTC design [ShiBY17]. We use the compact model architecture from [0001TDCSLH17]. Since our model needs to be able to perform line-level recognition, we trained our model on an internal dataset of 1.6 million text lines. Images in this dataset come from multiple sources, such as documents, receipts, scene images, and are in printed style. This model is referred to as the printed model. Similarly, we train another model for handwriting recognition on a dataset consisting of 248k handwritten text lines (handwriting model).

We use the Kaldi toolkit [povey2011kaldi] to build the language model. The language model is trained on a large text corpus comprising 67 million lines of text, collected from various sources. The lexicon WFST is built from a vocabulary of 100k words and sub-word units, learned from the training data using the byte-pair encoding algorithm [SennrichHB16a] implemented in the sentencepiece library [KudoR18]. The grammar WFST is built from an n-gram model containing unigrams and bigrams.

We use the OpenGrm toolkit [RoarkSARST12] to build regex WFSTs and convert them into CTC-compatible decoders using a Python binding of Kaldi [pykaldi]. Building a regex WFST typically takes a few seconds.

Iv-C Regex biasing for driver licenses

The formats of US driver licenses differ from state to state. We set regexes for a common set of fields, including date of birth, issuing date, expiration date, etc. The full regex is in Table I. We set different regexes 9 types of fields. The final regex is the concatenation of the regexes using the OR operator.

We tested the WFST with different biasing strengths and measure their performance in terms of word error rate (WER). The recognition model is the printed model. To test the efficacy of regex biasing on matching text lines and its side effects on other text lines, we divide the full set into two subsets, “fields” and “non-fields”, and calculate their WERs respectively.

Subset No bias
Full 5.5 5.0 4.9 4.9 4.9 5.6
-9% -11% -11% -11% +2%
Fields 5.7 4.1 3.8 3.8 3.7 4.1
-28% -33% -33% -35% -28%
Non-fields 5.4 5.4 5.4 5.4 5.5 6.4
0% 0% 0% +2% +19%
Table II: Word error rate (%) on the driver license dataset. “Full” is the full dataset; “Fields” is the subset where text matches specified regex, partially or fully; “Non-fields” is the subset where the text does not match the regex. Relative WER change to “no bias” WER is displayed in percentages.

Table II summarizes the results. Without regex biasing, the WER on the full set is 5.5, and the WERs on the two subsets are close. When regex biasing is applied with , the WER on fields drops to 4.1%, while the WER on non-fields does not change. It is worth noting that, since the weights in the main language model are mostly positive, setting still creates a bias.

As we further increase the biasing strength, the WER on fields drops to 3.7% when , reducing 35% of the errors compared with no biasing, while the WER on non-fields slightly increases to 5.5% due to false positives, i.e. text that is mistakenly recognized to match the regex. The overall WER is at its lowest 4.9%.

Beyond , the number of false positives drastically increases, and the overall WER increases, indicating that the regex biasing is too strong and creates many false positives. On the other hand, the WER on matching fields also increases beyond . There are two factors behind this increase: 1) False positives appear in the part where the text does not match the regex; 2) Because beam search sets a maximum weight difference between the allowed candidates and the best one, a lower will lead to the pruning of some correct paths. As we increase the beam size, the WER on fields gets closer to that of .

The optimal setting of depends on the application. Lower improves the accuracy of text matching the regex at the cost of increased false positives. But as a rule of thumb, in the range of significantly reduces errors on matching text and brings limited regression.

Figure 5 shows the efficacy and side effects of regex biasing through examples. With regex biasing, the recognizer correctly recognizes some highly challenging examples which are hard to recognize without prior knowledge of their formats. On the other hand, over biasing leads to more false positives. A failure case of such is displayed in the last row.

Figure 5: Examples of regex biasing on the driver license dataset. “Biased” results are from regex biasing with . Recognition errors are highlighted in red color.
Table III: Regex definition for passport MRZ. Definitions for “yy”, “mm”, and “dd” are reused from Table I.

Iv-D Regex biasing for passport MRZ

Passport MRZ follows the international standard ISO/IEC 7501-1. The standard specifies the number of characters (44) and the allowed characters at each position. Using the standard, we wrote the regexes shown in Table III and compile the final regex mrz_line into a WFST.

We measure the recognition performance on this dataset using character error rate (CER). We used the printed model for this dataset. Table IV summarizes the results under different . Again, we observe a significant improvement in character error rate after biasing. Character error rates are reduced by 36% at . We also observed that the biased recognizer avoided many common mistakes in the unbiased recognizer, such as confusing “0” with “O”. As further decreases, we see an increase in CER, also caused by early pruning in the beam search.

no bias
Full 8.5 5.4 5.4 5.4 5.5 6.8
-36% -37% -36% -35% -20%
Table IV: Character error rate (%) on the MRZ dataset.

Iv-E Biasing a domain-specific vocabulary

Regex biasing can be used for biasing a domain-specific vocabulary. A list of words can be expressed as a regex by joining them with the OR operator, as in “<word1>|<word2>| ... |<wordN>”. Biasing to such regexes is useful for recognizing words in a domain-specific vocabulary. For example, when trying to recognize handwritten prescriptions, we can set a domain vocabulary of common drug names and bias them by setting .

We use the IAM dataset to simulate such a scenario. Part of the text in IAM comes from a novel. Some character names have uncommon spellings and therefore are recognized with lower accuracies. Using the 37 character names, we set the regex shown in Table V.

Table V: Regex definition for IAM.

We measure the performance of regex biasing using two metrics: 1) the error rate on the 38 names, calculated by the sum of insertion, deletion, and substitution error, divided by the number of name appearances; 2) the WER on the whole dataset. The results are summarized in Table VI. As the biasing strength increases, we see a significant drop in the errors on the names. The WER on the full set also goes down until . As further decreases, the errors on names keep decreasing but the number of false positives increases and the overall WER also increases.

Some examples are shown in Figure 6. With regex biasing, the recognizer can recognize highly ambiguous words from the domain vocabulary, while not affecting the recognition of other words.

Subset no bias
Names 36.0 28.0 23.7 18.0 15.7 11.7 10.3
-22% -34% -50% -56% -68% -71%
Full 14.3 14.2 14.1 14.0 14.0 14.0 15.8
-1% -1% -2% -2% -3% +10%
Table VI: Name error rate (%) and full set WER (%) on IAM.
Figure 6: Examples of regex biasing on IAM. “Biased” results are from regex biasing with . Recognition errors are highlighted in red color. Images are framed by thin black lines for clarity.

Iv-F Runtime analysis

no bias
Driver License 1.6 1.6 1.6 1.5
Passport MRZ 14.6 12.2 5.9 1.0
IAM 6.8 7.1 6.8 6.8
Table VII: Runtime analysis of regex biasing on different datasets. Numbers are the average time (in milliseconds) for recognizing one text line.

Finally, we analyze the impact of regex biasing on inference time in Table VII. Overall, the amount of added time by regex biasing ranges from less than 1ms to negative. On the driver license dataset, regex biasing has little impact on runtime, and as decreases, the inference time decreases. This is because a lower reduces the number of search paths and therefore accelerates the search process. This phenomenon is more pronounced on the passport dataset, where the inference time drops by 10 times as lowers.

V Conclusion

We have proposed a novel method for biasing a recognizer using regular expressions. This method improves the performance of a recognizer on domain-specific data with efficacy, requires no labeled data and training process, and has limited impact on runtime speed.

With some modifications, the decoder we have used in this paper may also work with auto-regressive text recognition models, such as attention-based recognizers [LitmanATLMM20, ShiYWLYB19, QiaoZYZ020] and RNN transducer models [abs-1211-3711]. This can be explored in the future.

Structured text comes in many other forms, where the proposed biasing method may find its usage. For example, math equations (represented by Latex code) are strongly structured. A recognizer may benefit from limiting its search space to the one defined by the Latex syntax. We are also interested in exploring this direction in the future.