A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units

08/28/2018
by   Wassim Swaileh, et al.
0

We address the design of a unified multilingual system for handwriting recognition. Most of multi- lingual systems rests on specialized models that are trained on a single language and one of them is selected at test time. While some recognition systems are based on a unified optical model, dealing with a unified language model remains a major issue, as traditional language models are generally trained on corpora composed of large word lexicons per language. Here, we bring a solution by con- sidering language models based on sub-lexical units, called multigrams. Dealing with multigrams strongly reduces the lexicon size and thus decreases the language model complexity. This makes pos- sible the design of an end-to-end unified multilingual recognition system where both a single optical model and a single language model are trained on all the languages. We discuss the impact of the language unification on each model and show that our system reaches state-of-the-art methods perfor- mance with a strong reduction of the complexity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/17/2023

Massively Multilingual Shallow Fusion with Large Language Models

While large language models (LLM) have made impressive progress in natur...
research
08/18/2023

OCR Language Models with Custom Vocabularies

Language models are useful adjuncts to optical models for producing accu...
research
09/26/2019

Improving Pre-Trained Multilingual Models with Vocabulary Expansion

Recently, pre-trained language models have achieved remarkable success i...
research
07/19/2017

Learning Unified Embedding for Apparel Recognition

In apparel recognition, specialized models (e.g. models trained for a pa...
research
10/11/2022

Like a bilingual baby: The advantage of visually grounding a bilingual language model

Unlike most neural language models, humans learn language in a rich, mul...
research
10/22/2020

UniCase – Rethinking Casing in Language Models

In this paper, we introduce a new approach to dealing with the problem o...
research
09/06/2021

You should evaluate your language model on marginal likelihood overtokenisations

Neural language models typically tokenise input text into sub-word units...

Please sign up or login with your details

Forgot password? Click here to reset