Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese

05/08/2020
by   Marek Rychlik, et al.
0

We report upon the results of a research and prototype building project Worldly OCR dedicated to developing new, more accurate image-to-text conversion software for several languages and writing systems. These include the cursive scripts Farsi and Pashto, and Latin cursive scripts. We also describe approaches geared towards Traditional Chinese, which is non-cursive, but features an extremely large character set of 65,000 characters. Our methodology is based on Machine Learning, especially Deep Learning, and Data Science, and is directed towards vast quantities of original documents, exceeding a billion pages. The target audience of this paper is a general audience with interest in Digital Humanities or in retrieval of accurate full-text and metadata from digital images.

READ FULL TEXT
research
10/09/2020

Learning to Pronounce Chinese Without a Pronunciation Dictionary

We demonstrate a program that learns to pronounce Chinese text in Mandar...
research
05/07/2020

2kenize: Tying Subword Sequences for Chinese Script Conversion

Simplified Chinese to Traditional Chinese character conversion is a comm...
research
04/07/2020

g2pM: A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset

Conversion of Chinese graphemes to phonemes (G2P) is an essential compon...
research
11/11/2018

Neural-based Pinyin-to-Character Conversion with Adaptive Vocabulary

Pinyin-to-character (P2C) conversion is the core component of pinyin-bas...
research
08/23/2021

Machine Learning for Sensor Transducer Conversion Routines

Sensors with digital outputs require software conversion routines to tra...
research
09/02/2018

Chinese Pinyin Aided IME, Input What You Have Not Keystroked Yet

Chinese pinyin input method engine (IME) converts pinyin into character ...
research
10/21/2019

KuroNet: Pre-Modern Japanese Kuzushiji Character Recognition with Deep Learning

Kuzushiji, a cursive writing style, had been used in Japan for over a th...

Please sign up or login with your details

Forgot password? Click here to reset