Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT

10/01/2020
by   Ehsan Doostmohammadi, et al.
0

Words are properly segmented in the Persian writing system; in practice, however, these writing rules are often neglected, resulting in single words being written disjointedly and multiple words written without any white spaces between them. This paper addresses the problems of word segmentation and zero-width non-joiner (ZWNJ) recognition in Persian, which we approach jointly as a sequence labeling problem. We achieve a macro-averaged F1-score of 92.40 on a carefully collected corpus of 500 sentences with a high level of difficulty.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/14/2018

Urdu Word Segmentation using Conditional Random Fields (CRFs)

State-of-the-art Natural Language Processing algorithms rely heavily on ...
research
05/26/2021

Towards an IMU-based Pen Online Handwriting Recognizer

Most online handwriting recognition systems require the use of specific ...
research
10/17/2017

CASICT Tibetan Word Segmentation System for MLWS2017

We participated in the MLWS 2017 on Tibetan word segmentation task, our ...
research
09/17/2020

Word Segmentation from Unconstrained Handwritten Bangla Document Images using Distance Transform

Segmentation of handwritten document images into text lines and words is...
research
01/05/2021

edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Abugida refers to a phonogram writing system where each syllable is repr...
research
04/11/2022

Block-Segmentation Vectors for Arousal Prediction using Semi-supervised Learning

To handle emotional expressions in computer applications, Russell's circ...
research
08/30/2012

Benchmarking recognition results on word image datasets

We have benchmarked the maximum obtainable recognition accuracy on vario...

Please sign up or login with your details

Forgot password? Click here to reset