Urdu Word Segmentation using Conditional Random Fields (CRFs)

06/14/2018
by   Haris Bin Zia, et al.
0

State-of-the-art Natural Language Processing algorithms rely heavily on efficient word segmentation. Urdu is amongst languages for which word segmentation is a complex task as it exhibits space omission as well as space insertion issues. This is partly due to the Arabic script which although cursive in nature, consists of characters that have inherent joining and non-joining attributes regardless of word boundary. This paper presents a word segmentation system for Urdu which uses a Conditional Random Field sequence modeler with orthographic, linguistic and morphological features. Our proposed model automatically learns to predict white space as word boundary as well as Zero Width Non-Joiner (ZWNJ) as sub-word boundary. Using a manually annotated corpus, our model achieves F1 score of 0.97 for word boundary identification and 0.85 for sub-word boundary identification tasks. We have made our code and corpus publicly available to make our results reproducible.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/30/2020

A Subword Guided Neural Word Segmentation Model for Sindhi

Deep neural networks employ multiple processing layers for learning text...
research
10/01/2020

Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT

Words are properly segmented in the Persian writing system; in practice,...
research
09/12/2017

Cross-lingual Word Segmentation and Morpheme Segmentation as Sequence Labelling

This paper presents our segmentation system developed for the MLP 2017 s...
research
07/09/2018

Universal Word Segmentation: Implementation and Interpretation

Word segmentation is a low-level NLP task that is non-trivial for a cons...
research
10/02/2016

Sentence Segmentation in Narrative Transcripts from Neuropsychological Tests using Recurrent Convolutional Neural Networks

Automated discourse analysis tools based on Natural Language Processing ...
research
09/07/2023

Word segmentation granularity in Korean

This paper describes word segmentation granularity in Korean language pr...
research
06/18/2019

State-of-the-Art Vietnamese Word Segmentation

Word segmentation is the first step of any tasks in Vietnamese language ...

Please sign up or login with your details

Forgot password? Click here to reset