A Clustering Framework for Lexical Normalization of Roman Urdu

03/31/2020
by   Abdul Rafae Khan, et al.
0

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.

READ FULL TEXT
research
10/08/2021

Contrastive String Representation Learning using Synthetic Data

String representation Learning (SRL) is an important task in the field o...
research
04/12/2019

Adapting Sequence to Sequence models for Text Normalization in Social Media

Social media offer an abundant source of valuable raw data, however info...
research
04/27/2023

string2string: A Modern Python Library for String-to-String Algorithms

We introduce string2string, an open-source library that offers a compreh...
research
05/24/2023

Quantifying Character Similarity with Vision Transformers

Record linkage is a bedrock of quantitative social science, as analyses ...
research
01/13/2021

Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization

Many applications need to clean data with a target accuracy. As far as w...
research
09/24/2020

Novel Keyword Extraction and Language Detection Approaches

Fuzzy string matching and language classification are important tools in...
research
12/01/2020

Improving cluster recovery with feature rescaling factors

The data preprocessing stage is crucial in clustering. Features may desc...

Please sign up or login with your details

Forgot password? Click here to reset