Language Lexicons for Hindi-English Multilingual Text Processing

06/29/2021
by   Mohd Zeeshan Ansari, et al.
0

Language Identification in textual documents is the process of automatically detecting the language contained in a document based on its content. The present Language Identification techniques presume that a document contains text in one of the fixed set of languages, however, this presumption is incorrect when dealing with multilingual document which includes content in more than one possible language. Due to the unavailability of large standard corpora for Hindi-English mixed lingual language processing tasks we propose the language lexicons, a novel kind of lexical database that supports several multilingual language processing tasks. These lexicons are built by learning classifiers over transliterated Hindi and English vocabulary. The designed lexicons possess richer quantitative characteristic than its primary source of collection which is revealed using the visualization techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/12/2017

LanideNN: Multilingual Language Identification on Character Window

In language identification, a common first step in natural language proc...
research
02/06/2023

Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

We report findings of the TSAR-2022 shared task on multilingual lexical ...
research
03/19/2021

MuRIL: Multilingual Representations for Indian Languages

India is a multilingual society with 1369 rationalized languages and dia...
research
05/10/2012

Discrimination of English to other Indian languages (Kannada and Hindi) for OCR system

India is a multilingual multi-script country. In every state of India th...
research
06/07/2022

An Insight into The Intricacies of Lingual Paraphrasing Pragmatic Discourse on The Purpose of Synonyms

The term "paraphrasing" refers to the process of presenting the sense of...
research
07/13/2018

New/s/leak 2.0 - Multilingual Information Extraction and Visualization for Investigative Journalism

Investigative journalism in recent years is confronted with two major ch...
research
04/07/2017

Adposition Supersenses v2

This document describes an inventory of 50 semantic labels designed to c...

Please sign up or login with your details

Forgot password? Click here to reset