LanideNN: Multilingual Language Identification on Character Window

01/12/2017
by   Tom Kocmi, et al.
0

In language identification, a common first step in natural language processing, we want to automatically determine the language of some input text. Monolingual language identification assumes that the given document is written in one language. In multilingual language identification, the document is usually in two or three languages and we just want their names. We aim one step further and propose a method for textual language identification where languages can change arbitrarily and the goal is to identify the spans of each of the languages. Our method is based on Bidirectional Recurrent Neural Networks and it performs well in monolingual and multilingual language identification tasks on six datasets covering 131 languages. The method keeps the accuracy also for short documents and across domains, so it is ideal for off-the-shelf use without preparation of training data.

READ FULL TEXT
research
06/29/2021

Language Lexicons for Hindi-English Multilingual Text Processing

Language Identification in textual documents is the process of automatic...
research
10/09/2018

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text

We address fine-grained multilingual language identification: providing ...
research
11/01/2017

Improved Text Language Identification for the South African Languages

Virtual assistants and text chatbots have recently been gaining populari...
research
06/17/2023

Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation

Correctly identifying multiword expressions (MWEs) is an important task ...
research
11/27/2019

Sideways Transliteration: How to Transliterate Multicultural Person Names?

In a global setting, texts contain transliterated names from many cultur...
research
01/07/2021

Language Detection Engine for Multilingual Texting on Mobile Devices

More than 2 billion mobile users worldwide type in multiple languages in...
research
11/29/2018

Tuplemax Loss for Language Identification

In many scenarios of a language identification task, the user will speci...

Please sign up or login with your details

Forgot password? Click here to reset