DeepAI AI Chat
Log In Sign Up

LanideNN: Multilingual Language Identification on Character Window

by   Tom Kocmi, et al.
Charles University in Prague

In language identification, a common first step in natural language processing, we want to automatically determine the language of some input text. Monolingual language identification assumes that the given document is written in one language. In multilingual language identification, the document is usually in two or three languages and we just want their names. We aim one step further and propose a method for textual language identification where languages can change arbitrarily and the goal is to identify the spans of each of the languages. Our method is based on Bidirectional Recurrent Neural Networks and it performs well in monolingual and multilingual language identification tasks on six datasets covering 131 languages. The method keeps the accuracy also for short documents and across domains, so it is ideal for off-the-shelf use without preparation of training data.


Language Lexicons for Hindi-English Multilingual Text Processing

Language Identification in textual documents is the process of automatic...

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text

We address fine-grained multilingual language identification: providing ...

Tuplemax Loss for Language Identification

In many scenarios of a language identification task, the user will speci...

Improved Text Language Identification for the South African Languages

Virtual assistants and text chatbots have recently been gaining populari...

Sideways Transliteration: How to Transliterate Multicultural Person Names?

In a global setting, texts contain transliterated names from many cultur...

Language Detection Engine for Multilingual Texting on Mobile Devices

More than 2 billion mobile users worldwide type in multiple languages in...

A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Language Identification is the task of identifying a document's language...

Code Repositories