Open-Set Language Identification

07/16/2017
by   Shervin Malmasi, et al.
0

We present the first open-set language identification experiments using one-class classification. We first highlight the shortcomings of traditional feature extraction methods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One- Class Support Vector Machine using only a monolingual corpus for each language. Each model is evaluated against a test set of data from all 10 languages and we achieve an average F-score of 0.99, highlighting the effectiveness of this approach for open-set language identification.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/29/2023

Robust Open-Set Spoken Language Identification and the CU MultiLang Dataset

Most state-of-the-art spoken language identification models are closed-s...
research
03/09/2022

Automatic Language Identification for Celtic Texts

Language identification is an important Natural Language Processing task...
research
05/20/2022

Modernizing Open-Set Speech Language Identification

While most modern speech Language Identification methods are closed-set,...
research
03/19/2017

Native Language Identification using Stacked Generalization

Ensemble methods using multiple classifiers have proven to be the most s...
research
10/21/2022

AfroLID: A Neural Language Identification Tool for African Languages

Language identification (LID) is a crucial precursor for NLP, especially...
research
03/26/2019

Language Model Adaptation for Language and Dialect Identification of Text

This article describes an unsupervised language model adaptation approac...
research
11/29/2018

Tuplemax Loss for Language Identification

In many scenarios of a language identification task, the user will speci...

Please sign up or login with your details

Forgot password? Click here to reset