Automatic Language Identification for Romance Languages using Stop Words and Diacritics

06/14/2018
by   Ciprian-Octavian Truică, et al.
0

Automatic language identification is a natural language processing problem that tries to determine the natural language of a given content. In this paper we present a statistical method for automatic language identification of written text using dictionaries containing stop words and diacritics. We propose different approaches that combine the two dictionaries to accurately determine the language of textual corpora. This method was chosen because stop words and diacritics are very specific to a language, although some languages have some similar words and special characters they are not all common. The languages taken into account were romance languages because they are very similar and usually it is hard to distinguish between them from a computational point of view. We have tested our method using a Twitter corpus and a news article corpus. Both corpora consists of UTF-8 encoded text, so the diacritics could be taken into account, in the case that the text has no diacritics only the stop words are used to determine the language of the text. The experimental results show that the proposed method has an accuracy of over 90 texts and over 99.8

READ FULL TEXT
research
09/15/2022

Accuracy of the Uzbek stop words detection: a case study on "School corpus"

Stop words are very important for information retrieval and text analysi...
research
09/06/2017

The Voynich Manuscript is Written in Natural Language: The Pahlavi Hypothesis

The late medieval Voynich Manuscript (VM) has resisted decryption and wa...
research
10/25/2018

The Logoscope: a Semi-Automatic Tool for Detecting and Documenting French New Words

In this article we present the design and implementation of the Logoscop...
research
07/09/2019

Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings

Stretched words like `heellllp' or `heyyyyy' are a regular feature of sp...
research
09/14/2020

A Comparison of Two Fluctuation Analyses for Natural Language Clustering Phenomena: Taylor and Ebeling Neiman Methods

This article considers the fluctuation analysis methods of Taylor and Eb...
research
06/20/2018

TxPI-u: A Resource for Personality Identification of Undergraduates

Resources such as labeled corpora are necessary to train automatic model...
research
05/19/2017

A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

Psycholinguistic properties of words have been used in various approache...

Please sign up or login with your details

Forgot password? Click here to reset