Language Identification on Massive Datasets of Short Message using an Attention Mechanism CNN

10/15/2019
by   Duy Tin Vo, et al.
0

Language Identification (LID) is a challenging task, especially when the input texts are short and noisy such as posts and statuses on social media or chat logs on gaming forums. The task has been tackled by either designing a feature set for a traditional classifier (e.g. Naive Bayes) or applying a deep neural network classifier (e.g. Bi-directional Gated Recurrent Unit, Encoder-Decoder). These methods are usually trained and tested on a huge amount of private data, then used and evaluated as off-the-shelf packages by other researchers using their own datasets, and consequently the various results published are not directly comparable. In this paper, we first create a new massive labelled dataset based on one year of Twitter data. We use this dataset to test several existing language identification systems, in order to obtain a set of coherent benchmarks, and we make our dataset publicly available so that others can add to this set of benchmarks. Finally, we propose a shallow but efficient neural LID system, which is a ngram-regional convolution neural network enhanced with an attention mechanism. Experimental results show that our architecture is able to predict tens of thousands of samples per second and surpasses all state-of-the-art systems with an improvement of 5

READ FULL TEXT
research
12/04/2014

End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results

We replace the Hidden Markov Model (HMM) which is traditionally used in ...
research
07/28/2020

SalamNET at SemEval-2020 Task12: Deep Learning Approach for Arabic Offensive Language Detection

This paper describes SalamNET, an Arabic offensive language detection sy...
research
11/01/2017

Improved Text Language Identification for the South African Languages

Virtual assistants and text chatbots have recently been gaining populari...
research
09/11/2020

WOLI at SemEval-2020 Task 12: Arabic Offensive Language Identification on Different Twitter Datasets

Communicating through social platforms has become one of the principal m...
research
06/08/2022

Improved two-stage hate speech classification for twitter based on Deep Neural Networks

Hate speech is a form of online harassment that involves the use of abus...
research
10/31/2019

Attention Is All You Need for Chinese Word Segmentation

This paper presents a fast and accurate Chinese word segmentation (CWS) ...
research
03/15/2019

SemEval 2019 Task 6: An exploration of state-of-the-art methods for offensive language detection

We provide a comprehensive investigation of different custom and off-the...

Please sign up or login with your details

Forgot password? Click here to reset