An Efficient Architecture for Predicting the Case of Characters using Sequence Models

01/30/2020
by   Gopi Ramena, et al.
0

The dearth of clean textual data often acts as a bottleneck in several natural language processing applications. The data available often lacks proper case (uppercase or lowercase) information. This often comes up when text is obtained from social media, messaging applications and other online platforms. This paper attempts to solve this problem by restoring the correct case of characters, commonly known as Truecasing. Doing so improves the accuracy of several processing tasks further down in the NLP pipeline. Our proposed architecture uses a combination of convolutional neural networks (CNN), bi-directional long short-term memory networks (LSTM) and conditional random fields (CRF), which work at a character level without any explicit feature engineering. In this study we compare our approach to previous statistical and deep learning based approaches. Our method shows an increment of 0.83 in F1 score over the current state of the art. Since truecasing acts as a preprocessing step in several applications, every increment in the F1 score leads to a significant improvement in the language processing tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/23/2016

A CRF Based POS Tagger for Code-mixed Indian Social Media Text

In this work, we describe a conditional random fields (CRF) based system...
research
07/07/2022

Part-of-Speech Tagging of Odia Language Using statistical and Deep Learning-Based Approaches

Automatic Part-of-speech (POS) tagging is a preprocessing step of many n...
research
11/09/2019

Hate Speech Detection on Vietnamese Social Media Text using the Bi-GRU-LSTM-CNN Model

In recent years, Hate Speech Detection has become one of the interesting...
research
06/10/2019

Modeling Noisiness to Recognize Named Entities using Multitask Neural Networks on Social Media

Recognizing named entities in a document is a key task in many NLP appli...
research
09/20/2017

De-identification of medical records using conditional random fields and long short-term memory networks

The CEGS N-GRID 2016 Shared Task 1 in Clinical Natural Language Processi...
research
06/23/2022

DeepSafety:Multi-level Audio-Text Feature Extraction and Fusion Approach for Violence Detection in Conversations

Natural Language Processing has recently made understanding human intera...
research
03/12/2019

Offensive Language Analysis using Deep Learning Architecture

SemEval-2019 Task 6 (Zampieri et al., 2019b) requires us to identify and...

Please sign up or login with your details

Forgot password? Click here to reset