Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

09/05/2018
by   Daniel Watson, et al.
0

Text normalization is an important enabling technology for several NLP tasks. Recently, neural-network-based approaches have outperformed well-established models in this task. However, in languages other than English, there has been little exploration in this direction. Both the scarcity of annotated data and the complexity of the language increase the difficulty of the problem. To address these challenges, we use a sequence-to-sequence model with character-based attention, which in addition to its self-learned character embeddings, uses word embeddings pre-trained with an approach that also models subword information. This provides the neural model with access to more linguistic information especially suitable for text normalization, without large parallel corpora. We show that providing the model with word-level features bridges the gap for the neural network approach to achieve a state-of-the-art F1 score on a standard Arabic language correction shared task dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/20/2020

Applying the Transformer to Character-level Transduction

The transformer has been shown to outperform recurrent neural network-ba...
research
10/10/2017

MoNoise: Modeling Noise Using a Modular Normalization System

We propose MoNoise: a normalization model focused on generalizability an...
research
10/19/2017

Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings

We present an unsupervised context-sensitive spelling correction method ...
research
03/27/2019

Multilevel Text Normalization with Sequence-to-Sequence Networks and Multisource Learning

We define multilevel text normalization as sequence-to-sequence processi...
research
10/04/2017

A Neural Clickbait Detection Engine

In an age where people are becoming increasing likely to trust informati...
research
03/19/2020

Temporal Embeddings and Transformer Models for Narrative Text Understanding

We present two deep learning approaches to narrative text understanding ...
research
06/16/2019

Neural Decipherment via Minimum-Cost Flow: from Ugaritic to Linear B

In this paper we propose a novel neural approach for automatic decipherm...

Please sign up or login with your details

Forgot password? Click here to reset