MoNoise: Modeling Noise Using a Modular Normalization System

10/10/2017
by   Rob van der Goot, et al.
0

We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a non- canonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a different type of normalization action. The most important generation modules are a spelling correction system and a word embeddings module. Depending on the definition of the normalization task, a static lookup list can be crucial for performance. We train a random forest classifier to rank the candidates, which generalizes well to all different types of normaliza- tion actions. Most features for the ranking originate from the generation modules; besides these features, N-gram features prove to be an important source of information. We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.

READ FULL TEXT

page 8

page 9

research
09/05/2018

Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

Text normalization is an important enabling technology for several NLP t...
research
05/18/2023

a unified front-end framework for english text-to-speech synthesis

The front-end is a critical component of English text-to-speech (TTS) sy...
research
03/27/2015

Normalization of Non-Standard Words in Croatian Texts

This paper presents text normalization which is an integral part of any ...
research
11/28/2018

Sequence Learning with RNNs for Medical Concept Normalization in User-Generated Texts

In this work, we consider the medical concept normalization problem, i.e...
research
07/18/2019

Deep Neural Models for Medical Concept Normalization in User-Generated Texts

In this work, we consider the medical concept normalization problem, i.e...
research
01/21/2023

A Semantic Modular Framework for Events Topic Modeling in Social Media

The advancement of social media contributes to the growing amount of con...
research
04/01/2013

An improved quasar detection method in EROS-2 and MACHO LMC datasets

We present a new classification method for quasar identification in the ...

Please sign up or login with your details

Forgot password? Click here to reset