Patterns versus Characters in Subword-aware Neural Language Modeling

09/02/2017
by   Rustem Takhanov, et al.
0

Words in some natural languages can have a composite structure. Elements of this structure include the root (that could also be composite), prefixes and suffixes with which various nuances and relations to other words can be expressed. Thus, in order to build a proper word representation one must take into account its internal structure. From a corpus of texts we extract a set of frequent subwords and from the latter set we select patterns, i.e. subwords which encapsulate information on character n-gram regularities. The selection is made using the pattern-based Conditional Random Field model with l_1 regularization. Further, for every word we construct a new sequence over an alphabet of patterns. The new alphabet's symbols confine a local statistical context stronger than the characters, therefore they allow better representations in R^n and are better building blocks for word representation. In the task of subword-aware language modeling, pattern-based models outperform character-based analogues by 2-20 perplexity points. Also, a recurrent neural network in which a word is represented as a sum of embeddings of its patterns is on par with a competitive and significantly more sophisticated character-based convolutional architecture.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/26/2015

Character-Aware Neural Language Models

We describe a simple neural language model that relies only on character...
research
04/23/2017

Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

Fixed-vocabulary language models fail to account for one of the most cha...
research
08/09/2015

Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

We introduce a model for constructing vector representations of words by...
research
01/13/2023

From stage to page: language independent bootstrap measures of distinctiveness in fictional speech

Stylometry is mostly applied to authorial style. Recently, researchers h...
research
12/03/2018

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Word segmentation is the task of inserting or deleting word boundary cha...
research
12/19/2022

Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training

Language tasks involving character-level manipulations (e.g., spelling c...
research
01/12/2012

Autonomous Cleaning of Corrupted Scanned Documents - A Generative Modeling Approach

We study the task of cleaning scanned text documents that are strongly c...

Please sign up or login with your details

Forgot password? Click here to reset