Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

08/18/2019
by   Ahmed El-Kishky, et al.
0

Traditionally, many text-mining tasks treat individual word-tokens as the finest meaningful semantic granularity. However, in many languages and specialized corpora, words are composed by concatenating semantically meaningful subword structures. Word-level analysis cannot leverage the semantic information present in such subword structures. With regard to word embedding techniques, this leads to not only poor embeddings for infrequent words in long-tailed text corpora but also weak capabilities for handling out-of-vocabulary words. In this paper we propose MorphMine for unsupervised morpheme segmentation. MorphMine applies a parsimony criterion to hierarchically segment words into the fewest number of morphemes at each level of the hierarchy. This leads to longer shared morphemes at each level of segmentation. Experiments show that MorphMine segments words in a variety of languages into human-verified morphemes. Additionally, we experimentally demonstrate that utilizing MorphMine morphemes to enrich word embeddings consistently improves embedding quality on a variety of of embedding evaluations and a downstream language modeling task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2020

PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding

We look into the task of generalizing word embeddings: given a set of pr...
research
02/25/2020

Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction

Language-independent tokenisation (LIT) methods that do not require labe...
research
06/06/2020

ValNorm: A New Word Embedding Intrinsic Evaluation Method Reveals Valence Biases are Consistent Across Languages and Over Decades

Word embeddings learn implicit biases from linguistic regularities captu...
research
07/18/2016

Language classification from bilingual word embedding graphs

We study the role of the second language in bilingual word embeddings in...
research
02/07/2017

MORSE: Semantic-ally Drive-n MORpheme SEgment-er

We present in this paper a novel framework for morpheme segmentation whi...
research
01/02/2021

Superbizarre Is Not Superb: Improving BERT's Interpretations of Complex Words with Derivational Morphology

How does the input segmentation of pretrained language models (PLMs) aff...
research
11/02/2022

Boosting word frequencies in authorship attribution

In this paper, I introduce a simple method of computing relative word fr...

Please sign up or login with your details

Forgot password? Click here to reset