A Systematic Study of Leveraging Subword Information for Learning Word Representations

by   Yi Zhu, et al.

The use of subword-level information (e.g., characters, character n-grams, morphemes) has become ubiquitous in modern word representation learning. Its importance is attested especially for morphologically rich languages which generate a large number of rare words. Despite a steadily increasing interest in such subword-informed word representations, their systematic comparative analysis across typologically diverse languages and different tasks is still missing. In this work, we deliver such a study focusing on the variation of two crucial components required for subword-level integration into word representation models: 1) segmentation of words into subword units, and 2) subword composition functions to obtain final word representations. We propose a general framework for learning subword-informed word representations that allows for easy experimentation with different segmentation and composition components, also including more advanced techniques based on position embeddings and self-attention. Using the unified framework, we run experiments over a large number of subword-informed word representation configurations (60 in total) on 3 tasks (general and rare word similarity, dependency parsing, fine-grained entity typing) for 5 languages representing 3 language types. Our main results clearly indicate that there is no "one-sizefits-all" configuration, as performance is both language- and task-dependent. We also show that configurations based on unsupervised segmentation (e.g., BPE, Morfessor) are sometimes comparable to or even outperform the ones based on supervised word segmentation.


page 8

page 21


On the Importance of Subword Information for Morphological Tasks in Truly Low-Resource Languages

Recent work has validated the importance of subword information for word...

Enriching Word Vectors with Subword Information

Continuous word representations, trained on large unlabeled corpora are ...

Automatic Selection of Context Configurations for Improved Class-Specific Word Representations

This paper is concerned with identifying contexts useful for training wo...

Effective Subword Segmentation for Text Comprehension

Character-level representations have been broadly adopted to alleviate t...

LINSPECTOR: Multilingual Probing Tasks for Word Representations

Despite an ever growing number of word representation models introduced ...

A Classification Approach to Word Prediction

The eventual goal of a language model is to accurately predict the value...

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Most undeciphered lost languages exhibit two characteristics that pose s...

Please sign up or login with your details

Forgot password? Click here to reset