Probabilistic Modelling of Morphologically Rich Languages

08/18/2015
by   Jan A. Botha, et al.
0

This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/14/2016

Word Representation Models for Morphologically Rich Languages in Neural Machine Translation

Dealing with the complex word forms in morphologically rich languages is...
research
07/15/2016

Enriching Word Vectors with Subword Information

Continuous word representations, trained on large unlabeled corpora are ...
research
05/11/2020

Neural Polysynthetic Language Modelling

Research in natural language processing commonly assumes that approaches...
research
09/16/2017

Role of Morphology Injection in Statistical Machine Translation

Phrase-based Statistical models are more commonly used as they perform o...
research
03/16/2022

KinyaBERT: a Morphology-aware Kinyarwanda Language Model

Pre-trained language models such as BERT have been successful at tacklin...
research
12/08/2017

Building competitive direct acoustics-to-word models for English conversational speech recognition

Direct acoustics-to-word (A2W) models in the end-to-end paradigm have re...
research
10/25/2019

Stem-driven Language Models for Morphologically Rich Languages

Neural language models (LMs) have shown to benefit significantly from en...

Please sign up or login with your details

Forgot password? Click here to reset