Language Segmentation

by   David Alfter, et al.

Language segmentation consists in finding the boundaries where one language ends and another language begins in a text written in more than one language. This is important for all natural language processing tasks. The problem can be solved by training language models on language data. However, in the case of low- or no-resource languages, this is problematic. I therefore investigate whether unsupervised methods perform better than supervised methods when it is difficult or impossible to train supervised approaches. A special focus is given to difficult texts, i.e. texts that are rather short (one sentence), containing abbreviations, low-resource languages and non-standard language. I compare three approaches: supervised n-gram language models, unsupervised clustering and weakly supervised n-gram language model induction. I devised the weakly supervised approach in order to deal with difficult text specifically. In order to test the approach, I compiled a small corpus of different text types, ranging from one-sentence texts to texts of about 300 words. The weakly supervised language model induction approach works well on short and difficult texts, outperforming the clustering algorithm and reaching scores in the vicinity of the supervised approach. The results look promising, but there is room for improvement and a more thorough investigation should be undertaken.


Weakly Supervised POS Taggers Perform Poorly on Truly Low-Resource Languages

Part-of-speech (POS) taggers for low-resource languages which are exclus...

The Importance of Context in Very Low Resource Language Modeling

This paper investigates very low resource language model pretraining, wh...

StoryDB: Broad Multi-language Narrative Dataset

This paper presents StoryDB - a broad multi-language dataset of narrativ...

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Finding word boundaries in continuous speech is challenging as there is ...

WSLLN: Weakly Supervised Natural Language Localization Networks

We propose weakly supervised language localization networks (WSLLN) to d...

Not Enough Labeled Data? Just Add Semantics: A Data-Efficient Method for Inferring Online Health Texts

User-generated texts available on the web and social platforms are often...

Location Name Extraction from Targeted Text Streams using Gazetteer-based Statistical Language Models

Extracting location names from informal and unstructured texts requires ...

Please sign up or login with your details

Forgot password? Click here to reset