Leveraging Twitter for Low-Resource Conversational Speech Language Modeling

04/09/2015
by   Aaron Jaech, et al.
0

In applications involving conversational speech, data sparsity is a limiting factor in building a better language model. We propose a simple, language-independent method to quickly harvest large amounts of data from Twitter to supplement a smaller training set that is more closely matched to the domain. The techniques lead to a significant reduction in perplexity on four low-resource languages even though the presence on Twitter of these languages is relatively small. We also find that the Twitter text is more useful for learning word classes than the in-domain text and that use of these word classes leads to further reductions in perplexity. Additionally, we introduce a method of using social and textual information to prioritize the download queue during the Twitter crawling. This maximizes the amount of useful data that can be collected, impacting both perplexity and vocabulary coverage.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2019

Regularization Advantages of Multilingual Neural Language Models for Low Resource Domains

Neural language modeling (LM) has led to significant improvements in sev...
research
06/30/2019

Evaluating Language Model Finetuning Techniques for Low-resource Languages

Unlike mainstream languages (such as English and French), low-resource l...
research
06/14/2021

Overcoming Domain Mismatch in Low Resource Sequence-to-Sequence ASR Models using Hybrid Generated Pseudotranscripts

Sequence-to-sequence (seq2seq) models are competitive with hybrid models...
research
06/15/2022

Location-based Twitter Filtering for the Creation of Low-Resource Language Datasets in Indonesian Local Languages

Twitter contains an abundance of linguistic data from the real world. We...
research
11/09/2016

When silver glitters more than gold: Bootstrapping an Italian part-of-speech tagger for Twitter

We bootstrap a state-of-the-art part-of-speech tagger to tag Italian Twi...
research
10/16/2018

Strategies for Language Identification in Code-Mixed Low Resource Languages

In the recent years, substantial work has been done on language tagging ...
research
02/12/2022

Wav2Vec2.0 on the Edge: Performance Evaluation

Wav2Vec2.0 is a state-of-the-art model which learns speech representatio...

Please sign up or login with your details

Forgot password? Click here to reset