An Empirical Study of Factors Affecting Language-Independent Models

12/30/2019
by   Xiaotong Liu, et al.
0

Scaling existing applications and solutions to multiple human languages has traditionally proven to be difficult, mainly due to the language-dependent nature of preprocessing and feature engineering techniques employed in traditional approaches. In this work, we empirically investigate the factors affecting language-independent models built with multilingual representations, including task type, language set and data resource. On two most representative NLP tasks – sentence classification and sequence labeling, we show that language-independent models can be comparable to or even outperforms the models trained using monolingual data, and they are generally more effective on sentence classification. We experiment language-independent models with many different languages and show that they are more suitable for typologically similar languages. We also explore the effects of different data sizes when training and testing language-independent models, and demonstrate that they are not only suitable for high-resource languages, but also very effective in low-resource languages.

READ FULL TEXT
research
06/16/2021

Specializing Multilingual Language Models: An Empirical Study

Contextualized word representations from pretrained multilingual languag...
research
02/20/2019

Phoneme Level Language Models for Sequence Based Low Resource ASR

Building multilingual and crosslingual models help bring different langu...
research
11/21/2022

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Sentence representation from vanilla BERT models does not work well on s...
research
09/20/2018

Lessons learned in multilingual grounded language learning

Recent work has shown how to learn better visual-semantic embeddings by ...
research
10/20/2018

Improving Multilingual Semantic Textual Similarity with Shared Sentence Encoder for Low-resource Languages

Measuring the semantic similarity between two sentences (or Semantic Tex...
research
04/17/2020

AlloVera: A Multilingual Allophone Database

We introduce a new resource, AlloVera, which provides mappings from 218 ...
research
01/05/2022

Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Common designs of model evaluation typically focus on monolingual settin...

Please sign up or login with your details

Forgot password? Click here to reset