A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models

06/04/2023
by   Hyunwoong Ko, et al.
0

Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models. Despite the availability of various multilingual models such as mBERT (Devlin et al., 2019), XGLM (Lin et al., 2022), and BLOOM (Scao et al., 2022), researchers and developers often resort to building monolingual models in their respective languages due to the dissatisfaction with the current multilingual models non-English language capabilities. Addressing this gap, we seek to develop advanced multilingual language models that offer improved performance in non-English languages. In this paper, we introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature. In collaboration with TUNiB, our team collected 1.2TB of Korean data meticulously curated for our research journey. We made a deliberate decision to prioritize the development of Korean models before venturing into multilingual models. This choice was motivated by multiple factors: firstly, the Korean models facilitated performance comparisons with existing multilingual models; and finally, they catered to the specific needs of Korean companies and researchers. This paper presents our work in developing the Polyglot Korean models, which propose some steps towards addressing the non-English language performance gap in multilingual language models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2023

Lost in Translation: Large Language Models in Non-English Content Analysis

In recent years, large language models (e.g., Open AI's GPT-4, Meta's LL...
research
04/25/2021

XLM-T: A Multilingual Language Model Toolkit for Twitter

Language models are ubiquitous in current NLP, and their multilingual ca...
research
04/13/2021

Detoxifying Language Models Risks Marginalizing Minority Voices

Language models (LMs) must be both safe and equitable to be responsibly ...
research
05/14/2021

Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language

A widely used standard for portable multilingual data analysis pipelines...
research
09/17/2023

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

The driving factors behind the development of large language models (LLM...
research
05/03/2021

Scalar Adjective Identification and Multilingual Ranking

The intensity relationship that holds between scalar adjectives (e.g., n...
research
08/03/2018

Lightweight Multilingual Software Analysis

Developer preferences, language capabilities and the persistence of olde...

Please sign up or login with your details

Forgot password? Click here to reset