CINO: A Chinese Minority Pre-trained Language Model

02/28/2022
by   Ziqing Yang, et al.
1

Multilingual pre-trained language models have shown impressive performance on cross-lingual tasks. It greatly facilitates the applications of natural language processing on low-resource languages. However, there are still some languages that the existing multilingual models do not perform well on. In this paper, we propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages. It covers Standard Chinese, Cantonese, and six other Chinese minority languages. To evaluate the cross-lingual ability of the multilingual models on the minority languages, we collect documents from Wikipedia and build a text classification dataset WCM (Wiki-Chinese-Minority). We test CINO on WCM and two other text classification tasks. Experiments show that CINO outperforms the baselines notably. The CINO model and the WCM dataset are available at http://cino.hfl-rc.com.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/04/2022

MiLMo:Minority Multilingual Pre-trained Language Model

Pre-trained language models are trained on large-scale unsupervised data...
research
06/13/2023

Soft Language Clustering for Multilingual Model Pre-training

Multilingual pre-trained language models have demonstrated impressive (z...
research
10/23/2020

KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi

Recent progress in text classification has been focused on high-resource...
research
10/03/2022

SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis

We propose MINT, a new Multilingual INTimacy analysis dataset covering 1...
research
04/20/2023

Prompt-Learning for Cross-Lingual Relation Extraction

Relation Extraction (RE) is a crucial task in Information Extraction, wh...
research
09/14/2023

Automatic Data Visualization Generation from Chinese Natural Language Questions

Data visualization has emerged as an effective tool for getting insights...
research
01/26/2022

A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model

Synthetic data construction of Grammatical Error Correction (GEC) for no...

Please sign up or login with your details

Forgot password? Click here to reset