TiBERT: Tibetan Pre-trained Language Model

05/15/2022
by   Yuan Sun, et al.
0

The pre-trained language model is trained on large-scale unlabeled text and can achieve state-of-the-art results in many different downstream tasks. However, the current pre-trained language model is mainly concentrated in the Chinese and English fields. For low resource language such as Tibetan, there is lack of a monolingual pre-trained model. To promote the development of Tibetan natural language processing tasks, this paper collects the large-scale training data from Tibetan websites and constructs a vocabulary that can cover 99.95% of the words in the corpus by using Sentencepiece. Then, we train the Tibetan monolingual pre-trained language model named TiBERT on the data and vocabulary. Finally, we apply TiBERT to the downstream tasks of text classification and question generation, and compare it with classic models and multilingual pre-trained models, the experimental results show that TiBERT can achieve the best performance. Our model is published in http://tibert.cmli-nlp.com/

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/04/2022

MiLMo:Minority Multilingual Pre-trained Language Model

Pre-trained language models are trained on large-scale unsupervised data...
research
09/26/2019

Improving Pre-Trained Multilingual Models with Vocabulary Expansion

Recently, pre-trained language models have achieved remarkable success i...
research
06/13/2020

Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya

In recent years, transformer models have achieved great success in natur...
research
11/14/2022

Grafting Pre-trained Models for Multimodal Headline Generation

Multimodal headline utilizes both video frames and transcripts to genera...
research
10/13/2021

EventBERT: A Pre-Trained Model for Event Correlation Reasoning

Event correlation reasoning infers whether a natural language paragraph ...
research
05/18/2023

Democratized Diffusion Language Model

Despite the potential benefits of Diffusion Models for NLP applications,...
research
06/11/2021

Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation

A well-known limitation in pretrain-finetune paradigm lies in its inflex...

Please sign up or login with your details

Forgot password? Click here to reset