Training and Evaluation of a Multilingual Tokenizer for GPT-SW3

04/28/2023
by   Felix Stollenwerk, et al.
0

This paper provides a detailed discussion of the multilingual tokenizer used for GPT-SW3. It was trained on the Nordic Pile using the SentencePiece library and the BPE algorithm. We outline the tokenizer's most important features and share details on its learned vocabulary. In addition, we systematically analyze the properties and evaluate the performance of the tokenizer with regard to the different languages present in the data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2016

Survey on the Use of Typological Information in Natural Language Processing

In recent years linguistic typology, which classifies the world's langua...
research
10/24/2020

Improving Multilingual Models with Language-Clustered Vocabularies

State-of-the-art multilingual models depend on vocabularies that cover a...
research
07/26/2021

Multilingual Coreference Resolution with Harmonized Annotations

In this paper, we present coreference resolution experiments with a newl...
research
11/08/2019

Instance-based Transfer Learning for Multilingual Deep Retrieval

Perhaps the simplest type of multilingual transfer learning is instance-...
research
04/29/2022

How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

A multilingual tokenizer is a fundamental component of multilingual neur...
research
10/13/2018

Understanding Crosslingual Transfer Mechanisms in Probabilistic Topic Modeling

Probabilistic topic modeling is a popular choice as the first step of cr...
research
05/01/2023

Contextual Multilingual Spellchecker for User Queries

Spellchecking is one of the most fundamental and widely used search feat...

Please sign up or login with your details

Forgot password? Click here to reset