SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

08/19/2018
by   Taku Kudo, et al.
0

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2018

XNMT: The eXtensible Neural Machine Translation Toolkit

This paper describes XNMT, the eXtensible Neural Machine Translation too...
research
07/03/2019

Depth Growing for Neural Machine Translation

While very deep neural networks have shown effectiveness for computer vi...
research
08/14/2023

SOTASTREAM: A Streaming Approach to Machine Translation Training

Many machine translation toolkits make use of a data preparation step wh...
research
10/05/2021

On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation

Pre-training (PT) and back-translation (BT) are two simple and powerful ...
research
04/10/2020

Scalable Multilingual Frontend for TTS

This paper describes progress towards making a Neural Text-to-Speech (TT...
research
12/04/2020

A Benchmark Dataset for Understandable Medical Language Translation

In this paper, we introduce MedLane – a new human-annotated Medical Lang...
research
12/31/2020

The jsRealB Text Realizer: Organization and Use Cases

This paper describes the design principles behind jsRealB, a surface rea...

Please sign up or login with your details

Forgot password? Click here to reset