SOTASTREAM: A Streaming Approach to Machine Translation Training

08/14/2023
by   Matt Post, et al.
0

Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream can be manipulated by a set of user-definable operators that provide on-the-fly modifications, such as data normalization, augmentation or filtering. We release an open-source toolkit, SOTASTREAM, that implements this approach: https://github.com/marian-nmt/sotastream. We show that it cuts training time, adds flexibility, reduces experiment management complexity, and reduces disk space, all without affecting the accuracy of the trained models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2018

XNMT: The eXtensible Neural Machine Translation Toolkit

This paper describes XNMT, the eXtensible Neural Machine Translation too...
research
08/19/2018

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

This paper describes SentencePiece, a language-independent subword token...
research
08/25/2021

YANMTT: Yet Another Neural Machine Translation Toolkit

In this paper we present our open-source neural machine translation (NMT...
research
12/19/2016

Boosting Neural Machine Translation

Training efficiency is one of the main problems for Neural Machine Trans...
research
06/20/2017

THUMT: An Open Source Toolkit for Neural Machine Translation

This paper introduces THUMT, an open-source toolkit for neural machine t...
research
03/23/2019

Competence-based Curriculum Learning for Neural Machine Translation

Current state-of-the-art NMT systems use large neural networks that are ...
research
06/29/2022

CLTS-GAN: Color-Lighting-Texture-Specular Reflection Augmentation for Colonoscopy

Automated analysis of optical colonoscopy (OC) video frames (to assist e...

Please sign up or login with your details

Forgot password? Click here to reset