Neural Machine Translation without Embeddings

08/21/2020
by   Uri Shaham, et al.
0

Many NLP models follow the embed-contextualize-predict paradigm, in which each sequence token is represented as a dense vector via an embedding matrix, and fed into a contextualization component that aggregates the information from the entire sequence in order to make a prediction. Could NLP models work without the embedding component? To that end, we omit the input and output embeddings from a standard machine translation model, and represent text as a sequence of bytes via UTF-8 encoding, using a constant 256-dimension one-hot representation for each byte. Experiments on 10 language pairs show that removing the embedding matrix consistently improves the performance of byte-to-byte models, often outperforms character-to-character models, and sometimes even produces better translations than standard subword models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/20/2020

Word Shape Matters: Robust Machine Translation with Visual Embedding

Neural machine translation has achieved remarkable empirical performance...
research
10/20/2016

Neural Machine Translation with Characters and Hierarchical Encoding

Most existing Neural Machine Translation models use groups of characters...
research
02/28/2023

Are Character-level Translations Worth the Wait? An Extensive Comparison of Character- and Subword-level Models for Machine Translation

Pretrained large character-level language models have been recently revi...
research
12/02/2022

Subword-Delimited Downsampling for Better Character-Level Translation

Subword-level models have been the dominant paradigm in NLP. However, ch...
research
11/27/2019

Simultaneous Neural Machine Translation using Connectionist Temporal Classification

Simultaneous machine translation is a variant of machine translation tha...
research
05/28/2021

Reinforcement Learning for on-line Sequence Transformation

A number of problems in the processing of sound and natural language, as...
research
05/23/2022

Local Byte Fusion for Neural Machine Translation

Subword tokenization schemes are the dominant technique used in current ...

Please sign up or login with your details

Forgot password? Click here to reset