Neural Machine Translation without Embeddings
Many NLP models follow the embed-contextualize-predict paradigm, in which each sequence token is represented as a dense vector via an embedding matrix, and fed into a contextualization component that aggregates the information from the entire sequence in order to make a prediction. Could NLP models work without the embedding component? To that end, we omit the input and output embeddings from a standard machine translation model, and represent text as a sequence of bytes via UTF-8 encoding, using a constant 256-dimension one-hot representation for each byte. Experiments on 10 language pairs show that removing the embedding matrix consistently improves the performance of byte-to-byte models, often outperforms character-to-character models, and sometimes even produces better translations than standard subword models.
READ FULL TEXT