Neural Machine Translation (NMT) is the application of deep neural networks to translation of text. NMT is based on an end-to-end trainable algorithm that can learn to translate just by being presented with translated language pairs. Despite being a relatively new approach, NMT has in recent years surpassed classical statistical machine translation models and now holds state-of-the-art results[Wu et al., 2016, Luong and Manning, 2016, Chung et al., 2016].
compresses an input sequence of variable length from the source language to a fixed-length vector representing the sentiment of the sentence. Thedecoder, takes the fixed-length representation as input and produces a variable length translation in the target language. However, due to the fixed length representation the naïve encoder-decoder approach have limitations when translating longer sequences.
To overcome this shortcoming, the attention mechanism proposed by Bahdanau et al.  assists the decoder by learning to selectively attend to parts of the input sequence, which it deems most relevant for generating the next element in the output sequence and effectively reducing the distance from encoding to decoding. This approach has made it possible for encoder-decoder models to produce high quality translations over longer sequences. However, it suffers from the significant amount of computational power and memory needed to compute the relevance of every element of the input sequence for every element of the output sequence.
For this reason, training the models on individual characters is not practical, and most current solutions instead use word segmentation [Sutskever et al., 2014, Bahdanau et al., 2014] or multiple characters [Sennrich et al., 2015, Wu et al., 2016, Schuster and Nakajima, 2012] to represent sentences. However, this high-level segmentation approach has a set of drawbacks; most Latin based languages have millions of words with the majority occurring rarely. To handle this, current models use confined dictionaries with only the most common words with the remaining words being represented by a special <UNK>-token [Sutskever et al., 2014, Bahdanau et al., 2014]. As a result, names, places, and other rare words are typically translated as unknown by these models. Further, when all words are represented as separate entities, the model has to learn how every word is related to one-another, which can be challenging for rare words even if they are obvious variations of more frequent words [Luong and Manning, 2016]. Figure 2 illustrates some of the challenges of characters versus words for the encoder-decoder model.
Two branches of methods to circumvent these drawbacks have been proposed. The first branch is based on extending the current word-based encoder-decoder model to incorporate modules, such as dictionary look-up for out-of-dictionary words [Luong et al., 2014, Jean et al., 2014]. The second branch, which we will investigate in this work, is moving towards smaller units of computation.
In this paper we demonstrate models that use a new char2word mechanism (illustrated in figure 1) during encoding, which reduces long character-level input sequences to word-level representations. This approach has the advantages of keeping a small alphabet of possible input values and preserving relations between words that are spelled similarly, while significantly reducing the number of elements that the attention mechanism needs to consider for each output element it generates. Using this method the decoder’s memory and computational requirements are reduced, making it feasible to train the models on long sequences of characters as input and output. Thus avoiding the drawbacks of word based models described above. And lastly, we give a qualitative analysis of attention plots produced by a character-level encoder-decoder model with and without the hierarchical encoding mechanism, char2word. This shed light on how a character-level model uses attention, which might explain some of the success behind the BPE and hybrid models.
2 Related work
Other approaches to circumventing the increase in sequence lengths while reducing the dictionary size have been proposed: First, byte-pair Encoding (BPE) [Sennrich et al., 2015], currently holding state-of-the-art in the WMT’14 English-to-French and English-to-German [Wu et al., 2016], where the dictionary is a combination of the most common characters. In practice this means that most frequent words are encoded using fewer bits than less frequent words. Secondly, a hybrid model [Luong and Manning, 2016] where a word encoder-decoder consults a character encoder-decoder when confronted with out-of-dictionary words. Thirdly, pre-trained word-based encoder-decoder models with character input used for creating word embeddings [Ling et al., 2015] have been shown to achieve similar results to word-based approaches. As a last mention, character decoder with BPE encoder has shown to be end-to-end trained successfully [Chung et al., 2016].
Wu et al.  provides a good summary and large-scale demonstration of many of the techniques that are known to work well for NMT and RNNs in general. The RNN encoder-decoder with attention approach is used not only within machine translation, but can be regarded as a general architecture to extract information from a sequence and answer some type of question related to it [Kumar et al., 2015].
3 Materials and Methods
First we give a brief description of the Neural Machine Translation model, afterwards we will go into detail of explaining our proposed architecture for character and word segmentation.
3.1 Neural Machine Translation
From a probabilistic perspective, translation can be defined by maximizing the conditional probability of, where is the target sequence and is the source sequence. The conditional probability
is modelled by an encoder-decoder model where the encoder and the decoder are modelled by separate Recurrent Neural Networks (RNNs) and the whole model is trained end-to-end on a large parallel corpus. The model uses memory based RNN variants, as they enable modelling of longer dependencies in the sequences[Hochreiter and Schmidhuber, 1997, Cho et al., 2014b].
The encoder part (input RNN) computes a set of hidden representations,based on the input
where is a RNN with memory cells, is a hidden state representation at time step , with hidden units.
The decoder part (output RNN) then computes a context vector, , based on the hidden representations from the encoder:
where is a function that takes a set of hidden representations and returns a context vector where number of context units. For a decoder without attention, the value of is the same for all time steps.
Finally the decoder combines the previous predictions, , and the context vector, , to predict the next unit (word, BPE, or character), such that it maximises the log conditional probability
where is a non-linear, potentially multi-layered function that outputs the probability of , ande is the hidden state of the decoder RNN, such that
We minimise the cross entropy loss averaged over all time steps with sized mini-batches and add regularisation, such that
where is a tune-able hyper-parameter, is the number of non-bias weights and is the weights in the neural network.
As motivated in section 1, the attention mechanism can compute a new context vector for every time step by combining the hidden representations from the encoder as well as the previous hidden state, , of the decoder
where the weight parameter of each annotation is computed as
and we have that
where and reflect the importance of , w.r.t. the previous decoder state . The attention function, , is a nonlinear, possibly multi-layered neural network.
4 Our Model
We propose two models: The char-to-char NMT model and the char2word-to-char NMT model. Both models build on the encoder-decoder model with attention as defined in section 3.1 and section 3.1.1. Below we will give specific model definitions.
4.1 The char encoder
is a one-hot encoded vector andis the amount of input classes, is an embedding matrix with being the size of the embedding, and are RNN functions and is initialised as .
The char encoder is illustrated with the yellow arrows and blue circles in figure 3.
4.2 The char2word encoder
The character-to-word-level encoder (referred to as the char2word encoder) samples states from the forward pass of the char encoder defined in the above section. The states it samples are based on the locations of spaces in the original text, resulting in a sequence of outputs from the char encoder that essentially represents the words from the text and acts as their embeddings. We sample the indices from , such that
The char2word encoder is illustrated with the yellow arrows and blue circles in figure 1.
A result of this “downsampling” by using spaces, the char2word encoder only has about a fifth of the hidden states the char encoder has. As we described in the introduction, the computationally expensive part of the encoder-decoder with attention is the attention part. By significantly reducing the encoder we could train the char2word encoder in half the time compared to the char encoder.
4.3 The char decoder
Our character-level decoder (referred to as the char decoder) works with characters as the smallest unit of computation and decodes one character at a time. The decoder uses a RNN and the attention mechanism [Bahdanau et al., 2014] when decoding each character.
The new state in our decoder RNN, , as defined in equation 5 is computed as follows
where is a one-hot encoded vector with being the amount of input classes, is an embedding matrix with being the size of the embedding, is a RNN function and is initialised as .
4.3.1 Attention mechanism
The attention model (defined in equation 14) is used to compute the context for time step , which is utilised by the decoder to perform variable length attention. The attention function, , was parametrized as
where , , , , is the amount of hidden units in the decoder and is the amount of hidden units in the encoder. As does not depend on , we can pre-compute it in advance for optimisation purposes.
4.3.2 Output function
The output of the decoder uses a linear combination of the current hidden state in the decoder, , followed by a softmax function.
where , and is the amount of output classes.
5.1 Data and Preprocessing
We trained our models on two different datasets of language pairs from the WMT’15: En-De (4.5M) and De-En (4.5M). For validation we used the newstest2013 and for testing we used newstest2014 and newstest2015.
The data preprocessing applies is identical to Chung et al.  on En-De and De-En with the source sentence length set to 250 characters instead of 50 BPE units. In short, that means; We normalise punctuations and tokenise using Moses scripts222From Moses (https://github.com/moses-smt/mosesdecoder) using normalize-punctuation.perl and tokenizer.perl. We exclude all samples where the source sentence exceed 250 characters and the target sentence exceed 500 characters. The source and target language has separate dictionaries, each containing the 300 most common characters. Characters not in the dictionary is replaced with an unknown token.
5.2 Training details
The model hyperparameters are listed intables 2 and 1
. For the RNN functions in the encoder and the decoder we use gated recurrent units (GRU)[Cho et al., 2014b]
. For training we use back-propagation with stochastic-gradient descent using the Adam optimiser[Kingma and Ba, 2014] with a learning rate of . For L2 regularization we set . In order to stabilise training and avoid exploding gradients, the norms of the gradients are clipped with a threshold of
before updating the parameters. All models are implemented using TensorFlow[Abadi et al., 2016] and the code and details of the setup are available on GitHub333https://github.com/Styrke/master-code.
|input alphabet size ()||300|
|char RNN (forward)||400|
|char RNN (backward)||400|
|target alphabet size ()||300|
|input alphabet size ()||300|
|char RNN (forward)||400|
|spaces RNN (forward)||400|
|spaces RNN (backward)||400|
|target alphabet size ()||300|
5.2.1 Batch details
When training with batches, all sequences must be padded to match the longest sequence in the batch, and the recurrent layers must do the full set of computations for all samples and all timesteps, which can result in a lot of wasted resources[Hannun et al., 2014] (see figure 4). Training translation models is further complicated by the fact that source and target sentences, while correlated, may have different lengths, and it is necessary to consider both when constructing batches in order to utilize computation power and RAM optimally.
To circumvent this issue, we start each epoch by shuffling all samples in the dataset and sorting them with a stable sorting algorithm according to both the source and target sentence lengths. This ensures that any two samples in the dataset that have almost the same source and target sentence lengths are located close to each other in the sorted list while the exact order of samples varies between epochs. To pack a batch we simply started adding samples from the sorted sample list to the batch, until we reached the maximal total allowed character threshold (which we set to) for the full batch with padding after which we would start on a new batch. Finally all the batches are fed in random order to the model for training until all samples have been trained on, and a new epoch begins. Figure 5 illustrates what such dynamic batches might look like.
The quantitative results of our models are illustrated in table 3. Notice that the char2word-to-char model outperforms the char-to-char model on all datasets (average BLEU performance increase). This could be an indication that either having hierarchical, word-like, representations on the encoder or simply the fact that the encoder was significantly smaller, helps in NMT when using a character decoder with attention.
|validation set||test sets|
Plotting the weights of (defined at equation 14) is popular in NMT research, as these gives an indication of where the model found relevant information while decoding. We have provided plots of both our char-to-char- and char2word-to-char models in figures 7 and 6. The more intense the blue colour, the higher the values of at that point. Notice that each column corresponds to the decoding of a single unit, resulting in each column summing to .
The char-to-char attention plot, attending over every character, interestingly indicates that words that would normally be considered out-of-dictionary (see Lisette Verhaig in figure 6) are translated character by character-by-character, whereas common words are attended at the end/start of each word 444As we use a bi-directional RNN, full information will be available at both the end and start of a word to use as a single embedding. This observation might explain why using hierarchical encoding improves performance. BPE based models and the hybrid word-char model by Luong and Manning  effectively works in the same manner, when translating common words BPE- and hybrid word-char models will work on a word level, whereas with rare words the BPE will work with sub-parts of the word (maybe even characters) and the hybrid approach will use character representations.
The char2word-to-char attention plot has words, or character-made embeddings of words, to perform attention over. The attention plot seems very similar to the BPE-to-Char plot proposed by Chung et al. . This might indicate that it is possible to imitate lexeme (word) based models using smaller dictionaries and preserving relationship between words.
We have proposed a pure character based encoder-decoder model with attention using a hierarchical encoding. We find that the hierarchical encoding, using our newly proposed char2word encoding mechanism, improves the the BLEU score by an average of compared to models using a standard character encoder.
Qualitatively, we find that the attention of a character model without hierarchical encoding learns to make hierarchical representations even without being explicitly told to do so, by switching between word and character embeddings for common and rare words. This observation is in line with current research on Byte-Pair-Encoding- and hybrid word-character models, as these models uses word like embeddings for common words and sub-words or characters for rare words.
Furthermore, qualitatively we find that our hierarchical encoding finds lexemes in the source sentence when decoding similarly to current models with much larger dictionaries using Byte-Pair-Encoding.
- Abadi et al.  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016. URL https://arxiv.org/abs/1603.04467. Software available from tensorflow.org.
- Bahdanau et al.  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL https://arxiv.org/abs/1409.0473.
- Cho et al. [2014a] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. CoRR, abs/1409.1259, 2014a. URL https://arxiv.org/abs/1409.1259.
- Cho et al. [2014b] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014b. URL https://arxiv.org/abs/1406.1078.
- Chung et al.  Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. A character-level decoder without explicit segmentation for neural machine translation. CoRR, abs/1603.06147, 2016. URL https://arxiv.org/abs/1603.06147.
- Hannun et al.  Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: Scaling up end-to-end speech recognition. CoRR, abs/1412.5567, 2014. URL https://arxiv.org/abs/1412.5567.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.19188.8.131.525. URL http://dx.doi.org/10.1162/neco.19184.108.40.2065.
- Jean et al.  Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. CoRR, abs/1412.2007, 2014. URL https://arxiv.org/abs/1412.2007.
- Kalchbrenner and Blunsom  Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In EMNLP, volume 3, page 413, Seattle, October 2013. Association for Computational Linguistics.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL https://arxiv.org/abs/1412.6980.
- Kumar et al.  Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. CoRR, abs/1506.07285, 2015. URL https://arxiv.org/abs/1506.07285.
- Ling et al.  Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W. Black. Character-based neural machine translation. CoRR, abs/1511.04586, 2015. URL https://arxiv.org/abs/1511.04586.
- Luong and Manning  Minh-Thang Luong and Christopher D. Manning. Achieving open vocabulary neural machine translation with hybrid word-character models. CoRR, abs/1604.00788, 2016. URL https://arxiv.org/abs/1604.00788.
- Luong et al.  Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the rare word problem in neural machine translation. CoRR, abs/1410.8206, 2014. URL https://arxiv.org/abs/1410.8206.
- Papineni et al.  Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL http://dx.doi.org/10.3115/1073083.1073135.
- Schuster and Paliwal  M. Schuster and K.K. Paliwal. Bidirectional recurrent neural networks. Trans. Sig. Proc., 45(11):2673–2681, November 1997. ISSN 1053-587X. doi: 10.1109/78.650093. URL http://dx.doi.org/10.1109/78.650093.
- Schuster and Nakajima  Mike Schuster and Kaisuke Nakajima. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152. IEEE, 2012.
- Sennrich et al.  Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. CoRR, abs/1508.07909, 2015. URL https://arxiv.org/abs/1508.07909.
- Sutskever et al.  Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014. URL https://arxiv.org/abs/1409.3215.
- Wu et al.  Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. URL https://arxiv.org/abs/1609.08144.