Log In Sign Up

Neural Machine Translation with Characters and Hierarchical Encoding

Most existing Neural Machine Translation models use groups of characters or whole words as their unit of input and output. We propose a model with a hierarchical char2word encoder, that takes individual characters both as input and output. We first argue that this hierarchical representation of the character encoder reduces computational complexity, and show that it improves translation performance. Secondly, by qualitatively studying attention plots from the decoder we find that the model learns to compress common words into a single embedding whereas rare words, such as names and places, are represented character by character.


page 1

page 2

page 3

page 4


A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation

The existing machine translation systems, whether phrase-based or neural...

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Neural Machine Translation (NMT) is an end-to-end learning approach for ...

Neural Machine Translation without Embeddings

Many NLP models follow the embed-contextualize-predict paradigm, in whic...

Analysis of a Play by Means of CHAPLIN, the Characters and Places Interaction Network Software

Recently, we have developed a software able of gathering information on ...

Word Shape Matters: Robust Machine Translation with Visual Embedding

Neural machine translation has achieved remarkable empirical performance...

Plan, Attend, Generate: Character-level Neural Machine Translation with Planning in the Decoder

We investigate the integration of a planning mechanism into an encoder-d...

1 Introduction

Neural Machine Translation (NMT) is the application of deep neural networks to translation of text. NMT is based on an end-to-end trainable algorithm that can learn to translate just by being presented with translated language pairs. Despite being a relatively new approach, NMT has in recent years surpassed classical statistical machine translation models and now holds state-of-the-art results 

[Wu et al., 2016, Luong and Manning, 2016, Chung et al., 2016].

Early NMT models introduced by  Kalchbrenner and Blunsom [2013], Sutskever et al. [2014], Cho et al. [2014a] are based on the encoder-decoder network architecture. Here the encoder

compresses an input sequence of variable length from the source language to a fixed-length vector representing the sentiment of the sentence. The

decoder, takes the fixed-length representation as input and produces a variable length translation in the target language. However, due to the fixed length representation the naïve encoder-decoder approach have limitations when translating longer sequences.

Figure 1: Our char2word-to-char model using hierarchical encoding in English and decoding one character at a time in Latin. “-” marks sequence start for decoding.

To overcome this shortcoming, the attention mechanism proposed by Bahdanau et al. [2014] assists the decoder by learning to selectively attend to parts of the input sequence, which it deems most relevant for generating the next element in the output sequence and effectively reducing the distance from encoding to decoding. This approach has made it possible for encoder-decoder models to produce high quality translations over longer sequences. However, it suffers from the significant amount of computational power and memory needed to compute the relevance of every element of the input sequence for every element of the output sequence.

For this reason, training the models on individual characters is not practical, and most current solutions instead use word segmentation [Sutskever et al., 2014, Bahdanau et al., 2014] or multiple characters [Sennrich et al., 2015, Wu et al., 2016, Schuster and Nakajima, 2012] to represent sentences. However, this high-level segmentation approach has a set of drawbacks; most Latin based languages have millions of words with the majority occurring rarely. To handle this, current models use confined dictionaries with only the most common words with the remaining words being represented by a special <UNK>-token [Sutskever et al., 2014, Bahdanau et al., 2014]. As a result, names, places, and other rare words are typically translated as unknown by these models. Further, when all words are represented as separate entities, the model has to learn how every word is related to one-another, which can be challenging for rare words even if they are obvious variations of more frequent words [Luong and Manning, 2016]. Figure 2 illustrates some of the challenges of characters versus words for the encoder-decoder model.

Figure 2: Challenges in encoder-decoder models: Characters versus Words.

Two branches of methods to circumvent these drawbacks have been proposed. The first branch is based on extending the current word-based encoder-decoder model to incorporate modules, such as dictionary look-up for out-of-dictionary words [Luong et al., 2014, Jean et al., 2014]. The second branch, which we will investigate in this work, is moving towards smaller units of computation.

In this paper we demonstrate models that use a new char2word mechanism (illustrated in figure 1) during encoding, which reduces long character-level input sequences to word-level representations. This approach has the advantages of keeping a small alphabet of possible input values and preserving relations between words that are spelled similarly, while significantly reducing the number of elements that the attention mechanism needs to consider for each output element it generates. Using this method the decoder’s memory and computational requirements are reduced, making it feasible to train the models on long sequences of characters as input and output. Thus avoiding the drawbacks of word based models described above. And lastly, we give a qualitative analysis of attention plots produced by a character-level encoder-decoder model with and without the hierarchical encoding mechanism, char2word. This shed light on how a character-level model uses attention, which might explain some of the success behind the BPE and hybrid models.

2 Related work

Other approaches to circumventing the increase in sequence lengths while reducing the dictionary size have been proposed: First, byte-pair Encoding (BPE) [Sennrich et al., 2015], currently holding state-of-the-art in the WMT’14 English-to-French and English-to-German [Wu et al., 2016], where the dictionary is a combination of the most common characters. In practice this means that most frequent words are encoded using fewer bits than less frequent words. Secondly, a hybrid model [Luong and Manning, 2016] where a word encoder-decoder consults a character encoder-decoder when confronted with out-of-dictionary words. Thirdly, pre-trained word-based encoder-decoder models with character input used for creating word embeddings [Ling et al., 2015] have been shown to achieve similar results to word-based approaches. As a last mention, character decoder with BPE encoder has shown to be end-to-end trained successfully [Chung et al., 2016].

Wu et al. [2016] provides a good summary and large-scale demonstration of many of the techniques that are known to work well for NMT and RNNs in general. The RNN encoder-decoder with attention approach is used not only within machine translation, but can be regarded as a general architecture to extract information from a sequence and answer some type of question related to it [Kumar et al., 2015].

3 Materials and Methods

First we give a brief description of the Neural Machine Translation model, afterwards we will go into detail of explaining our proposed architecture for character and word segmentation.

3.1 Neural Machine Translation

From a probabilistic perspective, translation can be defined by maximizing the conditional probability of

, where is the target sequence and is the source sequence. The conditional probability

is modelled by an encoder-decoder model where the encoder and the decoder are modelled by separate Recurrent Neural Networks (RNNs) and the whole model is trained end-to-end on a large parallel corpus. The model uses memory based RNN variants, as they enable modelling of longer dependencies in the sequences 

[Hochreiter and Schmidhuber, 1997, Cho et al., 2014b].

The encoder part (input RNN) computes a set of hidden representations,

based on the input


where is a RNN with memory cells, is a hidden state representation at time step , with hidden units.

The decoder part (output RNN) then computes a context vector, , based on the hidden representations from the encoder:


where is a function that takes a set of hidden representations and returns a context vector where number of context units. For a decoder without attention, the value of is the same for all time steps.

Finally the decoder combines the previous predictions, , and the context vector, , to predict the next unit (word, BPE, or character), such that it maximises the log conditional probability


where is a non-linear, potentially multi-layered function that outputs the probability of , ande is the hidden state of the decoder RNN, such that


We minimise the cross entropy loss averaged over all time steps with sized mini-batches and add regularisation, such that

where is a tune-able hyper-parameter, is the number of non-bias weights and is the weights in the neural network.

3.1.1 Attention

As motivated in section 1, the attention mechanism can compute a new context vector for every time step by combining the hidden representations from the encoder as well as the previous hidden state, , of the decoder


where the weight parameter of each annotation is computed as


and we have that


where and reflect the importance of , w.r.t. the previous decoder state . The attention function, , is a nonlinear, possibly multi-layered neural network.

The encoder-decoder and attention model is trained jointly to minimise the loss function.

4 Our Model

We propose two models: The char-to-char NMT model and the char2word-to-char NMT model. Both models build on the encoder-decoder model with attention as defined in section 3.1 and section 3.1.1. Below we will give specific model definitions.

4.1 The char encoder

Our character-level encoder (referred to as the char encoder) is built upon a bi-directional RNN [Schuster and Paliwal, 1997]. The encoder function, , in equation 1 becomes



is a one-hot encoded vector and

is the amount of input classes, is an embedding matrix with being the size of the embedding, and are RNN functions and is initialised as .

The char encoder is illustrated with the yellow arrows and blue circles in figure 3.

4.2 The char2word encoder

The character-to-word-level encoder (referred to as the char2word encoder) samples states from the forward pass of the char encoder defined in the above section. The states it samples are based on the locations of spaces in the original text, resulting in a sequence of outputs from the char encoder that essentially represents the words from the text and acts as their embeddings. We sample the indices from , such that


where is defined from above equation 9 and is an ordered list of indices defining spaces in the input sequence . Given equation 9 is used with replacing .

The char2word encoder is illustrated with the yellow arrows and blue circles in figure 1.

A result of this “downsampling” by using spaces, the char2word encoder only has about a fifth of the hidden states the char encoder has. As we described in the introduction, the computationally expensive part of the encoder-decoder with attention is the attention part. By significantly reducing the encoder we could train the char2word encoder in half the time compared to the char encoder.

4.3 The char decoder

Our character-level decoder (referred to as the char decoder) works with characters as the smallest unit of computation and decodes one character at a time. The decoder uses a RNN and the attention mechanism [Bahdanau et al., 2014] when decoding each character.

The new state in our decoder RNN, , as defined in equation 5 is computed as follows


where is a one-hot encoded vector with being the amount of input classes, is an embedding matrix with being the size of the embedding, is a RNN function and is initialised as .

4.3.1 Attention mechanism

The attention model (defined in equation 14) is used to compute the context for time step , which is utilised by the decoder to perform variable length attention. The attention function, , was parametrized as


where , , , , is the amount of hidden units in the decoder and is the amount of hidden units in the encoder. As does not depend on , we can pre-compute it in advance for optimisation purposes.

4.3.2 Output function

The output of the decoder uses a linear combination of the current hidden state in the decoder, , followed by a softmax function.


where , and is the amount of output classes.

We use the same decoder with attention for both the char-to-char and char2word-to-char model, which is illustrated in figure 1 and figure 3. The main difference is that our figure 1 model has significantly lower amount of units to attend over.

Figure 3: Our char-to-char model encoding and decoding a sentence from English to Latin on character level. “-” marks sequence start for decoding.

5 Experiments

All models were evaluated using the BLEU score111We used the multi-bleu.perl script from Moses ( [Papineni et al., 2002].

5.1 Data and Preprocessing

We trained our models on two different datasets of language pairs from the WMT’15: En-De (4.5M) and De-En (4.5M). For validation we used the newstest2013 and for testing we used newstest2014 and newstest2015.

The data preprocessing applies is identical to Chung et al. [2016] on En-De and De-En with the source sentence length set to 250 characters instead of 50 BPE units. In short, that means; We normalise punctuations and tokenise using Moses scripts222From Moses ( using normalize-punctuation.perl and tokenizer.perl. We exclude all samples where the source sentence exceed 250 characters and the target sentence exceed 500 characters. The source and target language has separate dictionaries, each containing the 300 most common characters. Characters not in the dictionary is replaced with an unknown token.

5.2 Training details

The model hyperparameters are listed in

tables 2 and 1

. For the RNN functions in the encoder and the decoder we use gated recurrent units (GRU) 

[Cho et al., 2014b]

. For training we use back-propagation with stochastic-gradient descent using the Adam optimiser 

[Kingma and Ba, 2014] with a learning rate of . For L2 regularization we set . In order to stabilise training and avoid exploding gradients, the norms of the gradients are clipped with a threshold of

before updating the parameters. All models are implemented using TensorFlow 

[Abadi et al., 2016] and the code and details of the setup are available on GitHub333

layer no. units
input alphabet size () 300
embedding sizes 256
char RNN (forward) 400
char RNN (backward) 400
attention 300
char decoder 400
target alphabet size () 300
Table 1: Hyperparameter values used for training the char-to-char model. Where and represent the number of classes in the source and target languages, respectively.
layer no. units
input alphabet size () 300
embedding sizes 256
char RNN (forward) 400
spaces RNN (forward) 400
spaces RNN (backward) 400
attention 300
char decoder 400
target alphabet size () 300
Table 2: Hyperparameter values used for training the char2word-to-char model. Where and represent the number of classes in the source and target languages, respectively.

5.2.1 Batch details

When training with batches, all sequences must be padded to match the longest sequence in the batch, and the recurrent layers must do the full set of computations for all samples and all timesteps, which can result in a lot of wasted resources 

[Hannun et al., 2014] (see figure 4). Training translation models is further complicated by the fact that source and target sentences, while correlated, may have different lengths, and it is necessary to consider both when constructing batches in order to utilize computation power and RAM optimally.

Figure 4: A regular batch with random samples.

To circumvent this issue, we start each epoch by shuffling all samples in the dataset and sorting them with a stable sorting algorithm according to both the source and target sentence lengths. This ensures that any two samples in the dataset that have almost the same source and target sentence lengths are located close to each other in the sorted list while the exact order of samples varies between epochs. To pack a batch we simply started adding samples from the sorted sample list to the batch, until we reached the maximal total allowed character threshold (which we set to

) for the full batch with padding after which we would start on a new batch. Finally all the batches are fed in random order to the model for training until all samples have been trained on, and a new epoch begins. Figure 5 illustrates what such dynamic batches might look like.

Figure 5: Our dynamic batches of variable batch size and sequence length.

5.3 Results

5.3.1 Quantitative

The quantitative results of our models are illustrated in table 3. Notice that the char2word-to-char model outperforms the char-to-char model on all datasets (average BLEU performance increase). This could be an indication that either having hierarchical, word-like, representations on the encoder or simply the fact that the encoder was significantly smaller, helps in NMT when using a character decoder with attention.

validation set test sets
Model Language newstest2013 newstest2014 newstest2015
char-to-char De–En 18.89 17.97 18.04
char2word-to-char De–En 20.15 19.03 19.90
char-to-char En–De 15.32 14.15 16.11
char2word-to-char En–De 16.78 15.04 17.43
Table 3: Results: WMT’15, newstest2013 was used as validation set, newstest2014 and newstest2015 were used as test sets. The results with bold indicates the best results on that dataset.

5.3.2 Qualitative

Plotting the weights of (defined at equation 14) is popular in NMT research, as these gives an indication of where the model found relevant information while decoding. We have provided plots of both our char-to-char- and char2word-to-char models in figures 7 and 6. The more intense the blue colour, the higher the values of at that point. Notice that each column corresponds to the decoding of a single unit, resulting in each column summing to .

The char-to-char attention plot, attending over every character, interestingly indicates that words that would normally be considered out-of-dictionary (see Lisette Verhaig in figure 6) are translated character by character-by-character, whereas common words are attended at the end/start of each word 444As we use a bi-directional RNN, full information will be available at both the end and start of a word to use as a single embedding. This observation might explain why using hierarchical encoding improves performance. BPE based models and the hybrid word-char model by Luong and Manning [2016] effectively works in the same manner, when translating common words BPE- and hybrid word-char models will work on a word level, whereas with rare words the BPE will work with sub-parts of the word (maybe even characters) and the hybrid approach will use character representations.

The char2word-to-char attention plot has words, or character-made embeddings of words, to perform attention over. The attention plot seems very similar to the BPE-to-Char plot proposed by Chung et al. [2016]. This might indicate that it is possible to imitate lexeme (word) based models using smaller dictionaries and preserving relationship between words.

Figure 6: Attention plot of our char-to-char model encoding and decoding a sentence from English to German.
Figure 7: Attention plot of our char2word-to-char model encoding and decoding a sentence from English to German.

6 Conclusion

We have proposed a pure character based encoder-decoder model with attention using a hierarchical encoding. We find that the hierarchical encoding, using our newly proposed char2word encoding mechanism, improves the the BLEU score by an average of compared to models using a standard character encoder.

Qualitatively, we find that the attention of a character model without hierarchical encoding learns to make hierarchical representations even without being explicitly told to do so, by switching between word and character embeddings for common and rare words. This observation is in line with current research on Byte-Pair-Encoding- and hybrid word-character models, as these models uses word like embeddings for common words and sub-words or characters for rare words.

Furthermore, qualitatively we find that our hierarchical encoding finds lexemes in the source sentence when decoding similarly to current models with much larger dictionaries using Byte-Pair-Encoding.