Neural Machine Translation (NMT) models are typically trained using a fixed-size lexical vocabulary. In addition to controlling the computational load, this limitation also serves to maintain better distributed representations for the most frequent set of words included in the vocabulary. On the other hand, rare words in the long tail of the lexical distribution are often discarded during translation since they are not found in the vocabulary. The prominent approach to overcome this limitation is to segment words into subword units
Neural Machine Translation (NMT) models are typically trained using a fixed-size lexical vocabulary. In addition to controlling the computational load, this limitation also serves to maintain better distributed representations for the most frequent set of words included in the vocabulary. On the other hand, rare words in the long tail of the lexical distribution are often discarded during translation since they are not found in the vocabulary. The prominent approach to overcome this limitation is to segment words into subword unitssennrich2016neural and perform translation based on a vocabulary composed of these units. However, subword segmentation methods generally rely on statistical heuristics that lack any linguistic notion. Moreover, they are typically deployed as a pre-processing step before training the NMT model, hence, the predicted set of subword units are essentially not optimized for the translation task. Recently, cherry2018revisiting extended the approach of NMT based on subword units to implement the translation model directly at the level of characters, which could reach comparable performance to the subword-based model, although this would require much larger networks which may be more difficult to train. The major reason to this requirement may lie behind the fact that treating the characters as individual tokens at the same level and processing the input sequences in linear time increases the difficulty of the learning task, where translation would then be modeled as a mapping between the characters in two languages. The increased sequence lengths due to processing sentences as sequences of characters also augments the computational cost, and a possible limitation, since sequence models typically have limited capacity in remembering long-distance context.
In many languages, words are the core atomic units of semantic and syntactic structure, and their explicit modeling should be beneficial in learning distributed representations for translation.
There have been early studies in NMT which proposed to perform translation at the level of characters while also regarding the word boundaries in the translation model through a hierarchical decoding procedure, although these approaches were generally deployed through hybrid systems, either as a back-off solution to translate unknown words luong-hybrid, or as pre-trained components DBLP:journals/corr/LingTDB15 .
In this paper, we explore the benefit of achieving character-level NMT by processing sentences at multi-level dynamic time steps defined by the word boundaries, integrating a notion of explicit hierarchy into the decoder.
In our model, all word representations are learned compositionally from character embeddings using bi-directional recurrent neural networks (bi-RNNs)
. In this paper, we explore the benefit of achieving character-level NMT by processing sentences at multi-level dynamic time steps defined by the word boundaries, integrating a notion of explicit hierarchy into the decoder. In our model, all word representations are learned compositionally from character embeddings using bi-directional recurrent neural networks (bi-RNNs)schuster1997bidirectional, and decoding is performed by generating each word character by character based on the predicted word representation through a hierarchical beam search algorithm which takes advantage of the hierarchical architecture while generating translations.
We present the results of an extensive evaluation comparing conventional approaches for open-vocabulary NMT in the machine translation task from English into five morphologically-rich languages, where each language belongs to a different language family and has a distinct morphological typology. Our findings show that using the hierarchical decoding approach, the NMT models are able to obtain higher translation accuracy than the subword-based NMT models in many languages while using significantly fewer parameters, where the character-based models implemented with the same computational complexity may still struggle to reach comparable performance. Our analysis also shows that explicit modeling of word boundaries in character-level NMT is advantageous for capturing longer-term contextual dependencies and generalizing to morphological variations in the target language.
2 Neural Machine Translation
In this paper, we use recurrent NMT architectures based on the model developed by Bahdanau et al. bahdanau2014neural into a target sequence , using the decomposition
where is the target sentence history defined by the sequence .
The inputs of the network are one-hotvectors representing the tokens in the source sentence, which are binary vectors with a single bit set to 1 to identify a specific token in the vocabulary. Each one-hot vector is then mapped to a dense continuous representation, i.e. an embedding, of the source tokens via a look-up table. The representation of the source sequence is computed using a multi-layer bi-RNN, also referred as the encoder, which maps x into m dense vectors corresponding to the hidden states of the last bi-RNN layer updated in response to the input token embeddings.
The generation of the translation of the source sentence is called decoding, and it is conventionally implemented in an auto-regressive mode, where each token in the target sentence is generated based on an sequential classification procedure defined over the target token vocabulary. In this decoding architecture, a unidirectional recurrent neural network (RNN) predicts the most likely output token in the target sequence using an approximate search algorithm based on the previous target token , represented with the embedding of the previous token in the target sequence, the previous decoder hidden state, representing the sequence history, and the current attention context in the source sequence, represented by the context vector . The latter is a linear combination of the encoder hidden states, whose weights are dynamically computed by a dot product based similarity metric called the attention model luong2015effective.
The probability of generating each target word is estimated via a softmax function
where is the one-hot vector of the target vocabulary of size , and is the decoder output vector for the target word .
The model is trained by maximizing the log-likelihood of a parallel training set via stochastic gradient-descent
. The model is trained by maximizing the log-likelihood of a parallel training set via stochastic gradient-descentsgd, where the gradients are computed with the back propagation through time werbos1990backpropagation algorithm.
Due to the softmax function in Equation 2, the size of the target vocabulary plays an important role in defining the computational complexity of the model. In the standard architecture, the embedding matrices account for the vast majority of the network parameters, thus, the amount of embeddings that could be learned and stored efficiently needs to be limited. Moreover, for many words corresponding to the long tail of the lexical distribution, the model fails in learning accurate embeddings, as they are rarely observed in varying context, leading the model vocabulary to typically include the most frequent set of words in the target language. This creates an important bottleneck over the vocabulary coverage of the model, which is especially crucial when translating into low-resource and morphologically-rich languages, which often have a high level of sparsity in the lexical distribution.
The standard approach to overcome this limitation has now become applying a statistical segmentation algorithm on the training corpus which splits words into smaller and more frequent subword units, and building the model vocabulary composed of these units. The translation problem is then modeled as a mapping between sequences of subword units in the source and target languages sennrich2016neural; wu2016google; ataman2017linguistically. The most popular statistical segmentation method is Byte-Pair Encoding (BPE) sennrich2016neural, which finds the optimal description of a corpus vocabulary by iteratively merging the most frequent character sequences. One problem related to the subword-based NMT approach is that segmentation methods are typically implemented as pre-processing steps to NMT, thus, they are not optimized simultaneously with the translation task in an end-to-end fashion. This can lead to morphological errors at different levels, and cause loss of semantic or syntactic information ataman2017linguistically, due to the ambiguity in subword embeddings. In fact, recent studies have shown that the same approach can be extended to implement the NMT model directly at the level of characters, which could alleviate potential morphological errors due to subword segmentation. Although character-level NMT models have shown the potential to obtain comparable performance with subword-based NMT models, this would require increasing the computational cost of the model, defined by the network parameters kreutzer2018learning; cherry2018revisiting. As given in Figure 0(a) implementing the NMT decoder directly at the level of characters leads to repetitive passes over the attention mechanism and the RNNs modeling the target language for each character in the sentence. Since the distributed representations of characters are shared among different word and sentence-level context, the translation task requires a network with high capacity to learn this vastly dynamic context.
3 Hierarchical Decoding
In this paper, we explore the benefit of integrating a notion of hierarchy into the decoding architecture which could increase the computational efficiency in character-level NMT, following the work of luong-hybrid. In this architecture, the input embedding layer of the decoder is augmented with a character-level bi-RNN, which estimates a composition function over the embeddings of the characters in each word in order to compute the distributed representations of target words.
Given a bi-RNN with a forward () and backward () layer, the word representation of a token of characters is computed from the hidden states and , i.e. the final outputs of the forward and backward RNNs, as follows:
where and are weight matrices associated to each RNN and is a bias vector. The embeddings of characters and the parameters of the word composition layer are jointly learned while training the NMT model. Since all target word representations are computed compositionally, the hierarchical decoding approach eliminates the necessity of storing word embeddings, significantly reducing the number of parameters.
is a bias vector. The embeddings of characters and the parameters of the word composition layer are jointly learned while training the NMT model. Since all target word representations are computed compositionally, the hierarchical decoding approach eliminates the necessity of storing word embeddings, significantly reducing the number of parameters.
Each word in the target sentence is predicted by an RNN operating at the level of words, using the compositional target word representations, target sentence history and the context vector computed by the attention mechanism only in the beginning of a new word generation. Instead of classifying the predicted target word in the vocabulary, its distributed representation is fed to a character-level RNN to generate the surface form of the word one character at a time by modeling the probability of observing the
Each word in the target sentence is predicted by an RNN operating at the level of words, using the compositional target word representations, target sentence history and the context vector computed by the attention mechanism only in the beginning of a new word generation. Instead of classifying the predicted target word in the vocabulary, its distributed representation is fed to a character-level RNN to generate the surface form of the word one character at a time by modeling the probability of observing thecharacter of the word with length , , given the previous words in the sequence and the previous characters in the word.
The translation probability is then decomposed as:
Similar to luong-hybrid, the information necessary to generate the surface form is encoded into the attentional vector :
where is the hidden state of the word-level RNN representing the current target context. The attentional vector is used to initialize the character RNN, and after the generation of the first character in the word, character decoding continues in an auto-regressive mode, where the embedding of the each character is fed to the RNN to predict the next character in the word. The decoder consecutively iterates over the words and characters in the target sentence, where each RNN is updated at dynamic time steps based on the word boundaries.
4 Hierarchical Beam Search
In order to achieve efficient decoding with the hierarchical NMT decoder, we implement a hierarchical beam search algorithm. Similar to the standard algorithm, the beam search starts by predicting the most likely characters and storing them in a character beam along with their probabilities. The beams are reset each time the generation of a word is complete and the most likely words are used to update the hidden states of the word-level RNN, which are fed to the character RNN to continue the beam search. When the beam search is complete, the most likely character sequence is generated as the best hypothesis.
We evaluate decoding architectures using different levels of granularity in the vocabulary units and the attention mechanism, including the standard decoding architecture implemented either with subword sennrich2016neural or fully character-level cherry2018revisiting units, which constitute the baseline approaches, and the hierarchical decoding architecture, by implementing all in Pytorch
units, which constitute the baseline approaches, and the hierarchical decoding architecture, by implementing all in Pytorchpaszke2017automatic within the OpenNMT-py framework klein2017opennmt. In order to evaluate how each generative method performs in languages with different morphological typology, we model the machine translation task from English into five languages from different language families and exhibiting distinct morphological typology: Arabic (templatic), Czech (mostly fusional, partially agglutinative), German (fusional), Italian (fusional) and Turkish (agglutinative). We use the TED Talks corpora mauro2012wit3 for training the NMT models, which range from 110K to 240K sentences, and the official development and test sets from IWSLT111The International Workshop on Spoken Language Translation. mauro2017overview. The low-resource settings for the training data allows us to examine the quality of the internal representations learned by each decoder under high data sparseness. In order to evaluate how the performance of each method scales with increasing data size, we evaluate the models also by training with a multi-domain training data using the public data sets from WMT222The Conference on Machine Translation, with shared task organized for news translation. bojar2016findings in the English-to-German direction, followed by an analysis on each model’s capability in generalizing to morphological variations in the target language, using the Morpheval burlot2018wmt evaluation sets.
All models are implemented using gated recurrent units (GRU) with the same number of parameters. The hierarchical decoding model implements a 3-layer GRU architecture, which is compared with a character-level decoder which also uses a 3-layer stacked GRU architecture. The subword-level decoder has a 2-layer stacked GRU architecture, to account also for the larger number of embedding parameters. The models using the standard architecture have the attention mechanism after the first GRU layer, and have residual connections after the second layer
All models are implemented using gated recurrent units (GRU)cho2014properties
with the same number of parameters. The hierarchical decoding model implements a 3-layer GRU architecture, which is compared with a character-level decoder which also uses a 3-layer stacked GRU architecture. The subword-level decoder has a 2-layer stacked GRU architecture, to account also for the larger number of embedding parameters. The models using the standard architecture have the attention mechanism after the first GRU layer, and have residual connections after the second layerbarone2017deep. The hierarchical decoder implements the attention mechanism after the second layer and has a residual connection between the first and second layers.
The source sides of the data used for training character-level NMT models are segmented using BPE with 16,000 merge rules on the IWSLT data, and 32,000 on WMT. For subword-based models we learn shared merging rules for BPE for 16,000 (in IWSLT) and 32,000 (in WMT) units. The models use an embedding and hidden unit size of 512 under low-resource (IWSLT) and 1024 under high-resource (WMT) settings, and are trained using the Adam kinga2015method optimizer with a learning rate of 0.0003 and decay of 0.5, batch size of 100 and a dropout of 0.2. Decoding in all models is performed with a beam size of 5. The accuracy of each output is measured in terms of the BLEU metric papineni2002bleu and the significance of the improvements are measured using bootstrap hypothesis testing wasserman1989bootstrapping.
|Paradigm contrast features|
|Positive vs. comparative adjective||71.4||68.4||70.1|
|Present vs. future tense||85.7||92.0||90.6|
|Singular vs. plural noun||88.2||88.8||88.6|
|Present vs. past tense||92.0||93.3||95.4|
|Indicative vs. conditional mode||86.4||88.2||92.3|
|Pronoun vs. Nouns (gender)||96.5||97.4||98.8|
|Pronoun vs. Nouns (number)||95.4||96.0||93.4|
|Positive vs. superlative adjective||76.2||68.2||80.4|
|Simple vs. coordinated verbs (number)||96.4||93.4||97.2|
|Simple vs. coordinated verbs (person)||92.3||92.8||93.5|
|Simple vs. coordinated verbs (tense)||82.4||86.0||90.2|
The results of the experiments given in Table 1 show that the hierarchical decoder can reach performance comparable to or better than the NMT model based on subword units in all languages while using almost three times less number of parameters. The improvements are especially evident in Arabic and Turkish, languages with the most complex morphology, where the accuracy with the hierarchical decoder is 1.28 and 1.22 BLEU points higher, respectively, and comparable in Czech, Italian and German, which represent the fusional languages. In Czech, the hierarchical model outperforms the subword-based model by 0.19 BLEU and in Italian by 0.41 BLEU points. The subword-based NMT model achieves the best performance in German, a language that is rich in compounding, where explicit subword segmentation might allow learning better representations for translation units.
The fully character-level NMT model, on the other hand, obtains higher translation accuracy than the hierarchical model in Turkish, with an improvement of 0.91 BLEU, and in Czech with 0.15 BLEU points. As can be seen in the statistical characteristics of the training sets illustrated by plotting the token-to-type ratios in each language (Figure 2), these two directions constitute the most sparse settings, where Turkish has the highest amount of sparsity in the benchmark, followed by Czech, and the improvements seem to be proportional to the amount of sparsity in the language. This suggests that in case of high lexical sparsity, learning to translate based on representations of characters might aid in reducing contextual sparsity, allowing to learn better distributed representations. As the training data size increases, one would expect the likelihood of observing rare words to decrease, especially in languages with low morphological complexity, along with the significance of representing rare and unseen words cherry2018revisiting. Our results support this hypothesis, where decreasing lexical sparsity, either in the form of the training data size, or the morphological complexity of the target language, eliminates the advantage of character-level translation. In Arabic and Italian, where the training data is almost twice as large as the other languages, using the hierarchical model provides improvements of 2.83 and 2.31 BLEU points over the character-level NMT model. In German, the fully character-level NMT model still achieves the lowest accuracy, with 2.06 BLEU points below the subword-based model. This might be due to the increased level of contextual ambiguity leading to difficulty in learning reliable character embeddings when the model is trained over larger corpora. Another factor which might affect the lower performance of character-level models is the average sentence lengths, which are much longer compared to the sentence lengths resulting from with subword segmentation (Figure 2).
|Input||when a friend of mine told me that I needed to|
|see this great video about a guy protesting bicycle fines|
|in New York City, I admit I wasn’t very interested.|
|Output||bir arkadaşım New York’ta bisiklet protestosunu|
|Subword-based||protesto etmek için bu filmi izlemeye|
|Decoder||ihtiyacım olduğunu söylemişti.|
|Output||bana bir arkadaşım bana New York’ta bir adam ile ilgili|
|Character-based||bir adam hakkında görmem gereken bir adam hakkında|
|Decoder||görmem gerektiğini söyledi.|
|Output||bir arkadaşım New York’ta bisiklet yapmaya|
|Hierarchical||ihtiyacım olduğunu söylediği zaman,|
|Reference||bir arkadaşım New York şehrindeki bisiklet cezalarını protesto|
|eden bir adamın bu harika videosunu izlemem gerektiğini|
|söylediğinde, kabul etmeliyim ki çok da ilgilenmemiştim.|
In the experiments conducted in the English-to-German translation direction, the results of which are given in Table 3, accuracy obtained with the hierarchical and subword-based NMT decoders significantly increase with the extension of the training data, where the subword-based model obtains the best accuracy, followed by the hierarchical model, and the character-level NMT model obtains significantly lower accuracy compared to both approaches. Studies have shown that character-level NMT models could potentially reach the same performance with the subword-based NMT models cherry2018revisiting, although this might require increasing the capacity of the network. On the other hand, the consistency in the accuracy obtained using the hierarchical decoding model from low to mid resource settings suggests that explicit modeling of word boundaries aids in achieving a more efficient solution to character-level translation.
Since solely relying on BLEU scores may not be sufficient in understanding the generative properties of different NMT models, we perform an additional evaluation in order to assess the capacity of models in learning syntactic or morphological dependencies using the Morpheval test suites, which consist of sentence pairs that differ by one morphological contrast, and each output accuracy is measured in terms of the percentage of translations that could convey the morphological contrast in the target language. Table 2 lists the performance of different NMT models implementing decoding at the level of subwords, characters, or hierarchical word-character units in capturing variances in each individual morphological paradigm and preserving the agreement between inflected words and their dependent lexical items. The results of our analysis support the benefit of using BPE in German as an open-vocabulary NMT solution, where the subword-based model obtains the highest accuracy in most of the morphological paradigm generation tasks. The character-level model shows to be promising in capturing few morphological features better than the former, such as negation or comparative adjectives, and in capturing agreement features, the hierarchical decoding model generally performs better than the subword-based model. The dominance of subword-based models could be due to the high level of compounding in German where segmentation is possibly beneficial in splitting compound words and aiding better syntactic modeling in some cases. These results generally suggest the importance of processing the sentence context at the word level in order to induce a better notion of syntax during generation.
lists the performance of different NMT models implementing decoding at the level of subwords, characters, or hierarchical word-character units in capturing variances in each individual morphological paradigm and preserving the agreement between inflected words and their dependent lexical items. The results of our analysis support the benefit of using BPE in German as an open-vocabulary NMT solution, where the subword-based model obtains the highest accuracy in most of the morphological paradigm generation tasks. The character-level model shows to be promising in capturing few morphological features better than the former, such as negation or comparative adjectives, and in capturing agreement features, the hierarchical decoding model generally performs better than the subword-based model. The dominance of subword-based models could be due to the high level of compounding in German where segmentation is possibly beneficial in splitting compound words and aiding better syntactic modeling in some cases. These results generally suggest the importance of processing the sentence context at the word level in order to induce a better notion of syntax during generation.
In order to better illustrate the differences in the outputs of each NMT model, we also present some sample translations in Table 4, obtained by translating English into Turkish using the NMT models trained on the TED Talks corpus. The input sentences are selected such that they are sufficiently long so that one can see the ability of each model in capturing long-distance dependencies in context. The input sentence is from a typical conversation, which requires remembering a long context with many references. We highlight the words in each output that is generated for the first time. Most of the models fail to generate a complete translation, starting to forget the sentence history after the generation of a few words, indicated by the start of generation of repetitions of the previously generated words. The character-level decoder seems to have the shortest memory span, followed by the subword-based decoder, which completely omits the second half of the sentence. Despite omitting the translations of the last four words in the input and some lexical errors, the hierarchical decoder is the only model which can generate a meaningful and grammatically-correct sentence, suggesting that modeling translation based on a context defined at the lexical level might help to learn better grammatical and contextual dependencies, and remembering longer history.
Although current methodology in NMT allows more efficient processing by implementing feed-forward architectures vaswani2017attention, our approach can conceptually be applied within these frameworks. In this paper, we limit the evaluation to recurrent architectures for comparison to previous work, including luong-hybrid, sennrich2016neural and cherry2018revisiting, and leave implementation of hierarchical decoding with feed-forward architectures to future work.
In this paper, we explored the idea of performing the decoding procedure in NMT in a multi-dimensional search space defined by word and character level units via a hierarchical decoding structure and beam search algorithm. Our model obtained comparable to better performance than conventional open-vocabulary NMT solutions, including subword and character-level NMT methods, in many languages while using a significantly smaller number of parameters, showing promising application under high-resource settings. Our software is available for public usage333https://github.com/d-ataman/Char-NMT.
This project received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements 825299 (GoURMET) and 688139 (SUMMA).