I Introduction
The rapid adoption of neural network (NN) based approaches to machine translation (MT) has been attributed to the massive amounts of datasets, the affordability of highperforming commodity computers, and the accelerated progress in fields such as image recognition, computational systems biology and unmanned vehicles. Research activity in NNbased machine translation has been taking place since the 1990s, but statistical machine translation (SMT) soared along with the successes of machine learning. SMT incorporates a rulebased, data driven approach, and includes language models such as word based (ngrams), phrasedbased, syntaxbased and hierarchical based approaches. Neural machine translation (NMT), on the other hand, does not require predefined rules, but learns lingusitic rules from statistical models, sequences and occurences from large corpuses. Models trained using NNs produce even higher accuracy than existing SMT approaches, but training time can take anywhere from days to weeks to complete. Suboptimal strategies are often difficult to find, given the dimensionality and its effect on parameter exploration.
Training a neural network involves the estimation of a huge number of parameters. Ideally, optimization seeks to find the global optima, but in nonconvex problems such as neural networks, global optimality is given up and local minima in the parameter space is considered sufficient to obtain models that will generalize beyond training data. Besides obtaining better performance, choosing an appropriate optimization strategy could accelerate the training phase of neural networks and provide higher translation accuracy.
Modeling and training problems are of utmost importance in neural machine translation systems. This work empirically investigates a combination of optimization methods to train a NMT system. The following concerns are addressed: translation performance, training stability, and convergence speed. Specifically, we investigate how well, fast and stable different optimization algorithms are able to find an appropriate local minima, as well as how a combination of these optimizations can solve aspects of problems that arise in model training. Results demonstrate that applying a combination of optimizations leads to faster convergence, translation performance boost and more regularized behavior, compared to selecting an optimizer in isolation.
Ii Related Work
There are many efforts where researchers interpret the characteristics of different optimization techniques. Moreover, other efforts try to show the performance of optimizers in the investigation of loss surface for image classification tasks. Bergstra, et. al. discuss various techniques for hyperparameter tuning and search strategies, concluding that random search outperforms grid search [1]. Likewise, the authors in [2, 3] take a Bayesian approach toward parameter estimation and optimization. However, these efforts apply their strategies on image recognition tasks.
Britz, et. al. study a massive analysis of NMT hyperparameters aiming for better optimization being robust to the hyperparameter variations [4]. Likewise, Bahar et. al. compare various optimization strategies for NMT [5]. In addition, Wu, et. al. [6]
utilized the combination of Adam and a simple stochastic gradient descent (SGD) learning algorithm. They run Adam for a fixed number of iterations and switch to SGD to slow down the training phase.
To the best of our knowledge, there has not been any work comparing different optimization strategies for NMT. Most of the work in this area focuses on the modeling problem on a vanilla NMT task without exploring the tradeoffs of parameter selection, in terms of performance and stability.
Iii Background
Machine translation involves model design and model training. In general, learning algorithms are viewed as a combination of selecting a model criterion (family of functions) and training (parameterization), and a procedure for appropriately optimizing this criterion. The next subsections discuss how sentences are represented with a neural network and the optimization objectives used for training a model for a translation system.
Iiia Neural Machine Translation
Given a source and a target
sequence, NMT models the conditional probability of target words given the source sequence
[7]. The NMT training objective function is to minimize the crossentropy over training samples which is defined as follows:Since computing the objective function for the whole training data is expensive, the minibatch approach randomly selects a small number of samples and averages those samples, resulting in minibatch gradient calculations.
IiiA1 Recurrent Neural Networks
A recurrent neural network (RNN) is a neural network with
hidden states and an optional output, and operates on a variable length sequence . At each time step , the hidden state of the RNN is updated by(1) 
where
is a nonlinear activation function, such as a sigmoid or LSTM, and will be discussed in Section
IIIB2. RNNs can learn a probability distribution over a sequence by being trained to predict the next symbol in a sequence. The output at each timestep
is the conditional distribution . For instance, a multinomial distribution (1ofK coding) can be outputted with the softmax activation function(2) 
for all possible symbols , where represents the rows of weight matrix . Thus, the result is a combination of probabilities to compute the probability of sequence using
(3) 
The learned distribution is then used to sample a new sequence by iteratively sampling a symbol at each time step. The conditional distribution of the next symbol is defined as
Note that for activation functions and , must produce valid probabilities, where a softmax is typically employed, as defined in Eq. 2.
IiiA2 RNN EncoderDecoder
A RNN encoderdecoder (pictured in Fig. 1
) encodes a variablelength sequence into a fixedlength vector representation, and decodes a fixedlength vector back into a variable length sequence
[8]. The two components of the RNN encoderdecoder are jointly trained to maximize the conditional loglikelihood(4) 
where represents the set of model parameters, each is a pair of input and output sequences from the training set, and the output of the decoder starting from the input is differentiable. Gradientbased algorithms are used to estimate the model parameters, discussed in Sec IIIB. A trained RNN encoderdecoder model can be used to generate a target sequence given an input sequence, and score a given pair of inputoutput sequence, where the score is simply from Eqs 3 and 4.
In typical machine translation systems, the goal of the decoder is to find a translation given source sentence that maximizes
where is the translation model, and represents the language model. In practice, most systems are modeled as a loglinear model with additional features and corresponding weights:
where and are the feature and weight, and is the normalization constant that does not depend on weights. The weights are optimized to maximize the BLEU score on the development set.
(5) 
IiiB Optimization Objectives
The following subsections describe the tuning of hyperparameters that affect the performance of training a NMT system.
IiiB1 SGD Optimizers
Stochastic gradient descent (SGD) is commonly used to train neural networks. SGD updates a set of parameters , where is the learning rate, or how large the update should be, and represents the gradient cost function . SGD uses a schedulingbased step size selection, which makes the learning rate an important hyperparameter that requires careful tuning. An adaptive optimizer separately adapts the learning rate for each parameter.
Adaptive moment estimation (Adam) accumulates the decaying average of past squared gradients
. Similar to RmsProp and AdaDelta, Adam also stores the decaying mean of past gradients
. The moments, are biased corrected terms for instability against zero initialization. Equation 5 displays SGD, AdaGrad and Adam optimizers.IiiB2 Activation Functions
To address the vanishing gradients problem associated with learning longterm dependencies in RNNs, LSTMs
[9] and GRUs [8] employ a gating mechanism when computing the hidden states. Equation 6 displays the LSTM and GRU activation functions.For LSTMs (Eq. 6, left), note that the input , forget and hidden gates are the same equations except with different parameter matrices, which represent gates that are squashed by the sigmoid into vectors between 0 and 1 values. Multiplying the vectors determine how much of the other vectors to let into the current input state. is a candidate hidden state that is computed based on the current input and previous hidden state. serves as the internal memory, which is a combination of the previous memory multiplied by the input gate, a balance of two extremes of ignoring either the memory or the new hidden state completely. Lastly, the hidden state calculated as a combination of the internal memory and the output gate.
On the other hand, a GRU (Eq. 6, right) has two gates: a reset gate and an update gate . The reset gate determines how to combine the new input with the previous memory, whereas the update gate defines how much of the previous memory to keep around. If the reset gates were set to 1’s and the update gates to 0’s, the result would be equivalent to a vanilla RNN.
The differences between the two approaches to compute hidden units are that GRUs have 2 gates, whereas LSTMs have 3 gates. GRUs do not have an internal memory and output gates, compared with LSTM which uses as its internal memory and as an output gate. The GRU input and forget gates are coupled by an update gate , and the reset gate is applied directly to the previous hidden state. Also, there is no 2nd nonlinearity in GRUs, compared to LSTMs which uses two hyperbolic tangents.
(6) 
IiiB3 Dropout
In a fullyconnected, feedforward neural network, dropout randomly retains connections within hidden layers while discarding others
[10]. Equation 7 displays a standard hidden update function on the left, whereas a version that decides whether to retain a connection is displayed on the right.is the thinned output layer, and retaining a network connection is decided by a Bernoulli random variable
with probability .(7) 
ROEN, ENRO  DEEN, ENDE  

Train  corpus.bpe (2603030)  corpus.bpe (4497879) 
Valid  newsdev2016.bpe (1999)  newstest2014.bpe (3003) 
Test  newstest2016.bpe (1999)  newstest2016.bpe (2999) 
P100  V100  

CUDA capability  6.0  7.0 
Global memory (MB)  16276  16152 
Multiprocessors (MP)  56  80 
CUDA cores per MP  64  64 
CUDA cores  3584  5120 
GPU clock rate (MHz)  405  1380 
Memory clock rate (MHz)  715  877 
L2 cache size (MB)  4.194  6.291 
Constant memory (bytes)  65536  65536 
Shared mem blk (bytes)  49152  49152 
Registers per block  65536  65536 
Warp size  32  32 
Max threads per MP  2048  2048 
Max threads per block  1024  1024 
CPU (Intel)  Ivy Bridge  Haswell 
Architecture family  Pascal  Volta 
Architecture  Haswell  Ivy Bridge 

Model  E52698 v3  Xeon X5650 
Clock speed  2.30 GHz  2.67 GHz 
Node count  4, 14  6 
GPUs  4 V100  4 P100 
Memory  256 GB  50 GB 
Linux kernel  3.10.0229.14.1  2.6.32642.4.2 
Compiler  CUDA v9.0.67  
Flags  {‘g’, ‘lineinfo’, ‘arch=sm_cc’} 
IiiC Combination of Optimizers
Since the learning trajectory significantly affects the training process, it is required to select and tune the proper types of hyperparameters to yield good performance. The construction of the RNN cell with activation functions, the optimizer and its learning rate, and the dropout rates all have an affect on how the training progresses, and whether good accuracy can be achieved.
roen  enro  deen  ende  

cell  learnrt  P100  V100  P100  V100  P100  V100  P100  V100 
GRU  1e3  35.53  35.43  19.19  19.28  28.00  27.84  20.43  20.61 
5e3  34.37  34.05  19.07  19.16  26.05  22.16  n/a  19.01  
1e4  35.47  35.46  19.45  19.49  27.37  27.81  dnf  21.41  
LSTM  1e3  34.27  35.61  19.29  19.64  28.62  28.83  21.70  21.69 
5e3  35.05  34.99  19.48  19.43  n/a  24.36  18.53  18.01  
1e4  35.41  35.28  19.43  19.48  n/a  28.50  dnf  dnf  
GRU  1e3  34.22  34.17  19.42  19.43  33.03  32.55  26.55  26.85 
5e3  33.13  32.74  19.31  18.97  31.04  26.76  n/a  26.02  
1e4  33.67  34.44  18.98  19.69  33.15  33.12  dnf  28.43  
LSTM  1e3  33.10  33.95  19.56  19.08  33.10  33.89  28.79  28.84 
5e3  33.10  33.52  19.13  19.51  n/a  29.16  24.12  24.12  
1e4  33.29  32.92  19.14  19.23  n/a  33.44  dnf  dnf 
roen  deen  

cell  dropout  P100  V100  P100  V100  
GRU  0.0  34.47  6:29  34.47  4:43  32.29  9:48  31.61  6:15 
0.2  35.53  8:48  35.43  6:21  33.03  18:47  32.55  19:40  
0.3  35.36  12.21  35.15  7:28  31.36  10:14  31.50  9:33  
0.5  34.50  12:20  34.67  17:18  29.64  11:09  30.21  11:09  
LSTM  0.0  34.84  6:29  34.65  4:46  32.84  12:17  32.88  7:37 
0.2  34.27  8:10  35.61  6:34  33.10  16:33  33.89  13:39  
0.3  35.67  9:56  35.37  11:29  33.45  20.02  33.51  15:51  
0.5  34.50  15:13  34.33  12:45  32.67  20:02  32.20  13:03 
Iv Experiments
The experiments were carried out on the WMT 2016 [11] translation tasks for the Romanian and German languages in four directions: EN RO, RO EN, EN DE, and DE EN. The datasets and its characteristics used in the experiments are listed in Table I, with number of sentence examples in parenthesis. Table I shows that for WMT 2016 EN RO and RO EN, the training data consisted of 2.6M English and Romanian sentence pairs, whereas for WMT 2016 EN DE and DE EN, the training corpus consisted of approximately 4.5M German and English sentence pairs. Validation was performed on 1000 sentences of the newsdev2016 corpus for RO, and on newstest2014 corpus for DE. The newstest2016 corpus consisted of 1999 sentences for RO and 2999 sentences for DE, and was used as the test set. We evaluated and saved the models every 10K iterations and stopped training after 500K iterations.
All experiments used bilingual data without additional monolingual data. The models were trained on Marian [12], an efficient NMT framework written in C++ with multinode, multiGPU capabilities. We used the joint byte precision encoding (BPE) approach [13] in both the source and target sets, which converts words to a sequence of subwords. For all four tasks, the number of jointBPE operations were 20K. All words were projected on a 512dimensional embedding space, with vocabulary dimensions of . The minibatch size was determined automatically based on the sentence length that was able to fit in GPU global memory, set at 13000 MB for each GPU.
Decoding was performed using beam search with a beam size of 12. The translation portion consisted of recasing and detokenizing the translated BPE chunks. The trained models compared different hyperparameter strategies, including the type of optimizer, the activation function, and the amount of dropout applied, as discussed in Section IIIC. The number of parameters were initialized with the same random seed. The systems were evaluated using the casesensitive BLEU score computed by Moses SMT [14].
V Analysis
This section analyzes the results of the evaluated NMT systems in terms of translation quality, training stability and convergence speed.
Va Translation Qualtiy
Table IV shows BLEU scores calculated for four translation directions for the validation sets (top) and the test sets (bottom), comparing learning rates, activation functions and GPUs. Note that entries with n/a means that no results were available, whereas entries with dnf indicates training time that did not complete within 24 hours. For the validation sets, LSTM was able to achieve higher accuracy rates, whereas in the test set GRUs and LSTMs were about the same. Also, note that the best performing learning rates were usually at a lower value (e.g. 1e3). The type of hidden unit mechanism (e.g LSTM vs GRU) and the learning rate can affect the overall accuracy achieved, as demonstrated by Table IV.
Table V displays various dropout rates applied for two translation directions RO EN and DE EN, comparing hidden units, GPUs and overall training time. The learning rate was evaluated at 0.001, the rate that achieved the highest BLEU score as evident in Table IV. Generally speaking, increasing the dropout rates also increased training time. This may be the result of losing network connections when applying the dropout mechanism, but at the added benefit of avoiding overfitting. This is evident in Table V, where applying some form of dropout will result in a trained model achieving higher accuracies. The best performance can be seen when the dropout rate was set at 0.2 to 0.3. This confirms that some form of skip connection mechanism is necessary to prevent the overfitting of models under training.
Figure 2 shows BLEU score results as a function of training time, comparing GPUs, activation units, learning rates and translation directions. Note that in most cases a learning rate of 0.001 achieves the higher accuracy in most cases, at the cost of higher training time. Also, note the correlation between longer training time and higher BLEU scores in most cases. In some cases, the models were able to converge at a faster rate (e.g. Fig. 2 upper left, ROEN, GRU with learning rate of 0.005 vs 0.001).
VB Training Stability
Figure 3 shows the crossentropy scores for the RO EN and EN RO translation tasks, comparing different activation functions (GRU vs. LSTM), with learning rates at 0.001. Note the training stability patterns that emerge from this plot, which is highly correlated with the translation direction. The activation function (GRU vs LSTM) during validation also performed similarly across GPUs and was also highly correlated with the translation direction. Crossentropy scores for the EN RO translation direction were more or less the same. However, for RO EN, a LSTM that executed on a P100 converged the earliest by one iteration.
Figure 4 shows the same comparison of crossentropy scores over epochs for DE EN and EN DE translation tasks. Note that the behavior for this translation task was wildly different for all systems. Not only did it take more epochs to converge compared to Fig 3, but also how well the system progressed also varies, as evident in the crossentropy scores during validation. When comparing hidden units, LSTMs outperformed GRUs in all cases. When comparing GPUs, the V100 performed better than the P100 in terms of crossentropy, but took longer to converge in some cases (e.g. v100deenlstm, v100endelstm). Also, note that the behavior of the translation task EN DE for a GRU hidden unit never stabilized, as evident in both the high crossentropy scores and the peaks toward the end. The LSTM was able to achieve a better crossentropy score overall, with nearly a 8 point difference for DE EN, compared with the GRU.
VC Convergence Speed
Figure 5 shows the average wordspersecond for the RO EN translation task, comparing systems. The average wordspersecond executed remained consistent across epochs. The system that was able to achieve the most wordspersecond was v100roengru0.001, whereas the one that achieved the least wordspersecond was the v100roengru0.005. Surprisingly, the best and worst performer was the v100roengru, depending on its learning rate, with the sweet spot at 0.001. This confirms 0.001 as the best learn rate that can execute a decent number of wordspersecond and achieve a fairly high accuracy, as evident in previous studies, across all systems.
Table VI also displays wordspersecond and validation, comparing activation units, learning rates and GPUs. When fixing learning rate, the V100 was able to execute more wordspersecond than the P100, and was able to converge at an earlier iteration. When comparing hidden units, GRUs were able to execute higher words per second on a GPU and converge at a reasonable rate (at 18000 iterations) for most learning rates, except for 5e3. When looking at LSTMs, wordspersecond executed on a V100 was similar, although at a higher learning rate it was able to converge at 42000 iterations, but at the cost of longer training time and slower convergence (35000 iterations).
Table VII shows the corresponding total training time for the four translation directions, comparing GPUs, activation units, and learning rates. The dropout rate was set at 0.2, which was the best performer in most cases (Tab V). Table VII shows that the training time increased as the learning rates were decreased. In general, Romanian took a fraction of the time to complete training (usually under 10 hours), whereas German took 1822 hours to complete training.
wordspersec  validation  wordspersec  validation  
cell  learnrt  P100  V100  P100  V100  P100  V100  P100  V100 
roen  enro  
GRU  1e3  33009.23  45762.54  18000  18000  29969.14  42746.15  15000  15000 
5e3  32965.23  24253.14  19000  8000  30223.89  23144.62  17000  10000  
1e4  32828.61  24341.96  44000  16000  29959.34  23277.51  25000  14000  
LSTM  1e3  29412.87  40534.06  15000  16000  27282.54  38131.13  14000  14000 
5e3  29536.65  40598.24  16000  16000  27245.42  37384.46  19000  21000  
1e4  29478.51  41441.37  40000  35000  27002.60  38118.79  25000  25000  
deen  ende  
GRU  1e3  28279.53  38026.87  20000  28000  28367.91  39995.48  10000  10000 
5e3  28215.40  19819.59  25000  4000  n/a  39944.10  n/a  16000  
1e4  28367.54  33218.70  26000  32000  dnf  39993.89  dnf  36000  
LSTM  1e3  24995.64  33507.31  16000  17000  25245.67  35122.54  13000  17000 
5e3  25210.15  33740.92  14000  7000  25049.21  33649.20  9000  6000  
1e4  dnf  34529.58  dnf  31000  dnf  dnf  dnf  dnf 
roen  enro  deen  ende  

cell  learnrt  P100  V100  P100  V100  P100  V100  P100  V100 
GRU  1e3  8:48  6:21  7:47  5:26  18:47  19:40  9:26  6:41 
5e3  9:41  4:52  8:38  6:02  23:57  4:36  n/a  10:56  
1e4  21:58  9:43  12:33  8:59  23:50  21:09  dnf  23:58  
LSTM  1e3  8:10  6:34  7:49  5:36  16:33  13:39  13:50  12:24 
5e3  9:02  6:34  10:44  8:32  n/a  5:12  9:37  4:35  
1e4  22:29  14:05  13:46  9:45  n/a  23:57  dnf  dnf 
Vi Discussion
The variation in the results, in terms of language translation, hyperparameters, wordspersecond executed and BLEU scores, in addition to the hardware the training was executed on demonstrates the complexity in learning the grammatical structure between the two languages. In particular, the learning rate set for training, the hidden unit selected for the activation function, the optimization criterion and the amount of dropout applied to the hidden connections all have a drastic effect on overall accuracy and training time. Specifically, we found that a lower learning rate achieved the best performance in terms of convergence speed and BLEU score. Also, we found that the V100 is able to execute more wordspersecond than the P100 in all cases. When looking at accuracy as a whole, LSTM hidden units outperformed GRUs in all cases. Lastly, the amount of dropout applied on a network in all cases prevented the model from overfitting and achieve a higher accuracy.
The multidimensionality of hyperparameter optimization poses a challenge in selecting the architecture design for training NN models, as illustrated by the varying degrees of behavior across systems and its performance outcome. This work investigated how the varying design decisions can affect training outcome and provides neural network designers how to best look at which parameters affect performance, whether accuracy, words processed per second, and convergence expectation. Coupled with massive datasets for parallel text corpuses and commodity heterogenous GPU architectures, the models trained were able to achieve WMT grade accuracy with the proper selection of hyperparameter tuning.
Vii Conclusion
We analyzed the performance of various hyperparameters for training a NMT, including the optimization strategy, the learning rate, the activation cell, and the GPU across various systems for the WMT 2016 translation task in four translation directions. Results demonstrate that a proper learning rate and a minimal amount of dropout is able to prevent overfitting as well as achieve high training accuracy.
Future work includes developing optimization methods to evaluate how to best select hyperparameters. By statically analyzing the computational graph that represents a NN in terms of instruction operations executed and resource allocation constraints, one could derive execution performance for a given dataset without running experiments.
References
 [1] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyperparameter optimization,” in Advances in neural information processing systems, 2011, pp. 2546–2554.
 [2] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking the human out of the loop: A review of bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
 [3] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in Advances in neural information processing systems, 2012, pp. 2951–2959.
 [4] D. Britz, A. Goldie, T. Luong, and Q. Le, “Massive exploration of neural machine translation architectures,” arXiv preprint arXiv:1703.03906, 2017.
 [5] P. Bahar, T. Alkhouli, J.T. Peter, C. J.S. Brix, and H. Ney, “Empirical investigation of optimization algorithms in neural machine translation,” The Prague Bulletin of Mathematical Linguistics, vol. 108, no. 1, pp. 13–25, 2017.
 [6] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
 [7] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
 [8] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoderdecoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.

[9]
S. Hochreiter and J. Schmidhuber, “Long shortterm memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.  [10] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [11] M. JunczysDowmunt and R. Grundkiewicz, “Loglinear combinations of monolingual and bilingual neural machine translation models for automatic postediting,” in ACL. Berlin, Germany: Association for Computational Linguistics, 2016, pp. 751–758. [Online]. Available: http://www.aclweb.org/anthology/W162378
 [12] M. JunczysDowmunt, T. Dwojak, and H. Hoang, “Is neural machine translation ready for deployment? a case study on 30 translation directions,” in IWSLT, 2016.
 [13] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
 [14] P. Koehn, H. Hoang, A. Birch, C. CallisonBurch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens et al., “Moses: Open source toolkit for statistical machine translation,” in ACL. Association for Computational Linguistics, 2007, pp. 177–180.
Comments
There are no comments yet.