Neural Machine Translation (NMT) has witnessed a great success in recent years Wu et al. (2016); Gehring et al. (2017); Vaswani et al. (2017). The main reason of its success is that it employs a mass of parameters to model sufficient context for translation decision, and in particular enjoys an end-to-end flavor for training all these parameters. Despite its success, there still remains a severe interpretation challenge for NMT: it is hard to understand its learning dynamics, i.e., how do the trainable parameters affect a NMT model during its learning process?
Understanding learning dynamics of neural networks is beneficial to identify the potential training issues and further improve training protocols for neural networks(Smith et al., 2017; McCandlish et al., 2018). Existing works on understanding learning dynamics have been extensively investigated in classification tasks Shwartz-Ziv and Tishby (2017); Raghu et al. (2017); Li et al. (2018); Bottou et al. (2018); des Combes et al. (2019). Unlike neural networks for classification tasks, NMT involves in a complex architecture with massive parameters and requires large scale data for training, which makes it more difficult to understand its learning dynamics. To the best of our knowledge, there are no attempts at understanding the mechanism of learning dynamics for NMT, although it is acknowledged that the training process is critical to make advanced NMT architectures successful.
In this paper, we thereby propose to understand learning dynamics of NMT. Specifically, we use a technique named Loss Change Allocation (LCA) to decompose the overall loss according to individual parameter for each update during the training process as the moment LCA valueLan et al. (2019). By summing up the LCA values of a parameter between two update time steps, we are able to quantify how effective certain groups of parameter are to the loss decrease in this learning phase. We utilize LCA to analyze the learning dynamics of model parameters for model’s fitting ability on the training data. Since the original LCA requires calculating gradient on the entire training data for each update, whose brute-force implementation is impractical for standard translation tasks, we instead approximately calculate it on a stochastic mini-batch from training or test data for speedup. Our simulation shows that such approximate calculation is efficient and is empirically proved to deliver consistent results compared to the brute-force implementation. Furthermore, extensive experiments on two standard translation tasks reveal the following findings:
Parameters of the encoder (decoder) word embeddings and the softmax matrix contribute very little to loss decrease;
Parameters of both the last layer in encoder and decoder contribute most to loss decrease than other layers;
Word embeddings for frequent words contribute far more to the loss decrease than those for infrequent words.
2.1 Loss Change Allocation
Loss Change Allocation (LCA) functions as a microscope for investigating deeply into the training process of any models trained with stochastic gradient methods (Lan et al., 2019). It is an optimizer-agnostic methods for probing into fine-grained learning dynamics. In raw wordings, LCA tracks the contribution of each parameter at each gradient update of the loss change during the training process, where and is the number of model parameters. The basic idea of LCA is to take advantage of the first order Taylor expansion to approximate the loss change at each mini-batch update.
Recall that at each update step , the optimizer (e.g. SGD) samples a mini-batch from the training data for forward computation and then backwards to update the parameters from to . Given a dataset , formally, the moment loss change over all the model parameters are approximated and decomposed on LCA of each parameter as follows:
where each and denotes LCA bound with the parameter at the update . Therefore, the loss change on from update to update can be approximated by summing equations like Equation 1 for all t in between:
The above equation denotes the so-called path integral of the loss change for each special parameter along certain optimization trajectory , that is . This summation of moment LCA values reflects the effectiveness of with respect to the loss degradation on certain dataset between the update interval, so we call it the interval LCA value.
2.2 Approximate LCA
Theoretically, the calculation of in Equation 1 requires forward computation of the model over the whole dataset , which will bring about too much computation overhead. Instead, for each computation at update
, we only re-sample a new mini-batch to be a representative of the whole dataset, due to the previous smoothing trick, we are actually evaluating a bootstrapping of 15 mini-batches to represent the whole dataset, which can reduce the variance to some extent. In Section3.2, we will empirically validate the rationality of this sampling approach with simulated experiments, as an open question proposed in the original LCA paper (Lan et al., 2019).
2.3 Implementation Tricks
Since the LCA values at each update should be stored to hard disk for subsequent analyses, which represent the finest granuality of the discrete learning dynamics, this, in our investigation, will cause large storage overheads with a model up to millions trainable weights and trained up to 100 updates. Therefore, in practice, we adopt two tricks in our implementation. Firstly, we store the LCA value once for every 15 updates via averaging the LCA values for those steps:
for beginning with 0. Then, we divide the model parameters into several groups and calculate LCA value for each group rather than each parameter as follows:
where denotes the number of parameters that group. More precisely, we mainly study LCA for the following parameter groups: word embedding in encoder (en_emb); -th layer parameters in encoder (); -th layer parameters in decoder (); word embedding in decoder (de_emb); softmax matrix in decoder (de_softmax).
3 Experiments and Analyses
3.1 Data and model
We conduct experiments on two widely-used translation benchmarks, namely IWSLT14 DeEn and WMT14 EnDe. We use the Transformer base model (Vaswani et al., 2017) from Fairseq (Ott et al., 2019) for training and gathering the smoothed moment LCA values of each model parameters. Thanks to the sampling technique in Section 2.2, our training time is only doubled compared with standard training. Our NMT system respectively achieves BLEU points of 34.4 and 27.7 on the test sets for IWSLT and WMT tasks, which are close to state-of-the-art.
3.2 Evaluating the sampling approximation
To prove the effectiveness of our approximation, we conduct a simulated experiment as follows: we randomly sample 10 sentences from the IWSLT task and employ this small sampled data as the training data for running the exact implementation. Figure 1 demostrates cumulative LCA values’ occupation ratios of each module group in the Transformer described in §2.3. The ranking of each module’s occupation ratio reflects its relative effectiveness with respect to loss minimization on . So if the ranking of different modules are similar between the sampling-based computation and exact computation, we could relay on the sampling method for subsequent analyses. As shown in Figure 1, the ranking similarity between the sampling-based (approx.) and exact method are very similar to each other, with a Kendall’s rank coefficient as 0.905 (Kendall, 1938).
3.3 Experimental analyses
We conduct two main categories of analyses according to LCA: i) interval analysis: which tracks the LCA values of each groups of model parameters during certain interval of the whole training process; ii) cumulative analysis: which tracks the cumulative LCA values from the beginning of training to the end.
3.3.1 Learning of sparse and dense weights
Current best practice sequence-to-sequence learning paradigm proposes an explicit differentiation between encoder and decoder. This explicit separation may provide a bottleneck of gradient flow from the loss to the encoder. We visualize the cumulative LCA value of sparse and dense weights of Transformer in Figure 2.
Overall speaking, dense weights from encoder and decoder contribute similarly both on train and test. However, the sparse embeddings both contribute very little. This might because that the unpdate frequency of dense weights is much larger than the sparse weights. However, the dense softmax weigths’s LCA value (decoder’s output embedding) is still far less than those middle layers.
3.3.2 Layer-wise learning dynamics
To further analyze the layer-wise contribution of each encoder and decoder layer, we summarize the cumulative LCA value of each dense layer in Figure 3. There is an interesting sandwitch effect of the encoder where the beginning and the end layer contribute the most while the layer in between contribute less. For the decoder layers, the more higher the layer, the more contribution it makes to the loss change. It is very clear that from this modular view, different neural blocks provide similar effect on loss change on train as on test.
To further understand the convergence property shown in Raghu et al. (2017): that lower layer converges earlier. We draw the grouped interval LCA values along the whole training process in Figure 4. As you can see, higher layers tend to have smaller LCA values which means they contribute more than lower layers generally at any training interval, functioning as an evidence that higher layers continue to evolve representations.
3.3.3 Learning of the embeddings
As the vocabulary size is large, we can not visualize the behaviors for the embedding of every word in the vocabulary. We thereby split the vocabulary into 25 groups according to word frequency.
Figure 5 visualizes the cumulative LCA values for all groups sorted by word frequency. From this Figure, one can clearly see that words with very high frequency occupy most LCA values than lower frequency words on both training and test datasets. This fact further provides an explanation for the well-known question, i.e. why infrequent words are difficult to be translated but frequent words are easy for NMT.
In this paper we propose to use Loss Change Allocation (LCA) (Lan et al., 2019) for understanding learning dynamics of NMT. Since the exact calculation of LCA requires calculating the gradient on an entire dataset at each update, we instead present an approximate to put it into practice in NMT scenario. Our simulated experiment shows that such approximate calculation is efficient and is empirically proved to deliver consistent results to the exact implementation. Extensive experiments on two standard translation tasks reveal some valuable findings: parameters of encoder (decoder) word embeddings and softmax matrix contribute less to loss decrease and those of the first layer in encoder and the last layer in decoder contribute most to loss decrease during the training process. We will investigate the the relation of loss decrease with other interesting learning phenomenon, for example the emergence of weight sparsity (Voita et al., 2019; Michel et al., 2019) and module criticality (Zhang et al., 2019; Chatterji et al., 2019).
Optimization methods for large-scale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §1.
- The intriguing role of module criticality in the generalization of deep networks. ArXiv abs/1912.00528. Cited by: §4.
- On the learning dynamics of deep neural networks. International Conference on Learning Representations 2019. External Links: Cited by: §1.
- Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243–1252. External Links: Cited by: §1.
- A new measure of rank correlation. Biometrika 30 (1/2), pp. 81–93. Cited by: §3.2.
- LCA: loss change allocation for neural network training. In Advances in Neural Information Processing Systems, External Links: Cited by: Understanding Learning Dynamics for Neural Machine Translation, §1, §2.1, §2.2, §4.
- Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pp. 6389–6399. Cited by: §1.
- An empirical model of large-batch training. ArXiv abs/1812.06162. Cited by: §1.
- Are sixteen heads really better than one?. In Advances in Neural Information Processing Systems 32, pp. 14014–14024. External Links: Cited by: §4.
- Fairseq: a fast, extensible toolkit for sequence modeling. In NAACL-HLT, Cited by: §3.1.
- . In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6076–6085. External Links: Cited by: §1, §3.3.2.
- Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §1.
- Don’t decay the learning rate, increase the batch size. ArXiv abs/1711.00489. Cited by: §1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. External Links: Cited by: §1, §3.1.
- Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5797–5808. External Links: Cited by: §4.
- Google’s neural machine translation system: bridging the gap between human and machine translation. ArXiv abs/1609.08144. External Links: Cited by: §1.
- Are all layers created equal?. arXiv preprint arXiv:1902.01996. External Links: Cited by: §4.