In this paper we propose a neural machine translation (NMT) model in which the encoder and decoder layers are randomly generated then fixed throughout training. We show that despite the extreme simplicity in model construction and training procedure, the model still performs surprisingly well, reaching 70-80% BLEU scores given by a fully trainable model of the same architecture.
, a special type of recurrent neural network (RNN) whose recurrent and input matrices are randomly generated and untrained. Such a model building procedure is counter-intuitive, however as long as its dynamical behavior (characterized by a few key model hyperparameters) properly approximates the underlying dynamics of a given sequence processing task, randomized models can also yield competitive performance. If we view language processing from a dynamical system’s perspective(Elman, 1995), ESN can be an effective model for NLP tasks as well.
There are existing works that apply randomized approaches similar to ESN to NLP tasks (Tong et al., 2007; Hinaut and Dominey, 2012; Wieting and Kiela, 2019; Enguehard et al., 2019), which report the effectiveness of using representations produced by random encoders. However the capability of ESN in directly handling more general and complicated sequence-to-sequence (seq2seq) prediction tasks has not been investigated yet.
We propose an Echo State NMT model with a randomized encoder and decoder, extending ESN to a challenging seq2seq prediction task, and study its uncharacteristic effectiveness in MT. This also provides an interesting opportunity for model compression, as one only needs to store one single random seed offline, from which all randomized model components can be deterministically recovered.
Echo State Network (Jaeger, 2001) is a special type of recurrent neural network, in which the recurrent matrix (known as “reservoir”) and input transformation are randomly generated then fixed, and the only trainable component is the output layer (known as “readout”). A very similar model named Liquid State Machine (LSM) (Maass et al., 2002) was independently proposed almost simultaneously, but with a stronger focus on computational neuroscience. This family of models started by ESN and LSM later became known as Reservoir Computing (RC) (Verstraeten et al., 2007).
A basic version of ESN has the following formulation :
in which and are the hidden state and input at time , is the output and being a prediction function (for example softmax for classification). This formulation is almost equivalent to a simple RNN, except that the reservoir and input transformation matrices and are randomly generated and fixed. is also often required to be a sparse matrix. The only component that remains to be trained is the readout weights .
Despite the extremely simple construction process of ESN, it has been shown to perform surprisingly well in many regression and time-series prediction problems. A key condition for ESN to function properly is called the Echo State Property (ESP) (Jaeger, 2001; Yildiz et al., 2012), which basically claims that the ESN states asymptotically depend only on the driving input signals (hence states are “echos” of inputs), while the influence of the initial states vanishes over time. ESP essentially requires the recurrent network to have a “fading memory”, which is also shown to be critical in optimizing a dynamical system’s computational capacity (Legenstein and Maass, 2007; Dambre et al., 2012).
Theoretical analysis shows that in order for ESP to hold, the spectral radius of the reservoir matrix
, defined as the largest absolute value of its eigenvalues, needs to be smaller than 1, although it was argued that this is not a rigorous condition(Pascanu and Jaeger, 2011). Intuitively, determines how long an input signal can be retained in memory: smaller radius results in a shorter memory while larger radius enables a longer memory. In addition, the scale of the input, which determines how strong inputs influence the dynamics, remains a hyperparameter critical to the performance of the model.
3 Echo State Neural Machine Translation
3.1 Model Architecture
Inspired by the simple yet effective construction of ESN, we are interested in extending ESN to challenging sequence-to-sequence prediction tasks, especially NMT. We propose an ESN-based NMT model whose architecture follows RNMT+ Chen et al. (2018), the state-of-the-art RNN-based NMT model. Unlike RNMT+ which is fully trainable, we simply replace all recurrent layers in the encoder and decoder with echo state layers as shown in Eq. 1, and call this model ESNMT.
In addition to the simple RNN cell employed by the original ESN (Eq. 1), we also explore a variation of ESNMT which employs the LSTM cell (Hochreiter and Schmidhuber, 1997). That is, we randomly generate all weight matrices in the LSTM and keep them fixed. We call this version ESNMT-LSTM.
In the models above, the trainable components are word embedding, softmax and attention layers. Instead of freezing both encoder and decoder, we also investigate settings where only the encoder or decoder is frozen. We further consider cases where even the attention and embedding layers are randomized and fixed. These variations of architectures are compared in Sec.4.2.
We note that the size of the reservoir can be cheaply increased since they do not need to be trained, which often leads to better performance. We nevertheless constrain the ESNMT model size to be the same as trainable baselines in our experiments, even though the latter contain way more trainable parameters.
3.2 Adaptive Echo State Layers
As described in Sec.2, two critical hyperparameters that determine the dynamics of ESN and its performance are the spectral norm of the reservoir matrix and input scale. While common practice manually tunes these hyperparameters for specific tasks, we treat them as trainable parameters and let the training procedure find suitable values. Specifically, we modify the ESN layer in Eq. 1 into
where and are learnable scaling factors for the reservoir of the layer and input transformation matrices respectively. Similar modification is applied to the LSTM state transition formulation in ESNMT-LSTM.
Our models are trained with back-propagation and cross-entropy loss as usual.111 Note that ESNs have been commonly applied to regression and time-series prediction problems, in which case the loss functions are usually mean square error and readout weights can be updated in close-form without the necessity of back-propagation.
Note that ESNs have been commonly applied to regression and time-series prediction problems, in which case the loss functions are usually mean square error and readout weights can be updated in close-form without the necessity of back-propagation.Note that since recurrent layer weights are fixed and their gradients are not calculated, the challenging gradient explosion/vanishing problem (Pascanu et al., 2013) commonly observed in training RNNs can be significantly alleviated. Therefore we expect no significant difference in quality between ESNMT and ESNMT-LSTM, since the LSTM architecture, which was originally designed to tackle the gradient instability problem, will not be superior in this case. This is verified in our experimental results (Sec. 4.2).
3.4 Model Compression
Since randomized components of ESNMT can be deterministically generated simply from one fixed random seed, to store the model offline we only need to save this single seed together with remaining trainable model parameters. For example, in an ESNMT-LSTM model with 6-layer encoder and decoder of dimension 512 and vocabulary size 32K, around of the parameters from the recurrent layers can be recovered from a single random seed.
We train and evaluate our models on WMT’14 EnglishFrench, EnglishGerman and WMT’16 EnglishRomanian datasets. Sentences are processed into sequences of sub-word units using BPE (Sennrich et al., 2016). We use a shared vocabulary of 32K sub-word units for both source and target languages.
Our baselines are fully trainable RNMT+ with LSTM cells. For the proposed ESNMT models, all reservoir and input transformation matrices are generated randomly from a uniform distribution between -1 and 1. The reservoirs are then randomly pruned so thatand reach 20-25% sparsity222We also experimented with other sparsity levels, but did not observe significant differences in model quality., and normalize so that its spectral radius equals to 1. Note the effective spectral radius and input scaling are determined by the learnable scaling factors as shown in Eq. 2, which are initialized to 1 and 10 respectively for all layers. For all models the number of encoder and decoder layers are equally set to 6, and model dimension to 512 or 2048. We also adopt similar training recipes as used by the RNMT+ (Chen et al., 2018), including dropout, label smoothing and weight decay for all our models.
Table 1 compares BLEU scores for all languages pairs given by different models.
The results show that ESNMT can reach 70-80% of the BLEU scores yielded by fully trainable baselines across all settings. Moreover, using LSTM cells yields more or less the same performance as a simple RNN cell. This verifies our hypothesis in Sec. 3.1 that an LSTM cell is not particularly advantageous compared to a simple RNN cell in the ESN setting.
As mentioned in Sec 3.1
, in addition to randomizing both the encoder and decoder, we explore other strategies of applying randomization, and conduct an ablation test as follows: We start by randomizing and freezing everything in the ESNMT-LSTM model (dimension 512) except the softmax layer, then gradually release attention, encoder and/or decoder so that they become trainable. The results for EnFr are shown in Table 2.
|ESNMT-LSTM-512 (softmax only)||4.44|
|+ Encoder only||37.98|
|+ Decoder only||35.21|
|+ Encoder & decoder (fully trainable)||39.15|
From the table we have the following interesting findings:
[topsep=0pt, itemsep=0mm, wide, labelwidth=!, labelindent=0pt]
By randomizing only the entire decoder, the BLEU score (37.98) drops only by 1.17 from the baseline (39.15).
Randomizing the encoder incurs more BLEU loss (35.21) than decoder. This shows that training the encoder properly is more critical to seq2seq tasks.
Embedding layer deserves the most training. It lifts the BLEU given by an almost purely randomized model (4.44) immediately to 26.63. It is also interesting to note that a model with only the embedding and softmax layers trainable is already able to reach this BLEU score.
Effect of spectral radius
To find out why ESNMT works, we examine learned spectral radii for each layer, which are are critical in characterizing the dynamics of ESNMT. In Fig. 1 we show the learning curves of for all layers in the forward encoder ESN. The figure shows a clear trend that the radius increases almost monotonically from bottom to top layer (0.55 to 1.8). This indicates that lower layers retain short memories and focus more on word-level representations, while upper layers keep longer memories and account for better sentence-level semantics which requires capturing long-term dependencies between inputs. Similar phenomena are observed for the backward encoder ESN and decoder.
To further investigate how spectral radius determines translation quality, we study BLEU scores on EnFr testset as a function of sentence length, using models in which radii are fixed for all layers and set to 0.1, 0.9 or 2.0. The results are shown in Fig. 2, from which we see that when the radius is small (0.1), the model favors shorter sentences which requires less memory, increasing the radius to 2.0 equips the model with non-fading memory, in which remote inputs outweigh recent inputs, resulting in worse quality on short sentences. Radius 0.9 maintains a good balance between short and long memories, yielding the best quality. Nevertheless the overall quality for all settings is worse than models whose radii are learned (Table 1).
5 Related Work
Perhaps most related to our work are (Wieting and Kiela, 2019; Enguehard et al., 2019), in which they studied the effectiveness of using randomized encoders in performing SentEval tasks. We also note that similar randomization approaches have also been applied to other NLP problem settings (Tong et al., 2007; Hinaut and Dominey, 2012; Homma and Hagiwara, 2013; Alhama and Zuidema, 2016; Zhang and Bowman, 2018; Tenney et al., 2019; Ramamurthy et al., 2019). However none of them explore the potential of ESN for encoder-decoder models or more challenging tasks like MT, nor did they study why randomization in these problem settings works properly. These are the questions we aim to address in our paper.
We proposed Echo State NMT models whose encoder and decoder are composed of randomized and fixed ESN layers. Even without training these major components, the model can already reach 70-80% performance yielded by fully trainable baselines. These surprising findings encourage us to rethink about the nature of encoding and decoding in NMT, and design potentially more economic model architectures and training procedures.
ESNMT is based on the recurrent network architecture. One interesting research problem for future exploration is how to apply randomized algorithms to non-recurrent architectures like Transformer (Vaswani et al., 2017). This is potentially possible, as exemplified by randomized feedforward networks like Extreme Learning Machine (Huang et al., 2006).
Alhama and Zuidema (2016)
Raquel G. Alhama and Willem H. Zuidema. 2016.
Pre-wiring and pre-training: What does a neural network need to learn
truly general identity rules?
Journal of Artificial Intelligence Research, 61:927–946.
- Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–86, Melbourne, Australia. Association for Computational Linguistics.
- Dambre et al. (2012) Joni Dambre, David Verstraeten, Benjamin Schrauwen, and Serge Massar. 2012. Information processing capacity of dynamical systems. In Scientific reports.
- Elman (1995) Jeffrey L. Elman. 1995. Mind as motion. chapter Language As a Dynamical System, pages 195–225. Massachusetts Institute of Technology, Cambridge, MA, USA.
- Enguehard et al. (2019) Joseph Enguehard, Dan Busbridge, Vitalii Zhelezniak, and Nils Hammerla. 2019. Neural language priors. Computing Research Repository, arXiv:1910.03492.
- Gallicchio and Micheli (2017a) Claudio Gallicchio and Alessio Micheli. 2017a. Deep echo state network (deepesn): A brief survey. Computing Research Repository, arXiv:1712.04323.
- Gallicchio and Micheli (2017b) Claudio Gallicchio and Alessio Micheli. 2017b. Echo state property of deep reservoir computing networks. Cognitive Computation, 9(3):337–350.
- Gallicchio and Micheli (2019a) Claudio Gallicchio and Alessio Micheli. 2019a. Reservoir topology in deep echo state networks. Computing Research Repository, arXiv:1909.11022.
- Gallicchio and Micheli (2019b) Claudio Gallicchio and Alessio Micheli. 2019b. Richness of deep echo state network dynamics. CoRR, abs/1903.05174.
Gallicchio et al. (2017)
Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. 2017.
reservoir computing: A critical experimental analysis.
Neurocomputing, 268:87 – 99.
Advances in artificial neural networks, machine learning and computational intelligence.
- Hinaut and Dominey (2012) Xavier Hinaut and Peter F. Dominey. 2012. On-line processing of grammatical structure using reservoir computing. In Artificial Neural Networks and Machine Learning – ICANN 2012, pages 596–603, Berlin, Heidelberg.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Homma and Hagiwara (2013) Yukinori Homma and Masafumi Hagiwara. 2013. An echo state network with working memories for probabilistic language modeling. In Artificial Neural Networks and Machine Learning – ICANN 2013, pages 595–602.
- Huang et al. (2006) Guang-Bin Huang, Qin-Yu Zhu, and Chee Kheong Siew. 2006. Extreme learning machine: Theory and applications. Neurocomputing, 70:489–501.
- Jaeger (2001) Herbert Jaeger. 2001. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note’. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148.
- Legenstein and Maass (2007) Robert Albin Legenstein and Wolfgang Maass. 2007. What makes a dynamical system computationally powerful?, 1 edition, pages 127–154. MIT Press.
- Maass et al. (2002) Wolfgang Maass, Thomas Natschläger, and Henry Markram. 2002. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11):2531–2560.
- Pascanu and Jaeger (2011) Razvan Pascanu and Herbert Jaeger. 2011. A neurodynamical model for working memory. Neural Netw., 24(2):199–207.
- Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA. PMLR.
Ramamurthy et al. (2019)
Rajkumar Ramamurthy, Robin Stenzel, Rafet Sifa, Anna Ladi, and Christian
Echo state networks for named entity recognition.In Artificial Neural Networks and Machine Learning – ICANN 2019: Workshop and Special Sessions, pages 110–120. Springer International Publishing.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- Tenney et al. (2019) Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations.
- Tong et al. (2007) Matthew H. Tong, Adam D. Bickett, Eric M. Christiansen, and Garrison W. Cottrell. 2007. Learning grammatical structure with echo state networks. Neural Networks, 20(3):424 – 432.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008.
- Verstraeten et al. (2007) David Verstraeten, Benjamin Schrauwen, Michiel D’Haene, and Dirk Stroobandt. 2007. An experimental unification of reservoir computing methods. Neural networks : the official journal of the International Neural Network Society, 20 3:391–403.
- Wieting and Kiela (2019) John Wieting and Douwe Kiela. 2019. No training required: Exploring random encoders for sentence classification. In International Conference on Learning Representations.
- Yildiz et al. (2012) Izzet B. Yildiz, Herbert Jaeger, and Stefan J. Kiebel. 2012. Re-visiting the echo state property. Neural networks : the official journal of the International Neural Network Society, 35:1–9.
- Zhang and Bowman (2018) Kelly Zhang and Samuel Bowman. 2018. Language modeling teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 359–361, Brussels, Belgium. Association for Computational Linguistics.