1 Introduction
Recurrent neural networks (RNNs) are notoriously difficult to learn longterm patterns with gradients due to the wellknown gradient vanishing and exploding problems. Although backprop through time (BPTT) has been proven to work reasonably well in various RNN settings, it is known that it suffers from the truncation which fundamentally limits its ability to learn long term dependencies.
While a plethora of methods have been devoted to mitigate this issue by seeking for architectural innovation [6, 5, 2, 16], another line of research tries to attack the problem by looking for a better optimization method, including periodic warm restarts [11] and synthetic gradients [1, 9].
Periodic warm restart anneals the learning rate relatively quickly, and periodically reinitialize it to some predefined bigger value. It has shown strong performances in feedforward nets [11].
Synthetic gradients such as meta neural optimizers [1] and Decoupled Neural Interface (DNI) [9]
try to find alternatives to back propagation. Neural optimizers usually operate coordinatewise on the parameters of the base model, whereas in DNI, a single DNI is responsible for a whole layer’s neurons which allows it to consider the correlations between neurons. Another characteristic of neural optimizers is that they are defined on a global objective (e.g. the final training loss) and need to keep track of the whole training procedure (i.e. for the base model from initial states to convergence), so the neural optimizers can only handle limited steps of optimization steps. In contrast, DNIs rely on some local targets (e.g.
distance from the local ground truth gradients), thus making it suitable for more complex training schemes.DNIenabled RNNs have been shown to outperform pure BPTTbased RNNs with increased time horizons on smallscaled language modeling tasks. However, DNIs perform poorly when applied to very deep feedforward networks [8] and to the best of our knowledge, its effectiveness on largescale RNNs remains unclear [9].
In this project we propose a neuronwise architecture based on the DNI for largescale RNNs, in which we train a separate DNI for each neuron in the base model. Each of the DNI, however, shares the same set of parameters, which is similar to [1]. We then explore and benchmark DNIs in a language modeling setting which involves largescale RNNs.
To excavate the potentials of warm restart methods for RNNs, we propose to alternate synthetic and real gradients with periodic warm restarts. Our key hypothesis is that the more diverse landscapes [4] due to the combination of synthetic and real gradients and the perturbations from warm restarts might enable SGD to explore more with the same amount of iterations, which could lead to a better local minimum.
We summarize the contributions as follows:

We design a neuronwise DNI method, where a separate DNI is responsible for each neuron in the base model. Compared to previous methods such as layerwise DNI or parameterwise neural optimizer, our method significantly reduces the number of parameters by 12 orders, which makes it possible to effectively train DNI in a largescale RNN.

We propose a new periodical warm restart algorithm that alternates training with synthetic and real gradients, which has experimentally been validated to outperform previous synthetic or real gradient based warm restart methods.

We provide a comprehensive set of experiments that evaluates warm restart methods and canonical training methods in a largescale RNN setting. Experiments are conducted on small (PTB) to medium (WikiText2) sized corpus. Due to the lack of applying DNI in a largescale RNN setting in the literature, we believe the outcome of this comparison will be valuable for references in the community.
2 Methodology
2.1 Background
Decoupled neural interface.
Following the notations in [1], given the previous hidden state and the current input , the recurrent core module outputs and a new hidden state . For language modeling tasks, we have a loss, , at every timestep . Ideally, we wish to update using in the hope of capturing the infinite future information. But in practice we can only consider a finite horizon to update
(1) 
Truncated BPTT only computes and simply sets in Eqn. 1. A layerwise DNI module parameterized by , aims to approximate the true gradients with , where is the number of neurons. Since is not a tractable target, following [1], we use . This target is bootstrapped from a synthetic gradient and BPTT is only used during this span. This bootstrapping is responsible for propagating credit assignment across boundaries of truncated BPTT.
Periodic warm restart.
A warm restart [11] is performed every epochs, where is the index of the run^{1}^{1}1A run is defined as a period of time between each restart.. Within the th run, the learning rate is decayed as , where and are allowed values for the learning rate, records the number of epochs since the last restart.
2.2 Neuronwise DNI for Language Modeling
To decouple the design of DNIs from a specific base model, we train DNIs on a neuronwise manner, which amounts to assuming that one can predict a neuron’s gradients without taking into account its relationship with others. The overall framework is shown in Fig. 1, where we assume for simplicity that this layer only has two neurons. We use a shared DNI module to approximate it with and . During the forward pass, each DNI module will have its own distinct hidden states due to different input. As for the backward pass, after letting the base RNN train on a minibatch, we aggregate all the metagradients^{2}^{2}2By metagradients, we mean the gradients for training the parameters, , of the DNI module, . and use them to update the shared DNI module at once. That is, the different synthetic gradients are due to different hidden states, , of the base model and its own hidden states, , but not the parameters . It should be noted that although both the input and output to are scalars, its hidden state dimension is higher than 1D for learning richer representations. Thus we use MLPs to convert the dimensions of the inputs and outputs.
2.3 Alternating Synthetic and Real Gradients
[4] shows that DNI leads to drastically different layerwise representations from the ones based on BPTT, even though they are functionally identical. This observation prompts us to find a way to take advantage of the diverse representational powers of synthetic and real gradients. We are also inspired by the characteristics of nonconvex loss surface of deep neural networks and the ability of SGD to converge to and escape local minima with the help of warm restarts [11]. Together with these two sources of inspirations, we propose to alternate synthetic and real gradients after periodic warm restarts. Our key motivation is that the more diverse landscapes due to the combination of synthetic and real gradients and the perturbations from warm restarts might enable SGD to explore more with the same amount of iterations, which might lead to a better local minimum.
In effect, the authors of DNI [1] also discuss one spatial way, called , of mixing synthetic and real gradients.
can be seen as a mixture of different estimates of the gradients weighted by
at every timestep. In contrast, we mix different gradients periodically after restarts, which is a temporal mixture. Another related work is snapshot ensembles [7]. Ours is different in the following ways: 1) we consider RNNs but [11, 7] focus on feedforward networks; 2) after each restart, we alternate the gradients used for training between synthetic and real gradients, whereas previous works only use real gradients.3 Experiments
The experiments are designed for answering the following questions: Q1: How effective is the neuronwise DNI? Q2: What is the performance of alternating gradients for language modeling?
Two benchmark datasets used for evaluation are Penn Treebank (PTB) [14] and the WikiText2 (WT2) [13] dataset. All experiments use a quasirecurrent neural network (QRNN) [3] with 4 layers each having 1550 neurons for the actual language modeling tasks. We use a truncated BPTT with 200 steps for baselines and DNIs. Each DNI is a QRNN with a single layer which has 50 neurons. The learning rates for all models are gridsearched from to . We fix the random seed for initialization. In fact, our settings are very similar to [12] (e.g. dropout ratio) except that we do not use averaged stochastic gradient (AvSGD) [15], because the averaging operation is incompatible with warm restarts. Therefore, we use Adam [10]
as the optimizer. We also gridsearch for other hyperparameters for both baseline and our proposed method.
3.1 Evaluation of Neuronwise DNIs (Q1)
Method  PTB  WT2 

Truncated BPTT  62.78  72.6 
Neuronwise DNI  62.42  72.37 
Tab. 1 shows the performance of neuronwise DNIs and truncated BPTT for QRNNs on wordlevel PTB and wordlevel WT2 dataset.
As can be seen that neuronwise DNIs beats truncated BPTT by a decent margin on both datasets. We should emphasize that this is achieved by adding a small QRNN with 50 neurons to another neural language model only during training. It is also worth mentioning that this is the first time DNIenabled RNNs have been evaluated for a largescale setting.
3.2 Evaluation of Alternating Gradients (Q2)
We first train the base QRNN with real gradients by BPTT for the first epochs, and then switch to the gradients generated from DNIs for the next epochs after a warm restart. That is, we set , and we gridsearch . The totial running time is 320 epochs.
Unfortunately, we find that SGD with warm restart for RNNs on language modeling datasets seems very sensitive to the choice of the restarting points and annealing strategy, and we cannot find suitable hyperparameters after extensive gridsearch.
Method  PTB  WT2 

Truncated BPTT with Restart  64.44  73.97 
DNI with Restart  64.71  73.42 
Alternating DNI  63.76  73.35 
Tab. 2 shows the text perplexity of best performing models. Together with Fig. 2 and 3, we can see that even truncated BPTT with restart cannot come close to vanilla truncated BPTT trained results, and all methods with restart do not perform well. However, it should be noted that even though SGD with warm restart does not provide competitive performance, alternating synthetic and real gradients still outperform its counterparts: BPTT with restart and DNI with restart. We argue that this still shows the potential of alternating synthetic and real gradients.
4 Conclusion
We proposed neuronwise DNIs with alternating synthetic and real gradients for language modeling and investigated their performance on largescale tasks. Experiments showed that neuronwise DNIs are able to propagate credit assignment across boundaries of truncated BPTT and thus improve RNNs’ performance for language modeling. We also demonstrated that alternating synthetic and real gradients after periodic warm restarts can provide better performance than using either synthetic and real gradients alone. However, due to the unsuccessful attempts to make SGD with warm restart work for RNNs, we were unable to truly evaluate the effectiveness of alternating synthetic and real gradients. We will investigate the issues with restarts further.
Acknowledgements
We would like to thank Jie Fu for many helpful discussions.
References
 [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 [2] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. arXiv preprint arXiv:1511.06464, 2015.
 [3] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasirecurrent neural networks. arXiv preprint arXiv:1611.01576, 2016.
 [4] Wojciech Marian Czarnecki, Grzegorz Świrszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu. Understanding synthetic gradients and decoupled neural interfaces. arXiv preprint arXiv:1703.00522, 2017.
 [5] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 [6] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [7] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger. Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109, 2017.
 [8] Zhouyuan Huo, Bin Gu, and Heng Huang. Training neural networks using features replay. In Advances in Neural Information Processing Systems, pages 6658–6667, 2018.
 [9] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343, 2016.
 [10] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [11] Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.
 [12] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
 [13] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 [14] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
 [15] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
 [16] Trieu H Trinh, Andrew M Dai, MinhThang Luong, and Quoc V Le. Learning longerterm dependencies in rnns with auxiliary losses. 2018.