Alternating Synthetic and Real Gradients for Neural Language Modeling

by   Fangxin Shang, et al.

Training recurrent neural networks (RNNs) with backpropagation through time (BPTT) has known drawbacks such as being difficult to capture longterm dependencies in sequences. Successful alternatives to BPTT have not yet been discovered. Recently, BP with synthetic gradients by a decoupled neural interface module has been proposed to replace BPTT for training RNNs. On the other hand, it has been shown that the representations learned with synthetic and real gradients are different though they are functionally identical. In this project, we explore ways of combining synthetic and real gradients with application to neural language modeling tasks. Empirically, we demonstrate the effectiveness of alternating training with synthetic and real gradients after periodic warm restarts on language modeling tasks.


page 1

page 2

page 3

page 4


Neural Networks Compression for Language Modeling

In this paper, we consider several compression techniques for the langua...

Adaptively Truncating Backpropagation Through Time to Control Gradient Bias

Truncated backpropagation through time (TBPTT) is a popular method for l...

Recurrent Neural Network Grammars

We introduce recurrent neural network grammars, probabilistic models of ...

Scalable Bayesian Learning of Recurrent Neural Networks for Language Modeling

Recurrent neural networks (RNNs) have shown promising performance for la...

Regularizing RNNs by Stabilizing Activations

We stabilize the activations of Recurrent Neural Networks (RNNs) by pena...

Implicit Bias of Linear RNNs

Contemporary wisdom based on empirical studies suggests that standard re...

GuideBP: Guiding Backpropagation Through Weaker Pathways of Parallel Logits

Convolutional neural networks often generate multiple logits and use sim...

1 Introduction

Recurrent neural networks (RNNs) are notoriously difficult to learn long-term patterns with gradients due to the well-known gradient vanishing and exploding problems. Although backprop through time (BPTT) has been proven to work reasonably well in various RNN settings, it is known that it suffers from the truncation which fundamentally limits its ability to learn long term dependencies.

While a plethora of methods have been devoted to mitigate this issue by seeking for architectural innovation [6, 5, 2, 16], another line of research tries to attack the problem by looking for a better optimization method, including periodic warm restarts [11] and synthetic gradients [1, 9].

Periodic warm restart anneals the learning rate relatively quickly, and periodically re-initialize it to some predefined bigger value. It has shown strong performances in feedforward nets [11].

Synthetic gradients such as meta neural optimizers [1] and Decoupled Neural Interface (DNI) [9]

try to find alternatives to back propagation. Neural optimizers usually operate coordinate-wise on the parameters of the base model, whereas in DNI, a single DNI is responsible for a whole layer’s neurons which allows it to consider the correlations between neurons. Another characteristic of neural optimizers is that they are defined on a global objective (e.g. the final training loss) and need to keep track of the whole training procedure (i.e. for the base model from initial states to convergence), so the neural optimizers can only handle limited steps of optimization steps. In contrast, DNIs rely on some local targets (e.g.

distance from the local ground truth gradients), thus making it suitable for more complex training schemes.

DNI-enabled RNNs have been shown to outperform pure BPTT-based RNNs with increased time horizons on small-scaled language modeling tasks. However, DNIs perform poorly when applied to very deep feed-forward networks [8] and to the best of our knowledge, its effectiveness on large-scale RNNs remains unclear [9].

In this project we propose a neuron-wise architecture based on the DNI for large-scale RNNs, in which we train a separate DNI for each neuron in the base model. Each of the DNI, however, shares the same set of parameters, which is similar to [1]. We then explore and benchmark DNIs in a language modeling setting which involves large-scale RNNs.

To excavate the potentials of warm restart methods for RNNs, we propose to alternate synthetic and real gradients with periodic warm restarts. Our key hypothesis is that the more diverse landscapes [4] due to the combination of synthetic and real gradients and the perturbations from warm restarts might enable SGD to explore more with the same amount of iterations, which could lead to a better local minimum.

We summarize the contributions as follows:

  1. We design a neuron-wise DNI method, where a separate DNI is responsible for each neuron in the base model. Compared to previous methods such as layer-wise DNI or parameter-wise neural optimizer, our method significantly reduces the number of parameters by 1-2 orders, which makes it possible to effectively train DNI in a large-scale RNN.

  2. We propose a new periodical warm restart algorithm that alternates training with synthetic and real gradients, which has experimentally been validated to outperform previous synthetic or real gradient based warm restart methods.

  3. We provide a comprehensive set of experiments that evaluates warm restart methods and canonical training methods in a large-scale RNN setting. Experiments are conducted on small (PTB) to medium (WikiText-2) sized corpus. Due to the lack of applying DNI in a large-scale RNN setting in the literature, we believe the outcome of this comparison will be valuable for references in the community.

2 Methodology

2.1 Background

Decoupled neural interface.

Following the notations in [1], given the previous hidden state and the current input , the recurrent core module outputs and a new hidden state . For language modeling tasks, we have a loss, , at every time-step . Ideally, we wish to update using in the hope of capturing the infinite future information. But in practice we can only consider a finite horizon to update


Truncated BPTT only computes and simply sets in Eqn. 1. A layer-wise DNI module parameterized by , aims to approximate the true gradients with , where is the number of neurons. Since is not a tractable target, following [1], we use . This target is bootstrapped from a synthetic gradient and BPTT is only used during this span. This bootstrapping is responsible for propagating credit assignment across boundaries of truncated BPTT.

Periodic warm restart.

A warm restart [11] is performed every epochs, where is the index of the run111A run is defined as a period of time between each restart.. Within the -th run, the learning rate is decayed as , where and are allowed values for the learning rate, records the number of epochs since the last restart.

2.2 Neuron-wise DNI for Language Modeling

Figure 1: Illustration of neuron-wise DNI training for a layer with two neurons unrolled horizontally. Green arrows indicate gradients delivered by BPTT. Blue arrows represent synthetic gradients generated by DNIs.

To decouple the design of DNIs from a specific base model, we train DNIs on a neuron-wise manner, which amounts to assuming that one can predict a neuron’s gradients without taking into account its relationship with others. The overall framework is shown in Fig. 1, where we assume for simplicity that this layer only has two neurons. We use a shared DNI module to approximate it with and . During the forward pass, each DNI module will have its own distinct hidden states due to different input. As for the backward pass, after letting the base RNN train on a mini-batch, we aggregate all the meta-gradients222By meta-gradients, we mean the gradients for training the parameters, , of the DNI module, . and use them to update the shared DNI module at once. That is, the different synthetic gradients are due to different hidden states, , of the base model and its own hidden states, , but not the parameters . It should be noted that although both the input and output to are scalars, its hidden state dimension is higher than 1D for learning richer representations. Thus we use MLPs to convert the dimensions of the inputs and outputs.

2.3 Alternating Synthetic and Real Gradients

[4] shows that DNI leads to drastically different layer-wise representations from the ones based on BPTT, even though they are functionally identical. This observation prompts us to find a way to take advantage of the diverse representational powers of synthetic and real gradients. We are also inspired by the characteristics of non-convex loss surface of deep neural networks and the ability of SGD to converge to and escape local minima with the help of warm restarts [11]. Together with these two sources of inspirations, we propose to alternate synthetic and real gradients after periodic warm restarts. Our key motivation is that the more diverse landscapes due to the combination of synthetic and real gradients and the perturbations from warm restarts might enable SGD to explore more with the same amount of iterations, which might lead to a better local minimum.

In effect, the authors of DNI [1] also discuss one spatial way, called , of mixing synthetic and real gradients.

can be seen as a mixture of different estimates of the gradients weighted by

at every time-step. In contrast, we mix different gradients periodically after restarts, which is a temporal mixture. Another related work is snapshot ensembles [7]. Ours is different in the following ways: 1) we consider RNNs but [11, 7] focus on feed-forward networks; 2) after each restart, we alternate the gradients used for training between synthetic and real gradients, whereas previous works only use real gradients.

3 Experiments

The experiments are designed for answering the following questions: Q1: How effective is the neuron-wise DNI? Q2: What is the performance of alternating gradients for language modeling?

Two benchmark datasets used for evaluation are Penn Treebank (PTB) [14] and the WikiText-2 (WT2) [13] dataset. All experiments use a quasi-recurrent neural network (QRNN) [3] with 4 layers each having 1550 neurons for the actual language modeling tasks. We use a truncated BPTT with 200 steps for baselines and DNIs. Each DNI is a QRNN with a single layer which has 50 neurons. The learning rates for all models are grid-searched from to . We fix the random seed for initialization. In fact, our settings are very similar to [12] (e.g. dropout ratio) except that we do not use averaged stochastic gradient (AvSGD) [15], because the averaging operation is incompatible with warm restarts. Therefore, we use Adam [10]

as the optimizer. We also grid-search for other hyperparameters for both baseline and our proposed method.

3.1 Evaluation of Neuron-wise DNIs (Q1)

Figure 2: Validation perplexity with QRNN on PTB. Results before 25 epochs are not shown.
Figure 3: Validation perplexity with QRNN on WT2. Results before 15 epochs are not shown.
Method PTB WT2
Truncated BPTT 62.78 72.6
Neuron-wise DNI 62.42 72.37
Table 1: Test perplexity of truncated BPTT and neuron-wise DNI for QRNNs on PTB and WT2 dataset. The lower the better.

Tab. 1 shows the performance of neuron-wise DNIs and truncated BPTT for QRNNs on word-level PTB and word-level WT2 dataset.

As can be seen that neuron-wise DNIs beats truncated BPTT by a decent margin on both datasets. We should emphasize that this is achieved by adding a small QRNN with 50 neurons to another neural language model only during training. It is also worth mentioning that this is the first time DNI-enabled RNNs have been evaluated for a large-scale setting.

3.2 Evaluation of Alternating Gradients (Q2)

We first train the base QRNN with real gradients by BPTT for the first epochs, and then switch to the gradients generated from DNIs for the next epochs after a warm restart. That is, we set , and we grid-search . The totial running time is 320 epochs.

Unfortunately, we find that SGD with warm restart for RNNs on language modeling datasets seems very sensitive to the choice of the restarting points and annealing strategy, and we cannot find suitable hyper-parameters after extensive grid-search.

Method PTB WT2
Truncated BPTT with Restart 64.44 73.97
DNI with Restart 64.71 73.42
Alternating DNI 63.76 73.35
Table 2: Test perplexity of truncated BPTT with restart, neuron-wise DNI with restart, and alternating DNI for QRNNs on PTB and WT2 dataset. The lower the better.

Tab. 2 shows the text perplexity of best performing models. Together with Fig. 2 and 3, we can see that even truncated BPTT with restart cannot come close to vanilla truncated BPTT trained results, and all methods with restart do not perform well. However, it should be noted that even though SGD with warm restart does not provide competitive performance, alternating synthetic and real gradients still outperform its counterparts: BPTT with restart and DNI with restart. We argue that this still shows the potential of alternating synthetic and real gradients.

4 Conclusion

We proposed neuron-wise DNIs with alternating synthetic and real gradients for language modeling and investigated their performance on large-scale tasks. Experiments showed that neuron-wise DNIs are able to propagate credit assignment across boundaries of truncated BPTT and thus improve RNNs’ performance for language modeling. We also demonstrated that alternating synthetic and real gradients after periodic warm restarts can provide better performance than using either synthetic and real gradients alone. However, due to the unsuccessful attempts to make SGD with warm restart work for RNNs, we were unable to truly evaluate the effectiveness of alternating synthetic and real gradients. We will investigate the issues with restarts further.


We would like to thank Jie Fu for many helpful discussions.