Learning Long Term Dependencies via Fourier Recurrent Units

03/17/2018 ∙ by Jiong Zhang, et al. ∙ 0

It is a known fact that training recurrent neural networks for tasks that have long term dependencies is challenging. One of the main reasons is the vanishing or exploding gradient problem, which prevents gradient information from propagating to early layers. In this paper we propose a simple recurrent architecture, the Fourier Recurrent Unit (FRU), that stabilizes the gradients that arise in its training while giving us stronger expressive power. Specifically, FRU summarizes the hidden states h^(t) along the temporal dimension with Fourier basis functions. This allows gradients to easily reach any layer due to FRU's residual learning structure and the global support of trigonometric functions. We show that FRU has gradient lower and upper bounds independent of temporal dimension. We also show the strong expressivity of sparse Fourier basis, from which FRU obtains its strong expressive power. Our experimental study also demonstrates that with fewer parameters the proposed architecture outperforms other recurrent architectures on many tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have shown remarkably better performance than classical models on a wide range of problems, including speech recognition, computer vision and natural language processing. Despite DNNs having tremendous expressive power to fit very complex functions, training them by back-propagation can be difficult. Two main issues are vanishing and exploding gradients. These issues become particularly troublesome for recurrent neural networks (RNNs) since the weight matrix is identical at each layer and any small changes get amplified exponentially through the recurrent layers 

[BSF94]

. Although exploding gradients can be somehow mitigated by tricks like gradient clipping or normalization 

[PMB13], vanishing gradients are harder to deal with. If gradients vanish, there is little information propagated back through back-propagation. This means that deep RNNs have great difficulty learning long-term dependencies.

Many models have been proposed to address the vanishing/exploding gradient issue for DNNs. For example Long Short Term Memory (LSTM) 

[HS97] tries to solve it by adding additional memory gates, while residual networks [HZRS16] add a short cut to skip intermediate layers. Recently the approach of directly obtaining the statistical summary of past layers has drawn attention, such as statistical recurrent units (SRU) [OPS17]. However, as we show later, they still suffer from vanishing gradients and have limited access to past layers.

In this paper, we present a novel recurrent architecture, Fourier Recurrent Units (FRU) that use Fourier basis to summarize the hidden statistics over past time steps. We show that this solves the vanishing gradient problem and gives us access to any past time step region. In more detail, we make the following contributions:

  • We propose a method to summarize hidden states through past time steps in a recurrent neural network with Fourier basis (FRU). Thus any statistical summary of past hidden states can be approximated by a linear combination of summarized Fourier statistics.

  • Theoretically, we show the expressive power of sparse Fourier basis and prove that FRU can solve the vanishing gradient problem by looking at gradient norm bounds. Specifically, we show that in the linear setting, SRU only improves the gradient lower/upper bound of RNN by a constant factor of the exponent (i.e, both have the form ), while FRU (lower and upper) bounds the gradient by constants independent of the temporal dimension.

  • We tested FRU together with RNN, LSTM and SRU on both synthetic and real world datasets like pixel-(permuted) MNIST, IMDB movie rating dataset. FRU shows its superiority on all of these tasks while enjoying smaller number of parameters than LSTM/SRU.

We now present the outline of this paper. In Section 2 we discuss related work, while in Section 3 we introduce the FRU architecture and explain the intuition regarding the statistical summary and residual learning. In Sections 4 and 5 we prove the expressive power of sparse Fourier basis and show that in the linear case FRUs have constant lower and upper bounds on gradient magnitude. Experimental results on benchmarking synthetic datasets as well as real datasets like pixel MNIST and language data are presented in Section 6. Finally, we present our conclusions and suggest several interesting directions in Section 7.

2 Related Work

Numerous studies have been conducted hoping to address the vanishing and exploding gradient problems, such as the use of self-loops and gating units in the LSTM [HS97] and GRU [CVMBB14]. These models use trained gate units on inputs or memory states to keep the memory for a longer period of time thus enabling them to capture longer term dependencies than RNNs. However, it has also been argued that by using a simple initialization trick, RNNs can have better performance than LSTM on some benchmarking tasks [LJH15]. Apart from these advanced frameworks, straight forward methods like gradient clipping [Mik12] and spectral regularization [PMB13] are also proposed.

As brought to wide notice in Residual networks [HZRS16], give MLP and CNN shortcuts to skip intermediate layers allowing gradients to flow back and reach the first layer without being diminished. It is also claimed this helps to preserve features that are already good. Although ResNet is originally developed for MLP and CNN architectures, many extensions to RNN have shown improvement, such as maximum entropy RNN (ME-RNN) [MDP11], highway LSTM [ZCY16] and Residual LSTM [KEKL17].

Another recently proposed method, the statistical recurrent unit (SRU) [OPS17], keeps moving averages of summary statistics through past time steps. Rather than use gated units to decide what should be memorized, at each layer SRU memory cells incorporate new information at rate and forget old information by rate . Thus by linearly combining multiple memory cells with different ’s, SRU can have a multi-scale view of the past. However, the weight of moving averages is exponentially decaying through time and will surely go to zero given enough time steps. This prevents SRU from accessing the hidden states a few time steps ago, and allows gradients to vanish. Also, the expressive power of the basis of exponential functions is small which limits the expressivity of the whole network.

Fourier transform is a strong mathematical tool that has been successful in many applications. However the previous studies of Fourier expressive power have been concentrate in dense Fourier transform. Price and Song [PS15] proposed a way to define -sparse Fourier transform problem in the continuous setting and also provided an algorithm which requires the frequency gap. Based on that [CKPS16] proposed a frequency gap free algorithm and well defined the expressive power of -sparse Fourier transform. One of the key observations in the frequency gap free algorithm is that a low-degree polynomial has similar behavior as Fourier-sparse signal. To understand the expressive power of Fourier basis, we use the framework designed by [PS15] and use the techniques from [PS15, CKPS16].

There have been attempts to combine the Fourier transform with RNNs: the Fourier RNN [KS97] uses

as activation function in RNN model; ForeNet 

[ZC00] notices the similarity between Fourier analysis of time series and RNN predictions and arrives at an RNN with diagonal transition matrix. For CNN, the FCNN [PWCZ17] replaces sliding window approach with the Fourier transform in the convolutional layer. Although some of these methods show improvement over current ones, they have not fully exploit the expressive power of Fourier transform or avoided the gradient vanishing/exploding issue. Motivated by the shortcomings of the above methods, we have developed a method that has a thorough view of the past hidden states, has strong expressive power and does not suffer from the gradient vanishing/exploding problem.

Notation. We use to denote . We provide several definitions related to matrix . Let denote the determinant of a square matrix , and denote the transpose of . Let denote the spectral norm of matrix , and let denote the square matrix multiplied by itself times. Let denote the

-th largest singular value of

. For any function , we define to be . In addition to notation, for two functions , we use the shorthand (resp. ) to indicate that (resp. ) for an absolute constant . We use to mean for constants and . Appendix provides the detailed proofs and additional experimental results for comparison.

3 Fourier Recurrent Unit

In this section, we first introduce our notation in the RNN framework and then describe our method, the Fourier Recurrent Unit (FRU), in detail. Given a hidden state vector from the previous time step

, input , RNN computes the next hidden state and output as:

(1)

where is the activation, , and , is the time step and is the hidden state at step . In RNN, the output at each step is locally dependent to and only remotely linked with previous hidden states (through multiple weight matrices and activations). This give rise to the idea of directly summarizing hidden states through time.

Statistical Recurrent Unit.

For each , [OPS17] propose SRU with the following update rules

(2)

where . Given the decay factors , the decaying matrix is:

For each and , can be expressed as the summary statistics across previous time steps with the corresponding :

(3)

However, it is easy to note from (3) that the weight on vanishes exponentially with , thus the SRU cannot access hidden states from a few time steps ago. As we show later in section 5, the statistical factor only improves the gradient lower bound by a constant factor on the exponent and still suffers from vanishing gradient. Also, the span of exponential functions has limited expressive power and thus linear combination of entries of also have limited expressive power.

Fourier Recurrent Unit.

Recall that Fourier expansion indicates that a continuous function defined on can be expressed as:

where :

where . To utilize the strong expressive power of Fourier basis, we propose the Fourier recurrent unit model. Let denote a set of frequencies. For each , we have the following update rules

(4)

where is the Cosine matrix containing square matrices:

and each is a diagonal matrix with cosine at distinct frequencies evaluated at time step :

where , and is the dimension for each frequency. For every , the entry has the expression:

(5)
Figure 1: The Fourier Recurrent Unit

As seen from (5), due to the global support of trigonometric functions, we can directly link with hidden states at any time step. Furthermore, because of the expressive power of the Fourier basis, given enough frequencies, can express any summary statistic of previous hidden states. As we will prove in later sections, these features prevent FRU from vanishing/exploding gradients and give it much stronger expressive power than RNN and SRU.

Connection with residual learning.

Fourier recurrent update of can also be written as:

Thus the information flows from layer to layer along two paths. The second term, needs to pass two layers of non-linearity, several weight matrices and scaled down by , while the first term, directly goes to with only identity mapping. Thus FRU directly incorporates the idea of residual learning while limiting the magnitude of the residual term. This not only helps the information to flow more smoothly along the temporal dimension, but also acts as a regularization that makes the gradient of adjacent layers to be close to identity:

Intuitively this solves the gradient exploding/vanishing issue. Later in Section 5, we give a formal proof and comparison with SRU/RNN.

4 Fourier Basis

Figure 2:  Temperature changes of Beijing from year 2010 to 2012, and the fit with Fourier basis: (a) 5 Fourier basis; (b) 20 Fourier basis; (c) 60 Fourier basis; (d) 100 Fourier basis.

In this section we show that FRU has stronger expressive power than SRU by comparing the expressive power of limited number of Fourier basis (sparse Fourier basis) and exponential functions. On the one hand, we show that sparse Fourier basis is able to approximate polynomials well. On the other hand, we prove that even infinitely many exponential functions cannot fit a constant degree polynomial.

First, we state several basic facts which will be later used in the proof.

Lemma 4.1.

Given a square Vandermonde matrix where , then

Also recall the Taylor expansion of and is

4.1 Using Fourier Basis to Interpolate Polynomials

[CKPS16]

proved an interpolating result which uses Fourier basis (

, ) to fit a complex polynomial (). However in our application, the target polynomial is over the real domain, i.e. . Thus, we only use the real part of the Fourier basis. We extend the proof technique from previous work to our new setting, and obtain the following result,

Lemma 4.2.

For any -degree polynomial , any and any , there always exists frequency (which depends on and ) and with coefficients such that

Proof.

First, we define as follows

Using Claim B.2, we can rewrite

where

It suffices to show and . We first show ,

Claim 4.3.

For any fixed and any fixed coefficients , there exists coefficients and such that, for all , .

Proof.

Recall the definition of and , the problem becomes an regression problem. To guarantee . For any fixed and coefficients , we need to solve a linear system with unknown variables and constraints : ,

Further, we have and

Let denote the Vandermonde matrix where . Using Lemma 4.1, we know , then there must exist a solution to .

Let denote the Vandermonde matrix where , . Using Lemma 4.1, we know , then there must exist a solution to . ∎

We can prove .(We defer the proof to Claim B.1 in Appendix B) Thus, combining the Claim B.1 with Claim 4.3 completes the proof. ∎

4.2 Exponential Functions Have Limited Expressive Power

Given coefficients and decay parameters , we define function We provide an explicit counterexample which is a degree- polynomial. Using that example, we are able to show the following result and defer the proof to Appendix B.

Theorem 4.4.

There is a polynomial with degree such that, for any , for any , for any coefficients and decay parameters such that

5 Vanishing and Exploding Gradients

In this section, we analyze the vanishing/exploding gradient issue in various recurrent architectures. Specifically we give lower and upper bounds of gradient magnitude under the linear setting and show that the gradient of FRU does not explode or vanish with temporal dimension . We first analyze RNN and SRU models as a baseline and show their gradients vanish/explode exponentially with .

Gradient of linear RNN.

For linear RNN, we have:

where . Thus

Let

denote the loss function. By Chain rule, we have

Similarly for the upper bound:

Gradient of linear SRU.

For linear SRU, we have:

Denoting and , we have

Claim 5.1.

Let . Then using SRU update rule, we have .

We provide the proof in Appendix C.

With , by Chain rule, we have the lower bound:

And similarly for the upper bound:

These bounds for RNN and SRU are achievable, a simple example would be . It is easy to notice that with , SRUs have better gradient bounds than RNNs. However, SRUs is only better by a constant factor on the exponent and gradients for both methods could still explode or vanish exponentially with temporal dimension .

Gradient of linear FRU.

By design, FRU avoids vanishing/exploding gradient by its residual learning structure. Specifically, the linear FRU has bounded gradient which is independent of the temporal dimension . This means no matter how deep the network is, gradient of linear FRU would never vanish or explode. We have the following theorem:

Theorem 5.2.

With FRU update rule in (3), and being identity, we have: for any .

Proof.

For linear FRU, we have:

Let and , we can rewrite in the following way,

Claim 5.3.

We provide the proof of Claim 5.3 in Appendix C. By Chain rule

We define two sets and as follows

Thus, we have

The first term can be easily lower bounded by and the question is how to lower bound the second term. Since , we can use the fact ,

where the last step follows by and . Therefore, we complete the proof of lower bound. Similarly, we can show the following upper bound

Claim 5.4.

We provide the proof in Appendix C. Combining the lower bound and upper bound together, we complete the proof. ∎

6 Experimental Results

Figure 3: Test MSE of different models on mix-sin synthetic data. FRU uses .

We implemented the Fourier recurrent unit in Tensorflow [AAB16] and used the standard implementation of BasicRNNCell and BasicLSTMCell for RNN and LSTM, respectively. We also used the released source code of SRU [OPS17] and used the default configurations of , dimension of 60, and dimension of 200. We release our codes on github111https://github.com/limbo018/FRU. For fair comparison, we construct one layer of above cells with 200 units in the experiments. Adam [KB14] is adopted as the optimization engine. We explore learning rates in {0.001, 0.005, 0.01, 0.05, 0.1} and learning rate decay in {0.8, 0.85, 0.9, 0.95, 0.99}. The best results are reported after grid search for best hyper parameters. For simplicity, we use to denote sampled sparse frequencies and dimensions for each frequency in a FRU cell.

6.1 Synthetic Data

We design two synthetic datasets to test our model: mixture of sinusoidal functions (mix-sin) and mixture of polynomials (mix-poly). For mix-sin dataset, we first construct components with each component being a combination of sinusoidal functions at different frequencies and phases (sampled at beginning). Then, for each data point, we mix the components with randomly sampled weights. Similarly, each data point in mix-poly dataset is a random mixture of fixed degree polynomials, with coefficients sampled at beginning and fixed. Alg. 1 and Alg. 2 explain these procedures in detail. Among the sequences, are used for training and are used for testing. We picked sequence length to be 176, number of components to be 5 and degree to be 15 for mix-sin and for mix-poly. At each time step , models are asked to predict the sequence value at time step . It requires the model to learn the underlying functions and uncover the mixture rates at beginning time steps. Thus we can measure the model’s ability to express sinusoidal and polynomial functions as well as their long term memory.

Figure 3 and 4 plots the testing mean square error (MSE) of different models on mix-sin/mix-poly datasets. We use learning rate of 0.001 and learing rate decay of 0.9 for training. FRU achieves orders of magnitude smaller MSE than other models on mix-sin and mix-poly datasets, while using about half the number of parameters of SRU. This indicates FRU’s ability to easily express these component functions.

Figure 4: Test MSE of different models on mix-poly synthetic data with different maximum degrees of polynomial basis. FRU uses .

To explicitly demonstrate the gradient stability and ability to learn long term dependencies of different models, we analyzed the partial gradient at different distance. Specifically, we plot the partial derivative norm of error on digit w.r.t. the initial hidden state, i.e. where is label and is model prediction. The norms of gradients for FRU are very stable from to . With the convergence of training, the amplitudes of gradient curves gradually decrease. However, the gradients for SRU decrease in orders of magnitudes with the increase of time steps, indicating that SRU is not able to capture long term dependencies. The gradients for RNN/LSTM are even more unstable and the vanishing issues are rather severe.

Figure 5: L1, L2, and L norms of gradients for different models on the training of mix-poly (5, 5) dataset. We evaluate the gradients of loss to the initial state with time steps, i.e., , where is the loss at time step

. Each point in a curve is averaged over gradients at 20 consecutive time steps. We plot the curves at epoch

with different colors from dark to light. FRU uses and SRU uses .

6.2 Pixel-MNIST Dataset

We then explore the performance of Fourier recurrent units in classifying MNIST dataset. Each

image is flattened to a long sequence with a length of 784. The RNN models are asked to classify the data into 10 categories after being fed all pixels sequentially. Batch size is set to 256 and dropout [SHK14] is not included in this experiment. A softmax function is applied to the 10 dimensional output at last layer of each model. For FRU, frequencies are uniformly sampled in log space from 0 to 784.

Figure 6: Testing accuracy of RNN, LSTM, SRU, and FRU for pixel-by-pixel MNIST dataset. FRU uses , i.e., 60 frequencies with the dimension of each frequency to be 10.
Networks
Testing
Accuracy
#Variables
Variable
Ratio
RNN 10.39% 42K 0.26
LSTM 98.17% 164K 1.00
SRU 96.20% 275K 1.68
96.88% 107K 0.65
97.61% 159K 0.97
Table 1: Testing Accuracy of MNIST Dataset

Fig. 6 plots the testing accuracy of different models during training. RNN fails to converge and LSTM converges very slow. The fastest convergence comes from FRU, which achieves over 97.5% accuracy in 10 epochs while LSTM reaches 97% at around 40th epoch. Table 1 shows the accuracy at the end of 100 epochs for RNN, LSTM, SRU, and different configurations of FRU. LSTM ends up with 98.17% in testing accuracy and SRU achieves 96.20%. Different configurations of FRU with 40 and 60 frequencies provide close accuracy to LSTM. The number and ratio of trainable parameters are also illustrated in the table. The amount of variables for FRU is much smaller than that of SRU, and comparable to that of LSTM, while it is able to achieve smoother training and high testing accuracy. We ascribe such benefits of FRU to better expressive power and more robust to gradient vanishing from the Fourier representations.

6.3 Permuted MNIST Dataset

Figure 7: Testing accuracy of RNN, LSTM, SRU, and FRU for permuted pixel-by-pixel MNIST. FRU uses 60 frequencies with the dimension of 10 for each frequency.
RNN LSTM SRU FRU
87.46% 90.26% 92.21% 96.93%
Table 2: Testing Accuracy of Permuted MNIST Dataset

We now use the same models as previous section and test on permuted MNIST dataset. Permute MNIST dataset is generated from pixel-MNIST dataset with a random but fixed permutation among its pixels. It is reported the permutation increases the difficulty of classification [ASB16]. The training curve is plotted in Fig. 7 and the converged accuracy is shown in Table 2. We can see that in this task, FRU can achieve clip(((96.93)-(92.21))*(1))% higher accuracy than SRU, clip(((96.93)-(90.26))*(1))% higher accuracy than LSTM, and clip(((96.93)-(87.46))*(1))% higher accuracy than RNN. The training curve of FRU is smoother and converges much faster than other models. The benefits of FRU to SRU are more significant in permuted MNIST than that in the original pixel-by-pixel MNIST. This can be explained by higher model complexity of permuted-MNIST and stronger expressive power of FRU.

6.4 IMDB Dataset

Figure 8: Testing accuracy of RNN, LSTM, SRU, and FRU for IMDB dataset. uses 5 frequencies with the dimension of 10 for each frequency . is an extreme case of FRU with only frequency 0.
Networks
Testing
Accuracy
#Variables
Variable
Ratio
RNN 50.53% 33K 0.25
LSTM 83.64% 132K 1.00
SRU 86.40% 226K 1.72
86.71% 12K 0.09
86.44% 4K 0.03
Table 3: Testing Accuracy of IMDB Dataset

We further evaluate FRU and other models with IMDB movie review dataset (25K training and 25K testing sequences). We integrate FRU and SRU into TFLearn [D16], a high-level API for Tensorflow, and test together with LSTM and RNN. The average sequence length of the dataset is around 284 and the maximum sequence length goes up to over 2800. We truncate all sequences to a length of 300. All models use a single layer with 128 units, batch size of 32, dropout keep rate of 80%. FRU uses 5 frequencies with the dimension for each frequency as 10. Learning rates and decays are tuned separately for each model for best performance.

Fig. 8 plots the testing accuracy of different models during training and Table 3 gives the eventual testing accuracy. can achieve clip(((86.71)-(86.40))*(1))% higher accuracy than SRU, and clip(((86.71)-(83.64))*(1))% better accuracy than LSTM. RNN fails to converge even after a large amount of training steps. We draw attention to the fact that with 5 frequencies, FRU achieves the highest accuracy with 10X fewer variables than LSTM and 19X fewer variables than SRU, indicating its exceptional expressive power. We further explore a special case of FRU, , with only frequency 0, which is reduced to a RNN-like cell. It uses 8X fewer variables than RNN, but converges much faster and is able to achieve the second highest accuracy.

Besides the experimental results above, Section D in Appendix provides more experiments on different configurations of FRU for MNIST dataset, detailed procedures to generate synthetic data, and study of gradient vanishing during training.

7 Conclusion

In this paper, we have proposed a simple recurrent architecture called the Fourier recurrent unit (FRU), which has the structure of residual learning and exploits the expressive power of Fourier basis. We gave a proof of the expressivity of sparse Fourier basis and showed that FRU does not suffer from vanishing/exploding gradient in the linear case. Ideally, due to the global support of Fourier basis, FRU is able to capture dependencies of any length. We empirically showed FRU’s ability to fit mixed sinusoidal and polynomial curves, and FRU outperforms LSTM and SRU on pixel MNIST dataset with fewer parameters. On language models datasets, FRU also shows its superiority over other RNN architectures. Although we now limit our models to recurrent structure, it would be very exciting to extend the Fourier idea to help gradient issues/expressive power for non-recurrent deep neural network, e.g. MLP/CNN. It would also be interesting to see how other basis functions, such as polynomial basis, will behave on similar architectures. For example, Chebyshev’s polynomial is one of the interesting case to try.

References

Appendix

Appendix A Preliminaries

In this section we prove the equations (3) and (5) and include more background of Sparse Fourier transform.

Claim A.1.

With the SRU update rule in (3), for we have:

Proof.

We have

where the first step follows by definition of , the second step follows by definition of , and last step follows by applying the update rule (Eq. (3)) recursively. ∎

Claim A.2.

With the FRU update rule in (3), we have:

Proof.