1 Introduction
Deep neural networks (DNNs) have shown remarkably better performance than classical models on a wide range of problems, including speech recognition, computer vision and natural language processing. Despite DNNs having tremendous expressive power to fit very complex functions, training them by backpropagation can be difficult. Two main issues are vanishing and exploding gradients. These issues become particularly troublesome for recurrent neural networks (RNNs) since the weight matrix is identical at each layer and any small changes get amplified exponentially through the recurrent layers
[BSF94]. Although exploding gradients can be somehow mitigated by tricks like gradient clipping or normalization
[PMB13], vanishing gradients are harder to deal with. If gradients vanish, there is little information propagated back through backpropagation. This means that deep RNNs have great difficulty learning longterm dependencies.Many models have been proposed to address the vanishing/exploding gradient issue for DNNs. For example Long Short Term Memory (LSTM)
[HS97] tries to solve it by adding additional memory gates, while residual networks [HZRS16] add a short cut to skip intermediate layers. Recently the approach of directly obtaining the statistical summary of past layers has drawn attention, such as statistical recurrent units (SRU) [OPS17]. However, as we show later, they still suffer from vanishing gradients and have limited access to past layers.In this paper, we present a novel recurrent architecture, Fourier Recurrent Units (FRU) that use Fourier basis to summarize the hidden statistics over past time steps. We show that this solves the vanishing gradient problem and gives us access to any past time step region. In more detail, we make the following contributions:

We propose a method to summarize hidden states through past time steps in a recurrent neural network with Fourier basis (FRU). Thus any statistical summary of past hidden states can be approximated by a linear combination of summarized Fourier statistics.

Theoretically, we show the expressive power of sparse Fourier basis and prove that FRU can solve the vanishing gradient problem by looking at gradient norm bounds. Specifically, we show that in the linear setting, SRU only improves the gradient lower/upper bound of RNN by a constant factor of the exponent (i.e, both have the form ), while FRU (lower and upper) bounds the gradient by constants independent of the temporal dimension.

We tested FRU together with RNN, LSTM and SRU on both synthetic and real world datasets like pixel(permuted) MNIST, IMDB movie rating dataset. FRU shows its superiority on all of these tasks while enjoying smaller number of parameters than LSTM/SRU.
We now present the outline of this paper. In Section 2 we discuss related work, while in Section 3 we introduce the FRU architecture and explain the intuition regarding the statistical summary and residual learning. In Sections 4 and 5 we prove the expressive power of sparse Fourier basis and show that in the linear case FRUs have constant lower and upper bounds on gradient magnitude. Experimental results on benchmarking synthetic datasets as well as real datasets like pixel MNIST and language data are presented in Section 6. Finally, we present our conclusions and suggest several interesting directions in Section 7.
2 Related Work
Numerous studies have been conducted hoping to address the vanishing and exploding gradient problems, such as the use of selfloops and gating units in the LSTM [HS97] and GRU [CVMBB14]. These models use trained gate units on inputs or memory states to keep the memory for a longer period of time thus enabling them to capture longer term dependencies than RNNs. However, it has also been argued that by using a simple initialization trick, RNNs can have better performance than LSTM on some benchmarking tasks [LJH15]. Apart from these advanced frameworks, straight forward methods like gradient clipping [Mik12] and spectral regularization [PMB13] are also proposed.
As brought to wide notice in Residual networks [HZRS16], give MLP and CNN shortcuts to skip intermediate layers allowing gradients to flow back and reach the first layer without being diminished. It is also claimed this helps to preserve features that are already good. Although ResNet is originally developed for MLP and CNN architectures, many extensions to RNN have shown improvement, such as maximum entropy RNN (MERNN) [MDP11], highway LSTM [ZCY16] and Residual LSTM [KEKL17].
Another recently proposed method, the statistical recurrent unit (SRU) [OPS17], keeps moving averages of summary statistics through past time steps. Rather than use gated units to decide what should be memorized, at each layer SRU memory cells incorporate new information at rate and forget old information by rate . Thus by linearly combining multiple memory cells with different ’s, SRU can have a multiscale view of the past. However, the weight of moving averages is exponentially decaying through time and will surely go to zero given enough time steps. This prevents SRU from accessing the hidden states a few time steps ago, and allows gradients to vanish. Also, the expressive power of the basis of exponential functions is small which limits the expressivity of the whole network.
Fourier transform is a strong mathematical tool that has been successful in many applications. However the previous studies of Fourier expressive power have been concentrate in dense Fourier transform. Price and Song [PS15] proposed a way to define sparse Fourier transform problem in the continuous setting and also provided an algorithm which requires the frequency gap. Based on that [CKPS16] proposed a frequency gap free algorithm and well defined the expressive power of sparse Fourier transform. One of the key observations in the frequency gap free algorithm is that a lowdegree polynomial has similar behavior as Fouriersparse signal. To understand the expressive power of Fourier basis, we use the framework designed by [PS15] and use the techniques from [PS15, CKPS16].
There have been attempts to combine the Fourier transform with RNNs: the Fourier RNN [KS97] uses
as activation function in RNN model; ForeNet
[ZC00] notices the similarity between Fourier analysis of time series and RNN predictions and arrives at an RNN with diagonal transition matrix. For CNN, the FCNN [PWCZ17] replaces sliding window approach with the Fourier transform in the convolutional layer. Although some of these methods show improvement over current ones, they have not fully exploit the expressive power of Fourier transform or avoided the gradient vanishing/exploding issue. Motivated by the shortcomings of the above methods, we have developed a method that has a thorough view of the past hidden states, has strong expressive power and does not suffer from the gradient vanishing/exploding problem.Notation. We use to denote . We provide several definitions related to matrix . Let denote the determinant of a square matrix , and denote the transpose of . Let denote the spectral norm of matrix , and let denote the square matrix multiplied by itself times. Let denote the
th largest singular value of
. For any function , we define to be . In addition to notation, for two functions , we use the shorthand (resp. ) to indicate that (resp. ) for an absolute constant . We use to mean for constants and . Appendix provides the detailed proofs and additional experimental results for comparison.3 Fourier Recurrent Unit
In this section, we first introduce our notation in the RNN framework and then describe our method, the Fourier Recurrent Unit (FRU), in detail. Given a hidden state vector from the previous time step
, input , RNN computes the next hidden state and output as:(1)  
where is the activation, , and , is the time step and is the hidden state at step . In RNN, the output at each step is locally dependent to and only remotely linked with previous hidden states (through multiple weight matrices and activations). This give rise to the idea of directly summarizing hidden states through time.
Statistical Recurrent Unit.
For each , [OPS17] propose SRU with the following update rules
(2)  
where . Given the decay factors , the decaying matrix is:
For each and , can be expressed as the summary statistics across previous time steps with the corresponding :
(3) 
However, it is easy to note from (3) that the weight on vanishes exponentially with , thus the SRU cannot access hidden states from a few time steps ago. As we show later in section 5, the statistical factor only improves the gradient lower bound by a constant factor on the exponent and still suffers from vanishing gradient. Also, the span of exponential functions has limited expressive power and thus linear combination of entries of also have limited expressive power.
Fourier Recurrent Unit.
Recall that Fourier expansion indicates that a continuous function defined on can be expressed as:
where :
where . To utilize the strong expressive power of Fourier basis, we propose the Fourier recurrent unit model. Let denote a set of frequencies. For each , we have the following update rules
(4)  
where is the Cosine matrix containing square matrices:
and each is a diagonal matrix with cosine at distinct frequencies evaluated at time step :
where , and is the dimension for each frequency. For every , the entry has the expression:
(5) 
As seen from (5), due to the global support of trigonometric functions, we can directly link with hidden states at any time step. Furthermore, because of the expressive power of the Fourier basis, given enough frequencies, can express any summary statistic of previous hidden states. As we will prove in later sections, these features prevent FRU from vanishing/exploding gradients and give it much stronger expressive power than RNN and SRU.
Connection with residual learning.
Fourier recurrent update of can also be written as:
Thus the information flows from layer to layer along two paths. The second term, needs to pass two layers of nonlinearity, several weight matrices and scaled down by , while the first term, directly goes to with only identity mapping. Thus FRU directly incorporates the idea of residual learning while limiting the magnitude of the residual term. This not only helps the information to flow more smoothly along the temporal dimension, but also acts as a regularization that makes the gradient of adjacent layers to be close to identity:
Intuitively this solves the gradient exploding/vanishing issue. Later in Section 5, we give a formal proof and comparison with SRU/RNN.
4 Fourier Basis
In this section we show that FRU has stronger expressive power than SRU by comparing the expressive power of limited number of Fourier basis (sparse Fourier basis) and exponential functions. On the one hand, we show that sparse Fourier basis is able to approximate polynomials well. On the other hand, we prove that even infinitely many exponential functions cannot fit a constant degree polynomial.
First, we state several basic facts which will be later used in the proof.
Lemma 4.1.
Given a square Vandermonde matrix where , then
Also recall the Taylor expansion of and is
4.1 Using Fourier Basis to Interpolate Polynomials
[CKPS16]
proved an interpolating result which uses Fourier basis (
, ) to fit a complex polynomial (). However in our application, the target polynomial is over the real domain, i.e. . Thus, we only use the real part of the Fourier basis. We extend the proof technique from previous work to our new setting, and obtain the following result,Lemma 4.2.
For any degree polynomial , any and any , there always exists frequency (which depends on and ) and with coefficients such that
Proof.
First, we define as follows
Claim 4.3.
For any fixed and any fixed coefficients , there exists coefficients and such that, for all , .
Proof.
Recall the definition of and , the problem becomes an regression problem. To guarantee . For any fixed and coefficients , we need to solve a linear system with unknown variables and constraints : ,
Further, we have and
Let denote the Vandermonde matrix where . Using Lemma 4.1, we know , then there must exist a solution to .
Let denote the Vandermonde matrix where , . Using Lemma 4.1, we know , then there must exist a solution to . ∎
4.2 Exponential Functions Have Limited Expressive Power
Given coefficients and decay parameters , we define function We provide an explicit counterexample which is a degree polynomial. Using that example, we are able to show the following result and defer the proof to Appendix B.
Theorem 4.4.
There is a polynomial with degree such that, for any , for any , for any coefficients and decay parameters such that
5 Vanishing and Exploding Gradients
In this section, we analyze the vanishing/exploding gradient issue in various recurrent architectures. Specifically we give lower and upper bounds of gradient magnitude under the linear setting and show that the gradient of FRU does not explode or vanish with temporal dimension . We first analyze RNN and SRU models as a baseline and show their gradients vanish/explode exponentially with .
Gradient of linear RNN.
For linear RNN, we have:
where . Thus
Let
denote the loss function. By Chain rule, we have
Similarly for the upper bound:
Gradient of linear SRU.
For linear SRU, we have:
Denoting and , we have
Claim 5.1.
Let . Then using SRU update rule, we have .
We provide the proof in Appendix C.
With , by Chain rule, we have the lower bound:
And similarly for the upper bound:
These bounds for RNN and SRU are achievable, a simple example would be . It is easy to notice that with , SRUs have better gradient bounds than RNNs. However, SRUs is only better by a constant factor on the exponent and gradients for both methods could still explode or vanish exponentially with temporal dimension .
Gradient of linear FRU.
By design, FRU avoids vanishing/exploding gradient by its residual learning structure. Specifically, the linear FRU has bounded gradient which is independent of the temporal dimension . This means no matter how deep the network is, gradient of linear FRU would never vanish or explode. We have the following theorem:
Theorem 5.2.
With FRU update rule in (3), and being identity, we have: for any .
Proof.
For linear FRU, we have:
Let and , we can rewrite in the following way,
Claim 5.3.
We provide the proof of Claim 5.3 in Appendix C. By Chain rule
We define two sets and as follows
Thus, we have
The first term can be easily lower bounded by and the question is how to lower bound the second term. Since , we can use the fact ,
where the last step follows by and . Therefore, we complete the proof of lower bound. Similarly, we can show the following upper bound
Claim 5.4.
We provide the proof in Appendix C. Combining the lower bound and upper bound together, we complete the proof. ∎
6 Experimental Results
We implemented the Fourier recurrent unit in Tensorflow [AAB16] and used the standard implementation of BasicRNNCell and BasicLSTMCell for RNN and LSTM, respectively. We also used the released source code of SRU [OPS17] and used the default configurations of , dimension of 60, and dimension of 200. We release our codes on github^{1}^{1}1https://github.com/limbo018/FRU. For fair comparison, we construct one layer of above cells with 200 units in the experiments. Adam [KB14] is adopted as the optimization engine. We explore learning rates in {0.001, 0.005, 0.01, 0.05, 0.1} and learning rate decay in {0.8, 0.85, 0.9, 0.95, 0.99}. The best results are reported after grid search for best hyper parameters. For simplicity, we use to denote sampled sparse frequencies and dimensions for each frequency in a FRU cell.
6.1 Synthetic Data
We design two synthetic datasets to test our model: mixture of sinusoidal functions (mixsin) and mixture of polynomials (mixpoly). For mixsin dataset, we first construct components with each component being a combination of sinusoidal functions at different frequencies and phases (sampled at beginning). Then, for each data point, we mix the components with randomly sampled weights. Similarly, each data point in mixpoly dataset is a random mixture of fixed degree polynomials, with coefficients sampled at beginning and fixed. Alg. 1 and Alg. 2 explain these procedures in detail. Among the sequences, are used for training and are used for testing. We picked sequence length to be 176, number of components to be 5 and degree to be 15 for mixsin and for mixpoly. At each time step , models are asked to predict the sequence value at time step . It requires the model to learn the underlying functions and uncover the mixture rates at beginning time steps. Thus we can measure the model’s ability to express sinusoidal and polynomial functions as well as their long term memory.
Figure 3 and 4 plots the testing mean square error (MSE) of different models on mixsin/mixpoly datasets. We use learning rate of 0.001 and learing rate decay of 0.9 for training. FRU achieves orders of magnitude smaller MSE than other models on mixsin and mixpoly datasets, while using about half the number of parameters of SRU. This indicates FRU’s ability to easily express these component functions.
To explicitly demonstrate the gradient stability and ability to learn long term dependencies of different models, we analyzed the partial gradient at different distance. Specifically, we plot the partial derivative norm of error on digit w.r.t. the initial hidden state, i.e. where is label and is model prediction. The norms of gradients for FRU are very stable from to . With the convergence of training, the amplitudes of gradient curves gradually decrease. However, the gradients for SRU decrease in orders of magnitudes with the increase of time steps, indicating that SRU is not able to capture long term dependencies. The gradients for RNN/LSTM are even more unstable and the vanishing issues are rather severe.
6.2 PixelMNIST Dataset
We then explore the performance of Fourier recurrent units in classifying MNIST dataset. Each
image is flattened to a long sequence with a length of 784. The RNN models are asked to classify the data into 10 categories after being fed all pixels sequentially. Batch size is set to 256 and dropout [SHK14] is not included in this experiment. A softmax function is applied to the 10 dimensional output at last layer of each model. For FRU, frequencies are uniformly sampled in log space from 0 to 784.Networks 

#Variables 



RNN  10.39%  42K  0.26  
LSTM  98.17%  164K  1.00  
SRU  96.20%  275K  1.68  
96.88%  107K  0.65  
97.61%  159K  0.97 
Fig. 6 plots the testing accuracy of different models during training. RNN fails to converge and LSTM converges very slow. The fastest convergence comes from FRU, which achieves over 97.5% accuracy in 10 epochs while LSTM reaches 97% at around 40th epoch. Table 1 shows the accuracy at the end of 100 epochs for RNN, LSTM, SRU, and different configurations of FRU. LSTM ends up with 98.17% in testing accuracy and SRU achieves 96.20%. Different configurations of FRU with 40 and 60 frequencies provide close accuracy to LSTM. The number and ratio of trainable parameters are also illustrated in the table. The amount of variables for FRU is much smaller than that of SRU, and comparable to that of LSTM, while it is able to achieve smoother training and high testing accuracy. We ascribe such benefits of FRU to better expressive power and more robust to gradient vanishing from the Fourier representations.
6.3 Permuted MNIST Dataset
RNN  LSTM  SRU  FRU 
87.46%  90.26%  92.21%  96.93% 
We now use the same models as previous section and test on permuted MNIST dataset. Permute MNIST dataset is generated from pixelMNIST dataset with a random but fixed permutation among its pixels. It is reported the permutation increases the difficulty of classification [ASB16]. The training curve is plotted in Fig. 7 and the converged accuracy is shown in Table 2. We can see that in this task, FRU can achieve clip(((96.93)(92.21))*(1))% higher accuracy than SRU, clip(((96.93)(90.26))*(1))% higher accuracy than LSTM, and clip(((96.93)(87.46))*(1))% higher accuracy than RNN. The training curve of FRU is smoother and converges much faster than other models. The benefits of FRU to SRU are more significant in permuted MNIST than that in the original pixelbypixel MNIST. This can be explained by higher model complexity of permutedMNIST and stronger expressive power of FRU.
6.4 IMDB Dataset
Networks 

#Variables 



RNN  50.53%  33K  0.25  
LSTM  83.64%  132K  1.00  
SRU  86.40%  226K  1.72  
86.71%  12K  0.09  
86.44%  4K  0.03 
We further evaluate FRU and other models with IMDB movie review dataset (25K training and 25K testing sequences). We integrate FRU and SRU into TFLearn [D16], a highlevel API for Tensorflow, and test together with LSTM and RNN. The average sequence length of the dataset is around 284 and the maximum sequence length goes up to over 2800. We truncate all sequences to a length of 300. All models use a single layer with 128 units, batch size of 32, dropout keep rate of 80%. FRU uses 5 frequencies with the dimension for each frequency as 10. Learning rates and decays are tuned separately for each model for best performance.
Fig. 8 plots the testing accuracy of different models during training and Table 3 gives the eventual testing accuracy. can achieve clip(((86.71)(86.40))*(1))% higher accuracy than SRU, and clip(((86.71)(83.64))*(1))% better accuracy than LSTM. RNN fails to converge even after a large amount of training steps. We draw attention to the fact that with 5 frequencies, FRU achieves the highest accuracy with 10X fewer variables than LSTM and 19X fewer variables than SRU, indicating its exceptional expressive power. We further explore a special case of FRU, , with only frequency 0, which is reduced to a RNNlike cell. It uses 8X fewer variables than RNN, but converges much faster and is able to achieve the second highest accuracy.
Besides the experimental results above, Section D in Appendix provides more experiments on different configurations of FRU for MNIST dataset, detailed procedures to generate synthetic data, and study of gradient vanishing during training.
7 Conclusion
In this paper, we have proposed a simple recurrent architecture called the Fourier recurrent unit (FRU), which has the structure of residual learning and exploits the expressive power of Fourier basis. We gave a proof of the expressivity of sparse Fourier basis and showed that FRU does not suffer from vanishing/exploding gradient in the linear case. Ideally, due to the global support of Fourier basis, FRU is able to capture dependencies of any length. We empirically showed FRU’s ability to fit mixed sinusoidal and polynomial curves, and FRU outperforms LSTM and SRU on pixel MNIST dataset with fewer parameters. On language models datasets, FRU also shows its superiority over other RNN architectures. Although we now limit our models to recurrent structure, it would be very exciting to extend the Fourier idea to help gradient issues/expressive power for nonrecurrent deep neural network, e.g. MLP/CNN. It would also be interesting to see how other basis functions, such as polynomial basis, will behave on similar architectures. For example, Chebyshev’s polynomial is one of the interesting case to try.
References
 [AAB16] Mart\́mathbf{i}n Abadi, Ashish Agarwal, Paul Barham, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016.

[ASB16]
Martin Arjovsky, Amar Shah, and Yoshua Bengio.
Unitary evolution recurrent neural networks.
In
International Conference on Machine Learning
, pages 1120–1128, 2016.  [BSF94] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 [CKPS16] Xue Chen, Daniel M Kane, Eric Price, and Zhao Song. Fouriersparse interpolation without a frequency gap. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 741–750. IEEE, https://arxiv.org/pdf/1609.01361.pdf, 2016.
 [CVMBB14] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014.
 [D16] Aymeric Damien et al. Tflearn. https://github.com/tflearn/tflearn, 2016.

[HIKP12a]
Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price.
Nearly optimal sparse Fourier transform.
In
Proceedings of the fortyfourth annual ACM symposium on Theory of computing
, pages 563–578. ACM, 2012.  [HIKP12b] Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Simple and practical algorithm for sparse Fourier transform. In Proceedings of the twentythird annual ACMSIAM symposium on Discrete Algorithms, pages 1183–1194. SIAM, 2012.
 [HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.

[HZRS16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [IK14] Piotr Indyk and Michael Kapralov. Sampleoptimal Fourier sampling in any constant dimension. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 514–523. IEEE, 2014.
 [IKP14] Piotr Indyk, Michael Kapralov, and Eric Price. (Nearly) Sampleoptimal sparse Fourier transform. In Proceedings of the TwentyFifth Annual ACMSIAM Symposium on Discrete Algorithms, pages 480–499. SIAM, 2014.
 [KB14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [KEKL17] Jaeyoung Kim, Mostafa ElKhamy, and Jungwon Lee. Residual lstm: Design of a deep recurrent architecture for distant speech recognition. arXiv preprint arXiv:1701.03360, 2017.
 [KS97] Renée Koplon and Eduardo D Sontag. Using Fourierneural recurrent networks to fit sequential input/output data. Neurocomputing, 15(34):225–248, 1997.
 [LJH15] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
 [MDP11] Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and Jan Černockỳ. Strategies for training large scale neural network language models. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, pages 196–201. IEEE, 2011.
 [Mik12] Tomáš Mikolov. Statistical language models based on neural networks. Presentation at Google, Mountain View, 2nd April, 2012.
 [Moi15] Ankur Moitra. Superresolution, extremal functions and the condition number of vandermonde matrices. In Proceedings of the fortyseventh annual ACM symposium on Theory of computing (STOC), pages 821–830. ACM, 2015.
 [OPS17] Junier B Oliva, Barnabás Póczos, and Jeff Schneider. The statistical recurrent unit. In International Conference on Machine Learning, pages 2671–2680, 2017.
 [PMB13] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
 [Pri13] Eric C Price. Sparse recovery and Fourier sampling. PhD thesis, Massachusetts Institute of Technology, 2013.
 [PS15] Eric Price and Zhao Song. A robust sparse Fourier transform in the continuous setting. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 583–600. IEEE, https://arxiv.org/pdf/1609.00896.pdf, 2015.

[PWCZ17]
Harry Pratt, Bryan Williams, Frans Coenen, and Yalin Zheng.
FCNN: Fourier convolutional neural networks.
In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 786–798. Springer, 2017.  [She08] Yuri Shestopaloff. Properties of sums of some elementary functions and modeling of transitional and other processes. https://arxiv.org/pdf/0811.1213.pdf, 2008.
 [She10] Yuri K Shestopaloff. Sums of exponential functions and their new fundamental properties. http://www.akvypress.com/sumsexpbook/SumsOfExponentialFunctionsChapters2_3.pdf, 2010.
 [SHK14] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
 [ZC00] YingQian Zhang and LaiWan Chan. Forenet: Fourier recurrent networks for time series prediction. In Citeseer, 2000.
 [ZCY16] Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway long shortterm memory RNNs for distant speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 5755–5759. IEEE, 2016.
Appendix
Appendix A Preliminaries
In this section we prove the equations (3) and (5) and include more background of Sparse Fourier transform.
Claim A.1.
With the SRU update rule in (3), for we have:
Proof.
We have
where the first step follows by definition of , the second step follows by definition of , and last step follows by applying the update rule (Eq. (3)) recursively. ∎
Claim A.2.
With the FRU update rule in (3), we have: