1 Introduction
Recurrent neural networks are increasingly popular models for sequential data. The simple recurrent neural network (RNN) architecture (Elman, 1990) is, however, not suitable for capturing longerdistance dependencies. Architectures that address this shortcoming include the Long ShortTerm Memory (LSTM, Hochreiter and Schmidhuber 1997a), the Gated Recurrent Unit (GRU, Chung et al. 2014, 2015), and the structurally constrained recurrent network (SCRN, Mikolov et al. 2014). While these can capture some longerterm patterns (20 to 50 words), their structural complexity makes it difficult to understand what is going on inside. One exception is the SCRN architecture, which is by design simple to understand. It shows that the memory acquired by complex LSTM models on language tasks does correlate strongly with simple weighted bagsofwords. This demystifies the abilities of the LSTM model to a degree: while some authors have suggested that the LSTM understands the language and even the thoughts being expressed in sentences (Choudhury, 2015), it is arguable whether this could be said about a model that performs equally well and is based on representations that are essentially equivalent to a bag of words.
One property of recurrent architectures that allows for the formation of longerterm memory is the selfconnectedness of the basic units: this is most explicitly shown in the SCRN model, where one hidden layer contains neurons that do not have other recurrent connections except to themselves. Still, this architecture has several drawbacks: one has to choose the size of the fully connected and selfconnected recurrent layers, and the model is not capable of modeling nonlinearities in the longerterm memory component.
In this work, we aim to increase representational efficiency, i.e., the ratio of performance to acquired parameters. We simplify the model architecture further and develop several variants under the Differential State Framework, where the hidden layer state of the next time step is a function of its current state and the delta change computed by the model. We do not present the Differential State Framework as a model of human memory for language. However, we point out its conceptual origins in Surprisal Theory (Boston et al., 2008; Hale, 2001; Levy, 2008)
, which posits that the human language processor develops complex expectations of future words, phrases, and syntactic choices, and that these expectations and deviations from them (surprisal) guide language processing, e.g., in reading comprehension. How complex the models are (in the human language processor) that form the expectation is an open question. The cognitive literature has approached this with existing parsing algorithms, probabilistic contextfree grammars, or ngram language models. We take a connectionist perspective. The Differential State Framework proposes to not just generatively develop expectations and compare them with actual state changes caused by observing new input; it explicitly maintains
gates as a form of highlevel error correction and interpolation. An instantiation, the DeltaRNN, will be evaluated as a language model, and we will not attempt to simulate human performance such as in situations with gardenpath sentences that need to be reanalyzed because of costly initial misanalysis.2 The Differential State Framework and the DeltaRNN
In this section, we will describe the proposed Differential State Framework (DSF) as well as several concrete implementations one can derive from it.
2.1 General Framework
The most general formulation of the architectures that fall under DSF distinguishes two forms of the hidden state. The first is a fast state, which is generally a function of the data at the current timestep and a filtration (or summary function of past states). The second is a slow state, or dataindependent state. This concept can be specifically viewed as a composition of two general functions, formally defined as follows:
(1) 
where are the parameters of the statemachine and is the previous latent information the model is conditioned on. In the case of most gated architectures, , but in some others, as in the SCRN or the LSTM, ^{1}^{1}1 refers to the “cellstate” as in (Hochreiter and Schmidhuber, 1997b). or could even include information such as decoupled memory, and in general will be updated as symbols are iteratively processed. We define
to be any, possibly complicated, function that maps the previous hidden state and the currently encountered data point (e.g. a word, subword, or character token) to a realvalued vector of fixed dimensions using parameters
. , on the other hand, is defined to be the outer function that uses parameters to integrate the faststate, as calculated by , and the slowlymoving, currently untransformed state . In the subsections that follow, we will describe simple formulations of these two core functions and, later in Section 3, we will show how currently popular architectures, like the LSTM and various simplifications, are instantiations of this framework. The specific structure of Equation 1was chosen to highlight that we hypothesize the reason behind the success of gated neural architectures is largely because they have been treating the nextstep prediction tasks, like language modeling, as an interaction between two functions. One inner function focuses on integrating observed samples with a current filtration to create a new datadependent hidden representation (or state “proposal’)’ while an outer function focuses on computing the difference, or “delta”, between the impression of the subsequence observed so far (i.e.,
) with the newly formed impression. For example, as a sentence is iteratively processed, there might not be much new information (or “suprisal”) in a token’s mapped hidden representation (especially if it is a frequently encountered token), thus requiring less change to the iteratively inferred global representation of the sentence.^{2}^{2}2One way to extract a “sentence representation” from a temporal neural language model would be to simply to take the last hidden state calculated upon reaching a symbol such as punctuation (e.g., period or exclamation point). This is sometimes referred to as encoding variablelength sentences or paragraphs to a realvalued vector of fixed dimensionality. However, encountering a new or rare token (especially an unexpected one) might bias the outer function to allow the newly formed hidden impression to more strongly influence the overall impression of the sentence, which will be useful when predicting what token/symbol will come next. In Section 5, we will present a small demonstration using one of the trained wordmodels to illustrate the intuition just described.In the subsections to follow, we will describe the ways we chose to formulate and in the experiments of our paper. The process we followed for developing the concrete implementations of and involved starting from the simplest possible form using the fewest (if any) possible parameters to compose each function and testing it in preliminary experiments to verify its usefulness.
It is important to note that Equation 1 is still general enough to allow for future design of more clever or efficient functions that might improve the performance and longterm memory capabilities of the framework. More importantly, one might view the parameters that uses as possibly encapsulating structures that can be used to store explicit memoryvectors, as is the case in stackedbased RNNs (Das et al., 1992; Joulin and Mikolov, 2015) or linkedlistbased RNNs (Joulin and Mikolov, 2015).
2.2 Forms of the Outer Function
Keeping as general as possible, here we will describe several ways one could design , the function meant to decide how new and old hidden representations will be combined at each time step. We will strive to introduce as few additional parameters as necessary and experimental results will confirm the effectiveness of our simple designs.
One form that could take is a simple weighted summation, as follows:
(2) 
where is an elementwise activation applied to the final summation and and
are bias vectors meant to weight the fast and slow states respectively. In Equation
2, if , no additional parameters have been introduced making the outer function simply a rigid summation operator followed by a nonlinearity. However, one will notice that is transmitted across a set of fixed identity connections in addition to being transformed by .While and could be chosen to be hyperparameters and tuned externally (as sort of perdimension scalar multipliers), it might prove to be more effective to allow the model to learn these coefficients. If we introduce a vector of parameters , we can choose the fast and slow weights to be and , facilitating simple interpolation. Adding these negligibly few additional parameters to compose an interpolation mechanism yields the statemodel:
(3) 
Note that we define to be the Hadamard product. Incorporating this interpolation mechanism can be interpreted as giving the Differential State Framework model a flexible mechanism for mixing various dimensions of its longerterm memory with its more localized memory. Interpolation, especially through a simple gating mechanism, can be an effective way to allow the model to learn how to turn on/off latent dimensions, potentially yielding improved generalization performance, as was empirically shown by Serban et al. (2016).
Beyond fixing to some vector of preinitialized values, there two simple ways to parametrize :
(4)  
(5) 
where both forms only introduce an additional set of learnable bias parameters, however Equation 5 allows the data at time step to interact with the gate and thus takes into account additional information from the input distribution when mixing stable and local states together. Unlike Serban et al. (2016), we constrain the rates to lie in the range by using the logistic link function, , which will transform the biases into rates much like the rates of the SCRN. We crucially choose to share in this particular mechanism for two reasons: 1) we avoid adding yet another matrix of input to hidden parameters and, much to our advantage, reuse the computation of the linear preactivation term
, and 2) additionally coupling the data preactivation to the gating mechanism will serve as further regularization of the inputtohidden parameters (by restricting the amount of learnable parameters, much as in classical autoencoders). Two error signals,
and , now take part in the calculation of the partial derivative ( is the output of the model at ).Figure 1 depicts the architecture using the simple lateintegration mechanism.
2.3 Forms of the Inner Function – Instantiating the DeltaRNN
When a concrete form of the inner function is chosen, we can fully specify the Differential State Framework. We will also show, in Section 3, how many other commonlyused RNN architectures can, in fact, be treated as special cases of this general framework defined under Equation 1.
Starting from Equation 2, if we fix and , we can recover the classical Elman RNN, where is a linear combination of the projection of the current data point and the projection of the previous hidden state, followed by a nonlinearity . However, if we also set , we obtain a naive way to compute a delta change of states. Specifically, the simpleRNN’s hidden state, where (the identity function), is:
(6) 
where is the hidden layer state at time , is the input vector, and contains the weight matrices. In contrast, the simple DeltaRNN, where instead , we have:
(7) 
Thus, the state can be implicitly stable, assuming and are initialized with small values and allows this by being partially linear. For example we can choose to be the linear rectifier (or initialize the model so to start out in the linear regime of the hyperbolic tangent). In this case, the simple DeltaRNN does not need to learn anything to maintain the state constant over time.
Preliminary experimentation with this simple form (Equation 7) often yielded unsatisfactory performance. This further motivated the development of the simple interpolation mechanism presented in Equation 3. However, depending on how one chooses the nonlinearities, and , one can create different types of interpolation. Using an Elman RNN for as in Equation 6, substituting into Equation 3 can create what we propose as the “lateintegration” state model:
(8)  
(9) 
where
could be any choice of activation function, including the identity function. This form of interpolation allows for a more direct error propagation pathway since gradient information, once transmitted through the interpolation gate, has two pathways: through the nonlinearity of the local state (through
) and the pathway composed of implicit identity connections.^{3}^{3}3Lateintegration might remind the reader of the phrase “late fusion”, as in the context of Wang and Cho (2015). However, Wang and Cho was focused on merging the information from an external bagofwords context vector with the standard cell state of the LSTM.When using a simple Elman RNN, we have essentially described a firstorder DeltaRNN. However, historically, secondorder recurrent neural architectures have been shown to be powerful models in tasks such as grammatical inference (Giles et al., 1991) and noisy timeseries prediction (Giles et al., 2001) as well as incredibly useful in ruleextraction when treated as finitestate automata (Giles et al., 1992; Goudreau et al., 1994). Very recently, Wu et al. (2016) showed that the gating effect between the statedriven component and datadriven components of a layer’s preactivations facilitated better propagation of gradient signals as opposed to the usual linear combination. A secondorder version of
would be highly desirable, not only because it further mitigates the vanishing gradient problem that plagues backpropagation through time (used in calculating parameter gradients of neural architectures), but because the form introduces negligibly few additional parameters. We do note that the secondorder form we use, like in
Wu et al. (2016), is a rank1 matrix approximation of the actual tensor used in
Giles et al. (1992); Goudreau et al. (1994).We can take the lateintegration model, Equation 9, and replace, similar to Giles et al. (1991), with:
(10) 
or a more general form (Wu et al., 2016):
(11) 
where we note that can be a function of any arbitrary incoming set of information signals that are gated by the last known state. The DeltaRNN will ultimately combine this datadriven signal with its slowmoving state. More importantly, observe that even in the most general form (Equation 11), only a few further bias vector parameters, , , and are required.
Assuming a single hidden layer language model, with hidden units and input units (where corresponds to the cardinality of the symbol dictionary), a full lateintegration DeltaRNN that employs a secondorder (Equation 11), has only parameters^{4}^{4}4 counts the hidden bias, the full interpolation mechanism (Equation 5), and the secondorder biases, ., which is only slightly larger than a classical RNN with only parameters. This stands in stark contrast to the sheer number of parameters required to train commonlyused complex architectures such as the LSTM (with peephole connections), with parameters, and the GRU, with parameters.
2.4 Regularizing the DeltaRNN
Regularization is often important when training large, overparametrized models. To control for overfitting, approaches range from structural modifications to impositions of priors over parameters
(Neal, 2012). Commonly employed modern approaches include dropout (Srivastava et al., 2014) and variations (Gal and Ghahramani, 2016)or mechanisms to control for internal covariate drift, such as Batch Normalization
(Ioffe and Szegedy, 2015) for large feedforward architectures. In this paper, we investigate the effect that dropout will have on the DeltaRNN’s performance.^{5}^{5}5In preliminary experiments, we also investigated incorporating layer normalization (Ba et al., 2016) into the DeltaRNN architecture, the details of which may be found in the Appendix. We did not find observe noticeable gains using layer normalization over dropout, and thus only report the results of dropout in this paper.To introduce simple (nonrecurrent) dropout to the framework, our preliminary experiments uncovered that dropout was most effective when applied to the inner function
as opposed to the outer function’s computed deltastate. For the full DeltaRNN, under dropout probability
, this would lead to the following modification:(12) 
is the dropout operator which masks its input argument with a binary vector sampled from
independent Bernoulli distributions.
2.5 Learning under the DeltaRNN
Let be a variablelength sequence of symbols (such as words that would compose a sentence). In general, the distribution over the variables follows the graphical model:
(13) 
where are the model parameters (of a full DeltaRNN).
No matter how the hidden state
is calculated, in this paper, it will ultimately be fed into a maximumentropy classifier
^{6}^{6}6Note that the bias term has been omitted for clarity. defined as:(14) 
To learn parameters for any of our models, we optimize with respect to the sequence negative log likelihood:
(15) 
Model parameters, , of the DeltaRNN are learned under an empirical risk minimization framework. We employ backpropagation of errors (or rather, reversemode automatic differentiation with respect to this negative log likelihood objective function) to calculate gradients and update the parameters using the method of steepest gradient descent. For all experiments conducted in this paper, we found that the ADAM adaptive learning rate scheme (Kingma and Ba, 2014) (followed by a Polyak average (Polyak and Juditsky, 1992) for the subword experiments) yielded the most consistent and nearoptimal performance. We therefore use this setup for optimization of parameters for all models (including baselines), unless otherwise mentioned. For all experiments, we unroll computation graphs steps in time (where varies across experiments/tasks), and, in order to approximate full backpropagation through time, we carry over the last hidden from the previous minibatch (within a full sequence). More importantly, we found that by furthermore using the derivative of the loss with respect to the last hidden state, we can improve the approximation and thus perform one step of iterative inference ^{7}^{7}7We searched the stepsize over the values for all experiments in this paper. to update the last hidden state carried over. We ultimately used this proposed improved approximation for the subword models (since in those experiments we could directly train all baseline and proposed models in a controlled, identical fashion to ensure fair comparison).
For all DeltaRNNs experimented with in this paper, the output activation of the inner function was chosen to be the hyperbolic tangent. The output activation of the outer function was set to be the identity for the word and character benchmark experiments and the hyperbolic tangent for the subword experiments (these decisions were made based on preliminary experimentation on subsets of the final training data). The exact configuration of the implementation we used in this paper involved using the lateintegration form, either the unregularized (Equation 9) or the dropout regularized (Equation 12) variant, for the outer function and Equation 11
We compare our proposed models against a wide variety of unregularized baselines, as well several stateoftheart regularized baselines for the benchmark experiments. These baselines include the LSTM, GRU, and SCRN as well as computationally more efficient formulations of each, such as the MGU. The goal is to see if our proposed DeltaRNN is a suitable replacement for complex gated architectures and can capture longer term patterns in sequential text data.
3 Related Work: Recovering Previous Models
A contribution of this work is that our general framework, presented in Section 2.1, offers a way to unify previous proposals for gated neural architectures (especially for use in nextstep prediction tasks like language modeling) and explore directions of improvement. Since we will ultimately compare our proposed DeltaRNN of Section 2.3 to these architectures, we will next present how to derive several key architectures from our general form, such as the Gated Recurrent Unit and the Long Short Term Memory. More importantly, we will introduce them in the same notation / design as the DeltaRNN and highlight the differences between previous work and our own through the lens of and .
Simple models, largely based on the original Elman RNN (Elman, 1990), have often been shown to perform quite well in language modeling tasks (Mikolov et al., 2010, 2011). The Structurally Constrained Recurrent Network (SCRN, Mikolov et al. 2014), an important predecessor and inspiration for this work, showed that one fruitful path to learning longerterm dependencies was to impose a hard constraint on how quickly the values of hidden units could change, yielding more “stable” longterm memory. The SCRN itself is very similar to a combination of the RNN architectures of (Jordan, 1990; Mozer, 1993). The key element of its design is the constraint that part of recurrent weight matrix must stay close to the identity, a constraint that is also satisfied by the DeltaRNN. These identity connections (and corresponding context units that use them) allow for improved information travel over many timesteps and can even be viewed as an exponential trace memory (Mozer, 1993). Residual Networks, though feedforward in nature, also share a similar motivation (He et al., 2016). Unlike the SCRN, the proposed DeltaRNN does not require a separation of the slow and fast moving units, but instead models this slower timescale through implicitly stable states.
The Long Short Term Memory (LSTM, Hochreiter and Schmidhuber 1997a
) is arguably the currently most popular and oftenused gated neural architecture, especially in the domain of Natural Language Processing. Starting from our general form, Equation
1, we can see how the LSTM can be deconstructed, where setting , yields:(16)  
(17) 
where , noting that is the cellstate designed to act as the constant error carousal in mitigating the problem of vanishing gradients when using backpropagation through time. A great deal of recent work has attempted to improve the training of the LSTM, often by increasing its complexity, such as through the introduction of socalled “peephole connections” (Gers and Schmidhuber, 2000). To compute , using peephole connections, we use the following set of equations:
The Gated Recurrent Unit (GRU, Chung et al. 2014, 2015) can be viewed as one of the more successful attempts to simplify the LSTM. We see that and are still quite complex, requiring many intermediate computations to reach an output. In the case of the outer mixing function, , we see that:
(18)  
(19) 
noting that the state gate is also a function of the RNN’s previous hidden state and introduces parameters specialized for . In contrast, the DeltaRNN does not use an extra set of inputtohidden weights, and more directly, the preactivation of the input projection can be reused for the interpolation gate. The inner function of the GRU, , is defined as:
where is generally set to be the hyperbolic tangent activation function. A mutated architecture (MUT, Jozefowicz et al. 2015) was an attempt to simplify the GRU somewhat, as, much like the DeltaRNN, its interpolation mechanism is not a function of the previous hidden state but is still largely as parameterheavy as the GRU, only shedding a single extra parameter matrix, especially since its interpolation mechanism retains a specialized parameter matrix to transform the data. The DeltaRNN, on the other hand, shares this with its primary calculation of the data’s preactivation values. The Minimally Gated Unit (MGU, Zhou et al. 2016) is yet a further attempt to reduce the complexity of the GRU by merging its reset and update gates into a single forget gate, essentially using the same outer function under the GRU defined in Equation 19, but simplifying the inner function to be quite close to the ElmanRNN but conditioned on the forget gate as follows:
While the MGU certainly does reduce the number of parameters, viewing it from the perspective of our general DeltaRNN framework, one can see that it still largely uses a that is rather limited (only the capabilities of the ElmanRNN). The most effective version of our DeltaRNN emerged from the insight that a more powerful could be obtained by (approximately) increasing its order, which requires a few more bias parameters, and nesting it within a nonlinear interpolation mechanism that will compute the deltastates. Our framework is general enough to also allow designers to incorporate functions that augment the general stateengine with an external memory to create architectures that can exploit the strengths of models with decoupled memory architectures (Weston et al., 2014; Sukhbaatar et al., 2015; Graves et al., 2016) or datastructures that serve as memory (Sun et al., 1998; Joulin and Mikolov, 2015).
A final related, but important, strand of work uses depth (i.e., number of processing layers) to directly model various timescales, as emulated in models such as the hierarchical/multiresolutional recurrent neural network (HMRNN) (Chung et al., 2016). Since the DeltaRNN is designed to allow its interpolation gate to be driven by the data, it is possible that the model might already be learning how to make use of boundary information (word boundaries at the character/subword level, sentence boundaries as marked by punctuation at the wordlevel). The HMRNN, however, more directly attacks this problem by modifying an LSTM to learn how to manipulate its states when certain types of symbols are encountered. (This is different from models like the Clockwork RNN that require explicit boundary information (Koutnik et al., 2014).) One way to take advantage of the ideas behind the HMRNN would be to manipulate the Differential State Framework to incorporate the explicit modeling of timescales through layer depth (each layer is responsible for modeling a different timescale). Furthermore, it would be worth investigating how the HMRNN’s performance would change when built from modifying a DeltaRNN instead of an LSTM.
4 Experimental Results
Language modeling is an incredibly important nextstep prediction task, with applications in downstream applications in speech recognition, parsing, and information retrieval. As such, we will focus this paper on experiments on this task domain to gauge the efficacy of our DeltaRNN framework, noting that the DeltaRNN framework might prove useful in, for instance, machine translation (Bahdanau et al., 2014) or light chunking (Turian et al., 2009). Beyond improving language modeling performance, the sentence (and document) representations iteratively inferred by our architectures might also prove useful in composing higherlevel representations of text corpora, a subject we will investigate in future work.
4.1 Datasets
4.1 The Penn Treebank Corpus
The Penn Treebank corpus (Marcus et al., 1993) is often used to benchmark both word and characterlevel models via perplexity or bitspercharacter, and thus we start here.^{8}^{8}8To be directly comparable with previously reported results, we make use of the specific preprocessed train/valid/test splits found at http://www.fit.vutbr.cz/ imikolov/rnnlm/. The corpus contains 42,068 sentences (971,657 tokens, average tokenlength of about 4.727 characters) of varying length (the range is from 3 to 84 tokens, at the wordlevel).
4.2 The IMDB Corpus
The large sentiment analysis corpus
(Maas et al., 2011) is often used to benchmark algorithms for predicting the positive or negative tonality of documents. However, we opt to use this large corpus (training consists of 149,714 documents, 1,875,523 sentences, 40,765,697 tokens, average tokenlength is about 3.4291415 characters) to evaluate our proposed DeltaRNN as a (subword) language model. The IMDB dataset serves as a case when the context extends beyond the sentencelevel in the form of actual documents.Penn Treebank: Word Models  PPL 

NGram (Mikolov et al., 2014)  
NNLM (Mikolov, 2012)  
NGram+cache (Mikolov et al., 2014)  
RNN (Gulcehre et al., 2016)  
RNN (Mikolov, 2012)  
LSTM (Mikolov et al., 2014)  
SCRN (Mikolov et al., 2014)  
LSTM (Sundermeyer, 2016)  
MIRNN (Wu et al. 2016, our impl.)  
DeltaRNN (present work)  
DeltaRNN, dynamic #1 (present work)  
DeltaRNN, dynamic #2 (present work)  
LSTMrecurrent drop (Krueger et al., 2016)  
NRdropout (Zaremba et al., 2014)  
Vdropout (Gal and Ghahramani, 2016)  
DeltaRNNdrop, static (present work)  
DeltaRNNdrop, dynamic #1 (present work)  
DeltaRNNdrop, dynamic #2 (present work) 
Penn Treebank: Character Models  BPC 

Ndiscount Ngram (Mikolov et al., 2012)  
RNN+stabilization (Krueger et al., 2016)  
linear MIRNN (Wu et al., 2016)  
Clockwork RNN (Koutnik et al., 2014)  
RNN (Mikolov et al., 2012)  
GRU (Jernite et al., 2016)  
HFMRNN (Mikolov et al., 2012)  
MIRNN (Wu et al., 2016)  
MaxEnt Ngram (Mikolov et al., 2012)  
LSTM (Krueger et al., 2016)  
DeltaRNN (present work)  
DeltaRNN, dynamic #1 (present work)  
DeltaRNN, dynamic #2 (present work)  
LSTMnorm stabilizer (Krueger et al., 2016)  
LSTMweight noise (Krueger et al., 2016)  
LSTMstochastic depth (Krueger et al., 2016)  
LSTMrecurrent drop (Krueger et al., 2016)  
RBN (Cooijmans et al., 2016)  
LSTMzone out (Krueger et al., 2016)  
HLSTM + LN (Ha et al., 2016)  
TARDIS (Gulcehre et al., 2017)  
3HMLSTM + LN (Chung et al., 2016)  
DeltaRNNdrop, static (present work)  
DeltaRNNdrop, dynamic #1 (present work)  
DeltaRNNdrop, dynamic #2 (present work) 
4.2 Word & CharacterLevel Benchmark
The first set of experiments allow us to examine our proposed DeltaRNN models against reported stateoftheart models. These reported measures have been on traditional word and characterlevel language modeling tasks–we measure the persymbol perplexity of models. For the wordlevel models, we calculate the perword perplexity (PPL) using the measure . For the characterlevel models, we report the standard bitspercharacter (BPC), which can be calculated from the log likelihood using the formula: .
Over 100 epochs, wordlevel models with minibatches of 64 (padded) sequences. (Early stopping with a lookahead of 10 was used.) Gradients were clipped using a simple magnitudebased scheme
(Pascanu et al., 2013), with the magnitude threshold set to 5. A simple gridsearch was performed to tune the learning rate, , as well as the size of the hidden layer. Parameters (nonbiases) were initialized from zeromean Gaussian distributions with variance tuned,
^{9}^{9}9We also experimented with other initializations, most notably the identity matrix for the recurrent weight parameters as in
(Le et al., 2015). We found that this initialization often worsened performance. For the activation functions of the firstorder models, we experimented with the linear rectifier, the parametrized linear rectifier, and even our own proposed parametrized smoothened linear rectifier, but found such activations lead to lessthansatisfactory results. The results of this inquiry is documented in the code that will accompany the paper.. The characterlevel models, on the other hand, were updated using minibatches of 64 samples over 100 epochs. (Early stopping with a lookahead of 10 was used.) The parameter initializations and gridsearch for the learning rate and hidden layer size were the same as for the word models, with the exception of the hidden layer size, which was searched over ^{10}^{10}10Note that would yield nearly 4 million parameters, which was our upper bound on total number of parameters allowed for experiments in order to be commensurable with the work of Wu et al. (2016), which actually used for all Penn Treebank models. .A simple learning rate decay schedule was employed: if the validation loss did not decrease after a single epoch, the learning rate was halved (unless a lower bound on the value had been reached). When dropout was applied to the DeltaRNN (DeltaRNNdrop, we set the probability of dropping a unit to for the characterlevel models and for the word level models. We present the results for the unregularized and regularized versions of the models. For all of the DeltaRNNs, we furthermore experiment with two variations of dynamic evaluation, which facilitates fair comparison to compression algorithms, inspired by the improvements observed in (Mikolov, 2012). DeltaRNNdrop, dynamic #1
refers to simply updating the model samplebysample after each evaluation, where in this case, we update parameters using simple stochastic gradient descent
(Mikolov, 2012), with a stepsize . We develop a second variation of dynamic evaluation, DeltaRNNdrop, dynamic #2, where we allow the model to first iterate (and update) once over the validation set and then finally the testset, completely allowing the model to “compress” the Penn Treebank corpus. These two schemes are used for both the word and characterlevel benchmarks. It is important to stress the BPC and PPL measures reported for the dynamic models follow a strict “testthentrain” online paradigm, meaning that each nextstep prediction is made before updating model parameters.The standard vocabulary for the wordlevel models contains 10K unique words (including an unknown token for outofvocabulary symbols and an endofsequence token)^{11}^{11}11We use a special “null” token (or zerovector) to mark the start of a sequence. and the standard vocabulary for the characterlevel models includes 49 unique characters (including a symbol for spaces). Results for the wordlevel models are reported in Table 1 and results for the characterlevel models are reported in Table 1.
4.3 Subword Language Modeling
We chose to measure the negative log likelihood of the various architectures in the task of subword modeling. Subwords are particularly appealing not only in that the input distribution is of lower dimensionality but, as evidenced by the positive results of Mikolov et al. (2012), subword/character hybrid language models improve over the performance of pure characterlevel models. Subword models also enjoy the advantage held by characterlevel models when it comes to handling outofvocabulary words, avoiding the need for an “unknown” token. Research in psycholinguistics has long suggested that even human infants are sensitive to word boundaries at an early stage (e.g., Aslin et al. 1998), and that morphologically complex words enjoy dedicated processing mechanisms (Baayen and Schreuder, 2006). Subwordlevel language models may approximate such an architecture. Consistency in subword formation is critical in order to obtain meaningful results (Mikolov et al., 2012). Thus, we design our subword algorithm to partition a word according to the following scheme:

[itemsep=0.0cm]

Split on vowels (using a predefined list)

Link/merge each vowel with a consonant to the immediate right if applicable

Merge straggling single characters to subwords on the immediate right unless a subword of shorter character length is to the left.
This simple partitioning scheme was designed to ensure that no subword was shorter than two characters in length. Future work will entail designing a more realistic subword partitioning algorithm. Subwords below a certain frequency were discarded, and combined with 26 single characters to create the final dictionary. For Penn Treebank, this yields a vocabulary of 2405 symbols was created (2,378 subwords + 26 characters + 1 endtoken). For the IMDB corpus, after replacing all emoticons and special nonword symbols with special tokens, we obtain a dictionary of 1926 symbols (1899 subwords + 26 single characters + 1 endtoken). Results for all subword models are reported in Table 2.





# Params  NLL  
RNN  
SCRN  
MGU  
MIRNN  
GRU  
LSTM  
DeltaRNN 





# Params  NLL  
RNN  
SCRN  
MGU  
MIRNN  
GRU  
LSTM  
DeltaRNN 
Specifically, we test our implementations of the LSTM ^{12}^{12}12We experimented with initializing the forget gate biases of all LSTMs with values searched over since previous work has shown this can improve model performance. (with peephole connections as described in Graves 2013), the GRU, the MGU, the SCRN, as well as a classical Elman network, of both 1st and 2ndorder (Giles et al., 1991; Wu et al., 2016).^{13}^{13}13We will publicly release code to build and train the architectures in this paper upon publication. Subword models were trained in a similar fashion as the characterlevel models, updated (every 50 steps) using minibatches of 20 samples but over 30 epochs. Learning rates were tuned in the same fashion as the wordlevel models, and the same parameter initialization schemes were explored. The notable difference between this experiment and the previous ones is that we fix the number of parameters for each model to be equivalent to that of an LSTM with 100 hidden units for PTB and 50 hiddens units for IMDB. This ensures a controlled, fair comparison across models and allows us to evaluate if the DeltaRNN can learn similarly to models with more complicated processing elements (an LSTM cell versus a GRU cell versus a DeltaRNN unit). Furthermore, this allows us to measure parameter efficiency, where we can focus on the value of actual specific celltypes (for example, allowing us to compare the value of a much more complex LSTM memory unit versus a simple DeltaRNN cell) when the number of parameters is held roughly constant. We are currently running larger versions of the models depicted in the table above to determine if the results hold at scale.
5 Discussion
With respect to the word and characterlevel benchmarks, we see that the DeltaRNN outperforms all previous, unregularized models, and performs comparably to regularized stateoftheart. As documented in Table 2, we further trained a secondorder, wordlevel RNN (MIRNN) to complete the comparison, and remark that the secondorder connections appear to be quite useful in general, outperforming the SCRN and coming close to that of the LSTM. This extends the results of Wu et al. (2016) to the wordlevel. However, the DeltaRNN, which also makes use of secondorder units within its inner function, ultimately offers the best performance and performs better than the LSTM in all experiments. In both Penn Treebank and IMDB subword language modeling experiments, the DeltaRNN is competitive with complex architectures such as the GRU and the MGU. In both cases, the DeltaRNN nearly reaches the same performance as the best performing baseline model in either dataset (i.e., it nearly reaches the same performance as the GRU on Penn Treebank and the MGU on IMDB). Surprisingly, on IMDB, a simple Elman network is quite performant, even outperforming the MIRNN. We argue that this might be the result of constraining all neural architectures to only a small number of parameters for such a large dataset, a constraint we intend to relax in future work.
The DeltaRNN is far more efficient than a complex LSTM and certainly a memoryaugmented network like TARDIS (Gulcehre et al., 2017). Moreover, it appears to learn how to make appropriate use of its interpolation mechanism to decide how and when to update its hidden state in the presence of new data.^{14}^{14}14At greater computational cost, a somewhat lower perplexity for an LSTM may be attainable, such as the perplexity of 107 reported by Sundermeyer (2016) (see Table 1). However, this requires many more training epochs and precludes batch training. Given our derivations in Section 3, one could argue that nearly all previously proposed gated neural architectures are essentially trying do the same thing under the Differential State Framework. The key advantage offered by the DeltaRNN is that this functionality is offered directly and cheaply (in terms of required parameters).
It is important to contrast these (unregularized) results with those that use some form of regularization. Zaremba et al. (2014) reported that a single LSTM (for wordlevel Penn Treebank) can reach a PPL of , but this was achieved via dropout regularization (Srivastava et al., 2014). There is a strong relationship between using dropout and training an ensemble of models. Thus, one can argue that a single model trained with dropout actually is not a single model, but an implicit ensemble (see also Srivastava et al. 2014). An ensemble of twenty simple RNNs and cache models did previously reach PPL as low as 72, while a single RNN model gives only 124 (Mikolov, 2012). Zaremba et al. (2014) trained an ensemble of 38 LSTMs regularized with dropout, each with 100x times more parameters than the RNNs used by Mikolov 2012, achieving PPL 68. This is arguably a small improvement over 72, and seems to strengthen our claim that dropout is an implicit model ensemble and thus should not be used when one wants to report the performance of a single model. However, the DeltaRNN is amenable to regularization, including dropout. As our results show, when simple dropout is applied, the DeltaRNN can reach much lower perplexities, even similar to the stateoftheart with much larger models, especially when dynamic evaluation is permitted. This even extends to very complex architectures, such as the recently proposed TARDIS, which is a memoryaugmented network (and when dynamic evaluation is used, the simple DeltaRNN can outperform this complex model). Though we investigate the utility of simple dropout in this paper, our comparative results suggest that more sophisticated variants, such as variational dropout (Gal and Ghahramani, 2016), could yield yet further improvement in performance.
What is the lesson to be learned from the Differential State Framework? First, and foremost, we can obtain strong performance in language modeling with a simpler, more efficient (in terms of number of parameters), and thus faster, architecture. Second, the DeltaRNN is designed from the interpretation that the computation of the next hidden state is the result of a composition of two functions. One inner function decides how to “propose” a new hidden state while the outer function decides how to use this new proposal in updating the previously calculated state. The datadriven interpolation mechanism is used by the model to decide how much impact the newly proposed state has in updating what is likely to be a slowly changing representation. The SCRN, which could be viewed as the predecessor to the DeltaRNN framework, was designed with the idea that some constrained units could serve as a sort of cache meant to capture longerterm dependencies. Like the SCRN, the DeltaRNN is designed to help mitigate the problem of vanishing gradients, and through the interpolation mechanism, has multiple pathways through which the gradient might be carried, boosting the error signal’s longevity down the propagation path through time. However, the SCRN combines the slowmoving and fastchanging hidden states through a simple summation and thus cannot model nonlinear interactions between its shorter and longer term memories, furthermore requiring tuning of the sizes of these separated layers. On the other hand, the DeltaRNN, which does not require special tuning of an additional hidden layer, can nonlinearly combine the two types of states in a datadependent fashion, possibly allowing the model to exploit boundary information from text, which is quite powerful in the case of documents. The key intuition is that the gating mechanism only allows the state proposal to affect the maintained memory state only if the currently observed datapoint carries any useful information. This warrants a comparison, albeit indirect, to Surprisal Theory. This “surprisal” proves useful in iteratively forming a sentence impression that will help to better predict the words that come later.
With respect to the last point made, we briefly examine the evolution of a trained DeltaRNN’s hidden state across several sample sentences. The first two sentences are handcreated (constrained to use only the vocabulary of Penn Treebank) while the last one is sampled from the Penn Treebank training split. Since the DeltaRNN iteratively processes symbols of an ordered sequence, we measure the L1 norm across consecutive pairs of hidden states. We report the (minmax) normalized L1 scores^{15}^{15}15If we calculate the L1 norm, or Manhattan distance, for every contiguous pair of state vectors across a sequence of length and is the state calculated for the start/null token, we obtain the sequence of L1 measures (the L1 for the start token is simply excluded). Calculating the score for any () is then as simple performing minmax normalization, or . in Figure 2 and observe that, in accordance with our intuition, we can see that the L1 norm is lower for highfrequency words (indicating a smaller delta) such as “the” or “of” or “is”, which are words generally less informative about the general subject of a sentence/document. As this qualitative demonstration illustrates, the DeltaRNN appears to learn what to do with its internal state in the presence of symbols of variable information content.
6 Conclusions
We present the Differential State Framework, which affords us a useful perspective on viewing computation in recurrent neural networks. Instead of recomputing the whole state from scratch at every time step, the DeltaRNN only learns how to update the current state. This seems to be better suited for many types of problems, especially those that involve longer term patterns where part of the recurrent network’s state should be constant most of the time. Comparison to the currently widely popular LSTM and GRU architectures shows that the DeltaRNN can achieve similar or better performance on language modeling tasks, while being conceptually much simpler and with far less parameters. Comparison to the Structurally Constrained Recurrent Network (SCRN), which shares many of the main ideas and motivation, shows better accuracy and a simpler model architecture (since, in the SCRN, tuning the sizes of two separate hidden layers is required, and this model cannot learn nonlinear interactions within its longer memory).
Future work includes largerscale language modeling experiments to test the efficacy of the DeltaRNN framework as well as architectural variants that employ decoupled memory. Since the DeltaRNN can also be stacked just as any other neural architecture, we intend to investigate if depth (in terms of hidden layers) might prove useful on largerscale datasets. In addition, we intend to explore how useful the DeltaRNN might be in other tasks that the architectures such as the LSTM currently hold stateoftheart performance in. Finally, it would be useful to explore if DeltaRNN’s simpler, faster design can speed up the performance of grander architectures, such as the Differentiable Neural Computer (Graves et al., 2016) (composed of multiple LSTM modules).
Acknowledgments
We thank C. Lee Giles and Prasenjit Mitra for their advice. We thank NVIDIA for providing GPU hardware that supported this paper. A.O. was funded by a NACMESloan scholarship; D.R. acknowledges funding from NSF IIS1459300.
References
 Aslin et al. (1998) Aslin, R. N., Saffran, J. R., and Newport, E. L. (1998). Computation of conditional probability statistics by 8monthold infants. Psychological Science, 9(4):321–324.
 Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
 Baayen and Schreuder (2006) Baayen, R. H. and Schreuder, R. (2006). Morphological Processing. Wiley.
 Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
 Boston et al. (2008) Boston, M. F., Hale, J., Kliegl, R., Patil, U., and Vasishth, S. (2008). Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. Journal of Eye Movement Research, 2(1).

Choudhury (2015)
Choudhury, V. (2015).
Thought vectors: Bringing common sense to artificial intelligence.
www.iamwire.com.  Chung et al. (2016) Chung, J., Ahn, S., and Bengio, Y. (2016). Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704.
 Chung et al. (2014) Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

Chung et al. (2015)
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2015).
Gated feedback recurrent neural networks.
In
International Conference on Machine Learning
, pages 2067–2075.  Cooijmans et al. (2016) Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, Ç., and Courville, A. (2016). Recurrent batch normalization. arXiv preprint arXiv:1603.09025.
 Das et al. (1992) Das, S., Giles, C. L., and Sun, G.Z. (1992). Learning contextfree grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. In Proceedings of the 14th Annual Conference of the Cognitive Science Society, page 14, Bloomington, IN.
 Elman (1990) Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2):179–211.
 Gal and Ghahramani (2016) Gal, Y. and Ghahramani, Z. (2016). A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1019–1027.
 Gers and Schmidhuber (2000) Gers, F. A. and Schmidhuber, J. (2000). Recurrent nets that time and count. In Proceedings of the IEEEINNSENNS International Joint Conference on Neural Networks, volume 3, pages 189–194. IEEE.
 Giles et al. (1991) Giles, C. L., Chen, D., Miller, C., Chen, H., Sun, G., and Lee, Y. (1991). Secondorder recurrent neural networks for grammatical inference. In International Joint Conference on Neural Networks, volume 2, pages 273–281.
 Giles et al. (2001) Giles, C. L., Lawrence, S., and Tsoi, A. C. (2001). Noisy time series prediction using recurrent neural networks and grammatical inference. Machine Learning, 44(12):161–183.
 Giles et al. (1992) Giles, C. L., Miller, C. B., Chen, D., Chen, H.H., Sun, G.Z., and Lee, Y.C. (1992). Learning and extracting finite state automata with secondorder recurrent neural networks. Neural Computation, 4(3):393–405.
 Goudreau et al. (1994) Goudreau, M. W., Giles, C. L., Chakradhar, S. T., and Chen, D. (1994). Firstorder versus secondorder singlelayer recurrent neural networks. IEEE Transactions on Neural Networks, 5(3):511–513.
 Graves (2013) Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
 Graves et al. (2016) Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., GrabskaBarwińska, A., Colmenarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476.
 Gulcehre et al. (2016) Gulcehre, C., Moczulski, M., Denil, M., and Bengio, Y. (2016). Noisy activation functions. arXiv preprint arXiv:1603.00391.
 Ha et al. (2016) Ha, D., Dai, A., and Le, Q. V. (2016). Hypernetworks. arXiv preprint arXiv:1609.09106.
 Hale (2001) Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics, NAACL ’01, pages 1–8, Stroudsburg, PA, USA.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J. (2016).
Identity mappings in deep residual networks.
In
European Conference on Computer Vision
, pages 630–645. Springer.  Hochreiter and Schmidhuber (1997a) Hochreiter, S. and Schmidhuber, J. (1997a). Long shortterm memory. Neural Computation, 9(8):1735–1780.
 Hochreiter and Schmidhuber (1997b) Hochreiter, S. and Schmidhuber, J. (1997b). LTSM can solve hard time lag problems. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference, pages 473–479.
 Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
 Jernite et al. (2016) Jernite, Y., Grave, E., Joulin, A., and Mikolov, T. (2016). Variable computation in recurrent neural networks. arXiv preprint arXiv:1611.06188.
 Jordan (1990) Jordan, M. I. (1990). Artificial neural networks. chapter Attractor Dynamics and Parallelism in a Connectionist Sequential Machine, pages 112–127. IEEE Press, Piscataway, NJ, USA.
 Joulin and Mikolov (2015) Joulin, A. and Mikolov, T. (2015). Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in Neural Information Processing Systems, pages 190–198.
 Jozefowicz et al. (2015) Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015). An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 2342–2350.
 Kingma and Ba (2014) Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Koutnik et al. (2014) Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork rnn. arXiv preprint arXiv:1402.3511.
 Krueger et al. (2016) Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N. R., Goyal, A., Bengio, Y., Larochelle, H., Courville, A., et al. (2016). Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305.
 Le et al. (2015) Le, Q. V., Jaitly, N., and Hinton, G. E. (2015). A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941.
 Levy (2008) Levy, R. (2008). Expectationbased syntactic comprehension. Cognition, 106(3):1126 – 1177.
 Maas et al. (2011) Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACLHLT2011, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
 Marcus et al. (1993) Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
 Mikolov (2012) Mikolov, T. (2012). Statistical Language Models Based on Neural Networks. PhD thesis, University of Brno, Brno, CZ.
 Mikolov et al. (2014) Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., and Ranzato, M. (2014). Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753.
 Mikolov et al. (2010) Mikolov, T., Karafiát, M., Burget, L., Černocký, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), volume 2, pages 1045–1048, Makuhari, Chiba, JP.
 Mikolov et al. (2011) Mikolov, T., Kombrink, S., Burget, L., Černocký, J., and Khudanpur, S. (2011). Extensions of recurrent neural network language model. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5528–5531, Prague, Czech Republic.
 Mikolov et al. (2012) Mikolov, T., Sutskever, I., Deoras, A., Le, H.S., Kombrink, S., and Černocký, J. (2012). Subword language modeling with neural networks. http://www.fit.vutbr.cz/ĩmikolov/rnnlm/char.pdf. Accessed: 20170601.
 Mozer (1993) Mozer, M. C. (1993). Neural net architectures for temporal sequence processing. In Santa Fe Institute Studies in the Sciences of Complexity, volume 15, pages 243–243. AddisonWesley Publishing Co.
 Neal (2012) Neal, R. M. (2012). Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
 Pascanu et al. (2013) Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks. International Conference of Machine Learning (3), 28:1310–1318.
 Polyak and Juditsky (1992) Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855.
 Serban et al. (2016) Serban, I. V., Ororbia, I., Alexander, G., Pineau, J., and Courville, A. (2016). Piecewise Latent Variables for Neural Variational Text Processing. arXiv preprint arXiv:1612.00377.
 Srivastava et al. (2014) Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
 Sukhbaatar et al. (2015) Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. (2015). Endtoend memory networks. arXiv:1503.08895 [cs].
 Sun et al. (1998) Sun, G.Z., Giles, C. L., and Chen, H.H. (1998). The neural network pushdown automaton: Architecture, dynamics and training. In Adaptive processing of sequences and data structures, pages 296–345. Springer.
 Sundermeyer (2016) Sundermeyer, M. (2016). Improvements in Language and Translation Modeling. PhD thesis, RWTH Aachen University.
 Turian et al. (2009) Turian, J., Bergstra, J., and Bengio, Y. (2009). Quadratic features and deep architectures for chunking. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 245–248. Association for Computational Linguistics.
 Wang and Cho (2015) Wang, T. and Cho, K. (2015). Largercontext language modelling. arXiv preprint arXiv:1511.03729.
 Weston et al. (2014) Weston, J., Chopra, S., and Bordes, A. (2014). Memory networks. arXiv:1410.3916 [cs, stat].
 Wu et al. (2016) Wu, Y., Zhang, S., Zhang, Y., Bengio, Y., and Salakhutdinov, R. R. (2016). On multiplicative integration with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 2856–2864.
 Gulcehre et al. (2017) Gulcehre, Caglar and Chandar, Sarath and Bengio, Yoshua (2017). Memory Augmented Neural Networks with Wormhole Connections. arXiv:1701.08718 [cs, stat].
 Zaremba et al. (2014) Zaremba, W., Sutskever, I., and Vinyals, O. (2014). Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
 Zhou et al. (2016) Zhou, G.B., Wu, J., Zhang, C.L., and Zhou, Z.H. (2016). Minimal gated unit for recurrent neural networks. International Journal of Automation and Computing, 13(3):226–234.
Appendix A: Layer Normalized DeltaRNNs
In this appendix, we describe how layer normalization would be applied to a DeltaRNN. Though our preliminary experiments did not uncover that layer normalization gave much improvement over dropout, this was only observed on the Penn Treebank benchmark. Future work will investigate the benefits of layer normalization over dropout (as well as modelensembling ) on largerscale benchmarks.
A simple RNN requires the layer normalization to be applied after calculating the full linear preactivation (a sum of the filtration and the projected data point). On the other hand, a DeltaRNN requires further care (like the GRU) to ensure the correct components are normalized without damaging the favorable properties inherent to the model’s multiplicative gating. If layer normalization is applied to the preactivations of the lateintegration DeltaRNN proposed in this paper, the update equations become:
(20)  
(21)  
(22)  
(23) 
Note that the additional bias parameters introduced in the original update equations are now omitted. This can be done since the layer normalization operation applied will now perform the work of shifting and scaling. Since the DeltaRNN takes advantage of parametersharing, it notably requires substantially fewer layer normalizations than a more complex model (such as the GRU) would. A standard GRU would require nine layer normalizations while the DeltaRNN simply requires two.
Comments
There are no comments yet.