Introduction
Despite their success, artificial neural networks (ANNs), especially recurrent neural networks (RNNs), have repeatedly been shown to struggle with generalizing in a sophisticated, systematic manner, often uncovering misleading statistical associations instead of true casual relations. Verifying what is learned by these blackbox models remains an open challenge, centering around one central issue – the lack of interpretability and modularity. The fact that successful ANN optimization depends heavily on large quantities of data only serves to further worsen the problem. ^{1}^{1}1Preprint One research direction towards developing more interpretable ANNs focuses on rule extraction from and assimilation of rules into RNNs [angluin1983inductive, fu1977]. To solve difficult grammatical inference problems, various types of specialized RNNs have been designed [lstmcfg, boden2000context, tabor2000fractal, wiles1995learning, sennhauser2018evaluating, nam2019number] However, it has been shown that RNNs augmented with external memory structures, such as the neural network pushdown automaton (NNPDA), are more powerful than RNNs without, both historically [giles1992learning, pollack1990recursive, zeng1994discrete] and recently, using differentiable memory [joulin2015inferring, grefenstette2015learning, graves2014neural, kurach2015neural, zeng1994discrete, hao2018context, yogatama2018memory, graves2016hybrid]. Yet most of these models often lack interpretability and how they learn any given grammar is still debatable. In the past, rule integration methods have been proposed to tackle the interpretability issue [giles1992learning, omlin1996constructing] and offer a promising path towards the design of ANNs with an underlying knowledge structure that is bit more understandable and transparent. However, to the best of our knowledge, there exists no method for inserting rules into the states of the far more powerful class of higher order, memoryaugmented RNNs.In working towards interpretable, memorybased neural models, in this work, our contributions are the following:

We propose the neural state pushdown automaton and its incremental training method, which exploits the concept of iterative refinement

We develop a novel regularization method that empirically yields better generalization in complex, memorybased RNNs. To our knowledge, we are the first to propose a weight regularizer that works with higherorder RNNs.

We propose a method for programming states into a neural state machine with binary second and thirdorder weights .

We develop a method for inserting rules into stackbased recurrent networks.

We compare our model with the NNPDA and other RNNs, trained using different learning algorithms.
Motivation & Related Work
Research related to integrating knowledge into ANNs has existed for quite some time, such as through the design of state machines [tivno1998finite, omlin1996constructing]
. Recent efforts in the domain of natural language processing have shown the effectiveness of using state machines for tasks such as visual question answering, which allow an agent to directly use higherlevel semantic concepts to represent visual and linguistic modalities
[manning2019nsm]. With respect to ruleinsertion itself, there exists a great deal of work showcasing its effectiveness when used with ANNs[abuMostafa1990hints] as well as with RNNs [giles1992learning, omlin1996constructing]. Notably, [omlin1996constructing] showed how deterministic finite automaton rules could be encoded into second order RNNs. One important, classical model that we draw inspiration from is the neural network pushdown automaton (NNPDA) [nndpa1998sun]. The structure of our proposed model is similar to the NNPDA, but, as we will discuss, the major difference is that the model works with a digital stack as opposed to an analog one. Interestingly enough, prior work has also shown how to “hints” into the NNPDA, where knowledge of “dead states” can be used to guide its learning process [nndpa1998sun]. In the spirit of this hintbased methodology, we will develop a method for encoding useful rules related to target CFGs into our neural state pushdown automaton (NSPDA). This, to our knowledge, is the first approach of its kind, since no rule methodology has been previously proposed for complex statebased models. Creating such a procedure allows us to both exploit the far greater representational capabilities of memoryaugmented RNNs while offering an intuitive way for understanding the knowledge contained and acquired by RNNs. In this work, we will focus on RNNs that control a discrete stack, particularly our proposed NSPDA. We will empirically determine if the inductive biases we encode into its synaptic weights speed up the parameter optimization process and, furthermore, improve model generalization over longer sequences at test time. Furthermore, the results of our experiments, which compare a wide variety of RNNs (of varying order, with and without memory), will strongly contradict the claim presented in recent work [gru2019pda], which specifically claims that first order RNNs, like the popular gated recurrent unit RNN
[chung2014empirical], are as powerful as a PDA. In essence, our work demonstrates that for an RNN to recognize a complex CFG, it will, at least, require external memory. Our results also demonstrate the value of encoding even partial PDA information which positively impacts convergence time and model generalization.The Neural State Pushdown Automaton
Neural Architecture
The model we propose, the NSPDA with iterative refinement is shown in figure 1
. The NSPDA consists of fully connected recurrent neurons which we will label as
state neurons, primarily to distinguish them from the neurons that function as output neurons. Introducing the concept of state neurons is important when considering the notion of higherorder networks, i.e., second or third order RNNs, which allows us to map state representations directly to outputs. In this model, at each time step , the state neuron receives signals from the input neurons, its previous state, and the stackread neurons. The input neurons process a string, one character at a time, while nonrecurrent neurons, also labeled as “action neurons”, represent an operation to be performed on a stack data structure, i.e., Push/Pop/Noop. The action neurons are also designated as the controller which can either be recurrent or linear (recurrent controllers usually perform better in practice, so we focus on these in this paper). Furthermore, “read” neurons are used to keep track of the symbols present at the top of the stack. To make concrete the above highlevel description, consider a single hiddenlayer NSPDA. A full symbol sequence sample is defined as where the binary label indicates whether the sequence is valid () or not (). When processing a (binary) symbol/token at the discrete time step, the NSPDA is engaged with computing a new state variable vector
, where is the total number of input/sensory neurons (or dimensionality of the input space, sometimes classically refered to as alphabet size) and is the total number of state neurons. The action neuron vector is defined as and the read neuron vector is defined as , i.e., the action and read spaces are of the same dimensionality of the input or . Taken together, the above sets of input, state, and read neurons represent a full NSPDA model with parameters . Crucially, and are both 4dimensional (4D) synpatic weight tensor, i.e., the binary “tostate” tensor and the 4D tenary toaction tensor (note that: is “pop”, is “noop”, and is “push”). At , inference (for a third order NSPDA) is conducted as follows:(1)  
(2)  
(3) 
where , , and , are threshold values that determine what the next state of the discrete read unit will be (sampled uniformly from a special interval to create continuous value for backprop to work with). Note that is the next hidden state, is the next stack action, and is the next value of the neuron that reads the content at the top of the stack. and
are nonlinear activation functions, specifically, quantized sigmoidal functions, defined as:
(4)  
(5)  
(6) 
As the NSPDA processes a string, a prediction of its validity is made at each step. Specifically, the output weights (and bias scalar ) are used to map the state vector to the output space. The output model is defined as , where is the logistic link function. The actual external stack itself is manipulated by discretevalued action neurons that trigger a discrete push or pop action (as given by Equation 2). Take, for example, a 2letter alphabet, i.e., . The dimensions of the action and read spaces would then, in this case, be . When using a digital stack, the following actions can be taken:

PUSH: This means that the current input is pushed to the top of the stack. Example: To push the symbol “a”, use and .

POP: This means that the element is removed from the top of the stack. Example: To remove the symbol “b”, use and .

NOOP: This simply means “no operation, or, in other words, nothing is to be done with the stack. Example: use and .
In the case of the vector , we are reading the symbol currently located at the top of the stack (at each time step) (corresponding read vectors are shown above in the action vector examples0. Our goal is to make sure the RNNs choose the correct action during training and yet still maintain stable binary read states .
Learning and Optimization
First, we define the loss function used to both measure the performance of the network as well as optimize its parameters. Classically, state neural models such as the NNPDA exclusively made use of a binary loss function that only considered if a string was valid or invalid
[das1993using]. Furthermore, these models only made a prediction/classification at the very end of the sequence. In contrast, the NSPDA is an iterative, stepbystep predictive model. Thus, we consider using a sequence loss based on binary cross entropy.^{2}^{2}2In preliminary experiments, models using a squared error loss, with and without regularization penalties, had great difficulty in converging. We found using cross entropy was far more effective. . The instantaneous loss, for a single sequence ), is:(7) 
where is the th prediction/output from the final state neuron. Note that is copied each step in time, which injects an extra error signal throughout the sequence length, improving the optimization process (as opposed to relying on only a single output error signal to be effectively propagated backwards through the underlying computation graph). To compute updates for the NSPDA’s parameters, we employed several gradientbased approaches, including the popular and common backpropagation through time (BPTT) procedure as well as online algorithms such as realtime recurrent learning (RTRL) [williams1989rtrl] and unbiased online recurrent optimization (UORO) [tallec2017uoro]. In short, all of these algorithms compute gradients of the loss function (Equation 7
) with respect to NSPDA weights. The primary difference between the algorithms is that BPTT is based on reversemode differentiation routine while RTRL is based on forwardmode differentiation (and UORO is a faster, higher variance approximation of RTRL). In further detail, we describe UORO and RTRL in the appendix. While UORO and RTRL are not commonly used to train modernday RNNs, they offer faster ways to train them without requiring graph unfolding. Thus, we compare the results of using each in our experiments.
Iterative Refinement
One important element we introduced into the training protocol of the NSPDA is that of iterative refinement, an algorithm proposed in the signal processing literature for incorporating partial iterative inference into a nextstep predictive RNN [ororbia2019iterdecode]. At a highlevel, this means that, during training, at step , the NSPDA is forced to predict the same target () times (except for the state transitions that are provided as “hints”, of which we will describe in a later section). Crucially, the state vector is still carried over these
steps, meaning the recurrent synapses relating the state of the model at time
to .To adapt iterative refinement to a nextstep sequence model like the NSPDA, iterative refinement can cleanly introduced by manipulating the sequence loss of Equation 7 as follows:(8)  
(9) 
noting that we have introduced the variable to augment the sample . is an integer sequence computed as follows: where is a binary “hint” vector (automatically generated) of the form ( signals a hint is used, while is “no hint”). Empirically, we found worked well. In [ororbia2019iterdecode], using an RNN’s recurrent weights as a lateral processing mechanism [ororbia2019lifelong] was related to an RNN acting as a deep feedforward network with tied weights across hidden layers (a “prediction episode”). This means that additional nonlinearity (via depth) is being efficiently exploited without incurring the memory cost of storing extra weights. We found that iterative refinement introduces greater stability into learning process primarily when gradient noise is used. Note that, even in this case, while we work with full precision weights for gradient computation, before evaluation is conducted, the weights are converted to discrete values.
Two Stage Incremental learning
Incremental learning, or, in other words, training procedures that sort data samples based on their inherent difficulty and progressively present them to a neural agent progressively, has been shown to quite effective when training RNNs on input data that is known to have some structure [elman1993learning, das1993using]. Based on this prior finding, we developed a twostage incremental learning approach for improving a higherorder RNN’s ability to generalize to longer sequences. Formally, Algorithm 1 depicts the overall process. We found that using a stochastic learning rate [ororbia2019iterdecode] worked better in the first stage while a fixed learning rate combined with stochastic noise process applied to the weights (similar to gradient noise) worked better during second stage.
As we will see later experimentally, whenever the data has some exploitable structure that allows for an automatic sorting of samples by increasing complexity, incremental learning is highly effective in training higherorder RNNs. In the case of CFGs, we can sort samples based on string length and progressively build a model that can learn to generalize to increasingly longer string sequences. Algorithm 1 depicts the full process (note that we set in this paper and
is a variable that marks the number of epochs so far).
Regularizing Higher Order RNNs:
When training any RNN for long periods of time, the model tends to memorize the input training data which damages its ability to generalize to unseen sequence data, i.e., overfitting. Higher order RNNs are also susceptible to overfitting given their highcapacity and complexity, and yet, no regularization has ever been proposed to help these kinds of RNNs to combat overfitting. In this work we extend an adaptive (layerdependent) noise scheme that was originally proposed for training neurobiologicallyplausible ANNs [ororbia2019biologically], which showed strong positive results for simple feedforward classification tasks, to RNNs. Notably, our noisebased regularizer applies to higherdimensional tensors, which are fundamental to implementing any th order RNN.
We are also motivated by the fact that injecting noise to gradients can encourage exploration of an RNN’s error optimization landscape [Goodfellow16] in one of two ways: 1) at the input, i.e., data augmentation [Goodfellow16], or 2) at the recurrence [krueger2016zoneout]. Our regularizer falls under the second case.^{3}^{3}3We implemented a data augmentation approach but found it yielded poor results when learning contextfree grammars. The key details of our noisebased regularizer are depicted in Algorithm 2. Based on preliminary experiments, we found that a noise level less than 30% and more than 8% helps the network to converge faster and, more importantly, generalize better on unseen sequences, longer that than those found in the training set. Experimentally, later we will see that this regularizer improves generalization even when prior knowledge is not integrated into the RNN.
Integrating Prior Knowledge
Programming and Inserting Rules
We start by defining the data generating process that any RNN is to learn from, i.e., a PDA that generates a set of positive and negative strings. Formally, the state PDA is defined as a 7tuple where:

is the input alphabet

is the finite set of states

is known as stack alphabet (a finite set of tokens)

is the start state

is the initial stack symbol

is the set of accepting states

is the state transition.
To insert rules related to known state transitions into the (state) NSPDA, one needs to program its recurrent weights (which could be second or third order). Since the number of states in PDA is not known before hand, we assume that and that the network has enough capacity to learn an unknown contextfree grammar. In order to program and insert rules, we propose adapting methodology originally developed for secondorder RNNs and deterministic finite state automata (DFA) [omlin1996constructing] to the case of PDAbased RNNs. Specifically, we will exploit the similarity between the state transitions of the target PDA and the underlying dynamics of a stackdriven RNN. Consider a known transition , where is the top of the stack and is the sequence of symbols replacing . We then identify PDA states and , which correspond to state neurons and , respectively. Recall that each symbol has specific stack operations associated with it, which provide prior knowledge as to when to push and when to pop from the stack. It is desirable that the state neuron has a high output close to and has a low output close to after reading an input symbol using input neuron and the top of the stack using read neuron (remember that a read depends on an action neuron, as depicted in model Equation 3). This condition can be achieved by doing the following: 1) set the (third order) weights to a large positive value, which helps to ensure that the state neuron at the next time step will be high (and since is sigmoidal, this tends towards ), and 2) set to a large negative value, which would make the output of the state neuron low (tending towards ). The next item to consider are the (ternary) action weights stored in , which drive the action neurons that yield the stack operations (recall that [1,0,1] maps to [pop,noop,push]). First, we must assume that the total contribution of the weighted output of all state neurons can be neglected – this can be achieved by setting all other state neurons to the lowest value. In addition, we assume that each state neuron can only be assigned to one known state of the PDA. If we have prior knowledge of accepting and nonaccepting states related to a particular neuron, we may then bias its output . We start from (the leftmost neuron in the vector ) and work towards , programming each one by one. Armed with these assumptions, we can then stably encode rules into the NSPDA by programming the weight to be large positive value if the PDA’s state is an accepting state. Otherwise, we set to be a large negative value if the state is nonaccepting. If no such knowledge of the PDA is available, remains unchanged. Though described for a third order NSPDA, the above approach for programming weights also applies to a second order model as well. In a lower order NSPDA, with 3D weight tensors and , state updates and transitions are conducted by concatenating a read neuron with an input neuron to create a single vector. However, when programming a second order model, we are now working with a DFA [omlin1996constructing] instead of a PDA, which limits the capabilities of the NSPDA (as well as restricts its capacity) since we do not possess any knowledge about what to push or pop. However, when combined with our proposed learning procedure that incorporates iterative refinement, we believe that the second order NSPDA can still learn what action to perform. However, the issue of dimensionality arises – the state space of a lower order model is very large when compared to that of a third order NSPDA. In the case of a PDAbased model, pushing multiple symbols might lead to reaching same accepting state, however, in case of a DFAbased model (the second order NSPDA), we create separate sets of accepting states for each symbol. We found that this splitting mechanism was crucial in getting our network to work perfectly with a digital stack. While the above rule insertion scheme seems simple enough, determining the actual values for the weights that are to be programmed can be quite problematic. In the case of a third order synaptic connections (with binary weights), with just neurons, there are different combinations, which would quickly render our method impractical and near useless. However, we can sidestep this computational infeasibility by making use of “hints” [omlin1996constructing] within the framework of “orthogonal state encoding”. By assuming that the PDA starts generating a valid grammar at its initial state, we can then randomly choose a single state and make the output of one state neuron equal to . The outputs of all the other neurons are set to be equal to . Following this, we set the values of weights (according to known state transitions) according to the approach described above. Notably, these weights, though initially programmed, are still adaptable, making them amenable to tuning to a target grammar underlying a data sample. Programming the weights of second or third order networks jointly impacts the behavior of the state neurons , the read neurons and the input neurons . Following the scheme we described above yields sparse NSPDA representations of PDA states. It is difficult to program an NSPDA with a minimal number of states, despite the fact that we have a theoretical guarantee that the third order model is equivalent to PDA dynamics [nndpa1998sun]. We will observe in our results, the proposed methodology significantly reduces the NSPDA’s convergence time during optimization (leading to roughly comparable training time characteristic of first order RNNs), which is particularly important given the fact that its inference process entails 4D tensor products (which are far more expensive than the matrix computations of modernday RNNs).






Rule Method 









NNPDA w/o hints  
NNPDA w/ dead neuron hints  
NSPDA w/o hints  
NSPDA w/ Hint #1  
NSPDA w/ Hint #2  70  72  150  138  389  134  222  148 






Train Method 









Standard  
IL  
2IL (ours)  2001  2199  9899  10001  130192  129998  177189  177190 






Regularization Method 









w/o reg  
w reg  0.00  0.00  0.06  0.01  0.99  0.00  0.09  0.00 






RNN Type 











RNN  
LSTM  
LSTMp  
GRU  
Stack RNN 40+10  
Stack RNN 40+10+ rounding  
listRNN 40+5  
nd Order RNN  
nd Order RNN reg (ours)  
NNPDA  
NNPDA reg (ours)  
NSPDA, M1 (ours)  
NSPDA, M2 (ours)  0.00  0.00  0.00  0.01  0.00  0.00  0.00  0.00  0.01  0.88 
Experimental Details
We focused on five contextfree grammars, some labeled as Dyck(2) languages, which are some of the more difficult CFGs to recognize. For each grammatical inference task, we create a dataset that contains positive and negative (string) samples. Each sequence was of length which was sampled via , where
is the uniform distribution defined over the interval
. From the samples generated, we randomly sampled a subset from the total number of tokens generated. The number of state neurons for a second order NSPDA is set according to the following formula: . For a third order NSDPA, the number of state neurons was set according to: . All models made use of the iterative refinement loss (Equation 9, with ), weight updates were computed using whichever algorithm, i.e., BPTT, truncated BPTT (TBPTT) ( steps back in time), RTRL, or UORO, yielded best performance for a given model. For higher order networks, UORO performed better and we use this to optimize all RNNs of this type in this study^{4}^{4}4For all first order RNNs, we found BPTT worked best and use that to train all RNNs of this type in our experiments. (in the appendix, we offer a comparison of the various weight update rules when training an NSPDA). Gradients were hard clipped to. Parameters were updated using stochastic gradient descent (SGD) which made use of the stochastic learning rate annealing scheme proposed in
[ororbia2019iterdecode] with initial learning rate of . All models were trained for a maximum of epochs (or until convergence was reached, which was marked as 100% training accuracy). Experiments for each and every model was repeated times. All of our models used our proposed rule encoding scheme and all of the RNNs were trained using our proposed twostage incremental learning procedure. In Table 4, to demonstrate the value of our proposed two stage incremental training procedure (2IL), we compare an NSPDA trained without any incremental learning, one with ours, and one with the incremental learning approach (IL) proposed in [das1993using] and find that the our approach yields the best results across all grammars. All higherorder RNNs made use of our proposed adaptive noise regularizer, though in Table 4, we examine how the NSPDA performs with and without the proposed regularizer. With respect to the hints used, for all tables presented in the main paper, whenever hint usage is indicated, we mean Hint #2 (which worked the best empirically). In the appendix, we provide a detailed breakdown and ablation for all of the models investigated in this paper. Specifically, we present results for models that were trained with and without our regularizer as well as under various hint insertion conditions (no hints, Hint #1, and Hint #2). Baseline Algorithms: In order to provide the proper context do demonstrate the effectiveness of our proposed NSPDA, we conduct a thorough comparison of our model to as many baseline RNN models as possible. These models include a plethora of first order RNNs such as variations of the stackRNN [joulin2015inferring] (depth, all other metaparameters set according to original source) including the two variant models as well as the linkedlist model (using the same model labels as the original paper), the Long Short Term Memory RNN
[hochreiter1997long] with (LSTM) and without peepholes (LSTMp), the Gated Recurrent Unit (GRU) RNN [chung2014empirical], and a simple Elman RNN. We also compared to gated first order RNNs with multiplicative units, but due to space constraints, we report these results in the appendix. We furthermore compare against second order RNNs with (nd Order RNN) and without regularization (nd Order RNN reg), as well as the classical NNPDA with and without regularization (NNPDA reg). All baselines RNNs had a single layer ofneurons and individual hyperparameters for each was optimized based on validation set performance.
Results and Discussion
To the best of our knowledge, we are the first to conduct a comparison across such a wide variety of RNN models of both first, second, and third order, with and without external (stackbased) memory. For simple algorithmic patterns (nonDyck(2) CFGs), first order RNNs like the LSTM and GRU perform reasonably well, primarily because they utilize dynamic counting [lstmcfg, sennhauser2018evaluating] but yet do not learn any state transitions. This is evidenced when considering their performance on on the complex Dyck(2) CFG where the majority of RNNs exhibit great difficulty in generalizing to longer sequences. These results do corroborate those of prior work, specifically those that demonstrate that the LSTM essentially performs a form of dynamic counting, making it illsuited to recognizing complex grammars [Lstmdynamiccounting]. As pointed out by [Lstmdynamiccounting] there is a strong need for neural architectures with external memory, i.e., a stack, to solve complex CFGs but, in this study, we furthermore argue that prior knowledge is also needed as well. This makes sense given that is known that prior information often leads to greatly improved reasoning and better generalization [manning2019nsm]. The stack and list RNNs do make use of (continuous) external memory (in fact, multiple stack/lists) but, theoretically, only one stack should be sufficient to recognize a PDA of any arbitrary length while a 2stack PDA is as powerful as a Turning machine [hopcroft2pda]. However, quite surprisingly, a stackRNN with even 10 stacks has difficulty in generalizing to a complex grammar. This lines up with the theory – [hopcroft2pda] has proven that adding any more than stacks to a PDA does not provide any further computational advantage. Finally, it is impressive to see that high order RNNs coupled with external memory, particularly with a discrete stack structure (as opposed to a continuous stack like that of the stackRNN), perform so well across all CFGs. It is important to note that even the way our statebased RNN operates is markedly different than the way those of the past did – the NSPDA works as a nextstep prediction model, which allows us to use the powerful iterative refinement procedure as a way to aggressively error correct its states when predicting string validity (at least during training time). Table 4 shows that our NSPDA model generalizes very well when trained on sequences of length but tested on sequences on length up to . Finally, our results demonstrate the value of rule insertion, which, as we see empirically, in some cases, improved convergence speed by a wide margin.
Conclusions
In this work, we proposed the neural state pushdown automate (NSPDA) and its learning process, which utilizes an iterative refinementbased loss function, a twostage incremental training procedure, an adaptive noise regularization scheme (which works with any higher order network), and a method for stably encoding rules into the model itself. Our experimental results, which focused on contextfree grammars (CFGs), demonstrate that prior knowledge is essential to learning memoryaugmented that recognize complex CFGs well. Notably, we have empirically demonstrated the expressvity and flexibility of a high order temporal neural model that learns how to manipulate an external discrete stack. While our proposed neural model works with a discrete stack, our model’s underlying framework could be extended to manipulate other kinds of data structures, a subject of future work. When training on various CFGs, the statebased neural models we optimize converge faster and are more expressive than even powerful classical models such as the neural network pushdown automaton. Furthermore, we have shown that modernday, popular recurrent network structures (all of which are first order) struggle greatly to recognize complex grammars.These discovered limitations of first order RNNs indicates that ANN research should consider the exploration of more expressive, memoryaugmented models that offer ways to better integrate prior knowledge.
References
Appendix
Additional Results
In Table 7, we report an expansion of the model performance table that appears in the main paper. In it, we report the performance of 3 modern gated RNNs with multiplicative gating units, i.e., MIRNN, MILSTM, MIGRU. Interestingly enough, one could consider the multiplicative units to be a crude approximation of second order state neurons. Table 7 shows results for stably programming the weights of the NSPDA which, in effect, demonstrates that a programmed NSPDA (without learning) is equivalent to complex grammar PDA. In the other table (Table 7), we highlight how various learning algorithms affect the generalization ability of higher order recurrent networks. Here, we compare backpropagation through time (BPTT) to other online learning algorithms such as real time recurrent learning (RTRL) and unbiased online recurrent optimization (UORO). We describe these procedures in further detail in the next section. Notably, in our experiments, we observed that UORO boosts performance for higher order recurrent networks, while being faster than RTRL, the original algorithmofchoice when training higher order, statebased models. Furthermore, we remark that truncated BPTT (TBPTT), for some CFGs, can actually slightly improve model performance over BPTT (but in ohers, such as is the case for the palindrome CFG, lead to worse generalization).






Model 













nd Order NSPDA  
rd Order NSPDA 






Learning Algorithm 









BPTT  
TBPTT  
RTRL  
UORO 






RNN Type 











RNN  
LSTM  
LSTMp  
GRU  
Stack RNN 40+10  
Stack RNN 40+10+ rounding  
listRNN 40+5  
MIRNN  
MILSTM  
MIGRU  
nd Order RNN  
nd Order RNN reg (ours)  
NNPDA  
NNPDA reg (ours)  
NSPDA, M1 (ours)  
NSPDA, M2 (ours)  0.00  0.00  0.00  0.01  0.00  0.00  0.00  0.00  0.01  0.88 
On Training Algorithms
For all of the RNNs we study, we compared their (validation) performance when using various online and offline based learning algorithms. As mentioned in the last section, we found that UORO worked best for the NSPDA, which is advantageous in that UORO is faster than RTRL (even largely in terms of complexity) and does not require model unfolding like the popular and standard BPTT/TBPTT algorithms do. These results, again, are summarized in Table 7. Below we briefly describe the nonstandard approaches to learning RNNs, specifically RTRL and UORO. Notably, we are the first to implement and adapt UORO in calculating the updates to the weights of higher order networks.
RealTime Recurrent Learning
Realtime recurrent learning (RTRL) is a classical online learning procedure for training RNNs [williams1989rtrl]. The aim is to optimize the parameters of a statebased model in order to minimize a total (sequence) loss. The state model is abstract to the following function:
(10) 
RTRL computes the derivative of the model’s states and outputs with respect to the synaptic weights during the model’s forward computation, as data points in the sequence are processed iteratively, i.e., without any unfolding as in BPTT. When the task is next step prediction (predict given a history ), the loss to optimize, using RTRL, is defined as follows:
(11) 
Once we differentiate Equation 10 with respect to , we obtain:
(12) 
Where at each time we compute based on . These values are then used to directly compute . The above is, in short, how RTRL calculates its gradients without resorting to backward transfer or computation graph unfolding (as in reversemode differentiation). Since the shape of is the same as , for standard RNNs with hidden units, this calculation scales as (time complexity [williams1995gradient]). This high complexity makes RTRL highly impractical for training very wide and very deep recurrent models. However, in the case of a third order model like NSPDA (or an NNPDA), the number of states need for learning a target grammar are generally far fewer than those required of second or first order models (as we mentioned in the main paper). This means that a procedure such as RTRL is still applicable and useful at least for training RNNs to recognize context free grammars (of low input dimensionality).
Unbiased Online Recurrent Optimization
Unbiased Online Recurrent Optimization (UORO) [tallec2017uoro]
uses a rankone trick to approximate the operations need to make RTRL’s gradient computation work. This trick helps to reduce the overall complexity of the at the price of increasing variance of its gradient estimates. When designing an optimizer like UORO, we start from the idea that for any given unbiased estimation of
, we can form a stochastic matrix
such that . Since Equation 11 and 12 are affine in , the “unbiasedness” (of gradient estimates) is preserved due to the linearlity of the expectation. Next, we compute the value of and plug it into 11 and 12 to calculate the value for and . In a rankone, unbiased approximation, at time step , . To calculate at , we plug in into 12. Nonetheless, mathematically, the above is still not yet a rankone approximation of RTRL. In order to finally obtain a proper rankone approximation, one must use an additional, efficient approximation technique, proposed in [ollivier2015training], to rewrite the above equation as:(13) 
Note that is a vector of independent, random signs and contains positive numbers. Thus, the rankone trick can be applied for any . In UORO, and are factors meant to control the variance of the estimator’s computed approximate derivatives. In practice, we define as:
(14) 
and is defined to be:
(15) 
Initially, and , which yields unbiased estimates at time . Given the construction of the UORO procedure, by induction, all subsequent estimates can be shown to be unbiased as well.
Comments
There are no comments yet.