Long Distance Relationships without Time Travel: Boosting the Performance of a Sparse Predictive Autoencoder in Sequence Modeling

by   Jeremy Gordon, et al.
Project AGI
berkeley college
Numenta Inc

In sequence learning tasks such as language modelling, Recurrent Neural Networks must learn relationships between input features separated by time. State of the art models such as LSTM and Transformer are trained by backpropagation of losses into prior hidden states and inputs held in memory. This allows gradients to flow from present to past and effectively learn with perfect hindsight, but at a significant memory cost. In this paper we show that it is possible to train high performance recurrent networks using information that is local in time, and thereby achieve a significantly reduced memory footprint. We describe a predictive autoencoder called bRSM featuring recurrent connections, sparse activations, and a boosting rule for improved cell utilization. The architecture demonstrates near optimal performance on a non-deterministic (stochastic) partially-observable sequence learning task consisting of high-Markov-order sequences of MNIST digits. We find that this model learns these sequences faster and more completely than an LSTM, and offer several possible explanations why the LSTM architecture might struggle with the partially observable sequence structure in this task. We also apply our model to a next word prediction task on the Penn Treebank (PTB) dataset. We show that a 'flattened' RSM network, when paired with a modern semantic word embedding and the addition of boosting, achieves 103.5 PPL (a 20-point improvement over the best N-gram models), beating ordinary RNNs trained with BPTT and approaching the scores of early LSTM implementations. This work provides encouraging evidence that strong results on challenging tasks such as language modelling may be possible using less memory intensive, biologically-plausible training regimes.



There are no comments yet.


page 4


Gradual Learning of Deep Recurrent Neural Networks

Deep Recurrent Neural Networks (RNNs) achieve state-of-the-art results i...

Learning distant cause and effect using only local and immediate credit assignment

We present a recurrent neural network memory that uses sparse coding to ...

Reversible Recurrent Neural Networks

Recurrent neural networks (RNNs) provide state-of-the-art performance in...

Pseudo Random Number Generation through Reinforcement Learning and Recurrent Neural Networks

A Pseudo-Random Number Generator (PRNG) is any algorithm generating a se...

Encoding-based Memory Modules for Recurrent Neural Networks

Learning to solve sequential tasks with recurrent models requires the ab...

MCRM: Mother Compact Recurrent Memory A Biologically Inspired Recurrent Neural Network Architecture

LSTMs and GRUs are the most common recurrent neural network architecture...

IGLOO: Slicing the Features Space to Represent Long Sequences

We introduce a new neural network architecture, IGLOO, which aims at pro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In the sequence learning domain, the challenge of modeling relationships between related elements separated by long temporal distances is well known. Language modeling, the task of next character or next word prediction, is an extensively studied paradigm that exhibits the need to capture such long-distance relationships that are inherent to natural language. Historically, a variety of architectures have achieved excellent language modelling performance. Although larger datasets and increased memory capacity have also improved results, architectural changes have been associated with more significant improvements on older benchmarks.

N-gram models are an intuitive baseline model and were developed early in this history. N-gram models learn a distribution over the corpus vocabulary conditioned on the prior tokens, e.g. a tri-gram () model makes predictions based on the distribution:

Among N-gram models, smoothed 5-gram models achieve minimum perplexity on the Penn Treebank dataset (marcus1994penn), a result that illustrates constraints on the value of increasingly long temporal context.

More recent approaches have demonstrated the success of neural models such as Recurrent Neural Networks applied to language modeling. In 2011, Mikolov et al. presented a review of language models on the Penn Tree-bank (PTB) corpus showing that recurrent neural models at that time outperformed all other architectures (mikolovExtensionsRecurrentNeural2011).

Ordinary RNNs are known to suffer from the vanishing gradient problem in which partial derivatives used to backpropagate error signals across many layers approach zero. Hochreiter et al introduced a novel multi-gate architecture called Long Short-Term Memory (LSTM) as a potential solution

(hochreiterLongShortTermMemory1997). Models featuring LSTM have demonstrated state of the art results in language modeling, demonstrating their ability to robustly learn long-range causal structure in sequential input.

Though RNNs appear to be a natural fit for language modeling due to the inherently sequential nature of the task, feed-forward networks utilizing novel convolutional strategies have also been competitive in recent years. WaveNet is a deep autoregressive model using dilated causal convolutions in order to achieve long temporal range receptive fields

(oordWaveNetGenerativeModel2016). A recent review compared the wider family of temporal convolutional networks (TCN)—of which WaveNet is a member—with recurrent architectures such as LSTM and GRU, finding that TCNs surpassed traditional recurrent models on a wide range of sequence learning tasks (bai2018empirical).

Extending the concept of replacing recurrence with autoregressive convolution, Vaswani et al. added attentional filtering to their Transformer network

(vaswaniAttentionAllYou2017). The Transformer uses a deep encoder and decoder each composed of multi-headed attention and feed-forward layers. While the dilated convolutions of WaveNet allow it to learn relationships across longer temporal windows, attention allows the network to learn which parts of the input, as well as intermediate hidden states, are most useful for the present output.

Current state-of-the-art results are achieved by GPT-2, a 1.5 billion parameter Transformer

(gong2018frage), which obtains 35.7 PPL on the PTB task (see Table 2). The previous state of the art was an LSTM with the addition of mutual gating of the current input and the previous output reporting 44.8 PPL (melisMogrifierLSTM2019).

Common to all the neural approaches reviewed here is the use of some form of deep-backpropagation, either by unrolling through time (see section 3.1.2 for more detail) or through a finite window of recent inputs (WaveNet, Transformer). Since most of these models also benefit from deep multilayer architectures, backpropagation must flow across layers, and over time steps or input positions, resulting in very large computational graphs across which gradients much flow. By contrast, all other methods in the literature (such as traditional feed-forward ANNs and N-gram models) are not known to produce such good performance (i.e. none have surpassed 100 PPL on PTB).

1.1. Motivation

Despite the impressive successes of the recurrent, autoregressive, and attention-based approaches reviewed above, the question remains whether similar performance can be achieved by models that do not depend on deep backpropagation. Models that avoid backpropagation across many layers or time steps are interesting for two reasons. First, computational efficiency is becoming an increasingly important consideration in deep learning, both due to the pragmatics of designing algorithms that must be trained in resource constrained environments such as edge computing, and as researchers begin to acknowledge the significant environmental footprint of the hardware that drives machine learning at scale


. Second, to the extent that computational models may help us better understand the dynamics and perhaps mechanisms underlying our own cognitive abilities, architectures constrained by similar principles as those that govern the brain may offer more credible insights. Specifically, we are interested in models that lie within the biologically plausible criteria outlined by Rawlinson et al.: 1) local and immediate credit assignment, 2) no synaptic memory, and 3) no time-traveling synapses

(rawlinsonLearningDistantCause2019). Our goal, then, is to explore and push the performance bounds of sequence learning models leveraging dynamics consistent with these bio-plausibility constraints.

2. Method

2.1. Original RSM Model

We began with the Recurrent Sparse Memory (RSM) architecture proposed by Rawlinson et al. (rawlinsonLearningDistantCause2019). RSM is a predictive recurrent autoencoder that receives sequential inputs (e.g. images or word tokens), and is trained to generate a prediction of the next item in the sequence (see schematic in Figure 1). Like Hierarchical Temporal Memory (Hawkins2016), the RSM memory is organized into groups (or mini-columns), each composed of cells. Cells within each group share a single set of weights from feed forward input, such that the feed-forward contribution is an

-dimensional vector computed as:

Each cell receives dense recurrent connections from all cells at the previous time step, and the recurrent contribution is an matrix computed as:

is an matrix holding the weighted sum combining feed-forward and recurrent input to each cell in group , and is given by:

A top-k sparsity is used as per Makzhani and Frey (makhzaniKSparseAutoencoders2013). RSM implements this sparsity by computing two sparse binary masks, and , which indicate the most active cell (one per group), and most active group ( per layer), respectively. An inhibition trace was used in the original model to encourage efficient resource utilization during the sparsening step, but is replaced with boosting in this work (see section 3.2.4 for discussion). The final output is calculated by applying a nonlinearity to the sparsened activity:

A memory trace is maintained with an exponential decay parameterized by , such that . From , the recurrent input at the next time step is calculated by normalizing with constant , chosen such that the activity in sums to 1:

Like other predictive autoencoders, RSM is trained to generate the next input by “decoding” from the max of each group’s sparse activity:

The prediction is then computed as , where is a weight matrix with dimension equal to the transpose of .

Finally, to read out labels or word distributions from the network, RSM uses a simple classifier network composed of a 2-layer fully connected ANN using leaky ReLU nonlinearities. The classifier network is trained concurrently but independently to the RSM network (not sharing gradients), and takes the RSM’s hidden state as input.

Figure 1. Schematic of original RSM architecture, shown processing inputs from the stochastic sequential MNIST task (see section 3.1.1). Note that, as per original paper, the RSM network is trained only on the MSE loss, and is not affected by gradients backpropagated from the classifier network.

Schematic of original RSM architecture. Note that, as per original paper, no gradients pass from the classifier to the RSM in order to keep credit assignment local.

2.2. Boosted RSM (bRSM)

We developed a variant of RSM that (among other architectural changes) replaces cell-inhibition with a cell activity ‘boosting’ scheme. For brevity, we refer to the modified algorithm as bRSM.

In an attempt to encourage better generalization, we explored a number of adjustments to the original model described in section 2.1. Additional model details and hyper-parameter settings for reported experiments are included in Appendix B, and the full code for all experiments is publicly accessible111The full code for the bRSM model and all experiments is available at https://github.com/numenta/nupic.research/tree/master/projects/rsm.

We find that bRSM significantly improves performance on the language modeling task. We review each of our adjustments in the section below.

2.2.1. Flattened network

A fundamental dynamic of HTM-like architectures is that each mini-column learns some spatial structure in the input, and each cell within a mini-column learns a transition from a prior representation (Hawkins-et-al-2016-Book). A potential limitation of this architecture is that, while representations of the input via feed forward connections benefit from spatial semantics (similar representations for similar inputs), the predictive representations developed through recurrent connections lack this property: similar sequence items in different sequential contexts are highly orthogonal (rawlinsonLearningDistantCause2019).

To illustrate a potential inefficiency of this orthogonality, consider a network trained on sequences where some set of similar inputs predict both and at the next time step, prompting cells in the representations of both and to activate when exposed to inputs in . These cells may contain nearly identical weights linked to a sparse representation generalizing across patterns in . Such a redundancy might be avoided if some subset of cells having learned the transition from could be shared by both and . This line of reasoning motivated experiments in which each group was set to have only one cell, thus removing shared feed-forward weights from the model, and enabling decoding from the full hidden state rather than a group-max bottleneck. The flexibility of allowing predictive cells to participate in multiple input representations may explain the improved performance of this flattened architecture in the language modeling task, though we suspect the grouped model may be beneficial on tasks with higher-order compositionality in space or time.

2.2.2. Boosting

Sparse networks may learn locally optimal configurations in which only a small fraction of a layer’s representational capacity is used. When this occurs, many units remain idle resulting in inefficient resource usage and limited performance. The original RSM model employs an inhibition strategy whereby a separate exponentially decaying trace is used to discourage recently active cells from re-activating.

An alternative strategy known as boosting has been proposed to achieve the same goal but exhibits different properties from inhibition. We used a boosted k-Winners algorithm suggested by Cui et al. (ahmadHowCanWe). This algorithm tracks the duty cycle of each cell

, which captures the probability of recent activation (sparsened via top-

k masking):

A per-cell boost term is then computed based on this duty cycle, increasing the probability of less recently active cells from firing, and inhibiting those more recently active:

where is the expected layer sparseness defined as the number of winners divided by the layer size, , and is the boost strength hyper-parameter which can be optionally configured to remain fixed or decay during training (see Appendix B). The per-cell weighted sum is then redefined as:

2.2.3. Semantic embedding

Rawlinson et al. tested RSM with a synthetic binary word embedding (see Appendix C.1) with no semantic properties in order to isolate the performance of the architecture from that of the embedding. Since RSM was not specifically designed to learn high quality language embeddings, we chose to use a modern embedding leveraging sub-word semantics. We pretrained a 100-dimensional FastText (bojanowski2016enriching) embedding on the training corpus, and used this as input for all experiments (see Appendix C.2 for generation details).

2.2.4. Trainable decay

In language modeling, some tokens may provide useful context to word prediction many tokens in the future (e.g. rare words unique to a particular topic), while others may be necessary for next word prediction (e.g. tokens composing multi-word proper nouns or phrases, or common words indicating syntactic structure). In the original RSM model, the rate of decay of the recurrent input is parameterized by a single scalar value

, which is multiplied into the prior memory state on each time step. While each cell participates in multiple input representations, it may be possible to improve generalization performance by learning a unique exponential decay scalar for each cell in the memory. We implemented trainable decay as a single tensor

of dimension (equivalent to just in the flattened architecture), which we pass through a Sigmoid before applying to the memory in the decay step:

We found that applying a ceiling close to 1 to the term helped to avoid volatility likely caused by the memory state retaining too much history.

The benefit of moving to a trainable decay parameter requires a nominal increase in parameters, and provides a consistent but small improvement (~5 PPL on next word prediction).

2.2.5. Functional Partitioning

We found one final addition to be significantly beneficial on the stochastic sequential MNIST task (detailed in section 3.1.1). In this version of the model, the bRSM memory is partitioned into either two or three blocks: one taking feed-forward input only, one taking recurrent input only, and one integrating both input sources via addition. This third section is equivalent to the full memory in the original RSM model. To ensure utilization across all partitions while keeping target sparsity consistent, we applied the top-k nonlinearity to each partition separately, with partition winners proportional to partition cell count :

The motivation behind functional partitioning was an extension of the logic behind the use of a flattened memory. To the extent that it is useful for some cells to represent transitions from prior input, and others to represent current input, we wondered if an architecture in which these functional roles are enforced would improve performance.

The partitioned model whose ssMNIST results appear in Figure 4 uses a memory with cells allocated as follows: 7% feed-forward, 85% recurrent, and 8% integrated. The resultant model contains fewer parameters since a portion of cells are connected only to the input, which has lower dimensionality than the full memory.

This partitioning method did not improve generalization on the language modeling task hence these results are not reported.

3. Experiments

3.1. Tasks & Datasets

We selected tasks anticipated to be difficult for RNNs and RSM in particular, to enable empirical characterization of its limitations. We tested bRSM on two tasks: a non-deterministic version of the original partially-observable MNIST sequence task (rawlinsonLearningDistantCause2019), as well as next word prediction (language modeling) on the Penn Treebank dataset.

3.1.1. Stochastic Sequential MNIST (ssMNIST)

RSM was initially tested on a partially observable sequence learning task in which the network is exposed to higher-order sequences of randomly chosen MNIST images drawn according to a predetermined list of labels e.g. “0123 0123 0321”. It is then possible that algorithms could learn to ignore the images and simply keep count to make accurate predictions. A potential expansion to this task, then, is to require memorization of repeatable sub-sequences (e.g. the 12 digit example above) presented in a random order. This requires repeatable sub-sequences to be learned, while also learning to ignore sub-sequence order that has no predictive value. The image-observations and transitions are then both partially non-deterministic, and the images must be considered for optimal accuracy.

These randomly ordered sub-sequences can be described by a grammar. The grammar generating process is configured to specify sub-sequences of length digits each. Details of the grammar generated are described in Appendix A.

Using a single fixed grammar we can construct an observation generating process that randomly chooses between sub-sequences, but then follows each sub-sequence deterministically, as follows:

  1. Select one sub-sequence from the specified uniformly at random

  2. Select the first digit label in the sub-sequence

  3. Select a random MNIST digit according to the selected label

  4. Move through the sub-sequence, drawing random MNIST digits for each label, until the end is reached

  5. Go to 1

We generated a test grammar composed of 8 sub-sequences of 9 MNIST digits each (dimension specified to minimize confusion, see sample sequence and predicted outputs in Figure 2). This specific “8x9” grammar for which we report results, along with a calculation for the theoretical limit on prediction accuracy, is included in Appendix A.

Figure 2. High-order, partially observable stochastic sequence learning predictions. Rows alternate between actual 9-digit samples from the grammar, and bRSM predictions. Sequences “6-4-1-3-9” and “3-4-1-3-1” (with common sub-sequence “4-1-3” outlined) are predicted correctly.

To ensure that solving the task would require the successful learning of higher order sequences, we confirmed that prediction of at least some of the transitions in the resultant grammar required knowledge of the sequence item two or more steps prior.

Unlike many RNN tasks, there is no flag or special token to indicate sub-sequence boundaries or task reset. Without any priors for the length or existence of sub-sequences, the ssMNIST task is challenging even for humans.

3.1.2. Baseline: tBPTT trained LSTM

We chose to use an LSTM as a ‘baseline’ algorithm to represent the deep-backpropagation approach and compare to bRSM. Modern recurrent neural networks such as LSTMs are trained using backpropagation through time (BPTT), which conceptually unrolls the network’s computational graph across multiple time steps resulting in a standard multi-layer feed-forward network, and then backpropagating the loss from one or more output layers (or heads) towards the shallower layers representing earlier timesteps.

The LSTM was trained with Adam using a learning rate of . We set the hidden size of the LSTM layer to produce networks roughly consistent with the parameter count of bRSM. Results reported below are for an LSTM with 450 hidden units (2.57M parameters).

We implemented a training regime consistent with Williams and Peng’s improved truncated-BPTT algorithm (williamsEfficientGradientBasedAlgorithm) which is parameterized by two integers determining the flow of gradients through past states of the network. In , specifies the interval at which to inject error from the last outputs, while specifies the length of the history through which gradients should propagate. We set to match the online “one digit, one prediction” dynamic of the ssMNIST task. After disappointing initial results with large values, we experimented with a range of values to empirically optimize LSTM performance (see schematic in Figure 3).

To confirm correctness of the LSTM baseline algorithm, we verified it is able to solve a simplified (fully observable) version of the task where the same MNIST image is used at each occurrence of a given label. Under these conditions, LSTM achieves the theoretical accuracy limit comparatively quickly, though displays volatility even after approaching this accuracy ceiling (see Figure 5). This volatility in the fixed-image regime is likely an illustration of the tendency for these sequence learning models attempting to ‘learn’ spurious higher order transitions between sub-sequences that are not in fact predictable.

Figure 3. Schematic of truncated backpropagation through time parameterization BPTT(, ), with =1, =6 for simple grammar [{0123}, {0321}]. , and represent input, hidden and output respectively.
Model Params Mean Acc Max Acc
LSTM (cont) 2.6M 80.0% 9.1 81.4%
LSTM (mbs=100) 2.6M 73.4% 18.2 82.7%
bRSM 2.5M 86.4% 0.3 86.8%
bRSM (partitioned) 1.8M 88.8% 0.1 88.9%
Table 1. ssMNIST results on 8x9 grammar. Accuracy is reported as mean

one standard deviation, and max over 5 runs to account for observed inter-run variance. Theoretical ceiling on accuracy for this grammar is 88.8%.

Figure 4.

LSTM and bRSM performance on ssMNIST. Mean accuracy (line), standard error (shadow) and range (light shadow) across repeated runs. Gray line is theoretical accuracy ceiling for the 8x9 grammar (see Appendix

Figure 5. LSTM and bRSM performance on ssMNIST when using a constant image for each digit. The partially observable aspect has been removed, and LSTM successfully solves the sequence learning task. Mean accuracy and standard error shown across repeated runs.

A second option distinct from the tBPTT parameterization was also observed to significantly impact LSTM performance. Maximum digit prediction accuracy was achieved by adjusting the training regime to periodically clear the LSTM’s memory cell state. In Figure 4, mbs indicates the number of time steps (and therefore mini-batches) after which we cleared the LSTM module’s hidden and cell state.

Together, optimization of the backpropagation window to small finite values () and state clearing interval (mbs) advantage the LSTM with two sources of an implicit prior on the length of salient temporal context. Intuitively, setting or below our grammar’s sub-sequence length would make it impossible to learn high-order relationships, and too large of a value might confound the network by offering far more temporal context than is useful for learning transitions within each sub-sequence. We anticipated and confirmed that maximum accuracy would be achieved when both parameters were tuned to convey a useful prior on context while supplying a sufficient history to robustly learn the higher-order temporal relationships in the data. Results from experiments with varying configurations of tBPTT and state clearing are shown in Figure 4 and appear to support this understanding.

Across the variety of training regimes tested, LSTM with the continuous configuration and achieved the best mean accuracy across runs of 80.0% (90.0% of the theoretical limit for this grammar). The highest accuracy LSTM run was observed with and , reaching 82.7%, but inter-run variance was significantly higher in this configuration. In comparison the non-partitioned and partitioned variants of bRSM achieved 86.4% and 88.8% respectively, with very little inter-run variance. A summary of results is shown in Table 1.

LSTM did not achieve the maximum achievable prediction accuracy even with the additional context-length clues implicitly provided by the training regime. LSTM showed slower convergence, increased volatility and lower eventual accuracy without these clues. The much better results using a constant image for each digit suggest that the combination of partial observability, sequential uncertainty and unmarked sub-sequence boundaries make this task especially difficult for conventional recurrent models. In contrast, bRSM was able to learn the partially observable sequence relationships without the need to tune hyper-parameters in accordance with the grammar’s true time horizon. Furthermore, as noted by Rawlinson et al., by avoiding BPTT, RSM has an asymptotic memory use of , where is the number of cells in the hidden layer. This is a significant reduction from deep backpropagation models which require , where is the time-horizon, even when both models have the same number of parameters. For the empirically optimal tBPTT parameterization used in this analysis , which implies that 30 more memory is required. Overall, bRSM achieves better sequence learning performance than an ordinary LSTM in this partially observable condition, with less prior knowledge of the task and significantly less memory requirement.

3.2. Language Modeling

3.2.1. Dataset

Consistent with the original RSM paper, we present language modeling results using the Penn Treebank (PTB) dataset with preprocessing as per Mikolov et al. (mikolov2010recurrent). RSM’s performance on this language modeling task was the weakest result of those originally reported, making it an ideal target to determine if the observed limitations could be overcome. Model evaluation was performed using the test corpus.

3.2.2. Training Regime

We observed that, consistent with previous findings (rawlinsonLearningDistantCause2019)

, the bRSM model overfits quickly to the PTB training set, as illustrated by increasing volatility and ultimately a quick rise in test loss after 40-60,000 mini-batches of training. To address this dynamic, we found it useful to pause training of the core RSM model prior to overfit, and allow the classifier network to continue training. We noted that final test set perplexity was quite sensitive to the time of pause. For the results shared here, pause epoch is considered an additional hyper-parameter. A custom stopping criteria based on the derivative of validation loss would allow for more flexible experimentation, and is planned for future work.

3.2.3. Results

Towards our goal of exploring the performance bounds of models under our bio-plausibility constraints, we present results from experiments with bRSM on the PTB dataset. The lowest test perplexity (103.5 PPL) was achieved using the first four additions presented in section 2.2 (all but functional partitioning). A 7% word cache was effective, but an ensemble of bRSM and KN5 did not significantly improve test performance. KN5 results are shown to illustrate the performance of statistically defined n-gram models.

Table 2 reports results for the final bRSM model as well as versions of this model with each added feature ablated. bRSM, with and without the word cache, outperforms all early language modeling architectures, including ordinary (non-gated) recurrent neural language models trained with BPTT. While these results are not yet competitive with state-of-the-art deep models such as the Transformer, and modern LSTM-based approaches, they demonstrate a significant step forward for resource efficient performance.

Model Test PPL No. of params
KN5 141.2
KN5 + cache 125.7
Random Forest LM 131.9
RNN LM (uses tBPTT) 124.7
LSTM 78.9 13M
Mogrifier LSTM 50.1 24M
GPT-2 35.7 1500M
bRSM + cache 103.5 2.55M
Non-semantic embedding 152.6 2.34M
Inhibition instead of boosting 144.0 2.55M
Non-flattened (m=800, n=3) 112.8 3.36M
Without cache 112.0 2.55M
Untrained decay rate 107.3 2.55M
Table 2. Language modeling results. bRSM variants with each of 4 added feature ablated are shown. : As reported by Mikolov et al (mikolovEmpiricalEvaluationCombination).

3.2.4. Resource Utilization (Boosting vs Inhibition)

A possible explanation for the difference in performance seen between boosting and inhibition strategies involves the strength and temporal dynamics of each. Boosting integrates a moving average of individual cell activity across hundreds of time steps, promoting the use of idle cells. In contrast, inhibition produces a strong and immediate effect where cells are fully inhibited from firing after a single activation. Both strategies aim to improve resource utilization.

One way to compare the effect of these strategies is to quantify the informational capacity of the RSM memory using layer entropy (), which is calculated from the duty cycle as follows:

We can compare layer entropy during training and at inference time with the theoretical maximum binary entropy for an RSM layer, which is a function only of layer sparseness ():

In Figure 6, we compare the time course of binary entropy for two RSM models differing only in resource utilization strategy. As expected, both strategies have the effect of increasing layer entropy compared to having no strategy to promote the use of idle cells. We note that inhibition exhibits nearly identical entropy dynamics across training and test sets—approximately 425 bits, or 93% of maximum entropy—while the boosted model’s test entropy is reduced during exposure to unseen test sequences.

This observation supports a traditional bias-variance trade-off based understanding of the relationship between encoding entropy and generalization performance of sparse recurrent networks. In the high entropy case using inhibition, similar sequences are encoded in highly orthogonal patterns, which may support high capacity memorization. This is helpful when there is an opportunity to learn to interpret these patterns, but confounding when generalizing to unseen sequences, because similar contexts are encoded in dissimilar ways. This is consistent with our observation that inhibition produces worse perplexity and higher entropy on the test corpus.

However, some recent work has questioned the notion that high capacity function classes necessarily result in poor generalization performance (belkin2018reconciling), and so alternative explanations can be considered as well. For example, the strong inhibition of recently active cells may recruit arbitrary non-semantic encodings that struggle to generalize without implicating excessive capacity. In either case, encoding unseen sequences from the test corpus with relatively lower entropy implies that fewer unique encodings are produced. We hypothesize that the network falls back to known encodings of similar contexts, which the classifier network is able to interpret. Consequently, relatively better perplexity is observed from the lower-entropy test-corpus encoding.

Figure 6. Layer entropy comparison of boosting vs inhibition strategy. Maximum possible layer entropy shown by dashed gray line.

4. Conclusion

We presented results from a sparse predictive autoencoder with a slim memory footprint, trained on a time-local error signal. As far as we’re aware, this model demonstrates the best results to date on the PTB language modeling task among models not relying upon the use of memory-intensive deep backpropagation across many layers and/or time steps. Neural language models with better performance all use additional mechanisms to selectively filter and store historical state (e.g. attention and gating in Transformer and LSTM networks); our goal is not to beat them, but to show that learning rules which are local in time and space could be competitive, given further development. This work provides encouraging evidence that strong results on challenging tasks such as language modelling may be possible using less memory intensive, biologically-plausible training regimes.

We also showed that on tasks with particular characteristics—namely weak partial-observability and continual presentation of randomly-ordered sub-sequences without boundary markers—our approach outperformed the LSTM gated memory representation. This result also merits further investigation to understand the relationship between these task characteristics and local versus deep learning rules.


Appendix A ssMNIST Sequences

a.1. “8x9” sequence generation

a.1.1. Sub-sequences

The “8x9” grammar used in reported results is composed of the following sub-sequences, shown in rows below:

2, 4, 0, 7, 8, 1, 6, 1, 8
2, 7, 4, 9, 5, 9, 3, 1, 0
5, 7, 3, 4, 1, 3, 1, 6, 4
1, 3, 7, 5, 2, 5, 5, 3, 4
2, 9, 1, 9, 2, 8, 3, 2, 7
1, 2, 6, 4, 8, 3, 5, 0, 3
3, 8, 0, 5, 6, 4, 1, 3, 9
4, 7, 5, 3, 7, 6, 7, 2, 4

Note that several two and three-digit transitions are shared between sub-sequences, but no two sub-sequences share the same first two digits.

a.1.2. “8x9” grammar accuracy ceiling calculation

Given the semi-deterministic nature of the sample generating process and grammar defined, we can calculate the theoretical limit on prediction accuracy as follows.

1 digit: predict 2 at P=3/8.
2 digit:

Following 2: predict uniformly
Following 1: predict uniformly
All remaining deterministic

Remaining digits: deterministic conditioned on first 2 digits.

Correct predictions per sequence:
Accuracy ceiling:

Appendix B Model Details

b.1. Description of hyper-parameters

Probability of forgetting is a parameter used to expose the network to novel sequences by clearing the memory state at randomized intervals. This is parameterized by , the probability at each time step, and for each training sequence, of clearing the hidden state.

Boost strength controls the influence of the per-cell boost computation within the top-k algorithm. It is a non-negative parameter, and disables boosting when set to 0.

Boost strength factor allows an exponential decay of boost strength, which has been show to stabilize training.

Uniform mass weight

controls the interpolation of a uniform distribution with the output of the main model. The final distribution used to compute loss is calculated as a weighted average of each interpolated model distribution.

Word cache weight controls the interpolation of the simple word cache used in some experiments.

Word cache decay rate controls the decay of the word cache, which is implemented as a tensor with dimension equivalent to the size of the corpus vocabulary. After each token is observed, its index in the cache is set to 1. The cache is decayed according to this parameter on each step.

b.2. Hyper-parameters used

Tables 3 and 4 list the configurations for hyper-parameters for the language modeling and ssMNIST experiments respectively.

Description Symbol Value
Batch size 300
Probability of forgetting 0.025
Decoder L2 regularization 0.00001
No. of groups / mini-columns 1500
No. of cells per group 1
Number of winning groups / cells 80
Boost strength 1.2
Boost strength factor 0.85
Predictor hidden size 1200
Uniform mass weight 0.01
Word cache weight 0.07
Word cache decay rate 0.99
Table 3. Hyper-parameters used (language modeling)
Description Symbol Value
Batch size 300
Decoder L2 regularization 0.0
No. of groups / mini-columns 1000
No. of cells per group 1
Number of winning groups / cells 120
Boost strength 1.2
Boost strength factor 0.85
Predictor hidden size 1200
Table 4. Hyper-parameters used (ssMNIST)

Appendix C Word Embeddings

c.1. Synthetic Embedding

The synthetic embedding was constructed as per the original RSM work as follows:

For each word in the corpus, a 28-dimensional binary embedding is generated. The binary vector is constructed as the 14-bit left-filled binary encoding of the vocabulary index , concatenated with its inverse.

For example, the second word in the corpus, , would be embedded as , and the 100 words in the corpus, , would be embedded as

c.2. FastText Embedding

We used FastText’s unsupervised training method 222Code for generating FastText embeddings on custom corpora is available at https://github.com/facebookresearch/fastText to generate a single fixed embedding vector for each word in the PTB vocabulary. We used the skipgram model with learning rate () of 0.1, a vector dimension () of 100, minimal number of word occurrences () of 1, softmax loss (), and trained for 5 epochs (). Embeddings were stored in a static dictionary once generated and treated as inputs to the RSM network.