Strongly-Typed Recurrent Neural Networks

02/06/2016 ∙ by David Balduzzi, et al. ∙ 0

Recurrent neural networks are increasing popular models for sequential learning. Unfortunately, although the most effective RNN architectures are perhaps excessively complicated, extensive searches have not found simpler alternatives. This paper imports ideas from physics and functional programming into RNN design to provide guiding principles. From physics, we introduce type constraints, analogous to the constraints that forbids adding meters to seconds. From functional programming, we require that strongly-typed architectures factorize into stateless learnware and state-dependent firmware, reducing the impact of side-effects. The features learned by strongly-typed nets have a simple semantic interpretation via dynamic average-pooling on one-dimensional convolutions. We also show that strongly-typed gradients are better behaved than in classical architectures, and characterize the representational power of strongly-typed nets. Finally, experiments show that, despite being more constrained, strongly-typed architectures achieve lower training and comparable generalization error to classical architectures.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent neural networks (RNNs) are models that learn nonlinear relationships between sequences of inputs and outputs. Applications include speech recognition (Graves et al., 2013), image generation (Gregor et al., 2015), machine translation (Sutskever et al., 2014) and image captioning (Vinyals et al., 2015; Karpathy & Fei-Fei, 2015). Training RNNs is difficult due to exploding and vanishing gradients (Hochreiter, 1991; Bengio et al., 1994; Pascanu et al., 2013)

. Researchers have therefore developed gradient-stabilizing architectures such as Long Short-Term Memories or LSTMs

(Hochreiter & Schmidhuber, 1997)

and Gated Recurrent Units or GRUs

(Cho et al., 2014).

Unfortunately, LSTMs and GRUs are complicated and contain many components whose roles are not well understood. Extensive searches (Bayer et al., 2009; Jozefowicz et al., 2015; Greff et al., 2015) have not yielded significant improvements. This paper takes a fresh approach inspired by dimensional analysis and functional programming.

Intuition from dimensional analysis.

Nodes in neural networks are devices that, by computing dot products, measure the similarity of their inputs to representations encoded in weight matrices. Ideally, the representation learned by a net should “carve nature at its joints”. An exemplar is the system of measurement that has been carved out of nature by physicists. It prescribes units for expressing the readouts of standardized measuring devices (e.g. kelvin for thermometers and seconds for clocks) and rules for combining them.

A fundamental rule is the principle of dimensional homogeneity: it is only meaningful to add quantities expressed in the same units (Bridgman, 1922; Hart, 1995). For example adding seconds to volts is inadmissible. In this paper, we propose to take the measurements performed by neural networks as seriously as physicists take their measurements, and apply the principle of dimensional homogeneity to the representations learned by neural nets, see section 2.

Intuition from functional programming.

Whereas feedforward nets learn to approximate functions, recurrent nets learn to approximate programs – suggesting lessons from language design are relevant to RNN design. Language researchers stress the benefits of constraints: eliminating GOTO (Dijkstra, 1968); introducing type-systems that prescribe the interfaces between parts of computer programs and guarantee their consistency (Pierce, 2002); and working with stateless (pure) functions.

For our purposes, types correspond to units as above. Let us therefore discuss the role of states. The reason for recurrent connections is precisely to introduce state-dependence. Unfortunately, state-dependent functions have side-effects – unintended knock-on effects such as exploding gradients.

State-dependence without side-effects is not possible. The architectures proposed below encapsulate states in firmware (which has no learned parameters) so that the learnware (which encapsulates the parameters) is stateless. It follows that the learned features and gradients in strongly-typed architectures are better behaved and more interpretable than their classical counterparts, see section 3.

Strictly speaking, the ideas from physics (to do with units) and functional programming (to do with states) are independent. However, we found that they complemented each other. We refer to architectures as strongly-typed when they both (i) preserve the type structure of their features and (ii) separate learned parameters from state-dependence.


The core of the paper is section 2

, which introduces strongly-typed linear algebra. As partial motivation, we show how types are implicit in principal component analysis and feedforward networks. A careful analysis of the update equations in vanilla RNNs identifies a flaw in classical RNN designs that leads to incoherent features. Fixing the problem requires new update equations that preserve the type-structure of the features.

Section 3 presents strongly-typed analogs of standard RNN architectures. It turns out that small tweaks to the standard update rules yield simpler features and gradients, theorem 1 and corollary 2. Finally, theorem 3 shows that, despite their more constrained architecture, strongly-typed RNNs have similar representational power to classical RNNs. Experiments in section 4 show that strongly-typed RNNs have comparable generalization performance and, surprisingly, lower training error than classical architectures (suggesting greater representational power). The flipside is that regularization appears to be more important for strongly-typed architectures, see experiments.

Related work.

The analogy between neural networks and functional programming was proposed in (Olah, 2015), which also argued that representations should be interpreted as types. This paper extends Olah’s proposal. Prior work on typed-linear algebra (Macedo & Oliveira, 2013)

is neither intended for nor suited to applications in machine learning. Many familiar RNN architectures already incorporate forms of

weak-typing, see section 3.1.

2 Strongly-Typed Features

A variety of type systems have been developed for mathematical logic and language design (Reynolds, 1974; Girard, 1989; Pierce, 2002). We introduce a type-system based on linear algebra that is suited to deep learning. Informally, a type

is a vector space with an orthogonal basis. A more precise definition along with rules for manipulating types is provided below. Section 

2.2 provides examples; section 2.3 uses types to identify a design flaw in classical RNNs.

2.1 Strongly-Typed Quasi-Linear Algebra

Quasi-linear algebra is linear algebra supplemented with nonlinear functions that act coordinatewise.

Definition 1.

Dot-products are denoted by or . A type is a -dimensional vector space equipped with an inner product and an orthogonal basis such that .

Given type , we can represent vectors in as real-valued -tuples via

Definition 2.

The following operations are admissible:

  1. Unary operations on a type:
    Given a function (e.g. scalar multiplication, sigmoid , tanh

    or relu

    ), define

  2. Binary operations on a type:
    Given and an elementary binary operation 111Note: is projection onto the coordinate., define


    Binary operations on two different types (e.g. adding vectors expressed in different orthogonal bases) are not admissible.

  3. Transformations between types:
    A type-transform is a linear map such that for . Type-transformations are orthogonal matrices.

  4. Diagonalization:
    Suppose that and have the same dimension. Define


    where and . Diagonalization converts type into a new type, , that acts on by coordinatewise scalar multiplication.

Definition 1 is inspired by how physicists have carved the world into an orthogonal basis of meters, amps, volts etc. The analogy is not perfect: e.g. maps meters to square-meters, whereas types are invariant to coordinatewise operations. Types are looser than physical units.

2.2 Motivating examples

We build intuition by recasting PCA and feedforward neural nets from a type perspective.

Principal component analysis (PCA).

Let denote datapoints . PCA factorizes where is a

-orthogonal matrix and

contains the eigenvalues of

. A common application of PCA is dimensionality reduction. From a type perspective, this consists in:


(i) transforming the standard orthogonal basis of into the latent type given by the rows of ; (ii) projecting onto a subtype (subset of coordinates in the latent type); and (iii) applying the inverse to recover the original type.

Feedforward nets.

The basic feedforward architecture is stacked layers computing where is a nonlinearity applied coordinatewise. We present two descriptions of the computation.

The standard description is in terms of dot-products. Rows of correspond to features, and matrix multiplication is a collection of dot-products that measure the similarity between the input and the row-features:


Types provide a finer-grained description. Factorize by singular value decomposition into and orthogonal matrices and . The layer-computation can be rewritten as . From a type-perspective, the layer thus:


(i) transforms to a latent type; (ii) applies coordinatewise scalar multiplication to the latent type; (iii) transforms the result to the output type; and (iv) applies a coordinatewise nonlinearity. Feedforward nets learn interleaved sequences of type transforms and unary, type-preserving operations.

2.3 Incoherent features in classical RNNs

There is a subtle inconsistency in classical RNN designs that leads to incoherent features. Consider the updates:


We drop the nonlinearity, since the inconsistency is already visible in the linear case. Letting and unfolding Eq. (8) over time obtains


The inconsistency can be seen via dot-products and via types. From the dot-product perspective, observe that multiplying an input by a matrix squared yields


where refers to rows of and to columns. Each coordinate of is computed by measuring the similarity of a row of to all of its columns, and then measuring the similarity of the result to . In short, features are tangled and uninterpretable.

From a type perspective, apply an SVD to and observe that . Each multiplication by or transforms the input to a new type, obtaining


Thus sends whereas sends . Adding terms involving and , as in Eq. (9), entails adding vectors expressed in different orthogonal bases – which is analogous to adding joules to volts. The same problem applies to LSTMs and GRUs.

Two recent papers provide empirical evidence that recurrent (horizontal) connections are problematic even after gradients are stabilized: (Zaremba et al., 2015) find that Dropout performs better when restricted to vertical connections and (Laurent et al., 2015)

find that Batch Normalization fails unless restricted to vertical connections

(Ioffe & Szegedy, 2015). More precisely, (Laurent et al., 2015) find that Batch Normalization improves training but not test error when restricted to vertical connections; it fails completely when also applied to horizontal connections.

Code using GOTO can be perfectly correct, and RNNs with type mismatches can achieve outstanding performance. Nevertheless, both lead to spaghetti-like information/gradient flows that are hard to reason about.

Type-preserving transforms.

One way to resolve the type inconsistency, which we do not pursue in this paper, is to use symmetric weight matrices so that where is orthogonal and . From the dot-product perspective,


which has the simple interpretation that is amplified (or dampened) by in the latent type provided by . From the type-perspective, multiplication by is type-preserving


so addition is always performed in the same basis.

A familiar example of type-preserving transforms is autoencoders – under the constraint that the decoder

is the transpose of the encoder . Finally, (Moczulski et al., 2015) propose to accelerate matrix computations in feedforward nets by interleaving diagonal matrices, and , with the orthogonal discrete cosine transform, . The resulting transform, , is type-preserving.

3 Recurrent Neural Networks

We present three strongly-typed RNNs that purposefully mimic classical RNNs as closely as possible. Perhaps surprisingly, the tweaks introduced below have deep structural implications, yielding architectures that are significantly easier to reason about, see sections 3.3 and 3.4.

3.1 Weakly-Typed RNNs

We first pause to note that many classical architectures are weakly-typed. That is, they introduce constraints or restrictions on off-diagonal operations on recurrent states.

The memory cell in LSTMs is only updated coordinate-wise and is therefore well-behaved type-theoretically – although the overall architecture is not type consistent. The gating operation in GRUs reduces type-inconsistencies by discouraging (i.e. zeroing out) unnecessary recurrent information flows.

SCRNs, or Structurally Constrained Recurrent Networks (Mikolov et al., 2015), add a type-consistent state layer:


In MUT1, the best performing architecture in (Jozefowicz et al., 2015), the behavior of and is well-typed, although the gating by

is not. Finally, I-RNNs initialize their recurrent connections as the identity matrix

(Le et al., 2015). In other words, the key idea is a type-consistent initialization.

3.2 Strongly-Typed RNNs

The vanilla strongly-typed RNN is

T-RNN (16)

The T-RNN has similar parameters to a vanilla RNN, Eq (8), although their roles have changed. A nonlinearity for is not necessary because: (i) gradients do not explode, corollary 2, so no squashing is needed; and (ii) coordinatewise multiplication by introduces a nonlinearity. Whereas relus are binary gates (0 if , 1 else); the forget gate is a continuous multiplicative gate on .

Replacing the horizontal connection with a vertically controlled gate, Eq. (16), stabilizes the type-structure across time steps. Line for line, the type structure is:


We refer to lines (15) and (16) as learnware since they have parameters (). Line (17) is firmware since it has no parameters. The firmware depends on the previous state unlike the learnware which is stateless. See section 3.4 for more on learnware and firmware.

Strongly-typed LSTMs

differ from LSTMs in two respects: (i) is substituted for in the first three equations so that the type structure is coherent; and (ii) the nonlinearities in and are removed as for the T-RNN.

LSTM (21)
T-LSTM (27)

We drop the input gate from the updates for simplicity; see (Greff et al., 2015). The type structure is


Strongly-typed GRUs

adapt GRUs similarly to how LSTMs were modified. In addition, the reset gate is repurposed; it is no longer needed for weak-typing.

GRU (33)
T-GRU (38)

The type structure is


3.3 Feature Semantics

The output of a vanilla RNN expands as the uninterpretable


with even less interpretable gradient. Similar considerations hold for LSTMs and GRUs. Fortunately, the situation is more amenable for strongly-typed architectures. In fact, their semantics are related to average-pooled convolutions.


Applying a one-dimensional convolution to input sequence yields output sequence


Given weights associated with , average-pooling yields . A special case is when the convolution applies the same matrix to every input:


The average-pooled convolution is then a weighted average of the features extracted from the input sequence.

Dynamic temporal convolutions.

We now show that strongly-typed RNNs are one-dimensional temporal convolutions with dynamic average-pooling. Informally, strongly-typed RNNs transform input sequences into a weighted average of features extracted from the sequence


where the weights depends on the sequence. In detail:

Theorem 1 (feature semantics via dynamic convolutions).

Strongly-typed features are computed explicitly as follows.

  • T-RNN. The output is where

  • T-LSTM. Let and denote the vertical concatenation of the weight matrices and input vectors respectively. Then,


    where is defined as above.

  • T-GRU. Using the notation above,




Direct computation. ∎

In summary, T-RNNs compute a dynamic distribution over time steps, and then compute the expected feedforward features over that distribution. T-LSTMs store expectations in private memory cells that are reweighted by the output gate when publicly broadcast. Finally, T-GRUs drop the requirement that the average is an expectation, and also incorporate the output gate into the memory updates.

Strongly-typed gradients are straightforward to compute and interpret:

Corollary 2 (gradient semantics).

The strongly-typed gradients are

  • T-RNN:


    and similarly for .

  • T-LSTM:

  • T-GRU:


It follows immediately that gradients will not explode for T-RNNs or LSTMs. Empirically we find they also behave well for T-GRUs.

3.4 Feature Algebra

A vanilla RNN can approximate any continuous state update since is dense in continuous functions on if is a nonpolynomial nonlinear function (Leshno et al., 1993). It follows that vanilla RNNs can approximate any recursively computable partial function (Siegelmann & Sontag, 1995).

Strongly-typed RNNs are more constrained. We show the constraints reflect a coherent design-philosophy and are less severe than appears.

The learnware / firmware distinction.

Strongly-typed architectures factorize into stateless learnware and state-dependent firmware. For example, T-LSTMs and T-GRUs factorize222A superficially similar factorization holds for GRUs and LSTMs. However, their learnware is state-dependent, since depend on . as


Firmware decomposes coordinatewise, which prevents side-effects from interacting: e.g. for T-GRUs


and similarly for T-LSTMs. Learnware is stateless; it has no side-effects and does not decompose coordinatewise. Evidence that side-effects are a problem for LSTMs can be found in (Zaremba et al., 2015) and (Laurent et al., 2015), which show that Dropout and Batch Normalization respectively need to be restricted to vertical connections.

In short, under strong-typing the learnware carves out features which the firmware uses to perform coordinatewise state updates . Vanilla RNNs allow arbitrary state updates . LSTMs and GRUs restrict state updates, but allow arbitrary functions of the state. Translated from a continuous to discrete setting, the distinction between strongly-typed and classical architectures is analogous to working with binary logic gates (AND, OR) on variables learned by the vertical connections – versus working directly with -ary boolean operations.

Representational power.

Motivated by the above, we show that a minimal strongly-typed architecture can span the space of continuous binary functions on features.

Theorem 3 (approximating binary functions).

The strongly-typed minimal RNN with updates


and parameters , , can approximate any set of continuous binary functions on features.

Proof sketch. Let be a feature of interest. Combining (Leshno et al., 1993) with the observation that for implies that . As many weighted copies of as necessary are obtained by adding rows to that are scalar multiples of .

Any set of binary functions on any collection of features can thus be approximated. Finally, vertical connections can approximate any set of features (Leshno et al., 1993).

4 Experiments

Model vanilla RNN T-RNN
Layers 1 2 3 1 2 3
64 (no dropout) (1.365, 1.435) (1.347, 1.417) (1.353, 1.423) (1.371, 1.452) (1.323, 1.409) (1.342, 1.423)
256 (1.215, 1.274) (1.242, 1.254) (1.257, 1.273) (1.300, 1.398) (1.251, 1.276) (1.233, 1.266)
Table 1: The (train, test) cross-entropy loss of RNNs and T-RNNs on WP dataset.
Layers 1 2 3 1 2 3
64 (no dropout) (1.496, 1.560) (1.485, 1.557) (1.500, 1.563) (1.462, 1.511) (1.367, 1.432) (1.369, 1.434)
256 (1.237, 1.251) (1.098, 1.193) (1.185, 1.213) (1.254, 1.273) (1.045, 1.189) (1.167, 1.198)
Table 2: The (train, test) cross-entropy loss of LSTMs and T-LSTMs on WP dataset.
Layers 1 2 3 1 2 3
64 (no dropout) (1.349, 1.435) (1.432, 1.503) (1.445, 1.559) (1.518 ,1.569) (1.337, 1.422) (1.377, 1.436)
256 (1.083, 1.226) (1.163, 1.214) (1.219, 1.227) (1.142, 1.296) (1.208, 1.240) (1.216, 1.212)
Table 3: The (train, test) cross-entropy loss of GRUs and T-GRUs on WP dataset.

We investigated the empirical performance of strongly-typed recurrent nets for sequence learning. The performance was evaluated on character-level and word-level text generation. We conducted a set of proof-of-concept experiments. The goal is not to compete with previous work or to find the best performing model under a specific hyper-parameter setting. Rather, we investigate how the two classes of architectures perform over a range of settings.

4.1 Character-level Text Generation

The first task is to generate text from a sequence of characters by predicting the next character in a sequence. We used Leo Tolstoy’s War and Peace (WP) which consists of 3,258,246 characters of English text, split into train/val/test sets with 80/10/10 ratios. The characters are encoded into -dimensional one-hot vectors, where is the size of the vocabulary. We follow the experimental setting proposed in (Karpathy et al., 2015). Results are reported for two configurations: “64” and “256”, which are models with the same number of parameters as a 1-layer LSTM with 64 and 256 cells per layer respectively. Dropout regularization was only applied to the “256” models. The dropout rate was taken from based on validation performance. Tables 2 and 3 summarize the performance in terms of cross-entropy loss .

We observe that the training error of strongly-typed models is typically lower than that of the standard models for layers. The test error of the two architectures are comparable. However, our results (for both classical and typed models) fail to match those reported in (Karpathy et al., 2015), where a more extensive parameter search was performed.

Model Train Validation Test
small, no dropout
vanilla RNN 416.50 442.31 432.01
T-RNN 58.66 172.47 169.33
LSTM 36.72 122.47 117.25
T-LSTM 28.15 215.71 200.39
GRU 31.14 179.47 173. 27
T-GRU 28.57 207.94 195.82
medium, with dropout
LSTM (Zaremba et al., 2015) 48.45 86.16 82.70
LSTM (3-layer) 71.76 98.22 97.87
T-LSTM 50.21 87.36 82.71
T-LSTM (3-layer) 51.45 85.98 81.52
GRU 65.80 97.24 93.44
T-GRU 55.31 121.39 113.85
Table 4: Perplexity on the Penn Treebank dataset.

4.2 Word-level Text Generation

The second task was to generate word-level text by predicting the next word from a sequence. We used the Penn Treebank (PTB) dataset (Marcus et al., 1993), which consists of 929K training words, 73K validation words, and 82K test words, with vocabulary size of 10K words. The PTB dataset is publicly available on web.333

We followed the experimental setting in (Zaremba et al., 2015) and compared the performance of “small” and “medium” models. The parameter size of “small” models is equivalent to that of layers of -cell LSTMs, while the parameter size of “medium” models is the same as that of layers of -cell LSTMs. For the “medium” models, we selected the dropout rate from {0.4, 0.5, 0.6} according to validation performance. Single run performance, measured via perplexity, i.e., , are reported in Table 4.


For the “small” models, we found that the training perplexity of strongly-typed models is consistently lower than their classical counterparts, in line with the result for War & Peace. Test error was significantly worse for the strongly-typed architectures. A possible explanation for both observations is that strongly-typed architectures require more extensive regularization.

An intriguing result is that the T-RNN performs in the same ballpark as LSTMs, with perplexity within a factor of two. By contrast, the vanilla RNN fails to achieve competitive performance. This suggests there may be strongly-typed architectures of intermediate complexity between RNNs and LSTMs with comparable performance to LSTMs.

The dropout-regularized “medium” T-LSTM matches the LSTM performance reported in (Zaremba et al., 2015). The 3-layer T-LSTM obtains slightly better performance. The results were obtained with almost identical parameters to Zaremba et al (the learning rate decay was altered), suggesting that T-LSTMs are viable alternatives to LSTMs for sequence learning tasks when properly regularized. Strongly-typed GRUs did not match the performance of GRUs, possibly due to insufficient regularization.


We investigated the effect of removing gradient clipping on medium-sized LSTM and T-LSTM. T-LSTM gradients are well-behaved without clipping, although test performance is not competitive. In contrast, LSTM gradients explode without clipping and the architecture is unusable. It is possible that carefully initialized T-LSTMs may be competitive without clipping. We defer the question to future work.


Since strongly-typed RNNs have fewer nonlinearities than standard RNNs, we expect that they should have lower computational complexity. Training on the PTB dataset on an NVIDIA GTX 980 GPU, we found that T-LSTM is on average faster than LSTM. Similarly, the T-GRU trains on average faster than GRU.

5 Conclusions

RNNs are increasingly important tools for speech recognition, natural language processing and other sequential learning problems. The complicated structure of LSTMs and GRUs has led to searches for simpler alternatives with limited success

(Bayer et al., 2009; Greff et al., 2015; Jozefowicz et al., 2015; Le et al., 2015; Mikolov et al., 2015). This paper introduces strong-typing as a tool to guide the search for alternate architectures. In particular, we suggest searching for update equations that learn well-behaved features, rather than update equations that “appear simple”. We draw on two disparate intuitions that turn out to be surprisingly compatible: (i) that neural networks are analogous to measuring devices (Balduzzi, 2012) and (ii) that training an RNN is analogous to writing code.

The main contribution is a new definition of type that is closely related to singular value decomposition – and is thus well-suited to deep learning. It turns out that classical RNNs are badly behaved from a type-perspective, which motivates modifying the architectures. Section 3 tweaked LSTMs and GRUs to make them well-behaved from a typing and functional programming perspective, yielding features and gradients that are easier to reason about than classical architectures.

Strong-typing has implications for the depth of RNNs. It was pointed out in (Pascanu et al., 2014) that unfolding horizontal connections over time implies the concept of depth is not straightforward in classical RNNs. By contrast, depth has the same meaning in strongly-typed architectures as in feedforward nets, since vertical connections learn features and horizontal connections act coordinatewise.

Experiments in section 4 show that strongly-typed RNNs achieve comparable generalization performance to classical architectures when regularized with dropout and have consistently lower training error. It is important to emphasize that the experiments are not conclusive. Firstly, we did not deviate far from settings optimized for classical RNNs when training strongly-typed RNNs. Secondly, the architectures were chosen to be as close as possible to classical RNNs. A more thorough exploration of the space of strongly-typed nets may yield better results.

Towards machine reasoning.

A definition of machine reasoning, adapted from (Bottou, 2014), is “algebraically manipulating features to answer a question”. Hard-won experience in physics (Chang, 2004), software engineering (Dijkstra, 1968)

, and other fields has led to the conclusion that well-chosen constraints are crucial to effective reasoning. Indeed, neural Turing machines

(Graves et al., 2014) are harder to train than more constrained architectures such as neural queues and deques (Grefenstette et al., 2015).

Strongly-typed features have a consistent semantics, theorem 1, unlike features in classical RNNs which are rotated across time steps – and are consequently difficult to reason about. We hypothesize that strong-typing will provide a solid foundation for algebraic operations on learned features. Strong-typing may then provide a useful organizing principle in future machine reasoning systems.


We thank Tony Butler-Yeoman, Marcus Frean, Theofanis Karaletsos, JP Lewis and Brian McWilliams for useful comments and discussions.