1 Introduction
Recurrent neural networks (RNNs) are models that learn nonlinear relationships between sequences of inputs and outputs. Applications include speech recognition (Graves et al., 2013), image generation (Gregor et al., 2015), machine translation (Sutskever et al., 2014) and image captioning (Vinyals et al., 2015; Karpathy & FeiFei, 2015). Training RNNs is difficult due to exploding and vanishing gradients (Hochreiter, 1991; Bengio et al., 1994; Pascanu et al., 2013)
. Researchers have therefore developed gradientstabilizing architectures such as Long ShortTerm Memories or LSTMs
(Hochreiter & Schmidhuber, 1997)and Gated Recurrent Units or GRUs
(Cho et al., 2014).Unfortunately, LSTMs and GRUs are complicated and contain many components whose roles are not well understood. Extensive searches (Bayer et al., 2009; Jozefowicz et al., 2015; Greff et al., 2015) have not yielded significant improvements. This paper takes a fresh approach inspired by dimensional analysis and functional programming.
Intuition from dimensional analysis.
Nodes in neural networks are devices that, by computing dot products, measure the similarity of their inputs to representations encoded in weight matrices. Ideally, the representation learned by a net should “carve nature at its joints”. An exemplar is the system of measurement that has been carved out of nature by physicists. It prescribes units for expressing the readouts of standardized measuring devices (e.g. kelvin for thermometers and seconds for clocks) and rules for combining them.
A fundamental rule is the principle of dimensional homogeneity: it is only meaningful to add quantities expressed in the same units (Bridgman, 1922; Hart, 1995). For example adding seconds to volts is inadmissible. In this paper, we propose to take the measurements performed by neural networks as seriously as physicists take their measurements, and apply the principle of dimensional homogeneity to the representations learned by neural nets, see section 2.
Intuition from functional programming.
Whereas feedforward nets learn to approximate functions, recurrent nets learn to approximate programs – suggesting lessons from language design are relevant to RNN design. Language researchers stress the benefits of constraints: eliminating GOTO (Dijkstra, 1968); introducing typesystems that prescribe the interfaces between parts of computer programs and guarantee their consistency (Pierce, 2002); and working with stateless (pure) functions.
For our purposes, types correspond to units as above. Let us therefore discuss the role of states. The reason for recurrent connections is precisely to introduce statedependence. Unfortunately, statedependent functions have sideeffects – unintended knockon effects such as exploding gradients.
Statedependence without sideeffects is not possible. The architectures proposed below encapsulate states in firmware (which has no learned parameters) so that the learnware (which encapsulates the parameters) is stateless. It follows that the learned features and gradients in stronglytyped architectures are better behaved and more interpretable than their classical counterparts, see section 3.
Strictly speaking, the ideas from physics (to do with units) and functional programming (to do with states) are independent. However, we found that they complemented each other. We refer to architectures as stronglytyped when they both (i) preserve the type structure of their features and (ii) separate learned parameters from statedependence.
Overview.
The core of the paper is section 2
, which introduces stronglytyped linear algebra. As partial motivation, we show how types are implicit in principal component analysis and feedforward networks. A careful analysis of the update equations in vanilla RNNs identifies a flaw in classical RNN designs that leads to incoherent features. Fixing the problem requires new update equations that preserve the typestructure of the features.
Section 3 presents stronglytyped analogs of standard RNN architectures. It turns out that small tweaks to the standard update rules yield simpler features and gradients, theorem 1 and corollary 2. Finally, theorem 3 shows that, despite their more constrained architecture, stronglytyped RNNs have similar representational power to classical RNNs. Experiments in section 4 show that stronglytyped RNNs have comparable generalization performance and, surprisingly, lower training error than classical architectures (suggesting greater representational power). The flipside is that regularization appears to be more important for stronglytyped architectures, see experiments.
Related work.
The analogy between neural networks and functional programming was proposed in (Olah, 2015), which also argued that representations should be interpreted as types. This paper extends Olah’s proposal. Prior work on typedlinear algebra (Macedo & Oliveira, 2013)
is neither intended for nor suited to applications in machine learning. Many familiar RNN architectures already incorporate forms of
weaktyping, see section 3.1.2 StronglyTyped Features
A variety of type systems have been developed for mathematical logic and language design (Reynolds, 1974; Girard, 1989; Pierce, 2002). We introduce a typesystem based on linear algebra that is suited to deep learning. Informally, a type
is a vector space with an orthogonal basis. A more precise definition along with rules for manipulating types is provided below. Section
2.2 provides examples; section 2.3 uses types to identify a design flaw in classical RNNs.2.1 StronglyTyped QuasiLinear Algebra
Quasilinear algebra is linear algebra supplemented with nonlinear functions that act coordinatewise.
Definition 1.
Dotproducts are denoted by or . A type is a dimensional vector space equipped with an inner product and an orthogonal basis such that .
Given type , we can represent vectors in as realvalued tuples via
(1) 
Definition 2.
The following operations are admissible:

Unary operations on a type:
Given a function (e.g. scalar multiplication, sigmoid , tanhor relu
), define(2) 
Binary operations on a type:
Given and an elementary binary operation ^{1}^{1}1Note: is projection onto the coordinate., define(3) Binary operations on two different types (e.g. adding vectors expressed in different orthogonal bases) are not admissible.

Transformations between types:
A typetransform is a linear map such that for . Typetransformations are orthogonal matrices. 
Diagonalization:
Suppose that and have the same dimension. Define(4) where and . Diagonalization converts type into a new type, , that acts on by coordinatewise scalar multiplication.
Definition 1 is inspired by how physicists have carved the world into an orthogonal basis of meters, amps, volts etc. The analogy is not perfect: e.g. maps meters to squaremeters, whereas types are invariant to coordinatewise operations. Types are looser than physical units.
2.2 Motivating examples
We build intuition by recasting PCA and feedforward neural nets from a type perspective.
Principal component analysis (PCA).
Let denote datapoints . PCA factorizes where is a
orthogonal matrix and
contains the eigenvalues of
. A common application of PCA is dimensionality reduction. From a type perspective, this consists in:(5) 
(i) transforming the standard orthogonal basis of into the latent type given by the rows of ; (ii) projecting onto a subtype (subset of coordinates in the latent type); and (iii) applying the inverse to recover the original type.
Feedforward nets.
The basic feedforward architecture is stacked layers computing where is a nonlinearity applied coordinatewise. We present two descriptions of the computation.
The standard description is in terms of dotproducts. Rows of correspond to features, and matrix multiplication is a collection of dotproducts that measure the similarity between the input and the rowfeatures:
(6) 
Types provide a finergrained description. Factorize by singular value decomposition into and orthogonal matrices and . The layercomputation can be rewritten as . From a typeperspective, the layer thus:
(7) 
(i) transforms to a latent type; (ii) applies coordinatewise scalar multiplication to the latent type; (iii) transforms the result to the output type; and (iv) applies a coordinatewise nonlinearity. Feedforward nets learn interleaved sequences of type transforms and unary, typepreserving operations.
2.3 Incoherent features in classical RNNs
There is a subtle inconsistency in classical RNN designs that leads to incoherent features. Consider the updates:
(8) 
We drop the nonlinearity, since the inconsistency is already visible in the linear case. Letting and unfolding Eq. (8) over time obtains
(9) 
The inconsistency can be seen via dotproducts and via types. From the dotproduct perspective, observe that multiplying an input by a matrix squared yields
(10) 
where refers to rows of and to columns. Each coordinate of is computed by measuring the similarity of a row of to all of its columns, and then measuring the similarity of the result to . In short, features are tangled and uninterpretable.
From a type perspective, apply an SVD to and observe that . Each multiplication by or transforms the input to a new type, obtaining
(11) 
Thus sends whereas sends . Adding terms involving and , as in Eq. (9), entails adding vectors expressed in different orthogonal bases – which is analogous to adding joules to volts. The same problem applies to LSTMs and GRUs.
Two recent papers provide empirical evidence that recurrent (horizontal) connections are problematic even after gradients are stabilized: (Zaremba et al., 2015) find that Dropout performs better when restricted to vertical connections and (Laurent et al., 2015)
find that Batch Normalization fails unless restricted to vertical connections
(Ioffe & Szegedy, 2015). More precisely, (Laurent et al., 2015) find that Batch Normalization improves training but not test error when restricted to vertical connections; it fails completely when also applied to horizontal connections.Code using GOTO can be perfectly correct, and RNNs with type mismatches can achieve outstanding performance. Nevertheless, both lead to spaghettilike information/gradient flows that are hard to reason about.
Typepreserving transforms.
One way to resolve the type inconsistency, which we do not pursue in this paper, is to use symmetric weight matrices so that where is orthogonal and . From the dotproduct perspective,
(12) 
which has the simple interpretation that is amplified (or dampened) by in the latent type provided by . From the typeperspective, multiplication by is typepreserving
(13) 
so addition is always performed in the same basis.
A familiar example of typepreserving transforms is autoencoders – under the constraint that the decoder
is the transpose of the encoder . Finally, (Moczulski et al., 2015) propose to accelerate matrix computations in feedforward nets by interleaving diagonal matrices, and , with the orthogonal discrete cosine transform, . The resulting transform, , is typepreserving.3 Recurrent Neural Networks
We present three stronglytyped RNNs that purposefully mimic classical RNNs as closely as possible. Perhaps surprisingly, the tweaks introduced below have deep structural implications, yielding architectures that are significantly easier to reason about, see sections 3.3 and 3.4.
3.1 WeaklyTyped RNNs
We first pause to note that many classical architectures are weaklytyped. That is, they introduce constraints or restrictions on offdiagonal operations on recurrent states.
The memory cell in LSTMs is only updated coordinatewise and is therefore wellbehaved typetheoretically – although the overall architecture is not type consistent. The gating operation in GRUs reduces typeinconsistencies by discouraging (i.e. zeroing out) unnecessary recurrent information flows.
SCRNs, or Structurally Constrained Recurrent Networks (Mikolov et al., 2015), add a typeconsistent state layer:
(14) 
In MUT1, the best performing architecture in (Jozefowicz et al., 2015), the behavior of and is welltyped, although the gating by
is not. Finally, IRNNs initialize their recurrent connections as the identity matrix
(Le et al., 2015). In other words, the key idea is a typeconsistent initialization.3.2 StronglyTyped RNNs
The vanilla stronglytyped RNN is
(15)  
TRNN  (16)  
(17) 
The TRNN has similar parameters to a vanilla RNN, Eq (8), although their roles have changed. A nonlinearity for is not necessary because: (i) gradients do not explode, corollary 2, so no squashing is needed; and (ii) coordinatewise multiplication by introduces a nonlinearity. Whereas relus are binary gates (0 if , 1 else); the forget gate is a continuous multiplicative gate on .
Replacing the horizontal connection with a vertically controlled gate, Eq. (16), stabilizes the typestructure across time steps. Line for line, the type structure is:
(18) 
We refer to lines (15) and (16) as learnware since they have parameters (). Line (17) is firmware since it has no parameters. The firmware depends on the previous state unlike the learnware which is stateless. See section 3.4 for more on learnware and firmware.
Stronglytyped LSTMs
differ from LSTMs in two respects: (i) is substituted for in the first three equations so that the type structure is coherent; and (ii) the nonlinearities in and are removed as for the TRNN.
(19)  
(20)  
LSTM  (21)  
(22)  
(23)  
(25)  
(26)  
TLSTM  (27)  
(28)  
(29) 
We drop the input gate from the updates for simplicity; see (Greff et al., 2015). The type structure is
(31) 
Stronglytyped GRUs
adapt GRUs similarly to how LSTMs were modified. In addition, the reset gate is repurposed; it is no longer needed for weaktyping.
(32)  
GRU  (33)  
(34)  
(35)  
(37)  
TGRU  (38)  
(39)  
(40) 
The type structure is
(42) 
3.3 Feature Semantics
The output of a vanilla RNN expands as the uninterpretable
(43) 
with even less interpretable gradient. Similar considerations hold for LSTMs and GRUs. Fortunately, the situation is more amenable for stronglytyped architectures. In fact, their semantics are related to averagepooled convolutions.
Convolutions.
Applying a onedimensional convolution to input sequence yields output sequence
(44) 
Given weights associated with , averagepooling yields . A special case is when the convolution applies the same matrix to every input:
(45) 
The averagepooled convolution is then a weighted average of the features extracted from the input sequence.
Dynamic temporal convolutions.
We now show that stronglytyped RNNs are onedimensional temporal convolutions with dynamic averagepooling. Informally, stronglytyped RNNs transform input sequences into a weighted average of features extracted from the sequence
(46) 
where the weights depends on the sequence. In detail:
Theorem 1 (feature semantics via dynamic convolutions).
Stronglytyped features are computed explicitly as follows.

TRNN. The output is where
(47) 
TLSTM. Let and denote the vertical concatenation of the weight matrices and input vectors respectively. Then,
(48) where is defined as above.

TGRU. Using the notation above,
(49) where
(50)
Proof.
Direct computation. ∎
In summary, TRNNs compute a dynamic distribution over time steps, and then compute the expected feedforward features over that distribution. TLSTMs store expectations in private memory cells that are reweighted by the output gate when publicly broadcast. Finally, TGRUs drop the requirement that the average is an expectation, and also incorporate the output gate into the memory updates.
Stronglytyped gradients are straightforward to compute and interpret:
Corollary 2 (gradient semantics).
The stronglytyped gradients are

TRNN:
(51) (52) and similarly for .

TLSTM:
(53) (54) (55) 
TGRU:
(56) (57) (58)
It follows immediately that gradients will not explode for TRNNs or LSTMs. Empirically we find they also behave well for TGRUs.
3.4 Feature Algebra
A vanilla RNN can approximate any continuous state update since is dense in continuous functions on if is a nonpolynomial nonlinear function (Leshno et al., 1993). It follows that vanilla RNNs can approximate any recursively computable partial function (Siegelmann & Sontag, 1995).
Stronglytyped RNNs are more constrained. We show the constraints reflect a coherent designphilosophy and are less severe than appears.
The learnware / firmware distinction.
Stronglytyped architectures factorize into stateless learnware and statedependent firmware. For example, TLSTMs and TGRUs factorize^{2}^{2}2A superficially similar factorization holds for GRUs and LSTMs. However, their learnware is statedependent, since depend on . as
(59)  
(60)  
(61)  
(62) 
Firmware decomposes coordinatewise, which prevents sideeffects from interacting: e.g. for TGRUs
(63)  
(64) 
and similarly for TLSTMs. Learnware is stateless; it has no sideeffects and does not decompose coordinatewise. Evidence that sideeffects are a problem for LSTMs can be found in (Zaremba et al., 2015) and (Laurent et al., 2015), which show that Dropout and Batch Normalization respectively need to be restricted to vertical connections.
In short, under strongtyping the learnware carves out features which the firmware uses to perform coordinatewise state updates . Vanilla RNNs allow arbitrary state updates . LSTMs and GRUs restrict state updates, but allow arbitrary functions of the state. Translated from a continuous to discrete setting, the distinction between stronglytyped and classical architectures is analogous to working with binary logic gates (AND, OR) on variables learned by the vertical connections – versus working directly with ary boolean operations.
Representational power.
Motivated by the above, we show that a minimal stronglytyped architecture can span the space of continuous binary functions on features.
Theorem 3 (approximating binary functions).
The stronglytyped minimal RNN with updates
(65) 
and parameters , , can approximate any set of continuous binary functions on features.
Proof sketch. Let be a feature of interest. Combining (Leshno et al., 1993) with the observation that for implies that . As many weighted copies of as necessary are obtained by adding rows to that are scalar multiples of .
Any set of binary functions on any collection of features can thus be approximated. Finally, vertical connections can approximate any set of features (Leshno et al., 1993).
4 Experiments
Model  vanilla RNN  TRNN  

Layers  1  2  3  1  2  3 
64 (no dropout)  (1.365, 1.435)  (1.347, 1.417)  (1.353, 1.423)  (1.371, 1.452)  (1.323, 1.409)  (1.342, 1.423) 
256  (1.215, 1.274)  (1.242, 1.254)  (1.257, 1.273)  (1.300, 1.398)  (1.251, 1.276)  (1.233, 1.266) 
Model  LSTM  TLSTM  

Layers  1  2  3  1  2  3 
64 (no dropout)  (1.496, 1.560)  (1.485, 1.557)  (1.500, 1.563)  (1.462, 1.511)  (1.367, 1.432)  (1.369, 1.434) 
256  (1.237, 1.251)  (1.098, 1.193)  (1.185, 1.213)  (1.254, 1.273)  (1.045, 1.189)  (1.167, 1.198) 
Model  GRU  TGRU  

Layers  1  2  3  1  2  3 
64 (no dropout)  (1.349, 1.435)  (1.432, 1.503)  (1.445, 1.559)  (1.518 ,1.569)  (1.337, 1.422)  (1.377, 1.436) 
256  (1.083, 1.226)  (1.163, 1.214)  (1.219, 1.227)  (1.142, 1.296)  (1.208, 1.240)  (1.216, 1.212) 
We investigated the empirical performance of stronglytyped recurrent nets for sequence learning. The performance was evaluated on characterlevel and wordlevel text generation. We conducted a set of proofofconcept experiments. The goal is not to compete with previous work or to find the best performing model under a specific hyperparameter setting. Rather, we investigate how the two classes of architectures perform over a range of settings.
4.1 Characterlevel Text Generation
The first task is to generate text from a sequence of characters by predicting the next character in a sequence. We used Leo Tolstoy’s War and Peace (WP) which consists of 3,258,246 characters of English text, split into train/val/test sets with 80/10/10 ratios. The characters are encoded into dimensional onehot vectors, where is the size of the vocabulary. We follow the experimental setting proposed in (Karpathy et al., 2015). Results are reported for two configurations: “64” and “256”, which are models with the same number of parameters as a 1layer LSTM with 64 and 256 cells per layer respectively. Dropout regularization was only applied to the “256” models. The dropout rate was taken from based on validation performance. Tables 2 and 3 summarize the performance in terms of crossentropy loss .
We observe that the training error of stronglytyped models is typically lower than that of the standard models for layers. The test error of the two architectures are comparable. However, our results (for both classical and typed models) fail to match those reported in (Karpathy et al., 2015), where a more extensive parameter search was performed.
Model  Train  Validation  Test 

small, no dropout  
vanilla RNN  416.50  442.31  432.01 
TRNN  58.66  172.47  169.33 
LSTM  36.72  122.47  117.25 
TLSTM  28.15  215.71  200.39 
GRU  31.14  179.47  173. 27 
TGRU  28.57  207.94  195.82 
medium, with dropout  
LSTM (Zaremba et al., 2015)  48.45  86.16  82.70 
LSTM (3layer)  71.76  98.22  97.87 
TLSTM  50.21  87.36  82.71 
TLSTM (3layer)  51.45  85.98  81.52 
GRU  65.80  97.24  93.44 
TGRU  55.31  121.39  113.85 
4.2 Wordlevel Text Generation
The second task was to generate wordlevel text by predicting the next word from a sequence. We used the Penn Treebank (PTB) dataset (Marcus et al., 1993), which consists of 929K training words, 73K validation words, and 82K test words, with vocabulary size of 10K words. The PTB dataset is publicly available on web.^{3}^{3}3http://www.fit.vutbr.cz/~imikolov/rnnlm/simpleexamples.tgz
We followed the experimental setting in (Zaremba et al., 2015) and compared the performance of “small” and “medium” models. The parameter size of “small” models is equivalent to that of layers of cell LSTMs, while the parameter size of “medium” models is the same as that of layers of cell LSTMs. For the “medium” models, we selected the dropout rate from {0.4, 0.5, 0.6} according to validation performance. Single run performance, measured via perplexity, i.e., , are reported in Table 4.
Perplexity.
For the “small” models, we found that the training perplexity of stronglytyped models is consistently lower than their classical counterparts, in line with the result for War & Peace. Test error was significantly worse for the stronglytyped architectures. A possible explanation for both observations is that stronglytyped architectures require more extensive regularization.
An intriguing result is that the TRNN performs in the same ballpark as LSTMs, with perplexity within a factor of two. By contrast, the vanilla RNN fails to achieve competitive performance. This suggests there may be stronglytyped architectures of intermediate complexity between RNNs and LSTMs with comparable performance to LSTMs.
The dropoutregularized “medium” TLSTM matches the LSTM performance reported in (Zaremba et al., 2015). The 3layer TLSTM obtains slightly better performance. The results were obtained with almost identical parameters to Zaremba et al (the learning rate decay was altered), suggesting that TLSTMs are viable alternatives to LSTMs for sequence learning tasks when properly regularized. Stronglytyped GRUs did not match the performance of GRUs, possibly due to insufficient regularization.
Gradients.
We investigated the effect of removing gradient clipping on mediumsized LSTM and TLSTM. TLSTM gradients are wellbehaved without clipping, although test performance is not competitive. In contrast, LSTM gradients explode without clipping and the architecture is unusable. It is possible that carefully initialized TLSTMs may be competitive without clipping. We defer the question to future work.
Runtime.
Since stronglytyped RNNs have fewer nonlinearities than standard RNNs, we expect that they should have lower computational complexity. Training on the PTB dataset on an NVIDIA GTX 980 GPU, we found that TLSTM is on average faster than LSTM. Similarly, the TGRU trains on average faster than GRU.
5 Conclusions
RNNs are increasingly important tools for speech recognition, natural language processing and other sequential learning problems. The complicated structure of LSTMs and GRUs has led to searches for simpler alternatives with limited success
(Bayer et al., 2009; Greff et al., 2015; Jozefowicz et al., 2015; Le et al., 2015; Mikolov et al., 2015). This paper introduces strongtyping as a tool to guide the search for alternate architectures. In particular, we suggest searching for update equations that learn wellbehaved features, rather than update equations that “appear simple”. We draw on two disparate intuitions that turn out to be surprisingly compatible: (i) that neural networks are analogous to measuring devices (Balduzzi, 2012) and (ii) that training an RNN is analogous to writing code.The main contribution is a new definition of type that is closely related to singular value decomposition – and is thus wellsuited to deep learning. It turns out that classical RNNs are badly behaved from a typeperspective, which motivates modifying the architectures. Section 3 tweaked LSTMs and GRUs to make them wellbehaved from a typing and functional programming perspective, yielding features and gradients that are easier to reason about than classical architectures.
Strongtyping has implications for the depth of RNNs. It was pointed out in (Pascanu et al., 2014) that unfolding horizontal connections over time implies the concept of depth is not straightforward in classical RNNs. By contrast, depth has the same meaning in stronglytyped architectures as in feedforward nets, since vertical connections learn features and horizontal connections act coordinatewise.
Experiments in section 4 show that stronglytyped RNNs achieve comparable generalization performance to classical architectures when regularized with dropout and have consistently lower training error. It is important to emphasize that the experiments are not conclusive. Firstly, we did not deviate far from settings optimized for classical RNNs when training stronglytyped RNNs. Secondly, the architectures were chosen to be as close as possible to classical RNNs. A more thorough exploration of the space of stronglytyped nets may yield better results.
Towards machine reasoning.
A definition of machine reasoning, adapted from (Bottou, 2014), is “algebraically manipulating features to answer a question”. Hardwon experience in physics (Chang, 2004), software engineering (Dijkstra, 1968)
, and other fields has led to the conclusion that wellchosen constraints are crucial to effective reasoning. Indeed, neural Turing machines
(Graves et al., 2014) are harder to train than more constrained architectures such as neural queues and deques (Grefenstette et al., 2015).Stronglytyped features have a consistent semantics, theorem 1, unlike features in classical RNNs which are rotated across time steps – and are consequently difficult to reason about. We hypothesize that strongtyping will provide a solid foundation for algebraic operations on learned features. Strongtyping may then provide a useful organizing principle in future machine reasoning systems.
Acknowledgements.
We thank Tony ButlerYeoman, Marcus Frean, Theofanis Karaletsos, JP Lewis and Brian McWilliams for useful comments and discussions.
References
 Balduzzi (2012) Balduzzi, D. On the informationtheoretic structure of distributed measurements. Elect. Proc. in Theor. Comp. Sci., 88:28–42, 2012.
 Bayer et al. (2009) Bayer, Justin, Wierstra, Daan, Togelius, Julian, and Schmidhuber, Juergen. Evolving memory cell structures for sequence learning. In ICANN, 2009.
 Bengio et al. (1994) Bengio, Yoshua, Simard, P, and Frasconi, P. Learning longterm dependencies with gradient descent is difficult. IEEE Trans. Neur. Net., 5(2):157–166, 1994.
 Bottou (2014) Bottou, Léon. From machine learning to machine reasoning: An essay. Machine Learning, 94:133–149, 2014.
 Bridgman (1922) Bridgman, P W. Dimensional analysis. Yale University Press, 1922.
 Chang (2004) Chang, Hasok. Inventing Temperature: Measurement and Scientific Progress. Oxford University Press, 2004.
 Cho et al. (2014) Cho, Kyunghyun, van Merriënboer, Bart, Gulcehre, Caglar, Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In EMNLP, 2014.
 Dijkstra (1968) Dijkstra, Edsger W. Go To Statement Considered Harmful. Comm. ACM, 11(3):147–148, 1968.
 Girard (1989) Girard, JeanYves. Proofs and Types. Cambridge University Press, 1989.
 Graves et al. (2013) Graves, Alex, Mohamed, A, and Hinton, GE. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
 Graves et al. (2014) Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural Turing Machines. In arXiv:1410.5401, 2014.
 Grefenstette et al. (2015) Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil. Learning to Transduce with Unbounded Memory. In Adv in Neural Information Processing Systems (NIPS), 2015.
 Greff et al. (2015) Greff, Klaus, Srivastava, Rupesh Kumar, Koutník, Jan, Steunebrink, Bas R, and Schmidhuber, Juergen. LSTM: A Search Space Odyssey. In arXiv:1503.04069, 2015.
 Gregor et al. (2015) Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende, Danilo Jimenez, and Wierstra, Daan. DRAW: A Recurrent Neural Network For Image Generation. In ICML, 2015.
 Hart (1995) Hart, George W. Multidimensional Analysis: Algebras and Systems for Science and Engineering. Springer, 1995.
 Hochreiter & Schmidhuber (1997) Hochreiter, S and Schmidhuber, J. Long ShortTerm Memory. Neural Comp, 9:1735–1780, 1997.
 Hochreiter (1991) Hochreiter, Sepp. Untersuchungen zu dynamischen neuronalen Netzen. Master’s thesis, Technische Universität München, 1991.
 Ioffe & Szegedy (2015) Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In arXiv:1502.03167, 2015.
 Jozefowicz et al. (2015) Jozefowicz, Rafal, Zaremba, Wojciech, and Sutskever, Ilya. An Empirical Exploration of Recurrent Network Architectures. In ICML, 2015.
 Karpathy & FeiFei (2015) Karpathy, Andrej and FeiFei, Li. Deep VisualSemantic Alignments for Generating Image Descriptions. In CVPR, 2015.
 Karpathy et al. (2015) Karpathy, Andrej, Johnson, Justin, and FeiFei, Li. Visualizing and understanding recurrent neural networks. In arXiv:1506.02078, 2015.
 Laurent et al. (2015) Laurent, C, Pereyra, G, Brakel, P, Zhang, Y, and Bengio, Yoshua. Batch Normalized Recurrent Neural Networks. In arXiv:1510.01378, 2015.
 Le et al. (2015) Le, Quoc, Jaitly, Navdeep, and Hinton, Geoffrey. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. In arXiv:1504.00941, 2015.

Leshno et al. (1993)
Leshno, Moshe, Lin, Vladimir Ya., Pinkus, Allan, and Schocken, Shimon.
Multilayer Feedforward Networks With a Nonpolynomial Activation Function Can Approximate Any Function.
Neural Networks, 6:861–867, 1993.  Macedo & Oliveira (2013) Macedo, Hugo Daniel and Oliveira, José Nuno. Typing linear algebra: A biproductoriented approach. Science of Computer Programming, 78(11):2160 – 2191, 2013.
 Marcus et al. (1993) Marcus, Mitchell P., Marcinkiewics, Mary Ann, and Santorini, Beatrice. Building a large annotated corpus of english: The penn treebank. Comp. Linguistics, 19(2):313–330, 1993.
 Mikolov et al. (2015) Mikolov, Tomas, Joulin, Armand, Chopra, Sumit, Mathieu, Michael, and Ranzato, Marc’Aurelio. Learning Longer Memory in Recurrent Neural Networks. In ICLR, 2015.
 Moczulski et al. (2015) Moczulski, Marin, Denil, Misha, Appleyard, Jeremy, and de Freitas, Nando. ACDC: A Structured Efficient Linear Layer. In arXiv:1511.05946, 2015.
 Olah (2015) Olah, Christopher. Neural Networks, Types, and Functional Programming, 2015. URL http://colah.github.io/posts/201509NNTypesFP/.
 Pascanu et al. (2013) Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In ICML, 2013.
 Pascanu et al. (2014) Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, and Bengio, Yoshua. How to Construct Deep Recurrent Networks. In ICLR, 2014.
 Pierce (2002) Pierce, Benjamin C. Types and Programming Languages. MIT Press, 2002.
 Reynolds (1974) Reynolds, J C. Towards a theory of type structure. In Paris colloquium on programming, volume 19 of LNCS. Springer, 1974.
 Siegelmann & Sontag (1995) Siegelmann, Hava and Sontag, Eduardo. On the Computational Power of Neural Nets. Journal of Computer and System Sciences, 50:132–150, 1995.
 Sutskever et al. (2014) Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc. Sequence to Sequence Learning with Neural Networks. In Adv in Neural Information Processing Systems (NIPS), 2014.
 Vinyals et al. (2015) Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. In CVPR, 2015.
 Zaremba et al. (2015) Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent Neural Network Regularization. In arXiv:1409.2329, 2015.
Comments
There are no comments yet.