Interest in applying neural networks to decoding has existed since the 1980’s [6, 7, 8]. These early works, however, did not have a substantial impact on the field due to the limitations of the networks that were available at the time. More recently, deep neural networks were studied in [4, 5, 2].
A major challenge facing applications of deep networks to decoding is the avoidance of overfitting the codewords encountered during training. Specifically, training data is typically produced by randomly selecting codewords and simulating the channel transitions. Due to the large number of codewords (exponential in the block length), it is impossible to account for even a small fraction of them during training, leading to poor generalization of the network to new codewords. This issue was a major obstacle in [4, 5], constraining their networks to very short block lengths.
proposed a deep learning framework which is modeled on the LDPC belief propagation (BP) decoder, and is robust to overfitting. A drawback of their design, however, is that to preserve symmetry, the design is constrained to closely mimic the message-passing structure of BP. Specifically, the connections between neurons are controlled to resemble BP’s underlying Tanner graph, as are the activations at neurons. This severely limits the freedom available to the neural networks design, and precludes the application of powerful architectures that have emerged in recent years.
In this paper, we present a method which overcomes this drawback while maintaining the resilience to overfitting of 
. Our framework allows unconstrained neural network design, paving the way to the application of powerful neural network designs that have emerged in recent years. Central to our approach is a preprocessing step, which extracts from the channel output only the reliabilities (absolute values), and the syndrome of its hard decisions, and feeds them into the neural network. The network’s output is later combined with the channel output to produce an estimate of the transmitted codeword.
Decoding methods that focus on the syndrome are well known in literature on algebraic decoding (see e.g. [Sec. 3.2]). The approach decouples the estimation of the channel noise from that of the transmitted codeword. In our context of deep learning, its potential lies in the elimination of the need to simulate codewords during training, thus overcoming the overfitting problem. A few early works that used shallow neural networks have employed syndromes (e.g. ). However, these works did not discuss its potential in terms of overfitting, presumably because the problem was not as acute in their relatively simple networks. Importantly, their approach does not apply in cases where the channel includes reliabilities (mentioned above).
Our approach in this paper extends syndrome decoding to include channel reliabilities, and applies it to overcome the overfitting issue mentioned above. We provide a rigorous analysis which proves that our framework incurs no loss in optimality, in terms of bit error rate (BER) and mean square error (MSE). Our analysis utilizes some techniques developed by Burshtein , Wiechman and Sason  and Richardson and Urbanke [Sec. 4.11].
Building on our analysis, we propose two deep neural network architectures for decoding of linear codes. The first is a vanilla multilayer network, and the second a more-elaborate recurrent neural network (RNN) based architecture. We also develop a preprocessing technique (beyond the above-mentioned computation of the syndrome and reliabilities), which applies a permutation (an automorphism) to the decoder input to facilitate the operation of the neural network. This technique builds on ideas by Fossorier [9, 10] and Dimnik and Be’ery  but does not involve list decoding. Finally, we provide simulation results for decoding of BCH codes which for the case of BCH(63,45), demonstrate performance approaches that of the ordered statistics algorithm (OSD) of [9, 10].
Summarizing, our main contributions are:
A novel deep neural network training framework for decoding, which is robust to overitting. It is based on the extension of syndrome decoding that is described below.
An extension of syndrome decoding which accounts for reliabilities. As with legacy syndrome decoding, we define a generic framework, leaving room for a noise-estimation algorithm which is specified separately. We provide analysis that proves that the framework involves no loss of optimality, and that regardless of the noise-estimation algorithm, performance is invariant to the transmitted codeword.
Two neural network designs for the noise-estimation algorithm, including an elaborate RNN-based architecture.
A simple preprocessing technique that applies permutations to boost the decoder’s performance.
Our work is organized as follows. In Sec. II we introduce some notations and in Sec III we provide some bakground on neural networks. In Sec. IV we describe our syndrome-based framework and provide an analysis of it. In Sec. V we discuss deep neural network architectures as well as preprocessing by permutation. In Sec. VI we present simulation results. Sec. VII concludes the paper.
We will often use the superscripts and (e.g., and ) to denote binary values (i.e, over the alphabet ) and bipolar values (i.e., over ), respectively. We define the mapping from the binary to the bipolar alphabet, denoted , by and let denote the inverse mapping. Note that the following identity holds:
where denotes XOR. and denote the sign and absolute value of any real-valued
, respectively, In our analysis below, when applied to a vector, the operations, , and are assumed to be applied independently to each of the vector’s components.
Iii Brief Overview of Deep Neural Networks
We now briefly describe the essentials of neural networks. Our discussion is in no way comprehensive, and there are many variations on the simple setup we describe. For an elaborate discussion, see e.g. .
Fig. 1 depicts a simple multilayered neural network. The network is a directed graph whose structure is the blueprint for an associated computational algorithm. The nodes are called neurons and each performs (i.e., is associated with) a simple computation on inputs to produce a single output. The algorithm inputs are fed into the first-layer neurons, whose outputs are fed as inputs into the next layer and so forth. Finally, the outputs of the last layer neurons become the algorithm output. Deep networks are simply neural networks with many layers.
The inputs and output at each neuron are real-valued numbers. To compute its output, each neuron first computes an affine function of its input (a weighted sum plus a bias). It then applies a predefined activation function, which is typically nonlinear (e.g., sigmoid or hyperbolic tangent), to render the output.
The power of neural networks lies in their configurability. Specifically, the weights and biases at each neuron are parameters which can be tailored to produce a diverse range of computations. Typically, the network is configured by a training procedure. This procedure relies on a sample dataset, consisting of inputs, and in the case of supervised training, of desired outputs (known as labels). There are several training paradigms available, most of which are variations of gradient descent.
Overfitting occurs typically when the training dataset is insufficiently large or diverse to be representative of all valid network inputs. The resulting network does not generalize well to inputs that were not encountered in the training set.
Iv The Proposed Syndrome-Based Framework and its Analysis
Iv-a Framework Definition
We begin by briefly discussing the encoder. We assume standard transmission that uses a linear code . We let denote the input message, which is mapped to a codeword (recall that the superscript denotes binary vectors). We assume that is mapped to a bipolar vector using the mapping defined in Sec. II, and is transmitted over the channel. We let denote the channel output. For convenience, we define , , and to be column vectors.
Our decoder framework is depicted in Figure 2. The decoder’s main component is , whose role is to estimate the channel noise, and which we will later (Section V) implement using a deep neural network. The discussion in this section, however, applies to arbitrary . The inputs to are the absolute value and the syndrome , where is a parity check matrix of the code , is a vector of hard decisions, where is simply the inverse of the above defined bipolar mapping. The multiplication by is modulo-2. The output of is multiplied (componentwise) by , which is the sign of . Finally, when interested in hard decisions, we take the sign of the results and define this to be the estimate . This final hard decision step (which is depicted in Fig. 2) is omitted when the system is required to produce soft decisions.
Iv-B Binary-Input Symmetric-Output (BISO) Channels
Our analysis in the following section applies to a broad class of channels known as binary-input symmetric-output (BISO) channels. This class includes binary-input AWGN channels, binary-symmetric channels (BSCs) and many others. Our definition below follows Richarson [Definition 1].
Consider a memoryless channel with input alphabet
. The channel is BISO if its transition probability function satisfies:
for al the channel output alphabet, where and denote the random channel input and output (respectively).
An important feature of BISO channels is that their random transitions can be modeled by [proof of Lemma 1],
where is random noise which is independent of the transmitted . The tilde in serves to indicate that this is an equivalent statistical model, which might differ from the true physical one. To prove (3), we simply define
to be a random variable distributed asand independent of . The validity now follows from (2).
We now show that the decoder framework involves no penalty in performance in terms of metrics mean-squared-error (MSE) or bit error rate (BER). That is, the decoder can be designed to achieve any of them. Importantly, it addresses the overfitting problem that was described in Sec. I.
The follwing holds with respect to the framework of Sec. IV-A, assuming communication over a BISO channel:
The framework incurs no intrinsic loss of optimality, in the sense that with an appropriately designed , the decoder can achieve maximum a-posteriori (MAP) or minimum MSE (MMSE) decoding.
For any choice of , the decoder’s BER and MSE, conditioned on transmission of any codeword , are both invariant to .
We provide an outline of the proof here, and defer the details to Appendix A.
In Part 1 we neglect implementation concerns, and focus on realizations of that try to optimally estimate the multiplicative noise (see (3)). By (3), given , such estimation is equivalent to estimation of . We argue that the pair and is a sufficient statistic for estimation of . To see this, observe that is equivalent to the pair and . The latter term, in turn, is equivalent to the pair and , where is as defined in Sec. II and is a pseudo-inverse of the code’s generator matrix . By (1) and (3), and so is the sum of the transmitted message and (the projection of onto the code subspace). We argue that is independent of the noise and thus irrelevant to its estimation. This follows because is independent of
, and we assume it to be uniformly distributed within the message space.
To prove Part 2 of the theorem, we allow to be arbitrary and show that the decoder’s output can be modeled as for some vector-valued function . Thus, its relationship with (which determines the BER and MSE) depends on the noise alone. To see why this holds, first observe that the inputs to (and consequently, its outputs) are dependent on the noise only. This follows because the syndrome equals . This in turns follows from the relation and the fact that is a codeword, and so . By (3) and the bipolarity of , the absolute value equals . It now follows that the output of is dependent on alone. Multiplication by (see Fig. 2) is equivalent to multiplying by (by (3)) and the result follows. ∎
V Implementation using Deep Neural Networks
Theorem 1 proves that our framework does not intrinsically involve a loss of optimality. To realize its potential, we propose efficient implementations of the function . In this section, we discuss deep neural network implementations as well as a simple preprocessing technique that enables the networks to achieve improved performance.
V-a Deep Neural Network Architectures
We consider the following two architectures:
Vanilla Multi-Layer: With this architecture, the neural network contains an array of fully-connected layers as illustrated in Fig. 3. It closely resembles simple designs  with the exception that we feed the network inputs into each of the layers, in addition to the output of the previous layer (this idea is borrowed from the belief propagation algorithm).
The network includes 11 fully-connected layers, roughly equivalent to the 5 LDPC belief-propagation iterations as in 15]. Each of the first 10 layers consists of the same number of nodes ( for block length and for block length ). The final fully-connected layer has nodes and produces the network output, using a hyperbolic tangent arctivation function.
Recurrent Neural Network (RNN): With this architecture, we build a deep RNN by stacking multiple recurrent hidden states on top of each other  as illustrated in Fig. 4. RNNs realize a design which is equivalent to a multi-layer architecture by maintaining a network hidden state (memory) which is updated via feedback connections. Note that from a practical perspective, this renders the network more memory-efficient. In many applications, this structure is useful to enable temporal processing, and stacking RNNs as in Fig. 4 enables operation at different timescales. In our setting, this interpretation does not apply, but we nonetheless continue to refer to RNN layers as “time steps.” Stacking multiple RNNs produces an effect that is similar to deepening the network.
We use Gated Recurrent Unit (GRU)
cells which have shown peformance similar to well-known long short-term memory (LSTM) cells, but have less parameters due to the lack of a reset gate, making them a faster inference alternative. We use the hyperbolic tangent nonlinear activation functions, the networks posses 4 RNN stacks (levels), the hidden state size is set to and the RNN performs 5 time steps.
To train the networks, we simulate transmission of the all-one codeword (assuming the bipolar alphabet, ). We also simulate the multiplicative noise , which in the case of an AWGN channel, is distributed as a mean- Gaussian random variable. In our training for Sec. VI, we set
to 4 dB. This was selected arbitrarily, and could potentially be improved. We use Google’s TensorFlow library and the Adam optimizer. Testing proceeds in the same lines, except that we use randomly generated codewords rather than the all-one codeword. With each training batch, we generate a new set of noise samples. While this procedure produced our best results, an alternative approach which fixes the training noise and uses other techniques (e.g., dropout) to overcome overfitting the noise, is worth exploring.
With the RNN architecture, the network produces multiple outputs (at each time-step) and we use the following loss function:
where is the cross-entropy function, is the sign of component of the multiplicative noise and where the network output corresponding to codebit at RNN time step (layer) . a discount factor (in our simulations, we used ). The loss for the vanilla architecture is a degenerate version of the RNN one, with time steps and discount factors removed.
V-B Preprocessing by Permutation
The performance of the implementations described above can further be improved by applying simple preprocessing and postprocessing steps at the input and output of the decoder (respectively). Our approach is depicted in Fig. 5. Preprocessing involves applying a permutation to the components of the channel output vector, and postprocessing applies the inverse permutation to the decoder output. The approach draws on ideas from Fossorier [9, 10] and Dimnik and Be’ery . Note that the approach deviates from these works in that it does not involve computing a list of vectors.
Similar to [9, 10], our decoder selects the preprocessing permutation so as to maximize the sum of the adjusted reliabilities of the first components of the permuted channel output vector. Borrowing an idea from , however, we confine our permutations to subsets of the code’s automorphism group. We assume that the parity check matrix, by which the syndrome in Fig. 5 is computed, is arranged so that the last columns are diagonal and correspond to parity bits of the code.
We define the adjusted reliability of a channel component , denoted by,
That is, equals the mutual information of random variables and , denoting the channel input and output, respectively, conditioned on the event that the absolute value equals . is uniformly distributed in and is related to it via the channel transition probabilities. With this definition, our permutation selection criterion is equivalent to concentrating as much as possible of the channel capacity within the first channel output components.
Unlike , we borrow an idea from  and restrict the set of allowed permutations to the code’s automorphism group . Permutations in this group have the property that the permuted version of any codeword is guaranteed to be a codeword as well. In our context, confinement to such permutations ensures that the decoder input (the permuted channel output) continues to obey the communication model, namely being a noisy valid codeword. The decoder can thus continue to rely on the code’s structure to decode.
When compared to framework of Fig. 2, the decoder of Fig. 5 has the added benefit of knowing that the first channel outputs are consistently more reliable than the remaining components. That is, the input exhibits additional structure that the neural network can rely on.
In Appendix B we discuss permutations for BCH codes like those we will use in Sec VI below, as well as efficient methods for computing the optimal permutation. We also discuss formal aspects related to applying the analysis of Sec. IV to the framework of Fig. 5. Note that to achieve good results, the added steps of Fig. 5 need to be included during training of the neural network.
In Sec. VI we present simulation results for BCH(127,64) codes with and without the above preprocessing method, demonstrating the effectiveness of this approach. Note that with the shorter block length BCH(63,45) codes, preprocessing was not necessary, and we obtained performance that approaches the ordered statistics algorithm even without it.
Vi Simulation Results
Fig. 6 presents simulation results for communication with the BCH(63,45) code over an AWGN channel. We simulated our two architectures, namely syndrome-based vanilla and stacked-RNN. Note that in this case, we did not simulate permutations (Sec. V-B). Also plotted are results for the best method of Nachmani ,, the belief propagation (BP) algorithm, and for the ordered statistics decoding (OSD) algorithm  of order 2 (for this algorithm we simulated codewords for each point). As can be seen from the results, both our architectures substantially outperform the BP algorithm. Our stacked-RNN architecture, like that of , approaches the OSD algorithm very closely.
Fig. 7 presents results for the BCH(127,64) code. We simulated our syndrome-based stacked RNN method, with and without preprocessing. As can be seen, the preprocessing step renders as a substantial benefit. Also plotted are results for the BP and the OSD algorithms as well as the best results of 111Their paper  does not include results for this case.. Both our methods outperform the BP algorithm as well as the algorithm of . However, a gap remains to the OSD algorithm222Note that for of 4 dB or higher, we encountered no OSD errors for BCH(127,64), in our simulations., which widens with .
With respect to the number of codewords simulated, with our algorithms, we simulated codewords for each point. With the OSD algorithm, we simulated codewords for each point. With the BP algorithm we simulated for each point.
The analysis of  also includes an mRRD framework, into which their algorithm can be plugged to obtain superior performance. While our algorithms can similarly be plugged into that framework, our interest in this paper is in methods whose primary components are neural networks.
Our work in this paper presents a promising new framework for the application of deep learning to error correction coding. An important benefit of our design is the elimination of the problem of overfitting to the training codeword set, which was experienced by [4, 5]. We achieve this by using syndrome decoding, and by extending it to account for soft channel reliabilities. Our approach enables the neural network to focus on the estimating the noise alone, rather than the transmitted codeword.
It is interesting to compare our framework to that of Nachmani . While their approach also resolves the overfitting problem and achieves impressive simulation results, it is heavily constrained to follow the structure of the LDPC belief-propagation decoding algorithm. By contrast, our framework allows the unconstrained design of the neural network, and our architectures are free to draw from the rich experience that has emerged in recent years on neural network design.
Our simulation results demonstrate that our framework can be applied to achieve strong performance that approaches OSD. Further research will examine additional neural network architectures (beyond the RNN-based) and preprocessing methods, to improve our performance further. It will also consider the questions of latency and complexity.
We hope that research in these lines will produce powerful algorithms and have a substantial impact on error correction coding, rivaling the dramatic effect of deep learning on other fields.
Appendix A Proof of Theorem 1
In this section we provide the rigorous details of the proof, whose outline was provided in Sec IV-C.
We begin with the following lemma.
The following two claims hold with respect to the framework of Sec. IV-A:
There exists a matrix with dimensions , such that for all and defined as in Sec. IV-A (recall that and are both column vectors).
Let be a matrix obtained by concatenating the rows of and (i.e, . Then has full column rank, and is thus injective (one-to-one)
Note that in this lemma, we allow to contain redundant, linear dependent rows, as long as its rank remains . While we have not used such matrices in our work, the extra redundant rows could in theory be helpful in the design of effective neural networks for .
Proof of Lemma 1.
Part 1 of the lemma follows simply from the properties of the generator matrix of the code , denoted here . This matrix has full column-rank and dimensions and satisfies . Thus, we can define to be its left-inverse, and Part 1 of the lemma follows.
To prove Part 2 of the lemma, we first observe that we can assume without loss of generality the matrix has full row-rank. This is because by removing redundant (linear dependent) rows from we can obtain a full-rank parity-check matrix, and such removal cannot affect the rank of the corresponding . Let be a right-inverse of the matrix , whose existence follows from the fact that has full row-rank. has dimensions . Consider the matrix (obtained by concatenating the columns of and ).
where and denote the identity matrices of dimensions and , respectively. The equality follows from the orthogonality of the generator and parity matrices, and equalities , follow from the definitions above of and . The resulting matrix has rank , and thus cannot have rank less than . ∎
We now proceed to prove Part 1 of the theorem. We use the following notation: Vector are denoted by boldface (e.g., ) and scalars by normalface (e.g., ). Random variables are upper-cased () and their instantiations lower-cased ().
We use the notation of Fig. 2, replacing lowercase with uppercase wherever we need to denote random variables. Accordingly, we let and denote random variables corresponding to the channel input and output (respectively). is the realized channel output observed at the decoder and an arbitrary value.
Consider the string of equations ending in (5). In (a), is a random variable defined as (Fig. 2) and equality follows from the condition . In (b), we have relied on (3) to replace , where . In (c), is the absolute value of and . In (d), we have relied on (3) and the bipolarity of to obtain . We have also defined and and as in Fig. 2. In (e), the matrix was defined as in Lemma 1 and equality follows by the fact that is injective. In (), we have used the definition where is as defined in Lemma 1. In (g), we have decomposed . This follows from (1) and (3). In (h), we have relied on which follows from the fact that is a valid codeword. We have also replaced , where is the random message (see Fig. 2), following Lemma 1. Finally, in (i) we have made the observation that is independent of and can therefore be omitted from the condition. This follows because , being the transmitted message, is uniformly distributed in (see Sec. IV-A) and independent of (and ).
The proof now follows from (5). To obtain the MAP decision for given we can define the components of as follows, for and .
By (5), we now have . Similarly, to obtain the MMSE estimate for , we can define:
Turning to Part 2, in Sec IV-C we proved that the decoder’s output can always be modeled as for some vector-valued function . With respect to the BER metric, the indices where the vectors and diverge coincide with the indices where equals -1, and thus are independent of . With respect to the MSE metric (recall that in this case, the sign operation in Fig. 2 is omitted), we have and thus the error is independent of .
Appendix B Automorphisms of BCH codes
With the above codes, the blocklength equals for some positive integer . Codewords are binary vectors (i.e., defined over indices ). Permutations are bijective functions . Given a codeword and a permutation , we define the corresponding permuted codeword by .
The automorphism group of the above-mentioned BCH codes includes  [pp. 233] permutations of the form:
where and . The inverse permutation can be shown to equals shere and .
We now address the question of efficiently finding the optimal permutation in the sense of Sec. V-B, i.e., the permutation that maximizes the sum of adjusted reliabilities (see (4)) over the first components of the permuted codeword. For fixed , the set of permutations coincides with the set of cyclic permutations. Finding the optimal cyclic permutation can efficiently be achieved by computing the cumulative sum of the adjusted reliabilities. The case of arbitrary is adressed by observing that . The optimal permutation can be achieved by first applying the permutation and then repeating the above procedure for cyclic codes. Finally, the optimal permutation across all is computed by combining the above results for each individual . With respect to computation latency, we note that the cumulative sum can be computed in logarithmic time by recursively splitting the range .
Strictly speaking, the decoder of Fig. 5 violates the framework of Sec. IV, because the preprocessing and postprocessing steps are not included in that framework. In the case of BCH codes, however, this formal obstacle is easily overcome by removing the two steps and redefining to compensate. Specifically, the preprocessing permutation can equivalently be realized within by permuting the vector and manipulating the syndrome using identities detailed in . The selection of the optimal permutation (which depends only on ) and the postprocessing step can be redefined to be included in . The results of Sec. IV (particularly, resilience to overfitting) thus carry over to our setting.
-  T. Richardson and R. Urbanke, “The capacity of low-density parity-check codes under message-passing decoding,” IEEE Trans. Inf. Theory, vol. 47, pp. 599–618, Feb. 2001.
-  E. Nachmani, E. Marciano, L. Lugosch, Loren, W.J. Gross, D. Burshtein and Y. Be’ery, “ Deep learning methods for improved decoding of linear codes,” arXiv:1706.07043, 2017
-  E. Nachmani, Y. Bachar, E. Marciano, D. Burshtein and Y. Be’ery “Near Maximum Likelihood Decoding with Deep Learning,” Int. Zurich Seminar on Inf. and Comm., 2018
-  T. J. O’Shea and J. Hoydis, “An introduction to machine learning communications systems,” arXiv:1702.00832, 2017.
-  T. Gruber, S. Cammerer, J. Hoydis, and S. t. Brink, “On deep learning-based channel decoding,” 51st Annual Conference on Inf. Sciences and Systems (CISS), 2017.
-  L. G. Tallini and P. Cull, “Neural nets for decoding error-correcting codes,” Proc. IEEE Tech. Applicat. Conf. and Workshops Northcon95, pp. 89–94, Oct. 1995.
-  J.-L. Wu, Y.-H. Tseng, and Y.-M. Huang, “Neural network decoders for linear block codes,” Int. Journ. of Computational Engineering Science, vol. 3, no. 3, pp. 235–255, 2002.
-  J. Bruck and M. Blaum, “Neural networks, error-correcting codes, and polynomials over the binary n-cube.” IEEE Trans. Inf. Theory vol. 35(5), pp. 976–987, 1989.
-  M.P. Fossorier, S. Lin and J. Snyders. “Reliability-based syndrome decoding of linear block codes.” IEEE Trans. Inf. Theory vol. 44(1) pp. 388–398, Jan. 1998.
-  M.P. Fossorier and S. Lin, “Soft-decision decoding of linear block codes based on ordered statistics.” IEEE Trans. Inf. Theory, vol. 41(5), 1379–1396, Sep. 1995.
-  T. Richardson and R. Urbanke, “Modern coding theory,” Cambridge university press. (2008)
-  S. Lin and D. J. Costello. “Error control coding,” 2nd edition, Prentice Hall, 2004.
-  G. Wiechman and I. Sason, I., “Parity-check density versus performance of binary linear block codes: New bounds and applications,” IEEE Trans. Inf. Theory vol. 53(2), pp. 550–579, Jan. 2007.
-  D. Burshtein, M. Krivelevich, S. Litsyn, S. and G. Miller, “Upper bounds on the rate of LDPC codes,”. IEEE Trans. Inf. Theory vol. 48(9), pp. 2437–2449, Sep. 2002.
-  I. Goodfellow, Y. Bengio, and A. Courville, “Deep learning.” MIT press, 2016.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980. Dec. 2014.
-  I. Dimnik and Y. Be’ery, “Improved random redundant iterative HDPC decoding.” IEEE Trans. Commu., 57(7), July 2009.
-  F. J. MacWilliams and N. J. A. Sloane, The Theory of Error-Correcting Codes. North-Holland, 1978.
-  A. Graves, M. Abdel-rahman and G. Hinton, “Speech recognition with deep recurrent neural networks” IEEE international conference on acoustics, speech and signal processing (icassp) 2013.
-  K. Cho, B. Van Merriënboer, C. Gulcehre , D. Bahdanau, F. Bougares ,H. Schwenk and Y. Bengio G. Hinton, “Learning phrase representations using RNN encoder-decoder for statistical machine translation” arXiv preprint arXiv:1406.1078 2014.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory” Neural computation 1997.