I Introduction
In recent years deep learning methods have demonstrated significant improvements in various tasks. These methods outperform humanlevel object detection in some tasks
[1], and achieve stateoftheart results in machine translation [2] and speech processing [3]. Additionally, deep learning combined with reinforcement learning techniques was able to beat human champions in challenging games such as Go
[4].Error correcting codes for channel coding are used in order to enable reliable communications at rates close to the Shannon capacity. A wellknown family of linear error correcting codes are the linear lowdensity paritycheck (LDPC) codes [5]. LDPC codes achieve near Shannon channel capacity with the belief propagation (BP) decoding algorithm, but can typically do so for relatively large block lengths. For short to moderate high density parity check (HDPC) codes [6, 7, 8, 9, 10], such as common powerful linear algebraic codes, the regular BP algorithm obtains poor results compared to the optimal maximum likelihood decoder. On the other hand, the importance of close to optimal low complexity, low latency and low power decoders of short to moderate codes has grown with the emergence of applications driven by the Internet of Things.
Recently in [11] it has been shown that deep learning methods can improve the BP decoding of HDPC codes using a neural network. They formalized the belief propagation algorithm as neural network and showed that it can improve the decoding by in the high SNR regime. A key property of the method is that it is sufficient to train the neural network decoder using a single codeword (e.g., the allzero codeword), since the architecture guarantees the same error rate for any chosen transmitted codeword.
Later, Lugosch & Gross [12] proposed an improved neural network architecture that achieves similar results to [11] with less parameters and reduced complexity. The main difference was that they use the minsum algorithm instead of the sumproduct algorithm. Gruber et al. [13] proposed a neural net decoder with an unconstrained graph (i.e., fully connected network) and show that the network gets close to maximum likelihood results for very small block codes, . Also, O’Shea & Hoydis [14]
proposed to use an autoencoder as a communication system for small block code with
.In this work we modify the architecture of [11] to a recurrent neural network (RNN) and show that it can achieve up to improvement over the belief propagation algorithm in the high SNR regime. The advantage over the feedforward architecture of [11] is that it reduces the number of parameters. We also investigate the performance of the RNN decoder on parity check matrices with lower densities and fewer short cycles and show that despite the fact that we start with reduced cycle matrix, the network can improve the performance up to . The output of the training algorithm can be interpreted as a soft Tanner graph that replaces the original one. State of the art decoding algorithms of short to moderate algebraic codes, such as [15, 7, 10], utilize the BP algorithm as a component in their solution. Thus, it is natural to replace the standard BP decoder with our trained RNN decoder, in an attempt to improve either the decoding performance or its complexity. In this work we demonstrate, for a BCH(63,36) code, that such improvements can be realized by using the RNN decoder in the mRRD algorithm.
Ii Related Work
Iia Belief Propagation Algorithm
The BP decoder [5, 16] is a messages passing algorithm. The algorithm is constructed from the Tanner graph which is a graphical representation of the parity check matrix. The graphical representation consists of edges and nodes. There are two type of nodes:

Check nodes  corresponding to rows in the parity check matrix.

Variable nodes  corresponding to columns in the parity check matrix.
The edges correspond to ones in the parity check matrix. The messages are transmitted over edges. Consider a code with block length
. The input to the algorithm is a vector of size
, that consists of the loglikelihood ratios (LLRs) of the channel outputs. We consider an algorithm with decoding iterations. The LLR values, , are given bywhere is the channel output corresponding to the th codebit, . The iterations of the BP decoder are represented in [11] using the following trellis graph. The input layer consists of nodes. The following layers in the graph have size each, where is the number of edges in the Tanner graph (number of ones in the parity check matrix). The last layer has size , which is the length of the code.
The messages transmitted over the trellis graph are the following. Consider hidden layer , , and let be the index of some processing element in that layer. We denote by
, the output message of this processing element. For odd (even, respectively),
, this is the message produced by the BP algorithm after iterations, from variable to check (check to variable) node.For odd and we have (recall that the self LLR message of is ),
(1) 
under the initialization, for all edges (in the beginning there is no information at the parity check nodes). The summation in (1) is over all edges with variable node except for the target edge . Recall that this is a fundamental property of message passing algorithms [16].
Similarly, for even and we have,
(2) 
The final th output of the network is given by
(3) 
which is the final marginalization of the BP algorithm.
IiB Neural Sum Product Algorithm
Nachmani et al. [11]
have suggested a parameterized deep neural network decoder as a generalization of the BP algorithm. They use the trellis representation of the BP algorithm with weights associated with each edge of the Tanner graph. These weights are trained with stochastic gradient descent. More precisely, the equations to the neural sum product algorithm are 
(4) 
for odd ,
(5) 
for even , and
(6) 
where
is a sigmoid function. This algorithm coincides with the BP algorithm if all the weights are set to one (except for the sigmoid function at the output). Therefore the neural sum product algorithm cannot be inferior to the plain BP algorithm.
The neural sum product algorithm satisfies the message passing symmetry conditions [16][Definition 4.81]. Therefore the error rate is independent of the transmitted codeword. As a result the network can be trained by using noisy versions of a single codeword. The time complexity of the neural sum product algorithm is similar to plain BP algorithm. However, the neural sum product algorithm requires more multiplications and parameters then the plain BP algorithm. The neural network architecture is illustrated in Figure 1 for a BCH(15,11) code.
IiC Modified Random Redundant Iterative (mRRD) Algorithm
Dimnik and Be’ery [7] proposed an iterative algorithm for decoding HDPC codes based on the RRD [17] and the MBBP [18] algorithms. The mRRD algorithm is a close to optimal low complexity decoder for short length () algebraic codes such as BCH codes. This algorithm uses parallel decoder branches, each comprising of applications of several (e.g. 2) BP decoding iterations, followed by a random permutation from the Automorphism Group of the code, as shown in Figure 2. The decoding process in each branch stops if the decoded word is a valid codeword. The final decoded word is selected with a least metric selector (LMS) as the one for which the channel output has the highest likelihood. More details can be found in [7].
Iii Methods
Iiia BPRNN Decoding
We suggest the following parameterized deep neural network decoder which is a constrained version of the BP decoder of the previous section. We use the same trellis representation as in [11] for the decoder. The difference is that now the weights of the edges in the Tanner graph are tied, i.e. they are set to be equal in each iteration. This tying transfers the feedforward architecture of [11] into a recurrent neural network architecture. More precisely, the equations of the proposed architecture for time step are
(7) 
(8) 
For time step, , we have
(9) 
where is a sigmoid function. We initialize the algorithm by setting for all
. The proposed architecture also preserves the symmetry conditions. As a result the network can be trained by using noisy versions of a single codeword. The training is done as before with a cross entropy loss function at the last time step 
(10) 
where , are the final deep recurrent neural network output and the actual th component of the transmitted codeword. The proposed recurrent neural network architecture has the property that after every time step we can add final marginalization and compute the loss of these terms using (10
). Using multiloss terms can increase the gradient update at the backpropagation through time algorithm and allow learning the earliest layers. At each time step we add the final marginalization to loss:
(11) 
where , are the deep neural network output at the time step and the actual th component of the transmitted codeword. This network architecture is illustrated in Figure 3. Nodes in the variable layer implement (7), while nodes in the parity layer implement (8). Nodes in the marginalization layer implement (9). The training goal is to minimize (11).
IiiB mRRDRNN Decoding
We propose to combine the BPRNN decoding algorithm with the mRRD algorithm. We can replace the BP blocks in the mRRD algorithm with our BPRNN decoding scheme. The proposed mRRDRNN decoder algorithm should achieve near maximum likelihood performance with less computational complexity.
Iv Experiments And Results
Iva BpRnn
We apply our method to different linear codes, BCH(63,45), BCH(63,36), BCH(127,64) and BCH(127,99). In all experiments the results of training, validation and test sets are identical, we did not observe overfitting in our experiments. Details about our experiments and results are as follows. It should be noted that we have not trained the parameters in (7), i.e. we set
Training was conducted using stochastic gradient descent with minibatches. The training data is created by transmitting the zero codeword through an AWGN channel with varying SNRs ranging from to . The minibatch size was , and examples to BCH codes with
, BCH(127,99) and BCH(127,64) respectively. We applied the RMSPROP
[19] rule with a learning rate equal to , and to BCH codes with , BCH(127,99) and BCH(127,64) respectively. The neural network has hidden layers at each time step, and unfold equal to which corresponds to full iterations of the BP algorithm. At test time, we inject noisy codewords after transmitting through an AWGN channel and measure the bit error rate (BER) in the decoded codeword at the network output. The input to (7) is clipped such that the absolute value of the input is always smaller than some positive constant . This is also required for a practical implementation of the BP algorithm.IvA1 BER For BCH With
In Figures 4, 5, we provide the biterrorrate for BCH code with for regular parity check matrix based on [20]. As can be seen from the figures, the BPRNN decoder outperforms the BP feedforward (BPFF) decoder by . Not only that we improve the BER the network has less parameters. Moreover, we can see that the BPRNN decoder obtains comparable results to the BPFF decoder when training with multiloss. Furthermore, for the BCH(63,45) and BCH(63,36) there is an improvement up to and , respectively, over the plain BP algorithm.
In Figures 6 and 7, we provide the biterrorrate for a BCH code with for a cycle reduced parity check matrix [17]. For BCH(63,45) and BCH(63,36) we get an improvement up to and , respectively. This observation shows that the method with soft Tanner graph is capable to improve the performance of standard BP even for reduced cycle parity check matrices. Thus answering in the affirmative the uncertainty in [11] regarding the performance of the neural decoder on a cycle reduced parity check matrix. The importance of this finding is that it enables a further improvement in the decoding performance, as BP (both standard BP and the new parameterized BP algorithm) yields lower error rate for sparser parity check matrices.
IvA2 BER For BCH With
IvB mRRDRNN
In this Section we provide the bit error rate results for a BCH(63,36) code represented by a cycle reduced parity check matrix based on [17]. In all experiments we use the soft Tanner graph trained using the BPRNN with multiloss architecture and an unfold of 5, which corresponds to 5 BP iterations. The parameters of the mRRDRNN are as follows. We use 2 iterations for each block in Figure 2, a value of , denoted in the following by mRRDRNN(), and a value of .
In Figure 11 we present the bit error rate for mRRDRNN(1), mRRDRNN(3) and mRRDRNN(5). As can be seen, we achieve improvements of dB, dB and dB in the respective decoders. Hence, the mRRDRNN decoder can improve on the plain mRRD decoder. Also note that there is a gap of just
dB from the optimal maximum likelihood decoder, the performance of which was estimated using the implementation of
[21] based on the OSD algorithm [22].Figure 12 compares the average number of BP iterations for the various decoders using plain mRRD and mRRDRNN. As can be seen, there is a small increase in the complexity of up to 8% when using the RNN decoder. However, overall, with the RNN decoder one can achieve the same error rate with a significantly smaller computational complexity due to the reduction in the required value of .
V Conclusion
We introduced an RNN architecture for decoding linear block codes. This architecture yields comparable results to the feed forward architecture in [11] with less parameters. Furthermore, we showed that the neural network decoder improves on standard BP even for cycle reduced parity check matrices, with improvements of up to in the SNR. We also showed performance improvement of the mRRD algorithm with the new RNN architecture. We regard this work as a further step towards the design of deep neural networkbased decoding algorithms.
Our future work includes possible improvements in the performance by exploring new neural network architectures. Moreover, we will investigate endtoend learning of the mRRD algorithm (i.e. learning graph with permutation), and fine tune the parameters of the mRRDRNN algorithm. Finally, we are currently considering an extension of this work where the weights of the RNN are quantized in order to further reduce the number of free parameters. It has been shown in the past [23, 24] that in various applications the loss in performance incurred by weight quantization can be small if this quantization is performed properly.
Acknowledgment
We thank Jacob Goldberger for his comments on our work, Johannes Van Wonterghem and Joseph J. Boutros for making their OSD software available to us. We also thank Ethan Shiloh, Ilia Shulman and Ilan Dimnik for helpful discussion and their support to this research, and Gianluigi Liva for sharing with us his OSD Matlab package.
This research was supported by the Israel Science Foundation, grant no. 1082/13. The Tesla K40c used for this research was donated by the NVIDIA Corporation.
References
 [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
 [2] M.T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attentionbased neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
 [3] A. Graves, A.r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649.
 [4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
 [5] R. G. Gallager, Low Density Parity Check Codes. Cambridge, Massachusetts: M.I.T. Press, 1963.
 [6] J. Jiang and K. R. Narayanan, “Iterative softinput softoutput decoding of reedsolomon codes by adapting the paritycheck matrix,” IEEE Transactions on Information Theory, vol. 52, no. 8, pp. 3746–3756, 2006.
 [7] I. Dimnik and Y. Be’ery, “Improved random redundant iterative hdpc decoding,” IEEE Transactions on Communications, vol. 57, no. 7, pp. 1982–1985, 2009.

[8]
A. Yufit, A. Lifshitz, and Y. Be’ery, “Efficient linear programming decoding of hdpc codes,”
IEEE Transactions on Communications, vol. 59, no. 3, pp. 758–766, 2011.  [9] X. Zhang and P. H. Siegel, “Adaptive cut generation algorithm for improved linear programming decoding of binary linear codes,” IEEE Transactions on Information Theory, vol. 58, no. 10, pp. 6581–6594, 2012.
 [10] M. Helmling, E. Rosnes, S. Ruzika, and S. Scholl, “Efficient maximumlikelihood decoding of linear block codes on binary memoryless channels,” in 2014 IEEE International Symposium on Information Theory. IEEE, 2014, pp. 2589–2593.
 [11] E. Nachmani, Y. Beery, and D. Burshtein, “Learning to decode linear codes using deep learning,” 54’th Annual Allerton Conf. On Communication, Control and Computing, Mouticello, IL, September 2016. arXiv preprint arXiv:1607.04793, 2016.
 [12] L. Lugosch and W. J. Gross, “Neural offset minsum decoding,” arXiv preprint arXiv:1701.05931, 2017.
 [13] T. Gruber, S. Cammerer, J. Hoydis, and S. t. Brink, “On deep learningbased channel decoding,” accepted for CISS 2017, arXiv preprint arXiv:1701.07738, 2017.
 [14] T. J. O’Shea and J. Hoydis, “An introduction to machine learning communications systems,” arXiv preprint arXiv:1702.00832, 2017.
 [15] M. P. Fossorier, “Iterative reliabilitybased decoding of lowdensity parity check codes,” IEEE Journal on selected Areas in Communications, vol. 19, no. 5, pp. 908–917, 2001.
 [16] T. Richardson and R. Urbanke, Modern Coding Theory. Cambridge, UK: Cambridge University Press, 2008.
 [17] T. R. Halford and K. M. Chugg, “Random redundant softin softout decoding of linear block codes,” in Information Theory, 2006 IEEE International Symposium on. IEEE, 2006, pp. 2230–2234.
 [18] T. Hehn, J. B. Huber, S. Laendner, and O. Milenkovic, “Multiplebases beliefpropagation for decoding of short block codes,” in Information Theory, 2007. ISIT 2007. IEEE International Symposium on. IEEE, 2007, pp. 311–315.

[19]
T. Tieleman and G. Hinton, “Lecture 6.5rmsprop: Divide the gradient by a
running average of its recent magnitude,”
COURSERA: Neural Networks for Machine Learning
, vol. 4, no. 2, 2012.  [20] M. Helmling and S. Scholl, “Database of channel codes and ml simulation results,” www.unikl.de/channelcodes, University of Kaiserslautern, 2016.
 [21] J. Van Wonterghem, A. Alloumf, J. Boutros, and M. Moeneclaey, “Performance comparison of shortlength errorcorrecting codes,” in Communications and Vehicular Technologies (SCVT), 2016 Symposium on. IEEE, 2016, pp. 1–6.
 [22] M. P. Fossorier and S. Lin, “Softdecision decoding of linear block codes based on ordered statistics,” IEEE Transactions on Information Theory, vol. 41, no. 5, pp. 1379–1396, 1995.
 [23] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” arXiv preprint arXiv:1609.07061, 2016.

[24]
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks,” in
European Conference on Computer Vision
. Springer, 2016, pp. 525–542.
Comments
There are no comments yet.