1 Introduction
Reliable digital communication, both wireline (ethernet, cable and DSL modems) and wireless (cellular, satellite, deep space), is a primary workhorse of the modern information age. A critical aspect of reliable communication involves the design of codes that allow transmissions to be robustly (and computationally efficiently) decoded under noisy conditions. This is the discipline of coding theory; over the past century and especially the past 70 years (since the birth of information theory [1]) much progress has been made in the design of near optimal codes. Landmark codes include convolutional codes, turbo codes, low density parity check (LDPC) codes and, recently, polar codes. The impact on humanity is enormous – every cellular phone designed uses one of these codes, which feature in global cellular standards ranging from the 2nd generation to the 5th generation respectively, and are text book material [2].
The canonical setting is one of pointtopoint reliable communication over the additive white Gaussian noise (AWGN) channel and performance of a code in this setting is its gold standard. The AWGN channel fits much of wireline and wireless communications although the front end of the receiver may have to be specifically designed before being processed by the decoder (example: intersymbol equalization in cable modems, beamforming and sphere decoding in multiple antenna wireless systems); again this is text book material [3]. There are two long term goals in coding theory:
design of new, computationally efficient, codes that improve the state of the art (probability of correct reception) over the AWGN setting. Since the current codes already operate close to the information theoretic “Shannon limit", the emphasis is on
robustness and adaptability to deviations from the AWGN settings (a list of channel models motivated by practical settings, (such as urban, pedestrian, vehicular) in the recent 5th generation cellular standard is available in Annex B of 3GPP TS 36.101.) (b) design of new codes for multiterminal (i.e., beyond pointtopoint) settings – examples include the feedback channel, the relay channel and the interference channel.Progress over these long term goals has generally been driven by individual human ingenuity and, befittingly, is sporadic. For instance, the time duration between convolutional codes (2nd generation cellular standards) to polar codes (5th generation cellular standards) is over 4 decades. Deep learning is fast emerging as capable of learning sophisticated algorithms from observed data (input, action, output) alone and has been remarkably successful in a large variety of human endeavors (ranging from language [4] to vision [5] to playing Go [6]). Motivated by these successes, we envision that deep learning methods can play a crucial role in solving both the aforementioned goals of coding theory.
While the learning framework is clear and there is virtually unlimited training data available, there are two main challenges: The space of codes is very vast and the sizes astronomical; for instance a rate 1/2 code over 100 information bits involves designing codewords in a dimensional space. Computationally efficient encoding and decoding procedures are a must, apart from high reliability over the AWGN channel. Generalization is highly desirable across block lengths and data rate that each work very well over a wide range of channel signal to noise ratios (SNR). In other words, one is looking to design a family of codes (parametrized by data rate and number of information bits) and their performance is evaluated over a range of channel SNRs.
For example, it is shown that when a neural decoder is exposed to nearly 90% of the codewords of a rate 1/2 polar code over 8 information bits, its performance on the unseen codewords is poor [7]. In part due to these challenges, recent deep learning works on decoding known codes using datadriven neural decoders have been limited to short or moderate block lengths [7, 8, 9, 10]. Other deep learning works on coding theory focus on decoding known codes by training a neural decoder that is initialized with the existing decoding algorithm but is more general than the existing algorithm [11, 12]. The main challenge is to restrict oneself to a class of codes that neural networks can naturally encode and decode. In this paper, we restrict ourselves to a class of sequential encoding and decoding schemes, of which convolutional and turbo codes are part of. These sequential coding schemes naturally meld with the family of recurrent neural network (RNN) architectures, which have recently seen large success in a wide variety of timeseries tasks. The ancillary advantage of sequential schemes is that arbitrarily long information bits can be encoded and also at a large variety of coding rates.
Working within sequential codes parametrized by RNN architectures, we make the following contributions.
(1) Focusing on convolutional codes we aim to decode them on the AWGN channel using RNN architectures. Efficient optimal decoding of convolutional codes has represented historically fundamental progress in the broad arena of algorithms; optimal bit error decoding is achieved by the ‘Viterbi decoder’ [13] which is simply dynamic programming or Dijkstra’s algorithm on a specific graph (the ‘trellis’) induced by the convolutional code. Optimal block error decoding is the BCJR decoder [14] which is part of a family of forwardbackward algorithms. While early work had shown that vanillaRNNs are capable in principle of emulating both Viterbi and BCJR decoders [15, 16] we show empirically, through a careful construction of RNN architectures and training methodology, that neural network decoding is possible at very near optimal performances (both bit error rate (BER) and block error rate (BLER)). The key point is that we train a RNN decoder at a specific SNR and over short information bit lengths (100 bits) and show strong generalization capabilities by testing over a wide range of SNR and block lengths (up to 10,000 bits). The specific training SNR is closely related to the Shannon limit of the AWGN channel at the rate of the code and provides strong information theoretic collateral to our empirical results.
(2) Turbo codes are naturally built on top of convolutional codes, both in terms of encoding and decoding. A natural generalization of our RNN convolutional decoders allow us to decode turbo codes at BER comparable to, and at certain regimes, even better than state of the art turbo decoders on the AWGN channel. That data driven, SGDlearnt, RNN architectures can decode comparably is fairly remarkable since turbo codes already operate near the Shannon limit of reliable communication over the AWGN channel.
(3) We show the aforedescribed neural network decoders for both convolutional and turbo codes are robust
to variations to the AWGN channel model. We consider a problem of contemporary interest: communication over a “bursty" AWGN channel (where a small fraction of noise has much higher variance than usual) which models intercell interference in OFDM cellular systems (used in 4G and 5G cellular standards) or cochannel radar interference. We demonstrate empirically the neural network architectures can adapt to such variations and beat state of the art heuristics comfortably (despite evidence elsewhere that neural network are sensitive to models they are trained on
[17]). Via an innovative local perturbation analysis (akin to [18]), we demonstrate the neural network to have learnt sophisticated preprocessing heuristics in engineering of real world systems [19].2 RNN decoders for sequential codes
Among diverse families of coding scheme available in the literature, sequential coding schemes are particularly attractive. They () are used extensively in mobile telephone standards including satellite communications, 3G, 4G, and LTE; () provably achieve performance close to the information theoretic limit; and () have a natural recurrent structure that is aligned with an established family of deep models, namely recurrent neural networks. We consider the basic sequential code known as convolutional codes, and provide a neural decoder that can be trained to achieve the optimal classification accuracy.
A standard example of a convolutional code is the rate1/2 Recursive Systematic Convolutional (RSC) code. The encoder performs a forward pass on a recurrent network shown in Figure 1 on binary input sequence , which we call message bits
, with binary vector states
and binary vector outputs , which we call transmitted bits or a codeword. At time with binary input and the state of a twodimensional binary vector , the output is a twodimensional binary vector , where . The state of the next cell is updated as . Initial state is assumed to be 0, i.e., .The output bits are sent over a noisy channel, with the canonical one being the AWGN channel: the received binary vectors , which are called the received bits, are for all and , where ’s are i.i.d. Gaussian with zero mean and variance . Decoding a received signal
refers to (attempting to) finding the maximum a posteriori (MAP) estimate. Due to the simple recurrent structure, efficient iterative schemes are available for finding the MAP estimate for convolutional codes
[viterbi, 14]. There are two MAP decoders depending on the error criterion in evaluating the performance: bit error rate (BER) or block error rate (BLER).BLER counts the fraction of blocks that are wrongly decoded (assuming many such length blocks have been transmitted), and matching optimal MAP estimator is . Using dynamic programming, one can find the optimal MAP estimate in time linear in the block length , which is called the Viterbi algorithm. BER counts the fraction of bits that are wrong, and matching optimal MAP estimator is , for all . Again using dynamic programming, the optimal estimate can be computed in time, which is called the BCJR algorithm.
In both cases, the linear time optimal decoder crucially depends on the recurrent structure of the encoder. This structure can be represented as a hidden Markov chain (HMM), and both decoders are special cases of general efficient methods to solve inference problems on HMM using the principle of dynamic programming (e.g. belief propagation). These methods efficiently compute the exact posterior distributions in two passes through the network: the forward pass and the backward pass. Our first aim is to train a (recurrent) neural network from samples, without explicitly specifying the underlying probabilistic model, and still recover the accuracy of the matching optimal decoders. At a high level, we want to prove by a constructive example that highly engineered dynamic programming can be matched by a neural network which only has access to the samples. The challenge lies in finding the right architecture and showing the right training examples.
Neural decoder for convolutional codes. We treat the decoding problem as a dimensional binary classification problem for each of the message bits . The input to the decoder is a length sequence of received bits each associated with its length sequence of “true classes”
. The goal is to train a model to find an accurate sequencetosequence classifier. The input
is a noisy version of the class according to the rate1/2 RSC code defined in earlier in this section. We generate training examples foraccording to this joint distribution to train our model.
We introduce a novel neural decoder for rate1/2 RSC codes, we call NRSC. It is two layers of bidirection Gated Recurrent Units (biGRU) each followed by batch normalization units, and the output layer is a single fully connected sigmoid unit. Let
denote all the parameters in the model whose dimensions are shown in Figure 2, and denote the output sequence. The th outputestimates the posterior probability
, and we train the weights to minimize theerror with respect to a choice of a loss function
specified below:(1) 
As the encoder is a recurrent network, it is critical that we use recurrent neural networks as a building block. Among several options of designing RNNs, we make three specific choices that are crucial in achieving the target accuracy: bidirectional GRU as a building block instead of unidirectional GRU; 2layer architecture instead of a single layer; and using batch normalization. As we show in Table 2 in Appendix C, unidirectional GRU fails because the underlying dynamic program requires bidirectional recursion of both forward pass and backward pass through the received sequence. A single layer biGRU fails to give the desired performance, and two layers is sufficient. We show how the accuracy depends on the number of layer in Table 2 in Appendix C. Batch normalization is also critical in achieving the target accuracy.

Training. We propose two novel training techniques that improve accuracy of the trained model significantly. First, we propose a novel loss function guided by the efficient dynamic programming, that significantly reduces the number of training example we need to show. A natural loss (which gives better accuracy than crossentropy in our problem) would be . Recall that the neural network estimates the posterior , and the true label is a mere surrogate for the posterior, as typically the posterior distribution is simply not accessible. However, for decoding RSC codes, there exists efficient dynamic programming that can compute the posterior distribution exactly. This can significantly improve sample complexity of our training, as we are directly providing as opposed to a sample from this distribution, which is . We use a python implementation of BCJR in [20] to compute the posterior distribution exactly, and minimize the loss
(2) 
Next, we provide a guideline for choosing the training examples that improve the accuracy. As it is natural to sample the training data and test data from the same distribution, one might use the same noise level for testing and training. However, this is not reliable as shown in Figure 3.
Channel noise is measured by SignaltoNoise Ratio (SNR) defined as where is the variance of the Gaussian noise in the channel. For rate1/2 RSC code, we propose using training data with noise level . Namely, we propose using training SNR matched to test SNR if test SNR is below 0, and otherwise fix training SNR at 0 independent of the test SNR. In Appendix D, we give a general formula for general rate codes, and provide an information theoretic justification and empirical evidences showing that this is near optimal choice of training data.
Performance. In Figure 4, for various test SNR, we train our NRSC on randomly generated training data for rate1/2 RSC code of block length over AWGN channel with proposed training SNR of . We trained the decoder with Adam optimizer with learning rate 1e3, batch size 200, and total number of examples is 12,000, and we use clip norm. On the left we show biterrorrate when tested with length RSC encoder, matching the training data. ^{1}^{1}1Source codes available in https://github.com/yihanjiang/SequentialRNNDecoder We show that NRSC is able to learn to decode and achieve the optimal performance of the optimal dynamic programming (MAP decoder) almost everywhere. Perhaps surprisingly, we show on the right figure that we can use the neural decoder trained on length codes, and apply it directly to codes of length and still meet the optimal performance. Note that we only give training examples, while the number of unique codewords is . This shows that the proposed neural decoder can generalize to unseen codeword; and seamlessly generalizes to significantly longer block lengths. More experimental results including other types of convolutional codes are provided in Appendix A. We also note that training with in decoding convolutional codes also gives the same final BER performance as training with the posterior .
Complexity.
When it comes to an implementation of a decoding algorithm, another important metric in evaluating the performance of a decoder is complexity. In this paper our comparison metrics focus on the BER performance; the main claim in this paper is that there is an alternative decoding methodology which has been hitherto unexplored and to point out that this methodology can yield excellent BER performance. Regarding the circuit complexity, we note that in computer vision, there have been many recent ideas to make large neural networks practically implementable in a cell phone. For example, the idea of distilling the knowledge in a large network to a smaller network and the idea of binarization of weights and data in order to do away with complex multiplication operations have made it possible to implement inference on much larger neural networks than the one in this paper in a smartphone
[21, 22]. Such ideas can be utilized in our problem to reduce the complexity as well. A serious and careful circuit implementation complexity optimization and comparison is significantly complicated and is beyond the scope of a single paper. Having said this, a preliminary comparison is as follows. The complexity of all decoders (Viterbi, BCJR, neural decoder) is linear in the number of information bits (block length). The number of multiplications is quadratic in the dimension of hidden states of GRU (200) for the proposed neural decoder, and the number of encoder states (4) for Viterbi and BCJR algorithms.Turbo codes are naturally built out of convolutional codes (both encoder and decoder) and represent some of the most successful codes for the AWGN channel [23]. A corresponding stacking of multiple layers of the convolutional neural decoders leads to a natural neural turbo decoder which we show to match (and in some regimes even beat) the performance of standard state of the art turbo decoders on the AWGN channel; these details are available in Appendix B. Unlike the convolutional codes, the state of the art (messagepassing) decoders for turbo codes are not the corresponding MAP decoders, so there is no contradiction in that our neural decoder would beat the messagepassing ones. The training and architectural choices are similar in spirit to those of the convolutional code and are explored in detail in Appendix B.
3 NonGaussian channels: Robustness and Adaptivity
In the previous sections, we demonstrated that the neural decoder can perform as well as the turbo decoder. In practice, there are a wide variety of channel models that are suited for differing applications. Therefore, we test our neural decoder under some canonical channel models to see how robust and adaptive they are. Robustness refers to the ability of a decoder trained for a particular channel model to work well on a differing channel model without retraining. Adaptivity refers to the ability of the learning algorithm to adapt and retrain for differing channel models. In this section, we demonstrate that the neural turbo decoder is both adaptive and robust by testing on a set of nonGaussian channel models.
Robustness. The robustness test is interesting from two directions, other than obvious practical value. Firstly, it is known from information theory that Gaussian noise is the worst case noise among all noise distributions with a given variance [1, 24]. Shannon showed in his original paper [1] that among all memoryless noise sequences (with the same average energy), Gaussian noise is the worst in terms of capacity. After a long time, [24] showed that for any finite block length, the BER achieved by the minimum distance decoder for any noise pdf is lower bounded by the BER for Gaussian noise under the assumption of Gaussian codebook. Since Viterbi decoding is the minimum distance decoder for convolutional codes, it is naturally robust in the precise sense above. On the other hand, turbo decoder does not inherit this property, making it vulnerable to adversarial attacks. We show that the neural decoder is more robust to a nonGaussian noise, namely, tdistributed noise, than turbo decoder. Secondly, the robust test poses an interesting challenge for neural decoders since deep neural networks are known to misclassify when tested against small adversarial perturbations [17, 25]. While we are not necessarily interested in adversarial perturbations to the input in this paper, it is important for the learning algorithm to be robust against differing noise distributions. We leave research on the robustness to small adversarial perturbations as a future work.
For the nonGaussian channel, we choose the tdistribution family parameterized by parameter . We test the performance of both the neural and turbo decoder in this experiment when in Figure 4(a) and observe that the neural decoder performs significantly better than the standard Turbo decoder (also see Figure 15(a) in Appendix E). In order to understand the reason for such a bad performance of the standard Turbo decoder, we plot the average output loglikelihood ratio (LLR) as a function of the bit position in Figure 4(b), when the input is allzero codeword. The main issue for the standard decoder is that the LLRs are not calculated accurately (see Figure 15(b) in Appendix E): the LLR is exaggerated in the tdistribution while there is some exaggeration in the neural decoder as well, it is more modest in its prediction leading to more contained error propagation.
Adaptivity. A great advantage of neural channel decoder is that the neural network can learn a decoding algorithm even if the channel does not yield to a clean mathematical analysis. Consider a scenario where the transmitted signal is added with a Gaussian noise always, however, with a small probability, a further high variance noise is added. The channel model is mathematically described as follows, with describing the received symbol and denoting the transmitted symbol at time instant : , and with probability and with probability , i.e., denotes the Gaussian noise whereas denotes the bursty noise.
This channel model accurately describes how radar signals (which are bursty) can create an interference for LTE in next generation wireless systems. This model has attracted attention in communications systems community due to its practical relevance [26, 27]. Under the aforesaid channel model, it turns out that standard Turbo coding decoder fails very badly [28]. The reason that the Turbo decoder cannot be modified in a straightforward way is that the location of the bursty noise is a latent variable that needs to be jointly decoded along with the message bits. In order to combat this particular noise model, we finetune our neural decoder on this noise model, initialized from the AWGN neural decoder, and term it the bursty neural decoder. There are two stateoftheart heuristics [28]: (a) erasurethresholding: all LLR above a threshold are set to (b) saturationthresholding: all LLR above a threshold are set to the (signed) threshold.
We demonstrate the performance of our AWGN neural decoder (trained on Gaussian noise) as well as standard turbo decoder (for Gaussian noise) on this problem, shown in Figure 6 when . We summarize the results of Figure 6: (1) standard turbo decoder not aware of bursty noise will result in complete failure of decoding. (2) standard neural decoder still outperforms standard Turbo Decoder. (3) Burstyneuraldecoder outperforms Turbo Decoder using both stateoftheart heuristics at and obtains performance approaching that of the better of the two schemes at other variances.
Interpreting the Neural Decoder. We try to interpret the action of the neural decoder trained under bursty noise. To do so, we look at the following simplified model, where where are as before, but during the th symbol in a length codeword. We also fix the input codeword to be the allzero codeword. We look at the average output LLR as a function of position for the one round of the neural decoder in Figure 6(a) and one round of BCJR algorithm in Figure 6(b) (the BER as a function of position is shown in Figure 17 in Appendix E). A negative LLR implies correct decoding at this level and a positive LLR implies incorrect decoding. It is evident that both RNN and BCJR algorithms make errors concentrated around the midpoint of the codeword. However, what is different between the two figures is that the scale of likelihoods of the two figures are quite different: the BCJR has a high sense of (misplaced) confidence, whereas the RNN is more modest in its assessment of its confidence. In the later stages of the decoding, the exaggerated sense of confidence of BCJR leads to an error propagation cascade eventually toggling other bits as well.
4 Conclusion
In this paper we have demonstrated that appropriately designed and trained RNN architectures can ‘learn’ the landmark algorithms of Viterbi and BCJR decoding based on the strong generalization capabilities we demonstrate. This is similar in spirit to recent works on ‘program learning’ in the literature [29, 30]. In those works, the learning is assisted significantly by a low level program trace on an input; here we learn the Viterbi and BCJR algorithms only by endtoend training samples; we conjecture that this could be related to the strong “algebraic" nature of the Viterbi and BCJR algorithms. The representation capabilities and learnability of the RNN architectures in decoding existing codes suggest a possibility that new codes could be leant on the AWGN channel itself and improve the state of the art (constituted by turbo, LDPC and polar codes). Also interesting is a new look at classical multiterminal communication problems, including the relay and interference channels. Both are active areas of present research.
References
 [1] C. E. Shannon, “A mathematical theory of communication, part i, part ii,” Bell Syst. Tech. J., vol. 27, pp. 623–656, 1948.
 [2] T. Richardson and R. Urbanke, Modern coding theory. Cambridge university press, 2008.
 [3] D. Tse and P. Viswanath, Fundamentals of wireless communication. Cambridge university press, 2005.

[4]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in
Advances in neural information processing systems, 2013, pp. 3111–3119. 
[5]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein et al.
, “Imagenet large scale visual recognition challenge,”
International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.  [6] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
 [7] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learningbased channel decoding,” in Information Sciences and Systems (CISS), 2017 51st Annual Conference on. IEEE, 2017, pp. 1–6.
 [8] S. Cammerer, T. Gruber, J. Hoydis, and S. ten Brink, “Scaling deep learningbased decoding of polar codes via partitioning.” in GLOBECOM. IEEE, 2017, pp. 1–6.
 [9] S. Dörner, S. Cammerer, J. Hoydis, and S. t. Brink, “Deep learningbased communication over the air,” arXiv preprint arXiv:1707.03384, 2017.
 [10] T. J. O’Shea and J. Hoydis, “An introduction to machine learning communications systems,” CoRR, vol. abs/1702.00832, 2017. [Online]. Available: http://arxiv.org/abs/1702.00832
 [11] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear codes using deep learning,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Sept 2016, pp. 341–346.
 [12] W. Xu, Z. Wu, Y. L. Ueng, X. You, and C. Zhang, “Improved polar decoder based on deep learning,” in 2017 IEEE International Workshop on Signal Processing Systems (SiPS), Oct 2017, pp. 1–6.
 [13] A. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE transactions on Information Theory, vol. 13, no. 2, pp. 260–269, 1967.
 [14] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate (corresp.),” IEEE Transactions on information theory, vol. 20, no. 2, pp. 284–287, 1974.
 [15] X.A. Wang and S. B. Wicker, “An artificial neural net viterbi decoder,” IEEE Transactions on Communications, vol. 44, no. 2, pp. 165–171, Feb 1996.
 [16] M. H. Sazlı and C. Işık, “Neural network implementation of the bcjr algorithm,” Digital Signal Processing, vol. 17, no. 1, pp. 353 – 359, 2007. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1051200406000029
 [17] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
 [18] M. T. Ribeiro, S. Singh, and C. Guestrin, “"why should i trust you?": Explaining the predictions of any classifier,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. New York, NY, USA: ACM, 2016, pp. 1135–1144. [Online]. Available: http://doi.acm.org/10.1145/2939672.2939778
 [19] J. Li, X. Wu, and R. Laroia, OFDMA mobile broadband communications: A systems approach. Cambridge University Press, 2013.
 [20] V. Taranalli, “Commpy: Digital communication with python, version 0.3.0. available at https://github.com/veeresht/commpy,” 2015.
 [21] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015. [Online]. Available: http://arxiv.org/abs/1503.02531
 [22] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized neural networks,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 4107–4115. [Online]. Available: http://papers.nips.cc/paper/6573binarizedneuralnetworks.pdf
 [23] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limit errorcorrecting coding and decoding: Turbocodes. 1,” in Communications, 1993. ICC’93 Geneva. Technical Program, Conference Record, IEEE International Conference on, vol. 2. IEEE, 1993, pp. 1064–1070.
 [24] A. Lapidoth, “Nearest neighbor decoding for additive nongaussian noise channels,” IEEE Transactions on Information Theory, vol. 42, no. 5, pp. 1520–1529, 1996.
 [25] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
 [26] F. H. Sanders, J. E. Carroll, G. A. Sanders, and R. L. Sole, “Effects of radar interference on lte base station receiver performance,” NTIA, US Dept. of Commerce, 2013.
 [27] G. A. Sanders, Effects of radar interference on LTE (FDD) eNodeB and UE receiver performance in the 3.5 GHz band. US Department of Commerce, National Telecommunications and Information Administration, 2014.
 [28] H.A. SafaviNaeini, C. Ghosh, E. Visotsky, R. Ratasuk, and S. Roy, “Impact and mitigation of narrowband radar interference in downlink lte,” in Communications (ICC), 2015 IEEE International Conference on. IEEE, 2015, pp. 2644–2649.
 [29] S. Reed and N. De Freitas, “Neural programmerinterpreters,” arXiv preprint arXiv:1511.06279, 2015.
 [30] J. Cai, R. Shin, and D. Song, “Making neural programming architectures generalize via recursion,” arXiv preprint arXiv:1704.06611, 2017.
Appendix
Appendix A Neural decoder for other convolutional codes
The rate1/2 RSC code introduced in Section 2 is one example of many convolutional codes. In this section, we show empirically that neural decoders can be trained to decode other types of convolutional codes as well as MAP decoder. We consider the following two convolutional codes.
Unlike the rate1/2 RSC code in Section 2, the convolutional code in Figure 8 is not recursive, i.e., state does not have a feedback. Also, it is nonsystematic, i.e., the message bits can not be seen immediately from the coded bits. The convolutional code in Figure 8 is another type of rate1/2 RSC code with a larger state dimension (dimension 3 instead of 2).
Figure 8 show the architecture of neural network we used for the convolutional codes in Figure 8. For the code in Figure 8(a), we used the exact same architecture we used for the rate1/2 RSC code in Section 2. For the code in Figure 8(b), we used a larger network (LSTM instead of GRU and 800 hidden units instead of 400). This is due to the increased state dimension in the encoder.


For training of neural decoder in Figure 8(a), we used training examples of block length 100 with fixed SNR . For training convolutional code (b), we used training examples of block length 500. We set batch size 200 and clip norm. The convolutional code (b) has a larger state space.
Performance. In Figures 10 , we show the BER and BLER of the trained neural decoder for convolutional code in Figure 8(a) under various SNRs and block lengths. As we can see from these figures, neural decoder trained on one SNR (0dB) and short block length (100) can be generalized to decoding as good as MAP decoder under various SNRs and block lengths. Similarly in Figure 11, we show the BER and BLER performances of trained neural decoder for convolutional code in Figure 8(b), which again shows the generalization capability of the trained neural decoder.
Appendix B Neural decoder for turbo codes
Turbo codes, also called parallel concatenated convolutional codes, are popular in practice as they significantly outperform RSC codes. We provide a neural decoder for turbo codes using multiple layers of neural decoder we introduced for RSC codes. An example of rate1/3 turbo code is shown in Figure 12. Two identical rate1/2 RSC encoders are used, encoder 1 with original sequence as input and encoder 2 with a randomly permuted version of as input. Interleaver performs the random permutation. As the first output sequence of encoder 1 is identical to the output sequence of encoder 2, and hence redundant. So the sequence is thrown away, and the rest of the sequences are transmitted; hence, rate is 1/3.
The sequences are transmitted over AWGN channel, and the noisy received sequences are . Due to the interleaved structure of the encoder, MAP decoding is computationally intractable. Instead, an iterative decoder known as turbo decoder is used in practice, which uses the RSC MAP decoder (BCJR algorithm) as a building block. At first iteration the standard BCJR estimates the posterior with uniform prior on for all . Next, BCJR estimates with the interleaved sequence , but now takes the output of the first layer as a prior on ’s. This process is repeated, refining the belief on what the codewords ’s are, until convergence and an estimation is made in the end for each bit.
Training. We propose a neural decoder for turbo codes that we call NTurbo in Figure 12. Following the deep layered architecture of the turbo decoder, we stack layers of a variation of our NRSC decoder, which we call NBCJR. However, endtoend training (using examples of the input sequence ’s and the message sequence of ’s) of such a deep layers of recurrent architecture is challenging. We propose first training each layer separately, use these trained models as initializations, and train the deep layered neural decoder of NTurbo starting from these initialized weights.
We first explain our NBCJR architecture, which is a new type of NRSC that can take flexible bitwise prior distribution as input. Previous NRSC we proposed is customized for uniform prior distribution. The architecture is similar to the one for NRSC. The main difference is input size (3 instead of 2) and the type of RNN (LSTM instead of GRU). To generate training examples of , we generate examples of turbo codes. Then we ran turbo decoder for 12 component decoding  and collect input output pairs from the 12 intermediate steps of Turbo decoder, implemented in python [20] shown in Figure 13.
We train with codes with blocklength 100 at fixed SNR 1dB. We use mean squared error in (2) as a cost function.
To generate training examples with nonzero priors, i.e. example of a triplet (prior probabilities
, a received sequence , and posterior probabilities of the message bits), we use intermediate layers of a turbo decoder. We run turbo decoder, and in each of the intermediate layers, we take as an example the triplet: the input prior probability, the input sequence, and the output of the BJCR layer. We fix training SNR to be 1dB. We stack 6 layers of BCJR decoder with interleavers in between. The last layer of our neural decoder is trained slightly differently to output the estimated message bit and not the posterior probability. Accordingly, we use binary crossentropy loss of as a cost function. We train each NBCJR layer with 2,000 examples of length 100 turbo encoder, and in the endtoend training of NTurbo, we train with 1,000 examples of length 1,000 turbo encoder. We train with 10 epochs and ADAM optimizer with learning rate 0.001. For the endtoend training, we again use a fixed SNR of noise (1dB), and test on various SNRs. The choice of training SNR is discussed in detail in the Appendix
D.Performance. As can be seen in Figure 14, the proposed NTurbo meets the performance of turbo decoder for block length 100, and in some cases, for test SNR, it achieves a higher accuracy. Similar to NRSC, NTurbo generalizes to unseen codewords, as we only show examples in total. It also seamlessly generalizes in the test SNR, as training SNR is fixed at dB.
Appendix C Other neural network architectures for NRSC and NBCJR
In this section, we show the performances of neural networks of various recurrent network architectures in decoding rate1/2 RSC code and in learning BCJR algorithm with nonzero priors. Table 2 shows the BER of various types of recurrent neural networks trained under the same condition as in NRSC (120000 example, code length 100). We can see that BERs of the 1layered RNN and singledirectional RNN are orderwise worse than the one of 2layered GRU (NRSC), and two layers is sufficient. Table 2 shows the performance of neural networks of various recurrent network architectures in BCJR training. Again, we can see that 2layers are needed and single directional RNN does not work as well as bidirectional RNNs.
Depth  BER (at 4dB)  N (Training examples)  Hidden units 

biLSTM1  0.01376  12e+5  200 
biGRU1  0.01400  12e+5  200 
uniGRU2  0.01787  12e+5  200 
biRNN2  0.05814  12e+5  200 
biGRU2  0.00128  12e+5  200 
biGRU3  0.00127  12e+5  200 
biGRU4  0.00128  12e+5  200 
biGRU5  0.00132  12e+5  200 
BCJRlike RNN Performance  

Model  Number of Hidden Unit  BCJR Val MSE  Turbo BER (Turbo 6 iters: 0.002) 
BD1LSTM  100  0.0031  0.1666 
BD1GRU  100  0.0035  0.1847 
BD1RNN  100  0.0027  0.1448 
BD1LSTM  200  0.0031  0.1757 
BD1GRU  200  0.0035  0.1693 
BD1RNN  200  0.0024  0.1362 
SD1LSTM 
100  0.0033  0.1656 
SD1GRU  100  0.0034  0.1827 
SD1RNN  100  0.0033  0.2078 
SD1LSTM  200  0.0032  0.137 
SD1GRU  200  0.0033  0.1603 
SD1RNN  200  0.0024  1462 
BD2LSTM 
100  4.4176e04  0.1057 
BD2GRU  100  1.9736e04  0.0128 
BD2RNN  100  7.5854e04  0.0744 
BD2LSTM  200  1.5917e04  0.01307 
BD2GRU  200  1.1532e04  0.00609 
BD2RNN  200  0.0010  0.11229 
SD2LSTM 
100  0.0023  0.1643 
SD2GRU  100  0.0026  0.1732 
SD2RNN  100  0.0023  0.1614 
SD2LSTM  200  0.0023  0.1643 
SD2GRU  200  0.0023  0.1582 
SD2RNN  200  0.0023  0.1611 
Appendix D Guidelines for choosing the training SNR for neural decoders
As it is natural to sample the training data and test data from the same distribution, one might use the same noise level for testing and training. However, this matched SNR is not reliable as shown in Figure 3. We give an analysis that predicts the appropriate choice of training SNR that might be different from testing SNR, and justify our choice via comparisons over various pairs of training and testing SNRs.
We conjecture that the optimal training SNR that gives best BER for a target testing SNR depends on the coding rate. A coding rate is defined as the ratio between the length of the message bit sequence and the length of the transmitted codeword sequence . The example we use in this paper is a rate code with length of equal to . For a rate code, we propose using training SNR according to
(3) 
and call the knee of this curve a threshold. In particular, this gives for rate codes. In Figure 15 left, we train our neural decoder for RSC encoders of varying rates of whose corresponding . is plotted as a function of the rate in Figure 15 right panel. Compared to the grey shaded region of empirically observed region of training SNR that achieves the best performance, we see that it follows the theoretical prediction up to a small shift. The figure on the left shows empirically observed best SNR for training at each testing SNR for various rate codes. We can observe that it follows the trend of the theoretical prediction of a curve with a knee. Before the threshold, it closely aligns with the 45degree line . around the threshold, the curves become constant functions.
We derive the formula in (3) in two parts. When the test SNR is below the threshold, then we are targeting for bit error rate (and similarly the block error rate) of around . This implies that significant portion of the testing examples lie near the decision boundary of this problem. Hence, it makes sense to show matching training examples, as significant portion of the training examples will also be at the boundary, which is what we want in order to maximize the use of the samples. On the other hand, when we are above the threshold, our target biterrorrate can be significantly smaller, say . In this case, most of the testing examples are easy, and only a very small proportion of the testing examples lie at the decision boundary. Hence, if we match training SNR, most of the examples will be wasted. Hence, we need to show those examples at the decision boundary, and we propose that the training examples from SNR should lie near the boundary. This is a crude estimate, but effective, and can be computed using the capacity achieving random codes for AWGN channels and the distances between the codes words at capacity. Capacity is a fundamental limit on what rate can be used at a given test SNR to achieve small error. In other words, for a given test SNR over AWGN channel, Gaussian capacity gives how closely we can pack the codewords (the classes in our classification problem) so that they are as densely packed as possible. This gives us a sense of how decision boundaries (as measured by the test SNR) depend on the rate. It is given by the Gaussian capacity . Translating this into our setting, we set the desired threshold that we seek.
Comments
There are no comments yet.