I Introduction
Recently, deep learning (DL) has made remarkable achievements in the fields of computer vision and natural language processing and it has been adopted for application in channel decoding. The datadriven DL approach in
[1] converted the decoding task into the pure idea of learning to decode by optimizing the general black box fully connected deep neural network (FCDNN). Despite the advantage of oneshot decoding (i.e., no iterations), the FCDNN based decoder is short of expert knowledge, which in turn renders the FCDNN decoder unaccountable and fundamentally restricted by its dimensionality. Training any neural network in practice is impossible because the training complexity increases exponentially along with block length (e.g., for a Turbo code with length of , different codewords exist) [2]. In addition, a recurrent neural network (RNN) architecture containing two layers of bidirectional gated recurrent units was adopted to learn the BCJR algorithm
[3]. The aforementioned datadriven decoding methods count on a large amount of data to train numerous parameters, thereby converging slowly and suffering a high computational complexity.To address the aforementioned issues, the modeldriven DL approach can be used instead. The concept of a “soft” Tanner graph was proposed in [4], where weights were assigned to the Tanner graph of the belief propagation (BP) algorithm to obtain a deep neural network (DNN). These weights were learned to properly weight messages transmitted in Tanner graph, thereby improving the performance of BP algorithm. A large number of multiplications were required in [4]; thus, the authors in [5] proposed a minsum algorithm with trainable offset parameters to reduce the computational complexity of the algorithm. The aforementioned DNNbased BP decoder was transformed into an RNN architecture in [6] named BPRNN decoder by unifying the weights in each iteration, thereby reducing the number of parameters without sacrificing the performance. In addition, a trainable relaxation factor was introduced to improve the performance of this BPRNN decoder.
In sum, two central limitations are inherent in current DLbased decoding methods. First, existing datadriven approaches rely on vast training parameters. Second, the aforementioned modeldriven decoding algorithms are all based on the BP algorithm; however, whether these algorithms could be applied to sequential codes (e.g., Turbo code) to improve the performance remains unknown. To address the limitations, this paper presents TurboNet, a novel modeldriven DL architecture for turbo decoding that combines DL with the traditional maxlogmaximum a posteriori (MAP) algorithm. TurboNet is constructed based on the domain knowledge in turbo decoding algorithms and employing several learnable parameters. More specifically, the original iterative structure is unfolded to obtain an “unrolled” (i.e., each iteration is considered separately) structure, and the maxlogMAP algorithm is parameterized. With the design, the parameters can be determined via training data more efficiently than the existing black box FCDNN [2] and RNN [3] architectures. Our TurboNet decoder exhibits better performance compared with the traditional maxlogMAP algorithm for turbo decoding with different code rates (i.e., and ) and contains considerably fewer parameters compared with the neural BCJR decoder proposed in [3]. Furthermore, the proposed TurboNet decoder shows strong generalizations; that is, TurboNet is trained at a special signaltonoise ratio (SNR) and outperforms the maxlogMAP algorithm at a wide range of SNRs.
Ii TurboNet
To obtain the modeldriven DL architecture for turbo decoding, we briefly describe the system model in Section IIA and the traditional maxlogMAP algorithm in Section IIB. The architecture and details of TurboNet are elaborated in Section IIC, including a redefined function that evaluates network loss.
Iia System Model
At the transmitter, a binary information sequence is encoded by a turbo encoder that contains two identical recursive systematic convolutional encoders (RSCEs). The generator matrix of the RSCE is , where and [7]. The feedthrough passes one block of information bits , , to the output of the encoder, which are then referred to as systematic bits . The first RSCE generates a sequence of parity bits from the systematic bits, and the second RSCE generates a sequence of parity bits from , which is an interleaved sequence of the systematic bits. is the set of all RSCE states. The codeword consisting of bits is then modulated and transmitted over an additive white Gaussian noise (AWGN) channel. At the receiver, a softoutput detector computes reliability information in the form of loglikelihood ratios (LLRs) for the transmitted bits. The resulting LLRs
indicate the probability of the corresponding bits being a binary 1 or 0.
IiB MaxLogMAP Algorithm
A traditional turbo decoder introduced in [8] contains two softinput softoutput (SISO) decoders, which have the same structure. Therefore, we only introduce decoder 1 in detail as follows. The MAP algorithm is used in decoder 1 to compute a posteriori LLR for information bit as shown as follows:
(1) 
where and represent the states of the encoder at time and , respectively; and the sequence is made up of the LLRs of systematic bits and corresponding parity bits.
is the set of ordered pairs
corresponding to all state transitions caused by data input and is similarly defined for . All of the aforementioned state transitions are given in detail in the Table I.0  1  2  3  4  5  6  7  

0  4  5  1  2  6  7  3  
4  0  1  5  6  2  3  7 
On the basis of the Bayes formula, we obtain
(2)  
where and can be computed through the forward and backward recursions [9]:
(3) 
(4) 
with initial conditions , for , and , for . Moreover, is computed as follows [10]:
(5) 
where is the a priori probability LLR for the bit . Given that is the sum of the systematic bit LLR , the a priori probability LLR , and the extrinsic LLR , we obtain
(6) 
which can be used as the a priori probability LLR input of the subsequent SISO decoder 2 after it is interleaved.
The logMAP algorithm introduced in [11] evaluates and in logarithmic terms using the Jacobian logarithmic function , as shown as follows:
(7) 
and
(8) 
where , , and represent the logarithmic values of , , and , respectively. The a posteriori LLRs for information bits are computed as
(9)  
The maxlogMAP algorithm omits the logarithmic term in the Jacobian logarithmic function [12]. Hence, (7)(9) can be approximately written as:
(10) 
(11) 
and
(12)  
In Section IIC, we provide an alternative graphical representation called neural maxlogMAP algorithm to replace the traditional maxlogMAP algorithm.
IiC TurboNet Architecture
The traditional iterative structure is unfolded and each iteration is represented by a DNN decoding unit to obtain an “unrolled” structure shown in Fig. 1, which is equivalent to iterations. denotes the a priori probability LLR calculated by maxlogMAP algorithm with iterations and denotes the a posteriori LLR calculated by maxlogMAP algorithm with iterations, where . The structure of the DNN decoding unit in Fig. 1 is shown in Fig. 2.
Fig. 3 shows that subnet 1, which is based on neural maxlogMAP algorithm, consists of layers, of which are hidden layers. Subnet 2 has the same structure as subnet 1. The details of the subnet architecture are elaborated as follows:
IiC1 Input Layer
The input layer of the proposed network consists of neurons, and the output of all neurons constitutes the set , where .
IiC2 Hidden Layer 1
The first hidden layer consists of neurons, and the output of all neurons constitutes the set , where is the set of ordered pairs corresponding to all state transitions caused by data input . Some neuron corresponding to in this layer is connected to neurons that corresponding to , , and in the input layer, where and .
IiC3 Hidden Layer from 2 to K
Each layer of the following hidden layers contains 16 neurons. For the th hidden layer, the output of all neurons constitutes the set , where
is the set of neuron outputs for all odd positions in the
th hidden layer, is the set of neuron outputs for all even positions in the th hidden layer, and . For some , some neuron corresponding to in the th layer is connected to all neurons corresponding to elements in the set in layer and all neurons corresponding to elements in the set in the first hidden layer, where and ; some neuron corresponding to in the th layer is connected to all neurons corresponding to elements in the set in layer and all neurons corresponding to elements in the set in the first hidden layer, where and .IiC4 Hidden Layer
The last hidden layer consists of neurons, and the output of all neurons constitutes the set . Some neuron corresponding to in the last hidden layer is connected to all neurons corresponding to elements in the set , , and , where .
IiC5 Output Layer
The output layer consists of neurons, and the output of all neurons constitutes the set . Some neuron corresponding to is connected to the neuron corresponding to in hidden layer and the neurons corresponding to , in the input layer.
We assign weights to part of the edges in Fig. 3
. These weights will be trained using the stochastic gradient descent (SGD) algorithm. Therefore, we can calculate
, , and as follows:(13) 
(14)  
and
(15) 
Turbo code usually has a large block size. For example, the minimum message bit length of Turbo code in the longterm evolution (LTE) standard is 40, and the maximum is 6144. Therefore, parameterizing (10) and (11) will cause the neural network in Fig. 3 to be extremely “deep”, which may lead to gradient vanishing or gradient exploding. On this basis, we do not introduce any trainable parameters.
Given that the output of the th DNN decoding unit is
, the sigmoid function
is added, such that the final network output is in the range of . Generally, the mean square error and binary crossentropy can be used to calculate the network loss with and but for the following two reasons:
The magnitude of the a posteriori LLR calculated by the traditional maxlogMAP algorithm is usually greater than 10, whereas the sigmoid function is nearly close to 1 and 0 when . Therefore, gradient vanishing is likely to occur if the loss is calculated with ;

The loss of the network mainly comes from the occurrence of a few error bits. Hence, the loss of the network becomes extremely small, thereby making the entire network difficult to train.
A redefined loss function computed as (
16) is used to evaluate the loss of TurboNet(16) 
where represents the a posteriori LLR obtained by TurboNet consisting of decoding units, and represents the a posteriori LLR calculated by the traditional logMAP algorithm with given iterations.
The goal is to make the loss of the network as small as possible by training the parameters . The final decoding results can be obtained by hard decision, as shown as follows:
(17) 
By setting all weights to 1, the results of (13)(15) are the same as the original maxlogMAP algorithm. Hence, the performance of the neural maxlogMAP algorithm will not be inferior to the maxlogMAP algorithm by training the network parameters. Moreover, the complexity of TurboNet is similar to the turbo decoder using the maxlogMAP algorithm.
Iii Simulation Results
Iiia Parameter Settings
TurboNet was constructed on top of the TensorFlow framework, and an NVIDIA GeForce GTX 1080 Ti GPU was used for accelerated training. We trained TurboNet for Turbo codes (40, 132) and (40, 92) on randomly generated training data obtained over an AWGN channel at 0 dB SNR. TurboNet was composed of three DNN decoding units that corresponded with three full iterations. The loss function (
16) was used with the target LLR being the logMAP algorithm with six iterations. We trained TurboNet with SGD and the ADAM optimizer [13] with a batch size of 500. The learning rate was .IiiB Performance Analysis
The bit error rate (BER) performance curves obtained using the logMAP algorithm, maxlogMAP algorithm, and TurboNet are shown in Figs. 4 and 5.
IiiB1 BER Performance
Fig. 4 indicates that the BER of TurboNet that consists of three decoding units is lower than those of the maxlogMAP algorithm and the logMAP algorithm with three iterations at all SNR ranges. Notably, TurboNet also outperforms the maxlogMAP algorithm with five iterations in almost all cases. Fig. 5 shows that TurboNet containing three decoding units outperforms the maxlogMAP algorithm with three iterations at all SNR ranges. The BER performance of TurboNet is comparable to that of the logMAP algorithm with three iterations and the maxlogMAP algorithm with five iterations under most circumstances. These results suggest that TurboNet can still work when handling punctured Turbo code with a high code rate.
Here, we detail about the SNR of the training data and the iteration number of the logMAP algorithm in (16), which are closely related to the training of TurboNet.

The training data and the test data can have a similar distribution; thus, one might use the same SNR for testing and training, which is restricted because the precise SNR might not be available. Moreover, TurboNet is equivalent to traditional maxlogMAP algorithm by setting all weights to 1. Therefore, TurboNet is not able to learn to handle noise when the SNR is too high. However, if the SNR is set too low, the maxlogMAP algorithm has poor error correction capabilities and thus cannot learn effectively. So we deliberately reduced the SNR such that TurboNet could learn more robust error correction. Hence, we keep the SNR of the training data at a single 0 dB, which can help TurboNet learn to correct errors as much as possible.

Notably, the target LLR values in (16) are generated by the logMAP algorithm with a fixed iteration number . If the iteration number is large, then TurboNet can learn accurate information. However, should not be too large because TurboNet only contains three decoding units. If is too large, then a large gap will exist between TurboNet and the logMAP algorithm, thereby decreasing the generalization capability of TurboNet. Therefore, we set , which is exactly twice the number of the decoding units.
The improvement of BER is achieved by properly configuring the weight, such that the logarithmic term in the Jacobian logarithmic function is compensated appropriately. In addition, the LLRs are related to the channel conditions; thus, the part of the channel information might be learned by TurboNet, thereby making these LLRs used precisely.
# of parameters  Time  

MaxLogMAP (5 iterations)    2.3e4s 
Neural BCJR in[3] (3 units)  3.85M  5.89e3s 
TurboNet (3 decoding units)  17.8K  1.39e4s 
IiiB2 Computational Complexity
Table II compares the complexities of the decoders in terms of the number of parameters and time consumption required to complete a singleforward pass of one codeword. TurboNet has a lower computational cost and exhibits relatively faster computation speed with considerably fewer parameters compared with the datadriven neural BCJR decoder [3]. The SISO decoder in the neural BCJR decoder is replaced by two bidirectional LSTM layers, and the number of hidden units in each LSTM layer is 800. In addition, TurboNet shows lower latency compared with the maxlogMAP algorithm with five iterations.
Iv Conclusion
In this work, we demonstrate the benefits of the proposed TurboNet decoder architecture compared with traditional turbo decoder based on the maxlogMAP algorithm. In TurboNet, the original iterative structure is unfolded, and each iteration is represented as a DNN decoding unit. We obtain a neural maxlogMAP algorithm by assigning weights to the maxlogMAP algorithm. Moreover, a redefined loss function is used to improve the training process. The BER performance of TurboNet is improved compared with the maxlogMAP algorithm without increasing computational complexity. The error correction capability of the proposed TurboNet can be further improved by applying advanced DL technology, and we hope this letter encourages future research in this direction.
References
 [1] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learningbased channel decoding,” in Proc. IEEE 51st Annu. Conf. Inf. Sciences Syst., Mar. 2017, pp. 16.
 [2] X.A. Wang and S. B. Wicker, “An artificial neural net Viterbi decoder,” IEEE Trans. Commun., vol. 44, no. 2, pp. 165171, Feb. 1996.
 [3] H. Kim, Y. Jiang, R. B. Rana, S. Kannan, S. Oh, and P. Viswanath, “Communication algorithms via deep learning,” arxiv preprint arXiv:1805.09317, 2018.
 [4] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear codes using deep learning,” in Proc. IEEE Annu. Allerton Conf. Commun., Control, and Computing, 2016, pp. 341346.
 [5] L. Lugosch and W. J. Gross, “Neural offset minsum decoding,” in Proc. 2017 IEEE Int. Symp. Inf. Theory, Jun. 2017, pp. 13611365.
 [6] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, and Y. Be’ery, “Deep learning methods for improved decoding of linear codes”, IEEE J. Sel. Topics Signal Process., vol. 12, no. 1, pp. 119131, Feb. 2018.
 [7] 3rd Generation Partnership Project; Technical Specification; Evolved Universal Terrestrial Radio Access (EUTRA); Multiplexing and Channel Coding (Release 9) 3GPP Organizational Partners TS 36.212, Rev. 8.3.0, May 2008.
 [8] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit errorcorrecting coding and decoding: Turbocodes,” in Proc. Int. Conf. Communications, May 1993, pp. 10641070.
 [9] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Trans. Inf. Theory, vol. IT20, no. 2, pp. 284287, Mar. 1974.
 [10] S. Talakoub, L. Sabeti, B. Shahrrava, and M. Ahmadi, “An improved MaxLogMAP algorithm for turbo decoding and turbo equalization,” IEEE Trans. Instrum. Meas., vol. 56, no. 3, pp. 10581063, June 2007.
 [11] P. Robertson, P. Hoeher, and E. Villebrun, “Optimal and suboptimal maximum a posteriori algorithms suitable for turbo decoding,” Eur. Trans. Telecommun., vol. 8, no. 2, pp. 119125, 1997.
 [12] J. A. Erfanian, S. Pasupathy, and G. Gulak, “Reduced complexity symbol detectors with parallel structures for ISI channels,” IEEE Trans. Commun., vol. 42, no. 24, pp. 16611671, Feb.Apr. 1994.
 [13] D. P. Kingma and J. Ba. (2014). “Adam: A method for stochastic optimization.” arxiv preprint arXiv:1412.6980, 2014.
Comments
There are no comments yet.