1 Introduction
Reliable transmission over noisy channels has been an active research area for the past decades. Channel coding is the main tool to achieve reliable transmission by finding higher dimensional representations of the input data. In his seminal work [1], Shannon proved the existence of capacityachieving sequence of codes by random construction of an ensemble and investigating the conditions for the feasibility of reliable communication. Design of channel codes that approach or achieve the channel capacity has since then been an elusive goal. Among the most landmark codes designed thus far are Turbo, LDPC and polar codes [2][4].
Traditionally, an channel code is designed by designing an encoder that maps a set of binary message words of length to a set of codewords of length for transmission over the channel. Typically mathematical analysis is used to tailor the encoder and decoder to one another. For instance, under the maximum a posteriori (MAP) decoder which minimizes the block error rate (BLER), the encoder is designed such that the pairwise distance properties of the code is optimized. MAP decoding is almost never used unless when the code is very short or can be described via a trellis diagram with rather small size, e.g. convolutional codes. It is also possible to design the decoder first. Polar code design typically follows this approach, where the encoder design, suitably defined in [4], is carried out to optimize the performance under a successive cancellation (SC) decoder. Details aside, almost all classical code design approaches heavily rely on an informationtheoretically welldefined channel model, which in most cases is additive white Gaussian noise (AWGN) channel, and employing mathematical analysis as an essential tool. More importantly, the code design progress has thus far been sporadic and heavily relying on the ingenuity of humans.
There has been a growing interest in automating the design of encoder and decoders using deep learning framework. A deeplearning based framework allows for design of encoder and decoder for channels that cannot be described by a welldefined model or can be described but the model is too complex for code design. Although the ultimate goal of deep learning based design is envisioned to be for arbitrary channels, a first step towards this end can be designing codes which can compete with the stateoftheart classical channel codes over the AWGN channel. The code design essentially can be applied to any channel provided that sufficient transmissions over the channel is performed so as to collect a sufficiently large training set.
Deeplearning has been employed to design decoders for the classical encoders [5][9]. It has also been used to design both encoder and decoder based on Autoencoders (AEs). AEs are powerful deep learning frameworks with a wide variety of applications. They fall into two categories: undercomplete and overcomplete AEs. An undercomplete AE is used to learn latent representation of the input data by transforming it to a smaller latent space. Undercomplete AEs are used for numerous tasks including denoising, generative models and representation learning [10][14]. On the other hand, overcomplete AEs add redundancy to the input data so as to transform it to a higher dimension. One of the main application of overcomplete AEs is that the higherdimensional representation can be transmitted over a noisy channel allowing the receiver to reliably decode the input data [15][21]. In particular, the authors in [21]
used convolutional neural network (CNN) and recurrent neural network (RNN) to mimic the architecture of classical turbo encoder and decoder. The proposed turboAE is reported to have competitive performance to the stateoftheart classical codes while being trainable for an arbitrary channel model.
This paper is a further attempt to improve the design of AEs for reliable communication over noisy channels and is based on the concept of list decoding in the classical coding theory. Our contributions are:

We present list autoencoder (listAE)as general deep learning framework applicable to any AE architecture where, unlike current AEs, the decoder network outputs a list of decoded message word candidates.

We provide specific loss functions which operates on the output list. The loss function aims to minimize the BLER of a genieaided (GA) decoder; We assume a genie is available at the decoder output and, whenever the transmitted message word is present in the list, it tells us which candidate it is. In other words, with the GA decoder, a block error event is counted if and only if the transmitted word is not present in the output list.

Since a genie assumption is not realistic, during the testing phase, we propose to emulate the functionality of a genie by using classical error detection codes. In particular a cyclic redundancy check (CRC) code is appended to the message word prior to encoding by the encoder network. At the decoder, CRC check is carried out for each output candidate to select a single candidate as the final output of the decoder.

listAE is a framework which can be applied to any AE architecture. We propose a specific architecture that decodes the received word on a sequence of component codes with decreasing rates. The architecture, referred to as incremental redundancy listAE (IRlistAE), shows meaningful coding gain over the stateoftheart while providing error correction and detection capability simultaneously.
2 Problem definition
The problem of reliable transmission over a noisy channel can be defined as follows. As can be seen in Fig. 1, a message word of bits is formed as , where the take binary values from
. The message word is encoded using the an encoder neural network with an encoding function
to obtain codeword where the denotes the weights of the encoder neural network and denotes the code length. A power normalization block is applied toto give a codeword with zero mean and unit variance code symbols, i.e.
and for . ^{1}^{1}1The power normalization can also be carried out per codeword or per a batch of codewords. For details, see the power normalization section. The codeword is transmitted over the channel.The channel takes the codeword as input and outputs a noisy version , where the
take real values. As mentioned before, having an informationtheoretically defined channel model is not necessary, but if there is such a model, it is typically defined as a vector channel with transition probability density function (pdf)
. A widely used channel among researchers for code design is additive white Gaussian noise (AWGN) channel for which the output whereis Gaussian random variable with zero mean and variance
. For AWGN channel , where .The decoder network receives the channel output vector and applies a decoding function to give the decoded message word where the denotes the weights the decoder neural network. The encoder and decoder networks together form an AE. The error correction performance is measured in terms of bit error rate (BER) and block error rate (BLER), defined as and
, respectively. The performance depends on the encoding and decoding functions as well as the amount of impairment added by channel, which for AWGN is measured in terms of signaltonoise ratio (SNR). The goal is to minimize the BLER or BER of the AE for different levels of impairment, e.g. a wide range of SNR.
3 ListAE
Although designing new AE architectures can be chosen as a goal to improve the error correction performance, we choose to tackle the problem through a different dimension. We conjecture that it may be too difficult for the decoder network to reliably decide which message word has been transmitted by only one guess. Therefore, we propose a framework which allows the decoder network to output a list of candidates. A genie is assumed to be available at the decoder output to identify the transmitted message word if it is in the list. Such AE framework is referred to as genieaided (GA) listAE. In the next section we present the detailed design of a GAlistAE. Later we relax the assumption of having a genie and show how to realize a practical genie via classical error detection codes, e.g. cyclic redundancy check (CRC). It is noteworthy that the concept of list decoding is well studied in the classical coding theory and, to a great deal, we have borrowed from that field. For example, successive cancellation list (SCL) of polar codes and its different variants have been well studied theoretically and also implemented for practical wireless communication systems [22][23].
3.1 Description of GA listAE
A listAE is defined as any AE that outputs a list of candidates. Figure 2 shows a general listAE with a list size . A conventional AE is a special case of listAE with a list size of . As can be seen, the only difference from conventional AE is that the output is a list of candidates.
Since during the testing phase, the decoder must output a single candidate , there must be a selection process where a single candidate is chosen from the list. A GA decoder outputs if is equal to one of the rows of , otherwise it outputs a randomly chosen row of . In other words
(1) 
where is a random number chosen uniformly from to . During the training phase a value of each element of vectors in the output list is made to take a real number between zero and one, e.g. by passing through a Sigmoid activation, while in the testing phase the outputs are rounded to the nearest integer to give binary values so that the BER and BLER can be calculated between and .
3.2 Loss functions for GA listAE
For a conventional AE a loss function which minimizes a distance measure between the input and output is used to optimize the BER or BLER. There are a number of distance measures such as means square error (MSE), binary cross entropy (BCE) which are more suitable for BER optimization. Although BER optimization indirectly optimize the BLER, finding BLERspecific loss function with rather efficient training complexity remains an open problem.
With GA listAE, the performance metric to optimize is neither BER nor BLER. It is the GA BLER calculated between and given by (1). The challenge for defining a loss function which is tailored to the GA listAE lies in how to mathematically model the genie operation. One may think of the genie operation as a processing block which takes the list of candidates as well as the transmitted message word and outputs a single candidate depending on the presence of the message word in the list. The condition for checking this presence involves rounding the candidate message words in the list to take binary values and then comparing them to the transmitted word. This operation a) introduces zero derivative in the back propagation, and b) additionally complicates it due to the comparisons. It is not clear how the comparison can be tackled in the back propagation. To tackle this problem, we propose a modified loss function that some how reflects how “close” the output list is to the message word without involving the precise genie operation. The loss function should take small values when the message word is close to any candidate in the list. One such loss function can be based on the minimum, over the candidates, of a loss metric between the transmitted message word and each candidate. In particular, the loss function is as follows.
(2) 
where is the average BCE loss function which takes two vectors and of length .
(3)  
As a side note, one may conjecture that due to suboptimality of the loss function, not only should the loss function attempt to ensure the existence of a candidate with small loss value, but also it should ensure that other candidates in the list also have small loss values. The following two loss functions attempt to output a list with smallest average loss and a small maximum loss, respectively.
where is a positive design parameter.
Our experiments reveal that results in the best performance. It seems that if an additional restriction is put on the decoder network, it may compromise its capability to output the best list from the genie point of view. In other words, it decreases the likelihood of the transmitted message word being present in the list at the price of decreasing the loss for other candidates. Having a deeper understanding of how the loss function can be optimized under GA decoding remains open in our view. Since our best results are achieved with , for the remainder of this paper, we will focus on this loss function.
3.3 How to realize the genie
In practice a genie is not available at the output of the decoder! In the absence of a genie, without any further constraints, it is impossible to tell whether the transmitted word is in the list. The reason is that there is no correlation between the message bits, hence every candidate is a valid message word. To realize a genie, we can introduce intentional correlation between the message bits, e.g. through a linear equations over binary field, and at the decoder side check whether the equations are satisfied. If they are not satisfied for a list candidate, the candidate cannot be the transmitted word. Such linear equations are typically implemented using CRC codes in classical coding theory.
With a bit CRC generated by a polynomial , a word of bits is generated and is passed to the CRC calculator to generate CRC bits. The CRC bits are appended to the end of the message word to give the length vector as the encoder input. At the decoder side, each candidate in the list is checked for passing CRC equations. Among the candidates which pass the CRC, one is randomly chosen as the final output of the decoder. We refer to this type of listAE as CRCaided list AE (CAlistAE). The final length decoded word is given as
(4) 
where is a random number chosen uniformly from to .
It is noteworthy that, during the training phase, CRC bits are treated as information bits. In other words, the correlation between the bits of is not taken into account to minimize the loss function. The reason is similar to those which leaded to employing avoiding precise genie operation. The CRC check requires rounding each candidate word in the list and then performing operations in Galois Filed. It is not clear how these operations can be effectively handled in the back propagation. Therefore, we employ without CRC consideration in the training phase. In other words, the CRC is only added for performance evaluation in the testing phase.
As a side, it is noteworthy to mention that adding CRC to the message bits reduces the effective code rate by a factor of . Although this factor may be negligible specially for larger message word length , it would be desired to realize genie without any rate reduction. One possible approach could be associating a scalar metric
to each candidate in hopes that it somehow reflects a posteriori probability (APP) metric. The final decoder output is then chosen as the candidate with the maximum metric. In the training phase, a BCE loss calculated between
and the candidate with the maximum metric is minimized. Successive cancellation list decoding of polar codes, follow this approach and bring meaningful coding gain over nonlist decoding algorithms. Unfortunately we have not observed similar gain with this approach for listAE. However we think that this method is worth more investigation.3.4 An architecture for listAE: IRlistAE
In this section we propose a specific AE architecture referred to as incremental redundancy autoencoder (IRAE). The encoder of IRAE is essentially the same as TurboAE architecture and the decoder, too, relies on similar information exchange between the decoding blocks [21]. Roughly speaking a rate IRAE uses encoding blocks which are applied to an interleaved length message word and give the length, , codeword after proper power normalization. Like TurboAE, IRAE decoder consists of iterations. At iteration , a series of decoding blocks which are serially concatenated take a certain subset of with applicable interleavers and a feature matrix which is to represent the APP of the message word, as input and output an updated feature matrix. The same architecture is replicated in every iteration, but with independent learnable weights. If a decoding block takes a subset of consisting of vectors, i.e. , we say that the decoding block is a rate decoding block as one may associate it with an effective encoder which outputs the corresponding vectors . An AE described as above is said to be an IRAE if in an iteration each decoding block has a rate which is smaller or equal to the rate of the previous block. The motivation behind IRAE architecture is to allow more powerful codes with smaller rates to attempt decoding the message word based on an improved APP feature matrix given by previous weaker codes with higher rates. In this paper we mainly train and evaluate the performance of a rate IRAE. The detailed encoder and decoder architecture and training methodology is as follows.
3.4.1 Rate1/3 IRlistAE
Fig. 3 shows a rate IRlistAE. The encoder is identical to the rate turboAE encoder [21]. The output of the encoder is the length codeword with normalized power . The interleaver takes the length word and outputs an interleaved word. The decoder consists of serially concatenated iterations. As can be seen, each iteration consists of four decoding blocks with rates , taking the inputs , , and , respectively with applicable interleavers. An iteration also takes a feature matrix of size as an input and outputs a feature matrix for iteration . When taking a matrix as input, the interleaver or deinterleaver is applied on each column and generates a matrix of the same size. The feature matrix output by iteration
is additionally passed through a Sigmoid function and gives the output list of message word candidates. Our observation shows that the intermediate feature matrices can be thought as the loglikelihood ratios (LLRs) of the message bits; as we move from the output of the first iteration to the next iterations, the BER resulting from the feature matrices decreases. The interleavers are employed to mimic their role on enhancing the distance properties of the code by introducing long term memory
[21]. At the decoder side, deinterleavers are applied similar to TurboAE and turbo codes.3.4.2 Power normalization
The output is given to a power normalization block giving the codeword to meet the power constraint requirements. There are three types of power normalization proposed in [21].

codewordwise normalization: The normalized codeword is obtained from as where is the mean of the code symbols, i.e. and
is the standard deviation of the code symbols, i.e.
. This normalization method places each codeword on an dimensional sphere with radius 
code symbol wise normalization: With a batch size of , each code symbol is normalized as where is the mean of the code symbol over the codewords in the batch, and is the standard deviation of the code symbol over the codewords in the batch. This normalization method places the set of code symbol of the codewords in the batch on an dimensional sphere with radius for .

batchwise normalization: With a batch size of , and codewords of length , there are code symbols. Each codeword is normalized as where is the mean of the code symbols and is the standard deviation of the code symbols. This normalization method places the set of code symbols in the batch on an dimensional sphere with radius .
With the second and third method, in the testing phase, , , and can be precomputed from a large batch and be directly used on a single message word.
3.4.3 train methodology and hyper parameters
Loss  BCE in Eq. (2) 

Encoder block for  5 layers 1DCNN plus one linear layer, kernel size 5, CNN output channels 100 
First three decoding block for  5 layers 1DCNN plus one linear layer, the input and output channels of the CNN layers are and for the first and next layers respectively, kernel size 5 
Fourth decoding block  5 layers 1DCNN plus one linear layer, the input and output channels of the CNN layers are and for the first and next layers respectively kernel size 5 
(learning rate, batch size )  gradually changing to 
Encoder and decoder training SNR  for encoder and for decoder 
Activation function  Elu for CNN and Linear for the linear layers 
Power normalization  Batchwise 
Interleaver  Random 
Optimizer  Adam 
Iterations  6 
List size  64 
Number of epochs 
500 
In general any neural model can be used in place of the encoding and decoding blocks of the IRlistAE in Fig. 3
. Fully connected neural network (FCNN), convolutional neural network (CNN) and recurrent neural network (RNN) are natural choices. Our experiments show that FCNN is more difficult to train and provides inferior performance to CNN. RNN models such as LongShort Term Memory (LSTM) and Gate Recurrent Unit (GRU) are also widely used in the sequential modeling problems and can bring global dependency. That is, unlike CNN, a code symbol can be a function of a larger number of message bits, which in turn may improve the Euclidean distance properties of the code for improved performance under MAP decoder. One caveat is that at the code lengths of interest the trained decoder may not operate based on the distance between the codewords. Therefore distance properties may not be a dominant factor for performance. According to our experiments, CNN models were easier to train and had superior performance to FCNN and RNN. It is noteworthy that the local dependency issue of CNN can be mitigated by using sufficiently large kernel size or number of layers. In this paper we mainly present the result for CNNbased IRlistAE.
Table 1 shows the details of the training and hyper parameters of our best IRlistAE model. The model is trained for a maximum number of epochs. At each epoch, we train the encoder times while freezing the weights of decoder, and then train the decoder times while freezing the weights of encoder. This specific scheduled training was proposed in [21] to avoid getting stuck in local minimum. With a batch size of , for each training a set of randomly generated message words of length are generated and encoded by the encoder network. A set of noise vectors of length are generated and added to the codewords corresponding to the message words. The loss function is evaluated between the input message words and the output lists, and the weights of encoder or decoder are updated accordingly. Different noise vectors in the batch may be generated with the same or different SNRs. Following the methodology in [21], a fixed SNR is used for training encoder while a range of SNR is used to train the decoder. For the latter, for each noise vector a SNR value is randomly picked from the range and is used to generate the noise vector. Sufficiently larges batch sizes with small learning rates are needed for fine tuning the model. To accommodate for larger batch sizes, we use small batches of size while accumulating the gradient of the loss function without updating the weights. This is equivalent to directly applying a larger batch size of , see [20] for more details.
4 Experiment Results
We trained IRlistAE with the parameters given in Table 1. Fig. 4 shows the resulting test loss. For each epoch, after the model is trained, the test loss is calculated for a set of new training examples generated for the encoder training SNR of . As can be seen, the test loss generally decreases from a smaller list size to a larger one. This is welljustified as the set of weights of an AE with a smaller list size is a subset of one with a larger size. However, this relative comparison does not hold at every epoch, which is probably because the optimizer needs to observe more data/epoch to train for a larger list size. As expected, the smallest test loss after convergence, corresponds to list size of . It is noteworthy that certain optimization approaches can be taken to design the interleavers for improved performance. For instance it is possible to design interleavers with uniform positional BER, which in turn may improve the overall performance. Interested readers are referred to [24] and the references therein. For the sake of simplicity, in this paper a random interleaver is generated and chosen throughout the training and testing phase.
Fig. 5 shows the BER and BLER of the GAlistAE as well as the corresponding CAlistAE described. For the CAlistAE 16 bits of CRC are generated by polynomial and appended to the message bits. An interesting observation is that CRC appears to fairly realize the genie as the performance degradation over GA listAE is rather negligible. One caveat is that adding CRC slightly reduces the effective code rate compared to the case of GA decoding or the TurboAE. The code rate reduction vanishes as the block length increases. Therefore one interesting open problem is designing listAE for larger block lengths. It is also noteworthy that, practical codes typically encode CRCappended message words, to allow for error detection at the decoder. The CAlistAE provides both error correction and detection capability by exploiting such appended CRC and outputting a list of candidates.
Enhancing the performance of listAE by further optimizing the loss function which can reflect the probability of outputting the correct candidate in the list is an interesting future direction. Another open problem is finding more efficient ways to replace the genie without rate reduction issue due to the appended CRC.
5 Conclusion
We presented a list autoencoder framework for reliable transmission of data over noisy channels. A listAE is simply defined as AE which outputs a list of decoded candidates rather than a single one. Thanks to a genie and a properly chosen loss function to reflect the error rate performance, the listAE showed significant improvement in BLER over the stateoftheart. We then removed the unrealistic genie assumption and showed that similar performance can be achieved by appending CRC to the message bits with slight reduction in code rate.
References
 [1] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379423, 1948.
 [2] C. Berrou, A. Glavieux and P. Thitimajshima, ”Near Shannon limit errorcorrecting coding and decoding: Turbocodes. 1,” in Proceedings of IEEE International Conference on Communications (ICC), pp. 10641070, vol.2, 1993.
 [3] R. G. Gallager, LowDensity ParityCheck Codes. Cambridge, MA: MIT Press, 1963
 [4] E. Arikan, “Channel polarization: A method for constructing capacityachieving codes for symmetric binaryinput memoryless channels,” IEEE Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009
 [5] A. Hamalainen and J. Henriksson, “A recurrent neural decoder for convolutional codes,” in Proc. IEEE International Conference on Communications (ICC), vol. 2, pp. 1305–1309, 1999.
 [6] T. Gruber, S. Cammerer, J. Hoydis, and S. t. Brink, “On deep learning based channel decoding,” in Proc. 51st Annual Conference on Information Sciences and Systems (CISS), 2017, pp. 1–6.
 [7] H. Kim, Y. Jiang, R. Rana, S. Kannan, S. Oh, and P. Viswanath,“Communication algorithms via deep learning,” in Proc. international conference on representation learning (ICLR), 2018.
 [8] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear codes using deep learning,” in Proc. 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 341–346, 2016
 [9] S. Cammerer, T. Gruber, J. Hoydis, and S. ten Brink, “Scaling deep learningbased decoding of polar codes via partitioning,” in Proc. IEEE Global Communications Conference (GLOBECOM), pp. 1–6, 2017
 [10] A. Makhzani and B. Frey, “Ksparse autoencoders,” arXiv preprint arXiv:1312.5663, 2013.
 [11] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [12] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in neural information processing systems, 2016, pp. 2172–2180.

[13]
P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in
Proc. of the 25th international conference on Machine learning
, ACM, 2008, pp. 1096–1103. 
[14]
A. Krizhevsky and G. E. Hinton, “Using very deep autoencoders for contentbased image retrieval”, in
ESANN, 2011.  [15] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, 2017.
 [16] Z. Qin, H. Ye, G. Y. Li, and B.H. F. Juang, “Deep learning in physical layer communications,” IEEE Wireless Communications, vol. 26, no. 2, pp. 93–99, 2019.
 [17] H. Ye, L. Liang, and G. Y. Li, “Circular convolutional autoencoder for channel coding,” in Proc. 20th IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 2019, pp. 1–5.
 [18] Y. Jiang, H. Kim, H. Asnani, S. Kannan, S. Oh, and P. Viswanath, “Learn codes: Inventing lowlatency codes via recurrent neural networks,” in Proc. IEEE International Conference on Communications (ICC), 2019, pp. 1–7.
 [19] A. Felix, S. Cammerer, S. D¨orner, J. Hoydis, and S. Ten Brink, “Ofdmautoencoder for endtoend learning of communications systems,” in Proc. 19th IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 2018, pp. 1–5.
 [20] M. V. Jamali, H. Saber, H. Hatami and J. H. Bae, “ProductAE: Towards training larger channel codes based on neural product codes,” arXiv preprint arXiv:2110.04466, 2021.
 [21] Y. Jiang, H. Kim, H. Asnani, S. Kannan, S. Oh, P. Viswanath, “Turbo Autoencoder: Deep learning based channel codes for pointtopoint communication channels”, in Proc. 33rd Conf on Neural Information Processing Systems (NeurIPS 2019).
 [22] I. Tal and A. Vardy, “List Decoding of Polar Codes,” in IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 22132226, May 2015.
 [23] K. Niu and K. Chen, “CRCAided Decoding of Polar Codes,” in IEEE Communications Letters, vol. 16, no. 10, pp. 16681671, October 2012
 [24] H. Yildiz, H. Hatami, H. Saber, Y. Cheng and J. Bae, “Interleaver Design and Pairwise Codeword Distance Distribution Enhancement for Turbo Autoencoder,” in Proc. IEEE Global Communications Conference (GLOBECOM) 2021