I Introduction
Channel coding has emerged as an impactful field of research for the modern information age. Since its inception in [1], powered by the mathematical insight of information theory and principles of modern engineering, several capacityachieving codes such as Polar, Turbo and LDPC codes [2][3][4] have come close to Shannon limit with large block lengths under Additive White Gaussian Noise (AWGN) channels. These then have been successfully adopted and applied in LTE and 5G data planes [5]. As 5G is under intensive development, designing codes that have features such as low latency, robustness, and adaptivity has become increasingly important.
Ia Motivation
UltraReliable Low Latency Communication (URLLC) code [6] requires minimal delay constraints, thereby enabling scenarios such as vehicular communication, virtual reality, and remote surgery. While speaking of lowlatency requirements, it is instructive to observe that there is an interplay of three different types of delays: processing delay, propagation delay, and structural delay. Processing and propagation delays are affected mostly by computing resources and varying environment [7]. Low latency channel coding, as is the case in this paper, aims to improve the structural delay caused by encoder and/or decoder. Encoder structural delay refers to the delay between receiving the information bit and sending it out by the encoder. Decoder structural delay refers to the delay between receiving the bits from the channel and decoding the corresponding bits. Traditional AWGN capacityachieving codes such as LDPC and Turbo codes with small block lengths show poor performance for URLLC requirement [7], [8]. There has also been recent interest in developing finiteblock information theory to understand bounds on the reliability of codes at small to medium block length regime [9].
We note that latency translates directly to blocklength when using a block code. However, when using a convolutional code, latency is given by the decoding window length. Thus there is an inherent difference between block codes and convolutional codes when considering latency. Since the latter incorporates locality in encoding, they can also be locally decoded. While convolutional codes with small constraint lengths are not capacity achieving, it is possible that they can be optimal under the lowlatency constraint. Indeed, this possibility was raised by [7], who showed that convolutional codes beat a known converse on the performance of block codes. In this work, we develop further on this hypothesis and show that we can invent codes similar to convolutional codes that beat the performance of all known codes in the lowlatency regime. While convolutional codes are stateoftheart in the low latency regime, in the moderate latency regime, Extended BoseChaudhuriHocquenghem Code (eBCH) is shown to perform well [10].
In addition, low latency constraint channel coding must take channel effects into account under nonAWGN settings, as pilot bits used for accurate channel estimation increase latency
[7]. This calls for incorporating robustness and adaptivity, both, as desired features for URLLC codes. A practical channel coding system requires heuristics to compensate for channel effects, which leads to suboptimality when there exists a model mismatch [11]. Robustness refers to the ability to perform with acceptable degradation without retraining when model mismatches. Adaptivity refers to the ability to learn to adapt to different channel models with retraining. In general, channels without clean mathematical analysis lack a theory of an optimal communication algorithm, thus relying on suboptimal heuristics [12]. In short, current channel coding schemes fail to deliver under the challenges of low latency, robustness, and adaptivity.IB Deep Learning inspired Channel Coding : Prior Art
In the past decade, advances in deep learning
have greatly benefitted several fields in engineering such as computer vision, natural language processing as well as gaming technology
[13]. This has generated recent excitement in applying deep learning methods to communication system design [16][17]. Deep learning methods have been typically successful in settings where there is a significant modeldeficit, i.e., the observed data cannot be well described by a clean mathematical model. Thus, many of the initial proposals applying deep learning to communication systems have also focused on this regime, where there is a model uncertainty due to lack of, say, channel knowledge [18, 19]. In developing codes for the AWGN channel under lowlatency constraints, there is no model deficit, since the channel is welldefined mathematically and quite simple to describe. However, the main challenge is that optimal codes and decoding algorithms are not known  we term this regime as algorithmdeficit. In this regime, there has been very little progress in applying deep learning to communication system design. Indeed, there is no known code beating stateoftheart codes in canonical channels. We construct the first stateoftheart code for the AWGN channel in this paper (in the lowlatency regime). Broadly, the following have been two categories of works that apply deep learning to communications: () designing a neural network decoder (or neural decoder in short) for a given canonical encoder such as LDPC or turbo codes; () jointly designing both the neural network encoder and decoder, referred to as Channel Autoencoder (Channel AE) design
[16] (as illustrated in Figure 1).Neural decoder shows promising performance by mimicking and modifying existing optimal decoding algorithm. Learnable Belief Propagation (BP) decoders for BCH and HighDensity ParityCheck (HDPC) code have been proposed in [21] and [22]. Polar decoding via neural BP is proposed in [23] and [24]. As mimicking learnable Tanner graphs requires a fully connected neural network, generalizing to longer block lengths is prohibitive. Capacityachieving performance for Turbo Code under AWGN channel is achieved via Recurrent Neural Network (RNN) for arbitrary block lengths [25]. The joint design of neural code (encoders) and decoders via Channel Autoencoder (AE), which is relevant to the problem under consideration in this paper, has witnessed scanty progress. Deep Autoencoder has been successfully applied for various problems such as dimensionality reduction, representation learning, and graph generation [15]. However, Channel AE significantly differs from the typical deep autoencoder models in the following two aspects, thereby making it highly challenging:

The number of possible message bits grows exponentially with respect to block length, thus Channel AE must generalize to unseen messages with capacityrestricted encoder and decoder [23].

Channel model adds noise between the encoder and the decoder, and encoder needs to satisfy power constraint  thus requiring a high robustness in the code.
For the Channel AE training, [16] and [17] introduce learning tricks emphasizing both channel coding and modulations. Learning Channel AE without channel gradient is shown in [27]. Modulation gain is reported in [28]. Beyond AWGN and fading channels, [29] extended RNN to design code for the feedback channel, which outperforms existing stateoftheart code. Extending Channel AE to MIMO settings is reported in [20]. Despite the successes, existing research on Channel AE under canonical channels is currently restricted to very short blocklengths (for example, achieving the same performance as a (7,4) Hamming code). Furthermore, existing works do not focus on the lowlatency, robustness, and adaptivity requirements. In this paper, we ask the fundamental question:
Can we improve Channel AE design to construct new codes that comply with lowlatency requirements?
We answer this in affirmative as shown in the next subsection.
IC Our Contribution
The primary goal is to design a low latency code under extremely low latency requirements. As pointed out earlier, convolutional codes beat block codes under the low latency regime [7]. RNN is a constrained neural structure with a natural connection to convolutional codes, since the encoded symbol has a locality of memory and is most strongly influenced by the recent past of the input bits. Furthermore, RNN based codes have shown natural generalization across different block lengths in prior work [25][23]. We demonstrate that with carefully designed learnable structure using Bidirectional RNN (BiRNN) for both encoder and decoder, as well as a novel training methodology developed specifically for the Channel AE model, our BiRNN based neural code outperforms convolutional code.
We then propose Lowlatency Efficient Adaptive Robust Neural (LEARN) code, which applies learnable RNN structures for both the encoder and the decoder with an additional lowlatency constraint. LEARN achieves stateoftheart performance under extremely low latency constraints. To the best of our knowledge, this is the first work that achieves an endtoend design for a neural code achieving stateoftheart performance on the AWGN channel (in any regime). In summary, the contributions of the paper are:

Beating convolutional codes: We propose the BiRNN network structure and a tailored learning methodology for Channel AE that can beat canonical convolutional codes. The proposed training methodology results in smoother training dynamic and better generalization, which are required to beat convolutional codes (Section II).

Stateoftheart performance in low latency settings: We design LEARN code for low latency requirements with specific network design. LEARN code results in beating stateoftheart performance in extremely low latency requirements (Section III).

Robustness and Adaptivity: When the channel conditions are varying, LEARN codes show robustness (ability to work well under unseen channel) as well as adaptivity (adapt to a new channel with few training symbols), showing an order of magnitude improvement in reliability over stateoftheart codes (Section III).

Interpretations: We provide interpretations to aid in the fundamental understanding of why the jointly trained code works better than the canonical codes, which can be the fundamental basis for future research (Section IV).
Ii Designing Neural Code to beat Convolutional Code
In order to beat convolutional code using a learnt neuralnetwork code, the network architecture as well as training methodology need to be carefully crafted. In this section, we provide guidelines for designing learning architecture as well as training methods that are key to achieve a high reliability of neural codes. Finally, we demonstrate that the aforementioned codes outperform Convolutional Code under block coding setup with solely endtoend learning on both AWGN channel and nonAWGN channel.
Iia Network Structure Design
Recent research on Channel AE does not show coding gain for even moderate block lengths [16][23] with fully connected neural networks, even with nearly unlimited training examples. We argue that a Recurrent Neural Network (RNN) architecture is more suitable as a deep learning structure for Channel AE.
Introduction of RNN
As illustrated in Figure 2 (left), RNN is defined as a general function such that at time , where is the input, is the output, is the state sent to the next time slot and is the state from the last time slot. RNN can only emulate causal sequential functions. Indeed, it is known that an RNN can capture a general family of measurable functions from the input timesequence to the output timesequence [26]. Illustrated in Figure 2 (right), Bidirectional RNN (BiRNN) combines one forward and backward RNN and can infer current state by evaluating through both past and future. BiRNN is defined as , where and stands for the state at time for forward and backward RNNs [13].
RNN is a restricted structure which shares parameters between different time slots across the whole block, which makes it naturally generalizable to a longer block length. Moreover, RNN can be considered as an overparameterized nonlinear convolutional code for both encoder and decoder, since convolutional code encoder can be represented by causal RNN and BCJR forwardbackward algorithm can be emulated by BiRNN [25]. There are several parametric functions
for RNN, such as vanilla RNN, Gated Recurrent Unit (GRU), or Long Short Term Memory (LSTM). Vanilla RNN is known to be hard to train due to diminishing gradients. LSTM and GRU are the most widely used RNN variants which utilize gating schemes to alleviate the problem of diminishing gradients
[13]. We empirically compare Vanilla RNN with GRU and LSTM under test loss trajectory, which shows the test loss along with the training times. The test loss trajectory is the mean of 5 independent experiments. Figure 2 right depicts the test loss along training time, which shows that vanilla RNN has slower convergence, GRU converges fast, while GRU and LSTM have similar final test performances. Since GRU has less computational complexity, in this paper we use GRU as our primary network structure [31]. In this paper, we use the terms GRU and RNN interchangeably.RNN based Encoder and Decoder Design
Our empirical results comparing different Channel AE structures in Figure 3 show that for a longer block length, RNN outperforms simply applying Fully Connected Neural Network (FCNN) for Channel AE (both encoder and decoder). FCNN curve in Figure 3 refers to using FCNN for both encoder and decoder. RNN in Figure 3 refers to using BiRNN for both encoder and decoder. The training steps are the same for fair comparison. The repetition code and extended Hamming code performances are shown as a reference for both short and long block length cases.
Figure 3 (left) shows that for short block length (4), the performance of FCNN and RNN are close to each other, since for short block length enumerating all possible code is not prohibitive. On the other hand, for longer block length (100), Figure 3 (right) shows that in using FCNN for longer block length, the Bit Error Rate (BER) is even worse than repetition code, which shows failure in generalization. RNN outperforms FCNN due to its generalization via parameter sharing and adaptive learnable dependency length. Thus in this paper, we use RNN for both the encoder and the decoder to gain generalization across block length. We can also see from Figure 3 that RNN with a tailored training methodology outperforms simply applying RNN or FCNN for Channel AE; we illustrate this training methodology in Section II.B below.
Power Constraint Module
The output of the RNN encoder can take arbitrary values and does not necessarily satisfy the power constraint. To impose the power constraint, we use a power constraint layer followed by the RNN. As shown in Figure 1, before transmitting codewords, we force the output of power constraint module to generate codewords. Assume the message bit is , the output of the encoder is , and the power constraint output is . The following three differentiable implementations are investigated:

hard power constraint: use hyperbolic tangent function in training () and threshold to 1 and +1 for testing, which only allows discrete coding schemes.

bitwise normalization: . We have . For a given coding bit position, the bit power is normalized.

blockwise normalization: ,We have which allows us to reallocate power across the code block.
For bitwise and blockwise normalization, the behavior for training and testing are different. During training phase, bitwise and blockwise normalization normalize input minibatch according to the training minibatch statistics. During testing phase, the mean and std are precomputed by passing through many samples to ensure that the estimations of mean and std are accurate.
Figure 4 left shows that blockwise normalization does offer better learning test trajectory, while bitwise normalization shows slightly worse performance. Hard power constraint using , due to saturating gradients, results in high test error. Since neural code operates on communication systems, reallocating power might be against hardware constraints. To satisfy maximized power constraint, we use bitwise normalization in this paper.
IiB Training Methodology
The following training methods result in a faster learning trajectory and better generalizations with the learnable structure discussed above.

Train with a large batch size.

Use Binary Crossentropy (BCE) loss.

Train encoder and decoder separately. Train encoder once, and then train decoder 5 times.

Add minimum distance regularizer on encoder.

Use Adam optimizer.

Add more capacity (parameters) to the decoder than the encoder.
Some of the training methods are not common in deep learning due to the unique structure of Channel AE. In what follows, we show the empirical evidence and reason why the above training methods come with a better performance.
Large Batch Size
Deep Learning models typically use minibatching to improve generalization and reduce the memory usage. Small random minibatch results in better generalization [38][40], while large batch size requires more training to generalize. However, Figure 4 middle shows that larger batch size results in much better generalization for Channel AE, while small batch size tends to saturate in high test loss. Large batch size is required due to the following reasons:
(1) Large batch offers better power constraint statistics [32]
. With a large batch size, the normalization of power constraint module offers a better estimation of mean and standard deviation, which makes the output of the encoder less noisy, thus the decoder can be better trained accordingly.
(2) Large batch size gives a better gradient. Since Channel AE is trained with selfsupervision, the improvement of Channel AE originates from error back propagation. As extreme error might result in wrong gradient direction, large batch size can alleviate this issue.
The randomness of training minibatch results in better generalization [13]. Figure 4 right shows that even with small block length () when enumeration of all possible codes becomes possible, random minibatching outperforms fixed full batch which enumerates all possible codes in each batch. Note that training with all possible codewords leads to worse test performance, while training with large random batch (5000) outperforms full batch settings. Thus we conclude that using a large random batch gives a better result. In this paper due to GPU memory limitation, we use batch size 1000.
Use Binary Crossentropy Loss
The input and output are binary digits, which makes training Channel AE a binary selfsupervised classification problem [16]. Binary Crossentropy (BCE) is better due to surrogate loss argument [41]
. MSE and its variants can be used as the loss function for Channel AE
[20][23] as MSE offers implicit regularization on overconfidence of decoding. The comparison of MSE and BCE loss is shown in Figure 5 left.Although the final both test loss for BCE and MSE tends to converge, MSE leads to slower convergence. This is due to the fact that MSE actually punishes overconfident bits, which makes the learning gradient more sparse. As faster convergence and better generalization are always appreciated, BCE is used as a primary loss function.
Separately Train Encoder and Decoder
Training encoder and decoder jointly with endtoend backpropagation leads to saddle points. We argue that training Channel AE entails separately training encoder and decoder [27]. The accurate gradient of the encoder can be computed when the decoder is optimal for a given encoder. Thus after training encoder, training decoder until convergence will make the gradient of encoder more trustable. However, at every step training decoder till convergence is computationally expensive. Empirically we compare different training methods in Figure 5 right. Training encoder and decoder jointly saturates easily. Training encoder once and training decoder 5 times shows the best performance, and is used in this paper.
Adding Minimum Distance Regularizer
Naively optimizing Channel AE results in paired local optimum: a bad encoder and a bad decoder can be locked in a saddlepoint. Adding regularization to loss is a common method to escape local optima [13]. Coding theory suggests that maximizing minimum distance between all possible input messages[5] improves coding performance. However since the number of all possible messages increases exponentially with respect to code block length, computing loss with maximized minimum distance for long block code becomes prohibitive.
Exploiting the locality inherent to RNN codes, we introduce a different loss term solely for the encoder which we refer to as the partial minimum code distance regularizer. Partial minimum code distance is the minimum distance among all message with length . Computing pairwise distance requires computations. For long block code with block length , RNN encoded portion has minimum distance , which guarantees that the block code with block length has minimum distance at least , while can be as large as . Partial minimum code distance is a compromise over computation, which still guarantees large minimum distance under small block length , while hoping the minimum distance on longer block length would still be large. The loss objective of Channel AE with partial maximized minimum code distance is . To beat convolutional code via RNN autoencoder, we add minimum distance regularizer for block length 100, with and . The performance is shown in Figure 6, while the left graph shows the test loss trajectory, and the middle graph shows the BER.
Use Adam optimizer
The learning rate for encoder and decoder has to be adaptive to compensate the realizations of noise with different magnitude. Also as training loss decreases, it is less likely to experience decoding error, which makes both encoder and decoder gradient sparse. Adam[44] has adaptive learning rate and exponential moving average for sparse and nonstationary gradient, thus in theory is a suitable optimizer. The comparison of different optimizer in Figure 6 right shows that Adam empirically outperforms all other optimizers, with faster convergence and better generalization. SGD fails to converge with learning rate 0.001 with high instability. Thus we use Adam for training Channel AE.
Adding more capacity (parameters) to the decoder than the encoder
Channel AE can be considered as an overcomplete autoencoder model with the noise injected in the middle layer. In what follows, we perform the analysis of the introduction of the noise as done in [39] and for simplification consider only Minimum Square Error (MSE) (Binary Crossentropy (BCE) loss would follow the same procedure). Assume is the added Gaussian noise, is the input binary bits, is the encoder, is the decoder, and is the power constraint module. Applying Taylor expansion (see appendix for more details), the loss of Channel AE can be written as: , where is the Jacobian of function . The reconstruction error can be interpreted as the coding error when no noise is added. With smaller , the decoder results in an invariance and robustness of the representation for small variations of the input [37], thus the Jacobian term is the regularizer encouraging the decoder to be locally invariant to noise [39]. The Jacobian term reduces the sensitivity of decoder, which improves the generalization of decoders.
Empirically there exist an optimal training SNR for neural decoders [25] and Channel AE [16]. When training with too large noise , the Jacobian term dominates, hence the reconstruction error becomes nonzero, which degrades the performance. When training with too small noise, the decoder is not local invariant, which reduces the generalization ability.
Also as the Jacobian term only applies regularization to the decoder, thus the decoder needs more capacity. Empirically the neural decoder has to be more complicated than encoder. With training for 120 epochs, the encoder/decoder size 25/100 and 100/400 units shows better test loss, comparing to the cases where encoder is less complicated than decoder. As encoder/decoder with 25/100 units works as well as encoder/decoder with size 100/400 units, we take 25 units encoder and 100 units decoder for most of our applications. Further hyperparameter optimization such as Bayesian Optimization
[42] could lead to even better performance.Encoder and Decoder Hyperparameter Design after 120 epochs 

Enc Unit  Dec Unit  Test Loss 
25  100  0.180 
25  400  0.640 
100  100  0.690 
100  400  0.181 
IiC Design to beat Convolutional Code
Applying the network architecture guidelines and the training methodology improvements proposed hitherto, we design neural code with BiGRU for both encoder and decoder as shown in Figure 7. The hyperparameters are shown in Figure 8.
Encoder  2layer BiGRU with 25 units  Decoder  2layer BiGRU with 100 units 
Power constraint  bitwise normalization  Batch size  1000 
Learning rate  0.001, decay by 10 when saturate  Num epoch  240 
Block length  100  Batch per epoch  100 
Optimizer  Adam  Loss  Binary Cross Entropy (BCE) 
Min Dist Regularizer  0.0  Train SNR at rate 1/2  mixture of 0 to 8dB 
Train SNR at rate 1/3  mixture of 1 to 2dB  Train SNR at rate 1/4  mixture of 2 to 2dB 
Train method  train encoder once decoder 5 times  Min Distance Regularizer  0.001 () 
IiD Performance of RNNbased Channel AE: AWGN Setting
We design the block code under short block lengths and compare the performance with Tailbiting Convolutional Code (TBCC). The BER performance in AWGN channel under various code rates is shown in Figure 9. The TBCC BER curve is generated by the best generator function from Figure 11 (left). Figure 9 shows that RNNbased Channel AE outperforms all convolutional codes under memory size 7. RNNbased Channel AE empirically shows the advantage of jointly optimizing encoder and decoder over AWGN channel.
IiE Performance of RNNbased Channel AE: NonAWGN Setting
We test the robustness and adaptivity of RNNbased Channel AE on three families of channels:

AWGN channel: , where .

Additive Tdistribution Noise (ATN) channel: , where , for . This noise is a model for heavytailed distributions.
Robustness
Robustness shows when RNNbased Channel AE is trained for AWGN channel, the test performance with no retraining on a different channel (ATN and Radar) should not degrade much. Most existing codes are designed under AWGN since AWGN has a clean mathematical abstraction, and AWGN is the worst case noise under a given power constraint [1]. When both the encoder and the decoder are not aware of the nonAWGN channel, the BER performance degrades.
Robustness ensures both the encoder and the decoder perform well under channel mismatch, which is a typical use case for low latency scheme when channel estimation and compensation are not accurate [5].
Adaptivity
Adaptivity allows RNNbased Channel AE to learn a decoding algorithm from enough data even under no clean mathematical model [25]. We train RNNbased Channel AE under ATN and Radar channels with the same hyperparameters as shown in Figure 8 and with the same amount of training data to ensure RNNbased Channel AE converges. With both encoder and decoder learnable, two cases of adaptivity are tested. First is the decoder adaptivity, where encoder is fixed and decoder can be further trained. Second is the full adaptivity on both encoder and decoder. In our findings, encoder adaptivity doesn’t show any further advantage, and is thus omitted.
In this part, we evaluate RNNbased Channel AE for robustness and adaptivity on ATN and Radar channels. The BER performance is shown in Figure 10. Note that under nonAWGN channels RNNbased Channel AE trained on AWGN channel performs better than the Convolutional Code with Viterbi Decoder. RNNbased Channel AE shows more robust decoding ability for channel mismatching compared to the best Convolutional Code. As shown in Figure 10, RNNbased Channel AE with decoderonly adaptivity improves over RNNbased Channel AE robust decoder, while RNNbased Channel AE with full adaptivity with both trainable encoder and decoder shows the best performance.
The fully adapted RNNbased Channel AE is better than the Convolutional Code even with CSIR which utilizes the loglikelihood of Tdistribution noise. Thus designing jointly by utilizing encoder and decoder results in further optimized code under given channels. Even when the underlying mathematical model is far from a cleaner abstraction, RNNbased Channel AE is able to learn the underlying functional code via selfsupervised backpropagation.
RNNbased Channel AE is the first neural code as per authors’ knowledge to beat existing canonical codes under AWGN channel coding setting, which opens a new field of constructing good neural codes under canonical settings. Furthermore, RNNbased Channel AE can also be applied to channels even when the mathematical analysis cannot be applied.
Iii Design Low Latency Codes: LEARN
Designing codes for low latency constraints is challenging as many existing block codes require inevitably long block lengths. In this section, to address this challenge, we propose a novel RNN based encoder and decoder architecture that satisfies low latency constraint, which we call LEARN. We show that the LEARN code is (a) significantly more reliable than convolutional codes, which are stateoftheart under extreme low latency constraint [7], and (b) more robust and adaptive for various channels beyond AWGN channels. In the following, we first define the latency and review the literature under the low latency setting.
Iiia Low Latency Convolutional Code
Formally, decoder structural delay is understood in the following setting: to send message at time , the causal encoder sends code , and the decoder has to decode as soon as it received . The decoder structural delay is the number of bits that the decoder can look ahead to decode. The convolutional code has encoder delay due to its causal encoder, and the decoder delay is controlled by the optimal Viterbi Decoder [30] with a decoding window of length which only uses the last future branches in the trellis to compute the current output. For code rate convolutional code, the structural decoder delay is [8]. When information bit is , the structural decoder delay is . Convolutional code is the stateoftheart code under extreme low latency where [7].
In this paper, we confine our scope at investigating extreme low latency with no encoder delay under extremely low structural decoder delay to with code rates 1/2, 1/3, and 1/4. The benchmark we are using is convolutional code with variable memory length. Under unbounded block length setting, longer memory results in better performance, however under low latency constraint longer memory might not necessarily mean better performance since the decoding window is shorter [7]. Hence we test for all memory lengths under 7 to get the stateoftheart performance of the Recursive Systematic Convolutional (RSC) Code, whose generating functions are shown in Figure 11 (left), with convolutional code encoder shown in Figure 11 (right). The decoder is Viterbi Decoder with decoding window .
IiiB LEARN network structure
Following the network design proposed in previous section, we propose a novel RNN based neural network architecture for the LEARN (both the encoder and the decoder) that satisfies the low latency constraint. Our proposed LEARN encoder is illustrated in Figure 12 (left). The causal neural encoder is a causal RNN with two layers of GRU added to Fully Connected Neural Network (FCNN). The neural structure ensures that the optimal temporal storage can be learnt and extended to nonlinear regime. The power constraint module is bitwise normalization as described in previous section.
Applying BiRNN decoder for low latency code requires to compute lookahead instances for each received information bit, which is computationally expensive in both time and memory. To improve efficiency, the LEARN decoder uses two GRU structures instead of BiRNN structures. The LEARN decoder has two GRUs: one GRU runs till the current time slot, another GRU runs further for steps, then the outputs of two GRUs are summarized by a FCNN. LEARN decoder ensures that all the information bits satisfying delay constraint can be utilized with the forward pass only. When decoding a received signal, each GRU just needs to process one step ahead, which results in decoding computation complexity . Viterbi and BCJR low latency decoders need to go through the trellis and backtrack to the desired position, which requires going forward one step and backward with delay constraints steps, thus resulting in computation for decoding each bit. Although GRU has a large computational constant due to the complexity of the neural network, with emerging AI chips the computation time is expected to diminish [33]. The hyperparameters of LEARN are different from block code settings and is: (1) encoder and decoder uses GRU instead of BiGRU, (2) Number of training epoch reduced to 120, and (3) No Partial Minimum Distance Regularizer is used.
IiiC Performance of LEARN: AWGN Setting
Figure 13 shows the BER of LEARN code and stateoftheart RSC code from varying memory lengths in Figure 11 (left) for rates 1/2, 1/3, and 1/4 as a function of SNR under latency constraints and . As we can see from the figure, for rates 1/3 and 1/4 under AWGN channel, LEARN code under extreme delay ( to ) shows better performance in Bit Error Rate (BER) as compared to the stateoftheart RSC code from varying memory lengths from Figure 11 (left). LEARN outperforms all RSC code listed in Figure 11 (left) with with code rates 1/3 and 1/4, demonstrating a very promising application of neural code under low latency constraint.
For higher code rates such as and , LEARN shows comparable performance to convolutional codes but degrades at high SNR. We expect further improvements can be made via improved structure design and hyperparameter optimization, especially at higher rates.
IiiD Robustness and Adaptivity
The performance of LEARN with reference to robustness and adaptivity is shown in Figure 14 for three different settings: (1) delay , code rate , with ATN () channel; (2) delay , code rate , with ATN () channel; (3) delay , code rate , with Radar (, ) channel. As shown in Figure 14 (left), with ATN (
) that has a heavytail noise, LEARN with robustness outperforms convolutional code. Adaptivity with both encoder and decoder performs best, and is better than when only decoder is adaptive. By utilizing the degree of freedom of designing encoder and decoder, neural designed coding scheme can match canonical convolutional codes with Channel State Information at Receiver (CSIR) at low code rate (
), and outperform convolutional codes with CSIR at a high code rate ().As for Figure 14 (middle) ATN () channel with code rate and Figure 14 (right) Radar () channel with code rate , the same trend holds. Note that under Radar channel, we apply the heuristic proposed in [12]. We observe that LEARN with full adaptation gives an orderofmagnitude improvement in reliability over the convolutional code heuristic [12]. The experiment shows that by jointly designing both encoder and decoder, LEARN can adapt to a broad family of channels. LEARN offers an endtoend low latency coding design method which can be applied to any statistical channels and ensure good performance.
Iv Interpretability of Deep Models
The promising performance of LEARN and RNNbased Channel AE draws an inevitable question to interpret what the encoder and decoder learned, which could inspire future research, as well as finding caveats of learned encoder and decoder. We perform our interpretation via local perturbation for LEARN and RNNbased Channel AE encoders and decoders.
Iva Interpretability of encoder
Significant recurrent length of RNN is a recurrent capacity indicator to interpret the neural encoder and decoders. The RNN encoder’s significant recurrent length is defined as, at time , how long sequence can the input impact as RNN operates recurrently. Assume two input sequences , and , where only and are different. Taking a batch of and as input for RNN encoder, we compare the output absolute difference along the whole block between and to investigate how long can the input flip at time affect.
To investigate LEARN’s RNN encoder, we only flip the first bit (position 0) of and . The code position refers to the block bit positions, and the yaxis shows the averaged difference between two different sequences. Figure 15 up left shows that for extreme short delay , the encoder significant recurrent length is short. The effect of current bit diminishes after 2 bits. As the delay constraint increases, the encoder significant recurrent length increases accordingly. The LEARN encoder learns to encode locally to optimize under low latency constraint.
For RNNbased Channel AE with BiRNN encoder, the block length is 100, and the flip is applied at the middle 50th bit position. Figure 15 up right shows the encoder trained under AWGN and ATN channel. The encoder trained on ATN shows longer significant dependency length. ATN is a heavytail noise distribution, to alleviate the effect of extreme value, increasing the dependency of code results in better reliability. Note that even the longest significant recurrent length is only backward 10 steps and forward 16 steps, thus the GRU encoder actually didn’t learn very long recurrent dependency. AWGN Capacityachieving code has some inbuilt mechanism to improve longterm dependency, for example Turbo encoder uses interleaver to improve the longterm dependency [3]. Improving the significant recurrent length length via better learnable structure design is an interesting future research direction.
IvB Interpretability of decoder
The decoder significant recurrent length can illustrate how decoder cope with different constraints and channels. Assume two noiseless coded sequences , and , and equals other than at , where . The is the large deterministic pulse noise, where for our experiment. We compare the output absolute difference along the whole block between and to investigate how long the pulse noise can affect.
For LEARN decoder, we inject pulse noise at the starting position. Figure 15 down left shows that for all delay cases, the noise most significantly affected position equals the delay constraint, which shows that the LEARN decoder learns to coordinate with the causal LEARN encoder. As , the maximized decoder difference along the block is at position 1; while for , the maximized decoder difference along the block is at position 10. Other code bits have less significant but nonzero decoder difference.
LEARN decoder’s significant recurrent length implies that it not only learns to utilize the causal encoder?s immediate output, but also utilizes outputs in other time slots to help decoding. Note that the maximized significant recurrent length is approximately twice the delay constraint, after less than approximately , the impact diminishes. The LEARN decoder learns to decode locally to optimize under different low latency constraint.
For RNNbased Channel AE with BiRNN encoder, the perturbation is applied at the middle 50th position, still with block length 100. Figure 15 down right shows the encoder trained under AWGN and ATN channels. AWGN trained decoder is more sensitive to pulse noise with extreme value. By reducing the sensitivity for extreme noise and reduce the impact along sequence, ATN decoder is learned to improve the performance under ATN channel. The RNNbased Channel AE decoders learn to optimize as to how to utilize received signal under different channel settings.
V Conclusion
In this paper, we have demonstrated the power of neural network based architectures in achieving stateoftheart performance for simultaneous code and decoder design. In the long block length case, we showed that our learned codes can significantly outperform convolutional codes. However, in order to beat stateoftheart codes such as Turbo or LDPC codes, we require additional mechansims such as interleaving to introduce longterm dependence. This promises to be a fruitful direction for future exploration.
In the lowlatency regime, we showed that we can achieve stateoftheart performance with LEARN codes. Furthermore, we showed that LEARN codes beat existing codes by an order of magnitude in reliability when there is channel mismatch. Our present design is restricted to extreme low latency; however, with additional mechanisms for introducing longer term dependence [35, 36], it is possible to extend these designs to cover a larger range of delays. This is another interesting direction for future work.
Acknowledgment
This work was supported in part by NSF awards 1651236 and 1703403, as well as a gift from Intel.
References
 [1] Shannon, C.E. “A mathematical theory of communication." Bell system technical journal 27.3 (1948): 379423.
 [2] Arikan, Erdal. “A performance comparison of polar codes and ReedMuller codes." IEEE Communications Letters 12.6 (2008).
 [3] Berrou, Claude, Alain Glavieux, and Punya Thitimajshima. “Near Shannon limit errorcorrecting coding and decoding: Turbocodes. 1." Communications, 1993. ICC’93.
 [4] MacKay, D. JC, and Radford M. N. “Near Shannon limit performance of low density parity check codes." Electronics letters 32.18 (1996).
 [5] Richardson, Tom, and Ruediger Urbanke. Modern coding theory. Cambridge university press, 2008.
 [6] Sybis, Michal, et al. “Channel coding for ultrareliable lowlatency communication in 5G systems." Vehicular Technology Conference (VTCFall), IEEE 84th. IEEE, 2016.
 [7] Rachinger, Christoph, Johannes B. Huber, and Ralf R. Muller. “Comparison of convolutional and block codes for low structural delay." IEEE Transactions on Communications 63.12 (2015): 46294638.
 [8] Maiya, Shashank V., Daniel J. Costello, and Thomas E. Fuja. “Low latency coding: Convolutional codes vs. LDPC codes." IEEE Transactions on Communications 60.5 (2012): 12151225.
 [9] Polyanskiy, Yury, H. Vincent Poor, and Sergio Verdu. "Channel coding rate in the finite blocklength regime." IEEE Trans. on Info. Theory 56.5 (2010): 23072359.
 [10] Shirvanimoghaddam, Mahyar, et al. “Short Blocklength Codes for UltraReliable LowLatency Communications." arXiv:1802.09166 (2018).
 [11] Li, Junyi, Xinzhou Wu, and Rajiv Laroia. OFDMA mobile broadband communications: A systems approach. Cambridge University Press, 2013.
 [12] SafaviNaeini, HosseinAli, et al. “Impact and mitigation of narrowband radar interference in downlink LTE." 2015 IEEE ICC, 2015.
 [13] Goodfellow, Ian, et al. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
 [14] Han, Song, Huizi Mao, and William J. Dally. “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding." arXiv:1510.00149 (2015).
 [15] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. “Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504507.
 [16] O’Shea, Timothy J., Kiran Karra, and T. Charles Clancy. “Learning to communicate: Channel autoencoders, domain specific regularizers, and attention." Signal Processing and Information Technology (ISSPIT), 2016 IEEE International Symposium on. IEEE, 2016.
 [17] O’Shea, Timothy, and Jakob Hoydis. “An introduction to deep learning for the physical layer." IEEE Transactions on Cognitive Communications and Networking 3.4 (2017): 563575.
 [18] Farsad, Nariman and Goldsmith, Andrea, “Neural Network Detection of Data Sequences in Communication Systems." IEEE Transactions on Signal Processing, Jan 2018.
 [19] Dörner, Sebastian and Cammerer, Sebastian and Hoydis, Jakob and Brink, Stephan ten, “Deep learningbased communication over the air." arXiv preprint arXiv:1707.03384
 [20] O’Shea, Timothy J., Tugba Erpek, and T. Charles Clancy. “Deep learning based MIMO communications." arXiv preprint arXiv:1707.07980 (2017).
 [21] Nachmani, Eliya, Yair Be’ery, and David Burshtein. “Learning to decode linear codes using deep learning." Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on. IEEE, 2016.
 [22] Nachmani, Eliya, et al. “Deep learning methods for improved decoding of linear codes." IEEE Journal of Selected Topics in Signal Processing 12.1 (2018): 119131.
 [23] Gruber, Tobias, et al. “On deep learningbased channel decoding." Information Sciences and Systems (CISS), 2017 51st Annual Conference on. IEEE, 2017.
 [24] Cammerer, Sebastian, et al. “Scaling deep learningbased decoding of polar codes via partitioning." GLOBECOM 20172017 IEEE Global Communications Conference. IEEE, 2017.
 [25] Kim, Hyeji and Jiang, Yihan and Rana, Ranvir and Kannan, Sreeram and Oh, Sewoong and Viswanath, Pramod. ”Communication Algorithms via Deep Learning” Sixth International Conference on Learning Representations (ICLR), 2018.
 [26] Hammer, Barbara, “On the approximation capability of recurrent neural networks." Neurocomputing Vol. 31, pp. 107–123, 2010.
 [27] Aoudia, Faycal Ait, and Jakob Hoydis. “EndtoEnd Learning of Communications Systems Without a Channel Model." arXiv preprint arXiv:1804.02276 (2018).
 [28] Felix, Alexander, et al. “OFDMAutoencoder for EndtoEnd Learning of Communications Systems." arXiv preprint arXiv:1803.05815 (2018).
 [29] Kim, Hyeji and Jiang, Yihan and Rana, Ranvir and Kannan, Sreeram and Oh, Sewoong and Viswanath, Pramod. “Deepcode: Feedback codes via deep learning." NIPS 2018.
 [30] Viterbi, Andrew. “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm." IEEE transactions on Information Theory 13.2 (1967): 260269.
 [31] Chung, Junyoung, et al. “Empirical evaluation of gated recurrent neural networks on sequence modeling." arXiv:1412.3555 (2014).
 [32] Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

[33]
Ovtcharov, Kalin, et al. “Accelerating deep convolutional neural networks using specialized hardware." MSR Whitepaper 2.11 (2015).
 [34] Sanders, Geoffrey A, “Effects of radar interference on LTE (FDD) eNodeB and UE receiver performance in the 3.5 GHz band." US Department of Commerce, National Telecommunications and Information Administration, 2014.
 [35] Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V, “Sequence to sequence learning with neural networks." NIPS 2014.

[36]
Jaderberg, Max and Simonyan, Karen and Zisserman, Andrew and others, “Spatial transformer networks." NIPS 2015.

[37]
Rifai, Salah, et al. "Contractive autoencoders: Explicit invariance during feature extraction." Proceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, 2011.
 [38] Hoffer, Elad, Itay Hubara, and Daniel Soudry. "Train longer, generalize better: closing the generalization gap in large batch training of neural networks." Advances in Neural Information Processing Systems. 2017.
 [39] Poole, Ben, Jascha SohlDickstein, and Surya Ganguli. "Analyzing noise in autoencoders and deep networks." arXiv preprint arXiv:1406.1831 (2014).
 [40] Smith, Samuel L., et al. "Don’t decay the learning rate, increase the batch size." arXiv preprint arXiv:1711.00489 (2017).

[41]
Buja, Andreas, Werner Stuetzle, and Yi Shen. "Loss functions for binary class probability estimation and classification: Structure and applications." Working draft, November 3 (2005).
 [42] Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. "Practical bayesian optimization of machine learning algorithms." Advances in neural information processing systems. 2012.

[43]
Sabri, Motaz, and Takio Kurita. "Effect of Additive Noise for MultiLayered Perceptron with AutoEncoders." IEICE TRANSACTIONS on Information and Systems 100.7 (2017): 14941504.
 [44] Kinga, D., and J. Ba Adam. "A method for stochastic optimization." International Conference on Learning Representations (ICLR). Vol. 5. 2015.
Appendix
a Alternative Minimum Distance Regularizer
As RNN encoder has small dependency length (see section 4), small Hamming distance of messages may cause small minimum code distance. Another method of regularizing is to directly regularize the minimum distance between messages with small Hamming distances. The method starts by enumerating all messages within Hamming distance to a random message of length , which contains messages, then compute the minimum distance among the enumerated messages as a regularization term. However, this method doesn’t guarantee any minimum distance property even among short block length, and the computational complexity is high with even small . Empirically this method doesn’t work well.
B Loss Analysis for Encoder and Decoder Size
This section shows the derivation of loss analysis used in the main text. Using Minimum Square Error (MSE), the loss of Channel AE is . The output of the encoder is denoted as . Using order Taylor expansion following [43], with the assumption that noise is small and ignoring all higher order components, the decoder is approximated as:
Note that the order Taylor expansion is a local approximation of functions. Hence, the assumption of ignoring higher order components is only valid with small locally. Then by expanding the MSE loss, we have: , which yields: .
C Low Latency Benchmark: Convolutional Code with Different Memory length
The benchmarks of applying convolutional code under extreme low latency constraint is shown in Figure 16. Note that there doesn’t exist a uniform best convolutional code under different delay constraints and code rates. Thus the convolutional code reported in main section are using the best convolutional code shown in Figure 16.