I Introduction
Ia coset codes
coset codes, defined by Arıkan in [1], are a class of linear block codes with the generator matrix .
is an binary matrix defined as
(1) 
in which and denotes the th Kronecker power of .
The encoding process is
(2) 
where and denote the code bit sequence and the information bit sequence respectively.
An coset code [1] is defined by an information set , . Its generator matrix is composed of the rows indexed by in . Thus (2) is rewritten as
(3) 
where .
The key to constructing coset codes is to properly determine an information set . RM codes[2] and polar codes[1] are two wellknown examples of coset codes. They determine according to Hamming weight and subchannel reliability, respectively.
Recently, a parallel decoding framework of coset codes is proposed in [3]. As shown in Fig. 1(a), the encoding process of coset codes can be described by an stage encoding graph. The former and latter stages respectively correspond to outer and inner codes. The inner codes are independent component codes that can be decoded in parallel [4].
This parallel framework first produces equivalent encoding/decoding graphs by permuting the inner and outer parts of the original decoding graph (see Fig. 1). During decoding, we only process the inner code parts of these equivalent graphs, leaving the outer codes unprocessed. The LLRs from different graphs about the same code bit are exchanged iteratively to reach a consensus. Since all inner codes are decoded in parallel, this decoding framework supports a very high degree of parallelism. The code construction under the parallel decoding algorithm is different from polar/RM codes, and is studied separately in [3].
IB Motivations and Contributions
This paper mainly focuses on further enhancing decoding throughput. The aforementioned iterative LLR exchange procedure requires softoutput component decoders such as SCL and SCAN to provide extrinsic LLRs [3]. But if we aim at an ultrahighthroughput decoder, implementing softoutput component decoders gives rise to two problems. First, the area efficiency of SCL and SCAN is much lower than fastSC. Second, the interconnections among the large number of component decoders consume considerable chip area.
To alleviate both problems, we propose to adopt hardoutput SC as the component decoder. First, the complexity and storage are reduced within each component decoder. Compared with SCAN with one iteration, SC has decoding complexity and storage. Compared with softoutput SCL with list size 8, SC has decoding complexity and storage [5]. Second, the interconnections among component decoders are also reduced. Compared with softoutput decoders, hardoutput SC decoders significantly simplifies routing because only hard bits are propagated among component decoders.
Besides, we introduce an error detector before each SC decoder to opportunistically reduce computation. If no error is detected, we skip the SC decoding and directly output the hard decisions.
To minimize the performance loss due to the above simplifications, we propose a genetic algorithm based LLR generator. LLR input (for this iteration) is generated from SC decoding output (from previous iterations) via a set of damping factors to determine the amplitudes. The damping factors have a significant impact on the decoding performance, and is “learned” offline through a genetic algorithm based on unsupervised learning. Compared with “handpicked” parameters based on greedy stepwise optimization, the proposed genetic algorithm exhibits better performance.
Ii Stage permuted parallel decoding
coset codes [3] natively support parallel decoding, as the inner codes are independent. To decode these component codes, various softoutput decoders, e.g., SCL, SC permutation list and SCAN, are employed [3]. In this work, we propose hardoutput SC decoders to achieve higher area efficiency.
Iia Parallel decoding framework
The parallel decoding framework in [3] is modified to support SC component decoders. In Algorithm 1, a coset code is alternately decoded on two factor graphs and , as shown in Fig. 1. The stage permuted graph is generated by swapping the inner codes and outer codes in . Only the inner codes of each graph are decoded. And their decoding outputs are exchanged between the decoding graphs. The component decoders can be implemented in parallel.
Now that we use SC to decode the component codes, the hard output must be converted into soft LLR as input for the next iteration. Therefore, an LLR generator is placed before the SC decoder (line 8). Meanwhile, an error detector is placed before the LLR generator (line 6).
For decoding graph (resp. ), the th code bit of the th inner component code is denoted by (resp. ). Take graph for example, the hard outputs (HO) from different component decoders of the previous iteration are combined into and then sent for error detection (line 6).

If no error is detected, i.e., the error detection output (E) , then are directly taken as the new “HO” result of this iteration (line 8), and SC decoding is skipped.

Otherwise, if , the LLR of code bit in the th iteration, denoted by , is generated from channel LLR and previous “HO&E” results (line 10). The generated LLRs are decoded by SC to output new “HO” results (line 11).
Either way, new “HO&E” results are sent to the next iteration.
After
iterations, the algorithm outputs the estimated codeword of the last decoding iteration as results.
IiB SC as component decoder
A component decoder consists of three parts, an error detector, an LLR generator and an SC decoder (see Fig. 2). An SC decoder takes soft LLR input but generates hard bits output. The mismatch between hard output and soft input poses a challenge for iterative decoding, as the hard output cannot be directly used as soft input for the next iteration. To solve this problem, an LLR generator is required to generate soft values from the hard output.
An error detector is placed before the LLR generator and it serves two purposes. First, if its input vector (i.e., HO from the previous iteration) is a codeword (no error detected), LLR generation and SC decoding can be skipped to save computation. Second, it provides a way to estimate the reliability of hard bits. Heuristically, if an input vector is already a codeword, they are deemed more reliable. Otherwise, if error detection failed, there is a chance that the error can not be corrected by an SC decoder, which implies less reliability. Therefore, error detection results facilitate the “recovery” of soft LLRs for the next iteration.
In practice, an error detector based on syndrome check can be implemented by reusing the encoding circuit. It costs almost no additional hardware resource.
The LLR generator is activated when an error detection fails. It takes four inputs (i) the channel LLR, (ii,iii) the hard outputs of the previous two iterations, and (iv) the error detection output of the previous iteration.
Take the nonpermuted graph for example. For code bit , its input LLR is generated based on the previousiteration error detection output .
If , meaning error detection failed and hard output is from an SC decoder, the input LLR is the sum of channel LLR and hard outputs from the previous two iterations:
(4) 
where and respectively denote the damping factors, which determine the amplitude.
If , meaning hard output is directly from an error detector since no error was found, the input LLR is the sum of the channel LLR and hard output from the previous iteration:
(5) 
where the damping factor is denoted by .
Finally, the input LLR vector is sent to an SC decoder to output new “HO” results.
Iii Genetic algorithm based LLR generator design
The LLR generator, parameterized by the three damping factors, has a significant impact on the overall performance. Unfortunately, a theoretical optimum is difficult to obtain due to the following reasons. First, the extrinsic information transfer analysis is hard with the proposed component decoder. Second, the output of the component decoder is correlated with all its input vector due to the loopy decoding graph. Both make conventional density evolution methods inapplicable.
Artificial intelligence provides an alternative method in the case where a precise theoretical approach is unavailable. Recently, deep learning, reinforcement learning and genetic algorithm have been applied to design better code constructions [7] and decoding algorithms [8, 9].
Inspired by this, we exploit a genetic algorithm based on unsupervised learning to design the damping factors. Damping factors play a similar role of chromosomes in the genetic algorithm, because they both individually and collaboratively contribute to the fitness of a candidate. A good candidate requires that all its damping factors are respectively good. As such, a pair of good parents is likely to produce a good offspring, and this suggests that the genetic algorithm may ultimately converge to a good candidate.
At first, we start the genetic algorithm by initializing a population of size . Each candidate contains damping factors, including , and , . denotes the maximum decoding iteration. We initialize each candidate as follows.

Without any given prior knowledge, the initial damping factors are sampled from a uniform distribution
. By adjusting the parameter , we can trade optimality (a larger ) for convergence rate (a smaller ).
We observe that , and (used to calculate LLR for the first decoding iteration) can be directly set to without any performance loss, since there is no information from the previous iteration. Similarly, is directly set to .
The population are evaluated through Monte Carlo method and then ordered based on decoding performance. The minimum signaltonoise ratio to achieve a target block error rate (SNR@targetBLER) is taken as the performance metric.
Then, the algorithm enters a loop consisting of four steps.

Select two distinct parents from the population. The
th candidate is selected according to a probability
(normalized), where is called the sample focus. In this way, a better candidate will be selected with a higher probability. By adjusting the parameter , we can tradeoff between exploitation (a larger ) and exploration (a smaller ). 
Crossover between parents to produce an offspring. Specifically, each damping factor of the offspring is randomly selected from the corresponding ones of its parents.

Mutate the offspring randomly. This is implemented by independently mutating each damping factor with probability
. Specifically, if one damping factor is mutated, a random value sampled from Gaussian distribution
is added up to it. By adjusting and , we can tradeoff optimality (larger and ) and convergence rate (smaller and ). 
Insert the offspring back to the population according to the decoding performance.
The algorithm loop is terminated after reaching a maximum number of iterations.
Iv Performance evaluation
We evaluate the performance gain brought by the proposed LLR generator and the genetic algorithm, respectively. The hyper parameters^{1}^{1}1The further optimization of the hyper parameters may bring improved performance. This is outside the scope of this paper. of the genetic algorithm are provided in Table I.
Parameters  Value 
Population Size ()  32 
2  
Sample focus()  0.01 
Mutate probability ()  0.07 
Mutate variance ( ) 
0.3 
The learning trajectories of the required SNR to achieve BLER= are presented in Fig. 3. It shows that the decoding performance first improves rapidly as the genetic algorithm iterates and then converges gradually.
0  0  0  

0.2680  0  1.9997  
0.4236  0.2075  0.6695  
0.5051  0.2542  0.8296  
0.6147  0.3574  0.7598  
1.2661  0.9922  0.7647  
0.4054  0.2714  0.7851  
0.5360  0.1566  0.8723 
Two types of gains can be observed from Fig. 3. First, the gain brought by the proposed LLR generator is dB at BLER, for both converged and nonconverged cases. The error detection results facilitate the “recovery” of soft LLRs in the proposed LLR generator, leading to the observed gain.
Second, we exemplify the “learning gain” through three points on the learning curve and present their BLER performances in Fig. 4. On the one hand, this proves the effectiveness of the genetic algorithm in designing good damping factors. With the converged damping factors in Table II, the proposed scheme is dB better than the best “handpicked” damping factors^{2}^{2}2The “handpicked” method is a greedy stepwise optimization that chooses the best damping factors in every iteration.. On the other hand, it confirms that the component decoder with the proposed LLR generator exhibits better decoding performance than the case without it.
Next, we compare our scheme with some baselines in literatures.

The same code construction decoded by the parallel soft output decoding algorithm [3]. This scheme exhibits a similar degree of parallelism to the proposed decoding algorithm, but incurs higher implementation complexity due to the difficulty in handling the internal decoder data flow.

A polar code with the same length and code rate, evaluated under SC decoding. It enjoys more coding gain but incurs larger decoding latency due to the serial nature of SC decoding.

A recently proposed polar coding scheme with similar target for terabit/s throughput [6], which employs an unrolled hardware architecture for high throughput. “Unrolling” is only applicable for relatively short codes (e.g., ) and thus sacrifices coding gain.
The evaluation results are presented in Fig. 5. Compared with Type1 and Type2 baselines, the proposed decoder only trades dBdB loss at BLER for improved area efficiency and reduced decoding latency. Compared with Type3 baseline, the proposed scheme exhibits dB gain at BLER.
Then, we evaluate the complexity reduction due to skipped SC decoding. The number of activated SC decoders is measured to evaluate the complexity. The results are presented in Fig. 6. It shows that the complexity reduction ratio varies with SNR. For the case with higher SNR (lower BLER), more complexity is reduced. At BLER=, bypassing SC decoding can reduce decoding complexity.
At last, the area efficiency of the proposed decoder is presented in Table III (see details in our ASIC implementation [10]). With TSMC 16nm process, the area efficiency for code rate is when the maximum number of iterations is eight. The equivalent throughput under 7nm technology is about with eight iterations and with five iterations.
Info  Iter  Latency  Area Eff  Convert to  
size  ation  (ns)  (Gbps/)  10nm  7nm  

5  109.25  120.73  277.69  533.16  
6  131.1  100.61  231.41  444.30  
7  152.95  86.24  198.35  380.83  
8  174.8  75.46  173.55  322.22 
V Conclusions
In this work, we propose a lowcomplexity parallel decoding algorithm of coset codes. The framework exploits two equivalent decoding graphs. For each graph, the inner component codes are independent and support parallel decoding. The component decoder adopts a novel design comprising an error detector, an LLR generator and an SC decoder. The LLR generator, parameterized by a set of damping factors, is “learned” offline by a genetic algorithm based unsupervised learning. The proposed decoding algorithm achieves comparable performance to the case with softoutput component decoder and conventional polar codes, but requires much lower decoding and hardware implementation complexity.
References
 [1] E. Arıkan, “Channel Polarization: A method for constructing capacityachieving codes for symmetric binaryinput memoryless channels,” in IEEE Transactions on Information Theory, vol. 55, no. 7, pp. 30513073, July 2009.
 [2] S. Kudekar, S. Kumar, M. Mondelli, H. D. Pfister, E. Sasoglu and R. L. Urbanke, “Reedmuller codes achieve capacity on erasure channels,” in IEEE Transactions on Information Theory, vol. 63, no. 7, pp. 42984316, July 2017.
 [3] X. Wang, H. Zhang, R. Li, J. Tong, Y. Ge, and J. Wang, “On the construction of coset codes for parallel decoding,” accepted by IEEE Wireless communications and Networking Conference, 2020 (available: https://arxiv.org/abs/1904.13182).
 [4] H. Zhang et al., “A flipsyndromelist polar decoder architecture for ultralowlatency communications,” in IEEE Access, vol. 7, pp. 11491159, 2018.
 [5] X. Liu et al., “A 5.16Gbps decoder ASIC for polar code in 16nm FinFET,” 2018 15th International Symposium on Wireless Communication Systems (ISWCS), Lisbon, 2018, pp. 15.
 [6] A. Sural, E. G. Sezer, Y. Erturul, O. Arıkan and E. Arıkan, “Terabitspersecond throughput for polar codes,” 2019 IEEE 30th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC Workshops), Istanbul, Turkey, 2019, pp. 17.
 [7] L. Huang, H. Zhang, R. Li, Y. Ge and J. Wang, “AI coding: learning to construct error correction codes,” in IEEE Transactions on Communications, vol. 68, no. 1, pp. 2639, Jan. 2020.
 [8] X. Wang et al., “Learning to flip successive cancellation decoding of polar codes with LSTM networks,” 2019 IEEE 30th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Istanbul, Turkey, 2019, pp. 15.
 [9] F. Carpi1, C. Hager, M. Martalo, R, Raheli, and H. D. Pfister, “Reinforcement learning for channel coding: learned bitflipping decoding,” 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2019, pp. 922929.
 [10] J. Tong, X. Wang, Q. Zhang, H. Zhang, S. Dai, R. Li, and J. Wang, “Toward terabitspersecond communications: a hardware implementation of highthroughput coset codes,” Available: https://arxiv.org/, 2020.
Comments
There are no comments yet.