Toward Terabits-per-second Communications: Low-Complexity Parallel Decoding of G_N-Coset Codes

04/21/2020 ∙ by Xianbin Wang, et al. ∙ HUAWEI Technologies Co., Ltd. 0

Recently, a parallel decoding framework of G_N-coset codes was proposed. High throughput is achieved by decoding the independent component polar codes in parallel. Various algorithms can be employed to decode these component codes, enabling a flexible throughput-performance tradeoff. In this work, we adopt SC as the component decoders to achieve the highest-throughput end of the tradeoff. The benefits over soft-output component decoders are reduced complexity and simpler (binary) interconnections among component decoders. To reduce performance degradation, we integrate an error detector and a log-likelihood ratio (LLR) generator into each component decoder. The LLR generator, specifically the damping factors therein, is designed by a genetic algorithm. This low-complexity design can achieve an area efficiency of 533Gbps/mm^2 under 7nm technology.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a -coset codes

-coset codes, defined by Arıkan in [1], are a class of linear block codes with the generator matrix .

is an binary matrix defined as


in which and denotes the -th Kronecker power of .

The encoding process is


where and denote the code bit sequence and the information bit sequence respectively.

An -coset code [1] is defined by an information set , . Its generator matrix is composed of the rows indexed by in . Thus (2) is rewritten as


where .

The key to constructing -coset codes is to properly determine an information set . RM codes[2] and polar codes[1] are two well-known examples of -coset codes. They determine according to Hamming weight and sub-channel reliability, respectively.

Fig. 1: For -coset codes, equivalent encoding graphs may be obtained based on stage permutations: (a) Arıkan’s original encoding graph [1] and (b) stage-permuted encoding graph. Each node adds (mod-2) the signals on all incoming edges from the left and sends the result out on all edges to the right. (c) The parallel decoding framework only processes the inner code parts of the equivalent graphs, leaving the outer codes unprocessed.

Recently, a parallel decoding framework of -coset codes is proposed in [3]. As shown in Fig. 1(a), the encoding process of -coset codes can be described by an -stage encoding graph. The former and latter stages respectively correspond to outer and inner codes. The inner codes are independent component codes that can be decoded in parallel [4].

This parallel framework first produces equivalent encoding/decoding graphs by permuting the inner and outer parts of the original decoding graph (see Fig. 1). During decoding, we only process the inner code parts of these equivalent graphs, leaving the outer codes unprocessed. The LLRs from different graphs about the same code bit are exchanged iteratively to reach a consensus. Since all inner codes are decoded in parallel, this decoding framework supports a very high degree of parallelism. The code construction under the parallel decoding algorithm is different from polar/RM codes, and is studied separately in [3].

I-B Motivations and Contributions

This paper mainly focuses on further enhancing decoding throughput. The aforementioned iterative LLR exchange procedure requires soft-output component decoders such as SCL and SCAN to provide extrinsic LLRs [3]. But if we aim at an ultra-high-throughput decoder, implementing soft-output component decoders gives rise to two problems. First, the area efficiency of SCL and SCAN is much lower than fast-SC. Second, the interconnections among the large number of component decoders consume considerable chip area.

To alleviate both problems, we propose to adopt hard-output SC as the component decoder. First, the complexity and storage are reduced within each component decoder. Compared with SCAN with one iteration, SC has decoding complexity and storage. Compared with soft-output SCL with list size 8, SC has decoding complexity and storage [5]. Second, the interconnections among component decoders are also reduced. Compared with soft-output decoders, hard-output SC decoders significantly simplifies routing because only hard bits are propagated among component decoders.

Besides, we introduce an error detector before each SC decoder to opportunistically reduce computation. If no error is detected, we skip the SC decoding and directly output the hard decisions.

To minimize the performance loss due to the above simplifications, we propose a genetic algorithm based LLR generator. LLR input (for this iteration) is generated from SC decoding output (from previous iterations) via a set of damping factors to determine the amplitudes. The damping factors have a significant impact on the decoding performance, and is “learned” offline through a genetic algorithm based on unsupervised learning. Compared with “hand-picked” parameters based on greedy stepwise optimization, the proposed genetic algorithm exhibits better performance.

Ii Stage permuted parallel decoding

-coset codes [3] natively support parallel decoding, as the inner codes are independent. To decode these component codes, various soft-output decoders, e.g., SCL, SC permutation list and SCAN, are employed [3]. In this work, we propose hard-output SC decoders to achieve higher area efficiency.

0:    The received signal ;
0:    The recovered codeword: ;
1:  Initialize ; ; ;
2:  for iterations:  do
3:     Select decoding graph: ;
4:     if  is  then
5:        for inner component codes: (in parallel) do
6:           ;
7:           if  then
8:              ;
9:           else
10:              , ;
11:              ;
12:           end if
13:        end for
14:     else
15:        for inner component codes: (in parallel) do
16:           ;
17:           if  then
18:              ;
19:           else
20:              , ;
21:              ;
22:           end if
23:        end for
24:     end if
25:  end for
Algorithm 1 Parallel decoding framework.

Ii-a Parallel decoding framework

The parallel decoding framework in [3] is modified to support SC component decoders. In Algorithm 1, a -coset code is alternately decoded on two factor graphs and , as shown in Fig. 1. The stage permuted graph is generated by swapping the inner codes and outer codes in . Only the inner codes of each graph are decoded. And their decoding outputs are exchanged between the decoding graphs. The component decoders can be implemented in parallel.

Now that we use SC to decode the component codes, the hard output must be converted into soft LLR as input for the next iteration. Therefore, an LLR generator is placed before the SC decoder (line 8). Meanwhile, an error detector is placed before the LLR generator (line 6).

For decoding graph (resp. ), the -th code bit of the -th inner component code is denoted by (resp. ). Take graph for example, the hard outputs (HO) from different component decoders of the previous iteration are combined into and then sent for error detection (line 6).

  • If no error is detected, i.e., the error detection output (E) , then are directly taken as the new “HO” result of this iteration (line 8), and SC decoding is skipped.

  • Otherwise, if , the LLR of code bit in the -th iteration, denoted by , is generated from channel LLR and previous “HO&E” results (line 10). The generated LLRs are decoded by SC to output new “HO” results (line 11).

Either way, new “HO&E” results are sent to the next iteration.


iterations, the algorithm outputs the estimated codeword of the last decoding iteration as results.

Ii-B SC as component decoder

Fig. 2: A component decoder consists of an error detector, an LLR generator and an SC decoder. The “E” result of this iteration is taken to switch the MUX. If no error is detected (E=0), the “HO” result from the previous iteration is directly taken as the new “HO” result of this iteration. Otherwise, LLRs are generated for SC decoding to output new “HO” result.

A component decoder consists of three parts, an error detector, an LLR generator and an SC decoder (see Fig. 2). An SC decoder takes soft LLR input but generates hard bits output. The mismatch between hard output and soft input poses a challenge for iterative decoding, as the hard output cannot be directly used as soft input for the next iteration. To solve this problem, an LLR generator is required to generate soft values from the hard output.

An error detector is placed before the LLR generator and it serves two purposes. First, if its input vector (i.e., HO from the previous iteration) is a codeword (no error detected), LLR generation and SC decoding can be skipped to save computation. Second, it provides a way to estimate the reliability of hard bits. Heuristically, if an input vector is already a codeword, they are deemed more reliable. Otherwise, if error detection failed, there is a chance that the error can not be corrected by an SC decoder, which implies less reliability. Therefore, error detection results facilitate the “recovery” of soft LLRs for the next iteration.

In practice, an error detector based on syndrome check can be implemented by reusing the encoding circuit. It costs almost no additional hardware resource.

The LLR generator is activated when an error detection fails. It takes four inputs (i) the channel LLR, (ii,iii) the hard outputs of the previous two iterations, and (iv) the error detection output of the previous iteration.

Take the non-permuted graph for example. For code bit , its input LLR is generated based on the previous-iteration error detection output .

If , meaning error detection failed and hard output is from an SC decoder, the input LLR is the sum of channel LLR and hard outputs from the previous two iterations:


where and respectively denote the damping factors, which determine the amplitude.

If , meaning hard output is directly from an error detector since no error was found, the input LLR is the sum of the channel LLR and hard output from the previous iteration:


where the damping factor is denoted by .

Finally, the input LLR vector is sent to an SC decoder to output new “HO” results.

Iii Genetic algorithm based LLR generator design

The LLR generator, parameterized by the three damping factors, has a significant impact on the overall performance. Unfortunately, a theoretical optimum is difficult to obtain due to the following reasons. First, the extrinsic information transfer analysis is hard with the proposed component decoder. Second, the output of the component decoder is correlated with all its input vector due to the loopy decoding graph. Both make conventional density evolution methods inapplicable.

Artificial intelligence provides an alternative method in the case where a precise theoretical approach is unavailable. Recently, deep learning, reinforcement learning and genetic algorithm have been applied to design better code constructions [7] and decoding algorithms [8, 9].

Inspired by this, we exploit a genetic algorithm based on unsupervised learning to design the damping factors. Damping factors play a similar role of chromosomes in the genetic algorithm, because they both individually and collaboratively contribute to the fitness of a candidate. A good candidate requires that all its damping factors are respectively good. As such, a pair of good parents is likely to produce a good offspring, and this suggests that the genetic algorithm may ultimately converge to a good candidate.

At first, we start the genetic algorithm by initializing a population of size . Each candidate contains damping factors, including , and , . denotes the maximum decoding iteration. We initialize each candidate as follows.

  • Without any given prior knowledge, the initial damping factors are sampled from a uniform distribution

    . By adjusting the parameter , we can trade optimality (a larger ) for convergence rate (a smaller ).

We observe that , and (used to calculate LLR for the first decoding iteration) can be directly set to without any performance loss, since there is no information from the previous iteration. Similarly, is directly set to .

The population are evaluated through Monte Carlo method and then ordered based on decoding performance. The minimum signal-to-noise ratio to achieve a target block error rate (SNR@targetBLER) is taken as the performance metric.

Then, the algorithm enters a loop consisting of four steps.

  1. Select two distinct parents from the population. The

    -th candidate is selected according to a probability

    (normalized), where is called the sample focus. In this way, a better candidate will be selected with a higher probability. By adjusting the parameter , we can tradeoff between exploitation (a larger ) and exploration (a smaller ).

  2. Crossover between parents to produce an offspring. Specifically, each damping factor of the offspring is randomly selected from the corresponding ones of its parents.

  3. Mutate the offspring randomly. This is implemented by independently mutating each damping factor with probability

    . Specifically, if one damping factor is mutated, a random value sampled from Gaussian distribution

    is added up to it. By adjusting and , we can tradeoff optimality (larger and ) and convergence rate (smaller and ).

  4. Insert the offspring back to the population according to the decoding performance.

The algorithm loop is terminated after reaching a maximum number of iterations.

Iv Performance evaluation

We evaluate the performance gain brought by the proposed LLR generator and the genetic algorithm, respectively. The hyper parameters111The further optimization of the hyper parameters may bring improved performance. This is outside the scope of this paper. of the genetic algorithm are provided in Table I.

Parameters Value
Population Size () 32
Sample focus() 0.01
Mutate probability () 0.07

Mutate variance (

TABLE I: Hyper Parameters of genetic algorithm

The learning trajectories of the required SNR to achieve BLER= are presented in Fig. 3. It shows that the decoding performance first improves rapidly as the genetic algorithm iterates and then converges gradually.

Fig. 3: The learning trajectories of the SNR@BLER. After about 600 iterations, the genetic algorithm “learned” better damping factors than the “hand-picked” ones based on greedy stepwise optimization. After about 3000 iterations, the algorithms in both cases converge.
0 0 0
0.2680 0 1.9997
0.4236 0.2075 0.6695
0.5051 0.2542 0.8296
0.6147 0.3574 0.7598
1.2661 0.9922 0.7647
0.4054 0.2714 0.7851
0.5360 0.1566 0.8723
TABLE II: The damping factors designed by genetic algorithm

Two types of gains can be observed from Fig. 3. First, the gain brought by the proposed LLR generator is dB at BLER, for both converged and non-converged cases. The error detection results facilitate the “recovery” of soft LLRs in the proposed LLR generator, leading to the observed gain.

Second, we exemplify the “learning gain” through three points on the learning curve and present their BLER performances in Fig. 4. On the one hand, this proves the effectiveness of the genetic algorithm in designing good damping factors. With the converged damping factors in Table II, the proposed scheme is dB better than the best “hand-picked” damping factors222The “hand-picked” method is a greedy stepwise optimization that chooses the best damping factors in every iteration.. On the other hand, it confirms that the component decoder with the proposed LLR generator exhibits better decoding performance than the case without it.

Fig. 4: With converged damping factors, the gains brought by the proposed LLR generator and the genetic algorithm are dB and dB at BLER.

Next, we compare our scheme with some baselines in literatures.

  1. The same code construction decoded by the parallel soft output decoding algorithm [3]. This scheme exhibits a similar degree of parallelism to the proposed decoding algorithm, but incurs higher implementation complexity due to the difficulty in handling the internal decoder data flow.

  2. A polar code with the same length and code rate, evaluated under SC decoding. It enjoys more coding gain but incurs larger decoding latency due to the serial nature of SC decoding.

  3. A recently proposed polar coding scheme with similar target for terabit/s throughput [6], which employs an unrolled hardware architecture for high throughput. “Unrolling” is only applicable for relatively short codes (e.g., ) and thus sacrifices coding gain.

The evaluation results are presented in Fig. 5. Compared with Type-1 and Type-2 baselines, the proposed decoder only trades dBdB loss at BLER for improved area efficiency and reduced decoding latency. Compared with Type-3 baseline, the proposed scheme exhibits dB gain at BLER.

Fig. 5: Compared with Type-1 and Type-2 baselines, the proposed decoder only trades dBdB loss at BLER for improved area efficiency and reduced decoding latency. Compared with Type-3 baseline, the proposed scheme exhibits dB gain at BLER. The polar codes are constructed by Gaussian approximation at EsN0dB, dB, dB and dB for code rates , , and , respectively.

Then, we evaluate the complexity reduction due to skipped SC decoding. The number of activated SC decoders is measured to evaluate the complexity. The results are presented in Fig. 6. It shows that the complexity reduction ratio varies with SNR. For the case with higher SNR (lower BLER), more complexity is reduced. At BLER=, bypassing SC decoding can reduce decoding complexity.

Fig. 6: At BLER=, bypassing SC decoding can reduce decoding complexity.

At last, the area efficiency of the proposed decoder is presented in Table III (see details in our ASIC implementation [10]). With TSMC 16nm process, the area efficiency for code rate is when the maximum number of iterations is eight. The equivalent throughput under 7nm technology is about with eight iterations and with five iterations.

Info Iter- Latency Area Eff Convert to
size ation (ns) (Gbps/) 10nm 7nm
5 109.25 120.73 277.69 533.16
6 131.1 100.61 231.41 444.30
7 152.95 86.24 198.35 380.83
8 174.8 75.46 173.55 322.22
TABLE III: Decoder Area Efficiency

V Conclusions

In this work, we propose a low-complexity parallel decoding algorithm of -coset codes. The framework exploits two equivalent decoding graphs. For each graph, the inner component codes are independent and support parallel decoding. The component decoder adopts a novel design comprising an error detector, an LLR generator and an SC decoder. The LLR generator, parameterized by a set of damping factors, is “learned” offline by a genetic algorithm based unsupervised learning. The proposed decoding algorithm achieves comparable performance to the case with soft-output component decoder and conventional polar codes, but requires much lower decoding and hardware implementation complexity.