Prefix-free codes are generally used in source coding
to transform source letters of an arbitrary probability distribution to code letters of a uniform probability distribution, thereby maximizing entropy per code letter. In this work, we use prefix-free codes for a reverse purpose, i.e., for an operation calleddistribution matching (DM) with equal probabilities , and construct prefix-free codes such that the code letters have the minimum average energy among all possible prefix-free codes of the same size for the desired entropy per code letter. We also propose a framing method, with which variable-length prefix-free codewords can always be contained in a fixed-length frame, thereby facilitating application of prefix-free codes to communications, as will be discussed below.
A prominent application of prefix-free code distribution matching (PCDM) is probabilistic constellation shaping (PCS) for capacity-approaching communications. PCS makes low-energy code letters appear with a higher probability than high-energy code letters, thereby reducing average transmit energy for the same information rate (IR) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11].
A long-standing challenge of PCS has been incorporation of channel coding and DM. If PCDM is embedded between channel encoding and decoding (see Fig. 1 (a)), even a single error from the channel causes insertion or deletion of source letters in prefix-free decoding, leading to synchronization errors or catastrophic propagation of errors, thereby rendering the outer channel coding useless. In a reversed architecture where channel coding is embedded between prefix-free encoding and decoding (see Fig. 1 (b)), a symbol distribution formed by prefix-free encoding is not preserved by the subsequent channel encoding, since (linear) channel encoding generates almost equiprobable code symbols from source symbols of any distribution. There have been approaches to jointly optimizing channel coding and DM [6, 9, 10, 11], but they lack a rate adaptability and involve design of a customized channel coding scheme. Recently, however, an architecture called probabilistic amplitude shaping (PAS)  solved the problem. In the PAS architecture, source bits are first encoded by a prefix-free code to produce amplitudes of transmit symbols with a desired probability distribution, as shown in Fig. 1 (c), then a following systematic channel encoder generates parity bits that constitute signs of the transmit symbols. In this manner, the amplitude distribution remains unchanged by equiprobable parity bits, and the error propagation and synchronization errors do not occur since channel decoding corrects errors before prefix-free decoding. PAS enables separate design of channel coding and DM, hence allows for the use of off-the-shelf channel codes in conjunction with independently optimized prefix-free codes. In this paper, therefore, the performance of the proposed PCDM in communication systems will be numerically evaluated using the PAS architecture in comparison with the conventional communication schemes where transmit symbols are uniformly distributed.
Ii Prior Works
Let be an -ary code alphabet. Also, let and , respectively, denote a set of sourcewords (called a dictionary) that are concatenations of an indefinite number of source letters from alphabet and a set of codewords (called a codebook) that are concatenations of an indefinite number of code letters from alphabet . A code defines a bijective function that maps every sourceword in to exactly one codeword in such that is uniquely decodable. If a codeword is an ordered concatenation of two non-empty words and , then is called a prefix of the word . A code is a prefix-free code if any codeword is not a prefix of another codeword. A prefix-free code is an instantaneously decodable code, whose decoding can be performed as immediately as a codeword is found on successive receipt of the code letters. The size of a code is defined by the number of all codewords in the codebook, hence the number of all sourcewords in the dictionary.
In a special case where all code letters have equal transmit energies, Huffman codes represent optimal prefix-free codes for a known sourceword distribution, optimal in the sense that no other codes can produce a shorter expected codeword length. In a more general case where code letter energies are not necessarily equal, the prefix-free code construction problem has been addressed in the context of PCS in , using Lempel-Even-Cohn (LEC) coding  and Varn coding . LEC coding solves the problem when all codewords have an equal length. LEC coding is isomorphic to Huffman coding if we translate the transmit energy of the -th codeword into a sourceword probability as , with being a root of to ensure that PMF sums to 1. A Huffman codebook therefore becomes the LEC dictionary for parsing source bits, mapped to an LEC codebook consisting of equal-length codewords, completing a variable-to-fixed length (V2F) code. However, LEC coding minimizes the energy per source letter, and the entropy rate is determined by the a priori given Huffman codebook, while we want to minimize the energy per per code letter for an arbitrary entropy rate. Varn coding provides another solution when sourcewords have an equal length, hence constructing fixed-to-variable length (F2V) codes. However, Varn coding requires a non-uniform PMF for the equal-length sourcewords, which must be given a priori to realize a particular entropy rate determined by the PMF. Another approach to constructing F2V codes is by minimizing the information divergence between the desired and realized PMFs for a binary code alphabet , which also assumes a priori knowledge of the optimal PMF for the desired entropy rate. In a general case where both sourcewords and codewords are allowed to have unequal lengths, there are known algorithms to construct optimal prefix-free codes [3, 16, 17, 18, 19] using, e.g., dynamic programming [18, 19]; however, the resulting codes are optimal in terms of the energy per codeword, while we need a minimum energy per code letter. To the best of our knowledge, none of the existing approaches did not address the problem of constructing prefix-free codes that minimize the energy per code letter for arbitrary desired entropy rates.
Iii Prefix-Free Codes for Various Entropy Rates
Let denote the length of sourceword , and and denote the length and energy of codeword , respectively. Let , , and
be random variables taking values on the sets, , and , respectively, all of which are distributed with the PMF . Since our source bits are independent and equiprobable, the probability is dyadic, i.e., with .
Examples of size- V2F, F2V, and V2V codes for the unipolar 2-ary amplitude shift keying (2-ASK) code alphabet are shown in Tab. I, where the V2V code has , , and . The corresponding tree diagrams are illustrated in Fig. 2, where double circles, open circles, and closed circles represent the root nodes, branch nodes, and leaf nodes, respectively, which will be defined below. A code is formed by concatenating the roots of two ordered trees, called left and right trees depending on their relative root position. We use the following terminologies throughout the paper to describe the structure of a right tree (defined in a similar fashion for a left tree):
Root: The left-most node of a tree.
Child: A node directly connected to the right of a node.
Parent: The converse of a child.
Siblings: A group of nodes with the same parent.
Branch: A node with at least one child.
Leaf: A node with no children.
Degree: The number of children of a node.
Path: A sequence of nodes and edges connecting a node with another node.
Depth: The depth of a node is the number of edges from the root node to the node, i.e., the path length connecting the root node and the node.
Height: The height of a tree is the longest path length between the root and leaves.
Size: The size of a tree is the number of all leaf nodes of the tree.
Leaf of a left tree represents sourceword and that of a right tree represents codeword . The depth of leaf equals (in the left tree) or (in the right tree).
With the above notations, we can characterize energy and entropy rate of a root-concatenated tree. By the dyadic PMF of sourcewords and codewords, the expected energy per code letter can be calculated as
where denotes expectation with respect to throughout the paper, unless otherwise specified. The entropy rate is defined as the expected number of source bits per code letter
Then, the optimal prefix-free coding problem can be written as:
Iii-a V2F Codes
To solve the above problem, we begin with a balanced -ary right tree, i.e., a right tree in which every branch has children and every leaf is at the same depth (see the right tree of Fig. 2 (a)), which represents codebook of a fixed codeword length for all , with . Without loss of generality, assume that code letters are chosen from such that can immediately be obtained for the fixed . In this case, (1) and (2) degenerate, respectively, to , and , where the last equation holds since is dyadic. Therefore, minimizing the expected energy per code letter is equivalent to minimizing the expected energy per codeword for V2F codes. Also, maximizing subject to is equivalent to maximizing subject to .
If we waive the dyadic constraint on , the entropy-maximizing distribution under an energy constraint is the Maxwell-Boltzmann (MB) distribution , represented by a PMF with , where the rate parameter determines the expected energy per code letter. Let denote expectation with respect to a PMF , then the energy per code letter and entropy rate of optimal V2F codes are given, respectively, by and . Indeed, LEC coding  yields a special instance of the MB distribution, for which the rate parameter is given by with being a root of the characteristic condition . The dashed curves in Fig. 3 (a) and (b) show and as a strictly monotonically decreasing function of (this is true in general, see [12, Section 5.C]), obtained from the codewords of Tab. I (a). Due to the monotonicity, the optimal of the MB PMF that maximizes subject to can simply be obtained by the bisection method. Then, an optimal dyadic approximate of can be obtained by Geometric Huffman coding (GHC) , i.e., . Notice that, since code alphabet and constant completely define the right tree of a V2F code, the optimal can immediately be obtained for an arbitrary desired entropy rate , using a series of operations . The solid lines in Figs. 3 (a) and (b) show all and obtained with and . In this example, there are only five non-zero distinct rates generated by dyadic approximation of .
As we increase the codebook size, the entropy rates created by V2F codes become much more diverse, as shown in Fig. 4 for up to the 16-ASK alphabet (producing up to the 1024-QAM in the PAS architecture). Here, we find all V2F codes from the desired entropy rates with a rate granularity . The entropy rates become sparser in relatively low entropy rate regimes, which can be supplemented by allowing variable codeword lengths in Section III-C.] Energy efficiency of the constructed V2F codes are also shown in Fig. 4, evaluated by the ratio of the energy per code letter of the prefix-free code to that of the ideal MB-distributed code letters for the same entropy rate, i.e., , which we call energy gap. It can be seen that the energy gap is below 0.1 dB across a wide range of entropy rates with , or even with for .
Iii-B F2V Codes
For F2V codes, now we consider a balanced binary left tree representing dictionary of fixed sourceword length for all , with , such that all are parsed with equal probabilities . In this case, the expected energy per code letter and entropy rate of a code are simplified, respectively, as
with sum depth and sum energy of the tree being and , respectively.
We do not know of any existing method that can solve (3) with (4) and (5) for an arbitrary desired . In this paper, therefore, instead of directly solving the problem for a particular , we try to identify a set of trees that produces all possible sum depths under a finite tree size constraint, hence realizing all possible according to (5). The constructed trees are optimal in the sense that they have the minimum sum energy among all -trees with the same sum depth , where -tree is a tree in which every node except leaves has a degree not smaller than . Minimum sum energy of a tree leads to minimum energy per code letter for a given , and for a given due to (4) and (5). Indeed, any -tree is allowed for a right tree; however, we restrict the type of trees to -trees to avoid infinite tree expansion (see Fig. 5 (a) for example). In Fig. 5, and throughout the paper, denotes a tree that has leaves, is a tree whose sum depth is , and is a tree whose sum energy is . Also, let and be the sets of all trees and , respectively. Then, by (4) and (5), all trees in lead to same but not necessarily same . Since distinct sum depth generates distinct , we have as many optimal trees as the number of distinct sum depths of trees in , where an optimal tree with sum depth is defined as . A brute-force search of in requires exponential time in . There are known problems that are isomorphic to the problem of counting the number of all trees in , including the parenthesizations counting problem [21, Ch. 15.2]. For example, the trees in of Fig. 5 (b) have isomorphic representations of , , , , and in order, in which the edges connecting two siblings are parenthesized together in a recursive manner from the largest depth of the tree. The number of parenthesizations for is , with being the Catalan number defined as [22, 23], which grows as [21, p. 333].
We begin from a trivial size-1 tree that has only a root, then construct larger trees by appending smaller trees to the root, as shown in Fig. 6. A tree formed by appending sub-trees , satisfies the following relations:
where the last two equations hold since the -th edge from the root should be taken into account times to calculate sum depth and sum energy of the tree. For example, due to (6), a tree with and the -ASK alphabet (hence ) can be constructed from -tuples: for , for , and for . Here, we do not need to use all sub-trees of size to construct a set of size- trees, since we have an optimal sub-substructure property:
An optimal tree contains only optimal sub-trees in it.
Suppose that an optimal tree has a -th sub-tree . If the sub-tree is not an optimal tree, then there exists a sub-tree with . By replacing the sub-tree with from the tree , we can obtain a new tree that has a smaller sum energy than the optimal tree, i.e, with . This is contradiction, hence an optimal tree consists of only optimal sub-trees. ∎
It can be easily seen that the height of is minimized when its leaves have maximally uniform depths. The depths can differ from each other at most by one, and the minimum height is given by for the -ASK alphabet. The sum depth of a size- minimum-height tree is calculated as , which is equal to the smallest sum depth of all trees in . If is a positive integer power of , the minimum sum depth can be simplified as . On the other hand, if there are only two nodes at every depth of a tree (see, e.g., the last tree of Fig. 5 (b)), the tree has height that is the largest of all trees in . The sum depth in this case is calculated as . After some manipulation, it can be seen that the sum depth of a maximum-height tree is also the maximum sum depth of all trees in . Therefore, if we store only one optimal tree to for each , the size of is upper-bounded by ; i.e., grows as . Since there are at most choices of the optimal sub-tree sets (ranging from an empty set to ) for each of the edges of the root for enumerating all trees of size (using the optimal sub-tree property), the number of all choices to build is of the complexity . As aforementioned, each of the sub-tree sets is of size at most , hence we have the search space of size to identify . The search time is polynomial in , and exponential in , indicating the intractability of constructing F2V codes for a large alphabet.
To further reduce the search space, we can exploit the fact that a larger sub-tree is appended as far left as possible from the root of an optimal tree; i.e.,
Let for constitute the -th sub-tree of an optimal tree such that the root of is the -th child of the root of , then the sub-trees satisfy .
Let and denote two of the sub-trees of an optimal tree with , and assume . Then, by exchanging the two sub-trees and , we can construct another tree of the same size and the same sum depth due to (6) and (7), which has a smaller sum energy than by (8). This is contradiction, hence an optimal tree must have for any . ∎
For example, some trees in can be generated from three sub-trees that have , , or , but we can discard the latter two triplets by Theorem 2.
We construct one optimal F2V code for each of the entropy rates that can be realized with for and . As shown in Fig. 7, F2V codes offer a greater diversity of entropy rates than V2F codes with the same size, albeit with a slightly larger energy gap. Therefore, a natural consequence is to construct V2V codes to exploit the complementary merits of V2F and F2V codes, as will be illustrated henceforth.
Iii-C V2V Codes
We readily have a set of optimal right trees, representing optimal F2V codes. From this, without imposing any fixed-length constraints on sourcewords and codewords, we can enumerate near-optima V2V codes of size in the following manner:
For each of the desired entropy rates , where for some integer and the rate granularity , and for each of the optimal right trees in , identify the optimal PMF that achieves with the minimum energy per code letter.
For every obtained in Step 1, construct a set of optimal left trees representing that minimizes . The trees in and have one-to-one correspondence.
For each of , choose a pair of left and right trees in and that yield the minimum energy per code letter with a rate discrepancy for a small .
Indeed, an optimal V2V code does not necessarily consist of an optimal right tree, hence an exhaustive search of the left tree should be performed over all right trees in . However, due to the exponential growth of in , we restrict the search space to optimal right trees in ; under this constraint, the constructed V2V codes show surprisingly good performance with a very small .
Assume that a right tree is chosen from (hence and are given). Then, from (2), we have . Let , , then we have an equivalence relation by definition as
Therefore, by waiving the dyadic constraint on , the optimization problem of (3) translates into
where the equality in the right-hand side equation holds if and only if .
Also, the constraint functions in (11) and (12) are concave and affine, respectively, hence the problem can efficiently be solved by a convex optimization solver such as the CVX [24, 25], on condition that is known. Since we do not know of , we can make an initial guess using a PMF
, then attempt to reduce the error between the energy estimateof iteration and the true minimum energy in an iterative manner. This initial PMF maximizes in order to satisfy the rate condition (9), where the maximization of by the initial guess can be proven by using a Lagrangian . By the Karush-Kuhn-Tucker (KKT) conditions [26, Ch. 5.5.3], a PMF maximizes if the gradient of the Lagrangian vanishes at ; i.e.,
Since is a PMF, we have that in the last equation, hence the PMF that maximizes is given by the initial guess . If this does not fulfill the rate condition (9), then no other PMF can satisfy it, hence the corresponding right tree should be discarded. Otherwise, we can solve the convex optimization problem iteratively until the estimation error at iteration reaches within a termination threshold , as shown in Algorithm 1.
Given a right tree , Algorithm 1 produces PMF that is asymptotically optimal in iteration for an entropy rate , such that approaches to .
The proof follows a similar structure to that of Proposition 4.5 in [27, Sec. 4.3.1]. Suppose that Algorithm 1 terminates at the iteration . Then, converges to as increases, since the termination condition on Line 5 is not satisfied at every , hence for , indicating that is a monotonically decreasing function of , while is lower-bounded by . Furthermore, the converged energy is indeed the minimum energy within a small error. To see this, let be the optimal value of the objective function at the iteration such that for any PMF , where the equality holds if and only if . Then, , since for any PMF , where the equality holds if and only if by (14). Also, since , the termination of Algorithm 1 at the iteration implies . This proves that Algorithm 1 can approach to the minimum energy closely by choosing a sufficiently small , since and is a necessary and sufficient condition for . ∎
In our V2V code construction, Algorithm 1 is terminated mostly in 5 iterations with .
Once the optimal PMF is identified for every optimal right tree and for every desired entropy rates , an optimal dyadic estimate can be obtained as , which creates a left tree. Then, among all pairs of such constructed left and right trees, a pair can be chosen for each entropy rate that achieves the minimum .
Figure 8 shows all V2V codes enumerated by the rate granularity and the rate tolerance for the 2-ASK and 4-ASK alphabets with . For the 2-ASK alphabet, only is required to construct V2V codes to approach the theoretic minimum energy to within 0.05 dB across a wide range of entropy rates. For the 4-ASK alphabet, V2V codes are still within 0.1 to 0.2 dB of the limit across a wide range of entropy rates, although the obtained entropy rates are sparser and tends to grow with .
Some codes selected from Figs. 4 and 8 are shown in Fig. 9, whose entropy rates range from 0.15 to 3.83 in a step size of . This shows that we can approach the theoretic minimum energy per code letter to within 0.13 dB across a wide range of entropy rates for up to 1024-QAM, using a codebook size not larger than 256.
Iv Framing for Fixed-Rate Transmission
A fundamental problem of PCDM is that the number of output code letters varies depending on input patterns; i.e., the entropy rate of a prefix-free code is inherently probabilistic, whose mean value approaches to asymptotically in the number of encoding iterations. Application of PCDM to data transmission is critically hampered by this, since it can cause buffer over- or under-flow, insertion or deletion of bits that causes synchronization errors, and unbounded error propagation in the absence of re-synchronization . Framing in this paper refers to a method that enables a fixed-length frame to always contain a fixed number of information bits, thereby solving the variable length problem of PCDM. In [28, Sec. 4.8], framing is performed by casting overflow symbols into errors; the probability of error decreases as the frame length grows, but zero-error prefix-free decoding is not possible within a finite-length frame. To the best of our knowledge, there is no framing method known to date that allows unique error-free decoding of prefix-free codes.
Iv-a Algorithm for Framing
To achieve a fixed entropy rate in each fixed-length frame for the -ASK alphabet, we use two different codes: a prefix-free code with a probabilistic entropy rate whose mean value is close to , and a trivial code with a constant entropy rate (i.e., is a typical fixed-to-fixed length mapper for uniform -ASK). The idea is that we begin encoding with and then switch to at some point of the successive encoding process if an overflow is predicted. If we keep counting the numbers of input bits and output symbols encoded by in the first part of the encoding process, we can also calculate the number of symbols required to encode all the remaining input bits if we use instead of from that point onward, thereby predicting a potential overflow. In what follows, we will show that this prediction can be made in a way that unique decoding is possible, and that the penalty due to this switching is small.
Let (bits) and (symbols) be fixed input and output frame lengths, respectively. Then, framing enables PCDM to achieve a fixed entropy rate , where
in each frame. This shows that, for the chosen and , an additional rate adaptability is also offered by framing, albeit with a small loss of energy efficiency. Let and denote the lengths of sourcewords and codewords of at encoding iteration , respectively, then the cumulative sourceword and codeword lengths at iteration can be calculated respectively as , and .
Assume that the use of up to iteration was assured not to cause overflow as long as we switch to from the next iteration onwards, hence we have used up to iteration . Then, at the next iteration , in order to foresee if does not still cause overflow, we need to ensure that
where and define the available and required numbers of symbols in case we switch to after encoding with at , respectively. Codebook is used for encoding at iteration if condition (16) is fulfilled, otherwise from iteration onwards. Notice that (16) can be evaluated only after seeing the incoming bit pattern at iteration to obtain and . This makes unique decoding impossible, since, assuming that unique decoding was successfully performed up to iteration at the receiver such that and are known for all , decoding cannot identify which codebook was used at iteration due to the lack of knowledge of and . This suggests that, for unique decoding, the codebook at iteration must be identified without relying on and , which can be realized by using a bounding technique. Namely, in a pessimistic assumption that the shortest input bits are mapped to the longest output symbols defined in at iteration , we can find the lower bound of the available symbols and the upper bound of the required symbols as
where and . Now, without knowledge of and , the overflow-preventing condition (16) can be conservatively examined by
since and .
In successive PCDM encoding process, frame overflow can be avoided by switching the code from to at the earliest iteration that does not fulfill (19).
If (19) is fulfilled at , we can use at without overflow. Otherwise, we can avoid overflow by using for encoding at every , since by (15). Assume that was used at , fulfilling (19), hence ensured that overflow will not occur if we switch to at iteration . If (19) is fulfilled also at , we keep using at , since encoding with at every suffices to avoid overflow, otherwise, we switch to at without overflow by assumption. This completes proof by mathematical induction. ∎
PCDM symbols framed by using Theorem 4 can be uniquely decoded.
At , it is trivial to see that a decoder can identify a proper decoding code using (19), which allows for unique decoding and gives knowledge of and . Assume that and for all are known at the beginning of decoding iteration . Then, by assessing (19) at , we identify the decoding code at iteration . This enables unique decoding at and provides and . This completes proof by mathematical induction. ∎
In case where underflow occurs, i.e., if encoding of bits is finished using less than symbols, the encoder can simply fill the unoccupied slots in the output frame with dummy symbols of the smallest energy, e.g., for the -ASK alphabet. The decoder can discard these dummy symbols after bits are all decoded. There may also be incidents at the end of encoding that need minor manipulations as follows. At the last encoding iteration, it is possible that there is no sourceword in the dictionary that matches perfectly with the input bits. In this case, there must be multiple sourcewords whose -bit prefix matches perfectly with the input bits, where denotes the length of the last remaining input bits, hence an encoder can set a rule for unique encoding; e.g., it can pick the uppermost codeword in such a case. Unique decoding is straight-forward. The framing algorithm is summarized in Algorithm 2.
Iv-B Analysis of Fixed-Length Penalty by Gaussian Approximation
Suppose that a prefix-free encoder produces an entropy rate using the -th codeword of a size- code , given by . The probability of occurrence of equals , i.e., the probability that the encoder’s output symbol at a certain time belongs to codeword . Without a fixed-length constraint, the expected entropy rate of code , previously expressed as (2), can alternatively be calculated as , where is a random variable taking values on the ordered set , with a PMF . For example, with the V2V code in Tab. I (c), an entropy rate in 0.14, 0.43, 0.5, 0.6, 1, 1.67, 3, 6 is observed at the encoder’s output with a PMF 0.57, 0.143, 0.122, 0.102, 0.041, 0.015, 0.005, 0.003, which yields on average. Suppose now that a fixed-length constraint is imposed by framing. We consider the ensemble of symbol frames and assume that all symbols encoded by
are independent and identically distributed (IID). Also, though the true entropy rate of a symbol is a discrete random variable, we approximate the entropy rate by a continuous Gaussian random variable with mean
and variance; i.e., . In this case, the entropy rate accumulated up to the
-th symbol also follows a Gaussian distribution, if there has been neither overflow prediction nor early termination of encoding by until the -th symbol. To take into account overflow and early termination, we notice that (17) and (18) at the -th output symbol can respectively be transformed into and , where is a random variable representing the cumulative entropy rate at the -th output symbol.
However, characterization of exact becomes infeasible as the overflow and early termination probabilities increase with , hence we approximate for all again by a Gaussian random variable with mean and variance such that whose evolution over is mathematically tractable, as shown in Fig. 10. Then, on condition that code is still being used at the -th symbol, i.e., on condition that no overflow prediction and early termination has been made before the -th symbol, the probability of an overflow prediction at the -th symbol is
) denotes Gaussian cumulative distribution function (CDF) with meanand variance . Equation (20) gives insight into the behavior of the proposed framing method, since it indicates that overflow is predicted if the entropy rate accumulated until the -th symbol is less than a threshold that increases linearly with . Here, and are both linear in with slopes and , respectively, and since by the framing rule, the probability of overflow prediction in (20) gradually increases with . Recall, however, that this is a conditional probability, given no overflow prediction and early termination prior to symbol .
In order to take into account the history of previous overflow predictions and early terminations, let and denote the cumulative overflow prediction probability and cumulative early termination probability at symbol . Then, the unconditional probability that overflow is predicted at symbol is