Codes that correct deletions have several applications such as file synchronization and DNA-based storage. In remote file synchronization for instance, the goal is to synchronize two edited versions of the same file with minimum number of communicated bits. In general, the file edits are a result of a series of deletions and insertions. One way to achieve synchronization is by using systematic codes that can efficiently correct deletions [1, 2].
The study of deletion correcting codes goes back to the 1960s [3, 4, 5]. In 1965, Varshamov and Tenengolts constructed the VT codes for correcting asymmetric errors over the Z-channel . In 1966, Levenshtein showed that the VT codes are capable of correcting a single deletion with zero-error . Also in , Levenshtein derived fundamental bounds on the redundancy needed to correct deletions. The bounds showed that the number of redundant bits needed to correct deletions in a codeword of length bits is logarithmic in , namely bits for some constant . The fundamental bounds on the redundancy were later generalized and improved by Cullina and Kiyavash . The redundancy of VT codes is asymptotically optimal for correcting a single deletion. Finding VT-like codes for multiple deletions has been an open problem for several decades. The literature on multiple deletion correcting codes has mostly focused on constructing codes that can uniquely decode multiple deletions with zero-error [7, 8, 9, 10, 11]. There has also been multiple recent works which study codes that can correct multiple deletions with low probability of error, i.e., error [1, 12, 13, 14].
In this work, we focus on the problem of designing efficient list decoders for deletions. A list decoder returns a list of candidate strings which is guaranteed to contain the actual codeword. The idea of list decoding was first introduced in the 1960s by Elias  and Wozencraft . The main goal when studying list decoders is to find explicit codes that can return a small list of candidate strings in polynomial time111Polynomial time in terms of the length of the codeword.. The size of the list gives a lower bound on the time complexity of the list decoder. For instance, if the size of the list is superpolynomial, then polynomial time list decoding cannot be achieved. List decoding has been studied for various classes of error correcting codes, such as Reed-Solomon codes , Reed-Muller codes , and Polar codes . However, the problem of finding list decoders for deletions has not received much attention in the literature. In , Guruswami proved the existence of codes that can list-decode a constant fraction of deletions given by , where and is the length of the codeword. In the regime considered in , the codes have low rate of the order of . Recently in , Wachter-Zeh derived a Johnson-like upper bound on the list size for decoding deletions and insertions. An explicit list decoding algorithm that is based on VT codes was also proposed in . For the case of a constant number of deletions , the maximum list size achieved by the algorithm in  is .
In this paper, we study the list decoding performance of Guess & Check (GC) codes . GC codes are explicit systematic codes that have logarithmic redundancy and can correct a constant number of deletions in polynomial time. The decoding algorithm of GC codes includes a preliminary step which generates an initial list of candidate strings that contains the actual codeword. Then, the decoder uses additional redundancy to try to eliminate as much of the candidate strings as possible. In the unique decoding setting, if the size of the final list obtained is greater than one, then a decoding failure is declared. In , it was shown that the probability of decoding failure of GC codes vanishes asymptotically in . In this work, we examine the list decoding setting by studying the size of the final list obtained by a GC decoder for a constant number of deletions .
Contributions: Based on the decoding algorithm of GC codes presented in , the decoding complexity; and the list size, are both upper bounded by polynomial functions of . Therefore, by construction, GC codes can list-decode a constant number of deletions in polynomial time, and the resulting list size is at most polynomial in . Our simulations show that the actual value of the list size obtained by the GC decoder is very small. In fact, the values of the list size obtained in the simulations are very close to one and do not increase with . In this work, we quantify the value of the list size by studying the average and the maximum size of the list. Our theoretical results show that: (i) the average size of the list approaches asymptotically in for a uniform iid message and any deletion pattern; and (ii) there exists an infinite sequence of GC codes whose maximum list size is upper bounded by a constant that is independent of , for any message and any deletion pattern. These results demonstrate that in the average case, lists of size strictly greater than one occur with low probability. Moreover, in the worst case, the list size is very small (constant), and hence the performance of GC codes is very close to codes that can uniquely decode multiple deletions with zero-error. We also provide numerical simulations on the list decoding performance of GC codes for values of up to bits, and values of up to deletions. The average list size recorded in the simulations is very close to 1 (less than 2). Whereas, the maximum list size detected within the performed simulations is . These results improve on the list decoder presented in , whose maximum list size is , i.e., grows polynomially in for a constant number of deletions .
In this section, we present a high-level overview of Guess & Check (GC) codes  and state some of the previous results on these codes which will be helpful in subsequent sections of this paper. We also introduce the necessary notations used throughout the paper.
Ii-a Guess & Check (GC) Codes
GC codes are explicit binary codes that can correct multiple deletions, with high probability. GC codes are systematic; and have deterministic polynomial time encoding and decoding algorithms. A high-level description of the encoding and decoding algorithms is given below. A more detailed description can be found in .
Let be a message of length bits. Let be its corresponding codeword of length bits. Let be a constant representing the number of deletions. Consider a deletion pattern representing the positions of deletions, where , for .
Encoding: The encoding block diagram is illustrated in Fig. 1. Encoding is done based on the following steps. (i) The message of length bits is chunked into adjacent blocks of length bits each, and each block is then mapped to its corresponding symbol in , where . (ii) The resulting ary string is encoded using a systematic MDS code where is a constant representing the number of parity symbols. (iii) The ary symbols are mapped backed to their binary representations. (iv) Only the parity bits are encoded using a repetition code. This encoding procedure results in the codeword of length bits, where .
Decoding: Suppose that the codeword is affected by a deletion pattern . The high-level idea of the decoding algorithm is based on the following steps. (i) The decoder recovers the parity bits which are protected by a repetition code. (ii) The decoder makes guesses, where each guess corresponds to a specific distribution of the deletions among the blocks. The total number of guesses is equal to the total number of possible deletion patterns given by
(iii) For each guess, the decoder specifies the block boundaries based on its assumption on the locations of the deletions. Then, it treats the blocks that are affected by deletions as hypothetical erasures and decodes these erasures using the first MDS parity symbols. (iv) The decoder then goes over the guesses and checks whether each decoded string is consistent with the remaining parities or not. At the end of this decoding procedure, successful unique decoding is declared if only one guess is consistent with the parities. If two or more guesses are consistent with the parities, the decoder declares a decoding failure.
The results in  show that an upper bound on the probability of decoding failure for a uniform iid message , and any deletion pattern is given by
If and , then the probability of decoding failure goes to zero as goes to infinity.
Ii-B List Decoding
As previously described, the GC decoder makes guesses corresponding to all possible deletion patterns and decodes the received string based on these guesses. Note that depending on the runs of bits within the received string, different guesses may lead to the same decoded string. Therefore, the guessing phase results in an initial list of at most decoded strings. Then, the decoder checks this list with the additional parities and eliminates the guesses that are not consistent with these parities. By the end of this procedure, the GC decoder is left with a smaller final list of candidate strings. Since the parities are protected by a repetition code, and the decoder goes over all possible deletion patterns, then the final list is guaranteed to contain the actual codeword . Hence, if the size of this list is one, we say that the decoder achieved successful unique decoding. GC codes can achieve successful unique decoding with high probability, since the probability of decoding failure given in (2) (i.e., obtaining a list of size strictly greater than one) vanishes asymptotically.
Since the size of the initial list is at most , and is upper bounded by a polynomial function of given by (1), then the initial list size is at most polynomial in . In this paper, we are interested in studying the size of the final list obtained by the GC decoder. The list size is a deterministic function of the codeword and the deletion pattern , and hence can be represented by . Since the codeword is a deterministic function of the message , an equivalent definition of the list size is . Throughout the paper, we drop the argument and use to refer to the maximum list size over all possible deletion patterns for a message of length bits, i.e.,
Based on the decoding algorithm of GC codes, we have . To quantify the size of the final list we define the following quantities:
The average value of the list size, defined by
for a uniform iid message of length bits.
The maximum value of the list size, defined by
for any message of length bits.
We summarize the notations used in this paper in Table I.
|message||number of MDS parity symbols|
|length of the message in bits||chunking length used for encoding|
|codeword||field size given by|
|length of codeword in bits||total number of guesses given in (1)|
|deletion pattern||realization of the list size|
|number of deletions||average list size defined in (4)|
|position of the deletion||maximum list size defined in (5)|
Iii Main Results
In this section, we present our main results in this paper. The proofs of these results are given in Section IV. Recall that the number of deletions and the number of parity symbols are constants, i.e., independent of .
Theorem 1 (Average list size).
For a uniform iid message of length bits, and any deletion pattern , the average list size obtained by the Guess & Check (GC) decoder satisfies
The next result follows from Theorem 1 and shows that for appropriate choices of the GC code parameters, approaches one asymptotically in the length of the message .
For choices of the GC code parameters that satisfy and , the average list size satsifies
The next theorem presents an upper bound on the maximum list size defined in (5), which is the main quantity of interest in the list decoding literature. Informally, the theorem states that for appropriate choices of the code parameters, the maximum list size is upper bounded by a constant,
Theorem 2 (Maximum list size).
Let and . Consider a sufficiently large message length . There exists an infinite sequence of Guess & Check (GC) codes indexed by the message lengths , whose maximum list size , for any message of length , and any deletion pattern , is upper bounded by a constant that is independent of .
Theorem 2 says that there exists an infinite sequence of GC codes whose is upper bounded by a constant. The restriction to a sequence of codes is a limitation of the proof technique. We conjecture that this result is true for all GC codes with arbitrary . In Section V, we provide numerical simulations on the maximum list size, where we gradually increase the message length from to for multiple values of . In the obtained empirical results, the value of the maximum list size does not increase with .
Iii-B Discussion for the case of
Recall that the redundancy of GC codes is . Let be the chunking length used for encoding. In this case, the redundancy is , i.e., logarithmic in . It follows from Theorems 1 and 2 that a logarithmic redundancy is sufficient for GC codes so that (i) ; and (ii) for an infinite sequence of GC codes. It is easy to verify that a logarithmic redundancy corresponds to a code rate that is asymptotically optimal in (rate approaches one as goes to infinity). Therefore, GC codes can achieve the list decoding properties given in Theorems 1 and 2 with a logarithmic redundancy and an asymptotically optimal code rate.
Iv-a Proof of Theorem 1
For brevity, we use instead of throughout the proof. As previously mentioned, if , then we say that the decoder failed to decode uniquely. The probability of failure for a uniform iid binary message of length bits, and any deletion pattern , is upper bounded by the expression given in (2). Recall that , where is the total number of cases checked by the GC decoder, given by (1). The lower bound on follows directly from the fact that , and hence
To upper bound , we write the following.
Iv-B Proof of Corollary 1
Let , such that . Based on Theorem 1 we have
To prove that , we derive conditions on the code parameters under which is guaranteed to vanish asymptotically in . goes to zero as approaches infinity if the denominator in its mathematical expression converges to zero faster than the numerator. It is easy to verify that this holds when . Let be the first condition. Then, we have
Hence, we obtain a second condition that . Therefore, for and , approaches as the message length (or equivalently the block length ) goes to infinity.
Iv-C Sketch of the Proof of Theorem 2
In order to express the variation of the list size with respect to the message length , we use instead of . refers to the probability that the list size is for uniform iid message of length . Figure 2 illustrates the high-level idea of the proof of Theorem 2. The proof is based on the following two lemmas. The proofs of these lemmas are given in the appendix.
Lemma 1 (informal).
For any fixed list size greater than one, there exists an infinitely increasing sequence of message lengths over which is non-increasing.
Lemma 2 (informal).
For any fixed message length, if , then .
The high-level idea of the proof is the following. Consider a sufficiently large fixed message length . By definition, we have , where is the maximum list size of any message of length . Based on Lemma 1, for the fixed list size , there exists an infinitely increasing sequence of message lengths , such that is non-increasing over this set of message lengths. Therefore, . Furthermore, based on Lemma 2, . By combining these results, we have , and hence the maximum list size is upper bounded by the constant .
Iv-D Proof of Theorem 2
As previously mentioned, for a uniform iid message of length bits, , where is the probability that the decoder fails to decode uniquely. Therefore, from (2), for any we have
where . It can be easily verified that is a strictly decreasing function of if and .
For any fixed , , , such that
See Appendix A. ∎
For any fixed , , if , then .
See Appendix B. ∎
Consider a sufficiently large fixed message length . Lemma 1 implies that, for any fixed , there exists an infinitely increasing sequence , such that is decreasing in , for . Let be the sequence associated with the fixed list size , where is the maximum decoding list size for any message of length bits. By definition, we have
Consider the set , we have
Furthermore, from Lemma 2 we have, ,
This proves that there exists an infinite sequence of GC codes, indexed by , whose maximum list size is upper bounded by a constant222 is a constant that does not increase with ..
The result of Theorem 2 can be used to improve the upper bound in Theorem 1. Namely, based on Theorem 2, in (9) we can upper bound the list size by a constant instead of the total number of guesses given by (1). However, in that case, the result on would be restricted to a sequence of GC codes, whereas Theorem 1 is a universal result for all GC codes.
We performed numerical simulations on the average list size and the maximum list size for and bits, and for and deletions. The empirical results are shown in Table II.
bits are then deleted based on a uniformly distributed deletion pattern, and the resulting string is decoded.
The results show that: (i) the average list size is very close to one and its value approaches one further as increases; and (ii) the maximum list size recorded is and does not increase with , for . These numerical results are compliant with Theorems 1 and 2 which show that , and for an infinite sequence of GC codes. Note that the redundancy used for the simulations is with . This redundancy is much smaller than the one suggested by Theorems 1 and 2. This is due to the fact that GC codes perform better than what the theoretical bounds indicate, which was discussed in .
Appendix A Proof of Lemma 1
For any fixed , , , such that
Lemma 1 follows from the fact that for any fixed ,
where is a strictly decreasing function of for and . Therefore, for any fixed ,
Hence, by the definition of limits, , , such that
Appendix B Proof of Lemma 2
For any fixed , , if , then .
Lemma 2 follows from the construction of GC codes described in Section II. To prove Lemma 2, we first introduce some notations. Consider a message of fixed length and an arbitrarily fixed deletion pattern . Let be the GC encoding function (Fig. 1) that maps the message to the corresponding codeword. Let be the deletion function that maps the transmitted codeword to the received string based on the deletion pattern . Recall that the GC decoder generates guesses where each guess corresponds to a string that is decoded based on a certain assumption on the locations of the deletions. Let , be the function that decodes the received string based on the guess, where .
Since the encoding and decoding algorithms of GC codes are deterministic; and the deletion pattern is fixed, then the functions and are deterministic functions of their inputs. Let be the composition of these functions, given by , for . Note that is a family of distinct deterministic functions from to .
For a given message , let be the set of indices of the functions that result in different decoded strings. Namely, , such that , we have . The guessing phase results in an initial list of decoded strings given by
In the checking phase, each string in is checked with MDS parities in order to eliminate the strings which are not consistent with these parities. Let
be the matrix consisting of the encoding vectors corresponding to theMDS parities. Let be the vector containing the values of the parities corresponding to the encoding of . The final list of candidate strings given by
For brevity, henceforth we refer to by . Let , consider the following sets
Note that the condition in set is equivalent to
The next claim states that , which is intuitive since the intersection with introduces an additional equality constraint which is generally more restrictive than the “not equal” constraint of .
Since , then it is easy to see that proving is equivalent to proving that . If is empty, then it is trivial that . If is non-empty, then we consider the following cases:
: Then, , and hence the inequality is satisfied.
: Since and are fixed for all functions , then , such that
Since is a prime power, then (24) implies that
with , which contradicts the fact that is a family of distinct functions. Therefore, the case of is not possible.
and : The intersection of set with set can be seen as introducing an additional equality constraint to the set , given by (23). If and , then (23) is satisfiable by some of the messages . The constraint given by (23) corresponds to a set of linear congruences in , with , where the variables in these congruences are the bits of the message . Since the messages
must satisfy these linear congruences, then the degree of freedom of the bits ofis decreased by at least . In other words, the size of the set is decreased by at least a factor of , i.e., .
Recall that is the final list size defined in (3), and refers to the probability that the list size is exactly for a uniform iid message . Next, we prove that if , then . Note that proving this statement is equivalent to proving Lemma 2.
The event can result from cases, where each case corresponds a certain combination of out of the functions , satisfying equality constraints of the form of (23), whereas the remaining functions do not satisfy these constraints. The set represents one of these cases which lead to . We index these cases by , and refer to each case by its corresponding set . The sets , are mutually disjoint because the conditions on that define these sets are contradictory. Furthermore, it is easy to see that the result of Claim 1 applies to all of the cases, i.e., , we have . Therefore,
(26) follows from the fact that the cases correspond to mutually disjoint sets. (27) follows from the fact the message is uniform iid. (28) follows from Claim 1. Note that , we have . Hence, if , then , , and hence from (28) we get . This proves that if , then . Similarly, , we can apply the same approach to get . Therefore, if , then . ∎
-  S. Kas Hanna and S. El Rouayheb, “Guess & check codes for deletions, insertions, and synchronization,” IEEE Transactions on Information Theory, 2018. (IEEE Early Access).
-  R. Venkataramanan, V. N. Swamy, and K. Ramchandran, “Low-complexity interactive algorithms for synchronization from deletions, insertions, and substitutions,” IEEE Transactions on Information Theory, vol. 61, no. 10, pp. 5670–5689, 2015.
-  R. G. Gallager, “Sequential decoding for binary channels with noise and synchronization errors,” tech. rep., DTIC Document, 1961.
-  R. Varshamov and G. Tenengol’ts, “Correction code for single asymmetric errors,” Automat. Telemekh, vol. 26, no. 2, pp. 286–290, 1965.
-  V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” in Soviet physics doklady, vol. 10, p. 707, 1966.
-  D. Cullina and N. Kiyavash, “An improvement to levenshtein’s upper bound on the cardinality of deletion correcting codes,” IEEE Transactions on Information Theory, vol. 60, pp. 3862–3870, July 2014.
-  A. S. Helberg and H. C. Ferreira, “On multiple insertion/deletion correcting codes,” IEEE Transactions on Information Theory, vol. 48, no. 1, pp. 305–308, 2002.
-  J. Brakensiek, V. Guruswami, and S. Zbarsky, “Efficient low-redundancy codes for correcting multiple deletions,” IEEE Transactions on Information Theory, vol. 64, pp. 3403–3410, May 2018.
-  R. Gabrys and F. Sala, “Codes correcting two deletions,” arXiv preprint arXiv:1712.07222, 2017.
-  L. J. Schulman and D. Zuckerman, “Asymptotically good codes correcting insertions, deletions, and transpositions,” IEEE transactions on information theory, vol. 45, no. 7, pp. 2552–2557, 1999.
-  V. Guruswami and C. Wang, “Deletion codes in the high-noise and high-rate regimes,” IEEE Transactions on Information Theory, vol. 63, pp. 1961–1970, April 2017.
-  M. Abroshan, R. Venkataramanan, and A. G. i Fabregas, “Multilayer codes for synchronization from deletions,” in 2017 IEEE Information Theory Workshop (ITW), pp. 449–453, Nov 2017.
-  K. Tian, A. Fazeli, A. Vardy, and R. Liu, “Polar codes for channels with deletions,” in 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 572–579, Oct 2017.
-  V. Guruswami and R. Li, “Coding against deletions in oblivious and online models,” arXiv preprint arXiv:1612.06335, 2016.
-  P. Elias, “List decoding for noisy channels,” Quarterly Progress Report Massachusetts Institute of Technology, 1957.
-  J. M. Wozencraft, “List decoding,” Quarterly Progress Report Massachusetts Institute of Technology, vol. 48, pp. 90–95, 1958.
-  V. Guruswami and M. Sudan, “Improved decoding of reed-solomon and algebraic-geometric codes,” in Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280), pp. 28–37, Nov 1998.
-  R. Pellikaan and X.-W. Wu, “List decoding of q-ary reed-muller codes,” IEEE Transactions on Information Theory, vol. 50, pp. 679–682, April 2004.
-  I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Transactions on Information Theory, vol. 61, pp. 2213–2226, May 2015.
-  A. Wachter-Zeh, “List decoding of insertions and deletions,” IEEE Transactions on Information Theory, vol. 64, pp. 6297–6304, Sept 2018.