Consider a source sequence
with independent and identically distributed (i.i.d.) components having probability mass functionon a finite alphabet . For simplicity we assume here that is a known distribution over .111This can be generalized to the scenario where
is unknown to the encoder and decoder by first estimatingand then using this for designing the compression scheme as in [vatedka2019local]. Similarly, our results can be generalized to nonbinary alphabets.
A fixed-length compression scheme of rate consists of a pair of algorithms, the encoder and the decoder . The encoder maintains for every , a codeword such that is a good estimate of . The probability of error of the compression scheme is defined as
From the source coding theorem, we know that there exist sequences of codes with rate arbitrarily close to the entropy and error probability vanishing in .
Our goal is to design a fixed-length compression scheme that additionally supports local encoding and decoding. A locally decodable and updatable compression scheme consists of a global encoder and decoder pair and in addition, a local decoder and a local updater:
Local decoder: A local decoder is an algorithm which given , probes (possibly adaptively) a small number of bits of to output . Here, is the th symbol of . The worst-case local decodability of the scheme is the maximum number of bits probed by the local decoder for any . The average local decodability is the expected number of bits (averaged over the source distribution) probed to recover any .
Local updater: A local updater is an algorithm which given and , adaptively reads and modifies a small number of bits of to give such that . In particular, the local updater has no prior knowledge of or and must probe to obtain such information.
The worst-case update efficiency is defined as the maximum of the sum of the bits read and written in order to update any . Also, we assume that is distributed according to and is independent of . Likewise, the average update efficiency is the sum of the average number of bits probed and written in order to update any .
It was recently shown in [vatedka2019local] that
is achievable. In that paper the authors also gave a separate compression scheme achieving222Throughout the paper, we use logarithms to base two.
In particular, the question of whether
is achievable was left open. In this paper we answer this question in the affirmative. We also show that under certain additional assumptions on the local decoder and the local updater this locality is order optimal.
Our achievability proof is based on a novel succinct data structure for -sparse sequences of length in the bitprobe model which for any takes space while enabling local decode and update using at most and bit reads/writes respectively. Our restricted converse is based on an analysis of bipartite graphs that represent the encoding and decoding algorithms.
I-a Prior work
Local decoding and update for entropy-achieving compression schemes have been studied mostly in isolation. The problem of locally decodable source coding of random sequences has received attention very recently following [makhdoumi2013locally-arxiv, makhdoumi_onlocallydecsource]. Mazumdar et al. [mazumdar2015local] gave a fixed-length compressor of rate of with . They also provided a converse result for non-dyadic sources: for any compression scheme that achieves rate . Similar results are known for variable length compression [pananjady2018effect] and universal compression of sources with memory [tatwawadi18isit_universalRA]. Likewise, there are compressors that achieve [montanari2008smooth] rate and update efficiency .
In the computer science community, the literature has mostly focused on the word-RAM model [patrascu2008succincter, patrascu2014dynamic, makinen2006dynamic, sadakane2006squeezing, navarro2014optimal, viola2019howtostore], where each operation (read/write/arithmetic operations) is on words of size bits each, and the complexity is measured in terms of the number of word operations required for local decoding/update. However, in this case the number of bitprobes required is . For random messages, it is almost trivial to obtain local decoding/update efficiency of bitprobes by partitioning the message symbols into blocks of size and compressing each block separately.
Before we present our main results we make a few observations aimed at justifying our model, and in particular the requirements we impose on the local decoder (see previous section).
Note that for any index the local decoder must output , the -th estimate of the global decoder. We could potentially relax this constraint, by requiring that the local decoder produces an estimate , potentially different from , with small error probability
As we argue next, if we impose to be small (but non-vanishing) with either no global decoder or with a separate global decoder that achieves vanishing error probability, then from a coding perspective the solution is essentially trivial.
Ii-1 Only local decodability and non-vanishing error
If we only require to be small without constraints on the global decoder, then we can easily achieve
This can be obtained by partitioning the length message into blocks of size , and compressing each block using an entropy-achieving fixed-length compression scheme—notice that the probability of wrongly decoding any particular block vanishes exponentially with . Hence, for any small but constant we can achieve
Ii-2 Separate local and global decoders
Suppose that in addition to we also want using a separate global decoder to recover . This can be obtained by using a low-density parity check (LDPC) code with maximum variable and check node degrees. The codeword consists of two parts:
where is obtained as in the previous case by dividing the message into constant size blocks and separately encoding each, while
is obtained as the syndrome (of the LDPC code) of the (Hamming) error vector betweenand the decoding of .
The local decoder only probes , while the local updater needs to update both and . Since we are using an LDPC code, can be updated using bit modifications. Therefore,
The global decoder decodes both and and can recover with probability of error.
As we see, and are essentially trivial cases from a coding perspective. Note also that the solution to generally depends on the rate of decay of . Requiring the local decoder to output
removes this degree of freedom—sinceimplies —and, as we argue below, is more interesting from a coding perspective. This setup is perhaps more interesting also practically since we can parallelize global decoding. If we have a large number of parallel processors, then the runtime of global decoding can be made sublinear in .
We will henceforth only consider compression schemes with local decoders that output .
The main result of this article is the following: For any there exists a compression scheme for Bernoulli() sources that achieves
and the overall computational complexity of global encoding/decoding is quasilinear in . The above theorem is formally proved in Section III-A. The proof of the above theorem is based on a novel dynamic succinct data structure for sparse sequences that achieves locality in the bitprobe model. Fix any . For every , there exists a dynamic succinct data structure for -length binary vectors of sparsity at most with the following properties. Any such vector occupies at most bits, has worst-case local decoding and worst-case update efficiency at most .
To prove a lower bound, we make three assumptions
Global encoding and decoding using local algorithms: We assume that is obtained by running the local updater on each message symbol, and by running the local decoder for each bit. In other words, there is no separate global encoder or decoder.
Function assumption: The local update function for the ’th update is a deterministic function of , and is independent of the sequence of the previous updates. Likewise, the local decoder does not depend on the sequence of updates that have occurred previously.
Bounded average-to-worst case influence: For nonadaptive schemes, we can construct the corresponding local encoding graph and the local decoding graph as follows—see Fig. 1. Two vertices in are adjacent if is a function of . Likewise, are adjacent in if is a function of . We assume that the ratio of the average to the worst case degrees of the right vertices of is bounded from below by a constant independent of , and similarly for the left vertices of .
Remark: Under (A1) and (A2), any adaptive scheme (whether for update or decoding) achieving worst-case locality can be converted to a non-adaptive scheme with locality . This is because an adaptive scheme with locality may depend on up to different bits. Therefore a lower bound on the locality of a non-adaptive scheme translates into a lower bound on the logarithm of the locality of an adaptive scheme.
For any adaptive scheme that satisfy (A)–(A) with , we have
As mentioned earlier, it is possible to achieve without assumptions (A1)–(A3). We conjecture that Theorem II-2 holds even without assumption (A3), but were unsuccessful in proving this. We also conjecture that it holds even without (A2). As we will see later, the data structure that leads to Lemma II-2 does not satisfy (A2).
The high-level structure of our scheme is inspired by the locally decodable compressor in [mazumdar2015local]. The idea in [mazumdar2015local] is to partition the set of message symbols into constant-sized blocks and use a fixed-length compressor for each block. The residual error vector is then encoded using the succinct data structure in [buhrman2002bitvectors] that occupies negligible space but allows local decoding of a single bit using bitprobes. However, this data structure is static, in the sense that it does not allow efficient updates and hence does not get us small .
A well-known dynamic data structure in the word-RAM model is the van Emde Boas tree [van1975preserving] which takes space but allows local retrieval, insert and delete in time (equivalently bitprobes). If we use the van Emde Boas tree for encoding the residual error vector, then we can achieve rate close to but a higher locality of .
Our main contribution is a novel dynamic succinct data structure (used to encode the residual error) which occupies roughly the same space as [buhrman2002bitvectors] but allows local decode and update using only bitprobes.
Iii-a Proof of Theorem Ii-2
We partition the -length message sequence into blocks of size each. Each block is further partitioned into subblocks () of symbols each. Each subblock is compressed independently using a fixed-length lossy compression scheme of rate and average per-letter distortion . Let denote this subcodeword for the th subblock. In addition, the error vectors (denoted and equal to if is atypical and otherwise) are concatenated and for each , is compressed using the scheme in Lemma II-2 to give with . The overall codeword is the concatenation of and .
As long as the sparsity of is less than for a suitably chosen , we can recover the th message block (or any symbol within it) without error.
We choose and .
Using Azuma’s inequality (and carefully choosing ), the probability that the distortion in each block is greater than falls as .
The worst-case local decoding is at most , while the worst-case update efficiency is . The compression rate is , and the overall probability of error (using the union bound over blocks) is . This completes the proof. ∎
All that remains is to prove Lemma II-2.
Iii-B A succinct data structure achieving locality for sparse sequences of length
The high-level idea in our data structure is to split the symbols into blocks of symbols each, and maintain a dynamic memory table where we only store the blocks with nonzero Hamming weight. Addressing is resolved by storing for each block a pointer which indicates the location in the memory table where the block is encoded.
Suppose that we split the -length sequence into blocks of consecutive symbols each: .
The data structure has the following components:
Status bits: These indicate whether each block is zero or not. We store many bits , one for each block. The status bit is set to if the Hamming weight of is greater than zero, and set to zero otherwise.
Memory table: The table stores information about the nonzero blocks. The table is organised into many chunks333The chunks refer to the virtual partition of the memory table into units of bits each. This is to distinguish this from the “blocks” which form partitions of the -length input sequence. of bits each. The th chunk is further split into two parts: having bits, and having bits. Here, is a vector which stores a nonzero block for some (suitably defined later) , while is a reverse pointer which encodes in bits.
Memory pointers: These indicate where each block is stored in the memory table. There are pointers of bits each, one for each block.
Counter for number of nonzero blocks: is a vector of length bits which stores the number of nonzero blocks in .
The overall codeword is a bit sequence obtained by the concatenation of the status bits, memory table, memory pointers and the counter. The total space required is
The data structure is illustrated in Fig. 2.
Iii-B1 Initial encoding
Let denote the number of nonzero blocks in .
Status bits: If has nonzero Hamming weight, then . Otherwise, it is set to zero.
Memory pointers: If , then . If not, and is the th nonzero block among , then (or more precisely, the binary representation of ).
Counter for number of nonzero blocks: is set to the number of nonzero blocks.
Memory table: For every , if and , then and is equal to the binary representation of .
Iii-B2 Local decoding
Suppose that we want to recover which happens to be the th bit in the th block (i.e., ).
If , then output . This is because implies that the entire block is zero.
If not, then read . If , then output the th bit in .
The maximum number of bits probed is
Iii-B3 Local update
Suppose that we want to update (which happens to be the th bit in the th block) with . The update algorithm works as follows:
Suppose that and . The updater first reads .
If , then it reads . Suppose that . Then it writes into the th location of .
If , then it means that the block was originally a zero block. The updater sets to , and increments the counter for the number of nonzero blocks by . Suppose that after incrementing, . Then the updater sets , writes into , and sets to .
Suppose that and . The updater first reads . Clearly, this should be equal to . The updater reads (suppose that it is equal to ), and then to compute .
If has Hamming weight greater than , then it flips the th bit of .
If not, then it implies that . The updater next sets to . It then decrements . It next overwrites with the contents of , and sets to . This is to consistently ensure that the first chunks of the memory table always contains all the information about nonzero blocks.
The maximum number of bits that need to be read and written in order to update a single message bit is
Iii-B4 Proof of Lemma Ii-2
Let us now prove the statement. We use the above scheme with . From (1), the total space used is
The worst-case local decoding is equal to and the worst-case update efficiency is equal to . This completes the proof. ∎
The succinct data structure here satisfies (A) but not (A). Clearly, the order of the chunks in the memory table depends on the sequence of updates performed previously. For example, the first chunk could initially contain data of the first subblock. After a number of updates (e.g., involving setting all bits of the first block to zero, inserting bits in other blocks, and then repopulating the first block with ones), the first block could be stored in chunk .
Iv Lower bounds on simultaneous locality
To obtain lower bounds, we introduce an additional parameter that we might be interested in minimizing: the worst-case local encodability, , defined to be the maximum number of input symbols that any single codeword bit can depend on. Note that this is different from the update efficiency. This was studied in [mazumdar2017semisupervised], where the authors related this quantity to a problem of semisupervised learning. It has been established in the literature that separately, each of can be made for near-entropy compression. However, it is not known if
can be simultaneously achieved.
In this section, we assume (A1)–(A3) and that the scheme is nonadaptive.
Iv-1 The local encoding and decoding graphs
Our derivation of the lower bound involves analyzing the connectivity properties of two bipartite graphs that describe the local encoding and decoding functions. As we will demonstrate, the various probabilities of error are influenced by the degrees of these bipartite graphs, whose values are governed by , and .
Under (A1) and (A2), the th compressed bit can be written as for some function , where is the set of message locations that can depend on.
The th decoded bit can be written as for some function , where is the set of codeword locations that need to be probed in order to recover .
We can construct an encoder bipartite graph where is an edge only if , and an decoder bipartite graph where is an edge if . See Fig. 1 for an illustration.
Let be the degree of the th right vertex in . This is equal to the local encodability of the th codeword symbol and satisfies .
This gives a natural lower bound on the update efficiency for the th message symbol: it must be greater than or equal to , the degree of the th left vertex in .
The average degree of a vertex in the left (corresponding to a message symbol) is equal to times the average degree of a right vertex (which corresponds to codeword bits). This implies that is lower bounded by the average (arithmetic mean of) the individual local encodabilities of the individual codeword bits.
Likewise, we denote the degree of the th left vertex of by and that of the th right vertex in by . Note that is equal to the local decodability of (maximum number of codeword bits to be probed to recover ). By definition, for all .
We also define the so-called effective neighbourhood of to be
The th decoded symbol is a function of only those symbols in , i.e., there exists a function such that . This is illustrated in Fig. 3.
Let us now obtain a lower bound on the bit error probability of any compression scheme satisfying the above properties. We would like to point out that the following Lemmas IV-1 and IV only make use of assumptions (A1) and (A2), and do not require (A3). The probability of bit error, satisfies
For every , the decoded symbol is a deterministic function (the composition of and ’s) of . But we have . The probability of error is given by444For an event , is the indicator function which takes value if occurs, and zero otherwise.
If there is even a single configuration of for which , then . ∎
Observe that Lemma IV-1 holds even without assumption (A). Also, if , then by the pigeonhole principle for a nonvanishing fraction of ’s. Since , we can conclude that any compression scheme satisfying (A)–(A) and having vanishing probability of error and nontrivial rate must satisfy . In other words, one cannot achieve .
A similar observation was made in [makhdoumi_onlocallydecsource] for linear locally decodable (but not updatable) compression schemes. If the encoder of the compressor must be linear, then if we desire a vanishingly small probability of error.
We can use the above lemma to prove a lower bound on simultaneous local encodability and decoding. Any fixed-length compression scheme achieving a vanishing probability of error, and satisfying assumptions (A1) and (A2) must necessarily satisfy
If we operate at a rate , then by the pigeonhole principle, it must be the case that at least bits have a nonzero bit error probability. Call this set of bits/locations .
We will now select a subset of many ’s whose probabilities of error are all nonzero but whose error events are statistically independent of each other. We can construct such an in a greedy manner.
For every , the corresponding is a deterministic function of . In other words, each depends on at most many ’s. We can start with the empty set, and iteratively put those ’s in which are in but not in . It is clear that
Moreover, by construction, we have that all the events are independent for all (since the corresponding sets of ’s they depend on are disjoint).
Let . Using Lemma IV-1, the probability of block error is bounded as
The above quantity is vanishing in only if ∎
Iv-a Lower bound on and proof of Theorem Ii-2
We can in fact tighten the analysis above to prove an identical lower bound on .
Let . Let respectively denote the update efficiency for the th message symbol, the local encodability of the th codeword symbol, and the local decoding of the th message symbol. These are also respectively greater than or equal to the degrees of the th left vertex in , the th right vertex in , and the th right vertex in .
Consider any fixed-length compression scheme achieving vanishing probability of error and nontrivial compression rate . There exists a set of message symbols with such that for all ,
The proof is very similar to that of Lemma IV, so we only sketch the details. We select a subset of such that for all . We know that . A more careful rederivation of Lemma IV-1 gives us for all . Hence,
which is nonvanishing in if for any subset of of size . Therefore, there must exist a subset of size where for all . This completes the proof. ∎
This leads us to the following result: Any fixed-length compression scheme achieving vanishing probability of error and , and satisfying assumptions (A1)—(A3) and nonadaptive local algorithms must have
From Lemma IV-A, there exists a set of size such that for all , . However, . From the graph we also have
where denotes the degree of the th left vertex in .