This paper studies the fundamental limits of fixed-length lossless source coding in three scenarios:
Point-to-point: A single source is compressed and decompressed.
Multiple access: A fixed set of sources are separately compressed and jointly decompressed.
Random access: An arbitrary subset of a set of possible sources are separately compressed and jointly decompressed.
The information-theoretic limit in these three operational scenarios is the set of code sizes or rates at which a desired level of reconstruction error is achievable. Shannon’s theory 
analyzes this fundamental limit by taking an arbitrarily long encoding blocklength with a vanishing error probability. Since most real-world applications are delay and computation sensitive, it is of practical interest to analyze finite-blocklength fundamental limits. Following[2, 3, 4, 5], we allow a non-vanishing error probability and study refined asymptotics of the achievable rates in encoding blocklength .
In point-to-point almost-lossless source coding, non-asymptotic bounds and asymptotic expansions of the minimum achievable rate have been given in , , , , , . In particular, Kontoyiannis and Verdú  have given a third-order-optimal characterization of the minimum achievable rate at blocklength and target error probability by analyzing the optimal code. For a finite-alphabet stationary memoryless source with single-letter distribution , entropy , and varentropy ,
with any higher-order term bounded by ; here
denotes the complementary Gaussian distribution function.
In multiple access lossless source coding, also known as the Slepian-Wolf (SW) setting , the object of interest is the set of achievable rate tuples, known as the rate region. The first-order-optimal rate region for general sources was studied in [11, 7]; the results in [11, 7] reduce to Slepian and Wolf’s result in  for a stationary memoryless multiple source. The best prior asymptotic expansion of the SW rate region in terms of the encoding blocklength for a stationary memoryless multiple source is the second-order-optimal rate region, established independently in  and . The characterization by Tan and Kosut 
describes the rate region in a vector form that parallels the first two terms in (1). In this case, a quantity known as the entropy dispersion matrix plays a role similar to . Their result suggests that the third-order term is bounded from above by and from below by .
In the setting of point-to-point almost-lossless source coding, our contribution is to provide a precise characterization of the performance of random coding in terms of tight non-asymptotic bounds as well as their asymptotic expansions. By deriving the exact performance of random coding with the best possible threshold decoder, we conclude that random coding with threshold decoding cannot achieve in the third-order term in (1), and thus is strictly suboptimal. We show that random coding with maximum likelihood decoding, however, achieves the first three terms in (1). We do this by deriving and carefully analyzing a source coding counterpart of the random-coding union (RCU) bound in channel coding [3, Th. 16]. The fact that our asymptotic expansion is achieved by a random code rather than the optimal code from  has a number of important implications. First, it demonstrates that there is no loss (up to the third-order term) due to random coding, which implies the existence of a large number of codes that have near-optimal performance. In particular, our RCU bound for source coding holds when restricted to linear compressors, implying that there are linear codes with near-optimal performance. Second, it enables generalization of this technique to source coding scenarios where the optimal code is not known; this is crucial since knowledge of the optimal code in the case of point-to-point almost-lossless source coding is quite exceptional.
While finding optimal SW codes is intractable in general, our derivation of the source coding RCU bound generalizes to multiple access scenarios. The resulting achievability bound and a converse result from [7, Lemma 7.2.2] together yield a third-order-optimal characterization of the SW rate region for a stationary memoryless multiple source (Theorem 9), revealing a linear third-order term of . This tightens the result over the third-order bound from , which grows linearly with the alphabet size and exponentially with the number of encoders. Our third-order-optimal characterization implies that for dependent sources, the SW code’s independent encoders suffer no loss up to the third-order performance relative to joint encoding with a point-to-point code.
The prior information theory literature studies multiple access source coding for scenarios where the number and identity of encoders are fixed and known. However, in applications like sensor networks, the internet of things, and random access communication, the number of transmitters communicating with a given access point may be unknown or time-varying. The information theory of random access channel coding is investigated in papers such as [14, 15, 16]. Here, we introduce the notion of random access source coding, which extends multiple access source coding to the scenario where the number of encoders is unknown a priori.
To begin our study, we first establish a probabilistic model for the object being compressed in random access source coding. We call that object a random access-discrete multiple source (RA-DMS), which consists of all possible collections of sources to be compressed. We then develop a robust coding scheme to achieve reliable compression of an arbitrary subset of the sources despite a lack of a priori knowledge of that subset. Since the SW rate region varies with the source set being compressed, each encoder must vary its coding rate accordingly. Considering that the encoders do not know that set, we achieve the desired rate using a rateless code, which accommodates variable decoding times. The encoders transmit their codewords symbol-by-symbol until they are informed to stop, which is realized by using sporadic feedback from the decoder. The decoder selects a decoding time from a set of predetermined potential decoding times based on which encoders it sees in the network. Single-bit feedback from the decoder at each potential decoding time informs all encoders of when the decoder is able to decode. Thus, unlike commonly considered rateless codes that allow arbitrary decoding times [17, 18, 19, 20], our coding scheme only allows a fixed set of decoding times, thereby requiring only sporadic feedback in operation.
In the asymptotic analysis of our proposed coding scheme, we focus on the class of stationary memoryless permutation-invariant RA-DMSs. In this case, we are able to reduce the design complexity by employing identical encoding for all encoders. We demonstrate (Theorem20 in Section V below) that there exists a single deterministic code that simultaneously achieves, for all possible number of active encoders, the optimal performance (up to the third order) of the SW code. Since traditional random coding arguments are not sufficient to show the existence of a single deterministic code that meets every constraint in a collection of constraints, prior code designs for multiple-constraint scenarios employ shared randomness (see, for example, ). Inspired by Tchamkerten and Telatar’s work in , we here propose an alternative to that approach, deriving a refined random coding argument (Lemma 21 in Section V-E) that can be used to demonstrate the existence of a single deterministic code; this technique can be applied more broadly.
Except where noted, the source coding results presented in this paper do not require finite source alphabets but only countable ones.
The organization of this paper is as follows. Section II defines notation. Section III, IV, and V are devoted to (point-to-point) almost-lossless source coding, multiple access (Slepian-Wolf) source coding and random access source coding, respectively. The contents in these three sections are organized in parallel:
In Section V-A, we define the random access-discrete multiple source and describe our random access coding scheme. Prior work related to random access source coding is discussed in Section V-B. In Sections V-C, V-D, and V-E, we analyze the proposed coding scheme and give both the achievability and the converse characterizations of its finite-blocklength performance on the class of permutation-invariant RA-DMSs. Extensions of our coding strategy to broader classes of RA-DMSs are discussed in Section V-F.
We give the concluding remarks in Section VI, with proofs of auxiliary results given in the appendices.
We use uppercase letters (i.e.,
) to denote random variables, lowercase letters (i.e.,) to denote realizations of the random variables, calligraphic uppercase letters (i.e., ) to denote subsets of a sample space (events) or index sets, and script uppercase letters to denote subsets of a Euclidean space (i.e., ). Vectors are denoted by , with the all-ones vector denoted by 1. For any sequence and any ordered index set , vector . Matrices are denoted by bold uppercase letters (i.e., V), and the -th element of matrix V is denoted by . Relations “” and “” between two vectors of the same dimension correspond to elementwise inequalities. For a vector and a set , “” denotes the set formed by moving every point in by the displacement specified by u in (the Minkowski sum of and ). For any positive integers , such that , denotes the set and denotes . If , . We use . All ’s and ’s, if not specified, employ an arbitrary common base.
For two functions and , if there exist such that for all . For a -dimensional function , for some function if for all .
The standard Gaussian cumulative distribution function is denoted by
Function denotes the standard Gaussian complementary cumulative distribution function, and denotes the inverse function of
. The standard Gaussian probability density function is denoted by
For a distribution on a countable alphabet , the information (entropy density) and conditional information are defined as
for any with , , and
. The corresponding (conditional) entropy, varentropy, and third centered moment are denoted by, respectively,
Iii Random Coding in Almost-Lossless Source Coding
In point-to-point almost-lossless data compression, a discrete random variabledefined on a finite or countably infinite alphabet is encoded into a message taken from the set of codewords . A decoder subsequently reconstructs the source symbol from the compressed description. Due to the limitation on the code size , an almost-lossless source code is often associated with a non-zero error probability. The following definitions formalize almost-lossless source codes and their fundamental limits.
Definition 1 (Almost-lossless source code).
An code for a random variable with discrete alphabet comprises an encoding function and a decoding function such that the error probability satisfies .
The minimum achievable code size compatible with error probability is defined by
The generality of the setting in Definition 1 allows one to particularize any result derived for that setting to more specialized scenarios, such as the block code described in the next definition.
Definition 2 (Block almost-lossless source code).
An almost-lossless source code for a random vector defined on , the -fold Cartesian product of the set , is called an code.
The minimum code size and rate achievable at blocklength and error probability are defined by, respectively,
Almost-lossless block codes were previously defined in, for example, [7, Chapter 1].
A discrete information source is a sequence of discrete random variables , which is specified by the transition probability kernels , for each Many classes of sources, including sources with memory and non-stationary sources, conform to the setting of Definition 2. In our asymptotic analysis, we focus on the class of stationary memoryless sources, where for all , i.e., are i.i.d.
Shannon’s source coding theorem  gives a fundamental limit on the asymptotically achievable performance of the codes for a stationary memoryless source with single-letter distribution :
In the finite-blocklength regime, which is of more practical interest, Kontoyiannis and Verdú  gave a lower and an upper bound on that coincide in the first three terms. They also demonstrated an gap in the fourth-order term. Recall that and denote the second and third absolute centered moments of (see (7), (10)).
Theorem 1 (Kontoyiannis and Verdú ).
Consider a stationary memoryless source with a finite alphabet and single-letter distribution that satisfies . The following bounds111These bounds are stated in  in a base-2 logarithmic scale, but they hold for any base. The base of logarithm determines the information unit. hold:
(achievability) for all and all ,222According to , the achievability bound holds for any . Notice, however, that it only becomes meaningful when .
(converse) for all and all such that
Although Theorem 1 was stated in  only for and a finite source alphabet, the proof in  shows that provided that is finite, for all and any countable source alphabet, the bounds in (16) and (18) still hold with the same first three terms, replacing the bounds on the fourth-order terms by (with dependency on ).
, the source is non-redundant; that is, it has a finite alphabet and a uniform distribution. In this case,. The optimal code simply maps a fraction of possible source outcomes to unique codewords. So the minimum achievable code size satisfies
It follows immediately from (19) that
Although its dependency on is not explicitly noted, is indeed a function of , , and . The characterization of in (20) might lead one to suspect that has a discontinuity at equal to the uniform distribution on due to the missing third-order term (which otherwise appears in both (16) and (18)). However, this conclusion is flawed because the upper bound in (16) blows up for any finite when . Indeed, the Berry-Esseen type bounds are loose for small . See Figure 1. The discontinuity appears in the bounds of but there is no discontinuity in . (Note that unlike most non-asymptotic limits in information theory, in almost-lossless source coding is directly computable.) The right way to interpret the results in Theorem 1 is to see that for any , there exists some such that for all , behaves like in the third-order term. For a smaller , the minimum needed becomes larger.
Kontoyiannis and Verdú  obtained the bounds in Theorem 1, which coincide up to the third order, by analyzing the optimal code. That code encodes a cardinality- subset of that has the largest probability. The decoder declares an error whenever a symbol outside this optimum set is produced by the source. With a few notable exceptions (e.g., a few scenarios of (almost) lossless data compression examined in , ), characterizing the optimal code is elusive in most communication scenarios of interest. Thus, the random coding argument, first proposed by Shannon , has become a popular and powerful technique in deriving achievability results. Here we review the existing achievability bounds for almost-lossless compression based on random coding.
There exists an code for discrete random variable such that
The bound in Theorem 2 is obtained by assigning source realizations to codewords independently and uniformly at random. The decoder uses a threshold decoding rule that decodes to if and only if is a unique source realization that (i) is compatible with the observed codeword under the given (random) code design, and (ii) has information below . Particularizing (21) to a stationary memoryless source with single-letter satisfying and , choosing and optimally and applying the Berry-Esseen inequality, one obtains an asymptotic expansion of the bound: for all ,
The key question is whether the penalty in the third-order-term exhibited in (22) is due to random coding or due to the choice of the decoding rule. The optimum decoding rule is maximum likelihood. Previously, Kontoyiannis and Verdú [4, Th. 8] gave an exact expression for the performance of random coding under i.i.d. uniform random codeword generation and maximum likelihood decoding. However, their result, which is derived for general sources, is not easy to analyze. In Section III-C Theorem 4 below, we derive a new random-coding bound based on maximum likelihood decoding, and we demonstrate that random coding is capable of achieving third-order optimality for a stationary memoryless source.
Iii-C Main Results: New Achievability Bounds Based on Random Coding
In this section, we present two new achievability bounds for almost-lossless source coding of general sources. The first, called the dependence testing (DT) bound, parallels the DT bound in channel coding [3, Th. 17]. The second, called the random-coding union (RCU) bound, parallels the RCU bound in channel coding [3, Th. 16].
The DT bound tightens the prior bound based on threshold decoding presented in Theorem 2.
Theorem 3 (DT bound).
There exists an code for a discrete random variable such that
Note the following equality that holds for arbitrary and [3, Eq. (68)]:
If we take and , and take the expectation of both sides of (24) with respect to , we obtain the following equivalence
where denotes a probability with respect to , and denotes a mass with respect to the counting measure defined on , which assigns unit weight to each . Thus, it is sufficient to show (25).
We appeal to the following auxiliary result: there exists an code for discrete random variable such that for all ,
The proof of (26) is based on random coding. Fix . We draw encoder outputs i.i.d. uniformly at random from for each . We adopt a threshold decoder:
The averaged error over this random code construction is bounded by the union of the two error events:
By the random coding argument and the union bound, there exists an code such that
The inequality in (26) bounds the random coding performance of a threshold decoder with threshold . Paralleling the observation made in  in the context of channel coding, we notice that the right-hand-side of (26) is equal to times the minimum measure of the error event in a Bayesian binary hypothesis test between with a priori probability and with a priori probability . See [22, Remark 5] for an observation that the Neyman-Pearson lemma generalizes to -finite measures such as our here. Thus, this measure of error is minimized by the test that compares the log likelihood ratio between and () to the log ratio of the two a priori probabilities (), i.e.,
Therefore, taking minimizes the right-hand-side of (26). Hence Theorem 3 gives the tightest possible bound for random coding with threshold decoding. Particularizing Theorem 3 to a stationary memoryless source with a single-letter distribution satisfying and , and invoking the Berry-Esseen theorem, we obtain an asymptotic expansion of the bound: for all ,
Unfortunately, (36) is also third-order-suboptimal. Thus, threshold-based decoding in random coding is not sufficient to achieve the best performance in the third-order-term.
Next, we present the RCU bound. Unlike the random-coding bounds in Theorems 2 and 3, which employ threshold decoding, the RCU bound yields an asymptotic achievability result for stationary memoryless sources that is tight up to the third-order term. Therefore, the loss in the third-order-term in both (22) and (36) is due to the sub-optimal decoder rather than the random encoder design.
Theorem 4 (RCU bound).
There exists an code for a discrete random variable such that
where for all .
We begin our random code design by drawing the encoder output i.i.d. uniformly at random from for each . For decoding, we use the maximum likelihood decoder
When there is more than one source symbol that has the maximal probability mass, the decoder chooses among them equiprobably at random.
The error probability averaged over this random code construction is bounded by the probability of the event:
To prove the existence of an code satisfying (37) using the random coding argument, we show that , where the probability measure is generated by both and the random encoding map , is bounded from above by the right-hand-side of (37):
where (41) holds by the law of iterated expectation, (42) bounds the probability by the minimum of the union bound and 1, (43) holds because the encoder outputs are drawn i.i.d. uniformly at random and independently of , and (44) rewrites (43) in terms of the distribution . The proof is now complete, with (44) equal to the right-hand-side of (37). ∎
Applying the argument employed in the proof of [9, Th. 9.5] to the above analysis, we can obtain the same RCU bound by randomizing over only linear encoding maps. Thus, there is no loss in performance when restricting to linear compressors.
The RCU bound in Theorem 4 provides a new proof of the asymptotic achievability result in Theorem 1. While the original proof analyzes the optimal code, our proof relies on a randomly designed encoder, showing that optimal code design is not necessary to achieve third-order-optimal performance. This observation is useful in scenarios such as multiple access source coding, where the optimal code is hard to find, as discussed in Section IV below.
Our asymptotic analysis in Theorem 5 relies on the following assumptions. Consider a stationary memoryless source with single-letter distribution . We assume
where is the absolute constant in the Berry-Esseen inequality for i.i.d. random variables (see Theorem 6).
Theorem 5 (Third-order-optimal achievability via random coding).
The remainder term can be characterized more precisely as follows: for all and ,
for all and ,
where is defined in (45).
Before we show our proof of the asymptotic bound in Theorem refthm-rcu-asymp, we state two auxiliary results that turn out to be very useful in our analysis.
Theorem 6 (Berry-Esseen Inequality).
Consider a sequence of i.i.d. random variables with a common distribution such that , and . Then for any real and ,
We refer to as the Berry-Esseen constant for the i.i.d. random variables .
The second result is [3, Lemma 47] developed by Polyanskiy et al. The bound originally given in [3, Lemma 47] only requires independence of the random variables. One can sharpen it for i.i.d. random variables by appealing to the Berry-Esseen inequality above with . We state the modified version of the lemma below, which allows for a better numerical comparison between Theorem 5 and Theorem 1.
Lemma 7 (Modified from [3, Lemma 47]).
Let be i.i.d. random variables with a common distribution such that and . Then for any ,
Proof of Theorem 5.
We analyze the random-coding bound in Theorem 4. Denote for brevity
Each of and is a sum of i.i.d. random variables. Substituting in Theorem 4, we note that there exists an code such that
where . Let
We now choose
Specifically, we have
Iv Multiple Access Source Coding
The discussion that follows focuses on multiple access source coding with two encoders. While this choice is expedient for the sake of notational brevity, all of the results discussed here generalize to scenarios with encoders, as briefly noted in Remark 6 below.
In multiple access source coding, also known as Slepian-Wolf (SW) source coding , a pair of random variables with finite or countably infinite alphabets and are compressed separately. Each encoder observes only one of the random variables and independently maps it to one of the codewords in or , respectively; a single decoder subsequently decodes the pair of codewords it receives to reconstruct jointly. As in Section III-A, we first present the definition of a SW code for an abstract random object, and then particularize it to the case where the random object observed by the encoders lives in an alphabet endowed with a Cartesian product structure.
Definition 3 (SW code).
An SW code for a pair of random variables with finite or countably infinite alphabets and comprises two separate encoding functions and , and a decoding function such that the error probability satisfies .
A pair of code sizes is -achievable if there exists an SW code.
In the conventional block setting, the encoders individually observe and
drawn from a joint distributiondefined on . The block SW code is defined as follows.
Definition 4 (Block SW code).
Let and be the -fold Cartesian products of the sets and , respectively. A SW code for a pair of random vectors defined on is called an SW code.
The finite blocklength rates associated with this code are defined by
Definition 5 (-rate region).
A rate pair is -achievable if there exists an SW code with and . The -rate region