1 Introduction
The problem of separate encodings and joint decoding of correlated sources, i.e., the well known Slepian–Wolf (SW) coding problem, has received a vast level of attention ever since the landmark paper by Slepian and Wolf [18] was published, nearly five decades ago. Much less attention, however, was given to the asynchronous version of this problem, where there is a relative delay between the two correlated sources, see, e.g., [15], [11], [16], [19], [20]. The motivation for the asynchronous setting is thoroughly discussed in [11]. For memoryless correlated sources, Willems [20] assumed that the relative delay is unknown to the encoders, but known to the decoder, and proved that the achievable rate region is the same as in synchronous SW coding. Under similar assumptions, Rimoldi and Urbanke [16], as well as Sun, Tian, Chen and Wong [19], have proposed SW data compression schemes that are based on the notion of source splitting. In all these studies, it was assumed that the decoder has the option to postpone the actual decoding until after having received all codewords associated with the data to be decoded. Such an assumption essentially neutralizes the negative effect of the relative delay because the encoders and the decoder can still exploit the correlations between the two sources. As explained, however, by Matsuta and Uyematsu in their recent paper [11], this setup might be somewhat problematic, in practice, especially when the relative delay is very large.
The main result provided by Matsuta and Uyematsu in [11] (see also [9] and [10]
) is a worst–case result in spirit. They assumed that: (i) the joint probability distribution,
, of the two corresponding correlated random variables,
and , one from each source to be compressed, is only known to belong to a subset of joint probability distributions, (ii) the relative delay between the sources, , is unknown, but known to be bounded between two limits, and (iii) the absolute value of the relative delay, , is allowed to scale linearly with , and the ratio tends to a constant, , as , and is only known to be upper bounded by a given number . They have proved a coding theorem asserting that the achievability rate region is as follows. A rate–pair is achievable if and only if it satisfies the following three inequalities at the same time:(1)  
(2)  
(3) 
This result of [11] is very interesting, but it is also extremely pessimistic. It is overly pessimistic, not only because it is worst–case result, but even more importantly, because the above three suprema can potentially be achieved by three different sources, in general. Thus, for a single, given underlying joint source, (with relative delay, ), as bad as it may be, at least one of the above three inequalities can be improved, in general. Moreover, if happens to be the entire simplex of probability distributions over the given alphabets, and (which is a very realistic special case), these suprema are given by , , and , respectively, rendering this coding theorem an uninteresting triviality, as it allows no compression at all. The fact of the matter is, however, that at least the weakness concerning the three different achievers of the suprema in (1)–(3) can be handled rather easily. Upon a careful inspection of the proof of the converse part in [11], one readily concludes that it actually supports an assertion that the achievable rate region is included in the following set:
(4) 
Similar comments apply to the analysis of the error probability in [11], which is a pessimistic analysis, carried out for the worst source in and over all possible relative delay values, rather than the actual error probability associated with a given underlying source.
In this paper, we tackle the problem in a different manner. Instead of a worst–case approach, our approach is the following: for a given rate pair , even if we knew the source and the relative delay, we could have handled only sources, , that satisfy , and ( being the actual normalized relative delay). Now, the SW encoders are always simple random–binning encoders, regardless of the source parameters, so every uncertainty, that is associated with the source parameters and the relative delay, is confronted, and therefore must be handled, by the decoder. Owing to the earlier results on the minimum–entropy universal decoder for the SW encoding system (see, e.g., [1], [2], [3], [4], [5, Exercise 3.1.6], [7], [8], [12], [14], [17]), it is natural to set the goal of seeking a universal decoder that is asymptotically as good as the optimal maximum a posterior (MAP) decoder for the given source, in the sense that the random coding error exponent is the same, and hence so is the achievable rate region. In other words, unlike in previous works on universal SW decoding, here universality is sought, not only with respect to (w.r.t.) the source distribution, , but also w.r.t. the unknown relative delay between the two parts of the source. Although it is natural to think of the relative delay as of yet another unknown parameter associated with the underlying source, it will be interesting to see that in our proposed universal decoder, the unknown delay will be handled differently than the other unknown parameters. We will elaborate on this point later on.
Our main contributions, in this work, are the following:

We propose a universal decoder that allows uncertainty, not only regarding the source parameters, but also the relative delay.

We prove that our universal decoder achieves the same error exponent as the optimal MAP decoder that is cognizant of both the source parameters and the relative delay. This will be done by showing that our upper bound on the error probability of the universal decoder is of the same exponential order as a lower bound on the error probability of the MAP decoder.

We provide the Lagrange–dual form of the resulting error exponent, and thereby characterize the achievable rate region for achieving a prescribed random coding error exponent, .

We provide an outline for a possible extension to sources with memory.
2 Notation Conventions
Throughout the paper, random variables will be denoted by capital letters, specific values they may take will be denoted by the corresponding lower case letters, and their alphabets will be denoted by calligraphic letters. Random vectors and their realizations will be denoted, respectively, by capital letters and the corresponding lower case letters, both in the bold face font. Their alphabets will be superscripted by their dimensions. For example, the random vector
, ( – positive integer) may take a specific vector value in , the –th order Cartesian power of , which is the alphabet of each component of this vector. Segments of vector components will be denoted by subscripts and superscripts, for example, , , will designate the segment . When , the subscript will be omitted and therefore the notation will be . By convention, when , will be understood to be the empty string, whose probability is formally defined to be unity. Sources and channels will be denoted by the letter or , subscripted by the names of the relevant random variables/vectors and their conditionings, if applicable, following the standard notation conventions, e.g., , , and so on. When there is no room for ambiguity, these subscripts will be omitted. The probability of an event will be denoted by , and the expectation operator with respect to (w.r.t.) a probability distribution will be denoted by . Again, the subscript will be omitted if the underlying probability distribution is clear from the context. The entropy of a random variable (RV) with a generic distribution will be denoted by . Similarly, other information measures will be denoted using the customary notation, subscripted by the name of the underlying distribution . For example, for a pair of RVs, , distributed according to (or , for short), , and will denote the joint entropy, the conditional entropy of given , and the mutual information, respectively. For two positive sequences and , the notation will stand for equality in the exponential scale, that is, . Similarly, means that , and so on. The indicator function of an event will be denoted by . The notation will stand for .The empirical distribution of a sequence , which will be denoted by , is the vector of relative frequencies, , of each symbol in . The type class of , denoted , is the set of all vectors with . When we wish to emphasize the dependence of the type class on the empirical distribution , we will denote it by , with a slight abuse of notation. Information measures associated with empirical distributions will be denoted with ‘hats’ and will include the names of the vectors from which they are induced by parentheses. For example, the empirical entropy of , which is the entropy associated with , will be denoted by . An alternative notation, following the conventions described in the previous paragraph, is . Similar conventions will apply to the joint empirical distribution, the joint type class, the conditional empirical distributions and the conditional type classes associated with pairs of sequences of length . Accordingly, will be the joint empirical distribution of , or will denote the joint type class of , will stand for the conditional type class of given , will designate the empirical joint entropy of and , will be the empirical conditional entropy, and so on. Clearly, empirical information measures can be calculated, not only from the full vectors, but also from partial segments, like and . In this case, and will replace and in the above notations.
3 Problem Formulation
Let be a pair of correlated discrete memoryless sources (DMSs) with a relative delay of time units, that is,
are jointly distributed according to a certain probability distribution,
for every , but the various pairs are mutually independent. In other words, the random vectors are i.i.d. for different values of . Neither and are known to the encoders and decoder.Similarly as in [11], the two separate encoders that compress and both operate on successive blocks of length , without any attempt to align them, because is unknown and it may be arbitrarily large. These encoders are ordinary SW encoders at rates and , respectively. In other words, each member (resp. ) of (resp. ) is mapped into a bin (resp. ), which is selected independently at random for every –vector in the respective source space. As always, the randomly selected mappings of both encoders are revealed to the decoder.
As already mentioned, both and are unknown, but without essential loss of generality, it may be assumed that . For any , the respective blocks concurrently encoded, and , are statistically independent, and so, all values of , from and beyond, are actually equivalent from the viewpoints of the encoders and the decoder. The lower limit, , is assumed for convenience only. Negative values of correspond to switching the roles of the two sources in the forthcoming results and discussions (see also [11]).^{1}^{1}1We could have allowed negative values of in the formal problem setup to begin with, but this would make the notation more cumbersome. Our asymptotic regime will be defined as described in the Introduction: as , the relative delay will be asymptotically proportional to , i.e., the ratio tends to a limit, . In view of the above discussion, can be assumed to take on values in the interval .
The decoder receives the bin indices, and , of the compressed vectors, and
, respectively, and it outputs a pair of estimates,
. The average probability of error is defined as(5) 
where both the randomness of and the randomness of the encoder mappings are taken into account. The respective error exponent is defined as
(6) 
provided that the limit exists.
The optimal MAP decoder, that is cognizant of both and , is given by
(7)  
(8) 
where
(9) 
where all three factors admit product forms,
(10)  
(11)  
(12) 
The average probability of error, associated with the MAP decoder, will be denoted by and its error exponent will be denoted by . A general metric decoder is of the form
(13) 
where the function will be referred to as the decoding metric. The average error probability of the decoder that is based on the metric , will be denoted by , and its error exponent (if existent) will be denoted by .
In this paper, we propose a universal decoding metric , that is independent of the unknown and , yet its error exponent, , coincides with , and hence it is asymptotically optimal in the random coding error–exponent sense.
4 Main Result
We define the following functions for :
(14)  
(15)  
(16)  
(17) 
and finally, the universal decoding metric, , is defined as
(18) 
If the relative delay, , is allowed to take on also negative values, i.e., , then the minimum in (18) should be extended to , where for , , , , and are defined exactly as above, except that the roles of and are interchanged (that is, will be shifted positions to the right relative to , instead of the above shift, which is the opposite). For a pair of finite–alphabet RVs, , let as define the Rényi entropies of order as
(19)  
(20)  
(21)  
(22)  
(23) 
For , these quantities tend to the respective Shannon entropies.
Our main result is the following.
Theorem 1
Under the assumptions formalized in Section 3, the following is true:

The error exponents, and , both exist.

, where
(24) (25) (26)
Discussion. The remaining part of this section is devoted to a discussion on Theorem 1 and its significance.
Since the error exponents were defined under the condition that the certain limits exist, part (a) of the theorem establishes the basic fact that they indeed exist. Part (b) is more quantitative: it tells that the error exponents of the universal decoder and the MAP decoder are equal, thus rendering the universal decoder asymptotically optimal in the error exponent sense. Finally, part (b) provides also an exact single–letter expression of this error exponent, using a Gallager–style formula. Here, unlike in the synchronous case (of ), we also see unconditional Rényi entropies (weighted by ), which correspond to the compression of the segments, and , that are independent of each other and of all other pieces of data within the block, and hence no correlations can be exploited when compressing them. If is fixed (or grows sub–linearly with ), the relative weight of these segments is asymptotically negligible, and there is no asymptotic loss compared to the synchronous case. The error exponent is given by the minimum among three error exponents: corresponds to errors in the decoding of while is decoded correctly, designates the opposite type of error, and finally, stands for erroneous decoding of both and . The smallest of all three dominates the overall error exponent.
The above relation between the error exponent and the coding rates can be essentially inverted, in order to answer the following question: what is the achievable rate region, , for achieving an error exponent at least as large as a prescribed value, ? Using the above error exponent formula, the answer is readily found^{2}^{2}2See the last part of Subsection 5.3. to be the following.
(27) 
where
(28)  
(29)  
(30) 
For , which means a vanishing error probability, however slowly, the infima are approached by , which yield
(31)  
(32)  
(33) 
as expected in view of the results of [11].
In order to try to understand the decoding metric (18), consider the following observations. This decoding metric is given by the maximum of three different metrics, which are all in the spirit of the minimum entropy (ME) universal decoding metric,^{3}^{3}3The above defined function, , was mentioned also in [11, Section V, second paragraph] as a possible decoding metric, but it was not the decoding metric actually analyzed there, because the authors argued that it cannot be analyzed by the standard method of types. but modified to address the dependence structure at hand. Each one of these metrics is ‘responsible’ to handle a different type of error: is associated with errors in decoding both and , is for errors in only, while is decoded correctly, and finally, is meant for the opposite case, of decoding error in only. The maximum of all three metrics is meant to handle all three types of error at the same time. Every value of corresponds to a certain hypothesis concerning the relative delay. Note that this decoding metric is different from the one in [14], which relies on an encoding scheme that provides pointers to the type classes of and , in addition to their bin indices.
Another observation is regarding the special stature of the relative delay parameter, . On the face of it, it is natural to view as yet another unknown parameter of the source, in addition to the other unknown parameters – those associated with the joint distribution, . If was known, and only was unknown, we could have interpreted the empirical entropies in , and (actually, with ) as negative logarithms of the maximum likelihood (ML) values of the various segments, or equivalently, as the minima of the negative log–likelihood values. For example, , , and so on. In other words, the minima over are taken before (i.e., more internally to) the maximum over the three metrics. By contrast, the minimum over the hypothesized relative delay, , is taken after (i.e., externally to) the maximum over the three metrics. Attempts were made to prove that minimum over and the maximum among the three metrics can be commuted, but to no avail. Therefore, this point seems to be non–trivial.
Finally, it is in order to say a few words concerning sources with memory. Consider the case where is a first–order Markov source. In this case, the techniques of [12, Subsection 5.1], suggest that one can prove the universal asymptotic optimality of a similar universal decoder, where , , , , and are replaced by the respective length functions associated with the Lempel–Ziv algorithm (LZ78) [22] and the conditional LZ78 algorithm [21], , , , , and . It should be noted that and are not realizations of a Markov sequences, but they are realizations of a hiddenMarkov process, as their correlated counterparts are not available. Nonetheless, hidden Markov sources can still be accommodated in this framework (see, e.g., [13]).
5 Proof of Theorem 1
The proof is based on a simple sandwich argument: we first derive an upper bound to the average error probability of the universal decoder that is based on , and then a lower bound to the error probability of the MAP decoder. Both bounds turn out to be of the same exponential order. On the other hand, since the MAP decoder cannot be worse than the universal decoder, this exponential order must be exact for both decoders and its single–letter expression is easily derived using the method of types. This will establish both part (a) of Theorem 1 and the first equality in part (b). The second equality in part (b) will be obtained by deriving the Lagrange–dual of the original single–letter formula.
5.1 Upper Bound on the Error Probability of the Universal Decoder
The average probability of error of the proposed universal decoder, is as follows.
(34)  
(35)  
(36) 
As for , we have
(37)  
Now,
(38)  
Similarly,
(39)  
Comments
There are no comments yet.