Log In Sign Up

Inducing information stability and applications thereof to obtaining information theoretic necessary conditions directly from operational requirements

This work constructs a discrete random variable that, when conditioned upon, ensures information stability of quasi-images. Using this construction, a new methodology is derived to obtain information theoretic necessary conditions directly from operational requirements. In particular, this methodology is used to derive new necessary conditions for keyed authentication over discrete memoryless channels and to establish the capacity region of the wiretap channel, subject to finite leakage and finite error, under two different secrecy metrics. These examples establish the usefulness of the proposed methodology.


page 1

page 2

page 3

page 4


The Secrecy Capacity of Cost-Constrained Wiretap Channels

In many information-theoretic communication problems, adding an input co...

Secret-Key Agreement Using Physical Identifiers for Degraded and Less Noisy Authentication Channels

Secret-key agreement based on biometric or physical identifiers is a pro...

A discrete complement of Lyapunov's inequality and its information theoretic consequences

We establish a reversal of Lyapunov's inequality for monotone log-concav...

Inner Bound for the Capacity Region of Noisy Channels with an Authentication Requirement

The rate regions of many variations of the standard and wire-tap channel...

Irregular Repetition Slotted ALOHA in an Information-Theoretic Setting

An information-theoretic approach to irregular repetition slotted ALOHA ...

An Information Theoretic Measure of Judea Pearl's Identifiability and Causal Influence

In this paper, we define a new information theoretic measure that we cal...

I Introduction

Consider arbitrary discrete random variables (DRVs)

, which form a Markov chain in that order, where

, , and

for some conditional distribution . DRVs of this nature are often found in the literature related to the information theory, starting when Shannon [1] considered one way communication over a discrete memoryless channel (DMC) where represents the message, the output of the channel encoder, and the input to the channel decoder. The purpose of this work is to provide a DRV such that

  • the cardinality of its alphabet, , grows sub-exponentially with ,

  • form a Markov chain in that order, and

  • induces information stability (see [2, 3]).

The last property above means that for probability that converges to unity with

, a chosen randomly according to will exhibit convergence in probability of


are conditional entropies. The construct of is similar to that of a DRV that determines the empirical distribution, or type as defined in [4, Chap. 2], of .

Such a property proves to be extremely useful in establishing information-theoretic necessary conditions directly from operational requirements. To understand the importance of such a capability, consider Fano’s inequality [5], which states that given DRVs and , if then

where is a Bernoulli random variable with parameter . A typical application of Fano’s inequality generally views the DRV and

as respectively being a message sent at the transmitter and an estimate of the message made at the receiver of a communication system. Thus Fano’s inequality gives us an upper bound on

from the operation requirement of maintaining a small transmission error probability, i.e., . While no one would argue against the utility of Fano’s inequality, it is clear that it can only provide a bound for one specific operational requirement, namely a small transmission error probability. In contrast, inducing information stability can allow for a replacement of stochastic terms with information theoretic averages, directly in the operational quantities. Thus, such a method is applicable to all operational requirements that can be written as functions of distributions of DRVs involved. To demonstrate this methodology we have provided three different examples: one way communication over a discrete memoryless channel; tighter bounds on the probability of intrusion in a generalization of a problem introduced by Lai et al. [6]; and establishing the capacity of the wire-tap channel with finite error and leakage under two different secrecy metrics. These problems are chosen as to present a wide-range of operational requirements for which this new methodology can extract information theoretic necessary conditions. Furthermore, while our first example is chosen simply to present the reader with a well-studied problem in information theory, the second and third examples establish new results.

The rest of the paper is organized as follows. First we conclude the introduction by describing the notation used through the rest of the paper in Section I-A, particular attention should be given to the definition of a regular collection of DRVs (Definition 2) which defines where the theorems can be applied. Following this, we highlight relevant work in Section II. We then present our main results in Section III, and applications thereof in Section IV. The proofs of each main theorem is given its own unique treatment from Sections VI to VIII. This is done so that we may first present a suite of lemmas which serve to reduce the complexity of their proofs. Conclusions are found in section IX, and some miscellaneous proofs are found in the appendix.

I-a Notation

Constants, random variables (RVs), and sets will be denoted by lower case, upper case and script letters respectively. Function returns the probability of the event in the predicate. We will always employ the corresponding script form of a letter to denote the support set of any DRV. That is, if is a DRV, then is the set of all for which . Functions will be lower case or upper case depending on if they are random or not. Conditional DRVs and events will be denoted by , for example the DRV given the event is written .

The set of positive integers is written as , and the set of positive real numbers is written as . Furthermore, denotes the set of integers starting at and ending at , inclusively. We use to denote collections of constants, DRVs, etc. For instance the collection of three DRVs . Throughout this paper . That is, denotes a sequence of possibly mutually dependent DRVs, and denotes a sequence constants all from . Note that we have omitted the dependence on for simpler notation, and will continue to do so for the rest of the paper unless when it is necessary to highlight the dependence. The support set of is clearly a subset of . Also when , by convention this defines as some unspecified constant. From here-forth, we will only rarely need to refer back to the individual elements in the -length sequences of and . As such, the subscripts of DRVs will be primarily used to denote a collection of multiple -length DRV sequences, such as , for some .

Probability distributions, being deterministic functions over their support sets, will be denoted with lower case letters. Of particular importance will be , which will always denote a probability distribution, and when written with the subscript of DRVs, specifically denotes the associated probability distribution over said random variables. For instance, . With this notation is itself a RV, while is a fixed value. When the context is clear, we may drop the subscript entirely. Furthermore, for any . The set of all possible conditional distributions of the form , where and , is denoted . For DRVs and if , we will write or when clear . The empirical conditional distribution of is defined as for . The set of all valid empirical distributions for an -length sequence will be denoted . For empirical conditional distributions we shall use where , to denote the set of conditional empirical distributions for which is a valid distribution in .

Many of the results to be presented in later sections involves DRVs that satisfy specific sets of relationship and/or properties. For relationships between DRVs in particular, we will use the following two operators. First if , then DRVs form a Markov chain in that order. In other words for all . On the other hand, if , then can be written as a deterministic function of . For any DRVs , if then . To simplify the statements of our results, we will adopt the standard set notation when describing DRVs satisfying a specific set of properties. For instance, the DRVs that satisfy the conditions that and that will be denoted by .

Information-theoretic quantities which are averages over probability distributions of DRVs will be denoted by blackboard bold letters. In specific, the following quantities will receive heavy use:
For DRVs , and probability distributions and ,

It should be noted that while is a constant, is a RV. More specifically is the expected value of over . Moreover we will employ the concept of entropy spectrum [7] in the development of some of our results. More specifically, we will mostly consider the entropy spectrum frequency of , which is defined as

The conditioning notation will be omitted in the special cases where . Furthermore, the subscript may be omitted when the context is clear. Finally, we note that the exact bounds obtained in this paper quickly become unwieldy. This is unfortunate because this detracts from the elegance of the stated results. As a compromise, we introduce the following order terminology which is similar in spirit to Bachmann-Landau notation, but has a formal definition which has to be context sensitive.

Definition 1.

For any , we say if there exists a constant (that is possibly a function of the cardinalities of the alphabets involved) such that

Throughout the paper our results will be expressed in terms of , for some , with the value of acceptable being itself a function of . The exact calculations of the order terms are cumbersome and trivial, and we will skip most of such calculations except a few particularly important ones.

Now, we restrict the DRVs that our main theorems are applicable to.

Definition 2.

(Regular collection of DRVs) For any arbitrary index set and any , DRVs form a regular collection if

  • and are finite,

  • and for all ,

  • ,

  • is distributed , where for all , and

  • .

Furthermore, to simplify notation we assume that for all , and when we will assume that

Note neither of these assumptions are in the least restrictive given the first requirement of the definition.

Ii Background

Ii-a Images and Quasi-images

The manipulation of images and quasi-images will play an important role in establishing our theorems. Let us define these concepts. For all discussions and results in this section, it is assumed that is a regular collection of DRVs.

Definition 3.

([4, Ch. 15]) Let . For any , a set is called an -image of (generated) by if

Furthermore denotes the minimum cardinality (size) of -images of by . That is,

Definition 4.

([4, Problem 15.13]) Let . For any , a set is called an -quasi image of by if

Furthermore denotes the minimum cardinality (size) of -images of by .

Image sizes were originally introduced in Gács and Körner [8] and Ahlswede et al. [9], and found use in proving strong converses due to the blowing up lemma, which the authors of [8] and  [9] credit to Margulis [10]. In our paper’s context, the blowing up lemma will play an important role because of how it relates image sizes. Before pointing out the lemmas which will find use in this paper, we refer readers to  [4, Chap. 5][11] and [12, Chapter 3] for an information theoretic context of the blowing up lemma.

Lemma 5.

([4, Lemma 6.6]) Given , , , and , there exists , where such that

for every , and every distribution .

While it is possible to derive the theorems in Section III directly from Lemma 5, we take the further step now of providing an upper bound on . This can be done from combining a lemma which discusses the change in probability given a blow up (see Liu et al. [13] or111The lemma from Raginsky and Sason provides the same order for as can be obtained via [4, Lemma 5.3,5.4], but is a little sharper, and much simpler to present. Raginsky and Sason [12, Lemma 3.6.2]), with an upper bound on the increase in the image size due to the blow up (see Ahslwede et al. [9, Lemma 3] or Csiszá and Körner [4, Lemma 5.1]).

Lemma 6.

For any and , we have

where is a Bernoulli DRV with parameter .

Remark 7.

The value of will play a pivotal role in the bounds to come. In fact a tighter bound on the value of would directly lead to tighter bounds for multiple theorems in this paper. Because of this, we feel it necessary to bring forth recent work by Liu et al. [13, 14] who endeavor to provide an alternative to the blowing-up lemma which offers tighter bounds for certain information theoretic problems. By using functional inequalities and the reverse hypercontractivity of particular Markov semigroups instead of the blowing up lemma, they have been able to obtain order tight bounds on the hypothesis testing problem. While hypothesis testing does not directly extend to determining minimum image and quasi-image sizes, it is clear that two problems are closely related. In specific the geometrical interpretations of their work may lead to further insight which allow for an improvement in the term.

In terms of applications Ahlswede [15] used the blowing up lemma to prove a local strong converse for maximal error codes over a two-terminal DMC, showing that all bad codes have a good subcode of almost the same rate. Using the same lemma, Körner and Marton [16] developed a general framework for determining the achievable rates of a number of source and channel networks. On the other hand, many of the strong converses for some of the most fundamental multi-terminal DMCs studied in literature were proven using image size characterization techniques. Körner and Martin [17] employed such a technique to prove the strong converse of a discrete memoryless asymmetric broadcast channel. Later Dueck [18] used these methods, combined with an ingenious “wringing technique” to prove the strong converse of the discrete memoryless multiple access channel with independent messages.

Ii-B Other works of interest

Here we wish to briefly highlight a few of the methods by which information theoretic necessary conditions are generally obtained, first and foremost being Fano’s inequality [5]. Fano’s inequality and generalizations (for instance, Han and Verdú [19]), directly provide information theoretic necessary conditions from probability of error requirements. One significant problem is that it requires that the error probability go to zero with in order to obtain tight bounds in certain scenarios. One such scenario is establishing bounds on the number of message that can be reliably distinguished in one-way communication over a DMC, which we discuss in more detail in Section IV. While, as first claimed by Shannon and proven by Wolfowitz [20], this value does not change when allowing a finite error probability, the bound obtained from Fano’s inequality does increase with the error term.

Actually this allowed Wolfowitz to introduce the concept of a capacity dependent upon error, usually denoted by . Because of this there exists a demarcation between converses which are primarily independent of the error rate, and those which are tight only if the probability of error vanishes. Following the terminology of Csiszár and Körner [4, Pg. 93], a converse result showing for all is called a strong converse. Verdú and Han [21] showed the stronger assertions that this is true for all finite , and that all rates larger must have error probability approaching unity hold for all two-terminal DMCs. More recently techniques such as the meta-converse by Polyanskiy et al. [22] have been able to establish tight necessary condition as function of error probability up to the second order. The meta-converse leverages the idea that any decoder can be considered as a binary hypothesis test between the correct codeword set and the incorrect codeword set. Bounding the decoding error by the best binary hypothesis test, new bounds, which are relatively tight even for small values of , can be established.

Thus for the single operational requirement of transmission error probability, multiple different methodologies have been derived in order to obtain increasingly strong results. While each of these methodologies can be applied to different channels, they all still require the probability-of-error operational requirement as a starting point. The only general methodology that transcends this limitation are those related to the information spectrum as first defined by Verdú and Han [21]. For an in depth treatment of information spectrum methods, we point the reader to Han’s book [7]. The information-spectrum methods, in general, link operational quantities directly to information/entropy spectrum frequencies. Hence solving extremal problems of the information spectrum in turn determines the fundamental limits of these operational quantities. These methods are incredibly strong and universally applicable, but generally can not easily relate back to the more traditional information theoretic quantities like entropy and mutual information. Our work takes this further step, but at the cost of having to restrict our attention to DRVs that form a regular collection. Since such DRVs are of most common use, we feel this trade-off is one worth pursuing, because many operational requirements, in addition to the transmission error probability, are of recent interest.

Iii Main results

Given a regular collection , our primary goal is to “stabilize” , when conditioned on , where , in the sense that the entropy spectrum of is concentrated around a single frequency. More precisely, we want

for some , and and that vanish with increasing . It is easy to see that a statement such as the above is not true in general. But, as we will show it can be achieved by introducing a particular stabilizing DRV. This stabilization will allow for the direct exchange of probabilities and entropy terms, thanks to the following lemma.

Lemma 8.

Given DRVs , and , if

for some , then

Corollary 9.


The lemma’s proof can be found in Appendix -A. Thus stabilizing has the added benefit that converges to in probability for large . From this exchange, we can easily create new necessary conditions for different information theoretic problems, as we demonstrate in Section IV.

In order to construct the information stabilizing random variable, first for a given regular collection , we find a subset for which the quasi-image of by a specific is stable.

Theorem 10.

Given any regular collection and any , there exists both a set , where


and positive real numbers and such that


for all and .

The proof of Theorem 10 can be found in Section V. Next we repeatedly use Theorem 10 to continually carve out different sets which induce stability. Thus directly building upon Theorem 10 we construct the following theorem.

Theorem 11.

(Information stabilizing partitions) For any regular collection and real number , we have

  • [leftmargin=*]

  • a DRV ,

  • a positive real number , and

  • for each DRV , and , there exists a set such that




    for all .

The proof can be found in Section VI. In and of itself, Theorem 11 allows us a new methodology of providing information theoretic necessary conditions for certain problems. Still, the applicability of this methodology can be improved by also stabilizing .

Theorem 12.

For any DRVs , positive integer , and positive real numbers , we have:

  • [leftmargin=*]

  • a DRV

  • a real number , and

  • for each DRV and , there exists sets and such that


    for all , and


    for all .

Furthermore, if is uniform over , then



for all and .

Notice that providing stability to , and , also would then provide stability to and . Providing stability to may be instantly recognizable to the reader as stabilizing a message given an observation.

The need of our second augmentation theorem arises from the fact that Theorem 11 cannot in and of itself simultaneously provide stable quasi images for all product distributions in . Indeed, the reason for this being that there are an infinite number of such distributions, with even the number of conditional empirical distributions possible for symbols growing polynomial with . In turn then, the support set of would have to grow exponentially with , which is something which we are trying to avoid as it would make our results trivial. The following augmentation theorem rectifies this problem by providing a set which if stabilized then guarantee stability for all .

First, a quick point of emphasis. For the upcoming theorem, we begin to adopt the notation outlined previously where and is distributed for .

Theorem 13.

For any real number , there exists a subset

with the following property:
Given a regular collection , for each there exists a such that if

for some , then


At this point Theorems 1112, and 13 represent the main breadth of our contribution. But, it is clear that these Theorems are somewhat unwieldy. To simplify this procedure we will essentially combine Theorems 1112 and 13 into a single corollary which simultaneously stabilizes and for all and . Because of the tension between the accuracy of the stabilization, and the support set of the stabilizing random variable, we will construct the following corollary to only stabilize such that , with the remaining being contained in their own set. Similar corollaries can be obtained with less accuracy on the stabilization, but with a much larger range of stabilized values (e.g., stabilize all such that ). While such a trade off would be useful for scenarios such as ID coding, they would not be appropriate for the examples presented here.

In order to simplify analysis we introduce the following definition.

Definition 14.

For any regular collection and DRV , the -stable sets for , , and are


while the the -saturated sets for , , and are


If is uniform over , replace in the above with .

Now if then . In addition, if , then . In that sense and consists of the probability terms which are well described by information theoretic quantities. Combining Theorems 111213 and Lemma 30 allow us to establish the following result.

Corollary 15.

Let . For any regular collection and any DRV , there exists:

  • [leftmargin=*]

  • a DRV ,

  • positive real number , and

  • set such that


    and for each and either




    Furthermore if is uniform over , then (12) holds.

The proof of which is in Appendix -B. Note the error term is primarily due to the result holding simultaneously for all distributions in . If this term is of importance in a potential application, and if only a finite and fixed number of quasi-images need to be stabilized, then the order term can be improved by simply combining Theorem 11 and 12.

Iv Applications

In this section we will highlight a new methodology by which to obtain information theoretic necessary conditions. First we will apply this new methodology to a classical problem to highlight how it works, and how it differs from conventional approaches. In doing so we will provide extra commentary at each step in order that we make general application of the methodology plain. Next, we apply this methodology to establish new results for channels which require authentication, and wire-tap channels. These examples were chosen in order to demonstrate how this new methodology obtains information theoretic necessary conditions from a wide range of operational requirements.

Iv-a One way communications over a DMC

Here we consider a classical problem in information theory, channel coding over a DMC . In this model a source wants to send a message , which will be chosen at random according to some arbitrary distribution over

, to the destination. Connecting the source and destination is a DMC characterized by the conditional probability distribution

. To facilitate communications, the source and destination ahead of time agree upon a “code” consisting of an encoder and a decoder , both of which may be stochastic.

For the code to be considered operational it must satisfy the following error probability criterion for some pre-arranged :


We note that the distribution of is induced by with . Since form a regular collection of DRVs, we can apply Corollary 15. Doing so, will allow for us to directly transform (14) into a set of information theoretic necessary conditions. Before a demonstration of this, we shall describe how one would apply Fano’s inequality to attempt to achieve the same task, and what the shortcomings of doing so are.

Iv-A1 Fano’s inequality

Without a uniform distribution over

, Fano’s inequality can only (essentially) provide


Now, if it were the case that was uniform over , then (15) reduces to


and if we were further to assume that , for some , then asymptotically we could say


These bounds can be further simplified using the data processing inequality, and single-letterization techniques to show


which when substituted back into (17) yields


But notice the assumptions that had to be made to obtain this result. First we had to assume was uniform, and second we had to assume that the probability of error decay to zero with . The second assumption has already been the subject of much study, leading to the eventual demarcation between the strong and weak converse. With this in mind we instead consider what happens when you void the first assumption. In fact, repeating these steps with a non-uniform , gives


Notice that Equation (20) looks like a sufficient condition as well, and actually is if is information stable. But, this condition is actually not sufficient for general . Consider the following example to convince yourself of this fact. Let , , have the following distribution


This , on average, can not be reliably transmitted over the channel. To see this consider a case where any potential decoder is given the side information that determines whether or . When the decoder is informed that , then clearly the probability of error of the decoder can be eliminated. On the other hand when , then the number of potential messages greatly exceeds capacity and as a result the probability of error must be close to . This later fact being a by-product of the strong converse for the DMC. Thus, even with this side information, the best possible decoder could only obtain a minimum probability of error of just below . At the same time though, it is easy to calculate


which is less than for large enough as long as . As a consequence Equation (20) can not also provide a matching sufficient condition, or in other words, Equation (20) only provides a loose necessary condition.

Iv-A2 Information stable partitions

Now we move onto our methodology, which even without the assumption that is information stable, nor that as a function of , yields


for some , as necessary to ensure (14). First shown by Han [7, Theorem 3.8.5 & 3.8.6], Equation (23) is not only necessary, but also has a matching222With the usual asymmetry in the sign of the negligible terms. sufficient condition. We briefly discuss the sufficiency before completing the example. Observe that there can only be values of such that . We will refer to this set of messages as the transmissible set. It would be simple to construct a reliable channel code for the transmissible set, while mapping all messages not from the transmissible set to some fixed codeword. As a result, if is chosen from the transmissible set, the probability of error would be near , and otherwise . Hence there exists a coding scheme for which the error probability converges to the probability is chosen from the transmissible set. This previous statement is essentially Equation (23).

Returning to establishing the necessary conditions. In general, our methodology looks to directly replace the operational requirement’s probability terms with information theoretic quantities. Here the operational requirement (Equation (14)) can be written as


Next, because constitute a regular collection of DRVs, there exists

  • a DRV ,

  • positive real number , and

  • such that


    such that


    and either




    for all ,

where by Corollary333With  15. The set is not considered because the random variable is trivially uniform by convention. Introducing into the LHS of (24

) via the law of total probability yields


Now, let