I Introduction
Consider arbitrary discrete random variables (DRVs)
, which form a Markov chain in that order, where
, , andfor some conditional distribution . DRVs of this nature are often found in the literature related to the information theory, starting when Shannon [1] considered one way communication over a discrete memoryless channel (DMC) where represents the message, the output of the channel encoder, and the input to the channel decoder. The purpose of this work is to provide a DRV such that

the cardinality of its alphabet, , grows subexponentially with ,

form a Markov chain in that order, and
The last property above means that for probability that converges to unity with
, a chosen randomly according to will exhibit convergence in probability ofwhere
are conditional entropies. The construct of is similar to that of a DRV that determines the empirical distribution, or type as defined in [4, Chap. 2], of .
Such a property proves to be extremely useful in establishing informationtheoretic necessary conditions directly from operational requirements. To understand the importance of such a capability, consider Fano’s inequality [5], which states that given DRVs and , if then
where is a Bernoulli random variable with parameter . A typical application of Fano’s inequality generally views the DRV and
as respectively being a message sent at the transmitter and an estimate of the message made at the receiver of a communication system. Thus Fano’s inequality gives us an upper bound on
from the operation requirement of maintaining a small transmission error probability, i.e., . While no one would argue against the utility of Fano’s inequality, it is clear that it can only provide a bound for one specific operational requirement, namely a small transmission error probability. In contrast, inducing information stability can allow for a replacement of stochastic terms with information theoretic averages, directly in the operational quantities. Thus, such a method is applicable to all operational requirements that can be written as functions of distributions of DRVs involved. To demonstrate this methodology we have provided three different examples: one way communication over a discrete memoryless channel; tighter bounds on the probability of intrusion in a generalization of a problem introduced by Lai et al. [6]; and establishing the capacity of the wiretap channel with finite error and leakage under two different secrecy metrics. These problems are chosen as to present a widerange of operational requirements for which this new methodology can extract information theoretic necessary conditions. Furthermore, while our first example is chosen simply to present the reader with a wellstudied problem in information theory, the second and third examples establish new results.The rest of the paper is organized as follows. First we conclude the introduction by describing the notation used through the rest of the paper in Section IA, particular attention should be given to the definition of a regular collection of DRVs (Definition 2) which defines where the theorems can be applied. Following this, we highlight relevant work in Section II. We then present our main results in Section III, and applications thereof in Section IV. The proofs of each main theorem is given its own unique treatment from Sections VI to VIII. This is done so that we may first present a suite of lemmas which serve to reduce the complexity of their proofs. Conclusions are found in section IX, and some miscellaneous proofs are found in the appendix.
Ia Notation
Constants, random variables (RVs), and sets will be denoted by lower case, upper case and script letters respectively. Function returns the probability of the event in the predicate. We will always employ the corresponding script form of a letter to denote the support set of any DRV. That is, if is a DRV, then is the set of all for which . Functions will be lower case or upper case depending on if they are random or not. Conditional DRVs and events will be denoted by , for example the DRV given the event is written .
The set of positive integers is written as , and the set of positive real numbers is written as . Furthermore, denotes the set of integers starting at and ending at , inclusively. We use to denote collections of constants, DRVs, etc. For instance the collection of three DRVs . Throughout this paper . That is, denotes a sequence of possibly mutually dependent DRVs, and denotes a sequence constants all from . Note that we have omitted the dependence on for simpler notation, and will continue to do so for the rest of the paper unless when it is necessary to highlight the dependence. The support set of is clearly a subset of . Also when , by convention this defines as some unspecified constant. From hereforth, we will only rarely need to refer back to the individual elements in the length sequences of and . As such, the subscripts of DRVs will be primarily used to denote a collection of multiple length DRV sequences, such as , for some .
Probability distributions, being deterministic functions over their support sets, will be denoted with lower case letters. Of particular importance will be , which will always denote a probability distribution, and when written with the subscript of DRVs, specifically denotes the associated probability distribution over said random variables. For instance, . With this notation is itself a RV, while is a fixed value. When the context is clear, we may drop the subscript entirely. Furthermore, for any . The set of all possible conditional distributions of the form , where and , is denoted . For DRVs and if , we will write or when clear . The empirical conditional distribution of is defined as for . The set of all valid empirical distributions for an length sequence will be denoted . For empirical conditional distributions we shall use where , to denote the set of conditional empirical distributions for which is a valid distribution in .
Many of the results to be presented in later sections involves DRVs that satisfy specific sets of relationship and/or properties. For relationships between DRVs in particular, we will use the following two operators. First if , then DRVs form a Markov chain in that order. In other words for all . On the other hand, if , then can be written as a deterministic function of . For any DRVs , if then . To simplify the statements of our results, we will adopt the standard set notation when describing DRVs satisfying a specific set of properties. For instance, the DRVs that satisfy the conditions that and that will be denoted by .
Informationtheoretic quantities which are averages over probability
distributions of DRVs will be denoted by blackboard bold letters.
In specific, the following quantities will receive heavy use:
For DRVs , and probability distributions
and
,
It should be noted that while is a constant, is a RV. More specifically is the expected value of over . Moreover we will employ the concept of entropy spectrum [7] in the development of some of our results. More specifically, we will mostly consider the entropy spectrum frequency of , which is defined as
The conditioning notation will be omitted in the special cases where . Furthermore, the subscript may be omitted when the context is clear. Finally, we note that the exact bounds obtained in this paper quickly become unwieldy. This is unfortunate because this detracts from the elegance of the stated results. As a compromise, we introduce the following order terminology which is similar in spirit to BachmannLandau notation, but has a formal definition which has to be context sensitive.
Definition 1.
For any , we say if there exists a constant (that is possibly a function of the cardinalities of the alphabets involved) such that
Throughout the paper our results will be expressed in terms of , for some , with the value of acceptable being itself a function of . The exact calculations of the order terms are cumbersome and trivial, and we will skip most of such calculations except a few particularly important ones.
Now, we restrict the DRVs that our main theorems are applicable to.
Definition 2.
(Regular collection of DRVs) For any arbitrary index set and any , DRVs form a regular collection if

and are finite,

and for all ,

,

is distributed , where for all , and

.
Furthermore, to simplify notation we assume that for all , and when we will assume that
Note neither of these assumptions are in the least restrictive given the first requirement of the definition.
Ii Background
Iia Images and Quasiimages
The manipulation of images and quasiimages will play an important role in establishing our theorems. Let us define these concepts. For all discussions and results in this section, it is assumed that is a regular collection of DRVs.
Definition 3.
([4, Ch. 15]) Let . For any , a set is called an image of (generated) by if
Furthermore denotes the minimum cardinality (size) of images of by . That is,
Definition 4.
([4, Problem 15.13]) Let . For any , a set is called an quasi image of by if
Furthermore denotes the minimum cardinality (size) of images of by .
Image sizes were originally introduced in Gács and Körner [8] and Ahlswede et al. [9], and found use in proving strong converses due to the blowing up lemma, which the authors of [8] and [9] credit to Margulis [10]. In our paper’s context, the blowing up lemma will play an important role because of how it relates image sizes. Before pointing out the lemmas which will find use in this paper, we refer readers to [4, Chap. 5], [11] and [12, Chapter 3] for an information theoretic context of the blowing up lemma.
Lemma 5.
([4, Lemma 6.6]) Given , , , and , there exists , where such that
for every , and every distribution .
While it is possible to derive the theorems in Section III directly from Lemma 5, we take the further step now of providing an upper bound on . This can be done from combining a lemma which discusses the change in probability given a blow up (see Liu et al. [13] or^{1}^{1}1The lemma from Raginsky and Sason provides the same order for as can be obtained via [4, Lemma 5.3,5.4], but is a little sharper, and much simpler to present. Raginsky and Sason [12, Lemma 3.6.2]), with an upper bound on the increase in the image size due to the blow up (see Ahslwede et al. [9, Lemma 3] or Csiszá and Körner [4, Lemma 5.1]).
Lemma 6.
For any and , we have
where is a Bernoulli DRV with parameter .
Remark 7.
The value of will play a pivotal role in the bounds to come. In fact a tighter bound on the value of would directly lead to tighter bounds for multiple theorems in this paper. Because of this, we feel it necessary to bring forth recent work by Liu et al. [13, 14] who endeavor to provide an alternative to the blowingup lemma which offers tighter bounds for certain information theoretic problems. By using functional inequalities and the reverse hypercontractivity of particular Markov semigroups instead of the blowing up lemma, they have been able to obtain order tight bounds on the hypothesis testing problem. While hypothesis testing does not directly extend to determining minimum image and quasiimage sizes, it is clear that two problems are closely related. In specific the geometrical interpretations of their work may lead to further insight which allow for an improvement in the term.
In terms of applications Ahlswede [15] used the blowing up lemma to prove a local strong converse for maximal error codes over a twoterminal DMC, showing that all bad codes have a good subcode of almost the same rate. Using the same lemma, Körner and Marton [16] developed a general framework for determining the achievable rates of a number of source and channel networks. On the other hand, many of the strong converses for some of the most fundamental multiterminal DMCs studied in literature were proven using image size characterization techniques. Körner and Martin [17] employed such a technique to prove the strong converse of a discrete memoryless asymmetric broadcast channel. Later Dueck [18] used these methods, combined with an ingenious “wringing technique” to prove the strong converse of the discrete memoryless multiple access channel with independent messages.
IiB Other works of interest
Here we wish to briefly highlight a few of the methods by which information theoretic necessary conditions are generally obtained, first and foremost being Fano’s inequality [5]. Fano’s inequality and generalizations (for instance, Han and Verdú [19]), directly provide information theoretic necessary conditions from probability of error requirements. One significant problem is that it requires that the error probability go to zero with in order to obtain tight bounds in certain scenarios. One such scenario is establishing bounds on the number of message that can be reliably distinguished in oneway communication over a DMC, which we discuss in more detail in Section IV. While, as first claimed by Shannon and proven by Wolfowitz [20], this value does not change when allowing a finite error probability, the bound obtained from Fano’s inequality does increase with the error term.
Actually this allowed Wolfowitz to introduce the concept of a capacity dependent upon error, usually denoted by . Because of this there exists a demarcation between converses which are primarily independent of the error rate, and those which are tight only if the probability of error vanishes. Following the terminology of Csiszár and Körner [4, Pg. 93], a converse result showing for all is called a strong converse. Verdú and Han [21] showed the stronger assertions that this is true for all finite , and that all rates larger must have error probability approaching unity hold for all twoterminal DMCs. More recently techniques such as the metaconverse by Polyanskiy et al. [22] have been able to establish tight necessary condition as function of error probability up to the second order. The metaconverse leverages the idea that any decoder can be considered as a binary hypothesis test between the correct codeword set and the incorrect codeword set. Bounding the decoding error by the best binary hypothesis test, new bounds, which are relatively tight even for small values of , can be established.
Thus for the single operational requirement of transmission error probability, multiple different methodologies have been derived in order to obtain increasingly strong results. While each of these methodologies can be applied to different channels, they all still require the probabilityoferror operational requirement as a starting point. The only general methodology that transcends this limitation are those related to the information spectrum as first defined by Verdú and Han [21]. For an in depth treatment of information spectrum methods, we point the reader to Han’s book [7]. The informationspectrum methods, in general, link operational quantities directly to information/entropy spectrum frequencies. Hence solving extremal problems of the information spectrum in turn determines the fundamental limits of these operational quantities. These methods are incredibly strong and universally applicable, but generally can not easily relate back to the more traditional information theoretic quantities like entropy and mutual information. Our work takes this further step, but at the cost of having to restrict our attention to DRVs that form a regular collection. Since such DRVs are of most common use, we feel this tradeoff is one worth pursuing, because many operational requirements, in addition to the transmission error probability, are of recent interest.
Iii Main results
Given a regular collection , our primary goal is to “stabilize” , when conditioned on , where , in the sense that the entropy spectrum of is concentrated around a single frequency. More precisely, we want
for some , and and that vanish with increasing . It is easy to see that a statement such as the above is not true in general. But, as we will show it can be achieved by introducing a particular stabilizing DRV. This stabilization will allow for the direct exchange of probabilities and entropy terms, thanks to the following lemma.
Lemma 8.
Given DRVs , and , if
for some , then
Corollary 9.
Furthermore
The lemma’s proof can be found in Appendix A. Thus stabilizing has the added benefit that converges to in probability for large . From this exchange, we can easily create new necessary conditions for different information theoretic problems, as we demonstrate in Section IV.
In order to construct the information stabilizing random variable, first for a given regular collection , we find a subset for which the quasiimage of by a specific is stable.
Theorem 10.
Given any regular collection and any , there exists both a set , where
(1) 
and positive real numbers and such that
(2) 
for all and .
The proof of Theorem 10 can be found in Section V. Next we repeatedly use Theorem 10 to continually carve out different sets which induce stability. Thus directly building upon Theorem 10 we construct the following theorem.
Theorem 11.
(Information stabilizing partitions) For any regular collection and real number , we have

[leftmargin=*]

a DRV ,

a positive real number , and

for each DRV , and , there exists a set such that
(3) and
(4) for all .
The proof can be found in Section VI. In and of itself, Theorem 11 allows us a new methodology of providing information theoretic necessary conditions for certain problems. Still, the applicability of this methodology can be improved by also stabilizing .
Theorem 12.
For any DRVs , positive integer , and positive real numbers , we have:

[leftmargin=*]

a DRV

a real number , and

for each DRV and , there exists sets and such that
(5) (6) for all , and
(7) for all .
Furthermore, if is uniform over , then
and
(8) 
for all and .
Notice that providing stability to , and , also would then provide stability to and . Providing stability to may be instantly recognizable to the reader as stabilizing a message given an observation.
The need of our second augmentation theorem arises from the fact that Theorem 11 cannot in and of itself simultaneously provide stable quasi images for all product distributions in . Indeed, the reason for this being that there are an infinite number of such distributions, with even the number of conditional empirical distributions possible for symbols growing polynomial with . In turn then, the support set of would have to grow exponentially with , which is something which we are trying to avoid as it would make our results trivial. The following augmentation theorem rectifies this problem by providing a set which if stabilized then guarantee stability for all .
First, a quick point of emphasis. For the upcoming theorem, we begin to adopt the notation outlined previously where and is distributed for .
Theorem 13.
For any real number , there exists a subset
with the following property:
Given a regular collection
, for each
there exists a
such that if
for some , then
where
At this point Theorems 11, 12, and 13 represent the main breadth of our contribution. But, it is clear that these Theorems are somewhat unwieldy. To simplify this procedure we will essentially combine Theorems 11, 12 and 13 into a single corollary which simultaneously stabilizes and for all and . Because of the tension between the accuracy of the stabilization, and the support set of the stabilizing random variable, we will construct the following corollary to only stabilize such that , with the remaining being contained in their own set. Similar corollaries can be obtained with less accuracy on the stabilization, but with a much larger range of stabilized values (e.g., stabilize all such that ). While such a trade off would be useful for scenarios such as ID coding, they would not be appropriate for the examples presented here.
In order to simplify analysis we introduce the following definition.
Definition 14.
For any regular collection and DRV , the stable sets for , , and are
(9) 
while the the saturated sets for , , and are
(10) 
If is uniform over , replace in the above with .
Now if then . In addition, if , then . In that sense and consists of the probability terms which are well described by information theoretic quantities. Combining Theorems 11, 12, 13 and Lemma 30 allow us to establish the following result.
Corollary 15.
Let . For any regular collection and any DRV , there exists:

[leftmargin=*]

a DRV ,

positive real number , and

set such that
(11) and for each and either
(12) or
(13) Furthermore if is uniform over , then (12) holds.
The proof of which is in Appendix B. Note the error term is primarily due to the result holding simultaneously for all distributions in . If this term is of importance in a potential application, and if only a finite and fixed number of quasiimages need to be stabilized, then the order term can be improved by simply combining Theorem 11 and 12.
Iv Applications
In this section we will highlight a new methodology by which to obtain information theoretic necessary conditions. First we will apply this new methodology to a classical problem to highlight how it works, and how it differs from conventional approaches. In doing so we will provide extra commentary at each step in order that we make general application of the methodology plain. Next, we apply this methodology to establish new results for channels which require authentication, and wiretap channels. These examples were chosen in order to demonstrate how this new methodology obtains information theoretic necessary conditions from a wide range of operational requirements.
Iva One way communications over a DMC
Here we consider a classical problem in information theory, channel coding over a DMC . In this model a source wants to send a message , which will be chosen at random according to some arbitrary distribution over
, to the destination. Connecting the source and destination is a DMC characterized by the conditional probability distribution
. To facilitate communications, the source and destination ahead of time agree upon a “code” consisting of an encoder and a decoder , both of which may be stochastic.For the code to be considered operational it must satisfy the following error probability criterion for some prearranged :
(14) 
We note that the distribution of is induced by with . Since form a regular collection of DRVs, we can apply Corollary 15. Doing so, will allow for us to directly transform (14) into a set of information theoretic necessary conditions. Before a demonstration of this, we shall describe how one would apply Fano’s inequality to attempt to achieve the same task, and what the shortcomings of doing so are.
IvA1 Fano’s inequality
Without a uniform distribution over
, Fano’s inequality can only (essentially) provide(15) 
Now, if it were the case that was uniform over , then (15) reduces to
(16) 
and if we were further to assume that , for some , then asymptotically we could say
(17) 
These bounds can be further simplified using the data processing inequality, and singleletterization techniques to show
(18) 
which when substituted back into (17) yields
(19) 
But notice the assumptions that had to be made to obtain this result. First we had to assume was uniform, and second we had to assume that the probability of error decay to zero with . The second assumption has already been the subject of much study, leading to the eventual demarcation between the strong and weak converse. With this in mind we instead consider what happens when you void the first assumption. In fact, repeating these steps with a nonuniform , gives
(20) 
Notice that Equation (20) looks like a sufficient condition as well, and actually is if is information stable. But, this condition is actually not sufficient for general . Consider the following example to convince yourself of this fact. Let , , have the following distribution
(21) 
This , on average, can not be reliably transmitted over the channel. To see this consider a case where any potential decoder is given the side information that determines whether or . When the decoder is informed that , then clearly the probability of error of the decoder can be eliminated. On the other hand when , then the number of potential messages greatly exceeds capacity and as a result the probability of error must be close to . This later fact being a byproduct of the strong converse for the DMC. Thus, even with this side information, the best possible decoder could only obtain a minimum probability of error of just below . At the same time though, it is easy to calculate
(22) 
which is less than for large enough as long as . As a consequence Equation (20) can not also provide a matching sufficient condition, or in other words, Equation (20) only provides a loose necessary condition.
IvA2 Information stable partitions
Now we move onto our methodology, which even without the assumption that is information stable, nor that as a function of , yields
(23) 
for some , as necessary to ensure (14). First shown by Han [7, Theorem 3.8.5 & 3.8.6], Equation (23) is not only necessary, but also has a matching^{2}^{2}2With the usual asymmetry in the sign of the negligible terms. sufficient condition. We briefly discuss the sufficiency before completing the example. Observe that there can only be values of such that . We will refer to this set of messages as the transmissible set. It would be simple to construct a reliable channel code for the transmissible set, while mapping all messages not from the transmissible set to some fixed codeword. As a result, if is chosen from the transmissible set, the probability of error would be near , and otherwise . Hence there exists a coding scheme for which the error probability converges to the probability is chosen from the transmissible set. This previous statement is essentially Equation (23).
Returning to establishing the necessary conditions. In general, our methodology looks to directly replace the operational requirement’s probability terms with information theoretic quantities. Here the operational requirement (Equation (14)) can be written as
(24) 
Next, because constitute a regular collection of DRVs, there exists

a DRV ,

positive real number , and

such that
(25) such that
(26) and either
(27) or
(28) for all ,
where by Corollary^{3}^{3}3With 15. The set is not considered because the random variable is trivially uniform by convention. Introducing into the LHS of (24
) via the law of total probability yields
(29) 
Now, let