Most proofs of the sphere packing bound (SPB) have been either for the stationary channels with finite input sets [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] or for the stationarity channels with a specific noise structure, e.g. Poisson, Gaussian, [13, 14, 15, 16, 17, 18]. The proofs using Augustin’s method are exceptions to this observation: [19, 20, 21] do not assume either the finiteness of the input set or a specific noise structure; nor do they assume the stationarity of the channel. However, , [20, §31],  establish the SPB for the product channels, rather than the memoryless channels; hence proofs of the SPB for the composition constrained codes111According to [11, p. 183], the SPB for the constant composition codes appears in  with an incomplete proof. The first complete proof of the SPB for the constant composition codes is provided in . on the stationary channels [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] —which include the important special case of the cost constrained ones [15, 16, 17, 18, 13, 14]— are not subsumed by , [20, §31], or . In [20, §36], Augustin proved the SPB for the cost constrained (possibly non-stationary) memoryless channels assuming a bounded cost function. The framework of [20, Thm. 36.6] subsumes all previously considered models [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], except the Gaussian ones [15, 16, 17, 18].
Theorem 2, presented in §III, establishes the SPB for a framework that subsumes all of the models considered in [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 16, 17, 18, 13, 14, 19, 20, 21] by employing , which analyzes Augustin’s information measures. Our use of  and Augustin’s information measures is similar to the use of  and Rényi ’s information measures in . For the product channels, [21, Thm. LABEL:B-thm:productexponent] improved the previous results by Augustin in , [20, §31] by establishing the SPB with a prefactor that is polynomial in the block length for the hypothesis that the order ½ Rényi capacity of the component channels are . For the cost constrained memoryless channels, Theorem 2 enhances the prefactor of [20, Thm. 36.6] in an analogous way, from to . The prefactor of Theorem 2, however, is inferior to the prefactors reported in [3, 4, 5] for various symmetric channels, in  for the stationary Gaussian channel, and in  for the constant composition codes on the stationary discrete product channels. Determination of the optimal prefactor, in the spirit of [3, 4], remains open for the general case. Similar to [20, Thm. 36.6], Theorem 2 holds for non-stationary channels, as well. Unlike [20, Thm. 36.6], Theorem 2 does not assume the cost functions to be bounded.
The stationarity is assumed in most of the previous derivations of the SPB, [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 16, 17, 18, 13, 14]. Given a stationary product channel, one can obtain a stationary memoryless channel by imposing composition —i.e. type, empirical distribution— constraints on the input codewords. The cost constraints can be interpreted as a convex special case of this more general composition constraints. This interpretation, considered together with the composition based expurgations, is one of the main motivating factors behind the study of constant composition codes on the stationary memoryless channels with finite input sets. The composition based expurgations, however, are useful only when the input set of the channel is finite. Nevertheless, if the constraint set for the composition of the codewords is convex, then one can derive a SPB with a polynomial prefactor using Augustin’s information measures, see Theorem 1 in §III. The derivation of Theorem 1 relies on the Augustin center of the constraint set rather than the Augustin mean of the most populous composition of the code. Note that the most populous composition of the code might not even have more than one codeword when the input set is infinite. The framework of Theorem 1 is general enough to subsume the frameworks of all previous proofs of the SPB for the memoryless of product channels that we are aware of, except the frameworks of the proofs based on Augustin’s method [19, 20, 21]. Theorems 1 and 2 provide asymptotic SPBs; but they are proved using non-asymptotic SPBs presented in Lemmas 9 and 10.
The SPB implies that exponential decay rate of the optimal error probability with the block length —i.e. the reliability function, the error exponent— is bounded from above by the sphere packing exponent (SPE). For the memoryless channels in consideration, the Augustin’s variant of Gallager’s bound implies that the SPE bounds the reliability function from below, as well, provided that the list decoding is allowed. The Augustin’s variant is presented in §II-D. One can use standard results such as [24, 25] with minor modifications in order to establish the SPE as a lower bound to the reliability function for the list decoding, as well. Thus Augustin’s variant is of interest to us not because of what it implies about the reliability function, but because of how it implies it. What is special about Augustin’s variant is that it establishes an achievability result in terms of the Augustin information rather than the Rényi information used in the standard form of the Gallager’s bound . The Augustin’s variant rely on the fixed point property of the Augustin mean described in (13) to do that. It is worth mentioning that  implicitly employs the same fixed point property, but in a different way.
Before starting our discussion in earnest, let us point out a subtlety about the derivations of the SPB that is usually overlooked.  claimed to prove the SPB for arbitrary stationary product channels, without using any constant composition arguments.222[26, p. 413] reads “An important feature of the lower bound, which will be derived, is that no assumption of constant-composition codewords is made, not even as an intermediate step.” The derivation of [26, Thm. 19], however, establishes an upper bound on the reliability function that is strictly greater than the SPE in many channels. This has been demonstrated numerically in [12, p. 1594 and Appendix A]. An analytic confirmation this observation is presented in Appendix -A. The problematic step in  is the application of Lagrange multiplier techniques, see [12, footnote 8]. The proof of [26, Thm. 19] invokes [26, Thm. 16] that is valid for the Lagrange multiplier associated with the satisfying . For an arbitrary , however, the associated Lagrange multiplier may or may not be equal to the one for the optimal . This is the reason why the upper bound to the reliability function established in [26, Thm. 19] is not equal to the SPE in general, contrary to the claim repeated in [27, Lemma 1] and [28, Thm. 10.1.4]. In a nutshell, the proof of [26, Thm. 19] tacitly asserts a minimax equality that does not hold in general. For stationary memoryless channels with finite input alphabets, one can avoid this issue using the constant composition arguments. However, in that case the proof presented in  becomes a mere reproduction of the one in .
More recently,  proposed a derivation of the SPB for stationary channels with a single cost constraint using the approach presented in . Similar to , however, the proof in  asserts a minimax equality that does not hold in general. In particular, it is claimed that does not depend on in [29, (26)]. In order to assert that, one has to include a supremum over as the inner most optimization in both [29, (25) and (26)]. With the additional supremum, the explanation provided on [29, p. 931] is no longer valid. Considering Appendix -A, we do not believe that the proof in  can be salvaged without introducing major new ideas, such as composition based expurgations similar to  or codeword cost based expurgations similar to . In short, neither  nor  successfully proved the SPB for stationary memoryless channels even for the finite input set case.
I-a Notational Conventions
For any two vectorsand in their inner product, denoted by , is . For any , dimensional vector whose all entries are one is denoted by , the dimension will be clear from the context. We denote the closure, interior, and convex hull of a set by , , and , respectively; the relevant topology or vector space structure will be evident from the context.
For any set , we denote the set of all probability mass functions that are non-zero only on finitely many members of by . For any , we call the set of all ’s in for which the support of and denote it by . For any measurable space , we denote the set of all probability measures on it by and set of all finite measures by . We denote the integral of a measurable function with respect to the measure by or . If the integral is on the real line and if it is with respect to the Lebesgue measure, we denote it by or , as well. If is a probability measure, then we also call the integral of with respect the expectation of or the expected value of and denote it by or .
Our notation will be overloaded for certain symbols; however, the relations represented by these symbols will be clear from the context. We denote the Cartesian product of sets [30, p. 38] by . We use to denote the absolute value of real numbers and the size of sets. The sign stands for the usual less than or equal to relation for real numbers and the corresponding point-wise inequity for functions and vectors. For two measures and on the measurable space , iff for all . We denote the product of topologies [30, p. 38], -algebras [30, p. 118], and measures [30, Thm. 4.4.4] by . We use the short hand for the Cartesian product of sets and for the product of the -algebras .
I-B Channel Model
A channel is a function from the input set to the set of all probability measures on the output space :
is called the output set and is called the -algebra of the output events. We denote the set of all channels from the input set to the output space by . For any and , is the probability measure whose marginal on is and whose conditional distribution given is . The structure described in (1) is not sufficient on its own to ensure the existence of a unique with the desired properties for all , in general. The existence of a unique is guaranteed for all , if is a transition probability from to , i.e. a member of rather than .
A channel is called a discrete channel if both and are finite sets. For any and channels for , the length product channel is defined via the following relation:
A channel is called a length memoryless channel iff there exists a product channel satisfying both for all and . A product channel is stationary iff for all for some . For such a channel, we denote the composition (i.e. the empirical distribution, type) of each by , where .
For any , an dimensional cost function is a function from the input set to that is bounded from below, i.e. that is of the form for some . We assume without loss of generality that333Augustin [20, §33] has an additional hypothesis, , which excludes certain important cases such as the Gaussian channels.
We denote the set of all cost constraints that can be satisfied by some member of by and the set of all cost constraints that can be satisfied by some member of by :
Then both and have non-empty interiors and is the convex hull of , i.e. .
A cost function on a product channel is said to be additive iff it can be written as the sum of cost functions defined on the component channels. Given and for , we denote the resulting additive cost function on for the channel by , i.e.
I-C Codes With List Decoding
The pair is an channel code on iff
The encoding function is a function from the message set to the input set .
The decoding function is a measurable function from the output space to the set .
Given an channel code on , the conditional error probability for and the average error probability are defined as
An encoding function , hence the corresponding code, is said to satisfy the cost constraint iff . An encoding function , hence the corresponding code, on a stationary product channel is said to satisfy an empirical distribution constraint iff the composition of all of the codewords are in , i.e. iff for all .
The Rényi divergence, tilting, and Augustin’s information measures are central to the analysis we present in the following sections. We introduce these concepts in §II-A and §II-B, a more detailed discussion can be found in [22, 31]. In §II-C we define the SPE and derive widely known properties of it for our general channel model. In §II-D we derive Augustin’s variant of Gallager’s bound.
Ii-a The Rényi Divergence and Tilting
For any and , the order Rényi divergence between and is
where is any measure satisfying and .
For properties of the Rényi divergence, throughout the manuscript, we will refer to the comprehensive study provided by van Erven and Harremoës 
. Note that the order one Rényi divergence is the Kullback-Leibler divergence. For other orders, the Rényi divergence can be characterized in terms of the Kullback-Leibler divergence, as well, see[31, Thm. 30]. That characterization is related to another key concept for our analysis: the tilted probability measure.
For any and satisfying , the order tilted probability measure is
The conditional Rényi divergence and the tilted channel are straight forward generalizations of the Rényi divergence and the tilted probability measure that will allow us to express certain relations succinctly throughout our analysis.
For any , , , and the order conditional Rényi divergence for the input distribution is
If such that for all , then we denote by .
For any , and , the order tilted channel is a function from to given by
If such that for all , then we denote by .
For any , , and , the order Augustin operator for the input distribution , i.e. , is given by
where and the tilted channel is defined in (9).
Ii-B Augustin’s Information Measures
For any , , and the order Augustin information for the input distribution is
The infimum in (11) is achieved by a unique probability measure denoted by and called the order Augustin mean for the input distribution . Furthermore, the order Augustin mean satisfies the following identities:
These observations are established in [22, LemmaLABEL:C-lem:information-(LABEL:C-information:one,LABEL:C-information:zto,LABEL:C-information:oti)]; previously they were reported by Augustin [20, Lemma 34.2] for orders less than one. Throughout the manuscript, we refer to  for propositions about Augustin’s information measures. A more detailed account of the previous work on Augustin’s information measures can be found in , as well.
For any , , and , the order Augustin capacity of for the constraint set is
When the constraint set is the whole , we denote the order Augustin capacity by , i.e. .
Using the definitions of the Augustin information and capacity we get the following expression for
If is convex then the order of the supremum and the infimum can be changed as a result of [22, Thm. LABEL:C-thm:minimax]:
If in addition is finite, then [22, Thm. LABEL:C-thm:minimax] implies that there exists a unique probability measure , called the order Augustin center of for the constraint set , satisfying
We denote the set of all probability mass functions satisfying a cost constraint by , i.e.
For the constraint sets defined through cost constraints we use the symbol rather than with a slight abuse of notation. In order to be able apply convex conjugation techniques without any significant modifications, we extend the definition Augustin capacity to the infeasible cost constraints, i.e. ’s outside , as follows:
In order to characterize through convex conjugation techniques, we first define Augustin-Legendre (A-L) information and capacity. These concepts are first introduced in [1, §III-A] and [22, §LABEL:C-sec:cost-AL], as an extension of the analogous concepts in [11, Ch. 8].
For any , channel of the form with a cost function , , and , the order Augustin-Legendre information for the input distribution and the Lagrange multiplier is
For any , channel of the form with a cost function , and the order Augustin-Legendre (A-L) capacity for the Lagrange multiplier is
Except for certain sign changes, is the convex conjugate of because of an analogous relation between and , see[22, (LABEL:C-eq:information-constrained)-(LABEL:C-eq:Linformation-conjugate), (LABEL:C-eq:Lcapacity-astheconjugate)].
Then can be expressed in terms of at least for the interior points of :
Furthermore, there exists a non-empty convex, compact set of ’s satisfying provided that is finite, by [22, Lemma LABEL:C-lem:Lcapacity].
On the other hand, using the definitions of , , and we get the following expression for .
satisfies a minimax relation similar to the one given in (16), see [22, Thm. LABEL:C-thm:Lminimax]. That minimax relation, however, is best understood via the concept of Augustin-Legendre radius defined in the following.
For any , channel with a cost function , and , the order Augustin-Legendre radius of for the Lagrange multiplier is
Then as a result of [22, Thm. LABEL:C-thm:Lminimax], for any , with , and we have
If in addition is finite, then there exits a unique , called the order Augustin-Legendre center of for the Lagrange multiplier , satisfying
The A-L information measures are defined through a standard application of the convex conjugation techniques. However, starting with [24, Thms. 8 and 10] —i.e. the cost constrained variants of Gallager’s bound— the Rényi -Gallager (R-G) information measures rather than the A-L information measures have been the customary tools for applying convex conjugation techniques in the error exponent calculations, see for example [16, 17, 18]. A brief discussion of the R-G information information measures can be found in Appendix -B; for a more detailed discussion see .
Ii-C The Sphere Packing Exponent
For any , , and , the SPE is
We denote case by . Furthermore, with a slight abuse of notation, we denote case by and case by .
For any , , is nonincreasing and convex in on , finite on , and continuous on where . In particular,
Lemma 1 follows from the continuity and the monotonicity properties of established in [22, Lemma LABEL:C-lem:capacityO]; a proof can be found in Appendix -C. The proof of Lemma 1 is analogous to that of [21, Lemma LABEL:B-lem:spherepackingexponent], which relies on [21, Lemma LABEL:B-lem:capacityO] instead of [22, Lemma LABEL:C-lem:capacityO].
One can express in terms of , using the definitions of , , and :
Lemma 1 holds for by definition, but it can be strengthened significantly for ’s in