I Introduction
Communications over channels with feedback has been a longstanding problem in information theory literature. The early works on discrete memoryless channels pointed to negative answer as to whether feedback can increase the capacity [1]. Feedback, though, improves the channel’s error exponent — the maximum attainable exponential rate of decay of the error probability. The improvements are obtained using variable length codes, where the communication length depends on the channel’s relizations. In a seminal work, Burnashev [2] completely characterized the error exponent of DMC with noiseless and causal feedback. This characterization has a simple, yet intuitive, form:
(1) 
where is the (average) rate of transmission, is the capacity of the channel, and is the maximum exponent for binary hypothesis testing over the channel. It is equal to the maximal relative entropy between conditional output distributions. The Burnashev’s exponent can significantly exceed the spherepacking exponent, for nofeedback communications, as it approaches capacity with nonzero slope. The use of VLCs is shown to be essential to establish these resutls, as no improvements is gained using fixedlength codes [3, 4, 5].
This result led to the question as to whether the feedback improves capacity or error exponent of more general channels, modeling nontraditional communications involving memory and intersymbol interference (ISI). Among such models are channels with states where the transition probability of the channel varies depending on its state which itself evolves based on the past inputs and state realizations. Depending on the variants of this formulation, the agents may have no knowledge about the state (e.g. arbitrarily varying channels) or the may exactly know the state [6]. When state is known at the transmitter and the receiver, feedback can improve the error exponent. Particularly, Como, et al, [7] extended Burnashevtype exponent to finitestate ergodic Markov channels with known state and derived a similar form as in (1), under some ergodicity assumptions. The error exponent for channels with more general state evolution is still unknown. Only the feedback capacity of such channels when restricted to fixedlength codes is known [8].
This papers studies the feedback error exponent for channels with more general state evolution and allowing VLCs. More precisely, we study discrete channels with states where the state evolves as an arbitrary stochastic process (not necessarily ergodic or Markov) depending on the past realizations. Furthermore, the realization of the states are assumed to be unknown but the transmitter or the receiver may know the underlying probability distribution governing the evolution of the state. However, noiseless output is available at the transmitter with one unite of delay. The main contributions are two fold. First, we prove an upper bound on the error exponent of such channels which has the familiar form
where is the directed relative entropy, is the directed mutual information, and is a collection of “feasible” probability distributions. As a special case, the bound simplifies to the Burnashev’s expression when the channel is DMC. Second, we introduce an upper bound on the feedback capacity of VLC for communications over these channels with stochastic states. This upper bound generalizes the results of Tatikonda and Mitter [8], and Purmuter et al.[9] where fixedlength codes are studied. Our approach relies on analysis of the entropy of the stochastic process defined based on entropy of the message given the past channel’s output. We analyze the drift of the entropy via tools from the theory of martingales.
Related works on the capacity and error exponent of channels with feedback are extensive. Starting with DMCs with fedback, Yamamoto and Itoh [10] introduced a twophase iterative for achieving the Burnashev exponent. Also, error exponent of DMCs with feedback and cost constraints is studied in [11]. Also channels with state and feedback has been studied under various frameworks on the evolution model of the sates and whether they are known at the transmitter or the receiver. On one exterem of such models are arbitrarily varying channels [12]. The feedback capacity these channels for fixedlength codes is derived in [8]. Tchamkerten and Telatar [13] studied the universality of Burnashev error exponent. They considered communication setups where the parties have no exact knowledge of the statistics of the channel but know it belongs to a certain class of DMCs. The authors proved that no zerorate coding scheme achieves the Burnashev’s exponent simultaneously for all the DMC’s in the class. However, they showed positive results for two families of such channels (e.g., binary symmetric and Z) [14]. Another class of channels with state are Markov channels that has been studied extensively for deriving their capacity [6, 15, 16] and error exponent using fixedlength codes [8]. A lower bound on the error exponent of unifilar channels is derived [17], where the states is a deterministic function of the previous ones. Other variants of this problem have been studied, including continuousalphabet channels [18, 19], and multiuser channels [20, 21].
Ii Problem Formulation and Definitions
The formal definitions are presented in this section. For short hand, we use to denote
A discrete channel with stochastic state has three finite sets , and representing the input, output, and state of the channel, respectively. Consider a collection of channels , indexed by , where each element is the transition probability of the channel at state . The states
, evolve according to a conditional probability distribution
depending on the past inputs and state realizations. As a result, after uses of the channel with being the channels input, state and output, the next output is given bySuch evolution of the states induces memory over the time as it depends on past inputs.
After each use of the channel, the output of the channel is available at the transmitter with one unit of delay. Moreover, we allow VLC for communications, where where both the transmitter and the receiver do not know the state of the channel. More precisely, the setup is defined as follows.
Definition 1.
An VLC for communications over a channel with states and feedback is defined by

Encoding functions

Decoding functions

A stopping time with respect to (w.r.t) the filtration defined as the algebra of for . Furthermore, it is assumed that is almost surely bounded as .
For technical reasons, we study a class of VLCs for which the parameter grows subexponentially with that is for some fixed number . An example is the sequence VLCs, where with being fixed parameters .
In what follows, for any VLC, we define average rate, error probability, and error exponent. Given a message , the th output of the transmitter is denoted by , where is the noiseless feedback upto time . Let
represent the estimate of the decoder about the message. Then, at the end of the stopping time
, the decoder declares as the decoded message. The average rate and (average) probability of error for a VLC are defined asDefinition 2.
A rate is achievable for a given channel with stochastic states, if there exists a sequence of VLCs such that
and , where is fixed. The feedback capacity, , is the convex closure of all achievable rates.
Naturally, the error exponent of a VLC with probability of error and stopping time is defined as . The following definition formalizes this notion.
Definition 3.
An error exponent function is said to be achievable for a given channel, if for any rate there exists a sequence of VLCs such that
and with , where is fixed. The reliability function is the supremum of all achievable reliability functions .
Iii Main Results
We start with deriving an upper bound on the feedback capacity of channels with stochastic states and allowing VLCs. The expressions are based on the directed information as introduced in [22] and defined as
(2) 
We further extend this notion to variablelength sequences. Consider a stochastic process and let be a (bounded) stopping time w.r.t an induced filtration . Then, the directed mutual information is defined as
(3) 
Now, we are ready for an upper bound on the feedback capacity. For any integer , let be the set of all letter distributions on that factor as
(4) 
Next, we have the following result on the capacity with the proof in Appendix A.
Theorem 1.
The feedback capacity of a channel with stochastic states is bounded as
where is a stopping time with respect to .
Observe that for a trivial stopping time , the bound simplifies to that for fixedlength codes as given in[8].
Iiia Upper Bound on the Error Exponent
We need a notation to proceed. Consider a pair of random sequences . Let be the MAP estimation of from observation , that is . Also, let which is the effective channel (averaged over possible states) from the transmitter’s perspective at time . With this notation, we define the directed KLdivergence as
Intuitively, measures the sum of the expected “distance” between the channels probability distribution conditioned on the MAP symbol versus the worst symbol, across different times .
Theorem 2.
The error exponent of a channel with stochastic states is bounded as
where are stopping times, and
In the next section, we present our proof techniques.
Iv Proof of Theorem 2
The proof follows by a careful study of the drift of the entropy of the message conditioned on the channel’s output at each time . Define the following random process:
(5) 
where is the algebra of . We show that drifts in three phases: (i) linear drift (data phase) until reaching a small value (); (ii) fluctuation phase with values around ; and (iii) logarithmic drift (hypothesis testing phase) till the end. We derive bounds on the expected slope of the drifts and prove that the length of the fluctuation phase is asymptotically negligible as compared to the overall communication length ( Fig. 1).
More precisely, we have the following argument by defining a pruned time random process . First, for any and
define the following random variables
(6)  
(7) 
Then the pruned time process is defined as
(8) 
Note that is a stopping time with respect to but this is not the case for .
Lemma 1.
Suppose a nonnegative random process has the following properties w.r.t a filtration
(9a)  
(9b)  
(9c)  
(9d) 
where are nonnegative numbers and for all . Given , and , let
where with . Further define as
Let be as in (8) but w.r.t . Lastly define the random process as Then, for small enough the process is a submartingale with respect to the time pruned filtration .
Proof:
The objective is to prove almost surely for all and . We prove the lemma by considering three cases depending on .
Case (a). : From the definition of in (8), in this case and Also, as the time did not reach , then and . Therefore, in this case, the random process of interest equals to
(10) 
As a result, the difference between and satisfies the following
where the first equality holds as and the second equality holds as is a stopping time which implies that is a function of . Next, from (10), the difference term above is bounded as
where the last inequality follows from (9a). As a result, we proved that .
Case (b). : In this case, implying that and . Furthermore, since, , then . Consequently, the random process equals to
Note that does not necessarily equal to the logarithmic part. The reason is that is pruned by as in (7). Thus, can be greater than when . We proceed by bounding . Note that, for small enough the following inequality holds
(11) 
Applying inequality (11) with , we can write that
(12) 
Consequently, the difference satisfies the following
(13) 
Next, we bound the first term above as
where in the first equality, we add and subtract the intermediate terms . Next,we substitute the above terms in the righthand side of (13). As , then we obtain that
(14) 
where the inequality holds from (9a) and the fact that . Next, by factoring and the indicator function inside the expectation, we have the following chain of inequalities
where (a) is due to (9d), inequality (b) holds as , inequality (c) holds as , and lastly (d) holds as . To sum up, we proved that
Case (c). : This is the last case. Note that if , then . Thus, immediately, almost surely. Otherwise, if and or if , then and hence . Therefore, it remains to consider the case that and . Therefore, and . Furthermore, as and , then and , implying that we are in the logarithmic drift. Therefore, we have that
Hence, to sum up the above subcases, we conclude that when , then
Note that from (9b), the following inequality holds
Therefore, the difference satisfies the following
Next, we provide an argument similar to PointtoPoint (PtP) case. That is, we use the Taylor’s theorem for . We only need to consider the case that and implying that and . Using the Taylor’s theorem we can write
where is between and and
As a result, we have that
where inequality (a) holds as . The last inequality holds for sufficiently small .
Lastly, combining all cases from (a) to (c), we prove that which completes the proof. ∎
Now, we show that as in (5) has the conditions in Lemma 1. First (9a) holds because of the following lemma.
Lemma 2.
Given any VLC, the following inequality holds almost surely for
(15) 
where with the induced .
Proof:
For any , we have that