Computing the Feedback Capacity of Finite State Channels using Reinforcement Learning

01/27/2020 ∙ by Ziv Aharoni, et al. ∙ Ben-Gurion University of the Negev 0

In this paper, we propose a novel method to compute the feedback capacity of channels with memory using reinforcement learning (RL). In RL, one seeks to maximize cumulative rewards collected in a sequential decision-making environment. This is done by collecting samples of the underlying environment and using them to learn the optimal decision rule. The main advantage of this approach is its computational efficiency, even in high dimensional problems. Hence, RL can be used to estimate numerically the feedback capacity of unifilar finite state channels (FSCs) with large alphabet size. The outcome of the RL algorithm sheds light on the properties of the optimal decision rule, which in our case, is the optimal input distribution of the channel. These insights can be converted into analytic, single-letter capacity expressions by solving corresponding lower and upper bounds. We demonstrate the efficiency of this method by analytically solving the feedback capacity of the well-known Ising channel with a ternary alphabet. We also provide a simple coding scheme that achieves the feedback capacity.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Computing the capacity of a finite state channel (FSC) is a difficult task that has been vigorously researched in recent decades [Gallager68]. With the presence of feedback, the feedback capacity of a FSC can be expressed using the directed information [permuter2009finite, TatikondaMitter_IT09]

. Despite the fact that the directed information is a multi-letter expression, it was shown that it can be formulated as a Markov decision process (MDP), which enables its computability using known MDP algorithms


When formulated as a MDP, the feedback capacity of a FSC can be computed using a variety of methods, such as value and policy iteration. These algorithms have been proven very effective for channels with relatively small alphabets of the channel input, output and state [PermuterCuffVanRoyWeissman08, Ising_channel, Sabag_BEC, POSTchannel, Ising_artyom_IT, Sabag_BIBO_IT, trapdoor_generalized]. However, a principal drawback is that their computational complexity grows with the cardinality of the channel alphabet. Indeed, even for channel parameters from the ternary alphabet, these algorithms might be intractable.

We propose a machine learning (ML) approach to compute the capacity of such channels. ML has been proven to be a useful tool with a great impact in many research fields. One example in communications is

[deepcode], wherein a learning-based algorithm was applied to design a reliable code for the additive white Gaussian noise channel with feedback. The present work introduces a new role of ML in communications, an efficient computation of multi-letter capacity expressions using RL algorithms.

We propose a methodology that uses RL to compute the feedback capacity of unifilar FSCs. Initially, a RL algorithm, namely the deep deterministic policy gradient (DDPG), is used to numerically estimate the feedback capacity. Then, the outcome of the RL algorithm is used to conjecture the structure of the analytic solution, which is expressed by a directed graph. The conjectured graph, that is called a Q-graph, can be used to compute analytic lower and upper bounds of the feedback capacity [Q-UB]. The bounds are guaranteed to coincide to the feedback capacity, in the case that the Q-graph of the analytic solution is extracted. Furthermore, the Q-graph can be used to derive a simple, capacity-achieving coding scheme of the channel. In our work, the proposed methodology enabled us to compute the feedback capacity of the Ising channel with a ternary alphabet (Ising3), and derive a capacity achieving coding scheme.

The remainder of the paper is organized as follows. Section II includes the notation and preliminaries. In Section III, we present our main results. Section IV provides background on RL and on the DDPG algorithm. In Section V, we estimate the feedback-capacity of the Ising3 using the DDPG algorithm. In Section VI, we prove the feedback-capacity of the Ising3 channel and present a simple capacity-achieving coding scheme. Section VII contains conclusions and a discussion of the future work.

Ii Notation and Problem Definition

Ii-a Notation

Calligraphic letters, , denote alphabet sets, upper-case letters,

, denote random variables, and lower-case letters,

, denote sample values. A superscript,

, denotes the vector

. The probability distribution of a random variable,

, is denoted by . We omit the subscript of the random variable when it and the argument have the same letter, e.g. . The binary entropy is denoted by

Ii-B Unifilar state channels

A FSC is defined by the triplet , where is the channel input, is the channel output, is the channel state at the beginning of the transmission, and is the channel state at the end of the transmission, where the cardinalities are assumed to be finite. At each time , the channel has the memory-less property


A FSC is called unifilar if the new channel state, , is a time-invariant function of the triplet . For a FSC with feedback, the input is determined by the message and the feedback tuple .

The feedback capacity of a unifilar FSC is given by a multi-letter expression that is presented in the following theorem.

Theorem 1.

[PermuterCuffVanRoyWeissman08, Thm 1] The feedback capacity of a strongly connected unifilar state channel, where the initial state is available to both to the encoder and the decoder, can be expressed by

Ii-C Ising3 channel

The Ising channel model was introduced as an information theory problem by Berger and Bonomi in [Berger90IsingChannel], 70 years after it was introduced as a problem in statistical mechanics by Lenz and his student, Ernst Ising [ising1925beitrag]. Berger and Bonomi studied the channel with a binary alphabet size. We investigate a generalized version of the Ising channel, where the alphabets are not necessarily binary. The Ising channel is defined by


Hence, if then w.p 1. Otherwise, is assigned by one of the last two symbols with equal probability.

Iii Main Results

The following theorems constitute our main results.

Theorem 2.

The feedback capacity of a unifilar FSC can be estimated by a RL algorithm.

Remark 1.

Theorem 2 is a computational result. Specifically, while previous estimations of the capacity were constrained by the cardinality of the channel parameters, we show that the RL algorithm is dimensional free.

Using the numerical results from the RL algorithm, one can deduce the analytic solution structure by a Q-graph [Q-UB], which is used to compute the feedback capacity.

The following theorem is an instance of a known channel that we were able to solve using the numerical results from the RL algorithm.

Theorem 3.

The feedback-capacity of the Ising3 channel is given by


where for .

Furthermore, we derive a simple coding scheme that achieves the feedback capacity in Theorem 3.

Theorem 4.

There exists a simple coding scheme for the Ising channel with general alphabet , with the following achievable rate:


Note that for , the coding scheme achieves the capacity in Theorem 3.

The coding scheme is described by a repeated procedure that is given by the following:

Code construction and initialization:

  • The message is a stream of uniform bits.

  • Transform the message into a stream of symbols from , denoted by with the following statistics:


    In words, a new symbol equals the previous symbol with probability and, otherwise, it is randomly chosen from the remaining symbols. This mapping can be done, for instance, by using enumerative coding, as shown in [1054929].

  • At the first time, the encoder transmits twice.

  • The decoder, upon receiving , decode and sets .

The transmission procedure is given by the following:

  1. If transmit twice and move to the next symbol.

  2. If transmit once and view the last feedback .

    1. If move to the next symbol.

    2. If transmit again and move to the next symbol.


  1. If then , increment .

  2. If then wait for , set , and increment .

In Section VI, we prove that the coding scheme yields a zero-error code and that its maximum rate equals the feedback capacity as given in Thm. 3.

Iv Reinforcement Learning

In this section, we provide the definition of the basic RL problem setting as presented in [sutton2018reinforcement] and elaborate on the DDPG algorithm.

Iv-a Background

The RL field in ML comprises an agent that interacts with an unknown environment by taking sequential actions. Formally, at time , the agent observes the environment’s state and then takes an action . This incurs an immediate reward and the agent’s next state , as shown in Fig. 1.

The environment is assumed to satisfy the Markov property,


Hence, it can be defined by the conditional probabilities 111One can show that the marginal probabilities are sufficient since the objective is to maximize additive rewards.

A[c][][0.8]Agent B[c][][0.8]Environment C[c][][0.8]D D[l][][0.8] E[r][][0.8] F[r][][0.8]

Fig. 1: Depiction of the agent-environment interface in RL. The agent observes the environment state and chooses an action. In return, the environment draws an immediate reward and a next state according to .

The agent’s policy is a sequence of actions , and the cumulative rewards with respect to the policy from time onward are defined by , where is the discount factor. The agent’s goal is to find an optimal policy such that


The subscript of the expectation represents its dependence on the policy.

In the next section, we present the state-action value function that is used as a tool to find .

Iv-B State-action Value function

The state-action value function is defined as


That is, the expected cumulative rewards for taking action at state and thereafter following policy . Using the Markov property (Eq. (7)) of the environment, one can decompose Eq. (9) to


The decomposition in (10) is essential when estimating the function when is fixed. Once the state-action value function is estimated, it forms the basis for the improvement of a given policy. That is, for each state , the current action can be improved to the action by choosing


Iv-C Function approximation

The function approximators in RL are parameterized models for . The actor is defined by

, a parametric model of

, whose parameters are . The critic is defined by , a parametric model of the state-action value function the corresponds to , whose parameters are

. Generally, the actor and critic are modeled by neural networks (NNs), as shown in Fig


A[c][][1.] B[c][][1.] C[c][][0.7] D[c][][0.7] E[l][][0.7]

Fig. 2: Depiction of actor and critic networks. The actor network comprises a NN that maps the state to an action . The critic NN maps pair to the estimated future cumulative rewards.

The function approximations are modeled by a differentiable parametric model. Hence, learning

can be done without visiting the entire state space. Specifically, the approximation interpolates its estimate for observed states to unobserved states, which enables the algorithm to converge without visiting the entire state space. This constitutes the main difference between the RL approach and previous methods, such as DP, which turn it into a tractable solution for channels with high cardinality.

Iv-D DDPG algorithm

The DDPG algorithm [ddpg] is a deep RL algorithm for deterministic action and continuous state and action spaces. The training procedure comprises episodes, where each episode contains sequential steps. A single step of the algorithm comprises two parallel operations: (1) collecting experience from the environment, and (2) training the actor and critic networks to obtain the optimal policy.

In the first operation, the agent collects experience from the environment. Given the current state , the agent chooses an action according to a -greedy policy, where . That is, with probability the agent acts according to , and with probability the agent takes a random action uniformly over the action space. The term denotes the exploration parameter, and it is crucial to encourage the agent to search the entire state and action spaces. Then the agent samples from the environment the incurred reward and the next state . The transition tuple is then stored in a replay buffer, a bank of experience, that is used to improve the actor and critic networks. Finally, the agent updates its current state to be and moves to the next step.

The second operation entails training the actor and critic networks. First, transitions are drawn randomly from the replay buffer. Second, for each transition, we compute its target based on the right-hand side of Eq. 10.


Then we minimize the following objective with respect to the parameters of the critic network as given by


The aim of this update is to train the Critic to comply with Eq. (10). Afterward, we train the actor to maximize the critic’s estimation of future cumulative rewards. That is, we train the actor to choose actions that result in high cumulative rewards according to the critic’s estimation. The actor update formula is given by


To conclude, the algorithm alternates between improving the critic’s estimation of future cumulative rewards and training the actor to choose actions that maximize the critic’s estimation. The algorithm work flow is depicted in Fig. 3.

A[c][][0.7] B[c][][1.0] C[c][][0.7] D[c][][0.7]-greedy E[c][][0.7] F[c][][0.7]environment G[l][][.7] H[c][][.7] J[c][][0.7]D K[c][][0.7]replay-buffer

L[c][][0.7] examples M[c][][1.0] N[c][][0.7]update critic O[c][][0.7]update actor P[c][][1.0]

Fig. 3: Depiction of the work flow of the DDPG algorithm. At each time step , the agent samples a transition from the environment using -greedy policy and stores the transition in the replay buffer. Simultaneously, past transition are drawn from the replay buffer and used to update the critic and actor NN according to Equations (13) and (14) respectively.

V Estimating the capacity of the Ising3 channel using RL

In This section, we show the formulation of the feedback capacity as a RL problem, including details of the implementation of the RL algorithm and the experiments we conducted on various unifilar FSCs with feedback.

V-a Formulation of the feedback capacity as a RL problem

We formulate the Ising3 feedback capacity as a RL problem that is based on the formulation as done in [PermuterCuffVanRoyWeissman08]. We define the state by a two-dimensional vector, . The action is defined by . The reward is defined by , which is a deterministic function of . Hence, the conditional distribution is induced by the channel distribution Eq. (2). The next state distribution is given by the BCJR equation as given in Eq. (35) in [PermuterCuffVanRoyWeissman08]. Accordingly, the conditional distribution is induced by the channel distribution, Eq. (2) and the state evolution, Eq. (3).

V-B Implementation of the RL algorithm

We model ,

with two NNs, each of which is composed of three fully connected hidden layers of 300 units separated by a batch normalization layer. The actor network input is the state

and its output is a matrix such that . The critic network input is the tuple and its output is a scalar, which is the estimate for the cumulative future rewards. In our experiments, we trained the networks for episodes. Each episode length is steps. For the exploration, we chose and decayed it by each episode.

V-C Experiments

We conducted several experiments to verify the effectiveness of our formulation. First, we focused on experimenting channels whose analytic solution was proven in the past, such as the Trapdoor channel [PermuterCuffVanRoyWeissman08], Ising channel with a binary alphabet [Ising_channel, Ising_artyom_IT], Binary Erasure channel with input constraint [Sabag_BEC], and the Dicode channel [Q-UB]. The results showed that the obtained achievable rates were within of the feedback capacity for all channels.

Our aim is to solve a channel with large cardinality that previous methods have failed to solve due to computational complexity. We chose the Ising3 channel as a candidate and used our formulation to estimate its feedback capacity. We ran a simulation over the Ising3 channel with the same RL model as used in the previous experiments. By the end of training, we obtained a policy whose achievable rate is .

Fig. 4: State histogram of the optimal policy as obtained from RL. The histogram was generated by a Monte-Carlo evaluation of the estimated policy.

Another property of the obtained policy is that it visits only six discrete states as shown in Fig. 4. Furthermore, the transition between states is determined uniquely given the output of the channel. These transitions can be shown as a Q-graph, as depicted in Fig. 5.

A[c][][1.1] B[c][][1.1] C[c][][1.1] D[c][][1.1] E[c][][1.1] F[c][][1.1]

Fig. 5: Q-graph showing the transitions between states as a function of the channel’s output. Blue, red and green lines correspond to , respectively. States with dashed lines and states with solid line behave similarly.

In the next section we use the Q-graph we obtained from the estimated policy of the RL algorithm to solve the Ising3 channel.

Vi Analytic Solution for the Ising3 Channel

In this section, we prove Theorem 3 concerning the feedback capacity of the Ising3. Specifically, we use the graphical structure in Fig. 5 to compute a tight upper bound, and analyze the rate of the proposed coding scheme.

Vi-1 Bounds on the feedback capacity

The Q-graph method, introduced in [Q-UB], is a general technique that exploits the discrete histogram in Fig. 4 to provide upper and lower bounds on the capacity. The upper bound states that for any choice of a -graph,


where the joint distribution is

and is a stationary distribution. The upper-bound is tight, that is, equals to the feedback capacity, when the maximizer of (15

) satisfies the Markov chain


We use convex optimization tool to compute the upper-bound in (15) with respect to the -graph in Fig. 5. The result is used to conjecture a parameterized input as the optimal solution. Then, using the convexity of the upper bound (as a function of the entire joint distribution), one can show that the conjectured solution is optimal, and that the upper-bound can be simplified to the expression in Theorem 3. The tightness of the upper bound is shown via the Markov chain above.

Vi-2 Coding scheme - Sketch of proof for Theorem 4

The coding scheme in Section 4 is a generalization of the optimal coding scheme for that was presented in [Ising_channel]. We analyze the achievable rate by computing the entropy rate of input symbols, divided by the expected time until decoding a single symbol.

The entropy rate can be computed from the the symbols transition entropy:


The expected time until decoding a single symbol is


That is since when , the symbol is sent twice, and when , the symbol is sent once or twice with equal probability. The proof is completed by dividing Eq. (16) by Eq. (17) and taking a maximum over .

Vii Conclusions

We derived an estimation algorithm of the feedback capacity of a unifilar FSC using RL. The RL approach addresses the cardinality constraint and establishes RL as a useful tool for channels with high cardinality. We provided an example over the Ising3 channel, where we used the insights provided by the numerical results to analytically compute its feedback capacity. Furthermore, we showed a simple capacity-achieving coding scheme for the Ising3 channel with feedback.

Additionally, our preliminary results imply that we are able to solve the Ising channel for any alphabet size. Then, we plan to solve different channels numerically and, hopefully, establish methods to induce their analytic solution, and capacity-achieving coding schemes. Furthermore, we plan to use the feedback capacity problem as a framework to improve the RL algorithms.