Many discrete-time continuous-alphabet communication channels involve correlated noise or inter-symbol interference (ISI). Two predominant communication scenarios over such channels are when feedback from the receiver back to the transmitter is or is not present. The fundamental rates of reliable communication over such channels are, respectively, the feedback (FB) and feedforward (FF) capacity. Starting from the latter, the FF capacity of an -fold point-to-point channel , denoted , is given by 
In the presence of feedback, the FB capacity is 
is the directed information (DI) from the input sequence to the output , and is the distribution of causally-conditioned on (see [21, 24] for further details). Built on (3), for stationary processes, the DI rate is defined as
As proved in , when feedback is not present, the optimization problem (2) performed over the marginals is equivalent to the optimization in (1). This casts DI as a unifying information measure for representing both FF and FB capacities.
Computing and requires solving a multi-letter optimization problem. Closed form solutions to this challenging task are known only in several special cases. A common example for is the Gaussian channel with memory  and the ISI Gaussian channel . There are no known extensions of these solutions to the non-Gaussian case. For , a solution for the 1st order moving average additive Gaussian noise (MA(1)-AGN) channel was found . Another closed form characterization is available for auto-regressive moving-average (ARMA) AGN channels . To the best of our knowledge, these are the only two non-trivial examples of continuous channels with memory whose FB capacity is known in closed form. Furthermore, when the channel model is unknown, there is no efficient method for numerically approximating capacity.
Some recent progress related to capcity computation was made based on deep learning (DL) techniques[9, 19]. In a novel work , mutual information neural estimator (MINE)  was used to learn a modulation for a memoryless channel. In 
, a capacity estimator was proposed based on reinforcement learning algorithm that iteratively estimates and maximizes the DI rate, but only for discrete alphabet channels with a known channel model.
Inspired by the above, we develop the framework for estimating FF and FB capacity of arbitrary continuous-alphabet channels, possible with memory, without knowing the channel model.
Our method does not need to know the channel transition kernel.
We only assume a stationary channel model and that channel outputs can be sampled by feeding it with inputs.
Central to our method are a new DI neural estimator (DINE), used to evaluate the communication rate,
and a neural distribution transformer (NDT), used to simulate input distributions. Together, the DINE and NDT lay the groundwork for our capacity estimation algorithm. In the remainder of this section, we describe DINE, NDT, and their integration into the capacity estimator.
I-a Directed Information Neural Estimation
The estimation of mutual information (MI) from samples using neural networks (NNs) is a recently proposed approach[2, 3]
. It is especially effective when the involved random variables (RVs) are continuous. The concept originated from, where MINE was proposed. The core idea is to represent MI using the Donsker-Varadhan (DV) variational formula
where and . The supremum is over all measurable functions for which both expectations are finite. Parameterizing by an NN and replacing expectations with empirical averages, enables gradient ascent optimization to estimate . A variant of MINE that goes through estimating the underlying entropy terms was proposed in . The new estimators were shown empirically to perform extremely well, especially for continuous alphabets.
Herein, we propose a new estimator for the DI rate . The DI is factorized as
where is the differential entropy of and . Applying the approach of  to the entropy terms, we expand each as a Kullback-Leibler (KL) divergence and a cross-entropy (CE) residual and invoke the DV representation. To account for memory, we derive a formula valid for causally dependent data, which involves RNNs as function approximator (rather than the FF network used in the independently and identically distributed (i.i.d.) case). Thus, the DINE is an RNN-based estimator for the directed information rate from to based on their samples.
, which upper bounds the DI in the special case of a jointly Markov process with finite memory. DINE is the first method based on RNN and hence does not assume any parametric model such as discrete alphabets, or Markovity. Further details on the DINE algorithm are given in subsectionII-A.
I-B Neural Distribution Transformer and Capacity Estimation
DINE accounts for one of the two tasks involved in estimating capacity, it estimates the objective of (2). The remaining task is to optimize this objective over input distributions. Generally, sampling from an arbitrary distribution is a complex task. To overcome this, we design a deep generative model of the channel input distributions, namely the NDT. The idea is similar to ones used for generative modeling tasks, e.g, generative adversarial networks 
or variational autoencoders. The designed NDT maps i.i.d. noise into samples of the channel input distribution. For estimating FB capacity, in addition to the i.i.d. noise, the NDT also receives channel FB as inputs. Together, NDT and DINE form the overall system that estimates the capacity as shown in Fig 1.
The capacity estimation algorithm trains the DINE and NDT models together via an alternating maximization procedure. Namely, we iteratively train each model while keeping the (parameters of the) other one fixed. DINE estimates the communication rate of a fixed NDT input distribution, and the NDT is trained to increase its rate with respect to fixed DINE model. Proceeding until convergence, this results in the capacity estimate, as well as an NDT generative model for the achieving input distribution.
We demonstrate our method on the MA(1)-AGN channel. Both and are estimated using the same algorithm, using the channel as a black-box to solely generate samples. The estimation results are compared with the analytic solution to show the effectiveness of the proposed approach.
We give a high-level description of the algorithm and its building blocks. Due to space limitations, full details are reserved to the extended version of this paper. The implementation is available in github.222https://github.com/zivaharoni/capacity-estimator-via-dine
Ii-a Directed Information Estimation Method
where and are, respectively, the cross entropy (CE) and KL divergence between and , and is uniform reference measure over the support of the dataset. To simplify notation, we use the shorthands
Subtracting both elements in (II-A) and observing that the difference of CE terms equals the DI at the former time step, we have
Note that the difference of KL divergences equals . For stationary data processes we take the limit and obtain
Each is expanded by its DV representation  as:
To maximize (11), each DV potential is parametrized by a modified LSTM and expected values are estimated by empirical averages over the dataset . Thus, the optimization objectives are:
where and , are the parametrized potentials.
The estimator is given by:
To capture the time dependencies in we introduce a modified LSTM network model for functional approximation. LSTM  is an RNN that receives a time series as input and for each
, performs a recursive non-linear transform to calculate its hidden state. We denote the LSTM function by . The full characterization of is provided in .
We modify the structure of the LSTM to perform the calculations:
A similar modification is introduced for by substitution of with and with , we have:
A visualization of a modified LSTM cell (unrolled) is shown in Fig. 2. The LSTM cell’s output is the sequence , which is fed into a fully-connected layer to obtain and . As demonstrated by Algorithm 1 and Fig. 3, in each iteration we draw , a subset on , of size B. We feed the NN with to acquire ,
. Those enter the NN loss function (II-A), and gradients are calculated to update the NN parameters .
Ii-B Neural Distribution Transformer
The DINE model is an effective approach to estimate the argument of (2). However, finding the capacity comprises maximization of the DI with respect to the input distribution. For this purpose we present the NDT model that represents a general input distribution of the channel. At each iteration
the NDT maps an i.i.d noise vectorto a channel input variable . When feedback is present the NDT maps . Thus, NDT is represented by an RNN with parameters as shown in Fig. 4. The NDT model is used to generate the channel input , and the DINE estimates the DI between and .
Ii-C Complete Architecture Layout
Combining DINE and NDT models into a complete system enables capacity estimation. As shown in Fig. 1, the NDT model is fed with i.i.d. noise and its output is the samples . These samples are fed into the channel to generate its output. Then, are fed both to the DINE model that outputs . To estimate the capacity, DINE and NDT models are trained together. The training scheme, as shown in Algorithm 2, is a variant of alternated maximization procedure. This procedure iterates between updating the DINE and NDT models parameters sets , where each iteration the parameters of one model are fixed and the other ones are updated. By the end of training a long Monte-Carlo evaluation of samples is done in order to estimate the expectations in (II-A) accurately.
Applying this algorithm to channels with memory estimates their capacity without any specific knowledge of the channel underlying distribution. Next, we demonstrate the effectiveness of this algorithm on continuous alphabet channels.
Iii Numerical Results
We demonstrate the performance of Algorithm 2 on the AWGN channel and the first order MA-AGN channel. The numerical results are then compared with the analytic solution to verify the effectiveness of our method.
Iii-a AWGN channel
The power constrained AWGN channel is investigated as an instance of memoryless continuous alphabet channel for which analytic solution is known. The channel model is given by
where are i.i.d RVs, and is the channel input sequence bound to the power constraint . Its capacity is given by . In our implementation we chose and estimated the capacity for a range of values. The numerical results are compared to the analytic solution in Fig. 5
Iii-B Gaussian MA(1) channel
The calculation of capacity of linear Gaussian channels with memory can be divided into two cases, feedback () and feed-forward () capacity. We will focus on the MA(1) Gaussian channel model, which is given by:
where, , is the channel input sequence bound to the power constraint , and is the channel output.
Iii-B1 Feed-forward capacity
Iii-B2 Feedback capacity
In general, of the ARMA(k) Gaussian channel can be formulated as a dynamic programming problem, which can be solved by an iterative algorithm . For the particular case of (17), is given by , where is a solution of a 4th order polynomial equation. We applied Algorithm 2 for the feedback capacity to obtain an estimate of . The results and compared with the analytic solution as shown in Fig. 7).
Iv Conclusion and Future Work
We have presented a methodology to estimate FF and FB capacity using the channel as a ”black-box”. The estimator is designed by a novel DI estimator (DINE) and NDT model, both based on RNNs. The performance of the estimator are demonstrated on the AWGN and MA(1)-AGN channels, and estimation agrees with the analytic solution.
We wish to further generalize our method of information rate estimation for multi-user communication channels, a field with many unsolved problems and to find theoretical guarantees of the estimator. In addition, information theory (e.g, channel capacity) give us a rigorous mathematical framework where analytical solution are known due to Shannon theory hence this can be a good problem for evaluating machine learning approaches.
-  R. G. Gallager. Information theory and reliable communication. Vol. 2. New York: Wiley, 1968.
-  M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062. June 2018.
-  C. Chan, A. Al-Bashabshesh, H. P. Huang, M. Lim, D. S. H. Tam and C. Zhao. Neural Entropic Estimation: A faster path to mutual information estimation. arXiv preprint arXiv:1905.12957, May 2019.
-  M. Donsker, and S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time, iv. Communications on Pure and Applied Mathematics, 36(2):183-212. March 1983.
S. Hochreiter and J. Schumidhuber.Long short-term memory. Neural Computation 9(8): 1735-1780. November 1997.
A. M. Schäfer and H. G. Zimmermann. Recurrent neural networks are universal approximators. International journal of neural systems 17.04: 253-263. 2007.
-  L. Breiman. ”The individual ergodic theorem of information theory” The Annals of Mathematical Statistics: 809-811. September 1957. Information Theory, IEEE Trans. Comm., vol. COM-21, pp. 1345-1351. December 1973.
-  J. Massey, Causality, feedback, and directed information. Proc. Int. Symp. Inf. Theory Appl. , pp. 303–305. November 1990.
-  R. Fritschek, R. F. Schaefer, and G. Wunder. Deep Learning for Channel Coding via Neural Mutual Information Estimation. arXiv preprint arXiv:1903.02865 March 2019.
-  K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks 2.5 : 359-366. March 1989.
-  S. Yang, A. Kavcic, and S. Tatikonda. On the feedback capacity of power-constrained Gaussian noise channels with memory. IEEE Trans. Inf. Theory 53.3 : 929-954. March 2007.
-  Y. H. Kim. Feedback capacity of the first-order moving average Gaussian channel. IEEE Trans. Inf. Theory 52.7: 3063-3079. July 2006.
-  Y. H. Kim. Feedback capacity of stationary Gaussian channels. IEEE Trans. Inf. Theory 56.1: 57-85. Januaray 2010.
-  T. M. Cover, and J. A. Thomas. Elements of information theory. John Wiley and Sons, 2012.
-  W. Hirt, and J. L. Massey. Capacity of the discrete-time Gaussian channel with intersymbol interference. IEEE Trans. Inf. Theory 34.3: 38-38. May 1988.
-  J. Zhang, O. Simeone, Z. Cvetkovic, E. Abela, and M. Richardson. ITENE: Intrinsic Transfer Entropy Neural Estimator. arXiv preprint arXiv:1912.07277. January 2020.
-  Y. H.Kim . A coding theorem for a class of stationary channels with feedback. IEEE Trans. Inf. Theory 54.4: 1488-1499. April 2008.
-  S. Yang, A. Kavcic, and S. Tatikonda. Feedback Capacity of Stationary Sources over Gaussian Intersymbol Interference Channels. GLOBECOM, 2006.
-  Z. Aharoni, O. Sabag, and H. H. Permuter. Computing the Feedback Capacity of Finite State Channels using Reinforcement Learning. 2019 IEEE International Symposium on Information Theory (ISIT). IEEE, 2019.
-  S. Molavipour, G. Bassi, and M. S. Conditional Mutual Information Neural Estimator. arXiv preprint arXiv:1911.02277. November 2019
-  H. H. Permuter, Y. H. Kim, and T. Weissman. Interpretations of directed information in portfolio theory, data compression, and hypothesis testing. IEEE Trans. Inf. Theory 57.6: 3248-3259. June 2011.
-  D. P. Kingma, M. Welling. Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114. 2013.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville & Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680). 2014.
-  G. Kramer. Directed information for channels with feedback. Hartung-Gorre, 1998.
-  L. Zhao, H. Permuter, Y. Kim, and T. Weissman. Universal estimation of directed information. IEEE Trans. Inf. Theory 59.10: 6220-6242. October 2013.
-  I. Kontoyiannis, and M. Skoularidou. Estimating the directed information and testing for causality. IEEE Transactions on Information Theory 62.11: 6053-6067. November 2016.
-  C. J. Quinn, T.P. Coleman, N. Kiyavash et al. Estimating the directed information to infer causal relationships in ensemble neural spike train recordings. J Comput Neurosci 30, 17–44. June 2010.