I Introduction
When designing information transmission systems, the commonly adopted approach has been modeldriven, assuming a known statistical channel model, namely the channel inputoutput conditional probability law. But in certain application scenarios, the underlying physical mechanism of channel is not sufficiently understood by us to build a dependable channel model, or is known but yet too complicated to prefer a modeldriven design, e.g., with strongly nonlinear and highdimensional inputoutput relationship. Such scenarios motivate us to raise the question: “How to learn to transmit information over a channel without using its statistical model?”
With a sufficient amount of channel inputoutput samples as training data, ideally one would expect that although not using the actual statistical channel model, the encoder and the decoder can eventually adjust their operation so as to reliably transmit messages at a rate close to the channel capacity. But this is by no means a trivial task. First, nonparametric estimation of mutual information with training data only (see, e.g., [2]) and maximization of mutual information with respect to input distribution are both notoriously difficult problems in general. Second, the decoder needs to compute his decoding metric according to the actual statistical channel model, since otherwise the decoding rule would be “mismatched” and the channel inputoutput mutual information would not be achievable.
We remark that, another line of informationtheoretic works beyond the scope of this paper considers typebased universal decoding algorithms such as the maximum empirical mutual information (MMI) decoder (see, e.g., [3] [4, Sec. IVB4), p. 2168]), which can achieve the capacity and even the error exponent of a channel without utilizing its statistical model. But such universal decoding algorithms are less amenable to practical implementation, compared with decoding with a fixed metric, which will be considered in this paper.
In this paper, we adopt a less sophisticated approach with a more modest goal. We separate the learning phase and the information transmission phase. In the learning phase, given a number of channel inputoutput samples as training data, the encoder and the decoder learn about some key characteristics about the statistical channel model; in the information transmission phase, the encoder and the decoder run a prescribed coding scheme, with code rate and key parameters in the decoding rule already learnt from training data during the learning phase. Therefore, our goal is not the channel capacity, but a rate under the specific coding scheme, for which the informationtheoretic topic of mismatched decoding [4], in particular the socalled generalized mutual information (GMI), provides a convenient tool. We note that here the terminology of training is taken from machine learning, and is different from “channel training” in conventional wireless communications systems, where the statistical channel model is already known (typically a linear Gaussian channel with random channel fading coefficients) and the purpose of training is to estimate the channel fading coefficients based on pilot symbols.
Section II considers a general memoryless channel with known statistical model, under Gaussian codebook ensemble and nearest neighbor decoding. We allow the channel output to be processed by a memoryless function, before feeding it to the decoder. We show that in terms of GMI, the optimal channel output processing function is the minimum mean squared error (MMSE) estimator of the channel input upon observing the channel output, namely, the conditional expectation operator. This fact motivates the consideration that with training data only, the channel output processing function should also be in some sense “close” to the conditional expectation operator, and establishes a close connection with the classical topic of regression in machine learning.
Section III hence follows the above thoughts to formulate a learning problem. Unlike in regression problems where performance is measured in terms of generalization error, for our information transmission problem we are interested in code rate, which is chosen based upon training data only. In Section II
, we propose two performance metrics: overestimation probability, which quantifies the probability that the chosen code rate, as a random variable, exceeds the GMI; and receding level, which quantifies the average relative gap between the chosen code rate and the optimal GMI. We develop an algorithm called
(Learn For Information Transmission) to accomplish the aforementioned learning task, which is further assessed using numerical experiments.Section IV discusses several potential extensions of the basic channel model in Section II, including channels with memory, channels with general inputs and general decoding metrics, and channels with state. Section V concludes this paper. In order to illustrate the analytical results in Section II, Appendix presents case studies for several channels, some of which exhibit strong nonlinearity. These channels are also used in Section III for assessing the algorithm.
Throughout this paper, all rate expressions are in nats per channel use, and logarithms are to base . In numerical experiments, rates are converted into bits by the conversion rule of nat bits.
Recently, a heightened research interest has been seen in applying machine learning techniques (notably deep neural networks and variants) in physicallayer communications, including endtoend system design
[5][8], channel decoding [9][11], equalization [12][14], symbol detection [15][17], channel estimation and sensing [18][22], molecular communications [23], and so on. Researchers have accumulated considerable experience about designing machine learning enabled communication systems. Instead of exploring the performance of specific machine learning techniques, our main interest in this paper is a general problem formulation for integrating basic ingredients of machine learning into information transmission models, so that within the problem formulation different machine learning techniques can be applied and compared.Ii Channels with Known Statistical Model
In this section, we investigate a framework for information transmission over a memoryless channel with a real scalar input and a general (e.g., vector) output. We will discuss several potential extensions in Section
IV. The central assumption throughout this section is that the statistical model of the channel, namely, its inputoutput conditional probability law, is known to the encoder and the decoder. The results developed in this section will shed key insights into our study of learning based information transmission problems in Section III, where this assumption is abandoned.Iia Review of Generalized Mutual Information and An Achievable Rate Formula
It is well known in information theory that, given a memoryless channel with inputoutput conditional probability law , ,
, when the encoder uses a codebook where each symbol in each codeword is independently generated according to certain probability distribution
, and the decoder employs a maximumlikelihood decoding rule, the mutual information is an achievable rate, and by optimizing we achieve the channel capacity [24]. When the decoder employs a decoding rule which is no longer maximumlikelihood but mismatched to the channel conditional probability law, however, mutual information fails to characterize the achievable rate, and the corresponding analysis of mismatched decoding is highly challenging; see [4] [25] and references therein for a thorough overview. In fact, the ultimate performance limit of mismatched decoding called the mismatched capacity still remains an open problem to date, and we need to resort to its several known lower bounds; see, e.g., [26][30].In this work, our main interest is not about exploring the fundamental limit of mismatched decoding, but rather about using it as a convenient tool for characterizing the achievable rate of a given information transmission model. Following the techniques in [32], in [31], a particular lower bound of the mismatched capacity has been derived when , under the following conditions:
(1) Under average power constraint , codeword length , and code rate (nats/channel use), the encoder uses an independent and identically distributed (i.i.d.) Gaussian codebook, which is a set of mutually independent dimensional random vectors, each of which is distributed. The ensemble of i.i.d. Gaussian codebooks is called the i.i.d. Gaussian ensemble.
(2) Given a length channel output block , the decoder employs a nearest neighbor decoding rule with a prescribed processing function and a scaling parameter to decide the transmitted message as
(1) 
and is the codeword corresponding to message . Note that the output alphabet is arbitrary, for example, multidimensional, like in a multiantenna system. Geometrically, the right hand side of (1) is the squared Euclidean distance between a scaled codeword point and the processed received signal point, in the dimensional Euclidean space.
The lower bound derived in [31] is called the GMI under the codebook ensemble in condition (1) and the decoding rule in condition (2), given by the following result.
Proposition 1
For an information transmission model under conditions (1) and (2), the information rate
(2) 
is achievable, when the scaling parameter is set as
(3) 
Proof: This proposition has been stated in a slightly restricted form as [31, Prop. 1]. Here we briefly outline its proof, for completeness of exposition, and for easing the understanding of some technical development in Section III.
Due to the symmetry in the codebook design in condition (1), when considering the average probability of decoding error averaged over both the message set and the codebook ensemble, it suffices to assume that the message is selected for transmission, without loss of generality. Therefore the decoding metric for behaves like
(4) 
due to the strong law of large numbers, where the expectation is with respect to
. The GMI is the exponent of the probability that an incorrect codeword , , incurs a decoding metric no larger than , and hence is an achievable rate, due to a standard union bounding argument [31, Prop. 1] [32]:(5)  
(6) 
The calculation of
is facilitated by the noncentral chisquared distribution of
conditioned upon , following [31, App. A] (see also [32, Thm. 3.0.1]). We can express as(7) 
Solving the maximization problem (7) as in [31, App. A],^{1}^{1}1There is a minor error in passing from (78) to (80) in [31, App. A], but it can be easily rectified and does not affect the result. we arrive at (2). The corresponding optimal is given by (3), and the optimal is . Q.E.D.
As mentioned earlier in this subsection, there are several known lower bounds of the mismatched capacity, many of which actually outperform GMI in general. We employ the GMI under conditions (1) and (2) as the performance metric in our subsequent study, because first, its expression given in Proposition 1 is particularly neat; second, the combination of i.i.d. Gaussian codebook ensemble and nearest neighbor decoding rule provides a reasonable abstraction of many existing coding schemes (see, e.g., [32] [31] [33]) and is in fact capacityachieving for linear Gaussian channels (see, e.g., the appendix); and finally, it also has a rigorous informationtheoretic justification. In fact, the GMI is the maximally achievable information rate such that the average probability of decoding error asymptotically vanishes as the codeword length grows without bound, under the i.i.d. Gaussian codebook ensemble in condition (1) and the nearest neighbor decoding rule in condition (2); see, e.g., [32, pp. 11211122], for a discussion.
IiB Linear Output Processing
In this subsection, we restrict the processing function to be linear; that is, where is a column vector which combines the components of . We denote the dimension of and by . Noting that the GMI is increasing with , we aim at choosing the optimal so as to maximize . For this, we have the following result.
Proposition 2
Suppose that is invertible. The optimal linear output processing function is the linear MMSE estimator of upon observing , given by
(8) 
The resulting maximized and are
(9) 
and
(10) 
respectively. The corresponding scaling parameter is exactly in (9).
Proof: With a linear output processing function , we rewrite in (2) as
(11) 
noting that with a scalar . This is a generalized Rayleigh quotient [34]. With a transformation , we have
(12) 
When
is the eigenvector of the largest eigenvalue of
, is maximized as this largest eigenvalue divided by . Noting that this matrix has rank one, its largest eigenvalue is simply , and is achieved with , i.e., . The results in Proposition 2 then directly follow. Q.E.D.Note that the denominator in the logarithm of in (10) is exactly the mean squared error of the linear MMSE estimator (8) (see, e.g., [35]), which may be conveniently denoted by . So we have
(13) 
In our prior works, we have examined several special cases of Proposition 2, including scalar Gaussian channels with onebit output quantization and superNyquist sampling [31, Sec. VI], fading Gaussian channels with multiple receive antennas and output quantization [36] [37]. Here Proposition 2 serves as a general principle.
For the special case of scalar output, there is an interesting connection between Proposition 2 and the socalled Bussgang’s decomposition approach to channels with nonlinearity. Bussgang’s decomposition has its idea originated from Bussgang’s theorem [38], which is a special case of Price’s theorem [39], for the crosscorrelation function between a continuoustime stationary Gaussian input process and its induced output process passing a memoryless nonlinear device, and has been extensively applied to discretetime communication channels as well (e.g., [40] [41] [33]). For our channel model Bussgang’s decomposition linearizes the channel output as
(14) 
where the residual is uncorrelated with . So if we formally calculate the “signaltonoise ratio” of (14), we can verify that
(15) 
where is exactly (9) specialized to scalar output. Hence we have
(16) 
that is, the GMI result in Proposition 2 provides a rigorous informationtheoretic interpretation of Bussgang’s decomposition.
IiC Optimal Output Processing
What is the optimal output processing function without any restriction? Interestingly, this problem is intimately related to a quantity called the correlation ratio which has been studied in a series of classical works by Pearson, Kolmogorov, and Rényi; see, e.g., [42]. The definition of the correlation ratio is as follows.
Definition 1
The following result is key to our development.
Lemma 1
Proof: The result is essentially a consequence of the CauchySchwartz inequality, and has been given in [42, Thm. 1]. Q.E.D.
We can show that lies between zero and one, taking value one if and only if is a Borelmeasurable function of , and taking value zero if (but not only if) and are independent.
Applying Lemma 1 to our information transmission model, we have the following result.
Proposition 3
The optimal output processing function is the MMSE estimator of upon observing , i.e., the conditional expectation,
(19) 
The resulting maximized and are
(20) 
and
(21) 
respectively. The corresponding scaling parameter is exactly in (20).
Proof: In our information transmission model, we recognize the channel input as and the channel output as in Lemma 1. According to (18),
(22) 
noting that under condition (1). Hence,
(23) 
On the other hand, from Definition 1,
(24) 
Therefore, (23) becomes
(25) 
and equality can be attained, by letting , because of Lemma 1 and the fact that . This establishes Proposition 3. Q.E.D.
Here we provide a geometric interpretation of Proposition 3. Inspecting the general expression of in (2), we recognize it as the squared correlation coefficient between the channel input and the processed channel output , i.e., the squared cosine of the “angle” between and . So choosing the processing function means that we process the channel output appropriately so as to “align” it with , and the best alignment is accomplished by the MMSE estimator, i.e., the conditional expectation operator.
Utilizing the orthogonality property of MMSE estimator, (see, e.g., [35]), we can verify that the denominator in the logarithm of in (21) is exactly the mean squared error of the MMSE estimator (19), which may be conveniently denoted by . So we have^{2}^{2}2A side note is that (26) is consistent with the socalled estimation counterpart to Fano’s inequality [24, Cor. of Thm. 8.6.6]: , where is an arbitrary estimate of based upon . Under , we have , i.e., , thereby revisiting [24, Cor. of Thm. 8.6.6].
(26) 
In Figure 1 we illustrate the transceiver structure suggested by Propositions 2 and 3. The key difference between these two propositions lies in the choice of the channel output processing function, and the effect is clearly seen by comparing (13) and (26). For certain channels, the performance of MMSE estimator may substantially outperform that of LMMSE estimator, and consequently the benefit in terms of GMI may be noticeable.
The data processing inequality asserts that for any channel, processing the channel output cannot increase the inputoutput mutual information [24], but Propositions 2 and 3 do not violate it. This is because in our information transmission model, the decoder structure is restricted to be a nearest neighbor rule, which may be mismatched to the channel.
In order to illustrate the analytical results in this section, we present in Appendix case studies about a singleinputmultipleoutput (SIMO) channel without output quantization and with onebit output quantization (with and without dithering). These channel models will also be used as examples in our study of learning based information transmission, in the next section.
Iii Learning Based Information Transmission
With the key insights gained in Section II, in this section we turn to the setting where the encoder and the decoder do not utilize the statistical channel model, and study how to incorporate machine learning ingredients into our problem.
Iiia Analogy with Regression Problems
From the study in Section II, we see that with the codebook ensemble and the decoder structure fixed as in conditions (1) and (2), the key task is to choose an appropriate processing function so as to “align” the processed channel output with the channel input . When the statistical channel model is known, the optimal choice of is the conditional expectation operator. However, without utilizing the channel model knowledge, we need to learn a “good” choice of based on training data. This is where the theory of machine learning kicks in.
Our problem is closely related to the classical topic of regression in machine learning. In regression, we need to predict the value of a random variable upon observing another random variable .^{3}^{3}3In most machine learning literatures (see, e.g., [43]), is used for representing the quantity to predict and for the observed, exactly in contrary to our convention here. The reason for adopting our convention is that for information transmission problems is used for representing channel input and for channel output. Under quadratic loss, if the statistics of is known, then the optimal regression function is the conditional expectation operator. In the absence of statistical model knowledge, extensive studies have been devoted to design and analysis of regression functions that behave similarly to the conditional expectation operator; see, e.g., [43].
We note that, although our problem and the classical regression problem both boil down to designing processing functions that are in some sense “close” to the conditional expectation operator, there still exist some fundamental distinctions between the two problems. In short, the distinctions are due to the different purposes of the two problems. For a regression problem, we assess the quality of a predictor through its generalization error, which is the expected loss when applying the predictor to a new pair of , besides the training data set [43, Chap. 7]. For our information transmission problem, from a training data set, we not only need to form a processing function, but also need to determine a rate for transmitting messages. So the code rate is not a priori known, but need be training data dependent. We assess the quality of our design through certain characteristics of the rate. The details of our problem formulation are in the next subsection.
IiiB Problem Formulation
Before the information transmission phase, we have a learning phase. Suppose that we are given i.i.d. pairs of as the training data, according to the channel inputoutput joint probability distribution , which we denote by
(27) 
We have two tasks in the learning phase, with the training data set . First, we need to form a processing function and a scaling parameter , which will be used by the decoder to implement its decoding rule. Second, we need to provide a code rate so that the encoder and the decoder can choose their codebook to use during the information transmission phase. According to our discussion in Section IIIA, we desire to make close to the conditional expectation operator.
From a design perspective, it makes sense to distinguish two possible scenarios:
(A) is unknown, and
(B) is too complicated to prefer an exact realization of , but is still known to the decoder so that it can simulate the channel accordingly.
To justify scenario (B), we note that for a given , it is usually easy to generate a random channel output for any given channel input according to , but very difficult to inversely compute for a given channel output because that generally involves marginalization which can be computationally intensive for highdimensional .
We generate the training data set as follows. Under scenario (A), the encoder transmits i.i.d. training inputs , known to the decoder in advance, through the channel to the decoder, and the decoder thus obtains . Note that since we have control over the encoder, the input distribution is known. In contrast, under scenario (B), no actual transmission of training inputs is required, and the decoder simulates the channel by himself, according to , in an offline fashion, to obtain . We emphasize that, here in the learning phase, the input distribution need not be the same as that in the information transmission phase (i.e., Gaussian). Changing certainly will change the distribution of
, correspondingly the regression performance, and eventually the information transmission performance. The Gaussian distribution does not necessarily bear any optimality for a general statistical channel model. Nevertheless, in developing our proposed algorithm in Section
IIIC and conducting numerical experiments in Section IIID, we require the training inputs be Gaussian to generate , for technical reasons.It is always the decoder who accomplishes the learning task aforementioned, and informs the encoder the value of , possibly via a lowrate control link. In Figure 2 we illustrate the transceiver structure when the learning phase is taken into consideration.
Under scenario (A), we learn both and ; while under scenario (B), we learn , and can subsequently calculate the corresponding optimal scaling parameter , based upon Proposition 1. More details about how learning is accomplished will be given in the later part of this subsection and in Section IIIC. The achieved GMIs under the two scenarios are different, as given by the following result.
Proposition 4
Consider the information transmission model under conditions (1) and (2), given a training data set and a certain learning algorithm.
Under scenario (A), denote the learnt pair by . The corresponding GMI is given by
(28) 
Proof: Consider scenario (A). We still follow the proof of Proposition 1, viewing as a specific choice in the decoding rule. With a fixed , the maximization problem (7) is with respect to only; that is,
(30) 
Rearranging terms, and making a change of variable , we obtain (28).
Consider scenario (B), where the decoder knows the statistical channel model . Therefore, according to Proposition 1, upon learning , he can set the optimal choice of the scaling parameter as , resulting in and in (29). Q.E.D.
It is clear that is no greater than , and their gap is due to the lack of the statistical channel model knowledge . It is natural to expect that when learning is effective, the gap will be small. The following corollary illustrates one such case.
Corollary 1
Suppose that a learning algorithm learns under both scenarios (A) and (B), and under scenario (A) also learns . Denote the gap between
(31) 
by . When satisfies , we have , i.e., the gap between the two GMIs is quadratic with .
Proof: Under the condition for , we have , and we can choose a specific in (32) to get a lower bound of as
(32) 
Via a Taylor expansion with respect to , we have that (32) behaves like , where is bounded as . Therefore, the gap between and is . Q.E.D.
Under scenario (A), a learning algorithm is a mapping . The resulting output processing function (called regression function or predictor in classical regression problems)
usually belongs to certain prescribed function class, which may be linear (e.g., least squares, ridge regression) or nonlinear (e.g., kernel smoothing, neural networks). According to Proposition
4, we should set . This is, however, impossible since neither the encoder nor the decoder can calculate (28), without knowing . The situation is that the rate is achievable but its value is “hidden” by the nature. We hence need to estimate it, again based on and its induced . We desire a highly asymmetric estimation; that is, should hold with high probability, since otherwise there would be no guarantee on the achievability of , resulting in decoding failures. Meanwhile, we also desire that is close to , which corresponds to the ideal situation where the statistical channel model is known to the encoder and the decoder.The learnt are random due to the randomness of . In order to assess the performance of learning, based upon our discussion, we introduce the following two metrics to quantify the learning loss:

Overestimation probability:
(33) This may be understood as the “outage probability” corresponding to a learning algorithm.

Receding level:
(34) This is the averaged relative gap between the learnt code rate and the GMI under known channel and optimal output processing, conditioned upon the event that overestimation does not occur.
It is certainly desirable to have both and small.
Under scenario (B), the situation is much simpler. A learning algorithm is a mapping , and based upon we can choose since it can be evaluated as shown in Proposition 4. So we do not need to consider overestimation, and the receding level is simply
(35) 
We illustrate the rates and loss metrics using a simple example of additive white Gaussian noise (AWGN) channel, , , and . We use the LFIT algorithm proposed in Section IIIC to accomplish the learning task. Figure 4
displays the cumulative distribution functions (CDFs) of the resulting
, , and . As suggested by Corollary 1, the gap between and is nearly negligible. Note that for AWGN channels, and is further equal to channel capacity (the dashed vertical line). Figure 4 displays the CDF of , so the negative part corresponds to overestimation events and the yintercept is (3.16% in this example).In Table I, we summarize a comparison among the problem formulations of the two scenarios we consider and classical regression problems (see, e.g., [43]).
Classical regression  Scenario (A)  Scenario (B)  
Learning algorithm  
Processing function  Regression function, linear or nonlinear  Output processing function, linear or nonlinear  
Ground truth of  MMSE estimator  
Loss function  (33) and (34)  (35) 
IiiC Proposed Algorithm
There already exist numerous learning algorithms for classical regression problems to obtain , but under scenario (A) we further need to obtain and . Meanwhile, the learning loss measured by overestimation probability and receding level are also unconventional. We study these in this subsection.
We begin with a sketch of our approach. We use a collection of parameters to specify the structure used by . The exact choice of will be discussed in Section IIID
, which can be, for example, the complexity parameter in ridge regression, the width of kernel in kernel smoothing methods, the hyperparameters in deep neural networks, and so on. Fixing
, based upon , we learn , and then on top of it, estimate and .We can provide a theoretical guarantee on the overestimation probability, as given by the following result, whose proof technique will also be useful for devising our proposed algorithm.
Proposition 5
Suppose that we have generated a length training data set subject to . We split it into two parts, of lengths and , respectively, for some , and we learn and solely based upon the length part. Then as , the following estimate of achieves an overestimation probability no greater than :
(36) 
(37) 
(38) 
Here denotes the empirical variance for i.i.d. random variables , i.e.,
(39) 
Proof: We start with the expression of (28) in Proposition 4. Since , it is clear that the maximum value of the right hand side of (28) will exceed only if (i) the estimate of is smaller than its true value, or (ii) the estimate of is larger (smaller) than its true value when is positive (negative). Therefore, in order to ensure an overestimation probability target , it suffices to require each of (i) and (ii) occurs with probability no larger than
, due to the union bound. Applying the central limit theorem (CLT)
[44, Thm. 27.1] thus leads to (37) and (38), which ensure that in (36) does not exceed in (28) with probability no smaller than , as . Q.E.D.We remark that, by imposing appropriate regularity conditions, we may replace the bias terms in (37) and (38) using results from concentration inequalities (e.g., Bernstein’s inequalities [45, Thm. 2.10]), which control the overestimation probability even for finite . In practice, we find that while both CLT and concentration inequalities control the overestimation probability well below its target, they lead to rather large receding level, unless the training data set size is extremely large. This appears to be due to that the bias terms in (37) and (38) tend to be overly conservative when plugged into the maximization (36). For this reason, in the following, we propose an algorithm to produce , which has better performance in numerical experiments, applying the idea of crossvalidation (CV) [43, Chap. 7, Sec. 10], and drawing insights from Proposition 5.
The motivation of applying CV is as follows. For any reasonable size of , we need to utilize the training data economically. But if we simply use the same training data for both learning and estimating and (which involve ), a delicate issue is that the resulting estimates will usually be severely biased, a phenomenon already well known in classical regression problems when assessing generalization error using training data [43, Chap. 7, Sec. 2]. CV is an effective approach for alleviating such a problem, and the procedure is as follows.
We split into nonoverlapping segments of the same size, indexed from to . Taking away the th segment, we learn using the remaining segments, and then use the th segment to estimate expectations needed for calculating and . As runs from to , we have estimates for each interested expectation, and average them as the final estimate.
Denote the training data in the th segment as , and the learnt without using as . Note that , , are disjoint and .
With and , we estimate the following expectations via empirical means:
(40) 
Then we average them from to , to obtain their CV estimates:
(41) 
We also use an empirical mean estimate of the second moment of
, , as . Based upon these, we choose as .Now according to the proof of Proposition 5, we can affect the overestimation probability by biasing the estimates and . To implement this, we introduce two tunable scaling parameters which are typically close to one, in order to slightly bias these expectation estimates; that is, for prescribed , if and otherwise, we choose the code rate according to
Comments
There are no comments yet.