The aim of any communication system is to perfectly reproduce the message at the receiver sent by a transmitter through a channel between the sender and receiver. Due to the noise characteristics of the channel, the transmitted signal can get corrupted and the exact reconstruction of the message may not happen at the receiver. A robust communication system should be able to handle these corruptions due to the channel and reproduce the message with maximum faithfulness at the receiver.
Traditional communication systems follow a block by block design, optimized within the block for maximal performance. However, such a system may not result in a globally optimum solution across all blocks. The complexity of the signaling systems along with the unknown effect from the channel makes it difficult to design an optimal system across all the blocks. Lately, deep learning has seen extraordinary success in learning complex tasks involving natural signals such as images, speech, etc. This raises the question of can we leverage the developments in deep learning based techniques to design end to end communication systems, optimized to the channel conditions observed. In , the authors proposed the fascinating idea of an end to end design communication system based on the principles of autoencoders.
The generalization power of deep neural networks on a wide range of complex problems enabled the application of learning techniques over a broad spectrum of communication challenges. A novel Sparse Code Multiple Access (SCMA) technique based on deep learning to achieve lower Bit Error Rate (BER) than conventional techniques is proposed in . By jointly minimizing BER and Peak to Average Power Ratio (PAPR) using an autoencoder structure, 
proposed a solution that can outperform conventional methods. Channel estimation and signal detection have been cast as a deep learning problem in and the results show robust results over a wide range of Signal-to-Noise Ratio (SNR) with competitive performance to traditional approaches. The problem of Multiple-Input-Multiple-Output (MIMO) detection using deep learning is discussed in  and the ability of neural networks to serve as MIMO detectors was demonstrated. By using autoencoder networks in OFDM with CP systems,  showed that synchronization issues can be mitigated along with simplifying equalization over multipath channels when designing end to end system communication systems. Application of deep learning to obtain improved accurate channel estimates at a DL based OFDM receiver is proposed in . A deep learning based pilot assignment scheme for Massive-MIMO systems is presented in . In order to deal with the quantization effects at OFDM receivers,  introduced a deep learning based method for channel estimation and jointly learning precoder and decoder. These works show the exciting applications of utilizing deep learning at the physical layer of communication systems. Interested readers are referred to  for a detailed overview of applications of deep learning at the physical layer.
End to end optimization of communication systems 
will result in robust communication systems. However, to train the system end to end, the channel knowledge was required for computing the weight updates during backpropagation. To overcome the problem of unknown channel model, proposed to train the network in two phases: in the first phase train both the transmitter and receiver networks in simulation with known channel model and in second phase deploy the network in actual channel and fine tune the receiver network alone. A practical approach to train systems from end to end without any assumptions about the channel is proposed in  based on simultaneous perturbation stochastic approximations. Another method is proposed in  based on output perturbations at the transmitter. Approaches to approximate the channel distribution with neural network and use this as a surrogate channel for backpropagation are proposed in [15, 16].
The above-mentioned methods for training end to end communication systems have shown success in designing systems which are as competitive as traditional hand-designed systems. However, most of them used the training procedures which are directly borrowed from deep learning community, casting the symbol detection as a classification problem and hence using cross-entropy as the loss function to optimize the neural network. This raises the following questions:
Is cross-entropy the best loss function to use for an end to end modeling of communication systems using deep learning?
How can deep learning methods be designed to leverage available prior information, if any, about the communication system for improving the training process?
What can be a systematic way to incorporate domain-specific information in to the optimization objective?
In this work, we attempt to address the above questions in the context of end to end communication system design by deriving objective functions which can include prior information about the channel models or domain-specific information and be used to jointly train transmitter and receiver. We show that the proposed method is able to recover the objective functions used by previous works under appropriate assumptions. Our results show that the proposed method is able to produce consistently better-trained models with less variance between the trained models compared to previous works.
Bold face lower-case letters (eg. x
) denote column vector. Script face letters (eg.) denotes a set, denotes the cardinality of the set . represents a function which takes in a vector x and has parameters .
denotes KL divergence between random variablesand with distributions and respectively. represents a distribution with parameters . is the expectation operator with respect to distribution . represents an all zero vector of length ,
represents identity matrix of dimension. is a multivariate Gaussian with mean vector and covariance matrix . The trace of a matrix A is denoted by .
Ii End to End Modeling of Communication Systems
A communication system can be seen as a model which recreates a copy of the message which is sent by the transmitter at the receiver end. Let be the information to be sent from the transmitter. Modern communication systems convert the data x to a representation which is suitable for transmission over a noisy channel. A corrupted version of z, denoted by is received at the destination. The receiver tries to recover the best possible reconstruction of x from the observed .
The transmitter can be viewed as a function which takes in the information x and computes the intermediate representation z as . The channel which corrupts z can be represented as . Here is a stochastic function which when applied on z gives output . Finally the receiver can be characterized as another function which computes the best possible reconstruction of x from as .
In , a design for communication system trained from end to end is proposed based on the concepts of autoencoders . The transmitter function is represented using a neural network parameterized by such that and the receiver function is represented using another neural network parameterized by such that . However, the channel function , is typically unknown in a communication system and is generally considered as a stochastic mapping from z to . This channel function models both the hardware imperfections in the system as well as the channel impairments. Hence the communication system can be represented as
A schematic representation of the mentioned design using neural network function approximators is provided in Fig. 1.
The goal of an end to end communication system design is to find the parameters and such that
used one-hot encoding to represent the message symbolsx and the gain is calculated based on categorical cross-entropy over all the training samples. That is, , where corresponds to the normalized (to ) score given to the message x
from the output softmax layer.
In the following section, we discuss how to reformulate the problem of communication from a variational inference perspective and use the developments in the generative modeling capabilities of auto-encoder networks for simultaneously training the transmitter and the receiver.
Iii Variational Inference perspective
Efficient reconstruction of message x from the received representation at receiver can be achieved if full knowledge of channel is available. However, the stochastic nature of channel function and the lack of knowledge of the channel parameters makes this goal challenging. The joint density of the data that is transmitted x and the received signal can be represented as
where we assume that transmitter provides a deterministic mapping from x to z. However, in an unknown channel scenario, the conditional density , and in turn , is unknown.
Iii-a Variational Inference
Variational Inference (VI) is a method from statistical learning for approximating difficult to compute probability densities. VI deals with finding the conditional distribution of latent variables given x. Considering the joint density , inference in a Bayesian model amounts to conditioning on data and computing the posterior . Variational Inference applies optimization techniques to approximate this conditional density.
Recent developments in deep learning proposed the used of variational inference for the purpose of generative modeling. Generative modeling refers to the process of producing valid samples from . Consider the graphical model given in Fig. 2. Here, samples of x are generated from a latent variable and associated parameters represented by . The solid lines denote the generative model . To generate valid samples of x, we first sample and then use and to generate x. The dashed lines represent the inference procedure with variational approximation of the intractable posterior .
In , a stochastic optimization based method is proposed applying deep learning to first approximate the inference with appropriate prior on using an encoder network . Then, a decoder network is used to compute the reconstruction of message x from . Here and are the neural network parameters that will be learned during the training phase. Given high capacity model (ie., neural networks with sufficient learning capability), and good prior distribution , the model will approximate the posterior ie., . Because of the encoder-decoder structure present in the model, this method is generally known as Auto-encoding Variational Bayes (AVB).
The marginal likelihood of datapoint , under an encoding function, , can be computed as
is commonly referred as Evidence Lower Bound (ELBO) or Variational Lower Bound and
is the KL-divergence between the approximating and actual distributions Please see Appendix A for details on (6). By re-arranging (6) and noting that for any two random variables , we can see that . Therefore, the likelihood of reconstruction is lower bounded by (7) (hence the name ELBO). Since it is difficult to compute the value of , VI tries to maximize this alternative quantity by maximizing the ELBO .
Let be the approximation of obtained by maximizing ELBO (7). Then,
Noting that and observing that is constant with respect to , we have
Hence, the objective of minimizing ELBO (7) also reduces the KL-divergence between the posterior approximation and the intractable posterior .
Following from (7), the maximization objective ELBO can be re-arranged as
Hence, the objective of maximizing ELBO is equivalent to maximizing the penalized likelihood of reconstruction of x from where is the penalty is the KL-divergence between the inference density approximation and assumed prior .
Our contribution lies in recasting the problem of end to end modeling of communication system as a variational inference problem, showing how any prior knowledge about the system can be incorporated and then utilizing the autoencoder architecture to design the system.
Iii-B Communication as a VI problem
Drawing parallels to the communication system model presented in Sec. II, we can infer that the transmitter along with the unknown channel does the encoding of data x to and the receiver performs the reconstruction of from received . The graphical model for this process is given in Fig. 3.
From Fig. 1, and are the only learnable parameters in this system and represents the unknown parameters of the channel along with the stochastic channel function . From the model presented, we note that and . In AVB, the encoder network is used to learn the parameters to compute z from given symbol x. Then, a stochastic channel function is applied on z to sample which is used by the decoder network to recreate x. Hence,
The effect of the encoder and the stochastic channel function which together transforms the message x to a representation which suffered corruption from the channel is approximated by 111 is an approximation because the stochastic channel is unknown in general and is the best approximating behavior of the encoder and the channel combined.. The output of the decoder is a distribution over all the possible messages computed after observing and is represented as .
Finally, the objective of the optimization problem (4) to train end to end communication system having the model discussed above can be written as
where X is the set of available training points.
Iii-C Reconstruction Likelihood
The first term in maximizing objective ELBO (9) accounts for the capability of the end to end system to successfully reproduce the intended message x at the receiver end. The exact expression for reconstruction likelihood depends on how the message x is represented in the system.
Previous works on end to end design of communication systems [1, 12, 13, 14] used one-hot encoding to represented each message . With , one-hot encoding uses a vector of length with all entries except a for the position corresponding to the message. The softmax output layer of the receiver also produces a length vector, which sums to . If this representation of x is used, the reconstruction term in (9) takes the form of negative categorical cross entropy and can be written as
where corresponds to the normalized (to ) score given to the message x by the receiver ’s softmax output layer.
Another way of representing the message is to directly use the binary representation of the message. For , we need a block length of atleast to represent (uncoded) message x. Under this representation, x is a vector of length with multiple entries of s and
s. The output layer of decoder should also be properly modified to output the corresponding values. In this case, a popular choice for output layer activation function is to use sigmoid activation, which assigns a value betweenand for each of the entries in reconstruction. Hence
becomes a multivariate Bernoulli distribution of lengthwith element probabilities computed from . The reconstruction likelihood becomes negative of binary cross entropy as in  and can be computed as,
While one-hot representation with categorical cross entropy is a popular choice of loss function for classification tasks, the binary message representation with binary cross entropy is scalable to a learn for a very large number of messages 222While one-hot encoding requires nodes at the inputs layer, binary representation only requires only nodes at inputs.. One should select the appropriate representation for messages while keeping these constraints in mind. In Sec IV, we show that by using (15) instead of (14) on the (9), the models can be taught the concept of Gray Coding without any other explicit criterion.
Iii-D KL Loss for AWGN Channel
The Additive White Gaussian Noise (AWGN) channel is a widely used channel model to represent the corruption incurred to transmitted signal in communication systems. For a z of dimensions , Gaussian corruption with noise power per component is modeled as
where with is an all zero vector of dimension and is an identity matrix of dimension . Taking a Gaussian prior of , the KL Loss in (9) for AWGN channel can be computed as
Please refer Appendix B for the derivation. Depending on the representation used for symbols in the model, (17) can be combined with (14) (in case of one-hot representation) or with (15) (in case of binary representation) to get appropriate objective function for training the model in AWGN channel.
As the noise power per component and the prior variance are constant in the problem, the final objective to maximize can be written as
The first term in the derived objective (19) is negative of the categorical cross entropy. Previous works in [1, 12, 13, 14] considered only this term for optimization at a constant training SNR333 The SNR in this case is defined as .. The second term connects the signal power and noise power to the design. At a specified noise power per component, maximization of the above objective brings in the concept of using less power for signaling. Hence, the derived objective optimizes the signaling such that a tradeoff is achieved between minimizing the transmit power and maximizing the reconstruction likelihood. If we assume a constant training scenario, the second term becomes a constant and we recover the objective used in [1, 12, 13, 14].
Iii-E KL Loss for Rayleigh Block Fading Channel
One of the most widely used model to capture the fading effects during signal transmission is Rayleigh Block Fading. Under Rayleigh Block Fading (RBF) model, the corrupted signal can be modeled as
where and or equivalently 
where J is the matrix defined by with
is square zero matrix of dimensionand identity matrix of dimension 444Note that while implementing in DNN, we split complex z in to real and imaginary components and stack them into a column vector of dimension . Hence is always even in the model.. If the only knowledge we have about the channel is that it can be well modeled by a distribution with finite variance, then the prior choice should reflect this information. In this context, a normal prior is the maximum entropy prior. Hence, taking a prior of , the KL Loss in (9) for this case can be computed as
Please refer to Appendix C for the detailed derivation. Depending on the representation used for symbols in the model, (22) can be combined with (14) (in case of one-hot representation) or with (15) (in case of binary representation) to get appropriate objective function for training the model in RBF channel.
Considering one-hot encoding and removing the constant terms in the problem, the final ELBO objective (9) to maximize for training an end to end communication system in a RBF channel can be written as
This objective is slightly different from the AWGN objective (19) due to an additional term similar to capacity. Similar to the case of AWGN channel objective, we can see that at constant SNR condition, we recover the objective function used in [1, 12, 13, 14]. Interestingly, in the special case of , the new term in this objective (the third term in (23)) is equivalent to the AWGN channel capacity. Maximizing this objective optimizes the system to improve the channel capacity (third term) while minimizing the signaling energy (second term) and at the same time improving reconstruction loss (first term). This intuitively fits with the objective of communication systems - maximize the capacity while using minimum signaling power.
In this section, we presented an approach for end to end designing of communication systems based on the principles of variational inference and the recent developments in generative modeling with deep neural networks. We showed how any prior information about the channel in the form of channel parameters or functional form of the channel can be appropriately incorporated for designing the objective function through (9). By taking the cases of AWGN channel and RBF channel models, we demonstrated how the inclusion of the prior knowledge about channel can be utilized for designing the learning objective. In the complete absence of channel knowledge, Gaussian prior with high can serve as a non-informative prior.
With the inclusion of additional constraints like constant SNR during training, our method will recover the objective functions used in prior works[1, 12, 13, 14] which assume such constraints at the design phase. Previous works had to include an additional normalization layer at the transmitter output to control the power of the transmit symbols, which otherwise can become very high. This is because, the objective functions used by the learning agents in those works have no incentive for controlling the transmit power. However, our proposed method yields objective functions which implicitly take care of transmit power control and hence eliminate the need for additional normalization layer at the transmitter output.
Generalizing beyond AWGN and RBF channel models, the method we proposed in this section can be applied to additive non-Gaussian noise channels as well as other generalized fading channel models using suitable prior. In the scenarios where such additional knowledge of the channel is available, the KL-loss in (9) has to be computed with appropriate prior to obtain the objective function for training.
This section presents the results of the proposed method for training end to end communication systems and comparisons with existing methods of both traditional and other deep learning based methods. For the purpose of evaluation, we consider three cases:
2 bit block with one complex channel use (). This scheme is similar to the QPSK scheme which uses one constellation point in complex channel plane to represent 2 bits.
4 bit block with two complex channel uses .
8 bit block with four complex channel uses .
All the schemes are evaluated in both AWGN and RBF channel models. We compare the performance of trained models with traditional methods of QAM and Agrell sphere packing  and deep learning based method proposed in . For deep learning based methods, models are trained and the results are reported.
We use two metrics to compare the capabilities of the schemes.
Block Error Rate (BLER): The block error rate performance over a wide range of SNR of the schemes will show the usefulness of the schemes in delivering the information over the channel.
Packing Density: Another metric to compare the efficiency of multiple signaling methods is to compare the packing density of the transmit signals over the dimensions specified by the number of channel uses. Normalized second moment () of the transmit symbols z is defined as 
where is the square of minimum euclidean distance between transmit points. This metric is insensitive to scaling and hence useful to compare packing densities. Smaller the value of , better the packing density.
Please refer Appendix D for more details about the simulation setup and the training procedure.
Iv-a DNN architecture
We consider a feedforward autoencoder architecture with three hidden dense layers for encoder network and three hidden dense layers for decoder for all the experiments and both the DL methods under comparison for fairness. The network architecture details are given in Table I.
|Layer Name||Size||Activation Function|
|Transmit Layer||Linear for Proposed|
|Linear + BN for |
|Channel||(16) for AWGN channel|
|(20) for RBF channel|
Selection of activation functions for the network layers impact both the quality of the solution as well as the convergence properties of the model. Traditional activation functions including sigmoid, tanh restrict the activations to be in the range of and
respectively with saturating effects near the boundaries. These saturation effects can hinder gradient propagation through the layers. Recent works applying deep learning for communication systems modeling advocate the use of advanced activation functions like Rectified Linear Units (ReLU)[1, 12], Exponential Linear Units (ELU)  etc. We use ReLU for activation at our hidden layers, linear activation at the output of the encoder network and a softmax layer for output of the decoder network. The works in [1, 12, 13, 14]
used a Batch Normalization (BN) layer at the output of the transmitter to control the power of the transmitted constellation. If this layer is not included, the model will try to transmit at uncontrollably higher powers to minimize the cross-entropy loss. However, the objective functions presented in this work, (19) and (23), includes an additional term to minimize the transmit power. Hence, the deep learning model is incentivized for doing power control at the learning phase and will control the constellation power according to the noise it observed and reconstruction likelihood during training. We used and while training the proposed model. Adam optimizer  with learning rate , and is used for training all models and each model is trained for epochs. The models using  are trained at an SNR of .
Iv-B Evaluation in AWGN channel
The proposed method is evaluated in AWGN channel model given by (16). In this case, the objective function to optimize is given in (19). However, in a practical scenario, we would like to train the model without any assumptions on the channel model. To cover this case, we provide results using the objective function developed assuming RBF channel, (23), also. The BLER performance of models is given in Fig. 4.
Agrell  being the optimized sphere packing scheme found using search is able to perform better in all cases. Note that, in the case of one complex channel use, both Agrell and QAM scheme are exactly the same. As the number of channel uses increases, the dimension of the sphere packing problem also increases and it can be clearly seen that the QAM scheme is not performing as good as the other methods in comparison and the gap between the performance of Agrell scheme and QAM scheme widens with increase in number of channel uses.
In all the cases, we can see that deep learning methods are able to perform better than traditional QAM methods and are able to perform very close to the optimized Agrell schemes. Even with deep learning models, the performance compared to Agrell scheme widens as the dimension increases. The proposed method is able to provide better performance than the scheme proposed in , which can be attributed to the improved cost function. Interestingly, both (19) and (23) provides equally good BLER performance in AWGN channel.
The distribution of surrogate metric for packing density given by (24) for the trained models are given in Fig. 5. In the case of single channel use (), traditional QAM and Agrell schemes are the optimal sphere packing schemes (with ) and DL methods are able to reach close to the optimal. In lower dimensions the objective in (19) is able to consistently produce models with better . However, in higher dimension (Fig. 4(c)), objective in (23) produced better models. The proposed objective function (23) is able to produce models with better than traditional QAM approximately of the instances while the procedure in  managed to produce models only of the time. From all the results, we can conclude that even though (23) is developed for RBF channel, it can used in AWGN channel as well.
Iv-C Evaluation in RBF channel
For verifying the performance of the methods in RBF channel model, the model described in (20) is used. We provide the results of optimizing the DNN using both the objectives, (19) and (23), in Fig. 6 and Fig. 7.
We need to used pilot symbols to obtain an estimate of channel coefficient and the equalization is done prior to decoding as done in [16, 14]. The estimate of obtained from pilot symbols affects decoding performance through noisy equalization. We used the same power per component as the constellation points to transmit the pilot symbol such that both the pilot components and the symbol components in the block experience same SNR during transmission. We can also use boosted pilot symbols at higher power to improve decoding performance. However, our aim is to compare the performance across different constellation schemes and hence chose to use the same power per component for both pilot and data points.
Traditional QAM and Agrell schemes are not optimized for RBF channels. As the number of channel uses increases, we can see that the DL methods are able to perform better than QAM and Agrell. The improvement in the case of DL methods can be attributed to the function approximation power of neural networks which learns to neutralize the effects of noisy channel equalization. Surprisingly, in RBF channel, the models trained with objective derived for AWGN model (19) is able to give performance close to the models using (23). However, the difference of using (23) is visible in the packing density of the learned models. At higher dimension (Fig. 6(c)), (23) is able to consistently produce better models when compared to (19). Although the method in  is able to produce models with less variation at higher dimensions, in lower dimensions (Fig. 6(a) and Fig. 6(b)), it suffers with high variability. From all these results, we can conclude that using the objective (23) derived for RBF channel model can be expected to produce desired results consistently across different dimensions.
Iv-D Effect of
We used a prior of with variance per component during the derivation of objective functions (19) and (23). It can be easily seen from these objective functions that affects the weight given to the transmit symbol power term . When prior variance is low, more weight is given to the transmit symbol power control term to reduce the transmit power and vice versa. Hence, at low variance for prior noise distribution, we will have constellations with low transmit power requirements and higher the assumed prior noise power, the model will learn to transmit it at higher power. However, a very low value for will aggressively optimize the transmit power such that the constellations learned will have transmit power close to . This affects the decoding process and increases the BLER. In our experiments, we found that setting a higher value for satisfied the tradeoff between reconstruction likelihood and transmit power. We used as used in traditional Variational Autoencoders . Later, we also show that how a lower value of can cause unintended behavior of models when the with increase in number of bits transmitted per channel use.
Iv-E Recovering Gray codes
In all the experiments discussed above, we used one-hot encoding to represent the symbols as done in previous works [1, 12, 13, 14] for comparability. In order to study the structure of constellations, we trained models in AWGN channel for and using the method in 555 We chose for the simplicity of visualization as is difficult to visualize in 2D. Even though advanced methods like t-SNE can be used for high dimensional visualization and analyzing clustering behavior as done in [1, 14], it is a projection to a 2D plane and may not efficiently covey the placements of points in high dimensional space which we are trying to analyze here.. Two sample constellations learned by the method is given in Fig. 8. It can be observed that the symbols are well arranged in concentric circles maintaining sufficient distance between constellation points. This type of design is useful in optimizing BLER of the system. However, as close-by symbols change by multiple bit positions, this may not be optimized way to design if the system requirement is to improve BER. This constellation characteristic is the effect of choosing one-hot encoding for representing symbols as in one hot encoding, there is no incentive for the model to place symbols with only one bit changes near to each other.
However, by using the binary representation of the symbols and the reconstruction likelihood introduced in (15), the models will be able learn the concept of nearby symbols as the penalization forces all the bit positions to be correct. In this case, the objective function to train models in AWGN channel can be obtained by combining (15) and (17) and can be written as
As the input layer dimension is now reduced from to , we used a small network with hidden layers in encoder having and nodes, decoder having hidden layers with and nodes and finally an output layer of nodes with sigmoid activation function. Training is done for epochs with other settings being similar to the one used in previous experiments. Sample constellations learned by this model is given in Fig. 9.
From the constellations given in Fig 9, it can be easily observed that both the models learned the concept of gray coding. The symbols are placed in constellation in such a way that near-by symbols vary by only one bit. After training multiple models, we observed that constellations with concentric circle structure as in Fig. 8(a) is the most commonly learned structure and the traditional grid-like structure as given in Fig. 8(b) occurs rarely. This shows that the loss function we use is having multiple local minima resulting in concentric structure and very few local minima resulting in a grid-like structure.
The use of explicit batchnormalization for constraining constellation energy in  results in one symbol being placed at point as visible in Fig. 8. This may produce practical difficulties during transmission as a symbol close to is similar to no signal at all. As the method proposed in this work includes constraining the constellation energy into the objective function (25), this problem is not observed in the trained models (As seen in Fig. 9). The placement of a symbol at will result in constellation with center symbol differing in multiple bit positions from the symbols in first concentric circle and suffering a higher reconstruction likelihood with (15). Hence the models learn to avoid such a placement and instead places all symbols on concentric circles in the gray coding scheme.
Interestingly, when the number of symbols increased while keeping the , the model learns to cheat the system by placing two symbols which differ by only one bit top of each other and hence maintaining two concentric circles of the constellation but suffering a higher BLER. A sample constellation when model is trained using is given in Fig. 9(a).
In Fig. 9(a), the model learns to place symbols differing in second last bit (Eg: ). We used a value of for this experiment. This cheating behavior can be attributed to the symbol energy control term in objective (25). As discussed before, a low value of will give more importance to limiting the constellation transmit power and hence the model learns to place symbols on top of each other while sacrificing reconstruction likelihood and hence BLER. By adjusting the value of , the model learns spread out the symbols while maintaining gray coding scheme as shown in Fig.9(b).
This behavior is observed when more bits are squeezed to transmit per channel use. It can be inferred that acts as a honesty parameter and when forcing the model to pack more bits per channel use, the model needs to have a high value for this parameter to avoid cheating behavior. As both and
are hyperparameters to be chosen during the system specification, this behavior can be easily handled by appropriately setting the value ofat the design phase.
The main aim of this paper is to put forward the idea of using the principles of variational inference for end to end communication system design while leveraging the function approximation capabilities of deep neural networks to learn the optimal communication schemes. We used a model-based simulation to verify that the proposed method can result in improved system designs. By modeling the channel behavior properly and using desktop-class computing power, we were able to train the model in a matter of minutes. In this section, we will discuss how a system can be trained in a real environment, without knowledge of channel model and how the learned model can be efficiently implemented in devices.
V-a Training models in real channel
Since we assumed the knowledge of channel and used a model based simulation system, we were able to train the system with actual gradients. However, in a real system, the channel impairments will be an unknown layer on the network and hence backpropagation of gradients from receiver to transmitter is not possible using traditional optimization techniques used by deep learning community. Specific to the wireless communication domain, a few practical techniques are developed by the community to mitigate this problem and few of them are discussed below. We can replace the optimization objectives in the works described below with the objective function given in (13) and any of the following techniques can be used for model-free training with no further changes.
One of the solutions, proposed in , to avoid the problem of unavailable gradients at the transmitter is to first train the transmitter-receiver pair in a model based simulation and later fine-tune the receiver alone in the real channel. With sufficient training, this technique can provide an optimized receiver design with a near-optimized transmitter design.
Another solution proposed  is to first approximate the observed channel behavior into a tractable model and then use this model to provide a backpropagation pathway for loss gradients from receiver to transmitter. The problem of approximating the observed channel is formulated to a GAN learning problem . By iteratively training the GAN channel approximation, receiver and using GAN as channel surrogate function to optimize transmitter, end to end system design can be achieved. A similar solution is also proposed in .
Rather than using actual gradient information, 
proposed to use stochastic perturbation based on simultaneous perturbation to compute approximate gradients of the transmitter network parameters. By using conventional stochastic gradient descent variants to optimize receiver and perturbation based method to optimize transmitter, it is shown that end to end system design can be achieved in a complete model-free way. The advantage of gradient-free optimization comes at the cost of decreased convergence rate.
By perturbing (sampling) the only the transmitter output, 
showed that end to end system can be trained without any knowledge of the channel. By relaxing the channel input to a random variable, the proposed method is able to derive a surrogate gradient function for transmitter parameters without the knowledge of channel model. The proposed reinforcement learning based method is able to train models with performance competitive to other model aware solutions. The solution, however, comes with a high variance of the performance during the training phase. Gaussian distribution with constant non zero variance is used to sample the transmitter outputs.
is the standard deviation of the Gaussian perturbation applied at the transmitter output. We can observe that the BLER performance (Fig.10(a)) of models trained using  is almost the same as the models which require channel knowledge (trained using Adam). However, in the case of packing density (Fig. 10(b)), the models trained with  is slightly worse than models trained using Adam. This could be explained as the added perturbation also acts a noise to the model. The only price we pay while using  to train is the slow convergence of the models.
V-B Deploying models in real devices
Previously deep learning based solutions were limited by the compute capabilities of the devices. The works in [12, 14] already showed that implementing end to end communication system trained using deep learning is possible with SDR(Software Defined Radios). Even though a comparatively high compute power is required during the training phase, the compute requirements at the inference phase is small. This is due to the compute requirements of backpropagation during the training phase while at inference only a forward pass is required.
Recent developments in the semiconductor industry and GPU technology, in particular, has resulted in the availability of extremely power efficient compute devices. Google Coral line up of devices , Qualcomm SoC lineup with NPE (Neural Processing Engine) , Mythic’s IPU (Intelligent Processing Unit)  are a few examples of solutions targeting to bring the power of deep learning to small devices. The development of cheap and energy efficient compute devices will enable future communication devices to have sufficient hardware capabilities to carry out inference at the device itself with high throughput.
Vi Concluding Remarks
In this work, we presented a method to perform end to end modeling of communication systems from a variational inference perspective. We showed how to incorporate the prior information about the channel model into the objective function for training models with the special cases of AWGN and RBF channels. However, the method we presented in this paper can be applied in additive non-Gaussian noise channels as well as generalized fading models with appropriate prior distributions. Leveraging the generative modeling capabilities of autoencoders, we showed how to train the models using deep learning. Through extensive simulation, we showed that the proposed solution is able to produce models with competitive BLER and better packing density consistently when compared with previous works. We also showed how the objective function can be modified to teach the model the concept of Gray coding.
Appendix A Derivation of log-likelihood of data
This derivation is based on . Noting that is constant with respect to , we have
Appendix B Derivation of objective function for AWGN channel
The KL-divergence between two normal distributions withis given by
For AWGN model, we have and . Hence the KL Loss term in (9) can be computed as
Appendix C Derivation of objective function for RBF channel
Noting that , and , we have