I Introduction
Communication theory addresses the problem of reliably transmitting information and data from one point to another. A communication system itself can be abstracted into three blocks: (1) An encoder at the transmitter side, which takes a message to transmit and encodes it into a codeword. (2) A noisy channel, which transforms the transmitted codeword in a certain way. And (3) a decoder at the receiver side, which estimates the transmitted message based on its noisy channel output. The channel is usually fixed and accounts for example for signal impairments and imperfections in real life scenarios, such as wireless communication. The communication channel of interest in this work is the additive white Gaussian noise channel (AWGN). The aim is now to develop appropriate encoding and decoding schemes to combat the impairments of this channel. This usually involves adding redundancy and introducing sophisticated techniques which use the available communication dimensions (e.g. frequency and time) in an optimal way. Information theory studies the fundamental performance limits of reliable communication and the main achievement is the characterization of the maximum possible transmission rate; the socalled capacity
[1]. However, such proofs usually rely on socalled random coding arguments, which only show the existence of suitable capacityachieving encoders and decoders; but not how to actually construct them; cf. for example [2]. In fact, constructing capacityachieving coding schemes is a highly nontrivial task even for very simple communication scenarios. It is therefore natural to search for solutions which can simplify this process, for example using deep learning.Recent developments have shown that a neural network (NN) can simultaneously learn encoding and decoding functions by implementing the communication channel as an autoencoder with a noise layer in the middle, see for example [3] and references therein. Surprisingly, [4] showed that those learned encodingdecoding systems come close to practical baseline techniques. The appeal of this idea is that complex encoding and decoding functions can be learned without extensive communication theoretic analysis. It also enables an onthefly ability of the system to cope with new channel scenarios.
In this paper, we are interested not only in learning the encoding and decoding to account for reliable communication, but also in exploring the possibility to learn how the communication can be secured at the physical layer. To this end, we are interested in physical layer security approaches, which establish a secure transmission by utilizing the intrinsic channel properties and by employing informationtheoretic principles as security measures. In particular, these approaches result in secrecy techniques which are independent of the computational power of unknown adversaries, which is in contrast to prevalent computational complexity or cryptographic approaches, e.g. [5] and [6]. It therefore results in a stricter notion of security with the drawback that these schemes need to be designed for specific communication scenarios. The simplest scenario which involves reliable transmission and secrecy is the wiretap channel [7]. This refers to a threenode network with a legimate transmitterreceiver link and an external eavesdropper which must be kept ignorant of the transmission. It has been shown that specific encoding and decoding techniques exploit an inherent asymmetry (of the additive noise) of the legitimate receiver and the eavesdropper to account for physical layer secure communication. The secrecy capacity of the wiretap channel, i.e., maximum transmission rate at which both reliability and secrecy are satisfied, is known. However, constructing suitable encoders and decoders which achieve the secrecy capacity remains a nontrivial challenge. There are several approaches to this problem and most of them are based on polar, LDPC or lattice codes, see [8]
. However, those techniques are not practical and only work for highly specialized cases/channels. Our motivation is therefore that a NN code design provides a way for onthefly code design, which is practical for any channel. The question at hand is now, how to exploit and modify the autoencoder concept to also include the physical layer secrecy measures to obtain coding schemes for physical layer secure communication scenarios. In this paper, we demonstrate that this is indeed possible by creating a training environment where two NN decoders compete against each other. For that we define a novel security loss function, which is based on the crossentropy. We then show resulting constellations and the probability of error for an SNR range before and after secure coding.
Related work
The work of [3] introduced the idea of using an autoencoder NN concept to model the communication scenario. The main drawback of this method is, that the channel model needs to be differentiable, which can become a problem with real world channels. However, it was shown in [4], that the learned encoding and decoding rules provide a system which comes close to classical schemes and performs well if used on actual overtheair transmissions. Moreover, [9]
showed, that the training can be done without a mathematical channel model by including reinforcement learning. This shows that endtoend learning of communication systems can be a viable technology. This also holds for fiber communication as shown in
[10] and in [11], which utilized the autoencoder concept. Moreover, the concept can be used to learn advanced communication schemes such as orthogonal frequency division multiplexing (OFDM), which enables reliable transmission in multipath channel scenarios as shown in [12].The idea of using two competing NNs in a specific context is not new. One of the first works was for example [13] on the principle of predictability minimization. This principle is as follows: for each unit inside a NN exists an adaptive predictor, predicting the unit, based on all other units. The units are now trained to minimize this predictability, therefore enforcing independence between the units. Another popular instance of competing NNs are generative adversarial networks (GANs) as introduced in [14]. There, the two NNs consist of a generative model and a discriminative model, with the later predicting the probability that a sample came from the former model. The generative model is now trained to maximize the error probability of the discriminator model, which introduces an adversarial process. Another recent work is [15], where a key was provided to Alice and Bob and the NN learns to use the key on its communication link in a way, such that Eve cannot decipher the message (since she has no key). It is therefore a neural cryptography setting and different from our approach, as our network learns to encode a message for direct transmission such that Eve cannot decode it.
Notation
We stick to the convention of upper case random variables
and lower case realizations , i.e. , where is the probability mass function of . Moreover, we use upper case bold scriptfor random vectors and constant matrices, while lower case bold script
is used for constant vectors. We also use to denote the cardinality of a set .Ii Physical layer security in wiretap channels
The wiretap channel is a threenode network in which a sender (Alice) transmits confidential information to a legitimate receiver (Bob) while keeping an external eavesdropper (Eve) ignorant of it. This setup can be seen as the simplest communication scenario that involves both tasks of reliable transmission and secrecy. Accordingly, this is the crucial building block of secure communication to be understood for more complex communication scenarios.
In this paper, we study the degraded Gaussian wiretap channel as depicted in Fig. 1. The legitimate channel between Alice and Bob is given by an additive white Gaussian noise (AWGN) channel as
(1) 
where is the received channel output at Bob, is the channel input of Alice, and is the additive white Gaussian noise at Bob at time instant . The eavesdropper channel to Eve is then given by
(2) 
where is the received channel output at Eve and is the additive white Gaussian noise at Eve. This defines a degraded wiretap channel for which the eavesdropper channel output is strictly worse than the legitimate channel output .^{1}^{1}1Note that any Gaussian wiretap channel of the general form and with and multiplicative channel gains can be transformed into an equivalent degraded wiretap channel as in (1)(2). This means that any Gaussian wiretap channel is inherently degraded, cf. for example [16, Sec. 5.1].
The communication task is now as follows: To transmit a message , Alice encodes it into a codeword of block length , where .^{2}^{2}2Usually, in the Gaussian setting. Moreover, we assume an average transmission power constraint . At the receiver side, Bob obtains an estimate of the transmitted message by decoding its received channel output as . The transmission rate is then defined as .
The secrecy of the transmitted message is ensured and measured by information theoretic concepts. There are different criteria of information theoretic secrecy including weak secrecy [7] and strong secrecy [17]. In the end, all criteria have in common that the output at the eavesdropper should become statistically independent of the transmitted message implying that no confidential information is leaked to the eavesdropper. For example, strong secrecy is defined as
(3) 
with the mutual information between and , cf. for example [2].
The secrecy capacity now characterizes the maximum transmission rate at which Bob can reliably decode the transmitted message while Eve is kept in the dark, i.e., the secrecy criterion (3) is satisfied while achieving a vanishing error probability . The secrecy capacity of the Gaussian wiretap channel is known [16, 18] and is given by
(4)  
(5) 
Mututal information vs. crossentropy
A straightforward approach to optimize a NN based on informationtheoretic criteria would be an optimization based on the mutual information as in (3) and (4). In the security context, this would mean to optimize the encoder and decoder mapping to maximize , while minimizing . However, estimating the mutual information from data samples is a nontrivial challenge, due to the unknown underlying distribution. One approach is for example the variational information maximization, introduced in [19], which was recently applied to GANs [20]. This technique computes a tractable lower bound to maximize the mutual information between two distributions. However, in our case we would need a technique to simultaneously upper and lower bound two connected mutual information terms. To circumvent this challenge, we adapt a secrecy criterion based on the crossentropy on which we elaborate further in the next section.
Iii Neural network implementation
Iiia General model
As in the reference work [3], we implemented the communication scenario using an autoencoderlike network setting. An autoencoder is usually a NN which aims to copy the input of the network to its output. It consists of an encoder , which maps the input to some codewords and a decoder which aims to estimate the input from the output, see [21]. It is therefore a perfect scenario for the communication problem. Usually, these autoencoders are restricted in a certain way, for example that the encoding function performs a dimensionality reduction. That way, the autoencoder can learn useful properties of the data, which are needed for reconstruction. This is in contrast to the communication scenario where the encoder aims to introduce redundancy, i.e. increase the dimensionality. Moreover, there is a noise function inbetween encoder and decoder. The NN therefore learns to represent the input in a higher dimensional space to combat the noise corruption, such that the decoder can still estimate the correct output.
The general structure of our NN can be seen in Fig. 2. There, the message
gets onehot encoded into the binary vector
, which can be viewed as a probability distribution that shows which message was sent and is fed into the NN. The encoder is comprised of two fully connected/dense layers, where the first layer maps
towith a ReLU activation function, while the second layer maps the input to
with no activation function^{3}^{3}3The first layer is given by and the second layer by where andrepresent the weight matrices and the bias vectors, respectively and
.. Here, represents the codeword dimension and can be thought of as time instants or channel uses. The last layer of the encoder is a normalization, where the codewords get a unit power constraint, which is either the classical average power constraint over the codeword dimension, i.e. or a batch average power constraint , where the average is taken over the batch and the codeword dimension. Note that the normalization actually enforces that the resulting codewords have exactly the power specified, turning the inequalities into equalities. For the classical average power constraint, this will yield a constellation, where all the points lie on a circle with radius . The channel itself is realized as in Section II. Here we scale the variance as
, where SNR means SignaltoNoise Ratio. Moreover, we have two receiver blocks, which are equally constructed. Each one has two fully connected layers, where the first one maps the channel output back to with a ReLU activation function and the second one maps from to with a linear activation function. As the last step, we use the softmax function^{4}^{4}4The softmax function is defined as . and the crossentropy as loss function. The softmax gives a probability distribution over all messages, which is fed into the crossentropy:(6)  
as a loss function, which we then average over the batch sample size . This can be seen as a maximum likelihood estimation of the send signal, see for example [21, Chap. 5]. The index of the element of with the highest probability will be the decoded symbol . The same loss function is applied to the receiver model of the wiretapper, i.e. Eve. However, training for security needs a different loss function.
IiiB The security loss function
Remember that information theoretic security results in a difference of mutual information terms between the links. This is in general hard to compute and accordingly difficult use for NN optimization. We therefore focus on the crossentropy and establish another way to define the security loss. A straightforward method would be to define a security loss function by considering the differences between the crossentropy losses of Bob and Eve’s receiver. The secure loss would therefore be
(7) 
Here, and are the resulting probability mass functions from the softmax output in Bob and Eve’s decoder. Training the network for each symbol would result in training the encoder (Alice), such that Eve sees that symbol in another random location. These locations are only based on the initial weight configurations of Eve’s network (and therefore highly dependent on the adversary), as the subtracted crossentropy of Eve jumps to the next highest value and trains for it indefinitely. Moreover, crossentropy is unbounded from above, resulting in logarithmic growing negative loss. This shows that the loss function above is fundamentally illsuited for the problem. We therefore decided to mimic traditional wiretapcoding techniques. Theoretically, the usual approach is to map a specific message to a bin of codewords and then select a codeword randomly from this bin. There, intuitively, the randomness is used to confuse the eavesdropper. A more concrete method is to use coset codes, where the actual messages label the cosets, and the particular codeword inside the coset is again chosen at random. This method goes back to the work of [7] and we refer the reader to [22, Appendix A] for an introduction. The idea is that Eve can only distinguish between clusters of codewords. Whereas the messages itself are hidden randomly in each cluster. However, the legitimate receiver has a better channel and can also distinguish between codewords inside the clusters. This results in a possible secure communication by trading a part of the communication rate for security.
Our approach is now that we train the network such that Eve sees the same probability for all codewords in a certain cluster. The security loss function is a sum of the crossentropy of Bob and the crossentropy of Eve’s received estimated distribution with a modified input distribution. The input distribution is modified in a way, such that clusters of codewords have the same probability. Consider for example the training vector batch , resulting in the onehot encoding matrix
(8) 
where the rows are the samples of the batch and the columns indicate the symbol.
Multiplying this onehot input matrix from the right with the matrix , see Algorithm 1, results in the equalized matrix :
(9) 
One can see that in the resulting matrix , the first and second symbol, and the third and fourth symbol have the same distribution. The advantage of this method is, that we only need to calculate an matrix, which can be used with any input batch size. The new security loss can therefore be defined as
(10)  
where is Eve’s decoded distribution and stands for the equalized input symbol distribution. Furthermore, the parameter controls the tradeoff between security and communication rate on the legitimate channel. The loss function is then averaged over the batch size . Moreover, we chose the means algorithm for the clustering of the constellation points. This provides us a clustering based on the euclidean distance and goes well with the initial idea of coset codes, when they are implemented as a lattice code with a nearest neighborhood decoder.
Iv Training phases and simulation results
All of the simulations were done in TensorFlow with the Adam optimizer
[23] and gradually decreasing learning rate from to . Moreover, we constantly increased the batch size from toduring the epochs. The training was done with a channel layer of the direct link with an SNR of
dB, and on Eve’s link with an SNR of dB. All of the following figures for constellations and decision regions were done with codeword dimension , such that they can be easily plotted in d. Our training procedure is divided into three phases.In the first phase, we only train the encoder and the decoder of Bob with the standard crossentropy, as done in [3]. Here, the NN learns a signal constellation to combat the noise in an efficient way dependent on the power constraint and the signaltonoise ratio. The resulting learned constellations, i.e. encoding rules and the learned decision regions, i.e. decoding rules, are shown in Figure 4 for an average power constraint over the whole batch and for an average power constraint per symbol.








In the second phase, we freeze the encoding layers, and train Eve on decoding the previously learned encoding scheme with her crossentropy and the normal input distribution. The reason behind the freezing of the encoder is that we assume that in reallife situations an attacker can not influence the encoding of the signals. We therefore have an unchanged constellation, but Eve’s NN learns a decision region for her channel model, i.e. an additional noise factor with dB SNR.
In the third phase, we freeze the decoding layers of Eve and train the NN with the loss function (10) and
. This time the freezing is done because we cannot assume that a communication link has access to the decoding function of an attacker. For the equalization of the input clusters, we feed the network with an identity matrix
and calculate the clusters of the constellation points with the means algorithm. Figure 5 shows the results of the clustering algorithm on the learned constellations. We then use the algorithm 3, to calculate and create the equalized batch label matrix with . The resulting training effect is, that the NN tries to pull codewords from the same cluster together, close enough that Eve cannot distinguish the symbols in a cluster, and loose enough such that Bob can still decode them. Figure 6 shows the learned secure encoding schemes.After the secure training phase we train Bobs decoder and in the last step, Eve’s decoder once again. This is to make sure, that Eve’s neural network is able to train to decode the new encoded signals.
The training phase for symbols gets more accurate and faster with increasing codeword dimensions , which suggests that the NN can find better constellations. The drawback is that the actual communication rate drops, since the system needs more time instants to transmit a bit. Therefore, we have taken a conservative approach, which resulted in
. We do not provide the figures for this constellations due to the higher dimensionality. Moreover, we implemented a coset coding algorithm. For that we use a modified kmeans algorithm which gives equal cluster sizes of our
constellation points. In our case, we use clusters, so each cluster consists of symbols. The symbols in each cluster get a designation as secure encoded symbol to . We therefore have clusters, and each cluster has constellation points one for each symbol. The actual secure transmission consists of only 4 symbols, i.e. two bits. The specific cluster is chosen at random. This randomness increases Eve’s confusion about the transmission. Intuitively, we sacrifice a part of the transmission rate, i.e we only use 4 instead of 16 symbols, to accommodate randomness to confuse Eve. Again we refer to [22, Appendix A], which discusses Eve’s error probability and shows that coset coding does increase confusion at the eavesdropper. The NN therefore learns a constellation which can be seen as a finite latticelike structure, on which one can implement the idea of coset coding. For the actual simulation, we have taken a direct SNR of dB and an additional SNR of dB in the adversary link, during the training phase. We then evaluated the symbol error rate which approximates for decoding the symbols before the third training phase, i.e. before secure coding and after all the training, which results in Figure 7. We note that the figures were tested with the same total sample size of , while the testing was also done with a SNR of dB in the adversary link. Moreover, we used a factor of in the security loss function. One can see in the results, that the NN learns a tradeoff between communication rate and security.V Conclusions and outlook
We have shown that the recently developed endtoend learning of communication systems via autoencoder neural networks can be extended towards secure communication scenarios. For that we introduced a modified loss function which results in a tradeoff between security and communication rate. In particular, we have shown that the neural network learns a clustering, which resembles a finite constellation / lattice, which can be used for coset encoding as demonstrated. This opens up the ongoing research of endtoend learning of communication systems to the field of secure communication, as classical secure coding schemes can be learned and applied with a neural network. We think that our approach via the new loss function, is a fruitful direction in that regard. However, an optimal way would be to tackle the problem by direct optimization via mutual information terms, which remains a challenging research problem.
References
 [1] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, Jul. 1948.
 [2] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley & Sons, 2006.
 [3] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, Dec 2017.
 [4] S. Dörner, S. Cammerer, J. Hoydis, and S. t. Brink, “Deep learning based communication over the air,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 132–143, Feb 2018.
 [5] W. Diffie and M. Hellman, “New directions in cryptography,” IEEE transactions on Information Theory, vol. 22, no. 6, pp. 644–654, 1976.
 [6] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and publickey cryptosystems,” Commun. ACM, vol. 21, no. 2, pp. 120–126, Feb. 1978.
 [7] A. D. Wyner, “The WireTap Channel,” Bell System Technical Journal, vol. 54, no. 8, pp. 1355–1387, 1975.
 [8] Y. Wu, A. Khisti, C. Xiao, G. Caire, K. Wong, and X. Gao, “A survey of physical layer security techniques for 5g wireless networks and challenges ahead,” IEEE Journal on Selected Areas in Communications, vol. 36, no. 4, pp. 679–695, April 2018.
 [9] F. A. Aoudia and J. Hoydis, “Endtoend learning of communications systems without a channel model,” CoRR, vol. abs/1804.02276, 2018.
 [10] S. Li, C. Häger, N. Garcia, and H. Wymeersch, “Achievable information rates for nonlinear fiber communication via endtoend autoencoder learning,” arXiv preprint arXiv:1804.07675, 2018.
 [11] B. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. Bülow, D. Lavery, P. Bayvel, and L. Schmalen, “Endtoend deep learning of optical fiber communications,” arXiv preprint arXiv:1804.04097, 2018.
 [12] A. Felix, S. Cammerer, S. Dörner, J. Hoydis, and S. t. Brink, “OFDMautoencoder for endtoend learning of communications systems,” arXiv preprint arXiv:1803.05815, 2018.
 [13] J. Schmidhuber, “Learning factorial codes by predictability minimization,” Neural Computation, vol. 4, no. 6, pp. 863–879, 1992.
 [14] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [15] M. Abadi and D. G. Andersen, “Learning to protect communications with adversarial neural cryptography,” arXiv preprint arXiv:1610.06918, 2016.
 [16] M. Bloch and J. Barros, PhysicalLayer Security: From Information Theory to Security Engineering. Cambridge, UK: Cambridge University Press, 2011.
 [17] U. Maurer and S. Wolf, “Informationtheoretic key agreement: From weak to strong secrecy for free,” in International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 2000, pp. 351–368.
 [18] S. LeungYanCheong and M. Hellman, “The gaussian wiretap channel,” IEEE Transactions on Information Theory, vol. 24, no. 4, pp. 451–456, 1978.
 [19] D. Barber and F. Agakov, “The im algorithm: a variational approach to information maximization,” in Proceedings of the 16th International Conference on Neural Information Processing Systems. MIT Press, 2003, pp. 201–208.
 [20] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in Neural Information Processing Systems, 2016, pp. 2172–2180.
 [21] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. The MIT Press, 2016.
 [22] F. Oggier, P. Solé, and J. C. Belfiore, “Lattice codes for the wiretap gaussian channel: Construction and analysis,” IEEE Transactions on Information Theory, vol. 62, no. 10, pp. 5690–5708, Oct 2016.
 [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.