Communication theory addresses the problem of reliably transmitting information and data from one point to another. A communication system itself can be abstracted into three blocks: (1) An encoder at the transmitter side, which takes a message to transmit and encodes it into a codeword. (2) A noisy channel, which transforms the transmitted codeword in a certain way. And (3) a decoder at the receiver side, which estimates the transmitted message based on its noisy channel output. The channel is usually fixed and accounts for example for signal impairments and imperfections in real life scenarios, such as wireless communication. The communication channel of interest in this work is the additive white Gaussian noise channel (AWGN). The aim is now to develop appropriate encoding and decoding schemes to combat the impairments of this channel. This usually involves adding redundancy and introducing sophisticated techniques which use the available communication dimensions (e.g. frequency and time) in an optimal way. Information theory studies the fundamental performance limits of reliable communication and the main achievement is the characterization of the maximum possible transmission rate; the so-called capacity. However, such proofs usually rely on so-called random coding arguments, which only show the existence of suitable capacity-achieving encoders and decoders; but not how to actually construct them; cf. for example . In fact, constructing capacity-achieving coding schemes is a highly non-trivial task even for very simple communication scenarios. It is therefore natural to search for solutions which can simplify this process, for example using deep learning.
Recent developments have shown that a neural network (NN) can simultaneously learn encoding and decoding functions by implementing the communication channel as an autoencoder with a noise layer in the middle, see for example  and references therein. Surprisingly,  showed that those learned encoding-decoding systems come close to practical baseline techniques. The appeal of this idea is that complex encoding and decoding functions can be learned without extensive communication theoretic analysis. It also enables an on-the-fly ability of the system to cope with new channel scenarios.
In this paper, we are interested not only in learning the encoding and decoding to account for reliable communication, but also in exploring the possibility to learn how the communication can be secured at the physical layer. To this end, we are interested in physical layer security approaches, which establish a secure transmission by utilizing the intrinsic channel properties and by employing information-theoretic principles as security measures. In particular, these approaches result in secrecy techniques which are independent of the computational power of unknown adversaries, which is in contrast to prevalent computational complexity or cryptographic approaches, e.g.  and . It therefore results in a stricter notion of security with the drawback that these schemes need to be designed for specific communication scenarios. The simplest scenario which involves reliable transmission and secrecy is the wiretap channel . This refers to a three-node network with a legimate transmitter-receiver link and an external eavesdropper which must be kept ignorant of the transmission. It has been shown that specific encoding and decoding techniques exploit an inherent asymmetry (of the additive noise) of the legitimate receiver and the eavesdropper to account for physical layer secure communication. The secrecy capacity of the wiretap channel, i.e., maximum transmission rate at which both reliability and secrecy are satisfied, is known. However, constructing suitable encoders and decoders which achieve the secrecy capacity remains a non-trivial challenge. There are several approaches to this problem and most of them are based on polar, LDPC or lattice codes, see 
. However, those techniques are not practical and only work for highly specialized cases/channels. Our motivation is therefore that a NN code design provides a way for on-the-fly code design, which is practical for any channel. The question at hand is now, how to exploit and modify the autoencoder concept to also include the physical layer secrecy measures to obtain coding schemes for physical layer secure communication scenarios. In this paper, we demonstrate that this is indeed possible by creating a training environment where two NN decoders compete against each other. For that we define a novel security loss function, which is based on the cross-entropy. We then show resulting constellations and the probability of error for an SNR range before and after secure coding.
The work of  introduced the idea of using an autoencoder NN concept to model the communication scenario. The main drawback of this method is, that the channel model needs to be differentiable, which can become a problem with real world channels. However, it was shown in , that the learned encoding and decoding rules provide a system which comes close to classical schemes and performs well if used on actual over-the-air transmissions. Moreover, 
showed, that the training can be done without a mathematical channel model by including reinforcement learning. This shows that end-to-end learning of communication systems can be a viable technology. This also holds for fiber communication as shown in and in , which utilized the autoencoder concept. Moreover, the concept can be used to learn advanced communication schemes such as orthogonal frequency division multiplexing (OFDM), which enables reliable transmission in multi-path channel scenarios as shown in .
The idea of using two competing NNs in a specific context is not new. One of the first works was for example  on the principle of predictability minimization. This principle is as follows: for each unit inside a NN exists an adaptive predictor, predicting the unit, based on all other units. The units are now trained to minimize this predictability, therefore enforcing independence between the units. Another popular instance of competing NNs are generative adversarial networks (GANs) as introduced in . There, the two NNs consist of a generative model and a discriminative model, with the later predicting the probability that a sample came from the former model. The generative model is now trained to maximize the error probability of the discriminator model, which introduces an adversarial process. Another recent work is , where a key was provided to Alice and Bob and the NN learns to use the key on its communication link in a way, such that Eve cannot decipher the message (since she has no key). It is therefore a neural cryptography setting and different from our approach, as our network learns to encode a message for direct transmission such that Eve cannot decode it.
We stick to the convention of upper case random variablesand lower case realizations , i.e. , where is the probability mass function of . Moreover, we use upper case bold script
for random vectors and constant matrices, while lower case bold scriptis used for constant vectors. We also use to denote the cardinality of a set .
Ii Physical layer security in wiretap channels
The wiretap channel is a three-node network in which a sender (Alice) transmits confidential information to a legitimate receiver (Bob) while keeping an external eavesdropper (Eve) ignorant of it. This setup can be seen as the simplest communication scenario that involves both tasks of reliable transmission and secrecy. Accordingly, this is the crucial building block of secure communication to be understood for more complex communication scenarios.
In this paper, we study the degraded Gaussian wiretap channel as depicted in Fig. 1. The legitimate channel between Alice and Bob is given by an additive white Gaussian noise (AWGN) channel as
where is the received channel output at Bob, is the channel input of Alice, and is the additive white Gaussian noise at Bob at time instant . The eavesdropper channel to Eve is then given by
where is the received channel output at Eve and is the additive white Gaussian noise at Eve. This defines a degraded wiretap channel for which the eavesdropper channel output is strictly worse than the legitimate channel output .111Note that any Gaussian wiretap channel of the general form and with and multiplicative channel gains can be transformed into an equivalent degraded wiretap channel as in (1)-(2). This means that any Gaussian wiretap channel is inherently degraded, cf. for example [16, Sec. 5.1].
The communication task is now as follows: To transmit a message , Alice encodes it into a codeword of block length , where .222Usually, in the Gaussian setting. Moreover, we assume an average transmission power constraint . At the receiver side, Bob obtains an estimate of the transmitted message by decoding its received channel output as . The transmission rate is then defined as .
The secrecy of the transmitted message is ensured and measured by information theoretic concepts. There are different criteria of information theoretic secrecy including weak secrecy  and strong secrecy . In the end, all criteria have in common that the output at the eavesdropper should become statistically independent of the transmitted message implying that no confidential information is leaked to the eavesdropper. For example, strong secrecy is defined as
with the mutual information between and , cf. for example .
The secrecy capacity now characterizes the maximum transmission rate at which Bob can reliably decode the transmitted message while Eve is kept in the dark, i.e., the secrecy criterion (3) is satisfied while achieving a vanishing error probability . The secrecy capacity of the Gaussian wiretap channel is known [16, 18] and is given by
Mututal information vs. cross-entropy
A straight-forward approach to optimize a NN based on information-theoretic criteria would be an optimization based on the mutual information as in (3) and (4). In the security context, this would mean to optimize the encoder and decoder mapping to maximize , while minimizing . However, estimating the mutual information from data samples is a non-trivial challenge, due to the unknown underlying distribution. One approach is for example the variational information maximization, introduced in , which was recently applied to GANs . This technique computes a tractable lower bound to maximize the mutual information between two distributions. However, in our case we would need a technique to simultaneously upper and lower bound two connected mutual information terms. To circumvent this challenge, we adapt a secrecy criterion based on the cross-entropy on which we elaborate further in the next section.
Iii Neural network implementation
Iii-a General model
As in the reference work , we implemented the communication scenario using an autoencoder-like network setting. An autoencoder is usually a NN which aims to copy the input of the network to its output. It consists of an encoder , which maps the input to some codewords and a decoder which aims to estimate the input from the output, see . It is therefore a perfect scenario for the communication problem. Usually, these autoencoders are restricted in a certain way, for example that the encoding function performs a dimensionality reduction. That way, the autoencoder can learn useful properties of the data, which are needed for reconstruction. This is in contrast to the communication scenario where the encoder aims to introduce redundancy, i.e. increase the dimensionality. Moreover, there is a noise function in-between encoder and decoder. The NN therefore learns to represent the input in a higher dimensional space to combat the noise corruption, such that the decoder can still estimate the correct output.
The general structure of our NN can be seen in Fig. 2. There, the message
gets one-hot encoded into the binary vector
, which can be viewed as a probability distribution that shows which message was sent and is fed into the NN. The encoder is comprised of two fully connected/dense layers, where the first layer mapsto with no activation function333The first layer is given by and the second layer by where and
represent the weight matrices and the bias vectors, respectively and.. Here, represents the codeword dimension and can be thought of as time instants or channel uses. The last layer of the encoder is a normalization, where the codewords get a unit power constraint, which is either the classical average power constraint over the codeword dimension, i.e. or a batch average power constraint , where the average is taken over the batch and the codeword dimension. Note that the normalization actually enforces that the resulting codewords have exactly the power specified, turning the inequalities into equalities. For the classical average power constraint, this will yield a constellation, where all the points lie on a circle with radius . The channel itself is realized as in Section II
. Here we scale the variance as, where SNR means Signal-to-Noise Ratio. Moreover, we have two receiver blocks, which are equally constructed. Each one has two fully connected layers, where the first one maps the channel output back to with a ReLU activation function and the second one maps from to with a linear activation function. As the last step, we use the softmax function444The softmax function is defined as . and the cross-entropy as loss function. The softmax gives a probability distribution over all messages, which is fed into the cross-entropy:
as a loss function, which we then average over the batch sample size . This can be seen as a maximum likelihood estimation of the send signal, see for example [21, Chap. 5]. The index of the element of with the highest probability will be the decoded symbol . The same loss function is applied to the receiver model of the wiretapper, i.e. Eve. However, training for security needs a different loss function.
Iii-B The security loss function
Remember that information theoretic security results in a difference of mutual information terms between the links. This is in general hard to compute and accordingly difficult use for NN optimization. We therefore focus on the cross-entropy and establish another way to define the security loss. A straightforward method would be to define a security loss function by considering the differences between the cross-entropy losses of Bob and Eve’s receiver. The secure loss would therefore be
Here, and are the resulting probability mass functions from the softmax output in Bob and Eve’s decoder. Training the network for each symbol would result in training the encoder (Alice), such that Eve sees that symbol in another random location. These locations are only based on the initial weight configurations of Eve’s network (and therefore highly dependent on the adversary), as the subtracted cross-entropy of Eve jumps to the next highest value and trains for it indefinitely. Moreover, cross-entropy is unbounded from above, resulting in logarithmic growing negative loss. This shows that the loss function above is fundamentally ill-suited for the problem. We therefore decided to mimic traditional wiretap-coding techniques. Theoretically, the usual approach is to map a specific message to a bin of codewords and then select a codeword randomly from this bin. There, intuitively, the randomness is used to confuse the eavesdropper. A more concrete method is to use coset codes, where the actual messages label the cosets, and the particular codeword inside the coset is again chosen at random. This method goes back to the work of  and we refer the reader to [22, Appendix A] for an introduction. The idea is that Eve can only distinguish between clusters of codewords. Whereas the messages itself are hidden randomly in each cluster. However, the legitimate receiver has a better channel and can also distinguish between codewords inside the clusters. This results in a possible secure communication by trading a part of the communication rate for security.
Our approach is now that we train the network such that Eve sees the same probability for all codewords in a certain cluster. The security loss function is a sum of the cross-entropy of Bob and the cross-entropy of Eve’s received estimated distribution with a modified input distribution. The input distribution is modified in a way, such that clusters of codewords have the same probability. Consider for example the training vector batch , resulting in the one-hot encoding matrix
where the rows are the samples of the batch and the columns indicate the symbol.
Multiplying this one-hot input matrix from the right with the matrix , see Algorithm 1, results in the equalized matrix :
One can see that in the resulting matrix , the first and second symbol, and the third and fourth symbol have the same distribution. The advantage of this method is, that we only need to calculate an matrix, which can be used with any input batch size. The new security loss can therefore be defined as
where is Eve’s decoded distribution and stands for the equalized input symbol distribution. Furthermore, the parameter controls the trade-off between security and communication rate on the legitimate channel. The loss function is then averaged over the batch size . Moreover, we chose the -means algorithm for the clustering of the constellation points. This provides us a clustering based on the euclidean distance and goes well with the initial idea of coset codes, when they are implemented as a lattice code with a nearest neighborhood decoder.
Iv Training phases and simulation results
All of the simulations were done in TensorFlow with the Adam optimizer and gradually decreasing learning rate from to . Moreover, we constantly increased the batch size from to
during the epochs. The training was done with a channel layer of the direct link with an SNR ofdB, and on Eve’s link with an SNR of dB. All of the following figures for constellations and decision regions were done with codeword dimension , such that they can be easily plotted in -d. Our training procedure is divided into three phases.
In the first phase, we only train the encoder and the decoder of Bob with the standard cross-entropy, as done in . Here, the NN learns a signal constellation to combat the noise in an efficient way dependent on the power constraint and the signal-to-noise ratio. The resulting learned constellations, i.e. encoding rules and the learned decision regions, i.e. decoding rules, are shown in Figure 4 for an average power constraint over the whole batch and for an average power constraint per symbol.
In the second phase, we freeze the encoding layers, and train Eve on decoding the previously learned encoding scheme with her cross-entropy and the normal input distribution. The reason behind the freezing of the encoder is that we assume that in real-life situations an attacker can not influence the encoding of the signals. We therefore have an unchanged constellation, but Eve’s NN learns a decision region for her channel model, i.e. an additional noise factor with dB SNR.
In the third phase, we freeze the decoding layers of Eve and train the NN with the loss function (10) and
. This time the freezing is done because we cannot assume that a communication link has access to the decoding function of an attacker. For the equalization of the input clusters, we feed the network with an identity matrixand calculate the clusters of the constellation points with the -means algorithm. Figure 5 shows the results of the clustering algorithm on the learned constellations. We then use the algorithm 3, to calculate and create the equalized batch label matrix with . The resulting training effect is, that the NN tries to pull codewords from the same cluster together, close enough that Eve cannot distinguish the symbols in a cluster, and loose enough such that Bob can still decode them. Figure 6 shows the learned secure encoding schemes.
After the secure training phase we train Bobs decoder and in the last step, Eve’s decoder once again. This is to make sure, that Eve’s neural network is able to train to decode the new encoded signals.
The training phase for symbols gets more accurate and faster with increasing codeword dimensions , which suggests that the NN can find better constellations. The drawback is that the actual communication rate drops, since the system needs more time instants to transmit a bit. Therefore, we have taken a conservative approach, which resulted in
. We do not provide the figures for this constellations due to the higher dimensionality. Moreover, we implemented a coset coding algorithm. For that we use a modified k-means algorithm which gives equal cluster sizes of ourconstellation points. In our case, we use clusters, so each cluster consists of symbols. The symbols in each cluster get a designation as secure encoded symbol to . We therefore have clusters, and each cluster has constellation points one for each symbol. The actual secure transmission consists of only 4 symbols, i.e. two bits. The specific cluster is chosen at random. This randomness increases Eve’s confusion about the transmission. Intuitively, we sacrifice a part of the transmission rate, i.e we only use 4 instead of 16 symbols, to accommodate randomness to confuse Eve. Again we refer to [22, Appendix A], which discusses Eve’s error probability and shows that coset coding does increase confusion at the eavesdropper. The NN therefore learns a constellation which can be seen as a finite lattice-like structure, on which one can implement the idea of coset coding. For the actual simulation, we have taken a direct SNR of dB and an additional SNR of dB in the adversary link, during the training phase. We then evaluated the symbol error rate which approximates for decoding the symbols before the third training phase, i.e. before secure coding and after all the training, which results in Figure 7. We note that the figures were tested with the same total sample size of , while the testing was also done with a SNR of dB in the adversary link. Moreover, we used a factor of in the security loss function. One can see in the results, that the NN learns a trade-off between communication rate and security.
V Conclusions and outlook
We have shown that the recently developed end-to-end learning of communication systems via autoencoder neural networks can be extended towards secure communication scenarios. For that we introduced a modified loss function which results in a trade-off between security and communication rate. In particular, we have shown that the neural network learns a clustering, which resembles a finite constellation / lattice, which can be used for coset encoding as demonstrated. This opens up the ongoing research of end-to-end learning of communication systems to the field of secure communication, as classical secure coding schemes can be learned and applied with a neural network. We think that our approach via the new loss function, is a fruitful direction in that regard. However, an optimal way would be to tackle the problem by direct optimization via mutual information terms, which remains a challenging research problem.
-  C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, Jul. 1948.
-  T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley & Sons, 2006.
-  T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, Dec 2017.
-  S. Dörner, S. Cammerer, J. Hoydis, and S. t. Brink, “Deep learning based communication over the air,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 132–143, Feb 2018.
-  W. Diffie and M. Hellman, “New directions in cryptography,” IEEE transactions on Information Theory, vol. 22, no. 6, pp. 644–654, 1976.
-  R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and public-key cryptosystems,” Commun. ACM, vol. 21, no. 2, pp. 120–126, Feb. 1978.
-  A. D. Wyner, “The Wire-Tap Channel,” Bell System Technical Journal, vol. 54, no. 8, pp. 1355–1387, 1975.
-  Y. Wu, A. Khisti, C. Xiao, G. Caire, K. Wong, and X. Gao, “A survey of physical layer security techniques for 5g wireless networks and challenges ahead,” IEEE Journal on Selected Areas in Communications, vol. 36, no. 4, pp. 679–695, April 2018.
-  F. A. Aoudia and J. Hoydis, “End-to-end learning of communications systems without a channel model,” CoRR, vol. abs/1804.02276, 2018.
-  S. Li, C. Häger, N. Garcia, and H. Wymeersch, “Achievable information rates for nonlinear fiber communication via end-to-end autoencoder learning,” arXiv preprint arXiv:1804.07675, 2018.
-  B. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. Bülow, D. Lavery, P. Bayvel, and L. Schmalen, “End-to-end deep learning of optical fiber communications,” arXiv preprint arXiv:1804.04097, 2018.
-  A. Felix, S. Cammerer, S. Dörner, J. Hoydis, and S. t. Brink, “OFDM-autoencoder for end-to-end learning of communications systems,” arXiv preprint arXiv:1803.05815, 2018.
-  J. Schmidhuber, “Learning factorial codes by predictability minimization,” Neural Computation, vol. 4, no. 6, pp. 863–879, 1992.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  M. Abadi and D. G. Andersen, “Learning to protect communications with adversarial neural cryptography,” arXiv preprint arXiv:1610.06918, 2016.
-  M. Bloch and J. Barros, Physical-Layer Security: From Information Theory to Security Engineering. Cambridge, UK: Cambridge University Press, 2011.
-  U. Maurer and S. Wolf, “Information-theoretic key agreement: From weak to strong secrecy for free,” in International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 2000, pp. 351–368.
-  S. Leung-Yan-Cheong and M. Hellman, “The gaussian wire-tap channel,” IEEE Transactions on Information Theory, vol. 24, no. 4, pp. 451–456, 1978.
-  D. Barber and F. Agakov, “The im algorithm: a variational approach to information maximization,” in Proceedings of the 16th International Conference on Neural Information Processing Systems. MIT Press, 2003, pp. 201–208.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in Neural Information Processing Systems, 2016, pp. 2172–2180.
-  I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. The MIT Press, 2016.
-  F. Oggier, P. Solé, and J. C. Belfiore, “Lattice codes for the wiretap gaussian channel: Construction and analysis,” IEEE Transactions on Information Theory, vol. 62, no. 10, pp. 5690–5708, Oct 2016.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.