I Introduction
The application of machine learning techniques to communication systems has recently received increased attention
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Common to these approaches is the data/simulationdriven optimization of neural networks (NN) to serve as various communication system components, instead of traditional approaches that are systematically driven by models and theory. The promise of such approaches is that learning could potentially overcome situations where limited models are inaccurate and complex theory is intractable. This can be viewed as part of a larger “deep learning” trend, where the enthusiastic application of modern machine learning methods, revolving around deep neural networks, have widely impacted a variety of fields [13].We consider an endtoend, learningbased approach to optimize the modulation and signal detection for noncoherent, multipleinput, multipleoutput (MIMO) systems, i.e., communication with multiple transmit and receive antennas, where the channel coefficients are unknown. The endtoend aspect refers to the joint optimization of both the signal constellation and decoder as it would interact with a simulated MIMO channel to transmit and receive messages. As noted in the literature [1, 2]
, this general concept is analogous to training an autoencoder, but with a noisy channel inserted between the encoder and decoder, which has led several works
[1, 2, 3, 4, 6, 7, 8, 9, 11, 12] to use deep neural networks to realize both the encoder and decoder mappings. Related work [3] and [4] also consider the MIMO channel, although with channel state information (CSI) available, and the latter also examines a multiuser interference channel.One aim of our paper is to reconsider the benefits of employing neural networks and demonstrate an effective learningbased approach that eschews them altogether. Mapping from a finite message space to channel symbols does not require a neural network encoder, since a lookup table storing the signal constellation is sufficient. Noncoherent MIMO decoding theory [14]
guides us to a simplified decoder architecture that avoids employing neural networks, while still retaining the ability to perform simulationdriven optimization. We evaluate and compare this networkless approach versus employing a neural network decoder, and find that they perform comparably, although additional hyperparameter tuning for both could potentially further improve performance and stability.
We also use our learningbased approach to demonstrate that noncoherent MIMO communication is feasible even at extremely short coherence windows, i.e., with the channel coefficients stable for as few as two time slots. Unlike various conventional approaches [14, 15, 16, 17] to MIMO modulation design, which require limitations on time slots versus antennas, the learningbased approach is not so limited by analytical design constraints. Relaxing these constraints is also supported by the recent extension by [18] of MIMO capacity theory [19, 20], which shows that the conventional unitary, isotropically distributed inputs are no longer capacity achieving when antennas exceed time slots.
Ia Notation
We use uppercase/lowercase bold letters, e.g., and
, to denote matrices/vectors. A circularlysymmetric Gaussian distribution with zero mean and
variance is denoted by . We write to denote the conjugate transpose of , and to denote the identity matrix.Ii Modulation Optimization for MIMO Systems
Iia NonCoherent MIMO Channel
We consider transmission over a MIMO channel with transmitter antennas and receiver antennas. When transmitting a message using channel symbols, the received signal is an complex matrix given by
where is an complex matrix representing the transmitted signal, is the complex, random channel matrix, and is an complex matrix representing Gaussian noise. We focus on the noncoherent case where the random channel matrix is unknown (i.e., no CSI), but fixed over the channel uses. The elements of channel are i.i.d. and are independent of the noise , which is i.i.d. . We constrain the transmission to have average power , such that the signaltonoise (SNR) ratio is given by .
IiB Encoder Parameterization
The encoder maps a bit message to an symbol transmission across antennas. This encoder mapping, , can be generally parameterized by a simple lookup table specified by a codebook matrix . For power efficiency, the mean row of is subtracted from each row of to produce the centered codebook matrix . Then, the average power constraint is enforced by scaling to produce centered and normalized code matrix . To encode a message , the encoder mapping selects the row in indexed by the integer value of , and reshapes it to an matrix to form the transmitted signal . Essentially, this entire procedure is just to allow the signal constellation , which is constrained in its average power, to be parameterized by the unconstrained variable .
IiC Decoder Realizations
We consider two parametric, softoutput decoders that approximate the unnormalized, loglikelihoods for each possible message, and thus output a real vector of length . For both decoders, the softmax operation is applied to the output vector (by exponentiating each element and then scaling to normalize the sum to one) to produce a stochastic vector, denoted by , that approximates the posterior distribution . Note that applying the softmax operation to the vector of unnormalized, loglikelihoods , for some constant , would yield the corresponding posterior distribution .
IiC1 PseudoML (pML) Decoder
If the codewords are orthonormal, that is, for all messages , then the ML decoding rule is shown in [14] to be
(1) 
since the terms are proportional to , for some that is constant with respect to . This decoder immediately inspires a softoutput decoder that simply scales the objective in (1) with a parameter to output
(2) 
The parameter both accounts for the fact that is only proportional to , and allows the confidence of the decoder to be tuned, which is particularly important since it will be employed while enforcing the orthonormal constraint (i.e., ) in only a soft manner. Hence, we call this the pseudoML (pML) decoder. Smaller/larger
indicates lower/higher confidence, as the corresponding posterior estimate
(produced by applying the softmax operation) approaches uniform as and certainty as .IiC2 Neural Network (NN) Decoder
Alternatively, a softoutput decoder can be realized with a neural network, which serves as a parametric approximation for the mapping
(3) 
where denotes the parameters specifying the weights of the neural network layers. The network is applied to the received signal to yield an approximation of the loglikelihoods, to which the softmax operation is applied to produce the corresponding posterior estimate .
The specific network architectures used in our experiments are detailed alongside discussion of the results in Section IIIA. In order to handle a complexvalued matrix as input, is simply decomposed into its real and imaginary components and vectorized, i.e., is represented as a real vector of length .
IiD Optimization Objective
The main optimization objective is to minimize the crossentropy loss with respect to the encoder and decoder parameters, as given by
(4) 
where is produced by applying the softmax operation to the loglikelihoods produced by either decoder given by (2) or (3), as described in Section IIC. Since the crossentropy loss can be written as
the ideal optimization of the decoder should cause the estimated posterior to converge toward the true posterior , and the overall optimization is equivalent to maximizing the mutual information , with respect to the signal constellation, since is constant.
As mentioned earlier, the pML decoder given by (2) is formulated assuming orthonormal codewords that satisfy for all . We enforce orthonormality as a soft constraint by introducing an additional orthonormalloss term given by
The optimization objective that we use for the pML decoder is formed by combining this orthonormal loss with the primary crossentropy loss as follows
(5) 
where is a weighting parameter to control the impact of the orthonormal loss term. Note that rather simply adding on the orthonormal loss term, i.e., using an objective of the form , the loss terms have been multiplicatively combined in (5). We found from experimentation that this improved the reliability of convergence, possibly since these loss terms might decay at very different rates making it difficult to tune the hyperparameter in an additive combination.
Iii Experiments and Results
Our experiments evaluate communicating bits over channel uses. For time slots, we vary the number of receiver antennas , while keeping the number of transmit antennas fixed at , since theory [19, 20] teaches that unilaterally increasing transmit antennas does not increase capacity. We did also test increasing and found that it resulted in performance nearly identical to . For time slots, we vary both the number of transmit and receive antennas . For each operating point (combination of parameters ), we evaluated both the pML and NN decoders, by optimizing each across a variety of hyperparameters, and selecting the best performing codes. Further details about the network architectures and training procedures are given in Sections IIIA and IIIB.
The block error rate (BLER) performance results are shown in Figure 1 for and Figure 2 for , with the NN decoder results appearing on the left, and the pML decoder results appearing on the right. Note that for several operating points (seven for and two for ), the pML results exhibit large error floors, while the NN results generally do not. At other operating points, the results between NN and pML are similar (although sometimes slightly better or worse). Due to time constraints, we searched over six times fewer hyperparameters (optimization instances) for the pML decoder experiments, which we believe plays a significant role in the optimization failing in some cases. For the NN experiments, there were similar optimization failures for other hyperparameters. Interestingly, despite the orthonormal lossterm, only one operating point (, , ) resulted in the codebook for the pML decoder converging to orthonormal codewords. However, we did find that the presence of the of the orthonormal lossterm improved that optimization success rate. Two examples of learned signal constellations are shown in Figure 3.
Iiia Neural Network Architectures
We use two wellknown neural network architectures, the multilayer perceptron (MLP) and the Residual MLP (ResMLP)
[21, 22], to realize the neural networkbased decoders discussed in Section IIC.In the MLP architecture, the input vector is mapped to the output vector by applying a series of affine transformations and elementwise, nonlinear operations. The hidden (intermediate) layers and output layer (vector) of the network are given by
where are the affine transformation parameters that define the network, and
denotes the elementwise application the activation function
. For all of our MLP networks, we used the rectified linear unit (ReLU) for the hidden layers (i.e.,
, for ) and the identity function for the output layer (i.e., ). Note that the dimensions of the weight matricesand bias vectors are constrained by the desired input, output, and hidden layer dimensions.
In the ResMLP architecture, the input vector is first mapped to an initial hidden vector via an affine transformation, i.e.,
Then, over blocks, the hidden vector is updated according to
where
denotes the sequential application of batchnormalization
[23], an activation function, and affine transform, as given byFinally, the output is computed as
IiiB Training Procedures
We perform the optimization of the objectives given in Section IID
with stochastic gradient descent (SGD), specifically the popular Adam
[24]variant, which adaptively adjusts learning rates based on moment estimates. For each iteration, the expectations are approximated by the empirical mean over a batch of
uniformly sampled messages, randomly drawn along with random channel matrices and noise for the transmission of each message. Training was performed for up to iterations, with early stopping applied to halt training when the objective fails to improve, while saving the best snapshot in terms of BLER. We implemented these experiments using the Chainer deep learning framework [25].For the NN decoder, we tried both the MLP and ResMLP architectures across the combination of layers/blocks and hidden layer dimensions. For the pML decoder, the main hyperparameter is just the weight in the objective function given by (5), which we varied across . For both decoders, an additional hyperparameter is the SNR used during training simulations, which we nonexhaustively varied from 10 dB to 30 dB in 5 db increments, with one to three SNRs tried for each operating point.
Iv Discussion and Ongoing Work
Our experiments reevaluated the role of neural networks in learningbased approaches to communications. We demonstrated that neural networks can be avoided altogether while still realizing the fundamental concept of simulationdriven design optimization. We also used this approach to show the feasibility of noncoherent MIMO for coherence windows that are as short as two time slots.
Further experimentation and hyperparameter tuning is necessary and part of ongoing work, in order to further confirm our experimental observations. In particular, we believe that the convergence stability (and possibly the performance) of the pML decoder could be further improved with more tuning, since due to time constraints we had to explore a much smaller hyperparameter space.
The generalized loglikelihood ratio test (GLRT) decoder given by [17] does not require the codewords to be orthonormal, which would obviate the need for an orthonormal loss term. Due to the somewhat increased implementation and computational complexity, applying this GLRT deocder remains ongoing work.
References
 [1] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate: Channel autoencoders, domain specific regularizers, and attention,” in Signal Processing and Information Technology (ISSPIT), 2016 IEEE International Symposium on. IEEE, 2016, pp. 223–228.
 [2] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, 2017.
 [3] T. J. O’Shea, T. Erpek, and T. C. Clancy, “Physical layer deep learning of encodings for the mimo fading channel,” in Communication, Control, and Computing (Allerton), 2017 55th Annual Allerton Conference on. IEEE, 2017, pp. 76–80.
 [4] T. Erpek, T. J. O’Shea, and T. C. Clancy, “Learning a physical layer scheme for the mimo interference channel,” in 2018 IEEE International Conference on Communications (ICC). IEEE, 2018, pp. 1–5.
 [5] H. Kim, Y. Jiang, R. Rana, S. Kannan, S. Oh, and P. Viswanath, “Communication algorithms via deep learning,” arXiv preprint arXiv:1805.09317, 2018.
 [6] H. Kim, Y. Jiang, S. Kannan, S. Oh, and P. Viswanath, “Deepcode: Feedback codes via deep learning,” in Advances in Neural Information Processing Systems, 2018, pp. 9458–9468.
 [7] B. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. Bülow, D. Lavery, P. Bayvel, and L. Schmalen, “Endtoend deep learning of optical fiber communications,” arXiv preprint arXiv:1804.04097, 2018.
 [8] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learning based communication over the air,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 132–143, 2018.
 [9] F. A. Aoudia and J. Hoydis, “Endtoend learning of communications systems without a channel model,” arXiv preprint arXiv:1804.02276, 2018.
 [10] H. Ye, G. Y. Li, and B.H. Juang, “Power of deep learning for channel estimation and signal detection in ofdm systems,” IEEE Wireless Communications Letters, vol. 7, no. 1, pp. 114–117, 2018.

[11]
V. Raj and S. Kalyani, “Backpropagating through the air: Deep learning at physical layer without channel models,”
IEEE Communications Letters, vol. 22, no. 11, pp. 2278–2281, 2018.  [12] T. J. O’Shea, T. Roy, N. West, and B. C. Hilburn, “Physical layer communications system design overtheair using adversarial networks,” arXiv preprint arXiv:1803.03145, 2018.
 [13] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
 [14] B. M. Hochwald and T. L. Marzetta, “Unitary spacetime modulation for multipleantenna communications in rayleigh flat fading,” IEEE transactions on Information Theory, vol. 46, no. 2, pp. 543–564, 2000.
 [15] B. M. Hochwald, T. L. Marzetta, T. J. Richardson, W. Sweldens, and R. Urbanke, “Systematic design of unitary spacetime constellations,” IEEE transactions on Information Theory, vol. 46, no. 6, pp. 1962–1973, 2000.
 [16] X.B. Liang and X.G. Xia, “Unitary signal constellations for differential spacetime modulation with two transmit antennas: parametric codes, optimal designs, and bounds,” IEEE Transactions on Information Theory, vol. 48, no. 8, pp. 2291–2322, 2002.
 [17] T. KoikeAkino and P. Orlik, “Highorder superblock glrt for noncoherent grassmann codes in mimoofdm systems,” in Global Telecommunications Conference (GLOBECOM 2010), 2010 IEEE. IEEE, 2010, pp. 1–6.
 [18] W. Yang, G. Durisi, and E. Riegler, “On the capacity of largemimo blockfading channels,” IEEE Journal on Selected Areas in Communications, vol. 31, no. 2, pp. 117–132, 2013.
 [19] E. Telatar, “Capacity of multiantenna gaussian channels,” European transactions on telecommunications, vol. 10, no. 6, pp. 585–595, 1999.
 [20] T. L. Marzetta and B. M. Hochwald, “Capacity of a mobile multipleantenna communication link in rayleigh flat fading,” IEEE transactions on Information Theory, vol. 45, no. 1, pp. 139–157, 1999.

[21]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2016, pp. 770–778.  [22] F. Huang, J. Ash, J. Langford, and R. Schapire, “Learning deep resnet blocks sequentially using boosting theory,” arXiv preprint arXiv:1706.04964, 2017.
 [23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
 [24] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015. [Online]. Available: https://arxiv.org/abs/1412.6980
 [25] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a nextgeneration open source framework for deep learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twentyninth Annual Conference on Neural Information Processing Systems (NIPS), 2015.
Comments
There are no comments yet.