Learning to Modulate for Non-coherent MIMO

03/09/2019 ∙ by Ye Wang, et al. ∙ MERL 0

The deep learning trend has recently impacted a variety of fields, including communication systems, where various approaches have explored the application of neural networks in place of traditional designs. Neural networks flexibly allow for data/simulation-driven optimization, but are often employed as black boxes detached from direct application of domain knowledge. Our work considers learning-based approaches addressing modulation and signal detection design for the non-coherent MIMO channel. We demonstrate that simulation-driven optimization can be performed while entirely avoiding neural networks, yet still perform comparably. Additionally, we show the feasibility of MIMO communications over extremely short coherence windows (i.e., channel coefficient stability period), with as few as two time slots.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The application of machine learning techniques to communication systems has recently received increased attention 

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Common to these approaches is the data/simulation-driven optimization of neural networks (NN) to serve as various communication system components, instead of traditional approaches that are systematically driven by models and theory. The promise of such approaches is that learning could potentially overcome situations where limited models are inaccurate and complex theory is intractable. This can be viewed as part of a larger “deep learning” trend, where the enthusiastic application of modern machine learning methods, revolving around deep neural networks, have widely impacted a variety of fields [13].

We consider an end-to-end, learning-based approach to optimize the modulation and signal detection for non-coherent, multiple-input, multiple-output (MIMO) systems, i.e., communication with multiple transmit and receive antennas, where the channel coefficients are unknown. The end-to-end aspect refers to the joint optimization of both the signal constellation and decoder as it would interact with a simulated MIMO channel to transmit and receive messages. As noted in the literature [1, 2]

, this general concept is analogous to training an autoencoder, but with a noisy channel inserted between the encoder and decoder, which has led several works 

[1, 2, 3, 4, 6, 7, 8, 9, 11, 12] to use deep neural networks to realize both the encoder and decoder mappings. Related work [3] and [4] also consider the MIMO channel, although with channel state information (CSI) available, and the latter also examines a multi-user interference channel.

One aim of our paper is to reconsider the benefits of employing neural networks and demonstrate an effective learning-based approach that eschews them altogether. Mapping from a finite message space to channel symbols does not require a neural network encoder, since a lookup table storing the signal constellation is sufficient. Non-coherent MIMO decoding theory [14]

guides us to a simplified decoder architecture that avoids employing neural networks, while still retaining the ability to perform simulation-driven optimization. We evaluate and compare this network-less approach versus employing a neural network decoder, and find that they perform comparably, although additional hyperparameter tuning for both could potentially further improve performance and stability.

We also use our learning-based approach to demonstrate that non-coherent MIMO communication is feasible even at extremely short coherence windows, i.e., with the channel coefficients stable for as few as two time slots. Unlike various conventional approaches [14, 15, 16, 17] to MIMO modulation design, which require limitations on time slots versus antennas, the learning-based approach is not so limited by analytical design constraints. Relaxing these constraints is also supported by the recent extension by [18] of MIMO capacity theory [19, 20], which shows that the conventional unitary, isotropically distributed inputs are no longer capacity achieving when antennas exceed time slots.

I-a Notation

We use uppercase/lowercase bold letters, e.g., and

, to denote matrices/vectors. A circularly-symmetric Gaussian distribution with zero mean and

variance is denoted by . We write to denote the conjugate transpose of , and to denote the identity matrix.

Ii Modulation Optimization for MIMO Systems

Ii-a Non-Coherent MIMO Channel

We consider transmission over a MIMO channel with transmitter antennas and receiver antennas. When transmitting a message using channel symbols, the received signal is an complex matrix given by

where is an complex matrix representing the transmitted signal, is the complex, random channel matrix, and is an complex matrix representing Gaussian noise. We focus on the non-coherent case where the random channel matrix is unknown (i.e., no CSI), but fixed over the channel uses. The elements of channel are i.i.d. and are independent of the noise , which is i.i.d. . We constrain the transmission to have average power , such that the signal-to-noise (SNR) ratio is given by .

Ii-B Encoder Parameterization

The encoder maps a -bit message to an symbol transmission across antennas. This encoder mapping, , can be generally parameterized by a simple lookup table specified by a codebook matrix . For power efficiency, the mean row of is subtracted from each row of to produce the centered codebook matrix . Then, the average power constraint is enforced by scaling to produce centered and normalized code matrix . To encode a message , the encoder mapping selects the row in indexed by the integer value of , and reshapes it to an matrix to form the transmitted signal . Essentially, this entire procedure is just to allow the signal constellation , which is constrained in its average power, to be parameterized by the unconstrained variable .

Ii-C Decoder Realizations

We consider two parametric, soft-output decoders that approximate the unnormalized, log-likelihoods for each possible message, and thus output a real vector of length . For both decoders, the softmax operation is applied to the output vector (by exponentiating each element and then scaling to normalize the sum to one) to produce a stochastic vector, denoted by , that approximates the posterior distribution . Note that applying the softmax operation to the vector of unnormalized, log-likelihoods , for some constant , would yield the corresponding posterior distribution .

Ii-C1 Pseudo-ML (pML) Decoder

If the codewords are orthonormal, that is, for all messages , then the ML decoding rule is shown in [14] to be


since the terms are proportional to , for some that is constant with respect to . This decoder immediately inspires a soft-output decoder that simply scales the objective in (1) with a parameter to output


The parameter both accounts for the fact that is only proportional to , and allows the confidence of the decoder to be tuned, which is particularly important since it will be employed while enforcing the orthonormal constraint (i.e., ) in only a soft manner. Hence, we call this the pseudo-ML (pML) decoder. Smaller/larger

indicates lower/higher confidence, as the corresponding posterior estimate

(produced by applying the softmax operation) approaches uniform as and certainty as .

Ii-C2 Neural Network (NN) Decoder

Alternatively, a soft-output decoder can be realized with a neural network, which serves as a parametric approximation for the mapping


where denotes the parameters specifying the weights of the neural network layers. The network is applied to the received signal to yield an approximation of the log-likelihoods, to which the softmax operation is applied to produce the corresponding posterior estimate .

The specific network architectures used in our experiments are detailed alongside discussion of the results in Section III-A. In order to handle a complex-valued matrix as input, is simply decomposed into its real and imaginary components and vectorized, i.e., is represented as a real vector of length .

Ii-D Optimization Objective

The main optimization objective is to minimize the cross-entropy loss with respect to the encoder and decoder parameters, as given by


where is produced by applying the softmax operation to the log-likelihoods produced by either decoder given by (2) or (3), as described in Section II-C. Since the cross-entropy loss can be written as

the ideal optimization of the decoder should cause the estimated posterior to converge toward the true posterior , and the overall optimization is equivalent to maximizing the mutual information , with respect to the signal constellation, since is constant.

As mentioned earlier, the pML decoder given by (2) is formulated assuming orthonormal codewords that satisfy for all . We enforce orthonormality as a soft constraint by introducing an additional orthonormal-loss term given by

The optimization objective that we use for the pML decoder is formed by combining this orthonormal loss with the primary cross-entropy loss as follows


where is a weighting parameter to control the impact of the orthonormal loss term. Note that rather simply adding on the orthonormal loss term, i.e., using an objective of the form , the loss terms have been multiplicatively combined in (5). We found from experimentation that this improved the reliability of convergence, possibly since these loss terms might decay at very different rates making it difficult to tune the hyperparameter in an additive combination.

(a) Neural Network (NN) Decoder
(b) Pseudo-ML (pML) Decoder
Fig. 1: BLER performance comparison for , with (a) Neural Network Decoder, (b) Pseudo-ML Decoder.
(a) Neural Network (NN) Decoder
(b) Pseudo-ML (pML) Decoder
Fig. 2: BLER performance comparison for , with (a) Neural Network Decoder, (b) Pseudo-ML Decoder.

Iii Experiments and Results

Our experiments evaluate communicating bits over channel uses. For time slots, we vary the number of receiver antennas , while keeping the number of transmit antennas fixed at , since theory [19, 20] teaches that unilaterally increasing transmit antennas does not increase capacity. We did also test increasing and found that it resulted in performance nearly identical to . For time slots, we vary both the number of transmit and receive antennas . For each operating point (combination of parameters ), we evaluated both the pML and NN decoders, by optimizing each across a variety of hyperparameters, and selecting the best performing codes. Further details about the network architectures and training procedures are given in Sections III-A and III-B.

(a) Learned with NN decoder for , ,
(b) Learned with pML decoder for , ,
Fig. 3: Examples of signal constellations learned with: (a) NN decoder, (b) pML decoder. Complex signal depicted for each antenna (across rows) and time slot (across columns).

The block error rate (BLER) performance results are shown in Figure 1 for and Figure 2 for , with the NN decoder results appearing on the left, and the pML decoder results appearing on the right. Note that for several operating points (seven for and two for ), the pML results exhibit large error floors, while the NN results generally do not. At other operating points, the results between NN and pML are similar (although sometimes slightly better or worse). Due to time constraints, we searched over six times fewer hyperparameters (optimization instances) for the pML decoder experiments, which we believe plays a significant role in the optimization failing in some cases. For the NN experiments, there were similar optimization failures for other hyperparameters. Interestingly, despite the orthonormal loss-term, only one operating point (, , ) resulted in the codebook for the pML decoder converging to orthonormal codewords. However, we did find that the presence of the of the orthonormal loss-term improved that optimization success rate. Two examples of learned signal constellations are shown in Figure 3.

Iii-a Neural Network Architectures

We use two well-known neural network architectures, the multilayer perceptron (MLP) and the Residual MLP (ResMLP) 

[21, 22], to realize the neural network-based decoders discussed in Section II-C.

In the MLP architecture, the input vector is mapped to the output vector by applying a series of affine transformations and element-wise, nonlinear operations. The hidden (intermediate) layers and output layer (vector) of the network are given by

where are the affine transformation parameters that define the network, and

denotes the element-wise application the activation function

. For all of our MLP networks, we used the rectified linear unit (ReLU) for the hidden layers (i.e.,

, for ) and the identity function for the output layer (i.e., ). Note that the dimensions of the weight matrices

and bias vectors are constrained by the desired input, output, and hidden layer dimensions.

In the ResMLP architecture, the input vector is first mapped to an initial hidden vector via an affine transformation, i.e.,

Then, over blocks, the hidden vector is updated according to


denotes the sequential application of batch-normalization 

[23], an activation function, and affine transform, as given by

Finally, the output is computed as

Iii-B Training Procedures

We perform the optimization of the objectives given in Section II-D

with stochastic gradient descent (SGD), specifically the popular Adam 


variant, which adaptively adjusts learning rates based on moment estimates. For each iteration, the expectations are approximated by the empirical mean over a batch of

uniformly sampled messages, randomly drawn along with random channel matrices and noise for the transmission of each message. Training was performed for up to iterations, with early stopping applied to halt training when the objective fails to improve, while saving the best snapshot in terms of BLER. We implemented these experiments using the Chainer deep learning framework [25].

For the NN decoder, we tried both the MLP and ResMLP architectures across the combination of layers/blocks and hidden layer dimensions. For the pML decoder, the main hyperparameter is just the weight in the objective function given by (5), which we varied across . For both decoders, an additional hyperparameter is the SNR used during training simulations, which we non-exhaustively varied from 10 dB to 30 dB in 5 db increments, with one to three SNRs tried for each operating point.

Iv Discussion and Ongoing Work

Our experiments reevaluated the role of neural networks in learning-based approaches to communications. We demonstrated that neural networks can be avoided altogether while still realizing the fundamental concept of simulation-driven design optimization. We also used this approach to show the feasibility of non-coherent MIMO for coherence windows that are as short as two time slots.

Further experimentation and hyperparameter tuning is necessary and part of ongoing work, in order to further confirm our experimental observations. In particular, we believe that the convergence stability (and possibly the performance) of the pML decoder could be further improved with more tuning, since due to time constraints we had to explore a much smaller hyperparameter space.

The generalized log-likelihood ratio test (GLRT) decoder given by [17] does not require the codewords to be orthonormal, which would obviate the need for an orthonormal loss term. Due to the somewhat increased implementation and computational complexity, applying this GLRT deocder remains ongoing work.