I Introduction
In 2012 Alex Krizhevsky and his team presented a revolutionary deep neural network (DNN) in the ImageNet Large Scale Visual Recognition Challenge
[11]. The network largely outperformed all the competitors. This event triggered not only a revolution in the field of computer vision but has also affected many different engineering fields, including the field of digital communications. In our specific area of interest, a lot of new studies were published on machine learning for coding and communication theory since 2016.
In our work, we address the case of multilevel symbol detection on multipleinput multipleoutput (MIMO) channels via deep neural networks. There exist many algorithms to perform MIMO detection, whose performance ranges from optimal to highly suboptimal. A first category of decoders includes sphere decoding methods based on lattice points enumeration and radius adaptation. The complexity of sphere decoding is clearly less prohibitive than an exhaustive search and is polynomial in the dimension for small dimensions. Detection based on sphere decoding is quasioptimal and is very competitive in terms of number of operations for dimensions less than 32 (up to 64 for nondense MIMO lattices), however it cannot be parallelized because of its sequential nature. Furthermore, the dynamic tree structure of sphere decoding makes it hardwareunfriendly.
In a second category we find linear receivers: the zeroforcing (ZF) detector and the minimum mean squared error (MMSE) detector.
Finally, a nonexhaustive list of decoders having performance somewhere between these two categories includes: the decision feedbackequalizer (DFE),
the Kbest sphere decoder, message passing methods (e.g. belief propagation, approximate message passing, expected propagation) and semidefinite relaxation.
While some of these algorithms are nearoptimal in specific settings, their performance are largely degraded when these specific conditions are not respected.
As a result, the problem of finding hardwarefriendly lowcomplexity methods exhibiting nearoptimal performance in most settings remains open.
Neural network based implementation could offer new solutions.
MIMO detection with neural networks has already been investigated by several research groups. In [12] [13], the quadratic form of the MIMO channel is used to build the network. In [15] [9] [8] [6] suboptimal message passing iterative MIMO decoders are improved with the approach introduced in [7] [10]. The main idea of these studies is to unfold the underlying graph used by an iterative algorithm to get improvement via learning. Simulations show that in most cases learning enhances the performance of the considered algorithm. Nonetheless, these results are almost never compared to optimal detection. It is therefore difficult to assess the real efficiency of such an approach. Additionally, most studies consider binary inputs only. In [13]
, onehot encoding is used to address the case of nonbinary inputs. Unfortunately, the number of output neurons increases significantly with the spectral efficiency making this solution impractical.
Ii Problem settings and network used
In this paper we use row convention for vectors and matrices. We consider a symmetric flat quasistatic MIMO channel with
transmit antennas and receive antennas. Let be the matrix representing the channel coefficients. For simplicity, it is assumed that has real entries. Any complex matrix of size can be trivially transformed into a real matrix of size . Let be the channel input, i.e., is the uncoded information sequence. The input message yields the output via the standard flat MIMO channel equation,where is a Gaussian vector with i.i.d. components. The optimal decoder, also called Bayes decoder in the machine learning community, implements the maximum a posteriori (MAP) criterion. A nearoptimal neural network detector should implement a function that approximates the MAP criterion.
where is the finite MIMO constellation. In our settings, the MAP criterion is equivalent to finding the closest possible , closest in the Euclidean sense, as expressed by the following equation:
Neurons in regular DNN include a nonlinear activation function, such as the sigmoid function and the rectified linear unit
[4]. In the sequel, the standard sigmoid function is employed.Iia Architecture
In [12], the architecture of the network is inspired from the projected gradient descent:
where
is a projection operator. Our neural network embraces the same paradigm. It takes the form of an iterative algorithm where an estimate of the output is available after each iteration. It is illustrated in Figure
1. A generic iteration has two layers, as shown in the figure, where the network structure is derived from the following matrix equations:In the expression of , we can clearly recognize the terms used by the gradient descent, weighted by instead of (the two other terms are a hidden variable and a bias term commonly used in neural networks). The intuition behind this expression is that the network will learn specific learning rates for each iteration and each component. The operation performed between the layer and the next layer can be interpreted as the projection operator . The activation function used is described in the next section.
In [12], the matter of how should be initialized for the first iteration of the neural network is not discussed. We address and take advantage of this question in the section on the twinnetwork.
IiB The multilevel activation function
The default approach to address a multiclass problem with neural networks is to use the socalled “onehot encoding”. Namely, if the network should classify data between more than two categories, say
categories, it will have output neurons where legal combinations of values are only the combinations with a single neuron equal to 1 and all the others equal to 0. Unfortunately, this approach implies a large amount of output neurons. In the network of Figure 1, if each component of the input message can take levels, using onehot encoding means having output neurons (the neurons labeled in Figure 1) instead of in the binary case. This implies a greater complexity as well as longer training.To address this issue we introduce a novel activation function: we adapt the nonlinearity in the output neurons to take into account nonbinary symbols. Our customized sigmoid function shall be defined as a sum of standard sigmoids,
where are sigmoid shifts and is an overall translation. As an example, for (), the customized sigmoid is taken to be , as depicted on Figure 2.
IiC The twinnetwork
To further improve our system, we considered the paradigm of a random forest
[14]: “divide and conquer”. With a random forest, many decision trees are trained on a random subset of the training data with a randomly picked subset of dimensions. One decision tree alone tends to highly overfit. But the random forest, based on the aggregation of the trees and a majority decision rule, has very good and consistent results. The important idea is to introduce some randomness between the trees. The concept of a random forest is analogous to extreme pruning successfully utilized by the cryptography community for sphere decoding
[3]. They built trees having low success rate and repeated the operation many times with different bases of the lattice. They observed that complexity decreases much faster than the performance deterioration. This successful concept was also known in Ordered Statistics Decoding two decades ago. Therefore, in case of suboptimality of the network, a solution can be to duplicate the network and introduce randomness instead of increasing the number of parameters in the DNN. An easy way to introduce randomness is to initialize neural networks with distinct obtained via different manner. An instance of such system is illustrated in Figure 3. The first DNN is initialized with a random , while the second DNN receives an initial obtained by ZF.Iii Training statistics
Only a limited amount of studies discuss what training statistics should be used for efficient training of a neuralbased decoder. In [5], they introduce the notion of Normalized Validation Error (NVE) to investigate which SNR is most suited for efficient training.
They empirically observed that a SNR neither too high nor too low is the most efficient. In most papers, authors mix noisy data obtained at different SNRs to perform training, in hope that the network is efficient at all those SNRs.
To the best of our knowledge, in all papers on neural networks for decoding, the input message associated to a noisy received signal is used as label for the training.
Regardless of the noise, the label that should be used for a given is what would have been decoded by the optimal decoder, not the transmitted sequence. Consider for instance a simple BPSK. If the noise moves a point (e.g. +1) further than the decoding threshold (e.g. 0.2), one should not tell the neural network to try to recover the original point (here +1): it should decode the point associated to the region the received belongs to (here 1).
Let us call a given constellation/code/lattice that we want to train to decode and an element of . Leaving apart the notion of SNR, the optimal decoder (which we could also call the Voronoi classifier) performs the following operation: given a (anywhere) in the space of , it finds the associated to the decoding (Voronoi) region where is located. Moreover, if we want the network to learn the entire structure of , the training sample should be composed of points sampled randomly in its space. Equivalently, one can randomly choose elements of
(with equiprobability) and add uniformly distributed noise.
Nevertheless, to get quasimaximumlikelihood decoding (MLD) performance on the Gaussian channel, the network doesn’t need to learn the entire structure
of but rather the most relevant decision boundaries around the . Indeed, some regions along the boundaries
are so far from such that the Gaussian noise almost never sends to those regions. Therefore, a quasiMLD network can potentially make many simplifications compared to a perfect MLD network and thus reduce its complexity. These simplifications can be learned by training the network with Gaussian noise.
Unfortunately, getting MLD label can be very costly (especially compared to using the input message ): any sample should be decoded with the optimal decoder and potentially stored. Hence, if we were to use
as label for the training due to limited resources, what SNR should be used on the Gaussian channel? In light of the above discussion, we would want both to learn the necessary structure of the code to get quasiMLD performance (i.e. the SNR should not be too high) but the “noise” in the label (i.e. messages that are wrongly labeled w.r.t. the optimal decoder) should not be too high either. Empirically, we observed that the SNR corresponding to an error probability of
is a good tradeoff (only one sample out of 100 is mislabeled but the SNR is low enough to properly explore ).Iv Simulation results
In this section we present neural networks performance observed under several settings.
For each of these settings the results we report are the best complexityperformance tradeoff we obtained, i.e. we decreased the network size as much as possible while keeping near MLD performance.
For the first set of simulations, depicted in Figure 4, the settings are the following. We take and levels on each . The MIMO channel is a static channel randomly sampled from an i.i.d. Gaussian matrix. The considered matrix instance has condition number and Hermite constant dB (as a real lattice), i.e., this is a bad channel realization and an interesting challenge to our DNN. Additionally, we used the multilevel activation function. The training is done in a regular way with the Adam optimizer and a small batch size (). The multilevel MIMO detector used for these simulations has iterations, is of size and of size . Hence, the twinDNN has parameters (which is about 10 times smaller than ).
We observe that the twinnetwork DNN performance is close to the MLD performance and clearly outperforms the single DNN (we show only the curve for the randomly initialized single DNN because it matches the one initialized with the ZF point). This means that, under a different initialization, the two single DNNs are almost never wrong at the same time (except for the cases that cannot be recovered by the optimal decoder). Hence, this approach can be beneficial to improve a suboptimal neural network.
The second set of simulations was performed under the same settings as the one described above, but the batch size is increased to to train the network. Moreover, the size of the layer is decreased to . In Figure 5, we show a significant improvement of performance for the single DNN case: within just three iterations ( 1.25n) and with a decreased network size we manage to get nearMLD performance (even though the number of parameters in the network is decreased to ).
We don’t believe that the improvement is caused by a larger amount of data used to train the network: Firstly, in the smallbatch simulations we let the networks learn for a large enough amount of time. Secondly, the convergence to quasiMLD performance with a large batch size is very fast.
We rather believe that a nonnoisy gradient is better suited for efficient learning in our settings.
In this work, we also aim at comparing the performance of multilevel activation functions and onehot encoding. Note that onehot encoding associated to the softmax activation function yields soft outputs. Hence, we modify the network used in Figure 5 by replacing each level output neuron (i.e. the neurons labeled in Figure 1) by neurons to get soft outputs. Moreover, we used 10 iterations. The result obtained is depicted in Figure 6. We observe that we don’t manage to get quasioptimal performance as in Figure 5. Additionally, the training phase of this network took significantly more time than the previous one and required much more fine tuning of hyperparameters. To summarize, this network is more complex and harder to train.
Finally, we perform a last simulation on the MIMO channel used in [12]. The associated matrix is illconditioned, which makes it challenging for linear detectors but not necessarily for the sphere decoder. We take , levels with the multilevel activation function on output neurons. We observe in Figure 7 that this situation is well handled by our neural network.
The complexity of the different models presented in this section is summarized in Figure 8.
We plot the number of parameters (number of edges) of the network as a function of the cardinality of the constellation (obtained as ). We also write in blue the complexity of the network used in [12] for the T55 MIMO matrix.
We believe that the number of parameters indicated in blue could be diminished without degrading the performance if a larger batch size is used in training.
In light of these results, we may conclude that deep learning, with the proposed approach, is competitive for a large range of MIMO channels. However, deep learning in some extremal situations is difficult to set up, namely for specific channels where the function to be approximated is very challenging. For instance, if the MIMO channel is the generator matrix of a dense lattice (e.g. , , [1]), the function to learn is more complex (see next section) and even a neural network with a large number of iterations and an increased size for each layer fails to achieve nearMLD performance, as shown in Figure 9. Fortunately these extremal communication channels are rarely encountered.
V Connection with (infinite) lattice decoding
Lattice modeling of the MIMO channel is not always successful because of the finite number of levels which induces a finite constellation: the MLD point in the lattice can be out of the finite MIMO constellation. With the regular sphere decoder, it is possible to bound the number of states that each component of
can take and overcome this issue. However, if complexity reduction techniques are used as preprocessing, such as basis reduction, then this issue is difficult to avoid. Similarly, the hyperplane logical decoder (HLD) introduced in
[2], a neural network based lattice decoder, cannot be used (i.e. leads to disappointing performance) for MIMO detection because it can detect messages which are not in the finite constellation.In this section, we present a new strategy to avoid this issue while using a latticebased approach. Namely, we show how the detection can be performed in the fundamental parallelotope , given a quasiVoronoireduced lattice basis (see [2]), and still detect only possible messages belonging to the finite alphabet. This leads to both:

A better understanding of the hardness of the problem that the neural network should solve.

A new strategy for latticebased multilevel MIMO detection with neural networks.
We present the approach in four steps. Consider that the th component of is to be detected.

Step 1: Go in the fundamental parallelotope and consider only the first coordinates of .
where .

Step 2: Compute the decision boundary function (in pink on Figure 10):

Step 3: Go back to the original location.
where is the lattice basis and defines the coordinate system.

Step 4: Apply the multilevel sigmoid function on with delays equal to:
The main operational cost of this algorithm is due to the decision boundary function. It is closely related to the Boolean equation of the HLD and can be computed with a DNN.
References
 [1] J. Conway and N. Sloane. Sphere packings, lattices and groups. SpringerVerlag, New York, 3rd edition, 1999.
 [2] V. Corlay, J.J. Boutros, P. Ciblat, and L. Brunel, “Neural Lattice Decoders,” arXiv preprint arXiv:1807.00592, July 2018.
 [3] N. Gama, P.Q. Nguyen, and O. Regev, “Lattice Enumeration Using Extreme Pruning,” EUROCRYPT 2010, vol 6110, Springer, 2010.
 [4] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. The MIT Press, 2016.
 [5] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learningbased channel decoding,” Conference on Information Sciences and Systems, March 2017.
 [6] H. He, C. Weny, S. Jin, G. Ye Liz, “A ModelDriven Deep Learning Network for MIMO Detection,” arXiv preprint arXiv:1809.09336, Sept. 2018.
 [7] J.R. Hershey, J. Le Roux, and F.Weninger, “Deep Unfolding: ModelBased Inspiration of Novel Deep Architectures,” arXiv preprint arXiv:1409.2574, Nov. 2014.
 [8] M. Imanishi, S. Takabe, T. Wadayama, “Deep Learningaided iterative detector for massive overloaded MIMO channels,” arXiv preprint arXiv:1806.10827, June 2018.
 [9] X. Liu and Y. Li, “Deep MIMO Detection Based on Belief Propagation,” IEEE Information Theory Workshop (ITW), Guangzhou, China, Nov. 2018.
 [10] E. Nachmani, Y. Be’ery and D. Burshtein, “Learning to decode linear codes using deep learning,” 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, Illinois, pp. 341346, Sept. 2016.
 [11] NIPS’12 Proceedings of the 25th International Conference on Neural Information Processing Systems, vol. 1, pp. 10971105, Dec. 2012.
 [12] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” arXiv preprint arXiv:1706.01151, June 2017.
 [13] N. Samuel, T. Diskin, and A. Wiese, “Learning to Detect,” arXiv preprint arXiv:1805.07631, May 2018.
 [14] S. ShalevShwartz and S. BenDavid. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
 [15] X. Tan, W. Xu, Y. Be’ery, Z. Zhang, X. You, and C. Zhang, “Improving Massive MIMO Belief Propagation Detector with Deep Neural Network,” arXiv preprint arXiv:1804.01002, Apr. 2018.
Comments
There are no comments yet.