Multi-input Multi-output (MIMO) is a key technology in the fifth-generation (5G) mobile communication system. It is mostly used for improving the spectrum efficiency and channel capacity. In a MIMO system, several transmitting and receiving antennas are simultaneously used at the transmitter and the receiver end. In comparison with traditional Single-input Single-output (SISO) system, the MIMO system can make full use of space resources and increase the channel capacity without increasing the bandwidth since each antenna at the receiver can receive signals transmitted from all the transmitting antennas simultaneously. Therefore, it has been widely applied in various wireless communication systems due to its connection reliability and high transmission rate. However, it is of great challenge to recover the true signals after MIMO transmission, due to noise and intersymbol interference. The MIMO detection problem is known to be NP-hard.
I-a Related Work
Many algorithms have been proposed to address the MIMO detection problem. Among these algorithms, maximum likelihood detection (MLD)  is able to find the global optimal solution as it searches over all possible transmitted signals exhaustively. Its time complexity is thus prohibitively high. Therefore, it has little practical usefulness, but usually is used as a baseline for measuring the performance of a detector.
To reduce the computational complexity, some linear detection algorithms, such as matched filter (MF) , zero-forcing (ZF) , minimum mean square error (MMSE)  detector and so on, have been developed. It is acknowledged that the accuracy of these detectors tend to be poor. Thus, a lot of nonlinear detection algorithms have been proposed in which compromises between computation complexity and detection accuracy are made. Representative algorithms include sphere decoding (SD) , semidefinite relaxation (SDR) , approximate message propagation (AMP)  and so on. While having lower complexity than the MLD, these algorithms can achieve suboptimal detection performance.
Owing to the development of advanced optimization algorithms and the fast-growing of computing power, machine learning has made great achievements . Machine learning algorithms have been widely applied to many research areas, such as global optimization [10, 11, 12], system biology , etc. and industrials, such as Apple Siri, etc, to name a few. They have also been successfully applied to the MIMO detection problem. For example, Huang et al. 
proposed to convert the MIMO detection problem into a clustering problem, which is then solved by the expectation-maximization algorithm. Simulation experiments showed that this method can achieve near MLD performance when the channel is fixed and perfectly known. However, it is not applicable to varying channel and hence has very limited practical use. Elgabli et al. re-formulates the MIMO detection problem as a Lasso  problem. It is then optimized by a two-stage ADMM  algorithm. Experiments showed that it behaves well compared to classical detectors.
In the first category, deep learning techniques are used as feature extractor. Yan et al. proposed a detector called AE-ELM for the MIMO-OFDM system 
, in which an auto-encoder (AE) network is combined with extreme learning machine (ELM) to recover the transmitted signal for the OFDM (Orthogonal Frequency Division Multiplexing) system. The AE is used to extract features of the received signal, and the features are classified by the ELM. Experiments showed that AE-ELM can achieve a detection performance similar to the MLD in the MIMO-OFDM systems.
In deep learning, how to design the structure of a neural network is a key issue that need to pay great attention. Recent development addresses this issue by proposing a learning to learn approach . The main idea is to unfold an iterative algorithm (or a family of algorithm) to a fixed number of iterations. Each iteration is considered as a layer and the unfolded structure is called a deep neural network. This model-driven deep learning approach can achieve or exceed the performance of corresponding iterative algorithms   since the advantages of model-driven and data-driven approach are effectively complementary to each other.
Few learning to learn approaches have been developed for continuous and combinatorial optimization problems. Andrychowicz et al.
proposed to learn the descent direction by long short term memory (LSTM) recurrent neural network (RNN) for continuous optimization algorithms with differentiable . In Li et al. 25] proposed a learning to learn approach for black-box optimization problems, in which LSTM is used to model the iterative change. The learned algorithm compares favorably against Bayesian optimization  for hyper-parameter tuning. Dai et al. 
proposed to learn heuristics for combinatorial optimization problems, such as TSP, maximum cut, and others. Experimental results showed that their algorithm performs well and can generalize to combinatorial optimization problems with different sizes.
In the area of MIMO detection, learning to learn approaches have also been developed. For examples, Samuel et al.  proposed a detector called DetNet. It is the unfolding of a projected gradient descent algorithm. The iteration of the projected gradient descent algorithm is of the following form
where is the channel matrix,
is the estimate in the-th iteration, is the step size, and is a non-linear projection operator. A neural network is designed to approximate the projection operator. Simulation experiments show that DetNet achieves state-of-the-art performance while maintaining low computational complexity. However, in terms of the detection accuracy, there is still a big gap between DetNet and MLD. Actually, there have no detectors in literature that are comparable with the MLD in the varying channel scenario. In addition, DetNet requires that the number of receiving antennas is more than that of the transmitting antennas. This condition may not be always true in reality.
Gao et al. 
proposed to simplify the structure of DetNet by reducing the input, changing the connection structure from the fully connection to a sparse one and modifying the loss function. These simplifications reduce the complexity and improve the detection accuracy to some extent. Corlay et al.
proposed to change the sigmoid activation function used in DetNet to a multi-plateau version and used two networks with different initial values to detect the transmitted signals simultaneously. The detection performance can be improved by selecting the solution which has a smaller loss function. Similar to DetNet, OAMP-Net
is designed by unfolding the orthogonal AMP (OAMP) algorithm. It is claimed that OAMP-Net requires a short training time, and is able to adapt to varying channels. However, OAMP-Net assumes that the variance of the noise is known, which is not realistic.
For large-scale overloaded MIMO channels (i.e. the number of the receiving antennas is less than the number of transmitting antennas), Imanishi et al.  proposed a trainable iterative detector (TI-detector) based on the iterative soft thresholding algorithm (ISTA). In their algorithm, the sensing matrix in the ISTA algorithm is replaced by the pseudo-inverse of the channel matrix. In low-order modulation, the TI-detector can achieve comparable detection performance to the IW-SOAV (iterative weighted sum-of-absolute value) optimization algorithm  with low complexity. However, the performance of the TI-detector deteriorates seriously when the modulation order becomes higher. Tan et al.  proposed to use deep neural network (DNN) to improve the performance of the belief propagation (BP) algorithm  for MIMO detection. Their methods, named DNN-BP and DNN-MS, are the unfolding of two modified BP detectors (damped BP and max-sum BP), respectively. It was claimed that DNN-BP and DNN-MS improves the detection accuracy of the BP algorithm.
Although those learning to learn approaches have achieved substantial improvement over classical detectors, there is still a big gap between these detectors and the MLD.
I-B Main contributions
In this paper, we propose a novel learning to learn method, named learning to learn iterative search algorithm (LISA), for MIMO detection. Different from the present learning to learn approaches, we do not unfold existing iterative detection algorithm to a deep network. Rather, we first propose to iteratively construct a solution to the MIMO detection problem by taking deep neural network as building block. The constructive algorithm is then unfolded into a shallow network. In the construction process, the symbols of the transmitted signal are recovered one by one. In detecting each symbol, soft decisions are provided to meet the requirements of communication systems.
Experimental results show that LISA has a strong ability of generalization. Once trained, LISA can adapt to different models and signal to noise (SNR) levels. It also performs well for both fixed and varying channels. Experimental results also show that LISA performs significantly better than existing learning to learn approaches for fixed and varying channel models at different modulation and SNRs. Surprisingly, LISA can achieve near-MLD performance in both fixed and varying channel scenarios in the QPSK modulation.
The rest of the paper is organized as follows. In Section 4, the MIMO detection problem and its reformulation by using QL-decomposition is presented. In Section III-B, we present LISA for fixed channel and varying channel, respectively. Experimental results are given in Section IV. Section V concludes the paper.
Ii The Problem
In this section, we describe the MIMO model and review the reformulation of MIMO detection based on QL-decomposition.
Throughout the paper, we use lowercase letter to denote scalar (e.g.
), boldface lowercase letter to denote vector (e.g.), and boldface uppercase letter to denote matrix (e.g. ). denotes the transpose of the matrix .
denotes the complex Gaussian distribution with meanand covariance matrix . and denote the real and imaginary part of a complex number, respectively.
Ii-B The MIMO detection problem
Consider a MIMO system with transmitting antennas at the transmitter and receiving antennas at the receiver. Assume that at some time step , the transmitted symbol vector at the transmitter is , where represents the transmitted symbol from the -th transmitting antenna, and is a finite alphabet related to the constellation.
The MIMO detection problem is to recover from the received symbol vector observed at the receiver. Here can be described as follows:
where represents the complex channel matrix, each element is the path gain from the th transmitting antenna to the th receiving antenna. The value of is drawn from the i.i.d. complex Gaussian distribution with zero mean and unit variance. represents the additive white Gaussian noise (AWGN) at the receiver, i.e.
It is difficult to solve the complex-valued MIMO channel model directly. Without loss of generality, we use an equivalent real-valued channel model. Specifically, we separate the real and imaginary parts of the transmitted and received symbol vectors to obtain the real-valued model. The following equations show how to convert the complex-valued model to a real-valued one,
In this way, the MIMO channel model can be written as:
where , , , , and , i.e. the elements in is the real part of elements in . Without loss of generality, we assume the size of is . Its value depends on the modulation mode.
In the MIMO detection problem, we assume perfect channel state information. The goal is to recover the transmitted symbol vector as accurately as possible when we observe the received signal .
The best detection algorithm to solve the MIMO detection problem is the maximum likelihood detector (MLD), which solves the problem based on the maximum likelihood criterion:
The MLD searches all possible solutions in the solution space and selects the one that minimizes the objective function as the transmitted signal. It has a prohibitively high time complexity (exponential in the number of ). Apparently, the MLD is not applicable for large . We have found that no existing detectors are able to compare with the MLD in terms of detection accuracy.
Ii-C QL decomposition
Our constructive algorithm is built upon the transformation of the detection problem through QL decomposition. Consider the QL-decomposition of the channel matrix , i.e., , where
is an orthogonal matrix, andis a lower triangular matrix. Then Problem 5 can be converted to its equivalent form:
where . Expanding the -norm in Eq. 6, we get the following equivalent form:
and is the element of at .
Problem 7 shows that we can detect the symbol from to one by one recursively. That is to say, we can recover first, calculate ; recover based on the known and calculate , repeat this process until is recovered. Along the search procedure, we need to compare totally possible solutions, and the one that minimizes is the signal recovered. Many detectors, such as ZF , ZF-DF , SD  and SDR , are all built upon QL decomposition.
Iii Learning to do iterative search
In this section, we first present the proposed constructive algorithm built upon QL decomposition, then present LISA.
Iii-a The Decision Making Problem
Note that Problem 7
can actually be visualized as a decision tree withlayers, branches stemmed from each non-leaf node, and leaf nodes. A cumulative metric is associated at each node, and a branch metric at each branch (except the root).
The detection of the transmitted signal can then be considered as a decision making problem. At each node, a decision is to make: which branch shall the detector choose to go? In ZF, ZF-DF, SD and FCSD , different decision strategies have been used. For examples, in ZF-DF, is estimated as:
where are the estimated signals. That is, ZF-DF chooses the branch that minimize the cumulative metric. In the SD, at each node, it will skip the branches emanating from the node if it has a large radium such that .
From a decision-tree perspective, ZF-DF searches just a single path down from the root. It is clearly not able to obtain a satisfactory performance. On the other hand, the MLD searches all the branches. It is clearly not economic. The SD tries to make a balance on the detection complexity and accuracy by setting a proper . The value of determines how much branches the detector needs to search over. A smaller means to search over less branches, hence exhibits a lower complexity but a lower accuracy. In case , the SD degenerates to the MLD.
In this paper, rather than using a fixed strategy as in the aforementioned detectors, we propose to use a deep neural network to adaptively make the decision. At each node, the decision is made based on the output of a deep neural network, while the information collected so far is used as input to the neural network.
By this way, the detector does not need to search over the entire tree branches, but makes decisions purely based on the learned neural network. Apparently, this can accelerate the detection speed, but the accuracy depends entirely on the quality of the learned neural network.
In this section, we present the proposed learning to learn iterative search algorithm (LISA). LISA consists of multiple blocks. Each block is composed of a full solution constructive procedure. As seen in Problem 7, a sequential timing relationship between symbols is brought in by QL-decomposition. Thus, the structure of the block is designed based on LSTM [23, 38], which is a known technique for processing series data in deep learning. Please see Appendix for a brief introduction of LSTM. We designed two different MIMO detection networks for the fixed and varying channel scenarios, respectively.
Iii-B1 The solution construction procedure for varying channel
In this case, we assume that signals come from different channels. i.e., the channel is time-varying. Consider the real-valued channel model in (2), it can be rewritten as the following equation when QL-decomposition is performed on the channel matrix .
where . From Eq. 10, it can be seen that the following formula is established if the noise is ignored.
The ZF solves Eq. 11 by Gaussian elimination: take , then , and so on. is then obtained by projecting onto the constellation . For the ZF-DF, it differs from the ZF in that it projects the symbols to the constellation at each Gaussian elimination step, rather than afterwards. These methods tend to perform poor because the noise and intersymbol interference is not well handled.
Note that, from Eq. 11 we see that the recovered symbol is a function of and ; is a function of and , and so forth. In the ZF and ZF-DF, these functions () are considered to be linear. Solving these functions lead to exact solution if there is no noise and intersymbol interference. However, they are intrinsically non-linear in practice due to the existence of noise and interference. We hence propose to use LSTM to capture the non-linear relationship between the tranmitted signal and observations.
Basically, LSTM can be considered as a recursive function
where is called the cell state which is for information flow along time, is the output at time , and is the input at time , and is the input (observation) at time , respectively. is the parameter of LSTM. Usually , i.e. the parameters of LSTM is shared along time. Fig. 1 shows the LSTM structure.
The basic structure of the solution construction procedure for varying channel is shown in Fig. 2.
From the figure, it is seen that when predicting , the input to the LSTM is and , while for , the input is and , and so on. It is seen that given all the information, including and , and parameters of the LSTMs, a signal can be recovered for Problem 6 following Fig. 2. Such a construction process is called a ‘block’ in the sequel.
At each signal recovery step for the
-th symbol, we apply a softmax layer on the LSTM outputfor a soft-decision. The observed information is then updated. It is initialized to be . Mathematically, for , we have the following recursive functions:
is a probability vector, whose element represents the probability thatequals to an element in .
The softmax layer is applied to at each step. It is a function of with parameters , which can be stated as follows:
Iii-B2 The solution construction procedure for fixed channel
In this case, we assume that all signals come from a fixed channel known in advance. Compared with the varying channel scenario, this case is much easier to deal with. The basic structure of the solution construction procedure for fixed channel can thus be simplified.
There are two main changes in the fixed channel scenario in comparison with the varying case. First, when predicting , the input is only , no . Values of the lower triangular matrix are not required. Second, a fully connected deep neural network (DNN) is used instead of LSTM. The prediction for can be summarized as follows:
The basic structure for the fixed channel model is shown in Fig. 3. Again the whole solution construction process is called a block.
Iii-C The architecture of LISA
In a block, a transmitted symbol vector can be recovered one by one to reach a solution of Problem 6. The obtained solution is not necessarily optimal or with high quality. To improve this solution, we propose to unfold the constructive algorithm to several blocks. The structure of LISA is shown in Fig. 4.
In the figure, represents the input, which contains the observation and the low triangle matrix . It is the same for all the blocks. is the matrix of the probability obtained by the -th block. is the output at the -th block. In the fixed channel scenario, the output is only , while in the varying channel scenario, the output includes and .
Note that each block outputs not only a solution to the detection problem, but also some additional information. These additional information, including and for varying channel structure, and for fixed channel structure, is helpful to improve the solution quality in the following blocks.
We should emphasize that the framework used for building LISA is totally different from the unfolding structure of DetNet and its variants. In DetNet and its variants, the projected gradient descent is unfolded to several steps (layers). A neural network is designed to approximate the projection operator in Eq. 1. The unfolding structure assembles the iterations of the projected gradient descent algorithm. Full solutions are updated along the unfolding.
In LISA, a series of neural networks are used to assemble a full solution. LISA concerns more about how a full solution to the detection problem is gradually constructed. Domain knowledge about constructing a solution can be readily incorporated in such a framework. On the other hand, the solution construction in each DetNet layer is limited to the design of neural networks.
Iii-E Training LISA and Testing
In our experiments, the data used for training LISA is generated in two steps. First, we sample a set of channel matrix s. Element of
is sampled from a normal distribution with zero mean and unit variance. Then, for each, we generate data by applying Eq. 4
. The noise variance was randomly sampled so that the SNR will be uniformly distributed in. Note that the sampled is not positive-definite ensured.
Suppose the generated training data set is , where is the number of samples. is the -th transmitted symbol vector, and is the observed symbol vector.
Before training, we perform QL-decomposition on the channel matrix of each sample
Multiplying over , we obtain a converted data set , where .
The choice of the loss function is critical for neural network training. We regard the MIMO detection problem as a classification problem since all symbols take value from a discrete set. Therefore, the cross entropy loss is used to train LISA. Specifically, the loss function we used in the training process is written as:
where represents the cross entropy function,
represents the true probability distribution of the symbol, denotes the th symbol of the th sample . The probability distribution can be expressed as:
where is the indicator function. Actually,
is the one-hot encoding of the symbol. is the prediction of the probability distribution of the last block to , and is the parameter of LISA.
In the testing phase, suppose that the new observed signal is and its corresponding channel matrix is , where denotes the -th column of . To recover the transmitted signal, we first perform QL-decomposition on to convert the received signal to . We then feed the trained LISA with . The transmitted signal is recovered as
where is the output of the -th component of LISA.
Iv Experimental Results
In this section, comparison results of LISA against existing detectors are provided in both the fixed channel and varying channel scenarios. The code is implemented in Pytorch. The computing platform is two 14-core 2.40GHz Intel Xeon E5-2680 CPUs (28 Logical Processors) under Windows 10 with GPU NVIDIA Quadro P2000. The ADAM optimizer is used to train LISA.
The compared detectors include two linear detectors (ZF and MMSE), one machine learning based algorithm (two-stage ADMM), one learning to learn method (DetNet) and the best detector (MLD). The bit error rate (BER) is used to measure the performance of these detectors.
Iv-a Results in the fixed channel scenario
In the fixed channel scenario, the structure of the DNN is a fully connected neural network with three layers, and the number of neurons in the hidden layer is 600. For a fixed, ten million samples are generated and used to train LISA. The training samples consist of signals of different SNR values in
, which are sampled uniformly. Following standard deep neural network training procedure, we train LISA in 10 epochs. At each epoch, the mini-batch training method is adopted with batch size 1000. The BER of the learned LISA on a test data set with size 100K, which is generated the same as the training data, is compared among the detectors.
The comparison result regarding the fixed channel with QPSK modulation is shown in Fig. 5. It can be seen that LISA performs much better than the linear detectors (MMSE, ZF) and even achieve near-MLD performance. The DetNet  has very poor performance: it performs even worse than the linear detectors. The performance of the two-stage ADMM is a bit better than DetNet, but still worse than MMSE and ZF at high SNR levels.
Iv-B Results in the varying channel scenario
In this experiment, 100 million training samples are used, in which a range of SNR values are used to generate these samples. When generating the samples, we do not exclude s that are ill-conditioned (with large condition number) or even singular. The dimension of the hidden state, i.e. , is set as 600. Two blocks are used in LISA. In the training phase, we train LISA with 40 epochs and the batch size in each epoch is 20K.
We found the following post-processing procedure is able to further improve the performance of LISA. That is, for each and an observation , signals are recovered in original order and reverse order in the testing phase. The transmitted signal is then determined based on the two predictions.
In the original order, the signal is recovered from to . The lower-triangle matrix obtained by QL decomposition on and is input to the learned LISA. In the reverse order, the signal is recovered from to . To do so, we first reverse the order of the columns of the channel matrix to obtain . The QL-decomposition on () is then used as the input to LISA.
Suppose the recovered signal in the original order is and that of the reversed order is . Combing the two recovered signals, the transmitted signals are recovered as
That is, the signal with smaller recovery error is considered as the final recovered signal.
Fig. 6 shows the BER performance of the compared algorithms in varying channel with the QPSK modulation, where the complex channel matrix size is , i.e. and . From Fig. 6, it is seen that LISA performs extremely well in this case. Its BER performance is much lower than the compared algorithms, even close to the MLD. To the best of our knowledge, there is no such detectors that are able to reach the MLD performance. For the compared detectors, DetNet is only able to perform better than MMSE, while the two-stage ADMM performs worse than MMSE, a bit better than ZF.
Fig. 7 shows the performance of LISA in a MIMO system with 16-QAM modulation in terms of BER. It is seen that when the modulation order becomes higher, the detection performance of DetNet becomes similar to MMSE, but LISA still achieves better performance than MMSE. DetNet performs only comparably with MMSE, while two-stage ADMM performs only better than ZF, but worse than MMSE.
The BER performance of LISA with 64-QAM modulation in varying channel is shown in Fig. 8. From the figure, it is clear that DetNet performs the worst. Its performance is worse than ZF. Two-stage ADMM performs a little worse than MMSE. LISA still performs better than MMSE, and it performs clearly the best among the compared detectors.
LISA can also be used for massive MIMO detection. Fig. 9 shows the results of the compared detectors in varying channel with QPSK and 16-QAM modulation. It is seen from the figure that LISA performs the best among the detectors, two-stage ADMM is worse, but better than DetNet in QPSK modulation. The performance of MMSE and ZF are very poor. With 16-QAM modulation, the performance of DetNet, MMSE and ZF is no comparable to LISA and two-stage-ADMM, hence they are not shown in the figure.
Iv-C Further Results on Correlated Channel Matrices
So far all the results are obtained by training LISA on channel matrices with independently sampled normal entities. To further test the performance of LISA against known detectors, in this section, LISA is trained on channel matrices whose entities are with correlations. The Kroneker model proposed in  is applied to generate the correlated channel matrices. In the Kronecker model, the channel matrix is obtained as follows:
where is an i.i.d. Rayleigh fading matrix, () is the covariance matrix in the transmitting (receiving) end:
where is the correlation coefficient.
Fig. 10 shows the BER performance of LISA against the compared detectors in case of different correlation coefficients. From the figure, it is clearly seen that LISA performs significantly better than the rest of the algorithms. Especially, for the QPSK modulation, LISA obtains near-MLD performance. DetNet performs generally similar to MMSE, while two-stage-ADMM performs even worse than ZF in case of 16-QAM modulation. Two-stage-ADMM is only slightly better than ZF in QPSK.
In summary, we may conclude that LISA works significantly better than all classical detectors (such as ZF, MMSE) and machine learning based algorithms (DetNet and two-stage-ADMM). Further, in comparison between the framework used by DetNet and LISA, we may conclude that the LISA framework can perform better than unfolding based learning to learn approach.
Iv-D Sensitivity Analysis
In LISA, one of the important architecture parameters is the number of neurons used in LSTM. We investigate how the number affects the performance of LISA in the varying channel scenario. Fig. 11 shows the performance of LISA in terms of different number of neurons. From the figure, it is seen that along the increasing of neurons, the performance of LISA becomes better. However, the performance improvement becomes less significant.
In this paper, we proposed a learning to search approach, named LISA, for the MIMO detection problem. We first proposed a solution construction algorithm using DNN and LSTM as building block for fixed and varying channels, respectively, based on QL-decomposition of the channel matrix. LISA is composed of the unfolding of the solution construction algorithm. Experimental results showed that LISA achieves state-of-the-art performance. Especially, it is able to achieve near-MLD performance with low complexity both for fixed and varying channel scenarios in QPSK modulation. It performs significantly better than classical detectors and recently proposed machine/deep learning based detectors. Further, we showed that LISA performs significantly better than known detectors for complicated (correlated) channel matrices.
Artificial neural networks (ANN) are computing systems in which nodes are interconnected with each other. The nodes work much like neurons in the human brain. They connect with weights simulating the stimulus among neurons. Together with training algorithms, ANN can recognize hidden patterns and correlations in data. Fig. 12 shows a simple feedforward neural network with 3 layers, including an input, a hidden and an output layer. This network receives input and pass it over to the output.
ANN cannot process sequential, such as time series, data in which dependencies exist along time. Recurrent neural network (RNN) has been developed to handle this. It is still a kind of neural network but with loop in it. Fig. 13 shows a recurrent neural network. In the left plot, it is seen that is the input to a neural network (A) and is the output with a loop in. It can be unrolled as shown in the right plot.
Long Short Term Memory networks (LSTM) is a kind of RNN . It is capable of capturing long-term dependencies among input signals. There are a variety of LSTM variants. In our implementation, we choose the basic LSTM proposed in . The basic LSTM can be formulated as follows:
where means catenation of and , means element-wise multiplication, and tanh is the sigmoid activation function and tanh function, respectively:
It is seen that LSTM is a non-linear function due to the non-linearity of sigmoid and tanh. Parameters of the LSTM include , and . Training algorithms, such as ADAM, are mostly used to find optimal parameters that best fit the training data in the current deep learning community.
-  G. Forney, “Maximum-likelihood sequence estimation of digital sequences in the presence of intersymbol interference,” IEEE Transactions on Information Theory, vol. 18, no. 3, pp. 363–378, May 1972.
-  S. Yang and L. Hanzo, “Fifty years of MIMO detection: The road to large-scale MIMOs,” IEEE Communications Surveys Tutorials, vol. 17, no. 4, pp. 1941–1988, Fourthquarter 2015.
-  E. G. Larsson, “MIMO detection methods: How they work [lecture notes],” IEEE Signal Processing Magazine, vol. 26, no. 3, pp. 91–95, May 2009.
-  E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search in lattices,” IEEE Transactions on Information Theory, vol. 48, no. 8, pp. 2201–2214, Aug 2002.
-  M. O. Damen, H. El Gamal, and G. Caire, “On maximum-likelihood detection and the search for the closest lattice point,” IEEE Trans. Inf. Theor., vol. 49, no. 10, pp. 2389–2402, Oct. 2003. [Online]. Available: https://doi.org/10.1109/TIT.2003.817444
-  J. Jalden and B. Ottersten, “The diversity order of the semidefinite relaxation detector,” IEEE Transactions on Information Theory, vol. 54, no. 4, pp. 1406–1422, April 2008.
-  C. Jeon, R. Ghods, A. Maleki, and C. Studer, “Optimality of large MIMO detection via approximate message passing,” in 2015 IEEE International Symposium on Information Theory (ISIT), June 2015, pp. 1227–1231.
-  S. Wu, L. Kuang, Z. Ni, J. Lu, D. Huang, and Q. Guo, “Low-complexity iterative detection for large-scale multiuser MIMO-OFDM systems using approximate message passing,” IEEE Journal of Selected Topics in Signal Processing, vol. 8, no. 5, pp. 902–915, Oct 2014.
-  T. M. Mitchell, Machine Learning. McGraw Hill Education, 2017.
J. Sun, H. Zhang, A. Zhou, Q. Zhang, K. Zhang, Z. Tu, and K. Ye, “Learning
from a stream of non-stationary and dependent data in multiobjective
IEEE Transactions on Evolutionary Computation, vol. 23, no. 4, pp. 541 – 555, 2019.
J. Sun, H. Zhang, A. Zhou, Q. Zhang, and K. Zhang, “A new learning-based adaptive multi-objective evolutionary algorithm,”Swarm and Evolutionary Computation, vol. 44, pp. 304–319, 2019.
-  J. Shi, Q. Zhang, and J. Sun, “PPLS/D: Parallel pareto local search based on decomposition,” IEEE Transactions on Cybernetics, vol. Early Access, pp. 1–12, 2019.
-  L. Band, D. Wells, A. Larrieu, J. Sun, A. Middleton, A. French, G. Brunoud, E. Sato, M. Wilson, B. Peret, M. Oliva, R. Swarup, I. Sairanen, G. Parry, K. Ljung, T. Beeckman, J. Garibaldi, M. Estelle, M. Owen, K. Vissenberg, T. Hodgman, T. Pridmore, J. King, T. Vernoux, and M. Bennett, “Root gravitropism is regulated by a transient lateral auxin gradient dependent on root angle,” PNAS, vol. 109, no. 12, pp. 4668–4673, 2012.
-  Y. Huang, P. P. Liang, Q. Zhang, and Y. Liang, “A machine learning approach to MIMO communications,” in 2018 IEEE International Conference on Communications (ICC), May 2018, pp. 1–6.
-  A. Elgabli, A. Elghariani, A. O. Al-Abbasi, and M. Bell, “Two-stage lasso admm signal detection algorithm for large scale MIMO,” in 2017 51st Asilomar Conference on Signals, Systems, and Computers, Oct 2017, pp. 1660–1664.
-  R. Tibshirani, “Regression shrinkage and selection via the LASSO,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
-  S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
-  I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
-  X. Yan, F. Long, J. Wang, N. Fu, W. Ou, and B. Liu, “Signal detection of MIMO-OFDM system based on auto encoder and extreme learning machine,” in 2017 International Joint Conference on Neural Networks (IJCNN), May 2017, pp. 1602–1606.
-  J. Sun and Z. Xu, “Model-driven deep-learning,” National Science Review, vol. 5, no. 1, pp. 22–24, 08 2017. [Online]. Available: https://doi.org/10.1093/nsr/nwx099
-  K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the 27th International Conference on Machine Learning, 2017.
-  M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas, “Learning to learn by gradient descent by gradient descent,” in Advances in Neural Information Processing Systems, 2016, pp. 3981–3989.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  K. Li and J. Malik, “Learning to optimize,” arXiv preprint arXiv:1606.01885, 2016.
-  Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, and N. de Freitas, “Learning to learn for global optimization of black box functions,” arXiv preprint arXiv:1611.03824v1, 2016.
-  J. Mockus, Bayesian approach to global optimization: theory and applications. Springer Science & Business Media, 2012.
-  H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, and L. Song. (2017, May) Learning combinatorial optimization algorithms over graphs. [Online]. Available: arXiv:1704.01665v2
-  N. Samuel, T. Diskin, and A. Wiesel, “Learning to detect,” CoRR, vol. abs/1805.07631, 2018. [Online]. Available: http://arxiv.org/abs/1805.07631
-  G. Gao, C. Dong, and K. Niu, “Sparsely connected neural network for massive MIMO detection,” EasyChair Preprint no. 376, EasyChair, 2018.
-  V. Corlay, J. J. Boutros, P. Ciblat, and L. Brunel, “Multilevel MIMO detection with deep learning,” CoRR, vol. abs/1812.01571, 2018. [Online]. Available: http://arxiv.org/abs/1812.01571
-  H. He, C. Wen, S. Jin, and G. Y. Li, “A model-driven deep learning network for MIMO detection,” CoRR, vol. abs/1809.09336, 2018. [Online]. Available: http://arxiv.org/abs/1809.09336
-  S. Takabe, M. Imanishi, T. Wadayama, and K. Hayashi, “Deep Learning-Aided Projected Gradient Detector for Massive Overloaded MIMO Channels,” arXiv e-prints, Jun. 2018.
-  R. Hayakawa, K. Hayashi, H. Sasahara, and M. Nagahara, “Massive overloaded MIMO signal detection via convex optimization with proximal splitting,” in 2016 24th European Signal Processing Conference (EUSIPCO), Aug 2016, pp. 1383–1387.
-  X. Tan, W. Xu, Y. Be’Ery, Z. Zhang, X. You, and C. Zhang, “Improving massive MIMO belief propagation detector with deep neural network,” arXiv e-prints, 2018.
-  J. Yang, C. Zhang, X. Liang, S. Xu, and X. You, “Improved symbol-based belief propagation detection for large-scale MIMO,” in 2015 IEEE Workshop on Signal Processing Systems (SiPS), Oct 2015, pp. 1–6.
-  G. Golden, G. Foschini, R. Valenzuela, and P. WOlniansky, “Detection algorithm and initial laboratory results using V-BLAST space time communications architecture,” IEEE Electron. Lett., vol. 35, pp. 14–16, 1999.
-  L. Barbero and J. Thompson, “Fixing the complexity of the sphere decoder for MIMO detection,” IEEE Trans. Wireless Commun., vol. 7, pp. 2131–2142, 2008.
-  F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with LSTM,” Neural Computation, vol. 12, no. 10, pp. 2451–2471, Oct 2000.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  C. Oestges, “Validity of the kronecker model for MIMO correlated channels,” in 2006 IEEE 63rd Vehicular Technology Conference, vol. 6, May 2006, pp. 2818–2822.