I Introduction
The fifth generation (5G) wireless communication techniques aim at connecting everything in the physical world as neurons in a neural network. To achieve this goal, 5G systems should be able to interact with environment and everincreasing numbers of key performance indicators (KPIs) such as user experiences of virtual reality or internet of things should be jointly optimized. When the network optimization problems are becoming more and more complicated, the air interface technology in 5G New Radio (NR) is also facing the same problem with a bunch of new weapons such as massive multipleinput multipleoutput (MIMO), nonorthogonal multiple access (NOMA), and Polar/LDPC codes. Fortunately, artificial intelligence (AI) techniques are born for dealing with sophisticated problems and becoming the state of the art in many fields. While many researchers have already used AI algorithms in air interface blocks, most of air interface technologies are perfectly modeled by information theory, hence their replacements with AI should be carefully checked
[1]. Other than replacement, we believe the AIenhanced Air Interface (AIeAI) technology will be promising, which combines the benefits of certainty and uncertainty.In modern radio systems, the most uncertain part of air interface is the wireless channel, the estimated channel quickly become outdated. Using the outdated information in adaptive transmission will degrade the system performance [2]. Obviously, channel prediction seems to be a promising candidate to combine with adaptive transmission. In practice, the wireless channel is a superposition of sinusoids contributed by changing reflectors and scatters, as a result of its timevarying and intangible nature, we could compare the wireless channel to speaking in a strange language, moreover, the speaking is not precisely recorded due to the estimation error caused by noise, interference and hardware imperfections, which makes it sounds like mumbling. To predict the mumble of wireless channel, we should understand it at the first.
Instead of complexvalued channel prediction, some researchers focus on power prediction. Though the power of channel coefficients is of interest for some applications, it’s beneficial to predict complexvalued channel coefficients and obtain power from the squared magnitude of the complex value [3]. Techniques used in previous channel prediction methods are various, such as autoregressive (AR) model [4], sinusoidal model [5], complexvalued neural networks [6] and so on. However, let alone their respective shortcoming, a common weakness of these techniques is that their expressive power is not sufficient to memorize and fit the realistic channel.
Recall the channel mumbling image in our mind. Nowadays, people can be very confident in learning a language model with the powerful neural networks, especially recurrent neural networks (RNN), which is talented in solving sequential problems. Moreover, sequencetosequence (seq2seq) models
[7] can solve translation problems by introducing RNN for encoder and decoder respectively. Normally, seq2seq models are trained to translate sequence from one domain (e.g., sentences in English) to sequence in another domain (e.g., voices in Mandarin) [8], while in this paper, we exploit seq2seq models in the channel prediction task, where the inputs are channel in the past, and the outputs are channel in the future. Different from previous channel prediction methods, first of all, we replace every sampled channel into a word, which turns the wireless channel into a sentence full of words, just like we imagined, the wireless channel is now mumbling. Besides, every word is further represented by a word embedding [9], that can improve the expressive power of this model to a great extent. The main contributions of this work are as follows:
The proposal of a neuralnetwork based complexvalued channel prediction algorithm. This algorithm creatively turns the wireless channel into words and learns the channel model as a language model. The numerical results indicate the trained model is reliable and robust, which shows a promising future in realistic channel prediction.

The demonstration of using seq2seq models and its variants in time series prediction. It turns out that the encoder and decoder of seq2seq models with different lengths can be perfect containers for different time spans of past and future signals.
Ii Channel Predictor
Learning a language model with neural networks is essential to modern natural language processing (NLP) tasks. Normally, a vocabulary should be extracted from the corpus to be learned, which comprises all highfrequency words in this corpus. In a RNNbased language model learning, the conditional probability of each word is computed after all the previous words passing through the RNN cell, where Long ShortTerm Memory (LSTM) or Gated Recurrent Unit (GRU) is typically implemented. Corresponding to each word in the vocabulary, a word embedding representing this word is made up of a vector of numbers, which are gradually changing during training until the meaning of this word is encoded in these numbers. Due to the high dimensional space of the vector, a word embedding can carry much more information than a single word.
As we see the wireless channel as mumbling with channel language, a vocabulary of this language should also be extracted from the channel. However, the combinations of amplitude and phase of channel coefficients are infinite, which means the wireless channel only obey a statistical distribution. Fortunately, the Channel Changes (CC) is finite with a practical precision, if we can predict CC from the past, we can predict future channel. Channel coefficients in a past time span can be expressed as , which contains the past sampled channel, its prediction target is the future samples, i.e., . In this work, CC are calculated in the first place, CC sequence can be expressed as , where , then is used to predict with future CC. At time , as , and is known information at past, can be obtained with . Analogously, the sample in the future can be calculated by with predicted , where .
As each CC in can be seen as a word, we introduce Vocabulary of Channel Changes (VCC) to transform into train set. VCC includes top frequently appearing CC in while the rest CC are out of vocabulary. By considering the numerical precision of , the size of corpora, the expressive power of model and GPU memories, an appropriate can be chosen. Small makes fitness inaccuracy while large brings slow convergence speed. Small introduces interference while large leads to too many unknown prediction. In this work, and are chosen while length of is in the order of tens to hundreds of millions. In VCC, an unique ID (usually an integer) is assigned to each CC, these integers instead of wireless channel are inputs of neural networks. Table I shows the top 10 most and least frequently occurring CC and their corresponding IDs in a VCC extracted from a realistic measured channel, where is the size of this vocabulary. From the top 10 most frequently occurring CC and their frequencies, we can observe pairs of CC, each pair of CC is symmetrical about the origin or axis, which indicates a specific statistical distribution this wireless channel obeys. In this VCC, CC with occurring frequency higher than 10 are saved, so that the least frequency of CC here is 11. An ‘unk’ token is usually added to vocabulary whose ID is zero for example, so that all occurrences of outofvocabulary CC can be replaced with this ‘unk’ token. As mentioned above, every CC has a length word embedding representing it, where is around 400 in this work. Therefore, number of parameters utilized to represent this channel model is without considering the weights in neural networks, though the model used in this paper may be overparameterized, this work provides a method to greatly enlarge the expressive power of channel model. By looking up the VCC, each CC in is replaced by its corresponding ID, then is transformed into a new sequence with integers, which can be fed into neural networks for training and predicting. The predicted integers are then transformed back into future using VCC. With future , in future can be obtained.
Top10 most in VCC  Top10 least in VCC  
ID  CC  Freq.  ID  CC  Freq. 
1  ’+0.020.02i’  538211  X9  ’+0.20.05i’  12 
2  ’0.02+0.02i’  536925  X8  ’0.04+0.2i’  12 
3  ’0.020.02i’  535761  X7  ’0.03+0.2i’  12 
4  ’+0.02+0.02i’  534726  X6  ’0.1+0.2i’  12 
5  ’0.02+0.01i’  373125  X5  ’0.060.2i’  11 
6  ’0.01+0.02i’  371946  X4  ’0.050.2i’  11 
7  ’+0.01+0.02i’  371856  X3  ’+0.01+0.2i’  11 
8  ’0.020.01i’  371778  X2  ’+0.2+0.01i’  11 
9  ’0.010.02i’  371682  X1  ’+0.05+0.2i’  11 
10  ’+0.010.02i’  371673  X  ’+0.2+0.05i’  11 
Iii Modeling and Predicting
To achieve the purpose of predicting the following CC given the
preceding CC, two solutions corresponding to natural language generation (NLG) and neural machine translation (NMT) are proposed. The NLG solution uses RNN combined with backpropagation through time (BPTT) algorithm. The NMT solution uses seq2seq models comprising two RNNs, one for encoder and another one for decoder. Normally, these two RNNs belong to different domains, e.g., two kinds of languages, that means they should have different vocabularies. However, in our solution, we use the same vocabulary for both RNNs as they are modeling the same channel model, while encoder is in past and decoder is in future. This NMT solution can be seen as NLG solution with another RNN added for decoder. In figure
1, a block diagram is shown to explain the NMT based channel predictor. At the encoder side, time series with integer IDs , representing their CC counterparts in VCC, are fed into the reusable RNN cell, which consists an embedding layer and stacked LSTM or GRU networks. The hidden states (solid lines) are transferred through time steps in forward and backward direction, this bidirectional variant can provide more information but should only stay at encoder side, because time order in future cannot be reversed. Normally, only the last hidden state is forwarded to decoder, while in attention variant, intermediate hidden states with different attention values (weights) are also sent to decoder (dashed lines). Apparently, the attention and bidirectional variants should work together to achieve the largest performance gain. Except the hidden states input, IDs at the last time step of encoder should also be sent to decoder (dashdotted lines), alternatively, forwarding number zero also works well as the hidden states already possess enough information, even if words in one time step are unknown. The output of decoder is a predicted sequence, by comparing it to ground truth sequence, loss or prediction error can be obtained. It is worth noting that the lengths of input and output sequence are variable and can be different. Furthermore, data amount of train set can be largely increased by sliding time series with a window. For example, the input can be when output isby sliding the original data with one time step,this is similar to the data augmentation idea in Convolutional Neural Network (CNN)
[10].Figure 2 shows the attention values of 1 and 10 predicted CC on the preceding 30 CC for hundreds of NMT based predictions with a welltrained network (route 10 in indoor experiment in Section IVB). Apparently, we can tell that the main attention of the first predicted CC focuses on the intermediate hidden state instead of the last or the first hidden state of preceding CC, which means that in bidirectional mode, the hidden state calculated from part of forward direction and other part of backward direction may contribute more valuable information for prediction than full forward or full backward. The idea of introducing attention values is similar to the application of autocorrelation values in linear prediction method based on the AR modeling, only this attention mechanism is more powerful because the attention target is a hidden state of neural network while the autocorrelation target is just a channel coefficient. Observed from this figure, attention values through different tests are similar, that is essential for a robust channel prediction model. For the 10 predicted CC, the main attention is on the last hidden state, which is more and more important when predicting samples further ahead.
To simplify, is used to represent the operation of utilizing preceding CC to predict following CC. In channel prediction task, regarding the time interval between sampled channel or Doppler shift, the unit of and can also be time span or wavelength. For example, task in figure 2 can be expressed as a 30:10 samples channel prediction task, as the time interval between sampled channel is 1 millisecond (ms), it’s also a 30:10 ms channel prediction task. As the user speed in this case is 3km/h, the highest Doppler frequency is about 10Hz at center frequency 3.45GHz, the distance traveled during the prediction can be measured in wavelengths, therefore this task can be expressed as a 0.3:0.1 wavelengths channel prediction task.
In this paper, each RNN cell has a twolayered LSTM/GRU with hidden layers of size 1000. For fair comparison, we train 2 epochs for every model with Adam as optimizer and same annealing recipe of learning rate. After training, the predictor is tested with an M:N predicting operation, then every
length segmented data in testing channel sequence is replaced with predicted samples, so that a new predicted channel sequence is obtained. The normalized mean square error (NMSE) of the prediction can be expressed as,(1) 
where is Frobenius norm. For radio system, what really matters is the final throughput. Therefore, in some of the following experiments, block error rate (BLER) performance is also calculated with the predicted channel coefficients.
Iv Numerical Results and Discussions
Performance evaluation of this channel predictor is carried out on two scenes: link level simulation (LLS) scene in section IVA and realistic measured scene in section IVB. Orthogonal Frequency Division Multiplexing (OFDM) system with 20MHz bandwidth is used for both scenes. All the data and results of LLS presented here are based on the tappeddelay line (TDL) channel model [11], specifically, TDLC, which is designed for nonlineofsight (NLOS) propagation. A long delay spread 300ns is set in the TDLC model and noisefree channel coefficients on 31 propagation paths are introduced for training. For realistic measured scene, indoor and outdoor measurements are carried out at 3.45GHz carrier frequency.
Iva Simulation
Channel predictor is investigated for LLS to avoid the impact of noise, interference and hardware imperfections. The NLG solution and NMT solution are compared at user speed 100km/h, the unit of is symbol, and 14 symbols equal to one millisecond. In this case, train set is from 31 paths in 100 seconds, while test set is obtained from another 10 seconds with a different random seed. In figure (a)a, performance of decoding with ideal channel coefficients, i.e., ideal channel estimation (ICE) is calculated as a baseline, NLG and NMT solution with varying are investigated. When is fixed to 14 symbols, larger introduces more preceding symbols for learning and predicting. From figure (a)a, we can tell that by increasing , better performance can be achieved, from figure (b)b, it is shown that the performance gains are result from more accurate prediction according to NMSE value. However, reducing prediction error by increasing will finally saturate at , which is around 30 samples shown in previous work [2]. Based on the neuralnetwork algorithm we used, the memory span of this channel predictor is six times larger with 196 samples derived from NLG solution.
Normally, the saturate point shows fitting and memorizing capability of the model together with its hyperparameters, while for datadriven models, the size of train set may become the bottleneck of this capability. NMT solution trained with the same data set as NLG solution is shown as solid line with plussign markers in figure
(b)b, we can see that when is larger than 98, prediction is getting worse. It means the same data set is not enough to define a model with large . Fortunately, the time series sequence is continuous, if we slide the original data set with a time offset, i.e., sliding window technique, an augmented data set is obtained. Though this new data set is all from the old sequence, it’s essential for NMT based channel predictor to understand that and samples are all from a continuous long sequence. In this experiment, a sliding window is chosen to make the date set five times larger, and the prediction result is shown as dashed line, we can see that prediction error is going down at 196 samples and nearly saturate under this amount of data set. In the comparison of NMT and NLG solutions, NMT solution with 14:14 symbols outperforms all NLG solutions on BLER performance and NMSE value, even more, NMT solution with 98:14 symbols can reach the ICE performance bound. It means the prediction error of 98:14 symbols NMT solution is below the critical value for perfectly decoding with imperfect channel coefficients. Compared to NLG channel prediction, the fitting ability of NMT solution is much stronger.Transfer learning is efficient when solving a problem while the knowledge of a related problem has already been learned. As we have already learned channel model at 100km/h user speed, a slower or faster timevariant channel should be efficiently predicted or learned by transferring knowledge. In this experiment, we work on a channel with 3km/h user speed, as the channel changes slower, we can predict a longer time span with where S is the sampling rate and the channel coefficients sequence is sampled followed by calculation of . It is worth noting that when using transfer learning, VCC used to transform should be the same as VCC used by the model being transferred. In this case, and symbols are applied, that is, a 30:30 ms NMT solution for channel prediction. The predicted
can be recovered to original time interval by interpolating operation. As a result of the sampling operation, train set is 30 times smaller than original data set. For datadriven models, the size of data set is essential, which we have already proved in the previous experiment. If we learn from scratch with this small train set, even after 20 epochs training, around 1dB performance loss at
is shown in figure 4. However, if the welltrained 100km/h channel model is directly used to predict sampled 3km/h channel, 0.8dB performance loss can be reduced. From these results, we can see that even without learning, predicting slowlyvarying channel with rapidlyvarying channel model is feasible. With transfer learning, the performance is further improved, where the 100km/h channel model is being modified to fit the sampled 3km/h channel according to the input data. Moreover, attention variant of seq2seq models can be utilized and further performance gains are achievable as shown in figure 4.IvB Measurement
In realistic scene, measured channel coefficients are forwarding to channel predictor. The indoor and outdoor channel measurements are conducted in an environment shown as a schematic diagram in figure 5, which includes a 20m 20m room at 4th floor for indoor measurement, a 50m 50m garden with surrounding buildings for outdoor measurement, and lines with arrowhead indicating routes and directions of moving user equipment (UE). In the room, floor to ceiling height is 3m and the heights of base station (BS) and UE are 2m and 1m. UE speed for routes 1 to 12 is 3km/h while route 13 is moving around behind pillar, and route 14 is moving around in the corridor. For outdoor scene, BS is pointed to the garden through a window 15 meters above the ground and routes 15 to 16 is moving in circle with 3km/h. UE uses 1 antenna while BS uses 1 antenna for outdoor, 2 antennas for indoor. Different from simulation scene, channel estimation and prediction are in the frequency domain as the accurate sinusoids of propagation channel are difficult to obtain due to estimation error, therefore, every channel coefficient in the time series is estimated from pilot symbols. The leastsquare estimated channel is truncated at time domain while significant taps are kept. The timevarying channel from different subcarriers can also be aligned one by one to form one long sequence, which can largely increase the data set for training and predicting.
For indoor scene, channel predictor with attention based 30:10 ms NMT solution is used and the generalization capability of this channel predictor is being investigated. Not only the channel from different subcarriers are trained together, channel from different ports corresponding to 2 transmitting antennas are also trained together, the goal is to generalize the overall channel environment from training on limited route and transferring the knowledge on predicting other routes. Imagine such a scenario, where the AIenhanced indoor distributed antenna system (IDAS) was trained with channel of an UE in a room and is capable to predict channel of other UEs in other rooms with similar layout. In this experiment, training is carried out with train set composed of data from routes 1 to 9, which are following the same route and are just collected at different times. Data from routes 10 to 14 are directly predicted using the welltrained model, i.e., transferring knowledge directly without transfer learning (see section IVA). The prediction error is shown in table II, for comparison, routes 8 and 9 are also predicted with this model they have contributed. We can tell from the table that prediction error of routes 10 to 13 are close to trained routes 8 and 9, moreover, though route 10 is following the same route with train set, route 12 and NLOS route 13 have better prediction accuracy than it, which shows the generalization capability of this model is good. These results also indicate that transferring channel model among different locations, ports and times is feasible. However, route 14 in the corridor seems following different properties of propagation as the NMSE is much larger than the routes inside the room. Fortunately, transfer learning can be easily done on partial data of route 14, it can be seen that the prediction error of the rest of data is reduced. However, to further reduce the prediction error, more data in the corridor should be collected and trained. For a building with AIenhanced IDAS, it would be easy to collect CSI in all the corridors having antennas in a very short time and generate a large enough data set for training a generalized corridor channel model.
Route  8  9  10  11 

NMSE  0.0061  0.0057  0.0062  0.0064 
Route  12  13  14  14+TL 
NMSE  0.0058  0.0054  0.0108  0.0075 
Compared to indoor scene, BLER performance of the channel predictor for outdoor scene is studied at various signaltonoise ratio (SNR) values by using attenuators. In these experiments, channel predictor with attention based 20:20 ms NMT is used. For comparison, channel sampling rate (CSR) of estimator, i.e., frequency of channel estimation with pilots, is designed to be 400Hz or 2kHz. Channel coefficients on 1320 subcarriers estimated with 400Hz CSR are used for training of channel predictor, and the total length of train set is around 0.2 billion. Routes 15 to 16 are following the same circling route, data from route 15 is used for training and route 16 for testing. Though channel predictor is trained with ms, we still can use this predictor to inference on different time spans, e.g., 10ms or 30ms.
The decoding performance with predicted and estimated channel coefficients for route 16 is shown in figure 6. We can tell that longer prediction time span introduces larger performance loss, and at the decoding performance of ms predictor is around 0.1dB inferior to decoding performance of estimator with 400Hz CSR. When ms and ms, at least one third pilot or CSI feedback resources can be saved, here is a tradeoff between improving BLER performance and saving timefrequency resources, both of them can finally increase the system throughput.
The second tradeoff is between improving BLER performance and saving computing resources. As shown in figure 6, improving BLER performance by increasing CSR is not easy due to the inherent estimation error (estimator 2). Fortunately, predictor can outperform estimator with the cost of some computing resources. In figure 7, channel amplitudes corresponding to subcarriers in a frame are compared for three predictors and estimator 2. In this specific frame, only channel of predictor 2 can decode this frame correctly while other predictors or estimators will fail, though channel predicted by predictor 2 is the most different from channel estimated by estimator 2, the randomness of prediction can provide more possibilities to overcome estimation error. By using multiple decoding (MD) with different predicted channel at each frame and selecting the decoder with lowest error rate, performance gain can be achieved. As shown in figure 6, decoding three times with predictor 1, 2 and 3 is an option for 1dB performance gain. When improving BLER performance is critical and computing resources are sufficient, decoding more times can be further utilized for larger gains.
In this experiment, is 20 ms, which can be increased for better prediction. However, the third tradeoff here is between decreasing prediction error and decreasing block error. The training target is to decrease prediction error while the radio system is willing to decrease the block error. If the channel predictor is accurate enough, performance of this predictor will be nearly the same with estimator, however, its ability of compensating estimation error is also weakened, which means the MD methods will be useless. Therefore, for a specific system, a specific strategy should be designed for channel predictor.
V Conclusions
In this paper, we have presented a neuralnetwork based channel predictor which takes the wireless channel model as a language model, the channel changes as words. Vocabulary of Channel Changes is proposed to support the training process and the expressive power of this model is proved to be sufficient to generalize channel conditions among different user speeds, subcarriers, times, ports and locations for simulation and realistic measured scenes. Further, transfer learning is investigated and we prove the feasibility of transferring knowledge from different channel conditions by using the same Vocabulary of Channel Changes. Experimental comparison of the NLG solution and NMT solution is considered as an evidence that predicting time series sequence with seq2seq models is an accurate and efficient method. For seq2seq models, we also prove the effectiveness of its bidirectional and attention variants on channel prediction task. While one channel predictor can provide comparable performance with channel estimator, the combination of predictors can introduce remarkable gains with multiple decoding method. As predictors can outperform estimator in decoding, to say nothing of replacing outdated CSI in adaptive transmission. These results show us a promising future for AIenhanced channel prediction.
References
 [1] X. You, C. Zhang, X. Tan, S. Jin, and H. Wu, “Ai for 5g: Research directions and paradigms,” arXiv preprint arXiv:1807.08671, 2018.
 [2] A. DuelHallen, “Fading channel prediction for mobile radio adaptive transmission systems,” Proceedings of the IEEE, vol. 95, no. 12, pp. 2299–2313, 2007.
 [3] A. DuelHallen, S. Hu, and H. Hallen, “Longrange prediction of fading signals,” IEEE Signal processing magazine, vol. 17, no. 3, pp. 62–75, 2000.
 [4] T. Ekman, “Prediction of mobile radio channels: modeling and design,” Ph.D. dissertation, Institutionen för materialvetenskap, 2002.
 [5] M. Chen, T. Ekman, and M. Viberg, “New approaches for channel prediction based on sinusoidal modeling,” EURASIP Journal on Advances in Signal Processing, vol. 2007, no. 1, p. 049393, 2006.

[6]
T. Ding and A. Hirose, “Fading channel prediction based on combination of complexvalued neural networks and chirp ztransform,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 9, pp. 1686–1695, 2014.  [7] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
 [8] J. Chung, K. Cho, and Y. Bengio, “A characterlevel decoder without explicit segmentation for neural machine translation,” arXiv preprint arXiv:1603.06147, 2016.

[9]
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic
language model,”
Journal of machine learning research
, vol. 3, no. Feb, pp. 1137–1155, 2003. 
[10]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  [11] J. Meredith, “Study on channel model for frequency spectrum above 6 ghz,” 3GPP TR 38.900, Jun, Tech. Rep., 2016.
Comments
There are no comments yet.