I Introduction
To satisfy the demand of explosive wireless applications, e.g., diverse intelligent terminal access, autonomous driving, and Internet of Things, etc, the new generation of wireless communication systems are expected to handle massive data and meet the requirements of both highreliability and lowlatency. However, existing communication systems, which are basically designed based on conventional communication theories, exhibit several inherent limitations in meeting the aforementioned requirements, such as relying on accurate theoretical models, suffering from high complexity algorithms, and being restricted to blockstructure communication protocols, etc [46]
. Recently, intelligence communication has been recognized as a promising direction in future wireless communications. As a major branch of machine learning, deep learning (DL) has been applied in physical layer communications as a potential solution to deal with the massive data and the high complexity of wireless communication systems
[17, 35]. By merging DL into existing communication systems, many remarkable progresses have been made in various applications such as channel estimation
[53, 30, 18, 12, 54, 3, 52, 15], data detection [55, 19], channel feedback [47, 45, 16], beamforming [29, 2, 51, 4], and hybrid precoding [28], etc.Compared with conventional communications that are based on statistics and information theories, DL based communications benefit from both the excellent learning capability of deep neural networks (DNNs) and the impressive computational throughput of parallel processing architectures. Moreover, DL does not require tractable mathematical models or high computational operations. Among these superiorities, the most advantageous aspect of DL is its ability to handle problems with imperfect models or without mathematical models. Interestingly, there exists outofband sideinformation in communication systems that can be utilized to improve the system performance, including sub6G channels, user positions , and 3D scene images obtained by cameras, etc. It should be noted that the conventional communication techniques can hardly take advantage of these outofband sideinformation due to the lack of tractable mathematical models. In fact, the development of techniques that utilize outofband sideinformation to improve the system performance has been an emerging research trend in DL based communications
[29, 2, 51]. For example, [2] proposed a sub6GHz channel information aided network for mmwave beam and blockage prediction, which could effectively reduce the overheads of both feedback and beam training. In [51], a novel 3D scene based beam selection architecture was developed for mmwave communications by using the surrounding 3D scene of the cellular coverage as the input of networks.Meanwhile, model aided DL has also made much progress recently. Instead of purely relying on training data, model aided DL benefits from the guidance of model information and therefore can achieve better performance [17]. For instance, [15] proposed an efficient DL based channel estimator by learning the linear model between least squares (LS) estimation and linear minimum mean square error (LMMSE) estimation. In [19], the authors proposed a modeldriven DL based multipleinput multipleoutput (MIMO) detector by unfolding a iterative algorithm, which can significantly outperform the corresponding traditional iterative detector.
Although a few DL based works tried to utilize multisource sensing information (MSI), e.g., outofband sideinformation and model information, to improve the system performance, none of them has yet investigated how to integrate and comprehensively utilize these MSI in communication systems. From the viewpoint of machine learning, data from various sources are referred as multimodal sensory data, whereas data from one source are referred as data of a single modality. Communication systems naturally work with multimodal data and this clear advantage should not be squandered. Multimodal learning aims to build models that can fully exploit the constructive and complementary information lying in multimodal data, thus gaining performance advantages over methods that only use data of a single modality [36]. By combining DL architectures with multimodal learning methodologies, the concept of deep multimodal learning (DML) has been proposed in [33]
. Thanks to the excellent flexibility of DL in extracting hierarchical features of data, DML offers several advantages over conventional multimodal learning, such as learning based feature extraction, implicit dimensionality reduction, and easily scalability in the modality number, etc
[36]. The mainstream applications of DML include human action recognition, audiovisual speaker detection, and autonomous driving, etc [33, 36, 14]. For example, [14] jointly exploited two modalities, i.e., image and optical flow, for human action recognition, which could obtain higher recognition accuracy than only using image data.This paper aims to develop a systematic framework on DML based wireless communications. By using DML, the multimodal sensory data available in wireless communication systems can be fully exploited to provide constructive and complementary information for various tasks. The main contributions of this work can be summarized as following:

We provide complete descriptions and analyses on the framework of DML. As opposed to [36, 33]
that mainly studied DML in computer vision, speech, and natural language processing areas, this is the first work that explores how DML technologies can be applied to wireless communications to the best of the authors’ knowledge, and we also provide some heuristic understandings.

By investigating various modality combinations and fusion levels, we design several DML based architectures for channel prediction in massive MIMO systems as a case study. The design process presents a beneficial guidance on developing DML based communication technologies, while the proposed architectures can be easily extend to other communications problems like beam prediction, channel feedback, and resource allocation.

Simulations based on raytracing software have been conducted and demonstrate that the proposed framework can effectively exploit the constructive and complementary information of multimodal sensory data in various scenarios.
The remainder of this paper is organized as follows. The motivation of exploring DML based wireless communications is given in Section II. The design choices of DML are presented in Section III. As a case study, Section IV proposed several DML based architectures for channel prediction in massive MIMO systems. Numerical results are provided in Section V, followed by our main conclusions in Section VI.
Notation:
The bold letters denote vectors or matrices. The notation
denotes the length of the vector . The notations and respectively denote the transpose and the conjugate transpose of a matrix or a vector. The notation represents the complex vector space. The notation denotes the norm of . The notation represents the composite mapping operation. The notations and , respectively, denote the real and imaginary parts of matrices, vectors or scales. The notations and respectively represent the convolution operation and the matrix elementwise product. The notationrepresents the expectation with respect to all random variables within the brackets.
Ii DML for Wireless Communications
MSI in communication systems, including outofband sideinformation, model information, and other system information, is referred as multimodality. Information from one source is referred as one modality.
Multimodal sensory data of communication systems typically have varying confidence levels^{1}^{1}1The confidence level of one modality refers to the degree of the contribution or the reliability offered by the modality towards a certain task [5]. when accomplishing different tasks. Take the beam selection at the base station (BS) as an example; the optimal beam vector can be obtained in three ways: (1) finding the best beam based on known downlink channels [56], (2) learning from 3D scene images [51, 4], and (3) extrapolating from sub6G channels [2]. Among the three modalities, the downlink channels obviously have higher confidence level than both the 3D scene images and the sub6G channels while the confidence level of the 3D scene images and the sub6G channels depends on specifical scenarios. Nevertheless, even when we have access to the modality with the highest confidence level, there may exist some modalities that could provide complementary information to further improve the performance or robustness of the singlemodality based methods [34]. To better understand this, we can refer to the maximum ratio combining (MRC) [42], which is a widely adopted technology to obtain the combining gains of multiantenna communication systems. MRC can benefits from the antennas with worse channels, which also provide a revelatory understanding about the gains brought by modalities with relatively lower confidence levels.
Meanwhile, multimodal sensory data usually have different dimensionality and structures. For example, in massive MIMO systems, the user position data could be a realvalued vector, the received signals could be a much longer complexvalued vector, and the transmitted signals could be a highdimensional complexvalued matrix. Therefore, it is important to design architectures that could fuse these modalities efficiently.
Few existing work so far has investigated how to integrate and fuse multimodal sensory data in the wireless communication problems. Motivated by this, we aim to develop a systematic framework on the DML based communications and illustrate the methodology by investigating the DML based channel prediction in massive MIMO systems.
Iii Design Choices in DML
The framework of DML consists of three core parts: selection, extraction and fusion. The selection process is to select appropriate models as well as effective modalities for a certain task. The extraction process is to extract information from involved modalities. The fusion process is to fuse the extracted information in order to obtain a fused representation of the multimodal sensory data. To embrace the performance advantages offered by DML, there are several design choices to be considered, including the selections of models, modalities, fusion levels, and fusion strategies, as will be illustrated in the following.
Iiia Model Selection
DL models can generally be divided into two categories: discriminative and generative models. Discriminative models aim to learn the mapping function from the inputs to the outputs, and are typically used to solve regression and classification tasks. In other words, given the input and the label
, discriminative models learn the conditional probability
by updating the model parameters. Since the majority of tasks in physical layer communications are to estimate based on , such as channel estimation [53, 18, 12], data detection [55, 23, 19], and beamforming [2, 51, 4], etc, existing DL techniques for physical layer communications mainly adopt discriminative models.Generative models aim to learn the training data distribution that is required to generate new data with similar distributions. More specifically, generative models learn the joint probability
in supervised learning problems, or learn the input data probability
in unsupervised or selfunsupervised learning problems. For DML problems, generative models are useful in the following three aspects: (1) Extracting features from different modalities, which are then used to perform discriminative tasks, i.e., regression and classification tasks. (2) Dealing with the situation of missing modalities during the test stage or lacking labeled data
[39, 22]. (3) Providing good initialization points for discriminative models, such as DNNs [20].IiiB Modality Selection
Different modalities may provide complementary information and have varying confidence levels for a certain multimodal learning task. Nevertheless, too much modalities may lead to information redundancy and an excessively complex fusion process. Therefore, it is worthwhile to select the optimal modalities by comprehensively considering the performance gain and the fusion complexity. In the area of computer vision and speech, the modality selection problem is generally considered as a tradeoff optimization [48, 6, 27]. For example, [48] proposed to select the optimal modalities based on the tradeoff between the feature dimensionality and the modality correlations. However, the authors did not take the confidence levels of modalities into account, which may miss modalities with high confidence levels. In [6], the authors utilize a dynamic programming approach to find the optimal subset of modalities based on the threefold tradeoff between the performance gain, the overall confidence level of the selected subset, and the cost of the selected subset. In summary, [48, 24, 6, 27] provided heuristic solutions for the modality selection problem in the context of multimedia data while there is hardly any literature studying modality selection for communication systems. A more direct way that is adopted in most of existing DML works is to manually select modalities by intuition and experiments.
IiiC Fusion Level Selection
In general, we can perform the modality fusion in four levels: data fusion, feature fusion, decision fusion, or hybrid fusion.
IiiC1 Data fusion
Data fusion is to concatenate the raw or preprocessed data of all the modalities into a single vector and then to learn a joint multimodal representation based on the concatenated vector during the fusion, as shown in Fig. 1 (a). Data fusion is simple to design and allows endtoend training. An typical example of data fusion can refer to [53], where the received signals, the transmitted pilots, the pervious channels, and the LS estimates are directly concatenated as the inputs of networks to estimate channels in doubly selective fading scenarios. However, data fusion ignores the unique structure of different modalities, which may make it difficult to learn the complementary information among the modalities. In addition, simple concatenation of the multimodal sensory data leads to highdimensional input vectors that may contain redundancies, especially when the number of modalities is large.
IiiC2 Feature fusion
Before we introduce the feature fusion, we first explain how to extract features from the raw or preprocessed data of one modality. The transform from the raw or preprocessed data to features is referred to as “feature extraction”. Feature extraction algorithms are either generative or discriminative, linear or nonlinear, such as principal component analysis
[11], linear discriminative analysis [38], and Laplacian eigenmaps[7], etc. In recent few years, DNNs have been recognized as a popular technique to fuse modalities due to its excellent power and flexibility in extracting hierarchical features of the data. Specifically, each hidden layer of the network indeed represents a hierarchical features of the inputs. By changing the number of layers or choosing proper architecture, DNNs could extract features at various levels or with various dimensions. For example, [47]proposed a deep autoencoder architecture for channel feedback, where the dimension of the learnt compressed vector, i.e., the extracted feature that are used to reconstruct the original channel, can be adjusted according to the given compression ratio.
Now, we discuss the feature fusion. As illustrated in Fig. 1 (b), feature fusion is to fuse higherlevel features into a single hidden layer and then to learn a joint multimodal representations for the output. By utilizing the extracted higherlevel features, the model with feature fusion could learn higherorder correlations across modalities. Moreover, thanks to the flexibility of feature dimension reduction offered by DNNs, the feature fusion strategy may have more advantages than the data fusion strategy in learning multimodal representations [36].
IiiC3 Decision fusion
Before we introduce decision fusion, we first explain how to acquire a decision for one modality. The process of obtaining task results based on the modal data is referred to as “decision acquirement”. The decision acquirement can be realized by either DL based algorithms or conventional communication algorithms.
As shown in Fig. 1 (c), the decisions that are independently acquired by the involved modalities are fused to make a final decision, i.e., the output of the model. The disadvantage of decision fusion is that it cannot exploit the feature level correlation among modalities. The decision fusion strategy also has several advantages over the feature fusion strategy:

When the involved modalities are completely uncorrelated or have very different dimensionality, it is much simpler and more reasonable to adopt decision fusion.

Decision fusion makes it possible to adopt the most suitable algorithms to make decisions for each modality. In particular, for the modalities that can use accurate mathematical models to acquire the decision, conventional communication theories based algorithms would be more suitable than DL based algorithms.

The fusion task would be more easier to implement since the decisions of different modalities usually have similar data representations.
IiiC4 Hybrid fusion
To embrace the merits of both the feature and the decision fusion strategies, hybrid fusion combines both feature and decision fusion in a unified framework. Fig. 1 (b) displays an example of hybrid fusion where the decisions and features of three modalities are fused at two different depths of the model. It should be emphasized that the decisions or features of multiple modalities can either be fused into a single layer or be fused gradually, i.e., modalities can be fused at different depths of the model. The choice at what depth to fuse which modalities is based on intuition and experiments. Take the channel estimation as an example. Given the three modalities, i.e., the pilots, the received signals, and the user position, we usually choose to first fuse the pilots and the received signals and then fuse the user position because the pilots and the received signals are highly correlated and the corresponding fusion should work well based on conventional communication theories. Besides, the gradual fusion strategy could also avoid overlarge fusion vector, which partially solves the problem of dimensionality curse^{2}^{2}2The term “dimensionality curse” was first proposed in [8], which refers the phenomenon that when the data dimensionality increases, the dimension of feature space increases so fast that the available data become sparse and dissimilar in many ways. In this case, the amount of data required to support the data analysis often grows exponentially with the dimensionality..
It should be mentioned that the fusion level selection depends on the specifical problem, and therefore the superiority of the fusion level strategies should be investigated in a specific problem rather than in an absolute sense.
IiiD Fusion Strategy Selection
Various methods can be used to fuse different modalities, among which fixedrule based fusion is the simplest one, including “max”, “min”, “average”, and “majority voting”, etc (see more rules in [26]). Besides, the linear weighted is also a common fusion strategy, where features or decisions of different modalities are combined with linear weights. One successful application of linear weighted is MRC, where the weights can be directly determined by channels. However, the linear weighted based modality fusion is not so simple like MRC. The greatest challenge lies in the determination of the weights for each modality, especially when the data dimension is high. To solve this problem, DNN based fusion has been proposed and gained growing attentions in these years [33, 14]. DNN based fusion could learn a nonlinear weighted mapping from the input to the output, and the weights could be adjusted by training with preacquired datasets instead of manual selection.
Iv Case Study: DML for Massive MIMO Channel Prediction
In this section, we will first present the available modalities for channel prediction in massive MIMO systems. We will also give brief descriptions of the involved network architectures. Then, we will respectively discuss the architecture designs for the BS and the user, followed by detailed training steps of the proposed networks.
Iva Available Modalities for Channel Prediction
Acquiring the channel knowledge plays a critical role in massive MIMO which is a promising technology for future wireless communication systems. This is mainly due to its high power efficiency and spectrum efficiency [44, 50]. In this work, we consider a massive MIMO system, where a BS is equipped with antennas in the form of uniform linear array (ULA)^{3}^{3}3We adopt the ULA model here for simpler illustration, nevertheless, the proposed approaches are not restricted to the specifical array shape, and therefore is applicable for array with arbitrary geometry. and serves multiple singleantenna users. Note that the proposed approaches are applicable for uplink/downlink channel prediction in both TDD and FDD systems while we take the downlink channel prediction in FDD massive MIMO system as an typical example to illustrate the design and application of DML based channel prediction.
System model: To discuss the available modalities in FDD massive MIMO systems, we first present the mathematical model for the downlink transmission. Denote
as the pilot length. The received frequency domain signal of the
th user on the th subcarrier is(1) 
where is the received signal, is the downlink pilot signal, is the additive white Gaussian noise. Moreover, is the downlink channel that can be written as [1]
(2) 
where is the path number, is the frequency of the th downlink subcarrier, while , , and are the attenuation, phase shift, and delay of the th path, respectively. In addition, is the array manifold vector defined as
(3) 
where , is the antenna spacing, is the speed of light, and is the {azimuth, elevation} angle of arrival. We employ the accurate 3D raytracing simulator Wireless InSite [37] to obtain the channel parameters in Eq. (2), i.e., . To simplify the notation, we drop the subcarrier index and the user index in the rest of the paper, e.g., we replace , and with , and , respectively.
Available Modality: Available modalities in FDD massive MIMO system could be received signals, pilots, LS estimate, the downlink channels of previous coherent time periods, the uplink channel, the user location, and the partial downlink channel, as described in the following.
IvA1 Received signals and pilots
Eq. (1) indeed reveals that there exists a mapping function from to , which indicates that the received signals and the pilots are two modalities that could be jointly utilized to predict the downlink channel .
IvA2 LS estimate
When the number of pilots are sufficient (i.e., ), can be estimated by LS [9], i.e., . In fact, the LS estimate can be regarded as one modality from model information.
IvA3 Previous downlink channels
Denote the superscript as the index of coherent time periods. The downlink channels of previous coherent time periods, i.e., , are referred as previous downlink channels, , for ease of exposition^{4}^{4}4Since the downlink channel to be predicted and other modalities involved are all the data in the th coherent time period, we have omitted the superscript of these realtime data for simplicity.. In practical systems, there exist unknown time correlations among channels that cannot be exploited by conventional channel estimation algorithms. Whereas such time correlations could be implicitly learned by DNNs and then be used to improve the prediction accuracy.
IvA4 User location
The user location can be obtained by various techniques, such as the ultrawideband, the global positioning system, and the wireless fidelity, etc. Many positioning works have revealed that there is a distinct link between the user’s position and channels [40, 43]. Define the locationtochannel mapping as , where is the 3D coordinate of the user, and is the carrier frequency. Based on the universal approximation theorem [21] and the widely adopted assumption that is a bijective deterministic mapping in massive MIMO systems [40, 43], we know that the mapping function could be approximated arbitrarily well by a DNN under ideal conditions. Therefore, the modality of user location could be adopted to predict the downlink channel by using DNNs to learn the mapping .
IvA5 Uplink channel
Since uplink channels are easier to obtain than downlink channels in massive MIMO systems, many studies utilize uplink channels to aid the downlink channel prediction [3, 52, 49]. With the assumption that is a bijective deterministic mapping, the channeltolocation mapping exists and can be written as . Hence, the uplinktodownlink mapping exists, and can be written as follows [3]:
(4) 
where is the uplink frequency, and represents the composite mapping related to and . Therefore, the modality of uplink channel could also be adopted to predict the downlink channel by using DNNs to learn the mapping .
IvA6 Partial downlink channel
Due to the high cost and power consumption of the radiofrequency chains, massive MIMO systems usually adopt hybrid analog and digital transceivers that are operated with switchers [28]. Therefore, given the limited transmission period and pilot length, only partial downlink channel can be obtained by the user and then be fed back to BS. Denote the partial downlink channel as with . Denote the vector consisting of unknown elements in as . Recalling Eq. (3) and Eq. (2), it is obvious that there exists a deterministic mapping from to , which can be written as . Therefore, we can predict the downlink channel by learning the mapping .
Modality  [53, 15]  [53, 30]  [3, 52]  [3, 13]  

BS side  
User side 
In order to facilitate the analysis, we list the modalities for downlink channel prediction in Tab. I, where “” and “” respectively represent the available and unavailable modalities for BS or the user. In particular, the modalities and are available for BS because and could be fed back to the BS by the user. The modality is obtained based on . When the length of is sufficiently long for the LS estimator, i.e., , it would be more efficient to directly feed back the downlink channel rather than to BS. Therefore, we set the modality to be unavailable at BS. Tab. I also displays the existing works that utilize aforementioned modalities to predict channels. By trying and testing possible modality combinations and feature level strategies, we can find the modalities with higher confidence levels and the modality combinations with better performance.
IvB DNN Architectures
Based on the definition in Section IIIA, the downlink channel prediction is a typical discriminative regression task. Since discriminative models are naturally suitable for feature extraction and decision acquirement in discriminative tasks, we choose discriminative models for downlink CSI prediction. The selections of both modalities and feature level strategies depends on specifical scenarios. Besides, due to the excellent learning capability of DNNs, we adopt DNN based fusion for channel predition rather than fixedrule based fusion.
Loss function: A DNN architecture consists of the input , the label , the output , the network parameter
, the loss function
, a backpropagation learning algorithm, the activation functions and the network layers. Specifically, the network parameter
includes the weights and the biases of the network layers. The loss function adopted in this work is where is the batch size^{5}^{5}5Batch size is the number of samples in one training batch., and the subscript denotes the index of theth training sample. The backpropagation learning algorithm adopted in this work is the adaptive moment estimation (ADAM) algorithm
[25]. In the offline training stage, the network parameter is updated by the ADAM algorithm to minimize the loss function on the training dataset. While in the online testing stage, is fixed and the network could directly output the estimates of the labels in the testing dataset with a rather small error.Activation function:
The activation functions, including leaky rectified linear units (LeakyReLU)
^{6}^{6}6We adopt LeakyReLU instead of normal rectified linear units (ReLU) to avoid the “dead Relu” phenomenon
[31]., Sigmoid, and Tanh, apply elementwise nonlinear transformations to the outputs of the network layers. The functions LeakyReLU, Sigmoid, and Tanh can be respectively written as , , and .Network layer: Fig. 2 depicts the structure of the network layers, including the dense, the convolution and the LSTM layers. As shown in Fig. 2 (a), the dense layer can be mathematically expressed as , where and are the weight and the bias of the dense layer, respectively. Compared with the dense layer, the convolution layer is more powerful in learning the spatial features of the inputs. As illustrated in Fig. 2 (b), the convolution layer can be mathematically expressed as , where and is the weight and the bias of the filter, respectively. Fig. 2 (c) depicts the structure of the LSTM layer, where each LSTM layer contains LSTM units. The output of the LSTM layer can be written as . In the th () LSTM units, the relationships between the input and output can be expressed with the following equations:
(5a)  
(5b)  
(5c)  
(5d)  
(5e) 
where and are respectively the weights and the biases of the LSTM units, while , and are respectively the input gate, the forget gate and output gate. Moreover, is the cell state of the th LSTM unit. Since the LSTM layer can effectively learn both the shortterm and the longterm features through the memory cell and the gates, it has been recognized as a useful tool for time series related tasks.
IvC Architecture Designs at the BS Side
Accurate downlink channels are crucial for BS to obtain high beamforming gains. Here we consider the downlink channel prediction problem under two different scenarios, i.e., feedback link is unavailable or is available. Before coming to specifical architectures, we first present our main idea to design fusion architectures as follow:

[label=()]

Design and train elementary networks, i.e., the networks that adopt as few as possible modalities to independently predict downlink channels. In fact, all the modalities listed in Tab. I can independently predict downlink channels except the two modalities that should be jointly utilized to obtain downlink channels. Note that the performance of the elementary networks can be used to measure the confidence levels of the corresponding modalities.

Design and train twoelement based networks, i.e., the networks that fuse two elementary networks. the performance of the twoelement based networks can be used to measure the complementarity of the corresponding modality combinations. When we design fusion architectures with multiple modalities, we will preferentially fuse the modality combinations with better performance and then fuse the modalities with higher confidence levels based on experiments and intuition [36, 32].
The idea is also applicable to the architecture designs at the user side as will be shown in the later section.
IvC1 Feedback link is unavailable
In this scenario, available modalities are the previous downlink channels , the user location , and the uplink channel . To investigate the confidence levels of the three modalities, we propose three networks, i.e., , , and to respectively predict the downlink channel based on the previous downlink channels, the user location, and the uplink channel. Fig. 3 (a) illustrates the network structure of . The input of is , where
and is the mapping between the complex and the real domains, i.e., . The label of is . The network is composed of several LSTM layers and one dense layer. Here we adopt the LSTM layer to predict the downlink channels for its superiority in time series data analyses. Besides, we add the dense layer after the last LSTM layer is to release the output of from the limited value range of the activation functions and , as indicated in Eq. (5e). Fig. 3 (b) shows the network structure of both and , where the network is composed of several dense layers, and each dense layer except for the output layer is followed by the LeakyReLU function. Note that and have the common network structure and the same label
, but they have different inputs and different hypeparameters, including the number of layers, the number of neurons in each layers, and the learning rates, etc.
To investigate the complementarities of the three modalities i.e., , , and , we first propose , , and to respectively fuse two of the three modalities at the decision levels. As shown in Fig. 3 (c), consists of , , and . The network , composed of several dense layers and LeakyReLU functions, concatenates the network outputs of and as its input vector. Note that the structures of and can be similarly obtained following the design of . Therefore, we omit the descriptions of these networks for simplicity. Then, we propose , , and to respectively fuse two of the three modalities at the feature levels. As shown in Fig. 3 (d), the main difference between and is that concatenates the hidden layer outputs rather than the network outputs of and . Similarly, we omit the descriptions of and for simplicity. It should be explained that we do not consider data level fusion for , , and since the three modalities have very different dimensions and data structures, which would result in inefficient data fusion.
Furthermore, we propose and to fuse all the three modalities at the decision and the feature levels, respectively. As illustrated in Fig. 4 (a) and Fig. 4 (b), and are both composed of , , , and . The difference between and is that concatenates all the hidden layer outputs rather than the network outputs of , , and . Moreover, we propose and to fuse all the three modalities at the hybrid levels. As depicted in Fig. 4 (c), first uses to fuse the hidden layer outputs of and and then uses to fuse the network output of and the hidden layer output of . The only difference between and is that fuses the network outputs of both and while fuses the hidden layer outputs of both and . It should be mentioned that we choose to first fuse and at the feature level because outperforms other proposed twomodality based networks, as will be shown in the later simulation section. This indicates that the fusion of and provides stronger complementarity and therefore would be more suitable to be fused earlier. Note that the design and the testing for DML are not isolated but interoperable, which means that we need the testing results to guide the design of network. In other words, the excellent capability and flexibility of DML come at the cost of design complexity.
Remark 1: The channel prediction based on the three modalities, i.e., , , and , can also be referred as the channel extrapolation across the time, space, and frequency domains. The threemodality based networks in Fig. 4 jointly exploit the complementarity of the timespacefrequency information to improve the performance of the channel extrapolation.
IvC2 Feedback link is available
In this scenario, we need to investigate which modality would be more efficient to be fed back to BS under given feedback overhead. When the length of the vector to be fed back, denoted by , is greater than the number of BS antennas , it is obvious that we should directly feed back the downlink channel rather than the received signal . When is smaller than , we respectively try various fusion networks for and and then present the networks with the best performance in the following.
We first consider the case that is fed back to BS. Obviously, the length of the vector is . As shown in Fig. 5 (a), we propose to predict unknown based on the known , i.e., to learn the mapping . The network structure of is the same with except that the input and the label of are and , respectively. As shown in Fig. 5 (b), we propose to fuse the network output of and the hidden layer output of . As presented in Fig. 5 (c), we propose to fuse the network output of and the hidden layer output of . As illustrated in Fig. 5 (d), we propose to fuse the network outputs of both and .
Then, we consider the case that is fed back to BS. Since the length of the feedback vector is smaller than , i.e., , the LS estimator is not applicable due to the rank deficiency. However, it is feasible for DNNs to learn the mapping from to , and thus we propose to predict base on . As shown in Fig. 6 (a), is the input data of the first dense layer while and are concatenated at a new axis as the input data of the first convolution layer. Each convolution layer is followed by the LeakyReLU and the average pooling functions. The average pooling functions are added to downsample the data stream and avoid overfitting [41]. After reshaping the output of the last convolution layer, we use to fuse the two data streams from the modalities and . The label of is . Moreover, we propose to fuse the network output of with the hidden layer outputs of both and , as depicted in Fig. 6 (b). We also propose to fuse the hidden layer output of with the network outputs of both and , as shown in Fig. 6 (c). As illustrated in Fig. 6 (d), we propose to fuse the network outputs of both and .
Remark 2: All the networks proposed in Section IVC can be easily extended to other problems like beam prediction and antenna selection. Specifically, by replacing the labels of all these networks with the optimal beam vectors, the proposed architectures can handle the beam prediction at the BS side. Besides, the proposed architectures can deal with antenna selection by replacing the labels of all these networks with the optimal selection vectors. It is worth mentioning that the variant architectures for antenna selection do not require prefect downlink channels, which can significantly reduce the cost resulted from downlink channel prediction.
IvD Architecture Designs at the User Side
We consider three different scenarios for downlink channel prediction at the user side, i.e., pilots being unavailable, insufficient or sufficient.
IvD1 Pilots are unavailable
In this scenario, available modalities are the previous downlink channels and the user location . As described in Section IVC1, we can use , , and to predict the downlink channels.
IvD2 Pilots are insufficient
In this scenario, available modalities are , , , and . As described in Section IVC2, we can use , and to predict the downlink channels.
IvD3 Pilots are sufficient
When pilots are sufficient, the LS estimator can be used to estimate the downlink channel. Inspired by [53] and [15], we propose , consisting of several dense layers and LeakyReLU functions, to predict based on the LS estimate of downlink channel , as illustrated in Fig. 6 (e). The input and label of are and , respectively. It should be emphasized that even when LS estimates have been obtained, the available modalities, i.e., and , could also be adopted to enhance the accuracy. Moreover, we propose the , as shown in Fig. 6 (f), where the network input of and the network output of are fused by . We also propose the to fuse the network output of and the hidden layer output of , as displayed in Fig. 6 (g).
Remark 3: The networks , , and can be easily extended to data detection. One simple way inspired by [55] is to first divide the transmitted signals into pilots and data signals. Then, the pilots are fed to the network as depicted in Fig. 6 while the data signals are adopted as the training labels. In this way, we do not need to collect the downlink channels as training labels, which significantly reduce the cost for label collection.
IvE Training Steps
The detailed training steps of all the proposed fusion networks are given as follows:

Train the elementary networks, e.g., , , , , ^{7}^{7}7 is trained in the endtoend manner., and , independently to minimize the loss between its output and the label until their loss functions converge, and then fix these network parameters;

Train to minimize the loss between its output and the label until its loss function converges, and then fix its network parameters;

Following step 2), train and successively until their loss function converges and then fix their network parameters successively.
Obviously, the time and computation costs increase as the number of modalities increases. The balance between the cost and the performance should be considered in practical applications.
V Simulation Results
In this section, we will first present the simulation scenario and default network parameters. Then, the performance of the proposed networks will be evaluated and analyzed.
Va Simulation Setup
Dataset Generation: In the simulations, we consider the outdoor massive MIMO scenario that is constructed based on the accurate 3D raytracing simulator Wireless InSite [37]. Unlike conventional statistical channel generation methods, the 3D raytracing simulator can capture the dependence of channels on the environment geometry/materials and transmitter/receiver locations, and therefore can provide more reliable datasets for training and testing. The scenario comprises one BS and massive randomly distributed user antennas and each BS is equipped with 64 antennas. The scenario covers an area of square metres. A partial view of the raytracing scenario is illustrated in Fig. 7. The uplink and downlink frequencies are set to be 2.50 GHz and 2.62 GHz, respectively. Based on the environment setup, the 3D raytracing simulator outputs the uplink channel parameters, the downlink channel parameters, and the location for each user. With these outputs, we can construct the training and testing datasets of all the modalities. Specifically, we can obtain and for each user by using Eq. (2) and the channel parameters from the 3D raytracing simulator. With Eq. (1), we can generate the pilots and the received signals based on . Assuming that the previous downlink channels are the channels of the user at adjacent positions, and the users move along the yaxis. Then, can be obtained by collecting channels at adjacent positions. Partial downlink channel can be obtained by selecting elements out of , and then the rest elements constitute the vector . After obtaining all sample pairs, we randomly select 9000 samples from the sample pairs as the training dataset, and select 1000 samples from the rest of sample pairs as the testing dataset^{8}^{8}8For more details about how to generate channels using Wireless InSite, please refer to the paper [1] and the codes [10].. Since the perfect channels are not available in practical situation, unless otherwise specified, all the sample pairs in the datasets are estimated by the LMMSE algorithm [9]
when the signaltonoise ratio (SNR) is 25 dB and the pilot length is 64.
Adopted Neural Networks: Unless otherwise specified, the parameters of the proposed networks are given in Tab. II, where “LSTM: 256, 256” means that the hidden layers in consists of two LSTM layers, and each hidden layer has 256 units. The numbers of units in the input and the output layers for all the proposed networks are consistent with the lengths of the input and the output data vectors, and thus are omitted in Tab. II. We choose the output of the middle hidden layer as the hidden layer output of the networks. The batch size of all proposed networks is 128. Let and represent the estimated and the true downlink channels, respectively^{9}^{9}9The notation denotes the inverse mapping of that is given as .. Normalized meansquarederror (NMSE) is used to measure the prediction accuracy, which is defined as .
Network  Structure parameter  Learning rate  

, LSTM: 256,256  5e4  
Dense: 256,256,256,256,256  1e4  
Dense: 256,256,256  1e3  
Dense: 256,256,256  5e4  

5e4  
Dense: 256  5e4  

Dense: 256,256  5e4 
VB BS Side
Fig. 8 displays the NMSE performance of the previous downlink channel related networks versus . A larger means that would learn time correlation of downlink channels from previous downlink channels in longer time periods. It can be observed that the performance of all these networks first improves and then degrades as increases, which indicates that the channels within 3 time periods have positive contributions to the downlink channel prediction while the channels beyond 3 time periods have negative impact on the downlink channel prediction. Furthermore, always performs worse than other fusion networks, and outperforms other networks regardless of the value of in the considered setup. This implies that the modalities and do provide complementary information for . We will set to be 3 for better performance in the following simulations.
Fig. 9 shows the NMSE performance of all networks that are applicable to BS without feedback link, as discussed in Section IVC1. As shown in Fig. 9, all the twomodalities fused networks outperform the corresponding two singlemodality networks, which implies that any two of the three modalities, i.e., , and , can provide complementary information, thus enhancing the prediction accuracy. In particular, although has worse performance than both and , the four twomodalities fused networks, i.e., , , , and , all have better performance than both and . Besides, we notice that has the best performance among twomodalities fused networks and outperforms other threemodalities fused networks. In fact, the structure of is inspired by that of . More specifically, since outperforms other twomodalities fused networks. we choose to preferentially fuse and at the feature level.
Fig. 10 compares the NMSE performance of all the networks that are applicable to BS with feedback link, as discussed in Section IVC2. As shown in Fig. 10, the performance of all the proposed networks improves when the feedback length increases. As indicated in the first enlarge picture, it would be better to feed the partial downlink channel back to BS when is greater than 48, otherwise it would be better to feed the received signal back to BS. Furthermore, it can be observed from the second enlarge picture that and consistently outperform while the gaps between the three networks all degrade as increases. This indicates that when we choose to feed partial downlink channel back, i.e., , we can adopt instead of other related fusion networks to reduce the training cost, since the gaps between them are negligible. Moreover, as shown in the third enlarge picture, consistently outperforms , , and while the gap between and becomes negligible when is larger than 36. This indicates that we can adopt for better prediction accuracy when is smaller than 36 and adopt for lower training cost when is greater than 36 and less than 48.
VC User Side
Fig. 12 displays the NMSE performance of LS, , , , and versus the pilot length , where SNR is 30 dB. As shown in Fig. 12, has worse performance than , which means cannot provide complementary information for when SNR is high and the number of pilots is sufficient, i.e., . In other words, when the number of pilots is sufficient and SNR is high, the modalities have the highest confidence level and other modalities can hardly provided complementary information to improve the performance. Moreover, the LS estimator outperforms when is greater than 80, which implies that modelbased methods would gain more advantages than networkbased methods as increases. Furthermore, outperforms when is smaller than 72, which means networks can learn extra features from to improve the performance when is smaller than 72, while when is greater than 72, would provide redundant information to the network and result in worse performance. To obtain better performance at the user side, we can choose when is in and choose when is greater than 72.
VD Impairments in Practical Applications
To collect offline training samples, we can obtain extremely accurate channels by increasing SNR and the pilot length. However, in the online testing stage, low SNRs would impair the prediction accuracy of the proposed networks. Therefore, we investigate the impact of various SNRs on the performance of LS, , , , , , , and , where is 64. Fig. 12 shows the performance of these networks versus the SNR in the online testing stage. Notice that the performance of , , and becomes saturated when SNR is higher than 15 dB, which means that the estimation errors of the input channels would not impact the performance of the three networks when SNR is higher than 15 dB. As indicated in Fig. 12, outperforms all other networks when SNR is lower than 17 dB while outperforms when SNR is higher than 17 dB. This is because that the estimation based on pilots and received signals highly relies on SNR while the prediction based on
Comments
There are no comments yet.