Deep Multimodal Learning: Merging Sensory Data for Massive MIMO Channel Prediction

Existing work in intelligent communications has recently made preliminary attempts to utilize multi-source sensing information (MSI) to improve the system performance. However, the research on MSI aided intelligent communications has not yet explored how to integrate and fuse the multimodal sensory data, which motivates us to develop a systematic framework for wireless communications based on deep multimodal learning (DML). In this paper, we first present complete descriptions and heuristic understandings on the framework of DML based wireless communications, where core design choices are analyzed in the view of communications. Then, we develop several DML based architectures for channel prediction in massive multiple-input multiple-output (MIMO) systems that leverage various modality combinations and fusion levels. The case study of massive MIMO channel prediction offers an important example that can be followed in developing other DML based communication technologies. Simulations results demonstrate that the proposed DML framework can effectively exploit the constructive and complementary information of multimodal sensory data in various wireless communication scenarios.



There are no comments yet.


page 6

page 10

page 11

page 16

page 17

page 20

page 21

page 29


Cell-Free Massive MIMO for 6G Wireless Communication Networks

The recently commercialized fifth-generation (5G) wireless communication...

A Survey of Reconfigurable Intelligent Surfaces: Towards 6G Wireless Communication Networks with Massive MIMO 2.0

Reconfigurable intelligent surfaces (RISs) tune wireless environments to...

Deep Channel Learning For Large Intelligent Surfaces Aided mm-Wave Massive MIMO Systems

This letter presents the first work introducing a deep learning (DL) fra...

On the Performance of Image Recovery in Massive MIMO Communications

Massive MIMO (Multiple Input Multiple Output) has demonstrated as a pote...

Channel Estimation and Hybrid Beamforming for Reconfigurable Intelligent Surfaces Assisted THz Communications

Terahertz (THz) communications are promising to be the next frontier for...

Sensing and Classification Using Massive MIMO: A Tensor Decomposition-Based Approach

Wireless-based activity sensing has gained significant attention due to ...

TeraMIMO: A Channel Simulator for Wideband Ultra-Massive MIMO Terahertz Communications

Following recent advances in terahertz (THz) technology, there is a cons...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

To satisfy the demand of explosive wireless applications, e.g., diverse intelligent terminal access, autonomous driving, and Internet of Things, etc, the new generation of wireless communication systems are expected to handle massive data and meet the requirements of both high-reliability and low-latency. However, existing communication systems, which are basically designed based on conventional communication theories, exhibit several inherent limitations in meeting the aforementioned requirements, such as relying on accurate theoretical models, suffering from high complexity algorithms, and being restricted to block-structure communication protocols, etc [46]

. Recently, intelligence communication has been recognized as a promising direction in future wireless communications. As a major branch of machine learning, deep learning (DL) has been applied in physical layer communications as a potential solution to deal with the massive data and the high complexity of wireless communication systems

[17, 35]

. By merging DL into existing communication systems, many remarkable progresses have been made in various applications such as channel estimation

[53, 30, 18, 12, 54, 3, 52, 15], data detection [55, 19], channel feedback [47, 45, 16], beamforming [29, 2, 51, 4], and hybrid precoding [28], etc.

Compared with conventional communications that are based on statistics and information theories, DL based communications benefit from both the excellent learning capability of deep neural networks (DNNs) and the impressive computational throughput of parallel processing architectures. Moreover, DL does not require tractable mathematical models or high computational operations. Among these superiorities, the most advantageous aspect of DL is its ability to handle problems with imperfect models or without mathematical models. Interestingly, there exists out-of-band side-information in communication systems that can be utilized to improve the system performance, including sub-6G channels, user positions , and 3D scene images obtained by cameras, etc. It should be noted that the conventional communication techniques can hardly take advantage of these out-of-band side-information due to the lack of tractable mathematical models. In fact, the development of techniques that utilize out-of-band side-information to improve the system performance has been an emerging research trend in DL based communications

[29, 2, 51]. For example, [2] proposed a sub-6GHz channel information aided network for mmwave beam and blockage prediction, which could effectively reduce the overheads of both feedback and beam training. In [51], a novel 3D scene based beam selection architecture was developed for mmwave communications by using the surrounding 3D scene of the cellular coverage as the input of networks.

Meanwhile, model aided DL has also made much progress recently. Instead of purely relying on training data, model aided DL benefits from the guidance of model information and therefore can achieve better performance [17]. For instance, [15] proposed an efficient DL based channel estimator by learning the linear model between least squares (LS) estimation and linear minimum mean square error (LMMSE) estimation. In [19], the authors proposed a model-driven DL based multiple-input multiple-output (MIMO) detector by unfolding a iterative algorithm, which can significantly outperform the corresponding traditional iterative detector.

Although a few DL based works tried to utilize multi-source sensing information (MSI), e.g., out-of-band side-information and model information, to improve the system performance, none of them has yet investigated how to integrate and comprehensively utilize these MSI in communication systems. From the viewpoint of machine learning, data from various sources are referred as multimodal sensory data, whereas data from one source are referred as data of a single modality. Communication systems naturally work with multimodal data and this clear advantage should not be squandered. Multimodal learning aims to build models that can fully exploit the constructive and complementary information lying in multimodal data, thus gaining performance advantages over methods that only use data of a single modality [36]. By combining DL architectures with multimodal learning methodologies, the concept of deep multimodal learning (DML) has been proposed in [33]

. Thanks to the excellent flexibility of DL in extracting hierarchical features of data, DML offers several advantages over conventional multimodal learning, such as learning based feature extraction, implicit dimensionality reduction, and easily scalability in the modality number, etc

[36]. The mainstream applications of DML include human action recognition, audio-visual speaker detection, and autonomous driving, etc [33, 36, 14]. For example, [14] jointly exploited two modalities, i.e., image and optical flow, for human action recognition, which could obtain higher recognition accuracy than only using image data.

This paper aims to develop a systematic framework on DML based wireless communications. By using DML, the multimodal sensory data available in wireless communication systems can be fully exploited to provide constructive and complementary information for various tasks. The main contributions of this work can be summarized as following:

  • We provide complete descriptions and analyses on the framework of DML. As opposed to [36, 33]

    that mainly studied DML in computer vision, speech, and natural language processing areas, this is the first work that explores how DML technologies can be applied to wireless communications to the best of the authors’ knowledge, and we also provide some heuristic understandings.

  • By investigating various modality combinations and fusion levels, we design several DML based architectures for channel prediction in massive MIMO systems as a case study. The design process presents a beneficial guidance on developing DML based communication technologies, while the proposed architectures can be easily extend to other communications problems like beam prediction, channel feedback, and resource allocation.

  • Simulations based on ray-tracing software have been conducted and demonstrate that the proposed framework can effectively exploit the constructive and complementary information of multimodal sensory data in various scenarios.

The remainder of this paper is organized as follows. The motivation of exploring DML based wireless communications is given in Section II. The design choices of DML are presented in Section III. As a case study, Section IV proposed several DML based architectures for channel prediction in massive MIMO systems. Numerical results are provided in Section V, followed by our main conclusions in Section VI.


The bold letters denote vectors or matrices. The notation

denotes the length of the vector . The notations and respectively denote the transpose and the conjugate transpose of a matrix or a vector. The notation represents the complex vector space. The notation denotes the norm of . The notation represents the composite mapping operation. The notations and , respectively, denote the real and imaginary parts of matrices, vectors or scales. The notations and respectively represent the convolution operation and the matrix element-wise product. The notation

represents the expectation with respect to all random variables within the brackets.

Ii DML for Wireless Communications

MSI in communication systems, including out-of-band side-information, model information, and other system information, is referred as multimodality. Information from one source is referred as one modality.

Multimodal sensory data of communication systems typically have varying confidence levels111The confidence level of one modality refers to the degree of the contribution or the reliability offered by the modality towards a certain task [5]. when accomplishing different tasks. Take the beam selection at the base station (BS) as an example; the optimal beam vector can be obtained in three ways: (1) finding the best beam based on known downlink channels [56], (2) learning from 3D scene images [51, 4], and (3) extrapolating from sub-6G channels [2]. Among the three modalities, the downlink channels obviously have higher confidence level than both the 3D scene images and the sub-6G channels while the confidence level of the 3D scene images and the sub-6G channels depends on specifical scenarios. Nevertheless, even when we have access to the modality with the highest confidence level, there may exist some modalities that could provide complementary information to further improve the performance or robustness of the single-modality based methods [34]. To better understand this, we can refer to the maximum ratio combining (MRC) [42], which is a widely adopted technology to obtain the combining gains of multi-antenna communication systems. MRC can benefits from the antennas with worse channels, which also provide a revelatory understanding about the gains brought by modalities with relatively lower confidence levels.

Meanwhile, multimodal sensory data usually have different dimensionality and structures. For example, in massive MIMO systems, the user position data could be a real-valued vector, the received signals could be a much longer complex-valued vector, and the transmitted signals could be a high-dimensional complex-valued matrix. Therefore, it is important to design architectures that could fuse these modalities efficiently.

Few existing work so far has investigated how to integrate and fuse multimodal sensory data in the wireless communication problems. Motivated by this, we aim to develop a systematic framework on the DML based communications and illustrate the methodology by investigating the DML based channel prediction in massive MIMO systems.

Iii Design Choices in DML

The framework of DML consists of three core parts: selection, extraction and fusion. The selection process is to select appropriate models as well as effective modalities for a certain task. The extraction process is to extract information from involved modalities. The fusion process is to fuse the extracted information in order to obtain a fused representation of the multimodal sensory data. To embrace the performance advantages offered by DML, there are several design choices to be considered, including the selections of models, modalities, fusion levels, and fusion strategies, as will be illustrated in the following.

Iii-a Model Selection

DL models can generally be divided into two categories: discriminative and generative models. Discriminative models aim to learn the mapping function from the inputs to the outputs, and are typically used to solve regression and classification tasks. In other words, given the input and the label

, discriminative models learn the conditional probability

by updating the model parameters. Since the majority of tasks in physical layer communications are to estimate based on , such as channel estimation [53, 18, 12], data detection [55, 23, 19], and beamforming [2, 51, 4], etc, existing DL techniques for physical layer communications mainly adopt discriminative models.

Generative models aim to learn the training data distribution that is required to generate new data with similar distributions. More specifically, generative models learn the joint probability

in supervised learning problems, or learn the input data probability

in unsupervised or self-unsupervised learning problems. For DML problems, generative models are useful in the following three aspects: (1) Extracting features from different modalities, which are then used to perform discriminative tasks, i.e., regression and classification tasks. (2) Dealing with the situation of missing modalities during the test stage or lacking labeled data

[39, 22]. (3) Providing good initialization points for discriminative models, such as DNNs [20].

Iii-B Modality Selection

Different modalities may provide complementary information and have varying confidence levels for a certain multimodal learning task. Nevertheless, too much modalities may lead to information redundancy and an excessively complex fusion process. Therefore, it is worthwhile to select the optimal modalities by comprehensively considering the performance gain and the fusion complexity. In the area of computer vision and speech, the modality selection problem is generally considered as a tradeoff optimization [48, 6, 27]. For example, [48] proposed to select the optimal modalities based on the tradeoff between the feature dimensionality and the modality correlations. However, the authors did not take the confidence levels of modalities into account, which may miss modalities with high confidence levels. In [6], the authors utilize a dynamic programming approach to find the optimal subset of modalities based on the three-fold tradeoff between the performance gain, the overall confidence level of the selected subset, and the cost of the selected subset. In summary, [48, 24, 6, 27] provided heuristic solutions for the modality selection problem in the context of multimedia data while there is hardly any literature studying modality selection for communication systems. A more direct way that is adopted in most of existing DML works is to manually select modalities by intuition and experiments.

Iii-C Fusion Level Selection

Fig. 1: Illustrations of various fusion levels for DML.  (a) data fusion. (b) feature fusion. (c) decision fusion. (d) hybrid fusion.

In general, we can perform the modality fusion in four levels: data fusion, feature fusion, decision fusion, or hybrid fusion.

Iii-C1 Data fusion

Data fusion is to concatenate the raw or preprocessed data of all the modalities into a single vector and then to learn a joint multimodal representation based on the concatenated vector during the fusion, as shown in Fig. 1 (a). Data fusion is simple to design and allows end-to-end training. An typical example of data fusion can refer to [53], where the received signals, the transmitted pilots, the pervious channels, and the LS estimates are directly concatenated as the inputs of networks to estimate channels in doubly selective fading scenarios. However, data fusion ignores the unique structure of different modalities, which may make it difficult to learn the complementary information among the modalities. In addition, simple concatenation of the multimodal sensory data leads to high-dimensional input vectors that may contain redundancies, especially when the number of modalities is large.

Iii-C2 Feature fusion

Before we introduce the feature fusion, we first explain how to extract features from the raw or preprocessed data of one modality. The transform from the raw or preprocessed data to features is referred to as “feature extraction”. Feature extraction algorithms are either generative or discriminative, linear or nonlinear, such as principal component analysis

[11], linear discriminative analysis [38], and Laplacian eigenmaps[7], etc. In recent few years, DNNs have been recognized as a popular technique to fuse modalities due to its excellent power and flexibility in extracting hierarchical features of the data. Specifically, each hidden layer of the network indeed represents a hierarchical features of the inputs. By changing the number of layers or choosing proper architecture, DNNs could extract features at various levels or with various dimensions. For example, [47]

proposed a deep autoencoder architecture for channel feedback, where the dimension of the learnt compressed vector, i.e., the extracted feature that are used to reconstruct the original channel, can be adjusted according to the given compression ratio.

Now, we discuss the feature fusion. As illustrated in Fig. 1 (b), feature fusion is to fuse higher-level features into a single hidden layer and then to learn a joint multimodal representations for the output. By utilizing the extracted higher-level features, the model with feature fusion could learn higher-order correlations across modalities. Moreover, thanks to the flexibility of feature dimension reduction offered by DNNs, the feature fusion strategy may have more advantages than the data fusion strategy in learning multimodal representations [36].

Iii-C3 Decision fusion

Before we introduce decision fusion, we first explain how to acquire a decision for one modality. The process of obtaining task results based on the modal data is referred to as “decision acquirement”. The decision acquirement can be realized by either DL based algorithms or conventional communication algorithms.

As shown in Fig. 1 (c), the decisions that are independently acquired by the involved modalities are fused to make a final decision, i.e., the output of the model. The disadvantage of decision fusion is that it cannot exploit the feature level correlation among modalities. The decision fusion strategy also has several advantages over the feature fusion strategy:

  • When the involved modalities are completely uncorrelated or have very different dimensionality, it is much simpler and more reasonable to adopt decision fusion.

  • Decision fusion makes it possible to adopt the most suitable algorithms to make decisions for each modality. In particular, for the modalities that can use accurate mathematical models to acquire the decision, conventional communication theories based algorithms would be more suitable than DL based algorithms.

  • The fusion task would be more easier to implement since the decisions of different modalities usually have similar data representations.

Iii-C4 Hybrid fusion

To embrace the merits of both the feature and the decision fusion strategies, hybrid fusion combines both feature and decision fusion in a unified framework. Fig. 1 (b) displays an example of hybrid fusion where the decisions and features of three modalities are fused at two different depths of the model. It should be emphasized that the decisions or features of multiple modalities can either be fused into a single layer or be fused gradually, i.e., modalities can be fused at different depths of the model. The choice at what depth to fuse which modalities is based on intuition and experiments. Take the channel estimation as an example. Given the three modalities, i.e., the pilots, the received signals, and the user position, we usually choose to first fuse the pilots and the received signals and then fuse the user position because the pilots and the received signals are highly correlated and the corresponding fusion should work well based on conventional communication theories. Besides, the gradual fusion strategy could also avoid overlarge fusion vector, which partially solves the problem of dimensionality curse222The term “dimensionality curse” was first proposed in [8], which refers the phenomenon that when the data dimensionality increases, the dimension of feature space increases so fast that the available data become sparse and dissimilar in many ways. In this case, the amount of data required to support the data analysis often grows exponentially with the dimensionality..

It should be mentioned that the fusion level selection depends on the specifical problem, and therefore the superiority of the fusion level strategies should be investigated in a specific problem rather than in an absolute sense.

Iii-D Fusion Strategy Selection

Various methods can be used to fuse different modalities, among which fixed-rule based fusion is the simplest one, including “max”, “min”, “average”, and “majority voting”, etc (see more rules in [26]). Besides, the linear weighted is also a common fusion strategy, where features or decisions of different modalities are combined with linear weights. One successful application of linear weighted is MRC, where the weights can be directly determined by channels. However, the linear weighted based modality fusion is not so simple like MRC. The greatest challenge lies in the determination of the weights for each modality, especially when the data dimension is high. To solve this problem, DNN based fusion has been proposed and gained growing attentions in these years [33, 14]. DNN based fusion could learn a nonlinear weighted mapping from the input to the output, and the weights could be adjusted by training with pre-acquired datasets instead of manual selection.

Iv Case Study: DML for Massive MIMO Channel Prediction

In this section, we will first present the available modalities for channel prediction in massive MIMO systems. We will also give brief descriptions of the involved network architectures. Then, we will respectively discuss the architecture designs for the BS and the user, followed by detailed training steps of the proposed networks.

Iv-a Available Modalities for Channel Prediction

Acquiring the channel knowledge plays a critical role in massive MIMO which is a promising technology for future wireless communication systems. This is mainly due to its high power efficiency and spectrum efficiency [44, 50]. In this work, we consider a massive MIMO system, where a BS is equipped with antennas in the form of uniform linear array (ULA)333We adopt the ULA model here for simpler illustration, nevertheless, the proposed approaches are not restricted to the specifical array shape, and therefore is applicable for array with arbitrary geometry. and serves multiple single-antenna users. Note that the proposed approaches are applicable for uplink/downlink channel prediction in both TDD and FDD systems while we take the downlink channel prediction in FDD massive MIMO system as an typical example to illustrate the design and application of DML based channel prediction.

System model: To discuss the available modalities in FDD massive MIMO systems, we first present the mathematical model for the downlink transmission. Denote

as the pilot length. The received frequency domain signal of the

-th user on the -th subcarrier is


where is the received signal, is the downlink pilot signal, is the additive white Gaussian noise. Moreover, is the downlink channel that can be written as [1]


where is the path number, is the frequency of the -th downlink subcarrier, while , , and are the attenuation, phase shift, and delay of the -th path, respectively. In addition, is the array manifold vector defined as


where , is the antenna spacing, is the speed of light, and is the {azimuth, elevation} angle of arrival. We employ the accurate 3D ray-tracing simulator Wireless InSite [37] to obtain the channel parameters in Eq. (2), i.e., . To simplify the notation, we drop the sub-carrier index and the user index in the rest of the paper, e.g., we replace , and with , and , respectively.

Available Modality: Available modalities in FDD massive MIMO system could be received signals, pilots, LS estimate, the downlink channels of previous coherent time periods, the uplink channel, the user location, and the partial downlink channel, as described in the following.

Iv-A1 Received signals and pilots

Eq. (1) indeed reveals that there exists a mapping function from to , which indicates that the received signals and the pilots are two modalities that could be jointly utilized to predict the downlink channel .

Iv-A2 LS estimate

When the number of pilots are sufficient (i.e., ), can be estimated by LS [9], i.e., . In fact, the LS estimate can be regarded as one modality from model information.

Iv-A3 Previous downlink channels

Denote the superscript as the index of coherent time periods. The downlink channels of previous coherent time periods, i.e., , are referred as previous downlink channels, , for ease of exposition444Since the downlink channel to be predicted and other modalities involved are all the data in the -th coherent time period, we have omitted the superscript of these realtime data for simplicity.. In practical systems, there exist unknown time correlations among channels that cannot be exploited by conventional channel estimation algorithms. Whereas such time correlations could be implicitly learned by DNNs and then be used to improve the prediction accuracy.

Iv-A4 User location

The user location can be obtained by various techniques, such as the ultra-wideband, the global positioning system, and the wireless fidelity, etc. Many positioning works have revealed that there is a distinct link between the user’s position and channels [40, 43]. Define the location-to-channel mapping as , where is the 3D coordinate of the user, and is the carrier frequency. Based on the universal approximation theorem [21] and the widely adopted assumption that is a bijective deterministic mapping in massive MIMO systems [40, 43], we know that the mapping function could be approximated arbitrarily well by a DNN under ideal conditions. Therefore, the modality of user location could be adopted to predict the downlink channel by using DNNs to learn the mapping .

Iv-A5 Uplink channel

Since uplink channels are easier to obtain than downlink channels in massive MIMO systems, many studies utilize uplink channels to aid the downlink channel prediction [3, 52, 49]. With the assumption that is a bijective deterministic mapping, the channel-to-location mapping exists and can be written as . Hence, the uplink-to-downlink mapping exists, and can be written as follows [3]:


where is the uplink frequency, and represents the composite mapping related to and . Therefore, the modality of uplink channel could also be adopted to predict the downlink channel by using DNNs to learn the mapping .

Iv-A6 Partial downlink channel

Due to the high cost and power consumption of the radio-frequency chains, massive MIMO systems usually adopt hybrid analog and digital transceivers that are operated with switchers [28]. Therefore, given the limited transmission period and pilot length, only partial downlink channel can be obtained by the user and then be fed back to BS. Denote the partial downlink channel as with . Denote the vector consisting of unknown elements in as . Recalling Eq. (3) and Eq. (2), it is obvious that there exists a deterministic mapping from to , which can be written as . Therefore, we can predict the downlink channel by learning the mapping .

Modality [53, 15] [53, 30] [3, 52] [3, 13]
BS side
User side
TABLE I: Modalities involved in downlink channel prediction

In order to facilitate the analysis, we list the modalities for downlink channel prediction in Tab. I, where “” and “” respectively represent the available and unavailable modalities for BS or the user. In particular, the modalities and are available for BS because and could be fed back to the BS by the user. The modality is obtained based on . When the length of is sufficiently long for the LS estimator, i.e., , it would be more efficient to directly feed back the downlink channel rather than to BS. Therefore, we set the modality to be unavailable at BS. Tab. I also displays the existing works that utilize aforementioned modalities to predict channels. By trying and testing possible modality combinations and feature level strategies, we can find the modalities with higher confidence levels and the modality combinations with better performance.

Iv-B DNN Architectures

Based on the definition in Section III-A, the downlink channel prediction is a typical discriminative regression task. Since discriminative models are naturally suitable for feature extraction and decision acquirement in discriminative tasks, we choose discriminative models for downlink CSI prediction. The selections of both modalities and feature level strategies depends on specifical scenarios. Besides, due to the excellent learning capability of DNNs, we adopt DNN based fusion for channel predition rather than fixed-rule based fusion.

Loss function: A DNN architecture consists of the input , the label , the output , the network parameter

, the loss function

, a back-propagation learning algorithm, the activation functions and the network layers. Specifically, the network parameter

includes the weights and the biases of the network layers. The loss function adopted in this work is where is the batch size555Batch size is the number of samples in one training batch., and the subscript denotes the index of the

-th training sample. The back-propagation learning algorithm adopted in this work is the adaptive moment estimation (ADAM) algorithm

[25]. In the off-line training stage, the network parameter is updated by the ADAM algorithm to minimize the loss function on the training dataset. While in the on-line testing stage, is fixed and the network could directly output the estimates of the labels in the testing dataset with a rather small error.

Activation function:

The activation functions, including leaky rectified linear units (LeakyReLU)


We adopt LeakyReLU instead of normal rectified linear units (ReLU) to avoid the “dead Relu” phenomenon

, Sigmoid, and Tanh, apply element-wise nonlinear transformations to the outputs of the network layers. The functions LeakyReLU, Sigmoid, and Tanh can be respectively written as , , and .

Fig. 2: Illustrations of the dense layer (a), the convolution layer (b), and the LSTM layer (c).

Network layer: Fig. 2 depicts the structure of the network layers, including the dense, the convolution and the LSTM layers. As shown in Fig. 2 (a), the dense layer can be mathematically expressed as , where and are the weight and the bias of the dense layer, respectively. Compared with the dense layer, the convolution layer is more powerful in learning the spatial features of the inputs. As illustrated in Fig. 2 (b), the convolution layer can be mathematically expressed as , where and is the weight and the bias of the filter, respectively. Fig. 2 (c) depicts the structure of the LSTM layer, where each LSTM layer contains LSTM units. The output of the LSTM layer can be written as . In the -th () LSTM units, the relationships between the input and output can be expressed with the following equations:


where and are respectively the weights and the biases of the LSTM units, while , and are respectively the input gate, the forget gate and output gate. Moreover, is the cell state of the -th LSTM unit. Since the LSTM layer can effectively learn both the short-term and the long-term features through the memory cell and the gates, it has been recognized as a useful tool for time series related tasks.

Iv-C Architecture Designs at the BS Side

Accurate downlink channels are crucial for BS to obtain high beamforming gains. Here we consider the downlink channel prediction problem under two different scenarios, i.e., feedback link is unavailable or is available. Before coming to specifical architectures, we first present our main idea to design fusion architectures as follow:

  1. [label=()]

  2. Design and train elementary networks, i.e., the networks that adopt as few as possible modalities to independently predict downlink channels. In fact, all the modalities listed in Tab. I can independently predict downlink channels except the two modalities that should be jointly utilized to obtain downlink channels. Note that the performance of the elementary networks can be used to measure the confidence levels of the corresponding modalities.

  3. Design and train two-element based networks, i.e., the networks that fuse two elementary networks. the performance of the two-element based networks can be used to measure the complementarity of the corresponding modality combinations. When we design fusion architectures with multiple modalities, we will preferentially fuse the modality combinations with better performance and then fuse the modalities with higher confidence levels based on experiments and intuition [36, 32].

The idea is also applicable to the architecture designs at the user side as will be shown in the later section.

Fig. 3: The network structures of (a), (b), (b), (c), and (d).

Iv-C1 Feedback link is unavailable

In this scenario, available modalities are the previous downlink channels , the user location , and the uplink channel . To investigate the confidence levels of the three modalities, we propose three networks, i.e., , , and to respectively predict the downlink channel based on the previous downlink channels, the user location, and the uplink channel. Fig. 3 (a) illustrates the network structure of . The input of is , where

and is the mapping between the complex and the real domains, i.e., . The label of is . The network is composed of several LSTM layers and one dense layer. Here we adopt the LSTM layer to predict the downlink channels for its superiority in time series data analyses. Besides, we add the dense layer after the last LSTM layer is to release the output of from the limited value range of the activation functions and , as indicated in Eq. (5e). Fig. 3 (b) shows the network structure of both and , where the network is composed of several dense layers, and each dense layer except for the output layer is followed by the LeakyReLU function. Note that and have the common network structure and the same label

, but they have different inputs and different hype-parameters, including the number of layers, the number of neurons in each layers, and the learning rates, etc.

To investigate the complementarities of the three modalities i.e., , , and , we first propose , , and to respectively fuse two of the three modalities at the decision levels. As shown in Fig. 3 (c), consists of , , and . The network , composed of several dense layers and LeakyReLU functions, concatenates the network outputs of and as its input vector. Note that the structures of and can be similarly obtained following the design of . Therefore, we omit the descriptions of these networks for simplicity. Then, we propose , , and to respectively fuse two of the three modalities at the feature levels. As shown in Fig. 3 (d), the main difference between and is that concatenates the hidden layer outputs rather than the network outputs of and . Similarly, we omit the descriptions of and for simplicity. It should be explained that we do not consider data level fusion for , , and since the three modalities have very different dimensions and data structures, which would result in inefficient data fusion.

Fig. 4: The network structures of (a), (b), (c), and (d).

Furthermore, we propose and to fuse all the three modalities at the decision and the feature levels, respectively. As illustrated in Fig. 4 (a) and Fig. 4 (b), and are both composed of , , , and . The difference between and is that concatenates all the hidden layer outputs rather than the network outputs of , , and . Moreover, we propose and to fuse all the three modalities at the hybrid levels. As depicted in Fig. 4 (c), first uses to fuse the hidden layer outputs of and and then uses to fuse the network output of and the hidden layer output of . The only difference between and is that fuses the network outputs of both and while fuses the hidden layer outputs of both and . It should be mentioned that we choose to first fuse and at the feature level because outperforms other proposed two-modality based networks, as will be shown in the later simulation section. This indicates that the fusion of and provides stronger complementarity and therefore would be more suitable to be fused earlier. Note that the design and the testing for DML are not isolated but interoperable, which means that we need the testing results to guide the design of network. In other words, the excellent capability and flexibility of DML come at the cost of design complexity.

Remark 1: The channel prediction based on the three modalities, i.e., , , and , can also be referred as the channel extrapolation across the time, space, and frequency domains. The three-modality based networks in Fig. 4 jointly exploit the complementarity of the time-space-frequency information to improve the performance of the channel extrapolation.

Iv-C2 Feedback link is available

In this scenario, we need to investigate which modality would be more efficient to be fed back to BS under given feedback overhead. When the length of the vector to be fed back, denoted by , is greater than the number of BS antennas , it is obvious that we should directly feed back the downlink channel rather than the received signal . When is smaller than , we respectively try various fusion networks for and and then present the networks with the best performance in the following.

Fig. 5: The network structures of (a), (b), (c), and (d).

We first consider the case that is fed back to BS. Obviously, the length of the vector is . As shown in Fig. 5 (a), we propose to predict unknown based on the known , i.e., to learn the mapping . The network structure of is the same with except that the input and the label of are and , respectively. As shown in Fig. 5 (b), we propose to fuse the network output of and the hidden layer output of . As presented in Fig. 5 (c), we propose to fuse the network output of and the hidden layer output of . As illustrated in Fig. 5 (d), we propose to fuse the network outputs of both and .

Then, we consider the case that is fed back to BS. Since the length of the feedback vector is smaller than , i.e., , the LS estimator is not applicable due to the rank deficiency. However, it is feasible for DNNs to learn the mapping from to , and thus we propose to predict base on . As shown in Fig. 6 (a), is the input data of the first dense layer while and are concatenated at a new axis as the input data of the first convolution layer. Each convolution layer is followed by the LeakyReLU and the average pooling functions. The average pooling functions are added to down-sample the data stream and avoid overfitting [41]. After reshaping the output of the last convolution layer, we use to fuse the two data streams from the modalities and . The label of is . Moreover, we propose to fuse the network output of with the hidden layer outputs of both and , as depicted in Fig. 6 (b). We also propose to fuse the hidden layer output of with the network outputs of both and , as shown in Fig. 6 (c). As illustrated in Fig. 6 (d), we propose to fuse the network outputs of both and .

Fig. 6: The network structures of (a), (b), (c), (d), (e), (f), and (g).

Remark 2: All the networks proposed in Section IV-C can be easily extended to other problems like beam prediction and antenna selection. Specifically, by replacing the labels of all these networks with the optimal beam vectors, the proposed architectures can handle the beam prediction at the BS side. Besides, the proposed architectures can deal with antenna selection by replacing the labels of all these networks with the optimal selection vectors. It is worth mentioning that the variant architectures for antenna selection do not require prefect downlink channels, which can significantly reduce the cost resulted from downlink channel prediction.

Iv-D Architecture Designs at the User Side

We consider three different scenarios for downlink channel prediction at the user side, i.e., pilots being unavailable, insufficient or sufficient.

Iv-D1 Pilots are unavailable

In this scenario, available modalities are the previous downlink channels and the user location . As described in Section IV-C1, we can use , , and to predict the downlink channels.

Iv-D2 Pilots are insufficient

In this scenario, available modalities are , , , and . As described in Section IV-C2, we can use , and to predict the downlink channels.

Iv-D3 Pilots are sufficient

When pilots are sufficient, the LS estimator can be used to estimate the downlink channel. Inspired by [53] and [15], we propose , consisting of several dense layers and LeakyReLU functions, to predict based on the LS estimate of downlink channel , as illustrated in Fig. 6 (e). The input and label of are and , respectively. It should be emphasized that even when LS estimates have been obtained, the available modalities, i.e., and , could also be adopted to enhance the accuracy. Moreover, we propose the , as shown in Fig. 6 (f), where the network input of and the network output of are fused by . We also propose the to fuse the network output of and the hidden layer output of , as displayed in Fig. 6 (g).

Remark 3: The networks , , and can be easily extended to data detection. One simple way inspired by [55] is to first divide the transmitted signals into pilots and data signals. Then, the pilots are fed to the network as depicted in Fig. 6 while the data signals are adopted as the training labels. In this way, we do not need to collect the downlink channels as training labels, which significantly reduce the cost for label collection.

Iv-E Training Steps

The detailed training steps of all the proposed fusion networks are given as follows:

  1. Train the elementary networks, e.g., , , , , 777 is trained in the end-to-end manner., and , independently to minimize the loss between its output and the label until their loss functions converge, and then fix these network parameters;

  2. Train to minimize the loss between its output and the label until its loss function converges, and then fix its network parameters;

  3. Following step 2), train and successively until their loss function converges and then fix their network parameters successively.

Obviously, the time and computation costs increase as the number of modalities increases. The balance between the cost and the performance should be considered in practical applications.

V Simulation Results

In this section, we will first present the simulation scenario and default network parameters. Then, the performance of the proposed networks will be evaluated and analyzed.

Fig. 7: A partial view of the ray-tracing scenario. The green little box represents the BS antennas. The red little box represents the possible location of the user antenna and the distance between adjacent red little boxes is 1 m. The red line is consistent with the y-axis. This ray-tracing scenario is constructed using the Remcom Wireless InSite [37].

V-a Simulation Setup

Dataset Generation: In the simulations, we consider the outdoor massive MIMO scenario that is constructed based on the accurate 3D ray-tracing simulator Wireless InSite [37]. Unlike conventional statistical channel generation methods, the 3D ray-tracing simulator can capture the dependence of channels on the environment geometry/materials and transmitter/receiver locations, and therefore can provide more reliable datasets for training and testing. The scenario comprises one BS and massive randomly distributed user antennas and each BS is equipped with 64 antennas. The scenario covers an area of square metres. A partial view of the ray-tracing scenario is illustrated in Fig. 7. The uplink and downlink frequencies are set to be 2.50 GHz and 2.62 GHz, respectively. Based on the environment setup, the 3D ray-tracing simulator outputs the uplink channel parameters, the downlink channel parameters, and the location for each user. With these outputs, we can construct the training and testing datasets of all the modalities. Specifically, we can obtain and for each user by using Eq. (2) and the channel parameters from the 3D ray-tracing simulator. With Eq. (1), we can generate the pilots and the received signals based on . Assuming that the previous downlink channels are the channels of the user at adjacent positions, and the users move along the y-axis. Then, can be obtained by collecting channels at adjacent positions. Partial downlink channel can be obtained by selecting elements out of , and then the rest elements constitute the vector . After obtaining all sample pairs, we randomly select 9000 samples from the sample pairs as the training dataset, and select 1000 samples from the rest of sample pairs as the testing dataset888For more details about how to generate channels using Wireless InSite, please refer to the paper [1] and the codes [10].. Since the perfect channels are not available in practical situation, unless otherwise specified, all the sample pairs in the datasets are estimated by the LMMSE algorithm [9]

when the signal-to-noise ratio (SNR) is 25 dB and the pilot length is 64.

Adopted Neural Networks: Unless otherwise specified, the parameters of the proposed networks are given in Tab. II, where “LSTM: 256, 256” means that the hidden layers in consists of two LSTM layers, and each hidden layer has 256 units. The numbers of units in the input and the output layers for all the proposed networks are consistent with the lengths of the input and the output data vectors, and thus are omitted in Tab. II. We choose the output of the middle hidden layer as the hidden layer output of the networks. The batch size of all proposed networks is 128. Let and represent the estimated and the true downlink channels, respectively999The notation denotes the inverse mapping of that is given as .. Normalized mean-squared-error (NMSE) is used to measure the prediction accuracy, which is defined as .

Network Structure parameter Learning rate
, LSTM: 256,256 5e-4
Dense: 256,256,256,256,256 1e-4
Dense: 256,256,256 1e-3
Dense: 256,256,256 5e-4
Dense: 256,128,128
Filter: 16, 32, 8
Kernel: (5, 5)
Dense: 256 5e-4
Dense: 256,256 5e-4
TABLE II: Default Parameters for the proposed networks
Fig. 8: The NMSE performance of previous downlink channel related networks versus .

V-B BS Side

Fig. 8 displays the NMSE performance of the previous downlink channel related networks versus . A larger means that would learn time correlation of downlink channels from previous downlink channels in longer time periods. It can be observed that the performance of all these networks first improves and then degrades as increases, which indicates that the channels within 3 time periods have positive contributions to the downlink channel prediction while the channels beyond 3 time periods have negative impact on the downlink channel prediction. Furthermore, always performs worse than other fusion networks, and outperforms other networks regardless of the value of in the considered setup. This implies that the modalities and do provide complementary information for . We will set to be 3 for better performance in the following simulations.

Fig. 9 shows the NMSE performance of all networks that are applicable to BS without feedback link, as discussed in Section IV-C1. As shown in Fig. 9, all the two-modalities fused networks outperform the corresponding two single-modality networks, which implies that any two of the three modalities, i.e., , and , can provide complementary information, thus enhancing the prediction accuracy. In particular, although has worse performance than both and , the four two-modalities fused networks, i.e., , , , and , all have better performance than both and . Besides, we notice that has the best performance among two-modalities fused networks and outperforms other three-modalities fused networks. In fact, the structure of is inspired by that of . More specifically, since outperforms other two-modalities fused networks. we choose to preferentially fuse and at the feature level.

Fig. 9: The NMSE performance of the networks that are applicable to BS without feedback link.

Fig. 10 compares the NMSE performance of all the networks that are applicable to BS with feedback link, as discussed in Section IV-C2. As shown in Fig. 10, the performance of all the proposed networks improves when the feedback length increases. As indicated in the first enlarge picture, it would be better to feed the partial downlink channel back to BS when is greater than 48, otherwise it would be better to feed the received signal back to BS. Furthermore, it can be observed from the second enlarge picture that and consistently outperform while the gaps between the three networks all degrade as increases. This indicates that when we choose to feed partial downlink channel back, i.e., , we can adopt instead of other related fusion networks to reduce the training cost, since the gaps between them are negligible. Moreover, as shown in the third enlarge picture, consistently outperforms , , and while the gap between and becomes negligible when is larger than 36. This indicates that we can adopt for better prediction accuracy when is smaller than 36 and adopt for lower training cost when is greater than 36 and less than 48.

Fig. 10: The NMSE performance of all the networks that are applicable to BS with feedback link. The networks are trained separately for each value of .

V-C User Side

Fig. 12 displays the NMSE performance of LS, , , , and versus the pilot length , where SNR is 30 dB. As shown in Fig. 12, has worse performance than , which means cannot provide complementary information for when SNR is high and the number of pilots is sufficient, i.e., . In other words, when the number of pilots is sufficient and SNR is high, the modalities have the highest confidence level and other modalities can hardly provided complementary information to improve the performance. Moreover, the LS estimator outperforms when is greater than 80, which implies that model-based methods would gain more advantages than network-based methods as increases. Furthermore, outperforms when is smaller than 72, which means networks can learn extra features from to improve the performance when is smaller than 72, while when is greater than 72, would provide redundant information to the network and result in worse performance. To obtain better performance at the user side, we can choose when is in and choose when is greater than 72.

Fig. 11: The NMSE performance of LS, , , , and versus the pilot length . The networks are trained separately for each value of .
Fig. 12: The NMSE performance of the proposed networks versus SNR, where is 64. The networks are trained separately for each value of SNR.

V-D Impairments in Practical Applications

To collect off-line training samples, we can obtain extremely accurate channels by increasing SNR and the pilot length. However, in the on-line testing stage, low SNRs would impair the prediction accuracy of the proposed networks. Therefore, we investigate the impact of various SNRs on the performance of LS, , , , , , , and , where is 64. Fig. 12 shows the performance of these networks versus the SNR in the on-line testing stage. Notice that the performance of , , and becomes saturated when SNR is higher than 15 dB, which means that the estimation errors of the input channels would not impact the performance of the three networks when SNR is higher than 15 dB. As indicated in Fig. 12, outperforms all other networks when SNR is lower than 17 dB while outperforms when SNR is higher than 17 dB. This is because that the estimation based on pilots and received signals highly relies on SNR while the prediction based on