Signal modulation is an essential process in wireless communication systems. Modulation recognition tasks are generally used for both signal detection and demodulation. The signal transmission can be smoothly processed only when the signal receiver demodulates the signal correctly. However, with the fast development of wireless communication techniques and more high-end requirements, the number of modulation methods and parameters used in wireless communication systems is increasing rapidly. The problem of how to recognize modulation methods accurately is hence becoming more challenging.
Traditional modulation recognition methods usually require prior knowledge of signal and channel parameters, which can be inaccurate under mild circumstances and need to be delivered through a separate control channel. Hence, the need for autonomous modulation recognition arises in wireless systems, where modulation schemes are expected to change frequently as the environment changes. This leads to considering new modulation recognition methods using deep neural networks.
Deep Neural Networks (DNN) have played a significant role in the research domain of video, speech and image processing in the past few years. Recently the idea of deep learning has been introduced to the area of communications by applying convolutional neural networks (CNN) to the task of radio modulation recognition .
The Convolutional Neural Network (CNN) has been recently identified as a powerful tool in image classification and voice signal processing. There have also been successful attempts to apply this method in other areas such as natural language processing and video detection. Based on its supreme performance in feature extraction, a simple architecture of CNN was introduced in for distinguishing between 10 different modulations. Simulation results show that CNN not only demonstrates better accuracy results, but also provides more flexibility compared to current day expert-based approaches .
However, CNN has been challenged with problems like vanishing or exploding gradients, and accuracy degradation after reaching a certain network depth. Attempts have been made to address the above issues. Most notably, Residual Networks (ResNet)  and Densely Connected Networks (DenseNet)  were recently introduced to strengthen feature propagation in the neural network by creating shortcut paths between different layers in the network. A building block of a residual learning network can be expressed using the equation in Figure 1, where and () are input and output of the block respectively, and is the residual mapping function to be trained. Since it may be hard to learn the mapping () = , this block learns the residual mapping () = () - , which can be easier to learn . By adding the bypass connection, an identity mapping is created, allowing the deep network to learn simple functions that would have required a shallower network to learn.
Recently, a Convolutional Long Short-term Deep Neural Network (CLDNN) has been introduced in 
, where it combines the architectures of CNN and Long Short-Term Memory (LSTM) into a deep neural network by taking advantage of the complementarity of CNNs, LSTMs, and DNNs
. The LSTM unit is a memory unit of a Recurrent Neural Network (RNN). RNNs are neural networks with memory that are suitable for learning sequence tasks such as speech recognition and handwritten recognition. LSTM optimizes the gradient vanishing problem in RNNs by using a forget gate in its memory cell, which enables the learning of long-term dependencies.
Due to the fact that traditional channel models of the wireless channel may not be accurate, in our experiments, we use the RadioML2016.10b dataset generated in  as the input dataset. The data is generated in a way that captures various channel imperfections that are present in a real system using GNU radio. In this paper, we develop architectures of ResNet, DenseNet, and CLDNN for the modulation recognition task. Using the same dataset generated in , we achieve a roughly 13.5% accuracy improvement at high SNR against the state of the art architecture presented in . The improvements of accuracy are believed to be achieved by better spatial and temporal feature extraction.
Ii Simulation Setup
. This dataset contains 10 types of modulations: eight digital and two analog modulations. These consist of BPSK, QPSK, 8PSK, QAM16, QAM64, BFSK, CPFSK, and PAM4 for digital modulations, and WB-FM, and AM-DSB for analog modulations. For digital modulations, the entire Gutenberg works of Shakespeare in ASCII is used, with whitening randomizers applied to ensure equiprobable symbols and bits. For analog modulations, a continuous voice signal is used as input data, which consists primarily of acoustic voice speech with some interludes and off times. The entire dataset is a 128-sample complex time-domain vector generated in GNU radio. 160,000 samples are segmented into training and testing datasets through 128-samples rectangular windowing processing, which is similar to the windowed continuous acoustic voice signal in voice recognition tasks. The training examples - each consisting of 128 samples - are fed into the neural network in 2*128 vectors with real and imaginary parts separated in complex time samples. The labels in input data include SNR ground truth and the modulation type. The SNR of samples is uniformly distributed from -20dB to +18dB. All training and testing are done in Keras using Nvidia M60 GPU. We use Adam
from the deep learning library as optimizer in Keras and use Theano as back end.
Ii-a Evaluation Network
We start with a convolutional neural network architecture similar to the CNN2 network from , which performs blind temporal learning using a two-convolutional-layer deep neural network, and achieve accuracy of 75% at high SNR; a better performance against current day approaches . Our training is based on several neural network architectures: Convolutional Neural Network (CNN), Densely Connected Convolutional Network (DenseNet) , Residual Network (ResNet) , and Convolutional Long Short-Term Deep Neural Network (CLDNN) . For CNN, we optimized the following hyper-parameters: learning rate, dropout rate, filter size, number of filters per layer and the network depth. We tried different combinations of convolutional layer sequences and filter numbers in each layer to get the best accuracy result. We also develop deeper networks by adding more convolutional layers on the CNN2 model and get the optimal accuracy from the architecture shown in Figure 2
, where four convolutional layers are followed by two dense layers. The first parameter below each convolutional layer represents the number of filters in that layer, while the second and third numbers show the size of each filter. For the two dense layers, there are 128 and 11 neurons, in order of their depth in the network.
Inspired by the winner architecture of ImageNet 2015, we apply the ResNet architecture and test architectures with increasing number of convolutional layers up to 8. We obtain the best classification accuracy from the four-convolutional-layer ResNet architecture shown in Figure 3. The output from the first layer is forwarded to the layer two levels deeper. This structure alleviates the gradient vanishing problem by explicitly letting each few stacked layers fit into a residual mapping . The hyper-parameters are chosen according to the basic observation that we make from simple CNN architectures that having larger filters close to the input layer followed by smaller filters close to the output layer leads to significant accuracy improvement.
DenseNet further improves the information flow between layers than ResNet does, as each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers . Our DenseNet architecture is illustrated in Figure 4, with four convolutional layers densely connected with each other and the output fed into two dense layers. We set the parameters of the convolutional layers to achieve the best accuracy.
We finally propose a CLDNN architecture that includes long short-term memory units. CLDNNs are mainly used in voice processing tasks that involve raw time-domain waveforms . It is a combination of CNNs, long short-term memory (LSTM), and deep neural networks (DNN). In our setting, we choose four convolutional layers in CNN, followed by one LSTM layer with 50 computing units and two fully connected DNN layers(see Figure 5). We tested different CLDNN architectures with different number of memory cells in the LSTM layer. Our experiments show that an LSTM layer with 50 cells gives out the best accuracy result compared to other layer settings.
Ii-B Training Complexity
The computation time using one NVIDIA M60 GPU for 96000 training examples and 64000 validation and testing examples varies signifiantly for different models. The simplest model with only two convolutional layers in CNN takes approximately 15 seconds per epoch while the CNN with four convolutional layers takes approximately 400 seconds per epoch. We note that a high dropout rate may slow down the training speed but reduces overfitting. In our setting, we set the dropout rate to 0.6, which is higher than the rate used in
, and the activation function in each hidden layer is a Rectified Linear Unit (ReLU) function. We set patience, the period during which a non-converging validation loss is tolerated, to 10 when there are three and four convolutional layers and get a total training time of around half an hour. When the network becomes deeper, it starts to take more than 10 training epochs for the validation loss to decrease, so setting patience to 20 produces smaller validation loss, which means higher accuracy. To get better results, we set patience to 20 in the remaining models. It takes approximately 1000 seconds per epoch in all three models. The total training time is approximately 70 hours for the DenseNet model, 20 hours for the ResNet model, and 50 hours for the CLDNN model.
Iii-a Convolutional Neural Network
We start with a basic two-convolutional-layer neural network, in which two convolutional layers with 256 1x3 filters and 80 2x3 filters, respectively, are followed by two dense layers. We then explore the effect of different filter settings by exchanging filter settings between the two convolutional layers. The performances of networks with different filter settings demonstrate that layer architectures with larger filters in earlier convolutional layers and smaller filters in deeper convolutional layers optimize the accuracy result at high SNR.
Next, we explore the optimal depth of CNN by increasing the number of convolutional layers from 2 to 5. We find that the best accuracy at high SNR is approximately 83.8%. The best accuracy is obtained when using the four-convolutional-layer architecture as shown in Figure 2. This is a significant improvement of 8.8% over the two-convolutional-layer model. Due to the fact that lower loss corresponds to higher accuracy, a smoothly decreasing loss indicates that the network is learning well as it does for the four-convolutional-layer model. When the neural network gets deeper, it becomes less likely for the validation loss to converge. For the five and six-convolutional-layer models, large loss vibrations appear early during training, which means that the minimum losses achieved by these neural networks are larger than that of the four-convolutional-layer model, which leads to the poor classification performance.
Iii-B Residual Network
We find that combining a residual network with the original CNN architecture demonstrates similar performance as the pure CNN architecture. Similar to the result of CNN, the best performance of 83.5% is achieved when we combine ResNet with a four convolutional layer neural network as shown in Figure 3. Recognition accuracy also starts to decrease when we combine ResNet with a network architecture that has more than four convolutional layers.
Iii-C Densely Connected Network
Because more densely connected blocks require a deeper neural network, which in our experiments did result in accuracy degradation, we implement DenseNet on CNN architectures with only one densely connected block. We start with a three convolutional layer DenseNet and keep adding convolutional layers into the network until the accuracy result starts to descend. We achieve a best accuracy of 86.6% (see Figure 7) at high SNR using the four convolutional layer architecture shown in Figure 4.
CLDNN has been widely used in recognition tasks that involve time domain signals like videos, speech, and images, as the inherent memory property leads to recognizing temporal correlations in the input signal. Recent work has also suggested the use of CLDNN for modulation recognition tasks . However, neither the network architecture nor the obtained accuracy results were clearly specified in , and hence it was not feasible to reproduce these results and compare ours with. We applied the CLDNN architecture and compared the performance of CLDNN with results demonstrated by ResNet and DenseNet. We added an LSTM unit into the network after the convolutional part. We believe that the cyclic connections extract more relevant temporal features in the signal. The results of CLDNN - shown in Figure 8 - do outperform other models. The accuracy at high SNR reaches 88.5% and it is the highest among all tested neural network architectures.
In Figure 9, we show the classification results of the highest SNR case in a confusion matrix. There are two main discrepancies besides the clean diagonal in the matrix, which are WBFM being misclassified as AM-DSB and QAM16 being misclassified as QAM64. Details of the misclassification effects on accuracies are listed in Table I
, where the number in the percentage column represents the percentage of the left hand side modulation type that is misclassified as the modulation type on the right hand side. A small portion of 8PSK samples are misclassified as QPSK and a small portion of WBFM samples are misclassified as GFSK; we expect that further optimizing the neural network architecture and possibly increasing the depth would lead to capturing these subtle feature differences. We further notice that QAM16 and QAM64 are likely to be misclassified as each other, since their similarities in the constellation diagram make the differentiation vulnerable to small noise in the signal. We expect that appropriate pre-processing of the input signal can help alleviate these large misclassification percentages. Large discrepancy also exists in WBFM classification which is likely to be recognized as AM-DSB. We believe that this discrepancy is probably due to the silence period where only carrier tone exists in the analog voice signal.
By creating shortcuts between different layers, the ResNet and DenseNet architectures alleviate the vanishing gradient problem and promote feature reuse. By comparing the performances of ResNet and DenseNet in Figure8, we notice that DenseNet demonstrates significantly better performance than ResNet by including more shortcut connections in the network, and therefore further strengthens feature propagation throughout the network.
Although the ResNet and DenseNet architectures also suffer from accuracy degradation when the network grows deeper than the optimal depth, our experiments still show that when using the same network depth, DenseNet and ResNet have much lower convergence rates than plain CNN architectures. Figure 10 shows validation errors of ResNet, DenseNet, and CNN of the same network depth with respect to the number of training epochs used. We can see that the ResNet and the DenseNet start at significantly lower validation errors and remain having a lower validation error throughout the whole training process, meaning that combining ResNet and DenseNet into a plain CNN architecture does make neural networks more efficient to train for the considered modulation classifcation task.
We finally applied the CLDNN architecture and obtained through it the best performance among all tested network architectures. We believe that the good performance of CLDNN is due to its long-term memory ability, which is suitable for the causality characteristic of time domain radio signals.
Multiple state of the art deep neural networks were applied for the radio modulation recognition task. We explored signal feature extraction by adding convolutional layers, various kinds of residual layers and recurrent layers to a deep neural network architecture. A Convolutional Long Short-term Deep Neural Network (CLDNN) was found to deliver the best classification architecture, which improves the accuracy by approximately 13.5% upon the original CNN model introduced in . We believe that the causality of radio time domain signals leads to this improvement since a recurrent network is known to perform well for continuous acoustic signal processing tasks. The residual and densely connected networks (ResNet and DenseNet) also perform well although the best accuracy is limited by the depth of network, but they suggest that changing connections between layers - and specially creating shortcuts between non-consecutive layers - may produce better classification accuracy.
-  T. J. O’Shea and J. Corgan, “Convolutional radio modulation recognition networks,” CoRR, vol. abs/1602.04105, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
-  G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” CoRR, vol. abs/1608.06993, 2016.
-  T. N. Sainath, O. Vinyals, A. W. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584, 2015.
T. O’Shea and N. West, “Radio machine learning dataset generation with gnu radio,”GNU Radio Conference, vol. 1, no. 1, 2016.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
-  N. E. West and T. J. O’Shea, “Deep architectures for modulation recognition,” CoRR, vol. abs/1703.09197, 2017.