In recent years, we have witnessed the explosive growth in mobile data, most of which are generated by wireless devices (WDs), like smartphones and internet-of-things (IoT) sensors. The massive data are usually uploaded to cloud center for training artificial intelligent (AI) models. However, traditional cloud AI suffers from network congestion and does not support real-time applications. Mobile-edge computing (MEC), placing cloud-like functions at the network edge [like base station (BS) or access point (AP)], is an emerging technology that can overcome the shortcomings of cloud technology [7, 6, 11, 4].
Moreover, MEC further enables edge AI or edge machine learning that employs machine learning at edge servers, which integrates wireless communication and machine learning. The integration can be divided into two directions: the first is machine learning for wireless communication (ML-WC) and the second is wireless communication for machine learning (WC-ML). Most of the existing works can be categorized into ML-WC (please see survey papers in [3, 12, 8, 2, 13, 9, 14, 15, 1] and references therein), i.e., using machine learning as a tool to solve very complex problems or resolve mathematically intractable expressions pertinent to wireless communications. For WC-ML, edge servers use data transmitted from WDs to train machine learning models; in return, WDs download the trained models to quickly respond to real-time events. As the speed of data processing and computing can be rapidly increased at edge servers, wireless communication becomes a bottleneck for fast learning in WC-ML, since training data are usually large in scale but radio resources are scarce. In this case, machine learning operations and performance hinge on radio resources and channel dynamics. This calls for efficient radio resource allocation for fast learning and opens up a new research paradigm that is largely uncharted so far. Accordingly, this motivates us to consider how wireless communication can help machine learning at edge.
In conventional communication systems, the goal is for either reliable transmission or maximizing data rates, in which data bits are equally important. This can simplify the system design. However, it can not explore the feature of machine learning because some data bits are more important than others in machine learning, e.g., the data near the decision boundary of a classifier are more important than those far away. Motivated by this, in 
In this article, we consider an edge machine learning system as shown in Fig. 1, consisting one access point (AP) and multiple WDs. A certain machine learning model is trained at the AP by using the data transmitted from the WDs. We propose two simple and easily implemented importance criteria to differentiate the WDs’ data-importance, with one for centralized edge machine learning and the other for distributed edge machine learning. As a result, more radio resources are allocated to more important data to combat noise, targeting accelerating convergence and improving accuracy of machine learning models. We evaluate the proposed schemes using real datasets and show that performance gain can be achieved compared with the traditional schemes. Note that our goal is to propose simple and easily implementable importance criteria in which we show that even simple communication design (retransmission used in this article for example) can improve the performance of machine learning. Definitely other existing sophisticated communication designs under the proposed importance criteria can further improve learning performance but this is beyond the scope of this article.
The remainder of this article is organized as follows. Section II and Section III detail the proposed importance criteria for the centralized and distributed edge machine learning systems, respectively. Section IV presents experimental results for the proposed schemes. Section V finally concludes this article.
Ii Centralized Edge Machine Learning
In this section, we consider a centralized edge machine learning system as shown in Fig. 1(a), where the machine learning model is trained at the AP by using samples received from
WDs. Consider supervised learning in this paper,is denoted as the sample (from any WD ), where is the data and is the corresponding label. Note that a data usually has much larger size (with million coefficients for example) than a label (with a integer for example). Thus it is assumed that the WDs’ labels can be correctly received at the AP via a noiseless label channel,111“Noiseless” means that the channel is without noise or the noise can be neglected. while the WDs’ data are transmitted over the noisy and fading data channel. The data channel is assumed to be block fading so that the channel gains remain unchanged during each resource block but vary from one resource block to another. Let denote the maximum number of resource blocks, each of which is used for transmitting one training sample. The additive white Gaussian noise (AWGN) at the data channel is assumed to be independent circular symmetric complex Gaussian random process. Moreover, we assume that global channel state information (CSI) are available at the AP.
In machine learning, the training data are required to be correct as wrong data lead to incorrect learning. Thus when receiving a sample , the AP needs to check whether the data matches its corresponding label or not under the current model denoted . We say that the received data is correctly classified if matches , denoted as , and the received data is wrongly classified otherwise, denoted as . For each wrongly classified data, the reasons are two-folds: one is that the data channel is too noisy to receive the data, and the other is that the current model itself is wrong. Due to these reasons, we define the importance criterion of centralized edge machine learning as follows:
The intuition behind the aforementioned criterion is that, for a given training data, if the current model’s judgement on the data does not match its label, it means that the current model needs to be adjusted and the data is important to such adjustment. On the contrary, if the current model’s judgement on the data matches its label, it means that the current model does not need to be adjusted to some extent and thereby the data is less important in this case.
To this end, we set higher received SNR threshold for the more important data and lower received SNR threshold for the less important data, i.e.,
where and they are system parameters and constants. It is noticed that in conventional communication systems where data bits are equally important.
According to the importance criterion defined above, we adopt the classic automatic repeat-request (ARQ) policy to allocate resource blocks. That is, after determining the data importance, each data is allocated a new resource block for retransmission. By maximal ratio combining (MRC) at the AP, the retransmission is repeated until the received SNR exceeds the threshold defined in (1). Here the goal of retransmission is to suppress fading and noise so as to increase data reliability and thus learning performance. As is set in (1), the more important data are allocated more resource blocks than the less important data to improve learning performance. After retransmission, each data together with the previous received data, i.e., , are used to update the current model into a new model .
Finally, we depict the training procedure for the centralized edge machine learning in Algorithm 1, which consists of three steps: data judgement, data retransmission, and model update. The training procedure ends if the model converges or the totally resource blocks are exhausted in the retransmission step.
Iii Distributed Edge Machine Learning
We now study the distributed edge machine learning case as shown in 1(b). Different from the centralized edge machine learning where the learning task is embedded at the AP while the WDs transmit raw data, the distributed edge machine learning distributes the learning process over the WDs, and the WDs transmit their individual local models trained by their local datasets, then the AP aggregates the local models for the global model.
Denote is the size of dataset of WD . The distributed edge machine learning framework is that, each WD transmits the trained model to the AP instead of raw data and, at the AP side, the global model is obtained by aggregation or averaging:
For above traditional distributed learning framework expressed via (2), the local models have equal importance.
Note that the local models also experience independent fading and noise when they are transmitted to the AP. On the other hand, the local model trained by using larger size of dataset in general has higher accuracy in machine learning. Based on this, we distinguish the data importance of the local models and define the importance criterion of distributed edge machine learning as follows:
Therefore, following by the idea of more resource blocks being allocated to more important data, we adopt the widely used proportional allocation of resource blocks to the WDs. Specifically, each WD is allocated resource blocks for retransmission of model data , proportionally to its dataset size , i.e.,
where is the total number of resource blocks and is the floor operator. Denote as the -th retransmission of local model , the data-importance aware aggregation is
In another word, each local model is retransmitted times. After retransmission of each local model, all the data copies of are aggregated as in (4) for the global model. Compared with the traditional aggregation with equal importance in (2), the effects of retransmission here are two-fold: one is to using more resources to suppress fading and noise for more importance local models for increasing data reliability like the centralized case, and the other is to increase the proportion of the more important local models in aggregation. Both the effects finally improve performance of machine learning algorithms.
Iv Experimental Results
In this section, we evaluate the proposed data-importance criteria via simulations. The wireless fading are assumed to be independent and identically distributed (i.i.d.) Rayleigh fading. A resource block corresponds to a time-slot. For a fair comparison with , the AP adopts SVM as the machine learning algorithm for the centralized edge machine learning, while for the distributed edge machine learning, all the WDs also use the SVM model. And the SVM model uses linear kernel with the default parameters. We use the well-known mixed national institute of standards and technology (MNIST) dataset of handwritten digits to train the SVM classifier, which consists of two parts: a training set containing 60,000 samples and a test set containing 10,000 samples, and each set comprises data and labels. Each data in the MNIST data set is a gray image of 2828 pixels, which means the dimension of a data is 784, corresponding to 784 columns in the data matrix, and each row is a gray image. The content of these data is handwritten numbers 09, and these ten categories corresponds to the 10 columns of the label matrix, while each row represents the corresponding image located in the same row of the data matrix. In every row, only one column that the category belongs to is marked as 1, and the others are marked as 0. After training the classifier by using the training set, the test set is used to evaluate the classifier’s accuracy, which is similar to data judgement step in Algorithm 1. Specifically, the trained classifier judges each data in test set, if the result matches its label, the judgement is correct and wrong otherwise. The final classifier’s accuracy is averaged over all data in test set.
Iv-a Centralized Edge Machine Learning
We consider two benchmarks in centralized edge machine learning case, i.e., in this subsection Benchmark 1 denotes the scheme of  and Benchmark 2 denotes the traditional scheme with equal importance. To see the best performance of the three schemes, we exhaustively search the parameters, i.e., the values of received SNR thresholds ( for traditional scheme, and
for the proposed scheme), and the value of data-alignment probability in, then we choose the parameters achieving the best performance for all schemes for fair comparison. It is worth noting that in practical communication systems, the SNR thresholds are preset values depending on specific applications/scenarios and do not need to search when run the algorithms. In another word, the SNR thresholds are preset system parameters rather than optimization/searching variables.
In Fig. 2, we study the accuracy of the trained model versus the number of resource blocks , where the transmitted SNR is fixed as 4 dB for all schemes. We can observe that the performance of all schemes improve when the number of resource blocks increases. This is expected since more available resource blocks are more beneficial to suppress noise and thus improve quality of the received training data. It is also observed that the proposed scheme is superior to the two benchmark schemes, and Benchmark 1 is better than Benchmark 2 at small and then worse at large .
The accuracy of the trained model versus the transmitted SNR is investigated in Fig. 3, where the number of resource blocks is fixed as . We observe that increasing transmitted SNR of WDs can also improve quality of training data and thus the learning performance. It also demonstrates that the proposed scheme have better learning performance than the two benchmark schemes. Moreover, the proposed scheme and Benchmark 1 achieve about the same performance at high SNR regime, e.g., greater than about 7 dB in our simulation.
We study the accuracy of the trained model versus the SNR thresholds in Fig. 4, where we set the number of resource blocks as . For the proposed scheme, we fix one of and and vary the other one as of the traditional scheme with equal importance (i.e., Benchmark 2). We can observe that, as increases, all three curves first increase and then decrease. This means that there always exists one optimal SNR threshold achieving best performance for each scheme. Moreover, it shows that the proposed scheme can be always better than the traditional equal-importance scheme by selecting proper SNR thresholds.
Iv-B Distributed Edge Machine Learning
We also consider two benchmark schemes for distributed edge machine learning. The first benchmark scheme is equal-importance scheme, i.e., each WD is equally allocated resource blocks (note again that and are the numbers of resource blocks and WDs, respectively). The second one is that only the WD with the largest dataset is chosen to upload its local model. Here total resource blocks are considered and the SNR is fixed as 20 dB.
We first consider two WDs with different dataset sizes (i.e., and ) impacting on the accuracy of the trained model, as shown in Fig. 5, where the whole dataset is split into two disjoint sub-datasets with each for one WD. Firstly we can observe that the proposed scheme is superior to the two benchmark schemes over all ratios of . In addition, we observe that the accuracy of the proposed scheme is slightly decreasing when the dataset sizes becomes imbalanced. This indicates that the proposed scheme has best performance of resistance against the imbalance of dataset sizes. The equal-importance scheme becomes worse when the dataset sizes are more imbalanced. This interestingly reveals that, in traditional scheme, equal treatment of all local models will lower accuracy of the global model. On the contrary, the accuracy of the scheme selecting the largest dataset is increasing with data imbalance, which achieves about the same performance with the proposed scheme when one WD has all data.
Then we study the impact of number of WDs on accuracy of the global model in Fig. 6, where the whole dataset is randomly split into disjoint sub-datasets, with each sub-dataset for one WD. The superiority of the proposed scheme is again validated over the two benchmark schemes. For the proposed scheme, the accuracy of the global model is improved when the participated WDs increase, even though the total used data are fixed as the whole dataset is given. This is because that the dataset sizes of WDs become more balanced when the number of WDs becomes large as the dataset sizes are randomly generated from the given whole dataset.
V Concluding Remarks
In this article, we proposed two new data-importance criteria for mobile data transmission in centralized and distributed edge machine learning, respectively. The idea was to differentiate mobile data‘s importance according to their effects in machine learning, and radio resources are accordingly allocated to improve data quality and thus machine learning performance. We showed that, under the proposed data-importance criteria, simple radio resource allocation schemes can efficiently improve performance of machine learning.
Artificial neural networks-based machine learning for wireless networks: a tutorial. IEEE Commun. Surveys Tuts. 21 (4), pp. 3039–3071. Cited by: §I.
-  Machine learning for wireless communications in the internet of things: a comprehensive survey. Note: ArXiv: https://arxiv.org/pdf/1901.07947.pdf Cited by: §I.
-  (2017-Apr.) Machine learning paradigms for next-generation wireless networks. IEEE Wireless Commun. 24 (2), pp. 98–105. Cited by: §I.
-  (2019-Sep.) Multiuser computation offloading and downloading for edge computing with virtualization. IEEE Trans. Wireless Commun. 18 (9), pp. 4298–4311. Cited by: §I.
-  (2019-Jul.) Wireless data acquisition for edge learning: importance-aware retransmission. In IEEE SPAWC, Cited by: §I, §IV-A, §IV.
-  (2018-Jun.) Price-based distributed offloading for mobile-edge computing with computation capacity constraints. IEEE Wireless Commun. Lett. 7 (3), pp. 420–423. External Links: Cited by: §I.
-  (2017-Fourth Quarter) A survey on mobile edge computing: the communication perspective. IEEE Commun. Surveys Tuts. 19 (4), pp. 2322–2358. Cited by: §I.
-  Wireless network intelligence at the edge. Note: ArXiv: https://arxiv.org/abs/1812.02858 Cited by: §I.
-  (2019-Apr.) Deep learning in physical layer communications. IEEE Wireless Commun. 26 (2), pp. 93–99. Cited by: §I.
-  (2012) Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6 (1), pp. 1–114. Cited by: §I.
Edge computing: vision and challenges. IEEE Internet of Things Journal 3 (5), pp. 637–646. External Links: Cited by: §I.
-  (2018-Dec.) A very brief introduction to machine learning with applications to communication systems. IEEE Trans. Cognitive Commun. Netw. 4 (4), pp. 648–664. Cited by: §I.
-  (2019-Sep.) 6G: the next frontier: from holographic messaging to artificial intelligence using subterahertz and visible light communication. IEEE Veh. Technol. Mag. 14 (3), pp. 42–50. Cited by: §I.
-  Thirty years of machine learning: the road to pareto-optimal next-generation wireless networks. Note: Available: https://arxiv.org/pdf/1902.01946.pdf Cited by: §I.
-  (2019-Mar.) Deep learning in mobile and wireless networking: a survey. IEEE Commun. Surveys Tuts. 21 (3), pp. 2224–2287. Cited by: §I.