I Introduction
While smartphones have become indispensable platforms supporting our daily lives, billions of Internetofthings (IoT) devices are expected to be deployed in near future to automate the operations of our societies. With the prevalence of such devices on the network edge, namely edge devices, people envision an incoming new world of ubiquitous computing and ambient intelligence. This vision motivates Internet companies and wireless operators to develop technologies for deploying machine learning on the (network) edge giving rise to a new platform for supporting intelligent applications, called edge learning (see e.g., [1, 2, 3]). This trend aims at leveraging enormous realtime data generated by billions of edge devices to train AI models. In return, downloading the learnt intelligence onto the devices will enable them to respond to realtime events with humanlike capabilities. As a paradigm shift in computing, the impact of edge learning is not limited to individual convenience and productivity but more importantly can lead to breakthroughs in critical areas such as healthcare and disaster avoidance. In practice, a network virtualization architecture has been standardized by 3GPP laying a platform for implementing edge computing and learning [3]. From the research perspective, edge learning crosses two disciplines, wireless communication and machine learning, and thus brings many new research challenges. Tackling such challenges by crossdisciplinary design defines the theme of this paper.
As dataprocessing speeds are increasing rapidly, wireless acquisition of highdimensional training data from many edge devices has emerged to be a bottleneck for fast edge learning (see e.g., [4]). The issue can be exacerbated in highmobility applications (e.g., drone mounted sensors). This calls for highly efficient resource allocation and multiaccess methods tailored to wireless data acquisition. Conventional techniques have been developed based on the principle of decoupled communication and computing, and thus are incapable of exploiting the features of learning. In particular, data bits (or symbols) are assumed of equal importance, which simplifies the design criterion to be rate maximization. In contrast, for learning, the importance distribution in a training dataset is nonuniform
, namely that some samples are more important than others. For instance, for classification problems, samples near decision boundaries are more critical than those far away for training a classifier
[5]. This fact motivates the new design principle of importanceaware resource allocation for efficient wireless data acquisition in edge learning.In this work, we apply this principle to redesign the classic technique of automatic repeatrequest (ARQ) to improve the efficiency of edge learning. The classic design only targets reliability, which essentially repeats the transmission of a data packet until it is reliably received. On the other hand, edge learning gives rise to two unique and new design issues. The first is the mentioned nonuniform importance distribution over data samples. The second is the qualityvsquantity tradeoff. To be specific, in each communication round, an edge server needs to make a binary decision on whether to improve the quality of a data sample by retransmission or to acquire a new sample. Though both quality and quantity of the training dataset are important for accurate learning, they need be balanced given a limited transmission budget. To optimize the tradeoff, we propose a scheme called importance ARQ that adapts the retransmission decision to both data importance and reliability.
We consider an edge learning system where a classifier is trained at the edge based on support vector machine
(SVM), with data collected from distributed edge devices. The acquisition of highdimensional data samples is bandwidthconsuming and relies on a highrate noisy data channel. On the other hand, a dedicated lowrate noiseless channel is allocated for accurate delivery of smallsized labels. The mismatch between the correctly received labels and noisy data samples at the edge may lead to an incorrectly learnt model that fails to generalize on the future data. To tackle the issue, retransmission with coherent combining is used to enhance the data quality. The retransmission decision at the edge server is controlled by the proposed importance ARQ that optimizes the said qualityvsquantity tradeoff so as to efficiently utilize the radio resource. The main contributions of this work are summarized as follows.

We identify an important design problem in wireless data acquisition for edge learning. The problem is how to intelligently control retransmission of data samples by exploiting the differences in their importance levels for learning. This problem reveals a unique qualityvsquantity tradeoff in wireless data acquisition.

To solve the above problem, we propose a solution called importance ARQ. The scheme selectively retransmits a data sample based on its underlying importance for training a classifier model which is estimated using the realtime model under training. The key technical contribution in the design is
an elegant communicationlearning relation between two corresponding metrics, i.e., signaltonoise ratio (SNR) and data importance, given a desired learning accuracy. This new measure facilitates the design of a simple threshold based policy for making retransmission decisions. 
We evaluate the performance of the proposed importance ARQ via extensive experiments using a real dataset. The results demonstrate that the proposed method avoids learning performance degradation caused by channel noise while achieving faster convergence than conventional SNRbased ARQ.
Ii Related Work
Ii1 Data Acquisition in Machine Learning
Data acquisition is an important area in machine learning. Two topics relevant to our study are active learning [5] and noisy label learning [6] which are briefly introduced as follows.
Targeting the scenario that unlabeled data is abundant but manually labeling is expensive, active learning aims to selectively label the informative data (by querying an oracle), such that a model can be quickly trained using only a few carefully selected data samples, thus reducing the cost of manual labelling. Roughly speaking, the informativeness of a sample is related to how uncertain the prediction of this sample is under the current model, i.e., the more uncertain on the prediction the more informative the sample to the model learning. One of the commonly used uncertainty measure is the entropy, an information theoretic concept [7]. A classic data selection scheme based on the entropy measure is called uncertainty sampling which was first proposed in [8]. Subsequent work [9]
found that purely sampling the data with the highest uncertainty may lead to undesired query to the outlier that fails to represent the data space. To cope with the issue, the authors proposed a socalled information density sampling scheme that accounts for both the uncertainty and representativeness of the data. Other popular data selection criteria, such as expected model change
[10] and expected error reduction [11], all take into account the entire data space and thus can naturally avoid the outliner sampling issue.In contrast to active learning, noisy label learning considers a completely different scenario where the acquisition of a data sample is expensive while attaining a label for the sample can be cheap by interaction with nonexpert labellers. As the labellers are not professional, the provided labels can be misleading. To avoid the performance degradation incurred by the noisy labels, extensive research efforts have been dedicated to developing robust learning mechanisms. The existing solutions can be grouped into the following three directions: 1) request multiple labels from different labellers and use majority voting to control the label quality [12]; 2) reweigh the importance of noisy labeled data in the learning algorithm [13]
; and 3) adopt a more robust loss function for model training by exploiting statistical learning theory
[14]. Note that the conventional data acquisition problems focused only on the machine learning perspective and failed to consider the potential noise corruption on the acquired data, thus are fundamentally distinct from the wireless data acquisition problem in edge learning. Nevertheless, the spirit of accounting for data importance in the acquisition process inspires us to develop an importanceaware mechanism to cope with the noisy data issue as elaborated in the sequel.Ii2 ARQ in Wireless Communication
ARQ is a classic topic in wireless communication, which has been extensively studied [15, 16, 17, 18, 19, 20]. In terms of ARQ protocols, there are three basic schemes including stop and wait ARQ [15], GoBackN ARQ [16] and selective repeat ARQ [17]. The three schemes differ in how to proceed the transmission before and after receiving an acknowledgement (ACK), and thus achieve different reliabilitylatency tradeoffs. To avoid excessive network delay caused by frequent retransmission, advanced ARQ protocols often incorporate the forward error coding design such as block codes [18] and fountain codes [20], leading to an active research area called hybrid ARQ. Besides, ARQ can be further strengthened by applying diversity combining techniques such as maximumratiocombing (MRC) [21] or selective combining [22]. This allows early termination once the decoding is made possible by coherently combining several noisy observations. Another vein of research focuses on redesigning ARQ protocols for more sophisticated communication systems such as multipleinput multipleoutput (MIMO) [23] and energy harvesting (EH) systems [24]. While existing designs of ARQ purely target reliable communication, the retransmissionbased wireless data acquisition in edge learning is to optimize the learning performance. This renders existing ARQ design inapplicable and introduces new challenges of crossdiscipline design to be addressed in the current work.
Iii Communication and Learning Models
Iiia Communication System Model
We consider an edge learning system as shown in Fig. 1 comprising a single edge server and multiple edge devices, each equipped with a single antenna. A classifier model at the server is trained using a labelled dataset distributed over devices, denoted as with representing the th data sample, its dimensions, and its label. To this end, edge devices share the wireless channel in a time division manner and take turn to transmit local data to the server. Note that a label has a much smaller size than a data sample (e.g., a integer versus a vector of a million real coefficients). Thus two separate channels are planned: a lowrate label channel and a highrate data channel. The former is assumed noiseless for simplicity. Reliable uploading of data samples over the noisy and fading channel is the bottleneck of wireless data acquisition and the focus of this work. Time is slotted into symbol durations, called slots. Transmission of a data sample requires slots, called a symbol block.
At the beginning of each symbol block, the edge server makes a binary decision on either randomly selecting a device for acquiring a new sample or requesting the previously scheduled device for retransmission to improve sample quality. Retransmission is controlled by stopandwait ARQ. The positive ACK or negative ACK is sent to the target device based on whether the currently received sample at the server satisfies a predefined quality requirement as elaborated in the sequel. Each edge device is assumed to have backlogged data. Upon receiving a request from the server, a device either transmits a randomly picked new sample from its buffer or retransmits the previous sample.
The noisy data channel is assumed to follow blockfading where the channel coefficient remains static within a symbol block and is independent and identically distributed (i.i.d.) over different blocks. During the th symbol block, the device sends the data using linear analog modulation, yielding the received signal given by
(1) 
where is the transmit power constraint for a symbol block, the fading coefficient
is a complex random variable (r.v.) assumed to have a unit variance, i.e.,
, without loss of generality, and is the additive white Gaussian noise (AWGN) vector with the entries following i.i.d. distributions. Analog uncoded transmission is assumed here to allow fast data transmission [25] and for a higher energy efficiency (compared with the digital counterpart) as pointed out by [26]. We assume that perfect channel state information (CSI) on is available at the server. This allows the server to compute the instantaneous SNR of the received data and make the retransmission decision.Retransmission Combining: To exploit the time diversity gain provided by multiple independent noisy observations of the same data sample from retransmissions, ARQ together with MRC combining technique are applied to coherently combine all the observations for maximizing the receive SNR. To be specific, consider a data sample retransmitted times. All received copies, say from symbol block to , can be combined by MRC to acquire the received sample, denoted as , as follows:
(2) 
where is given in (1). In (2), we extract the real part of the combined signal for further processing since the data for machine learning are realvalued in general (e.g., photos, voice clips or video clips). As a result, the receive SNR for sample is given as
(3) 
where the coefficient at the right hand side arises from the fact that only the noise in the real dimension with variance affects the received data. The summation in (2) represents MRC and its value grows as the number of retransmissions, , increases. The SNR expression in (3) measures the reliability of a received data sample and serves as a criterion for making the retransmission decision as discussed in Section V.
Transmission Budget Constraint: Due to limited radio resource or a latency requirement for data acquisition, the transmission budget for a specific learning task is restricted to be symbol blocks. Therefore, the total dataacquisition duration (in symbol block) is constrained by
(4) 
where denotes the number of data samples and the number of retransmissions for acquiring the th data sample.
IiiB Learning Model
For the task of edge learning, we target supervised training of a classifier model by implementing linear support vector machine (SVM). Prior to training, we assume that the edge server has a small set of clean observed samples, denoted as . This allows the construction of a coarse initial classifier, which is used for making retransmission decision at the beginning stage. The classifier is refined progressively using newly received data samples. As shown in Fig. 1
, SVM is to seek an optimal hyperplane
as a decision boundary by maximizing its margin to data points, i.e., the minimum distance between the hyperplane to any data sample. The points lie on the margin are referred to as support vectors which directly determine the decision boundary. Let denote the th datalabel pair in the training set. A convex formulation for the SVM problem is given by(5)  
(6) 
The original SVM works only for linearly separable datasets, which is hardly the case when the dataset is corrupted by the channel noise as in the current scenario. To make the algorithm robust and be able to cope with the potential outlier caused by noise, a variant of SVM called soft margin SVM is adopted. The technique is widely used in practice to classify a noisy dataset that is not linearly separable by allowing misclassification but with an additional penalty in the objective in (5). The implementation details are omitted here for brevity. Interested readers are referred to the classic literature [27].
Upon the training is completed, the learnt SVM model can then be used for predicting the label of the new sample, denoted by , by computing its output score as follows:
(7) 
Then the sign of the output score yields the prediction of the binary label.
Iv Wireless Data Acquisition by Retransmission
Iva Why Retransmission is Needed?
Given a noisy data channel and a reliable label channel, the classifier model at the edge server is trained using noisy data samples with correct labels. The channel noise can cause a data sample to cross the groundtruth decision boundary, thereby resulting a mismatch between the sample and its label, referred to as a datalabel mismatch in the rest of paper. The issue can cause incorrect learning as illustrated in Fig. 2. Specifically, for the noiseless transmission case in Fig. 2(a), the new data sample helps refine the current decision boundary to approach the groundtruth one. However, for the case of noisy transmission in Fig. 2(b), noise corrupts the new sample and causes it to be an outlier falling into the opposite (wrong) side of the decision boundary. The situation will be exacerbated when the SVM classifier is used since the outliner may be selected as the supporter vector (or indirectly affect the decision boundary by increasing the penalty in soft margin SVM).
Retransmission can provide diversity gain to suppress channel noise so as to improve data reliability and hence the learning performance. To visualize the benefit of retransmission, we demonstrate in Fig. 3 the performance of the edge learnt classifiers which are trained using the noise corrupted dataset based on the channel model in (1) under a varying number of retransmissions. Specifically, we consider the learning task of handwritten digit recognition using the wellknown MNIST dataset that consists of 10 categories ranging from digit “0” to “9”. The noisy level of the received data is reflected by their average receive SNR which is set to be . We train three classifiers with different fixed numbers of retransmissions: . The curves of their test accuracy are shown in Fig. 3(a), with the corresponding received dataset distribution visualized in Fig. 3(b) using the classic distributed stochastic embeding (SNE) algorithm for projecting the images onto the paper (). It is observed from the case without retransmission (), after receiving a certain number (i.e., 8000) of noisy data samples, the training of the classifier fails as reflected in the dramatic drop in test accuracy. The reason is that the strong noise effect (see Fig. 3(a)) accumulates to cause the divergence of the model (see Fig. 3(b)). As the number of retransmission increases, the noise effect is subdued to a sufficiently low level ensuring that the class structure of the original dataset can be resolved, leading to a converged model and a high test accuracy. The experiment demonstrates the effectiveness of retransmission in edge learning. Nevertheless, given the transmission budget constraint in (4), the retransmission scheme should be carefully designed to improve the training performance, which is the main theme of this paper.
IvB Two Main Design Issues
As discussed in Section IVA, retransmission is able to enhance the quality of the received data. This comes, however, at a cost of reducing the data acquisition efficiency due to the redundant transmission. The objective of our design is to adapt retransmission to data importance besides the channel so as to efficiently utilize the limited radio resource for improving the learning accuracy. To be specific, for each received data sample (new or retransmitted), the server needs to make a binary decision on whether retransmission is needed in the next symbol block.
The decision making is challenging as it should address two issues mentioned in Introduction and elaborated as follows.

QualityvsQuantity Tradeoff: The learning performance can be improved by either increasing the reliability (quality) of the wirelessly transmitted training dataset by more retransmissions, or its size (quantity) by acquiring more data samples. The budget constraint in (4) introduces a tradeoff between the two aspects, called the qualityvsquantity tradeoff.

Nonuniform Data Importance: In conventional data communication, bits are implicitly assumed to have equal importance. This is not the case for training a classifier where data samples with higher uncertainty (i.e., near to decision boundary) are more critical than those with lower (i.e., far from decision boundary). Considering the nonuniform importance in data shifts the paradigm of radio resource allocation in conventional ratedriven wireless transmission.
These issues are addressed via the proposed importance ARQ presented in the following sections.
V Principle of ImportanceAware Retransmission
In this section, an importanceaware retransmission scheme is proposed to regulate the datasample reliability according to its importance. We first recall the traditional channelaware retransmission design from the pure communications perspective, where the retransmission decision is only determined by an SNR threshold without considering data importance. Then, we introduce the concept of uncertainty to quantify the importance of a data sample for learning. This then motivates us to propose an importance ARQ scheme, which accounts for both the SNR as well as data uncertainty.
Va Traditional ChannelAware Retransmission
In wireless communications, SNR is commonly used to measure the received data reliability. Thus, it is natural to use SNR as the criterion in retransmission policy design. The traditional channelaware retransmission scheme is as follows.
Scheme 1 (ChannelAware ARQ).
The edge server repeatedly requests the scheduled edge device to retransmit the same data sample until a required SNR is met: , where is defined in (3) and is a prespecified SNR threshold.
This scheme achieves equally high reliability for all data samples. Under a budget constraint, however, this can lead to inefficient utilization of the radio resource for data acquisition, as the obtained dataset may not contain sufficient samples for accurate learning. The efficiency of the channelaware retransmission can be improved by considering nonuniform data importance as measured by data uncertainty.
VB Data Uncertainty
In the active learning literature, the importance of a data sample is usually measured by its uncertainty, which characterizes how significant it can contribute to model learning [5]. A typical measure is entropy [7], an information theoretic notion, defined as follows:
(8) 
where denotes the set of model parameters to be learnt. For the considered SVM classifier, the model parameters are (see (5)). In active learning, to train a model using a large unlabelled dataset under a limited labelling budget, an active learner selects a subset of “important” data samples (with high uncertainty) to label by queries and then update the model. In the same spirit, in an edge learning system, the server prefers data samples with high uncertainty for retransmission as they are more “important”. Besides entropy, there exist several alternative measures that are easier to compute but specific to the learning model adopted. One of them related to the current work is introduced as follows.
Uncertainty Measure for SVM: In general, the computation of the entropy in (8) is complex involving computing a posterior distribution. For this reason, targeting the SVM classification, we develop a simpler uncertainty measure by exploiting the dataset clustering structure. The definition is motivated by the geometric intuition that a classifier makes less confident inference on a data sample which is located near the decision boundary. Based on this intuition, we propose measuring the uncertainty of a data sample by the inverse of its distance to the boundary. Given a data sample , the said distance can be readily computed by the absolute value of the output score (defined in (7)) as follows
(9) 
Then the distance based uncertainty measure is given by
(10) 
One can observe that the measure grows towards infinity as a data sample approaches the decision boundary but reduces as the sample moves away from the boundary.
With the concept of uncertainty in mind, one can see that data samples with higher uncertainty are more prone to noise corruption. To be specific, a small perturbation caused by channel noise can push a highly uncertain data sample across the decision boundary, thus inducing the aforementioned datalabel mismatch illustrated in Fig. 2. To cope with the issue, highly uncertain data samples should be protected against noise by retransmission such that the noiseinduced perturbation on the received data can be suppressed. To implement the intuitive idea, native schemes may not work or at least intractable. For instance, consider the following retransmission scheme: the edge server requests for retransmission until the data uncertainty is smaller than a threshold. Unfortunately, this scheme may incur endless retransmissions if the threshold is higher than the uncertainty of the groundtruth data or its observed uncertainty grows over retransmissions. This calls for more intelligent use of data uncertainty in the retransmission policy design.
VC Importance ARQ
Motivated by the issues discussed in the preceding subsections, we propose a novel retransmission scheme, called importance ARQ, to integrate the two key metrics in wireless data acquisition, namely the SNR from wireless communication and data uncertainty from machine learning. Essentially, the design is inspired by the following observation. No error would be incurred in learning if the transmitted and received data samples lie at the same side of the (groundtruth) decision hyperplane so that they have the same predicted labels. This suggests that the key to tackle the datalabel mismatch issue is to align the received and transmitted data samples to be in the same halfspace divided by the decision hyperplane, which is referred to as noisy data alignment.
The finding motivates us to apply the probability of such alignment as the criterion in the retransmission policy design. However, the groundtruth decision hyperplane cannot be known. One practical solution is to utilize the current model under training as a
surrogate to check the consistency between the transmitted and received versions of a data sample. The consistency (alignment) can be translated to an event that the transmitted and received data sample have matching labels. Equivalently, if they are identical, their output scores defined in (7) must have the same signs. Consider an arbitrary transmitted data sample and its received version after retransmissions as defined in (2). The event is(11) 
Then the probability of noisy data alignment, called the data alignment probability, is formally defined as follows.
Definition 1 (DataAlignment Probability).
Conditioned on the received data sample, the dataalignment probability is:
(12) 
To calculate the probability defined above, the distribution of the transmitted sample score conditioned on the received data sample is needed. This knowledge can be inferred from the conditional distribution of transmitted sample, i.e., as derived in the following lemma.
Lemma 1 (Conditional Distribution of Transmitted Sample).
Conditioned on the received sample , the distribution of the transmitted sample
follows a Gaussian distribution:
(13) 
Proof: See Appendix A.
With the conditional distribution in (13), the required distribution can be readily derived using the linear relationship in (7). The derivation simply involves projecting the highdimensional Gaussian distribution onto a particular direction specified by , which yields a unitvariate Gaussian distribution of dimension one as elaborated below.
Lemma 2 (Conditional Distribution of Transmitted Sample Score).
Conditioned on the estimated sample , the distribution of the transmitted sample score follows a unitvariate Gaussian distribution, given by
(14) 
where .
Based on Lemma 2, we are ready to derive the dataalignment probability. The closedform expression is presented in the following proposition.
Proposition 1 (DataAlignment Probability).
Conditioned on the received sample , the probability of noisy data alignment is given as
(15) 
where follows from Lemma 2 and is the well known error function defined as .
Proof: As shown in Fig. 4, the conditional distribution for the transmitted data score is a Gaussian and the probability of data alignment is equal to the area shaded in grey. Mathematically, the probability can be derived using Lemma 2 as follows:
(16) 
The integral therein can be further expressed using the error function .
Remark 1.
(How Does Retransmission Affects Noisy Data Alignment?) Retransmission contributes to increasing the probability of noisy data alignment. Specifically, retransmission affects both the mean and variance of the conditional distribution in (14). From the mean perspective, retransmission helps align the average of retransmitted samples with its ground truth. To be specific, the received estimate approaches the groundtruth value as the number of retransmissions grows:
(17) 
From the variance perspective, retransmission continuously reduces the variance by increasing the receive SNR or equivalently the number of retransmissions . Particularly, it follows from that
(18) 
Combining the two aspects, one can further apply Chernoff bound to (16) and show that:
(19) 
where is a positive constant. As a result, the probability of noisy data alignment can approach one in an exponential rate as the number of retransmissions .
The above discussion suggests that the dataalignment probability can be controlled by retransmission. This motivates us to design a threshold based retransmission scheme that, for each acquired data sample, the edge server calls for retransmission until the said probability exceeds a prespecified threshold . Thereby, the scheme enforces the noisy data alignment in a probabilistic sense, i.e.,
(20) 
To facilitate implementation, the requirement on data alignment probability in (20) can be transformed to an equivalent requirement on SNR by using the inverse error function.
Proposition 2 (Data Alignment Requirement).
The received data sample is aligned with the transmitted sample (ground truth) with probability , if the SNR is larger than the following importance based threshold:
(21) 
where is the uncertainty measure given in (10) and . The proof simply involves the monotonicity of the error function and omitted here for brevity.
Based on the result, importance ARQ is proposed as follows.
Scheme 2 (Importance ARQ).
In importance ARQ, the edge server repeatedly requests the scheduled edge device to retransmit the same data sample until
(22) 
where is an additional fixed threshold (same as that in scheme 1) set for avoiding excessive retransmissions for those extremely uncertain samples.
Remark 2 (ImportanceAware SNR Thresholding).
The proposed importance ARQ can be viewed as an importanceaware adaptive SNR threshold scheme. As observed from (22), the SNR threshold is proportional to the distancebased uncertainty of the data sample. The result is aligned with the intuition that a data sample of higher uncertainty should be more reliably received to ensure the alignment between the transmitted and received samples. To better understand this result, a graphical illustration is provided in Fig. 4. For a prespecified , a highly uncertain sample near the decision hyperplane requires a slim distribution with small variance (corresponding to a high receive SNR) to meet the requirement on the data alignment probability (area shaded in grey) larger than (see Fig. 4(b)). On the other hand, for a less uncertain data sample, the requirement of can be easily satisfied with a relatively flat distribution with large variance and low receive SNR (see Fig. 4(a)).
Remark 3 (Comparison with Conventional Retransmission Schemes).
The proposed importance ARQ have several advantages over conventional schemes. Compared with the conventional channelaware ARQ in Scheme 1, the importance ARQ can make a more efficient use of the transmission budget by differentiating data samples by their uncertainty and acquiring them with proportional reliabilities. Thereby, this improves the mentioned qualityvsquantity tradeoff. Furthermore, importance ARQ avoids pratical issues such as excessive retransmissions arising in some naively designed schemes e.g., the example mentioned in section VB. In particular, the data alignment criteria in (20) can be always met by a finite number of retransmissions (see Remark 1). The performance gain of importance ARQ is evaluated by simulations in Section VII.
Vi Implementation for MultiClass Classification
In section V, we presented the principle of importance ARQ assuming binary classification. In this section, the principle is generalized to multiclass classification. Note that a class SVM classifier can be trained using the socalled oneversusone implementation [28]. The implementation involves binary SVM classifiers each trained using the samples from the two concerned classes only. As a result, for each input data sample , a class SVM will output a dimension vector, denoted by , which records the output scores as defined in (7), from the component binary SVMs. To map the output to one of the class indexes, a socalled reference coding matrix of size is built and denoted by , where each row gives the “reference output pattern” corresponding to the associated class. An example of the reference coding matrix with = 4 is provided as follows (a “learner” refers to a single binary SVM classifier):
With at hand, the prediction of the class index of involves simply comparing the Hamming distances between and different row in , and choosing the row index with the smallest distance as the predicted class index. Particularly, the Hamming distance between and the th row of is defined by
(23) 
where denotes the th element in vector , and denotes the sign function taking a value from corresponding to the cases , and , respectively. One can observe from the definition of the Hamming distance that not all the learners’ output scores have an active impact on predicting a particular class. For example, the scores from learners and have no impact on determining class as they are assigned a zero weight in computing the Hamming distance between and . In other words, only learners and are active when class is predicted.
Having obtained the predicted label , all the active learners determining the current predicted label should satisfy the requirement of data alignment probability predefined in (20). Consequently, the singlethreshold policy for importance ARQ defined in (22) can be then extended to a multithreshold policy as defined below:
(24) 
Vii Experimental Results
Viia Experiment Setup
For the channel model, we assume the classic Rayleigh fading channel with channel coefficients following i.i.d. complex Gaussian distribution . The average transmit SNR defined by is by default set as dB.
For the learning model, binary and multiclass SVM models are trained on the MNIST dataset [29], which includes a training set of 60,000 samples, and a test set of 10,000 samples. Each sample is a greyvalued image of pixels that gives the sample dimensions . The training set used in the experiment is partitioned as follows. At the edge server, the initial small collection of clean observations is constructed by randomly selecting samples for each class. The remaining training data are assumed randomly assigned to different edge devices. Note that with random data distribution and scheduling, the number of devices has no effect on the learning performance. The learning accuracy is obtained by testing the learnt model on the entire test set. For binary classification, we choose the relatively less differentiable class pair of “3” and “5” (according to tSNE visualization) for SVM model training. The maximum transmission budget is set to be 4000 and 40,000 for binary and multiclass classification, respectively. All results are averaged over 200 (for the binaryclass case) and 20 (for the multiclass case) experiments.
ViiB QualityvsQuantity Tradeoff
To demonstrate the qualityvsquantity tradeoff in wireless data acquisition, Fig. 5 displays the curves of learning accuracy versus transmission budget for both channelaware ARQ and importance ARQ. In Fig. 5(a), we test the performance of channelaware ARQ with three SNR thresholds, i.e., and dB, from low to high datareliability requirements. Similar cases for importance ARQ are considered in Fig. 5(b) with the reliability requirements specified by the dataalignment probability: and . It is observed from both Fig. 5(a) and 5(b) that setting the thresholds too low (e.g., and ) leads to a fast convergence rate but at a cost of performance degradation as the errors accumulate. In contrast, a too high threshold (e.g., and ) also leads to poor learning performance due to insufficient acquired samples. This suggests that the retransmission threshold should be carefully designed for optimizing the qualityvsquantity tradeoff and thereby improving the learning performance. In the following experiments, we select thresholds based on observations in this subsection to obtain good performance.
ViiC Learning Performance for Binary Classification
In Fig. 6, the learning performance of the proposed importance ARQ is compared with two baseline schemes, namely the channelaware ARQ and the scheme without retransmission (maximum data quantity). Varying values of average receive SNR are considered in Fig. 6(a). It is observed that without retransmission the performance of edge learning dramatically degrades after acquiring a sufficiently large number of noisy samples. This is aligned with our previous observations from Fig. 3(a) and justifies the need for retransmission. Next, one can observe that importance ARQ outperforms the conventional channelaware ARQ throughout the entire training duration. This confirms the performance gain from intelligent utilization of the radio source for data acquisition. Furthermore, the performance gain of importance ARQ is almost the same in varying SNR scenarios. This demonstrates the robustness of the proposed scheme against the hostile channel condition.
In Fig. 6(b)
, we further investigate the underlying reason for the performance improvement of importance ARQ by plotting the distribution of average numbers of retransmissions over a range of sample uncertainty (or its distance to the decision hyperplane). One can observe closetouniform distribution for conventional channelaware ARQ corresponding to uncertainty independence. In contrast, for importance ARQ, retransmission is concentrated in the high uncertainty region. This is aligned with the scheme’s design principle and shows its effectiveness in adapting retransmission to data importance.
ViiD Learning Performance for Multiclass Classification.
In Fig. 7(a), the learning performance of the proposed importance ARQ is compared with two baseline schemes in the scenario of multiclass classification. Similar trends as in the binaryclass scenario are observed and the importance ARQ is found to consistently outperform the benchmarking schemes in this more challenging scenario. The relation between importance ARQ and the multicluster structure of the training dataset is illustrated in Fig. 7(b). In the left subfigure, samples in different classes in general have distinct average distances to their corresponding decision hyperplanes and thus with different uncertainty (inverse of distance). In the right subfigure, one can observe that importance ARQ can effectively adapt the average retransmission budget for different classes to their uncertainty levels. For example, class 5 has the shortest average distance to the hyperplane thus is allocated the largest transmission budget to protect its receive quality. In contrast, class 1 has the longest distance thus consumes less budget as desired.
Viii Concluding Remarks
In this paper, we have proposed a novel retransmission scheme, importance ARQ, for wireless data acquisition in edge learning systems. It adapts retransmission to datasample importance so as to enhance the learning performance with a limited transmission budget. The work contributes to shifting the paradigm of classic communicationcentric radio resource allocation towards new designs targeting edge learning. Our initial investigation in this largely uncharted area opens up many interesting research directions, including importanceaware power and spectrum allocation, user scheduling, and congestion control.
a Proof of Lemma 1
Consequently, the transmitted sample is
where is the equivalent noise after combining with the entries defined as
Since follows i.i.d , each entries in are i.i.d and the distributions are given as follows:
Taking the real part of yields to the following distribution:
which leads to the desired result in (13).
References
 [1] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Towards an intelligent edge: Wireless communication meets machine learning,” arXiv preprint arXiv:1809.00343, 2018.
 [2] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “When edge meets learning: Adaptive control for resourceconstrained distributed machine learning,” in IEEE INFOCOM, 2018.
 [3] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, 2017.
 [4] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al., “Communicationefficient learning of deep networks from decentralized data,” in AISTATS, 2017.

[5]
B. Settles, “Active learning,”
Synthesis Lectures on Artificial Intelligence and Machine Learning
, vol. 6, no. 1, pp. 1–114, 2012.  [6] B. Frénay and M. Verleysen, “Classification in the presence of label noise: a survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 5, pp. 845–869, 2014.
 [7] A. Holub, P. Perona, and M. C. Burl, “Entropybased active learning for object recognition,” in IEEE CVPR Workshops’08, pp. 1–8, IEEE, 2008.

[8]
D. D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling for supervised learning,” in
ICML, pp. 148–156, 1994.  [9] B. Settles and M. Craven, “An analysis of active learning strategies for sequence labeling tasks,” in EMNLP, pp. 1070–1079, Association for Computational Linguistics, 2008.
 [10] B. Settles, M. Craven, and S. Ray, “Multipleinstance active learning,” in NIPS, pp. 1289–1296, 2008.
 [11] N. Roy and A. McCallum, “Toward optimal active learning through monte carlo estimation of error reduction,” in ICML, pp. 441–448, 2001.
 [12] V. Sheng, J. Zhang, B. Gu, and X. Wu, “Majority voting and pairing with multiple noisy labeling,” IEEE Trans. Knowl. Data Eng. (Early Access), 2017.
 [13] T. Liu and D. Tao, “Classification with noisy labels by importance reweighting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 3, pp. 447–461, 2016.
 [14] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,” in NIPS, pp. 1196–1204, 2013.
 [15] M. Moeneclaey, H. Bruneel, I. Bruyland, and D.Y. Chung, “Throughput optimization for a generalized stopandwait ARQ scheme,” IEEE Trans. Commun., vol. 34, no. 2, pp. 205–207, 1986.
 [16] Y.D. Yao, “An effective gobackN ARQ scheme for variableerrorrate channels,” IEEE Trans. Commun., vol. 43, no. 1, pp. 20–23, 1995.
 [17] E. Weldon, “An improved selectiverepeat ARQ strategy,” IEEE Trans. Commun., vol. 30, no. 3, pp. 480–486, 1982.
 [18] L. Badia, M. Levorato, and M. Zorzi, “Markov analysis of selective repeat type II hybrid ARQ using block codes,” IEEE Trans. Commun., vol. 56, no. 9, 2008.
 [19] E. Visotsky, Y. Sun, V. Tripathi, M. L. Honig, and R. Peterson, “Reliabilitybased incremental redundancy with convolutional codes,” IEEE Trans. Commun., vol. 53, no. 6, pp. 987–997, 2005.
 [20] D. Sejdinovic, V. Ponnampalam, R. J. Piechocki, and A. Doufexi, “The throughput analysis of different IRHARQ schemes based on fountain codes,” in IEEE WCNC’08, pp. 267–272, IEEE, 2008.
 [21] E. N. Onggosanusi, A. G. Dabak, Y. Hui, and G. Jeong, “Hybrid ARQ transmission and combining for MIMO systems,” in IEEE ICC, vol. 5, pp. 3205–3209, IEEE, 2003.
 [22] Q. Zhang and S. A. Kassam, “Hybrid ARQ with selective combining for fading channels,” IEEE J. Sel. Areas Commun., vol. 17, no. 5, pp. 867–880, 1999.
 [23] E. W. Jang, J. Lee, H.L. Lou, and J. M. Cioffi, “On the combining schemes for MIMO systems with hybrid ARQ,” IEEE Trans. Wireless Commun., vol. 8, no. 2, pp. 836–842, 2009.
 [24] Y. Mao, J. Zhang, and K. B. Letaief, “ARQ with adaptive feedback for energy harvesting receivers,” in IEEE WCNC’06, pp. 1–6, IEEE, 2016.
 [25] T. L. Marzetta and B. M. Hochwald, “Fast transfer of channel state information in wireless systems,” IEEE Trans. Signal Process., vol. 54, no. 4, pp. 1268–1278, 2006.
 [26] S. Cui, J.J. Xiao, A. J. Goldsmith, Z.Q. Luo, and H. V. Poor, “Energyefficient joint estimation in sensor networks: Analog vs. digital,” in IEEE ICASSP’05, vol. 4, pp. 745–748, 2005.
 [27] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, vol. 1. Springer series in statistics New York, NY, USA:, 2001.
 [28] J. C. Platt, N. Cristianini, and J. ShaweTaylor, “Large margin DAGs for multiclass classification,” in NIPS, pp. 547–553, 2000.
 [29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
Comments
There are no comments yet.