Wireless Data Acquisition for Edge Learning: Importance Aware Retransmission

12/05/2018 ∙ by Dongzhu Liu, et al. ∙ The Hong Kong University of Science and Technology 0

By deploying machine learning algorithms at the network edge, edge learning recently emerges as a promising framework to provide intelligent services. It can effectively leverage the rich data collected by abundant mobile devices and exploit the proximate edge computing resource for low-latency execution. Edge learning crosses two disciplines, machine learning and wireless communication, and thereby gives rise to many new research issues. In this paper, we address a wireless data acquisition problem, which involves a retransmission decision in each communication round to optimize the data quality-vs-quantity tradeoff. A new retransmission protocol called importance-aware automatic-repeat-request (importance ARQ) is proposed. Unlike classic ARQ focusing merely on reliability, importance ARQ selectively retransmits a data sample based on its uncertainty which helps learning and can be measured using the model under training. Underpinning the proposed protocol is an elegant communication-learning relation between two corresponding metrics, i.e., signal-to-noise ratio (SNR) and data uncertainty. This new measure facilitates the design of a simple threshold based policy for retransmission decisions. As demonstrated via experiments with real datasets, the proposed method avoids learning performance degradation caused by channel noise while achieving faster convergence than conventional SNR-based ARQ.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

While smartphones have become indispensable platforms supporting our daily lives, billions of Internet-of-things (IoT) devices are expected to be deployed in near future to automate the operations of our societies. With the prevalence of such devices on the network edge, namely edge devices, people envision an incoming new world of ubiquitous computing and ambient intelligence. This vision motivates Internet companies and wireless operators to develop technologies for deploying machine learning on the (network) edge giving rise to a new platform for supporting intelligent applications, called edge learning (see e.g., [1, 2, 3]). This trend aims at leveraging enormous real-time data generated by billions of edge devices to train AI models. In return, downloading the learnt intelligence onto the devices will enable them to respond to real-time events with human-like capabilities. As a paradigm shift in computing, the impact of edge learning is not limited to individual convenience and productivity but more importantly can lead to breakthroughs in critical areas such as healthcare and disaster avoidance. In practice, a network virtualization architecture has been standardized by 3GPP laying a platform for implementing edge computing and learning [3]. From the research perspective, edge learning crosses two disciplines, wireless communication and machine learning, and thus brings many new research challenges. Tackling such challenges by cross-disciplinary design defines the theme of this paper.

As data-processing speeds are increasing rapidly, wireless acquisition of high-dimensional training data from many edge devices has emerged to be a bottleneck for fast edge learning (see e.g., [4]). The issue can be exacerbated in high-mobility applications (e.g., drone mounted sensors). This calls for highly efficient resource allocation and multi-access methods tailored to wireless data acquisition. Conventional techniques have been developed based on the principle of decoupled communication and computing, and thus are incapable of exploiting the features of learning. In particular, data bits (or symbols) are assumed of equal importance, which simplifies the design criterion to be rate maximization. In contrast, for learning, the importance distribution in a training dataset is non-uniform

, namely that some samples are more important than others. For instance, for classification problems, samples near decision boundaries are more critical than those far away for training a classifier 

[5]. This fact motivates the new design principle of importance-aware resource allocation for efficient wireless data acquisition in edge learning.

In this work, we apply this principle to redesign the classic technique of automatic repeat-request (ARQ) to improve the efficiency of edge learning. The classic design only targets reliability, which essentially repeats the transmission of a data packet until it is reliably received. On the other hand, edge learning gives rise to two unique and new design issues. The first is the mentioned non-uniform importance distribution over data samples. The second is the quality-vs-quantity tradeoff. To be specific, in each communication round, an edge server needs to make a binary decision on whether to improve the quality of a data sample by retransmission or to acquire a new sample. Though both quality and quantity of the training dataset are important for accurate learning, they need be balanced given a limited transmission budget. To optimize the tradeoff, we propose a scheme called importance ARQ that adapts the retransmission decision to both data importance and reliability.

We consider an edge learning system where a classifier is trained at the edge based on support vector machine

(SVM), with data collected from distributed edge devices. The acquisition of high-dimensional data samples is bandwidth-consuming and relies on a high-rate noisy data channel. On the other hand, a dedicated low-rate noiseless channel is allocated for accurate delivery of small-sized labels. The mismatch between the correctly received labels and noisy data samples at the edge may lead to an incorrectly learnt model that fails to generalize on the future data. To tackle the issue, retransmission with coherent combining is used to enhance the data quality. The retransmission decision at the edge server is controlled by the proposed importance ARQ that optimizes the said quality-vs-quantity tradeoff so as to efficiently utilize the radio resource. The main contributions of this work are summarized as follows.

  • We identify an important design problem in wireless data acquisition for edge learning. The problem is how to intelligently control retransmission of data samples by exploiting the differences in their importance levels for learning. This problem reveals a unique quality-vs-quantity tradeoff in wireless data acquisition.

  • To solve the above problem, we propose a solution called importance ARQ. The scheme selectively retransmits a data sample based on its underlying importance for training a classifier model which is estimated using the real-time model under training. The key technical contribution in the design is

    an elegant communication-learning relation between two corresponding metrics, i.e., signal-to-noise ratio (SNR) and data importance, given a desired learning accuracy. This new measure facilitates the design of a simple threshold based policy for making retransmission decisions.

  • We evaluate the performance of the proposed importance ARQ via extensive experiments using a real dataset. The results demonstrate that the proposed method avoids learning performance degradation caused by channel noise while achieving faster convergence than conventional SNR-based ARQ.

Ii Related Work

Ii-1 Data Acquisition in Machine Learning

Data acquisition is an important area in machine learning. Two topics relevant to our study are active learning [5] and noisy label learning [6] which are briefly introduced as follows.

Targeting the scenario that unlabeled data is abundant but manually labeling is expensive, active learning aims to selectively label the informative data (by querying an oracle), such that a model can be quickly trained using only a few carefully selected data samples, thus reducing the cost of manual labelling. Roughly speaking, the informativeness of a sample is related to how uncertain the prediction of this sample is under the current model, i.e., the more uncertain on the prediction the more informative the sample to the model learning. One of the commonly used uncertainty measure is the entropy, an information theoretic concept [7]. A classic data selection scheme based on the entropy measure is called uncertainty sampling which was first proposed in [8]. Subsequent work [9]

found that purely sampling the data with the highest uncertainty may lead to undesired query to the outlier that fails to represent the data space. To cope with the issue, the authors proposed a so-called information density sampling scheme that accounts for both the uncertainty and representativeness of the data. Other popular data selection criteria, such as expected model change

[10] and expected error reduction [11], all take into account the entire data space and thus can naturally avoid the outliner sampling issue.

In contrast to active learning, noisy label learning considers a completely different scenario where the acquisition of a data sample is expensive while attaining a label for the sample can be cheap by interaction with non-expert labellers. As the labellers are not professional, the provided labels can be misleading. To avoid the performance degradation incurred by the noisy labels, extensive research efforts have been dedicated to developing robust learning mechanisms. The existing solutions can be grouped into the following three directions: 1) request multiple labels from different labellers and use majority voting to control the label quality [12]; 2) reweigh the importance of noisy labeled data in the learning algorithm [13]

; and 3) adopt a more robust loss function for model training by exploiting statistical learning theory

[14]. Note that the conventional data acquisition problems focused only on the machine learning perspective and failed to consider the potential noise corruption on the acquired data, thus are fundamentally distinct from the wireless data acquisition problem in edge learning. Nevertheless, the spirit of accounting for data importance in the acquisition process inspires us to develop an importance-aware mechanism to cope with the noisy data issue as elaborated in the sequel.

Ii-2 ARQ in Wireless Communication

ARQ is a classic topic in wireless communication, which has been extensively studied [15, 16, 17, 18, 19, 20]. In terms of ARQ protocols, there are three basic schemes including stop and wait ARQ [15], Go-Back-N ARQ [16] and selective repeat ARQ [17]. The three schemes differ in how to proceed the transmission before and after receiving an acknowledgement (ACK), and thus achieve different reliability-latency tradeoffs. To avoid excessive network delay caused by frequent retransmission, advanced ARQ protocols often incorporate the forward error coding design such as block codes [18] and fountain codes [20], leading to an active research area called hybrid ARQ. Besides, ARQ can be further strengthened by applying diversity combining techniques such as maximum-ratio-combing (MRC) [21] or selective combining [22]. This allows early termination once the decoding is made possible by coherently combining several noisy observations. Another vein of research focuses on redesigning ARQ protocols for more sophisticated communication systems such as multiple-input multiple-output (MIMO) [23] and energy harvesting (EH) systems [24]. While existing designs of ARQ purely target reliable communication, the retransmission-based wireless data acquisition in edge learning is to optimize the learning performance. This renders existing ARQ design inapplicable and introduces new challenges of cross-discipline design to be addressed in the current work.

Iii Communication and Learning Models

Iii-a Communication System Model

We consider an edge learning system as shown in Fig. 1 comprising a single edge server and multiple edge devices, each equipped with a single antenna. A classifier model at the server is trained using a labelled dataset distributed over devices, denoted as with representing the -th data sample, its dimensions, and its label. To this end, edge devices share the wireless channel in a time division manner and take turn to transmit local data to the server. Note that a label has a much smaller size than a data sample (e.g., a integer versus a vector of a million real coefficients). Thus two separate channels are planned: a low-rate label channel and a high-rate data channel. The former is assumed noiseless for simplicity. Reliable uploading of data samples over the noisy and fading channel is the bottleneck of wireless data acquisition and the focus of this work. Time is slotted into symbol durations, called slots. Transmission of a data sample requires slots, called a symbol block.

At the beginning of each symbol block, the edge server makes a binary decision on either randomly selecting a device for acquiring a new sample or requesting the previously scheduled device for retransmission to improve sample quality. Retransmission is controlled by stop-and-wait ARQ. The positive ACK or negative ACK is sent to the target device based on whether the currently received sample at the server satisfies a pre-defined quality requirement as elaborated in the sequel. Each edge device is assumed to have backlogged data. Upon receiving a request from the server, a device either transmits a randomly picked new sample from its buffer or retransmits the previous sample.

The noisy data channel is assumed to follow block-fading where the channel coefficient remains static within a symbol block and is independent and identically distributed (i.i.d.) over different blocks. During the -th symbol block, the device sends the data using linear analog modulation, yielding the received signal given by

(1)

where is the transmit power constraint for a symbol block, the fading coefficient

is a complex random variable (r.v.) assumed to have a unit variance, i.e.,

, without loss of generality, and is the additive white Gaussian noise (AWGN) vector with the entries following i.i.d. distributions. Analog uncoded transmission is assumed here to allow fast data transmission [25] and for a higher energy efficiency (compared with the digital counterpart) as pointed out by [26]. We assume that perfect channel state information (CSI) on is available at the server. This allows the server to compute the instantaneous SNR of the received data and make the retransmission decision.

Figure 1: Edge learning system.

Retransmission Combining: To exploit the time diversity gain provided by multiple independent noisy observations of the same data sample from retransmissions, ARQ together with MRC combining technique are applied to coherently combine all the observations for maximizing the receive SNR. To be specific, consider a data sample retransmitted times. All received copies, say from symbol block to , can be combined by MRC to acquire the received sample, denoted as , as follows:

(2)

where is given in (1). In (2), we extract the real part of the combined signal for further processing since the data for machine learning are real-valued in general (e.g., photos, voice clips or video clips). As a result, the receive SNR for sample is given as

(3)

where the coefficient at the right hand side arises from the fact that only the noise in the real dimension with variance affects the received data. The summation in (2) represents MRC and its value grows as the number of retransmissions, , increases. The SNR expression in (3) measures the reliability of a received data sample and serves as a criterion for making the retransmission decision as discussed in Section V.

Transmission Budget Constraint: Due to limited radio resource or a latency requirement for data acquisition, the transmission budget for a specific learning task is restricted to be symbol blocks. Therefore, the total data-acquisition duration (in symbol block) is constrained by

(4)

where denotes the number of data samples and the number of retransmissions for acquiring the -th data sample.

Iii-B Learning Model

For the task of edge learning, we target supervised training of a classifier model by implementing linear support vector machine (SVM). Prior to training, we assume that the edge server has a small set of clean observed samples, denoted as . This allows the construction of a coarse initial classifier, which is used for making retransmission decision at the beginning stage. The classifier is refined progressively using newly received data samples. As shown in Fig. 1

, SVM is to seek an optimal hyperplane

as a decision boundary by maximizing its margin to data points, i.e., the minimum distance between the hyperplane to any data sample. The points lie on the margin are referred to as support vectors which directly determine the decision boundary. Let denote the -th data-label pair in the training set. A convex formulation for the SVM problem is given by

(5)
(6)

The original SVM works only for linearly separable datasets, which is hardly the case when the dataset is corrupted by the channel noise as in the current scenario. To make the algorithm robust and be able to cope with the potential outlier caused by noise, a variant of SVM called soft margin SVM is adopted. The technique is widely used in practice to classify a noisy dataset that is not linearly separable by allowing misclassification but with an additional penalty in the objective in (5). The implementation details are omitted here for brevity. Interested readers are referred to the classic literature [27].

Upon the training is completed, the learnt SVM model can then be used for predicting the label of the new sample, denoted by , by computing its output score as follows:

(7)

Then the sign of the output score yields the prediction of the binary label.

(a) Noiseless data.
(b) Noisy data.
Figure 2: Illustration of the data-label mismatch issue.

Iv Wireless Data Acquisition by Retransmission

Iv-a Why Retransmission is Needed?

Given a noisy data channel and a reliable label channel, the classifier model at the edge server is trained using noisy data samples with correct labels. The channel noise can cause a data sample to cross the ground-truth decision boundary, thereby resulting a mismatch between the sample and its label, referred to as a data-label mismatch in the rest of paper. The issue can cause incorrect learning as illustrated in Fig. 2. Specifically, for the noiseless transmission case in Fig. 2(a), the new data sample helps refine the current decision boundary to approach the ground-truth one. However, for the case of noisy transmission in Fig. 2(b), noise corrupts the new sample and causes it to be an outlier falling into the opposite (wrong) side of the decision boundary. The situation will be exacerbated when the SVM classifier is used since the outliner may be selected as the supporter vector (or indirectly affect the decision boundary by increasing the penalty in soft margin SVM).

Retransmission can provide diversity gain to suppress channel noise so as to improve data reliability and hence the learning performance. To visualize the benefit of retransmission, we demonstrate in Fig. 3 the performance of the edge learnt classifiers which are trained using the noise corrupted dataset based on the channel model in (1) under a varying number of retransmissions. Specifically, we consider the learning task of handwritten digit recognition using the well-known MNIST dataset that consists of 10 categories ranging from digit “0” to “9”. The noisy level of the received data is reflected by their average receive SNR which is set to be . We train three classifiers with different fixed numbers of retransmissions: . The curves of their test accuracy are shown in Fig. 3(a), with the corresponding received dataset distribution visualized in Fig. 3(b) using the classic -distributed stochastic embeding (-SNE) algorithm for projecting the images onto the paper (). It is observed from the case without retransmission (), after receiving a certain number (i.e., 8000) of noisy data samples, the training of the classifier fails as reflected in the dramatic drop in test accuracy. The reason is that the strong noise effect (see Fig. 3(a)) accumulates to cause the divergence of the model (see Fig. 3(b)). As the number of retransmission increases, the noise effect is subdued to a sufficiently low level ensuring that the class structure of the original dataset can be resolved, leading to a converged model and a high test accuracy. The experiment demonstrates the effectiveness of retransmission in edge learning. Nevertheless, given the transmission budget constraint in (4), the retransmission scheme should be carefully designed to improve the training performance, which is the main theme of this paper.

(a) Performance of different amounts of retransmissions for each sample.
(b) Visualization of training set structures at .
Figure 3: The effect of retransmission.

Iv-B Two Main Design Issues

As discussed in Section IV-A, retransmission is able to enhance the quality of the received data. This comes, however, at a cost of reducing the data acquisition efficiency due to the redundant transmission. The objective of our design is to adapt retransmission to data importance besides the channel so as to efficiently utilize the limited radio resource for improving the learning accuracy. To be specific, for each received data sample (new or retransmitted), the server needs to make a binary decision on whether retransmission is needed in the next symbol block.

The decision making is challenging as it should address two issues mentioned in Introduction and elaborated as follows.

  • Quality-vs-Quantity Tradeoff: The learning performance can be improved by either increasing the reliability (quality) of the wirelessly transmitted training dataset by more retransmissions, or its size (quantity) by acquiring more data samples. The budget constraint in (4) introduces a tradeoff between the two aspects, called the quality-vs-quantity tradeoff.

  • Non-uniform Data Importance: In conventional data communication, bits are implicitly assumed to have equal importance. This is not the case for training a classifier where data samples with higher uncertainty (i.e., near to decision boundary) are more critical than those with lower (i.e., far from decision boundary). Considering the non-uniform importance in data shifts the paradigm of radio resource allocation in conventional rate-driven wireless transmission.

These issues are addressed via the proposed importance ARQ presented in the following sections.

V Principle of Importance-Aware Retransmission

In this section, an importance-aware retransmission scheme is proposed to regulate the data-sample reliability according to its importance. We first recall the traditional channel-aware retransmission design from the pure communications perspective, where the retransmission decision is only determined by an SNR threshold without considering data importance. Then, we introduce the concept of uncertainty to quantify the importance of a data sample for learning. This then motivates us to propose an importance ARQ scheme, which accounts for both the SNR as well as data uncertainty.

V-a Traditional Channel-Aware Retransmission

In wireless communications, SNR is commonly used to measure the received data reliability. Thus, it is natural to use SNR as the criterion in retransmission policy design. The traditional channel-aware retransmission scheme is as follows.

Scheme 1 (Channel-Aware ARQ).

The edge server repeatedly requests the scheduled edge device to retransmit the same data sample until a required SNR is met: , where is defined in (3) and is a pre-specified SNR threshold.

This scheme achieves equally high reliability for all data samples. Under a budget constraint, however, this can lead to inefficient utilization of the radio resource for data acquisition, as the obtained dataset may not contain sufficient samples for accurate learning. The efficiency of the channel-aware retransmission can be improved by considering non-uniform data importance as measured by data uncertainty.

V-B Data Uncertainty

In the active learning literature, the importance of a data sample is usually measured by its uncertainty, which characterizes how significant it can contribute to model learning [5]. A typical measure is entropy [7], an information theoretic notion, defined as follows:

(8)

where denotes the set of model parameters to be learnt. For the considered SVM classifier, the model parameters are (see (5)). In active learning, to train a model using a large unlabelled dataset under a limited labelling budget, an active learner selects a subset of “important” data samples (with high uncertainty) to label by queries and then update the model. In the same spirit, in an edge learning system, the server prefers data samples with high uncertainty for retransmission as they are more “important”. Besides entropy, there exist several alternative measures that are easier to compute but specific to the learning model adopted. One of them related to the current work is introduced as follows.

Uncertainty Measure for SVM: In general, the computation of the entropy in (8) is complex involving computing a posterior distribution. For this reason, targeting the SVM classification, we develop a simpler uncertainty measure by exploiting the dataset clustering structure. The definition is motivated by the geometric intuition that a classifier makes less confident inference on a data sample which is located near the decision boundary. Based on this intuition, we propose measuring the uncertainty of a data sample by the inverse of its distance to the boundary. Given a data sample , the said distance can be readily computed by the absolute value of the output score (defined in (7)) as follows

(9)

Then the distance based uncertainty measure is given by

(10)

One can observe that the measure grows towards infinity as a data sample approaches the decision boundary but reduces as the sample moves away from the boundary.

With the concept of uncertainty in mind, one can see that data samples with higher uncertainty are more prone to noise corruption. To be specific, a small perturbation caused by channel noise can push a highly uncertain data sample across the decision boundary, thus inducing the aforementioned data-label mismatch illustrated in Fig.  2. To cope with the issue, highly uncertain data samples should be protected against noise by retransmission such that the noise-induced perturbation on the received data can be suppressed. To implement the intuitive idea, native schemes may not work or at least intractable. For instance, consider the following retransmission scheme: the edge server requests for retransmission until the data uncertainty is smaller than a threshold. Unfortunately, this scheme may incur endless retransmissions if the threshold is higher than the uncertainty of the ground-truth data or its observed uncertainty grows over retransmissions. This calls for more intelligent use of data uncertainty in the retransmission policy design.

V-C Importance ARQ

Motivated by the issues discussed in the preceding sub-sections, we propose a novel retransmission scheme, called importance ARQ, to integrate the two key metrics in wireless data acquisition, namely the SNR from wireless communication and data uncertainty from machine learning. Essentially, the design is inspired by the following observation. No error would be incurred in learning if the transmitted and received data samples lie at the same side of the (ground-truth) decision hyperplane so that they have the same predicted labels. This suggests that the key to tackle the data-label mismatch issue is to align the received and transmitted data samples to be in the same half-space divided by the decision hyperplane, which is referred to as noisy data alignment.

The finding motivates us to apply the probability of such alignment as the criterion in the retransmission policy design. However, the ground-truth decision hyperplane cannot be known. One practical solution is to utilize the current model under training as a

surrogate to check the consistency between the transmitted and received versions of a data sample. The consistency (alignment) can be translated to an event that the transmitted and received data sample have matching labels. Equivalently, if they are identical, their output scores defined in (7) must have the same signs. Consider an arbitrary transmitted data sample and its received version after retransmissions as defined in (2). The event is

(11)

Then the probability of noisy data alignment, called the data alignment probability, is formally defined as follows.

Definition 1 (Data-Alignment Probability).

Conditioned on the received data sample, the data-alignment probability is:

(12)

To calculate the probability defined above, the distribution of the transmitted sample score conditioned on the received data sample is needed. This knowledge can be inferred from the conditional distribution of transmitted sample, i.e., as derived in the following lemma.

Lemma 1 (Conditional Distribution of Transmitted Sample).

Conditioned on the received sample , the distribution of the transmitted sample

follows a Gaussian distribution:

(13)

Proof: See Appendix -A.

With the conditional distribution in (13), the required distribution can be readily derived using the linear relationship in (7). The derivation simply involves projecting the high-dimensional Gaussian distribution onto a particular direction specified by , which yields a unit-variate Gaussian distribution of dimension one as elaborated below.

Lemma 2 (Conditional Distribution of Transmitted Sample Score).

Conditioned on the estimated sample , the distribution of the transmitted sample score follows a unit-variate Gaussian distribution, given by

(14)

where .

(a) Small uncertainty.
(b) Lagre uncertainty.
Figure 4: Illustration of the probability of noisy data alignment.

Based on Lemma 2, we are ready to derive the data-alignment probability. The closed-form expression is presented in the following proposition.

Proposition 1 (Data-Alignment Probability).

Conditioned on the received sample , the probability of noisy data alignment is given as

(15)

where follows from Lemma 2 and is the well known error function defined as .

Proof: As shown in Fig. 4, the conditional distribution for the transmitted data score is a Gaussian and the probability of data alignment is equal to the area shaded in grey. Mathematically, the probability can be derived using Lemma 2 as follows:

(16)

The integral therein can be further expressed using the error function .  

Remark 1.

(How Does Retransmission Affects Noisy Data Alignment?)  Retransmission contributes to increasing the probability of noisy data alignment. Specifically, retransmission affects both the mean and variance of the conditional distribution in (14). From the mean perspective, retransmission helps align the average of retransmitted samples with its ground truth. To be specific, the received estimate approaches the ground-truth value as the number of retransmissions grows:

(17)

From the variance perspective, retransmission continuously reduces the variance by increasing the receive SNR or equivalently the number of retransmissions . Particularly, it follows from that

(18)

Combining the two aspects, one can further apply Chernoff bound to (16) and show that:

(19)

where is a positive constant. As a result, the probability of noisy data alignment can approach one in an exponential rate as the number of retransmissions .

The above discussion suggests that the data-alignment probability can be controlled by retransmission. This motivates us to design a threshold based retransmission scheme that, for each acquired data sample, the edge server calls for retransmission until the said probability exceeds a pre-specified threshold . Thereby, the scheme enforces the noisy data alignment in a probabilistic sense, i.e.,

(20)

To facilitate implementation, the requirement on data alignment probability in (20) can be transformed to an equivalent requirement on SNR by using the inverse error function.

Proposition 2 (Data Alignment Requirement).

The received data sample is aligned with the transmitted sample (ground truth) with probability , if the SNR is larger than the following importance based threshold:

(21)

where is the uncertainty measure given in (10) and . The proof simply involves the monotonicity of the error function and omitted here for brevity.

Based on the result, importance ARQ is proposed as follows.

Scheme 2 (Importance ARQ).

In importance ARQ, the edge server repeatedly requests the scheduled edge device to retransmit the same data sample until

(22)

where is an additional fixed threshold (same as that in scheme 1) set for avoiding excessive retransmissions for those extremely uncertain samples.

Remark 2 (Importance-Aware SNR Thresholding).

The proposed importance ARQ can be viewed as an importance-aware adaptive SNR threshold scheme. As observed from (22), the SNR threshold is proportional to the distance-based uncertainty of the data sample. The result is aligned with the intuition that a data sample of higher uncertainty should be more reliably received to ensure the alignment between the transmitted and received samples. To better understand this result, a graphical illustration is provided in Fig. 4. For a pre-specified , a highly uncertain sample near the decision hyperplane requires a slim distribution with small variance (corresponding to a high receive SNR) to meet the requirement on the data alignment probability (area shaded in grey) larger than (see Fig. 4(b)). On the other hand, for a less uncertain data sample, the requirement of can be easily satisfied with a relatively flat distribution with large variance and low receive SNR (see Fig. 4(a)).

Remark 3 (Comparison with Conventional Retransmission Schemes).

The proposed importance ARQ have several advantages over conventional schemes. Compared with the conventional channel-aware ARQ in Scheme 1, the importance ARQ can make a more efficient use of the transmission budget by differentiating data samples by their uncertainty and acquiring them with proportional reliabilities. Thereby, this improves the mentioned quality-vs-quantity tradeoff. Furthermore, importance ARQ avoids pratical issues such as excessive retransmissions arising in some naively designed schemes e.g., the example mentioned in section V-B. In particular, the data alignment criteria in (20) can be always met by a finite number of retransmissions (see Remark 1). The performance gain of importance ARQ is evaluated by simulations in Section VII.

Vi Implementation for Multi-Class Classification

In section V, we presented the principle of importance ARQ assuming binary classification. In this section, the principle is generalized to multi-class classification. Note that a -class SVM classifier can be trained using the so-called one-versus-one implementation [28]. The implementation involves binary SVM classifiers each trained using the samples from the two concerned classes only. As a result, for each input data sample , a -class SVM will output a -dimension vector, denoted by , which records the output scores as defined in (7), from the component binary SVMs. To map the output to one of the class indexes, a so-called reference coding matrix of size is built and denoted by , where each row gives the “reference output pattern” corresponding to the associated class. An example of the reference coding matrix with = 4 is provided as follows (a “learner” refers to a single binary SVM classifier):

With at hand, the prediction of the class index of involves simply comparing the Hamming distances between and different row in , and choosing the row index with the smallest distance as the predicted class index. Particularly, the Hamming distance between and the -th row of is defined by

(23)

where denotes the -th element in vector , and denotes the sign function taking a value from corresponding to the cases , and , respectively. One can observe from the definition of the Hamming distance that not all the learners’ output scores have an active impact on predicting a particular class. For example, the scores from learners and have no impact on determining class as they are assigned a zero weight in computing the Hamming distance between and . In other words, only learners and are active when class is predicted.

Having obtained the predicted label , all the active learners determining the current predicted label should satisfy the requirement of data alignment probability predefined in (20). Consequently, the single-threshold policy for importance ARQ defined in (22) can be then extended to a multi-threshold policy as defined below:

(24)

Vii Experimental Results

Vii-a Experiment Setup

For the channel model, we assume the classic Rayleigh fading channel with channel coefficients following i.i.d. complex Gaussian distribution . The average transmit SNR defined by is by default set as dB.

For the learning model, binary and multi-class SVM models are trained on the MNIST dataset [29], which includes a training set of 60,000 samples, and a test set of 10,000 samples. Each sample is a grey-valued image of pixels that gives the sample dimensions . The training set used in the experiment is partitioned as follows. At the edge server, the initial small collection of clean observations is constructed by randomly selecting samples for each class. The remaining training data are assumed randomly assigned to different edge devices. Note that with random data distribution and scheduling, the number of devices has no effect on the learning performance. The learning accuracy is obtained by testing the learnt model on the entire test set. For binary classification, we choose the relatively less differentiable class pair of “3” and “5” (according to t-SNE visualization) for SVM model training. The maximum transmission budget is set to be 4000 and 40,000 for binary and multi-class classification, respectively. All results are averaged over 200 (for the binary-class case) and 20 (for the multi-class case) experiments.

Vii-B Quality-vs-Quantity Tradeoff

To demonstrate the quality-vs-quantity tradeoff in wireless data acquisition, Fig. 5 displays the curves of learning accuracy versus transmission budget for both channel-aware ARQ and importance ARQ. In Fig. 5(a), we test the performance of channel-aware ARQ with three SNR thresholds, i.e., and dB, from low to high data-reliability requirements. Similar cases for importance ARQ are considered in Fig. 5(b) with the reliability requirements specified by the data-alignment probability: and . It is observed from both Fig. 5(a) and 5(b) that setting the thresholds too low (e.g., and ) leads to a fast convergence rate but at a cost of performance degradation as the errors accumulate. In contrast, a too high threshold (e.g., and ) also leads to poor learning performance due to insufficient acquired samples. This suggests that the retransmission threshold should be carefully designed for optimizing the quality-vs-quantity tradeoff and thereby improving the learning performance. In the following experiments, we select thresholds based on observations in this subsection to obtain good performance.

(a) Channel-aware ARQ.
(b) Importance ARQ.
Figure 5: Quality-vs-quantity tradeoff in wireless data acquisition.

Vii-C Learning Performance for Binary Classification

(a) Classification performance for different values of average receive SNR .
(b) Retransmission Distribution.
Figure 6: Performance of a binary classifier with wirelessly acquired data.

In Fig. 6, the learning performance of the proposed importance ARQ is compared with two baseline schemes, namely the channel-aware ARQ and the scheme without retransmission (maximum data quantity). Varying values of average receive SNR are considered in Fig. 6(a). It is observed that without retransmission the performance of edge learning dramatically degrades after acquiring a sufficiently large number of noisy samples. This is aligned with our previous observations from Fig. 3(a) and justifies the need for retransmission. Next, one can observe that importance ARQ outperforms the conventional channel-aware ARQ throughout the entire training duration. This confirms the performance gain from intelligent utilization of the radio source for data acquisition. Furthermore, the performance gain of importance ARQ is almost the same in varying SNR scenarios. This demonstrates the robustness of the proposed scheme against the hostile channel condition.

In Fig. 6(b)

, we further investigate the underlying reason for the performance improvement of importance ARQ by plotting the distribution of average numbers of retransmissions over a range of sample uncertainty (or its distance to the decision hyperplane). One can observe close-to-uniform distribution for conventional channel-aware ARQ corresponding to uncertainty independence. In contrast, for importance ARQ, retransmission is concentrated in the high uncertainty region. This is aligned with the scheme’s design principle and shows its effectiveness in adapting retransmission to data importance.

Vii-D Learning Performance for Multi-class Classification.

In Fig. 7(a), the learning performance of the proposed importance ARQ is compared with two baseline schemes in the scenario of multi-class classification. Similar trends as in the binary-class scenario are observed and the importance ARQ is found to consistently outperform the benchmarking schemes in this more challenging scenario. The relation between importance ARQ and the multi-cluster structure of the training dataset is illustrated in Fig. 7(b). In the left subfigure, samples in different classes in general have distinct average distances to their corresponding decision hyperplanes and thus with different uncertainty (inverse of distance). In the right subfigure, one can observe that importance ARQ can effectively adapt the average retransmission budget for different classes to their uncertainty levels. For example, class 5 has the shortest average distance to the hyperplane thus is allocated the largest transmission budget to protect its receive quality. In contrast, class 1 has the longest distance thus consumes less budget as desired.

(a) Performance evaluation.
(b) Retransmission v.s. Uncertainty.
Figure 7: Performance evaluation in multi-class classification.

Viii Concluding Remarks

In this paper, we have proposed a novel retransmission scheme, importance ARQ, for wireless data acquisition in edge learning systems. It adapts retransmission to data-sample importance so as to enhance the learning performance with a limited transmission budget. The work contributes to shifting the paradigm of classic communication-centric radio resource allocation towards new designs targeting edge learning. Our initial investigation in this largely uncharted area opens up many interesting research directions, including importance-aware power and spectrum allocation, user scheduling, and congestion control.

-a Proof of Lemma 1

The received sample in (2) can be rewritten as

where .

Consequently, the transmitted sample is

where is the equivalent noise after combining with the entries defined as

Since follows i.i.d , each entries in are i.i.d and the distributions are given as follows:

Taking the real part of yields to the following distribution:

which leads to the desired result in (13).

References

  • [1] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Towards an intelligent edge: Wireless communication meets machine learning,” arXiv preprint arXiv:1809.00343, 2018.
  • [2] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “When edge meets learning: Adaptive control for resource-constrained distributed machine learning,” in IEEE INFOCOM, 2018.
  • [3] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, 2017.
  • [4] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al., “Communication-efficient learning of deep networks from decentralized data,” in AISTATS, 2017.
  • [5] B. Settles, “Active learning,”

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    , vol. 6, no. 1, pp. 1–114, 2012.
  • [6] B. Frénay and M. Verleysen, “Classification in the presence of label noise: a survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 5, pp. 845–869, 2014.
  • [7] A. Holub, P. Perona, and M. C. Burl, “Entropy-based active learning for object recognition,” in IEEE CVPR Workshops’08, pp. 1–8, IEEE, 2008.
  • [8]

    D. D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling for supervised learning,” in

    ICML, pp. 148–156, 1994.
  • [9] B. Settles and M. Craven, “An analysis of active learning strategies for sequence labeling tasks,” in EMNLP, pp. 1070–1079, Association for Computational Linguistics, 2008.
  • [10] B. Settles, M. Craven, and S. Ray, “Multiple-instance active learning,” in NIPS, pp. 1289–1296, 2008.
  • [11] N. Roy and A. McCallum, “Toward optimal active learning through monte carlo estimation of error reduction,” in ICML, pp. 441–448, 2001.
  • [12] V. Sheng, J. Zhang, B. Gu, and X. Wu, “Majority voting and pairing with multiple noisy labeling,” IEEE Trans. Knowl. Data Eng. (Early Access), 2017.
  • [13] T. Liu and D. Tao, “Classification with noisy labels by importance reweighting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 3, pp. 447–461, 2016.
  • [14] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,” in NIPS, pp. 1196–1204, 2013.
  • [15] M. Moeneclaey, H. Bruneel, I. Bruyland, and D.-Y. Chung, “Throughput optimization for a generalized stop-and-wait ARQ scheme,” IEEE Trans. Commun., vol. 34, no. 2, pp. 205–207, 1986.
  • [16] Y.-D. Yao, “An effective go-back-N ARQ scheme for variable-error-rate channels,” IEEE Trans. Commun., vol. 43, no. 1, pp. 20–23, 1995.
  • [17] E. Weldon, “An improved selective-repeat ARQ strategy,” IEEE Trans. Commun., vol. 30, no. 3, pp. 480–486, 1982.
  • [18] L. Badia, M. Levorato, and M. Zorzi, “Markov analysis of selective repeat type II hybrid ARQ using block codes,” IEEE Trans. Commun., vol. 56, no. 9, 2008.
  • [19] E. Visotsky, Y. Sun, V. Tripathi, M. L. Honig, and R. Peterson, “Reliability-based incremental redundancy with convolutional codes,” IEEE Trans. Commun., vol. 53, no. 6, pp. 987–997, 2005.
  • [20] D. Sejdinovic, V. Ponnampalam, R. J. Piechocki, and A. Doufexi, “The throughput analysis of different IR-HARQ schemes based on fountain codes,” in IEEE WCNC’08, pp. 267–272, IEEE, 2008.
  • [21] E. N. Onggosanusi, A. G. Dabak, Y. Hui, and G. Jeong, “Hybrid ARQ transmission and combining for MIMO systems,” in IEEE ICC, vol. 5, pp. 3205–3209, IEEE, 2003.
  • [22] Q. Zhang and S. A. Kassam, “Hybrid ARQ with selective combining for fading channels,” IEEE J. Sel. Areas Commun., vol. 17, no. 5, pp. 867–880, 1999.
  • [23] E. W. Jang, J. Lee, H.-L. Lou, and J. M. Cioffi, “On the combining schemes for MIMO systems with hybrid ARQ,” IEEE Trans. Wireless Commun., vol. 8, no. 2, pp. 836–842, 2009.
  • [24] Y. Mao, J. Zhang, and K. B. Letaief, “ARQ with adaptive feedback for energy harvesting receivers,” in IEEE WCNC’06, pp. 1–6, IEEE, 2016.
  • [25] T. L. Marzetta and B. M. Hochwald, “Fast transfer of channel state information in wireless systems,” IEEE Trans. Signal Process., vol. 54, no. 4, pp. 1268–1278, 2006.
  • [26] S. Cui, J.-J. Xiao, A. J. Goldsmith, Z.-Q. Luo, and H. V. Poor, “Energy-efficient joint estimation in sensor networks: Analog vs. digital,” in IEEE ICASSP’05, vol. 4, pp. 745–748, 2005.
  • [27] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, vol. 1. Springer series in statistics New York, NY, USA:, 2001.
  • [28] J. C. Platt, N. Cristianini, and J. Shawe-Taylor, “Large margin DAGs for multiclass classification,” in NIPS, pp. 547–553, 2000.
  • [29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.