I Introduction
Machine intelligence is revolutionizing every branch of science and technology [1, 2]. If a machine wants to learn, it requires at least two ingredients: information and computation, which are usually separated from each other in machinetype communication (MTC) systems [3]. Nonetheless, sending vast volumes of data from MTC devices to the cloud not only leads to a heavy communication burden but also increases the transmission latency. To address this challenge brought by MTC, a promising solution is the edge machine learning technique [4, 5, 6, 7, 8, 9, 10, 11] that trains a machine learning model or finetune a pretrained model at the edge (i.e., nearby radio access point with computation resources).
In general, there are two ways to implement edge machine learning: data sharing and model sharing. Data sharing uses the edge to collect data generated from MTC devices for machine learning [4, 5, 6, 7], while model sharing uses federated learning [8, 9, 10, 11] to exchange model parameters (instead of data) between the edge and users. Both approaches are recognized as key paradigms in the sixth generation (6G) wireless communications [12, 13, 14]. However, since the MTC devices often cannot process the data due to limited computation power, this paper focuses on data sharing.
Ia Motivation and Related Work
In contrast to conventional communication systems, edge machine learning systems aim to maximize the learning performance instead of the communication throughput. Therefore, edge resource allocation becomes very different from traditional resource allocation schemes that merely consider the wireless channel conditions [15, 16, 17, 18]. For instance, the celebrated waterfilling scheme allocates more resources to better channels for throughput maximization [15], and the maxmin fairness scheme allocates more resources to celledge users to maintain certain quality of service [16]
. While these two schemes have proven to be very efficient in traditional wireless communication systems, they could lead to poor learning performance in edge learning systems because they do not account for the machine learning factors such as model and dataset complexities. Imagine training a deep neural network (DNN) and a support vector machine (SVM) at the edge. Due to much larger number of parameters in DNN, the edge should allocate more resources to MTC devices that upload data for the DNN than those for the SVM.
Nonetheless, in order to maximize the learning performance, we need a mathematical expression of the learning performance with respect to the number of samples, which does not exist to the best of the authors’ knowledge. While the sample complexity of a learning task can be related to the VapnikChervonenkis (VC) dimension [1]
, this theory only provides a vague estimate that is independent of the specific learning algorithm or data distribution. To better understand the learning performance, it has been proved in
[19, 20]that the generalization error can be upper bounded by the summation of the bias between the main prediction and the optimal prediction, the variance due to training datasets, and the noise of the target example. With the bound being tight for certain loss functions (e.g., squared loss and zeroone loss), the biasvariance decomposition theory gives rise to an empirical nonlinear classification error model
[21, 22, 23] that is also theoretically supported by the inverse power law derived via statistical mechanics [24].IB Summary of Results
In this paper, we adopt the above nonlinear model to approximate the learning performance, and a learning centric power allocation (LCPA) problem is formulated with the aim of minimizing classification error subject to the total power budget constraint. Since the formulated machine learning resource allocation problem is nonconvex and nonsmooth, it is nontrivial to solve. Despite the two challenges, by leveraging the majorization minimization (MM) framework from optimization, an MMbased LCPA algorithm that converges to a KarushKuhnTucker (KKT) solution is proposed. To get deeper insights into LCPA, an analytical solution is derived for the asymptotic case, with the number of antennas at the edge going to infinity. The asymptotic optimal solution discloses that the transmit powers are inversely proportional to the channel gain, and scale exponentially with the classification error model parameters. This result reveals that machine learning has a stronger impact than wireless channels in LCPA. To enable affordable computation complexity when the number of MTC devices is extremely large, two variants of LCPA, called mirrorprox LCPA and accelerated LCPA, are proposed. Both algorithms are firstorder methods (FOMs), implying that their complexities are linear with respect to the number of users. Furthermore, the iteration complexity for the accelerated LCPA to converge has achieved the lower bound derived for any FOM, meaning that the accelerated LCPA is among the fastest FOMs for solving the considered problem. Extensive experimental results based on public datasets show that the proposed LCPA scheme is able to achieve a higher classification accuracy than that of the sumrate maximization and maxmin fairness power allocation schemes. For the first time, the benefit brought by joint communication and learning design is quantitatively demonstrated in edge machine learning systems. Our results also show that both the mirrorprox LCPA and accelerated LCPA reduce the computation time by orders of magnitude compared to the MMbased LCPA but still achieve satisfactory performance.
To sum up, the contributions of this paper are listed as follows.

A learning centric power allocation (LCPA) scheme is developed for the edge machine learning problem, which maximizes the learning accuracy instead of the communication throughput.

To understand how LCPA works, an asymptotic optimal solution to the edge machine learning problem is derived, which, for the first time, discloses that the transmit power obtained from LCPA grows linearly with the path loss and grows exponentially with the learning parameters.

To reduce the computation time of LCPA in the massive multipleinput multipleoutput (MIMO) setting, two variants of LCPA based on FOMs are proposed, which enable the edge machine learning system to scale up the number of MTC users.

Extensive experimental results based on public datasets (e.g., MNIST, CIFAR10, ModelNet40) show that the proposed LCPA is able to achieve a higher accuracy than that of the sumrate maximization and maxmin fairness schemes.
IC Outline
The rest of this paper is organized as follows. System model and problem formulation are described in Section II. Classification error modeling is presented in Section III. The MMbased LCPA algorithm, and asymptotic solutions are derived in Sections IV and V, respectively. Finally, experimental results are presented in Section VI, and conclusions are drawn in Section VII.
Notation: Italic letters, lowercase and uppercase bold letters represent scalars, vectors, and matrices, respectively. Curlicue letters stand for sets and is the cardinality of a set. The operators , and take the transpose, Hermitian and inverse of a matrix, respectively. We use to represent a sequence, to represent a column vector, and to represent the norm of a vector. The symbol indicates the identity matrix, indicates the vector with all entries being one, and
stands for complex Gaussian distribution with zero mean and unit variance. The function
, while and denote the exponential function and the logarithm function, respectively. Finally,means the expectation of a random variable,
if and zero otherwise, and means the order of arithmetic operations. Important variables and parameters to be used in this paper are listed in Table I.Symbol  Type  Description 
Variable  Transmit power (in ) at user .  
Variable  Number of training samples for task .  
Parameter  Total transmit power budget (in ).  
Parameter  Communication bandwidth (in ).  
Parameter  Transmission time (in ) of the data collection.  
Parameter  Data size (in ) per sample for task  
Parameter  Initial number of samples at the edge for task .  
Parameter  Noise power (in ).  
Parameter  The composite channel gain from user to the edge when detecting data of user :  
Function  Classification error of the learning model when the sample size is  
Function  Empirical classification error model for task with parameters . 
Ii System Model and Problem Formulation
We consider an edge machine learning system shown in Fig. 1, which consists of an intelligent edge with antennas and users. The goal of the edge is to train classification models by collecting data observed at user groups^{1}^{1}1In case where some data from a particular user is used to train both model and model , we can allow and to include a common user. That is, for . (e.g., UAVs with camera sensors) , with the group storing the data for training the model
. For the classification models, without loss of generality, Fig. 1 depicts a convolutional neural network (CNN) and a support vector machine (SVM) with
, but more user groups and other classification models are equally valid. It is assumed that the data are labeled at the edge. This can be supplemented by the recent selflabeled techniques [25, 26], where a classifier is trained with an initial small number of labeled examples, and then the model is retrained with its own most confident predictions, thus enlarging its labeled training set. After training the classifiers, the edge can feedback the trained models to users for subsequent use (e.g., object recognition). Notice that if the classifiers are pretrained at the cloud and deployed at the edge, the task of edge machine learning is to finetune the pretrained models at the edge, using local data generated from MTC users. Based on the above description, the edge needs to collect the data and learn from them.
More specifically, the user transmits a signal with power . Accordingly, the received signal at the edge is , where is the channel vector from the user to the edge, and . By applying the wellknown maximal ratio combining (MRC) receiver to , the datarate of user is
(1) 
where represents the composite channel gain (including channel fading and MIMO processing) from user to the edge when detecting data of user :
(2) 
With the expression of in (1), the amount of data in received from user is , where constant is the bandwidth in that is assigned to the system (e.g., a standard MTC system would have bandwidth [27]), and is the total number of transmission time in second. As a result, the total number of training samples that are collected at the edge for training the model is
(3) 
where is the initial number of samples for task at the edge, and the approximation is due to when . Notice that is the number of bits for each data sample. For example, the handwritten digits in the MNIST dataset [28] are grayscale images with pixels (each pixel has bits), and in this case ( bits are reserved for the labels of classes [28] in case the users also transmit labels). With the collected samples, the intelligent edge can train models in the learning phase.
In the considered system, the design variables that can be controlled are the transmit powers of different users and the sample sizes of different models . Since the power costs at users should not exceed the total budget , the variable needs to satisfy . Having the transmit power satisfied, it is then crucial to minimize the classification errors (i.e., the number of incorrect predictions divided by the number of total predictions), which leads to the following learning centric power allocation (LCPA) problem:
(4a)  
(4b) 
where is the classification error of the learning model when the sample size is , and the minmax operation at the objective function is to guarantee the worstcase learning performance. The key challenge to solve is that functions represent generalization errors, and to the best of the authors’ knowledge, currently there is no exact expression of . To address this issue, the following section will adopt an empirical classification error model to approximate .
Remark:
When data is nonuniformly distributed among users, the learning error would likely depend on how much data each user could contribute to the learning task. In this case, the proposed method can still be applied by adding constraints
(5) 
to , where is the maximum number of samples to be collected from user . For example, if the data from user is not important or there is not enough data at user , we can set a small .
Iii Modeling of Classification Error
In general, the classification error is a nonlinear function of [21, 22, 19, 20, 23, 24]. Particularly, this nonlinear function should satisfy the following properties:

Since is a percentage, it is bounded as ;

Since more data would provide more information, is a monotonically decreasing function of [21];

As increases, the magnitude of derivative would gradually decrease and become zero when is sufficiently large [22], meaning that increasing sample size no longer helps machine learning.
Based on the properties (i)–(iii), the following nonlinear model [21, 22, 23] can be used to capture the shape of :
(6) 
where are tuning parameters. It can be seen that satisfies all the features (i)–(iii). Moreover, if , meaning that the error is with infinite data^{2}^{2}2We assume the model is powerful enough such that given infinite amount of data, the error rate can be driven to zero..
Interpretation from Learning Theory. Apart from (i)–(iii), the model (6) corroborates the inverse power relationship between learning performance and the amount of training data from the perspective of statistical mechanics [24]. The error model in (6) can also be explained by the biasvariance decomposition theory [19, 20]
. In particular, it is known that the probability of incorrect classification is proportional to the summation of a bias term and a variance term
[19]. The bias is independent of the training set, and is zero for a learner that always makes the optimal prediction [20]. The variance is independent of the true value of the predicted variable, and is asymptotically proportional to for independent and identically distributed (IID) samples [21]. But since the datasets could be finite and nonIID, we use to represent the error rate, with being a tuning parameter to account for the nonIID dataset. Finally, by multiplying a weighting factor to account for the model complexity of the classifier , we immediately obtain the result in (6).Iiia Parameter Fitting of CNN and SVM Classifiers
We use the public MNIST dataset [28] as the input images, and train the layer CCN (shown in Fig. 1) with training sample size ranging from to . In particular, the input image is sequentially fed into a
convolution layer (with ReLu activation, 32 channels, and SAME padding), a
max pooling layer, then another convolution layer (with ReLu activation, 64 channels, and SAME padding), a max pooling layer, a fully connected layer with units (with ReLu activation), and a final softmax output layer (with ouputs). The training procedure is implemented via Adam optimizer with a learning rate of and a minibatch size of . After training for iterations, we test the trained model on a validation dataset with unseen samples, and compute the corresponding classification error. By varying the sample size as , we can obtain the classification error for each sample size , where , and is the number of points to be fitted. With , the parameters in can be found via the following nonlinear least squares fitting:(7) 
The above problem can be solved by twodimensional bruteforce search, or gradient descent method. Since the parameters for different tasks are obtained independently, the total complexity is linear in terms of the number of tasks.
To demonstrate the versatility of the model, we also fit the nonlinear model to the classification error of a support vector machine (SVM) classifier. The SVM uses penalty coefficient and Gaussian kernel function with [29]. Moreover, the SVM classifier is trained on the digits dataset in the Scikitlearn Python machine learning tookbox, and the dataset contains images of size from classes, with bits (corresponding to integers to ) for each pixel [29]. Therefore, each image needs . Out of all images, we train the SVM using the first samples with sample size , and use the latter samples for testing. The parameters for the SVM are obtained following a similar procedure in (7).
The fitted classification error versus the sample size is shown in Fig. 2a. It is observed from Fig. 2a that with the parameters , the nonlinear classification error model in (6) matches the experimental data of CNN very well. Furthermore, with , the model in (6) fits the experimental data of SVM.
IiiB Practical Implementation
One may wonder how could one obtain the fitted classification error model before the actual machine learning model is being trained. There are two ways to address this issue.
1) Extrapolation. More specifically, the error function can be obtained by training the machine learning model on an initial dataset (i.e., with a maximum size of ) at the edge, and the performance on a future larger dataset can be predicted. This is called extrapolation [21]. For example, by fitting the error function to the first half experimental data of CNN in Fig. 2b (i.e., ), we can obtain , and the resultant curve predicts the errors at very well as shown in Fig. 2b. Similarly, with obtained from the experimental data of , the proposed model for SVM matches the classiication errors at . It can be seen that the fitting performance in Fig. 2b is slightly worse than that in Fig. 2a, as we use smaller number of pilot data. But since our goal is to distinguish different tasks rather than accurate prediction of the classification errors, such an extrapolation method can still guide the resource allocation at the edge.
2) Approximation. This means that we can pretrain a large number of commonlyused models offline (not at the edge) and store their corresponding parameters of in a lookup table at the edge. Then by choosing a set of parameters from the table, the unknown error model at the edge can be approximated. This is because the error functions can share the same trend for two similar tasks, e.g., classifying digit ‘’ and ‘’ with SVM as shown in Fig. 2c. Notice that there may be a mismatch between the pretraining task and the real task at the edge. This is the case between classifying digit ‘’ and ‘’ in Fig. 2c. As a result, it is necessary to carefully measure the similarity between two tasks before choosing the parameters.
Iv MMBased LCPA Algorithm
Based on the results in Section III, we can directly approximate the true error function by . However, to account for the approximation error between and (e.g., due to noise in samples or slight mismatch between data used for training and data observed in MTC devices), a weighting factor can be applied to , where a higher value of accounts for a larger approximation error. Then by replacing with and putting (4b) into to eliminate , problem becomes:
(8) 
where
(9) 
It can be seen that is a nonlinear optimization problem due to the nonlinear classification error model (6). Moreover, the operator introduces nonsmoothness to the problem, and the objective function is not differentiable. As a result, existing methods based on gradient descent [18] are not applicable.
To solve , we propose to use the framework of MM [30, 31, 32, 33], which constructs a sequence of upper bounds on and replaces in (8) with to obtain the surrogate problems. More specifically, given any feasible solution to , we define surrogate functions
(10) 
and the following proposition can be established.
Proposition 1.
The functions satisfy the following conditions:
(i) Upper bound condition: .
(ii) Convexity: is convex in .
(iii) Local condition: and .
Proof.
See Appendix A. ∎
With part (i) of Proposition 1, an upper bound can be directly obtained if we replace the functions by around a feasible point. However, a tighter upper bound can be achieved if we treat the obtained solution as another feasible point and continue to construct the nextround surrogate function. In particular, assuming that the solution at the iteration is given by , the following problem is considered at the iteration:
(11) 
Based on part (ii) of Proposition 1, the problem is convex and can be solved by offtheshelf software packages (e.g., CVX Mosek [34]) for convex programming. Denoting its optimal solution as , we can set , and the process repeats with solving the problem . According to part (iii) of Proposition 1 and [30, Theorem 1], the sequence converges to the KKT solution to for any feasible starting point (e.g., ). The entire procedure of MMbased LCPA is summarized in Fig. 3.
In terms of computational complexity, involves primal variables and dual variables^{3}^{3}3The dual variables correspond to constraints in , where constraints come from the operator , constraints come from nonnegative power constraints, and constraint comes from the power budget.. Therefore, the worstcase complexity for solving is [35]. Consequently, the total complexity for solving is , where is the number of iterations needed for the algorithm to converge.
V Asymptotic Optimal Solution to
Although a KKT solution to has been derived in Section IV, it can be seen that MMbased LCPA requires a cubic complexity with respect to . This leads to timeconsuming computations if is in the range of hundreds or more. As a result, lowcomplexity largescale optimization algorithms are indispensable. To this end, in this section we investigate the asymptotic case when the number of antennas at the edge approaches infinite (i.e.,
), which can facilitate the algorithm design via the law of large numbers
[36, 37, 38].As , the channels from different users to the edge would be asymptotically orthogonal [37] and we have
(12) 
Based on such orthogonality feature, and putting for into in , the function is asymptotically equal to
(13) 
Therefore, the problem when is equivalent to
(14) 
Now it can be seen that by adopting the asymptotic analysis, the function in (13) is much simpler than in . However, the summation in the base of the power function in (13) still hinders us from computing the solution of . To address this challenge, the following subsection will derive the closedform solution to for the special case of for all . Then we step further to tackle the general case of in Section VB. Finally, an acceleration scheme to achieve faster convergence is discussed in Section VC.
Va Analytical LCPA When
When (i.e., each user group has only one user), the summation can be dropped and we have . For notational simplicity, we denote the unique user in group as user . Then problem in (14) is rewritten as
(15a)  
(15b) 
where is a slack variable and has the interpretation of classification error level, and the following proposition gives the optimal solution to .
Proposition 2.
The optimal to is
(16) 
where satisfies .
Proof.
See Appendix B. ∎
To efficiently compute the classification error level , it is observed that the function is a decreasing function of . Therefore, the classification error level can be obtained from solving using bisection method within interval . More specifically, given and (initially and ), we set . If , we update ; otherwise, we update . This procedure is repeated until with . Since bisection method has a linear convergence rate [39], and in each iteration we need to compute scalar functions , the bisection method has a complexity of .
Scaling Law of Learning Centric Communication. According to Proposition 2, the user transmit power is inversely proportional to the wireless channel gain . However, it is exponentially dependent on the classification error level and the learning parameters . Moreover, among all parameters, is the most important factor, since is involved in both the power and exponential functions. The above observations disclose that in edge machine learning systems, the learning parameters will have more significant impacts on the physicallayer design than those of the wireless channels. This result is a joint effect of the Shannon information theory and the learning theory.
Learning Centric versus Communication Centric Power Allocation. Notice that the result in (16) is fundamentally different from the most wellknown resource allocation schemes (e.g., iterative waterfilling [15] and maxmin fairness [16]). For example, the waterfilling solution for maximizing the system throughput under is given by
(17) 
where is a constant chosen such that . On the other hand, the maxmin fairness solution under is given by
(18) 
It can be seen from (17) and (18) that the waterfilling scheme would allocate more power resources to better channels, and the maxmin fairness scheme would allocate more power resources to worse channels. But no matter which scheme we adopt, the only impact factor is the channel condition .
VB MirrorProx LCPA When
In the last subsection, we have derived the closedform solution to when . However, if the number of users in each group is larger than , the operator in (13) cannot be dropped, and there is no closedform solution to . In this general setting, the major challenge comes from the nonsmooth operator in the objective function, which hinders us from computing the gradients.
To deal with the nonsmoothness, we reformulate into a smooth bilevel optimization problem with norm (simplex) constraints. Observing that the projection onto a simplex in Euclidean space requires high computational complexities, a mirrorprox LCPA method working on nonEuclidean manifold is proposed. In this way, the distance is measured by KullbackLeibler (KL) divergence, and the nonEuclidean projection would have analytical expressions. Lastly, with an extragradient step, the proposed mirrorprox LCPA converges to the global optimal solution to with an iteration complexity of [40, 41, 42], where is the target solution accuracy.
More specifically, we first equivalently transform into a smooth bilevel optimization problem. By defining set and introducing variables such that , is rewritten as
(19) 
It can be seen from that is differentiable with respect to either or , and the corresponding gradients are
(20a)  
(20b) 
where
(21) 
with its element being
(22) 
However, is a bilevel problem, with both the upper layer variable and the lower layer variable involved in the simplex constraints. In order to facilitate the projection onto simplex constraints, below we consider a nonEuclidean (Banach) space induced by norm. In such a space, the Bregman distance between two vectors and is the KL divergence
(23) 
and the following proposition can be established.
Proposition 3.
If the classification error is upper bounded by , then is –smooth in Banach space induced by norm, where
(24a)  
(24b) 
with .
Proof.
See Appendix C. ∎
The smoothness result in Proposition 3 enables us to apply mirror descent (i.e., generalized gradient descent in nonEuclidean space) to and mirror ascent to in the space [41]. This leads to the proposed mirrorprox LCPA, which is an iterative algorithm that involves i) a proximal step and ii) an extragradient step. In particular, the mirrorprox LCPA initially chooses a feasible and (e.g., and ). Denoting the solution at the iteration as , the following equations are used to update the nextround solution [41]:
(25a)  
(25b)  
(25c)  
(25d) 
where is the stepsize, and the terms inside in (25a)–(25d) are obtained from (20a)–(20b). Notice that a small would lead to slow convergence of the algorithm while a large would cause the algorithm to diverge. According to [41], should be chosen inversely proportional to Lipschitz constant or derived in Proposition 3. In this paper, we set with , which empirically provides fast convergence of the algorithm.
How Mirrorprox LCPA Works. The major features of the mirrorprox LCPA are summarized as follows:

The formulas (25a)–(25b) update the variables along their gradient direction, while keeping the updated point close to the current point . This is achieved via the proximal operator that minimizes the distance (or ) plus a firstorder linear function. Moreover, since the KL divergence is the Bregman distance, the update (25a)–(25b) is a Bregman proximal step.

The signs for updating the upper level variable and the lower level variable are opposite, because the upper layer is a minimization problem while the lower layer is a maximization problem.
Lastly, by putting the Bregman distance in (23), the function in (13), the gradient in (21), and a proper into (25a)–(25b), the equations (25a)–(25b) are shown in Appendix D to be equivalent to
(26a)  
(26b) 
Following a similar procedure to Appendix D, the equations (25c)–(25d) can also be reduced to an explicit form.
According to Proposition 3 and [40], the mirrorprox LCPA algorithm is guaranteed to converge to the optimal solution to . But in practice, we can terminate the iterative procedure when the norm is small enough, e.g., . The entire procedure for computing the solution to using the mirrorprox LCPA is summarized in Fig. 3b. In terms of computational complexity, computing (26a) requires a complexity of . Since the number of iterations for the mirrorprox LCPA to converge is with being the target solution accuracy, the total complexity of mirrorprox LCPA would be .
VC Accelerated LCPA
For the mirrorprox LCPA method, while the periteration complexity is , the number of iterations is in the worst case, where is the target solution accuracy. However, it is known that the number of iterations for FOMs could be potentially reduced to if the objective function is smooth [43], and the gap between and is significant [43, 45, 44, 46]. For example, if takes a common value of , then is on the order of iterations but is on the order of iterations.
In order to derive a FOM algorithm with iterations for the edge machine learning resource allocation problem, we consider a smooth variant of , which minimizes the average classification error:
(27) 
where is defined in (13). In this case, the proposed accelerated gradient learning centric power allocation (accelerated LCPA) solves via the following update at the iteration: