# Sparse Optimization for Green Edge AI Inference

With the rapid upsurge of deep learning tasks at the network edge, effective edge artificial intelligence (AI) inference becomes critical to provide low-latency intelligent services for mobile users via leveraging the edge computing capability. In such scenarios, energy efficiency becomes a primary concern. In this paper, we present a joint inference task selection and downlink beamforming strategy to achieve energy-efficient edge AI inference through minimizing the overall power consumption consisting of both computation and transmission power consumption, yielding a mixed combinatorial optimization problem. By exploiting the inherent connections between the set of task selection and group sparsity structural transmit beamforming vector, we reformulate the optimization as a group sparse beamforming problem. To solve this challenging problem, we propose a log-sum function based three-stage approach. By adopting the log-sum function to enhance the group sparsity, a proximal iteratively reweighted algorithm is developed. Furthermore, we establish the global convergence analysis and provide the ergodic worst-case convergence rate for this algorithm. Simulation results will demonstrate the effectiveness of the proposed approach for improving energy efficiency in edge AI inference systems.

## Authors

• 4 publications
• 2 publications
• 52 publications
• 259 publications
• 177 publications
• 54 publications
07/29/2019

### Energy-Efficient Processing and Robust Wireless Cooperative Transmission for Edge Inference

Edge machine learning can deliver low-latency and private artificial int...
12/02/2019

### Reconfigurable Intelligent Surface for Green Edge Inference

Reconfigurable intelligent surface (RIS) as an emerging cost-effective t...
12/25/2020

### Reconfigurable Intelligent Surface Assisted Mobile Edge Computing with Heterogeneous Learning Tasks

The ever-growing popularity and rapid improving of artificial intelligen...
02/05/2021

### Reconfigurable Intelligent Surface Assisted Edge Machine Learning

The ever-growing popularity and rapid improving of artificial intelligen...
08/18/2021

### Dynamic RAT Selection and Transceiver Optimization for Mobile Edge Computing Over Multi-RAT Heterogeneous Networks

Mobile edge computing (MEC) integrated with multiple radio access techno...
12/29/2020

### Leveraging AI and Intelligent Reflecting Surface for Energy-Efficient Communication in 6G IoT

The ever-increasing data traffic, various delay-sensitive services, and ...
11/27/2019

### A Framework for Weighted-Sum Energy Efficiency Maximization in Wireless Networks

Weighted-sum energy efficiency (WSEE) is a key performance metric in het...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

THE availability of big data and computing power, along with the advances in the optimization algorithms, has triggered a booming era of artificial intelligence (AI). Notably, deep learning [2]

is regarded as the most popular sector in modern AI and has achieved exciting breakthroughs in applications such as speech recognition, computer vision

[3], etc. Benefiting from these achievements, AI is becoming a promising tool that streamlines people’s decision-making process and facilitates the development of diversified intelligence services (e.g., virtual personal assistant, recommendation system, etc). Meanwhile, with the proliferation of mobile computing as well as Internet-of-Things (IoT) devices, massive real-time data are generated locally [4]. However, it is widely acknowledged that traditional cloud-based computing [5, 6] faces challenges (e.g., latency, privacy and network congestion) for supporting the ubiquitous AI-empowered applications on mobile devices [7].

In contrast, edge AI is a promising approach, which can tackle the above concerns, via fusing mobile edge computing [8]

with AI-enabled techniques (e.g., deep neural networks (DNNs)). By pushing AI models to the network edge, it brings the edge servers close to the requesting mobile devices and thus enables low-latency and privacy-preserving. Notably, edge AI is envisioned as the key ingredient of future intelligent

G networks [9, 10, 11, 12], which fully unleashes the potentials for mobile communication and computation. Typically, edge AI consists of two phases of edge training and edge inference. In particular, federated learning [13]

is a key enabling technology to train machine learning models directly on mobile devices without uploading data to the cloud center. By deploying trained AI models and implementing model inference at network edge, this paper mainly focuses on edge inference. Following

[7, 14]

, the edge AI inference architecture is generally classified into three major types:

• On-device inference: It performs the model inference directly on end devices where DNN models are deployed. While some enabling techniques (e.g., model compression [15, 16], hardware speedup [17]) have been proposed to facilitate the deployment of the DNN model, it still poses challenges for resource-limited (e.g., memory, power budget and computation) end devices [18]. To mitigate such concerns, on-device distributed computing is envisioned as a promising solution for on-device inference, which enables AI model inference across multiple distributed end devices [19].

• Joint device-edge inference: This mode carries out the AI model inference in a device-edge cooperation fashion [7] with the model partition and model early-exit techniques [20]. While device-edge cooperation is flexible and enables low correspondence-latency edge inference, it may still have high resource requirements for end devices due to the resource-demanding nature of DNNs [21].

• Edge server inference: Such methods transfer the raw input data to edge serves for processing, which then return inference results to end-users [22, 23]. Edge server inference is particularly suitable for those computation-intensive tasks. Nonetheless, the inference performance relies mainly on the channel bandwidth between the edge server and end devices. Cooperative transmission [24] becomes promising for communication-efficient inference results delivery.

To support those computation-tasks on the resource-limited end devices, edge server inference stands out as a viable solution to fulfill the key performance requirements. The main focus of this paper is on the AI model inference for mobile devices with the edge server inference architecture. For the edge AI inference system, energy efficiency is a key performance indicator [14], which motives us to focus on the energy-efficient edge inference design. This is achieved by optimizing the overall network power consumption, including computation power consumption for performing inference tasks and transmission power consumption for returning inference results. In particular, cooperative transmission [24] is a widely recognized technique to reduce the downlink transmit power consumption and provide low-latency transmission services by exploiting the high beamforming gains for edge AI inference. In this work, we thus consider that multiple edge base stations (BSs) collaboratively transmit the inference results to the end devices [22]. To enable transmission cooperation, we apply the computation replication principle [25], i.e., the inference tasks from end devices can be performed by several neighboring edge BSs to create multiple copies of the inference results. However, computation replication greatly increases the power consumption in performing inference tasks. Therefore, it is necessary to select the inference tasks to be performed by each edge BS to achieve an optimal balance between communication and computation power consumption.

In this paper, we propose a joint inference task selection and downlink beamforming strategy towards achieving energy-efficient edge AI inference by optimizing the overall network power consumption consisting of the computation power consumption and the transport network power consumption under the quality-of-service (QoS) constraints. However, the resulting formulation contains combinatorial variables and nonconvex constraints, which makes it computationally intractable. To address this issue, we observe that the transmit beamforming vector has an intrinsic connection with the set of inference task selection (i.e., tasks are opted by edge servers to execute). Based on this crucial observation, we present a group sparse beamforming (GSBF) reformulation, followed by proposing a log-sum function based three-stage GSBF approach. In particular, in the first stage, we adopt a weighted log-sum function based relaxation to enhance the group sparsity of the structural solutions.

Nonetheless, the log-sum function minimization problem poses challenges in computation and analysis. To resolve the issues, we present a proximal iteratively reweighted algorithm, which solves a sequence of weighted convex subproblems. Moreover, we establish the global convergence analysis and worst-case convergence rate analysis of the presented proximal iteratively reweighted algorithm. Specifically, by leveraging the Fréchet subdifferential [26], we characterize the first-order necessary optimality conditions of the formulated convex-constrained log-sum problem. We then show that the generated iterates of the proposed algorithm make the function values steadily decrease and prove that any cluster point of the generated entire sequence is a critical point of the initial objective for any initial feasible point. Finally, we show that the defined optimality residual has ergodic worst-case convergence rate, where is the iteration counter.

In the following, we summarize the major contributions of this paper as follows.

• We propose a joint task selection and downlink beamforming strategy to optimize the trade-off between computation and communication power consumption for an energy-efficient edge AI inference system. In particular, task selection is achieved by controlling the group sparsity structure of the transmit beamforming vector, thereby formulating a group sparse beamforming problem under the target QoS constraints.

• To solve the resulting optimization problem, we proposed a log-sum function based three-stage GSBF approach. In particular, we adopt a weighted log-sum approximation to enhance the group sparsity of the transmit beamforming vector in the first stage. Moreover, we propose a proximal iteratively reweighted algorithm to solve the log-sum minimization problem.

• For the presented proximal iteratively reweighted algorithm, we establish the global convergence analysis. We prove that every cluster point generated by the presented algorithm satisfies the first-order necessary optimality condition for the original nonconvex log-sum problem. Furthermore, a worst-case convergence rate is established for this algorithm in an ergodic sense.

• Numerical experiments are conducted to demonstrate the effectiveness and competitive performance of the log-sum function based three-stage GSBF approach for designing the green edge AI inference system.

### I-a Related Works

The study of inducing sparsity generally falls into the sparse optimization category [27, 28]. In particular, sparse optimization, emerging as a powerful tool, has recently contributed to the effective design of wireless networks, e.g., group sparse beamforming for energy-efficient cloud radio access networks [29, 30], and sparse signal processing for Internet-of-Things (IoT) networks [31, 32]. In particular, to induce the group sparsity structure of the beamforming vector, the work of [22, 23] adopted the mixed -norm. As illustrated in [27], the mixed -norms () can induce the group sparsity structure of the interested solution. Moreover, the mixed -norm and -norm [33] are commonly adopted. However, the effectiveness of sparsity based on convex sparsity-inducing norms is not satisfactory since there always exists some small nonzero elements in the obtained solutions [34]. In contrast to these works, some works applied nonconvex sparsity-inducing functions to seek sparser solutions [35]. Notably, the work [34] reported the capability of log-sum function for enhancing the sparsity of the solutions.

Motivated by their superior performance on inducing sparsity, we adopt log-sum functions to promote the sparsity pattern in the solutions. However, adopting the log-sum function to enhance sparsity usually makes the problem difficult to compute and analyze. In [34], the authors first proposed an iteratively reweighted algorithm (IRL1) for tackling the nonconvex and nonsmooth log-sum functions with linear constraints. Nonetheless, they did not further conduct the convergence analysis for the proposed method. Under reasonable assumptions, the work of [36] established the convergence results for a class of unconstrained nonconvex nonsmooth problems based on the limiting-subgradient tool. In particular, these results could apply to the log-sum model in an unconstrained setting. In [37], they proposed a proximal iteratively reweighted algorithm and proved that any accumulation point is a critical point. The work of [38] further showed that, for any starting feasible points, the sequence generated by their proximal iteratively reweighted algorithm could converge to a critical point under the Kurdyka-Łojasiewicz (KL) property [39]. However, these works focused on the unconstrained formulation or linearly constrained cases when the log-sum model is involved. The theoretical analysis for the log-sum function with general convex-set constraints has not been investigated.

### I-B Organization

The remainder of this paper is organized as follows. Section II presents the system model of the edge AI inference, followed by the problem formulation and analysis. Section II-C provides the group sparse beamforming formulation. The log-sum function based three-stage GSBF approach is proposed in  Section III. Section IV provides the global convergence and convergence rate analysis of the proposed proximal iteratively reweighted algorithm. Section V demonstrates the performance of the proposed approach. The conclusion remark is made in Section VI. To keep the main text coherent and free of technical details, we divert most of the mathematic proofs to the Appendices.

### I-C Notation

Throughout this paper, we subsume the notation used as follows. We use and to denote the complex vector space and the real Euclidean -space , respectively. Boldface lower-case letters and upper case letters to represent vectors (e.g., ) and matrices (e.g., ) with an appropriate size, respectively. The inner product between is denoted as . and is the conventionally defined -norm and -norm for any vectors in , respectively. In addition, we use and to denote the Hermitian and transpose operators, respectively. is the real part of a complex scalar. is a vector with all components equal to 1 and denotes the zero vector with an appropriate size. In particular, represents a vector whose th element is the -norm of a structured vector . We use to denote composition operation between two functions and symbol defines the elementwise product for any two vectors .

For any closed convex set , we use

to denote the characteristic function associated with

, which is defined as

 δC(c):={0,c∈C,+∞,c∉C.

Similarly, defines a indicator function associated with the given condition , i.e., if condition is met, then return the value ; otherwise, return the value . Moreover,

corresponds to the complex random variable

with mean

and variance

.

## Ii System Model and Problem Formulation

This section describes the overall system model and power consumption model for performing intelligent tasks in the considered edge AI inference system, followed by the problem formulation and analysis.

### Ii-a System Model

We consider an edge computing system consisting of -antenna BSs collaboratively serving single-antenna mobile users (MUs), as illustrated in Fig. 1. These deployed BSs are used as dedicated edge nodes and have access to the enormous computation and storage resources [8]. For convenience, define and as the index sets of MUs and BSs, respectively. MUs have inference computing tasks, and the results can be inferred from task-related DNNs. For ease of expression, we use to denote the raw input data collected from MU , and the corresponding inference results are represented as . As performing intelligent tasks on DNNs are typically resource-demanding, it is usually impractical to perform the tasks on resource-constrained mobile devices locally. In the proposed edge AI inference system, by exploiting the computation replication [25], we consider the scenario that each neighboring edge BS has collected the raw input data from all MUs. Then the edge BSs process the data for model inference. After the edge BSs complete the model inference, the inference results are returned to the corresponding MUs via the downlink channels. We assume that all edge BSs have been equipped with the pre-trained deep network models for all inference tasks [23].

In the downlink transmission, the edge BSs, which perform the inference tasks for the same MU cooperatively, return the inference results to the MU. We assume perfect channel state information (CSI) is available to all edge BSs to enable cooperative transmission for the inference results [24]. Let denote the indexes of MUs whose tasks are selectively performed by BS , and represents task selection strategy.

Let denote the encoded scalar of the requested output for MU , and be the transmit beamforming vector at the BS for . For convenience, and without loss of generality, we assume that , i.e., the power of is normalized to the unit. The transmitted signal at BS  can be expressed as

 xn=∑k∈Anvnksk. (1)

Let be the propagation channel coefficient vector between BS and MU . The received signal at MU  denoted as , is then given by

 yk =∑n∈NhHnkxn+zk (2) =∑n∈NhHnk∑l∈Anvnlsl+zk =∑n∈NhHnk⎡⎣I(k∈An)vnksk+∑l∈An,l≠kvnlsl⎤⎦+zk,

where is complex the additive white Gaussian noise.

We assume that all data symbols are mutually independent of each other as well as noise. Based on (2), the signal-to-interference-plus-noise ratio (SINR) for MU  is therefore given as

 SINRk(A)=|∑n∈NI(k∈An)hHnkvnk|2∑l≠k|∑n∈NI(l∈An)hHnkvnl|2+σ2k. (3)

#### Ii-A2 Power Consumption Model

The computation and transmission power consumption for model inference is generally large. Energy efficiency is of significant importance for an energy-efficient edge AI inference system design, for which the overall network power consumed in computation and communication at the edge BSs becomes our main interest. Specifically, we express the total transmission power for all edge BSs in the downlink as

 Ptrans(A,{vnk}): =N∑n=11ηnE[∑k∈An∥vnksk∥22] (4) =N∑n=1∑k∈An1ηn∥vnk∥22,

where is the radio frequency power amplifier efficiency coefficient of edge BS .

In addition to the downlink transmission power consumption, the power consumed in performing AI inference tasks should be taken into consideration as well, owing to the power-demanding nature of running DNNs. We use to denote the computation power consumption of the BS in performing inference task . Then the computation power consumed by all BSs are given by

 Pcomp(A):=∑n∈N∑k∈AnPcnk. (5)

For the estimation of the computation energy consumption in executing task

therein, the works [40, 41] stated that the energy consumption of a deep neural network layer for inference mainly including computation energy consumption and data movement energy consumption. For illustration, we take GoogLeNet v1 [42] as a concrete example to illustrate the energy consumed by performing inference tasks. Specifically, we use GoogLeNet v1 to perform image classification tasks on the Eyeriss chip [43]. With the help of an energy estimation online tool [44], we are able to visualize the energy consumption breakdown of the GoogLeNet v1, as illustrated in Fig. 2. We obtain the estimation of the computation power consumption via dividing the total energy consumption by the computation time. In particular, the computation time is determined by the total number of multiplication-and-accumulation (MAC) operations and the peak throughout of Eyeriss chip.

Therefore, the overall power consumption for edge AI inference, including transmission and computation power consumption, is calculated as

### Ii-B Problem Formulation and Analysis

Note that there is a fundamental trade-off between transmission and computation power consumption. To be specific, more edge BSs performing the same task for MUs can significantly reduce the transmission power by exploiting higher transmit beamforming gains. However, this inevitably increases the computation power consumption for performing inference tasks. Therefore, the goal of an energy-efficient edge inference system can be achieved by minimizing the overall network power consumption to reach a balance between these two parts of power consumption.

Let be the target SINR for MUs to receive the reliable AI inference results in the downlink successfully. In our proposed energy-efficient edge AI inference system, the overall power minimization problem is thus formulated as

 minA,{vnk} Poverall(A,{vnk}) (7) s.t. SINRk(A)≥γk, ∀k∈K, ∑k∈K∥vnk∥22≤Pmaxn, ∀n∈N,

where denotes the maximum transmit power of edge BS .

Unfortunately, problem (7) turns out to be a mixed combinatorial optimization problem due to the presence of combinatorial variable , which makes it computationally intractable. On the other hand, the nonconvex SINR constraints also pose troublesome challenges for solving (7). To address these issues, we recast problem (7) into a tractable formulation by inducing the group sparsity of the beamforming vector in the following section.

### Ii-C A Group Sparse Beamforming Representation Framework

One naive approach to cope with the combinatorial variable  is the exhaustive search. However, it is often computationally prohibitive owing to the exponential complexity. As a practical alternative, there is a critical observation that such a combinatorial variable can be eliminated by exploiting the inherent connection between task selection and the group sparsity structure of beamforming vectors. Specifically, if edge BS does not perform the inference tasks from MU  (i.e., ), then it will not deliver the inference result in the downlink transmission (i.e., ). In other words, if , all coefficients in the beamforming vector are zero simultaneously. Mathematically, we have , for all , meaning the task selection strategy can be uniquely determined by the group sparsity structure of . In this respect, the overall network power consumption problem (7) can rewritten as

 (8)

By considering the sparsity structure in the beamforming vectors, the SINR expression (3) is transformed into

 SINRk =|∑n∈NhHnkvnk|2∑l≠k|∑n∈NhHnkvnl|2+σ2k (9) =|hHkvk|2∑l≠k|hHkvl|2+σ2k, ∀k∈K,

where and are the aggregated channel vector and downlink transmit beamforming vector for MU , respectively.

On the other hand, since an arbitrary phase rotation of the transmit beamforming vectors does not affect the downlink SINR constraints and the objective function value, we can always find proper phases to equivalently transform the SINR constraints in (7) into convex second-order cone constraints [45]. We thus have the following convex-constrained sparse optimization framework for network power minimization

 min{vnk} Psparse({vnk}) (10) s.t. ∑k∈K∥vnk∥22≤Pmaxn, ∀n∈N, √∑l≠k|hHkvl|2+σ2k≤1√γkR(hHkvk), ∀k∈K.

However, problem (10) is still nonconvex due to the indicator function in the objective function. As presented in [29, Proposition 1], a weighted mixed -norm can be served as the tightest convex surrogate of the objective in (10), i.e.,

 P({vnk})=2N∑n=1K∑k=1√Pcnk/ηn∥vnk∥2. (11)

In this paper, we instead propose to adopt a new group sparsity inducing function for inference tasks selection via enhancing sparsity, thereby further reducing the network power consumption.

## Iii A Los-sum Function Based Three-stage Group Sparse Beamforming Framework

In this section, we shall propose to adopt the log-sum function to enhance the group sparsity of the beamforming vector, followed by describing the log-sum function based three-stage GSBF approach. In particular, we propose a proximal iteratively reweighted algorithm to address the log-sum minimization problem in the first stage.

### Iii-a Log-sum Function for Enhancing Group Sparsity

Let denote the aggregated beamforming vector . To promote the group sparsity for the beamforming vector , in this paper, we propose to use the following weighted nonconvex log-sum function as an approximation for the objective

 Ω(v):=N∑n=1K∑k=1ρnklog(1+p∥vnk∥2), (12)

where is a weight coefficient and is a tunable parameter. The main motivation for adopting such a log-sum penalty among various types of sparsity-inducing functions [27] is based on the following considerations:

• The mixed -norm is similar as an -norm of vector and thereof offers the tightest convex relaxation to the -norm. In contrast to the mixed -norm, it has been reported that the log-sum function can significantly enhance the sparsity of the solution than the conventional -norm [27, 34].

• From the perspective of performance and theoretical analysis of the designed algorithm, a log-sum function brings more practicability due to its coercivity and boundedness of its first derivative.

### Iii-B A Log-sum Function Based Three-stage Group Sparse Beamforming Approach

We present the proposed log-sum based three-stage GSBF framework. Specifically, the first stage is to solve the log-sum convex-constrained problem via the proposed proximal iteratively reweighted algorithm to obtain a solution ; the second stage prioritizes the tasks in progress based on the obtained solution  and system parameters, followed by obtaining the optimal task selection strategy ; with fixed , we refine the in the third stage. Details are depicted as follows.

Stage 1: Log-sum Function Minimization. In this first stage, we obtain the group sparsity structure of beamformer by solving the following nonconvex program

 minv Ω(v)s.t.v∈C, (13)

where denotes the convex-constraints in (10).

However, the nonconvex and nonsmooth objective in (13) and the presence of the convex constraints usually pose challenges in computation and analysis. Inspired by the work of [34], we can iteratively minimize the objective by solving a sequence of tractable convex subproblems. The main idea of our presented algorithm is to solve a well-constructed convex surrogate subproblem instead of directly solving the original nonconvex problem.

Let . First observe that is a composite function with convex and nonconvex. At the th iterate , for any feasible , we have

 fp(z(~vnk)) ≤fp(z(v[i]nk))+⟨w[i],z(~vnk)−z(v[i]nk)⟩ (14) ≤fp(z(v[i]nk))+⟨w[i],z(~vnk)−z(v[i]nk)⟩ +β2∥~vnk−v[i]nk∥22,

where is the subgradient of at and is the prescribed proximity parameter, and the first inequality holds by the definition of the subgradient of the convex function. Hence, a convex subproblem is derived as an approximation of at current iterate , which reads

 min{vnk}N∑n=1K∑k=1w[i]nk∥vnk∥2+β2N∑n=1K∑k=1∥vnk−v[i]nk∥22s.t.v∈C (15)

with weights

 w[i]nk=ρnk⋅∂(fp(z(v[i]nk)))=pρnkp∥v[i]nk∥2+1. (16)

As presented in [34], a smaller causes larger , then drive the nonzero components of towards zero aggressively. Overall, to enhance the group sparsity structure of the beamforming vector, the proposed proximal iteratively reweighted algorithm is illustrated in Algorithm 1.

Stage 2: Tasks Selection. In this second stage, an ordering guideline is applied to determine the priority of inference tasks, which is guided by the solution obtained in Stage 1. For ease of notation, let denote the set of all tasks. By considering the key system parameters (e.g., , and ), the priority of task

is heuristically given as

 θnk= ⎷∥hnk∥22ηnPcnk∥vnk∥2. (17)

Intuitively, if edge BS is with a lower aggregative beamformer gain, lower power amplifier efficiency, lower channel power gain, but a higher computation power consumption for MU , task has a lower priority. A lower indicates that the tasks from MU have lower priority and may not be performed by BS . Thus, tasks are arranged in light of the rule (17) with descending order. That is, the task’s priority is , where denotes the permutation of task indexes.

We then solve a sequence of convex feasibility detection problems to obtain task selection strategy ,

 find vs.t.vπ(t)=0, v∈C, (18)

where and increases from to until (18) is feasible. Here are convex constraints, meaning that all ’s coefficients are zeros for task . The support set of beamformer is defined as , then the optimal task selection strategy can be derived from .

Stage 3: Solution Refinement. At this point, we have determined tasks selection for each BS. Then, fix the obtained task index set, we solve the following convex program to refine the beamforming vectors

 minv N∑n=1K∑k=11ηn∥vnk∥22+N∑n=1∑k∈A∗Pcnk (19) s.t. vπ(t)=0, v∈C.

Overall, our proposed log-sum function based three-stage GSBF framework for solving (7) can be presented in Algorithm 2.

## Iv Global Convergence Analysis

In this section, we provide the global convergence for Algorithm 1. Specifically, we derive the first-order necessary optimality condition to characterize the optimal solutions. We then establish convergence results for a subsequence of the sequence generated by Algorithm 1. Furthermore, we show that for any initial feasible point, the entire sequence must have cluster points, and any cluster point satisfies the established first-order optimality condition. Finally, the ergodic worst-case convergence rate of the optimality residual is derived.

### Iv-a First-order Necessary Optimality Condition

In this subsection, we derive the first-order necessary conditions to characterize the optimal solution of (13). Problem (13) is equivalently rewritten as

 minv J(v):=Ω(v)+δC(v). (20)

Similarly, for the derived subproblem (15), we have

 minv G(v;v[i]):=N∑n=1K∑k=1w[i]nk∥vnk∥2+β2∥v−v[i]∥22+δC(v). (21)

Due to the nonconvex and nonsmooth nature of the log-sum function, we make use of the Fréchet subdifferential as the major tool in our analysis. Its definition is introduced as follows.

###### Definition 1 (Fréchet subdifferential [26])

Let be a real Banach space and denotes the corresponding topological dual and be a function from into an extended real line , finite at . A set

 ∂Ff(r)={r∗∈X∗|liminfu→rf(u)−f(r)−⟨r∗,u−r⟩∥u−r∥2≥0}

is called a Fréchet subdifferential of at . Its elements are referred to as Fréchet subgradients.

Several important properties of the Fréchet subdifferential [26] are listed below, which are used to characterize the optimal solution of (13).

###### Proposition 1

Let be a closed and convex set. Then the following properties on Fréchet subdifferentials holds true.

• If is Fréchet subdifferentiable at and attains local minimum at , then .

• Let be Fréchet subdifferentiable at with being convex, then is Fréchet subdifferentiable at such that

 ¯y∂z(¯x)⊂∂Fh∘z(¯x)

for any .

• with closed and convex sets .

The following Fermat’s rule [46] describes the necessary optimality condition of problem (13).

###### Theorem 1 (Fermat’s rule)

If (20) attains a local minimum at , then it holds true that

 0∈∂FJ(v):=∂FΩ(v)+NC(v). (22)

We next investigate the properties of in the following Proposition 2, indicating that the Fréchet subdifferentials of at is bounded.

###### Proposition 2

If , then for any . In particular, is any element of .

To explore the behavior of the proposed proximal iteratively reweighted algorithm, based on Theorem 1 and Proposition 2, we define the optimality residual associated with (20) at a point as

 r[i]:=w[i]⊙x[i]+u[i], (23)

where and . Since , it implies that if then satisfies the first-order necessary optimality condition (22). We adopt to measure the convergence rate of our algorithm.

Moreover, we provide the first-order optimality condition of the subproblem (21) as follows

 0=∂G(v;v[i])=β(v[i+1]−v[i])+w[i]⊙x[i+1]+u[i+1], (24)

where , and . Note that the existence of optimal solution to (21) simply follows from the convexity and the coercivity of the objective .

Now we show that an optimal solution of (21) also satisfies the first-order necessary optimality condition of (20) in the following lemma.

###### Lemma 1

satisfies the first-order necessary optimality condition of (20) if and only if

 v[i]=\operatornamewithlimitsargminvG(v;v[i]).
###### Proof:

Please refer to Appendix A for details.

Define the model reduction caused by at a point as

 ΔG(v[i+1];v[i]):=G(v[i];v[i])−G(v[i+1];v[i]). (25)

The new iterate causes a decrease in the objective , and this model reduction (25) converges to zero in the limit, both results are revealed in the following Lemma 2.

###### Lemma 2

Suppose is generated by of Algorithm 1 with . The following statements hold true

• .

• .

• is monotonically decreasing. Indeed,

 ΔG(v[i+1];v[i])≥β2∥v[i]−v[i+1]∥22.
###### Proof:

Please refer to Appendix B for details.

We now provide our main result in the following Theorem 2.

###### Theorem 2

Suppose is generated by Algorithm 1 with . It holds true that must be bounded and any cluster point of satisfies the first-order necessary optimality condition of (20).

###### Proof:

Please refer to Appendix C for details.

### Iv-B Ergodic Worst-case Convergence Rate

In this subsection, we show that the presented proximal iteratively reweighted algorithm has ergodic worst-case convergence rate in terms of the optimality residual. In the following Lemma, it states that the optimality residual has an upper bound with the displacement of the iterates.

###### Lemma 3

The optimality residual associated with problem (20) satisfies

 ∥r[i+1]∥22≤(β2+2βκp2+κ2p4)∥v[i]−v[i+1]∥22

with 111 denotes the maximum elements among for all , ..

###### Proof:

Please refer to Appendix D for details.

The subproblem (21) is referred to as the primal problem, and by exploiting the conjugate function [46], the associated Fenchel-Rockafellar dual is constructed as

 maxλ,μ Q(λ,μ;v[i]) (26) s.t. ∥λnk∥2≤1,∀n∈N,k∈K,

where the dual objective is given as , and the technical details to construct (26) is provided in Appendix E.

The Fenchel-Rockafellar duality theorem [46] states that the solution to (26) provides a lower bound on the minimum value to the solution of (21). Moreover, the gap between the primal objective function value of (21) and the corresponding dual objective function value of (26) at the th iterate is defined as

 g(v,λ,μ;v[i]):=G(v;v[i])−Q(λ,μ;v[i]). (27)

If this gap is zero, then the strong duality holds. That is, at the optimal solution , we have

 ΔG(v[i+1];v[i])=g(v[i+1],λ[i+1],μ[i+1];v[i]). (28)

We now show that the duality gap vanish asymptotically in the following Theorem.

###### Theorem 3

Let be the sequence generated by Algorithm 1 with . Then has ergodic worst-case convergence rate.

###### Proof:

Please refer to Appendix F for details.

## V Numerical Experiments

In this section, we use numerical experiments to validate the effectiveness of our proposed algorithms and illustrate the presented theoretical results. We compare the log-sum function based three-stage GSBF approach with the coordinated beamforming approach (CB) [47] and mixed GSBF [29] beamforming approach (Mixed GSBF). These two approaches are listed below:

• CB considers minimizing the total transmit power consumption. In other words, all BSs are required to perform the inference tasks from all MUs.

• Mixed GSBF considers adopting the mixed -norm (i.e., the objective function in (13) is replaced with ) to induce group sparsity of the beamforming vector in Stage 1 of Algorithm 2.

On the experimental set-up, we consider the edge AI inference system with -antennas, and single-antenna MUs that all are uniformly and independently distributed in a km km square region. The channel between BS and MU is set as , where the path-loss model is given by and is the Euclidean distance between BS and MU , is the small-scale fading coefficient, i.e., . We set W and specify W, and . Furthermore, for the proposed log-sum function based three-stage GSBF approach, we set , and initialize . In particular, we terminate the proximal iteratively reweighted algorithm either it hits the predefined maximum iterations or satisfies

 ∥w[i+1]−w[i]∥1≤ϵ, (29)

where is a predescribed tolerance.

### V-a Convergence of the Proximal Iteratively Reweighted Algorithm

The goal in this subsection is to illustrate the convergence behavior of the proposed proximal iteratively reweighted algorithm. The presented result is obtained in a typical channel realization. Fig. 3 illustrates the convergence of the proximal iteratively reweighted algorithm. We can see that steadily decreases along with the iterations, which is consistent with our analysis in Lemma 2. Interestingly, we observe that the objective value of drops quickly in the first few iterations (less iterations), which indicates that the proposed proximal iteratively reweighted algorithm converges very fast. In view of this, we may suggest early terminating the Algorithm 1 in practice to obtain an approximate solution to speed up the entire algorithm while guaranteeing the overall performance.

### V-B Effectiveness of the Proposed Approach

We evaluate the performance of the three algorithms in terms of the overall network power consumption, the transmit power consumption and the number of computation tasks. The presented results are averaged over randomly and independently generated channel realizations.

Fig. 4 depicts the overall network power consumption of three approaches with different target SINRs. First, we observe that all three approaches have higher total power consumption as the required SINR becomes more stringent. This is because more edge BSs are required to transmit the inference results for higher QoS. In addition, we can see that CB approach has the highest power consumption among three approaches and the relative power difference between CB and the other two approaches can achieve approximately when SINR is dB and approximately when SINR is dB, indicating the effectiveness of joint task selection strategy and group sparse beamforming approach to minimize the overall network power consumption. On the other hand, we can see that the proposed log-sum function based three-stage GSBF approach outperforms the mixed GSBF approach, which demonstrates that enhance the group sparsity further reduces the overall network power consumption. In particular, we also observe that the performance gap between the blue and the red curve approximately remains at when SINR ranges from dB to dB, which indicates that the proposed log-sum function based three-stage GSBF approach is still attractive in the high SINR regime.

Tables I and II further demonstrate the number of inference tasks performed by edge BSs and the transmission power consumption, respectively. To be specific, in Table I, we observe that the number of performed inference tasks among three approaches is different under various SINRs, which shows the existence of the task selection strategy. Besides, it is observed that the log-sum function based three-stage GSBF approach always achieves a less number of performed inference tasks compared to the mixed GSBF approach for target SINRs, which indicates that the log-sum function based three-stage GSBF approach can enhance the group sparsity pattern in the beamforming vector. Meanwhile, as observed in Table II, the CB approach has the lowest transmission power compared to the other two approaches because the CB approach only optimizes the power consumption in transmission with performing all inference tasks. On the other hand, the transmission power consumption of the log-sum function based three-stage GSBF approach is slightly higher compared to the mixed GSBF approach under most SINRs. This is because more edge BSs participate in performing inference tasks in the mixed GSBF approach, resulting in a higher transmit beamforming gain for reducing transmission power. In other words, less number of performed inference tasks further reduces the computation power consumption of edge BSs but increases the transmission power consumption. Observe the Fig. 4 and Tables I-II together, it indicates that the proposed joint task selection strategy and GSBF approach find a good balance between computation power consumption and transmission power consumption, yielding lowest network power consumption.

## Vi Conclusion

In this paper, we developed an energy-efficient edge AI inference system through the joint selection of the inference tasks and optimization of the transmit beamforming vectors for minimizing the computation power consumption and the downlink transmission power consumption, respectively. Based on the critical insight that the inference tasks selection can be achieved by controlling the group sparsity structure of transmit beamforming vectors, we developed a group sparse optimization framework for network power minimization, for which a log-sum function based three-stage group sparse beamforming algorithm was developed to enhance group sparsity in the solutions. To resolve the resulting nonconvex and nonsmooth log-sum function minimization problem, we further proposed a proximal iteratively reweighted algorithm. Furthermore, the global convergence analysis was provided, and a worst-case convergence rate in an ergodic sense has been derived for this algorithm.

## Appendix A Proof of Lemma 1

Let and . If , by (24), we have

 0=w[i]⊙x[i]+u[i]. (30)

We conclude that satisfies (22), indicating that is first-order optimal for (20).

Conversely, if satisfies (22), implying satisfies (30) by Proposition 2. Thus must be the optimal solution to the subproblem (15). This completes the proof.

## Appendix B Proof of Lemma 2

First of all, and is convex, so that . Since is concave, we have,

 fp(z)≤fp(z0)+⟨∇fp(z0),(z−z0)⟩, ∀z,z0∈RL+. (31)

Therefore

 J(v[i+1])=Ω(v[i+1])≤Ω(v[i])+N∑n=1K∑k=1w[i]nk(∥v[i+1]nk∥2−∥v[i]nk∥2)+β2∥v[i+1]−v[i]∥2=J(v[i])+G(v[i+1];v[i])−G(v[i];v[i]), (32)

where the first inequality follows from (31). This completes the first statement .

On the other hand, by (32), we have

 ΔG(v[i+1];v[i])≤J(v[i])−J(v[i+1]). (33)

Summing both sides of (33) over , yielding

 0≤t∑i=0ΔG(v[i+1];v[i])≤J(v[0])−J(v[t+1])≤J(v[0])−~J, (34)

where is the lower bound of . Allowing , we have

 limi→∞ΔG(v[i+1];v[i])=0. (35)

This completes the second statement .

For the last statement , inspired by the proof line presented in [37]. By (24), we have

 0=β(v[i+1]−v