I Introduction
To handle diverse use cases and business models, a new technology called network slicing has been investigated extensively for fifth generation (5G)[1]. In the concept of network slicing, the network slice instances are orchestrated and chained by a set of network functions to provide customized services. By enabling flexible support of various applications, network slicing benefits 5G networks in a costefficient way. As an important part of network slicing, network slicing in radio access networks (RANs) has been studied to further improve endtoend network performance[2].
Although network slicing is a good solution to meet service requirements in 5G, there are remarkable challenges to be solved. Traditional core network slicing methods are businessdriven only, which neglect characteristics of the RAN. However, network slicing in different network architectures are different, like in heterogeneous networks or cloud RANs (CRANs)[3, 4]. Jointly considering characteristics of RANs and network slicing can be beneficial. Second, the performance requirements of emerging applications become more stringent. To achieve huge capacity, massive connections and ultralow latency, resource allocation should be elaborately designed, which includes not only radio but also caching and computing resources. Third, as indicated in TS 38.300[5], it should be possible for a single RAN node to support multiple slices. Due to the differentiated capability of each node, the node association strategy in a sliced RAN becomes critical.
Meanwhile, fogRANs (FRANs) have been considered as a revolutionary paradigm to tackle performance requirements in 5G[6]. By exploiting the edge caching and computing, capacity burdens on fronthaul are alleviated and endtoend latency is shortened. According to desired performance, each user equipment (UE) in a FRAN can select a proper communication mode, which includes CRAN mode, fogradio access point (FAP) mode, devicetodevice (D2D) mode. With adaptive mode selection and interference suppression, services and applications, such as the industrial Internet, health monitoring and Internet of vehicles, can be well supported.
To exploit the prospect of network slicing in FRANs, a hierarchical RAN slicing architecture is presented in this paper. The proposed architecture shown in Fig. 1 takes full advantages of both FRANs and network slicing. According to the decomposition principle of the control and data planes, the high power node (HPN) in the network access layer executes the functions of the control plane, including control signaling and system broadcasting information delivery for accessed traditional UEs and fogUEs (FUEs). With radio resource control connections established, the network slice selection assistance information[5] is utilized to help traditional UEs and FUEs for network slice selection. Numerous modes are provided in the data plane for a differentiated handling of traffic. Specially, the remote radio heads (RRHs) are cooperated with each other in the baseband unit (BBU) pool, which provides the CRAN mode in the data plane. Thanks to the fog computing, FAPs are used to process local collaboration radio signal and D2D mode can be further triggered to meet performance requirements.
In the RAN slicing architecture, mode selection and resource allocation are critical for improving performance of network slices. To achieve a high data rate, UEs should associate with RRHs to leverage large scale centralized signal processing in the BBU pool. To alleviate transmission burdens on fronthaul and save system power, local data processing should be available, which are enabled by FAPs and FUEs. Consequently, for UEs with different performance requirements, advanced mode selection are in need. Note that the data transmission under different modes would consume not only radio but also computing resource. Like in CRAN mode, centralized processing and largescale collaborative transmission requires global coordination, scheduling and control, of which the computing complexity typically increases polynomially with the network size[6]. Hence it is important to coordinate the computing and radio resource. To meet the performance requirements of traditional UEs and FUEs, both multidimensional resource management and communication mode selection in sliced FRANs should be tackled elaborately. Considering their coupling, a joint optimization of mode selection and resource allocation is essential. To determine the best mode selection and coordinate the multidimensional resource, intelligent decisionmaking mechanisms are promising, which consider the channel states of different modes, the computing load at each FAP, the performance requirements of traditional UEs and FUEs and the total power consumption.
Based on the aforementioned characteristics of mode selection and resource allocation, the joint optimization solution to system power minimization in sliced FRANs is researched in this paper.
Ia Related Work
FRANs have emerged as a promising 5G RAN that can satisfy diverse quality of service (QoS) requirements in 5G. With coordination among the communication, computation and caching, QoS requirements like high spectral efficiency, high energy efficiency and low latency for different service types can be met. Many studies on FRANs have been conducted, like computation offloading in[7], and edge caching strategies in[8, 9]. In[7], the impact of fog computing on energy consumption and delay performance are investigated. With queuing models established, a multiobjective optimization problem considering energy consumption, execution delay and payment cost is formulated. Using the scalarization method and interior point method, superior performance over the existing schemes is achieved. In[8], a joint optimization of caching and user association is studied. By decomposing the original problem, a distributed algorithm based on the Hungarian method is proposed. Simulation results show that with an efficient caching policy, the average download delay can be significantly reduced. In[9], a new metric called economical energy efficiency is adopted. With cache status and fronthaul capacity considered, a resource allocation problem is formulated and solved by using fractional programming. Advantages of the proposed algorithm including system greenness improvement are confirmed.
There have also been numerous works on RAN slicing that demands efficient resource allocation, resource isolation and sharing[10]. In[2], the application of network slicing in an ultradense RAN is studied. To improve the quality of computation experience for mobile devices, the design of computation offloading policies is investigated. Considering the timevarying communication qualities and computation resources, a stochastic computation offloading problem is formulated and then a deep reinforcement learning (DRL) framework is proposed, which achieves a significant improvement in computation offloading performance compared with baseline policies. In[3], a dynamic radio resource slicing framework is presented for a twotier heterogeneous wireless network. By partitioning radio spectrum resources into different bandwidth slices for sharing, the framework achieves differentiated QoS provisioning for services in the presence of network load dynamics. In[4], two typical 5G services in a CRAN are considered and specific slice instances are orchestrated. To maximize the cloud RAN operator’s revenue, efficient approaches including successive convex approximation and semidefinite relaxation are exploited. With acceptable time complexities, the proposed algorithm significantly saves system power consumption. In[11], hierarchical radio resource allocation is studied for RAN slicing in FRANs, where a global radio resource manager performs a centralized subchannel allocation while local radio resource managers allocate assigned resources to UEs to facilitate slice customization. In[12], the network slicing in multicell virtualized wireless networks is considered. To maximize the network sum rate, a joint BS assignment, subcarrier, and power allocation algorithm is developed. Simulation results demonstrate that under the minimum required rate constraint of each slice, the proposed iterative algorithm outperforms the traditional approach, especially in the respect of the coverage improvement and spectrum efficiency enhancement. In[13]
, the combinatorial optimization of multidimensional resources in network slicing is investigated. To deal with the dilemma between network provider and tenants, a realtime resource slicing framework based on semiMarkov decision process is developed, which considers the longterm return of the network provider and the uncertainty of resource demands from tenants. Taking advantages of deep dueling neural network, the proposed framework can improve the performance of the system significantly. In
[14], a novel spectral efficiency approach is proposed to the allocation of resource blocks for different services. By learning in advance whether resources is adequate to provide service, unsuccessful allocation process is avoided. Simulations show that the approach significantly improves the spectral efficiency with respect to a singleslot based model.Note that there still exist challenges in RAN slicing. For example, the everincreasingly complicated configuration issues and blossoming new performance requirements would be challenging in 5G, since only predefined problems can be dealt with by the network. To realize an intelligent implementation of network slicing, artificial intelligence has attracted particular attentions. By enabling networks be capable of interacting with environments, a network can automatically recognize a new type of application, infer an appropriate provisioning mechanism and establish a required network slice
[15]. Meanwhile, with network scenarios becoming heterogeneous and complicated, costefficient and lowcomplexity algorithms based on machine learning can be developed for practical implementations[16]. With network patterns and user behaviors learned and predicted, an intelligent decision making system can be established to improve the network performance.There have been numerous works about applications of artificial intelligence and machine learning in wireless networks[17]. In[18], the resource allocation schemes for vehicletovehicle (V2V) communications are investigated. To avoid the large transmission overhead in the traditional centralized method, a novel decentralized resource allocation mechanism based on deep reinforcement learning is proposed. Each V2V link or a vehicle acts as an independent agent and finds the optimal subband and transmission power autonomously. Simulation results showed that each agent can effectively learn to satisfy the stringent latency constraints on V2V links while minimizing the interference to vehicletoinfrastructure communications. In[19]
, applications of machine learning to improve heterogeneous network traffic control are researched. Based on traffic patterns at the edge routers, a supervised deep learning system is trained. Compared with benchmark routing strategy, the proposed system outperforms in terms of signaling overhead, throughput, and delay. In
[20], a DRL assisted resource allocation method is designed for ultra dense networks. The original multiobjective problem is decoupled into two parts based on the general theory of DRL. The spectrum efficiency (SE) maximization is utilized to build the deep neural network. The residual objectives like energy efficiency (EE) and fairness, are considered as the rewards to train the deep neural network. Simulation results show that, the proposed method significantly outperforms the existing resource allocation algorithms in term of the tradeoff among the SE, EE and fairness. In[21], the joint SE and EE optimization in cognitive radio networks are studied and a deeplearning inspired message passing algorithm is proposed. To learn the optimal parameters of the algorithm, a feedforward neural network is devised and an analogous back propagation algorithm is developed. The simulation results show that the proposed algorithm achieves a lower power consumption for secondary users accessing the licensed spectrum while preserving the capacity of the primary users.
In this paper, we focus on the mode selection and resource allocation in a sliced FRAN, which is formulated as a mixed integer programming. To deal with the NPhard problem, RL is adopted to generate an efficient solution. Combining the strength of both supervised and unsupervised learning methods, the RL techniques have been widely used in wireless networks
[22]. In[23], mode selection and resource allocation in D2D enabled CRANs are investigated and a distributed approach based on RL is proposed, where D2D pairs perform selfoptimization without global channel state information. In[24], a decentralized and selforganizing mechanism based on RL techniques is introduced to reduce intertier interference and improve spectral efficiency. Simulation results show that the proposed mechanism possesses better convergence properties and incurs less overhead than existing techniques. To offload the traffic in a stochastic heterogeneous cellular network, an online RL framework is presented in[25]. By modeling as a discretetime Markov decision process, the energyaware traffic offloading problem is solved by a centralized Qlearning algorithm with a compact state representation.IB Main Contributions
Motivated by the benefits of machine learning, the uplink of a sliced FRAN is concerned in this paper. In particular, an optimization framework for RAN slicing is presented, which takes the queue stabilities of traditional UEs and bit rate requirements of FUEs into consideration. Both orthogonal and multiplexed subchannel strategies are considered. The main contributions of the paper are:

The joint optimization on mode selection and resource allocation in the uplink sliced FRAN are investigated, where traditional UEs and FUEs are served by constructed network slice instances. Both the orthogonal and multiplexed subchannel strategies are presented. Under different UEs’ demands and limited computing resources, a system power minimization problem is formulated, which is stochastic and mixedinteger programming. Using the general Lyapunov optimization framework, this nonconvex optimization problem is transformed into a minimization of the driftpluspenalty function, which can be further reformulated as a deterministic mode selection and resource allocation problem at each slot.

RLbased approaches are proposed to solve the reformulated mode selection and resource allocation problem. Unlike previous work in[3, 4, 11], this paper applies the RL techniques to solve the driftpluspenalty minimization under different subchannel allocation strategies. Specifically, communication modes are selected based on learned policies. Afterwards, transmission power of traditional UEs and FUEs are derived by a generalized weighted minimum meansquare error (WMMSE) approach. Through the RLbased approaches, a longterm system performance optimization can be achieved.

The proposed approaches are evaluated under different conditions. Impacts of different parameters like computing resource are evaluated. By simulation, it is observed that the RLbased approach can provide realoptimal performance. By changing the value of the defined tradeoff parameter, tradeoff between traditional UEs’ queuing delay and system power consumption can be controlled in a flexible and efficient way.
The remainder of this paper is organized as follows. Section II introduces the system model including the communication model and computing model. In Section III, the system power minimization problem is formulated and transformed into a deterministic problem based on the general Lyapunov optimization framework. In Section IV, both the orthogonal and multiplexed subchannel strategies are considered, which enable different levels of slice isolation. Corresponding RLbased algorithms are designed to solve the deterministic problem. Section V evaluates the performance of the proposed algorithms, followed by the conclusions in Section VI.
Ii System model
The system model is elaborated in this section, including the considered FRAN model, communication model and computing model.
Iia The FRAN model
The scenario considered in this paper is illustrated in Fig. 1. It assumes an FRAN architecture consisting of a terminal layer, a network access layer and a cloud computing layer. In the cloud computing layer, the BBU pool provides centralized signal processing. And in the network access layer, there are distributed RRHs connected with the BBU pool, each of which is singleantenna. There are also FAPs configured with antennas. Owing to fog computing, collaborative radio signal processing can not only be executed in the centralized BBU pool but also at distributed FAPs. We also assume that the network operates in slotted time with time dimension partitioned into decision slots indexed by
There are singleantenna traditional UEs and singleantenna FUEs in the terminal layer, whose sets are denoted as and , respectively. Examples of traditional UEs include agricultural field monitoring sensors, and industrial monitoring devices, which desire low power consumption and have random bursty traffic arrivals. FUEs can be smartphones or laptops[6], which are always equipped with a large buffer. To provide a high data rate for each FUE, a network slice instance is constructed, which is composed of multiple modes and corresponding physical resource. In the CRAN mode, RRHs are cooperated for uplink data reception and the BBU pool provides centralized signal detection and baseband processing. Moreover, FAPs are deployed for a local service to alleviate the burden on the fronthaul. Similarly, both CRAN mode and FAP mode are available in the network slice instance specific for traditional UEs. However, the objective is to maintain a low power consumption and stable transmission delay for traditional UEs. In addition, FUEs can benefit both network slice instances via the D2D mode. Specially, FUEs relay the data traffic of other FUEs, which extends the coverage of the slice instance for FUEs; while in the slice instance for traditional UEs, FUEs aggregate the data to allow more traditional UEs to be connected simultaneously.
There are subchannels to be allocated, each of which is with bandwidth . In this paper, we consider both the orthogonal and multiplexed subchannel strategies. In the former, subchannel is allocated to at most one traditional UE or FUE , which enables hard isolation between slice instances. While in the latter, subchannel can be shared among multiple traditional UEs and FUEs. In this strategy, the isolation between the slice instances would be guaranteed with a sophisticated mode selection and resource allocation. Although slice isolation in current works is guaranteed mainly through an orthogonal subchannel allocation strategy. To achieve higher spectrum utilization, it is still necessary to investigate a multiplexed subchannel allocation strategy.
IiB The communication model
To achieve the rate requirement , FUE should connect to the proper FAP/RRHs. Denote the communication mode selection of FUE at slot as , which equals to when FAP () is selected and subchannel is allocated and equals to otherwise. For notation simplicity, we define that in the case that CRAN mode is selected (i.e., all RRHs are connected) and subchannel is allocated. Suppose that the optimal linear detection, i.e., MMSE detection, is employed, the uplink rate of FUE at slot when is
(1) 
where is the transmission power of FUE on subchannel ,
is the channel vector between UE
and the FAP on subchannel , is the MMSE detection vector, and is the noise power. Note that these channel vector data account for the antenna gain, path loss, shadow fading, and fast fading together.Similarly, the rate of traditional UE can be obtained, . Besides guaranteeing a precise rate threshold , a stable queue backlog is also considered for traditional UE given its random traffic arrival characteristics. Let represent the queue backlog for traditional UE in slot . As shown in Fig. 1, we have the following expression for the dynamics of queue backlog ,
(2) 
where is the number of bits for traditional UE to be uploaded in time slot . Note that varies over time and we have . To minimize the average queue backlog and maintain stability, we seek to perform a queueaware resource allocation. A definition on the queue stability which bounds the average queue backlog is described in (3).
Definition 1
(Queue stability[26]). The queue backlog which is a discrete time process would be meanrate stable if
(3) 
Besides RRHs and FAPs, FUE can be also selected as serving nodes of UE (). Taking advantage of a large buffer, an FUE can help upload the data of other FUEs and traditional UEs. For example, FUE in Fig. 1 is out of the coverage area, and then its neighbor, FUE , is selected to deliver the data traffic. FUE acts as a relay for the data traffic from traditional UE to the FAP, since the maximum transmission power of traditional UE is limited. Thus in addition to uploading bits at slot to guarantee its own rate requirement, FUE needs to relay the traffic of other UEs which are received at the last slot. The bit rate requirement of FUE at slot is , where is an indicator function that equals to when holds and equals to otherwise.
IiC The computing model
Computing resource provision in the BBU pool and FAPs plays a key role in boosting the potential of FRANs. As it is shown in the aforementioned communication model, there are baseband processing and MMSE detector generation. In this paper, we construct the computing model which follows that in[27] and corresponding details are as follows.

For baseband processing, it consists of inverse fast fourier transform (IFFT), demodulation and decoding. The IFFT consumes constant computing resource, which is assumed as
, while the computing resource required by demodulation and decoding is approximated as . 
For MMSE detector generation, the computational complexity depends on the number of antennas. Taking the case of as an example, we assume that the computing resource consumed by the calculation of is .
Overall, computing resource consumption for UE are modeled as
(4) 
where and are the slopes. Considering the limited computing resource at FAPs, the number of UEs accessing FAPs should be under a threshold. Suppose is the computing resource available at FAP , we have the following constraint on computing resource consumption.
(5) 
According to the computing model (4), UEs will consume more computing resource in CRAN mode than FAP mode, since there are more antennas utilized (). Moreover, there is no computing resource consumption for the UEs choosing D2D mode.
Iii Problem formulation and Lyapunov Optimization
In this section, the concerned optimization problem is presented at first. Then with the Lyapunov framework, the original stochastic problem is reformulated as a deterministic problem at each slot.
Iiia Problem formulation
For the concerned uplink FRAN, the system power consumption is incurred by fronthaul transmission and wireless transmission, which is given by
(6) 
where and are the efficiencies of the power amplifier at each traditional UE and FUE, respectively, is the constant power consumption caused by fronthaul transmission.
Despite the meanrate stable constraint defined in C0 and computing resource constraint defined in C1, there are also performance constraints to be considered. As stated in following C2 and C3, the rate of traditional UE should be larger than its threshold , while for an FUE , its rate has to be large enough to upload all the bits in its buffer.
(7) 
To upload traditional UE’s bits and maintain the required rate for FUEs, a decision on mode selection should be properly made. Although offloading all the uploaded bits to FAPs can reduce system power consumption, computing resource at FAPs are limited. In this paper, our aim is to perform efficient mode selection and resource allocation, which are described by a tuple . Combining the constraints and performance requirements, we formulate the system power optimization problem as below.
(8) 
subjects to
where C0 is to achieve a stable queue backlog for each traditional UE, C1 is the computing resource constraint, C2 and C3 are to satisfy the rate requirement for traditional UEs and FUEs, respectively, and C4 means if subchannel is not allocated to UE , the transmission power has to be 0 and limited by the maximum transmission power otherwise. C5 is the communication mode selection constraint, C6 implies that at most one mode can be selected by UE on subchannel , and C7 means at most 1 subchannel can be allocated to UE .
Solving problem (8
) is difficult due to the following reasons. First, the problem with aforementioned constraints is a nonlinear optimization problem and falls within the category of mixed integer programming. Traditional methods like branchandbound and genetic algorithms that can be applied are centralized and will result in high complexity. Second, the scale of the problem will increase as the number of traditional UEs/FUEs grows. Third, the problem includes future information like bit rates and queue backlog, which vary over time and are hard to precisely predict. How to make decisions on
to adapt to dynamic traffic is of great challenge.IiiB General Lyapunov optimization
Fortunately, with Lyapunov optimization[26], the original optimization problem with the timeaveraged constraints C0 can be transformed into a queue meanrate stable problem, which can be solved only based on the observed channel state information and queue backlogs at each time slot. Let define queue backlog set. Taking advantage of Lyapunov optimization, a Lyapunov function is defined as a scalar metric of queue congestion:
(9) 
Then the Lyapunov drift is defined, which pushes the queue backlog to a lower congestion state and keeps queues stable,
(10) 
To combine the queue backlog and system power consumption, the driftpluspenalty is defined, where is a nonnegative parameter controlling the tradeoff between the average system power and the average queue delay. Suppose that the expectation of is deterministically bounded by finite constants , i.e., . Let denote the theoretical optimal value of (8), and then the relationship between the driftpluspenalty function and C0 is established in Theorem 1,
Theorem 1
(Lyapunov optimization). Suppose there exist positive constants , and such that for all slots and all possible , the driftpluspenalty function satisfies:
(11) 
Then C0 is satisfied and the average system power meets
(12) 
The average queue delay is defined as the average length of all queues, which satisfies
(13) 
Proof:
Since (11) holds for any slot, we can take expectations of both sides and we have
Sum over and using the law of telescoping sums, it yields
(14) 
Based on the fact that for all , we rearrange (14) to obtain yields
which could be furthermore rearranged according to definition of Lyapunov function
(15) 
Note that holds for any , we have
(16) 
Dividing both sides by and taking the limit as , we have
(17) 
According to Definition 1, the queue of traditional UE is meanrate stable. A similar proof can be applied to the queues of other traditional UEs, which indicates constraint is satisfied.
Moreover, the following inequality is obtained by rearranging the terms in (14)
(18) 
with some nonnegative terms neglected when appropriate. Dividing both sides of (18) by and taking the limit as , the inequality (12) is obtained based on the fact that .
Similarly, inequality (14) can also be rewritten as
(19) 
Theorem 1 suggests that by adjusting the value of parameter , a neartooptimal solution can be obtained which provides an average system power arbitrarily close to the optimum . Moreover, it is also shown that there exists an tradeoff between the average system power and the average queue delay. With an increase of parameter , the achieved system power consumption becomes lower at the cost of incurring a larger queuing delay. Therefore, a larger is suitable for the delay tolerable UEs to obtain the required performance.
Instead of minimizing the driftpluspenalty directly, we aim to push the driftpluspenalty’s upper bound to its minimum. Based on the queue dynamics of and the definition of Lyapunov drift in (10), the following lemma holds for the upper bound of driftpluspenalty.
Lemma 2
(Upper bound of Lyapunov driftpluspenalty). At any time slot , with the observed queue state and parameter , there exists an upper bound for the driftpluspenalty under any control policy:
(20) 
where is a finite constant which is larger than for any .
Proof:
Squaring both sides of (2) and combining the inequality , the following inequality can be obtained
(21) 
Summing (21) over , we obtain
Based on the concept of opportunistically minimizing an expectation, the policy that minimizes is the one that minimizes with the observation of during each slot. Since neither nor in (20) will be affected by the policy at slot , the upper bound minimization for the driftpluspenalty can be accomplished by solving the following deterministic problem at slot :
(23) 
As it is shown in (23), the powerminusrate function as an optimization target is not convex on either variable or variable .
Iv Solution for Orthogonal and Multiplexed Subchannel Strategies
The nonconvex problem (23), which includes integer variables and continuous variables , is hard to be solved. Although methods like branchandbound and genetic algorithms can be utilized to solve the integer parts, these existing solutions require a huge complexity when simultaneously considering all traditional UEs, FUEs, FAPs and RRHs. Moreover, the residual part of the problem (23) is still nonconvex, because the rate term in the powerminusrate function depends on the transmission power of traditional UEs and FUEs using the same subchannel .
In this section, we consider the mode selection and resource allocation under orthogonal and multiplexed subchannel strategies. To overcome the above challenges, a centralized approach based on Qlearning and softmax decisionmaking is proposed for the orthogonal subchannel strategy. For the multiplexed subchannel strategy, limitations on the subchannel allocation are relaxed. In this case, a distributed approach is developed, where each traditional UE or FUE needs to consider only its own mode selection possibilities.
Iva Centralized RLbased solution for the orthogonal subchannel strategy
A centralized approach for mode selection is proposed based on Qlearning. In particular, the definition of states in Qlearning is related to current mode selection of UEs. To decrease the dimensions of the Q table, the state is , in which implies that during the current iteration, only UE would reselect a mode according to the action, and the element denotes that subchannel has been allocated to UE connecting to FAP (namely ). Considering constraints C5C7, we define the action as . With action selected, the element and corresponding in state change and the current state transits to the next state.
The Qvalue in the Qlearning is defined as the discounted accumulative reward and starts at a tuple of a state and an action, which is updated as follows
(24) 
where is the learning rate, and is the reward resulting from taking action . Note that in the orthogonal subchannel strategy, subchannel, for example can not be shared among UEs. Hence in given state , there is an element being . If the action is chosen and , the reward has to be 0 (). Otherwise, the value of reward is defined as a value between and that decreases when the powerminusrate increases:
(25) 
where and . Note that the reward function is defined according to the UE’s performance requirement. Since the meanrate stable is considered only for each traditional UE, the reward function of UE is different from FUE’s.
Here, the softmax selection policy[28]
is used to determine the communication mode. The probability
of UE selecting FAP on subchannel is calculated as(26) 
where is the temperature parameter. At the beginning, the temperature parameter is high, which leads to a nearly equiprobable selection among the different modes. As the episode increases, the value of the temperature parameter decreases and greater difference in selection probabilities
occurs. The larger the estimated value of
is, the higher the probability is.After are identified via Qlearning, problem (23) is simplified into the following problem.
(27) 
Since subchannel is allocated to at most one UE in the orthogonal subchannel allocation strategy, the interference part in (1) equals to and the rate in (27) is convex and monotonically increases with the power . Suppose is the extreme point of the targeted convex function. When is in the feasible region defined by C1 C4, is the optimal solution of problem (27). When is not in the feasible region, we can find the optimal solution by the following iterative methods:
IvB Distributed RLbased solution for multiplexed subchannel allocation strategy
In the multiplexed subchannel allocation strategy, a distributed RLbased approach is proposed, in which UEs autonomously select their communication modes. The main advantage of using distributed approaches is that they allow for a reduction in complexity since each UE needs to consider only its own selection possibilities. Note that the size of Qtable can be decreased by only considering the neighbor nodes of UE , which makes the storage of Qtable affordable for each UE.
Whenever RRHs(), an FAP() or an FUE() and subchannel has been selected by UE , the value of is updated as (24). Unlike the special case in the orthogonal subchannel allocation strategy, a subchannel can be shared among multiple UEs in the multiplexed subchannel allocation strategy. We have to consider the following cases in which are supposed to be 0: 1) An excessive load occurs in FAP and there is no enough computing resource for the connected UEs, meaning that constraint C1 is not fulfilled; 2) The propagation conditions in the selected mode do not allow guaranteeing the traditional UE’s rate requirement, meaning that constraint C2 is not satisfied; 3) The propagation conditions in the selected mode do not allow achieving the desired rate of FUE, meaning that constraints C3 is not satisfied. If constraints C1, C2 and C3 are satisfied, we have the same definition on the reward as in (25). By defining a reward with C1 C3 and the powerminusrate function considered, the reward reflects the degree of fulfillment of the optimization target and the constraints.
Based on communication modes output by distributed Qlearning, there is a fixed onetoone mapping between and due to constraints C5C7. Define the corresponding mode selection and subchannel allocation for UE as and , respectively. Note that when subchannel is used by a single UE, the interference part is omitted, which makes the problem convex. When subchannel is reused, for example by UE and , we have . Problem (23) can now be simplified into the following problem at subchannel .
(28)  
where is the SINR corresponding to the desired rate in C2 and sum rate threshold in the right side of C3. The second order cone constraint D2 is transformed from C2 and C3 equivalently.
The target function in (28) is nonconvex when subchannel is reused. Hence, a Cadditive approximation of the driftpluspenalty algorithm is presented, the performance of which is within an additive constant of the infimum. The definition of Cadditive approximation[26] is defined as follows.
Definition 2
(Cadditive approximation). For a given constant , a Cadditive approximation of the driftpluspenalty algorithm is to choose an action that yields a conditional expected value on the righthandside of the driftpluspenalty under given at time slot , which is within a constant from the infimum over all possible control actions.
The Cadditive approximation of the driftpluspenalty algorithm is inspired by the equivalence between the weighted sum rate maximization and WMMSE[29] for the MIMO channel, which is extended to solve problem (28). We state this equivalence as follows.
Proposition 3
(Equivalent WMMSE problem). Problem (28) has the same optimal solution as the following WMMSE problem:
(29) 
where denotes the meansquare error (MSE) weight for UE , is a receiver variable, and is the corresponding MSE defined as
(30) 
Note that WMMSE problem (29) is not jointly convex in , and but convex with respect to each of the individual optimization variables when other individuals are fixed. Hence, the block coordinate descent (BCD) method is utilized to obtain a stationary point of problem (29). The BCD method is summarized as follows and described in Algorithm 2.

The optimal receiver under the fixed and is given by
(31) 
The optimal MSE weight under the fixed and is given by
(32) 
Note that the optimization problem for finding the optimal transmit power under the fixed and is
(33) which is a second order cone problem and can be solved efficiently when there is convex region. Note that the convex region is defined by the constraint C4, D2 and D3 jointly. In particular, the new constraint D3 is derived from C1. In constraint C1, the computing resource consumption of UE is calculated according to the resource allocation under determined mode selection. While in the presented BCD method, the resource allocation is determined in an iterative way. Hence in D3 is calculated based on the power output by the last iteration.