Mode Selection and Resource Allocation in Sliced Fog Radio Access Networks: A Reinforcement Learning Approach

by   Hongyu Xiang, et al.

The mode selection and resource allocation in fog radio access networks (F-RANs) have been advocated as key techniques to improve spectral and energy efficiency. In this paper, we investigate the joint optimization of mode selection and resource allocation in uplink F-RANs, where both of the traditional user equipments (UEs) and fog UEs are served by constructed network slice instances. The concerned optimization is formulated as a mixed-integer programming problem, and both the orthogonal and multiplexed subchannel allocation strategies are proposed to guarantee the slice isolation. Motivated by the development of machine learning, two reinforcement learning based algorithms are developed to solve the original high complexity problem under traditional and fog UEs' specific performance requirements. The basic idea of the proposals is to generate a good mode selection policy according to the immediate reward fed back by an environment. Simulation results validate the benefits of our proposed algorithms and show that a tradeoff between system power consumption and queue delay can be achieved.


page 3

page 4

page 5

page 6

page 7

page 8

page 9

page 12


Non-Orthogonal Network Slicing for eMBB Service in a Multi-UAV-Aided Network

This paper is concerned with the network slicing problem for enhanced mo...

Deep Reinforcement Learning Based Mode Selection and Resource Management for Green Fog Radio Access Networks

Fog radio access networks (F-RANs) are seen as potential architectures t...

An Evolutionary Game for User Access Mode Selection in Fog Radio Access Networks

The fog radio access network (F-RAN) is a promising paradigm for the fif...

Delay-aware Resource Allocation in Fog-assisted IoT Networks Through Reinforcement Learning

Fog nodes in the vicinity of IoT devices are promising to provision low ...

Learning from Peers: Transfer Reinforcement Learning for Joint Radio and Cache Resource Allocation in 5G Network Slicing

Radio access network (RAN) slicing is an important part of network slici...

Optimizing Resource Allocation and VNF Embedding in RAN Slicing

5G radio access network (RAN) with network slicing methodology plays a k...

Disaggregation for Energy Efficient Fog in Future 6G Networks

We study the benefits of adopting server disaggregation in the fog compu...

I Introduction

To handle diverse use cases and business models, a new technology called network slicing has been investigated extensively for fifth generation (5G)[1]. In the concept of network slicing, the network slice instances are orchestrated and chained by a set of network functions to provide customized services. By enabling flexible support of various applications, network slicing benefits 5G networks in a cost-efficient way. As an important part of network slicing, network slicing in radio access networks (RANs) has been studied to further improve end-to-end network performance[2].

Although network slicing is a good solution to meet service requirements in 5G, there are remarkable challenges to be solved. Traditional core network slicing methods are business-driven only, which neglect characteristics of the RAN. However, network slicing in different network architectures are different, like in heterogeneous networks or cloud RANs (C-RANs)[3, 4]. Jointly considering characteristics of RANs and network slicing can be beneficial. Second, the performance requirements of emerging applications become more stringent. To achieve huge capacity, massive connections and ultra-low latency, resource allocation should be elaborately designed, which includes not only radio but also caching and computing resources. Third, as indicated in TS 38.300[5], it should be possible for a single RAN node to support multiple slices. Due to the differentiated capability of each node, the node association strategy in a sliced RAN becomes critical.

Meanwhile, fog-RANs (F-RANs) have been considered as a revolutionary paradigm to tackle performance requirements in 5G[6]. By exploiting the edge caching and computing, capacity burdens on fronthaul are alleviated and end-to-end latency is shortened. According to desired performance, each user equipment (UE) in a F-RAN can select a proper communication mode, which includes C-RAN mode, fog-radio access point (F-AP) mode, device-to-device (D2D) mode. With adaptive mode selection and interference suppression, services and applications, such as the industrial Internet, health monitoring and Internet of vehicles, can be well supported.

To exploit the prospect of network slicing in F-RANs, a hierarchical RAN slicing architecture is presented in this paper. The proposed architecture shown in Fig. 1 takes full advantages of both F-RANs and network slicing. According to the decomposition principle of the control and data planes, the high power node (HPN) in the network access layer executes the functions of the control plane, including control signaling and system broadcasting information delivery for accessed traditional UEs and fog-UEs (F-UEs). With radio resource control connections established, the network slice selection assistance information[5] is utilized to help traditional UEs and F-UEs for network slice selection. Numerous modes are provided in the data plane for a differentiated handling of traffic. Specially, the remote radio heads (RRHs) are cooperated with each other in the baseband unit (BBU) pool, which provides the C-RAN mode in the data plane. Thanks to the fog computing, F-APs are used to process local collaboration radio signal and D2D mode can be further triggered to meet performance requirements.

In the RAN slicing architecture, mode selection and resource allocation are critical for improving performance of network slices. To achieve a high data rate, UEs should associate with RRHs to leverage large scale centralized signal processing in the BBU pool. To alleviate transmission burdens on fronthaul and save system power, local data processing should be available, which are enabled by F-APs and F-UEs. Consequently, for UEs with different performance requirements, advanced mode selection are in need. Note that the data transmission under different modes would consume not only radio but also computing resource. Like in C-RAN mode, centralized processing and large-scale collaborative transmission requires global coordination, scheduling and control, of which the computing complexity typically increases polynomially with the network size[6]. Hence it is important to coordinate the computing and radio resource. To meet the performance requirements of traditional UEs and F-UEs, both multi-dimensional resource management and communication mode selection in sliced F-RANs should be tackled elaborately. Considering their coupling, a joint optimization of mode selection and resource allocation is essential. To determine the best mode selection and coordinate the multi-dimensional resource, intelligent decision-making mechanisms are promising, which consider the channel states of different modes, the computing load at each F-AP, the performance requirements of traditional UEs and F-UEs and the total power consumption.

Based on the aforementioned characteristics of mode selection and resource allocation, the joint optimization solution to system power minimization in sliced F-RANs is researched in this paper.

I-a Related Work

F-RANs have emerged as a promising 5G RAN that can satisfy diverse quality of service (QoS) requirements in 5G. With coordination among the communication, computation and caching, QoS requirements like high spectral efficiency, high energy efficiency and low latency for different service types can be met. Many studies on F-RANs have been conducted, like computation offloading in[7], and edge caching strategies in[8, 9]. In[7], the impact of fog computing on energy consumption and delay performance are investigated. With queuing models established, a multi-objective optimization problem considering energy consumption, execution delay and payment cost is formulated. Using the scalarization method and interior point method, superior performance over the existing schemes is achieved. In[8], a joint optimization of caching and user association is studied. By decomposing the original problem, a distributed algorithm based on the Hungarian method is proposed. Simulation results show that with an efficient caching policy, the average download delay can be significantly reduced. In[9], a new metric called economical energy efficiency is adopted. With cache status and fronthaul capacity considered, a resource allocation problem is formulated and solved by using fractional programming. Advantages of the proposed algorithm including system greenness improvement are confirmed.

There have also been numerous works on RAN slicing that demands efficient resource allocation, resource isolation and sharing[10]. In[2], the application of network slicing in an ultra-dense RAN is studied. To improve the quality of computation experience for mobile devices, the design of computation offloading policies is investigated. Considering the time-varying communication qualities and computation resources, a stochastic computation offloading problem is formulated and then a deep reinforcement learning (DRL) framework is proposed, which achieves a significant improvement in computation offloading performance compared with baseline policies. In[3], a dynamic radio resource slicing framework is presented for a two-tier heterogeneous wireless network. By partitioning radio spectrum resources into different bandwidth slices for sharing, the framework achieves differentiated QoS provisioning for services in the presence of network load dynamics. In[4], two typical 5G services in a C-RAN are considered and specific slice instances are orchestrated. To maximize the cloud RAN operator’s revenue, efficient approaches including successive convex approximation and semidefinite relaxation are exploited. With acceptable time complexities, the proposed algorithm significantly saves system power consumption. In[11], hierarchical radio resource allocation is studied for RAN slicing in F-RANs, where a global radio resource manager performs a centralized subchannel allocation while local radio resource managers allocate assigned resources to UEs to facilitate slice customization. In[12], the network slicing in multi-cell virtualized wireless networks is considered. To maximize the network sum rate, a joint BS assignment, sub-carrier, and power allocation algorithm is developed. Simulation results demonstrate that under the minimum required rate constraint of each slice, the proposed iterative algorithm outperforms the traditional approach, especially in the respect of the coverage improvement and spectrum efficiency enhancement. In[13]

, the combinatorial optimization of multi-dimensional resources in network slicing is investigated. To deal with the dilemma between network provider and tenants, a real-time resource slicing framework based on semi-Markov decision process is developed, which considers the long-term return of the network provider and the uncertainty of resource demands from tenants. Taking advantages of deep dueling neural network, the proposed framework can improve the performance of the system significantly. In

[14], a novel spectral efficiency approach is proposed to the allocation of resource blocks for different services. By learning in advance whether resources is adequate to provide service, unsuccessful allocation process is avoided. Simulations show that the approach significantly improves the spectral efficiency with respect to a single-slot based model.

Note that there still exist challenges in RAN slicing. For example, the ever-increasingly complicated configuration issues and blossoming new performance requirements would be challenging in 5G, since only predefined problems can be dealt with by the network. To realize an intelligent implementation of network slicing, artificial intelligence has attracted particular attentions. By enabling networks be capable of interacting with environments, a network can automatically recognize a new type of application, infer an appropriate provisioning mechanism and establish a required network slice

[15]. Meanwhile, with network scenarios becoming heterogeneous and complicated, cost-efficient and low-complexity algorithms based on machine learning can be developed for practical implementations[16]. With network patterns and user behaviors learned and predicted, an intelligent decision making system can be established to improve the network performance.

There have been numerous works about applications of artificial intelligence and machine learning in wireless networks[17]. In[18], the resource allocation schemes for vehicle-to-vehicle (V2V) communications are investigated. To avoid the large transmission overhead in the traditional centralized method, a novel decentralized resource allocation mechanism based on deep reinforcement learning is proposed. Each V2V link or a vehicle acts as an independent agent and finds the optimal sub-band and transmission power autonomously. Simulation results showed that each agent can effectively learn to satisfy the stringent latency constraints on V2V links while minimizing the interference to vehicle-to-infrastructure communications. In[19]

, applications of machine learning to improve heterogeneous network traffic control are researched. Based on traffic patterns at the edge routers, a supervised deep learning system is trained. Compared with benchmark routing strategy, the proposed system outperforms in terms of signaling overhead, throughput, and delay. In

[20], a DRL assisted resource allocation method is designed for ultra dense networks. The original multi-objective problem is decoupled into two parts based on the general theory of DRL. The spectrum efficiency (SE) maximization is utilized to build the deep neural network. The residual objectives like energy efficiency (EE) and fairness, are considered as the rewards to train the deep neural network. Simulation results show that, the proposed method significantly outperforms the existing resource allocation algorithms in term of the tradeoff among the SE, EE and fairness. In[21]

, the joint SE and EE optimization in cognitive radio networks are studied and a deep-learning inspired message passing algorithm is proposed. To learn the optimal parameters of the algorithm, a feed-forward neural network is devised and an analogous back propagation algorithm is developed. The simulation results show that the proposed algorithm achieves a lower power consumption for secondary users accessing the licensed spectrum while preserving the capacity of the primary users.

In this paper, we focus on the mode selection and resource allocation in a sliced F-RAN, which is formulated as a mixed integer programming. To deal with the NP-hard problem, RL is adopted to generate an efficient solution. Combining the strength of both supervised and unsupervised learning methods, the RL techniques have been widely used in wireless networks

[22]. In[23], mode selection and resource allocation in D2D enabled C-RANs are investigated and a distributed approach based on RL is proposed, where D2D pairs perform self-optimization without global channel state information. In[24], a decentralized and self-organizing mechanism based on RL techniques is introduced to reduce inter-tier interference and improve spectral efficiency. Simulation results show that the proposed mechanism possesses better convergence properties and incurs less overhead than existing techniques. To offload the traffic in a stochastic heterogeneous cellular network, an online RL framework is presented in[25]. By modeling as a discrete-time Markov decision process, the energy-aware traffic offloading problem is solved by a centralized Q-learning algorithm with a compact state representation.

I-B Main Contributions

Motivated by the benefits of machine learning, the uplink of a sliced F-RAN is concerned in this paper. In particular, an optimization framework for RAN slicing is presented, which takes the queue stabilities of traditional UEs and bit rate requirements of F-UEs into consideration. Both orthogonal and multiplexed subchannel strategies are considered. The main contributions of the paper are:

  1. The joint optimization on mode selection and resource allocation in the uplink sliced F-RAN are investigated, where traditional UEs and F-UEs are served by constructed network slice instances. Both the orthogonal and multiplexed subchannel strategies are presented. Under different UEs’ demands and limited computing resources, a system power minimization problem is formulated, which is stochastic and mixed-integer programming. Using the general Lyapunov optimization framework, this nonconvex optimization problem is transformed into a minimization of the drift-plus-penalty function, which can be further reformulated as a deterministic mode selection and resource allocation problem at each slot.

  2. RL-based approaches are proposed to solve the reformulated mode selection and resource allocation problem. Unlike previous work in[3, 4, 11], this paper applies the RL techniques to solve the drift-plus-penalty minimization under different subchannel allocation strategies. Specifically, communication modes are selected based on learned policies. Afterwards, transmission power of traditional UEs and F-UEs are derived by a generalized weighted minimum mean-square error (WMMSE) approach. Through the RL-based approaches, a long-term system performance optimization can be achieved.

  3. The proposed approaches are evaluated under different conditions. Impacts of different parameters like computing resource are evaluated. By simulation, it is observed that the RL-based approach can provide real-optimal performance. By changing the value of the defined tradeoff parameter, tradeoff between traditional UEs’ queuing delay and system power consumption can be controlled in a flexible and efficient way.

The remainder of this paper is organized as follows. Section II introduces the system model including the communication model and computing model. In Section III, the system power minimization problem is formulated and transformed into a deterministic problem based on the general Lyapunov optimization framework. In Section IV, both the orthogonal and multiplexed subchannel strategies are considered, which enable different levels of slice isolation. Corresponding RL-based algorithms are designed to solve the deterministic problem. Section V evaluates the performance of the proposed algorithms, followed by the conclusions in Section VI.

Ii System model

The system model is elaborated in this section, including the considered F-RAN model, communication model and computing model.

Ii-a The F-RAN model

The scenario considered in this paper is illustrated in Fig. 1. It assumes an F-RAN architecture consisting of a terminal layer, a network access layer and a cloud computing layer. In the cloud computing layer, the BBU pool provides centralized signal processing. And in the network access layer, there are distributed RRHs connected with the BBU pool, each of which is single-antenna. There are also F-APs configured with antennas. Owing to fog computing, collaborative radio signal processing can not only be executed in the centralized BBU pool but also at distributed F-APs. We also assume that the network operates in slotted time with time dimension partitioned into decision slots indexed by

Fig. 1: The single antenna system model of the RAN slicing architecture, wherein network slices for traditional UEs and F-UEs are constructed.

There are single-antenna traditional UEs and single-antenna F-UEs in the terminal layer, whose sets are denoted as and , respectively. Examples of traditional UEs include agricultural field monitoring sensors, and industrial monitoring devices, which desire low power consumption and have random bursty traffic arrivals. F-UEs can be smartphones or laptops[6], which are always equipped with a large buffer. To provide a high data rate for each F-UE, a network slice instance is constructed, which is composed of multiple modes and corresponding physical resource. In the C-RAN mode, RRHs are cooperated for uplink data reception and the BBU pool provides centralized signal detection and baseband processing. Moreover, F-APs are deployed for a local service to alleviate the burden on the fronthaul. Similarly, both C-RAN mode and F-AP mode are available in the network slice instance specific for traditional UEs. However, the objective is to maintain a low power consumption and stable transmission delay for traditional UEs. In addition, F-UEs can benefit both network slice instances via the D2D mode. Specially, F-UEs relay the data traffic of other F-UEs, which extends the coverage of the slice instance for F-UEs; while in the slice instance for traditional UEs, F-UEs aggregate the data to allow more traditional UEs to be connected simultaneously.

There are subchannels to be allocated, each of which is with bandwidth . In this paper, we consider both the orthogonal and multiplexed subchannel strategies. In the former, subchannel is allocated to at most one traditional UE or F-UE , which enables hard isolation between slice instances. While in the latter, subchannel can be shared among multiple traditional UEs and F-UEs. In this strategy, the isolation between the slice instances would be guaranteed with a sophisticated mode selection and resource allocation. Although slice isolation in current works is guaranteed mainly through an orthogonal subchannel allocation strategy. To achieve higher spectrum utilization, it is still necessary to investigate a multiplexed subchannel allocation strategy.

Ii-B The communication model

To achieve the rate requirement , F-UE should connect to the proper F-AP/RRHs. Denote the communication mode selection of F-UE at slot as , which equals to when F-AP () is selected and subchannel is allocated and equals to otherwise. For notation simplicity, we define that in the case that C-RAN mode is selected (i.e., all RRHs are connected) and subchannel is allocated. Suppose that the optimal linear detection, i.e., MMSE detection, is employed, the uplink rate of F-UE at slot when is


where is the transmission power of F-UE on subchannel ,

is the channel vector between UE

and the F-AP on subchannel , is the MMSE detection vector, and is the noise power. Note that these channel vector data account for the antenna gain, path loss, shadow fading, and fast fading together.

Similarly, the rate of traditional UE can be obtained, . Besides guaranteeing a precise rate threshold , a stable queue backlog is also considered for traditional UE given its random traffic arrival characteristics. Let represent the queue backlog for traditional UE in slot . As shown in Fig. 1, we have the following expression for the dynamics of queue backlog ,


where is the number of bits for traditional UE to be uploaded in time slot . Note that varies over time and we have . To minimize the average queue backlog and maintain stability, we seek to perform a queue-aware resource allocation. A definition on the queue stability which bounds the average queue backlog is described in (3).

Definition 1

(Queue stability[26]). The queue backlog which is a discrete time process would be mean-rate stable if


Besides RRHs and F-APs, F-UE can be also selected as serving nodes of UE (). Taking advantage of a large buffer, an F-UE can help upload the data of other F-UEs and traditional UEs. For example, F-UE in Fig. 1 is out of the coverage area, and then its neighbor, F-UE , is selected to deliver the data traffic. F-UE acts as a relay for the data traffic from traditional UE to the F-AP, since the maximum transmission power of traditional UE is limited. Thus in addition to uploading bits at slot to guarantee its own rate requirement, F-UE needs to relay the traffic of other UEs which are received at the last slot. The bit rate requirement of F-UE at slot is , where is an indicator function that equals to when holds and equals to otherwise.

Ii-C The computing model

Computing resource provision in the BBU pool and F-APs plays a key role in boosting the potential of F-RANs. As it is shown in the aforementioned communication model, there are baseband processing and MMSE detector generation. In this paper, we construct the computing model which follows that in[27] and corresponding details are as follows.

  • For baseband processing, it consists of inverse fast fourier transform (IFFT), demodulation and decoding. The IFFT consumes constant computing resource, which is assumed as

    , while the computing resource required by demodulation and decoding is approximated as .

  • For MMSE detector generation, the computational complexity depends on the number of antennas. Taking the case of as an example, we assume that the computing resource consumed by the calculation of is .

Overall, computing resource consumption for UE are modeled as


where and are the slopes. Considering the limited computing resource at F-APs, the number of UEs accessing F-APs should be under a threshold. Suppose is the computing resource available at F-AP , we have the following constraint on computing resource consumption.


According to the computing model (4), UEs will consume more computing resource in C-RAN mode than F-AP mode, since there are more antennas utilized (). Moreover, there is no computing resource consumption for the UEs choosing D2D mode.

Iii Problem formulation and Lyapunov Optimization

In this section, the concerned optimization problem is presented at first. Then with the Lyapunov framework, the original stochastic problem is reformulated as a deterministic problem at each slot.

Iii-a Problem formulation

For the concerned uplink F-RAN, the system power consumption is incurred by fronthaul transmission and wireless transmission, which is given by


where and are the efficiencies of the power amplifier at each traditional UE and F-UE, respectively, is the constant power consumption caused by fronthaul transmission.

Despite the mean-rate stable constraint defined in C0 and computing resource constraint defined in C1, there are also performance constraints to be considered. As stated in following C2 and C3, the rate of traditional UE should be larger than its threshold , while for an F-UE , its rate has to be large enough to upload all the bits in its buffer.


To upload traditional UE’s bits and maintain the required rate for F-UEs, a decision on mode selection should be properly made. Although offloading all the uploaded bits to F-APs can reduce system power consumption, computing resource at F-APs are limited. In this paper, our aim is to perform efficient mode selection and resource allocation, which are described by a tuple . Combining the constraints and performance requirements, we formulate the system power optimization problem as below.


subjects to

where C0 is to achieve a stable queue backlog for each traditional UE, C1 is the computing resource constraint, C2 and C3 are to satisfy the rate requirement for traditional UEs and F-UEs, respectively, and C4 means if subchannel is not allocated to UE , the transmission power has to be 0 and limited by the maximum transmission power otherwise. C5 is the communication mode selection constraint, C6 implies that at most one mode can be selected by UE on subchannel , and C7 means at most 1 subchannel can be allocated to UE .

Solving problem (8

) is difficult due to the following reasons. First, the problem with aforementioned constraints is a nonlinear optimization problem and falls within the category of mixed integer programming. Traditional methods like branch-and-bound and genetic algorithms that can be applied are centralized and will result in high complexity. Second, the scale of the problem will increase as the number of traditional UEs/F-UEs grows. Third, the problem includes future information like bit rates and queue backlog, which vary over time and are hard to precisely predict. How to make decisions on

to adapt to dynamic traffic is of great challenge.

Iii-B General Lyapunov optimization

Fortunately, with Lyapunov optimization[26], the original optimization problem with the time-averaged constraints C0 can be transformed into a queue mean-rate stable problem, which can be solved only based on the observed channel state information and queue backlogs at each time slot. Let define queue backlog set. Taking advantage of Lyapunov optimization, a Lyapunov function is defined as a scalar metric of queue congestion:


Then the Lyapunov drift is defined, which pushes the queue backlog to a lower congestion state and keeps queues stable,


To combine the queue backlog and system power consumption, the drift-plus-penalty is defined, where is a non-negative parameter controlling the tradeoff between the average system power and the average queue delay. Suppose that the expectation of is deterministically bounded by finite constants , i.e., . Let denote the theoretical optimal value of (8), and then the relationship between the drift-plus-penalty function and C0 is established in Theorem 1,

Theorem 1

(Lyapunov optimization). Suppose there exist positive constants , and such that for all slots and all possible , the drift-plus-penalty function satisfies:


Then C0 is satisfied and the average system power meets


The average queue delay is defined as the average length of all queues, which satisfies


Since (11) holds for any slot, we can take expectations of both sides and we have

Sum over and using the law of telescoping sums, it yields


Based on the fact that for all , we rearrange (14) to obtain yields

which could be furthermore rearranged according to definition of Lyapunov function


Note that holds for any , we have


Dividing both sides by and taking the limit as , we have


According to Definition 1, the queue of traditional UE is mean-rate stable. A similar proof can be applied to the queues of other traditional UEs, which indicates constraint is satisfied.

Moreover, the following inequality is obtained by rearranging the terms in (14)


with some non-negative terms neglected when appropriate. Dividing both sides of (18) by and taking the limit as , the inequality (12) is obtained based on the fact that .

Similarly, inequality (14) can also be re-written as


Dividing (19) by and taking the limit as , inequality (13) is obtained according to the fact that .

Theorem 1 suggests that by adjusting the value of parameter , a near-to-optimal solution can be obtained which provides an average system power arbitrarily close to the optimum . Moreover, it is also shown that there exists an tradeoff between the average system power and the average queue delay. With an increase of parameter , the achieved system power consumption becomes lower at the cost of incurring a larger queuing delay. Therefore, a larger is suitable for the delay tolerable UEs to obtain the required performance.

Instead of minimizing the drift-plus-penalty directly, we aim to push the drift-plus-penalty’s upper bound to its minimum. Based on the queue dynamics of and the definition of Lyapunov drift in (10), the following lemma holds for the upper bound of drift-plus-penalty.

Lemma 2

(Upper bound of Lyapunov drift-plus-penalty). At any time slot , with the observed queue state and parameter , there exists an upper bound for the drift-plus-penalty under any control policy:


where is a finite constant which is larger than for any .


Squaring both sides of (2) and combining the inequality , the following inequality can be obtained


Summing (21) over , we obtain

Taking conditional expectations of both sides, we have


By adding , we have (20).

Based on the concept of opportunistically minimizing an expectation, the policy that minimizes is the one that minimizes with the observation of during each slot. Since neither nor in (20) will be affected by the policy at slot , the upper bound minimization for the drift-plus-penalty can be accomplished by solving the following deterministic problem at slot :


As it is shown in (23), the power-minus-rate function as an optimization target is not convex on either variable or variable .

Iv Solution for Orthogonal and Multiplexed Subchannel Strategies

The non-convex problem (23), which includes integer variables and continuous variables , is hard to be solved. Although methods like branch-and-bound and genetic algorithms can be utilized to solve the integer parts, these existing solutions require a huge complexity when simultaneously considering all traditional UEs, F-UEs, F-APs and RRHs. Moreover, the residual part of the problem (23) is still non-convex, because the rate term in the power-minus-rate function depends on the transmission power of traditional UEs and F-UEs using the same subchannel .

In this section, we consider the mode selection and resource allocation under orthogonal and multiplexed subchannel strategies. To overcome the above challenges, a centralized approach based on Q-learning and softmax decision-making is proposed for the orthogonal subchannel strategy. For the multiplexed subchannel strategy, limitations on the subchannel allocation are relaxed. In this case, a distributed approach is developed, where each traditional UE or F-UE needs to consider only its own mode selection possibilities.

Iv-a Centralized RL-based solution for the orthogonal subchannel strategy

A centralized approach for mode selection is proposed based on Q-learning. In particular, the definition of states in Q-learning is related to current mode selection of UEs. To decrease the dimensions of the Q table, the state is , in which implies that during the current iteration, only UE would reselect a mode according to the action, and the element denotes that subchannel has been allocated to UE connecting to F-AP (namely ). Considering constraints C5C7, we define the action as . With action selected, the element and corresponding in state change and the current state transits to the next state.

The Q-value in the Q-learning is defined as the discounted accumulative reward and starts at a tuple of a state and an action, which is updated as follows


where is the learning rate, and is the reward resulting from taking action . Note that in the orthogonal subchannel strategy, subchannel, for example can not be shared among UEs. Hence in given state , there is an element being . If the action is chosen and , the reward has to be 0 (). Otherwise, the value of reward is defined as a value between and that decreases when the power-minus-rate increases:


where and . Note that the reward function is defined according to the UE’s performance requirement. Since the mean-rate stable is considered only for each traditional UE, the reward function of UE is different from F-UE’s.

Here, the softmax selection policy[28]

is used to determine the communication mode. The probability

of UE selecting F-AP on subchannel is calculated as


where is the temperature parameter. At the beginning, the temperature parameter is high, which leads to a nearly equiprobable selection among the different modes. As the episode increases, the value of the temperature parameter decreases and greater difference in selection probabilities

occurs. The larger the estimated value of

is, the higher the probability is.

After are identified via Q-learning, problem (23) is simplified into the following problem.


Since subchannel is allocated to at most one UE in the orthogonal subchannel allocation strategy, the interference part in (1) equals to and the rate in (27) is convex and monotonically increases with the power . Suppose is the extreme point of the targeted convex function. When is in the feasible region defined by C1 C4, is the optimal solution of problem (27). When is not in the feasible region, we can find the optimal solution by the following iterative methods:

1:  Derive the partial derivative of the targeted optimization function in (27);
2:  Find the extreme point of the targeted convex function.
3:  Initialize and define a fixed step ;
4:  repeat
5:     With fixed, calculate the partial derivative ;
6:     Find the minimal one ;
7:     Update ;
8:  until  is in the feasible region.
Algorithm 1 An iterative method to find the optimal solution of problem (27).

Iv-B Distributed RL-based solution for multiplexed subchannel allocation strategy

In the multiplexed subchannel allocation strategy, a distributed RL-based approach is proposed, in which UEs autonomously select their communication modes. The main advantage of using distributed approaches is that they allow for a reduction in complexity since each UE needs to consider only its own selection possibilities. Note that the size of Q-table can be decreased by only considering the neighbor nodes of UE , which makes the storage of Q-table affordable for each UE.

Whenever RRHs(), an F-AP() or an F-UE() and subchannel has been selected by UE , the value of is updated as (24). Unlike the special case in the orthogonal subchannel allocation strategy, a subchannel can be shared among multiple UEs in the multiplexed subchannel allocation strategy. We have to consider the following cases in which are supposed to be 0: 1) An excessive load occurs in F-AP and there is no enough computing resource for the connected UEs, meaning that constraint C1 is not fulfilled; 2) The propagation conditions in the selected mode do not allow guaranteeing the traditional UE’s rate requirement, meaning that constraint C2 is not satisfied; 3) The propagation conditions in the selected mode do not allow achieving the desired rate of F-UE, meaning that constraints C3 is not satisfied. If constraints C1, C2 and C3 are satisfied, we have the same definition on the reward as in (25). By defining a reward with C1 C3 and the power-minus-rate function considered, the reward reflects the degree of fulfillment of the optimization target and the constraints.

Based on communication modes output by distributed Q-learning, there is a fixed one-to-one mapping between and due to constraints C5C7. Define the corresponding mode selection and subchannel allocation for UE as and , respectively. Note that when subchannel is used by a single UE, the interference part is omitted, which makes the problem convex. When subchannel is reused, for example by UE and , we have . Problem (23) can now be simplified into the following problem at subchannel .


where is the SINR corresponding to the desired rate in C2 and sum rate threshold in the right side of C3. The second order cone constraint D2 is transformed from C2 and C3 equivalently.

The target function in (28) is non-convex when subchannel is reused. Hence, a C-additive approximation of the drift-plus-penalty algorithm is presented, the performance of which is within an additive constant of the infimum. The definition of C-additive approximation[26] is defined as follows.

Definition 2

(C-additive approximation). For a given constant , a C-additive approximation of the drift-plus-penalty algorithm is to choose an action that yields a conditional expected value on the right-hand-side of the drift-plus-penalty under given at time slot , which is within a constant from the infimum over all possible control actions.

The C-additive approximation of the drift-plus-penalty algorithm is inspired by the equivalence between the weighted sum rate maximization and WMMSE[29] for the MIMO channel, which is extended to solve problem (28). We state this equivalence as follows.

Proposition 3

(Equivalent WMMSE problem). Problem (28) has the same optimal solution as the following WMMSE problem:


where denotes the mean-square error (MSE) weight for UE , is a receiver variable, and is the corresponding MSE defined as


Note that WMMSE problem (29) is not jointly convex in , and but convex with respect to each of the individual optimization variables when other individuals are fixed. Hence, the block coordinate descent (BCD) method is utilized to obtain a stationary point of problem (29). The BCD method is summarized as follows and described in Algorithm 2.

  • The optimal receiver under the fixed and is given by

  • The optimal MSE weight under the fixed and is given by

  • Note that the optimization problem for finding the optimal transmit power under the fixed and is


    which is a second order cone problem and can be solved efficiently when there is convex region. Note that the convex region is defined by the constraint C4, D2 and D3 jointly. In particular, the new constraint D3 is derived from C1. In constraint C1, the computing resource consumption of UE is calculated according to the resource allocation under determined mode selection. While in the presented BCD method, the resource allocation is determined in an iterative way. Hence in D3 is calculated based on the power output by the last iteration.

1:  For each slot , observe the current and , and then make the queue-aware power allocation according to the following steps:
2:  Initialize the precision , power and corresponding power-minus-rate function ;
3:  repeat
4:     Update and ;
5:     With fixed, compute according to (31);
6:     Compute the corresponding MSE according to (30) and set ;
7:     Find the optimal value of power by solving problem (33);
8:     Calculate the corresponding power-minus-rate function