I Introduction
The requirement and development of Internet of Things (IoT) services, a key challenge in 5G, have been continuously rising, with the expanding diversity and density of IoT devices[1]. Cloud radio access networks (CRANs)[2] are regarded as the promising mobile network architecture to meet this new challenge. Specifically, CRANs separate base stations into radio units, which are commonly referred as remote radio heads (RRHs), and signal processing centralized baseband unit (BBU) Pool. In a CRAN, BBU can be placed in a convenient and easily accessible place, and RRHs can be deployed up on poles or rooftops on demand. It is expected that CRAN architecture will be an integral part of future deployments to enable efficient IoT services.
Dynamic resource allocation (DRA) for IoT in CRANs is indispensable to maintain acceptable performance. In order to get the optimal allocation strategy, several works have tried to apply convex optimizations, like second order cone programming (SOCP) in [3], semidefinite programming (SDP) in [4] and mixinteger programming (MIP) in [5]. However, in realtime CRANs where the environment keeps changing, the efficiency of the above methods in finding the optimal decision faces great challenges. Attempts have been made in reinforcement learning (RL) to increase the efficiency of the solution procedure in [3] [6].
RL has shown its great advantages to solve DRA problems in wireless communication systems and for IoT. Existing methods to DRA problem in RANs generally model it as a RL problem [6][7][8], by setting different parameters as the reward. For instance, the work in [6]
regarded the successful transmission probability of the user requests as the reward, and another work in
[8] set the sum of average quality of service (QoS) and averaged resource utilization of the slice as the reward. However, with the increase of the complexity in allocation problems, the search space of solutions tends to be infinite, which is hard to be tackled.With the combination of RL and deep neural network (DNN)
[9], deep reinforcement learning (DRL) has been proposed and applied to address the above problems in [10][11][12]. By utilizing the ability of extracting useful features directly from the highdimensional state space of DNN, DRL is able to perform endtoend RL [9]. With the assistance of DNN, problems of large search space and continuous states are no longer the insurmountable challenges.To apply DRL framework in DRA problems, the design of reward, action and state becomes vital. The action set needs to be enumerable in most circumstances. The work in [3]
used a twostep decision framework to guarantee its enumerability, by changing the state of one RRH at each epoch, which performs well in the models with innumerable states.
Furthermore, in DRA problems, how to get optimal allocation strategy will be finally turned into another optimization problem in most cases, i.e., convex optimization problem [13], which can be solved mathematically. Unfortunately, traditional algorithms[14][15][16] for solving the convex optimization problem, such as SOCP still faces significant limitations, such as timeconsuming, making it hard to generate a policy for largescale systems.
Recent works have achieved significant improvement in computational efficiency by applying the DNN approximator [17][18][19][20] to DRA problems. However, the unstable performance of DNN in regression process makes it hard to achieve good performance [21]. With a large number of hyperparameters, fine tuning becomes even harder in practical system. Some researchers discussed and investigated this problem in computability theory and information theory domains, e.g., in [22].
Gradient boosting machine (GBM) [23] is one member of boosting algorithms family[24][25], a subbranch of ensemble learning[26][27][28]
. It has been firmly established as one of stateoftheart approaches in machine learning (ML) community, and it has played a dominating role in existing data mining and machine learning competitions
[29] due to its fast training and excellent performance. However, to the best of our knowledge, few works applied this method to the DRA problem, even to other regression problems in communication systems.In this paper, to efficiently address DRA problem for IoT in CRANs with innumerable states, one common form of DRL, namely the deep Qnetwork (DQN) is employed. Moreover, to tackle the difficulties in obtaining the reward in DQN in low latency, a treebased GBM, i.e., gradient boosting decision tree (GBDT) is utilized to approximate the solutions of SOCP. Then, we demonstrate the improvement of our method by comparing it to the traditional methods under simulations.
The main contributions of this paper are as follows:

We first give the model of dynamic resource allocation problem for IoT in the realtime CRAN. Then, we propose a GBDTbased regressor to approximate the SOCP solution of the optimal transmitting power consumption, which serves as the immediate reward needed in DQN. By doing so, there is no need to solve the original SOCP problem every time, and therefore great computational cost can be saved.

Next, we aggregate the GBDTbased regressor with a DQN to propose a new framework, where the immediate reward is obtained from GBDTbased regressor instead of SOCP solutions, to generate the optimal policy to control the states of RRHs. The proposed framework can save the power consumption of the whole CRAN system for IoT.

We show the performance gain and complexity reduction of our proposed solution by comparing it with the existing methods.
The remainder of this paper is organized as follows. Section II presents the related works, whereas system model is given in Section III. Section IV introduces the proposed GBDTbased DQN framework. The simulation results are reported in Section V, followed by the conclusions presented in Section VI.
Ii Related Works
The resource allocation problem under CRANs is normally interpreted into an optimization problem, where one needs to search the decision space to find an optimal combinatorial set of decisions to optimize different goals [13] [30] [31] based on current situations. Although numerous researchers devoted their time in finding solutions to optimization problems, most of them are still hard or impossible to be tackled with traditional pure mathematical methods. RL has been recently applied to address those problems.
In [32], a modelfree RL model was adopted to solve the adaptive selection problem between backhaul and fronthaul transfer modes, which aimed to minimize the longterm delivery latency in fog radio access network (FRAN). Specifically, an online onpolicy valuebased strategy StateActionRewardStateAction (SARSA) with linear approximation was applied in this system. Moreover, some works have proposed more efficient RL methods to overcome slow convergence and scalability issues in traditional RLbased algorithms, such as Qlearning. In [33], four methods, i.e. state space reduction techniques, convergence speed up methods, demand forecasting combined with RL algorithm and DNN were proposed to handle the aforementioned problems, especially to deal with the huge state space.
Furthermore, as reported in [34]
, DQN achieved a better performance on resource allocation problems, compared with the traditional Qlearning based method. In practice, the size of possible state space may be very large or even infinite, which makes it impossible to traverse each state that required by the traditional Qlearning. Approximation methods can address this kind of problem that they maps the continuous and innumerable state space to a nearoptimal Qvalue space in consecutive setting, rather than Qtable. DNN shows its advantage of approximation in the highdimensional space in many domains. Therefore, the adoption of DNN to estimate Qvalue can improve the system performance and computing efficiency, as reported in the simulation results from
[34].In [3], a twostep decision framework was adopted to solve the enumerability problem of action space in CRANs. The DRL agent first determined which RRH to turn on or turn off, and then the agent got the resource allocation solution by solving a convex optimization problem. Any other complex actions can be decomposed into the twostep decision, reducing the action space significantly. Moreover, the work in [6] shows the impractical use of SA (i.e., Single BS Association) scheme even in a smallscale CRAN. Specifically, SA scheme abandoned the collaboration of each RRH and only supported few users. This research is a guidance to our research.
The works in [6] and [8] all adopted the DRL method to solve resource allocation problems in the RAN settings. In [8], the concept of intelligent allocation based on DRL was proposed to tackle the cache resource optimization problem in FRAN. To satisfy user’s QoS, the caching schemes should be intelligent, i.e. more effective and selfadaptive. Considering the limitation of cache space, this requirement challenges the design of schemes, and it motivates the adoption of DRL technique.
As reported in [6], a DRLbased framework is used in more complicated resource allocation problems, i.e., virtualized radio access networks. Based on the average QoS utility and resource utilization of users, the DQNbased autonomous resource management framework can make virtual operations to customize their own utility function and objective function based on different requirements.
In this paper, to improve the system efficiency, we propose a novel gradientboostingbased DQN framework for resource allocation problem, which significantly improves the system performance through offline training and online running.
To the best of our knowledge, there is few works to apply gradient boosting machine to approximate solutions of convex optimization problems in wireless communication and we are the first to propose this framework.
Iii System Model
Iiia Network Model
We consider a typical CRAN architecture where there is a single cell model with a set of RRHs denoted by and a set of users which can be some IoT devices denoted by . In the DRA for IoT in CRAN as shown in Fig. 1, we can get the current states, i.e. the state of each RRHs and the demands of IoT device users, from the networks in th decision epoch . All the RRHs are connected to the centralized BBU pool, meaning all information can be shared and processed by the DQNbased agent to make decisions, i.e. turning on or off the RRHs. We simplify the model by making assumption that all RRHs and users are equipped with a single antenna, which is readily to be generalized into the multiantenna case by using technique proposed in [35].
Then, the corresponding signaltointerferenceplusnoise ratio (SINR) at the receiver of th user can be given as:
(1) 
where
denotes the channel gain vector and each element
denotes the channel gain from RRH to user ; denotes the vector of all RRHs beamforming to user and each element denotes the weight of beamforming vector in RRH distributed to user and is the noise.According to the Shannon formula, the data rate of user can be given as:
(2) 
where is the channel bandwidth and is the SINR margin depending on a couple of practical considerations, e.g., the modulation scheme.
The relationship of the transmitting power and the power consumed by the base station can be approximated to be nearly linear, according to [36]. Then, we apply the linear power model for each RRH as:
(3) 
where is the transmitting power of RRH ; is a constant denoting the drain efficiency of the power amplifier; and is the power consumption of RRH when is active without transmitting signals. In the case of no need for transmission, can be set to the sleep mode, whose power can be given by . Thus, one has .
In addition, we take consideration of the power consumption for the state transition of RRHs, i.e. the power consumed to change RRHs’ states. We put the RRHs which reverse states in the current epoch to the set and use to denote the power to change the mode between and , i.e. we assume they share the same power consumption. Therefore, in the current epoch, the total power consumption of all RRHs can be written as:
(4) 
IiiB CPbeamforming
From Equation (4), one can see that the latter two parts are easy to be calculated, which are composed by some constants and only relying on the current state and action. To minimize , it is necessary for us to calculate the minimal transmitting power in each epoch, which depends on the allocation scheme of beamforming weights in active RRHs.
Therefore, this optimization problem can be expressed as:
Control Plane (CP)Beamforming:
(5)  
(5.1)  
(5.2)  
(5.3) 
where the objective is to get the minimal total transmitting power given the states of RRHs and user demands. Also, the variables are distributive weights corresponding to beamforming power; is defined as the user demand; is given by Equation (1) and is the transmitting power constraint for RRH . Also, Constraint (5.2) ensures the demand of all users will be met, whereas Constraint (5.3) ensures the limitation of transmitting power in each RRH.
As shown in [4], the above CPbeamforming can be transformed into a SOCP problem. Therefore, we rewrite the above optimizations as:
Modified CPBeamforming:
(6)  
(6.1)  
(6.2)  
(6.3)  
(6.4)  
(6.5) 
where we apply variable to replace the optimization (5.1) by adding Constraint (6.3), which is a common method in transformation process [37]. We also rewrite Constraint (5.2) as Constraint (6.1) and apply some simple manipulations to get the above modified optimization.
Now, it is ready to see that the above Modified CPBeamforming optimization is the same as a standard SOCP problem. By using the iterative algorithm mentioned proposed in [38], we can get the optimal solutions. It is worth noting that the CPBeamforming optimization may have no feasible solutions. In this case, it means more RRHs should be activated to satisfy the user demands. In this case, we will give a large negative reward to the DQN agent and jump out of the current training loop.
Then, we can calculate the total power consumption by applying Equation (4). In the following part, we propose the DQNbased framework to predict the states of RRHs and adopt GBDT to approximate the solutions of the aforementioned SOCP problems.
Iv GBDT Aided Deep QNetwork for DRA in CRANs
Iva State, Action Space and Reward Function
Our goal in the aforementioned DRA problem is to generate a policy that minimizes the system’s power consumption at any state by taking the best action. Here, the best action refers to the action that contributes the least to overall power consumption in a long term but also satisfies user demands, system requirements and constraints among all the available actions. The fundamental idea of RLbased method is to abstract an agent and an environment from the given problem to generate the environment model [39] and employ the agent to find the optimal action in each state, so as to maximize the cumulative discounted reward by exploring the environment and receiving immediate reward signalled by the environment.
To apply RL method in our problem, we transform the system model defined in Section III into a RL model. The general assumption that future reward is discounted by a factor of per timestep is made here. Then, the cumulative discounted reward from timestep can be expressed as:
(7.1) 
where denotes mathematical expectation; denotes the th reward; denotes the th state and denotes the discount factor. If tends to 0, the agent only considers the immediate reward; whereas if tends to 1, the agent focuses on the future reward. Moreover, the infinity over the summation sign indicates the endless sequence in DRA problem.
Leveraging the common definition in Qlearning, the optimal actionvalue function is defined as the greatest mathematical expected cumulative discounted reward reached by taking action in state and then following a subsequently optimal policy, which guarantees the optimality of cumulative future reward. The function strongly follows the Bellman equation, a wellknown identity in optimality theory. In this model, the optimal actionvalue function to represent the maximum cumulative reward from state with action can be expressed as:
(7.2) 
where denotes the immediate reward received at state if action is taken; denotes the possible action in the next state , and other symbols are of the same meaning as Equation (7.1). The expression means that the agent takes action in the state , receiving the immediate reward , and then subsequently follows an optimal trajectory that leads to greatest value.
In a general view, demonstrates how promising the final expected cumulative reward will be if action is taken in state in a quantitative way. That is to say, in DRA problem, how much power consumption the CRAN can cut down if it decides to take the action , i.e switches on or off one selected RRH when observing the state , i.e. a set of user demands and the states (i.e. sleep/active) of RRHs. Since the true value of can never be known, our goal is to employ DNN to learn an approximation . For the following sections, just denotes the approximated and has all the same properties of .
The generic policy function defined in the context of RL is used here, which can be expressed as:
(7.3) 
where is the argmax of the actionvalue function over all possible actions in a specific state . The policy function leads to the action that maximize the values in all states.
The state, action and reward defined in our problem are given as:

State: The state has two components that one is a set of states of RRHs and the other is a set of demands from users. Specifically, is defined as the set of all RRHs’ states, in which denotes the state of RRH . In the case of , RRH is in the sleep state, whereas means that it is in the active state. is defined as the set of all users’ demands, and denotes the demand of user , in which is the minimum of all demands and is the maximal demand. Thus, the state of RL is expressed as and its cardinality is .

Action: In each decision epoch, we enable the RL agent to determine the next state of one RRH. We use a set of to denote the action space, in which . If , it means RRH changes the state, otherwise the RRH remains its current state in next epoch. Then, the action space can be substantially reduced. It is noteworthy that we set the constriction that , which means only one or none of all RRH states will alter its state and reduces the space into the size of .

Reward: To minimize the total power consumption, we define the immediate reward as the difference between the upper bound of power consumption. The actual power consumption is expressed as:
where denotes the upper bound of the power consumption obtained from the system setting, and denotes the actual total power consumption of the system that is composed of three parts defined in Equation (4). To be more specific, the reward is defined to minimize the system power consumption under the condition of satisfying the user demands, which requires us to solve the optimization problem according to Equation (6), shown in Section III.
To sum up, the policy mentioned in this work is a function that maps the current state , the set of user demand and RRHs status, to the best action , turning on or off one RRH, that minimizes the overall power consumption of the whole system.
IvB Gradient Boosting Decision Tree
GBM is a gradient boosting framework that can be applied to any classifiers or regressors. To be more specific, GBM is the aggregation of base estimators (i.e., classifiers or regressors) that any base estimators like
nearest neighbor, neural network and naive Bayesian estimators can be fitted into the GBM. Better base estimators advocate higher performance. Among all kinds of GBM, a prominent one is based on decision tree, called gradient boosting decision tree (GBDT), which has been gaining its popularity for years due to its competitive performance in different areas. In our framework, the GBDT is applied to the regression task due to its prominent performance.The concept of GBDT is to optimize the empirical risk via steepest gradient descent in hypothesis space by adding more base tree etimators. Considering the regression task in our work, given a dataset with entities of different states and their corresponding rewards generated by simulation and solving SOCP, one can have
where denotes the state representation of system model, whereas denotes the corresponding solution of SOCP solver from Equation (6), in line with the definition of the Reward
function. To optimize the empirical risk of regression is to minimize the expectation of a welldefined loss function over the given dataset
, which can be express as:(8) 
where denotes the model itself and is the final mapping to approximate , which is our fitting object, the power comsumption. is the set of representing system model, and is the set of representing solution of SOCP solver. Here the first term is model prediction loss, which is a differentiable convex function to measure the distance between true power consumption and estimated power consumption; and loss (i.e., meansquare error) is applied in this task. The latter term is the regularization penalty applied to constrain model complexity, contributing to finalize a model with less overfitting and better generalization performance.
The choice of prediction loss and regularization penalty alters circumstantially. Also, the penalty function is given by:
where and are two hyperparameters, while and are the numbers of trees ensembled and weights owned by each tree, respectively. When the regularization parameter is set to zero, the loss function falls back to the traditional gradient tree boosting method [40].
In GBDT, it starts with a weak model that simply predicts the mean value of at each leaf and improves the prediction by aggregating additive fixed size decision trees as base estimators to predict the pseudoresiduals of previous results. The final prediction is linear combination of all the output from regression trees. The final estimator function as adverted in (9) can be expressed as follow:
(9) 
where is the initial guess, is the base estimator at the iteration and is the weight for the estimator or a fixed learning rate. The product denotes the step at iteration .
IvC GBDTbased Deep QNetwork (DQN)
In this section, we will show how to apply GBDTbased DQN scheme to solve our DRA problem for IoT in realtime CRAN, by using the previously defined states, actions and reward. Traditional RL methods, like Qlearning, compute and store the Q value for each stateaction group into a table. It is unrealistic to apply those methods in our problem, as the stateaction groups are countless and the demands of users in a state are continuous variables. Therefore, DQN is considered to be best solutions for this problem. Similar with the related works, e.g. [8][34], we also apply experience replay buffer and fixed Qtargets in this work to estimate the actionvalue function .
In our framework, two stages are included, i.e., offline training and online decision making as well as regular training:

For offline training stage, we pretrain DQN to estimate the value of taking each action in any specific states. To achieve this, millions of system data are generated in terms of all RRHs’ states, user demands and its corresponding system power consumption by simulation and solving SOCP problem given in equation (6). Then, the GBDT is employed to estimate the immediate reward to alleviate the expensive computation in solving the SOCP problem for further training and tuning.

For online decision making and regular tuning, we load the pretrained DQN to generate the best action to take for our proposed DRA problem in realtime. This is achieved by employing the policy function defined in (7.3), which maximizes the in state . To emphasize, the function tells how much the system can cut down the power consumption if it decides to take the action when seeing the state . Then, the DQN observes the immediate reward obtained from GBDT approximation and observes next state . In an online regular tuning scheme, the DQN will not immediately update model parameters when observing new states but to store the new observations to memory buffer. Then, under some given conditions, the DQN will finetune its parameters according to that buffer. This allows DQN to dynamically adapt to new patterns regularly.
The whole algorithm is given in Algorithm 1, whereas the framework of GBDTbased DQN is given by Fig. 2. The denotes the set of model parameters. The loss function is loss (i.e., meansquare error), which indicates the difference between Q target and model output. S refers to the step in Algorithm 1.
In Fig. 2, one can see that the left side describes a DQN framework, illustrating the agent, the environment and how to get the reward. Specifically, the agent will observe a new state from the environment after taking an action and then it will receive an immediate reward signalled by the reward function from GBDT approximator. Traditional DQN obtains the reward by solving the SOCP optimization, which can not be realtime, as explained before. In our architecture, we adopt GBDT regression (i.e., the right side of Fig. 2) to obtain the reward, which can operate in a online process in realtime.
We also give the training process of GBDT in the Appendix.
IvD Error Tolerance Examination (ETE)
Our target is to use GBDT to approximate the typical SOCP problem in CRANs under the framework of DQN. Thus, it is important to evaluate its practical performance. The error from GBDT or DNN will influence the optimality of the given scheme, even worsening the performance of whole system power consumption. Therefore, the examination of error influence is of vital significance. Considering its important role in the whole DRA problem, we emphasize the concept of error tolerance examination (ETE) here. Specifically, in the simulation, we will first compare the result of the optimal decision provided by CPBeamforming solution with the nearoptimal decision from GBDT or DNN approximation solution, and then evaluate its performance in the dynamic resource allocation settings.
V Simulation Results
In this section, we present the simulation settings and performance of the proposed GBDTbased DQN solutions. We take the definition of channel fading mode from previous work as [41]:
(4.1) 
where is the path loss with the distance of ; is the antenna gain; is the shadowing coefficient and is the smallscale fading coefficient. The simulation settings are summarized in Table I.
All training and testing processes are conducted in the environment equipped with 8GB RAM, Intel core i76700HQ (2.6GHz), python 3.5.6, tensorflow 1.13.1 and lightGBM 2.2.3.
Symbol  Parameters  Value 
Channel bandwidth  10 MHz  
Max transmit power  1.0 W  
Active power  6.8 W  
Sleep power  4.3 W  
Transition power  2.0 W  
Background noise  102 dBm  
Antenna gain  9 dBi  
Lognormal shadowing  8 dB  
Rayleigh smallscale fading  
Path loss with a distance of (km)  dB  
Distance  Uniformly distributed in m  
Power amplifier efficiency  25%  
W Watt, dB decibel, dBm decibelmilliwatts, dBi dB(isotropic). 
We compare our DQNbased solution containing GBDT approximator (abbreviated as DQN) with two other schemes:
1) All RRHs Open (AO): all RHHs are turned on, which can serve each user;
2) One RRH Closed (OC): one of those RHHs (chosen randomly) stays in the sleep state, which cannot serve any user.
It is noteworthy that the in previous work [42], another solution in which only one random RRH is turned on, is also discussed in the dynamic resource allocation problem. However, it can hardly be applied to the practical systems [3]. Therefore, we do not compare it in this paper.
Va GBDTbased SOCP Approximator
VA1 Computational Complexity
We compare computational complexity between a GBDT approximator and solutions from traditional SOCP solver in [33]. Firstly, a test set of 1000 entities are randomly generated in terms of status of RRHs and user demands. In addition, both the GBDT approximator and the traditional SOCP method are executed to predict or compute the outputs of that test set for 10000 times, respectively. One can see from Table II that GBDT approximator is much faster than SOCP solver, which prove the efficiency of GBDT approximator.
System Input Setup  Average Time Per Input  

GBDT  SOCP  
6 RRHs and 3 users  0.00079  0.08281 
8 RRHs and 4 users  0.00077  0.09387 
12 RRHs and 6 users  0.00070  0.16240 
18 RRHs and 9 users  0.00075  0.42803 
The time in above table is obtained by averaging 1000 different system inputs, each of which is recalculated by 10000 times through two algorithms respectively. 
VA2 Fitting Property
Then, we analyse the performance of GBDT approximator in specific situations, where we set that there are 8 RRHs and 4 users of IoT devices whose demands are ranging from 20Mbps to 40Mbps
respectively. We compare it with DNN approximator. It applies the fullyconnected net with 3 layers, each of which with 32, 64, 1 neurons respectively. Its activation function is a rectified linear unit (ReLU). Firstly, in Fig.
3(a), we assume that all 8 RRHs are turned on. One can see from this figure that GBDT has better fitting performance than DNN. Then, we assume that there is one RRH switched off. One can see from Fig. 3(b) that GBDT still fits very well with the SOCP solutions. In Fig. 3(c), we assume that the states of all 8 RRH are set switched on or off randomly. As expected, GBDT has much better fitting performance, compared with the SOCP solutions.VB Training Effect of GBDT and DNN
We demonstrate the training performance between the GBDT approximator and DNN aproximator by comparing the training effect in Fig. 4. Mean squared error (MSE) is used here to calculate the loss. From Fig. 4, one can see that even trained with far more time, the loss of DNN is still higher than that of GBDT. One also notices that GBDT has less parameters to adjust and therefore has quicker training process.
The specific comparison is not unfolded here, as it is not the focus of this paper. Next, we will examine the performance of GBDTbased DQN solutions.
VC System Performance
In this section, we consider there are 8 RRHs and 4 users, whose demands are randomly selected. We change the user demands every 100 ms. The performance of AO, OC and GBDTbased DQN is compared next.
VC1 Instant Power
We examine the instant system power consumption in this subsection. In the top figures of Fig. 5(a) and Fig. 5(b), we compare the strategies of AO and DQN, where we set all the RRHs open initially and then all RRHs stay active in AO schemes. In the bottom figures of Fig. 5(a) and Fig. 5(b), we turn off one RRH randomly at the beginning for both OC and DQN and then one RRH stay switched off in OC scheme. Moreover, we set user demands are selected randomly from the set of 20Mbps to 40Mbps in Fig. 5(a), whereas we randomly select user demands from the set of 20Mbps to 60Mbps in Fig. 5(b). One can see from all the figures in Fig. 5 that our proposed DQN always outperforms AO and OC. This is because DQN controls RRHs to turn on and off depending on the current states of the systems, whereas AO always turns on all the RRHs and OC randomly turns off one RRH, which may not be the optimal strategy and contribute to larger power consumption than DQN.
One can also see that when we increase the upper limit of user demands from 40Mbps in Fig. 5(a) to 60Mbps in Fig. 5(b), the performance of all DQN, OC and AO become more unstable. However, our proposed DQN still has the best performance when compared with AO and OC.
Moreover, one can see that although there may be some errors caused by GBDT approximator, our proposed DQN framework has considerable performance, which shows the good ability of error tolerance in our proposed solution.
VC2 Average Power
In Fig. 6, we show the performance comparison between GBDTbased DQN, AO and OC in the long term. The DQN with reward obtained from SOCP solver is also depicted. We compare the average system power consumption by averaging all instant system power in the past time slots.
We first analyse the performance under the condition of user demands below 40Mbps between both DQN schemes (including GBDT and SOCP) and AO scheme. We set all the RRH switched on and set user demands changed every 100 ms per slot and lasting for 500s. One can see from Fig. 6(a) that both DQN schemes outperform AO and can save power around 8 Watts per time slot. The slight fluctuation comes from the randomness of the requirement. Moreover, one can see from Fig. 6(a) that DQN with GBDT have the similar performance as the DQN scheme with SOCP solver, which shows the error tolerance feature of our proposed solutions.
Then we turn one RRH off and continue to analyse the average system power consumption under DQN and OC scheme. One can see from Fig. 6(b) that both DQN schemes still outperform OC scheme, as expected. Also, one can see that DQN scheme with GBDT has the similar performance as SOCP solver, similarly with above.
VC3 Overall Performance of GBDTbased DQN
To evaluate the overall performance of GBDTbased DQN in different situations, we set user demands from 20Mbps to 60Mbps with 10Mbps interval, and keep other factors unchanged. One can see from Fig. 7(a) and Fig. 7(b) that with the increase of user demands, the power consumption of AO, OC and DQN increase as well. One also sees that our proposed GBDTbased DQN have much better performance than AO and OC, as expected, which prove the effectiveness of our scheme.
Vi Conclusion
In this paper, we presented a GBDTbased DQN framework to tackle the dynamic resource allocation problem for IoT in the realtime CRANs. We first employed the GBDT to approximate the solutions of the SOCP problem. Then, we built the DQN framework to generate a efficient resource allocation policy regarding to the status of RRHs in CRANs. Furthermore, we demonstrated the offline training, online decision making as well as regular tuning processes. Lastly, we evaluated the proposed framework with the comparison to two other methods, AO and OC, and examined its accuracy and the ability of error tolerance compared with SOCPbased DQN scheme. Simulation results showed that the proposed GBDTbased DQN can achieve a much better performance in terms of power saving than other baseline solutions under the realtime setting. Future work is in progress to let GBDT approximator meet the strict constraints of practical problems, which is expected to be employed in a wide range of scenarios.
[Training and Predicting Process of GBDT]
The training process of GBDT is shown in Algorithm 2.
The GBDT is consisted of two concepts, where one is called the gradient and the other is boosting. In training process, the 0th tree is fitted to the given training dataset, and it predicts the mean value of in the training set regardless of what the input is; the predicted values of 0th tree are denoted as . However, the predictions from the 0th tree still have residuals between true values . Then, another additive tree is applied to fit to the new dataset that the inputs are same as the 0th tree, but the fitting target ’s are the residuals . Then, the predictions of the GBDT are the linear combination of the predictions from the 0th tree and the new additive tree, namely , where is the weight attributed to this tree. Next, another tree is fitted to the new residuals and follow the same process as before.
From above process, one can see that the boosting concept is to utilize the residuals between the previous ensembled results and true values. By learning from the residual, the model can make progress when new trees are added. The gradient part of concept can be explained as that the whole training process is supervised and guided by the gradient of objective function, where it is typically expressed as , whose derivative is the pseudoresidual between and .
References
 [1] J. Lin et al., “A survey on Internet of Things: Architecture enabling technologies security and privacy and applications,” IEEE Internet Things J., vol. 4, no. 5, pp. 11251142, Oct. 2017.
 [2] A. Checko et al., “Cloud RAN for mobile networks–A technology overview,” IEEE Commun. Surveys Tuts., vol. 17, no. 1, pp. 405–426, Sep. 2014.
 [3] Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy, “A deep reinforcement learning based framework for powerefficient resource allocation in cloud RANs,” in Proc. IEEE Int. Conf. Commun. (ICC), pp. 1–6, 2017.
 [4] A. Wiesel, Y. C. Eldar, and S. Shamai, “Linear precoding via conic optimization for fixed MIMO receivers,” IEEE Trans. Signal Process., vol. 54, no. 1, pp. 161–176, 2006.
 [5] M. Gerasimenko et al., “Cooperative radio resource management in heterogeneous cloud radio access networks,” IEEE Access, vol. 3, pp. 397–406, 2015.
 [6] Y. Zhou et al., “Deep reinforcement learning based coded caching scheme in fog radio access networks,” 2018 IEEE/CIC International Conference on Communications in China (ICCC Workshops), pp. 309–313, 2018.
 [7] P. Rost et al., “Cloud technologies for flexible 5G radio access networks,” IEEE Commun. Mag., vol. 52, no. 5, pp. 68–76, 2014.
 [8] G. Sun et al., “Dynamic reservation and deep reinforcement learning based autonomous resource slicing for virtualized radio access networks,” in IEEE Access, vol. 7, pp. 45758–45772, 2019.
 [9] V. FrançoisLavet et al. “An introduction to deep reinforcement learning.” Foundations and Trends in Machine Learning, vol. 11, no. 3–4, pp. 219–354, 2018.

[10]
H. He et al.
, “Modeldriven deep learning for physical layer communications,”
arXiv preprint arXiv:l809.06059, 2019.  [11] H. Zhu et al., “Caching transient data for Internet of Things: A deep reinforcement learning approach,” IEEE Internet Things J., vol. 6, no. 2, pp. 2074–2083, Apr. 2019.
 [12] H. Zhu, Y. Cao, W. Wang, T. Jiang, and S. Jin, “Deep reinforcement learning for mobile edge caching: Review new features and open issues,” IEEE Netw., vol. 32, no. 6, pp. 50–57, Nov. 2018.
 [13] D. Liu et al., “User association in 5G networks: A survey and an outlook,” IEEE Commun. Surveys Tuts., vol. 18, no. 2, pp. 1018–1044, 2nd Quart. 2015.
 [14] A. Domahidi, E. Chu, and S. Boyd, “Ecos: An socp solver for embedded systems,” Control Conference (ECC) 2013 European, pp. 3071–3076, 2013.

[15]
E. Andersen and K. Andersen, “The MOSEK interior point optimizerfor linear programming: an implementation of the homogeneousalgorithm,”
High Performance Optimization, vol. 33, pp. 197–232, 2000.  [16] J. F. Sturm, “Using SeDuMi 1.02, a Matlab toolbox for optimization over symmetric cones,” Optimization Methods and Software, vol. 11, no. 1–4, pp. 625–653, 1999.
 [17] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” Proceedings of the 27th International Conference on International Conference on Machine Learning. Omnipress, pp. 399–406, 2010.
 [18] J. R. Hershey, J. Le Roux, and F. Weninger, “Deep unfolding: Model–based inspiration of novel deep architectures,” arXiv preprint arXiv:1409.2574, 2014.
 [19] C. Lu, W. Xu, S. Jin, and K. Wang, “Bitlevel optimized neural network for multiantenna channel quantization,” IEEE Commun. Lett. (Early Access), pp. 1–1, Sep. 2019.
 [20] C. Lu, W. Xu, H. Shen, J. Zhu, and K. Wang “MIMO channel information feedback using deep recurrent network,” IEEE Commun. Lett., vol. 23, no. 1, pp. 188–191, Jan. 2019.
 [21] Z. H. Zhou and J. Feng, “Deep forest: Towards an alternative to deep neural networks,” arXiv preprint arXiv:1702.08835, 2017.
 [22] H. Sun et al., “Learning to optimize: Training deep neural networks for interference management,” IEEE Trans. Signal Process., vol. 66, no. 20, pp. 5438–5453, Oct 2018.
 [23] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001.

[24]
L. Breiman, “Bias, variance, and arcing classifiers,”
Tech. Rep. 460, Statistics Department, University of California, Berkeley, CA, USA, 1996.  [25] Z. H. Zhou, “Ensemble methods: foundations and algorithms,” Chapman and Hall/CRC, 2012.

[26]
D. Opitz and R. Maclin, “Popular ensemble methods: An empirical study,”
Journal of Artificial Intelligence Research
, pp. 169–198, 1999.  [27] R. Polikar, “Ensemble based systems in decision making,” IEEE Circuits Syst. Mag., vol. 6, no. 3, pp. 21–45, 2006.
 [28] L. Rokach, “Ensemble–based classifiers,” Artificial Intelligence Review, vol. 33, no. 1–2, pp. 1–39, 2010.
 [29] A. Natekin and A. Knoll, “Gradient boosting machines, a tutorial,” Frontiers in neurorobotics, vol. 7, no. 21, 2013.
 [30] T. P. Do and Y. H. Kim, “Resource allocation for a fullduplex wirelesspowered communication network with imperfect selfinterference cancelation,” IEEE Commun. Lett., vol. 20, no. 12, pp. 2482–2485, Dec. 2016.
 [31] J. Miao, Z. Hu, K. Yang, C. Wang, and H. Tian, “Joint power and bandwidth allocation algorithm with QoS support in heterogeneous wireless networks,” IEEE Commun. Lett., vol. 16, no. 4, pp. 479–481, 2012.
 [32] J. Moon et al., “Online reinforcement learning of XHaul content delivery mode in fog radio access networks,” IEEE Signal Process. Lett., vol. 26, no. 10, pp. 1451–1455, 2019.
 [33] I. John, A. Sreekantan, and S. Bhatnagar, “Efficient adaptive resource provisioning for cloud applications using Reinforcement Learning,” 2019 IEEE 4th International Workshops on Foundations and Applications of Self* Systems (FAS*W), Umea, Sweden, pp. 271–272, 2019.
 [34] J. Li, H. Gao, T. Lv, and Y. Lu, “Deep reinforcement learning based computation offloading and resource allocation for MEC,” 2018 IEEE Wireless Communications and Networking Conference (WCNC), pp. 1–6, April 2018.
 [35] B. Dai and W. Yu, “Energy efficiency of downlink transmission strategies for cloud radio access networks,” IEEE J. Sel. Areas Commun., vol. 34, no. 4, pp. 1037–1050, Apr. 2016.
 [36] G. Auer et al., “How much energy is needed to run a wireless network,” IEEE Wirel. Commun., vol. 18, no. 5, pp. 40–49, 2011.
 [37] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge university press, 2004.
 [38] M. S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret, “Applications of secondorder cone programming,” Linear Algebra and its Applications Journal, Vol. 284, No. 1, 1998, pp. 193–228.
 [39] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning, Cambridge: MIT press, 1998.

[40]
T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. (ACM), 2016.  [41] Y. Shi, J. Zhang, and K. B. Letaief, “Group sparse beamforming for green cloudRAN,” IEEE Trans. Wireless Commun., vol. 13, no. 5, pp. 2809–2823, May 2014.
 [42] B. Dai and W. Yu, “Energy efficiency of downlink transmission strategies for cloud radio access networks,” IEEE J. Sel. Areas Commun., vol. 34, no. 4, pp. 1037–1050, 2016.