I Introduction
As a promising architecture, the fog radio access network (FRAN) can well support the future services of internet of things (IoT) with the help of edge caching and edge computing [1, 2]. These services include patient health monitoring [3], services with low latency [4], large scale IoT data analytics [5], and so on. In FRANs, each user equipment (UE) can potentially operate in different communication modes including cloud RAN (CRAN) mode, fog radio access point (FAP) mode, devicetodevice (D2D) mode, and so on. In CRAN mode, UEs are served by multiple cooperative remote radio heads (RRHs), benefited from centralized signal processing and resource allocation, while UEs in FAP mode and D2D mode are served locally by FAPs and UEs equipped with cache, respectively. Recently, many studies have been conducted on FRANs, in terms of performance analysis [6], radio resource allocation [7], the joint design of cloud and edge processing [8], the impact of cache size [9], and so on.
Although significant progress has been achieved, resource management in FRANs still needs further investigation. Compared with resource management in traditional wireless networks, communication mode selection should be addressed as well due to the coupling with resource management, and meanwhile the dynamics of edge caching complicate the network environment, which both lead to a more challenging problem. Specifically, from the perspective of optimization, communication mode selection problem is usually NPhard [10]. To solve the problem, classical algorithms like branch and bound and particle swarm can be adopted. Nevertheless, considering the network dynamics, communication modes of UEs need to be frequently updated, which makes algorithms with high complexity less applicable.
On the other hand, owing to the great development of fast and massively parallel graphical processing units as well as the explosive growth of data, deep learning has attracted a lot of attention and is widely adopted in speech recognition, image recognition, localization
[11, 12], and so on. To help the computer learn environment from highdimensional raw input data and make intelligent decisions, the author in [13]proposes to combine deep learning with reinforcement learning, and the proposal is known as deep reinforcement learning (DRL). In DRL, the deep neural network (DNN) adopted as the Qfunction approximator is called deep Q network (DQN). Using replay memory and target DQN, DRL algorithm can realize stable training.
Totally speaking, applying DRL to wireless networks has the following considerable advantages. First, a DNN with a moderate size can finish the prediction given the input in almost real time since only a small number of simple operations is required for forward passing [14]. This facilitates a DRL agent to make a quick control decision on networks based on the Q values output by the DQN. Second, the powerful representation capabilities of DNNs allows the DRL agent to learn directly from the raw collected network data with high dimension instead of manual inputs. Third, by distributing computation across multiple machines and multiple cores, the time to train DQN can be greatly reduced [15]. Fourth, the DRL agent aims at optimizing a longterm performance, considering the impact of actions on future reward/cost. The fifth advantage of DRL is that it is a modelfree approach, and hence does not rely on a specific system model that can be based on some ideal assumptions. At last, it is convenient for DRL based schemes to consider the cost incurred by system state transitions.
Motivated by the benefits of DRL, a DRL based joint mode selection and resource management approach is proposed, aiming at minimizing the longterm FRAN system power consumption. Using DRL, the controller can quickly control the communication modes of UEs and the onoff states of processors in the cloud facing with the dynamics of edge caching states. After the controller makes the decision, precoding for UEs in CRAN mode is optimized subsequently with quality of service (QoS) constraints and the computing capability constraint in the cloud.
Ia Related Work and Challenges
Up to now, some attention has been paid to radio resource management in fog radio access networks. In [7], a cloud radio access network with each RRH equipped with a cache is investigated, and a contentcentric beamforming design is presented, where a group of users requesting the same content is served by a RRH cluster. A mixedinteger nonlinear programming problem is formulated to minimize the weighted sum of backhaul cost and transmit power under the QoS constraint of each user group. While in [16], a similar network scenario is considered, aiming at minimizing the total system power consumption by RRH selection and load balancing. Specifically, the RRH operation power incurred by circuits and cooling system is included and backhaul capacity constraint is involved. The author in [17] goes one step further by jointly optimizing RRH selection, data assignment, and multicast beamforming. The data assignment refers to whether the content requested by a user group is delivered to a RRH via backhaul, which is not handled in [7] and [16]. To solve the NPhard network power consumption minimization problem, which consists of RRH power consumption and backhaul power consumption, a generalized layered group sparse beamforming modeling framework is proposed. Different from previous works that mainly optimize network cost or system power consumption, the author in [18] tries to minimize the total content delivery latency of the network, which is the sum of wireless transmission latency and backhaul latency caused by downloading the uncached contents. Due to the fractional form and 0 norm in the objective function, the formulated problem is nonconvex, which is then decomposed into beamforming design problem and data assignment problem.
Although the proposals in the above works achieve good performance, only CRAN mode is taken into account. When a UE in FRANs is allowed to operate in different communication modes, the problem of user communication mode selection should be handled, which is the key to gaining the benefits of FRANs [19]. In [20]
, the author investigates a joint mode selection and resource allocation problem in a downlink FRAN, and particle swarm optimization is utilized to optimize user communication modes. Other approaches to mode selection problems include branch and bound
[10] as well as Tabu search [21]. However, these optimization methods can induce high computational complexity. While in [22], evolutionary game is adopted to model the interaction of users for mode selection, in which the payoff of each user involves both the ergodic rate under a certain mode and the delay cost. Then, an algorithm based on replicator dynamics is proposed to achieve evolutionary equilibrium. Nevertheless, the proposed algorithm can get only the proportion of users selecting each communication mode, and therefore accurate communication mode control can not be realized.Moreover, the works [7, 16, 17, 18, 20] research resource management problems under a static environment where content availability at each cache is unchanged. This assumption is reasonable for content delivery via FAPs or RRHs with a cache, since cached contents are usually updated on a large time scale, and meanwhile FAPs and RRHs have stable power supplies to keep the normal operations of their caches. On the contrary, for content delivery via D2D transmission, cache state dynamics should be taken into account. That is the local availability of the content requested by a UE at the cache of its paired UE can easily change with time, incurred by the autonomous and frequent cache update behavior of the UE holders, the dynamic battery level of the paired UE, user timevarying content requests, and so on. These dynamics will make mode selection algorithms with high complexity inapplicable. Even worse, the existence of interference between active D2D links and UEs in CRAN mode complicates the wireless environment as well.
Fortunately, DRL, as an emerging approach to complicated control problems, has the potential to provide efficient solutions for wireless network design. In [23], a DRL based communication link scheduling algorithm is developed for a cacheenabled opportunistic interference alignment wireless network. Markov process is used to model the network dynamics including the dynamics of cache states at the transmitter side and channel state information (CSI). To extract features from the high dimensional input composed of CSI and cache states, a DNN with several convolutional layers is used to learn the state representation. In [24]
, DRL is applied in mobility management, and a convolutional NN and a recurrent NN are responsible for feature extraction from the Received Signal Strength Indicator. The performance is evaluated on a practical testbed in a wireless local area network, and significant throughput improvement is observed. While in
[25], the author revisits the power consumption minimization problem in CRANs using DRL to control the activation of RRHs, where the power consumption caused by the RRH onoff state transition is considered as well. In addition, the author in [26] shows that DRL based on Wolpertinger architecture is effective in cache management. Specifically, the request frequencies of each file over different time durations and the current file requests from users constitute the input state, and the action decides whether to cache the requested content.IB Contributions and Organization
In this paper, a network power consumption minimization problem for a downlink FRAN is studied. Different from [7, 16, 17, 25], the power consumption induced by the running processors in the cloud for centralized signal processing is included as well. Owing to the caching capability of D2D transmitters, UEs can acquire the desired contents locally without accessing RRHs. This can help traffic offloading which alleviates the burden of fronthaul on one hand and on the other hand allows turning off some processors in the cloud to save energy since less computing resource is needed to support less number of UEs. Facing with the dynamic cache states at D2D transmitters and the interference between UEs in the same communication mode, a DRL based approach is proposed to help the network controller learn the environment from raw collected data and make intelligent and fast decisions on network operations to reduce system power consumption. As far as we know, our paper is the first work to adopt DRL to solve joint communication mode selection and resource management problem taking the dynamics of edge cache states into account to achieve a green FRAN. The main contributions of the paper are:

An energy minimization problem in a downlink FRAN with two potential communication modes, i.e., CRAN mode and D2D mode, is investigated. To make the system model more practical, the dynamics of cache states at D2D transmitters are considered, which are modeled by Markov process, and the power consumption caused by processors in the cloud is taken into account. Based on the system model, a Markov decision problem is formulated, where the network controller aims at minimizing the longterm system power consumption by controlling UE communication modes and processors’ onoff states at each decision step with the precoding for UEs in CRAN mode optimized subsequently.

For the precoding optimization under given UE communication modes and processors’ onoff states, the corresponding problem is formulated as a RRH transmission power minimization problem under perUE QoS constraints, perRRH transmission power constraints, and the computing resource constraint in the cloud, which is solved by an iterative algorithm based on 0 norm approximation. Then, a DRL based approach is proposed with UE communication modes, edge cache states, and processors’ onoff states as input to select actions for communication mode and processor state control. After precoding optimization, the negative of system power consumption is determined and fed back to the controller as the reward, based on which the controller updates DQN.

The impacts of important learning parameters and edge caching service capability on system performance are illustrated, and the proposal is compared with other several communication mode and processor state control schemes including Qlearning and random control. Furthermore, the effectiveness of integrating transfer learning with DRL to accelerate the training process in a new but similar environment is demonstrated.
The remainder of this paper is organized as follows. Section II describes the downlink FRAN model. Section III formulates the concerned energy minimization problem, and the DRL based approach is specified in Section IV. Simulation results are illustrated in Section V, followed by the conclusion in Section VI.
Ii System Model
The discussed downlink FRAN system is shown in Fig. 1, which consists of one cloud, multiple RRHs, multiple UEs with their paired D2D transmitters. The cloud contains multiple processors of heterogenous computing capabilities which are connected with each other via fiber links to achieve computing resource sharing, and the computing capability of processor is characterized by whose unit is million operations per time slot (MOPTS) [27]. For each processor in the cloud, it has two states, i.e., on state and off state, which are indicated by and , respectively. Meanwhile, content servers provide largescale caching capability, and the controller is used for network control like resource management. The set of processors, RRHs, and UEs are denoted by , , and , respectively. Each RRH is equipped with antennas, and each UE is with one antenna. In addition, RRHs communicate with the cloud via highbandwidth fronthaul.
The paired D2D transmitter for each UE is chosen by comprehensively considering the social tie and the physical condition as per in [28]. In the considered scenario, each UE can operate either in D2D mode or CRAN mode which are denoted by a 01 indicator . Specifically, means UE operates in D2D mode, while means that UE is served by RRHs. Moreover, suppose that the D2D transmission does not interfere the RRH transmission by operating in different frequency bands, and all the UEs in the same communication mode share the same frequency band, hence interfering with each other. Finally, the high power node (HPN) with wide coverage is responsible for delivering control signalling and exchanges control information with the controller via backhaul [29]. In the following, models for communication, computing, caching, and system energy consumption are elaborated.
Iia Communication Model
By the collaborative transmission of RRHs, the received symbol of UE in CRAN mode is given by
(1) 
where is the message of UE ,
is the channel vector between RRH
and UE , is the precoding vector of RRH for UE , and is the noise which follows the distribution of . Then the data rate achieved by UE is given by(2) 
For UE in D2D mode, it is assumed that the D2D transmitter transmits at a constant power level, and the received symbol of UE is given by
(3) 
where is the transmit power of the D2D transmitter paired with UE , is the channel coefficient between UE and its D2D transmitter, is the channel coefficient between the D2D transmitter of UE and UE . Then the data rate achieved by UE in D2D mode is given by
(4) 
IiB Computing Model
To finish the basedband processing and generate the transmitted signals for RRHs, computing resource provision in the cloud plays a key role, whose model follows that in [27]. Specifically, the computing resource consumed by coding and modulation for UE is given by
(5) 
where is the data rate of UE . Meanwhile, the computing resource consumed by calculating the transmit signal for UE depends on the number of nonzeros elements in its network wide precoding vector , which is modeled as
(6) 
Then, the computing resource consumption for the whole system is calculated as
(7) 
Note that only the UEs accessing RRHs consume computing resource, and hence computing resource needed may decrease by serving UEs locally via D2D communication, which further allows turning off some processors to save energy. Moreover, it should be highlighted that additional constant terms can be added to (7) to account for the computing resource consumed by other baseband operations, which has no impact on our proposal.
IiC Caching Model
We define the value of the cache state at a D2D transmitter is True only when the requested content is cached in the D2D transmitter and the transmitter’s battery level is high enough so that the holder is willing to share contents, and the cache state is False otherwise. Note that the cache state at each D2D transmitter can be highly dynamic due to the following reasons. First, although UEs are paired based on their social ties, this can not imply the content requested by a UE must be cached by its partner whose cached contents can be frequently updated by the device holder based on its own interest. Second, the UE battery level dynamically changes with time, and the user content request is timevarying.
To characterize the dynamics of cache states, Markov process is adopted as per [23]
with the probability transition matrix given by
(8) 
where denotes the transition probability of the cache state at a D2D transmitter from True to False.
IiD Energy Consumption Model
Totally speaking, the energy consumption in the concerned downlink FRAN includes the energy consumption incurred by the running processors, fronthaul transmission, and wireless transmission. First, according to [30], the energy consumed by processor in Watts is given by
(9) 
where is a parameter depending on the structure of the processor. Second, the wireless transmission power for UE is as follows.
(10) 
which is the transmission power when UE is either served by RRHs or served by its paired D2D transmitter. and are the efficiency of the power amplifier at each RRH and at each UE, respectively [30]. Third, for the fronthaul energy consumption corresponding to UE , it is simply modeled as
(11) 
with a constant representing the energy consumption for delivering the processed signal of UE to its associated RRHs via fronthaul [16]. Then, the energy consumption of the whole system is given by
(12) 
It should be noted that the modeling of the caching state using a Markov process motivates the adoption of the Markov decision process (MDP) to formulate our concerned problem. In addition, since our aim is to achieve a green FRAN under user QoS and computing resource constraints , the reward setting of the MDP will be closely related to the data rate, computing, and energy consumption models.
Iii Problem Formulation and Decoupling
In this section, an optimization problem aiming at minimizing the energy consumption in a downlink FRAN is formulated from an MDP perspective. Specifically, the problem is decoupled into a joint control problem of processors’ onoff states and UEs’ communication modes and a precoding design problem under the computing resource constraint.
Iiia The Basics of MDP
From the energy consumption model, it can be seen that the cache state for each UE plays a key role in the following ways. On one hand, the cache state will influence the set of UEs that can potentially exploit D2D communication, which directly influences the energy consumption incurred by fronthaul and wireless transmission. On the other hand, since UEs served by local caching do not consume computing resource in the cloud anymore, there is a chance to turn off some processors to save energy. Facing with the dynamics of cache state for each UE pair, it is natural to formulate the energy minimization problem from an MDP perspective.
MDP provides a formalism for reasoning about planning and acting in the face of uncertainty, which can be defined using a tuple . is the set of possible states, is the set of available actions, gives the transition probabilities to each state if action is taken in state , and is the reward function. The process of MPD is described as follows. At an initial state , the agent takes an action . Then the state of the system transits to the next state according to the transition probabilities , and the agent receives a reward . With the process continuing, a state sequence is generated. The agent in an MPD aims to maximize a discounted accumulative reward when starting in state , which is called statevalue function and defined as
(13) 
where is the reward received at decision step , is a discount factor adjusting the effect of future rewards to the current decisions, and the policy is a mapping from state
to a probability distribution over actions that the agent can take in state
.The optimal statevalue function is given by
(14) 
Then, if is available, the optimal policy is determined as
(15) 
where is the expected reward by taking action at state . To calculate , the valueiteration algorithm can be adopted. However, since the transition probabilities are not easy to acquire in many practical problems, reinforcement learning algorithms, especially Qlearning, are widely adopted to handle MDP problems, for which the state space, explicit transition probabilities, and the reward function are not essential [31].
In Qlearning, the Q function is defined as
(16) 
which is the expected accumulative reward when starting from state with action and then following policy . Similarly, we can define the optimal Q function as
(17) 
Qlearning is ensured to reach the optimal Q values under certain conditions [31], which is executed iteratively according to
(18) 
where is the learning rate. Once for each stateaction pair is achieved, the optimal policy can be determined as
(19) 
IiiB Problem Formulation
In this paper, the energy minimization problem for the considered downlink FRAN is formulated as an MDP problem, where the controller in the cloud tries to minimize longterm system energy consumption by controlling the onoff states of processors, the communication mode of each UE, and optimizing precoding vectors for RRH transmission. More formally, our concerned MPD problem is defined as follows.

State space: The state space is defined as a set of tuples . is a vector representing the current onoff states of all the processors, where the th element is . is a vector representing the current communication modes of all the UEs, where the th element is . While is a vector consisting of the cache state at each D2D transmitter.

Action space: The action space is defined as a set of tuples . represents to turn on or turn off a certain processor, while represents to change the communication mode of a certain UE. Note that the network controller controls the onoff state of only one processor and the communication mode of only one UE each time to reduce the number of actions [25]. Moreover, the precoding design for RRH transmission is handled separately for the same reason [25].

Reward: The immediate reward is taken as the negative of system energy consumption which is the sum of energy consumption incurred by the running processors, fronthaul transmission, and wireless transmission as defined in (12). Hence, after communication mode and processor state control followed by the precoding optimization for UEs in CRAN mode, the reward can be totally determined.
Note that due to the cache state dynamics, the state after control can transit to an infeasible state. In summary, three situations should be properly handled. The first one is that the controller selects D2D mode for a UE, but the cache state at its paired UE after transition is False. The second one is that the QoS of a UE in the CRAN mode is not met due to too many sleeping processors, and the third one is that the QoS of a UE in D2D mode is unsatisfied because of the strong interference among active D2D links. To deal with these situations and always guarantee the QoS of UEs, protecting operations will be performed. Specifically, the UE with QoS violation will inform the HPN over the control channel, and then the HPN sends protecting operation information to the controller that reactivates all the processors and switches each UE in D2D mode with QoS violation to RRHs. In addition, the precoding for UEs in CRAN mode will be reoptimized.
Motivated by the recent advances in artificial intelligence, DRL is utilized to control the onoff states of processors and the communication modes of UEs. The details about the DRL based approach will be introduced in the next section. After the controller takes an action using DRL, precoding is then optimized for RRH transmission, which is formulated as the following optimization problem.
(20) 
where is the network wide precoding vector for UE , the first constraint is to meet the QoS demand for each UE, the second constraint is the transmission power constraint for each RRH, while the last constraint is the computing resource constraint in the cloud. Note that once the controller takes an action, the parameters and will be determined.
Iv DRL Based Mode Selection and Resource Management
In this section, the precoding optimization given processors’ onoff states and communication modes of UEs is handled first, and then a DRL based algorithm is proposed to control the network facing with the dynamics of caching states at D2D transmitters and complex radio environment.
Iva Precoding Design with the Computing Resource Constraint
For problem (20), the main difficultly to solve it lies in the nonconvex constraint () as well as the 0 norm and the sum rate in constraint (). Fortunately, the QoS constraint () can be transformed into a second order cone constraint by the phase rotation of precoding [30]. Moreover, the 0 norm term in constraint () can be approximated by reweighted 1 norm as per [27] and [32]. Then, inspired by the proposal in [32], problem (20) can be solved iteratively and the problem for each iteration is as follows:
(21) 
where is the channel vector from all the RRHs to UE , is the precoding of the th antenna of RRH for UE , is the data rate of UE calculated by the precoding output by the last iteration, is the norm approximation of the term in constraint () of problem (20). is updated as
(22) 
with the precoding calculated by the last iteration and a small enough parameter. Note that problem (21) is a convex optimization problem that can be efficiently solved by CVX [33], and the proof of the convexity is given by the following proposition.
Proposition 1.
Problem (21) is a convex optimization problem.
Proof.
First, it has been shown that constraints () and () are convex in [34], and meanwhile, the objective function as well as constraints () and () are also convex according to [35]. For the constraint (), it can be reformulated as the following inequality:
(23) 
, where the right side is a constant and the left side is a convex reweighted 1 norm [36]. Hence, it can be concluded that problem (21) is convex. ∎
The algorithm diagram is listed in Algorithm 1. First, the precoding is initialized, which can be got by solving a relaxed version of problem (20) without considering the computing resource constraint. Then, and the weight can be calculated, based on which problem (21) can be solved. At the end of each iteration, with small enough is set to 0, and one possible criteria is comparing the value of with [16].
By updating iteratively, the above algorithm gradually sets for the UEantenna link that has low transmit power [32]. Meanwhile, utilizing the interior point method, the complexity of the above algorithm for each iteration is .
IvB DRL based Mode Selection and Resource Management
After the precoding design under fixed UE communication modes and processors’ onoff states is handled, the remaining task is to find a way of reaching a good policy for the MDP formulated in Subsection B of Section III. As introduced before, Qlearning is a widely adopted algorithm in the research of wireless networks for network control without knowing the transition probability and reward function in advance. However, Qlearning has three factors which can limit its application in future wireless networks. First, traditional Qlearning stores the Qvalues in a tabular form. Second, to achieve the optimal policy, Qlearning needs to revisit each stateaction pair infinitely often [31]. Third, the state for Qlearning is often manually defined like in [37]. These three characteristics will make Qlearning impractical when considering large system state and action spaces [38]. While DRL proposed in [13] can overcome these problems, and has the potential to achieve better performance owing to the following facts. First, DRL uses DQN to store learned Qvalues in the form of connection weights between different layers. Second, with the help of replay memory and generalization capability brought by NNs, DRL can achieve good performance with less interactions with complex environments. Third, with DQN, DRL can directly learn the representation from high dimensional raw network data, and hence manual input is avoided.
Considering these benefits, the controller uses DRL to learn the control policy of UE communication modes and processors’ onoff states by interacting with the dynamic environment to minimize the discounted and accumulative system power consumption for decision steps, and that is to maximize the long term reward given by . The training procedure of the DRL is shown in Algorithm 2. Specifically, given the current system state composed of UE communication modes, cache states at D2D transmitters, and processors’ states, the controller takes this state as input of DQN to output the Qvalues corresponding with each action. Then, an action is selected based on greedy scheme, and the operational states of a certain processor and a certain UE are changed if needed. Afterward, the controller optimizes the precoding using Algorithm 1, and the cache state at each D2D transmitter transits according to the transition matrix. Once any QoS violation information from UEs is received by the HPN, the HPN will help those UEs with unsatisfied QoS in D2D mode access the CRAN and the controller will activate all the processors. Next, this interaction is stored in the replay memory of the controller containing the state transition, the action, and the negative of system power consumption which is the reward. After several interactions, the controller will update DQN by training over a batch of interaction data randomly sampled from the replay memory, intending to minimize the meansquarederror between the target Q values and the predicted Q values of DQN. In addition, every larger period, the controller will set the weights of DQN to the target DQN.
In addition to the proposal in [13]
, researchers have made some enhancements on DRL subsequently. To more effectively reuse the experienced transitions in the replay memory, prioritized replay is proposed. Moreover, double DRL is introduced to overcome the optimistic Qvalue estimation involved in the calculation of the target value, while dueling DRL is proposed to effectively learn in the situation where the state value should be more cared about. Furthermore, DRL with deep deterministic policy gradient is introduced to address the continuous control problem. All these new DRL approaches take the advantage of the ideas of replay memory and target DQN in
[13], and their specifications can be referred to [38]. Although only the proposal in [13] is adopted for our communication mode selection and resource management in this paper, these advances can be utilized as well, which does not affect the core idea and the main conclusions of the paper.V Simulation Results and Analysis
The simulation scenario is illustrated in Fig. 2 where the distance between each pair of RRHs is 800 m, and four UEs are randomly distributed within a disk area of radius 100 m whose center is the same as that of the RRHs. Each UE has a corresponding potential D2D transmitter that is randomly located within the distance of 20 m away from the UE. Each RRH is equipped with two antennas, and each UE is equipped with one antenna. The channel coefficient of each UEantenna link consists of the fading related to distance modeled by , shadow fading of 8 dB, and small scale fading modeled by , while the channel coefficients among UEs are only related to distance. The maximum transmission power of each RRH is set to 1.5 W, and the constant transmission power of each D2D transmitter is set to 100 mW. The QoS requirement of each UE is 5 dB. There are six processors with heterogeneous power consumptions and computing capabilities. The power consumptions corresponding with these six processors are 21.6 W, 6.4 W, 5 W, 8 W, 12.5 W, and 12.5 W, and their corresponding computing capabilities are 6 MOPTS, 4 MOPTS, 1 MOPTS, 2 MOPTS, 5 MOPTS, and 5 MOPTS. It is assumed that for UE , where can be explained as caching service capability of UE
’s paired D2D transmitter. The adopted DQN is a dense NN constructed by an input layer, two hidden layers, and an output layer. The number of neurons in the input layer is 14, while that in the output layer is 96. There are 24 neurons in each hidden layer, and ReLu is utilized as the activation function. All other parameters in the simulation are listed in Table
I.Parameter  Value  Parameter  Value 

The learning rate of Adam optimizer  0.0001  RRH power efficiency  
The capacity of replay memory  5000  UE power efficiency  
The number of steps to update target DQN  480  Discounted factor  0.99 
The number of steps to update DQN  3  Noise  W 
The number of steps for linearly annealing from 1 to 0.01  3000  Fronthaul transmission power for each UE  5 W 
Batch size for each DQN update  32  The initial steps to populate replay memory by random action selection  1000 
Va The Impacts of Learning Parameters
In this subsection, we investigate the impacts of learning rate and batch size on the performance of our proposal by training DRL with 32000 epochs. The initial state for each epoch in this section is that all the UEs operate in CRAN mode with all processors turning on, and the cache state at each D2D transmitter is False. From Fig. 3, discounted and accumulative system power consumption is evaluated under different batch sizes with , . It can be seen that the performance when batch size is equal to 32 is the best, whose possible reason can be explained as follows. With a small batch size, the gradient is only a very rough approximation of the true gradient, and hence long time can be needed to achieve a good policy. On the contrary, if the batch size is too large, although the calculated gradient is more accurate, there is a chance that the learning process is trapped in local optimum. Under batch size of 32, simulation is conducted to select an appropriate learning rate as shown in Fig. 4. It can be observed that using a too small learning rate 0.00001, the learning process of DRL is slow, while a larger learning rate 0.001 will result in local optimum. Hence, we select the learning rate as 0.0001.
VB The Impact of Edge Caching Service Capability
To demonstrate the influence of edge caching on system performance, we let , , and vary the value of . Fig. 5 shows the evolution of long term system power consumption under different . It can be seen that a smaller leads to more system power consumption. This is because when edge caches have poorer service capability, more UEs need to be served by RRHs, which thus causes larger processor and fronthaul power consumption. In addition, Fig. 6 is drawn to intuitively show the expectation of long term system performance. The expected performance for each is estimated by using the corresponding model trained in Fig. 5 to perform tests over 10000 epochs and then taking the average.
VC The Effectiveness of Integrating Transfer Learning
To help the DRL model quickly adaptive to new environment where the cache state transition matrix at each D2D transmitter changes, transfer learning can be adopted, which is expected to accelerate the learning process by transferring the knowledge learned in a source task to a different but similar task. Since the learned knowledge of the DRL is stored in the form of connection weights of the DQN, we propose to set the weights of a welltrained DRL model to another new DRL model to be trained to avoid training from scratch. To verify this idea, the weights of the DRL model that is trained when , , are used for the weight initialization of the DRL model to be trained in two different environments with , , and , , respectively. By the results shown in Fig. 7 and Fig. 8, it is observed that transfer learning can effectively help DRL achieve performance similar to that achieved by training from scratch but with much less training time. Nevertheless, transfer learning can lead to negative guidance on the target task when the similarity between the source task and the target task is low [39].
VD Performance with Other Baselines
To verify the superiority of our proposal, the following baseline schemes are adopted in our simulation study:

D2D mode always: In this scheme, the controller always progressively makes UEs operate in D2D mode and turns off all the processors.

DRL based, CRAN mode only: In this scheme, all the UEs operate in CRAN mode, and the controller uses DRL to control the onoff states of processors only.

Qlearning based control: In this scheme, the controller controls the UE communication modes and processors’ states using the iterative Qlearning based on equation (18).

Random control: In this scheme, the controller selects each action with equal probability.
Note that in the above comparison baselines, after communication mode selection and processor state control is finished, precoding is optimized using Algorithm 1 if needed, and the protecting operation still applies to always guarantee the QoS of UEs. The comparison result is illustrated in Fig. 9, where a more general heterogenous caching service capability at each D2D transmitter is considered. Specifically, we set , , , and . It can be found that our proposal performs the best, which shows its effectiveness on network control facing with dynamic and complex wireless environment. Specifically, due to the cache state dynamics and the interference among active D2D links, the D2D mode always scheme can lead to more frequent D2D communication failure compared with our proposal and hence more frequent protecting operations. While for the DRL based, CRAN mode only scheme, although it does not suffer from the dynamic environment since all the UEs access RRHs, delivering all the traffic via RRH transmission will induce high fronthaul and processor power consumption. Moreover, compared with Qlearning, since replay memory helps DRL review the historical interactions and DQN has the capability of generalizing learned knowledge to new situations, our proposal therefore achieves better performance with the same number of interactions with the environment.
Vi Conclusion
In this article, a deep reinforcement learning (DRL) based approach has been developed for a fog radio access network (FRAN) to minimize the longterm system power consumption under the dynamics of edge caching states. Specifically, the network controller can make a quick and intelligent decision on the user equipment (UE) communication modes and processors’ onoff states given the current system state using the well trained DRL model, and the precoding for UEs in cloud RAN mode is then optimized under per UE quality of service constraints, perRRH transmission power constraints, and the computing capability constraint in the cloud based on an iterative algorithm. By simulations, the impacts of learning rate and batch size have been shown. Moreover, the impact of edge caching service capability on system power consumption has been demonstrated, and the superiority of DRL based approach compared with other baselines is significant. Finally, transfer learning has been integrated with DRL, which can reach performance similar to the case without transfer learning but needs much less interactions with the environment. In the future, it is interesting to incorporate power control of devicetodevice UEs, subchannel allocation, as well as fronthaul resource allocation into DRL based resource management to achieve better FRAN performance and make the system model more practical.
References
 [1] M. Peng, S. Yan, K. Zhang, and C. Wang, “Fog computing based radio access networks: Issues and challenges,” IEEE Netw., vol. 30, no. 4, pp. 46–53, Jul. 2016.
 [2] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet Things J., vol. 3, no. 6, pp. 854–864, Dec. 2016.
 [3] P. Verma and S. Sood, “Fog assistedIoT enabled patient health monitoring in smart homes,” IEEE Internet Things J., vol. 5, no. 3, pp. 1789–1796, Jun. 2018.
 [4] A. Yousefpour, G. Ishigaki, R. Gour, and J. P. Jue, “On reducing IoT service delay via fog offloading,” IEEE Internet Things J., vol. 5, no. 2, pp. 998–1010, Apr. 2018.
 [5] J. He et al., “Multitier fog computing with largescale IoT data analytics for smart cities,” IEEE Internet Things J., vol. 5, no. 2, pp. 677–686, Apr. 2018.
 [6] J. Liu, M. Sheng, T. Q. S. Quek, and J. Li, “D2D enhanced coordinated multipoint in cloud radio access networks,” IEEE Trans. Wireless Commun., vol. 15, no. 6, pp. 4248–4262, Jun. 2016.
 [7] M. Tao, E. Chen, H. Zhou, and W. Yu, “Contentcentric sparse multicast beamforming for cacheenabled cloud RAN,” IEEE Trans. Wireless Commun., vol. 15, no. 9, pp. 6118–6131, Sep. 2016.
 [8] J. Kang, O. Simeone, J. Kang, and S. Shamai, “Joint optimization of cloud and edge processing for fog radio access networks,” IEEE Wireless Commun., vol. 15, no. 11, pp. 7621–7632, Nov. 2016.
 [9] M. A. MaddahAli and U. Niesen, “Cacheaided interference channels,” in Proceedings of ISIT, Hongkong, China, Jun. 2015, pp. 809–813.
 [10] G. Yu et al., “Joint mode selection and resource allocation for devicetodevice communications,” IEEE Trans. Commun., vol. 62, no. 11, pp. 3814–3824, Nov. 2014.
 [11] X. Wang, X. Wang, and S. Mao, “RF sensing for Internet of Things: A general deep learning framework,” IEEE Communications, to appear.
 [12] X. Wang, L. Gao, S. Mao, and S. Pandey, “CSIbased fingerprinting for indoor localization: A deep learning approach,” IEEE Trans. Veh. Tech., vol. 66, no. 1, pp. 763–776, Jan. 2017.
 [13] V. Mnih et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
 [14] H. Sun et al., “Learning to optimize: Training deep neural networks for wireless resource management,” arXiv:1705.09412v2, Oct. 2017, accessed on Apr. 15, 2018.
 [15] K. Chavez, H. Ong, and A. Hong, “Distributed deep Qlearning,” arXiv:1508.04186, Oct. 2015, accessed on Apr. 15, 2018.
 [16] D. Chen and V. Kuehn, “Adaptive radio unit selection and load balancing in the downlink of fog radio access network,” in Proceedings of GLOBECOM, Washington, DC, USA, Dec. 2016, pp. 1–7.
 [17] X. Peng, Y. Shi, J. Zhang, and K. B. Letaief, “Layered group sparse beamforming for cacheenabled green wireless networks,” IEEE Trans. Commun., vol. 65, no. 12, pp. 5589–5603, Dec. 2017.
 [18] X. Yang, J. Zheng, Z. Fei, and B. Li, “Optimal file dissemination and beamforming for cacheenabled CRANs,” IEEE Access, vol. 6, pp. 6390–6399, Nov. 2017.
 [19] M. Mukherjee, L, Shu, and D. Wang, “Survey of fog computing: Fundamental, network applications, and research challenges,” IEEE Commun. Surveys Tuts., Mar. 2018, doi: 10.1109/COMST.2018.2814571, submitted for publication.
 [20] H. Xiang, M. Peng, Y. Cheng, and H. Chen, “Joint mode selection and resource allocation for downlink fog radio access networks supported D2D ,” in Proc. QSHINE’15, Taipei, China, Aug. 2015, pp. 177–182.
 [21] H. Zhou, Y. Ji, J. Li, and B. Zhao, “Joint mode selection, MCS assignment, resource allocation and power control for D2D communication underlaying cellular networks,” in Proceedings of WCNC, Istanbul, Turkey, Apr. 2014, pp. 1667–1672.
 [22] S. Yan, M. Peng, M. Abana, and W. Wang, “An evolutionary game for user access mode selection in fog radio access networks,” IEEE Access, vol. 5, Jan. 2017, pp. 2200–2210.
 [23] Y. He et al., “Deep reinforcement learningbased optimization for cacheenabled opportunistic interference alignment wireless networks,” IEEE Trans. Veh. Tech., vol. 66, no. 11, pp. 10433–10445, Nov. 2017.
 [24] G. Cao, Z. Lu, X. Wen, T. Lei, and Z. Hu, “AIF : An artificial intelligence framework for smart wireless network management,” IEEE Commun. Lett., vol. 22, no. 2, pp. 400403, Feb. 2018.
 [25] Z. Xu, Y. Wang, J. Tang, J. Wang, and M. Gursoy, “A deep reinforcement learning based framework for powerefficient resource allocation in cloud RANs,” in Proceedings of ICC, Paris, France, May 2017, pp. 16.
 [26] C. Zhong, M. Gursoy, and S. Velipasalar, “A deep reinforcement learningbased framework for content caching,” arXiv:1712.08132v1, Dec. 2017, accessed on Apr. 15, 2018.
 [27] Y. Liao, L. Song, Y. Li, and Y. Zhang, “How much computing capability is enough to run a cloud radio access network?” IEEE Commun. Lett., vol. 21, no. 1, pp. 104–107, Jan. 2017.
 [28] D. Wu, L. Zhou, Y. Cai, H. Chao, and Y. Qian, “Physicalsocialaware D2D content sharing networks: A providerdemander matching game,” IEEE Trans. Veh. Tech., Apr. 2018, doi: 10.1109/TVT.2018.2825366, submitted for publication.
 [29] M. Peng, Y. Li, J. Jiang, J. Li, and C. Wang, “Heterogeneous cloud radio access networks: A new perspective for enhancing spectral and energy efficiencies,” IEEE Wireless Commun., vol. 21, no. 6, pp. 126–135, Dec. 2014.
 [30] J. Tang, W. P. Tay, T. Q. S. Quek, and B. Liang, “System cost minimization in cloud RAN with limited fronthaul capacity,” IEEE Trans. Wireless Commun., vol. 16, no. 5, pp. 3371–3384, May 2017.
 [31] R. Sutton and A. Barto, Reinforcement learning: An introduction, Cambridge, MA: MIT Press, 1998.
 [32] B. Dai and W. Yu, “Sparse beamforming and usercentric clustering for downlink cloud radio access network,” IEEE Access, vol. 2, pp. 1326–1339, Oct. 2014.
 [33] M. Grant, S. Boyd, and Y. Ye, “CVX: Matlab software for disciplined convex programming,” Jun. 2015. [Online]. Available: http://cvxr.com/cvx/
 [34] A. G. Gotsis and A. Alexiou, “Spatial resources optimization in distributed MIMO networks with limited data sharing,” in Globecom Workshop, Atlanta, USA, Dec. 2013, pp. 789–794.
 [35] Y. Shi, J. Zhang, and K. B. Letaief, “Group sparse beamforming for green cloud radio access networks,” in Proceedings of Globecom, Atlanta, USA, Dec. 2013, pp. 4662–4667.
 [36] E. Candes, M. Wakin, and S. Boyd, “Enhancing sparsity by reweighted 1 minimization,” Journal of Fourier Analysis and Applications, vol. 14, no. 5, pp. 877–905, Oct. 2008.
 [37] M. Simsek et al., “Learning based frequency and timedomain intercell interference coordination in HetNets,” IEEE Trans. Veh. Tech., vol. 64, no. 10, pp. 4589–4602, Oct. 2015.
 [38] Y. Li, “Deep reinforcement learning: An overview, ” arXiv:1701.07274v5, Sep. 2017, accessed on Apr. 15, 2018.
 [39] R. Li, Z. Zhao, X. Chen, J. Palicot, and H. Zhang, “TACT: A transfer actorcritic learning framework for energy saving in cellular radio access networks,” IEEE Trans. Wireless Commun., vol. 13, no. 4, pp. 2000–2011, Apr. 2014.