The most important function of mobile edge computing is providing low delay service for mobile terminals . In ,we proposed maximizing the system revenues using reinforcement learning in cloud-fog computing systems. However, we did not consider delay constraints and wireless bandwidth resources. In this paper, we solve the system revenue optimization problem with delay constraints using a multi-timescale actor-critic reinforcement learning. Compared with the original constrained actor-critic algorithm , we use neural networks to approximate parameterized policy and parameterized state value function to avoid finding proper feature functions. In addition, we use eligibility trace in both actor and critic. In the future, we will present experiment results and more technical details.
Ii System Model
We consider a mobile edge computing system, as shown in Fig. 1. Two computing servers are considered: cloud server and edge server. The edge server is in an access point (AP). The access point connects the cloud server and user terminals through wired and wireless links, respectively. The computing resources are quantified as virtual machines (VMs). The wireless bandwidth is quantified as subchannels. We assume that the cloud server has sufficient computing resources and the number of VMs in the edge server is . We assume that the number of subchannels of the wireless link is . We assume that the user terminals have priority services. Different priority services require different numbers of VMs and subchannels, and different delay constraints.
The considered mobile edge computing system is an event-triggered decision making system. The decision maker is in the AP. When a service request arrives, the decision maker decides whether to accept it, where to process it, and how many VMs and subchannels to be allocated for it. When a service completes, the decision maker spares the VMs and subchannels and transmits the processed data to the user terminal. The objective of the decision making system is to optimize the average system revenues to guarantee the average delay constraints of different priority services.
Iii Constrained SMDP Formulation
We use a constrained semi-Markov decision process (SMDP) to model the constrained optimization problem mentioned in the previous section. In general, a constrained SMDP can be formulated as a 6-tuple . Here, is the
th decision epoch,is the state space, is the action space,
is the transition probability,is the immediate reward of the objective,
is the immediate cost vector of constraints. In the rest of this section, we construct the detailed constrained SMDP for the considered problem.
Iii-a Decision Epoch
A mentioned in the previous section, the mobile edge computing system is event-triggered. We define the decision epoch is the time instant when a service request arrival or departure occurs.
Iii-B State Space and Action Space
The total number of th priority ongoing services that occupy subchannels and VMs in the edge server is . The number of th priority ongoing services occupying sunchannels in cloud server is . We assume the maximal number of VMs that a service can occupy is , where . We also assume that the maximal number of subchannels that a service can occupy is , where . We assume the cloud server allocates each service sufficient VMs, such as . and have to satisfy the following constraints:
We define an event . Here, represents the arrival event of the th priority service request, represents the departure event of the service which occupies subchannels and VMs in the edge server, represents the departure event of the service occupying subchannels in the cloud server.
We only consider the state in the decision epoch. The state space is . In our considered constrained SMDP problem, the decision maker make a decision only in service request. When the service departs, the decision maker naturally releases the bandwidth and computing resources.
The action space is . Here, represents releasing the computing and bandwidth resources when a service departs, represents rejecting a service request, represents an action that edge server accepts the th priority service request and allocates subchannels and VMs, represents an action that cloud server accepts the th priority service request and allocates subchannels.
Iii-C Transition Probability
We define the transition probability is , where is the next state. The state transition probability is
The expected time interval between adjacent decision epochs is defined as , whose formulation is as follows:
Iii-D Immediate Reward and Constraints
The immediate reward we consider is the net reward which consists of the reward got at the decision epoch and the cost lost between the last two decision epochs. The immediate reward is formulated as follows:
where and represent the reward that the service request is accepted by the cloud server and edge server, respectively. is the penalty of rejecting a service request. is time interval between the state and . is the loss ratio after taking action at state , as follows:
where and represent the cost of running a VM per unit time in cloud server and edge server, respectively.
According to the Little‘s Law, we can use the length of the queue to represent the delay . The immediate constraint of priority service as follows:
where and represent the weights.
Iii-E Policy Optimization with Delay Constraints
We define the policy as a map from a state to an action. From an initial state , the objective of this paper is to find an optimal policy to satisfy the following optimization problem:
We consider the stationary probability of state following policy as . In a constrained MDP, the optimal policy is always random. We set the probability taking action in state as . Thus, the constrained optimization problem (8) can be formulated as:
Iv Actor-Critic Algorithm for Constrained SMDP
We formulate the constrained optimization problem (9) as the following Lagrangian:
where is the vector of lagrange multiplier, .
The objective of the Lagrangian is to find and to satisfy the following expression:
Given the Lagrange multiplier , the optimal policy satisfies the following Bellman equation:
According to Poisson equation , for any and , the following equation satisfies:
We use actor-critic reinforcement learning algorithm to find the optimal policy . Firstly, we respectively parameterize the random policy and state value function as follows:
We use soft-max in action preferences for policy parameterization. We set parameterized numerical preferences to . Parameterized policy can be formulated as
We use neural networks to represent and .
We use multi-timescale stochastic approximation (MTSA) algorithm to find optimal , w and . We set multi-timescale step-sizes as , , and satisfying the following conditions:
where is a positive constant.
is the estimate ofin the th step. We set the delay trace rates as and . We initialize the average Lagrange reward as .
where and are to guarantee policy parameters and Lagrange multipliers in the feasible region.
-  Y. C. Hu, M. Patel, D. Sabella, N. Sprecher, and V. Young, “Mobile edge computing A key technology towards 5G,” Eur. Telecommun. Standards Inst., Sophia Antipolis, France, White Paper, 2015, p. 11.
-  Q. Li, L. Zhao, J. Gao, H. Liang, L. Zhao and X. Tang, “SMDP-based coordinated virtual machine allocations in cloud-fog computing systems,” IEEE Internet Things J., vol. 5, no. 3, pp. 1977-1988, Jun. 2018.
-  S. Bhatnagar and K. Lakshmanan, “An online actor-critic algorithm with function approximation for constrained Markov decision processes,” J. Optim. Theory Appl., vol. 153, no. 3, pp. 688 C708, Jun. 2012.
-  C. Comaniciu and H. V. Poor, “Jointly optimal power and admission control for delay sensitive traffic in CDMA networks with LMMSE receivers,” IEEE Trans. Signal Process., vol. 51, no. 8, pp. 2031-2042, Aug. 2003.