Meta-Reinforcement Learning for Reliable Communication in THz/VLC Wireless VR Networks

01/29/2021 ∙ by Yining Wang, et al. ∙ Princeton University King's College London Virginia Polytechnic Institute and State University The Chinese University of Hong Kong, Shenzhen 0

In this paper, the problem of enhancing the quality of virtual reality (VR) services is studied for an indoor terahertz (THz)/visible light communication (VLC) wireless network. In the studied model, small base stations (SBSs) transmit high-quality VR images to VR users over THz bands and light-emitting diodes (LEDs) provide accurate indoor positioning services for them using VLC. Here, VR users move in real time and their movement patterns change over time according to their applications. Both THz and VLC links can be blocked by the bodies of VR users. To control the energy consumption of the studied THz/VLC wireless VR network, VLC access points (VAPs) must be selectively turned on so as to ensure accurate and extensive positioning for VR users. Based on the user positions, each SBS must generate corresponding VR images and establish THz links without body blockage to transmit the VR content. The problem is formulated as an optimization problem whose goal is to maximize the average number of successfully served VR users by selecting the appropriate VAPs to be turned on and controlling the user association with SBSs. To solve this problem, a meta policy gradient (MPG) algorithm that enables the trained policy to quickly adapt to new user movement patterns is proposed. In order to solve the problem for VR scenarios with a large number of users, a dual method based MPG algorithm (D-MPG) with a low complexity is proposed. Simulation results demonstrate that, compared to a baseline trust region policy optimization algorithm (TRPO), the proposed MPG and D-MPG algorithms yield up to 38.2 33.8 75

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deploying virtual reality (VR) applications over wireless networks provides new opportunities for VR to offer seamless user experience [12]. However, the scarce bandwidth of sub-6 GHz limits the ability of wireless networks to satisfy the stringent quality-of-service (QoS) requirements of VR applications in terms of delivering high data rates, low latency, and high reliability. A promising solution is to integrate VR services over high frequency bands with abundant bandwidth, such as terahertz (THz) and millimeter wave (mmWave) frequencies. Currently, 5G supports millimeter wave (mmWave) frequency bands to provide basic wireless VR services. However, as discussed by industry in [14], in order to support ultimate VR services that must integrate vision with perception, an uncompressed bit rate of up to 2 Tbit/s is strictly required. Hence, it is necessary to study the use of frequency bands beyond mmWave for future wireless networks. In addition, although the use of beamforming enables mmWave beams to focus on a small area, interference between neighboring users is still difficult to control in a dense room. Therefore, THz frequencies are viewed as a natural candidate to provide unprecedentedly high data rate for VR content transmission due to the large available bandwidth. Moreover, THz bands can achieve very narrow pencil beamforming (narrower than mmWave) that spatially aligns narrow THz beams to VR users and, hence significantly reducing the interference [15, 4, 34]. However, THz frequencies are highly prone to blockage and their transmission distance is short [4, 34, 6]. In indoor VR scenarios, although short distances enable high-rate VR image transmission at THz frequencies, the mobile users’ bodies may lead to dynamic blockages over the THz links, thus negatively affecting the immersive VR experience. In addition, to ensure a seamless interaction between the users and the virtual world, it is necessary to accurately locate VR users in real time for VR image generation and transmission. Therefore, deploying THz-enabled wireless networks to offer high-reliability VR services faces many challenges such as user positioning, reduction of link blockage, user association, and reliability assurance.

Recently, several works such as in [9, 26, 19, 21, 10, 20, 6, 5] studied a number of problems related to wireless VR networks. In [9], the authors studied the use of both edge fog computing and caching to satisfy the low latency requirement of VR users. The authors in [26] proposed a novel mobile edge computing-based mobile VR delivery framework that can cache the field of views (FOVs) of VR images. The work in [19] studied the problem of resource management in wireless VR networks to minimize the VR interaction latency. However, the works in [9, 26, 19] sacrificed the quality of delivered VR videos (e.g., by reducing the resolution of VR videos or only displaying the FOV of VR images) to meet the low latency constraints. This challenge can be addressed by using high frequency bands (e.g., mmWave and THz) with abundant bandwidth to transmit high-quality VR images. The authors in [21] investigated the use of the mmWave bands to maximize the quality of the delivered video chunks in a wireless VR network. In [10], the authors introduced a multi-connectivity (MC)-enabled mmWave network for providing low-latency VR services. The work in [20] studied the use of mmWave bands to meet the high bandwidth requirements of panoramic VR video streaming. However, the works in [21, 10, 20] did not study how to use mmWave and high frequency bands to provide reliable VR services in a dense VR scenario. In [6], the authors studied the use of THz bands to provide VR services in a dense VR network. The authors in [5] studied the use of THz-based reconfigurable intelligent surfaces (RISs) to serve VR users in a wireless network. However, the works in [6] and [21, 10, 20, 5] did not consider the mobility of users that can significantly affect VR network performance, particularly for THz-enabled wireless VR networks whose transmission links can be blocked by mobile users. Moreover, all of the existing works in [9, 26, 19, 21, 10, 20, 6, 5] ignored the requirement of accurate user localization that is needed to generate users’ VR images. Therefore, in a THz-enabled VR system, it is necessary to consider the time-varying user positions that are used to generate VR images and avoid dynamic blockages of THz links.

A number of existing works such as in [7, 16, 17, 18] studied the problem of positioning applied in a VR system. In [7]

, the authors used machine learning (ML) algorithms to predict the locations and orientation of VR users. However, the position prediction accuracy of ML algorithms depends on the training data and cannot adapt to different users’ movement patterns. The authors in

[16] and [17] studied the use of ultrawideband signals and ultrasonic waves to achieve decimeter-level VR user positioning, respectively. The work in [18] proposed a mobile laser scanning (MLS) positioning system for indoor VR applications. Although the positioning accuracy of an MLS system can reach the centimeter-level accuracy, such a laser system is expensive. Moreover, the existing works in [16, 17, 18] require equipping VR systems with additional positioning devices, thus increasing energy consumption and deployment costs. The work in [2] showed that THz has the potential for indoor positioning. However, since THz bands require very narrow pencil beamforming in dense indoor VR scenarios, one can only passively adjust the beam direction or user association after the user moves, which can detach the users from their virtual world. Visible light communication (VLC) based on light-emitting diodes (LEDs) can provide an alternative and accurate positioning service [35]. In [33, 23, 25], the authors proved that using three LEDs that are in the line of sight (LoS) of the receiver can provide a centimeter-level three-dimensional (3-D) position. However, none of these works in [33, 23, 25] considered the dynamic selection of LEDs according to the user mobility so as to provide inclusive positioning services while ensuring acceptable brightness in a multi-user VR scenario. To this end, we propose to use a THz/VLC-enabled wireless VR network that jointly considers the VLC access points (VAPs) selection and user association in order to provide reliable positioning and high data rate VR content transmission services for VR users.

The main contribution of this work is, thus, a novel framework that jointly uses VLC and THz to service VR users. In particular, we study a dynamic THz/VLC-enabled VR network that can accurately locate VR users in real time using VLC and build THz links to transmit high-quality VR images based on the users’ positions. In the studied network, only a subset of the VAPs can be turned on to locate VR users due to the users’ limited tolerance for brightness. Based on the obtained user positions, each small base station (SBS) must determine the user association to generate corresponding VR images and build THz links to avoid blockages caused by the user bodies. The problem is formulated as a reliability maximization problem that jointly considers the VAP selection, user association with THz SBSs, and time varying users’ movement patterns. The reliability of VR networks is defined as the average number of successfully served VR users. To solve this problem, we propose a meta-policy gradient (MPG) algorithm to find the locally optimal policy for VAP selection and user association. Compared to traditional reinforcement learning (RL) algorithms that can only be trained for a fixed environment in which each user has a fixed movement pattern, the proposed algorithm enables the trained policy to quickly adapt to new users’ movement patterns. To reduce the computational complexity of the MPG algorithm, we propose a dual method based MPG algorithm that uses dual method to assist the MPG algorithm to determine user association based on the selected VAPs. Simulation results show that, compared to a baseline trust region policy optimization algorithm (TRPO), the proposed MPG algorithm and the dual method based MPG algorithm yield a performance improvement of about 38.2% and 33.8% in terms of the average number of successfully served users as well as about 75% and 87.5% gains in the convergence speed, respectively. Simulation results also show that the proposed dual method based MPG algorithm achieves up to 88.76% reduction in the training time compared to the MPG algorithm. To the best of our knowledge, this paper is the first to study the joint use of THz and VLC for reliability maximization while considering dynamic VR users’ movement patterns.

The rest of this paper is organized as follows. The system model and the problem formulation are described in Section II. The use of MPG algorithm for VAP selection and user association is introduced in Section III. The dual method based MPG algorithm is presented in Section IV. In Section V, the numerical results are presented and discussed. Finally, conclusions are drawn in Section VI.

Ii System Model and Problem Formulation

Consider an indoor wireless network that consists of a set of B SBSs and a set of V VAPs. All the VAPs and SBSs are managed by a central controller. The SBSs are evenly distributed in an indoor area to serve a set of U VR users over THz frequencies, as shown in Fig. 1. In the studied model, accurate locations of the users are required by the SBSs so as to build LoS THz links and generate the VR images requested by users [26]. Each VAP provides accurate indoor positioning and tracking services for VR users using VLC. Here, we consider dual-mode user equipments (UEs) that are able to access both THz and VLC bands. In the studied multi-user VR network, at each time slot , each SBS can only serve one user with a narrow beam while each VAP can locate all the users that are not blocked in its FOV. To control the system energy consumption, the central controller selects a group of VAPs at the beginning of each time slot to locate VR users. Here, not all users can be accurately localized due to the user body blockage over the VLC links [32]. Hence, based on the obtained user positions, the central controller determines the SBSs associated with the successfully localized users, and then SBSs transmit the corresponding VR images to those users using wireless THz links. In our model, each time period consists of time slots. A successful transmission implies that the request of a given VR user is successfully completed within a time period.

Fig. 1: Illustration of the considered THz/VLC-enabled wireless VR network.

Ii-a User Blockage Model

In the studied model, the LoS links (VLC or THz links) between user and a transmitter (a VAP or an SBS) can be blocked by other VR users’ bodies [6]. For a given a user located at at time slot in time period and a transmitter located at

, we define a binary variable

that indicates whether LoS links exist between user and transmitter , as follows:

(1)

where is the set of all possible points in the transmission link between transmitter and user , is the space occupied by the body of user at time slot in time period and as is true, , otherwise. Equation (1) indicates that the LoS link between transmitter and user at time slot exists only if none of the other users () blocks the transmission link, as shown in Fig. 1. In (1), implies that the link between transmitter and user is blocked at time slot in period ; otherwise, we have . Here, we assume that the positions of the VR users remain unchanged during each time slot .

Ii-B VLC Indoor Positioning

We assume that the three-dimensional (3D) location of each user is determined by three VAPs from three different orientations [35], where and are the coordinates of user in the room and is the height of user . VAPs are selectively turned on to provide stable and acceptable brightness as well as accurate user positioning services simultaneously. At each time slot in time period , a set of three VAPs is turned on to broadcast their location information to users. When each user receives the location information of VAP , and , it can calculate the incidence angle [33]. Note that, user can receive the location information sent by VAP at time slot only when the following conditions are satisfied: a) VAP is in the FOV of user , as shown in Fig. 1, and b) the VLC link between VAP and user is LoS (i.e. ). Then, the set of VAPs available for providing the positioning service to user can be given by

(2)

where is the receiver FOV semi-angle.

Based on three different incidence angles and the corresponding VAP locations, each user can calculate its own location at time slot in period using a triangulation algorithm [33]. Then, the positioning state of user at time slot in period will be

(3)

where represents the number of VAPs that can serve user .

Once the position of user is successfully calculated at time slot in period (i.e. ), user transmits its own location to the central controller and requests the corresponding VR image. Here, we do not consider the uplink transmission links that used to transmit positions from each VR user to the central controller and, thus, we ignore the delay associated with these transmissions due to the small data size of position information. Based on the obtained user positions, the central controller can determine the user-SBS association and, then, the SBSs can generate corresponding VR images and serve the associated users over THz band.

Ii-C Transmission Model

Due to the extremely narrow pencil beamforming (narrower than mmWave) for THz [34], we assume that each user can only be associated with one SBS and each SBS can only serve one user at each time slot. In time period , let be the index of the link between SBS and user at time slot , i.e., implies that user is associated with SBS ; otherwise, we have . Then, we have

(4)

At time slot in period , given an SBS located at and its associated user located at , the path loss of the THz link between SBS and user can be given by [6]

(5)

where is the distance between SBS and user , is the speed of light, is the operating frequency, and represents the transmittance of the medium following the Beer-Lambert law with being the overall absorption coefficient of the medium at THz frequency [22]. The total noise power at each UE that is generated by thermal agitation of electrons and molecular absorption is [22]

(6)

where is the transmit power of each SBS, represents the Johnson-Nyquist noise generated by thermal agitation of electrons in conductors with and being Boltzmann constant and the temperature in Kelvin, respectively, and is the sum of molecular absorption noise caused by the transmit power of any SBS . Here, we assume that each user will not be subject to interference by other SBSs due to the narrow beam. The data rate of VR image transmission from SBS to its associated user at time slot in period can be expressed as

(7)

where is the bandwidth of the THz band.

Given the data size of the VR image requested by user at time slot in period , the transmission delay will be

(8)

where . Note that the data size of a VR image only depends on the image resolution which remains unchanged during service. Since the user position will change at the next time slot, the VR image requested by user can be successfully transmitted only when the transmission delay is within the time duration of a time slot . Then, in time period , the transmission state of user at time slot can be expressed as

(9)

From (9), we can see that, whether the requested VR image of user is successfully transmitted at time slot or not depends on the user’s locations, user association, and blockages between SBS and user .

Ii-D Reliability Model

As mentioned earlier, in our model, the reliability of the THz/VLC-enabled wireless VR network refers to the average number of successfully served VR users. At each time slot , a successfully served user must satisfy two conditions: a) user is successfully localized and b) the VR image requested by user is transmitted within . In order to enable a seamless and immersive wireless VR experience, we assume that the waiting delay is limited to a time period that consists of time slots. In other words, each user should be successfully served at least once in a time period. Therefore, in time period , the service state of user until time slot based on the selected and will be

(10)

where and represents the logical “or” operation. The newly served users at time slot will be

(11)

Then, the number of successfully served users in each time period can be given by

(12)

where and .

Ii-E Problem Formulation

Given the defined system model, our goal is to effectively select the subset of VAPs to provide accurate positioning services and, then, determine the user-SBS association based on the obtained user positions so as to maximize the reliability of the studied VR network. Then, the reliability maximization problem is formulated as follows:

(13)
(13a)
(13b)
(13c)
(13d)

where is the total number of all time periods. Constraint (13a) captures the fact that only three VAPs are selected at each time slot to provide positioning service. Constraints (13b), (13c), and (13d) indicate that each user can only be associated with one SBS and each SBS can only serve one user at each time slot. From (13), we can see that the reliability depends on the selected VAPs and the user association with SBSs. Meanwhile, the VAP selection and the user-SBS association depend on the positions of the VR users. However, the users’ positions continuously change as time elapses. Therefore, real-time user positions are needed by the central controller so as to generate corresponding VR images and build THz links without blockages. Moreover, due to the time-varying nature of VR applications, the users’ movement pattern varies over different time periods [29]. Here, we define a position transition matrix as the users’ movement pattern durning time period , in which each element

is the probability of the user moving from

to . Note that the studied THz/VLC-enabled VR network has no knowledge of the users’ movement patterns. Due to the non-convexity and the unpredictability of the users’ movement patterns, (13) cannot be solved by the traditional optimization algorithms, such as dynamic programming or nonlinear programming. Moreover, traditional RL algorithms, such as Q-learning [28] or deep Q-network [5], can only solve optimization problems in static and known environments, and, thus, they are also not suitable to solve the problem in (13). Hence, we propose a RL algorithm based on a meta-learning framework to sensitively adapt to dynamic users’ movement patterns so as to determine the VAP selection and the user association in advance. We next introduce a meta-reinforcement learning algorithm to proactively determine the VAP selection and the user association.

Iii Meta-Learning for VAP Selection and User Association

Next, we introduce a policy gradient-based RL algorithm [27] using meta-learning framework [11], called meta policy gradient (MPG), that can effectively solve problem (13). Traditional policy gradient algorithms can only determine the VAP selection and user association in a fixed environment (i.e., the fixed users’ movement patterns). Meta-learning is a novel learning approach that can integrate the prior reliability-enhancing experience with information collected from the new users’ movement patterns, thus training a rapidly adaptive learning model. Therefore, the proposed MPG can obtain the VAP selection and user association policies that can be quickly updated to adapt to new users’ movement patterns using only a few further training steps. Compared with the meta-trained value decomposition-based RL algorithm [13]

that uses each agent’s local observation of the environment to estimate the rewards resulting from the actions, the proposed MPG algorithm enables the agent to directly obtain the reward of a chosen action from the global environment. Hence, the proposed MPG algorithm can effectively find a better action that results in a higher reliability compared to the RL algorithm in

[13]. The VAP selection aims to obtain the positions of as many users as possible under the limitation of energy consumption. Then, the user-SBS association is determined based on the user positions in a way to avoid blockages of THz links and meet the transmission delay constraints. Next, we first introduce the components of the MPG algorithm for VAP selection and user association. Then, we explain the entire procedure of using our MPG algorithm to select VAPs and determine the user association with SBSs.

Iii-a Components of MPG

An MPG algorithm consists of six components: a) agent, b) actions, c) states, d) policy, e) reward, and f) tasks, which are specified as follows:

  • Agent: Our agent is a central controller that can obtain the user positions and simultaneously control the VAPs and the SBSs.

  • Actions: The action of the agent at each time slot in period

    is a vector

    that jointly considers the VAP selection and the user association. The action space is the set of all optional actions.

  • States: The state at time slot in time period is defined as that consists of: 1) the user position , where depends on and the movement pattern in time period , which is unknown to the central controller and 2) the service state vector that implies each user whether has been successfully served until time slot . The state space is the set of all possible states.

  • Policy

    : The policy is the probability of the agent choosing each action at a given state. The MPG algorithm uses a deep neural network parameterized by

    to map the input state to the output action. Then, the policy can be expressed as . Based on the policy , an execution process in a time period can be defined as a trajectory .

  • Reward: The benefit of choosing action at state is . Therefore, the reward of a trajectory during a time period is . Note that the reward function is equivalent to the number of successfully served users defined in (12), that is . The objective function of problem (13) that the agent aims to optimize is the average reward function of all time periods .

  • Tasks : We use a task to refer to the reliability maximization problem in each time period . A task is thus defined as . For each task , the trajectory and the corresponding reward are affected by the users’ movement pattern that is unknown to the agent. However, the policy is shared by all tasks. Therefore, the agent must find the effective policy that can quickly adapt to new users’ movement patterns.

Iii-B MPG for Optimization of Reliability

Next, we introduce the entire procedure of training the proposed MPG algorithm. Our purpose from training MPG is to find the optimal policy that maximizes the reliability of the THz/VLC-enabled wireless VR network over different time periods. The MPG algorithm enables the trained policy to quickly adapt to the time-varying users’ movement patterns. The intuition behind the proposed MPG is that some of its parameters are task-sensitive while other parameters are broadly applicable to all tasks. Therefore, the training process of MPG has two steps: 1) task learning step and 2) meta-learning step. The task learning step enables the MPG to execute the policy gradient on task-sensitive parameters so as to make rapid progress on each new task. The meta-learning step aims to find the broadly applicable parameters that can improve the performance of all tasks. The proposed MPG model is trained offline, which means that the MPG model is trained by the trajectories and the corresponding rewards sampled in historical tasks. Using the historical trajectories and rewards, the MPG model can learn the distribution of the tasks and thus quickly adapt to a new task. In particular, the trained fast-adaptive MPG model only requires a few iterations of the task learning step to learn the new users’ movement pattern so as to solve the new task. Hence, the proposed algorithm can maximize the reliability of the studied VR network in each specific new time period. Specifically, the task learning step and meta-learning step can be given as follows:

  1. Task learning step: For each task , the agent first collects trajectories based on a given policy . The set of collected trajectories of task is , where is the trajectory of task . To evaluate the policy for maximizing the reliability of the VR network, we define the expected reward of the trajectories in as

    (14)

    where . is the probability of state transitioning to state after taking action , which depends on the movement pattern . The goal of optimizing the policy for each task is to maximize the number of successfully served users in time period , that is

    (15)

    For each task , the policy is updated using the standard gradient ascent method

    (16)

    where is the learning rate that is equal for all tasks and the policy gradient is

    (17)

    Finally, the agent collects trajectories for each task using the corresponding updated policy . Each trajectory set is used to optimize the broadly applicable parameters in the next meta-learning step so as to increase the average number of successfully served user for all tasks.

  2. Meta-learning step: The agent first computes the expected rewards of each trajectory set based on the each updated policy . To solve the reliability maximization problem (13), we only need to solve the following optimization problem

    (18)

    Substituting (16) into (18), we have

    (19)

    Then, to improve the average number of successfully served users for all tasks, the policy will be updated by

    (20)

    where is the learning rate for meta-learning. Here, note that the meta-learning step is performed over the parameters instead of the parameters updated in the previous task learning step.

By iteratively running the task learning and the meta-learning step, a locally optimal policy for determining the VAP selection and user association under different users’ movement patterns can be obtained. The specific training process of the proposed MPG algorithm is summarized in Algorithm 1.

1:  Input: The set of VAPs , the set of SBSs , the user positions , and the transition matrix .
2:  Initialize: is initially generated randomly, , task learning rate , meta-learning rate , and the number of iterations .
3:  for  do
4:      for all task  do
5:          Collect trajectories using .
6:          Compute using based on (17).
7:          Compute parameters of the adapted policy based on (16).
8:          Collect trajectories using .
9:      end for
10:      Compute using each .
11:      Update the parameters of the policy based on (20).
12:  end for
Algorithm 1 MPG algorithm for VAP selection and user association.

The optimization problem (13) is solved once the locally optimal policy of the proposed MPG model that used to determine the VAP selection and user association is obtained. Since the meta-learning step tends to optimize the broadly applicable parameters for all tasks, the proposed MPG algorithm enables the trained policy to quickly adapt to new tasks. This means that, for a new task with new users’ movement pattern, using the trained policy as initialization, the agent can quickly find the locally optimal policy by only executing the task learning step with a few trajectories.

Iii-C Complexity and Overhead of MPG

Next, we analyze the computational complexity of the proposed MPG algorithm for VAP selection and user association optimization. The complexity of the MPG algorithm depends on the number of the policy parameter , which depends on the size of action space and the size of state space . The action space is a set of all possible VAP selections and user associations. The number of optional combinations of three VAPs from VAPs is . The number of possible user-SBS association depends on the the number of users and the number of SBSs , which is . Hence, the size of will be . The state space consists of continuous user locations as well as discrete service state and newly served users . To ensure the finite state space, we discretize the continuous user positions . In particular, the considered indoor space is divided into small grids and the position of the user in each grid is represented by the center of the grid. The size of service state space is . Then, the size of state space is . Therefore, the computational complexity of the proposed MPG algorithm can be given as

(21)

where

is the number of the neurons in layer

of the deep neural network used to train the policy. From (21) we can see that, due to the combinatorial user associations (i.e., ), the complexity of the MPG algorithm becomes unacceptably large as the number of users increases. To this end, we proposed a dual method based MPG (D-MPG) solution in which an action only determines VAP selection. Given the VAP selection, the user association can be determined by dual method thus reducing the size of action space of the original MPG algorithm. Here, we need to note that the MPG and D-MPG algorithms have their own advantages and drawbacks. MPG can converge to a local optimal solution but D-MPG cannot. However, D-MPG has a faster convergence compared to the MPG. Therefore, one must select the solutions (MPG or D-MPG) based on the implementation requirements such as training time or performance. Next, we introduce the D-MPG algorithm.

Iv Dual Method Based Meta-Learning

The components of the D-MPG algorithm are defined as follows:

  • Agent: The agent of the D-MPG is also the central controller.

  • Actions: The action of the agent at each time slot in period consider the subset of VAPs to select, which is . The action space is .

  • States: The state at time slot in time period is and the state space is . Note that, given the VAP selection , the service state vector depends on the user association. The determination of user association using dual method will be specified in Section IV-A.

  • Policy: The policy is used to build the relationship between the input state and output action, where is the parameter of the deep neural network used to learn the policy. The trajectory during a time period based on the policy can be given as .

  • Reward: The reward of a trajectory in time period is , where is the optimal user association based on the chosen action . The average reward function of all time periods that the agent aims to optimize is , which is also the objective function of the reliability maximization problem (13). Here, we can see that, to maximize the average reward function , we need to determine the optimal user association at each time slot .

  • Tasks : Task is the reliability maximization problem in each time period .

From the above definitions, we can see that the only difference between the MPG algorithm and the D-MPG algorithm is action. In particular, an action of the original MPG jointly determines VAP selection and user association while an action of the D-MPG determines only VAP selection. Therefore, the D-MPG can significantly decrease the action space thus improving training complexity and convergence speed. Next, we will specify the dual method for user-SBS association optimization.

Iv-a Optimization of User-SBS Association and Reliability

Once VAPs are selected, the user-SBS association can be determined based on the user positions to avoid blockages of THz links and meet the transmission delay constraints by solving the optimization problem defined in (13). Substituting (12) into (13), the user-SBS association and reliability maximization problem with fixed VAP selection can be expressed as

(22)
(22a)
(22b)
(22c)

From (11), we can see that users can experience immersive VR services as long as each one of them is successfully served once in each time period. This means that serving a VR user multiple times in a period cannot improve the reliability of the studied VR network. Therefore, a problem equivalent to (22) is

(23)
(23a)
(23b)

where (23b) indicates that each VR user can be served at most once in a time period. Based on (23b), the newly served user at each time slot defined in (11) can be simplified to

(24)

This is because and must always be satisfied simultaneously with the additional service constraint (23b). Then, substituting (10) into (24), we have

(25)

Here, due to (23b), the service state at time slot must be satisfied if we have at time slot . Hence, (25) that represents the number of served user at each time slot given the selected VAPs is obtained.

Since optimizing the user association in each time period is independent, problem (23) can be decoupled into multiple subproblems. In addition, due to the binary variable , the optimization problem in (23) is hard to solve. Hence, we temporarily adopt the fractional user association relaxation, where association variable can take on any real value in . As proved in [30], although the feasible region of is relaxed to be continuous, the optimal solution of the relaxed problem also meets the integer constraint. Therefore, the relaxation does not cause any loss of optimality to the final solution of problem (23). Then, for each time period , the reliability maximization subproblem can be formulated as follows:

(26)
(26a)
(26b)
(26c)
(26d)
(26e)

where represents the set of users that are successfully localized by the set of VAPs at time slot in time period and is the number of users in . Note that problem (26) becomes convex after the binary variable is relaxed. Here, we ignore the blockages of THz links caused by the users that are not successfully localized by the set of VAPs (i.e., ).

Due to constraint (26c), all time slots are coupled in problem (