The vehicle-to-infrastructure (V2I) communication via millimeter-wave (mmWave) is vital for the successful operation of next generation fifth-generation (5G) intelligent transportation system (ITS) . Moreover, as the demand of wireless spectrum in ITS is enormously increased nowadays, roadside mmWave base stations that configure small cell coverage regions are deployed in multi-tier heterogeneous vehicular networks (HetVNets) to enhance the spectrum efficiency as well as offload the huge traffic burden . The additional dense deployment of the small cells on HetVNet meets the requirements of ITS vehicles with high data rate and enlarged coverage region. However, as the number of base stations and vehicular user equipment systems (VUEs) is getting dramatically increased, the radio access technology (RAT) of VUEs on HetVNet is challenging to optimize the wireless resource utilization . The radio resource management in HetVNet for improving the QoS of VUEs is NP-hard and computationally intractable . Furthermore, considering the fact that the propagation property of mmWave wireless channels, which is highly directive and is only available within short range (i.e., approximately a hundred meter), plenteous mmWave base stations should be densely deployed to support mmWave wireless communication services [3, 4, 5, 6, 7]. Therefore, decision making with respect to association, channel selection, and occasional beam alignment task of VUEs imposes a heavy computational burden to ensure QoS-aware wireless communication on HetVNet .
There have been many research results regarding cell association and resource allocation problem that is called CARA. The resource allocation problem of mmWave-enabled network was considered in  and . In addition, joint CARA problem was studied in [12, 13, 14, 15]. However, because of the NP-hard and combinatorial features of joint CARA problem, it is challenging to achieve a globally optimal solution. There have been some approaches to solve the CARA problem, such as graph theory approach , integer programming method , matching game solution , and stochastic geometric strategy . These approaches still were limited to solve the joint CARA problem as they needed nearly precise information such as full knowledge of channel state information (CSI) or fading models. In practice, such accurate information may not be available so that computing the optimal point of joint CARA problem is intractable. For this regard, the multi-agent DRL approach is proposed to solve the joint CARA problem in a way that improves the downlink (DL) throughput of VUEs in HetVNet.
The reinforcement learning has been widely applied to solve various types of complex decision making problems in wireless networks such as interference alignment in cache-enabled networks  and dynamic duty cycle selection technique in unlicensed spectrum . Unlike the existing approaches, the reinforcement learning needs only a few information to operate, such as the possible action space of learning agent. Based on the interaction between agent and its own environment, the reinforcement learning agent observes state transition and learns how to act good by updating its policy 
. The agent estimates the expected total reward per possible actions for given state and make a decision how to act on a sequential decision making process. In, a power-efficient resource allocation framework for cloud radio access networks (RANs) is proposed based on DRL. Elsayed et al. 
proposed a DRL based latency reducing scheme of mission critical services for the next generation wireless networks. They combined the long short-term memory (LSTM) and Q-learning  to minimize the delay of the mission critical services.
However, the traditional DRL based learning approaches mostly assumed single-agent systems, which are hard to be applied in practice. A user may fail to learn an optimal policy because of the partially observable and non-stationary environment, which is caused by actions of neighboring users. In this paper, we solve the joint CARA problem of HetVNet with a QoS guaranteeing MADDPG  based approach. Based on the MADDPG strategy, multiple VUEs can learn their own policy to solve the CARA problem in a cooperative manner. Simulation results show that the proposed method outperforms other DRL methods in terms of DL throughput.
The rest of this paper is organized as follows. The system model of HetVNet and CARA problem are presented in Sec. II. Based on the network architecture and problem definition, multi-agent DRL based solution for the joint CARA problem is proposed in Sec. III. Sec. IV shows the results of performance evaluation. Lastly, Sec. V concludes this paper.
Ii System Model and problem definition
This section presents the HetVNet system model as Fig. 1 and define the joint CARA problem. The 3-tier HetVNet system consists of a macro-cell base station (MaBS), micro-cell base stations (MiBSs), and mmWave-enabled pico-cell base stations (PBSs). In addition, the CARA problem is solved in a way that each VUE cooperatively associates with base station and the wireless resource is efficiently allocated to each VUE so that the downlink throughput of VUEs in the HetVNet can be satisfied.
Ii-a System Model
The 3-tier HetVNet consists of MaBSs, MiBSs, and PBSs among base stations, where . In addition, there are VUEs in the HetVNet. The set of base station is denoted as , where . To simply note the PBS and other base stations, the set of MaBS and MiBSs is , where . The set of PBSs is denoted as , where
. In addition, each VUE can associate with only one base station during a time slot and it is equipped with one antenna. Suppose that a binary vectorrepresents the cell association information of -th VUE with a base station among . Then, the vector can be denoted as , where and . If the -th VUE associates with -th base station, then the is set to 1. Otherwise, the value is set to 0. Then, the vector can be represented as:
In addition, each VUE can utilize carrier aggregation, which combines multiple subchannels of associated base station. However, for fair resource access, each VUE is limited to utilize the spectrum at most . We assume that MaBSs and MiBSs share orthogonal channels and PBSs operate on mmWave channels. The represents the set of orthogonal channels and it can be denoted as . In addition, the set of mmWave channel can be signified as . Then, resource allocation vector between -th VUE and -th MaBS or MiBS can be denoted as , where , and . If the VUE use the -th channel among the , then the is set to 1. Otherwise, it is set to 0. However, if the VUE associates with -th PBS, the resource allocation vector between -th VUE and -th PBS can be denoted as , where and . The criteria of setting the value of is as same as the way to . Then, the resource allocation toward -th VUE can be denoted as:
Because the MiBSs coexist in the coverage of MaBS, the co-channel interference should be taken into account. In practice, the transmit power value can be defined with a finite number. Suppose that stands for the possible transmit power level vector of MaBS or MiBS on the shared spectrum, represent the transmit power per each channel in . In addition, each VUE is assumed to measure instantaneous channel gain , where and . Then, the signal-to-interference-plus-noise ratio (SINR) at -th VUE, which is associated with -th base station among (i.e., ) using channel or , is as follows (denoted by ):
where is the bandwidth of a channel, stands for the noise power, , and . Based on Eq. (3), the DL throughput of -th VUE, which is denoted as can be:
Ii-B CARA Problem Formulation
Based on the aforementioned system model, the joint CARA problem can be defined in a way that the VUEs are satisfied with minimum QoS baseline , while they cooperatively associate with base stations and utilize wireless resource, i.e.,
Considering the transmit power of -th bsae station toward -th VUE, the power-aware cost can be changed as:
where the stands for the cost of unit power level. Overall, the total revenue of the -th VUE in the HetVNet system can be formulated as:
where the stands for the positive profit of each channel capacity. Hence, the objective of joint CARA problem is to optimize the expected total return of Eq. (7) under Eq. (5). The expected total return of the revenue of the -th VUE can be denoted as , i.e.,
where the is the discounting factor in reinforcement learning to represent the uncertainty of future revenue. Throughout the Eq. (1) to Eq. (8), VUEs and base stations dynamically transit their resource utilization state and action, which is highly combinatorial and intractable to optimize. In this regard, optimization based multi-agent DRL is derived to solve the joint CARA problem of HetVNet.
Iii Multi-agent DRL for cooperative CARA problem in HetVNet
Throughout multiple interactions with the HetVNet (i.e., the environment), each VUE accumulates its own experiences, which is paired with . The stands for the local observation of -th VUE at time slot . The denotes the action of the VUE. Lastly, the signifies the temporal difference reward of the VUE. The aforementioned traditional single-agent approaches to solve the joint CARA problem are not capable of learning the cooperative spectrum access policy of VUEs, because of the non-stationary environment. The may differ from the same and in the set of experience pairs, because the observation of -th VUE only contains local information. That is, the VUE only has local information of the HetVNet so that states and actions of other VUEs, which impact on the VUE’s reward, may differ even the same local observation and action of VUE. Thus, to solve the joint CARA problem with multiple VUEs, policy updating procedure of a VUE should take into account actions of other VUEs, rather updating the policy only with its own action. Therefore, the multi-agent approach is more suitable for optimizing the policies of VUEs to solve the joint CARA problem in HetVNet.
Iii-a Preliminaries of Reinforcement Learning
The reinforcement learning agent learns how to act (i.e., policy) in a sequential decision making problem through interactions between its environment. The decision making problem can be modeled as a Markov decision process (MDP), which is the pairs of. The observation space stands for the set of possible observations of VUEs and the action space denotes the set of possible actions of them. The agent aims to optimize its policy , which is parameterized with , i.e., . The policy updating procedure changes the parameter in a way the expected total return of the agent with respect to for given is improved. The value of action for sequential observations (i.e., state ) is measured with action-value function, or Q-function, to evaluate the expected total return per action. The Q-function can be denoted as follows:
Iii-B MADDPG Approach on CARA
Here, the multi-agent deep reinforcement learning strategy is presented to solve the joint CARA problem. In DRL, deep neural network (DNN) is utilized to build the learning agent. The DNN takes a role of a non-linear approximator to obtain the optimal policiesVUEs. Suppose that be the set of all agent policies and is the parameter set of corresponding policy. Based on estimation of Q-function for each possible action, VUEs update their own policy. The MADDPG is policy gradient based off-policy actor-critic algorithm , where the objective function is expected reward, i.e., . That is, the optimal policy of -th VUE can be represented as . To optimize the objective function, the gradient of the objective function is calculated with respect to as:
where the , is a centralized action-value funciton, and replay buffer . The contains transition tuples , where and . The centralized action-value function
is updated for minimizing the loss function (12):
where . The stands for the target policies with delayed parameters . In addition, the MADDPG is actor-critic based algorithm, where the actor takes a role of making sequential decisions over time slots, while the critic evaluates the behavior of the actor. Each VUE agent consists of the actor and critic with behavior network and target network. The actor updates the behavior policy network and periodically update the target policy network by utilizing gradient ascent updating manner on the with Eq. (11). Similarly, the critic updates the behavior Q-function and periodically updates the target Q-function in a way that minimizes the loss function in Eq. (12). The VUEs have such actor and critic to optimize their own policy to behave cooperatively, while they update their critic’s Q-function to reasonably evaluate the actions. To be more specific, the optimization objective of such policy gradient approach is updating the of target network, which makes the VUEs actually how to act. The value of neural network of target network of actor is fixed for a number of iterations, while the weights of neural network of behavior network of actor are updated.
That is, the multi-agents in HetVNet observe their local information and aim to act in a way that maximize their expected total return. They can stably update the policy parameter even though the local information and interactions between other VUEs and HetVNet. In other words, the environment is stationary even as the policies change. Suppose that
stands for the state transition probability,for any . Therefore, because the state transition probability from to of VUE is same even though the behavior policy and target policy are mutually different.
Iii-B1 State Space
The state space of each VUE in HetVNet is defined with two-folds: QoS satisfaction and accumulative DL throughput variation. The state of -th VUE in terms of QoS is set to 1 if , or is set to 0 otherwise. In addition, the DL throughput of current time slot is compared to previous one to decide the . The is set to 1 if the DL throughput of current time slot is higher than previous one, or is set to 0 otherwise. Therefore, the state space of VUEs can be defined as .
Iii-B2 Action Space
The VUE decides actions to choose for every time slot. It firstly decides what kind of base station to associate between MaBS/MiBS and PBS. In addition, it chooses which channels to utilize for communication. Thus, the action space of VUEs can be defined as , where . As the number of PBSs is increased, the action space exponentially grow so that it is intractable to solve the joint CARA problem with traditional approaches.
Iii-B3 Reward Structure
The immediate reward of -th VUE can be denoted as and it can be computed based on the interaction between VUEs and the HetVNet, i.e., . Then, the can be:
Note that the stands for the failure penalty of -th VUE, which is took into account for the calculation of the reward when the VUE fails to associate with a base station or it cannot access any wireless spectrum.
Iii-C Algorithm Description
The MADDPG based algorithm to solve the joint CARA problem is presented in this section. The detailed description of the algorithm is as follows:
First, the parameters of the actor and critic network, which activate and evaluate the action of VUEs, are initialized (line 1–3).
Next, for iterations, following procedures are conducted to update the target network parameters of VUEs. Given the initial state x, each VUE selects its action based on the exploration noise and its own policy (line 5). After the actions of each VUE are conducted, then the actions are activated by the VUEs (line 6). Next, the HetVNet interacts with the VUEs and returns corresponding rewards and next states (line 7). Then, each VUE observes the state transition pair and stores in the replay buffer , which contains the experiences of VUEs (line 8). Then, the episodic state x is changed to the next (line 9).
Throughout the episode, each VUE conducts following procedures to update their actor and critic networks. At first, an -th VUE samples a random minibatch of samples among (line 11). Note that the superscript stands for the approximation of other VUEs of -th VUE. Then, the target value of Q-function is set (line 12). By minimizing the difference between and among samples, the of behavior critic is updated (line 13). Similarly, the of of behavior actor is updated with the gradient to optimize the policy parameter (line 14). Note that the policy update is based on gradient ascent calculation.
Lastly, after all VUEs update their behavior networks, the target network parameters are updated under the concept of soft update (line 16).
Iv Performance evaluation
In this section, we provide the performance evaluation setting of MADDPG algorithm to solve the joint CARA problem. We considered 1 MaBS, 10 MiBSs, 50 PBSs, and 100 VUEs in HetVNet as Fig. 1. In case of cell coverage region of each base station, the radius of MaBS is set to 3000m, while the MiBS and PBS are set to 500m and 100m, respectively. The transmit powers of MaBS, MiBS, and PBS are set to 40dBm, 35dBm, and 20dBm, respectively. The is set to 30, while the is set to 5. The channel bandwidth of MaBS/MiBS is set to 180kHz and the DL center frequency is 2GHz. Meanwhile, the channel bandwidth of PBS is set to 800MHz and the DL center frequency is 28GHz. The path loss of MaBS and MiBS is set to and the PBS’s one is set to . All the failure cost of is set to and the base line of QoS is set to 7dBm. The noise power is set to -175dBm/Hz and the is set toI.
Firstly, the learning curve of the MADDPG to solve the joint CARA problem is as Fig. 2. The convergence points of each learning model are slightly different to each other. It shows that the required episode to get the converged performance is decreased as the learning rate is smaller.
Next, the performance of MADDPG strategy is compared with other policy gradient (PG) algorithms, i.e., vanilla actor–critic and DDPG approaches. Note that the vanilla actor–critic is a baseline algorithm among PG algorithms. As shown in Fig. 3, the vanilla actor–critic approach almost fails to solve the joint CARA problem, so that each VUEs greedily access the wireless spectrum and suffer from the collision, while DDPG and MADDPG strategies showed much higher performance. However, because of the non-stationary problem of DDPG, the total reward of VUEs trained by the MADDPG is higher than that of DDPG.
Finally, the performance of average DL throughput of VUEs among HetVNet is provided as Fig. 4. Considering Fig. 3 and Fig. 4, the MADDPG strategy to solve the joint CARA problem learned policies of VUEs in a way that cooperatively associate with base stations and utilize wireless spectrums (high total reward of VUEs as Fig. 3 and high DL throughput as Fig. 4). Although the DDPG-based solution showed somewhat lower performance than that of MADDPG, it is still showed to learn cooperative policies as Fig. 3. However, the vanilla actor–critic approach showed the lowest total reward of VUEs as Fig. 3 and DL throughput as Fig. 4, which stands for the vanilla actor–critic approach learned selfish association and resource utilization policies under non-stationary environment setting. In conclusion, the MADDPG strategy was successful to learn cooperative policies to solve joint CARA problem in considered HetVNet.
In this paper, we proposed multi-agent DRL approach to solve the joint CARA problem in HetVNet. Because of the non-stationary problem and NP-hard property, the traditional approaches including single agent RL methods were limited to solve the problem. However, the proposed MADDPG strategy showed a near optimal solution with a small number of iterations and the achieved better DL throughput performance compared to other reinforcement learning methods.
This research was supported by the National Research Foundation of Korea (2019R1A2C4070663); and also by Institute for IITP grant funded by MSIT (No.2018-0-00170, Virtual Presence in Moving Objects through 5G). J. Kim is the corresponding author of this paper.
-  G. A. Akpakwu, B. J. Silva, G. P. Hancke, and A. M. Abu–Mahfouz, “A survey on 5G networks for the Internet of Things: Communication technologies and challenges,” IEEE Access, 6:3619–3647, Dec. 2017.
-  C. Campolo, A. Molinaro, A. Iera, and F. Menichella, “5G network slicing for vehicle–to–everything services,” IEEE Wireless Communications, 24(6):38–45, Dec. 2017.
-  J. Kim, Y. Tian, S. Mangold, and A. F. Molisch, “Joint scalable coding and routing for 60 GHz real-time live HD video streaming applications,” IEEE Trans. on Broadcast., 59(3):500–512, Sept. 2013.
-  J. Kim and A. F. Molisch, “Fast millimeter-wave beam training with receive beamforming,” Journal of Communications and Networks, 16(5):512-522, Oct. 2014.
-  J. Kim, S.-C. Kwon, and G. Choi, “Performance of video streaming in infrastructure-to-vehicle telematic platforms with 60-GHz radiation and IEEE 802.11ad baseband,” IEEE Trans. on Vehicular Technology, 65(12):10111-10115, Dec. 2016.
-  J. Kim, L. Xian, and A. S. Sadri, “Numerical simulation study for frequency sharing between micro-cellular systems and fixed service systems in millimeter-wave bands,” IEEE Access, 4:9847-9859, 2016.
-  J. Kim and W. Lee, “Feasibility study of 60 GHz millimeter-wave technologies for hyperconnected fog computing applications,” IEEE Internet of Things J., 4(5):1165-1173, Oct. 2017.
-  K. Zheng, Q. Zheng, P. Chatzimisios, W. Xiang, and Y. Zhou, “Heterogeneous vehicular networking: A survey on architecture, challenges, and solutions,” IEEE Communications Surveys and Tutorials, 17(4):2377–2396, Jun. 2015.
-  T. S. Rappaport, G. R. MacCartney, M. K. Samimi, and S. Sun, “Wideband millimeter-wave propagation measurements and channel models for future wireless communication system design,” IEEE Trans. on Communications, 63(9):3029–3056, May 2015.
-  S. A. Busari, K. M. S. Huq, G. Felfel, and J. Rodriguez, “Adaptive resource allocation for energy–efficient millimeter-wave massive MIMO networks,” in Proc. IEEE GLOBECOM, 2018.
-  Z. Shi, Y. Wang, L. Huang, and T. Wang, “Dynamic resource allocation in mmWave unified access and backhaul network,” in Proc. IEEE PIMRC, pp. 2260–2264, 2015.
-  Y. Liu, L. Lu, G. Y. Li, Q. Cui, and W. Han, “Joint user association and spectrum allocation for small cell networks with wireless backhauls,” IEEE Wireless Communications Letters, 5(5):496–499, Jul. 2016.
-  N. Wang, E. Hossain, and V. K. Bhargava, “Joint downlink cell association and bandwidth allocation for wireless backhauling in two-tier HetNets with large–scale antenna arrays,” IEEE Trans. on Wireless Communications, 15(5):3251–3268, Jan. 2016.
-  Q. Kuang, W. Utschick, and A. Dotzler, “Optimal joint user association and resource allocation in heterogeneous networks via sparsity pursuit,” arXiv preprint arXiv:1408.5091, 2014.
-  Y. Lin, W. Bao, W. Yu, and B. Liang, “Optimizing user association and spectrum allocation in HetNets: A utility perspective,” IEEE Journal on Selected Areas in Communications, 33(6):1025–1039, Mar. 2015.
-  Y. Chen, J. Li, W. Chen, Z. Lin, and B. Vucetic, “Joint user association and resource allocation in the downlink of heterogeneous networks,” IEEE Trans. on Vehicular Technology, 65(7):5701–5706, Jul. 2015.
-  J. Ortín, J. R. Gállego, and M. Canales, “Joint cell selection and resource allocation games with backhaul constraints,” Pervasive and Mobile Computing, 35:125–145, 2017.
-  T. LeAnh, N. H. Tran, W. Saad, L. B. Le, D. Niyato, T. M. Ho, and C. S. Hong, “Matching theory for distributed user association and resource allocation in cognitive femtocell networks,” IEEE Trans. on Vehicular Technology, 66(9):8413–8428, Mar. 2017.
-  W. Bao, and B. Liang, “Structured spectrum allocation and user association in heterogeneous cellular networks,” in Proc. IEEE INFOCOM, pp. 1069–1077, 2014.
-  Y. He, Z. Zhang, F. R. Yu, N. Zhao, H. Yin, V. C. Leung, and Y. Zhang, “Deep–reinforcement–learning–based optimization for cache–enabled opportunistic interference alignment wireless networks,” IEEE Trans. on Vehicular Technology, 66(11):10433–10445, Sept. 2017.
-  N. Rupasinghe, and İ. Güvenç, “Reinforcement learning for licensed–assisted access of LTE in the unlicensed spectrum,” in Proc. IEEE WWCNC, pp. 1279–1284, 2015.
-  K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “A brief survey on deep reinforcement learning,” arXiv preprint, arXiv:1708.05866, 2017.
-  Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy, “A deep reinforcement learning based framework for power–efficient resource allocation in cloud RANs,” in Proc. IEEE ICC, 2017.
-  M. Elsayed, M. Erol–kantarci, “Deep reinforcement learning for reducing latency in mission critical services,” in Proc. GLOBECOM, 2018.
-  S. Hochreiter and J. Schmidhuber, “Long short–term memory,” Neural computation, 9(8):1735–1780, 1997.
-  C. J. Watkins and P. Dayan, “Q–learning,” Machine Learning, 8(3):279–292, 1992.
-  R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi–agent actor–critic for mixed cooperative–competitive environments,” in Proc. NIPS, pp. 6379–6390, 2017.