Artificial intelligence (AI) has witnessed rapid development in the past decade and has been applied for various contexts, including many mission-critical and real-time services for communication, such as Internet of Vehicles (IoV) . This capability is largely owing to edge- and endpoint-AI (e.g., in smartphones and vehicles), which aims to reduce communication latency and preserve data privacy by moving AI closer to users. However, due to the limited capability in edges and endpoints (e.g., sensing, computing, and storage), it is still challenging for a single computing device to perform complicated tasks alone, such as autonomous driving. One of the promising approaches for addressing this challenge is cooperative intelligence (CI) , which integrates the capabilities of multiple devices to achieve joint goals and shared intentions. Specifically, CI is expected to facilitate next-generation networks by establishing collaboration among various communication-related intelligent equipment. In recent years, CI has been applied in various communication scenarios , such as adaptive routing, mobility management for communication devices, network control, and resource allocation.
Although CI provides several benefits in communication, various challenges emerge when deployed in the real world. The first challenge is to achieve coordination among intelligent devices. Notably, cooperative learning, such as cooperative multi-agent reinforcement learning (MARL) , has been exploited to address this challenge. Next, privacy concern is a prominent issue because existing schemes of cooperative learning are achieved by data and information sharing. For example, in aerial base station control, drones need to move cooperatively to achieve given goals (e.g., offer emergency communication) . Due to privacy concerns, drones may be unwilling to share their data/information directly for cooperation (e.g., in-drone videos and movement trajectory). Another example is collaborative network management with edge controllers , in which edge information (e.g., customer locations, requirements, and feedback) is private. As a result, privacy issues become a major obstacle to the deployment of CI. Lastly, cooperative learning needs to be efficient with low overheads despite the practical constraints in communication (e.g., limited bandwidth and stringent latency requirements).
Several approaches exploit cooperative learning with (partial) privacy protection. However, they either lack efficiency by introducing high overheads  or provide inadequate privacy protection (e.g., vulnerable to inference attacks ). In this paper, to make CI deployable in practical communications, we propose PP-MARL, a novel cooperative learning scheme that
enables CI with privacy protection, in which the privacy-sensitive data (e.g., agents’ observations and actions) is stored locally by introducing hierarchical critics and memories to MARL;
ensures the effectiveness of agents’ collaboration through centralized learning by sharing less information than existing approaches;
reduces overheads in communication and computation compared with existing schemes.
Moreover, we apply and evaluate PP-MARL in two representative use cases: mobility management in drone-assisted communication and network control with edge intelligence. Simulation results show that PP-MARL can achieve better collaboration than existing schemes with lower overheads and better privacy protection.
Ii MARL for CI in Communications
In this section, we first introduce MARL for agents’ coordination in CI. Next, we analyze challenges of applying MARL in communications through use cases studies. Last, we introduce existing privacy-preserving schemes for MARL and analyze their applicability to CI communications.
Ii-a MARL for CI
CI is an integral function in multi-agent and distributed systems, where multiple intelligent agents seek collaborations to improve their welfare jointly. The concept of an intelligent agent, which originated from AI technology, is a computational unit that can complete a task or a part of a joint task autonomously, flexibly, and interactively. As illustrated in Fig. 1, many-fold entities can be regarded as intelligent agents in communication scenarios, including local, edge, and cloud agents.
One prominent way of coordinating agents to achieve CI in the long term is cooperative MARL 
, which is designed to find cooperative policies of agents by jointly evaluating their performance. The cooperative policies have advantages over independent strategies, especially when the agents can obtain only partial information (i.e., partial observations). MARL with partial observation can be formulated as a decentralized and partially observable Markov decision process (DEC-POMDP)
. DEC-POMDP is defined by five tuples, including states, observations, actions, transition function, and rewards. In DEC-POMDP, each agent observes the environment and gets observations which are partial information regarding the state of the environment. Next, agents make actions based on the observations to maximize their expected cumulative joint rewards. The state transition for producing the next state follows the transition function, which denotes probability distributions over states. In this process, the stored data of each agent is called experiences, including observations, actions, rewards, and successive observations.
In MARL, three cooperation modes have been proposed to incentivize agents’ cooperation by sharing information, negotiating, and learning cooperatively, depicted in Fig. 2. Firstly, the centralized mode relies on a central node to coordinate the training and even the execution of agents, such as MADDPG . The central node should belong to a trusted authority to make and enforce decisions on behalf of a swarm of agents; otherwise, agents will not trust the central node, such that the agent cooperation will be obstructed. Therefore, the centralized mode suffers from a single point of failure and privacy issues (i.e., sharing data with a central node). Secondly, in the decentralized mode, agents are independent learners and workers with no explicit information exchanged with each other, such as DDPG .Strictly speaking, this mode is non-collaborative, and its non-stationary characteristics in partial observation environments may lead to non-convergence of the learning . Lastly, the networked mode is a decentralized model with networked agents, in which agents are connected via a communication network so that their local information can be spread across the network, such as FedQ . However, this mode scales poorly with an increasing number of agents because its communication overheads increase dramatically. In addition, information sharing among agents poses privacy issues.
Ii-B Use Case Studies
To understand the challenges of applying MARL-assisted CI in communication, we study two typical multi-agent use cases as illustrated in Fig. 1.
Ii-B1 Mobility management in drone-assisted communication
In this use case, drones, working as aerial communication infrastructures, provide network services to targets (e.g., vehicles and users) . Each drone has limited coverage and cannot serve all targets anytime and anywhere. Therefore, multiple drones need to cooperate to cover targets while avoiding collisions. Drones, also working as agents, can observe positions of targets and other drones in range and then decide movements to maximize the joint rewards (e.g., the number of covered targets) cooperatively. Given that each drone has a partial observation of the environment, it would be difficult to make a wise movement without cooperation. However, data sharing to support cooperation brings a privacy risk (e.g., revealing the location of vehicles) and introduces additional communication overheads (e.g., in terms of bandwidth and energy consumption).
Ii-B2 Network control with edge intelligence
This use case is CI for network control in hierarchical software-defined IoV (HSD-IoV) . HSD-IoV consists of edge controllers located in base stations (BSs) and a core controller in the cloud. The edge controllers are close to vehicles and offer low-latency network control, while the core controller can coordinate the learning of edge controllers in the centralized mode. The problem is how to assign edge controllers to vehicles dynamically and cooperatively to minimize the latency of responses. The edge controllers, working as agents, observe the demands and locations of vehicles in their observation range. Then, they make decisions locally regarding the edge controller assignment policy (i.e., to find a good matching between edge controllers and vehicles) based on their observations, which can be viewed as actions. Due to agents’ limited observations, they need to assign edge controllers to vehicles cooperatively to minimize the overall latency. The detailed formulations can be found in our previous work . Similarly, data/information sharing for cooperation also brings risks in privacy leakages, such as vehicles’ locations and demands. In addition to bandwidth overheads, agent cooperation during action-making (i.e. execution) results in delays which cannot be ignored, as network control is a delay-sensitive task.
Challenges. According to these use case studies, CI has the potential to improve communications (e.g., in mobility management and network control). However, it also brings new challenges in privacy protection and overheads. Given the practical constraints (e.g., privacy concerns and limited bandwidth), the CI used for communication problems needs to be efficient privacy protected, which includes 1) privacy protection over agents’ experiences, 2) low overheads for cooperative learning and execution (e.g., bandwidth, energy, latency and computation), and 3) ensuring the gains of collaboration.
Ii-C Existing Privacy-Preserving Schemes for MARL
According to the use case analysis in §II-B, we find that agent cooperation encounters a practical challenge of privacy protection: most existing MARL schemes share experience directly for learning (e.g., MADDPG), leading to potential leaks of agents’ data. Recently, some state-of-the-art MARL schemes have been proposed to offer (partial) privacy protection.
Ii-C1 Independent learner
This straightforward scheme has been used in proposals like DDPG, where all agents execute and train independently. Strictly speaking, this scheme does not offer any data sharing and is non-collaborative. The lack of collaboration poses a convergence problem in agents’ learning because the partially observed environment without data sharing is non-stationary .
Ii-C2 Federated learning (FL)
FL  can protect the privacy of the local data, which allows multi-agents to train with local data respectively and build a shared model by sharing the model parameters or gradients. Solutions integrating FL into MARL have also been proposed, such as FedQ. FedQ federatively builds global models using networked cooperation modes. However, FedQ incurs higher communication overheads and delays in agents’ execution because its training and execution rely on information sent from neighbour agents. Therefore, the performance of FedQ is not acceptable in time-sensitive scenarios.
Ii-C3 Data encryption
This approach converts raw data to encrypted data, such as through HE  and DP . HE has been introduced for data encryption in single-agent reinforcement learning . However, the efficiency issue is the most challenging in deploying HE for MARL because the HE scheme incurs significant computational overheads for supporting encryption on the shared data. The overhead is especially high for traditional MARL due to their high volume of shared data (e.g., all experiences in MADDPG and observations in FACMAC ). Next, DP  is another technique that protects data by injecting noises. DP has lower overheads than HE, but DP brings a trade-off between cooperation performance and privacy protection.
Iii Privacy-preserving MARL for Communication
As mentioned above, the existing schemes cannot address all the challenges mentioned in §II-B simultaneously as they either lack collaboration, have inadequate privacy protection or introduce high overheads. In this section, we propose PP-MARL to offer efficient privacy protection for CI given the practical challenges in communications.
Iii-a Overview of PP-MARL
In a nutshell, PP-MARL is designed in a way which allows decentralized execution with localized actors and centralized training with hierarchical critics and memories, as presented in Fig. 3. Decentralized execution offers low latency, which meets the requirements of latency-sensitive applications in communication; centralized training with a central node for coordination training guarantees convergence; hierarchical critics and memories support efficient privacy protection.
The execution is an interactive process between agents and environment, during which agents can get and store data (i.e. experience) for learning. The execution is designed to be decentralized in PP-MARL, enabling low-latency actions by decoupling dependencies from other agents. Firstly, agents independently (and possibly partially) observe the environment and obtain local observations (e.g., the position of vehicles within the observation range). Next, each actor in the agents makes an action decision given its local observation. All the actions are aggregated into joint actions, and then the joint actions act on the environment (e.g., drones move to target places and offer services to vehicles). Then, the environment gives feedback (i.e., rewards) to the agents given their actions and environment state, and the environment transfers to new states. Lastly, agents’ experiences are hierarchically stored for privacy protection. The global memory stores public data (i.e., memory identifier), and each agent stores private experiences into its local memory, including local observations, individual actions, rewards, and successive observations.
The training in PP-MARL is designed to be centralized by introducing hierarchical critics to estimate the value of joint action-values. A hierarchical critic consists of local critics of the involved agents and a global critic for local value integration. The local critic in each agent evaluates local performance on the local observations and actions, and then it gives local critic values (abbreviated as q values) to the global critic.Strictly speaking, a local critic is a utility function, instead of a value function, because it cannot estimate expected rewards by itself, and therefore q values cannot be used for updating the actors directly. The global critic is located in the central node and designed as a deep neural network. It combines q values of all involved agents to evaluate the joint actions based on the joint observations and gives global critic value (abbreviated as Q values). As agents may have different reward functions, multiple global critics can be introduced for different agents or teams in the central node.
The hierarchical critic differs from a decentralized critic (e.g., DDPG) and a centralized critic (e.g., MADDPG). A decentralized critic estimates the local action-values based only on local observations and actions for each agent. On the other hand, centralized critics can use the information of all agents to estimate the joint action-values based on the joint observations (e.g., MADDPG) or environment states (e.g., FACMAC). However, the centralized critics introduce high communication overheads and privacy issues. For example, the critics in MADDPG are trained with experiences that include observations, actions, and rewards of all agents. To avoid high overheads and privacy leakage in training, PP-MARL introduces hierarchical critics so that only q values are shared to estimate the joint action-values. The centralized training process for actors and hierarchical critics in PP-MARL is described in detail below.
Iii-A1 Hierarchical critic updates
A hierarchical critic can be viewed as a tree-like neural network whose parameters are updated to minimize the loss function over Q values. Firstly, the central node provides sampled identifiers to all involved agents for memory extraction.Secondly, local critics give corresponding q values to the global critic. Next, estimated rewards of each agent are calculated by the central node, which are the distance between Q values and discounted target critic values (i.e., the estimated further accumulated rewards) of the next step. The estimated rewards are then sent back to agents. After receiving the estimated rewards, each agent calculates the loss defined as the mean squared error between the estimated rewards and the locally stored rewards (i.e., the ground truth) and sends the loss back to the central node for updating the hierarchical critics. Finally, the hierarchical critic is updated with backward propagation.
Iii-A2 Actor-network updates
The parameters of the actor-networks are updated using the sampled policy gradient . Policy gradient methods optimize parameterized actors with respect to the Q values by gradient descent. The Q value is obtained by merging q values with global critic networks. The local critic in the agent to be updated is fed with actions obtained by its actor-network and local observations from its local memory. The other local critics are fed with local observations and actions from their local memories.
Iii-B Privacy Protection in PP-MARL
Through PP-MARL architectural design, we are able to reduce the amount of information for sharing and transform agents’ experiences to q values for privacy protection. Furthermore, to avoid inferring private data through the q values, we introduce HE and DP to further improve privacy protection, as depicted in Fig. 3. PP-MARL has an HE-friendly architecture because the amount of shared data (i.e., q values) to be encrypted is much less than existing schemes. In contrast, if HE is used in MADDPG, it needs to encrypt over observations, actions, and rewards, and FACMAC needs to encrypt over q values and states.
Deploying HE to neural networks meets challenges because HE schemes support only homomorphic arithmetic operations such as homomorphic addition and multiplication. However, most of the popular and standard activation functions in neural networks are non-arithmetic, such as rectified linear units (ReLU) and sigmoid.To address this issue, we introduce interactive HE, which executes non-arithmetic operations (i.e., activation functions) locally by agents or a trusted device with the secret key (see green dash lines in Fig. 3). Interactive HE operations include: encrypting q values, decrypting the outputs of middle layers for activation, encrypting the data after activation, and decrypting predicted rewards to calculate losses. Clearly, this process brings additional overheads (e.g., bandwidth and energy) in data encryption, decryption and transmission. To mitigate this, one approach is to replace the non-arithmetic activation function by the low order polynomials or linear activation function . With such a mitigation approach, the overheads in PP-MARL will be much lower without the need to decrypt, encrypt, and transfer the outputs from hidden layers (i.e., processes with green dash lines). However, there is a trade-off between cooperative learning performance and overheads. Besides HE, DP is another potential technique for privacy protection. By adding noise to the q values, it can improve privacy protection for PP-MARL as well. However, DP methods often have to tradeoff the degree of privacy protection for user data against the model’s accuracy , as also confirmed by our evaluation results (see Table I). The detailed comparison will be discussed in §IV
In brief, PP-MARL offers better privacy protection than existing approaches because more private data (i.e., local observations, actions, and rewards) are stored and used locally. Furthermore, HE and DP are introduced in PP-MARL to improve privacy protection against data inference attacks further. Next, PP-MARL is more efficient as the exchanged messages between agents and the central node are less than the existing approaches. The specific comparisons are described in §IV.
Iv Evaluation and Analysis
This section shows the performance comparison of PP-MARL with existing approaches in the two use cases discussed in §II-B in terms of the learning performance, privacy protection degree, and overheads in communication and computation.
Iv-a Drone-assisted Communication
In this experiment, three drones are employed to cover three target places to provide communication services. The drones serve in rectangular space of 11 km. The positions of drones and targets are initialized randomly. Each target place is a circular space with a radius of 40 m, and the coverage radius of drones is 50 m. Each episode is divided into 10 identical time-intervals (i.e., steps). Drones aim to cover all the targets as soon as possible cooperatively.
Fig. 4 shows the learning performance comparison with four representative algorithms. The y-axis is the mean of the total number of target places covered by drones in each episode, and the x-axis is the ability of agents’ observations, which ranges from low to high. We can observe that cooperative learning schemes (i.e., PP-MARL, FACMAC, FedQ, and MADDPG) generally achieve better performance than decentralized learning schemes without cooperation (i.e., DDPG). The former operates in either the centralized mode or the networked mode to share information (i.e., q values in PP-MARL and FedQ, global states in FACMAC, and agents’ experiences in MADDPG) in training, while the latter lacks cooperation in learning, resulting in a worsened training performance. While individual intelligence alone may be limited, collaboration in learning can bring together agents’ intelligence, leading to better performance for the application. Generally, PP-MARL achieves better performance than the others, except the one with a high observation range where FACMAC has a limited advantage over PP-MARL. Furthermore, by comparing the different observation ranges, we can find that a broader range of observation does not necessarily lead to better performance for all MARL algorithms. This phenomenon is shown in PP-MARL and DDPG, for which the medium observation is the best. A wider range of observations can provide more information for the agents in acting while bringing difficulty in training owing to the larger input dimension.
|DDPG||Decentral||o, a, r||0.23||5.5||0||1 ,||1||1||1|
|MADDPG||Central||o, a, r||0||6.9||0.34||1x,||14x||1.33x||1x|
|FedQ||Networked||o, a||r, q, w||0.056||7.0||0.32||8.3x,||13.3x||1.63x||2.2x|
|FACMAC||Central||a||o, r, q||0.030||8.0||0.19||1x,||9.28x||1.48x||1x|
|PP-MARL||Central||o, a, r||q||0.064||9.4||0.03||1x,||1.96x||1.01x||1x|
|MADDPG-HE||Central||õ, ã, r̃||0.23||6.9||3.3||1x,||128x||(1.3+0.55h)x||1x|
|PP-MARL-HE||Central||o, a, r||q̃||0.23||9.4||1.03||1x,||40x||(1.0+0.16h)x||1x|
|PP-MARL-HE||Central||o, a, r||q̃||0.23||8.8||0.03||1x,||2x||(1.0+0.02h)x||1x|
|PP-MARL-DP||Central||o, a, r||q̃||0.20||8.3||0.03||1x,||2x||1.01x||1x|
Iv-B Network Control with Edge Intelligence
For HSD-IoV, we use the dataset of vehicle mobility traces from SNDlib . This dataset offers real-time position data reported by buses from Rio de Janeiro (7 days in -hour format). The region is 1010 km map of Rio de Janeiro, Brazil. There are 4 BSs equipped with edge controllers and they all have a coverage radius of 4 km. The observations of the edge controllers varies based on their locations, and their service coverage may overlap. The problem is how to assign edge controllers adaptive to loads of vehicles in the overlap zone and minimize the response latency. We show simulation results of PP-MARL compared with FACMAC, MADDPG, DDPG, and two non-intelligent approaches (i.e., DB and RC ). DB is a distance-based greedy policy. RC is centralized intelligence using a cloud controller.
Fig. 5 presents the mean control delays of vehicles varied in 24 hours. All the curves show significant changes over time, such as more delay during rush hours from 7:00 am to 10:00 am. Among all schemes, PP-MARL has the shortest delay compared to the others, i.e., 8.5%, 12.1%, 21.5%, 43%, and 72.6% shorter than MADDPG, DDPG, FACMAC, DB, and RC on average, respectively. By comparing RC with other methods, it can be demonstrated that edge intelligence has less control delay than centralized intelligence (i.e. RC). DB has a higher delay than the MARL algorithms, as it only considers distance and ignores the load of edge controllers. DDPG has limitations in delay reduction compared with cooperative MARL algorithms due to lacking cooperation. In summary, CI can provide flexible network management, and PP-MARL can further reduce control delays by efficient cooperation.
Iv-C Privacy Protection and Overheads
Firstly, the privacy protection degree is defined as the root mean square error (RMSE) between the inferred data and actual ones. A higher RMSE implies that it is more difficult to infer private data from public data and thus able to provide better privacy protection. The results reveal that DDPG provides the best protection because it is non-cooperative without information sharing, and the inferred data are predicted from zero vectors as inputs. However, as our experimental results in §IV-A and §IV-B show, non-cooperative MARL (i.e., DDPG) can be less effective with lower gains. PP-MARL provides better privacy protection than the other cooperative methods because it only shares q values. Although value-decomposition methods (e.g., FACMAC) can protect privacy for actions, private data may still be inferred much easier from their public data. FedQ provides marginally less privacy protection than PP-MARL owing to reward sharing.
Secondly, we compare PP-MARL with baselines in terms of bandwidth, energy, computation, and delay overheads. The results show that PP-MARL has obviously low overheads than the other cooperative MARL. The results are consistent with our previous analysis in §II-C: DDPG has no additional overheads as it is a non-cooperative; MADDPG introduces higher bandwidth due to more data sharing between agents and the central node; FedQ has a disadvantage in supporting low-latency applications because of higher delays and energy cost in execution, and clearly it introduces high communication and computation overheads; FACMAC has lower overheads than FedQ and MADDPG, but is still significantly higher than PP-MARL due to shared states.
Thirdly, we show the performance and overheads of HE and DP deployed in PP-MARL, as shown in Table I. The results show that introducing interactive HE (i.e., HE) into MARL brings obviously high overheads during training. For example, MADDPG with HE introduces around 10 times higher bandwidth cost and 3 times more energy consumption during training. In PP-MARL, the increment by introducing interactive HE cannot be ignored, even though PP-MARL is designed to be HE-friendly by reducing native communication data volume (only sharing q values) and decomposing the critic neural network to be hierarchical. Furthermore, we considered linear activation function as a tradeoff scheme for PP-MARL (i.e., HE). The PP-MARL-HE has obviously lower overhead than PP-MAL-HE, but at the cost of the lost gain in cooperation. PP-MARL-DP shows the decrease in gains, privacy protection, and computational overheads, which adds Laplacian-distributed noise  on the q values. Hence, we prefer HE to DP to improve PP-MARL privacy protection, given that computing and energy are not strictly limited at the central node. Moreover, we prefer interactive HE only when the agents (e.g., edge and end devices) is powerful.
From the above comparison, we can draw the following conclusions. PP-MARL provides the best privacy protection with the lowest overheads among all the cooperative MARL schemes. Besides, we introduce PP-MARL-HE and PP-MARL-DP, variants of PP-MARL, which further improve privacy protection at the cost of higher overheads in communication and computation than PP-MARL. In addition, PP-MARL proves to be an HE-friendly architecture, introducing less overhead than the other MARL schemes with HE.
V Conclusion and Discussion
In this paper, we exploited CI in facilitating next-generation networks. We showed that the collaboration among agents was achieved by data and information sharing via communication. However, given the privacy concern and practical constraints in communication, the deployment of CI was obstructed. To address this issue, we proposed a privacy-preserving MARL scheme, called PP-MARL, which leverages an HE-friendly architecture. We evaluated PP-MARL’s performance in two representative use cases: mobility management in drone-assisted communication and network control with cooperative edge intelligence. Simulation results revealed that cooperative learning promotes collaboration among agents, and PP-MARL can provide better privacy protection with lower overhead than state-of-the-art approaches.
Privacy protection for CI in communications remains challenging in environments with large-scale agents or a variable number of agents involved in the collaboration. In future research, we will extend PP-MARL to support communication problems with dynamically varying numbers of large-scale agents, such as network control with mobile edge controllers.
This work has been partly funded by EU H2020 RISE COSAFE project (No. 824019) and the Alexander von Humboldt Foundation.
-  (2020) Homomorphic encryption systems statement: trends and challenges. Computer Science Review 36, pp. 100235. Cited by: item 3, §III-B.
-  (2021) Cooperative AI: machines must learn to find common ground. Nature 593 (7857), pp. 33–36. Cited by: §I.
-  (2020) Open problems in cooperative AI. In NeurIPS, Cited by: §I.
-  (2016) Cryptonets: applying neural networks to encrypted data with high throughput and accuracy. In ICML, pp. 201–210. Cited by: §III-B.
-  (2019) Differential privacy techniques for cyber physical systems: a survey. IEEE Communications Surveys & Tutorials 22 (1), pp. 746–789. Cited by: item 3, §III-B, §IV-C.
-  (2021) Differential privacy for industrial internet of things: opportunities, applications, and challenges. IEEE Internet of Things Journal 8 (13), pp. 10430–10451. Cited by: §II-C3.
-  (2020) Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine 37 (3), pp. 50–60. Cited by: §II-C2.
-  (2016) Continuous control with deep reinforcement learning. ICLR. Cited by: §II-A.
-  (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In NeurIPS, pp. 6379–6390. Cited by: §II-A, §II-C1, §III-A2.
-  (2021) FACMAC: factored multi-agent centralised policy gradients. In NeurIPS, Cited by: §I, §II-C3.
-  (2021) SARSA (0) reinforcement learning over fully homomorphic encryption. In ISCS, pp. 1–7. Cited by: §II-C3.
-  (2020) Dynamic controller assignment in software defined internet of vehicles through multi-agent deep reinforcement learning. IEEE Transactions on Network and Service Management 18 (1), pp. 585–596. Cited by: §I, §I, §II-B2, §IV-B.
-  (2021) Harnessing uavs for fair 5G bandwidth allocation in vehicular communication via deep reinforcement learning. IEEE Transactions on Network and Service Management 18 (4), pp. 4063–4074. Cited by: §I, §II-B1.
-  (2021) Multi-agent reinforcement learning: a selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pp. 321–384. Cited by: §I, §II-A.
-  (2019) Federated reinforcement learning. arXiv:1901.08277. Cited by: §I, §II-A.