I Introduction
Quantum computing (QC) has been posited as a means of achieving computational superiority for certain tasks that classical computers struggle to solve Nielsen and Chuang (2010). Despite this potential, the lack of errorcorrection in current quantum computers has made it challenging to effectively implement complex quantum circuits on these ”noisy intermediatescale quantum” (NISQ) devices Preskill (2018). To harness the quantum advantages offered by NISQ devices, the development of specialized quantum circuit architectures is necessary.
Recent advances in the hybrid quantumclassical computing framework Bharti et al. (2022) that utilizes both classical and quantum computing. Under this approach, certain computational tasks that are expected to benefit from quantum processing are executed on a quantum computer, while others, such as gradient calculations, are performed on classical computers. This hybrid approach aims to take advantage of the strengths of both types of computing to address a wide range of tasks. Hybrid algorithms that utilize variational quantum circuits (VQC) have proven to be effective in a variety of machine learning tasks. VQCs are a subclass of quantum circuits that possess tunable parameters, and their incorporation into QML models has demonstrated success in a wide range of tasks Bharti et al. (2022); Cerezo et al. (2021).
Reinforcement learning (RL) is a branch of machine learning that deals with sequential decision making tasks. Deep neural networkbased RL has achieved remarkable results in complicated tasks with humanlevel
Mnih et al. (2015) or superhuman performance Silver et al. (2017). However, quantum RL is a developing field with many unresolved issues and challenges. The majority of existing quantum RL models are based on VQC Chen et al. (2020); Lockwood and Si (2020); Skolik et al. (2022); Jerbi et al. (2021); Hsiao et al. (2022). Although these models have been shown to perform well in a variety of benchmark tasks, training them requires a significant amount of computational resources. The long training time limits the exploration of quantum RL’s broad application possibilities. We propose an asynchronous training framework for quantum RL agents in this paper. We focus on the asynchronous training of advantage actorcritic quantum policies using multiple instances of agents running in parallel.We show, using numerical simulations, that quantum models may outperform or be similar to classical models in the various benchmark tasks considered. Furthermore, the suggested training approach has the practical advantage of requiring significantly less time for training, allowing for more quantum RL applications.
The structure of this paper is as follows: In Section II, we provide an overview of relevant prior work and compare our proposal to these approaches. In Section III, we provide a brief overview of the necessary background in reinforcement learning. In Section IV, we introduce the concept of variational quantum circuits (VQCs), which serve as the building blocks of our quantum reinforcement learning agents. In Section V, we present our proposed quantum A3C framework. In Section VI, we describe our experimental setup and present our results. Finally, in Section VII, we offer some concluding remarks.
Ii Relevant Works
The work that gave rise to quantum reinforcement learning (QRL) Meyer et al. (2022b) may be traced back to Dong et al. (2008). However, the framework demands a quantum environment, which may not be met in most realworld situations. Further studies utilizing Groverlike methods include Wiedemann et al. (2022); Sannia et al. (2022). Quantum linear system solvers are also used to implement quantum policy iteration Cherrat et al. (2022). We will concentrate on recent advancements in VQCbased QRL dealing with classical environments. The first VQCbased QRL Chen et al. (2020), which is the quantum version of deep learning (DQN), considers discrete observation and action spaces in the testing environments such as FrozenLake and CognitiveRadio. Later, more sophisticated efforts in the area of quantum DQN take into account continuous observation spaces like CartPole Lockwood and Si (2020); Skolik et al. (2022)
. A further development along this direction includes the using of quantum recurrent neural networks such as QLSTM as the value function approximator
Chen (2022) to tackle challenges such as partial observability or environments requiring longer memory of previous steps. Various methods such as hybrid quantumclassical linear solver are developed to find value functions CHEN et al. (2020). A further improvement of DQN which can improve the agent convergence such as Double DQN (DDQN) are also implemented within VQC framework in the work Heimann et al. (2022), in which the authors apply QRL to solve robot navigation task. Recent advances in QRL have led to the development of frameworks that aim to learn policy functions, denoted as , directly. These frameworks are able to learn the optimal policy for a given problem, in addition to learning value functions such as the function. For example, the paper Jerbi et al. (2021) describes the quantum policy gradient RL through the use of REINFORCE algorithm. Then, the work Hsiao et al. (2022) consider an improved policy gradient algorithm called PPO with VQCs and show that even with a small number of parameters, quantum models can outperform their classical counterparts. Provable quantum advantages of policy gradient are shown in the work Jerbi et al. (2022). Additional research, such as the work in Meyer et al. (2022a), has explored the impact of various postprocessing methods for VQC on the performance of quantum policy gradients. Several improved quantum policy gradient algorithms have been proposed in recent years, including actorcritic Schenk et al. (2022) and soft actorcritic (SAC) Lan (2021); Acuto et al. (2022). These modifications seek to further improve the efficiency and effectiveness of QRL methods. QRL has also been applied to the field of quantum control Sequeira et al. (2022) and has been extended to the multiagent setting Yun et al. (2022a); Yan et al. (2022); Yun et al. (2022b). The work Chen et al. (2022b) were the first to explore the use of evolutionary optimization for QRL. In their work, multiple agents were initialized and run in parallel, with the top performing agents being selected as parents to generate the next generation of agents. In the work Wu et al. (2020), the authors studied the use of advanced quantum policy gradient methods, such as the deep deterministic policy gradient (DDPG) algorithm, for QRL in continuous action spaces.In this work, we extend upon previous research on quantum policy gradient Jerbi et al. (2021); Hsiao et al. (2022); Schenk et al. (2022) by introducing an asynchronous training method for quantum policy learning. While previous approaches have employed singlethreaded training, our method utilizes an asynchronous approach, which may offer practical benefits such as reduced training time through the use of multicore CPU computing resources and the potential for utilizing multiple quantum processing units (QPUs) in the future. Our approach shares some similarities with the evolutionary QRL method presented in Chen et al. (2022b), which also utilizes parallel computing resources. However, our approach differs in that individual agents can share their gradients directly with the shared global gradient asynchronously, rather than waiting for all agents to finish before calculating fitness and creating the next generation of agents. This characteristic may further improve the efficiency of the training process. These contributions represent a novel advancement in the field of quantum reinforcement learning.
Iii Reinforcement Learning
Reinforcement learning (RL) is a machine learning framework in which an agent learns to accomplish a given goal by interacting with an environment in discrete time steps Sutton and Barto (2018). The agent observes a state at each time step and then chooses an action from the action space based on its current policy . The policy is a mapping from a specific state
to the probabilities of choosing one of the actions in
. After performing the action , the agent gets a scalar reward and the state of the following time step from the environment. For episodic tasks, the procedure is repeated across a number of time steps until the agent reaches the terminal state or the maximum number of steps permitted. Seeing the state along the training process, the agent aims to maximize the expected return, which can be expressed as the value function at state under policy , , where is the return, the total discounted reward from time step . The value function can be further expressed as , where the actionvalue function or Qvalue function is the expected return of choosing an action in state according to the policy . The learning is RL algorithm to optimize the via the following formula(1) 
In contrast to valuebased reinforcement learning techniques, such as learning, which rely on learning a value function and using it to guide decisionmaking at each time step, policy gradient methods focus on directly optimizing a policy function, denoted as , parametrized by . The parameters are updated through a gradient ascent procedure on the expected total return, . A notable example of a policy gradient algorithm is the REINFORCE algorithm, introduced in Williams (1992). In the standard REINFORCE algorithm, the parameters are updated along the direction
, which is an unbiased estimate of
. However, this policy gradient estimate often suffers from high variance, making training difficult. To reduce the variance of this estimate while maintaining its unbiasedness, a term known as the
baseline can be subtracted from the return. This baseline, denoted as , is a learned function of the state . The resulting update becomes . A common choice for the baseline in RL is an estimate of the value function . Using this choice for the baseline often results in a lower variance estimate of the policy gradient Sutton and Barto (2018). The quantity can be interpreted as the advantage of action at state . Intuitively, the advantage can be thought of as the ”goodness or badness” of action relative to the average value at state . This approach is known as the advantage actorcritic (A2C) method, where the policy is the actor and the baseline, which is the value function , is the critic Sutton and Barto (2018).The asynchronous advantage actorcritic (A3C) algorithm Mnih et al. (2016) is a variant of the A2C method that employs multiple concurrent actors to learn the policy through parallelization. Asynchronous training of RL agents involves executing multiple agents on multiple instances of the environment, allowing the agents to encounter diverse states at any given time step. This diminished correlation between states or observations enhances the numerical stability of onpolicy RL algorithms such as actorcritic Mnih et al. (2016). Furthermore, asynchronous training does not require the maintenance of a large replay memory, thus reducing memory requirements Mnih et al. (2016). By harnessing the advantages and gradients computed by a pool of actors, A3C exhibits impressive sample efficiency and robust learning performance, making it a prevalent choice in the realm of reinforcement learning.
Iv Variational Quantum Circuit
Variational quantum circuits (VQCs), also referred to as parameterized quantum circuits (PQCs), are a class of quantum circuits that contain tunable parameters. These parameters can be optimized using various techniques from classical machine learning, including gradientbased and nongradientbased methods. A generic illustration of a VQC is in the central part of Figure 1.
The three primary components of a VQC are the encoding circuit, the variational circuit, and the quantum measurement layer. The encoding circuit, denoted as , transforms classical values into a quantum state, while the variational circuit, denoted as
, serves as the learnable part of the VQC. The quantum measurement layer, on the other hand, is utilized to extract information from the circuit. It is a common practice to repeatedly execute the circuit, also known as ”shots,” in order to obtain the expectation values of each qubit. A common choice is to use the Pauli
expectation values. Instead of being binary integers, the values are received as floats. Additionally, other components, such as additional VQCs or classical components such as DNN, can process the values obtained from the circuit.The VQC can operate with other classical components such as tensor networks (TN)
Chen et al. (2022b, 2021); Qi et al. (2021) and deep neural networks (NN) to perform data preprocessing such as dimensional reduction or postprocessing such as scaling. We call such VQCs as dressed VQC, as shown in Figure 1. The whole model can be trained in an endtoend manner via gradientbased Chen et al. (2021); Qi et al. (2021) or gradientfree methods Chen et al. (2022b). For the gradientbased methods, the whole model can be represented as a directed acyclic graph (DAG) and then backpropagation can be applied. The success of such endtoend optimization relies on the capabilities of calculating the quantum gradients such as parametershift rule Mitarai et al. (2018). VQCbased QML models have shown success in areas such as classification Mitarai et al. (2018); Qi et al. (2021); Chehimi and Saad (2022); Chen and Yoo (2021); Chen et al. (2021)Yang et al. (2021, 2022); Di Sipio et al. (2022) and sequence modeling Chen et al. (2022d, a).V Quantum A3C
The proposed quantum asynchronous advantage actorcritic (QA3C) framework consists of two main components: a global shared memory and processspecific memories for each agent. The global shared memory maintains the dressed VQC policy and value parameters, which are modified when an individual process uploads its own gradients for parameter updates. Each agent has its own processspecific memory that maintains local dressed VQC policy and value parameters. These local models are used to generate actions during an episode within individual processes. When certain criteria are met, the gradients of the local model parameters are uploaded to the global shared memory, and the global model parameters are modified accordingly. The updated global model parameters are then immediately downloaded to the local agent that just uploaded its own gradients. The overall concept of QA3C is depicted in Figure 2.
We construct the quantum policy and value function with the dressed VQC as shown in Figure 1, in which the VQC follows the architecture shown in Figure 3. This VQC architecture has been studied in the work such as quantum recurrent neural networks (QRNN) Chen et al. (2022d), quantum recurrent RL Chen (2022)
, quantum convolutional neural networks
Chen et al. (2022c), federated quantum classification Chen and Yoo (2021)and has demonstrated superior performance over their classical counterparts under certain conditions. In addition, we employ the classical DNN before and after the VQC to dimensionally reduce the data and finetune the outputs from the VQC, respectively. The neural network components in this hybrid architecture consist of singlelayer networks for dimensionality conversion. Specifically, the network preceding the VQC is a linear layer with an input dimension equal to the size of the observation vector and an output dimension equal to the number of qubits in the VQC. The networks following the VQC are linear layers with input dimensions equal to the number of qubits in the VQC and output dimensions equal to the number of actions (for the actor function
) or 1 (for the critic function ). These layers serve to convert the output of the VQC for use in the actorcritic algorithm. The policy and value function are updated after every steps or when the agent reaches the terminal state. The details of the algorithm such as the gradient update formulas are presented in Algorithm 1.Vi Experiments and Results
vi.1 Testing Environments
vi.1.1 Acrobot
The Acrobot environment from OpenAI Gym Brockman et al. (2016) consists of a system with two linearly connected links, with one end fixed. The joint connecting the two links can be actuated by applying torques. The goal is to swing the free end of the chain over a predetermined height, starting from a downward hanging position, using as few steps as possible. The observation in this environment is a sixdimensional vector comprising the sine and cosine values of the two rotational joint angles, as well as their angular velocities. The agents are able to take one of three actions: applying , , or torque to the actuated joint. An action resulting in the free end reaching the target height receives a reward of and terminates the episode. Any action that does not lead to the desired height receives a reward of . The reward threshold is .
vi.1.2 CartPole
CartPole is a commonly used evaluation environment for simple RL models that has been utilized as a standard example with in OpenAI Gym Brockman et al. (2016) (see Figure 5). A fixed junction connects a pole to a cart traveling horizontally over a frictionless track in this environment. The pendulum initially stands upright, and the aim is to keep it as near to its starting position as possible by moving the cart left and right. Each time step, the RL agent learns to produce the right action according on the observation it gets. The observation in this environment is a four dimensional vector containing values of the cart position, cart velocity, pole angle, and pole velocity at the tip. Every time step where the pole is near to being upright results in a award. An episode ends if the pole is inclined more than degrees from vertical or the cart moves more than units away from the center.
vi.1.3 MiniGridSimpleCrossing
The MiniGridSimpleCrossing environment ChevalierBoisvert et al. (2018) is more sophisticated, with a lot bigger observation input for the RL agent. In this scenario, the RL agent receives a dimensional vector through observation and must choose an action from the action space , which offers six options. It is important to note that the dimensional vector is a compact and efficient representation of the environment rather than the real pixels. There are six actions ,, in the action space for the agent to choose. They are turn left, turn right, move forward, pick up an object, drop the object being carried and toggle. Only the first three of them are having actual effects in this case. The agent is expected to learn this fact. In this environment, the agent receives a reward of 1 upon reaching the goal. A penalty is subtracted from this reward based on the formula , where the maximum number of steps allowed is defined as , and is the grid size ChevalierBoisvert et al. (2018). In this work, is set to 9. This reward scheme presents a challenge because it is sparse, meaning that the agent does not receive rewards until it reaches the goal. As shown in Figure 6, the agent (shown in red triangle) is expected to find the shortest path from the starting point to the goal (shown in green). We consider three cases in this environment: MiniGridSimpleCrossingS9N1v0, MiniGridSimpleCrossingS9N2v0 and MiniGridSimpleCrossingS9N3v0. Here the represents the number of valid crossings across walls from the starting position to the goal.
vi.2 Hyperparameters and Model Size
In the proposed QA3C, we use the Adam optimizer with learning rate , and . The local agents will update the parameters with the global shared memory every steps. The discount factor is set to be . For the VQC, we set the number of qubits to be and two variational layers are used. Therefore, for each VQC, there are quantum parameters. Actor and critic both have their own VQC, thus the total number of quantum parameters is 96. The VQC architecture are the same across various testing environments considered in this work. As we described in the Section V, single layer networks are used before and after the VQC to convert the dimensions of data. The networks preceding the VQC have input dimensions based on the environments that the agent is to solve. For the classical benchmarks, we consider the model which are very similar to the dressed VQC model. Specifically, we keep the architecture of classical model similar to the one presented in Figure 1 while we replace the 8qubit VQC with a single layer with input and output dimensions equal to 8. This makes the architecture very similar to the quantum model and the number of parameters are also very close. We summarize the number of parameters in Table 1.
QA3C  Classical  
Classical  Quantum  Total  Total  
Acrobot  148  96  244  292 
CartPole  107  96  203  251 
SimpleCrossing  2431  96  2527  2575 
We utilize the opensource PennyLane package
Bergholm et al. (2018)to construct the quantum circuit models and the PyTorch as a overall machine learning framework. The number of CPU cores and hence the number of parallel agents is 80 in this work. We present simulation results in which the scores from the past 100 episodes are averaged.
vi.3 Results
vi.3.1 Acrobot
We begin by evaluating the performance of our models on the Acrobot environment. The simulation results of this experiment are presented in Figure 7. The total number of episodes was 100,000. As shown in the figure, the quantum model exhibits a gradual improvement during the early training episodes, while the classical model struggles to improve its policy. In terms of average score, the quantum model demonstrates superior performance compared to the classical model. Furthermore, the quantum model exhibits a more stable convergence pattern, without significant fluctuations or collapses after reaching optimal scores. These results suggest that the quantum model may be more robust and reliable in this environment.
vi.3.2 CartPole
The next experiment was conducted in the CartPole environment. The total number of episodes was 100,000. As illustrated in Figure 8, the quantum model achieved significantly higher scores than the classical model. While the classical model demonstrated faster learning in the early training episodes, the quantum model eventually surpassed it and reached superior scores. These results suggest that the quantum model may be more effective in this environment.
vi.3.3 MiniGridSimpleCrossing
The final experiment was conducted in the MiniGridSimpleCrossing environment, comprising a total of 100,000 episodes. As depicted in Figure 9, among the three scenarios, MiniGridSimpleCrossingS9N1v0, MiniGridSimpleCrossingS9N2v0, and MiniGridSimpleCrossingS9N3v0, the quantum model outperformed the classical model in two of the three scenarios, MiniGridSimpleCrossingS9N2v0 and MiniGridSimpleCrossingS9N3v0, demonstrating faster convergence and higher scores. Even in the remaining scenario, MiniGridSimpleCrossingS9N1v0, the difference in performance between the two models was minor.
Vii Conclusion
In this study, we demonstrate the effectiveness of an asynchronous training framework for quantum RL agents. Through numerical simulations, we show that in the benchmark tasks considered, advantage actorcritic quantum policies trained asynchronously can outperform or match the performance of classical models with similar architecture and sizes. This technique affords a strategy for expediting the training of quantum RL agents through parallelization, and may have potential applications in various realworld scenarios.
Acknowledgements.
The views expressed in this article are those of the authors and do not represent the views of Wells Fargo. This article is for informational purposes only. Nothing contained in this article should be construed as investment advice. Wells Fargo makes no express or implied warranties and expressly disclaims all legal, tax, and accounting implications related to this article.Appendix A Algorithms
a.1 QuantumA3C
References
 Variational quantum soft actorcritic for robotic arm control. arXiv preprint arXiv:2212.11681. Cited by: §II.
 Pennylane: automatic differentiation of hybrid quantumclassical computations. arXiv preprint arXiv:1811.04968. Cited by: §VI.2.
 Noisy intermediatescale quantum algorithms. Reviews of Modern Physics 94 (1), pp. 015004. Cited by: §I.
 Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §VI.1.1, §VI.1.2.
 Variational quantum algorithms. Nature Reviews Physics 3 (9), pp. 625–644. Cited by: §I.
 Quantum federated learning with quantum data. In ICASSP 20222022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8617–8621. Cited by: §IV.
 Hybrid quantumclassical ulamvon neumann linear solverbased quantum dynamic programing algorithm. Proceedings of the Annual Conference of JSAI JSAI2020 (), pp. 2K6ES203–2K6ES203. External Links: Document Cited by: §II.
 Reservoir computing via quantum recurrent neural networks. arXiv preprint arXiv:2211.02612. Cited by: §IV.
 Variational quantum reinforcement learning via evolutionary optimization. Machine Learning: Science and Technology 3 (1), pp. 015025. Cited by: §II, §II, §IV.

An endtoend trainable hybrid classicalquantum classifier
. Machine Learning: Science and Technology 2 (4), pp. 045021. Cited by: §IV.  Quantum convolutional neural networks for high energy physics data analysis. Physical Review Research 4 (1), pp. 013231. Cited by: §V.
 Variational quantum circuits for deep reinforcement learning. IEEE Access 8, pp. 141007–141024. Cited by: §I, §II.

Quantum long shortterm memory
. In ICASSP 20222022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8622–8626. Cited by: §IV, §V.  Federated quantum machine learning. Entropy 23 (4), pp. 460. Cited by: §IV, §V.
 Quantum deep recurrent reinforcement learning. arXiv preprint arXiv:2210.14876. Cited by: §II, §V.
 Quantum reinforcement learning via policy iteration. arXiv preprint arXiv:2203.01889. Cited by: §II.
 Minimalistic gridworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gymminigrid Cited by: §VI.1.3.
 The dawn of quantum natural language processing. In ICASSP 20222022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8612–8616. Cited by: §IV.
 Quantum reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38 (5), pp. 1207–1220. Cited by: §II.
 Quantum deep reinforcement learning for robot navigation tasks. arXiv preprint arXiv:2202.12180. Cited by: §II.
 Unentangled quantum reinforcement learning agents in the openai gym. arXiv preprint arXiv:2203.14348. Cited by: §I, §II, §II.
 Quantum policy gradient algorithms. arXiv preprint arXiv:2212.09328. Cited by: §II.
 Variational quantum policies for reinforcement learning. arXiv preprint arXiv:2103.05577. Cited by: §I, §II, §II.
 Variational quantum soft actorcritic. arXiv preprint arXiv:2112.11921. Cited by: §II.

Reinforcement learning with quantum variational circuit.
In
Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment
, Vol. 16, pp. 245–251. Cited by: §I, §II.  Quantum policy gradient algorithm with optimized action decoding. arXiv preprint arXiv:2212.06663. Cited by: §II.
 A survey on quantum reinforcement learning. arXiv preprint arXiv:2211.03464. Cited by: §II.
 Quantum circuit learning. Physical Review A 98 (3), pp. 032309. Cited by: §IV.
 Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §III.
 Humanlevel control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §I.
 Quantum computation and quantum information. Cited by: §I.
 Quantum computing in the nisq era and beyond. Quantum 2, pp. 79. Cited by: §I.
 Qtnvqc: an endtoend learning framework for quantum neural networks. arXiv preprint arXiv:2110.03861. Cited by: §IV.
 A hybrid classicalquantum approach to speedup qlearning. arXiv preprint arXiv:2205.07730. Cited by: §II.
 Hybrid actorcritic algorithm for quantum reinforcement learning at cern beam lines. arXiv preprint arXiv:2209.11044. Cited by: §II, §II.
 Variational quantum policy gradients with an application to quantum control. arXiv preprint arXiv:2203.10591. Cited by: §II.
 Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: §I.
 Quantum agents in the gym: a variational quantum algorithm for deep qlearning. Quantum 6, pp. 720. Cited by: §I, §II.
 Reinforcement learning: an introduction. MIT press. Cited by: §III, §III.
 Quantum policy iteration via amplitude estimation and grover search–towards quantum advantage for reinforcement learning. arXiv preprint arXiv:2206.04741. Cited by: §II.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §III.
 Quantum reinforcement learning in continuous action space. arXiv preprint arXiv:2012.10711. Cited by: §II.
 A multiagent quantum deep reinforcement learning method for distributed frequency control of islanded microgrids. IEEE Transactions on Control of Network Systems 9 (4), pp. 1622–1632. Cited by: §II.

Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition
. In ICASSP 20212021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6523–6527. Cited by: §IV.  When bert meets quantum temporal convolution learning for text classification in heterogeneous computing. In ICASSP 20222022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8602–8606. Cited by: §IV.
 Quantum multiagent reinforcement learning via variational quantum circuit design. arXiv preprint arXiv:2203.10443. Cited by: §II.
 Quantum multiagent meta reinforcement learning. arXiv preprint arXiv:2208.11510. Cited by: §II.