I Introduction
Humans have invented technologies with transforming impact on society. One such example is the internet, which significantly influences our everyday life. The quantum internet Kimble08 ; Wehner:qinternetreview could become the next generation of such a worldspanning network, and promises applications that go beyond its classical counterpart. This includes e.g. distributed quantum computation, secure communication or distributed quantum sensing. Quantum technologies are now at the brink of being commercially used, and the quantum internet is conceived as one of the key applications in this context. Such quantum technologies are based on the invention of a number of central protocols and schemes, for instance quantum cryptography ShorQKD ; RennerPhd ; ZhaoQKD ; GottesmanQKD ; LoQKD and teleportation teleportation . Additional schemes that solve fundamental problems such as the accumulation of channel noise and decoherence have been discovered and have also shaped future research. This includes e.g. entanglement purification bbpssw ; dejmps ; eppreview and the quantum repeater Br98 that allow for scalable longdistance quantum communication. These schemes are considered key results whose discovery represent breakthroughs in the field of quantum information processing. But to what extent are human minds required to find such schemes?
Here we show that many of these central quantum protocols can in fact be found using machine learning by phrasing the problem in a reinforcement learning (RL) framework RusselNorvig2003 ; sutton1998reinforcement ; wiering2012reinforcement
, the framework at the forefront of modern artificial intelligence
human2015mnih ; mastering2016silver ; silver2018general . By using projective simulation (PS) briegel2012projective , a physically motivated framework for RL, we show that teleportation, entanglement swapping, and entanglement purification are found by a PS agent. We equip the agent with a universal gate set, and specify the desired task via a reward scheme. With certain specifications of the structure of the action and percept spaces, RL then leads to the rediscovery of the desired protocols. Based on these elementary schemes, we then show that such an artificial agent can also learn more complex tasks and discover longdistance communication protocols, the socalled quantum repeaters Br98 . The usage of elementary protocols learned previously is of central importance in this case. We also equip the agent with the possibility to call subagents, thereby allowing for a design of a hierarchical scheme simon2016 ; simon2017skill that offers the flexibility to deal with various environmental situations. The proper combination of optimized block actions discovered by the subagents is the central element at this learning stage, which allows the agent to find a scalable, efficient scheme for longdistance communication. We are aware that we make use of existing knowledge in the specific design of the challenges. Rediscovering existing protocols under such guidance is naturally very different from the original achievement (by humans) of conceiving of and proposing them in the first place, an essential part of which includes the identification of relevant concepts and resources. However, the agent does not only rediscover known protocols and schemes, but can go beyond known solutions. In particular, we find that in asymmetric situations, where channel noise and decoherence are nonuniform, the schemes found by the agent outperform humandesigned schemes that are based on known solutions for symmetric cases.From a technical perspective, the agent is situated in stochastic environments RusselNorvig2003 ; sutton1998reinforcement ; universal2013orseau , as measurements with random outcomes are central elements of some of the schemes considered. This requires to learn proper reactions to all measurement outcomes, e.g., the required correction operations in a teleportation protocol depending on outcomes of (Bell) measurements. Additional elements are abort operations, as not all measurement outcomes lead to a situation where the resulting state can be further used. This happens for instance in entanglement purification, where the process needs to be restarted in some cases as the resulting state is no longer entangled. The overall scheme is thus probabilistic. These are new challenges that have not been treated in projective simulation before, but the PS agent can in fact deal with such challenges. Another interesting element is the usage of block actions that have been learned previously. This is a mechanism similar to hierarchical skill learning in robotics simon2016 ; simon2017skill , and to clip composition in PS briegel2012projective ; briegel2012creative ; melnikov2018active , where previously learned tasks are used to solve more complex challenges and problems. Here we use this concept for longdistance communication schemes. The initial situation is a quantum channel that has been subdivided by multiple repeater stations that share entangled pairs with their neighboring stations. Previously learned protocols, namely entanglement swapping and entanglement purification, are used as new primitives. Additionally, the agent is allowed to employ subagents that operate in the same way but deal with a problem at a smaller scale, i.e. they find optimized block actions for shorter distances that the main agent can employ at the larger scale. This allows the agent to deal with big systems, and rediscover the quantum repeater with its favorable scaling. The ability to delegate is of special importance in asymmetric situations as such block actions need to be learned separately for different initial states of the environment – in our case the fidelity of the elementary pairs might vary drastically either because they correspond to segments with different channel noise, or they are of different length. In this case, the agent outperforms humandesigned protocols that are tailored to symmetric situations.
The paper is organized as follows. In Sec. II we provide background information on reinforcement learning and projective simulation, and discuss our approach on how to apply these techniques on problems in quantum communication. In Sec. III, we show that the PS agent can find solutions to elementary quantum protocols, thereby rediscovering teleportation, entanglement swapping, entanglement purification and the elementary repeater cycle. In Sec. IV we present results for the scaling repeater in a symmetric and asymmetric setting, and summarize and conclude in Sec. V.
Ii Projective Simulation for quantum communication tasks
In this paper the process of designing quantum communication protocols is viewed as learning by trial and error. This process is visualized in Fig. 1 as an interaction between an RL agent and its environment: by trial and error the agent is manipulating quantum states hence constructing communication protocols. At each interaction step the RL agent perceives the current state of the protocol (environment) and chooses one of the available operations (actions). This action modifies the previous version of the protocol and the interaction step ends. In addition to the state of the protocol the agent gets feedback at each interaction step. This feedback is specified by a reward function, which depends on the specific quantum communication task a)d) in Fig. 1. A reward is interpreted by the RL agent and its memory is updated.
The described RL approach is used for two reasons. First, there is a similarity between a target quantum communication protocol and a typical RL target. A target quantum communication protocol is a sequence of elementary operations leading to a desired quantum state, whereas a target of an RL agent is a sequence of actions that maximizes the achievable reward. In both cases the solution is therefore a sequence, which makes it natural to assign each elementary quantum operation a corresponding action, and to assign each desired state a reward. Second, the way the described targets are achieved is similar in RL and quantum communication protocols. In both cases an initial search (exploration) over a large number of operation (or action) sequences is needed. This search space can be viewed as a network, where states of a quantum communication environment are vertices, and basic quantum operations are edges. The structure of a complex network, formed in the described way, is similar to the one observed in quantum experiments melnikov2018active , which makes the search problem equivalent to navigation in mazes – a reference problem in RL sutton1998reinforcement ; sutton1990integrated ; mirowski2016learning ; hierarchical2016maze .
It should also be said, that the role of the RL agent goes beyond mere parameter estimation for the following reasons. First, using simple search methods (e.g, a bruteforce or a guided search) would fail for the considered problem sizes: e.g. in the teleportation task discussed in section
III.1, the number of possible states of the communication environment is at least ^{1}^{1}1In the teleportation task the shortest possible sequence of gates is equal to , and at each step of this sequence there are at least possible gates that can be applied.. Second, the RL agent learns in the space of its memory parameters, but it is not the case with optimization techniques (e.g, genetic algorithms, simulated annealing, or gradient descent algorithms) that would search directly in the parameter space of communication protocols. Optimizing directly in the space of protocols, which consist of both actions and stochastic environment responses, can only be efficient if the space is sufficiently small
sutton1998reinforcement . Additional complication will be introduced by the fact that reward signals are often sparse in quantum communication tasks, hence the reward gradient is almost always zero giving optimization algorithms no direction for parameter change. Third, using an optimization technique for constructing an optimal action sequence, ignoring stochastic environment responses, is usually not possible in quantum communication tasks. Because different responses are needed depending on measurement outcomes, there is no single action sequence that achieves an optimal protocol, i.e. there is no single point optimal point in the parameter space with which an optimization technique. Nevertheless, there is at least one point in the RL agent’s memory parameter space that achieves an optimal protocol as the RL agent can choose an action depending on the current state of the environment rather than a whole action sequence.As a learning agent that operates within the RL framework shown in Fig. 1 we use the PS agent briegel2012projective ; mautner2013projective . PS is a physicallymotivated approach to learning and decision making, which is based on deliberation in the episodic and compositional memory (ECM). The ECM is organized as an adjustable network of memory units, which provides flexibility in constructing different concepts in learning, e.g., metalearning makmal2016meta and generalization melnikov2017projective . The deliberation within the ECM is based on a not computationally demanding random walk process, which in addition can be sped up via a quantum walk process kempe2003quantum ; venegasandraca2012quantum , leading to a quadratic speedup in deliberation time paparo2014quantum , and makes the PS model conceptually attractive. Physical implementations of the quantumenhanced PS agent were proposed by using trapped ions dunjko2015quantum or superconducting circuits friis2015coherent . The quantumenhanced deliberation was recently implemented, as a proofofprinciple, in a smallscale quantum information processor based on trapped ions sriarunothai2018speeding .
The use of PS in the design of quantum communication protocols has further advantages compared to other approaches, such as standard tabular RL models, or deep RL networks. First, the PS agent was shown to perform well on problems that, from an RL perspective, are conceptually similar to designing communication networks. In the problems that can be mapped to a navigation problem melnikov2018benchmarking , such as the design of quantum experiments melnikov2018active and the optimization of quantum error correction codes nautrup2018optimizing , PS outperformed methods that were practically used for those problems (and were not based on machine learning). In standard navigation problems, such as the grid world and the mountain car problem, the PS agent shows a performance qualitatively similar to standard tabular RL models of SARSA and Qlearning melnikov2018benchmarking . Second, as was shown in Ref. melnikov2018benchmarking , the computational effort is one to two orders of magnitude lower compared to tabular approaches. The reason for this is a low model complexity: in static task environments the simple PS agent has only one relevant model parameter. This makes it easy to set up the agent for a new complex environment, such as the quantum communication network, where model parameter optimization is costly because of the runtime of the simulations. Third, by construction, the PS decision making can be explained by analyzing graph properties of its ECM. Because of this intrinsic interpretability of the PS model, we are able to properly analyze the outcomes of the learning process.
Next, we show how the PS agent learns quantum communication protocols. The code of the PS agent used in this context is a derivative of a publicly available Python code PScode .
Iii Learning elementary protocols
We let the agent interact with various environments where the initial states and goals correspond to wellknown quantum information protocols. For each of the protocols we will first explain our formulation of the environment and the techniques we used. Then we discuss the solutions the agent finds before finally comparing them to the established protocols. A detailed description of the environments together with additional results can be found in the Appendix.
iii.1 Quantum teleportation
The agent is tasked to find a way to transmit quantum information without directly sending the quantum system to the recipient. As an additional resource a maximally entangled state shared between sender and recipient is available. The agent can apply operations from a (universal) gate set locally. This task challenges the agent without any prior knowledge to find the best (shortest) sequence of operations out of a large number of possible action sequences, which grows exponentially with a sequence length.
We describe the learning task as follows: There are two qubits and at the sender’s station and one qubit at the recipient’s station. Initially, the qubits and are in a maximally entangled state and is in an arbitrary input state . The setup is depicted in Fig. 2a. For this setup we consider two cases: the agent is equipped with either a Clifford gate set or a universal gate set. In both cases the agent can perform singlequbit measurements, but multiqubit operations can only be applied on qubits at the same station (in this case, only between and ). The task is considered to be successfully solved if the qubit at is in state . In order to ensure that this works for all possible input states, instead of using random input states, we make use of the Jamiołkowski fidelity jamiolkowski ; jamfid to evaluate if the protocol proposed by the agent is successful. This means we require that the overlap of the ChoiJamiołkowski state jamiolkowski corresponding to the effective map generated by the suggested protocol with the ChoiJamiołkowski state corresponding to the optimal protocol is equal to .
In Fig. 2b the learning curves, i.e., the number of operations the agent applies to reach a solution at each trial, are shown. If no solution is found, the PS agent tries a maximum of operations before the environment is reset and a new trial is started. We see that the average number of operations decreases below , which means that the PS agent finds solutions. The average lengths of these solutions decrease over time as the agent keeps finding better and better solutions based on its experience. We observe that the learning curve converges to some average number of operations in both cases, using a Clifford (blue) and a universal (green) gate set. However, the mean squared deviation does not go to zero. This can be explained by looking at the individual learning curves of two example agents in Fig. 2c: the agent does not arrive at a single solution for this problem setup, but rather four different solutions. These solutions can be summarized as follows (up to different orders of commuting operations):

Apply , where H is the Hadamard gate and CNOT is the controlledNOT operation.

Measure qubits and in the computational basis.

Depending on the measurement outcomes, either apply , , or (decomposed to the elementary gates of the used gate set) on qubit .
We see four different solutions in Fig. 2c as four horizontal lines, which appear because of the probabilistic nature of the quantum communication environment. The agent learns different sequences of gates because different operations are needed, depending on measurement outcomes the agent has no control over. Four appropriate correction operations of different length (as seen in Fig. 2c), which are needed in order for the agent to successfully transmit quantum information at each trial, complete the protocol. This protocol found by the agent is identical to the wellknown quantum teleportation protocol teleportation .
Note that because we used the Jamiołkowski fidelity to verify that the protocol implements the teleportation channel for all possible input states, it follows that the same protocol can be used for entanglement swapping if the input qubit at is part of an entangled state.
iii.2 Entanglement purification
Noise and imperfections are a fundamental obstacle to distribute entanglement over longdistances, so a strategy to deal with these is needed. One possible idea is to use a larger amount of entanglement in the form of multiple Bell pairs, each of which may have been affected by noise during the initial distribution, and try to obtain fewer, less noisy pairs from them. The agent again has to rely on using only local operations at the two different stations that are connected by the Bell pairs.
Specifically, we provide the agent with two noisy Bell pairs as input, where is of the form of . Here and denote the standard Bell basis and is the fidelity with respect to . This starting situation is depicted in Fig. 3a. The agent is tasked with finding a protocol that probabilistically outputs one copy with increased fidelity. However, it is desirable to obtain a protocol that does not only result in an increased fidelity when applied once, but consistently increases the fidelity when applied recurrently, i.e. on two pairs that have been obtained from the previous round of the protocol. In order to make such a recurrent application possible while dealing with probabilistic measurements, identifying the branches that should be reused is an integral part.
To this end, a different technique than before is employed. Rather than simply obtaining a random measurement outcome every time the agent picks a measurement action, instead the agent needs to provide potentially different actions for all possible outcomes. The actions taken on all the different branches of the protocol are then evaluated as a whole. This makes it possible to calculate the result of the recurrent application of that protocol separately for each trial. The agent is rewarded according to both the overall success probability of the protocol and the obtained increase in fidelity.
The agent is provided with a Clifford gate set and singlequbit measurements. Qubits labeled are held by one party and those labeled are held by another party. Multiqubit operations can only be applied on qubits at the same station. The output of each of the branches is enforced to be a state with one qubit on side and one on side along with a decision by the agent whether to consider that branch success or failure for the purpose of iterating the protocol. Since this naturally needs two singlequbit measurements, with two possible outcomes each, there are four branches that need to be considered.
In Fig. 3b we see reward values that agents obtained for the protocols applied to initial states with fidelities of . The reward is normalized such that the entanglement purification protocol presented in Ref. dejmps would obtain a reward of . All the successful protocols found start the same way (up to permutations of commuting operations): they apply followed by measuring qubits and in the computational basis. In some of the protocols two of the previously discussed four branches are marked as successful, while others only mark one particular combination of measurement outcomes. The latter therefore have a smaller probability of success, which is reflected in the reward. However, looking closely at the distribution in Fig. 3b we can see that these cases correspond to two variants with slightly different rewards. Those variants differ in the operations that are applied on the output copies before the next purification step. The variant with slightly lower reward applies the Hadamard gate on both qubits: . The protocol that obtains the full reward of applies and is depicted in Fig. 3c. This protocol is equivalent to the wellknown DEJMPS protocol dejmps for an even number of recurrence steps, but requires a shorter action sequence for the gate set provided to the agent. We discuss this solution in more detail, as well as an additional variant of the environment with automatic depolarization after each recurrence step, in Appendix B.
iii.3 Quantum repeater
Entanglement purification alone certainly increases the distance over which one can distribute an entangled state of sufficiently high fidelity. However, the reachable distance is limited because at some point too much noise will accumulate such that the initial states will no longer have the minimal fidelity required for the entanglement purification protocol. Hence, one splits up the channels into smaller segments. Now the agent has to deal with two such channel segments that distribute noisy Bell pairs with a common station in the middle as depicted in Fig. 4a. In this scenario the challenge for the agent is to use the protocols of the previous sections in order to distribute an entangled state over the whole distance. To this end the agent may use the previously discovered protocols for teleportation/entanglement swapping and entanglement purification as elementary actions, rather than individual gates.
The task is to find a protocol for distributing an entangled state between the two outer stations with a threshold fidelity of at least , all the while using as few initial states as possible. The initial Bell pairs are considered to have initial fidelities of . Furthermore, the CNOT gates used for entanglement purification are considered to be imperfect, which we model as local depolarizing noise with reliability parameter acting on the two qubits involved followed by the perfect CNOT operation eppreview . The effective map is given by:
(1) 
where denotes the local depolarizing noise channel with reliability parameter acting on the th qubit:
(2) 
with , , denoting the Pauli matrices acting on the th qubit.
While the point of such an approach only begins to show for much longer distances, which we take a look at in Sec. IV, some key concepts can already be observed at small scales.
The agent naturally tends to find solutions that use a small number of actions in an environment that is similar to a navigation problem. However, this is not necessarily desirable here because the resources, i.e. the number of initial Bell pairs, is the figure of merit in this scenario rather than the number of actions. Therefore an appropriate reward function for this environment takes the used resources into account.
In Fig. 4b the learning curve of the best of 128 agents in terms of resources used is depicted. Looking at the best solutions, the key insight is that it is beneficial to purify the shortdistance pairs a few times before connecting them via entanglement swapping even though this way more actions need to be performed by the agent. This solution is in line with the idea of the established quantum repeater protocol Br98 .
Iv Scaling quantum repeater
The point of the quantum repeater lies in its scaling behavior which only starts to show when considering longer distances than just two links. This means we have to consider starting situations of variable length as depicted in Fig. 1d using the same error model as described in section III.3. In order to distribute entanglement over varying distances, the agent needs to come up with a scalable scheme. However, both the action space and the length of action sequences required to find a solution would quickly grow unmanageable with increasing distances. Furthermore, an RL agent learns a solution for a particular situation and problem size rather than finding a universal concept that can be transferred to similar starting situations and larger scales.
To overcome these restrictions, we provide the agent with the ability to effectively outsource finding solutions for distributing an entangled pair over a short distance and reuse them as elementary actions for the larger setting. This means that, as a single action, the agent can instruct multiple subagents to come up with a solution for a small distance and then pick the best action sequence among those solutions. This process is illustrated in Fig. 3a.
Again, the aim is to come up with a protocol that distributes an entangled pair over a long distance with sufficiently high fidelity, while using as few resources as possible.
iv.1 Symmetric protocols
First, we take a look at a symmetric variant of this setup: The initial situation is symmetric and the agent is only allowed to do actions in a symmetric way. If it applies one step of an entanglement purification protocol on one of the initial pairs, all the other pairs need to be treated in the same way. Similarly, entanglement swapping is always performed at every second station that is still connected to other stations. In Fig. 3bc the results for various lengths of Bell pairs with an initial fidelity of are shown. We compare the solutions that the agent found with a strategy that repeatedly purifies all pairs up to a chosen working fidelity followed by entanglement swapping (see Appendix D.3). For lengths greater than 8 repeater links, the agent still finds a solution with desirable scaling behavior solution while only using slightly more resources.
iv.2 Asymmetric setup
The more interesting scenario is when the initial Bell pairs are subjected to different levels of noise, e.g. when the physical channels between stations are of different length or quality. In this scenario symmetric protocols are not optimal.
We consider the following scenario: 9 repeater stations connected via links that can distribute Bell pairs of different initial fidelities . In Fig. 3d the learning curve in terms of resources for the agent that can delegate work to subagents is shown. The gate reliability of the CNOT gates used in the entanglement purification protocol is . The obtained solution is compared to the resources needed for a protocol that does not take into account the asymmetric nature of this situation and that is also used as an initial guess for the reward function (see Appendix D.3 for additional details of that approach). Clearly the solution found by the RL agent is preferable to the protocol tailored to symmetric situations. Fig. 3e shows how that advantage scales for different gate reliability parameters .
V Summary and Outlook
We have demonstrated that reinforcement learning can serve as a highly versatile and useful tool in the context of quantum communication. When provided with a sufficiently structured task environment including an appropriately chosen reward function, the learning agent will retrieve (effectively rediscover) basic quantum communication protocols like teleportation, entanglement purification, and the quantum repeater. We have developed methods to state challenges that occur in quantum communication as RL problems in a way that offers very general tools to the agent while ensuring that relevant figures of merit are optimized.
We have shown that stating the considered challenges as an RL problem is beneficial and offers advantages over using optimization techniques as discussed in section II.
Regarding the question to what extent programs can help us in finding genuinely new schemes for quantum communication, it has to be emphasized that a significant part of the work consists in asking the right questions and identifying the relevant resources, both of which are central to the formulation of the task environment and are provided by researchers. However, it should also be noted that not every aspect of designing the environment is necessarily a crucial addition and many details of the implementation are simply an acknowledgment of practical limitations like computational runtimes. When provided with a properly formulated task, a learning agent can play a helpful, assisting role in exploring the possibilities.
In fact, we used the PS agent in this way to demonstrate that the application of machine learning techniques to quantum communication is not limited to rediscovering existing protocols. The PS agent finds adapted and optimized solutions in situations that lack certain symmetries assumed by the basic protocols, such as the qualities of physical channels connecting different stations. We extended the PS model to include the concept of delegating parts of the solution to other agents, which allows the agent to effectively deal with problems of larger size. Using this new capability for longdistance quantum repeaters with asymmetrically distributed channel noise the agent comes up with novel and practically relevant solutions.
We are confident that the presented approach can be extended to more complex scenarios. We believe that reinforcement learning can become a practical tool to apply to quantum communication problems that do not have a rich spectrum of existing protocols such as designing quantum networks, especially if the underlying network structure is irregular.
Acknowledgments
J.W. and W.D. were supported by the Austrian Science Fund (FWF) through Grants No. P28000N27 and P30937N27. A.A.M. and H.J.B. were supported by the FWF through the SFB BeyondC P02. A.A.M. acknowledges funding by the Swiss National Science Foundation (SNSF), through the Grant PP00P2179109 and by the Army Research Laboratory Center for Distributed Quantum Information via the project SciNet. H.J.B. was supported by the Ministerium für Wissenschaft, Forschung, und Kunst BadenWürttemberg (AZ: 337533.3010/41/1).
References
 (1) Kimble H. J., “The quantum internet,” Nature, vol. 453, no. 7198, pp. 1023–1030, 2008.
 (2) S. Wehner, D. Elkouss, and R. Hanson, “Quantum internet: A vision for the road ahead,” Science, vol. 362, no. 6412, 2018.
 (3) P. W. Shor and J. Preskill, “Simple Proof of Security of the BB84 Quantum Key Distribution Protocol,” Phys. Rev. Lett., vol. 85, pp. 441–444, 2000.
 (4) R. Renner, “Security of Quantum Key Distribution,” PhD thesis, ETH Zurich, 2005.
 (5) Y.B. Zhao and Z.Q. Yin, “Apply current exponential de finetti theorem to realistic quantum key distribution,” International Journal of Modern Physics: Conference Series, vol. 33, p. 1460370, 2014.
 (6) D. Gottesman and H.K. Lo, “Proof of security of quantum key distribution with twoway classical communications,” IEEE Transactions on Information Theory, vol. 49, no. 2, pp. 457–475, 2003.
 (7) H.K. Lo, “A simple proof of the unconditional security of quantum key distribution,” Journal of Physics A: Mathematical and General, vol. 34, no. 35, p. 6957, 2001.
 (8) C. H. Bennett, G. Brassard, C. Crépeau, R. Jozsa, A. Peres, and W. K. Wootters, “Teleporting an unknown quantum state via dual classical and EinsteinPodolskyRosen channels,” Phys. Rev. Lett., vol. 70, pp. 1895–1899, 1993.
 (9) C. H. Bennett, G. Brassard, S. Popescu, B. Schumacher, J. A. Smolin, and W. K. Wootters, “Purification of Noisy Entanglement and Faithful Teleportation via Noisy Channels,” Phys. Rev. Lett., vol. 76, pp. 722–725, 1996.
 (10) D. Deutsch, A. Ekert, R. Jozsa, C. Macchiavello, S. Popescu, and A. Sanpera, “Quantum Privacy Amplification and the Security of Quantum Cryptography over Noisy Channels,” Phys. Rev. Lett., vol. 77, pp. 2818–2821, 1996.
 (11) W. Dür and H. J. Briegel, “Entanglement purification and quantum error correction,” Rep. Prog. Phys., vol. 70, no. 8, p. 1381, 2007.
 (12) H.J. Briegel, W. Dür, J. I. Cirac, and P. Zoller, “Quantum Repeaters: The Role of Imperfect Local Operations in Quantum Communication,” Phys. Rev. Lett., vol. 81, pp. 5932–5935, 1998.
 (13) S. Russel and P. Norvig, Artificial Intelligence  A Modern Approach. New Jersey: Prentice Hall, 3rd ed., 2010.
 (14) R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT press, 2nd ed., 2017.
 (15) M. Wiering and M. van Otterlo, eds., Reinforcement learning: State of the Art. Adaptation, Learning, and Optimization, vol. 12, Berlin, Germany: Springer, 2012.
 (16) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.

(17)
D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,”
Nature, vol. 529, no. 7597, pp. 484–489, 2016.  (18) D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
 (19) H. J. Briegel and G. De las Cuevas, “Projective simulation for artificial intelligence,” Sci. Rep., vol. 2, p. 400, 2012.
 (20) S. Hangl, E. Ugur, S. Szedmak, and J. Piater, “Robotic playing for hierarchical complex skill learning,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 2799–2804, 2016.
 (21) S. Hangl, V. Dunjko, H. J. Briegel, and J. Piater, “Skill learning by autonomous robotic playing using active learning and creativity,” arXiv:1706.08560, 2017.
 (22) L. Orseau, T. Lattimore, and M. Hutter, “Universal knowledgeseeking agents for stochastic environments,” in Algorithmic Learning Theory (S. Jain, R. Munos, F. Stephan, and T. Zeugmann, eds.), pp. 158–172, Springer Berlin Heidelberg, 2013.
 (23) H. J. Briegel, “On creative machines and the physical origins of freedom,” Sci. Rep., vol. 2, p. 522, 2012.

(24)
A. A. Melnikov, H. Poulsen Nautrup, M. Krenn, V. Dunjko, M. Tiersch, A. Zeilinger, and H. J. Briegel, “Active learning machine learns to create new quantum experiments,”
Proc. Natl. Acad. Sci. U.S.A., vol. 115, no. 6, pp. 1221–1226, 2018.  (25) R. S. Sutton, “Integrated architectures for learning, planning, and reacting based on approximating dynamic programming,” in Proceedings of the 7th International Conference on Machine Learning, pp. 216–224, 1990.
 (26) P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and R. Hadsell, “Learning to navigate in complex environments,” arXiv:1611.03673, 2016.
 (27) T. Mannucci and E.J. van Kampen, “A hierarchical maze navigation algorithm with reinforcement learning and mapping,” in Proc. IEEE Symposium Series on Computational Intelligence, 2016.
 (28) In the teleportation task the shortest possible sequence of gates is equal to , and at each step of this sequence there are at least possible gates that can be applied.
 (29) J. Mautner, A. Makmal, D. Manzano, M. Tiersch, and H. J. Briegel, “Projective simulation for classical learning agents: a comprehensive investigation,” New Gener. Comput., vol. 33, no. 1, pp. 69–114, 2015.
 (30) A. Makmal, A. A. Melnikov, V. Dunjko, and H. J. Briegel, “Metalearning within projective simulation,” IEEE Access, vol. 4, pp. 2110–2122, 2016.
 (31) A. A. Melnikov, A. Makmal, V. Dunjko, and H. J. Briegel, “Projective simulation with generalization,” Sci. Rep., vol. 7, p. 14430, 2017.
 (32) J. Kempe, “Quantum random walks: An introductory overview,” Contemp. Phys., vol. 44, no. 4, pp. 307–327, 2003.
 (33) S. E. VenegasAndraca, “Quantum walks: a comprehensive review,” Quantum Information Processing, vol. 11, no. 5, pp. 1015–1106, 2012.
 (34) G. D. Paparo, V. Dunjko, A. Makmal, M. A. MartinDelgado, and H. J. Briegel, “Quantum speedup for active learning agents,” Phys. Rev. X, vol. 4, p. 031002, 2014.
 (35) V. Dunjko, N. Friis, and H. J. Briegel, “Quantumenhanced deliberation of learning agents using trapped ions,” New J. Phys., vol. 17, no. 2, p. 023006, 2015.
 (36) N. Friis, A. A. Melnikov, G. Kirchmair, and H. J. Briegel, “Coherent controlization using superconducting qubits,” Sci. Rep., vol. 5, p. 18036, 2015.
 (37) T. Sriarunothai, S. Wölk, G. S. Giri, N. Friis, V. Dunjko, H. J. Briegel, and C. Wunderlich, “Speedingup the decision making of a learning agent using an ion trap quantum processor,” Quantum Sci. Technol., vol. 4, no. 1, p. 015014, 2018.
 (38) A. A. Melnikov, A. Makmal, and H. J. Briegel, “Benchmarking projective simulation in navigation problems,” IEEE Access, vol. 6, pp. 64639–64648, 2018.
 (39) H. Poulsen Nautrup, N. Delfosse, V. Dunjko, H. J. Briegel, and N. Friis, “Optimizing quantum error correction codes with reinforcement learning,” arXiv:1812.08451, 2018.
 (40) “Projective simulation Github repository.” github.com/qicibk/projectivesimulation. Accessed: 20190223.

(41)
A. Jamiołkowski, “Linear transformations which preserve trace and positive semidefiniteness of operators,”
Rep. Math. Phys., vol. 3, no. 4, pp. 275–278, 1972.  (42) A. Gilchrist, N. K. Langford, and M. A. Nielsen, “Distance measures to compare real and ideal quantum processes,” Phys. Rev. A, vol. 71, p. 062310, 2005.
 (43) D. Gottesman, “Theory of faulttolerant quantum computation,” Phys. Rev. A, vol. 57, pp. 127–137, 1998.
 (44) P. O. Boykin, T. Mor, M. Pulver, V. Roychowdhury, and F. Vatan, “On universal and faulttolerant quantum computing: A novel basis and a new constructive proof of universality for shor’s basis,” in Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS ’99, (Washington, DC, USA), pp. 486–, IEEE Computer Society, 1999.
Appendix A Quantum teleportation
a.1 Environment description
Fig. 2a depicts the setup. Qubit is initialized as part of an entangled state in order to facilitate measuring the Jamiołkowski fidelity later on. Qubits and are in a state.
Goal: The state of should be teleported to . We measure this by calculating the Jamiołkowski fidelity of the effective channel applied by the action sequences. That means we calculate the overlap of with the reduced state to determine whether the goal has been reached.
Actions: The following Actions are allowed:

Depending on the specification of the task either for a Clifford gate set or for a universal gate set, on each qubit. (3 actions)

The Hadamard gate on each qubit (3 actions)

The CNOTgate on the two qubits at location A. (1 action)

Zmeasurements on each qubit. (3 actions)
In total: 10 actions. Note that the Clifford group is generated by , and CliffordGenerators and replacing with makes it a universal gate set universalgateset . The measurements are modeled as destructive measurements, which means that operations acting on that qubit are no longer available, thereby reducing the number of actions that the agent can choose.
Percepts: The agent only uses the previous actions of the current trial as a percept. Zmeasurements with different outcomes will produce different percepts.
Reward: If Goal is reached, and the trial ends. Else .
a.2 Discussion
One central complication in this scenario is that the entanglement is a limited resource. If the entanglement is destroyed without accomplishing anything (e.g., measuring qubit as the first action), then the goal can no longer be reached no matter what the agent tries afterwards. This is a feature that distinguishes this setup and other quantum environments with irreversible operations from a simple navigation problem in a maze. Instead this is more akin to a navigational problem, where there are numerous cliffs that the agent can fall down, but never get back up again, which means that the agent could be permanently separated from the goal.
An alternative formulation that would make the goal reachable at each trial even if a wrong irreversible action was taken would be to provide the agent with a reset action that resets the environment to the initial state. A different class of percepts would need to be used in this case.
With prior knowledge of the quantum teleportation protocol it is easy to understand why this problem structure favors an RL approach. The shortest solution for one particular combination of measurement outcomes takes only four actions and these actions are to be performed regardless of which correction operation needs to be applied. That means once this simple solution is found it significantly reduces the complexity of finding the other solutions, as now only the correction operations need to be found.
Compare this to searching for this solution by brute forcing action sequences. For the universal gate set we know that the most complicated of the four solutions takes at least actions. Ignoring the measurement actions for now as they anyway reduce the number of available actions, there are possible action sequences. So, we would have to try at least sequences, which is vastly more than the few hundred thousands of trials needed by the agent.
Appendix B Entanglement purification
b.1 Environment description
Fig. 3a shows the initial setup with and sharing two Bell pairs with initial fidelity .
Goal: Find an action sequence that results in a protocol that improves the fidelity of the Bell pair, when applied recurrently. This means that two copies of the resulting twoqubit state after one successful application of the protocol are taken and the protocol is using them as input states.
Actions: The following Actions are available:

on each qubit (4 actions)

, the Hadamard gate on each qubit (4 actions)

The CNOTgates and on qubits at the same location. (2 action)

Zmeasurements on each qubit (4 actions)

Accept/Reject (2 actions)
In total: 16 actions. Note that these gates generate the Clifford group. (We tried different variants of generating the gate set as the choice of basis is not fundamental, the one with gave the best results.) The measurements are modeled as destructive measurements, which means that operations acting on that qubit are no longer available, thereby reducing the number of actions that the agent can choose. In order to reduce the huge action space further, the requirement that the final state of one sequence of gates needs to be a twoqubit state shared between and is enforced by removing actions that would destructively measure all qubits on one side. The accept and reject actions are essential because they allow identifying successful branches.
Percepts: The agent only uses the previous actions of the current trial as a percept. Zmeasurements with different outcomes will produce different percepts.
Reward: The protocol suggested by the agent is performed recurrently for ten times. This is done to ensure that the solution found is a viable protocol for recurrent application because it is possible that a single step of the protocol might increase the fidelity but further applications of the protocol could undo that improvement. The reward function is given by where is the success probability (i.e. the combined probability of the accepted branches) of the th step, is the increase in fidelity after ten steps and the constant is chosen such that the known protocols bbpssw or dejmps would receive a reward of .
Problemspecific techniques: To evaluate the performance of an entanglement purification protocol that is applied in a recurrent fashion it is necessary to know which actions are performed and especially whether the protocol should be considered successful for all possible measurement outcomes. Therefore, it is not sufficient to use the same approach as for the teleportation challenge and simply consider one particular measurement outcome for each trial. Instead, the agent is required to choose actions for all possible measurement outcomes every time it chooses a measurement action. This means we keep track of multiple separate branches (and the associated probabilities) with different states of the environment. The average density matrix of the branches that the agent decides to keep is the state that is used for the next purification step. We choose to do it this way because it allows us to obtain a complete protocol that can be evaluated at each trial and the agent is rewarded according to the performance of the whole protocol.
b.2 Discussion
As discussed in the main text, the agent found an entanglement purification protocol that is equivalent to the DEJMPS protocol dejmps for an even number of purification steps.
Let us briefly recap how the DEJMPS protocol works: Initially we have two copies of a state that is diagonal in the Bell basis and can be written with coefficients :
(3)  
The effect of the multilateral CNOT operation followed by measurements in the computational basis on and and postselected for coinciding measurement results is:
(4)  
where denote the new coefficient after the procedure and is a normalization constant and also the probability of success. Without any additional intervention applying this map recurrently, not only the desired coefficient (the fidelity) will be amplified, but both and .
To avoid this and only amplify the fidelity with respect to , the DEJMPS protocol calls for the application of on both copies of before applying the multilateral CNOTs and performing the measurements. The effect of this operation is to exchange the two coefficients and thus preventing the unwanted amplification of . So the effective map at each entanglement purification step is the following:
(5)  
with .
In contrast, the solution found by the agent calls for to be applied, which exchanges two different coefficients and instead for an effective map:
(6)  
and . Note that the maps (5) and (6) are identical except that roles of and are exchanged. It is clear that applying the agent’s map twice will have the same effect as applying the DEJMPS protocol twice, which means that for an even number of recurrence step they are equivalent.
As a side note, the other protocol that was found by the agent as described in the main text, applies such an additional operation before each entanglement purification step as well: Applying on exchanges and . This also yields a successful entanglement purification protocol, however with a slightly worse performance.
b.3 Automatic polarization variant
We also investigated a variant where after each purification step, the state is automatically depolarized before the protocol is applied again. That means if the first step brought the state up to the new fidelity it is then brought to the form: . This can always be achieved without changing the fidelity bbpssw .
In Fig. 6 the obtained reward for 100 agents for this alternative scenario is shown. The successful protocols consist of applying followed by measuring qubits and in the computational basis. Optionally, some additional local operations that do not change the fidelity itself can be added as the effect of those is undone by the automatic depolarization. Similar to the scenario in the main text, there are some solutions that only accept one branch as successful, which means they only get half the reward as the success probability at each step is halved (center peak in Fig. 6). The protocols for which two relevant branches are accepted are equivalent to the entanglement purification protocol presented in bbpssw .
Appendix C Quantum repeater
c.1 Environment description
The setup is depicted in Fig. 4a. The repeater stations share entangled states via noisy channels with their neighbors, which results in pairs with an initial fidelity of . The previous two protocols now are available as the elementary actions for this more complex scenario.
Goal: Entangled pair between the leftmost and the rightmost station with fidelity above threshold fidelity .
Actions:

Purify a pair with one entanglement purification step. (Pair 1, Pair 2, the pair that arises from entanglement swapping at station I)

Entanglement swapping at the middle station
We use the protocol in bbpssw for this as it is computationally easier to handle. For a practical application it would be advisable to use a more efficient entanglement purification protocol.
Percepts: Current position of the pairs and fidelity of each pair.
Reward function: . The whole path is rewarded in full. The reward constant is obtained from an initial guess using the workingfidelity strategy described in D.3.
Appendix D Scaling repeater
d.1 Environment description
In addition to the elementary actions from the distance2 quantum repeater discussed above we provide the agent with the ability to delegate solving smallerscale problems of the same type to other agents, therefore splitting the problem into smaller parts. Then, the found sequence of actions is applied as one block action as illustrated in Fig. 3a.
Goal: Entangled pair between the leftmost and the rightmost station with fidelity above threshold fidelity .
Actions:

Purify a pair with one entanglement purification step.

Entanglement swapping at a station

Block actions of shorter lengths
So for the setup with repeater links, initially there are purification actions and entanglement swapping actions. Of course, the purification actions have to be adjusted every time an entanglement swapping is performed to include the new, longerdistance pair. The block actions can be applied at different locations, e.g. example a length two block action can initially be applied at different positions (which also have to be adjusted to include longerdistance pairs as entanglement swapping actions are chosen). So it is easy to see how the action space quickly gets much larger as increases.
Percepts: Current position of the pairs and fidelity of each pair.
Reward: Again we use the resourcebased reward function as this is the metric we would like to optimize. . The whole path is rewarded in full. The reward constant is obtained from an initial guess (see D.3) and adjusted downward once a better solution is found such that the maximum possible reward from one trial is .
Comment on block actions: The main agent can use block actions for a wide variety of situations at different stages of the protocol. This means the subagents are tasked with finding block actions for a wide variety of initial fidelities, so a new problem needs to be solved for each new situation. In order to speed up the trials we save situations that have already been solved by subagents in a big table and reuse the found action sequence if a similar situation arises.
d.1.1 Symmetric variant
We force a symmetric protocol by modifying the actions as follows:
Actions:

Purify all pairs with one entanglement purification step.

Entanglement swapping at every second active station

Block actions of shorter lengths, that have been obtained in the same, symmetrized manner.
d.2 Additional results and discussion
We also investigated different starting situations for this setup. Here we discuss two of them:
First, we also applied the agent that is not restricted to symmetric protocols to a symmetric starting situation. The results for initial fidelities can be found in Fig. 7ab. In general the agent finds solutions that are very close but not equal to the workingfidelity strategy described in D.3. Remarkably, for some reliability parameters the agent even finds a solution that is slightly better by switching around the order of operations a little, or a threshold effect, where omitting an entanglement purification step on one of the pairs is still enough to reach the desired threshold fidelity.
Finally, we also looked at a situation that is highly asymmetric with starting fidelities (0.95, 0.9, 0.6, 0.9, 0.95, 0.95, 0.9, 0.6). Thus there are highquality links on most connections, but two links suffer from very high levels of noise. The results depicted in Fig. 7cd show that the advantage over a workingfidelity strategy is even more pronounced.
d.3 Workingfidelity strategy
This is the strategy we use to determine the reward constants for the quantum repeater environments and was presented in Br98 . This strategy leads to a resource requirement per repeater station that grows logarithmically with the distance.
For repeater lengths with links it is a fully nested scheme and can therefore be stated easily:

Pick a working fidelity .

Purify all pairs until their fidelity is .

Perform entanglement swapping at every second active station such that there are half as many repeater links left.

Repeat from step 2. until only one pair remains (and therefore the outermost stations are connected).
We then optimize the choice of such that the resources are minimized for the given scenario.
As we are dealing with repeater lengths that are not a power of as part of the delegated subsystems discussed in the main text, the strategy is adjusted as follows for those cases.

Pick a working fidelity .

Purify all pairs until their fidelity is .

Perform entanglement swapping at the station with the smallest combined distance of their left and right pair (e.g. 2 links + 3 links). If multiple stations are equal in that regard, pick the leftmost station.

Repeat from step 2. until only one pair remains (and therefore the outermost stations are connected).
Then, we again optimize the choice of such that the resources are minimized for the given scenario.