I Introduction
The rise of machine learning
Murphy (2012) has led to intense interest in using machine learning in physics, and in particular in combining it with quantum information technology Dunjko and Briegel (2018); Mehta et al. (2019). Recent success stories include discriminating phases of matter Carrasquilla and Melko (2017); Broecker et al. (2017); Van Nieuwenburg et al. (2017) and efficient representation of manybody quantum states Carleo and Troyer (2017); Carleo et al. (2018); Gao and Duan (2017).In physics, many problems can be described within control theory which is concerned with finding a way to steer a system to achieve a goal Leigh (2004). The search for optimal control can naturally be formulated as reinforcement learning (RL) Kaelbling et al. (1996); Sutton and Barto (2018); Sutton et al. (1992); Chen et al. (2013); Palittapongarnpim et al. (2017); Fösel et al. (2018); Bukov et al. (2018); AlbarránArriagada et al. (2018); Niu et al. (2019), a discipline of machine learning. Reinforcement learning (RL) has been used in the context of quantum control Bukov et al. (2018), to design experiments in quantum optics Melnikov et al. (2018), and to automatically generate sequences of gates and measurements for quantum error correction Fösel et al. (2018); Sweke et al. (2018); Andreasson et al. (2018).
RL has also been applied to control problems in quantum metrology Dunjko and Briegel (2018)
: In the context of global parameter estimation, i.e., when the parameter is a priori unknown, the problem of optimizing singlephoton adaptive phaseestimation was investigated
Hentschel and Sanders (2010, 2011); Lovett et al. (2013). There, the goal is to estimate an unknown phase difference between the two arms of a Mach–Zehnder interferometer. After each measurement, an additional controllable phase in the interferometer can be adjusted dependent on the already acquired measurement outcomes. The optimization with respect to policies, i.e., mappings from measurement outcomes to controlled phase shifts, can be formulated as a RL problem and tackled with particle swarm Hentschel and Sanders (2010, 2011); Sergeevich and Bartlett (2012); Stenberg et al. (2016) or differential evolution Lovett et al. (2013); Palittapongarnpim et al. (2016) algorithms, where the results of the former were recently applied in an experiment Lumino et al. (2018).Also in the regime of local parameter estimation, where the parameter is already known to high precision (typically from previous measurements), actorcritic and proximalpolicyoptimization RL algorithms were used to find policies to control the dynamics of quantum sensors Liu and Yuan (2017a, b); Xu et al. (2019). There, the estimation of the precession frequency of a dissipative spin particle was improved by adding a linear control to the dynamics in form of an additional controlled magnetic field Xu et al. (2019).
Recently it was shown theoretically that the sensitivity (in the regime of local parameter estimation) of existing quantum sensors based on precession dynamics, such as spinprecession magnetometers, can be increased by adding nonlinear control to their dynamics in such a way that the dynamics becomes nonregular or (quantum)chaotic Fiderer and Braun (2018, 2019). The nonlinear kicks (described by a “nonlinear” Hamiltonian compared to the “linear” precession Hamiltonian where , , are the spin angular momentum operators) lead to a torsion, a precession with rotation angle depending on the state of the spins.
Adding nonlinear kicks to the otherwise regular dynamics comes along with a large number of new degrees of freedom that remained so far unexplored: Rather than kicking the system periodically with always the same strength and with the same preferred axis as in Ref.
Fiderer and Braun (2018), one can try to optimize each kick individually, i.e., vary its timing, strength, or rotation axis. The number of parameters increases linearly with the total measurement time (assuming a fixed upper bound of kicks per unit time), and becomes rapidly too large for bruteforce optimization.In this work, we use crossentropy RL to optimize the kicking strengths and times in order to maximize the quantum Fisher information, whose inverse constitutes a lower bound on the measurement precision. The crossentropy method is used to train a neural network that takes the current state as input and gives an action on the current state (the nonlinear kicks) as output. In this way, the neural network generates a sequence of kicks that represents the policy for steering the dynamics.
This represents an offline, modelfree approach which is aimed at longterm performance, i.e., the optimization is done based on numerical simulations, without being restricted to a specific class of policies, and with the goal of maximizing the quantum Fisher information only after a given time and not, as it would be the case for greedy algorithms, for each time step. We show that this can lead to largely enhanced sensitivity even compared to the already enhanced sensitivity of the quantumchaotic sensor with constant periodic kicks Fiderer and Braun (2018).
Ii Quantum metrology
The standard tool for evaluating the sensitivity with which a parameter can be measured is the quantum CramérRao bound Helstrom (1976); Holevo (1982); Braunstein and Caves (1994). It gives the smallest uncertainty with which a parameter encoded in a quantum state (density matrix)
can be estimated. The bound is optimized over all possible (POVM=positive operator valued measure) measurements (including but not limited to standard projective vonNeumann measurements of quantum observables), and all possible dataanalysis schemes in the sense of using arbitrary unbiased estimator functions
of the obtained measurement results. It can be saturated in the limit of a large number of measurements, and hence gives the ultimate sensitivity that can be reached once technical noise has been eliminated and only the intrinsic fluctuations due to the quantum state itself remain.Iii The system
We consider a spin model based on the angular momentum algebra, with spin operators , and , where and are angular momentum quantum numbers. Note that the model can be implemented not only with physical spins but with any physical system with quantum mechanical operators that fulfill the angular momentum algebra. The Hamiltonian of our model is given by
(3) 
The first summand describes a precession about the axis with precession frequency . The second summand describes the nonlinear kicks, i.e., a torsion about the axis, see Fig. 1. This corresponds to a precession about the axis with a precession angle proportional to the component. The time defines a time scale such that and measure time in units of . The th kick is applied at time where quantifies its kicking strength (in units of a frequency).
In an atomic spinprecession magnetometer, as discussed in Ref. Fiderer and Braun (2018), the first summand corresponds to a Larmor precession characterized by the Larmor frequency with Landé gfactor , Bohr magneton , and magnetic field strength , which is the parameter that one wants to estimate. The nonlinear kicks can, for example, be generated with offresonant light pulses exploiting the ac Stark effect. We introduce a dimensionless kicking strength as and, for the sake of simplicity, we set and .
For a pure state, the unitary time evolution of the system between kicks at time and is given by
(4) 
where the unitary transformation propagates the state according to the Hamiltonian (3), from time [directly after the th kick] to [directly after the th kick], as indicated by the index [in order to simplify notation, the index of not only labels the kicking strength at time but also refers to the propagation from to of ]. We have
(5) 
where denotes timeordering. Since the kicks are assumed to be instantaneous, this leads to
(6) 
i.e., a precession for time followed by a kick of strength . The kick occurs at the end of the time interval .
For the standard kicked top (KT), , see Fig. 1, the kicking strengths are constant, , and kicking times are given by , with . Dynamics of the standard KT is nonintegrable for and has a well defined classical limit that shows a transition from regular to chaotic dynamics when is increased. In Ref. Fiderer and Braun (2018) the behavior of the QFI for regular and chaotic dynamics was studied in this transition regime (for and ) which manifests itself by a mixed classical phase space between regular and chaotic dynamics. Quantum chaos is defined as quantum dynamics that becomes chaotic in the classical limit. In contrast to classical chaos, quantum chaos does not exhibit exponential sensitivity to changes of initial conditions due to the properties of unitary quantum evolution, but can be very sensitive to parameters of the evolution Peres (2006). The kicked top has been realized with atomic spins in a cold gas Chaudhury et al. (2009) and with a pair of spin nuclei using NMR techniques Krithika et al. (2019). Here, we generalize the standard KT to kicks of strength at arbitrary times as given in Eq. (6), see also Fig. 1.
Any new quantum metrology method needs to demonstrate its viability in the presence of noise and decoherence. We study two different versions of the KT which differ in the decoherence model used: phase damping and superradiant damping. Both can be described by Markovian master equations and are well studied models for open quantum systems Dicke (1954); Gross et al. (1976); Gross and Haroche (1982); Braun (2001). While phase damping conserves the energy and only leads to decoherence in the basis, superradiant damping leads in addition to a relaxation to the ground state . Its combination with periodic kicking in the chaotic regimes is known to give rise to a nonequilibrium steady state in the form of a smearedout strange attractor Braun (2001) that still conserves information about the parameter , whereas without the kicking the system in presence of superradiant damping simply decays to the ground state. The master equations for both processes have the Kossakowski–Lindblad form Kossakowski (1972); Lindblad (1976), with
(7) 
for phase damping, where , and
(8) 
for superradiant damping, where are the ladder operators, and and denote the decoherence rates. With the generator , defined by , one has in both cases the formal solution with the continuoustime propagator . The solution of Eq. (7) in the basis, where , is immediate,
(9) 
Also for Eq. (8) a formally exact solution has been found Bonifacio et al. (1971) and efficient semiclassical (for large ) expressions are available Braun et al. (1998a, b). For our purposes it was the simplest to solve Eq. (8) numerically by diagonalization of . Combining these decoherence mechanisms with the unitary evolution the transformation reads
(10) 
because in both cases the dissipative generator commutes with the precession.
As initial state we use an SU(2) coherent state, which can be seen as the most classical state of a spin Giraud et al. (2008, 2010), and is usually easy to prepare (for instance by optically polarizing the atomic spins in a SERF magnetometer). Also, it is equivalent to a symmetric state of spin pointing all in the same direction. With respect to the basis it reads
(11) 
We choose , .
Iv The kicked top as a control problem and reinforcement learning
We consider the kicked top as a control problem and discretize the kicking strengths and times . The precise parameters of the discretized control problem vary between the following examples and are summarized in Appendix A. In the following, denotes a discrete time step (measured in units of ), is a discrete step of kicking strength, the RL agent optimizes the QFI at time , and we bound the total accumulated kicking strength which is never reached in optimized policies though. The frequency , that we want to estimate, is set to induce a rotation of the state by ( is measured in units of ).
Possible control policies are simply given by a vector of kicking strengths
with . To each policy corresponds a QFI value, calculated from the resulting state , which quantifies how well the policy performs. To tackle this type of problem, various numerical algorithms are available, each with its own advantages and drawbacks Dunjko and Briegel (2018); Mehta et al. (2019); Palittapongarnpim et al. (2017). We pursue the relatively unexplored (in the context of physics) route of crossentropy RL.The system, the kicked top, will be called “environment”, and we imagine an “RL agent” interacting with the environment by applying nonlinear kicks (“actions”) and getting in response information about the current state of the environment (“observation”, which is in our case the full density matrix of the current state), see Fig. 2. The RL agent repeatedly has to take the decision whether to increase the kicking strength (by ) or to go on from the current position in time to . After each decision, it obtains an observation and, only after the total time , a “reward” (the quantum Fisher information of ), that it seeks to maximize. This concludes one “episode” after which the environment is reset [i.e., the spin is reinitialized in the coherent state at , , see Eq. (11)] and the next episode starts.
In our case, a neural network represents the RL agent: The observation is given to the neural network’s input neurons while each output neuron represents one possible action, i.e., we have two output neurons for “kick” and “go on”. The activation of these output neurons determines the probability of executing that action. The policy, however, is not given by the neural network directly. Since the environment is deterministic (i.e., the state evolves deterministically for a given policy
of kicking strengths) there is no point in choosing a stochastic policy such as a neural network. Instead, a single choice of kicking strengths represents the policy which is obtained by first generating a few episodes with several trained neural networks and then picking the episode with the largest QFI. The kicking strengths applied in that episode represent the policy (see Appendix B)^{1}^{1}1In comparison, Sanders et al. Hentschel and Sanders (2010, 2011); Lovett et al. (2013) restricted their policy search for adaptive singlephoton interferometry in such a way that their search space corresponds to points in , making it similar to our problem. However, in their case the observations from the environment are probabilistic measurement outcomes while in our case the observation is the deterministic state ..The RL crossentropy method De Boer et al. (2005) we use works as follows: We produce a set of episodes with the neural network, and then we reinforce the actions of the episodes with the highest reward. This is done by choosing the best
of episodes and we use the pairs of observations and actions of these episodes to train the neural network with the stochastic gradient descent method called
Adam (see Appendix for details) Kingma and Ba (2014). As a result of this training the weights of the neural network are adjusted, i.e., the agent learns from its experience. Future actions taken by the agent are then influenced not only by randomness but also by this experience. The whole process of generating episodes and training the network is iterated. For the parameters of the training process see Appendix B. In Appendix C we study the learning success for different numbers of episodes and iterations.V Results
We compare the QFI for different models: (i) the top (simple precession without kicks), (ii) the standard kicked top, as studied in Ref. Fiderer and Braun (2018), with periodic kicks (period , i.e., a precession angle of for one period, and kicking strength ), and (iii) the generalized kicked top optimized with RL. In case of superradiance damping (phase damping) we denote the top by SRT (PDT), the standard kicked top by SRKT (PDKT) and the RLoptimized generalized kicked top by SRGKT (PDGKT). Details on the training and the optimization of the RL results are provided in Appendix B.
Let us first consider superradiant damping with results presented in Fig. 3. The QFI for the SRT exhibits a characteristic growth quadratic in time. However, due to decoherence, the QFI does not maintain this growth but starts to decay rapidly towards zero. The time when the QFI reaches its maximum was found to decay roughly as with spin size and damping rate Fiderer and Braun (2018).
The situation changes with the introduction of nonlinear kicks. There, the QFI for the SRKT shows the interesting behavior of not decaying to zero for large times. Instead it reaches a plateau value which was found to take surprisingly high values for specific choices of and dissipation rates Fiderer and Braun (2018), in particular, for . The system looses energy through superradiant damping while the nonlinear kicks add energy. This prevents the state from decaying to the ground state, which is an eigenstate of the precession and would lead to a vanishing QFI. From this perspective, the plateau results from a dynamical equilibrium established by the interplay of superradiant damping and kicks.
However, the full potential of exploiting such effects and increasing the QFI with the help of nonlinear kicks is not achieved with constant periodic kicks. Instead, the RL agent^{2}^{2}2The training of one RL agent takes about eight hours on a desktop computer. finds policies to make the QFI of the SRGKT increase further even when the QFI of the SRT decayed already to zero and the QFI of the SRKT reached its plateau value.
Examples for and are presented in Fig. 3. The QFI of the SRGKT is optimized for a total time which is the largest time plotted in each example. At , the plateau value of the SRKT for is relatively low and the RLoptimized policy achieves an improvement in sensitivity (associated with ) of more than an order of magnitude. Panels (a) and (b) show continuous growth of the QFI through an optimized kicking policy. Only if the time (the QFI is optimized to be maximal at ) is increased further, the impressive growth of the QFI finally breaks down. Instead of increasing , we choose to increase superradiant damping while keeping constant, which has a similar effect. In that case, see panels (c) and (d), the RL agent chooses a policy which makes the QFI oscillate at a relatively high level before the time is reached.
The superiority of the policies found by the RL agent can be understood by taking a look at the evolution of the quantum state, see Fig. 4: We represent the quantum state in the space of where and, due to the conservation of angular momentum, which restricts the space to a sphere. This is represented in Fig. 4 with either a sphere parametrized with , , and , or in a plane (the phase space) spanned by the coordinate and the azimuthal angle such that corresponds to the positive axis, , to the positive axis, and with arbitrary to the positive (negative) axis.
The quantum state can be represented in the phase space with the help of the Husimi or the Wigner distributions which are quasi probability distributions of the quantum state. The first two rows of panels in Fig.
4 depict the Wigner distribution of the initial quantum state (left column) and the quantum states of the SRKT (middle column, with kicking strength ) and SRGKT (right column) evolved for a time with damping rate . The plotted cases for the SRKT and SRGKT correspond to the QFI given in panel (b) of Fig. 3, where one can also see the corresponding RLoptimized distribution of kicks.Due to the small spin size of , we are deep in the quantum mechanical regime which manifests itself in an uncertainty of the initial spin coherent state that is relatively large compared to total size of the phase space. The distribution of the states evolved under dissipative dynamics exhibit remarkable differences for periodic and RLoptimized kicks:
In case of periodic kicks, we find that the initially localized distribution gets distributed over the phase space. It exhibits a maximum on the negative axis, see panels (b) and (b)in Fig. 4. This is reminiscent of the dissipative evolution in the absence of kicks, where the state is driven towards the ground state which is centered around . The ground state is an eigenstate of the precession and, thus, insensitive to changes in the frequency we want to estimate. Similarly, we interpret the part of the state distribution of the SRKT that is centered around negative axis as insensitive. However, the distribution also exhibits nonvanishing parts distributed over the remainder of the phase space that can be understood as being sensitive to changes of and therefore explain the nonzero QFI of the SRKT.
The state corresponding to RLoptimized kicks looks like a strongly squeezed state that almost encircles the whole sphere. Similar to spin squeezing, which is typically applied to the initial state as a part of the state preparation, we interpret the squeezed distribution as particularly sensitive with respect to the precession dynamics. This is due to the reduced uncertainty along the precession trajectories, i.e., with respect to the coordinate. In the Supplemental Material ^{3}^{3}3The clips are available at https://doi.org/10.6084/m9.figshare.c.4640051.v3., we provide clips of the evolution over time of the state distributions that illustrate how the RL agent generates the squeezed state. In particular, the squeezed state distribution can be seen as a feature the RL agent is aiming for with its policy. The distribution of RLoptimized kicks is shown in Fig. 3 (in Appendix E, we provide a finer resolution of the distribution of kicks): It is roughly periodic with period corresponding to a precession angle of . Also note that for the SRGKT the Wigner distribution has negative contributions which is associated with nonclassicality of the quantum state Agarwal (2012).
An advantage of the superradiant dynamics lies in its welldefined simple classical limit Braun (2001), see also Appendix D. The lower two rows of panels in Fig. 4 depict the corresponding classical limit where the quantum state is represented by a cloud of phase space points (distributed according to the Husimi distribution of the initial spin coherent state) that are propagated according to the classical equations of motion. One of the reasons why the evolved classical distributions differ from the Wigner distributions is the absence of quantum uncertainty in the classical dynamics; in principle, over the course of the dynamics all classical phase space points can be concentrated to an arbitrarily small region of the phase space. In case of the SRKT, the phase space points are distributed over the whole phase space, reminiscent of classical chaos. However, the distribution is not completely uniform but it exhibits a spiral density inhomogeneity. The plots as in Fig. 4 but for are shown in the Appendix E.
Fig. 5 shows the gains of the RLoptimized SRGKT over the SRT. The gain is defined as the ratio of the RLoptimized QFI at time and the maximum QFI for the SRT. A broad damping regime is found where gains can be achieved: In the regime of small decoherence rates , the RL agent can fight decoherence in such a way that the QFI exhibits a continuous growth over the total time [see panels (a) and (b) in Fig. 3]. In comparison with the SRT, the RL agent benefits of stronger damping in this regime and, therefore, the gain increases with the dissipation rate . For larger decoherence rates, the RL agent can no longer fight decoherence in the same manner [see panels (c) and (d) in Fig. 3], which manifests itself in a reduction of gains for large decoherence rates. In panel (b) of Fig. 5, we can see the (even larger) gain in QFI compared to the plateau value reached by the SRKT.
The RLoptimized QFI is associated with a lower bound on the sensitivity (see Eq. 1) for a given measurement time . If measurement time can be chosen arbitrarily, sensitivity is associated with Fiderer and Braun (2018). This sensitivity represents the standard quantity reported for experimental parameter estimation because it takes time into account as a valuable resource; sensitivity is given in units of the parameter to be estimated per square root of Hertz. With RL we try to maximize with respect to policies.
Fig. 6 compares the SRT with the SRGKT where the latter was optimized with RL in order to maximize the rescaled QFI. Note, that the initial spin coherent state is centered around the positive axis, which means it is an eigenstate of the nonlinear kicks; kicks cannot induce spin squeezing at the very beginning of the dynamics. This changes when the spin precesses away from the axis. Therefore, it makes sense that the RL agent applies the strongest kick only after a precession by about . The actions that the RL agent takes after the rescaled QFI reached its maximum are irrelevant and can be attributed to random noise generated by the RL algorithm.
As we have seen, the interplay of nonlinear kicks and superradiant damping is very special. However, also for other decoherence models the QFI can be increased significantly, for instance in case of a alkalivapor magnetometer Fiderer and Braun (2018). To demonstrate the performance of the RL agent in connection with another decoherence model, we take a look at phase damping, see Fig. 7. The behavior of the QFI of the PDT is qualitatively similar to superradiant damping. The introduction of kicks, however, has a qualitatively different effect on the QFI. The RL agent can achieve improvements of the QFI for the PDGKT at time (the highest time plotted in each panel of Fig. 7) compared with the QFI of the PDT at the same time. Compared to the superradiant case, improvements are rather small. Notably, the policies applied by the RL agent are also different from superradiant damping; for instance, the RL agent avoids kicks for large parts of the dynamics.
Vi Discussion
This work builds on recent results on quantumchaotic sensors Fiderer and Braun (2018). We find that reinforcement learning (RL) techniques can be used to optimize the dynamical control that was used in Ref. Fiderer and Braun (2018)
to render the sensor dynamics chaotic. The control policies found with RL are tailored to boundary conditions such as the initial state, the targeted measurement time, and the decoherence model under consideration. At the example of superradiant damping we demonstrate improvements in measurement precision and an improved robustness with respect to decoherence. A drawback of RL often lies in the expensive hyperparameter tuning of the algorithm. However, here we demonstrate that a basic reinforcement algorithm (the cross entropy method) can be used for several choices of boundary conditions with practically no hyperparameter tuning (there was no hyperparameter search necessary, solely parameters that directly influence the computation time were chosen conveniently). Another drawback of RL is its black box character: while the results achieve a good performance the underlying reasons and mechanisms remain hidden. In the example of superradiant damping, we were able to unveil the approach taken by RL by visualizing the quantum dynamics with the help of the Wigner distribution of the quantum state. This revealed that RL favors a policy that is reminiscent of spin squeezing. However, instead of squeezing the state only at the beginning of the dynamics, the squeezing is refreshed and enhanced in roughly periodic cycles in order to fight against the superradiant damping. In the spirit of Ref.
Fiderer and Braun (2018), these findings emphasize the potential that lies in the optimization of the measurement dynamics. We are optimistic that reinforcement learning will be used in other quantum metrological settings in order to achieve maximum measurement precision with limited quantum resources.Appendix A Control problem and optimisation parameters of the examples
Table 1 shows the parameters of the control problem and for the optimization used in each example. We train RL agents for iterations with episodes in each iteration. Each episode is simulated until a total time is reached. Then we produce sample episodes of each trained RL agent and choose the best episode to plot the sample policies and gains.
Figure  

Samples with superradiant damping (Fig. 3)  5  500  50  20  0.2  0.05  100 
Gains of superradiant damping (Fig. 5)  20  300  40  20  1.0  0.10  100 
Samples of rescaled QFI (Fig. 6)  2  500  50  20  0.1  0.10  50 
Samples with phase damping (Fig. 7)  1  1,000  100  1  1.0  0.10  100 
Appendix B Cross entropy reinforcement learning
Here we give further information on the neural network, the cross entropy method, and the pseudocode for the cross entropy method with discrete actions. The code implementation is based on an example by Jan Schaffranek ^{4}^{4}4https://www.udemy.com/artificialintelligenceundreinforcementlearninginpython.
The input layer of the neural network is defined by the observation. The output layer is determined by the number of actions (two) and we choose 300 neurons in the hidden layer. The layers are fully connected. The hidden layer has the rectified linear unit (ReLU) as its activation function and the output layer has the softmax function as its activation function
Nielsen (2015). As a cost function we choose the categorical cross entropy Nielsen (2015). The share of best episodes is always . The number of iterations and number of episodes vary for different settings, see Table 1 for detailed information. For training we use the Adam optimizer Kingma and Ba (2014) with learning rate .Appendix C Learning curve and stability of the algorithm
At the example of the superradiance decoherence model, we study the learning behavior of the cross entropy reinforcement learning algorithm for different training lengths (i.e. number of iterations) and different numbers of episodes per iteration. The results are summarized in Fig. 8. Spin size is and dissipation rate is .
In order to see the influence of the number of iterations, we set the number of episodes to 100 and let 20 different RL agents (with different random seeds) train for various numbers of iterations. The training of a single RL agent takes about one hour at most (for the higher number of iterations) on a desktop computer. We then use each RL agent to produce 20 episodes, giving us 400 episodes for each data point in Fig. 8. We used those episodes to calculate mean and standard deviation of the reward. The results are shown in the panel (a) of Fig. 8. In order to see the influence of the number of episode in each iteration, we fix the number of iterations to 500 and do the same procedure as before. The results are shown in panel (b) of Fig. 8.
We can see that the standard deviation over policies decreases with the number of iterations while the mean QFI increases. The same is true for the number of episodes (panel (b)), where for 32 episodes a stable plateau of the QFI is reached such that increasing the number of episodes does not achieve any further improvements. Overall, these results demonstrate the stability of the algorithm if the number of episodes and iterations is chosen sufficiently large.
Appendix D Classical equations of motion
The kicked top with superradiant damping has a well defined classical limit. It is obtained from the quantum equations of motion by taking the limit where and . The rescaled angular momentum operator then becomes the classical coordinate vector and with the unit sphere becomes the classical phase space with azimuthal angle and coordinate as canonical variables. The equations of motions are found to be Braun (2001)
(12)  
(13)  
(14) 
for the precession about the axis by an angle ,
(15)  
(16)  
(17) 
for the kicks about the axis with kicking strength , and, with azimuthal angle (see main text)
(18) 
(19) 
(20) 
(21) 
for the superradiant damping, where
(22) 
for a time , spin size , and superradiant decoherence rate .
Appendix E A closer look at the kicks set by the reinforcement learning agent
Here we take a closer look at the kicks chosen by the RL agent in the examples with superradiant damping, considered in Fig. 3 in the main text.
In case of , for both, and , we find relatively similar distribution of kicks, see panel (a) in Fig. 9. The most striking difference between the two policies for and are the comparatively strong kicks in the beginning of the sequence. By observing the time evolution of the Wigner function (see Supplemental Material), we find that these kicks basically rotate the state by an additional angle about the axis. This leads to a phase shift of between the two policies [see panels (d) and (d) of Fig. 10] compared to the initial state [see panels (a) and (a) of Fig. 10].
For the policies are even more similar with several kicks increasing in strength with a period length of , see panel (b) in Fig. 9.
Fig. 10 is analog to Fig. 4 in the main text but for instead of . The only qualitative difference compared to the , is the periodically kicked top: The combination of periodic kicks with and seems to be a special configuration. The classical phase space is comparable with the case, but there is much less structure in the Wigner function. Instead, the state concentrates on the south pole and exhibits a slightly squeezed shape (this is difficult to judge from Fig. 10 though). The rather high value of the QFI for and , is best explained by this squeezing. When choosing other kicking strength, we observed a Wigner function similar to the case of .
References
 Murphy (2012) K. P. Murphy, Machine Learning: A Probabilistic Perspective (MIT press, 2012).
 Dunjko and Briegel (2018) V. Dunjko and H. J. Briegel, Reports on Progress in Physics 81, 074001 (2018).
 Mehta et al. (2019) P. Mehta, M. Bukov, C.H. Wang, A. G. Day, C. Richardson, C. K. Fisher, and D. J. Schwab, Physics Reports (2019).
 Carrasquilla and Melko (2017) J. Carrasquilla and R. G. Melko, Nature Physics 13, 431 (2017).
 Broecker et al. (2017) P. Broecker, F. F. Assaad, and S. Trebst, arXiv preprint arXiv:1707.00663 (2017).
 Van Nieuwenburg et al. (2017) E. P. Van Nieuwenburg, Y.H. Liu, and S. D. Huber, Nature Physics 13, 435 (2017).
 Carleo and Troyer (2017) G. Carleo and M. Troyer, Science 355, 602 (2017).
 Carleo et al. (2018) G. Carleo, Y. Nomura, and M. Imada, Nature communications 9, 5322 (2018).
 Gao and Duan (2017) X. Gao and L.M. Duan, Nature communications 8, 662 (2017).
 Leigh (2004) J. R. Leigh, Control Theory, Vol. 64 (Iet, 2004).

Kaelbling et al. (1996)
L. P. Kaelbling, M. L. Littman, and A. W. Moore, Journal of artificial intelligence research
4, 237 (1996).  Sutton and Barto (2018) R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (MIT press, 2018).
 Sutton et al. (1992) R. S. Sutton, A. G. Barto, and R. J. Williams, IEEE Control Systems Magazine 12, 19 (1992).
 Chen et al. (2013) C. Chen, D. Dong, H.X. Li, J. Chu, and T.J. Tarn, IEEE transactions on neural networks and learning systems 25, 920 (2013).
 Palittapongarnpim et al. (2017) P. Palittapongarnpim, P. Wittek, E. Zahedinejad, S. Vedaie, and B. C. Sanders, Neurocomputing 268, 116 (2017).
 Fösel et al. (2018) T. Fösel, P. Tighineanu, T. Weiss, and F. Marquardt, Physical Review X 8, 031084 (2018).
 Bukov et al. (2018) M. Bukov, A. G. Day, D. Sels, P. Weinberg, A. Polkovnikov, and P. Mehta, Physical Review X 8, 031086 (2018).
 AlbarránArriagada et al. (2018) F. AlbarránArriagada, J. C. Retamal, E. Solano, and L. Lamata, Physical Review A 98, 042315 (2018).
 Niu et al. (2019) M. Y. Niu, S. Boixo, V. N. Smelyanskiy, and H. Neven, in AIAA Scitech 2019 Forum (2019) p. 0954.
 Melnikov et al. (2018) A. A. Melnikov, H. P. Nautrup, M. Krenn, V. Dunjko, M. Tiersch, A. Zeilinger, and H. J. Briegel, Proceedings of the National Academy of Sciences 115, 1221 (2018).
 Sweke et al. (2018) R. Sweke, M. S. Kesselring, E. P. van Nieuwenburg, and J. Eisert, arXiv preprint arXiv:1810.07207 (2018).
 Andreasson et al. (2018) P. Andreasson, J. Johansson, S. Liljestrand, and M. Granath, arXiv preprint arXiv:1811.12338 (2018).
 Hentschel and Sanders (2010) A. Hentschel and B. C. Sanders, in 2010 Seventh International Conference on Information Technology: New Generations (IEEE, 2010) pp. 506–511.
 Hentschel and Sanders (2011) A. Hentschel and B. C. Sanders, Physical review letters 107, 233601 (2011).
 Lovett et al. (2013) N. B. Lovett, C. Crosnier, M. PerarnauLlobet, and B. C. Sanders, Physical review letters 110, 220501 (2013).

Sergeevich and Bartlett (2012)
A. Sergeevich and S. D. Bartlett, in
2012 IEEE Congress on Evolutionary Computation
(IEEE, 2012) pp. 1–3.  Stenberg et al. (2016) M. P. Stenberg, O. Köhn, and F. K. Wilhelm, Physical Review A 93, 012122 (2016).
 Palittapongarnpim et al. (2016) P. Palittapongarnpim, P. Wittek, and B. C. Sanders, in 24th European Symposium on Artificial Neural Networks, Bruges, April 27–29, 2016 (2016) pp. 327–332.
 Lumino et al. (2018) A. Lumino, E. Polino, A. S. Rab, G. Milani, N. Spagnolo, N. Wiebe, and F. Sciarrino, Physical Review Applied 10, 044033 (2018).
 Liu and Yuan (2017a) J. Liu and H. Yuan, Physical Review A 96, 012117 (2017a).
 Liu and Yuan (2017b) J. Liu and H. Yuan, Physical Review A 96, 042114 (2017b).
 Xu et al. (2019) H. Xu, J. Li, L. Liu, Y. Wang, H. Yuan, and X. Wang, arXiv preprint arXiv:1904.11298 (2019).
 Fiderer and Braun (2018) L. J. Fiderer and D. Braun, Nature communications 9, 1351 (2018).
 Fiderer and Braun (2019) L. J. Fiderer and D. Braun, in Optical, OptoAtomic, and EntanglementEnhanced Precision Metrology, Vol. 10934 (International Society for Optics and Photonics, 2019) p. 109342S.
 Helstrom (1976) C. W. Helstrom, Quantum Detection and Estimation Theory (Academic press, 1976).
 Holevo (1982) A. S. Holevo, Probabilistic and Statistical Aspects of Quantum Theory (NorthHolland, Amsterdam, 1982).
 Braunstein and Caves (1994) S. L. Braunstein and C. M. Caves, Phys. Rev. Lett. 72, 3439 (1994).
 Paris (2009) M. G. A. Paris, International Journal of Quantum Information 7, 125 (2009).
 Peres (2006) A. Peres, Quantum Theory: Concepts and Methods, Vol. 57 (Springer Science & Business Media, 2006).
 Chaudhury et al. (2009) S. Chaudhury, A. Smith, B. Anderson, S. Ghose, and P. S. Jessen, Nature 461, 768 (2009).
 Krithika et al. (2019) V. Krithika, V. Anjusha, U. T. Bhosale, and T. Mahesh, Physical Review E 99, 032219 (2019).
 Dicke (1954) R. H. Dicke, Phys. Rev. 93, 99 (1954).
 Gross et al. (1976) M. Gross, C. Fabre, P. Pillet, and S. Haroche, Phys. Rev. Lett. 36, 1035 (1976).
 Gross and Haroche (1982) M. Gross and S. Haroche, Phys. Rep. 93, 301 (1982).
 Braun (2001) D. Braun, Dissipative Quantum Chaos and Decoherence, Springer Tracts in Modern Physics, Vol. 172 (Springer, 2001).
 Kossakowski (1972) A. Kossakowski, Rep. Math. Phys. 3, 247 (1972).
 Lindblad (1976) G. Lindblad, Math. Phys. 48, 119 (1976).
 Bonifacio et al. (1971) R. Bonifacio, P. Schwendimann, and F. Haake, Physical Review A 4, 302 (1971).
 Braun et al. (1998a) P. A. Braun, D. Braun, and F. Haake, Eur. Phys. J. D 3, 1 (1998a).
 Braun et al. (1998b) P. A. Braun, D. Braun, F. Haake, and J. Weber, Eur. Phys. J. D 2, 165 (1998b).
 Giraud et al. (2008) O. Giraud, P. Braun, and D. Braun, Phys. Rev. A 78, 042112 (2008).
 Giraud et al. (2010) O. Giraud, P. Braun, and D. Braun, New Journal of Physics 12, 063005 (2010).
 De Boer et al. (2005) P.T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, Annals of operations research 134, 19 (2005).
 Kingma and Ba (2014) D. P. Kingma and J. Ba, arXiv preprint arXiv:1412.6980 (2014).
 Agarwal (2012) G. S. Agarwal, Quantum Optics (Cambridge University Press, 2012).

Nielsen (2015)
M. A. Nielsen,
Neural Networks and Deep Learning
, Vol. 25 (Determination press San Francisco, CA, USA:, 2015).
Comments
There are no comments yet.