Improving the dynamics of quantum sensors with reinforcement learning

08/22/2019 ∙ by Jonas Schuff, et al. ∙ 0

Recently proposed quantum-chaotic sensors achieve quantum enhancements in measurement precision by applying nonlinear control pulses to the dynamics of the quantum sensor while using classical initial states that are easy to prepare. Here, we use the cross entropy method of reinforcement learning to optimize the strength and position of control pulses. Compared to the quantum-chaotic sensors in the presence of superradiant damping, we find that decoherence can be fought even better and measurement precision can be enhanced further by optimizing the control. In some examples, we find enhancements in sensitivity by more than an order of magnitude. By visualizing the evolution of the quantum state, the mechanism exploited by the reinforcement learning method is identified as a kind of spin-squeezing strategy that is adapted to the superradiant damping.



There are no comments yet.


page 13

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The rise of machine learning

Murphy (2012) has led to intense interest in using machine learning in physics, and in particular in combining it with quantum information technology Dunjko and Briegel (2018); Mehta et al. (2019). Recent success stories include discriminating phases of matter Carrasquilla and Melko (2017); Broecker et al. (2017); Van Nieuwenburg et al. (2017) and efficient representation of many-body quantum states Carleo and Troyer (2017); Carleo et al. (2018); Gao and Duan (2017).

In physics, many problems can be described within control theory which is concerned with finding a way to steer a system to achieve a goal Leigh (2004). The search for optimal control can naturally be formulated as reinforcement learning (RL) Kaelbling et al. (1996); Sutton and Barto (2018); Sutton et al. (1992); Chen et al. (2013); Palittapongarnpim et al. (2017); Fösel et al. (2018); Bukov et al. (2018); Albarrán-Arriagada et al. (2018); Niu et al. (2019), a discipline of machine learning. Reinforcement learning (RL) has been used in the context of quantum control Bukov et al. (2018), to design experiments in quantum optics Melnikov et al. (2018), and to automatically generate sequences of gates and measurements for quantum error correction Fösel et al. (2018); Sweke et al. (2018); Andreasson et al. (2018).

RL has also been applied to control problems in quantum metrology Dunjko and Briegel (2018)

: In the context of global parameter estimation, i.e., when the parameter is a priori unknown, the problem of optimizing single-photon adaptive phase-estimation was investigated

Hentschel and Sanders (2010, 2011); Lovett et al. (2013). There, the goal is to estimate an unknown phase difference between the two arms of a Mach–Zehnder interferometer. After each measurement, an additional controllable phase in the interferometer can be adjusted dependent on the already acquired measurement outcomes. The optimization with respect to policies, i.e., mappings from measurement outcomes to controlled phase shifts, can be formulated as a RL problem and tackled with particle swarm Hentschel and Sanders (2010, 2011); Sergeevich and Bartlett (2012); Stenberg et al. (2016) or differential evolution Lovett et al. (2013); Palittapongarnpim et al. (2016) algorithms, where the results of the former were recently applied in an experiment Lumino et al. (2018).

Also in the regime of local parameter estimation, where the parameter is already known to high precision (typically from previous measurements), actor-critic and proximal-policy-optimization RL algorithms were used to find policies to control the dynamics of quantum sensors Liu and Yuan (2017a, b); Xu et al. (2019). There, the estimation of the precession frequency of a dissipative spin- particle was improved by adding a linear control to the dynamics in form of an additional controlled magnetic field Xu et al. (2019).

Recently it was shown theoretically that the sensitivity (in the regime of local parameter estimation) of existing quantum sensors based on precession dynamics, such as spin-precession magnetometers, can be increased by adding nonlinear control to their dynamics in such a way that the dynamics becomes non-regular or (quantum-)chaotic Fiderer and Braun (2018, 2019). The nonlinear kicks (described by a “nonlinear” Hamiltonian compared to the “linear” precession Hamiltonian where , , are the spin angular momentum operators) lead to a torsion, a precession with rotation angle depending on the state of the spins.

Adding nonlinear kicks to the otherwise regular dynamics comes along with a large number of new degrees of freedom that remained so far unexplored: Rather than kicking the system periodically with always the same strength and with the same preferred axis as in Ref. 

Fiderer and Braun (2018), one can try to optimize each kick individually, i.e., vary its timing, strength, or rotation axis. The number of parameters increases linearly with the total measurement time (assuming a fixed upper bound of kicks per unit time), and becomes rapidly too large for brute-force optimization.

In this work, we use cross-entropy RL to optimize the kicking strengths and times in order to maximize the quantum Fisher information, whose inverse constitutes a lower bound on the measurement precision. The cross-entropy method is used to train a neural network that takes the current state as input and gives an action on the current state (the nonlinear kicks) as output. In this way, the neural network generates a sequence of kicks that represents the policy for steering the dynamics.

This represents an offline, model-free approach which is aimed at long-term performance, i.e., the optimization is done based on numerical simulations, without being restricted to a specific class of policies, and with the goal of maximizing the quantum Fisher information only after a given time and not, as it would be the case for greedy algorithms, for each time step. We show that this can lead to largely enhanced sensitivity even compared to the already enhanced sensitivity of the quantum-chaotic sensor with constant periodic kicks Fiderer and Braun (2018).

Ii Quantum metrology

The standard tool for evaluating the sensitivity with which a parameter can be measured is the quantum Cramér-Rao bound Helstrom (1976); Holevo (1982); Braunstein and Caves (1994). It gives the smallest uncertainty with which a parameter encoded in a quantum state (density matrix)

can be estimated. The bound is optimized over all possible (POVM=positive operator valued measure) measurements (including but not limited to standard projective von-Neumann measurements of quantum observables), and all possible data-analysis schemes in the sense of using arbitrary unbiased estimator functions

of the obtained measurement results. It can be saturated in the limit of a large number of measurements, and hence gives the ultimate sensitivity that can be reached once technical noise has been eliminated and only the intrinsic fluctuations due to the quantum state itself remain.

Figure 1: Schematic representation of parameter encoding in quantum metrology. Panel (a) shows the standard protocol: the parameter is encoded in the initial state through the dynamics, the resulting state is measured, and the parameter is inferred by (classical) post processing of the measurement outcomes. In panel (b), the dynamics is given by the kicked top model: the encoding of the parameter through linear precession about the -axis is periodically disrupted through parameter-independent, nonlinear, controlled kicks (green triangles) with kicking strength that can render the dynamics chaotic. In panel (c), the dynamics is given by a generalized kicked top model: the kicking strengths and times between kicks are optimized in order to maximize the sensitivity with which can be inferred (varying are indicated by different sizes of the green triangles). Variation of the kicking axis is possible but beyond the scope of this work.

The quantum Cramér-Rao bound for the smallest possible variance of the estimate



For a state given in diagonalized form, , where is the dimension of the Hilbert space, the quantum Fisher information (QFI) is given by Paris (2009)


where the sum runs over all such that , and .

Iii The system

We consider a spin model based on the angular momentum algebra, with spin operators , and , where and are angular momentum quantum numbers. Note that the model can be implemented not only with physical spins but with any physical system with quantum mechanical operators that fulfill the angular momentum algebra. The Hamiltonian of our model is given by


The first summand describes a precession about the -axis with precession frequency . The second summand describes the nonlinear kicks, i.e., a torsion about the -axis, see Fig. 1. This corresponds to a precession about the -axis with a precession angle proportional to the -component. The time defines a time scale such that and measure time in units of . The th kick is applied at time where quantifies its kicking strength (in units of a frequency).

In an atomic spin-precession magnetometer, as discussed in Ref. Fiderer and Braun (2018), the first summand corresponds to a Larmor precession characterized by the Larmor frequency with Landé g-factor , Bohr magneton , and magnetic field strength , which is the parameter that one wants to estimate. The nonlinear kicks can, for example, be generated with off-resonant light pulses exploiting the ac Stark effect. We introduce a dimensionless kicking strength as and, for the sake of simplicity, we set and .

For a pure state, the unitary time evolution of the system between kicks at time and is given by


where the unitary transformation propagates the state according to the Hamiltonian (3), from time [directly after the th kick] to [directly after the th kick], as indicated by the index [in order to simplify notation, the index of not only labels the kicking strength at time but also refers to the propagation from to of ]. We have


where denotes time-ordering. Since the kicks are assumed to be instantaneous, this leads to


i.e., a precession for time followed by a kick of strength . The kick occurs at the end of the time interval .

For the standard kicked top (KT), , see Fig. 1, the kicking strengths are constant, , and kicking times are given by , with . Dynamics of the standard KT is non-integrable for and has a well defined classical limit that shows a transition from regular to chaotic dynamics when is increased. In Ref. Fiderer and Braun (2018) the behavior of the QFI for regular and chaotic dynamics was studied in this transition regime (for and ) which manifests itself by a mixed classical phase space between regular and chaotic dynamics. Quantum chaos is defined as quantum dynamics that becomes chaotic in the classical limit. In contrast to classical chaos, quantum chaos does not exhibit exponential sensitivity to changes of initial conditions due to the properties of unitary quantum evolution, but can be very sensitive to parameters of the evolution Peres (2006). The kicked top has been realized with atomic spins in a cold gas Chaudhury et al. (2009) and with a pair of spin- nuclei using NMR techniques Krithika et al. (2019). Here, we generalize the standard KT to kicks of strength at arbitrary times as given in Eq. (6), see also Fig. 1.

Any new quantum metrology method needs to demonstrate its viability in the presence of noise and decoherence. We study two different versions of the KT which differ in the decoherence model used: phase damping and superradiant damping. Both can be described by Markovian master equations and are well studied models for open quantum systems Dicke (1954); Gross et al. (1976); Gross and Haroche (1982); Braun (2001). While phase damping conserves the energy and only leads to decoherence in the basis, superradiant damping leads in addition to a relaxation to the ground state . Its combination with periodic kicking in the chaotic regimes is known to give rise to a non-equilibrium steady state in the form of a smeared-out strange attractor Braun (2001) that still conserves information about the parameter , whereas without the kicking the system in presence of superradiant damping simply decays to the ground state. The master equations for both processes have the Kossakowski–Lindblad form Kossakowski (1972); Lindblad (1976), with


for phase damping, where , and


for superradiant damping, where are the ladder operators, and and denote the decoherence rates. With the generator , defined by , one has in both cases the formal solution with the continuous-time propagator . The solution of Eq. (7) in the basis, where , is immediate,


Also for Eq. (8) a formally exact solution has been found Bonifacio et al. (1971) and efficient semiclassical (for large ) expressions are available Braun et al. (1998a, b). For our purposes it was the simplest to solve Eq. (8) numerically by diagonalization of . Combining these decoherence mechanisms with the unitary evolution the transformation reads


because in both cases the dissipative generator commutes with the precession.

As initial state we use an SU(2) coherent state, which can be seen as the most classical state of a spin Giraud et al. (2008, 2010), and is usually easy to prepare (for instance by optically polarizing the atomic spins in a SERF magnetometer). Also, it is equivalent to a symmetric state of spin- pointing all in the same direction. With respect to the basis it reads


We choose , .

Iv The kicked top as a control problem and reinforcement learning

We consider the kicked top as a control problem and discretize the kicking strengths and times . The precise parameters of the discretized control problem vary between the following examples and are summarized in Appendix A. In the following, denotes a discrete time step (measured in units of ), is a discrete step of kicking strength, the RL agent optimizes the QFI at time , and we bound the total accumulated kicking strength which is never reached in optimized policies though. The frequency , that we want to estimate, is set to induce a rotation of the state by ( is measured in units of ).

Possible control policies are simply given by a vector of kicking strengths

with . To each policy corresponds a QFI value, calculated from the resulting state , which quantifies how well the policy performs. To tackle this type of problem, various numerical algorithms are available, each with its own advantages and drawbacks Dunjko and Briegel (2018); Mehta et al. (2019); Palittapongarnpim et al. (2017). We pursue the relatively unexplored (in the context of physics) route of cross-entropy RL.

The system, the kicked top, will be called “environment”, and we imagine an “RL agent” interacting with the environment by applying nonlinear kicks (“actions”) and getting in response information about the current state of the environment (“observation”, which is in our case the full density matrix of the current state), see Fig. 2. The RL agent repeatedly has to take the decision whether to increase the kicking strength (by ) or to go on from the current position in time to . After each decision, it obtains an observation and, only after the total time , a “reward” (the quantum Fisher information of ), that it seeks to maximize. This concludes one “episode” after which the environment is reset [i.e., the spin is reinitialized in the coherent state at , , see Eq. (11)] and the next episode starts.

Figure 2: Typical setup in reinforcement learning: the RL agent acts upon the environment which in return gives the RL agent an observation and a reward. In our case the RL agent is a neural network and the environment is the generalized kicked top.

In our case, a neural network represents the RL agent: The observation is given to the neural network’s input neurons while each output neuron represents one possible action, i.e., we have two output neurons for “kick” and “go on”. The activation of these output neurons determines the probability of executing that action. The policy, however, is not given by the neural network directly. Since the environment is deterministic (i.e., the state evolves deterministically for a given policy

of kicking strengths) there is no point in choosing a stochastic policy such as a neural network. Instead, a single choice of kicking strengths represents the policy which is obtained by first generating a few episodes with several trained neural networks and then picking the episode with the largest QFI. The kicking strengths applied in that episode represent the policy (see Appendix B)111In comparison, Sanders et al. Hentschel and Sanders (2010, 2011); Lovett et al. (2013) restricted their policy search for adaptive single-photon interferometry in such a way that their search space corresponds to points in , making it similar to our problem. However, in their case the observations from the environment are probabilistic measurement outcomes while in our case the observation is the deterministic state ..

The RL cross-entropy method De Boer et al. (2005) we use works as follows: We produce a set of episodes with the neural network, and then we reinforce the actions of the episodes with the highest reward. This is done by choosing the best

of episodes and we use the pairs of observations and actions of these episodes to train the neural network with the stochastic gradient descent method called

Adam (see Appendix for details) Kingma and Ba (2014). As a result of this training the weights of the neural network are adjusted, i.e., the agent learns from its experience. Future actions taken by the agent are then influenced not only by randomness but also by this experience. The whole process of generating episodes and training the network is iterated. For the parameters of the training process see Appendix B. In Appendix C we study the learning success for different numbers of episodes and iterations.

V Results

Figure 3: Examples for the policy adopted by the RL agent for superradiant damping. We plot the accumulated kicking strength on the left axes as red dots and on the right axes in blue the quantum Fisher information for the top (solid line), the periodically kicked top with chosen as in Ref. Fiderer and Braun (2018) (dashed line) and the QFI that corresponds to the policy of the RL agent (crosses). We additionally plot red vertical lines in the places, where the RL agent decides to set a kick. The height of the lines correspond to the kicking strength in arbitrary units and are not on the scale of the left axis. There is a regime where the RL agent manages to increase the QFI with each time step [panel (a) and (b)], and a regime where the RL agent makes the QFI oscillate [panel (c) and (d)].

We compare the QFI for different models: (i) the top (simple precession without kicks), (ii) the standard kicked top, as studied in Ref. Fiderer and Braun (2018), with periodic kicks (period , i.e., a precession angle of for one period, and kicking strength ), and (iii) the generalized kicked top optimized with RL. In case of superradiance damping (phase damping) we denote the top by SR-T (PD-T), the standard kicked top by SR-KT (PD-KT) and the RL-optimized generalized kicked top by SR-GKT (PD-GKT). Details on the training and the optimization of the RL results are provided in Appendix B.

Let us first consider superradiant damping with results presented in Fig. 3. The QFI for the SR-T exhibits a characteristic growth quadratic in time. However, due to decoherence, the QFI does not maintain this growth but starts to decay rapidly towards zero. The time when the QFI reaches its maximum was found to decay roughly as with spin size and damping rate Fiderer and Braun (2018).

The situation changes with the introduction of nonlinear kicks. There, the QFI for the SR-KT shows the interesting behavior of not decaying to zero for large times. Instead it reaches a plateau value which was found to take surprisingly high values for specific choices of and dissipation rates Fiderer and Braun (2018), in particular, for . The system looses energy through superradiant damping while the nonlinear kicks add energy. This prevents the state from decaying to the ground state, which is an eigenstate of the precession and would lead to a vanishing QFI. From this perspective, the plateau results from a dynamical equilibrium established by the interplay of superradiant damping and kicks.

However, the full potential of exploiting such effects and increasing the QFI with the help of nonlinear kicks is not achieved with constant periodic kicks. Instead, the RL agent222The training of one RL agent takes about eight hours on a desktop computer. finds policies to make the QFI of the SR-GKT increase further even when the QFI of the SR-T decayed already to zero and the QFI of the SR-KT reached its plateau value.

Examples for and are presented in Fig. 3. The QFI of the SR-GKT is optimized for a total time which is the largest time plotted in each example. At , the plateau value of the SR-KT for is relatively low and the RL-optimized policy achieves an improvement in sensitivity (associated with ) of more than an order of magnitude. Panels (a) and (b) show continuous growth of the QFI through an optimized kicking policy. Only if the time (the QFI is optimized to be maximal at ) is increased further, the impressive growth of the QFI finally breaks down. Instead of increasing , we choose to increase superradiant damping while keeping constant, which has a similar effect. In that case, see panels (c) and (d), the RL agent chooses a policy which makes the QFI oscillate at a relatively high level before the time is reached.

Figure 4: Illustration of kicked superradiant dynamics with Wigner functions and its classical limit. The spin size is and the dissipation rate is . Panels in the left column (a) corresponds to the initial spin coherent state at . The middle and right columns correspond to the state at time generated with periodic kicks (middle column (b), ) and with kicks optimized with reinforcement learning [right column (c), the corresponding QFI is shown in panel (b) of Fig. 3]. The top two rows show the Wigner functions of the density matrix, the bottom two rows show the classical phase space, populated by points initially distributed according to the Husimi distribution of the initial spin coherent state and then propagated according to the classical equations of motion.

The superiority of the policies found by the RL agent can be understood by taking a look at the evolution of the quantum state, see Fig. 4: We represent the quantum state in the space of where and, due to the conservation of angular momentum, which restricts the space to a sphere. This is represented in Fig. 4 with either a sphere parametrized with , , and , or in a plane (the phase space) spanned by the -coordinate and the azimuthal angle such that corresponds to the positive -axis, , to the positive -axis, and with arbitrary to the positive (negative) -axis.

The quantum state can be represented in the phase space with the help of the Husimi or the Wigner distributions which are quasi probability distributions of the quantum state. The first two rows of panels in Fig. 

4 depict the Wigner distribution of the initial quantum state (left column) and the quantum states of the SR-KT (middle column, with kicking strength ) and SR-GKT (right column) evolved for a time with damping rate . The plotted cases for the SR-KT and SR-GKT correspond to the QFI given in panel (b) of Fig. 3, where one can also see the corresponding RL-optimized distribution of kicks.

Due to the small spin size of , we are deep in the quantum mechanical regime which manifests itself in an uncertainty of the initial spin coherent state that is relatively large compared to total size of the phase space. The distribution of the states evolved under dissipative dynamics exhibit remarkable differences for periodic and RL-optimized kicks:

In case of periodic kicks, we find that the initially localized distribution gets distributed over the phase space. It exhibits a maximum on the negative -axis, see panels (b) and (b)in Fig. 4. This is reminiscent of the dissipative evolution in the absence of kicks, where the state is driven towards the ground state which is centered around . The ground state is an eigenstate of the precession and, thus, insensitive to changes in the frequency we want to estimate. Similarly, we interpret the part of the state distribution of the SR-KT that is centered around negative -axis as insensitive. However, the distribution also exhibits non-vanishing parts distributed over the remainder of the phase space that can be understood as being sensitive to changes of and therefore explain the non-zero QFI of the SR-KT.

The state corresponding to RL-optimized kicks looks like a strongly squeezed state that almost encircles the whole sphere. Similar to spin squeezing, which is typically applied to the initial state as a part of the state preparation, we interpret the squeezed distribution as particularly sensitive with respect to the precession dynamics. This is due to the reduced uncertainty along the precession trajectories, i.e., with respect to the coordinate. In the Supplemental Material 333The clips are available at, we provide clips of the evolution over time of the state distributions that illustrate how the RL agent generates the squeezed state. In particular, the squeezed state distribution can be seen as a feature the RL agent is aiming for with its policy. The distribution of RL-optimized kicks is shown in Fig. 3 (in Appendix E, we provide a finer resolution of the distribution of kicks): It is roughly periodic with period corresponding to a precession angle of . Also note that for the SR-GKT the Wigner distribution has negative contributions which is associated with non-classicality of the quantum state Agarwal (2012).

An advantage of the superradiant dynamics lies in its well-defined simple classical limit Braun (2001), see also Appendix D. The lower two rows of panels in Fig. 4 depict the corresponding classical limit where the quantum state is represented by a cloud of phase space points (distributed according to the Husimi distribution of the initial spin coherent state) that are propagated according to the classical equations of motion. One of the reasons why the evolved classical distributions differ from the Wigner distributions is the absence of quantum uncertainty in the classical dynamics; in principle, over the course of the dynamics all classical phase space points can be concentrated to an arbitrarily small region of the phase space. In case of the SR-KT, the phase space points are distributed over the whole phase space, reminiscent of classical chaos. However, the distribution is not completely uniform but it exhibits a spiral density inhomogeneity. The plots as in Fig. 4 but for are shown in the Appendix E.

Figure 5: Improvement in the quantum Fisher information due to reinforcement learning for superradiant damping. The improvement in panel (a) is the ratio of quantum Fisher information at time (100 discretized time steps) optimized with reinforcement learning and the maximum QFI of the top (no kicks). In panel (b) we plot the ratio of the QFI optimized with reinforcement learning and the plateau values achieved by periodic kicking for spin size and kicking strength . In panel (b), the case of is omitted due to the very small plateau values in that case. The discretization is coarser than in previous examples: (i.e., a precession angle of per time step) and .

Fig. 5 shows the gains of the RL-optimized SR-GKT over the SR-T. The gain is defined as the ratio of the RL-optimized QFI at time and the maximum QFI for the SR-T. A broad damping regime is found where gains can be achieved: In the regime of small decoherence rates , the RL agent can fight decoherence in such a way that the QFI exhibits a continuous growth over the total time [see panels (a) and (b) in Fig. 3]. In comparison with the SR-T, the RL agent benefits of stronger damping in this regime and, therefore, the gain increases with the dissipation rate . For larger decoherence rates, the RL agent can no longer fight decoherence in the same manner [see panels (c) and (d) in Fig. 3], which manifests itself in a reduction of gains for large decoherence rates. In panel (b) of Fig. 5, we can see the (even larger) gain in QFI compared to the plateau value reached by the SR-KT.

Figure 6: Examples for the policy adopted by the RL agent for maximizing the rescaled quantum Fisher information with superradiant damping. We plot the accumulated kicking strength on the left axis as red dots and on the right axis the rescaled quantum Fisher information for the top (blue solid line) and for the generalized kicked top optimized with reinforcement learning (blue crosses). In case of () the strongest kick is applied after an initial rotation angle of ().
Figure 7: Examples for the strategy adopted by the RL agent for phase damping. All data is for spin with increasing damping rates from panel (a) to (d). We plot the accumulated kicking strength on the left axis as red dots and on the right axis the quantum Fisher information for the top (blue solid line) and for the generalized kicked top optimized with reinforcement learning (blue crosses). We additionally plot red vertical lines at times when the RL agent sets a kick. The length of the lines corresponds to the kicking strength in arbitrary units (independent of the scale of the left axis). Note that the RL agent aims to maximize the QFI for and outperforms the top in all examples.

The RL-optimized QFI is associated with a lower bound on the sensitivity (see Eq. 1) for a given measurement time . If measurement time can be chosen arbitrarily, sensitivity is associated with Fiderer and Braun (2018). This sensitivity represents the standard quantity reported for experimental parameter estimation because it takes time into account as a valuable resource; sensitivity is given in units of the parameter to be estimated per square root of Hertz. With RL we try to maximize with respect to policies.

Fig. 6 compares the SR-T with the SR-GKT where the latter was optimized with RL in order to maximize the rescaled QFI. Note, that the initial spin coherent state is centered around the positive -axis, which means it is an eigenstate of the nonlinear kicks; kicks cannot induce spin squeezing at the very beginning of the dynamics. This changes when the spin precesses away from the axis. Therefore, it makes sense that the RL agent applies the strongest kick only after a precession by about . The actions that the RL agent takes after the rescaled QFI reached its maximum are irrelevant and can be attributed to random noise generated by the RL algorithm.

As we have seen, the interplay of nonlinear kicks and superradiant damping is very special. However, also for other decoherence models the QFI can be increased significantly, for instance in case of a alkali-vapor magnetometer Fiderer and Braun (2018). To demonstrate the performance of the RL agent in connection with another decoherence model, we take a look at phase damping, see Fig. 7. The behavior of the QFI of the PD-T is qualitatively similar to superradiant damping. The introduction of kicks, however, has a qualitatively different effect on the QFI. The RL agent can achieve improvements of the QFI for the PD-GKT at time (the highest time plotted in each panel of Fig. 7) compared with the QFI of the PD-T at the same time. Compared to the superradiant case, improvements are rather small. Notably, the policies applied by the RL agent are also different from superradiant damping; for instance, the RL agent avoids kicks for large parts of the dynamics.

Vi Discussion

This work builds on recent results on quantum-chaotic sensors Fiderer and Braun (2018). We find that reinforcement learning (RL) techniques can be used to optimize the dynamical control that was used in Ref. Fiderer and Braun (2018)

to render the sensor dynamics chaotic. The control policies found with RL are tailored to boundary conditions such as the initial state, the targeted measurement time, and the decoherence model under consideration. At the example of superradiant damping we demonstrate improvements in measurement precision and an improved robustness with respect to decoherence. A drawback of RL often lies in the expensive hyperparameter tuning of the algorithm. However, here we demonstrate that a basic reinforcement algorithm (the cross entropy method) can be used for several choices of boundary conditions with practically no hyperparameter tuning (there was no hyperparameter search necessary, solely parameters that directly influence the computation time were chosen conveniently). Another drawback of RL is its black box character: while the results achieve a good performance the underlying reasons and mechanisms remain hidden. In the example of superradiant damping, we were able to unveil the approach taken by RL by visualizing the quantum dynamics with the help of the Wigner distribution of the quantum state. This revealed that RL favors a policy that is reminiscent of spin squeezing. However, instead of squeezing the state only at the beginning of the dynamics, the squeezing is refreshed and enhanced in roughly periodic cycles in order to fight against the superradiant damping. In the spirit of Ref. 

Fiderer and Braun (2018), these findings emphasize the potential that lies in the optimization of the measurement dynamics. We are optimistic that reinforcement learning will be used in other quantum metrological settings in order to achieve maximum measurement precision with limited quantum resources.

Appendix A Control problem and optimisation parameters of the examples

Table 1 shows the parameters of the control problem and for the optimization used in each example. We train RL agents for iterations with episodes in each iteration. Each episode is simulated until a total time is reached. Then we produce sample episodes of each trained RL agent and choose the best episode to plot the sample policies and gains.

Samples with superradiant damping (Fig. 3) 5 500 50 20 0.2 0.05 100
Gains of superradiant damping (Fig. 5) 20 300 40 20 1.0 0.10 100
Samples of rescaled QFI (Fig. 6) 2 500 50 20 0.1 0.10 50
Samples with phase damping (Fig. 7) 1 1,000 100 1 1.0 0.10 100
Table 1: Hyperparameters used for the examples in the main text.

Appendix B Cross entropy reinforcement learning

Here we give further information on the neural network, the cross entropy method, and the pseudocode for the cross entropy method with discrete actions. The code implementation is based on an example by Jan Schaffranek 444

The input layer of the neural network is defined by the observation. The output layer is determined by the number of actions (two) and we choose 300 neurons in the hidden layer. The layers are fully connected. The hidden layer has the rectified linear unit (ReLU) as its activation function and the output layer has the softmax function as its activation function

Nielsen (2015). As a cost function we choose the categorical cross entropy Nielsen (2015). The share of best episodes is always . The number of iterations and number of episodes vary for different settings, see Table 1 for detailed information. For training we use the Adam optimizer Kingma and Ba (2014) with learning rate .

Appendix C Learning curve and stability of the algorithm

Figure 8: Learning behavior of the algorithm at the example of superradiant damping with , , , , and

. Panel (a) shows how the mean QFI and its standard deviation with respect to different runs of the algorithm behaves for various numbers of iterations and fixed number of episodes fixed to 100. In Panel (b) the number of episodes is varied and number of iterations fixed to 500.

At the example of the superradiance decoherence model, we study the learning behavior of the cross entropy reinforcement learning algorithm for different training lengths (i.e. number of iterations) and different numbers of episodes per iteration. The results are summarized in Fig. 8. Spin size is and dissipation rate is .

In order to see the influence of the number of iterations, we set the number of episodes to 100 and let 20 different RL agents (with different random seeds) train for various numbers of iterations. The training of a single RL agent takes about one hour at most (for the higher number of iterations) on a desktop computer. We then use each RL agent to produce 20 episodes, giving us 400 episodes for each data point in Fig. 8. We used those episodes to calculate mean and standard deviation of the reward. The results are shown in the panel (a) of Fig. 8. In order to see the influence of the number of episode in each iteration, we fix the number of iterations to 500 and do the same procedure as before. The results are shown in panel (b) of Fig. 8.

We can see that the standard deviation over policies decreases with the number of iterations while the mean QFI increases. The same is true for the number of episodes (panel (b)), where for 32 episodes a stable plateau of the QFI is reached such that increasing the number of episodes does not achieve any further improvements. Overall, these results demonstrate the stability of the algorithm if the number of episodes and iterations is chosen sufficiently large.

Appendix D Classical equations of motion

The kicked top with superradiant damping has a well defined classical limit. It is obtained from the quantum equations of motion by taking the limit where and . The rescaled angular momentum operator then becomes the classical coordinate vector and with the unit sphere becomes the classical phase space with azimuthal angle and -coordinate as canonical variables. The equations of motions are found to be Braun (2001)


for the precession about the -axis by an angle ,


for the kicks about the -axis with kicking strength , and, with azimuthal angle (see main text)


for the superradiant damping, where


for a time , spin size , and superradiant decoherence rate .

Appendix E A closer look at the kicks set by the reinforcement learning agent

Here we take a closer look at the kicks chosen by the RL agent in the examples with superradiant damping, considered in Fig. 3 in the main text.

In case of , for both, and , we find relatively similar distribution of kicks, see panel (a) in Fig. 9. The most striking difference between the two policies for and are the comparatively strong kicks in the beginning of the sequence. By observing the time evolution of the Wigner function (see Supplemental Material), we find that these kicks basically rotate the state by an additional angle about the -axis. This leads to a phase shift of between the two policies [see panels (d) and (d) of Fig. 10] compared to the initial state [see panels (a) and (a) of Fig. 10].

For the policies are even more similar with several kicks increasing in strength with a period length of , see panel (b) in Fig. 9.

Fig. 10 is analog to Fig. 4 in the main text but for instead of . The only qualitative difference compared to the , is the periodically kicked top: The combination of periodic kicks with and seems to be a special configuration. The classical phase space is comparable with the case, but there is much less structure in the Wigner function. Instead, the state concentrates on the south pole and exhibits a slightly squeezed shape (this is difficult to judge from Fig. 10 though). The rather high value of the QFI for and , is best explained by this squeezing. When choosing other kicking strength, we observed a Wigner function similar to the case of .



Figure 9: Kicks set by the RL agent for the SR-GKT. Panel (a) shows the case for and panel (b) for . In red on the left axis are the kicking strengths for (crosses) and (circles). To show the precession, we plot on the right axis in grey the component of an unkicked spin coherent state without decoherence.
Figure 10: Shows the corresponding data as in Fig. 4 but for instead of : Illustration of kicked superradiant dynamics with Wigner functions and its classical limit. The dissipation rate is . Panels in the left column (a) correspond to the initial spin coherent state at . The middle and right columns correspond to the state at time generated with periodic kicks [middle column (b), ] and with kicks optimized with reinforcement learning [right column (c), the corresponding QFI is shown in panel (b) of Fig. 3]. The top two rows show the Wigner functions of the density matrix, the bottom two rows show the classical phase space, populated by points initially distributed according to the Husimi distribution of the initial spin coherent state and then propagated according to the classical equations of motion.


  • Murphy (2012) K. P. Murphy, Machine Learning: A Probabilistic Perspective (MIT press, 2012).
  • Dunjko and Briegel (2018) V. Dunjko and H. J. Briegel, Reports on Progress in Physics 81, 074001 (2018).
  • Mehta et al. (2019) P. Mehta, M. Bukov, C.-H. Wang, A. G. Day, C. Richardson, C. K. Fisher,  and D. J. Schwab, Physics Reports  (2019).
  • Carrasquilla and Melko (2017) J. Carrasquilla and R. G. Melko, Nature Physics 13, 431 (2017).
  • Broecker et al. (2017) P. Broecker, F. F. Assaad,  and S. Trebst, arXiv preprint arXiv:1707.00663  (2017).
  • Van Nieuwenburg et al. (2017) E. P. Van Nieuwenburg, Y.-H. Liu,  and S. D. Huber, Nature Physics 13, 435 (2017).
  • Carleo and Troyer (2017) G. Carleo and M. Troyer, Science 355, 602 (2017).
  • Carleo et al. (2018) G. Carleo, Y. Nomura,  and M. Imada, Nature communications 9, 5322 (2018).
  • Gao and Duan (2017) X. Gao and L.-M. Duan, Nature communications 8, 662 (2017).
  • Leigh (2004) J. R. Leigh, Control Theory, Vol. 64 (Iet, 2004).
  • Kaelbling et al. (1996)

    L. P. Kaelbling, M. L. Littman,  and A. W. Moore, Journal of artificial intelligence research 

    4, 237 (1996).
  • Sutton and Barto (2018) R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (MIT press, 2018).
  • Sutton et al. (1992) R. S. Sutton, A. G. Barto,  and R. J. Williams, IEEE Control Systems Magazine 12, 19 (1992).
  • Chen et al. (2013) C. Chen, D. Dong, H.-X. Li, J. Chu,  and T.-J. Tarn, IEEE transactions on neural networks and learning systems 25, 920 (2013).
  • Palittapongarnpim et al. (2017) P. Palittapongarnpim, P. Wittek, E. Zahedinejad, S. Vedaie,  and B. C. Sanders, Neurocomputing 268, 116 (2017).
  • Fösel et al. (2018) T. Fösel, P. Tighineanu, T. Weiss,  and F. Marquardt, Physical Review X 8, 031084 (2018).
  • Bukov et al. (2018) M. Bukov, A. G. Day, D. Sels, P. Weinberg, A. Polkovnikov,  and P. Mehta, Physical Review X 8, 031086 (2018).
  • Albarrán-Arriagada et al. (2018) F. Albarrán-Arriagada, J. C. Retamal, E. Solano,  and L. Lamata, Physical Review A 98, 042315 (2018).
  • Niu et al. (2019) M. Y. Niu, S. Boixo, V. N. Smelyanskiy,  and H. Neven, in AIAA Scitech 2019 Forum (2019) p. 0954.
  • Melnikov et al. (2018) A. A. Melnikov, H. P. Nautrup, M. Krenn, V. Dunjko, M. Tiersch, A. Zeilinger,  and H. J. Briegel, Proceedings of the National Academy of Sciences 115, 1221 (2018).
  • Sweke et al. (2018) R. Sweke, M. S. Kesselring, E. P. van Nieuwenburg,  and J. Eisert, arXiv preprint arXiv:1810.07207  (2018).
  • Andreasson et al. (2018) P. Andreasson, J. Johansson, S. Liljestrand,  and M. Granath, arXiv preprint arXiv:1811.12338  (2018).
  • Hentschel and Sanders (2010) A. Hentschel and B. C. Sanders, in 2010 Seventh International Conference on Information Technology: New Generations (IEEE, 2010) pp. 506–511.
  • Hentschel and Sanders (2011) A. Hentschel and B. C. Sanders, Physical review letters 107, 233601 (2011).
  • Lovett et al. (2013) N. B. Lovett, C. Crosnier, M. Perarnau-Llobet,  and B. C. Sanders, Physical review letters 110, 220501 (2013).
  • Sergeevich and Bartlett (2012) A. Sergeevich and S. D. Bartlett, in 

    2012 IEEE Congress on Evolutionary Computation

     (IEEE, 2012) pp. 1–3.
  • Stenberg et al. (2016) M. P. Stenberg, O. Köhn,  and F. K. Wilhelm, Physical Review A 93, 012122 (2016).
  • Palittapongarnpim et al. (2016) P. Palittapongarnpim, P. Wittek,  and B. C. Sanders, in 24th European Symposium on Artificial Neural Networks, Bruges, April 27–29, 2016 (2016) pp. 327–332.
  • Lumino et al. (2018) A. Lumino, E. Polino, A. S. Rab, G. Milani, N. Spagnolo, N. Wiebe,  and F. Sciarrino, Physical Review Applied 10, 044033 (2018).
  • Liu and Yuan (2017a) J. Liu and H. Yuan, Physical Review A 96, 012117 (2017a).
  • Liu and Yuan (2017b) J. Liu and H. Yuan, Physical Review A 96, 042114 (2017b).
  • Xu et al. (2019) H. Xu, J. Li, L. Liu, Y. Wang, H. Yuan,  and X. Wang, arXiv preprint arXiv:1904.11298  (2019).
  • Fiderer and Braun (2018) L. J. Fiderer and D. Braun, Nature communications 9, 1351 (2018).
  • Fiderer and Braun (2019) L. J. Fiderer and D. Braun, in Optical, Opto-Atomic, and Entanglement-Enhanced Precision Metrology, Vol. 10934 (International Society for Optics and Photonics, 2019) p. 109342S.
  • Helstrom (1976) C. W. Helstrom, Quantum Detection and Estimation Theory (Academic press, 1976).
  • Holevo (1982) A. S. Holevo, Probabilistic and Statistical Aspects of Quantum Theory (North-Holland, Amsterdam, 1982).
  • Braunstein and Caves (1994) S. L. Braunstein and C. M. Caves, Phys. Rev. Lett. 72, 3439 (1994).
  • Paris (2009) M. G. A. Paris, International Journal of Quantum Information 7, 125 (2009).
  • Peres (2006) A. Peres, Quantum Theory: Concepts and Methods, Vol. 57 (Springer Science & Business Media, 2006).
  • Chaudhury et al. (2009) S. Chaudhury, A. Smith, B. Anderson, S. Ghose,  and P. S. Jessen, Nature 461, 768 (2009).
  • Krithika et al. (2019) V. Krithika, V. Anjusha, U. T. Bhosale,  and T. Mahesh, Physical Review E 99, 032219 (2019).
  • Dicke (1954) R. H. Dicke, Phys. Rev. 93, 99 (1954).
  • Gross et al. (1976) M. Gross, C. Fabre, P. Pillet,  and S. Haroche, Phys. Rev. Lett. 36, 1035 (1976).
  • Gross and Haroche (1982) M. Gross and S. Haroche, Phys. Rep. 93, 301 (1982).
  • Braun (2001) D. Braun, Dissipative Quantum Chaos and Decoherence, Springer Tracts in Modern Physics, Vol. 172 (Springer, 2001).
  • Kossakowski (1972) A. Kossakowski, Rep. Math. Phys. 3, 247 (1972).
  • Lindblad (1976) G. Lindblad, Math. Phys. 48, 119 (1976).
  • Bonifacio et al. (1971) R. Bonifacio, P. Schwendimann,  and F. Haake, Physical Review A 4, 302 (1971).
  • Braun et al. (1998a) P. A. Braun, D. Braun,  and F. Haake, Eur. Phys. J. D 3, 1 (1998a).
  • Braun et al. (1998b) P. A. Braun, D. Braun, F. Haake,  and J. Weber, Eur. Phys. J. D 2, 165 (1998b).
  • Giraud et al. (2008) O. Giraud, P. Braun,  and D. Braun, Phys. Rev. A 78, 042112 (2008).
  • Giraud et al. (2010) O. Giraud, P. Braun,  and D. Braun, New Journal of Physics 12, 063005 (2010).
  • De Boer et al. (2005) P.-T. De Boer, D. P. Kroese, S. Mannor,  and R. Y. Rubinstein, Annals of operations research 134, 19 (2005).
  • Kingma and Ba (2014) D. P. Kingma and J. Ba, arXiv preprint arXiv:1412.6980  (2014).
  • Agarwal (2012) G. S. Agarwal, Quantum Optics (Cambridge University Press, 2012).
  • Nielsen (2015) M. A. Nielsen, 

    Neural Networks and Deep Learning

    , Vol. 25 (Determination press San Francisco, CA, USA:, 2015).