Control of Stochastic Quantum Dynamics with Differentiable Programming

01/04/2021 ∙ by Frank Schäfer, et al. ∙ Universität Basel 5

Controlling stochastic dynamics of a quantum system is an indispensable task in fields such as quantum information processing and metrology. Yet, there is no general ready-made approach to design efficient control strategies. Here, we propose a framework for the automated design of control schemes based on differentiable programming (∂P). We apply this approach to state preparation and stabilization of a qubit subjected to homodyne detection. To this end, we formulate the control task as an optimization problem where the loss function quantifies the distance from the target state and we employ neural networks (NNs) as controllers. The system's time evolution is governed by a stochastic differential equation (SDE). To implement efficient training, we backpropagate the gradient information from the loss function through the SDE solver using adjoint sensitivity methods. As a first example, we feed the quantum state to the controller and focus on different methods to obtain gradients. As a second example, we directly feed the homodyne detection signal to the controller. The instantaneous value of the homodyne current contains only very limited information on the actual state of the system, covered in unavoidable photon-number fluctuations. Despite the resulting poor signal-to-noise ratio, we can train our controller to prepare and stabilize the qubit to a target state with a mean fidelity around 85 solutions found by the NN to a hand-crafted control strategy.



There are no comments yet.


page 12

page 13

page 30

Code Repositories


Repository for the Control of Stochastic Quantum Dynamics with Differentiable Programming paper.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to precisely prepare and manipulate quantum degrees of freedom is a prerequisite of most applications of quantum mechanics in sensing, computation, simulation and general information processing. Many relevant tasks in this area can be formulated as optimal control problems, and therefore,

quantum control is a rich and very active research field, see [1, 2] for two recent textbooks that provide theoretical background and [3] for a recent review of important issues.

A typical goal of quantum control is to find a sequence of operations or parameter values (e.g. external field amplitudes) such that the quantum system under consideration is maintained in a certain target state or evolves in a desired fashion subject to additional boundary conditions (e.g. most rapid evolution for a prescribed maximal strength of the control fields). In the case of feedback control the control sequence is determined based on a signal coming from the system [3, 4]. The control task and its boundary conditions are typically specified by a loss function. To cast the optimal control problem into an optimization task, one then introduces a parametric ansatz for feedback schemes, also known as controllers, and explores the parameter space to minimize the loss function.

Reinforcement learning (RL) [5, 6] has been proposed as a suitable framework to develop control strategies. In this framework, the controller (or agent), optimizes its strategy (the policy) based on the loss function (the rewards) obtained from repetitive interactions with the system to be controlled. In the context of quantum physics, RL has proven useful, e.g. for reducing the error rate in quantum gate synthesis [7], for autonomous preparation of Floquet-engineered states [8], and for optimal manipulation of many-qubit systems [9]. RL is a black-box setting, i.e. the agent has no prior knowledge on the structure of the system it interacts with, and has to develop its policy (explore the space of control parameters) only relying on its past interactions with the system. This makes RL very versatile but requires a large number of training episodes to find a good strategy.

Optimal control design is rarely done on an (unknown) system in situ. Instead one trains the controller on a physical model of the real system. This means that rather than learning how to interact with an unknown environment, one actually starts with a lot of prior scientific knowledge about the system, namely its precise dynamical model. In the simplest cases, one can even solve the optimal control problem for this model analytically. Generally, using prior information leads to more data-efficient training [10, 11]. In the context of the present paper, we use the physical model of the system to efficiently compute the loss function’s gradients with respect to the parameters of the controller ansatz. Naturally, having access to the gradients of the loss function can streamline the optimal control design tremendously.

In the case of quantum control the model consist of a quantum state space and a parametric equation of motion. The dynamics of the system can be solved for fixed values of the parameters of the model and of the controller. Usually, this is done numerically. Then, the most naive way to obtain the loss function’s gradients is to solve the dynamics for a set of parameter values in the neighborhood of each point and use finite difference quotients. Yet, this method is unfeasible if the number of parameters is large, it suffers from floating-point errors, and it may be numerically unstable. To circumvent these issues, automatic differentiation (AD) has been proposed as another approach to calculate gradients numerically [12, 13], a paradigm also known as differentiable programming (P) [14, 15]. By backpropagating the loss function’s sensitivity through the numerical simulation, one can compute the gradients with similar time complexity as solving the system’s dynamics [16]. Recently, these techniques have been merged with deep neural networks (NNs) as ansatz for the controllers [10, 17, 18]

. This is possible because the training of deep NNs is based on stochastic gradient descent through a large parameter space and becomes efficient when used in conjunction with

P-compatible solvers for the system’s dynamics.

In this work, we develop such a physics-informed reinforcement learning framework based on P and NNs to control the stochastic dynamics of a quantum system under continuous monitoring [2, 19]. Continuous measurements, such as photon counting and homodyne detection, allow one to gain information on the random evolution of a dissipative quantum system [2]

. This information can be used to estimate the state of the quantum system

[20, 21, 22], to implement feedback protocols [23, 24, 25, 26, 27], to generate nonclassical states [28, 29, 30, 31], and to implement teleportation protocols [32, 33]. Continuous homodyne detection can be realized experimentally in the microwave [34, 35] and optical regime [20, 36]. The time evolution of a monitored quantum system is described by quantum trajectories, which are solutions of differential equations driven by a Lévy process.

To illustrate our framework, we focus on a qubit subjected to continuous homodyne detection [34, 37] described by a stochastic Schrödinger equation. We engineer a controller which provides a control scheme to fulfil a given state preparation task based on the measured homodyne current. This situation extends an earlier study [10], where it has been demonstrated that

P can be efficiently used for quantum control in the context of (unitary) closed-system dynamics, i.e. when the dynamics follows an ordinary differential equation. The stochastic nature of the problem analyzed here renders the control task more challenging because the controller must adapt to the random evolution of the quantum state in each trajectory. Moreover, the instantaneous value of the homodyne current does not determine the actual state of the qubit. It is correlated only to the projection of the state onto the

-axis and this signal is hidden in the noise which dominates the measured homodyne current [38, 39]. Thus, the information about the state of the qubit at a given time must be filtered out from the time series of measurement results.

This paper is organized as follows: In Section 2, we describe the proposed setup of a qubit in a leaky cavity subjected to homodyne detection and derive the stochastic Schrödinger equation that describes its dynamics. We then discuss two ways to use the record of the homodyne detection signal in a feedback scheme to engineer a drive that can be applied to the qubit to perform a desired control task, e.g. state preparation or stabilization. We also introduce the concept of adjoint sensitivity methods that we use to efficiently compute gradients with respect to solutions of SDEs. Section 3 describes the first feedback scheme in detail: Here, we assume that the controller has direct access to the quantum state, e.g. through an appropriate filtering procedure applied to the measurement records. We compare three strategies, viz. a hand-crafted control scheme, one in which a neural network continuously updates the control drive based on the knowledge of the state, and a numerically less demanding one with a piecewise-constant control drive. Section 4 presents the second feedback scheme where directly the measurement record of the homodyne current is fed to the NN representing the controller. In this setup, the NN must first learn how to filter the data to obtain information on the state of the system. Only after that it can propose an efficient control strategy. We conclude in Section 5 and discuss potential future applications.

2 Theoretical background

2.1 A qubit under homodyne detection

Figure 1: (a) Sketch of the considered setup. A qubit coupled to a leaky cavity is continuously monitored by a homodyne detection measurement. The radiation emitted by the qubit is mixed with a local oscillator laser with complex amplitude at a beamsplitter. The signal , which is the difference of the photocurrents of the two detectors, is fed into a control agent, which applies a drive to the qubit to generate and stabilize a target state. (b) Different structures of the controller considered in Secs. 3 and 4, respectively. (c) Example of a typical signal of the homodyne detection measurement. (d) The homodyne detection signal is proportional to the quadrature , which is hidden in the noise of the measurement [note the different axis scaling in (c) and (d)]. The data has been integrated over a measurement interval . The parameters of the physical model are , .

We consider a driven two-level system with states and . In a rotating frame, its Hamiltonian reads


where are Pauli matrices, is the Rabi frequency of the drive laser, and is the detuning between the qubit and the laser. The qubit can spontaneously decay into a photon field via the interaction Hamiltonian


where is the decay rate. The field operators and satisfy the commutation relation . We assume that the field is initially in the vacuum state, . Physical examples of such a system are a two-level atom inside a leaky single-mode cavity that can be adiabatically eliminated, or an artificial atom, e.g. a superconducting qubit, coupled to a waveguide. The radiation emitted from the two-level system is monitored with a continuous homodyne measurement, as depicted in Fig. 1(a).

We show in A (see also [2, 19]) that the evolution of the qubit is governed by a stochastic Schrödinger equation


where denotes an unnormalized qubit state. The instantaneous value of the measured homodyne current

is a random variable satisfying


Here, is the expectation value of at time , and

is a stochastic white-noise term satisfying

, which stems from the shot noise of the local oscillator. Heuristically,

can be considered as the derivative of a stochastic Wiener increment, , such that the contribution of the noise to the current integrated over a short time interval is described by a Wiener process,


The ensemble averages of the stochastic Wiener increment satisfy and . Several remarks are in order to better understand the dynamics of the system.

First, Eq. (5) implies that the value of the current over a short interval contains only very little information about the state of the qubit. This can be seen from the (heuristically stated) vanishing signal-to-noise ratio


If was a constant signal, one could simply integrate the current over a time interval longer than to increase the signal-to-noise ratio. However, this is not possible because relaxation will change the state of the two-level system on the time scale . Thus, the low signal-to-noise ratio is an intrinsic feature of this homodyne detection scheme. This is illustrated in Figs. 1(c) and (d), where we show a simulation of the homodyne current together with the respective value for a single quantum trajectory.

Second, Eq. (3) shows that the infinitesimal time evolution and thus the quantum trajectory is fully determined by the record of the measured homodyne current and the values of the applied drive

, which are vectors containing the respective values of

and from the start time until the time . In A.4, we derive a closed-form expression of the operator , which gives the mapping


between the states of the qubit at times and . The operator can be interpreted as a filter determining the state of the qubit at time from the values of the measured homodyne current and the applied drive.

Finally, Eq. (3) does not preserve the norm of the state , as indicated by the tilde. This is no problem if one integrates Eq. (3) numerically, since one can renormalize the state after each time step. For analytical calculations, it is useful to add some correction terms (see A.5 or [2, 19]) such that the norm of the state is preserved up to second order in ,


Here, the nonlinear drift and diffusion terms are


Note that Eq. (8) is a stochastic Schrödinger equation in the Itô form with multiplicative scalar noise.

2.2 Feedback control overview

In our control protocol, the results of the homodyne detection measurement determine the drive to be applied to the qubit. We will consider two different schemes which are sketched in Fig. 1(b).

In the first scheme, discussed in Section 3, we filter the homodyne signal to extract the system’s exact state at time . Subsequently, the controller receives this state as an input and determines the drive to be applied next. Equations (7) and (55) give an explicit filtering procedure to determine the state of the qubit at time from the records of homodyne measurements and the drive , which are known in the experiment. Since we train the agent on simulated trajectories, we know the system’s state at each time step. Therefore, we can skip the filter in Fig. 1(b) and directly feed back the solution of the SDE solver at each time step to our controller. In other words we assume a perfect filtering at any time. Thus, this situation corresponds to the feedback used in Ref. [10] but the deterministic evolution is replaced by an SDE. We will use this control scheme in Section 3 to test different backpropagation methods and to compare different control strategies.

In the second scheme, discussed in Section 4, the controller obtains at time the homodyne current record measured over some time interval . Now, the NN forming the controller must simultaneously learn how to filter the signal from the noise and to predict the next action . Such an implementation of the control protocol based only on is a challenging task because the signal of the system quadrature is hidden in the noise as discussed in Section 2.1.

2.3 Workflow

Figure 2: Workflow of the learning scheme discussed in Section 2.3 to train the controller. In the forward pass, a controller, which is in this work implemented by a neural network, maps the present quantum state (see Section 3) or a measurement of the homodyne current (see Section 4) to a drive . Then, an SDE is solved to determine the subsequent state and homodyne detection current . A loss function modeling the state preparation objective and possible constraints is evaluated based on a quantum trajectory, i.e. a sequence of states. In the backward pass, the gradient of the loss function with respect to the parameters of the controller is evaluated by (adjoint) sensitivity methods (see Section 2.4

). This step incorporates physical knowledge of the system into the training process and is numerically more efficient than a model-blind gradient estimation. The gradient of the loss function with respect to the parameters of the controller is used to update the control strategy in a series of training epochs.

The learning scheme to control the stochastic dynamics of the continuously monitored qubit based on P consists of three building blocks as sketched in Fig. 2: a parametrized controller , which will be formed by a NN, a model of the dynamics, expressed as an SDE, and a loss function.

At the beginning of each run, we initialize the system in an arbitrary state on the Bloch sphere,


where and are the excited and ground states of the qubit in the -basis, respectively. To ensure that the controller will perform optimally for any initial state, we sample the angles and uniformly from their intervals and , respectively.

Depending on the chosen control scheme, cf. Fig. 1(b), the controller receives as an input either the quantum state (Section 3) or the last homodyne detection record in form of a vector gathered over some time interval (Section 4). Moreover, in Section 4 the controller also receives a vector of the last control actions applied prior to the time .

The controller then maps this input to the next value of the drive, . Given and , we use the Runge-Kutta Milstein solver from the StochasticDiffEq.jl package [40, 41, 42] to calculate the next state according to the SDE (8). This loop between control agent and SDE solver is iterated for all time steps. We store the quantum states and the drive values at uniformly spaced time steps to be able to evaluate the loss function. Hereafter, the set is called checkpoints.

We minimize a loss function of the form [10, 12, 13, 43]


where the terms encode case-specific objectives of the optimization process, see, e.g. Ref. [12]. Their relative importance can be controlled by the weights . We choose the weights

empirically but, if necessary, they could also be tuned by means of hyperparameter optimization techniques 

[10]. To enforce a large fidelity with respect to the target state over the whole control interval, we include into the loss function the average infidelity of the checkpoints with respect to the target state ,


This form of the loss function also leads to a time-optimal performance of the controller. To focus on specific time intervals, like the last few steps for example, the sum in Eq. (13) can be straightforwardly adjusted. We will consider the target state in the following. In addition to , we include the term


in the loss function to favor smaller amplitudes of the drive . In Section 4, we will find that this term is important to suppress the collapse of the NN towards a strategy where constant maximal pulses are applied during the training.

At this stage, we have calculated a quantum trajectory and evaluated its value of the loss function. In the next step, we must update the control strategy to decrease the value of the loss function. The derivative of the loss function with respect to the parameters of the neural network provides a meaningful update rule towards a better control strategy. Thus, we need to calculate the gradient . This can be computed efficiently using sensitivity methods for SDEs discussed next.

2.4 Adjoint sensitivity methods

The loss function , defined in Eq. (12), is a scalar function which explicitly depends on the checkpoints and implicitly on the parameters of the controller. In contrast to score-function estimators [44, 45, 46], such as the REINFORCE algorithm [47], we will incorporate the physical model into the gradient computation as sketched in Fig. 2.

Automatic differentiation (AD) is a powerful tool to evaluate gradients of numeric functions at machine precision [48]. A key concept to AD is the computational graph [49], also known as Wengert trace, which is a directed acyclic graph that represents the sequence of elementary operations that a computer program applies to its input values to calculate its output values. The nodes of the computational graph are the elementary computational steps of the program called primitives. The outcome of each primitive operation, called an intermediate variable, is evaluated in the forward pass through the graph.

In forward-mode AD, one associates with each intermediate variable the value of the derivative


with respect to a parameter of interest. The derivatives are calculated together with the associated intermediate values in the forward pass, i.e. the gradient is pushed forward through the graph. This procedure must be repeated for each parameter , therefore, forward-mode AD scales poorly in computation time with increasing number of parameters .

In contrast, reverse-mode AD traverses the computation graph backwards from the loss function to the parameters by defining an adjoint process:


which is the sensitivity of the loss function with respect to changes in the intermediate variable . Reverse-mode AD is very efficient in terms of the number of input parameters because one needs just a single backward pass after the forward pass to obtain the gradient with respect to all parameters . Thus, we always implement the AD of the NN in reverse mode. However, reverse-mode AD might be very memory expensive because all intermediate variables from the forward pass need to be stored for the backward pass. Therefore, reverse-mode AD scales poorly in memory if the number of steps and parameters of the SDE solver increases. Whether forward-mode or reverse-mode AD is more efficient to calculate gradients of the loss function depends on the specific details of the control loop shown in Fig. 2.

In fact, it is possible to combine different AD methods on different parts of the computational graph as we illustrate now. In Secs. 3.3 and 4, we will use Algorithm 1, where the drive is piecewise constant, i.e. a constant value of is applied over successive time steps between two checkpoints. In this case, the memory consumption of the reverse-mode AD of the NN is moderate since the number of parameters grows only with the number of checkpoints rather than with the number of time steps. In contrast, in the SDE solver the evaluation of the time evolution between two checkpoints and only depends on the single parameter , which makes forward-mode AD very efficient [50]. Therefore, we nest both methods and use an inner forward-mode AD through the SDE solver and an outer reverse-mode AD for the remaining parts, i.e. the NN and the computation of the loss function.

Input: , , Result: for  do        compute and store checkpoints:               solve(,) for SDE (8) in end for loss() Algorithm 1 Piecewise constant Input: , , Result: compute and store checkpoints: solve(,) for SDE (8) in loss() Algorithm 2 Continuously updated

Restricting oneself to a piecewise-constant control drive, however, prevents the controller from reacting instantaneously to changes of the state. To implement a fast control loop, the controller has to be placed in the drift term of the SDE (8). Thus, the parameters entering the solver will be the NN parameters , see Algorithm 2. The continuous adjoint sensitivity method [51, 52, 53] circumvents the resulting memory issues by introducing a new primitive for the whole SDE in the backward pass of the code. This new primitive is determined by the solution of another SDE problem, the adjoint SDE. Formally, defining the adjoint process as


the adjoint SDE problem in the Itô sense satisfies the differential equation


with the initial condition


where the standard conversion factor in Eq. (18) accounts for the required transformation from the Itô to the Stratonovich sense and vice versa [54]. The gradients of the loss function with respect to are then determined by the integration of


with the initial condition . This continuous adjoint sensitivity method will be used in Section 3.2. Further details regarding the adjoint method and our Julia implementation [55, 56] within the SciML ecosystem [11, 16, 40] are discussed in C.2.

3 SDE control based on full knowledge of the state

We first investigate the scenario in which the controller maps the quantum state to a new control parameter, . From a learning perspective, this is a major simplification because the NN does not need to learn how to filter the homodyne current to determine the state . From a practical perspective of controller design, this approach assumes that there is already a filter module, which allows one to predict the state of the system from the measurement record and past control actions. We discuss in Section 2.1 and A.4 the implementation of such a filter in our case of a qubit subjected to homodyne detection. Note that if the initial state of the qubit is unknown, only a mixed state can be obtained if the detection record is too short. Nevertheless, it is always possible to obtain a pure state estimate and recover our scenario, e.g. by projecting onto the surface of the Bloch sphere. Alternatively, one can straightforwardly generalize our approach to the case of a stochastic quantum master equation describing the evolution of the system’s density matrix in the presence of homodyne detection, cf. A.5.

For all numerical experiments we fix the parameters of the physical model as and .

3.1 Hand-crafted strategy

Figure 3: Preparation of the state for the continuously monitored qubit described by the SDE (8

). (a) Mean fidelity (blue solid line) and corresponding standard deviation (shaded) for the hand-crafted strategy summarized in Algorithm 

3 as a function of the time steps at which checkpoints are stored. (b) Corresponding drive amplitudes. The black lines in (a,b) visualize the time evolution of the initial state . Stereographic projection, Eq. (21), of the applied drive with respect to states of the southern hemisphere of the Bloch sphere for (c) the hand-crafted strategy and (d) the optimized NN described in Section 3.2. Note that panel (d) is the stereographic projection of the Bloch sphere of the inset of Fig. 4(a).

The control operator in Eq. (1) induces a rotation about the -axis. Therefore, a simple but very intuitive strategy to move the state upwards in each time step is to compute the expectation value and to choose the direction of rotation depending on its sign. Specifically, if (or ), the controller should rotate (counter-) clockwise about the -axis, as summarized in Algorithm 3.

if  then
end if
Algorithm 3 Hand-crafted controller

Figure 3(a,b) shows the fidelity and the associated drive for this control scheme. Though conceptually straightforward, this hand-crafted control function is very efficient. The mean fidelity over the whole control interval is . We visualize the control strategy in Fig. 3(c), where we show the applied drive as a function of the state on the Bloch sphere. The Bloch sphere [with spherical coordinates , see also Eq. (11)] is mapped onto the tangential plane at , described by polar coordinates , using a stereographic projection


Apparently, the stereographic projection maps the south pole to and the north pole to . Throughout this paper we truncate the value of to an interval , so we plot the applied drive values on the states of the southern hemisphere, cf. Fig. 3(c,d).

3.2 Continuously updated control drive

Figure 4: Preparation of the state for the continuously monitored qubit described by the SDE (8), based on (a-c) continuously updated control parameters and (d-f) piecewise-constant control parameters. We compute the gradients of the loss function with respect to the parameters of the NN using the continuous adjoint sensitivity method in (a-c) and the nested discrete AD method in (d-f), see Section 2.4. (a,d) Smooth evolution of the loss as a function of training epochs. (b-f) Performance of the trained NN when it is applied to a set of 256 randomly sampled initial states on the Bloch sphere. The solid blue [red] line in (b,e) [(c,f)] shows the mean fidelity [control drive ] as a function of the time steps at which checkpoints are stored. The shaded regions represent the corresponding standard deviation which is large in panels (c) and (f) because the controller will choose a different sequence of controls for each quantum state. The black lines in (b-f) visualize the results for an initial state . Inset in Panel (a): Visualization of the control parameters (in color code) as a function of the current state on the Bloch sphere. The color palette indicates rotation clockwise (blue) or anti-clockwise (red) about the -axis. The yellow points indicate the trajectory starting at the initial state where the first 50 points are connected by a line. The hyperparameters are listed in Tab. 1.

Now we apply a NN as the controller. We use a controller which changes in every time step of the solver based on the current state . This implements a high-frequency feedback loop but renders discrete AD methods very inefficient since all parameters of the NN contribute to the feedback loop. Consequently, we use the continuous adjoint sensitivity methods described in Section 2.4 to calculate gradients of the loss function.

Figure 4(a) shows the smooth evolution of the loss function throughout the training of the (fully-connected) neural network, which converges to a configuration able to reach quickly fidelities with mean value of about with modest standard deviation, see Fig. 4(b). The mean fidelity over the whole control interval is . Figure 4(c) illustrates the applied drive during this time evolution. The inset of Fig. 4(a) visualizes the control strategy on the Bloch sphere. The same is depicted in Fig. 3(d) using the stereographic projection, see Eq. (21). The controlled evolution of the initial state , corresponding to the black lines in Fig. 4(b,c), is marked by the yellow points. These points show how the controller first transfers the state from the south pole to the north pole region and then stabilizes it in vicinity of the target state .

3.3 Piecewise-constant control drive

Now we reduce the control frequency and assume that the controller changes the action only every time steps. This case is crucial in many physical situations, where the control loop is not fast enough to follow high-frequency changes in the physical system. In practice this means that we evolve the state for substeps between the checkpoints with fixed value of . The resulting piecewise-constant control scheme allows us to use the discrete forward-mode adjoint sensitivity method through the SDE solver combined with an outer reverse-mode AD through the rest of the control loop, as described in Section 2.4. Although we restricted the rate at which can change, we find that the NN converges to a similar learning behavior with similarly large fidelities as in Section 3.2, see Fig. 4(e). The mean fidelity over the whole control interval is .

3.4 Comparison of the control strategies

Based on the results from a set of 256 trajectories, all three control strategies perform nearly equally well in terms of their average fidelities. The piecewise-constant controller slightly outperforms the other two approaches by having the smallest relative dispersion of the mean fidelity , which we attribute to the larger number of training epochs and the larger NN, see Table 1.

The hand-crafted strategy achieves a large average fidelity, but generates sudden jumps in the drive , as shown in Fig. 3(b). Such a drive is experimentally hardly feasible. We find that NNs with moderate depths provide a smooth mapping between the input states and the drive, see Figs. 3(c) vs. 3(d), while keeping high fidelity in the control interval. The signals generated by these protocols are experimentally more accessible, see Figs. 4(d) and (f). When necessary, specific terms can be added to in Eq. (12) to strengthen various requirements on the controller’s performance (e.g. the smoothness of the drive or bounds on the power input) as discussed in Section 2.2. This is not the case for the hand-crafted strategies where an efficient implementation of these constraints might be impossible. Furthermore, in some cases, e.g. the mountain car problem, it is not straightforward to come up with any hand-crafted strategy to start with. In contrast, the RL and P approach is easily adjustable to different physical systems.

4 SDE control based on homodyne current

Figure 5: Preparation of the state using the piecewise-constant control scheme with the discrete AD method to compute the gradients , see Section 2.4, where only the homodyne current is used as an input to the controller. (a) Sketch of the acquisition process of the input data for the NN controller, see Section 4. (b) Evolution of the loss function as a function of the training epochs. (c) and (d) Performance of the NN after 400 training epochs [in the first plateau of in panel (b)] for a set 256 trajectories: (c) Mean fidelity (blue solid lines) and corresponding standard deviation (shaded area), (d) Corresponding drive amplitudes. (e) and (f) Same quantities as in panels (c) and (d), but for the fully trained NN, i.e. after 14000 training epochs. Black lines in panels (c-f) show the results for an initial state . The hyperparameters are listed in Tab. 1 in B.
Input: , ,
for  do
       compute and store checkpoints
       solve(,) for SDE (8) in
end for
Algorithm 4 Homodyne current

In this section, we will construct a controller which directly obtains the (noisy) measurement record of the homodyne current and determines the optimal control field in each time interval . We consider a controller formed by a slightly augmented NN architecture with fully connected layers (see Fig. 6). The acquisition of input data for the NN consists of the following steps:

  1. The controller generates a piecewise-constant drive between two checkpoints at times and , as described in Section 3.3. In the time window , we integrate the SDE (8) using substeps of length as sketched in Fig. 5(a). We label these substeps with . The input for the NN to predict the next is a vector where is the length of the time interval over which we gather the data. Experimentally, can be interpreted as the detection time window of the photodetectors. According to Eqs. (4) and (5), the homodyne measurement in the th substep is


    The first term corresponds to the quadrature signal measured over the detection window , which we assume to be approximately constant on time scales of the order of . The second term is the Wiener increment during the th substep. Note that the quantity is dimensionless and solely represents the number of measured detected photons, as discussed in A.

  2. In addition, we provide the NN with the information on the last control parameters . This equips the NN with a “memory” of its own actions, such that it can take into account how a sequence of the last control drive amplitudes affected the performance. We empirically choose such that the length of the input vector corresponds to of the length of .

Given these inputs, the task for the controller is to provide an optimal mapping . Figure 5(b) shows the evolution of the loss function during the learning phase as a function of the training epochs. Note that there are two distinct plateaus. First, after a couple of hundred epochs, the NN develops a general strategy consisting of the application of a periodic drive to prevent the qubit from decaying to the ground state. Figures 5(c,d) show examples of the performance of the NN at epoch to illustrate this phase of the training. This strategy is state-independent, yet it manages to keep the mean fidelity of the simulated trajectories around over the whole control interval. Around epoch , after some transition phase where oscillates substantially, the NN starts to provide state-sensitive control fields. The final performance of the controller in Fig. 5(e,f) reaches the mean fidelity over the whole control interval. During the last 50 time steps in the control interval, which we specifically focus on during the training using the adjusted loss function term of the type (13) for these steps, the average fidelity is .

At this stage, we infer that the NN has learnt how to extract the signal from the noisy data before it attains a similar learning strategy as seen in Section 3. In contrast to the filter mentioned in Section 2.1 and A.4, this universal approach based on P will also work if some parameters of the model were a priori unknown. For example, if the detuning was unknown, one could train the controller on an ensemble of randomly chosen parameters , such that it learns how to deal with the general situation of arbitrary detuning. In this case a straightforward filtering of the signals to obtain the state is unfeasible, as this would require to solve the filter for all possible values of model parameters, cf. Eq. (55). Other options for signal filtering would be (recursive) backward filtering methods [57, 58]

or recurrent neural networks because their structure allows them to capture temporal correlations in the data 

[39]. Note that such filter methods are compatible with the two-step control approach described in Section 3.

5 Discussion

In this work we proposed a framework based on differentiable programming (P) to automatically design feedback control protocols for stochastic quantum dynamics. As a test bed, we used a qubit subjected to homodyne detection, whose dynamics is given by a stochastic Schrödinger equation. Note, however, that our method can straightforwardly be applied to different physical systems and that it can be generalized to the case of stochastic quantum master equations.

In Section 3, we demonstrated that a controller, formed by a NN, can be trained to prepare and stabilize a target state starting from an arbitrary initial state of the system when the NN obtains the full knowledge of the instantaneous state at any time. The method generates a smooth drive while maintaining a high fidelity with the target state over the whole control interval. Additional constraints on the performance of the controller can be implemented by adding further terms to the loss function. This makes the P approach more versatile than tailoring control functions manually, which requires a unique approach for every new system and can even be infeasible for large quantum systems.

The key feature of our P framework is to include an explicit model of the system’s dynamics into the computation of the gradients of the loss function with respect to the parameters of the NN, i.e. the controller. Specifically, in Section 3.2

, we employed the recently developed continuous adjoint sensitivity method for the gradient estimation through an SDE solver, which is memory efficient and, thus, allows us to study a high-frequency controller. F.S. implemented these new continuous adjoint sensitivity methods in the DiffEqSensitivity.jl package within the open-source SciML ecosystem 


In Section 4, we showed that the feedback control scheme can be based directly on providing the NN with a record of homodyne current measurements without the need to filter the information on the actual state beforehand. Therefore, the NN must first learn how to filter the input data (with poor signal-to-noise ratio) before it can predict optimal state-dependent values of the control drive. Ultimately, the trained NN was able to reach fidelities above in a target time interval for random initial states.

In future studies, the optimization of the loss function based on stochastic trajectories using adjoint sensitivity methods could be compared to alternative approaches. First, the solution to the stochastic optimal control problem in the specific case of Markovian feedback (as in Section 3) is a Hamilton-Jacobi-Bellman equation [11, 54]

. The solution of this partial differential equation, with same dimension as the original SDE, may directly give the optimal drive 

[59]. However, solving this partial differential equation with a mesh-based technique is computationally demanding and mesh-free methods, e.g. based on NNs also require a (potentially costly) training procedure [60]. Second, the expected values of the loss functions could be optimized by leveraging the Koopman expectation for direct computation of expected values from stochastic and uncertain models [61]

. Additionally, one could approach this control problem by using an SDE moment expansion to generate ordinary differential equations for the moments and apply a closure relationship

[62]. Additional research is required to ascertain the efficiency of these approaches in comparison to our method.

The results reported in this paper imply that

P is a powerful tool for the automated design of quantum control protocols. Further experimental needs, e.g. finite time lag between the measurement and the applied drive, finite-temperature effects, or imperfect homodyne detection, can be incorporated straightforwardly into this method. Thus, our work introduces a new perspective on how prior physical knowledge can be encoded into machine learning tools to construct a universal control framework. Besides the control application demonstrated here, the

P paradigm can be also adopted to solve other inverse problems such as estimating model parameters from experimental data [63, 64, 65]. An interesting perspective for future work is to extend our framework to control-assisted quantum sensing and metrology.


We would like to thank Niels Lörch, Eliska Greplova, Moritz Schauer, and Chris Rackauckas for helpful discussions. We acknowledge financial support from the Swiss National Science Foundation (SNSF) and the NCCR Quantum Science and Technology. Parts of the computations were performed at sciCORE ( scientific computing core facility at University of Basel.

Appendix A Continuous homodyne detection

a.1 Quantum trajectories: monitoring the spontaneous emission of a qubit

Consider a two-level atom interacting with a free photon. In the case of discrete photon modes, the interaction is given by the usual Hamiltonian with and being the bosonic creation and annihilation operators, respectively, fulfilling the commutation relation , and being a coupling constant with dimension of energy. In contrast to the notation used in the main text, in this appendix, we will mark operators by a hat to distinguish them from scalars. This Hamiltonian results from the dipole interaction written in the form and application of the rotating wave approximation. However, when addressing a decay to the continuum, the photons are represented by quantum field operators with commutation relation , thus having units of square root of energy (or, equivalently, inverse square root of time). The interaction Hamiltonian takes the form


where is the decay rate, again, with dimension of energy like .

The field propagator in the interaction picture can be formally written as , where is the time-ordering operator. Expanding for short times , we find


Here, the second-order terms give an important contribution. Higher-order terms only contribute as and can be safely neglected, and we also assume that is short on the timescale of internal qubit dynamics. Assuming the free field is originally in the vacuum state we can write


is a properly normalized state of the field, and denotes the excited qubit state.

The simplest way to monitor the qubit state

is then to continuously measure the intensity of the outgoing field. This defines a Poissonian process, as the probability to detect a photon in a time interval



and the probability of no detection is respectively. Thus, we obtain the standard unraveling for the spontaneous decay process


a.2 Weak homodyning

An alternative to photon-counting detection is to mix the outgoing light with coherent light from a local-oscillator laser on a beamsplitter, and to measure the intensities at both output ports of the beamsplitter. As a warm-up we first consider the situation where the intensity of the local-oscillator field is comparable to the emitted light. It is given by the expansion of a coherent state using only vacuum and single-photon states


where the single-photon state of the -mode is defined identically to the mode . Mixing the two states on a 50:50 beamsplitter gives


Introducing the photocurrent variable , which measures the intensity difference of the two outputs, we see that there are three possibilities, . The respective probabilities of the three outcomes are given by the norms of the three different terms in Eq. (29),


The three possible states of the system after the measurement are proportional to


By setting (i.e. no mixing with a local-oscillator field), we recover the previous case where the qubit simply has a chance to decay


and . The only difference is that the emitted photon is randomly split between the two detectors, with no information about the state of the qubit contained in the sign of .

a.3 Strong homodyning

Finally, we consider the situation treated in the main text. The standard homodyne measurement requires mixing the signal with a strong coherent local-oscillator field


Recall that here the Fock states are defined with the creation operator , describing the field impinging on the lower port of the beamsplitter during the interval , see Fig. 1(a). For the modes and to match, the local-oscillator laser has to be in resonance with the drive laser because, in the lab frame, the field picks up the phase of rotating at the frequency of the drive laser.

When the modes match, the 50:50 beamsplitter transformation takes the form


In particular, this implies , i.e. a coherent state of the mode is split into two coherent states of half the intensity when mixed with the vacuum state of the mode on a 50:50 beamsplitter. Using the last two expressions, we can write down the overall state of the qubit, the spontaneously emitted radiation and the homodyne laser field after the beamsplitter. To do so, let us compute


which is the operator mapping the state of the qubit at time to the state of the qubit and the two detected modes at time . We can further simplify this expression by noting that


where is the photon number operator. Labeling the photon number operator for the mode as , we can then rewrite Eq. (35) in a very intuitive form


Again, directly gives us the state of the qubit and the detected modes, while


gives the conditional state of the qubit together with the probability to detect and photons respectively. In the measurement process, superpositions of states with different photon numbers collapse such that the state after the post-measurement is a classical statistical mixture of the possible outcomes,


One easily sees from Eq. (37) that , i.e. only the difference of the photon counts reveals information on the state of the qubit, while the sum only describes the shot noise of the local oscillator. It is therefore sufficient to keep the difference and discard the sum, which defines the state


In a slight abuse of notation we can formally introduce the joint quantum state of the qubit and the count difference as with


, which gives rise to the same post-measurement state. Here,


where is the modified Bessel function of the first kind.

Next, we use that the distribution of on the right-hand side of Eq. (42

) is well approximated by the normal distribution

in the limit . Therefore, we can replace the state of an integer-valued in Eq. (41) with


of a continuously valued and normally distributed , and we also introduce a continuously valued operator . In the regime of interest the actual value of does not play a role. To get rid of it, recall that is the intensity of the local-oscillator laser in the time window , so it is more convenient to work with the laser power , which is independent of the choice of the time window . Then, we can rescale the photon count difference to


to get rid of the laser intensity. Equation (41) now reads


where the initial state of the rescaled can be obtained form Eq. (43) and satisfies


This also allows us to define the homodyne current of the main text , as the photon count difference per time. At this point note that Eq. (45) already implies Eq. (3). Let us now show how the measured value of is distributed. The marginal distribution of after the measurement reads