A hypothesis in neuroscience claims that several neurological diseases, such as Parkinson’s111Parkinson’s disease is the second most common neurodegenerative disorder after Alzheimer’s. It affects approximately seven million people globally and 1–2 per 1000 of the population at any time. Its prevalence is increasing with age affecting 1% of the population above 60 years ., originate from the networks of pathologically synchronous neurons in the brain. These malicious ensembles of neurons can collectively generate signals in a synchronized manner, debatably leading to the “macro” symptoms such as tremor, rigidity, bradykinesia, postural instability, and other movement abnormalities [16, 11, 8]. To overcome these collective signals (or ’modes’) in advanced stages of a disease, doctors often resort to high-frequency open-loop pulse stimulation of certain brain regions via implanted micro-electrodes – a technology called deep brain stimulation (DBS) [2, 1, 18].
Today, DBS systems have no feedback algorithms embedded into their circuitry, with doctors simply adjusting the electrode currents according to the symptomatic observations . Although the new generations of DBS promise to provide the feedback functionality, the difficulty of conducting experimentation with live human brains still makes it hard to find the best stimulation algorithm experimentally. Moreover, a large network of interacting neurons is a complex non-linear system, which, considering limitations of the hardware and the unknown biological pathway of the illness itself, calls for additional modeling effort.
As such, there appeared a demand for synthetic physical modeling to mimic the collective signaling patterns of neuronal ensembles [12, 9, 10]. The aim of several open-loop  and of the more recent closed-loop feedback-based control approaches [25, 26, 23, 19]
is to desynchronize the large network of neurons, without suppressing the very oscillatory activity of individual neurons. In such physical synthetic models, the output of neurons is typically described either by several sets of ordinary differential equations (ODE), by partial differential equations (PDE), or by a map-based definition.
At the same time, the explosive development of RL 
in recent years has offered a new data-driven methodology that could operate completely unaware of the physical world or of the underlying neuronal model. The Machine Learning (ML) techniques are now extensively used for analysis and prediction of complex systems[13, 21, 34, 24, 7, 32, 33] and it seems natural to propose this framework for the purposes of control in deep brain stimulation as well. RL is often difficult to apply to real-world applications because of the necessary exploration, which implies a large number of trial and errors, potentially with dramatic consequences, before being able to improve the policy. Nevertheless, DBS is a setting where those drawbacks are absent. Its action space can easily be constrained to ensure that the agent’s actions are harmless to the patient, and, depending on the DBS device, the frequency of decision making ranges from 60 Hz to 150 kHz , meaning that 1 million transitions may be collected in less than 2 to 5 hours on a single patient.
In this paper, we report creation of a convenient gym environment  for developing and comparing the interaction of RL agents with several types of neuronal models developed in computational neuroscience and physics. The ODEs or descriptor maps are wrapped into the framework as individual environments, allowing to switch easily between environments, to use various RL models, and potentially multiple agents. Using this framework, we demonstrate successful suppression of the collective mode in three different types of oscillatory ensembles, using various policy-based approaches , and show the first demonstration of synchrony suppression using a pair of RL agents trained using PPO. The suppression workflow proposed here is universal and could be used to create benchmarks among different physical models, to create different control algorithms, and to pave the way towards the clinical realization of deep brain stimulation via RL. The policy gradient algorithm PPO used below can provide a robust data-driven control, agnostic of the neuronal model and promises pathways for integration with current clinical DBS systems.
2 The model
In this work, we train RL agents with proximal policy optimization [27, PPO] (see diagram of Fig. 1). Classically, training involves five main blocks for the control problem: Environment, Action, State, Reward, and Agent. The flow works as follows: the agent observes a state, then takes an action, next, the environment responds with a reward signal and the agent observes the new state of the environment, which closes the loop of interaction. We now describe each block, its characteristics, and its function in detail.
Fig. 1 conceptually shows which components contribute to the model of our RL “environment”. Each configuration, such as the model and the number of neurons in an inter-connected ensemble, type of their links, strength and the model of connectivity within the “brain”, can be tuned to simulate certain pathological signalling patterns. Well studied in the physical sciences, such models of pathological brain networks include (ranging from simple to complex): a globally coupled ensemble, interacting groups of excitatory and inhibitory neurons, including spatially-structured ones, detailed models of involved brain regions, and other more complex models.
Within these models, individual neurons could be described by (from simple to complex): map-based models (e.g. Rulkov), integrate-and-fire models, conductance-based models (simple 2D models of spiking dynamics, e.g. Bonhoeffer-van der Pol or Morris-Lecar; 3D models of spiking/bursting (Hindmarsh-Rose), high-dimensional biophysically motivated models (Hodgkin-Huxley), multi-compartment models, distributed-parameter models, and many others. Connections between such individual neurons include simple coupling, excitatory, and inhibitory synaptic connections, etc.
We refer readers to Ref. for the overview of the possible systems mentioned above. Herein, however, we will consider two particularly popular neuronal models [4, 15] with the sole goal of mimicking various realistic signalling patterns of collective neuronal activity qualitatively: namely, regular, chaotic, and bursting signalling regimes.
Bonhoeffer–van der Pol oscillators. As our first basic model, we consider a population of regularly oscillating neurons, known as Bonhoeffer–van der Pol or FitzHugh–Nagumo oscillators, globally coupled via the mean field X. See Fig. 4(a) for an illustration of its oscillatory behavior (for ). The equations governing the model are:
where is the index of the neuron, where is the mean field, and where is the action. The neurons are not identical: the currents
are drawn from a Gaussian distribution with a mean of
and a standard deviation of. The strength of the global coupling is determined by .
This model has two properties that make the control problem non-trivial. First, for very low values of the coupling , the mean fields are , , i.e. the fixed point to which the system should converge is not the origin and is a priori unknown. Second, the model exhibits chaotic collective dynamics for certain values of (Chaotic model, see the broadened trajectory in Fig. 2(b)).
Bursting Hindmarsh–Rose neuronal model. The other type of oscillators considered is an ensemble of Hindmarsh-Rose  neurons in a bursting regime:
2.2 Action and State
The action and the state are respectively the input to and the resulting response output from the environment, produced with a sampling rate .
We consider idealistic -shaped pulse actions, with a constant interval between each pulse and an amplitude limited by a value . The action is tuned at each time step, with and , . For treatment of more realistic pulses, encountered in the DBS systems, see . For convenience, below we omit the index for the discrete time . Naturally, smaller values of are commonly sought after in biological applications, such as DBS for Parkinson’s disease, as the system should be as little invasive as possible. The total “energy” supplied to the ensemble from an external source, , is thus another measure that one aims to minimize in practice. The action affects all neurons similarly, its precise effect is represented by the letter A in Eqns. (1) and (2).
The state is based on the current value of the mean field, , extracted using a Runge–Kutta-based solver for Eqns. (1) or (2). The solver is implemented in the gym environment we developed. This provides feedback from the system, after application of action . To account for the oscillatory behavior of the model, the state consists of the most recent values of .
Some of our experiments will introduce some noise in the action: the executed action is the one selected by the agent plus a white noise term. Similarly, to mimic real-world conditions, we will also introduce some noise at the state perception level.
For a given action and a given observation at time , we propose the following class of reward functions for synchrony suppression tasks:
where the first term rewards convergence of the system to an average of the mean field over previous values, , and the second term favors smaller values of the action . The coefficient allows to introduce a bias towards a desired outcome (e.g., a more accurate convergence to a particular value of the mean field vs. a smaller amplitude of the suppression pulse).
We trained our RL agent using the Proximal Policy Optimization algorithm [27, PPO]. We briefly describe the method below. As usual in RL, we wish to maximize the expected return, defined as the discounted sum of rewards:
where is the expectation over visited states following a given policy , is a discount factor that controls the trade-off between long-term and immediate rewards (set to 0.99 in our experiments), and is the reward received at time , specified by Eq. (3).
The policy is parameterized by a neural network with parameters
, encoding the probability of taking actionwhen the current state is :
is optimized using PPO to maximize the expected return given by Eq. (4). In our experiments, we used two-hidden layers MLPs with 64 neurons, trained using the Stable Baselines library , with the default parameters for PPO. Generally speaking, the nonlinear nature of Eqns. (1) and (2) will make the feedback highly sensitive to the amplitude of the input. To handle this sensitivity, we opted for the use of two agents trained for different values of neuronal spiking activities. Given the small size of the networks, training was performed on CPU 222The code is available at https://github.com/cviaai/RL-DBS/. The training reward (Eq. 4) and PPO loss are plotted for the Regular and Bursting environments in Figure 3.
3.1 Synchrony suppression in the environments
We first test our agent on an ensemble of self-sustained Bonhoeffer-van der Pol neurons oscillating around a non-zero equilibrium point and globally coupled with (Fig. 4(a)). At , we initiate synchrony suppression by sending action pulses according to our trained PPO agent. This confirms that the reward function described by Eq. (3) for leads to convergence to the natural equilibrium point, with a non-zero average . At , i.e. when suppression is activated, the action amplitudes spike slightly, for about 200 time steps, and then quickly reduce to . As a point of comparison, we study below the impact of constant actions. We wish to emphasize that each individual neuron maintains its output; it is the desynchronization of the entire ensemble that causes the mean field to decrease.
RL can also suppress synchronization in the Bonhoeffer-van der Pol ensemble when the collective mode is chaotic (, Fig. 4(b)). Although the oscillatory dynamics is now irregular, our PPO agent performs here similarly to the non-chaotic regimes, with , the same order of magnitude for the required action amplitudes (), and a total stimuli energy required for suppression only larger than in the regular regime.
The bursting output of Hindmarsh-Rose neurons, Eq.(2) for and can also be suppressed (Fig. 4(c)). The bursting pattern and the high synchrony of the oscillators occurs at the beginning until a series of action pulses is applied at . Interestingly, immediately after the stimuli are applied, the mean field spikes above its anterior value, which portrays a transient regime where the system undergoes a temporal increase of synchrony. As the PPO agent continues to adapt to the current state, the synchrony of oscillations vanishes, at which point (around ) the mean field converges to the special point .
The convergence of the ensemble to the special point is best monitored in the phase space , shown in Fig. 2(c). As the agent acts on the collective oscillation, the trajectories bend towards the fixed point. Broadening of the trajectory in the chaotic and in the bursting ensembles have particular signatures indicating intricate signalling regimes.
3.2 Multiple PPO agents
Dynamical nonlinear systems containing large populations of coupled neurons are especially hard to control because of their very different responses to weak and strong stimuli. This is where another modern direction of RL, entailing multiple agents, could be beneficial for the task at hand. We propose to use multiple auxiliary PPO agents, trained on various neuronal patterns, e.g. during the transient ones occurring immediately after ( in our experiments) or during the suppressed regime. As such, the primary agent would “see” only the strong stimuli, whereas the auxiliary agent would “see” only the signal that has already been partially suppressed and is, therefore, weaker. Figure 5 demonstrates that when this secondary model overtakes the control at , it further reduces the amplitude of the mean field and desynchronizes the ensemble beyond the performance of a single model.
Indeed, the response to a stimulus is determined by the corresponding phase response curve that does not depend on the stimuli amplitude only in the limit of an infinitely small action ; with a finite action, the response will always be pronounced as dependent on the amplitude of the input. Long-term, one could envision a library of such ANNs pre-trained at different amplitude levels, at different values of sampling rate , and at different pulse skipping rates – all to be embedded into the software controlling a DBS device. This promises a personalized approach to the patients with different signaling patterns and at different progression stages of the disease, regardless of its etiology. Characterization of the full nonlinear response of these strongly interconnected ensembles and engaging three or more such agents will be studied in future work. Deep architectures, alternatively, are also expected to fit the nonlinear response curve better than the small networks we used in our study, albeit with the associated lack of physical interpretability.
3.3 Quantitative analysis
We now proceed to characterize the RL-based suppression as a function of various parameters of the system and of the stimulation. The major factor that determines the amplitude of the collective oscillation is the coupling strength , which we thoroughly varied. The results for the Bonhoeffer–van der Pol ensemble, Eq. (1), are shown in Fig. 6. For the unperturbed system, the dependence of the standard deviation of the collective mode, std, on the coupling strength follows a threshold-like curve, Fig. 6(a). The value for std was taken when the PPO agent reached the best possible level of synchrony suppression. As can be seen in Fig. 4(b), this final “steady stage” of the control is achieved soon after the stimuli application is switched on, at about , and is preserved until the control is switched off. The corresponding value for the Hindmarsh-Rose model is , see Fig. 4(c).
In the suppressed steady state the mean field continues to fluctuate due to the final size of the ensemble ( ). The amplitude of the action also fluctuates but the pulse sequence now has a uniform variance and a diminished range of amplitudes required to keep the control active. We speculate that there is an additional source of fluctuations emerging from the probabilistic uncertainty inherent to the ANNs. Despite not reaching the theoretical limit, the RL algorithm is actually more pertinent to the real experimental data because this uncertainty can indirectly train the model to accommodate noisy signals.
The extent of the mean field suppression can be quantified by the following suppression coefficient
where (resp. ) represents the mean field values before (resp. after) the stimuli application. The fluctuations of the suppressed field do not depend much on , but the amplitude of the collective mode of the unperturbed field grows with , see Fig. 6(a). The suppression coefficient is maximal for strongly synchronized systems and achieves in that case.
3.3.1 Study of skipping pulses
Next, of great importance for future RL-based DBS devices, is the minimization of total energy sent via stimuli to the brain. We analyzed the dependence of on a skip parameter , defined as follows. We trained a PPO agent as though to send a stimuli every time step , but only allowed it to send pulses to the environment every time steps. The rationale behind this test is to look for the optimal frequency of action pulses in order to minimize the energy of the perturbation sent to the system while still suppressing synchrony. The resulting fall in the suppression efficiency shown in Fig. 6(b) can be deemed as a classic example of trade-off when e.g. a limited stimuli energy must be used or an incomplete suppression is desired. Figure 4(c) shows the time dependence of the mean field immediately after the stimuli are initiated at for the case of and and for different values of . As we can see, for , suppression is still rather efficient and comes with a smaller total energy supplied to the system (circle diameters in Fig. 6(b)).
3.3.2 Study of response to constant stimuli
A standard test of an RL environment is to explore the efficiency of constant stimulation (here, de-synchronization is no longer the task). To study such a response we use our simulator for the Bonhoeffer–van der Pol model and predict the evolution of for constant values of ranging from to with a step size of . Outside that range, the effects are simply more pronounced, and less desirable. Results are shown in Fig. 7(a). For relatively negative values of the action, we do observe suppression of the oscillations, which implies that the very individual neurons cease to oscillate. In these stable cases, the mean value of is around and the applied pulses are larger than in absolute value. In contrast, using our trained agent, we achieve the same level of suppression with a mean of and an average action smaller than in absolute values (with a standard deviation of ): the RL agent is far less invasive, and sends far less energy to the system. Finally, for constant actions that are smaller than in absolute value, we see that suppression is very limited.
3.3.3 Study of Action-State noise stability
Another essential condition for the deployment of RL agents to real-world scenarios is their stability to noise. Observations of the actual state will never be accurate, nor will the stimuli applied to the brain be exactly the one required by the agent. For these reasons, we ran suppression experiments in a noisy setting. For the three types of environments (regular Bonhoeffer–van der Pol, chaotic Bonhoeffer–van der Pol, and bursting Hindmarsh–Rose), we added some white noise to the state observed by the RL agent (at each time step, drawn independently from ). Similarly, the action performed in the environment was the action selected by the agent with some additive noise (drawn from . Fig. 7 shows the suppression coefficient at the end of training for various values of and , each point corresponding to an average over seeds. We first observe that the state noise has a limited effect on the efficiency of the trained agent.
Noisy actions have a far more significant impact on the efficiency of the agent. For the Bonhoeffer–van der Pol environments considered in this section, the mean action is approximately (depending on and the randomness of the run). Applying a noise of the same order of magnitude, the agent reaches a similar suppression coefficient . For larger noise levels, we observe a steady degradation in performance.
The same conclusions can be drawn for the bursting model, although noise levels need to be larger to observe a decreased performance. Indeed, in the Hindmarsh–Rose environment, the mean action is with a standard deviation of , it is thus more robust to small perturbations of the applied stimuli. Overall, these experiments allow the definitions of thresholds below which stability to noise is guaranteed.
4 Discussion and State-of-the-art
The speed of suppression and the residual synchrony in the time series curves in Fig. 6(c) portray the trade-off between supplied energy and the extent of residual synchronization mentioned above. The fact that the five-fold reduction of the stimuli frequency still allows achieving a satisfactory degree of suppression naturally suggests the following pathway for future work. We speculate that the most efficient application of stimuli should actually be non-uniform pulse trains in time and that the frequency of it should be adapted according to the patient’s symptoms.
However, as mentioned above, the cause-effect relationship between the synchrony and the pathology is still an unproven hypothesis in neurobiology and in computational neuroscience. Nonetheless, machine learning methods could be proposed for the optimization of the stimulation parameters regardless of the etiology of the disease, and – as we studied on the synthetic data – RL could be considered as the ideal candidate for integration with a real DBS device. Pre-clinical approbation  could be a logical continuation to test both the cause-effect hypothesis and to optimize device settings experimentally prior to proceeding to human studies.
But perhaps more importantly, the community needs to standardize and honestly compare various control algorithms apple-to-apple - something that is not possible to accomplish as of today. As of now, the schemes proposed in the literature exploited delayed or non-delayed, linear and nonlinear control loops, continuous or pulsatile stimulation, specialized pulses that preserve total charge, adaptive tuning of the feedback parameters. And recently, ML-based approaches started to appear. Having different input parameters, different underlying models’ assumptions, and different criteria to define successful suppression, the current state of affairs suggests that our gym environment holds the potential to become particularly useful and to provide a unified platform to evaluate various methods.
In our work, we supply a potentially large and diverse collection of RL environments within a single framework. Pulsatile, continuous, or purposefully optimized agents could interact with these environments effectively enabling the parameter search for a particular configuration of a DBS device.
Having introduced a clear metric (Eq. 6 and that of a total supplied energy ) as a criterion for efficient suppression, and having characterized basic collective behavior seen in neuronal ensembles (regular, chaotic, bursting), we aspire to enable a “gym research” effort that is easy to set up and use. The proposed framework should make it easy to reproduce published results across physics and computer science publications and to compare results from different papers. Clear metrics and synthetic data can also become a sound platform for various AI competitions.
To conclude, we presented a new RL gym framework for the synchrony suppression task in a strongly interconnected oscillatory network that is believed to be the cause of tremor and other systemic neurological symptoms. Considering limit-cycle Bonhoeffer-van der Pol oscillators and Hindmarsh-Rose neurons as the test models, we demonstrated successful synchrony suppression for regular, chaotic, and bursting collective oscillation, without having knowledge about the ensemble model.
An important advantage of the RL-based suppression method is that it is data-driven and universal. It could be readily implemented in an experimental setting if one takes the measuring/stimulating equipment characteristics and limitations into account. The suppression workflow proposed in the diagram of Fig. 1 is universal and can be exploited for a variety of practical tasks. We find Reinforced Learning to be an ideal candidate for clinical approbation as a “smart” control algorithm to be embedded into deep brain stimulation devices.
-  (2009) Deep brain stimulation of the subthalamic nucleus for the treatment of Parkinson’s disease. Lancet Neurol. 8 (1), pp. 67–81. Cited by: §1.
-  (1991) Long-term suppression of tremor by chronic stimulation of the ventral intermediate thalamic nucleus. Lancet 337 (), pp. 403–406. Cited by: §1.
-  (2012) Animal models of Parkinson’s disease. The FEBS Journal 279 (7), pp. 1156–1166. External Links: Cited by: §4.
-  (1948-09) Activation of passive iron as a model for the extraction of nerve. The Journal of General Physiology 32 (1), pp. 69–91. External Links: Cited by: §2.1.
-  (2016) OpenAI gym. External Links: Cited by: §1.
-  (2006) Phase response curve. Scholarpedia 1 (12), pp. 1332. Note: revision #27615 External Links: Cited by: §3.2.
Inferring the dynamics of oscillatory systems using recurrent neural networks. Chaos 29 (), pp. 063128. Cited by: §1.
-  (2010) Deep brain stimulation mechanisms: beyond the concept of local functional inhibition. European Journal of Neuroscience 32 (7), pp. 1080–1091. Cited by: §1.
-  (2001) Preface to volume 4 neuro-informatics and neural modelling. In Neuro-Informatics and Neural Modelling, F. Moss and S. Gielen (Eds.), Handbook of Biological Physics, Vol. 4, pp. ix – xi. External Links: Cited by: §1.
-  (2001) Chapter 21. In Neuro-Informatics and Neural Modelling, F. Moss and S. Gielen (Eds.), Handbook of Biological Physics, Vol. 4, pp. 887 – 968. External Links: Cited by: §1.
-  (2009) Optical deconstruction of Parkinsonian neural circuitry. Science 324 (5925), pp. 354–359. Cited by: §1.
-  (1992-02) Synchronization and computation in a chaotic neural network. Phys. Rev. Lett. 68, pp. 718–721. External Links: Cited by: §1.
-  (2018) Data-driven modeling and prediction of complex spatio-temporal dynamics in excitable media. Frontiers in Applied Mathematics and Statistics 4, pp. 60. Cited by: §1.
-  (2018) Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselines Cited by: §2.4.
-  (1984) A model for neuronal bursting using three coupled first order differential equations. Proc. Roy. Soc. London Ser. B 221, pp. 87. Cited by: §2.1, §2.1.
-  (2008-04-01) Mechanisms and targets of deep brain stimulation in movement disorders. Neurotherapeutics 5 (2), pp. 294–308. Cited by: §1.
-  (2019) Reinforcement learning for suppression of collective activity in oscillatory ensembles. (1909.12154). Cited by: §2.2.
-  (2017) Innovations in deep brain stimulation methodology. Mov. Disorders. 32 (1), pp. 11. Cited by: §1, §1.
-  (2013-04) Oscillation suppression and synchronization: frequencies determine the role of control with time delays. EPL (Europhysics Letters) 102 (2), pp. 20003. Cited by: §1.
-  (2002-01) Spiking neuron models: single neurons, populations, plasticity. External Links: Cited by: §2.1.
-  (2018) Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach. Phys. Rev. Lett. 120, pp. 024102. Cited by: §1.
-  (1999) Finite-size effects in a population of interacting oscillators. Phys. Rev. E 59 (2), pp. 1633–1636. Cited by: §3.3.
-  (2005) Effective desynchronization by nonlinear delayed feedback. Phys. Rev. Lett. 94 (), pp. 164102. Cited by: §1.
-  (2018) Sparse identification of nonlinear dynamics for rapid model recovery. Chaos 28, pp. 063116. Cited by: §1.
-  (2004) Controlling synchrony in ensemble of globally coupled oscillators. PRL 92 (), pp. 114102. Cited by: §1.
-  (2004) Delayed feedback control of collective synchrony: An approach to suppression of pathological brain rhythms. Phys. Rev. E. 70, pp. 041904. Cited by: §1.
-  (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Cited by: §2.4, §2.
-  (2018) Frequency-dependent effects of subthalamic deep brain stimulation on motor symptoms in Parkinson’s disease: a meta-analysis of controlled trials. Scientific reports 8 (1), pp. 1–9. Cited by: §1.
-  (2018) Reinforcement learning: an introduction. MIT Press. External Links: Cited by: §1, §1.
-  (2001) Effective desynchronization by means of double-pulse phase resetting. Europhys Lett. 53 (1), pp. 15–21. Cited by: §1.
-  (2017-08-01) Epidemiology of Parkinson’s disease. Journal of Neural Transmission 124 (8), pp. 901–905. External Links: Cited by: footnote 1.
-  (2019) Synchronization of chaotic systems and their machine-learning models. Phys. Rev. E 99, pp. 042203. Cited by: §1.
-  (2019) Deep learning algorithm for data-driven simulation of noisy dynamical system. J. of Comp. Physics 376 (), pp. 1212–1231. Cited by: §1.
-  (2018) Observing spatio-temporal dynamics of excitable media using reservoir computing. Chaos 28 (4), pp. 043118. Cited by: §1.