Log In Sign Up

Dynamically writing coupled memories using a reinforcement learning agent, meeting physical bounds

by   Théo Jules, et al.

Traditional memory writing operations proceed one bit at a time, where e.g. an individual magnetic domain is force-flipped by a localized external field. One way to increase material storage capacity would be to write several bits at a time in the bulk of the material. However, the manipulation of bits is commonly done through quasi-static operations. While simple to model, this method is known to reduce memory capacity. In this paper, we demonstrate how a reinforcement learning agent can exploit the dynamical response of a simple multi-bit mechanical system to restore its memory to full capacity. To do so, we introduce a model framework consisting of a chain of bi-stable springs, which is manipulated on one end by the external action of the agent. We show that the agent manages to learn how to reach all available states for three springs, even though some states are not reachable through adiabatic manipulation, and that both the training speed and convergence within physical parameter space are improved using transfer learning techniques. Interestingly, the agent also points to an optimal design of the system in terms of writing time. In fact, it appears to learn how to take advantage of the underlying physics: the control time exhibits a non-monotonic dependence on the internal dissipation, reaching a minimum at a cross-over shown to verify a mechanically motivated scaling relation.


page 1

page 2

page 3

page 4


Mutual Information-based State-Control for Intrinsically Motivated Reinforcement Learning

In reinforcement learning, an agent learns to reach a set of goals by me...

Integrating Episodic Memory into a Reinforcement Learning Agent using Reservoir Sampling

Episodic memory is a psychology term which refers to the ability to reca...

State-based Episodic Memory for Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) algorithms have made promising...

Memory Lens: How Much Memory Does an Agent Use?

We propose a new method to study the internal memory used by reinforceme...

Deciding a Graph Property by a Single Mobile Agent: One-Bit Memory Suffices

We investigate the computational power of the deterministic single-agent...

Improved bounds for the many-user MAC

Many-user MAC is an important model for understanding energy efficiency ...

Code Repositories


At first sight, memory seems like a fragile property of carefully crafted devices. However, upon closer inspection, various forms of information retention are present in a wide array of disordered systems [keim2019memory]. Governed by a very rugged and complex energy landscape, such systems usually display history-dependent dynamical [kovacs1963glass, prados2014kovacs, jules2020plasticity] and static [Matan2002, Diani2009] responses. Fundamentally, even a single hysteresis cycle can be seen as an embryonic form of memory, as theorized by the Preisach model [preisach1935magnetische, mayergoyz1986mathematical]. Combining these cycles into larger structures generates multi-stable systems showing memory capacity, such as the well-known return point memory where the system can remember and return to previously visited states [Barker1983a, Deutsch2004, mungan2019structure, keim2020global, Keim2021a]. This basic model for memory has drawn recent interest in various areas of physics including spin ice  [libal2012hysteresis], cellular automata [goicoechea1994hysteresis], crumpled sheets [bense2021complex], glassy  [lindeman2021multiple], plastic [puglisi2002mechanism, regev2021topology] and granular [keim2020global, Keim2021a] systems, origami bellows [yasuda2017origami, jules2022delicate]. Interestingly, the model’s property indicates that information can be written and read from the underlying system, making the internal memory mechanism adequate to store information. However, the Preisach model is grounded in a quasi-static framework, for adiabatic transformations. As a result, the specific characteristics of each hysteretic cycle [terzi2020state] or the addition of internal coupling [VanHecke2021a, bense2021complex, jules2022delicate] can considerably shrink the set of reachable states and thus the storage capacity of the device. In this paper, we show that a controller, in the form of a reinforcement learning agent, can take advantage of the dynamics to reach all stable states, including adiabatically inaccessible states, effectively restoring the memory of the system to its full capacity. We base our study on a model framework akin to that introduced in [puglisi2002rate, jules2022delicate], i.e., a chain of bi-stable springs with three coupled units and internal dissipation. After successfully training the agent on a specified set of physical parameters, we demonstrate that transfer learning [pan2009survey, taylor2009transfer] accelerates the training on different parameters and extends the region of parameter space that leads to learning convergence. Finally, we investigate the change of the dynamical protocol proposed by the trained control process for a single transition between two states as a function of the dissipation’s amplitude. The transition duration presents a minimum for a critical value of the dissipation that appears to verify a physically motivated scaling relation, pointing to the fact that the agent learns how to harness the physics of the system to its advantage.

Model for a chain of bi-stable springs

Figure 1: Model for a chain of three coupled bi-stable spring-mass units. a) Schematics view of the model. The first unit is attached to a fixed wall and an external force is applied to the last one. b) Deformation of all three bi-stable springs under external load with the specific choice of disorder = 0.050, = 0.050, = 0.040, = 0.020, = 0.030, = 0.045 (m). The switching fields are defined through Eq. [4]. c) Transition graph where the nodes represent the stable configurations and the arrows the quasi-statically achievable transitions between them. The states and are ”Garden of Eden” states (see main text) with this choice of disorder.

We consider a one-dimensional multi-stable mechanical system composed of identical masses connected by bi-stable springs in series, as shown in Fig. 1 a). For each spring , we set a reference length such that the deformation of the spring reads


where is the position of the -th mass at time . The tunable bi-stability of each spring is achieved through a generic quartic potential. We thus obtain the cubic force equations


with is the stiffness of the spring, and correspond to meta stable configurations and , and corresponds to an unstable equilibrium. Other works on similar systems also considered trilinear forms [puglisi2002rate, puglisi2002mechanism]. Still, our choice generates a smooth mechanical response and keeps its amplitude moderate for low deformation. This behavior plays a crucial role in the scaling analysis that we will present later. The first spring of the chain is attached to a fixed wall, a condition that imposes , and we apply an external force to the last mass. Finally, we consider that the system is bathed in an environmental fluid, resulting in a viscous force being exerted on each mass such that


where the dot indicates the derivative with respect to time and is a viscous coefficient.

Let us first consider the case where the system is a single bi-stable spring attached to a mass. The solutions for mechanical equilibrium, schematized in Fig. 1 b), displays two critical amplitudes such that


where are both solutions of . Note that is negative and is positive since we defined both and to be positive. These two forces are essential to describe the stability of the system. Indeed, we observe two branches of stable configurations, that we call 0 (for ) and 1 (for ). When remains in the range , both branches present a solution for the mechanical equilibrium : the system has two stable states. However, as soon as gets beyond this range, either the branch (for ) or (for ) vanishes.

These dynamical properties lead to an hysteresis cycle that can be described with a simple experiment. We start with the spring at rest and , i.e. , and slowly pull on its free end, increasing . As a response, the system slightly stretches and increases. This continues until where the branch disappears. At this point, if we continue to increase , the system necessarily jumps to the branch , with , in order to maintain mechanical equilibrium. If we now decrease , decreases accordingly until where the system has to jump back to the original branch . If we only consider the stability branches, the described hysteresis cycle corresponds to a so-called hysteron, the basic block of the Preisach model [Preisach1935]. In a Preisach model, a set of independent hysterons is actuated through an external field. Each hysteron has two states, 0 and 1, and two switching fields, and , that characterize when each hysteron switches between states. As long as the switching fields are unique and the hysterons are independent, i.e. the switching fields do not change with the state of the system, and the Preisach model is able to predict the possible transitions between the configurations. Our system, bi-stable spring/mass in series, under quasi-static actuation, fulfills the local mechanical equilibrium and independence assumptions.

To get a better visual representation of the model’s predictions, the quasi-statically achievable transitions between states are modeled as the edges of a directed graph, where the nodes represent the stable configurations. The allowed transitions correspond to the switch of a single hysteron. The exact topology of the transition diagram depends on the relative values of the switching fields [terzi2020state]. If the order for and is the same for all the coupled hysterons, the system can reach every configuration from any other one. Otherwise, there exists isolated configurations which can never be reached again once left. As is customary, we call these unreachable configurations ”Garden of Eden” (GoE) states (see [jules2022delicate] for a historical account). An illustration for the case is given in Fig. 1 c) where both and are GoE states. The existence of GoE states severely limits the total number of reachable states, and as a result, the amount of information that can be stored in the system.

A solution to overcome this limitation is to break the quasi-static assumption and to take advantage of the dynamics as a means to reach GoE configurations. However, this approach foregoes a major upside of the Preisach model, its simplicity. Indeed, applying Newton’s second law of motion to each mass yields a description of the dynamical evolution for the system through a system of coupled non-linear differential equations


While providing a clear experimental protocol to make the system change configuration is straightforward in the quasi-static regime, the non-linear response of the springs and the coupling between the equations make a similar analysis a hefty challenge in the dynamical case. In the following, we demonstrate how the use of artificial neural networks and reinforcement learning unlocks this feat.

Reinforcement learning to control the dynamic

Reinforcement Learning (RL) is a computational paradigm that consists in optimizing, through trial-and-error, the actions of an agent that interacts with an environment. RL has been shown to be an effective method to control multistable systems with nonlinear dynamics [gadaleta1999optimal, gadaleta2001learning, wang2021constrained, pisarchik2014control]. In RL, the optimization aims to maximize the cumulative reward associated with the accomplishment of a given task. At each step , the environment is described by an observable state . The agent uses this information, in combination with a policy , to decide the action to be taken, i.e. . This action brings the environment to a new state , and grants the agent with a reward quantifying its success with respect to the final objective. An episode ends when that goal is reached or, if not, after a finite time. The goal of training is to learn a policy that maximizes the agent’s cumulative reward over an episode. When the control space is continuous, one can resort to an actor-critic architecture [konda1999actor], which is based on two Artificial Neural Networks (ANN) learning in tandem. One network, called the actor, generates a sensory-motor representation of the problem in the form of a mapping of its parameter space into the space of policies, such that

. This operation makes the exploration of the continuous spectrum of choices computationally tractable. The optimization of the actor necessitates an estimation of the expected reward at long time. This is the role of the second ANN, the critic, which learns to evaluate the decisions of the actor, and how it should adjust them. This is done though the same bootstrapping of the Bellman’s equation as that used in Q-learning 

[grondman2011efficient]. The successive trials - resulting in multiple episodes - are stacked in a finite memory queue (FIFO), or replay buffer, and after each trial the ANN tandem is trained on that buffer, thus progressively improving their decision policy and the quality of the memory. The precise architecture of the algorithm we used is based on Twin Delayed Deep Deterministic Policy Gradient [TD3] and is detailed in the Materials and Methods section.

In our system, the environment consists in the positions and velocities of the masses, an action is a choice for the values of the force applied to the last mass in the chain at time , and the goal is to bring the system close to a given meta-stable memory state in a given time - close enough that it cannot switch states if let free to evolve. At the start of an episode, the environment is randomly initialized, and a random target state is set. The information provided as an input to the networks includes the position and velocity

of all the masses in addition to the one-hot encoded target configuration. Then the policy decides on the force

applied to reach the next step. We reduce the space of possible actions by limiting the amplitude of the force to be lower or equal to one newton (). The evolution of the system is simulated by solving the differential equations (6) numerically with a Runge-Kutta method of order 4. The reward from this action is computed relative to the newly reached state: we give a penalty () with an amplitude proportional to the velocity of the masses and to the distance of the masses from their target rest positions. After every step, the replay buffer is updated with the corresponding data. The critic is optimized with a batch of data every step, while the actor is optimized every two steps. The episode stops if the system is sufficiently close to rest in the correct configuration, in which case a large positive reward () is granted, or after . Then a new episode is started, and the algorithm is repeated for a predefined number of episodes. More details are available in the Materials and Methods.

Using the described method, we trained our ANNs on a chain of three bi-stable springs specifically designed to display GoE states, as detailed in the Materials and Methods section and illustrated in Fig. 1 c. Interestingly, the networks achieve a 100% success rate at reaching any target state - including the GoE states - in less than 10 000 episodes, as shown in Fig. 2. a). We thus accomplished our initial objective and designed a reliable method that produces protocols for the transitions to any stable configurations, restoring the memory of the device to its full capacity.

Figure 2: Training dynamics of the RL agent on the model presented in Fig. 1 and different values of the viscous coefficient. a) Evolution of the success rate during training. The success rate is defined as the number of times the agent succeeds in reaching the target configuration over the last 100 episodes. For the blue, green, and orange curves, the ANN was initialized randomly. For the purple curve, the ANN was initialized using the weights of a previously learned model with =4 Kg/s. b) Learning time with respect to the viscous coefficient with and without Transfer Learning. The learning time is defined as the number of episodes it takes for the success rate to reach the threshold value 0.8 for the first time.

In order to evaluate the robustness of the decision making process of the policy and gain insights on the mechanisms involved, we study how the physical parameters of the system influence the designed solutions. We chose to focus on a quantity that deeply affects the dynamics of the masses and has simple qualitative interpretation: the viscous coefficient . To observe its effect on the designed policy, we trained agents with random initialization of weights for a range of while keeping all other physical parameters consistent. The learning time, defined as the number of episodes before the success rate reaches 80 during training, is shown in Fig. 2 b) as a function of . Even though the algorithm manages to learn the transitions for a wide range of , the learning time varies significantly. Notably, the learning time gets longer for very low viscous coefficients (here kg/s) but also seems to diverge at very high .

Due to the continuous nature of the system, we expect small modification of the physical parameters not to catastrophically change the dynamics of the system. With this assumption, we employed Transfer Learning (TL) techniques [pan2009survey, taylor2009transfer] between runs at different to accelerate the learning phase. In TL, the ANNs are not initialized with random weights, but with the weights of ANNs already trained on a similar physical system but with a slightly different . The expectation is that some physical principles learned by the algorithm remain applicable for solving the new problem. We thus slowly increase from from 2 kg/s up to 10 kg/s and decrease it from 2 kg/s to 0 kg/s, applying TL at each increment. We observe in Fig. 2. b) that TL is very effective, dividing by up to 30 the learning time for very high viscosity, and allowing to reach otherwise non-converging regions. This acceleration of training also allows for a finer discretization of the viscous coefficient exploration while keeping computation time reasonable.

By mixing RL and TL, we generated an algorithm that quickly produces precise transition protocols to any stable state for chains of bi-stable springs, including GoE states. In the next section, we analyze the properties of the force signals produced by the ANN and investigate how they relate to the dynamics of the system as the viscous coefficient is varied.

How damping affects the control strategy

The intensity of the damping impacts the dynamical response of the system, which significantly affects the actuation protocol proposed by the ANN. To study these variations, we selected a unique transition (111 001), recorded the signal of the force generated by the agent, and computed the corresponding mechanical energy injected into the system at each time step for different values of , as shown in Fig. 3. Please note that the state 001 is a GoE state. We observe that the relation between and the time it takes to reach the target state is non-monotonic. We identify the minimum episode duration, corresponding to the most efficient actuation, with a critical viscous coefficient . Interestingly, this minimum also marks the transition between two qualitatively different behaviors of the control force: a high-viscosity regime () and a low-viscosity regime () as shown in Fig. 3 c) and d). In the high-viscosity regime, always saturates its limit value, and only changes sign a few times per episode. Remarkably, each sign change occurs roughly when a mass is placed at the correct position. In contrast, the force signal in the low-viscosity regime appears less structured, with large fluctuations between consecutive steps.

In order to qualitatively explain these different behaviors, we focus our analysis on the energy transfer between the external controller and the system. The starting and the final states are stable configurations at rest. Consequently, they both correspond to local energy minima and the agent has to provide mechanical energy to the system in an effort to overcome the energy barriers between these configurations. After crossing the barriers, the surplus of kinetic energy has to be removed to slow down the masses and trap them in the well associated with the targeted minimum. In the low-viscosity regime, the internal energy dissipated due to viscosity is small. As a result, the protocols require phases where the agent is actively draining energy from the system. After a short initial phase of a few steps, where much energy is introduced into the system by setting the external load to its maximal value, a substantial fraction of the remainder of the episode involves careful adjustments to remove the kinetic energy. In the high-viscosity regime on the other hand, the viscosity is able to dissipate the extra energy without further intervention. As it increases, the dynamics slow down, which translates into an increasing episode duration.

While we established the characteristics of the designed protocols in both regimes, we have yet to define a quantitative estimation of the crossover between regimes.

Figure 3: Analysis of the protocol proposed by the RL agent for the transition with different values of the viscous coefficient. a) Force signal and b) injected energy during the transition. The injected energy is computed by multiplying the chain’s elongation with the value of the external force. c) Force signal and e) deformation of each bi-stable spring during the transition for = 0 kg/s (low-viscosity regime). The colors blue, orange, and green correspond to the deformation of the first, second, and third spring, respectively. The subplots for the deformations all have the same height with edge values [-0.07, 0.19] meter. The dashed lines represent the stable equilibria (black) and (red). d) Force signal and f) deformation of each bi-stable spring during the transition for = 9 kg/s (high-viscosity regime). The subplots for the deformations all have the same height with edge values [-0.14, 0.24] meter.

Drawing inspiration from these protocols, we consider a simplified situation where the external force saturates the constraint , and where the switching fields are very small, i.e. .

Since all the masses start in a stable equilibrium, the mechanical response of the chain to the external load is very soft, at least for small enough displacements at the start of the episode.

Thus, we can approximates the dynamics of the last mass by the ordinary differential equation


which solves into


with a relaxation time and a saturation velocity . The relaxation time corresponds to the time it takes for dissipation to take over inertia. This transition is also associated to a length scale such that


Let us now define more precisely what we mean by small displacement. With our assumptions, and due to the asymptotic shape of the potential, the typical relative distance for which the mechanical response becomes of the same magnitude as the external load verifies


It is thus clear that if , the system will be dominated by dissipation and converge to equilibrium without further oscillations. On the other hand, if , neighboring masses will rapidly feel differential forces and inertia will dominate. Interestingly, equating these two length scales allows to point to a critical dissipation at the frontier of these two regimes


To test this prediction, we investigate how the damping crossover observed in designed policies varies as the masses and the maximum force are varied, exploring more than two orders of magnitude for both parameters. As show in Fig.4, the results present an excellent agreement with the proposed scaling argument [11].

Figure 4: Evolution of the critical viscous coefficient with respect to a) the masses and b) the maximum amplitude of the external load . The lines show the scaling behavior .

Discussions and conclusion

We have shown a proof of concept of general memory writing operations in a strongly non-linear system of coupled bi-stable springs by a reinforcement learning agent. In particular, we found that this technique allows reaching otherwise unreachable memory states dynamically. Interestingly, the agent appears to learn how to harness the physics underlying the behavior of the system: its control strategy changes qualitatively as the viscous coefficient is varied, from a relatively simple actuation in the large dissipation regime to a jerky dynamical behavior aimed at extracting the excess energy in the small dissipation regime. This transition coincides with a change in the system’s internal response, from an over-damped to an inertial response. As such, not only did it discover this transition by itself, pointing to physically relevant characteristic length scales, but it also signals an optimal design to achieve the most efficient memory manipulation. In that sense, it was able to gather and share with the authors some insightful knowledge about the physics of the memory system, thus displaying some form of intelligence in understanding the challenges it was asked to tackle. Key stakes of future work will consist in identifying the cognitive structures established by the agent to complete the learned tasks, i.e. by rationalizing its neural activity and learning dynamics, using this knowledge to learn transitions for a higher number of coupled units. Indeed, while we managed to successfully train networks on a system of four spring (see the Materials and Methods section), training on larger systems did not converge. Finally, it would be interesting to verify whether the principles we discovered in this simulated model remain relevant in real-life situations.


Research by T.J. was supported in part by the Raymond and Beverly Sackler Post-Doctoral Scholarship.


Description of our Reinforcement Learning setup

Deep Deterministic Policy Gradient

The aim of a RL agent is to choose an action that will allow it to reach a specific goal in the future. Many RL algorithms learn an approximated Q-function in order to have access to the optimal actions to choose to reach the goal. The Q-function is the expectation of the future (discounted) reward for every action available to the agent given the state of the environment. The output of the Q-function is called the Q-value.


with and being a discount factor between 0 and 1.
The optimal Q-function satisfies Bellman equation.


where is the reward obtained if the agent takes the action while the state of the environment is , and are respectively the next action and the next state.
There is a simple relation between the optimal Q-function and the optimal action/policy.


In Q-Learning algorithms, an iterative, self-consistent scheme is written to approximate Bellman’s equation. The algorithm learns this approximation on the accumulated memory of its previous trials.


However, in our case, the agent chooses its actions in a continuous interval, creating an infinite number of possible actions and making equation [14] impractical. The Deep Deterministic Policy Gradient (DDPG) agent we chose addresses this problem by separately learning a Q-function and a policy. It approximates the optimal Q-function by a neural network, called the critic, and approximates the optimal policy (i. e. the optimal action) by a second neural network, called the actor (see Fig. 5). The critic takes as an input the environment’s state and the action chosen by the actor and outputs a Q-value. The actor takes the environment’s state as an input and outputs a selected action. While playing, the agent stacks its experiences into a replay buffer and uses them randomly to update the Q-function and the policy.

Figure 5: Actor and critic architectures.

Twin-Delayed DDPG

The DDPG agent learns the optimal Q-function and the optimal policy concurrently. The policy indeed appears in the loss used to learn the model approximating the Q-function, and the Q-function appears in the loss of the neural network approximating the policy. An unwanted drawback is that the errors in the learned Q-function propagate to the policy. We then preferred to use the so-called Twin-Delayed DDPG [TD3] that addresses this issue. Twin-Delayed DDPG learns two Q-functions and uses the smaller of the two in the updates of its Q-functions. Noise is also added to the policy used in the Q-function loss, and the latter is updated less frequently than the Q-function. These additional mechanisms improve performance by limiting over-estimations and reducing per-update errors.

Gym package

Gym is a tool developed by OpenAIc⃝ to deploy reinforcement learning algorithms [gym]. Gym offers a wide variety of challenging environments and allows to easily create custom ones. We then used the Gym package to create the environment for our agent. In Gym formalism, training is divided into episodes themselves divided into steps. A step corresponds to each time the agent chooses an action. At the beginning of each episode, the state of the environment is randomly initialized and the agent possesses a defined number of steps to reach the goal of the game.

Description of our setup

We used the Twin-Delayed DDPG agent implemented in [JMLR:v22:20-376] to solve our control problem. The task involves reaching a stable configuration close to rest, starting from random initial conditions. At the beginning of an episode, a target state is randomly chosen and the initial positions and velocities are randomly sampled from the respective intervals [ - 0.2, + 0.2] and [-0.1, 0.1]. We begin training with random initial policy and Q-functions parameters. The weights and biases of each of layer are sampled from where

is the uniform distribution and

, being the size of the input of layer . For 10 000 steps, actions are sampled uniformly from [-, ], being equal to 1 N, without concerting the policy or Q-functions. Once this fully exploratory phase is completed, the agent starts using ANNs. At each time step , the agent observes the current state of the system (composed of the position, the velocity and the target state of each mass) and chooses a force in the interval [-,

] in consequence. The selected force, to which is added a noise taken from a Gaussian distribution of mean 0 and standard deviation 0.1, brings the system to a new state

computed by Runge-Kutta method of order 4. At each time step (constant ), the RK4 method is done through 10 successive iterations for a total duration of 0.1 s. Each of those iterations changes the state of the system. Once the numerical resolution is completed, the agent receives a reward given by function [16].


where is a variable equal to 0 or 1, and are respectively the displacement and velocity of the ith mass at time t+1.
The experience above is stocked in the replay buffer, which possesses a finite maximal size of 1e6 experiences. Each new experience overwrites the oldest stored one when the buffer is full. This process allows the continuous improvement of the available training dataset during training. The Q-functions and the policy are then updated. The Q-functions are updated at every step, while the policy is updated every two steps. Both the Q-functions and the policy are updated using the Adam algorithm [kingma2014adam] with a learning rate of 0.001 and a batch size of 100 experiences randomly sampled from the Replay Buffer. The operation goes on until either the agent reaches the goal, at which point it receives a reward of 50, or 200 steps are exceeded. At this stage, the environment is reset, giving place to a new episode. This algorithm is repeated for a predefined number of episodes.

Figure 6: Architecture of the learning protocol.

The fixed parameters of the environment, the hyper-parameters of the agent and the architecture of the neural networks of the policy and the Q-function are summarized in tables 13 and 2.

Parameter Value
1 (N)
1 (Kg)
88.7 (N/m)
, (1e-2 (m))
dt 0.1 (s)
nres 10
maxepisodelen 200
successpos 0.005 (m)
successvel 0.01 (m/s)
successr 50
penaltypos 1
Table 1: Environment parameters. dt : discretization time, nres : number of iterations for the numerical resolution, maxepisodelen : maximum number of steps per episodes, successpos : success condition on the position, successvel : success condition on the velocity, successr : success reward, penaltypos : penalty coefficient on the position, penaltyvel : penalty coefficient on the velocity.
Hyper-parameter Value
Policy optimizer Adam
Policy learning rate 0.001
Q-function optimizer Adam
Q-function learning rate 0.001
Replay buffer size 1e6
Exploration time 10 000 (steps)
Batch size 100
Policy update interval 2 (steps)
Q-function update interval 1 (step)
Table 2:

TD3 agent hyperparameters.

Policy layer Policy activation
Linear (400) Relu
Linear (300) Relu
Linear (1) Tanh
Q-function layer Q-function activation
Linear (400) Relu
Linear (300) Relu
Linear (1) None
Table 3: Policy and Q-function models.

Four coupled bi-stable spring-mass units

We trained a multistable chain composed of four bi-stable spring-mass units with the specific choice of disorder = 0.050, = 0.050, = 0.040, = 0.020, = 0.030, = 0.045, = 0.055, = 0.055 (m). This choice leads to four GoE states , , and . The training was done with = 2 Kg/s. All other physical parameters, hyperparameters, and training protocols were kept identical to the three springs training. The training dynamics and the deformation of the springs during the transition 0000 1011 proposed by the trained networks is shown in Fig. 7.

Figure 7: Learning dynamics of the RL agent on the model composed of four coupled bi-stable spring-mass units for = 2 Kg/s. a) Evolution of the success rate during training. The ANN were initialized randomly. b) Deformation of each bi-stable spring during the transition 0000 1011. The colors blue, orange, green and purple correspond to the deformation of the first, second, third and fourth spring, respectively. The subplots for the deformations all have the same height with edge values [-0.14, 0.36] meter. The dashed lines represent the stable equilibria (black) and (red).

Data Availability

The code used to produce the results of this study is available online [RepoGit].