Log In Sign Up

Cooperative multi-agent reinforcement learning for high-dimensional nonequilibrium control

Experimental advances enabling high-resolution external control create new opportunities to produce materials with exotic properties. In this work, we investigate how a multi-agent reinforcement learning approach can be used to design external control protocols for self-assembly. We find that a fully decentralized approach performs remarkably well even with a "coarse" level of external control. More importantly, we see that a partially decentralized approach, where we include information about the local environment allows us to better control our system towards some target distribution. We explain this by analyzing our approach as a partially-observed Markov decision process. With a partially decentralized approach, the agent is able to act more presciently, both by preventing the formation of undesirable structures and by better stabilizing target structures as compared to a fully decentralized approach.


Scalable Reinforcement Learning Policies for Multi-Agent Control

This paper develops a stochastic Multi-Agent Reinforcement Learning (MAR...

Multi-agent Reinforcement Learning for Networked System Control

This paper considers multi-agent reinforcement learning (MARL) in networ...

MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning

Decentralized learning has shown great promise for cooperative multi-age...

Multi-Agent Reinforcement Learning: A Report on Challenges and Approaches

Reinforcement Learning (RL) is a learning paradigm concerned with learni...

Deep Decentralized Reinforcement Learning for Cooperative Control

In order to collaborate efficiently with unknown partners in cooperative...

Mean-Field Multi-Agent Reinforcement Learning: A Decentralized Network Approach

One of the challenges for multi-agent reinforcement learning (MARL) is d...

Code Repositories

1 Context and related work

Designing nanoscale structures that are tuned to have specific material properties or dynamical behavior is a longstanding goal in the molecular sciences yin_colloidal_2005; ma_inverse_2019; gadelrab_inverting_2017; ronellenfitsch_inverse_2019. While advances in nanofabrication allow for increasingly intricate design, these approaches are often limited to specific materials and require homogeneity of the molecular components. Many nanoscale systems, however, “self-assemble” in suitable conditions or in the presence of appropriate external driving forces cameron_biogenesis_2013; sigl_programmable_2021; rotskoff_robust_2018. Exploiting self-assembly to manufacture nanoscale components presents its own challenges, as the system must be designed to spontaneously and rapidly form a given target structure. The starting materials can become kinetically trapped, forming unfavorable structures if left to self-assemble without external control rotskoff_robust_2018; whitelam_statistical_2015. Here we examine the problem of guiding self-assembly to target high yield of the target structure via external nonequilibrium driving forces, which we refer to as a “protocol”. Reinforcement learning (RL) offers a promising toolkit for computing optimal protocols, but there has been little systematic investigation of approaches based on RL in this context.

Here, we consider two minimal models of molecular self-assembly in which particles evolve according to a stochastic, nonequilibrium dynamics. In both models, clusters of particles form spontaneously under appropriate external conditions, and we seek to design protocols that optimize the size of particle clusters to a given target. The external driving forces (e.g., temperature, light intensity) that we consider can be modulated as a function of both space and time (Fig. 1). In the systems considered here, the protocol is consequently high-dimensional because the external field is modulated on a grid with high spatial resolution. Because of the exponentially large action space, it is natural to employ multi-agent reinforcement learning (MARL) in this context. Briefly, MARL extends the RL paradigm of an agent interacting with an environment to one in which multiple agents interact with the same environment concurrently (and possibly with each other).

Several works have investigated reinforcement learning to control condensed phase dynamics, mostly in equilibrium settings rechtsman_optimized_2005; chen_patchy_18; romano_designing_2020; Ma_magnetic_self_assembly_2020. These approaches modulate the inter-particle interaction potentials between constituents—fundamentally altering the constituent material—rather than externally modulating them as we do here. Recently, Ref. falk_learning_2021 investigated controlling dynamics of a nonequilibrium active matter system with a low-dimensional protocol using standard RL techniques. Multi-agent reinforcement learning has a substantial literature tan_multi-agent_1993; busoniu_comprehensive_2008; strategies that treat each cooperative agent independently are among the most relevant to the current study, including rashid_qmix_2018; son_qtran_2019; sunehag_value-decomposition_2018.

Main contributions:

  • We argue that within the framework of a partially observed Markov Decision Process, multi-agent reinforcement learning with improved state estimates lead to improved value.

  • We demonstrate the effectiveness of incorporating local state and/or reward information in a MARL approach for controlling self-assembly in a system of active, nonequilibrium colloidal particles and equilibrium Lennard-Jones particles ungoing thermal annealing.

2 Multi-Agent Reinforcement Learning

We consider a partially observed Markov decision process (POMDP), cf. Ref. roy_finding_2005, specified as a tuple consisting of the state space, the action space, the space of observations, the observable function, the reward, the state-action transition operator, the discount factor, and the belief update: As depicted in Fig. 1, we consider an external control function that can independently tune the strength of an external field on a specific region. We want to dynamically assemble clusters with a user-specified cluster size distribution and we take


as our cost (or inverse reward) function. We discuss in detail how we compute this reward in  Computational approach.

The function takes a local particle configuration and simply counts the number of clusters with members for each ; this constitutes the observable The empirical distribution of cluster sizes is denoted Due to this mapping, the state information is not perfectly resolved by the agent, but rather the input information is partially observed. We decompose the system into a fixed number of independent regions, each of which is controlled by an individual agent. Because each agent shares the same goal, the optimization is “fully cooperative” tan_multi-agent_1993. As a result, all agents share a common cost function (1) and function, allowing for high data efficiency as the experience of each agent is shared with all other agents.

This algorithm follows the MARL paradigm of centralized training but fully decentralized execution. However, fully decentralized execution poses a limitation as spatially proximal information is crucial for the self-assembly of clusters. Because of the high spatial resolution of control, the presence or absence of particles and clusters in neighboring grids will no doubt influence the cluster formation for a grid. Incorporating more local information into the observation of the current state leads to improved belief about the current state, which we summarize in the following proposition:

Proposition 1 (Low entropy belief improves expected reward).

Consider a POMDP given by and the POMDP augmented with local state information in which the observation function is extended to

leading to a distinct evolution of the belief probabilities, denoted

. Let denote a state of the system with a unique optimal action . If is such that for all states , i.e. it perfectly resolves the state, then for all

To prove this proposition, we simply write the Bellman optimality equation assuming the current state is , for which we know attains the maximal value, and letting

denote the transition operator on the belief vector induced by



which is equivalent to the fully-observed MDP value function and . If the observation increases the uncertainty of the state relative to the perfectly resolved belief vector , then there exists at least one state that has belief In other words,


because the expected reward for any state is suboptimal by assumption.

Computational approach

: We utilize a fully cooperative multi-agent Clipped Deep Learning approach mnih-atari-2013; fujimoto_td3_18 to learn optimized protocols that minimize . For a given region consisting of particle configurations , we define the state and the cost as . Additionally, we consider a discrete action space , representing the strength of the externally modulating field on that region. Proposition 1 suggests that incorporating additional local information should benefit the optimization, so long as that information improves the accuracy of the belief about the current state. To do so, we consider a hierarchy of local region sizes: a plaquette consisting of the current region and its nearest neighbors, a grid, a grid and the global system. Given particle configurations in the local regions, , we define the state of the local region and the cost of the local region as .

With this in mind, we extend our approach to include centralized training but partially decentralized execution. This corresponds to sharing information among the agents, as we include information about the state and/or cost of local regions. When including state information about the local region, we consider the joint state between a region and its surrounding local region . Additionally, when including cost information about the surrounding region, we consider the average cost between and . We indicate whether or not we included surrounding state and/or cost information with a subscript S for cooperative state and a subscript C for cooperative cost.

3 Active Colloids

Figure 1: Active Colloids. In the plots above, both the KL divergence from the instantaneous cluster size distribution to the target distribution and the absolute deviation of the mean instantaneous cluster size distribution from the mean target size are shown as a function of time for the optimized protocols of various partially decentralized approaches and a fully decentralized approach (the protocols are fully trained in these curves). In the illustrations below, we compare the predicted action in the fully decentralized case (no gray shading) and the best performing partially decentralized case, (with gray shading). Higher light-induced activities (red) promote cluster formation.

Fig. 1 summarizes numerical results for a 2D colloidal particle system driven by externally controlled light palacci_living_2013. In this system, when the Péclet number is sufficiently large (i.e. when the light-induced activity is high relative to the temperature), the system exhibits a nonequilibrium phase separation, called motility induced phase separation (MIPS) cates_motility-induced_2015.

To explore the limits of an activity-inducing external control, we sought to maintain a steady state distribution consisting of clusters of particles much smaller than the macroscopic aggregate that forms when there is constant activity. We specified a target distribution of cluster sizes using a Gaussian target distribution with and considered a grid of control. Consistent with Prop. 1 including information of surrounding grids improves the performance of the optimal protocol. The best performing protocol, , includes cluster size information and cost information in a local region. Surprisingly, including just the cost information of this local region, , actually hindered the performance compared to the fully decentralized approach. On the other hand, for the case, when we included information about only the cost of the entire global system, we see that the performance does remarkably well. This underscores the importance of carefully selecting relevant centralized or semi-centralized information to improve performance. Acting in a partially decentralized manner can not only incur an additional computation cost but can also result in a decrease in performance.

Including surrounding information in the best performing approach was important in two cases: when the current region was empty or when it contained a cluster that was at or near the mean of the target distribution. In both these cases, in a fully decentralized setting, the trained agent always selects an action with the highest activity. However, in the partially decentralized setting, the agent enforces a lower activity when the surrounding region contains clusters that are above the mean of the target distribution, preventing the formation of a cluster far larger than the mean target size.

4 Thermal Annealing

Figure 2: Thermal Annealing. See Fig. 1 caption for descriptions of plots. In the cartoons below, we compare the predicted action in the fully decentralized case (no gray shading) and (with gray shading). Lower temperatures (more blue) promote cluster formation.

Thermal annealing is widely used to improve the yield of nanoscopic materials makrides_temperature_2012; dey_dna_2021. Annealing is limited as a mechanism for control because there is essentially only one parameter that can be tuned, the rate at which the temperature is decreased. We examine an alternative paradigm that exercises more localized control with measurement-guided feedback to design an annealing schedule. Rather than globally tuning a temperature, we locally update the temperature on a grid, see Fig. 2

. As above, we fixed a target cluster size distribution, chosen to be a Gamma distribution with variance

and a mean and considered a 15x15 grid of control. Here, the size and density of the system is far lower and we sought to form clusters of a much smaller size compared to the active system. This allowed us to define individual grids that were only slightly larger than an individual particle, providing much finer levels of control. Note that the interaction potential for these particles is distinct from the one that we use to model the active system and there is no nonequilibrium dynamics. The fully decentralized and top performing partially decentralized approaches differed in their behavior when the cluster size was around the mean. In the fully decentralized case, the agent was far more conservative when a cluster size was of size 4 or greater, and would impose a high temperature breaking apart the cluster. On the other hand, in one of the top performing partially decentralized approaches, , the agent allowed cluster sizes between 3 and 5 to persist by ensuring a lower temperature, given that there were no clusters in the vicinity.

Broader Impact

In this work, we investigated how a partially decentralized reinforcement learning could be used to learn protocols that externally drive a system towards some non-equilibrium steady-state. This approach can be used as a framework for learning protocols for self-assembly in more complex systems. For example, in the design of suprastructures, each agent—or a local grouping of agents—can be used to design individual substructures, which then self-assemble to form larger structures. At present, the key limitation is that the paradigm is most naturally suited to control with spatial resolution. Ultimately, the ability to design protocols for self-assembly of nanoscopic structures, can have wide-ranging impacts across a number of disciplines ranging from medicine to materials science. Potential negative impacts could, in principle, arise from the design of harmful or toxic materials.

Data and Code Availability:

The data that support the findings of this study are available from the corresponding author upon reasonable request. Our code is available at