I Introduction
Free energy estimation is of central importance in the natural sciences. Accurate estimation of free energies, however, is challenging, as many systems are out of reach of experimental methods and analytic theory. Computerbased estimation has thus emerged as a valuable alternative. Successful application areas of insilico free energy estimation span industry and scientific research, including drug discovery Shirts, Mobley, and Brown (2010), condensed matter physics Auer and Frenkel (2001), materials science Damasceno, Engel, and Glotzer (2012), structural biology Curk et al. (2018), and the effects of mutagenesis Hauser et al. (2018). Because of its importance and wide ranging applications, computerbased free energy estimation has been an active field of research for decades Chipot and Pohorille (2007).
At the core of many stateoftheart estimators Shirts and Chodera (2008) lies the free energy perturbation (FEP) identity introduced by Zwanzig Zwanzig (1954) in 1954:
(1) 
Here is the Helmholtz free energy difference between two thermodynamic states and , each connected to a thermal reservoir at inverse temperature and specified by a setting of external parameters and . We denote as a point in the system’s configuration space, and define
(2) 
with and the energy functions associated with and . By we denote an expectation under equilibrium distribution (and similarly for ). A procedure for computation of via Eq. (1) is then as follows: A number of samples are first drawn from , for example via a Monte Carlo chain or Molecular Dynamics simulation. The change in energy, , associated with instantaneously switching is then computed, from which an exponential average is taken. From a statistical point of view, FEP is an application of Importance Sampling (IS), where serves as the proposal distributionNeal (2001); Arnaud, de Freitas, and Gordon (2001); Liu (2008).
While Eq. (1) is exact, the convergence of this estimator for a finite number of samples strongly depends on the degree to which and overlap in configuration space Pohorille, Jarzynski, and Chipot (2010). Indeed, the dominant contributions to the above expectation will come from samples of that are typical under , and such contributions become increasingly rare with decreasing overlap Jarzynski (2006).
There exist multiple strategies for mitigating the overlap requirement. Arguably the most common strategy is a multistaged approach, also known as stratification, in which a sequence of intermediate thermodynamic states is defined between and (Fig. 1a). Here the increased convergence is facilitated by demanding that neighboring pairs of states be chosen to contain sufficient overlap. The quantity of interest, , is then recovered as a sum over the pairwise differences Chipot and Pohorille (2007). The Multistage Bennett Acceptance Ratio Shirts and Chodera (2008) (MBAR) estimator is a prominent example of an estimator that follows this strategy. However, multistaged approaches require samples from multiple states as well as a suitable order parameter to define intermediate stages. Furthermore, it is unclear a priori how best to discretize the order parameter (for example how many stages to use).
An alternative, elegant strategy to increasing overlap is by incorporating configuration space maps. Jarzynski developed Targeted Free Energy Perturbation Jarzynski (2002) (TFEP), a generalization of FEP whereby an invertible mapping defined on configuration space transports points sampled from to a new distribution, (see Fig. 1b). Jarzynski showed that a generalized FEP identity can be applied to this process (Eq. (7) below) from which the free energy difference can be recovered. Importantly, if the mapping is chosen wisely, an effective overlap can be increased, leading to quicker convergence of the TFEP estimator. Hahn and Then extended TFEP to the bidirectional setting Hahn and Then (2009), whereby the mapping and its inverse are applied to samples from and , respectively. Lowererror free energy estimates can then be obtained via the statisticallyoptimal BAR estimator Bennett (1976) (Eq. (11) below).
Whether in the unidirectional or bidirectional case, the main challenge for targeted approaches is crafting a mapping that is capable of increasing overlap. Unfortunately for most realworld problems the physical intuition needed to develop such a technique is simply lacking. Modernday machine learning (ML) techniques, however, seem perfectly suited for this task.
Since the introduction of TFEP in 2002, research in ML has made remarkable progress in fields of image classification Krizhevsky, Sutskever, and Hinton (2012); He et al. (2016), playing video Mnih et al. (2015) and board games Silver et al. (2016, 2018), and generative modelling of images Brock, Donahue, and Simonyan (2019); Karras, Laine, and Aila (2019). ML has also enabled advances in the natural sciences, including state of the art protein structure prediction Senior et al. (2020), neuralnetwork based force fields for water Morawietz et al. (2016), generative modelling of lattice field theories Albergo, Kanwar, and Shanahan (2019), new paradigms for sampling equilibrium distributions of molecules, and variational free energy estimates Wu, Wang, and Zhang (2019); Noé et al. (2019).
In this work, we turn targeted free energy estimation into a learning problem. In lieu of a handcrafted mapping, we represent our mapping by a deep neural network whose parameters are optimized so as to maximize overlap. Once trained, the free energy can then be computed by evaluating the targeted estimator with our learned mapping. Below we will consider both unidirectional and bidirectional settings, and will refer to them as Learned Free Energy Perturbation (LFEP) and Learned Bennett Acceptance Ratio (LBAR), respectively.
The rest of our manuscript is structured as follows. In §II below we summarize the previouslydeveloped targeted free energy estimators that we will be making use of. This is followed by development of suitable training objectives to maximize overlap (§III). We then demonstrate our method by applying it to a solvation system. In §IV, we describe the experimental setup and discuss inherent symmetries that are exploited to devise a model with the correct inductive biases (§V). Finally, we present experimental results in §VI and discuss our findings in §VII.
Ii Targeted estimators
In the following we will refer to and as the thermodynamic states, defined by equilibrium densities and , where and are the normalization constants (partition functions). In the targeted scheme, configurations are drawn from and mapped to new configurations via an invertible, userspecified mapping . The set of mapped configurations can be thought of as samples from a new state :
(3) 
Similarly, we also consider the reverse case where configurations are drawn from and mapped to via the inverse
(4) 
We refer to this pair of prescriptions as the “forward” and “reverse” process, respectively. For each process we denote generalized energy differences as
(5)  
(6) 
where and are the Jacobian determinants associated with the mappings. As originally shown by Jarzynski, an identity exists which relates to an ensemble of realizations of :
(7) 
Eq. (7) can be regarded as a generalization of FEP, as it holds for any invertible , and reduces to Eq. (1) if is the identity. An analogous equation holds for the reverse process. Derivations of Eq. (7) can be found in Refs. Jarzynski, 2002; Hahn and Then, 2009 or Appendix A.
Hahn and Then extended the above result to the bidirectional case Hahn and Then (2009), showing a fluctuation theorem (FT) exists between the forward and reverse processes:
(8) 
The functions
(9) 
and
(10) 
can be thought of as generalized work distributions associated with the mapping processes. With these bidirectional estimates, Bennett’s Acceptance Ratio (BAR) method Bennett (1976) can be employed as an alternative estimator of Hahn and Then (2009). BAR estimation of can be formulated as a selfconsistent iteration of the equation
(11) 
where is the Fermi function. For simplicity in Eq. (11) we restrict ourselves to the case where the number of samples in the forward and reverse directions are equal, but more general formulations exist. BAR has a statistical advantage over FEP as it has been shown to be the minimum variance free energy estimator for any asymptoticallyunbiased method Shirts et al. (2003). Because of this property, BAR is generally the method of choice when samples from both and are available.
Crucially, the targeted estimators above hold for every invertible mapping. That is, given an infinite number of samples, any invertible choice of will produce a consistent estimate of . Of course, the finitesample convergence properties are of more practical importance and will strongly depend on the choice of .
Iii Training objective
In a distributional sense, the forward and reverse processes act to transform and into and , as depicted in Fig 1. In what follows we will refer to the distribution of mapped configurations as the “images” (i.e. and ), and the distributions we want them mapped towards as the “targets” (i.e. () for the forward (reverse) process). Due to the deterministic mapping, the bases and images are related by the change of variable formula,
(12)  
(13) 
The crucial consideration for convergence of our estimators (Eq. (7) and Eq. (11)) is the overlap between the image and target distributions Jarzynski (2002). Indeed, in the limit that images and targets coincide, Jarzynski showed Jarzynski (2002) that . This implies that the convergence of Eq. (7) is immediate (i.e. only one sample is needed). In this limit, it is also the case that implying that the expectation values on either side of Eq. (11) converge immediately. This overlap argument is further reinforced in the Appendix A, where we show that the TFEP estimator can be interpreted as a FEP estimator between and .
We now turn our attention to the construction of a loss function that accurately judges the quality of
. Guided by the considerations of overlap, we consider the Kullback–Leibler (KL) divergence between the image and target. For the forward process we have:(14) 
In the above derivation, we invoked a change of variable formula in going from the first to third lines, and used the identity to get to the last. An analogous equation can be derived for the reverse process yielding
(15) 
While from Eqs. (14–15) it is clear that the KL cannot be accurately estimated unless is known, in terms of optimizing, and can be disregarded as they are constants.
Below we consider two separate training regimes for our model. In the unidirectional case, the model was trained only on the forward process, with a loss function
(16) 
In the bidirectional case, the model was trained using both forward and reverse processes with the loss
(17) 
When samples from both states are available, bidirectional training is preferable. Unlike the unidirectional loss, explicitly encourages both and to be masscovering (Minka, 2005), which is important for good performance of importancesampling estimators Neal (2005).
Iv Experimental setup
To test our method, we consider a system similar to the one used by Jarzynski Jarzynski (2002) consisting of a repulsive solute immersed in a bath of identical solvent particles. The task is to calculate the free energy change associated with growing the solute radius from to radius (see Fig. 2). In contrast to a hardsphere solute as used in Ref. Jarzynski, 2002, we modeled our solute as a soft sphere. This is because any finite particle overlap would lead to infinite forces, which our training method cannot handle.
Intuitively, an effective mapping should push solvent particles away from the center to avoid highenergy steric clashes with the expanding solute. Jarzynski followed this intuition, defining a mapping that uniformly compresses the solvent particles amidst an expanding repulsive solute. Although this mapping gave a significant convergence gains when applied to a hard solute Jarzynski (2002), it is not directly applicable to soft solutes. This is because the phase space compression results in a transformed density whose support is not equal to that of the target density, violating the assumption of invertibility.
Below we demonstrate that we no longer need to rely on physical intuition to handcraft a tractable mapping; this process can be fully automated using the general framework proposed in this work. In order to learn an effective mapping, however, it is crucial that the model be compatible with the inherent symmetries of the underlying physical system.
iv.1 Energy
Periodic boundary conditions (PBCs) confine the particle system to a dimensional torus,
, where each coordinate of the position vector
of particle lives in the interval (see Fig. 2). The total energy can then be decomposed into a sum of pairwise contributions according to(18) 
where subscript denotes the state and . The quantity is a difference vector whose components correspond to the shortest, signed distance in the respective dimension, and the function round is applied elementwise giving the nearest integral number. The radially symmetric functions and represent the LennardJones Jones and Chapman (1924) (LJ) and Weeks–Chandler–Andersen Weeks and Chandler (1971) (WCA) potentials. In practice, we truncate LJ interactions using a radial cutoff of and shift the potential to be zero at the cutoff Frenkel and Smit (2002).
iv.2 Training data
The data used for training and evaluation was generated via molecular dynamics (MD) simulations using the package LAMMPS Plimpton (1995). Simulation frames were generated via Langevin dynamics and were saved infrequently enough to ensure decorrelated samples (as judged by the potential energy decorrelation time), giving a total of frames per simulation. A total of independent simulation trajectories were generated for each thermodynamic state, starting from randomlyintialized configurations and momenta. Ten independent training and evaluation datasets were then constructed from these simulations by first concatenating and shuffling configurations, and then partitioning them into training and test samples for each dataset. The value of was chosen such that the baseline BAR estimator, when evaluated on data points, yields a reasonably accurate free energy difference. Below we train and evaluate our model on these datasets so as to compute statistical convergence properties of the estimators across independent runs. Further simulation details are summarized in Appendix §B.
iv.3 Symmetries
The system under consideration exhibits properties that are widely encountered in the atomistic simulation community: PBCs and permutation invariance. PBCs are usually employed to reduce finitesize effects. Permutation invariance arises as a consequence of the energy being invariant to particle permutations—a condition that is satisfied by the solvent particles as they are all identical.
PBCs are a choice of geometry for the space in which particles live: a 3D torus. Without them, the system (including the solute) would admit rigid rotation, reflection and rigid translation symmetries. With them, only the translation symmetries remain, and a discrete set of rotations/reflections. The original rigid translations in become translations by elements of the torus , leaving the energy invariant. Because of this symmetry, we can fix the solute at the origin of the simulation box without affecting the ratio (as shown in Fig. 2). The remaining set of discrete rotations/reflections is the group of symmetries of a cube (the octahedral group), which contains only elements.
Our choice of model architecture respects PBCs and permutation invariance, as we will discuss in the next section in detail. We also tested other architectures, and found that respecting PBCs and permutation invariance led to best performance (data not shown). Our chosen model architecture does not obey the octahedral symmetries, but we do not expect this to have a big impact because of the octahedral group’s relatively small size compared to (infinite) and to the group of particle permutations (of size ).
V Model
We implement the mapping using a deep neural network, parameterized by a set of learnable parameters . In designing the architecture of the network, we take into account the following considerations.

[label = (), leftmargin=*]

The mapping must be bijective, and the inverse mapping should be efficient to compute, for any setting of the parameters .

The Jacobian determinant should be efficient to compute for any setting of .

The network should be flexible enough to represent complex mappings.

The transformed distributions and should respect the boundary conditions and symmetries of the physical system.
The first three requirements are satisfied by a class of deep neural networks known as normalizing flows Papamakarios et al. (2019), which are invertible networks with efficient Jacobian determinants. Since bijectivity is a closed property under function composition, multiple normalizing flows (or “layers”) can be composed into a deeper flow, yielding a model with increased flexibility. We implement as a normalizing flow composed of invertible layers, that is,
(19) 
Each layer is of the same architecture but it has its own learnable parameters , and the learnable parameters of are simply .
Our implementation of is based on the architecture proposed by Dinh et al. Dinh, SohlDickstein, and Bengio (2017), which is often referred to as the coupling layer. Let , , be the spatial coordinates of the particle with position vector . To simplify notation, we will also use to denote the inputs to layer (that is, the particles transformed by ), and drop the dependence on . Let be a subset of the indices associated with layer . Then, is defined as follows:
(20) 
where
(21) 
By we refer to the set of coordinates of that are indexed by , and is the output of for coordinate of particle . The functions and are implemented by neural networks. The parameters of associated with coordinate of particle are denoted by , and are computed by . The parameters of are the learnable parameters of the layer, .
In simple terms, works as follows. We partition the coordinates into two sets; the coordinates indexed by are left invariant, whereas the remaining coordinates are transformed elementwise. Each transformed coordinate undergoes a different transformation depending on the value of , which itself is a function of all the coordinates that remain invariant. To ensure that each coordinate gets the opportunity to be transformed as a function of every other coordinate, each layer uses a different partition , and we cycle through the partitions , , , , and across layers.
For the mapping to be bijective, a sufficient condition is that be strictly increasing for any setting of . In that case, the inverse is obtained by simply replacing with in Eq. (20), and the Jacobian determinants can be computed efficiently as follows:
(22) 
Finally, the inverse and Jacobian determinant of the composite mapping can be computed by
(23)  
(24) 
To ensure that the transformed distributions and obey the required boundary conditions, the implementation of must reflect the fact that and are identified as the same point. For this to be the case, a sufficient set of conditions is the following:
(25)  
(26) 
for any setting of the parameters . To satisfy the above conditions, we implement using circular splines, which were recently proposed by Rezende et al. Jimenez Rezende et al. (2020) and are based on the rationalquadratic spline flows of Durkan et al. Durkan et al. (2019). Our implementation of the circular splines is detailed in Appendix §C.
In addition to the above, we also need to make sure that the parameters in Eq. (21) are periodic functions of the invariant coordinates . This can be easily achieved by the following feature transformation:
(27) 
where and act elementwise. The above feature transformation is injective, so no information about is lost. We apply this feature transformation to each particle at the input layer of network .
Finally, to ensure that the transformed distributions and are invariant to particle permutations, it is necessary that the Jacobian determinant also be invariant to particle permutations. In our architecture, this can be achieved by taking to be equivariant to particle permutations. Specifically, let be a permutation of the set . We say that is equivariant with respect to if
(28) 
that is, if permuting the particles has the effect of permuting the parameter outputs in exactly the same way. From Eq. (22), we can easily see that the above property implies that leaves , and hence , invariant, because we sum over all particles and the sum is permutation invariant. Previous studies have made similar observations Bender et al. (2020); Köhler, Klein, and Noé (2019). Our implementation of is based on the architecture proposed by Vaswani et al. Vaswani et al. (2017), often referred to as the transformer, which we use in a permutationequivariant configuration. The implementation details of our transformer architecture are in Appendix §C.
Vi Results
In this section, we evaluate the performance of our method for the solvation system illustrated in Fig. 2. We focus on the bidirectional BAR and LBAR estimators in the main text, due to their advantages over unidirectional approaches as discussed in §II. We refer to Appendix §D for a discussion of the unidirectional counterparts.
To capture statistical variation, our training and analysis procedure was performed times, each using independent training and evaluation datasets. Our loss profiles and free energy estimates below report averages as well as statistical variation across these runs.
We first report training results in Fig. 3, where the fullbatch loss is plotted as a function of the number of training steps. We observe a pattern commonly encountered in ML: after an initial decrease of both training and test loss, the latter develops a minimum. At around the minimum, the model stops generalizing and starts to overfit to the training data. We therefore employ a technique called early stopping Caruana, Lawrence, and Giles (2001) and use the model parameters corresponding to the minimum test loss for all further evaluations. It is worth emphasizing, however, that the precise location of the minimum does not have an appreciable effect on the quality of the free energy estimates reported for the bidirectional estimator (results not presented). We also note the small variation among the independent runs, suggesting that there is no significant dependence of the performance on a particular dataset.
Next, we probe the overlap resulting from our learned mapping. In Fig. 4a we plot the distributions of learned work values for the forward and reverse directions compared against their unmapped counterparts for which the mapping is the identity. These distributions are typically unimodal Chipot and Pohorille (2007) and satisfy the inequalities for any invertible mapping. The free energy difference corresponds precisely to the point of intersection of the forward and reverse distributions. Both of these facts are a consequence of the fluctuation theorem [Eq. (8)]. Intuitively, it is clear then that an effective mapping should increase the overlap between the distributions to facilitate locating the intersection. This is precisely the enhancement we observe for the mapped work values. The two modes are shifted towards each other and share a significantly larger overlap than the unmapped distributions. We also see that the mapping strongly reduces the variance of the distributions. In fact, with a perfect mapping we would expect the forward and reverse distributions to collapse onto a single delta distribution located at .
We now turn to the statistical convergence of the free energy estimates. In Figure 4b we plot a running average of the estimate as a function of the number of evaluated samples per stage. The solid lines report averages of the estimate over the independent runs, and shaded regions represent one standard deviation of the runs. To test the correctness of our implementation, we first compare our estimates against a converged multistage MBAR estimator which employed stages and more samples in total. From the figure we see that the variation of our estimate overlaps nicely with the MBAR error estimates. When comparing to the baseline BAR estimator, we see in Fig. 4b clear variance reduction of LBAR across a wide range of evaluated sample sizes. The fullbatch LBAR standard deviation we observe is approximately of that reported by BAR. Moreover, training and evaluation of the model occurred on the same dataset, demonstrating that an effective mapping can be learned in a dataefficient manner. This is an important practical consideration but not at all obvious a priori. We could, in principle, even combine samples from the training and test datasets for estimation of but have used the test set only to detect overfitting.
Vii Discussion
In this work, we turn TFEP into a learning problem by combining it with stateoftheart ML techniques. TFEP previously required handcrafting a tractable mapping on configurationspace, a significant challenge for many realistic systems. We proposed to represent the mapping by a suitable neural network, and identified training objectives for unidirectional and bidirectional cases. We then tested their performance on a fully periodic system. This type of boundary condition is routinely used in atomistic simulations but poses a challenge for ML models, as it requires careful design of the flow architecture. Our experimental results suggest that both LFEP and LBAR estimators lead to a significant variance reduction compared to their respective baselines. Furthermore, we found that the LBAR method is even competitive with a multistaged MBAR estimator while requiring only a small fraction of the data. Our results therefore clearly highlight the potential of this approach and suggest that ML could eliminate the need for staging altogether.
Improving TFEP via learned mappings relates to the general idea of improving importance sampling by learning the proposal distribution, which has been explored substantially in machine learning and statistics. For instance, recent works in machine learning have proposed training a flexible deeplearning model of the proposal distribution to improve importance sampling
Müller et al. (2019) or more sophisticated variants such as bridge sampling Papamakarios and Murray (2015) and sequential Monte Carlo Gu, Ghahramani, and Turner (2015); Paige and Wood (2016); Le, Baydin, and Wood (2017). In turn, these approaches can be traced back to methods for adaptive importance sampling Cappé et al. (2008) and adaptive sequential Monte Carlo Cornebise, Moulines, and Olsson (2008) in statistics. One recent instance of these approaches that relates closely to our work is Neural Importance Sampling Müller et al. (2019), which uses expressive normalizing flows to learn good proposal distributions for importance sampling. Many of the above works have noted that the choice of loss function is important, with the forward KL divergence and chisquared divergence being standard choices. These observations are in line with our observations of the differences between the unidirectional and bidirectional training losses.Finally, we note that the ideas presented in this work can also be combined with Boltzmann Generators Noé et al. (2019). Since our estimators are asymptotically unbiased, they could yield much more accurate free energy estimates than the variational approximations reported in Ref. Noé et al., 2019. Merging both ideas would be an interesting extension for future work.
Acknowledgements.
We would like to thank our colleagues Shakir Mohamed, Michael Figurnov, Ulrich Paquet, James Kirkpatrick, Daan Wierstra, Craig Donner, John Jumper, Amber NicklinClark and Steph HughesFitt for their help and for stimulating discussions.Appendix A Alternate derivation of TFEP and interpretation
In this section we derive and interpret the unidirectional TFEP estimator as a multistaged FEP estimator. This interpretation allows us to reason about TFEP using intuition from FEP.
Given explicit densities for , and , we can formally decompose into a sum over two terms,
(29) 
which can each be computed separately using the FEP estimator Eq. (1). We next define the energy of as
(30a)  
(30b) 
such that , and where we have used Eq. (12) in going from the first to second line. Conveniently, one of these stages comes for free, as :
(31) 
In going from the first to second line we have used Eq. (30a). Combining Eqs. (29–31), as well as the FEP estimator Eq. (1), we arrive at our final result:
(32) 
where
(33) 
is the energy difference between and . Although Eq. (32) is an explicit estimate between and , it is just a reformulation of the TFEP estimator [Eq. (7) above] as can be seen by the equivalence of and [compare Eqs. (5) and (33)]. This novel interpretation of TFEP allows us to apply the intuition on convergence we have built for FEP. Specifically, we can accelerate convergence if shares large overlap with .
Appendix B System
To generate the training data, we performed molecular dynamics (MD) simulations of the system illustrated in Fig. 2 using the simulation package LAMMPS Plimpton (1995). The system is similar to the one studied in Ref. Jarzynski, 2002 but we replaced hard solutesolvent interactions by a Weeks–Chandler–Andersen Weeks and Chandler (1971) (WCA) potential. Below, we represent our energy, length, and mass units in terms of the LJ well depth , the LJ diameter , and the solvent particle mass Frenkel and Smit (2002). From this our unit of time is defined as . Quantities expressed in these reduced units are denoted with an asterisk. We used a cubic simulation box with edge length , employed cutoff radii of and for LJ and WCA interactions, where labels the state. The solute radii were taken to be and . Both LJ and WCA potentials shared the same value for , and we set .
To simulate a specific state, we first assigned random positions to all solvent particles. The solute was placed at the origin of the box and kept stationary throughout the simulation. After performing energy minimization, the system was equilibrated in the canonical ensemble for a period of . The equations of motion were integrated using the velocity Verlet algorithm Swope et al. (1982) with a timestep . We employed a Langevin thermostat with a relaxation time of to keep the system at a temperature of . To prevent drift of the center of mass motion, we set the total random force to zero. During the long production run, configurations were sampled every reduced time units, yielding a total of samples. For each state, such simulations were generated starting from random initializations and random number seeds. The resulting samples were then partitioned as described in the main text.
We followed the same protocol to generate samples for MBAR. In addition to the two states corresponding to and , we considered intermediate states with a constant radial increment . Using two different random seeds, we obtained samples for each state and evaluated all energy functions on each sample. MBAR estimates of were then computed from this combined energy matrix using the package pymbar Shirts and Chodera (2008).
Appendix C Model implementation details
We implement using circular splines Jimenez Rezende et al. (2020) so that satisfies the boundary conditions in Eqs. (25–26). Briefly, is a piecewise function that consists of segments. The parameters are a set of triplets , where defines the domain of segment , defines its image, and
define its slopes at the endpoints (all slopes are required to be positive). Each segment is a strictly increasing rationalquadratic function, constructed using the interpolation method of Gregory and Delbourgo
Gregory and Delbourgo (1982). The conditions in Eqs. (25–26) are satisfied by setting , and . We can increase the flexibility of by increasing the number of segments . With a large enough , circular splines can approximate arbitrarily well any strictly increasing function from to itself that satisfies Eqs. (25–26).We implement using the transformer architecture of Vaswani et al. Vaswani et al. (2017) so that satisfies the permutationequivariance condition in Eq. (28). Briefly, the network architecture is as follows. Each input undergoes the feature transformation in Eq. (27), and is then mapped to a fixedsized vector via an affine transformation identically for each . Then, is processed by a sequence of transformer blocks, each of which is composed of two residual layers He et al. (2016). The first residual layer is
(34) 
where is a multiheaded selfattention layer as described by Vaswani et al. Vaswani et al. (2017), and is its th output. The second residual layer is
(35) 
where is a standard
multilayer perceptron
Goodfellow, Bengio, and Courville (2016) that is applied identically for each . After repeating the transformer block a specified number of times (each time with different learnable parameters), each vector is mapped to the output via an affine transformation identically for each . To help with generalization and stability, we applied layer normalization Ba, Kiros, and Hinton (2016) to the inputs of and . Because is permutationequivariant and every and affine transformation is applied identically for each , it follows that is permutationequivariant, and therefore the transformed distribution is permutationinvariant as desired.All models were trained using the Adam optimizer Kingma and Ba (2015)
. A summary of the hyperparameters is provided in Table
1.Transformer  
Number of blocks  4 
Number of heads  4 
Value dimension  16 
Key dimension  16 
Embedding dimension  16 
Circular spline flow  
Number of segments ()  16 
Adam optimizer  
Batch size  256 
Learning rate  
0.9  
0.999 
Appendix D Results for LFEP
In this section, we discuss the free energy estimates obtained with LFEP, using unidirectional training with the loss [Eq. (16)] as a proxy for [Eq. (14)]. To estimate the KL, we replaced in Eq. (14) with the LFEP estimate of that quantity. Figure 5a shows the evolution of both quantities during training. The results feature an interesting behaviour in the initial training regime. While test and training loss are still decreasing, the KL already exhibits a pronounced minimum due to a drift of the estimate towards lower values. This is further illustrated in Fig. 5b, which compares the convergence of for three different mappings corresponding to specific training steps (colored symbols in Fig. 5a). We see that the variance is reduced significantly in all three cases. However, only the first mapping (training step ) agrees with the MBAR baseline, while the other two mappings exhibit a noticeable bias.
This is problematic in that the minimum of the KL would be a natural point to stop training but does not yield the lowest bias. This is also in contrast to our findings for the bidirectional case, where the quality of the mapping was fairly insensitive to the stopping point. One possible explanation for these observations is the wellknown zeroforcing property Minka (2005) of . That is, does not encourage the transformed distribution to be masscovering, which is known to negatively impact the performance of importancesampling estimators Neal (2005).
References
 Shirts, Mobley, and Brown (2010) M. R. Shirts, D. L. Mobley, and S. P. Brown, “Freeenergy calculations in structurebased drug design,” in Drug Design: Structure and LigandBased Approaches, edited by K. M. Merz, Jr, D. Ringe, and C. H. Reynolds (Cambridge University Press, 2010) p. 61–86.
 Auer and Frenkel (2001) S. Auer and D. Frenkel, “Prediction of absolute crystalnucleation rate in hardsphere colloids,” Nature 409, 1020–1023 (2001).
 Damasceno, Engel, and Glotzer (2012) P. F. Damasceno, M. Engel, and S. C. Glotzer, “Predictive selfassembly of polyhedra into complex structures,” Science 337, 453–457 (2012).
 Curk et al. (2018) T. Curk, P. Wirnsberger, J. Dobnikar, D. Frenkel, and A. Šarić, “Controlling cargo trafficking in multicomponent membranes,” Nano Lett. 18, 5350–5356 (2018).
 Hauser et al. (2018) K. Hauser, C. Negron, S. K. Albanese, S. Ray, T. Steinbrecher, R. Abel, J. D. Chodera, and L. Wang, “Predicting resistance of clinical Abl mutations to targeted kinase inhibitors using alchemical freeenergy calculations,” Commun. Biol. 1, 70 (2018).
 Chipot and Pohorille (2007) C. Chipot and A. Pohorille, eds., Free Energy Calculations: Theory and Applications in Chemistry and Biology, Springer Series in Chemical Physics, Vol. 86 (SpringerVerlag, Berlin, 2007).
 Shirts and Chodera (2008) M. R. Shirts and J. D. Chodera, “Statistically optimal analysis of samples from multiple equilibrium states,” J. Chem. Phys. 129, 124105 (2008).
 Zwanzig (1954) R. W. Zwanzig, “High‐temperature equation of state by a perturbation method. I. Nonpolar gases,” J. Chem. Phys. 22, 1420 (1954).
 Neal (2001) R. M. Neal, “Annealed importance sampling,” Stat. Comput. 11, 1573–1375 (2001).
 Arnaud, de Freitas, and Gordon (2001) D. Arnaud, N. de Freitas, and N. Gordon, Sequential Monte Carlo methods in practice, Information Science and Statistics (Springer New York, 2001).
 Liu (2008) J. S. Liu, Monte Carlo strategies in scientific computing (Springer Science & Business Media, 2008).
 Pohorille, Jarzynski, and Chipot (2010) A. Pohorille, C. Jarzynski, and C. Chipot, “Good practices in freeenergy calculations,” J. Phys. Chem. B 114, 10235–10253 (2010).
 Jarzynski (2006) C. Jarzynski, “Rare events and the convergence of exponentially averaged work values,” Phys. Rev. E 73, 046105 (2006).
 Jarzynski (2002) C. Jarzynski, “Targeted free energy perturbation,” Phys. Rev. E 65, 046122 (2002).
 Hahn and Then (2009) A. M. Hahn and H. Then, “Using bijective maps to improve freeenergy estimates,” Phys. Rev. E 79, 011113 (2009).
 Bennett (1976) C. H. Bennett, “Efficient estimation of free energy differences from Monte Carlo data,” J. Comp. Phys. 22, 245–268 (1976).

Krizhevsky, Sutskever, and Hinton (2012)
A. Krizhevsky, I. Sutskever, and G. E. Hinton,
“ImageNet classification with deep convolutional neural networks,”
Advances in Neural Information Processing Systems (2012).  He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE Conference on Computer Vision and Pattern Recognition (2016).

Mnih et al. (2015)
V. Mnih, K. Kavukcuoglu,
D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen,
C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis,
“Humanlevel control through deep reinforcement learning,”
Nature 518, 529–533 (2015).  Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature 529, 484–489 (2016).
 Silver et al. (2018) D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay,” Science 362, 1140–1144 (2018).
 Brock, Donahue, and Simonyan (2019) A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity natural image synthesis,” International Conference on Learning Representations (2019).
 Karras, Laine, and Aila (2019) T. Karras, S. Laine, and T. Aila, “A stylebased generator architecture for generative adversarial networks,” IEEE Conference on Computer Vision and Pattern Recognition (2019).
 Senior et al. (2020) A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. R. Nelson, A. Bridgland, H. Penedones, S. Petersen, K. Simonyan, S. Crossan, P. Kohli, D. T. Jones, D. Silver, K. Kavukcuoglu, and D. Hassabis, “Improved protein structure prediction using potentials from deep learning,” Nature 577, 706–710 (2020).
 Morawietz et al. (2016) T. Morawietz, A. Singraber, C. Dellago, and J. Behler, “How van der Waals interactions determine the unique properties of water,” Proc. Natl. Acad. Sci. U.S.A. 113, 8368–8373 (2016).

Albergo, Kanwar, and Shanahan (2019)
M. S. Albergo, G. Kanwar, and P. E. Shanahan,
“Flowbased generative models for Markov chain Monte Carlo in lattice field theory,”
Phys. Rev. D 100, 034515 (2019).  Wu, Wang, and Zhang (2019) D. Wu, L. Wang, and P. Zhang, “Solving statistical mechanics using variational autoregressive networks,” Phys. Rev. Lett. 122, 080602 (2019).
 Noé et al. (2019) F. Noé, S. Olsson, J. Köhler, and H. Wu, “Boltzmann generators: Sampling equilibrium states of manybody systems with deep learning,” Science 365, (2019).
 Shirts et al. (2003) M. R. Shirts, E. Bair, G. Hooker, and V. S. Pande, “Equilibrium free energies from nonequilibrium measurements using maximumlikelihood methods,” Phys. Rev. Lett. 91, 140601 (2003).
 Minka (2005) T. Minka, “Divergence measures and message passing,” Tech. Rep. MSRTR2005173 (Microsoft Research Cambridge, 2005).
 Neal (2005) R. M. Neal, “Estimating ratios of normalizing constants using linked importance sampling,” Tech. Rep. 0511 (Department of Statistics, University of Toronto, 2005).
 Jones and Chapman (1924) J. E. Jones and S. Chapman, “On the determination of molecular fields. –II. from the equation of state of a gas,” Proc. R. Soc. Lond. A 106, 463–477 (1924).
 Weeks and Chandler (1971) J. D. Weeks and D. Chandler, “Role of repulsive forces in determining the equilibrium structure of simple liquids,” J. Chem. Phys. 54, 5237 (1971).
 Frenkel and Smit (2002) D. Frenkel and B. Smit, Understanding Molecular Simulation, 2nd ed. (Academic Press, San Diego, 2002).
 Plimpton (1995) S. Plimpton, “Fast parallel algorithms for shortrange molecular dynamics,” J. Comp. Phys. 117, 1–19 (1995).
 Papamakarios et al. (2019) G. Papamakarios, E. Nalisnick, D. Jimenez Rezende, S. Mohamed, and B. Lakshminarayanan, “Normalizing flows for probabilistic modeling and inference,” arXiv preprint arXiv:1912.02762 (2019).
 Dinh, SohlDickstein, and Bengio (2017) L. Dinh, J. SohlDickstein, and S. Bengio, “Density estimation using Real NVP,” International Conference on Learning Representations (2017).
 Jimenez Rezende et al. (2020) D. Jimenez Rezende, G. Papamakarios, S. Racanière, M. S. Albergo, G. Kanwar, P. E. Shanahan, and K. Cranmer, “Normalizing flows on tori and spheres,” arXiv preprint arXiv:2002.02428 (2020).
 Durkan et al. (2019) C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios, “Neural spline flows,” Advances in Neural Information Processing Systems (2019).
 Bender et al. (2020) C. Bender, K. O’Connor, Y. Li, J. J. Garcia, M. Zaheer, and J. Oliva, “Exchangeable generative models with flow scans,” AAAI Conference on Artificial Intelligence (2020).
 Köhler, Klein, and Noé (2019) J. Köhler, L. Klein, and F. Noé, “Equivariant flows: sampling configurations for multibody systems with symmetric energies,” arXiv preprint arXiv:1910.00753 (2019).
 Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (2017).

Caruana, Lawrence, and Giles (2001)
R. Caruana, S. Lawrence, and L. Giles,
“Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,”
Advances in Neural Information Processing Systems (2001).  Müller et al. (2019) T. Müller, B. McWilliams, F. Rousselle, M. Gross, and J. Novák, “Neural importance sampling,” ACM Trans. Graph. 38, (2019).
 Papamakarios and Murray (2015) G. Papamakarios and I. Murray, “Distilling intractable generative models,” Probabilistic Integration Workshop at Advances in Neural Information Processing Systems (2015).
 Gu, Ghahramani, and Turner (2015) S. Gu, Z. Ghahramani, and R. E. Turner, “Neural adaptive sequential Monte Carlo,” Advances in Neural Information Processing Systems (2015).
 Paige and Wood (2016) B. Paige and F. Wood, “Inference networks for sequential Monte Carlo in graphical models,” International Conference on Machine Learning (2016).
 Le, Baydin, and Wood (2017) T. A. Le, A. G. Baydin, and F. Wood, “Inference compilation and universal probabilistic programming,” International Conference on Artificial Intelligence and Statistics (2017).
 Cappé et al. (2008) O. Cappé, R. Douc, A. Guillin, J.M. Marin, and C. P. Robert, “Adaptive importance sampling in general mixture classes,” Stat. Comput. 18, 447–459 (2008).
 Cornebise, Moulines, and Olsson (2008) J. Cornebise, É. Moulines, and J. Olsson, “Adaptive methods for sequential importance sampling with application to state space models,” Stat. Compt. 18, 461–480 (2008).
 Swope et al. (1982) W. C. Swope, H. C. Andersen, P. H. Berens, and K. R. Wilson, “A computer simulation method for the calculation of equilibrium constants for the formation of physical clusters of molecules: Application to small water clusters,” J. Chem. Phys. 76, 637 (1982).
 Gregory and Delbourgo (1982) J. A. Gregory and R. Delbourgo, “Piecewise rational quadratic interpolation to monotonic data,” IMA Journal of Numerical Analysis 2, 123–130 (1982).
 Goodfellow, Bengio, and Courville (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016).
 Ba, Kiros, and Hinton (2016) J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450 (2016).
 Kingma and Ba (2015) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations (2015).