A continuation of stochastic-modeling, examining Normalizing Flows and probability density approximation
Analyzing and interpreting time-dependent stochastic data requires accurate and robust density estimation. In this paper we extend the concept of normalizing flows to so-called temporal Normalizing Flows (tNFs) to estimate time dependent distributions, leveraging the full spatio-temporal information present in the dataset. Our approach is unsupervised, does not require an a-priori characteristic scale and can accurately estimate multi-scale distributions of vastly different length scales. We illustrate tNFs on sparse datasets of Brownian and chemotactic walkers, showing that the inclusion of temporal information enhances density estimation. Finally, we speculate how tNFs can be applied to fit and discover the continuous PDE underlying a stochastic process.READ FULL TEXT VIEW PDF
A continuation of stochastic-modeling, examining Normalizing Flows and probability density approximation
Density estimations from sparse time series data are ubiquitous to interpret probabilistic or stochastic phenomena in quantitative science, e.g. in econometrics [zambom_review_2012], variational inference [rezende_variational_2015] and biological sciences [manzo_review_2015]. In this latter application, single particle tracking (SPT) has become the method of choice to investigate the dynamics, structure and interaction of many molecules in a cellular context, allowing the observation of single molecule trafficking on the nanoscale throughout the cell. The obtained trajectories are typically interpreted as random walks and analyzed in terms of their mean squared displacement (MSD). This analysis provides insight into the underlying transport processes and has revealed its non-ergodicity and anomalous diffusive properties [manzo_review_2015]. The main difficulty in SPT is linking the particles between the frames to create a trajectory; particles cross, thus exchanging their identity, or stop fluorescing completely [manzo_review_2015]. Alternatively, there exists a rich mathematical literature studying trajectories in terms of walker densities [klafter_first_2011]. This perspective provides an alternative way to extract transport properties from experimental data without the need to link the particles between the frames. Key to this approach is accurately inferring the evolving particle density, particularly when data is sparse.
The classical approach to density estimation is binning. It provides an accurate density estimate when the sample-size is large, but becomes sensitive to the location of the bins when data becomes sparse. This method is also subject to a bias-variance trade-off; small-scale features are not captured when using oversized bins, whereas undersized bins will lead to a very noisy estimate. Alternatively, one can use continuous methods such as the Kernel Density Estimate (KDE). In KDE, a kernel is placed on each particle and the mean over all kernels then gives an estimate of the density. The resulting estimate is highly sensitive to the width of the kernel and although several automated estimators exist[turlach_bandwidth_nodate, zambom_review_2012]
, choosing the right width is a non-trivial task. While these techniques are firmly established, inferring the distribution of a random variableevolving through time remains challenging. Explicitly including the temporal axis suppresses natural variations in the estimate by exploiting temporal correlations; samples taken closely together in time are likely to be only slightly different. To our knowledge, KDE and binning cannot include temporal dynamics under the constraint of particle conservation. In this paper we propose a novel technique based on normalizing flows, which is capable of handling these constraints.
Normalizing Flows (NFs) learn an arbitrarily complex probability distribution by applying a series of transformations to a known distribution in a latent space. NFs originated in the field of machine learning and were initially applied to infer posterior distributions in the context of variational inference[rezende_variational_2015]. They have been successfully applied as generative models (e.g. to generate novel faces [kingma_glow:_2018]) and many papers showcase their capability as density estimators for 2D, time independent toy problems [huang_neural_2018, chen_continuous-time_2017, dinh_density_2016]. NFs have several advantages as a density estimation technique: they are unsupervised, do not require an a-priori length scale such as the bin- or kernel width and they can naturally accommodate several of such length scales across a dataset, a notoriously hard problem. Here we extend NFs to include temporal dynamics and hence name our approach temporal Normalizing Flows (tNFs).
The rest of the paper is organized as follows. In Section 2, we introduce and implement tNFs. Section 3 presents the application of tNFs on a multi-scale toy problem and datasets of Brownian and chemotactic particles. In Section 4 we present some further perspectives of this approach, in particular its potential as a physics informed density estimator and its ability to perform accurate density estimations on finite domains.
Consider a set of samples taken from an unknown distribution . Given some model , we can estimate by minimizing the negative log-likelihood of the model on the data,
To obtain an accurate estimate of , the model needs to be flexible enough. Normalizing flows [chen_neural_2018, noe_boltzmann_2019, rezende_variational_2015] allow the construction of an arbitrary model by applying an invertible transformation to a known probability density. Consider a random variable distributed by pdf . Given an invertible transformation , is then distributed by , which is given by,
Typically, is referred to as the real space and as the latent space and is usually a Gaussian. Normalizing flows learn the real to latent space mapping and consequently the density by minimizing the negative log likelihood,
The normalizing flows as presented in the previous section cannot account for temporal dynamics. Nonetheless, our starting point for deriving the temporal NF is the N-dimensional equivalent of eq. 2,
where is Jacobian of . As we cannot write a conservation relation for the temporal axis, i.e. , we cannot include it as an additional dimension in eq. 4, explaining why NFs cannot account for temporal dynamics. However, assume for now that such a construction is possible. The determinant of the Jacobian for a 1D temporally-varying distribution can then be written as,
where is the latent spatial coordinate and the latent temporal coordinate. Note that both are dependent on and , i.e. and . As the transformation of is not allowed, the latent time must be equal to the real time , so that determinant of the Jacobian becomes,
The 1D temporal Normalizing Flow can then be written as,
We show a graphical interpretation of this in figure 2. While the temporal axis is not stretched or compressed, all frames are coupled through the mapping . Using a single mapping for the whole dataset prevents overfitting and suppresses natural variations in the estimate, as we will show in the results.
The prime challenge of implementing NFs and tNFs in practice is finding a flexible yet invertible transformation. NFs are generally applied as generative models on high-dimensional data, requiring a computationally efficient method to evaluate the determinant of the Jacobian (see e.g. FFJORD[grathwohl_ffjord:_2018], Autoregressive flows [papamakarios_masked_2017] or GLOW [kingma_glow:_2018]). Spatio-temporal density estimation contains up to four dimensions, such that calculating the determinant of the Jacobian is not a computational constraint. This allows us to propose a relatively simple implementation.
Wehenkel et al. [wehenkel_unconstrained_2019] recently introduced a method for the construction of monotonic neural networks, independent of the networks’ specific architecture. Building on the observation that a function is monotonic if its derivative is positive, they propose to constrain a neural network to positive outputs only and numerically integrate over the output to obtain a monotonic function. We slightly modify their approach and use an unconstrainedfeed-forward neural network to model the log Jacobian, naturally leading to monotonic and hence invertible mapping . This leads to the following implementation for the tNF,
Here is a time dependent offset function. Both and
are modeled by unconstrained neural networks with a tanh-activation function (
contains 3 hidden layers of 30 neurons andcontains 1 hidden layer of 100 neurons). In the remainder of this work we choose a time independent Gaussian as latent distribution, . We perform the integration in eq. 8 over a regular grid rather then integrating over the particles’ positions. This approach scales with the size of the grid, rather than with the number of particles, works well when data is sparse and scales to higher dimensions.
We now demonstrate tNFs on three datasets:
A multi-scale toy problem to show tNFs can accommodate different length scales in a single distribution;
A dataset of Brownian motion to show how tNFs enhance density estimation for sparse datasets;
A dataset of chemotactic walkers to show that tNFs can correctly estimate a multi-modal, non-Gaussian density.
A key problem in density estimation is inferring an accurate distribution when vastly different length scales are present within a single dataset. Classical approaches such as binning and KDE require a single characteristic length scale, prohibiting an accurate estimate of a multi-scale distribution. We now show that normalizing flows, and by extension tNFs, are capable of accurately inferring such a distribution.
Brownian motion is the most basic and ubiquitous random walk and thus an ideal test case to assess the performance of tNFs, comparing them to time independent NFs and classical binning. We generate a single trajectory for a Brownian random walker by the recursive relation, . Here is the step number with the initial position, the diffusive coefficient and the time step. In the limit of an infinite number of walkers, the walker density is described by the diffusion equation, .
Our dataset consists of walkers with , with snapshots being taken every for frames. The initial positions were sampled from a Gaussian centered at with width ; in this case, the diffusion equation can be solved exactly and the solution behaves as a spreading Gaussian in time. We show the estimated density at and in figure 4 (a) and (b) for the tNF, the time independent NF and binning. The tNF provides a significantly better density estimate than the time independent NF, illustrated by the difference in error; for the tNF and for the NF, averaged over frame 15 and 85.
Normalizing flows are based on neural networks and hence prone to overfitting. We analyze the effect of overfitting in Appendix I and show that NFs overfit more strongly than tNFs and perform worse in terms of the error. We mainly attribute this improvement to the temporal correlations in the dataset, which suppresses the natural frame-to-frame variations in the density estimate. Nonetheless, tNFs are not immune to overfitting and we speculate performance could be enhanced by applying techniques such as early stopping.
For the diffusion equation the true mapping can be trivially derived. We compare it to the learned mapping in figure 4(c). It shows perfect agreement at , but deviates from the true curve for at . As can be seen in figure 4
(a), no samples were present in this domain, explaining the deviance. Nonetheless, it implies that the network does not generalize well outside the sampling domain. We speculate that techniques such as batch normalization or a different architecture for the network (a recurrent network, for example) might further improve performance.
The Brownian motion presented in the previous section was a linear problem with a uni-modal, Gaussian solution. We now apply tNFs to so-called chemotactic walkers, a non-linear problem with a multi-modal solution. Bacteria and other micro-organisms sense gradients of chemicals throughout their environment and use this to guide their motion towards a food source. This effect is known as chemotaxis and is typically modelled by a random walker with a superimposed drift; , where is the chemical density and is the chemotactic sensitivity, which controls the interaction between the chemical and the bacteria. In the infinite walker limit, the walker and chemical density are given by the Keller-Segel model: and . Here and are the diffusion coefficients of the bacteria and the chemical respectively and a decay set by has been added to the chemical density.
Our dataset consisted of walkers with and we sampled the initial position from a Gaussian centred at . The food source was modelled by a Gaussian with diffusion coefficient , centred at ; the walkers will thus drift towards food source over time. Figure 5 shows a comparison of the time independent NF, tNF and the binning method. In figure 5(a) and (b) we find that the tNF leads to a significantly more accurate density estimation, illustrated by the difference in error ( for the NF versus for the tNF, averaged over and ). The tNF captures the multi-modal distribution at excellently, without overfitting, contrarily to the time independent NF. The mapping, as shown in figure 5c, is non-linear, in contrast to the mapping obtained for the Brownian motion.
Density estimation near boundaries is often problematic [botev_kernel_2010, malec_nonparametric_nodate], as they introduce discontinuities in the profile. Applying KDE in such situations leads to non-zero probabilities past the boundary. We show here that NFs are less prone to these artifacts. In figure 6 we compare binning, KDE and tNF for 1000 random walkers between two reflective boundaries at . We show the corresponding Jacobian and latent density in figure 6b. At the boundaries, the latent density approaches zero, which must be compensated by the Jacobian to obtain the non-zero density of the true profile. How well the network is able to do this determines the quality of the estimate at the boundary and might lead to artifacts. To improve the density estimate near the boundary, we propose to use a latent distribution with finite support, e.g., the Epanechnikov kernel [epanechnikov_non-parametric_1969]. However, this introduces a discontinuity in the cost function, leading to training issues.
Physics Informed Neural Networks (PINNs)[raissi_physics_2017] have emerged as a powerful yet simple method to include physical constraints in neural networks. They have been applied to (i) solve PDE’s [lu_deepxde:_2019], (ii) infer parameters of a known equation [raissi_inferring_2017] and (iii) perform model discovery [both_deepmod:_2019]. Here, we propose Physics Informed Normalizing Flows (PINFs) to directly fit continuous models to single particle data. Contrarily to PINNs, PINFs do not require an estimate of the density before fitting and explicitly conserve energy, mass or probability densities. By including the fitting in the cost function, PINFs form an end-to-end differentiable model to fit continuous models to discrete data. We construct it by adding the continuous model to the log-likelihood, analogously to a PINN,
Here, is a constant and sets the relative strength of the fitting term. The two terms in eq. 9 are of different origin (i.e. a likelihood term vs a MSE term) and hence are typically of different orders of magnitude. Consequently, training is more complex than PINNs, but preliminary testing on random walkers confirmed that PINFs are indeed capable of inferring the parameters of the PDE directly from the positional data. Further research however is required to improve the performance of these PINFs.
In this paper we have introduced temporal Normalizing Flows (tNFs), an extension of normalizing flows to estimate a time-varying probability density. We demonstrate that tNFs can naturally accommodate different length scales in a problem and outperform binning and time-independent normalizing flows, even when the density is non-Gaussian and multi-modal. tNFs use the full time series data to perform density estimation, rather than inferring the density one frame at a time. This exploits the temporal correlations in the data, which improves the performance of the neural network used to model the mapping. The use of an unconstrained monotonic neural network opens up the possibility of applying techniques such as batching and batch normalization, or even completely different architectures, e.g. RNNs.
We provide two perspectives, building on this work: (i) density estimation on a finite domain and (ii) discovering and fitting a PDE to the data. (i): Boundaries typically lead to discontinuous density profiles. In this situation, tNFs can provide a more accurate estimate of the true profile, compared to e.g. KDE. While the discontinuous density profile cannot be strictly modeled using a Gaussian latent distribution, we speculate that using a distribution with finite support could capture such discontinuities. (ii) Typically a continuous PDE can be derived for a time-dependent distribution. tNFs can be used to fit the corresponding PDE directly to positional data by simultaneously making an estimate of the density and fitting a PDE to the data. Rather than inferring parameters, we speculate that PINFs can also be used as PDE solvers (similar to [lu_deepxde:_2019]) where energy or mass conservation is required. Our initial results with these physics informed normalizing flows are encouraging, but much work remains to be done, especially optimizing the training scheme.
Our work fits in the wider context of temporal reasoning in machine learning. When applying generative modeling to a time series of images for example, the temporal axis must also be treated differently. Approaches based on modeling the latent time as a Gaussian process [casale_gaussian_nodate] or as a Linear Gaussian State Space Model [fraccaro_disentangled_nodate] have recently been been brought forward. We propose temporal normalizing flows could be used for similar time dependent applications.
Access to the temporal dynamics of a process has both theoretical and practical benefits. Analysis and modeling experimental data is often limited to equilibrium processes [noe_boltzmann_2019], restricting the potential of the data at hand. Being able to study the temporal dynamics of such systems in terms of the underlying probability distribution or PDE opens up many opportunities in out-of-equilibrium science. We thus believe that tNFs can greatly aid the study of out-of-equilibrium processes.
In this appendix we study the effect of overfitting on the density estimation by comparing the
error with the log-likelihood as a function of the training epoch in figure7a and b. Here we performed a density estimate for 500 Brownian walkers with parameters identical as those selected in the main text. We delineate the minimum error with a black dashed line; note that this occurs after roughly 7000 epochs and that the error, with respect to the analytical solution increases upon training further. The negative log likelihood keeps decreasing however, corresponding to overfitting the solution. We found empirically that the minimum error occurs roughly at the elbow of the cost function, which for all cases considered is roughly at epochs so all the NF and tNF have been trained for 10000 epochs.
We show that tNFS are less prone to overfitting than NFs by comparing the log-likelihood and error for a single representative frame in figure 8 for both approaches. The log-likelihood of the time-independent NF keeps decreasing, leading to overfitting and an increased error. On the other hand, the tNF likelihood saturates and no longer decreases significantly after 10000 epochs, and neither does the error.